Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Networking Bug

One Failed NIC Strands 20,000 At LAX 293

The card in question experienced a partial failure that started about 12:50 p.m. Saturday, said Jennifer Connors, a chief in the office of field operations for the Customs and Border Protection agency. As data overloaded the system, a domino effect occurred with other computer network cards, eventually causing a total system failure. A spokeswoman for the airports agency said airport and customs officials are discussing how to handle a similar incident should it occur in the future.
This discussion has been archived. No new comments can be posted.

One Failed NIC Strands 20,000 At LAX

Comments Filter:
  • by sigipickl ( 595932 ) on Wednesday August 15, 2007 @04:16PM (#20240949)
    This totally sounds like a token ring problem.... Either network flooding or dropped packets (tokens). These issues used to be a bear to track down- going from machine to machine in serial from the MAU...

    Ethernet and switching has made me fat- I never have to leave my desk to troubleshoot.
  • Re:You figure it out (Score:5, Informative)

    by GreggBz ( 777373 ) on Wednesday August 15, 2007 @04:21PM (#20241023) Homepage
    One not to unreasonable strategy is to set up SNMP traps on all your NICs. This is not unlike the cable modem watching software at most Cable ISPs.

    At first, I can envision it being a PITA if you have a variety of NIC hardware especially finding all those MIBs. But they are all pretty standard these days, and your polling interval could be fairly long, like every 2 minutes. You could script the results, sorting all the naughties and periodic non-responders to the top of the list. That would narrow things down a heck of a lot in a circumstance like this.

    No alarms, but at least a quick heartbeat of your (conceivably very large) network. A similar system can be used to watch 30,000+ cable modems, without to much load on the snmp trap server.
  • by kschendel ( 644489 ) on Wednesday August 15, 2007 @04:24PM (#20241081) Homepage
    RTFA. This was a *Customs* system. Not LAX, not airlines. The only blame that the airlines can (and should) get for this is not shining the big light on Customs and Border Patrol from the very start. I think it's time that the airlines started putting public and private pressure on CBP and TSA to get the hell out of the way. It's not as if they are actually securing anything.

    CBP deserves a punch in the nose for not having a proper network design with redundancy; and another punch in the nose for not having any clue what to do in an outage. They should have a reduced-service backup plan, and a manual backup plan, and a diversion backup plan. There's no excuse for federal officials to sit there like idiots waiting for things to magically get fixed. Oh wait, I guess some of them ARE idiots.

  • by camperdave ( 969942 ) on Wednesday August 15, 2007 @04:57PM (#20241375) Journal
    You're right to a point. An ethernet frame, along with the source and destination addresses, has a checksum. A switch that is using a store and forward procedure is supposed to drop the frame if the checksum is invalid. If the nic was throwing garbled frames onto the network, it would have to be garbled in such a way as to have a valid checksum (assuming they are using store and forward switches in the first place).
  • by Vengance Daemon ( 946173 ) on Wednesday August 15, 2007 @05:41PM (#20241887)
    Why are you assuming that this is an Ethernet network? As old as the equipment they are using is, it may be a Token Ring network - the symptoms that were described sound just like a "beaconing" token ring network.
  • Re:You figure it out (Score:5, Informative)

    by ctr2sprt ( 574731 ) on Wednesday August 15, 2007 @06:32PM (#20242413)

    One not to unreasonable strategy is to set up SNMP traps on all your NICs.

    That doesn't make much sense. If the NIC goes down or starts misbehaving, the chances of your NIC's SNMP traps arriving at their destination is effectively zero. You probably mean setting up traps on your switches with threshold traps on all the interfaces, the switch's CPU, CAM table size, etc. Which would be more useful. You could also use a syslog server, which is going to be considerably easier if you don't have a dedicated monitoring solution.

    But they are all pretty standard these days, and your polling interval could be fairly long, like every 2 minutes.

    You're not thinking of traps if you're talking about polling. Traps are initiated by the switch (or other device) and sent to your log monster. You can use SNMP polling of the sort that e.g. MRTG and OpenNMS do which, with appropriate thresholds, can get you most of the same benefits. But don't use it on Cisco hardware, not if you want your network to function, anyway. Their CPUs can't handle SNMP polling, not at the level you're talking about.

    No alarms, but at least a quick heartbeat of your (conceivably very large) network. A similar system can be used to watch 30,000+ cable modems, without to much load on the snmp trap server.

    I think you are underestimating exactly how much SNMP trap spam network devices send. You'll get a trap for the ambient temperature being too high. You'll get a trap if you send more than X frames per second ("threshold fired"), and another trap two seconds later when it drops below Y fps ("threshold rearmed"). You'll get at least four link traps whenever a box reboots (down for the reboot, up/down during POST, up when the OS boots; probably another up/down as the OS negotiates link speed and duplex), plus an STP-related trap for each link state change ("port 2/21 is FORWARDING"). You'll get traps when CDP randomly finds, or loses, some device somewhere on the network. You'll get an army of traps whenever you create, delete, or change a vlan. If you've got a layer 7 switch that does health checks, you'll get about ten traps every time one of your HA webservers takes more than 100ms to serve its test page, which happens about once per server per minute even when nothing is wrong.

    And the best part is that because SNMP traps are UDP, they are the first thing to get thrown away when the shit hits the fan. So when a failing NIC starts jabbering and the poor switch's CPU goes to 100%, you'll never see a trap. All you'll see are a bunch of boxes on the same vlan going up and down for no apparent reason. You might get a fps threshold trap from some gear on your distribution or core layers, assuming it's sufficiently beefy to handle a panicked switch screaming ARPs at a gig a second and have some brains left over, but that's about it. More likely you won't have a clue that anything is wrong until the switch kicks and 40 boxes go down for five minutes.

    Monitoring a network with tens of thousands of switch ports sucks hardcore, there's no way around it.

  • by ScaredOfTheMan ( 1063788 ) on Wednesday August 15, 2007 @08:37PM (#20243575)
    Yes NICs can go crazy and start blasting broadcasts or Unicasts over your network, if you have a Cisco switch (or any other that supports storm control like features) you may want to enable it, it costs you nothing but the time it takes you to update the config. on the access switch (the one connected to your PCs) get into config mode at type this on every interface that connects directly to a PC (use the interface range command to speed things up if you want). Switch(config-if)#storm-control unicast level X where X is the percent of total interface bandwidth you specify as the threshold for cutting access to that port. Its measure every second, so if you have 100 meg port and you set it to 30, if the PC pushes more than 30 meg a sec in unicasts the switch kills the port, till the pc calms down, if its a 10 meg port the 30 then equals 3 meg, etc etc. You can also add a second line to control broadcasts by changing the word unicast to broadcast. If that had this in place, when the NIC went nuts, the switch would have killed the port, and no outage (I assume a lot here, but you get the point).
  • Re:You figure it out (Score:2, Informative)

    by huge ( 52607 ) on Thursday August 16, 2007 @09:03AM (#20248039)

    And the best part is that because SNMP traps are UDP, they are the first thing to get thrown away when the shit hits the fan.
    In some cases it might be better idea to use inform [cisco.com] instead of trap.

E = MC ** 2 +- 3db

Working...