One Failed NIC Strands 20,000 At LAX - Slashdot

Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

×

One Failed NIC Strands 20,000 At LAX 293

Posted by kdawson on Wednesday August 15, 2007 @03:56PM from the comp-dot-risks dept.

The card in question experienced a partial failure that started about 12:50 p.m. Saturday, said Jennifer Connors, a chief in the office of field operations for the Customs and Border Protection agency. As data overloaded the system, a domino effect occurred with other computer network cards, eventually causing a total system failure. A spokeswoman for the airports agency said airport and customs officials are discussing how to handle a similar incident should it occur in the future.

This discussion has been archived. No new comments can be posted.

One Failed NIC Strands 20,000 At LAX

Search 293 Comments Log In/Create an Account

Comments Filter:

Re:Whiskey Tango Foxtrot (Score:2, Informative)

by sigipickl ( 595932 ) writes: on Wednesday August 15, 2007 @04:16PM (#20240949)

This totally sounds like a token ring problem.... Either network flooding or dropped packets (tokens). These issues used to be a bear to track down- going from machine to machine in serial from the MAU...

Ethernet and switching has made me fat- I never have to leave my desk to troubleshoot.

Parent Share
twitter facebook
Re:You figure it out (Score:5, Informative)

by GreggBz ( 777373 ) writes: on Wednesday August 15, 2007 @04:21PM (#20241023) Homepage

One not to unreasonable strategy is to set up SNMP traps on all your NICs. This is not unlike the cable modem watching software at most Cable ISPs.

At first, I can envision it being a PITA if you have a variety of NIC hardware especially finding all those MIBs. But they are all pretty standard these days, and your polling interval could be fairly long, like every 2 minutes. You could script the results, sorting all the naughties and periodic non-responders to the top of the list. That would narrow things down a heck of a lot in a circumstance like this.

No alarms, but at least a quick heartbeat of your (conceivably very large) network. A similar system can be used to watch 30,000+ cable modems, without to much load on the snmp trap server.

Parent Share
twitter facebook
Re:Head of IT for LAX should be fired... (Score:3, Informative)

by kschendel ( 644489 ) writes: on Wednesday August 15, 2007 @04:24PM (#20241081) Homepage

RTFA. This was a *Customs* system. Not LAX, not airlines. The only blame that the airlines can (and should) get for this is not shining the big light on Customs and Border Patrol from the very start. I think it's time that the airlines started putting public and private pressure on CBP and TSA to get the hell out of the way. It's not as if they are actually securing anything.

CBP deserves a punch in the nose for not having a proper network design with redundancy; and another punch in the nose for not having any clue what to do in an outage. They should have a reduced-service backup plan, and a manual backup plan, and a diversion backup plan. There's no excuse for federal officials to sit there like idiots waiting for things to magically get fixed. Oh wait, I guess some of them ARE idiots.

Parent Share
twitter facebook
It depends on the switch (Score:5, Informative)

by camperdave ( 969942 ) writes: on Wednesday August 15, 2007 @04:57PM (#20241375) Journal

You're right to a point. An ethernet frame, along with the source and destination addresses, has a checksum. A switch that is using a store and forward procedure is supposed to drop the frame if the checksum is invalid. If the nic was throwing garbled frames onto the network, it would have to be garbled in such a way as to have a valid checksum (assuming they are using store and forward switches in the first place).

Parent Share
twitter facebook
Re:That's all it takes (Score:4, Informative)

by Vengance Daemon ( 946173 ) writes: on Wednesday August 15, 2007 @05:41PM (#20241887)

Why are you assuming that this is an Ethernet network? As old as the equipment they are using is, it may be a Token Ring network - the symptoms that were described sound just like a "beaconing" token ring network.

Parent Share
twitter facebook
Re:You figure it out (Score:5, Informative)

by ctr2sprt ( 574731 ) writes: on Wednesday August 15, 2007 @06:32PM (#20242413)

One not to unreasonable strategy is to set up SNMP traps on all your NICs.

That doesn't make much sense. If the NIC goes down or starts misbehaving, the chances of your NIC's SNMP traps arriving at their destination is effectively zero. You probably mean setting up traps on your switches with threshold traps on all the interfaces, the switch's CPU, CAM table size, etc. Which would be more useful. You could also use a syslog server, which is going to be considerably easier if you don't have a dedicated monitoring solution.

But they are all pretty standard these days, and your polling interval could be fairly long, like every 2 minutes.

You're not thinking of traps if you're talking about polling. Traps are initiated by the switch (or other device) and sent to your log monster. You can use SNMP polling of the sort that e.g. MRTG and OpenNMS do which, with appropriate thresholds, can get you most of the same benefits. But don't use it on Cisco hardware, not if you want your network to function, anyway. Their CPUs can't handle SNMP polling, not at the level you're talking about.

No alarms, but at least a quick heartbeat of your (conceivably very large) network. A similar system can be used to watch 30,000+ cable modems, without to much load on the snmp trap server.

I think you are underestimating exactly how much SNMP trap spam network devices send. You'll get a trap for the ambient temperature being too high. You'll get a trap if you send more than X frames per second ("threshold fired"), and another trap two seconds later when it drops below Y fps ("threshold rearmed"). You'll get at least four link traps whenever a box reboots (down for the reboot, up/down during POST, up when the OS boots; probably another up/down as the OS negotiates link speed and duplex), plus an STP-related trap for each link state change ("port 2/21 is FORWARDING"). You'll get traps when CDP randomly finds, or loses, some device somewhere on the network. You'll get an army of traps whenever you create, delete, or change a vlan. If you've got a layer 7 switch that does health checks, you'll get about ten traps every time one of your HA webservers takes more than 100ms to serve its test page, which happens about once per server per minute even when nothing is wrong.

And the best part is that because SNMP traps are UDP, they are the first thing to get thrown away when the shit hits the fan. So when a failing NIC starts jabbering and the poor switch's CPU goes to 100%, you'll never see a trap. All you'll see are a bunch of boxes on the same vlan going up and down for no apparent reason. You might get a fps threshold trap from some gear on your distribution or core layers, assuming it's sufficiently beefy to handle a panicked switch screaming ARPs at a gig a second and have some brains left over, but that's about it. More likely you won't have a clue that anything is wrong until the switch kicks and 40 boxes go down for five minutes.

Monitoring a network with tens of thousands of switch ports sucks hardcore, there's no way around it.

Parent Share
twitter facebook
A Cisco Config to prevent this (Score:3, Informative)

by ScaredOfTheMan ( 1063788 ) writes: on Wednesday August 15, 2007 @08:37PM (#20243575)

Yes NICs can go crazy and start blasting broadcasts or Unicasts over your network, if you have a Cisco switch (or any other that supports storm control like features) you may want to enable it, it costs you nothing but the time it takes you to update the config. on the access switch (the one connected to your PCs) get into config mode at type this on every interface that connects directly to a PC (use the interface range command to speed things up if you want). Switch(config-if)#storm-control unicast level X where X is the percent of total interface bandwidth you specify as the threshold for cutting access to that port. Its measure every second, so if you have 100 meg port and you set it to 30, if the PC pushes more than 30 meg a sec in unicasts the switch kills the port, till the pc calms down, if its a 10 meg port the 30 then equals 3 meg, etc etc. You can also add a second line to control broadcasts by changing the word unicast to broadcast. If that had this in place, when the NIC went nuts, the switch would have killed the port, and no outage (I assume a lot here, but you get the point).

Share
twitter facebook
Re:You figure it out (Score:2, Informative)

by huge ( 52607 ) writes: on Thursday August 16, 2007 @09:03AM (#20248039)

And the best part is that because SNMP traps are UDP, they are the first thing to get thrown away when the shit hits the fan.

In some cases it might be better idea to use inform [cisco.com] instead of trap.

Parent Share
twitter facebook

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Related Links Top of the: day, week, month.

390 comments32-Hour Workweek for America Proposed by Senator Bernie Sanders
358 commentsWhat Should Happen to Empty Downtown Office Spaces?
340 commentsHacktivism Erupts In Response To Hamas-Israel War
324 comments'Feedback' Is Now Too Harsh. The New Word is 'Feedforward'
248 commentsWorkers are Resisting Calls to Return to Offices

E = MC ** 2 +- 3db