One Failed NIC Strands 20,000 At LAX 293
The card in question experienced a partial failure that started about 12:50 p.m. Saturday, said Jennifer Connors, a chief in the office of field operations for the Customs and Border Protection agency. As data overloaded the system, a domino effect occurred with other computer network cards, eventually causing a total system failure. A spokeswoman for the airports agency said airport and customs officials are discussing how to handle a similar incident should it occur in the future.
That's all it takes (Score:2, Interesting)
Re: (Score:3, Interesting)
Re: (Score:3, Insightful)
Re:That's all it takes (Score:5, Interesting)
It would take a router to stop this from happening. I don't think that there are many networks that use routers for internal partitioning. Even then, that entire network behind that router would be flooded.
It depends on the switch (Score:5, Informative)
Re:That's all it takes (Score:4, Informative)
Re: (Score:2)
damn tokens... (Score:3, Funny)
The worst thing is when a user decides to unplug the cable to move something or whatever. Then the token can fall out and you have to spend hours on your hands and knees with a magnifying glass trying to find the damn thing!
Its true! I saw it in a Dilbert cartoon!
Re: (Score:2)
We have *tons* of routers that separate the various subnets (which map 1-to-1 to VLANs) on our internal network. How do you get from one "broadcast domain" to another? Via the default gateway on your own subnet. That is passing through a router. (It may not be a physical box called a 'router' but the packets are still being routed).
References? (Score:2)
Re:That's all it takes (Score:5, Insightful)
If the IT folks were clueless about this machine's age or condition, then the blame lies solely with them for not knowing what the hell they were doing. However, if it was the other folks who shot the IT folks down about upgrading then "welcome to the current state of business", unfortunately.
Re:That's all it takes (Score:5, Insightful)
Re:That's all it takes (Score:5, Insightful)
That's very difficult to do, and your estimates of the costs will be called into question. Its often impossible to predict how long it'll take to diagnose and fix a problem unless you've already diagnosed and fixed a similar problem.
Making this kind of estimate also places you into a lose-lose position. If your estimate was high, then management sees you as "chicken little" and will be more likely to dismiss further concerns as more fearmongering. If your estimate was low, then the blame for the outage will cascade down onto you for not showing/convincing management that new equipment was needed.
Re: (Score:3, Insightful)
Right, but that's why IT doesn't provide the numbers. It just provides the scenario and it's the bean-counters (BC) that provide the numbers.
IT: "We have some really old hardware that's going to fail any day now..."
BC: "So what?"
IT: "Well, that's a good question, we know it's going to cost $Bazillion to fix so we need to find out if it's worth it or not. Here's what will happen when it dies - LAX completely shuts dow
Re: (Score:3, Insightful)
Re:That's all it takes (Score:5, Insightful)
In contrast technical staff get to hear a lot about the Tacoma Narrows Bridge, Liberty Ships, Titanic or similar disasters from long ago as illustrations of how things can go wrong before they get let out of their first year of training. Some management would discard those lessons as things from the days of dinosaurs which is why we seem to have maintainance, infrastructure and contingincy plans reduced to nothing every decade and then be seen as important in the years immediately following a string of expensive or deadly disasters.
Re:That's all it takes (Score:4, Insightful)
There's no reason you can't leave the almost-broken computer there and get a new one. You just build a backup system. Surely management understands that redundancy is good. Then, when the crappy one breaks, you can swap it out instantly. That way, you don't have to mess with things prematurely, but you're only down for hopefully a few minutes. (Of course, replacing it "intentionally", before it fails, is more reliable, but keeping a backup system is a viable alternative if nobody wants to touch the working system.)
Re: (Score:3, Insightful)
Re: (Score:3, Interesting)
No. In managements' eyes, redundancy is bad. You're paying twice as much, but you're not getting any extra functionality in return.
Re: (Score:2)
Of course they're running old and outdated hardware. When thing work, particularly in a mission critical situation, you don't touch them! Even if the IT admins knew that computer was old and on the brink of dying, how are they supposed to convince the suits and beancounters of that?
And why shouldn't the bean counters expect the old, outdated hardware to work. Quite a bit of the air traffic control system is hardware that is decades old, though it is slowly being replaced. You have to go to the bean count
The scope of the problem (Score:5, Interesting)
Americans are still designing systems (and I'm talking WHOLE systems, not just the computers) for the industrial revolution. Much the same way, we're educating our kids for the same purpose- to make them cogs for manufacturing.
The Japanese have a more 'cellular' structure, as opposed to the 'pyramid' designed back a couple of 'turns of the century' ago. One man on top drives five, who drive 200, who drive them all. But the Japanese model is more like object orientation: each unit has private parts. So long as the command it's given produces the proper results and stays within budget, who cares?
Assembly lines gather at their meetings and decide policy on their own. "Fred has been late 3 times this week; do we care?" and the only people to whom it matters, decide. There's no need for a strict, top-down policy, especially since only tiny organizations all do only one job.
Imagine the broken structures in a holding company; they own a newspaper, a carwash and a grocery store; the top man can't say "We'll only use glass containers", because that would be a disaster in a car wash. They can't say "we choose leaded inks" which might be fine for the car wash, but danger at the newspaper. Each unit has it's own purpose.
So how about giving the network admins the power to do *whatever* it takes to let them keep the equipment up to date? As long as it runs, under budget, and doesn't get'em on the newspapers, who cares about the specifics? Why not let the unused budget from every year sit in an account (not being taken back) and use THAT to improve infrastructure?
If these guys were able to have that kind of control, this discussion wouldn't be happening.
Re:That's all it takes (Score:5, Interesting)
I've seen what's running some government agencies, and it's frightening.
Re: (Score:3, Funny)
Re: (Score:3, Interesting)
Yeah, Access is a piece of shit. Unfortunately, it's a lot better than using Excel as a database, which is in many cases the alternative that I've witnessed.
There are also a lack of alternatives: you have FileMakerPro, which is neat (I like it) but not very appealing to some because it has a significant learning curve compared to Access and is also proprietary and expensive; aside from that you have OO.org's Base, which is still immature; and then you've
Re: (Score:2)
So, no, running out-of-date hardware wouldn't surprise me at all.
Re: (Score:2)
Yes, I am glad to be out of that velvet lined rut and in a world where there are actual professionals.
Re: (Score:2)
Re: (Score:2)
For my money, this should never have happened from a problem with one machine. That's wholly unacceptable. My home network is robust enough to handle one bad machine without going down completely...Hell, I could lose a whole subnet and no one on the other subnet would notice a thing.
If this system or switch or whatever is critical, there should have been a fail over. Th
Whiskey Tango Foxtrot (Score:5, Insightful)
For that to have had any effect at all, that system must have been the lynchpin for a critical piece of the network...probably some Homeland security abortion tacked on to the network, or some such crap...This is like the time I traced a network meltdown to a 4 port hub (not a switch, and unmanaged hub) that was plugged into (not a joke) a T-3 concentrator on one port, and and three subnets of around 200 computers each on the other 3 ports. Every single one of the outbound cables from the $15.00 hub terminated in a piece of networking infrastructure costing not less than $10,000 dollars.
This is like that. Single point of failure in the worst possible way. Gross incompetence, shortsightedness, and general disregard for things like "uptime"; pretty much what we've come to expect from the airline industry these days. If I'm not flying myself, I'm going to be driving, sailing, or riding a goddamn bicycle before I fly commercial.
Re:Whiskey Tango Foxtrot (Score:4, Interesting)
Token ring sure used to fail like this! 1 bad station sending 10,000 ring-purge messages a second? Still, it was a truck. Files under 1Mb could be transferred, and this was TR/4, not 16!
Re: (Score:2, Informative)
Ethernet and switching has made me fat- I never have to leave my desk to troubleshoot.
Re: (Score:2)
Re: (Score:3, Funny)
Right?
[crickets chirping]
Right?
Re:Whiskey Tango Foxtrot (Score:4, Funny)
Token Ring upgrade (Score:2)
Re: (Score:3, Interesting)
Re:Whiskey Tango Foxtrot (Score:5, Insightful)
Re: (Score:2)
Re: (Score:2)
I'll bet the sucker not only keeps his job but gets a commendation for finding the problem.
Re: (Score:3, Funny)
Re: (Score:3, Interesting)
And beyond that... how come there is no redundancy? After 9/11, every IT organization on the planet began making sure there was some form or fail-over to a backup system or disaster recovery site to ensure that critical systems could not go down as the result of something similar or some other large-scale disaster. Not only was this system cobbled together apparently, there was no regard for the possibility of it failing for any reason.
Re: (Score:3, Insightful)
Re: (Score:2)
For instance, imagine a RAID 1 in which the data is becoming corrupted. Having redundancy doesn't help: you just have two copies of a corrupted file.
In this instance, a network card started spewing out crap. Because it could fill it's pipe, and most of the packets were rebroadcast down most of the other cables, they also filled those cables.
Re: (Score:2)
Business continuity style disaster recovery doesn't really take public facing websites into account as high priority. Usually it's the payroll, accounts receivable and things needed to keep a business moving forward in case of a disaster.
Letting customers visit your public website is probably the lowest priority in recovering from an actual disaster.
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
According to the effing article, it wasn't even a server, but a goddamn desktop. How in the holy hell does a desktop take down the whole system? I can't even conceive of a situation where that could be the case on anything other than a network designed by chimps, especially through a hardware failure...A compromised system might be able to do it, but a system just going dark?
Los Angeles World Airports is a unique system of four airports owned and operated by the City of Los Angeles.
Any further questions?
Probably the lowest bidder union labor designing and setting it up. Shoulda called IBM.
Re: (Score:2)
Heh, I think you're starting to get the sensation I had when one tiny error in GRUB locked me out of my computer entirely, to the point where even having the Ubuntu Install CD couldn't gain me any access to any OS whatsoever.
Geez, what kind of chimp would allow such a damaging failure to occur along such a vital path, right?
Re: (Score:2)
The thing is, a network topology is wildly different from a computer. It should be designed for parts of it to drop off, and parts to go berserk...These things happen all the time. It should be designed with a minimum of
Re: (Score:2)
Yeah, then I could go to the local store, buy a part, replace it, and move on with my life, like I've done before (a power supply did in fact go out on me before).
I'd still expect you to be able to boot from CD though. You try knoppix? You should be able to boot to knoppix, then mount the
Hey! Capital idea! I'd download it on that computer I c
Re: (Score:2)
LOL. Perhaps they ran out of funding after buying all of the rest of the hardware?
Re: (Score:2)
Re: (Score:2)
Fricking stupid. People think it'll never come back to bite them, and it always does.
Re: (Score:2, Insightful)
I think just hiring idiots would be enough. No need to train them.
Re: (Score:2)
Do these people hire idiots with no training or experience or what?
Probably they do to some extent, but if it's like other places I've worked, they probably hire people who have a clue, but then tell them to do little bits and pieces, and never give them enough resources to actually do the job right.
It's a lot of "we'll pay you to come out and install this." They don't want to hear 'well, you should really re-think the architecture of your whole network' as a response. They just want the new piece grafted on, and if you don't do the job, they'll just find somebody who wil
Re: (Score:2)
Bunch of monkeys. The reason I don't fly commercial anymore has nothing to do with the planes. It has everything to do with the airports.
Re: (Score:2)
The point is ATC is not controlled by "their equipment". The airport authority and the FAA are two wholly separate agencies.
It would be like worrying about the strength of the US Army based on the fact that a prisoner escaped from a police car owned by the NYPD. Sure, both agencies involve guys carrying guns, but they otherwise have nothing to do with each other.
T
Re: (Score:2)
After all, preventing layer 2 loops is what Spanning Tree is all about, and I thought Cisco had some similar system for figuring out if a link was unidirectional (if you're sending packets down to something and not getting anything back, it can shut it down, to keep it from just
Re: (Score:2)
In other news... (Score:3, Insightful)
What makes them think they'll get another shot? Rank and file voters are ready with their own plan...should a 'similar incident' by the same fools happen again.
The backup plan (Score:5, Funny)
DHS's idea of a "backup plan" will probably be to build a huge fenced area into which to dump arriving passengers when their systems are down.
Re: (Score:2)
I hear EMA has several new/used camp trailers I'm sure DHS could avail themselves of.
No, a multi-front plan (Score:2)
Change to Wifi because that can't have NIC faults.
C'mon folk... help me out here!
Re: (Score:2)
Arrest all NIC designers, engineers, network stack developers, IT managers,... on suspicion of conspiring to cause the problem. Change to Wifi because that can't have NIC faults. C'mon folk... help me out here!
Print each package to be sent over the network, use the USPS first class mail to send it to the right destination on time, and hire a bunch of undocumented immigrants to enter the data again.
I'm sure they already have a nice database to use to find prospects that could do the data entry!
Re: (Score:2)
You figure it out (Score:4, Interesting)
First you see latency on a network, then you fire up a sniffer and hope to god you can get enough packets to deduce which is the flaky card without shutting down every NIC on your network.
Of course I did write a paper on this behavior years ago in my CS networking class. Taking a Snort box and a series of custom scripts to notify admins with spikes on the network outside of normal operating ranges for that device's history. However implementing this successfully in an elegant fashion has been beyond me and I just rely on Nagios to do a lot of my bidding.
Re:You figure it out (Score:5, Informative)
At first, I can envision it being a PITA if you have a variety of NIC hardware especially finding all those MIBs. But they are all pretty standard these days, and your polling interval could be fairly long, like every 2 minutes. You could script the results, sorting all the naughties and periodic non-responders to the top of the list. That would narrow things down a heck of a lot in a circumstance like this.
No alarms, but at least a quick heartbeat of your (conceivably very large) network. A similar system can be used to watch 30,000+ cable modems, without to much load on the snmp trap server.
Re:You figure it out (Score:5, Informative)
That doesn't make much sense. If the NIC goes down or starts misbehaving, the chances of your NIC's SNMP traps arriving at their destination is effectively zero. You probably mean setting up traps on your switches with threshold traps on all the interfaces, the switch's CPU, CAM table size, etc. Which would be more useful. You could also use a syslog server, which is going to be considerably easier if you don't have a dedicated monitoring solution.
You're not thinking of traps if you're talking about polling. Traps are initiated by the switch (or other device) and sent to your log monster. You can use SNMP polling of the sort that e.g. MRTG and OpenNMS do which, with appropriate thresholds, can get you most of the same benefits. But don't use it on Cisco hardware, not if you want your network to function, anyway. Their CPUs can't handle SNMP polling, not at the level you're talking about.
I think you are underestimating exactly how much SNMP trap spam network devices send. You'll get a trap for the ambient temperature being too high. You'll get a trap if you send more than X frames per second ("threshold fired"), and another trap two seconds later when it drops below Y fps ("threshold rearmed"). You'll get at least four link traps whenever a box reboots (down for the reboot, up/down during POST, up when the OS boots; probably another up/down as the OS negotiates link speed and duplex), plus an STP-related trap for each link state change ("port 2/21 is FORWARDING"). You'll get traps when CDP randomly finds, or loses, some device somewhere on the network. You'll get an army of traps whenever you create, delete, or change a vlan. If you've got a layer 7 switch that does health checks, you'll get about ten traps every time one of your HA webservers takes more than 100ms to serve its test page, which happens about once per server per minute even when nothing is wrong.
And the best part is that because SNMP traps are UDP, they are the first thing to get thrown away when the shit hits the fan. So when a failing NIC starts jabbering and the poor switch's CPU goes to 100%, you'll never see a trap. All you'll see are a bunch of boxes on the same vlan going up and down for no apparent reason. You might get a fps threshold trap from some gear on your distribution or core layers, assuming it's sufficiently beefy to handle a panicked switch screaming ARPs at a gig a second and have some brains left over, but that's about it. More likely you won't have a clue that anything is wrong until the switch kicks and 40 boxes go down for five minutes.
Monitoring a network with tens of thousands of switch ports sucks hardcore, there's no way around it.
Re: (Score:2)
On linux, it's called bonding. This is a killer feature.
I had some very limited professional experience with LAWA in the last couple of years. (LAWA runs LAX) I have no doubt there is quite a bit of consultant the usual chicanery going on whereby they don't actually hire qualified IT people, just people an elected official or two or three may k
Re: Follow-up (Score:2)
http://www.lacity.org/ctr/press/ctrpress18616087_
Social not technical problem. (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
It works, and it works a lot more easily than anything else th
Sigh, ignorance is bliss (Score:2)
Head of IT for LAX should be fired... (Score:4, Insightful)
Simply amazing. Will someone in the press publish the names of these losers so they can be blacklisted?
Re:Head of IT for LAX should be fired... (Score:5, Funny)
Re: (Score:3, Informative)
CBP deserves a punch in the nose for not having a proper network design with redundancy; and another punch in the nose for not h
Re: (Score:2)
Re: (Score:2)
IF (and that's a big if) we accept your logic that the lack of a cover-up means the IT head/network admin (your post isn't terribly clear on this point) didn't realize there was anything wrong with the way things were being done, then yes, I suppose that person should be fired. However, I call that logic "bullshit". Maybe I'm too damn optimistic, but I'd pref
LACP (Score:2)
More info at http://en.wikipedia.org/wiki/Link_Aggregation_Cont rol_Protocol [wikipedia.org]
Systems seem to be more commonly shipping with multiple NICs (esp. servers) so maybe this will be used more and more. It is important to note that the network switch/router needs to be able to support LACP (dumb/cheap switches do not while expensive/managed ones do) so that might be a barrier. Cisco s
Let that be a lesson to you... (Score:4, Funny)
Now you see what happens when some joker thinks [s]he can get away with using chunky for something as critical as proper care and feeding of network cards. Pfft.
Bah! Kids these days... I tell ya. Probably the same folks that think the interwebnet is the same as the World Wide Web.
Great, Scott! What's next?!
The whole system is pointless anyway (Score:4, Insightful)
I guess the system exists to give the appearance that the feds actually give a shit.
And then the Pres and Congress wonder why their approval ratings are as small as their shoe sizes...
nic can take down a segment (Score:4, Interesting)
Re: (Score:2)
Re: (Score:3, Interesting)
It was during the debugging phase. We got it to occur, and then turned off one machine at a time. When all the machines on the segment were off and the switch was still jabber isolated we all went "WTF?!" and then started unplugging cables.
"A similar incident" (Score:3, Insightful)
Except in the future, the incident isn't going to be similar, aside from being similarly boneheaded. This attitude of "only defend yourself from things that have already happened to you before" is just plain dumb. Obviously their system was set up and administered by a boneheaded organization to begin with, and now that same boneheaded organization is rushing to convene a committee to discuss a committee to discuss how to prevent something that already happened from happening again. The root flaw is still in the organization.
Re: (Score:2)
Re: (Score:2)
Blaming the Wrong NIC (Score:2, Insightful)
The real problem NIC is the one that wasn't there as backup. Either a redundant one already online, or a hotswap one for a brief downtime, or just a spare that could be replaced after a quick diagnostic according to the system's exception handling runbook of emergency procedures.
Of course, we can't blame a NIC that doesn't exist, even if we're blaming it
Managed switches are FTW (Score:2, Insightful)
Having said that, since the managed switches are gigE uplinked and each port is only 10/100, I don't think we've ever had a problem wh
not too suprised (Score:2)
Most reservations are checked for problems automatically but pushed through by a person and moved from one queue to another. If the program that checks them crashes, it can back things up.
I remember a program crashing and a queue getting 2000+ reservations in it before someone figured out what was going on and it had things screwed up for abo
IT is not that advanced (Score:2)
This brings out an obvious point, despite the advances we have made in computing and IT, it is still relatively young and not that robust.
This is the equivalent of your car stops working and the 'check engine' light does not even come on. At least now some of the technology for cars is getting to the point that it will find the problem for you. The same still cannot be said for large computer networks.
When people stop treating computers as flawless wonder machines, then we shall see some real progress
Re: (Score:2)
Where I come from is over a decade of hard-won experience dealing with network issu
sadly... this may be typical (Score:5, Insightful)
Then as you work more places you start seeing that this is pretty far from actual truth. Many "production" systems are held together by rubber bands, and duct tape if you're lucky (but not even the good kind.) In my experience it can be a combination of poor funding, poor priorities, technical management that doesn't understand technology, or just a lack of experience or skills among the workers.
Not every place is a Google or Yahoo!, that I can imagine look and smell like technology wherever you go on their fancy campuses. Most organizations are businesses first, and tech shops last. If software and hardware appears to "work", it is hard to convince anybody in a typical business that anything should change- even if what is "working" is a one-off prototype running on desktop hardware. It often requires strong technical management and a good CIO/CTO to make sure that things happen like they should.
I suspect that a lot of things that we consider "critical" in our society are a hell of a lot less robust under then hood than anything Google is running.
Are They Saying...? (Score:2)
Re: (Score:2)
Tom
Fragile System (Score:2)
Um, what about a paper backup?? (Score:3, Insightful)
It is laughable that there is no non-computerised backup for the system. (How about filling out the forms and scanning them in later?)
A Cisco Config to prevent this (Score:3, Informative)