One Failed NIC Strands 20,000 At LAX 293
The card in question experienced a partial failure that started about 12:50 p.m. Saturday, said Jennifer Connors, a chief in the office of field operations for the Customs and Border Protection agency. As data overloaded the system, a domino effect occurred with other computer network cards, eventually causing a total system failure. A spokeswoman for the airports agency said airport and customs officials are discussing how to handle a similar incident should it occur in the future.
Whiskey Tango Foxtrot (Score:5, Insightful)
For that to have had any effect at all, that system must have been the lynchpin for a critical piece of the network...probably some Homeland security abortion tacked on to the network, or some such crap...This is like the time I traced a network meltdown to a 4 port hub (not a switch, and unmanaged hub) that was plugged into (not a joke) a T-3 concentrator on one port, and and three subnets of around 200 computers each on the other 3 ports. Every single one of the outbound cables from the $15.00 hub terminated in a piece of networking infrastructure costing not less than $10,000 dollars.
This is like that. Single point of failure in the worst possible way. Gross incompetence, shortsightedness, and general disregard for things like "uptime"; pretty much what we've come to expect from the airline industry these days. If I'm not flying myself, I'm going to be driving, sailing, or riding a goddamn bicycle before I fly commercial.
Re:That's all it takes (Score:5, Insightful)
If the IT folks were clueless about this machine's age or condition, then the blame lies solely with them for not knowing what the hell they were doing. However, if it was the other folks who shot the IT folks down about upgrading then "welcome to the current state of business", unfortunately.
In other news... (Score:3, Insightful)
What makes them think they'll get another shot? Rank and file voters are ready with their own plan...should a 'similar incident' by the same fools happen again.
Head of IT for LAX should be fired... (Score:4, Insightful)
Simply amazing. Will someone in the press publish the names of these losers so they can be blacklisted?
Re:Whiskey Tango Foxtrot (Score:5, Insightful)
Re:That's all it takes (Score:3, Insightful)
Re:You figure it out (Score:1, Insightful)
Why would anyone be stupid enough to have all hosts in a mission-critical setting on one subnet?
Maybe you meant it's a "large issue" if you're a complete moron and put everything on one subnet, but everything is an issue if you're a complete moron, so there's nothing special about nics.
The whole system is pointless anyway (Score:4, Insightful)
I guess the system exists to give the appearance that the feds actually give a shit.
And then the Pres and Congress wonder why their approval ratings are as small as their shoe sizes...
"A similar incident" (Score:3, Insightful)
Except in the future, the incident isn't going to be similar, aside from being similarly boneheaded. This attitude of "only defend yourself from things that have already happened to you before" is just plain dumb. Obviously their system was set up and administered by a boneheaded organization to begin with, and now that same boneheaded organization is rushing to convene a committee to discuss a committee to discuss how to prevent something that already happened from happening again. The root flaw is still in the organization.
Blaming the Wrong NIC (Score:2, Insightful)
The real problem NIC is the one that wasn't there as backup. Either a redundant one already online, or a hotswap one for a brief downtime, or just a spare that could be replaced after a quick diagnostic according to the system's exception handling runbook of emergency procedures.
Of course, we can't blame a NIC that doesn't exist, even if we're blaming it for not existing. We have to blame the people who designed and deployed the system with the single point of failure, and the managers and oversight staff who let the airport depend on that single point of failure.
But instead I'm sure we'll blame the dead NIC. Which gave its life in service to its country.
Managed switches are FTW (Score:2, Insightful)
Having said that, since the managed switches are gigE uplinked and each port is only 10/100, I don't think we've ever had a problem where a server was outbounding and brought down the switch/network (just made some extra latency). We've had some really large inbounds occasionally take down a whole switch, and heaven forbid some idiot shuts the port off on an inbound attack instead of nulling it at the border, cause then the ARP drops and the DOS gets forwarded to every port on the VLAN on a ton of switches.. but a broken NIC packet storming would not have been an issue.
OK, so maybe they don't have managed switches all the way down the to the lowest point on the network. They should still have SOME further up the chain and be monitoring them such that they know from what direction the problem is coming, and shut it off / look at it with a sniffer etc.
Infrastructure that is as important as an airport should have it's own infrastructure properly equipped and maintained with managed equipment, making this nearly a non-issue and certainly one easily resolved.
Re:Whiskey Tango Foxtrot (Score:2, Insightful)
I think just hiring idiots would be enough. No need to train them.
Re:Whiskey Tango Foxtrot (Score:3, Insightful)
Re:That's all it takes (Score:5, Insightful)
sadly... this may be typical (Score:5, Insightful)
Then as you work more places you start seeing that this is pretty far from actual truth. Many "production" systems are held together by rubber bands, and duct tape if you're lucky (but not even the good kind.) In my experience it can be a combination of poor funding, poor priorities, technical management that doesn't understand technology, or just a lack of experience or skills among the workers.
Not every place is a Google or Yahoo!, that I can imagine look and smell like technology wherever you go on their fancy campuses. Most organizations are businesses first, and tech shops last. If software and hardware appears to "work", it is hard to convince anybody in a typical business that anything should change- even if what is "working" is a one-off prototype running on desktop hardware. It often requires strong technical management and a good CIO/CTO to make sure that things happen like they should.
I suspect that a lot of things that we consider "critical" in our society are a hell of a lot less robust under then hood than anything Google is running.
Re:That's all it takes (Score:4, Insightful)
There's no reason you can't leave the almost-broken computer there and get a new one. You just build a backup system. Surely management understands that redundancy is good. Then, when the crappy one breaks, you can swap it out instantly. That way, you don't have to mess with things prematurely, but you're only down for hopefully a few minutes. (Of course, replacing it "intentionally", before it fails, is more reliable, but keeping a backup system is a viable alternative if nobody wants to touch the working system.)
Re:That's all it takes (Score:3, Insightful)
Re:That's all it takes (Score:5, Insightful)
That's very difficult to do, and your estimates of the costs will be called into question. Its often impossible to predict how long it'll take to diagnose and fix a problem unless you've already diagnosed and fixed a similar problem.
Making this kind of estimate also places you into a lose-lose position. If your estimate was high, then management sees you as "chicken little" and will be more likely to dismiss further concerns as more fearmongering. If your estimate was low, then the blame for the outage will cascade down onto you for not showing/convincing management that new equipment was needed.
Um, what about a paper backup?? (Score:3, Insightful)
It is laughable that there is no non-computerised backup for the system. (How about filling out the forms and scanning them in later?)
Re:That's all it takes (Score:3, Insightful)
Re:That's all it takes (Score:3, Insightful)
Right, but that's why IT doesn't provide the numbers. It just provides the scenario and it's the bean-counters (BC) that provide the numbers.
IT: "We have some really old hardware that's going to fail any day now..."
BC: "So what?"
IT: "Well, that's a good question, we know it's going to cost $Bazillion to fix so we need to find out if it's worth it or not. Here's what will happen when it dies - LAX completely shuts down. Would that hurt the bottom line enough to justify budgeting $Bazillion?"
BC: "OMFG!" [throws money]
Re:That's all it takes (Score:1, Insightful)
Rightfully so - it would be your mistake. Don't ever give a single number unless it's solid. Give a confidence interval (even a huge, rough, unscientific one) instead.
Re:That's all it takes (Score:5, Insightful)
In contrast technical staff get to hear a lot about the Tacoma Narrows Bridge, Liberty Ships, Titanic or similar disasters from long ago as illustrations of how things can go wrong before they get let out of their first year of training. Some management would discard those lessons as things from the days of dinosaurs which is why we seem to have maintainance, infrastructure and contingincy plans reduced to nothing every decade and then be seen as important in the years immediately following a string of expensive or deadly disasters.