Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Networking Bug

One Failed NIC Strands 20,000 At LAX 293

The card in question experienced a partial failure that started about 12:50 p.m. Saturday, said Jennifer Connors, a chief in the office of field operations for the Customs and Border Protection agency. As data overloaded the system, a domino effect occurred with other computer network cards, eventually causing a total system failure. A spokeswoman for the airports agency said airport and customs officials are discussing how to handle a similar incident should it occur in the future.
This discussion has been archived. No new comments can be posted.

One Failed NIC Strands 20,000 At LAX

Comments Filter:
  • by SatanicPuppy ( 611928 ) * <Satanicpuppy.gmail@com> on Wednesday August 15, 2007 @03:58PM (#20240711) Journal
    According to the effing article, it wasn't even a server, but a goddamn desktop. How in the holy hell does a desktop take down the whole system? I can't even conceive of a situation where that could be the case on anything other than a network designed by chimps, especially through a hardware failure...A compromised system might be able to do it, but a system just going dark?

    For that to have had any effect at all, that system must have been the lynchpin for a critical piece of the network...probably some Homeland security abortion tacked on to the network, or some such crap...This is like the time I traced a network meltdown to a 4 port hub (not a switch, and unmanaged hub) that was plugged into (not a joke) a T-3 concentrator on one port, and and three subnets of around 200 computers each on the other 3 ports. Every single one of the outbound cables from the $15.00 hub terminated in a piece of networking infrastructure costing not less than $10,000 dollars.

    This is like that. Single point of failure in the worst possible way. Gross incompetence, shortsightedness, and general disregard for things like "uptime"; pretty much what we've come to expect from the airline industry these days. If I'm not flying myself, I'm going to be driving, sailing, or riding a goddamn bicycle before I fly commercial.
  • by Svet-Am ( 413146 ) on Wednesday August 15, 2007 @04:00PM (#20240749) Homepage
    Of course they're running old and outdated hardware. When thing work, particularly in a mission critical situation, you don't touch them! Even if the IT admins knew that computer was old and on the brink of dying, how are they supposed to convince the suits and beancounters of that? Non-technical people take the approach that since computers are inherently binary (work or no-work) that if the machine is up and running _right now_ then there is no problem and no sense on spending money to replace it.

    If the IT folks were clueless about this machine's age or condition, then the blame lies solely with them for not knowing what the hell they were doing. However, if it was the other folks who shot the IT folks down about upgrading then "welcome to the current state of business", unfortunately.
  • In other news... (Score:3, Insightful)

    by djupedal ( 584558 ) on Wednesday August 15, 2007 @04:01PM (#20240757)
    "...said airport and customs officials are discussing how to handle a similar incident should it occur in the future."

    What makes them think they'll get another shot? Rank and file voters are ready with their own plan...should a 'similar incident' by the same fools happen again.
  • by Glasswire ( 302197 ) on Wednesday August 15, 2007 @04:05PM (#20240797) Homepage
    ...for not firing the networking manager. The fact that they were NOT terrified that this news would get out and were too stupid to cover it up indicates he/she and their subordinates SIMPLY DON'T KNOW THEY DID ANYTHING WRONG by not putting in a sufficently montiored switch architecture which would rapidly alert IT staff and lock out the offending node.
    Simply amazing. Will someone in the press publish the names of these losers so they can be blacklisted?
  • by MightyMartian ( 840721 ) on Wednesday August 15, 2007 @04:07PM (#20240821) Journal
    If the NIC starts broadcasting like nuts, it will overwhelm everything on the segment. If you have a flat network topology, then kla-boom, everything goes down the shits. A semi-decent switch ought to deal with a broadcast storm. The best way to deal with it is to split your network up, thus rendering the scope of such an incident significantly smaller.
  • by COMON$ ( 806135 ) * on Wednesday August 15, 2007 @04:09PM (#20240847) Journal
    apparently you are not familliar with what a bad nic does to even the best of switches.
  • by Anonymous Coward on Wednesday August 15, 2007 @04:20PM (#20241013)

    Why would anyone be stupid enough to have all hosts in a mission-critical setting on one subnet?

    Maybe you meant it's a "large issue" if you're a complete moron and put everything on one subnet, but everything is an issue if you're a complete moron, so there's nothing special about nics.

  • by Potent ( 47920 ) on Wednesday August 15, 2007 @04:23PM (#20241061) Homepage
    When the U.S. Government is letting millions of illegal aliens cross over from Mexico and live here with impunity, then what the fuck is the point with stopping a few thousand document carrying people getting off of planes from entering the country?

    I guess the system exists to give the appearance that the feds actually give a shit.

    And then the Pres and Congress wonder why their approval ratings are as small as their shoe sizes...
  • by The One and Only ( 691315 ) * <[ten.hclewlihp] [ta] [lihp]> on Wednesday August 15, 2007 @04:30PM (#20241135) Homepage

    A spokeswoman for the airports agency, said airport and customs officials are discussing how to handle a similar incident should it occur in the future.

    Except in the future, the incident isn't going to be similar, aside from being similarly boneheaded. This attitude of "only defend yourself from things that have already happened to you before" is just plain dumb. Obviously their system was set up and administered by a boneheaded organization to begin with, and now that same boneheaded organization is rushing to convene a committee to discuss a committee to discuss how to prevent something that already happened from happening again. The root flaw is still in the organization.

  • by Doc Ruby ( 173196 ) on Wednesday August 15, 2007 @04:30PM (#20241141) Homepage Journal
    The NIC that failed isn't the part that's at fault. NICs fail, and can be counted on to do so inevitably, if relatively unpredictably (MTBF is statistical).

    The real problem NIC is the one that wasn't there as backup. Either a redundant one already online, or a hotswap one for a brief downtime, or just a spare that could be replaced after a quick diagnostic according to the system's exception handling runbook of emergency procedures.

    Of course, we can't blame a NIC that doesn't exist, even if we're blaming it for not existing. We have to blame the people who designed and deployed the system with the single point of failure, and the managers and oversight staff who let the airport depend on that single point of failure.

    But instead I'm sure we'll blame the dead NIC. Which gave its life in service to its country.
  • by Sehnsucht ( 17643 ) on Wednesday August 15, 2007 @04:39PM (#20241211)
    Where I work, if there's a packet storm someplace (server is getting attacked, server is attacker, or someone just has a really phat pipe on the other end and is moving a ton of data) we get a SNMP TRAP for packet threshold on the offending port. BAM! You know where the problem is, and since we have managed switches you just shut off the port if you can't resolve the problem.

    Having said that, since the managed switches are gigE uplinked and each port is only 10/100, I don't think we've ever had a problem where a server was outbounding and brought down the switch/network (just made some extra latency). We've had some really large inbounds occasionally take down a whole switch, and heaven forbid some idiot shuts the port off on an inbound attack instead of nulling it at the border, cause then the ARP drops and the DOS gets forwarded to every port on the VLAN on a ton of switches.. but a broken NIC packet storming would not have been an issue.

    OK, so maybe they don't have managed switches all the way down the to the lowest point on the network. They should still have SOME further up the chain and be monitoring them such that they know from what direction the problem is coming, and shut it off / look at it with a sniffer etc.

    Infrastructure that is as important as an airport should have it's own infrastructure properly equipped and maintained with managed equipment, making this nearly a non-issue and certainly one easily resolved.
  • by kylemonger ( 686302 ) on Wednesday August 15, 2007 @04:41PM (#20241239)
    Do these people hire idiots with no training or experience or what?

    I think just hiring idiots would be enough. No need to train them.

  • by dave562 ( 969951 ) on Wednesday August 15, 2007 @04:47PM (#20241295) Journal
    They concentrated all of the redundancy dollars into layer B of the OSI model... the bureaucracy. There wasn't anything left for the lower layers.
  • by EmperorKagato ( 689705 ) * <sakamura@gmail.com> on Wednesday August 15, 2007 @05:13PM (#20241565) Homepage Journal

    Even if the IT admins knew that computer was old and on the brink of dying, how are they supposed to convince the suits and beancounters of that?
    You show the suits and bean counters how much it costs the company if the system failed and time was spent recovering that system.
  • by bwy ( 726112 ) on Wednesday August 15, 2007 @05:17PM (#20241621)
    Sadly, many real-world systems are often nothing like what people might envision as them as. We all sit back in our chairs reading slashdot and thinking everything is masterfully architected, fully HA, redundant, etc.

    Then as you work more places you start seeing that this is pretty far from actual truth. Many "production" systems are held together by rubber bands, and duct tape if you're lucky (but not even the good kind.) In my experience it can be a combination of poor funding, poor priorities, technical management that doesn't understand technology, or just a lack of experience or skills among the workers.

    Not every place is a Google or Yahoo!, that I can imagine look and smell like technology wherever you go on their fancy campuses. Most organizations are businesses first, and tech shops last. If software and hardware appears to "work", it is hard to convince anybody in a typical business that anything should change- even if what is "working" is a one-off prototype running on desktop hardware. It often requires strong technical management and a good CIO/CTO to make sure that things happen like they should.

    I suspect that a lot of things that we consider "critical" in our society are a hell of a lot less robust under then hood than anything Google is running.
  • by ThinkingInBinary ( 899485 ) <thinkinginbinary ... AGOom minus city> on Wednesday August 15, 2007 @05:31PM (#20241777) Homepage

    Of course they're running old and outdated hardware. When thing work, particularly in a mission critical situation, you don't touch them! Even if the IT admins knew that computer was old and on the brink of dying, how are they supposed to convince the suits and beancounters of that? Non-technical people take the approach that since computers are inherently binary (work or no-work) that if the machine is up and running _right now_ then there is no problem and no sense on spending money to replace it.

    There's no reason you can't leave the almost-broken computer there and get a new one. You just build a backup system. Surely management understands that redundancy is good. Then, when the crappy one breaks, you can swap it out instantly. That way, you don't have to mess with things prematurely, but you're only down for hopefully a few minutes. (Of course, replacing it "intentionally", before it fails, is more reliable, but keeping a backup system is a viable alternative if nobody wants to touch the working system.)

  • by Greventls ( 624360 ) on Wednesday August 15, 2007 @06:04PM (#20242103)
    The new system is usually extremely expensive. Why spend all that money on a new system when the old one works? I know programmers who refuse to update their code from VB3.
  • by quanticle ( 843097 ) on Wednesday August 15, 2007 @06:53PM (#20242675) Homepage
    You show the suits and bean counters how much it costs the company if the system failed and time was spent recovering that system.

    That's very difficult to do, and your estimates of the costs will be called into question. Its often impossible to predict how long it'll take to diagnose and fix a problem unless you've already diagnosed and fixed a similar problem.

    Making this kind of estimate also places you into a lose-lose position. If your estimate was high, then management sees you as "chicken little" and will be more likely to dismiss further concerns as more fearmongering. If your estimate was low, then the blame for the outage will cascade down onto you for not showing/convincing management that new equipment was needed.
  • by LeRandy ( 937290 ) on Wednesday August 15, 2007 @06:53PM (#20242677)
    Am I the only one laughing that back in old, antiquated Europe, our passport control have the ability to read the documents, with their own eyes? Oh I forget, how are you supposed to treat your visitors like criminals if you can't take their photograph, fingerprints, and 30-odd other bits of personal data to make sure we aren't terrier-ists (fans of small dogs). It doesn't help prevent terrorist attacks, but it does give you a nice big data mine (and how are you supposed to undermine people's rights effectively if you don't know everything about them).

    It is laughable that there is no non-computerised backup for the system. (How about filling out the forms and scanning them in later?)

  • by dbIII ( 701233 ) on Wednesday August 15, 2007 @08:45PM (#20243657)
    Then they do not believe you until you can point at 20,000 people stranded at LAX. At this point you are fired since you knew about the problem, made some fuss, but did not make enough fuss to actually convice the suits and bean counters. It does help others that can then point at the problem of somebody else and get their suits and bean counters to pay attention. This is why infrastructure failure disasters go in cycles determined by the attention span and age of management - each new generation has to see a major failure before they listen while engineers have the benefit of written knowlege going back years .
  • by itwerx ( 165526 ) on Thursday August 16, 2007 @01:04AM (#20245661) Homepage
    That's very difficult to do, and your estimates of the costs will be called into question.

    Right, but that's why IT doesn't provide the numbers. It just provides the scenario and it's the bean-counters (BC) that provide the numbers.

    IT: "We have some really old hardware that's going to fail any day now..."

    BC: "So what?"

    IT: "Well, that's a good question, we know it's going to cost $Bazillion to fix so we need to find out if it's worth it or not. Here's what will happen when it dies - LAX completely shuts down. Would that hurt the bottom line enough to justify budgeting $Bazillion?"

    BC: "OMFG!" [throws money]
  • by Anonymous Coward on Thursday August 16, 2007 @02:17AM (#20245965)

    Making this kind of estimate also places you into a lose-lose position. If your estimate was high, then management sees you as "chicken little" and will be more likely to dismiss further concerns as more fearmongering. If your estimate was low, then the blame for the outage will cascade down onto you for not showing/convincing management that new equipment was needed.

    Rightfully so - it would be your mistake. Don't ever give a single number unless it's solid. Give a confidence interval (even a huge, rough, unscientific one) instead.

  • by dbIII ( 701233 ) on Thursday August 16, 2007 @06:31AM (#20247079)
    No - it implies a great deal of management has become a shallow oral tradition with all the problems that implies. They are not learning from anything before them and react with great surprise when a Rupert Murdoch or a Bill Gates that does know how to learn from the mistakes of others leaves them with effectively nothing but their underwear. It's like Cortez in South America - he used tactics of Roman Generals that he had read about against those that did not have a written history.

    In contrast technical staff get to hear a lot about the Tacoma Narrows Bridge, Liberty Ships, Titanic or similar disasters from long ago as illustrations of how things can go wrong before they get let out of their first year of training. Some management would discard those lessons as things from the days of dinosaurs which is why we seem to have maintainance, infrastructure and contingincy plans reduced to nothing every decade and then be seen as important in the years immediately following a string of expensive or deadly disasters.

BLISS is ignorance.

Working...