Dublin Air Traffic Control Brought Down By Faulty NIC 203
Not so very long ago after passengers were left hanging by a similar glitch at LAX, Gilby4mPuck writes with another story of NIC failure leading to a disruption of air traffic, this time in Ireland, excerpting: "Data showing the location, height and speed of approaching planes disappeared from screens for 10 minutes each time. ...
Thales ATM stated that in 10 similar air traffic control Centres worldwide with over 500,000 flight hours (50 years), this is the first time an incident of this type has been reported. ...
'[They] confirmed the root cause of the hardware system malfunction as an intermittent malfunctioning network card which consequently overcame the built-in system redundancy,' said an IAA spokeswoman."
There's only one way to solve this (Score:5, Funny)
Re: (Score:2)
Put all those NIC's on the terror watchlist!
Why would anyone listen to you? Somebody who was just put on the terror watchlist by a bad NIC.
More scary stories. (Score:2, Interesting)
Intermittent problems are the worst (Score:2, Interesting)
ten minutes (Score:2)
Re:ten minutes (Score:5, Informative)
there are plenty of examples of 10 minute failover
Older cisco ATAs take 10 minutes to swing onto SRST if keepalives are lost to the callmanager cluster.
a complex routing protocol refresh (big BGP networks) can take many minutes
a faulty NIC can easily bring down a LAN segment, with or without redundant switching paths - and it makes it look like a router failure as the router overloads trying to deal with the broadcast storm
NICtzche (Score:4, Funny)
if this piece of hardware was capable of "overc[oming] the built-in system redundancy", perhaps its ilk ought to be patrolling the transistorized wunderplatz of interconnected morsels governing our most hubris means of transportation? I, for one, would certainly feel safer.
Well its a step above the old AppleTalk (Score:3, Interesting)
When I was administering a small network in Marin, every time we had a small earthquake, all of the AppleTalk connectors would come loose. Took hours to find the faults and push them together. I guess we should have used duct tape.
I suppose at an airport as each jet came in creating vibrations, those same connectors would have dislodged.
It's a success story. (Score:5, Funny)
"...an intermittent malfunctioning network card which consequently overcame the built-in system redundancy"
But it's one of the lucky ones.
Every year, thousands of NICs fall victim to built-in system redundancy; if you know a card whose activity indicators are darkened and lifeless, it may have a redundancy problem. With your support and donations, we at Ethernetics Anonymous can help more network cards beat the scourge of built-in system redundancy, and make them feel like a useful part of society again.
Re: (Score:2)
Should have added:
"With your donation of only two bits per day"
My idea of fault tolerance (Score:2)
Re: (Score:2, Insightful)
Unfortunately, this NIC's fault showed up as the radar not working. What were they supposed to fail-over to? Binoculars?
Re: (Score:3, Interesting)
Unfortunately, this NIC's fault showed up as the radar not working. What were they supposed to fail-over to? Binoculars?
I suppose so, if it's possible to do it that way. Also, have the planes do the old-fashioned "circle the airport and keep an eye out for other traffic" if that works with big, heavy planes. It sure gives you (the pilot) a nice sense of being a free and sovereign person anyway, like on small airfields. :-)
Re: (Score:2)
(This is coming from a future air traffic controller)
You're forgetting a few things.
1. Not all air traffic control is done from airport towers. There are also TRACONs and ARTCCs, which is the type of facility you see in the movie Pushing Tin. Basically big dark buildings filled with radar screens and strung out people completely messed up on caffeine.
2. "circle the airport and watch for traffic" doesn't work for airplanes at FL350 doing 500+ knots. Usually that's IFR traffic, so the planes would have no cha
Re: (Score:2)
Re: (Score:2)
[quote]What were they supposed to fail-over to? Binoculars?[/quote]
And a giant relief model of the airport with young ladies pushing around little model aircraft with billiard cues. And a big glass panel with people marking up aircraft positions with wax crayons.
Re: (Score:2)
That is the stupidest plan ever, it is snooker cue _rests_ with which the ladies push the little model aircraft around.
In the queue (Score:4, Funny)
I was due to fly the evening it all went wrong. Here's a lesson: if you're standing in a three-hour queue for the Ryanair desk, and they tell people to rebook on the web, and you take out a laptop and 3G modem, be prepared for a stampede.
Re: (Score:2)
Re: (Score:2)
On top of that..
You are flying ryan air, everything is an extra, I am suprised they don't charge to use the bathroom onboard (Having said that, they probably will now).
5 a person to rebook on your laptop, would have paid for a new laptop!
Re: (Score:2)
1) I wasn't leaving that queue. :-)
2) The thing is - and believe me, I do shop around - none of the other options are very much better. :( The national flag carrier has remodelled itself to a low-cost airline and now matches ryanair feature for misfeature. BMI (Baby) are quite good where they fly, but to many other locations options seem to be very expensive, even accounting for ultimate cost including disasters like the above, or nonexistent.
Re: (Score:2)
Unfortunately not. The flight I was on appeared not to be formally cancelled on the system, so I wasn't allowed to rebook at the time - and of course their system was jammers anyway. I tried to help out the people either side of me but was pretty unsuccessful.
Re: (Score:2)
Actually, that's not true - it did do me good. I booked an Easyjet flight from Belfast for the next day. :-)
One card "overcame the redundancy"??? (Score:4, Insightful)
If they have good redundancy, they have two separate networks and two independent, preferrably different network cards, in all systems. Then they would do fail-over. Seems to me that if one card can bring this down, then the people that designed the redundancy screwed up badly.
Re: (Score:2)
second that... sorry, I missed your post before I wrote mine. Whoever built the system goofed, and to screw up with flight control systems at this level should be grounds for termination and never ever to get work in mission critical systems again. There really is no room for error in systems like this.
I've worked a bit in the aerospace industry, specifically on software that would estimate the amount of fuel required for a flight taking into account alternative landing areas, winds and so on.
The amount of
Re: (Score:2)
Indeed. The open souce angle is also critical for fast and conclusive accident investigation.
Come to think of it, I have never worked on a really critical system, but I am in IT security, which shares the thinking about ways to break a system. One difference is that our "malfunctions" are intelligent and malicious. On the other hand, they typically cannot kill large numbers of people. I think I prefer that. Having software out there that can kill, would probably give me bad dreams....
Speaking as a sysadmin of just such a network... (Score:2)
...it is not so black and white.
I administer a network like that. Pharmaceutical plant to be precise.
All machines on the production network have 2 independent PCI nics, connecting to 2 identical but separate networks, using separate routers and switches. The critical servers are stratus high availability servers which have dual redundant everything, driving all components in lock-step and correcting errors on the fly.
If something happens to cause a network switch over, there is a bulk of network traffic to
Re: (Score:2)
If something happens to cause a network switch over, there is a bulk of network traffic to deal with it, because sockets have to be opened and closed, state has to be transferred, system control message flow has to be restarted so that all controllers go back to the normal state, ... And at application level, everything is RPC and DCOM based, so this will cause a significant disruption for the running services, since COM objects and RPC marshalling have to be destroyed and recreated, reinitialized, ...
That
Was it running windows? (Score:4, Funny)
Blame it on the sales guy (Score:2)
I think this is a case of Sales guy vs web dude [techcrunch.com].
Why!? (Score:5, Funny)
I am flying to Florida tomorrow, it will only be my fifth plane flight in total and my first transatlantic flight. Despite being a rational scientist, who knows how safe it is statistically, I am having trouble suppressing my anxiety.
And at this point, fate sees fit to bombard me with horror stories about flying. This news about air traffic control comes on the heels of a headline I just saw on the front page of the Independent about pilots not reporting faults on aircraft and thus unsafe ones still flying about. I can't remember the exact wording because my brain parsed it as "TOMORROW YOU WILL DIE IN FLAMES"
Re:Why!? (Score:4, Funny)
Re: (Score:2)
Damn thats cold
Your signature, however, gives me something else to focus on. Fucking software patents! Idiotic corporate pandering EU! Grrrrrr! I'm not afraid of flying, I'm angry about IP abuse!
Re: (Score:2)
I can't remember the exact wording because my brain parsed it as "TOMORROW YOU WILL DIE IN FLAMES"
Just to reassure you, you are far more likely to die from the impact, or failing that smoke inhalation, or failing that drowning. The passenger compartment is rather well insulated from flame.
Re: (Score:3, Funny)
--INT: PLANE COCKPIT
PILOT #1: Oh wow, I really hope we don't have a crash.
PILOT #2: Me too.
PILOT #1: But they say it's safer than crossing the road!
PILOT #2: Yes, but we have to do that too.
PILOT #1: Best not to think about it.
Re: (Score:2)
"TOMORROW YOU WILL DIE IN FLAMES"
its not the flames that kill you, its the long long LONG fall.
but don't think of it as an end; think of it as a really effective way to cut down on your living expenses.
Re: (Score:3, Funny)
Truth? (Score:2)
"[They] confirmed the root cause of the hardware system malfunction as an intermittent malfunctioning network card which consequently overcame the built-in system redundancy,' said an IAA spokeswoman."
And when we edit littlebit, can we have the truth?:
They confirmed the root caused the hardware system malfunction using an intermittent malfunctioning network card wich consequently overcame the build-in system redundancy.
Confusing terminology (Score:5, Informative)
I work in aviation and wonder if the terminology being used by the newspaper articles is correct.
It appears to be talking about mode S IFF (Interrogation Friend or Foe) or SIFF radar systems which identify aircraft and appends height data. The speed is the only thing that needs calculating, as it isn't encoded in the pulse train.
Why this is weird is because much older bus technologies are normally used to handle this data being transferred than current network technology, such as MIL-STD-1553 [wikipedia.org].
This makes me wonder if it was one of two things - a system inputing to an ethernet PC system that calculates and displays the information or more likely they are talking about a DLTU type stub connector (or remote terminal) used in such typical buses. This is unlikely because the bus systems they are employed on, the bus controller would have picked up on the failure during continuous built in test and pulled in an alternative.
If its the former then someone needs shooting. ATC is a realtime application and the overhead involved here would be unacceptable. I'm not even sure of the benefit of a network, multiple self contained indiviual terminals would be safer.
Re: (Score:2)
A quick google turns up:
http://en.wikipedia.org/wiki/Avionics_Full-Duplex_Switched_Ethernet [wikipedia.org]
Which suggests that Ethernet-derived products are, indeed, used in critical systems (although this seems to be on-aircraft rather than in ATC). It (apparently) has seen wide deployment on common "famous" aircraft.
And the UK has been "upgrading" its air traffic control for years and years - so much so that they now appear to be nothing more than an office with some multi-head display if the footage shown on news-repor
Re: (Score:3, Interesting)
While you're right, the key phrase from the article you give is:
Specifically, this standard is aimed at use on aircraft not in ATC, in fact because of the weight reduction it offers.
Also not to split hairs but Dublin is not in the UK, this seems trite but is valid as there are different agencies involved. More over, the appropriation of new technologies is
Irish Examiner, ha! (Score:3, Funny)
Everyone in Ireland knows that the Irish Examiner used to be the Cork examiner - and they never miss an opportunity to point out how Dublin is doing a bad job.
This is because Cork thinks that it's the centre of the friggin' universe. The 'Real Capital', my arse! Just a bunch of thunderin' ejits, living in their little Blarney fantasy land. Sure they can't even talk right. What the hell is a 'langer', anyway. They wouldn't even know how to spell NIC.
The fact that they are right is quite beside the point.
(For a North American cultural equivalent, please see http://en.wikipedia.org/wiki/South_Park:_Bigger%2C_Longer_%26_Uncut [wikipedia.org])
Anyone who mods me down is from Cork - believe it!
Re: (Score:2)
but Cork is the only bit of Ireland that will still float if the country falls into the sea!
Zing Zang Zoom (Score:2)
It's a cover-up for Zing Zang Zoom [technet.com] rolling out a rootkit protection
But what is a "contol"? (Score:3, Funny)
What is a "contol" and why is this so important?
Oops! (Score:2)
Re:testing and QA (Score:5, Insightful)
Testing doesn't confer prescience.
Re:testing and QA (Score:5, Funny)
Only The Spice confers prescience.
Re: (Score:2)
Only The Spice confers prescience.
Actually, it confers "the ability to fold space. That is, travel to any part of universe without moving."
Well, at least they got the "not moving" part right.
Re: (Score:3, Informative)
That was in the movie. Read the book, it's much better.
Re: (Score:2)
Yes, the books are better, and the spice does confer prescience in a small number of instances.
Re:testing and QA (Score:4, Informative)
Actually, it confers "the ability to fold space. That is, travel to any part of universe without moving."
Actually actually, the space folding is done using the Holtzman drive, which is a perfectly ordinary machine. The Navigator merely navigates, plotting a safe path through the non-space/time foldspace. The spice grants the Navigator the limited prescience required to do this.
Eventually the Navigators become obsolete, replaced by Ixian semisentient machines known as Compilers that perform the same task without needing melange. A good thing too, because by that point Arrakis is rubble and sandworms are pretty much extinct.
Details courtesy of Wikipedia (and my lack of a social life).
Re:testing and QA (Score:5, Interesting)
One of the odd and very likable things about Dune is that there are occasionally implications that the society we read about is not the most advanced. Maybe their taboos are limiting them. Essentially the world we read about is actually in its own version of the Dark Ages where progress has all but stopped and feudalism is the only system. The Tleiaxu and the Ixians aren't in a Dark Age though. But we don't here too much about them because they are outside the known world because they violate the taboos that govern the know world.
Essentially it's a bit like reading history Taliban controlled Afghanistan, or unfortunately anywhere with an Islamic government. And I'm sure it's deliberate - Frank Herbert apparently was inspired by the Islamic uprisings against the British.
Or if you look at another way he wanted to write a hallucinogenic, retro sci fi epic, and he came up with a bunch of explanations - the Butlerian Jihad, the necessary for spice based prescience for interstallar travel, and the incompatibily between directed energy weapons and shields to explain why his universe was that way and not like conventional sci fi with ray guns, robots and open societies in the Popper sense.
Re: (Score:2)
Not according to the book. According to the book the spice is needed to predict what will happen when you arrive, that is: To ensure that you don't arrive inside a planet, or other dangerous place. The point is that when you do something that amount to traveling faster then light, the only way to know anything about where you arrive, is to predict the future.
This is also a big reason, that the guild newer took over Dune. They were so conditioned to always seek the safe path(Because that was what their ships
Re: (Score:3, Funny)
Yeah. Fortunately I just got back from going outside. OTOH, it was just raining, and I saw all the millions and millions of tiny water drops falling from the sky. Which made me think of Interrupt 80 and all those forked off-processes it would spawn with that code...
Re: (Score:3, Interesting)
So putting in a faulty NIC card and seeing what happened wouldn't have done anything at all, huh?
Part of testing systems is trying to emulate what happens when a portion goes down.
Re: (Score:3, Funny)
So putting in a faulty NIC card and seeing what happened wouldn't have done anything at all, huh?
You keep a bunch of 'faulty' NICs around?
Re: (Score:2, Interesting)
Re:testing and QA (Score:5, Insightful)
Re: (Score:3, Interesting)
Re: (Score:2)
If it works after 5 years. sure.
If not there always is a backup. isn't there? Well, in that case there is a backup of the backup.
Re:testing and QA (Score:4, Insightful)
The article says "it overcame the built in system redundancy"... how the hell does ONE failing card in a redundant setup "overcome" the redundant backup parts/systems ??
I call "CYA kissass excuse maker" to the stand!
Someone screwed up big, and they're Covering Their Asses now.
Re:testing and QA (Score:4, Insightful)
The article says "it overcame the built in system redundancy"... how the hell does ONE failing card in a redundant setup "overcome" the redundant backup parts/systems ??
I suspect it's because, as mentioned in the summary, it was "an intermittent malfunctioning network card". i.e. the failover system must have thought the card was functioning.
Re: (Score:3, Insightful)
The article says "it overcame the built in system redundancy"... how the hell does ONE failing card in a redundant setup "overcome" the redundant backup parts/systems ??
If the card had failed completely the redundant one would probably have kicked in. What I think happened is the card malfunctioned in a way causing the system to still think that the card is fine and there is no need for the redundant one to kick in.
Re: (Score:2)
I recall some DEC NICS that when they started to fail, all got the same MAC. Talk about a fun thing to troubleshoot on a network! If it was a plain old switch using MAC switching, you can cause havoc pretty easily.
Re: (Score:3, Informative)
Air traffice towers generally are not noisy or dusty. And in any case, disregarding the ports, the NIC card itself is practically eternal. Compared to the rest of the system, and the lifetime of the system that is.
Two lessons learned from years of technical support. The NIC isn't broken, unless the computer has been dragged from the network cable. And that the CPU is not broken as long as the system has not been overclocked, and the heatsink is still in place.
Re: (Score:2)
I always thought that CPUs could never be broken. We had an Athlon 64 processor 4600+, it was never overclocked and always used with a standard fan/heatsink, in a well ventilated case. After a year of work, it then started randomly crashing every few weeks. Replacing all the components except the CPU didn't fix the problem (different motherboard, memory, etc). Replacing the CPU did fix the problem. They can die randomly but it is very rare.
Re: (Score:2)
If you didn't apply the heatsink yourself, you don't know if it's been done correctly. On my first PC that I bought with my own money (well, half bought and my dad paid the rest), it kept locking up randomly, and after lots of IRQ and driver troubleshooting my dad removed the heatsink only to find that they hadn't applied it correctly. One reapplication of thermal paste and proper connection to the CPU later, and everything was fine (until that system got messed up in a lightning storm a few years later, bu
Re: (Score:2)
We did - we applied the heatsink several times, when we moved the CPU between different motherboards. Proper thermal transfer compound was used. The temperature of the CPU was fine.
Re: (Score:2)
if you did apply it yourself, you can't be sure that it's been done correctly! On the first system I built; I put the heat sink on pi radians rotated from where it was supposed to be, and without thermal paste at all. (How was I supposed to know?)
Talk about expensive mistake.
Re: (Score:2)
Re: (Score:2)
Lightning damage can come in through the NIC. On the network I saw, a Linksys router hooked to a satellite modem started sending damaging voltage down the ethernet after a storm. One computer lost its NIC, two others lost NIC + motherboard. After the storm was over, the Linksys box was still deadly to a laptop.
Re: (Score:2)
What has to be improved is the redundancy system solution that has to be able to detect intermittent function and therefore do a complete failover.
And this isn't the first time this kind of problem have happened, and it's probably not the last time either. But in this case it was on a mission critical system.
Re:testing and QA (Score:4, Informative)
Re:testing and QA (Score:5, Funny)
Any number raised to the power 0 is 1. So if you don't install anything, hence n is 0, it will always work since the probability of failure is 1-1 = 0.
Re: (Score:2)
Sometimes, pure intuition can be more handy than maths.
Re: (Score:2)
No system, no possibility of system failure.
That was my point.
Re: (Score:2)
Success vs failure (Score:2)
No system, no possibility of system failure.
No possibility of success either. Brilliant! Let's all just go back to bed and forget about trying anything that might possibly fail...
Re: (Score:3, Funny)
So if you don't install anything, hence n is 0, it will always work since the probability of failure is 1-1 = 0.
Actually, it would be more accurate to say that it would never fail. ;)
Re: (Score:3, Insightful)
If one fails with probability p, and you have n of them, a total system failure is probability p^n, not 1-p^n. Well technically it's Mult(p,1->n) where p1 is the probability of the first failing, p2 the probability of the second, etc, multiplying them all together to get the chance of a total system failure.
The probability of any one device in a redundant system failing is (1-((1-p)^n)). This equation rapidly approaches 1, so in larger setups failures will be a common occurrence, but they'll largely be h
Re:testing and QA (Score:5, Funny)
Whatever happened to testing of installed hardware? You'd think they might csider that sort of thing important when it involves the lives of thousands of people. Then again, maybe they were drunk at the time.
Well, when we set up some cheap NAS boxes with redundant nics .. some load balancers and other goodies .. we tested it by yanking cables on the bonded nics and making sure everything still worked.
This was for an e-commerce site.. I would agree in hoping more testing with real failures would be done on systems that monitor air traffic.
Also, we were very drunk when yanking cables during our test .. so I don't think intoxication is really a factor. In fact, turning a drunken monkey loose in a data center with a clearance to pull cables is _very_ good fail over testing :)
Re:testing and QA (Score:5, Insightful)
The problem is not that redundancy wasn't implemented.
The problem is that redundancy doesn't handle 'flapping' hardware very well.
The NIC intermittently failed, causing the redundancy to switch cards several times.
This can play havoc on systems that work on a LAN and assume the MAC address to stay the same.
Also, a NIC that does not report an error, doesn't fail completely and simply swaps a few bits around can be nigh-on impossible to diagnose.
This could have been caught with real-time hardware and log-monitoring, but I have to confess even I only check the logs daily, not real-time. While some monitoring systems can mail the admin in the event of failure, not all systems are usually configured that way ('workstations' being a prime candidate).
There is a line you draw between monitoring and cost-effectiveness. Every company takes a claculated risk in this and they got bitten.
Re: (Score:3, Insightful)
The problem is not that redundancy wasn't implemented. The problem is that redundancy doesn't handle 'flapping' hardware very well.
The NIC intermittently failed, causing the redundancy to switch cards several times.
This can play havoc on systems that work on a LAN and assume the MAC address to stay the same.
That's what got me curious, it looked like they were using takeover instead of bonding devices.
The most well engineered system in the world can not hope to escape a ~9 minute ARP cache upstream, which makes me wonder why it was designed the way that it was.
I'm not thinking in an antagonistic sense, I'm more wondering what changed in the network _after_ the system was deployed.
Re: (Score:2)
Indeed.
I've seen Network Bonding in RedHat Enterprise Linux with HP hardware use a 'fake' MAC address that is bound to several interfaces to avoid just this problem.
Unfortunately, it may confuse the switch it is connected to, because of said ARP cache (CAM table, ours was 16 hours).
Really-HA systems require genuine engineers with tons of real-life experience, just to know what bits work and what bits you want to avoid. ;-)
I hope to become one, one day
Re: (Score:3, Informative)
The problem is, NICs can fail in all kinds of ways that yanking cables won't simulate. In this case it sounds like if they had yanked the cable, the backup system would have come online exactly like it was supposed to, but because the faulty NIC was kinda-sorta-almost-but-not-really working, it didn't. That's a difficult thing to test in the lab.
Re: (Score:2)
why stop with the drunken monkey...
http://www.the5thwave.com/gallery/comp_misc/677.html [the5thwave.com]
Re:testing and QA (Score:5, Insightful)
The very best planned of redundant systems can be brought to its knees by hardware that "mostly works".
It's not hard to have system B check that system A is on/off line, and step in if the latter is the case. But what happens when A is *mostly* or *sorta* online? Does system B check that ALL functionality done by A is being done appropriately? Almost never.
And that's why, even in the best, most carefully designed, fully redundant high-availability systems, you never, ever see 100% uptime. It's just not possible to anticipate everything that can go wrong.
So design a system that fails gracefully! That's what nature did.
Take a look at your own body. It's a gorgeous example of a high-availability, high-redundancy system. There are literally BILLIONS of cells in your body, each operating as a semi-independent unit, such that any of them can fail without bringing down the whole, or even affecting it noticeably. Your body is an excellent example of a cheap, redundant, high-availability system.
Yet catastrophic failures still occur. Whether by cancer, diabetes, or heart disease, even a well-designed, tested-for-millions-of-years high-redundancy system with billions of individual, replaceable parts fails catastrophically from time to time.
It's the nature of the beast.
Mother nature has compensated by making not only the system redundant, but the need for the system also redundant. Rapid reproduction is nature's friend! Not just redundancy, but redundant redundancy.
High availability - it's much, much, MUCH harder than you thought.
Re: (Score:3, Interesting)
Re:testing and QA (Score:5, Interesting)
Yes but if _one_ NIC can bring the entire system down what other single failures in a component could bring the entire system down? Obviously the system with the malfunctioning NIC can do any number of things that may result in a similar failure mode. Or what happens if the network switch it is attached to fails (I assume they use multiple paths... but if one nic can nuke it all, imagine if a switch went bonkers).
You don't need to bring the entire system down to cause havoc. What if there's a hitherto unknown bug in one of the CPUs which under some very specific set of circumstances causes aircraft altitude to be misreported on the operator's screen? As the GP said, most redundant systems only ensure that the components appear to be broadly working. They seldom check that all the components are doing something sensible.
Re: (Score:2, Insightful)
-- "The very best planned of redundant systems can be brought to its knees by hardware that "mostly works"."--
--NO, you are wrong there. What this indicates is that someone skimped. Techniques for processing and getting reliable signals through systems that only mostly work are very well known and used routinely. What this event means is that someone, either explicitly or implicitly assumed that NICs are binary - they either work or they don't, and designed accordingly.
What should have been used is multiple
Re: (Score:2)
And this is where the "cheap" part of my comment "cheap, redundant, high-availability system" comes into play.
See, the likelihood of failure in a redundant system goes *up* as the number of units increases. But as the number of units in a redundant system increases, the likelihood of a *complete* failure drops to a number never equalling zero. In other words, no matter how much redundancy you build in, you'll never achieve zero downtime over the long haul.
The human body achieves zero downtime over a few dec
Re: (Score:2)
You have dual NICs on system A, with a etherkiller [google.com] connected to the second card.
When B takes over, it then can make sure that A stays down
Re: (Score:2)
High availability - it's much, much, MUCH harder than you thought.
I would like nothing more that to be able to breed my way out of hardware failures.
"What? Another NIC failed? Honey, spread 'em!"
Re: (Score:2)
Quite likely it did work at the time of FAT, SAT, Shadow operation and when going into live operation.
If it breaks down later on is another issue, that's not possible to test for beforehand. Isn't that pretty obvious? It is like testing a car to see if it will ever be in an accident. You sir, are the drunk one :)
Re: (Score:2, Interesting)
Re: (Score:2)
Flew into Belfast instead last week because of this... when you're inconvenienced by technology it is very calming to know that what caused it was a real honest to goodness fuck-up, rather than a much less interesting case of human error :-)
Re: (Score:2, Redundant)
Re: (Score:2)
I'll have you know, my good man, that not all of slashdot's readers are American.
Re: (Score:3, Funny)
Lucky Charms never pissed me off so much as Trix did. I remember one commercial where the rabbit actually *bought his own cereal* and the kids took it because "trix are for kids". I wanted to see him mow the little fuckers down for home invasion or something...
Silly rabbit, supersonic lead is for thieving, speciesist little pricks.