State of Virginia Technology Centers Down 190
bswooden writes "Some rather important departments (DMV, Social Services, Taxation) in the state of Virginia are currently without access to documents and information as a technology meltdown has caused much of their infrastructure to be offline for over 24 hours now. State CIO Sam Nixon said, 'A failure occurred in one memory card in what is known as a "storage area network," or SAN, at Virginia's Information Technologies Agency (VITA) suburban Richmond computing center, one of several data storage systems across Virginia.' How does the IT for some of the largest departments in a state come to a screeching halt over a single memory card? Oh, and also, the state is paying Northrup Grumman $2.4 billion over 10 years to manage the state's IT infrastructure."
Reader miller60 adds, "Virginia's IT systems drew scrutiny last fall when state agencies reported rolling outages due to the lack of network redundancy."
It's always money (Score:2, Interesting)
Re:It's always schedule (Score:3, Interesting)
Heh, it shouldn't be about the money, though... they should have specified high availability from the very beginning. They often throw it out during the prototyping stage, saying they need to Keep It Simple Stupid just to get things working, but then all the software is never designed to be able to handle redundancy, and shoehorning it in later becomes pretty much like starting again from scratch.
Also, designing in redundancy is usually worse than having no redundancy at all if it's never tested. There should be a pretty simple test plan, where, say, the CTO comes in and is allowed to pull any single random wire or component out of the rack and see how the system reacts / recovers. But unfortunately people are usually using the system by that time, and it's too much of a hassle to come in off-hours and pay everyone overtime for such a test.
Re:Redundancy (Score:3, Interesting)
What makes you think that the legislators expect redundancy? When that kind of money changes hands the only thing they care about is getting favors and campaign contributions.
Re:HA fail (Score:5, Interesting)
Also, this can happen when you hire an external firm to manage something that you should be managing yourself. External managers for projects like this are motivated by extracting as much money as possible from you. Internal departments of technology, by comparison, are motivated by convincing co-workers to not shout at them.
Re:HA fail (Score:2, Interesting)
Step 1) Design system so a single SAN controller is the only thing keeping the network running.
Step 2) Use money saved by not adding redundancy/designing the system correctly to give self money.
Step 3) Expect one component to last long enough for you to leave the job before it fails.
Step 4) ????
Step 5) Profit anyway because they don't get the concept of failures==bad things and keep paying you.
Re:It's always money (Score:3, Interesting)
I love how people can determine a MTBF of 50 years after testing a piece of hardware for a month.
For my money, the only computer that should be able to claim a 50 year MTBF is the Univac. And that's really, really not accurate.
Even funnier (Score:4, Interesting)
As a leftover from when Virginia-headquartered AOL was the king of connectivity, you see license plates here in Virginia touting us as the Internet Capital [virginia.gov].
Re:It's always money (Score:4, Interesting)
This is a private sector failure. NG is the culprit here, not the government.
This is why you should be very wary of bidding out work to 3rd party. They don't care about your city. They are not thinking about how their decision impact the city in 10-20-50 years.
and while infrastructures is far more complex and expensive then people who don't deal with it realize, 2.5 billion of 10 years? 240million a year? That is a price where they should have a tested redundancy system. I single point SAN failure? Shame on NG.
I hate to burst your preconceive bubble, but in my years in the private sector and public sector as taught me, most government agency are far better at keeping there own infrastructure. More reliable and long standing.
Re:Question. (Score:4, Interesting)
Depends on the SAN. The article (as most tech articles are) is very short on scope & details. So "one chip" went bad. Should that bring everything to a screeching halt? The answer should be "no" but in practice we can all say that it's more often a case of "not usually." From TFA:
It was hailed as being able to suffer a failure to one part but continue uninterrupted service because standby parts or systems would take over. But when the memory card failed Wednesday, a fallback that attempted to shoulder the load began reporting multiple errors, Nixon said.
So Array Alpha shits the bed. You follow your failover procedures and start running on Array Zappa. That immediately starts throwing errors. Ok armchair QB's, let me switch to my Keeanu Reeves voice and ask "What do you do?" You built a pretty damned redundant system there and you're still down. Sure, it'd be nice if they had a backup in another DC they could fail to, but they don't. Doesn't matter, eventually you're playing the double / triple / quadruple hulled oil tanker game. Either way, Redundant SAN's aren't cheap and aren't all that easy (it's not exactly a "the bosses nephew who 'knows all about computers' set it up last weekend" level of complexity.) The TFA also has these points:
Full function may not be restored until Monday.
Experts who examined the system determined that no data were lost except for those being keyed into the system at the moment it failed, Nixon said.
Other than the fact that proofreading and the usage of proper grammar are no longer a requirements to work for a Virginia newspaper, what do those points tell us? Sounds to me like they hit the last line in the DR procedures: Restore from backup. Depending on what their backup strategy is (maybe they're splitting several terrabytes across a tape robot that only supports 200/400gig tapes because that robot is the only device the vendor supports.) and how truly important the affected system is (This may be a system where the powers that be said "fsck it, they can process renewals by hand and we'll bring everything back up on Monday after we test on Saturday") a return to business on Monday might be SOP. But that wouldn't sell newspapers (or make talking points with the voters...) now, would it?
Maybe there was a major screwup here. Maybe they never tested their failovers and maybe that 2nd SAN was bad out of the box. I'm a little more willing to cut some slack and say "man, that sucks. Glad it's not my ass on the line." Karma's a bitch like that. I like to take these stories as an opportunity to rethink my own single points of failure are rather than point & laugh and tell everyone how I'll never lose and data because it's I'm running RAID 5......
Re:HA fail (Score:5, Interesting)
How does a fault in a single SAN controller cause an outage of the entire data storage network? Expensive SAN solutions are expensive & highly redundant for reason. This smells like a "Let's buy the cheaper solution" and/or an infrastructure design fail.
RTFA!
The problem was a dual (or worse) failure. What the article reveals is that while they may have had all of the right hardware in place and a mechanism for it to handle the most likely failures, they were missing the 'soft' components of a good HA system: routine testing of failover and a rapid repair plan. In the auto industry where failed systems can halt factories and rack up hundreds of thousands of dollars of cost per hour of downtime, it is the norm for HA systems to have frequent failover tests, to have on-site spares for critical components that can be replaced by on-site staff, and to have support arrangements that put a skilled human on-site with replacement hardware in a small amount of time. This is why traditional "enterprise class" systems are so expensive. They are designed for rapid diagnosis and repair, and a well-run enterprise that needs truly HA systems pays for expensive HUMAN support by their own staff and/or from IBM, Sun^WOracle, EMC, HP, etc. and monitoring systems on top of that. If you fail over your HA systems every Sunday at 02:00 (or whatever time is safe...) and have the right staff, processes, and support contracts in place, you will find nearly all of the latent failures and have them fixed before a true production failure exposes them.
The most appalling thing about this to me isn't the failure. Some systems don't have safe times for testing failovers, and I know from personal experience that a component in an HA system that was working perfectly Saturday and has been idle since Sunday can go tits-up when needed on Wednesday. The real problem is the long outage. If the clowns in the VA state government were doing their jobs, they would not have a system like this without vendor support contracts to fix well-defined hardware problems (e.g. "bad memory card" ) within a few hours at most. This was something I always loved about working in a shop with the top-grade EMC contract. The Symmetrix and its associated gadgetry would call EMC about failures and we'd have a tech show up at the DC with parts before we even noticed anything unusual: costly, but nowhere near as expensive as killing all of the SAN-reliant systems for a random day every 3 years. The 4th 9 is not cheap or simple, because it always requires humans.
Re:HA fail (Score:2, Interesting)
Funny, I was unaware that Northrop Grumman were a scion of the free market. Could you name some of their non-government customers that provide more than 1% of their total revenue? It's called a Military-Congressional-Industrial Complex for a reason. But thanks for playing the strawman game.
Re:money saving measure! (Score:3, Interesting)
And sometimes that's exactly the right approach, except you should really build three or four or ten. One might argue that that's the very purpose of the government: to force inefficiency where short-term self-interest would result in long-term disaster - in other words, to avoid the tragedy of the commons.
Re:HA fail (Score:3, Interesting)
If you fail over your HA systems every Sunday at 02:00 (or whatever time is safe...)
(voice of tech ignorant executive)
"We can't be down then. We have remote workers that want to do things at that time."
"The overtime for that window is too expensive, and we can't do it during production hours. We'll just assume you planned carefully."
"You just told me part of the reason that system is so expensive is that it is much less likely to fail. Well, we're not paying for a spare."
And after hearing that, I want to duct tape those fucking executives to their $1500 chair and let them watch while I take a powder-actuated nailer to their precious Mercedes S550.
Why the rage? Just spell out very clearly (and in writing) exactly what will happen if component X fails, and the cost to implement redundancy now. When component X fails and the company loses Y dollars of revenue and the CEO comes to you, just pull out the email and say "I tried to design redundancy but he wouldn't spend the money".
It worked for me when I tried to get money for a spare battery cabinet on our primary UPS. I told my boss that if a single battery in the string fails during a power failure, we'll lose most of the server room. That happened during a power outage -- the UPS went offline, we lost power to most of the equipment (we were able to scrounge up enough money for redundant power to some core network devices, but nearly all of the servers went down). Everything came up again after the generator restarted, then went down again after utility power came back because of a delay in transfer from generator to utility. My boss screamed at me, I told her I warned her it could happen and pulled out the email where I outlined exactly that scenario (except for the part where everything tripped off again during the switch back to utility power.
I had a new battery cabinet installed within 2 weeks, *and* a redundant UPS for most servers.