State of Virginia Technology Centers Down 190

Posted by Soulskill on Friday August 27, 2010 @12:00PM from the your-tax-dollars-at-work dept.

bswooden writes "Some rather important departments (DMV, Social Services, Taxation) in the state of Virginia are currently without access to documents and information as a technology meltdown has caused much of their infrastructure to be offline for over 24 hours now. State CIO Sam Nixon said, 'A failure occurred in one memory card in what is known as a "storage area network," or SAN, at Virginia's Information Technologies Agency (VITA) suburban Richmond computing center, one of several data storage systems across Virginia.' How does the IT for some of the largest departments in a state come to a screeching halt over a single memory card? Oh, and also, the state is paying Northrup Grumman $2.4 billion over 10 years to manage the state's IT infrastructure." Reader miller60 adds, "Virginia's IT systems drew scrutiny last fall when state agencies reported rolling outages due to the lack of network redundancy."

This discussion has been archived. No new comments can be posted.

State of Virginia Technology Centers Down

Load All Comments

Search 190 Comments Log In/Create an Account

Comments Filter:

HA fail (Score:4, Insightful)

by Anonymous Coward writes: on Friday August 27, 2010 @12:03PM (#33393622)

How does a fault in a single SAN controller cause an outage of the entire data storage network? Expensive SAN solutions are expensive & highly redundant for reason. This smells like a "Let's buy the cheaper solution" and/or an infrastructure design fail.

Share
twitter facebook
- Re:HA fail (Score:5, Interesting)
  
  by cgenman ( 325138 ) writes: on Friday August 27, 2010 @12:39PM (#33394116) Homepage
  
  Also, this can happen when you hire an external firm to manage something that you should be managing yourself. External managers for projects like this are motivated by extracting as much money as possible from you. Internal departments of technology, by comparison, are motivated by convincing co-workers to not shout at them.
  
  Parent Share
  twitter facebook
  - Re: (Score:3, Funny)
    
    by Daniel Dvorkin ( 106857 ) * writes:
    
    Also, this can happen when you hire an external firm to manage something that you should be managing yourself. External managers for projects like this are motivated by extracting as much money as possible from you. Internal departments of technology, by comparison, are motivated by convincing co-workers to not shout at them.
    B-b-but you're saying that the bloated corrupt government that takes money from people at gunpoint and has no incentives for efficiency might have done a better job than a private contractor that works on the God-given free enterprise system that rewards efficiency and punishes waste! That's unpossible!
    - Re:HA fail (Score:4, Insightful)
      
      by ultranova ( 717540 ) writes: on Friday August 27, 2010 @04:15PM (#33397136)
      
      B-b-but you're saying that the bloated corrupt government that takes money from people at gunpoint and has no incentives for efficiency might have done a better job than a private contractor that works on the God-given free enterprise system that rewards efficiency and punishes waste!
      
      On the contrary, the free market did exactly as it was supposed to: it eliminated the inefficiency of redundant systems and a safety margin. Efficiency or the safety of redundancy, you can have one or the other but not both. That's why any important system should be managed by the government, and free enterprise should be limited to the role of logistical optimization it's actually good at.
      Unfortunately some people nowadays consider free market their religion, so we got deregulation and resulting financial crisis. Oh well...
      
      Parent Share
      twitter facebook
      - money saving measure! (Score:2)
        
        by Thud457 ( 234763 ) writes:
        
        S.R. Hadden: [imdb.com] First rule in government spending: why build one when you can have two at twice the price?
        
        Re: (Score:3, Interesting)
        
        by ultranova ( 717540 ) writes:
        
        First rule in government spending: why build one when you can have two at twice the price?
        
        And sometimes that's exactly the right approach, except you should really build three or four or ten. One might argue that that's the very purpose of the government: to force inefficiency where short-term self-interest would result in long-term disaster - in other words, to avoid the tragedy of the commons.
    - Re: (Score:2, Interesting)
      
      by conspirator57 ( 1123519 ) writes:
      
      Funny, I was unaware that Northrop Grumman were a scion of the free market. Could you name some of their non-government customers that provide more than 1% of their total revenue? It's called a Military-Congressional-Industrial Complex for a reason. But thanks for playing the strawman game.
    - - Re:HA fail (Score:5, Insightful)
        
        by cgenman ( 325138 ) writes: on Friday August 27, 2010 @05:46PM (#33398322) Homepage
        
        If you're big enough that you're not just going to be scaling staff up and immediately down again, hire your people in-house. It's not a question of government vs private companies. It's a question of hiring your best people to be on staff, or outsourcing to someone who doesn't have the same motivations. This is true if you're a government, a corporation, a private entity, or a high school marching band. Plus the markup on external IT services is just obscene.
        Poorly managed projects will be poorly managed internally or externally. But externally poorly managed projects are a lot more expensive, and harder to reign back under control.
        
        Parent Share
        twitter facebook
- Re: (Score:2, Interesting)
  
  by Even on Slashdot FOE ( 1870208 ) writes:
  
  Step 1) Design system so a single SAN controller is the only thing keeping the network running.
  Step 2) Use money saved by not adding redundancy/designing the system correctly to give self money.
  Step 3) Expect one component to last long enough for you to leave the job before it fails.
  Step 4) ????
  Step 5) Profit anyway because they don't get the concept of failures==bad things and keep paying you.
  - - Re: (Score:2, Insightful)
      
      by Anonymous Coward writes:
      
      Did the dude from the City of SF design this network so that if he wasn't there to SSH in with a modem he had hidden in his toaster over, the ram in a SAN would bring the whole network down?
      No he asked them repeatedly to buy a spare, which was denied, then he refused to yank it out of the live production system when another's department's boss said to give it to the chick he was banging so she could be a computer expert too.
- Re: (Score:2, Insightful)
  
  by g0bshiTe ( 596213 ) writes:
  
  I live in Virginia, it's more like business as usual for a Commonwealth.
- Re: (Score:2)
  
  by NotBornYesterday ( 1093817 ) writes:
  
  Because there was more than one failure. FTFA:
  The system was built with redundancies and backup storage. It was hailed as being able to suffer a failure to one part but continue uninterrupted service because standby parts or systems would take over. But when the memory card failed Wednesday, a fallback that attempted to shoulder the load began reporting multiple errors, Nixon said.
  Cheap solution problem? Possibly. Infrastructure design fail? Possibly, but not likely. Couldn't critique it without seeing their setup, but it sounds like they designed some redundancy in. I wonder what kind of "memory card" failed. From the description, it sounds like it might be a cache module.
  - Re: (Score:2)
    
    by Local ID10T ( 790134 ) writes:
    
    Because there was more than one failure. FTFA:
    The system was built with redundancies and backup storage. It was hailed as being able to suffer a failure to one part but continue uninterrupted service because standby parts or systems would take over. But when the memory card failed Wednesday, a fallback that attempted to shoulder the load began reporting multiple errors, Nixon said.
    Cheap solution problem? Possibly. Infrastructure design fail? Possibly, but not likely. Couldn't critique it without seeing their setup, but it sounds like they designed some redundancy in. I wonder what kind of "memory card" failed. From the description, it sounds like it might be a cache module.
    Regular testing of redundant systems is critical. Anyone who has done disaster planning knows this.
- Re: (Score:3, Informative)
  
  by Wyatt Earp ( 1029 ) writes:
  
  Sweet Zombie Jesus.
  If the RAM in our 8TB Netgear SAN fries it doesn't blow up my office, what the hell are they and Northrup Grumman doing?
  - - Re:HA fail (Score:4, Funny)
      
      by pnutjam ( 523990 ) writes: <slashdot@bo r o wicz.org> on Friday August 27, 2010 @01:35PM (#33394910) Homepage Journal
      
      I started working for a city government earlier this year, let's just say I was amazed, I won't qualify it as amazed in a good way or bad way, but, you know...
      
      Parent Share
      twitter facebook
- Re: (Score:2)
  
  by donnyspi ( 701349 ) writes:
  
  Yeah really. Before we got away from traditional hardware (NAS, SAN, etc.) we had piece of crap Dot Hill arrays and they had redundant power supplies and redundant controllers. There must be more to this story.
  - Acronym Fail (Score:2)
    
    by Nefarious Wheel ( 628136 ) writes:
    
    Redundant Array of Inexpensive Disks. RAID. Ok, maybe that scared them. Redundant Array of Raid Controllers - RARC? Nope, sounds Chinese. How about Redundant Infrastructure Array Audits? Nope, than definitely will not do...
- Re: (Score:2)
  
  by CharlyFoxtrot ( 1607527 ) writes:
  
  How does a fault in a single SAN controller cause an outage of the entire data storage network? Expensive SAN solutions are expensive & highly redundant for reason. This smells like a "Let's buy the cheaper solution" and/or an infrastructure design fail.
  On the plus side if the US government ever builds Skynet we know where to strike.
- Re:HA fail (Score:5, Interesting)
  
  by wkcole ( 644783 ) writes: on Friday August 27, 2010 @01:49PM (#33395102)
  
  How does a fault in a single SAN controller cause an outage of the entire data storage network? Expensive SAN solutions are expensive & highly redundant for reason. This smells like a "Let's buy the cheaper solution" and/or an infrastructure design fail.
  RTFA!
  The problem was a dual (or worse) failure. What the article reveals is that while they may have had all of the right hardware in place and a mechanism for it to handle the most likely failures, they were missing the 'soft' components of a good HA system: routine testing of failover and a rapid repair plan. In the auto industry where failed systems can halt factories and rack up hundreds of thousands of dollars of cost per hour of downtime, it is the norm for HA systems to have frequent failover tests, to have on-site spares for critical components that can be replaced by on-site staff, and to have support arrangements that put a skilled human on-site with replacement hardware in a small amount of time. This is why traditional "enterprise class" systems are so expensive. They are designed for rapid diagnosis and repair, and a well-run enterprise that needs truly HA systems pays for expensive HUMAN support by their own staff and/or from IBM, Sun^WOracle, EMC, HP, etc. and monitoring systems on top of that. If you fail over your HA systems every Sunday at 02:00 (or whatever time is safe...) and have the right staff, processes, and support contracts in place, you will find nearly all of the latent failures and have them fixed before a true production failure exposes them.
  The most appalling thing about this to me isn't the failure. Some systems don't have safe times for testing failovers, and I know from personal experience that a component in an HA system that was working perfectly Saturday and has been idle since Sunday can go tits-up when needed on Wednesday. The real problem is the long outage. If the clowns in the VA state government were doing their jobs, they would not have a system like this without vendor support contracts to fix well-defined hardware problems (e.g. "bad memory card" ) within a few hours at most. This was something I always loved about working in a shop with the top-grade EMC contract. The Symmetrix and its associated gadgetry would call EMC about failures and we'd have a tech show up at the DC with parts before we even noticed anything unusual: costly, but nowhere near as expensive as killing all of the SAN-reliant systems for a random day every 3 years. The 4th 9 is not cheap or simple, because it always requires humans.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by Peeteriz ( 821290 ) writes:
    
    They had pretty much the most expensive support contracts possible. The problem is that apparently all this waste of taxpayers' money has bought nothing useful.
  - Re: (Score:2)
    
    by swb ( 14022 ) writes:
    
    If you fail over your HA systems every Sunday at 02:00 (or whatever time is safe...)
    (voice of tech ignorant executive)
    "We can't be down then. We have remote workers that want to do things at that time."
    "The overtime for that window is too expensive, and we can't do it during production hours. We'll just assume you planned carefully."
    "You just told me part of the reason that system is so expensive is that it is much less likely to fail. Well, we're not paying for a spare."
    And after hearing that, I want t
    - Re: (Score:3, Interesting)
      
      by hawguy ( 1600213 ) writes:
      
      If you fail over your HA systems every Sunday at 02:00 (or whatever time is safe...)
      (voice of tech ignorant executive)
      "We can't be down then. We have remote workers that want to do things at that time."
      "The overtime for that window is too expensive, and we can't do it during production hours. We'll just assume you planned carefully."
      "You just told me part of the reason that system is so expensive is that it is much less likely to fail. Well, we're not paying for a spare."
      And after hearing that, I want to duct tape those fucking executives to their $1500 chair and let them watch while I take a powder-actuated nailer to their precious Mercedes S550.
      Why the rage? Just spell out very clearly (and in writing) exactly what will happen if component X fails, and the cost to implement redundancy now. When component X fails and the company loses Y dollars of revenue and the CEO comes to you, just pull out the email and say "I tried to design redundancy but he wouldn't spend the money".
      It worked for me when I tried to get money for a spare battery cabinet on our primary UPS. I told my boss that if a single battery in the string fails during a power failur
- - Re: (Score:2)
    
    by Foobar of Borg ( 690622 ) writes:
    
    terrorism must be involved. From what I've heard, they are evacuating New Jersey and calling in the National Guard.
    No, that was the Martians.
  - Re: (Score:2)
    
    by Kymermosst ( 33885 ) writes:
    
    It is far worse than that. The summary says it is a meltdown! I don't know how IT could cause that, but terrorism must be involved. From what I've heard, they are evacuating New Jersey and calling in the National Guard.
    No. Their IT infrastructure is so power-hungry that they co-located a nuclear plant with their main data center.
    *That* is what is melting down.
- - Re: (Score:2)
    
    by Americano ( 920576 ) writes:
    
    Does the government not have responsibility to:
    1) Manage the delivery and implementation of the contracted items, and
    2) Verify that what was contracted for is actually delivered?
    Are you actually suggesting that a bunch of "average salary" mid-level IT drones would have done a better job at implementing a high availability / fault tolerant system than a private contractor that specializes in design and implementation of this type of system, and has done it dozens of times?
    I think it's far more likely that
It's always money (Score:2, Interesting)

by Anonymous Coward writes:

I'll tell you exactly how. Some manager somewhere said that it cost too much to add redundancy. It's happened over and over at my extremely large company, and it will continue to happen as long as money is the prime concern.
- Re: (Score:2)
  
  by jsnipy ( 913480 ) writes:
  
  at least now they can quantify thier (bad) descision with thier loss of productivity and perhaps loss of revenue.
- Re:It's always schedule (Score:3, Interesting)
  
  by rwa2 ( 4391 ) * writes:
  
  Heh, it shouldn't be about the money, though... they should have specified high availability from the very beginning. They often throw it out during the prototyping stage, saying they need to Keep It Simple Stupid just to get things working, but then all the software is never designed to be able to handle redundancy, and shoehorning it in later becomes pretty much like starting again from scratch.
  Also, designing in redundancy is usually worse than having no redundancy at all if it's never tested. There sh
  - Re: (Score:2)
    
    by sjames ( 1099 ) writes:
    
    Not putting all of your eggs in one basket, even a double walled basket with 2 handles and shock absorbers, would be a good start.
- Re:It's always money (Score:4, Insightful)
  
  by Daniel_Staal ( 609844 ) writes: <DStaal@usa.net> on Friday August 27, 2010 @12:14PM (#33393764)
  
  Add in politics: Get a couple of representatives arguing over where the money (if any) should be spent, and all possibility of real redundancy and fault-tolerance go out the window.
  It's true in larger government organizations than this. The failures just haven't occurred yet.
  
  Parent Share
  twitter facebook
  - Re:It's always money (Score:4, Interesting)
    
    by geekoid ( 135745 ) writes: <dadinportland@@@yahoo...com> on Friday August 27, 2010 @01:09PM (#33394528) Homepage Journal
    
    This is a private sector failure. NG is the culprit here, not the government.
    This is why you should be very wary of bidding out work to 3rd party. They don't care about your city. They are not thinking about how their decision impact the city in 10-20-50 years.
    and while infrastructures is far more complex and expensive then people who don't deal with it realize, 2.5 billion of 10 years? 240million a year? That is a price where they should have a tested redundancy system. I single point SAN failure? Shame on NG.
    I hate to burst your preconceive bubble, but in my years in the private sector and public sector as taught me, most government agency are far better at keeping there own infrastructure. More reliable and long standing.
    
    Parent Share
    twitter facebook
    - Re: (Score:3, Insightful)
      
      by Daniel_Staal ( 609844 ) writes:
      
      My 'preconceive bubble' is based on my current job for the US government, and the situation we have in our department.
      It might be true on average that government agencies are better at keeping their own infrastructure, especially if they can manage to keep their accounting and design of that infrastructure at a lower level. However, once those decisions pass the level from the internal to the external (or: From those hired for the job, to those elected/appointed into it), that long-term planning appears to
      - Re: (Score:2)
        
        by gclef ( 96311 ) writes:
        
        However, once those decisions pass the level from the internal to the external (or: From those hired for the job, to those elected/appointed into it), that long-term planning appears to break down, in favor of political squabbles.
        As someone who's worked both sides of the public/private line, allow me to assure you that this is not unique to government. I've seen plenty of boneheaded design decisions made by upper management for obscure/bizarre/just-plain-wrong reasons in both private and government gigs.
    - Re: (Score:2)
      
      by sjames ( 1099 ) writes:
      
      It MAY be a government failure as well. When you write the impossible into a bid, make the bidding process tremendously complex and make the cost of even bidding too high for most potential contractors (by expecting a complex analysis up-front for free) you eliminate all but the largest contractors with a fat legal department. If you then require acceptance of the lowest bid with no allowance for confidence level you set up a perfect storm for a ripoff. You assure that each bid you receive will be a lie bas
- Re: (Score:2)
  
  by joebok ( 457904 ) writes:
  
  Money is always the prime concern for a business. If the cost of adding redundancy is higher than the expected cost of dealing with network failure, then why would a business do it?
  That being said, I often see the cost of dealing with a significant network interruption being underestimated - either the $ cost or the probability of it happening.
  - Re:It's always money (Score:5, Insightful)
    
    by cgenman ( 325138 ) writes: on Friday August 27, 2010 @12:49PM (#33394250) Homepage
    
    Everyone seems to think that a network outage is no big deal, until the network goes down. That's when people start thinking of the burn rate of an entire organization sitting on their thumbs while that network of off-the-shelf Linksys routers is replaced by some kid at Best Buy. Or how that 5k dollars per year for a backup external line suddenly pales in comparison to the 5k dollars per hour your organization is wasting because you were a cheap bastard.
    
    Parent Share
    twitter facebook
- - Re: (Score:2)
    
    by jeffmeden ( 135043 ) writes:
    
    What does mean mean again? Ah nevermind. Odds are 2 out of 3 it will fail outside of business hours anyway. And if that's the case, no one will notice!
  - Re: (Score:3, Interesting)
    
    by cgenman ( 325138 ) writes:
    
    I love how people can determine a MTBF of 50 years after testing a piece of hardware for a month.
    For my money, the only computer that should be able to claim a 50 year MTBF is the Univac. And that's really, really not accurate.
    - Re: (Score:2)
      
      by jeffmeden ( 135043 ) writes:
      
      What does mean mean again? Oh, that's right. If you want a MTBF of 50 years, you can either get one unit and run it for 50 years to prove yourself, or you can get 100 units and run them for 6 months... To be sure, it doesn't automatically take into account mechanical wear but any engineer worth their salt can extrapolate acceptable wear rates with 6 months of data (and that's only if you are talking about systems with moving parts)...
      - Re:It's always money (Score:4, Informative)
        
        by cgenman ( 325138 ) writes: on Friday August 27, 2010 @01:52PM (#33395138) Homepage
        
        "get 100 units and run them for 6 months..."
        Which works if you presume a linear fail rate, which is bonkers. Systems always run better at the beginning of their lifecycle. Static buildup, electrical interference, repeated heating and cooling cycles, etc all take a toll on the electronics. Would you really personally estimate a real-world MTBF of off-the-shelf SATA drives at 70 years? No, because they work perfectly well for the first year, start having trouble the second, and are all dead by the 8th. But if you presume linear dropoff using just that first year of testing, they look pretty damn bomb proof because that's when they work best. It's a stupid system that's only valid if you replace all of your hardware every year.
        And all systems have moving parts. Electrons move. The circuit boards expand and contract. Crap builds up on important components. Electroplating can move move metals from one part of the design to another. Stuff gets plugged in and unplugged.
        I realize that MTBF has a very technical definition that is different than marketing departments utilize it as. I might agree with you that any engineer worth their salt can extrapolate a proper MTBF. But most of the MTBF's I've seen are just stupidly wrong. If people really believe those published fantasy numbers, no wonder they don't put enough redundancy in their systems.
        
        Parent Share
        twitter facebook
  - - - Re: (Score:2, Funny)
        
        by Anonymous Coward writes:
        
        Oh yeah? Well YOUR momma so stank, she lay down on train tracks and nothing happened 'cause not even the train would hit that.
Northrup Grumman (Score:2)

by elrous0 ( 869638 ) * writes:

Northrup Grumman already runs the U.S. military. Might as well turn over IT to them too.
- - - - Re: (Score:2)
        
        by brainboyz ( 114458 ) writes:
        
        B2 Bomber?
        
        Re: (Score:2)
        
        by Amouth ( 879122 ) writes:
        
        Rephrase - Got any examples that aren't more than 20 years old?
        
        Re: (Score:2)
        
        by tsm_sf ( 545316 ) writes:
        
        They were both robots? That explains so much...
They need a better network admin (Score:4, Funny)

by Nemesisghost ( 1720424 ) writes: on Friday August 27, 2010 @12:07PM (#33393664)

Maybe they should hire Terry Childs, at least he won't let their network go down for something like this.

Share
twitter facebook
- - Re: (Score:2, Insightful)
    
    by Anonymous Coward writes:
    
    That's insane. Terry Childs failed (he was arrested and unable to make changes to the network)--and the city kept running.
Redundancy (Score:3, Funny)

by CmdrPorno ( 115048 ) writes: on Friday August 27, 2010 @12:10PM (#33393708)

Silly state, expecting to get redundancy for only $2.4 billion dollars. Don't they realize they're going to have to pay a lot more than that to get a reliable network?

Share
twitter facebook
- Re: (Score:3, Interesting)
  
  by Wonko the Sane ( 25252 ) writes:
  
  What makes you think that the legislators expect redundancy? When that kind of money changes hands the only thing they care about is getting favors and campaign contributions.
- Even funnier (Score:4, Interesting)
  
  by SteveFoerster ( 136027 ) writes: <steve&stevefoerster,com> on Friday August 27, 2010 @12:59PM (#33394398) Homepage
  
  As a leftover from when Virginia-headquartered AOL was the king of connectivity, you see license plates here in Virginia touting us as the Internet Capital [virginia.gov].
  
  Parent Share
  twitter facebook
- Offer (Score:2)
  
  by XanC ( 644172 ) writes:
  
  I'll do it for $2.3 billion!
- Re: (Score:2)
  
  by Darth_brooks ( 180756 ) * writes:
  
  Your sig makes that comment *that* much more hilarious.
  - Re: (Score:2)
    
    by CmdrPorno ( 115048 ) writes:
    
    Thankfully, I didn't pay $2.4 billion for AT&T's crappy network. (The reception is actually not that bad here, but I'm in a rural area with no 3G, which really sucks.)
Awful. (Score:4, Insightful)

by boneclinkz ( 1284458 ) writes: on Friday August 27, 2010 @12:11PM (#33393716)

Our primary concern should be a complete audit of World of Warcraft server hardware, to ensure that this vulnerability does not exist in other, more vital networks.

Share
twitter facebook
Sorry, has to be said... (Score:3, Funny)

by Omega Hacker ( 6676 ) writes: <omega@@@omegacs...net> on Friday August 27, 2010 @12:15PM (#33393780)

I think the id10ts who pulled off this stunt are rather DIMM....

Share
twitter facebook
Question. (Score:2, Insightful)

by U8MyData ( 1281010 ) writes:

Umm, so what's the point of having a SAN if it weren't redundant? Me thinks there is more to this story.
- Re: (Score:3, Insightful)
  
  by MightyMartian ( 840721 ) writes:
  
  Probably involving executives vacationing in nice tropical locales by rewarding themselves with hefty bonuses. Meanwhile some poor IT guys weren't given the budget that reflected how much the State was paying out, and had to cobble together a SAN solution, or pick the cheapest one off the shelf. The IT guys will, of course, be the patsies for this whole episode, with the CEO and CTO all huffing and puffing and vowing to State officials and lawmakers that they're doing everything they can to get to the bot
  - - Re: (Score:2)
      
      by Necron69 ( 35644 ) writes:
      
      Ditto. I tests SAN configurations for a living, and I'm stumped by this one. I'd love to know some details.
      Necron69
      - Re: (Score:3, Funny)
        
        by jeffmeden ( 135043 ) writes:
        
        "What could possibly be the difference between raid0 and raid1? Come on, who would put those radio button choices so close together if they really meant opposite things!"
      - Re:Question. (Score:4, Informative)
        
        by MightyMartian ( 840721 ) writes: on Friday August 27, 2010 @12:58PM (#33394380) Journal
        
        Well, as Sherlock Holmes' greatest axiom goes "When you have eliminated the impossible, whatever remains, however improbable, must be the truth." Using that logic, the answer is simple. They're not using a SAN. Somewhere along the line someone is bullshitting, and my gut tells me its management. A lot of folks who get government contracts pretty much view them as an opportunity to skim off the top. Why, take what should be a $50,000 solution and mock something up for $10,000, and that's $40,000 profit.
        
        Parent Share
        twitter facebook
- Re: (Score:2)
  
  by Locke2005 ( 849178 ) writes:
  
  You'd think they'd at least do RAID 1 Mirroring. Then they could just hot swap in another drive, sync it, and be on their merry way. Why centralize your data services if you're not going to do it right?
- Re:Question. (Score:4, Interesting)
  
  by Darth_brooks ( 180756 ) * writes: <clipper377@ g m a il.com> on Friday August 27, 2010 @01:13PM (#33394582) Homepage
  
  Depends on the SAN. The article (as most tech articles are) is very short on scope & details. So "one chip" went bad. Should that bring everything to a screeching halt? The answer should be "no" but in practice we can all say that it's more often a case of "not usually." From TFA:
  It was hailed as being able to suffer a failure to one part but continue uninterrupted service because standby parts or systems would take over. But when the memory card failed Wednesday, a fallback that attempted to shoulder the load began reporting multiple errors, Nixon said.
  So Array Alpha shits the bed. You follow your failover procedures and start running on Array Zappa. That immediately starts throwing errors. Ok armchair QB's, let me switch to my Keeanu Reeves voice and ask "What do you do?" You built a pretty damned redundant system there and you're still down. Sure, it'd be nice if they had a backup in another DC they could fail to, but they don't. Doesn't matter, eventually you're playing the double / triple / quadruple hulled oil tanker game. Either way, Redundant SAN's aren't cheap and aren't all that easy (it's not exactly a "the bosses nephew who 'knows all about computers' set it up last weekend" level of complexity.) The TFA also has these points:
  Full function may not be restored until Monday.
  Experts who examined the system determined that no data were lost except for those being keyed into the system at the moment it failed, Nixon said.
  Other than the fact that proofreading and the usage of proper grammar are no longer a requirements to work for a Virginia newspaper, what do those points tell us? Sounds to me like they hit the last line in the DR procedures: Restore from backup. Depending on what their backup strategy is (maybe they're splitting several terrabytes across a tape robot that only supports 200/400gig tapes because that robot is the only device the vendor supports.) and how truly important the affected system is (This may be a system where the powers that be said "fsck it, they can process renewals by hand and we'll bring everything back up on Monday after we test on Saturday") a return to business on Monday might be SOP. But that wouldn't sell newspapers (or make talking points with the voters...) now, would it?
  Maybe there was a major screwup here. Maybe they never tested their failovers and maybe that 2nd SAN was bad out of the box. I'm a little more willing to cut some slack and say "man, that sucks. Glad it's not my ass on the line." Karma's a bitch like that. I like to take these stories as an opportunity to rethink my own single points of failure are rather than point & laugh and tell everyone how I'll never lose and data because it's I'm running RAID 5......
  
  Parent Share
  twitter facebook
- Re: (Score:2)
  
  by geekoid ( 135745 ) writes:
  
  Her is an educated guess:
  When getting the bid, NG promised redundancy.
  NG stalled and then was behind schedule.
  the redundancy system became less 'important' due to time
  NG went live
  NG let a bunch of contractors go
  NG says there in house staff will take care of it.
  NG new hires get stuck at the end of the project, do enough to consider it 'done'. Several amateur mistakes were made.
  What happening right now:
  People who work for the state IT are showing everyone the email the they got from NG saying the system was
Typical liberal overreaction (Score:5, Funny)

by BitHive ( 578094 ) writes: on Friday August 27, 2010 @12:27PM (#33393930) Homepage

Guys, accidents happen. This "Northrop Grumman", whoever they are, will no doubt be fired and not receive any more contracts once word of this gets out. This will put pressure on them to provide better services, or be out-competed by other entrepreneurs. Our free market system works, you just need to expect this kind of thing when it's government doing the hiring.

Share
twitter facebook
- Re: (Score:2)
  
  by idiotnot ( 302133 ) writes:
  
  They get all defensive (no pun intended) when you point out that this was one of Mark Warner's crowning achievements as governor......
- Re: (Score:2)
  
  by mounthood ( 993037 ) writes:
  
  Guys, accidents happen. This "Northrop Grumman", whoever they are, will no doubt be fired and not receive any more contracts once word of this gets out. This will put pressure on them to provide better services, or be out-competed by other entrepreneurs. Our free market system works, you just need to expect this kind of thing when it's government doing the hiring.
  The problem is that it's the government selecting the vendor. If the government would just get out of the vendor-hiring-business maybe the Free Market could fix this mess.
  - Re: (Score:2)
    
    by Daniel Dvorkin ( 106857 ) * writes:
    
    The problem is that it's the government selecting the vendor. If the government would just get out of the vendor-hiring-business maybe the Free Market could fix this mess.
    I'm not sure if you're joking or not. If you're serious ... um, who do you suggest should hire people to run the government's servers, other than the government?
- - Re: (Score:2)
    
    by Fjandr ( 66656 ) writes:
    
    Wooooooooosh!
  - Re: (Score:2)
    
    by Mr.Intel ( 165870 ) writes:
    
    Woooooosh!
- - Re: (Score:2)
    
    by oodaloop ( 1229816 ) writes:
    
    Yes, he was joking, I there's a whoooosh around here somewhere for you.
  - Re: (Score:2)
    
    by sjames ( 1099 ) writes:
    
    The real IT outfits are deeply disadvantaged by feeling the need to actually deliver on the contract. That drives costs up and caps promises.
Ok, this really sucks!!!!!!! I know why and can (Score:2)

by Anon-Admin ( 443764 ) writes:

not say. The F***Ing NDA stops me from saying anything about the stuff I saw in NGC's IT.
Well, I guess I can say it is BROKE NOW and you have to fix it. Told you so!
- Re: (Score:2)
  
  by Ironhandx ( 1762146 ) writes:
  
  NDAs are such a bitch.
  I think you should talk to Julian Assange at Wikileaks so that those of us that want the juicy details can get them.
  P.S. Theres a fat unmarked manila envelope in it for you. We all chipped in. Its a really nice envelope.
- Re: (Score:2)
  
  by NeutronCowboy ( 896098 ) writes:
  
  There's a Post Anonymously button for that reason. Given the state of their IT department, I doubt they'll be able to figure out who broke their NDA, even if police manages to give them an IP.
  - - Re: (Score:2)
      
      by NeutronCowboy ( 896098 ) writes:
      
      This assumes smartness on the part of the department. Looking at the clusterfuck that is the current meltdown, I doubt they have the smarts for that.
Northrop Grumman? Thats why... (Score:2)

by Nadaka ( 224565 ) writes:

My company works on a project that N G lost on a re-compete bid. I can not go much into details, but suffice it to say: I am not at all surprised that they screwed up maintenance and management based on what I have had to deal with on the software they developed.
Northrup Grumman (Score:2)

by fermion ( 181285 ) writes:

This is what you get for hiring a military contractor to do a civilian persons job. All 2.5 billion gets you in the military is a manger and toilet seat. You don't start getting functional hardware until the budget reaches 100 billion.
What brand? (Score:2)

by CambodiaSam ( 1153015 ) writes:

Anyone know what brand of SAN went down? My company had a similar issue where our SAN had a major outage, and the vendor claimed it was "an error that never happens, we swear".
Northrop hiring event (Score:2)

by confused one ( 671304 ) writes:

Funny that I should receive an email today inviting me to a Northrop Grumman Information Systems Hiring Event. The event occurs on the 25th of August and I received the email on the afternoon of the 27th. Failed there too!
It's not always the bureaucracy (Score:2)

by roc97007 ( 608802 ) writes:

Ok, in this case it probably is the bureaucracy at fault. But it isn't in all cases. In my previous job we had an architect who would take it upon himself to "value engineer" a vendor's solution, with unpredictable results. I'm not sure why -- we had budget. Maybe it was his way of seeming more valuable? This led to "solutions" like a SAN cobbled together from disk arrays, controllers and switches from three different vendors that were not meant to work together, had never been tested in the chosen co
- Re: (Score:2)
  
  by account_deleted ( 4530225 ) writes:
  
  Comment removed based on user account deletion
  - Re: (Score:2)
    
    by roc97007 ( 608802 ) writes:
    
    > What makes you think Northrop Grumman had a choice? They still work for the state IT department at the end of the day.
    There are typically very high penalties for not meeting your service levels. A 24 hour unplanned outage can blow a half year's profits for the contract. Like any outsourcing company, NG did have a choice -- don't take that contract under those conditions.
- Re: (Score:2)
  
  by pnutjam ( 523990 ) writes:
  
  Hey now! I love to "value engineer", but I would never work back from a series of proprietary solutions.
  
  And I never have a budget. I've dreamed of having a budget, must be nice.
  I would stroke it and love it and call it George...
oblig. quote (Score:2)

by StripedCow ( 776465 ) writes:

There's something rotten in the state of Virginia?
It happens (Score:4, Informative)

by kilodelta ( 843627 ) writes: on Friday August 27, 2010 @01:30PM (#33394850) Homepage

But $2.4 billion over ten years comes out to $240,000,000 per YEAR! With that kind of money they could replace their infrastructure a few times over every year.

This is a clear example of the malfeasance that happens when government gets corrupted by corporate interests. Taxpayers in VA should be up in arms about this one.

Here's my story of state agency screw-ups. Two jobs ago I was working for the Secretary of State's office here. We had the opportunity and funding to get our IT infrastructure in order when the Help America Vote Act (HAVA) became law. We were able to build out a secure and redundant room to house our critical infrastructure.

Physical access by key and alarm code only, Redundant power which included an APS Symmetra UPS system, backed up by a 125kW natural gas fired generator. Even made sure to extend tendrils from the redundant power out to the MDF so the ISP could use our power system. Also had redundant cooling tied to the generator.

The one Achilles Heel of the operation was DNS. Ours was provided from outside our space.Suggested they build a zone locally that way we'd have DNS services if the state's went down. But they quashed it as being too difficult! Ut si!

Well one day there's a massive power outage in the city. They were still up and running, lights on, air conditioning on but couldn't get in or out of the internal network even though the ISP circuits were still up. Yup, DNS!

Share
twitter facebook
- Re: (Score:2)
  
  by CAIMLAS ( 41445 ) writes:
  
  Are you kidding? THat's a trivial amount to:
  * maintain tens if not hundreds if not thousands of proprietary (legacy) applications
  * maintain the many, many workstations
  * maintain the fabric for many, many workstations
  * maintain the servers which provide services, many of which are interconnected and do not cope with modern technologies well.
  * maintain the storage for all of that
  * SECURE all of the above
  * make it as fault tolerant as possible
  Shit, I suspect the Cisco contract is probably a good 3rd of that pe
Reminds me of the good ol' days (Score:2)

by Locke2005 ( 849178 ) writes:

In my first job, I changed the boot-up message on the VAX to "If only my girlfriend when down as often as this computer!" I kinda assumed it would scroll up off the terminal and nobody would see it. It, uh, didn't. One of our female programmers, who was famous for overreacting, came into work and threw a hissy fit. We fixed the message and decided to tell everyone we couldn't figure out who put it there. This is why you shouldn't give all developers administrator privileges!
What do they have that could be down? (Score:2)

by dazedNconfuzed ( 154242 ) writes:

Whenever I drive thru Virginia (up I-81) there's a sign announcing "Entering Virginia's Technology Corridor" which is followed by hundreds of miles of rolling green pastures.
What, there was a proliferation of cow-tipping?
Happened before? (Score:2)

by mr100percent ( 57156 ) writes:

Reminds me of a Classic TheDailyWTF: I'm Sure You Can Deal [thedailywtf.com]
Grrrr to the incredulous... (Score:2)

by bartwol ( 117819 ) writes:

To anybody who feels incredulous at the notion of a single point of failure taking down a purportedly redundant system:I suspect you have limited experience with the issues and challenges of managing a very large system infrastructure. The complexity of such systems goes well beyond the knowledge of any individual, so notions of fault tolerance across the enterprise are highly theoretical. Even with extensive planning and testing, the gotcha is in what you don't know. Sometimes, one of those What-You-Don't-
Typical State Job Interview (Score:2)

by hackus ( 159037 ) writes:

So tell us a little bit about your education Mr. X.
"Well, I have a certificate in Microsoft Administration and a computer science degree, plus I am Cisco certfied."
Oh excellent!!! Thank you for your time.
So tell us a little bit about your education Mr. Y.
"Well, I have a degree in computer science and I ran several storage area networks for several years now from my previous employer Widgets are Us"
But, do you have any certifications?
"No."
Thanks Mr. Y. It has been nice speaking with you, don't call us, we
Having had... (Score:2)

by Rhys ( 96510 ) writes:

Memory go bad in a "san device" (I say in quotes because nobody in their right mind would actually think a singlepathed non-redundant disk array is really san-grand hardware) from a fruit-flavored vendor before, I can actually have some pity for the guys responsible/working on it. Debugging it is a great time too, because your filesystem rebuild generally works. As does copying small amounts of data. It is only once you try to copy a couple terabytes things go to hell.
Filesystem data and inode corruption bo
- Re: (Score:3, Insightful)
  
  by snookerhog ( 1835110 ) writes:
  
  sounds like nobody in Virginia knows either
- Re:card? (Score:4, Informative)
  
  by Culture20 ( 968837 ) writes: on Friday August 27, 2010 @12:08PM (#33393666)
  
  A technically correct term, albeit against normal colloquialism which calls them memory chips. Memory chips are the black things on the cards.
  
  Parent Share
  twitter facebook
  - Re: (Score:3, Informative)
    
    by NotBornYesterday ( 1093817 ) writes:
    
    From the awkward phrasing, my completely uninformed guess is they are referring to a cache module on a controller somewhere.
    - Re: (Score:2)
      
      by GaryOlson ( 737642 ) writes:
      
      Unless the unit is from Texas Memory Systems [ramsan.com] -- completely flash based storage. Note in the FAQ where scheduled down time is required to replace a faulty Flash drive.
      
      I wonder is some genius decided to use one of these units as primary storage instead of using as caching storage.
- Re: (Score:3, Funny)
  
  by jeffmeden ( 135043 ) writes:
  
  HAHAHAHHAHAHHAHHA - stupids
  "This is supposed to be the best system you can buy, and it's never supposed to fail, but this one did," he said
  And iv'e got a bridge for sale in San Francisco...
  Throw in your city's cisco-powered WAN and I'll take it!
- Re: (Score:2)
  
  by Yunzil ( 181064 ) writes:
  
  And iv'e got a bridge for sale in San Francisco...
  No thanks. I got a sweet deal on one in Brooklyn.
- - Re: (Score:2)
    
    by pnutjam ( 523990 ) writes:
    
    48
- Re: (Score:2)
  
  by MightyMartian ( 840721 ) writes:
  
  Chuckle...
  My wife gets pissed when I have to stay late or go in on the weekend to replace a switch or move some wires around. "Plumbers don't do that... Electricians don't do that..." she says. "No, they don't, and everybody gets pissed off when you can't flush the toilet and all the lights are off."

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

HA fail (Score:4, Insightful)

Re:HA fail (Score:5, Interesting)

Re: (Score:3, Funny)

Re:HA fail (Score:4, Insightful)

money saving measure! (Score:2)

Re: (Score:3, Interesting)

Re: (Score:2, Interesting)

Re:HA fail (Score:5, Insightful)

Re: (Score:2, Interesting)

Re: (Score:2, Insightful)

Re: (Score:2, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3, Informative)

Re:HA fail (Score:4, Funny)

Re: (Score:2)

Acronym Fail (Score:2)

Re: (Score:2)

Re:HA fail (Score:5, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

It's always money (Score:2, Interesting)

Re: (Score:2)

Re:It's always schedule (Score:3, Interesting)

Re: (Score:2)

Re:It's always money (Score:4, Insightful)

Re:It's always money (Score:4, Interesting)

Re: (Score:3, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:It's always money (Score:5, Insightful)

Re: (Score:2)

Re: (Score:3, Interesting)

Re: (Score:2)

Re:It's always money (Score:4, Informative)

Re: (Score:2, Funny)

Northrup Grumman (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

They need a better network admin (Score:4, Funny)

Re: (Score:2, Insightful)

Redundancy (Score:3, Funny)

Re: (Score:3, Interesting)

Even funnier (Score:4, Interesting)

Offer (Score:2)

Re: (Score:2)

Re: (Score:2)

Awful. (Score:4, Insightful)

Sorry, has to be said... (Score:3, Funny)

Question. (Score:2, Insightful)

Re: (Score:3, Insightful)

Re: (Score:2)

Re: (Score:3, Funny)

Re:Question. (Score:4, Informative)

Re: (Score:2)

Re:Question. (Score:4, Interesting)

Re: (Score:2)

Typical liberal overreaction (Score:5, Funny)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Ok, this really sucks!!!!!!! I know why and can (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Northrop Grumman? Thats why... (Score:2)

Northrup Grumman (Score:2)

What brand? (Score:2)

Northrop hiring event (Score:2)

It's not always the bureaucracy (Score:2)