Anatomy of the VA's IT Meltdown 137
Lucas123 writes "According to a Computerworld story, a relatively simple breakdown in communications led to a day-long systems outage within the VA's medical centers. The ultimate result of the outage: the cancellation of a project to centralize IT systems at more than 150 medical facilities into four regional data processing centers. The shutdown 'left months of work to recover data to update the medical records of thousands of veterans. The procedural failure also exposed a common problem in IT transformation efforts: Fault lines appear when management reporting shifts from local to regional.'"
In other words.... (Score:5, Insightful)
Once again, the VA shows its true colors and mucks up another project funded by taxpayers for the well-being of our nations Veterans. A more screwed up organization one will not find.
Assumption junction, what's your function? (Score:4, Insightful)
DOH! Looks like it was all just due to someone's assumption that someone else would do their job.
From my experience, you can assume things happened, but if you don't verify that they actually happened - you are DOOMED.
my 2 cents. (Score:5, Insightful)
unfortunately one of the best ways to learn how well your disaster recovery system works is to have a disaster. The problem with scheduled drills is the scenarios themselves are planned out and typically not run system wide ie test the part of the system then that part of the system etc. on RTFA it seems much of the breakdown occurred because too many people assumed. There was also no centralized decision making identities who had access to all the information. All scenarios when view from there individual perspective seemed to have made the right decision. However sometimes when implementing a global recovery plan one system may have to be sacrificed by another.
Zonk, you retard (Score:5, Insightful)
Please, God, isn't there some kind of Editing 101 correspondence-school course we can send all these guys to? I mean, I love Slashdot to death, but please God, can you give the staff just one ounce of basic editorial skills: spelling, grammar, etc? Teach them to write for clarity, not just brevity? Maybe go for broke and touch on dupe-checking, fact-checking, changing links so they point to the original article instead of some guy's AdSense-laden blog page that says nothing more than "here's the story"?
You're EDITORS, for God's sake (even if in name only), you are indeed allowed to EDIT submissions.
VA Acronym? (Score:2, Insightful)
Why always centralizing? (Score:4, Insightful)
1) Trying to centralize gives us large expensive computers that are made out of the same components as smaller ones and thus fail just as the smaller ones do, however, ever trying to cram more crap on the same machine will bring down everything at once whenever it fails.
2) Trying to centralize has the ultimate goal to eliminate jobs but they need those people since they know all the little details and hickups their systems have. If people know a project is going to eliminate their job, they won't be cooperative. IT not being cooperative is very bad in this world where everything is computerized.
3) Eventually the same number of people is going to have to work in the centralized system just because you also centralize the problems and more problems will bring more people, more people will bring more overhead and inefficiency, more inefficiency will bring more people (at least that's the default in today's business world, throwing more people at an IT problem doesn't make it disappear faster)
4) More people in a project that was designed to be more cost efficient means the managers will have to cut expenses. Cut expenses brings underpaid people, underpaid people bring less or no experience and higher turnover, higher turnover means more cutting expenses.
Therefore: keep your local IT guy(s) and infrastructure although you can't squeeze 100% of work/day and it will bring a little more expense. The end-users have a better relationship with the guy(s) and that makes happier people. Centralizing brings more overhead, less customer-interaction with IT and thus more inefficiency throughout the business.
Re:In other words.... (Score:1, Insightful)
Re:In other words.... (Score:3, Insightful)
Re:In other words.... (Score:2, Insightful)
I won't say it's perfect, but it has quite low overhead (relative to private insurance) and if there was no debate about who was allowed on and who wasn't it could be streamlined further.
Very few people want a single source of healthcare providing everything.
They messed up everything they could mess up. (Score:1, Insightful)
Couple of reasons: First, they're running Vista. I'm not trying to be all "You must only run Linux or ur a n00b" here -- you can run Windows servers just fine, but no reasonable IT planner should ever, *ever* consider using an OS that new for a mission-critical enterprise application. If it doesn't have two or three years in the field, don't even consider it.
Second, their failover plan sucked. Live data syncs are good for physical disasters (fires, earthquakes, zombie attacks) but, as the VA discovered, they leave you shitting your pants when you run into an issue that may or may not be data-related. The solution to this, of course, is to keep a day or week-old copy someplace along with an up-to-date (but not implemented!) transaction log that you can go through and update with once you've sanity-checked it.
Third, letting the vendor run "tests" on your production system. Nobody, and I mean nobody, should ever get to touch any production system unless they're implementing a specific change that's been tested in an identical environment, passed QA and review by folks who know the system and then only with a published implementation, testing and backout plan. If a system needs "tests", you pull it out of production before you start messing with it.
Finally, their "virtualized team" approach (read: our people are scattered all over the place) is moronic -- you see this sort of thing, and without fail it's the result of political pressures rather than sane management. In this case, I'll bet my hat is was a situation where a bunch of middle managers were allowed to maneuver to keep their fingers in the pie when centralization tool place, so instead of having everyone you need on hand and in one group you're busy setting up conference calls.
Plus, now their solution is to bring in a bunch of consultants. Yeah, that always works. Good luck, guys! You're gonna need it.
Poor VMS. (Score:3, Insightful)
We hardly knew ye.
It happens (Score:4, Insightful)
Seems to me that things worked otherwise well is a major accomplishment. They are still on the old system and are entering in data back into that system and migrating into the new system. But it seems things went well otherwise.
Anytime you do a major shift like this, it's hard. The users hate it because they can do their job very quickly on the system they are use to, but now have to learn a new system and slow down.
Things happen.
Re:awesome! (Score:3, Insightful)
Re:Why always centralizing? (Score:3, Insightful)
Obviously not all of this data needs to be centralized, but it's existance should be. We don't know to what level the VA was doing this, but I've met a large number of people who work in it's IT branch, and they love what they do, and are very good at it as well. Sometimes things just go wrong, and sometimes things get pushed out there for beuracratic reasons, but most of the time the VA is very IT savy.
Re:Why always centralizing? (Score:3, Insightful)
If that's how you're doing it, you're doing it wrong.
On how many smaller systems can you upgrade your disk controller's firmware without having to reboot or even stop access to the disks? Not a problem on a good SAN system.
And those systems only get economical when your data storage needs get big.
2) Trying to centralize has the ultimate goal to eliminate jobs but they need those people since they know all the little details and hickups their systems have. If people know a project is going to eliminate their job, they won't be cooperative. IT not being cooperative is very bad in this world where everything is computerized.
It doesn't always have that ultimate goal, but very often does. And very often, if done correctly, it can achieve that goal.
Take 8 sites with 2 admins each that are only doing 50% duty running that service. (You need at least 2 so someone gets to have an occasional vacation).
That's 16 people, doing the workload of 8.
Bring that down to 1 site, and odds are you could do the exact same job with 8 people (since now there are 7 others to back you up)
And now you're all on one system, so you don't have lots of little variances, so you can be more efficient, etc...
Yes, we have lost some of the little details by losing those people, but in general you've got other problems if some information is only known by one person.
As it turned out, a lot of that "critical" information got passed along to other folks anyway, most of what was left turned out to be unimportant, and that small remaining percentage?
Well, the rest of us are smart, and the ones with the info weren't idiots... we were able to figure it out.
3) Eventually the same number of people is going to have to work in the centralized system just because you also centralize the problems and more problems will bring more people, more people will bring more overhead and inefficiency, more inefficiency will bring more people (at least that's the default in today's business world, throwing more people at an IT problem doesn't make it disappear faster)
Starting with bad assumptions.
A small focused skilled team can do pretty much anything. =-)
In fact some would say they're the only ones who do anything.
One example: We used to repeatedly run into situations where we had the same problem at x sites, so we had at least x people trying to solve it. We didn't realize other's were duplicating our effort, so there was a lot of wasted effort, with solutions from different angles, so the sites ended up getting more and more out of sync in their setups.
4) More people in a project that was designed to be more cost efficient means the managers will have to cut expenses. Cut expenses brings underpaid people, underpaid people bring less or no experience and higher turnover, higher turnover means more cutting expenses.
Every centralization project I've been on has had its hiccups, but in the end has resulted in reduced costs overall. We always started off with the people we had, and a contractor or to who was an "expert" in the field we were working in, just to make sure we had an outsider's view. We didn't always believe the contractor, but we'd at least use them for everything they were worth. We then "centralized", and kept most of the folks around to keep everything running everywhere... then the layoffs.
The main problem we have on from our last centralization is that many in our small team are very shy about sharing issues before they know everything about it. They're afraid of looking bad, because they won't be as valuable. (Hadn't run into that one before)
Re:In other words.... (Score:5, Insightful)
If any one hospital or chain of hospitals peformed as consistantly lousey as the VA has that hospital would have been sued into oblivion decades ago. Hundreds of thousands of vets who've used the VA's services can attest. But, we can't neccessarily sue the VA because they're part of the government. Go to any VA hospital in the US. Odds are that after you pass through the pretty facade they've set up you'll find patient after patient sitting in a wheel chair or bed lined along some wall waiting for some over-worked, over-stressed and under-staffed doctor and not getting the care they deserve.
The VA needs to take a lesson from the corporate world and change it's face. Rename itself, start fresh. AND START DOING THEIR G-D JOB! That's the best dismal chance they've got to make things right. As it is right now there isn't a Vet in the US or abroad that thinks highly of the VA. And if there is, I'd find 100 that would refute any positive statement made about the VA.
And, yes - I'm a Vet. My Father is a Vet. My Grandfather is a Vet. My Uncle is a Vet. I don't recall them looking forward to communicating with the VA, either.
In closing, if the VA *did* do their job the homeless wouldn't consist of 25% US Veterans that couldn't re-adjust to civilian life after witnessing the horrors of war!
http://www.cnn.com/2007/US/11/08/homeless.veterans/ [cnn.com]
http://www.cnn.com/HEALTH/blogs/paging.dr.gupta/2007/05/mia-in-plain-sight.html [cnn.com]
Re:awesome! (Score:3, Insightful)
I work at the heart of this... (Score:5, Insightful)
VISTA runs on HP's VMS, and on top of that it runs Cache from Intersystems. (And yes it costs the tax payers a lot! But a lot less since we've been centralizing it over the last 3 or 4 years.)
It is a HUGE system.
The centralization that we're currently undergoing is massive, this problem was (IMHO) scape goated to a poor change control process.
I know what was change, I know who changed it, and I know when they changed it. However, this 'melt down' has happened three times... (Not to the same drastic outcome.) It comes down to VMS locking out logons because locks aren't being released properly. (Now you could argue that the reason locks got behind was this change... But I don't think that is the real reason because of our previous problems.)
It's that simple. Ask the VISTA manager over lunch sometime. They weren't afraid of data corruption. They were afraid if they moved the systems, the other system would lock up too with too much user load.
There goes "VISTA". Everyone logged in is fine. Everyone not on... Isn't getting on.
Now comes the bad part... No procedures!
We take 32 medical centers, and throw their IT into a data center. You 'had' clear lines of who owns what, and what happens when they go down. Now you centralize all that... Who raises the flag when something bad happens? Is it the site that has the problem? Is it someone who now controls the system at the data center? Who is responsible for what?
Oh wait... OI&T only has a dozen staff... And almost NONE of those people are technical. Everyones pay was simply moved from one appropriation to another. But what about the IT systems?!?! We moved those too, but didn't hire any permanent staff to take care of it? We just rubber banded a bunch of people together that work across the whole west coast and hand them a pager and say good luck?
Suffice it to say, we have some REALLY REALLY hard working people... And some really bad management. (Congress forcing us to do things on a time table is really annoying. Especially since they expect results, but don't expect any documentation... What do you think is going to get skipped?)
Congress: How is that data center move going!
Howard: We've moved 28 sites!
Congress: Good Job!
Howard:
Then again... Howard doesn't even know everything we skip to get things done.
Bah
Like the budgie! (Score:3, Insightful)
First thing I learned in the military: your weapon was made by the lowest bidder.
Re:I work at the heart of this... (Score:1, Insightful)
We do have snapshots (I believe the brand naming is snap clones) as well as tape, as well as 'MANY' mirrors of the data.
There was never a data corruption question. VMS shut out the users. Thats it... No VISTA logon because no VMS logon.
The cluster and consolidation was tested and ran for over a year on 8 sites before all 32 were planned and rolled into one. (But if you read congress transcripts you would also know we're currently on hold.) We actually had this lock problem in pilot roll out. However the problem was never 'fixed' and still exists in the deployed solution. What causes it is a mystery... It can come and go depending on the lock backup.
The current plan is to break up the cluster. Not to de-centralize, but to bust this puppy into smaller clusters to limit the number of locks on single files/devices. Not only that... But just running back the journal files, startup scripts, etc, takes WAY too long to be agile!
(The ultimate goal is to go open source... Shhh don't tell anyone we're trying to save you and me money!)
This problem escalated out of control because of several contributing factors. Lack of communication, lack of expectations, etc...
Doctors seem to think you can just flip a switch and move to the other data center. This is NOT the case. Running log files, data checks, etc, etc, it takes well over 4 hours just to 'ramp up' to do the move. Then there is changing DNS and ensuring that every one of our over 50k workstations are using all proper DNS entries etc etc etc.
It's a complex problem, and frankly... I think our VISTA managers are some of the hardest working underpaid Government employees. (Some of them are only GS-9 and yet equally charged with the care of this huge system.)
The VA exploits its talented employees, so it doesn't have to manage things that well. People care, and will MAKE it work.
I get worked up... Because this is SOOOO close to home.