Anatomy of the VA's IT Meltdown

Become a fan of Slashdot on Facebook

Anatomy of the VA's IT Meltdown 137

Posted by Zonk on Tuesday November 20, 2007 @01:04PM from the use-your-words dept.

Lucas123 writes "According to a Computerworld story, a relatively simple breakdown in communications led to a day-long systems outage within the VA's medical centers. The ultimate result of the outage: the cancellation of a project to centralize IT systems at more than 150 medical facilities into four regional data processing centers. The shutdown 'left months of work to recover data to update the medical records of thousands of veterans. The procedural failure also exposed a common problem in IT transformation efforts: Fault lines appear when management reporting shifts from local to regional.'"

This discussion has been archived. No new comments can be posted.

Anatomy of the VA's IT Meltdown

Search 137 Comments Log In/Create an Account

Comments Filter:

In other words.... (Score:5, Insightful)

by Like2Byte ( 542992 ) writes: <Like2Byte@yaFREEBSDhoo.com minus bsd> on Tuesday November 20, 2007 @01:08PM (#21422353) Homepage

Business as usual for the VA.

Once again, the VA shows its true colors and mucks up another project funded by taxpayers for the well-being of our nations Veterans. A more screwed up organization one will not find.

Share
twitter facebook
Assumption junction, what's your function? (Score:4, Insightful)

by digitaldc ( 879047 ) * writes: on Tuesday November 20, 2007 @01:15PM (#21422487)

Volpp assumed that the data center in Sacramento would move into the first level of backup -- switching over to the Denver data center. It didn't happen.

DOH! Looks like it was all just due to someone's assumption that someone else would do their job.
From my experience, you can assume things happened, but if you don't verify that they actually happened - you are DOOMED.

Share
twitter facebook
my 2 cents. (Score:5, Insightful)

by Brigadier ( 12956 ) writes: on Tuesday November 20, 2007 @01:24PM (#21422609)

unfortunately one of the best ways to learn how well your disaster recovery system works is to have a disaster. The problem with scheduled drills is the scenarios themselves are planned out and typically not run system wide ie test the part of the system then that part of the system etc. on RTFA it seems much of the breakdown occurred because too many people assumed. There was also no centralized decision making identities who had access to all the information. All scenarios when view from there individual perspective seemed to have made the right decision. However sometimes when implementing a global recovery plan one system may have to be sacrificed by another.

Share
twitter facebook
Zonk, you retard (Score:5, Insightful)

by sootman ( 158191 ) writes: on Tuesday November 20, 2007 @01:28PM (#21422663) Homepage Journal

I'm sure I'll get modded to -5, Flamebait, but fucking A, Zonk, Slashdot isn't a newspaper. You don't need to be so economical in your headlines. When I saw the headline, I first thought of VA Linux--you know, the guys who kinda sorta own you. "Medical centers" threw me, so I thought for a second that it might mean the state of Virginia. Then it dawned on me that you probably meant the United States Department of Veterans Affairs. I'm sure I'm not the only one.

Please, God, isn't there some kind of Editing 101 correspondence-school course we can send all these guys to? I mean, I love Slashdot to death, but please God, can you give the staff just one ounce of basic editorial skills: spelling, grammar, etc? Teach them to write for clarity, not just brevity? Maybe go for broke and touch on dupe-checking, fact-checking, changing links so they point to the original article instead of some guy's AdSense-laden blog page that says nothing more than "here's the story"?

You're EDITORS, for God's sake (even if in name only), you are indeed allowed to EDIT submissions.

Share
twitter facebook
VA Acronym? (Score:2, Insightful)

by bmomjian ( 195858 ) writes: on Tuesday November 20, 2007 @01:29PM (#21422679) Homepage

Isn't it obvious that the acronym "VA" isn't good to use in a title? FYI, it stands for "U.S. Veteran's Administration".

Share
twitter facebook
Why always centralizing? (Score:4, Insightful)

by guruevi ( 827432 ) writes: on Tuesday November 20, 2007 @01:31PM (#21422697)

I wonder why higher management always wants to centralize their resources. The internet protocol and subsequent many IT applications were built to be efficient in small and decentralized environments.

1) Trying to centralize gives us large expensive computers that are made out of the same components as smaller ones and thus fail just as the smaller ones do, however, ever trying to cram more crap on the same machine will bring down everything at once whenever it fails.
2) Trying to centralize has the ultimate goal to eliminate jobs but they need those people since they know all the little details and hickups their systems have. If people know a project is going to eliminate their job, they won't be cooperative. IT not being cooperative is very bad in this world where everything is computerized.
3) Eventually the same number of people is going to have to work in the centralized system just because you also centralize the problems and more problems will bring more people, more people will bring more overhead and inefficiency, more inefficiency will bring more people (at least that's the default in today's business world, throwing more people at an IT problem doesn't make it disappear faster)
4) More people in a project that was designed to be more cost efficient means the managers will have to cut expenses. Cut expenses brings underpaid people, underpaid people bring less or no experience and higher turnover, higher turnover means more cutting expenses.

Therefore: keep your local IT guy(s) and infrastructure although you can't squeeze 100% of work/day and it will bring a little more expense. The end-users have a better relationship with the guy(s) and that makes happier people. Centralizing brings more overhead, less customer-interaction with IT and thus more inefficiency throughout the business.

Share
twitter facebook
Re:In other words.... (Score:1, Insightful)

by Anonymous Coward writes: on Tuesday November 20, 2007 @01:40PM (#21422841)

Actually, the VA is lightyears ahead of the private sector when it comes to electronic patient records. Also, I would imagine that if one were to research all of the screw ups in private hospitals and small practices one would find just as many problems as the VA seems to have. It's just the fact that the VA is a large government entity that it becomes an easy target. On the whole, the VA used to be deserving of much of its bad publicity, but in recent years it has made a significant turnaround and should be considered a model for others to follow. I'm not saying that serious problems don't still exist, or that it's a perfectly run organization, but I do think that the media paints a one-sided picture.

Parent Share
twitter facebook
Re:In other words.... (Score:3, Insightful)

by LurkerXXX ( 667952 ) writes: on Tuesday November 20, 2007 @01:48PM (#21422969)

You think things like this don't happen at private hospitals? I work with one and I can tell you right now they do.

Parent Share
twitter facebook
Re:In other words.... (Score:2, Insightful)

by AvitarX ( 172628 ) writes: <me@@@brandywinehundred...org> on Tuesday November 20, 2007 @01:50PM (#21422989) Journal

Or maybe examine medicare.

I won't say it's perfect, but it has quite low overhead (relative to private insurance) and if there was no debate about who was allowed on and who wasn't it could be streamlined further.

Very few people want a single source of healthcare providing everything.

Parent Share
twitter facebook
They messed up everything they could mess up. (Score:1, Insightful)

by Skyshadow ( 508 ) * writes: on Tuesday November 20, 2007 @02:02PM (#21423175) Homepage

Yeah, time to fire your IT organization's management. And a few of their leads, too. And maybe some of the techs.

Couple of reasons: First, they're running Vista. I'm not trying to be all "You must only run Linux or ur a n00b" here -- you can run Windows servers just fine, but no reasonable IT planner should ever, *ever* consider using an OS that new for a mission-critical enterprise application. If it doesn't have two or three years in the field, don't even consider it.

Second, their failover plan sucked. Live data syncs are good for physical disasters (fires, earthquakes, zombie attacks) but, as the VA discovered, they leave you shitting your pants when you run into an issue that may or may not be data-related. The solution to this, of course, is to keep a day or week-old copy someplace along with an up-to-date (but not implemented!) transaction log that you can go through and update with once you've sanity-checked it.

Third, letting the vendor run "tests" on your production system. Nobody, and I mean nobody, should ever get to touch any production system unless they're implementing a specific change that's been tested in an identical environment, passed QA and review by folks who know the system and then only with a published implementation, testing and backout plan. If a system needs "tests", you pull it out of production before you start messing with it.

Finally, their "virtualized team" approach (read: our people are scattered all over the place) is moronic -- you see this sort of thing, and without fail it's the result of political pressures rather than sane management. In this case, I'll bet my hat is was a situation where a bunch of middle managers were allowed to maneuver to keep their fingers in the pie when centralization tool place, so instead of having everyone you need on hand and in one group you're busy setting up conference calls.

Plus, now their solution is to bring in a bunch of consultants. Yeah, that always works. Good luck, guys! You're gonna need it.

Share
twitter facebook
Poor VMS. (Score:3, Insightful)

by juuri ( 7678 ) writes: on Tuesday November 20, 2007 @02:18PM (#21423425) Homepage

staffers from Hewlett-Packard Co. conducting a review of the center's HP AlphaServer system running on Virtual Memory System and testing its performance.

We hardly knew ye.

Share
twitter facebook
It happens (Score:4, Insightful)

by ACMENEWSLLC ( 940904 ) writes: on Tuesday November 20, 2007 @02:27PM (#21423585) Homepage

What they were doing was a major change to their IT infrastructure. That's massive. Things happen. The fact that they were down at 17 of 128+3 (131) data centers because some IT staffer changed a port # at one of their hub data centers without following proper procedure -- that's minor.

Seems to me that things worked otherwise well is a major accomplishment. They are still on the old system and are entering in data back into that system and migrating into the new system. But it seems things went well otherwise.

Anytime you do a major shift like this, it's hard. The users hate it because they can do their job very quickly on the system they are use to, but now have to learn a new system and slow down.

Things happen.

Share
twitter facebook
Re:awesome! (Score:3, Insightful)

by Richard Steiner ( 1585 ) writes: <rsteiner@visi.com> on Tuesday November 20, 2007 @02:28PM (#21423607) Homepage Journal

Many companies don't know enough to fire people who are damaging to their operations.

Parent Share
twitter facebook
Re:Why always centralizing? (Score:3, Insightful)

by bobaferret ( 513897 ) writes: on Tuesday November 20, 2007 @02:36PM (#21423771)

There are certain this that centraalization brings to the table. Such as this this guy just came into the hospital unconcious , and we know that when he was in a VA hospital accross the country last week he was given a drug that would interact badly with what we want to give him right now. Or what is the chnage in his cat scan since last week whe he was someplace else and had one.

Obviously not all of this data needs to be centralized, but it's existance should be. We don't know to what level the VA was doing this, but I've met a large number of people who work in it's IT branch, and they love what they do, and are very good at it as well. Sometimes things just go wrong, and sometimes things get pushed out there for beuracratic reasons, but most of the time the VA is very IT savy.

Parent Share
twitter facebook
Re:Why always centralizing? (Score:3, Insightful)

by TheSkyIsPurple ( 901118 ) writes: on Tuesday November 20, 2007 @02:38PM (#21423813)

1) Trying to centralize gives us large expensive computers that are made out of the same components as smaller ones and thus fail just as the smaller ones do, however, ever trying to cram more crap on the same machine will bring down everything at once whenever it fails.

If that's how you're doing it, you're doing it wrong.

On how many smaller systems can you upgrade your disk controller's firmware without having to reboot or even stop access to the disks? Not a problem on a good SAN system.
And those systems only get economical when your data storage needs get big.

2) Trying to centralize has the ultimate goal to eliminate jobs but they need those people since they know all the little details and hickups their systems have. If people know a project is going to eliminate their job, they won't be cooperative. IT not being cooperative is very bad in this world where everything is computerized.

It doesn't always have that ultimate goal, but very often does. And very often, if done correctly, it can achieve that goal.
Take 8 sites with 2 admins each that are only doing 50% duty running that service. (You need at least 2 so someone gets to have an occasional vacation).
That's 16 people, doing the workload of 8.
Bring that down to 1 site, and odds are you could do the exact same job with 8 people (since now there are 7 others to back you up)

And now you're all on one system, so you don't have lots of little variances, so you can be more efficient, etc...

Yes, we have lost some of the little details by losing those people, but in general you've got other problems if some information is only known by one person.
As it turned out, a lot of that "critical" information got passed along to other folks anyway, most of what was left turned out to be unimportant, and that small remaining percentage?
Well, the rest of us are smart, and the ones with the info weren't idiots... we were able to figure it out.

3) Eventually the same number of people is going to have to work in the centralized system just because you also centralize the problems and more problems will bring more people, more people will bring more overhead and inefficiency, more inefficiency will bring more people (at least that's the default in today's business world, throwing more people at an IT problem doesn't make it disappear faster)

Starting with bad assumptions.
A small focused skilled team can do pretty much anything. =-)
In fact some would say they're the only ones who do anything.

One example: We used to repeatedly run into situations where we had the same problem at x sites, so we had at least x people trying to solve it. We didn't realize other's were duplicating our effort, so there was a lot of wasted effort, with solutions from different angles, so the sites ended up getting more and more out of sync in their setups.

4) More people in a project that was designed to be more cost efficient means the managers will have to cut expenses. Cut expenses brings underpaid people, underpaid people bring less or no experience and higher turnover, higher turnover means more cutting expenses.

Every centralization project I've been on has had its hiccups, but in the end has resulted in reduced costs overall. We always started off with the people we had, and a contractor or to who was an "expert" in the field we were working in, just to make sure we had an outsider's view. We didn't always believe the contractor, but we'd at least use them for everything they were worth. We then "centralized", and kept most of the folks around to keep everything running everywhere... then the layoffs.

The main problem we have on from our last centralization is that many in our small team are very shy about sharing issues before they know everything about it. They're afraid of looking bad, because they won't be as valuable. (Hadn't run into that one before)

Parent Share
twitter facebook
Re:In other words.... (Score:5, Insightful)

by Like2Byte ( 542992 ) writes: <Like2Byte@yaFREEBSDhoo.com minus bsd> on Tuesday November 20, 2007 @02:43PM (#21423897) Homepage

The VA is far more than just another hospital. It is supposed to aid US Veterans of all service branchs to see to the needs of them from educational loans, purchasing a home, medical care/assitance and others. See their site: http://va.gov./ [va.gov.]

If any one hospital or chain of hospitals peformed as consistantly lousey as the VA has that hospital would have been sued into oblivion decades ago. Hundreds of thousands of vets who've used the VA's services can attest. But, we can't neccessarily sue the VA because they're part of the government. Go to any VA hospital in the US. Odds are that after you pass through the pretty facade they've set up you'll find patient after patient sitting in a wheel chair or bed lined along some wall waiting for some over-worked, over-stressed and under-staffed doctor and not getting the care they deserve.

The VA needs to take a lesson from the corporate world and change it's face. Rename itself, start fresh. AND START DOING THEIR G-D JOB! That's the best dismal chance they've got to make things right. As it is right now there isn't a Vet in the US or abroad that thinks highly of the VA. And if there is, I'd find 100 that would refute any positive statement made about the VA.

And, yes - I'm a Vet. My Father is a Vet. My Grandfather is a Vet. My Uncle is a Vet. I don't recall them looking forward to communicating with the VA, either.

In closing, if the VA *did* do their job the homeless wouldn't consist of 25% US Veterans that couldn't re-adjust to civilian life after witnessing the horrors of war!

http://www.cnn.com/2007/US/11/08/homeless.veterans/ [cnn.com]
http://www.cnn.com/HEALTH/blogs/paging.dr.gupta/2007/05/mia-in-plain-sight.html [cnn.com]

Parent Share
twitter facebook
Re:awesome! (Score:3, Insightful)

by TooMuchToDo ( 882796 ) writes: on Tuesday November 20, 2007 @02:50PM (#21424027)

Mod parent up. As a small business owner, I've found that one reason our clients love us is because we manage their entire environment (be it hosting or internal network) and we provide them with all the documentation. I tell them "If you can't fire us at any time and keep running with no problems, we haven't done our job." Luckily, our clients love us, and we haven't been fired yet (in business 7 years).

Parent Share
twitter facebook
I work at the heart of this... (Score:5, Insightful)

by Anonymous Coward writes: on Tuesday November 20, 2007 @02:52PM (#21424059)

1st off... VISTA is not Windows VISTA. It's the "Veterans Health Information Systems and Technology Architecture". Do a google search on that.

VISTA runs on HP's VMS, and on top of that it runs Cache from Intersystems. (And yes it costs the tax payers a lot! But a lot less since we've been centralizing it over the last 3 or 4 years.)

It is a HUGE system.

The centralization that we're currently undergoing is massive, this problem was (IMHO) scape goated to a poor change control process.

I know what was change, I know who changed it, and I know when they changed it. However, this 'melt down' has happened three times... (Not to the same drastic outcome.) It comes down to VMS locking out logons because locks aren't being released properly. (Now you could argue that the reason locks got behind was this change... But I don't think that is the real reason because of our previous problems.)

It's that simple. Ask the VISTA manager over lunch sometime. They weren't afraid of data corruption. They were afraid if they moved the systems, the other system would lock up too with too much user load.

There goes "VISTA". Everyone logged in is fine. Everyone not on... Isn't getting on.

Now comes the bad part... No procedures!

We take 32 medical centers, and throw their IT into a data center. You 'had' clear lines of who owns what, and what happens when they go down. Now you centralize all that... Who raises the flag when something bad happens? Is it the site that has the problem? Is it someone who now controls the system at the data center? Who is responsible for what?

Oh wait... OI&T only has a dozen staff... And almost NONE of those people are technical. Everyones pay was simply moved from one appropriation to another. But what about the IT systems?!?! We moved those too, but didn't hire any permanent staff to take care of it? We just rubber banded a bunch of people together that work across the whole west coast and hand them a pager and say good luck?

Suffice it to say, we have some REALLY REALLY hard working people... And some really bad management. (Congress forcing us to do things on a time table is really annoying. Especially since they expect results, but don't expect any documentation... What do you think is going to get skipped?)

Congress: How is that data center move going!
Howard: We've moved 28 sites!
Congress: Good Job!
Howard: .:Thinks:. Too bad they don't know about everything we've short changed to make such an obscene deadline!

Then again... Howard doesn't even know everything we skip to get things done.

Bah

Share
twitter facebook
Like the budgie! (Score:3, Insightful)

by PHAEDRU5 ( 213667 ) writes: <instascreed@UUUg ... inus threevowels> on Tuesday November 20, 2007 @03:02PM (#21424253) Homepage

Cowboy IT people remain employed because they're cheap!

First thing I learned in the military: your weapon was made by the lowest bidder.

Parent Share
twitter facebook
Re:I work at the heart of this... (Score:1, Insightful)

by Anonymous Coward writes: on Tuesday November 20, 2007 @05:34PM (#21426959)

And in reply to the article about data corruption.

We do have snapshots (I believe the brand naming is snap clones) as well as tape, as well as 'MANY' mirrors of the data.

There was never a data corruption question. VMS shut out the users. Thats it... No VISTA logon because no VMS logon.

The cluster and consolidation was tested and ran for over a year on 8 sites before all 32 were planned and rolled into one. (But if you read congress transcripts you would also know we're currently on hold.) We actually had this lock problem in pilot roll out. However the problem was never 'fixed' and still exists in the deployed solution. What causes it is a mystery... It can come and go depending on the lock backup.

The current plan is to break up the cluster. Not to de-centralize, but to bust this puppy into smaller clusters to limit the number of locks on single files/devices. Not only that... But just running back the journal files, startup scripts, etc, takes WAY too long to be agile!

(The ultimate goal is to go open source... Shhh don't tell anyone we're trying to save you and me money!)

This problem escalated out of control because of several contributing factors. Lack of communication, lack of expectations, etc...

Doctors seem to think you can just flip a switch and move to the other data center. This is NOT the case. Running log files, data checks, etc, etc, it takes well over 4 hours just to 'ramp up' to do the move. Then there is changing DNS and ensuring that every one of our over 50k workstations are using all proper DNS entries etc etc etc.

It's a complex problem, and frankly... I think our VISTA managers are some of the hardest working underpaid Government employees. (Some of them are only GS-9 and yet equally charged with the care of this huge system.)

The VA exploits its talented employees, so it doesn't have to manage things that well. People care, and will MAKE it work.

I get worked up... Because this is SOOOO close to home.

Parent Share
twitter facebook

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Anatomy of the VA's IT Meltdown 137

Anatomy of the VA's IT Meltdown More Login

Anatomy of the VA's IT Meltdown

In other words.... (Score:5, Insightful)

Assumption junction, what's your function? (Score:4, Insightful)

my 2 cents. (Score:5, Insightful)

Zonk, you retard (Score:5, Insightful)

VA Acronym? (Score:2, Insightful)

Why always centralizing? (Score:4, Insightful)

Re:In other words.... (Score:1, Insightful)

Re:In other words.... (Score:3, Insightful)

Re:In other words.... (Score:2, Insightful)

They messed up everything they could mess up. (Score:1, Insightful)

Poor VMS. (Score:3, Insightful)

It happens (Score:4, Insightful)

Re:awesome! (Score:3, Insightful)

Re:Why always centralizing? (Score:3, Insightful)

Re:Why always centralizing? (Score:3, Insightful)

Re:In other words.... (Score:5, Insightful)

Re:awesome! (Score:3, Insightful)

I work at the heart of this... (Score:5, Insightful)

Like the budgie! (Score:3, Insightful)

Re:I work at the heart of this... (Score:1, Insightful)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot