ISP Recovers in 72 Hours After Leveling by Tornado - Slashdot

Follow Slashdot stories on Twitter

×

ISP Recovers in 72 Hours After Leveling by Tornado 258

Posted by CmdrTaco on Thursday September 04, 2003 @02:13PM from the now-that-is-what-i-call-disaster-recovery dept.

aldheorte writes "Amazing story of how an ISP in Jackson, TN, whose main facility was completely leveled by a tornado, recovered in 72 hours. The story is a great recounting of how they executed their disaster recovery plan, what they found they had left out of that plan, data recovery from destroyed hard drives, and perhaps the best argument ever for offsite backups. (Not affiliated with the ISP in question)"

This discussion has been archived. No new comments can be posted.

ISP Recovers in 72 Hours After Leveling by Tornado

Search 258 Comments Log In/Create an Account

Comments Filter:

Compare and contrast... (Score:5, Interesting)

by ptomblin ( 1378 ) writes: <ptomblin@xcski.com> on Thursday September 04, 2003 @02:23PM (#6871169) Homepage Journal

A couple of friends of mine were badly burned because the web hosting company they were using lost all their data (customer and their own) in one humungous crash, and didn't have any backups. They didn't even have a spare copy of their customer database, so they couldn't even contact their customers to tell them what was going on. Nor could they tell what customers they had and how much service they'd paid for, etc.

Share
twitter facebook
Re:And then gets slashdotted (Score:4, Interesting)

by cindik ( 650476 ) writes: <solidusfullstop@ ... m ['dik' in gap]> on Thursday September 04, 2003 @02:37PM (#6871316) Homepage Journal

That's actually interesting - how many sites have contingency plans for the /. effect? How many businesses? It's not just /., but just about any media can refer people to a real business site. For small companies, this could bring them down for some time. Imagine the "Bruce Almighty" effect, only with some business with a small-to-medium capacity connection, bombarded just because someone used http://www.slashdotme.com/ or spam@.me.into.oblivion.org in their movie. The fact that so many sites are taken down by the /. effect causes me to believe that few sites and those who run them are truly prepared.

Parent Share
twitter facebook
Re:Amazing is an innapropriate adjective (Score:4, Interesting)

by Mr Krinkle ( 112489 ) writes: on Thursday September 04, 2003 @02:47PM (#6871404) Homepage

Wrong on SOOOOOO many levels.
Let me start with this line:
"I realize that slashdot is mostly populated by high-school educated "IT people", who give a shit about logs and backups"
You claim to be a programmer, I have been a programmer and am now a Sys Admin, as both the BEST way to troubleshoot was from the logs. Unless you are the supreme programmer whose code never needs debugging and whose users never mispunch something causing an error a log file will let you see and know what has happened.

Now for this line:
"and restoring backup tapes is exhillirating and exciting."
I have restored from tape backup. We had a "programmer" BS from Virginia Tec, Masters from UMass who was certain he knew exactly what he was doing when he blew away an entire production database. (Actually he was a really good guy who just made a simple mistake) Fortunately we had tapes to restore from. But if ANYONE thinks that a restore is "exhillirating" (yes I left your type/mistake in there) then they are just strange. That was one of the most tedious and boring things I have had to do. But we had been tedious in backing EVERYTHING up so production was not severely impacted.

Now for where you directly insult everyone:
"I fully expect the PHBs and army of cable monkeys to get the network up and running in our new location."
So as a systems admin do I become a cable monkey? or am I a PHB? Either way I would be VERY needed if a disaster strikes just as I am needed every day. As for the elitist attitude and your lack of knowledge and concern for the backend of systems I am glad you do not work anywhere near me as I hate IT personal that have to call me to run windows update on their system when the latest worm comes around or to show them how to NOT clik ignore when Norton tells them they have a virus.
In short, Please show some respect for your coworkers and realize that these guys were prepared and did what their plan stated they could do.
If not don't be alarmed if somehow your account gets disabled and everything blown away and surprisingly they won't have backups, cause you "just don't care" for them.

Parent Share
twitter facebook
Re:Truly stunning (Score:5, Interesting)

by HardCase ( 14757 ) writes: on Thursday September 04, 2003 @02:57PM (#6871495)

No, what amazes me is that this is news. The IT industry is so full of idiots and morons and MCSEs that taking basic precautions earns you a six-figure salary and news coverage. These folks didn't even have off-site backups, it was luck that they were able to resume business operations (ie: billing) so soon.

I agree, although maybe not so vehemently. For the IT managers who need a clue, the article is evidence that a sound disaster recovery plan works. Obviously, in the case of the ISP, the plan wasn't completely sound, but the other, possibly more important, point of the article is that the ISP's management recognized that their recovery plan was incomplete. Based on the lessons they learned, they made changes.

I work for a large (~20,000 employees) company, with about 10,000 employees at one site. The IT department (actually the entire company as well) has a disaster recovery plan in place. But beyond having a plan, we also have drills. As an example, we are in the flight path of the local airport (possibly not the best place in the world for a manufacturing site). What happens if a plane crashes smack in the middle of the plant? Hopefully we'll never know for sure, but the drills that we've run showed strong and weak points of the disaster plan. The strong points were emphasized, the weak points were revised and the disaster plan continues as a work in progress.

Specifics aside, and maybe this is just stating the obvious, but considering a disaster recovery plan to be a continuously evolving procedure could be one of its strongest points.

-h-

Parent Share
twitter facebook
Off topic, I know..... (Score:2, Interesting)

by BlabberMouth ( 672282 ) writes: on Thursday September 04, 2003 @03:00PM (#6871527)

but isn't the new moderation system leading to the first few good posts on any topic all getting modded up to 5 while the rest get ignored?

Share
twitter facebook
Re:Fire... (Score:3, Interesting)

by Zathrus ( 232140 ) writes: on Thursday September 04, 2003 @03:06PM (#6871589) Homepage

If a vault is destroyed in the same disaster, then there are probably more important things to worry about.

If the vault is destroyed, then you're probably right. But it doesn't take that to render the data unusable -- if the bank gets hit, the vault may survive but the keys may be destroyed (yeah, I'm sure they can get more made or have a locksmith come in, but that will take time). Or the vault is inaccessible for some amount of time due to damage. Even if the data is good, having it unavailable does you no good at all.

The data backup services are good, as is just going a bit further afield for a safe deposit box or other repository. As you say, if the data is important you do what it takes.

Parent Share
twitter facebook
I live in Jackson.... (Score:5, Interesting)

by Daniel Wood ( 531906 ) * writes: on Thursday September 04, 2003 @03:14PM (#6871686) Homepage Journal

I am also a former Aeneas customer.
Unless Aeneas has made some major changes they are quite certainly the worst ISP I have ever worked with. Aeneas has contracts with the Jackson-Madison County School System to provide internet service district wide. The quality of such service is, bar none, the worst I have experienced.
I did some volunteer work at a local Elementary school helping teachers work out any lingering computing problems they had(Virii, printer drivers, misconfigured ip settings, file transfer to a new computer, etc). The internet service I experienced while I was there lead me to believe I was on a 128k ISDN line. Not until I went to the server room did I realize that I was, infact, on a T1. Now this is during the middle of summer, mabye four other persons were in the building, three of which were in the same room as myself. The service was also intermittent, having several dead periods while I was working. Needless to say, I remained unimpressed by said experience.

When I was an Aeneas dialup customer, in 1998, the service provided by Aeneas was also subpar. The dialup speeds were averaging 21.6kbps, where as when I switched to U.S. Internet(now owned by Earthlink) my dialup speeds were always above 26.4kbps(Except on Mother's Day). There were frequent disconnections, and they had a limit of 150hrs/month.

I'm not supprised how easy it is to restore subpar service. All they had to do was tie together the strings that are their backbone.

Share
twitter facebook
Re:make sure off-site is far enough away (Score:4, Interesting)

by hawkbug ( 94280 ) writes: <psxNO@SPAMfimble.com> on Thursday September 04, 2003 @03:14PM (#6871689) Homepage

Exactly, that statement is very true - I had a buddy who worked for a company there in tower 2. He worked offsite in Iowa, and one day couldn't vpn in to continue his programming. Turned on the news, and you know the rest. The problem was, he had all his java source on their servers. Sure, they backed it up daily and had an offsite backup in the other tower... The bad news was he lost all his work, and a lot of coworkers. The good news is that the company survived, and simply contracted him on for another 2 years to complete the project. He had to start from scratch, but gets paid more as a result. I'm sure insurance covered the companies losses.

Parent Share
twitter facebook
Re:But... (Score:5, Interesting)

by BMonger ( 68213 ) writes: on Thursday September 04, 2003 @03:19PM (#6871751)

Hmmm... what if a website admin did become a subscriber. Could they theoretically take the RSS feed to know when a new post was made, pull the article text, scan it for their domain and if their domain was linked to just have a script auto-block referers from slashdot for like 24 hours or so? Somebody less lazy than me might look into that. Then you could sell it for like $100! It'd be like paying the mob not to beat you up! But only if somebody affiliated with slashdot wrote it I guess.

Parent Share
twitter facebook
Re:Amazing is an innapropriate adjective (Score:3, Interesting)

by wuice ( 71668 ) writes: on Thursday September 04, 2003 @03:34PM (#6871943) Homepage

Yep, thats the way it works. I dont crawl around on the floor plugging shit in and getting dirty.

...

They're just added beurocracy for the computer world, and I work to replace them each and every day with more sophisticated self-administrating softwares.

If you don't know how to crawl around on the floor plugging shit in and getting dirty, you do not have the perspective necessary to write software to replace the people who do. The best programmers are not arrogantly disconnected from the people in the trenches, especially if they're working on software directed towards their field. A good programmer needs at least to know what people commonly need support about in order to address it in future software. If your CTO is as out of touch and disconnected as you, I pity your fellow employees.

You're also a poor team player, which is a liability to you and your career unless you work solo. You're also incredibly stuck up and elitist, which unfortunately probably actually helps your career. You're also way off base: you obviously consider yourself "above" the type of people who enjoyed this article, and your comments have been way more of an advertisment of yourself than anything to do with the issue. Why don't you drop out of this conversation and let the high school kids who spend all day plugging shit in enjoy it. Believe it or not, there are a lot more nerds in high schools than in high-paying programming positions. That being the case, this site should have more stories about them than you.

Parent Share
twitter facebook
72 Hours is a little long.... (Score:3, Interesting)

by fuqqer ( 545069 ) writes: on Thursday September 04, 2003 @03:37PM (#6871977) Homepage

72 hours seems way too long to be out of business. That's 3 days of money that the ISP is not pulling in dough. Unless the whole internet is crippled, I'd ditch an ISP that was out for three days. One of the main selling points for ISP is connectivity rain, snow, shine, OR rabid squirrels...

The company (ISP/consulting/services hosting) I used to work for had a DR plan to be executed in 24 hours with 75% functionality. Offsite servers and backups of course...

More impressive to me is the World Trade Center folks like American Express and other companies that had DR plans situated across the river. A lot of datacenters and information services were functional again within 18-24 hours. That's PPP PPP (prior planning prevents piss-poor performance).

I write good sigs on my bathroom wall...but this is not a real sig.

Share
twitter facebook
How about (Score:3, Interesting)

by phorm ( 591458 ) writes: on Thursday September 04, 2003 @03:46PM (#6872072) Journal

1) Implement good disaster-recovery plan
2) ??? (aka mad-scramble to initiate plan)
3) Profit (or at least don't go under)

This must have been a pretty in depth recovery plan though. I mean, even with backups and a redundant connection elsewhere... I think that for myself processing the fact that my office had just been bowled over by wind-on-steroids would faze me for a little while (office...tornado...holy...shit...must...recover.. .data)

Now they're up and running, but what of their old office? It must be very interesting to have to deal with the stage of "step over rubble, salvage what we can" and the general amazement at nature's fury.

I'm in the process of configuring several of my servers to offload to a remote master. If the town gets levelled we're toast, but if an individual location bites it, then at least critical data (accounting records, home dirs, etc) is saved. This will still be a big bite out of the business.

Does insurance cover natural disasters such as tornado, would be a big question? A lot of insurance companies don't cover "act of god", etc

Parent Share
twitter facebook
tape backups? (Score:3, Interesting)

by Musashi Miyamoto ( 662091 ) writes: on Thursday September 04, 2003 @04:03PM (#6872286)

From the article, it looks as if the only thing they had to restore from tape/disk was their customer database, so that they could send out the next month's bills. So, the 72 hours was basically putting in new hardware and turning it on. They probably lost all their user's web sites and other "expendible" data.

How about talking about disaster recovery for a REAL company with tens to hundreds of terabytes of data sitting on disk? The kind of data that you cannot lose and must have back on-line asap?

This article is like congratulating them for putting up detour signs when a road is destroyed, or rerouting power when a power line goes down.

Just about everything that was destroyed was not-unique, manufactured items that could be recreated and repurchased. The only exception was the user data, which was pulled off of a nearly destroyed drive by a data recovery company. (Lucky for them!)

I would like to hear more about companies that lose tons of difficult to replace, unique items, such as TBs of user data, prototype designs, business records, etc.

I would bet that if a company were to permenantly lose these types of things, they would nearly go out of business.

Share
twitter facebook
Been there, done that, Northridge Quake (Score:5, Interesting)

by Tsu Dho Nimh ( 663417 ) writes: <abacaxi.hotmail@com> on Thursday September 04, 2003 @04:06PM (#6872320)
I was playing minute-person at a "disaster recovery" meeting (the first one) where high-level suits were figuring out what to do in case of a disaster at their multi-state bank. Their core assumptions were initially as follows:
- They would all survive whatever it was. (I was looking out the window, and seeing jetliners coming in for a landing ... a few feet too low and the meeting would have been over).
- All critical equipment would survive in repairable condition.
- Public services would not be affected over a wide area or for a long time.
- Critical personnel would be available as needed, as would the transportation to get them there.
- The disaster plan only needed to be distributed to managers, who would instruct people what to do to recover.
That was on a Monday. The next Monday was the Northridge quake.
- One critical person woke up with his armoir on top of him, and a 40-foot chasm between him and the freeway.
- One of their buildings was so badly damaged that they were banned from entering ... and there was mission-critical info on those desktop PCs. Had it not been a holiday, the casualty toll would have been horrendous.
- The building with their backups was on the same power grid as the one with no power and the generators could only power the computers, not the AC they also needed.
- None of the buildings had food or water for the staff who had to sleep over, nor did they have working toilets or even cots to nap on.
- One of the local competitors was back in business Tuesday morning, because their disaster plan worked. They rolled up the trailers, swapped some cables and were going again.
They came into the next meeting a couple of weeks after the quake with a whole new perspective on disaster planning and training:
- Anyone who survives knows what the disaster plan is and copies of it are all over the place.
- Critical equipment is redundant and "offsite" backups are out of the quake zone.
- They have generators and fuel enough to last a couple of weeks for the critical equipment and it's support, survival supplies for the critical staff. This is rotated regularly to keep it form going stale.
- They cross-trained like mad.
- They started testing the plan regularly.
Share
twitter facebook
Re:My ISP's disaster recovery plan (Score:3, Interesting)

by trybywrench ( 584843 ) writes: on Thursday September 04, 2003 @04:15PM (#6872462)

My friend works for what was UUNET in Richardson TX. His datacenter is on two seperate power feeds and has two or three massive generators with 30 days of fuel. When I asked him why 30 days he said that if the datacenter doesn't have power for >30 days then society is crumbling and Internet access/web sites are pretty low on the overall priority list.

Parent Share
twitter facebook
Re:Been there, done that, Northridge Quake (Score:4, Interesting)

by Zachary Kessin ( 1372 ) writes: <zkessin@gmail.com> on Thursday September 04, 2003 @04:22PM (#6872568) Homepage Journal

Well a solid disaster plan would (if you are big enough to afford it) have a second location far away. If you had a location in California and a second say in Boston you would be ok. Ofcourse that costs a lot of money and many small to mid sized firms could not afford it in the first place.

But one thing with disaster recovery is you need to figure out what is and is not a disaster you should worry about. I live in Jerusalem, terorism is something very real here but mostly hits soft targets. On the other hand major blizards are a non issue. In Boston we worried about Nor'easters and occationaly a huracane. If you live in Utica NY you probalby don't have to worry to much about terrorism. Fire can happen anywhere.

I don't know how you figure out what is or is not a probable event in your location. I suppose you talk to the insurance folks they have spent a lot of time figuring this out.

The other question is how much recovory can you afford? If your disaster recovory plans puts your company into chapter 11 it was not a very good plan.

I like saying "Utica"

Parent Share
twitter facebook
Re:Nice work! (Score:3, Interesting)

by Mooncaller ( 669824 ) writes: on Thursday September 04, 2003 @04:24PM (#6872587)

Right on! I have been involved with the design of at least 5 disaster recovery plans. The first one was while I was in the Air Force. I guess I was pretty luck, learning at the age of 19, that preparing for successfull disaster recovery is a continous process. The main output of a disaster recovery development process, are those binders. I guess thats why so many people confuse disaster recovery with the binders. But they are only the result of a process; just like a piece of software is created as part of a process. And like software, the binders need to be tested and reviewed regularly.
Some people learn by reading.
Some people learn by observing.
And some people have to piss on the electric fence for themselves.

Parent Share
twitter facebook
Re:Poor tech support (Score:2, Interesting)

by dswensen ( 252552 ) writes: on Thursday September 04, 2003 @04:43PM (#6872833) Homepage

I'm sorry to hear you've run into such rude tech support. Around here, we're polite enough up until the customer starts copping a serious attitude.

That said, we get dozens of calls a day accusing us (not asking politely, as characterized by your post) of having downtime when, in fact, the problem is on the client's side. I have outright been called a liar when I say our ISP is not "down." When there actually is an outage (which is rare, but happens), it's much worse.

Then, we are "always" down, and have had "dozens" of outages in the past week, and etc. etc. (usually this is a customer running Win 95 with an antiquated HSP modem who lives in the sticks and has a 400-foot phone cord going from his computer to the phone jack in the barn, but... nevermind. We're "always down.")

So yes, when you have hundreds of callers a day telling you you are ruining their business and costing them thousands of dollars and raping their grandchildren, patience sometimes runs a little thin. Because so many customers open with "are you guys down AGAIN?" rather than describing their problem, sometimes techs can get a little terse.

Nonetheless, if any of the techs here spoke to a customer the way you're characterizing it, he would pretty much be fired on the spot.

Parent Share
twitter facebook
Re:72 Hours is a little long.... (Score:3, Interesting)

by payslee ( 123537 ) writes: <payslee&yahoo,com> on Thursday September 04, 2003 @05:44PM (#6873554)

My Dad worked in the IT department at one of those banks, across the street from the WTC. I found it interesting that according to him, the year-2000 bug scare turned out to big a big help when the real disaster struck. Of course, their systems were orders of magnitude more complex than this ISP's, but then they, had that much more redundancy built in to everything.
Prior to 2000, they built an entirely new system and ran it in parallel with the current one, for six months. Every transaction went through both systems with the results compared to ensure compliance. They had run so many data recovery scenarios that even having to abandon their headquarters did not mean that service was interrupted for more than minute amount of time.

So the article has a good point when it says you may not know what disaster will hit, but a good plan has flexibility built in. Total system failure can happen in oh so many ways, these days.

Parent Share
twitter facebook
Young! (Score:2, Interesting)

by holzp ( 87423 ) writes: on Thursday September 04, 2003 @06:08PM (#6873814)

did anybody else notice these lines:

Meanwhile, Aeneas CIO and Operations Manager Josh Hart..

'It doesn't even look like there was an office here,'" remembers Hart, 25.

Aeneas launched its contingency plan when it was founded in 1996; since then, CIO Hart has enhanced the strategy gradually almost every year.

Seems to have gone unnoticed that this guy founded the company at 18...before the dot com boom!

Share
twitter facebook
Re:My ISP's disaster recovery plan (Score:4, Interesting)

by afidel ( 530433 ) writes: on Thursday September 04, 2003 @06:30PM (#6874009)

30 days may be a bit much but as I found out one day 48 hours comes close to being too little in some situations. We had a massive generator capable of running most of our 4 story suburban office building for a couple days including the datacenter, AC for the datacenter, lights, and desktops. It would not run AC for the rest of the building or the elevator. At the ~35% load we placed on it and its 500 gallon tank the engineer from Catapilar said it should run for around 48 hours. Well we called our fuel supplier to get some offroad diesel delivered the next morning, no can do, they no longer stock it!?!? WHAT! Then we tried every other listed company in the area, none of them could get to us the next day with fuel. We ended up getting a fuel company out to deliver 300 gallons from Detroit to our offices in Akron, Ohio paying a $500 delivery charge and 70 cents a mile. After that we made sure to get a contract with a fuel company that guarenteed 24 hour delivery of offroad diesel =)

Parent Share
twitter facebook

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Related Links Top of the: day, week, month.

390 comments32-Hour Workweek for America Proposed by Senator Bernie Sanders
358 commentsWhat Should Happen to Empty Downtown Office Spaces?
340 commentsHacktivism Erupts In Response To Hamas-Israel War
324 comments'Feedback' Is Now Too Harsh. The New Word is 'Feedforward'
248 commentsWorkers are Resisting Calls to Return to Offices

Understanding is always the understanding of a smaller problem in relation to a bigger problem. -- P.D. Ouspensky