Forgot your password?
typodupeerror
America Online IT

AOL Creates Fully Automated Data Center 123

Posted by Unknown Lamer
from the tomorrow-system-architect-automates-himself dept.
miller60 writes with an except from a Data Center Knowledge article: "AOL has begun operations at a new data center that will be completely unmanned, with all monitoring and management being handled remotely. The new 'lights out' facility is part of a broader updating of AOL infrastructure that leverages virtualization and modular design to quickly deploy and manage server capacity. 'These changes have not been easy,' AOL's Mike Manos writes in a blog post about the new facility. 'It's always culturally tough being open to fundamentally changing business as usual.'" Mike Manos's weblog post provides a look into AOL's internal infrastructure. It's easy to forget that AOL had to tackle scaling to tens of thousands of servers over a decade before the term Cloud was even coined.
This discussion has been archived. No new comments can be posted.

AOL Creates Fully Automated Data Center

Comments Filter:
  • by Anonymous Coward

    How long will it take for an engineer to get there to replace a card or server?

    • Seem like it may take time for any one to come to the site for any thing vs have a few people on site to get to stuff quicker.

      • by EdIII (1114411) on Tuesday October 11, 2011 @08:34PM (#37685646)

        The whole idea is not to need to get to stuff quicker at all.

        If you are:

        1) Completely virtualized.
        2) Use power circuits that are monitored for load, on a battery back up, power conditioners, and diesel fuel generators for local utility backup.
        3) Use management devices to control all your bare metal as if you are standing there, complete with USB connected storage per device that you can swap out the iso for.
        4) Have redundancy in your virtualization setup that allows you to have high availability, live migration, automated backups, etc.

        What you get is an infrastructure that allows you to route around failures and schedule hardware swap outs on your own timetable, which can be far more economical.

        If you don't have that then it does involve costly emergency response at 2am to replace a bare metal server that went down. You either pay somebody you have retained locally to do it, or you are the one driving down to the datacenter at 2am to do the replacement yourself with who-the-heck-knows how long it will take with uptime monitoring solutions sending out emails like crazy to the rest of the admin staff, and heavens help you, some execs that demanded to be in the loop from now on due to an "incident".

        Don't know about you..... but I would rather be able to relax at 10pm and have a few beers once awhile (to the point I can't drive) without worrying about bare metal servers going down all the time, or who is on call, etc.

    • by errandum (2014454)

      About as much time as it takes on most datacenters that already are monitored remotely. With news like this some would think Nagios or Ganglia did not provide the admins with a web interface.

      PS: They might want to, at least, man it with a security guard to sound the alarm in case of fire or robbery

      • by Martin Blank (154261) on Tuesday October 11, 2011 @07:04PM (#37685020) Journal

        One of the major backbone providers has a lights-out data center not far from my work. I know a guy who has a hosting business there, and he's shown me around to the limits of his access. There is no one on-site from the company or its contractors--not even a security guard. They have biometrics plus PINs for access; it's laced with low-light/IR cameras (it wouldn't surprise me to learn they have microphones); it has motion detectors in case the cameras miss something; and the redundancy is incredible. They maintain contracts with local electricians, plumbers, and a few technical companies should a blade burn out. They manage the entire thing from a few states over, and as of a couple of years ago almost all of their data centers had been converted to run this way. Savings were good, something like a million dollars per DC per year even as unanticipated downtime decreased.

        I looked at it and saw the future of IT. I wasn't sure if I was more impressed or scared.

        • by mikael (484)

          It's more scary - every field of technology evolves that way.

          Early valve computers used to require technicians to replace burnt out valves on a daily basis. Each morning of the day, the technicians would go round and replace any that had burnt out or were about to burn out. Now your PC has about 2 billion transistors or more (CPU +GPU), and not one will burn out.

          100 years ago, it would take 25 minutes to make a long-distance call between San Francisco and New York due to all the operators involved. Now, it'

          • This isn't scary. This is things getting better.
            • by tehcyder (746570)

              This isn't scary. This is things getting better.

              It's scary if your job is manually maintaining servers.

              • by Grave (8234)

                I'm not so sure. While individual reliability has increased dramatically, the shear number of systems in use around the world has increased as well, probably along a similar rate. Will we eventually reach a point at which computer hardware simply does not fail without an external event (power surge, physical damage, etc)? Maybe. But I don't see that happening until performance plateaus.

                • by mikael (484)

                  Reliable will become nearly 100% if everything moves to solid state. How many electric motors are there in a laptop these days? Cooling fans, hard disk drives, CD-drives (auto-eject, play motor) - must be around five or six.

        • by X0563511 (793323)

          Those must be some fancy microphones to be of any use inside a DC...

          • It depends on the noise level of the DC. Where I work, microphones would be useless, but some of the computer rooms in other buildings are relatively quiet and we've used microphones on NetBotz devices when people have been in the room and we're monitoring what they're talking about while working. (It has sometimes saved a phone call when a configuration looked odd momentarily but they were doing it for a reason.)

        • You know, I've seen this idea floated time and time again, and I still have to say that I think this is all smoke and mirrors, at least from an infrastructure perspective. My opinion is that you will never see an industry-wide move toward Data Centers without onsite staffing simply because the risk is far too great. Even in the example you cite above, there is huge liability and potential for disaster. Biometric security devices can fail or be defeated, and all I can say about infrared cameras, mic
      • by kmoser (1469707)
        One word: RoboCop.
    • by PTBarnum (233319)

      The article states "failed equipment is addressed in a scheduled way using outsourced or vendor partners". They don't care if an individual server is down, they just move the workload elsewhere, and wait for a repair. So there actually will be people in their data center doing repairs, they just aren't AOL employees and aren't based in the data center. I could see making a decision that a longer wait time for repairs is justified by labor savings, but it isn't really obvious where those savings come from

    • by arbiter1 (1204146)
      What it sounds like everything hosted there will be a cloud type system, so if 1 machine dies you won't even notice.
      • by X0563511 (793323)

        No, but if a whole cabinet or row goes out because someone wasn't around to notice the funny smell or magic smoke coming out of the power equipment, or hear that ACU fan belt starting to come loose, you just might notice...

    • by Zocalo (252965) on Tuesday October 11, 2011 @07:29PM (#37685228) Homepage
      Who cares? I'm guessing you don't have much experience of server clusters but generally, long before you get to the kind of scale we are talking about here, you start treating servers in the same way you might treat HDDs in a RAID array. When one fails, other servers in the cluster pick up the slack until you can either repair the broken unit or you simply remote install the appropriate image onto a standby server and bring that up until an engineer physically goes to site. Handling of the data is somewhat critical though; should a server die you ideally need to be able to resume what it was working on seemlessly and without causing any data corruption; think transaction based DB queries and timeout/retry.

      If you have enough spare servers and you can easily get by with engineers only needing to go on site once a month or so, assuming you get your MTBF calculations right that is. There's a good white paper [google.com] by Google on how 200,000 hr MTBF hard drive failure rates equate to drive failures every few hours when you have a few 100k HDs.
    • by Grishnakh (216268)

      How long will it take for an engineer to get there to replace a card or server?

      Much less time than it'll take them to get a user.

      Honestly, I was surprised by this article; I thought AOL had already folded.

    • by Chapter80 (926879)

      ...over a decade before the term Cloud was even coined.

      You mean over a decade before you heard the term?

      Cmon! HP was using the term Cloud five years before "America Online" existed in 1991.

      Just because your expertise doesn't extend back before you got that first AOL floppy and went online to type "a/s/l?", it doesn't mean it didn't happen.

  • by Anonymous Coward

    Is now hands-off?

  • by Anonymous Coward

    So they have a fully automated unmanned data center... For their fully unused unpopulated services?

    WIN!

  • Wow .. how '2000'ish (Score:4, Informative)

    by johnlcallaway (165670) on Tuesday October 11, 2011 @06:27PM (#37684610)
    Wow ... we were doing this 10 years ago before virtual systems were commonplace, 'computers on a card' where just coming out. Data center was 90 miles away. All monitoring and managing was done remotely. The only time we ever went to physical data center was if a physical piece of hardware had to be swapped out. Multiple IP addresses were configured per server so any single server one one tier could act as a fail over for another one on the same tier. We used firewalls to automate failovers, hardware failures were too infrequent to spend money on other methods. We could rebuild Sun servers in 10 minutes from saved images. All software updates were scripted and automated. A separate maintenance network was maintained. Logins were not allowed except on the maintenance network, and all ports where shutdown except for ssh. A remote serial interface provided hard-console access to each machine if the networks to a system wasn't available.

    Yawn ......
    • by johnlcallaway (165670) on Tuesday October 11, 2011 @06:32PM (#37684672)
      Thanks for not pointing to the actual blog in the original article. So what they are really blogging is their ability to move an entire DATA CENTER without having to send people to do it. Other than .. you know .. install the hardware to start with.

      Never mind........
    • by rubycodez (864176)
      virtual systems were commonplace in the 1960s. But finally these bus-oriented microcomputers, and PC wintel type "servers" have gotten into it. Young 'uns.......
      • by ebunga (95613)

        Eh, machines of that era required constant manual supervision, and uptime was measured in hours, not months or years. That doesn't negate the fact that many new tech fads are poor reimplementations of technology that died for very good reasons.

        • by timeOday (582209)
          And other new tech fads are good reimplementations of ideas that didn't pan out in the past but are now feasible due to advances in technology. You really can't generalize without looking at specifics - "somebody tried that a long time ago and it wasn't worth it" doesn't necessarily prove anything.
          • by rednip (186217)

            "somebody tried that a long time ago and it wasn't worth it" doesn't necessarily prove anything.

            Unless there is some change in technology or technique, past failures are a good indicator of continued inability.

            • by timeOday (582209)
              The tradeoff between centralized and decentralized computing is a perfect example of a situation where the technology is constantly evolving at a rapid pace. Whether it's better to have a mainframe, a cluster, a distributed cluster (cloud), or fully decentralized (peer-to-peer) varies from application to application and from year-to-year. None of those options can be ruled in or out by making generalizations from the year 2000, let alone the 1960's.
            • by Geminii (954348)
              Of course, in the IT industry, 'some change in technology' comes along every week.
        • by rubycodez (864176)
          depends what model you bought, the redundant, fault-tolerant systems stayed up while components replaced
        • by dwreid (966865)
          Actually, while 60s era mainframes did require significant maintenance by the time the late 70s came around up-time was much better. I still have a late 70s mini-computer that I keep around for laughs that routinely gets about a year and a half between reboots, running 11 users and multi-tasking for each user. As for features that come and go, the IBM 7030 had instruction pipe-lining and look-ahead (what Intel calls hyper-threading) way back in the 60s. In fact it could have as many as 11 instructions in th
        • by afabbro (33948)

          Eh, machines of that era required constant manual supervision, and uptime was measured in hours, not months or years.

          I'm not sure what datacenter you were working in, but in general that is quite untrue.

    • by mikael (484)

      Telephone exchanges in rural areas are like that. The only time a technician had to enter the premises was to clear out old equipment. There was enough spare capacity in the exchanges that the only work required was to open the local cabinets on the street and pair up a new telephone line.

  • Seriously. AOL keeps my relative's PC experience safe; which, generally, keeps them from bugging me for help. :-)

  • Who? (Score:4, Insightful)

    by Jailbrekr (73837) <jailbrekr@digitaladdiction.net> on Tuesday October 11, 2011 @06:27PM (#37684624) Homepage

    Seriously though, most telcomm operations operate like this. Their switching centers are all fully automated and unmanned, and usually in the basement of some non descript building. This is nothing new.

    • by rickb928 (945187)

      Um, I wouldn't be comfortable my telcomm's switching centers in basements. These are moct commonly the first room to flood when the water comes, and telcomm, switches are everywhere their users are.

      I see telcomm switches housed above ground, in plain, sometimes unmarked buildings. There's one a quarter mile from my house, and I drive by two others to go to work. If they have basements, I bet that's where they keep stuff that doesn't matter as much.

      And the huge switch that used to work in my old hometown,

      • by h4rr4r (612664)

        The building I am in hosts one such setup in the basement. It never floods at my location.

      • by dwreid (966865)
        Actually that's not true. Equipment of this type was and is routinely stored in basements as well as entire buildings. I know because I've worked on them for years.
  • .. but there last geek quite, so now the data center must fend for itself.

    • by Anonymous Coward

      Spelling. You fail it.

      • by haus (129916)

        You do realize that this story is about AOL, correct spelling would simply be out of plase.

  • What (Score:4, Funny)

    by Dunbal (464142) * on Tuesday October 11, 2011 @06:28PM (#37684634)
    AOL still exists? Wow. Yeah ok I guess this is the result of years of beancounter thinking - the expensive part of running the service and the reason they were losing money was the IT staff, huh? Glad I closed my CompuServe account before giving these guys any money.
    • by jgotts (2785)

      Instead of $15/hour techs working for AOL doing regular maintenance they've switched to outside contractors billing at $100-200/hr when the shit hits the fan. I don't think this idea is going to work very well.

      • by Synerg1y (2169962)

        The contractors warranty their work :) Sometimes makes all the difference, the $15/h tech is just miserable usually.

      • Re:What (Score:5, Informative)

        by billcopc (196330) <vrillco@yahoo.com> on Tuesday October 11, 2011 @07:03PM (#37685012) Homepage

        How often does shit hit the fan in that sort of environment ?

        As a hybrid techie who does a lot of hardware work, I would much rather go in once a month, fix a batch of issues in one visit, collect my fat cheque and go back to the pub, than spend 40+ hours a week playing Bejeweled, waiting for stuff to break.

        I would expect AOL's strategy to greatly reduce costs, because that $15/hr rack monkey costs a lot more than $15/hr in the end. They have benefits, you have to "manage" them, they need human comforts like bathrooms, cleaning, seating, heating/air, lunch room. From an efficiency standpoint, the contractor route is more efficient in both money and time.

        • by hedwards (940851)

          Depends, how confident are you that every eventuality has been planned for and provided for by the system? A significant outage can easily eat up an entire years worth of $15 an hour salaries if you hit an unforeseen condition which causes the whole data center to go down. Sure it's unlikely if the people doing the planning know what they're doing, but I'm sure that the folks in the WTC weren't expecting their records to be destroyed by a terrorist attack taking the entire building down.

          • Depends, how confident are you that every eventuality has been planned for and provided for by the system? A significant outage can easily eat up an entire years worth of $15 an hour salaries if you hit an unforeseen condition which causes the whole data center to go down. Sure it's unlikely if the people doing the planning know what they're doing, but I'm sure that the folks in the WTC weren't expecting their records to be destroyed by a terrorist attack taking the entire building down.

            Of course any number of $15/h techs in the WTC wouldn't have helped them with this problem anyway.

          • by billcopc (196330)

            So, how do you suggest one should plan against supposed terrorists razing the whole building ?

            More to the point: how is the $15 lackey going to make a difference in that scenario ? If nothing else, NOT having the lackey there saves the company from paying out death benefits :D

    • ...and Daddy Warbucks got some dough - in a manner of speaking, as it were, etc und so weiter.


  • I'm from Europe. What is AOL again? And what is its/their significance in 2011/2012 anyway?

    - Jesper
  • by mccrew (62494) on Tuesday October 11, 2011 @06:36PM (#37684708)
    In other news, the rest of AOL is expected to go "lights out" any time now.
  • by frisket (149522)
    AOL? Who they?
  • But I can't resist.
     
    ...In Soviet Russia, remote hands are YOURS!

  • It's pretty easy to automate a bunch of off switches. ;)
  • by 140Mandak262Jamuna (970587) on Tuesday October 11, 2011 @06:57PM (#37684950) Journal
    The new data center with 0 head count matches nicely the AOL user base with 0 head count!
  • Two points. (Score:4, Insightful)

    by rickb928 (945187) on Tuesday October 11, 2011 @06:58PM (#37684962) Homepage Journal

    One - If there is redundancy and virtualization, AOL can certainly keep services running while a tech goes in, maybe once a week, and swaps out the failed blades that have already beeen remotely disabled and their usual services relocated. this is not a problem. Our outfit here has a lights-out facility that sees a tech maybe every few weeks, and other than that a janitor keeps the dust bunnies at bay and makes sure the locks work daily. And yes, they've asked him to flip power switches and tell them what color the lights were. He's gotten used to this. that center doesn't have state-of-the-art stuff in it, either.

    Two - Didn't AOL run on a mainframe (or more than one) in the 90s? It predated anything useful, even the Web I think. Netscape was being launched in 1998, Berners-Lee was making a NeXT browser in 1990, and AOL for Windows existed in 1991. Mosaic and Lynx were out in 1993. AOL sure didn't need any PC infrastructure, it predated even Trumpet Winsock, I think, and Linux. I don't think I could have surfed the Web in 1991 with a Windows machine, but I could use AOL.

    • by laffer1 (701823)

      Netscape was founded in 1994. http://en.wikipedia.org/wiki/Netscape [wikipedia.org]

      • by rickb928 (945187)

        I was thinking of the browser, not the company.

        • by laffer1 (701823)

          Netscape didn't come out in 1998. Netscape Navigator 3 was out in 1997 for instance http://sillydog.org/narchive/full123.php [sillydog.org]

          I was using Netscape Navigator 2.x with AOL in 1996. I remember because it was a big deal that AOL finally got 32bit winsock support for windows 95. Netscape was definitely out in 1995 as well. I remember "best vieweed with netscape" buttons on websites when I first got on AOL in 1995.

          Are you talking about a specific browser version? Like Netscape Communicator 4.0 ?

          Both Internet E

    • how does redundancy help you when the main power switch goes down / on fire and there is no one there. Let's see firemen make a big mess and no is there to start the rebuild or it may just do a safe shutdown just to send some out just to find out you need to call in this other guy to fix the switch or generator.

      • by ToddDTaft (170931)

        how does redundancy help you when the main power switch goes down / on fire and there is no one there

        If you are a big enough operation, you have redundancy at the data center level. i.e. you can lose an entire data center and have no loss of service on your production applications. Other than a possible speed/performance degradation, your average customer has no knowledge that anything bad has happened.

      • This is why you have a duplicate data center in another city that is kept in standby and is just sitting there ready to take over. (Actually, you normally have a mix of services active at either location.)

        The company I work for makes telecom equipment, and supporting geo redundancy is a fairly key requirement for some major customers.

    • by evilviper (135110)

      It predated anything useful, even the Web I think. Netscape was being launched in 1998, Berners-Lee was making a NeXT browser in 1990, and AOL for Windows existed in 1991.

      The web was around, and in-force MUCH earlier than you would imagine. Windows 98 had Internet Explorer version 4 inextricably linked to the OS. Not version 1, but version 4. Internet Explorer was concieved as a weapon against Netscape, so there's no way IEv4 predated Netscape...

      And before the WWW, the internet was quite useful. Newsgr

      • No, you couldn't because NOBODY had Windows in 91.

        What in the world are you talking about?

      • by dkf (304284)

        No, you couldn't because NOBODY had Windows in 91.

        '91 was when Win 3.1 came out, and that was when it was becoming obvious that Win really was evolving to becoming a full-time OS. (It wasn't there yet at the time, oh boy it wasn't there, but it was clear that was the way things were going.) Surfing the web at that time (well, info services like gopher) required third-party software, but it definitely existed. I remember using it.

        • by evilviper (135110)

          '91 was when Win 3.1 came out

          Nope, March of '92. Others have pointed out that Windows 3.0 was out at that time, but I still maintain practically nobody was running it. In 91 it was very much a DOS world.

    • by Jay L (74152)

      AOL initially ran on a network of Stratus fault-tolerant minicomputers, each running two to eight 680x0 CPUs. Later we added unix boxen, some beefy SGIs and HPs for servers, and Suns for front-end telco interfacing IIRC. By the mid-90s we grew a Tandem fault-tolerant cluster for our critical databases; it did hot component failover, multimaster replication, all
      the stuff that's common today, but
      with SQL down in the drive controller for blazing speeds. We didn't really
      start moving to a PC-based architecture

      • by Jay L (74152)

        Wow. I will never post from an iPhone again...

      • by rickb928 (945187)

        Wow. We're still two years from decomissioning our Stratus servers. We're still 6 months from decom of SNA. I gotta talk to the other team about stepping it up.

        • by Jay L (74152)

          Are you running VOS or FTX? I don't know about FTX, but if you're running VOS, and you're (at least) two years out, I highly recommend upgrading to the V-Series. Stuff that used to compile overnight now takes seconds; we stopped building an inverted index of our source code because "display *.pl1 -match x" was instant. More on the port:

          http://newsgroups.derkeiler.com/Archive/Comp/comp.sys.stratus/2007-11/msg00005.html [derkeiler.com]

          • by rickb928 (945187)

            We're killing all of them. They don't fit into the new software models, and are actually 3 years overdue for decommissioning. We have no redundancy on 75% of them, and their replacements are already online and on production. It's our users who are holding this up, some have put off their work for 5-6 years now, and we don't have the power to compel them to do it. Yet.

            Good while they worked, still there, but doomed. They mostly do file transforms and routing, much better on the RHEL system replacing them

  • by Trip6 (1184883) on Tuesday October 11, 2011 @07:24PM (#37685182)

    Oh yeah, to house all the dial-up modems...

  • I didn't know AOL even still existed!
  • AOWho?

  • n/t /obligatory

  • At least that way they won't need "heroic support"

  • by QuietLagoon (813062) on Tuesday October 11, 2011 @09:47PM (#37686100)
    What are they doing nowadays that requires multiple servers?
    • by archen (447353)

      They still serve email. My boss (and much of his family) still use (and pay) for AOL even though they have broadband and AOL provides them with nothing but an email address as far as I can figure. It's apparently hilariously bad, as he's always talking about how the website doesn't work much of the time, and the connection simply times out. I think they also distribute some software that goes with "AOL" but I have no idea what it does. I hear it still crashes a lot though.

    • by Geminii (954348)
      Counting all the money they make off suckers?
  • ... I say FUUUUUUUUUUUUUUUuuuuuu...
  • ....wait for it .... Smynet! (Someone typoed)

  • To start chewing through wires, causing power outages, starting fires, pooping in the mailbox, that kind of stuff.

  • One of the early search engines, I think Infoseek, worked this way. Machines were installed in blocks of 100 (this was before 1U servers) and never replaced individually. Failed machines were powered off remotely. When some fraction of the block had failed, about 20%, the whole cluster was replaced.

    There's a lot to be said for this. You have less maintenance-induced failure. Operating costs are low.

  • by 1s44c (552956)

    ...over a decade before the term Cloud was even coined.

    You mean back when it was called 'grid'?

  • What they did:
    * Modularize/Standardize Infrastructure, e.g. storage & computing power
    * Build provisioning systems
    * Virtualize everything

    When they say that they are flexible, they mean that they have a lot of dark hardware lying around.

I wish you humans would leave me alone.

Working...