Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Data Storage Databases Programming The Internet IT Technology

GitLab.com Melts Down After Wrong Directory Deleted, Backups Fail (theregister.co.uk) 356

An anonymous reader quotes a report from The Register: Source-code hub Gitlab.com is in meltdown after experiencing data loss as a result of what it has suddenly discovered are ineffectual backups. On Tuesday evening, Pacific Time, the startup issued the sobering series of tweets, starting with "We are performing emergency database maintenance, GitLab.com will be taken offline" and ending with "We accidentally deleted production data and might have to restore from backup. Google Doc with live notes [link]." Behind the scenes, a tired sysadmin, working late at night in the Netherlands, had accidentally deleted a directory on the wrong server during a frustrating database replication process: he wiped a folder containing 300GB of live production data that was due to be replicated. Just 4.5GB remained by the time he canceled the rm -rf command. The last potentially viable backup was taken six hours beforehand. That Google Doc mentioned in the last tweet notes: "This incident affected the database (including issues and merge requests) but not the git repos (repositories and wikis)." So some solace there for users because not all is lost. But the document concludes with the following: "So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place." At the time of writing, GitLab says it has no estimated restore time but is working to restore from a staging server that may be "without webhooks" but is "the only available snapshot." That source is six hours old, so there will be some data loss.
This discussion has been archived. No new comments can be posted.

GitLab.com Melts Down After Wrong Directory Deleted, Backups Fail

Comments Filter:
  • Yawn... (Score:5, Insightful)

    by Anonymous Coward on Wednesday February 01, 2017 @02:04AM (#53779121)

    No backups, untested backups, overwriting backups, etc. You'd think a code repository would understand this shit.

    This has been going on since the dawn of computing and it seems there's no end in sight.

     

    • by Anonymous Coward
    • Re: (Score:3, Insightful)

      Clearly their DR Plan didn't get any form of QA. It's no good having five forms of backup/replication if non of them work!
    • by zifn4b ( 1040588 )

      No backups, untested backups, overwriting backups, etc. You'd think a code repository would understand this shit.

      This has been going on since the dawn of computing and it seems there's no end in sight.

      You'd think so but the level of incompetence these days rivals the incompetence of 20 years ago. I just heard yesterday that a global multi-national company that's been around for years lost a file because "another file from a different source came in too soon and overwrote it". At that point, I did a complete facepalm because I was astounded that we still have software around running critical business operations sometimes even global operations.

  • by sixdrum ( 4791263 ) on Wednesday February 01, 2017 @02:06AM (#53779129)
    A few years back, I caught and stopped a fellow sysadmin's rm -rf on /home on our home directory server. He had typo'd while cleaning up some old home directories, i.e.:

    rm -rf /home/user1 /home/user2 /home/ user3

    Backups were ineffective. 30% of our users lost their home directories permanently. He never lived it down. Check your backups!
    • by Anonymous Coward on Wednesday February 01, 2017 @02:28AM (#53779179)

      That's why you always always run ls first.

      ls -ld /home/user1 /home/user2 /home/ user3

      Then edit the command to rm. Always.

      • by AmiMoJo ( 196126 ) on Wednesday February 01, 2017 @04:07AM (#53779419) Homepage Journal

        mkdir ./trash
        mv file_to_delete ./trash

        If it's still working next month you can empty trash, but just leaving it there forever is a valid option too. In a production environment, storage is too cheap to warrant deleting anything.

        • by jabuzz ( 182671 )

          Oh I wish that where really the case. Unfortunately where a single run of a job on an HPC facility can produce 1TB of files that is not actually the case in the real world for everyone.

      • That's why you always always run ls first.

        ls -ld /home/user1 /home/user2 /home/ user3

        Then edit the command to rm. Always.

        Or you use scripts.

        somescript user1 user2 user3

      • by Megol ( 3135005 )

        Or perhaps the operating system (shell) should prevent these kinds of errors? I guess it isn't macho enough...

        • Re: (Score:2, Insightful)

          by Anonymous Coward

          Do you prefer your kitchen knives un-sharpened because then you're less likely to cut yourself?

      • by jez9999 ( 618189 )

        Or use a GUI that moves stuff to a recycle bin first. :-) It's saved my bacon on more than one occasion.

        • by quenda ( 644621 )

          GUI? We don't need no stinkin' GUI!

          # mkdir junk
          # mv file1 dir2 .... junk
          # ls -la junk

            Look carefully!!
          # rm -rf junk

        • I habitually shift-delete things because it saves a lot of time moving large folders with massive numbers of files into the recycle bin. I have been caught out by this once or twice over the years, but always had a recent backup and so have never lost anything that way.

      • This. I do this.
      • Also rm -rf /home/{user1,user2,user3} is safer: if you accidentally include a space, the braces don't get expanded at all:

        rm -rf /home/{user[12]}

        is equivalent to

        rm -rf /home/user1 /home/user2

        but rm -rf /home/{user1, user2}

        is equivalent to

        rm -rf "/home/{user1, user2}"

        so 'rm -ageddon' is avoided.

        • Seriously, though, much more thought needs to be given to two things: one is making accidents harder, the other is making effective backups a no-brainer.

    • Comment removed (Score:5, Interesting)

      by account_deleted ( 4530225 ) on Wednesday February 01, 2017 @02:35AM (#53779203)
      Comment removed based on user account deletion
    • by arglebargle_xiv ( 2212710 ) on Wednesday February 01, 2017 @03:04AM (#53779279)

      Backups were ineffective. 30% of our users lost their home directories permanently. He never lived it down. Check your backups!

      Actually: Check your privilege! (Especially if rm -rf is involved).

      • by sinij ( 911942 )

        Backups were ineffective. 30% of our users lost their home directories permanently. He never lived it down. Check your backups!

        Actually: Check your privilege!

        Sudo is a real victim here. Let not make it worse by engaging in victim-blaming.

    • by Megane ( 129182 )
      This is when tab completion is your friend, especially when you have path names with spaces in them. Also, for me the big one is overwriting stuff with the mv command (tab completion can make this easier to do), so I have it aliased to "mv -i". I almost never want to delete a file by overwriting it with the mv command.
    • Moral: the command line is too powerful for puny humans who might not be totally attentive to every character being entered at all times.

    • My major whoops early in my career.

      Brand new install of Slackware with Kernel 1.2.8 (circa late 1994) which was a statically linked build. Thought I was in /usr/local/lib (shell only had current level directory not the full path) but was really in /lib. Proceeded to rm -rf * to get rid of a test build (or so I thought). Well then I was wondering after about 10 sec the rm command was throwing errors. Seems that once the rm command hit libc.a any and all operations ceased.

      After that I always had the root

  • by Nkwe ( 604125 ) on Wednesday February 01, 2017 @02:12AM (#53779143)
    If you have not successfully tested a restore and you do not have a completely offline copy, you do not have a backup.
    • Re: (Score:3, Informative)

      by dbIII ( 701233 )
      Good advice but it's a misleading headline above. It appears their real backup exists and is six hours old, so annoying but not catastrophic.
      It is a good example that replication is not a backup and is often a way to just mirror mistakes.
    • Typical case of "we're unlikely to lose our data, and anyway we've got a backup which in turn is unlikely to fail ; so why test a unlikely x unlikely event?"
      • I think backups are surprisingly likely to fail. Just like RAID is surprisingly likely to have more than one disk fail at a time, even though intuitively that seems extremely unlikely.
        • RAID fails because hard disks (probably the same type and batch) running together get hit at the same rate as the matching disks do not fail with the same chance distribution. Their failure correlation is therefore to be quite high. This explains that rebuilding a RAID array after failure can be a very dangerous operation and could easily lead to total failure. Usually, doing (incremental) backups are the safer option when a single disk fails as that is not nearly as invasive as a complete RAID rebuild.

    • by Opportunist ( 166417 ) on Wednesday February 01, 2017 @03:36AM (#53779359)

      Sing with me, kids:

      One backup in my bunk
      One backup in my trunk
      One backup at the town's other end
      One backup on another continent

      All of them tested and verified sane
      now go to bed, you can sleep once again

    • by tonymercmobily ( 658708 ) on Wednesday February 01, 2017 @04:16AM (#53779449) Homepage Journal

      "If you have not successfully tested a restore and you do not have a completely offline copy, you do not have a backup."
      OK, now that I have repeated it, let me add.

      As a CEO, and a CTO, you MUST test backups and resiliency by artificially creating downtime and real-life events. You switch off the main server. Or instruct the hosting company to reboot the main server, unplug the main hard drive, and plug it back in. Then you sit up, and watch with great interest what happens.

      THEN you will see, for real, how your company reacts to real disasters.

      The difference is that if anything _really_ wrong happens, you can turn the hard drive back on and fire a few people.

      Smart companies do it. For a reason. Yes it creates downtime. But yes it can save your company.

      http://www.datacenterknowledge... [datacenterknowledge.com]

      Merc.

      • In other words: IT fire drills. Smart companies conduct them... but somehow I have never seen them done, or even seen companies asking their outsourcing partners to produce some proof of recovery procedures having been tested. No, "they are ISO-over-9000 and that is good enough for us". Good enough to cover your arse when things go south, sure.

        We had plenty of actual fire drills, though.
        • by cdrudge ( 68377 ) on Wednesday February 01, 2017 @07:37AM (#53779807) Homepage

          "they are ISO-over-9000 and that is good enough for us"

          Distilled down, all that ISO-around-9000 says is that "we say what we do and do what we say" when it comes to business processes. It's perfectly acceptable from an ISO-around-9000 standpoint to have a disaster recovery process that reads like the below as long as that is really what they do:

          1. Perform backup
          2. Pray nothing goes wrong.

          Now hopefully they have something a lot more than that. But if they don't test the backups. If they don't hold an "IT fire drill" to practice what do do when the feces hits the fan. If they don't have disaster recovery backup servers and snapshots and whatever else they should have, then they have completely documented their process and follow it like the standards require.

      • Smart companies do it. For a reason. Yes it creates downtime. But yes it can save your company

        You make the assumption they CXO want to save the company. Downtime costs happen this quarter. Benefit accrues to whoever is the CXO five years down the line. Why should current CEO save the a** of the next CEO. Squeeze the company dry, show as much revenue/profit as possible, cash the stock options and skip town. By the time they discover the shoddy backup vendor you hired to cut costs, had been saving the data in the "1TB" thumbdrives bought in some flea market in outer Mongolia, you are already well into

      • As a CEO, and a CTO, you MUST test backups and resiliency by artificially creating downtime and real-life events.

        As an IT professional, and occasional admin, you MUST have backup for your hardware to switch to, which mitigates the pain of live testing. The hardware is typically a small portion of the total cost of the business, even if you double it.

    • If you have not successfully tested a restore and you do not have a completely offline copy, you do not have a backup.

      Especially now that ransomware is overwriting online backups.

    • by tomhath ( 637240 )

      If you have not successfully tested a restore and you do not have a completely offline copy, you do not have a backup.

      And don't trust someone else who says they made and tested the backup. Our DBAs had proof that the sysadmins told them the disk backups worked. But the DBAs never did a practice restore of their own. You can guess what happened when a failed update trashed the database.

  • by subk ( 551165 ) on Wednesday February 01, 2017 @02:13AM (#53779145)
    Use mv! Also.. What's with the need to tweet? Don't tell the customer anything until the dust settles! Geez... What's with these amateurs?
    • # rm `which rm`
    • by infolation ( 840436 ) on Wednesday February 01, 2017 @02:58AM (#53779265)

      Don't tell the customer anything until the dust settles! Geez... What's with these amateurs?

      Don't tell the customer anything!! Geez... What's with these semi-pros?

    • Re:Don't use rm! (Score:4, Interesting)

      by Darinbob ( 1142669 ) on Wednesday February 01, 2017 @03:07AM (#53779289)

      Boring job, doesn't pay as much as others. Everyone wants to be the rockstar since that's who the recruiters look for, nobody wants to be the janitor that cleans up after the concert. Turn that into a startup and seriously, no one at a startup wants to be the grunt, and (almost) no one at a startup has an ounce of experience with real world issues.

      This is why sysadmins were created, because the people actually using the computers didn't want to manage them.

      • Re:Don't use rm! (Score:4, Insightful)

        by sodul ( 833177 ) on Wednesday February 01, 2017 @03:43AM (#53779375) Homepage

        Nowadays since nobody wants to do sysadmin work and since most startups and companies feel that a pure sysadmin job it is a waste of money they slap 'must code shell and chef' on top, call it DevOps but then just treat them just as badly as before. The 'DevOps' term is just is misused as 'Agile' nowadays. What I have seen in practice is DevOps are Ops that Develop scripts, or worse a DevOps team/role between Devs and Ops ... and a new silo is created instead of walls broken. Most Agile shops are actually chaos driven with anything goes since Sales promised a feature to a prospect customer yesterday, every week.

    • > Don't tell the customer anything until the dust settles!

      That's one way to handle a major crisis, but if you're transparent about an issue, it puts a lot more minds at ease than it upsets, since then at least your customers know that you're aware of the problem, that you're working to fix it, and that they can communicate with you.

  • by djinn6 ( 1868030 ) on Wednesday February 01, 2017 @02:13AM (#53779147)
    Two things:
    1. Test your backups
    2. TEST your BACKUPS!
    • Re: (Score:2, Funny)

      by Anonymous Coward

      but NOT on your production hardware running live services.

      me thinks gitlab should have browsed their hosted repos for some backup software.

      • by asylumx ( 881307 )

        but NOT on your production hardware running live services.

        There are plenty who disagree with this. Right or wrong, their arguments have merit.

  • by jtara ( 133429 ) on Wednesday February 01, 2017 @03:02AM (#53779275)

    At least it wasn't github.com.

    So, it didn't break the Internet.

    And practically everything else.

  • by daid303 ( 843777 ) on Wednesday February 01, 2017 @03:06AM (#53779285)

    I've made this mistake, deleted all attachments on a life system once.

    After this, I made all the prompts for critical servers a different color:
    export PS1='\e[41m\u@\h:\w\$\e[49m'

  • by Anonymous Coward on Wednesday February 01, 2017 @03:17AM (#53779313)

    I'm not a fan of git, I'm not happy when I'm forced to use it and I don't understand how it works, not really. But remember how KDE deleted all their projects, everywhere, globally, except for a powered-down virtual machine?

      http://jefferai.org/2013/03/29/distillation/

    When I remember that, and I read this story, I can't understand why people use something that is so sensitive to mistakes. It's like giving everybody root on every machine, which is running DOS in real mode. Somebody please explain it to me.

    • by Entrope ( 68843 ) on Wednesday February 01, 2017 @07:46AM (#53779847) Homepage

      KDE's problems were not due to Git. They were due to a corrupt filesystem, a home-brew mirroring setup, and overworked admins.

      If you're going to troll-ol-ol a blame vector for that, at least be remotely fair and blame Linux (or whatever OS their master server was running), open source, and the associated culture.

  • by HxBro ( 98275 ) on Wednesday February 01, 2017 @03:28AM (#53779333)

    Just imagine if git had some other magical copy of the repo somewhere, maybe even on the local machine you develop on, now that would save your data in a case like this

    • by gweihir ( 88907 )

      Just imagine if you had actually read the story. The git-repos are not affected.

    • by AmiMoJo ( 196126 )

      Repos will be okay, it's all the ancillary stuff, i.e. the things that make them worth using over other git hosting companies. User management, wikis, release management, issue tracking etc.

  • Comment removed (Score:5, Interesting)

    by account_deleted ( 4530225 ) on Wednesday February 01, 2017 @03:56AM (#53779399)
    Comment removed based on user account deletion
  • All my sympathy... (Score:5, Insightful)

    by Gumbercules!! ( 1158841 ) on Wednesday February 01, 2017 @05:23AM (#53779553)
    I don't care if this is a mistake and screw up of their own making (and it is, on every level) - if you've ever worked as sysadmin you have got to feel for these guys.
  • by gweihir ( 88907 ) on Wednesday February 01, 2017 @05:36AM (#53779589)

    Something like this is going to happen sooner or later. It cannot really be avoided. BCM and recovery tests are the only way to be sure your replication/journaling/etc. works and your backups can be restored.

    Of course in this age of incompetent bean-counters, these are often skipped, because "everything works" and these test do involve downtime.

    • by Entrope ( 68843 )

      BCM? Bravo Company, manufacturer of firearm parts so you can shoot your servers? Buzzword-Centric Methodology? The SourceForge "BCM" project, a file compression utility? Baylor College of Medicine? Bear Creek Mining? Bacau International Airport? Broadcom?

  • Does anyone think that Backups/DR Testing as a business would be something that businesses would go for?

    Everybody "runs backups" but due to all the usual limitations in time and capacity, nobody really tests whether they can restore everything and actually make it work, and how long it might actually take to accomplish this.

    I always wondered if you could mount a hundred TB of storage, a couple of tape drives, and switching into one of those rock band roadie cases and take it to a business with the idea that

    • As a sysadmin, this sounds great (a bit 'brown trousers' for me personally, but great). However, one of my clients is entirely 'in the cloud', so no need for your truck of kit - just provide as many VMs as we like somewhere on t'internet. Ideally you'd be able to do this in a 'little internet' which has a VPN to get into it, has it's own DNS servers, and maybe ways to 'bend' or alter requests to other cloudy services, such as Google or Amazon such that the app 'thinks' its talking to the real, live producti

  • A lesson always learnt the hard way. Those of us who have learnt it the hard way have known the feeling before: I'll trust that this is correct and the feeling after: Shiat!

  • And of course everybody here knows _never_ to _rely_ upon cloud storage. Use it, by all means, but plan as if the cloud storage facility could have a meltdown at any moment. Gitlab users should just push their project to a different git server. There is also something to be said for having git server projects mirrored, e.g. a master on github and a second on gitlab, so that, in the event of one cloud service failing, you have a hot spare.

    What is frustrating is that, given all the progress in hardware reliab

  • That 4.5 GB of data, happens to hold the answer! To life, the universe, and EVERYTHING!! Mankind is fortunate that the weary sysadmin was able to abort the procedure before it completely wiped the slate clean!

    So don't blame the guy, praise him and thank him for saving us all!

  • by thesandbender ( 911391 ) on Wednesday February 01, 2017 @08:53AM (#53780117)
    Editors. I understand that any loss is bad but holy hyperbole batman... the title reads like a nuke was dropped on Gitlab's datacenters. I had to read halfway through the post to see they lost six (6!) hours of data. Again, really bad, but just losing six hours of data would be a case study in success for a lot of companies and definitely not a "melt-down".
    • I see your point but I'd guess you are not a professional sysadmin. TFA should have been prefaced "For SysAdmins only". Most don't care about losing data: this far along in the computer revolution, most of us have lost years of data due to a disk or pebcak failure.

      Most of the time it is not a deal-breaker, or "melt-down" in this case. A company might have to spend some money, or a worker has to spend a lot of time, or the two dozen drafts of your "Great American Novel" goes gone.

      But sometimes it's the entir

  • "So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place."

    Or, more accurately, less than 5 backup/replication techniques were deployed.

    I've seen this before. The backup strategy you didn't deploy didn't fail. It never existed except in documentation. And your unwarranted trust.

    I do not miss sysadmin work so much.

Logic is the chastity belt of the mind!

Working...