GitLab.com Melts Down After Wrong Directory Deleted, Backups Fail (theregister.co.uk) 356

Posted by BeauHD on Wednesday February 01, 2017 @03:00AM from the put-in-a-hard-day's-work dept.

An anonymous reader quotes a report from The Register: Source-code hub Gitlab.com is in meltdown after experiencing data loss as a result of what it has suddenly discovered are ineffectual backups. On Tuesday evening, Pacific Time, the startup issued the sobering series of tweets, starting with "We are performing emergency database maintenance, GitLab.com will be taken offline" and ending with "We accidentally deleted production data and might have to restore from backup. Google Doc with live notes [link]." Behind the scenes, a tired sysadmin, working late at night in the Netherlands, had accidentally deleted a directory on the wrong server during a frustrating database replication process: he wiped a folder containing 300GB of live production data that was due to be replicated. Just 4.5GB remained by the time he canceled the rm -rf command. The last potentially viable backup was taken six hours beforehand. That Google Doc mentioned in the last tweet notes: "This incident affected the database (including issues and merge requests) but not the git repos (repositories and wikis)." So some solace there for users because not all is lost. But the document concludes with the following: "So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place." At the time of writing, GitLab says it has no estimated restore time but is working to restore from a staging server that may be "without webhooks" but is "the only available snapshot." That source is six hours old, so there will be some data loss.

GitLab.com Melts Down After Wrong Directory Deleted, Backups Fail

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 356 Comments Log In/Create an Account

Comments Filter:

Yawn... (Score:5, Insightful)

by Anonymous Coward writes: on Wednesday February 01, 2017 @03:04AM (#53779121)

No backups, untested backups, overwriting backups, etc. You'd think a code repository would understand this shit.
This has been going on since the dawn of computing and it seems there's no end in sight.

- Re: (Score:2)
  
  by Anonymous Coward writes:
  
  http://knowyourmeme.com/memes/disaster-girl [knowyourmeme.com]
- Re: (Score:3, Insightful)
  
  by Big Hairy Ian ( 1155547 ) writes:
  
  Clearly their DR Plan didn't get any form of QA. It's no good having five forms of backup/replication if non of them work!
  - - - Re: Yawn... (Score:4, Interesting)
        
        by Qzukk ( 229616 ) writes: on Wednesday February 01, 2017 @02:02PM (#53782397) Journal
        
        There's two levels of redundancy. There's "oh my god the database server is on fire! Promote the replicated server to master and failover!" which, depending on the database, should take a few seconds to perform manually. Testing automation for this (pull the plug and see what happens) depends on your setup and how long it takes your heartbeat to decide that the server is dead and how (If we shot servers in the head every time we got a DDoS, we'd burn through servers in a few seconds, it takes more than one failed connection for automation to decide the server is down).
        Then, there's "oh my god the datacenter is on fire!". This is what people usually call "Disaster Recovery". One dead server isn't a disaster when you have failovers, but when your entire datacenter is dead, THAT's a disaster. It's tough as nails to automate too, since without having at least three datacenters, it's inherently a split-brain issue. If Datacenter A stops responding to Datacenter B, which one is actually down? If you aren't an AS and can't just republish your IPs at Datacenter B with a BGP routing change, that means you're going to have to publish new DNS records and wait one TTL for everyone to see them. If you had an authoritative DNS server at Datacenter A, then hopefully it was able to recognize that its down and shot itself (or at least updated its zone files with B's IPs) or you can somehow get to it and kill it, otherwise when Datacenter A comes back online, it'll be serving up A's IPs again and conflict with the other DNS server. This also is setting aside replicating your data between datacenters and how much of that is lost when you switch back and forth.
        
- Re: (Score:2)
  
  by zifn4b ( 1040588 ) writes:
  
  No backups, untested backups, overwriting backups, etc. You'd think a code repository would understand this shit.
  This has been going on since the dawn of computing and it seems there's no end in sight.
  You'd think so but the level of incompetence these days rivals the incompetence of 20 years ago. I just heard yesterday that a global multi-national company that's been around for years lost a file because "another file from a different source came in too soon and overwrote it". At that point, I did a complete facepalm because I was astounded that we still have software around running critical business operations sometimes even global operations.
- - Re: Yawn... (Score:5, Funny)
    
    by Nutria ( 679911 ) writes: on Wednesday February 01, 2017 @04:35AM (#53779357)
    
    paki chimps in jungle
    Someone failed geography class...
    
    - Re: (Score:3, Insightful)
      
      by Anonymous Coward writes:
      
      No no, he's being """ironic""" and """trolling""" you. He isn't actually a stupid racist.
      He's just a racist.
  - Re: (Score:2)
    
    by moronoxyd ( 1000371 ) writes:
    
    What does GitHub have to do with Gitlab.com?
I feel that lone sysadmin's pain (Score:5, Insightful)

by sixdrum ( 4791263 ) writes: on Wednesday February 01, 2017 @03:06AM (#53779129)

A few years back, I caught and stopped a fellow sysadmin's rm -rf on /home on our home directory server. He had typo'd while cleaning up some old home directories, i.e.:

rm -rf /home/user1 /home/user2 /home/ user3

Backups were ineffective. 30% of our users lost their home directories permanently. He never lived it down. Check your backups!

- Re:I feel that lone sysadmin's pain (Score:5, Insightful)
  
  by Anonymous Coward writes: on Wednesday February 01, 2017 @03:28AM (#53779179)
  
  That's why you always always run ls first.
  ls -ld /home/user1 /home/user2 /home/ user3
  Then edit the command to rm. Always.
  
  - Re:I feel that lone sysadmin's pain (Score:5, Interesting)
    
    by AmiMoJo ( 196126 ) writes: on Wednesday February 01, 2017 @05:07AM (#53779419) Homepage Journal
    
    mkdir ./trash
    mv file_to_delete ./trash
    If it's still working next month you can empty trash, but just leaving it there forever is a valid option too. In a production environment, storage is too cheap to warrant deleting anything.
    
    - Re: (Score:3)
      
      by jabuzz ( 182671 ) writes:
      
      Oh I wish that where really the case. Unfortunately where a single run of a job on an HPC facility can produce 1TB of files that is not actually the case in the real world for everyone.
  - Or you use scripts (Score:3)
    
    by perpenso ( 1613749 ) writes:
    
    That's why you always always run ls first.
    ls -ld /home/user1 /home/user2 /home/ user3
    Then edit the command to rm. Always.
    Or you use scripts.
    
    somescript user1 user2 user3
  - Re: (Score:2)
    
    by Megol ( 3135005 ) writes:
    
    Or perhaps the operating system (shell) should prevent these kinds of errors? I guess it isn't macho enough...
    - Re: (Score:2, Insightful)
      
      by Anonymous Coward writes:
      
      Do you prefer your kitchen knives un-sharpened because then you're less likely to cut yourself?
    - - Re:I feel that lone sysadmin's pain (Score:4, Insightful)
        
        by Angstroem ( 692547 ) writes: on Wednesday February 01, 2017 @06:45AM (#53779609)
        
        The command-line wizards like to mock the GUI crowd, but I've never seen anyone make this kind of blunder with a GUI admin tool. :-P
        Then you have never worked on a repository with users of TortoiseSVN and the likes.
        "Hey, my commit didn't get through because of some funky error I didn't care about. But if I flip this 'force' switch, then everything always goes smoothly."
        
      - Re: (Score:2)
        
        by Applehu Akbar ( 2968043 ) writes:
        
        Including that purist-hated trash can/recycle bin.
      - Re: (Score:2)
        
        by BlackHawk-666 ( 560896 ) writes:
        
        Accenture made exactly this blunder on the London Stock Exchange website root folder (running on IIS). Some nimrod came in and accidentally deleted all the files from that folder taking about 30 different financial products offline. We noticed pretty quick and scrambled to restore from a backup.
        Funny thing is...some other nimrod or the same one did almost the same thing a month later, this time only removing a few key products :-)
      - Re: (Score:3)
        
        by TheRaven64 ( 641858 ) writes:
        
        I have. It's just as easy to accidentally click on the wrong folder, or delete the foo folder from the window showing bar instead of the window showing baz. This is why good UIs are all about making sure that there's an undo button that works after you've done the stupid thing, not about trying to make the stupid thing impossible. Most GUI systems will move things to the trash, rather than deleting. The problem is that users then get into the habit of reflexively emptying the trash immediately after a d
  - Re: (Score:2)
    
    by jez9999 ( 618189 ) writes:
    
    Or use a GUI that moves stuff to a recycle bin first. :-) It's saved my bacon on more than one occasion.
    - Re: (Score:2)
      
      by quenda ( 644621 ) writes:
      
      GUI? We don't need no stinkin' GUI!
      # mkdir junk # mv file1 dir2 .... junk # ls -la junk
      Look carefully!!
      # rm -rf junk
    - Re: (Score:2)
      
      by BlackHawk-666 ( 560896 ) writes:
      
      I habitually shift-delete things because it saves a lot of time moving large folders with massive numbers of files into the recycle bin. I have been caught out by this once or twice over the years, but always had a recent backup and so have never lost anything that way.
  - Re: (Score:2)
    
    by 140Mandak262Jamuna ( 970587 ) writes:
    
    This. I do this.
  - Re: (Score:2)
    
    by John Allsup ( 987 ) writes:
    
    Also rm -rf /home/{user1,user2,user3} is safer: if you accidentally include a space, the braces don't get expanded at all:
    rm -rf /home/{user[12]}
    is equivalent to
    rm -rf /home/user1 /home/user2
    but rm -rf /home/{user1, user2}
    is equivalent to
    rm -rf "/home/{user1, user2}"
    so 'rm -ageddon' is avoided.
    - Re: (Score:2)
      
      by John Allsup ( 987 ) writes:
      
      Seriously, though, much more thought needs to be given to two things: one is making accidents harder, the other is making effective backups a no-brainer.
  - - Re: (Score:3)
      
      by Joce640k ( 829181 ) writes:
      
      Exactly right.
      Exactly wrong.
      Learn to use "mv" instead of "rm -rf".
      eg. Create a folder called /trash and move the files there.
      When you see the system is still working and you need some disk space then you can empty the trash. Not before.
- Comment removed (Score:5, Interesting)
  
  by account_deleted ( 4530225 ) writes: on Wednesday February 01, 2017 @03:35AM (#53779203)
  
  Comment removed based on user account deletion
  
- Re:I feel that lone sysadmin's pain (Score:5, Funny)
  
  by arglebargle_xiv ( 2212710 ) writes: on Wednesday February 01, 2017 @04:04AM (#53779279)
  
  Backups were ineffective. 30% of our users lost their home directories permanently. He never lived it down. Check your backups!
  Actually: Check your privilege! (Especially if rm -rf is involved).
  
  - Re: (Score:2)
    
    by sinij ( 911942 ) writes:
    
    Backups were ineffective. 30% of our users lost their home directories permanently. He never lived it down. Check your backups!
    Actually: Check your privilege!
    Sudo is a real victim here. Let not make it worse by engaging in victim-blaming.
- Re: (Score:2)
  
  by Megane ( 129182 ) writes:
  
  This is when tab completion is your friend, especially when you have path names with spaces in them. Also, for me the big one is overwriting stuff with the mv command (tab completion can make this easier to do), so I have it aliased to "mv -i". I almost never want to delete a file by overwriting it with the mv command.
  - Re: (Score:2)
    
    by Zontar The Mindless ( 9002 ) writes:
    
    This is when tab completion is your friend...t
    This. Very first thing I thought of as well.
- Re: (Score:2)
  
  by Applehu Akbar ( 2968043 ) writes:
  
  Moral: the command line is too powerful for puny humans who might not be totally attentive to every character being entered at all times.
- Re: (Score:2)
  
  by rholtzjr ( 928771 ) writes:
  
  My major whoops early in my career.
  Brand new install of Slackware with Kernel 1.2.8 (circa late 1994) which was a statically linked build. Thought I was in /usr/local/lib (shell only had current level directory not the full path) but was really in /lib. Proceeded to rm -rf * to get rid of a test build (or so I thought). Well then I was wondering after about 10 sec the rm command was throwing errors. Seems that once the rm command hit libc.a any and all operations ceased.
  After that I always had the root
- - Re: (Score:3)
    
    by Opportunist ( 166417 ) writes:
    
    That's fine until he decides that typing "rm user1 user2 user3 user4..." is too much of a hassle and he replaces it with a script that lists the directories and removes them all. ...blissfully forgetting that there is a ".." directory. Oh .., how many well intended scripts have thee turned into the spawn of hell...
- - Re:I feel that lone sysadmin's pain (Score:5, Informative)
    
    by stridebird ( 594984 ) writes: on Wednesday February 01, 2017 @05:08AM (#53779425) Homepage
    
    Correct pattern is:
    > cd /home && rm ...
    ie don't run rm unless cd worked.
    
  - - Re: I feel that lone sysadmin's pain (Score:2, Informative)
      
      by saloomy ( 2817221 ) writes:
      
      I usually have a /trash directory in my Linux servers, I have moved the rm command to "removed" and wrote a sweet script named rm which moves files/folders to /trash. Then a cron job "removes" files and folders from trash after 48 hours. Works awesome unless I'm space-bound, and I usually am not. Saved my ass more than once!
      - Re: I feel that lone sysadmin's pain (Score:5, Insightful)
        
        by gsslay ( 807818 ) writes: on Wednesday February 01, 2017 @07:10AM (#53779661)
        
        This seems like a good idea, but it gets you into the habit of thinking that "rm" is a safe command that you can easily recover from. Then one day you use it on a server where you have forgotten to, or haven't yet, done your "sweet script" trick. Or worse; on someone else's server.
        
        Far better to treat the command "rm" with the full respect it deserves at all times and never assume it does anything but wipe data. Call your little script something like rm2 instead and get into the habit of always using that. That way the worst thing that can happen when it doesn't exist is "command not found".
        
        
        Re: I feel that lone sysadmin's pain (Score:5, Interesting)
        
        by rgbatduke ( 1231380 ) writes: <rgb&phy,duke,edu> on Wednesday February 01, 2017 @08:10AM (#53779745) Homepage
        
        Having used the "sweet rm" trick back in the 80's somewhere (with much more limited space, and a cron FIFO groomer) it also doesn't protect you from a wide variety of file corruption issues and overwrites. Remove a file, recreate it, remove it again? Delete two files from different parts of your tree -- e.g. README -- that have the same name? Original file gone (unless you don't just alias rm, you write a very complicated script). If you run out of space and have an alias/script like "flush" to take out the trash and make room for more, it just moves the problem one notch downstream.
        With that said, it did save my ass a few times. Then I learned personal discipline, started using version control (SCCS at the time, IIRC) onto a reliable server to not just back up any files of any importance I create but to save reversible strings of revisions back to the Egg, and stopped using my reversible rm altogether after one or two of the disasters it still leaves open.
        Moral: Version control with frequent checkins usually leaves your working image itself on your working machine. Keeping the repository on a different machine is already one level of redundancy. Keeping it on a server class machine in a tier 1 or tier 2 facility with reliable, regular backups and RAIDed disk is suddenly very, very, very reliable. As the current incident shows, not perfectly reliable. Human error, multiple disk failures in an array, nuclear war, internal malice or incompetence or just plain accident can still cause data loss, but in this case what is being reported isn't disaster -- they had 6 hour backups! Even though I'm sure there will be some folks who are inconvenienced, MOST of the users will still have usable, current working copies and be out anywhere from zero to a few hours of work. I've been on both sides of the sysadmin aisle in data loss server crashes, and -- they happen. Wise users use a belt AND suspenders to the extent possible lest they find their pants gathered around their ankles one day...
        
        
        Re: (Score:3)
        
        by mrchaotica ( 681592 ) * writes:
        
        Call your little script something like rm2 instead
        Or better yet, something that doesn't even have the string "rm" in it, like trash.
      - Re: I feel that lone sysadmin's pain (Score:5, Insightful)
        
        by Joce640k ( 829181 ) writes: on Wednesday February 01, 2017 @08:22AM (#53779767) Homepage
        
        I usually have a /trash directory in my Linux servers, I have moved the rm command to "removed" and wrote a sweet script named rm which moves files/folders to /trash. Then a cron job "removes" files and folders from trash after 48 hours. Works awesome unless I'm space-bound, and I usually am not. Saved my ass more than once!
        This sounds clever but it's a facepalming fail on so many levels. Modifying the system is ALWAYS a bad idea. Shame on anybody who upvoted it.
        If that's your intention then why not learn to type "mv" instead of "rm"? This way you're not depending on using a hacked system (or not) and you'll be safe anywhere.
        
      - Re: (Score:2)
        
        by jeremyp ( 130771 ) writes:
        
        This is a bad idea. rm is a sharp tool and you should never do anything to it that makes you think it isn't. One day you'll be working on somebody else's system but you'll have forgotten that rm can be dangerous and you'll merrily delete something career ending, go look for it in /trash and then have to commit ritual suicide.
      - Re: (Score:3)
        
        by RabidReindeer ( 2625839 ) writes:
        
        There are many tricks. Personally, I like to tar stuff or do a ZIP-with-delete and keep it for a day or 3 before removal. For large quantities of data, that can take a while, though, so another possibility if one is working with snapshot-capable storage management is to snapshot it and work "offline" on the snapshot. I do this on VM images, for example.
        Hot mirrors updated just infrequently enough that you can break the link before the damage propagates isn't a bad idea, either. Filesystems with "time machin
        
        Re: (Score:3)
        
        by PincushionMan ( 1312913 ) writes:
        
        You might want to look into Squashfs. The archive command for a single directory (or file) is:
        mksquashfs source_dir target_image.sqfs
        If you want to do multiple directories or files, no problem:
        mksquashfs source_dir1 source_dir2 souce_file1 source_file2 target_image.sqfs
        Squashfs generation is comparable to that of tar.gz files. Not only does it do gzip compression natively, it can compress the inodes in the directory tree and also do fs level de-duplication. Squashfs is compatible with any kernel
- - Re: (Score:2)
    
    by Calydor ( 739835 ) writes:
    
    I think it was a patch to EVE Online that did the same thing, accidentally deleting / instead of some specific directory within the game.
Repeat after me (and others) (Score:5, Interesting)

by Nkwe ( 604125 ) writes: on Wednesday February 01, 2017 @03:12AM (#53779143)

If you have not successfully tested a restore and you do not have a completely offline copy, you do not have a backup.

- Re: (Score:3, Informative)
  
  by dbIII ( 701233 ) writes:
  
  Good advice but it's a misleading headline above. It appears their real backup exists and is six hours old, so annoying but not catastrophic.
  It is a good example that replication is not a backup and is often a way to just mirror mistakes.
  - Re:Repeat after me (and others) (Score:5, Informative)
    
    by MatthiasF ( 1853064 ) writes: on Wednesday February 01, 2017 @04:24AM (#53779323)
    
    Uh, did you read the article?
    
    The six hours old snapshot was a fluke manual LVM snapshot run, normally they are 24 hours. The SQL_dumps weren't running at all because of mis-configuration, producing tiny little files and failing silently. Webhooks will need to be rolled back to the 24 hour backup since they were removed in the 6 hour one because of a synchronization process (meaning at best 18 hours of updates will have no webhooks but possibly all 24 hours at worst). Lastly, their replication of their backups from Microsoft's Azure to Amazon's S3 for what I assume is vendor agnostic redundancy has sent no files at all ("the bucket is empty").
    
    It's like they thought out everything but never made sure any of it was working.
    
    - Re:Repeat after me (and others) (Score:5, Insightful)
      
      by dbIII ( 701233 ) writes: on Wednesday February 01, 2017 @04:30AM (#53779337)
      
      Uh, did you read the article?
      No, and I got the wrong impression from skimming the article.
      You are correct and I am not.
      
      - Re: (Score:2)
        
        by Matias Kiviniemi ( 4598063 ) writes:
        
        The internetz council convened and decided we will none of that "admit my mistakes"-bullshit here. Please hand in your card and exit the premises immediately.
      - Re:Repeat after me (and others) (Score:4, Informative)
        
        by TheRaven64 ( 641858 ) writes: on Wednesday February 01, 2017 @10:47AM (#53780433) Journal
        
        Please mod the parent up. After the uptick in trolling and invective in the last couple of months, this post is a breath of fresh air around here.
        
      - Re:Repeat after me (and others) (Score:4, Funny)
        
        by Notabadguy ( 961343 ) writes: on Wednesday February 01, 2017 @11:39AM (#53780779)
        
        Look at his user ID. Give him time, he'll come around.
        
- Re: (Score:2)
  
  by hcs_$reboot ( 1536101 ) writes:
  
  Typical case of "we're unlikely to lose our data, and anyway we've got a backup which in turn is unlikely to fail ; so why test a unlikely x unlikely event?"
  - Re: (Score:2)
    
    by phantomfive ( 622387 ) writes:
    
    I think backups are surprisingly likely to fail. Just like RAID is surprisingly likely to have more than one disk fail at a time, even though intuitively that seems extremely unlikely.
    - Re: (Score:3)
      
      by Daimanta ( 1140543 ) writes:
      
      RAID fails because hard disks (probably the same type and batch) running together get hit at the same rate as the matching disks do not fail with the same chance distribution. Their failure correlation is therefore to be quite high. This explains that rebuilding a RAID array after failure can be a very dangerous operation and could easily lead to total failure. Usually, doing (incremental) backups are the safer option when a single disk fails as that is not nearly as invasive as a complete RAID rebuild.
  - - Re: (Score:2)
      
      by hcs_$reboot ( 1536101 ) writes:
      
      Test your backups, at least once, seriously!
- Re:Repeat after me (and others) (Score:5, Funny)
  
  by Opportunist ( 166417 ) writes: on Wednesday February 01, 2017 @04:36AM (#53779359)
  
  Sing with me, kids:
  One backup in my bunk
  One backup in my trunk
  One backup at the town's other end
  One backup on another continent
  All of them tested and verified sane
  now go to bed, you can sleep once again
  
  - Re:Repeat after me (and others) (Score:4, Funny)
    
    by Zontar The Mindless ( 9002 ) writes: <`moc.liamg' `ta' `ofni.hsifcitsalp'> on Wednesday February 01, 2017 @09:08AM (#53779925) Homepage
    
    Q: You can never have too much money, too much sex, or ___ ____ ______. (Fill in the blanks.)
    (A: "Too many backups".)
    --Actual question from the final exam for the Networking 100 class I took in 1998.
    
- Re:Repeat after me (and others) (Score:5, Insightful)
  
  by tonymercmobily ( 658708 ) writes: on Wednesday February 01, 2017 @05:16AM (#53779449) Homepage Journal
  
  "If you have not successfully tested a restore and you do not have a completely offline copy, you do not have a backup."
  OK, now that I have repeated it, let me add.
  As a CEO, and a CTO, you MUST test backups and resiliency by artificially creating downtime and real-life events. You switch off the main server. Or instruct the hosting company to reboot the main server, unplug the main hard drive, and plug it back in. Then you sit up, and watch with great interest what happens.
  THEN you will see, for real, how your company reacts to real disasters.
  The difference is that if anything _really_ wrong happens, you can turn the hard drive back on and fire a few people.
  Smart companies do it. For a reason. Yes it creates downtime. But yes it can save your company.
  http://www.datacenterknowledge... [datacenterknowledge.com]
  Merc.
  
  - Re: (Score:2)
    
    by JaredOfEuropa ( 526365 ) writes:
    
    In other words: IT fire drills. Smart companies conduct them... but somehow I have never seen them done, or even seen companies asking their outsourcing partners to produce some proof of recovery procedures having been tested. No, "they are ISO-over-9000 and that is good enough for us". Good enough to cover your arse when things go south, sure.
    
    We had plenty of actual fire drills, though.
    - Re:Repeat after me (and others) (Score:5, Insightful)
      
      by cdrudge ( 68377 ) writes: on Wednesday February 01, 2017 @08:37AM (#53779807) Homepage
      
      "they are ISO-over-9000 and that is good enough for us"
      Distilled down, all that ISO-around-9000 says is that "we say what we do and do what we say" when it comes to business processes. It's perfectly acceptable from an ISO-around-9000 standpoint to have a disaster recovery process that reads like the below as long as that is really what they do:
      1. Perform backup
      2. Pray nothing goes wrong.
      Now hopefully they have something a lot more than that. But if they don't test the backups. If they don't hold an "IT fire drill" to practice what do do when the feces hits the fan. If they don't have disaster recovery backup servers and snapshots and whatever else they should have, then they have completely documented their process and follow it like the standards require.
      
  - Re: (Score:2)
    
    by 140Mandak262Jamuna ( 970587 ) writes:
    
    Smart companies do it. For a reason. Yes it creates downtime. But yes it can save your company
    You make the assumption they CXO want to save the company. Downtime costs happen this quarter. Benefit accrues to whoever is the CXO five years down the line. Why should current CEO save the a** of the next CEO. Squeeze the company dry, show as much revenue/profit as possible, cash the stock options and skip town. By the time they discover the shoddy backup vendor you hired to cut costs, had been saving the data in the "1TB" thumbdrives bought in some flea market in outer Mongolia, you are already well into
  - Re: (Score:2)
    
    by drinkypoo ( 153816 ) writes:
    
    As a CEO, and a CTO, you MUST test backups and resiliency by artificially creating downtime and real-life events.
    As an IT professional, and occasional admin, you MUST have backup for your hardware to switch to, which mitigates the pain of live testing. The hardware is typically a small portion of the total cost of the business, even if you double it.
- Re: (Score:3)
  
  by Applehu Akbar ( 2968043 ) writes:
  
  If you have not successfully tested a restore and you do not have a completely offline copy, you do not have a backup.
  Especially now that ransomware is overwriting online backups.
- Re: (Score:3)
  
  by tomhath ( 637240 ) writes:
  
  If you have not successfully tested a restore and you do not have a completely offline copy, you do not have a backup.
  And don't trust someone else who says they made and tested the backup. Our DBAs had proof that the sysadmins told them the disk backups worked. But the DBAs never did a practice restore of their own. You can guess what happened when a failed update trashed the database.
Don't use rm! (Score:3)

by subk ( 551165 ) writes: on Wednesday February 01, 2017 @03:13AM (#53779145)

Use mv! Also.. What's with the need to tweet? Don't tell the customer anything until the dust settles! Geez... What's with these amateurs?

- Re: (Score:2)
  
  by hcs_$reboot ( 1536101 ) writes:
  
  # rm `which rm`
- Re:Don't use rm! (Score:5, Funny)
  
  by infolation ( 840436 ) writes: on Wednesday February 01, 2017 @03:58AM (#53779265)
  
  Don't tell the customer anything until the dust settles! Geez... What's with these amateurs?
  Don't tell the customer anything!! Geez... What's with these semi-pros?
  
- Comment removed (Score:4, Interesting)
  
  by account_deleted ( 4530225 ) writes: on Wednesday February 01, 2017 @04:07AM (#53779289)
  
  Comment removed based on user account deletion
  
  - Re:Don't use rm! (Score:4, Insightful)
    
    by sodul ( 833177 ) writes: on Wednesday February 01, 2017 @04:43AM (#53779375) Homepage
    
    Nowadays since nobody wants to do sysadmin work and since most startups and companies feel that a pure sysadmin job it is a waste of money they slap 'must code shell and chef' on top, call it DevOps but then just treat them just as badly as before. The 'DevOps' term is just is misused as 'Agile' nowadays. What I have seen in practice is DevOps are Ops that Develop scripts, or worse a DevOps team/role between Devs and Ops ... and a new silo is created instead of walls broken. Most Agile shops are actually chaos driven with anything goes since Sales promised a feature to a prospect customer yesterday, every week.
    
- Re: (Score:2)
  
  by ThunderBird89 ( 1293256 ) writes:
  
  > Don't tell the customer anything until the dust settles!
  That's one way to handle a major crisis, but if you're transparent about an issue, it puts a lot more minds at ease than it upsets, since then at least your customers know that you're aware of the problem, that you're working to fix it, and that they can communicate with you.
- - Re: (Score:2)
    
    by subk ( 551165 ) writes:
    
    Yeah just mv foo /dev/null
    No, you're missing the point. mv foo /some/safe/place and when everything is working again... and you're sure you don't need it.. Then and only then use rm.
Test your backups! (Score:3)

by djinn6 ( 1868030 ) writes: on Wednesday February 01, 2017 @03:13AM (#53779147)

Two things:
1. Test your backups
2. TEST your BACKUPS!

- Re: (Score:2, Funny)
  
  by Anonymous Coward writes:
  
  but NOT on your production hardware running live services.
  me thinks gitlab should have browsed their hosted repos for some backup software.
  - Re: (Score:2)
    
    by asylumx ( 881307 ) writes:
    
    but NOT on your production hardware running live services.
    There are plenty who disagree with this. Right or wrong, their arguments have merit.
At least it wasn't github.com (Score:3)

by jtara ( 133429 ) writes: on Wednesday February 01, 2017 @04:02AM (#53779275)

At least it wasn't github.com.
So, it didn't break the Internet.
And practically everything else.

- Re: (Score:2)
  
  by phantomfive ( 622387 ) writes:
  
  Github [ycombinator.com] goes down [theregister.co.uk] from time to time [ycombinator.com], too. Self-hosting code is so easy (that's what git was designed to do), that there's really no reason to have your company depend on Github. Unless you're early stage startup and don't even have an office or something.
  - Re: (Score:2)
    
    by Richard_at_work ( 517087 ) writes:
    
    Github isnt just code - there is a heck of a lot there which you dont get locally without lots of third party tools and the hassle that comes with them.
    - - Re: (Score:2)
        
        by account_deleted ( 4530225 ) writes:
        
        Comment removed based on user account deletion
Made this mistake once... (Score:3)

by daid303 ( 843777 ) writes: on Wednesday February 01, 2017 @04:06AM (#53779285)

I've made this mistake, deleted all attachments on a life system once.
After this, I made all the prompts for critical servers a different color:
export PS1='\e[41m\u@\h:\w\$\e[49m'

- Re:Made this mistake once... (Score:5, Funny)
  
  by serviscope_minor ( 664417 ) writes: on Wednesday February 01, 2017 @05:00AM (#53779409) Journal
  
  Good choice. But, I always use this prompt:
  PS1='C:$(echo ${PWD//\//\\\} | tr "[:lower:]" "[:upper:]" | sed -e"s/\$[^\\]\\{6\\}\$[^\\]\\{2,\\}/\\1~1/g" ) >'
  
How can this keep happening? (Score:3, Interesting)

by Anonymous Coward writes: on Wednesday February 01, 2017 @04:17AM (#53779313)

I'm not a fan of git, I'm not happy when I'm forced to use it and I don't understand how it works, not really. But remember how KDE deleted all their projects, everywhere, globally, except for a powered-down virtual machine?
http://jefferai.org/2013/03/29/distillation/
When I remember that, and I read this story, I can't understand why people use something that is so sensitive to mistakes. It's like giving everybody root on every machine, which is running DOS in real mode. Somebody please explain it to me.

- Re:How can this keep happening? (Score:4, Informative)
  
  by Entrope ( 68843 ) writes: on Wednesday February 01, 2017 @08:46AM (#53779847) Homepage
  
  KDE's problems were not due to Git. They were due to a corrupt filesystem, a home-brew mirroring setup, and overworked admins.
  If you're going to troll-ol-ol a blame vector for that, at least be remotely fair and blame Linux (or whatever OS their master server was running), open source, and the associated culture.
  
If only there was another copy of the repo (Score:5, Funny)

by HxBro ( 98275 ) writes: on Wednesday February 01, 2017 @04:28AM (#53779333)

Just imagine if git had some other magical copy of the repo somewhere, maybe even on the local machine you develop on, now that would save your data in a case like this

- Re: (Score:3)
  
  by gweihir ( 88907 ) writes:
  
  Just imagine if you had actually read the story. The git-repos are not affected.
- Re: (Score:2)
  
  by AmiMoJo ( 196126 ) writes:
  
  Repos will be okay, it's all the ancillary stuff, i.e. the things that make them worth using over other git hosting companies. User management, wikis, release management, issue tracking etc.
Comment removed (Score:5, Interesting)

by account_deleted ( 4530225 ) writes: on Wednesday February 01, 2017 @04:56AM (#53779399)

Comment removed based on user account deletion

All my sympathy... (Score:5, Insightful)

by Gumbercules!! ( 1158841 ) writes: on Wednesday February 01, 2017 @06:23AM (#53779553)

I don't care if this is a mistake and screw up of their own making (and it is, on every level) - if you've ever worked as sysadmin you have got to feel for these guys.

- Re: (Score:3)
  
  by malkavian ( 9512 ) writes:
  
  Definitely feel for 'em.. And really feel for the guy who was on the keyboard..
An that is why you run BCM and recovery tests (Score:4, Interesting)

by gweihir ( 88907 ) writes: on Wednesday February 01, 2017 @06:36AM (#53779589)

Something like this is going to happen sooner or later. It cannot really be avoided. BCM and recovery tests are the only way to be sure your replication/journaling/etc. works and your backups can be restored.
Of course in this age of incompetent bean-counters, these are often skipped, because "everything works" and these test do involve downtime.

- Re: (Score:2)
  
  by Entrope ( 68843 ) writes:
  
  BCM? Bravo Company, manufacturer of firearm parts so you can shoot your servers? Buzzword-Centric Methodology? The SourceForge "BCM" project, a file compression utility? Baylor College of Medicine? Bear Creek Mining? Bacau International Airport? Broadcom?
  - - Re: An that is why you run BCM and recovery tests (Score:2)
      
      by Entrope ( 68843 ) writes:
      
      You first. If your head is so far up your ass that you can't tell when you're using a buzzword acronym with little exposure in the tech world and a lot of plausible meanings, you might be a tool.
DR Testing as a business model (Score:2)

by swb ( 14022 ) writes:

Does anyone think that Backups/DR Testing as a business would be something that businesses would go for?
Everybody "runs backups" but due to all the usual limitations in time and capacity, nobody really tests whether they can restore everything and actually make it work, and how long it might actually take to accomplish this.
I always wondered if you could mount a hundred TB of storage, a couple of tape drives, and switching into one of those rock band roadie cases and take it to a business with the idea that
- Re: (Score:2)
  
  by coofercat ( 719737 ) writes:
  
  As a sysadmin, this sounds great (a bit 'brown trousers' for me personally, but great). However, one of my clients is entirely 'in the cloud', so no need for your truck of kit - just provide as many VMs as we like somewhere on t'internet. Ideally you'd be able to do this in a 'little internet' which has a VPN to get into it, has it's own DNS servers, and maybe ways to 'bend' or alter requests to other cloudy services, such as Google or Amazon such that the app 'thinks' its talking to the real, live producti
Only perform reversible actions (Score:2)

by marko123 ( 131635 ) writes:

A lesson always learnt the hard way. Those of us who have learnt it the hard way have known the feeling before: I'll trust that this is correct and the feeling after: Shiat!
And of course (Score:2)

by John Allsup ( 987 ) writes:

And of course everybody here knows _never_ to _rely_ upon cloud storage. Use it, by all means, but plan as if the cloud storage facility could have a meltdown at any moment. Gitlab users should just push their project to a different git server. There is also something to be said for having git server projects mirrored, e.g. a master on github and a second on gitlab, so that, in the event of one cloud service failing, you have a hot spare.
What is frustrating is that, given all the progress in hardware reliab
Go ahead...yawn, but (Score:2)

by Provocateur ( 133110 ) writes:

That 4.5 GB of data, happens to hold the answer! To life, the universe, and EVERYTHING!! Mankind is fortunate that the weary sysadmin was able to abort the procedure before it completely wiped the slate clean!
So don't blame the guy, praise him and thank him for saving us all!
Six hours of loss is a "melt-down"? (Score:5, Insightful)

by thesandbender ( 911391 ) writes: on Wednesday February 01, 2017 @09:53AM (#53780117)

Editors. I understand that any loss is bad but holy hyperbole batman... the title reads like a nuke was dropped on Gitlab's datacenters. I had to read halfway through the post to see they lost six (6!) hours of data. Again, really bad, but just losing six hours of data would be a case study in success for a lot of companies and definitely not a "melt-down".

- Re: (Score:3)
  
  by sysrammer ( 446839 ) writes:
  
  I see your point but I'd guess you are not a professional sysadmin. TFA should have been prefaced "For SysAdmins only". Most don't care about losing data: this far along in the computer revolution, most of us have lost years of data due to a disk or pebcak failure.
  Most of the time it is not a deal-breaker, or "melt-down" in this case. A company might have to spend some money, or a worker has to spend a lot of time, or the two dozen drafts of your "Great American Novel" goes gone.
  But sometimes it's the entir
Um, to clarify: (Score:2)

by rickb928 ( 945187 ) writes:

"So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place."
Or, more accurately, less than 5 backup/replication techniques were deployed.
I've seen this before. The backup strategy you didn't deploy didn't fail. It never existed except in documentation. And your unwarranted trust.
I do not miss sysadmin work so much.
- Re: (Score:2)
  
  by glenebob ( 414078 ) writes:
  
  The first sentence is true. The second one only achieves "should be true" status.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Yawn... (Score:5, Insightful)

Re: (Score:2)

Re: (Score:3, Insightful)

Re: Yawn... (Score:4, Interesting)

Re: (Score:2)

Re: Yawn... (Score:5, Funny)

Re: (Score:3, Insightful)

Re: (Score:2)

I feel that lone sysadmin's pain (Score:5, Insightful)

Re:I feel that lone sysadmin's pain (Score:5, Insightful)

Re:I feel that lone sysadmin's pain (Score:5, Interesting)

Re: (Score:3)

Or you use scripts (Score:3)

Re: (Score:2)

Re: (Score:2, Insightful)

Re:I feel that lone sysadmin's pain (Score:4, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Comment removed (Score:5, Interesting)

Re:I feel that lone sysadmin's pain (Score:5, Funny)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re:I feel that lone sysadmin's pain (Score:5, Informative)

Re: I feel that lone sysadmin's pain (Score:2, Informative)

Re: I feel that lone sysadmin's pain (Score:5, Insightful)

Re: I feel that lone sysadmin's pain (Score:5, Interesting)

Re: (Score:3)

Re: I feel that lone sysadmin's pain (Score:5, Insightful)

Re: (Score:2)

Re: (Score:3)

Re: (Score:3)

Re: (Score:2)

Repeat after me (and others) (Score:5, Interesting)

Re: (Score:3, Informative)

Re:Repeat after me (and others) (Score:5, Informative)

Re:Repeat after me (and others) (Score:5, Insightful)

Re: (Score:2)

Re:Repeat after me (and others) (Score:4, Informative)

Re:Repeat after me (and others) (Score:4, Funny)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re:Repeat after me (and others) (Score:5, Funny)

Re:Repeat after me (and others) (Score:4, Funny)

Re:Repeat after me (and others) (Score:5, Insightful)

Re: (Score:2)

Re:Repeat after me (and others) (Score:5, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:3)

Don't use rm! (Score:3)

Re: (Score:2)

Re:Don't use rm! (Score:5, Funny)

Comment removed (Score:4, Interesting)

Re:Don't use rm! (Score:4, Insightful)

Re: (Score:2)

Re: (Score:2)

Test your backups! (Score:3)

Re: (Score:2, Funny)

Re: (Score:2)

At least it wasn't github.com (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Made this mistake once... (Score:3)

Re:Made this mistake once... (Score:5, Funny)