GitLab.com Melts Down After Wrong Directory Deleted, Backups Fail (theregister.co.uk) 356
An anonymous reader quotes a report from The Register: Source-code hub Gitlab.com is in meltdown after experiencing data loss as a result of what it has suddenly discovered are ineffectual backups. On Tuesday evening, Pacific Time, the startup issued the sobering series of tweets, starting with "We are performing emergency database maintenance, GitLab.com will be taken offline" and ending with "We accidentally deleted production data and might have to restore from backup. Google Doc with live notes [link]." Behind the scenes, a tired sysadmin, working late at night in the Netherlands, had accidentally deleted a directory on the wrong server during a frustrating database replication process: he wiped a folder containing 300GB of live production data that was due to be replicated. Just 4.5GB remained by the time he canceled the rm -rf command. The last potentially viable backup was taken six hours beforehand. That Google Doc mentioned in the last tweet notes: "This incident affected the database (including issues and merge requests) but not the git repos (repositories and wikis)." So some solace there for users because not all is lost. But the document concludes with the following: "So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place." At the time of writing, GitLab says it has no estimated restore time but is working to restore from a staging server that may be "without webhooks" but is "the only available snapshot." That source is six hours old, so there will be some data loss.
Yawn... (Score:5, Insightful)
No backups, untested backups, overwriting backups, etc. You'd think a code repository would understand this shit.
This has been going on since the dawn of computing and it seems there's no end in sight.
Re: (Score:2)
http://knowyourmeme.com/memes/disaster-girl [knowyourmeme.com]
Re: (Score:3, Insightful)
Re: Yawn... (Score:4, Interesting)
There's two levels of redundancy. There's "oh my god the database server is on fire! Promote the replicated server to master and failover!" which, depending on the database, should take a few seconds to perform manually. Testing automation for this (pull the plug and see what happens) depends on your setup and how long it takes your heartbeat to decide that the server is dead and how (If we shot servers in the head every time we got a DDoS, we'd burn through servers in a few seconds, it takes more than one failed connection for automation to decide the server is down).
Then, there's "oh my god the datacenter is on fire!". This is what people usually call "Disaster Recovery". One dead server isn't a disaster when you have failovers, but when your entire datacenter is dead, THAT's a disaster. It's tough as nails to automate too, since without having at least three datacenters, it's inherently a split-brain issue. If Datacenter A stops responding to Datacenter B, which one is actually down? If you aren't an AS and can't just republish your IPs at Datacenter B with a BGP routing change, that means you're going to have to publish new DNS records and wait one TTL for everyone to see them. If you had an authoritative DNS server at Datacenter A, then hopefully it was able to recognize that its down and shot itself (or at least updated its zone files with B's IPs) or you can somehow get to it and kill it, otherwise when Datacenter A comes back online, it'll be serving up A's IPs again and conflict with the other DNS server. This also is setting aside replicating your data between datacenters and how much of that is lost when you switch back and forth.
Re: (Score:2)
No backups, untested backups, overwriting backups, etc. You'd think a code repository would understand this shit.
This has been going on since the dawn of computing and it seems there's no end in sight.
You'd think so but the level of incompetence these days rivals the incompetence of 20 years ago. I just heard yesterday that a global multi-national company that's been around for years lost a file because "another file from a different source came in too soon and overwrote it". At that point, I did a complete facepalm because I was astounded that we still have software around running critical business operations sometimes even global operations.
Re: Yawn... (Score:5, Funny)
paki chimps in jungle
Someone failed geography class...
Re: (Score:3, Insightful)
No no, he's being """ironic""" and """trolling""" you. He isn't actually a stupid racist.
He's just a racist.
Re: (Score:2)
What does GitHub have to do with Gitlab.com?
I feel that lone sysadmin's pain (Score:5, Insightful)
rm -rf
Backups were ineffective. 30% of our users lost their home directories permanently. He never lived it down. Check your backups!
Re:I feel that lone sysadmin's pain (Score:5, Insightful)
That's why you always always run ls first.
ls -ld /home/user1 /home/user2 /home/ user3
Then edit the command to rm. Always.
Re:I feel that lone sysadmin's pain (Score:5, Interesting)
mkdir ./trash ./trash
mv file_to_delete
If it's still working next month you can empty trash, but just leaving it there forever is a valid option too. In a production environment, storage is too cheap to warrant deleting anything.
Re: (Score:3)
Oh I wish that where really the case. Unfortunately where a single run of a job on an HPC facility can produce 1TB of files that is not actually the case in the real world for everyone.
Or you use scripts (Score:3)
That's why you always always run ls first.
ls -ld /home/user1 /home/user2 /home/ user3
Then edit the command to rm. Always.
Or you use scripts.
somescript user1 user2 user3
Re: (Score:2)
Or perhaps the operating system (shell) should prevent these kinds of errors? I guess it isn't macho enough...
Re: (Score:2, Insightful)
Do you prefer your kitchen knives un-sharpened because then you're less likely to cut yourself?
Re:I feel that lone sysadmin's pain (Score:4, Insightful)
Then you have never worked on a repository with users of TortoiseSVN and the likes.
"Hey, my commit didn't get through because of some funky error I didn't care about. But if I flip this 'force' switch, then everything always goes smoothly."
Re: (Score:2)
Including that purist-hated trash can/recycle bin.
Re: (Score:2)
Accenture made exactly this blunder on the London Stock Exchange website root folder (running on IIS). Some nimrod came in and accidentally deleted all the files from that folder taking about 30 different financial products offline. We noticed pretty quick and scrambled to restore from a backup.
Funny thing is...some other nimrod or the same one did almost the same thing a month later, this time only removing a few key products :-)
Re: (Score:3)
Re: (Score:2)
Or use a GUI that moves stuff to a recycle bin first. :-) It's saved my bacon on more than one occasion.
Re: (Score:2)
GUI? We don't need no stinkin' GUI!
# mkdir junk .... junk
# mv file1 dir2
# ls -la junk
Look carefully!!
# rm -rf junk
Re: (Score:2)
I habitually shift-delete things because it saves a lot of time moving large folders with massive numbers of files into the recycle bin. I have been caught out by this once or twice over the years, but always had a recent backup and so have never lost anything that way.
Re: (Score:2)
Re: (Score:2)
Also rm -rf /home/{user1,user2,user3} is safer: if you accidentally include a space, the braces don't get expanded at all:
rm -rf /home/{user[12]}
is equivalent to
rm -rf /home/user1 /home/user2
but rm -rf /home/{user1, user2}
is equivalent to
rm -rf "/home/{user1, user2}"
so 'rm -ageddon' is avoided.
Re: (Score:2)
Seriously, though, much more thought needs to be given to two things: one is making accidents harder, the other is making effective backups a no-brainer.
Re: (Score:3)
Exactly right.
Exactly wrong.
Learn to use "mv" instead of "rm -rf".
eg. Create a folder called /trash and move the files there.
When you see the system is still working and you need some disk space then you can empty the trash. Not before.
Comment removed (Score:5, Interesting)
Re:I feel that lone sysadmin's pain (Score:5, Funny)
Backups were ineffective. 30% of our users lost their home directories permanently. He never lived it down. Check your backups!
Actually: Check your privilege! (Especially if rm -rf is involved).
Re: (Score:2)
Backups were ineffective. 30% of our users lost their home directories permanently. He never lived it down. Check your backups!
Actually: Check your privilege!
Sudo is a real victim here. Let not make it worse by engaging in victim-blaming.
Re: (Score:2)
Re: (Score:2)
This is when tab completion is your friend...t
This. Very first thing I thought of as well.
Re: (Score:2)
Moral: the command line is too powerful for puny humans who might not be totally attentive to every character being entered at all times.
Re: (Score:2)
Brand new install of Slackware with Kernel 1.2.8 (circa late 1994) which was a statically linked build. Thought I was in /usr/local/lib (shell only had current level directory not the full path) but was really in /lib. Proceeded to rm -rf * to get rid of a test build (or so I thought). Well then I was wondering after about 10 sec the rm command was throwing errors. Seems that once the rm command hit libc.a any and all operations ceased.
After that I always had the root
Re: (Score:3)
That's fine until he decides that typing "rm user1 user2 user3 user4..." is too much of a hassle and he replaces it with a script that lists the directories and removes them all. ...blissfully forgetting that there is a ".." directory. Oh .., how many well intended scripts have thee turned into the spawn of hell...
Re:I feel that lone sysadmin's pain (Score:5, Informative)
Correct pattern is: /home && rm ...
> cd
ie don't run rm unless cd worked.
Re: I feel that lone sysadmin's pain (Score:2, Informative)
Re: I feel that lone sysadmin's pain (Score:5, Insightful)
Far better to treat the command "rm" with the full respect it deserves at all times and never assume it does anything but wipe data. Call your little script something like rm2 instead and get into the habit of always using that. That way the worst thing that can happen when it doesn't exist is "command not found".
Re: I feel that lone sysadmin's pain (Score:5, Interesting)
Having used the "sweet rm" trick back in the 80's somewhere (with much more limited space, and a cron FIFO groomer) it also doesn't protect you from a wide variety of file corruption issues and overwrites. Remove a file, recreate it, remove it again? Delete two files from different parts of your tree -- e.g. README -- that have the same name? Original file gone (unless you don't just alias rm, you write a very complicated script). If you run out of space and have an alias/script like "flush" to take out the trash and make room for more, it just moves the problem one notch downstream.
With that said, it did save my ass a few times. Then I learned personal discipline, started using version control (SCCS at the time, IIRC) onto a reliable server to not just back up any files of any importance I create but to save reversible strings of revisions back to the Egg, and stopped using my reversible rm altogether after one or two of the disasters it still leaves open.
Moral: Version control with frequent checkins usually leaves your working image itself on your working machine. Keeping the repository on a different machine is already one level of redundancy. Keeping it on a server class machine in a tier 1 or tier 2 facility with reliable, regular backups and RAIDed disk is suddenly very, very, very reliable. As the current incident shows, not perfectly reliable. Human error, multiple disk failures in an array, nuclear war, internal malice or incompetence or just plain accident can still cause data loss, but in this case what is being reported isn't disaster -- they had 6 hour backups! Even though I'm sure there will be some folks who are inconvenienced, MOST of the users will still have usable, current working copies and be out anywhere from zero to a few hours of work. I've been on both sides of the sysadmin aisle in data loss server crashes, and -- they happen. Wise users use a belt AND suspenders to the extent possible lest they find their pants gathered around their ankles one day...
Re: (Score:3)
Or better yet, something that doesn't even have the string "rm" in it, like trash.
Re: I feel that lone sysadmin's pain (Score:5, Insightful)
I usually have a /trash directory in my Linux servers, I have moved the rm command to "removed" and wrote a sweet script named rm which moves files/folders to /trash. Then a cron job "removes" files and folders from trash after 48 hours. Works awesome unless I'm space-bound, and I usually am not. Saved my ass more than once!
This sounds clever but it's a facepalming fail on so many levels. Modifying the system is ALWAYS a bad idea. Shame on anybody who upvoted it.
If that's your intention then why not learn to type "mv" instead of "rm"? This way you're not depending on using a hacked system (or not) and you'll be safe anywhere.
Re: (Score:2)
This is a bad idea. rm is a sharp tool and you should never do anything to it that makes you think it isn't. One day you'll be working on somebody else's system but you'll have forgotten that rm can be dangerous and you'll merrily delete something career ending, go look for it in /trash and then have to commit ritual suicide.
Re: (Score:3)
There are many tricks. Personally, I like to tar stuff or do a ZIP-with-delete and keep it for a day or 3 before removal. For large quantities of data, that can take a while, though, so another possibility if one is working with snapshot-capable storage management is to snapshot it and work "offline" on the snapshot. I do this on VM images, for example.
Hot mirrors updated just infrequently enough that you can break the link before the damage propagates isn't a bad idea, either. Filesystems with "time machin
Re: (Score:3)
You might want to look into Squashfs. The archive command for a single directory (or file) is:
mksquashfs source_dir target_image.sqfs
If you want to do multiple directories or files, no problem:
mksquashfs source_dir1 source_dir2 souce_file1 source_file2 target_image.sqfs
Squashfs generation is comparable to that of tar.gz files. Not only does it do gzip compression natively, it can compress the inodes in the directory tree and also do fs level de-duplication. Squashfs is compatible with any kernel
Re: (Score:2)
I think it was a patch to EVE Online that did the same thing, accidentally deleting / instead of some specific directory within the game.
Repeat after me (and others) (Score:5, Interesting)
Re: (Score:3, Informative)
It is a good example that replication is not a backup and is often a way to just mirror mistakes.
Re:Repeat after me (and others) (Score:5, Informative)
The six hours old snapshot was a fluke manual LVM snapshot run, normally they are 24 hours. The SQL_dumps weren't running at all because of mis-configuration, producing tiny little files and failing silently. Webhooks will need to be rolled back to the 24 hour backup since they were removed in the 6 hour one because of a synchronization process (meaning at best 18 hours of updates will have no webhooks but possibly all 24 hours at worst). Lastly, their replication of their backups from Microsoft's Azure to Amazon's S3 for what I assume is vendor agnostic redundancy has sent no files at all ("the bucket is empty").
It's like they thought out everything but never made sure any of it was working.
Re:Repeat after me (and others) (Score:5, Insightful)
No, and I got the wrong impression from skimming the article.
You are correct and I am not.
Re: (Score:2)
Re:Repeat after me (and others) (Score:4, Informative)
Re:Repeat after me (and others) (Score:4, Funny)
Look at his user ID. Give him time, he'll come around.
Re: (Score:2)
Re: (Score:2)
Re: (Score:3)
RAID fails because hard disks (probably the same type and batch) running together get hit at the same rate as the matching disks do not fail with the same chance distribution. Their failure correlation is therefore to be quite high. This explains that rebuilding a RAID array after failure can be a very dangerous operation and could easily lead to total failure. Usually, doing (incremental) backups are the safer option when a single disk fails as that is not nearly as invasive as a complete RAID rebuild.
Re: (Score:2)
Re:Repeat after me (and others) (Score:5, Funny)
Sing with me, kids:
One backup in my bunk
One backup in my trunk
One backup at the town's other end
One backup on another continent
All of them tested and verified sane
now go to bed, you can sleep once again
Re:Repeat after me (and others) (Score:4, Funny)
Q: You can never have too much money, too much sex, or ___ ____ ______. (Fill in the blanks.)
(A: "Too many backups".)
--Actual question from the final exam for the Networking 100 class I took in 1998.
Re:Repeat after me (and others) (Score:5, Insightful)
"If you have not successfully tested a restore and you do not have a completely offline copy, you do not have a backup."
OK, now that I have repeated it, let me add.
As a CEO, and a CTO, you MUST test backups and resiliency by artificially creating downtime and real-life events. You switch off the main server. Or instruct the hosting company to reboot the main server, unplug the main hard drive, and plug it back in. Then you sit up, and watch with great interest what happens.
THEN you will see, for real, how your company reacts to real disasters.
The difference is that if anything _really_ wrong happens, you can turn the hard drive back on and fire a few people.
Smart companies do it. For a reason. Yes it creates downtime. But yes it can save your company.
http://www.datacenterknowledge... [datacenterknowledge.com]
Merc.
Re: (Score:2)
We had plenty of actual fire drills, though.
Re:Repeat after me (and others) (Score:5, Insightful)
Distilled down, all that ISO-around-9000 says is that "we say what we do and do what we say" when it comes to business processes. It's perfectly acceptable from an ISO-around-9000 standpoint to have a disaster recovery process that reads like the below as long as that is really what they do:
1. Perform backup
2. Pray nothing goes wrong.
Now hopefully they have something a lot more than that. But if they don't test the backups. If they don't hold an "IT fire drill" to practice what do do when the feces hits the fan. If they don't have disaster recovery backup servers and snapshots and whatever else they should have, then they have completely documented their process and follow it like the standards require.
Re: (Score:2)
Smart companies do it. For a reason. Yes it creates downtime. But yes it can save your company
You make the assumption they CXO want to save the company. Downtime costs happen this quarter. Benefit accrues to whoever is the CXO five years down the line. Why should current CEO save the a** of the next CEO. Squeeze the company dry, show as much revenue/profit as possible, cash the stock options and skip town. By the time they discover the shoddy backup vendor you hired to cut costs, had been saving the data in the "1TB" thumbdrives bought in some flea market in outer Mongolia, you are already well into
Re: (Score:2)
As a CEO, and a CTO, you MUST test backups and resiliency by artificially creating downtime and real-life events.
As an IT professional, and occasional admin, you MUST have backup for your hardware to switch to, which mitigates the pain of live testing. The hardware is typically a small portion of the total cost of the business, even if you double it.
Re: (Score:3)
If you have not successfully tested a restore and you do not have a completely offline copy, you do not have a backup.
Especially now that ransomware is overwriting online backups.
Re: (Score:3)
If you have not successfully tested a restore and you do not have a completely offline copy, you do not have a backup.
And don't trust someone else who says they made and tested the backup. Our DBAs had proof that the sysadmins told them the disk backups worked. But the DBAs never did a practice restore of their own. You can guess what happened when a failed update trashed the database.
Don't use rm! (Score:3)
Re: (Score:2)
Re:Don't use rm! (Score:5, Funny)
Don't tell the customer anything until the dust settles! Geez... What's with these amateurs?
Don't tell the customer anything!! Geez... What's with these semi-pros?
Re:Don't use rm! (Score:4, Interesting)
Boring job, doesn't pay as much as others. Everyone wants to be the rockstar since that's who the recruiters look for, nobody wants to be the janitor that cleans up after the concert. Turn that into a startup and seriously, no one at a startup wants to be the grunt, and (almost) no one at a startup has an ounce of experience with real world issues.
This is why sysadmins were created, because the people actually using the computers didn't want to manage them.
Re:Don't use rm! (Score:4, Insightful)
Nowadays since nobody wants to do sysadmin work and since most startups and companies feel that a pure sysadmin job it is a waste of money they slap 'must code shell and chef' on top, call it DevOps but then just treat them just as badly as before. The 'DevOps' term is just is misused as 'Agile' nowadays. What I have seen in practice is DevOps are Ops that Develop scripts, or worse a DevOps team/role between Devs and Ops ... and a new silo is created instead of walls broken. Most Agile shops are actually chaos driven with anything goes since Sales promised a feature to a prospect customer yesterday, every week.
Re: (Score:2)
> Don't tell the customer anything until the dust settles!
That's one way to handle a major crisis, but if you're transparent about an issue, it puts a lot more minds at ease than it upsets, since then at least your customers know that you're aware of the problem, that you're working to fix it, and that they can communicate with you.
Re: (Score:2)
Yeah just mv foo /dev/null
No, you're missing the point. mv foo /some/safe/place and when everything is working again... and you're sure you don't need it.. Then and only then use rm.
Test your backups! (Score:3)
1. Test your backups
2. TEST your BACKUPS!
Re: (Score:2, Funny)
but NOT on your production hardware running live services.
me thinks gitlab should have browsed their hosted repos for some backup software.
Re: (Score:2)
There are plenty who disagree with this. Right or wrong, their arguments have merit.
At least it wasn't github.com (Score:3)
At least it wasn't github.com.
So, it didn't break the Internet.
And practically everything else.
Re: (Score:2)
Re: (Score:2)
Github isnt just code - there is a heck of a lot there which you dont get locally without lots of third party tools and the hassle that comes with them.
Re: (Score:2)
That's only the systemd repo, most repos don't do that.
Made this mistake once... (Score:3)
I've made this mistake, deleted all attachments on a life system once.
After this, I made all the prompts for critical servers a different color:
export PS1='\e[41m\u@\h:\w\$\e[49m'
Re:Made this mistake once... (Score:5, Funny)
Good choice. But, I always use this prompt:
PS1='C:$(echo ${PWD//\//\\\} | tr "[:lower:]" "[:upper:]" | sed -e"s/\\([^\\]\\{6\\}\\)[^\\]\\{2,\\}/\\1~1/g" ) >'
How can this keep happening? (Score:3, Interesting)
I'm not a fan of git, I'm not happy when I'm forced to use it and I don't understand how it works, not really. But remember how KDE deleted all their projects, everywhere, globally, except for a powered-down virtual machine?
http://jefferai.org/2013/03/29/distillation/
When I remember that, and I read this story, I can't understand why people use something that is so sensitive to mistakes. It's like giving everybody root on every machine, which is running DOS in real mode. Somebody please explain it to me.
Re:How can this keep happening? (Score:4, Informative)
KDE's problems were not due to Git. They were due to a corrupt filesystem, a home-brew mirroring setup, and overworked admins.
If you're going to troll-ol-ol a blame vector for that, at least be remotely fair and blame Linux (or whatever OS their master server was running), open source, and the associated culture.
If only there was another copy of the repo (Score:5, Funny)
Just imagine if git had some other magical copy of the repo somewhere, maybe even on the local machine you develop on, now that would save your data in a case like this
Re: (Score:3)
Just imagine if you had actually read the story. The git-repos are not affected.
Re: (Score:2)
Repos will be okay, it's all the ancillary stuff, i.e. the things that make them worth using over other git hosting companies. User management, wikis, release management, issue tracking etc.
Comment removed (Score:5, Interesting)
All my sympathy... (Score:5, Insightful)
Re: (Score:3)
An that is why you run BCM and recovery tests (Score:4, Interesting)
Something like this is going to happen sooner or later. It cannot really be avoided. BCM and recovery tests are the only way to be sure your replication/journaling/etc. works and your backups can be restored.
Of course in this age of incompetent bean-counters, these are often skipped, because "everything works" and these test do involve downtime.
Re: (Score:2)
BCM? Bravo Company, manufacturer of firearm parts so you can shoot your servers? Buzzword-Centric Methodology? The SourceForge "BCM" project, a file compression utility? Baylor College of Medicine? Bear Creek Mining? Bacau International Airport? Broadcom?
Re: An that is why you run BCM and recovery tests (Score:2)
You first. If your head is so far up your ass that you can't tell when you're using a buzzword acronym with little exposure in the tech world and a lot of plausible meanings, you might be a tool.
DR Testing as a business model (Score:2)
Does anyone think that Backups/DR Testing as a business would be something that businesses would go for?
Everybody "runs backups" but due to all the usual limitations in time and capacity, nobody really tests whether they can restore everything and actually make it work, and how long it might actually take to accomplish this.
I always wondered if you could mount a hundred TB of storage, a couple of tape drives, and switching into one of those rock band roadie cases and take it to a business with the idea that
Re: (Score:2)
As a sysadmin, this sounds great (a bit 'brown trousers' for me personally, but great). However, one of my clients is entirely 'in the cloud', so no need for your truck of kit - just provide as many VMs as we like somewhere on t'internet. Ideally you'd be able to do this in a 'little internet' which has a VPN to get into it, has it's own DNS servers, and maybe ways to 'bend' or alter requests to other cloudy services, such as Google or Amazon such that the app 'thinks' its talking to the real, live producti
Only perform reversible actions (Score:2)
A lesson always learnt the hard way. Those of us who have learnt it the hard way have known the feeling before: I'll trust that this is correct and the feeling after: Shiat!
And of course (Score:2)
And of course everybody here knows _never_ to _rely_ upon cloud storage. Use it, by all means, but plan as if the cloud storage facility could have a meltdown at any moment. Gitlab users should just push their project to a different git server. There is also something to be said for having git server projects mirrored, e.g. a master on github and a second on gitlab, so that, in the event of one cloud service failing, you have a hot spare.
What is frustrating is that, given all the progress in hardware reliab
Go ahead...yawn, but (Score:2)
That 4.5 GB of data, happens to hold the answer! To life, the universe, and EVERYTHING!! Mankind is fortunate that the weary sysadmin was able to abort the procedure before it completely wiped the slate clean!
So don't blame the guy, praise him and thank him for saving us all!
Six hours of loss is a "melt-down"? (Score:5, Insightful)
Re: (Score:3)
I see your point but I'd guess you are not a professional sysadmin. TFA should have been prefaced "For SysAdmins only". Most don't care about losing data: this far along in the computer revolution, most of us have lost years of data due to a disk or pebcak failure.
Most of the time it is not a deal-breaker, or "melt-down" in this case. A company might have to spend some money, or a worker has to spend a lot of time, or the two dozen drafts of your "Great American Novel" goes gone.
But sometimes it's the entir
Um, to clarify: (Score:2)
"So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place."
Or, more accurately, less than 5 backup/replication techniques were deployed.
I've seen this before. The backup strategy you didn't deploy didn't fail. It never existed except in documentation. And your unwarranted trust.
I do not miss sysadmin work so much.
Re: (Score:2)
The first sentence is true. The second one only achieves "should be true" status.