Inside the Longest Atlassian Outage of All Time (pragmaticengineer.com) 94
Gergely Orosz: We are in the middle of the longest outage Atlassian has had. Close to 400 companies and anywhere from 50,000 to 400,000 users had no access to JIRA, Confluence, OpsGenie, JIRA Status page, and other Atlassian Cloud services. The outage is its 9th day, having started on Monday, 4th of April. Atlassian estimates many impacted customers will be unable to access their services for another two weeks. At the time of writing, 45% of companies have seen their access restored. For most of this outage, Atlassian has gone silent in communications across their main channels such as Twitter or the community forums. It took until Day 9 for executives at the company to acknowledge the outage.
While the company stayed silent, outage news started trending in niche communities. In these forums, people tried to guess causes of the outage, wonder why there is full radio silence, and many took to mocking the company for how it is handling the situation. Atlassian did no better with communicating with customers during this time. Impacted companies received templated emails and no answers to their questions. After I tweeted about this outage, several Atlassian customers turned to me to vent about the situation, and hope I can offer more details. Customers claimed how the company's statements made it seem they received support, which they, in fact, did not. Several customers hoped I could help get the attention of the company which had not given them any details, beyond telling them to wait weeks until their data is restored.
While the company stayed silent, outage news started trending in niche communities. In these forums, people tried to guess causes of the outage, wonder why there is full radio silence, and many took to mocking the company for how it is handling the situation. Atlassian did no better with communicating with customers during this time. Impacted companies received templated emails and no answers to their questions. After I tweeted about this outage, several Atlassian customers turned to me to vent about the situation, and hope I can offer more details. Customers claimed how the company's statements made it seem they received support, which they, in fact, did not. Several customers hoped I could help get the attention of the company which had not given them any details, beyond telling them to wait weeks until their data is restored.
Everything doesn't need to be on the cloud (Score:5, Insightful)
You mean letting another company run your critical infrastructure can be bad? Who could have seen that coming?
Re:Everything doesn't need to be on the cloud (Score:4, Funny)
Re: (Score:2)
Re:Everything doesn't need to be on the cloud (Score:5, Informative)
One of our standalone apps for Jira Service Management and Jira Software was fully integrated into our products as native functionality. Because of this, we needed to deactivate the standalone legacy app on customer sites that had it installed. Our engineering teams planned to use an existing script to deactivate instances of this standalone application.
However, the script was executed with the wrong execution mode and the wrong list of IDs. The result was that sites for approximately 400 customers were improperly deleted.
Re: (Score:2)
MOD THIS UP. What an amazing blunder.
I'll take "SELECT *before* DELETE" for 500, Alex.
Re:Everything doesn't need to be on the cloud (Score:5, Informative)
Company CTO, Apu, explained what is going on in a blog post yesterday: ...
Link: https://twitter.com/Atlassian/... [twitter.com]
This is from the @Atlassian twitter account, and has the same text as parent poster included.
Re: (Score:1)
Let me guess, Atlassian relies on DevOps/CI For Testing?
Re: (Score:2)
Company CTO, Apu, explained what is going on in a blog post yesterday:
One of our standalone apps for Jira Service Management and Jira Software was fully integrated into our products as native functionality. Because of this, we needed to deactivate the standalone legacy app on customer sites that had it installed. Our engineering teams planned to use an existing script to deactivate instances of this standalone application. However, the script was executed with the wrong execution mode and the wrong list of IDs. The result was that sites for approximately 400 customers were improperly deleted.
The Cloud is infallible 100 percent secure, and simply has no possibility of failing. Only we can fail the cloud. I love this stuff, It was predictable when the suits bought into it as a way to get rid of those damn IT cost centers, and drank that cloud koolaid.
Re: (Score:2)
Re:Everything doesn't need to be on the cloud (Score:5, Insightful)
You don't need to use Atlassian. This is definitely a case where it's better to "roll your own" or write things down on paper.
Re: Everything doesn't need to be on the cloud (Score:2)
Mantis bug tracker has been sufficient for me.
Re:Everything doesn't need to be on the cloud (Score:5, Insightful)
They still have their on-prem stuff. Except it is called "Data Center", and expect to pay for 500 users minimum for all but BitBucket which has 25 users minimum, or $2500/year. The cloud stuff is still not cheap, especially if you have you own SSO.
Overall, I'm disappointed with Atlassian. Before they forced everyone to the cloud or Data Center, their products were a must have in most businesses. Need documentation? Stand up Confluence. Need to manage tickets? Jira Service Desk. Any/all IaC? Stuff that into BitBucket. Yes, their stuff was in Java, and you had to do a lot of hoops to get the TLS keys into the Java keystores, but it worked well. Backups were easy, as you could do both by the appliance, and have it dump its contents.
The annoying thing is migrating away. Confluence can be replaced by SharePoint... sort of. Jira can be replaced by GitHub Issues. Bitbucket can be replaced by GitHub Enteprise, which isn't cheap, but the best in the industry, and extremely well supported.
IMHO, they made a money grab to lock people into cloud subscriptions, and now have a good amount of egg on their face. When looking at their cloud offerings, I was not assured that they were up to compliance standards such as HIPAA, FERPA, or others, and all I had were vague assurances from their web page. Looks like my apprehension was justified.
Maybe they need to go back to their pre-2019 pricing model. Not everyone can do a cloud solution, due to compliance issues, and not all companies need 500 licenses a year.
Re: Everything doesn't need to be on the cloud (Score:2)
In both cases I can be certain that their sanity is "in".
Re: (Score:2)
GitLab is a decent solution. At the enterprise end, it is half the price of GitHub Enterprise. However, one should have support for it eventually, just because it holds so much company critical data. The downside is that upgrading and maintaining it is harder than GHE, which is completely appliance based. However, it is a solid option.
Re:Everything doesn't need to be on the cloud (Score:4, Informative)
They still have their on-prem stuff. Except it is called "Data Center", and expect to pay for 500 users minimum for all but BitBucket which has 25 users minimum, or $2500/year.
According to The Register they're getting rid of the on-premises option. Sucks to be customers.
Re: Everything doesn't need to be on the cloud (Score:2)
Re:Everything doesn't need to be on the cloud (Score:5, Funny)
Critical infrastructure? 400 companies just had their most productive period in recent memory. Managers were seen on Zoom calls looking confused.
Re: (Score:2, Funny)
If it had been local it would have gone down to ransomware and the budget backup system that they grudgingly paid for wouldn't have worked.
Local often isn't any better.
Re:Everything doesn't need to be on the cloud (Score:4, Insightful)
I get economies of scale. Where for smaller orgs a Cloud Solution gives them a lot of value for the buck, and are able to get a big system for less that on average will operate better in a cloud envrionment. However, when big orgs with thousands of people and a large full time IT Staff... Cloud becomes more of burden, as your org is already big enough to have the economy of scale, and can run the applications for less, because they don't need to pay for the profit margins.
It is usually upper management knowing the cost of everything, but not the price. They see a million dollar piece of software or a $10,000 a month cloud solution. Not realizing they are going to nickle and dime you for every extra feature, and or live with a bad default, that is easy to change locally but won't bother in the cloud environment, which hinders your business workflows, support that really doesn't care if they fix your problem today or in 3 years. And often is better practiced in writing airtight legal contracts that put your company in a disadvantage to try to strong arm them to change.
Re: (Score:3)
You mean letting another company run your critical infrastructure can be bad? Who could have seen that coming?
Offload your critical infrastructure to a 3rd party and they might screw up and leave you non-operational.
Offload your critical infrastructure to in-house IT and they might screw up and leave you non-operational.
I'm sure most people think they'll do better keeping it in-house just like most people think they're above average drivers. But while the cloud stuff sometimes turns out to be smoke and mirrors so do many of the in-house failsafes, especially since very few organizations really have the resources to
Re: (Score:2)
Re: (Score:2)
I understand what you're saying, and I've lived that existence, as well, but what you're describing is just freakin' bad management. This problem could be avoided by better management.
Remember this org was only about 50 people, there's only so many IT people you can support. And they did have redundancy at that position... but redundancy sometimes means you're working on your 3rd backup (aka, me).
However, management should've been thinking at least two steps ahead, in case I got run off the Tappan Zee bridge into the Hudson, or won the lottery: same difference. Not planning for redundancy is planning for failure. Not requiring documenting simple enough for any pimply faced youth to follow on how to start a "touchy" server is planning for failure. It happens in small and large companies alike.
I feel like the large org should be better prepped for that (of course, they also have more systems to maintain). But even laying out a good recovery plan means the unfamiliar person responsible needs to locate the appropriate plan and figure out how to apply it.
Not saying a good org can't do it
Re: (Score:2)
Re: (Score:2)
Everything is bad. The question is "is it worse than doing it yourself?" The "other person's computer" is usually run by someone far more competent. You just think outages and catastrophes are new because you didn't read about them in the news in the past. Funny enough the news doesn't cover one company's internal problem, the only reason you hear about it is because multiple companies are affected.
A single company can't give a shit about other customers, so the "eggs in one basket" retort doesn't apply in
Re: (Score:2)
You mean letting another company run your critical infrastructure can be bad? Who could have seen that coming?
It's always a great idea to have a full set of local backups and storage, as well as an IT department to maintain it.
Ransomware (Score:2, Insightful)
Seems pretty obvious they got hit with a ransomware attack.
Re: (Score:2, Insightful)
And didn't have a robust backup system with practiced restoration.
Re:Ransomware (Score:5, Informative)
Seems pretty obvious they got hit with a ransomware attack.
Except they didn't. A failed upgrade deleted customer data.
Re: (Score:2)
Seems pretty obvious they got hit with a ransomware attack.
Except they didn't. Stupid, incompetent employees deleted customer data.
Fixed
Re: (Score:2)
Except they didn't. Stupid, incompetent employees deleted customer data.
Fixed
Pretty much. This thing can happen, but it a) needs to be very unlikely and b) you need to be able to recover fast when it happens. It looks to me like they never tried this out in a test environment. You know, like minimally professional people would.
Re: (Score:2)
Seems pretty obvious they got hit with a ransomware attack.
Everything seems like a ransomware attack when all you know is ransomware attacks. Put a bit more thought into your posts, or did someone ransomware your ability to think?
Worth checking (Score:5, Interesting)
Would be interesting to study whether productivity for those 400,000 users has gone up or down.
Re:Worth checking (Score:5, Funny)
but employees likely are feeling like they actually can do work right now.
Re: (Score:2)
well, the PM can't measure productivity because JIRA is down
Check the number of GitHub commits.
Re: (Score:3)
You mean BitBucket commits?
Re: (Score:2)
I meant Perforce submittals.
Re: (Score:1)
Exactly! Previous management switched us to Agile 2-week sprints on JIRA. The managers had set up all these graphs showing velocity etc. We had to cut up our work in very small pieces even when it did not make sense and try to rate their difficulty, in that whole planning process that took a day, then we'd do the work for about 7 work days, then, the last 2 days of the cycle we'd mostly sit around, drink coffee etc because we were not allowed to start new tickets, as they might not be finished which would s
Weeks until services restored? (Score:5, Insightful)
Re:Weeks until services restored? (Score:4, Interesting)
if the cloud is going to be unreliable what is the purpose of the cloud
To extract money from the technologically illiterate (e.g., most CEOs)
Re: (Score:3)
Re: (Score:2)
Re: (Score:1)
Re:Weeks until services restored? (Score:5, Insightful)
if the cloud is going to be unreliable what is the purpose of the cloud from a service stand point. Except you have the illusion of competence until the collapse. Whatever their clients were paying for they got screwed. Why would anyone use their cloud service now?
If you start a company/division, are you going to invest time and infrastructure and maintenance spinning up some on-prem system for tracking issues and project management? and train your staff? Or will you just say "use Atlassian", send around URLs, and everyone can get started immediately.
And once your company/division has a lot of its project-tracking data locked up in Atlassian, and a workforce who are familiar with it, but now you've grown big enough to pay people to set up an alternative project-tracking system -- would you really do that? and lose weeks of productivity across your entire (now fairly large) company/division?
I think that cloud service providers will always give the illusion of competence to the people they need to attract, and everyone will continue to sign up for them.
Re:Weeks until services restored? (Score:4, Interesting)
If you start a company/division, are you going to invest time and infrastructure and maintenance spinning up some on-prem system for tracking issues and project management?
Yep. Short-term pain for long-term gain, especially when a lot of on-prem stuff is open-source and fairly decent.
Re: (Score:2)
Even if you don't have an on-prem server to start with, it's not that hard to rent a virtual machine from an ISP and spin up a bug tracker, wiki, source control, and file server if you need that. When you get really rolling, you can buy a local server and move all the stuff locally. At least then you own your data.
I think the allure of Atlassian is the "SEP" phenomenon from Hitchhiker's Guide. Go with Atlassian, and it's "Somebody Else's Problem".
Re: (Score:2)
I like this approach. It's probably what I would do had it been my decision.
A company HAS to own and safeguard its data and its processes. It can delegate hosting, technology, and many other things, but not responsibility.
The problem with "SEP" is that companies or departments relying on something like the Atlassian suite, and are impacted by an outage like this, might no longer be operating near capacity, or perhaps at all.
Their management would quite understandably ask why no one developed contingency p
Re: (Score:2)
And once your company/division has a lot of its project-tracking data locked up in Atlassian, and a workforce who are familiar with it, but now you've grown big enough to pay people to set up an alternative project-tracking system -- would you really do that?
Ditch Jira? I'll buy two!
Re: (Score:2)
When starting up? Absolutely.
But there should be, from the very start:
* Local backups of EVERYTHING crucial
* Contingency plans for what to do during outages.
* Consideration at regular intervals (perhaps each 1/2 year?) of whether the Atlassian suite continues to be the better option, as opposed to many alternatives, both cloud and self-hosted.
Re: (Score:2)
It can offer significant value for small- to mid-sized companies whose core competency doesn't happen to be running server farms.
The problem is that things that are critical need to be managed accordingly, and that's true whether they are in the cloud or not.
At a very minimum, there should be ways to back up cloud data locally, and vice versa.
There also should be processes in place for how to continue to stay in business if the cloud provider experiences outages.
The same would be true if one had one's own i
Re: (Score:2)
Re: (Score:2)
That's a fair point. Especially in this case. This wasn't a failure of the Internet (routing, DNS, etc.), but, by their own admission, a failure of Atlassian as a cloud hosting provider. It's not even so much as that some of their normal, "business as usual" processes failed. That happens to everyone from time to time. But normally there should be regular and robust disaster recovery planning, training, and testing. Clearly, that absolutely critical process also failed; it was either not sufficient, o
Re: (Score:2)
The Cloud (Score:5, Insightful)
The place every CEO wants to go, every CIO has plan for, and the place most front line IT people are very skeptical of.
Re:The Cloud (Score:5, Informative)
I wouldn't say 'skeptical'. More like 'cautiously evaluating'. It really depends on what service you want to see cloud- or locally- hosted, and the competence of your cloud-hosting service vs your own staff.
Our experience has been that the cloud services we buy have been much less troublesome than anything we host onsite, and far less troublesome than what we develop ourselves. The software we develop and that our customers self-host accounts for most of my headaches, and we are frantically porting those services to the web so that we can host it on our customers' behalf.
When I started here, we had everything on-site. It was a nightmare. I constantly had to go in to the office in evenings, weekends, and even vacations. Since moving a few key services to cloud-based providers, I don't make unscheduled trips in to the office...at all.
Re: (Score:2)
Wow (Score:5, Funny)
This must be a stressful time at Atlassian.
I bet that right now theiir staff feels like they have the weight of the world on their shoulders.
Re:Wow (Score:5, Funny)
Re: (Score:2)
People really need to consider the whole situation objectively.
These used to allow self hosted Jira (Score:5, Interesting)
Re: These used to allow self hosted Jira (Score:4, Insightful)
Re: These used to allow self hosted Jira (Score:4, Insightful)
Jira is a bug tracker? I always thought of it as a soul-crushing, red-tape laden, unusable, unnavigable, expensive product that pretends to be a bug tracker but fails super hard at that task.
Re: (Score:2)
Jira is a bug tracker? I always thought of it as a soul-crushing, red-tape laden, unusable, unnavigable, expensive product that pretends to be a bug tracker but fails super hard at that task.
Preach it!
Re: These used to allow self hosted Jira (Score:2)
Preach! I can't stand it. They sold out company on it as "agile".
What a crock. Agile is designed to ensure you produce a mashed pile of bug-infested rotten spaghetti, topped with a third bolognaise sauce.
And JIRA is virtually unusable. I've yet to figure out how I can use it without maximising the browser window. And even then, there's barely more than a postage stamp of useful information on each screen.
Re: These used to allow self hosted Jira (Score:2)
Turd, not third.
Re: (Score:2)
I really tried my best to take Agile/Jira seriously, but Jira is such a turd on its own, and I saw the same level of results you described. Good work seemed to happen despite Jira, not because of it. And bad work, unforgivable bugs, and lengthy delays and backtracking (or forward-hacking) were normal.
Every time I'd ask why some bad decision was in place (how it got there or why it can't be fixed), the excuse was always "Agile". And a lot of reasonable suggestions (like "why don't we take inventory of requir
Re:These used to allow self hosted Jira (Score:5, Informative)
Re: (Score:2)
We still self-host Jira and Confluence onsite where I work and everything has continued to run smoothly through this debacle. I hope more customers demand they bring that model back as an option.
How is it so many people here can't even be bothered to google. They still have options to self-host - just google the data center version of whatever you think you need, I'm pretty sure it exists.
Re: (Score:2)
We’ve ended sales for new server licenses and will end support for server on February 15, 2024 PT. We’re continuing investment in Data Center with several key improvements. Learn what this means for you.
Link [atlassian.com]
Re: (Score:3)
Re: (Score:2)
self hosted Jira instance ... We aren't one of the affected customers...
Ummm, why would you think this outage would have any effect on a self hosted instance? Don't think you needed to point out you aren't affected. This was a cloud outage from a cloud infra script.
Re: These used to allow self hosted Jira (Score:1)
Shut down in favor of hosted, I am guessing.
Re: (Score:2)
Atlassian used to allow self hosting Jira and other products. They took that away not long ago,
So the fact that they still sell self-hosted versions of all their products (the "data center" versions) somehow ceased to exist in your reality?
They stopped selling the single node server licenses. They didn't stop selling self hosted solutions.
Could they fix those annoying emoticons as well? (Score:4, Interesting)
Everytime I type a colon, I get these annoying emoticons and there is literally no way to disable those.
Hunderds of people are complaining about it on the Atlassian forum and the only thing that Atlassian can do is mark the topic as resolved.
All this hipster emoticon mayhem in a product for serious business applications doesn't make sense.
Time to ditch Atlassian.
Re: (Score:2)
lol this is the funniest thing I've seen so far today! That's so inexcusably janky and a direct obstacle to productivity. People have been, and still are complaining about it for years, examples abound.
Someone found an effective workaround using adblock to block these
https://.atlassian.net/gateway... [atlassian.net]
https://.atlassian.com/images/... [atlassian.com]
Yes Sir! (Score:1)
Where are the customers' yachts? (Score:3)