Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
IT Technology

Inside the Longest Atlassian Outage of All Time (pragmaticengineer.com) 94

Gergely Orosz: We are in the middle of the longest outage Atlassian has had. Close to 400 companies and anywhere from 50,000 to 400,000 users had no access to JIRA, Confluence, OpsGenie, JIRA Status page, and other Atlassian Cloud services. The outage is its 9th day, having started on Monday, 4th of April. Atlassian estimates many impacted customers will be unable to access their services for another two weeks. At the time of writing, 45% of companies have seen their access restored. For most of this outage, Atlassian has gone silent in communications across their main channels such as Twitter or the community forums. It took until Day 9 for executives at the company to acknowledge the outage.

While the company stayed silent, outage news started trending in niche communities. In these forums, people tried to guess causes of the outage, wonder why there is full radio silence, and many took to mocking the company for how it is handling the situation. Atlassian did no better with communicating with customers during this time. Impacted companies received templated emails and no answers to their questions. After I tweeted about this outage, several Atlassian customers turned to me to vent about the situation, and hope I can offer more details. Customers claimed how the company's statements made it seem they received support, which they, in fact, did not. Several customers hoped I could help get the attention of the company which had not given them any details, beyond telling them to wait weeks until their data is restored.

This discussion has been archived. No new comments can be posted.

Inside the Longest Atlassian Outage of All Time

Comments Filter:
  • by SoCalChris ( 573049 ) on Wednesday April 13, 2022 @01:46PM (#62443760) Journal

    You mean letting another company run your critical infrastructure can be bad? Who could have seen that coming?

    • by Anonymous Coward on Wednesday April 13, 2022 @01:50PM (#62443776)
      Apu went on vacation and his brother Rakesh accidentally unplugged something. Everything will be fine once Apu returns and plugs it back in.
      • Yes, "accidentally" *unplugs coffee maker and plugs server back in*
      • by rudy_wayne ( 414635 ) on Wednesday April 13, 2022 @01:57PM (#62443796)
        Company CTO, Apu, explained what is going on in a blog post yesterday:

        One of our standalone apps for Jira Service Management and Jira Software was fully integrated into our products as native functionality. Because of this, we needed to deactivate the standalone legacy app on customer sites that had it installed. Our engineering teams planned to use an existing script to deactivate instances of this standalone application.

        However, the script was executed with the wrong execution mode and the wrong list of IDs. The result was that sites for approximately 400 customers were improperly deleted.

        • MOD THIS UP. What an amazing blunder.

          I'll take "SELECT *before* DELETE" for 500, Alex.

        • by ljw1004 ( 764174 ) on Wednesday April 13, 2022 @02:29PM (#62443914)

          Company CTO, Apu, explained what is going on in a blog post yesterday: ...

          Link: https://twitter.com/Atlassian/... [twitter.com]

          This is from the @Atlassian twitter account, and has the same text as parent poster included.

        • by Anonymous Coward
          Why when I read this, does this sound like a similar situation as to what happened to Salesforce a few years ago (but with db permissions).

          Let me guess, Atlassian relies on DevOps/CI For Testing?
        • Company CTO, Apu, explained what is going on in a blog post yesterday:

          One of our standalone apps for Jira Service Management and Jira Software was fully integrated into our products as native functionality. Because of this, we needed to deactivate the standalone legacy app on customer sites that had it installed. Our engineering teams planned to use an existing script to deactivate instances of this standalone application. However, the script was executed with the wrong execution mode and the wrong list of IDs. The result was that sites for approximately 400 customers were improperly deleted.

          The Cloud is infallible 100 percent secure, and simply has no possibility of failing. Only we can fail the cloud. I love this stuff, It was predictable when the suits bought into it as a way to get rid of those damn IT cost centers, and drank that cloud koolaid.

    • It's not like you have a choice anymore since they have discontinued their on premises products and forced their entire customer base to their (more expensive) cloud offerings.
      • by phantomfive ( 622387 ) on Wednesday April 13, 2022 @02:18PM (#62443878) Journal

        You don't need to use Atlassian. This is definitely a case where it's better to "roll your own" or write things down on paper.

      • by ctilsie242 ( 4841247 ) on Wednesday April 13, 2022 @03:01PM (#62443978)

        They still have their on-prem stuff. Except it is called "Data Center", and expect to pay for 500 users minimum for all but BitBucket which has 25 users minimum, or $2500/year. The cloud stuff is still not cheap, especially if you have you own SSO.

        Overall, I'm disappointed with Atlassian. Before they forced everyone to the cloud or Data Center, their products were a must have in most businesses. Need documentation? Stand up Confluence. Need to manage tickets? Jira Service Desk. Any/all IaC? Stuff that into BitBucket. Yes, their stuff was in Java, and you had to do a lot of hoops to get the TLS keys into the Java keystores, but it worked well. Backups were easy, as you could do both by the appliance, and have it dump its contents.

        The annoying thing is migrating away. Confluence can be replaced by SharePoint... sort of. Jira can be replaced by GitHub Issues. Bitbucket can be replaced by GitHub Enteprise, which isn't cheap, but the best in the industry, and extremely well supported.

        IMHO, they made a money grab to lock people into cloud subscriptions, and now have a good amount of egg on their face. When looking at their cloud offerings, I was not assured that they were up to compliance standards such as HIPAA, FERPA, or others, and all I had were vague assurances from their web page. Looks like my apprehension was justified.

        Maybe they need to go back to their pre-2019 pricing model. Not everyone can do a cloud solution, due to compliance issues, and not all companies need 500 licenses a year.

    • by phantomfive ( 622387 ) on Wednesday April 13, 2022 @02:16PM (#62443870) Journal

      Critical infrastructure? 400 companies just had their most productive period in recent memory. Managers were seen on Zoom calls looking confused.

    • Re: (Score:2, Funny)

      by AmiMoJo ( 196126 )

      If it had been local it would have gone down to ransomware and the budget backup system that they grudgingly paid for wouldn't have worked.

      Local often isn't any better.

    • by jellomizer ( 103300 ) on Wednesday April 13, 2022 @02:56PM (#62443954)

      I get economies of scale. Where for smaller orgs a Cloud Solution gives them a lot of value for the buck, and are able to get a big system for less that on average will operate better in a cloud envrionment. However, when big orgs with thousands of people and a large full time IT Staff... Cloud becomes more of burden, as your org is already big enough to have the economy of scale, and can run the applications for less, because they don't need to pay for the profit margins.

      It is usually upper management knowing the cost of everything, but not the price. They see a million dollar piece of software or a $10,000 a month cloud solution. Not realizing they are going to nickle and dime you for every extra feature, and or live with a bad default, that is easy to change locally but won't bother in the cloud environment, which hinders your business workflows, support that really doesn't care if they fix your problem today or in 3 years. And often is better practiced in writing airtight legal contracts that put your company in a disadvantage to try to strong arm them to change.

    • You mean letting another company run your critical infrastructure can be bad? Who could have seen that coming?

      Offload your critical infrastructure to a 3rd party and they might screw up and leave you non-operational.

      Offload your critical infrastructure to in-house IT and they might screw up and leave you non-operational.

      I'm sure most people think they'll do better keeping it in-house just like most people think they're above average drivers. But while the cloud stuff sometimes turns out to be smoke and mirrors so do many of the in-house failsafes, especially since very few organizations really have the resources to

      • I understand what you're saying, and I've lived that existence, as well, but what you're describing is just freakin' bad management. This problem could be avoided by better management. I was that "sole IT guy" in one group of a large, multi-national corporation - German company, rhymes with Ciemans - and one day, I'd had enough, packed up all my shit, and left. The next week, coincidentally - and it truly was; it was no fault of mine - one of the servers went down, and the pimply faced youth that was pre
        • I understand what you're saying, and I've lived that existence, as well, but what you're describing is just freakin' bad management. This problem could be avoided by better management.

          Remember this org was only about 50 people, there's only so many IT people you can support. And they did have redundancy at that position... but redundancy sometimes means you're working on your 3rd backup (aka, me).

          However, management should've been thinking at least two steps ahead, in case I got run off the Tappan Zee bridge into the Hudson, or won the lottery: same difference. Not planning for redundancy is planning for failure. Not requiring documenting simple enough for any pimply faced youth to follow on how to start a "touchy" server is planning for failure. It happens in small and large companies alike.

          I feel like the large org should be better prepped for that (of course, they also have more systems to maintain). But even laying out a good recovery plan means the unfamiliar person responsible needs to locate the appropriate plan and figure out how to apply it.

          Not saying a good org can't do it

          • by bn-7bc ( 909819 )
            Nut management had planned multiple steps ahead, if iy had bobome really bad, the intire CxO level would have triggered their parachutes and smiled all the way to the bank while someone else would have needed to deal with the trash fire that was threatening to consume the place.
    • Everything is bad. The question is "is it worse than doing it yourself?" The "other person's computer" is usually run by someone far more competent. You just think outages and catastrophes are new because you didn't read about them in the news in the past. Funny enough the news doesn't cover one company's internal problem, the only reason you hear about it is because multiple companies are affected.

      A single company can't give a shit about other customers, so the "eggs in one basket" retort doesn't apply in

    • You mean letting another company run your critical infrastructure can be bad? Who could have seen that coming?

      It's always a great idea to have a full set of local backups and storage, as well as an IT department to maintain it.

  • Ransomware (Score:2, Insightful)

    by samwichse ( 1056268 )

    Seems pretty obvious they got hit with a ransomware attack.

    • Re: (Score:2, Insightful)

      by Aighearach ( 97333 )

      And didn't have a robust backup system with practiced restoration.

    • Re:Ransomware (Score:5, Informative)

      by MikeDataLink ( 536925 ) on Wednesday April 13, 2022 @02:03PM (#62443822) Homepage Journal

      Seems pretty obvious they got hit with a ransomware attack.

      Except they didn't. A failed upgrade deleted customer data.

      • Seems pretty obvious they got hit with a ransomware attack.

        Except they didn't. Stupid, incompetent employees deleted customer data.

        Fixed

        • by gweihir ( 88907 )

          Except they didn't. Stupid, incompetent employees deleted customer data.

          Fixed

          Pretty much. This thing can happen, but it a) needs to be very unlikely and b) you need to be able to recover fast when it happens. It looks to me like they never tried this out in a test environment. You know, like minimally professional people would.

    • Seems pretty obvious they got hit with a ransomware attack.

      Everything seems like a ransomware attack when all you know is ransomware attacks. Put a bit more thought into your posts, or did someone ransomware your ability to think?

  • Worth checking (Score:5, Interesting)

    by boundary ( 1226600 ) on Wednesday April 13, 2022 @01:49PM (#62443772)

    Would be interesting to study whether productivity for those 400,000 users has gone up or down.

    • by Huitzil ( 7782388 ) on Wednesday April 13, 2022 @02:18PM (#62443876)
      well, the PM can't measure productivity because JIRA is down

      but employees likely are feeling like they actually can do work right now.
      • well, the PM can't measure productivity because JIRA is down

        Check the number of GitHub commits.

      • by Anonymous Coward

        Exactly! Previous management switched us to Agile 2-week sprints on JIRA. The managers had set up all these graphs showing velocity etc. We had to cut up our work in very small pieces even when it did not make sense and try to rate their difficulty, in that whole planning process that took a day, then we'd do the work for about 7 work days, then, the last 2 days of the cycle we'd mostly sit around, drink coffee etc because we were not allowed to start new tickets, as they might not be finished which would s

  • by oldgraybeard ( 2939809 ) on Wednesday April 13, 2022 @01:53PM (#62443782)
    if the cloud is going to be unreliable what is the purpose of the cloud from a service stand point. Except you have the illusion of competence until the collapse. Whatever their clients were paying for they got screwed. Why would anyone use their cloud service now?
    • by rudy_wayne ( 414635 ) on Wednesday April 13, 2022 @02:25PM (#62443894)

      if the cloud is going to be unreliable what is the purpose of the cloud

      To extract money from the technologically illiterate (e.g., most CEOs)

      • by GoTeam ( 5042081 )
        Or tech illiterate CIOs and CTOs believe their even more tech illiterate CEOs will be dazzled by the word CLOUD.
      • Any group can be organized in a distribution curve, under which you'll have some really smart, some really dumb, and some hovering around "average." Tech sales companies, like dictators, take advantage of the useful idiots to build their empires.
    • Cost mainly, and perhaps illusion of "cheap" scalability.
    • by ljw1004 ( 764174 ) on Wednesday April 13, 2022 @02:34PM (#62443924)

      if the cloud is going to be unreliable what is the purpose of the cloud from a service stand point. Except you have the illusion of competence until the collapse. Whatever their clients were paying for they got screwed. Why would anyone use their cloud service now?

      If you start a company/division, are you going to invest time and infrastructure and maintenance spinning up some on-prem system for tracking issues and project management? and train your staff? Or will you just say "use Atlassian", send around URLs, and everyone can get started immediately.

      And once your company/division has a lot of its project-tracking data locked up in Atlassian, and a workforce who are familiar with it, but now you've grown big enough to pay people to set up an alternative project-tracking system -- would you really do that? and lose weeks of productivity across your entire (now fairly large) company/division?

      I think that cloud service providers will always give the illusion of competence to the people they need to attract, and everyone will continue to sign up for them.

      • by dskoll ( 99328 ) on Wednesday April 13, 2022 @03:12PM (#62443996) Homepage

        If you start a company/division, are you going to invest time and infrastructure and maintenance spinning up some on-prem system for tracking issues and project management?

        Yep. Short-term pain for long-term gain, especially when a lot of on-prem stuff is open-source and fairly decent.

        • Double yep.

          Even if you don't have an on-prem server to start with, it's not that hard to rent a virtual machine from an ISP and spin up a bug tracker, wiki, source control, and file server if you need that. When you get really rolling, you can buy a local server and move all the stuff locally. At least then you own your data.

          I think the allure of Atlassian is the "SEP" phenomenon from Hitchhiker's Guide. Go with Atlassian, and it's "Somebody Else's Problem".
          • I like this approach. It's probably what I would do had it been my decision.

            A company HAS to own and safeguard its data and its processes. It can delegate hosting, technology, and many other things, but not responsibility.

            The problem with "SEP" is that companies or departments relying on something like the Atlassian suite, and are impacted by an outage like this, might no longer be operating near capacity, or perhaps at all.

            Their management would quite understandably ask why no one developed contingency p

      • by SendBot ( 29932 )

        And once your company/division has a lot of its project-tracking data locked up in Atlassian, and a workforce who are familiar with it, but now you've grown big enough to pay people to set up an alternative project-tracking system -- would you really do that?

        Ditch Jira? I'll buy two!

      • When starting up? Absolutely.

        But there should be, from the very start:

            * Local backups of EVERYTHING crucial

            * Contingency plans for what to do during outages.

            * Consideration at regular intervals (perhaps each 1/2 year?) of whether the Atlassian suite continues to be the better option, as opposed to many alternatives, both cloud and self-hosted.

    • It can offer significant value for small- to mid-sized companies whose core competency doesn't happen to be running server farms.

      The problem is that things that are critical need to be managed accordingly, and that's true whether they are in the cloud or not.

      At a very minimum, there should be ways to back up cloud data locally, and vice versa.

      There also should be processes in place for how to continue to stay in business if the cloud provider experiences outages.

      The same would be true if one had one's own i

      • "It can offer significant value for small- to mid-sized companies whose core competency doesn't happen to be running server farms." Bingo! You win the prize ;) If that is the argument shouldn't the cloud companies have a core competency running their systems? How about that "offer significant value" Now?
        • That's a fair point. Especially in this case. This wasn't a failure of the Internet (routing, DNS, etc.), but, by their own admission, a failure of Atlassian as a cloud hosting provider. It's not even so much as that some of their normal, "business as usual" processes failed. That happens to everyone from time to time. But normally there should be regular and robust disaster recovery planning, training, and testing. Clearly, that absolutely critical process also failed; it was either not sufficient, o

  • The Cloud (Score:5, Insightful)

    by MikeDataLink ( 536925 ) on Wednesday April 13, 2022 @02:05PM (#62443834) Homepage Journal

    The place every CEO wants to go, every CIO has plan for, and the place most front line IT people are very skeptical of.

    • Re:The Cloud (Score:5, Informative)

      by Mark of the North ( 19760 ) on Wednesday April 13, 2022 @02:44PM (#62443932)

      I wouldn't say 'skeptical'. More like 'cautiously evaluating'. It really depends on what service you want to see cloud- or locally- hosted, and the competence of your cloud-hosting service vs your own staff.

      Our experience has been that the cloud services we buy have been much less troublesome than anything we host onsite, and far less troublesome than what we develop ourselves. The software we develop and that our customers self-host accounts for most of my headaches, and we are frantically porting those services to the web so that we can host it on our customers' behalf.

      When I started here, we had everything on-site. It was a nightmare. I constantly had to go in to the office in evenings, weekends, and even vacations. Since moving a few key services to cloud-based providers, I don't make unscheduled trips in to the office...at all.

    • It can be done well, but not without the same attention to planning, risk analysis, cost-benefit analysis, etc., as one would give toward any other decision of comparable significance.
  • Wow (Score:5, Funny)

    by Waffle Iron ( 339739 ) on Wednesday April 13, 2022 @02:26PM (#62443898)

    This must be a stressful time at Atlassian.

    I bet that right now theiir staff feels like they have the weight of the world on their shoulders.

  • by linuxguy ( 98493 ) on Wednesday April 13, 2022 @03:01PM (#62443980) Homepage
    Atlassian used to allow self hosting Jira and other products. They took that away not long ago, telling us that they can do a better job of managing the service. And that we should trust them with our data. Well?
    • by RegistrationIsDumb83 ( 6517138 ) on Wednesday April 13, 2022 @03:31PM (#62444042)
      The few people who stood up against this and switched to an alternative bug tracker are probably feeling pretty justified right now.
      • by bustinbrains ( 6800166 ) on Wednesday April 13, 2022 @04:12PM (#62444138)

        Jira is a bug tracker? I always thought of it as a soul-crushing, red-tape laden, unusable, unnavigable, expensive product that pretends to be a bug tracker but fails super hard at that task.

        • Jira is a bug tracker? I always thought of it as a soul-crushing, red-tape laden, unusable, unnavigable, expensive product that pretends to be a bug tracker but fails super hard at that task.

          Preach it!

        • Preach! I can't stand it. They sold out company on it as "agile".

          What a crock. Agile is designed to ensure you produce a mashed pile of bug-infested rotten spaghetti, topped with a third bolognaise sauce.

          And JIRA is virtually unusable. I've yet to figure out how I can use it without maximising the browser window. And even then, there's barely more than a postage stamp of useful information on each screen.

          • by SendBot ( 29932 )

            I really tried my best to take Agile/Jira seriously, but Jira is such a turd on its own, and I saw the same level of results you described. Good work seemed to happen despite Jira, not because of it. And bad work, unforgivable bugs, and lengthy delays and backtracking (or forward-hacking) were normal.

            Every time I'd ask why some bad decision was in place (how it got there or why it can't be fixed), the excuse was always "Agile". And a lot of reasonable suggestions (like "why don't we take inventory of requir

    • by RossGGG ( 963029 ) on Wednesday April 13, 2022 @03:45PM (#62444072)
      We still self-host Jira and Confluence onsite where I work and everything has continued to run smoothly through this debacle. I hope more customers demand they bring that model back as an option.
      • We still self-host Jira and Confluence onsite where I work and everything has continued to run smoothly through this debacle. I hope more customers demand they bring that model back as an option.

        How is it so many people here can't even be bothered to google. They still have options to self-host - just google the data center version of whatever you think you need, I'm pretty sure it exists.

        • Important changes to our server and Data Center products
          We’ve ended sales for new server licenses and will end support for server on February 15, 2024 PT. We’re continuing investment in Data Center with several key improvements. Learn what this means for you.

          Link [atlassian.com]
    • by atheos ( 192468 )
      I shut down our self hosted Jira instance just hours before this outage started. We aren't one of the affected customers, but it was interesting timing for us.
    • Atlassian used to allow self hosting Jira and other products. They took that away not long ago,

      So the fact that they still sell self-hosted versions of all their products (the "data center" versions) somehow ceased to exist in your reality?

      They stopped selling the single node server licenses. They didn't stop selling self hosted solutions.

  • by thesjaakspoiler ( 4782965 ) on Wednesday April 13, 2022 @06:59PM (#62444516)

    Everytime I type a colon, I get these annoying emoticons and there is literally no way to disable those.
    Hunderds of people are complaining about it on the Atlassian forum and the only thing that Atlassian can do is mark the topic as resolved.
    All this hipster emoticon mayhem in a product for serious business applications doesn't make sense.
    Time to ditch Atlassian.

  • I just ran "sudo rm -rf /*" and they were gone.
  • by dsgrntlxmply ( 610492 ) on Thursday April 14, 2022 @09:42AM (#62446100)
    Atlassian believe that their shit doesn't stink. Their co-CEO Neville Throatwarbler-Mangrove has developed a 5 year old's fascination with industrial machinery and wants to buy electric power generation. Also he owns part of an American basketball team. He seems to be distracted by paying quite a lot of attention to expansive manifestations of wealth, and none to its source. Meanwhile the Confluence search function uses its amazing trade secret worst-first results ordering to conceal any first-person writings on crucial topics such as "how can I get Jira to take the fucking spurious bullet point markups off my latest comment?"

TRANSACTION CANCELLED - FARECARD RETURNED

Working...