Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
IT

Optus Says Massive Australia Outage Was After Software Upgrade (reuters.com) 33

Australian telecoms provider Optus said on Monday that a massive outage which effectively cut off 40% of the country's population and triggered a political firestorm was caused by "changes to routing information" after a "routine software upgrade." From a report: More than 10 million Australians were hit by the 12-hour network blackout at the Singapore Telecommunications-owned telco on Nov. 8, triggering fury and frustration among customers and raising wider concerns about the telecommunications infrastructure.

Optus said in a statement that an initial investigation found the company's network was affected by "changes to routing information from an international peering network" early that morning, "following a routine software upgrade." It added: "These routing information changes propagated through multiple layers in our network and exceeded preset safety levels on key routers which could not handle these. This resulted in those routers disconnecting from the Optus IP Core network to protect themselves." The project to reconnect the routers was so large that "in some cases (it) required Optus to reconnect or reboot routers physically, requiring the dispatch of people across a number of sites in Australia", it added.

This discussion has been archived. No new comments can be posted.

Optus Says Massive Australia Outage Was After Software Upgrade

Comments Filter:
  • Seems like those days of the internet are long gone. If they ever existed.

    • by The-Ixian ( 168184 ) on Monday November 13, 2023 @12:04PM (#64002455)

      If the end user had presence on two independent network providers, then the Internet could route around the problem.

      But if you only have a single ISP and the ISP has a routing problem, how are you going to "route around" that?

      • by Viol8 ( 599362 )

        I wouldn't expect one fat fingered mistake to take their entire network offline, I'd expect some kind of backup routing in their systems.

        • It's always DNS :)

        • You can have backups for routing since that would create different routing information on the same network. The backups take the form of fail safes that cause networks to island when they receive bogus information to avoid creating network storms. And precisely this failsafe is what caused the outage to take as long as it did. The initial problem was quickly identified and resolved. The exercise of then rejoining all the islanded networks - requiring techs to travel all over a very large country is what too

          • by Viol8 ( 599362 )

            Perhaps they should consider having on site techs then. Lights out works great until it doesn't.

            • I don't think you understand how telecoms infrastructure works if you think having techs "on site" is a suitable option. Especially in a country like Australia. Hint: It's a bit bigger and more complex than your office building.

  • by Vlad_the_Inhaler ( 32958 ) on Monday November 13, 2023 @12:44PM (#64002567)

    Early October 2022 Faecebook / Meta suffered a major outage rather like this one (from what I remember) - a software update led to routing problems affecting all of their brands. The only one I use is Whatsapp and that was certainly down.
    This even happens to professionals, not just amateurs like Optus.

  • by Zarhan ( 415465 ) on Monday November 13, 2023 @12:45PM (#64002571)

    The real culprit appears to have been wrongly-configured max-prefix setting.

    Meaning that when someone sent the nth prefix received that went over the limit, they tore down the BGP session => boom.

    See

    https://www.mail-archive.com/o... [mail-archive.com]

    • You MIGHT want to re-read that, esp. the send post where Akamai employee says that no changes on their part occurred.
  • Translation (Score:5, Insightful)

    by jmccue ( 834797 ) on Monday November 13, 2023 @12:57PM (#64002599) Homepage

    "These routing information changes propagated through multiple layers in our network and exceeded preset safety levels on key routers which could not handle these," the company said.

    We did not want to spend money to perform a proper upgrade, so we hope this reason confuses the pleebs in the Courts and Gov. enough to not to punish us.

  • The company I worked for switched to a gigantic multi-billion dollar IT contractor in India. Shortly after, they did a firewall update in the middle of the day and took down 3 hospitals and 21 clinics because of a typo or logical issue. They didn't document it or put in a change order or get it approved. In fact, it turns out they didn't have real networking degrees.
    How does that relate to this story? You could probably guess.
  • by organgtool ( 966989 ) on Monday November 13, 2023 @01:22PM (#64002673)
    This seems like an attempt by Optus to blame a third-party vendor and absolve themselves of most, if not all, of the blame. But I'll counter that by asking how Optus didn't catch this issue in ANY of their test environments. They do have plenty of test environments and detailed test scenarios they run before deploying any changes, including third-party software, right???!!!
    • by HBI ( 10338492 )

      Why they didn't have an OOB network to reboot remote stuff is also hard to fathom.

    • In fact, assume that this was the case. Then why hide what happened? They would have been MORE than glad to show it off if this was as simple as is being said now.
      Nope.
      They off-shored the code/work and backdoor was put in. They do not want to admit that.
    • by gweihir ( 88907 )

      Indeed. And worse: They had no rollback procedure! Obviously incompetent, greedy and grossly negligent cretins at work. There is no excuse for what happened. Hence, obviously, they try to lie by misdirection.

  • The first rule of computers.
  • I think that classifies as "gross negligence". They are likely unfit to operate any type of communication infrastructure.

In the long run, every program becomes rococco, and then rubble. -- Alan Perlis

Working...