Optus Says Massive Australia Outage Was After Software Upgrade (reuters.com) 33
Australian telecoms provider Optus said on Monday that a massive outage which effectively cut off 40% of the country's population and triggered a political firestorm was caused by "changes to routing information" after a "routine software upgrade." From a report: More than 10 million Australians were hit by the 12-hour network blackout at the Singapore Telecommunications-owned telco on Nov. 8, triggering fury and frustration among customers and raising wider concerns about the telecommunications infrastructure.
Optus said in a statement that an initial investigation found the company's network was affected by "changes to routing information from an international peering network" early that morning, "following a routine software upgrade." It added: "These routing information changes propagated through multiple layers in our network and exceeded preset safety levels on key routers which could not handle these. This resulted in those routers disconnecting from the Optus IP Core network to protect themselves." The project to reconnect the routers was so large that "in some cases (it) required Optus to reconnect or reboot routers physically, requiring the dispatch of people across a number of sites in Australia", it added.
Optus said in a statement that an initial investigation found the company's network was affected by "changes to routing information from an international peering network" early that morning, "following a routine software upgrade." It added: "These routing information changes propagated through multiple layers in our network and exceeded preset safety levels on key routers which could not handle these. This resulted in those routers disconnecting from the Optus IP Core network to protect themselves." The project to reconnect the routers was so large that "in some cases (it) required Optus to reconnect or reboot routers physically, requiring the dispatch of people across a number of sites in Australia", it added.
So much for routing around problems (Score:2, Offtopic)
Seems like those days of the internet are long gone. If they ever existed.
Re:So much for routing around problems (Score:4, Insightful)
If the end user had presence on two independent network providers, then the Internet could route around the problem.
But if you only have a single ISP and the ISP has a routing problem, how are you going to "route around" that?
Re: (Score:3)
I wouldn't expect one fat fingered mistake to take their entire network offline, I'd expect some kind of backup routing in their systems.
Re: (Score:2)
It's always DNS :)
Re:So much for routing around problems (Score:4, Insightful)
Except when it's BGP. Some mistakes are too big for DNS.
Re: (Score:3)
You can have backups for routing since that would create different routing information on the same network. The backups take the form of fail safes that cause networks to island when they receive bogus information to avoid creating network storms. And precisely this failsafe is what caused the outage to take as long as it did. The initial problem was quickly identified and resolved. The exercise of then rejoining all the islanded networks - requiring techs to travel all over a very large country is what too
Re: (Score:2)
Perhaps they should consider having on site techs then. Lights out works great until it doesn't.
Re: (Score:2)
I don't think you understand how telecoms infrastructure works if you think having techs "on site" is a suitable option. Especially in a country like Australia. Hint: It's a bit bigger and more complex than your office building.
Re: (Score:1)
After you figure out how to confiscate 300 million plus firearms without a revolution, then you can eat crow when your expected result does not happen. What a fucking idiot. That you could possibly believe that means that you were dropped on your head at birth.
The only way out for the USA is a civil war and break-up of the 'country' into groupings of independent states. Really, the USA can't continue as a unified country for more than a few decades, a century at the outside.
Re: (Score:1)
Always fun to pull this one out at times like this. [twitter.com]
Of course we are not serious about mental health.
Is that similar to the Meta outage 13 months ago? (Score:4, Interesting)
Early October 2022 Faecebook / Meta suffered a major outage rather like this one (from what I remember) - a software update led to routing problems affecting all of their brands. The only one I use is Whatsapp and that was certainly down.
This even happens to professionals, not just amateurs like Optus.
Software upgrade, naah (Score:4, Informative)
The real culprit appears to have been wrongly-configured max-prefix setting.
Meaning that when someone sent the nth prefix received that went over the limit, they tore down the BGP session => boom.
See
https://www.mail-archive.com/o... [mail-archive.com]
Re: (Score:2)
Translation (Score:5, Insightful)
"These routing information changes propagated through multiple layers in our network and exceeded preset safety levels on key routers which could not handle these," the company said.
We did not want to spend money to perform a proper upgrade, so we hope this reason confuses the pleebs in the Courts and Gov. enough to not to punish us.
Re: (Score:2)
No amount of money has ever eliminated human error. There's always a human somewhere.
They'll never admit it but... (Score:1)
How does that relate to this story? You could probably guess.
Attempt to Absolve Themselves (Score:3)
OOB (Score:3)
Why they didn't have an OOB network to reboot remote stuff is also hard to fathom.
Re: (Score:2)
Nope.
They off-shored the code/work and backdoor was put in. They do not want to admit that.
Re: (Score:2)
Indeed. And worse: They had no rollback procedure! Obviously incompetent, greedy and grossly negligent cretins at work. There is no excuse for what happened. Hence, obviously, they try to lie by misdirection.
Never upgrade a running system (Score:2)
So no working rollback procedure? (Score:2)
I think that classifies as "gross negligence". They are likely unfit to operate any type of communication infrastructure.