Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Bug Communications Software

Software Bug Behind Biggest Telephony Outage In US History (bleepingcomputer.com) 106

An anonymous reader writes: A software bug in a telecom provider's phone number blacklisting system caused the largest telephony outage in US history, according to a report released by the US Federal Communications Commission (FCC) at the start of the month. The telco is Level 3, now part of CenturyLink, and the outage took place on October 4, 2016.

According to the FCC's investigation, the outage began after a Level 3 employee entered phone numbers suspected of malicious activity in the company's network management software. The employee wanted to block incoming phone calls from these numbers and had entered each number in fields provided by the software's GUI. The problem arose when the Level 3 technician left a field empty, without entering a number. Unbeknownst to the employee, the buggy software didn't ignore the empty field, like most software does, but instead viewed the empty space as a "wildcard" character. As soon as the technician submitted his input, Level 3's network began blocking all incoming and outgoing telephone calls — over 111 million in total.

This discussion has been archived. No new comments can be posted.

Software Bug Behind Biggest Telephony Outage In US History

Comments Filter:
  • Check the spec - perhaps it was by design or not called out to ignore empty entries?
    • Re:Bug or feature? (Score:5, Insightful)

      by geekmux ( 1040042 ) on Sunday April 01, 2018 @08:46PM (#56364649)

      Check the spec - perhaps it was by design or not called out to ignore empty entries?

      A null/blank input taken as a wildcard is certainly not a feature.

      Even labeling that as a mere bug is putting it mildly. More like gargantuan fuck-up.

      • by Anonymous Coward

        No, it's just a bug.

        Not finding it during testing is the gargantuan fuck-up.

        • Broken As Designed (Score:5, Insightful)

          by Anonymous Coward on Monday April 02, 2018 @03:10AM (#56365339)

          Ha ha. No, this is not just a bug. The fuckup goes much deeper than that. "An empty field acts as a wildcard" is the least of your problems. It may or may not be expected behaviour for a GUI. "Not finding it during testing" is par for the course for GUIs for this sort of thing. You're not supposed to give wrong input, even accidentally!

          The real problem is thinking a GUI is appropriate to feed lists of boring numbers through. By hand, no less. It's way too easy to accidentally leave a field empty or --if it's a micro-managing form like windows IP address entry type things-- copy part of a numer in the wrong line, shift it a sub-field, or something else similarly silly.

          What we have here is a mismatch between user interface and purpose, cooked up without thinking. This is the same mode that makes users stupid, but now it was the designer who wasn't thinking. The focus was on "getting some input fields done", not on "how will this be used and what might the consequences be?" The deeper problem is TFIing such lists. GUIs are entirely stupid for this.

          Compare "here, have a GUI" with this sequence: Check the list then feed it to the system as a textfile. It gets queued. Then check the list as it appears in the system against your original list. The system probably should make explicit just what it will do with each entry, like "block one number" or "block a range of numers". Possibly have someone else look over the proposed actions. THEN activate it.

          So the problem is that the workflow is entirely too stupid to live. And it was shaped into that form by a GUI.

          • " You're not supposed to give wrong input, even accidentally!"

            I guess you never had a complaint from a user that your program behaved erratically, each time the secretary put heavy binders _on_ the keyboard?

            • You're misunderstanding him. He's saying that this GUI is entirely too fragile, where it should be fault-tolerant. In essence, you're preaching to the choir.
          • by bws111 ( 1216812 )

            Nothing about your textfile method is in any way superior to using a GUI. Nothing. A text file can have blank lines, long lines, short lines, lines with characters you didn't expect, and other such stuff. The requirement for checking inputs is no different with a text file than it is with a GUI. The only difference really is that a good GUI can provide more immediate feedback on incorrect entries.

            • by Anonymous Coward

              Of course you need to give it well-formed input, you always have to do that. But programs can be much better at communicating what they're going to do with it that they usually do. Your argument is that I'm saying textfiles are inherently superior to GUIs. That's a straw man: I'm giving a counter-example of how GUIs are not inherently better than other methods, like the dead simple one-number-per-line textfile approach. That "GUIs are just better"-attitude is so entrenched it's the status quo to the point o

      • by guruevi ( 827432 )

        It could be either, just depends on what the original spec said. I would find it useful to be able to cover an entire set of numbers simply by leaving them blank, very similar to eg. IP addresses. Even in some configuration files, you can type in 10/8 and it will recognize it as being "10.0.0.0/8".

        • No, if the spec said it was supposed to be a wildcard, then someone should have strongly questioned this. Ie, a bug in the spec. A dev saying "I just follow orders" isn't a good excuse. That's like saying you're just a programmer and thinking isn't a part of the job.

          • by Hognoxious ( 631665 ) on Monday April 02, 2018 @05:37AM (#56365587) Homepage Journal

            If it says "filter by" and you enter nothing you're saying filter by nothing, i.e. don't filter, i.e. give me everything.

            Plenty of software works like that. Otherwise the user is going to have to enter * in 47 different fields.

            Now if there's a minimum number of selections (filter by at least one of foo, bar and froblgobl) that should be enforced somewhere in the software, twice.

            It was probably created by one of these full-stack unicorns I keep hearing about.

        • Re:Bug or feature? (Score:5, Insightful)

          by arglebargle_xiv ( 2212710 ) on Monday April 02, 2018 @01:38AM (#56365207)
          It's a bug no matter what the spec says. Anything where you can shut down phone service for 110 million subscribers simply by hitting enter (without filling a field) isn't just a bug, it's a twelve-storey bug with a magnificent entrance hall, carpeting throughout, 24- hour portage, and an enormous sign on the roof, saying 'This Is a Massive Bug'.
      • Re:Bug or feature? (Score:4, Interesting)

        by rtb61 ( 674572 ) on Sunday April 01, 2018 @10:56PM (#56364939) Homepage

        Not even close. Under law, a professional is a professional and that ties to responsibility for actions. That design was professionally criminally negligent and should be treated as such, with the penalty to reflect the harm causes and that means possible custodial sentence along with a massive fine. Let's not get freaky on the custodial sentence though, probably sufficient to let them 'cool their jets' with no more than a 90 day sentence if no one died but at least 30 days, sort of put the wind up them, focus their attention, remind them there are real penalties for being a crap professional, being in a role you should not be in. If anyone dies though, manslaughter charges.

        Find those individually responsible fine them, let them feel the weight of a custodial sentence, 30 days and fine the company much more. Custodial sentences should be the norm for criminal negligence as a professional, start licensing coders because of the harm they can cause. Differing grades, low grade licences for low risk work, high grade licences for high risk work. If you do not force them to do a good job, they will continue to do a shitty job, with a meh, someone else's problem for the shitty work the coder has done.

        • Re:Bug or feature? (Score:5, Insightful)

          by johannesg ( 664142 ) on Monday April 02, 2018 @01:55AM (#56365237)

          Why are you blaming the programmer? The feature must have been designed; did the design call for empty being interpreted as a wildcard? It must have been tested; something as important as this has a testing budget associated with it, surely? Some company executive must have signed off on it. There will have been a formal handover from development to production. Did the programmer have the power to correct the design? Did he have the power to enforce testing? Did he have the power to stop deployment? Or was he just some underpaid wage slave who was paid by the hour to stamp out code as quickly as possible? Someone told by his manager he is just a warm body who can be replaced at a moment's notice? What was written in the user manual? Was there a procedure for blocking a number, and if so, was it followed? Was training given on how to use the software correctly? How can it be that the company has no liability, but somehow, someone who formed only a tiny part of the chain (and certainly not the best-paid part of it) should, according to you, now face prison time?

          • Re:Bug or feature? (Score:4, Insightful)

            by thegarbz ( 1787294 ) on Monday April 02, 2018 @02:57AM (#56365313)

            Why are you blaming the programmer? The feature must have been designed; did the design call for empty being interpreted as a wildcard?

            Don't assume it was designed. It's amazing how much gets "designed" at implementation if something isn't expressly stated in the specification. The only thing we know is that we don't know who to blame.

            That said the GP's assertion that someone should face prison time is completely stupid. Even by American "jail everyone for everything" standards.

            • Even by American "jail everyone for everything" standards.

              America does not have jail for everyone for everything standard.

              Usually if it looks like some ethnic minority is going to be on the receiving end, lots of internet warriors will post "jail time" posts. If it looks like it is going to be some white due the same posters will suddenly talk about onerous government tyranny, liberty and the founding fathers.

              Mercifully the actual courts are saner. Not perfect. But not as bad as these internet trolls

              • America does not have jail for everyone for everything standard.

                You're either living in a bubble as to how your country works, or you're actually in a horrible shit hole of a country worse than other countries your president refers to as shit hole. https://en.wikipedia.org/wiki/... [wikipedia.org]

                And I've been to the USA, it's a lovely country so bubble it is. Step one is recognize you* have a problem.

                *Well not you specifically, your country has a problem, but you really should know about this problem.

        • 30 days in jail is a huge deal. I wouldn't recommend it for something like this. With so many people living paycheck to paycheck that's enough time to get behind on rent/mortgage, losing your home and if your marriage/relationship wasn't that strong that's an easy separation/divorce.

          You certainly wouldn't have a job after 30 days of not showing up, and your new job hunt doesn't happen while in jail.

          I really don't think you understand what you are recommending. Especially since I don't see you linking to any

        • If I ran the risk of getting imprisoned for something as easy to mistype as the notorious "goto fail" I'd steer well clear of software development.

        • by Nkwe ( 604125 )

          Find those individually responsible fine them, let them feel the weight of a custodial sentence, 30 days and fine the company much more.

          When looking for the "individual responsible", don't forget to consider whomever set the budget and didn't provide for enough resource to create a design, review the design, create a prototype, test the prototype with real users, perform failure analysis, make changes to the design and prototype as necessary, test those changes, etc. The root cause of many software (or any project for that matter) failures is insufficient resources being applied to solve the problem. Granted you have to balance what you spe

      • A null/blank input taken as a wildcard is certainly not a feature.

        It usually is in filter boxes, isn't it?

      • The screw up wasn't this design decision. It was omitting basic double checks one should always have when making production changes, *especially* in a large environment like level 3. Where was the review by a second operator? Where was the warning "this change will block over $threshold numbers"? it is ridiculous that one person at one point could make a large scale change like this.
    • Check the spec - perhaps it was by design or not called out to ignore empty entries?

      The "by design" part is slightly plausible. But "not called out"? I haven't yet met either a programmer or a tester who wouldn't have at least tried out the 'null entry' scenario and flagged it as a problem. Heck, one of the most basic tests is to check what happens in the case of empty fields. This smacks more of somebody higher up ignoring test results and/or good advice.

      • Re:Bug or feature? (Score:5, Insightful)

        by magarity ( 164372 ) on Sunday April 01, 2018 @09:36PM (#56364783)

        But "not called out"? I haven't yet met either a programmer or a tester who wouldn't have at least tried out the 'null entry' scenario and flagged it as a problem..

        Have you never worked with offshore developers or testers? If it isn't itemized, they won't think to do it.

        • But "not called out"? I haven't yet met either a programmer or a tester who wouldn't have at least tried out the 'null entry' scenario and flagged it as a problem..

          Have you never worked with offshore developers or testers? If it isn't itemized, they won't think to do it.

          And thus they need an army of project managers and QA. Because cheaper!

      • That's why I think it may have been by design. Test a null entry, it does what it was supposed to - act as a wildcard...
    • by Greyfox ( 87712 )
      Well I'm pretty sure the desired number was blocked, so I'd call that a win!
  • by Anonymous Coward

    It was Linux.

    • by PPH ( 736903 )

      rm -rf / tmp/junk/

      • I did this once...on the first day of a new job.

        I was wondering why the delete was taking so long...until the other developers around started asking "what's going on with the server?".

        • I did this once...on the first day of a new job.

          Who get's root privileges on the first day? Even if you are a sysadmin you don't get root privileges on your first day around here.... But what system admin doesn't have story like this? I once deleted ALL the E-Mail on a system with a wild card, but we had a full system backup that I made the night before so not much was lost beyond face.

          This is an example of why I NEVER run a console with admin privileges as the default. I do have "sudo", and any time I type "sudo" it's my reminder to pause and actual

  • by pirodude ( 54707 ) on Sunday April 01, 2018 @09:10PM (#56364703)

    I'm 99% sure they were using the Sonus EMS management software (L3 is a huge Sonus shop) to manage the PSX routing engine. The software works as longest match of the number. Since you have to always select the country, a blank entry would be treated as +1 and block everything after that or everything in the US.

    • The software works as longest match of the number.

      Why?

      • by pirodude ( 54707 ) on Sunday April 01, 2018 @11:19PM (#56364985)

        If you want to route all 212 area code numbers to a specific carrier you can just enter '212' and it will route them. If you want go do a NPA-NXX, just enter '212555'. Since it's longest match it will also work for a 'thousands block' (ie, 2125551) and even down to the individual number (2125551212). US numbers don't mean a whole lot, but in other countries they specify specific geographic regions, carriers or number types. The backend database takes longest match for the most flexibility and the EMS UI is nothing more than a glorified frontend directly to the DB. There's little business logic actually protecting you.

        In a lot of cases, you want a wildcard match. I route a number of prefixes to different carriers with longer matches but I have a blank entry to default fall back directly to Level3 if I don't have any other carriers to handle calls.

        Everyone who uses Sonus knows this is how it works. It sounds like they gave a task to someone and only trained them on one piece of data entry. The fact that 800 people had access to this highly specialized software without higher level tooling that adds in the required business logic is the terrifying piece.

        • If you want to route all 212 area code numbers to a specific carrier you can just enter '212' and it will route them. If you want go do a NPA-NXX, just enter '212555'. Since it's longest match it will also work for a 'thousands block' (ie, 2125551) and even down to the individual number (2125551212).

          I may well be missing something, but I'm still not seeing why this scheme provides any benefit over one where you explicitly ask for a wildcard if that's what you want (using your examples, '212*' '212555*' or '212555????' and so on. A system where a blank entry means the rule will be applied to any and everything seems like it's just asking for the exact sort of trouble that arose here.

          • by Anonymous Coward on Monday April 02, 2018 @03:16AM (#56365345)

            I may well be missing something, but I'm still not seeing why this scheme provides any benefit over one where you explicitly ask for a wildcard if that's what you want

            Phone numbers are hierarchical and variable length (as opposed to e.g. IP addresses which are fixed length). a) The most common mistake to make is to route only a particular number or set of numbers as opposed to the hierarchy - using the shortest match by default avoids this mistake b) the routing algorithm used normally also works on a hierarchy, so a wildcard match apart from the end of the number can be very costly and unwise c) that's the way it's "always" been done and so doing something different would be conusing.

        • by Kjella ( 173770 )

          The backend database takes longest match for the most flexibility and the EMS UI is nothing more than a glorified frontend directly to the DB. There's little business logic actually protecting you. (...) Everyone who uses Sonus knows this is how it works. It sounds like they gave a task to someone and only trained them on one piece of data entry. The fact that 800 people had access to this highly specialized software without higher level tooling that adds in the required business logic is the terrifying piece.

          Well is any other API than the EMS UI supported for creating DIY management tools, like do they want you writing directly to the database? Creating your own, custom UI to behind the scenes call the real UI seems excessively complex for what you get. Seems like EMS could fix this quite trivially with a warning and/or permissions, like you can only add blocks over a certain length, that blocking single numbers/area codes doesn't mean permission to pull the plug on entire countries. Even as a paid enhancement

      • It's just the simplest, most brain-dead way to code it that supports any form of multiple-matching entries. Probably not coincidentally it is also the method that takes the least amount of server load while still supporting the ability to test for any partial or full string equality whatsoever. I highly doubt whoever wrote that code even spent enough time thinking about it to realize this would be an inherent weakness to such an approach. If they did, they certainly didn't expect it would be handed over

  • 1) There should have been a warning: "Do you want to block all calls?" If not, then require the employee to enter a phone number.

    2) Or for a better solution, that form should not interpret a blank field as a wildcard. If all phone calls are to be blocked, then someone must sign on with a manager's user id, and fill out a special form that lets you block all phone calls.

  • According to the FCC's investigation, the outage began after a Level 3 employee entered phone numbers suspected of malicious activity in the company's network management software.

    No, the outage began years ago when someone created a process in which a human being manually enters data directly into a production system.

    I swear, if I had a nickel for every time a major fuckup was root caused to "human error in a process that should have never had a human factor to begin with", I could buy a house in the Bay Ar

  • by TheRealHocusLocus ( 2319802 ) on Monday April 02, 2018 @07:03AM (#56365787)

    In 1987 I had just taken a job at the local Telco and was hitting a steep learning curve. My experience to that point had been PC computers and networks, assembler, CBASIC dBase and the like. This was an IBM System/38 and their billing software used RPG/III, which was a real structured language unlike its spaghetti-GOTO RPG/II cousin, but aspects were still position sensitive and opcodes were silly-simple compared to languages with which I was familiar. It was more like assembler than anything else. Most data flows consisted of running commands that generated a relational input stream sort of like an SQL query, through simple RPG programs.

    We had just installed an ITT 1210 switch and ITT had sent over a block of sample RPG code demonstrating how to parse the various fields and flags appearing on call tapes. My boss provided specs for the internal call ticket system they were using and the simple (!) task was to write a shim that generated a batch of call tickets from each tape. Pretty straightforward, tedious without being intricate. But one part of their code slapped me across the face when I examined it.

    The tape recorded end time and call duration in whole seconds, call start time would need to be calculated. They had supplied a routine to do this but it didn't make any sense because I could see no modulo 60 arithmetic in it, they were applying the simple RPG subtraction opcode on the zoned fields. I spent the most mystified HOUR of my LIFE searching the language manuals for that surely described RPG's 'magic' ops for manipulating times and dates, which I assumed had to be there because IBM is GREAT and I am STUPID... finding none. Forced to conclude that I was looking at concept code that was dashed off hurriedly in two minutes I confronted my boss with it (and my solution) but it was a hard sell at first, because my boss was incredulous too.

  • by 140Mandak262Jamuna ( 970587 ) on Monday April 02, 2018 @09:22AM (#56366351) Journal
    So these phone companies have the ability to block all incoming calls from a malicious phones. All these days... All the complaints about spam callers... About scam artists posing as IRS employees .... They had the ability to block them. But they never did. Bastards.

Our OS who art in CPU, UNIX be thy name. Thy programs run, thy syscalls done, In kernel as it is in user!

Working...