Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Bug Security

'Trojan Source' Bug Threatens the Security of All Code (krebsonsecurity.com) 88

"Virtually all compilers -- programs that transform human-readable source code into computer-executable machine code -- are vulnerable to an insidious attack in which an adversary can introduce targeted vulnerabilities into any software without being detected," warns cybersecurity expert Brian Krebs in a new report. An anonymous reader shares an excerpt: Researchers with the University of Cambridge discovered a bug that affects most computer code compilers and many software development environments. At issue is a component of the digital text encoding standard Unicode, which allows computers to exchange information regardless of the language used. Unicode currently defines more than 143,000 characters across 154 different language scripts (in addition to many non-script character sets, such as emojis). Specifically, the weakness involves Unicode's bi-directional or "Bidi" algorithm, which handles displaying text that includes mixed scripts with different display orders, such as Arabic -- which is read right to left -- and English (left to right). But computer systems need to have a deterministic way of resolving conflicting directionality in text. Enter the "Bidi override," which can be used to make left-to-right text read right-to-left, and vice versa.

"In some scenarios, the default ordering set by the Bidi Algorithm may not be sufficient," the Cambridge researchers wrote. "For these cases, Bidi override control characters enable switching the display ordering of groups of characters." Bidi overrides enable even single-script characters to be displayed in an order different from their logical encoding. As the researchers point out, this fact has previously been exploited to disguise the file extensions of malware disseminated via email. Here's the problem: Most programming languages let you put these Bidi overrides in comments and strings. This is bad because most programming languages allow comments within which all text -- including control characters -- is ignored by compilers and interpreters. Also, it's bad because most programming languages allow string literals that may contain arbitrary characters, including control characters.

"So you can use them in source code that appears innocuous to a human reviewer [that] can actually do something nasty," said Ross Anderson, a professor of computer security at Cambridge and co-author of the research. "That's bad news for projects like Linux and Webkit that accept contributions from random people, subject them to manual review, then incorporate them into critical code. This vulnerability is, as far as I know, the first one to affect almost everything." The research paper, which dubbed the vulnerability "Trojan Source," notes that while both comments and strings will have syntax-specific semantics indicating their start and end, these bounds are not respected by Bidi overrides. [...] Anderson said such an attack could be challenging for a human code reviewer to detect, as the rendered source code looks perfectly acceptable. "If the change in logic is subtle enough to go undetected in subsequent testing, an adversary could introduce targeted vulnerabilities without being detected," he said. Equally concerning is that Bidi override characters persist through the copy-and-paste functions on most modern browsers, editors, and operating systems.

This discussion has been archived. No new comments can be posted.

'Trojan Source' Bug Threatens the Security of All Code

Comments Filter:
  • Fucking dumb (Score:3, Insightful)

    by mveloso ( 325617 ) on Monday November 01, 2021 @07:07PM (#61949427)

    This is one of the fucking dumbest discoveries ever.

    The corollary to this is, "if someone can insert shit into your source code you're fucked."

    "Text editors allow anyone to edit files, which could lead to a compromise."

    • Re:Fucking dumb (Score:5, Insightful)

      by Darinbob ( 1142669 ) on Monday November 01, 2021 @07:17PM (#61949461)

      It's not a compiler flaw, so "virtually all compilers" is wrong. Instead it's a flaw in editors. Since I still use Emacs, I am immune for the moment :-)

      • Yep, and 32-bit VB6 doesn't recognize Unicode... you need 64 bits to do that.

        • by vbdasc ( 146051 )

          First, there is no 64-bit VB6, and second, 32-bit Windows NT recognizes Unicode perfectly well.

      • I don't know about all languages but most languages don't really advertise that they work will Unicode directly . The exceptions I know of are Julia , APL and fortress ( a Fortran modernization ) which aspire to use mathematical symbols fluently. Perhaps other languages allow that but I've never seen it. So for all thise one could simply say the compiler will only process ascii symbols . And boom you done. You can still put Unicode escapes in your strings and you can still get fucked by this if your

        • Raise your hand if you have ever cut and pasted some code snippet of any web page? We all know that occasionally your source gets borked when the code contains some weird character you can't see like a "smart" quote or tabs instead of spaces or some oddball Unicode like a subtle italic .

          If you are careful you always paste your code into something that only accepts ascii like say a terminal or a restrictive editor or at least an editor that shows invisibles and expands tabs. Then cut and paste that into th

          • Never really used an editor that would accept straight-up unicode without getting into a particular mode first. Word and word processors don't count because they stick crap in anyway and you can't compile the resulting cut and pasted code anyway. Maybe this "bug" is due to the rise of everyone using IDEs that all try to be too smart for you? I mean, who would write a programmer's editor, or an editing window in an IDE, that would accept formatting characters and the like? A good programming editor will

      • by gweihir ( 88907 )

        It's not a compiler flaw, so "virtually all compilers" is wrong. Instead it's a flaw in editors. Since I still use Emacs, I am immune for the moment :-)

        Indeed. The claim that it is a compiler problem is entirely wrong. This is solely a problem with the things used to display code. Simply set your locale to "C" when reviewing code and disallow Unicode in source code entirely. Code written in Unicode is insane anyways. I have claimed so since around 1999 (after I attended a talk on the tech behind it) and see no reason to change that stance. Also because you get other problems like different characters being rendered to the same or an indistinguishable displ

    • Re: (Score:2, Interesting)

      by Anonymous Coward

      The corollary to this is, "if someone can insert shit into your source code you're fucked."

      "Text editors allow anyone to edit files, which could lead to a compromise."

      True. But prior to this realization, you could attempt to audit your source code to prevent malicious code from getting into your product, and still be completely blind to malware hidden right there in your source files. Heck, you wouldn't even need to have an intentionally malicious developer -- if someone cut & pastes example code snippets from somewhere like Stackoverflow or something, would anybody even realize that malware is getting put in?

      • Perhaps text editors will add some mode that both flags the use of the bidirectional override and gives you an option to see the input text with the bidirectional override characters highlighted. That would be nice.

        Also, make sure to check the output binary on an airgapped computer with a disassembler that you wrote yourself in machine language on said airgapped computer.

        • But all the Unicode characters that look just like Ascii characters.

          Or, how about a font that shows characters with different shapes. I'll be happy with that.

          (I use Verdana, it is about the only one. I, 1 l 0 O all different.)

    • by The Evil Atheist ( 2484676 ) on Monday November 01, 2021 @07:18PM (#61949469)
      Yeah, security researchers shouldn't spend time on dumb discoveries, like out-of-bounds access. They're too simple! They should only go after the very interesting sounding exploits, because no hacker would ever try to use simple exploits for the greatest gain.
    • by Arnonyrnous Covvard ( 7286638 ) on Monday November 01, 2021 @07:27PM (#61949497)
      The attack is making malicious code pass code review, because the attacker hides a planted bug by encoding the source code in a way which gives the reviewer a different view than the compiler. The attacker could be a rogue coworker or a contributor to open source software. Compilers should probably refuse to process any file with an unencoded bidi override Unicode codepoint.
      • by HiThere ( 15173 )

        Well, perhaps not *quite* that strict. They should require a specific compiler switch to allow it to be processed. And issue a warning message when that switch is present.

        • The way rust did it is you have to escape bidi characters

          • by tlhIngan ( 30335 )

            The way rust did it is you have to escape bidi characters

            Only bidi characters? Because that's a big ulnerability.

            We mock /. for not accepting Unicode, but the reason is Unicode has a lot of issues. There are thousands of control codepoints in Unicode, and hundreds more keep getting added all the time. Any one of those can be used to hide or obfuscate code or screw up the formatting of a web page. In fact, the attacks were everyday occurrences that the editors added a strict filter because it's impossible t

            • by OolimPhon ( 1120895 ) on Tuesday November 02, 2021 @04:48AM (#61950365)

              The end result is if you accept Unicode in anything other than comments and string literals, there's a chance for the attack to happen.

              Um, no. Look at the example given. If you have a bidi string inside a literal, that string could contain an end-literal mark, which means that whatever else is in that literal - or hidden in the bidi string - would get executed..

              Saying that literals or strings are safe is not good enough. This attack is about the same as Bobby Tables, just a little more sophisticated.

      • That's the way rust did it. It also adds a lint to warn about it, so it will be obvious in the editor as well, even if the editor has no protection against it.

    • by vbdasc ( 146051 )

      "Text editors allow anyone to edit files, which could lead to a compromise."

      This should be changed to:

        "Text editors allow anyone to edit files, and their actual content might not be the same as what you see on the screen, which could lead to a compromise.".

      And yes, it's indeed worrisome.

    • It's not a discovery. ./ discovered it first, that's why it doesn't support unicode. You know, to stay secure and such.

    • The Linux kernel is written in ASCII not Unicode, so unicode issues don't impact it.

  • Cool! (Score:5, Insightful)

    by jenningsthecat ( 1525947 ) on Monday November 01, 2021 @07:20PM (#61949477)

    I guess Slashdot is immune then, since it doesn't do Unicode!

    • Really, Slashdot is seeming old because it doesn't accept emoji.

    • Re:Cool! (Score:5, Informative)

      by phantomfive ( 622387 ) on Monday November 01, 2021 @08:50PM (#61949725) Journal

      The reason Slashdot doesn't do Unicode is specifically to prevent these kinds of hacks, which were kind of annoying. For a while, Slashdot did support unicode.

      • by gTsiros ( 205624 )

        Input sanitation? What IS that???

      • Unicode support falls into the dustbin that contains OMG Ponies. At least we don't have to worry about inline GIFs and emojis cluttering up everything. BRB have to go and yell at some punk kids on my lawn again!
      • I thought Slashdot didn't do Unicode specifically so that technical dingbats (not the font) didn't have to feel shame when presented with typographically correct curly quotes.

  • Ken Thompson 1984 (Score:5, Informative)

    by Volanin ( 935080 ) on Monday November 01, 2021 @07:21PM (#61949479)

    This is a tame variant of the Ken Thompson Hack from 1984, which was the "root password of all evil". While this obfuscates malicious code inside the compiler with special unicode characters, the Ken Thompson Hack proposes a way to "eternally" hijack a compiler to insert malicious code without ANY TRACE in the source code at all. While the attack might arguably be unpractical, the read is really interesting:

    https://wiki.c2.com/?TheKenTho... [c2.com]

    • by FeelGood314 ( 2516288 ) on Monday November 01, 2021 @09:00PM (#61949737)
      Ken Thompson's attack was a way to compromise a compiler in such away that all compilers built from the compromised compiler would also be compromised. This attack is a very very cleaver code obfuscation. Normally I would reject any code that was submitted that wasn't clear about how it worked even if the code seemed to work or came from a reputable source. No human reader has a hope against this and this could even fool automated checks since they may not parse the code the same way the compiler does. There are similar attacks against X509 certificates.

      One fix might be to only allow ascii for coding but that puts the non english world at a disadvantage. Also f*#k punycode for URLs that's another easy way to obfuscate something to fool humans.
      • by AmiMoJo ( 196126 )

        It's not that clever, the exact same attack was used on MacOS a few years back. By using Unicode direction mark characters filenames could be obscured, making the user think an executable was actually an innocuous .jpg file, for example.

        I've said it before, Unicode is a massive screw-up and needs to be replaced. It made too many mistakes that can't be reversed or worked around, including the use of direction marks.

        • "making the user think an executable was actually an innocuous .jpg file"

          Did it change the icon that was next to the file name? Did it display one for an image file and not an executeable?

            Whenever I use a point and click file manager I have a habit of checking for this. Others should do the same.

          • I know that an executeable can have it's own custom icon. It would be a good idea to set the file manager to only display the generic icons.

          • This was quite common on file sharing platforms in the early 2000s e.g. Eminem - Lose Yourself.mp3.exe, with an icon from some random media player.
            • Hiding extentions and more importantly file types is one of the biggest blunders anybody could make. At the very least, a big red box around the icon or something to indicate that a file is an executeable should be always be the default, and require a manual reg edit to disable.

                I understand that some users prefer or even need the minimalist view, but removing all indications that a file is executeable/possible malware is a very bad move.

          • On Windows by default settings, the Explorer will not even show the extension. Change the icon to something looking harmless, and you have a good chance to fool most users. You still need a way to get past UAC for admin access, but overall I think this is Microsoft's second biggest security mistake in the history of the company.
            BTW, the biggest one was that at some point on the 2000s, Outlook would execute VB attachments to emails without user confirmation. One did not even need to open the email, showing i

      • Simple ways to avoid this:
        Have your source vault reject Bidi control characters.
        Have your editor highlight any lines that have Bidi characters.

      • by Anonymous Coward

        Everyone knows English already, and if they don't they certainly aren't writing code in any popular languages. It might be easier to fix all the programmers by having them write entirely in English than fixing all the editors.

    • by whh3 ( 450031 )

      But because Krebs said it we all think it's breaking news!

  • I learned Pascal on a very old Mac (which was new at the time).
    Maybe universities will go back to teaching Pascal.
    It doesn't (or at least didn't) support Unicode.

    • by HiThere ( 15173 )

      see: https://wiki.freepascal.org/FP... [freepascal.org]

    • I remember they taught Modula-2 at uni when I was there.

      I bought a copy of Stony Brook Modula-2 whose IDE flew on my OS/2 box at the time - a 100MHz 486DX4 with a whole 4MB of video memory. That compiler, with its IDE, made writing code a breeze. Those were the days!

      --
      I used to have a few good .sigs. You're stuck with this one.

    • by ELCouz ( 1338259 )
      Delphi support Unicode since 2009. Freepascal Unicode support was added a bit later.
  • Boustrophedon (Score:4, Interesting)

    by Jodka ( 520060 ) on Monday November 01, 2021 @07:45PM (#61949537)

    from the /. summary:

    "Cambridge researchers wrote. 'For these cases, Bidi override control characters enable switching the display ordering of groups of characters.'"

    Quite convenient for all the ancient Etruscans writing boustrophedonically. [wikipedia.org]

  • by mcswell ( 1102107 ) on Monday November 01, 2021 @09:08PM (#61949757)

    There's a *really* easy fix: run a character code scanner to ensure that these Unicode characters (IIRC there are 3 of them) are not present in your code files. (The problem is only visual--i.e. when displayed by certain kinds of rendering devices--the computer just treats the characters like any other random Unicode characters.) Don't people run automatic scanners to check for other kinds of bugs? Like checking that you don't have any U+29E3"equals sign and slanted parallel", which looks much like '#'. Putting code after this character would also make it look like commented out code when it wasn't. (Not sure what the compiler would do with this character, maybe some compilers would treat it like part of a variable name?)

    It's not clear whether this can also happen when specifying code points (like typing '\u202E'). If so, you can also scan for those sequences with a regex.

    • Industry moves slowly. I still am getting warning messages at my work about Python 2.7 being removed in 2022 while our development environment sticks with Python 3.6 which is already 5 years old.

      Fortunately most of the highest risk industries don't accept open source contributions and there's little motivation for an internal developer to harm systems that way. I'd probably put defense contractors as high on the list of groups that should be concerned. Government work has lots of red tape slowing down adopt
    • Just fix the editor and the codereview tools. They should cancel any global string setting at the end of comments or raw strings. That way the reviewed codeview would match the compilers

    • There's a *really* easy fix: run a character code scanner to ensure that these Unicode characters (IIRC there are 3 of them) are not present in your code files.

      Easier fix: Add an "is there unicode in the source" test to the compiiler. On, and fatal error, by default.

      need the editors too, for when you have to support an older compiler.

    • UNICODE is an extensible standard.
      Better to test for acceptable characters rather than scan for (what you currently think are) the bad ones.

      If you miss a character that you really need to have, easy enough to add it to the acceptable list.

  • by spitzak ( 4019 ) on Monday November 01, 2021 @10:07PM (#61949851) Homepage

    Almost any syntax highlighting, which is provided as an option by almost every program used to display source code, will highlight as the compiler sees the text, resulting in obvious messed up display if this is attempted.
    It would also be trivial, and perhaps useful, if such editors ignored all bidi instructions, at least outside quoted strings and comments.

  • Just write code in palindrome, that's not vulnerable...
  • Looks to me the solution is as simple as reviewing the code in ASCII only mode. Source code itself does not require Unicode, only the string literals do.
  • But can't an automated tool that removes these things (possibly together with any BiDi text inside comments) solve the problem?

  • by cunniff ( 264218 ) on Tuesday November 02, 2021 @09:17AM (#61950933) Homepage

    From The Standard [unicode.org] (in Chapter 3, Basic Display Algorithm):

    The algorithm reorders text only within a paragraph; characters in one paragraph have no effect on characters in a different paragraph. Paragraphs are divided by the Paragraph Separator or appropriate Newline Function (for guidelines on the handling of CR, LF, and CRLF, see Section 4.4, Directionality, and Section 5.8, Newline Guidelines of [Unicode]). Paragraphs may also be determined by higher-level protocols: for example, the text in two different cells of a table will be in different paragraphs.

    It seems to me that compilers and code editors could define a "paragraph" as a string literal, a comment, or an identifier token, which would allow for reasonable display of mixed L2R/R2L text without allowing for it to obfuscate the surrounding code. Additionally, this would mean that attempting to hide a string or comment terminator inside a L2R/R2L/L2R/etc. sequence would fail.

    For command shells, I would also suggest that path delimiters and terminating file extensions also delimit paragraphs. And probably other characters as required by the command shell environment.

    • by tepples ( 727027 )

      It seems to me that compilers and code editors could define a "paragraph" as a string literal, a comment, or an identifier token

      Code editors would need to be updated to provide "paragraph" definitions for all programming languages for which they support syntax highlighting.

      • by Briareos ( 21163 )

        Code editors would need to be updated to provide "paragraph" definitions for all programming languages for which they support syntax highlighting.

        Actually, to be on the safe side they'd need to provide that also for all programming languages they don't even support syntax highlighting for...

  • Add a feature to all editors: "View as 7-bit ASCII." Code points 0-127 are displayed as usual. Everything above that is displayed as 8-bit hexadecimal values of the form "\x??" where ?? are hex digits (or some other fancy editor-specific rendering that makes them stand out).

    The idea is that you can view the code as the compiler will see it -- as a stream of bytes. This should make it easier to see if what appear to be string constants or comments are actually trying to escape into the code stream, part

  • Raku (ex perl 7) allows to use UTF8 for constant, variables and function name (pi with the Greek pi character, sum with the capital greek sigma character, etc.) .
    As it was designed specifically with unicode in mind, they may have already solve the problem. Or not ?

  • Every compiler I have looked at discards comments. A documentation compiler will look specifically at the comments. I fail to so how this could be a threat?

news: gotcha

Working...