'Trojan Source' Bug Threatens the Security of All Code (krebsonsecurity.com) 88
"Virtually all compilers -- programs that transform human-readable source code into computer-executable machine code -- are vulnerable to an insidious attack in which an adversary can introduce targeted vulnerabilities into any software without being detected," warns cybersecurity expert Brian Krebs in a new report. An anonymous reader shares an excerpt: Researchers with the University of Cambridge discovered a bug that affects most computer code compilers and many software development environments. At issue is a component of the digital text encoding standard Unicode, which allows computers to exchange information regardless of the language used. Unicode currently defines more than 143,000 characters across 154 different language scripts (in addition to many non-script character sets, such as emojis). Specifically, the weakness involves Unicode's bi-directional or "Bidi" algorithm, which handles displaying text that includes mixed scripts with different display orders, such as Arabic -- which is read right to left -- and English (left to right). But computer systems need to have a deterministic way of resolving conflicting directionality in text. Enter the "Bidi override," which can be used to make left-to-right text read right-to-left, and vice versa.
"In some scenarios, the default ordering set by the Bidi Algorithm may not be sufficient," the Cambridge researchers wrote. "For these cases, Bidi override control characters enable switching the display ordering of groups of characters." Bidi overrides enable even single-script characters to be displayed in an order different from their logical encoding. As the researchers point out, this fact has previously been exploited to disguise the file extensions of malware disseminated via email. Here's the problem: Most programming languages let you put these Bidi overrides in comments and strings. This is bad because most programming languages allow comments within which all text -- including control characters -- is ignored by compilers and interpreters. Also, it's bad because most programming languages allow string literals that may contain arbitrary characters, including control characters.
"So you can use them in source code that appears innocuous to a human reviewer [that] can actually do something nasty," said Ross Anderson, a professor of computer security at Cambridge and co-author of the research. "That's bad news for projects like Linux and Webkit that accept contributions from random people, subject them to manual review, then incorporate them into critical code. This vulnerability is, as far as I know, the first one to affect almost everything." The research paper, which dubbed the vulnerability "Trojan Source," notes that while both comments and strings will have syntax-specific semantics indicating their start and end, these bounds are not respected by Bidi overrides. [...] Anderson said such an attack could be challenging for a human code reviewer to detect, as the rendered source code looks perfectly acceptable. "If the change in logic is subtle enough to go undetected in subsequent testing, an adversary could introduce targeted vulnerabilities without being detected," he said. Equally concerning is that Bidi override characters persist through the copy-and-paste functions on most modern browsers, editors, and operating systems.
"In some scenarios, the default ordering set by the Bidi Algorithm may not be sufficient," the Cambridge researchers wrote. "For these cases, Bidi override control characters enable switching the display ordering of groups of characters." Bidi overrides enable even single-script characters to be displayed in an order different from their logical encoding. As the researchers point out, this fact has previously been exploited to disguise the file extensions of malware disseminated via email. Here's the problem: Most programming languages let you put these Bidi overrides in comments and strings. This is bad because most programming languages allow comments within which all text -- including control characters -- is ignored by compilers and interpreters. Also, it's bad because most programming languages allow string literals that may contain arbitrary characters, including control characters.
"So you can use them in source code that appears innocuous to a human reviewer [that] can actually do something nasty," said Ross Anderson, a professor of computer security at Cambridge and co-author of the research. "That's bad news for projects like Linux and Webkit that accept contributions from random people, subject them to manual review, then incorporate them into critical code. This vulnerability is, as far as I know, the first one to affect almost everything." The research paper, which dubbed the vulnerability "Trojan Source," notes that while both comments and strings will have syntax-specific semantics indicating their start and end, these bounds are not respected by Bidi overrides. [...] Anderson said such an attack could be challenging for a human code reviewer to detect, as the rendered source code looks perfectly acceptable. "If the change in logic is subtle enough to go undetected in subsequent testing, an adversary could introduce targeted vulnerabilities without being detected," he said. Equally concerning is that Bidi override characters persist through the copy-and-paste functions on most modern browsers, editors, and operating systems.
Fucking dumb (Score:3, Insightful)
This is one of the fucking dumbest discoveries ever.
The corollary to this is, "if someone can insert shit into your source code you're fucked."
"Text editors allow anyone to edit files, which could lead to a compromise."
Re:Fucking dumb (Score:5, Insightful)
It's not a compiler flaw, so "virtually all compilers" is wrong. Instead it's a flaw in editors. Since I still use Emacs, I am immune for the moment :-)
Re: (Score:2)
Yep, and 32-bit VB6 doesn't recognize Unicode... you need 64 bits to do that.
Re: (Score:2)
First, there is no 64-bit VB6, and second, 32-bit Windows NT recognizes Unicode perfectly well.
Julia apl. and fortress (Score:2)
I don't know about all languages but most languages don't really advertise that they work will Unicode directly . The exceptions I know of are Julia , APL and fortress ( a Fortran modernization ) which aspire to use mathematical symbols fluently. Perhaps other languages allow that but I've never seen it. So for all thise one could simply say the compiler will only process ascii symbols . And boom you done. You can still put Unicode escapes in your strings and you can still get fucked by this if your
The stack overflow attack (Score:3)
Raise your hand if you have ever cut and pasted some code snippet of any web page? We all know that occasionally your source gets borked when the code contains some weird character you can't see like a "smart" quote or tabs instead of spaces or some oddball Unicode like a subtle italic .
If you are careful you always paste your code into something that only accepts ascii like say a terminal or a restrictive editor or at least an editor that shows invisibles and expands tabs. Then cut and paste that into th
Re: (Score:2)
Never really used an editor that would accept straight-up unicode without getting into a particular mode first. Word and word processors don't count because they stick crap in anyway and you can't compile the resulting cut and pasted code anyway. Maybe this "bug" is due to the rise of everyone using IDEs that all try to be too smart for you? I mean, who would write a programmer's editor, or an editing window in an IDE, that would accept formatting characters and the like? A good programming editor will
Re: (Score:3)
It's not a compiler flaw, so "virtually all compilers" is wrong. Instead it's a flaw in editors. Since I still use Emacs, I am immune for the moment :-)
Indeed. The claim that it is a compiler problem is entirely wrong. This is solely a problem with the things used to display code. Simply set your locale to "C" when reviewing code and disallow Unicode in source code entirely. Code written in Unicode is insane anyways. I have claimed so since around 1999 (after I attended a talk on the tech behind it) and see no reason to change that stance. Also because you get other problems like different characters being rendered to the same or an indistinguishable displ
Re: (Score:2, Interesting)
The corollary to this is, "if someone can insert shit into your source code you're fucked."
"Text editors allow anyone to edit files, which could lead to a compromise."
True. But prior to this realization, you could attempt to audit your source code to prevent malicious code from getting into your product, and still be completely blind to malware hidden right there in your source files. Heck, you wouldn't even need to have an intentionally malicious developer -- if someone cut & pastes example code snippets from somewhere like Stackoverflow or something, would anybody even realize that malware is getting put in?
Re: (Score:3)
Perhaps text editors will add some mode that both flags the use of the bidirectional override and gives you an option to see the input text with the bidirectional override characters highlighted. That would be nice.
Also, make sure to check the output binary on an airgapped computer with a disassembler that you wrote yourself in machine language on said airgapped computer.
Not jus BiDi (Score:2)
But all the Unicode characters that look just like Ascii characters.
Or, how about a font that shows characters with different shapes. I'll be happy with that.
(I use Verdana, it is about the only one. I, 1 l 0 O all different.)
Re:Fucking dumb (Score:4, Funny)
Re: Fucking dumb (Score:1)
Re: (Score:2)
Re: (Score:1)
they discovered that most programmers don't "write" programs they maintain or copy something already running somewhere. this flaw is present because people need a way to interrupt a running process in memory to make "improvements" and the like.
Yes, dumb. That's what makes it dangerous. (Score:5, Informative)
Re: (Score:3)
Well, perhaps not *quite* that strict. They should require a specific compiler switch to allow it to be processed. And issue a warning message when that switch is present.
Re: Yes, dumb. That's what makes it dangerous. (Score:3)
The way rust did it is you have to escape bidi characters
Re: (Score:3)
Only bidi characters? Because that's a big ulnerability.
We mock /. for not accepting Unicode, but the reason is Unicode has a lot of issues. There are thousands of control codepoints in Unicode, and hundreds more keep getting added all the time. Any one of those can be used to hide or obfuscate code or screw up the formatting of a web page. In fact, the attacks were everyday occurrences that the editors added a strict filter because it's impossible t
Re: Yes, dumb. That's what makes it dangerous. (Score:5, Informative)
The end result is if you accept Unicode in anything other than comments and string literals, there's a chance for the attack to happen.
Um, no. Look at the example given. If you have a bidi string inside a literal, that string could contain an end-literal mark, which means that whatever else is in that literal - or hidden in the bidi string - would get executed..
Saying that literals or strings are safe is not good enough. This attack is about the same as Bobby Tables, just a little more sophisticated.
Re: (Score:2)
It's almost like people should be sanitizing inputs, whether it's a literal being put into a SQL statement or otherwise.
Re: Yes, dumb. That's what makes it dangerous. (Score:3)
That's the way rust did it. It also adds a lint to warn about it, so it will be obvious in the editor as well, even if the editor has no protection against it.
Re: (Score:3)
"Text editors allow anyone to edit files, which could lead to a compromise."
This should be changed to:
"Text editors allow anyone to edit files, and their actual content might not be the same as what you see on the screen, which could lead to a compromise.".
And yes, it's indeed worrisome.
Re: (Score:2)
It's not a discovery. ./ discovered it first, that's why it doesn't support unicode. You know, to stay secure and such.
Re: (Score:2)
The Linux kernel is written in ASCII not Unicode, so unicode issues don't impact it.
Re:Who uses UNICODE for source code? (Score:4, Insightful)
Lots of code is generated in utf-8. If you're using ASCII-7 text files, those are automatically handled by compilers that assume they are utf-8. And many languages presume that the input text is utf-8. ASCII-7 is a proper subset of utf-8 so you wouldn't notice. https://gcc.gnu.org/onlinedocs... [gnu.org]
Re: (Score:2)
This actually has some insight. It's why you learn never to use 'editors' like MS Word etc. to write code. Early on, in the 90s, I would see people sometimes doing that, then pasting it into an IDE. And then they couldn't figure out why the code wouldn't compile or would run weird when the control characters from Word would come across into the IDE, but wouldn't be displayed there, either. Of course that was benign fucktardery, but I never though about someone trying to use it for evil. But yes, I avoided i
Re: (Score:2)
Many European countries (Spain, Germany, France, Hungary...) have some non-ASCII characters. To what extent programmers in those languages use those characters (or even write in their native languages--e.g. variable names or comments), I don't know.
Re: (Score:3)
In this particular case, it's Israel and the Arab countries.
Re: (Score:2)
In this particular case, it's Israel and the Arab countries.
Almost all code I have seen from around the world is ASCII. I did see some some French in comments once, that had accents.
The point is, the words and syntax you use to write code are not English, or any other human language for that matter. I don't see much of an advantage in being able to write variable and function names in UTF-8, rather than the usual subset of ASCII typified by C. UTF-8 belongs in strings. Naturally, you might well want Hebrew or Arabic in strings literals, even if the surrounding code
Re: (Score:2)
I do. I'm a computational linguist, so I have to do special processing of lots of Unicode characters, and make special rules to do this (usually in Python, sometimes in XML that gets translated into FST code).
I don't suppose everyone is a computational linguist, though...
Cool! (Score:5, Insightful)
I guess Slashdot is immune then, since it doesn't do Unicode!
Re: (Score:2)
As a linguist/ computational linguist, I use Unicode (outside the Basic Latin, i.e. ASCII) all the time. It is not a crock, it is a far better solution for scripts that use non-ASCII characters than anything that came before.
As for being stuck in Unicode, I would say that's an advantage, because fonts and other support for Latin-1 (ISO 8859-1) is getting less and less. But I don't know why you get stuck in there; I program in Python, and it's quite capable of reading and writing non-Unicode. Maybe you're
Re: (Score:2)
Really, Slashdot is seeming old because it doesn't accept emoji.
No Emojies is good! (Score:3)
Last thing we want is flashing emojies!
Re: Cool! (Score:2)
Keeps the low IQ kiddies away.
Because who wants to see (big sausage string of emojis) "LOL! MEE TOO! ROFOLOHLOMOKOHPTERBALLESHIP!" on /.
Re:Cool! (Score:5, Informative)
The reason Slashdot doesn't do Unicode is specifically to prevent these kinds of hacks, which were kind of annoying. For a while, Slashdot did support unicode.
Re: (Score:2)
Input sanitation? What IS that???
Re: (Score:3)
That's what they're doing. It's overly aggressive input sanitation.
Re: (Score:2)
Re: (Score:2)
I thought Slashdot didn't do Unicode specifically so that technical dingbats (not the font) didn't have to feel shame when presented with typographically correct curly quotes.
Ken Thompson 1984 (Score:5, Informative)
This is a tame variant of the Ken Thompson Hack from 1984, which was the "root password of all evil". While this obfuscates malicious code inside the compiler with special unicode characters, the Ken Thompson Hack proposes a way to "eternally" hijack a compiler to insert malicious code without ANY TRACE in the source code at all. While the attack might arguably be unpractical, the read is really interesting:
https://wiki.c2.com/?TheKenTho... [c2.com]
Not the Ken Thompson attack (Score:4, Informative)
One fix might be to only allow ascii for coding but that puts the non english world at a disadvantage. Also f*#k punycode for URLs that's another easy way to obfuscate something to fool humans.
Re: (Score:3)
It's not that clever, the exact same attack was used on MacOS a few years back. By using Unicode direction mark characters filenames could be obscured, making the user think an executable was actually an innocuous .jpg file, for example.
I've said it before, Unicode is a massive screw-up and needs to be replaced. It made too many mistakes that can't be reversed or worked around, including the use of direction marks.
Re: Not the Ken Thompson attack (Score:2)
"making the user think an executable was actually an innocuous .jpg file"
Did it change the icon that was next to the file name? Did it display one for an image file and not an executeable?
Whenever I use a point and click file manager I have a habit of checking for this. Others should do the same.
Re: Not the Ken Thompson attack (Score:2)
I know that an executeable can have it's own custom icon. It would be a good idea to set the file manager to only display the generic icons.
Re: (Score:2)
Re: Not the Ken Thompson attack (Score:2)
Hiding extentions and more importantly file types is one of the biggest blunders anybody could make. At the very least, a big red box around the icon or something to indicate that a file is an executeable should be always be the default, and require a manual reg edit to disable.
I understand that some users prefer or even need the minimalist view, but removing all indications that a file is executeable/possible malware is a very bad move.
Re: (Score:1)
On Windows by default settings, the Explorer will not even show the extension. Change the icon to something looking harmless, and you have a good chance to fool most users. You still need a way to get past UAC for admin access, but overall I think this is Microsoft's second biggest security mistake in the history of the company.
BTW, the biggest one was that at some point on the 2000s, Outlook would execute VB attachments to emails without user confirmation. One did not even need to open the email, showing i
Re: (Score:3)
Simple ways to avoid this:
Have your source vault reject Bidi control characters.
Have your editor highlight any lines that have Bidi characters.
Re: (Score:1)
Everyone knows English already, and if they don't they certainly aren't writing code in any popular languages. It might be easier to fix all the programmers by having them write entirely in English than fixing all the editors.
Re: (Score:3)
But because Krebs said it we all think it's breaking news!
Pascal! (Score:2)
I learned Pascal on a very old Mac (which was new at the time).
Maybe universities will go back to teaching Pascal.
It doesn't (or at least didn't) support Unicode.
Re: (Score:3)
see: https://wiki.freepascal.org/FP... [freepascal.org]
Re: (Score:2)
I remember they taught Modula-2 at uni when I was there.
I bought a copy of Stony Brook Modula-2 whose IDE flew on my OS/2 box at the time - a 100MHz 486DX4 with a whole 4MB of video memory. That compiler, with its IDE, made writing code a breeze. Those were the days!
-- .sigs. You're stuck with this one.
I used to have a few good
Re: (Score:2)
Boustrophedon (Score:4, Interesting)
from the /. summary:
"Cambridge researchers wrote. 'For these cases, Bidi override control characters enable switching the display ordering of groups of characters.'"
Quite convenient for all the ancient Etruscans writing boustrophedonically. [wikipedia.org]
Re:Boustrophedon (Score:5, Interesting)
Re:If it can't be expressed in ascii characters (Score:4, Funny)
You had ascii characters?
We had to make do with binary coded decimal. And we liked it!
Fix (Score:3)
There's a *really* easy fix: run a character code scanner to ensure that these Unicode characters (IIRC there are 3 of them) are not present in your code files. (The problem is only visual--i.e. when displayed by certain kinds of rendering devices--the computer just treats the characters like any other random Unicode characters.) Don't people run automatic scanners to check for other kinds of bugs? Like checking that you don't have any U+29E3"equals sign and slanted parallel", which looks much like '#'. Putting code after this character would also make it look like commented out code when it wasn't. (Not sure what the compiler would do with this character, maybe some compilers would treat it like part of a variable name?)
It's not clear whether this can also happen when specifying code points (like typing '\u202E'). If so, you can also scan for those sequences with a regex.
Re: (Score:3)
Fortunately most of the highest risk industries don't accept open source contributions and there's little motivation for an internal developer to harm systems that way. I'd probably put defense contractors as high on the list of groups that should be concerned. Government work has lots of red tape slowing down adopt
Re: (Score:2)
Just fix the editor and the codereview tools. They should cancel any global string setting at the end of comments or raw strings. That way the reviewed codeview would match the compilers
Easier fix. (also more secure) (Score:2)
There's a *really* easy fix: run a character code scanner to ensure that these Unicode characters (IIRC there are 3 of them) are not present in your code files.
Easier fix: Add an "is there unicode in the source" test to the compiiler. On, and fatal error, by default.
need the editors too, for when you have to support an older compiler.
Re: (Score:2)
UNICODE is an extensible standard.
Better to test for acceptable characters rather than scan for (what you currently think are) the bad ones.
If you miss a character that you really need to have, easy enough to add it to the acceptable list.
Syntax highlighting will show this (Score:3)
Almost any syntax highlighting, which is provided as an option by almost every program used to display source code, will highlight as the compiler sees the text, resulting in obvious messed up display if this is attempted.
It would also be trivial, and perhaps useful, if such editors ignored all bidi instructions, at least outside quoted strings and comments.
Palindrome (Score:2)
Re: (Score:3)
Read in ASCII mode (Score:2)
Forgive my dumb question (Score:2)
But can't an automated tool that removes these things (possibly together with any BiDi text inside comments) solve the problem?
The standard suggests a workaround (Score:3)
From The Standard [unicode.org] (in Chapter 3, Basic Display Algorithm):
The algorithm reorders text only within a paragraph; characters in one paragraph have no effect on characters in a different paragraph. Paragraphs are divided by the Paragraph Separator or appropriate Newline Function (for guidelines on the handling of CR, LF, and CRLF, see Section 4.4, Directionality, and Section 5.8, Newline Guidelines of [Unicode]). Paragraphs may also be determined by higher-level protocols: for example, the text in two different cells of a table will be in different paragraphs.
It seems to me that compilers and code editors could define a "paragraph" as a string literal, a comment, or an identifier token, which would allow for reasonable display of mixed L2R/R2L text without allowing for it to obfuscate the surrounding code. Additionally, this would mean that attempting to hide a string or comment terminator inside a L2R/R2L/L2R/etc. sequence would fail.
For command shells, I would also suggest that path delimiters and terminating file extensions also delimit paragraphs. And probably other characters as required by the command shell environment.
Re: (Score:2)
It seems to me that compilers and code editors could define a "paragraph" as a string literal, a comment, or an identifier token
Code editors would need to be updated to provide "paragraph" definitions for all programming languages for which they support syntax highlighting.
Re: (Score:2)
Code editors would need to be updated to provide "paragraph" definitions for all programming languages for which they support syntax highlighting.
Actually, to be on the safe side they'd need to provide that also for all programming languages they don't even support syntax highlighting for...
Possible Mitigation (Score:2)
Add a feature to all editors: "View as 7-bit ASCII." Code points 0-127 are displayed as usual. Everything above that is displayed as 8-bit hexadecimal values of the form "\x??" where ?? are hex digits (or some other fancy editor-specific rendering that makes them stand out).
The idea is that you can view the code as the compiler will see it -- as a stream of bytes. This should make it easier to see if what appear to be string constants or comments are actually trying to escape into the code stream, part
Does raku (ex perl 7) has this UTF-8 problem ? (Score:1)
Raku (ex perl 7) allows to use UTF8 for constant, variables and function name (pi with the Greek pi character, sum with the capital greek sigma character, etc.) .
As it was designed specifically with unicode in mind, they may have already solve the problem. Or not ?
Every compiler I have looked at (Score:1)
Every compiler I have looked at discards comments. A documentation compiler will look specifically at the comments. I fail to so how this could be a threat?