OS X Users: 13 Characters of Assyrian Can Crash Your Chrome Tab 119
abhishekmdb writes No browsers are safe, as proved yesterday at Pwn2Own, but crashing one of them with just one line of special code is slightly different. A developer has discovered a hack in Google Chrome which can crash the Chrome tab on a Mac PC. The code is a 13-character special string which appears to be written in Assyrian script. Matt C has reported the bug to Google, who have marked the report as duplicate. This means that Google are aware of the problem and are reportedly working on it.
Related poem (Score:2)
The Assyrian came down like the wolf on the fold,
And his cohorts were gleaming in purple and gold;
And the sheen of their spears was like stars on the sea,
When the blue wave rolls nightly on deep Galilee.
Byron [poetryfoundation.org]
Re: (Score:2)
what? no 'burma shave' ??
Re: (Score:2, Troll)
Re:Related poem (Score:5, Funny)
Now then, this particular Assyrian, the one whose cohorts were gleaming in purple and gold,
Just what does the poet mean when he says he came down like a wolf on the fold?
In heaven and earth more than is dreamed of in our philosophy there are great many things.
But I don't imagine that among them there is a wolf with purple and gold cohorts or purple and gold anythings.
Ogden Nash [blogspot.com]
Re: (Score:2)
So that is Google's way of fixing this problem?
Thank you, Neal Stephenson (Score:5, Funny)
Let us henceforth dub it the Snow Crash exploit.
Re: (Score:2)
Re: (Score:1)
It's the imitator language derivative that is still being used today in Old Persia. Those Iranians are fun guys!
It's the script to use when you don't want to write in Arabic.
Re: (Score:2)
Man bites dog (Score:1)
Stop the presses a bug found in a large complex program.
Re: (Score:2)
... which millions of people use to connect to the internet... and there are dozens (thousands) of bugs still hidden where that bug came from. Do you still think browsers should be allowed for serious stuff like online banking, home automation and online elections?
Re: (Score:3)
Complex software should be banned! Like the stuff that flies all the commercial aeroplanes and runs the nuclear reactors.
Re: (Score:1)
Yeah, because no one was ever shot in an real bank.
Re: (Score:2)
Stop the presses a bug found in a large complex program.
No Browser is safe : Chrome, Firefox, Internet Explorer, Safari all hacked at Pwn2Own contest [techworm.net]
It's not "a bug" in "a program". It's every major browser. And it's pretty much like this every time they do pwn2own. If a group of hackers are able to bring down every major browser with previously unknown* exploits every year just for a chance to win a laptop, what can better motivated (financed) groups do?
* unknown to the browser developers anyway... 17 seconds to pwn IE, yeah right... like they say on the cooki
Re: (Score:1)
Bridgekeeper: Stop. Who would cross the Bridge of Death must answer me these questions three, ere the other side he see.
Sir Lancelot: Ask me the questions, bridgekeeper. I am not afraid.
Bridgekeeper: What... is your name?
Sir Lancelot: My name is Sir Lancelot of Camelot.
Bridgekeeper: What... is your quest?
Sir Lancelot: To seek the Holy Grail.
Bridgekeeper: What... is your favourite colour?
Sir Lancelot: Blue.
Bridgekeeper: Go on. Off you go.
Sir Lancelot: Oh, thank you. Thank you very much.
Sir Robin: That's easy
Re: (Score:3)
Well, I don't know about *foolproof*, but most of the time when software does bad things because of specially crafted input, it's because someone didn't bother to do an input validation that they obviously ought to have done. This has been a leading cause of bugs since the 1974 edition of "The Elements of Programming Style", which devotes 2 out of 56 lessons to it:
#19 Test input for plausibility and validity.
#20Make sure input doesn't violate the limits of the program.
If K&P were writing that today they'd probably have a rule "never hand a piece of non-literal data to an interpreter without escaping anythi
Schneier got it right a decade and a half ago (Score:5, Informative)
Re: (Score:2)
At that time, Schneier was just one of many that held this opinion. None if us is surprised by what is happening. If you want to be secure, stay away from Unicode or process UTF-8 as ASCII. As soon as you try to render, parse or even only compare anything besides standard ASCII, you are screwed.
Re: (Score:2)
Unfortunately, unicode is now woven into various Java string handling and database interactions, and it is far too complex to test all the possible input and storage scenaries. I've also noticed a strong tendency among current QA engineers to test only the new feature, and to avoid testing old components interacting with new features without _amazing_ pushback from their managers who want to keep testing costs very small. The result is a fairly predictable string of failure modes, and of production failures
Re: (Score:2)
Indeed. That is why I usually add to stay away from Java if you want/need security. Testing is pretty much a non-starter to get secure code though, unless the person doing the tests really understands the code, security and has a generous testing budget. In usual industrial practice, none of the three are the case.
Re: (Score:2)
It's also aggravated by the "install the latest software, and build components, from arbitrary 3rd party repositories". I'm afraid that I just a long discussion with some Java developers who were accustomed to building their software on their desktops, pulling in arbitrary, unknown versions of components and their dependencies, and and using the resulting components to build the next round. .I'm afraid it's reminding me, forcibly, of Perl developers saying "just use cpan build!", and ruby developers saying
Re: (Score:2)
Yes, Java and Python (3) and Qt all are causing enormous difficulties as they followed Microsoft down the fantasy road and thought you had to convert strings on input to "unicode" or somehow it was impossible to use them. Since not all 8-byte strings can convert there must either be a lossy conversion or there must be an error, neither of which are expected, especially if the software is intended to copy data from one point to another without change.
The original poster is correct in saying "stay away from U
Re: (Score:2)
UTF8 has nothing to do with it.
The problem commonly is: people try to "clean" input with some stupid regex, rather than treating all user-provided strings as permanently dirty. You can do anything you need to, risk-free, with this attitude. You have to understand the encoding you use for storage/transmission (if your framework doesn't provide a way to safely, blindly store/transmit any string, then just encode the string in some way first), but that's a much, much smaller world than the universe of possib
Re: (Score:2)
I don't know what your definition of "dirty" is, but there are going to be scenarios where you need your data cleaned.
Re: (Score:2)
No, actually the best advice is to not do any computations at all, i.e. pull the plug. Unfortunately, just like ignoring user input, that comes with the slight problem that your software cannot get any work done anymore.
Re: (Score:2)
In UTF-8 I'd be surprised if someone handled this wrong, because three byte characters are common, and there is no good reason to be able to process three byte but not four byte UTF-8.
If they are using UTF-16 on the other hand, I wouldn't be surprised if someone assumes that characters are a single UTF-16 word.
Re: (Score:2)
You might be right, but it's such an old problem - it was a big deal 10 years ago in the Windows world as UCS2 didn't handle it. C# was actually UTF from the start, like Java, of course.
Still, crashing because of, what, a null in the input? I could certainly understand truncation (just like other incorrect display problems), but a crash?
Re: (Score:2)
Indeed. The problem is that Unicode is far too complex to still be understandable to the average programmer (and the good ones have to waste far too much time on it). Of course, you should always make your assumptions explicit and do explicit rejection of anything you are not prepared to process. But that would be a sound coding practice, and we cannot have that, now can we?
Re: (Score:2)
You miss my point: I basically said that as soon as you are interpreting the data as Unicode, you are screwed. As to treating input as permanently dirty, that would be effective if possible, but it is not. For many security-critical functionality, you just have to reject anything that is not 7-bit ASCII, because quite often you need to sanitize input and use it afterwards.
Re: (Score:2)
Maybe I'm still not getting your point. Sure, if you need to understand the details of Unicode character composition and such because you're the one rendering the output glyphs, or you want to sort or search across different encodings of the same word, that's rough, but there's no excuse for a security failure while doing those tasks.
On your other point: the notion of "sanitizing input" is fundamentally flawed to begin with. You can never know what future framework that user data will be interacting with,
Re: (Score:2)
Have you ever developed any system more complicated than a college project?
One or two; one or two. Somehow I've never managed to develop one that would crash due to malformed input, however.
Re: (Score:2)
My point is that my first impression when I heard bout Unicode a long time ago was "this is really dumb and it will kill security".
As to your Ad Hominem: You are an anonymous coward and have no standing.
Re: (Score:2)
Re: (Score:3)
Unicode is sort of complicated, or at least it's more complicated than might be expected. But the problem with Schneier saying "Unicode is too complex to ever be secure" is that he might as well just say "programming is too complex to ever be secure." Sure, Unicode is a little complicated. But it's hardly the most complicated thing you'll ever have to deal with as a programmer. If we can't even get that right, we might as well just quit.
Re: (Score:2)
If they had just stuck with 24 or 32 bits per character, instead of going with multiple variable length character encodings, you might be right. When you can't be sure how many bytes any given character needs you can't use simple maths to work out how big buffers need to be, or even be sure that you won't end up with odd spare bytes at the end.
It looks like this what has happened here. Even supposedly well debugged library code still has issues with it.
How well forethought of dice (Score:5, Funny)
to ditch unicode support. They recognized that experimental technology like this shouldn't be rolled out to this much users. Thank you dice for keeping slashdot safe!
Re: (Score:2)
Did Dice ditch unicode support? I thought the slash code always had issues/didn't support it, long before Dice acquired them.
Re: (Score:1)
perhaps i can draw the situation in pictures
/\
joke
0
\/ you
Re: (Score:1)
This is preformatted text
Well, that stinks. Let me try the <tt> tags, then:
This is preformatted text with tt &
Phooey. It ate all my extra spaces. I suppose you could use non-breaking spaces....
Nope. I guess trolls abused these features too much in the distant past, so I sort of understand that.
I'm still confused about the lack of Unicode, though. I though Perl could handle it?
Re: (Score:2)
Slashcode always supported Unicode.
The reason it appears it doesn't is that thanks to a bunch of wankers who decided to abuse Unicode to no end, it ended up screwing the site layout up thanks to abuse of control codes.
So what was added was an input filter that limited what Unicode could come in - pretty much just ASCII at this point.
Unicode IS complex, and you really cannot blindly ha
Re: (Score:2)
Yeah. It's not like Slashdot.jp patched slashcode to support Unicode 10+ years ago.
Re: (Score:3)
Actually we are probably going to have to ditch Unicode at some point, at least in its current form. East Asian language support is badly broken. I could be fixed, but not in a non-breaking way.
CJK unification is one of the biggest screw-ups in the history of computing.
Re: (Score:1)
From what I understand unicode has abandoned CJK unification a long time ago there are now separate planes for each language.
Of course the old planes still exists, so you need to transpose those when you find them in a string.
Re: (Score:2)
In fairness... (Score:1)
Re: (Score:2)
Eventually their (Christian) descendants ended up speaking Aramaic like practically everyone else in the Near East at the time (it was the official language in the Western part of the Persian Empire); the modern Assyrian language is one of the many forms of modern Aramaic (now split into several different languages, much as Latin evolved into several different languages over much the same period) and this script is properly c
Re:In fairness... (Score:4, Informative)
The people in question call themselves Assyrians at the present day; there are some Akkadian words preserved in their Aramaic language even now, although Akkadian itself probably died out in the earlier part of the first millennium BC.
The name "Syriac" is itself from a worn-down version of the same name; it was once used pretty much as the equivalent of "Aramaic" but is now generallly used to describe only one particular version of Aramaic which was a major literary language of Western Asia in early Christian times, and is still used as a liturgical language by Nestorian Christians as far afield as India. The script is used to write several modern Aramaic languages spoken by Christians.
These ancient communities have suffered greatly in the Middle East wars of recent times, and a huge proportion have left as refugees.
Syriac not Assyrian (Score:5, Informative)
That script is the Syriac script not the Assyrian one: https://en.wikipedia.org/wiki/... [wikipedia.org].
Re: (Score:2)
Dupe (Score:2)
this report is a dupe: https://code.google.com/p/chro... [google.com]
Re: (Score:2)
Who says the Internet isn't educational?
Lotus Notes was like this too (Score:2)
I once had a small Notes web thing running for a bunch of people in Scandinavia. The thing crashed every time when someone from Iceland worked with it. Ruend out that the icelandic character is not in some middle european character set (this was before UTF-8) and wasted Notes every time. That was a total bastard of a problem to find.
Re: (Score:2)
Hah. Slashdot breaks too! It is the Icelandic 'thorn' character http://en.wikipedia.org/wiki/T... [wikipedia.org]
Re: (Score:1)
How long do you think it's going to take for said characters to be posted (inadvertently, of course) in a comment on this post?
Since Slashdot doesn't actually support Unicode, they wouldn't come in at all. They'd just disappear. Soviet Russia style.
Good news! (Score:3)
In related news, we don't need to worry about this bug being used by unscrupulous sorts of folks in the comments here. The one and only time a lack of unicode support has come in useful...
Re: (Score:2)
That's correct usage in British English, AC. Welcome to the Internet.
so does imagur (Score:2)
mtbf - 15 mins.
Sounds occult (Score:1)
Since Snowcrash ... (Score:2)
... we know that Assyrian or more precisely Sumerian is tricky.
Didn't Steve Jobs have Assyrian Heritage? (Score:2)
I know, Syrian, but still. I always knew he was going to be the death of Apple.
Might not be unicode ... (Score:2)
Re: (Score:2)
Re: (Score:2)
I've changed version control systems since then so I don't have the check-in history handy.
Re:Type "bush hid the facts" into Notepad. (Score:5, Informative)
http://blogs.msdn.com/b/oldnew... [msdn.com]
About every ten months, somebody new discovers the Notepad file encoding problem. Let's see what else there is to say about it.
First of all, can we change Notepad's detection algorithm? The problem is that there are a lot of different text files out there. Let's look just at the ones that Notepad supports.
8-bit ANSI (of which 7-bit ASCII is a subset). These have no BOM; they just dive right in with bytes of text. They are also probably the most common type of text file.
UTF-8. These usually begin with a BOM but not always.
Unicode big-endian (UTF-16BE). These usually begin with a BOM but not always.
Unicode little-endian (UTF-16LE). These usually begin with a BOM but not always.
If a BOM is found, then life is easy, since the BOM tells you what encoding the file uses. The problem is when there is no BOM. Now you have to guess, and when you guess, you can guess wrong. For example, consider this file:
D0 AE
Depending on which encoding you assume, you get very different results.
If you assume 8-bit ANSI (with code page 1252), then the file consists of the two characters U+00D0 U+00AE, or "". Sure this looks strange, but maybe it's part of the word VATNI which might be the name of an Icelandic hotel.
If you assume UTF-8, then the file consists of the single Cyrillic character U+042E
If you assume Unicode big-endian, then the file consists of the Korean Hangul syllable U+D0AE
If you assume Unicode little-endian, then the file consists of the Korean Hangul syllable U+AED0
Some people might say that the rule should be "All files without a BOM are 8-bit ANSI." In that case, you're going to misinterpret all the files that use UTF-8 or UTF-16 and don't have a BOM. Note that the Unicode standard even advises against using a BOM for UTF-8, so you're already throwing out everybody who follows the recommendation.
Okay, given that the Unicode folks recommend against using a BOM for UTF-8, maybe your rule is "All files without a BOM are UTF-8." Well, that messes up all 8-bit ANSI files that use characters above 127.
Maybe you're willing to accept that ambiguity, and use the rule, "If the file looks like valid UTF-8, then use UTF-8; otherwise use 8-bit ANSI, but under no circumstances should you treat the file as UTF-16LE or UTF-16BE." In other words, "never auto-detect UTF-16". First, you still have ambiguous cases, like the file above, which could be either 8-bit ANSI or UTF-8. And second, you are going to be flat-out wrong when you run into a Unicode file that lacks a BOM, since you're going to misinterpret it as either UTF-8 or (more likely) 8-bit ANSI. You might decide that programs that generate UTF-16 files without a BOM are broken, but that doesn't mean that they don't exist. For example,
cmd /u /c dir >results.txt
This generates a UTF-16LE file without a BOM. If you poke around your Windows directory, you'll probably find other Unicode files without a BOM. (For example, I found COM+.log.) These files still "worked" under the old IsTextUnicode algorithm, but now they are unreadable. Maybe you consider that an acceptable loss.
The point is that no matter how you decide to resolve the ambiguity, somebody will win and somebody else will lose. And then people can start experimenting with the "losers" to find one that makes your algorithm look stupid for choosing "incorrectly".
Re:Type "bush hid the facts" into Notepad. (Score:5, Funny)
My conclusion is that the unicode guys are assholes.
Re: (Score:3)
Unicode made three big mistakes.
1. Attempting to be backwards compatible with a subset of ASCII. A subset that breaks all the common encodings used outside the US.
2. Multiple encodings (8, 16 and 32 bit). Pick one, stick to it, don't make try to guess with stupid BOMs etc.
3. CJK unification. Trying to merge three distinct languages in a way that makes it impossible to mix them in a pure Unicode document.
So yeah, those guys are assholes.
Re: (Score:3)
Unicode and how it is represented in a file are two different things. Unicode is a good idea, it solves many problems and contains all the (to me) strange characters used by: Greeks, Chinese, etc.
How to represent it in a file is different. UTF-8 is the obvious answer today, but other encodings were tried by different organisations first. The big win of UTF-8 is that you can have characters from very different regions on the same web page (or in the same file) - something that you cannot do you you adopt a p
Re: (Score:2)
The big downside of UTF-8 is using it as an in-memory string. To find the nth character and you have to start at the beginning of the string.
C# and Java use UTF16 internally for strings.
Re: (Score:2)
I agree completely. There is no reason that a program cannot read UTF-8 and store as UTF-32 internally. There is a trade-off between time and memory. Note that UTF-16 is also a variable length encoding scheme so you still need to start at the start of string to find the nth character.
Re: novice programmer alert! (Score:2)
The big downside of UTF-8 is using it as an in-memory string. To find the nth character and you have to start at the beginning of the string.
And this is important, why? Can you come up with an example where you actually produce "n" by doing anything other than looking at the n-1 characters before it in the string? No, and therefore an offset in bytes can be used just as easily.
C# and Java use UTF16 internally for strings.
And you are aware that UTF-16 is variable-length as well, and therefore you can't "find
Re: (Score:2)
Unicode is a good idea, it solves many problems and contains all the (to me) strange characters used by: Greeks, Chinese, etc.
That's one of its biggest problems: it doesn't support all the characters in Chinese. In fact it doesn't really support any of them, because they tried to merge them with Japanese and Korean characters. The result is that Unicode contains a sort of amalgamation that can be used to approximate any of those three languages, but not represent them properly.
I listen to both Japanese and Chinese music. Unicode is broken for me. There is no way to tell if a character is a Chinese or a Japanese one. The character
Re: (Score:2)
Re: (Score:2)
I agree overall with your comment, but I think UTF-8's backwards compatibility with ASCII was genius and is the reason we have as much Unicode support as we do today. I consider UTF-8 to be one of the best hacks of all time. Without it, the software that existed at the time would have had to be thrown out or re-written. The fact that software can (often) process UTF-8 without even being aware that it isn't ASCII was exactly what was needed to get Unicode off the ground. UTF-8 allowed Unicode to be adopted i
Re: (Score:2)
The problem is that ASCII is only useful for US English. Other forms of English need symbols like the pound (£) sign. Other Latin derived languages need accented characters. Non-Latin languages already use some subset of ASCII plus extensions. Any software that has to support more than just 7-bit US ASCII and UTF-8 has to guess, and usually gets it wrong.
Re: (Score:2)
Actually Plan 9 and UTF-8 encoding existed well before Microsoft started adding Unicode to Windows.
The reason for 16-bit Unicode was political correctness. It was considered wrong that Americans got the "better" shorter 1-byte encodings for their letters, therefore any solution that did not punish those evil Americans by making them rewrite their software was not going to be accepted. No programmer at that time (including ones that did not speak English) would ever argue for using anything other than a vari
Re: (Score:1)
Unicode made one enormous mistake - existing in the first place.
If plain ascii was good enough for Virgil, Newton & Shakespeare it's good enough for you.
Re: (Score:3)
Re: (Score:2)
Maybe you're willing to accept that ambiguity, and use the rule, "If the file looks like valid UTF-8, then use UTF-8; otherwise use
Yay! You actually got the answer partially correct. However you then badly stumble when you follow this up with:
8-bit ANSI, but under no circumstances UTF-16
The correct answer is "after knowing it is not UTF-8, use your complicated and error-prone encoding detectors".
The problem is a whole lot of stupid code, in particular from Windows programmers, basically tries all kinds of m
Re: (Score:2)
Re: (Score:2)
Re: (Score:1)
I've had a delightful time explaining to my trainees that *EVERY SERVER SHOULD ONLY BE RUN IN A LANG=C ENVIRONEMNT". Unicode is *bad*, *bad*, *bad* for systems work of any sort.
And in a related XKCD post:
https://xkcd.com/327/ [xkcd.com]
That works, until your servers have to process any kind of foreign characters whatsoever. This is a fault that only affects OS X, only when using Google Chrome. It's not (to my knowledge) a weakness of Unicode.
Re: (Score:2)
Re: (Score:2)