Slashdot is powered by your submissions, so send in your scoop

OS X Users: 13 Characters of Assyrian Can Crash Your Chrome Tab 119

Posted by timothy on Saturday March 21, 2015 @02:48PM from the cat-like-typing-detected dept.

abhishekmdb writes No browsers are safe, as proved yesterday at Pwn2Own, but crashing one of them with just one line of special code is slightly different. A developer has discovered a hack in Google Chrome which can crash the Chrome tab on a Mac PC. The code is a 13-character special string which appears to be written in Assyrian script. Matt C has reported the bug to Google, who have marked the report as duplicate. This means that Google are aware of the problem and are reportedly working on it.

This discussion has been archived. No new comments can be posted.

OS X Users: 13 Characters of Assyrian Can Crash Your Chrome Tab

Load All Comments

Search 119 Comments Log In/Create an Account

Comments Filter:

Related poem (Score:2)

by Tim the Gecko ( 745081 ) writes:

The Assyrian came down like the wolf on the fold,
And his cohorts were gleaming in purple and gold;
And the sheen of their spears was like stars on the sea,
When the blue wave rolls nightly on deep Galilee.
Byron [poetryfoundation.org]
- Re: (Score:2)
  
  by TheGratefulNet ( 143330 ) writes:
  
  what? no 'burma shave' ??
  - Re: (Score:2, Troll)
    
    by BarbaraHudson ( 3785311 ) writes:
    
    It's not Tuesday :-)
- Re:Related poem (Score:5, Funny)
  
  by Chris Mattern ( 191822 ) writes: on Saturday March 21, 2015 @03:41PM (#49309471)
  
  Now then, this particular Assyrian, the one whose cohorts were gleaming in purple and gold,
  Just what does the poet mean when he says he came down like a wolf on the fold?
  In heaven and earth more than is dreamed of in our philosophy there are great many things.
  But I don't imagine that among them there is a wolf with purple and gold cohorts or purple and gold anythings.
  Ogden Nash [blogspot.com]
  
  Parent Share
  twitter facebook
- - Re: (Score:2)
    
    by Vlad_the_Inhaler ( 32958 ) writes:
    
    So that is Google's way of fixing this problem?
Thank you, Neal Stephenson (Score:5, Funny)

by Applehu Akbar ( 2968043 ) writes: on Saturday March 21, 2015 @03:01PM (#49309299)

Let us henceforth dub it the Snow Crash exploit.

Share
twitter facebook
- Re: (Score:2)
  
  by fuzzyfuzzyfungus ( 1223518 ) writes:
  
  Weren't the Snow-Crash-related fertile crescent dwellers Sumerians, the Xerox-PARC of Mesopotamian civilization, who invented more or less everything and then got massacred by their imitators?
  - Re: (Score:1)
    
    by Whiteox ( 919863 ) writes:
    
    It's the imitator language derivative that is still being used today in Old Persia. Those Iranians are fun guys!
    It's the script to use when you don't want to write in Arabic.
- Re: (Score:2)
  
  by fredgiblet ( 1063752 ) writes:
  
  That was my first thought as well.
Man bites dog (Score:1)

by Anonymous Coward writes:

Stop the presses a bug found in a large complex program.
- Re: (Score:2)
  
  by gnupun ( 752725 ) writes:
  
  ... which millions of people use to connect to the internet... and there are dozens (thousands) of bugs still hidden where that bug came from. Do you still think browsers should be allowed for serious stuff like online banking, home automation and online elections?
  - Re: (Score:3)
    
    by viperidaenz ( 2515578 ) writes:
    
    Complex software should be banned! Like the stuff that flies all the commercial aeroplanes and runs the nuclear reactors.
  - Re: (Score:1)
    
    by mSparks43 ( 757109 ) writes:
    
    Yeah, because no one was ever shot in an real bank.
- Re: (Score:2)
  
  by disambiguated ( 1147551 ) writes:
  
  Stop the presses a bug found in a large complex program.
  No Browser is safe : Chrome, Firefox, Internet Explorer, Safari all hacked at Pwn2Own contest [techworm.net]
  It's not "a bug" in "a program". It's every major browser. And it's pretty much like this every time they do pwn2own. If a group of hackers are able to bring down every major browser with previously unknown* exploits every year just for a chance to win a laptop, what can better motivated (financed) groups do?
  * unknown to the browser developers anyway... 17 seconds to pwn IE, yeah right... like they say on the cooki
- Re: (Score:1)
  
  by ArcadeMan ( 2766669 ) writes:
  
  Bridgekeeper: Stop. Who would cross the Bridge of Death must answer me these questions three, ere the other side he see.
  Sir Lancelot: Ask me the questions, bridgekeeper. I am not afraid.
  Bridgekeeper: What... is your name?
  Sir Lancelot: My name is Sir Lancelot of Camelot.
  Bridgekeeper: What... is your quest?
  Sir Lancelot: To seek the Holy Grail.
  Bridgekeeper: What... is your favourite colour?
  Sir Lancelot: Blue.
  Bridgekeeper: Go on. Off you go.
  Sir Lancelot: Oh, thank you. Thank you very much.
  Sir Robin: That's easy
- Re: (Score:3)
  
  by hey! ( 33014 ) writes:
  
  Well, I don't know about *foolproof*, but most of the time when software does bad things because of specially crafted input, it's because someone didn't bother to do an input validation that they obviously ought to have done. This has been a leading cause of bugs since the 1974 edition of "The Elements of Programming Style", which devotes 2 out of 56 lessons to it:
  #19 Test input for plausibility and validity.
  #20Make sure input doesn't violate the limits of the program.
  If K&P were writing that today they'd probably have a rule "never hand a piece of non-literal data to an interpreter without escaping anythi
Schneier got it right a decade and a half ago (Score:5, Informative)

by Max Hyre ( 1974 ) * writes: <mh-slash&hyre,net> on Saturday March 21, 2015 @03:19PM (#49309355)

This exploit rang a bell, so I searched Bruce Schneier's website. And, sure enough, on July 15, 2000, he observed ``Unicode is just too complex to ever be secure.'' [schneier.com] Doesn't exactly warm the cockles of the paranoid's heart.

Share
twitter facebook
- Re: (Score:2)
  
  by gweihir ( 88907 ) writes:
  
  At that time, Schneier was just one of many that held this opinion. None if us is surprised by what is happening. If you want to be secure, stay away from Unicode or process UTF-8 as ASCII. As soon as you try to render, parse or even only compare anything besides standard ASCII, you are screwed.
  - Re: (Score:2)
    
    by Antique Geekmeister ( 740220 ) writes:
    
    Unfortunately, unicode is now woven into various Java string handling and database interactions, and it is far too complex to test all the possible input and storage scenaries. I've also noticed a strong tendency among current QA engineers to test only the new feature, and to avoid testing old components interacting with new features without _amazing_ pushback from their managers who want to keep testing costs very small. The result is a fairly predictable string of failure modes, and of production failures
    - Re: (Score:2)
      
      by gweihir ( 88907 ) writes:
      
      Indeed. That is why I usually add to stay away from Java if you want/need security. Testing is pretty much a non-starter to get secure code though, unless the person doing the tests really understands the code, security and has a generous testing budget. In usual industrial practice, none of the three are the case.
      - Re: (Score:2)
        
        by Antique Geekmeister ( 740220 ) writes:
        
        It's also aggravated by the "install the latest software, and build components, from arbitrary 3rd party repositories". I'm afraid that I just a long discussion with some Java developers who were accustomed to building their software on their desktops, pulling in arbitrary, unknown versions of components and their dependencies, and and using the resulting components to build the next round. .I'm afraid it's reminding me, forcibly, of Perl developers saying "just use cpan build!", and ruby developers saying
    - Re: (Score:2)
      
      by spitzak ( 4019 ) writes:
      
      Yes, Java and Python (3) and Qt all are causing enormous difficulties as they followed Microsoft down the fantasy road and thought you had to convert strings on input to "unicode" or somehow it was impossible to use them. Since not all 8-byte strings can convert there must either be a lossy conversion or there must be an error, neither of which are expected, especially if the software is intended to copy data from one point to another without change.
      The original poster is correct in saying "stay away from U
  - Re: (Score:2)
    
    by lgw ( 121541 ) writes:
    
    UTF8 has nothing to do with it.
    The problem commonly is: people try to "clean" input with some stupid regex, rather than treating all user-provided strings as permanently dirty. You can do anything you need to, risk-free, with this attitude. You have to understand the encoding you use for storage/transmission (if your framework doesn't provide a way to safely, blindly store/transmit any string, then just encode the string in some way first), but that's a much, much smaller world than the universe of possib
    - Re: (Score:2)
      
      by countach ( 534280 ) writes:
      
      I don't know what your definition of "dirty" is, but there are going to be scenarios where you need your data cleaned.
      - Re: (Score:2)
        
        by gweihir ( 88907 ) writes:
        
        No, actually the best advice is to not do any computations at all, i.e. pull the plug. Unfortunately, just like ignoring user input, that comes with the slight problem that your software cannot get any work done anymore.
    - Re: (Score:2)
      
      by gnasher719 ( 869701 ) writes:
      
      Well, assyrian unicode characters are in the range around U12000. They require four bytes in UTF-8 and two 16-bit words in UTF-16.
      
      In UTF-8 I'd be surprised if someone handled this wrong, because three byte characters are common, and there is no good reason to be able to process three byte but not four byte UTF-8.
      
      If they are using UTF-16 on the other hand, I wouldn't be surprised if someone assumes that characters are a single UTF-16 word.
      - Re: (Score:2)
        
        by lgw ( 121541 ) writes:
        
        You might be right, but it's such an old problem - it was a big deal 10 years ago in the Windows world as UCS2 didn't handle it. C# was actually UTF from the start, like Java, of course.
        Still, crashing because of, what, a null in the input? I could certainly understand truncation (just like other incorrect display problems), but a crash?
      - Re: (Score:2)
        
        by gweihir ( 88907 ) writes:
        
        Indeed. The problem is that Unicode is far too complex to still be understandable to the average programmer (and the good ones have to waste far too much time on it). Of course, you should always make your assumptions explicit and do explicit rejection of anything you are not prepared to process. But that would be a sound coding practice, and we cannot have that, now can we?
    - Re: (Score:2)
      
      by gweihir ( 88907 ) writes:
      
      You miss my point: I basically said that as soon as you are interpreting the data as Unicode, you are screwed. As to treating input as permanently dirty, that would be effective if possible, but it is not. For many security-critical functionality, you just have to reject anything that is not 7-bit ASCII, because quite often you need to sanitize input and use it afterwards.
      - Re: (Score:2)
        
        by lgw ( 121541 ) writes:
        
        Maybe I'm still not getting your point. Sure, if you need to understand the details of Unicode character composition and such because you're the one rendering the output glyphs, or you want to sort or search across different encodings of the same word, that's rough, but there's no excuse for a security failure while doing those tasks.
        On your other point: the notion of "sanitizing input" is fundamentally flawed to begin with. You can never know what future framework that user data will be interacting with,
    - - Re: (Score:2)
        
        by lgw ( 121541 ) writes:
        
        Have you ever developed any system more complicated than a college project?
        One or two; one or two. Somehow I've never managed to develop one that would crash due to malformed input, however.
      - Re: (Score:2)
        
        by gweihir ( 88907 ) writes:
        
        My point is that my first impression when I heard bout Unicode a long time ago was "this is really dumb and it will kill security".
        As to your Ad Hominem: You are an anonymous coward and have no standing.
- Re: (Score:2)
  
  by gnasher719 ( 869701 ) writes:
  
  I'd say the things that Schneier mentions in this article are not actual problems. The first step is avoiding UTF-16 because it is much too tempting to assume that one 16-bit word = one character; nobody will make that assumption with UTF-8. The next step is cleaning UTF-8 and accepting only valid UTF-8; simply removing anything that isn't valid will do fine. What _must_ happen is that after this cleaning step nobody ever again accesses the original data, only the cleaned data. At that point handling the ch
  - Re: (Score:3)
    
    by disambiguated ( 1147551 ) writes:
    
    Unicode is sort of complicated, or at least it's more complicated than might be expected. But the problem with Schneier saying "Unicode is too complex to ever be secure" is that he might as well just say "programming is too complex to ever be secure." Sure, Unicode is a little complicated. But it's hardly the most complicated thing you'll ever have to deal with as a programmer. If we can't even get that right, we might as well just quit.
    - Re: (Score:2)
      
      by AmiMoJo ( 196126 ) * writes:
      
      If they had just stuck with 24 or 32 bits per character, instead of going with multiple variable length character encodings, you might be right. When you can't be sure how many bytes any given character needs you can't use simple maths to work out how big buffers need to be, or even be sure that you won't end up with odd spare bytes at the end.
      It looks like this what has happened here. Even supposedly well debugged library code still has issues with it.
How well forethought of dice (Score:5, Funny)

by NotInHere ( 3654617 ) writes: on Saturday March 21, 2015 @03:19PM (#49309357)

to ditch unicode support. They recognized that experimental technology like this shouldn't be rolled out to this much users. Thank you dice for keeping slashdot safe!

Share
twitter facebook
- Re: (Score:2)
  
  by cdrudge ( 68377 ) writes:
  
  Did Dice ditch unicode support? I thought the slash code always had issues/didn't support it, long before Dice acquired them.
  - Re: (Score:1)
    
    by Anonymous Coward writes:
    
    perhaps i can draw the situation in pictures
    joke
    
    0
    \/ you /\
    - - Re: (Score:1)
        
        by PincushionMan ( 1312913 ) writes:
        
        No, it does, you just have to use the <pre> tags, like so...
        This is preformatted text
        Well, that stinks. Let me try the <tt> tags, then:
        This is preformatted text with tt & /tt tags. Phooey. It ate all my extra spaces. I suppose you could use   non-breaking spaces....
        
        Nope. I guess trolls abused these features too much in the distant past, so I sort of understand that.
        I'm still confused about the lack of Unicode, though. I though Perl could handle it?
  - Re: (Score:2)
    
    by tlhIngan ( 30335 ) writes:
    
    Did Dice ditch unicode support? I thought the slash code always had issues/didn't support it, long before Dice acquired them.
    Slashcode always supported Unicode.
    The reason it appears it doesn't is that thanks to a bunch of wankers who decided to abuse Unicode to no end, it ended up screwing the site layout up thanks to abuse of control codes.
    So what was added was an input filter that limited what Unicode could come in - pretty much just ASCII at this point.
    Unicode IS complex, and you really cannot blindly ha
- Re: (Score:2)
  
  by wiredlogic ( 135348 ) writes:
  
  Yeah. It's not like Slashdot.jp patched slashcode to support Unicode 10+ years ago.
- Re: (Score:3)
  
  by AmiMoJo ( 196126 ) * writes:
  
  Actually we are probably going to have to ditch Unicode at some point, at least in its current form. East Asian language support is badly broken. I could be fixed, but not in a non-breaking way.
  CJK unification is one of the biggest screw-ups in the history of computing.
  - Re: (Score:1)
    
    by Anonymous Coward writes:
    
    From what I understand unicode has abandoned CJK unification a long time ago there are now separate planes for each language.
    Of course the old planes still exists, so you need to transpose those when you find them in a string.
- Re: (Score:2)
  
  by Megane ( 129182 ) writes:
  
  The support is in there, it's just that it uses a whitelist, which happens to be very small, probably only to U00FF if that much. There are also likely problems on the client side where the user's browser posts in the wrong encoding.
In fairness... (Score:1)

by fuzzyfuzzyfungus ( 1223518 ) writes:

If I were looking for a language to scare a program into submission with, Assyrian would be a pretty plausible choice. Even by the rather high standards of the rough neighborhood that is the near and middle east, they cut quite a swath of blood-soaked mayhem through their neighbors; and put out lots of cuneiform inscriptions and rather morbid art gloating about their efficiency at this.
- Re: (Score:2)
  
  by bargainsale ( 1038112 ) writes:
  
  Wrong Assyrians. The ones you're thinking of spoke Akkadian and wrote cuneiform.
  
  Eventually their (Christian) descendants ended up speaking Aramaic like practically everyone else in the Near East at the time (it was the official language in the Western part of the Persian Empire); the modern Assyrian language is one of the many forms of modern Aramaic (now split into several different languages, much as Latin evolved into several different languages over much the same period) and this script is properly c
  - Re:In fairness... (Score:4, Informative)
    
    by bargainsale ( 1038112 ) writes: on Saturday March 21, 2015 @04:46PM (#49309767)
    
    (They spoke Aramaic long before they became Christian, of course.)
    
    The people in question call themselves Assyrians at the present day; there are some Akkadian words preserved in their Aramaic language even now, although Akkadian itself probably died out in the earlier part of the first millennium BC.
    
    The name "Syriac" is itself from a worn-down version of the same name; it was once used pretty much as the equivalent of "Aramaic" but is now generallly used to describe only one particular version of Aramaic which was a major literary language of Western Asia in early Christian times, and is still used as a liturgical language by Nestorian Christians as far afield as India. The script is used to write several modern Aramaic languages spoken by Christians.
    
    These ancient communities have suffered greatly in the Middle East wars of recent times, and a huge proportion have left as refugees.
    
    Parent Share
    twitter facebook
Syriac not Assyrian (Score:5, Informative)

by seyyah ( 986027 ) writes: on Saturday March 21, 2015 @03:26PM (#49309401)

That script is the Syriac script not the Assyrian one: https://en.wikipedia.org/wiki/... [wikipedia.org].

Share
twitter facebook
- Re: (Score:2)
  
  by PJ6 ( 1151747 ) writes:
  
  Yes, but what does it say?
Dupe (Score:2)

by NotInHere ( 3654617 ) writes:

this report is a dupe: https://code.google.com/p/chro... [google.com]
- Re: (Score:2)
  
  by bargainsale ( 1038112 ) writes:
  
  It says "John, house of Ephraim."
  
  Who says the Internet isn't educational?
Lotus Notes was like this too (Score:2)

by tigersha ( 151319 ) writes:

I once had a small Notes web thing running for a bunch of people in Scandinavia. The thing crashed every time when someone from Iceland worked with it. Ruend out that the icelandic character is not in some middle european character set (this was before UTF-8) and wasted Notes every time. That was a total bastard of a problem to find.
- Re: (Score:2)
  
  by tigersha ( 151319 ) writes:
  
  Hah. Slashdot breaks too! It is the Icelandic 'thorn' character http://en.wikipedia.org/wiki/T... [wikipedia.org]
- Re: (Score:1)
  
  by EmeraldBot ( 3513925 ) writes:
  
  How long do you think it's going to take for said characters to be posted (inadvertently, of course) in a comment on this post?
  Since Slashdot doesn't actually support Unicode, they wouldn't come in at all. They'd just disappear. Soviet Russia style.
Good news! (Score:3)

by Anubis IV ( 1279820 ) writes: on Saturday March 21, 2015 @04:03PM (#49309573)

In related news, we don't need to worry about this bug being used by unscrupulous sorts of folks in the comments here. The one and only time a lack of unicode support has come in useful...

Share
twitter facebook
- Re: (Score:2)
  
  by mister_playboy ( 1474163 ) writes:
  
  That's correct usage in British English, AC. Welcome to the Internet.
so does imagur (Score:2)

by rs79 ( 71822 ) writes:

mtbf - 15 mins.
Sounds occult (Score:1)

by tanimislam ( 1452305 ) writes:

hmm, ancient and dead language from the time of reported magic. Just typing the words will crash your Mac. Imagine if one spoke them!
Since Snowcrash ... (Score:2)

by angel'o'sphere ( 80593 ) writes:

... we know that Assyrian or more precisely Sumerian is tricky.
Didn't Steve Jobs have Assyrian Heritage? (Score:2)

by nucrash ( 549705 ) writes:

I know, Syrian, but still. I always knew he was going to be the death of Apple.
- Might not be unicode ... (Score:2)
  
  by perpenso ( 1613749 ) writes:
  
  It might not be unicode. I once had a bug because I assumed a particular MacOSX/iOS API call was returning UTF8. It was actually returning old-school MacRoman by default. Worked for some locales, caused a crash on others.
  - Re: (Score:2)
    
    by gnasher719 ( 869701 ) writes:
    
    I'd be curious to know which iOS call would return MacRoman.
    - Re: (Score:2)
      
      by perpenso ( 1613749 ) writes:
      
      It was years ago (2010'ish). I was getting iOS to localize currency amounts and dates. Testing was done in English, French and German and things seemed fine -- yeah my bad for using such similar languages. The crash occurred with a Scandinavian user, I don't recall the particular language. The fix was simple, I believe I merely had to specify that I wanted UTF8 rather than the default.
      
      I've changed version control systems since then so I don't have the check-in history handy.
- Re:Type "bush hid the facts" into Notepad. (Score:5, Informative)
  
  by rudy_wayne ( 414635 ) writes: on Saturday March 21, 2015 @04:01PM (#49309567)
  
  http://blogs.msdn.com/b/oldnew... [msdn.com]
  About every ten months, somebody new discovers the Notepad file encoding problem. Let's see what else there is to say about it.
  First of all, can we change Notepad's detection algorithm? The problem is that there are a lot of different text files out there. Let's look just at the ones that Notepad supports.
  8-bit ANSI (of which 7-bit ASCII is a subset). These have no BOM; they just dive right in with bytes of text. They are also probably the most common type of text file.
  UTF-8. These usually begin with a BOM but not always.
  Unicode big-endian (UTF-16BE). These usually begin with a BOM but not always.
  Unicode little-endian (UTF-16LE). These usually begin with a BOM but not always.
  If a BOM is found, then life is easy, since the BOM tells you what encoding the file uses. The problem is when there is no BOM. Now you have to guess, and when you guess, you can guess wrong. For example, consider this file:
  D0 AE
  Depending on which encoding you assume, you get very different results.
  If you assume 8-bit ANSI (with code page 1252), then the file consists of the two characters U+00D0 U+00AE, or "". Sure this looks strange, but maybe it's part of the word VATNI which might be the name of an Icelandic hotel.
  If you assume UTF-8, then the file consists of the single Cyrillic character U+042E
  If you assume Unicode big-endian, then the file consists of the Korean Hangul syllable U+D0AE
  If you assume Unicode little-endian, then the file consists of the Korean Hangul syllable U+AED0
  Some people might say that the rule should be "All files without a BOM are 8-bit ANSI." In that case, you're going to misinterpret all the files that use UTF-8 or UTF-16 and don't have a BOM. Note that the Unicode standard even advises against using a BOM for UTF-8, so you're already throwing out everybody who follows the recommendation.
  Okay, given that the Unicode folks recommend against using a BOM for UTF-8, maybe your rule is "All files without a BOM are UTF-8." Well, that messes up all 8-bit ANSI files that use characters above 127.
  Maybe you're willing to accept that ambiguity, and use the rule, "If the file looks like valid UTF-8, then use UTF-8; otherwise use 8-bit ANSI, but under no circumstances should you treat the file as UTF-16LE or UTF-16BE." In other words, "never auto-detect UTF-16". First, you still have ambiguous cases, like the file above, which could be either 8-bit ANSI or UTF-8. And second, you are going to be flat-out wrong when you run into a Unicode file that lacks a BOM, since you're going to misinterpret it as either UTF-8 or (more likely) 8-bit ANSI. You might decide that programs that generate UTF-16 files without a BOM are broken, but that doesn't mean that they don't exist. For example,
  cmd /u /c dir >results.txt
  This generates a UTF-16LE file without a BOM. If you poke around your Windows directory, you'll probably find other Unicode files without a BOM. (For example, I found COM+.log.) These files still "worked" under the old IsTextUnicode algorithm, but now they are unreadable. Maybe you consider that an acceptable loss.
  The point is that no matter how you decide to resolve the ambiguity, somebody will win and somebody else will lose. And then people can start experimenting with the "losers" to find one that makes your algorithm look stupid for choosing "incorrectly".
  
  Parent Share
  twitter facebook
  - Re:Type "bush hid the facts" into Notepad. (Score:5, Funny)
    
    by Pinky's Brain ( 1158667 ) writes: on Saturday March 21, 2015 @04:18PM (#49309629)
    
    My conclusion is that the unicode guys are assholes.
    
    Parent Share
    twitter facebook
    - Re: (Score:3)
      
      by AmiMoJo ( 196126 ) * writes:
      
      Unicode made three big mistakes.
      1. Attempting to be backwards compatible with a subset of ASCII. A subset that breaks all the common encodings used outside the US.
      2. Multiple encodings (8, 16 and 32 bit). Pick one, stick to it, don't make try to guess with stupid BOMs etc.
      3. CJK unification. Trying to merge three distinct languages in a way that makes it impossible to mix them in a pure Unicode document.
      So yeah, those guys are assholes.
      - Re: (Score:3)
        
        by Alain Williams ( 2972 ) writes:
        
        Unicode and how it is represented in a file are two different things. Unicode is a good idea, it solves many problems and contains all the (to me) strange characters used by: Greeks, Chinese, etc.
        How to represent it in a file is different. UTF-8 is the obvious answer today, but other encodings were tried by different organisations first. The big win of UTF-8 is that you can have characters from very different regions on the same web page (or in the same file) - something that you cannot do you you adopt a p
        
        Re: (Score:2)
        
        by nogginthenog ( 582552 ) writes:
        
        The big downside of UTF-8 is using it as an in-memory string. To find the nth character and you have to start at the beginning of the string.
        C# and Java use UTF16 internally for strings.
        
        Re: (Score:2)
        
        by Alain Williams ( 2972 ) writes:
        
        I agree completely. There is no reason that a program cannot read UTF-8 and store as UTF-32 internally. There is a trade-off between time and memory. Note that UTF-16 is also a variable length encoding scheme so you still need to start at the start of string to find the nth character.
        
        Re: novice programmer alert! (Score:2)
        
        by spitzak ( 4019 ) writes:
        
        The big downside of UTF-8 is using it as an in-memory string. To find the nth character and you have to start at the beginning of the string.
        And this is important, why? Can you come up with an example where you actually produce "n" by doing anything other than looking at the n-1 characters before it in the string? No, and therefore an offset in bytes can be used just as easily.
        C# and Java use UTF16 internally for strings.
        And you are aware that UTF-16 is variable-length as well, and therefore you can't "find
        
        Re: (Score:2)
        
        by AmiMoJo ( 196126 ) * writes:
        
        Unicode is a good idea, it solves many problems and contains all the (to me) strange characters used by: Greeks, Chinese, etc.
        That's one of its biggest problems: it doesn't support all the characters in Chinese. In fact it doesn't really support any of them, because they tried to merge them with Japanese and Korean characters. The result is that Unicode contains a sort of amalgamation that can be used to approximate any of those three languages, but not represent them properly.
        I listen to both Japanese and Chinese music. Unicode is broken for me. There is no way to tell if a character is a Chinese or a Japanese one. The character
      - Re: (Score:2)
        
        by jrumney ( 197329 ) writes:
        
        4. Inconsistent policy for character inclusion. After years of opposing addition of symbols commonly used in typesetting or web pages (such as a common symbol for indicating external links consisting of a box with a curved arrow coming out of it) on the basis that they are "not plain text and best represented by graphic images", we get emoji added. And they still won't add many of these symbols they've opposed in the past (they recently added the standard triangular recycling mark, but this was long after
      - Re: (Score:2)
        
        by disambiguated ( 1147551 ) writes:
        
        I agree overall with your comment, but I think UTF-8's backwards compatibility with ASCII was genius and is the reason we have as much Unicode support as we do today. I consider UTF-8 to be one of the best hacks of all time. Without it, the software that existed at the time would have had to be thrown out or re-written. The fact that software can (often) process UTF-8 without even being aware that it isn't ASCII was exactly what was needed to get Unicode off the ground. UTF-8 allowed Unicode to be adopted i
        
        Re: (Score:2)
        
        by AmiMoJo ( 196126 ) * writes:
        
        The problem is that ASCII is only useful for US English. Other forms of English need symbols like the pound (Â£) sign. Other Latin derived languages need accented characters. Non-Latin languages already use some subset of ASCII plus extensions. Any software that has to support more than just 7-bit US ASCII and UTF-8 has to guess, and usually gets it wrong.
        
        Re: (Score:2)
        
        by spitzak ( 4019 ) writes:
        
        Actually Plan 9 and UTF-8 encoding existed well before Microsoft started adding Unicode to Windows.
        The reason for 16-bit Unicode was political correctness. It was considered wrong that Americans got the "better" shorter 1-byte encodings for their letters, therefore any solution that did not punish those evil Americans by making them rewrite their software was not going to be accepted. No programmer at that time (including ones that did not speak English) would ever argue for using anything other than a vari
      - Re: (Score:1)
        
        by Hognoxious ( 631665 ) writes:
        
        Unicode made one enormous mistake - existing in the first place.
        If plain ascii was good enough for Virgil, Newton & Shakespeare it's good enough for you.
    - - Re: (Score:3)
        
        by jrumney ( 197329 ) writes:
        
        For UTF-16. "Only Windows uses BOMs" is pretty much correct for UTF-8, where the Unicode standard discourages it.
  - Re: (Score:2)
    
    by spitzak ( 4019 ) writes:
    
    Maybe you're willing to accept that ambiguity, and use the rule, "If the file looks like valid UTF-8, then use UTF-8; otherwise use
    Yay! You actually got the answer partially correct. However you then badly stumble when you follow this up with:
    8-bit ANSI, but under no circumstances UTF-16
    The correct answer is "after knowing it is not UTF-8, use your complicated and error-prone encoding detectors".
    The problem is a whole lot of stupid code, in particular from Windows programmers, basically tries all kinds of m
- Re: (Score:2)
  
  by thunderclap ( 972782 ) writes:
  
  Same thing happens when you type Bill fed the goats. Its an unicode error in notepad for XP. You want something fun? type that into Chrome for a mac in an apple store. Thats fun.
  - Re: (Score:2)
    
    by thunderclap ( 972782 ) writes:
    
    Since it deleted the word here is an image of it http://2.bp.blogspot.com/-_TfD... [blogspot.com]
- Re: (Score:1)
  
  by EmeraldBot ( 3513925 ) writes:
  
  I've had a delightful time explaining to my trainees that *EVERY SERVER SHOULD ONLY BE RUN IN A LANG=C ENVIRONEMNT". Unicode is *bad*, *bad*, *bad* for systems work of any sort.
  And in a related XKCD post:
  https://xkcd.com/327/ [xkcd.com]
  That works, until your servers have to process any kind of foreign characters whatsoever. This is a fault that only affects OS X, only when using Google Chrome. It's not (to my knowledge) a weakness of Unicode.
- Re: (Score:2)
  
  by l0ungeb0y ( 442022 ) writes:
  
  Yeah, well, it's not too hard to escape from unicode hell...
- Re: (Score:2)
  
  by Half-pint HAL ( 718102 ) writes:
  
  Yeah, because people who speak funny foreign languages don't deserve to use our linguistically pure English-speaking servers, right?

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Related poem (Score:2)

Re: (Score:2)

Re: (Score:2, Troll)

Re:Related poem (Score:5, Funny)

Re: (Score:2)

Thank you, Neal Stephenson (Score:5, Funny)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Man bites dog (Score:1)

Re: (Score:2)

Re: (Score:3)

Re: (Score:1)

Re: (Score:2)

Re: (Score:1)

Re: (Score:3)

Schneier got it right a decade and a half ago (Score:5, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

How well forethought of dice (Score:5, Funny)

Re: (Score:2)

Re: (Score:1)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:1)

Re: (Score:2)

In fairness... (Score:1)

Re: (Score:2)

Re:In fairness... (Score:4, Informative)

Syriac not Assyrian (Score:5, Informative)

Re: (Score:2)

Dupe (Score:2)

Re: (Score:2)

Lotus Notes was like this too (Score:2)

Re: (Score:2)

Re: (Score:1)

Good news! (Score:3)

Re: (Score:2)

so does imagur (Score:2)

Sounds occult (Score:1)

Since Snowcrash ... (Score:2)

Didn't Steve Jobs have Assyrian Heritage? (Score:2)

Might not be unicode ... (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:Type "bush hid the facts" into Notepad. (Score:5, Informative)

Re:Type "bush hid the facts" into Notepad. (Score:5, Funny)

Re: (Score:3)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: novice programmer alert! (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)