Become a fan of Slashdot on Facebook

Unicode Encoding Flaw Widespread 184

Posted by kdawson on Tuesday May 22, 2007 @02:33AM from the sneaking-past-the-IDS dept.

LordNikon writes "According to this CERT advisory: 'Full-width and half-width encoding is a technique for encoding Unicode characters. Various HTTP content scanning systems fail to properly scan full-width/half-width Unicode encoded HTTP traffic. By sending specially-crafted HTTP traffic to a vulnerable content scanning system, an attacker may be able to bypass that content scanning system.' A proof of concept affecting IIS is already being posted to security mailing lists. Cisco IPS and other IDS products are also affected." The CERT advisory lists 93 systems, with 6 reported as vulnerable (including 3com, Cisco, and Snort), 5 known not vulnerable (including Apple and HP), and the rest unknown.

This discussion has been archived. No new comments can be posted.

Unicode Encoding Flaw Widespread

Load All Comments

Search 184 Comments Log In/Create an Account

Comments Filter:

Limited impact. (Score:3, Informative)

by shird ( 566377 ) writes: on Tuesday May 22, 2007 @02:38AM (#19217829) Homepage Journal

This appears to be limited to content scanning, and isn't really a vulnerability in itself. Relying on content scanning to prevent an exploit to reach an exploitable system is a pretty bad idea, much better to fix the system than the extra layer of defense on the outside.

Content scanning is mostly useful against filtering known exploits, and is hardly meant to be your primary defense. Being able to bypass this scanning won't buy you much. If the content scanner is aware of an exploit it scans for, chances are so are the systems being targeted and are patched to protect against it.

Share
twitter facebook
- Re: (Score:2)
  
  by KevMar ( 471257 ) writes:
  
  So this is another case of don't trust user input.
  
  I don't see anything new here, just another trick to look for. Most well tested systems should not be affected.
  
  Unless I'm overlooking something? I'm not am I?
  - - Re:Limited impact. (Score:4, Insightful)
      
      by fatphil ( 181876 ) writes: on Tuesday May 22, 2007 @07:32AM (#19219111) Homepage
      
      I think you've missed his point. There are now two ways that, for example, a quote character can be passed as user input to your program: either as " or as %ublah.
      
      Your program, sitting below the layer performing the unicode translations, doesn't need to do anything differently from before, as it doesn't matter which of the two methods were used. If you _relied on_ the layers above you to strip out, reject, escape, or whatever, quote characters, then you're writing teabag code, and should get a job selling flowers instead, as software engineering is beyond you.
      
      Always validate user input to your own specification. Never rely on something external to do it.
      
      This exploit hasn't changed the rules one little bit, it's just highlighted the fact that some idiots don't follow them.
      
      Parent Share
      twitter facebook
      - Re: (Score:3, Insightful)
        
        by CastrTroy ( 595695 ) writes:
        
        Is this another problem with unescaped quotes? When will people learn? Not an hour goes by that a system doesn't get attacked by SQL injection attacks. Why do programmers continue to not use things like prepared statements which are invulnerable against such attacks. I blame it on the people writing the tutorials. Every beginner tutorial on the web shows queries being constructed at runtime, and doesn't have any mention of how insecure doing things like this is. It's hard to break the habit once you've
      - Re: (Score:2, Interesting)
        
        by MikeB90 ( 857499 ) writes:
        
        The point is you as your own program might have escaped or regexed items incorrectly and be open to this attack. Of course you don't blindly depend on some "magic" function. Duh! but you yourself are mortal too. And I doubt many people knew about fullwidth/halfwidh unicode transforms. The fact that one of the articles linked to this says they did a successful SQL injection SHOWS there are issues. BTW insulting people is not normally a useful technique
    - Re: (Score:3, Insightful)
      
      by jZnat ( 793348 ) writes:
      
      Well, the way I see it, there are three ways to handle Unicode characters (one of which is wrong): store as full two-byte Unicode values (inefficient when using mostly ASCII characters like in english), store in a UTF character set such as UTF-8 (useful for primarily ASCII text as it is a superset of ASCII), or pretend it isn't Unicode and treat it as two (or three if input is in UTF-8 for example) separate ASCII characters (bad).
      
      So, perhaps if data was all stored and represented in UTF-8, for example, this
      - Re: (Score:2)
        
        by CoughDropAddict ( 40792 ) * writes:
        
        Two bytes is not enough for all Unicode characters. UTF-16, which stores characters under U+FFFF using two bytes, is still a variable-length encoding for characters higher than U+FFFF. If you want a fixed-length encoding, use UTF-32.
        
        I recommend checking out the Wikipedia article Comparison of Unicode encodings [wikipedia.org].
        
        So, perhaps if data was all stored and represented in UTF-8, for example, this wouldn't be a problem?
        
        You can't impose this on the whole world; a lot of the protocols and file formats that we use eve
- I don't think you know what you're talking about.. (Score:2)
  
  by msimm ( 580077 ) writes:
  
  $-$
  
  They've been trying to sell this kind of kit to us for years.
- Re: (Score:2)
  
  by jrumney ( 197329 ) writes:
  
  There have been many vulnerabilities in the past that were based on encoding a URL in some broken (or even non-broken) way to get past the first level of URL checking to a lower level where directory traversal is possible. On Unix based servers, the risk of this is mitigated by running your webserver in a chroot jail. On IIS, you just have to hope that IIS 6.0 is actually fundamentally secure down to its lowest levels, not just an insecure product with a thin veneer of security layered over it like previous
  - Unlimited impact. (Score:2)
    
    by dolmen.fr ( 583400 ) writes:
    
    On Unix based servers, the risk of this is mitigated by running your webserver in a chroot jail.
    chroot jail doesn't protect your application against XSS.
  - - Re: (Score:2)
      
      by jrumney ( 197329 ) writes:
      
      What does "no privs" mean on Windows? Clearly IIS 6.0 does have privileges. It has opened port 80 for listening for example, and it can read files and run scripts. So it cannot really mean no priviliges.
      - Re:Limited impact. (Score:5, Informative)
        
        by TheRaven64 ( 641858 ) writes: on Tuesday May 22, 2007 @06:27AM (#19218865) Journal
        
        Windows makes no distinction between privileged and unprivileged ports, so any application that can open sockets can listen on port 80. That said, every port number (and every other object in the NT kernel) has an associated ACL, so it is possible to limit them on an individual basis. I've never seen this exposed to the UI though, so I've no idea how you'd go about doing it. Filesystem objects also have ACLs, so I'd imagine that IIS is not allowed access to the filesystem outside the tree it is sharing.
        The NT kernel provides a lot of facilities that are very useful for writing secure code. I often wonder if the application developers at Microsoft ever noticed that they weren't writing code on top of DOS anymore...
        
        Parent Share
        twitter facebook
        
        Re: (Score:2)
        
        by SEMW ( 967629 ) writes:
        
        I've never seen this exposed to the UI though, so I've no idea how you'd go about doing it
        IIRC, in Windows XP, View -> Folder options -> untick "Use simple file sharing (recommended)" will let you see and edit an object's permissions though its properties dialogue.
        
        In Vista's this is now enabled by default, which I suppose is inevitable since MS are making permissions so much more visible with UAC and such; but I do wonder how many people will go randomly clicking around to see what it does, click through the UAC dialogue, and end up doing something like removing permission to access th
        
        Re: (Score:2, Informative)
        
        by flydpnkrtn ( 114575 ) writes:
        
        I think he meant getting to "port object permissions" on a programmatic level... with an API. What you are describing are filesystem Access Control Lists. He's talking about using ACLs on ports. Everything being an object in NT, and being able to have ACLs applied to "everything," is a good idea. As the grandparent said, the application developers at MS just have to use them.
        
        Basically the "Security tab" you see for files could be applied to individual ports.
        
        Re: (Score:2)
        
        by Doctor Memory ( 6336 ) writes:
        
        I do wonder how many people will go randomly clicking around to see what it does, click through the UAC dialogue, and end up doing something like removing permission to access the C: drive for everyone but their pet dog...
        Oh, c'mon, nobody's that dumb. Well, except maybe this guy [bash.org]...
        
        Re:Limited impact. (Score:5, Informative)
        
        by rabtech ( 223758 ) writes: on Tuesday May 22, 2007 @11:11AM (#19221887) Homepage
        
        The NT kernel has a root namespace for everything in the system (from local filesystems to network drives to sockets to synchronization objects like mutexes), and in fact treats everything as a file (just like Unix) underneath.
        
        Using the Native (NT Executive) API you can read or set the ACL on any object in the namespace, assuming you have the appropriate user rights and you own the object (or the ACL allows you to modify the permissions). NT kernel objects can also be case-sensitive (though that can confuse some Win32 programs). Often, you can delete, move, etc files that are locked by the Win32 subsystem, which can be useful in certain situations (though in Vista they made the IO system capable of cancelling outstanding IOs on its own so the zombie process bug that ends up locking files doesn't happen anymore. Its unfortunate Vista is so DRM-laden, or I'd try upgrading.)
        
        The APIs are NtQuerySecurityObject and NtSetSecurityObject and I believe the devices are in \Device\Tcp, \Device\Ip, \Device\RawIp, \Device\Udp, etc. Check out http://undocumented.ntinternals.net/ [ntinternals.net] for more details on what is in the native API (ntdll). This API provides everything necessary to implement a full POSIX layer, which is exactly what Services for Unix does, installing itself as a new runtime subsystem right next to the Win32 subsystem. (With Server 2003 R2 SP2 they shipped it as an available component as part of the install; I've even got setuid support and GCC installed as part of the package.)
        
        Parent Share
        twitter facebook
        
        MOD PARENT UP (Score:2)
        
        by TheRaven64 ( 641858 ) writes:
        
        Great reply, thank you. I haven't used Windows for a few years, but it's good to keep up with this kind of thing, and I'm sure others can benefit from this information.
        
        Re: (Score:2)
        
        by Foolhardy ( 664051 ) writes:
        
        Also, the object manager namespace can be browsed with winobj [microsoft.com] or winobjex.
        
        Actually, the IO system has always been able to cancel IO operations, including by terminating the thread owning the operation. However, IO can only be canceled when the drivers owning the operation allow it to be, and Vista got rid of many of the places IO could block but couldn't be canceled in the standard drivers. MUP (which does UNC network host lookups) in particular.
        
        I had the same idea about reaching the ACLs of objects w
        
        Re: (Score:2)
        
        by Foolhardy ( 664051 ) writes:
        
        One nitpick: while open sockets are indeed file objects, and starting with Server 2003 SP1 the endpoint drivers do support ACLs on open sockets [msdn.com], unopened sockets (i.e. the port numbers themselves) are not objects, and do not have ACLs. There are firewalls that can control access to socket operations on a per process basis, but they're implemented as special TDI filters with special rules, usually not standard ACLs.
        
        I've spent some time implementing a security descriptor editor [dyndns.org] designed to expose ALL object
      - Re: (Score:3, Insightful)
        
        by Ravnen ( 823845 ) writes:
        
        The Network Service account on Windows has similar privileges to a normal user, which means it can't access files owned by other users, but can of course read some files owned by the system. The notion of reserved ports doesn't exist on Windows, so no software makes security assumptions based on whether or not a port is below 1024, and the ability to open port 80 doesn't imply any higher privileges than the ability to open any other port.
        At any rate, running in a chroot jail is arguably better in some way
        
        Re: (Score:2)
        
        by Ravnen ( 823845 ) writes:
        
        Interesting, I didn't know about that. All I meant is it doesn't use the old BSD distinction of ports below 1024 being reserved for privileged users, with 1024 and above being open to unprivileged users. I suppose you could effectively set it up that way using port filtering.
Incident response (Score:4, Interesting)

by Anonymous Coward writes: on Tuesday May 22, 2007 @03:11AM (#19217961)

I work incident response in a large web company (hence anonymous posting, natch) and currently we're treating this as "interesting, but case not proven". We test our web apps filter all input so I'm adding double-width unicode to our security regression test cases; however I'm happy to let the FD posters lab it out between them in the short term. These alleged IIS exploits don't work for us - which is not to say that we don't have some system, somewhere, for which this is an issue. At the end of the day it's just a clear restatement of something that's obvious to anyone - you need to filter input carefully, and you need to be aware of issues around alternative encodings. But it's not a "BRB" (big-red-button, ie emergency stop and all hands to the pumps to fix a vulnerability) issue for us - yet. The last time we had one of those, it was the Microsoft DNS server remote root... because most of our internal domain controllers were also running DNS servers.

Share
twitter facebook
"Not vunerable" (Score:3, Informative)

by iamacat ( 583406 ) writes: on Tuesday May 22, 2007 @03:20AM (#19218001)

According to the advisory, Apple products do not provide HTTP content filtering and are therefore not vulnerable. This will do nothing to help someone build a functioning protection system.

Share
twitter facebook
- Re: (Score:2)
  
  by KiloByte ( 825081 ) writes:
  
  Yeah, no "content filtering" is needed, why would it be? Any text is either the request (and thus not "content") or mere data, in the second case it shouldn't be filtered unless something is terribly broken.
  
  Trying to parse encapsulated data is a bad idea generally; as is trying to detect the same attack twice. Of course, unless you're snakeoil^Wsecurity software salesman.
bypassing great firewall? (Score:2, Interesting)

by z-j-y ( 1056250 ) writes:

I'm wondering if the great firewalls (Cisco product?) are also vulnerable to this. At least it'll force them to do longer string matching.
Nothing to see, move along ... (Score:5, Insightful)

by udippel ( 562132 ) writes: on Tuesday May 22, 2007 @06:24AM (#19218849)

It is a vulnerability, in the strict sense.
It is a self-inflicted misbehaviour as in common sense.
It is like those silly Cisco content inspectors on port 25, that try to avoid attacks on flimsy MTAs.
It is like someone dying from a jab against measles: the jab protected that person from contracting measles, actually.
It is like those stupid anti-virus programs that are more vulnerable than the daemons they profess to protect.

When the attacker uses a codepage different from the one that you think she ought to use, she can circumvent your content filter. Which ought not be an attack vector, in any case.

As I said: nothing to see, move along ...

Share
twitter facebook
flawed design .. (Score:2)

by rs232 ( 849320 ) writes:

What kind of a flawed design is it where character encoding can impact security. The concept of scanning for unsafe strings is also flawed as in the case of virus scanning, as it only know about the stuff it knows about. This is another example of Ranums enumerating badness [ranum.com]. If the SQL engine used only stored procedures then you wouldn't have to run a content scanner as the only thing coming over HTTP is DATA.
- Stored procedure cross-compatibility? (Score:2)
  
  by tepples ( 727027 ) writes:
  
  If the SQL engine used only stored procedures then you wouldn't have to run a content scanner as the only thing coming over HTTP is DATA.
  Do the popular free software implementations of SQL (MySQL, PostgreSQL, Firebird SQL, etc.) implement stored procedures in any sort of standard manner?
  - Re: (Score:2)
    
    by rs232 ( 849320 ) writes:
    
    'Do the popular free software implementations of SQL (MySQL, PostgreSQL, Firebird SQL, etc.) implement stored procedures in any sort of standard manner?'
    
    I don't know what you mean by standard manner. According to this [postgresql.org] PostgreSQL uses something called procedural languages. But then again since when was SQL ever implimented in a common standard. Remember when Microsoft 'extended' SQL so as to allow spaces in table names, you only have to wrap the name in square brackets [] or back-ticks ``.
    
    But my point
Another likely example of OSS? (Score:2)

by erroneus ( 253617 ) writes:

Back in the Win95 days, I recall a stupid little exploit that would lock up a Win95 machine. The root of the problem, however, was in the TCP/IP code from BSD's source. Microsoft had used BSD's TCP/IP stack code in building one for Win95. I'm not here to complain that big bad commercial vendors are "stealing" from the open source community. I'm just suggesting that perhaps this is yet another example of how OSS has made yet another important, thought silent, contribution.

It's annoying to me when people
- Re: (Score:2)
  
  by El_Muerte_TDS ( 592157 ) writes:
  
  Just because you use "free" code doesn't mean you don't have to check it for correctness.
  If X works in system Y doesn't imply it works in system Z. Heck, the reason it works in Y could be because of a bug in Y.
- TCP/IP code from BSD .. (Score:2)
  
  by rs232 ( 849320 ) writes:
  
  'Back in the Win95 days, I recall a stupid little exploit that would lock up a Win95 machine. The root of the problem, however, was in the TCP/IP code from BSD's source'
  
  I assume you are referring to the ping of death [archive.org]. The root cause being a bug in the TCP protocol and occured on other platforms not using the BSD code.
  
  was Another likely example of OSS?
- - Re: (Score:3, Informative)
    
    by Frankie70 ( 803801 ) writes:
    
    Apparently, Vista's networking stack has been rewritten from scratch -- which does make you wonder how much of the reason for that was technical, and how much was MS wanting to be seen to get rid of all the BSD/*nix code in Windows in preparation for their patent offensive...
    
    Why should using BSD code come in the way of their patent offensive?
    Using BSD code isn't infringing on BSD's or someone else's patent.
IIS's fault (Score:2)

by phasm42 ( 588479 ) writes:

After reading through this carefully, it seems the fault is really with the webserver software (in this case, IIS). The problem is that normally a full-width character (such as FF1C in the example) and the regular character "<" are not equivalent, but IIS is translating the full-width form of a character into the regular character, so although the two forms were distinct before reaching the frontline filters, they are no longer distinct by the time it reaches application code running under IIS.

I guess
- Re: (Score:2)
  
  by DrVomact ( 726065 ) writes:
  
  "Full width" vs. "Half width" (or, as I prefer, "half-wit") characters exist for typographical convenience in rendering Japanese characters. (Take a look at the Unicode spec, section 10.3 for example http://www.unicode.org/book/ch10.pdf/ [unicode.org]). This does not, however, explain why certain symbols that are already defined in other parts of the Unicode standard, such as the less-than symbol (or left angle bracket) are duplicated there. I suspect that it has something to do with possible confusions that might arise
  - Re: (Score:3, Insightful)
    
    by phasm42 ( 588479 ) writes:
    
    here are 2 ways of producing the < glyph: you can use character code x8B or xFF1C.
    Shouldn't that be x3C?
    I'm not sure if that's right or wrong, if there is a right and wrong way to handle this issue (I suppose that means it's excellent grounds for a religious war)--it's just important that it be handled consistently.
    I thought about this a little more, and I think the difference will be in what it is used for. In HTML, the "<" glyph has a special meaning, so it makes sense that a different version (in
    - Re: (Score:2)
      
      by DrVomact ( 726065 ) writes:
      
      Shouldn't that be x3C?
      
      Er...yes, of course. Apparently x8B is one of those European-style single quotes (at least that's what I think the purpose of that character is) that looks like a small left angle bracket. (There's a double version as well.)
      
      That's what I get for posting from work, where I have to keep looking over my shoulder watching for my boss, who doesn't understand that posting to /. is research.
- Re: (Score:3, Informative)
  
  by spitzak ( 4019 ) writes:
  
  They are there for compatability with some Japanese and Chinese character sets, which contained most of the ascii characters in both "half" and "full width" forms. The full-width ones were twice as wide to match the square characters, which was useful for lining up columns.
  
  This is all pointless now with proportionally-spaced fonts (and multiple fonts, you could easily select the "wide" font to print those characters instead). However Unicode had as a design requirement that translating from any common encod
Don't Steal my WoW account! (Score:2)

by Evil W1zard ( 832703 ) writes:

So how long til we find out that there has been exploitation of this vulnerability for X number of months for the sole purpose of stealing our WoW accounts!!!

Why steal someone's real identity when you can steal their uber virtual Undead Priest identity and sell it for 16 bucks.
US-CERT != CERT (Score:2)

by mabu ( 178417 ) writes:

Am I the only one who has noticed that since CERT partnered with the US Government, the response time on advisories has been much slower, and the details and depth of reports are less comprehensive? CERT advisories used to be a critical part of our security strategy. Now by the time the hit the mailing list (if at all), they're more of an afterthought.

Is there a better alternative to CERT now because it just isn't cutting it. I am familiar with Bugtraq and Security Focus. By the time CERT mentions somet
half-wit encoding? (Score:2)

by DrVomact ( 726065 ) writes:

Full-width and half-width encoding is a technique for encoding Unicode characters.
That comes as a complete surprise to me, and I thought I knew at least a little about Unicode and other character encoding schemes. The usual methods of encoding Unicode character points are UTF-8 (variable-length scheme where characters may be represented by anything from one to six bytes), UTF-16 (fixed-width double byte encoding), UTF-32 (fixed-length 4 byte encoding), and well there's UTF-7 and other oddballs. But the cl
- Re: (Score:2, Insightful)
  
  by HeroreV ( 869368 ) writes:
  
  UTF-16 (fixed-width double byte encoding)
  UTF-16 is a variable-width encoding. Code points from plane 0 are encoded in 16 bits and code points from planes 1 through 16 are encoded as two 16 bit surrogates. Many developers, like you, aren't aware of this, so it's very common for software to choke on UTF-16 with surrogate pairs.
  I don't understand how mistaking one character for another is going to break anything
  scenario:
  1) You escape a Unicode string that contains fullwidth characters. The fullwidth characters
- Re:Send your claim in now (Score:5, Funny)
  
  by QuantumG ( 50515 ) writes: <qg@biodome.org> on Tuesday May 22, 2007 @02:47AM (#19217867) Homepage Journal
  
  IIS 6 hasn't had a public remotely exploitable bug in it. Ever.
  That's bullshit anyway, I've got dozens of remote exploits for IIS 6.
  
  Oh, you said public.. hehe, forget I said anything.
  
  Parent Share
  twitter facebook
- Re: (Score:2)
  
  by tuxedobob ( 582913 ) writes:
  
  Exactly what I was thinking.
- - Re: (Score:2)
    
    by jrumney ( 197329 ) writes:
    
    It allows you to hide an exploit from first level scanners so it gets through to a deeper level.
    - Re: (Score:2)
      
      by LurkerXXX ( 667952 ) writes:
      
      Right, which is an exploit which allows you to claim $16,000 exactly how? Hint: It doesn't. This isn't an exploit at all.
- Re: (Score:2)
  
  by nmoog ( 701216 ) writes:
  
  Yeah. Actually, 7 or 8 bits per character really seems excessive to me, and opens the door to additional attack vectors. Surely if people can't take the time to learn to communicate in 1 bit they should not be allowed to use the internet.
- Re: (Score:2, Insightful)
  
  by Anonymous Coward writes:
  
  To think that even English fits in 7-bit ASCII is naïve.
  - Re:Smelly foreigners (Score:5, Funny)
    
    by ettlz ( 639203 ) writes: on Tuesday May 22, 2007 @04:57AM (#19218417) Journal
    
    To think that English doesn't fit in 7-bit ASCII is na\"ive.
    
    Parent Share
    twitter facebook
    - Re: (Score:2)
      
      by earthbound kid ( 859282 ) writes:
      
      When you wrote \"i you used three bytes (assuming an ASCII-style one byte per character encoding) to represent one character. In contrast, Unicode represents ï as codepoint x00EF, which in UTF-8 ends up as two bytes, x00C3 and x00AF.
      
      You should amend your quote to "you can represent English in 7-bits... just so long as you're willing to use more than 7-bits to do it."
      - Re: (Score:2)
        
        by petermgreen ( 876956 ) writes:
        
        a few people trying to look posh may use the odd diacritic on a loanword or as a heavy metal umulat but really they aren't nessacery for english.
        
        you can get down to 6 bits per character if you are prepared to do away with either most punctuation or mixed case.
        
        Re: (Score:2)
        
        by The Warlock ( 701535 ) writes:
        
        How low can you go if you completely forgo proper spelling?
    - Re: (Score:2)
      
      by cortana ( 588495 ) writes:
      
      How na+AO8-ve!
  - - Re: (Score:3, Funny)
      
      by Hognoxious ( 631665 ) writes:
      
      There are no accent marks in English.
      è is sometimes used to indicate that the e in a past participle is pronounced, eg learnèd (rhymes with Bernard) as opposed to learned (rhymes with burned).
      When loan words with accent marks come into English, the accent marks are dropped.
      The umlaut in naïve is retained to indicate that it doesn't rhyme with glaive.
      Loan words that have been in English long enough even tend to have their pronunciations and/or spellings Anglicized.
      Yes, that's why I'm post
      - Re: (Score:2)
        
        by Nick Number ( 447026 ) writes:
        
        Brits tend to pronounce it BURN-urd, whereas Americans favor bur-NARD.
- Re: (Score:3, Interesting)
  
  by Hognoxious ( 631665 ) writes:
  
  Would some of the things that led to computers - morse code, telegraphy etc have been feasible using, say, Chinese in its normal written form? Are computers biased towards English (and other languages using the same or similar alphabets) because they were largely invented by English speakers, or is the language fundamentally more amenable to small, simple encoding?
  - Depends on alphabet size (Score:3)
    
    by Viol8 ( 599362 ) writes:
    
    If you want to represent a language on a computer (and not just numbers) then you need a way to enter and store all the characters that language uses. Obviously the less characters the better. The latin alphabet with all its variations, the cyrillic , hebrew, arabic & korean all lend themselves to this quite easily since they all have a manageable number of letters. Languages such as Chinese and Japanese don't , they don't even use alpabets , they use characters for each object/concept which as you can
    - Re: (Score:2)
      
      by setagllib ( 753300 ) writes:
      
      Japanese is actually even more complicated than Chinese in that regard. On the one hand it *does* have definite limited alphabets (e.g. hiragana) but also imports a huge amount from Chinese characters. So not only do they have multiple base alphabets, all of which have large distinct character counts, they also have a character library. In doing so I think they have the worst of both worlds - a lot to remember, hard to encode, and not even very compatible with other languages.
      
      Chinese ideographs are so numer
      - Re: (Score:3, Informative)
        
        by TheRaven64 ( 641858 ) writes:
        
        Chinese ideographs are so numerous and difficult to remember that they are considered one of the reasons for China's incredibly low literacy rate.
        If you want some evidence of this, then take a look at what happened to Korea when it dropped the Chinese ideograms in favour of a new, home-grown phonogram-based alphabet.
        
        Re:Depends on alphabet size (Score:5, Interesting)
        
        by rabtech ( 223758 ) writes: on Tuesday May 22, 2007 @11:27AM (#19222151) Homepage
        
        IIRC, China was on its way to moving to an alphabet system (certain characters can be used for their alphabetic sounds in various circumstances) and so was Japan (look at Katakana/Hirigana).
        
        It is likely that the introduction of the printing press (and later mass media like TV/radio and computers) have "arrested" this natural evolution. It may also be possible that the development of a national identity and cohesive society tends to put the brakes on some developments as well - if a single unified language is mandated by culture or a central authority then local variations are much less important.
        
        Romanji (and to a certain extent English itself) is definitely influencing the Japanese; the younger generations even moreso. Japan may end up using an alphabet for day to day needs almost exclusively within the next 100 years. The situation in China is much less clear but it will probably happen eventually.
        
        If we look into the past, nearly all societies with ideographic/logographic writing systems eventually moved to an alphabetic system. Hell, even Ancient Egyptian Hieroglyphs were partially syllabic much like Katakana. Much as previous posters have pointed out, changing to an alphabetic system from Chinese-characters has allowed Korea to dramatically raise literacy rates. There is only so much time for schooling and memorization, and only so much effort to expend on literacy. If a simpler writing system is more accessible then that is a net gain, even if there are a few things that logographic writing systems do better than alphabetic ones.
        
        Parent Share
        twitter facebook
        
        Re: (Score:3, Informative)
        
        by ShakaUVM ( 157947 ) writes:
        
        You're missing the key roadblock to simply replacing characters with pinyin, or any other romanization: Chinese is a heavily overloaded language. While there are a bit of homophones in English, *every* word in Chinese is a homophone, with something like 13 different homophones per sound on some of them. We differentiate some of homophones by writing them differently (layed, laid, etc.), Pinyin *cannot* differentiate these homophones -- it's an exact transcription of the sound. Chinese differentiate their wr
        
        Re: (Score:3, Interesting)
        
        by loyukfai ( 837795 ) writes:
        
        IIRC, China was on its way to moving to an alphabet system (certain characters can be used for their alphabetic sounds in various circumstances)...
        I'm a Chinese but I have never heard of this. Would you be so kind to educate me on this...? Where did you hear such things?
        I'm serious.
      - Re: (Score:2)
        
        by ickoonite ( 639305 ) writes:
        
        I don't think it actually has much to do with the complexity of the script - Japanese is, as you say, more complicated, and yet Japan has long had some of the highest literacy rates in the world, even before its modern era. I think - as someone else has suggested here - it has far more to do with the lack of access to education due to poverty, etc. rather than the inherent complexity of hanzi.
        
        Besides, because of the vast number of homonyms in Chinese, an ideographic writing system makes discerning intende
  - Re: (Score:2)
    
    by kahei ( 466208 ) writes:
    
    Would some of the things that led to computers - morse code, telegraphy etc have been feasible using, say, Chinese in its normal written form?
    
    Well, they weren't feasible using English in its normal written form... so I'd guess they wouldn't be feasible using Chinese in it's normal written form either.
    
    Offhand I can't think of any human script or language that's fundamentally suitable to telegraphy. Which isn't really all that surprising.
  - Re:Smelly foreigners (Score:4, Interesting)
    
    by TempeTerra ( 83076 ) writes: on Tuesday May 22, 2007 @06:51AM (#19218941)
    
    The notable difference between Chinese and English (or most other written languages) is that several English characters combine to form syllables, which combine to form words (i.e., we use an alphabet). In Chinese, each character corresponds directly with a word (each character is a logogram). If you're interested you can look up Alphabet on Wikipedia as a starting point, although I must admit I find the article hard to follow even though I know what it should be saying.
    
    The practical result of this is that English is normally encoded as a long sequence of 0-25 values (a-z), whereas Chinese would be encoded as a shorter sequence of 0-~100,000 values (Wikipedia reports Chinese dictionaries with 85,000 characters). Naturally, there would be fewer Chinese characters required for a message as each character corresponds to an entire word.
    
    I guess that since morse code is rather like binary and English letters can be encoded using 5 bits, Chinese morse codes would need to be... about 20 bits long? It's late at night, brain not work so good. It seems to me that morse codes using 20 dots/dashes would be extremely difficult to learn; but on the other hand it shouldn't be any more difficult than learning Chinese characters in the first place.
    
    I wouldn't be surprised if English morse codes were more robust against poor data, siny Englxsh is stvll reahible even if sew2eral cheracter; are wrong.
    
    Disclaimer: I don't know anything about the subject, I'm talking out of my elbow for the sake of discussion.
    
    Parent Share
    twitter facebook
    - Re: (Score:3, Informative)
      
      by jc42 ( 318812 ) writes:
      
      The notable difference between Chinese and English (or most other written languages) is that several English characters combine to form syllables, which combine to form words (i.e., we use an alphabet). In Chinese, each character corresponds directly with a word (each character is a logogram).
      
      Actually, this is pretty much a myth that originated from people with very little knowledge of Chinese language and writing. In all the Chinese languages ("dialects";-), most of the vocabulary is two-syllable words, a
  - Re: (Score:3, Informative)
    
    by vtcodger ( 957785 ) writes:
    
    ***Would some of the things that led to computers - morse code, telegraphy etc have been feasible using, say, Chinese in its normal written form?***
    The answer would seem to be -- sort of ... maybe. See http://www.njstar.com/tools/telecode/jim-reeds-ct c .htm [njstar.com].
    Summary: For telegraphy, Chinese characters are assigned numeiic codes in radical-stroke count order. That's the way that Japanese, and -- I assume -- Chinese, dictionaries, are arranged.
    It may seem inefficient to use 20 bits (sort of) to encode
    - Re: (Score:3, Informative)
      
      by jc42 ( 318812 ) writes:
      
      [T]he classic Chinese numeric notation is not as convenient as 'arabic' notation. But it's much less unwieldy than say Roman numerals, so I don't think it would have been an insumountable hurdle either.
      
      Actually, classical Chinese numbers are only slightly worse than Arabic notation (which apparently developed in India but was spread by Arab traders who knew a good accounting system when they saw it). The Chinese notation was far better than any of the Western number notations that the Arabic notation suppl
  - Re: (Score:2)
    
    by steelfood ( 895457 ) writes:
    
    First, only about five thousand characters are actually commonly used, with less than two thousand tones to represent those five thousand charcters. That gives rise to my second point, which is that spoken Chinese can be highly contextual (hence the propensity for puns and other wordplay).
    
    My guess is that morse code would have evolved to be the same way that ASL simplifies language considerably. Each sequence would represent a different idea, or character, but every idea could pretty much be conveyed with a
- Re:Not a surprise... (Score:5, Insightful)
  
  by etnu ( 957152 ) writes: on Tuesday May 22, 2007 @04:42AM (#19218359) Homepage
  
  You'd prefer securing against vulnerabilities in dozens, if not hundreds of different encodings? The only people who are against Unicode are those that have never had to work with more than one written language in the same project. Yes, it's a lot easier to secure stuff when you only accept ASCII or ISO8859-1/Windows CP-1252, but then you're limiting your software to about a third of the world (if that). Crappy engineers are going to write crappy code no matter what the encoding. No sense compromising for the sake of poorly written software.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by HopeOS ( 74340 ) writes:
    
    I have no problem with an encoding that is capable of encapsulating all the world's languages. I use it daily.
    
    I take issue with the fact that they implemented it so poorly.
    
    1. It is impossible to determine if a character is whitespace; you have to look it up in a table.
    2. It is impossible to determine if the character is even printable; you have to look it up in a table.
    3. It is impossible to determine if the character has another, more canonical presentation; you have to look it up in table.
    
    That's a lot of
- Re:Not a surprise... (Score:4, Insightful)
  
  by KiloByte ( 825081 ) writes: on Tuesday May 22, 2007 @05:23AM (#19218513)
  
  Wrong, the flaw in Cisco's "security" software and IIS is due to them converting things to 8-bit charsets, not due to Unicode. In fact, the whole idea of "code pages" is fundamentally broken, as it assumes all data ever moves to another places only in the same region.
  
  The idea of double-width characters is broken too, yeah, and they are there only to appease the users of some broken Chinese/Japanese software -- but there's nothing wrong with having strange characters in file names. They don't match any file they are not supposed to unless you try to shoehorn them into a limited character set.
  
  So, it's a flaw in the software, not Unicode by itself.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by Teancum ( 67324 ) writes:
    
    There is the option of using UTF-8 instead of UTF-16 for the encoding of Unicode characters. Most implementations of Unicode insist upon UTF-16 (meaning all characters including Latin alphabets use 16 bits per letter). If you have some software that your anticipated audience is primarily Latin alphabet users but you want to make Unicode available, you can use UTF-8 to keep mostly 8-bit characters but allow the full Unicode code points (including 32-bit characters as well) if you need those non-Latin chara
    - Re: (Score:2)
      
      by TapeCutter ( 624760 ) writes:
      
      "due to poor implementation of existing standards and very lazy software developers"
      
      Tip of the day: Source code is like shit, everyone else's stinks.
      
      After two decades as a software developer (plus another as an amature) I can tell you that 99% of the time both your design and implementation will be constrained by an existing code base. The whole thing is recursive: if your "bleeding edge" project becomes "leading edge", it will end up as a legacy system that will in it's turn crush the ambition of a n
    - Re: (Score:2)
      
      by KiloByte ( 825081 ) writes:
      
      There is the option of using UTF-8 instead of UTF-16 for the encoding of Unicode characters. Most implementations of Unicode insist upon UTF-16 (meaning all characters including Latin alphabets use 16 bits per letter).
      "All characters"? I'm afraid that's only 1/17 of Unicode. And according to the law of mainland China, software which doesn't support codes over 16 bits can't be sold there -- well, the commies are nothing but lawful so it's mostly a paper requirement, but it's there.
      
      And UTF-16 has all the fl
      - Re: (Score:2)
        
        by spitzak ( 4019 ) writes:
        
        Unfortunatly Microsoft has completely fucked it up, and the "A" suffix functions are useless, too. What they do is use a "code page" to translate the bytes into that nasty utf16 and thus it will not pass utf8 through. Filenames on NTFS are actually stored in utf16, which means a nightmare of future compatability. There is a third interface (the "multibyte" one) that could save us, but Microsoft oh-so-conviently left out any way for a program to force the multibyte encoding to UTF8.
        
        Microsoft is not the only
- Re:Not a surprise... (Score:5, Insightful)
  
  by kahei ( 466208 ) writes: on Tuesday May 22, 2007 @06:19AM (#19218819) Homepage
  
  Down below this post, there's a troll writing something like 'lol if u cant just use ASCII u shud let ur language die u foreign creeps lol k thx'.
  
  And a whole bunch of people then jump on the troll and criticize him for his US-centrism, and so on, and the troll is at -1.
  
  Yet the post I'm replying to, which is at +4, really comes to the same thing as this troll; it's simply UNIX 8-bit centric rather than USA ASCII centric.
  
  The fact is, computers are used for text, and much if not most text is non-ASCII. How would you rather represent that text:
  
  --With Unicode
  --With KOI-8, KOI-8R, KOI-8RU, EBCDIC, EUC-KR, EUC-JP, shift-JIS, Shift-JIS-the-Jphone-version, ISCII, VISCII, ISO-2022-*, and the many many other encodings [hwacha.net] that have evolved in different times and environments.
  
  Seriously, which is going to be easier to secure (and otherwise manage) -- one encoding (which is HEAVILY documented and discussed) or a large number of encodings (the actual number being ever-changing and impossible to really know) many of which are not well documented and have forgotten ramifications and assumptions?
  
  Right -- so now you know why people use Unicode so much.
  
  But the interesting question is, why is one error ("All teh world is teh USA lol! Shouldn't you learn to speak English?") rightly jumped on and pounded flat, whereas another form that's actually more problematic ("All teh world is C on UNIX lolz!! Shouldn't you stop wanting dangerous extra features?") isn't?
  
  Actually, I see in another window that some people have indeed been pounding the parent poster flat, so perhaps my question isn't valid after all.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by kalidasa ( 577403 ) writes:
    
    He's an idiot. I sometimes work with languages for which there simply IS NO OTHER ENCODING THAN UNICODE. Does he really want me to create new 8-bit encodings for each of them? Ones that won't be standardized, and so won't be easily exchangeable with other users?
    - Re: (Score:2)
      
      by Carewolf ( 581105 ) writes:
      
      How are you going to exchange your subset of unicode if no one else is using it anyway? Who is going to have the right fonts installed?
      
      The problem with unicode is that you assume people can decode all your data, but they actually can't. With small encodings people either have it installed or not. With unicode you have it, but it doesn't actually work for 99% of the symbols, because there are no complete fonts.
      - Re: (Score:2)
        
        by jZnat ( 793348 ) writes:
        
        You can have multiple font families installed that will cover most (if not all) of the Unicode characters when union'd together, so as long as the font manager grabs glyphs from a list of available fonts, all characters should be covered. Therefore, you don't really need a complete font family with all the Unicode glyphs.
  - Re: (Score:2)
    
    by gweihir ( 88907 ) writes:
    
    Oh, I don't dispute that Unicode is a good idea for Text representation. It just has no place in anything that is carrying executable code or commands. If you allow Unicode in command languages, then there is no way to secure them with human possible effort, since filters essentially stop working.
    - Re: (Score:2)
      
      by ultranova ( 717540 ) writes:
      
      If you allow Unicode in command languages, then there is no way to secure them with human possible effort, since filters essentially stop working.
      
      Um... Why ? Why is filtering command sequences made from 32-bit characters inherently any more difficult than filtering 7-bit characters ? It doesn't make any sense to me.
      - Re: (Score:2)
        
        by gweihir ( 88907 ) writes:
        
        Itis not directly. But the problem is that unicode allows more than one representation for some characters. If one normalizer knows most, but not all of them, and another knows all, then the first one will see some strings as different that the secon one will see as equal. Think variable names, function names and the like and you see the problem.
        
        Of course, if the normalizer is completely correct, then the problem does go away. Becasue of the complexity of Uncode, this is at the moment very hard to impossibl
        
        Re: (Score:2)
        
        by Intron ( 870560 ) writes:
        
        http://www.unicode.org/versions/ [unicode.org]
        
        Any time a standard has been changed, you will have some outdated, but perfectly correct software. Hence, two pieces of software may not agree on the meaning of a Unicode string even without a software error.
        
        Re: (Score:2)
        
        by gnasher719 ( 869701 ) writes:
        
        '' Any time a standard has been changed, you will have some outdated, but perfectly correct software. Hence, two pieces of software may not agree on the meaning of a Unicode string even without a software error. ''
        
        Actually, the normalisation functions are defined to be unaffected by future changes.
    - Re: (Score:2)
      
      by gnasher719 ( 869701 ) writes:
      
      '' Oh, I don't dispute that Unicode is a good idea for Text representation. It just has no place in anything that is carrying executable code or commands. If you allow Unicode in command languages, then there is no way to secure them with human possible effort, since filters essentially stop working. ''
      
      Why would they stop working? As two examples, the bash shell and the Perl language don't assign any special meaning to any character with a code above 0x80, so Unicode using UTF8 encoding would be completely
  - I don't know Japanese law, so why support kanji? (Score:2)
    
    by tepples ( 727027 ) writes:
    
    The fact is, computers are used for text, and much if not most text is non-ASCII.
    In order to market my product in some other country, I have to familiarize myself with its laws. As of the foreseeable future, I have the time to do this only for the United States of America, for which ISO-8859-1 is "good enough" especially on a handheld device with 4 MB of RAM. It also costs money to license foreign fonts, unless you just want rectangles everywhere.
  - Re: (Score:2)
    
    by iabervon ( 1971 ) writes:
    
    The difference in this thread is that the OP claims that only ASCII should be used for "semantics-carrying containers", which is a confusing way of saying "control structures". The real flaw is that some systems will allow SQL string constants to be ended by non-ASCII double quote characters. In this case, the issue is the Unicode section for ASCII characters to be used in text where the normal characters have square space allotments. If the application behind a filter is using a human-meaning-preserving co
- Re: (Score:3, Informative)
  
  by gnasher719 ( 869701 ) writes:
  
  Unicode is of course not the problem at all.
  
  The problem is using character sets that can represent huge amounts of different characters, and among them characters that have similar looking glyphs. That is at the same time a feature that people really really want.
  
  So spam filters will have a problem. They filter out "Viagra" but they don't filter out sequences of letters that look the same. Well, tough. If you follow the rule not to follow any links in emails but type them in yourself, that gets you mostly ar
  - Re: (Score:2)
    
    by gweihir ( 88907 ) writes:
    
    2. Why would you need to filter out anything at all? This is a completely brain-damaged approach in the first place, using user input to form commands that could potentially be dangerous and filtering out user input that would produce dangerous commands. Instead, there shouldn't be any commands that could be dangerous in the first place.
    
    And how do you propose to not have dangerous commands when you actually need them in places, just not comming in from that particular channel? Remove, e.g., "drop table" fr
- Re: (Score:2)
  
  by DrVomact ( 726065 ) writes:
  
  This comment does not make a bit of sense. What are "semantics carrying containers"? Why would Unicode be harder to secure than Shift-JIS or ANSI?
- - Re: (Score:2, Funny)
    
    by Gadget_Guy ( 627405 ) writes:
    
    Ooh, poor old BSD sounds really sick there. I hope that it doesn't die!
- Re: (Score:2)
  
  by udippel ( 562132 ) writes:
  
  I could see more exotic equipment like alcatel being untested yet but it seems there should be enough accessibility of d-link and fedora for example
  
  d-link doesn't do content filtering; at least not in your home.
  Fedora, is probably the same as other Debian/GNU/BSDs; depending on the applications performing the filtering.
  I fail to see the usefulness of the list of platforms mixed with trade names here.
  
  Am I the only one ?
  - Re: (Score:2)
    
    by aliquis ( 678370 ) writes:
    
    No, how can they know for sure that everything from HP is safe? And even less everything in any major dist with lots of packages.
    
    Must take a while to figure out, thought I don't know how much software HP have made, but I guess many companies run small inhouse projects maybe written by someone as their ex-job or whatever.
  - - Re: (Score:2)
      
      by udippel ( 562132 ) writes:
      
      the firewall in my dlink router does SPI
      
      Don't want to quarrel with you, despite being on /., SPI isn't what you might think it was. It doesn't perform full Layer7 processing. It doesn't process content.
- Re: (Score:2, Funny)
  
  by Anonymous Coward writes:
  
  4) You are an idiot
  5) You are an asshole
- Re:Hmmmm.... (Score:5, Interesting)
  
  by peragrin ( 659227 ) writes: on Tuesday May 22, 2007 @06:55AM (#19218957)
  
  1) unicode is better than having a hundred other encodes to debug
  2)there's is nearly two billion chinese and Indians, who can't use your encoding.
  3)I get just as much spam from US companies as I do foreign ones
  
  Parent Share
  twitter facebook
- - Re: (Score:2)
    
    by /ASCII ( 86998 ) writes:
    
    Misuse of wchar_t? Care to elaborate? My only complaint with wchar_t is that is barely used at all. From what I've seen, programs that use wchar_t are shorter, more readable, and more secure.
    - - Re: (Score:2)
        
        by /ASCII ( 86998 ) writes:
        
        My experience with unicode and C is that it is painless when done right.
        
        You only need two things.
        
        1) Remember to call setlocale.
        2) Use wide character wrappers around all system functions that don't already have one, e.g. wopen, wrealpath, etc.. Never ever directly use narrow character strings for anything.
        
        Re: (Score:2)
        
        by Srin Tuar ( 147269 ) writes:
        
        About your point #2:
        
        On linux (any unix really) you want to avoid wchars and wide functions like the plague.
        
        The way to go for i18n is using utf-8 and bytes for character strings everywhere. (look into the gtk+ library for examples of this)
        
        The whole wchar experiment has been declared a failure, and is deprecated for any usage really.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Limited impact. (Score:3, Informative)

Re: (Score:2)

Re:Limited impact. (Score:4, Insightful)

Re: (Score:3, Insightful)

Re: (Score:2, Interesting)

Re: (Score:3, Insightful)

Re: (Score:2)

I don't think you know what you're talking about.. (Score:2)

Re: (Score:2)

Unlimited impact. (Score:2)

Re: (Score:2)

Re:Limited impact. (Score:5, Informative)

Re: (Score:2)

Re: (Score:2, Informative)

Re: (Score:2)

Re:Limited impact. (Score:5, Informative)

MOD PARENT UP (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3, Insightful)

Re: (Score:2)

Incident response (Score:4, Interesting)

"Not vunerable" (Score:3, Informative)

Re: (Score:2)

bypassing great firewall? (Score:2, Interesting)

Nothing to see, move along ... (Score:5, Insightful)

flawed design .. (Score:2)

Stored procedure cross-compatibility? (Score:2)

Re: (Score:2)

Another likely example of OSS? (Score:2)

Re: (Score:2)

TCP/IP code from BSD .. (Score:2)

Re: (Score:3, Informative)

IIS's fault (Score:2)

Re: (Score:2)

Re: (Score:3, Insightful)

Re: (Score:2)

Re: (Score:3, Informative)

Don't Steal my WoW account! (Score:2)

US-CERT != CERT (Score:2)

half-wit encoding? (Score:2)

Re: (Score:2, Insightful)

Re:Send your claim in now (Score:5, Funny)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2, Insightful)

Re:Smelly foreigners (Score:5, Funny)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3, Funny)

Re: (Score:2)

Re: (Score:3, Interesting)

Depends on alphabet size (Score:3)

Re: (Score:2)

Re: (Score:3, Informative)

Re:Depends on alphabet size (Score:5, Interesting)

Re: (Score:3, Informative)

Re: (Score:3, Interesting)

Re: (Score:2)

Re: (Score:2)

Re:Smelly foreigners (Score:4, Interesting)

Re: (Score:3, Informative)

Re: (Score:3, Informative)

Re: (Score:3, Informative)

Re: (Score:2)

Re:Not a surprise... (Score:5, Insightful)

Re: (Score:2)

Re:Not a surprise... (Score:4, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:Not a surprise... (Score:5, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)