Spammers Using Soft Hyphen To Hide Malicious URLs 162
Trailrunner7 writes with this excerpt from ThreatPost illustrating the ongoing Spy-vs.-Spy battle between spammers and the rest of us:
"Spammers have jumped on the little-used soft hyphen (or SHY character) to fool URL filtering devices. According to researchers, spammers are larding up URLs for sites they promote with the soft hyphen character, which many browsers ignore. Spammers aren't shy about jumping humans flexible cognitive abilities to slip past the notice of spam filters (H3rb41 V14gr4, anyone?). ... The latest trend involves the use of an obscure character called the soft hyphen or 'SHY' character to obscure malicious URLs in spam messages. Writing on the Symantec Connect blog, researcher Samir Patil said that the company has seen recent spam messages that insert the HTML symbol for the soft hyphen to obfuscate URLs for Web pages promoted by the spammers."
Re:Why (Score:5, Informative)
Why don't modern browsers render this character?
The character isn't supposed to be rendered. Soft hyphen indicates where to break words if necessary. The hyphens are not rendered if the word doesn't need to be broken.
Re:Why (Score:3, Informative)
Why don't modern browsers render this character?
From Wikipedia:
"Since it is difficult for a computer program to automatically make good decisions on when to hyphenate a word, the concept of a soft hyphen was introduced to allow manual specification of a place where a hyphenated break was allowed without forcing a line break in an inconvenient place if the text was later re-flowed."
So a soft hyphen marks a position where you can hyphenate a word. If you don't do it, you of course shouldn't print anything at that position.
Re:So how often is it used legitimately? (Score:4, Informative)
Is there any good reason not to just call the presence of soft hyphens as a reliable indicator of spam and use it as the basis of a spam filter?
Yes, there is: languages other than English. In e.g. German, the use soft hyphens, while not universal, is becoming more common, at least, and for a reason: longer words that can't automatically be hyphenated by the browser as necessary lead to ugly layout, especially when there's not a lot of horizontal space (e.g. on news sites, which often tend to emulate printed newspapers).
Re:Why (Score:1, Informative)
Re:Why (Score:5, Informative)
Why don't modern browsers render this character?
Two reasons, the first being that HTML 4 specs [w3.org] call for it to not be rendered unless it meets the criteria. Here is the full blurb:
The other reason is that the current unicode standard basically says it doesn't support when and where it should be displayed as a hyphen and leaves it open to interpretation of whoever is coding for it. Here is the blurb from the unicode standard on it:
SpamAssassin is not vulnerable to this (Score:5, Informative)
Just tested this in SpamAssassin with http ://exa ­ mple.com (spaced to evade slashdot's own obfuscation-eliminator) - Result: The URL domain (example.com) is properly extracted without the obfuscation.
That said, SA is fully capable of detecting the obfuscation attempt itself (using a rawbody rule)...
Re:Why (Score:5, Informative)
I'm a pretty IT-savvy guy, but WHAT IS that bloody character?
Say you're laying out a book. You have the word Sauerkraut at a line wrap, but it is broken into Sauerk-raut because your layout software don't know where to break it. You then put in a soft hyphen between r and k, this indicates to your software that this word should be broken there. It turns into Sauer-kraut which is correct.
Later you get angry with the Sauerkraut and call it "bloody Sauerkraut". Now the whole word will be at the next line, and the soft hyphen won't show because your software doesn't need to break the word. Thus you can insert these freely without fretting about words containing a hyphen later on, they'll only be rendered when used as a hint.
HTH
Not always (Score:5, Informative)
It is only supposed to be rendered when the word is split across multiple lines.
For example if your text was "super­cali­fragilistic­expialidocious" then all of the following are valid rendering depending on where the render decides to start a new line:
supercalifragilisticexpialidocious
or
supercalifragilistic-
expialidocious
or
supercali-
fragilistic-
expialidocious
Re:So how often is it used legitimately? (Score:4, Informative)
Re:Obligatory Kajagoogoo (Score:0, Informative)
Good job. Came here to blame Kajagoogoo for this.
Offtopic mod needs to hush hush.
Re:What is it? (Score:3, Informative)
Who modded this insightful?
No, domain registrars don't allow soft hyphens in domain name registrations. Give me a single example of a registered domain with a soft hyphen in it.
As the other user said, this is used for masking URLs in e-mails, and thus trying to thwart spam filters.
This will render as
Yet many spam filters will not trigger on the words "cheap" and "Viagra", and the e-mail has a greater chance of getting through filters.
A similar technique was used in the past for domain names and e-mail addresses, back when e-mail actually followed the standards, and no-one read their e-mail in a HTML browser.
is a legal e-mail address that is interpreted as cheapviagra@hotmail.com, but many newer programs don't follow the standards and will barf, which is why spammers seldom do this anymore.
Re:H3rb41 V14gr4? (Score:3, Informative)
I thought the only situation where you need Viagra is exactly human contact (in the most literal meaning of the word).
There's the rub, so to speak. Most men using viagra don't need it, they just like using it, and nothing prevents them from enjoying it on their own.
Re:What is it? (Score:2, Informative)
So the problem is browsers silently removing them.
A browser should never modify an URL. Especially it should not remove invalid characters. It should give an error if you try to go to an invalid URL. It should not try to be "helpful" here.