Spammers Using Soft Hyphen To Hide Malicious URLs 162
Trailrunner7 writes with this excerpt from ThreatPost illustrating the ongoing Spy-vs.-Spy battle between spammers and the rest of us:
"Spammers have jumped on the little-used soft hyphen (or SHY character) to fool URL filtering devices. According to researchers, spammers are larding up URLs for sites they promote with the soft hyphen character, which many browsers ignore. Spammers aren't shy about jumping humans flexible cognitive abilities to slip past the notice of spam filters (H3rb41 V14gr4, anyone?). ... The latest trend involves the use of an obscure character called the soft hyphen or 'SHY' character to obscure malicious URLs in spam messages. Writing on the Symantec Connect blog, researcher Samir Patil said that the company has seen recent spam messages that insert the HTML symbol for the soft hyphen to obfuscate URLs for Web pages promoted by the spammers."
H3rb41 V14gr4? (Score:5, Insightful)
I never got the leet speak in spam thing. Sure, it might get past the filter, but who can read it? Are they trying to sell drugs to script kiddies?
Re:H3rb41 V14gr4? (Score:4, Insightful)
I never got the leet speak in spam thing. Sure, it might get past the filter, but who can read it? Are they trying to sell drugs to script kiddies?
I don't know about you, but I can't stop trying to figure out what word they're trying to represent with the symbols. For example, I know the second word in your subject means viagra, but what is "H3rb41"? Oh..."herbal". It's naturally (perhaps unknowingly) targeted towards geeks and puzzle-solvers, which perhaps isn't the worst market to target available-without-human-contact penis drugs towards.
Re: (Score:3, Insightful)
I thought the only situation where you need Viagra is exactly human contact (in the most literal meaning of the word).
Re: (Score:3, Informative)
I thought the only situation where you need Viagra is exactly human contact (in the most literal meaning of the word).
There's the rub, so to speak. Most men using viagra don't need it, they just like using it, and nothing prevents them from enjoying it on their own.
Re: (Score:2)
Re:H3rb41 V14gr4? (Score:4, Funny)
I think this photograph is appropriate. And I'm happy to say: No I can't read it.
http://media.ebaumsworld.com/picture/strober/get_laid.jpg [ebaumsworld.com]
Re: (Score:2)
http://megatokyo.com/strip/9 [megatokyo.com]
"Does anyone here speak 133+?"
Probably MegaTokyo's finest moment, and blatantly ripped from "Airplane!" at that :P
Re: (Score:2)
"I need help. I need you to get the doctor. I got some bad pain in my chest, I need my pills."
Priceless. :p
Re: (Score:2)
Most British people should be able to read it, since substituting numbers for letters is very common on car number plates (driven by the kind of people who are willing to pay extra for this kind of thing). The pattern of letters and numbers is restricted -- if you buy a new car now, it will be __60 ___, where the blanks are letters. You might choose to pay extra for WE60 FST ('we go fast'). Yesterday I saw "MU51 CFX" -- "music fx". Pre-2000, there was a different format, so M4 TT = matt, M477 HEW = matthew
Re: (Score:3, Interesting)
I never understood how it actually worked, except as you suggested, the script kiddy crowd are heavily in to giving money to strangers in exchange for uber zomg epic sexual prowess.
Maybe I'm old fashioned, but I'm kind of reluctant to whip out my credit card to buy something from a company that employs mittens-wearing illiterates to write their adverts. Sure I'll eat at a Chinese restaurant with an amusingly translated menu, but that's a little different.
Re: (Score:2)
Re: (Score:2)
Never watched late night cable channels, have we? Does the word "Extenze" ring a bell? Those ads are taking the word "ubiquitous" to a whole new level, and proving that "skank hoes" ain't just on the street corner anymore. Well, ok, they DO go out and do "man on the street" interviews, where amazingly enough, every man they com
Re: (Score:2)
You know how, even though only a tiny fraction of a percent of people actually respond to spam by buying the product, sending the spam is so cheap that it's still profitable to do so? I always assumed that the incomprehensible leetspeak just tacks on another factor of 0.1 or so but the resulting sales still justify the spamming. Or at least that's what the spammers think; who knows whether they're being economically rational.
Re: (Score:2)
Since I didn't see anyone mention it I'll take the chance you weren't just making a joke and give you the answer:
The point of the character substutitions / "Leet speek" is exactly the same as the URL mangling they are talking about here--getting around spam filters. When the spam filters know to search for anything with "Viagra" in it, you just change that to V1agra, problem solved. The next week go with V1@gra.
The people buy this stuff are likely not to mind.
Re: (Score:2, Funny)
I like my hyphen hard, not soft. That's why I use H3rb41 V14gr4.
Re: (Score:2)
I never got the leet speak in spam thing. Sure, it might get past the filter, but who can read it? Are they trying to sell drugs to script kiddies?
I figured someone that falls for that crap, bad spelling and all, sort of deserves losing their money.
Re: (Score:2)
I think it's worse than that. Their attempt to fool the pattern recognition algorithms on the scanners is understandable, but self-defeating for more reasons than their supposed target audience.
Even script kiddies have a modicum of intelligence to know not to have anything to do with spam. Unless they are trying to develop skills to use in that industry.....
In any case, normal people can read that stuff pretty easily, but at the same time it sets of alarms that it is unsafe. Ironically, the same pattern
Re: (Score:2)
I find it humorous and perhaps a bit ironic that spammers, in an attempt to bypass spam filters, will often render their message completely unreadable. Congratulations! You've beaten my spam filter. However, even if there was a sliver of a chance that you could fool me into giving you money, you blew it because I have no clue what your message says.
Why (Score:2)
Re:Why (Score:5, Informative)
Why don't modern browsers render this character?
The character isn't supposed to be rendered. Soft hyphen indicates where to break words if necessary. The hyphens are not rendered if the word doesn't need to be broken.
Re: (Score:1, Informative)
Re:Why (Score:5, Insightful)
Not always (Score:5, Informative)
It is only supposed to be rendered when the word is split across multiple lines.
For example if your text was "super­cali­fragilistic­expialidocious" then all of the following are valid rendering depending on where the render decides to start a new line:
supercalifragilisticexpialidocious
or
supercalifragilistic-
expialidocious
or
supercali-
fragilistic-
expialidocious
Re: (Score:2)
URLs are not words.
Re: (Score:2)
If you do not render, you should sanitize the underlying URL.
Re: (Score:3, Informative)
Why don't modern browsers render this character?
From Wikipedia:
"Since it is difficult for a computer program to automatically make good decisions on when to hyphenate a word, the concept of a soft hyphen was introduced to allow manual specification of a place where a hyphenated break was allowed without forcing a line break in an inconvenient place if the text was later re-flowed."
So a soft hyphen marks a position where you can hyphenate a word. If you don't do it, you of course shouldn't print anything at that position.
Re: (Score:2)
Exactly it's purpose. It's never supposed to be shown, only to give an the browser client an easy way to break the word for dynamic width.
Re: (Score:2)
Hyphenating a URL makes no sense. Ones containing this character should be invalid.
Re: (Score:2)
I understood pretty much everything from the summary. Everything BUT the character
Re:Why (Score:5, Informative)
I'm a pretty IT-savvy guy, but WHAT IS that bloody character?
Say you're laying out a book. You have the word Sauerkraut at a line wrap, but it is broken into Sauerk-raut because your layout software don't know where to break it. You then put in a soft hyphen between r and k, this indicates to your software that this word should be broken there. It turns into Sauer-kraut which is correct.
Later you get angry with the Sauerkraut and call it "bloody Sauerkraut". Now the whole word will be at the next line, and the soft hyphen won't show because your software doesn't need to break the word. Thus you can insert these freely without fretting about words containing a hyphen later on, they'll only be rendered when used as a hint.
HTH
Re: (Score:3, Funny)
Speaking of bloody sauerkraut, I think there was some sort of hyphen-depression when the inventors of the German language decided it would be fun to glue adjectives and nouns together. i.e. when I see something like: unabhaengigkeitserklaerungen, I have an nigh-irresistible urge to shout Gesundheit!
I'm still not sure why the nazis went to all of the trouble of building cipher-machines. The language looks sufficiently jumbled from the start.
Re: (Score:2, Insightful)
Where one in English might use a series of adjectives plus a noun a German would use a single agglomerative word - what is your problem?
Deutsch is a sufficiently sophisticated language without your assistance.
It doesn't work the same as your native tongue - get a life and stop trolling my forum - twat.
Re: (Score:2)
Re: (Score:3, Insightful)
This is purely senseless and is a mark of poor language design.
Languages (in general) aren't designed, they evolve. Which makes your (all too long-winded) point quite moot.
Re: (Score:2)
thankfully, despite the relation, modern English mostly escapes this terrible behavior, compound words usually being limited to a combination of two separate words.
Unlike German, where we always use the pathological case.
Besides the fact that one can apparently stream together an arbitrary number adjectives and nouns (the order of which often seems to be immaterial), German has the grammatical gender of Latin based languages, but without the normally sensible, almost always predictable rules for applying gender; however, unlike most modernized Latin based languages, German retains three genders, again complicating issues.
German isn't a Romance language and neither is English. Both are Germanic languages, although English has undergone extensive crossbreeding with French, which has lead to a mix of Germanic and Romance elements. Germanic and Romance are completely different branches of the Indo-European language family.
Then, you have writers who have the unfortunate habit of including far too much information into a single sentence. Similar to poor English writers who have a tendency to include entire descriptive paragraphs into (parentheses), except those helpful punctuation, like the hyphen, are also omitted.
Overly long sentences are not language-specific and indeed the case can be made (and easily defended) that one can write impossibly long, but entirely c
Re: (Score:2)
Beautifully put. YIDH
Cheers
Jon
Re: (Score:2)
Italics isn't something that just 'happens' when laying out text. Hyphenation is. As one of the other replies said, this is used as a hint, as software doesn't know all of the syllables of words and how to break them.
(I hadn't heard of it before this article either.)
Re: (Score:2)
Well clearly you have to draw the line somewhere. I agree, stuff like \a, and \b obviously don't belong in the character set. But then newlines clearly do, and even spaces are 'display-control strings'. What about tabs?
I think you're probably right about this soft-hyphen though. It sounds like it is rarely used and creates more problems than it solves.
Re: (Score:2)
I don't think it's generally needed, at all. A line break might be desirous any-
where in a text; are authors supposed to figure out approx-
imately where they might be needed? Or should they simp-ly soft-hyphen-ate every-fuck-ing-thing so that it is actu-ally use-ful?
A browser-side diction-ary might be more bet-ter in most cases.
Re: (Score:2)
Software should keep track of hyphenation positions the same way it keeps track of other formatting positions.
Yes, a good hyphenation dictionary for every language would be nice. Along with special characters in the character set that indicate which language we are currently using (quotes, book titles and so on in a separate language from the main text is common in many settings).
Oh wait, that would never work well, it'd complicate the character set, increase application sizes by orders of magnitude in many cases, and we *don't* have good hyphenation dictionary for a great many languages.
In browsers it might be sup
Re: (Score:2)
Re: (Score:2)
So, in how many words do I have to add these characters because the layout software won't do it for me?
We use Indesign at our publishing company, its hyphenation is usually quite good. It *will* miss in a few words for each book, especially in other languages than English, in those cases soft hyphens are very practical. It's OK, native English speakers tend to forget that we use other languages in most of the world :)
Re: (Score:2)
You read a news entry about "the guy who committed the crime". Never in the summary do they mention the guy's name or what crime he committed, but they emphasize on how dangerous the guy is and how horrible the crime was. Now let me know if that sort of approach doesn't, um, I don't know, miss something essential.
This is not about me being lazy and not reading the article (I did), but about the summary missing some essential information (it does).
Re:Why (Score:5, Informative)
Why don't modern browsers render this character?
Two reasons, the first being that HTML 4 specs [w3.org] call for it to not be rendered unless it meets the criteria. Here is the full blurb:
The other reason is that the current unicode standard basically says it doesn't support when and where it should be displayed as a hyphen and leaves it open to interpretation of whoever is coding for it. Here is the blurb from the unicode standard on it:
Shy Spammers (Score:4, Funny)
Re: (Score:2)
Spammers are getting more shy? That's a relief!
Careful what you wish for...instead of getting rickrolled you might end up being Kajagoogled [youtube.com].
What is it? (Score:5, Funny)
Why didn't they just put the friggin character in the summary so I didn't have to read the article?
Anyways, according to the article it's ­, which looks "identical to a regular hyphen." Are you happy now slashdot? I had to read TFA to find that out.
Re: (Score:2, Insightful)
Re: (Score:2)
Re: (Score:3, Insightful)
Are registrars accepting domain names with soft hyphens? And if so, why? It's rather obvious that such domain names would only be used for fraud.
IMHO registrars should not accept any non-printable character in domain names.
Re: (Score:3, Insightful)
Yes, they are. Otherwise this story wouldn't exist.
Why? Because they like money, and don't give a fuck.
Of course they should not accept any non-printable characters.
Registrars are pretty much only half a step above the spammers in terms of ethics / shittiness.
Re: (Score:2)
Do you have any evidence that registrars are accepting soft hyphens in domain names?
soft hyphens supposed to be eliminated in the Name Preparation phase [verisign.com].
The soft hyphen is being used by spammers to obfuscate their URLs in order to get past anti-spam rules.
This slashdot story appears to be misinformation and a plug for Symantec.
Re: (Score:3, Informative)
Who modded this insightful?
No, domain registrars don't allow soft hyphens in domain name registrations. Give me a single example of a registered domain with a soft hyphen in it.
As the other user said, this is used for masking URLs in e-mails, and thus trying to thwart spam filters.
This will render as
Yet many spam filters will
Re: (Score:2)
DNS isn't allowed to disallow certain characters.
Registrars most certainly can, and do.
TFA doesn't give an example of a domain with a soft hyphen in the name.
Again, give me just ONE example. That's all it takes to prove me wrong.
This is about using soft hyphens to hide the real domain name in e-mails, not about using a real domain name that actually contains a soft hyphen. It'd be rather useless for this purpose, because
Re: (Score:2)
Re: (Score:2, Informative)
So the problem is browsers silently removing them.
A browser should never modify an URL. Especially it should not remove invalid characters. It should give an error if you try to go to an invalid URL. It should not try to be "helpful" here.
Re: (Score:2)
Re: (Score:2)
Luckily you read it and I garnered the information from your post.
Now we can all make informed opinions.
Re: (Score:2)
The soft hyphen is not a very outgoing character. Indeed, it experiences severe apprehension when surrounded by others and may hide itself from view.
However, like most others with acute diffidence, the shy character can be brought out when placed in a more comfortable position, such as the end of the line, instead of the middle.
So how often is it used legitimately? (Score:5, Interesting)
Is there any good reason not to just call the presence of soft hyphens as a reliable indicator of spam and use it as the basis of a spam filter?
Re: (Score:2, Funny)
Well, I know I've certainly never seen it!
Re:So how often is it used legitimately? (Score:4, Informative)
Is there any good reason not to just call the presence of soft hyphens as a reliable indicator of spam and use it as the basis of a spam filter?
Yes, there is: languages other than English. In e.g. German, the use soft hyphens, while not universal, is becoming more common, at least, and for a reason: longer words that can't automatically be hyphenated by the browser as necessary lead to ugly layout, especially when there's not a lot of horizontal space (e.g. on news sites, which often tend to emulate printed newspapers).
Re:So how often is it used legitimately? (Score:4, Informative)
I'm there (Score:2)
Or I would have been, if that had been a real URL. Who could resist that?
Especially if you're into überwachungsaufgaben. Not that I am! No sir. Just saufgaben-curious.
Re: (Score:2)
that is an actual German word, says Google - their compound concoctions never cease to amaze me. :)
Re: (Score:2)
So, when I get an email with a link to www.Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz.de, should I avoid clicking the link, or what?
No, just think about what you are having for dinner and be sure you prepare and eat it within the rules.
Re: (Score:2)
Re: (Score:2, Interesting)
Re: (Score:2)
Can you explain how this is relevant to this soft hyphen issue? That is, I read the relevant part of the wikipedia article (http://en.wikipedia.org/wiki/Hyphenation), and it does mention different rules (e.g. "co-worker" in British English, but "coworker" in American English). However, that is not related to the soft hyphen issue, which is related to hyphenation for justification reasons.
Re: (Score:2)
Re: (Score:3, Interesting)
I would think most spam filters would do that automatically as they learn.
Symantec seems to think people still use character-for-character text matching spam filters that don't learn. Maybe Symantec products do.
Re: (Score:3, Insightful)
Re: (Score:2)
FWIW I've been using in webpages for years...
Re: (Score:2)
Re: (Score:2)
I think this is a mistake. "goo gle.com" should lead to an error.
If you use the DNS servers of most ISPs, instead of error, you end up either going to a custom search page to which they are getting paid for the ads, or an offer to buy the domain.
Re: (Score:2)
I agree, but ISPs are making money off of it and not likely to give up that extra revenue, which is one more reason I just point to one of my own DNS servers at the office instead.
Re: (Score:3, Interesting)
Implementations of the DNS protocols must not place any restrictions on the labels that can be used. In particular, DNS servers must not refuse to serve a zone because it contains labels that might not be acceptable to some DNS client programs.
Good News! (Score:3, Interesting)
So now spam filters will pick up on soft hyphens used in URIs inside emails (when was the last time you saw one used legitimately?), making the spam easier to spot.
shy (Score:3, Funny)
No good shysters.
SpamAssassin is not vulnerable to this (Score:5, Informative)
Just tested this in SpamAssassin with http ://exa ­ mple.com (spaced to evade slashdot's own obfuscation-eliminator) - Result: The URL domain (example.com) is properly extracted without the obfuscation.
That said, SA is fully capable of detecting the obfuscation attempt itself (using a rawbody rule)...
Re: (Score:2)
Erm, I don't think you know what "properly extracted" means.
exa­mple.com doesn't lead to example.com, it leads to xn--example-nka.com, so if you extract the former instead of the latter, you're doing it wrong.
PROOF that SpamAssassin is not vulnerable to this (Score:2)
I'm a SpamAssassin developer, both on the official project and on a commercial derivative. Others on my commercial team independently verified my claim as well. I highly doubt we're all wrong.
That said, I decided to FULLY dig into the issue to see what's going on under the hood. In addition to a careful analysis of the spamassassin debug output, I spun up Wireshark [wireshark.org] to look at the actual DNS queries. Since SA knows what example.com is ([84234] dbg: uridnsbl: domain example.com in skip list), I had to u
HTML 5 will save us (Score:2)
Note the use of the phrase "should be". I see this a lot when reading about HTML 5. Are people really that stupid and/or naive that they think all browsers will follow the HT
Re: (Score:2)
standards can only tell you how things should be.
they may tell you how things will break for you if you try to do things in a non-standard way, but they have no power to force you not to try.
Re: (Score:2)
Note the use of the phrase "should be".
Yes, "should be". SHOULD has a very different meaning from MUST in standards documents.
The Wrongest Part (Score:5, Funny)
The thing that really grates on the nerves, is using a soft-hypen to sell Viagra.
Re: (Score:2)
That's why we have /.
Re: (Score:2)
I thought it was funny when I started receiving spams for "Viagra soft tabs". I thought that's what it was supposed to cure.
Why would soft-hyphen be legal in a URL? (Score:2)
I don't get how you can put a soft-hyphen in a URL and have it work? It's a formatting character, it shouldn't ever be legal to have a formatting character as part of a URL? Are they registering domain-names with soft-hyphens in the name? Or is this a case where the browser 'helpfully' replaces a soft hyphen with a regular hyphen when actually trying to connect to the web server, but for some reason does NOT render they hyphen when displaying it to the user? It seems like the browser should behave consisten
Re: (Score:2)
Mod parent up. Why in the hell is such a character allowed in URLs at all?
Re: (Score:2)
Indeed; it seems to be a good example why extending the DNS character set was not such a good idea. DNS should have readable domain names, and avoid using different characters with identical glyphs and non-printing characters.
My theory (Score:2)
terrible summary (Score:4, Funny)
Re: (Score:2)
SHY characters don't get noticed... (Score:2)
...unless they get a lucky break.
I think I learned that in junior high.
Very easy to block (Score:2)
This is so easy to defeat with a simple regular expression in your spam filter. I doubt spammers will continue with this tactic for long.
Not Rendered? (Score:2)
Alright I did some testing in Chrome, Firefox, Internet Explorer and Opera (all latest versions)
simple link, with a SHY character in the link. Depending on the format of the link (with a http, or without), All 4 browser did the exact thing we expected them to do : The link either showed the hyphen and linked to a hyphened page correctly (when I say "Showed", I mean, that if you mouse-over the link, you see the hyphen in the task bar) or just didn't show it and didn't link to a hyphened page.
So, I don't see
Re: (Score:2)
Ah I re-re-read the summary. It's only to go through the Spam filtering system... forget I said anything.
Re: (Score:2)
Re: (Score:2)
Maybe it fools Symantic's spam filters.