Forgot your password?
typodupeerror
Security Spam IT

Spammers Using Soft Hyphen To Hide Malicious URLs 162

Posted by timothy
from the conservative-in-what-you-accept dept.
Trailrunner7 writes with this excerpt from ThreatPost illustrating the ongoing Spy-vs.-Spy battle between spammers and the rest of us: "Spammers have jumped on the little-used soft hyphen (or SHY character) to fool URL filtering devices. According to researchers, spammers are larding up URLs for sites they promote with the soft hyphen character, which many browsers ignore. Spammers aren't shy about jumping humans flexible cognitive abilities to slip past the notice of spam filters (H3rb41 V14gr4, anyone?). ... The latest trend involves the use of an obscure character called the soft hyphen or 'SHY' character to obscure malicious URLs in spam messages. Writing on the Symantec Connect blog, researcher Samir Patil said that the company has seen recent spam messages that insert the HTML symbol for the soft hyphen to obfuscate URLs for Web pages promoted by the spammers."
This discussion has been archived. No new comments can be posted.

Spammers Using Soft Hyphen To Hide Malicious URLs

Comments Filter:
  • H3rb41 V14gr4? (Score:5, Insightful)

    by MrEricSir (398214) on Thursday October 07, 2010 @05:33PM (#33830008) Homepage

    I never got the leet speak in spam thing. Sure, it might get past the filter, but who can read it? Are they trying to sell drugs to script kiddies?

    • Re:H3rb41 V14gr4? (Score:4, Insightful)

      by caffeinemessiah (918089) on Thursday October 07, 2010 @05:43PM (#33830136) Journal

      I never got the leet speak in spam thing. Sure, it might get past the filter, but who can read it? Are they trying to sell drugs to script kiddies?

      I don't know about you, but I can't stop trying to figure out what word they're trying to represent with the symbols. For example, I know the second word in your subject means viagra, but what is "H3rb41"? Oh..."herbal". It's naturally (perhaps unknowingly) targeted towards geeks and puzzle-solvers, which perhaps isn't the worst market to target available-without-human-contact penis drugs towards.

      • Re: (Score:3, Insightful)

        by maxwell demon (590494)

        I thought the only situation where you need Viagra is exactly human contact (in the most literal meaning of the word).

        • Re: (Score:3, Informative)

          by Abstrackt (609015)

          I thought the only situation where you need Viagra is exactly human contact (in the most literal meaning of the word).

          There's the rub, so to speak. Most men using viagra don't need it, they just like using it, and nothing prevents them from enjoying it on their own.

      • Funny, I read it immediately as herbal viagra. I guess different people's brains may be handling the job of reading differently. Reminds me of Richard Feynman's "experiments" with reading and counting at the same time etc: http://www.youtube.com/watch?v=Cj4y0EUlU-Y [youtube.com]
      • by commodore64_love (1445365) on Thursday October 07, 2010 @06:21PM (#33830520) Journal

        I think this photograph is appropriate. And I'm happy to say: No I can't read it.

        http://media.ebaumsworld.com/picture/strober/get_laid.jpg [ebaumsworld.com]

        • by rwa2 (4391) *

          http://megatokyo.com/strip/9 [megatokyo.com]

          "Does anyone here speak 133+?"

          Probably MegaTokyo's finest moment, and blatantly ripped from "Airplane!" at that :P

          • by vux984 (928602)

            "I need help. I need you to get the doctor. I got some bad pain in my chest, I need my pills."

            Priceless. :p

        • by xaxa (988988)

          Most British people should be able to read it, since substituting numbers for letters is very common on car number plates (driven by the kind of people who are willing to pay extra for this kind of thing). The pattern of letters and numbers is restricted -- if you buy a new car now, it will be __60 ___, where the blanks are letters. You might choose to pay extra for WE60 FST ('we go fast'). Yesterday I saw "MU51 CFX" -- "music fx". Pre-2000, there was a different format, so M4 TT = matt, M477 HEW = matthew

    • Re: (Score:3, Interesting)

      I never understood how it actually worked, except as you suggested, the script kiddy crowd are heavily in to giving money to strangers in exchange for uber zomg epic sexual prowess.

      Maybe I'm old fashioned, but I'm kind of reluctant to whip out my credit card to buy something from a company that employs mittens-wearing illiterates to write their adverts. Sure I'll eat at a Chinese restaurant with an amusingly translated menu, but that's a little different.

      • by Beale (676138)
        I hear you can get better by grinding.
      • by Obfuscant (592200)
        I never understood how it actually worked, except as you suggested, the script kiddy crowd are heavily in to giving money to strangers in exchange for uber zomg epic sexual prowess.

        Never watched late night cable channels, have we? Does the word "Extenze" ring a bell? Those ads are taking the word "ubiquitous" to a whole new level, and proving that "skank hoes" ain't just on the street corner anymore. Well, ok, they DO go out and do "man on the street" interviews, where amazingly enough, every man they com

    • You know how, even though only a tiny fraction of a percent of people actually respond to spam by buying the product, sending the spam is so cheap that it's still profitable to do so? I always assumed that the incomprehensible leetspeak just tacks on another factor of 0.1 or so but the resulting sales still justify the spamming. Or at least that's what the spammers think; who knows whether they're being economically rational.

    • by bill_kress (99356)

      Since I didn't see anyone mention it I'll take the chance you weren't just making a joke and give you the answer:

      The point of the character substutitions / "Leet speek" is exactly the same as the URL mangling they are talking about here--getting around spam filters. When the spam filters know to search for anything with "Viagra" in it, you just change that to V1agra, problem solved. The next week go with V1@gra.

      The people buy this stuff are likely not to mind.

    • Re: (Score:2, Funny)

      by Anonymous Coward

      I like my hyphen hard, not soft. That's why I use H3rb41 V14gr4.

    • by Nyder (754090)

      I never got the leet speak in spam thing. Sure, it might get past the filter, but who can read it? Are they trying to sell drugs to script kiddies?

      I figured someone that falls for that crap, bad spelling and all, sort of deserves losing their money.

    • by EdIII (1114411)

      I think it's worse than that. Their attempt to fool the pattern recognition algorithms on the scanners is understandable, but self-defeating for more reasons than their supposed target audience.

      Even script kiddies have a modicum of intelligence to know not to have anything to do with spam. Unless they are trying to develop skills to use in that industry.....

      In any case, normal people can read that stuff pretty easily, but at the same time it sets of alarms that it is unsafe. Ironically, the same pattern

    • I find it humorous and perhaps a bit ironic that spammers, in an attempt to bypass spam filters, will often render their message completely unreadable. Congratulations! You've beaten my spam filter. However, even if there was a sliver of a chance that you could fool me into giving you money, you blew it because I have no clue what your message says.

  • Why don't modern browsers render this character?
    • Re:Why (Score:5, Informative)

      by TopSpin (753) on Thursday October 07, 2010 @05:40PM (#33830080) Journal

      Why don't modern browsers render this character?

      The character isn't supposed to be rendered. Soft hyphen indicates where to break words if necessary. The hyphens are not rendered if the word doesn't need to be broken.

    • Re: (Score:3, Informative)

      by maxwell demon (590494)

      Why don't modern browsers render this character?

      From Wikipedia:

      "Since it is difficult for a computer program to automatically make good decisions on when to hyphenate a word, the concept of a soft hyphen was introduced to allow manual specification of a place where a hyphenated break was allowed without forcing a line break in an inconvenient place if the text was later re-flowed."

      So a soft hyphen marks a position where you can hyphenate a word. If you don't do it, you of course shouldn't print anything at that position.

      • From Wikipedia:

        "Since it is difficult for a computer program to automatically make good decisions on when to hyphenate a word, the concept of a soft hyphen was introduced to allow manual specification of a place where a hyphenated break was allowed without forcing a line break in an inconvenient place if the text was later re-flowed."

        Exactly it's purpose. It's never supposed to be shown, only to give an the browser client an easy way to break the word for dynamic width.

      • Hyphenating a URL makes no sense. Ones containing this character should be invalid.

    • I'm a pretty IT-savvy guy, but WHAT IS that bloody character?
      I understood pretty much everything from the summary. Everything BUT the character :) - Fail. As far as the summary is concerned.
      • Re:Why (Score:5, Informative)

        by Man Eating Duck (534479) on Thursday October 07, 2010 @06:10PM (#33830432)

        I'm a pretty IT-savvy guy, but WHAT IS that bloody character?

        Say you're laying out a book. You have the word Sauerkraut at a line wrap, but it is broken into Sauerk-raut because your layout software don't know where to break it. You then put in a soft hyphen between r and k, this indicates to your software that this word should be broken there. It turns into Sauer-kraut which is correct.

        Later you get angry with the Sauerkraut and call it "bloody Sauerkraut". Now the whole word will be at the next line, and the soft hyphen won't show because your software doesn't need to break the word. Thus you can insert these freely without fretting about words containing a hyphen later on, they'll only be rendered when used as a hint.

        HTH

        • Re: (Score:3, Funny)

          by modecx (130548)

          Speaking of bloody sauerkraut, I think there was some sort of hyphen-depression when the inventors of the German language decided it would be fun to glue adjectives and nouns together. i.e. when I see something like: unabhaengigkeitserklaerungen, I have an nigh-irresistible urge to shout Gesundheit!

          I'm still not sure why the nazis went to all of the trouble of building cipher-machines. The language looks sufficiently jumbled from the start.

          • Re: (Score:2, Insightful)

            by JSG (82708)

            Where one in English might use a series of adjectives plus a noun a German would use a single agglomerative word - what is your problem?

            Deutsch is a sufficiently sophisticated language without your assistance.

            It doesn't work the same as your native tongue - get a life and stop trolling my forum - twat.

            • by srussia (884021)
              Hui!
        • by JSG (82708)

          Beautifully put. YIDH

          Cheers
          Jon

    • Re:Why (Score:5, Informative)

      by Tynin (634655) on Thursday October 07, 2010 @05:54PM (#33830274)

      Why don't modern browsers render this character?

      Two reasons, the first being that HTML 4 specs [w3.org] call for it to not be rendered unless it meets the criteria. Here is the full blurb:

      9.3.3 Hyphenation

      In HTML, there are two types of hyphens: the plain hyphen and the soft hyphen. The plain hyphen should be interpreted by a user agent as just another character. The soft hyphen tells the user agent where a line break can occur.

      Those browsers that interpret soft hyphens must observe the following semantics: If a line is broken at a soft hyphen, a hyphen character must be displayed at the end of the first line. If a line is not broken at a soft hyphen, the user agent must not display a hyphen character. For operations such as searching and sorting, the soft hyphen should always be ignored.

      In HTML, the plain hyphen is represented by the "-" character ( or ). The soft hyphen is represented by the character entity reference ( or )

      The other reason is that the current unicode standard basically says it doesn't support when and where it should be displayed as a hyphen and leaves it open to interpretation of whoever is coding for it. Here is the blurb from the unicode standard on it:

      Hyphenation. U+00AD soft hyphen (SHY) indicates an intraword break point, where a line break is preferred if a word must be hyphenated or otherwise broken across lines. Such break points are generally determined by an automatic hyphenator. SHY can be used with any script, but its use is generally limited to situations where users need to override the behavior of such a hyphenator. The visible rendering of a line break at an intraword break point, whether automatically determined or indicated by a SHY, depends on the surrounding characters, the rules governing the script and language used, and, at times, the meaning of the word. The precise rules are outside the scope of this standard, but see Unicode Standard Annex #14, “Unicode Line Breaking Algorithm,” for additional information. A common default rendering is to insert a hyphen before the line break, but this is insufficient or even incorrect in many situations.

      Contrast this usage with U+2027 hyphenation point, which is used for a visible indication of the place of hyphenation in dictionaries. For a complete list of dash characters in the Unicode Standard, including all the hyphens, see Table 6-3.

      The Unicode Standard includes two nonbreaking hyphen characters: U+2011 non-breaking hyphen and U+0F0C tibetan mark delimiter tsheg bstar. See Section 10.2, Tibetan, for more discussion of the Tibetan-specific line breaking behavior.

  • by biryokumaru (822262) <biryokumaru@gmail.com> on Thursday October 07, 2010 @05:34PM (#33830020)
    Spammers are getting more shy? That's a relief!
  • What is it? (Score:5, Funny)

    by iONiUM (530420) on Thursday October 07, 2010 @05:34PM (#33830024) Homepage Journal

    Why didn't they just put the friggin character in the summary so I didn't have to read the article?

    Anyways, according to the article it's &shy, which looks "identical to a regular hyphen." Are you happy now slashdot? I had to read TFA to find that out.

    • Re: (Score:2, Insightful)

      by mclearn (86140)
      And, as TFA points out, this is a valid tactic because "modern browsers" (ambiguously non-committal) do not render the character. I assume, spammers are writing URLs as: http://m&shy;i&shy;crosoft.com/ (eg. m-i-crosoft.com, but rendered onscreen as microsoft.com). This, of course, tricks folks into thinking that they are clicking on a valid microsoft.com URL.
      • by mclearn (86140)
        Nope. My bad. Since the SHY character is used as a way to dictate line breaks, it obviously isn't used to forge domains or anything similar. Presumably then, the SHY is used to ensure that patterns such as "Viagra" can be written as Viagra and not be caught by simple pattern matchers? TFA was light on actual examples.
    • If a word is wrapped to the next line, it shows a hyphen. Otherwise it's hidden. That's what a soft hyphen does.
    • Luckily you read it and I garnered the information from your post.

      Now we can all make informed opinions.

    • by c++0xFF (1758032)

      The soft hyphen is not a very outgoing character. Indeed, it experiences severe apprehension when surrounded by others and may hide itself from view.

      However, like most others with acute diffidence, the shy character can be brought out when placed in a more comfortable position, such as the end of the line, instead of the middle.

  • by JesseL (107722) * on Thursday October 07, 2010 @05:41PM (#33830094) Homepage Journal

    Is there any good reason not to just call the presence of soft hyphens as a reliable indicator of spam and use it as the basis of a spam filter?

    • Re: (Score:2, Funny)

      by Cinder6 (894572)

      Well, I know I've certainly never seen it!

    • by Anonymous Coward on Thursday October 07, 2010 @05:48PM (#33830204)

      Is there any good reason not to just call the presence of soft hyphens as a reliable indicator of spam and use it as the basis of a spam filter?

      Yes, there is: languages other than English. In e.g. German, the use soft hyphens, while not universal, is becoming more common, at least, and for a reason: longer words that can't automatically be hyphenated by the browser as necessary lead to ugly layout, especially when there's not a lot of horizontal space (e.g. on news sites, which often tend to emulate printed newspapers).

      • by treeves (963993) on Thursday October 07, 2010 @06:34PM (#33830634) Homepage Journal
        So, when I get an email with a link to www.Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz.de, should I avoid clicking the link, or what?
        • Or I would have been, if that had been a real URL. Who could resist that?

          Especially if you're into überwachungsaufgaben. Not that I am! No sir. Just saufgaben-curious.

        • that is an actual German word, says Google - their compound concoctions never cease to amaze me. :)

        • So, when I get an email with a link to www.Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz.de, should I avoid clicking the link, or what?

          No, just think about what you are having for dinner and be sure you prepare and eat it within the rules.

      • Re: (Score:2, Interesting)

        by TheRaven64 (641858)
        Hyphenating long words in German is pretty easy. Long words are usually compound words and they are correctly broken at the word boundaries. Hyphenating English automatically is actually a harder problem than hyphenating German, and is made harder by the fact that English and American have different rules for when you are supposed to hyphenate.
        • by mattack2 (1165421)

          is made harder by the fact that English and American have different rules for when you are supposed to hyphenate

          Can you explain how this is relevant to this soft hyphen issue? That is, I read the relevant part of the wikipedia article (http://en.wikipedia.org/wiki/Hyphenation), and it does mention different rules (e.g. "co-worker" in British English, but "coworker" in American English). However, that is not related to the soft hyphen issue, which is related to hyphenation for justification reasons.

          • It's relevant to the point that I was replying to - that soft hyphens are more common in German because it is harder to insert hyphens automatically. Soft hyphens are just hints to an automatic hyphenation system. It will attempt to find a place to break the word. In American, the correct place to do this is based on phonetics, while in English it is based on the derivation of the word. It's quite hard to do this correctly automatically (although it's quite easy to do it almost-correctly and use soft hy
    • Re: (Score:3, Interesting)

      by ceoyoyo (59147)

      I would think most spam filters would do that automatically as they learn.

      Symantec seems to think people still use character-for-character text matching spam filters that don't learn. Maybe Symantec products do.

    • Re: (Score:3, Insightful)

      by AltairDusk (1757788)
      Shouldn't be too hard for the spam filter to strip the soft hyphens then analyze the URL, I don't see this being useful to the spammers for too long unless I'm missing something.
    • by Nicopa (87617)

      FWIW I've been using in webpages for years...

  • Good News! (Score:3, Interesting)

    by hardburn (141468) <hardburn@wumpus- ... OWnet minus city> on Thursday October 07, 2010 @05:48PM (#33830192)

    So now spam filters will pick up on soft hyphens used in URIs inside emails (when was the last time you saw one used legitimately?), making the spam easier to spot.

  • shy (Score:3, Funny)

    by Anonymous Coward on Thursday October 07, 2010 @05:50PM (#33830228)

    No good shysters.

  • by Khopesh (112447) on Thursday October 07, 2010 @06:02PM (#33830356) Homepage Journal

    Just tested this in SpamAssassin with http ://exa &shy; mple.com (spaced to evade slashdot's own obfuscation-eliminator) - Result: The URL domain (example.com) is properly extracted without the obfuscation.

    That said, SA is fully capable of detecting the obfuscation attempt itself (using a rawbody rule)...

    • by Zarel (900479)

      Erm, I don't think you know what "properly extracted" means.

      exa&shy;mple.com doesn't lead to example.com, it leads to xn--example-nka.com, so if you extract the former instead of the latter, you're doing it wrong.

      • I'm a SpamAssassin developer, both on the official project and on a commercial derivative. Others on my commercial team independently verified my claim as well. I highly doubt we're all wrong.

        That said, I decided to FULLY dig into the issue to see what's going on under the hood. In addition to a careful analysis of the spamassassin debug output, I spun up Wireshark [wireshark.org] to look at the actual DNS queries. Since SA knows what example.com is ([84234] dbg: uridnsbl: domain example.com in skip list), I had to u

  • from the article:

    The advent of HTML 5 within the next couple years is expected to solve many of these problems, because that specification finally standardizes how HTML code should be parsed by Web browsers, rather than leaving it up to individual platform vendors to develop their own interpretations of how the code should be parsed.

    Note the use of the phrase "should be". I see this a lot when reading about HTML 5. Are people really that stupid and/or naive that they think all browsers will follow the HT

    • by blair1q (305137)

      standards can only tell you how things should be.

      they may tell you how things will break for you if you try to do things in a non-standard way, but they have no power to force you not to try.

    • by mortonda (5175)

      Note the use of the phrase "should be".

      Yes, "should be". SHOULD has a very different meaning from MUST in standards documents.

  • by SuperKendall (25149) on Thursday October 07, 2010 @06:08PM (#33830418)

    The thing that really grates on the nerves, is using a soft-hypen to sell Viagra.

  • I don't get how you can put a soft-hyphen in a URL and have it work? It's a formatting character, it shouldn't ever be legal to have a formatting character as part of a URL? Are they registering domain-names with soft-hyphens in the name? Or is this a case where the browser 'helpfully' replaces a soft hyphen with a regular hyphen when actually trying to connect to the web server, but for some reason does NOT render they hyphen when displaying it to the user? It seems like the browser should behave consisten

    • Mod parent up. Why in the hell is such a character allowed in URLs at all?

      • by elronxenu (117773)

        Indeed; it seems to be a good example why extending the DNS character set was not such a good idea. DNS should have readable domain names, and avoid using different characters with identical glyphs and non-printing characters.

  • by Laxori666 (748529) on Thursday October 07, 2010 @09:19PM (#33832106) Homepage
    Is it just me or is this summary terrible? Every sentence says the same thing, just slightly reworded. In the summary, it's as if each new sentence doesn't give any additional information, but it's worded as if it does. Researchers have found that this summary is repetitive. Some say this can indicate the repetitiveness of a summary.
    • by lousyd (459028)
      Yes, thank you. I was scanning the comments to this story just hoping that somebody would say that. 5 times they repeated the same thing!
  • ...unless they get a lucky break.

    I think I learned that in junior high.

  • This is so easy to defeat with a simple regular expression in your spam filter. I doubt spammers will continue with this tactic for long.

  • Alright I did some testing in Chrome, Firefox, Internet Explorer and Opera (all latest versions)

    simple link, with a SHY character in the link. Depending on the format of the link (with a http, or without), All 4 browser did the exact thing we expected them to do : The link either showed the hyphen and linked to a hyphened page correctly (when I say "Showed", I mean, that if you mouse-over the link, you see the hyphen in the task bar) or just didn't show it and didn't link to a hyphened page.

    So, I don't see

    • by Spez (566714)

      Ah I re-re-read the summary. It's only to go through the Spam filtering system... forget I said anything.

"There is nothing new under the sun, but there are lots of old things we don't know yet." -Ambrose Bierce

Working...