Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Security Spam IT

Spammers Using Soft Hyphen To Hide Malicious URLs 162

Trailrunner7 writes with this excerpt from ThreatPost illustrating the ongoing Spy-vs.-Spy battle between spammers and the rest of us: "Spammers have jumped on the little-used soft hyphen (or SHY character) to fool URL filtering devices. According to researchers, spammers are larding up URLs for sites they promote with the soft hyphen character, which many browsers ignore. Spammers aren't shy about jumping humans flexible cognitive abilities to slip past the notice of spam filters (H3rb41 V14gr4, anyone?). ... The latest trend involves the use of an obscure character called the soft hyphen or 'SHY' character to obscure malicious URLs in spam messages. Writing on the Symantec Connect blog, researcher Samir Patil said that the company has seen recent spam messages that insert the HTML symbol for the soft hyphen to obfuscate URLs for Web pages promoted by the spammers."
This discussion has been archived. No new comments can be posted.

Spammers Using Soft Hyphen To Hide Malicious URLs

Comments Filter:
  • Re:Why (Score:5, Informative)

    by TopSpin ( 753 ) on Thursday October 07, 2010 @05:40PM (#33830080) Journal

    Why don't modern browsers render this character?

    The character isn't supposed to be rendered. Soft hyphen indicates where to break words if necessary. The hyphens are not rendered if the word doesn't need to be broken.

  • Re:Why (Score:3, Informative)

    by maxwell demon ( 590494 ) on Thursday October 07, 2010 @05:43PM (#33830140) Journal

    Why don't modern browsers render this character?

    From Wikipedia:

    "Since it is difficult for a computer program to automatically make good decisions on when to hyphenate a word, the concept of a soft hyphen was introduced to allow manual specification of a place where a hyphenated break was allowed without forcing a line break in an inconvenient place if the text was later re-flowed."

    So a soft hyphen marks a position where you can hyphenate a word. If you don't do it, you of course shouldn't print anything at that position.

  • by Anonymous Coward on Thursday October 07, 2010 @05:48PM (#33830204)

    Is there any good reason not to just call the presence of soft hyphens as a reliable indicator of spam and use it as the basis of a spam filter?

    Yes, there is: languages other than English. In e.g. German, the use soft hyphens, while not universal, is becoming more common, at least, and for a reason: longer words that can't automatically be hyphenated by the browser as necessary lead to ugly layout, especially when there's not a lot of horizontal space (e.g. on news sites, which often tend to emulate printed newspapers).

  • Re:Why (Score:1, Informative)

    by KillaGouge ( 973562 ) <gougec17@msRASPn.com minus berry> on Thursday October 07, 2010 @05:53PM (#33830254)
    according to here [cs.tut.fi] the ISO 8859-1 standard calls for that specific character to be rendered.
  • Re:Why (Score:5, Informative)

    by Tynin ( 634655 ) on Thursday October 07, 2010 @05:54PM (#33830274)

    Why don't modern browsers render this character?

    Two reasons, the first being that HTML 4 specs [w3.org] call for it to not be rendered unless it meets the criteria. Here is the full blurb:

    9.3.3 Hyphenation

    In HTML, there are two types of hyphens: the plain hyphen and the soft hyphen. The plain hyphen should be interpreted by a user agent as just another character. The soft hyphen tells the user agent where a line break can occur.

    Those browsers that interpret soft hyphens must observe the following semantics: If a line is broken at a soft hyphen, a hyphen character must be displayed at the end of the first line. If a line is not broken at a soft hyphen, the user agent must not display a hyphen character. For operations such as searching and sorting, the soft hyphen should always be ignored.

    In HTML, the plain hyphen is represented by the "-" character ( or ). The soft hyphen is represented by the character entity reference ( or )

    The other reason is that the current unicode standard basically says it doesn't support when and where it should be displayed as a hyphen and leaves it open to interpretation of whoever is coding for it. Here is the blurb from the unicode standard on it:

    Hyphenation. U+00AD soft hyphen (SHY) indicates an intraword break point, where a line break is preferred if a word must be hyphenated or otherwise broken across lines. Such break points are generally determined by an automatic hyphenator. SHY can be used with any script, but its use is generally limited to situations where users need to override the behavior of such a hyphenator. The visible rendering of a line break at an intraword break point, whether automatically determined or indicated by a SHY, depends on the surrounding characters, the rules governing the script and language used, and, at times, the meaning of the word. The precise rules are outside the scope of this standard, but see Unicode Standard Annex #14, “Unicode Line Breaking Algorithm,” for additional information. A common default rendering is to insert a hyphen before the line break, but this is insufficient or even incorrect in many situations.

    Contrast this usage with U+2027 hyphenation point, which is used for a visible indication of the place of hyphenation in dictionaries. For a complete list of dash characters in the Unicode Standard, including all the hyphens, see Table 6-3.

    The Unicode Standard includes two nonbreaking hyphen characters: U+2011 non-breaking hyphen and U+0F0C tibetan mark delimiter tsheg bstar. See Section 10.2, Tibetan, for more discussion of the Tibetan-specific line breaking behavior.

  • by Khopesh ( 112447 ) on Thursday October 07, 2010 @06:02PM (#33830356) Homepage Journal

    Just tested this in SpamAssassin with http ://exa &shy; mple.com (spaced to evade slashdot's own obfuscation-eliminator) - Result: The URL domain (example.com) is properly extracted without the obfuscation.

    That said, SA is fully capable of detecting the obfuscation attempt itself (using a rawbody rule)...

  • Re:Why (Score:5, Informative)

    by Man Eating Duck ( 534479 ) on Thursday October 07, 2010 @06:10PM (#33830432)

    I'm a pretty IT-savvy guy, but WHAT IS that bloody character?

    Say you're laying out a book. You have the word Sauerkraut at a line wrap, but it is broken into Sauerk-raut because your layout software don't know where to break it. You then put in a soft hyphen between r and k, this indicates to your software that this word should be broken there. It turns into Sauer-kraut which is correct.

    Later you get angry with the Sauerkraut and call it "bloody Sauerkraut". Now the whole word will be at the next line, and the soft hyphen won't show because your software doesn't need to break the word. Thus you can insert these freely without fretting about words containing a hyphen later on, they'll only be rendered when used as a hint.

    HTH

  • Not always (Score:5, Informative)

    by pavon ( 30274 ) on Thursday October 07, 2010 @06:27PM (#33830572)

    It is only supposed to be rendered when the word is split across multiple lines.

    For example if your text was "super&shy;cali&shy;fragilistic&shy;expialidocious" then all of the following are valid rendering depending on where the render decides to start a new line:

    supercalifragilisticexpialidocious

    or

    supercalifragilistic-
    expialidocious

    or

    supercali-
    fragilistic-
    expialidocious

  • by treeves ( 963993 ) on Thursday October 07, 2010 @06:34PM (#33830634) Homepage Journal
    So, when I get an email with a link to www.Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz.de, should I avoid clicking the link, or what?
  • by Anonymous Coward on Thursday October 07, 2010 @07:06PM (#33831004)

    Good job. Came here to blame Kajagoogoo for this.

    Offtopic mod needs to hush hush.

  • Re:What is it? (Score:3, Informative)

    by arth1 ( 260657 ) on Friday October 08, 2010 @08:25AM (#33834668) Homepage Journal

    Yes, they are. Otherwise this story wouldn't exist.

    Who modded this insightful?

    No, domain registrars don't allow soft hyphens in domain name registrations. Give me a single example of a registered domain with a soft hyphen in it.

    As the other user said, this is used for masking URLs in e-mails, and thus trying to thwart spam filters.

    Get your ch&shy;eap Vi&shy;agra at http://www.chshyeapvishyagra.com/

    This will render as

    Get your cheap Viagra at http://www.cheapviagra.com/

    Yet many spam filters will not trigger on the words "cheap" and "Viagra", and the e-mail has a greater chance of getting through filters.

    A similar technique was used in the past for domain names and e-mail addresses, back when e-mail actually followed the standards, and no-one read their e-mail in a HTML browser.
      is a legal e-mail address that is interpreted as cheapviagra@hotmail.com, but many newer programs don't follow the standards and will barf, which is why spammers seldom do this anymore.

  • Re:H3rb41 V14gr4? (Score:3, Informative)

    by Abstrackt ( 609015 ) on Friday October 08, 2010 @09:24AM (#33835036)

    I thought the only situation where you need Viagra is exactly human contact (in the most literal meaning of the word).

    There's the rub, so to speak. Most men using viagra don't need it, they just like using it, and nothing prevents them from enjoying it on their own.

  • Re:What is it? (Score:2, Informative)

    by maxwell demon ( 590494 ) on Friday October 08, 2010 @09:49AM (#33835252) Journal

    So the problem is browsers silently removing them.

    A browser should never modify an URL. Especially it should not remove invalid characters. It should give an error if you try to go to an invalid URL. It should not try to be "helpful" here.

Receiving a million dollars tax free will make you feel better than being flat broke and having a stomach ache. -- Dolph Sharp, "I'm O.K., You're Not So Hot"

Working...