Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Security

Gmail's AI-Powered Spam Detection Is Its Biggest Security Upgrade in Years (arstechnica.com) 45

The latest post on the Google Security blog details a new upgrade to Gmail's spam filters that Google is calling "one of the largest defense upgrades in recent years." ArsTechnica: The upgrade comes in the form of a new text classification system called RETVec (Resilient & Efficient Text Vectorizer). Google says this can help understand "adversarial text manipulations" -- these are emails full of special characters, emojis, typos, and other junk characters that previously were legible by humans but not easily understandable by machines. Previously, spam emails full of special characters made it through Gmail's defenses easily.

[...] The reason emails like this have been so difficult to classify is that, while any spam filter could probably swat down an email that says "Congratulations! A balance of $1000 is available for your jackpot account," that's not what this email actually says. A big portion of the letters here are "homoglyphs" -- by diving into the endless depths of the Unicode standard, you can find obscure characters that look like they're part of the normal Latin alphabet but actually aren't.

This discussion has been archived. No new comments can be posted.

Gmail's AI-Powered Spam Detection Is Its Biggest Security Upgrade in Years

Comments Filter:
  • by MIPSPro ( 10156657 ) on Monday December 04, 2023 @02:26PM (#64053935)
    They have the first thing I'd describe as an "AI spell checker" that uses context, locale, and many other factors to influence the spell check logic. I don't know when it got "this good" but I've noticed significant upgrades to the spelling checker to the point that I'd call it really exceptionally good. I used to tease AI-fanboys by saying "Wake me up when they can write a decent spell checker." Well, they did, and I'm now awake and watching ChatGPT and friends closely.
    • Can you give me an example of where spell checker has been failing you and how context and locale would have yielded a different result?
      • Spell checkers formerly got very tripped up by homonyms, for example: "That sure is a plane looking plane." Also homophones almost always need context to be checked correctly. Consider: "Everyone accept me got to go." or "It's currently ate o-clock."
        • by dgatwood ( 11270 )

          Spell checkers formerly got very tripped up by homonyms, for example: "That sure is a plane looking plane." Also homophones almost always need context to be checked correctly. Consider: "Everyone accept me got to go." or "It's currently ate o-clock."

          Nit: That would be grammar checkers. There's nothing wrong with the spelling of either of those two sentences. They just make no sense because the wrong word was spelled.

          • Semantics. Technically you know which word you mean, but you don't know which spelling goes with it. It corrects the spelling of the word you mean to type from the word you did type.

          • That's one way to describe the situation. Another is that they are misused not misspelled, but technically using a homophone that's the wrong word is still a misspelling. Most grammar checkers are aggressive and make bad decisions, the Google spelling checker (and perhaps it's a spelling/grammar checker) is much less annoying in this way and focuses on spelling issues, I haven't seen it go after other common grammatical issues beyond double words.
          • Would it be a grammar checker? The statement appears grammatically correct. Perhaps a better moniker would be a coherence checker.
        • Thanks for the example now I get where you are coming from. As others have noted that's not really a spelling issue but it is something an LLM would likely be very good at finding and would be useful to the user regardless of name. I think coherence checker might be a better name but whatever they call it I agree that it should be rolled it into spelling and grammar checking.
          • Yes, it's kind of a hybrid activity as you point out, not pure spell checking. I guess that's why it works a bit better.
        • like a wood plane and an aeroplane? They're both spelled correctly and are both grammatically correct as entered.

          Try again.

    • I believe they use predictive keyboard type of algorithm and run what you're typing through that to suggest corrections based on what words you're likely to use vs. what you actually type (and using that as a basis to decide if a homophone is the wrong word). Which, yes, is the precursor to something like ChatGPT. ChatGPT is essentially the same idea on a bigger scale but uses prompt text instead of trying to continue your own typing. And uses an entirely different type of training data.

  • by dargaud ( 518470 ) <slashdot2@nOSpaM.gdargaud.net> on Monday December 04, 2023 @02:30PM (#64053953) Homepage
    I've said it 20 years ago here when the SEO craze started, search engine crawlers should do something like that:
    1 - do their usual stuff (crawl as googlebot, get the page)
    2 - crawl as Chrome or Firefox in plain user version, render the page in the browser, do a screenshot and OCR it...
    3 - compare the text of the 2 versions (now with IA !). If they differ too much, penalize the page/site.
    This would handle text in images, text with fucky unicode, white text on white background, etc...
  • by sinij ( 911942 )
    My trust in Google is nonexistent, so I see this new measure as a way to smuggle censorship into Gmail while maintaining plausible deniability. There is no way politically inconvenient explosive stories are not going to disproportionately likely get flagged as spam and blocked for anyone Google profiles as a swing voter.
  • Why does one need an advanced AI system to block emails with homoglpys? Just block any email that uses more than a very small number of characters outside of Unicode 1.
    • AI has been made so politically correct it wouldn't dare penalize anything homo
    • by gweihir ( 88907 )

      I have a custom spamassassin rule that blocks subjects with those in it. Extending that to full emails is probably something like half an hour of Perl or Python scripting. I guess that is too hard for Google these days...

      • That's too simple. It wouldn't get anyone promoted, let alone justify the entire teams to have people lead.
        • by gweihir ( 88907 )

          As I said, "too hard". In this case, too hard for the organization, not too hard for the engineers.

          One of the signs of a tech company in deline is when they cannot do simple things right anymore.

    • That can be pretty tough when you're a multi-lingual system and some languages spell English loanwords with a Latin alphabet. I think it would be easier to normalize the glyphs to whichever form spells a real word and then use that in the filtering.

  • by dskoll ( 99328 )

    In my experience, Bayes was very good at catching stuff like this, because words with homoglyphs in them were very quickly classified as very spammy.

  • While I believe AI may be able to filter spam more effectively, I also believe AI will be used to generate spam more effectively too. I believe the generative AI will be more effective than the protective AI, since all you have to do is setup a gmail account to check if it was filtered, so that AI can get immediate, automatic feed back.

    Whereas the filter AI relies on people marking it as spam to fix any errors, The may also never even see any false positives. I personally welcome the day where AI gets bette

  • by schweini ( 607711 ) on Monday December 04, 2023 @02:57PM (#64054039)
    Strange. I would've thought that detecting homoglyphs should be relatively easy? There really aren't THAT many legitimate use-cases to use vastly different Unicode codepoints from different parts of Unicode in the sme email? Or to use the lookalike Unicode characters mixed with regular ones?
    • Strange. I would've thought that detecting homoglyphs should be relatively easy? There really aren't THAT many legitimate use-cases to use vastly different Unicode codepoints from different parts of Unicode in the sme email? Or to use the lookalike Unicode characters mixed with regular ones?

      Yeah, throwing "AI" at a simple task like this seems like they are just looking for nails to hit with their AI hammer. Also as other posters have mentioned, we've known for years that homoglyphs have been used a lot on spam - if they were serious about reducing spam, a human at Google should have written a script to deal with this simple vector many years ago.

  • Detecting AI generated content is pretty much impossible at this time, after all. So this is probably just the next step in the arms-race.

  • Gmail puts emails from my Linode-hosted server into SPAM folders. If they are send from a VM hosted elsewhere, they don't go into the SPAM folders.

    My Linode server is no blocklists. SPF and DKIM are setup and working properly. I run a small family mailserver: no spam sent ever, since I began using the IP address. There is absolutely no reason for my emails to be categorized as SPAM.

    Clearly, there is no intelligence involved in this categorization as SPAM, artificial or otherwise.

    Perhaps the new upgraded fil

    • Unlikely. Gmail doesn't want to play nice with independent self-hosted email. They want you to find the experience so sucky that you throw up your hands in despair and move to Gmail so they can mine all your family's data. Stay strong!

      • I want to reinforce that, if I route the email through a different VM, hosted by a different service, Gmail doesn't treat them as SPAM. The only difference is the IP address the emails are coming from.

        • You have proven how straightforward it is to get around IP-based spam filtering. You'd think Google would also know in 2023 how pointless it is to flag spam based on IP address. It seems so dumb that I'm inclined to ascribe it to malice rather than incompetence.

  • "by diving into the endless depths of the Unicode standard, you can find obscure characters that look like they're part of the normal Latin alphabet but actually aren't."

    So, filter every email that uses "lookalike" characters. Done and dusted.

    • by pjt33 ( 739471 )

      Not every e-mail that uses them: some of the lookalikes are e.g. Cyrillic letters. What it should do is look at the metadata to see what language the e-mail claims to be in and weight characters accordingly.

  • by awwshit ( 6214476 ) on Monday December 04, 2023 @03:31PM (#64054183)

    GMail is going to apply this filtering to its outbound mail too, right? Right?

    Dear GMail, saving your users from spam while allowing them to relentlessly spam the rest of the world is evil. Be sure to apply your new filtering to your user's outbound mail too, maybe GMail would stop being the top spammer in the world.

    • by tlhIngan ( 30335 )

      Dear GMail, saving your users from spam while allowing them to relentlessly spam the rest of the world is evil. Be sure to apply your new filtering to your user's outbound mail too, maybe GMail would stop being the top spammer in the world.

      Most of my spam comes from Google Groups.

      Spammers import lists of email addresses and there's no way to remove yourself from the list (I've tried everything from "unsubscribe" to clicking the unsubscribe link) as you're not quite sure what address spammers used and either

  • by bobbutts ( 927504 ) <bobbutts@gmail.com> on Monday December 04, 2023 @03:31PM (#64054187)
    I get obvious scam/virus with Russian characters with several obvious virus attachments emails marked as "Important" once in awhile. Meanwhile actual important things like Fedex shipping notifications and some major retailers always go to spam.
  • Their chart [googleusercontent.com] shows their "False Positive reduction" was -19.49% and "False Negative reduction" was -17.71%, but I doubt they'd be reporting on negative reductions (aka increases).

    Kidding aside, this serves to me (as a researcher in this exact space) as a reminder of how bad GMail is (was?) at threat detection; a spam-specific tokenizers like the (not at all recent) one in SpamAssassin is pretty good (see Paul Graham's 2003 Better Bayesian Filtering [paulgraham.com] writeup, which SA used extensively). There's lots of room f

  • this does nothing for image based spam where the text is a vector in a graphic file.

If all else fails, lower your standards.

Working...