Distributed Spam Detection 304

Posted by CmdrTaco on Saturday December 01, 2001 @01:20PM from the interesting-ideas dept.

A reader writes "There's an interesting project at SourceForge, called, "Vipul's Razor", that uses a gnutella like system to let users exchange spam "signatures" to filter spam. I work at an ISP in Ottawa, we have been using it for last two weeks to stop bulk of spam coming to our POP3 accounts. More impressively, it hasn't tagged any valid mail as spam yet. Here's the scoop from its webpage: "Vipul's Razor is a distributed, collaborative, spam detection and filtering network. Razor establishes a distributed and constantly updating catalogue of spam in propagation. This catalogue is used by clients to filter out known spam. On receiving a spam, a Razor Reporting Agent (run by an end-user or a troll box) calculates and submits a 20-character unique identification of the spam (a SHA Digest) to its closest Razor Catalogue Server. The Catalogue Server echos this signature to other trusted servers after storing it in its database. Prior to manual processing or transport-level reception, Razor Filtering Agents (end-users and MTAs) check their incoming mail against a Catalogue Server and filter out or deny transport in case of a signature match."" Cool idea. I'm up around 80% spam a day on my main mail account. Might be worth a try.

Distributed Spam Detection

This discussion has been archived. No new comments can be posted.

Search 304 Comments Log In/Create an Account

Comments Filter:

So... (Score:5, Interesting)

by DagSverre ( 223837 ) writes: on Saturday December 01, 2001 @01:29PM (#2641294) Homepage

...what stops this from being abused? Say I set up a box that automatically reports all mails on the most popular mailing lists as spam, effictively making the ISPs around the world start to filter out the mailing lists...

It's a great initiative, I really hope no troll out there takes my word on this and actually do this.

Fabulous Idea! (Score:3, Interesting)

by under_score ( 65824 ) writes: <mishkin@@@berteig...com> on Saturday December 01, 2001 @01:30PM (#2641298) Homepage

The people who came up with this idea deserve to be considered heros! This is one of the coolest uses of technology I have seen. (Not to be too gushing: SPAM is a rich mans problem - I hope someone comes up with some cool technological solutions to some of humanities more basic problems.) I run a server which hosts mail for a number of domains. I haven't yet, cause I just heard of it, but this will be used! There might be some interesting extensions based on possible problems: certain kinds of spam interest certain people. Perhaps a categorization system would be useful so that spam can be filtered based on these categories (for example, some people might like receiving 100 MLM spam messages a day :-P ). Also, there is an (extremely) slim chance that a legit mail might be blocked based on match hashes. Although this is extremely unlikely, could it be fixed somehow? Finally, some spam comes with very slight differences but is essentially the same spam instance. Chain letters are in a grey area. It would be good to have some heuristic methods of filtering based on content too. I don't know the characteristics of the hashing algorthm used, but perhaps by doing three hashes: start of message, middle of message, and end of message, it may be possible to identify spam even if a small part has been change. Anyway, just some random thoughts. Kudos again to those who have built this!

Can it be abused? (Score:1, Interesting)

by LinuxGeek8 ( 184023 ) writes: on Saturday December 01, 2001 @01:32PM (#2641302) Homepage

That's quite interesting.
But the main question is, can it be abused?
I'd expect the senders of spam to be wanting this project to be rendered useless, by submitting garbage to the database.

In return, I guess it is possible to have some sort of moderating system on the submitters of the data, which can filter out most of the abusers.

Yes I've posted this before but (Score:3, Interesting)

by 4444444 ( 444444 ) writes: <4444444444444444 ... 444444@lenny.com> on Saturday December 01, 2001 @01:41PM (#2641331) Homepage

I love costing spammers real money just got to
http://goto.com
and do a search for "bulk email" each link you click will cost the scumbags that sell spam software or spamming services several dollars each
Also I love this new technology I wish all isp's would use it

and for more spam fighting ideas please check out
http://www.lenny.com/spam

Re:Great use of p2p -- Wont work. (Score:2, Interesting)

by VC ( 89143 ) writes: on Saturday December 01, 2001 @02:07PM (#2641402)

This wont work. All that will happen is that the spammers will just modify their spam programs to slightly modify each message they send out. This will result in each message having a COMPLETELY different SHA signature.
Cool idea but wont work. Sorry. Maybe some kind of AI algrorithm.

Re:Yes I've posted this before but (Score:2, Interesting)

by bleeeeck ( 190906 ) writes: on Saturday December 01, 2001 @02:07PM (#2641403)

I love costing spammers real money just got to http://goto.com and do a search for "bulk email" each link you click will cost the scumbags that sell spam software or spamming services several dollars each
Here's the link [overture.com] for you lazy people.
The top few listings are more than $8 each.

Re:Great use of p2p (Score:2, Interesting)

by __aawsxp7741 ( 78632 ) writes: on Saturday December 01, 2001 @02:20PM (#2641429)

How about Freenet [freenetproject.org]? Can be (ab)used for piracy, of course, but neither is that its purpose, nor does it seem its current main use.

Re:Fighting spam (Score:2, Interesting)

by Thanatopsis ( 29786 ) writes: <despain.brian@nOSPaM.gmail.com> on Saturday December 01, 2001 @02:28PM (#2641443) Homepage

Not really, you simply change the order in which your filters get checked and filter out legitimate mailing list traffic from SPAM. For example I am member of various ZDNet lists and development lists. I filter those based on the sender or the from address into my mailbox for them and then I can read them at my leasure.

a good idea, but... (Score:3, Interesting)

by deander2 ( 26173 ) writes: <public@kered . o rg> on Saturday December 01, 2001 @02:29PM (#2641444) Homepage

What stops the spammer from including a unique identifier in each e-mail (such as a count variable), changing the SHA for each e-mail that goes out?

Just a thought...

Re:Have you looked at Hotmail's new spam filter? (Score:2, Interesting)

by Anonymous Coward writes: on Saturday December 01, 2001 @02:32PM (#2641449)

I wonder if Hotmail is using the same kind of logic. I mean, they allow the user to label which emails the user sees as spam. Then they can set somekind of threshold based on how many labels a signature has received.

When the threshold is crossed, then the signature will be categorized as spam to all other users.

It will work beautifully considering how many users they have.

I've managed to filter most spam (Score:3, Interesting)

by Rikardon ( 116190 ) writes: on Saturday December 01, 2001 @02:34PM (#2641453)

I found a clever way to defeat most spam on the webpage of an avid cyclist; unfortunately I can't remember his name or enough information about him to run a Google search and give this method proper attribution. But here goes anyway:

The key to this method is to realize that most spam has a spoofed "To" address -- RARELY is it addressed directly to you. If you dig in the message headers, you'll usually found it was mailed (or CC'd) to a whole bunch of people at once, for obvious reasons. So you set up your mail filters thusly:

First, set up a filter allowing any "legal" mailing lists you're on to go to your Inbox.

Next, a filter to allow any mail sent directly to you (i.e. you@domain.com is in the To or CC lines) to go to your Inbox.

Finally, a filter that deletes everything else.

You'd be amazed how effective this is. Since setting this up, I only get maybe one spam message past this system every three or four months.

Mind you, I also have my email come in via Bigfoot, which has a pretty good spam filter itself. But this has nonetheless proven quite effective.

Re:Great use of p2p -- Wont work. (Score:5, Interesting)

by DLG ( 14172 ) writes: on Saturday December 01, 2001 @02:40PM (#2641459)

>> This wont work. All that will happen is that the spammers will just modify their spam programs to slightly modify each message they send out.

It will however require them to send each specific message separately rather than sending large cc's or using some sort of relay. That alone is a big step since right now most spammers can get away with sending a single email message and relying on an open relay to retransmit to a larger group.

Furthermore I have doubts that for the time being this project will concern spammers. Infact I am pretty sure spammers are not really interested in wasting their own time trying to spam people who consider spam a violation. It is more convenient to ignore those people (which is why they don't bother to check if you want spam or not before they send it to you).

DLG

Virus Detection (Score:5, Interesting)

by doorbot.com ( 184378 ) writes: on Saturday December 01, 2001 @02:42PM (#2641465) Journal

This seems like it would be a great method for virus detection on a non-Windows machine. For those of you who run *nix mail servers which eventually filters down to Windows clients, having a mail tagged as viral would be nice to have it be immediately denied at the server. So I'm assuming all it would take is a smart admin to tag the email as spam, and then it will propagate around to the other servers (less than 1k would transfer!).

One flaw, depending on your perspective... (Score:4, Interesting)

by wirefarm ( 18470 ) writes: <jim&mmdc,net> on Saturday December 01, 2001 @02:51PM (#2641483) Homepage

I spent the last few days hacking together a bulk mailer in perl. I did so with a lot of sensitivity and a bit of trepidation and a lot of social engineering to my employer who wanted to put together a way to send invitations to a party via email, rather than the very expensive snail mail method that we had been using.

This was emailed to our real customers - our 'A list'. These are the people who get invited to these parties each time - people who come and enjoy the food and drinks, no strings attached.

But, yet, technically, it *is* bulk email and this first time, unsolicited. A very large percentage of the people responded enthusiasticly that they want to remain on the list for this, but a few (8 out of 3500) asked to be removed from the list. One guy seemed annoyed and I typed him a personal apology. (In fact, I doubt that this guy read the email before sending off his remove request.)
What if that guy had submitted the email as spam to this system?
In that case, the rest would miss out on coming to a good party.

I hate spam as much as anyone on slashdot. I was asked to set up a bulk email and found that it could be done in a way that was not offensive in this case. Had it conflicted with my conscience, I would have refused.

Maybe the system needs some sort of moderation as a filter, too. At least that would allow valid bulk email to survive one trigger-happy end-user.

Ok, go ahead and tell me that I'm wrong in this...
Cheers,
Jim in Tokyo

Not necessarily such a Fabulous Idea! (Score:3, Interesting)

by marxmarv ( 30295 ) writes: on Saturday December 01, 2001 @03:02PM (#2641505) Homepage

The people who came up with this idea deserve to be considered heros!

Wouldn't that be BrightLight?

I don't know the characteristics of the hashing algorthm used, but perhaps by doing three hashes: start of message, middle of message, and end of message, it may be possible to identify spam even if a small part has been change.

HTML email provides too many places to hide garbage. Comment tags and unused X- attributes are the obvious ones; finely (or grossly) tweaking COLOR elements, or any number of things done to inlined images, provide an effectively infinite number of variations which will pass any filter based on the usual message digest algorithms.
Many such tricks can be defeated by only hashing words that appear in some standard dictionary and discarding all else, such that

<FONT COLOR="#FEFDFA"><BLINK X-515322451412135135>LIVE CO--ED NAKED DRESSED GIRLS, =46REE</BLINK></FONT>

gets reduced to LIVE NAKED DRESSED GIRLS before hashing. Even then, the smart thing to do is not to block matching mail but to blackhole the sources of matching mail, preferably permanently.

(Not to be too gushing: SPAM is a rich mans problem - I hope someone comes up with some cool technological solutions to some of humanities more basic problems.)

Humanity's more basic problems are the inability to cope with the concept of a world without scarcity. Would that technology fix that instead of providing the powerful with more ways to create unnatural scarcity.
-jhp

Re:Great use of p2p -- Wont work. (Score:4, Interesting)

by friscolr ( 124774 ) writes: on Saturday December 01, 2001 @03:31PM (#2641557) Homepage

Maybe some kind of AI algrorithm
everytime spam gets mentioned on slashdot, someone says this, and everytime i respond with the work i've been doing-
pattern matching spam [blackant.net]
uses word counts and phrase counts from known spam and known good mail to match against incoming mail. requires a certain amount of known spam/not spam, but otherwise it has a good rate of matching spam/not spam and doesn't require the incoming mail to at all known beforehand.

Re:idea won't work if reaches critical mass (Score:3, Interesting)

by morzel ( 62033 ) writes: on Saturday December 01, 2001 @04:33PM (#2641726)

It is true that it is not always trivial to pick the pieces in a way that the fragments being hashed start at the same offset, but isn't always needed to add extra complexity. Due to the sheer numbers of the same message being sent by the spammers, it would be quite difficult and timeconsuming for them to create a lot of "slight variants" of the same message. Add to that that spammers aren't the only resourceful people on this planet: we can make it difficult for them as well.

This is how I would do it:

Strip HTML/markup language, so that we get plain text of the message.

Strip all "meaningless" characters from the text, keep only alphabetic (or alphanumeric) characters, no spaces or punctuation.

Uppercase everything.

We now have one string, with all the meaningful characters of the email, which makes it quite hard for spammers to vary much without mutilating the message they're trying to convey.

Pick a 8 entry points in this string based on the occurance a number of well-chosen, predefined two-character combinations that are likely to be found in English text(*) - these need to be defined upfront. There are lots of texts available in the gutenberg project to analyze to get to such a set.

This is hard: we need to find a good balance between physical location in the string, and the occurance of the combinations we have defined, so that we can take a broad "sample" of the text. Luckily for us , spammers tend to send long messages :-)

Now we compute the hash of the fragments, defined by our entry-points and a fixed length. These hashes combined provide a "real big signature" of the spam message. Pick the last two bytes of every hash, and stick them together for a "small signature" that can be used for searching/matching. We need to define our protocol for searching the catalogue in such a way that when a partial match is found using the small signature, we can retrieve the full signature to check further.

Based on this we have a rating from 0/8 -> 8/8 for the probability of a mail being a spam message. End user settings can define what is destined for the bitbucket, and what goes in your mailbox.

In the end, spammers can (and will) try to circumvent these measures, but it would be hard and (hopefully) time-consuming, and it will require them to mutilate their messages to be undetected. Of course, this system only works properly when people are willing to submit spam fingerprints to the catalogue servers.

Anyway, that's my 0.02 EURO...

(*)Of course, English isn't the only language being used in spam, but I guess it's the most prevalent here. You can ofcourse apply the same principle to any language. Heck, if you really want to push the envelope, you can try to detect the language (character frequency analysis and checking for very common words).

Re:Great use of p2p -- Wont work. (Score:5, Interesting)

by kevinank ( 87560 ) writes: on Saturday December 01, 2001 @04:49PM (#2641773) Homepage

Interesting work, but I notice that you are only examining trigrams, and you are using an even weight factor. To improve selection you probably at least need to use variable weights (a fuzzy logic neural network rather than binary logic) and train the network with more sample spam.
I've been working on a similar project but using additional factors that help identify spam such as violations of the mail RFC's, and other header indicators, in addition to NLP. I have a prototype that I'm using to score all of my inbox e-mail and am using that to tune the weight factors and add in new factors as I encounter them. It would be interesting to combine your approach with mine I think, since I hadn't thought of analyzing trigrams.
Anyway, if you are interested send me an e-mail and I'll give you my current perl code.

Not Gnutella-like at all; it's Napster-like. (Score:2, Interesting)

by jordan ( 17131 ) writes: on Saturday December 01, 2001 @06:02PM (#2641963) Homepage

The comment made in the submission states that Razor is gnutella-like. That is BS too; if anything, it's Napster-like. Razor is a centralized, collaborative filtering system. One could argue that Razor's master servers are distributed and that the entire system is therefore not fully centralized, but this will change shortly to a master/slave model, which will allow the introduction of a reputation management system.

Keep your eyes peeled.

--jordan

Re:Fabulous Idea! (Score:2, Interesting)

by mmol_6453 ( 231450 ) writes: <short.circuit@ma ... m ['l.g' in gap]> on Saturday December 01, 2001 @06:21PM (#2642014) Homepage Journal

I own and operate an ISP, and I will not install this software on my servers, because I refuse to withold my customers' mail.

However, I will reccommend this software to my customers, so they can use it at their option. That way, they can do what they want. (And I don't get hit with a lawsuit on the off chance a very vital email gets blocked.)

Re:Some positivism and less bitching please... (Score:3, Interesting)

by Kris_J ( 10111 ) writes: on Saturday December 01, 2001 @06:53PM (#2642160) Homepage Journal

What is needs is for someone to setup free email accounts with "nospam" in the domain. myemail@nospam.com, or myemail@yahoo.nospam.com, etc -- then all these new harvest-bots that trim out "nospam" will either get it wrong or discount it completely.
Just a random thoughr early on a Sunday morning...

Here's a Perl script for hitting spammers links (Score:1, Interesting)

by Anonymous Coward writes: on Saturday December 01, 2001 @11:42PM (#2642820)

I just wrote a Spam Victims Revange v0.01 [geocities.com], it's a little Perl script which hits paid links found on Overture [overture.com] under "bulk email" queries etc. It acts like a real browser, in terms of HTTP_USER_AGENT and random "clicks" intervals, showing progress of total hits and total bucks. Enjoy.

Another spam system (Score:1, Interesting)

by Anonymous Coward writes: on Sunday December 02, 2001 @08:21AM (#2643442)

You also can check this idea [linuxfr.org] which works also.

Basicly you have a bunch of trusted people which can add entries to the spam list. When they receive a spam they do forward it to a list, signing the message with pgp/gnupg. A perl engine will then verify the sign to know if the person is allowed to add/remove entries. Then it will fetch the From: header from the forwarded email, and add it to a file which is available on the net. You just have to write your script to fetch the file every 10min and add the content to your access list (postfix, sendmail, etc) with REJECT.

Scripts are available also for Gnus/Emacs so you hit F1 and it will send the mail the way it should, so announcing spam is one key away. It's important announcing spam doesn't take time, or you won't do it as you probably receive many per day.

You also can add [domain] in the subject line which will add the whole domain from the From: header. The [rbl:IP] will add it to a rbl table.

Take a look, it's cool.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Distributed Spam Detection 304

Distributed Spam Detection More Login

Distributed Spam Detection

So... (Score:5, Interesting)

Fabulous Idea! (Score:3, Interesting)

Can it be abused? (Score:1, Interesting)

Yes I've posted this before but (Score:3, Interesting)

Re:Great use of p2p -- Wont work. (Score:2, Interesting)

Re:Yes I've posted this before but (Score:2, Interesting)

Re:Great use of p2p (Score:2, Interesting)

Re:Fighting spam (Score:2, Interesting)

a good idea, but... (Score:3, Interesting)

Re:Have you looked at Hotmail's new spam filter? (Score:2, Interesting)

I've managed to filter most spam (Score:3, Interesting)

Re:Great use of p2p -- Wont work. (Score:5, Interesting)

Virus Detection (Score:5, Interesting)

One flaw, depending on your perspective... (Score:4, Interesting)

Not necessarily such a Fabulous Idea! (Score:3, Interesting)

Re:Great use of p2p -- Wont work. (Score:4, Interesting)

Re:idea won't work if reaches critical mass (Score:3, Interesting)

Re:Great use of p2p -- Wont work. (Score:5, Interesting)

Not Gnutella-like at all; it's Napster-like. (Score:2, Interesting)

Re:Fabulous Idea! (Score:2, Interesting)

Re:Some positivism and less bitching please... (Score:3, Interesting)

Here's a Perl script for hitting spammers links (Score:1, Interesting)

Another spam system (Score:1, Interesting)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot