Distributed Spam Detection 304
A reader writes "There's an interesting project at SourceForge, called, "Vipul's Razor", that uses a gnutella like
system to let users exchange spam "signatures" to filter spam. I work at an ISP in Ottawa, we have been using it for last two weeks to stop bulk of spam coming to our POP3 accounts. More impressively, it hasn't tagged any valid mail as spam yet.
Here's
the scoop from its webpage:
"Vipul's Razor is a distributed, collaborative, spam detection and
filtering network. Razor establishes a distributed and constantly updating
catalogue of spam in propagation. This catalogue is used by clients to
filter out known spam. On receiving a spam, a Razor Reporting Agent (run
by an end-user or a troll box) calculates and submits a 20-character
unique identification of the spam (a SHA Digest) to its closest Razor
Catalogue Server. The Catalogue Server echos this signature to other
trusted servers after storing it in its database. Prior to manual
processing or transport-level reception, Razor Filtering Agents (end-users
and MTAs) check their incoming mail against a Catalogue Server and filter
out or deny transport in case of a signature match."" Cool idea. I'm up around 80% spam a day on my main mail account. Might be worth a try.
So... (Score:5, Interesting)
It's a great initiative, I really hope no troll out there takes my word on this and actually do this.
Fabulous Idea! (Score:3, Interesting)
Can it be abused? (Score:1, Interesting)
But the main question is, can it be abused?
I'd expect the senders of spam to be wanting this project to be rendered useless, by submitting garbage to the database.
In return, I guess it is possible to have some sort of moderating system on the submitters of the data, which can filter out most of the abusers.
Yes I've posted this before but (Score:3, Interesting)
http://goto.com
and do a search for "bulk email" each link you click will cost the scumbags that sell spam software or spamming services several dollars each
Also I love this new technology I wish all isp's would use it
and for more spam fighting ideas please check out
http://www.lenny.com/spam
Re:Great use of p2p -- Wont work. (Score:2, Interesting)
Cool idea but wont work. Sorry. Maybe some kind of AI algrorithm.
Re:Yes I've posted this before but (Score:2, Interesting)
Here's the link [overture.com] for you lazy people.
The top few listings are more than $8 each.
Re:Great use of p2p (Score:2, Interesting)
Re:Fighting spam (Score:2, Interesting)
a good idea, but... (Score:3, Interesting)
What stops the spammer from including a unique identifier in each e-mail (such as a count variable), changing the SHA for each e-mail that goes out?
Just a thought...
Re:Have you looked at Hotmail's new spam filter? (Score:2, Interesting)
When the threshold is crossed, then the signature will be categorized as spam to all other users.
It will work beautifully considering how many users they have.
I've managed to filter most spam (Score:3, Interesting)
The key to this method is to realize that most spam has a spoofed "To" address -- RARELY is it addressed directly to you. If you dig in the message headers, you'll usually found it was mailed (or CC'd) to a whole bunch of people at once, for obvious reasons. So you set up your mail filters thusly:
First, set up a filter allowing any "legal" mailing lists you're on to go to your Inbox.
Next, a filter to allow any mail sent directly to you (i.e. you@domain.com is in the To or CC lines) to go to your Inbox.
Finally, a filter that deletes everything else.
You'd be amazed how effective this is. Since setting this up, I only get maybe one spam message past this system every three or four months.
Mind you, I also have my email come in via Bigfoot, which has a pretty good spam filter itself. But this has nonetheless proven quite effective.
Re:Great use of p2p -- Wont work. (Score:5, Interesting)
It will however require them to send each specific message separately rather than sending large cc's or using some sort of relay. That alone is a big step since right now most spammers can get away with sending a single email message and relying on an open relay to retransmit to a larger group.
Furthermore I have doubts that for the time being this project will concern spammers. Infact I am pretty sure spammers are not really interested in wasting their own time trying to spam people who consider spam a violation. It is more convenient to ignore those people (which is why they don't bother to check if you want spam or not before they send it to you).
DLG
Virus Detection (Score:5, Interesting)
One flaw, depending on your perspective... (Score:4, Interesting)
This was emailed to our real customers - our 'A list'. These are the people who get invited to these parties each time - people who come and enjoy the food and drinks, no strings attached.
But, yet, technically, it *is* bulk email and this first time, unsolicited. A very large percentage of the people responded enthusiasticly that they want to remain on the list for this, but a few (8 out of 3500) asked to be removed from the list. One guy seemed annoyed and I typed him a personal apology. (In fact, I doubt that this guy read the email before sending off his remove request.)
What if that guy had submitted the email as spam to this system?
In that case, the rest would miss out on coming to a good party.
I hate spam as much as anyone on slashdot. I was asked to set up a bulk email and found that it could be done in a way that was not offensive in this case. Had it conflicted with my conscience, I would have refused.
Maybe the system needs some sort of moderation as a filter, too. At least that would allow valid bulk email to survive one trigger-happy end-user.
Ok, go ahead and tell me that I'm wrong in this...
Cheers,
Jim in Tokyo
Not necessarily such a Fabulous Idea! (Score:3, Interesting)
Many such tricks can be defeated by only hashing words that appear in some standard dictionary and discarding all else, such that
gets reduced to LIVE NAKED DRESSED GIRLS before hashing. Even then, the smart thing to do is not to block matching mail but to blackhole the sources of matching mail, preferably permanently. Humanity's more basic problems are the inability to cope with the concept of a world without scarcity. Would that technology fix that instead of providing the powerful with more ways to create unnatural scarcity.-jhp
Re:Great use of p2p -- Wont work. (Score:4, Interesting)
everytime spam gets mentioned on slashdot, someone says this, and everytime i respond with the work i've been doing-
pattern matching spam [blackant.net]
uses word counts and phrase counts from known spam and known good mail to match against incoming mail. requires a certain amount of known spam/not spam, but otherwise it has a good rate of matching spam/not spam and doesn't require the incoming mail to at all known beforehand.
Re:idea won't work if reaches critical mass (Score:3, Interesting)
This is how I would do it:
Strip HTML/markup language, so that we get plain text of the message.
Strip all "meaningless" characters from the text, keep only alphabetic (or alphanumeric) characters, no spaces or punctuation.
Uppercase everything.
:-)
We now have one string, with all the meaningful characters of the email, which makes it quite hard for spammers to vary much without mutilating the message they're trying to convey.
Pick a 8 entry points in this string based on the occurance a number of well-chosen, predefined two-character combinations that are likely to be found in English text(*) - these need to be defined upfront. There are lots of texts available in the gutenberg project to analyze to get to such a set.
This is hard: we need to find a good balance between physical location in the string, and the occurance of the combinations we have defined, so that we can take a broad "sample" of the text. Luckily for us , spammers tend to send long messages
Now we compute the hash of the fragments, defined by our entry-points and a fixed length. These hashes combined provide a "real big signature" of the spam message. Pick the last two bytes of every hash, and stick them together for a "small signature" that can be used for searching/matching. We need to define our protocol for searching the catalogue in such a way that when a partial match is found using the small signature, we can retrieve the full signature to check further.
Based on this we have a rating from 0/8 -> 8/8 for the probability of a mail being a spam message. End user settings can define what is destined for the bitbucket, and what goes in your mailbox.
In the end, spammers can (and will) try to circumvent these measures, but it would be hard and (hopefully) time-consuming, and it will require them to mutilate their messages to be undetected. Of course, this system only works properly when people are willing to submit spam fingerprints to the catalogue servers.
Anyway, that's my 0.02 EURO...
(*)Of course, English isn't the only language being used in spam, but I guess it's the most prevalent here. You can ofcourse apply the same principle to any language. Heck, if you really want to push the envelope, you can try to detect the language (character frequency analysis and checking for very common words).
Re:Great use of p2p -- Wont work. (Score:5, Interesting)
I've been working on a similar project but using additional factors that help identify spam such as violations of the mail RFC's, and other header indicators, in addition to NLP. I have a prototype that I'm using to score all of my inbox e-mail and am using that to tune the weight factors and add in new factors as I encounter them. It would be interesting to combine your approach with mine I think, since I hadn't thought of analyzing trigrams.
Anyway, if you are interested send me an e-mail and I'll give you my current perl code.
Not Gnutella-like at all; it's Napster-like. (Score:2, Interesting)
Keep your eyes peeled.
--jordan
Re:Fabulous Idea! (Score:2, Interesting)
However, I will reccommend this software to my customers, so they can use it at their option. That way, they can do what they want. (And I don't get hit with a lawsuit on the off chance a very vital email gets blocked.)
Re:Some positivism and less bitching please... (Score:3, Interesting)
Just a random thoughr early on a Sunday morning...
Here's a Perl script for hitting spammers links (Score:1, Interesting)
Another spam system (Score:1, Interesting)
Basicly you have a bunch of trusted people which can add entries to the spam list. When they receive a spam they do forward it to a list, signing the message with pgp/gnupg. A perl engine will then verify the sign to know if the person is allowed to add/remove entries. Then it will fetch the From: header from the forwarded email, and add it to a file which is available on the net. You just have to write your script to fetch the file every 10min and add the content to your access list (postfix, sendmail, etc) with REJECT.
Scripts are available also for Gnus/Emacs so you hit F1 and it will send the mail the way it should, so announcing spam is one key away. It's important announcing spam doesn't take time, or you won't do it as you probably receive many per day.
You also can add [domain] in the subject line which will add the whole domain from the From: header. The [rbl:IP] will add it to a rbl table.
Take a look, it's cool.