Slashdot Log In
Distributed Spam Detection
Posted by
CmdrTaco
on Sat Dec 01, 2001 12:20 PM
from the interesting-ideas dept.
from the interesting-ideas dept.
A reader writes "There's an interesting project at SourceForge, called, "Vipul's Razor", that uses a gnutella like
system to let users exchange spam "signatures" to filter spam. I work at an ISP in Ottawa, we have been using it for last two weeks to stop bulk of spam coming to our POP3 accounts. More impressively, it hasn't tagged any valid mail as spam yet.
Here's
the scoop from its webpage:
"Vipul's Razor is a distributed, collaborative, spam detection and
filtering network. Razor establishes a distributed and constantly updating
catalogue of spam in propagation. This catalogue is used by clients to
filter out known spam. On receiving a spam, a Razor Reporting Agent (run
by an end-user or a troll box) calculates and submits a 20-character
unique identification of the spam (a SHA Digest) to its closest Razor
Catalogue Server. The Catalogue Server echos this signature to other
trusted servers after storing it in its database. Prior to manual
processing or transport-level reception, Razor Filtering Agents (end-users
and MTAs) check their incoming mail against a Catalogue Server and filter
out or deny transport in case of a signature match."" Cool idea. I'm up around 80% spam a day on my main mail account. Might be worth a try.
This discussion has been archived.
No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
SpamBouncer (Score:5, Informative)
Great use of p2p (Score:5, Insightful)
Are there any other innovative non-piracy p2p apps out there that we should know about?
Re:Great use of p2p (Score:5, Informative)
Parent
Re:Great use of p2p -- Wont work. (Score:5, Interesting)
It will however require them to send each specific message separately rather than sending large cc's or using some sort of relay. That alone is a big step since right now most spammers can get away with sending a single email message and relying on an open relay to retransmit to a larger group.
Furthermore I have doubts that for the time being this project will concern spammers. Infact I am pretty sure spammers are not really interested in wasting their own time trying to spam people who consider spam a violation. It is more convenient to ignore those people (which is why they don't bother to check if you want spam or not before they send it to you).
DLG
Parent
Some positivism and less bitching please... (Score:3, Funny)
There's not perfect solution for spam (aside from killing every single individuals that dare spamming people, which unfortunately is still illegal
Legislation is too busy removing our civil rights right now than to make our lives better (as they should do). So right now, I'd say, ANY technology helping us to reduce spam should be welcomed and helped in a productive way instead of bashing on it without even giving it a try. It's an open project and it means that if you can contribute in a POSITIVE way, you should. Else, people, please don't discourage programmers working on something that could eventually come out as being a very good solution.
Re:Some positivism and less bitching please... (Score:3, Interesting)
Just a random thoughr early on a Sunday morning...
Re:Great use of p2p -- Wont work. (Score:4, Interesting)
everytime spam gets mentioned on slashdot, someone says this, and everytime i respond with the work i've been doing-
pattern matching spam [blackant.net]
uses word counts and phrase counts from known spam and known good mail to match against incoming mail. requires a certain amount of known spam/not spam, but otherwise it has a good rate of matching spam/not spam and doesn't require the incoming mail to at all known beforehand.
Parent
Re:Great use of p2p -- Wont work. (Score:5, Interesting)
I've been working on a similar project but using additional factors that help identify spam such as violations of the mail RFC's, and other header indicators, in addition to NLP. I have a prototype that I'm using to score all of my inbox e-mail and am using that to tune the weight factors and add in new factors as I encounter them. It would be interesting to combine your approach with mine I think, since I hadn't thought of analyzing trigrams.
Anyway, if you are interested send me an e-mail and I'll give you my current perl code.
Parent
So... (Score:5, Interesting)
It's a great initiative, I really hope no troll out there takes my word on this and actually do this.
Re:So... (Score:4, Insightful)
Parent
Re:So... (Score:4, Insightful)
well, i would have to disagree with you on this point.. i work at a web hosting company as the technical support manager, and handling abuse complaints falls into my realm of responsibility... and i have found that a significant number of first time spammers do not KNOW that spam is "wrong", and get quite upset that they were "taken" by companies that send bulk messages on their behalf. i had one gentleman send me an apology letter that actually made me feel sorry for him. he, and many other people on our network, have never been repeat spammers.
i know that there are many people out there who don't care, but we can't automatically assume that all spammers are evil. some of them are just ignorant.
Parent
Authentication with servers? (Score:5, Insightful)
Bogus hashes won't tag valid mail (Score:4, Informative)
Injecting random hashes into the network won't result in valid emails being tagged, but can flood/DOS the catalogue machines.
It would be possible to create hashes for a number of "probable" emails, but diversity in messages is so big, the chances are quite slim to actually stop a legitimate mail.
Parent
Fabulous Idea! (Score:3, Interesting)
Not necessarily such a Fabulous Idea! (Score:3, Interesting)
Many such tricks can be defeated by only hashing words that appear in some standard dictionary and discarding all else, such that
gets reduced to LIVE NAKED DRESSED GIRLS before hashing. Even then, the smart thing to do is not to block matching mail but to blackhole the sources of matching mail, preferably permanently. Humanity's more basic problems are the inability to cope with the concept of a world without scarcity. Would that technology fix that instead of providing the powerful with more ways to create unnatural scarcity.-jhp
How about a server frontend approach? (Score:3, Insightful)
Nothing truly insightful here, just speculation from a convenience freak.
Fighting spam (Score:5, Informative)
SpamCop [spamcop.net] is a great service for reporting spam; just paste the spam message into the web form, and it'll automatically figure out where the smap came from and send complaints off to the appropriate people.
The Spam Bouncer [spambouncer.org] is a procmail-based personal spam screening tool. It's got some interesting features, but I haven't used it in a long while.
The way I avoid spam is to have my mail client screen out any email which contains any of these phrases:
to be removed
to be permanently removed
to get removed
to get off the list
to get off this list
to be taken off
to remove yourself
removal instructions
remove in subject line
"remove" in subject line
remove in the subject
"remove" in the subject
'remove' in the subject
S.1618
S. 1618
This list by itself catches about 80% of the spam I get.
Re:Fighting spam (Score:2, Informative)
one time mailing
Re:Fighting spam (Score:2, Informative)
Um, are you on any legitimate mailing lists? Don't those get filtered out? I'd imagine half of Slashdot's readership is on one or more of the Linux development lists. I'm Yahoo! Groups mailing list for any number of different interests....
Foreign spam removal (Score:5, Informative)
For the many /.ers who:
a. Use Outlook secretly
b. Receive loads of foreign spam
c. Don't know any foreign languages
d. Don't have any foreign friends
e. Don't have any friends
This Outlook rule is for you!
Apply this rule after the message arrives
with
Ô or ¾ or Ç or or É or ½ or Í or ò or Ë or ® or Ä or ã or Ï or Ö or Ô in the subject or body
delete it
and stop processing more rules.
This blocks 99% of foreign spam [spamhaus.org]. Sue Mosher wrote about other effective methods [slipstick.com] for killing spam in Outlook. Finally, before you reply saying "You dummy, that filter works in any client!" -- You're right.
Parent
idea won't work if reaches critical mass (Score:4, Insightful)
To get around this all a spammer has to do is change/add at least one charachter to each spam. This would make all the hashes unique and no spams would be detected.
Re:idea won't work if reaches critical mass (Score:3, Interesting)
This is how I would do it:
Strip HTML/markup language, so that we get plain text of the message.
Strip all "meaningless" characters from the text, keep only alphabetic (or alphanumeric) characters, no spaces or punctuation.
Uppercase everything.
:-)
We now have one string, with all the meaningful characters of the email, which makes it quite hard for spammers to vary much without mutilating the message they're trying to convey.
Pick a 8 entry points in this string based on the occurance a number of well-chosen, predefined two-character combinations that are likely to be found in English text(*) - these need to be defined upfront. There are lots of texts available in the gutenberg project to analyze to get to such a set.
This is hard: we need to find a good balance between physical location in the string, and the occurance of the combinations we have defined, so that we can take a broad "sample" of the text. Luckily for us , spammers tend to send long messages
Now we compute the hash of the fragments, defined by our entry-points and a fixed length. These hashes combined provide a "real big signature" of the spam message. Pick the last two bytes of every hash, and stick them together for a "small signature" that can be used for searching/matching. We need to define our protocol for searching the catalogue in such a way that when a partial match is found using the small signature, we can retrieve the full signature to check further.
Based on this we have a rating from 0/8 -> 8/8 for the probability of a mail being a spam message. End user settings can define what is destined for the bitbucket, and what goes in your mailbox.
In the end, spammers can (and will) try to circumvent these measures, but it would be hard and (hopefully) time-consuming, and it will require them to mutilate their messages to be undetected. Of course, this system only works properly when people are willing to submit spam fingerprints to the catalogue servers.
Anyway, that's my 0.02 EURO...
(*)Of course, English isn't the only language being used in spam, but I guess it's the most prevalent here. You can ofcourse apply the same principle to any language. Heck, if you really want to push the envelope, you can try to detect the language (character frequency analysis and checking for very common words).
Yes I've posted this before but (Score:3, Interesting)
http://goto.com
and do a search for "bulk email" each link you click will cost the scumbags that sell spam software or spamming services several dollars each
Also I love this new technology I wish all isp's would use it
and for more spam fighting ideas please check out
http://www.lenny.com/spam
Re:Yes I've posted this before but (Score:2)
Now I wonder whether they have any limitations for hits from a given IP address? One little perl script could put some of those companies out of business otherwise....
[TMB]
there are some scripts (Score:3, Informative)
http://www.lenny.com/spam
How do you compute a signature? (Score:5, Informative)
This could work very well, but we need some way of computing signatures which will be invariant across different copies of personalized spam for this to be effective.
Re:How do you compute a signature? (Score:2)
Open for abuse? (Score:2, Insightful)
Re: Distributed spam filter (Score:3, Insightful)
You can tell if the same email has been sent to hundreds of people (and if you use hashes, you can do that without revealing the email)
You can click a "this is spam" button when you read an email, and anyone who trusts you (i.e. has your public key in their "trusted filtering friends" list) can look for similar messages and filter them.
But, there do seem to be a load of problems:
- Personalised email, as someone already mentioned
- Privacy problems with letting others into the secrets of your mailbox
- If you have the original of a message, you can calculate the hash, then see who else got the message (i.e. works for personal mail as well as spam)
- Relatively easy for malicious users to wrongly label someone as a spammer
Well worth investigating, though...
SpamAssassin uses Razor (Score:5, Informative)
Sounds tres cool (Score:2)
One question about this system that I hope the poster (or someone else using this system) will answer: what's it like on server load? Right now, at the ISP I work at, we're using procmail to filter for spam (check the graphs here: http://selenium.dowco.com/spam/spam.html [dowco.com]). It's a good way of doing things, but there are some shortcomings: basically, since it runs on our mailserver, I can't run all the body searches I want; in fact, we had to cut out body searches recently because the load was getting too high and/or email was taking too long to get through. There's some workarounds that I haven't got around to putting in yet (body scanning only when 3k in size, etc), but you can see my point. Anyone?
brightmail? (Score:2)
This is just a temporary solution. (Score:5, Informative)
This is probably a 'fuzzy' hash function that should ignore minute variations. However, it goes without saying that if this hash-based spam filter becomes widespread, then the spammers will simply figure out how to hash-bust their way past it.
To have any hope of working over the long term, this kind of an approach must include the ability to distribute not just the hashes themselves, but the hash function as well, so that the hash function itself can be adjusted, when needed.
Heh, intresting idea (Score:2)
The big problem with that, is well, it's not easy
One way around potential abuse. (Score:5, Insightful)
The thought goes like this.
A person submits a signature of "identified" spam mail to a "supernode" for ex. and the submission gets a ranking of 1. Each additional submission (by other users) increases the score by a number.
This way, there are several classifications which could be used to filter incoming mail. For the mail providers, they could opt for only removing mail matching signatures with a very high score (thus very likely these will be actual spam) or they could filter anything reported.
The purpose of allowing the use of classifications is that it will take longer time to get higher scores, since more people have to report the specific spam mail. Some people whish to eliminate things the least bit suspected, but mileage may vary.
Do you see a resemblance with the
Re:One way around potential abuse. (Score:3, Informative)
Mailwasher (Score:3, Informative)
X-YahooFilteredBulk (Score:4, Informative)
Unfortunately, a lot anti-spam measures (including Exim 3's system filters) only take place after a message has been accepted for delivery. For me, this results in a lot of bounce messages frozen in the queue as they cannot be returned (Hotmail mailbox full, etc). I've switched on features like verifying the sender and the headers, but this doesn't catch them all, and in some cases might even stop some legitimate spam (one of my mailing lists uses incorrect syntax for the "RCPT TO:").
More effective anti-spam systems need to filter before the message has been accepted. If you wait until then, it is already too late and it is on your system. No, refusing accept delivery is much effective IMHO, and forces the MTA's further up the chain to deal with it. They shouldn't have accepted it in the first place! When you get spam, return 550 (or whatever the code is) and let the SMTP client deal with it. In an ideal world, ever provider (ISP, or free service like Yahoo) will implement stricter MTA's. If the spam rejection can be pushed far enough up the chain, life for everyone will easier.
BTW, according to Philip Hazel (a message I recieved to a question I posed on the Exim mailing list), Exim 4 will offer much more functionality along these lines, including the invocation of C funtions after the DATA phase of the SMTP input. I guess this would be the spot to plug in Vipul's Razor, although I don't know what kind performance hit that would lead to. Mr. Hazel also pointed out that some stupid clients are in contravention of the RFC and will continue to try and delivery a message if they recieved 5xx after the DATA phase... oh well: they'll be using my bandwidth but they won't be putting any crap on my server.
a good idea, but... (Score:3, Interesting)
What stops the spammer from including a unique identifier in each e-mail (such as a count variable), changing the SHA for each e-mail that goes out?
Just a thought...
I've managed to filter most spam (Score:3, Interesting)
The key to this method is to realize that most spam has a spoofed "To" address -- RARELY is it addressed directly to you. If you dig in the message headers, you'll usually found it was mailed (or CC'd) to a whole bunch of people at once, for obvious reasons. So you set up your mail filters thusly:
First, set up a filter allowing any "legal" mailing lists you're on to go to your Inbox.
Next, a filter to allow any mail sent directly to you (i.e. you@domain.com is in the To or CC lines) to go to your Inbox.
Finally, a filter that deletes everything else.
You'd be amazed how effective this is. Since setting this up, I only get maybe one spam message past this system every three or four months.
Mind you, I also have my email come in via Bigfoot, which has a pretty good spam filter itself. But this has nonetheless proven quite effective.
Virus Detection (Score:5, Interesting)
One flaw, depending on your perspective... (Score:4, Interesting)
This was emailed to our real customers - our 'A list'. These are the people who get invited to these parties each time - people who come and enjoy the food and drinks, no strings attached.
But, yet, technically, it *is* bulk email and this first time, unsolicited. A very large percentage of the people responded enthusiasticly that they want to remain on the list for this, but a few (8 out of 3500) asked to be removed from the list. One guy seemed annoyed and I typed him a personal apology. (In fact, I doubt that this guy read the email before sending off his remove request.)
What if that guy had submitted the email as spam to this system?
In that case, the rest would miss out on coming to a good party.
I hate spam as much as anyone on slashdot. I was asked to set up a bulk email and found that it could be done in a way that was not offensive in this case. Had it conflicted with my conscience, I would have refused.
Maybe the system needs some sort of moderation as a filter, too. At least that would allow valid bulk email to survive one trigger-happy end-user.
Ok, go ahead and tell me that I'm wrong in this...
Cheers,
Jim in Tokyo
List of server-based spam filter systems (Score:5, Funny)
an other effective spam stopping method ? (Score:3, Insightful)
I receive too much real messages in order to try this out and I think most spammers won't bother to actuall remove an email address from their database if it doesn't exist. But has someone else tried this with any luck?
This p2p spam sounds really nice and I'm going to give it a try asap. I already "lost" an other mail-account in the flood of spam I got on it, so now it forwards all messages to msnbill@microsoft.com (microsoft domain billing address).
The death of SpamCop (Score:3, Informative)
Recently, though, SpamCop switched to a heuristic spam-filter, which is quite leaky. Not only does spam get through, messages from well-known viruses come through. It stops maybe half the spam now.
So SpamCop is now no more effective than typical procmail filters. So there's no point in paying for SpamCop service any more.
Anyone know of a good challenge/response alternative to SpamCop?
Answers to some questions raised on slashdot. (Score:5, Informative)
Some of you point out that Razor's use of SHA-1 signatures can be defeated by introducing randomness in the message. This is true; SHA-1 will eventually be phased out and replaced by a fuzzy hashing mechanism like nilsimsa in future. [http://lexx.shinn.net/cmeclax/nilsimsa.html] [http://www.geocrawler.com/archives/3/2539/2001/7/ 0/6173567/]
The protocol is structured to aid change of
hashing algorithms seamlessly, without breaking
the existing system.
Regarding the possibility of poisoning the database, we are working on a reputation system
that will assign credit to honest reporters.
Once we have a critical mass of users, it would
be hard for dishonest reporters to even join
the reporting network, much less be able to
mount a DOS attack.
Some of these issues have been discussed on the
razor-users mailing list. The list archives
are located at
[http://www.geocrawler.com/archives/3/2539/2001/]
best,
vipul.
Re:Stopping bogus entries? (Score:2, Informative)
Re:Stopping bogus entries? (Score:3, Informative)
.derf
Re:Stopping bogus entries? (Score:2, Informative)
Re:spammer said stopping spam in un-american. (Score:3, Informative)
http://www.bbbsouthland.org/topic110.html [bbbsouthland.org]
for more information.