Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

Response to Gordon Cormack's Study of Spam Detection 229

Posted by michael on Thursday June 24, 2004 @11:25AM from the even-stephen dept.

Nuclear Elephant writes "In light of Gordon Cormack's Study of Spam Detection recently posted on Slashdot, I felt compelled to architect an appropriate response to Cormack's technical errors in testing which ultimately explain why one of the world's most accurate spam filters (CRM114) could possibly end up at the bottom of the list, underneath SpamAssassin. I spend some time explaining what is a correct test process and keep my grievances simplified about the shortcomings of Cormack's research."

This discussion has been archived. No new comments can be posted.

Response to Gordon Cormack's Study of Spam Detection

Load All Comments

Search 229 Comments Log In/Create an Account

Comments Filter:

How I do (Score:5, Interesting)

by mirko ( 198274 ) writes: on Thursday June 24, 2004 @11:32AM (#9518700) Journal

I set many aliases to my official email and I gave all of these to and only to spammers.
So, whenever I get a mail more than 95% similar to a mail that I know is a spam, I dump it.
This combined with Apple's Mail.app Bayesian filter and there may only be a few spams left.

Share
twitter facebook
- Re:How I do (Score:2)
  
  by lukewarmfusion ( 726141 ) writes:
  
  My mail provider has some filtering software which lets me customize the threshold that I want filtered. On my side, Thunderbird has filters. Finally, I use a wildcard email that forwards to my actual address. When I use my email somewhere that might spam me, I simply describe the potential spammer like this:
  
  slashdotspam@mydomain.com
  
  If I get mail from there, I know how it came in. The combination of all these keeps my total spam to probably three or four a week.
- Re:How I do (Score:4, Informative)
  
  by julesh ( 229690 ) writes: on Thursday June 24, 2004 @12:35PM (#9519448)
  
  Mail.app's filter isn't Bayesian. Please see previous slashdot article on how it works (I'm too lazy to find the reference right now).
  
  Parent Share
  twitter facebook
Excellent review (Score:5, Informative)

by XMichael ( 563651 ) writes: on Thursday June 24, 2004 @11:33AM (#9518706) Homepage Journal

On the origional forum, I was saying something of the similair (except not nearly as well written!! hehe)

DSPAM, IMHO, provides far better results than this report was leading too. A properly trained Bayes filter, but a somewhat intellegent person provides simply amazing results. I swear I can go weeks on end without a single spam getting through, no false positives -- and between 20 and 100 SPAM in my "spam" box per day!

DSpam using Bayes algorithm is by far the best filtering method i've used. And I've used alot! (From SpamAssassin to SpamProbe and all the inbetweens). The only setback, DSpam takes a couple weeks to train...

Priceless Photos [pricelessphotos.org]

Share
twitter facebook
- False positives. (Score:3, Informative)
  
  by Christopher Thomas ( 11717 ) writes:
  
  I swear I can go weeks on end without a single spam getting through, no false positives -- and between 20 and 100 SPAM in my "spam" box per day!
  
  This is what I don't get - in order to be sure you have no false positives, you have to comb through all of the spam by hand, which for the most part defeats the purpose of a spam filter. If you don't do so, then you can't claim zero false positives - you can only claim that you haven't _noticed_ any false positives.
  
  I have a whitelist at work, and it works quite
  - Re:False positives. (Score:2, Insightful)
    
    by Donny Smith ( 567043 ) writes:
    
    Exactly - what's the point if you have to re-check it anway?
    That is the main reason I don't use any spam filters.
    
    Without a filter I can check emails as they come rather than create myself a "homework" of having to check 50 messages at once...
    - Re:False positives. (Score:2)
      
      by juhaz ( 110830 ) writes:
      
      Checking emails as they come takes more time than quickly scanning over 50 messages at the end of day.
      - Re:False positives. (Score:3, Insightful)
        
        by SpaceLifeForm ( 228190 ) writes:
        
        No, you can scan your spam folder in seconds, because you will recognise the subject lines. The duration is not comparable. When you have a folder for spam, any non-spam sticks out, but if you need to think looking at alternating spam and non-spam messages, you spend more time thinking.
    - - Re:False positives. (Score:2)
        
        by cardshark2001 ( 444650 ) writes:
        
        The bayesian filters integrated into Mozilla-mail are not very effective. It only gets about 50% and that is after months of training
        I had the foresight to save all my junk mail, about 3 years ago. I used it to train the filters when I switched to mozilla, and I hit the ground running with about an 80% rate (trained with about 5000 spam mails and about 800 real mails).
        Since then, I have given my email address to any site that asks for it, because I figure the more spam, the better for my filter. This ha
  - Re:False positives. (Score:2)
    
    by Xentax ( 201517 ) writes:
    
    I've used a Bayesian plugin before that let you set thresholds - so a certain score would be marked as "probably good" and be left in your inbox, a range would be set as "probably spam" and put in a possible junk folder, and beyond that was "definitely" spam and went in a spam or trash folder.
    
    It defaulted to like 10/90 (I don't remember which score was more spamlike, so imagine less than 10 was almost certainly ok, and greater than 90 was almost certainly spam) - I set it much lower for awhile (50) until I
    - Re:False positives. (Score:2)
      
      by mAineAc ( 580334 ) writes:
      
      it seems to me that with the way that spam is now adays we are taking the wrong approach. Why don't they make a filter assuming everything is spam, and then just filter out the good email? I realize that a white list does just this with the email address, but to take it a step further, look at what the content is and filter according to words or phrases you want to see like 'the kids are doing great.' I can't see this getting any more false positives then what we are using now :)
      - Re:False positives. (Score:3, Informative)
        
        by Xentax ( 201517 ) writes:
        
        This *is* already done - statistical filters are trained on both words that are 'spamlike' (words that show up only, or mostly, in lots of email marked by the user as spam), and words that are NOT (words that show up only, or mostly, in email marked not spam).
        
        This is (AFAIK) done against tokens in both the mail body and the headers, which pays dividends if the delivery paths are clustered (for example, if your whole family has accounts with MyISP.com, you'll probably get good filtering provided the spam is
  - Re:False positives. (Score:2)
    
    by fyonn ( 115426 ) writes:
    
    This is what I don't get - in order to be sure you have no false positives, you have to comb through all of the spam by hand, which for the most part defeats the purpose of a spam filter. If you don't do so, then you can't claim zero false positives - you can only claim that you haven't _noticed_ any false positives.
    
    I file spam in a spam box as I can easily scan across the contents in 10 seconds and hit delete before I go to bed, as opposed to the distraction when an email arrives and you go to check it i
  - Re:False positives. (Score:2)
    
    by firewood ( 41230 ) writes:
    
    This is what I don't get - in order to be sure you have no false positives, you have to comb through all of the spam by hand,
    True. But you can check if your false positive rate is low enough by statistical sampling.
    So once every few days I scan through a thousand or so items marked as spam by procmail. As long as I continue to find 0 or 1 false positives (which I add to my whitelists), I consider my filters 99.9% good. That error rate is probably better than my own human error rate for misfiling and/or
- Re:Excellent review (Score:2, Insightful)
  
  by mev ( 36558 ) writes:
  
  Unfortunately it seems like the author is too intent on slamming Cormack for his review to fit my description of an "Excellent Review". I wish he had toned this down as he could still have delivered the same technical message in a more credible fashion.
  
  "Excellent counterattack" might be more fitting.
Studies create discussion (Score:5, Insightful)

by Timesprout ( 579035 ) writes: on Thursday June 24, 2004 @11:39AM (#9518780)

I usually frown when I see many of these so called studies offering conclusions, several of which differ radically from my own experience. There recent Java/C++ performance one was a classic example. It gets annoying when a pro MS result is immediately decried as marketing FUD because it just cant be better and a pro Linux result is taken gospel truth here on /. Usually I tend to take all results with a grain of salt or just plain ignore them and focus on the debate around them.

The benifit of these studies though is that fantical crap aside informed people will usually take the time to interpret results or suggest corrections/improvements that actually benifit developers and improve their knowledge base more than any information provided by the actual study.

Share
twitter facebook
- Re:Studies create discussion (Score:2, Insightful)
  
  by dasmegabyte ( 267018 ) writes:
  
  But the purpose of studies is to offer insight into the best tools for a specific set of dependencies. Decry the dependencies, and you're essentially eliminating the purpose of the study.
  
  For example, I am working on a GIS application. I looked at offerings from ArcView and MapInfo and found that while they do what I need to do out of the box, they are quite expensive and required a license for every seat of my application. So I looked to Open Source. There I found hundreds of tools, none of which did w
  - Re:Studies create discussion (Score:5, Insightful)
    
    by killjoe ( 766577 ) writes: on Thursday June 24, 2004 @01:59PM (#9520438)
    
    "This is defeatest bullshit. Ignoring your problems doesn't make them go away. "
    
    You miss an important point. This is not "our" problem, it's YOUR problem. I don't need a GIS program and neither to millions of other other people. YOU need one and too bad for you they cost tens of thousands of dollars. You have no right to complain that somebody else hasn't taken the time and effort required to give you a free equavalent.
    
    What you need to understand is the open source is nothing but scratching an itch. This is your itch and you need to scratch it.
    
    OPEN SOURCE ONLY WORKS IF PEOPLE CONTRIBUTE. This very simple and obvious point seems to be lost on most people. You are not supposed to sit around till somebody else does the work and give you something for nothing. You need to contribute.
    
    You need to start an organization and start raising money to fund an open source development effort or to accelerate and existing one. You need to get involved and contribute. BTW bitching on slashdot does not count as contributing.
    
    "This is like blaming McDonalds for your big, fat ass, or blaming Microsoft because you got a virus when you didn't run the patch they released to prevent it."
    
    Or blaming the open source community because they didn't give you something for free.
    
    Parent Share
    twitter facebook
    - Re:Studies create discussion (Score:2)
      
      by Brandybuck ( 704397 ) writes:
      
      You are not supposed to sit around till somebody else does the work and give you something for nothing. You need to contribute.
      
      Which is why Open Source will probably always be for developers by developers. Unless of course the non-developing users decide to contribute cash...
      
      It's sort of like public television. You can sit around and watch it for free, or you can donate and help other people watch it for free. "Your generous donations will make this software free-beer for everyone!"
    - Hello? (Score:2)
      
      by Ayanami Rei ( 621112 ) * writes:
      
      He said he wasn't an expert. So of course he'd be forced to make that conclusion. He cannot scratch his itch because he cannot reach it.
      
      This is the kind of response he was talking about that does no good. Rather, you should acknowledge that the area is weak and that more focus needs to be given there in the future.
      
      (Incidentally, I'm interested in OSS in the GIS field. Any ideas/good pointers? Anyone?)
      - Re:Hello? (Score:3, Insightful)
        
        by killjoe ( 766577 ) writes:
        
        "He cannot scratch his itch because he cannot reach it."
        
        You don't have to be a developer. As I said you can start a campaign to ask for donations, you can write letters to companies asking for sponsorship, you can donate some of your own money, you can try to get like minded individuals together to solve the problem.
        
        OPEN SOURCE DOES NOT WORK UNLESS YOU CONTRIBUTE.
        
        " Rather, you should acknowledge that the area is weak and that more focus needs to be given there in the future."
        
        More focus needs to be give
You don't like my software so I'll flame you (Score:2, Insightful)

by ifreakshow ( 613584 ) * writes:

This guy seems a little harsh and just a bit jealous of the success of Gordon Cormack's article. I'd like to know what makes his opinion any more valid than Gordon's.

Information on his professional career was very hard to find on the site.

This just seems like a flame because his software(dspam) didn't perform well in the test.
- Re:You don't like my software so I'll flame you (Score:2)
  
  by arkanes ( 521690 ) writes:
  
  I have to agree that the article has a very put-out and almost bitter feel to it, which makes me less inclined to take it seriously. That said, there are perfectly valid criticisms in it. For example, not releasing the configuration data is clearly improper. Testing the accuracy of the filters against SpamAssassin is totally incorrect methodology! It looks good to apply the filter to such a huge body of email, but a smaller set would have made it much easier to validate the results. Misconfiguration of the
- Re:You don't like my software so I'll flame you (Score:3, Insightful)
  
  by Threni ( 635302 ) writes:
  
  > This guy seems a little harsh and just a bit jealous of the success of Gordon
  > Cormack's article.
  
  Articles aren't 'successful` - they're either useful, or they're just fun to read. Perhaps his is the latter.
  
  From the response:
  ---
  It turned out that Cormack was using the wrong flags, didn't understand how to train correctly, and seemed very reluctant to fully read the documentation. I don't mean to ride on Cormack, but proper testing requires a significant amount of research, and research seems to be
- Re:You don't like my software so I'll flame you (Score:5, Insightful)
  
  by Otter ( 3800 ) writes: on Thursday June 24, 2004 @11:52AM (#9518934) Journal
  There are some technical objections in there (old versions of software, the fact that Spam Assassin was tested with a spam collection generated by spam assassin). But honestly, after wading through all the whining and sneering, I didn't have the energy to pick the points out of the overall flow.
  Jonathan, next time:
  
  Start by summarizing your technical objections.
  
  Continue by detailing your technical objections.
  
  Leave the nasty rants to the end, or better yet, leave them out entirely.
  
  Stop talking about "geeks" in every paragraph.
  
  Please stop referring to spam filter comparisons as "science".
  Parent Share
  twitter facebook
  - Re:You don't like my software so I'll flame you (Score:3, Insightful)
    
    by ComputerSlicer23 ( 516509 ) writes:
    
    Please stop referring to spam filter comparisons as "science".
    
    I believe the author of the article would have two issues with that assertion.
    First off, you can have science about how fast grass grows. You have science about how many sexual partners a person has. You have science about how to manipulate people with irrational arguments. Science can be applied to anything that you apply scientific princepals to. Science in a lot of ways, is merely a matter of measuring in a controlled manner and then
- Re:You don't like my software so I'll flame you (Score:5, Insightful)
  
  by pclminion ( 145572 ) writes: on Thursday June 24, 2004 @11:56AM (#9518967)
  
  This guy seems a little harsh and just a bit jealous of the success of Gordon Cormack's article.
  Let me explain why he's irritated, as somebody who has conducted spam filter statistical tests and made publications on the topic.
  Yes, it is irritating when somebody demonstrates that his method is better than yours. However, most researchers are able to accept this, and continue improving their own work.
  However, what is far more irritating (by an order of magnitude at least) is when somebody "demonstrates" the inferiority of your work, and they do so in a completely scientifically bogus way.
  Let me give a concrete example. Suppose you were Galileo. You have just put forth the postulate that all objects fall at the same speed regardless of mass. A "debunker" attempts to demonstrate that this isn't true by dropping an iron ball and a feather. Obviously, the feather falls much more slowly.
  "Ha ha, neener, neener!" cries the debunker. Of course, Galileo knows his method is flawed. If people actually listen to this supposed debunker, Galileo might become very, very irritated indeed.
  
  Parent Share
  twitter facebook
  - - Re:why can't we all just get along? (Score:2)
      
      by pclminion ( 145572 ) writes:
      
      But no, lashing out is not adult, it is not constructive.
      True, but neither is ignoring his points simply because he had some attitude.
      I do think he handled the stress a little poorly.
- Re:You don't like my software so I'll flame you (Score:4, Interesting)
  
  by julesh ( 229690 ) writes: on Thursday June 24, 2004 @12:37PM (#9519474)
  
  He made a few very good points, but the overall tone was a little too ranty.
  
  This was the most important point, I think, and was buried 2/3rds of the way down:
  
  The emails being 8 months old, heuristic rules were clearly updated during this time to detect spams from the past eight months. The tests perform no analysis of how well SpamAssassin would do up against emails received the next day, or the next eight months. Essentially, by the time the tests were performed, SpamAssassin had already been told (by a programmer) to watch for these spams. [...] What good is a test to detect spam filter accuracy when the filter has clearly been programmed to detect its test set?
  
  Parent Share
  twitter facebook
Spamassasin is good but not that good... (Score:5, Informative)

by Shoeler ( 180797 ) writes: on Thursday June 24, 2004 @11:41AM (#9518818)

For any users of spamassassin's 2.x branch (2.63 is current as of this writing), we all know how dated its signatures are right now. When the 2.6 branch was first released, I got zero spam and 100% ham for the first few weeks. Now that 3.x is being integrated as an ASF and being apache-ized, updates have been slow and 3.x is still awaiting deployment.

Point being - I was darn surprised to see SA at the top of his charts.

Now - if only mimedefang would easily use another spam-checker....

Share
twitter facebook
- Re:Spamassasin is good but not that good... (Score:2)
  
  by julesh ( 229690 ) writes:
  
  Well, of course it was. As stated in the article, he was using the latest version of SA to classify mail that was up to 8 months old. I'd expect it to be pretty close to perfect on that. It's just current stuff it ain't so hot on.
  - Re:Spamassasin is good but not that good... (Score:2)
    
    by iserlohn ( 49556 ) writes:
    
    SA gets a bad rap because it works even when the bayesian filter isn't activated. This leads to horrible results.
    
    We deployed SA on our own internal MX and we have over 99% accuracy over the past 3 months. Although the bayes filter is primitive compared to what other advanced filters are doing, with enough training and a bigger token DB, SA works very very well. Couple that with network checks (ie, Razor2, Pyzor, DCC) and the system is comparable to the best statistical filters.
Just read it - (Score:2, Informative)

by calebb ( 685461 ) * writes:

I just read the whole article - it does repeat itself a few times, but the author provides additional evidence each time his theses were reiterated:

1. Cormack is very inexperienced in the area of statistical filtering. Agreed!!!
2. Cormack went into the testing with many presuppositions. Also Agreed!!

And in case you're not familiar with the word presupposition:
1. To believe or suppose in advance.
2. To require or involve necessarily as an antecedent condition.

Overall, this is a very good articl
- Re:Just read it - (Score:4, Informative)
  
  by Henry Stern ( 30869 ) writes: <henry@stern.ca> on Thursday June 24, 2004 @02:03PM (#9520473) Homepage
  
  1. Cormack is very inexperienced in the area of statistical filtering.
  
  Disagreed. Gordon Cormack has been doing information retrieval for 20 years. He is fairly well known in the area. See his publication history at DBLP [uni-trier.de].
  
  A far more likely conclusion about what's going on here is that Zdiarski's ego has been hurt. Both he and Dr. Yerazunis engage in some very sketchy statistics in their papers and I think that it has caught up to them.
  
  1. Yerazunis' study of "human classification performance" is fundamentally flawed. He did a "user study" where he sat down and re-classified a few thousand of his personal e-mails and wrote down how many mistakes he made. He repeats this experiment once and calls his results "conclusive." There are several reasons why this is not a sound methodology:
  
  a) He has only one test subject (himself). You cannot infer much about the population from a sample size of 1.
  
  b) He has already seen the messages before. We have very good associative memory. You will also notice that he makes fewer mistakes on the second run which indicates that a human's classification accuracy (on the same messages) increases with experience. For this very reason, it is of the utmost importance to test classification performance on unseen data. After all, the problem tends towards "duplicate detection" when you've seen the data before hand.
  
  c) He evaluates his own performance. When someone's own ego is on the line, you would expect that it would be very difficult to remain objective.
  
  2. Both Yerazunis and Zdziarski make use of "chained tokens" in their software. This is referred to in other circles as an "n-gram" model. As with many nonlinear models (the complexity of an n-gram model is exponential with n), it is very easy to over-fit the n-gram model to the training data. Natural language tends to follow the Pareto law (sometimes called the 80/20 rule) where the ranking of a term is inversely proportional to the frequency of occurence of that term. The exponential complexity of the n-gram model contributes to the sparse distribution of text leading to a database with noisy probability estimates.
  
  3. Zdziarski uses a "noise reduction algorithm" called Dobly to smooth out probability estimates in the messages. Aside from his unsubstantiated claim of increased accuracy, I have never seen anything to suggest that it actually works as advertised.
  
  Considering these points, I was not surprised at all by the results of Dr. Cormack's study. While one may argue that his experimental configuration can use some improvement, his evaluation methods are logically and statistically sound. What I personally saw in the results of this paper was that two classifiers that use unproven technology did not perform as advertised. After all, every other Bayes-based spam filter performed acceptably well.
  
  Lastly, I won't really touch his flawed arguments about how using domain knowledge about spam (i.e. SpamAssassin's heuristic) somehow hinders the classifier over time when you are also using a personalised classifier. You'll notice that SpamAssassin still did acceptably well when all of the rules were disabled.
  
  Go read some more of Zdziarski's work and draw your own conclusions about his work. Pay careful attention to his use of personal attacks when comparing his filter to that of others.
  
  Parent Share
  twitter facebook
- - Re:And to that... (Score:4, Insightful)
    
    by calebb ( 685461 ) * writes: on Thursday June 24, 2004 @12:28PM (#9519360) Homepage Journal
    
    "You mean like any other normal person who might be wanting to use such a product?"
    
    And to that, I would say... Someone writing an article for publication in a peer-reviewed journal should become experienced in their area of research before attempting to publish their results!
    
    For example, I'm sure you don't have much experience with Nuclear Magnetic Resonance imaging - And you might or might not have experience with X11 forwarding. But unless you are fluent with both of those topics, I would not expect you to attempt to publish a paper in a peer-reviewed journal discussing those topics!
    (Like I did, last December [wisc.edu])
    
    However, for the sake of presenting some evidence to back up what I'm saying here, I'll take your example of Consumer Reports.
    
    From their site: CR has the most comprehensive auto-test program and reliability survey data of any U.S. publication; its auto experts have decades of experience in driving, testing, and reporting on cars.
    
    ...nevermind, I don't need to say anything else.
    
    Parent Share
    twitter facebook
I'm not saying we wouldn't get our hair mussed... (Score:3, Funny)

by VAXcat ( 674775 ) writes: on Thursday June 24, 2004 @12:04PM (#9519043)

I prefer using the original CRM114 discriminator and it's host platform on spammers. If you're not familiar with the original CRM114 and it's delivery platform, it was featured in the following movie... http://www.imdb.com/title/tt0057012/combined

Share
twitter facebook
I wouldn't take this critique too seriously (Score:5, Interesting)

by EsbenMoseHansen ( 731150 ) writes: on Thursday June 24, 2004 @12:12PM (#9519179) Homepage
There are several warning signs in this article.
1. The author spends a lot of time trying to discredit the author on such terms as impartialness and experience. While such can lead credence to a strong case, it bodes when mentioned as the very first points. Also note the beginning of the article: "Many misled CS student...".
2. The author has no statistical or published backings for his claim
3. Most of the arguments are flawed, in my opionion. Yes, the corpus was trained on SpamAssassin, but the other filters' mistakes were, as far as I recall, examined for errors individually. Thus, any mistakes would be spotted or credit each filter equally.
4. I also always find it suspect when someone claims: "Yes, the program did not perform, but with a different configuration it might/in the latest version it might". While it could be true, such claims needs backing.
5. He claims that X's email was atypical, even for geeks. I would like to state here that I have 3 email accounts, of which none lie near his "typical" spam quotient (60%): 2 with >90% spams and 1 with <1% spam.
That said, he does raise a few valid points, such as the timeline:
1. If filters expunge old data based on time, this would not work in the test. That gives SpamAssisins' static rules an egde
2. Configurations should really have been published. I see no reason why not.
Share
twitter facebook
- Re:I wouldn't take this critique too seriously (Score:5, Interesting)
  
  by int2str ( 619733 ) writes: on Thursday June 24, 2004 @12:49PM (#9519621)
  
  Yes, I agree with your points. The author spends way too much time dicrediting the study.
  
  I also have to say that my experience was much more along the line of Cormacks. I've tried DSPAM for a while on my server, starting from scratch. Training on error with only new emails. On a small mail server with about 10 users of different types (geeks, businesses, moms etc).
  - DSPAM took way too long to produce any kind of results
  - 2500 emails before advanced features kick in is *a lot* for the average soccer mom
  - DPSAM produced way too many false positives early on
  - The spam filtering accuracy leveled off at about 80% (number from DSPAMs web interfac)
  
  So this is not another overzealus CS student here, but real world testing.
  
  The DSPAM author does not address any of the real points and just rags on Cormack.
  
  Not much of a "rebutal" in my book.
  
  Parent Share
  twitter facebook
- Re:I wouldn't take this critique too seriously (Score:3, Funny)
  
  by jpetts ( 208163 ) writes:
  
  While such can lead credence to a strong case, it bodes when mentioned as the very first points.
  
  But does it bode well or ill?
- Re:I wouldn't take this critique too seriously (Score:2)
  
  by Glass of Water ( 537481 ) writes:
  
  I think you mean bodes ill. Bodes means something similar to predicts or foretells.
  Thank you, that is all.
- Re:I wouldn't take this critique too seriously (Score:2)
  
  by gurps_npc ( 621217 ) writes:
  
  I disagree entirely with 3. You can NOT test a device's accuracy by comparing it's previous output to future output, even if you also backcheck possible errors using third machines. It is just BAD science and you should graded F- for even attempting to do it.
  You ignore the change in relative accuracy.
  Assume for example that Spam Assasin is in fact the best around, but it has a 10% false spam rate. Every other program is slightly worse with an 11% false spam rate, always making the same mistake that Spa
What is typical (Score:4, Insightful)

by Anonymous Coward writes: on Thursday June 24, 2004 @12:13PM (#9519185)

Due to X's extremely high volume of traffic and the fact that X's email addresses were available to harvest bots on the Web and in newsgroups for 20 years, it is no surprise that X has an abnormally high spam ratio, 81.6%.

I'm not happy about this, first he says that this account has a abnormally high spam ratio and then says that a normal user can have 60%. Where do we get these figures from I would like to know as my average is pushing up against 100%. I don't think that there is such as thing as an average user, some people seem to get nearly no spam and the rest of us get almost complete spam.

Reviewing todays inbox reveals around 200 emails, of which 8 were legit. You do the maths, I would be making progress if it was only 81%.

Share
twitter facebook
To cut through the spam (Score:5, Insightful)

by NigelJohnstone ( 242811 ) writes: on Thursday June 24, 2004 @12:13PM (#9519188)

Oh boy he goes on and on, if ever you wanted to cut out the spam in an article...

His main points (at least the ones I agreed with):

1. No training period, many features only turn on after lots of real emails have been processed. Fair enough.

2. No purge window, stale emails get purged over time (e.g. 4 months), but in a test everthing is shoved through at once (in minutes) and so nothing gets purged. Again fair.

The rest of it complains about the tester, or complains that it was less than ideal conditions & settings for the particular filter.
We call that 'the real world' here.

Sys admins are not experts in configuring filters.

Also he should realise that any new filter gets a better rating than the dominant filter. Spammers try to defeat the most popular filter of the day. So sure a new filter might perform better than an existing one *initially* simply because the spammers are targetting it. Until it becomes dominant and then the spammers adjust the spam to defeat the new dominant filter.

So in the real world the data set will always be unusual because the spammers make it that way.

Share
twitter facebook
- - Re:SA vs SA... SA Wins! (Score:2)
    
    by NigelJohnstone ( 242811 ) writes:
    
    "Cormick builds a list of spam and a list of ham using SPAM Assasin. "
    
    I read that bit, but Comicks words:
    
    "The test sequence contained 49,086 messages. Our gold standard classified 9,038
    (18.4%) as ham and 40,048 (81.6%) as spam.
    >>>>>>>The gold standard was derived from
    X's initial judgements, amended to correct errors that were observed as the result
    of disagreements between these judgements and the various runs."
    
    From this I am left with the impression that X was the judge, not Spam Ass
Main issue (Score:2)

by TheLink ( 130905 ) writes:

Zdziarski claims Cormack mainly used Spamassassin to classify the corpus into the ham and spam groups.

If this is true then to me this is a critical flaw in Cormack's methodology.

Not saying there are, or aren't other flaws. But this to me is the main one to consider. Zdziarski should have just put this at the top of his response, instead of putting a lot of waffle about stuff that does "not appear to have been a problem with Cormack's tests".
- But is it correct? (Score:2)
  
  by NigelJohnstone ( 242811 ) writes:
  
  To repeat a comment I made just above. From his original test paper:
  
  "The test sequence contained 49,086 messages. Our gold standard classified 9,038
  (18.4%) as ham and 40,048 (81.6%) as spam.
  The gold standard was derived from
  X's initial judgements, amended to correct errors that were observed as the result
  of disagreements between these judgements and the various runs."
  
  From this I got that:
  
  1. He had an initial set of Spam judged by person X. (e.g. 99.84% accurate).
  2. That he ran it through each test filter
Why not... (Score:2)

by Vadim Makarov ( 529622 ) writes:

postage-based email [vad1.com]?
- Re:Why not... (Score:2)
  
  by Kent Recal ( 714863 ) writes:
  
  No, thanks.
Constructing arguments (Score:5, Informative)

by cynicalmoose ( 720691 ) writes: <giles.robertson@westminster.org.uk> on Thursday June 24, 2004 @12:40PM (#9519507) Homepage

As far as I understand, Cormack accepted that he was testing only on one person's corpus, and qualified his findings as such.

This is something that is featured throughout the rebuttal - an argument that runs:
a) Such and such was done incorrectly
b) Therefore the system was inaccurate
c) Therefore CRM-114 is better than stated

The ultimate point where I lost patience was where he claimed that the results were invalid because they didn't conform to accepted, real world knowledge. The study was empirical; it shows something, based on how it was set up; and what it shows is valuable. If you discarded results each time they contradicted agreed wisdom we would still think of a geocentric universe.

Share
twitter facebook
- Re:Constructing arguments (Score:3, Insightful)
  
  by bourne ( 539955 ) writes:
  
  The ultimate point where I lost patience was where he claimed that the results were invalid because they didn't conform to accepted, real world knowledge. The study was empirical; it shows something, based on how it was set up; and what it shows is valuable.
  
  But without knowing how the test was set up, how can you trust the test's so-called empirical results?
  In medicine, research results aren't generally trusted unless 1) the study was sound, e.g., double-blind and 2) a separate team has recreated equi
Anyone got Gordon's email addy? (Score:2, Funny)

by bl8n8r ( 649187 ) writes:

I purpose a little test of my own...
POPFile OTOH (Score:4, Informative)

by JohnGrahamCumming ( 684871 ) * writes: <slashdot AT jgc DOT org> on Thursday June 24, 2004 @12:47PM (#9519594) Homepage Journal

Actually publishes statistics from real users. If the user is willing POPFile sends back accuracy information to a central server and then a nightly cron job analyzes it and publishes information on the web for all to see.

No need to read a study, or even the author's opinion. No wild claims made, just real data.

Here it is:

http://www.usethesource.com/popfile_stats.html

Shows that POPFile has an _average accuracy_ over all users, including the training period of 95%. After it's seen 500 emails it has an accuracy of 97%. And the average POPFile user has 5 categories of classification.

John.

Share
twitter facebook
DSPAM (Score:2, Interesting)

by Big Boss ( 7354 ) writes:

I don't claim to have done any scientific studies on the subject, but I have tried a number of different anti-spam soultions over the past few years. In my experience, the best soultion is a multi-pronged approach that takes advantage of the strong points of a few setups.

If you want to talk about the results from a single filter in my current arsenal, I would give DSPAM the highest marks. I found it to catch more spams than a trained and customized SpamAssassin with no false positives. It's also very fast,
Obfuscated Hyperverbosity (Score:2)

by Andy_R ( 114137 ) writes:

The author 'architected an appropriate response' . Persumably this is a lot better than simply replying?

I'd advise the author not to use the word "percept", because he doesn't know what it means.

I'd advise the author not to use the word "someodd", because dictionary.com doesn't know what it means.

As for "very unique"...
The problem w/ Bayes (Score:3, Informative)

by king_ramen ( 537239 ) writes: on Thursday June 24, 2004 @01:12PM (#9519915)

As the author of this article states OVER and OVER, it is REALLY EASY to mess up your filters, and it is very tedious (with lots of permutations) to properly build your corpus. For a centralized spam filtering solution, the goals are: 1. Insulate the users from spam 2. Insulate the users from "administration" 3. Do no harm (no false positives) For these goals, I would take a "dumb" filter, set it conservatively, and hope for 80% catch rate and zero false positives. DSpam has a complicated workflow that requires EACH AND EVERY end user to complete a feedback loop. This is WAY to much to expect from people who are barely capable of finding Google. Unless the ONLY access to the mail is web-based, with a VERY clear "This is Spam" button, Bayes is a sysadmin's nightmare. My only gripe w/ SpamAssassin is performance. If I could get SPAMD to analyze headers in 25ms instead of 2000ms I'd never look back. As it is, DSPAM's performance has me very jealous.

Share
twitter facebook
Re: Response to Gordon Cormack's Study of Spam (Score:3, Funny)

by telstar ( 236404 ) writes: on Thursday June 24, 2004 @01:24PM (#9520043)

He launches rockets ... He develops 3D game engines ... He analyzes spam trends ... Is there anything this Carmack guy can't do?

What'd you say?
Cormack?

Nevermind...

Share
twitter facebook
Spam Assasin validation telling point (Score:2)

by gurps_npc ( 621217 ) writes:

I find the most telling point is that he used Spam Assasin to decide if the various spam detectors had made an error or were correct.
OBVIOUSLY, Spam Assasin is going to agree with Spam Assasin being the best.
What the test really did was determine how close to Spam Assasin the other spam detecters were, not how good they were at detecting spam.
Atypical, high volume of traffic? (Score:3, Informative)

by dougmc ( 70836 ) writes: <dougmc+slashdot@frenzied.us> on Thursday June 24, 2004 @01:36PM (#9520181) Homepage

This seems very atypical. The test subject does not represent typical email behavior, except among the most hardcore geeks. Even still, typical hardcore geeks will adjust this behavior in an attempt to curve spam. The typical technical user (someone who makes his living online) will have the same email address for perhaps five or more years, and the typical non-technical user (a majority of the users on the Internet, lest we forget) will change email addresses every couple of years. In either case, most sane users use one or two variants at the most.

Who is Jonathan to decide what consitutes sanity?
Maybe I'm a hardcore geek, but I do do exactly what Gordon does -- have several accounts feeding a `master' mail account, using addresses I've owned for over a decade. I also post to Usenet and mailing lists with my unobfuscated mailing address -- I want people to be able to reach me, and I refuse to let the spammers take that away from me.
And I think I'm very sane, thank you.

49,000 emails in eight months is also absurd.

I agree. That's an absurdly *small* amount. I personally receive over 1500 spams/day -- so I'd have 49,000 in under a month. Obviously the amount of spam I receive is because I set myself up as a target, but I'm hardly the only one. Even Jonathan's email address is clearly listed on his page, unobfuscated, so he's doing it too, at least to some degree.
(As a piece of anecdotal evidence, Spamassassin catches all but about 4/day of the spams I get, and false positives are extremely rare. Of course, I have spent a good deal of time tweaking SA to work best with my email, and it now works very well.)

A good test should have included independent tests with corpora from 10-15 different test subject, of all walks of life - geek, doctor, etc.

That sounds fine in theory, but in practice it's hard to do. How many people from all non-geek walks of life save *all* their email, including spam, and are willing to give it to you so you can analyze it?
And merely capturing all their email won't do it -- they need to categorize it for you, because they're the only ones who can reliably decide what's spam *for them* and what's not.
I do agree, that the study had more than it's share of issues, but this critique goes way over the top.

Share
twitter facebook
Crap writing (Score:3, Insightful)

by fuzzy12345 ( 745891 ) writes: on Thursday June 24, 2004 @01:40PM (#9520227)

I was turned off as soon as I hit that word "architect" being used as a verb. After our hero "architected" his response, did he assign the task of actually writing it to someone else? Nooo.
English does evolve, and good writers sometimes repurpose words to great effect. Alas, judging by the rest of the reviews here, our hero is NOT a good writer -- having built a shoddy and ramshackle outhouse, he proudly crowns himself the architect of it.
As for all those people who shout "prescriptive grammarian!", I often suspect they're just too lazy to learn to write well, and have decided that claiming that rules are passe is an effective workaround.

Share
twitter facebook
an important consideration left out (Score:2)

by mabu ( 178417 ) writes:

When self-proclaimed pundits do these studies, they should also factor into account the exponential increase in resources needed to accept and filter the mail's content. This results in more memory, faster machines, slower mail service and more deferred mail and reduced performance overall of everything else that might be done on that server.

Contrast this with the effectiveness of RBLs, which block spam based on the source and immediately cut off the huge resource requirement needed by these "filters".

By
- RBL (black lists) do not help with zombie systems (Score:3, Insightful)
  
  by wintermute42 ( 710554 ) writes:
  
  I have noticed that black lists are indeed effective. Many spammers now use "bullet proof" spam hosts, so they use static domain names. However, there has been an marked rise in zombie systems sending spams. These are systems that are infected by viruses and then used as spam hosts. Since these systems come on line rapidly (when they are infected) and then drop out (when they are cleared of the virus or booted off their ISP) it seems unlikely that black lists will help.
  
  At least in the spam stream I s
Cormack and Lynam re Zdziarski's factual errors (Score:5, Informative)

by gvc ( 167165 ) writes: on Thursday June 24, 2004 @02:48PM (#9521026)

We shall not respond to Mr. Zdziarski's attacks, except to identify the most outstanding factual errors and to note that ad hominem arguments are irrelevant in assessing the validity of our work.
We encourage interested parties to read our paper [uwaterloo.ca] and our points of fact re Zdziarski [uwaterloo.ca].
Thomas Lynam
Gordon Cormack
June 24, 2004

Share
twitter facebook
- Re:Cormack and Lynam re Zdziarski's factual errors (Score:2)
  
  by Trogre ( 513942 ) writes:
  
  It would be so much easier to believe you if you would just show us the code you used to perform the tests.
- Re:Cormack and Lynam re Zdziarski's factual errors (Score:2, Insightful)
  
  by EatAtJoes ( 102729 ) writes:
  
  While obviously Cormack and Lynam are central to this discussion, it's depressing that this is +4, Informative when instead they obviously resent any serious questioning of their work. Is there a '-1, Wussy' moderation?
  
  "We shall not respond" -- huh? Pull the log out of your ass guys. Like it or not, he's got legitimate beefs with your study. What's more, he's got cred: dude puts SERIOUS effort into GPL'd software that helps people, so his input is relevant and valid. Get over it.
  
  Besides, his questioning o
Collaborative filtering? (Score:2)

by WOV ( 652967 ) writes:

I am always confused by the omission from these tests of collaborative filters like Cloudmark's SpamNet [cloudmark.com], which I have used at work for a long time with a very high "catch" rate, no real processing time, and no false positives. Essentially, every email you get it hashes and checks with the server. If you get a spam, you right-click and report it as such. Then it pulls any messages from your inbox which enough credible people have marked before you. (A gross oversimplification, but close enough.)
I feel li
CRM114 is impossible to get installed (Score:3, Insightful)

by Anonymous Coward writes: on Thursday June 24, 2004 @03:08PM (#9521263)

I remember going through the CRM114 installation docs, and vividly remember the 20 or so steps that I had to go through, and after about 3 or 4 hours of trying to get it installed, I finally gave up. I think part of the goal of software design is to make your software so that people will be able to quickly install and use it. The author of this program lost sight of this important point. I'm not going to sit there and reverse engineer some esoteric codebase just to get it working, and I'm sure alot of other people feel the same way. Therefore, I use SpamAssassin among other things, and it works really well and was quick and relatively painless to get working. I didn't have to go through their source code to figure out how to get it installed.

Share
twitter facebook
the corpus was *not* classified by SA alone (Score:5, Informative)

by jmason ( 16123 ) writes: on Thursday June 24, 2004 @04:09PM (#9521919) Homepage
My $.02. disclaimer: I'm one of the SA developers.
- "The Corpus was Classified by SpamAssassin, for SpamAssassin", and "The Accuracy of the Test Subject's Corpus is Questionable":
  
  No, this is incorrect. Firstly, he states that he used user feedback to reclassify FNs and FPs (p. 4).
  
  The misunderstanding probably comes from p. 6, where he notes that he also ran SpamAssassin 2.63 over the "gold standard" corpus once it was complete, to verify his original classifications.
  
  However, in addition to that, he states 'all subsequent disagreements between the gold standard and later runs were also manually adjudicated, and all runs were repeated with the updated gold standard. The results presented here are based on this revised standard, in which all cases of disagreement have been vetted manually.' So in other words, the "gold standard" should be as near as possible to 100% accurate, since all the tested filters and the human classification have "had a shot" at classifying every mail, and the human has had final say on every misclassification.
  
  In other words, if any misclassifications remain in the "gold standard" corpus, every one of the tested filters agreed on that misclassification.
  
  IMO, that's as good as a hand-classified corpus can get.
- "old versions of software were used":
  
  It's unrealistic to expect the author to use the most up-to-date versions of filters available by the time the paper is made available to the public. That's the difference between results and a paper -- it takes time to analyze results, write it up and come to valid conclusions, once the testing results are obtained. IMO, the author can't be faulted for spending some time on that end of things.
  
  Given that, using 6-month old release versions of the software under test seems reasonable.
  
  SpamAssassin 2.60, when new SpamAssassin rules were last added to a released ruleset, is 9 months old (released 2003-09-22); so logically, in testing against DSPAM 2.8 (released 2003-11-26), DSPAM should therefore have had the edge. ;)
- "test started with untrained filters":
  
  IMO, that's the real world. People don't start with fully-trained filters.
  
  In addition, the graphs on pp. 15-20 show accuracy over the course of the entire 8 month period, so "post-training" accuracy can be viewed there.
- "spam in the test is as old as 14 months":
  
  Nope, he states (p. 4) that the corpus uses mail between August 2003 and March 2004.
- "it should purge old data":
  
  SpamAssassin purges its Bayes databases automatically, based on the age of messages in the corpus. We call it "expiry".
  
  In that test, the "SA-Standard" dataset would be using this, so stating "Cormack did not perform any purge simulation at all" is not accurate. However, that would not have increased SpamAssassin's accuracy figures, since we have generally have found that while it keeps the overhead of bayes database sizes and memory down, it marginally reduces accuracy, instead of increasing it (at the default settings).
  
  (Also worth noting that it can deal with being run from an en-masse check over a static corpus, as it uses the timestamp information in the Received headers rather than the current system time. So even if this test was run in the course of 4 hours, it'd still be an accurate simulation of what would happen in "real world" use over the course of 8 months.)
And finally, what Henry said in comment 9520473 [slashdot.org].

--j.
Share
twitter facebook
DSPAM. (Score:2)

by asackett ( 161377 ) writes:

Honestly, the first time I read Cormack's paper I stopped partway through because his findings didn't jive with my own experience. I've applied no scientific method to debunk his findings, and I don't care to -- I have other demands for my time.

I use and recommend DSPAM. Many of the accounts that are aggregated in my inbox have been exposed on the web and in Usenet for several years, so my spam load is probably about as high as anyone else's. No comparison testing analysis can change the fact that my inbox
- Re:Architect is not a verb. (Score:2, Insightful)
  
  by pclminion ( 145572 ) writes:
  
  I hope you're proud of your anal retentiveness.
  Haven't you ever Googled something? Haven't you ever input data into a computer? (The use of the word input as a verb is, of course, the result of verbing, and it's now considered acceptable usage.) In recent years it has become common in English to "verb" nouns. In fact, I just did it. English, like any other language, evolves over time.
  To deny this fact makes you just another prescriptivist language maven, completely disconnected from reality and any sens
  - Re:Architect is not a verb. (Score:3, Insightful)
    
    by corporatemutantninja ( 533295 ) writes:
    
    Well said. HOWEVER, I have to agree with the poster who pointed out that using "architect" as a verb in the context of writing is a little out of place. If we're going to help the language grow, let's at least do so in useful ways. "Architect a solution to an engineering problem", sure, "architect a whiny, defensive rebuttal", no. If we're going to make it a verb let's at least have it relate somewhat to the noun.
  - Re:Architect is not a verb. (Score:2)
    
    by Inda ( 580031 ) writes:
    
    Haven't you ever input data into a computer?
    Why is the readability of that sentence poor?
  - Re:Architect is not a verb. (Score:2)
    
    by donutz ( 195717 ) writes:
    
    In recent years it has become common in English to "verb" nouns.
    
    But "verb" isn't a verb, it's a noun! You can't "verb" something, or go around "verbing" things...check it out here [reference.com].
    - Re:Architect is not a verb. (Score:2)
      
      by Daniel ( 1678 ) writes:
      
      You can't "verb" something
      
      Sure you can, but verbing weirds language.
      
      Daniel
  - Re:Architect is not a verb. (Score:2)
    
    by perly-king-69 ( 580000 ) writes:
    
    No, he's promoting the correct use of English which promotes inclusivity. We all know where we stand. By designing (or should I say architecturizing) your own rules you begin to exclude groups of people, such as those whose first language is not English. It's elitism, nothing less.
    - Re:Architect is not a verb. (Score:2)
      
      by pclminion ( 145572 ) writes:
      
      By designing (or should I say architecturizing) your own rules you begin to exclude groups of people, such as those whose first language is not English.
      Then the Germans had better stop joining words together however they please. It creates these big, long words which are incomprehensible to non-native speakers. They seem to do it willy-nilly!
      That's what you're saying. Right?
      It's elitism, nothing less.
      Prescriptivism is the only thing elitist here.
      - Re:Architect is not a verb. (Score:2)
        
        by perly-king-69 ( 580000 ) writes:
        
        There's nothing unwrong about prescriptivisationism.
  - Re:Architect is not a verb. (Score:2)
    
    by Kphrak ( 230261 ) writes:
    
    I wouldn't mind verbing so much if the right usage hadn't been drilled into me as a kid...but "verbing" the word "architect" is not a language advancement. It's a sloppy shortcut normally used in buzz-speak (that's why you almost never hear it in everyday English, but so often in computer- and business-related fields). It's ambiguous and makes English even more difficult to understand than it is already. The fact that enough people complained about it for this thread to occur shows that in fact, it is not "
  - Verbing is not a verb (Score:2)
    
    by horza ( 87255 ) writes:
    
    I personally think to "architect" something 'sounds' right and it's obvious and unambiguous in what it means. The grammar nazi is right though and it is incorrect. Input *is* a transitive verb [reference.com]. However verbing sounds like something simply offensive and shouldn't be done in public.
    
    The language evolves, but slowly as everyone needs to be able to keep up. This is the problem with Open Standards: creating a stable API can sometimes slow or stifle innovation
    
    Phillip.
    - Re:Verbing is not a verb (Score:2)
      
      by pclminion ( 145572 ) writes:
      
      Input *is* a transitive verb.
      How do you think those usages get placed in dictionaries? They don't fall from the sky. The noun "input" got verbed. And the use of "verb" as a verb will also eventually be accepted in the dictionary.
      Acceptance in the dictionary, and acceptance as usage in language are two distinct things, however.
      - Re:Verbing is not a verb (Score:2)
        
        by cperciva ( 102828 ) writes:
        
        The noun "input" got verbed
        
        Nope; in fact, the verb "input" got nouned. The first known use of "input" as a verb in the context of computers was in 1946; the first known use of "input" as a noun in the same context was in 1948.
        
        Outside of the specific case of computers, the difference is even more distinct, with the verb "ynputt" pre-dating the noun "input" by almost four hundred years.
      - No. (Score:2)
        
        by Ayanami Rei ( 621112 ) * writes:
        
        You input data. You don't input input.
    - Re:Verbing is not a verb (Score:2)
      
      by Mournblade ( 72705 ) writes:
      
      The grammar Nazi would probably also point out that "I personally think" is redundant.
  - Re:Architect is not a verb. (Score:2)
    
    by nagora ( 177841 ) writes:
    
    I hope you're proud of your anal retentiveness.
    If you mean being proud of knowing that "architecting" was not even close to being the right word, then I'm proud, sure.
    Language does evolve over time and new words do come into usage, but how does that mean that just picking words at random and using them instead of already existing, perfectly adequate, words is not pointless, unclear, and pretentious?
    To deny this fact makes you just another prescriptivist language maven, completely disconnected from reali
    - Re:Architect is not a verb. (Score:2)
      
      by pclminion ( 145572 ) writes:
      
      You put a meaningless jumble of words together. "Architect" in this context was anything but meaningless. If you can't figure out what was meant, that indicates a lack of brain power on your part, nothing more.
      - Re:Architect is not a verb. (Score:2)
        
        by nagora ( 177841 ) writes:
        
        "Architect" in this context was anything but meaningless.
        Alright, genius: did he mean "write" or "design"? And why was not using one of those an appropriate choice?
        TWW
- It could be worse... (Score:2)
  
  by schon ( 31600 ) writes:
  
  He could have said someone "tasked" him to "architect" a response. :o)
  - Re:It could be worse... (Score:2)
    
    by gr ( 4059 ) writes:
    
    ... with the intention of "growing" the economy, no doubt.
- - - Re:Is that what your mom worded (Score:2)
      
      by Karamchand ( 607798 ) writes:
      
      First off: Your original posting was simply completely off topic. Where would we be if every message pointing out a grammatical mistake in a story got moderated +5? - slashdot would look like an English schoolbook.
      Secondly: Your second posting is not only off topic, but also insulting and purely flaming.
- - Re:Why use "architect" - why not "write" (Score:2)
    
    by pclminion ( 145572 ) writes:
    
    what's the difference between "writing" a response and err, "architecting" a response?
    You're being purposefully dense.
    To architect a response would imply careful consideration, artistic presentation, and stunning aesthetics. I don't necessarily agree that that's what he's done here, but obviously that is what he meant to convey with his choice of words.
    And if you disagree with verbing words, you have better stop "inputting" data into a computer, or "Googling" for answers, or "bookmarking" links, or "
    - definitely curious and also concerned (Score:2)
      
      by fantomas ( 94850 ) writes:
      
      Like other people, I found michael's choice of word curious: the first time I have ever seen the noun architect used as a verb. The curiosity of the expression took my attention away from his main argument.
      I feel the Plain English Campaign [plainenglish.co.uk] offers a useful guide "We define plain English as something that the intended audience can read, understand and act upon the first time they read it". So, perhaps you are right for the majority of people. But I had to pause a while and think about what michael meant.
      I
    - - Re:Why use "architect" - why not "write" (Score:2)
        
        by magefile ( 776388 ) writes:
        
        It's more established as a word, but take a look at the root of the word "crafted". Is it a noun? You tell me.
  - Re:Why use "architect" - why not "write" (Score:2)
    
    by nagora ( 177841 ) writes:
    
    You will have noticed by now that any attempt to express what you mean clearly is regarded as fascism by the /. crowd. In this case you were 100% right: "architect" was not even close to being the right word for the job. I can't imagine the trouble this guy gets into if he tries programming: "I know I typed 'print', but I meant 'close'!"
    TWW
- - Re:Confirmed: Architect not a verb (Score:2, Offtopic)
    
    by j_kenpo ( 571930 ) writes:
    
    http://dictionary.reference.com/search?q=google
    
    The World-Wide Web search engine that
    indexes the greatest number of web pages - over two billion by
    December 2001 and provides a free service that searches this
    index in less than a second.
    
    The site's name is apparently derived from "googol", but
    note the difference in spelling.
    
    The "Google" spelling is also used in "The Hitchhikers Guide
    to the Galaxy" by Douglas Adams, in which one of Deep
    Thought's designers asks, "And are you not," said Fook,
    leaning anxiously
  - Confirmed: Architect IS a verb (Score:5, Informative)
    
    by cperciva ( 102828 ) writes: on Thursday June 24, 2004 @12:24PM (#9519306) Homepage
    
    Quoth the OED:
    
    architect v. To design (a building). Also transf. and fig. Hence architected ppl. a., designed by an architect; architecting vbl. n. and ppl. a.
    
    The use of "architect" as a verb isn't even recently invented: Keats wrote "This was architected thus By the great Oceanus" in 1818.
    
    Parent Share
    twitter facebook
  - It's a decent paper, but take it with some salt... (Score:5, Interesting)
    
    by Ayanami Rei ( 621112 ) * writes: <rayanami@gWELTYmail.com minus author> on Thursday June 24, 2004 @02:33PM (#9520858) Journal
    
    ...this guy seriously believes the earth is a scant 10000 years old [nuclearelephant.com]. And he dismisses all evidence to the contrary without a throuogh explanation. I can't help but wonder if he treat's other people's research with the same disregard.
    
    Parent Share
    twitter facebook
- Re:????? Did you even... (Score:2, Insightful)
  
  by calebb ( 685461 ) * writes:
  
  RTA?
  
  Read the article, then post!
  
  There's really very little to be said in favor of Jonathan A. Zdiarski's "defence?"
  Now, I could start posting how ignorant that statement is, but then I'd just be rewriting Zdiarski's article. Cormack's entire test was flawed - He used SpamAssassin (95% accuracy) to create his 'ham' corpus. He used software versions that were 6+ months old. Even the email address he used for testing is incredibly unique and atypical! (He uses an address that he's had for 20+ years;
  - Re:????? Did you even... (Score:2)
    
    by Otter ( 3800 ) writes:
    
    Now, I could start posting how ignorant that statement is, but then I'd just be rewriting Zdiarski's article. Cormack's entire test was flawed - He used SpamAssassin (95% accuracy) to create his 'ham' corpus. He used software versions that were 6+ months old. Even the email address he used for testing is incredibly unique and atypical! (He uses an address that he's had for 20+ years; One that has been posted all over the WWW numerous times. An address that has many forwarders pointing to it. How is that typ
- Re:Special Pleading (Score:2)
  
  by julesh ( 229690 ) writes:
  
  As near as I can tell (I skimmed, admittedly, I didn't read every word carefully), his defense amounts to "please don't test the different filters because..."
  
  You must have been skimming very badly. I read it, and this kind of argument was never used at all. Basically, he pointed out flaws in the way the test was set up that biased it towards SpamAssassin. Particularly that the test was started with untrained filters, and that the version of SpamAssassin's ruleset used was more recent than the messages
- Re:architect (Score:3, Insightful)
  
  by psykocrime ( 61037 ) writes:
  
  For the love of Cthulu, people, "architect" is a noun, not a verb.
  
  Languages are dynamic, not static. If enough people begin to use 'architect' as a verb, then it is a verb. I have a strong hunch that 20 years from now, the verb form of architect will appear in Merriam-Webster...
- Re:architect (Score:2, Funny)
  
  by reynhout ( 89071 ) writes:
  
  > For the love of Cthulu, people, "architect" is a noun, not a verb.
  
  Ya.
  
  And for the love of Howard Phillips Lovecraft, "Cthulhu" is not spelled "Cthulu".
  
  Duh.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

How I do (Score:5, Interesting)

Re:How I do (Score:2)

Re:How I do (Score:4, Informative)

Excellent review (Score:5, Informative)

False positives. (Score:3, Informative)

Re:False positives. (Score:2, Insightful)

Re:False positives. (Score:2)

Re:False positives. (Score:3, Insightful)

Re:False positives. (Score:2)

Re:False positives. (Score:2)

Re:False positives. (Score:2)

Re:False positives. (Score:3, Informative)

Re:False positives. (Score:2)

Re:False positives. (Score:2)

Re:Excellent review (Score:2, Insightful)

Studies create discussion (Score:5, Insightful)

Re:Studies create discussion (Score:2, Insightful)

Re:Studies create discussion (Score:5, Insightful)

Re:Studies create discussion (Score:2)

Hello? (Score:2)

Re:Hello? (Score:3, Insightful)

You don't like my software so I'll flame you (Score:2, Insightful)

Re:You don't like my software so I'll flame you (Score:2)

Re:You don't like my software so I'll flame you (Score:3, Insightful)

Re:You don't like my software so I'll flame you (Score:5, Insightful)

Re:You don't like my software so I'll flame you (Score:3, Insightful)

Re:You don't like my software so I'll flame you (Score:5, Insightful)

Re:why can't we all just get along? (Score:2)

Re:You don't like my software so I'll flame you (Score:4, Interesting)

Spamassasin is good but not that good... (Score:5, Informative)

Re:Spamassasin is good but not that good... (Score:2)

Re:Spamassasin is good but not that good... (Score:2)

Just read it - (Score:2, Informative)

Re:Just read it - (Score:4, Informative)

Re:And to that... (Score:4, Insightful)

I'm not saying we wouldn't get our hair mussed... (Score:3, Funny)

I wouldn't take this critique too seriously (Score:5, Interesting)

Re:I wouldn't take this critique too seriously (Score:5, Interesting)

Re:I wouldn't take this critique too seriously (Score:3, Funny)

Re:I wouldn't take this critique too seriously (Score:2)

Re:I wouldn't take this critique too seriously (Score:2)

What is typical (Score:4, Insightful)

To cut through the spam (Score:5, Insightful)

Re:SA vs SA... SA Wins! (Score:2)

Main issue (Score:2)

But is it correct? (Score:2)

Why not... (Score:2)

Re:Why not... (Score:2)

Constructing arguments (Score:5, Informative)

Re:Constructing arguments (Score:3, Insightful)

Anyone got Gordon's email addy? (Score:2, Funny)

POPFile OTOH (Score:4, Informative)

DSPAM (Score:2, Interesting)

Obfuscated Hyperverbosity (Score:2)

The problem w/ Bayes (Score:3, Informative)

Re: Response to Gordon Cormack's Study of Spam (Score:3, Funny)

Spam Assasin validation telling point (Score:2)

Atypical, high volume of traffic? (Score:3, Informative)

Crap writing (Score:3, Insightful)

an important consideration left out (Score:2)

RBL (black lists) do not help with zombie systems (Score:3, Insightful)

Cormack and Lynam re Zdziarski's factual errors (Score:5, Informative)

Re:Cormack and Lynam re Zdziarski's factual errors (Score:2)

Re:Cormack and Lynam re Zdziarski's factual errors (Score:2, Insightful)

Collaborative filtering? (Score:2)

CRM114 is impossible to get installed (Score:3, Insightful)

the corpus was *not* classified by SA alone (Score:5, Informative)

DSPAM. (Score:2)

Re:Architect is not a verb. (Score:2, Insightful)

Re:Architect is not a verb. (Score:3, Insightful)

Re:Architect is not a verb. (Score:2)

Re:Architect is not a verb. (Score:2)

Re:Architect is not a verb. (Score:2)

Re:Architect is not a verb. (Score:2)

Re:Architect is not a verb. (Score:2)

Re:Architect is not a verb. (Score:2)

Re:Architect is not a verb. (Score:2)

Verbing is not a verb (Score:2)

Re:Verbing is not a verb (Score:2)

Re:Verbing is not a verb (Score:2)

the corpus was not classified by SA alone (Score:5, Informative)