Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Communications Spam The Internet

Eisenstadt's Analysis Of 8 Years' Worth Of Email 230

Hylton writes "Thought this might be of interest: Marc Eisenstadt's saved every email he's gotten over the past eight years, including spam, and run an analysis of it."
This discussion has been archived. No new comments can be posted.

Eisenstadt's Analysis Of 8 Years' Worth Of Email

Comments Filter:
  • Article Text (Score:5, Informative)

    by Anonymous Coward on Tuesday February 15, 2005 @09:15PM (#11684630)

    February 11, 2005
    Eight years of email stats, pass 1Email This EntryPrint This Entry
    Posted by Marc Eisenstadt

    What's the reality behind the 'email overload' talk? Let's look at some numbers... personal numbers.

    To kick things off, I've got a huge email archive. I started emailing in the early ArpaNet days, around 1972, and haven't stopped since. My archive has been extremely thorough for at least the past 12 years (and, in case you think I'm nuts for keeping all of these, my actual regret from a scientific/archive perspective is that I don't have the earlier ones too!). Why? Let's just say that one day I planned to do an analysis of it all... types of mails, social networks, the whole works. But things got a little out of hand.... (anyone lookin' for some data, give me a shout... but first read on)...

    Most of this 'storage mania' was triggered by a casual comment in around 1992 or 1993 by Ron Baecker, of the University of Toronto, a longtime research colleague and acquaintance and someone whose work I have long admired and respected. Ron asked me, "given ultra-cheap storage and ultra-fast search, both clearly on their way, why would you ever need either to delete or indeed to accurately file/categorize your emails?"

    OK, so as a little personal experiment, I decided to keep 'em, and to see what happened. The quick story is that migrating across machines, operating systems, and preferred email clients, plus being a bit cavalier about the whole thing, has meant that although all the emails are 'there' in various archive files, it takes a little work to get 'em all back in a harmonious form, that is with all headers intact and no duplicates (the main formats are Vax mails, Unix mails, Mac Eudora, PC Eudora, Outlook Express, and Outlook).

    The longer story, with some data and preliminary analysis, begins like this:

    Even though I haven't had the time or motivation thus far to put in the harmonization work required to get all the data in one format and with duplicates eliminated, I nevertheless thought that a little 'first pass' set of totals (with my estimate of their accuracy) would be interesting, and maybe even provide a little coarse empirical support for Stowe's "Just Say No To Email" campaign.

    So I quickly eyeballed-and-tallied the most coherent of the archives, spanning eight years of emails, from January 1st 1997 to December 31st 2004. The totals are real enough, but the 'eyeballing' was needed to assess the approximate propotion of spam and duplication involved in the emails. A more detailed analysis later will enable me to do these more accurately. I've indicated my estimate of the margin for error in the third column, and my estimate for the percentage of spam received (and I mean real spam: i.e. either 'greedily-lookin-for-suckers' or 'low-down-mean-and-nasty spam', not conference announcements - you know what I'm talkin' about). For 2003, this number is precise, because I filtered off such spam using SpamAssassin, and counted them! 2004 spam numbers are an extrapolation, but the totals are accurate, as explained below. Here goes:

    TABLE 1: Eisenstadt's 1997-2004 email totals
    Year

    Emails received Est. Error Est. Spam

    1997 4320 20% 2%
    1998 3996 20% 3%
    1999 6821 10% 5%
    2000 7580 5% 6%
    2001 6125 5% 7%
    2002 6497 5% 10%
    2003 13092 1% 37.6%
    2004 13889 1% 40%

    2003 is the most accurate, because (unlike earlier years when I was changing clients and machines) I have all emails in one clean format and all spam preserved, auto-filtered by SpamAssassin into a folder that I look at only a few times a year, scanning rapidly for false rejections. Incidentally, that falsely rejected email rate appears to be roughly 1 in 5000: good enough for me! By 2004, although I kept all emails, I got fed up keeping the spam even for analysis purposes, and can't even be bothered to scan it, so stuff auto-filtered by SpamAssassin is now deleted without my looking at it - so the column 4 '40% spam' in the lower

  • Re:Indeed (Score:5, Informative)

    by iced_773 ( 857608 ) on Tuesday February 15, 2005 @09:18PM (#11684662)
    I should point out that you shouldn't respond to spam under ANY circumstances - it just verifies to the spammer that your address exists.
  • by Vario ( 120611 ) on Tuesday February 15, 2005 @09:18PM (#11684669)
    This is the google cache linked with slashcode: http://64.233.183.104/search?q=cache:GshwWambHvEJ: www.corante.com/getreal/archives/2005/02/11/eight_ years_of_email_stats_pass_1.php [64.233.183.104]

    It still tries to access the original site, so it rather slow but you can read the article.
  • by confusedneutrino ( 732640 ) on Tuesday February 15, 2005 @10:00PM (#11684982)
    I had a very similar setup going on for a while, but I lost it over a year ago. 6 years and 2 gigs of emails lost to a faulty power supply. Scouring turned up nothing usable and I didn't have backups of my emails.

    I felt like I lost a part of my past...

    Goes to show the value of backing up your data.
  • Re:Indeed (Score:3, Informative)

    by Anonymous Coward on Tuesday February 15, 2005 @10:22PM (#11685118)
    And also set your email client not to load images, or anything remote for that matter, off the net. They can just add a image.jpg?id=123456 and know that the email address in their db with the id of 123456 read their spam message.
  • by NanoGator ( 522640 ) on Tuesday February 15, 2005 @10:52PM (#11685268) Homepage Journal
    "...don't just do it because they think it is fun to piss off the world, they do it because they make lots and lots of money from it."

    Funny thing is, they don't necessarily make money from people buying it, but rather the people advertising it. "Give me $10,000, and I'll get your message out to 10,000 people!" "Okay! That's a lot cheaper than buying a banner on a big site!" (Note: The numbers are made up.)
  • by ticklish2day ( 575989 ) on Tuesday February 15, 2005 @10:55PM (#11685284)
    From the post...
    The big red splotch in August 2003 around the 100K mark is the Sobig virus.
  • Re:Not very much (Score:2, Informative)

    by baalz ( 458046 ) on Wednesday February 16, 2005 @10:25AM (#11688290)
    Yeah, this sort of analysis seems fairly frivolous. Everybody uses email differently. I've noticed fairly substantial "email culture" differences in the jobs I've worked at. At my current job I usually get about 10 emails a day from people I've never met in other departments telling the whole world they're stepping out early for a doctor's apointment, a dozen reminders every month to fill out your time cards (sent to everybody regarless of if they're already filled out), etc. Like the parent poster, I do more than glance at less than 50% of my email, and this is internal email with no real spam. At previous jobs the email culture was such that managers would send time sensitive requests by email, we discussed more detailed technical issues, and there were much fewer "worthless" mails. Email was used in a very different fashion. When I got far less numbers of email, I spent more of my time on it. Not only did I have a much higher rate of reply, but I also had mail notification turned on and you've got to figure on the cost of context switching from whatever I was already working on when that little chime goes off (more than 3 minutes if the thought required is not trivial).

  • Re:Indeed (Score:3, Informative)

    by Just Some Guy ( 3352 ) <kirk+slashdot@strauser.com> on Wednesday February 16, 2005 @11:15AM (#11688742) Homepage Journal
    My poor little dialup domain has been receiving around 50-60,000 spams a day to those bogus accounts. It hit 120,000 one day.

    Two words:

    1. DNSBL
    2. Greylisting

    Add those to your setup and see that drop to about 30-40. Let SpamAssassin clean up the rest and forget about it.

All seems condemned in the long run to approximate a state akin to Gaussian noise. -- James Martin

Working...