Stories
Slash Boxes
Comments

News for nerds, stuff that matters

Slashdot Log In

Log In

Create Account  |  Retrieve Password

Anonymity of Netflix Prize Dataset Broken

Posted by Zonk on Tue Nov 27, 2007 09:23 AM
from the there-are-degrees-of-anonymity dept.
KentuckyFC writes "The anonymity of the Netflix Prize dataset has been broken by a pair of computer scientists from the University of Texas, according to a report from the physics arXivblog. It turns out that an individual's set of ratings and the dates on which they were made are pretty unique, particularly if the ratings involve films outside the most popular 100 movies. So it's straightforward to find a match by comparing the anonymized data against publicly available ratings on the Internet Movie Database (IMDb) (abstract on the physics arxiv). The researchers used this method to find how individuals on the IMDb privately rated films on Netflix, in the process possibly working out their political affiliation, sexual preferences and a number of other personal details"
+ -
story

Related Stories

[+] Your Rights Online: AOL, Netflix and the End of Open Research 85 comments
An anonymous reader writes "In 2006, heads rolled at AOL after the company released anonymized logs of user searches. With last week's announcement that researchers had been able to learn the identities of users in the scrubbed Netflix dataset, could the days of companies sharing data with academic researchers be numbered? Shortly after the AOL incident, Google's Eric Schmidt called the data release 'a terrible thing,' and assured the public that 'this kind of thing could not happen at Google.' Will any high tech company ever take this kind of chance again? If not, how will this impact research and and the development of future technologies that could have come from the study of real data?"
[+] Developers: Psychologist Beating Math Nerds in Race to Netflix Prize 205 comments
s1d writes "An almost-anonymous British psychologist named Gavin Potter has suddenly risen to the top of the Netflix prize charts. With his very first attempt, he got a score which took the BellKor team seven months to reach. Currently at a score of 8.07, he has only five teams ahead of him now in the race for the ultimate Netflix algorithm. 'Potter says his anonymity is mostly accidental. He started that way and didn't come out into the open until after Wired found him. "I guess I didn't think it was worth putting up a link until I had got somewhere," he says, adding that he'd been seriously posting under the name of his venture capital and consulting firm, Mathematical Capital, for two months before launching "Just a guy." When he started competing, he posted to his blog: "Decided to take the Netflix Prize seriously. Looks kind of fun. Not sure where I will get to as I am not an academic or a mathematician. However, being an unemployed psychologist I do have a bit of time."'"
This discussion has been archived. No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More
Loading... please wait.
  • by tygerstripes (832644) on Tuesday November 27 2007, @09:29AM (#21491743)
    Who goes out of their way to rate "Anal Whores 3" online?
    • Bill Clinton?
    • by mh1997 (1065630) on Tuesday November 27 2007, @09:46AM (#21491931)

      Who goes out of their way to rate "Anal Whores 3" online?
      The good thing about porn flicks, as a general rule, is that they're too bland to have really bad plots. The search for good dialogue strays too far off the beaten path established by the social mores of the target market, be that old men, college students, or perverts out on dates. There are pornos with solid plots, just rarely pornos with complicated plots.

      What they generally aren't is full of capers designed by crackheads in search of sexual relief, or a dominatrix dying to destroy the gold market with a Da Vinci alchemy machine only a cat burglar from Hoboken could steal.

      Yes, the plot of Anal Whores 3 is as convoluted as it is kitschy. Mercedes and Veronica Diamond forcibly enlist the help of happy-go-lucky and half-a-second-out-of-prison pizza delivery man Hawk (Peter North) to steal the pieces to a machine that turns lead vibrators into gold. Hawk isn't halfway to a cup of coffee with his wise cracking cohort, Tommy (Johnny Cockring) when he finds himself back in the burglary game. Casing out a heist he meets nun/professional patron of the arts/double agent/love interest Jessie Jane (vows of bestiality can put the kibosh on even the best of cinematic love interests). When you throw in a CIA agent (Dick Coburn) and a couple of double dildos, you've managed to make the world's most convoluted porno....

  • Probabilities (Score:5, Insightful)

    by dj245 (732906) on Tuesday November 27 2007, @09:31AM (#21491765) Homepage
    The researchers used this method to find how individuals on the IMDb privately rated films on Netflix, in the process possibly working out their political affiliation, sexual preferences and a number of other personal details"

    This is a loaded statement. The most you can determine is that if a person likes movie A, B, C and D but hated E and F, there is a higher probability they are a guy. If they liked Z but didn't like X, there is a higher probability they might be a republican than not. You're still anonymous.

    Unless, of course, you're one of the three people that liked "Glitter". Then I think they might have something on you.
    • Re:Probabilities (Score:4, Insightful)

      by Se7enLC (714730) on Tuesday November 27 2007, @09:39AM (#21491861) Homepage Journal
      I think they're on to something here. They cracked the anonymity by using the public movie ratings (and the dates those ratings were made) as a key. If the user has rated enough movies (especially some of the less-often-rated movies) you can uniquely identify which user they are. Once you know which user they are, you have now connected a username to the list of private ratings.

      Now, they go one step too far to say that you can determine anything but movie preferences out of a movie rating list. Just because somebody liked or disliked brokeback mountain doesn't mean they are gay or straight, just like their opinion of michael moore movies doesn't give political affiliation.

      It will tell you what movies they rented, though, and some people might not be happy having their movie-renting history publicly available.
      • Re: (Score:3, Insightful)

        one step too far to say that you can determine anything but movie preferences out of a movie rating list.

        also your taking a aggregate of the household. So a household (will call them Chen'ys) had a gay kid, and the devil living in the same house with a Saint... good luck figuring out when the gay kid updates the queue, and when the Wife, or the Devil is at the keyboard.
        • Re: (Score:3, Interesting)

          Some tech-savvy households may enable profiles on Netflix, enabling each person to track their likes & dislikes independently. (I did this for my GF, who has wildly disparate tastes from me). I'm not sure what effect that would have on the data. It'd certainly be neat if the scientists could differentiate between individual and multiple users using a particular profile.
        • Or maybe they like cowboy films and are open minded, as well as liking expose material and documentaries (not sure about the others).

          Maybe they're not, but there's always the possibility.
    • From the paper:

      First, we can immediately find his political orientation based on his strong opinions about "Power and Terror: Noam Chomsky in Our Times" and "Fahrenheit 9/11." Strong guesses about his religious views can be made based on his ratings on "Jesus of Nazareth" and "The Gospel of John". He did not like "Super Size Me" at all; perhaps this implies something about his physical size? Both items that we found with predominantly gay themes, "Bent" and "Queer as folk" were rated one star out of five.

    • Re:Probabilities (Score:5, Insightful)

      by Chapter80 (926879) on Tuesday November 27 2007, @10:15AM (#21492277)
      I think you're missing the point.

      If you rate a handful of movies on ImDB, under the persona "MyNickname12345" and that can be traced to your personal MySpace page, you have made that choice. No problem.

      If you then submit 100 movie ratings to Netflix, assuming that it is PRIVATE information that will not be linked back to you, and then Netflix releases the data to the public, now the 100 movies can be correlated to you, and your name can be revealed. Researchers have shown how PRIVATE DATA released to the public can be linked to already public information. PROBLEM!

  • Privacy is becoming a fleeting thing in this interconnected world. Perhaps we should reanalyze our perspective on it all?
    • Re: (Score:3, Informative)

      Perhaps if we're obscure and pretentious enough, no one will want to spy on us! Brillant!

      The world changes. Learn to live with it.
      • by phobos13013 (813040) on Tuesday November 27 2007, @11:15AM (#21493125)
        Actually TFA seems to suggest that the more obscure and pretentious we are, the easier it is the track us. If we become homogeneous drones voting on the top 100 films, we are safe! Even so, I don't plan to become a homogeneous drone...
    • But then we'd have to re-analyze capitalism itself, an I don't think society is ready for *that* rich people would simply pay for organizations to falsify their data, it would be one sided.
  • Do what now? (Score:5, Insightful)

    by faloi (738831) on Tuesday November 27 2007, @09:32AM (#21491785)
    It doesn't sound like the anonymity of the prize set was broken through any fault of NetFlix. It sounds like some sampling of users made the mistake of rating movies on a site where the info is publicly available, and a site where it's not. All they did was correlate the two.

    So the lesson is, basically, don't post stuff that you don't want to be public to a website that makes it public, right? This is sounds roughly like blaming the DMV for figuring out a car owners likely political leanings by the bumper stickers on their car.
    • was broken through any fault of NetFlix.

      just because someone choose to go public with liking "The Rise of Theodore Roosevelt" doesn't mean they should know that the company will take some seemingly private data linking you to really likeing "brokeback mmoutain", and the series "The L word" and publicize it later. and that the combination of your post, and the combination now violates netflix's privacy policy (in spirit)
      IE they say they will only disclose "on an anonymous basis" anything but your reviews.

    • Re:Do what now? (Score:5, Insightful)

      by IBBoard (1128019) on Tuesday November 27 2007, @10:08AM (#21492169) Homepage
      Exactly - all they did was found that there was a correlation that might mean that the people are the same on IMDB and NetFlix. There's also the possibility that they're different people and that they just voted similar on different places.

      Besides, this all relies on people voting for a) really obscure films so they can be easily identified and b) voting similarly or identically on lots of films so that they can get a better idea as to whether it is the same person based on them liking the same films the same amounts.

      Just because two people from two different data sets both like (and are the only people in the data sets to like) lemon and custard jam as well as peanut butter with chips doesn't mean they're the same person, it just means they could be the same person and have similar tastes in obscure foods.
      • Re:Do what now? (Score:4, Insightful)

        by Peter Mork (951443) <Peter.Mork@gmail.com> on Tuesday November 27 2007, @10:14AM (#21492253) Homepage

        Exactly - all they did was found that there was a correlation that might mean that the people are the same on IMDB and NetFlix.

        Caveat: I haven't had a chance to pore over the statistical calculations. However, the paper notes that their similarity measure was 38 standard deviations from the norm. Assuming the math is valid, this seems on par with a DNA test, which also provides a correlation. I wouldn't be so quick to dismiss the results until you can find a serious methodological problem.

        • Re: (Score:3, Insightful)

          While yes, they did get a very perfect match on that record, the line about it is:

          ...our algorithm identified the records of two users the Netflix Prize dataset with eccentricities of around 28 and 15, respectively.

          Granted they went for a small number of IMDB users due to their TOS, but that's still a tiny fraction. They mention finding a perfect match in IMDB and 1/8th of the NetFlix database towards the start of the report (although the sentence is a bit clunky and unclear). If that's their general accura

      • Re:Do what now? (Score:4, Informative)

        by arvindn (542080) on Tuesday November 27 2007, @12:08PM (#21493827) Homepage Journal
        "Besides, this all relies on people voting for a) really obscure films so they can be easily identified "

        not true -- obscure films help a little bit but not too much. we put up a recent draft of our paper in which the dependence on obscure movies is much reduced.

        "b) voting similarly or identically on lots of films so that they can get a better idea as to whether it is the same person based on them liking the same films the same amounts."

        again not true at all. one of the main claims of our paper is that our method is tolerant to an INCREDIBLE amount of noise. we have the math to back this up.

        --Arvind Narayanan

    • Re: (Score:3, Insightful)

      Their lesson is that it can take surprising little public information to identify you.

      For example, ratings on a scale of 1-5 for 2 movies, and a knowledge of when they were seen to within 14 days, was suffiecient to identify the complete data histories of 40% of the Netflix clients. As the authors say, that's the kind of information cooleagues give out every day around the water cooler.

      Repeating the experiment with a knowledge of 8 movies, 6 hits in the database would be sufficient to identify the per

    • Re: (Score:3, Insightful)

      So the lesson is, basically, don't post stuff that you don't want to be public to a website that makes it public, right?

      Nope, it's more complicated than that.

      Suppose that you want to keep your political attitudes private -- for whatever reason, you decided it's nobody else's business. On IMDb, linked to your real identity, you only rate movies with non-political content, which you don't mind anybody knowing your opinion about. On Netflix, you believe that your ratings will be kept private, and you want to

  • by CastrTroy (595695) on Tuesday November 27 2007, @09:36AM (#21491815) Homepage
    Seems like it was only broken because the identity of the people was posted somewhere else, along with the ratings. My only question is how they connected the rankings on Netflix, to the rankings on IMDB. Does Netflix take the liberty of submitting all the users rankings to IMDB for them, and also include their name with this data? If you just have anonymous dataset A, with anonymous dataset B, you could match up users from both and figure out which person in A is the same person in B, but you still wouldn't know who the person is. However, if you now have dataset B be not anonymous, then it's not too difficult to compare movie ratings and find out who the people are.
    • They are just saying it is likely a person rated a movie on Netflix and IMDb at roughly the same time. That is the correlation which is need to connect the anonymous with the publicly posted information.

      While I do rate a few films on IMDb I usually do them in batches, where on Netflix I rate the movie as soon as I'm finished viewing it. So the time link wouldn't be there between my two accounts.
    • What NetFlix did that was stupid was include the names of the movies in their dataset. There was no need for this for the prize (unless anybody was using the names for prediction I suppose), anonynous identifiers would have been okay.
      • If a person liked season 1 of Stargate: SG1 it would be a good idea to recommend season 2 to them. Goes for sequels too. So yeah, titles are needed a little bit.
  • did it work? (Score:3, Interesting)

    by Speare (84249) on Tuesday November 27 2007, @09:39AM (#21491859) Homepage

    The researchers used this method to find how individuals on the IMDb privately rated films on Netflix, in the process possibly working out their political affiliation, sexual preferences and a number of other personal details

    {tongueincheek}Yeah, but the question is, will knowing those personal facts generate better movie recommendations?{/tongueincheek}

    When there's a significant prize at stake, researchers can try all sorts of slimy tricks to win. (I'm not saying that's the motive behind this report, but there are many "researchers" going for the prize.) And when there's significant profits at stake, a corporation will damn-fire-certainly use whatever means they can use to maximize those profits, regardless of whether it might be "ethical."

  • by Anonymous Coward
    For those who haven't rated movies on IMDB, such as myself - and I imagine a large proportion of subscribers.
  • by Anonymous Coward on Tuesday November 27 2007, @09:54AM (#21492005)
    There are two things going on here. One, many people are asking how you could identify any personal information about people based on their movie preferences. The answer is data-mining. Very sophisticated techniques exist to do things exactly like this, i.e. take a data set and find out about the people.

    The second problem is that by deanonymizing the NetFlix data, you can start to cheat on the NetFlix prize. The requirement to win $1 million is that your recommendation engine is 10% better than the one they are currently using. However, if you can learn the exact preferences of some users in the dataset (i.e. by finding the rest of their ratings on IMDB) then you can hardcode that into your recommendation engine and get the recommendations for these users exactly right. This can boost your score even though your actual system is no better than the existing one. This is known as over-fitting to the data.

    Finally, this paper is over a year old. Can we please have some new news?
  • by Thanshin (1188877) on Tuesday November 27 2007, @09:55AM (#21492011)
    Every time you feel the need to vote 10 in Glitter, also vote 10 to The Godfather.
    Every time you cheer for Brokeback Mountain, also put a 10 in Huge Knockers MXII.
    Every time you want to express your love for Dersu Uzala, vote a 10 in Spice World, with added commentaries.

    That way, everybody will know you're a security conscious computer scientist. Or a squizophrenic moron.
  • by call -151 (230520) * on Tuesday November 27 2007, @09:55AM (#21492013) Homepage
    The summary is somewhat misleading- the only accounts that can be identified are those that belong to people who also rate on IMBD and who have thus chosen to make at least some of their ratings public. If person X rates 1000 movies on Netflix and has made 20 or so ratings on IMDB publically available, then it is possible to infer with some small uncertainty which of the anonymized individuals in the NetFlix database they are. Thus you have possibly figured out their ratings of the other 980 movies they rated for Netflix but did not post on IMBD. Interesting, but not earth-shattering or a serious breach of privacy, I would say.
    • Interesting, but not earth-shattering or a serious breach of privacy, I would say.

      And who exactly are you to say so?
      Because it isn't a Credit Card # or SSN it isn't serious?

      A) Some people would rather go to jail or commit suicide than admit to something embarrassing they'd rather keep private. Privacy isn't (just) about hiding (illegal) things from the Government.

      B) Demographic information is something you can never take back and can never change.
      At least I can get a new credit card & SSN.

  • by puppetluva (46903) on Tuesday November 27 2007, @10:02AM (#21492089)
    This is total hyperbole.

    All they researchers are saying is that they can deduce some of your preferences based on your other preferences. Of COURSE you can do that, that was the whole point of the contest Netflix put up.

    What they are _not_ saying is that they now know who you are, where you live, or anything uniquely identifying about you. So basically, you are still anonymous.

    I'm starting to tire of news headlines that claim the world is on fire when someone actually just does something slightly derivative from the norm and thinks they are brilliant. The noise from these non-events mask actual brilliant achievements and make it seem that everyone is doing banal work.

    • Re: (Score:3, Insightful)

      All they researchers are saying is that they can deduce some of your preferences based on your other preferences.

      The researchers are making a stronger claim. They are stating that based on actual public ratings (available from IMDB) they can generate actual private ratings published by Netflix under the guise of anonymity. As the paper notes, someone competing for the Netflix prize could use this data to improve the accuracy of their prediction algorithm. However, the point of this paper is to reveal t

    • Re: (Score:3, Informative)

      Othe the other hand, if somebody *already* knows who you are, the lesson is that it can take surprising little public information to identify your entire history of ratings at Netflix.

      For example, the authors found for 40% of individuals, accurate ratings on a scale of 1-5 for only *two* random movies,together with a knowledge to within 14 days of when they were seen, would be sufficient to identify an individual in the dataset. As they comment, that's the kind of information cooleagues give out every day a
  • by SmallFurryCreature (593017) on Tuesday November 27 2007, @10:11AM (#21492221) Journal

    As far as I know in IMDB you are rating the overall quality of the movie, not I agree with it OR I want to see more like this.

    One example, Shindlers list, great movie, do NOT want to see it again. Same with Grave of the fireflies. Some movies just ain't for multiple viewings. They are my "favorite movies I never want to see again".

    On the other hand I got movies I can watch any day of the week, but that I would NEVER rate as highly. Cannonbal run is one such movie. It watch it far too often, but I wouldn't call it a good movie. You can always fine me ready for a Jacky Chan movie or a spagethi western.

    Is the netflix rating system a "I liked this movie and want to see more like it" system or a "This movie was brilliant and I would highly recommend it too everyone else" type of rating system?

    Granted some people get it confused, probably the same people that use the slashdot moderation system to silence views they don't like, but that only makes basing conclusions on user ratings even more problematic.

    I can rate a movie highly even if I do not agree with it, simply because it is good. And I can rate a movie I really like to watch as crap simply because I know I like watching crap.

    I don't like the godfather movies, I can see they are high quality, I just don't like them. So my rating them would be fairly high as for quality, but low for 'I want to see more like this'.

    I thought that the netflix system was "I want to see more like this" based. Surely nobody is so stupid as to think a quality rating and a "i like this" rating system are the same? Or am I completly in the wrong in seeing a difference between the two? Am I insane in thinking that you can see a movie as being a great artwork and still not liking it or viceversa?

    • Re: (Score:3, Interesting)

      One example, Shindlers list, great movie, do NOT want to see it again. Same with Grave of the fireflies. Some movies just ain't for multiple viewings. They are my "favorite movies I never want to see again".

      Just out of curiosity, why don't you want to see those films again? both of them are really good films and although I would not see them every weekend (as for example Sin City), I enjoy watching them from time to time. The plot is interesting, the photography/drawing is nice and the screen writing is wel
      • The comment "favotire movie I never want to see again" is one I got from a review of Grave of the Fireflies that I just happened to totally agree with. Don't read the reviews, just watch it yourselve and if you are not into Anime just set that aside for the duration of the movie, then ask yourselve again, if you can understand that comment.

        It is powerfull movie, like Shindlers List, but not a happy tale. I am not talking a tear jerker movie here, I am talking a "we will all burn in hell for this" movie. Te

    • Re: (Score:3, Insightful)

      As far as I know in IMDB you are rating the overall quality of the movie, not I agree with it OR I want to see more like this.

      No. You give people way too much credit if you think their ratings on public sites are that nuanced or objective. I think most people just rate things on how well they like it themselves. A significant portion seem to even just give 10s to anything they like, too.

      I also find it amusing how the votes tend to congregate somewhere in the 3rd quartile a bit above average(e.g. 7 on
  • ...would be a lot more appreciative of this proof of concept if someone trawled Slashdot threads to see how often you feed trolls by responding to comments with a "-1" rating... :P
  • Wait - you mean if I enjoyed a movie with a gay theme, people are going to assume I'm gay?

    Anyone think the IMDB rating of Brokeback Mountain is going to plummet dramatically. (It is 7.8 today)

    And of course, if it does, we will be able to correlate the timing of the sudden drop with the publishing of this slashdot article, allowing us to link the slashdot readership with imdb users. Now we have your Netflix ratings, IMDB ratings, AND slashdot postings all correlated...
  • by AmiMoJo (196126) <mojo.world3@net> on Wednesday November 28 2007, @09:20AM (#21504043) Homepage
    None of the mainstream media picked up on it, but I remember thinking this sort of thing might be possible with the data lost by HMRC too. I bet Tesco would love to get their hands on it for planning where to put new stores and what to stock etc. Combined with their Clubcard database, of course.
    • by nagora (177841) on Tuesday November 27 2007, @10:40AM (#21492659)
      You're missing the point completely. Other people will be using "data mining" of this sort, and making serious decisions about whether you support terrorism, or are just generally not a "good citizen", and they won't be revealing their judgments to the public to let them know what might be going on.

      TWW

    • Re: (Score:3, Insightful)

      From TFA:

      He did not like "Super Size Me" at all; perhaps this implies something about his physical size?
      Or maybe he's a manager of a McDonalds. Or a part-time Ronald McDonald. Or...