Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Security Media Movies Privacy Your Rights Online

Anonymity of Netflix Prize Dataset Broken 164

KentuckyFC writes "The anonymity of the Netflix Prize dataset has been broken by a pair of computer scientists from the University of Texas, according to a report from the physics arXivblog. It turns out that an individual's set of ratings and the dates on which they were made are pretty unique, particularly if the ratings involve films outside the most popular 100 movies. So it's straightforward to find a match by comparing the anonymized data against publicly available ratings on the Internet Movie Database (IMDb) (abstract on the physics arxiv). The researchers used this method to find how individuals on the IMDb privately rated films on Netflix, in the process possibly working out their political affiliation, sexual preferences and a number of other personal details"
This discussion has been archived. No new comments can be posted.

Anonymity of Netflix Prize Dataset Broken

Comments Filter:
  • Probabilities (Score:5, Insightful)

    by dj245 ( 732906 ) on Tuesday November 27, 2007 @10:31AM (#21491765) Homepage
    The researchers used this method to find how individuals on the IMDb privately rated films on Netflix, in the process possibly working out their political affiliation, sexual preferences and a number of other personal details"

    This is a loaded statement. The most you can determine is that if a person likes movie A, B, C and D but hated E and F, there is a higher probability they are a guy. If they liked Z but didn't like X, there is a higher probability they might be a republican than not. You're still anonymous.

    Unless, of course, you're one of the three people that liked "Glitter". Then I think they might have something on you.
  • Do what now? (Score:5, Insightful)

    by faloi ( 738831 ) on Tuesday November 27, 2007 @10:32AM (#21491785)
    It doesn't sound like the anonymity of the prize set was broken through any fault of NetFlix. It sounds like some sampling of users made the mistake of rating movies on a site where the info is publicly available, and a site where it's not. All they did was correlate the two.

    So the lesson is, basically, don't post stuff that you don't want to be public to a website that makes it public, right? This is sounds roughly like blaming the DMV for figuring out a car owners likely political leanings by the bumper stickers on their car.
  • Re:Probabilities (Score:4, Insightful)

    by Se7enLC ( 714730 ) on Tuesday November 27, 2007 @10:39AM (#21491861) Homepage Journal
    I think they're on to something here. They cracked the anonymity by using the public movie ratings (and the dates those ratings were made) as a key. If the user has rated enough movies (especially some of the less-often-rated movies) you can uniquely identify which user they are. Once you know which user they are, you have now connected a username to the list of private ratings.

    Now, they go one step too far to say that you can determine anything but movie preferences out of a movie rating list. Just because somebody liked or disliked brokeback mountain doesn't mean they are gay or straight, just like their opinion of michael moore movies doesn't give political affiliation.

    It will tell you what movies they rented, though, and some people might not be happy having their movie-renting history publicly available.
  • by Anonymous Coward on Tuesday November 27, 2007 @10:41AM (#21491879)
    For those who haven't rated movies on IMDB, such as myself - and I imagine a large proportion of subscribers.
  • Re:Probabilities (Score:3, Insightful)

    by Dare nMc ( 468959 ) on Tuesday November 27, 2007 @10:53AM (#21491993)

    one step too far to say that you can determine anything but movie preferences out of a movie rating list.

    also your taking a aggregate of the household. So a household (will call them Chen'ys) had a gay kid, and the devil living in the same house with a Saint... good luck figuring out when the gay kid updates the queue, and when the Wife, or the Devil is at the keyboard.
  • by puppetluva ( 46903 ) on Tuesday November 27, 2007 @11:02AM (#21492089)
    This is total hyperbole.

    All they researchers are saying is that they can deduce some of your preferences based on your other preferences. Of COURSE you can do that, that was the whole point of the contest Netflix put up.

    What they are _not_ saying is that they now know who you are, where you live, or anything uniquely identifying about you. So basically, you are still anonymous.

    I'm starting to tire of news headlines that claim the world is on fire when someone actually just does something slightly derivative from the norm and thinks they are brilliant. The noise from these non-events mask actual brilliant achievements and make it seem that everyone is doing banal work.

  • Re:Do what now? (Score:5, Insightful)

    by IBBoard ( 1128019 ) on Tuesday November 27, 2007 @11:08AM (#21492169) Homepage
    Exactly - all they did was found that there was a correlation that might mean that the people are the same on IMDB and NetFlix. There's also the possibility that they're different people and that they just voted similar on different places.

    Besides, this all relies on people voting for a) really obscure films so they can be easily identified and b) voting similarly or identically on lots of films so that they can get a better idea as to whether it is the same person based on them liking the same films the same amounts.

    Just because two people from two different data sets both like (and are the only people in the data sets to like) lemon and custard jam as well as peanut butter with chips doesn't mean they're the same person, it just means they could be the same person and have similar tastes in obscure foods.
  • by SmallFurryCreature ( 593017 ) on Tuesday November 27, 2007 @11:11AM (#21492221) Journal

    As far as I know in IMDB you are rating the overall quality of the movie, not I agree with it OR I want to see more like this.

    One example, Shindlers list, great movie, do NOT want to see it again. Same with Grave of the fireflies. Some movies just ain't for multiple viewings. They are my "favorite movies I never want to see again".

    On the other hand I got movies I can watch any day of the week, but that I would NEVER rate as highly. Cannonbal run is one such movie. It watch it far too often, but I wouldn't call it a good movie. You can always fine me ready for a Jacky Chan movie or a spagethi western.

    Is the netflix rating system a "I liked this movie and want to see more like it" system or a "This movie was brilliant and I would highly recommend it too everyone else" type of rating system?

    Granted some people get it confused, probably the same people that use the slashdot moderation system to silence views they don't like, but that only makes basing conclusions on user ratings even more problematic.

    I can rate a movie highly even if I do not agree with it, simply because it is good. And I can rate a movie I really like to watch as crap simply because I know I like watching crap.

    I don't like the godfather movies, I can see they are high quality, I just don't like them. So my rating them would be fairly high as for quality, but low for 'I want to see more like this'.

    I thought that the netflix system was "I want to see more like this" based. Surely nobody is so stupid as to think a quality rating and a "i like this" rating system are the same? Or am I completly in the wrong in seeing a difference between the two? Am I insane in thinking that you can see a movie as being a great artwork and still not liking it or viceversa?

  • Re:Do what now? (Score:4, Insightful)

    by Peter Mork ( 951443 ) <Peter.Mork@gmail.com> on Tuesday November 27, 2007 @11:14AM (#21492253) Homepage

    Exactly - all they did was found that there was a correlation that might mean that the people are the same on IMDB and NetFlix.

    Caveat: I haven't had a chance to pore over the statistical calculations. However, the paper notes that their similarity measure was 38 standard deviations from the norm. Assuming the math is valid, this seems on par with a DNA test, which also provides a correlation. I wouldn't be so quick to dismiss the results until you can find a serious methodological problem.

  • Re:Do what now? (Score:3, Insightful)

    by JPMH ( 100614 ) on Tuesday November 27, 2007 @11:15AM (#21492275)
    Their lesson is that it can take surprising little public information to identify you.

    For example, ratings on a scale of 1-5 for 2 movies, and a knowledge of when they were seen to within 14 days, was suffiecient to identify the complete data histories of 40% of the Netflix clients. As the authors say, that's the kind of information cooleagues give out every day around the water cooler.

    Repeating the experiment with a knowledge of 8 movies, 6 hits in the database would be sufficient to identify the personal histories of 99% of clients included in the Netflix data.

  • Re:Probabilities (Score:5, Insightful)

    by Chapter80 ( 926879 ) on Tuesday November 27, 2007 @11:15AM (#21492277)
    I think you're missing the point.

    If you rate a handful of movies on ImDB, under the persona "MyNickname12345" and that can be traced to your personal MySpace page, you have made that choice. No problem.

    If you then submit 100 movie ratings to Netflix, assuming that it is PRIVATE information that will not be linked back to you, and then Netflix releases the data to the public, now the 100 movies can be correlated to you, and your name can be revealed. Researchers have shown how PRIVATE DATA released to the public can be linked to already public information. PROBLEM!

  • by RocketJeff ( 46275 ) on Tuesday November 27, 2007 @11:18AM (#21492311) Homepage

    First, we can immediately find his political orientation based on his strong opinions about "Power and
    Terror: Noam Chomsky in Our Times" and "Fahrenheit 9/11." Strong guesses about his religious views can
    be made based on his ratings on "Jesus of Nazareth" and "The Gospel of John". He did not like "Super
    Size Me" at all; perhaps this implies something about his physical size? Both items that we found with
    predominantly gay themes, "Bent" and "Queer as folk" were rated one star out of five. He is a cultish
    follower of "Mystery Science Theater 3000". This is far from all we found about this one person, but having
    made our point, we will spare the reader further lurid details.


    Finding a paragraph like this in a research paper makes me call into question the motives and intentions of the 'researchers.' They seems sort of like the Jerry Springer of research (since he's just trying to help the families he has on his show...).

    They imply that the person didn't like "Super Size Me" because he's probably fat (or are they trying to imply that he has a problem with gaining weight and is jealous?).

    Also, they imply that because he rated two "predominantly gay theme" items as poor he must not be homosexual. Or are they implying that because he rented/rated these that he must be gay (because who would ever rent them otherwise).

    The fact that they use the "there's more juicy stuff about this guy, but we can't tell because we're serious researchers" line at the end is the pièce de résistance that really shows what motivates these researchers.
  • by Peter Mork ( 951443 ) <Peter.Mork@gmail.com> on Tuesday November 27, 2007 @11:21AM (#21492365) Homepage

    All they researchers are saying is that they can deduce some of your preferences based on your other preferences.

    The researchers are making a stronger claim. They are stating that based on actual public ratings (available from IMDB) they can generate actual private ratings published by Netflix under the guise of anonymity. As the paper notes, someone competing for the Netflix prize could use this data to improve the accuracy of their prediction algorithm. However, the point of this paper is to reveal that public ratings can be used to identify purportedly anonymous private ratings.

    As a comparison, imagine if the public information consisted of the dates that various people went to the doctor for a yearly physical. This is hardly sensitive information. Now imagine that your insurance company provided a list of (id, date, diagnosis) records. Ostensibly, the id field is an arbitrary (anonymous) identifier. The paper shows that based on limited background knowledge (a handful of (date, 'physical exam') records), an attacker could reverse engineer your diagnosis history.

  • by TubeSteak ( 669689 ) on Tuesday November 27, 2007 @11:38AM (#21492623) Journal

    Interesting, but not earth-shattering or a serious breach of privacy, I would say.
    And who exactly are you to say so?
    Because it isn't a Credit Card # or SSN it isn't serious?

    A) Some people would rather go to jail or commit suicide than admit to something embarrassing they'd rather keep private. Privacy isn't (just) about hiding (illegal) things from the Government.

    B) Demographic information is something you can never take back and can never change.
    At least I can get a new credit card & SSN.
  • by nagora ( 177841 ) on Tuesday November 27, 2007 @11:40AM (#21492659)
    You're missing the point completely. Other people will be using "data mining" of this sort, and making serious decisions about whether you support terrorism, or are just generally not a "good citizen", and they won't be revealing their judgments to the public to let them know what might be going on.

    TWW

  • by Danny Rathjens ( 8471 ) <slashdot2.rathjens@org> on Tuesday November 27, 2007 @12:08PM (#21493019)
    As far as I know in IMDB you are rating the overall quality of the movie, not I agree with it OR I want to see more like this.

    No. You give people way too much credit if you think their ratings on public sites are that nuanced or objective. I think most people just rate things on how well they like it themselves. A significant portion seem to even just give 10s to anything they like, too.

    I also find it amusing how the votes tend to congregate somewhere in the 3rd quartile a bit above average(e.g. 7 on a 1-10 scale) rather than 5.5 where it would be if people ranked things more fairly. (I wonder if this is associated with that effect where people always rank themselves above average despite evidence to the contrary, as well.)
  • by ps236 ( 965675 ) on Tuesday November 27, 2007 @12:22PM (#21493201)
    > I also find it amusing how the votes tend to congregate somewhere in the 3rd quartile a bit above average(e.g. 7 on a 1-10 scale) rather than 5.5 where it would be if people ranked things more fairly

    I'm not sure about that. People will tend to watch films they think/hope they will like. So, the ones where they think 'that'll be absolute poop' they won't bother watching, so, hopefully, won't bother rating.

    So, people should rate fewer films as 'poop' than as 'great', because they select only the 'hopefully good' films to review.

    If you forced people to go to see and review all films, even the ones where you have to drag them screaming through the door, then the average rating would almost certainly decrease considerably.

  • by amccaf1 ( 813772 ) on Tuesday November 27, 2007 @12:31PM (#21493345)
    From TFA:

    He did not like "Super Size Me" at all; perhaps this implies something about his physical size?
    Or maybe he's a manager of a McDonalds. Or a part-time Ronald McDonald. Or...
  • Re:Do what now? (Score:3, Insightful)

    by IBBoard ( 1128019 ) on Tuesday November 27, 2007 @12:42PM (#21493487) Homepage
    While yes, they did get a very perfect match on that record, the line about it is:

    ...our algorithm identified the records of two users the Netflix Prize dataset with eccentricities of around 28 and 15, respectively.


    Granted they went for a small number of IMDB users due to their TOS, but that's still a tiny fraction. They mention finding a perfect match in IMDB and 1/8th of the NetFlix database towards the start of the report (although the sentence is a bit clunky and unclear). If that's their general accuracy then even if they can perfectly match some people (a statistical possibility) then they can't match enough to leave most people needing to worry.
  • Re:Do what now? (Score:3, Insightful)

    by yali ( 209015 ) on Tuesday November 27, 2007 @04:57PM (#21497043)

    So the lesson is, basically, don't post stuff that you don't want to be public to a website that makes it public, right?

    Nope, it's more complicated than that.

    Suppose that you want to keep your political attitudes private -- for whatever reason, you decided it's nobody else's business. On IMDb, linked to your real identity, you only rate movies with non-political content, which you don't mind anybody knowing your opinion about. On Netflix, you believe that your ratings will be kept private, and you want to take advantage of their recommendations. So you rate all the same movies that you rated on IMDb, but you also post your ratings of Fahrenheit 9/11, The Corporation, etc. With the method described in this paper, somebody could potentially link your supposedly anonymized political ratings back to your real identity.

  • by Minwee ( 522556 ) <dcr@neverwhen.org> on Tuesday November 27, 2007 @11:21PM (#21500681) Homepage
    Is that any more surreal than a form of "entertainment" in which people get shot at or blown up every five minutes or so?

Genetics explains why you look like your father, and if you don't, why you should.

Working...