Anonymity of Netflix Prize Dataset Broken 164
KentuckyFC writes "The anonymity of the Netflix Prize dataset has been broken by a pair of computer scientists from the University of Texas, according to a report from the physics arXivblog. It turns out that an individual's set of ratings and the dates on which they were made are pretty unique, particularly if the ratings involve films outside the most popular 100 movies. So it's straightforward to find a match by comparing the anonymized data against publicly available ratings on the Internet Movie Database (IMDb) (abstract on the physics arxiv). The researchers used this method to find how individuals on the IMDb privately rated films on Netflix, in the process possibly working out their political affiliation, sexual preferences and a number of other personal details"
Sexual preferences? (Score:5, Funny)
Re: (Score:2, Funny)
Re:Sexual preferences? (Score:5, Funny)
What they generally aren't is full of capers designed by crackheads in search of sexual relief, or a dominatrix dying to destroy the gold market with a Da Vinci alchemy machine only a cat burglar from Hoboken could steal.
Yes, the plot of Anal Whores 3 is as convoluted as it is kitschy. Mercedes and Veronica Diamond forcibly enlist the help of happy-go-lucky and half-a-second-out-of-prison pizza delivery man Hawk (Peter North) to steal the pieces to a machine that turns lead vibrators into gold. Hawk isn't halfway to a cup of coffee with his wise cracking cohort, Tommy (Johnny Cockring) when he finds himself back in the burglary game. Casing out a heist he meets nun/professional patron of the arts/double agent/love interest Jessie Jane (vows of bestiality can put the kibosh on even the best of cinematic love interests). When you throw in a CIA agent (Dick Coburn) and a couple of double dildos, you've managed to make the world's most convoluted porno....
Re:Sexual preferences? (Score:5, Interesting)
Re:Sexual preferences? (Score:4, Informative)
Re:Sexual preferences? (Score:5, Funny)
Re: (Score:3, Informative)
Andi McDowell imitates a dolphin in it too.
OT: if you like Hudson Hawk, you'll also like.... (Score:2)
Both star Bruce Willis, interestingly enough.
"Hey mister, are you gonna die?"
"Do you know what it's like to be called Chlamydia for a year?"
"You are a slender reed compared to that guard"
Both HH and 5E are in my top 10 movies. And the commentary on Hudson Hawk is great - they talk about how they hired the narrator from Rocky & Bullwinkle, so that you'd know the tone they were taking. Fun stu
Re: (Score:2)
"Looks like Bunny's got today's balls balls."
Re: (Score:3, Funny)
I see what you've done there.....
Re: (Score:2)
It's kind of ............surreal.
Re: (Score:3, Insightful)
Probabilities (Score:5, Insightful)
This is a loaded statement. The most you can determine is that if a person likes movie A, B, C and D but hated E and F, there is a higher probability they are a guy. If they liked Z but didn't like X, there is a higher probability they might be a republican than not. You're still anonymous.
Unless, of course, you're one of the three people that liked "Glitter". Then I think they might have something on you.
Re:Probabilities (Score:4, Insightful)
Now, they go one step too far to say that you can determine anything but movie preferences out of a movie rating list. Just because somebody liked or disliked brokeback mountain doesn't mean they are gay or straight, just like their opinion of michael moore movies doesn't give political affiliation.
It will tell you what movies they rented, though, and some people might not be happy having their movie-renting history publicly available.
Re: (Score:3, Insightful)
also your taking a aggregate of the household. So a household (will call them Chen'ys) had a gay kid, and the devil living in the same house with a Saint... good luck figuring out when the gay kid updates the queue, and when the Wife, or the Devil is at the keyboard.
Re: (Score:3, Interesting)
Re: (Score:2)
Maybe they're not, but there's always the possibility.
Re: (Score:2)
From the paper (Score:2)
Re:Probabilities (Score:5, Insightful)
If you rate a handful of movies on ImDB, under the persona "MyNickname12345" and that can be traced to your personal MySpace page, you have made that choice. No problem.
If you then submit 100 movie ratings to Netflix, assuming that it is PRIVATE information that will not be linked back to you, and then Netflix releases the data to the public, now the 100 movies can be correlated to you, and your name can be revealed. Researchers have shown how PRIVATE DATA released to the public can be linked to already public information. PROBLEM!
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
They released a lot more than some data: they published the algorithm. Anyone is free to write their own implementation of it. And anyone who is participating in the Netflix prize already has a copy of the database.
However, I do agree with you that they went about it in a responsible manner. They revealed it. Without their insight, we might have continued living in ignorance that some "unknown adversary" (external to Netflix) is already correlating our movie rental habit
Re: (Score:2)
most people underestimate the ease of correlating supposedly anonymized data, as has been shown time and time again.
Liked Brokeback Mountain == gay liberal cowboy (Score:2)
Re: (Score:2)
only a matter of time (Score:2)
Re: (Score:3, Informative)
The world changes. Learn to live with it.
Re:only a matter of time (Score:5, Interesting)
Re: (Score:2)
What a defeatist attitude. Why not try to change it into the world in which you want to live? You can be damn sure someone else is trying to change it into the world they want to live in, which may well be at odds with the world you want to live in, so if you just "learn to live with it" you're setting yourself up to be shat on from a great height.
If you don't like the idea of personal information being mined in this way, talk to your friends and write to your
Re: (Score:2)
Do what now? (Score:5, Insightful)
So the lesson is, basically, don't post stuff that you don't want to be public to a website that makes it public, right? This is sounds roughly like blaming the DMV for figuring out a car owners likely political leanings by the bumper stickers on their car.
Re: (Score:2)
just because someone choose to go public with liking "The Rise of Theodore Roosevelt" doesn't mean they should know that the company will take some seemingly private data linking you to really likeing "brokeback mmoutain", and the series "The L word" and publicize it later. and that the combination of your post, and the combination now violates netflix's privacy policy (in spirit)
IE they say they will only disclose "on an anonymous basis" anything but your reviews.
Re:Do what now? (Score:5, Insightful)
Besides, this all relies on people voting for a) really obscure films so they can be easily identified and b) voting similarly or identically on lots of films so that they can get a better idea as to whether it is the same person based on them liking the same films the same amounts.
Just because two people from two different data sets both like (and are the only people in the data sets to like) lemon and custard jam as well as peanut butter with chips doesn't mean they're the same person, it just means they could be the same person and have similar tastes in obscure foods.
Re:Do what now? (Score:4, Insightful)
Caveat: I haven't had a chance to pore over the statistical calculations. However, the paper notes that their similarity measure was 38 standard deviations from the norm. Assuming the math is valid, this seems on par with a DNA test, which also provides a correlation. I wouldn't be so quick to dismiss the results until you can find a serious methodological problem.
Re: (Score:3, Insightful)
Granted they went for a small number of IMDB users due to their TOS, but that's still a tiny fraction. They mention finding a perfect match in IMDB and 1/8th of the NetFlix database towards the start of the report (although the sentence is a bit clunky and unclear). If that's their general accura
Re:Do what now? (Score:4, Informative)
not true -- obscure films help a little bit but not too much. we put up a recent draft of our paper in which the dependence on obscure movies is much reduced.
"b) voting similarly or identically on lots of films so that they can get a better idea as to whether it is the same person based on them liking the same films the same amounts."
again not true at all. one of the main claims of our paper is that our method is tolerant to an INCREDIBLE amount of noise. we have the math to back this up.
--Arvind Narayanan
Re: (Score:2)
NetFlix.dating anyone?
Re: (Score:3, Insightful)
For example, ratings on a scale of 1-5 for 2 movies, and a knowledge of when they were seen to within 14 days, was suffiecient to identify the complete data histories of 40% of the Netflix clients. As the authors say, that's the kind of information cooleagues give out every day around the water cooler.
Repeating the experiment with a knowledge of 8 movies, 6 hits in the database would be sufficient to identify the per
Re: (Score:2)
(unless I greatly misunderstand their method)
I've you've only ever given your movie ratings to NetFlix, then they still have no way to correlate that with any other source. (And even then, all they've shown is that User #1234 in the Netflix List is the same person as RandomPseudonym582 on IMDB, which personally I don't find to be terribly interesting.)
Re: (Score:2, Informative)
Re: (Score:3, Insightful)
Nope, it's more complicated than that.
Suppose that you want to keep your political attitudes private -- for whatever reason, you decided it's nobody else's business. On IMDb, linked to your real identity, you only rate movies with non-political content, which you don't mind anybody knowing your opinion about. On Netflix, you believe that your ratings will be kept private, and you want to
Anonymity broken by stupidity (Score:3, Interesting)
Re: (Score:2)
While I do rate a few films on IMDb I usually do them in batches, where on Netflix I rate the movie as soon as I'm finished viewing it. So the time link wouldn't be there between my two accounts.
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Including such data in the example sets would in turn allow to determine the correlation between movies and their internal id number pretty quickly. Even i
Re: (Score:2)
But yeah, that introduces a data leak.
Re: (Score:2)
Are you sure? Are people using data from outside the training set? Because if what you say is true then they're essentially asking people to use some kind of probabilistic record linkage to include external databases, which would automatically include the personal identifiers. This would be highly dubious behaviour.
Re: (Score:2)
Which almost makes
did it work? (Score:3, Interesting)
{tongueincheek}Yeah, but the question is, will knowing those personal facts generate better movie recommendations?{/tongueincheek}
When there's a significant prize at stake, researchers can try all sorts of slimy tricks to win. (I'm not saying that's the motive behind this report, but there are many "researchers" going for the prize.) And when there's significant profits at stake, a corporation will damn-fire-certainly use whatever means they can use to maximize those profits, regardless of whether it might be "ethical."
Re: (Score:2)
When there's significant profits at stake, individual humans will damn-fire-certainly use whatever means they can use to maximize those profits, regardless of whether it might be "ethical".
How does this break anonymity? (Score:2, Insightful)
Data-mining and the actual problem (Score:4, Interesting)
The second problem is that by deanonymizing the NetFlix data, you can start to cheat on the NetFlix prize. The requirement to win $1 million is that your recommendation engine is 10% better than the one they are currently using. However, if you can learn the exact preferences of some users in the dataset (i.e. by finding the rest of their ratings on IMDB) then you can hardcode that into your recommendation engine and get the recommendations for these users exactly right. This can boost your score even though your actual system is no better than the existing one. This is known as over-fitting to the data.
Finally, this paper is over a year old. Can we please have some new news?
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Sounds like a nice opportunity to learn something. I don't have any idea on how to implement any of the stuff I mentioned before, though.
Re: (Score:2)
Re: (Score:2)
This won't work because submitted algorithms are run against data that wasn't made public.
the data that wasn't made public might be further ratings from people who had data in the public test set. imagine that both the public and private data sets are mirrored in the public rankings on IMDB, anyone that was able to match your public data set against IMDB can "predict" all your other IMDB rankings 100% and if any of those are in the private data set than they'll get those ones exactly right.
i'm curious how the prediction is being tested against the private data though, it may not be possible
Easy solution (Score:4, Funny)
Every time you cheer for Brokeback Mountain, also put a 10 in Huge Knockers MXII.
Every time you want to express your love for Dersu Uzala, vote a 10 in Spice World, with added commentaries.
That way, everybody will know you're a security conscious computer scientist. Or a squizophrenic moron.
Re: (Score:2)
Really, one of the biggest problems with Netflix ever getting good recommendations is that they are not trying to make recommendations for individuals. They are making recommendations for a group of people who's tastes my not cross over at all. You joke in your post about what movies get a 10 (well, 5 anyways), but would it seem unreasonable that a family of 4 could come up with those very ratings?
Re: (Score:2)
Re: (Score:2)
Wait. So did I.
requires another (partial)public revealing to work (Score:4, Informative)
Re:requires another (partial)public revealing to w (Score:3, Insightful)
Interesting, but not earth-shattering or a serious breach of privacy, I would say.
And who exactly are you to say so?
Because it isn't a Credit Card # or SSN it isn't serious?
A) Some people would rather go to jail or commit suicide than admit to something embarrassing they'd rather keep private. Privacy isn't (just) about hiding (illegal) things from the Government.
B) Demographic information is something you can never take back and can never change.
At least I can get a new credit card & SSN.
Re: (Score:2)
OT: is there a way to escape greaterthan/lessthan signs?
Re: (Score:2)
apersand-lt-semicolon results in <
apersand-gt-semicolon results in >
(no spaces or dashes.)
Re: (Score:2)
> => >, < => <
Re: (Score:2)
You have to use the HTML escape codes, which are < and > .
Re:requires another (partial)public revealing to w (Score:2)
Imagine a pastor who uses a recognizable username for many sites, including both IMDB and his church's web forums. He uses Netflix as a way to feed his secret love of movies with sexual content which his church would publicly denounce. Now these researchers could link his username to ratings for all these movies, and post the information online.
All it would take then is for a curious church mem
The world is not on fire (Score:3, Insightful)
All they researchers are saying is that they can deduce some of your preferences based on your other preferences. Of COURSE you can do that, that was the whole point of the contest Netflix put up.
What they are _not_ saying is that they now know who you are, where you live, or anything uniquely identifying about you. So basically, you are still anonymous.
I'm starting to tire of news headlines that claim the world is on fire when someone actually just does something slightly derivative from the norm and thinks they are brilliant. The noise from these non-events mask actual brilliant achievements and make it seem that everyone is doing banal work.
Re: (Score:3, Insightful)
The researchers are making a stronger claim. They are stating that based on actual public ratings (available from IMDB) they can generate actual private ratings published by Netflix under the guise of anonymity. As the paper notes, someone competing for the Netflix prize could use this data to improve the accuracy of their prediction algorithm. However, the point of this paper is to reveal t
Re: (Score:2)
In addition, I'd point out that this can probably be generalized to (a) any anonymous data set that's combined with (b) some other non-anonymized data set that will map onto it. Here, we have (a) a sample of anonymized Netflix data and (b) a sample of non-anonymized IMDB ratings. So a lot of the reactions are either "Well, duh, if you post publicly to IMDB, you've posted publicly, so you're stupid!" or "it's only movies, who cares?"
I care. Not just in the hypothetical of "what i
Re: (Score:3, Informative)
For example, the authors found for 40% of individuals, accurate ratings on a scale of 1-5 for only *two* random movies,together with a knowledge to within 14 days of when they were seen, would be sufficient to identify an individual in the dataset. As they comment, that's the kind of information cooleagues give out every day a
Re: (Score:2)
All they researchers are saying is that they can deduce some of your preferences based on your other preferences. Of COURSE you can do that, that was the whole point of the contest Netflix put up.
What they are _not_ saying is that they now know who you are, where you live, or anything uniquely identifying about you. So basically, you are still anonymous.
Did you even read the summary?
They took anonymous ratings & discovered they can link some of them to IMDB usernames. We can argue over whether or not those IMDB usernames are "uniquely identifying" or "anonymous" but they definitely say something about who you are.
I'm sure a percentage of those IMDB usernames are easily linked to real people through a trivial google search. Does that break this alleged veil of anonymity? Datamining isn't that hard these days.
Re: (Score:2)
Linkage of that kind is only useful if the user-populations for IMDB commenters and Netflix commenters are the same (at least 50%) and that most people make the same comments and ratings on both systems in the same way _most_ of the time. Chances are that if the populations are _not_ the same and that the commenters don't mostly duplicate their ratings for every movie in each place. . . In that case, you then you probably get more fal
Re: (Score:2)
They said you couldn't identify a person's record in the dataset even if you know some (or all!) of their ratings.
We showed that that's not true. Even if there's a LOT of noise. That's all there is to it.
--Arvind Narayanan
What are you rating in IMDB vs Netflix (Score:5, Insightful)
As far as I know in IMDB you are rating the overall quality of the movie, not I agree with it OR I want to see more like this.
One example, Shindlers list, great movie, do NOT want to see it again. Same with Grave of the fireflies. Some movies just ain't for multiple viewings. They are my "favorite movies I never want to see again".
On the other hand I got movies I can watch any day of the week, but that I would NEVER rate as highly. Cannonbal run is one such movie. It watch it far too often, but I wouldn't call it a good movie. You can always fine me ready for a Jacky Chan movie or a spagethi western.
Is the netflix rating system a "I liked this movie and want to see more like it" system or a "This movie was brilliant and I would highly recommend it too everyone else" type of rating system?
Granted some people get it confused, probably the same people that use the slashdot moderation system to silence views they don't like, but that only makes basing conclusions on user ratings even more problematic.
I can rate a movie highly even if I do not agree with it, simply because it is good. And I can rate a movie I really like to watch as crap simply because I know I like watching crap.
I don't like the godfather movies, I can see they are high quality, I just don't like them. So my rating them would be fairly high as for quality, but low for 'I want to see more like this'.
I thought that the netflix system was "I want to see more like this" based. Surely nobody is so stupid as to think a quality rating and a "i like this" rating system are the same? Or am I completly in the wrong in seeing a difference between the two? Am I insane in thinking that you can see a movie as being a great artwork and still not liking it or viceversa?
Re: (Score:2)
I must be a pessimist, but I don't believe the average Joe would agree with that statement. I think most people would see the two statements as synonymous. That is, if they even think about the distinction. Mostly I think they'd just grab their "gut" feeling and go with it.
I suppose we could test the argument by comparing movies that are ranked high on quality with total movie rentals or some other
Re: (Score:3, Interesting)
Just out of curiosity, why don't you want to see those films again? both of them are really good films and although I would not see them every weekend (as for example Sin City), I enjoy watching them from time to time. The plot is interesting, the photography/drawing is nice and the screen writing is wel
Simple as you said, I do NOT enjoy watching them (Score:3, Interesting)
The comment "favotire movie I never want to see again" is one I got from a review of Grave of the Fireflies that I just happened to totally agree with. Don't read the reviews, just watch it yourselve and if you are not into Anime just set that aside for the duration of the movie, then ask yourselve again, if you can understand that comment.
It is powerfull movie, like Shindlers List, but not a happy tale. I am not talking a tear jerker movie here, I am talking a "we will all burn in hell for this" movie. Te
Re: (Score:3, Insightful)
No. You give people way too much credit if you think their ratings on public sites are that nuanced or objective. I think most people just rate things on how well they like it themselves. A significant portion seem to even just give 10s to anything they like, too.
I also find it amusing how the votes tend to congregate somewhere in the 3rd quartile a bit above average(e.g. 7 on
Re: (Score:2, Insightful)
I'm not sure about that. People will tend to watch films they think/hope they will like. So, the ones where they think 'that'll be absolute poop' they won't bother watching, so, hopefully, won't bother rating.
So, people should rate fewer films as 'poop' than as 'great', because they select only the 'hopeful
Re: (Score:2)
That means the above-average movies and the total flops get rated, but not the below-average movies.
Re: (Score:2)
However, when you're talking about dozens of movies, all you need is a correlation. Our algorithm is powerful enough to tolerate a large amount of noise. If you read the paper, we were able to match up users between imdb and netflix with a very high level of confidence, in the sense that the best match was 15-30 standard deviations away from the second best match. In statistics terms, that's a insanely close match.
--Arvind Narayanan
Re: (Score:2)
Is the netflix rating system a "I liked this movie and want to see more like it" system or a "This movie was brilliant and I would highly recommend it too everyone else" type of rating system?
It's both. The system allows users to say how much they preferred a movie. This can then be used to predict what movies a user will prefer in the future. If an unseen movie is preferred by users that have expressed preferences similar to yours, then it will be recommended to you.
But your question about what the semantics of a rating are is good one. The answer is, we don't really know, and it doesn't really matter from a practical standpoint. People like things of "high quality" whatever that means.
I think a lot of you naysayers... (Score:2)
Brokeback's decline (Score:2)
Anyone think the IMDB rating of Brokeback Mountain is going to plummet dramatically. (It is 7.8 today)
And of course, if it does, we will be able to correlate the timing of the sudden drop with the publishing of this slashdot article, allowing us to link the slashdot readership with imdb users. Now we have your Netflix ratings, IMDB ratings, AND slashdot postings all correlated...
Wrong headline (Score:2)
Because really, thats all the mat their "research" has.
And in fact, if you rate movies in IMDB, and your handle can be tracked, who needs the netflix data?
This whole thing is a non-issue, and the paper is so content-slim i doubt it will be accepted anywhere (well, maybe "new scientist" will print it...)
Endless media hyperbole (Score:2)
To the issue of your anonymity being shattered, puh-lease. If you post information in a public forum such as IMDB and it can be correlated to information from MySpace, it wasn't a giant leap into your privacy. It was just gathering already public information. What's the big deal?
You choose to post that stuff where it could be publicly viewed. The fact that it lines up with data from Netflix only proves that NF did in fact provide a quality dataset. Big deal.
More woe for HMRC then (Score:3, Interesting)
Re:This is a 'research' paper? (Score:4, Insightful)
TWW
Re: (Score:3, Insightful)
Re: (Score:2)
Finding a paragraph like this in a research paper makes me call into question the motives and intentions of the 'researchers.' They seems sort of like the Jerry Springer of research (since he's just trying to help the families he has on his show...).
It's clear you didn't read the paper. To be sure, the quoted paragraph did appear in the paper, which of course was selected for the summary because it was the most interesting. The full paper is 24 pages of substantially heavier research and analysis. The
Re: (Score:2)
Actually, I did read the entire paper - not in depth, but enough to get the basic picture of how they went about the task.
They presented what appears to be sound research before they d
Re: (Score:2)
Gee, you mean like a tabloid might make if such details were "accidently" leaked to them in, say, the run-up to an election? You still don't think they were making a valid point? I would ask if you needed a map drawn, but you already had and appar