Carnegie Mellon CAPTCHA Digitization Project Now Underway 119
tomandlu writes "The BBC is reporting that Carnegie Mellon University has found a novel use for CAPTCHAs — deciphering old texts. We've discussed this project before, but it was prior to it getting off the ground. Users Entering text acts as a sort of distributed computing project. Basically, the CAPTCHA is made up of two words — one of which is known to Carnegie, and one of which isn't. If the user correctly deciphers the known word, then the unknown word is assumed to be correct. Well, almost. Two different users must give the same answer to the same unknown CAPTCHA before it is taken off the list. 'Using the reCAPTCHA system von Ahn's team is digitizing documents and manuscripts as fast as the Internet Archive can supply them, and the good news for book lovers (and bad news for spammers) is that the supply of reCAPTCHAs is not likely to dry up any time soon.'"
Fiery church? (Score:3, Funny)
Re: (Score:2)
Hear that? It's the sound of X number of spammers crying out in agony/frustration/pain/rage.
Rock on (Score:1)
Re: (Score:2, Insightful)
It gives you two words to enter in but you only have to get the right one correct in order to get through.
Spammers could fill the left word with nonsense and OCR the right one and the system would crumble.
Who cares if the OCR isnt 100% accurate. It'll be good enough to get a lot of spam through.
Re: (Score:1)
Re: (Score:1)
Re: (Score:2, Insightful)
Re: (Score:2)
I did it about 10 times putting garbage in the left. Every time I got it correct.
Re: (Score:3, Insightful)
Also, if the first two people to decypher the unknown word don't agree, then the word is recycled back into the system until "a lot more people" submit the same answer. This greatly reduces the threat of a "garbage attack" because any random input is unlikely to be repeated by the second person to get that word, or anyone
Re: (Score:1)
All that is necessary is that a hash of the image is stored and the same garbage is sent both times the image appears.
Once more the more images are attacked in this manner the faster the attack would progress as more of the known images would be absolutely known to the attacker as well.
Re: (Score:3, Interesting)
Also, storing the hashes for successfully identified images is also useless... once a word is identified by at least two parties, it is removed from circulation. That means if the attacker IDs a word correctly, chances are it won't stay in the system much longer. Even if the a
Give it a go! (Score:2, Informative)
JS is almost unavoidable for logins now. (Score:3, Informative)
I want to participate... (Score:4, Interesting)
Re: (Score:2, Funny)
Re:I want to participate... (Score:5, Informative)
Re: (Score:1)
Also, they should really give the whole sentences this provides context and would yield to higher results. Otherwise many short words would be misinterpreted.
Re: (Score:2)
Re: (Score:2)
Re: (Score:3, Insightful)
OLD NEWS... and a dupe (Score:2)
other than that, it is really nice
Re: (Score:2)
Re: (Score:2)
does that mean it's ok to spam now? (Score:1)
Time to get linking...
I'm not so sure this is a good idea. (Score:2)
And that's not even
Re:I'm not so sure this is a good idea. (Score:4, Insightful)
Most people, when presented with a CAPTCHA, make an honest effort to try and get it right - otherwise they can't get their precious Facebook account. The number of people who understand what's going on with this reCAPTCHA thing is probably pretty small. Finally, those who know what it is about are probably inclined to not be jackasses and purposefully screw it up. I'd say that honest errors and malicious errors are an overwhelmingly small portion of reCAPTCHA responses. While flawed, this system might still be, say, 95% correct. So, for accepting a certain amount of error, you are able to get as much character recognition done as you are able to supply. As the article says: 3000 man-hours a day at 95% accuracy versus, maybe, a few dozen man-hours a day at slightly higher accuracy. You tell me which is better.
Re: (Score:2)
I plan to use at least the mailhide recaptcha on my site. I d
Re:I'm not so sure this is a good idea. (Score:5, Insightful)
For your other point, there should be a "not a word" button to hit in that case to flag up that the original OCR has screwed up the word boundary.
I thought it was a really novel project, reminds me of the image tagging "games" that people came up with last year, but in a new problem domain.
Re: (Score:2)
For your other point, there should be a "not a word" button to hit in that case to flag up that the original OCR has screwed up the word boundary.
That would defeat the point of the project. Words scanned from real books contain all manner of 'not a word' combinations of letters and numbers, the principle is the same. I came across several portions of words that had been hyphenated at the margin of a page. Many Capatcha type systems use random strings of characters. Any non-english words that show up should be treated as a sting of characters.
Re: (Score:2)
Re: (Score:2)
Re:I'm not so sure this is a good idea. (Score:5, Funny)
Congratulations,
you managed to fail the Turing test.
Re: (Score:2)
Re:I'm not so sure this is a good idea. (Score:5, Funny)
Re: (Score:1)
Re: (Score:2)
Re:I'm not so sure this is a good idea. (Score:5, Informative)
We're already getting several million legitimate solutions a day. The chance that a few malicious people would happen to get the same CAPTCHA is relatively small. Also, for many of our words, the OCR's answer happens to be correct -- it just doesn't have high confidence in the word. If a single person agrees with the OCR in this case, we can mark the word as "read" with no further human confirmation. For this reason, many of the words will only ever be shown to a single human.
Re: (Score:2)
I'm sure you've got most bases covered, but intentional malice goes way beyond 'a few malicious people'. In this case, it involves at least 1 malicious person, a captcha breaker, a few thousand anonymous free proxies, and a lot of malice. I'll admit that I find this idea trivial because I'm a programmer, but I think most (non-script-kiddie) hackers will find it trivial as well.
I sincerely hope nobody tries to sabotage your project, but I'd f
Re:I'm not so sure this is a good idea. (Score:4, Informative)
In terms of the digital output, we spot-check some of the transcribed pages every day. These spot-checks will also turn up any anomalous solutions, with high probability.
Re: (Score:2)
Re: (Score:2)
Wow, that seems like a major mistake if you're actually doing that. It's quite possible for a human to make a mistake on a word, for exactly the same reason the OCR makes a mistake. In fact, the most likely error for a human to make is the same one the OCR made. Which means you will be accepting as 'read' many errors simply because the human agreed.
Re: (Score:3)
I wonder, afte this is running for a while, most of the unknown words will be nonsense (jabberwocky, snickersnee) Proper or made up names (Elric of Melnibone? I saw Benoit in the third captcha I solved, I now got one that looks like Visscher), numbers and other things people wouldn't work through.
The other problem is with common words that OCR gets wrong. I've/me are c
Re: (Score:2)
"I wonder, afte this is running for a while, most of the unknown words will be nonsense"
It's already been running for a few months, and we're getting millions of solutions a da
Re: (Score:2)
I have to agree with this point. I tried about 20 of them and there were at least 4 that were impossible to be sure of because of the wavy line running through the critical part of a character - this was particularly an issue on numbers, where there is no possible context to give you the correct answer. I guess it all comes out in the wash because you just re-present the images until a consensus develops.
Re: (Score:2)
Re: (Score:2)
Still, as someone else noted, there -should- be a way to note that one of the words appears to be nonsense and that you'd like to flag it for a human to interpret instead
Problems (Score:3, Interesting)
Captchas are now twice as annoying for the user, since you have to type two words (but maybe the fact that there is some value in it will appease the user).
Some algorithms these days are quite literally better than humans at detecting the hidden text in captchas. Pictures, not text, are better for this purpose.
Testing the answer against another users answer is a good idea in principle (its how they make sure no one is cheating in distributed computing projects) but giving the same answer as another user is not difficult when they are using the same algorithm. We can assume that any algorithm being applied against this captcha is trying to do loads of work (that is, after all, why you write such a program) and so it will be answering the same question multiple times.
Am I right on these points? (I just woke up).
Re:Problems (Score:4, Insightful)
Re:Problems (Score:5, Insightful)
1. Its not twice as annoying. Compared to how faded and scrambled many "one-word" captchas are, this is significantly less annoying.
2. People seem to be acting like someone will fill out one word correctly and then intentionally scramble the other to screw up the project. Not many people are crazy enough to even want to do that. But even if they were, how do they know which word is the known, and which is the unknown?
3. Endless Supply - Each word that is correctly translated is another word that is "known" and therefore can be safely used as a known in a new captcha.
4. Verification - Thanks to #3, they could also potentially maintain the verification % rate for various words to later determine the accuracy or inaccuracy of past translations (assuming that they ever find that to be a problem).
Yeah, we all know that captchas are not perfect, but this project is a better idea than most. And because it is centralized, they can update the image generation scheme centrally if it is broken.
In practice, these seem to get broken less often than people think.
Re: (Score:2)
I think the GP's worry is that the spammers use OCR and there are a lot of them, so the two challenges you are relying on for checking both get answered by the same OCR spambot code - so they could match even though they're wrong.
Re: (Score:1, Funny)
Re: (Score:2)
Please RTFA. How do you propose that the same bot gets the same word twice in one sitting, let alone with the same warping and strikethrough so as to guarantee the same word is typed both times?
Check out recaptcha.net [recaptcha.net] to test it out.
Re: (Score:2)
Re: (Score:2)
1) We've done some studies at CMU that shows that recognizing and typing 2 real English words is much easier and faster than typing 6 or 7 random letters and numbers. Would you rather type "private much" (which is what just showed up for reCAPTCHA) or "KXd2cM" (which is what showed up for Yahoo's CAPTCHA)?
2) Any given CAPTCHA is only shown to a couple of users. We're getting millions of legitimate solutions a day, so even a relatively sophisticated bot would have little chance of seeing th
Privacy (Score:2)
Presentation about human computation (Score:1)
CATTTTCHA? (Score:2, Interesting)
> , was originally designed at Carnegie Mellon to help to keep out automated programs known as "bots."
Where did they get the "P" from?
Re: (Score:1)
Re: (Score:1)
Possible problem (Score:1)
Don't worry (Score:2)
`CowboyNeal' answer to all CAPTCHAs (Score:2, Interesting)
Re: (Score:1)
Re:`CowboyNeal' answer to all CAPTCHAs (Score:5, Informative)
We can compute the daily frequency of each human-provided solution and automatically flag anything that suddenly jumps in popularity. It's especially suspicious if these answers always disagree with the OCR's guess (often the OCR happens to be right, but just doesn't have high confidence).
Re: (Score:2)
Is there any word on how CAPTCHA decoders, like PWNtcha, perform against the current reCAPTCHA?
In case reCAPTCHA can be automatically deciphered efficiently, a slightly altered malevolent attack might still be feasible. Let D be a roughly complete list of English words (a dictionary), together with the relative frequencies of the words occurring in standard English texts. Generate a fixed mapping f from D to D such that words are going to be assigned to each other only in case their occurrence frequencies
Re: (Score:3)
"Turing" test (Score:3, Informative)
I believe CAPTCHAs are the wrong solution to the wrong problem. It's a bit exaggerated to call them a "Turing test", because I'm quite sure that OCR systems will be made in the near future that are better than humans in reading CAPTCHAs. A simple text-based question that requires actual intelligence is a much better Turing test, and also a much smaller nuisance for people with impaired vision. Of course, writing a foolproof system that can produce a nearly infinite amount of such questions is a challenging problem by itself.
Re: (Score:2)
I think it is more than a challenge. I have introduced a system like this on a public forum that I administer. It's a phpBB mod that asks a question during the registration phase to which the registrant is required to give a correct answer.
The problem is that I have found it very hard to come
Re: (Score:2)
What language is this in?
What are the first five letters of the alphabet?
What are the five vowels?
Other stuff:
Are you a human or a computer program?
What is the name of this site? (see title bar)
Pick a number, any number. (Any number is taken as correct)
Leave the following space blank.
Of course, the biggest problem with a limited dictionary of questions like this is that a spammer can sit through them, answer them all, or at least a portion, and then put a script to replay the
Re: (Score:2)
Or if you mean that they would be too easy for a robot to answer
Re: (Score:1)
My point about trivia questions is that they are often very culturally-dependent. What is obvious and very easy for an average American (or English person) may not be at all obvious to someone from Burkina Faso (for example).
Re: (Score:2)
Re: (Score:2)
What colour is a ripe tomato?
Can be yellow, brown, purple or even green!
Peekaboom (Score:3)
Here's an nice video [google.com] on the subject.
MOD PARENT UP (Score:2)
MOD grandparent PARENT UP (Score:1)
Re: (Score:2)
Practical Use (Score:2)
A couple of months ago I switch to recaptcha.net's plugin for phpBB and it stemmed the tide. The number of spam bots getting thru decreased greatly. Those that did, I felt slightly better when I deleted their registration requests unfulfilled. Their Evil cpu cy
Drupal Module makes it simple (Score:4, Interesting)
I'm not affiliated with the project, other than as a happy, comment-spam-free user of it.
Re: (Score:1)
http://www.testdesigner.com/about/contact/ [testdesigner.com]
Re: (Score:2)
Does it stop spam? (Score:2)
Re: (Score:2)
Re: (Score:2)
Say Foo! (Score:1, Funny)
Sadly (Score:2)
They can set up fake porn sites with registrations (collecting more email addresses to spam in the process), and when someone wants to 'register' for the free porn, the spammers site scrapes a captcha from the site they want to get into with a bot, and show it to their user trying to sign up for porn. The eager pornhound dutifully types in the answer, which the spammer's scripts can then supply to the site the capthcha originally c
Re: (Score:2)
We have noticed one such "humans filling out CAPTCHAs for spammers" attack on reCAPTCHA, but in this case it was offshore workers being paid to solve CAPTCHAs. We shut them out of the
Re: (Score:2)
Not case sensitive? Ut oh (Score:2, Interesting)
It's not like they have NO idea what the word is.. (Score:1)
I could see it being a problem with 'Z' and 'z', or something like that. I'm sure they can parse the language, though, and intelligently decide if it is l
How if..... (Score:1)
Next Step: (Score:2)
Champollion is rolling in his grave in frustration because he didn't think of this...
Caps? (Score:1)
Caps aren't relevant (Score:1)
Minor problems but good overall (Score:3, Interesting)
1) Hyphenated word fragments broken over lines. ie "vances" where you can't see the "ad-" from the previous line.
2) Dialectic spellings of English words, ie British spelling where "s" replaces "z" in verb forms such as "categorise"
3) Numbers with commas/decimals. Is that thirteen-thousand "13,000" or a precise thirteen "13.000" to three places?
4) Archaic spellings and outdated words. Because these are old books being digitized (only books before 1923 are out of copyright) this is quite common.
But it's a brilliant idea and for the majority of the text samples there was no ambiguity.
Re: (Score:1)
Re: (Score:2)
Context. If the text is difficult to read so that one or more letters are ambiguous, if you know that the word is a modern American English word then you can fill in the blank(s). I failed to mention proper nouns (ie names) and that is more common because there are no standardized spellings of them. They are turning up quite often in the text.
Also some of the scanned text was a number with a fraction, and some had accent marks and the
Why? (Score:2)
Is there really a shortage of willing volunteer transcribers? I seem to remember Project Gutenberg getting far more volunteers than they could use, without even asking...
And speaking for myself, I'm sure I could transcribe a couple full sentences more quickly than I could two arbitrary words, so I'd call this a terrible use of the available volunteer resources as well.
Isn't this self-defeating? (Score:1)
There's a Wordpress plug-in... (Score:1)