Slashdot Log In
Carnegie Mellon CAPTCHA Digitization Project Now Underway
Posted by
Zonk
on Tue Oct 02, 2007 08:44 AM
from the way-more-fun-than-the-usual-kind dept.
from the way-more-fun-than-the-usual-kind dept.
tomandlu writes "The BBC is reporting that Carnegie Mellon University has found a novel use for CAPTCHAs — deciphering old texts. We've discussed this project before, but it was prior to it getting off the ground. Users Entering text acts as a sort of distributed computing project. Basically, the CAPTCHA is made up of two words — one of which is known to Carnegie, and one of which isn't. If the user correctly deciphers the known word, then the unknown word is assumed to be correct. Well, almost. Two different users must give the same answer to the same unknown CAPTCHA before it is taken off the list. 'Using the reCAPTCHA system von Ahn's team is digitizing documents and manuscripts as fast as the Internet Archive can supply them, and the good news for book lovers (and bad news for spammers) is that the supply of reCAPTCHAs is not likely to dry up any time soon.'"
Related Stories
[+]
Making CAPTCHAs Even Harder With 3-D Models 326 comments
Michael G. Kaplan writes "CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Humans Apart) are commonly used to prevent computers from filling out web forms. Computer vision experts have been able to design programs to foil CAPTCHA with a high degree of success. I have designed a CAPTCHA that is based on the identification of attributes contained in an image generated by the grouping of easily recognized 3-D objects. I call this the Virtual Photographic CAPTCHA and it is likely to remain invulnerable to automated attack for many years to come. A novel anti-spam system necessitated its development."
[+]
How to Prevent Form Spam Without Captchas 272 comments
UnderAttack writes "Spam submitted to web contact forms and forums continues to be a huge problem. The standard way out is the use of captchas. However, captchas can be hard to read even for humans. And if implemented wrong, they will be read by the bots. The SANS Internet Storm Center covers a nice set of alternatives to captchas. For example, the use of style sheets to hide certain form fields from humans, but make them 'attractive' to bots. The idea of these methods is to increase the work a spammer has to do to spam the form without inconveniencing regular users."
[+]
Fill Out CAPTCHAs, Digitize Books At The Same Time 121 comments
alphadogg wrote with a link to a Networld article about a noble endeavor: putting CAPTCHAs to work for the good of humanity. A scientist at Carnegie Mellon is looking to create a new type of security check that will assist in a project meant to digitize and make searchable text from books and printed materials. Above and beyond that, the offering would probably be more secure than most current systems. "Instead of requiring visitors to retype random numbers and letters, they would retype text that otherwise is difficult for the optical character recognition systems to decipher when being used to digitize books and other printed materials. The translated text would then go toward the digitization of the printed material on behalf of the Internet Archive project."
[+]
Have Spammers Overcome the CAPTCHA? 330 comments
thefickler writes "It appears that spammers have found a way to automatically create Hotmail and Yahoo email accounts. They have already generated more than 15,000 bogus Hotmail accounts, according to security company BitDefender. The company says that a new threat, dubbed Trojan.Spammer.HotLan.A, is using automatically generated Yahoo and Hotmail accounts to send out spam email, which suggests that spammers have found a way to overcome Microsoft's and Yahoo's CAPTCHA systems."
[+]
News: Games With A Purpose Help With Tasks That Tax Computers 61 comments
Falkkin writes "Luis von Ahn and his team at Carnegie Mellon University have launched GWAP, a new web site for 'Games With A Purpose.' By playing these online games, humans help provide data for problems that are hard for computers to solve, such as computer vision and sound classification. Slashdot has previously covered other human computation projects by Dr. von Ahn, including the ESP Game and reCAPTCHA. The new web site contains a re-vamping of the ESP Game as well as four completely new games." (Falkkin also points to an AP story on GWAP and to coverage at the BBC.)
[+]
Technology: reCAPTCHA Hard At Work, Rescuing Fading Texts 31 comments
sciencehabit writes "Computer scientists have developed a program, called reCAPTCHA, which is being used in lieu of CAPTCHA by several sites, to help digitize old books and newspapers. The reCAPTCHA takes entries from old and faded texts that optical scanners and digital-text readers have trouble with. So every time you solve that string of crooked letters, you may actually be helping historians digitally reconstruct a page from the 1908 New York Times." The Science Now story links to the longer and more informative article at Ars Technica. (We last mentioned this program last year — and now it's good to get some sense of how well it's working.)
This discussion has been archived.
No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Full
Abbreviated
Hidden
Loading... please wait.

Fiery church? (Score:3, Funny)
I want to participate... (Score:4, Interesting)
Re: (Score:2, Funny)
Re:I want to participate... (Score:5, Informative)
Parent
Re: (Score:3, Insightful)
Problems (Score:3, Interesting)
Captchas are now twice as annoying for the user, since you have to type two words (but maybe the fact that there is some value in it will appease the user).
Some algorithms these days are quite literally better than humans at detecting the hidden text in captchas. Pictures, not text, are better for this purpose.
Testing the answer against another users answer is a good idea in principle (its how they make sure no one is cheating in distributed computing projects) but giving the same answer as another user is not difficult when they are using the same algorithm. We can assume that any algorithm being applied against this captcha is trying to do loads of work (that is, after all, why you write such a program) and so it will be answering the same question multiple times.
Am I right on these points? (I just woke up).
Re:Problems (Score:4, Insightful)
Parent
Re:Problems (Score:5, Insightful)
1. Its not twice as annoying. Compared to how faded and scrambled many "one-word" captchas are, this is significantly less annoying.
2. People seem to be acting like someone will fill out one word correctly and then intentionally scramble the other to screw up the project. Not many people are crazy enough to even want to do that. But even if they were, how do they know which word is the known, and which is the unknown?
3. Endless Supply - Each word that is correctly translated is another word that is "known" and therefore can be safely used as a known in a new captcha.
4. Verification - Thanks to #3, they could also potentially maintain the verification % rate for various words to later determine the accuracy or inaccuracy of past translations (assuming that they ever find that to be a problem).
Yeah, we all know that captchas are not perfect, but this project is a better idea than most. And because it is centralized, they can update the image generation scheme centrally if it is broken.
In practice, these seem to get broken less often than people think.
Parent
"Turing" test (Score:3, Informative)
I believe CAPTCHAs are the wrong solution to the wrong problem. It's a bit exaggerated to call them a "Turing test", because I'm quite sure that OCR systems will be made in the near future that are better than humans in reading CAPTCHAs. A simple text-based question that requires actual intelligence is a much better Turing test, and also a much smaller nuisance for people with impaired vision. Of course, writing a foolproof system that can produce a nearly infinite amount of such questions is a challenging problem by itself.
Peekaboom (Score:3)
Here's an nice video [google.com] on the subject.
Drupal Module makes it simple (Score:4, Interesting)
I'm not affiliated with the project, other than as a happy, comment-spam-free user of it.
Minor problems but good overall (Score:3, Interesting)
1) Hyphenated word fragments broken over lines. ie "vances" where you can't see the "ad-" from the previous line.
2) Dialectic spellings of English words, ie British spelling where "s" replaces "z" in verb forms such as "categorise"
3) Numbers with commas/decimals. Is that thirteen-thousand "13,000" or a precise thirteen "13.000" to three places?
4) Archaic spellings and outdated words. Because these are old books being digitized (only books before 1923 are out of copyright) this is quite common.
But it's a brilliant idea and for the majority of the text samples there was no ambiguity.
Re: (Score:2, Insightful)
It gives you two words to enter in but you only have to get the right one correct in order to get through.
Spammers could fill the left word with nonsense and OCR the right one and the system would crumble.
Who cares if the OCR isnt 100% accurate. It'll be good enough to get a lot of spam through.
Re: (Score:3, Insightful)
Also, if the first two people to decypher the unknown word don't agree, then the word is recycled back into the system until "a lot more people" submit the same answer. This greatly reduces the threat of a "garbage attack" because any random input is unlikely to be repeated by the second person to get that word, or anyone
Re: (Score:3, Interesting)
Also, storing the hashes for successfully identified images is also useless... once a word is identified by at least two parties, it is removed from circulation. That means if the attacker IDs a word correctly, chances are it won't stay in the system much longer. Even if the a
JS is almost unavoidable for logins now. (Score:3, Informative)
Re:I'm not so sure this is a good idea. (Score:4, Insightful)
Most people, when presented with a CAPTCHA, make an honest effort to try and get it right - otherwise they can't get their precious Facebook account. The number of people who understand what's going on with this reCAPTCHA thing is probably pretty small. Finally, those who know what it is about are probably inclined to not be jackasses and purposefully screw it up. I'd say that honest errors and malicious errors are an overwhelmingly small portion of reCAPTCHA responses. While flawed, this system might still be, say, 95% correct. So, for accepting a certain amount of error, you are able to get as much character recognition done as you are able to supply. As the article says: 3000 man-hours a day at 95% accuracy versus, maybe, a few dozen man-hours a day at slightly higher accuracy. You tell me which is better.
Parent
Re:I'm not so sure this is a good idea. (Score:5, Insightful)
For your other point, there should be a "not a word" button to hit in that case to flag up that the original OCR has screwed up the word boundary.
I thought it was a really novel project, reminds me of the image tagging "games" that people came up with last year, but in a new problem domain.
Parent
Re:I'm not so sure this is a good idea. (Score:5, Funny)
Congratulations,
you managed to fail the Turing test.
Parent
Re:I'm not so sure this is a good idea. (Score:5, Funny)
Parent
Re:I'm not so sure this is a good idea. (Score:5, Informative)
We're already getting several million legitimate solutions a day. The chance that a few malicious people would happen to get the same CAPTCHA is relatively small. Also, for many of our words, the OCR's answer happens to be correct -- it just doesn't have high confidence in the word. If a single person agrees with the OCR in this case, we can mark the word as "read" with no further human confirmation. For this reason, many of the words will only ever be shown to a single human.
Parent
Re:I'm not so sure this is a good idea. (Score:4, Informative)
In terms of the digital output, we spot-check some of the transcribed pages every day. These spot-checks will also turn up any anomalous solutions, with high probability.
Parent
Re: (Score:3)
I wonder, afte this is running for a while, most of the unknown words will be nonsense (jabberwocky, snickersnee) Proper or made up names (Elric of Melnibone? I saw Benoit in the third captcha I solved, I now got one that looks like Visscher), numbers and other things people wouldn't work through.
The other problem is with common words that OCR gets wrong. I've/me are c
Re:`CowboyNeal' answer to all CAPTCHAs (Score:5, Informative)
We can compute the daily frequency of each human-provided solution and automatically flag anything that suddenly jumps in popularity. It's especially suspicious if these answers always disagree with the OCR's guess (often the OCR happens to be right, but just doesn't have high confidence).
Parent
Re: (Score:3)