Stories
Slash Boxes
Comments

News for nerds, stuff that matters

Slashdot Log In

Log In

Create Account  |  Retrieve Password

Carnegie Mellon CAPTCHA Digitization Project Now Underway

Posted by Zonk on Tue Oct 02, 2007 08:44 AM
from the way-more-fun-than-the-usual-kind dept.
tomandlu writes "The BBC is reporting that Carnegie Mellon University has found a novel use for CAPTCHAs — deciphering old texts. We've discussed this project before, but it was prior to it getting off the ground. Users Entering text acts as a sort of distributed computing project. Basically, the CAPTCHA is made up of two words — one of which is known to Carnegie, and one of which isn't. If the user correctly deciphers the known word, then the unknown word is assumed to be correct. Well, almost. Two different users must give the same answer to the same unknown CAPTCHA before it is taken off the list. 'Using the reCAPTCHA system von Ahn's team is digitizing documents and manuscripts as fast as the Internet Archive can supply them, and the good news for book lovers (and bad news for spammers) is that the supply of reCAPTCHAs is not likely to dry up any time soon.'"
+ -
story

Related Stories

[+] Making CAPTCHAs Even Harder With 3-D Models 326 comments
Michael G. Kaplan writes "CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Humans Apart) are commonly used to prevent computers from filling out web forms. Computer vision experts have been able to design programs to foil CAPTCHA with a high degree of success. I have designed a CAPTCHA that is based on the identification of attributes contained in an image generated by the grouping of easily recognized 3-D objects. I call this the Virtual Photographic CAPTCHA and it is likely to remain invulnerable to automated attack for many years to come. A novel anti-spam system necessitated its development."
[+] How to Prevent Form Spam Without Captchas 272 comments
UnderAttack writes "Spam submitted to web contact forms and forums continues to be a huge problem. The standard way out is the use of captchas. However, captchas can be hard to read even for humans. And if implemented wrong, they will be read by the bots. The SANS Internet Storm Center covers a nice set of alternatives to captchas. For example, the use of style sheets to hide certain form fields from humans, but make them 'attractive' to bots. The idea of these methods is to increase the work a spammer has to do to spam the form without inconveniencing regular users."
[+] News: Fill Out CAPTCHAs, Digitize Books At The Same Time 121 comments
alphadogg wrote with a link to a Networld article about a noble endeavor: putting CAPTCHAs to work for the good of humanity. A scientist at Carnegie Mellon is looking to create a new type of security check that will assist in a project meant to digitize and make searchable text from books and printed materials. Above and beyond that, the offering would probably be more secure than most current systems. "Instead of requiring visitors to retype random numbers and letters, they would retype text that otherwise is difficult for the optical character recognition systems to decipher when being used to digitize books and other printed materials. The translated text would then go toward the digitization of the printed material on behalf of the Internet Archive project."
[+] Have Spammers Overcome the CAPTCHA? 330 comments
thefickler writes "It appears that spammers have found a way to automatically create Hotmail and Yahoo email accounts. They have already generated more than 15,000 bogus Hotmail accounts, according to security company BitDefender. The company says that a new threat, dubbed Trojan.Spammer.HotLan.A, is using automatically generated Yahoo and Hotmail accounts to send out spam email, which suggests that spammers have found a way to overcome Microsoft's and Yahoo's CAPTCHA systems."
[+] News: Games With A Purpose Help With Tasks That Tax Computers 61 comments
Falkkin writes "Luis von Ahn and his team at Carnegie Mellon University have launched GWAP, a new web site for 'Games With A Purpose.' By playing these online games, humans help provide data for problems that are hard for computers to solve, such as computer vision and sound classification. Slashdot has previously covered other human computation projects by Dr. von Ahn, including the ESP Game and reCAPTCHA. The new web site contains a re-vamping of the ESP Game as well as four completely new games." (Falkkin also points to an AP story on GWAP and to coverage at the BBC.)
[+] News: reCAPTCHA Hard At Work, Rescuing Fading Texts 112 comments
sciencehabit writes "Computer scientists have developed a program, called reCAPTCHA, which is being used in lieu of CAPTCHA by several sites, to help digitize old books and newspapers. The reCAPTCHA takes entries from old and faded texts that optical scanners and digital-text readers have trouble with. So every time you solve that string of crooked letters, you may actually be helping historians digitally reconstruct a page from the 1908 New York Times." The Science Now story links to the longer and more informative article at Ars Technica. (We last mentioned this program last year — and now it's good to get some sense of how well it's working.)
This discussion has been archived. No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More
Loading... please wait.
  • by gEvil (beta) (945888) on Tuesday October 02 2007, @08:48AM (#20821635)
    Is this proof that Carnegie Mellon (and the BBC) support religious terrorism?
  • by DrWho520 (655973) on Tuesday October 02 2007, @08:50AM (#20821651) Journal
    Where can I sign up? Sounds like a great way to burn a few hours on a rainy, Saturday afternoon!
  • Problems (Score:3, Interesting)

    by David_Shultz (750615) on Tuesday October 02 2007, @09:07AM (#20821777)
    Interesting idea, but here are the immediate problems as I see them...

    Captchas are now twice as annoying for the user, since you have to type two words (but maybe the fact that there is some value in it will appease the user).

    Some algorithms these days are quite literally better than humans at detecting the hidden text in captchas. Pictures, not text, are better for this purpose.

    Testing the answer against another users answer is a good idea in principle (its how they make sure no one is cheating in distributed computing projects) but giving the same answer as another user is not difficult when they are using the same algorithm. We can assume that any algorithm being applied against this captcha is trying to do loads of work (that is, after all, why you write such a program) and so it will be answering the same question multiple times.

    Am I right on these points? (I just woke up).
    • Re:Problems (Score:4, Insightful)

      by AltGrendel (175092) <[ag-slashdot] [at] [exit0.us]> on Tuesday October 02 2007, @09:13AM (#20821833) Homepage
      I agree, but if you think about it, it's really a win-win for Carnegie Mellon. Either way, they get the text translated.
      • Re:Problems (Score:5, Insightful)

        by jsight (8987) on Tuesday October 02 2007, @09:22AM (#20821913) Homepage
        I agree... I don't understand why people find so many silly faults with this.

        1. Its not twice as annoying. Compared to how faded and scrambled many "one-word" captchas are, this is significantly less annoying.
        2. People seem to be acting like someone will fill out one word correctly and then intentionally scramble the other to screw up the project. Not many people are crazy enough to even want to do that. But even if they were, how do they know which word is the known, and which is the unknown?
        3. Endless Supply - Each word that is correctly translated is another word that is "known" and therefore can be safely used as a known in a new captcha.
        4. Verification - Thanks to #3, they could also potentially maintain the verification % rate for various words to later determine the accuracy or inaccuracy of past translations (assuming that they ever find that to be a problem).

        Yeah, we all know that captchas are not perfect, but this project is a better idea than most. And because it is centralized, they can update the image generation scheme centrally if it is broken.

        In practice, these seem to get broken less often than people think.
  • "Turing" test (Score:3, Informative)

    by DrLex (811382) on Tuesday October 02 2007, @09:49AM (#20822271) Homepage
    Well, this finally makes CAPTCHAs somewhat useful. I won't try to formulate it in some sugar-coated way: I personally hate CAPTCHAs. On some types (especially the ones from Digg), I fail about 50% of them, and that's getting quite annoying after a while. Especially when your code is rejected even if you believe there is no doubt about what you've read in the image.
    I believe CAPTCHAs are the wrong solution to the wrong problem. It's a bit exaggerated to call them a "Turing test", because I'm quite sure that OCR systems will be made in the near future that are better than humans in reading CAPTCHAs. A simple text-based question that requires actual intelligence is a much better Turing test, and also a much smaller nuisance for people with impaired vision. Of course, writing a foolproof system that can produce a nearly infinite amount of such questions is a challenging problem by itself.
  • by EnsilZah (575600) <EnsilZah@@@Gmail...com> on Tuesday October 02 2007, @10:08AM (#20822541) Homepage
    Sounds like what they're doing at Peekaboom [peekaboom.org] and The ESP Game [espgame.org], harnessing humans to solve problems that are difficult for computers.
    Here's an nice video [google.com] on the subject.
  • by Slashdot Parent (995749) on Tuesday October 02 2007, @10:19AM (#20822691)
    For all of you Drupal admins out there, I just wanted to let you know that there is a reCAPTCHA module [drupal.org] that makes using reCAPTCHA a snap.

    I'm not affiliated with the project, other than as a happy, comment-spam-free user of it.
  • by MrKevvy (85565) on Tuesday October 02 2007, @12:58PM (#20825175)
    After doing a hundred or so, several problems I can see with this that may cause problems with accuracy even if the text is human-readable:

    1) Hyphenated word fragments broken over lines. ie "vances" where you can't see the "ad-" from the previous line.
    2) Dialectic spellings of English words, ie British spelling where "s" replaces "z" in verb forms such as "categorise"
    3) Numbers with commas/decimals. Is that thirteen-thousand "13,000" or a precise thirteen "13.000" to three places?
    4) Archaic spellings and outdated words. Because these are old books being digitized (only books before 1923 are out of copyright) this is quite common.

    But it's a brilliant idea and for the majority of the text samples there was no ambiguity.
    • Re: (Score:2, Insightful)

      I've found a flaw.

      It gives you two words to enter in but you only have to get the right one correct in order to get through.

      Spammers could fill the left word with nonsense and OCR the right one and the system would crumble.
      Who cares if the OCR isnt 100% accurate. It'll be good enough to get a lot of spam through.
      • Re: (Score:3, Insightful)

        You don't know which word is known (and checked against) and which is unknown. This makes your ORC attack less effective because you must get BOTH words right in order to guarantee success.

        Also, if the first two people to decypher the unknown word don't agree, then the word is recycled back into the system until "a lot more people" submit the same answer. This greatly reduces the threat of a "garbage attack" because any random input is unlikely to be repeated by the second person to get that word, or anyone
          • Re: (Score:3, Interesting)

            Still won't work. It's safe to assume the distortion/noise added to the text to prevent simple OCR would be different for each instance of the image; that's the whole point, after all. Hashes of the image data are useless in that case.

            Also, storing the hashes for successfully identified images is also useless... once a word is identified by at least two parties, it is removed from circulation. That means if the attacker IDs a word correctly, chances are it won't stay in the system much longer. Even if the a
      • Unfortunately I think most CAPTCHAs use JS; it's been a while since I've been to a site that didn't make me turn it on to get through login/registration. I have no idea why this is, since people have been doing login pages since before JS was around or popular, but now it seems like the way every idiot is doing it.
    • by necro81 (917438) on Tuesday October 02 2007, @09:26AM (#20821955) Journal

      There still needs to be a human reviewing the work before it's truly accepted, and that human might as well be doing it in the first place, with the context still there to help them.
      That is, for all intents and purposes, impractical, which was the entire point. The backlog of work was never going to get done in a reasonable timescale with dedicated humans correcting all the errors. A dedicated human, even with the context, will still make mistakes or get stumped.

      Most people, when presented with a CAPTCHA, make an honest effort to try and get it right - otherwise they can't get their precious Facebook account. The number of people who understand what's going on with this reCAPTCHA thing is probably pretty small. Finally, those who know what it is about are probably inclined to not be jackasses and purposefully screw it up. I'd say that honest errors and malicious errors are an overwhelmingly small portion of reCAPTCHA responses. While flawed, this system might still be, say, 95% correct. So, for accepting a certain amount of error, you are able to get as much character recognition done as you are able to supply. As the article says:

      Given that it takes about 10 seconds to decipher a reCAPTCHA and type in the answer, this represents the equivalent of almost three thousand man hours a day spent deciphering words that CMU's computers find illegible.
      3000 man-hours a day at 95% accuracy versus, maybe, a few dozen man-hours a day at slightly higher accuracy. You tell me which is better.
    • by smallfries (601545) on Tuesday October 02 2007, @09:27AM (#20821967) Homepage
      Wouldn't the easy solution be to present the context as part of the reCapatcha? Rather than two single words from isolated contexts, present two "lines" with a word or two either side, and a slight colour change on the target words to indicate which ones the system is after. This would make your validation easier but wouldn't aid OCR in any way.

      For your other point, there should be a "not a word" button to hit in that case to flag up that the original OCR has screwed up the word boundary.

      I thought it was a really novel project, reminds me of the image tagging "games" that people came up with last year, but in a new problem domain.
    • by MrMr (219533) on Tuesday October 02 2007, @09:40AM (#20822143)
      've already had words like 'Alau' and '45-618' in the few I've done, and since there's an ugly line through them, I can't be close to sure it's right... They make no sense, but they look like that.

      Congratulations,
      you managed to fail the Turing test.
    • by Falkkin (97268) on Tuesday October 02 2007, @10:07AM (#20822517) Homepage
      "And that's not even counting malice where people deliberately put wrong words in."

      We're already getting several million legitimate solutions a day. The chance that a few malicious people would happen to get the same CAPTCHA is relatively small. Also, for many of our words, the OCR's answer happens to be correct -- it just doesn't have high confidence in the word. If a single person agrees with the OCR in this case, we can mark the word as "read" with no further human confirmation. For this reason, many of the words will only ever be shown to a single human.
        • by Falkkin (97268) on Tuesday October 02 2007, @10:38AM (#20822965) Homepage
          You said "people" putting in wrong words (ala the suggestion someone said below about "everyone fill in CowboyNeal!"), which is quite different from automated attacks. For that, we have numerous scripts that notice various forms of anomalous behavior from any given IP. We manually review these to make sure the answers are reasonable. We are also working with CERT, who have a large database of botnetted machines, to detect attacks. I'm not going to give complete details of everything we check, but rest assured that we are very active in preventing attacks -- our goal is to be the best CAPTCHA in the world, and we take security threats very seriously.

          In terms of the digital output, we spot-check some of the transcribed pages every day. These spot-checks will also turn up any anomalous solutions, with high probability.
    • I got "derground". If they are getting this from digitized books, they have to work on undoing hyphenation before presenting it to the user.

      I wonder, afte this is running for a while, most of the unknown words will be nonsense (jabberwocky, snickersnee) Proper or made up names (Elric of Melnibone? I saw Benoit in the third captcha I solved, I now got one that looks like Visscher), numbers and other things people wouldn't work through.

      The other problem is with common words that OCR gets wrong. I've/me are c
    • by Falkkin (97268) on Tuesday October 02 2007, @10:21AM (#20822725) Homepage
      Sorry, but we've already thought of this attack :)

      We can compute the daily frequency of each human-provided solution and automatically flag anything that suddenly jumps in popularity. It's especially suspicious if these answers always disagree with the OCR's guess (often the OCR happens to be right, but just doesn't have high confidence).
        • PWNtcha does not defeat reCAPTCHA, nor are we aware of any existing OCR or CAPTCHA-breaking algorithms that do. We are working with research groups at a couple universities who are trying to break our CAPTCHA (and if they can, we'll obviously fix it). In case we do notice a break, it's trivial for us to switch to a completely different kind of CAPTCHA (using different distortions). Because our system is a web service, if there is a security breach, we can fix it for all sites at once by simply changing t