9th August 2007

reCAPTCHA, OCR, and the work of blog commenters

posted in blogs, books, creativity |

I listened to the June NPR Technology report “Turning Verification Codes into Books?” several weeks ago, but not until this evening when I was visiting Bob Sprankle’s Bit by Bit blog did I become a willing worker in the creative reCAPTCHA project. This ingenious project is using the captcha word entries bloggers enter when commenting on a blog to verify they are, indeed, a human being, and also help translate OCR scanned copies of books into digital text! I installed the free reCAPTCHA plug-in for Wordpress and was able to configure it readily after creating an account on the reCAPTCHA website and creating a public and private “key.”

How does the reCAPTCHA project work? According to the project’s website:

To archive human knowledge and to make information more accessible to the world, multiple projects are currently digitizing physical books that were written before the computer age. The book pages are being photographically scanned, and then, to make them searchable, transformed into text using “Optical Character Recognition” (OCR). The transformation into text is useful because scanning a book produces images, which are difficult to store on small devices, expensive to download, and cannot be searched. The problem is that OCR is not perfect.

I certainly understand the problems with OCR technologies. The few times we used OCR to scan articles for faculty when I worked at Texas Tech University, the results were very disappointing. I’m sure the technologies have improved since then, but clearly OCR software cannot be perfect and human intervention is needed to decipher words in many cases. That’s where blog commenters like you and me come in! Again according to the project’s website:

About 60 million CAPTCHAs are solved by humans around the world every day. In each case, roughly ten seconds of human time are being spent. Individually, that’s not a lot of time, but in aggregate these little puzzles consume more than 150,000 hours of work each day. What if we could make positive use of this human effort? reCAPTCHA does exactly that by channeling the effort spent solving CAPTCHAs online into “reading” books.

Blog commenters are able to help out when reCAPTCHA words are required before a blog comment can be entered. Again according to the project website:

reCAPTCHA improves the process of digitizing books by sending words that cannot be read by computers to the Web in the form of CAPTCHAs for humans to decipher. More specifically, each word that cannot be read correctly by OCR is placed on an image and used as a CAPTCHA. This is possible because most OCR programs alert you when a word cannot be read correctly.

If you have problems leaving a comment on my blog now that I’ve activated this plug-in please let me know. I’m hopeful it will not cause many problems, however, because the goal of the reCAPTCHA project is GREAT.

If you have your own blog, consider installing and using reCAPTCHA! Not only will you likely decrease blog spam by installing it, but you’ll also be helping out a worthy cause– by getting all your blog commenters to pitch in with the work of OCR translations!

Technorati Tags:
, , , , ,

On this day..

There are currently 2 responses to “reCAPTCHA, OCR, and the work of blog commenters”

Join the conversation!

  1. 1 On August 9th, 2007, Rich said:

    FYI, Leo Laporte and Steve Gibson did a really good job explaing both CAPTCHA and reCAPTCHA a few weeks ago on the Security Now podcast, if you (or any other readers) want to hear more about it. Sounds like a very cool idea!

  2. 2 On August 10th, 2007, Wesley Fryer said:

    Excellent Rich, thanks for the heads-up on that episode. Security Now is one of my favorite podcasts, and I’ll listen to that episode on my iPod Sunday as our family drives back to Oklahoma from Lubbock! :-)