Log reCAPTCHASpam sucks. Whether it is in our email inboxes or weblog comments, we don’t want useless messages advertising stuff we don’t need. We owners of weblogs and websites especially don’t want our home sweet home to become the advertising spot for these products, so we take measures to prevent it, almost all of them based on the most basic fact we know about all comment spam: it is never posted by humans.

The most effective way to prevent comment spam is also the most simple: simply ask the commenter to prove he/she is human before allowing any comments to be posted. So the CAPTCHA was born: a “Completely Automated Public Turing test to tell Computers and Humans Apart”, where the human commenter is asked to decode an image representing a word or a combination of letters and digits, garbled to make automatic recognition (OCR) impossible.

reCAPTCHAHuman beings worldwide sort an estimated 60 million of these CAPTCHA’s every day, while organisations like the Internet Archive struggle to digitize badly recognizable books. It was only a matter of time before someone figured out 1+1 equals two. And someone finally did: reCAPTCHA. By using words scanned from books for the Internet Archive for verification, instead of randomly generated garbage, the global effort of solving CAPTCHA’s is channeled and put to good use in aiding the digitizing of books.

However, I do see one problem: it would be impossible to use just the non-OCRable words for verification: the system would not be usable as a verification, since the checking end would not know the correct answer. reCAPTCHA solves this by using two words for each captcha: one of which has been correctly identified, the other yet unknown. The verification only uses the word that has already been identified, while the other part is passed on to (in this case) the Internet Archive to help them in decoding it. In other words: the system that has to challenge non-human commenters with a string computers cannot decode uses a word that has already been decoded by a computer! Assuming a spammer’s OCR-system might ever reach the same level of expertise as the system reCAPTCHA is teaching, any reCAPTCHA instantly becomes worthless. Moreover, the spammer would probably flood the “unknown” part of the captcha with garbage, since it is not needed for the verification anyway, rendering the entire project useless.

Of course this scenario can be prevented. The “reconizable” part of the reCAPTCHA should be as recently decoded by humans as possible and it should ideally be random which of the two words in the reCAPTCHA is the “recognizable” and which is the “unknown”. I hope the guys at reCAPTCHA thought of this.

