reCAPTCHA is a system originally developed at
Carnegie Mellon University'sCarnegie Mellon University is a private research university in Pittsburgh, Pennsylvania, United States....
main Pittsburgh campus. It uses
CAPTCHAA CAPTCHA is a type of challenge-response test used in computing as an attempt to ensure that the response is generated by a person. The process usually involves one computer asking a user to complete a simple test which the computer is able to generate and grade...
to help
digitizeDigitizing or digitization is the representation of an object, image, sound, document or a signal by a discrete set of its points or samples. The result is called digital representation or, more specifically, a digital image, for the object, and digital form, for the signal...
the text of books while protecting websites from
botInternet bots, also known as web robots, WWW robots or simply bots, are software applications that run automated tasks over the Internet. Typically, bots perform tasks that are both simple and structurally repetitive, at a much higher rate than would be possible for a human alone...
s attempting to access restricted areas. On September 16, 2009,
GoogleGoogle Inc. is an American multinational public corporation invested in Internet search, cloud computing, and advertising technologies. Google hosts and develops a number of Internet-based services and products, and generates profit primarily from advertising through its AdWords program...
acquired reCAPTCHA. reCAPTCHA is currently digitizing the archives of
The New York TimesThe New York Times is an American daily newspaper founded and continuously published in New York City since 1851. The New York Times has won 106 Pulitzer Prizes, the most of any news organization...
and books from Google Books. Twenty years of
The New York Times have been digitized and the project planned to have completed the remaining years by the end of 2010.
reCAPTCHA supplies subscribing websites with images of words that
optical character recognitionOptical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. It is widely used to convert books and documents into electronic files, to computerize a record-keeping...
(OCR) software has been unable to read. The subscribing websites (whose purposes are generally unrelated to the book digitization project) present these images for humans to decipher as CAPTCHA words, as part of their normal validation procedures. They then return the results to the reCAPTCHA service, which sends the results to the digitization projects.
The system is reported to display over 200 million CAPTCHAs every day, and among its subscribers are such popular sites as
FacebookFacebook is a social networking service and website launched in February 2004, operated and privately owned by Facebook, Inc. , Facebook has more than 800 million active users. Users must register before using the site, after which they may create a personal profile, add other users as...
,
TicketMasterTicketmaster Entertainment, Inc. is an independent American ticket sales and distribution company based in West Hollywood, California, USA, with operations in many countries around the world. In 2010 it merged with Live Nation to become Live Nation Entertainment...
,
TwitterTwitter is an online social networking and microblogging service that enables its users to send and read text-based posts of up to 140 characters, informally known as "tweets".Twitter was created in March 2006 by Jack Dorsey and launched that July...
,
4chan4chan is an English-language imageboard website. Launched on October 1, 2003, its boards were originally used for the posting of pictures and discussion of manga and anime...
, CNN.com, and
StumbleUponStumbleUpon is a discovery engine that finds and recommends web content to its users. Its features allow users to discover and rate Web pages, photos, and videos that are personalized to their tastes and interests using peer-sourcing and social-networking principles.Toolbar versions exist for...
.
CraigslistCraigslist is a centralized network of online communities featuring free online classified advertisements, with sections devoted to jobs, housing, personals, for sale, services, community, gigs, résumés, and discussion forums....
began using reCAPTCHA in June 2008. The U.S.
National Telecommunications and Information AdministrationThe National Telecommunications and Information Administration is an agency of the United States Department of Commerce that serves as the President's principal adviser on telecommunications policies pertaining to the United States' economic and technological advancement and to regulation of the...
also used reCAPTCHA for its digital TV converter box coupon program website as part of the
US DTV transitionThe DTV transition in the United States was the switchover from analog to exclusively digital broadcasting of free over-the-air television programming...
.
Origin
The reCAPTCHA program originated with
GuatemalaGuatemala is a country in Central America bordered by Mexico to the north and west, the Pacific Ocean to the southwest, Belize to the northeast, the Caribbean to the east, and Honduras and El Salvador to the southeast...
n
computer scientistA computer scientist is a scientist who has acquired knowledge of computer science, the study of the theoretical foundations of information and computation and their application in computer systems....
Luis von AhnLuis von Ahn is an entrepreneur and an associate professor in the Computer Science Department at Carnegie Mellon University. He is known as one of the pioneers of the idea of crowdsourcing. He is the founder of the company reCAPTCHA, which was sold to Google in 2009...
, aided by a MacArthur Fellowship. An early CAPTCHA developer, he realized "he had unwittingly created a system that was frittering away, in ten-second increments, millions of hours of a most precious resource: human brain cycles."
Operation
Scanned text is subjected to analysis by two different
optical character recognitionOptical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. It is widely used to convert books and documents into electronic files, to computerize a record-keeping...
programs; in cases where the programs disagree, the questionable word is converted into a CAPTCHA. The word is displayed along with a control word already known. The system assumes that if the human types the control word correctly, the questionable word is also correct. The identification performed by each OCR program is given a value of 0.5 points, and each interpretation by a human is given a full point. Once a given identification hits 2.5 votes, the word is considered called. Those words that are consistently given a single identity by human judges are recycled as control words.
Implementation
reCAPTCHA tests are taken from the central site of the reCAPTCHA project, which supplies the words to be deciphered. This is done through a
JavaScriptJavaScript is a prototype-based scripting language that is dynamic, weakly typed and has first-class functions. It is a multi-paradigm language, supporting object-oriented, imperative, and functional programming styles....
APIAn application programming interface is a source code based specification intended to be used as an interface by software components to communicate with each other...
with the server making a callback to reCAPTCHA after the request has been submitted. The reCAPTCHA project provides libraries for various programming languages and applications to make this process easier. reCAPTCHA is a free service (that is, the CAPTCHA images are provided to websites free of charge, in return for assistance with the decipherment), but the reCAPTCHA software itself is not
open sourceThe term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...
.
reCAPTCHA offers plugins for several web-application platforms, like
ASP.NETASP.NET is a Web application framework developed and marketed by Microsoft to allow programmers to build dynamic Web sites, Web applications and Web services. It was first released in January 2002 with version 1.0 of the .NET Framework, and is the successor to Microsoft's Active Server Pages ...
,
RubyRuby is a dynamic, reflective, general-purpose object-oriented programming language that combines syntax inspired by Perl with Smalltalk-like features. Ruby originated in Japan during the mid-1990s and was first developed and designed by Yukihiro "Matz" Matsumoto...
, or
PHPPHP is a general-purpose server-side scripting language originally designed for web development to produce dynamic web pages. For this purpose, PHP code is embedded into the HTML source document and interpreted by a web server with a PHP processor module, which generates the web page document...
, to ease the implementation of the service.
Security
The basis of the
CAPTCHAA CAPTCHA is a type of challenge-response test used in computing as an attempt to ensure that the response is generated by a person. The process usually involves one computer asking a user to complete a simple test which the computer is able to generate and grade...
system is to prevent automated access to a system by computer programs or "bots". On December 14, 2009, Jonathan Wilkins released a paper describing weaknesses in reCAPTCHA that allowed a solve rate of 18%. On August 1, 2010, Chad Houck gave a presentation to the
DEF CONDEF CON is one of the world's largest annual computer hacker conventions, held every year in Las Vegas, Nevada...
18 Hacking Conference detailing a method to reverse the distortion added to images which allowed a computer program to determine a valid response 10% of the time. The reCAPTCHA system was modified on 21 July 2010, before Houck was to speak on his method. Houck modified his method to what he described as an "easier" CAPTCHA to determine a valid response 31.8% of the time. Houck also mentioned security defenses in the system such as a high security lock out if a valid response isn't given 32 times in a row. reCAPTCHA frequently modifies its system which would require the author of a similar program to frequently update the method of decoding, which may frustrate potential abusers.
Mailhide
reCAPTCHA has also created project Mailhide, which protects email addresses on web pages from being
harvestedEmail harvesting is the process of obtaining lists of email addresses using various methods for use in bulk email or other purposes usually grouped as spam.-Methods:...
by
spammersEmail spam, also known as junk email or unsolicited bulk email , is a subset of spam that involves nearly identical messages sent to numerous recipients by email. Definitions of spam usually include the aspects that email is unsolicited and sent in bulk. One subset of UBE is UCE...
. By default, the email address is converted into a format that does not allow a
crawlerA Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or—especially in the FOAF community—Web scutters.This process is called Web...
to see the full email address. For example, "mailme@example.com" would be converted to "mai...@example.com". The visitor would then click on the "..." and solve the CAPTCHA in order to obtain the full email address. One can also edit the popup code so that none of the address is visible.
External links
- The reCAPTCHA project
- Try reCAPTCHA at google.com
- ReCAPTCHA: The job you didn't even know you had Two-page article in The Walrus
The Walrus is a Canadian general interest magazine which publishes long form journalism on Canadian and international affairs, along with fiction and poetry by Canadian writers. It launched in September 2003, as an attempt to create a Canadian equivalent to American magazines such as Harper's, The...
magazine
- Luis von Ahn, Benjamin Maurer, Colin McMillen, David Abraham and Manuel Blum. 2008. "CAPTCHA: Human-Based Character Recognition via Web Security Measures" Science 12 September 2008: Vol. 321 no. 5895 pp. 1465-1468. http://dx.doi.org/10.1126/science.1160379