Optical character recognition

Optical character recognition

Discussion
Ask a question about 'Optical character recognition'
Start a new discussion about 'Optical character recognition'
Answer questions from other users
Full Discussion Forum
 
Encyclopedia
Optical character recognition, usually abbreviated to OCR, is the mechanical
Machine
A machine manages power to accomplish a task, examples include, a mechanical system, a computing system, an electronic system, and a molecular machine. In common usage, the meaning is that of a device having parts that perform or assist in performing any type of work...

 or electronic
Electronics
Electronics is the branch of science, engineering and technology that deals with electrical circuits involving active electrical components such as vacuum tubes, transistors, diodes and integrated circuits, and associated passive interconnection technologies...

 translation of scanned image
Image
An image is an artifact, for example a two-dimensional picture, that has a similar appearance to some subject—usually a physical object or a person.-Characteristics:...

s of handwritten, typewritten or printed text into machine-encoded text. It is widely used to convert books and documents into electronic files, to computerize a record-keeping system in an office, or to publish the text on a website. OCR makes it possible to edit the text, search for a word or phrase, store it more compactly, display or print a copy free of scanning artifacts, and apply techniques such as machine translation
Machine translation
Machine translation, sometimes referred to by the abbreviation MT is a sub-field of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another.On a basic...

, text-to-speech and text mining
Text mining
Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as...

 to it.
OCR is a field of research in pattern recognition
Pattern recognition
In machine learning, pattern recognition is the assignment of some sort of output value to a given input value , according to some specific algorithm. An example of pattern recognition is classification, which attempts to assign each input value to one of a given set of classes...

, artificial intelligence
Artificial intelligence
Artificial intelligence is the intelligence of machines and the branch of computer science that aims to create it. AI textbooks define the field as "the study and design of intelligent agents" where an intelligent agent is a system that perceives its environment and takes actions that maximize its...

 and computer vision
Computer vision
Computer vision is a field that includes methods for acquiring, processing, analysing, and understanding images and, in general, high-dimensional data from the real world in order to produce numerical or symbolic information, e.g., in the forms of decisions...

.

OCR systems require calibration to read a specific font
Typeface
In typography, a typeface is the artistic representation or interpretation of characters; it is the way the type looks. Each type is designed and there are thousands of different typefaces in existence, with new ones being developed constantly....

; early versions needed to be programmed with images of each character, and worked on one font at a time. "Intelligent" systems with a high degree of recognition accuracy for most fonts are now common. Some systems are capable of reproducing formatted output that closely approximates the original scanned page including images, columns and other non-textual components.

History


In 1929 Gustav Tauschek
Gustav Tauschek
Gustav Tauschek was an Austrian pioneer of Information technology and developed numerous improvements for punched card-based calculating machines from 1922 to 1945.-Biography:...

 obtained a patent on OCR in Germany, followed by Paul W. Handel who obtained a US patent on OCR in USA in 1933 . In 1935 Tauschek was also granted a US patent on his method . Tauschek's machine was a mechanical device that used templates and a photodetector
Photodetector
Photosensors or photodetectors are sensors of light or other electromagnetic energy. There are several varieties:*Active pixel sensors are image sensors consisting of an integrated circuit that contains an array of pixel sensors, each pixel containing a both a light sensor and an active amplifier...

.

In 1949 RCA
RCA
RCA Corporation, founded as the Radio Corporation of America, was an American electronics company in existence from 1919 to 1986. The RCA trademark is currently owned by the French conglomerate Technicolor SA through RCA Trademark Management S.A., a company owned by Technicolor...

 engineers worked on the first primitive computer-type OCR to help blind people for the US Veterans Administration, but instead of converting the printed characters to machine language, their device converted it to machine language and then spoke the letters. It proved far too expensive and was not pursued after testing.

In 1950, David H. Shepard
David H. Shepard
David Hammond Shepard was a prolific American inventor, who invented among other things, the first optical character recognition device, first voice recognition system and the Farrington B numeric font used on credit cards.-Life:Shepard was born September 30, 1923 in Milwaukee. His father died...

, a cryptanalyst at the Armed Forces Security Agency in the United States
United States
The United States of America is a federal constitutional republic comprising fifty states and a federal district...

, addressed the problem of converting printed messages into machine language for computer processing and built a machine to do this, reported in the Washington Daily News on 27 April 1951 and in the New York Times on 26 December 1953 after his was issued. Shepard then founded Intelligent Machines Research Corporation
Intelligent Machines Research Corporation
Intelligent Machines Research Corporation was founded by David H. Shepard and William Lawless, Jr. in 1952 to commercialize the work Shepard had done with the help of Harvey Cook in building "Gismo", a machine later called the "Analyzing Reader"....

 (IMR), which went on to deliver the world's first several OCR systems used in commercial operation.

In 1955, the first commercial system was installed at the Reader's Digest
Reader's Digest
Reader's Digest is a general interest family magazine, published ten times annually. Formerly based in Chappaqua, New York, its headquarters is now in New York City. It was founded in 1922, by DeWitt Wallace and Lila Bell Wallace...

. The second system was sold to the Standard Oil
Standard Oil
Standard Oil was a predominant American integrated oil producing, transporting, refining, and marketing company. Established in 1870 as a corporation in Ohio, it was the largest oil refiner in the world and operated as a major company trust and was one of the world's first and largest multinational...

 Company for reading credit card
Credit card
A credit card is a small plastic card issued to users as a system of payment. It allows its holder to buy goods and services based on the holder's promise to pay for these goods and services...

 imprints for billing purposes. Other systems sold by IMR during the late 1950s included a bill stub reader to the Ohio Bell Telephone Company and a page scanner to the United States Air Force
United States Air Force
The United States Air Force is the aerial warfare service branch of the United States Armed Forces and one of the American uniformed services. Initially part of the United States Army, the USAF was formed as a separate branch of the military on September 18, 1947 under the National Security Act of...

 for reading and transmitting by teletype typewritten messages. IBM
IBM
International Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...

 and others were later licensed on Shepard's OCR patents.

In about 1965, Reader's Digest and RCA collaborated to build an OCR Document reader designed to digitise the serial numbers on Reader's Digest coupons returned from advertisements. The fonts used on the documents were printed by an RCA Drum printer using the OCR-A font
OCR-A font
In the early days of computer optical character recognition, there was a need for a font thatcould be recognized by the computers of that day, and byhumans...

. The reader was connected directly to an RCA 301 computer (one of the first solid state computers). This reader was followed by a specialised document reader installed at TWA where the reader processed Airline Ticket stock. The readers processed documents at a rate of 1,500 documents per minute, and checked each document, rejecting those it was not able to process correctly. The product became part of the RCA product line as a reader designed to process "Turn around Documents" such as those utility and insurance bills returned with payments.

The United States Postal Service
United States Postal Service
The United States Postal Service is an independent agency of the United States government responsible for providing postal service in the United States...

 has been using OCR machines to sort mail since 1965 based on technology devised primarily by the prolific inventor Jacob Rabinow
Jacob Rabinow
Jacob Rabinow was an engineer who led a truly prolific career as an inventor. He earned a total of 230 U.S. patents on a variety of mechanical, optical and electrical devices....

. The first use of OCR in Europe was by the British General Post Office (GPO). In 1965 it began planning an entire banking system, the National Giro, using OCR technology, a process that revolutionized bill payment systems in the UK. Canada Post
Canada Post
Canada Post Corporation, known more simply as Canada Post , is the Canadian crown corporation which functions as the country's primary postal operator...

 has been using OCR systems since 1971. OCR systems read the name and address of the addressee at the first mechanised sorting center, and print a routing bar code on the envelope based on the postal code
Postal code
A postal code is a series of letters and/or digits appended to a postal address for the purpose of sorting mail. Once postal codes were introduced, other applications became possible.In February 2005, 117 of the 190 member countries of the Universal Postal Union had postal code systems...

. To avoid confusion with the human-readable address field which can be located anywhere on the letter, special ink (orange in visible light) is used that is clearly visible under ultraviolet light. Envelopes may then be processed with equipment based on simple barcode readers.

In 1974 Ray Kurzweil started the company Kurzweil Computer Products, Inc. and led development of the first omni-font
Typeface
In typography, a typeface is the artistic representation or interpretation of characters; it is the way the type looks. Each type is designed and there are thousands of different typefaces in existence, with new ones being developed constantly....

 optical character recognition system — a computer program capable of recognizing text printed in any normal font. He decided that the best application of this technology would be to create a reading machine for the blind, which would allow blind people to have a computer read text to them out loud. This device required the invention of two enabling technologies — the CCD
Charge-coupled device
A charge-coupled device is a device for the movement of electrical charge, usually from within the device to an area where the charge can be manipulated, for example conversion into a digital value. This is achieved by "shifting" the signals between stages within the device one at a time...

 flatbed scanner and the text-to-speech synthesizer. On January 13, 1976 the successful finished product was unveiled during a widely-reported news conference headed by Kurzweil and the leaders of the National Federation of the Blind .

In 1978 Kurzweil Computer Products began selling a commercial version of the optical character recognition computer program. LexisNexis
LexisNexis
LexisNexis Group is a company providing computer-assisted legal research services. In 2006 it had the world's largest electronic database for legal and public-records related information...

 was one of the first customers, and bought the program to upload paper legal and news documents onto its nascent online databases. Two years later, Kurzweil sold his company to Xerox
Xerox
Xerox Corporation is an American multinational document management corporation that produced and sells a range of color and black-and-white printers, multifunction systems, photo copiers, digital production printing presses, and related consulting services and supplies...

, which had an interest in further commercializing paper-to-computer text conversion. Kurzweil Computer Products became a subsidiary of Xerox known as Scansoft, now Nuance Communications
Nuance Communications
Nuance Communications is a multinational computer software technology corporation, headquartered in Burlington, Massachusetts, USA, that provides speech and imaging applications...

.

1992-1996 Commissioned by the U.S. Department of Energy (DOE), Information Science Research Institute (ISRI) conducted the most authoritative of the Annual Test of OCR Accuracy for 5 consecutive years in the mid-90s. Information Science Research Institute (ISRI) is a research and development unit of University of Nevada, Las Vegas
University of Nevada, Las Vegas
University of Nevada-Las Vegas is a public, coeducational university located in the Las Vegas suburb of Paradise, Nevada, USA. The campus is located approximately east of the Las Vegas Strip. The institution includes a Shadow Lane Campus, located just east of the University Medical Center of...

. ISRI was established in 1990 with funding from the U.S. Department of Energy. Its mission is to foster the improvement of automated technologies for understanding machine printed documents .

OCR software


Desktop & Server OCR Software

OCR software and ICR software
Intelligent Character Recognition
In computer science, intelligent character recognition is an advanced optical character recognition or — rather more specific — handwriting recognition system that allows fonts and different styles of handwriting to be learned by a computer during processing to improve accuracy and recognition...

 technology are analytical artificial intelligence systems that consider sequences of characters rather than whole words or phrases. Based on the analysis of sequential lines and curves, OCR and ICR make 'best guesses' at characters using database look-up tables to closely associate or match the strings of characters that form words.

WebOCR & OnlineOCR

With IT technology development, the platform for people to use software has been changed from single PC platform to multi-platforms such as PC +Web-based+ Cloud Computing + Mobile devices. After 30 years development, OCR software started to adapt to new application requirements. WebOCR also known as OnlineOCR or Web-based OCR service, has been a new trend to meet larger volume and larger group of users after 30 years development of the desktop OCR. Internet and broadband technologies have made WebOCR & OnlineOCR practically available to both individual users and enterprise customers. Since 2000, some major OCR vendors began offering WebOCR & Online software, a number of new entrants companies to seize the opportunity to develop innovative Web-based OCR service, some of which are free of charge services.

Application-Oriented OCR

Since OCR technology has been more and more widely applied to paper-intensive industry, it is facing more complex images environment in the real world. For example: complicated backgrounds, degraded-images, heavy-noise, paper skew, picture distortion, low-resolution, disturbed by grid & lines, text image consisting of special fonts, symbols, glossary words and etc. All the factors affect OCR products’ stability in recognition accuracy.

In recent years, the major OCR technology providers began to develop dedicated OCR systems, each for a special type of images. They combine various optimization methods related the special image, such as business rules, standard expression, glossary dictionary and rich information contained in color image, to improve the recognition accuracy.

Such strategy to customize OCR technology is called “Application-Oriented OCR” or "Customized OCR", widely used in the fields of Business-card OCR, Invoice OCR, Screenshot OCR, ID card OCR, Driver-license OCR or Auto plant OCR, and so on.

Current state of OCR technology


Recognition of Latin-script
Latin alphabet
The Latin alphabet, also called the Roman alphabet, is the most recognized alphabet used in the world today. It evolved from a western variety of the Greek alphabet called the Cumaean alphabet, which was adopted and modified by the Etruscans who ruled early Rome...

, typewritten text is still not 100% accurate even where clear imaging is available. One study based on recognition of 19th- and early 20th-century newspaper pages concluded that character-by-character OCR accuracy for commercial OCR software varied from 71% to 98%; total accuracy can be achieved only by human review. Other areas—including recognition of hand printing, cursive
Cursive
Cursive, also known as joined-up writing, joint writing, or running writing, is any style of handwriting in which the symbols of the language are written in a simplified and/or flowing manner, generally for the purpose of making writing easier or faster...

 handwriting, and printed text in other scripts (especially those East Asian language characters which have many strokes for a single character)—are still the subject of active research.

Accuracy rates can be measured in several ways, and how they are measured can greatly affect the reported accuracy rate. For example, if word context (basically a lexicon of words) is not used to correct software finding non-existent words, a character error rate of 1% (99% accuracy) may result in an error rate of 5% (95% accuracy) or worse if the measurement is based on whether each whole word was recognized with no incorrect letters.

On-line character recognition is sometimes confused with Optical Character Recognition (see Handwriting recognition
Handwriting recognition
Handwriting recognition is the ability of a computer to receive and interpret intelligible handwritten input from sources such as paper documents, photographs, touch-screens and other devices. The image of the written text may be sensed "off line" from a piece of paper by optical scanning or...

). OCR is an instance of off-line character recognition, where the system recognizes the fixed static shape of the character, while on-line character recognition instead recognizes the dynamic motion during handwriting. For example, on-line recognition, such as that used for gestures in the Penpoint OS
PenPoint OS
The PenPoint OS was a product of GO Corporation and was one of the earliest operating systems written specifically for graphical tablets and personal digital assistants...

 or the Tablet PC
Tablet computer
A tablet computer, or simply tablet, is a complete mobile computer, larger than a mobile phone or personal digital assistant, integrated into a flat touch screen and primarily operated by touching the screen...

 can tell whether a horizontal mark was drawn right-to-left, or left-to-right. On-line character recognition is also referred to by other terms such as dynamic character recognition, real-time character recognition, and Intelligent Character Recognition
Intelligent Character Recognition
In computer science, intelligent character recognition is an advanced optical character recognition or — rather more specific — handwriting recognition system that allows fonts and different styles of handwriting to be learned by a computer during processing to improve accuracy and recognition...

 or ICR.

On-line systems for recognizing hand-printed text on the fly have become well known as commercial products in recent years (see Tablet PC history
Tablet computer
A tablet computer, or simply tablet, is a complete mobile computer, larger than a mobile phone or personal digital assistant, integrated into a flat touch screen and primarily operated by touching the screen...

). Among these are the input devices for personal digital assistant
Personal digital assistant
A personal digital assistant , also known as a palmtop computer, or personal data assistant, is a mobile device that functions as a personal information manager. Current PDAs often have the ability to connect to the Internet...

s such as those running Palm OS
Palm OS
Palm OS is a mobile operating system initially developed by Palm, Inc., for personal digital assistants in 1996. Palm OS is designed for ease of use with a touchscreen-based graphical user interface. It is provided with a suite of basic applications for personal information management...

. The Apple Newton
Apple Newton
The MessagePad was the first series of personal digital assistant devices developed by Apple for the Newton platform in 1993. Some electronic engineering and the manufacture of Apple's MessagePad devices was done in Japan by the Sharp Corporation...

 pioneered this product. The algorithms used in these devices take advantage of the fact that the order, speed, and direction of individual lines segments at input are known. Also, the user can be retrained to use only specific letter shapes. These methods cannot be used in software that scans paper documents, so accurate recognition of hand-printed documents is still largely an open problem. Accuracy rates of 80% to 90% on neat, clean hand-printed characters can be achieved, but that accuracy rate still translates to dozens of errors per page, making the technology useful only in very limited applications.

Recognition of cursive text is an active area of research, with recognition rates even lower than that of hand-printed text. Higher rates of recognition of general cursive script will likely not be possible without the use of contextual or grammatical information. For example, recognizing entire words from a dictionary is easier than trying to parse individual characters from script. Reading the Amount line of a cheque
Cheque
A cheque is a document/instrument See the negotiable cow—itself a fictional story—for discussions of cheques written on unusual surfaces. that orders a payment of money from a bank account...

 (which is always a written-out number) is an example where using a smaller dictionary can increase recognition rates greatly. Knowledge of the grammar of the language being scanned can also help determine if a word is likely to be a verb or a noun, for example, allowing greater accuracy. The shapes of individual cursive characters themselves simply do not contain enough information to accurately (greater than 98%) recognise all handwritten cursive script.

It is necessary to understand that OCR technology is a basic technology also used in advanced scanning applications. Due to this, an advanced scanning solution can be unique and patented and not easily copied despite being based on this basic OCR technology.

For more complex recognition problems, intelligent character recognition
Intelligent Character Recognition
In computer science, intelligent character recognition is an advanced optical character recognition or — rather more specific — handwriting recognition system that allows fonts and different styles of handwriting to be learned by a computer during processing to improve accuracy and recognition...

 systems are generally used, as artificial neural network
Artificial neural network
An artificial neural network , usually called neural network , is a mathematical model or computational model that is inspired by the structure and/or functional aspects of biological neural networks. A neural network consists of an interconnected group of artificial neurons, and it processes...

s can be made indifferent to both affine
Affine transformation
In geometry, an affine transformation or affine map or an affinity is a transformation which preserves straight lines. It is the most general class of transformations with this property...

 and non-linear transformations.

A technique which is having considerable success in recognising difficult words and character groups within documents generally amenable to computer OCR is to submit them automatically to humans in the reCAPTCHA
ReCAPTCHA
reCAPTCHA is a system originally developed at Carnegie Mellon University's main Pittsburgh campus. It uses CAPTCHA to help digitize the text of books while protecting websites from bots attempting to access restricted areas. On September 16, 2009, Google acquired reCAPTCHA. reCAPTCHA is currently...

 system.

See also

  • List of optical character recognition software
  • Automatic number plate recognition
    Automatic number plate recognition
    Automatic number plate recognition is a mass surveillance method that uses optical character recognition on images to read the license plates on vehicles. They can use existing closed-circuit television or road-rule enforcement cameras, or ones specifically designed for the task...

  • Book scanning
    Book scanning
    Book scanning is the process of converting physical books and magazines into digital media such as images, electronic text, or electronic books by using an image scanner....

  • CAPTCHA
    CAPTCHA
    A CAPTCHA is a type of challenge-response test used in computing as an attempt to ensure that the response is generated by a person. The process usually involves one computer asking a user to complete a simple test which the computer is able to generate and grade...

  • Computational linguistics
    Computational linguistics
    Computational linguistics is an interdisciplinary field dealing with the statistical or rule-based modeling of natural language from a computational perspective....

  • Computer vision
    Computer vision
    Computer vision is a field that includes methods for acquiring, processing, analysing, and understanding images and, in general, high-dimensional data from the real world in order to produce numerical or symbolic information, e.g., in the forms of decisions...

  • Digital Library
    Digital library
    A digital library is a library in which collections are stored in digital formats and accessible by computers. The digital content may be stored locally, or accessed remotely via computer networks...

  • Digital pen
    Digital pen
    A digital pen is an input device which captures the handwriting or brush strokes of a user, converts handwritten analog information created using "pen and paper" into digital data, enabling the data to be utilized in various applications. For example, the writing data can be digitized and uploaded...

  • Digital Mailroom
    Digital mailroom
    Digital mailroom is a term used to describe the automation of incoming mail processes. Using document scanning and document capture technologies companies can digitise incoming mail and automate the classification and distribution of mail within the organisation...

  • Handwriting
    Handwriting
    Handwriting is a person's particular & individual style of writing with pen or pencil, which contrasts with "Hand" which is an impersonal and formalised writing style in several historical varieties...

  • Institutional repository
    Institutional repository
    An Institutional repository is an online locus for collecting, preserving, and disseminating - in digital form - the intellectual output of an institution, particularly a research institution....

  • Machine learning
    Machine learning
    Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...

  • Music OCR
    Music OCR
    Music OCR is the application of optical character recognition to interpret sheet music or printed scores into editable and, often, playable form. Once captured digitally, the music can be saved in commonly used file formats, e.g...

  • Optical mark recognition
    Optical mark recognition
    Optical Mark Recognition is the process of capturing human-marked data from document forms such as surveys and tests.-OMR background:...

  • Raster to vector
    Raster to vector
    In computer graphics, vectorization refers to the process of converting raster graphics into vector graphics.- Popular applications :*In computer-aided design drawings are scanned, vectorized and written as CAD files in a process called paper-to-CAD conversion or drawing conversion.*In geographic...

  • Raymond Kurzweil
    Raymond Kurzweil
    Raymond "Ray" Kurzweil is an American author, inventor and futurist. He is involved in fields such as optical character recognition , text-to-speech synthesis, speech recognition technology, and electronic keyboard instruments...

  • Sketch recognition
    Sketch recognition
    Sketch recognition is the automated recognition of hand-drawn diagrams by a computer. Research in sketch recognition lies at the crossroads of Artificial Intelligence and Human Computer Interaction...

  • Speech recognition
    Speech recognition
    Speech recognition converts spoken words to text. The term "voice recognition" is sometimes used to refer to recognition systems that must be trained to a particular speaker—as is the case for most desktop recognition software...

  • Voice recording

External links