Optical character recognition - AbsoluteAstronomy.com

Optical character recognition, usually abbreviated to OCR, is the mechanical

Machine

A machine manages power to accomplish a task, examples include, a mechanical system, a computing system, an electronic system, and a molecular machine. In common usage, the meaning is that of a device having parts that perform or assist in performing any type of work...

or electronic

Electronics

Electronics is the branch of science, engineering and technology that deals with electrical circuits involving active electrical components such as vacuum tubes, transistors, diodes and integrated circuits, and associated passive interconnection technologies...

translation of scanned image

Image

An image is an artifact, for example a two-dimensional picture, that has a similar appearance to some subject—usually a physical object or a person.-Characteristics:...

s of handwritten, typewritten or printed text into machine-encoded text. It is widely used to convert books and documents into electronic files, to computerize a record-keeping system in an office, or to publish the text on a website. OCR makes it possible to edit the text, search for a word or phrase, store it more compactly, display or print a copy free of scanning artifacts, and apply techniques such as machine translation

Machine translation

Machine translation, sometimes referred to by the abbreviation MT is a sub-field of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another.On a basic...

, text-to-speech and text mining

Text mining

Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as...

to it.
OCR is a field of research in pattern recognition

Pattern recognition

In machine learning, pattern recognition is the assignment of some sort of output value to a given input value , according to some specific algorithm. An example of pattern recognition is classification, which attempts to assign each input value to one of a given set of classes...

, artificial intelligence

Artificial intelligence

Artificial intelligence is the intelligence of machines and the branch of computer science that aims to create it. AI textbooks define the field as "the study and design of intelligent agents" where an intelligent agent is a system that perceives its environment and takes actions that maximize its...

and computer vision

Computer vision

Computer vision is a field that includes methods for acquiring, processing, analysing, and understanding images and, in general, high-dimensional data from the real world in order to produce numerical or symbolic information, e.g., in the forms of decisions...

.

OCR systems require calibration to read a specific font

Typeface

In typography, a typeface is the artistic representation or interpretation of characters; it is the way the type looks. Each type is designed and there are thousands of different typefaces in existence, with new ones being developed constantly....

; early versions needed to be programmed with images of each character, and worked on one font at a time. "Intelligent" systems with a high degree of recognition accuracy for most fonts are now common. Some systems are capable of reproducing formatted output that closely approximates the original scanned page including images, columns and other non-textual components.

History

In 1929 Gustav Tauschek

Gustav Tauschek

Gustav Tauschek was an Austrian pioneer of Information technology and developed numerous improvements for punched card-based calculating machines from 1922 to 1945.-Biography:...

obtained a patent on OCR in Germany, followed by Paul W. Handel who obtained a US patent on OCR in USA in 1933 . In 1935 Tauschek was also granted a US patent on his method . Tauschek's machine was a mechanical device that used templates and a photodetector

Photodetector

Photosensors or photodetectors are sensors of light or other electromagnetic energy. There are several varieties:*Active pixel sensors are image sensors consisting of an integrated circuit that contains an array of pixel sensors, each pixel containing a both a light sensor and an active amplifier...

.

In 1949 RCA

RCA

RCA Corporation, founded as the Radio Corporation of America, was an American electronics company in existence from 1919 to 1986. The RCA trademark is currently owned by the French conglomerate Technicolor SA through RCA Trademark Management S.A., a company owned by Technicolor...

engineers worked on the first primitive computer-type OCR to help blind people for the US Veterans Administration, but instead of converting the printed characters to machine language, their device converted it to machine language and then spoke the letters. It proved far too expensive and was not pursued after testing.

In 1950, David H. Shepard

David H. Shepard

David Hammond Shepard was a prolific American inventor, who invented among other things, the first optical character recognition device, first voice recognition system and the Farrington B numeric font used on credit cards.-Life:Shepard was born September 30, 1923 in Milwaukee. His father died...

, a cryptanalyst at the Armed Forces Security Agency in the United States

United States

The United States of America is a federal constitutional republic comprising fifty states and a federal district...

, addressed the problem of converting printed messages into machine language for computer processing and built a machine to do this, reported in the Washington Daily News on 27 April 1951 and in the New York Times on 26 December 1953 after his was issued. Shepard then founded Intelligent Machines Research Corporation

Intelligent Machines Research Corporation

Intelligent Machines Research Corporation was founded by David H. Shepard and William Lawless, Jr. in 1952 to commercialize the work Shepard had done with the help of Harvey Cook in building "Gismo", a machine later called the "Analyzing Reader"....

(IMR), which went on to deliver the world's first several OCR systems used in commercial operation.

In 1955, the first commercial system was installed at the Reader's Digest

Reader's Digest

Reader's Digest is a general interest family magazine, published ten times annually. Formerly based in Chappaqua, New York, its headquarters is now in New York City. It was founded in 1922, by DeWitt Wallace and Lila Bell Wallace...

. The second system was sold to the Standard Oil

Standard Oil

Standard Oil was a predominant American integrated oil producing, transporting, refining, and marketing company. Established in 1870 as a corporation in Ohio, it was the largest oil refiner in the world and operated as a major company trust and was one of the world's first and largest multinational...

Company for reading credit card

Credit card

A credit card is a small plastic card issued to users as a system of payment. It allows its holder to buy goods and services based on the holder's promise to pay for these goods and services...

imprints for billing purposes. Other systems sold by IMR during the late 1950s included a bill stub reader to the Ohio Bell Telephone Company and a page scanner to the United States Air Force

United States Air Force

The United States Air Force is the aerial warfare service branch of the United States Armed Forces and one of the American uniformed services. Initially part of the United States Army, the USAF was formed as a separate branch of the military on September 18, 1947 under the National Security Act of...

for reading and transmitting by teletype typewritten messages. IBM

IBM

International Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...

and others were later licensed on Shepard's OCR patents.

In about 1965, Reader's Digest and RCA collaborated to build an OCR Document reader designed to digitise the serial numbers on Reader's Digest coupons returned from advertisements. The fonts used on the documents were printed by an RCA Drum printer using the OCR-A font

OCR-A font

In the early days of computer optical character recognition, there was a need for a font thatcould be recognized by the computers of that day, and byhumans...

. The reader was connected directly to an RCA 301 computer (one of the first solid state computers). This reader was followed by a specialised document reader installed at TWA where the reader processed Airline Ticket stock. The readers processed documents at a rate of 1,500 documents per minute, and checked each document, rejecting those it was not able to process correctly. The product became part of the RCA product line as a reader designed to process "Turn around Documents" such as those utility and insurance bills returned with payments.

The United States Postal Service

United States Postal Service

The United States Postal Service is an independent agency of the United States government responsible for providing postal service in the United States...

has been using OCR machines to sort mail since 1965 based on technology devised primarily by the prolific inventor Jacob Rabinow

Jacob Rabinow

Jacob Rabinow was an engineer who led a truly prolific career as an inventor. He earned a total of 230 U.S. patents on a variety of mechanical, optical and electrical devices....

. The first use of OCR in Europe was by the British General Post Office (GPO). In 1965 it began planning an entire banking system, the National Giro, using OCR technology, a process that revolutionized bill payment systems in the UK. Canada Post

Canada Post

Canada Post Corporation, known more simply as Canada Post , is the Canadian crown corporation which functions as the country's primary postal operator...

has been using OCR systems since 1971. OCR systems read the name and address of the addressee at the first mechanised sorting center, and print a routing bar code on the envelope based on the postal code

Postal code

A postal code is a series of letters and/or digits appended to a postal address for the purpose of sorting mail. Once postal codes were introduced, other applications became possible.In February 2005, 117 of the 190 member countries of the Universal Postal Union had postal code systems...

. To avoid confusion with the human-readable address field which can be located anywhere on the letter, special ink (orange in visible light) is used that is clearly visible under ultraviolet light. Envelopes may then be processed with equipment based on simple barcode readers.

In 1974 Ray Kurzweil started the company Kurzweil Computer Products, Inc. and led development of the first omni-font

Typeface

optical character recognition system — a computer program capable of recognizing text printed in any normal font. He decided that the best application of this technology would be to create a reading machine for the blind, which would allow blind people to have a computer read text to them out loud. This device required the invention of two enabling technologies — the CCD

Charge-coupled device

A charge-coupled device is a device for the movement of electrical charge, usually from within the device to an area where the charge can be manipulated, for example conversion into a digital value. This is achieved by "shifting" the signals between stages within the device one at a time...

flatbed scanner and the text-to-speech synthesizer. On January 13, 1976 the successful finished product was unveiled during a widely-reported news conference headed by Kurzweil and the leaders of the National Federation of the Blind .

In 1978 Kurzweil Computer Products began selling a commercial version of the optical character recognition computer program. LexisNexis

LexisNexis

LexisNexis Group is a company providing computer-assisted legal research services. In 2006 it had the world's largest electronic database for legal and public-records related information...

was one of the first customers, and bought the program to upload paper legal and news documents onto its nascent online databases. Two years later, Kurzweil sold his company to Xerox

Xerox

Xerox Corporation is an American multinational document management corporation that produced and sells a range of color and black-and-white printers, multifunction systems, photo copiers, digital production printing presses, and related consulting services and supplies...

, which had an interest in further commercializing paper-to-computer text conversion. Kurzweil Computer Products became a subsidiary of Xerox known as Scansoft, now Nuance Communications

Nuance Communications

Nuance Communications is a multinational computer software technology corporation, headquartered in Burlington, Massachusetts, USA, that provides speech and imaging applications...

.

1992-1996 Commissioned by the U.S. Department of Energy (DOE), Information Science Research Institute (ISRI) conducted the most authoritative of the Annual Test of OCR Accuracy for 5 consecutive years in the mid-90s. Information Science Research Institute (ISRI) is a research and development unit of University of Nevada, Las Vegas

University of Nevada, Las Vegas

University of Nevada-Las Vegas is a public, coeducational university located in the Las Vegas suburb of Paradise, Nevada, USA. The campus is located approximately east of the Las Vegas Strip. The institution includes a Shadow Lane Campus, located just east of the University Medical Center of...

. ISRI was established in 1990 with funding from the U.S. Department of Energy. Its mission is to foster the improvement of automated technologies for understanding machine printed documents .

OCR software

Desktop & Server OCR Software

OCR software and ICR software

Intelligent Character Recognition

In computer science, intelligent character recognition is an advanced optical character recognition or — rather more specific — handwriting recognition system that allows fonts and different styles of handwriting to be learned by a computer during processing to improve accuracy and recognition...

technology are analytical artificial intelligence systems that consider sequences of characters rather than whole words or phrases. Based on the analysis of sequential lines and curves, OCR and ICR make 'best guesses' at characters using database look-up tables to closely associate or match the strings of characters that form words.

WebOCR & OnlineOCR

With IT technology development, the platform for people to use software has been changed from single PC platform to multi-platforms such as PC +Web-based+ Cloud Computing + Mobile devices. After 30 years development, OCR software started to adapt to new application requirements. WebOCR also known as OnlineOCR or Web-based OCR service, has been a new trend to meet larger volume and larger group of users after 30 years development of the desktop OCR. Internet and broadband technologies have made WebOCR & OnlineOCR practically available to both individual users and enterprise customers. Since 2000, some major OCR vendors began offering WebOCR & Online software, a number of new entrants companies to seize the opportunity to develop innovative Web-based OCR service, some of which are free of charge services.

Application-Oriented OCR

Since OCR technology has been more and more widely applied to paper-intensive industry, it is facing more complex images environment in the real world. For example: complicated backgrounds, degraded-images, heavy-noise, paper skew, picture distortion, low-resolution, disturbed by grid & lines, text image consisting of special fonts, symbols, glossary words and etc. All the factors affect OCR products’ stability in recognition accuracy.

In recent years, the major OCR technology providers began to develop dedicated OCR systems, each for a special type of images. They combine various optimization methods related the special image, such as business rules, standard expression, glossary dictionary and rich information contained in color image, to improve the recognition accuracy.

Such strategy to customize OCR technology is called “Application-Oriented OCR” or "Customized OCR", widely used in the fields of Business-card OCR, Invoice OCR, Screenshot OCR, ID card OCR, Driver-license OCR or Auto plant OCR, and so on.

Current state of OCR technology

Recognition of Latin-script

Latin alphabet

The Latin alphabet, also called the Roman alphabet, is the most recognized alphabet used in the world today. It evolved from a western variety of the Greek alphabet called the Cumaean alphabet, which was adopted and modified by the Etruscans who ruled early Rome...

, typewritten text is still not 100% accurate even where clear imaging is available. One study based on recognition of 19th- and early 20th-century newspaper pages concluded that character-by-character OCR accuracy for commercial OCR software varied from 71% to 98%; total accuracy can be achieved only by human review. Other areas—including recognition of hand printing, cursive

Cursive

Cursive, also known as joined-up writing, joint writing, or running writing, is any style of handwriting in which the symbols of the language are written in a simplified and/or flowing manner, generally for the purpose of making writing easier or faster...

handwriting, and printed text in other scripts (especially those East Asian language characters which have many strokes for a single character)—are still the subject of active research.

Accuracy rates can be measured in several ways, and how they are measured can greatly affect the reported accuracy rate. For example, if word context (basically a lexicon of words) is not used to correct software finding non-existent words, a character error rate of 1% (99% accuracy) may result in an error rate of 5% (95% accuracy) or worse if the measurement is based on whether each whole word was recognized with no incorrect letters.

On-line character recognition is sometimes confused with Optical Character Recognition (see Handwriting recognition

Handwriting recognition

Handwriting recognition is the ability of a computer to receive and interpret intelligible handwritten input from sources such as paper documents, photographs, touch-screens and other devices. The image of the written text may be sensed "off line" from a piece of paper by optical scanning or...

). OCR is an instance of off-line character recognition, where the system recognizes the fixed static shape of the character, while on-line character recognition instead recognizes the dynamic motion during handwriting. For example, on-line recognition, such as that used for gestures in the Penpoint OS

PenPoint OS

The PenPoint OS was a product of GO Corporation and was one of the earliest operating systems written specifically for graphical tablets and personal digital assistants...

or the Tablet PC

Tablet computer

A tablet computer, or simply tablet, is a complete mobile computer, larger than a mobile phone or personal digital assistant, integrated into a flat touch screen and primarily operated by touching the screen...

can tell whether a horizontal mark was drawn right-to-left, or left-to-right. On-line character recognition is also referred to by other terms such as dynamic character recognition, real-time character recognition, and Intelligent Character Recognition

Intelligent Character Recognition

or ICR.

On-line systems for recognizing hand-printed text on the fly have become well known as commercial products in recent years (see Tablet PC history

Tablet computer

). Among these are the input devices for personal digital assistant

Personal digital assistant

A personal digital assistant , also known as a palmtop computer, or personal data assistant, is a mobile device that functions as a personal information manager. Current PDAs often have the ability to connect to the Internet...

s such as those running Palm OS

Palm OS

Palm OS is a mobile operating system initially developed by Palm, Inc., for personal digital assistants in 1996. Palm OS is designed for ease of use with a touchscreen-based graphical user interface. It is provided with a suite of basic applications for personal information management...

. The Apple Newton

Apple Newton

The MessagePad was the first series of personal digital assistant devices developed by Apple for the Newton platform in 1993. Some electronic engineering and the manufacture of Apple's MessagePad devices was done in Japan by the Sharp Corporation...

pioneered this product. The algorithms used in these devices take advantage of the fact that the order, speed, and direction of individual lines segments at input are known. Also, the user can be retrained to use only specific letter shapes. These methods cannot be used in software that scans paper documents, so accurate recognition of hand-printed documents is still largely an open problem. Accuracy rates of 80% to 90% on neat, clean hand-printed characters can be achieved, but that accuracy rate still translates to dozens of errors per page, making the technology useful only in very limited applications.

Recognition of cursive text is an active area of research, with recognition rates even lower than that of hand-printed text. Higher rates of recognition of general cursive script will likely not be possible without the use of contextual or grammatical information. For example, recognizing entire words from a dictionary is easier than trying to parse individual characters from script. Reading the Amount line of a cheque

Cheque

A cheque is a document/instrument See the negotiable cow—itself a fictional story—for discussions of cheques written on unusual surfaces. that orders a payment of money from a bank account...

(which is always a written-out number) is an example where using a smaller dictionary can increase recognition rates greatly. Knowledge of the grammar of the language being scanned can also help determine if a word is likely to be a verb or a noun, for example, allowing greater accuracy. The shapes of individual cursive characters themselves simply do not contain enough information to accurately (greater than 98%) recognise all handwritten cursive script.

It is necessary to understand that OCR technology is a basic technology also used in advanced scanning applications. Due to this, an advanced scanning solution can be unique and patented and not easily copied despite being based on this basic OCR technology.

For more complex recognition problems, intelligent character recognition

Intelligent Character Recognition

systems are generally used, as artificial neural network

Artificial neural network

An artificial neural network , usually called neural network , is a mathematical model or computational model that is inspired by the structure and/or functional aspects of biological neural networks. A neural network consists of an interconnected group of artificial neurons, and it processes...

s can be made indifferent to both affine

Affine transformation

In geometry, an affine transformation or affine map or an affinity is a transformation which preserves straight lines. It is the most general class of transformations with this property...

and non-linear transformations.

A technique which is having considerable success in recognising difficult words and character groups within documents generally amenable to computer OCR is to submit them automatically to humans in the reCAPTCHA

ReCAPTCHA

reCAPTCHA is a system originally developed at Carnegie Mellon University's main Pittsburgh campus. It uses CAPTCHA to help digitize the text of books while protecting websites from bots attempting to access restricted areas. On September 16, 2009, Google acquired reCAPTCHA. reCAPTCHA is currently...

system.

External links

Unicode OCR - Hex Range: 2440-245F Optical Character Recognition in Unicode

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.

History

OCR software

Current state of OCR technology

See also

External links