OCRFeeder
Encyclopedia
OCRFeeder is a free software
Free software
Free software, software libre or libre software is software that can be used, studied, and modified without restriction, and which can be copied and redistributed in modified or unmodified form either without restriction, or with restrictions that only ensure that further recipients can also do...

 desktop OCR
Optical character recognition
Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. It is widely used to convert books and documents into electronic files, to computerize a record-keeping...

 suite for GNOME
GNOME
GNOME is a desktop environment and graphical user interface that runs on top of a computer operating system. It is composed entirely of free and open source software...

. It converts paper documents to digital document files or makes them accessible to visually impaired users.

OCRFeeder is distributed as free software under GNU General Public License
GNU General Public License
The GNU General Public License is the most widely used free software license, originally written by Richard Stallman for the GNU Project....

 (GPL) version 3 or later. It is available for Unix-like
Unix-like
A Unix-like operating system is one that behaves in a manner similar to a Unix system, while not necessarily conforming to or being certified to any version of the Single UNIX Specification....

 operating systems, either in source code or as pre-built binary package for systems that are based on the Debian package management system
Dpkg
dpkg is the software at the base of the Debian package management system. dpkg is used to install, remove, and provide information about .deb packages....

 or third-party builds for openSUSE
OpenSUSE
openSUSE is a general purpose operating system built on top of the Linux kernel, developed by the community-supported openSUSE Project and sponsored by SUSE...

 and Slackware
Slackware
Slackware is a free and open source Linux-based operating system. It was one of the earliest operating systems to be built on top of the Linux kernel and is the oldest currently being maintained. Slackware was created by Patrick Volkerding of Slackware Linux, Inc. in 1993...

.
In Debian
Debian
Debian is a computer operating system composed of software packages released as free and open source software primarily under the GNU General Public License along with other free software licenses. Debian GNU/Linux, which includes the GNU OS tools and Linux kernel, is a popular and influential...

-based Linux distribution
Linux distribution
A Linux distribution is a member of the family of Unix-like operating systems built on top of the Linux kernel. Such distributions are operating systems including a large collection of software applications such as word processors, spreadsheets, media players, and database applications...

s it may be installed directly from the default software channels.

History

OCRFeeder was started as a master's thesis in computer science
Computer science
Computer science or computing science is the study of the theoretical foundations of information and computation and of practical techniques for their implementation and application in computer systems...

 by Joaquim Rocha, who is now working for Igalia
Igalia
Igalia is a private company, based in Spain, known for their contributions to the GNOME project, their work in the Maemo and MeeGo platforms, and the WebKitGTK+ project....

, S.L. and continuing development there.

The first version was published in March 2009. The OCRFeeder project was initially published and hosted on Google Code
Google Code
Google Code is Google's site for developer tools, APIs and technical resources. The site contains documentation on using Google developer tools and APIs—including discussion groups and blogs for developers using Google's developer products....

, temporarily used Gitorious
Gitorious
Gitorious is a Web site hosting collaborative open source projects using the Git distributed revision control system. The name also refers to the server software that the Web site is developed and hosted on...

 and now uses the GNOME infrastructure. Since 5 April 2010 a software package is included in the official Debian repositories.

Version 0.7 from July 30, 2010 brought image pre-processing features, 0.7.1 (November 8, 2010) enabled for scanner access from within OCRFeeder.

Features

OCRFeeder has a simple graphical user interface that is designed to the GNOME Human Interface Guidelines
Human Interface Guidelines
Human interface guidelines are software development documents which offer application developers a set of recommendations. Their aim is to improve the experience for the users by making application interfaces more intuitive, learnable, and consistent. Most guides limit themselves to defining a...

.
It performs a Document Layout Analysis
Document Layout Analysis
Document Layout Analysis is a part of Computer Vision indicating the process of identifying and categorizing the regions of interest in a document image, e.g. a scanned page. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading...

 and transfers the layout to capable output formats. It searches for content areas, outlines them and guesses the content type (text or image) and processes text areas through the OCR backend. It can use virtually any commandline OCR engine as backend and features auto-detection and auto-configuration for all popular free engines. OCR backends may be either auto-configured, the necessary command line entered in a GUI dialogue or configured directly via a XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....

 file. Scan image post-processing including deskewing can be done. All recognition results can be reviewed and edited before saving to the desired output format. Sessions can be saved and loaded. The suite also includes a spell checker
Spell checker
In computing, a spell checker is an application program that flags words in a document that may not be spelled correctly. Spell checkers may be stand-alone capable of operating on a block of text, or as part of a larger application, such as a word processor, email client, electronic dictionary,...

. OCRFeeder has built-in procedures for the post-processing of the raw OCR results returned by the OCR engine. It can remove remaining segmentation to printed lines of text, even with removal of hyphenation.

Although OCRFeeder is a GUI tool, it can also run in command line mode (as ocrfeeder-cli), which may be a useful tool for automatic document batch processing
Batch processing
Batch processing is execution of a series of programs on a computer without manual intervention.Batch jobs are set up so they can be run to completion without manual intervention, so all input data is preselected through scripts or command-line parameters...

. In this mode OCRFeeder uses the default OCR engine, which the user can set in the application's preferences.

The program is written in Python
Python (programming language)
Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...

 and uses the GTK+
GTK+
GTK+ is a cross-platform widget toolkit for creating graphical user interfaces. It is licensed under the terms of the GNU LGPL, allowing both free and proprietary software to use it. It is one of the most popular toolkits for the X Window System, along with Qt.The name GTK+ originates from GTK;...

 library (using PyGTK
PyGTK
PyGTK is a set of Python wrappers for the GTK+ graphical user interface library. PyGTK is free software and licensed under the LGPL. It is analogous to PyQt and wxPython which are python wrappers for Qt and wxWidgets respectively. Its original author is the prominent GNOME developer James Henstridge...

).
It acts as a graphical
Graphical user interface
In computing, a graphical user interface is a type of user interface that allows users to interact with electronic devices with images rather than text commands. GUIs can be used in computers, hand-held devices such as MP3 players, portable media players or gaming devices, household appliances and...

 front-end for other existing tools. For example it does not make actual character recognition itself, but uses external programs such as an “OCR engine” that is installed on the system. It can automatically detect and configure CuneiForm
CuneiForm (software)
In computer software, CuneiForm is an OCR tool. It was originally developed at Cognitive Technologies and, after a few years with no development, released as freeware on December 12, 2007. The kernel of OCR engine was released under the open source BSD license license at the beginning of April...

, GOCR
GOCR
GOCR is a free optical character recognition program, initially written by Jörg Schulenburg. It can be used to convert or scan image files into text files.- Features :...

, Ocrad
Ocrad
Ocrad is an optical character recognition program, developed as part of the GNU Project. Like all GNU software it is free software, and is licensed under the GNU GPL....

 and Tesseract
Tesseract (software)
Tesseract is a free software optical character recognition engine for various operating systems.Originally developed as proprietary software at Hewlett-Packard between 1985 and 1995, it had very little work done on it in the following decade. It was then released as open source in 2005 by Hewlett...

 as backend OCR engines. Scanners are accessed via SANE
Scanner Access Now Easy
Scanner Access Now Easy is an application programming interface that provides standardized access to any raster image scanner hardware ....

. For post-processing of scanned images there is integration of the command-line tool “Unpaper”, among other things.
PDF files are processed using Ghostscript
Ghostscript
Ghostscript is a suite of software based on an interpreter for Adobe Systems' PostScript and Portable Document Format page description languages.- Features :...

 in the backend.

Input and output

OCRFeeder can import data from PDF
Portable Document Format
Portable Document Format is an open standard for document exchange. This file format, created by Adobe Systems in 1993, is used for representing documents in a manner independent of application software, hardware, and operating systems....

 or graphic files
Digital image
A digital image is a numeric representation of a two-dimensional image. Depending on whether or not the image resolution is fixed, it may be of vector or raster type...

. From 0.7.1a version it supports grabbing images directly from the scanner device
Image scanner
In computing, an image scanner—often abbreviated to just scanner—is a device that optically scans images, printed text, handwriting, or an object, and converts it to a digital image. Common examples found in offices are variations of the desktop scanner where the document is placed on a glass...

.

The results can be saved in HTML
HTML
HyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages....

, OpenDocument
OpenDocument
The Open Document Format for Office Applications is an XML-based file format for representing electronic documents such as spreadsheets, charts, presentations and word processing documents....

 or plain text
Plain text
In computing, plain text is the contents of an ordinary sequential file readable as textual material without much processing, usually opposed to formatted text....

 file formats (initial formatting can be done directly in the program). hOCR
HOCR
hOCR is an open standard which defines a data format for representation of OCR output. The standard aims to embed layout, recognition confidence, style and other information into the recognized text itself...

file output is also planned.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK