Home      Discussion      Topics      Dictionary      Almanac
Signup       Login
Information extraction

Information extraction

Ask a question about 'Information extraction'
Start a new discussion about 'Information extraction'
Answer questions from other users
Full Discussion Forum
Information extraction is a type of information retrieval
Information retrieval
Information retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...

 whose goal is to automatically extract structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language processing
Natural language processing
Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....

 (NLP). Recent activities in multimedia document processing like automatic annotation and concept extraction out of images/audio/video could be seen as information extraction.

Due to the difficulty of the problem, current approaches to IE focus on narrowly restricted domains. An example is the extraction from news wire reports of corporate mergers, such as denoted by the formal relation:,
from an online news sentence such as:
"Yesterday, New-York based Foo Inc. announced their acquisition of Bar Corp."

A broad goal of IE is to allow computation to be done on the previously unstructured data. A more specific goal is to allow logical reasoning
Logical reasoning
In logic, three kinds of logical reasoning can be distinguished: deduction, induction and abduction. Given a precondition, a conclusion, and a rule that the precondition implies the conclusion, they can be explained in the following way:...

 to draw inferences based on the logical content of the input data. Structured data is semantically well-defined data from a chosen target domain, interpreted with respect to category and context.


Information extraction dates back to the late 1970s in the early days of NLP. An early commercial system from the mid 1980s was JASPER built for Reuters
Reuters is a news agency headquartered in New York City. Until 2008 the Reuters news agency formed part of a British independent company, Reuters Group plc, which was also a provider of financial market data...

 by the Carnegie Group with the aim of providing real-time financial news to financial traders.

Beginning in 1987, IE was spurred by a series of Message Understanding Conference
Message Understanding Conference
The Message Understanding Conferences were initiated and financed by DARPA to encouragethe development of new and better methods of information extraction.The character of this competition—many concurrent research teams competing against one another—required the development of standardsfor...

s. MUC is a competition-based conference that focused on the following domains:
  • MUC-1 (1987), MUC-2 (1989): Naval operations messages.
  • MUC-3 (1991), MUC-4 (1992): Terrorism in Latin American countries.
  • MUC-5 (1993): Joint ventures and microelectronics domain.
  • MUC-6 (1995): News articles on management changes.
  • MUC-7 (1998): Satellite launch reports.

Considerable support came from DARPA, the US defense agency, who wished to automate mundane tasks performed by government analysts, such as scanning newspapers for possible links to terrorism.

Present significance

The present significance of IE pertains to the growing amount of information available in unstructured form. Tim Berners-Lee
Tim Berners-Lee
Sir Timothy John "Tim" Berners-Lee, , also known as "TimBL", is a British computer scientist, MIT professor and the inventor of the World Wide Web...

, inventor of the world wide web
World Wide Web
The World Wide Web is a system of interlinked hypertext documents accessed via the Internet...

, refers to the existing Internet
The Internet is a global system of interconnected computer networks that use the standard Internet protocol suite to serve billions of users worldwide...

 as the web of documents and advocates that more of the content be made available as a web of data
Semantic Web
The Semantic Web is a collaborative movement led by the World Wide Web Consortium that promotes common formats for data on the World Wide Web. By encouraging the inclusion of semantic content in web pages, the Semantic Web aims at converting the current web of unstructured documents into a "web of...

. Until this transpires, the web largely consists of unstructured documents lacking semantic metadata
The term metadata is an ambiguous term which is used for two fundamentally different concepts . Although the expression "data about data" is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at...

. Knowledge contained within these documents can be made more accessible for machine processing by means of transformation into relational form
Relational database
A relational database is a database that conforms to relational model theory. The software used in a relational database is called a relational database management system . Colloquial use of the term "relational database" may refer to the RDBMS software, or the relational database itself...

, or by marking-up with XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....

 tags. An intelligent agent monitoring a news data feed requires IE to transform unstructured data into something that can be reasoned with. A typical application of IE is to scan a set of documents written in a natural language
Natural language
In the philosophy of language, a natural language is any language which arises in an unpremeditated fashion as the result of the innate facility for language possessed by the human intellect. A natural language is typically used for communication, and may be spoken, signed, or written...

 and populate a database with the information extracted.

IE tasks and subtasks

Applying information extraction on text, is linked to the problem of text simplification
Text simplification
Text simplification is an operation used in natural language processing to modify, enhance, classify or otherwise process an existing corpus of human-readable text in such a way that the grammar and structure of the prose is greatly simplified, while the underlying meaning and information remains...

 in order to create a structured view of the information present in free text. The overall goal being to create a more easily machine-readable text to process the sentences. Typical subtasks of IE include:
  • Named entity extraction which could include:
    • Named entity recognition
      Named entity recognition
      Named-entity recognition is a subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.Most research on NER...

      : recognition of known entity names (for people and organizations), place names, temporal expressions, and certain types of numerical expressions. In the previous example, it will be to say that the sentence refer to a specific "M.Smith" on which we have extra information from other sentence and/or existing knowledge. Typically this involves assigning a unique identifier to the extracted entity. A simpler task is named entity detection, which aims to detect entities without having any existing knowledge about the entity instances.
    • Coreference
      In linguistics, co-reference occurs when multiple expressions in a sentence or document refer to the same thing; or in linguistic jargon, they have the same "referent."...

       resolution: detection of coreference
      In linguistics, co-reference occurs when multiple expressions in a sentence or document refer to the same thing; or in linguistic jargon, they have the same "referent."...

       and anaphoric
      Anaphora (linguistics)
      In linguistics, anaphora is an instance of an expression referring to another. Usually, an anaphoric expression is represented by a pro-form or some other kind of deictic--for instance, a pronoun referring to its antecedent...

       links between text entities. In IE tasks, this is typically restricted in finding links between previously-extracted named entities. For example, "International Business Machines" and "IBM" refer to the same real-world entity. If we take the two sentences "M.Smith likes fishing. But he doesn't like biking", it would be to detect that "he" is referring to the previously detected person "M.Smith".
    • Relationship extraction
      Relationship extraction
      A relationship extraction task requires the detection and classification of semantic relationship mentions within a set of artifacts, typically from text or XML documents...

      : identification of relations between entities, such as:
      • PERSON works for ORGANIZATION (extracted from the sentence "Bill works for IBM.")
      • PERSON located in LOCATION (extracted from the sentence "Bill is in France.")
  • Semi-structured information extraction which may refer to any IE that tries to restore some kind information structure that has been lost through publication such as:
    • Table extraction: finding and extracting tables from documents.
    • Comments extraction : extracting comments from actual content of article in order to restore the link between author of each sentence
  • Language and vocabulary analysis
    • Terminology extraction
      Terminology extraction
      Terminology mining, term extraction, term recognition, or glossary extraction, is a subtask of information extraction. The goal of terminology extraction is to automatically extract relevant terms from a given corpus....

      : finding the relevant terms for a given corpus
      Text corpus
      In linguistics, a corpus or text corpus is a large and structured set of texts...

Note this list is not exhaustive and that the exact meaning of IE activities is not commonly accepted and that many approaches combine multiple sub-tasks of IE in order to achieve a wider goal. Machine learning, statistical analysis and/or natural language processing are often used in IE.

IE on non-text documents is becoming an increasing topic in research and information extracted from multimedia documents can now be expressed in a high level structure as it is done on text. This naturally lead to the fusion of extracted information from multiple kind of documents and sources.

Information extraction and the World Wide Web

IE has been the focus of the MUC conferences. The proliferation of the Web, however, intensified the need for developing IE systems that help people to cope with the enormous amount of data that is available online. Systems that perform IE from online text should meet the requirements of low cost, flexibility in development and easy adaptation to new domains. MUC systems fail to meet those criteria. Moreover, linguistic analysis performed for unstructured text does not exploit the HTML/XML tags and layout format that are available in online text. As a result, less linguistically intensive approaches have been developed for IE on the Web using wrappers
Wrapper (data mining)
Wrapper in data mining is a program that extracts content of a particular information source and translates it into a relational form. Many web pages present structured data - telephone directories, product catalogs, etc. formatted for human browsing using HTML language...

, which are sets of highly accurate rules that extract a particular page's content. Manually developing wrappers has proved to be a time-consuming task, requiring a high level of expertise. Machine learning
Machine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...

 techniques, either supervised
Supervised learning
Supervised learning is the machine learning task of inferring a function from supervised training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object and a desired output value...

 or unsupervised
Unsupervised learning
In machine learning, unsupervised learning refers to the problem of trying to find hidden structure in unlabeled data. Since the examples given to the learner are unlabeled, there is no error or reward signal to evaluate a potential solution...

, have been used to induce such rules automatically.

Wrappers typically handle highly structured collections of web pages, such as product catalogues and telephone directories. They fail, however, when the text type is less structured, which is also common on the Web. Recent effort on adaptive information extraction motivates the development of IE systems that can handle different types of text, from well-structured to almost free text -where common wrappers fail- including mixed types. Such systems can exploit shallow natural language knowledge and thus can be also applied to less structured text.


Three standard approaches are now widely accepted for IE:
  • Hand-written regular expressions (perhaps stacked)
  • Using classifiers
    • Generative: naïve Bayes
    • Discriminative: Maxent models
  • Sequence models
    • Hidden Markov model
      Hidden Markov model
      A hidden Markov model is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved states. An HMM can be considered as the simplest dynamic Bayesian network. The mathematics behind the HMM was developed by L. E...

    • CMMs/MEMMs
    • Conditional random field
      Conditional random field
      A conditional random field is a statistical modelling method often applied in pattern recognition.More specifically it is a type of discriminative undirected probabilistic graphical model. It is used to encode known relationships between observations and construct consistent interpretations...

      s (CRF) are commonly used in conjunction with IE for tasks as varied as extracting information from research papers to extracting navigation instructions.

Numerous other approaches exist for IE including hybrid approaches that combine some of the standard approaches previously listed.

Free or Open Source Information Extraction Software or Services

  • ReVerb is an open source unsupervised relation extraction system from the University of Washington
  • GExp is a rule based open source information extraction toolkit
  • General Architecture for Text Engineering
    General Architecture for Text Engineering
    General Architecture for Text Engineering or GATE is a Java suite of tools originally developed at the University of Sheffield beginning in 1995 and now used worldwide by a wide community of scientists, companies, teachers and students for all sorts of natural language processing tasks, including...

     "General Architecture for Text Engineering", which is bundled with a free Information Extraction system
  • OpenCalais
    ClearForest was a software company that develops and markets text analytics and text mining solutions. Founded in 1998, ClearForest has its headquarters just outside of Boston and has development in Israel near Tel Aviv. It was acquired by Reuters in April, 2007...

     Automated information extraction web service from Thomson Reuters
    Thomson Reuters
    Thomson Reuters Corporation is a provider of information for the world's businesses and professionals and is created by the Thomson Corporation's purchase of Reuters Group on 17 April 2008. Thomson Reuters is headquartered at 3 Times Square, New York City, USA...

     (Free limited version)
  • Machine Learning for Language Toolkit (Mallet)
    Mallet (software project)
    MALLET is a Java "MAchine Learning for Language Toolkit".-Description:MALLET is an integrated collection of Java code useful for statistical natural language processing, document classification, cluster analysis, information extraction, topic modeling and other machine learning applications to...

     is a Java-based package for a variety of natural language processing tasks, including information extraction.
  • Apache Tika provides information extraction framework to parse textual content and meta data for several document formats
  • DBpedia Spotlight
    DBpedia Spotlight
    DBpedia Spotlight is a tool for annotating mentions of DBpedia resources in text, providing a solution for linking unstructured information sources to the Linked Open Data cloud through DBpedia. DBpedia Spotlight performs named entity extraction, including entity detection and Name Resolution...

     is an open source tool in Java/Scala (and free web service) that can be used for Named Entity Recognition
    Named entity recognition
    Named-entity recognition is a subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.Most research on NER...

     and Name Resolution.
  • See also CRF implementations

See also

  • Concept mining
    Concept Mining
    Concept mining is an activity that results in the extraction of concepts from artifacts. Solutions to the task typically involve aspects of artificial intelligence and statistics, such as data mining and text mining...

    The DARPA TIPSTER Text program was started in 1991 by the Defense Advanced Research Projects Agency . It supported research to improve informational retrieval and extraction software and worked to deploy these improved technologies to government users...

  • Semantic translation
    Semantic translation
    Semantic translation is the process of using semantic information to aid in the translation of data in one representation or data model to another representation or data model...

  • Faceted search
  • Named entity recognition
    Named entity recognition
    Named-entity recognition is a subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.Most research on NER...

  • Web scraping
    Web scraping
    Web scraping is a computer software technique of extracting information from websites...

  • Nutch
    Nutch is an effort to build an open source web search engine based on Lucene Java for the search and index component.- Features :Nutch is coded entirely in the Java programming language, but data is written in language-independent formats...

  • Enterprise search
    Enterprise search
    Enterprise search is the practice of making content from multiple enterprise-type sources, such as databases and intranets, searchable to a defined audience.-Enterprise search summary:...

  • Knowledge extraction
    Knowledge extraction
    Knowledge Extraction is the creation of knowledge from structured and unstructured sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing...

External links