Named entity recognition - AbsoluteAstronomy.com

Named-entity recognition (NER) (also known as entity identification and entity extraction) is a subtask of information extraction

Information extraction

Information extraction is a type of information retrieval whose goal is to automatically extract structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language...

that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

Most research on NER systems has been structured as taking an unannotated block of text, such as this one:

Jim bought 300 shares of Acme Corp. in 2006.

And producing an annotated block of text, such as this one:

Jim bought 300 shares of Acme Corp. in 2006.

In this example, the annotations have been done using so-called ENAMEX tags that were developed for the Message Understanding Conference

Message Understanding Conference

The Message Understanding Conferences were initiated and financed by DARPA to encouragethe development of new and better methods of information extraction.The character of this competition—many concurrent research teams competing against one another—required the development of standardsfor...

in the 1990s.

State-of-the-art NER systems for English produce near-human performance. For example, the best system entering MUC-7 scored 93.39% of F-measure

F1 Score

In statistics, the F1 score is a measure of a test's accuracy. It considers both the precision p and the recall r of the test to compute the score: p is the number of correct results divided by the number of all returned results and r is the number of correct results divided by the number of...

while human annotators scored 97.60% and 96.95%. These algorithms had roughly twice the error rate (6.61%) of human annotators (2.40% and 3.05%).

Approaches

NER systems have been created that use linguistic grammar

Formal grammar

A formal grammar is a set of formation rules for strings in a formal language. The rules describe how to form strings from the language's alphabet that are valid according to the language's syntax...

-based techniques as well as statistical model

Statistical model

A statistical model is a formalization of relationships between variables in the form of mathematical equations. A statistical model describes how one or more random variables are related to one or more random variables. The model is statistical as the variables are not deterministically but...

s. Hand-crafted grammar-based systems typically obtain better precision, but at the cost of lower recall and months of work by experienced computational linguists

Computational linguistics

Computational linguistics is an interdisciplinary field dealing with the statistical or rule-based modeling of natural language from a computational perspective....

. Statistical NER systems typically require a large amount of manually annotated

Annotation

An annotation is a note that is made while reading any form of text. This may be as simple as underlining or highlighting passages.Annotated bibliographies give descriptions about how each source is useful to an author in constructing a paper or argument...

training data.

Problem domains

Research indicates that even state-of-the-art NER systems are brittle, meaning that NER systems developed for one domain do not typically perform well on other domains. Considerable effort is involved in tuning NER systems to perform well in a new domain; this is true for both rule-based and trainable statistical systems.

Early work in NER systems in the 1990s was aimed primarily at extraction from journalistic articles. Attention then turned to processing of military dispatches and reports. Later stages of the automatic content extraction (ACE) evaluation also included several types of informal text styles, such as weblogs and text transcripts from conversational telephone speech conversations. Since about 1998, there has been a great deal of interest in entity identification in the molecular biology

Molecular biology

Molecular biology is the branch of biology that deals with the molecular basis of biological activity. This field overlaps with other areas of biology and chemistry, particularly genetics and biochemistry...

, bioinformatics

Bioinformatics

Bioinformatics is the application of computer science and information technology to the field of biology and medicine. Bioinformatics deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, software...

, and medical natural language processing

Natural language processing

Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....

communities. The most common entity of interest in that domain has been names of genes and gene products.

Named entity types

In the expression named entity, the word named restricts the task to those entities for which one or many rigid designator

Rigid designator

In modal logic and the philosophy of language, a term is said to be a rigid designator when it designates the same thing in all possible worlds in which that thing exists and does not designate anything else in those possible worlds in which that thing does not exist...

s, as defined by Kripke

Saul Kripke

Saul Aaron Kripke is an American philosopher and logician. He is a professor emeritus at Princeton and teaches as a Distinguished Professor of Philosophy at the CUNY Graduate Center...

, stands for the referent. For instance, the automotive company created by Henry Ford in 1903 is referred to as Ford or Ford Motor Company. Rigid designators include proper names as well as certain natural kind terms like biological species and substances.

There is a general agreement to include temporal expressions

Temporal expressions

A temporal expression in a text is a sequence of tokens that denote time, that is express a point in time, a duration or a frequency.Examples:-External links:...

and some numerical expressions (i.e., money, percentages, etc.) as instances of named entities in the context of the NER task. While some instances of these types are good examples of rigid designators (e.g., the year 2001) there are also many invalid ones (e.g., I take my vacations in “June”). In the first case, the year 2001 refers to the 2001st year of the Gregorian calendar. In the second case, the month June may refer to the month of an undefined year (past June, next June, June 2020, etc.). It is arguable that the named entity definition is loosened in such cases for practical reasons. The definition of the term named entity is therefore not strict and often has to be explained in the context it is used.

At least two hierarchies

Hierarchy

A hierarchy is an arrangement of items in which the items are represented as being "above," "below," or "at the same level as" one another...

of named entity types have been proposed in the literature. BBN

BBN Technologies

BBN Technologies is a high-technology company which provides research and development services. BBN is based next to Fresh Pond in Cambridge, Massachusetts, USA...

categories, proposed in 2002, is used for Question Answering

Question answering

In information retrieval and natural language processing , question answering is the task of automatically answering a question posed in natural language...

and consists of 29 types and 64 subtypes. Sekine's extended hierarchy, proposed in 2002, is made of 200 subtypes.

Current Challenges and Research Trends

Despite the high F1 numbers reported on the MUC-7 dataset, the problem of Named Entity Recognition is far from being solved. The main efforts are directed to reducing the annotation labor
, robust performance across domains and scaling up to fine-grained entity types. .

A recently emerging task of identifying "important expressions" in text and cross-linking them to Wikipedia
can be seen as an instance of extremely fine-grained named entity recognition, where the types are the actual Wikipedia pages describing the (potentially ambiguous) concepts. Below is an example output of a Wikification system:

http://en.wikipedia.org/wiki/Michael_I._Jordan Michael Jordan is a professor at http://en.wikipedia.org/wiki/University_of_California,_Berkeley Berkeley

Available Systems

Several systems are available online. For traditional NER, the most popular publicly available systems are:
Illinois NER system ,
Stanford NER system,
and
Lingpipe NER system.
The Illinois NER reports 90.6 F1 on the CoNLL03 NER shared task data and the Stanford NER reports 86.86 F1 .

There are also several publicly available Wikification systems for identifying important expressions in the text and cross-linking them to Wikipedia. Most notably, Illinois Wikification system
WM Wikifier
and
TAGME .

NER evaluation forums

Evaluation of NER systems is critical to scientific progress of this field.

Most evaluation of these systems has been performed at conferences or contests put on by government organizations, sometimes acting in concert with contractors or academics.

Conference	Acronym	Language(s)	Year(s)	Sponsor	Archive Site
Message Understanding Conference Message Understanding Conference The Message Understanding Conferences were initiated and financed by DARPA to encouragethe development of new and better methods of information extraction.The character of this competition—many concurrent research teams competing against one another—required the development of standardsfor...	MUC	English	1987–1999	DARPA	http://www.itl.nist.gov/iaui/894.02/related_projects/muc/index.html
Multilingual Entity Task Conference	MET	Chinese and Japanese	1998	US	http://www-nlpir.nist.gov/related_projects/tipster/met.htm
Automatic Content Extraction Program	ACE	English	2000–2008	NIST	http://www.nist.gov/speech/tests/ace/
Conference on Computational Natural Language Learning	CoNLL	Spanish and Dutch / German and English	2002–2003		http://www.cnts.ua.ac.be/conll/
Evaluation contest for named entity recognizers in Portuguese	HAREM	Portuguese	2004–2008	Linguateca	http://www.linguateca.pt/HAREM/
Information Retrieval and Extraction Exercise	IREX	Japanese	1998–1999		http://portal.acm.org/citation.cfm?id=992814&dl=acm&coll=&CFID=15151515&CFTOKEN=6184618
ACL Special Interest Group in Chinese	SIGHan	Chinese	2006		http://sighan.cs.uchicago.edu/bakeoff2006/
TAC Knowledge Base Population Evaluation	TAC/KBP	English	2009–	NIST	http://www.nist.gov/tac/

External links

Named entity recognition for Arabic – Issues and challenges in morphologically rich languages such as Arabic

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.