The
American National Corpus (ANC) is a
text corpusIn linguistics, a corpus or text corpus is a large and structured set of texts...
of
American EnglishAmerican English is a set of dialects of the English language used mostly in the United States. Approximately two-thirds of the world's native speakers of English live in the United States....
currently containing 22 million words written and spoken data produced since 1990. The ANC may at some point of time include a range of genres comparable to the
British National CorpusThe British National Corpus is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. It was compiled as a general corpus in the field of corpus linguistics...
. It is currently annotated for
part of speechIn grammar, a part of speech is a linguistic category of words , which is generally defined by the syntactic or morphological behaviour of the lexical item in question. Common linguistic categories include noun and verb, among others...
and lemma,
shallow parseShallow parsing is an analysis of a sentence which identifies the constituents , but does not specify their internal structure, nor their role in the main sentence....
, and named entities.
The ANC in its current size of 22 million words is available from the
Linguistic Data ConsortiumThe Linguistic Data Consortium is an open consortium of universities, companies and government research laboratories. It creates, collects and distributes speech and text databases, lexicons, and other resources for linguistics research and development purposes. The University of Pennsylvania is...
. A 15 million word subset of the corpus, called the
Open American National Corpus (OANC), is freely available with no restrictions on its use from the
ANC Website.
The corpus and its annotations are provided according to the specifications of
ISO/TC 37ISO/TC 37 is a technical committee within the International Organization for Standardization that prepares standards and other documents concerning methodology and principles for terminology and language resources....
SC4's Linguistic Annotation Framework. By using a freely provided transduction tool, the corpus and user-chosen annotations is provided in multiple formats, including the XML format conformant to the
XML Corpus Encoding Standard (XCES)XCES is an XML based standard to codify text corpus. These texts are mainly used by linguists and natural language researchers. XCES is highly based on previous Corpus Encoding Standard but using XML as the markup language. It supports simple corpora as well as anotated corpora, parallel corpora...
(usable with the
British National CorpusThe British National Corpus is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. It was compiled as a general corpus in the field of corpus linguistics...
's XAIRA search engine), a
UIMAUIMA stands for Unstructured Information Management Architecture. An OASIS standard as of March 2009, UIMA is to date the only industry standard for content analytics....
-compliant format, and formats suitable for input to a wide variety of concordance software.
The ANC differs from other corpora of English because it is richly annotated, including different
part of speechIn grammar, a part of speech is a linguistic category of words , which is generally defined by the syntactic or morphological behaviour of the lexical item in question. Common linguistic categories include noun and verb, among others...
annotations (Penn tags, CLAWS5 and CLAWS7 tags),
shallow parse annotationsShallow parsing is an analysis of a sentence which identifies the constituents , but does not specify their internal structure, nor their role in the main sentence....
, and annotations for several types of named entities. Additional annotations are added to all or parts of the corpus as they become available, often by contributions from other projects. Unlike on-line searchable corpora, which due to copyright restrictions allow access only to individual sentences, the entire ANC is available to enable research involving, for example, development of statistical language models and full-text linguistic annotation.
ANC annotations are automatically produced and unvalidated. A
Manually Annotated Sub-Corpus (MASC) will be released in Fall 2009, which includes validated annotations for the above-mentioned phenomena as well as
Penn TreebankA treebank or parsed corpus is a text corpus in which each sentence has been parsed, i.e. annotated with syntactic structure. Syntactic structure is commonly represented as a tree structure, hence the name Treebank...
syntactic annotation,
WordNetWordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets...
sense annotation, and
FrameNetFrameNet is a project housed at the International Computer Science Institute in Berkeley, California which produces an electronic resource based on...
semantic frame annotations.
In Fall, 2009, the OANC Ngram Search Engine will become available on the
ANC Website, which will provide intra- and inter-sentential pattern-based searches. In early 2010, the OANC will be expanded to include an additional 20-30 million words of written and spoken data.
See also
- British National Corpus
The British National Corpus is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. It was compiled as a general corpus in the field of corpus linguistics...
- Oxford English Corpus
The Oxford English Corpus is a text corpus of English language used by the makers of the Oxford English Dictionary and by Oxford University Press's language research programme. It is the largest corpus of its kind, containing over two billion words...
- Corpus of Contemporary American English
The freely-searchable 425 million word Corpus of Contemporary American English is the largest corpus of American English currently available, and the only publicly-available corpus of American English to contain a wide array of texts from a number of genres.It was created by Mark Davies, Professor...
(COCA).
External links