All Topics  
Text corpus

 

   Email Print
   Bookmark   Link






 

Text corpus



 
 
In linguistics
Linguistics

Linguistics is the science study of natural language. Linguistics encompasses a number of sub-fields. An important topical division is between the study of language structure and the study of Meaning ....
, a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules on a specific universe.

A corpus may contain texts in a single language (monolingual corpus) or text data in multiple languages (multilingual corpus).






Discussion
Ask a question about 'Text corpus'
Start a new discussion about 'Text corpus'
Answer questions from other users
Full Discussion Forum



Encyclopedia


In linguistics
Linguistics

Linguistics is the science study of natural language. Linguistics encompasses a number of sub-fields. An important topical division is between the study of language structure and the study of Meaning ....
, a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules on a specific universe.

A corpus may contain texts in a single language (monolingual corpus) or text data in multiple languages (multilingual corpus). Multilingual corpora that have been specially formatted for side-by-side comparison are called aligned parallel corpora.

In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as annotation
Annotation

An annotation is an addition made to pragmatics in a book, document, online record, video, or other information.Commonly this is used, for example, in draft documents, where another reader has written notes about the quality of a document at a certain point, "marginalia", or perhaps just underlined or highlighted passages....
. An example of annotating a corpus is part-of-speech tagging
Part-of-speech tagging

Part-of-speech tagging , also called grammatical tagging or word-category disambiguation, is the process of marking up the words in a text as corresponding to a particular parts of speech, based on both its definition, as well as its context?i.e., relationship with adjacent and related words in a phrase, sentence, or paragraph....
, or POS-tagging, in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of tags. Another example is indicating the lemma
Lemma (linguistics)

In linguistics a lemma has two distinct interpretations:# morphology / lexicography: the canonical form or citation form of a set of forms ; e.g....
 (base) form of each word. When the language of the corpus is not a working language of the researchers who use it, interlinear gloss
Gloss

A gloss is a brief summary of a word's meaning, equivalent to the dictionary entry of that word, but only a word or two in length. It is typically used for the meaning of a word in another language, and hence a simple translation....
ing is used to make the annotation bilingual.

Some corpora have further structured levels of analysis applied. In particular, a number of smaller corpora may be fully parsed
Parsing

In computer science and linguistics, parsing, or, more formally, syntactic analysis, is the process of analyzing a sequence of lexical analysis#Token to determine their grammatical structure with respect to a given formal grammar....
. Such corpora are usually called Treebanks
Treebank

A treebank or parsed corpus is a text corpus in which each sentence has been parsed, i.e. annotated with syntactic structure. Syntactic structure is commonly represented as a tree structure, hence the name Treebank....
 or Parsed Corpora
Treebank

A treebank or parsed corpus is a text corpus in which each sentence has been parsed, i.e. annotated with syntactic structure. Syntactic structure is commonly represented as a tree structure, hence the name Treebank....
. The difficulty of ensuring that the entire corpus is completely and consistently annotated means that these corpora are usually smaller, containing around 1 to 3 million words. Other levels of linguistic structured analysis are possible, including annotations for morphology, semantics and pragmatics.

Corpora are the main knowledge base in corpus linguistics
Corpus linguistics

Corpus linguistics is the study of language as expressed in samples or "real world" text. This method represents a digestive approach to deriving a set of abstract rules by which a natural language is governed or else relates to another language....
. The analysis and processing of various types of corpora are also the subject of much work in computational linguistics
Computational linguistics

Computational linguistics is an interdisciplinary field dealing with the Statistics and/or rule-based modeling of natural language from a computational perspective....
, speech recognition
Speech recognition

Speech recognition converts spoken words to machine-readable input . The term "voice recognition" is sometimes incorrectly used to refer to speech recognition, when actually referring to speaker recognition, which attempts to identify the person speaking, as opposed to what is being said....
 and machine translation
Machine translation

Machine translation, sometimes referred to by the abbreviation MT, is a sub-field of computational linguistics that investigates the use of computer software to translation text or speech from one natural language to another....
, where they are often used to create hidden Markov model
Hidden Markov model

A hidden Markov model is a statistical model in which the system being modeled is assumed to be a Markov process with unknown parameters; the challenge is to determine the hidden parameters from the observable data....
s for POS-tagging and other purposes. Corpora and frequency list
Frequency list

In computational linguistics, a frequency list is a sorted list of words together with their frequency, where frequency here usually means the number of occurrences in a given corpus....
s derived from them are useful for language teaching.

Archaeological corpora


Text corpora are also used in the study of historical document
Historical document

Historical documents are documents that contain important information about a person, place, or event.Most famous historical documents are either laws, accounts of battles , and the exploits of the powerful....
s, for example in attempts to decipher
Decipherment

Decipherment is the analysis of documents written in ancient languages, where the language is unknown, or knowledge of the language has been lost....
 ancient scripts, or in Biblical scholarship. Some archaeological corpora can be of such short duration that they provide a snapshot in time. One of the shortest corpora in time, may be the 15-30 year Amarna letters
Amarna letters

The Amarna letters are an archive of correspondence on clay tablets, mostly diplomatic, between the Ancient Egypt administration and its representatives in Canaan and Amurru during the New Kingdom....
 texts-(1350 BC). The corpus of an ancient city, (for example the "Kültepe
Kültepe

K?ltepe is a modern village near the ancient city of Kane? in central eastern Anatolia. The nearest modern city is Kayseri, about 20 km southwest....
 Texts" of Turkey), may go through a series of corpora, determined by their find site dates.

Some notable text corpora

English language:
  • American National Corpus
    American National Corpus

    American National Corpus is a paid membership-based collaboratory with the aim of creating an electronic text corpus of American English. The collection will include text and transcripts of spoken data produced from 1990, with the goal of a 100 million word corpus....
  • Bank of English
    Bank of English

    The Bank of English is the name of the COBUILD text corpus, a collection of English texts. These are mainly British, but American and Australian data are also included....
  • British National Corpus
    British National Corpus

    The British National Corpus is a 100-million-word text corpus of samples of written and spoken English language from a wide range of sources. It was compiled as a general corpus in the field of corpus linguistics....
  • Corpus Juris Secundum
    Corpus Juris Secundum

    Corpus Juris Secundum is an encyclopedia of United States law . Its full title is Corpus Juris Secundum: Complete Restatement Of The Entire American Law As Developed By All Reported Cases It contains an alphabetical arrangement of legal topics as developed by U.S....
  • Corpus of Contemporary American English
    Corpus of Contemporary American English

    The freely-searchable 385+ million word Corpus of Contemporary American English is the largest text corpus of American English currently available, and the only publicly-available corpus of American English to contain a wide array of texts from a number of genres....
     (COCA) 385 million words, 1990-present. Freely searchable online.
  • Brown Corpus
    Brown Corpus

    The Brown University Standard Corpus of Present-Day American English was compiled by Henry Kucera and W. Nelson Francis at Brown University, Providence, Rhode Island, Rhode Island as a general Text corpus in the field of corpus linguistics....
    , forming part of the "Brown Family" of corpora, together with LOB, Frown and F-LOB.
  • International Corpus of English
    International Corpus of English

    The International Corpus of English is a set of text corpus representing varieties of English from around the world. Over twenty countries or groups of countries where English is the first language or an official second language are included....
  • Oxford English Corpus
    Oxford English Corpus

    The Oxford English Corpus is a text corpus of English language used by the makers of the Oxford English Dictionary and by Oxford University Press's language research programme....
  • Scottish Corpus of Texts & Speech
Other languages:
  • Amarna letters
    Amarna letters

    The Amarna letters are an archive of correspondence on clay tablets, mostly diplomatic, between the Ancient Egypt administration and its representatives in Canaan and Amurru during the New Kingdom....
    , (for Akkadian
    Akkadian language

    Akkadian or Assyrian-Babylonian is a Semitic language that was spoken in ancient Mesopotamia. The earliest attested Semitic language, it used the cuneiform writing system derived ultimately from ancient Sumerian language, an unrelated language isolate....
    , Egyptian, Sumerogram
    Sumerogram

    A Sumerogram is the use of a Sumerian language cuneiform character or group of characters as an ideogram or logogram rather than a syllabogram in the writing representation of a language other than Sumerian, such as Akkadian language or Hittite language....
    's, etc.)
  • Bijankhan Corpus
    Bijankhan Corpus

    The Bijankhan corpus is a tagged Text corpus that is suitable for natural language processing research on the Persian language. This collection is gathered from daily news and common texts....
     A Contemporary Persian Corpus for NLP researches
  • Croatian National Corpus
    Croatian National Corpus

    Croatian National Corpus is the biggest and the most important corpus of the Croatian language. Its compilation started in 1998 in at Faculty of Humanities and Social Sciences, University of Zagreb, University of Zagreb following the ideas of Marko Tadic....
  • Hamshahri Corpus
    Hamshahri Corpus

    The Hamshahri Corpus is a sizable Persian language corpus based on the Iranian newspaper Hamshahri, one of the first online Persian language newspapers in Iran....
     A Contemporary Persian Corpus for IR researches
  • Neo-Assyrian Text Corpus Project
    Neo-Assyrian Text Corpus Project

    In the Neo-Assyrian Text Corpus Project, the following works are published:...
  • Russian National Corpus
    Russian National Corpus

    The Russian National Corpus is a corpus linguistics of Russian language that has been available online since April 29, 2004. It is being created by the Institute of Russian language, Russian Academy of Sciences....
  • Thesaurus Linguae Graecae
    Thesaurus Linguae Graecae

    The Thesaurus Linguae Graecae is a research center at the University of California, Irvine. The TLG was founded in 1972 by Marianne McDonald with the goal to create a comprehensive digital collection of all surviving texts written in Greek from antiquity to the present era....
     (Ancient Greek)


See also

  • Concordance
    Concordance (publishing)

    A concordance is an alphabetical list of the principal words used in a book or body of work, with their immediate contexts. Because of the time and difficulty and expense involved in creating a concordance in the pre-computer era, only works of special importance, such as the Bible, Qur'an or the works of William Shakespeare, had concordance...
  • Corpus linguistics
    Corpus linguistics

    Corpus linguistics is the study of language as expressed in samples or "real world" text. This method represents a digestive approach to deriving a set of abstract rules by which a natural language is governed or else relates to another language....
  • Linguistic Data Consortium
    Linguistic Data Consortium

    The Linguistic Data Consortium is an open consortium of universities, companies and government research laboratories. It creates, collects and distributes speech and text databases, lexicons, and other resources for linguistics research and development purposes....
  • Natural language processing
    Natural language processing

    Natural language processing is a field of computer science concerned with the interactions between computers and human languages. Natural language generation systems convert information from computer databases into readable human language....
  • Natural Language Toolkit
    Natural Language Toolkit

    Natural Language Toolkit or, more commonly, NLTK is a suite of Library and programs for symbolic and statistical natural language processing for the Python ....
  • Parallel text alignment
  • Search engines: they access the "web corpus".
  • Translation memory
    Translation memory

    A translation memory, or TM, is a database that stores segments that have been previously translated. A translation-memory system stores the words, phrases and paragraphs that have already been translated and aid human translators....
  • Treebank
    Treebank

    A treebank or parsed corpus is a text corpus in which each sentence has been parsed, i.e. annotated with syntactic structure. Syntactic structure is commonly represented as a tree structure, hence the name Treebank....
  • Zipf's Law
    Zipf's law

    Zipf's law, an empirical law formulated using mathematical statistics, refers to the fact that many types of data studied in the physical science and social science sciences can be approximated with a Zipfian distribution, one of a family of related discrete power law probability distributions....


External links

  • : Conventions for interlinear morpheme
    Morpheme

    In morpheme-based morphology, a is the smallest linguistic unit that has semantics Meaning .In spoken language, morphemes are composed of phonemes , and in written language morphemes are composed of graphemes ....
    -by-morpheme gloss
    Gloss

    A gloss is a brief summary of a word's meaning, equivalent to the dictionary entry of that word, but only a word or two in length. It is typically used for the meaning of a word in another language, and hence a simple translation....
    es