Corpus linguistics
Encyclopedia
Corpus linguistics is the study of language as expressed in samples (corpora
Text corpus
In linguistics, a corpus or text corpus is a large and structured set of texts...

)
or "real world" text. This method represents a digestive approach to deriving a set of abstract rules by which a natural language
Natural language
In the philosophy of language, a natural language is any language which arises in an unpremeditated fashion as the result of the innate facility for language possessed by the human intellect. A natural language is typically used for communication, and may be spoken, signed, or written...

 is governed or else relates to another language. Originally done by hand, corpora are now largely derived by an automated process.

The corpus approach runs counter to Noam Chomsky
Noam Chomsky
Avram Noam Chomsky is an American linguist, philosopher, cognitive scientist, and activist. He is an Institute Professor and Professor in the Department of Linguistics & Philosophy at MIT, where he has worked for over 50 years. Chomsky has been described as the "father of modern linguistics" and...

's view that real language is riddled with performance-related errors, thus requiring careful analysis of small speech samples obtained in a highly controlled laboratory setting.

The problem of laboratory-selected sentences is similar to that facing lab-based psychology: researchers do not have any measure of the ethnographic representativity of their data.

Corpus linguistics does away with Chomsky's competence/performance split: adherents believe that reliable language analysis best occurs on field-collected samples, in natural contexts and with minimal experimental interference. Within corpus linguistics there are divergent views as to the value of corpus annotation, from John Sinclair
John McHardy Sinclair
John McHardy Sinclair , Professor of Modern English Language at Birmingham University, 1965 – 2000. He pioneered work in corpus linguistics, discourse analysis, lexicography, and language teaching....

 advocating minimal annotation and allowing texts to 'speak for themselves', to others, such as the Survey of English Usage
Survey of English Usage
The Survey of English Usage was the first research centre in Europe to carry out research with corpora. The Survey is based in the Department of English Language and Literature at University College London.- History :...

 team (based in University College, London) advocating annotation as a path to greater linguistic understanding and rigour.

History

A landmark in modern corpus linguistics was the publication by Henry Kucera
Henry Kucera
Henry Kučera, born Jindřich Kučera was a Czech linguist who was a pioneer in corpus linguistics and linguistic software....

 and W. Nelson Francis of Computational Analysis of Present-Day American English in 1967, a work based on the analysis of the Brown Corpus
Brown Corpus
The Brown University Standard Corpus of Present-Day American English was compiled in the 1960s by Henry Kucera and W. Nelson Francis at Brown University, Providence, Rhode Island as a general corpus in the field of corpus linguistics...

, a carefully compiled selection of current American English, totalling about a million words drawn from a wide variety of sources. Kucera and Francis subjected it to a variety of computational analyses, from which they compiled a rich and variegated opus, combining elements of linguistics, language teaching, psychology
Psychology
Psychology is the study of the mind and behavior. Its immediate goal is to understand individuals and groups by both establishing general principles and researching specific cases. For many, the ultimate goal of psychology is to benefit society...

, statistics
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....

, and sociology
Sociology
Sociology is the study of society. It is a social science—a term with which it is sometimes synonymous—which uses various methods of empirical investigation and critical analysis to develop a body of knowledge about human social activity...

. A further key publication was Randolph Quirk's 'Towards a description of English Usage' (1960) in which he introduced The Survey of English Usage
Survey of English Usage
The Survey of English Usage was the first research centre in Europe to carry out research with corpora. The Survey is based in the Department of English Language and Literature at University College London.- History :...

.

Shortly thereafter, Boston publisher Houghton-Mifflin approached Kucera to supply a million word, three-line citation base for its new American Heritage Dictionary
The American Heritage Dictionary of the English Language
The American Heritage Dictionary of the English Language is an American dictionary of the English language published by Boston publisher Houghton Mifflin, the first edition of which appeared in 1969...

, the first dictionary
Dictionary
A dictionary is a collection of words in one or more specific languages, often listed alphabetically, with usage information, definitions, etymologies, phonetics, pronunciations, and other information; or a book of words in one language with their equivalents in another, also known as a lexicon...

 to be compiled using corpus linguistics. The AHD made the innovative step of combining prescriptive elements (how language should be used) with descriptive information (how it actually is used).

Other publishers followed suit. The British publisher Collins' COBUILD
COBUILD
COBUILD, an acronym for Collins Birmingham University International Language Database, is a British research facility set up at the University of Birmingham in 1980 and funded by Collins publishers.The facility was led by Professor John Sinclair...

 monolingual learner's dictionary
Monolingual learner's dictionary
A Monolingual learner's dictionary is a type of dictionary designed to meet the reference needs of people learning a foreign language...

, designed for users learning English as a foreign language
English language learning and teaching
English as a second language , English for speakers of other languages and English as a foreign language all refer to the use or study of English by speakers with different native languages. The precise usage, including the different use of the terms ESL and ESOL in different countries, is...

, was compiled using the Bank of English
Bank of English
The Bank of English is the name of the COBUILD corpus, a collection of English texts. These are mainly British, but American and Australian data are also included....

. The Survey of English Usage
Survey of English Usage
The Survey of English Usage was the first research centre in Europe to carry out research with corpora. The Survey is based in the Department of English Language and Literature at University College London.- History :...

 Corpus was used in the development of one of the most important Corpus-based Grammars, the Comprehensive Grammar of English (Quirk et al. 1985).

The Brown Corpus
Brown Corpus
The Brown University Standard Corpus of Present-Day American English was compiled in the 1960s by Henry Kucera and W. Nelson Francis at Brown University, Providence, Rhode Island as a general corpus in the field of corpus linguistics...

 has also spawned a number of similarly structured corpora: the LOB Corpus (1960s British English
British English
British English, or English , is the broad term used to distinguish the forms of the English language used in the United Kingdom from forms used elsewhere...

), Kolhapur (Indian English
Indian English
Indian English is an umbrella term used to describe dialects of the English language spoken primarily in the Republic of India.As a result of British colonial rule until Indian independence in 1947 English is an official language of India and is widely used in both spoken and literary contexts...

), Wellington (New Zealand English
New Zealand English
New Zealand English is the dialect of the English language used in New Zealand.The English language was established in New Zealand by colonists during the 19th century. It is one of "the newest native-speaker variet[ies] of the English language in existence, a variety which has developed and...

), Australian Corpus of English (Australian English
Australian English
Australian English is the name given to the group of dialects spoken in Australia that form a major variety of the English language....

), the Frown Corpus (early 1990s American English
American English
American English is a set of dialects of the English language used mostly in the United States. Approximately two-thirds of the world's native speakers of English live in the United States....

), and the FLOB Corpus (1990s British English). Other corpora represent many languages, varieties and modes, and include the International Corpus of English
International Corpus of English
The International Corpus of English is a set of corpora representing varieties of English from around the world. Over twenty countries or groups of countries where English is the first language or an official second language are included.-History:...

, and the British National Corpus
British National Corpus
The British National Corpus is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. It was compiled as a general corpus in the field of corpus linguistics...

, a 100 million word collection of a range of spoken and written texts, created in the 1990s by a consortium of publishers, universities (Oxford and Lancaster
Lancaster University
Lancaster University, officially The University of Lancaster, is a leading research-intensive British university in Lancaster, Lancashire, England. The university was established by Royal Charter in 1964 and initially based in St Leonard's Gate until moving to a purpose-built 300 acre campus at...

) and the British Library
British Library
The British Library is the national library of the United Kingdom, and is the world's largest library in terms of total number of items. The library is a major research library, holding over 150 million items from every country in the world, in virtually all known languages and in many formats,...

. For contemporary American English, work has stalled on the American National Corpus
American National Corpus
The American National Corpus is a text corpus of American English currently containing 22 million words written and spoken data produced since 1990. The ANC may at some point of time include a range of genres comparable to the British National Corpus...

, but the 400+ million word Corpus of Contemporary American English
Corpus of Contemporary American English
The freely-searchable 425 million word Corpus of Contemporary American English is the largest corpus of American English currently available, and the only publicly-available corpus of American English to contain a wide array of texts from a number of genres.It was created by Mark Davies, Professor...

 (1990–present) is now available through a web interface.

The first computerized corpus of transcribed spoken language was constructed in 1971 by the Montreal French Project, containing one million words, which inspired Shana Poplack
Shana Poplack
Shana Poplack is a leading proponent of variation theory, the approach to language science pioneered by William Labov. She has extended the methodology and theory of this field into bilingual speech patterns, the prescription-praxis dialectic in the co-evolution of standard and non-standard...

's much larger corpus of spoken French in the Ottawa-Hull area.

Besides these corpora of living languages, computerized corpora have also been made of collections of texts in ancient languages. An example is the Andersen
Francis Andersen
Francis Ian Andersen is an Australian scholar in the fields of biblical studies and Hebrew. Together with A. Dean Forbes, he pioneered the use of computers for the analysis of biblical Hebrew syntax...

-Forbes database of the Hebrew Bible, developed since the 1970s, in which every clause is parsed using graphs representing up to seven levels of syntax, and every segment tagged with seven fields of information.
The Quranic Arabic Corpus
Quranic Arabic Corpus
The Quranic Arabic Corpus is an annotated linguistic resource consisting of 77,430 words of . The research project is led by at the University of Leeds, and is part of the Arabic language computing research group within the School of Computing, supervised by...

 is an annotated corpus for the Classical Arabic language of the Quran. This is a recent project with multiple layers of annotation including morphological segmentation, part-of-speech tagging
Part-of-speech tagging
In corpus linguistics, part-of-speech tagging , also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text as corresponding to a particular part of speech, based on both its definition, as well as its context—i.e...

, and syntactic analysis using dependency grammar.

Methods

Corpus Linguistics has generated a number of research methods, attempting to trace a path from data to theory. Wallis and Nelson (2001) first introduced what they called the 3A perspective: Annotation, Abstraction and Analysis.
  • Annotation consists of the application of a scheme to texts. Annotations may include structural markup, part-of-speech
    Lexical category
    In grammar, a part of speech is a linguistic category of words , which is generally defined by the syntactic or morphological behaviour of the lexical item in question. Common linguistic categories include noun and verb, among others...

     tagging, parsing, and numerous other representations.

  • Abstraction consists of the translation (mapping) of terms in the scheme to terms in a theoretically motivated model or dataset. Abstraction typically includes linguist-directed search but may include e.g., rule-learning for parsers.

  • Analysis consists of statistically probing, manipulating and generalising from the dataset. Analysis might include statistical evaluations, optimisation of rule-bases or knowledge discovery methods.


Most lexical corpora today are part-of-speech-tagged (POS-tagged). However even corpus linguists who work with 'unannotated plain text' inevitably apply some method to isolate terms that they are interested in from surrounding words. In such situations annotation and abstraction are combined in a lexical search.

The advantage of publishing an annotated corpus is that other users can then perform experiments on the corpus. Linguists with other interests and differing perspectives than the originators' can exploit this work. By sharing data, corpus linguists are able to treat the corpus as a locus of linguistic debate, rather than as an exhaustive fount of knowledge.

See also

  • Concordance
    Concordance (publishing)
    A concordance is an alphabetical list of the principal words used in a book or body of work, with their immediate contexts. Because of the time and difficulty and expense involved in creating a concordance in the pre-computer era, only works of special importance, such as the Vedas, Bible, Qur'an...

     (KWIC)
  • Collocation
    Collocation
    In corpus linguistics, collocation defines a sequence of words or terms that co-occur more often than would be expected by chance. In phraseology, collocation is a sub-type of phraseme. An example of a phraseological collocation is the expression strong tea...

  • Collostructional analysis
    Collostructional analysis
    Collostructional analysis is a family of methods developed by Stefan Th. Gries and...

  • Keyword (linguistics)
    Keyword (linguistics)
    In corpus linguistics a key word is a word which occurs in a text more often than we would expect to occur by chance alone. Key words are calculated by carrying out a statistical test which compares the word frequencies in a text against their expected frequencies derived in a much larger corpus,...

  • Lexical priming
  • Linguistic Data Consortium
    Linguistic Data Consortium
    The Linguistic Data Consortium is an open consortium of universities, companies and government research laboratories. It creates, collects and distributes speech and text databases, lexicons, and other resources for linguistics research and development purposes. The University of Pennsylvania is...

  • Machine translation
    Machine translation
    Machine translation, sometimes referred to by the abbreviation MT is a sub-field of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another.On a basic...

  • Natural Language Toolkit
    Natural Language Toolkit
    Natural Language Toolkit or, more commonly, NLTK is a suite of libraries and programs for symbolic and statistical natural language processing for the Python programming language. NLTK includes graphical demonstrations and sample data...

  • Pattern grammar
    Pattern grammar
    Pattern Grammar is a model for describing the syntactic environments of individual lexical items, derived from studying their occurrences in authentic linguistic corpora. It was developed by Hunston, Francis, and Manning as part of the COBUILD project....

  • Search engines: they access the "web corpus".
  • Semantic prosody
    Semantic prosody
    Semantic prosody, also discourse prosody, describes the way in which certain seemingly neutral words can be perceived with positive or negative associations through frequent occurrences with particular collocations....

  • Text corpus
    Text corpus
    In linguistics, a corpus or text corpus is a large and structured set of texts...

  • Translation memory
    Translation memory
    A translation memory, or TM, is a database that stores so-called "segments", which can be sentences or sentence-like units that have previously been translated. A translation memory system stores the words, phrases and paragraphs that have already been translated, in order to aid human translators...

  • Treebank
    Treebank
    A treebank or parsed corpus is a text corpus in which each sentence has been parsed, i.e. annotated with syntactic structure. Syntactic structure is commonly represented as a tree structure, hence the name Treebank...

  • Xaira
    Xaira
    Xaira is an XML Aware Indexing and Retrieval Architecture developed at Oxford University. It is based on SARA, an SGML-aware text-searching system originally developed for searching the British National Corpus. Xaira has been redeveloped as a generic XML system for constructing query-systems for...

    : a general purpose XML aware open-source corpus analysis tool

Journals

There are several international peer-reviewed journals dedicated to corpus linguistics, for example,
Corpora
Corpora (journal)
Corpora is a twice-yearly peer-reviewed linguistic academic journal that publishes scholarly articles and book reviews on corpus linguistics, with a focus on corpus construction and corpus technology...

,
Corpus Linguistics and Linguistic Theory
Corpus Linguistics and Linguistic Theory (journal)
Corpus Linguistics and Linguistic Theory is a peer-reviewed linguistic academic journal that publishes scholarly articles, squibs, and book reviews on corpus linguistics, with a focus on corpus-linguistic findings and their relevance to linguistic theory. It is published by Mouton de Gryuter,...

,
ICAME Journal and the
International Journal of Corpus Linguistics
International Journal of Corpus Linguistics
The International Journal of Corpus Linguistics is a quarterly peer-reviewed linguistic academic journal that publishes scholarly articles and book reviews on corpus linguistics, with a focus on applied linguistics....

.

Book series

Book series in this field include
Language and Computers
Language and Computers
Language and Computers: Studies in Practical Linguistics is a book series on corpus linguistics and related areas.As studies in linguistics, volumes in the series have, by definition, their foundations in linguistic theory; however, they are not concerned with theory for theory's sake, but always...

,
Studies in Corpus Linguistics and English Corpus Linguistics

Other

  • Biber, D., Conrad, S., Reppen R. Corpus Linguistics, Investigating Language Structure and Use, Cambridge: Cambridge UP, 1998. ISBN 0-521-49957-7
  • McCarthy, D., and Sampson G. Corpus Linguistics: Readings in a Widening Discipline, Continuum, 2005. ISBN 0-826-48803-X
  • Facchinetti, R. Theoretical Description and Practical Applications of Linguistic Corpora. Verona: QuiEdit, 2007 ISBN 978-88-89480-37-3
  • Facchinetti, R. (ed.) Corpus Linguistics 25 Years on. New York/Amsterdam: Rodopi, 2007 ISBN 978-90-420-2195-2
  • Facchinetti, R. and Rissanen M. (eds.) Corpus-based Studies of Diachronic English. Bern: Peter Lang, 2006 ISBN 3-03910-851-4

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK