British National Corpus
Encyclopedia
The British National Corpus (BNC) is a 100-million-word text corpus
Text corpus
In linguistics, a corpus or text corpus is a large and structured set of texts...

 of samples of written and spoken English
English language
English is a West Germanic language that arose in the Anglo-Saxon kingdoms of England and spread into what was to become south-east Scotland under the influence of the Anglian medieval kingdom of Northumbria...

 from a wide range of sources. It was compiled as a general corpus (collection of texts) in the field of corpus linguistics
Corpus linguistics
Corpus linguistics is the study of language as expressed in samples or "real world" text. This method represents a digestive approach to deriving a set of abstract rules by which a natural language is governed or else relates to another language. Originally done by hand, corpora are now largely...

. The corpus covers British English
British English
British English, or English , is the broad term used to distinguish the forms of the English language used in the United Kingdom from forms used elsewhere...

 of the late twentieth century from a wide variety of genre
Genre
Genre , Greek: genos, γένος) is the term for any category of literature or other forms of art or culture, e.g. music, and in general, any type of discourse, whether written or spoken, audial or visual, based on some set of stylistic criteria. Genres are formed by conventions that change over time...

s with the intention that it be a representative sample of spoken and written British English of that time.

Of the two parts to the 10-million word spoken corpus, one part is demographic, containing transcriptions of spontaneous natural conversation
Conversation
Conversation is a form of interactive, spontaneous communication between two or more people who are following rules of etiquette.Conversation analysis is a branch of sociology which studies the structure and organization of human interaction, with a more specific focus on conversational...

s made by members of the public and the other involves context-governed aspects such as transcriptions of recording
Recording
Recording is the process of capturing data or translating information to a recording format stored on some storage medium, which is often referred to as a record or, if an auditory medium, a recording....

s made at specific types of meeting and event.
All the original recordings transcribed for inclusion in the BNC have been deposited at the British Library Sound Archive
British Library Sound Archive
The British Library Sound Archive in London, England is one of the largest collections of recorded sound in the world, including music, spoken word and ambient recordings....

.

The corpus is marked up following the recommendations of the Text Encoding Initiative
Text Encoding Initiative
The Text Encoding Initiative is a text-centric community of practice in the academic field of digital humanities. The community runs a mailing list, meetings and conference series, and maintains a technical standard, a wiki and a toolset....

 and includes full linguistic annotation and contextual information. The most recent edition, from March 2007, is distributed in XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....

 format along with the Xaira
Xaira
Xaira is an XML Aware Indexing and Retrieval Architecture developed at Oxford University. It is based on SARA, an SGML-aware text-searching system originally developed for searching the British National Corpus. Xaira has been redeveloped as a generic XML system for constructing query-systems for...

 software. It is freely available under a licence and is widely distributed.

See also

  • Corpus of Contemporary American English
    Corpus of Contemporary American English
    The freely-searchable 425 million word Corpus of Contemporary American English is the largest corpus of American English currently available, and the only publicly-available corpus of American English to contain a wide array of texts from a number of genres.It was created by Mark Davies, Professor...

  • American National Corpus
    American National Corpus
    The American National Corpus is a text corpus of American English currently containing 22 million words written and spoken data produced since 1990. The ANC may at some point of time include a range of genres comparable to the British National Corpus...

  • Oxford English Corpus
    Oxford English Corpus
    The Oxford English Corpus is a text corpus of English language used by the makers of the Oxford English Dictionary and by Oxford University Press's language research programme. It is the largest corpus of its kind, containing over two billion words...


External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK