The
Oxford English Corpus is a
text corpusIn linguistics, a corpus or text corpus is a large and structured set of texts...
of
English languageEnglish is a West Germanic language that arose in the Anglo-Saxon kingdoms of England and spread into what was to become south-east Scotland under the influence of the Anglian medieval kingdom of Northumbria...
used by the makers of the
Oxford English DictionaryThe Oxford English Dictionary , published by the Oxford University Press, is the self-styled premier dictionary of the English language. Two fully bound print editions of the OED have been published under its current name, in 1928 and 1989. The first edition was published in twelve volumes , and...
and by
Oxford University PressOxford University Press is the largest university press in the world. It is a department of the University of Oxford and is governed by a group of 15 academics appointed by the Vice-Chancellor known as the Delegates of the Press. They are headed by the Secretary to the Delegates, who serves as...
's language research programme. It is the largest corpus of its kind, containing over two
billion1,000,000,000 is the natural number following 999,999,999 and preceding 1,000,000,001.In scientific notation, it is written as 109....
words. The sources for these words are writings of all sorts, from "literary novels and specialist journals to everyday newspapers and magazines and from
HansardHansard is the name of the printed transcripts of parliamentary debates in the Westminster system of government. It is named after Thomas Curson Hansard, an early printer and publisher of these transcripts.-Origins:...
to the language of chatrooms, emails, and weblogs". This may be contrasted with similar
databaseA database is an organized collection of data for one or more purposes, usually in digital form. The data are typically organized to model relevant aspects of reality , in a way that supports processes requiring this information...
s that sample only a specific kind of writing.
The digital version of the Oxford English Corpus is formatted in
XMLExtensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....
and usually analysed with
Sketch Engine software.
Each document in the OE Corpus is accompanied by
metadataThe term metadata is an ambiguous term which is used for two fundamentally different concepts . Although the expression "data about data" is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at...
naming:
- title
- author (if known; many websites make this difficult to determine reliably)
- author gender (if known)
- language type (e.g. British English, American English)
- source website
- year (+ date, if known)
- date of collection
- domain + subdomain
- document statistics (number of tokens, sentences, etc.)
See also
- Oxford English Dictionary
The Oxford English Dictionary , published by the Oxford University Press, is the self-styled premier dictionary of the English language. Two fully bound print editions of the OED have been published under its current name, in 1928 and 1989. The first edition was published in twelve volumes , and...
- Corpus of Contemporary American English
The freely-searchable 425 million word Corpus of Contemporary American English is the largest corpus of American English currently available, and the only publicly-available corpus of American English to contain a wide array of texts from a number of genres.It was created by Mark Davies, Professor...
- American National Corpus
The American National Corpus is a text corpus of American English currently containing 22 million words written and spoken data produced since 1990. The ANC may at some point of time include a range of genres comparable to the British National Corpus...
- Frequency analysis
In cryptanalysis, frequency analysis is the study of the frequency of letters or groups of letters in a ciphertext. The method is used as an aid to breaking classical ciphers....
External links