Survey of English Usage
Encyclopedia
The Survey of English Usage was the first research centre in Europe to carry out research with corpora
Corpus linguistics
Corpus linguistics is the study of language as expressed in samples or "real world" text. This method represents a digestive approach to deriving a set of abstract rules by which a natural language is governed or else relates to another language. Originally done by hand, corpora are now largely...

. The Survey is based in the Department of English Language and Literature at University College London
University College London
University College London is a public research university located in London, United Kingdom and the oldest and largest constituent college of the federal University of London...

.

History

The Survey of English Usage was founded in 1959 by Randolph (now Lord) Quirk. Many well-known linguists have spent time doing research at the Survey, including Bas Aarts, Valerie Adams, John Algeo, Dwight Bolinger, Noël Burton-Roberts, David Crystal
David Crystal
David Crystal OBE FLSW FBA is a linguist, academic and author.-Background and career:Crystal was born in Lisburn, Northern Ireland. He grew up in Holyhead, North Wales, and Liverpool, England where he attended St Mary's College from 1951....

, Derek Davy, Jan Firbas, Sidney Greenbaum
Sidney Greenbaum
Sidney Greenbaum was a British scholar of the English language and of linguistics. He was Quain Professor of English language and literature at University College London from 1983 to 1990 and Director of the Survey of English Usage, 1983-96...

, Liliane Haegeman, Robert Ilson, Ruth Kempson, Geoffrey Leech
Geoffrey Leech
Geoffrey Leech was Professor of Linguistics and Modern English Language at Lancaster University from 1974 to 2002. He then became Research Professor in English Linguistics...

, Jan Rusiecki, Jan Svartvik, and Joe Taglicht.

The original Survey Corpus predated modern computing. It was recorded on reel-to-reel tapes, transcribed on paper, filed in filing cabinets, and indexed on paper cards. Transcriptions were annotated with a detailed prosodic
Prosody (linguistics)
In linguistics, prosody is the rhythm, stress, and intonation of speech. Prosody may reflect various features of the speaker or the utterance: the emotional state of the speaker; the form of the utterance ; the presence of irony or sarcasm; emphasis, contrast, and focus; or other elements of...

 and paralinguistic
Paralanguage
Paralanguage refers to the non-verbal elements of communication used to modify meaning and convey emotion. Paralanguage may be expressed consciously or unconsciously, and it includes the pitch, volume, and, in some cases, intonation of speech. Sometimes the definition is restricted to...

 annotation developed by Crystal and Quirk (1964). Sets of paper cards were manually annotated for grammatical structures and filed, so, for example, all noun phrases could be found in the noun phrase filing cabinet in the Survey. Naturally, corpus searches required a visit to the Survey.

This corpus is now known more widely as the London-Lund Corpus (LLC), as it was the responsibility of co-workers in Lund, Sweden, to computerise the corpus. Thirty-four of the spoken texts were published in book form as Svartvik and Quirk (1980), and the corpus was used as the basis for the famous Comprehensive Grammar (Quirk et al. 1985).

Constructing corpora

In 1988 Sidney Greenbaum proposed a new project, ICE, the International Corpus of English
International Corpus of English
The International Corpus of English is a set of corpora representing varieties of English from around the world. Over twenty countries or groups of countries where English is the first language or an official second language are included.-History:...

. ICE was to be an international project, carried out at research centres around the world, to compile corpora of English varieties where English was the first or second official language. ICE texts would contain spoken and written English in a balanced sample of one million words per component so that these samples could be compared in a wide varieties of ways. The ICE project continues around the world to the present day.

ICE-GB, the British Component of ICE, was compiled at the Survey. ICE-GB was annotated to a very detailed level, including constructing a full grammatical analysis (parse) for every sentence in the corpus. The first release of ICE-GB took place in 1998. ICE-GB was distributed with software for searching and exploring the parsed corpus called ICECUP. Release 2 of ICE-GB has now been released and is available on CD.

As well as contrasting varieties of English, many researchers are interested in language development and change over time. A recent project at the Survey undertook the parsing of a large (400,000 word) selection of the spoken part of the LLC in a manner directly comparable with ICE-GB, forming a new, 800,000 word diachronic corpus, called the Diachronic Corpus of Present-Day Spoken English (DCPSE). DCPSE has now been released and is available on CD from the Survey.

These two corpora comprise the largest collection of parsed and corrected, orthographically transcribed spoken English language data in the world, with over one million words of spoken English in this form.

Exploring corpora

Parsed corpora
Treebank
A treebank or parsed corpus is a text corpus in which each sentence has been parsed, i.e. annotated with syntactic structure. Syntactic structure is commonly represented as a tree structure, hence the name Treebank...

 are large databases containing detailed grammatical tree structures. One of the consequences of forming large collections of valuable linguistic data is a pressing need for methods and tools to help researchers and other users make the most of them. So in parallel with the parsing of natural language data, the Survey team have carried out research and development of software tools to help linguists use these corpora. The ICECUP research platform uses an intuitive grammatical query representation called Fuzzy Tree Fragments (FTFs) to search parsed corpora.

Linguistic research with corpora

As well as distributing corpora and tools to the corpus linguistics
Corpus linguistics
Corpus linguistics is the study of language as expressed in samples or "real world" text. This method represents a digestive approach to deriving a set of abstract rules by which a natural language is governed or else relates to another language. Originally done by hand, corpora are now largely...

research community, the SEU carries out research into English language. Recent projects include research on the English Noun Phrase, Subordination in Spoken and Written English, and the English Verb Phrase. The Survey also provides support for a small number of PhD students who carry out research into English language corpora.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK