Heaps' law
Encyclopedia
In linguistics
Linguistics
Linguistics is the scientific study of human language. Linguistics can be broadly broken into three categories or subfields of study: language form, language meaning, and language in context....

, Heaps' law is an empirical law which describes the portion of a vocabulary
Vocabulary
A person's vocabulary is the set of words within a language that are familiar to that person. A vocabulary usually develops with age, and serves as a useful and fundamental tool for communication and acquiring knowledge...

 which is represented by an instance document
Document
The term document has multiple meanings in ordinary language and in scholarship. WordNet 3.1. lists four meanings :* document, written document, papers...

 (or set of instance documents) consisting of words chosen from the vocabulary. This can be formulated as


Where VR is the subset of the vocabulary V represented by the instance text of size n. K and β are free parameters determined empirically.

With English text corpora
Text corpus
In linguistics, a corpus or text corpus is a large and structured set of texts...

, typically K is between 10 and 100, and β is between 0.4 and 0.6.



A typical Heaps-law plot. The x-axis represents the text size, and the y-axis represents the number of distinct vocabulary elements present in the text. Compare the values of the two axes.


Heaps' law means that as more instance text is gathered, there will be diminishing returns in terms of discovery of the full vocabulary from which the distinct terms are drawn.

It is interesting to note that Heaps' law applies in the general case where the "vocabulary" is just some set of distinct types which are attributes of some collection of objects. For example, the objects could be people, and the types could be country of origin of the person. If persons are selected randomly (that is, we are not selecting based on country of origin), then Heaps' law says we will quickly have representatives from most countries (in proportion to their population) but it will become increasingly difficult to cover the entire set of countries by continuing this method of sampling.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK