Frequency list
Encyclopedia
In computational linguistics
Computational linguistics
Computational linguistics is an interdisciplinary field dealing with the statistical or rule-based modeling of natural language from a computational perspective....

, a frequency list is a sorted list of word
Word
In language, a word is the smallest free form that may be uttered in isolation with semantic or pragmatic content . This contrasts with a morpheme, which is the smallest unit of meaning but will not necessarily stand on its own...

s (word types) together with their frequency
Frequency
Frequency is the number of occurrences of a repeating event per unit time. It is also referred to as temporal frequency.The period is the duration of one cycle in a repeating event, so the period is the reciprocal of the frequency...

, where frequency here usually means the number of occurrences in a given corpus
Text corpus
In linguistics, a corpus or text corpus is a large and structured set of texts...

. A short example could be:




























the3789654
he 2098762
[...]
king 57897
boy 56975
[...]
outragious [sic] 76
[...]
stringyfy 5
[...]
transducionalify 1



It seems that Zipf's law holds for frequency lists drawn from longer texts of any natural language. Frequency lists are a necessary prerequisite for building of an electronic dictionary, which is by itself a prerequisite for a wide range of applications in computational linguistics
Computational linguistics
Computational linguistics is an interdisciplinary field dealing with the statistical or rule-based modeling of natural language from a computational perspective....

.

German linguists define the häufigkeitsklasse (frequency class) of an item in the list using the base 2 logarithm
Binary logarithm
In mathematics, the binary logarithm is the logarithm to the base 2. It is the inverse function of n ↦ 2n. The binary logarithm of n is the power to which the number 2 must be raised to obtain the value n. This makes the binary logarithm useful for anything involving powers of 2,...

 of the ratio between its frequency and the frequency of the most frequent item. The most common item belongs to frequency class 0 (zero) and any item that is approximately half as frequent belongs in class 1. In the example list above, the misspelled word outragious has a ratio of 76/3789654 and belongs in class 16.
where is the floor function
Floor function
In mathematics and computer science, the floor and ceiling functions map a real number to the largest previous or the smallest following integer, respectively...

.

See also

  • Letter frequencies
    Letter frequencies
    The frequency of letters in text has often been studied for use in cryptography, and frequency analysis in particular. No exact letter frequency distribution underlies a given language, since all writers write slightly differently. Linotype machines sorted the letters' frequencies as etaoin shrdlu...

  • Most common words in English
    Most common words in English
    The list below of most common words in English cannot be definitive. It is based on an analysis of the Oxford English Corpus of over a billion words, and represents one study done by Oxford Online, associated with the Oxford English Dictionary...

  • The Long Tail
    The Long Tail
    The Long Tail or long tail refers to the statistical property that a larger share of population rests within the tail of a probability distribution than observed under a 'normal' or Gaussian distribution...

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK