Language model - AbsoluteAstronomy.com

A statistical language model assigns a probability

Probability

Probability is ordinarily used to describe an attitude of mind towards some proposition of whose truth we arenot certain. The proposition of interest is usually of the form "Will a specific event occur?" The attitude of mind is of the form "How certain are we that the event will occur?" The...

to a sequence of m words

by means of a probability distribution

Probability distribution

In probability theory, a probability mass, probability density, or probability distribution is a function that describes the probability of a random variable taking certain values....

.

Language modeling is used in many natural language processing

Natural language processing

Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....

applications such as speech recognition

Speech recognition

Speech recognition converts spoken words to text. The term "voice recognition" is sometimes used to refer to recognition systems that must be trained to a particular speaker—as is the case for most desktop recognition software...

, machine translation

Machine translation

Machine translation, sometimes referred to by the abbreviation MT is a sub-field of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another.On a basic...

, part-of-speech tagging

Part-of-speech tagging

In corpus linguistics, part-of-speech tagging , also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text as corresponding to a particular part of speech, based on both its definition, as well as its context—i.e...

, parsing

Parsing

In computer science and linguistics, parsing, or, more formally, syntactic analysis, is the process of analyzing a text, made of a sequence of tokens , to determine its grammatical structure with respect to a given formal grammar...

and information retrieval

Information retrieval

Information retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...

.

In speech recognition

Speech recognition

and in data compression

Data compression

In computer science and information theory, data compression, source coding or bit-rate reduction is the process of encoding information using fewer bits than the original representation would use....

, such a model tries to capture the properties of a language, and to predict the next word in a speech sequence.

When used in information retrieval, a language model is associated with a document

Document

The term document has multiple meanings in ordinary language and in scholarship. WordNet 3.1. lists four meanings :* document, written document, papers...

in a collection. With query Q as input, retrieved documents are ranked based on the probability that the document's language model would generate the terms of the query, P(Q|M_d). The method to use language models in information retrieval is the query likelihood model

Query likelihood model

The query likelihood model is a language model used in Information Retrieval. A language model is constructed for each document in the collection. It is then possible to rank each document by the probability of specific documents given a query...

.

In practice, unigram language models are most commonly used in information retrieval, as they are sufficient to determine the topic from a piece of text. Unigram models only calculate the probability of hitting an isolated word, without considering any influence from the words before or after the target. This leads to the Bag of words model

Bag of words model

The bag-of-words model is a simplifying assumption used in natural language processing and information retrieval. In this model, a text is represented as an unordered collection of words, disregarding grammar and even word order.The bag-of-words model is used in some methods of document...

, and turns out to generate a multinomial distribution over words.

Estimating the probability of sequences can become difficult in corpora

Corpus linguistics

Corpus linguistics is the study of language as expressed in samples or "real world" text. This method represents a digestive approach to deriving a set of abstract rules by which a natural language is governed or else relates to another language. Originally done by hand, corpora are now largely...

, in which phrase

Phrase

In everyday speech, a phrase may refer to any group of words. In linguistics, a phrase is a group of words which form a constituent and so function as a single unit in the syntax of a sentence. A phrase is lower on the grammatical hierarchy than a clause....

s or sentence

Sentence (linguistics)

In the field of linguistics, a sentence is an expression in natural language, and often defined to indicate a grammatical unit consisting of one or more words that generally bear minimal syntactic relation to the words that precede or follow it...

s can be arbitrarily long and hence some sequences are not observed during training

Training

The term training refers to the acquisition of knowledge, skills, and competencies as a result of the teaching of vocational or practical skills and knowledge that relate to specific useful competencies. It forms the core of apprenticeships and provides the backbone of content at institutes of...

of the language model (data sparseness problem of overfitting

Overfitting

In statistics, overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations...

). For that reason these models are often approximated using smoothed N-gram

N-gram

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items in question can be phonemes, syllables, letters, words or base pairs according to the application...

models.

Unigram models

A unigram model used in information retrieval can be treated as the combination of a bunch of one-state finite automatons. It splits the probabilities of different terms in a context, e.g. from

.

In this model, the probability to hit each word all depends on its own, so we only have one-state finite automations as units. For each automation, we only have one way to hit its only state, assigned with one probability. Viewing from the whole model, the sum of all the one-state-hitting probabilities should be 1. Followed is an illustration of an unigram model of a document.

Terms	Probability in doc
a	0.1
world	0.2
likes	0.05
we	0.05
share	0.3
...	...

The probability generated for a specific query is calculated as

For different documents, we can build their own unigram models, with different hitting probabilities of words in it. And we use probabilities from different documents to generate different hitting probabilities for a query. Then we can rank documents for a query according to the generating probabilities. Next is an example of two unigram models of two documents.

Terms	Probability in Doc1	Probability in Doc2
a	0.1	0.3
world	0.2	0.1
likes	0.05	0.03
we	0.05	0.02
share	0.3	0.2
...	...	...

In information retrieval contexts, unigram language models are often smoothed to avoid instances where

. A common approach is to generate a maximum-likelihood model for the entire collection and linearly interpolate

Linear interpolation

Linear interpolation is a method of curve fitting using linear polynomials. Lerp is an abbreviation for linear interpolation, which can also be used as a verb .-Linear interpolation between two known points:...

the collection model with a maximum-likelihood model for each document to create a smoothed document model.

N-gram models

In an n-gram model, the probability

of observing the sentence w₁,...,w_m is approximated as

Here, it is assumed that the probability of observing the i^th word w_i in the context history of the preceding i-1 words can be approximated by the probability of observing it in the shortened context history of the preceding n-1 words (n^th order Markov property

Markov property

In probability theory and statistics, the term Markov property refers to the memoryless property of a stochastic process. It was named after the Russian mathematician Andrey Markov....

).

The conditional probability can be calculated from n-gram frequency counts:

The words bigram and trigram language model denote n-gram language models with n=2 and n=3, respectively.

Typically, however, the n-gram probabilities are not derived directly from the frequency counts, because models derived this way have severe problems when confronted with any n-grams that have not explicitly been seen before. Instead, some form of smoothing is necessary, assigning some of the total probability mass to unseen words or N-grams. Various methods are used, from simple "add-one" smoothing (assign a count of 1 to unseen N-grams) to more sophisticated models, such as Good-Turing discounting or back-off model

Katz's back-off model

Katz back-off is a generative n-gram language model that estimates the conditional probability of a word given its history in the n-gram. It accomplishes this estimation by "backing-off" to models with smaller histories under certain conditions...

Example

In a bigram (n=2) language model, the probability of the sentence I saw the red house is approximated as

whereas in a trigram (n=3) language model, the approximation is

Note that the context of the first

N-grams is filled with start-of-sentence markers, typically denoted .

Additionally, without an end-of-sentence marker, the probability of an ungrammatical sequence *I saw the would always be higher than that of the longer sentence I saw the red house.

Other models
A positional language model is one that describes the probability of given words occurring close to one another in a text, not necessarily immediately adjacent.
The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.