A statistical
language model assigns a
probabilityProbability is a way of expressing knowledge or belief that an event will occur or has occurred. In mathematics the concept has been given an exact meaning in probability theory, that is used extensively in such areas of study as mathematics, statistics, finance, gambling, science, and philosophy...
to a sequence of
m words by means of a
probability distributionIn probability theory and statistics, a probability distribution identifies either the probability of each value of an unidentified random variable , or the probability of the value falling within a particular interval...
.
Language modeling is used in many
natural language processingNatural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages. Natural language generation systems convert information from computer databases into readable human language...
applications such as
speech recognitionSpeech recognition converts spoken words to text. The term "voice recognition" is sometimes used to refer to speech recognition where the recognition system is trained to a particular speaker - as is the case for most desktop recognition software, hence there is an aspect of speaker recognition,...
,
machine translationMachine translation, sometimes referred to by the abbreviation MT, is a sub-field of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another. At its basic level, MT performs simple substitution of words in one natural...
,
part-of-speech taggingIn corpus linguistics, part-of-speech tagging , also called grammatical tagging or word-category disambiguation, is the process of marking up the words in a text as corresponding to a particular part of speech, based on both its definition, as well as its context —ie...
,
parsingIn computer science and linguistics, parsing, or, more formally, syntactic analysis, is the process of analyzing a text, made of a sequence of tokens , to determine its grammatical structure with respect to a given formal grammar.Parsing is also an earlier term for the diagramming of sentences of...
and
information retrievalInformation retrieval is the science of searching for documents, for information within documents and for metadata about documents, as well as that of searching relational databases and the World Wide Web...
.
In
speech recognitionSpeech recognition converts spoken words to text. The term "voice recognition" is sometimes used to refer to speech recognition where the recognition system is trained to a particular speaker - as is the case for most desktop recognition software, hence there is an aspect of speaker recognition,...
and in
data compressionIn computer science and information theory, data compression or source coding is the process of encoding information using fewer bits than an unencoded representation would use, through use of specific encoding schemes.As with any communication, compressed data communication only works when both...
, such a model tries to capture the properties of a language, and to predict the next word in a speech sequence.
When used in information retrieval, a language model is associated with a
documentA document is a bounded physical representation of a body of information designed with the capacity to communicate. A document may manifest symbolic, diagrammatic or sensory-representational information. To document is to produce a document artifact by collecting and representing information...
in a collection.
A statistical
language model assigns a
probabilityProbability is a way of expressing knowledge or belief that an event will occur or has occurred. In mathematics the concept has been given an exact meaning in probability theory, that is used extensively in such areas of study as mathematics, statistics, finance, gambling, science, and philosophy...
to a sequence of
m words by means of a
probability distributionIn probability theory and statistics, a probability distribution identifies either the probability of each value of an unidentified random variable , or the probability of the value falling within a particular interval...
.
Language modeling is used in many
natural language processingNatural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages. Natural language generation systems convert information from computer databases into readable human language...
applications such as
speech recognitionSpeech recognition converts spoken words to text. The term "voice recognition" is sometimes used to refer to speech recognition where the recognition system is trained to a particular speaker - as is the case for most desktop recognition software, hence there is an aspect of speaker recognition,...
,
machine translationMachine translation, sometimes referred to by the abbreviation MT, is a sub-field of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another. At its basic level, MT performs simple substitution of words in one natural...
,
part-of-speech taggingIn corpus linguistics, part-of-speech tagging , also called grammatical tagging or word-category disambiguation, is the process of marking up the words in a text as corresponding to a particular part of speech, based on both its definition, as well as its context —ie...
,
parsingIn computer science and linguistics, parsing, or, more formally, syntactic analysis, is the process of analyzing a text, made of a sequence of tokens , to determine its grammatical structure with respect to a given formal grammar.Parsing is also an earlier term for the diagramming of sentences of...
and
information retrievalInformation retrieval is the science of searching for documents, for information within documents and for metadata about documents, as well as that of searching relational databases and the World Wide Web...
.
In
speech recognitionSpeech recognition converts spoken words to text. The term "voice recognition" is sometimes used to refer to speech recognition where the recognition system is trained to a particular speaker - as is the case for most desktop recognition software, hence there is an aspect of speaker recognition,...
and in
data compressionIn computer science and information theory, data compression or source coding is the process of encoding information using fewer bits than an unencoded representation would use, through use of specific encoding schemes.As with any communication, compressed data communication only works when both...
, such a model tries to capture the properties of a language, and to predict the next word in a speech sequence.
When used in information retrieval, a language model is associated with a
documentA document is a bounded physical representation of a body of information designed with the capacity to communicate. A document may manifest symbolic, diagrammatic or sensory-representational information. To document is to produce a document artifact by collecting and representing information...
in a collection. With query
Q as input, retrieved documents are ranked based on the probability that the document's language model would generate the terms of the query,
P(Q|Md).
Estimating the probability of sequences can become difficult in corpora, in which
phraseIn grammar, a phrase is a group of words functioning as a single unit in the syntax of a sentence.For example, the house at the end of the street is a phrase. It acts like a noun. It can further be broken down into two shorter phrases functioning as adjectives: at the end and of the street, a...
s or
sentenceIn linguistics, a sentence is an expression in natural language—a grammatical and lexical unit consisting of one or more words, representing distinct and differentiated concepts, and combined to form a meaningful statement, question, request and command....
s can be arbitrarily long and hence some sequences are not observed during
trainingThe term training refers to the acquisition of knowledge, skills, and competencies as a result of the teaching of vocational or practical skills and knowledge that relate to specific useful competencies. It forms the core of apprenticeships and provides the backbone of content at institutes of...
of the language model (data sparseness problem of
overfittingIn statistics, overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many degrees of freedom, in relation to the amount of data available...
). For that reason these models are often approximated using smoothed
N-gramAn n-gram model is a type of probabilistic model for predicting the next item in a sequence. n-grams are used in various areas of statistical natural language processing and genetic sequence analysis....
models.
N-gram models
In an n-gram model, the probability of observing the sentence w
1,...,w
m is approximated as
Here, it is assumed that the probability of observing the
ith word
wi in the context history of the preceding
i-1 words can be approximated by the probability of observing it in the shortened context history of the preceding
n-1 words (
nth order Markov propertyIn mathematics, the term Markov property or Markov-type property can refer to either of two closely-related things.In the narrowest sense, a stochastic process has the Markov property if the conditional probability distribution of future states of the process, given the present state and a constant...
).
The conditional probability can be calculated from n-gram frequency counts:
The words bigram and trigram language model denote n-gram language models with n=2
and n=3
, respectively.
Example
In a bigram (n=2) language model, the probability of the sentence I saw the red house is approximated as
whereas in a trigram (n=3) language model, the approximation is
Note, that the context of the first n-1 ngrams is filled start-of-sentence markers, typically denoted
.