Markovian discrimination - AbsoluteAstronomy.com

Markovian discrimination in spam filtering is a method used in CRM114 and other spam filters to model the statistical behaviors of spam and nonspam more accurately than in simple Bayesian methods

Bayesian spam filtering

Bayesian spam filtering is a statistical technique of e-mail filtering. It makes use of a naive Bayes classifier to identify spam e-mail.Bayesian classifiers work by correlating the use of tokens , with spam and non spam e-mails and then using Bayesian inference to calculate a probability that an...

. A simple Bayesian model of written text contains only the dictionary of legal words and their relative probabilities. A Markovian model adds the relative transition probabilities that given one word, predict what the next word will be. It is based on the theory of Markov chain

Markov chain

A Markov chain, named after Andrey Markov, is a mathematical system that undergoes transitions from one state to another, between a finite or countable number of possible states. It is a random process characterized as memoryless: the next state depends only on the current state and not on the...

s by Andrey Markov

Andrey Markov

Andrey Andreyevich Markov was a Russian mathematician. He is best known for his work on theory of stochastic processes...

, hence the name. In essence, a Bayesian filter works on single words alone, while a Markovian filter works on phrases or entire sentences.

There are two types of Markov models; the visible Markov model, and the hidden Markov model

Hidden Markov model

A hidden Markov model is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved states. An HMM can be considered as the simplest dynamic Bayesian network. The mathematics behind the HMM was developed by L. E...

or HMM.
The difference is that with a visible Markov model, the current word is considered to contain the entire state of the language model, while a hidden Markov model hides the state and presumes only that the current word is probabilistically related to the actual internal state of the language.

For example, in a visible Markov model the word "the" should predict with accuracy the following word, while in
a hidden Markov model, the entire prior text implies the actual state and predicts the following words, but does
not actually guarantee that state or prediction. Since the latter case is what's encountered in spam filtering,
hidden Markov models are almost always used. In particular, because of storage limitations, the specific type
of hidden Markov model called a Markov random field is particularly applicable, usually with a clique size of
between four and six tokens.

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.