Viterbi algorithm

# Viterbi algorithm

Overview
The Viterbi algorithm is a dynamic programming
Dynamic programming
In mathematics and computer science, dynamic programming is a method for solving complex problems by breaking them down into simpler subproblems. It is applicable to problems exhibiting the properties of overlapping subproblems which are only slightly smaller and optimal substructure...

algorithm
Algorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...

for finding the most likely
Likelihood function
In statistics, a likelihood function is a function of the parameters of a statistical model, defined as follows: the likelihood of a set of parameter values given some observed outcomes is equal to the probability of those observed outcomes given those parameter values...

sequence of hidden states – called the Viterbi path – that results in a sequence of observed events, especially in the context of Markov information source
Markov information source
In mathematics, a Markov information source, or simply, a Markov source, is an information source whose underlying dynamics are given by a stationary finite Markov chain.-Formal definition:...

s, and more generally, hidden Markov model
Hidden Markov model
A hidden Markov model is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved states. An HMM can be considered as the simplest dynamic Bayesian network. The mathematics behind the HMM was developed by L. E...

s. The forward algorithm is a closely related algorithm for computing the probability of a sequence of observed events. These algorithms belong to the realm of probability theory
Probability theory
Probability theory is the branch of mathematics concerned with analysis of random phenomena. The central objects of probability theory are random variables, stochastic processes, and events: mathematical abstractions of non-deterministic events or measured quantities that may either be single...

.
Discussion

Encyclopedia
The Viterbi algorithm is a dynamic programming
Dynamic programming
In mathematics and computer science, dynamic programming is a method for solving complex problems by breaking them down into simpler subproblems. It is applicable to problems exhibiting the properties of overlapping subproblems which are only slightly smaller and optimal substructure...

algorithm
Algorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...

for finding the most likely
Likelihood function
In statistics, a likelihood function is a function of the parameters of a statistical model, defined as follows: the likelihood of a set of parameter values given some observed outcomes is equal to the probability of those observed outcomes given those parameter values...

sequence of hidden states – called the Viterbi path – that results in a sequence of observed events, especially in the context of Markov information source
Markov information source
In mathematics, a Markov information source, or simply, a Markov source, is an information source whose underlying dynamics are given by a stationary finite Markov chain.-Formal definition:...

s, and more generally, hidden Markov model
Hidden Markov model
A hidden Markov model is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved states. An HMM can be considered as the simplest dynamic Bayesian network. The mathematics behind the HMM was developed by L. E...

s. The forward algorithm is a closely related algorithm for computing the probability of a sequence of observed events. These algorithms belong to the realm of probability theory
Probability theory
Probability theory is the branch of mathematics concerned with analysis of random phenomena. The central objects of probability theory are random variables, stochastic processes, and events: mathematical abstractions of non-deterministic events or measured quantities that may either be single...

.

The algorithm makes a number of assumptions.
• First, both the observed events and hidden events must be in a sequence. This sequence often corresponds to time.
• Second, these two sequences need to be aligned, and an instance of an observed event needs to correspond to exactly one instance of a hidden event.
• Third, computing the most likely hidden sequence up to a certain point t must depend only on the observed event at point t, and the most likely sequence at point t − 1.

These assumptions are all satisfied in a first-order hidden Markov model.

The terms "Viterbi path" and "Viterbi algorithm" are also applied to related dynamic programming algorithms that discover the single most likely explanation for an observation. For example, in statistical parsing
Statistical parsing
Statistical parsing is a group of parsing methods within natural language processing. The methods have in common that they associate grammar rules with a probability. Grammar rules are traditionally viewed in computational linguistics as defining the valid sentences in a language...

a dynamic programming algorithm can be used to discover the single most likely context-free derivation (parse) of a string, which is sometimes called the "Viterbi parse".

The Viterbi algorithm was conceived by Andrew Viterbi
Andrew Viterbi
Andrew James Viterbi, Ph.D. is an Italian-American electrical engineer and businessman who co-founded Qualcomm Inc....

in 1967 as a decoding algorithm for convolutional codes over noisy digital communication links. For more details on the history of the development of the algorithm see David Forney
David Forney
George David "Dave" Forney, Jr. is an electrical engineer who has made contributions in telecommunication system theory, specifically in coding theory and information theory....

's article. The algorithm has found universal application in decoding the convolutional code
Convolutional code
In telecommunication, a convolutional code is a type of error-correcting code in which* each m-bit information symbol to be encoded is transformed into an n-bit symbol, where m/n is the code rate and...

s used in both CDMA and GSM digital cellular, dial-up modems, satellite, deep-space communications, and 802.11 wireless LANs. It is now also commonly used in speech recognition
Speech recognition
Speech recognition converts spoken words to text. The term "voice recognition" is sometimes used to refer to recognition systems that must be trained to a particular speaker—as is the case for most desktop recognition software...

, keyword spotting
Keyword spotting
Keyword spotting is a subfield of speech recognition that deals with the identification of keywords in utterances.There are several types of keyword spotting:* Keyword spotting in unconstrained speech* Keyword spotting in isolated word recognition...

, computational linguistics
Computational linguistics
Computational linguistics is an interdisciplinary field dealing with the statistical or rule-based modeling of natural language from a computational perspective....

, and bioinformatics
Bioinformatics
Bioinformatics is the application of computer science and information technology to the field of biology and medicine. Bioinformatics deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, software...

. For example, in speech-to-text (speech recognition), the acoustic signal is treated as the observed sequence of events, and a string of text is considered to be the "hidden cause" of the acoustic signal. The Viterbi algorithm finds the most likely string of text given the acoustic signal.

## Overview

The assumptions listed above can be elaborated as follows. The Viterbi algorithm operates on a state machine assumption. That is, at any time the system being modeled is in one of a finite number of states. While multiple sequences of states (paths) can lead to a given state, at least one of them is a most likely path to that state, called the "survivor path". This is a fundamental assumption of the algorithm because the algorithm will examine all possible paths leading to a state and only keep the one most likely. This way the algorithm does not have to keep track of all possible paths, only one per state.

A second key assumption is that a transition from a previous state to a new state is marked by an incremental metric, usually a number. This transition is computed from the event. The third key assumption is that the events are cumulative over a path in some sense, usually additive. So the crux of the algorithm is to keep a number for each state. When an event occurs, the algorithm examines moving forward to a new set of states by combining the metric of a possible previous state with the incremental metric of the transition due to the event and chooses the best. The incremental metric associated with an event depends on the transition possibility from the old state to the new state. For example in data communications, it may be possible to only transmit half the symbols from an odd numbered state and the other half from an even numbered state. Additionally, in many cases the state transition graph is not fully connected. A simple example is a car that has three states — forward, stop and reverse — and is not allowed to undergo a transition from forward to reverse without first entering the stop state. After computing the combinations of incremental metric and state metric, only the best survives and all other paths are discarded. There are modifications to the basic algorithm which allow for a forward search in addition to the backwards one described here.

Path history must be stored. In some cases, the search history is complete because the state machine at the encoder starts in a known state and there is sufficient memory to keep all the paths. In other cases, a programmatic solution must be found for limited resources: one example is convolutional encoding, where the decoder must truncate the history at a depth large enough to keep performance to an acceptable level. Although the Viterbi algorithm is very efficient and there are modifications that reduce the computational load, the memory requirements tend to remain constant.

## Algorithm

Suppose we are given a Hidden Markov Model (HMM) with states , initial probabilities of being in state and transition probabilities of transitioning from state to state . Say we observe outputs . The state sequence most likely to have produced the observations is given by the recurrence relations:

Here is the probability of the most probable state sequence responsible for the first observations (we add one because indexing started at 0) that has as its final state. The Viterbi path can be retrieved by saving back pointers that remember which state was used in the second equation. Let be the function that returns the value of used to compute if , or if . Then:

The complexity of this algorithm is .

## Example

Alice talks to Bob three days in a row and discovers that on the first day he went for a walk, on the second day he went shopping, and on the third day he cleaned his apartment. Alice has a question: what is the most likely sequence of rainy/sunny days that would explain these observations? This is answered by the Viterbi algorithm.
1. Helps visualize the steps of Viterbi.

def print_dptable(V):
print " ",
for i in range(len(V)): print "%7s" % ("%d" % i),
print

for y in V[0].keys:
print "%.5s: " % y,
for t in range(len(V)):
print "%.7s" % ("%f" % V[t][y]),
print

def viterbi(obs, states, start_p, trans_p, emit_p):
V = [{}]
path = {}

# Initialize base cases (t 0)
for y in states:
V[0][y] = start_p[y] * emit_p[y][obs[0]]
path[y] = [y]

# Run Viterbi for t > 0
for t in range(1,len(obs)):
V.append({})
newpath = {}

for y in states:
(prob, state) = max([(V[t-1][y0] * trans_p[y0][y] * emit_p[y][obs[t]], y0) for y0 in states])
V[t][y] = prob
newpath[y] = path[state] + [y]

# Don't need to remember the old paths
path = newpath

print_dptable(V)
(prob, state) = max([(V[len(obs) - 1][y], y) for y in states])
return (prob, path[state])

The function `viterbi` takes the following arguments: `obs` is the sequence of observations, e.g. `['walk', 'shop', 'clean']`; `states` is the set of hidden states; `start_p` is the start probability; `trans_p` are the transition probabilities; and `emit_p` are the emission probabilities. For simplicity of code, we assume that the observation sequence `obs` is non-empty and that `trans_p[i][j]` and `emit_p[i][j]` is defined for all states i,j.

In the running example, the forward/Viterbi algorithm is used as follows:

def example:
return viterbi(observations,
states,
start_probability,
transition_probability,
emission_probability)
print example

This reveals that the observations `['walk', 'shop', 'clean']` were most likely generated by states `['Sunny', 'Rainy', 'Rainy']`, with score 0.01344 (to be normalized). In other words, given the observed activities, it was most likely sunny when Bob went for a walk and then it started to rain the next day and kept on raining.

The operation of Viterbi's algorithm can be visualized by means of a
trellis diagram. The Viterbi path is essentially the shortest
path through this trellis. The trellis for the weather example is shown below; the corresponding
Viterbi path is in bold:
When implementing Viterbi's algorithm, it should be noted that many languages use Floating Point
Floating Point
Floating Point is an album by John McLaughlin, released in 2008 through the record label Abstract Logix. The album reached number fourteen on Billboards Top Jazz Albums chart....

arithmetic - as p is small, this may lead to underflow
Arithmetic underflow
The term arithmetic underflow is a condition in a computer program that can occur when the true result of afloating point operation is smaller in magnitude...

in the results. A common technique to avoid this is to take the logarithm of the probabilities and use it throughout the computation, the same technique used in the Logarithmic Number System
Logarithmic Number System
A logarithmic number system is an arithmetic system used for representing real numbers in computer and digital hardware, especially for digital signal processing.-Theory:...

. Once the algorithm has terminated, an accurate value can be obtained by performing the appropriate exponentiation.
Java implementation

import java.util.Hashtable;

public class Viterbi
{
static final String RAINY = "Rainy";
static final String SUNNY = "Sunny";

static final String WALK = "walk";
static final String SHOP = "shop";
static final String CLEAN = "clean";

public static void main(String[] args)
{
String[] states = new String[] {RAINY, SUNNY};

String[] observations = new String[] {WALK, SHOP, CLEAN};

Hashtable start_probability = new Hashtable;
start_probability.put(RAINY, 0.6f);
start_probability.put(SUNNY, 0.4f);

// transition_probability
Hashtable> transition_probability =
new Hashtable>;
Hashtable t1 = new Hashtable;
t1.put(RAINY, 0.7f);
t1.put(SUNNY, 0.3f);
Hashtable t2 = new Hashtable;
t2.put(RAINY, 0.4f);
t2.put(SUNNY, 0.6f);
transition_probability.put(RAINY, t1);
transition_probability.put(SUNNY, t2);

// emission_probability
Hashtable> emission_probability =
new Hashtable>;
Hashtable e1 = new Hashtable;
e1.put(WALK, 0.1f);
e1.put(SHOP, 0.4f);
e1.put(CLEAN, 0.5f);
Hashtable e2 = new Hashtable;
e2.put(WALK, 0.6f);
e2.put(SHOP, 0.3f);
e2.put(CLEAN, 0.1f);
emission_probability.put(RAINY, e1);
emission_probability.put(SUNNY, e2);

Object[] ret = forward_viterbi(observations,
states,
start_probability,
transition_probability,
emission_probability);
System.out.println(((Float) ret[0]).floatValue);
System.out.println((String) ret[1]);
System.out.println(((Float) ret[2]).floatValue);
}

public static Object[] forward_viterbi(String[] obs, String[] states,
Hashtable start_p,
Hashtable> trans_p,
Hashtable> emit_p)
{
Hashtable T = new Hashtable;
for (String state : states)
T.put(state, new Object[] {start_p.get(state), state, start_p.get(state)});

for (String output : obs)
{
Hashtable U = new Hashtable;
for (String next_state : states)
{
float total = 0;
String argmax = "";
float valmax = 0;

float prob = 1;
String v_path = "";
float v_prob = 1;

for (String source_state : states)
{
Object[] objs = T.get(source_state);
prob = ((Float) objs[0]).floatValue;
v_path = (String) objs[1];
v_prob = ((Float) objs[2]).floatValue;

float p = emit_p.get(source_state).get(output) *
trans_p.get(source_state).get(next_state);
prob *= p;
v_prob *= p;
total += prob;
if (v_prob > valmax)
{
argmax = v_path + "," + next_state;
valmax = v_prob;
}
}
U.put(next_state, new Object[] {total, argmax, valmax});
}
T = U;
}

float total = 0;
String argmax = "";
float valmax = 0;

float prob;
String v_path;
float v_prob;

for (String state : states)
{
Object[] objs = T.get(state);
prob = ((Float) objs[0]).floatValue;
v_path = (String) objs[1];
v_prob = ((Float) objs[2]).floatValue;
total += prob;
if (v_prob > valmax)
{
argmax = v_path;
valmax = v_prob;
}
}
return new Object[]{total, argmax, valmax};
}
}

Extensions
With the algorithm called iterative Viterbi decoding one can find the subsequence of an observation that matches best (on average) to a given HMM. Iterative Viterbi decoding works by iteratively invoking a modified Viterbi algorithm, reestimating the score for a filler until convergence.

An alternative algorithm, the Lazy Viterbi algorithm, has been proposed recently. This works by not expanding any nodes until it really needs to, and usually manages to get away with doing a lot less work (in software) than the ordinary Viterbi algorithm for the same result - however, it is not so easy to parallelize in hardware.

The Viterbi algorithm has been extended to operate with a deterministic finite automaton in order to quickly generate the trellis with state transitions pointing back at variable amount of history.