Belief propagation
Encyclopedia
Belief propagation is a message passing
Message-passing method
Message-passing methods are a set of algorithms in statistics/machine learning for doing inference through local computation. Belief propagation on Bayesian networks is a good example of a message-passing method.* Variational message passing...

 algorithm
Algorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...

 for performing inference on graphical model
Graphical model
A graphical model is a probabilistic model for which a graph denotes the conditional independence structure between random variables. They are commonly used in probability theory, statistics—particularly Bayesian statistics—and machine learning....

s, such as Bayesian network
Bayesian network
A Bayesian network, Bayes network, belief network or directed acyclic graphical model is a probabilistic graphical model that represents a set of random variables and their conditional dependencies via a directed acyclic graph . For example, a Bayesian network could represent the probabilistic...

s and Markov random fields. It calculates the marginal distribution
Marginal distribution
In probability theory and statistics, the marginal distribution of a subset of a collection of random variables is the probability distribution of the variables contained in the subset. The term marginal variable is used to refer to those variables in the subset of variables being retained...

 for each unobserved node, conditional on any observed nodes. Belief propagation is commonly used in artificial intelligence
Artificial intelligence
Artificial intelligence is the intelligence of machines and the branch of computer science that aims to create it. AI textbooks define the field as "the study and design of intelligent agents" where an intelligent agent is a system that perceives its environment and takes actions that maximize its...

 and information theory
Information theory
Information theory is a branch of applied mathematics and electrical engineering involving the quantification of information. Information theory was developed by Claude E. Shannon to find fundamental limits on signal processing operations such as compressing data and on reliably storing and...

 and has demonstrated empirical success in numerous applications including low-density parity-check codes, turbo codes, free energy
Thermodynamic free energy
The thermodynamic free energy is the amount of work that a thermodynamic system can perform. The concept is useful in the thermodynamics of chemical or thermal processes in engineering and science. The free energy is the internal energy of a system less the amount of energy that cannot be used to...

 approximation, and satisfiability.

The algorithm was first proposed by Judea Pearl
Judea Pearl
Judea Pearl is a computer scientist and philosopher, best known for developing the probabilistic approach to artificial intelligence and the development of Bayesian networks ....

 in 1982, who formulated this algorithm on tree
Tree (graph theory)
In mathematics, more specifically graph theory, a tree is an undirected graph in which any two vertices are connected by exactly one simple path. In other words, any connected graph without cycles is a tree...

s, and was later extended to polytree
Polytree
In graph theory, a polytree is a directed graph with at most one undirected path between any two vertices. In other words, a polytree is a directed acyclic graph for which there are no undirected cycles either...

s. It has since been shown to be a useful approximate algorithm on general graphs.

If X=(Xv) is a set of discrete random variable
Random variable
In probability and statistics, a random variable or stochastic variable is, roughly speaking, a variable whose value results from a measurement on some type of random process. Formally, it is a function from a probability space, typically to the real numbers, which is measurable functionmeasurable...

s with a joint
Joint distribution
In the study of probability, given two random variables X and Y that are defined on the same probability space, the joint distribution for X and Y defines the probability of events defined in terms of both X and Y...

 mass function
Probability mass function
In probability theory and statistics, a probability mass function is a function that gives the probability that a discrete random variable is exactly equal to some value...

 p, the marginal distribution
Marginal distribution
In probability theory and statistics, the marginal distribution of a subset of a collection of random variables is the probability distribution of the variables contained in the subset. The term marginal variable is used to refer to those variables in the subset of variables being retained...

 of a single Xi is simply the summation of p over all other variables:


However this quickly becomes computationally prohibitive: if there are 100 binary variables, then one needs to sum over 299 ≈ 6.338 × 1029 possible values. By exploiting the graphical
Graph (mathematics)
In mathematics, a graph is an abstract representation of a set of objects where some pairs of the objects are connected by links. The interconnected objects are represented by mathematical abstractions called vertices, and the links that connect some pairs of vertices are called edges...

 structure, belief propagation allows the marginals to be computed much more efficiently.

Description of the sum-product algorithm

Belief propagation operates on a factor graph
Factor graph
In probability theory and its applications, a factor graph is a particular type of graphical model, with applications in Bayesian inference, that enables efficient computation of marginal distributions through the sum-product algorithm...

: a bipartite graph
Bipartite graph
In the mathematical field of graph theory, a bipartite graph is a graph whose vertices can be divided into two disjoint sets U and V such that every edge connects a vertex in U to one in V; that is, U and V are independent sets...

 containing nodes corresponding to variables V and factors U, with edges between variables and the factors in which they appear. We can write the joint mass function:


where xu is the vector of neighbouring variable nodes to the factor node u. Any Bayesian network
Bayesian network
A Bayesian network, Bayes network, belief network or directed acyclic graphical model is a probabilistic graphical model that represents a set of random variables and their conditional dependencies via a directed acyclic graph . For example, a Bayesian network could represent the probabilistic...

 or Markov random field can be represented as a factor graph.

The algorithm works by passing real valued functions called messages along the edges between the nodes. These contain the "influence" that one variable exerts on another. There are two types of messages:
  • A message from a variable node v to a factor node u is the product of the messages from all other neighbouring factor nodes (except the recipient; alternatively one can say the recipient sends the message "1"):


where N(v) is the set of neighbouring (factor) nodes to v. If is empty, then is set to the uniform distribution.

  • A message from a factor node u to a variable node v is the product of the factor with messages from all other nodes, marginalised over xv:


where N(u) is the set of neighbouring (variable) nodes to u. If is empty then .

The name of the algorithm is clear from the previous formula: the complete marginalisation is reduced to a sum of products of simpler terms than the ones appearing in the full joint distribution.

Exact algorithm for trees

The simplest form of the algorithm is when the factor graph is a tree
Tree (graph theory)
In mathematics, more specifically graph theory, a tree is an undirected graph in which any two vertices are connected by exactly one simple path. In other words, any connected graph without cycles is a tree...

: in this case the algorithm computes exact marginals, and terminates after 2 steps.

Before starting, the graph is orientated by designating one node as the root; any non-root node which is connected to only one other node is called a leaf.

In the first step, messages are passed inwards: starting at the leaves, each node passes a message along the (unique) edge towards the root node. The tree structure guarantees that it is possible to obtain messages from all other adjoining nodes before passing the message on. This continues until the root has obtained messages from all of its adjoining nodes.

The second step involves passing the messages back out: starting at the root, messages are passed in the reverse direction. The algorithm is completed when all leaves have received their messages.

Upon completion, the marginal distribution of each node is the product of all messages from adjoining factors:


Likewise, the joint marginal distribution of the set of variables belonging to one factor is the product of the factor and the messages from the variables:


These can be shown by mathematical induction
Mathematical induction
Mathematical induction is a method of mathematical proof typically used to establish that a given statement is true of all natural numbers...

.

Approximate algorithm for general graphs

Curiously, nearly the same algorithm is used in general graph
Graph (mathematics)
In mathematics, a graph is an abstract representation of a set of objects where some pairs of the objects are connected by links. The interconnected objects are represented by mathematical abstractions called vertices, and the links that connect some pairs of vertices are called edges...

s. The algorithm is then sometimes called "loopy" belief propagation, because graphs typically contain cycle
Cycle (graph theory)
In graph theory, the term cycle may refer to a closed path. If repeated vertices are allowed, it is more often called a closed walk. If the path is a simple path, with no repeated vertices or edges other than the starting and ending vertices, it may also be called a simple cycle, circuit, circle,...

s, or loops. The procedure must be adjusted slightly because graphs might not contain any leaves. Instead, one initializes all variable messages to 1 and uses the same message definitions above, updating all messages at every iteration (although messages coming from known leaves or tree-structured subgraphs may no longer need updating after sufficient iterations). It is easy to show that in a tree, the message definitions of this modified procedure will converge to the set of message definitions given above within a number of iterations equal to the diameter
Diameter
In geometry, a diameter of a circle is any straight line segment that passes through the center of the circle and whose endpoints are on the circle. The diameters are the longest chords of the circle...

 of the tree.

The precise conditions under which loopy belief propagation will converge are still not well understood; it is known that graphs containing a single loop will converge to a correct solution. Several sufficient (but not necessary) conditions for convergence of loopy belief propagation to a unique fixed point exist. There exist graphs which will fail to converge, or which will oscillate between multiple states over repeated iterations. Techniques like EXIT chart
EXIT chart
An Extrinsic information transfer chart, commonly called an EXIT chart, is a technique to aid the construction of good iteratively-decoded error-correcting codes ....

s can provide an approximate visualisation of the progress of belief propagation and an approximate test for convergence.

There are other approximate methods for marginalization including variational methods and Monte Carlo method
Monte Carlo method
Monte Carlo methods are a class of computational algorithms that rely on repeated random sampling to compute their results. Monte Carlo methods are often used in computer simulations of physical and mathematical systems...

s.

One method of exact marginalization in general graphs is called the junction tree algorithm, which is simply belief propagation on a modified graph guaranteed to be a tree. The basic premise is to eliminate cycles by clustering them into single nodes.

Related algorithm and complexity issues

A similar algorithm is commonly referred to as the Viterbi algorithm
Viterbi algorithm
The Viterbi algorithm is a dynamic programming algorithm for finding the most likely sequence of hidden states – called the Viterbi path – that results in a sequence of observed events, especially in the context of Markov information sources, and more generally, hidden Markov models...

, but also known as the max-product or min-sum algorithm, which solves the related problem of maximization, or most probable explanation. Instead of attempting to solve the marginal, the goal here is to find the values that maximises the global function (i.e. most probable values in a probabilistic setting), and it can be defined using the arg max:


An algorithm that solves this problem is nearly identical to belief propagation, with the sums replaced by maxima in the definitions.

It is worth noting that inference
Inference
Inference is the act or process of deriving logical conclusions from premises known or assumed to be true. The conclusion drawn is also called an idiomatic. The laws of valid inference are studied in the field of logic.Human inference Inference is the act or process of deriving logical conclusions...

 problems like marginalization and maximization are NP-hard
NP-hard
NP-hard , in computational complexity theory, is a class of problems that are, informally, "at least as hard as the hardest problems in NP". A problem H is NP-hard if and only if there is an NP-complete problem L that is polynomial time Turing-reducible to H...

 to solve exactly and approximately (at least for relative error
Approximation error
The approximation error in some data is the discrepancy between an exact value and some approximation to it. An approximation error can occur because#the measurement of the data is not precise due to the instruments...

) in a graphical model. More precisely, the marginalization problem defined above is #P-complete and maximization is NP-complete
NP-complete
In computational complexity theory, the complexity class NP-complete is a class of decision problems. A decision problem L is NP-complete if it is in the set of NP problems so that any given solution to the decision problem can be verified in polynomial time, and also in the set of NP-hard...

.

Relation to free energy

The sum-product algorithm is related to the calculation of free energy
Thermodynamic free energy
The thermodynamic free energy is the amount of work that a thermodynamic system can perform. The concept is useful in the thermodynamics of chemical or thermal processes in engineering and science. The free energy is the internal energy of a system less the amount of energy that cannot be used to...

 in thermodynamics
Thermodynamics
Thermodynamics is a physical science that studies the effects on material bodies, and on radiation in regions of space, of transfer of heat and of work done on or by the bodies or radiation...

. Let Z be the partition function
Partition function (mathematics)
The partition function or configuration integral, as used in probability theory, information science and dynamical systems, is an abstraction of the definition of a partition function in statistical mechanics. It is a special case of a normalizing constant in probability theory, for the Boltzmann...

. A probability distribution


(as per the factor graph representation) can be viewed as a measure of the internal energy
Internal energy
In thermodynamics, the internal energy is the total energy contained by a thermodynamic system. It is the energy needed to create the system, but excludes the energy to displace the system's surroundings, any energy associated with a move as a whole, or due to external force fields. Internal...

 present in a system, computed as


The free energy of the system is then


It can then be shown that the points of convergence of the sum-product algorithm represent the points where the free energy in such a system is minimized. Similarly, it can be shown that a fixed point of the iterative belief propagation algorithm in graphs with cycles is a stationary point of a free energy approximation.

Generalized belief propagation (GBP)

Belief propagation algorithms are normally presented as messages update equations on a factor graph, involving messages between variable nodes and their neighboring factor nodes and vice versa. Considering messages between regions in a graph is one way of generalizing the belief propagation algorithm. There are several ways of defining the set of regions in a graph that can exchange messages. One method uses ideas introduced by Kikuchi in the physics literature, and is known as Kikuchi's cluster variation method.

Improvements in the performance of belief propagation algorithms are also achievable by breaking the replicas symmetry in the distributions of the fields (messages). This generalization leads to a new kind of algorithm called survey propagation (SP), which have proved to be very efficient in NP-complete
NP-complete
In computational complexity theory, the complexity class NP-complete is a class of decision problems. A decision problem L is NP-complete if it is in the set of NP problems so that any given solution to the decision problem can be verified in polynomial time, and also in the set of NP-hard...

 problems like satisfiability
and graph coloring
Graph coloring
In graph theory, graph coloring is a special case of graph labeling; it is an assignment of labels traditionally called "colors" to elements of a graph subject to certain constraints. In its simplest form, it is a way of coloring the vertices of a graph such that no two adjacent vertices share the...

.

The cluster variational method and the survey propagation algorithms are two different improvements to belief propagation. The name generalized survey propagation (GSP) is waiting to be assigned to the algorithm that merges both generalizations.

Gaussian belief propagation (GaBP)

Gaussian belief propagation is a variant of the belief propagation algorithm when the underlying distributions are Gaussian. The first work analyzing this special model was the seminal work of Weiss and Freeman

The GaBP algorithm solves the following marginalization problem:


where Z is a normalization constant, A is a symmetric positive definite matrix (inverse covariance matrix a.k.a precision matrix) and b is the shift vector.

Equivalently, it can be shown that using the Gaussian model, the solution of the marginalization problem is equivalent to the MAP
Maximum a posteriori
In Bayesian statistics, a maximum a posteriori probability estimate is a mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data...

 assignment problem:


This problem is also equivalent to the following minimization problem of the quadratic form:


Which is also equivalent to the linear system of equations


Convergence of the GaBP algorithm is easier to analyze (relatively to the general BP case) and there are two known sufficient convergence conditions. The first one was formulated by Weiss et al. in the year 2000, when the information matrix A is diagonally dominant. The second convergence condition was formulated by Johnson et al. in 2006, when the spectral radius
Spectral radius
In mathematics, the spectral radius of a square matrix or a bounded linear operator is the supremum among the absolute values of the elements in its spectrum, which is sometimes denoted by ρ.-Matrices:...

 of the matrix


where D = diag(A).

The GaBP algorithm was linked to the linear algebra domain, and it was shown that the GaBP algorithm can be
viewed as an iterative algorithm for solving the linear system of equations
Ax = b where A is the information matrix and b is the shift vector. The known convergence conditions of the GaBP algorithm are
identical to the sufficient conditions of the Jacobi method
Jacobi method
In numerical linear algebra, the Jacobi method is an algorithm for determining the solutions of a system of linear equations with largest absolute values in each row and column dominated by the diagonal element. Each diagonal element is solved for, and an approximate value plugged in. The process...

. Empirically, the GaBP algorithm is shown to converge faster than classical iterative methods like the Jacobi method, the Gauss–Seidel method, successive over-relaxation
Successive over-relaxation
In numerical linear algebra, the method of successive over-relaxation is a variant of the Gauss–Seidel method for solving a linear system of equations, resulting in faster convergence. A similar method can be used for any slowly converging iterative process.It was devised simultaneously by David...

, and others. Additionally, the GaBP algorithm is shown to be immune to numerical problems of the preconditioned Conjugate Gradient
Conjugate gradient method
In mathematics, the conjugate gradient method is an algorithm for the numerical solution of particular systems of linear equations, namely those whose matrix is symmetric and positive-definite. The conjugate gradient method is an iterative method, so it can be applied to sparse systems that are too...

method

Recently, a double-loop technique was introduced to force convergence of the GaBP algorithm to the correct solution
even when the sufficient conditions for convergence do not hold. The double loop technique works for either positive definite
or column dependent matrices.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK