Maximum a posteriori - AbsoluteAstronomy.com

Bayesian statistics

Bayesian statistics is that subset of the entire field of statistics in which the evidence about the true state of the world is expressed in terms of degrees of belief or, more specifically, Bayesian probabilities...

, a maximum a posteriori probability (MAP) estimate is a mode

Mode (statistics)

In statistics, the mode is the value that occurs most frequently in a data set or a probability distribution. In some fields, notably education, sample data are often called scores, and the sample mode is known as the modal score....

of the posterior distribution. The MAP can be used to obtain a point estimate

Point estimation

In statistics, point estimation involves the use of sample data to calculate a single value which is to serve as a "best guess" or "best estimate" of an unknown population parameter....

of an unobserved quantity on the basis of empirical data. It is closely related to Fisher

Ronald Fisher

Sir Ronald Aylmer Fisher FRS was an English statistician, evolutionary biologist, eugenicist and geneticist. Among other things, Fisher is well known for his contributions to statistics by creating Fisher's exact test and Fisher's equation...

's method of maximum likelihood

Maximum likelihood

In statistics, maximum-likelihood estimation is a method of estimating the parameters of a statistical model. When applied to a data set and given a statistical model, maximum-likelihood estimation provides estimates for the model's parameters....

(ML), but employs an augmented optimization objective

Optimization (mathematics)

In mathematics, computational science, or management science, mathematical optimization refers to the selection of a best element from some set of available alternatives....

which incorporates a prior distribution over the quantity one wants to estimate. MAP estimation can therefore be seen as a regularization

Regularization (mathematics)

In mathematics and statistics, particularly in the fields of machine learning and inverse problems, regularization involves introducing additional information in order to solve an ill-posed problem or to prevent overfitting...

of ML estimation.

Description

Assume that we want to estimate an unobserved population parameter

on the basis of observations

. Let

be the sampling distribution

Sampling distribution

In statistics, a sampling distribution or finite-sample distribution is the probability distribution of a given statistic based on a random sample. Sampling distributions are important in statistics because they provide a major simplification on the route to statistical inference...

, so that

is the probability of

when the underlying population parameter is

. Then the function

is known as the likelihood function

Likelihood function

In statistics, a likelihood function is a function of the parameters of a statistical model, defined as follows: the likelihood of a set of parameter values given some observed outcomes is equal to the probability of those observed outcomes given those parameter values...

and the estimate

is the maximum likelihood estimate of

.

Now assume that a prior distribution

over

exists. This allows us to treat

as a random variable

Random variable

In probability and statistics, a random variable or stochastic variable is, roughly speaking, a variable whose value results from a measurement on some type of random process. Formally, it is a function from a probability space, typically to the real numbers, which is measurable functionmeasurable...

as in Bayesian statistics

Bayesian statistics

. Then the posterior distribution of

is as follows:

where

is density function of

is the domain of

. This is a straightforward application of Bayes' theorem

Bayes' theorem

In probability theory and applications, Bayes' theorem relates the conditional probabilities P and P. It is commonly used in science and engineering. The theorem is named for Thomas Bayes ....

.

The method of maximum a posteriori estimation then estimates

as the mode

Mode (statistics)

of the posterior distribution of this random variable:

The denominator of the posterior distribution (so-called partition function

Partition function (mathematics)

The partition function or configuration integral, as used in probability theory, information science and dynamical systems, is an abstraction of the definition of a partition function in statistical mechanics. It is a special case of a normalizing constant in probability theory, for the Boltzmann...

) does not depend on

and therefore plays no role in the optimization. Observe that the MAP estimate of

coincides with the ML estimate when the prior

is uniform (that is, a constant function

Constant function

In mathematics, a constant function is a function whose values do not vary and thus are constant. For example the function f = 4 is constant since f maps any value to 4...

). The MAP estimate is a limit of Bayes estimators under a sequence of 0-1 loss functions, but generally not a Bayes estimator

Bayes estimator

In estimation theory and decision theory, a Bayes estimator or a Bayes action is an estimator or decision rule that minimizes the posterior expected value of a loss function . Equivalently, it maximizes the posterior expectation of a utility function...

per se, unless

is discrete.

Computing

MAP estimates can be computed in several ways:

Analytically, when the mode(s) of the posterior distribution can be given in closed form. This is the case when conjugate prior
Conjugate prior
In Bayesian probability theory, if the posterior distributions p are in the same family as the prior probability distribution p, the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood...

s are used.
Via numerical
Numerical analysis
Numerical analysis is the study of algorithms that use numerical approximation for the problems of mathematical analysis ....

optimization
Optimization (mathematics)
In mathematics, computational science, or management science, mathematical optimization refers to the selection of a best element from some set of available alternatives....

such as the conjugate gradient method
Conjugate gradient method
In mathematics, the conjugate gradient method is an algorithm for the numerical solution of particular systems of linear equations, namely those whose matrix is symmetric and positive-definite. The conjugate gradient method is an iterative method, so it can be applied to sparse systems that are too...

or Newton's method
Newton's method in optimization
In mathematics, Newton's method is an iterative method for finding roots of equations. More generally, Newton's method is used to find critical points of differentiable functions, which are the zeros of the derivative function.-Method:...

. This usually requires first or second derivative
Derivative
In calculus, a branch of mathematics, the derivative is a measure of how a function changes as its input changes. Loosely speaking, a derivative can be thought of as how much one quantity is changing in response to changes in some other quantity; for example, the derivative of the position of a...

s, which have to be evaluated analytically or numerically.
Via a modification of an expectation-maximization algorithm
Expectation-maximization algorithm
In statistics, an expectation–maximization algorithm is an iterative method for finding maximum likelihood or maximum a posteriori estimates of parameters in statistical models, where the model depends on unobserved latent variables...

. This does not require derivatives of the posterior density.
Via a Monte Carlo method
Monte Carlo method
Monte Carlo methods are a class of computational algorithms that rely on repeated random sampling to compute their results. Monte Carlo methods are often used in computer simulations of physical and mathematical systems...

using simulated annealing
Simulated annealing
Simulated annealing is a generic probabilistic metaheuristic for the global optimization problem of locating a good approximation to the global optimum of a given function in a large search space. It is often used when the search space is discrete...

Criticism

While MAP estimation is a limit of Bayes estimators (under the 0-1 loss function), it is not very representative of Bayesian methods in general. This is because MAP estimates are point estimates, whereas Bayesian methods are characterized by the use of distributions to summarize data and draw inferences: thus, Bayesian methods tend to report the posterior mean

Mean

In statistics, mean has two related meanings:* the arithmetic mean .* the expected value of a random variable, which is also called the population mean....

or median

Median

In probability theory and statistics, a median is described as the numerical value separating the higher half of a sample, a population, or a probability distribution, from the lower half. The median of a finite list of numbers can be found by arranging all the observations from lowest value to...

instead, together with credible interval

Credible interval

In Bayesian statistics, a credible interval is an interval in the domain of a posterior probability distribution used for interval estimation. The generalisation to multivariate problems is the credible region...

s. This is both because these estimators are optimal under squared-error and linear-error loss respectively - which are more representative of typical loss function

Loss function

In statistics and decision theory a loss function is a function that maps an event onto a real number intuitively representing some "cost" associated with the event. Typically it is used for parameter estimation, and the event in question is some function of the difference between estimated and...

s - and because the posterior distribution may not have a simple analytic form: in this case, the distribution can be simulated using Markov chain Monte Carlo

Markov chain Monte Carlo

Markov chain Monte Carlo methods are a class of algorithms for sampling from probability distributions based on constructing a Markov chain that has the desired distribution as its equilibrium distribution. The state of the chain after a large number of steps is then used as a sample of the...

techniques, while optimization to find its mode(s) may be difficult or impossible.
In many types of models, such as mixture model

Mixture model

In statistics, a mixture model is a probabilistic model for representing the presence of sub-populations within an overall population, without requiring that an observed data-set should identify the sub-population to which an individual observation belongs...

s, the posterior may be multi-modal

Bimodal distribution

In statistics, a bimodal distribution is a continuous probability distribution with two different modes. These appear as distinct peaks in the probability density function, as shown in Figure 1....

. In such a case, the usual recommendation is that one should choose the highest mode: this is not always feasible (global optimization

Global optimization

Global optimization is a branch of applied mathematics and numerical analysis that deals with the optimization of a function or a set of functions to some criteria.- General :The most common form is the minimization of one real-valued function...

is a difficult problem), nor in some cases even possible (such as when identifiability issues arise). Furthermore, the highest mode may be uncharacteristic of the majority of the posterior.

Finally, unlike ML estimators, the MAP estimate is not invariant under reparameterization. Switching from one parameterization to another involves introducing a Jacobian that impacts on the location of the maximum.

As an example of the difference between Bayes estimators mentioned above (mean and median estimators) and using an MAP estimate, consider the case where there is a need to classify inputs

as either positive or negative (for example, loans as risky or safe). Suppose there are just three possible hypotheses about the correct method of classification