Prior probability
Encyclopedia
In Bayesian
Bayesian probability
Bayesian probability is one of the different interpretations of the concept of probability and belongs to the category of evidential probabilities. The Bayesian interpretation of probability can be seen as an extension of logic that enables reasoning with propositions, whose truth or falsity is...

 statistical inference
Statistical inference
In statistics, statistical inference is the process of drawing conclusions from data that are subject to random variation, for example, observational errors or sampling variation...

, a prior probability distribution, often called simply the prior, of an uncertain quantity p (for example, suppose p is the proportion of voters who will vote for the politician named Smith in a future election) is the probability distribution
Probability distribution
In probability theory, a probability mass, probability density, or probability distribution is a function that describes the probability of a random variable taking certain values....

 that would express one's uncertainty about p before the "data" (for example, an opinion poll) is taken into account. It is meant to attribute uncertainty rather than randomness to the uncertain quantity. The unknown quantity may be a parameter
Parameter
Parameter from Ancient Greek παρά also “para” meaning “beside, subsidiary” and μέτρον also “metron” meaning “measure”, can be interpreted in mathematics, logic, linguistics, environmental science and other disciplines....

 or latent variable
Latent variable
In statistics, latent variables , are variables that are not directly observed but are rather inferred from other variables that are observed . Mathematical models that aim to explain observed variables in terms of latent variables are called latent variable models...

.

One applies Bayes' theorem
Bayes' theorem
In probability theory and applications, Bayes' theorem relates the conditional probabilities P and P. It is commonly used in science and engineering. The theorem is named for Thomas Bayes ....

, multiplying the prior by the likelihood function
Likelihood function
In statistics, a likelihood function is a function of the parameters of a statistical model, defined as follows: the likelihood of a set of parameter values given some observed outcomes is equal to the probability of those observed outcomes given those parameter values...

 and then normalizing, to get the posterior probability distribution, which is the conditional distribution of the uncertain quantity given the data.

A prior is often the purely subjective assessment of an experienced expert. Some will choose a conjugate prior
Conjugate prior
In Bayesian probability theory, if the posterior distributions p are in the same family as the prior probability distribution p, the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood...

when they can, to make calculation of the posterior distribution easier.

Parameters of prior distributions are called hyperparameter
Hyperparameter
In Bayesian statistics, a hyperparameter is a parameter of a prior distribution; the term is used to distinguish them from parameters of the model for the underlying system under analysis...

s,
to distinguish them from parameters of the model of the underlying data. For instance, if one is using a beta distribution to model the distribution of the parameter p of a Bernoulli distribution, then:
  • p is a parameter of the underlying system (Bernoulli distribution), and
  • α and β are parameters of the prior distribution (beta distribution), hence hyperparameters.

Informative priors

An informative prior expresses specific, definite information about a variable.
An example is a prior distribution for the temperature at noon tomorrow.
A reasonable approach is to make the prior a normal distribution with expected value
Expected value
In probability theory, the expected value of a random variable is the weighted average of all possible values that this random variable can take on...

 equal to today's noontime temperature, with variance
Variance
In probability theory and statistics, the variance is a measure of how far a set of numbers is spread out. It is one of several descriptors of a probability distribution, describing how far the numbers lie from the mean . In particular, the variance is one of the moments of a distribution...

 equal to the day-to-day variance of atmospheric temperature,
or a distribution of the temperature for that day of the year.

This example has a property in common with many priors,
namely, that the posterior from one problem (today's temperature) becomes the prior for another problem (tomorrow's temperature); pre-existing evidence which has already been taken into account is part of the prior and as more evidence accumulates the prior is determined largely by the evidence rather than any original assumption, provided that the original assumption admitted the possibility of what the evidence is suggesting. The terms "prior" and "posterior" are generally relative to a specific datum or observation.

Uninformative priors

An uninformative prior expresses vague or general information about a variable.
The term "uninformative prior" may be somewhat of a misnomer; often, such a prior might be called a not very informative prior, or an objective prior, i.e. one that's not subjectively elicited.
Uninformative priors can express "objective" information such as "the variable is positive" or "the variable is less than some limit".

The simplest and oldest rule for determining a non-informative prior is the principle of indifference
Principle of indifference
The principle of indifference is a rule for assigning epistemic probabilities.Suppose that there are n > 1 mutually exclusive and collectively exhaustive possibilities....

, which assigns equal probabilities to all possibilities.

In parameter estimation problems, the use of an uninformative prior typically yields results which are not too different from conventional statistical analysis, as the likelihood function often yields more information than the uninformative prior.

Some attempts have been made at finding a priori probabilities
A priori probability
The term a priori probability is used in distinguishing the ways in which values for probabilities can be obtained. In particular, an "a priori probability" is derived purely by deductive reasoning...

, i.e. probability distributions in some sense logically required by the nature of one's state of uncertainty; these are a subject of philosophical controversy, with Bayesians being roughly divided into two schools: "objective Bayesians", who believe such priors exist in many useful situations, and "subjective Bayesians" who believe that in practice priors usually represent subjective judgements of opinion that cannot be rigorously justified (Williamson 2010). Perhaps the strongest arguments for objective Bayesianism were given by Edwin T. Jaynes.

As an example of an a priori prior, due to Jaynes (2003), consider a situation in which one knows a ball has been hidden under one of three cups, A, B or C, but no other information is available about its location. In this case a uniform prior of seems intuitively like the only reasonable choice. More formally, we can see that the problem remains the same if we swap around the labels ("A", "B" and "C") of the cups. It would therefore be odd to choose a prior for which a permutation of the labels would cause a change in our predictions about which cup the ball will be found under; the uniform prior is the only one which preserves this invariance. If one accepts this invariance principle then one can see that the uniform prior is the logically correct prior to represent this state of knowledge. It should be noted that this prior is "objective" in the sense of being the correct choice to represent a particular state of knowledge, but it is not objective in the sense of being an observer-independent feature of the world: in reality the ball exists under a particular cup, and it only makes sense to speak of probabilities in this situation if there is an observer with limited knowledge about the system.

As a more contentious example, Jaynes published an argument (Jaynes 1968) based on Lie group
Lie group
In mathematics, a Lie group is a group which is also a differentiable manifold, with the property that the group operations are compatible with the smooth structure...

s that
suggests that the prior representing complete uncertainty about a probability should be the Haldane prior p−1(1 − p)−1. The example Jaynes gives is of finding a chemical in a lab and asking whether it will dissolve in water in repeated experiments. The Haldane prior gives by far the most weight to and , indicating that the sample will either dissolve every time or never dissolve, with equal probability. However, if one has observed samples of the chemical to dissolve in one experiment and not to dissolve in another experiment then this prior is updated to the uniform distribution
Uniform distribution
-Probability theory:* Discrete uniform distribution* Continuous uniform distribution-Other:* "Uniform distribution modulo 1", see Equidistributed sequence*Uniform distribution , a type of species distribution* Distribution of military uniforms...

 on the interval [0, 1]. This is obtained by applying Bayes' theorem
Bayes' theorem
In probability theory and applications, Bayes' theorem relates the conditional probabilities P and P. It is commonly used in science and engineering. The theorem is named for Thomas Bayes ....

 to the data set consisting of one observation of dissolving and one of not dissolving, using the above prior. The Haldane prior has been criticized on the grounds that it yields an improper posterior distribution that puts 100% of the probability content at either p = 0 or at p = 1 if a finite number of observations have given the same result. The Jeffreys prior p−1/2(1 − p)−1/2 is therefore preferred (see below).

Priors can be constructed which are proportional to the Haar measure
Haar measure
In mathematical analysis, the Haar measure is a way to assign an "invariant volume" to subsets of locally compact topological groups and subsequently define an integral for functions on those groups....

 if the parameter space X carries a natural group structure which leaves invariant our Bayesian state of knowledge (Jaynes, 1968). This can be seen as a generalisation of the invariance principle used to justify the uniform prior over the three cups in the example above. For example, in physics we might expect that an experiment will give the same results regardless of our choice of the origin of a coordinate system. This induces the group structure of the translation group on X, which determines the prior probability as a constant improper prior. Similarly, some measurements are naturally invariant to the choice of an arbitrary scale (i.e., it doesn't matter if we use centimeters or inches, we should get results that are physically the same). In such a case, the scale group is the natural group structure, and the corresponding prior on X is proportional to 1/x. It sometimes matters whether we use the left-invariant or right-invariant Haar measure. For example, the left and right invariant Haar measures on the affine group
Affine group
In mathematics, the affine group or general affine group of any affine space over a field K is the group of all invertible affine transformations from the space into itself.It is a Lie group if K is the real or complex field or quaternions....

 are not equal. Berger (1985, p. 413) argues that the right-invariant Haar measure is the correct choice.

Another idea, championed by Edwin T. Jaynes, is to use the principle of maximum entropy
Principle of maximum entropy
In Bayesian probability, the principle of maximum entropy is a postulate which states that, subject to known constraints , the probability distribution which best represents the current state of knowledge is the one with largest entropy.Let some testable information about a probability distribution...

 (MAXENT). The motivation is that the Shannon entropy of a probability distribution measures the amount of information contained in the distribution. The larger the entropy, the less information is provided by the distribution. Thus, by maximizing the entropy over a suitable set of probability distributions on X, one finds the distribution that is least informative in the sense that it contains the least amount of information consistent with the constraints that define the set. For example, the maximum entropy prior on a discrete space, given only that the probability is normalized to 1, is the prior that assigns equal probability to each state. And in the continuous case, the maximum entropy prior given that the density is normalized with mean zero and variance unity is the standard normal distribution. The principle of minimum cross-entropy generalizes MAXENT to the case of "updating" an arbitrary prior distribution with suitability constraints in the maximum-entropy sense.

A related idea, reference priors, was introduced by José-Miguel Bernardo
José-Miguel Bernardo
José-Miguel Bernardo is a Spanish mathematician and statistician. A noted Bayesian, he is currently a professor of Statistics at the University of Valencia.He is a founding co-President of the...

. Here, the idea is to maximize the expected Kullback–Leibler divergence
Kullback–Leibler divergence
In probability theory and information theory, the Kullback–Leibler divergence is a non-symmetric measure of the difference between two probability distributions P and Q...

 of the posterior distribution relative to the prior. This maximizes the expected posterior information about X when the prior density is p(x); thus, in some sense, p(x) is the "least informative" prior about X. The reference prior is defined in the asymptotic limit, i.e., one considers the limit of the priors so obtained as the number of data points goes to infinity. Reference priors are often the objective prior of choice in multivariate problems, since other rules (e.g., Jeffreys' rule
Jeffreys prior
In Bayesian probability, the Jeffreys prior, named after Harold Jeffreys, is a non-informative prior distribution on parameter space that is proportional to the square root of the determinant of the Fisher information:...

) may result in priors with problematic behavior.

Objective prior distributions may also be derived from other principles, such as information
Information theory
Information theory is a branch of applied mathematics and electrical engineering involving the quantification of information. Information theory was developed by Claude E. Shannon to find fundamental limits on signal processing operations such as compressing data and on reliably storing and...

 or coding theory
Coding theory
Coding theory is the study of the properties of codes and their fitness for a specific application. Codes are used for data compression, cryptography, error-correction and more recently also for network coding...

 (see e.g. minimum description length
Minimum description length
The minimum description length principle is a formalization of Occam's Razor in which the best hypothesis for a given set of data is the one that leads to the best compression of the data. MDL was introduced by Jorma Rissanen in 1978...

) or frequentist statistics (see frequentist matching).

Philosophical problems associated with uninformative priors are associated with the choice of an appropriate metric, or measurement scale. Suppose we want a prior for the running speed of a runner who is unknown to us. We could specify, say, a normal distribution as the prior for his speed, but alternatively we could specify a normal prior for the time he takes to complete 100 metres, which is proportional to the reciprocal of the first prior. These are very different priors, but it is not clear which is to be preferred. Jaynes' often-overlooked method of transformation groups can answer this question in some situations.

Similarly, if asked to estimate an unknown proportion between 0 and 1, we might say that all proportions are equally likely and use a uniform prior. Alternatively, we might say that all orders of magnitude for the proportion are equally likely, the , which is the uniform prior on the logarithm of proportion. The Jeffreys prior
Jeffreys prior
In Bayesian probability, the Jeffreys prior, named after Harold Jeffreys, is a non-informative prior distribution on parameter space that is proportional to the square root of the determinant of the Fisher information:...

 attempts to solve this problem by computing a prior which expresses the same belief no matter which metric is used. The Jeffreys prior for an unknown proportion p is p−1/2(1 − p)−1/2, which differs from Jaynes' recommendation.

Priors based on notions of algorithmic probability
Algorithmic probability
In algorithmic information theory, algorithmic probability is a method of assigning a probability to each hypothesis that explains a given observation, with the simplest hypothesis having the highest probability and the increasingly complex hypotheses receiving increasingly small probabilities...

 are used in inductive inference
Inductive inference
Around 1960, Ray Solomonoff founded the theory of universal inductive inference, the theory of prediction based on observations; for example, predicting the next symbol based upon a given series of symbols...

 as a basis for induction in very general settings.

Practical problems associated with uninformative priors include the requirement that the posterior distribution be proper. The usual uninformative priors on continuous, unbounded variables are improper. This need not be a problem if the posterior distribution is proper. Another issue of importance is that if an uninformative prior is to be used routinely, i.e., with many different data sets, it should have good frequentist properties. Normally a Bayesian
Bayesian probability
Bayesian probability is one of the different interpretations of the concept of probability and belongs to the category of evidential probabilities. The Bayesian interpretation of probability can be seen as an extension of logic that enables reasoning with propositions, whose truth or falsity is...

 would not be concerned with such issues, but it can be important in this situation. For example, one would want any decision rule
Decision theory
Decision theory in economics, psychology, philosophy, mathematics, and statistics is concerned with identifying the values, uncertainties and other issues relevant in a given decision, its rationality, and the resulting optimal decision...

 based on the posterior distribution to be admissible
Admissible decision rule
In statistical decision theory, an admissible decision rule is a rule for making a decision such that there isn't any other rule that is always "better" than it, in a specific sense defined below....

 under the adopted loss function. Unfortunately, admissibility is often difficult to check, although some results are known (e.g., Berger and Strawderman 1996). The issue is particularly acute with hierarchical Bayes model
Hierarchical Bayes model
The hierarchical Bayes model is a method in modern Bayesian statistical inference. It is a framework for describing statistical models that can capture dependencies more realistically than non-hierarchical models....

s; the usual priors (e.g., Jeffreys' prior) may give badly inadmissible decision rules if employed at the higher levels of the hierarchy.

Improper priors

If Bayes' theorem is written as
then it is clear that it would remain true if all the prior probabilities P(Ai) and P(Aj) were multiplied by a given constant; the same would be true for a continuous random variable. The posterior probabilities will still sum (or integrate) to 1 even if the prior values do not, and so the priors only need to be specified in the correct proportion.

Taking this idea further, in many cases the sum or integral of the prior values may not even need to be finite to get sensible answers for the posterior probabilities. When this is the case, the prior is called an improper prior.

Some statisticians use improper priors as uninformative priors. For example, if they need a prior distribution for the mean and variance of a random variable, they may assume p(mv) ~ 1/v (for v > 0) which would suggest that any value for the mean is "equally likely" and that a value for the positive variance becomes "less likely" in inverse proportion to its value. Many authors (Lindley, 1973; De Groot, 1937; Kass and Wasserman, 1996) warn against the danger of over-interpreting those priors since they are not probability densities. The only relevance they have is found in the corresponding posterior, as long as it is well-defined for all observations. (The Haldane prior is a typical counterexample.)

Examples

Examples of improper priors include:
  • Beta(0,0), the beta distribution for α=0, β=0.
  • The uniform distribution
    Uniform distribution
    -Probability theory:* Discrete uniform distribution* Continuous uniform distribution-Other:* "Uniform distribution modulo 1", see Equidistributed sequence*Uniform distribution , a type of species distribution* Distribution of military uniforms...

     on an infinite interval (i.e., a half-line or the entire real line).
  • The logarithmic prior on the positive reals.

Other priors

The concept of algorithmic probability
Algorithmic probability
In algorithmic information theory, algorithmic probability is a method of assigning a probability to each hypothesis that explains a given observation, with the simplest hypothesis having the highest probability and the increasingly complex hypotheses receiving increasingly small probabilities...

provides a route to specifying prior probabilities based on the relative complexity of the alternative models being considered.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK