Maximum entropy probability distribution
Encyclopedia
In statistics
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....

 and information theory
Information theory
Information theory is a branch of applied mathematics and electrical engineering involving the quantification of information. Information theory was developed by Claude E. Shannon to find fundamental limits on signal processing operations such as compressing data and on reliably storing and...

, a maximum entropy probability distribution is a probability distribution
Probability distribution
In probability theory, a probability mass, probability density, or probability distribution is a function that describes the probability of a random variable taking certain values....

 whose entropy
Information entropy
In information theory, entropy is a measure of the uncertainty associated with a random variable. In this context, the term usually refers to the Shannon entropy, which quantifies the expected value of the information contained in a message, usually in units such as bits...

 is at least as great as that of all other members of a specified class of distributions.

According to the principle of maximum entropy
Principle of maximum entropy
In Bayesian probability, the principle of maximum entropy is a postulate which states that, subject to known constraints , the probability distribution which best represents the current state of knowledge is the one with largest entropy.Let some testable information about a probability distribution...

, if nothing is known about a distribution except that it belongs to a certain class, then the distribution with the largest entropy should be chosen as the default. The motivation is twofold: first, maximizing entropy minimizes the amount of prior information built into the distribution; second, many physical systems tend to move towards maximal entropy configurations over time.

Definition of entropy

If X is a discrete random variable with distribution given by
then the entropy of X is defined as

If X is a continuous random variable with probability density
Probability density
Probability density may refer to:* Probability density function in probability theory* The product of the probability amplitude with its complex conjugate in quantum mechanics...

 p(x), then the entropy of X is sometimes defined as
where p(x) log p(x) is understood to be zero whenever p(x) = 0. In connection with maximum entropy distributions, this form of definition is often the only one given, or at least it is taken as the standard form. However, it is recognisable as the special case m=1 of the more general definition
which is discussed in the articles Entropy (information theory) and Principle of maximum entropy
Principle of maximum entropy
In Bayesian probability, the principle of maximum entropy is a postulate which states that, subject to known constraints , the probability distribution which best represents the current state of knowledge is the one with largest entropy.Let some testable information about a probability distribution...

.

The base of the logarithm
Logarithm
The logarithm of a number is the exponent by which another fixed value, the base, has to be raised to produce that number. For example, the logarithm of 1000 to base 10 is 3, because 1000 is 10 to the power 3: More generally, if x = by, then y is the logarithm of x to base b, and is written...

 is not important as long as the same one is used consistently: change of base merely results in a rescaling of the entropy. Information theoreticians may prefer to use base 2 in order to express the entropy in bit
Bit
A bit is the basic unit of information in computing and telecommunications; it is the amount of information stored by a digital device or other physical system that exists in one of two possible distinct states...

s; mathematicians and physicists will often prefer the natural logarithm
Natural logarithm
The natural logarithm is the logarithm to the base e, where e is an irrational and transcendental constant approximately equal to 2.718281828...

, resulting in a unit of nat
Nat (information)
A nat is a logarithmic unit of information or entropy, based on natural logarithms and powers of e, rather than the powers of 2 and base 2 logarithms which define the bit. The nat is the natural unit for information entropy...

s or neper
Neper
The neper is a logarithmic unit for ratios of measurements of physical field and power quantities, such as gain and loss of electronic signals. It has the unit symbol Np. The unit's name is derived from the name of John Napier, the inventor of logarithms...

s for the entropy.

Examples of maximum entropy distributions

A table of examples of maximum entropy distributions is given in Park & Bera (2009)

Given mean and standard deviation: the normal distribution

The normal distribution N(μ,σ2) has maximum entropy among all real
Real
Real may also refer to:* Reality, the state of things as they actually exist, rather than as they may appear or may be thought to be.-Finance:* Inflation adjusted amountsCurrencies:* Brazilian realFormer currencies:* Mexican real* Portuguese real...

-valued distributions with specified mean
Mean
In statistics, mean has two related meanings:* the arithmetic mean .* the expected value of a random variable, which is also called the population mean....

 μ and standard deviation
Standard deviation
Standard deviation is a widely used measure of variability or diversity used in statistics and probability theory. It shows how much variation or "dispersion" there is from the average...

 σ. Therefore, the assumption of normality imposes the minimal prior structural constraint beyond these moments.(See the differential entropy article for a derivation.)

Uniform and piecewise uniform distributions

The uniform distribution
Uniform distribution (continuous)
In probability theory and statistics, the continuous uniform distribution or rectangular distribution is a family of probability distributions such that for each member of the family, all intervals of the same length on the distribution's support are equally probable. The support is defined by...

 on the interval [a,b] is the maximum entropy distribution among all continuous distributions which are supported in the interval [a, b] (which means that the probability density is 0 outside of the interval).

More generally, if we're given a subdivision a=a0 < a1 < ... < ak = b of the interval [a,b] and probabilities p1,...,pk which add up to one, then we can consider the class of all continuous distributions such that
The density of the maximum entropy distribution for this class is constant on each of the intervals [aj-1,aj); it looks somewhat like a histogram
Histogram
In statistics, a histogram is a graphical representation showing a visual impression of the distribution of data. It is an estimate of the probability distribution of a continuous variable and was first introduced by Karl Pearson...

.

The uniform distribution on the finite set {x1,...,xn} (which assigns a probability of 1/n to each of these values) is the maximum entropy distribution among all discrete distributions supported on this set.

Positive and given mean: the exponential distribution

The exponential distribution
Exponential distribution
In probability theory and statistics, the exponential distribution is a family of continuous probability distributions. It describes the time between events in a Poisson process, i.e...

 with mean 1/λ is the maximum entropy distribution among all continuous distributions supported in [0,∞) that have a mean of 1/λ.

In physics, this occurs when gravity acts on a gas that is kept at constant pressure and temperature: if X describes the height of a molecule, then the variable X is exponentially distributed (which also means that the density of the gas depends on height proportional to the exponential distribution). The reason: X is clearly positive and its mean, which corresponds to the average potential energy
Potential energy
In physics, potential energy is the energy stored in a body or in a system due to its position in a force field or due to its configuration. The SI unit of measure for energy and work is the Joule...

, is fixed. Over time, the system will attain its maximum entropy configuration, according to the second law of thermodynamics
Second law of thermodynamics
The second law of thermodynamics is an expression of the tendency that over time, differences in temperature, pressure, and chemical potential equilibrate in an isolated physical system. From the state of thermodynamic equilibrium, the law deduced the principle of the increase of entropy and...

.

Discrete distributions with given mean

Among all the discrete distributions supported on the set {x1,...,xn} with mean μ, the maximum entropy distribution has the following shape:
where the positive constants C and r can be determined by the requirements that the sum of all the probabilities must be 1 and the expected value must be μ.

For example, if a large number N of dice is thrown, and you are told that the sum of all the shown numbers is S. Based on this information alone, what would be a reasonable assumption for the number of dice showing 1, 2, ..., 6? This is an instance of the situation considered above, with {x1,...,x6} = {1,...,6} and μ = S/N.

Finally, among all the discrete distributions supported on the infinite set {x1,x2,...} with mean μ, the maximum entropy distribution has the shape:
where again the constants C and r were determined by the requirements that the sum of all the probabilities must be 1 and the expected value must be μ. For example, in the case that xk = k, this gives

Circular random variables

For a continuous random variable distributed about the unit circle, the Von Mises distribution maximizes the entropy when given the real and imaginary parts of the first circular moment or, equivalently, the circular mean and circular variance.

When given the mean and variance of the angles modulo , the wrapped normal distribution
Wrapped normal distribution
In probability theory and directional statistics, a wrapped normal distribution is a wrapped probability distribution which results from the "wrapping" of the normal distribution around the unit circle. It finds application in the theory of Brownian motion and is a solution to the heat equation for...

 maximizes the entropy.

A theorem by Boltzmann

All the above examples are consequences of the following theorem by Ludwig Boltzmann
Ludwig Boltzmann
Ludwig Eduard Boltzmann was an Austrian physicist famous for his founding contributions in the fields of statistical mechanics and statistical thermodynamics...

.

Continuous version

Suppose S is a closed subset
Closed set
In geometry, topology, and related branches of mathematics, a closed set is a set whose complement is an open set. In a topological space, a closed set can be defined as a set which contains all its limit points...

 of the real number
Real number
In mathematics, a real number is a value that represents a quantity along a continuum, such as -5 , 4/3 , 8.6 , √2 and π...

s R and we're given n measurable function
Measurable function
In mathematics, particularly in measure theory, measurable functions are structure-preserving functions between measurable spaces; as such, they form a natural context for the theory of integration...

s f1,...,fn and n numbers a1,...,an. We consider the class C of all continuous random variables which are supported on S (i.e. whose density function is zero outside of S) and which satisfy the n expected value
Expected value
In probability theory, the expected value of a random variable is the weighted average of all possible values that this random variable can take on...

 conditions

If there is a member in C whose density function is positive everywhere in S, and if there exists a maximal entropy distribution for C, then its probability density p(x) has the following shape:
where the constants c and λj have to be determined so that the integral of p(x) over S is 1 and the above conditions for the expected values are satisfied.

Conversely, if constants c and λj like this can be found, then p(x) is indeed the density of the (unique) maximum entropy distribution for our class C.

This theorem is proved with the calculus of variations
Calculus of variations
Calculus of variations is a field of mathematics that deals with extremizing functionals, as opposed to ordinary calculus which deals with functions. A functional is usually a mapping from a set of functions to the real numbers. Functionals are often formed as definite integrals involving unknown...

 and Lagrange multipliers
Lagrange multipliers
In mathematical optimization, the method of Lagrange multipliers provides a strategy for finding the maxima and minima of a function subject to constraints.For instance , consider the optimization problem...

.

Discrete version

Suppose S = {x1,x2,...} is a (finite or infinite) discrete subset of the reals and we're given n functions f1,...,fn and n numbers a1,...,an. We consider the class C of all discrete random variables X which are supported on S and which satisfy the n conditions

If there exists a member of C which assigns positive probability to all members of S and if there exists a maximum entropy distribution for C, then this distribution has the following shape:
where the constants c and λj have to be determined so that the sum of the probabilities is 1 and the above conditions for the expected values are satisfied.

Conversely, if constants c and λj like this can be found, then the above distribution is indeed the maximum entropy distribution for our class C.

This version of the theorem can be proved with the tools of ordinary calculus
Calculus
Calculus is a branch of mathematics focused on limits, functions, derivatives, integrals, and infinite series. This subject constitutes a major part of modern mathematics education. It has two major branches, differential calculus and integral calculus, which are related by the fundamental theorem...

 and Lagrange multipliers
Lagrange multipliers
In mathematical optimization, the method of Lagrange multipliers provides a strategy for finding the maxima and minima of a function subject to constraints.For instance , consider the optimization problem...

.

Caveats

Note that not all classes of distributions contain a maximum entropy distribution. It is possible that a class contain distributions of arbitrarily large entropy (e.g. the class of all continuous distributions on R with mean 0 but arbitrary standard deviation), or that the entropies are bounded above but there is no distribution which attains the maximal entropy (e.g. the class of all continuous distributions X on R with E(X) = 0 and E(X2) = E(X3) = 1 ).

It is also possible that the expected value restrictions for the class C force the probability distribution to be zero in certain subsets of S. In that case our theorem doesn't apply, but one can work around this by shrinking the set S.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK