Differential entropy - AbsoluteAstronomy.com

Differential entropy is a concept in information theory

Information theory

Information theory is a branch of applied mathematics and electrical engineering involving the quantification of information. Information theory was developed by Claude E. Shannon to find fundamental limits on signal processing operations such as compressing data and on reliably storing and...

that extends the idea of (Shannon) entropy

Information entropy

In information theory, entropy is a measure of the uncertainty associated with a random variable. In this context, the term usually refers to the Shannon entropy, which quantifies the expected value of the information contained in a message, usually in units such as bits...

, a measure of average surprisal of a random variable

Random variable

In probability and statistics, a random variable or stochastic variable is, roughly speaking, a variable whose value results from a measurement on some type of random process. Formally, it is a function from a probability space, typically to the real numbers, which is measurable functionmeasurable...

, to continuous probability distribution

Probability distribution

In probability theory, a probability mass, probability density, or probability distribution is a function that describes the probability of a random variable taking certain values....

Definition

Let X be a random variable with a probability density function

Probability density function

In probability theory, a probability density function , or density of a continuous random variable is a function that describes the relative likelihood for this random variable to occur at a given point. The probability for the random variable to fall within a particular region is given by the...

f whose support

Support (mathematics)

In mathematics, the support of a function is the set of points where the function is not zero, or the closure of that set . This concept is used very widely in mathematical analysis...

is a set

. The differential entropy

is defined as

.
As with its discrete analog, the units of differential entropy depend on the base of the logarithm

Logarithm

The logarithm of a number is the exponent by which another fixed value, the base, has to be raised to produce that number. For example, the logarithm of 1000 to base 10 is 3, because 1000 is 10 to the power 3: More generally, if x = by, then y is the logarithm of x to base b, and is written...

, which is usually 2 (i.e., the units are bit

Bit

A bit is the basic unit of information in computing and telecommunications; it is the amount of information stored by a digital device or other physical system that exists in one of two possible distinct states...

s). See logarithmic units for logarithms taken in different bases. Related concepts such as joint, conditional

Conditional entropy

In information theory, the conditional entropy quantifies the remaining entropy of a random variable Y given that the value of another random variable X is known. It is referred to as the entropy of Y conditional on X, and is written H...

differential entropy, and relative entropy are defined in a similar fashion.

One must take care in trying to apply properties of discrete entropy to differential entropy, since probability density functions can be greater than 1. For example, Uniform

Uniform distribution (continuous)

In probability theory and statistics, the continuous uniform distribution or rectangular distribution is a family of probability distributions such that for each member of the family, all intervals of the same length on the distribution's support are equally probable. The support is defined by...

(0,1/2) has negative differential entropy

.

Thus, differential entropy does not share all properties of discrete entropy.

Note that the continuous mutual information

Mutual information

In probability theory and information theory, the mutual information of two random variables is a quantity that measures the mutual dependence of the two random variables...

has the distinction of retaining its fundamental significance as a measure of discrete information since it is actually the limit of the discrete mutual information of partitions of X and Y as these partitions become finer and finer. Thus it is invariant under non-linear homeomorphisms (continuous and uniquely invertible maps)
, including linear transformations of X and Y, and still represents the amount of discrete information that can be transmitted over a channel that admits a continuous space of values.

Properties of differential entropy

For two densities f and g, with equality if almost everywhere
Almost everywhere
In measure theory , a property holds almost everywhere if the set of elements for which the property does not hold is a null set, that is, a set of measure zero . In cases where the measure is not complete, it is sufficient that the set is contained within a set of measure zero...

. Similarly, for two random variables X and Y, and with equality if and only if
If and only if
In logic and related fields such as mathematics and philosophy, if and only if is a biconditional logical connective between statements....

X and Y are independent
Statistical independence
In probability theory, to say that two events are independent intuitively means that the occurrence of one event makes it neither more nor less probable that the other occurs...

.
The chain rule for differential entropy holds as in the discrete case.
Differential entropy is translation invariant, ie, for a constant c.
Differential entropy is in general not invariant under arbitrary invertible maps. In particular, for a constant a, . For a vector valued random variable X and a matrix A, .
In general, for a transformation from a random vector X to a random vector with same dimension Y , the corresponding entropies are related via where is the Jacobian of the transformation m. Equality is achieved if the transform is bijective, i.e., invertible.
If a random vector has mean zero and covariance
Covariance
In probability theory and statistics, covariance is a measure of how much two variables change together. Variance is a special case of the covariance when the two variables are identical.- Definition :...

matrix K, with equality if and only if X is jointly gaussian.

However, differential entropy does not have other desirable properties:

It is not invariant under change of variables
Change of variables
In mathematics, a change of variables is a basic technique used to simplify problems in which the original variables are replaced with new ones; the new and old variables being related in some specified way...

.
It can be negative.

A modification of differential entropy that addresses this is the relative information entropy, also known as the Kullback–Leibler divergence

Kullback–Leibler divergence

In probability theory and information theory, the Kullback–Leibler divergence is a non-symmetric measure of the difference between two probability distributions P and Q...

, which includes an invariant measure

Invariant measure

In mathematics, an invariant measure is a measure that is preserved by some function. Ergodic theory is the study of invariant measures in dynamical systems...

factor (see limiting density of discrete points

Limiting density of discrete points

In information theory, the limiting density of discrete points is an adjustment to the formula of Claude Elwood Shannon for differential entropy.It was formulated by Edwin Thompson Jaynes to address defects in the initial definition of differential entropy....

Maximization in the normal distribution

With a normal distribution, differential entropy is maximized for a given variance. The following is a proof that a Gaussian variable has the largest entropy amongst all random variables of equal variance.

Let

be a Gaussian PDF

Probability density function

with mean

and variance

and

an arbitrary PDF

Probability density function

with the same variance. Since differential entropy is translation invariant we can assume that

has the same mean of

.

Consider the Kullback-Leibler divergence between the two distributions

Now note that

because the result does not depend on

other than through the variance. Combining the two results yields

with equality when

following from the properties of Kullback-Leibler divergence.

This result may also be demonstrated using the variational calculus. A Lagrangian function with two Lagrangian multipliers may be defined as:

where g(x) is some function with mean μ. When the entropy of g(x) is at a maximum and the constraint equations, which consist of the normalization condition

and the requirement of fixed variance

, are both satisfied, then a small variation

about g(x) will produce a variation

about L which is equal to zero:

Since this must hold for any small

, the term in brackets must be zero, and solving for g(x) yields:

Using the constraint equations to solve for

and

yields the normal distribution:

Example: Exponential distribution

Let X be an exponentially distributed

Exponential distribution

In probability theory and statistics, the exponential distribution is a family of continuous probability distributions. It describes the time between events in a Poisson process, i.e...

random variable with parameter

, that is, with probability density function

Its differential entropy is then

Here,

was used rather than

to make it explicit that the logarithm was taken to base e, to simplify the calculation.

Differential entropies for various distributions

In the table below,

(the gamma function

Gamma function

In mathematics, the gamma function is an extension of the factorial function, with its argument shifted down by 1, to real and complex numbers...

, and

is Euler's constant

Euler-Mascheroni constant

The Euler–Mascheroni constant is a mathematical constant recurring in analysis and number theory, usually denoted by the lowercase Greek letter ....

. Each distribution maximizes the entropy for a particular set of functional constraints listed in the fourth column, and the constraint that x be included in the support of the probability density, which is listed in the fifth column.

Table of differential entropies and corresponding maximum entropy constraints
Distribution Name	Probability density function (pdf)	Entropy in nats	Maximum Entropy Constraint	Support
Uniform Uniform distribution (continuous) In probability theory and statistics, the continuous uniform distribution or rectangular distribution is a family of probability distributions such that for each member of the family, all intervals of the same length on the distribution's support are equally probable. The support is defined by...			None
Normal
Exponential Exponential distribution In probability theory and statistics, the exponential distribution is a family of continuous probability distributions. It describes the time between events in a Poisson process, i.e...
Rayleigh
Beta	for
Cauchy Cauchy distribution The Cauchy–Lorentz distribution, named after Augustin Cauchy and Hendrik Lorentz, is a continuous probability distribution. As a probability distribution, it is known as the Cauchy distribution, while among physicists, it is known as the Lorentz distribution, Lorentz function, or Breit–Wigner...
Chi
Chi-squared
Erlang
F
Gamma
Laplace
Logistic
Lognormal
Maxwell-Boltzmann
Generalized normal
Pareto
Student's t
Triangular
Weibull
Multivariate normal

(Many of the differential entropies are from .

Variants

As described above, differential entropy does not share all properties of discrete entropy. A modification of differential entropy adds an invariant measure

Invariant measure

In mathematics, an invariant measure is a measure that is preserved by some function. Ergodic theory is the study of invariant measures in dynamical systems...

factor to correct this, (see limiting density of discrete points

Limiting density of discrete points

). If m(x) is further constrained to be a probability density, the resulting notion is called relative entropy in information theory:

The definition of differential entropy above can be obtained by partitioning the range of X into bins of length

with associated sample points ih within the bins, for X Riemann integrable. This gives a quantized

Quantization (signal processing)

Quantization, in mathematics and digital signal processing, is the process of mapping a large set of input values to a smaller set – such as rounding values to some unit of precision. A device or algorithmic function that performs quantization is called a quantizer. The error introduced by...

version of X, defined by

. Then the entropy of

.
The first term on the right approximates the differential entropy, while the second term is approximately

. Note that this procedure suggests that the entropy in the discrete sense of a continuous random variable should be

Definition

Properties of differential entropy

Maximization in the normal distribution

Example: Exponential distribution

Differential entropies for various distributions

Variants

See also