Quantities of information - AbsoluteAstronomy.com

Information theory

Information theory is a branch of applied mathematics and electrical engineering involving the quantification of information. Information theory was developed by Claude E. Shannon to find fundamental limits on signal processing operations such as compressing data and on reliably storing and...

is based on probability theory

Probability theory

Probability theory is the branch of mathematics concerned with analysis of random phenomena. The central objects of probability theory are random variables, stochastic processes, and events: mathematical abstractions of non-deterministic events or measured quantities that may either be single...

and statistics

Statistics

Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....

, and measures information with several quantities of information. The choice of logarithmic base in the following formulae determines the unit

Units of measurement

A unit of measurement is a definite magnitude of a physical quantity, defined and adopted by convention and/or by law, that is used as a standard for measurement of the same physical quantity. Any other value of the physical quantity can be expressed as a simple multiple of the unit of...

of information entropy

Information entropy

In information theory, entropy is a measure of the uncertainty associated with a random variable. In this context, the term usually refers to the Shannon entropy, which quantifies the expected value of the information contained in a message, usually in units such as bits...

that is used. The most common unit of information is the bit

Bit

A bit is the basic unit of information in computing and telecommunications; it is the amount of information stored by a digital device or other physical system that exists in one of two possible distinct states...

, based on the binary logarithm

Binary logarithm

In mathematics, the binary logarithm is the logarithm to the base 2. It is the inverse function of n ↦ 2n. The binary logarithm of n is the power to which the number 2 must be raised to obtain the value n. This makes the binary logarithm useful for anything involving powers of 2,...

. Other units include the nat

Nat (information)

A nat is a logarithmic unit of information or entropy, based on natural logarithms and powers of e, rather than the powers of 2 and base 2 logarithms which define the bit. The nat is the natural unit for information entropy...

, based on the natural logarithm

Natural logarithm

The natural logarithm is the logarithm to the base e, where e is an irrational and transcendental constant approximately equal to 2.718281828...

, and the hartley, based on the base 10 or common logarithm

Common logarithm

The common logarithm is the logarithm with base 10. It is also known as the decadic logarithm, named after its base. It is indicated by log10, or sometimes Log with a capital L...

.

In what follows, an expression of the form

is considered by convention to be equal to zero whenever p is zero. This is justified because

for any logarithmic base.

Self-information

Shannon derived a measure of information content called the self-information
Self-information
In information theory, self-information is a measure of the information content associated with the outcome of a random variable. It is expressed in a unit of information, for example bits,nats,or...

or "surprisal" of a message m:

where

is the probability that message m is chosen from all possible choices in the message space

. The base of the logarithm only affects a scaling factor and, consequently, the units in which the measured information content is expressed. If the logarithm is base 2, the measure of information is expressed in units of bit

Bit

s.

Information is transferred from a source to a recipient only if the recipient of the information did not already have the information to begin with. Messages that convey information that is certain to happen and already known by the recipient contain no real information. Infrequently occurring messages contain more information than more frequently occurring messages. This fact is reflected in the above equation - a certain message, i.e. of probability 1, has an information measure of zero. In addition, a compound message of two (or more) unrelated (or mutually independent) messages would have a quantity of information that is the sum of the measures of information of each message individually. That fact is also reflected in the above equation, supporting the validity of its derivation.

An example: The weather forecast broadcast is: "Tonight's forecast: Dark. Continued darkness until widely scattered light in the morning." This message contains almost no information. However, a forecast of a snowstorm would certainly contain information since such does not happen every evening. There would be an even greater amount of information in an accurate forecast of snow for a warm location, such as Miami.

Entropy

The entropy
Information entropy
In information theory, entropy is a measure of the uncertainty associated with a random variable. In this context, the term usually refers to the Shannon entropy, which quantifies the expected value of the information contained in a message, usually in units such as bits...

of a discrete message space

is a measure of the amount of uncertainty one has about which message will be chosen. It is defined as the average

Expected value

In probability theory, the expected value of a random variable is the weighted average of all possible values that this random variable can take on...

self-information of a message

from that message space:

where

denotes the expected value

Expected value

In probability theory, the expected value of a random variable is the weighted average of all possible values that this random variable can take on...

operation.

An important property of entropy is that it is maximized when all the messages in the message space are equiprobable (e.g.

). In this case

.

Sometimes the function H is expressed in terms of the probabilities of the distribution:

where each

and

An important special case of this is the binary entropy function:

Joint entropy

The joint entropy of two discrete random variables

and

is defined as the entropy of the joint distribution

Joint distribution

In the study of probability, given two random variables X and Y that are defined on the same probability space, the joint distribution for X and Y defines the probability of events defined in terms of both X and Y...

and

are independent

Statistical independence

In probability theory, to say that two events are independent intuitively means that the occurrence of one event makes it neither more nor less probable that the other occurs...

, then the joint entropy is simply the sum of their individual entropies.

(Note: The joint entropy should not be confused with the cross entropy

Cross entropy

In information theory, the cross entropy between two probability distributions measures the average number of bits needed to identify an event from a set of possibilities, if a coding scheme is used based on a given probability distribution q, rather than the "true" distribution p.The cross entropy...

, despite similar notations.)

Conditional entropy (equivocation)

Given a particular value of a random variable

, the conditional entropy of

given

is defined as:

where

is the conditional probability

Conditional probability

In probability theory, the "conditional probability of A given B" is the probability of A if B is known to occur. It is commonly notated P, and sometimes P_B. P can be visualised as the probability of event A when the sample space is restricted to event B...

given

.

The conditional entropy
Conditional entropy
In information theory, the conditional entropy quantifies the remaining entropy of a random variable Y given that the value of another random variable X is known. It is referred to as the entropy of Y conditional on X, and is written H...

of

given

, also called the equivocation of

about

is then given by:

A basic property of the conditional entropy is that:

Kullback–Leibler divergence (information gain)

The Kullback–Leibler divergence (or information divergence, information gain, or relative entropy) is a way of comparing two distributions, a "true" probability distribution

Probability distribution

In probability theory, a probability mass, probability density, or probability distribution is a function that describes the probability of a random variable taking certain values....

p, and an arbitrary probability distribution q. If we compress data in a manner that assumes q is the distribution underlying some data, when, in reality, p is the correct distribution, Kullback–Leibler divergence is the number of average additional bits per datum necessary for compression, or, mathematically,

It is in some sense the "distance" from q to p, although it is not a true metric

Metric (mathematics)

In mathematics, a metric or distance function is a function which defines a distance between elements of a set. A set with a metric is called a metric space. A metric induces a topology on a set but not all topologies can be generated by a metric...

due to its not being symmetric.

Mutual information (transinformation)

It turns out that one of the most useful and important measures of information is the mutual information
Mutual information
In probability theory and information theory, the mutual information of two random variables is a quantity that measures the mutual dependence of the two random variables...

, or transinformation. This is a measure of how much information can be obtained about one random variable by observing another. The mutual information of

relative to

(which represents conceptually the average amount of information about

that can be gained by observing

) is given by:

A basic property of the mutual information is that:

That is, knowing Y, we can save an average of

bits in encoding X compared to not knowing Y. Mutual information is symmetric

Symmetric function

In algebra and in particular in algebraic combinatorics, the ring of symmetric functions, is a specific limit of the rings of symmetric polynomials in n indeterminates, as n goes to infinity...

Mutual information can be expressed as the average Kullback–Leibler divergence

Kullback–Leibler divergence

In probability theory and information theory, the Kullback–Leibler divergence is a non-symmetric measure of the difference between two probability distributions P and Q...

(information gain) of the posterior probability distribution

Posterior probability

In Bayesian statistics, the posterior probability of a random event or an uncertain proposition is the conditional probability that is assigned after the relevant evidence is taken into account...

of X given the value of Y to the prior distribution

Prior probability

In Bayesian statistical inference, a prior probability distribution, often called simply the prior, of an uncertain quantity p is the probability distribution that would express one's uncertainty about p before the "data"...

on X:

In other words, this is a measure of how much, on the average, the probability distribution on X will change if we are given the value of Y. This is often recalculated as the divergence from the product of the marginal distributions to the actual joint distribution:

Mutual information is closely related to the log-likelihood ratio test

Likelihood-ratio test

In statistics, a likelihood ratio test is a statistical test used to compare the fit of two models, one of which is a special case of the other . The test is based on the likelihood ratio, which expresses how many times more likely the data are under one model than the other...

in the context of contingency tables and the multinomial distribution and to Pearson's χ² test

Pearson's chi-squared test

Pearson's chi-squared test is the best-known of several chi-squared tests – statistical procedures whose results are evaluated by reference to the chi-squared distribution. Its properties were first investigated by Karl Pearson in 1900...

: mutual information can be considered a statistic for assessing independence between a pair of variables, and has a well-specified asymptotic distribution.

Differential entropy

See main article: Differential entropy
Differential entropy
Differential entropy is a concept in information theory that extends the idea of entropy, a measure of average surprisal of a random variable, to continuous probability distributions.-Definition:...

.

The basic measures of discrete entropy have been extended by analogy to continuous spaces by replacing sums with integrals and probability mass function

Probability mass function

In probability theory and statistics, a probability mass function is a function that gives the probability that a discrete random variable is exactly equal to some value...

s with probability density function

Probability density function

In probability theory, a probability density function , or density of a continuous random variable is a function that describes the relative likelihood for this random variable to occur at a given point. The probability for the random variable to fall within a particular region is given by the...

s. Although, in both cases, mutual information expresses the number of bits of information common to the two sources in question, the analogy does not imply identical properties; for example, differential entropy may be negative.

The differential analogies of entropy, joint entropy, conditional entropy, and mutual information are defined as follows:

where

is the joint density function,

and

are the marginal distributions, and

is the conditional distribution.

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.