All Topics  
Mutual information

 

   Email Print
   Bookmark   Link






 

Mutual information



 
 
In probability theory
Probability theory

Probability theory is the branch of mathematics concerned with analysis of Statistical randomness phenomena. The central objects of probability theory are random variables, stochastic processes, and event s: mathematical abstractions of determinism events or measured quantities that may either be single occurrences or evolve over time in an a...
 and information theory
Information theory

Information theory is a branch of applied mathematics and electrical engineering involving the quantification of information. Historically, information theory was developed by Claude E....
, the mutual information (sometimes known by the archaic
Archaism

In language, an archaism is the use of a form of speech or writing that is no longer current. This can either be done deliberately or as part of a specific jargon or formula ....
 term transinformation) of two random variable
Random variable

In mathematics, random variables are used in the study of Randomness and probability. They were developed to assist in the analysis of Game of chance, stochastic events, and the results of experiment by capturing only the mathematical properties necessary to answer probability questions....
s is a quantity that measures the mutual dependence of the two variables. The most common unit of measurement of mutual information is the bit
Bit

A bit is a binary numeral system numerical digit, taking a value of either 0 or 1. Binary digits are a basic unit of information Computer data storage and transmission in digital computing and digital information theory....
, when logarithms to the base 2 are used.

ally, the mutual information of two discrete random variables X and Y can be defined as:

where p(x,y) is the joint probability distribution function
Joint distribution

In the study of probability, given two random variables X and Y, the joint distribution of X and Y defines the probability of events defined in terms of both X and Y....
 of X and Y, and and are the marginal probability distribution functions of X and Y respectively.

In the continuous
Continuum

Continuum can refer to:* Continuum , anything that goes through a gradual transition from one condition, to a different condition, without any abrupt changes or "discontinuities"....
 case, we replace summation by a definite double integral:

where p(x,y) is now the joint probability density function of X and Y, and and are the marginal probability density functions of X and Y respectively.

These definitions are ambiguous because the base of the log function is not specified.






Discussion
Ask a question about 'Mutual information'
Start a new discussion about 'Mutual information'
Answer questions from other users
Full Discussion Forum



Encyclopedia


In probability theory
Probability theory

Probability theory is the branch of mathematics concerned with analysis of Statistical randomness phenomena. The central objects of probability theory are random variables, stochastic processes, and event s: mathematical abstractions of determinism events or measured quantities that may either be single occurrences or evolve over time in an a...
 and information theory
Information theory

Information theory is a branch of applied mathematics and electrical engineering involving the quantification of information. Historically, information theory was developed by Claude E....
, the mutual information (sometimes known by the archaic
Archaism

In language, an archaism is the use of a form of speech or writing that is no longer current. This can either be done deliberately or as part of a specific jargon or formula ....
 term transinformation) of two random variable
Random variable

In mathematics, random variables are used in the study of Randomness and probability. They were developed to assist in the analysis of Game of chance, stochastic events, and the results of experiment by capturing only the mathematical properties necessary to answer probability questions....
s is a quantity that measures the mutual dependence of the two variables. The most common unit of measurement of mutual information is the bit
Bit

A bit is a binary numeral system numerical digit, taking a value of either 0 or 1. Binary digits are a basic unit of information Computer data storage and transmission in digital computing and digital information theory....
, when logarithms to the base 2 are used.

Definition of mutual information

Formally, the mutual information of two discrete random variables X and Y can be defined as:

where p(x,y) is the joint probability distribution function
Joint distribution

In the study of probability, given two random variables X and Y, the joint distribution of X and Y defines the probability of events defined in terms of both X and Y....
 of X and Y, and and are the marginal probability distribution functions of X and Y respectively.

In the continuous
Continuum

Continuum can refer to:* Continuum , anything that goes through a gradual transition from one condition, to a different condition, without any abrupt changes or "discontinuities"....
 case, we replace summation by a definite double integral:

where p(x,y) is now the joint probability density function of X and Y, and and are the marginal probability density functions of X and Y respectively.

These definitions are ambiguous because the base of the log function is not specified. To disambiguate, the function I could be parameterized as I(X,Y,b) where b is the base. Alternatively, since the most common unit of measurement of mutual information is the bit, a base of 2 could be specified.

Intuitively, mutual information measures the information that X and Y share: it measures how much knowing one of these variables reduces our uncertainty about the other. For example, if X and Y are independent, then knowing X does not give any information about Y and vice versa, so their mutual information is zero. At the other extreme, if X and Y are identical then all information conveyed by X is shared with Y: knowing X determines the value of Y and vice versa. As a result, the mutual information is the same as the uncertainty contained in Y (or X) alone, namely the entropy
Information entropy

In information theory, entropy is a measure of the uncertainty associated with a random variable. The term by itself in this context usually refers to the Shannon entropy, which quantifies, in the sense of an expected value, the self-information contained in a message, usually in units such as bits....
 of Y (or X: clearly if X and Y are identical they have equal entropy).

Mutual information quantifies the dependence between the joint distribution
Joint distribution

In the study of probability, given two random variables X and Y, the joint distribution of X and Y defines the probability of events defined in terms of both X and Y....
 of X and Y and what the joint distribution would be if X and Y were independent. Mutual information is a measure of dependence in the following sense: I(X; Y) = 0 if and only if
If and only if

If and only if, in logic and fields that rely on it such as mathematics and philosophy, is a biconditional logical connective between statements....
 X and Y are independent random variables. This is easy to see in one direction: if X and Y are independent, then p(x,y) = p(x) p(y), and therefore:

Moreover, mutual information is nonnegative (i.e. I(X;Y) = 0; see below) and symmetric
Symmetric function

In mathematics, the term "symmetric function" can mean two different things. A symmetric function of n variables is one whose value at any n-tuple of arguments is the same as its value at any permutation of that n-tuple....
 (i.e. I(X;Y) = I(Y;X)).

Relation to other quantities

Mutual information can be equivalently expressed as

where H(X) and H(Y) are the marginal entropies
Information entropy

In information theory, entropy is a measure of the uncertainty associated with a random variable. The term by itself in this context usually refers to the Shannon entropy, which quantifies, in the sense of an expected value, the self-information contained in a message, usually in units such as bits....
, H(X|Y) and H(Y|X) are the conditional entropies
Conditional entropy

In information theory, the conditional entropy quantifies the remaining information entropy of a random variable given that the value of a second random variable is known....
, and H(X,Y) is the joint entropy
Joint entropy

The joint entropy is an information entropy used in information theory. The joint entropy measures how much entropy is contained in a joint system of two random variables....
 of X and Y. Since H(X) = H(X|Y), this characterization is consistent with the nonnegativity property stated above.

Intuitively, if entropy H(X) is regarded as a measure of uncertainty about a random variable, then H(X|Y) is a measure of what Y does not say about X. This is "the amount of uncertainty remaining about X after Y is known", and thus the right side of the first of these equalities can be read as "the amount of uncertainty in X, minus the amount of uncertainty in X which remains after Y is known", which is equivalent to "the amount of uncertainty in X which is removed by knowing Y". This corroborates the intuitive meaning of mutual information as the amount of information (that is, reduction in uncertainty) that knowing either variable provides about the other.

Note that in the discrete case H(X|X) = 0 and therefore H(X) = I(X;X). Thus I(X;X) = I(X;Y), and one can formulate the basic principle that a variable contains more information about itself than any other variable can provide.

Mutual information can also be expressed as a Kullback-Leibler divergence, of the product p(x) × p(y) of the marginal distribution
Marginal distribution

In probability theory and statistics, the marginal distribution of a subset of a collection of random variables is the probability distribution of the variables contained in the subset....
s of the two random variables X and Y, from p(x,y) the random variables' joint distribution
Joint distribution

In the study of probability, given two random variables X and Y, the joint distribution of X and Y defines the probability of events defined in terms of both X and Y....
:

Furthermore, let p(x|y) = p(x, y) / p(y). Then

Thus mutual information can also be understood as the expectation
Expected value

In probability theory and statistics, the expected value of a random variable is the Lebesgue integral of the random variable with respect to its probability measure....
 of the Kullback-Leibler divergence of the univariate distribution p(x) of X from the conditional distribution
Conditional distribution

Given two jointly distributed random variables X and Y, the conditional probability distribution of Y given X is the probability distribution of Y when X is known to be a particular value....
 p(x|y) of X given Y: the more different the distributions p(x|y) and p(x), the greater the information gain.

Variations of the mutual information

Several variations on the mutual information have been proposed to suit various needs. Among these are normalized variants and generalizations to more than two variables.

Metric

Many applications require a metric
Metric (mathematics)

In mathematics, a metric or distance function is a function which defines a distance between elements of a Set . A set with a metric is called a metric space....
, that is, a distance measure between points. The quantity

satisfies the basic properties of a metric; most importantly, the triangle inequality
Triangle inequality

In mathematics, the triangle inequality states that for any triangle, the length of a given side must be less than the sum of the other two sides but greater than the difference between the two sides....
, but also non-negativity, indiscernability
Identity of indiscernibles

The identity of indiscernibles is an ontology principle which states that two or more object s or entity are identical , if they have all their property in common....
 and symmetry. In addition, one also has , and so

The metric D is a universal metric, in that if any other distance measure places X and Y close-by, then the D will also judge them close.

Conditional mutual information

Sometimes it is useful to express the mutual information of two random variables conditioned on a third.

which can be simplified as

Conditioning on a third random variable may either increase or decrease the mutual information, but it is always true that for discrete, jointly distributed random variables X, Y, Z. This result has been used as a basic building block for proving other inequalities in information theory
Inequalities in information theory

Inequality are very important in the study of information theory. There are a number of different contexts in which these inequalities appear....
.

Multivariate mutual information

Several generalizations of mutual information to more than two random variables have been proposed, such as total correlation
Total correlation

In probability theory and in particular in information theory, total correlation is one of several generalizations of the mutual information. It is also known as the multivariate constraint or multiinformation ....
 and interaction information
Interaction information

The interaction information or co-information is one of several generalizations of the mutual information, and expresses the amount information bound up in a set of variables, beyond that which is present in any subset of those variables....
. If Shannon entropy is viewed as a signed measure
Signed measure

In mathematics, signed measure is a generalization of the concept of measure by allowing it to have negative and positive numbers values. Some authors may call it a charge, by analogy with electric charge, which is a familiar distribution that takes on positive and negative values....
 in the context of information diagram
Information diagram

An information diagram is a type of Venn diagram used in information theory to illustrate relationships among Shannon's basic Quantities of information: information entropy, joint entropy, conditional entropy and mutual information....
s, as explained in the article Information theory and measure theory
Information theory and measure theory

Measures in information theory Many of the formulas in information theory have separate versions for continuous probability distribution and discrete probability distribution cases, i.e....
, then the only definition of multivariate mutual information that makes sense is as follows: and for where (as above) we define (This definition of multivariate mutual information is identical to that of interaction information
Interaction information

The interaction information or co-information is one of several generalizations of the mutual information, and expresses the amount information bound up in a set of variables, beyond that which is present in any subset of those variables....
 except for a change in sign when the number of random variables is odd.)

Applications
Some have criticized the blind application of information diagrams used to derive the above definition, and indeed it has found rather limited practical application, since it is difficult to visualize or grasp the significance of this quantity for a large number of random variables. It can be zero, positive, or negative for any

One high-dimensional generalization scheme that maximizes the mutual information between the joint distribution and other target variables is found be useful in feature selection
Feature selection

Feature selection, also known as variable selection, feature reduction, attribute selection or variable subset selection, is the technique, commonly used in machine learning, of selecting a subset of relevant features for building robust learning models....
.

Normalized variants

Normalized variants of the mutual information are provided by the coefficients of constraint (Coombs, Dawes & Tversky 1970) or uncertainty coefficient (Press & Flannery 1988)

The two coefficients are not necessarily equal. A more useful and symmetric scaled information measure is the redundancy

which attains a minimum of zero when the variables are independent and a maximum value of

when one variable becomes completely redundant with the knowledge of the other. See also Redundancy (information theory)
Redundancy (information theory)

Redundancy in information theory is the number of bits used to transmit a message minus the number of bits of actual information in the message....
. Another symmetrical measure is the symmetric uncertainty (Witten & Frank 2005), given by

which represents a weighted average of the two uncertainty coefficients (Press & Flannery 1988).

Other normalized versions are provided by the following expressions (Yao 2003, Strehl & Ghosh 2002).

The quantity

is a metric
Metric (mathematics)

In mathematics, a metric or distance function is a function which defines a distance between elements of a Set . A set with a metric is called a metric space....
, i.e. satisfies the triangle inequality, etc. The metric is also a universal metric.

Weighted variants

In the traditional formulation of the mutual information,

each event or object specified by is weighted by the corresponding probability . This assumes that all objects or events are equivalent apart from their probability of occurrence. However, in some applications it may be the case that certain objects or events are more significant than others, or that certain patterns of association are more semantically important than others.

For example, the deterministic mapping may be viewed as stronger (by some standard) than the deterministic mapping , although these relationships would yield the same mutual information. This is because the mutual information is not sensitive at all to any inherent ordering in the variable values (Cronbach 1954, Coombs & Dawes 1970, Lockhead 1970), and is therefore not sensitive at all to the form of the relational mapping between the associated variables. If it is desired that the former relation — showing agreement on all variable values — be judged stronger than the later relation, then it is possible to use the following weighted mutual information (Guiasu 1977)

which places a weight on the probability of each variable value co-occurrence, . This allows that certain probabilities may carry more or less significance than others, thereby allowing the quantification of relevant holistic or prägnanz factors. In the above example, using larger relative weights for , , and would have the effect of assessing greater informativeness for the relation than for the relation , which may be desirable in some cases of pattern recognition, and the like. There has been little mathematical work done on the weighted mutual information and its properties, however.

Absolute mutual information

Using the ideas of Kolmogorov complexity
Kolmogorov complexity

In algorithmic information theory , the Andrey Kolmogorov complexity of an object such as a piece of text is a measure of the computational resources needed to specify the object....
, one can consider the mutual information of two sequences independent of any probability distribution:

To establish that this quantity is symmetric up to a logarithmic factor requires the chain rule for Kolmogorov complexity
Chain rule for Kolmogorov complexity

The chain rule for Kolmogorov complexity is an analogue of the chain rule for Information entropy, which states:That is, the combined randomness of two sequences X and Y is the sum of the randomness of X plus whatever randomness is left in Y once we know X....
 . Approximations of this quantity via compression
Data compression

In computer science and information theory, data compression or source coding is the process of encoding information using fewer bits than an code representation would use through use of specific encoding schemes....
 can be used to define a distance measure
Metric (mathematics)

In mathematics, a metric or distance function is a function which defines a distance between elements of a Set . A set with a metric is called a metric space....
 to perform a hierarchical clustering of sequences without having any domain knowledge
Domain knowledge

Most generally, domain knowledge is the knowledge which is valid and directly used for a pre-selected domain of human endeavor or an autonomous computer activity....
 of the sequences .

Applications of mutual information

In many applications, one wants to maximize mutual information (thus increasing dependencies), which is often equivalent to minimizing conditional entropy
Conditional entropy

In information theory, the conditional entropy quantifies the remaining information entropy of a random variable given that the value of a second random variable is known....
. Examples include:

  • The channel capacity
    Channel capacity

    In electrical engineering, computer science and information theory, channel capacity is the tightest upper bound on the amount of information that can be reliably transmitted over a channel ....
     is equal to the mutual information, maximized over all input distributions.
  • Discriminative training procedures for hidden Markov model
    Hidden Markov model

    A hidden Markov model is a statistical model in which the system being modeled is assumed to be a Markov process with unknown parameters; the challenge is to determine the hidden parameters from the observable data....
    s have been proposed based on the maximum mutual information (MMI) criterion.
  • RNA secondary structure
    RNA structure

    The functional form of single stranded RNA molecules frequently requires a specific tertiary structure. The scaffold for this structure is provided by secondary structural elements which are hydrogen bonds within the molecule....
     prediction from a multiple sequence alignment
    Multiple sequence alignment

    A multiple sequence alignment is a sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In general, the input set of query sequences are assumed to have an evolutionary relationship by which they share a lineage and are descended from a common ancestor....
    .
  • Mutual information has been used as a criterion for feature selection
    Feature selection

    Feature selection, also known as variable selection, feature reduction, attribute selection or variable subset selection, is the technique, commonly used in machine learning, of selecting a subset of relevant features for building robust learning models....
     and feature transformations in machine learning
    Machine learning

    Machine learning is the subfield of artificial intelligence that is concerned with the design and development of algorithms that allow computers to improve their performance over time based on data, such as from sensor data or databases....
    . It can be used to characterize both the relevance and redundancy of variables, such as the minimum redundancy feature selection
    Minimum redundancy feature selection

    Feature selection is one of the basic problems in pattern recognition and machine learning. It has a variety of applications in many areas, such as cancer diagnosis and speaker recognition....
    .
  • Mutual information is often used as a significance function for the computation of collocation
    Collocation

    Within the area of corpus linguistics, collocation is defined as a sequence of words or terminology which co-occurrence more often than would be expected by chance....
    s in corpus linguistics
    Corpus linguistics

    Corpus linguistics is the study of language as expressed in samples or "real world" text. This method represents a digestive approach to deriving a set of abstract rules by which a natural language is governed or else relates to another language....
    .
  • Mutual information is used in medical imaging
    Medical imaging

    Medical imaging refers to the techniques and processes used to create s of the human body for clinical purposes or medical science .As a discipline and in its widest sense, it is part of biological imaging and incorporates radiology , radiological sciences, endoscopy, thermography, medical photography and microscopy ....
     for image registration
    Image registration

    In computer vision, sets of data acquired by sampling the same scene or object at different times, or from different perspectives, will be in different coordinate systems....
    . Given a reference image (for example, a brain scan), and a second image which needs to be put into the same coordinate system
    Coordinate system

    In mathematics and its applications, a coordinate system is a system for assigning an n-tuple of numbers or scalar to each Point in an n-dimensional space....
     as the reference image, this image is deformed until the mutual information between it and the reference image is maximized.
  • Detection of phase synchronization
    Phase synchronization

    Phase synchronization is the process by which two or more cyclic signals tend to oscillate with a repeating sequence of relative phase angles....
     in time series
    Time series

    In statistics, signal processing, and many other fields, a time series is a sequence of data points, measured typically at successive times, spaced at time intervals....
     analysis
  • In the infomax
    Infomax

    Infomax is an optimization principle for neural networks and other information processing systems. It prescribes that a function that maps a set of input values I to a set of output values O should be chosen or learned so as to maximize the average Claude_Shannon mutual information between I and O, subject to a set of specified constraints a...
     method for neural-net and other machine learning, including the infomax-based Independent component analysis
    Independent component analysis

    Independent component analysis is a computational method for separating a multivariate signal into additive subcomponents supposing the mutual statistical independence of the non-Gaussian source signals....
     algorithm
  • Average mutual information in delay embedding theorem is used for determining the embedding delay parameter.
  • Mutual information between genes
    Gênes

    G?nes is the name of a d?partement in France of the First French Empire in present Italy. It was named after the city Genoa. It was formed in 1805, when Napoleon Bonaparte occupied the Republic of Genoa....
     in expression microarray
    Microarray

    Different kinds of biological assays are called microarrays:*DNA microarrays, such as cDNA microarrays and oligonucleotide microarrays*MMChips, for surveillance of microRNA populations...
     data is used by the ARACNE
    ARACNE

    ARACNE is a method for reconstructing biological networks from microarray data developed at Columbia University. The method uses information theory methods to reduce false positives which are predicted through indirect interactions....
     algorithm for reconstruction of gene networks
    Gene regulatory network

    A gene regulatory network or genetic regulatory network is a collection of DNA segments in a cell whichinteract with each other and with other substances in the cell, thereby governing the rates at which genes in the network are transcribed into mRNA....
    .
  • Mutual information is used as a clusterings comparing measure, provided some advantages over other classical measures such as the Rand index
    Rand index

    In statistics, and in particular in data clustering, the Rand index or Rand measure is a measure of the similarity between two data clusterings....
     and the Adjusted rand index
    Adjusted rand index

    The Adjusted Rand Index is the corrected-for-chance version of the Rand index. A possible alternative for the Rand and Adjusted Rand index are the information theoretic based measures namely the Mutual Information and the Adjusted Mutual Information ....
    .
  • The adjusted-for-chance version of the mutual information is the Adjusted Mutual Information
    Adjusted Mutual Information

    The Adjusted Mutual Information is used for comparing clusterings. It corrects the effect of agreement solely due to chance between clusterings, similar to the way the Adjusted rand index corrects the Rand index....
     (AMI). It is used for comparing clustering. It corrects the effect of agreement solely due to chance between clusterings, similar to the way the Adjusted rand index
    Adjusted rand index

    The Adjusted Rand Index is the corrected-for-chance version of the Rand index. A possible alternative for the Rand and Adjusted Rand index are the information theoretic based measures namely the Mutual Information and the Adjusted Mutual Information ....
     corrects the Rand index
    Rand index

    In statistics, and in particular in data clustering, the Rand index or Rand measure is a measure of the similarity between two data clusterings....
    . A Matlab program for calculating the Adjusted Mutual Information between two clusterings can be obtained from http://ee.unsw.edu.au/~nguyenv/Software.htm


See also

  • Pointwise mutual information
    Pointwise Mutual Information

    Pointwise mutual information is a measure of association used in information theory and statistics.The PMI of a pair of probability space x and y belonging to discrete random variables quantifies the discrepancy between the probability of their coincidence given their joint distribution versus the probability of their coincidence...