Information geometry is a branch of

mathematicsMathematics is the study of quantity, space, structure, and change. Mathematicians seek out patterns and formulate new conjectures. Mathematicians resolve the truth or falsity of conjectures by mathematical proofs, which are arguments sufficient to convince other mathematicians of their validity...

that applies the techniques of differential geometry to the field of

probability theoryProbability theory is the branch of mathematics concerned with analysis of random phenomena. The central objects of probability theory are random variables, stochastic processes, and events: mathematical abstractions of non-deterministic events or measured quantities that may either be single...

. It derives its name from the fact that the

Fisher informationIn mathematical statistics and information theory, the Fisher information is the variance of the score. In Bayesian statistics, the asymptotic distribution of the posterior mode depends on the Fisher information and not on the prior...

is used as the Riemannian metric when considering the

geometryGeometry arose as the field of knowledge dealing with spatial relationships. Geometry was one of the two fields of pre-modern mathematics, the other being the study of numbers ....

of probability distribution families that form a

Riemannian manifoldIn Riemannian geometry and the differential geometry of surfaces, a Riemannian manifold or Riemannian space is a real differentiable manifold M in which each tangent space is equipped with an inner product g, a Riemannian metric, which varies smoothly from point to point...

. Notably, information geometry has been used to prove the higher-order

efficiencyIn statistics, an efficient estimator is an estimator that estimates the quantity of interest in some “best possible” manner. The notion of “best possible” relies upon the choice of a particular loss function — the function which quantifies the relative degree of undesirability of estimation errors...

properties of the

maximum-likelihood estimatorIn statistics, maximum-likelihood estimation is a method of estimating the parameters of a statistical model. When applied to a data set and given a statistical model, maximum-likelihood estimation provides estimates for the model's parameters....

.

Information geometry reached maturity through the work of

Shun'ichi AmariAmari Shun'ichi, 甘利俊一 【あまりしゅんいち】, is a Japanese scholar born in 1936 in Tokyo, Japan.He majored in Mathematical Engineering in 1958 from the University of Tokyo then graduated in 1963 from the Graduate School of the University of Tokyo....

and other Japanese mathematicians in the 1980s. Amari and Nagaoka's book,

*Methods of Information Geometry*, is currently the

*de facto* reference book of the relatively young field due to its broad coverage of significant developments attained using the methods of information geometry up to the year 2000. Many of these developments were previously only available in Japanese-language publications.

## Introduction

In information geometry, a family (similar collection) of

probability distributionsIn probability theory, a probability mass, probability density, or probability distribution is a function that describes the probability of a random variable taking certain values....

over the

random variableIn probability and statistics, a random variable or stochastic variable is, roughly speaking, a variable whose value results from a measurement on some type of random process. Formally, it is a function from a probability space, typically to the real numbers, which is measurable functionmeasurable...

(or vector),

*X*, is viewed as forming a

manifoldIn mathematics , a manifold is a topological space that on a small enough scale resembles the Euclidean space of a specific dimension, called the dimension of the manifold....

,

*M*, with

coordinate systemIn geometry, a coordinate system is a system which uses one or more numbers, or coordinates, to uniquely determine the position of a point or other geometric element. The order of the coordinates is significant and they are sometimes identified by their position in an ordered tuple and sometimes by...

,

. One possible coordinate system for the manifold is the free parameters of the probability distribution family. Each point,

*P*, in the manifold,

*M*, with coordinate

, carries a

functionIn mathematics, a function associates one quantity, the argument of the function, also known as the input, with another quantity, the value of the function, also known as the output. A function assigns exactly one output to each input. The argument and the value may be real numbers, but they can...

on the random variable (or vector), i.e. the probability distribution. We write this as

. The set of all points,

*P*, in the probability family forms the manifold,

*M*.

For example, with the family of

normal distributions, the ordered pair of the mean,

, and standard deviation,

, form one possible coordinate system,

. Each particular point in the manifold, such as

and

, carries a specific normal distribution with mean,

, and standard deviation,

, so that

.

To form a Riemannian manifold on which the techniques of differential geometry can be applied, a Riemannian metric must be defined. Information geometry takes the

Fisher informationIn mathematical statistics and information theory, the Fisher information is the variance of the score. In Bayesian statistics, the asymptotic distribution of the posterior mode depends on the Fisher information and not on the prior...

to be the "natural"

metricIn mathematics, a metric or distance function is a function which defines a distance between elements of a set. A set with a metric is called a metric space. A metric induces a topology on a set but not all topologies can be generated by a metric...

, although it is not the only possible metric. The Fisher information is re-termed to be called the "Fisher metric" and, significantly, is invariant under coordinate transformation.

## Examples

The main tenet of information geometry is that many important structures in

probability theoryProbability theory is the branch of mathematics concerned with analysis of random phenomena. The central objects of probability theory are random variables, stochastic processes, and events: mathematical abstractions of non-deterministic events or measured quantities that may either be single...

,

information theoryInformation theory is a branch of applied mathematics and electrical engineering involving the quantification of information. Information theory was developed by Claude E. Shannon to find fundamental limits on signal processing operations such as compressing data and on reliably storing and...

and

statisticsStatistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....

can be treated as structures in differential geometry by regarding a space of probability distributions as a

differentiable manifoldA differentiable manifold is a type of manifold that is locally similar enough to a linear space to allow one to do calculus. Any manifold can be described by a collection of charts, also known as an atlas. One may then apply ideas from calculus while working within the individual charts, since...

endowed with a Riemannian metric and a family of

affine connectionIn the branch of mathematics called differential geometry, an affine connection is a geometrical object on a smooth manifold which connects nearby tangent spaces, and so permits tangent vector fields to be differentiated as if they were functions on the manifold with values in a fixed vector space...

s distinct from the

canonical affine connectionIn Riemannian geometry, the Levi-Civita connection is a specific connection on the tangent bundle of a manifold. More specifically, it is the torsion-free metric connection, i.e., the torsion-free connection on the tangent bundle preserving a given Riemannian metric.The fundamental theorem of...

. The e-affine connection and m-affine connection geometrize expectation and maximization, as in the

expectation-maximization algorithmIn statistics, an expectation–maximization algorithm is an iterative method for finding maximum likelihood or maximum a posteriori estimates of parameters in statistical models, where the model depends on unobserved latent variables...

.

For example,

- The Fisher information metric
In information geometry, the Fisher information metric is a particular Riemannian metric which can be defined on a smooth statistical manifold, i.e., a smooth manifold whose points are probability measures defined on a common probability space....

is a Riemannian metric.
- The Kullback-Leibler divergence is one of a family of divergences related to dual affine connections.
- An exponential family
In probability and statistics, an exponential family is an important class of probability distributions sharing a certain form, specified below. This special form is chosen for mathematical convenience, on account of some useful algebraic properties, as well as for generality, as exponential...

is flat submanifold under the e-affine connection.
- The maximum likelihood
In statistics, maximum-likelihood estimation is a method of estimating the parameters of a statistical model. When applied to a data set and given a statistical model, maximum-likelihood estimation provides estimates for the model's parameters....

estimate may be obtained through projectionGenerally speaking, in mathematics, a projection is a mapping of a set which is idempotent, which means that a projection is equal to its composition with itself. A projection may also refer to a mapping which has a left inverse. Bot notions are strongly related, as follows...

onto the chosen statistical model using the m-affine connection.
- The unique existence of maximum likelihood estimate on exponential families is the consequence of the e- and m- connections being dual affine.
- The
**em-algorithm,** (em here stands for: e-projection, m-projection) which behaves similarly to the canonical EM algorithm for most cases is, under broad conditions, an iterative dual projection method via the e-connection and m-connection.
- The concepts of accuracy of estimators, in particular the first and third order efficiency of estimators, can be represented in terms of imbedding curvatures of the manifold representing the statistical model and the manifold of representing the estimator (the second order always equals zero after bias correction).
- The higher order asymptotic power of statistical test can be represented using geometric quantities.

The importance of studying statistical structures as geometrical structures lies in the fact that geometric structures are invariant under coordinate transforms. For example, the Fisher information metric is invariant under coordinate transformation.

The statistician

FisherSir Ronald Aylmer Fisher FRS was an English statistician, evolutionary biologist, eugenicist and geneticist. Among other things, Fisher is well known for his contributions to statistics by creating Fisher's exact test and Fisher's equation...

recognized in the 1920s that there is an intrinsic measure of amount of information for statistical estimators. The Fisher information matrix was shown by Cramer and Rao to be a Riemannian metric on the space of probabilities, and became known as Fisher information metric.

The mathematician Cencov (Chentsov) proved in the 1960s and 1970s that on the space of probability distributions on a sample space containing at least three points,

- There exists a unique intrinsic metric. It is the Fisher information metric.
- There exists a unique one parameter family of affine connections. It is the family of -affine connections later popularized by Amari.

Both of these uniqueness are, of course, up to the multiplication by a constant.

Amari and Nagaoka's study in the 1980s brought all these results together, with the introduction of the concept of dual-affine connections, and the interplay among

metricIn mathematics, a metric or distance function is a function which defines a distance between elements of a set. A set with a metric is called a metric space. A metric induces a topology on a set but not all topologies can be generated by a metric...

,

affine connectionIn the branch of mathematics called differential geometry, an affine connection is a geometrical object on a smooth manifold which connects nearby tangent spaces, and so permits tangent vector fields to be differentiated as if they were functions on the manifold with values in a fixed vector space...

and divergence. In particular,

- Given a Riemannian metric
*g* and a family of dual affine connections , there exists a unique set of dual divergences defined by them.
- Given the family of dual divergences , the metric and affine connections can be uniquely determined by second order and third order differentiations.

Also, Amari and Kumon showed that asymptotic efficiency of estimates and tests

can be represented by geometrical quantities.

## Fisher information metric as a Riemannian metric

Information geometry makes frequent use of the

Fisher information metricIn information geometry, the Fisher information metric is a particular Riemannian metric which can be defined on a smooth statistical manifold, i.e., a smooth manifold whose points are probability measures defined on a common probability space....

:

Substituting

from

information theoryInformation theory is a branch of applied mathematics and electrical engineering involving the quantification of information. Information theory was developed by Claude E. Shannon to find fundamental limits on signal processing operations such as compressing data and on reliably storing and...

, the formula becomes:

## History

The history of information geometry is associated with the discoveries of at least the following people, and many others

- Sir Ronald Aylmer Fisher
- Harald Cramér
Harald Cramér was a Swedish mathematician, actuary, and statistician, specializing in mathematical statistics and probabilistic number theory. He was once described by John Kingman as "one of the giants of statistical theory".-Early life:Harald Cramér was born in Stockholm, Sweden on September...

- Calyampudi Radhakrishna Rao
- Solomon Kullback
Solomon Kullback was an American cryptanalyst and mathematician, who was one of the first three employees hired by William F. Friedman at the US Army's Signal Intelligence Service in the 1930s, along with Frank Rowlett and Abraham Sinkov. He went on to a long and distinguished career at SIS and...

- Richard Leibler
Richard Leibler was an American mathematician and cryptanalyst. Richard Leibler was born in March 1914. He received his A.M. in mathematics from Northwestern University and his Ph.D. from the University of Illinois in 1939...

- Claude Shannon
- Imre Csiszár
Imre Csiszár is a Hungarian mathematician with contributions to information theoryand probability theory. In 1996 he won the Claude E. Shannon Award, the highest annualaward given in the field of information theory....

- Cencov
- Bradley Efron
Bradley Efron is an American statistician best known for proposing the bootstrap resampling technique, which has had a major impact in the field of statistics and virtually every area of statistical application...

- Vos
- Shun'ichi Amari
Amari Shun'ichi, 甘利俊一 【あまりしゅんいち】, is a Japanese scholar born in 1936 in Tokyo, Japan.He majored in Mathematical Engineering in 1958 from the University of Tokyo then graduated in 1963 from the Graduate School of the University of Tokyo....

- Hiroshi Nagaoka
- Kass
- Shinto Eguchi
- Ole Barndorff-Nielsen
- Giovanni Pistone
- Bernard Hanzon
Bernand Hanzon is an academic, mathematician, and researcher. He has worked on systems theory, probability and statistics, and mathematical finance...

- Damiano Brigo
Damiano Brigo is an applied mathematician, and current Gilbart Chair of Financial Mathematics at King's College, London, known for a number of results in systems theory, probability and mathematical finance.-Main results:...

### Natural gradient

An important concept on information geometry is the natural gradient. The concept and theory of the natural gradient suggests an adjustment to the

energy functionIn the theory of ordinary differential equations , Lyapunov functions are scalar functions that may be used to prove the stability of an equilibrium of an ODE. Named after the Russian mathematician Aleksandr Mikhailovich Lyapunov, Lyapunov functions are important to stability theory and control...

of a learning rule. This adjustment takes into account the

curvatureIn mathematics, curvature refers to any of a number of loosely related concepts in different areas of geometry. Intuitively, curvature is the amount by which a geometric object deviates from being flat, or straight in the case of a line, but this is defined in different ways depending on the context...

of the (prior) statistical differential manifold, by way of the Fisher information metric.

This concept has many important applications in

blind signal separationBlind signal separation, also known as blind source separation, is the separation of a set of signals from a set of mixed signals, without the aid of information about the source signals or the mixing process....

,

neural networkThe term neural network was traditionally used to refer to a network or circuit of biological neurons. The modern usage of the term often refers to artificial neural networks, which are composed of artificial neurons or nodes...

s,

artificial intelligenceArtificial intelligence is the intelligence of machines and the branch of computer science that aims to create it. AI textbooks define the field as "the study and design of intelligent agents" where an intelligent agent is a system that perceives its environment and takes actions that maximize its...

, and other engineering problems that deal with information. Experimental results have shown that application of the concept leads to substantial performance gains.

### Nonlinear filtering

Other applications concern statistics of stochastic processes and approximate finite dimensional solutions of the

filtering problem (stochastic processes)In the theory of stochastic processes, the filtering problem is a mathematical model for a number of filtering problems in signal processing and the like. The general idea is to form some kind of "best estimate" for the true value of some system, given only some observations of that system...

. As the nonlinear filtering problem admits an infinite dimensional solution in general, one can use a geometric structure in the space of probability distributions to project the infinite dimensional filter into an approximate finite dimensional one, leading to the projection filters introduced in 1987 by

Bernard HanzonBernand Hanzon is an academic, mathematician, and researcher. He has worked on systems theory, probability and statistics, and mathematical finance...

.

## External links