Jensen's inequality

Jensen's inequality

Discussion
Ask a question about 'Jensen's inequality'
Start a new discussion about 'Jensen's inequality'
Answer questions from other users
Full Discussion Forum
 
Encyclopedia
{{For|Jensen's inequality for analytic functions|Jensen's formula}} In mathematics
Mathematics
Mathematics is the study of quantity, space, structure, and change. Mathematicians seek out patterns and formulate new conjectures. Mathematicians resolve the truth or falsity of conjectures by mathematical proofs, which are arguments sufficient to convince other mathematicians of their validity...

, Jensen's inequality, named after the Danish mathematician Johan Jensen, relates the value of a convex function
Convex function
In mathematics, a real-valued function f defined on an interval is called convex if the graph of the function lies below the line segment joining any two points of the graph. Equivalently, a function is convex if its epigraph is a convex set...

 of an integral
Integral
Integration is an important concept in mathematics and, together with its inverse, differentiation, is one of the two main operations in calculus...

 to the integral of the convex function. It was proved by Jensen in 1906. Given its generality, the inequality appears in many forms depending on the context, some of which are presented below. In its simplest form the inequality states that the convex transformation of a mean is less than or equal to the mean after convex transformation; it is a simple corollary that the opposite is true of concave transformations. Jensen's inequality generalizes the statement that the secant line
Secant line
A secant line of a curve is a line that intersects two points on the curve. The word secant comes from the Latin secare, to cut.It can be used to approximate the tangent to a curve, at some point P...

 of a convex function lies above the graph of the function, which is Jensen's inequality for two points: the secant line consists of weighted means of the convex function, t f(x_1) + (1-t) f(x_2), while the graph of the function is the convex function of the weighted means, f(t x_1 + (1-t) x_2). There are also converses of the Jensen's inequality, which estimate the upper bound of the integral of the convex function. In the context of probability theory
Probability theory
Probability theory is the branch of mathematics concerned with analysis of random phenomena. The central objects of probability theory are random variables, stochastic processes, and events: mathematical abstractions of non-deterministic events or measured quantities that may either be single...

, it is generally stated in the following form: if X is a random variable
Random variable
In probability and statistics, a random variable or stochastic variable is, roughly speaking, a variable whose value results from a measurement on some type of random process. Formally, it is a function from a probability space, typically to the real numbers, which is measurable functionmeasurable...

 and \varphi is a convex function, then \varphi\left(\mathbb{E}\left[X\right]\right) \leq \mathbb{E}\left[\varphi(X)\right].

Statements

The classical form of Jensen's inequality involves several numbers and weights. The inequality can be stated quite generally using either the language of measure theory
Measure (mathematics)
In mathematical analysis, a measure on a set is a systematic way to assign to each suitable subset a number, intuitively interpreted as the size of the subset. In this sense, a measure is a generalization of the concepts of length, area, and volume...

 or (equivalently) probability. In the probabilistic setting, the inequality can be further generalized to its full strength.

Finite form

For a real convex function
Convex function
In mathematics, a real-valued function f defined on an interval is called convex if the graph of the function lies below the line segment joining any two points of the graph. Equivalently, a function is convex if its epigraph is a convex set...

 {{nowrap|\varphi}}, numbers x1x2, ..., xn in its domain, and positive weights ai, Jensen's inequality can be stated as: \varphi\left(\frac{\sum a_i x_i}{\sum a_j}\right) \le \frac{\sum a_i \varphi (x_i)}{\sum a_j} \qquad\qquad (1) and the inequality is reversed if {{nowrap|\varphi}} is concave
Concave function
In mathematics, a concave function is the negative of a convex function. A concave function is also synonymously called concave downwards, concave down, convex upwards, convex cap or upper convex.-Definition:...

, which is\varphi\left(\frac{\sum a_i x_i}{\sum a_j}\right) \geq \frac{\sum a_i \varphi (x_i)}{\sum a_j}.\qquad\qquad(2) As a particular case, if the weights ai are all equal, then (1) and (2) become \varphi\left(\frac{\sum x_i}{n}\right) \le \frac{\sum \varphi (x_i)}{n} \qquad\qquad (3) \varphi\left(\frac{\sum x_i}{n}\right) \geq \frac{\sum \varphi (x_i)}{n} \qquad\qquad (4) For instance, the function log(x) is concave (note that we can use Jensen's to prove convexity or concavity, if it holds for two real numbers whose functions are taken), so substituting \scriptstyle\varphi(x)\,=\,\log(x) in the previous formula (4) establishes the (logarithm of) the familiar arithmetic mean-geometric mean inequality: \frac{x_1 + x_2 + \cdots + x_n}{n} \geq \sqrt[n]{x_1 x_2 \cdots x_n}. The variable x may, if required, be a function of another variable (or set of variables) t, so that xi = g(ti). All of this carries directly over to the general continuous case: the weights ai are replaced by a non-negative integrable function f(x), such as a probability distribution, and the summations are replaced by integrals.

Measure-theoretic and probabilistic form

Let (Ω, Aμ) be a measure space, such that μ(Ω) = 1. If g is a real
Real number
In mathematics, a real number is a value that represents a quantity along a continuum, such as -5 , 4/3 , 8.6 , √2 and π...

-valued function that is μ-integrable, and if \varphi is a convex function
Convex function
In mathematics, a real-valued function f defined on an interval is called convex if the graph of the function lies below the line segment joining any two points of the graph. Equivalently, a function is convex if its epigraph is a convex set...

 on the real line, then: \varphi\left(\int_\Omega g\, d\mu\right) \le \int_\Omega \varphi \circ g\, d\mu. In real analysis, we may require an estimate on \varphi\left(\int_a^b f(x)\, dx\right) where a,b are real numbers, and f:[a,b]\to\mathbb{R} is a non-negative real
Real number
In mathematics, a real number is a value that represents a quantity along a continuum, such as -5 , 4/3 , 8.6 , √2 and π...

-valued function that is Lebesgue-integrable. In this case, the Lebesgue measure of [a,b] need not be unity. However, by integration by substitution, the interval can be rescaled so that it has measure unity. Then Jensen's inequality can be applied to get \varphi\left(\int_a^b f(x)\, dx\right) \le \int_a^b \varphi((b-a)f(x))\frac{1}{b-a} \,dx. The same result can be equivalently stated in a probability theory
Probability theory
Probability theory is the branch of mathematics concerned with analysis of random phenomena. The central objects of probability theory are random variables, stochastic processes, and events: mathematical abstractions of non-deterministic events or measured quantities that may either be single...

 setting, by a simple change of notation. Let \scriptstyle(\Omega, \mathfrak{F},\mathbb{P}) be a probability space
Probability space
In probability theory, a probability space or a probability triple is a mathematical construct that models a real-world process consisting of states that occur randomly. A probability space is constructed with a specific kind of situation or experiment in mind...

, X an integrable real-valued random variable
Random variable
In probability and statistics, a random variable or stochastic variable is, roughly speaking, a variable whose value results from a measurement on some type of random process. Formally, it is a function from a probability space, typically to the real numbers, which is measurable functionmeasurable...

 and \varphi a convex function
Convex function
In mathematics, a real-valued function f defined on an interval is called convex if the graph of the function lies below the line segment joining any two points of the graph. Equivalently, a function is convex if its epigraph is a convex set...

. Then: \varphi\left(\mathbb{E}\left[X\right]\right) \leq \mathbb{E}\left[\varphi(X)\right]. In this probability setting, the measure μ is intended as a probability \scriptstyle\mathbb{P}, the integral with respect to μ as an expected value
Expected value
In probability theory, the expected value of a random variable is the weighted average of all possible values that this random variable can take on...

 \scriptstyle\mathbb{E}, and the function g as a random variable
Random variable
In probability and statistics, a random variable or stochastic variable is, roughly speaking, a variable whose value results from a measurement on some type of random process. Formally, it is a function from a probability space, typically to the real numbers, which is measurable functionmeasurable...

 X.

General inequality in a probabilistic setting

More generally, let T be a real topological vector space
Topological vector space
In mathematics, a topological vector space is one of the basic structures investigated in functional analysis...

, and X a T-valued integrable random variable. In this general setting, integrable means that there exists an element \scriptstyle\mathbb{E}\{X\} in T, such that for any element z in the dual space
Dual space
In mathematics, any vector space, V, has a corresponding dual vector space consisting of all linear functionals on V. Dual vector spaces defined on finite-dimensional vector spaces can be used for defining tensors which are studied in tensor algebra...

 of T: \scriptstyle\mathbb{E}|\langle z, X \rangle|\,<\,\infty , and \scriptstyle\langle z, \mathbb{E}\{X\}\rangle\,=\,\mathbb{E}\{\langle z, X \rangle\}. Then, for any measurable convex function φ and any sub-σ-algebra \scriptstyle\mathfrak{G} of \scriptstyle\mathfrak{F}: \varphi\left(\mathbb{E}\left[X|\mathfrak{G}\right]\right) \leq \mathbb{E}\left[\varphi(X)|\mathfrak{G}\right]. Here \scriptstyle\mathbb{E}\{\cdot|\mathfrak{G} \} stands for the expectation conditioned
Conditional expectation
In probability theory, a conditional expectation is the expected value of a real random variable with respect to a conditional probability distribution....

 to the σ-algebra \scriptstyle\mathfrak{G}. This general statement reduces to the previous ones when the topological vector space T is the real axis, and \scriptstyle\mathfrak{G} is the trivial σ-algebra \scriptstyle\{\varnothing, \Omega\}. In case that the sub-sigma algebra is generated by a measurable function Y the statement can be given as{{clarify|reason=explain or wikilink for notation|date=October 2011}}\varphi\left(\mathbb{E}\left[X|Y\right]\circ Y\right) \leq \mathbb{E}\left[\varphi(X)|Y\right]\circ Y or\varphi\circ\mathbb{E}\left[X|Y\right] \leq \mathbb{E}\left[\varphi\circ X|Y\right].

Proofs

Jensen's inequality can be proved in several ways, and three different proofs corresponding to the different statements above will be offered. Before embarking on these mathematical derivations, however, it is worth analyzing an intuitive graphical argument based on the probabilistic case where X is a real number (see figure). Assuming a hypothetical distribution of X values, one can immediately identify the position of \scriptstyle\mathbb{E}\{X\} and its image \scriptstyle\varphi(\mathbb{E}\{X\}) in the graph. Noticing that for convex mappings \scriptstyle Y\,=\,\varphi(X) the corresponding distribution of Y values is increasingly "stretched out" for increasing values of X, it is easy to see that the distribution of Y is broader in the interval corresponding to X > X0 and narrower in X < X0 for any X0; in particular, this is also true for \scriptstyle X_0 \,=\, \mathbb{E}\{ X \}. Consequently, in this picture the expectation of Y will always shift upwards with respect to the position of \scriptstyle\varphi(\mathbb{E}\{ X \} ), and this "proves" the inequality, i.e. \mathbb{E}\{Y\} = \mathbb{E}\{ \varphi(X) \} \geq \varphi(\mathbb{E}\{ X \} ), with equality when φ(X) is not strictly convex, e.g. when it is a straight line, or when X follows a degenerate distribution (i.e. is a constant). The proofs below formalize this intuitive notion.

Proof 1 (finite form)

If λ1 and λ2 are two arbitrary positive real numbers such that λ1 + λ2 = 1 then convexity of \scriptstyle\varphi implies \varphi(\lambda_1 x_1+\lambda_2 x_2)\leq \lambda_1\,\varphi(x_1)+\lambda_2\,\varphi(x_2)\text{ for any }x_1,\,x_2. This can be easily generalized: if λ1, λ2, ..., λn are positive real numbers such that λ1 + ... + λn = 1, then \varphi(\lambda_1 x_1+\lambda_2 x_2+\cdots+\lambda_n x_n)\leq \lambda_1\,\varphi(x_1)+\lambda_2\,\varphi(x_2)+\cdots+\lambda_n\,\varphi(x_n), for any x1, ..., xn. This finite form of the Jensen's inequality can be proved by induction
Mathematical induction
Mathematical induction is a method of mathematical proof typically used to establish that a given statement is true of all natural numbers...

: by convexity hypotheses, the statement is true for n = 2. Suppose it is true also for some n, one needs to prove it for n + 1. At least one of the λi is strictly positive, say λ1; therefore by convexity inequality: \begin{align} \varphi\left(\sum_{i=1}^{n+1}\lambda_i x_i\right) & = \varphi\left(\lambda_1 x_1+(1-\lambda_1)\sum_{i=2}^{n+1} \frac{\lambda_i}{1-\lambda_1} x_i\right) \\ & \leq \lambda_1\,\varphi(x_1)+(1-\lambda_1) \varphi\left(\sum_{i=2}^{n+1}\left( \frac{\lambda_i}{1-\lambda_1} x_i\right)\right). \end{align} Since \scriptstyle \sum_{i=2}^{n+1} \lambda_i/(1-\lambda_1)\, =\,1, one can apply the induction hypotheses to the last term in the previous formula to obtain the result, namely the finite form of the Jensen's inequality. In order to obtain the general inequality from this finite form, one needs to use a density argument. The finite form can be rewritten as: \varphi\left(\int x\,d\mu_n(x) \right)\leq \int \varphi(x)\,d\mu_n(x), where μn is a measure given by an arbitrary convex combination
Convex combination
In convex geometry, a convex combination is a linear combination of points where all coefficients are non-negative and sum up to 1....

 of Dirac deltas: \mu_n=\sum_{i=1}^n \lambda_i \delta_{x_i}. Since convex functions are continuous
Continuous function
In mathematics, a continuous function is a function for which, intuitively, "small" changes in the input result in "small" changes in the output. Otherwise, a function is said to be "discontinuous". A continuous function with a continuous inverse function is called "bicontinuous".Continuity of...

, and since convex combinations of Dirac deltas are weakly
Weak topology
In mathematics, weak topology is an alternative term for initial topology. The term is most commonly used for the initial topology of a topological vector space with respect to its continuous dual...

 dense
Dense set
In topology and related areas of mathematics, a subset A of a topological space X is called dense if any point x in X belongs to A or is a limit point of A...

 in the set of probability measures (as could be easily verified), the general statement is obtained simply by a limiting procedure.

Proof 2 (measure-theoretic form)

Let g be a real-valued μ-integrable function on a probability space Ω, and let φ be a convex function on the real numbers. Since φ is convex, at each real number x we have a nonempty set of subderivative
Subderivative
In mathematics, the concepts of subderivative, subgradient, and subdifferential arise in convex analysis, that is, in the study of convex functions, often in connection to convex optimization....

s, which may be thought of as lines touching the graph of φ at x, but for which at or below the graph of φ at all points. Now, if we definex_0:=\int_\Omega g\, d\mu, because of the existence of subderivatives for convex functions, we may choose an a and b such thatax + b \leq \varphi(x), for all real x andax_0+ b = \varphi(x_0). But then we have that\varphi\circ g (x) \geq ag(x)+ b for all x. Since we have a probability measure, the integral is monotone with μ(Ω)=1 so that\int_\Omega \varphi\circ g\, d\mu \geq \int_\Omega (ag + b)\, d\mu

Proof 3 (general inequality in a probabilistic setting)

Let X be an integrable random variable that takes values in a real topological vector space T. Since \scriptstyle\varphi:T \mapsto \mathbb{R} is convex, for any x,y \in T, the quantity \frac{\varphi(x+\theta\,y)-\varphi(x)}{\theta}, is decreasing as θ approaches 0+. In particular, the subdifferential of φ evaluated at x in the direction y is well-defined by (D\varphi)(x)\cdot y:=\lim_{\theta \downarrow 0} \frac{\varphi(x+\theta\,y)-\varphi(x)}{\theta}=\inf_{\theta \neq 0} \frac{\varphi(x+\theta\,y)-\varphi(x)}{\theta}. It is easily seen that the subdifferential is linear in y and, since the infimum taken in the right-hand side of the previous formula is smaller than the value of the same term for θ = 1, one gets \varphi(x)\leq \varphi(x+y)-(D\varphi)(x)\cdot y.\, In particular, for an arbitrary sub-σ-algebra \scriptstyle\mathfrak{G} we can evaluate the last inequality when \scriptstyle x\,=\,\mathbb{E}\{X|\mathfrak{G}\},\,y=X-\mathbb{E}\{X|\mathfrak{G}\} to obtain \varphi(\mathbb{E}\{X|\mathfrak{G}\})\leq \varphi(X)-(D\varphi)(\mathbb{E}\{X|\mathfrak{G}\})\cdot (X-\mathbb{E}\{X|\mathfrak{G}\}). Now, if we take the expectation conditioned to \scriptstyle\mathfrak{G} on both sides of the previous expression, we get the result since: \mathbb{E}\{\left[(D\varphi)(\mathbb{E}\{X|\mathfrak{G}\})\cdot (X-\mathbb{E}\{X|\mathfrak{G}\})\right]|\mathfrak{G}\}=(D\varphi)(\mathbb{E}\{X|\mathfrak{G}\})\cdot \mathbb{E}\{ \left( X-\mathbb{E}\{X|\mathfrak{G}\} \right) |\mathfrak{G}\}=0, by the linearity of the subdifferential in the y variable, and the following well-known property of the conditional expectation
Conditional expectation
In probability theory, a conditional expectation is the expected value of a real random variable with respect to a conditional probability distribution....

: \mathbb{E}\{ \left(\mathbb{E}\{X|\mathfrak{G}\} \right) |\mathfrak{G}\}=\mathbb{E}\{ X |\mathfrak{G}\}.

Form involving a probability density function

Suppose Ω is a measurable subset of the real line and f(x) is a non-negative function such that \int_{-\infty}^\infty f(x)\,dx = 1. In probabilistic language, f is a probability density function
Probability density function
In probability theory, a probability density function , or density of a continuous random variable is a function that describes the relative likelihood for this random variable to occur at a given point. The probability for the random variable to fall within a particular region is given by the...

. Then Jensen's inequality becomes the following statement about convex integrals: If g is any real-valued measurable function and φ is convex over the range of g, then \varphi\left(\int_{-\infty}^\infty g(x)f(x)\, dx\right) \le \int_{-\infty}^\infty \varphi(g(x)) f(x)\, dx. If g(x) = x, then this form of the inequality reduces to a commonly used special case: \varphi\left(\int_{-\infty}^\infty x\, f(x)\, dx\right) \le \int_{-\infty}^\infty \varphi(x)\,f(x)\, dx.

Alternative finite form

If \Omega is some finite set \{x_1,x_2,\ldots,x_n\}, and if \mu is a counting measure
Counting measure
In mathematics, the counting measure is an intuitive way to put a measure on any set: the "size" of a subset is taken to be the number of elements in the subset, if the subset is finite, and ∞ if the subset is infinite....

 on \Omega, then the general form reduces to a statement about sums: \varphi\left(\sum_{i=1}^{n} g(x_i)\lambda_i \right) \le \sum_{i=1}^{n} \varphi(g(x_i))\lambda_i, provided that \lambda_1 + \lambda_2 + \cdots + \lambda_n = 1, \lambda_i \ge 0. There is also an infinite discrete form.

Statistical physics

Jensen's inequality is of particular importance in statistical physics when the convex function is an exponential, giving: e^{\langle X \rangle} \leq \left\langle e^X \right\rangle, where angle brackets denote expected value
Expected value
In probability theory, the expected value of a random variable is the weighted average of all possible values that this random variable can take on...

s with respect to some probability distribution
Probability distribution
In probability theory, a probability mass, probability density, or probability distribution is a function that describes the probability of a random variable taking certain values....

 in the random variable
Random variable
In probability and statistics, a random variable or stochastic variable is, roughly speaking, a variable whose value results from a measurement on some type of random process. Formally, it is a function from a probability space, typically to the real numbers, which is measurable functionmeasurable...

 X. The proof in this case is very simple (cf. Chandler, Sec. 5.5). The desired inequality follows directly, by writing \left\langle e^X \right\rangle e^{\langle X \rangle} \left\langle e^{X - \langle X \rangle} \right\rangle and then applying the inequality e^X \geq 1+X \, to the final exponential.

Information theory

If p(x) is the true probability distribution for x, and q(x) is another distribution, then applying Jensen's inequality for the random variable Y(x) = q(x)/p(x) and the function \varphi(y) = −log(y) gives \Bbb{E}\{\varphi(Y)\} \ge \varphi(\Bbb{E}\{Y\}) \Rightarrow \int p(x) \log \frac{p(x)}{q(x)} \, dx \ge - \log \int p(x) \frac{q(x)}{p(x)} \, dx \Rightarrow \int p(x) \log \frac{p(x)}{q(x)} \, dx \ge 0 \Rightarrow - \int p(x) \log q(x) \, dx \ge - \int p(x) \log p(x) \, dx, a result called Gibbs' inequality
Gibbs' inequality
In information theory, Gibbs' inequality is a statement about the mathematical entropy of a discrete probability distribution. Several other bounds on the entropy of probability distributions are derived from Gibbs' inequality, including Fano's inequality....

. It shows that the average message length is minimised when codes are assigned on the basis of the true probabilities p rather than any other distribution q. The quantity that is non-negative is called the Kullback–Leibler divergence of q from p.

Rao–Blackwell theorem

{{main|Rao–Blackwell theorem}} If L is a convex function, then from Jensen's inequality we get L(\Bbb{E}\{\delta(X)\}) \le \Bbb{E}\{L(\delta(X))\} \quad \Rightarrow \quad \Bbb{E}\{L(\Bbb{E}\{\delta(X)\})\} \le \Bbb{E}\{L(\delta(X))\}. \, So if δ(X) is some estimator
Estimator
In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule and its result are distinguished....

 of an unobserved parameter θ given a vector of observables X; and if T(X) is a sufficient statistic for θ; then an improved estimator, in the sense of having a smaller expected loss L, can be obtained by calculating \delta_1 (X) = \Bbb{E}_{\theta}\{\delta(X') \,|\, T(X')= T(X)\}, \, the expected value of δ with respect to θ, taken over all possible vectors of observations X compatible with the same value of T(X) as that observed. This result is known as the Rao–Blackwell theorem.

See also

NEWLINE
    NEWLINE
  • Karamata's inequality
    Karamata's inequality
    In mathematics, Karamata's inequality, named after Jovan Karamata, also known as the majorization inequality, is a theorem in elementary algebra for convex and concave real-valued functions, defined on an interval of the real line...

     for a more general inequality.
  • NEWLINE
  • Law of averages
    Law of averages
    The law of averages is a lay term used to express a belief that outcomes of a random event will "even out" within a small sample.As invoked in everyday life, the "law" usually reflects bad statistics or wishful thinking rather than any mathematical principle...

  • NEWLINE
  • The operator Jensen inequality of Hansen and Pedersen.
NEWLINE {{refimprove|date=October 2011}}