Tikhonov regularization
Encyclopedia
Tikhonov regularization, named for Andrey Tikhonov, is the most commonly used method of regularization
Regularization (mathematics)
In mathematics and statistics, particularly in the fields of machine learning and inverse problems, regularization involves introducing additional information in order to solve an ill-posed problem or to prevent overfitting...

 of ill-posed problems. In statistics
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....

, the method is known as ridge regression, and, with multiple independent discoveries, it is also variously known as the Tikhonov-Miller method, the Phillips-Twomey method, the constrained linear inversion method, and the method of linear regularization. It is related to the Levenberg-Marquardt algorithm
Levenberg-Marquardt algorithm
In mathematics and computing, the Levenberg–Marquardt algorithm provides a numerical solution to the problem of minimizing a function, generally nonlinear, over a space of parameters of the function...

 for non-linear least-squares
Non-linear least squares
Non-linear least squares is the form of least squares analysis which is used to fit a set of m observations with a model that is non-linear in n unknown parameters . It is used in some forms of non-linear regression. The basis of the method is to approximate the model by a linear one and to...

 problems. When the following problem is not well posed
Well-posed problem
The mathematical term well-posed problem stems from a definition given by Jacques Hadamard. He believed that mathematical models of physical phenomena should have the properties that# A solution exists# The solution is unique...

 (either because of non-existence or non-uniqueness of x)NEWLINE
NEWLINE
A\mathbf{x}=\mathbf{b},
NEWLINE then the standard approach is known as linear least squares
Linear least squares
In statistics and mathematics, linear least squares is an approach to fitting a mathematical or statistical model to data in cases where the idealized value provided by the model for any data point is expressed linearly in terms of the unknown parameters of the model...

 and seeks to minimize the residualNEWLINE
NEWLINE
\|A\mathbf{x}-\mathbf{b}\|^2
NEWLINE where \left \| \cdot \right \| is the Euclidean norm. This may be due to the system being overdetermined
Overdetermined system
In mathematics, a system of linear equations is considered overdetermined if there are more equations than unknowns. The terminology can be described in terms of the concept of counting constraints. Each unknown can be seen as an available degree of freedom...

 or underdetermined
Underdetermined system
In mathematics, a system of linear equations is considered underdetermined if there are fewer equations than unknowns. The terminology can be described in terms of the concept of counting constraints. Each unknown can be seen as an available degree of freedom...

 (A may be ill-conditioned or singular). In the latter case this is no better than the original problem. In order to give preference to a particular solution with desirable properties, the regularization term is included in this minimization:NEWLINE
NEWLINE
\|A\mathbf{x}-\mathbf{b}\|^2+ \|\Gamma \mathbf{x}\|^2
NEWLINE for some suitably chosen Tikhonov matrix, \Gamma . In many cases, this matrix is chosen as the identity matrix
Identity matrix
In linear algebra, the identity matrix or unit matrix of size n is the n×n square matrix with ones on the main diagonal and zeros elsewhere. It is denoted by In, or simply by I if the size is immaterial or can be trivially determined by the context...

 \Gamma= I , giving preference to solutions with smaller norms
Norm (mathematics)
In linear algebra, functional analysis and related areas of mathematics, a norm is a function that assigns a strictly positive length or size to all vectors in a vector space, other than the zero vector...

. In other cases, highpass operators (e.g., a difference operator or a weighted Fourier operator
Discrete Fourier transform
In mathematics, the discrete Fourier transform is a specific kind of discrete transform, used in Fourier analysis. It transforms one function into another, which is called the frequency domain representation, or simply the DFT, of the original function...

) may be used to enforce smoothness if the underlying vector is believed to be mostly continuous. This regularization improves the conditioning of the problem, thus enabling a numerical solution. An explicit solution, denoted by \hat{x}, is given by:NEWLINE
NEWLINE
\hat{x} = (A^{T}A+ \Gamma^{T} \Gamma )^{-1}A^{T}\mathbf{b}
NEWLINE The effect of regularization may be varied via the scale of matrix \Gamma. For \Gamma = \alpha I, when \alpha = 0 this reduces to the unregularized least squares solution provided that (ATA)−1 exists.

Bayesian interpretation

Although at first the choice of the solution to this regularized problem may look artificial, and indeed the matrix \Gamma seems rather arbitrary, the process can be justified from a Bayesian point of view
Bayesian probability
Bayesian probability is one of the different interpretations of the concept of probability and belongs to the category of evidential probabilities. The Bayesian interpretation of probability can be seen as an extension of logic that enables reasoning with propositions, whose truth or falsity is...

. Note that for an ill-posed problem one must necessarily introduce some additional assumptions in order to get a stable solution. Statistically we might assume that a priori
A priori (statistics)
In statistics, a priori knowledge is prior knowledge about a population, rather than that estimated by recent observation. It is common in Bayesian inference to make inferences conditional upon this knowledge, and the integration of a priori knowledge is the central difference between the Bayesian...

 we know that x is a random variable with a multivariate normal distribution. For simplicity we take the mean to be zero and assume that each component is independent with standard deviation
Standard deviation
Standard deviation is a widely used measure of variability or diversity used in statistics and probability theory. It shows how much variation or "dispersion" there is from the average...

 \sigma _x. Our data are also subject to errors, and we take the errors in b to be also independent
Statistical independence
In probability theory, to say that two events are independent intuitively means that the occurrence of one event makes it neither more nor less probable that the other occurs...

 with zero mean and standard deviation \sigma _b. Under these assumptions the Tikhonov-regularized solution is the most probable
Maximum a posteriori
In Bayesian statistics, a maximum a posteriori probability estimate is a mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data...

 solution given the data and the a priori distribution of x, according to Bayes' theorem
Bayes' theorem
In probability theory and applications, Bayes' theorem relates the conditional probabilities P and P. It is commonly used in science and engineering. The theorem is named for Thomas Bayes ....

. The Tikhonov matrix is then \Gamma = \alpha I for Tikhonov factor \alpha = \sigma _b / \sigma _x. If the assumption of normality is replaced by assumptions of homoskedasticity and uncorrelatedness of errors
Errors and residuals in statistics
In statistics and optimization, statistical errors and residuals are two closely related and easily confused measures of the deviation of a sample from its "theoretical value"...

, and if one still assumes zero mean, then the Gauss-Markov theorem entails that the solution is the minimal unbiased estimate
Bias of an estimator
In statistics, bias of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased. Otherwise the estimator is said to be biased.In ordinary English, the term bias is...

.

Generalized Tikhonov regularization

For general multivariate normal distributions for x and the data error, one can apply a transformation of the variables to reduce to the case above. Equivalently, one can seek an x to minimize \|Ax-b\|_P^2 + \|x-x_0\|_Q^2\, where we have used \left \| x \right \|_Q^2 to stand for the weighted norm x^T Q x (cf. the Mahalanobis distance
Mahalanobis distance
In statistics, Mahalanobis distance is a distance measure introduced by P. C. Mahalanobis in 1936. It is based on correlations between variables by which different patterns can be identified and analyzed. It gauges similarity of an unknown sample set to a known one. It differs from Euclidean...

). In the Bayesian interpretation P is the inverse covariance matrix
Covariance matrix
In probability theory and statistics, a covariance matrix is a matrix whose element in the i, j position is the covariance between the i th and j th elements of a random vector...

 of b, x_0 is the expected value
Expected value
In probability theory, the expected value of a random variable is the weighted average of all possible values that this random variable can take on...

 of x, and Q is the inverse covariance matrix of x. The Tikhonov matrix is then given as a factorization of the matrix Q = \Gamma^T \Gamma (e.g. the Cholesky factorization), and is considered a whitening filter. This generalized problem can be solved explicitly using the formula NEWLINE
NEWLINE
x_0 + (A^T PA + Q)^{-1} A^T P(b-Ax_0).\,
NEWLINE

Regularization in Hilbert space

Typically discrete linear ill-conditioned problems result from discretization of integral equation
Integral equation
In mathematics, an integral equation is an equation in which an unknown function appears under an integral sign. There is a close connection between differential and integral equations, and some problems may be formulated either way...

s, and one can formulate a Tikhonov regularization in the original infinite dimensional context. In the above we can interpret A as a compact operator
Compact operator
In functional analysis, a branch of mathematics, a compact operator is a linear operator L from a Banach space X to another Banach space Y, such that the image under L of any bounded subset of X is a relatively compact subset of Y...

 on Hilbert space
Hilbert space
The mathematical concept of a Hilbert space, named after David Hilbert, generalizes the notion of Euclidean space. It extends the methods of vector algebra and calculus from the two-dimensional Euclidean plane and three-dimensional space to spaces with any finite or infinite number of dimensions...

s, and x and b as elements in the domain and range of A. The operator A^* A + \Gamma^T \Gamma is then a self-adjoint
Hermitian adjoint
In mathematics, specifically in functional analysis, each linear operator on a Hilbert space has a corresponding adjoint operator.Adjoints of operators generalize conjugate transposes of square matrices to infinite-dimensional situations...

 bounded invertible operator.

Relation to singular value decomposition and Wiener filter

With \Gamma = \alpha I , this least squares solution can be analyzed in a special way via the singular value decomposition
Singular value decomposition
In linear algebra, the singular value decomposition is a factorization of a real or complex matrix, with many useful applications in signal processing and statistics....

. Given the singular value decomposition of A A = U \Sigma V^T\, with singular values \sigma _i, the Tikhonov regularized solution can be expressed as \hat{x} = V D U^T b where D has diagonal values D_{ii} = \frac{\sigma _i}{\sigma _i ^2 + \alpha ^2} and is zero elsewhere. This demonstrates the effect of the Tikhonov parameter on the condition number
Condition number
In the field of numerical analysis, the condition number of a function with respect to an argument measures the asymptotically worst case of how much the function can change in proportion to small changes in the argument...

 of the regularized problem. For the generalized case a similar representation can be derived using a generalized singular value decomposition
Generalized singular value decomposition
In linear algebra the generalized singular value decomposition is a matrix decomposition more general than the singular value decomposition...

. Finally, it is related to the Wiener filter
Wiener filter
In signal processing, the Wiener filter is a filter proposed by Norbert Wiener during the 1940s and published in 1949. Its purpose is to reduce the amount of noise present in a signal by comparison with an estimation of the desired noiseless signal. The discrete-time equivalent of Wiener's work was...

: \hat{x} = \sum _{i=1} ^q f_i \frac{u_i ^T b}{\sigma _i} v_i where the Wiener weights are f_i = \frac{\sigma _i ^2}{\sigma _i ^2 + \alpha ^2} and q is the rank
Rank (linear algebra)
The column rank of a matrix A is the maximum number of linearly independent column vectors of A. The row rank of a matrix A is the maximum number of linearly independent row vectors of A...

 of A.

Determination of the Tikhonov factor

The optimal regularization parameter \alpha is usually unknown and often in practical problems is determined by an ad hoc method. A possible approach relies on the Bayesian interpretation described above. Other approaches include the discrepancy principle, cross-validation, L-curve method, restricted maximum likelihood
Restricted maximum likelihood
In statistics, the restricted maximum likelihood approach is a particular form of maximum likelihood estimation which does not base estimates on a maximum likelihood fit of all the information, but instead uses a likelihood function calculated from a transformed set of data, so that nuisance...

 and unbiased predictive risk estimator. Grace Wahba
Grace Wahba
Grace Wahba is the I. J. Schoenberg Professor of Statistics at the University of Wisconsin–Madison. She is a pioneer in methods for smoothing noisy data...

 proved that the optimal parameter, in the sense of leave-one-out cross-validation minimizes: G = \frac{\operatorname{RSS}}{\tau ^2} = \frac{\left \| X \hat{\beta} - y \right \| ^2}{\left[ \operatorname{Tr} \left(I - X (X^T X + \alpha ^2 I) ^{-1} X ^T \right) \right]^2} where \operatorname{RSS} is the residual sum of squares
Residual sum of squares
In statistics, the residual sum of squares is the sum of squares of residuals. It is also known as the sum of squared residuals or the sum of squared errors of prediction . It is a measure of the discrepancy between the data and an estimation model...

 and \tau is the effective number degree of freedom
Degrees of freedom (statistics)
In statistics, the number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary.Estimates of statistical parameters can be based upon different amounts of information or data. The number of independent pieces of information that go into the...

. Using the previous SVD decomposition, we can simplify the above expression:\operatorname{RSS} = \left \| y - \sum _{i=1} ^q (u_i ' b) u_i \right \| ^2 + \left \| \sum _{i=1} ^q \frac{\alpha ^ 2}{\sigma _i ^ 2 + \alpha ^ 2} (u_i ' b) u_i \right \| ^2 \operatorname{RSS} = \operatorname{RSS} _0 + \left \| \sum _{i=1} ^q \frac{\alpha ^ 2}{\sigma _i ^ 2 + \alpha ^ 2} (u_i ' b) u_i \right \| ^2 and \tau = m - \sum _{i=1} ^q \frac{\sigma _i ^2}{\sigma _i ^2 + \alpha ^2}

Relation to probabilistic formulation

The probabilistic formulation of an inverse problem
Inverse problem
An inverse problem is a general framework that is used to convert observed measurements into information about a physical object or system that we are interested in...

 introduces (when all uncertainties are Gaussian) a covariance matrix C_M representing the a priori uncertainties on the model parameters, and a covariance matrix C_D representing the uncertainties on the observed parameters (see, for instance, Tarantola, 2004 http://www.ipgp.jussieu.fr/~tarantola/Files/Professional/SIAM/index.html). In the special case when these two matrices are diagonal and isotropic, C_M = \sigma_M^2 I and C_D = \sigma_D^2 I , and, in this case, the equations of inverse theory reduce to the equations above, with \alpha = {\sigma_D}/{\sigma_M} .

History

Tikhonov regularization has been invented independently in many different contexts. It became widely known from its application to integral equations from the work of Tychonoff
Andrey Nikolayevich Tychonoff
Andrey Nikolayevich Tikhonov was a Soviet and Russian mathematician known for important contributions to topology, functional analysis, mathematical physics, and ill-posed problems. He was also inventor of magnetotellurics method in geology. Tikhonov originally published in German, whence the...

 and David L. Phillips. Some authors use the term Tikhonov-Phillips regularization. The finite dimensional case was expounded by Arthur E. Hoerl, who took a statistical approach, and by Manus Foster, who interpreted this method as a Wiener
Norbert Wiener
Norbert Wiener was an American mathematician.A famous child prodigy, Wiener later became an early researcher in stochastic and noise processes, contributing work relevant to electronic engineering, electronic communication, and control systems.Wiener is regarded as the originator of cybernetics, a...

-Kolmogorov filter. Following Hoerl, it is known in the statistical literature as ridge regression.

See Also

NEWLINE
    NEWLINE
  • LASSO estimator is another regularization method in statistics.
NEWLINE
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK