All Topics  
Least squares

 
Least Squares

   Email Print
   Bookmark   Link






 

Least squares



 
 
The method of least squares or ordinary least squares (OLS) is used to solve overdetermined system
Overdetermined system

In mathematics, a system of linear equations is considered overdetermined if there are more equations than unknowns. The terminology can be described in terms of the concept of counting constants....
s. Least squares is often applied in statistical contexts, particularly regression analysis
Regression analysis

In statistics, regression analysis is a collective name for techniques for the modeling and analysis of numerical data consisting of values of a dependent variable and of one or more independent variables ....
.

Least squares can be interpreted as a method of fitting data. The best fit in the least-squares sense is that instance of the model for which the sum of squared residuals has its least value, a residual
Errors and residuals in statistics

In statistics and Optimization , statistical errors and residuals are two closely related and easily confused measures of "deviation of a sample from the mean": the error of a sample is the deviation of the sample from the population mean or actual function, while the residual of a sample is the difference between the sa...
 being the difference between an observed value and the value given by the model.






Discussion
Ask a question about 'Least squares'
Start a new discussion about 'Least squares'
Answer questions from other users
Full Discussion Forum



Encyclopedia


The method of least squares or ordinary least squares (OLS) is used to solve overdetermined system
Overdetermined system

In mathematics, a system of linear equations is considered overdetermined if there are more equations than unknowns. The terminology can be described in terms of the concept of counting constants....
s. Least squares is often applied in statistical contexts, particularly regression analysis
Regression analysis

In statistics, regression analysis is a collective name for techniques for the modeling and analysis of numerical data consisting of values of a dependent variable and of one or more independent variables ....
.

Least squares can be interpreted as a method of fitting data. The best fit in the least-squares sense is that instance of the model for which the sum of squared residuals has its least value, a residual
Errors and residuals in statistics

In statistics and Optimization , statistical errors and residuals are two closely related and easily confused measures of "deviation of a sample from the mean": the error of a sample is the deviation of the sample from the population mean or actual function, while the residual of a sample is the difference between the sa...
 being the difference between an observed value and the value given by the model. The method was first described by Carl Friedrich Gauss
Carl Friedrich Gauss

Johann Carl Friedrich Gauss. was a Germans mathematician and scientist who contributed significantly to many fields, including number theory, statistics, mathematical analysis, Differential geometry and topology, geodesy, electrostatics, astronomy and optics....
 around 1794. Least squares corresponds to the maximum likelihood
Maximum likelihood

Maximum likelihood estimation is a popular statistics method used for fitting a mathematical model to data. The modeling of real world data using estimation by maximum likelihood offers a way of tuning the free parameters of the model to provide a good fit....
 criterion if the experimental errors have a normal distribution
Normal distribution

The normal distribution, also called the Gaussian distribution, is an important family of continuous probability distributions, applicable in many fields....
 and can also be derived as a method of moments
Method of moments

The method of moments can refer to the following:* method of moments , a method of parameter estimation in statistics;* method of moments , a way of proving convergence in distribution in probability theory;...
 estimator. Regression analysis is available in most statistical software packages.

The discussion is presented in terms of polynomial
Polynomial

In mathematics, a polynomial is an expression constructed from variables and constants, using the operations of addition, subtraction, multiplication, and constant non-negative whole number exponents....
 functions but any function can be used in least-squares data fitting. For example, a Fourier series
Fourier series

In mathematics, a Fourier series decomposes a periodic function into a sum of simple oscillating functions, namely sine wave . The study of Fourier series is a branch of Fourier analysis....
 fit is optimal in the least-squares sense.

History


Context

The method of least squares grew out of the fields of astronomy
Astronomy

Astronomy is the science of Astronomical object and Phenomenon that originate outside the Earth's atmosphere . It is concerned with the evolution, physics, chemistry, meteorology, and motion of celestial objects, as well as the physical cosmology....
 and geodesy
Geodesy

Geodesy , also called geodetics, a branch of earth sciences, is the scientific discipline that deals with the measurement and representation of the Earth, including its gravitational field, in a three-dimensional time-varying space....
 as scientists and mathematicians sought to provide solutions to the challenges of navigating the Earth's oceans during the Age of Exploration. The accurate description of the behavior of celestial bodies was key to enabling ships to sail in open seas where before sailors had relied on land sightings to determine the positions of their ships.

The method was the culmination of several advances that took place during the course of the eighteenth century:

  • The combination of different observations taken under the same conditions as opposed to simply trying one's best to observe and record a single observation accurately. This approach was notably used by Tobias Mayer while studying the libration
    Libration

    In astronomy libration refers to the various orbital conditions which make it possible to see more than 50% of the moon's surface over time, even though the front of the Moon is tidal locking to always face towards Earth....
    s of the moon.
  • The combination of different observations as being the best estimate of the true value; errors decrease with aggregation rather than increase, perhaps first expressed by Roger Cotes
    Roger Cotes

    Roger Cotes Fellow of the Royal Society was an English mathematician, known for working closely with Isaac Newton by proofreading the second edition of his famous book, the Philosophiae Naturalis Principia Mathematica, before publication....
    .
  • The combination of different observations taken under different conditions as notably performed by Roger Joseph Boscovich
    Roger Joseph Boscovich

    Roger Joseph Boscovich was a physicist, astronomer, mathematician, philosopher, diplomat, poet, and Society of Jesus from Republic of Ragusa who lived for a time in France, England and some Italy states ....
     in his work on the shape of the earth and Pierre-Simon Laplace
    Pierre-Simon Laplace

    Pierre-Simon, marquis de Laplace was a France mathematician and astronomer whose work was pivotal to the development of astronomy and statistics....
     in his work in explaining the differences in motion of Jupiter
    Jupiter

    Jupiter is the fifth planet from the Sun and the Solar system by size planet within the Solar System. It is two and a half times as massive as all of the other planets in our Solar System combined....
     and Saturn
    Saturn

    Saturn is the sixth planet from the Sun and the second largest planet in the Solar System, after Jupiter. Saturn, along with Jupiter, Uranus and Neptune, is classified as a gas giant....
    .
  • The development of a criterion that can be evaluated to determine when the solution with the minimum error has been achieved, developed by Laplace in his Method of Situation.


The method itself

Carl Friedrich Gauss
Carl Friedrich Gauss

Johann Carl Friedrich Gauss. was a Germans mathematician and scientist who contributed significantly to many fields, including number theory, statistics, mathematical analysis, Differential geometry and topology, geodesy, electrostatics, astronomy and optics....
 is credited with developing the fundamentals of the basis for least-squares analysis in 1795 at the age of eighteen. Legendre
Adrien-Marie Legendre

Adrien-Marie Legendre was a France mathematician. He made important contributions to statistics, number theory, abstract algebra and mathematical analysis....
 was the first to publish the method, however.

An early demonstration of the strength of Gauss's method came when it was used to predict the future location of the newly discovered asteroid Ceres. On January 1, 1801, the Italian astronomer Giuseppe Piazzi
Giuseppe Piazzi

'Giuseppe Piazzi' was an Italy Theatines monk, mathematician, and astronomer. He was born in Ponte in Valtellina, and died in Naples. He established an observatory at Palermo, now the Osservatorio Astronomico di Palermo ? Giuseppe S....
 discovered Ceres and was able to track its path for 40 days before it was lost in the glare of the sun. Based on this data, it was desired to determine the location of Ceres after it emerged from behind the sun without solving the complicated Kepler's nonlinear equations
Kepler's laws of planetary motion

In astronomy, Kepler's three laws of planetary motion are*"The orbit of every planet is an ellipse with the sun at a Focus ."*"A line joining a planet and the sun sweeps out equal areas during equal intervals of time."...
 of planetary motion. The only predictions that successfully allowed Hungarian astronomer Franz Xaver von Zach
Franz Xaver von Zach

Baron Franz Xaver von Zach was a Hungarian astronomer born at Pest .He served for some time in the Austrian army, and afterwards lived in London from 1783 to 1786 as tutor in the house of the Saxony minister, Heinrich von Br?hl....
 to relocate Ceres were those performed by the 24-year-old Gauss using least-squares analysis.

Gauss did not publish the method until 1809, when it appeared in volume two of his work on celestial mechanics, Theoria Motus Corporum Coelestium in sectionibus conicis solem ambientium. In 1829, Gauss was able to state that the least-squares approach to regression analysis is optimal in the sense that in a linear model where the errors have a mean of zero, are uncorrelated, and have equal variances, the best linear unbiased estimators of the coefficients is the least-squares estimators. This result is known as the Gauss–Markov theorem.

The idea of least-squares analysis was also independently formulated by the Frenchman Adrien-Marie Legendre
Adrien-Marie Legendre

Adrien-Marie Legendre was a France mathematician. He made important contributions to statistics, number theory, abstract algebra and mathematical analysis....
 in 1805 and the American Robert Adrain
Robert Adrain

Robert Adrain was a scientist and mathematician, considered one of the most brilliant mathematical minds of the time in America.He was born in Carrickfergus, Ireland, but left Ireland after the failure of the uprising of the United Irishmen in 1798 and moved to Princeton, New Jersey....
 in 1808.

Problem statement

The objective consists of adjusting the parameters of a model function so as to best fit a data set. A simple data set consists of n points (data pairs) , i = 1, ..., n, where is an independent variable
Independent variable

The terms "dependent variable" and "independent variable" are used in similar but subtly different ways in mathematics and statistics as part of the standard terminology in those subjects....
 and is a dependent variable whose value is found by observation. The model function has the form , where the m adjustable parameters are held in the vector . We wish to find those parameter values for which the model "best" fits the data. The least squares method defines "best" as when the sum, S, of squared residuals is a minimum. A residual
Errors and residuals in statistics

In statistics and Optimization , statistical errors and residuals are two closely related and easily confused measures of "deviation of a sample from the mean": the error of a sample is the deviation of the sample from the population mean or actual function, while the residual of a sample is the difference between the sa...
 is defined as the difference between the values of the dependent variable and the predicted values from the estimated model,

An example of a model is that of the straight line. Denoting the intercept as and the slope as , the model function is given by

See the example of linear least squares
Linear least squares

Linear least squares is an important computational problem, that arises primarily in applications when it is desired to fit a linear function mathematical model to measurements obtained from experiments....
 for a fully worked out example of this model.

A data point may consist of more than one independent variable. For an example, when fitting a plane to a set of height measurements, the plane is a function of two independent variables, x and z, say. In the most general case there may be one or more independent variables and one or more dependent variables at each data point.

Solving the least squares problem

Least squares problems fall into two categories, linear and non-linear. The linear least squares problem has a closed form solution, but the non-linear problem does not and is usually solved by iterative refinement; at each iteration the system is approximated by a linear one, so the core calculation is similar in both cases.

The minimum
Maxima and minima

In mathematics, maxima and minima, known collectively as extrema, are the largest value or smallest value , that a function takes in a point either within a given neighbourhood or on the function domain in its entirety ....
 of the sum of squares is found by setting the gradient
Gradient

In vector calculus, the gradient of a scalar field is a vector field which points in the direction of the greatest rate of increase of the scalar field, and whose magnitude is the greatest rate of change....
 to zero. Since the model contains m parameters there are m gradient equations.

and since the gradient equations become

The gradient equations apply to all least squares problems. Each particular problem requires particular expressions for the model and its partial derivatives.

Linear least squares

A regression model is a linear one when the model comprises a linear combination
Linear combination

In mathematics, linear combinations are a concept central to linear algebra and related fields of mathematics.Most of this article deals with linear combinations in the context of a vector space over a field , with some generalisations given at the end of the article....
 of the parameters, i.e.

where the coefficients, , are functions of .

Letting

we can then see that in that case the least square estimate (or estimator, if we are in the context of a random sample), is given by

For a derivation of this estimate see Linear least squares
Linear least squares

Linear least squares is an important computational problem, that arises primarily in applications when it is desired to fit a linear function mathematical model to measurements obtained from experiments....
.

Non-linear least squares

There is no closed-form solution to a non-linear least squares problem. Instead, numerical algorithms are used to find the value of the parameters which minimize the objective. Most algorithms involve choosing initial values for the parameters. Then, the parameters are refined iteratively, that is, the values are obtained by successive approximation. k is an iteration number and the vector of increments, is known as the shift vector. In some commonly used algorithms, at each iteration the model may be linearized by approximation to a first-order Taylor series
Taylor series

In mathematics, the Taylor series is a representation of a function as an Series of terms calculated from the values of its derivatives at a single point....
 expansion about

The Jacobian
Jacobian

In vector calculus, the Jacobian is shorthand for either the Jacobian matrix or its determinant, the Jacobian determinant.In algebraic geometry the Jacobian of a algebraic curve means the Jacobian variety: a group variety associated to the curve, in which the curve can be embedded....
, J, is a function of constants, the independent variable and the parameters, so it changes from one iteration to the next. The residuals are given by

and the gradient equations become

which, on rearrangement, become m simultaneous linear equations, the normal equations.

The normal equations are written in matrix notation as

These are the defining equations of the Gauss–Newton algorithm.

Differences between linear and non-linear least squares

  • The model function, f, in LLSQ (linear least squares) is a linear combination of parameters of the form The model may represent a straight line, a parabola or any other polynomial-type function. In NLLSQ (non-linear least squares) the parameters appear as functions, such as and so forth. If the derivatives are either constant or depend only on the values of the independent variable, the model is linear in the parameters. Otherwise the model is non-linear.
  • Many solution algorithms for NLLSQ require initial values for the parameters, LLSQ does not.
  • Many solution algorithms for NLLSQ require that the Jacobian be calculated. Analytical expressions for the partial derivatives can be complicated. If analytical expressions are impossible to obtain the partial derivatives must be calculated by numerical approximation.
  • In NLLSQ non-convergence (failure of the algorithm to find a minimum) is a common phenomenon whereas the LLSQ is globally concave so non-convergence is not an issue.
  • NLLSQ is usually an iterative process. The iterative process has to be terminated when a convergence criterion is satisfied. LLSQ solutions can be computed using direct methods, although problems with large numbers of parameters are typically solved with iterative methods, such as the Gauss–Seidel method.
  • In LLSQ the solution is unique, but in NLLSQ there may be multiple minima in the sum of squares.
  • Under the condition that the errors are uncorrelated with the predictor variables, LLSQ yields unbiased estimates, but even under that condition NLLSQ estimates are generally biased.
These differences must be considered whenever the solution to a non-linear least squares problem is being sought.

Least squares, regression analysis and statistics

The methods of least squares and regression analysis
Regression analysis

In statistics, regression analysis is a collective name for techniques for the modeling and analysis of numerical data consisting of values of a dependent variable and of one or more independent variables ....
 are conceptually different. However, the method of least squares is often used to generate estimators and other statistics in regression analysis.

Consider a simple example drawn from physics. A spring should obey Hooke's law
Hooke's law

In mechanics, and physics, Hooke's law of theory of elasticity is an approximation that states that the extension of a spring is in direct proportion with the load added to it as long as this load does not exceed the elastic limit....
 which states that the extension of a spring is proportional to the force, F, applied to it. constitutes the model, where F is the independent variable. To estimate the force constant, k, a series of n measurements with different forces will produce a set of data, , where yi is a measured spring extension. Each experimental observation will contain some error. If we denote this error , we may specify an empirical model for our observations,



There are many methods we might use to estimate the unknown parameter k. Noting that the n equations for the m observations in our data comprise an overdetermined system
Overdetermined system

In mathematics, a system of linear equations is considered overdetermined if there are more equations than unknowns. The terminology can be described in terms of the concept of counting constants....
 with one unknown and n equations, we may choose to estimate k using least squares. The sum of squares to be minimized is

The least squares estimate of the force constant, k, is given by

Here it is assumed that application of the force causes the spring to expand and, having derived the force constant by least squares fitting, the extension can be predicted from Hooke's law.

In regression analysis the researcher specifies an empirical model. For example, a very common model is the straight line model which is used to test if there is a linear relationship between dependent and independent variable. If a linear relationship is found to exist, the variables are said to be correlated. However, correlation does not prove causation, as both variables may be correlated with other, hidden, variables, or the dependent variable may "reverse" cause the independent variables, or the variables may be otherwise spuriously correlated. For example, suppose there is a correlation between deaths by drowning and the volume of ice cream sales at a particular beach. Yet, both the number of people going swimming and the volume of ice cream sales increase as the weather gets hotter, and presumably the number of deaths by drowning is correlated with the number of people going swimming. Perhaps an increase in swimmers causes both the other variables to increase.

In order to make statistical tests on the results it is necessary to make assumptions about the nature of the experimental errors. A common (but not necessary) assumption is that the errors belong to a Normal distribution
Normal distribution

The normal distribution, also called the Gaussian distribution, is an important family of continuous probability distributions, applicable in many fields....
. The central limit theorem
Central limit theorem

The central limit theorem states that the re-averaged sum of a sufficiently large number of Independent and identically-distributed random variables Statistical independence random variables each with finite mean and variance will be approximately normal distribution ....
 supports the idea that this is a good assumption in many cases.
  • The Gauss–Markov theorem. In a linear model in which the errors have expectation
    Expectation

    In the case of uncertainty, expectation is what is considered the most likely to happen. An expectation, which is a belief that is centred on the future, may or may not be realistic....
     zero conditional on the independent variables, are uncorrelated
    Uncorrelated

    In probability theory and statistics, two real-valued random variables are said to be uncorrelated if their covariance is zero.Uncorrelated random variables have a correlation of zero, except in the trivial case when both variables have variance zero ....
     and have equal variance
    Variance

    In probability theory and statistics, the variance of a random variable, probability distribution, or sample is one measure of statistical dispersion, averaging the squared distance of its possible values from the expected value ....
    s, the best linear unbiased estimator of any linear combination of the observations, is its least-squares estimator. "Best" means that the least squares estimators of the parameters have minimum variance. The assumption of equal variance is valid when the errors all belong to the same distribution.
  • In a linear model, if the errors belong to a Normal distribution
    Normal distribution

    The normal distribution, also called the Gaussian distribution, is an important family of continuous probability distributions, applicable in many fields....
     the least squares estimators are also the maximum likelihood
    Linear model

    Disambiguation : go here for the Linear model of innovationIn statistics, given a sample the most general form of linear model is formulated as...
     estimators.


However, if the errors are not normally distributed, a central limit theorem
Central limit theorem

The central limit theorem states that the re-averaged sum of a sufficiently large number of Independent and identically-distributed random variables Statistical independence random variables each with finite mean and variance will be approximately normal distribution ....
 often nonetheless implies that the parameter estimates will be approximately normally distributed so long as the sample is reasonably large. For this reason, given the important property that the error is mean independent of the independent variables, the distribution of the error term is not an important issue in regression analysis. Specifically, it is not typically important whether the error term follows a normal distribution.

In a least squares calculation with unit weights, or in linear regression, the variance on the jth parameter, denoted , is usually estimated with

Confidence limits can be found if the probability distribution
Probability distribution

In probability theory and statistics, a probability distribution identifies either the probability of each value of an unidentified random variable , or the probability of the value falling within a particular interval ....
 of the parameters is known, or an asymptotic approximation is made, or assumed. Likewise statistical tests on the residuals can be made if the probability distribution of the residuals is known or assumed. The probability distribution of any linear combination of the dependent variables can be derived if the probability distribution of experimental errors is known or assumed. Inference is particularly straightforward if the errors are assumed to follow a normal distribution, which implies that the parameter estimates and residuals will also be normally distributed conditional on the values of the independent variables.

Weighted least squares

See also: Weighted mean
Weighted mean

The weighted mean is similar to an arithmetic mean , where instead of each of the data points contributing equally to the final average, some data points contribute more than others....
The expressions given above are based on the implicit assumption that the errors are uncorrelated with each other and with the independent variables and have equal variance. The Gauss–Markov theorem shows that, when this is so, is a best linear unbiased estimator (BLUE). If, however, the measurements are uncorrelated but have different uncertainties, a modified approach might be adopted. Aitken
Alexander Aitken

Alexander Craig Aitken, Royal Society Royal Society of Edinburgh Royal Society of Literature was one of New Zealand's greatest mathematicians. He studied for a PhD at the University of Edinburgh, where his dissertation, "Smoothing of Data", was considered so impressive that he was awarded a DSc in 1926, and was elected a fellow of the Royal...
 showed that when a weighted sum of squared residuals is minimized, is BLUE if each weight is equal to the reciprocal of the variance of the measurement. The gradient equations for this sum of squares are

which, in a linear least squares system give the modified normal equations

or

When the observational errors are uncorrelated the weight matrix, W, is diagonal. If the errors are correlated, the resulting estimator is BLUE if the weight matrix is equal to the inverse of the variance-covariance matrix of the observations.

When the errors are uncorrelated, it is convenient to simplify the calculations to factor the weight matrix as . The normal equations can then be written as

where



For non-linear least squares systems a similar argument shows that the normal equations should be modified as follows.

Note that for empirical tests, the appropriate W is not known for sure and must be estimated. For this Feasible Generalized Least Squares
Feasible generalized least squares

Feasible generalized least squares is a regression analysis technique. It is similar to generalized least squares except that it uses an estimated variance-covariance matrix since the true matrix is not known directly....
 (FGLS) techniques may be used.

Lasso method

In some contexts a regularized version of the least squares solution may be preferable. The LASSO algorithm, for example, finds a least-squares solution with the constraint that , the L1-norm of the parameter vector, is no greater than a given value. Equivalently, it may solve an unconstrained minimization of the least-squares penalty with added, where is a constant. (This is the Lagrangian
Lagrange multipliers

In mathematical optimization , the method of Lagrange multipliers provides a strategy for finding the maximum/minimum of a function subject to constraint ....
 form of the constrained problem.) This problem may be solved using quadratic programming
Quadratic programming

Quadratic programming is a special type of mathematical optimization problem. It is the problem of optimizing a quadratic function of several variables subject to linear constraints on these variables...
 or more general convex optimization
Convex optimization

Convex optimization is a subfield of optimization . Given a real number vector space together with a convex function, real-valued function defined on a convex set of , the problem is to find the point in for which the number is smallest, i.e., the point such that for all ....
 methods. The L1-regularized formulation is useful in some contexts due to its tendency to prefer solutions with fewer nonzero parameter values, effectively reducing the number of variables upon which the given solution is dependent .

See also

  • L2 norm
  • Least absolute deviation
  • Measurement uncertainty
    Measurement uncertainty

    In metrology, measurement uncertainty describes a region about an observed value of a physical quantity which is likely to enclose the true value of that quantity....
  • Root mean square
    Root mean square

    In mathematics, the root mean square , also known as the quadratic mean, is a statistics measure of the magnitude of a varying quantity. It is especially useful when variates are positive and negative, e.g., sinusoids....
  • Squared deviations
    Squared deviations

    In probability theory and statistics, the definition of variance is either the expected value , or average of squared deviations from the mean....
  • Iteratively re-weighted least squares
    Iteratively re-weighted least squares

    The method of iteratively re-weighted least squares is a numerical analysis for minimizing any specified objective function using a standard weighted least squares method such as Gaussian elimination....
  • Total least squares
  • Levenberg–Marquardt algorithm
  • Regression analysis
    Regression analysis

    In statistics, regression analysis is a collective name for techniques for the modeling and analysis of numerical data consisting of values of a dependent variable and of one or more independent variables ....
  • Partial least squares regression
    Partial least squares regression

    In statistics, the method of partial least squares regression bears some relation to principal component analysis; instead of finding the hyperplanes of minimum variance, it finds a linear model describing some predicted variables in terms of other observable variables....


External links

  • , from MIT OpenCourseWare
  • -- Excellent slides providing an introductory regression example (University of Texas at Arlington)