Errors and residuals in statistics - AbsoluteAstronomy.com

Statistics

Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....

and optimization

Optimization (mathematics)

In mathematics, computational science, or management science, mathematical optimization refers to the selection of a best element from some set of available alternatives....

, statistical errors and residuals are two closely related and easily confused measures of the deviation

Deviation (statistics)

In mathematics and statistics, deviation is a measure of difference for interval and ratio variables between the observed value and the mean. The sign of deviation , reports the direction of that difference...

of a sample from its "theoretical value". The error of a sample is the deviation of the sample from the (unobservable) true function value, while the residual of a sample is the difference between the sample and the estimated function value.

The distinction is most important in regression analysis

Regression analysis

In statistics, regression analysis includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables...

, where it leads to the concept of studentized residual

Studentized residual

In statistics, a studentized residual is the quotient resulting from the division of a residual by an estimate of its standard deviation. Typically the standard deviations of residuals in a sample vary greatly from one data point to another even when the errors all have the same standard...

Introduction

Suppose there is a series of observations from a univariate distribution

Univariate distribution

In statistics, a univariate distribution is a probability distribution of only one random variable. This is in contrast to a multivariate distribution, the probability distribution of a random vector.-Further reading:...

and we want to estimate the mean

Mean

In statistics, mean has two related meanings:* the arithmetic mean .* the expected value of a random variable, which is also called the population mean....

of that distribution (the so-called location model). In this case the errors are the deviations of the observations from the population mean, while the residuals are the deviations of the observations from the sample mean.

A statistical error is the amount by which an observation differs from its expected value

Expected value

In probability theory, the expected value of a random variable is the weighted average of all possible values that this random variable can take on...

, the latter being based on the whole population

Statistical population

A statistical population is a set of entities concerning which statistical inferences are to be drawn, often based on a random sample taken from the population. For example, if we were interested in generalizations about crows, then we would describe the set of crows that is of interest...

from which the statistical unit was chosen randomly. For example, if the mean height in a population of 21-year-old men is 1.75 meters, and one randomly chosen man is 1.80 meters tall, then the "error" is 0.05 meters; if the randomly chosen man is 1.70 meters tall, then the "error" is −0.05 meters. The expected value, being the mean

Arithmetic mean

In mathematics and statistics, the arithmetic mean, often referred to as simply the mean or average when the context is clear, is a method to derive the central tendency of a sample space...

of the entire population, is typically unobservable, and hence the statistical error cannot be observed either.

A residual (or fitting error), on the other hand, is an observable estimate of the unobservable statistical error. Consider the previous example with men's heights and suppose we have a random sample of n people. The sample mean could serve as a good estimator of the population mean. Then we have:

The difference between the height of each man in the sample and the unobservable population mean is a statistical error, whereas
The difference between the height of each man in the sample and the observable sample mean is a residual.

Note that the sum of the residuals within a random sample is necessarily zero, and thus the residuals are necessarily not independent
Statistical independence
In probability theory, to say that two events are independent intuitively means that the occurrence of one event makes it neither more nor less probable that the other occurs...

. The statistical errors on the other hand are independent, and their sum within the random sample is almost surely

Almost surely

In probability theory, one says that an event happens almost surely if it happens with probability one. The concept is analogous to the concept of "almost everywhere" in measure theory...

not zero.

One can standardize statistical errors (especially of a normal distribution) in a z-score (or "standard score"), and standardize residuals in a t-statistic, or more generally studentized residuals.

Example with some mathematical theory

If we assume a normally distributed population with mean μ and standard deviation

Standard deviation

Standard deviation is a widely used measure of variability or diversity used in statistics and probability theory. It shows how much variation or "dispersion" there is from the average...

σ, and choose individuals independently, then we have

and the sample mean

Arithmetic mean

In mathematics and statistics, the arithmetic mean, often referred to as simply the mean or average when the context is clear, is a method to derive the central tendency of a sample space...

is a random variable distributed thus:

The statistical errors are then

whereas the residuals are

(As is often done, the "hat

Caret

Caret usually refers to the spacing symbol ^ in ASCII and other character sets. In Unicode, however, the corresponding character is , whereas the Unicode character named caret is actually a similar but lowered symbol: ....

" over the letter ε indicates an observable estimate of an unobservable quantity called ε.)

The sum of squares of the statistical errors, divided by σ², has a chi-squared distribution with n degrees of freedom

Degrees of freedom (statistics)

In statistics, the number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary.Estimates of statistical parameters can be based upon different amounts of information or data. The number of independent pieces of information that go into the...

This quantity, however, is not observable. The sum of squares of the residuals, on the other hand, is observable. The quotient of that sum by σ² has a chi-squared distribution with only n − 1 degrees of freedom:

It is remarkable that the sum of squares of the residuals

Squared deviations

In probability theory and statistics, the definition of variance is either the expected value , or average value , of squared deviations from the mean. Computations for analysis of variance involve the partitioning of a sum of squared deviations...

and the sample mean can be shown to be independent of each other. That fact and the normal and chi-squared distributions given above form the basis of calculations involving the quotient

The probability distributions of the numerator and the denominator separately depend on the value of the unobservable population standard deviation σ, but σ appears in both the numerator and the denominator and cancels. That is fortunate because it means that even though we do not know σ, we know the probability distribution of this quotient: it has a Student's t-distribution with n − 1 degrees of freedom. We can therefore use this quotient to find a confidence interval

Confidence interval

In statistics, a confidence interval is a particular kind of interval estimate of a population parameter and is used to indicate the reliability of an estimate. It is an observed interval , in principle different from sample to sample, that frequently includes the parameter of interest, if the...

for μ.

Regressions

In regression analysis

Regression analysis

, the distinction between errors and residuals is subtle and important, and leads to the concept of studentized residual

Studentized residual

s.

Given an unobservable function that relates the independent variable to the dependent variable – say, a line – the deviations of the dependent variable observations from this function are the errors. If one runs a regression on some data, then the deviations of the dependent variable observations from the fitted function are the residuals.

However, because of the behavior of the process of regression, the distributions of residuals at different data points (of the input variable) may vary even if the errors themselves are identically distributed. Concretely, in a linear regression

Linear regression

In statistics, linear regression is an approach to modeling the relationship between a scalar variable y and one or more explanatory variables denoted X. The case of one explanatory variable is called simple regression...

where the errors are identically distributed, the variability of residuals of inputs in the middle of the domain will be higher than the variability of residuals at the ends of the domain: linear regressions fit endpoints better than the middle.
This is also reflected in the influence functions of various data points on the regression coefficients: endpoints have more influence.

Thus to compare residuals at different inputs, one needs to adjust the residuals by the expected variability of residuals, which is called studentizing. This is particularly important in the case of detecting outliers: a large residual may be expected in the middle of the domain, but considered an outlier at the end of the domain.

Stochastic error

The stochastic error in a measurement is the error that is random from one measurement to the next. Stochastic errors tend to be gaussian

GAUSSIAN

Gaussian is a computational chemistry software program initially released in 1970 by John Pople and his research group at Carnegie-Mellon University as Gaussian 70. It has been continuously updated since then...

, or normal, in their distribution. That's because the stochastic error is most often the sum of many random errors, and when we add many random errors together, the distribution of their sum looks gaussian, as shown by the Central Limit Theorem

Central limit theorem

In probability theory, the central limit theorem states conditions under which the mean of a sufficiently large number of independent random variables, each with finite mean and variance, will be approximately normally distributed. The central limit theorem has a number of variants. In its common...

.
A stochastic error is added to a regression equation to introduce all of the variation in Y that cannot be explained by the included Xs. It is, in effect, a symbol of the inability to model all the movements of the dependent variable.

Alternative uses of "error" in statistics

The use of the term "error" as discussed in the sections above is in the sense of a deviation of a value from a hypothetical unobserved value. At least two other uses also occur in statistics, both referring to observable prediction errors:

Mean square error or mean squared error (abbreviated MSE) and root mean square error (RMSE) refer to the amount by which the values predicted by an estimator differ from the quantities being estimated (typically outside the sample from which the model was estimated).

Sum of squared errors, typically abbreviated SSE or SS_e, refers to the residual sum of squares

Residual sum of squares

In statistics, the residual sum of squares is the sum of squares of residuals. It is also known as the sum of squared residuals or the sum of squared errors of prediction . It is a measure of the discrepancy between the data and an estimation model...

(the sum of squared residuals) of a regression; this is the sum of the squares of the deviations of the actual values from the predicted values, within the sample used for estimation. Likewise, the sum of absolute errors (SAE) refers to the sum of the absolute values of the residuals, which is minimized in the least absolute deviations

Least absolute deviations

Least absolute deviations , also known as Least Absolute Errors , Least Absolute Value , or the L1 norm problem, is a mathematical optimization technique similar to the popular least squares technique that attempts to find a function which closely approximates a set of data...

approach to regression.

External links