In
statisticsStatistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....
and
optimizationIn mathematics, computational science, or management science, mathematical optimization refers to the selection of a best element from some set of available alternatives....
,
statistical errors and
residuals are two closely related and easily confused measures of the
deviationIn mathematics and statistics, deviation is a measure of difference for interval and ratio variables between the observed value and the mean. The sign of deviation , reports the direction of that difference...
of a sample from its "theoretical value". The
error of a sample is the deviation of the sample from the (unobservable)
true function value, while the
residual of a sample is the difference between the sample and the
estimated function value.
The distinction is most important in
regression analysisIn statistics, regression analysis includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables...
, where it leads to the concept of
studentized residualIn statistics, a studentized residual is the quotient resulting from the division of a residual by an estimate of its standard deviation. Typically the standard deviations of residuals in a sample vary greatly from one data point to another even when the errors all have the same standard...
s.
Introduction
Suppose there is a series of observations from a
univariate distributionIn statistics, a univariate distribution is a probability distribution of only one random variable. This is in contrast to a multivariate distribution, the probability distribution of a random vector.-Further reading:...
and we want to estimate the
meanIn statistics, mean has two related meanings:* the arithmetic mean .* the expected value of a random variable, which is also called the population mean....
of that distribution (the so-called location model). In this case the errors are the deviations of the observations from the population mean, while the residuals are the deviations of the observations from the sample mean.
A
statistical error is the amount by which an observation differs from its
expected valueIn probability theory, the expected value of a random variable is the weighted average of all possible values that this random variable can take on...
, the latter being based on the whole
populationA statistical population is a set of entities concerning which statistical inferences are to be drawn, often based on a random sample taken from the population. For example, if we were interested in generalizations about crows, then we would describe the set of crows that is of interest...
from which the statistical unit was chosen randomly. For example, if the mean height in a population of 21-year-old men is 1.75 meters, and one randomly chosen man is 1.80 meters tall, then the "error" is 0.05 meters; if the randomly chosen man is 1.70 meters tall, then the "error" is −0.05 meters. The expected value, being the
meanIn mathematics and statistics, the arithmetic mean, often referred to as simply the mean or average when the context is clear, is a method to derive the central tendency of a sample space...
of the entire population, is typically unobservable, and hence the statistical error cannot be observed either.
A
residual (or fitting error), on the other hand, is an observable
estimate of the unobservable statistical error. Consider the previous example with men's heights and suppose we have a random sample of
n people. The
sample mean could serve as a good estimator of the
population mean. Then we have:
- The difference between the height of each man in the sample and the unobservable population mean is a statistical error, whereas
- The difference between the height of each man in the sample and the observable sample mean is a residual.
Note that the sum of the residuals within a random sample is necessarily zero, and thus the residuals are necessarily
not independentIn probability theory, to say that two events are independent intuitively means that the occurrence of one event makes it neither more nor less probable that the other occurs...
. The statistical errors on the other hand are independent, and their sum within the random sample is
almost surelyIn probability theory, one says that an event happens almost surely if it happens with probability one. The concept is analogous to the concept of "almost everywhere" in measure theory...
not zero.
One can standardize statistical errors (especially of a
normal distribution) in a z-score (or "standard score"), and standardize residuals in a t-statistic, or more generally studentized residuals.
Example with some mathematical theory
If we assume a
normally distributed population with mean μ and
standard deviationStandard deviation is a widely used measure of variability or diversity used in statistics and probability theory. It shows how much variation or "dispersion" there is from the average...
σ, and choose individuals independently, then we have
and the
sample meanIn mathematics and statistics, the arithmetic mean, often referred to as simply the mean or average when the context is clear, is a method to derive the central tendency of a sample space...
is a random variable distributed thus:
The
statistical errors are then
whereas the
residuals are
(As is often done, the "
hatCaret usually refers to the spacing symbol ^ in ASCII and other character sets. In Unicode, however, the corresponding character is , whereas the Unicode character named caret is actually a similar but lowered symbol: ....
" over the letter ε indicates an observable
estimate of an unobservable quantity called ε.)
The sum of squares of the
statistical errors, divided by σ
2, has a chi-squared distribution with
n degrees of freedomIn statistics, the number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary.Estimates of statistical parameters can be based upon different amounts of information or data. The number of independent pieces of information that go into the...
:
This quantity, however, is not observable. The sum of squares of the
residuals, on the other hand, is observable. The quotient of that sum by σ
2 has a chi-squared distribution with only
n − 1 degrees of freedom:
It is remarkable that the
sum of squares of the residualsIn probability theory and statistics, the definition of variance is either the expected value , or average value , of squared deviations from the mean. Computations for analysis of variance involve the partitioning of a sum of squared deviations...
and the sample mean can be shown to be independent of each other. That fact and the normal and chi-squared distributions given above form the basis of calculations involving the quotient
The probability distributions of the numerator and the denominator separately depend on the value of the unobservable population standard deviation
σ, but
σ appears in both the numerator and the denominator and cancels. That is fortunate because it means that even though we do not know
σ, we know the probability distribution of this quotient: it has a
Student's t-distribution with
n − 1 degrees of freedom. We can therefore use this quotient to find a
confidence intervalIn statistics, a confidence interval is a particular kind of interval estimate of a population parameter and is used to indicate the reliability of an estimate. It is an observed interval , in principle different from sample to sample, that frequently includes the parameter of interest, if the...
for
μ.
Regressions
In
regression analysisIn statistics, regression analysis includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables...
, the distinction between
errors and
residuals is subtle and important, and leads to the concept of
studentized residualIn statistics, a studentized residual is the quotient resulting from the division of a residual by an estimate of its standard deviation. Typically the standard deviations of residuals in a sample vary greatly from one data point to another even when the errors all have the same standard...
s.
Given an unobservable function that relates the independent variable to the dependent variable – say, a line – the deviations of the dependent variable observations from this function are the errors. If one runs a regression on some data, then the deviations of the dependent variable observations from the
fitted function are the residuals.
However, because of the behavior of the process of regression, the
distributions of residuals at different data points (of the input variable) may vary
even if the errors themselves are identically distributed. Concretely, in a
linear regressionIn statistics, linear regression is an approach to modeling the relationship between a scalar variable y and one or more explanatory variables denoted X. The case of one explanatory variable is called simple regression...
where the errors are identically distributed, the variability of residuals of inputs in the middle of the domain will be
higher than the variability of residuals at the ends of the domain: linear regressions fit endpoints better than the middle.
This is also reflected in the influence functions of various data points on the regression coefficients: endpoints have more influence.
Thus to compare residuals at different inputs, one needs to adjust the residuals by the expected variability of
residuals, which is called studentizing. This is particularly important in the case of detecting outliers: a large residual may be expected in the middle of the domain, but considered an outlier at the end of the domain.
Stochastic error
The stochastic error in a measurement is the error that is random from one measurement to the next. Stochastic errors tend to be
gaussianGaussian is a computational chemistry software program initially released in 1970 by John Pople and his research group at Carnegie-Mellon University as Gaussian 70. It has been continuously updated since then...
, or normal, in their distribution. That's because the stochastic error is most often the sum of many random errors, and when we add many random errors together, the distribution of their sum looks gaussian, as shown by the
Central Limit TheoremIn probability theory, the central limit theorem states conditions under which the mean of a sufficiently large number of independent random variables, each with finite mean and variance, will be approximately normally distributed. The central limit theorem has a number of variants. In its common...
.
A stochastic error is added to a regression equation to introduce all of the variation in Y that cannot be explained by the included Xs. It is, in effect, a symbol of the inability to model all the movements of the dependent variable.
Alternative uses of "error" in statistics
The use of the term "error" as discussed in the sections above is in the sense of a deviation of a value from a hypothetical unobserved value. At least two other uses also occur in statistics, both referring to observable prediction errors:
Mean square error or
mean squared error (abbreviated MSE) and
root mean square error (RMSE) refer to the amount by which the values predicted by an estimator differ from the quantities being estimated (typically outside the sample from which the model was estimated).
Sum of squared errors, typically abbreviated SSE or SS
e, refers to the
residual sum of squaresIn statistics, the residual sum of squares is the sum of squares of residuals. It is also known as the sum of squared residuals or the sum of squared errors of prediction . It is a measure of the discrepancy between the data and an estimation model...
(the sum of squared residuals) of a regression; this is the sum of the squares of the deviations of the actual values from the predicted values, within the sample used for estimation. Likewise, the
sum of absolute errors (SAE) refers to the sum of the absolute values of the residuals, which is minimized in the
least absolute deviationsLeast absolute deviations , also known as Least Absolute Errors , Least Absolute Value , or the L1 norm problem, is a mathematical optimization technique similar to the popular least squares technique that attempts to find a function which closely approximates a set of data...
approach to regression.
See also
- Absolute deviation
In statistics, the absolute deviation of an element of a data set is the absolute difference between that element and a given point. Typically the point from which the deviation is measured is a measure of central tendency, most often the median or sometimes the mean of the data set.D_i = |x_i-m|...
- Consensus forecasts
In a number of sciences, ranging from econometrics to meteorology, consensus forecasts are predictions of the future that are created by combining together several separate forecasts which have often been created using different methodologies...
- Deviation (statistics)
In mathematics and statistics, deviation is a measure of difference for interval and ratio variables between the observed value and the mean. The sign of deviation , reports the direction of that difference...
- Explained sum of squares
In statistics, the explained sum of squares is a quantity used in describing how well a model, often a regression model, represents the data being modelled...
- Lack-of-fit sum of squares
In statistics, a sum of squares due to lack of fit, or more tersely a lack-of-fit sum of squares, is one of the components of a partition of the sum of squares in an analysis of variance, used in the numerator in an F-test of the null hypothesis that says that a proposed model fits well.- Sketch of...
- Error detection and correction
In information theory and coding theory with applications in computer science and telecommunication, error detection and correction or error control are techniques that enable reliable delivery of digital data over unreliable communication channels...
- Innovation (signal processing)
In time series analysis — as conducted in statistics, signal processing, and many other fields — the innovation is the difference between the observed value of a variable at time t and the optimal forecast of that value based on information available prior to time t...
- Innovations vector
The innovations vector or residual vector is the difference between the measurement vector and the predicted measurement vector. Each difference represents the deviation of the observed random variable from the predicted response. The innovation vector is often used to check the validity of a...
- Margin of error
The margin of error is a statistic expressing the amount of random sampling error in a survey's results. The larger the margin of error, the less faith one should have that the poll's reported results are close to the "true" figures; that is, the figures for the whole population...
- Mean absolute error
In statistics, the mean absolute error is a quantity used to measure how close forecasts or predictions are to the eventual outcomes. The mean absolute error is given by...
- Propagation of error
- Root mean square deviation
The root-mean-square deviation is the measure of the average distance between the atoms of superimposed proteins...
- Sampling error
-Random sampling:In statistics, sampling error or estimation error is the error caused by observing a sample instead of the whole population. The sampling error can be found by subtracting the value of a parameter from the value of a statistic...
- Studentized residual
In statistics, a studentized residual is the quotient resulting from the division of a residual by an estimate of its standard deviation. Typically the standard deviations of residuals in a sample vary greatly from one data point to another even when the errors all have the same standard...
- Type I and type II errors
In statistical test theory the notion of statistical error is an integral part of hypothesis testing. The test requires an unambiguous statement of a null hypothesis, which usually corresponds to a default "state of nature", for example "this person is healthy", "this accused is not guilty" or...
External links