All Topics  
Errors and residuals in statistics

 

   Email Print
   Bookmark   Link






 

Errors and residuals in statistics



 
 
In statistics
Statistics

Statistics is a Mathematics pertaining to the collection, analysis, interpretation or explanation, and presentation of data. It also provides tools for prediction and forecasting based on data....
 and optimization
Optimization (mathematics)

In mathematics, the simplest case of optimization, or mathematical programming, refers to the study of problems in which one seeks to maxima and minima or maxima and minima a Function of a real variable by systematically choosing the values of Real number or integer variables from within an allowed set....
, statistical errors and residuals are two closely related and easily confused measures of "deviation
Deviation (statistics)

In mathematics and statistics, deviation is a measure of difference for levels of measurement between the observed value and the mean. The sign of deviation, either positive or negative, indicates whether the observation is larger than or smaller than the mean....
 of a sample from the mean": the error of a sample is the deviation of the sample from the (unobservable) population mean or actual function, while the residual of a sample is the difference between the sample and the (observed) sample mean or regressed (fitted) function. The distinction is most important in regression analysis
Regression analysis

In statistics, regression analysis is a collective name for techniques for the modeling and analysis of numerical data consisting of values of a dependent variable and of one or more independent variables ....
, where the subtle behavior of residuals leads to the concept of studentized residual
Studentized residual

In statistics, a studentized residual is the quotient resulting from division of a errors and residuals in statistics by an estimator of its standard deviation....
s.

a univariate distribution
Univariate distribution

In statistics, a univariate distribution is a probability distribution of only one random variable. This is in contrast to a multivariate distribution, the probability distribution of a random vector....
, the distinction between errors and residuals is just the difference between deviations from the population mean versus the sample mean.

A statistical error is the amount by which an observation differs from its expected value
Expected value

In probability theory and statistics, the expected value of a random variable is the Lebesgue integral of the random variable with respect to its probability measure....
; the latter being based on the whole population from which the statistical unit was chosen randomly.






Discussion
Ask a question about 'Errors and residuals in statistics'
Start a new discussion about 'Errors and residuals in statistics'
Answer questions from other users
Full Discussion Forum



Encyclopedia


In statistics
Statistics

Statistics is a Mathematics pertaining to the collection, analysis, interpretation or explanation, and presentation of data. It also provides tools for prediction and forecasting based on data....
 and optimization
Optimization (mathematics)

In mathematics, the simplest case of optimization, or mathematical programming, refers to the study of problems in which one seeks to maxima and minima or maxima and minima a Function of a real variable by systematically choosing the values of Real number or integer variables from within an allowed set....
, statistical errors and residuals are two closely related and easily confused measures of "deviation
Deviation (statistics)

In mathematics and statistics, deviation is a measure of difference for levels of measurement between the observed value and the mean. The sign of deviation, either positive or negative, indicates whether the observation is larger than or smaller than the mean....
 of a sample from the mean": the error of a sample is the deviation of the sample from the (unobservable) population mean or actual function, while the residual of a sample is the difference between the sample and the (observed) sample mean or regressed (fitted) function. The distinction is most important in regression analysis
Regression analysis

In statistics, regression analysis is a collective name for techniques for the modeling and analysis of numerical data consisting of values of a dependent variable and of one or more independent variables ....
, where the subtle behavior of residuals leads to the concept of studentized residual
Studentized residual

In statistics, a studentized residual is the quotient resulting from division of a errors and residuals in statistics by an estimator of its standard deviation....
s.

Univariate explanation

For a univariate distribution
Univariate distribution

In statistics, a univariate distribution is a probability distribution of only one random variable. This is in contrast to a multivariate distribution, the probability distribution of a random vector....
, the distinction between errors and residuals is just the difference between deviations from the population mean versus the sample mean.

A statistical error is the amount by which an observation differs from its expected value
Expected value

In probability theory and statistics, the expected value of a random variable is the Lebesgue integral of the random variable with respect to its probability measure....
; the latter being based on the whole population from which the statistical unit was chosen randomly. The expected value, being for instance the mean
Arithmetic mean

In mathematics and statistics, the arithmetic mean of a list of numbers is the sum of all of the list divided by the number of items in the list....
 of the entire population, is typically unobservable. If the mean height in a population of 21-year-old men is 1.75 meters, and one randomly chosen man is 1.80 meters tall, then the "error" is 0.05 meters; if the randomly chosen man is 1.70 meters tall, then the "error" is −0.05 meters. The nomenclature arose from random measurement errors
Observational error

Observational error is the difference between a measurement value of quantity and its true value. In statistics, an error is not a "mistake". Variability is an inherent part of things being measured and of the measurement process....
 in astronomy
Astronomy

Astronomy is the science of Astronomical object and Phenomenon that originate outside the Earth's atmosphere . It is concerned with the evolution, physics, chemistry, meteorology, and motion of celestial objects, as well as the physical cosmology....
. It is as if the measurement of the man's height were an attempt to measure the population mean, so that any difference between the man's height and the mean would be a measurement error.

A residual (or fitting error), on the other hand, is an observable estimate of the unobservable statistical error. The simplest case involves a random sample of n men whose heights are measured. The sample
Sample (statistics)

In statistics, a sample is a subset of a Statistical population. Typically, the population is very large, making a census or a complete enumeration of all the values in the population impractical or impossible....
 mean is used as an estimate of the population
Statistical population

In statistics, a statistical population is a Set of entities concerning which statistical inferences are to be drawn, often based on a random sample taken from the population....
 mean. Then we have:

  • The difference between the height of each man in the sample and the unobservable population mean is a statistical error, and
  • The difference between the height of each man in the sample and the observable sample mean is a residual.


Note that the sum of the residuals within a random sample is necessarily zero, and thus the residuals are necessarily not independent
Statistical independence

In probability theory, to say that two event s are independent intuitively means that the occurrence of one event makes it neither more nor less probable that the other occurs....
. The sum of the statistical errors within a random sample need not be zero; the statistical errors are independent random variable
Random variable

In mathematics, random variables are used in the study of Randomness and probability. They were developed to assist in the analysis of Game of chance, stochastic events, and the results of experiment by capturing only the mathematical properties necessary to answer probability questions....
s if the individuals are chosen from the population independently.

In sum:
  • Residuals are observable; statistical errors are not.
  • Statistical errors are often independent of each other; residuals are not (at least in the simple situation described above, and in most others).


Example with some mathematical theory


If we assume a normally distributed
Normal distribution

The normal distribution, also called the Gaussian distribution, is an important family of continuous probability distributions, applicable in many fields....
 population with mean µ and standard deviation
Standard deviation

In statistics, standard deviation is a simple measure of the variability or statistical dispersion of a data set. A low standard deviation indicates that all of the data points are very close to the same value , while high standard deviation indicates that the data are ?spread out? over a large range of values....
 s, and choose individuals independently, then we have

and the sample mean
Arithmetic mean

In mathematics and statistics, the arithmetic mean of a list of numbers is the sum of all of the list divided by the number of items in the list....


is a random variable distributed thus:

The statistical errors are then

whereas the residuals are

(As is often done, the "hat" over the letter e indicates an observable estimate of an unobservable quantity called e.)

The sum of squares of the statistical errors, divided by s2, has a chi-square distribution
Chi-square distribution

In probability theory and statistics, the chi-square distribution is one of the most widely used theoretical probability distributions in inferential statistics, e.g., in statistical significance tests....
 with n degrees of freedom
Degrees of freedom (statistics)

In statistics, the phrase degrees of freedom is used to describe the number of values in the final calculation of a statistic that are free to vary....
:

This quantity, however, is not observable. The sum of squares of the residuals, on the other hand, is observable. The quotient of that sum by s2 has a chi-square distribution with only n − 1 degrees of freedom:

It is remarkable that the sum of squares of the residuals
Squared deviations

In probability theory and statistics, the definition of variance is either the expected value , or average of squared deviations from the mean....
 and the sample mean can be shown to be independent of each other. That fact and the normal and chi-square distributions given above form the basis of calculations involving the quotient

The probability distributions of the numerator and the denominator separately depend on the value of the unobservable population standard deviation s, but s appears in both the numerator and the denominator and cancels. That is fortunate because it means we know the probability distribution of this quotient: it has a Student's t-distribution
Student's t-distribution

In probability and statistics, Student's t-distribution is a probability distribution that arises in the problem of estimating the expected value of a normal distribution Statistical population when the sample size is small....
 with n − 1 degrees of freedom. We can therefore use this quotient to find a confidence interval
Confidence interval

In statistics, a confidence interval is an interval estimation of a population parameter. Instead of estimating the parameter by a single value, an interval likely to include the parameter is given....
 for μ.

Regressions


In regression analysis
Regression analysis

In statistics, regression analysis is a collective name for techniques for the modeling and analysis of numerical data consisting of values of a dependent variable and of one or more independent variables ....
, the distinction between errors and residuals is subtle and important, and leads to the concept of studentized residual
Studentized residual

In statistics, a studentized residual is the quotient resulting from division of a errors and residuals in statistics by an estimator of its standard deviation....
s.

Given a function that relates the independent variable to the dependent variable – say, a line – the deviation of observations from this function are the errors. If one runs a regression on some data, then the deviations of observations from the fitted function are the residuals.

However, because of the behavior of the process of regression, the distributions of residuals at different data points (of the input variable) may vary even if the errors themselves are identically distributed. Concretely, in a linear regression
Linear regression

In statistics, linear regression is used for two things;Linear regression is a form of regression analysis in which the relationship between one or more independent variables and another variable, called the dependent variable, is modeled by a least squares function, called linear regression equation....
 where the errors are identically distributed, the variability of residuals of inputs in the middle of the domain will be higher than the variability of residuals at the ends of the domain: linear regressions fit endpoints better than the middle. This is also reflected in the influence functions of various data points on the regression coefficients: endpoints have more influence.

Thus to compare residuals at different inputs, one needs to adjust the residuals by the expected variability of residuals, which is called studentizing. This is particularly important in the case of detecting outliers: a large residual may be expected in the middle of the domain, but considered an outlier at the end of the domain.

See also

  • Absolute deviation
    Absolute deviation

    In statistics, the absolute deviation of an element of a data set is the absolute difference between that element and a given point. Typically the point from which the deviation is measured is a measure of central tendency, most often the median or sometimes the mean of the data set....
  • Deviation (statistics)
    Deviation (statistics)

    In mathematics and statistics, deviation is a measure of difference for levels of measurement between the observed value and the mean. The sign of deviation, either positive or negative, indicates whether the observation is larger than or smaller than the mean....
  • Error detection and correction
    Error detection and correction

    In mathematics, computer science, telecommunication, and information theory, error detection and correction has great practical importance in maintaining data integrity across noisy channels and less-than-reliable storage media....
  • Margin of error
    Margin of error

    The margin of error is a statistic expressing the amount of random sampling error in a statistical survey's results. The larger the margin of error, the less faith one should have that the poll's reported results are close to the "true" figures; that is, the figures for the whole Statistical population....
  • Mean absolute error
    Mean absolute error

    In statistics, the mean absolute error is a quantity used to measure how close forecasts or predictions are to the eventual outcomes. The mean absolute error is given by...
  • Propagation of error
  • Root mean square deviation
    Root mean square deviation

    The root mean square deviation is the measure of the average distance between the backbones of protein structural alignment proteins. In the study of globular protein conformations, one customarily measures the similarity in three-dimensional structure by the RMSD of the Cα atomic coordinates after optimal rigid body superposition....
  • Sampling error
    Sampling error

    In statistics, sampling error or estimation error is the Errors and residuals in statistics caused by observing a sample instead of the whole population....
  • Studentized residual
    Studentized residual

    In statistics, a studentized residual is the quotient resulting from division of a errors and residuals in statistics by an estimator of its standard deviation....


External links

  • Residuals from the humorous perspective.