All Topics  
Regression analysis

 
Regression Analysis

   Email Print
   Bookmark   Link






 

Regression analysis



 
 
In statistics, regression analysis is a collective name for techniques for the modeling and analysis of numerical data consisting of values of a dependent variable (also called response variable or measurement) and of one or more independent variable
Independent variable

The terms "dependent variable" and "independent variable" are used in similar but subtly different ways in mathematics and statistics as part of the standard terminology in those subjects....
s (also known as explanatory variables or predictors). The dependent variable in the regression equation is modeled as a function of the independent variables, corresponding parameters ("constants"), and an error term.






Discussion
Ask a question about 'Regression analysis'
Start a new discussion about 'Regression analysis'
Answer questions from other users
Full Discussion Forum



Encyclopedia


In statistics, regression analysis is a collective name for techniques for the modeling and analysis of numerical data consisting of values of a dependent variable (also called response variable or measurement) and of one or more independent variable
Independent variable

The terms "dependent variable" and "independent variable" are used in similar but subtly different ways in mathematics and statistics as part of the standard terminology in those subjects....
s (also known as explanatory variables or predictors). The dependent variable in the regression equation is modeled as a function of the independent variables, corresponding parameters ("constants"), and an error term. The error term is treated as a random variable
Random variable

In mathematics, random variables are used in the study of Randomness and probability. They were developed to assist in the analysis of Game of chance, stochastic events, and the results of experiment by capturing only the mathematical properties necessary to answer probability questions....
. It represents unexplained variation in the dependent variable. The parameters are estimated so as to give a "best fit" of the data. Most commonly the best fit is evaluated by using the least squares
Least squares

The method of least squares or ordinary least squares is used to solve overdetermined systems. Least squares is often applied in statistical contexts, particularly regression analysis....
 method, but other criteria have also been used.

Regression can be used for prediction
Prediction

A prediction is a statement or claim that a particular event will occur in the future in more certain terms than a forecasting. The etymology of this word is Latin ....
 (including forecasting of time-series
Time series

In statistics, signal processing, and many other fields, a time series is a sequence of data points, measured typically at successive times, spaced at time intervals....
 data), inference
Inference

Inference is the act or process of deriving a logical consequence from premises.Inference is studied within several different fields.* Human inference is traditionally studied within the field of cognitive psychology....
, hypothesis testing, and modeling of causal relationships. These uses of regression rely heavily on the underlying assumptions being satisfied. Regression analysis has been criticized as being misused for these purposes in many cases where the appropriate assumptions cannot be verified to hold. One factor contributing to the misuse of regression is that it can take considerably more skill to critique a model than to fit a model.

History of regression analysis

The earliest form of regression was the method of least squares, which was published by Legendre in 1805, and by Gauss
Carl Friedrich Gauss

Johann Carl Friedrich Gauss. was a Germans mathematician and scientist who contributed significantly to many fields, including number theory, statistics, mathematical analysis, Differential geometry and topology, geodesy, electrostatics, astronomy and optics....
 in 1809. The term “least squares” is from Legendre’s term, moindres carrés. However, Gauss claimed that he had known the method since 1795.

Legendre and Gauss both applied the method to the problem of determining, from astronomical observations, the orbits of bodies about the Sun. Euler
Leonhard Euler

Leonhard Paul Euler was a pioneering Swiss mathematician and physicist who spent most of his life in Russia and Germany.Euler made important discoveries in fields as diverse as calculus and graph theory....
 had worked on the same problem (1748) without success. Gauss published a further development of the theory of least squares in 1821, including a version of the Gauss–Markov theorem
Gauss–Markov theorem

In statistics, the Gauss?Markov theorem, named after Carl Friedrich Gauss and Andrey Markov, states that in a linear model in which the errors have expectation zero and are uncorrelated and have equal variances, a best linear bias of an estimator estimator of the coefficients is given by the least-squares estimator....
.

The term "regression" was coined by Francis Galton
Francis Galton

Sir Francis Galton Fellow of the Royal Society , Cousin#Half_cousins of Charles Darwin, was an England Victorian era polymath, anthropologist, Eugenics, tropical List of explorers, geographer, inventor, meteorologist, proto-geneticist, Psychometrics, and statistician....
, a cousin of Charles Darwin
Charles Darwin

Charles Robert Darwin Royal Society was an English people natural history who realised and presented compelling evidence that all species of life have evolution over time from common descent, through the process he called natural selection....
, in the nineteenth century to describe a biological phenomenon. The phenomenon was that the heights of descendants of tall ancestors tend to regress down towards a normal average. For Galton, regression had only this biological meaning, but his work was later extended by Udny Yule
Udny Yule

George Udny Yule was usually known as Udny Yule, and was a Scotland statistician, bornat Beech Hill near Haddington, Scotland and died in Cambridge, England....
 and Karl Pearson
Karl Pearson

Karl Pearson Fellow of the Royal Society established the disciplineof mathematical statistics.In 1911 he founded the world's first university statistics department at University College London....
 to a more general statistical context. At the present time, the term "regression" is often synonymous with "least squares curve fitting
Curve fitting

Curve fitting is finding a curve which has the best fit to a series of data points and possibly other constraints. This section is an introduction to both interpolation and regression analysis....
".

Underlying assumptions


Classical assumptions for regression analysis include:
  • The sample must be representative of the population for the inference prediction.
  • The error is assumed to be a random variable
    Random variable

    In mathematics, random variables are used in the study of Randomness and probability. They were developed to assist in the analysis of Game of chance, stochastic events, and the results of experiment by capturing only the mathematical properties necessary to answer probability questions....
     with a mean of zero conditional on the explanatory variables.
  • The independent variables are error-free. If this is not so, modeling may be done using errors-in-variables model
    Errors-in-variables model

    Total least squares, also known as errors in variables, rigorous least squares, or orthogonal regression, is a least squares data modeling technique in which observational errors on both dependent and independent variables are taken into account....
     techniques.
  • The predictors must be linearly independent, i.e. it must not be possible to express any predictor as a linear combination of the others. See Multicollinearity
    Multicollinearity

    Multicollinearity is a statistical phenomenon in which two or more predictor variables in a multiple regression model are highly correlated. In this situation the coefficient estimates may change erratically in response to small changes in the model or the data....
    .
  • The errors are uncorrelated
    Uncorrelated

    In probability theory and statistics, two real-valued random variables are said to be uncorrelated if their covariance is zero.Uncorrelated random variables have a correlation of zero, except in the trivial case when both variables have variance zero ....
    , that is, the variance-covariance matrix of the errors is diagonal
    Diagonal matrix

    In linear algebra, a diagonal matrix is a square matrix in which the entries outside the main diagonal are all zero. The diagonal entries themselves may or may not be zero....
     and each non-zero element is the variance of the error.
  • The variance of the error is constant across observations (homoscedasticity
    Homoscedasticity

    In statistics, a sequence or a vector of random variables is homoskedastic if all random variables in the sequence or vector have the same finite set variance....
    ). If not, weighted least squares or other methods might be used.
These are sufficient (but not all necessary) conditions for the least-squares estimator to possess desirable properties, in particular, these assumptions imply that the parameter estimates will be unbiased
Bias of an estimator

In statistics, the difference between an estimator's expected value and the true value of the parameter being estimated is called the bias. An estimator or decision rule having nonzero bias is said to be biased....
, consistent
Consistent estimator

In statistics, a consistent sequence of estimators is one which convergence in probability to the true value of the parameter. Often, the sequence of estimators is indexed by sample size, and so the consistency is as sample size tends to infinity....
, and efficient in the class of linear unbiased estimators. Many of these assumptions may be relaxed in more advanced treatments.

Assumptions include the geometrical support of the variables (Cressie, 1996). Independent and dependant variables often refer to values measured at point locations. There may be spatial trends and spatial autocorrelation in the variables that violates statistical assumptions of regression. Geographic Weighted regression is one technique to deal with such data (Fotheringham et al., 2002). Also, variables may include values aggregated by areas. With aggregated data the Modifiable Areal Unit Problem
Modifiable Areal Unit Problem

The modifiable areal unit problem is a source of statistical bias that can radically affect the results of statistical hypothesis tests. MAUP can cause the correlation, or association, between two variables to range from -0.99 to +0.99....
 can cause extreme variation in regression parameters (Fotheringham and Wong, 1991). When analyzing data aggregated by political boundaries, postal codes or census areas results may be very different with a different choice of units.

Regression equation

It is convenient to assume an environment in which an experiment is performed: the dependent variable is then outcome of a measurement.

The regression equation deals with the following variables:
  • The unknown parameters denoted as ß. This may be a scalar or a vector of length k.
  • The independent variables, X.
  • The dependent variable, Y.


Regression equation is a function of variables X and ß.

The user of regression analysis must make an intelligent guess about this function. Sometimes the form of this function is known, sometimes he must apply a trial and error process.

Assume now that the vector of unknown parameters, β is of length k. In order to perform a regression analysis the user must provide information about the dependent variable Y:
  • If the user performs the measurement N times, where N < k, regression analysis cannot be performed: there is not provided enough information to do so.
  • If the user performs N independent measurements, where N = k, then the problem reduces to solving a set of N equations with N unknowns β.
  • If, on the other hand, the user provides results of N independent measurements, where N > k, regression analysis can be performed. Such a system is also called an overdetermined system
    Overdetermined system

    In mathematics, a system of linear equations is considered overdetermined if there are more equations than unknowns. The terminology can be described in terms of the concept of counting constants....
    ;


In the last case the regression analysis provides the tools for:
  1. finding a solution for unknown parameters ß that will, for example, minimize the distance between the measured and predicted values of the dependent variable Y (also known as method of least squares
    Least squares

    The method of least squares or ordinary least squares is used to solve overdetermined systems. Least squares is often applied in statistical contexts, particularly regression analysis....
    ).
  2. under certain statistical assumptions the regression analysis uses the surplus of information to provide statistical information about the unknown parameters ß and predicted values of the dependent variable Y.


Independent measurements

Quantitatively, this is explained by the following example: Consider a regression model with, say, three unknown parameters β0, β1 and β2. An experimenter performed 10 repeated measurements at exactly the same value of independent variables X. In this case regression analysis fails to give a unique value for the three unknown parameters: the experimenter did not provide enough information. The best one can do is to calculate the average value of the dependent variable Y and its standard deviation.

If the experimenter had performed five measurements at X1, four at X2 and one at X3, where X1, X2 and X3 are different values of the independent variable X then regression analysis would provide a unique solution to unknown parameters β.

In the case of general linear regression (see below) the above statement is equivalent to the requirement that matrix XTX is regular (that is: it has an inverse matrix).

Statistical assumptions

When the number of measurements, N, is larger than the number of unknown parameters, k, and the measurement errors ei (see below) are normally distributed then the excess of information contained in (N - k) measurements is used to make the following statistical predictions about the unknown parameters:
  • confidence intervals of unknown parameters.


Linear regression

In linear regression, the model specification is that the dependent variable, is a linear combination
Linear combination

In mathematics, linear combinations are a concept central to linear algebra and related fields of mathematics.Most of this article deals with linear combinations in the context of a vector space over a field , with some generalisations given at the end of the article....
 of the parameters (but need not be linear in the independent variables). For example, in simple linear regression
Simple linear regression

A simple linear regression is a linear regression in which there is only one covariate . Simple linear regression is a form of multiple regression....
 for modeling data points there is one independent variable: , and two parameters, and :

straight line:


In multiple linear regression, there are several independent variables or functions of independent variables. For example, adding a term in xi2 to the preceding regression gives:

parabola:


This is still linear regression; although the expression on the right hand side is quadratic in the independent variable , it is linear in the parameters , and

In both cases, is an error term and the subscript indexes a particular observation. Given a random sample from the population, we estimate the population parameters and obtain the sample linear regression model:



The term is the residual
Errors and residuals in statistics

In statistics and Optimization , statistical errors and residuals are two closely related and easily confused measures of "deviation of a sample from the mean": the error of a sample is the deviation of the sample from the population mean or actual function, while the residual of a sample is the difference between the sa...
, . One method of estimation is ordinary least squares. This method obtains parameter estimates that minimize the sum of squared residuals
Errors and residuals in statistics

In statistics and Optimization , statistical errors and residuals are two closely related and easily confused measures of "deviation of a sample from the mean": the error of a sample is the deviation of the sample from the population mean or actual function, while the residual of a sample is the difference between the sa...
, SSE:

Minimization of this function results in a set of normal equations
Linear least squares

Linear least squares is an important computational problem, that arises primarily in applications when it is desired to fit a linear function mathematical model to measurements obtained from experiments....
, a set of simultaneous linear equations in the parameters, which are solved to yield the parameter estimators, . See regression coefficients
Least-squares estimation of linear regression coefficients

In parametric statistics, the least-squares estimator is often used to estimate the coefficients of a linear regression. The least-squares estimator optimizes a certain criterion ....
 for statistical properties of these estimators.

In the case of simple regression, the formulas for the least squares estimates are

where is the mean
Arithmetic mean

In mathematics and statistics, the arithmetic mean of a list of numbers is the sum of all of the list divided by the number of items in the list....
 (average) of the values and is the mean of the values. See linear least squares(straight line fitting)
Linear least squares

Linear least squares is an important computational problem, that arises primarily in applications when it is desired to fit a linear function mathematical model to measurements obtained from experiments....
 for a derivation of these formulas and a numerical example. Under the assumption that the population error term has a constant variance, the estimate of that variance is given by:



This is called the root mean square error (RMSE) of the regression. The standard error
Standard error (statistics)

The standard error of a method of measurement or estimation is the standard deviation of the sampling distribution associated with the estimation method....
s of the parameter estimates are given by

Under the further assumption that the population error term is normally distributed, the researcher can use these estimated standard errors to create confidence intervals and conduct hypothesis tests about the population parameters.

General linear data model

In the more general multiple regression model, there are p independent variables:



The least square parameter estimates are obtained by p normal equations. The residual can be written as

The normal equations are

In matrix notation, the normal equations are written as

For a numerical example see linear regression (example)
Linear regression

In statistics, linear regression is used for two things;Linear regression is a form of regression analysis in which the relationship between one or more independent variables and another variable, called the dependent variable, is modeled by a least squares function, called linear regression equation....
.

Regression diagnostics

Once a regression model has been constructed, it may be important to confirm the goodness of fit
Goodness of fit

The goodness of fit of a statistical model describes how well it fits a set of observations. Measures of goodness of fit typically summarize the discrepancy between observed values and the values expected under the model in question....
 of the model and the statistical significance
Statistical significance

In statistics, a result is called statistically significant if it is unlikely to have occurred by chance. "A statistically significant difference" simply means there is statistical evidence that there is a difference; it does not mean the difference is necessarily large, important, or significant in the common meaning of the word....
 of the estimated parameters. Commonly used checks of goodness of fit include the R-squared, analyses of the pattern of residuals
Errors and residuals in statistics

In statistics and Optimization , statistical errors and residuals are two closely related and easily confused measures of "deviation of a sample from the mean": the error of a sample is the deviation of the sample from the population mean or actual function, while the residual of a sample is the difference between the sa...
 and hypothesis testing. Statistical significance can be checked by an F-test
F-test

An F-test is any statistical test in which the test statistic has an F-distribution if the null hypothesis is true. The name was coined by George W....
 of the overall fit, followed by t-tests of individual parameters.

Interpretations of these diagnostic tests rest heavily on the model assumptions. Although examination of the residuals can be used to invalidate a model, the results of a t-test or F-test
F-test

An F-test is any statistical test in which the test statistic has an F-distribution if the null hypothesis is true. The name was coined by George W....
 are sometimes more difficult to interpret if the model's assumptions are violated. For example, if the error term does not have a normal distribution, in small samples the estimated parameters will not follow normal distributions, which complicates inference. With relatively large samples, however, a central limit theorem
Central limit theorem

The central limit theorem states that the re-averaged sum of a sufficiently large number of Independent and identically-distributed random variables Statistical independence random variables each with finite mean and variance will be approximately normal distribution ....
 can be invoked such that hypothesis testing may proceed using asymptotic approximations.

Regression with limited dependent variables

The response variable may be non-continuous ("limited" to lie on some subset of the real line). For binary (zero or one) variables, if analysis proceeds with least-squares linear regression, the model is called the linear probability model. Nonlinear models for binary dependent variables include the probit
Probit model

In statistics, a probit model is a popular specification of a generalized linear model. In particular, it is used for Binomial regression using the probit link function....
 and logit model
Logistic regression

In statistics, logistic regression is a model used for prediction of the probability of occurrence of an event by fitting data to a logistic curve....
. The multivariate probit
Multivariate probit

In statistics and econometrics, the multivariate probit model is a generalization of the probit model used to estimate several correlated binary outcomes jointly....
 model makes it possible to estimate jointly the relationship between several binary dependent variables and some independent variables. For categorical variables with more than two values there is the multinomial logit
Multinomial logit

In statistics, economics, and genetics, a multinomial logit model is a regression model which generalizes logistic regression by allowing more than two discrete outcomes....
. For ordinal variables with more than two values, there are the ordered logit
Ordered logit

In statistics, the ordered logit model , is a regression model for Levels_of_measurement#Ordinal_measurement dependent variables. It can be thought of as an extension of the logistic regression model for dichotomous dependent variables, allowing for more than two response categories....
 and ordered probit
Ordered probit

In statistics, ordered probit is a generalization of the popular probit analysis, used for ordinal multinomial dependent variables. Similarly, the popular logit method also has a counterpart ordered logit....
 models. Censored regression models may be used when the dependent variable is only sometimes observed, and Heckman correction
Heckman correction

The Heckman correction is any of a number of related statistical methods developed by James Heckman in 1976 through 1979 which allow the researcher to correct for selection bias....
 type models may be used when the sample is not randomly selected from the population of interest. An alternative to such procedures is linear regression based on polychoric or polyserial correlations between the categorical variables. Such procedures differ in the assumptions made about the distribution of the variables in the population. If the variable is positive with low values and represents the repetition of the occurrence of an event, count models like the Poisson regression
Poisson regression

In statistics, Poisson regression is a form of regression analysis used to model count data and contingency tables. Poisson regression assumes the response variable Y has a Poisson distribution, and assumes the logarithm of its expected value can be modelled by a linear combination of unknown parameters....
 or the negative binomial model may be used

Interpolation and extrapolation

Regression models predict a value of the variable given known values of the variables. If the prediction is to be done within the range of values of the variables used to construct the model this is known as interpolation
Interpolation

In the mathematics subfield of numerical analysis, interpolation is a method of constructing new data points within the range of a discrete set of known data points....
. Prediction outside the range of the data used to construct the model is known as extrapolation
Extrapolation

In mathematics, extrapolation is the process of constructing new data points outside a discrete set of known data points. It is similar to the process of interpolation, which constructs new points between known points, but the results of extrapolations are often less meaningful, and are subject to greater uncertainty....
 and it is more risky.

Nonlinear regression


When the model function is not linear in the parameters the sum of squares must be minimized by an iterative procedure. This introduces many complications which are summarized in Differences between linear and non-linear least squares
Least squares

The method of least squares or ordinary least squares is used to solve overdetermined systems. Least squares is often applied in statistical contexts, particularly regression analysis....


Other methods

Although the parameters of a regression model are usually estimated using the method of least squares, other methods which have been used include:
  • Bayesian
    Bayesian

    Bayesian refers to methods in probability and statistics named after the Reverend Thomas Bayes , in particular methods related to:* the degree-of-belief interpretation of probability, as opposed to frequency or proportion or propensity interpretations; or...
     methods, e.g. Bayesian linear regression
    Bayesian linear regression

    In statistics, Bayesian linear regression is a Bayesian probability alternative to the better-known ordinary least-squares linear regression.Consider a standard linear regression problem, where we specify the conditional probability of ' given ' predictor variables:...
  • Minimization of absolute deviation
    Absolute deviation

    In statistics, the absolute deviation of an element of a data set is the absolute difference between that element and a given point. Typically the point from which the deviation is measured is a measure of central tendency, most often the median or sometimes the mean of the data set....
    s, leading to quantile regression
    Quantile regression

    Quantile regression is a type of regression analysis used in statistics. Whereas the method of least squares results in estimates that approximate the conditional mean of the response variable given certain values of the predictor variables, quantile regression results in estimates approximating either the median or other quantiles of th...
  • Nonparametric regression
    Nonparametric regression

    Nonparametric regression is a form of regression analysis in which the predictor does not take a predetermined form but is constructed according to information derived from the data....
    . This approach requires a large number of observations, as the data are used to build the model structure as well as estimate the model parameters. They are usually computationally intensive.


See also

  • Confidence interval
    Confidence interval

    In statistics, a confidence interval is an interval estimation of a population parameter. Instead of estimating the parameter by a single value, an interval likely to include the parameter is given....
  • Confidence region
    Confidence region

    In statistics, a confidence region is a multi-dimensional generalization of a confidence interval. It is a set of points in an n-dimensional space, often represented as an ellipsoid around a point which is an estimated solution to a problem, although any shape can occur....
  • Extrapolation
    Extrapolation

    In mathematics, extrapolation is the process of constructing new data points outside a discrete set of known data points. It is similar to the process of interpolation, which constructs new points between known points, but the results of extrapolations are often less meaningful, and are subject to greater uncertainty....
  • Kriging
    Kriging

    Kriging is a group of geostatistics techniques to interpolation the value of a random field at an unobserved location from observations of its value at nearby locations....
     (a linear least squares estimation algorithm)
  • Forecasting
    Forecasting

    Forecasting is the process of estimation in unknown situations. Prediction is a similar, but more general term. Both can refer to estimation of time series, cross-sectional data or longitudinal study data....
  • Prediction interval
    Prediction interval

    In statistics, a prediction interval bears the same relationship to a future observation that a confidence interval bears to an unobservable population parameter....
  • Statistics
    Statistics

    Statistics is a Mathematics pertaining to the collection, analysis, interpretation or explanation, and presentation of data. It also provides tools for prediction and forecasting based on data....
  • Trend estimation
    Trend estimation

    When a series of measurements of a process is treated as a time series, trend estimation is the application of statistics techniques to make and justify statements about trends in the data....
  • Robust regression
    Robust regression

    In robust statistics, robust regression is a form of regression analysis designed to circumvent some limitations of traditional parametric statistics and non-parametric statistics....
  • Multivariate normal distribution
    Multivariate normal distribution

    In probability theory and statistics, a multivariate normal distribution, sometimes also called a multivariate Gaussian distribution, is a generalization of the one-dimensional normal distribution to higher dimensions....
  • Important publications in regression analysis
    List of publications in statistics

    Probability'The Doctrine of Chances':'Author:' Abraham de Moivre:'Publication data:' 1738 :'Online version:' ?'Th?orie analytique des probabilit?s':'Author:' Pierre-Simon Laplace:'Publication data:' 1820 :'Online version:'; , with more accurate character recognition; , complete PDF and PDFs by section...
  • Multivariate adaptive regression splines
    Multivariate adaptive regression splines

    Multivariate adaptive regression splines is a form of regression analysis introduced by Jerome Friedman in 1991 . It is a non-parametric regression technique...
  • Segmented regression
    Segmented regression

    Segmented regression is a method in regression analysis in which the independent variable is partitioned into intervals and a separate line segment is fit to each interval....


Software

All major statistical software packages
List of statistical packages

A statistical package is a suite of computer programs that are specialised for statistics. It enables people to obtain the results of standard statistical procedures and statistical significance tests, without requiring low-level numerical programming....
 perform the common types of regression analysis correctly and in a user-friendly way. Simple linear regression
Simple linear regression

A simple linear regression is a linear regression in which there is only one covariate . Simple linear regression is a form of multiple regression....
 can be done in some spreadsheet
Spreadsheet

A spreadsheet is a computer application that simulates a paper worksheet. It displays multiple cells that together make up a grid consisting of rows and columns, each cell containing either alphanumeric text or numeric values....
 applications. There are a number of software programs that perform specialized forms of regression, and experts may choose to write their own code to using statistical programming languages or numerical analysis software
List of numerical analysis software

Listed here are a number of computer programs used for performing numerical analysis calculations:* ADMB is a software suite for non-linear statistical modeling based on C++ which uses automatic differentiation....
.

External links

  • Online Article on linear regression with a Regression Analysis Tool
  • - Online textbook
  • - Some comments on linear regression models by Bill Venables.
  • at MathPages
  • - How linear regression mistakes can appear when Y-range is much smaller than X-range
  • - Matlab code for Active Learning + Model Selection + Surrogate Model Regression
  • Online curve and surface fitting application