All Topics  
Linear regression

 

   Email Print
   Bookmark   Link






 

Linear regression



 
 
In statistics, linear regression is used for two things;

In both cases, several sets of outcomes are available for the quantity of interest together with the related variables.

Linear regression is a form of regression analysis
Regression analysis

In statistics, regression analysis is a collective name for techniques for the modeling and analysis of numerical data consisting of values of a dependent variable and of one or more independent variables ....
 in which the relationship between one or more independent variable
Independent variable

The terms "dependent variable" and "independent variable" are used in similar but subtly different ways in mathematics and statistics as part of the standard terminology in those subjects....
s and another variable, called the dependent variable, is modeled by a least squares
Least squares

The method of least squares or ordinary least squares is used to solve overdetermined systems. Least squares is often applied in statistical contexts, particularly regression analysis....
 function, called linear regression equation.






Discussion
Ask a question about 'Linear regression'
Start a new discussion about 'Linear regression'
Answer questions from other users
Full Discussion Forum



Encyclopedia


In statistics, linear regression is used for two things;
  • to construct a simple formula that will predict what value will occur for a quantity of interest when other related variables take given values.
  • to allow a test to be made of whether a given variable does have an effect on a quantity of interest in situations where there may be many related variables.


In both cases, several sets of outcomes are available for the quantity of interest together with the related variables.

Linear regression is a form of regression analysis
Regression analysis

In statistics, regression analysis is a collective name for techniques for the modeling and analysis of numerical data consisting of values of a dependent variable and of one or more independent variables ....
 in which the relationship between one or more independent variable
Independent variable

The terms "dependent variable" and "independent variable" are used in similar but subtly different ways in mathematics and statistics as part of the standard terminology in those subjects....
s and another variable, called the dependent variable, is modeled by a least squares
Least squares

The method of least squares or ordinary least squares is used to solve overdetermined systems. Least squares is often applied in statistical contexts, particularly regression analysis....
 function, called linear regression equation. This function is a linear combination
Linear combination

In mathematics, linear combinations are a concept central to linear algebra and related fields of mathematics.Most of this article deals with linear combinations in the context of a vector space over a field , with some generalisations given at the end of the article....
 of one or more model parameters, called regression coefficients. A linear regression equation with one independent variable represents a straight line when the predicted value (from the regression equation) is plotted against the independent variable: this is called a simple linear regression
Simple linear regression

A simple linear regression is a linear regression in which there is only one covariate . Simple linear regression is a form of multiple regression....
. However, note that "linear" does not refer to this straight line, but rather to the way in which the regression coefficients occur in the regression equation. The results are subject to statistical analysis.

Introduction


Theoretical model


A linear regression model assumes, given a random sample , a possibly imperfect relationship between , the regressand, and regressors . A disturbance term , which is a random variable too, is added to this assumed relationship to capture the influence of everything else on other than . Hence, the multiple linear regression model takes the following form:

Note that the regressors are also called independent variables, exogenous variables, covariates, input variables or predictor variables. Similarly, regressands are also called dependent variables, response variables, measured variables, or predicted variables.

Models which do not conform to this specification may be treated by nonlinear regression
Nonlinear regression

In statistics, nonlinear regression is a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables....
. A linear regression model need not be a linear function of the independent variable: linear in this context means that the conditional mean of is linear in the parameters . For example, the model is linear in the parameters and , but it is not linear in , a nonlinear function of . An illustration of this model is shown in the example, below.

Data and estimation


It is important to distinguish the model formulated in terms of random variables and the observed values of these random variables. Typically, the observed values, or data, denoted by lower case letters, consist of n values .

In general there are parameters to be determined, . In order to estimate the parameters it is often useful to use the matrix notation

where Y is a column vector that includes the observed values of , includes the unobserved stochastic components and the matrix the observed values of the regressors:

X includes, typically, a constant column, that is, a column which does not vary across observations, which is used to represent the intercept term .

If there is any linear dependence among the columns of X, then the vector of parameters cannot be estimated by least squares unless is constrained, as, for example, by requiring the sum of some of its components to be 0. However, some linear combinations of the components of may still be uniquely estimable in such cases. For example, the model cannot be solved for and independently as the matrix of observations has the reduced rank
Rank (linear algebra)

The column rank of a matrix_ A is the maximal number of linear independence columns of A. Likewise, the row rank is the maximal number of linearly independent rows of A....
 2. In this case the model can be rewritten as and can be solved to give a value for the composite entity .

Note that to only perform a least squares estimation of it is not necessary to consider the sample as random variables. It may even be conceptually simpler to consider the sample as fixed, observed values, as we have done thus far. However in the context of hypothesis testing and confidence intervals, it will be necessary to interpret the sample as random variables that will produce estimators which are themselves random variables. Then it will be possible to study the distribution of the estimators and draw inferences.

Classical assumptions


Classical assumptions for linear regression include the assumptions that the sample is selected at random from the population of interest, that the dependent variable is continuous on the real line, and that the error terms follow identical and independent normal distribution
Normal distribution

The normal distribution, also called the Gaussian distribution, is an important family of continuous probability distributions, applicable in many fields....
s, that is, that the errors are i.i.d. and Gaussian
Normal distribution

The normal distribution, also called the Gaussian distribution, is an important family of continuous probability distributions, applicable in many fields....
. Note that these assumptions imply that the error term does not statistically depend on the values of the independent variables, that is, that is statistically independent of the predictor variables. This article adopts these assumptions unless otherwise stated. Note that in more advanced treatments all of these assumptions may be relaxed. In particular note that the assumption that the error terms are normally distributed is of no consequence unless the sample is very small because central limit theorem
Central limit theorem

The central limit theorem states that the re-averaged sum of a sufficiently large number of Independent and identically-distributed random variables Statistical independence random variables each with finite mean and variance will be approximately normal distribution ....
s imply that, so long as the error terms have finite variance and are not too strongly correlated, the parameter estimates will be approximately normally distributed even when the underlying errors are not.

Under these assumptions, an equivalent formulation of simple linear regression
Simple linear regression

A simple linear regression is a linear regression in which there is only one covariate . Simple linear regression is a form of multiple regression....
 that explicitly shows the linear regression as a model of conditional expectation can be given as

The conditional expected value of Yi given Xi is an affine function of Xi. Note that this expression follows from the assumption that the mean of is zero conditional on Xi.

Least-squares analysis


Least squares estimates


The first objective of regression analysis is to best-fit the data by estimating the parameters of the model. Of the different criteria that can be used to define what constitutes a best fit, the least squares criterion is a very powerful one. This estimate (or estimator, if we are in the context of a random sample), is given by

For a full derivation see Linear least squares
Linear least squares

Linear least squares is an important computational problem, that arises primarily in applications when it is desired to fit a linear function mathematical model to measurements obtained from experiments....
.

Regression inference


The estimates can be used to test various hypotheses.

Denote by the variance of the error term (recall we assume that for every ). An unbiased estimate of is given by

where is the sum of square residuals. The relation between the estimate and the true value is: where has Chi-square distribution
Chi-square distribution

In probability theory and statistics, the chi-square distribution is one of the most widely used theoretical probability distributions in inferential statistics, e.g., in statistical significance tests....
 with degrees of freedom.

The solution to the normal equations can be written as This shows that the parameter estimators are linear combinations of the dependent variable. It follows that, if the observational errors are normally distributed, the parameter estimators will follow a joint normal distribution. Under the assumptions here, the estimated parameter vector is exactly distributed,



where denotes the multivariate normal distribution.

The standard error
Standard error

Standard error can refer to:* Standard error , the estimated standard deviation or error of a series of measurements* Standard error stream, one of the standard streams in Unix-like operating systems...
 of a parameter estimator is given by The confidence interval
Confidence interval

In statistics, a confidence interval is an interval estimation of a population parameter. Instead of estimating the parameter by a single value, an interval likely to include the parameter is given....
 for the parameter, , is computed as follows:

The residuals can be expressed as The matrix is known as the hat matrix
Hat matrix

In statistics, the hat matrix, H, relates the fitted values to the observed values. It describes the influence each observed value has on each fitted value....
 and has the useful property that it is idempotent. Using this property it can be shown that, if the errors are normally distributed, the residuals will follow a normal distribution with covariance matrix . Studentized residuals are useful in testing for outliers.

Given a value of the independent variable, xd, the predicted response is calculated as Writing the elements as , the mean response confidence interval for the prediction is given, using error propagation theory, by:

The predicted response confidence intervals for the data are given by:

Univariate linear case

We consider here the case of the simplest regression model, . In order to estimate and , we have a sample of observations which are, here, not seen as random variables and denoted by lower case letters. As stated in the introduction, however, we might want to interpret the sample in terms of random variables in some other contexts than least squares estimation.

The idea of least squares estimation is to minimize the following unknown quantity, the sum of squared errors:

Taking the derivative of the preceding expression with respect to and yields the normal equations:

This is a linear system of equations which can be solved using Cramer's rule
Cramer's rule

Cramer's rule is a theorem in linear algebra, which gives the solution of a system of linear equations or corresponding square matrices in terms of determinants....
:


The covariance matrix is

The mean response confidence interval is given by

The predicted response confidence interval is given by

The term is a reference to the Student T Distribution. is Standard Error.

Analysis of variance

In analysis of variance
Analysis of variance

In statistics, analysis of variance is a collection of statistical models, and their associated procedures, in which the observed variance is partitioned into components due to different explanatory variables....
 (ANOVA), the total sum of squares is split into two or more components.

The "total (corrected) sum of squares" is



where



("corrected" means has been subtracted from each y-value). Equivalently



The total sum of squares is partitioned as the sum of the "regression sum of squares" SSReg (or RSS, also called the "explained sum of squares") and the "error sum of squares" SSE, which is the sum of squares of residuals.

The regression sum of squares is

where u is an n-by-1 vector in which each element is 1. Note that



and



The error (or "unexplained") sum of squares SSE, which is the sum of square of residuals, is given by

The total sum of squares SST is

Pearson's coefficient of regression
Coefficient of determination

In statistics, the coefficient of determination, R2 is used in the context of statistical models whose main purpose is the prediction of future outcomes on the basis of other related information....
, R 2 is then given as

If the errors are independent and normally distributed with expected value 0 and they all have the same variance, then under the null hypothesis
Null hypothesis

In statistics, a null hypothesis is a concept which arises in the context of statistical hypothesis testing. A common convention is to use the symbol H0 to denote the null hypothesis....
 that all of the elements in ß = 0 except the constant, the statistic



follows an F-distribution
F-distribution

In probability theory and statistics, the F-distribution is a continuous probability distribution probability distribution. It is also known as Snedecor's F distribution or the Fisher-Snedecor distribution ....
 with (m-1) and (n-m) degrees of freedom. If that statistic is too large, then one rejects the null hypothesis. How large is too large depends on the level of the test, which is the tolerated probability of type I error; see statistical significance
Statistical significance

In statistics, a result is called statistically significant if it is unlikely to have occurred by chance. "A statistically significant difference" simply means there is statistical evidence that there is a difference; it does not mean the difference is necessarily large, important, or significant in the common meaning of the word....
.

Example


To illustrate the various goals of regression, we give an example. The following data set gives the average heights and weights for American women aged 30-39 (source: The World Almanac and Book of Facts, 1975).
Height (m) 1.47 1.5 1.52 1.55 1.57 1.60 1.63 1.65 1.68 1.7 1.73 1.75 1.78 1.8 1.83
Weight (kg) 52.21 53.12 54.48 55.84 57.2 58.57 59.93 61.29 63.11 64.47 66.28 68.1 69.92 72.19 74.46
A plot of weight against height (see below) shows that it cannot be modeled by a straight line, so a regression is performed by modeling the data by a parabola. where the dependent variable is weight and the independent variable is height.

Place the observations , in the matrix X.

The values of the parameters are found by solving the normal equations Element ij of the normal equation matrix, is formed by summing the products of column i and column j of X. Element i of the right-hand side vector is formed by summing the products of column i of X with the column of dependent variable values.

Thus, the normal equations are

(value standard deviation)

The calculated values are given by The observed and calculated data are plotted together and the residuals, , are calculated and plotted. Standard deviation
Standard deviation

In statistics, standard deviation is a simple measure of the variability or statistical dispersion of a data set. A low standard deviation indicates that all of the data points are very close to the same value , while high standard deviation indicates that the data are ?spread out? over a large range of values....
s are calculated using the sum of squares, .

The confidence intervals are computed using:

with =5%, = 2.2. Therefore, we can say that the 95% confidence interval
Confidence interval

In statistics, a confidence interval is an interval estimation of a population parameter. Instead of estimating the parameter by a single value, an interval likely to include the parameter is given....
s are:

Examining results of regression models


Checking model assumptions

Some of the model assumptions can be evaluated by calculating the residuals and plotting or otherwise analyzing them. The following plots can be constructed to test the validity of the assumptions:
  1. Residuals against the explanatory variables in the model, as illustrated above. The residuals should have no relation to these variables (look for possible non-linear relations) and the spread of the residuals should be the same over the whole range.
  2. Residuals against explanatory variables not in the model. Any relation of the residuals to these variables would suggest considering these variables for inclusion in the model.
  3. Residuals against the fitted values, .
  4. A time series
    Time series

    In statistics, signal processing, and many other fields, a time series is a sequence of data points, measured typically at successive times, spaced at time intervals....
     plot of the residuals, that is, plotting the residuals as a function of time.
  5. Residuals against the preceding residual.
  6. A normal probability plot
    Normal probability plot

    The normal probability plot is a graphical technique for assessing whether or not a data set is approximately normal distribution.The data are plotted against a theoretical normal distribution in such a way that the points should form an approximate straight line....
     of the residuals to test normality. The points should lie along a straight line.


There should not be any noticeable pattern to the data in all but the last plot.

Assessing goodness of fit


  1. The coefficient of determination
    Coefficient of determination

    In statistics, the coefficient of determination, R2 is used in the context of statistical models whose main purpose is the prediction of future outcomes on the basis of other related information....
     gives what fraction of the observed variance of the response variable can be explained by the given variables.
  2. Examine the observational and prediction confidence intervals. In most contexts, the smaller they are the better.


Other procedures


Generalized least squares

Generalized least squares, which includes weighted least squares as a special case, can be used when the observational errors have unequal variance or serial correlation.

Errors-in-variables model

Errors-in-variables model
Errors-in-variables model

Total least squares, also known as errors in variables, rigorous least squares, or orthogonal regression, is a least squares data modeling technique in which observational errors on both dependent and independent variables are taken into account....
  or total least squares when the independent variables are subject to error

Generalized linear model


Generalized linear model
Generalized linear model

In statistics, the generalized linear model is a flexible generalization of ordinary linear regression. It relates the random distribution of the measured variable of the experiment to the systematic portion of the experiment through a function called the link function....
 is used when the distribution function of the errors is not a Normal distribution. Examples include exponential distribution
Exponential distribution

In probability theory and statistics, the exponential distributions are a class of continuous probability distributions. They describe the times between events in a Poisson process, i.e....
, gamma distribution
Gamma distribution

In probability theory and statistics, the gamma distribution is a two-parameter family of continuous probability distributions. It has a scale parameter θ and a shape parameter k....
, inverse Gaussian distribution
Inverse Gaussian distribution

In probability theory, the inverse Gaussian distribution is a two-parameter family of continuous probability distributions with support on ....
, Poisson distribution
Poisson distribution

In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a number of events occurring in a fixed period of time if these events occur with a known average rate and Statistical independence of the time since the last event....
, binomial distribution
Binomial distribution

In probability theory and statistics, the binomial distribution is the discrete probability distribution of the number of successes in a sequence of n statistical independence yes/no experiments, each of which yields success with probability p....
, multinomial distribution
Multinomial distribution

In probability theory, the multinomial distribution is a generalization of the binomial distribution.The binomial distribution is the probability distribution of the number of "successes" in n statistical independence Bernoulli trials, with the same probability of "success" on each trial....


Robust regression


A host of alternative approaches to the computation of regression parameters are included in the category known as robust regression
Robust regression

In robust statistics, robust regression is a form of regression analysis designed to circumvent some limitations of traditional parametric statistics and non-parametric statistics....
. One technique minimizes the mean absolute error
Approximation error

The approximation error in some data is the discrepancy between an exact value and some approximation to it. An approximation error can occur because...
, or some other function of the residuals, instead of mean squared error as in linear regression. Robust regression is much more computationally intensive than linear regression and is somewhat more difficult to implement as well. While least squares estimates are not very sensitive to breaking the normality of the errors assumption, this is not true when the variance or mean of the error distribution is not bounded, or when an analyst that can identify outliers is unavailable.

Among Stata
Stata

Stata is a general-purpose statistical software package created in 1985 by StataCorp. It is used by many businesses and academic institutions around the world....
 users, Robust regression is frequently taken to mean linear regression with Huber-White standard error estimates due to the naming conventions for regression commands. This procedure relaxes the assumption of homoscedasticity
Homoscedasticity

In statistics, a sequence or a vector of random variables is homoskedastic if all random variables in the sequence or vector have the same finite set variance....
 for variance estimates only; the predictors are still ordinary least squares (OLS) estimates. This occasionally leads to confusion; Stata
Stata

Stata is a general-purpose statistical software package created in 1985 by StataCorp. It is used by many businesses and academic institutions around the world....
 users sometimes believe that linear regression is a robust method when this option is used, although it is actually not robust in the sense of outlier-resistance.

Instrumental variables and related methods

The assumption that the error term in the linear model can be treated as uncorrelated with the independent variables will frequently be untenable, as omitted-variables bias, "reverse" causation, and errors-in-variables problems can generate such a correlation. Instrumental variable
Instrumental variable

In statistics, econometrics, epidemiology and related disciplines, the method of instrumental variables is used to estimate causal relationships when controlled experiments are not feasible....
 and other methods can be used in such cases.

Applications of linear regression

Linear regression is widely used in biological, behavioral and social sciences to describe possible relationships between variables. It ranks as one of the most important tools used in these disciplines.

The trend line

For trend lines as used in technical analysis
Technical analysis

Technical analysis is a security analysis technique that claims the ability to forecast the future direction of prices through the study of past market data, primarily price and volume....
, see Trend lines (technical analysis)
Trend lines (technical analysis)

A trend line is formed when you can draw a diagonal line between two or more price pivot points. They are commonly used to judge entry and exit investment timing when trading securities....


A trend line represents a trend, the long-term movement in time series
Time series

In statistics, signal processing, and many other fields, a time series is a sequence of data points, measured typically at successive times, spaced at time intervals....
 data after other components have been accounted for. It tells whether a particular data set (say GDP, oil prices or stock prices) have increased or decreased over the period of time. A trend line could simply be drawn by eye through a set of data points, but more properly their position and slope is calculated using statistical techniques like linear regression
Linear regression

In statistics, linear regression is used for two things;Linear regression is a form of regression analysis in which the relationship between one or more independent variables and another variable, called the dependent variable, is modeled by a least squares function, called linear regression equation....
. Trend lines typically are straight lines, although some variations use higher degree polynomials depending on the degree of curvature desired in the line.

Trend lines are sometimes used in business analytics to show changes in data over time. This has the advantage of being simple. Trend lines are often used to argue that a particular action or event (such as training, or an advertising campaign) caused observed changes at a point in time. This is a simple technique, and does not require a control group, experimental design, or a sophisticated analysis technique. However, it suffers from a lack of scientific validity in cases where other potential changes can affect the data.

Epidemiology

As one example, early evidence relating tobacco smoking
Tobacco smoking

Tobacco smoking is the inhalation of smoke from burned dried or cured leaves of the tobacco plant, most often in the form of a cigarette. People may smoke casually for pleasure, habitually to satisfy an addiction to the nicotine present in tobacco and to the act of smoking, or in response to social pressure....
 to mortality and morbidity came from studies employing regression. Researchers usually include several variables in their regression analysis in an effort to remove factors that might produce spurious correlations. For the cigarette smoking example, researchers might include socio-economic status in addition to smoking to ensure that any observed effect of smoking on mortality is not due to some effect of education or income. However, it is never possible to include all possible confounding variables in a study employing regression. For the smoking example, a hypothetical gene might increase mortality and also cause people to smoke more. For this reason, randomized controlled trial
Randomized controlled trial

A randomized controlled trial is a type of scientific experiment most commonly used in testing the efficacy or effectiveness of healthcare Service or health technologies ....
s are often able to generate more compelling evidence of causal relationships than correlational analysis using linear regression. When controlled experiments are not feasible, variants of regression analysis such as instrumental variables and other methods may be used to attempt to estimate causal relationships from observational data.

Finance

The capital asset pricing model
Capital asset pricing model

In finance, the Capital Asset Pricing Model is used to determine a theoretically appropriate required rate of return of an asset, if that asset is to be added to an already well-diversified Portfolio , given that asset's non-Diversification risk....
 uses linear regression as well as the concept of Beta
Beta coefficient

The beta coefficient, in terms of finance and investment, describes how the expected return of a stock or portfolio is correlated to the return of the financial market as a whole....
 for analyzing and quantifying the systematic risk of an investment. This comes directly from the Beta coefficient
Beta coefficient

The beta coefficient, in terms of finance and investment, describes how the expected return of a stock or portfolio is correlated to the return of the financial market as a whole....
 of the linear regression model that relates the return on the investment to the return on all risky assets.

Regression may not be the appropriate way to estimate beta in finance given that it is supposed to provide the volatility of an investment relative to the volatility of the market as a whole. This would require that both these variables be treated in the same way when estimating the slope. Whereas regression treats all variability as being in the investment returns variable, i.e. it only considers residuals in the dependent variable.

Environmental science

Linear regression finds application in a wide range of environmental science applications. For example, recent work published in the Journal of Geophysical Research used regression models to identify data contamination, which led to an overstatement of global warming trends over land. Using the regression model to filter extraneous, nonclimatic effects reduced the estimated 1980–2002 global average temperature trends over land by about half.

See also


Additional sources

  • Cohen, J., Cohen P., West, S.G., & Aiken, L.S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences. (2nd ed.) Hillsdale, NJ: Lawrence Erlbaum Associates
  • Charles Darwin
    Charles Darwin

    Charles Robert Darwin Royal Society was an English people natural history who realised and presented compelling evidence that all species of life have evolution over time from common descent, through the process he called natural selection....
    . The Variation of Animals and Plants under Domestication. (1869) (Chapter XIII describes what was known about reversion in Galton's time. Darwin uses the term "reversion".)
  • Draper, N.R. and Smith, H. Applied Regression Analysis Wiley Series in Probability and Statistics (1998)
  • Francis Galton. "Regression Towards Mediocrity in Hereditary Stature," Journal of the Anthropological Institute, 15:246-263 (1886). (Facsimile at: )
  • Robert S. Pindyck and Daniel L. Rubinfeld (1998, 4h ed.). Econometric Models and Economic Forecasts,, ch. 1 (Intro, incl. appendices on S operators & derivation of parameter est.) & Appendix 4.3 (mult. regression in matrix form).
.

External links

  • Downloadable version of paper, subsequently published in the European Journal of Operational Research 2008.
  • (with Matlab software).
  • : An interactive, visual flash demonstration of how linear regression works.
  • : Combining many linear regressions to approximate any nonlinear function.
  • . See: for "error", for "Gauss-Markov theorem", for "method of least squares", and for "regression".
  • at MathPages
  • Interactive simulation to show the effect of outliers on the regression results
  • by Elmer G. Wiens. Online multiple and restricted multiple regression package.
  • Online curve and surface fitting.
  • Many resources for teaching statistics including Linear Regression.
  • Python, Smalltalk & Java Implementation of Linear Regression Calculation.
  • - Matlab code for Active Learning + Model Selection + Surrogate Model Regression
  • "Mahler's Guide to Regression"
  • - Notes, PPT, Videos, Mathcad, Matlab, Mathematica, Maple at