Stepwise regression

Stepwise regression

Ask a question about 'Stepwise regression'
Start a new discussion about 'Stepwise regression'
Answer questions from other users
Full Discussion Forum
In statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....

, stepwise regression includes regression models in which the choice of predictive variables is carried out by an automatic procedure. Usually, this takes the form of a sequence of F-test
An F-test is any statistical test in which the test statistic has an F-distribution under the null hypothesis.It is most often used when comparing statistical models that have been fit to a data set, in order to identify the model that best fits the population from which the data were sampled. ...

s, but other techniques are possible, such as t-tests, adjusted R-square, Akaike information criterion
Akaike information criterion
The Akaike information criterion is a measure of the relative goodness of fit of a statistical model. It was developed by Hirotsugu Akaike, under the name of "an information criterion" , and was first published by Akaike in 1974...

, Bayesian information criterion, Mallows' Cp
Mallows' Cp
In statistics, Mallows' Cp, named for Colin L. Mallows, is used to assess the fit of a regression model that has been estimated using ordinary least squares. It is applied in the context of model selection, where a number of predictor variables are available for predicting some outcome, and the...

, or false discovery rate
False discovery rate
False discovery rate control is a statistical method used in multiple hypothesis testing to correct for multiple comparisons. In a list of rejected hypotheses, FDR controls the expected proportion of incorrectly rejected null hypotheses...


Main Approaches

The main approaches are:
  • Forward selection, which involves starting with no variables in the model, trying out the variables one by one and including them if they are 'statistically significant'.
  • Backward elimination, which involves starting with all candidate variables and testing them one by one for statistical significance, deleting any that are not significant.
  • Methods that are a combination of the above, testing at each stage for variables to be included or excluded.

A widely used algorithm was first proposed by Efroymson (1960). This is an automatic procedure for statistical model selection
Model selection
Model selection is the task of selecting a statistical model from a set of candidate models, given data. In the simplest cases, a pre-existing set of data is considered...

 in cases where there is a large number of potential explanatory variables, and no underlying theory on which to base the model selection. The procedure is used primarily in regression analysis
Regression analysis
In statistics, regression analysis includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables...

, though the basic approach is applicable in many forms of model selection. This is a variation on forward selection. At each stage in the process, after a new variable is added, a test is made to check if some variables can be deleted without appreciably increasing the residual sum of squares
Residual sum of squares
In statistics, the residual sum of squares is the sum of squares of residuals. It is also known as the sum of squared residuals or the sum of squared errors of prediction . It is a measure of the discrepancy between the data and an estimation model...

 (RSS). The procedure terminates when the measure is (locally) maximized, or when the available improvement falls below some critical value.

Selection criteria

One of the main issues with stepwise regression is that it searches a large space of possible models. Hence it is prone to overfitting the data. In other words, stepwise regression will often fit much better in sample than it does on new out of sample data. This problem can be mitigated if the criteria for adding (or deleting) a variable is stiff enough. The key line in the sand is at what can be thought of as the Bonferroni point: namely how significant the best spurious variable should be based on chance alone. On a t-statistic scale, this occurs at about , where is the number of predictors. Unfortunately, this means that many variables which actually carry signal will not be included. This fence turns out to be the right trade off between over-fitting and missing signal. If we look at the risk
Risk function
In decision theory and estimation theory, the risk function R of a decision rule, δ, is the expected value of a loss function L:...

 of different cutoffs, then using this bound will be within a factor of the best possible risk. Any other cut off will end up having a larger such risk inflation.

Model Accuracy

A way to test for errors in models created by step-wise regression, is to not rely on the model's F-statistic, significance, or multiple-r, but instead assess the model against a set of data that was not used to create the model. This is often done by building a model based on a sample of the dataset available (e.g. 70%) and use the remaining 30% dataset to assess the accuracy of the model. Accuracy is then often measured as the actual standard error (Se), MAPE, or mean error between the predicted value and the actual value in the hold-out sample . This method is particularly valuable when data is collected in different settings (e.g. time, social) or when models are assumed to be generalizable.


Stepwise regression procedures are used in data mining
Data mining
Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...

, but are controversial. Several points of criticism have been made.
  • A sequence of F-tests is often used to control the inclusion or exclusion of variables, but these are carried out on the same data and so there will be problems of multiple comparisons for which many correction criteria have been developed.
  • It is difficult to interpret the p-values associated with these tests, since each is conditional on the previous tests of inclusion and exclusion (see "dependent tests" in false discovery rate
    False discovery rate
    False discovery rate control is a statistical method used in multiple hypothesis testing to correct for multiple comparisons. In a list of rejected hypotheses, FDR controls the expected proportion of incorrectly rejected null hypotheses...

  • The tests themselves are biased, since they are based on the same data. (Rencher and Pun, 1980, Copas, 1983). Wilkinson and Dalall (1981) computed percentage points of the multiple correlation coefficient by simulation and showed that a final regression obtained by forward selection, said by the F-procedure to be significant at 0.1% was in fact only significant at 5%.

  • When estimating the degrees of freedom, the number of the candidate independent variables from the best fit selected is smaller than the total number of final model variables, causing the fit to appear better than it is when adjusting the r2 value for the number of degrees of freedom. It is important to consider how many degrees of freedom have been used in the entire model, not just count the number of independent variables in the resulting fit.

  • Models that are created may be too-small than the real models in the data.

Critics regard the procedure as a paradigmatic example of data dredging
Data dredging
Data dredging is the inappropriate use of data mining to uncover misleading relationships in data. Data-snooping bias is a form of statistical bias that arises from this misuse of statistics...

, intense computation often being an inadequate substitute for subject area expertise.