Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....
is the substitution of some value for a missing data point or a missing component of a data point. Once all missing values have been imputed, the dataset can then be analysed using standard techniques for complete data. The analysis should ideally take into account that there is a greater degree of uncertainty than if the imputed values had actually been observed, however, and this generally requires some modification of the standard complete-data analysis methods. Many imputation techniques are available.
A once-common method of imputation was hot-deck imputation
where a missing value was imputed from a randomly selected similar record. The term "hot deck" dates back to the storage of data on punched card
A punched card, punch card, IBM card, or Hollerith card is a piece of stiff paper that contains digital information represented by the presence or absence of holes in predefined positions...
s, and indicates that the information donors come from the same dataset as the recipients. The stack of cards was "hot" because it was currently being processed.
Cold-deck imputation, by contrast, selects donors from another dataset. Since computer power has advanced rapidly and punched cards are no longer used, more sophisticated methods of imputation have generally superseded the original random and sorted hot deck imputation techniques, such as the nearest neighbour hot deck imputation and the approximate Bayesian bootstrap.
Since standard analysis techniques do not reflect the additional uncertainty due to imputing for missing data, further adjustments (such as multiple imputation or a Rao–Shao correction) are necessary to account for this.
Alternatives to imputing missing data
Imputation is not the only method available for handling missing data. It usually gives better results than listwise deletion
(in which all subjects with any missing values are omitted from the analysis) and may be competitive with a maximum likelihood
In statistics, maximum-likelihood estimation is a method of estimating the parameters of a statistical model. When applied to a data set and given a statistical model, maximum-likelihood estimation provides estimates for the model's parameters....
approach in many circumstances. The expectation-maximization algorithm
In statistics, an expectation–maximization algorithm is an iterative method for finding maximum likelihood or maximum a posteriori estimates of parameters in statistical models, where the model depends on unobserved latent variables...
is a method for finding maximum likelihood estimates that has been widely applied to missing data problems. Other successful methods include computational intelligence methods.
In machine learning, it is sometimes possible to train a classifier directly over the original data without imputing it first. That was shown to yield better performance in cases where the missing data is structurally absent, rather than missing due to measurement noise.
- Bootstrapping (statistics)
In statistics, bootstrapping is a computer-based method for assigning measures of accuracy to sample estimates . This technique allows estimation of the sample distribution of almost any statistic using only very simple methods...
- Censoring (statistics)
In statistics, engineering, and medical research, censoring occurs when the value of a measurement or observation is only partially known.For example, suppose a study is conducted to measure the impact of a drug on mortality. In such a study, it may be known that an individual's age at death is at...
In data analysis involving geographical locations, geo-imputation or geographical imputation methods are steps taken to replace missing values for exact locations with approximate locations derived from associated data...
- Regression estimation
Regression estimation is a technique used to replace missing values in data. The variable with missing data is treated as the dependent variable, while the rest of the cases are treated as independent variables. A regression equation is then generated which can be used to predict missing values...