Akaike information criterion
Encyclopedia
The Akaike information criterion is a measure of the relative goodness of fit
Goodness of fit
The goodness of fit of a statistical model describes how well it fits a set of observations. Measures of goodness of fit typically summarize the discrepancy between observed values and the values expected under the model in question. Such measures can be used in statistical hypothesis testing, e.g...

 of a statistical model
Statistical model
A statistical model is a formalization of relationships between variables in the form of mathematical equations. A statistical model describes how one or more random variables are related to one or more random variables. The model is statistical as the variables are not deterministically but...

. It was developed by Hirotsugu Akaike
Hirotsugu Akaike
was a Japanese statistician. In the early 1970s he formulated an information criterion for model identification which has become known as the Akaike information criterion.-Awards:...

, under the name of "an information criterion" (AIC), and was first published by Akaike in 1974. It is grounded in the concept of information entropy
Information entropy
In information theory, entropy is a measure of the uncertainty associated with a random variable. In this context, the term usually refers to the Shannon entropy, which quantifies the expected value of the information contained in a message, usually in units such as bits...

, in effect offering a relative measure of the information lost when a given model is used to describe reality. It can be said to describe the tradeoff between bias
Bias
Bias is an inclination to present or hold a partial perspective at the expense of alternatives. Bias can come in many forms.-In judgement and decision making:...

 and variance
Variance
In probability theory and statistics, the variance is a measure of how far a set of numbers is spread out. It is one of several descriptors of a probability distribution, describing how far the numbers lie from the mean . In particular, the variance is one of the moments of a distribution...

 in model construction, or loosely speaking between accuracy and complexity of the model.

Given a data set, several candidate models may be ranked according to their AIC values. From the AIC values one may also infer that e.g. the top two models are roughly in a tie and the rest are far worse. Thus, AIC provides a means for comparison among models—a tool for model selection
Model selection
Model selection is the task of selecting a statistical model from a set of candidate models, given data. In the simplest cases, a pre-existing set of data is considered...

. AIC does not provide a test of a model in the usual sense of testing a null hypothesis
Null hypothesis
The practice of science involves formulating and testing hypotheses, assertions that are capable of being proven false using a test of observed data. The null hypothesis typically corresponds to a general or default position...

; i.e. AIC can tell nothing about how well a model fits the data in an absolute sense. Ergo, if all the candidate models fit poorly, AIC will not give any warning of that.

Definition

In the general case, the AIC is


where k is the number of parameter
Parameter
Parameter from Ancient Greek παρά also “para” meaning “beside, subsidiary” and μέτρον also “metron” meaning “measure”, can be interpreted in mathematics, logic, linguistics, environmental science and other disciplines....

s in the statistical model
Statistical model
A statistical model is a formalization of relationships between variables in the form of mathematical equations. A statistical model describes how one or more random variables are related to one or more random variables. The model is statistical as the variables are not deterministically but...

, and L is the maximized value of the likelihood function
Likelihood function
In statistics, a likelihood function is a function of the parameters of a statistical model, defined as follows: the likelihood of a set of parameter values given some observed outcomes is equal to the probability of those observed outcomes given those parameter values...

 for the estimated model.

Given a set of candidate models for the data, the preferred model is the one with the minimum AIC value. Hence AIC not only rewards goodness of fit, but also includes a penalty that is an increasing function of the number of estimated parameters. This penalty discourages overfitting
Overfitting
In statistics, overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations...

 (increasing the number of free parameters in the model improves the goodness of the fit, regardless of the number of free parameters in the data-generating process).

AIC is founded in information theory. Suppose that the data is generated by some unknown process f. We consider two candidate models to represent f: g1 and g2. If we knew f, then we could find the information lost from using g1 to represent f by calculating the Kullback–Leibler divergence
Kullback–Leibler divergence
In probability theory and information theory, the Kullback–Leibler divergence is a non-symmetric measure of the difference between two probability distributions P and Q...

, DKL(f,g1); similarly, the information lost from using g2 to represent f would be found by calculating DKL(f,g2). We would then choose the candidate model that minimized the information loss.

We cannot choose with certainty, because we do not know f. Akaike (1974) showed, however, that we can estimate, via AIC, how much more (or less) information is lost by g1 than by g2. It is remarkable that such a simple formula for AIC results. The estimate, though, is only valid asymptotically
Asymptotic analysis
In mathematical analysis, asymptotic analysis is a method of describing limiting behavior. The methodology has applications across science. Examples are...

; if the number of data points is small, then some correction is often necessary (see AICc, below).

How to apply AIC in practice

AIC estimates relative support for a model. To apply this in practice, we start with a set of candidate models, and then find the models' corresponding AIC values. There will almost always be information lost due to using one of the candidate models to represent the "true" model. We wish to select, from among R candidate models, the model that minimizes the information loss. We cannot do this exactly, but we can minimize the estimated information loss.

Denote the AIC values of the candidate models by AIC1, AIC2, AIC3, …, AICR. Let AICmin be the minimum of those values. Then exp((AICmin−AICi)/2) can be interpreted as the relative probability that the ith model minimizes the (estimated) information loss.

As an example, suppose that there were three models in the candidate set, with AIC values 100, 102, and 110. Then the second model is exp((100−102)/2) = 0.368 times as probable as the first model to minimize the information loss, and the third model is exp((100−110)/2) = 0.007 times as probable as the first model to minimize the information loss. In this case, we might omit the third model from further consideration and take a weighted average of the first two models, with weights 1 and 0.368, respectively. Statistical inference
Statistical inference
In statistics, statistical inference is the process of drawing conclusions from data that are subject to random variation, for example, observational errors or sampling variation...

 would then be based on the weighted multimodel.

If all the models in the candidate set have the same number of parameters, then using AIC might at first appear to be very similar to using the likelihood-ratio test
Likelihood-ratio test
In statistics, a likelihood ratio test is a statistical test used to compare the fit of two models, one of which is a special case of the other . The test is based on the likelihood ratio, which expresses how many times more likely the data are under one model than the other...

. There are, however, important distinctions. In particular, the likelihood-ratio test is valid only for nested models whereas AIC (and AICc) has no such restriction.

The quantity exp((AICmin−AICi)/2) is the relative likelihood of model i.

AICc

AICc is AIC with a correction for finite sample sizes:


where k denotes the number of model parameters. Thus, AICc is AIC with a greater penalty for extra parameters.

Burnham & Anderson (2002) strongly recommend using AICc, rather than AIC, if n is small or k is large. Since AICc converges to AIC as n gets large, AICc generally should be employed regardless. Using AIC, instead of AICc, when n is not many times larger than k2, increases the probability of selecting models that have too many parameters, i.e. of overfitting. The probability of AIC overfitting can be substantial, in some cases.

Brockwell & Davis (p. 273) advise using AICc as the primary criterion in selecting the orders of an ARMA
Arma
-Businesses, institutions and organizations:* Agung Rai Museum of Art located in Ubud, Bali - also related to the ARMA resort* American Rock Mechanics Association, a geoscience organization...

 model for time series. McQuarrie & Tsai ground their high opinion of AICc on extensive simulation work with regression and time series.

AICc was first proposed by Hurvich & Tsai (1989). Different derivations of it are given by Brockwell & Davis, Burnham & Anderson, and Cavanaugh. All the derivations assume a univariate linear model with normally-distributed errors (conditional upon regressors); if that assumption does not hold, then the formula for AICc will usually change. Further discussion of this, with examples of other assumptions, is given by Burnham & Anderson (2002, ch.7). In particular, bootstrap
Bootstrapping (statistics)
In statistics, bootstrapping is a computer-based method for assigning measures of accuracy to sample estimates . This technique allows estimation of the sample distribution of almost any statistic using only very simple methods...

 estimation is usually feasible.

Note that when all the models in the candidate set have the same k, then AICc and AIC will give identical (relative) valuations. In that situation, then, AIC can always be used.

Relevance to chi-squared fitting

Often, one wishes to select amongst competing models where the likelihood functions assume that the underlying errors are normally distributed and independent. This assumption leads to  model fitting.

For fitting, the likelihood is given by,
where C is a constant independent of the model used, and dependent only on the use of particular data points. i.e. it does not change if the data do not change.

The AIC is therefore given by . As only differences in AIC are meaningful, the constant C can be ignored, allowing us to take for model comparisons. This form is often convenient, because most model-fitting programs produce as a statistic for the fit.

Another convenient form arises if the σi are assumed to be identical and the residual sum of squares
Residual sum of squares
In statistics, the residual sum of squares is the sum of squares of residuals. It is also known as the sum of squared residuals or the sum of squared errors of prediction . It is a measure of the discrepancy between the data and an estimation model...

 (RSS) is available. Then we get AIC = n ln(RSS/n) + 2k + C, where again C can be ignored in model comparisons.

Bayesian information criterion

The AIC penalizes the number of parameters less strongly than does the Bayesian information criterion (BIC), which was independently developed by Akaike and by Schwarz in 1978, using Bayesian formalism. Akaike's version of BIC was originally denoted ABIC (for "a Bayesian Information Criterion") or referred to as Akaike's Bayesian Information Criterion.

A comparison of AIC/AICc and BIC is given by Burnham & Anderson (2002, sect. 6.4). The authors argue that AIC/AICc has theoretical advantages over BIC. Firstly, because AIC/AICc is derived from principles of information. Secondly, because the (Bayesian) derivation of BIC has a prior of 1/R (where R is the number of candidate models), which is "not sensible", since the prior should be a decreasing function of k. The authors also show that AIC and AICc can be derived in the same Bayesian framework as BIC, just by using a different prior. Additionally, they present a few simulation studies that suggest AICc tends to have practical/performance advantages over BIC. See additionally Burnham & Anderson (2004).

Further comparison of AIC and BIC, in the context of regression, is given by Yang (2005). In particular, AIC is asymptotically optimal in selecting the model with the least mean squared error
Mean squared error
In statistics, the mean squared error of an estimator is one of many ways to quantify the difference between values implied by a kernel density estimator and the true values of the quantity being estimated. MSE is a risk function, corresponding to the expected value of the squared error loss or...

, under the assumption that the exact "true" model is not in the candidate set (as is virtually always the case in practice); BIC is not asymptotically optimal. Yang further shows that the rate at which AIC converges to the optimum is, in a certain sense, the best possible.

See also

  • Deviance information criterion
    Deviance information criterion
    The deviance information criterion is a hierarchical modeling generalization of the AIC and BIC . It is particularly useful in Bayesian model selection problems where the posterior distributions of the models have been obtained by Markov chain Monte Carlo simulation...

  • Focused information criterion
    Focused information criterion
    In statistics, the focused information criterion is a method for selecting the most appropriate model among a set of competitors for a given data set...

  • Occam's Razor
    Occam's razor
    Occam's razor, also known as Ockham's razor, and sometimes expressed in Latin as lex parsimoniae , is a principle that generally recommends from among competing hypotheses selecting the one that makes the fewest new assumptions.-Overview:The principle is often summarized as "simpler explanations...


External links

  • Hirotogu Akaike comments on how he arrived at the AIC, in This Week's Citation Classic (21 December 1981)
  • AIC (Aalto University
    Aalto University
    Aalto University is a Finnish university established on January 1, 2010, by the merger of the Helsinki University of Technology, the Helsinki School of Economics, and the University of Art and Design Helsinki....

    )
  • Example Calculation (University of Georgia
    University of Georgia
    The University of Georgia is a public research university located in Athens, Georgia, United States. Founded in 1785, it is the oldest and largest of the state's institutions of higher learning and is one of multiple schools to claim the title of the oldest public university in the United States...

    )
  • Model Selection (University of Iowa
    University of Iowa
    The University of Iowa is a public state-supported research university located in Iowa City, Iowa, United States. It is the oldest public university in the state. The university is organized into eleven colleges granting undergraduate, graduate, and professional degrees...

    )
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK