Logistic regression

# Logistic regression

Overview
In statistics
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....

, logistic regression (sometimes called the logistic model or logit
Logit
The logit function is the inverse of the sigmoidal "logistic" function used in mathematics, especially in statistics.Log-odds and logit are synonyms.-Definition:The logit of a number p between 0 and 1 is given by the formula:...

model
) is used for prediction of the probability
Probability
Probability is ordinarily used to describe an attitude of mind towards some proposition of whose truth we arenot certain. The proposition of interest is usually of the form "Will a specific event occur?" The attitude of mind is of the form "How certain are we that the event will occur?" The...

of occurrence of an event by fitting data to a logit
Logit
The logit function is the inverse of the sigmoidal "logistic" function used in mathematics, especially in statistics.Log-odds and logit are synonyms.-Definition:The logit of a number p between 0 and 1 is given by the formula:...

function logistic curve. It is a generalized linear model
Generalized linear model
In statistics, the generalized linear model is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to...

used for binomial regression
Binomial regression
In statistics, binomial regression is a technique in which the response is the result of a series of Bernoulli trials, or a series of one of two possible disjoint outcomes...

. Like many forms of regression analysis, it makes use of several predictor variables that may be either numerical or categorical. For example, the probability that a person has a heart attack within a specified time period might be predicted from knowledge of the person's age, sex and body mass index
Body mass index
The body mass index , or Quetelet index, is a heuristic proxy for human body fat based on an individual's weight and height. BMI does not actually measure the percentage of body fat. It was invented between 1830 and 1850 by the Belgian polymath Adolphe Quetelet during the course of developing...

.
Discussion
 Ask a question about 'Logistic regression' Start a new discussion about 'Logistic regression' Answer questions from other users Full Discussion Forum

Encyclopedia
In statistics
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....

, logistic regression (sometimes called the logistic model or logit
Logit
The logit function is the inverse of the sigmoidal "logistic" function used in mathematics, especially in statistics.Log-odds and logit are synonyms.-Definition:The logit of a number p between 0 and 1 is given by the formula:...

model
) is used for prediction of the probability
Probability
Probability is ordinarily used to describe an attitude of mind towards some proposition of whose truth we arenot certain. The proposition of interest is usually of the form "Will a specific event occur?" The attitude of mind is of the form "How certain are we that the event will occur?" The...

of occurrence of an event by fitting data to a logit
Logit
The logit function is the inverse of the sigmoidal "logistic" function used in mathematics, especially in statistics.Log-odds and logit are synonyms.-Definition:The logit of a number p between 0 and 1 is given by the formula:...

function logistic curve. It is a generalized linear model
Generalized linear model
In statistics, the generalized linear model is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to...

used for binomial regression
Binomial regression
In statistics, binomial regression is a technique in which the response is the result of a series of Bernoulli trials, or a series of one of two possible disjoint outcomes...

. Like many forms of regression analysis, it makes use of several predictor variables that may be either numerical or categorical. For example, the probability that a person has a heart attack within a specified time period might be predicted from knowledge of the person's age, sex and body mass index
Body mass index
The body mass index , or Quetelet index, is a heuristic proxy for human body fat based on an individual's weight and height. BMI does not actually measure the percentage of body fat. It was invented between 1830 and 1850 by the Belgian polymath Adolphe Quetelet during the course of developing...

. Logistic regression is used extensively in the medical and social sciences fields, as well as marketing applications such as prediction of a customer's propensity to purchase a product or cease a subscription.

## Definition

An explanation of logistic regression begins with an explanation of the logistic function
Logistic function
A logistic function or logistic curve is a common sigmoid curve, given its name in 1844 or 1845 by Pierre François Verhulst who studied it in relation to population growth. It can model the "S-shaped" curve of growth of some population P...

, which, like probabilities, always takes on values between zero and one:

A graph of the function is shown in figure 1. The input is z and the output is ƒ(z). The logistic function is useful because it can take as an input any value from negative infinity to positive infinity, whereas the output is confined to values between 0 and 1. The variable z represents the exposure to some set of independent variables, while ƒ(z) represents the probability of a particular outcome, given that set of explanatory variables. The variable z is a measure of the total contribution of all the independent variables used in the model and is known as the logit
Logit
The logit function is the inverse of the sigmoidal "logistic" function used in mathematics, especially in statistics.Log-odds and logit are synonyms.-Definition:The logit of a number p between 0 and 1 is given by the formula:...

.

The variable z is usually defined as

where is called the "intercept
Y-intercept
In coordinate geometry, using the common convention that the horizontal axis represents a variable x and the vertical axis represents a variable y, a y-intercept is a point where the graph of a function or relation intersects with the y-axis of the coordinate system...

" and , , , and so on, are called the "regression coefficients" of , , respectively. The intercept is the value of z when the value of all independent variables are zero (e.g. the value of z in someone with no risk factors). Each of the regression coefficients describes the size of the contribution of that risk factor. A positive regression coefficient means that the explanatory variable increases the probability of the outcome, while a negative regression coefficient means that the variable decreases the probability of that outcome; a large regression coefficient means that the risk factor strongly influences the probability of that outcome, while a near-zero regression coefficient means that that risk factor has little influence on the probability of that outcome.

Logistic regression is a useful way of describing the relationship between one or more independent variables (e.g., age, sex, etc.) and a binary response variable, expressed as a probability, that has only two values, such as having cancer ("has cancer" or "doesn't have cancer") .

## Sample size-dependent efficiency

Logistic regression tends to systematically overestimate odds ratios or beta coefficients when the sample size is less than about 500. With increasing sample size, the magnitude of overestimation diminishes and the estimated odds ratio asymptotically approaches the true population value. In a single study, overestimation due to small sample size might not have any relevance for the interpretation of the results, since it is much lower than the standard error of the estimate. However, if a number of small studies with systematically overestimated effects are pooled together without consideration of this effect, an effect may be perceived when in reality it does not exist.

A minimum of 10 events per independent variable has been recommended. For example, in a study where death is the outcome of interest, and 50 of 100 patients die, the maximum number of independent variables the model can support is 50/10 = 5.

## Example

The application of a logistic regression may be illustrated using a fictitious example of death from heart disease. This simplified model uses only three risk factors (age, sex, and blood cholesterol level) to predict the 10-year risk of death from heart disease. These are the parameters that the data fit:

The model can hence be expressed as

In this model, increasing age is associated with an increasing risk of death from heart disease (z goes up by 2.0 for every year over the age of 50), female sex is associated with a decreased risk of death from heart disease (z goes down by 1.0 if the patient is female), and increasing cholesterol is associated with an increasing risk of death (z goes up by 1.2 for each 1 mmol/L increase in cholesterol above 5 mmol/L).

We wish to use this model to predict a particular subject's risk of death from heart disease: he is 50 years old and his cholesterol level is 7.0 mmol/L. The subject's risk of death is therefore

This means that by this model, the subject's risk of dying from heart disease in the next 10 years is 0.07 (or 7%).

## Formal mathematical specification

Logistic regression analyzes binomially distributed data of the form

where the numbers of Bernoulli trial
Bernoulli trial
In the theory of probability and statistics, a Bernoulli trial is an experiment whose outcome is random and can be either of two possible outcomes, "success" and "failure"....

s ni are known and the probabilities of success pi are unknown. An example of this distribution is the fraction of seeds (pi) that germinate after ni are planted.

The model proposes for each trial i there is a set of explanatory variables that might inform the final probability. These explanatory variables can be thought of as being in a k-dimensional vector Xi and the model then takes the form

The logit
Logit
The logit function is the inverse of the sigmoidal "logistic" function used in mathematics, especially in statistics.Log-odds and logit are synonyms.-Definition:The logit of a number p between 0 and 1 is given by the formula:...

s, natural logs of the odds
Odds
The odds in favor of an event or a proposition are expressed as the ratio of a pair of integers, which is the ratio of the probability that an event will happen to the probability that it will not happen...

, of the unknown binomial probabilities are modeled as a linear function of the Xi.

Note that a particular element of Xi can be set to 1 for all i to yield an intercept
Y-intercept
In coordinate geometry, using the common convention that the horizontal axis represents a variable x and the vertical axis represents a variable y, a y-intercept is a point where the graph of a function or relation intersects with the y-axis of the coordinate system...

in the model. The unknown parameters βj are usually estimated by maximum likelihood
Maximum likelihood
In statistics, maximum-likelihood estimation is a method of estimating the parameters of a statistical model. When applied to a data set and given a statistical model, maximum-likelihood estimation provides estimates for the model's parameters....

using a method common to all generalized linear model
Generalized linear model
In statistics, the generalized linear model is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to...

s. The maximum likelihood estimates can be computed numerically by using iteratively reweighted least squares.

The interpretation of the βj parameter estimates is as the additive effect on the log of the odds
Odds
The odds in favor of an event or a proposition are expressed as the ratio of a pair of integers, which is the ratio of the probability that an event will happen to the probability that it will not happen...

for a unit change in the jth explanatory variable. In the case of a dichotomous explanatory variable, for instance gender, is the estimate of the odds of having the outcome for, say, males compared with females.

The model has an equivalent formulation

This functional form is commonly called a single-layer perceptron
Perceptron
The perceptron is a type of artificial neural network invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt. It can be seen as the simplest kind of feedforward neural network: a linear classifier.- Definition :...

or single-layer artificial neural network
Artificial neural network
An artificial neural network , usually called neural network , is a mathematical model or computational model that is inspired by the structure and/or functional aspects of biological neural networks. A neural network consists of an interconnected group of artificial neurons, and it processes...

. A single-layer neural network computes a continuous output instead of a step function
Step function
In mathematics, a function on the real numbers is called a step function if it can be written as a finite linear combination of indicator functions of intervals...

. The derivative of pi with respect to X = x1...xk is computed from the general form:

where f(X) is an analytic function
Analytic function
In mathematics, an analytic function is a function that is locally given by a convergent power series. There exist both real analytic functions and complex analytic functions, categories that are similar in some ways, but different in others...

in X. With this choice, the single-layer neural network is identical to the logistic regression model. This function has a continuous derivative, which allows it to be used in backpropagation
Backpropagation
Backpropagation is a common method of teaching artificial neural networks how to perform a given task. Arthur E. Bryson and Yu-Chi Ho described it as a multi-stage dynamic system optimization method in 1969 . It wasn't until 1974 and later, when applied in the context of neural networks and...

. This function is also preferred because its derivative is easily calculated:

## Extensions

Extensions of the model cope with dependent variables with more than two values, also called polytomous regression. Ordered logistic regression handles ordinal dependent variables (ordered values). Multinomial logistic regression handles nominal dependent variables (unordered values, also called "classification"). An extension of the logistic model to sets of interdependent variables is the conditional random field
Conditional random field
A conditional random field is a statistical modelling method often applied in pattern recognition.More specifically it is a type of discriminative undirected probabilistic graphical model. It is used to encode known relationships between observations and construct consistent interpretations...

.

## Model accuracy

A way to test for errors in models created by step-wise regression is to not rely on the model's F-statistic, significance, or multiple-r, but instead assess the model against a set of data that was not used to create the model. The class of techniques is called cross-validation.

Accuracy is measured as correctly classified records in the holdout sample. There are four possible classifications:
1. prediction of 0 when the holdout sample has a 0 (true negative)
2. prediction of 0 when the holdout sample has a 1 (false negative)
3. prediction of 1 when the holdout sample has a 0 (false positive)
4. prediction of 1 when the holdout sample has a 1 (true positive)

The percent of correctly classified observations in the holdout sample is referred to the assessed model accuracy. Additional accuracy can be expressed as the model's ability to correctly classify 0, or the ability to correctly classify 1 in the holdout dataset. The holdout model assessment method is particularly valuable when data are collected in different settings (e.g., at different times or places) or when models are assumed to be generalizable.

• Logistic function
Logistic function
A logistic function or logistic curve is a common sigmoid curve, given its name in 1844 or 1845 by Pierre François Verhulst who studied it in relation to population growth. It can model the "S-shaped" curve of growth of some population P...

• Sigmoid function
Sigmoid function
Many natural processes, including those of complex system learning curves, exhibit a progression from small beginnings that accelerates and approaches a climax over time. When a detailed description is lacking, a sigmoid function is often used. A sigmoid curve is produced by a mathematical...

• Artificial neural network
Artificial neural network
An artificial neural network , usually called neural network , is a mathematical model or computational model that is inspired by the structure and/or functional aspects of biological neural networks. A neural network consists of an interconnected group of artificial neurons, and it processes...

• Data mining
Data mining
Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...

• Jarrow–Turnbull model
• Limited dependent variable
Limited dependent variable
A limited dependent variable is a variable whose range ofpossible values is "restricted in some important way." In econometrics, the term is often used whenestimation of the relationship between the limited dependent variable...

• Linear discriminant analysis
Linear discriminant analysis
Linear discriminant analysis and the related Fisher's linear discriminant are methods used in statistics, pattern recognition and machine learning to find a linear combination of features which characterizes or separates two or more classes of objects or events...

• Multinomial logit model
Multinomial logit
In statistics, economics, and genetics, a multinomial logit model, also known as multinomial logistic regression, is a regression model which generalizes logistic regression by allowing more than two discrete outcomes...

• Ordered logit
Ordered logit
In statistics, the ordered logit model , is a regression model for ordinal dependent variables...

• Perceptron
Perceptron
The perceptron is a type of artificial neural network invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt. It can be seen as the simplest kind of feedforward neural network: a linear classifier.- Definition :...

• Principle of maximum entropy
Principle of maximum entropy
In Bayesian probability, the principle of maximum entropy is a postulate which states that, subject to known constraints , the probability distribution which best represents the current state of knowledge is the one with largest entropy.Let some testable information about a probability distribution...

• Probit model
Probit model
In statistics, a probit model is a type of regression where the dependent variable can only take two values, for example married or not married....

• Variable rules analysis
• Hosmer–Lemeshow test
Hosmer–Lemeshow test
The Hosmer–Lemeshow test is a statistical test for goodness of fit for logistic regression models. It is used frequently in risk prediction models. The test assesses whether or not the observed event rates match expected event rates in subgroups of the model population. The Hosmer–Lemeshow...

• Separation (statistics)
Separation (statistics)
In statistics separation is a phenomenon associated with models for dichotomous or categorical outcomes, including logistic and probit regression. Separation occurs if the predictor is associated with only one outcome value when the predictor is greater than some constant...