In
statisticsStatistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....
and
decision theoryDecision theory in economics, psychology, philosophy, mathematics, and statistics is concerned with identifying the values, uncertainties and other issues relevant in a given decision, its rationality, and the resulting optimal decision...
a
loss function is a function that maps an
eventIn probability theory, an event is a set of outcomes to which a probability is assigned. Typically, when the sample space is finite, any subset of the sample space is an event...
onto a
real numberIn mathematics, a real number is a value that represents a quantity along a continuum, such as -5 , 4/3 , 8.6 , √2 and π...
intuitively representing some "cost" associated with the event. Typically it is used for parameter estimation, and the event in question is some function of the difference between estimated and true values for an instance of data. In the context of
economicsEconomics is the social science that analyzes the production, distribution, and consumption of goods and services. The term economics comes from the Ancient Greek from + , hence "rules of the house"...
, for example, this is usually
economic costThe economic cost of a decision depends on both the cost of the alternative chosen and the benefit that the best alternative would have provided if chosen. Economic cost differs from accounting cost because it includes opportunity cost....
or
regretRegret is defined as the difference between the actual payoff and the payoff that would have been obtained if a different course of action had been chosen. This is also called difference regret...
. In
Machine LearningMachine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...
, it is the penalty for an incorrect classification of an example.
Definition
Formally, we begin by considering some family of distributions for a
random variableIn probability and statistics, a random variable or stochastic variable is, roughly speaking, a variable whose value results from a measurement on some type of random process. Formally, it is a function from a probability space, typically to the real numbers, which is measurable functionmeasurable...
X, that is indexed by some
θ.
More intuitively, we can think of
X as our "data", perhaps

, where

i.i.d. The
X is the set of things the
decision ruleIn decision theory, a decision rule is a function which maps an observation to an appropriate action. Decision rules play an important role in the theory of statistics and economics, and are closely related to the concept of a strategy in game theory....
will be making decisions on. There exists some number of possible ways

to model our data
X, which our decision function can use to make decisions. For a finite number of models, we can thus think of
θ as the
index to this family of probability models. For an infinite family of models, it is a set of parameters to the family of distributions.
On a more practical note, it is important to understand that, while it is tempting to think of loss functions as necessarily parametric (since they seem to take
θ as a "parameter"), the fact that
θ is non-finite-dimensional is completely incompatible with this notion; for example, if the family of probability functions is uncountably infinite,
θ indexes an uncountably infinite space.
From here, given a set
A of possible actions, a
decision ruleIn decision theory, a decision rule is a function which maps an observation to an appropriate action. Decision rules play an important role in the theory of statistics and economics, and are closely related to the concept of a strategy in game theory....
is a function
δ :

→
A.
A
loss function is a real lower-bounded function
L on
Θ ×
A for some
θ ∈ Θ. The value
L(
θ,
δ(
X)) is the
cost of action
δ(
X) under parameter
θ.
Decision rules
A decision rule makes a choice using an optimality criterion. Some commonly used criteria are:
- Minimax
Minimax is a decision rule used in decision theory, game theory, statistics and philosophy for minimizing the possible loss for a worst case scenario. Alternatively, it can be thought of as maximizing the minimum gain...
: Choose the decision rule with the lowest worst loss — that is, minimize the worst-case (maximum possible) loss:
-
- Invariance
In statistics, the concept of being an invariant estimator is a criterion that can be used to compare the properties of different estimators for the same quantity. It is a way of formalising the idea that an estimator should have certain intuitively appealing qualities...
: Choose the optimal decision rule which satisfies an invariance requirement.
- Minimize the expected value
In probability theory, the expected value of a random variable is the weighted average of all possible values that this random variable can take on...
of the loss function.
Expected loss
The value of the loss function itself is a random quantity because it depends on the outcome of a random variable
X. Both frequentist and
BayesianBayesian probability is one of the different interpretations of the concept of probability and belongs to the category of evidential probabilities. The Bayesian interpretation of probability can be seen as an extension of logic that enables reasoning with propositions, whose truth or falsity is...
statistical theory involve making a decision based on the
expected valueIn probability theory, the expected value of a random variable is the weighted average of all possible values that this random variable can take on...
of the loss function: however this quantity is defined differently under the two paradigms.
Frequentist risk
The expected loss in the frequentist context is obtained by taking the expected value with respect to the probability distribution,
Pθ, of the observed data,
X. This is also referred to as the
risk function of the decision rule
δ and the parameter
θ. Here the decision rule depends on the outcome of
X. The risk function is given by
Bayesian expected loss
In a Bayesian approach, the expectation is calculated using the posterior distribution
π* of the parameter
θ:

.
One then should choose the action
a* which minimises the expected loss. Although this will result in choosing the same action as would be chosen using the Bayes risk, the emphasis of the Bayesian approach is that one is only interested in choosing the optimal action under the actual observed data, whereas choosing the actual Bayes optimal decision rule, which is a function of all possible observations, is a much more difficult problem.
Selecting a loss function
Sound statistical practice requires selecting an estimator consistent with the actual loss experienced in the context of a particular applied problem. Thus, in the applied use of loss functions, selecting which statistical method to use to model an applied problem depends on knowing the losses that will be experienced from being wrong under the problem's particular circumstances, which results in the introduction of an element of
teleologyA teleology is any philosophical account which holds that final causes exist in nature, meaning that design and purpose analogous to that found in human actions are inherent also in the rest of nature. The word comes from the Greek τέλος, telos; root: τελε-, "end, purpose...
into problems of scientific decision-making.
A common example involves estimating "
locationIn statistics, a location family is a class of probability distributions that is parametrized by a scalar- or vector-valued parameter μ, which determines the "location" or shift of the distribution...
." Under typical statistical assumptions, the
meanIn statistics, mean has two related meanings:* the arithmetic mean .* the expected value of a random variable, which is also called the population mean....
or average is the statistic for estimating location that minimizes the expected loss experienced under the
TaguchiTaguchi methods are statistical methods developed by Genichi Taguchi to improve the quality of manufactured goods, and more recently also applied to, engineering, biotechnology, marketing and advertising...
or
squared-errorThe method of least squares is a standard approach to the approximate solution of overdetermined systems, i.e., sets of equations in which there are more equations than unknowns. "Least squares" means that the overall solution minimizes the sum of the squares of the errors made in solving every...
loss function, while the
medianIn probability theory and statistics, a median is described as the numerical value separating the higher half of a sample, a population, or a probability distribution, from the lower half. The median of a finite list of numbers can be found by arranging all the observations from lowest value to...
is the estimator that minimizes expected loss experienced under the absolute-difference loss function. Still different estimators would be optimal under other, less common circumstances.
In economics, when an agent is
risk neutralIn economics and finance, risk neutral behavior is between risk aversion and risk seeking. If offered either €50 or a 50% chance of each of €100 and nothing, a risk neutral person would have no preference between the two options...
, the loss function is simply expressed in monetary terms, such as profit, income, or end-of-period wealth.
But for risk averse (or
risk-lovingIn economics and finance, a risk lover is a person who has a preference for risk. While most investors are considered risk averse, one could view casino goers as risk loving...
) agents, loss is measured as the negative of a
utility functionIn economics, utility is a measure of customer satisfaction, referring to the total satisfaction received by a consumer from consuming a good or service....
, which represents satisfaction and is usually interpreted in
ordinalOrdinal utility theory states that while the utility of a particular good or service cannot be measured using a numerical scale bearing economic meaning in and of itself, pairs of alternative bundles of goods can be ordered such that one is considered by an individual to be worse than, equal to,...
terms rather than in
cardinalIn economics, cardinal utility refers to a property of mathematical indices that preserve preference orderings uniquely up to positive linear transformations...
(absolute) terms.
Other measures of cost are possible, for example
mortalityDeath is the permanent termination of the biological functions that sustain a living organism. Phenomena which commonly bring about death include old age, predation, malnutrition, disease, and accidents or trauma resulting in terminal injury....
or morbidity in the field of
public healthPublic health is "the science and art of preventing disease, prolonging life and promoting health through the organized efforts and informed choices of society, organizations, public and private, communities and individuals" . It is concerned with threats to health based on population health...
or
safety engineeringSafety engineering is an applied science strongly related to systems engineering / industrial engineering and the subset System Safety Engineering...
.
For most optimization algorithms, it is desirable to have a loss function that is globally continuous and differentiable.
Two very commonly-used loss functions are the
squared lossIn statistics, the mean squared error of an estimator is one of many ways to quantify the difference between values implied by a kernel density estimator and the true values of the quantity being estimated. MSE is a risk function, corresponding to the expected value of the squared error loss or...
,

, and the
absolute lossIn statistics, the absolute deviation of an element of a data set is the absolute difference between that element and a given point. Typically the point from which the deviation is measured is a measure of central tendency, most often the median or sometimes the mean of the data set.D_i = |x_i-m|...
,

. However the absolute loss has the disadvantage that it is not differentiable around

. The squared loss has the disadvantage that it has the tendency to be dominated by outliers---when summing over a set of

's (as in

), the final sum tends to be the result of a few particularly-large a-values, rather than an expression of the average a-value.
Loss functions in Bayesian statistics
One of the consequences of
Bayesian inferenceIn statistics, Bayesian inference is a method of statistical inference. It is often used in science and engineering to determine model parameters, make predictions about unknown variables, and to perform model selection...
is that in addition to experimental data, the loss function does not in itself wholly determine a decision. What is important is the relationship between the loss function and the
prior probabilityIn Bayesian statistical inference, a prior probability distribution, often called simply the prior, of an uncertain quantity p is the probability distribution that would express one's uncertainty about p before the "data"...
. So it is possible to have two different loss functions which lead to the same decision when the prior
probability distributionIn probability theory, a probability mass, probability density, or probability distribution is a function that describes the probability of a random variable taking certain values....
s associated with each compensate for the details of each loss function.
Combining the three elements of the prior probability, the data, and the loss function then allows decisions to be based on maximizing the
subjective expected utilitySubjective expected utility is a method in decision theory in the presence of risk, promoted by L. J. Savage in 1954 following previous work by Ramsey and von Neumann...
, a concept introduced by Leonard J. Savage.
Regret
Savage also argued that using non-Bayesian methods such as
minimaxMinimax is a decision rule used in decision theory, game theory, statistics and philosophy for minimizing the possible loss for a worst case scenario. Alternatively, it can be thought of as maximizing the minimum gain...
, the loss function should be based on the idea of
regretRegret is defined as the difference between the actual payoff and the payoff that would have been obtained if a different course of action had been chosen. This is also called difference regret...
, i.e., the loss associated with a decision should be the difference between the consequences of the best decision that could have been taken had the underlying circumstances been known and the decision that was in fact taken before they were known.
Quadratic loss function
The use of a quadratic loss function is common, for example when using
least squaresThe method of least squares is a standard approach to the approximate solution of overdetermined systems, i.e., sets of equations in which there are more equations than unknowns. "Least squares" means that the overall solution minimizes the sum of the squares of the errors made in solving every...
techniques or
Taguchi methodsTaguchi methods are statistical methods developed by Genichi Taguchi to improve the quality of manufactured goods, and more recently also applied to, engineering, biotechnology, marketing and advertising...
. It is often more mathematically tractable than other loss functions because of the properties of
varianceIn probability theory and statistics, the variance is a measure of how far a set of numbers is spread out. It is one of several descriptors of a probability distribution, describing how far the numbers lie from the mean . In particular, the variance is one of the moments of a distribution...
s, as well as being symmetric: an error above the target causes the same loss as the same magnitude of error below the target. If the target is
t, then a quadratic loss function is

for some constant
C; the value of the constant makes no difference to a decision, and can be ignored by setting it equal to 1.
Many common statistics, including t-tests,
regressionIn statistics, regression analysis includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables...
models,
design of experimentsIn general usage, design of experiments or experimental design is the design of any information-gathering exercises where variation is present, whether under the full control of the experimenter or not. However, in statistics, these terms are usually used for controlled experiments...
, and much else, use
least squaresThe method of least squares is a standard approach to the approximate solution of overdetermined systems, i.e., sets of equations in which there are more equations than unknowns. "Least squares" means that the overall solution minimizes the sum of the squares of the errors made in solving every...
Linear models theory, which is based on the Taguchi loss function.
The quadratic loss function is also used in
linear-quadratic optimal control problemsThe theory of optimal control is concerned with operating a dynamic system at minimum cost. The case where the system dynamics are described by a set of linear differential equations and the cost is described by a quadratic functional is called the LQ problem...
.