{{About|frequentist hypothesis testing which is taught in introductory statistics|Bayesian hypothesis testing|Bayesian inference}}
A

**statistical hypothesis test** is a method of making decisions using data, whether from a controlled experiment or an

observational studyIn epidemiology and statistics, an observational study draws inferences about the possible effect of a treatment on subjects, where the assignment of subjects into a treated group versus a control group is outside the control of the investigator...

(not controlled). In

statisticsStatistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....

, a result is called statistically significant if it is unlikely to have occurred by

chanceLuck or fortuity is good fortune which occurs beyond one's control, without regard to one's will, intention, or desired result. There are at least two senses people usually mean when they use the term, the prescriptive sense and the descriptive sense...

alone, according to a pre-determined threshold probability, the significance level. The phrase "

*test of significance*" was coined by

Ronald FisherSir Ronald Aylmer Fisher FRS was an English statistician, evolutionary biologist, eugenicist and geneticist. Among other things, Fisher is well known for his contributions to statistics by creating Fisher's exact test and Fisher's equation...

: "Critical tests of this kind may be called tests of significance, and when such tests are available we may discover whether a second sample is or is not significantly different from the first."
Hypothesis testing is sometimes called

**confirmatory data analysis**, in contrast to

exploratory data analysisIn statistics, exploratory data analysis is an approach to analysing data sets to summarize their main characteristics in easy-to-understand form, often with visual graphs, without using a statistical model or having formulated a hypothesis...

. In

frequency probabilityFrequency probability is the interpretation of probability that defines an event's probability as the limit of its relative frequency in a large number of trials. The development of the frequentist account was motivated by the problems and paradoxes of the previously dominant viewpoint, the...

, these decisions are almost always made using null-hypothesis tests (i.e., tests that answer the question

*Assuming that the null hypothesis is true, what is the probability of observing a value for the test statistic that is at least as extreme as the value that was actually observed?*) One use of hypothesis testing is deciding whether experimental results contain enough information to cast doubt on conventional wisdom.
A result that was found to be statistically significant is also called a

**positive result**; conversely, a result that is not unlikely under the null hypothesis is called a

**negative result** or a

**null result**.
Statistical hypothesis testing is a key technique of

frequentistFrequency probability is the interpretation of probability that defines an event's probability as the limit of its relative frequency in a large number of trials. The development of the frequentist account was motivated by the problems and paradoxes of the previously dominant viewpoint, the...

statistical inferenceIn statistics, statistical inference is the process of drawing conclusions from data that are subject to random variation, for example, observational errors or sampling variation...

. The Bayesian approach to hypothesis testing is to base rejection of the hypothesis on the

posterior probabilityIn Bayesian statistics, the posterior probability of a random event or an uncertain proposition is the conditional probability that is assigned after the relevant evidence is taken into account...

. Other approaches to reaching a decision based on data are available via

decision theoryDecision theory in economics, psychology, philosophy, mathematics, and statistics is concerned with identifying the values, uncertainties and other issues relevant in a given decision, its rationality, and the resulting optimal decision...

and

optimal decisionAn optimal decision is a decision such that no other available decision options will lead to a better outcome. It is an important concept in decision theory. In order to compare the different decision outcomes, one commonly assigns a relative utility to each of them...

s.
The

*critical region* of a hypothesis test is the set of all outcomes which cause the

null hypothesisThe practice of science involves formulating and testing hypotheses, assertions that are capable of being proven false using a test of observed data. The null hypothesis typically corresponds to a general or default position...

to be rejected in favor of the alternative hypothesis. The critical region is usually denoted by the letter

*C*.

### Example 1 – Courtroom trial

A statistical test procedure is comparable to a criminal

trialIn law, a trial is when parties to a dispute come together to present information in a tribunal, a formal setting with the authority to adjudicate claims or disputes. One form of tribunal is a court...

; a defendant is considered not guilty as long as his or her guilt is not proven. The prosecutor tries to prove the guilt of the defendant. Only when there is enough charging evidence the defendant is convicted.
In the start of the procedure, there are two hypotheses

$H\_0$: "the defendant is not guilty", and

$H\_1$: "the defendant is guilty". The first one is called

*null hypothesis*The practice of science involves formulating and testing hypotheses, assertions that are capable of being proven false using a test of observed data. The null hypothesis typically corresponds to a general or default position...

, and is for the time being accepted. The second one is called

*alternative (hypothesis)*. It is the hypothesis one tries to prove.
The hypothesis of innocence is only rejected when an error is very unlikely, because one doesn't want to convict an innocent defendant. Such an error is called

*error of the first kind* (i.e. the conviction of an innocent person), and the occurrence of this error is controlled to be rare. As a consequence of this asymmetric behaviour, the

*error of the second kind* (acquitting a person who committed the crime), is often rather large.
NEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINE

| Null Hypothesis (H_{0}) is true He or she truly is not guilty | Alternative Hypothesis (H_{1}) is true He or she truly is guilty |
---|

Accept Null Hypothesis Acquittal | Right decision | Wrong decision Type II Error |

Reject Null Hypothesis Conviction | Wrong decision Type I Error | Right decision |

NEWLINENEWLINE
A criminal trial can be regarded as either or both of two decision processes:
guilty vs not guilty or evidence vs a threshold ("beyond a reasonable
doubt"). In one view, the defendant is judged; in the other view the
performance of the prosecution (which bears the burden of proof) is
judged. A hypothesis test can be regarded as either a judgment of a
hypothesis or as a judgment of evidence.

### Example 2 – Clairvoyant card game

A person (the subject) is tested for clairvoyance. He is shown the reverse of a randomly chosen playing card 25 times and asked which of the four

suitsIn playing cards, a suit is one of several categories into which the cards of a deck are divided. Most often, each card bears one of several symbols showing to which suit it belongs; the suit may alternatively or in addition be indicated by the color printed on the card...

it belongs to. The number of hits, or correct answers, is called

*X*.
As we try to find evidence of his clairvoyance, for the time being the null hypothesis is that the person is not clairvoyant. The alternative is, of course: the person is (more or less) clairvoyant.
If the null hypothesis is valid, the only thing the test person can do is guess. For every card, the probability (relative frequency) of any single suit appearing is 1/4. If the alternative is valid, the test subject will predict the suit correctly with probability greater than 1/4. We will call the probability of guessing correctly

*p*. The hypotheses, then, are:
NEWLINE

NEWLINE- null hypothesis $\backslash text\{:\}\; \backslash qquad\; H\_0:\; p\; =\; \backslash tfrac\; 14$ (just guessing)

NEWLINE
and
NEWLINE

NEWLINE- alternative hypothesis $\backslash text\{:\}\; H\_1:\; p>\backslash tfrac\; 14$ (true clairvoyant).

NEWLINE
When the test subject correctly predicts all 25 cards, we will consider him clairvoyant, and reject the null hypothesis. Thus also with 24 or 23 hits. With only 5 or 6 hits, on the other hand, there is no cause to consider him so. But what about 12 hits, or 17 hits? What is the critical number,

*c*, of hits, at which point we consider the subject to be clairvoyant? How do we determine the critical value

*c*? It is obvious that with the choice

*c*=25 (i.e. we only accept clairvoyance when all cards are predicted correctly) we're more critical than with

*c*=10. In the first case almost no test subjects will be recognized to be clairvoyant, in the second case, a certain number will pass the test. In practice, one decides how critical one will be. That is, one decides how often one accepts an error of the first kind – a false positive, or Type I error. With

*c* = 25 the probability of such an error is:

$P(\backslash text\{reject\; \}H\_0\; |\; H\_0\; \backslash text\{\; is\; valid\})\; =\; P(X\; \backslash ge\; 25|p=\backslash tfrac\; 14)=\backslash left(\backslash tfrac\; 14\backslash right)^\{25\}\backslash approx10^\{-15\},$
and hence, very small. The probability of a false positive is the probability of randomly guessing correctly all 25 times.
Being less critical, with

*c*=10, gives:

$P(\backslash text\{reject\; \}H\_0\; |\; H\_0\; \backslash text\{\; is\; valid\})\; =\; P(X\; \backslash ge\; 10|p=\backslash tfrac\; 14)\; =\backslash sum\_\{k=10\}^\{25\}P(X=k|p=\backslash tfrac\; 14)\backslash approx\; 0\{.\}07.$
Thus,

*c* = 10 yields a much greater probability of false positive.
Before the test is actually performed, the desired probability of a Type I error is determined. Typically, values in the range of 1% to 5% are selected. Depending on this desired Type 1 error rate, the critical value

*c* is calculated. For example, if we select an error rate of 1%,

*c* is calculated thus:

$P(\backslash text\{reject\; \}H\_0\; |\; H\_0\; \backslash text\{\; is\; valid\})\; =\; P(X\; \backslash ge\; c|p=\backslash tfrac\; 14)\; \backslash le\; 0\{.\}01.$
From all the numbers c, with this property, we choose the smallest, in order to minimize the probability of a Type II error, a false negative. For the above example, we select:

$c=12$.
But what if the subject did not guess any cards at all? Having zero correct answers is clearly an oddity too. The probability of guessing incorrectly once is equal to

*p*' = (1 −

*p*) = 3/4. Using the same approach we can calculate that probability of randomly calling all 25 cards wrong is:

$P(\backslash text\{reject\; \}H\_0\; |\; H\_0\; \backslash text\{\; is\; valid\})\; =\; P(X\; \backslash ge\; 25|p\text{'}=\backslash tfrac\; 34)\; =\backslash left(\backslash tfrac\; 34\backslash right)^\{25\}\; \backslash approx\; 0\{.\}00075.$
This is highly unlikely (less than 1 in a 1000 chance). While the subject can't guess the cards correctly, dismissing H

_{0} in favour of H

_{1} would be an error. In fact, the result would suggest a trait on the subject's part of avoiding calling the correct card. A test of this could be formulated: for a selected 1% error rate the subject would have to answer correctly at least twice, for us to believe that card calling is based purely on guessing.

### Example 3 – Radioactive suitcase

As an example, consider determining whether a suitcase contains some radioactive material. Placed under a

Geiger counterA Geiger counter, also called a Geiger–Müller counter, is a type of particle detector that measures ionizing radiation. They detect the emission of nuclear radiation: alpha particles, beta particles or gamma rays. A Geiger counter detects radiation by ionization produced in a low-pressure gas in a...

, it produces 10 counts per minute. The null hypothesis is that no radioactive material is in the suitcase and that all measured counts are due to ambient radioactivity typical of the surrounding air and harmless objects. We can then calculate how likely it is that we would observe 10 counts per minute if the null hypothesis were true. If the null hypothesis predicts (say) on average 9 counts per minute and a

standard deviationStandard deviation is a widely used measure of variability or diversity used in statistics and probability theory. It shows how much variation or "dispersion" there is from the average...

of 1 count per minute, then we say that the suitcase is compatible with the null hypothesis (this does not guarantee that there is no radioactive material, just that we don't have enough evidence to suggest there is). On the other hand, if the null hypothesis predicts 3 counts per minute and a standard deviation of 1 count per minute, then the suitcase is not compatible with the null hypothesis, and there are likely other factors responsible to produce the measurements.
The test described here is more fully the null-hypothesis statistical significance test. The null hypothesis represents what we would believe by default, before seeing any evidence.

Statistical significanceIn statistics, a result is called statistically significant if it is unlikely to have occurred by chance. The phrase test of significance was coined by Ronald Fisher....

is a possible finding of the test, declared when the observed

sampleIn statistics, a sample is a subset of a population. Typically, the population is very large, making a census or a complete enumeration of all the values in the population impractical or impossible. The sample represents a subset of manageable size...

is unlikely to have occurred by chance if the null hypothesis were true. The name of the test describes its formulation and its possible outcome. One characteristic of the test is its crisp decision: to reject or not reject the null hypothesis. A calculated value is compared to a threshold, which is determined from the tolerable risk of error.

### Example 4 – Lady tasting tea

The following example is summarized from Fisher, and is known as the

*Lady tasting tea*In the design of experiments in statistics, the lady tasting tea is a famous randomized experiment devised by Ronald A. Fisher and reported in his book Statistical methods for research workers . The lady in question was Dr...

example.
Fisher thoroughly explained his method in a proposed experiment to test a Lady's claimed ability to determine the means of tea preparation by taste. The article is less than 10 pages in length and is notable for its simplicity and completeness
regarding terminology, calculations and design of the experiment. The example is loosely based on an event in Fisher's life.
The Lady proved him wrong.
NEWLINE

NEWLINE- The experiment provided the Lady with 8 randomly ordered cups of tea - 4 prepared by first adding milk, 4 prepared by first adding the tea. She was to select the 4 cups prepared by one method.NEWLINE
NEWLINE- This offered the Lady the advantage of judging cups by comparison.
NEWLINE- The Lady was fully informed of the experimental method.

NEWLINE NEWLINE- The null hypothesis was that the Lady had no such ability.
NEWLINE- The test statistic was a simple count of the number of successes in selecting the 4 cups.
NEWLINE- The null hypothesis distribution was computed by the number of permutations. The number of selected permutations equaled the number of unselected permutations.

NEWLINE
NEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINE

Tea-Tasting Distribution Success count | Permutations of selection | Number of permutations |
---|

0 | oooo | 1 × 1 = 1 |

1 | ooox, ooxo, oxoo, xooo | 4 × 4 = 16 |

2 | ooxx, oxox, oxxo, xoxo, xxoo, xoox | 6 × 6 = 36 |

3 | oxxx, xoxx, xxox, xxxo | 4 × 4 = 16 |

4 | xxxx | 1 × 1 = 1 |

| Total | 70 |

NEWLINENEWLINE
NEWLINE

NEWLINE- The critical region was the single case of 4 successes of 4 possible based on a conventional probability criterion (< 5%; 1 of 70 ≈ 1.4%).
NEWLINE- Fisher asserted that no alternative hypothesis was (ever) required.

NEWLINE
If and only if the Lady properly categorized all 8 cups was Fisher willing to reject the null hypothesis – effectively acknowledging the Lady's ability with > 98% confidence (but without quantifying her ability). Fisher later discussed the benefits of more trials and repeated tests.

## The testing process

In the statistical literature, statistical hypothesis testing plays a fundamental role. The usual line of reasoning is as follows:NEWLINE

NEWLINE- We start with a research hypothesis of which the truth is unknown.
NEWLINE- The first step is to state the relevant
**null and alternative hypotheses**. This is important as mis-stating the hypotheses will muddy the rest of the processGarbage in, garbage out is a phrase in the field of computer science or information and communication technology. It is used primarily to call attention to the fact that computers will unquestioningly process the most nonsensical of input data and produce nonsensical output...

. Specifically, the null hypothesis allows to attach an attribute: it should be chosen in such a way that it allows us to conclude whether the alternative hypothesis can either be accepted or stays undecided as it was before the test. NEWLINE- The second step is to consider the statistical assumptions being made about the sample in doing the test; for example, assumptions about the statistical independence
In probability theory, to say that two events are independent intuitively means that the occurrence of one event makes it neither more nor less probable that the other occurs...

or about the form of the distributions of the observations. This is equally important as invalid assumptions will mean that the results of the test are invalid. NEWLINE- Decide which test is appropriate, and stating the relevant
**test statistic**In statistical hypothesis testing, a hypothesis test is typically specified in terms of a test statistic, which is a function of the sample; it is considered as a numerical summary of a set of data that...

`T`. NEWLINE- Derive the distribution of the test statistic under the null hypothesis from the assumptions. In standard cases this will be a well-known result. For example the test statistics may follow a Student's t distribution or a normal distribution.
NEWLINE- The distribution of the test statistic partitions the possible values of
`T` into those for which the null-hypothesis is rejected, the so called critical region, and those for which it is not. NEWLINE- Compute from the observations the observed value
`t`_{obs} of the test statistic `T`. NEWLINE- Decide to either
**fail to reject** the null hypothesis or **reject** it in favor of the alternative. The decision rule is to reject the null hypothesis `H`_{0} if the observed value `t`_{obs} is in the critical region, and to accept or "fail to reject" the hypothesis otherwise.

NEWLINE
It is important to note the philosophical difference between accepting the null hypothesis and simply failing to reject it. The "fail to reject" terminology highlights the fact that the null hypothesis is assumed to be true from the start of the test; if there is a lack of evidence against it, it simply continues to be assumed true. The phrase "accept the null hypothesis" may suggest it has been proved simply because it has not been disproved, a logical

fallacyIn logic and rhetoric, a fallacy is usually an incorrect argumentation in reasoning resulting in a misconception or presumption. By accident or design, fallacies may exploit emotional triggers in the listener or interlocutor , or take advantage of social relationships between people...

known as the

argument from ignoranceArgument from ignorance, also known as argumentum ad ignorantiam or "appeal to ignorance" , is a fallacy in informal logic. It asserts that a proposition is true because it has not yet been proven false, it is "generally accepted"...

. Unless a test with particularly high

powerThe power of a statistical test is the probability that the test will reject the null hypothesis when the null hypothesis is actually false . The power is in general a function of the possible distributions, often determined by a parameter, under the alternative hypothesis...

is used, the idea of "accepting" the null hypothesis may be dangerous. Nonetheless the terminology is prevalent throughout statistics, where its meaning is well understood.
Alternatively, if the testing procedure forces us to reject the null hypothesis (H-null), we can accept the alternative hypothesis (H-alt) and we conclude that the research hypothesis is supported by the data. This fact expresses that our procedure is based on probabilistic considerations in the sense we accept that using another set could lead us to a different conclusion.

## Definition of terms

The following definitions are mainly based on the exposition in the book by Lehmann and Romano:
Statistical hypothesis : A statement about the parameters describing a population (not a sample).
Statistic : A value calculated from a sample, often to summarize the sample for comparison purposes.
Simple hypothesis : Any hypothesis which specifies the population distribution completely.
Composite hypothesis : Any hypothesis which does

*not* specify the population distribution completely.
Null hypothesis : A simple hypothesis associated with a contradiction to a theory one would like to prove.
Alternate hypothesis : A hypothesis (often composite) associated with a theory one would like to prove.
Statistical test : A decision function that takes its values in the set of hypotheses.
Region of acceptance : The set of values for which we fail to reject the null hypothesis.
Region of rejection / Critical region: The set of values of the test statistic for which the null hypothesis is rejected.

Power of a testThe power of a statistical test is the probability that the test will reject the null hypothesis when the null hypothesis is actually false . The power is in general a function of the possible distributions, often determined by a parameter, under the alternative hypothesis...

(1 −

*β*): The test's probability of correctly rejecting the null hypothesis. The complement of the false negative rate,

*β*.
Size / Significance level of a test (

*α*): For simple hypotheses, this is the test's probability of

*incorrectly* rejecting the null hypothesis. The false positive rate. For composite hypotheses this is the upper bound of the probability of rejecting the null hypothesis over all cases covered by the null hypothesis.

p-valueIn statistical significance testing, the p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. One often "rejects the null hypothesis" when the p-value is less than the significance level α ,...

: The probability, assuming the null hypothesis is true, of observing a result at least as extreme as the test statistic.
Statistical significance test : A predecessor to the statistical hypothesis test. An experimental result was said to be statistically significant if a sample was sufficiently inconsistent with the (null) hypothesis. This was variously considered common sense, a pragmatic heuristic for identifying meaningful experimental results, a convention establishing a threshold of statistical evidence or a method for drawing conclusions from data. The statistical hypothesis test added mathematical rigor and philosophical consistency to the concept by making the alternative hypothesis explicit. The term is loosely used to describe the modern version which is now part of statistical hypothesis testing.
Similar test : When testing hypotheses concerning a subset of the parameters describing the distribution of the observed random variables, a similar test is one whose distribution, under the null hypothesis, is independent of the nuisance parameters (the ones not being tested).
A statistical hypothesis test compares a test statistic (z or t for
examples) to a threshold. The test statistic (the formula found in
the table below) is based on optimality. For a fixed level of Type I
error rate, use of these statistics minimizes Type II error rates
(equivalent to maximizing power). The following terms describe tests
in terms of such optimality:
Most powerful test: For a given

*size* or

*significance level*, the test with the greatest power.

Uniformly most powerful testIn statistical hypothesis testing, a uniformly most powerful test is a hypothesis test which has the greatest power 1 − β among all possible tests of a given size α...

(UMP): A test with the greatest

*power* for all values of the parameter being tested.
Consistent test: When considering the properties of a test as the sample size grows, a test is said to be consistent if, for a fixed size of test, the power against any fixed alternative approaches 1 in the limit.
Unbiased test : For a specific alternative hypothesis, a test is said to be

**unbiased** when the probability of rejecting the null hypothesis is not less than the significance level when the alternative is true

*and* is less than or equal to the significance level when the null hypothesis is true.
Conservative test : A test is conservative if, when constructed for a given nominal significance level, the true probability of

*incorrectly* rejecting the null hypothesis is never greater than the nominal level.
Uniformly most powerful unbiased (UMPU) : A test which is UMP in the set of all unbiased tests.

## Interpretation

The direct interpretation is that if the p-value is less than the required significance level, then we say the null hypothesis is rejected at the given level of significance. Criticism on this interpretation can be found in the corresponding section.

## Common test statistics

In the table below, the symbols used are defined at the bottom of the table. Many other tests can be found in other articles.
NEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINE

Name | Formula | Assumptions or notes |
---|

One-sample z-test A Z-test is any statistical test for which the distribution of the test statistic under the null hypothesis can be approximated by a normal distribution. Due to the central limit theorem, many test statistics are approximately normally distributed for large samples...
| $z=\backslash frac\{\backslash overline\{x\}-\backslash mu\_0\}\{\backslash sigma\}\backslash sqrt\; n$ | (Normal population **or** *n* > 30) **and** σ known.
(*z* is the distance from the mean in relation to the standard deviation of the mean). For non-normal distributions it is possible to calculate a minimum proportion of a population that falls within *k* standard deviations for any *k* (see: *Chebyshev's inequality*In probability theory, Chebyshev’s inequality guarantees that in any data sample or probability distribution,"nearly all" values are close to the mean — the precise statement being that no more than 1/k2 of the distribution’s values can be more than k standard deviations away from the mean...
). |

Two-sample z-test | $z=\backslash frac\{(\backslash overline\{x\}\_1\; -\; \backslash overline\{x\}\_2)\; -\; d\_0\}\{\backslash sqrt\{\backslash frac\{\backslash sigma\_1^2\}\{n\_1\}\; +\; \backslash frac\{\backslash sigma\_2^2\}\{n\_2\}\}\}$ | Normal population **and** independent observations **and** σ_{1} and σ_{2} are known |

One-sample t-test | $t=\backslash frac\{\backslash overline\{x\}-\backslash mu\_0\}\; \{(\; s\; /\; \backslash sqrt\{n\}\; )\}\; ,$
$df=n-1\; \backslash $ | (Normal population **or** *n* > 30) **and** s unknown |

Paired t-test | $t=\backslash frac\{\backslash overline\{d\}-d\_0\}\; \{\; (\; s\_d\; /\; \backslash sqrt\{n\}\; )\; \}\; ,$
$df=n-1\; \backslash $ | (Normal population of differences **or** *n* > 30) **and** s unknown |

Two-sample pooled t-test, equal variances* | $t=\backslash frac\{(\backslash overline\{x\}\_1\; -\; \backslash overline\{x\}\_2)\; -\; d\_0\}\{s\_p\backslash sqrt\{\backslash frac\{1\}\{n\_1\}\; +\; \backslash frac\{1\}\{n\_2\}\}\},$
$s\_p^2=\backslash frac\{(n\_1\; -\; 1)s\_1^2\; +\; (n\_2\; -\; 1)s\_2^2\}\{n\_1\; +\; n\_2\; -\; 2\},$
$df=n\_1\; +\; n\_2\; -\; 2\; \backslash $ [NIST handbook: Two-Sample t-Test for Equal Means] | (Normal populations **or** *n*_{1} + *n*_{2} > 40) **and** independent observations **and** σ_{1} = σ_{2} unknown |

Two-sample unpooled t-test, unequal variances* | $t=\backslash frac\{(\backslash overline\{x\}\_1\; -\; \backslash overline\{x\}\_2)\; -\; d\_0\}\{\backslash sqrt\{\backslash frac\{s\_1^2\}\{n\_1\}\; +\; \backslash frac\{s\_2^2\}\{n\_2\}\}\},$
$df\; =\; \backslash frac\{\backslash left(\backslash frac\{s\_1^2\}\{n\_1\}+\backslash frac\{s\_2^2\}\{n\_2\}\backslash right)^2\}\; \{\backslash frac\{\backslash left(\backslash frac\{s\_1^2\}\{n\_1\}\backslash right)^2\}\{n\_1-1\}\; +\; \backslash frac\{\backslash left(\backslash frac\{s\_2^2\}\{n\_2\}\backslash right)^2\}\{n\_2-1\}\}$ | (Normal populations **or** *n*_{1} + *n*_{2} > 40) **and** independent observations **and** σ_{1} ≠ σ_{2} both unknown |

One-proportion z-test | $z=\backslash frac\{\backslash hat\{p\}\; -\; p\_0\}\{\backslash sqrt\{p\_0\; (1-p\_0)\}\}\backslash sqrt\; n$ | *n*^{ .}p_{0} > 10 **and** *n* (1 − *p*_{0}) > 10 **and** it is a SRS (Simple Random Sample), see notes. |

Two-proportion z-test, pooled for $d\_0=0$ | $z=\backslash frac\{(\backslash hat\{p\}\_1\; -\; \backslash hat\{p\}\_2)\; -\; d\_0\}\{\backslash sqrt\{\backslash hat\{p\}(1\; -\; \backslash hat\{p\})(\backslash frac\{1\}\{n\_1\}\; +\; \backslash frac\{1\}\{n\_2\})\}\}$
$\backslash hat\{p\}=\backslash frac\{x\_1\; +\; x\_2\}\{n\_1\; +\; n\_2\}$ | *n*_{1} *p*_{1} > 5 **and** *n*_{1}(1 − *p*_{1}) > 5 **and** *n*_{2} *p*_{2} > 5 **and** *n*_{2}(1 − *p*_{2}) > 5 **and** independent observations, see notes. |

>d_0|>0 | $z=\backslash frac\{(\backslash hat\{p\}\_1\; -\; \backslash hat\{p\}\_2)\; -\; d\_0\}\{\backslash sqrt\{\backslash frac\{\backslash hat\{p\}\_1(1\; -\; \backslash hat\{p\}\_1)\}\{n\_1\}\; +\; \backslash frac\{\backslash hat\{p\}\_2(1\; -\; \backslash hat\{p\}\_2)\}\{n\_2\}\}\}$ | *n*_{1} *p*_{1} > 5 **and** *n*_{1}(1 − *p*_{1}) > 5 **and** *n*_{2} *p*_{2} > 5 **and** *n*_{2}(1 − *p*_{2}) > 5 **and** independent observations, see notes. |

Chi-squared test for variance | $\backslash chi^2=(n-1)\backslash frac\{s^2\}\{\backslash sigma^2\_0\}$ | |

Chi-squared test for goodness of fit | $\backslash chi^2=\backslash sum^k\backslash frac\{(observed-expected)^2\}\{expected\}$ | *df = k - 1 - # parameters estimated*, and one of these must hold.
• All expected counts are at least 5.
• All expected counts are > 1 and no more than 20% of expected counts are less than 5.{{fact|date=November 2011}} |

*Two-sample F test for equality of variances | $F=\backslash frac\{s\_1^2\}\{s\_2^2\}$ | Arrange so $s\_1^2$ __>__ $s\_2^2$ and reject H_{0} for $F\; >\; F(\backslash alpha/2,n\_1-1,n\_2-1)$ |

In general, the subscript 0 indicates a value taken from the null hypothesis The practice of science involves formulating and testing hypotheses, assertions that are capable of being proven false using a test of observed data. The null hypothesis typically corresponds to a general or default position... , H_{0}, which should be used as much as possible in constructing its test statistic. *... Definitions of other symbols:*NEWLINENEWLINENEWLINENEWLINENEWLINENEWLINENEWLINE- $\backslash alpha$, the probability
Probability is ordinarily used to describe an attitude of mind towards some proposition of whose truth we arenot certain. The proposition of interest is usually of the form "Will a specific event occur?" The attitude of mind is of the form "How certain are we that the event will occur?" The... of Type I errorIn statistical test theory the notion of statistical error is an integral part of hypothesis testing. The test requires an unambiguous statement of a null hypothesis, which usually corresponds to a default "state of nature", for example "this person is healthy", "this accused is not guilty" or... (rejecting a null hypothesisThe practice of science involves formulating and testing hypotheses, assertions that are capable of being proven false using a test of observed data. The null hypothesis typically corresponds to a general or default position... when it is in fact true) NEWLINE- $n$ = sample size
Sample size determination is the act of choosing the number of observations to include in a statistical sample. The sample size is an important feature of any empirical study in which the goal is to make inferences about a population from a sample...
NEWLINE- $n\_1$ = sample 1 size
NEWLINE- $n\_2$ = sample 2 size
NEWLINE- $\backslash overline\{x\}$ = sample mean
NEWLINE- $\backslash mu\_0$ = hypothesized population mean
NEWLINE- $\backslash mu\_1$ = population 1 mean
NEWLINE- $\backslash mu\_2$ = population 2 mean
NEWLINE- $\backslash sigma$ = population standard deviation
NEWLINE- $\backslash sigma^2$ = population variance
NEWLINE- $\backslash sum$ = sum (of k numbers)
| NEWLINE $s$ = sample standard deviation A sample standard deviation is an estimate, based on a sample, of a population standard deviation. See:* Standard deviation#Estimation* Sample mean and sample covariance... NEWLINE $s^2$ = sample varianceNEWLINE $s\_1$ = sample 1 standard deviationNEWLINE $s\_2$ = sample 2 standard deviationNEWLINE $t$ = t statisticNEWLINE $df$ = degrees of freedomIn statistics, the number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary.Estimates of statistical parameters can be based upon different amounts of information or data. The number of independent pieces of information that go into the... NEWLINE $\backslash overline\{d\}$ = sample mean of differencesNEWLINE $d\_0$ = hypothesized population mean differenceNEWLINE $s\_d$ = standard deviation of differences | NEWLINE $\backslash hat\{p\}$ = *x/n* = sample proportionIn mathematics, a ratio is a relationship between two numbers of the same kind , usually expressed as "a to b" or a:b, sometimes expressed arithmetically as a dimensionless quotient of the two which explicitly indicates how many times the first number contains the second In mathematics, a ratio is... , unless specified otherwiseNEWLINE $p\_0$ = hypothesized population proportionNEWLINE $p\_1$ = proportion 1NEWLINE $p\_2$ = proportion 2NEWLINE $d\_p$ = hypothesized difference in proportionNEWLINE $\backslash min\backslash \{n\_1,n\_2\backslash \}$ = minimum of *n*_{1} and *n*_{2}NEWLINE $x\_1\; =\; n\_1\; p\_1$NEWLINE $x\_2\; =\; n\_2\; p\_2$NEWLINE $\backslash chi^2$ = Chi-squared statisticNEWLINE $F$ = F statistic | NEWLINENEWLINE |

NEWLINENEWLINENEWLINE

## Origins

Hypothesis testing is largely the product of

Ronald FisherSir Ronald Aylmer Fisher FRS was an English statistician, evolutionary biologist, eugenicist and geneticist. Among other things, Fisher is well known for his contributions to statistics by creating Fisher's exact test and Fisher's equation...

,

Jerzy NeymanJerzy Neyman , born Jerzy Spława-Neyman, was a Polish American mathematician and statistician who spent most of his professional career at the University of California, Berkeley.-Life and career:...

,

Karl PearsonKarl Pearson FRS was an influential English mathematician who has been credited for establishing the disciplineof mathematical statistics....

and (son)

Egon PearsonEgon Sharpe Pearson, CBE FRS was the only son of Karl Pearson, and like his father, a leading British statistician....

. Fisher was an agricultural statistician who emphasized rigorous
experimental design and methods to extract a result from few samples
assuming Gaussian distributions. Neyman (who teamed with the
younger Pearson) emphasized mathematical rigor and methods to obtain more results from many samples and a wider range of distributions. Modern hypothesis testing is an (extended) hybrid of the Fisher vs Neyman/Pearson formulation, methods and
terminology developed in the early 20th century.

## Importance

Statistical hypothesis testing plays an important role in the whole of statistics and in

statistical inferenceIn statistics, statistical inference is the process of drawing conclusions from data that are subject to random variation, for example, observational errors or sampling variation...

. For example, Lehmann (1992) in a review of the fundamental paper by Neyman and Pearson (1933) says: "Nevertheless, despite their shortcomings, the new paradigm formulated in the 1933 paper, and the many developments carried out within its framework continue to play a central role in both the theory and practice of statistics and can be expected to do so in the foreseeable future".
Significance testing has been the favored statistical tool
in some experimental social sciences (over 90% of articles in the
Journal of Applied Psychology during the early 1990s).
Other fields have favored the estimation of parameters.
Editors often consider significance as a criterion for the publication
of scientific conclusions based on experiments with statistical
results.

## Controversy

Since significance tests were first popularized many objections
have been voiced by prominent and respected statisticians. The
volume of criticism and rebuttal has filled books with language seldom
used in the scholarly debate of a dry subject.
Much of the criticism was published more than 40 years ago.
The fires of controversy have burned hottest in the field of experimental
psychology. Nickerson surveyed the issues in the year 2000. He
included 300 references and reported 20 criticisms and
almost as many recommendations, alternatives and supplements.
The following section greatly condenses Nickerson's discussion,
omitting many issues.

### Selected criticisms

NEWLINE

NEWLINE- There are numerous persistent misconceptions regarding the test and its results.
NEWLINE- The test is a flawed application of probability theory.NEWLINE
NEWLINE- While the data can be unlikely given the null hypothesis, the alternative hypothesis can be even more unlikely. (Nobody can be that lucky. vs. Clairvoyance is impossible.)

NEWLINE NEWLINE- The test result is a function of sample size.
NEWLINE- The test result is uninformative.
NEWLINE- Statistical significance does not imply practical significance.
NEWLINE- Using statistical significance as a criterion for publication results in problems collectively known as publication bias
Publication bias is the tendency of researchers, editors, and pharmaceutical companies to handle the reporting of experimental results that are positive differently from results that are negative or inconclusive, leading to bias in the overall published literature...

.NEWLINENEWLINE- Published Type I errors are difficult to correct.
NEWLINE- Published effect sizes are biased upward.
NEWLINE- Meta-studies are biased by the invisibility of tests which failed to reach significance.
NEWLINE- Type II errors (false negatives) are common.

NEWLINE

NEWLINE
Each criticism has merit, but is subject to discussion.

### Misuses and abuses

The characteristics of significance tests can be abused. When the
test statistic is close to the chosen significance level, the
temptation to carefully treat outliers, to adjust the chosen
significance level, to pick a better statistic or to replace a
two-tailed test with a one-tailed test can be powerful. If the goal
is to produce a significant experimental result:NEWLINE

NEWLINE- Conduct a few tests with a large sample size.
NEWLINE- Rigorously control the experimental design.
NEWLINE- Publish the successful tests; Hide the unsuccessful tests.
NEWLINE- Emphasize the statistical significance of the results if the practical significance is doubtful.

NEWLINE
If the goal is to fail to produce a significant effect:NEWLINE

NEWLINE- Conduct a large number of tests with inadequate sample size.
NEWLINE- Minimize experimental design constraints.
NEWLINE- Publish the number of tests conducted that show "no significant result".

NEWLINE

### Results of the controversy

The controversy has produced several results. The American
Psychological Association has strengthened its statistical
reporting requirements after review, medical journal publishers have
recognized the obligation to publish some results that are not
statistically significant to combat publication bias and a
journal has been created to publish such results exclusively.

[JASNH website: JASNH homepage]
Textbooks have added some cautions and increased coverage of the tools
necessary to estimate the size of the sample required to produce
significant results. Major organizations have not abandoned use of
significance tests although they have discussed doing so.

### Alternatives to significance testing

The numerous criticisms of significance testing do not lead to a
single alternative or even to a unified set of alternatives. A
unifying position of critics is that statistics should not lead to a
conclusion or a decision but to a probability or to an estimated value
with confidence bounds. The Bayesian statistical philosophy is
therefore congenial to critics who believe that an experiment should
simply alter probabilities and that conclusions should only be reached
on the basis of numerous experiments.
One strong critic of significance testing suggested a list of reporting
alternatives:
effect sizes for importance, prediction intervals for confidence,
replications and extensions for replicability, meta-analyses for
generality. None of these suggested alternatives produces a
conclusion/decision.
Lehmann said that hypothesis testing theory can be presented in terms
of conclusions/decisions, probabilities, or confidence intervals.
"The distinction between the ... approaches is largely one of
reporting and interpretation."
On one "alternative" there is no disagreement: Fisher himself said,
"In relation to the test of significance, we may say that a
phenomenon is experimentally demonstrable when we know how to conduct
an experiment which will rarely fail to give us a statistically
significant result." Cohen, an influential critic of significance
testing, concurred,
"...don't look for a
magic alternative to NHST

*[null hypothesis significance testing]* ... It doesn't exist." "...given
the problems of statistical induction, we must finally rely,
as have the older sciences, on replication." The "alternative" to
significance testing is repeated testing. The easiest way to decrease
statistical uncertainty is by more data, whether by increased sample
size or by repeated tests. Nickerson claimed to have never
seen the publication of a literally replicated experiment in
psychology.
While

Bayesian inferenceIn statistics, Bayesian inference is a method of statistical inference. It is often used in science and engineering to determine model parameters, make predictions about unknown variables, and to perform model selection...

is a possible alternative to significance
testing, it requires information that is seldom available in the
cases where significance testing is most heavily used.

### Future of the controversy

It is unlikely that this controversy will be resolved in the near
future. The flaws and unpopularity of significance testing
do not eliminate the need for an objective and transparent means
of reaching conclusions regarding experiments that produce statistical
results. Critics have not unified around an alternative. Other
forms of reporting confidence or uncertainty will probably grow in
popularity.

## Improvements

Jones and Tukey suggested a modest improvement in the original
null-hypothesis formulation to formalize handling of one-tail tests. They conclude that, in the "

Lady Tasting TeaIn the design of experiments in statistics, the lady tasting tea is a famous randomized experiment devised by Ronald A. Fisher and reported in his book Statistical methods for research workers . The lady in question was Dr...

" example, Fisher ignored the 8-failure case (equally improbable as the 8-success case) in the example test involving tea, which altered the claimed significance by a factor of 2.

## See also

{{Portal|Statistics}}
For a reconstruction and defense of Neyman–Pearson testing, see Mayo and Spanos, (2006), "Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy of Induction," GJPS, 57: 323–57.

## Further reading

NEWLINE

NEWLINE- Lehmann, E.L.(1970). Testing statistical hypothesis (5th ed.). Ney York: Wiley.
NEWLINE- Lehmann E.L. (1992) "Introduction to Neyman and Pearson (1933) On the Problem of the Most Efficient Tests of Statistical Hypotheses". In:
*Breakthroughs in Statistics, Volume 1*, (Eds Kotz, S., Johnson, N.L.), Springer-Verlag. ISBN 0-387-94037-5 (followed by reprinting of the paper)

NEWLINE

## External links

NEWLINE

NEWLINE
{{Statistics}}
{{DEFAULTSORT:Statistical Hypothesis Testing}}