All Topics  
Statistical hypothesis testing

 

   Email Print
   Bookmark   Link






 

Statistical hypothesis testing



 
 
A statistical hypothesis test is a method of making statistical decisions using experimental data. It is sometimes called confirmatory data analysis, in contrast to exploratory data analysis
Exploratory data analysis

Exploratory data analysis is an approach to data analysis for the purpose of formulating hypothesis worth testing, complementing the tools of conventional statistics for testing hypotheses....
. In frequency probability
Frequency probability

Frequency probability is the Probability interpretations that defines an event's probability as the limit of its relative frequency in a large number of trials....
, these decisions are almost always made using null-hypothesis tests; that is, ones that answer the question Assuming that the null hypothesis is true, what is the probability of observing a value for the test statistic that is at least as extreme as the value that was actually observed? One use of hypothesis testing is deciding whether experimental results contain enough information to cast doubt on conventional wisdom.

It is a key technique of frequentist
Frequency probability

Frequency probability is the Probability interpretations that defines an event's probability as the limit of its relative frequency in a large number of trials....
 statistical inference
Statistical inference

Inferential statistics or statistical induction comprises the use of statistics to make inferences concerning some unknown aspect of a population....
, and is widely used, but also much criticized.






Discussion
Ask a question about 'Statistical hypothesis testing'
Start a new discussion about 'Statistical hypothesis testing'
Answer questions from other users
Full Discussion Forum



Encyclopedia


A statistical hypothesis test is a method of making statistical decisions using experimental data. It is sometimes called confirmatory data analysis, in contrast to exploratory data analysis
Exploratory data analysis

Exploratory data analysis is an approach to data analysis for the purpose of formulating hypothesis worth testing, complementing the tools of conventional statistics for testing hypotheses....
. In frequency probability
Frequency probability

Frequency probability is the Probability interpretations that defines an event's probability as the limit of its relative frequency in a large number of trials....
, these decisions are almost always made using null-hypothesis tests; that is, ones that answer the question Assuming that the null hypothesis is true, what is the probability of observing a value for the test statistic that is at least as extreme as the value that was actually observed? One use of hypothesis testing is deciding whether experimental results contain enough information to cast doubt on conventional wisdom.

It is a key technique of frequentist
Frequency probability

Frequency probability is the Probability interpretations that defines an event's probability as the limit of its relative frequency in a large number of trials....
 statistical inference
Statistical inference

Inferential statistics or statistical induction comprises the use of statistics to make inferences concerning some unknown aspect of a population....
, and is widely used, but also much criticized. The main alternative is Bayesian inference
Bayesian inference

Bayesian inference is statistical inference in which evidence or observations are used to update or to newly infer the probability that a hypothesis may be true....
.

The critical region of a hypothesis test is the set of all outcomes which, if they occur, cause the null hypothesis
Null hypothesis

In statistics, a null hypothesis is a concept which arises in the context of statistical hypothesis testing. A common convention is to use the symbol H0 to denote the null hypothesis....
 to be rejected and the alternative hypothesis accepted. It is usually denoted by C.

Example

As an example, consider determining whether a suitcase contains some radioactive material. Placed under a Geiger counter
Geiger counter

A Geiger counter, also called a Geiger-M?ller counter, is a type of particle detector that measures ionizing radiation....
, it produces 10 counts per minute. The null hypothesis is that no radioactive material is in the suitcase and that all measured counts are due to ambient radioactivity typical of the surrounding air and harmless objects. We can then calculate how likely it is that the null hypothesis produces 10 counts per minute. If the null hypothesis predicts (say) on average 9 counts per minute and a standard deviation of 1 count per minute, then we say that the suitcase is compatible with the null hypothesis. (This does not guarantee that there is no radioactive material, just that we have no reason to believe it); on the other hand, if the null hypothesis predicts 3 counts per minute and a standard deviation of 1 count per minute, then the suitcase is not compatible with the null hypothesis, and there are likely other factors responsible to produce the measurements.

The test described here is more fully the null-hypothesis statistical significance test. The null hypothesis is a conjecture made solely to be falsified by the sample. Statistical significance
Statistical significance

In statistics, a result is called statistically significant if it is unlikely to have occurred by chance. "A statistically significant difference" simply means there is statistical evidence that there is a difference; it does not mean the difference is necessarily large, important, or significant in the common meaning of the word....
 is a possible finding of the test – that the sample is unlikely to have occurred by chance given the truth of the null hypothesis. The name of the test describes its formulation and its possible outcome. One characteristic of the test is its crisp decision: reject or do not reject (which is not the same as accept). A calculated value is compared to a threshold, which is determined from the tolerable risk of error.

Definition of terms

The following definitions are mainly based on the exposition in Lehmann and Romano:

Simple hypothesis : Any hypothesis which specifies the population distribution completely. Composite hypothesis : Any hypothesis which does not specify the population distribution completely. Statistical test : A decision function that takes its values in the set of hypotheses. Region of acceptance : The set of values for which we fail to reject the null hypothesis. Region of rejection / Critical region: The set of values of the test statistic for which the null hypothesis is rejected. Power of a test
Statistical power

The power of aStatistical hypothesis testing is the probability that the test will reject a false null hypothesis . As power increases, the chances of a Type II error decrease....
 (1 − β): The test's probability of correctly rejecting the null hypothesis. The complement of the false negative rate, β. Size / Significance level of a test (α): For simple hypotheses, this is the test's probability of incorrectly rejecting the null hypothesis. The false positive rate. For composite hypotheses this is the upper bound of the probability of rejecting the null hypothesis over all cases covered by the null hypothesis. Most powerful test: For a given size or significance level, the test with the greatest power. Uniformly most powerful test
Uniformly most powerful test

In statistical hypothesis testing, a uniformly most powerful test is a statistical hypothesis testing which has the greatest Statistical_power 1 − β among all possible tests of a given Type I and type II errors α....
 (UMP): A test with the greatest power for all values of the parameter being tested. Consistent test: When considering the properties of a test as the sample size grows, a test is said to be consistent if, for a fixed size of test, the power against any fixed alternative approaches 1 in the limit. Unbiased test : For a specific alternative hypothesis, a test is said to be unbiased when the probability of rejecting the null hypothesis is not less than the significance level when the alternative is true and is less than or equal to the significance level when the null hypothesis is true. Uniformly most powerful unbiased (UMPU) : A test which is UMP in the set of all unbiased tests.

Common test statistics

See legend defining symbols at bottom of table. The statistics for some other tests have their own articles, including the Wald test
Wald test

The Wald test is a statistical test, typically used to test whether an effect exists or not. In other words, it tests whether an independent variable has a statistically significant relationship with a dependent variable....
 and the likelihood ratio test.


Origins


Hypothesis testing is largely the product of Ronald Fisher
Ronald Fisher

Sir Ronald Aylmer Fisher, Fellow of the Royal Society was an England statistician, evolutionary biologist, and genetics. He was described by Anders Hald as "a genius who almost single-handedly created the foundations for modern statistical science" and Richard Dawkins described him as "the greatest of Charles Darwin successors"....
, Jerzy Neyman
Jerzy Neyman

Jerzy Neyman , born Jerzy Splawa-Neyman, was a Polish-American mathematician and statistician.He was born into a Poles family in Bendery, Bessarabia in Imperial Russia, the fourth of four children of Czeslaw Splawa-Neyman and Kazimiera Lutoslawska....
, Karl Pearson
Karl Pearson

Karl Pearson Fellow of the Royal Society established the disciplineof mathematical statistics.In 1911 he founded the world's first university statistics department at University College London....
 and (son) Egon Pearson
Egon Pearson

Egon Sharpe Pearson was the only son of Karl Pearson, and like his father, a leading British statistician. He went to Winchester School and Trinity College, Cambridge, and succeeded his father as professor of statistics at University College London and as editor of the journal Biometrika....
. Fisher was an agricultural statistician who emphasized rigorous experimental design and methods to extract a result from few samples assuming Gaussian distributions. Neyman (who teamed with the younger Pearson) emphasized mathematical rigor and methods to obtain more results from many samples and a wider range of distributions. Modern hypothesis testing is an (extended) hybrid of the Fisher vs Neyman/Pearson formulation, methods and terminology developed in the early 20th century.

Example


The following example is summarized from Fisher Fisher thoroughly explained his method in a proposed experiment to test a Lady's claimed ability to determine the means of tea preparation by taste. The article is less than 10 pages in length and is notable for its simplicity and completeness regarding terminology, calculations and design of the experiment. The example is loosely based on an event in Fisher's life. The Lady proved him wrong.

  1. The null hypothesis was that the Lady had no such ability.
  2. The test statistic was a simple count of the number of successes in 8 trials.
  3. The distribution associated with the null hypothesis was the binomial distribution familiar from coin flipping experiments.
  4. The critical region was the single case of 8 successes in 8 trials based on a conventional probability criterion (< 5%).
  5. Fisher asserted that no alternative hypothesis was (ever) required.


If, and only if the 8 trials produced 8 successes was Fisher willing to reject the null hypothesis – effectively acknowledging the Lady's ability with > 98% confidence (but without quantifying her ability). Fisher later discussed the benefits of more trials and repeated tests.

Criticism


Some statisticians have commented that pure "significance testing" has what is actually a rather strange goal of detecting the existence of a "real" difference between two populations. In practice a difference can almost always be found given a large enough sample, what is typically the more relevant goal of science is a determination of causal effect size
Effect size

In statistics, effect size is a measure of the strength of the relationship between two variables. In scientific experiments, it is often useful to know not only whether an experiment has a statistical significance effect, but also the size of any observed effects....
. The amount and nature of the difference, in other words, is what should be studied. Many researchers also feel that hypothesis testing is something of a misnomer. In practice a single statistical test in a single study never "proves" anything.

Rejection of the null hypothesis at some effect size has no bearing on the practical significance at the observed effect size. A statistically significant finding may not be relevant in practice due to other, larger effects of more concern, whilst a true effect of practical significance may not appear statistically significant if the test lacks the power to detect it. Appropriate specification of both the hypothesis and the test of said hypothesis is therefore important to provide inference of practical utility.

Meta-criticism


Little criticism of the technique appears in introductory statistics texts. Criticism is of the application, or of the interpretation, rather than of the method.

Criticism of null-hypothesis significance testing is available in other articles ("Null-hypothesis" and "Statistical significance
Statistical significance

In statistics, a result is called statistically significant if it is unlikely to have occurred by chance. "A statistically significant difference" simply means there is statistical evidence that there is a difference; it does not mean the difference is necessarily large, important, or significant in the common meaning of the word....
") and their references. Attacks and defenses of the null-hypothesis significance test are collected in Harlow et al.

The original purposes of Fisher's formulation, as a tool for the experimenter, was to plan the experiment and to easily assess the information content of the small sample. There is little criticism, Bayesian in nature, of the formulation in its original context.

In other contexts, complaints focus on flawed interpretations of the results and over-dependence/emphasis on one test.

Numerous attacks on the formulation have failed to supplant it as a criterion for publication in scholarly journals. The most persistent attacks originated from the field of Psychology. After review, the did not explicitly deprecate the use of null-hypothesis significance testing, but adopted enhanced publication guidelines which implicitly reduced the relative importance of such testing. The recognizes an obligation to publish negative (not statistically significant) studies under some circumstances. The applicability of the null-hypothesis testing to the publication of observational (as contrasted to experimental) studies is doubtful.

Philosophical criticism


Philosophical criticism to hypothesis testing includes consideration of borderline cases.

Any process that produces a crisp decision from uncertainty is subject to claims of unfairness near the decision threshold. (Consider close election results.) The premature death of a laboratory rat during testing can impact doctoral theses and academic tenure decisions.

"... surely, God loves the .06 nearly as much as the .05"

The statistical significance required for publication has no mathematical basis, but is based on long tradition.

"It is usual and convenient for experimenters to take 5% as a standard level of significance, in the sense that they are prepared to ignore all results which fail to reach this standard, and, by this means, to eliminate from further discussion the greater part of the fluctuations which chance causes have introduced into their experimental results."

Fisher, in the cited article, designed an experiment to achieve a statistically significant result based on sampling 8 cups of tea.

Ambivalence attacks all forms of decision making. A mathematical decision-making process is attractive because it is objective and transparent. It is repulsive because it allows authority to avoid taking personal responsibility for decisions.

Pedagogic criticism


Pedagogic criticism of the null-hypothesis testing includes the counter-intuitive formulation, the terminology and confusion about the interpretation of results.

"Despite the stranglehold that hypothesis testing has on experimental psychology, I find it difficult to imagine a less insightful means of transiting from data to conclusions."

Students find it difficult to understand the formulation of statistical null-hypothesis testing. In rhetoric, examples often support an argument, but a mathematical proof
Mathematical proof

In mathematics, a proof is a convincing demonstration that some mathematical statement is necessarily true. Proofs are obtained from deductive reasoning, rather than from inductive reasoning or empirical arguments....
 "is a logical argument, not an empirical one". A single counterexample
Counterexample

In logic, and especially in its applications to mathematics and philosophy, a counterexample is an exception to a proposed general rule, i.e., a specific instance of the falsity of a universal quantification ....
 results in the rejection of a conjecture. Karl Popper
Karl Popper

Knight Bachelor Karl Raimund Popper Order of the Companions of Honour, Fellow of the Royal Society, Fellow of the British Academy was an Austrian and British philosopher and a professor at the London School of Economics....
 defined science by its vulnerability to dis-proof by data. Null-hypothesis testing shares the mathematical and scientific perspective rather than the more familiar rhetorical one. Students expect hypothesis testing to be a statistical tool for illumination of the research hypothesis by the sample; It is not. The test asks indirectly whether the sample can illuminate the research hypothesis.

Students also find the terminology confusing. While Fisher disagreed with Neyman and Pearson about the theory of testing, their terminologies have been blended. The blend is not seamless or standardized. While this article teaches a pure Fisher formulation, even it mentions Neyman and Pearson terminology (Type II error and the alternative hypothesis). The typical introductory statistics text is less consistent. The Sage Dictionary of Statistics would not agree with the title of this article, which it would call null-hypothesis testing. "...there is no alternate hypothesis in Fisher's scheme: Indeed, he violently opposed its inclusion by Neyman and Pearson." In discussing test results, "significance" often has two distinct meanings in the same sentence; One is a probability, the other is a subject-matter measurement (such as currency). The significance (meaning) of (statistical) significance is significant (important).

There is widespread and fundamental disagreement on the interpretation of test results.

"A little thought reveals a fact widely understood among statisticians: The null hypothesis, taken literally (and that's the only way you can take it in formal hypothesis testing), is almost always false in the real world.... If it is false, even to a tiny degree, it must be the case that a large enough sample will produce a significant result and lead to its rejection. So if the null hypothesis is always false, what's the big deal about rejecting it?" (The above criticism only applies to point hypothesis tests. If one were testing, for example, whether a parameter is greater than zero, it would not apply.)

"How has the virtually barren technique of hypothesis testing come to assume such importance in the process by which we arrive at our conclusions from our data?"

Null-hypothesis testing just answers the question of "how well the findings fit the possibility that chance factors alone might be responsible."

Null-hypothesis significance testing does not determine the truth or falseness of claims. It determines whether confidence in a claim based solely on a sample-based estimate exceeds a threshold. It is a research quality assurance test, widely used as one requirement for publication of experimental research with statistical results. It is uniformly agreed that statistical significance is not the only consideration in assessing the importance of research results. Rejecting the null hypothesis is not a sufficient condition for publication.

"Statistical significance does not necessarily imply practical significance!"

Practical criticism


Practical criticism of hypothesis testing includes the sobering observation that published test results are often contradicted. Mathematical models support the conjecture that most published medical research test results are flawed. Null-hypothesis testing has not achieved the goal of a low error probability in medical journals.

Improvements


Jones and Tukey suggested a modest improvement in the original null-hypothesis formulation to formalize handling of one-tail tests. Fisher ignored the 8-failure case (equally improbable as the 8-success case) in the example test involving tea, which altered the claimed significance by a factor of 2.

Killeen proposed an alternative statistic that estimates the probability of duplicating an experimental result. It "provides all of the information now used in evaluating research, while avoiding many of the pitfalls of traditional statistical inference."

See also


External links

  • Dallal GE (2007) (A good tutorial)