Statistical significance

Statistical significance

Discussion
Ask a question about 'Statistical significance'
Start a new discussion about 'Statistical significance'
Answer questions from other users
Full Discussion Forum
 
Encyclopedia
In statistics
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....

, a result is called statistically significant if it is unlikely to have occurred by chance
Randomness
Randomness has somewhat differing meanings as used in various fields. It also has common meanings which are connected to the notion of predictability of events....

. The phrase test of significance
Statistical hypothesis testing
A statistical hypothesis test is a method of making decisions using data, whether from a controlled experiment or an observational study . In statistics, a result is called statistically significant if it is unlikely to have occurred by chance alone, according to a pre-determined threshold...

was coined by Ronald Fisher
Ronald Fisher
Sir Ronald Aylmer Fisher FRS was an English statistician, evolutionary biologist, eugenicist and geneticist. Among other things, Fisher is well known for his contributions to statistics by creating Fisher's exact test and Fisher's equation...

.
As used in statistics, significant does not mean important or meaningful, as it does in everyday speech. Research analysts who focus solely on significant results may miss important response patterns which individually may fall under the threshold set for tests of significance. Many researchers urge that tests of significance should always be accompanied by effect-size
Effect size
In statistics, an effect size is a measure of the strength of the relationship between two variables in a statistical population, or a sample-based estimate of that quantity...

 statistics, which approximate the size and thus the practical importance of the difference.

The amount of evidence required to accept that an event is unlikely to have arisen by chance is known as the significance level or critical p-value
P-value
In statistical significance testing, the p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. One often "rejects the null hypothesis" when the p-value is less than the significance level α ,...

: in traditional Fisherian
Ronald Fisher
Sir Ronald Aylmer Fisher FRS was an English statistician, evolutionary biologist, eugenicist and geneticist. Among other things, Fisher is well known for his contributions to statistics by creating Fisher's exact test and Fisher's equation...

 statistical hypothesis testing
Statistical hypothesis testing
A statistical hypothesis test is a method of making decisions using data, whether from a controlled experiment or an observational study . In statistics, a result is called statistically significant if it is unlikely to have occurred by chance alone, according to a pre-determined threshold...

, the p-value is the probability of observing data at least as extreme as that observed, given that the null hypothesis is true. If the obtained p-value is small then it can be said either the null hypothesis
Null hypothesis
The practice of science involves formulating and testing hypotheses, assertions that are capable of being proven false using a test of observed data. The null hypothesis typically corresponds to a general or default position...

 is false or an unusual event has occurred. It is worth stressing that p-values do not have any repeat sampling interpretation.

An alternative statistical hypothesis testing framework is the Neyman–Pearson frequentist school which requires both a null and an alternative hypothesis to be defined and investigates the repeat sampling properties of the procedure, i.e. the probability that a decision to reject the null hypothesis will be made when it is in fact true and should not have been rejected (this is called a "false positive" or Type I error) and the probability that a decision will be made to accept the null hypothesis when it is in fact false (Type II error).

It is worth stressing that Fisherian p-values are philosophically different from Neyman–Pearson Type I errors. This confusion is unfortunately propagated by many statistics textbooks.

Use in practice


The significance level is usually denoted by the Greek symbol α (lowercase alpha
Alpha (letter)
Alpha is the first letter of the Greek alphabet. In the system of Greek numerals it has a value of 1. It was derived from the Phoenician letter Aleph...

). Popular levels of significance are 10% (0.1), 5% (0.05), 1% (0.01), 0.5% (0.005), and 0.1% (0.001). If a test of significance
Statistical hypothesis testing
A statistical hypothesis test is a method of making decisions using data, whether from a controlled experiment or an observational study . In statistics, a result is called statistically significant if it is unlikely to have occurred by chance alone, according to a pre-determined threshold...

 gives a p-value lower than the significance level α, the null hypothesis is rejected. Such results are informally referred to as 'statistically significant'. For example, if someone argues that "there's only one chance in a thousand this could have happened by coincidence," a 0.001 level of statistical significance is being implied. The lower the significance level, the stronger the evidence required. Choosing level of significance is an arbitrary task, but for many applications, a level of 5% is chosen, for no better reason than that it is conventional.

In some situations it is convenient to express the statistical significance as 1 − α. In general, when interpreting a stated significance, one must be careful to note what, precisely, is being tested statistically.

Different levels of α trade off countervailing effects. Smaller levels of α increase confidence in the determination of significance, but run an increased risk of failing to reject a false null hypothesis (a Type II error, or "false negative determination"), and so have less statistical power
Statistical power
The power of a statistical test is the probability that the test will reject the null hypothesis when the null hypothesis is actually false . The power is in general a function of the possible distributions, often determined by a parameter, under the alternative hypothesis...

. The selection of the level α thus inevitably involves a compromise between significance and power, and consequently between the Type I error and the Type II error. More powerful
Statistical power
The power of a statistical test is the probability that the test will reject the null hypothesis when the null hypothesis is actually false . The power is in general a function of the possible distributions, often determined by a parameter, under the alternative hypothesis...

 experiments – usually experiments with more subjects or replications – can obviate this choice to an arbitrary degree.

In terms of σ (sigma)


In some fields, for example nuclear and particle physics, it is common to express statistical significance in units of the standard deviation
Standard deviation
Standard deviation is a widely used measure of variability or diversity used in statistics and probability theory. It shows how much variation or "dispersion" there is from the average...

 σ of a normal distribution. A statistical significance of "" can be converted into a value of α by use of the cumulative distribution function Φ of the standard normal distribution, through the relation:


or via use of the error function
Error function
In mathematics, the error function is a special function of sigmoid shape which occurs in probability, statistics and partial differential equations...

:


However, values might more easily be found using tabulated values which are often found in text books: see standard normal table
Standard normal table
A standard normal table also called the "Unit Normal Table" is a mathematical table for the values of Φ, the cumulative distribution function of the normal distribution....

. The use of σ implicitly assumes a normal distribution of measurement values. For example, if a theory predicts a parameter to have a value of, say,  109 ± 3, and one measures the parameter to be 100, then one might report the measurement as a "3σ deviation" from the theoretical prediction. In terms of α, this statement is equivalent to saying that "assuming the theory is true, the likelihood of obtaining the experimental result by coincidence is 0.27%" (since 1 − erf(3/√2) = 0.0027).

Fixed significance levels such as those mentioned above may be regarded as useful in exploratory data analyses. However, modern statistical advice is that, where the outcome of a test is essentially the final outcome of an experiment or other study, the p-value should be quoted explicitly. And, importantly, it should be quoted whether the p-value is judged to be significant. This is to allow maximum information to be transferred from a summary of the study into meta-analyses
Meta-analysis
In statistics, a meta-analysis combines the results of several studies that address a set of related research hypotheses. In its simplest form, this is normally by identification of a common measure of effect size, for which a weighted average might be the output of a meta-analyses. Here the...

.

Pitfalls and criticism


The scientific literature contains extensive discussion of the use of the concept of statistical significance and in particular of its potential misuse and criticism of its use: see potential misuse and criticism for details of these opinions.

Signal–noise ratio conceptualisation of significance


Statistical significance can be considered to be the confidence one has in a given result. In a comparison study, it is dependent on the relative difference between the groups compared, the amount of measurement and the noise associated with the measurement. In other words, the confidence one has in a given result being non-random (i.e. it is not a consequence of chance
Randomness
Randomness has somewhat differing meanings as used in various fields. It also has common meanings which are connected to the notion of predictability of events....

) depends on the signal-to-noise ratio
Signal-to-noise ratio
Signal-to-noise ratio is a measure used in science and engineering that compares the level of a desired signal to the level of background noise. It is defined as the ratio of signal power to the noise power. A ratio higher than 1:1 indicates more signal than noise...

 (SNR) and the sample size.

Expressed mathematically, the confidence that a result is not by random chance is given by the following formula by Sackett:


For clarity, the above formula is presented in tabular form below.

Dependence of confidence with noise, signal and sample size (tabular form)
Parameter Parameter increases Parameter decreases
Noise Confidence decreases Confidence increases
Signal Confidence increases Confidence decreases
Sample size Confidence increases Confidence decreases


In words, the dependence of confidence is high if the noise is low and/or the sample size is large and/or the effect size
Effect size
In statistics, an effect size is a measure of the strength of the relationship between two variables in a statistical population, or a sample-based estimate of that quantity...

 (signal) is large. The confidence of a result (and its associated confidence interval
Confidence interval
In statistics, a confidence interval is a particular kind of interval estimate of a population parameter and is used to indicate the reliability of an estimate. It is an observed interval , in principle different from sample to sample, that frequently includes the parameter of interest, if the...

) is not dependent on effect size alone. If the sample size is large and the noise is low a small effect size can be measured with great confidence. Whether a small effect size is considered important is dependent on the context of the events compared.

In medicine, small effect sizes (reflected by small increases of risk) are often considered clinically relevant and are frequently used to guide treatment decisions (if there is great confidence in them). Whether a given treatment is considered a worthy endeavour is dependent on the risks, benefits and costs.

See also



  • Statistical hypothesis testing
    Statistical hypothesis testing
    A statistical hypothesis test is a method of making decisions using data, whether from a controlled experiment or an observational study . In statistics, a result is called statistically significant if it is unlikely to have occurred by chance alone, according to a pre-determined threshold...

  • A/B testing
    A/B testing
    A/B testing, split testing or bucket testing is a method of marketing testing by which a baseline control sample is compared to a variety of single-variable test samples in order to improve response rates...

  • ABX test
    ABX test
    An ABX test is a method of comparing two kinds of sensory stimuli to identify detectable differences. A subject is presented with two known samples , and one unknown sample X, for three samples total. X is randomly selected from A and B, and the subject identifies X as being either A or B...

  • Fisher's method
    Fisher's Method
    In statistics, Fisher's method, also known as Fisher's combined probability test, is a technique for data fusion or "meta-analysis" . It was developed by and named for Ronald Fisher...

     for combining independent
    Statistical independence
    In probability theory, to say that two events are independent intuitively means that the occurrence of one event makes it neither more nor less probable that the other occurs...

     test
    Statistical hypothesis testing
    A statistical hypothesis test is a method of making decisions using data, whether from a controlled experiment or an observational study . In statistics, a result is called statistically significant if it is unlikely to have occurred by chance alone, according to a pre-determined threshold...

    s of significance
  • Reasonable doubt

Further reading

  • Ziliak, Stephen, and McCloskey, Deirdre, (2008). The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives. Ann Arbor, University of Michigan Press
    University of Michigan Press
    The University of Michigan Press is part of the University of Michigan Library and serves as a primary publishing unit of the University of Michigan, with special responsibility for the creation and promotion of scholarly, educational, and regional books and other materials in digital and print...

    , 2009.
  • Thompson, Bruce, (2004). The "significance" crisis in psychology and education. Journal of Socio-Economics, 33, pp. 607–613.
  • Chow, Siu L. , (1996). Statistical Significance: Rationale, Validity and Utility, Volume 1 of series Introducing Statistical Methods, Sage Publications Ltd, ISBN 978-0-76195205-3 – argues that statistical significance is useful in certain circumstances.
  • Kline, Rex, (2004). Beyond Significance Testing: Reforming Data Analysis Methods in Behavioral Research Washington, DC: American Psychological Association.

External links