Tajima's D
Encyclopedia
Tajima's D is a statistical test created by and named after the Japanese researcher Fumio Tajima. The purpose of the test is to distinguish between a DNA sequence
DNA sequence
The sequence or primary structure of a nucleic acid is the composition of atoms that make up the nucleic acid and the chemical bonds that bond those atoms. Because nucleic acids, such as DNA and RNA, are unbranched polymers, this specification is equivalent to specifying the sequence of...

 evolving randomly ("neutrally") and one evolving under a non-random process, including directional selection
Directional selection
In population genetics, directional selection is a mode of natural selection in which a single phenotype is favored, causing the allele frequency to continuously shift in one direction...

 or balancing selection
Balancing selection
Balancing selection refers to a number of selective processes by which multiple alleles are actively maintained in the gene pool of a population at frequencies above that of gene mutation. This usually happens when the heterozygotes for the alleles under consideration have a higher adaptive value...

, demographic expansion or contraction, genetic hitchhiking
Genetic hitchhiking
Genetic hitchhiking is the process by which an allele may increase in frequency by virtue of being linked to a gene that is positively selected. Proximity on a chromosome may allow genes to be dragged along with a selective sweep experienced by an advantageous gene nearby...

, or introgression. A randomly evolving DNA sequence contains mutations with no effect on the fitness and survival of an organism. The randomly evolving mutations are called "neutral", while mutations under selection are "non-neutral". For example, you would expect to find that a mutation which causes prenatal death or severe disease to be under selection. When looking at the human population as a whole, we say that the population frequency
Frequency
Frequency is the number of occurrences of a repeating event per unit time. It is also referred to as temporal frequency.The period is the duration of one cycle in a repeating event, so the period is the reciprocal of the frequency...

 of a neutral mutation fluctuates randomly (i.e. the percentage of people in the population with the mutation changes from one generation to the next, and this percentage is equally likely to go up or down, through genetic drift
Genetic drift
Genetic drift or allelic drift is the change in the frequency of a gene variant in a population due to random sampling.The alleles in the offspring are a sample of those in the parents, and chance has a role in determining whether a given individual survives and reproduces...

.

The strength of genetic drift depends on the population size. If a population is at a constant size with constant mutation rate, the population will reach an equilibrium of gene frequencies. This equilibrium has important properties, including the number of segregating sites , and the number of nucleotide differences between pairs sampled (these are called pairwise differences
Nucleotide diversity
Nucleotide diversity is a concept in molecular genetics which is used to measure the degree of polymorphism within a population.One commonly used measure of nucleotide diversity was first introduced by Nei and Li in 1979...

). To standardize the pairwise differences, the mean or 'average' number of pairwise differences is used. This is simply the sum of the pairwise differences divided by the number of pairs, and is signified by .

The purpose of Tajima's test is to identify sequences which do not fit the neutral theory model at equilibrium between mutation
Mutation
In molecular biology and genetics, mutations are changes in a genomic sequence: the DNA sequence of a cell's genome or the DNA or RNA sequence of a virus. They can be defined as sudden and spontaneous changes in the cell. Mutations are caused by radiation, viruses, transposons and mutagenic...

 and genetic drift
Genetic drift
Genetic drift or allelic drift is the change in the frequency of a gene variant in a population due to random sampling.The alleles in the offspring are a sample of those in the parents, and chance has a role in determining whether a given individual survives and reproduces...

. In order to perform the test on a DNA sequence or gene, you need to sequence homologous
Homology (biology)
Homology forms the basis of organization for comparative biology. In 1843, Richard Owen defined homology as "the same organ in different animals under every variety of form and function". Organs as different as a bat's wing, a seal's flipper, a cat's paw and a human hand have a common underlying...

 DNA for at least 3 individuals. Tajima's statistic computes a standardized measure of the total number of segregating sites (these are DNA sites that are polymorphic
Polymorphism (biology)
Polymorphism in biology occurs when two or more clearly different phenotypes exist in the same population of a species — in other words, the occurrence of more than one form or morph...

) in the sampled DNA and the average number of mutations between pairs in the sample. The two quantities whose values are compared are both method of moments estimates of the population genetic parameter theta, and so are expected to equal the same value. If these two numbers only differ by as much as one could reasonably expect by chance, then the null hypothesis of neutrality cannot be rejected. Otherwise, the null hypothesis of neutrality is rejected.

Hypothetical example

Lets say that you are a genetics researcher who finds two mutations, a mutation in a gene which causes pre-natal death and a mutation in DNA which has no effect on human health or survival. You publish your findings in a scientific journal, identifying the first mutation as "under negative selection" and the second as "neutral". The neutral mutation gets passed on from one generation to the next, while the mutation under negative selection disappears, since anyone with the mutation cannot reproduce and pass it on to the next generation.


In order to back your discovery with more scientific evidence, you gather DNA samples from 100 people and determine the exact DNA sequence for the gene in each of these 100. Using all 100 DNA samples as input, you determine Tajima's D on both the detrimental mutation and the 'neutral' DNA. If your hypothesis is correct, then Tajima's Test will output "neutral" for the neutral mutation and "non-neutral" for the pre-natal death allele.

Scientific explanation

Under the neutral theory model, for a population at constant size at equilibrium:


for diploid DNA, and


for haploid.

In the above formulas, S is the number of segregating sites, n is the number of samples, and i is the index of summation.
But selection
Natural selection
Natural selection is the nonrandom process by which biologic traits become either more or less common in a population as a function of differential reproduction of their bearers. It is a key mechanism of evolution....

, demographic fluctuations and other violations of the neutral model (including rate heterogeneity and introgression) will change the expected values of and , so that they are no longer expected to be equal. The difference in the expectations for these two variables (which can be positive or negative) is the crux of Tajima's D test statistic.

is calculated by taking the difference between the two estimates of the population genetics parameter . This difference is called , and D is calculated by dividing by the square root of its variance
Variance
In probability theory and statistics, the variance is a measure of how far a set of numbers is spread out. It is one of several descriptors of a probability distribution, describing how far the numbers lie from the mean . In particular, the variance is one of the moments of a distribution...

  (its standard deviation
Standard deviation
Standard deviation is a widely used measure of variability or diversity used in statistics and probability theory. It shows how much variation or "dispersion" there is from the average...

, by definition).


Fumio Tajima demonstrated by computer simulation that the statistic described above could be modeled using a beta distribution. If the value for a sample of sequences is outside the confidence interval
Confidence interval
In statistics, a confidence interval is a particular kind of interval estimate of a population parameter and is used to indicate the reliability of an estimate. It is an observed interval , in principle different from sample to sample, that frequently includes the parameter of interest, if the...

 then one can reject the null hypothesis
Null hypothesis
The practice of science involves formulating and testing hypotheses, assertions that are capable of being proven false using a test of observed data. The null hypothesis typically corresponds to a general or default position...

 of neutral mutation
Neutral mutation
In genetics, a neutral mutation is a mutation that has no effect on fitness. In other words, it is neutral with respect to natural selection.For example, some mutations in a DNA triplet or codon do not change which amino acid is introduced: this is known as a synonymous substitution. Unless the...

 for the sequence in question.

Statistical test

When performing a statistical test such as Tajima's D, the critical question is whether the value calculated for the statistic is unexpected under a null process. For Tajima's D, the magnitude
Magnitude
Magnitude Is A Part Of An EarthquakesMagnitude may refer to:In mathematics:*Magnitude , the relative size of a mathematical object*Magnitude , a term for the size or length of a vector...

 of the statistic is expected to increase the more the history
Internal validity
Internal validity is the validity of inferences in scientific studies, usually based on experiments as experimental validity.- Details :...

 of the population deviates from a history expected under neutrality. In the example below, the calculation
Calculation
A calculation is a deliberate process for transforming one or more inputs into one or more results, with variable change.The term is used in a variety of senses, from the very definite arithmetical calculation of using an algorithm to the vague heuristics of calculating a strategy in a competition...

 of this statistic is shown for some data
Data
The term data refers to qualitative or quantitative attributes of a variable or set of variables. Data are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which...

, and found to be unusual.

In Tajima's Test, the null hypothesis
Null hypothesis
The practice of science involves formulating and testing hypotheses, assertions that are capable of being proven false using a test of observed data. The null hypothesis typically corresponds to a general or default position...

is neutral evolution.

Mathematical details


where


and are two estimates of the expected number of single nucleotide polymorphism
Single nucleotide polymorphism
A single-nucleotide polymorphism is a DNA sequence variation occurring when a single nucleotide — A, T, C or G — in the genome differs between members of a biological species or paired chromosomes in an individual...

s (SNPs)between two DNA sequences under the neutral mutation
Neutral mutation
In genetics, a neutral mutation is a mutation that has no effect on fitness. In other words, it is neutral with respect to natural selection.For example, some mutations in a DNA triplet or codon do not change which amino acid is introduced: this is known as a synonymous substitution. Unless the...

 model in a sample size from an effective population size
Effective population size
In population genetics, the concept of effective population size Ne was introduced by the American geneticist Sewall Wright, who wrote two landmark papers on it...

 

The first estimate is the average number of SNPs found in (n choose 2) pairwise comparisons of sequences in the sample

The second estimate is derived from the expected value
Expected value
In probability theory, the expected value of a random variable is the weighted average of all possible values that this random variable can take on...

 of , the total number of polymorphisms in the sample

Tajima defines , whereas Hartl & Clark use a different symbol to define the same parameter .

Historical example

The genetic mutation which causes sickle-cell anemia is non-neutral because it affects survival and fitness. People homozygous for the mutation have the sickle-cell disease, while those without the mutation (homozygous for the wild-type allele
Allele
An allele is one of two or more forms of a gene or a genetic locus . "Allel" is an abbreviation of allelomorph. Sometimes, different alleles can result in different observable phenotypic traits, such as different pigmentation...

) do not have the disease. People with one copy of the mutated allele (heterozygous) do not have the disease, but instead are resistant to malaria
Malaria
Malaria is a mosquito-borne infectious disease of humans and other animals caused by eukaryotic protists of the genus Plasmodium. The disease results from the multiplication of Plasmodium parasites within red blood cells, causing symptoms that typically include fever and headache, in severe cases...

. Thus in Africa
Africa
Africa is the world's second largest and second most populous continent, after Asia. At about 30.2 million km² including adjacent islands, it covers 6% of the Earth's total surface area and 20.4% of the total land area...

, where there is a prevalence of the malaria parasite Plasmodium falciparum
Plasmodium falciparum
Plasmodium falciparum is a protozoan parasite, one of the species of Plasmodium that cause malaria in humans. It is transmitted by the female Anopheles mosquito. Malaria caused by this species is the most dangerous form of malaria, with the highest rates of complications and mortality...

that is transmitted through mosquitos Anopheles
Anopheles
Anopheles is a genus of mosquito. There are approximately 460 recognized species: while over 100 can transmit human malaria, only 30–40 commonly transmit parasites of the genus Plasmodium, which cause malaria in humans in endemic areas...

, there is a selective advantage for heterozygous individuals. Meanwhile, in countries such as the USA where the risk of malaria infection is low, the population frequency of the mutation is lower.

Example

Suppose you are a geneticist studying an unknown gene. As part of your research you get DNA samples from four random people (plus yourself). For simplicity, you label your sequence as a string of zeroes, and for the other four people you put a zero when their DNA is the same as yours and a one when it is different. (For this example, the specific type of difference is not important.)


Position 12345 67890 12345 67890
Person Y 00000 00000 00000 00000
Person A 00100 00000 00100 00010
Person B 00000 00000 00100 00010
Person C 00000 01000 00000 00010
Person D 00000 01000 00100 00010

Notice the four polymorphic sites (positions where someone differs from you, at 3, 7, 13 and 19 above). Now compare each pair of sequences and get the average
Average
In mathematics, an average, or central tendency of a data set is a measure of the "middle" value of the data set. Average is one form of central tendency. Not all central tendencies should be considered definitions of average....

 number of polymorphisms between two sequences. There are "five choose two" (ten) comparisons that need to be done.



Person Y is you!

You vs A: 3 polymorphisms

Person Y 00000 00000 00000 00000
Person A 00100 00000 00100 00010

You vs B: 2 polymorphisms

Person Y 00000 00000 00000 00000
Person B 00000 00000 00100 00010

You vs C: 2 polymorphisms

Person Y 00000 00000 00000 00000
Person C 00000 01000 00000 00010

You vs D: 3 polymorphisms

Person Y 00000 00000 00000 00000
Person D 00000 01000 00100 00010

A vs B: 1 polymorphism

Person A 00100 00000 00100 00010
Person B 00000 00000 00100 00010

A vs C: 3 polymorphisms

Person A 00100 00000 00100 00010
Person C 00000 01000 00000 00010

A vs D: 2 polymorphisms

Person A 00100 00000 00100 00010
Person D 00000 01000 00100 00010

B vs C: 2 polymorphisms

Person B 00000 00000 00100 00010
Person C 00000 01000 00000 00010

B vs D: 1 polymorphism

Person B 00000 00000 00100 00010
Person D 00000 01000 00100 00010

C vs D: 1 polymorphism

Person C 00000 01000 00000 00010
Person D 00000 01000 00100 00010




The average number of polymorphisms is .

The lower-case d described above is the difference between these two numbers—the average number of polymorphisms found in pairwise comparison (2) and the total number of polymorphic sites (4). Thus .

Since this is a statistical test, you need to assess the significance of this value. A discussion of how to do this is provided below.

Estimating significance

A negative Tajima's D signifies an excess of low frequency polymorphisms, indicating population size expansion (e.g., after a bottleneck or a selective sweep) and/or purifying selection. A positive Tajima's D signifies low levels of both low and high frequency polymorphisms, indicating a decrease in population size and/or balancing selection. However, calculating a conventional "p-value" associated with any Tajima's D value that is obtained from a sample is impossible. Briefly, this is because there is no way to describe the distribution of the statistic that is independent of the true, and unknown, theta parameter (no pivot quantity exists). To circumvent this issue, several options have been proposed.

Tajima (1989) found an empirical similarity between the distribution of the test statistic and a beta distribution with mean zero and variance one. He estimated theta by taking Watterson's estimator
Watterson estimator
In population genetics, the Watterson estimator is a method for estimating the population mutation rate, \theta = 4N_e\mu, where N_e is the effective population size and \mu is the per-generation mutation rate of the population of interest...

and dividing it the number of samples. Simulations have shown this distribution to be conservative (Fu and Li, 1991) , and now that the computing power is more readily available this approximation is not frequently used.

A more nuanced approach was presented in a paper by Simonsen et al. These authors advocated constructing a confidence interval for the true theta value, and then performing a grid search over this interval to obtain the critical values at which the statistic is significant below a particular alpha value. An alternative approach is for the investigator to perform the grid search over the values of theta which they believe to be plausible based on their knowledge of the organism under study. Bayesian approaches are a natural extension of this method.

A very rough rule of thumb to significance is that values greater than +2 or less than -2 are likely to be significant. This rule is based on an appeal to asymptotic properties of some statistics, and thus +/- 2 does not actually represent a critical value for a significance test.

Finally, genome wide scan's of Tajima's D in sliding windows along a chromosomal segment are often performed. With this approach, those regions that have a value of D that greatly deviates from the bulk of the empirical distribution of all such windows are reported as significant. This method does not assess significance in the traditional statistical test, but is quite powerful given a large genomic region, and is unlikely to falsely identify interesting regions of a chromosome if only the greatest outliers are reported.

Computational tools for Tajima's D test

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK