Distance matrices are used in phylogeny as
non-parametric distance methods were originally applied to phenetic data using a matrix of pairwise distances. These distances are then reconciled to produce a tree (a phylogram, with informative branch lengths). The
distance matrixIn mathematics, a distance matrix is a matrix containing the distances, taken pairwise, of a set of points. It is therefore a symmetric N×N matrix containing non-negative reals as elements, given N points in Euclidean space...
can come from a number of different sources, including measured distance (for example from
immunological studiesImmunology is a broad branch of biomedical science that covers the study of all aspects of the immune system in all organisms. It deals with, among other things, the physiological functioning of the immune system in states of both health and disease; malfunctions of the immune system in...
) or morphometric analysis, various pairwise distance formulae (such as
euclidean distanceIn mathematics, the Euclidean distance or Euclidean metric is the "ordinary" distance between two points that one would measure with a ruler, and is given by the Pythagorean formula. By using this formula as distance, Euclidean space becomes a metric space...
) applied to discrete morphological characters, or
genetic distanceGenetic distance includes a variety of parameters used to measure the genetic divergence between species or between populations within a species. Smaller genetic distances indicate a close genetic relationship whereas large genetic distances indicate a more distant genetic relationship...
from sequence,
restriction fragmentA restriction fragment is a DNA fragment resulting from the cutting of a DNA strand by a restriction enzyme , a process called restriction. Each restriction enzyme is highly specific, recognising a particular short DNA sequence, or restriction site, and cutting both DNA strands at specific points...
, or
allozymeVariant forms of an enzyme that are not coded for by different alleles at the same locus are called allozymes as opposed to isozymes which are enzymes that perform the same function, but which are coded for by genes located at different loci....
data. For phylogenetic character data, raw distance values can be calculated by simply counting the number of pairwise differences in character states (Manhattan distance).
Distance-matrix methods
Distance-matrix methods of phylogenetic analysis explicitly rely on a measure of "genetic distance" between the sequences being classified, and therefore they require an MSA (multiple sequence alignment) as an input. Distance is often defined as the fraction of mismatches at aligned positions, with gaps either ignored or counted as mismatches. Distance methods attempt to construct an all-to-all matrix from the sequence query set describing the distance between each sequence pair. From this is constructed a phylogenetic tree that places closely related sequences under the same interior node and whose branch lengths closely reproduce the observed distances between sequences. Distance-matrix methods may produce either rooted or unrooted trees, depending on the algorithm used to calculate them. They are frequently used as the basis for progressive and iterative types of
multiple sequence alignmentA multiple sequence alignment is a sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutionary relationship by which they share a lineage and are descended from a common ancestor...
. The main disadvantage of distance-matrix methods is their inability to efficiently use information about local high-variation regions that appear across multiple subtrees.
Neighbor-joining
Neighbor-joining methods apply general
data clusteringCluster analysis or clustering is the assignment of a set of observations into subsets so that observations in the same cluster are similar in some sense...
techniques to sequence analysis using genetic distance as a clustering metric. The simple
neighbor-joiningIn bioinformatics, neighbor-joining is a bottom-up clustering method used for the construction of phylogenetic trees. Usually used for trees based on DNA or protein sequence data, the algorithm requires knowledge of the distance between each pair of taxa in the tree.- The algorithm...
method produces unrooted trees, but it does not assume a constant rate of evolution (i.e., a
molecular clockThe molecular clock is a technique in molecular evolution which uses fossil constraints and rates of molecular change to deduce the time in geologic history when two species or other taxa diverged. It is used to estimate the time of occurrence of events called speciation or radiation...
) across lineages. Its relative,
UPGMAUPGMA is a simple agglomerative or hierarchical clustering method used in bioinformatics for the creation of phylogenetic trees...
(Unweighted Pair Group Method with Arithmetic mean) produces rooted trees and requires a constant-rate assumption - that is, it assumes an ultrametric tree in which the distances from the root to every branch tip are equal.
Fitch-Margoliash method
The Fitch-Margoliash method uses a weighted
least squaresThe method of least squares is applied to approximate solutions of overdetermined systems, i.e. systems of equations in which there are more equations than unknowns. Least squares is often applied in statistical contexts, particularly regression analysis....
method for clustering based on genetic distance. Closely related sequences are given more weight in the tree construction process to correct for the increased inaccuracy in measuring distances between distantly related sequences. The distances used as input to the algorithm must be normalized to prevent large artifacts in computing relationships between closely related and distantly related groups. The distances calculated by this method must be
linearThe word linear comes from the Latin word linearis, which means created by lines.In mathematics, a linear map or function f is a function which satisfies the following two properties......
; the linearity criterion for distances requires that the
expected valueIn probability theory and statistics, the expected value of a random variable is the integral of the random variable with respect to its probability measure....
s of the branch lengths for two individual branches must equal the expected value of the sum of the two branch distances - a property that applies to biological sequences only when they have been corrected for the possibility of back mutations at individual sites. This correction is done through the use of a
substitution matrixIn bioinformatics and evolutionary biology, a substitution matrix describes the rate at which one character in a sequence changes to other character states over time...
such as that derived from the Jukes-Cantor model of DNA evolution. The distance correction is only necessary in practice when the evolution rates differ among branches.
The least-squares criterion applied to these distances is more accurate but less efficient than the neighbor-joining methods. An additional improvement that corrects for correlations between distances that arise from many closely related sequences in the data set can also be applied at increased computational cost. Finding the optimal least-squares tree with any correction factor is
NP-completeIn computational complexity theory, the complexity class NP-complete , is a class of problems having two properties...
, so
heuristicHeuristic is an adjective for experience-based techniques that help in problem solving, learning and discovery. A heuristic method is particularly used to rapidly come to a solution that is hoped to be close to the best possible answer, or 'optimal solution'...
search methods like those used in maximum-parsimony analysis are applied to the search through tree space.
Using outgroups
Independent information about the relationship between sequences or groups can be used to help reduce the tree search space and root unrooted trees. Standard usage of distance-matrix methods involves the inclusion of at least one
outgroupIn cladistics, an outgroup is a group of organisms that serves as a reference group for determination of the evolutionary relationship between three ormore monophyletic groups of organisms....
sequence known to be only distantly related to the sequences of interest in the query set. This usage can be seen as a type of experimental control. If the outgroup has been appropriately chosen, it will have a much greater
genetic distanceGenetic distance includes a variety of parameters used to measure the genetic divergence between species or between populations within a species. Smaller genetic distances indicate a close genetic relationship whereas large genetic distances indicate a more distant genetic relationship...
and thus a longer branch length than any other sequence, and it will appear near the root of a rooted tree. Choosing an appropriate outgroup requires the selection of a sequence that is moderately related to the sequences of interest; too close a relationship defeats the purpose of the outgroup and too distant adds
noiseIn science, and especially in physics and telecommunication, noise is fluctuations in and the addition of external factors to the stream of target information being received at a detector. In communications, it may be deliberate as for instance jamming of a radio or TV signal, but in most cases it...
to the analysis. Care should also be taken to avoid situations in which the species from which the sequences were taken are distantly related, but the gene encoded by the sequences is highly
conservedConservation may refer to:* Conservation genetics - "an interdisciplinary science that aims to apply genetic methods to the conservation and restoration of biodiversity."...
across lineages.
Horizontal gene transferHorizontal gene transfer , also Lateral gene transfer , is any process in which an organism incorporates genetic material from another organism without being the offspring of that organism. By contrast, vertical transfer occurs when an organism receives genetic material from its ancestor, e.g...
, especially between otherwise divergent
bacteriaThe bacteria are a large group of unicellular microorganisms. Typically a few micrometres in length, bacteria have a wide range of shapes, ranging from spheres to rods and spirals...
, can also confound outgroup usage.
In general, pairwise distance data are an underestimate of the path-distance between taxa on a phylogram. Pairwise distances effectively "cut corners" in a manner analogous to geographic distance: the distance between two cities may be 100 miles "as the crow flies," but a traveler may actually be obligated to travel 120 miles because of the layout of roads, the terrain, stops along the way, etc. Between pairs of taxa, some character changes that took place in ancestral lineages will be undetectable, because later changes have erased the evidence (often called multiple hits and back mutations in sequence data). This problem is common to all phylogenetic estimation, but it is particularly acute for distance methods, because only two samples are used for each distance calculation; other methods benefit from evidence of these hidden changes found in other taxa not considered in pairwise comparisons. For
nucleotideNucleotides are molecules that, when joined together, make up the structural units of RNA and DNA. In addition, nucleotides play central roles in metabolism...
and
amino acidAmino acids are molecules containing an amine group, a carboxylic acid group and one of the twenty R-groups. These molecules are particularly important in biochemistry, where this term refers to alpha-amino acids with the general formula H
2NCHRCOOH, where R is an organic substituent...
sequence data, the same stochastic models of nucleotide change used in maximum likelihood analysis can be employed to "correct" distances, rendering the analysis "semi-parametric."
Several simple algorithms exist to construct a tree directly from pairwise distances, including
UPGMAUPGMA is a simple agglomerative or hierarchical clustering method used in bioinformatics for the creation of phylogenetic trees...
and neighbor joining (NJ), but these will not necessarily produce the best tree for the data. To counter potential complications noted above, and to find the best tree for the data, distance analysis can also incorporate a tree-search protocol that seeks to satisfy an explicit optimality criterion. Two optimality criteria are commonly applied to distance data, minimum evolution (ME) and
least squares inferenceLeast squares inference in phylogeny generates aphylogenetic tree based on anobserved matrix of pairwise genetic distances andoptionally a weightmatrix. The goal is to find a tree which satisfies the distance constraints asbest as possible....
. Least squares is part of a broader class of regression-based methods lumped together here for simplicity. These regression formulae minimize the residual differences between path-distances along the tree and pairwise distances in the data matrix, effectively "fitting" the tree to the empirical distances. In contrast, ME accepts the tree with the shortest sum of branch lengths, and thus minimizes the total amount of evolution assumed. ME is closely akin to parsimony, and under certain conditions, ME analysis of distances based on a discrete character dataset will favor the same tree as conventional parsimony analysis of the same data.
Phylogeny estimation using distance methods has produced a number of controversies.
UPGMAUPGMA is a simple agglomerative or hierarchical clustering method used in bioinformatics for the creation of phylogenetic trees...
assumes an ultrametric tree (a tree where all the path-lengths from the root to the tips are equal). If the rate of evolution were equal in all sampled lineages (a
molecular clockThe molecular clock is a technique in molecular evolution which uses fossil constraints and rates of molecular change to deduce the time in geologic history when two species or other taxa diverged. It is used to estimate the time of occurrence of events called speciation or radiation...
), and if the tree were completely balanced (equal numbers of taxa on both sides of any split, to counter the node density effect), UPGMA should not produce a biased result. These expectations are not met by most datasets, and although UPGMA is somewhat robust to their violation, it is not commonly used for phylogeny estimation.
Neighbor-joiningIn bioinformatics, neighbor-joining is a bottom-up clustering method used for the construction of phylogenetic trees. Usually used for trees based on DNA or protein sequence data, the algorithm requires knowledge of the distance between each pair of taxa in the tree.- The algorithm...
is a form of star decomposition and, as a
heuristicHeuristic is an adjective for experience-based techniques that help in problem solving, learning and discovery. A heuristic method is particularly used to rapidly come to a solution that is hoped to be close to the best possible answer, or 'optimal solution'...
method, is generally the least computationally intensive of these methods. It is very often used on its own, and in fact quite frequently produces reasonable trees. However, it lacks any sort of tree search and optimality criterion, and so there is no guarantee that the recovered tree is the one that best fits the data. A more appropriate analytical procedure would be to use NJ to produce a starting tree, then employ a tree search using an optimality criterion, to ensure that the best tree is recovered.
Many scientists eschew distance methods. In some cases, this is for esoteric philosophical reasons. A commonly-cited reason is that distances are inherently phenetic rather than phylogenetic, in that they do not distinguish between ancestral similarity (
symplesiomorphyA symplesiomorphy or symplesiomorphic character is in cladistics a trait which is shared between two or more taxa, but which is also shared with other taxa which have an earlier last common ancestor with the taxa under consideration...
) and derived similarity (
synapomorphyIn evolutionary biology, a synapomorphy is a derived character state shared by two or more terminal groups and inherited from their most recent common ancestor....
). This criticism is not entirely fair: most currently implementations of parsimony, likelihood, and Bayesian phylogenetic inference use time-reversible character models, and thus accord no special status to derived or ancestral character states. Under these models, the tree is estimated unrooted; rooting, and consequently determination of polarity, is performed after the analysis. The primary difference between these methods and distances is that parsimony, likelihood, and Bayesian methods fit individual characters to the tree, whereas distance methods fit all the characters at once. There is nothing inherently less phylogenetic about this approach.
More practically, distance methods are avoided because the relationship between individual characters and the tree is lost in the process of reducing characters to distances. Since these methods do not use character data directly, and information locked in the distribution of character states can be lost in the pairwise comparisons. Also, some complex phylogenetic relationships may produce biased distances. On any phylogram, branch lengths will be underestimated because some changes cannot be discovered at all due to failure to sample some species due to either experimental design or extinction (a phenomenon called the node density effect). However, even if pairwise distances from genetic data are "corrected" using stochastic models of evolution as mentioned above, they may more easily sum to a different tree than one produced from analysis of the same data and model using
maximum likelihoodMaximum likelihood estimation is a popular statistical method used for fitting a statistical model to data, and providing estimates for the model's parameters....
. This is because pairwise distances are not independent; each branch on a tree is represented in the distance measurements of all taxa it separates. Error resulting from any characteristic of that branch that might confound phylogeny (stochastic variability, change in evolutionary parameters, an abnormally long or short branch length) will be propagated through all of the relevant distance measurements. The resulting distance matrix may then better fit an alternate (presumably less optimal) tree.
Despite these potential problems, distance methods are extremely fast, and they often produce a reasonable estimate of phylogeny. They also have certain benefits over the methods that use characters directly. Notably, distance methods allow use of data that may not be easily converted to character data, such as DNA-DNA hybridization assays. They also permit analyses that account for the possibility that the rate at which particular nucleotides are incorporated into sequences may vary over the tree, using LogDet distances. For some network-estimation methods (notably NeighborNet), the abstraction of information about individual characters in distance data are an advantage. When considered character-by character, conflict between character and a tree due to reticulation cannot be told from conflict due either to homoplasy or error. However, pronounced conflict in distance data, which represents an amalgamation of many characters, is less likely due to error or homoplasy unless the data are strongly biased, and is thus more likely to be a result of reticulation.
Distance methods are extremely popular among some molecular systematists, a substantial number of whom use NJ without an optimization stage almost exclusively. With the increasing speed of character-based analyses, some of the advantages of distance methods will probably wane. However, the nearly-instantaneous NJ implementations, the ability to incorporate an evolutionary model in a speedy analysis, LogDet distances, network estimation methods, and the occasional need to summarize relationships in with a single number all mean that distance methods will probably stay in the mainstream for a long time to come.