Substitution model
Encyclopedia
In biology, a substitution model describes the process from which a sequence of characters changes into another set of traits. For example, in cladistics
Cladistics
Cladistics is a method of classifying species of organisms into groups called clades, which consist of an ancestor organism and all its descendants . For example, birds, dinosaurs, crocodiles, and all descendants of their most recent common ancestor form a clade...

, each position in the sequence might correspond to a property of a species
Species
In biology, a species is one of the basic units of biological classification and a taxonomic rank. A species is often defined as a group of organisms capable of interbreeding and producing fertile offspring. While in many cases this definition is adequate, more precise or differing measures are...

 which can either be present or absent. The alphabet could then consist of "0" for absence and "1" for presence. Then the sequence 00110 could mean, for example, that a species does not have feathers or lay eggs, does have fur, is warm-blooded, and cannot breathe underwater. Another sequence 11010 would mean that a species has feathers, lays eggs, does not have fur, is warm-blooded, and cannot breathe underwater. In phylogenetics
Phylogenetics
In biology, phylogenetics is the study of evolutionary relatedness among groups of organisms , which is discovered through molecular sequencing data and morphological data matrices...

, sequences are often obtained by firstly obtaining a nucleotide or protein sequence alignment
Sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are...

, and then taking the bases
Nucleotide
Nucleotides are molecules that, when joined together, make up the structural units of RNA and DNA. In addition, nucleotides participate in cellular signaling , and are incorporated into important cofactors of enzymatic reactions...

 or amino acid
Amino acid
Amino acids are molecules containing an amine group, a carboxylic acid group and a side-chain that varies between different amino acids. The key elements of an amino acid are carbon, hydrogen, oxygen, and nitrogen...

s at corresponding positions in the alignment as the characters. Sequences achieved by this might look like AGCGGAGCTTA and GCCGTAGACGC.

Substitution models are used for a number of things:
  1. Constructing evolutionary trees in phylogenetics or cladistics.
  2. Simulating sequences to test other methods and algorithms.

Neutral, independent, finite sites models

Most substitution models used to date are neutral, independent, finite sites models.
Neutral
Neutral theory
Neutral theory may refer to one of these two related theories:* the neutral theory of molecular evolution; or* the unified neutral theory of biodiversity....

 : Selection
Natural selection
Natural selection is the nonrandom process by which biologic traits become either more or less common in a population as a function of differential reproduction of their bearers. It is a key mechanism of evolution....

 does not operate on the substitutions, and so they are unconstrained.
Independent
Statistical independence
In probability theory, to say that two events are independent intuitively means that the occurrence of one event makes it neither more nor less probable that the other occurs...

 : Changes in one site do not affect the probability of changes in another site.
Finite Sites : There are finitely many sites, and so over evolution, a single site can be changed multiple times. This means that, for example, if a character has value 0 at time 0 and at time t, it could be that no changes occurred, or that it changed to a 1 and back to a 0, or that it changed to a 1 and back to a 0 and then to a 1 and then back to a 0, and so on.

The molecular clock and the units of time

Typically, a branch length of a phylogenetic tree is expressed as the expected number of substitutions per site; if the evolutionary model indicates that each site within an ancestral sequence will typically experience x substitutions by the time it evolves to a particular descendant's sequence then the ancestor and descendant are considered to be separated by branch length x.

Sometimes a branch length is measured in terms of geological years. For example, a fossil record may make it possible to determine the number of years between an ancestral species and a descendant species. Because some species evolve at faster rates than others, these two measures of branch length are not always in direct proportion. The expected number of substitutions per site per year is often indicated with the Greek letter mu (μ).

A model is said to have a molecular clock
Molecular clock
The molecular clock is a technique in molecular evolution that uses fossil constraints and rates of molecular change to deduce the time in geologic history when two species or other taxa diverged. It is used to estimate the time of occurrence of events called speciation or radiation...

 if the expected number of substitutions per year μ is constant regardless of which species' evolution is being examined. An important implication of a molecular clock is that the number of expected substitutions between an ancestral species and any of its present-day descendants must be independent of which descendant species is examined.

Note that the assumption of a molecular clock is often unrealistic, especially across long periods of evolution. For example, even though rodent
Rodent
Rodentia is an order of mammals also known as rodents, characterised by two continuously growing incisors in the upper and lower jaws which must be kept short by gnawing....

s are genetically very similar to primate
Primate
A primate is a mammal of the order Primates , which contains prosimians and simians. Primates arose from ancestors that lived in the trees of tropical forests; many primate characteristics represent adaptations to life in this challenging three-dimensional environment...

s, they have undergone a much higher number of substitutions in the estimated time since divergence in some regions of the genome
Genome
In modern molecular biology and genetics, the genome is the entirety of an organism's hereditary information. It is encoded either in DNA or, for many types of virus, in RNA. The genome includes both the genes and the non-coding sequences of the DNA/RNA....

. This could be due to their shorter generation time, higher metabolic rate, increased population structuring, increased rate of speciation
Speciation
Speciation is the evolutionary process by which new biological species arise. The biologist Orator F. Cook seems to have been the first to coin the term 'speciation' for the splitting of lineages or 'cladogenesis,' as opposed to 'anagenesis' or 'phyletic evolution' occurring within lineages...

, or smaller body size. When studying ancient events like the Cambrian explosion
Cambrian explosion
The Cambrian explosion or Cambrian radiation was the relatively rapid appearance, around , of most major phyla, as demonstrated in the fossil record, accompanied by major diversification of other organisms, including animals, phytoplankton, and calcimicrobes...

 under a molecular clock assumption, poor concurrence between cladistic and phylogenetic data is often observed. There has been some work on models allowing variable rate of evolution (see for example and ).

Time-reversible and stationary models

Many useful substitution models are time-reversible
Time reversibility
Time reversibility is an attribute of some stochastic processes and some deterministic processes.If a stochastic process is time reversible, then it is not possible to determine, given the states at a number of points in time after running the stochastic process, which state came first and which...

; in terms of the mathematics, the model does not care which sequence is the ancestor and which is the descendant so long as all other parameters (such as the number of substitutions per site that is expected between the two sequences) are held constant.

When an analysis of real biological data is performed, there is generally no access to the sequences of ancestral species, only to the present-day species. However, when a model is time-reversible, which species was the ancestral species is irrelevant. Instead, the phylogenetic tree can be rooted using any of the species, re-rooted later based on new knowledge, or left unrooted. This is because there is no 'special' species, all species will eventually derive from one another with the same probability.

A model is time reversible if and only if it satisfies the property
or, equivalently, the detailed balance
Detailed balance
The principle of detailed balance is formulated for kinetic systems which are decomposed into elementary processes : At equilibrium, each elementary process should be equilibrated by its reverse process....

 property,
for every i, j, and t. The notation is explained below.

Time-reversibility should not be confused with stationarity. A model is stationary if Q does not change with time. The analysis below assumes a stationary model.

The mathematics of substitution models

Stationary, neutral, independent, finite sites models (assuming a constant rate of evolution) have two parameters, , an equilibrium vector of base (or character) frequencies and a rate matrix, Q, which describes the rate at which bases of one type change into bases of another type; element for i ≠ j is the rate at which base i goes to base j. The diagonals of the Q matrix are chosen so that the rows sum to zero:

The equilibrium row vector π must be annihilated by the rate matrix Q:

The transition matrix function is a function from the branch lengths (in some units of time, possibly in substitutions), to a matrix
Matrix (mathematics)
In mathematics, a matrix is a rectangular array of numbers, symbols, or expressions. The individual items in a matrix are called its elements or entries. An example of a matrix with six elements isMatrices of the same size can be added or subtracted element by element...

 of conditional probabilities. It is denoted . The entry in the ith column and the jth row, , is the probability, after time t, that there is a base j at a given position, conditional on there being a base i in that position at time 0. When the model is time reversible, this can be performed between any two sequences, even if one is not the ancestor of the other, if you know the total branch length between them.

The asymptotic properties of Pij(t) are such that Pij(0) = δij, where δij is the Kronecker delta function. That is, there is no change in base composition between a sequence and itself. At the other extreme, or, in other words, as time goes to infinity the probability of finding base j at a position given there was a base i at that position originally goes to the equilibrium probability that there is base j at that position, regardless of the original base. Furthermore, it follows that for all t.

The transition matrix can be computed from the rate matrix via matrix exponentiation
Matrix exponential
In mathematics, the matrix exponential is a matrix function on square matrices analogous to the ordinary exponential function. Abstractly, the matrix exponential gives the connection between a matrix Lie algebra and the corresponding Lie group....

:
where Qn is the matrix Q multiplied by itself enough times to give its nth power.

If Q is diagonalizable
Diagonalizable matrix
In linear algebra, a square matrix A is called diagonalizable if it is similar to a diagonal matrix, i.e., if there exists an invertible matrix P such that P −1AP is a diagonal matrix...

, the matrix exponential can be computed directly: let Q = U−1 Λ U be a diagonalization of Q, with
where Λ is a diagonal matrix and where are the eigenvalues of Q, each repeated according to its multiplicity. Then
where the diagonal matrix eΛt is given by

GTR: Generalised time reversible

GTR is the most general neutral, independent, finite-sites, time-reversible model possible. It was first described in a general form by Simon Tavaré in 1986.

The GTR parameters for nucleotides consist of an equilibrium base frequency vector, , giving the frequency at which each base occurs at each site, and the rate matrix


Because the model must be time reversible and must approach the equilibrium nucleotide (base) frequencies at long times, each rate below the diagonal equals the reciprocal rate above the diagonal multiplied by the equlibrium ratio of the two bases. As such, the nucleotide GTR requires 6 substitution rate parameters and 4 equilibrium base frequency parameters. Since the 4 frequency parameters must sum to 1, there are only 3 free frequency parameters. The total of nine free parameters is often further reduced to 8 parameters plus , the overall number of substitutions per unit time. When measuring time in substitutions (=1) only 8 free parameters remain.

In general, to compute the number of parameters, you count the number of entries above the diagonal in the matrix, i.e. for n trait values per site , and then add n-1 for the equilibrium frequencies, and subtract 1 because is fixed. You get


For example, for an amino acid sequence (there are 20 "standard" amino acids that make up proteins), you would find there are 208 parameters. However, when studying coding regions of the genome, it is more common to work with a codon substitution model (a codon is three bases and codes for one amino acid in a protein). There are codons, resulting in 2078 free parameters, but when the rates for transitions between codons which differ by more than one base are assumed to be zero, then there are only parameters.

Mechanistic vs. empirical models

A main difference in evolutionary models is how many parameters are estimated every time for the data set under
consideration and how many of them are estimated once on a large data set.
Mechanistic models describe all substitution as a function of a number of parameters which are estimated for every
data set analyzed, preferably using maximum likelihood
Maximum likelihood
In statistics, maximum-likelihood estimation is a method of estimating the parameters of a statistical model. When applied to a data set and given a statistical model, maximum-likelihood estimation provides estimates for the model's parameters....

. This has the advantage that the model can be adjusted to the particularities of a specific data set (e.g. different composition biases in DNA). Problems can arise when too many
parameters are used, particularly if they can compensate for each other. Then it is often the case that the data set is too small to yield enough information to estimate all parameters accurately.

Empirical models are created by estimating many parameters (typically all entries of the rate matrix and the character frequencies, see the GTR model above) from a large data set. These parameters are then fixed and will be
reused for every data set. This has the advantage that those parameters can be estimated more accurately. Normally, it
is not possible to estimate all entries of the substitution matrix from the current data set only. On the downside,
the estimated parameters might be too generic and do not fit a particular data set well enough.

With the large-scale genome sequencing still producing very large amounts of DNA and protein sequences, there is
enough data available to create empirical models with any number of parameters. Because of the problems mentioned above, the two approaches are often combined, by estimating most of the parameters once on large-scale data, while a few remaining parameters are then adjusted to the data set under consideration. The following sections give an overview of the different approaches taken for DNA, protein or codon-based models.

Models of DNA substitution

See main article: Models of DNA evolution
Models of DNA evolution
A number of different Markov models of DNA sequence evolution have been proposed. These substitution models differ in terms of the parameters used to describe the rates at which one nucleotide replaces another during evolution. These models are frequently used in molecular phylogenetic analyses...

for more formal descriptions of the DNA models.

Models of DNA evolution were first proposed in 1969 by Jukes and Cantor, assuming equal transition rates as well as equal equilibrium frequencies for all bases. In 1980 Kimura introduced a model with two parameters: one for the transition and one for the transversion rate and in 1981, Felsenstein made a model in which the substitution rate corresponds to the equilibrium frequency of the target nucleotide. Hasegawa, Kishino and Yano (HKY) unified the two last models to a six parameter model. In the 1990s, models similar to HKY were developed and refined by several researchers.

For DNA substitution models, mainly mechanistic models (as described above) are employed. The small number of parameters
to estimate makes this feasible, but also DNA is often highly optimized for specific purposes (e.g. fast expression
Gene expression
Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product. These products are often proteins, but in non-protein coding genes such as ribosomal RNA , transfer RNA or small nuclear RNA genes, the product is a functional RNA...

 or stability) depending on the organism and the type of gene, making it necessary to adjust the model to these circumstances.

Models of amino acid substitutions

For many analyses, particularly for longer evolutionary distances, the evolution is modeled on the amino acid level. Since not
all DNA substitution also alter the encoded amino acid, information is lost when looking at amino acids instead of nucleotide
bases. However, several advantages speak in favor of using the amino acid information: DNA is much more inclined to show
compositional bias than amino acids, not all positions in the DNA evolve at the same speed (non-synonymous
Missense mutation
In genetics, a missense mutation is a point mutation in which a single nucleotide is changed, resulting in a codon that codes for a different amino acid . This can render the resulting protein nonfunctional...

 mutations
are more likely to become fixed in the population than synonymous ones), but probably most important, because of those
fast evolving positions and the limited alphabet size (only four possible states), the DNA suffers much more from back
substitutions, making it difficult to accurately estimate longer distances.

Unlike the DNA models, amino acid models traditionally are empirical models. They were pioneered in the 1970s by Dayhoff and co-workers
, by estimating replacement rates from protein alignments with at least 85% identity. This minimized the chances of observing multiple substitutions at a site. From the estimated rate matrix, a series of replacement probability matrices were derived, known under names such as PAM
Point accepted mutation
Point accepted mutation , is a set of matrices used to score sequence alignments. The PAM matrices were introduced by Margaret Dayhoff in 1978 based on 1572 observed mutations in 71 families of closely related proteins...

250. The Dayhoff model was used to assess the significance of homology search results, but also for phylogenetic analyses. The Dayhoff PAM matrices were based on relatively few alignments (since not more were available at that time), but in the 1990s, new matrices were estimated using almost the same methodology, but based on the large protein databases available then .
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK