All Topics  
Multiple sequence alignment

 
Multiple Sequence Alignment

   Email Print
   Bookmark   Link






 

Multiple sequence alignment



 
 
A multiple sequence alignment (MSA) is a sequence alignment
Sequence alignment

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural biology, or evolutionary relationships between the sequences....
 of three or more biological sequences, generally protein
Protein

Proteins are organic compounds made of amino acids arranged in a linear chain and joined together by peptide bonds between the carboxyl and amino groups of adjacent amino acid Residue ....
, DNA
DNA

Deoxyribonucleic acid is a nucleic acid that contains the genetics instructions used in the development and functioning of all known living organisms and some viruses....
, or RNA
RNA

Ribonucleic acid is a type of molecule that consists of a long chain of nucleotide units. Each nucleotide consists of a nucleobase, a ribose sugar, and a phosphate....
. In general, the input set of query sequences are assumed to have an evolution
Evolution

In biology, evolution is change in the heritability trait of a population of organisms from one generation to the next. These changes are caused by a combination of three main processes: variation, reproduction, and selection....
ary relationship by which they share a lineage and are descended from a common ancestor.






Discussion
Ask a question about 'Multiple sequence alignment'
Start a new discussion about 'Multiple sequence alignment'
Answer questions from other users
Full Discussion Forum



Recent Posts









Encyclopedia


Rplp0 90 Clustalw Aln
A multiple sequence alignment (MSA) is a sequence alignment
Sequence alignment

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural biology, or evolutionary relationships between the sequences....
 of three or more biological sequences, generally protein
Protein

Proteins are organic compounds made of amino acids arranged in a linear chain and joined together by peptide bonds between the carboxyl and amino groups of adjacent amino acid Residue ....
, DNA
DNA

Deoxyribonucleic acid is a nucleic acid that contains the genetics instructions used in the development and functioning of all known living organisms and some viruses....
, or RNA
RNA

Ribonucleic acid is a type of molecule that consists of a long chain of nucleotide units. Each nucleotide consists of a nucleobase, a ribose sugar, and a phosphate....
. In general, the input set of query sequences are assumed to have an evolution
Evolution

In biology, evolution is change in the heritability trait of a population of organisms from one generation to the next. These changes are caused by a combination of three main processes: variation, reproduction, and selection....
ary relationship by which they share a lineage and are descended from a common ancestor. From the resulting MSA, sequence homology
Homology (biology)

In evolutionary biology, homology refers to any similarity between characteristics that is due to their common descent. The word homologous derives from the ancient Greek ??????e??, 'to agree'....
 can be inferred and phylogenetic analysis
Molecular phylogeny

Molecular phylogenetics, also known as molecular systematics, is the use of the structure of molecules to gain information on an organism's evolutionary relationships....
 can be conducted to assess the sequences' shared evolutionary origins. Visual depictions of the alignment as in the image at right illustrate mutation
Mutation

In biology, mutations are changes to the nucleotide sequence of the genetic material of an organism. Mutations can be caused by copying errors in the genetic material during cell division, by exposure to ultraviolet or ionizing radiation, chemical mutagens, or virus , or can be induced by the organism, itself, by cellular processes such as s...
 events such as point mutations (single amino acid
Amino acid

In chemistry, an amino acid is a molecule containing both amine and carboxyl functional groups. These molecules are particularly important in biochemistry, where this term refers to alpha-amino acids with the general formula H2NCHRCOOH, where R is an organic substituent....
 or nucleotide
Nucleotide

Nucleotides are molecules that comprise the structural units of RNA and DNA. Additionally, nucleotides play central roles in metabolism. In that capacity, they serve as sources of chemical energy , participate in cell signaling , and are incorporated into important cofactors of enzymatic reactions ....
 changes) that appear as differing characters in a single alignment column, and insertion or deletion mutations (indel
Indel

The term ??indel?? has different definitions in different fields. In evolutionary studies, ??indel?? is used to mean an insertion or a deletion and ??indels?? simply refers to the mutation class that includes both insertions, deletions, and the combination thereof , including insertion and deletion events that may be separated by many years....
s or gaps) that appear as hyphens in one or more of the sequences in the alignment. Multiple sequence alignment is often used to assess sequence conservation
Conservation (genetics)

Conservation may refer to:* Conservation genetics - "an interdisciplinary science that aims to apply genetic methods to the conservation and restoration of biodiversity."...
 of protein domain
Protein domain

A protein domain is a part of protein sequence and tertiary structure that can biological evolution, function, and exist independently of the rest of the protein chain....
s, tertiary
Tertiary structure

In biochemistry and chemistry, the tertiary structure of a protein or any other macromolecule is its three-dimensional structure, as defined by the atomic coordinates....
 and secondary
Secondary structure

In biochemistry and structural biology, secondary structure is the general three-dimensional form of local segments of biopolymers such as proteins and nucleic acids ....
 structures, and even individual amino acids or nucleotides.

Multiple sequence alignment also refers to the process of aligning such a sequence set. Because three or more sequences of biologically relevant length can be difficult and are almost always time-consuming to align by hand, computational algorithm
Algorithm

In mathematics, computing, linguistics and related subjects, an algorithm is a sequence of finite instructions, often used for calculation and data processing....
s are used to produce and analyze the alignments. MSAs require more sophisticated methodologies than pairwise alignment
Sequence alignment

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural biology, or evolutionary relationships between the sequences....
 because they are more computationally complex (naively, O(LengthNseqs) to produce. Most multiple sequence alignment programs use heuristic
Heuristic

Heuristic is an adjective for methods that help in problem solving, in turn leading to learning and discovery. These methods in most cases employ experimentation and trial-and-error techniques....
 methods rather than global optimization
Global optimization

Global optimization is a branch of applied mathematics and numerical analysis that deals with the optimization of a function or a Set of functions to some criteria....
 because identifying the optimal alignment between more than a few sequences of moderate length is prohibitively computationally expensive.

Dynamic programming and computational complexity

The most direct method for producing an MSA uses the dynamic programming
Dynamic programming

In mathematics and computer science, dynamic programming is a method of solving problems that exhibit the properties of overlapping subproblems and optimal substructure ....
 technique to identify the globally optimal alignment solution. For proteins, this method usually involves two sets of parameters: a gap penalty
Gap penalty

Gap penalties are used during sequence alignment. Gap penalties contribute to the overall score of alignments, and therefore, the size of the gap penalty relative to the entries in the similarity matrix affects the alignment that is finally selected....
 and a substitution matrix
Substitution matrix

In evolutionary biology, a substitution matrix describes the rate at which one character in a sequence changes to other character states over time....
 assigning scores or probabilities to the alignment of each possible pair of amino acids based on the similarity of the amino acids' chemical properties and the evolutionary probability of the mutation. For nucleotide sequences a similar gap penalty is used, but a much simpler substitution matrix, wherein only identical matches and mismatches are considered, is typical. The scores in the substitution matrix may be either all positive or a mix of positive and negative in the case of a global alignment, but must be both positive and negative, in the case of a local alignment. In the latter case it is essential that the average score be less than 0.

For n individual sequences, the naive method requires constructing the n-dimensional equivalent of the matrix formed in standard pairwise sequence alignment
Sequence alignment

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural biology, or evolutionary relationships between the sequences....
. The search space thus increases exponentially with increasing n and is also strongly dependent on sequence length. To find the global optimum for n sequences this way has been shown to be an NP-complete
NP-complete

In computational complexity theory, the complexity class NP-complete is a class of problems having two properties:* Any given solution to the problem can be verified quickly ; the set of problems with this property is called NP ....
 problem. in 1989, Altschul introduced a practical method that uses pairwise alignments to constrain the n-dimensional search space. In this approach pairwise dynamic programming alignments are performed on each pair of sequences in the query set, and only the space near the n-dimensional intersection of these alignments is searched for the n-way alignment. The MSA program optimizes the sum of all of the pairs of characters at each position in the alignment (the so-called sum of pair score) and has been implemented in the program. But MSA is still impractical for many multiple sequence alignment applications that require the simultaneous alignment of more than about 20 sequences.

Progressive alignment construction

The most widely used approach to multiple sequence alignments uses a heuristic search known as progressive technique (also known as the hierarchical or tree method), that builds up a final MSA by combining pairwise alignments beginning with the most similar pair and progressing to the most distantly related. All progressive alignment methods require two stages: a first stage in which the relationships between the sequences are represented as a tree
Phylogenetic tree

A phylogenetic tree or evolutionary tree is a tree showing the evolutionary relationships among various biological species or other entities that are believed to have a common descent....
, called a guide tree, and a second step in which the MSA is built by adding the sequences sequentially to the growing MSA according to the guide tree. The initial guide tree is determined by an efficient clustering
Clustering

Clustering can refer to the following:In information technology:* Forming a computer cluster - the connection of many computers to be used as one larger computer....
 method such as neighbor-joining
Neighbor-joining

In bioinformatics, neighbor-joining is a bottom-up clustering method used for the construction of phylogeny tree data structures. Usually used for trees based on DNA or protein primary structure data, the algorithm requires knowledge of the distance between each pair of taxa in the tree....
 or UPGMA
UPGMA

UPGMA is a simple Cluster analysis or bottom-up data cluster analysis used in bioinformatics for the creation of phylogenetic trees. UPGMA assumes a constant rate of evolution , and is not a well-regarded method for inferring phylogenetic trees unless this assumption has been tested and justified for the data set being used....
, and may use distances based on the number of identical two letter sub-sequences (as in FASTA
FASTA

FASTA is a DNA and protein sequence alignment software package first described by David J. Lipman and William R. Pearson in 1985 in the article ....
 rather than a dynamic programming alignment)

Progressive alignments cannot be globally optimal. The primary problem is that when errors are made at any stage in growing the MSA, these errors are then propagated through to the final result. Performance is also particularly bad when all of the sequences in the set are rather distantly related. Most modern progressive methods modify their scoring function with a secondary weighting function that assigns scaling factors to individual members of the query set in a nonlinear fashion based on their phylogenetic distance from their nearest neighbors. This corrects for non-random selection of the sequences given to the alignment program.

Progressive alignment methods are efficient enough to implement on a large scale for many (100s to 1000s) sequences. Progressive alignment services are commonly available on publicly accessible web servers so users need not locally install the applications of interest. The most popular progressive alignment method has been the Clustal
Clustal

Clustal is a widely used multiple sequence alignment computer program. The latest version is 2.0. There are two main variations:*ClustalW: command line interface...
 family, especially the weighted variant ClustalW to which access is provided by a large number of web portals including , , and . Different portals or implementations can vary in user interface and make different parameters accessible to the user. ClustalW is used extensively for phylogenetic tree construction, in spite of the author's explicit warnings that unedited alignments should not be used in such studies and as input for protein structure prediction
Protein structure prediction

Protein structure prediction is one of the most important goals pursued by bioinformatics and theoretical chemistry. Its aim is the prediction of the three-dimensional structure of proteins from their amino acid sequences, sometimes including additional relevant information such as the structures of related proteins....
 by homology modeling.

Another common progressive alignment method called T-Coffee
T-Coffee

T-Coffee is a multiple sequence alignment software using a progressive approach. It generates a library of pairwise alignments to guide the multiple sequence alignment....
 is slower than Clustal and its derivatives but generally produces more accurate alignments for distantly related sequence sets. T-Coffee calculates pairwise alignments by combining the direct alignment of the pair with indirect alignments that aligns each sequence of the pair to a third sequence. It uses the output from Clustal as well as another local alignment program LALIGN, which finds multiple regions of local alignment between two sequences. The resulting alignment and phylogenetic tree are used as a guide to produce new and more accurate weighting factors.

Because progressive methods are heuristics that are not guaranteed to converge to a global optimum, alignment quality can be difficult to evaluate and their true biological significance can be obscure. A 2006, semi-progressive method that improves alignment quality and does not use a lossy heuristic while still running in polynomial time
Polynomial time

In computational complexity theory, polynomial time refers to the computation time of a problem where the run time, , is no greater than a polynomial function of the problem size, n....
 has been implemented in the program .

Iterative methods

A set of methods to produce MSAs while reducing the errors inherent in progressive methods are classified as "iterative" because they work similarly to progressive methods but repeatedly realign the initial sequences as well as adding new sequences to the growing MSA. One reason progressive methods are so strongly dependent on a high-quality initial alignment is the fact that these alignments are always incorporated into the final result - that is, once a sequence has been aligned into the MSA, its alignment is not considered further. This approximation improves efficiency at the cost of accuracy. By contrast, iterative methods can return to previously calculated pairwise alignments or sub-MSAs incorporating subsets of the query sequence as a means of optimizing a general objective function such as finding a high-quality alignment score.

A variety of subtly different iteration methods have been implemented and made available in software packages; reviews and comparisons have been useful but generally refrain from choosing a "best" technique. The software package uses a hill-climbing algorithm to optimize its MSA alignment score and iteratively corrects both alignment weights and locally divergent or "gappy" regions of the growing MSA. PRRP performs best when refining an alignment previously constructed by a faster method.

Another iterative program, DIALIGN, takes an unusual approach of focusing narrowly on local alignments between sub-segments or sequence motif
Sequence motif

In genetics, a sequence motif is a nucleotide or amino acid sequence pattern that is widespread and has, or is conjectured to have, a biology significance....
s without introducing a gap penalty. The alignment of individual motifs is then achieved with a matrix representation similar to a dot-matrix plot in a pairwise alignment. An alternative method that uses fast local alignments as anchor points or "seeds" for a slower global-alignment procedure is implemented in the suite.

A third popular iteration-based method called (multiple sequence alignment by log-expectation) improves on progressive methods with a more accurate distance measure to assess the relatedness of two sequences. The distance measure is updated between iteration stages (although, in its original form, MUSCLE contained only 2-3 iterations depending on whether refinement was enabled).

Hidden Markov models

Hidden Markov model
Hidden Markov model

A hidden Markov model is a statistical model in which the system being modeled is assumed to be a Markov process with unknown parameters; the challenge is to determine the hidden parameters from the observable data....
s are probabilistic models that can assign likelihoods to all possible combinations of gaps, matches, and mismatches to determine the most likely MSA or set of possible MSAs. HMMs can produce a single highest-scoring output but can also generate a family of possible alignments that can then be evaluated for biological significance. HMMs can produce both global and local alignments. Although HMM-based methods have been developed relatively recently, they offer significant improvements in computational speed, especially for sequences that contain overlapping regions.

Typical HMM-based methods work by representing an MSA as a form of directed acyclic graph
Directed acyclic graph

In computer science and mathematics, a directed acyclic graph, also called a DAG, is a with no ; that is, for any Vertex v, there is no nonempty directed path that starts and ends on v....
 known as a partial-order graph, which consists of a series of nodes representing possible entries in the columns of an MSA. In this representation a column that is absolutely conserved (that is, that all the sequences in the MSA share a particular character at a particular position) is coded as a single node with as many outgoing connections as there are possible characters in the next column of the alignment. In the terms of a typical hidden Markov model, the observed states are the individual alignment columns and the "hidden" states represent the presumed ancestral sequence from which the sequences in the query set are hypothesized to have descended. An efficient search variant of the dynamic programming method, known as the Viterbi algorithm
Viterbi algorithm

The Viterbi algorithm is a dynamic programming algorithm for finding the most likelihood function sequence of hidden states – called the Viterbi path – that results in a sequence of observed events, especially in the context of Markov information sources, and more generally, hidden Markov models....
, is generally used to successively align the growing MSA to the next sequence in the query set to produce a new MSA. This is distinct from progressive alignment methods because the alignment of prior sequences is updated at each new sequence addition. However, like progressive methods, this technique can be influenced by the order in which the sequences in the query set are integrated into the alignment, especially when the sequences are distantly related.

Several software programs are available in which variants of HMM-based methods have been implemented and which are noted for their scalability and efficiency, although properly using an HMM method is more complex than using more common progressive methods. The simplest is (Partial-Order Alignment); a similar but more generalized method is implemented in the package (Sequence Alignment and Modeling System). SAM has been used as a source of alignments for protein structure prediction
Protein structure prediction

Protein structure prediction is one of the most important goals pursued by bioinformatics and theoretical chemistry. Its aim is the prediction of the three-dimensional structure of proteins from their amino acid sequences, sometimes including additional relevant information such as the structures of related proteins....
 to participate in the CASP
CASP

CASP, which stands for Critical Assessment of Techniques for Protein Structure Prediction, is a community-wide experiment for protein structure prediction taking place every two years since 1994....
 structure prediction experiment and to develop a database of predicted proteins in the yeast
Yeast

Yeasts are eukaryote microorganisms classified in the Kingdom fungus, with about 1,500 species currently described; they dominate fungal diversity in the oceans....
 species S. cerevisiae. HMM methods can also be used for database search with HMMER
HMMER

HMMER is a free software and commonly used software package for protein sequence analysis written by Sean Eddy. It is used for sensitive database search using Hidden Markov model ....
.

Genetic algorithms and simulated annealing

Standard optimization techniques in computer science - both of which were inspired by, but do not directly reproduce, physical processes - have also been used in an attempt to more efficiently produce quality MSAs. One such technique, genetic algorithm
Genetic algorithm

A genetic algorithm is a Search algorithm wikt:technique used in computing to find exact or approximate solutions to Optimization and Search algorithm problems....
s, has been used for MSA production in an attempt to broadly simulate the hypothesized evolutionary process that gave rise to the divergence in the query set. The method works by breaking a series of possible MSAs into fragments and repeatedly rearranging those fragments with the introduction of gaps at varying positions. A general objective function is optimized during the simulation, most generally the "sum of pairs" maximization function introduced in dynamic programming-based MSA methods. A technique for protein sequences has been implemented in the software program SAGA (Sequence Alignment by Genetic Algorithm) and its equivalent in RNA is called RAGA.

The technique of simulated annealing
Simulated annealing

Simulated annealing is a generic probabilistic algorithm metaheuristic for the global optimization problem of applied mathematics, namely locating a good approximation to the global optimum of a given function in a large search space....
, by which an existing MSA produced by another method is refined by a series of rearrangements designed to find more optimal regions of alignment space than the one the input alignment already occupies. Like the genetic algorithm method, simulated annealing maximizes an objective function like the sum-of-pairs function. Simulated annealing uses a metaphorical "temperature factor" that determines the rate at which rearrangements proceed and the likelihood of each rearrangement; typical usage alternates periods of high rearrangement rates with relatively low likelihood (to explore more distant regions of alignment space) with periods of lower rates and higher likelihoods to more thoroughly explore local minima near the newly "colonized" regions. This approach has been implemented in the program MSASA (Multiple Sequence Alignment by Simulated Annealing).

Motif finding

Caspase Motif Alignment
Motif finding, also known as profile analysis, is a method of locating sequence motif
Sequence motif

In genetics, a sequence motif is a nucleotide or amino acid sequence pattern that is widespread and has, or is conjectured to have, a biology significance....
s in global MSAs that is both a means of producing a better MSA and a means of producing a scoring matrix for use in searching other sequences for similar motifs. A variety of methods for isolating the motifs have been developed, but all are based on identifying short highly conserved patterns within the larger alignment and constructing a matrix similar to a substitution matrix that reflects the amino acid or nucleotide composition of each position in the putative motif. The alignment can then be refined using these matrices. In standard profile analysis, the matrix includes entries for each possible character as well as entries for gaps. Alternatively, statistical pattern-finding algorithms can identify motifs as a precursor to an MSA rather than as a derivation. In many cases when the query set contains only a small number of sequences or contains only highly related sequences, pseudocount
Pseudocount

A pseudocount is a count added to observation data in order to change the probability in a model of those data, which is known not to be 0 , to being negligible rather than being zero....
s are added to normalize the distribution reflected in the scoring matrix. In particular, this corrects zero-probability entries in the matrix to values that are small but nonzero.

Blocks analysis is a method of motif finding that restricts motifs to ungapped regions in the alignment. Blocks can be generated from an MSA or they can be extracted from unaligned sequences using a precalculated set of common motifs previously generated from known gene families. Block scoring generally relies on the spacing of high-frequency characters rather than on the calculation of an explicit substitution matrix. The server provides an interactive method to locate such motifs in unaligned sequences.

Statistical pattern-matching has been implemented using both the expectation-maximization algorithm
Expectation-maximization algorithm

An expectation-maximization algorithm is used in statistics for finding maximum likelihood estimates of parameters in probabilistic models, where the model depends on unobserved latent variables....
 and the Gibbs sampler. One of the most common motif-finding tools, known as MEME, uses expectation maximization and hidden Markov methods to generate motifs that are then used as search tools by its companion MAST in the combined suite .

See also

  • Cladistics
    Cladistics

    Cladistics is the hierarchical classification of species based on evolutionary ancestry. Cladistics is distinguished from other taxonomic systems because it focuses on evolution rather than similarities between species, and because it places heavy emphasis on objective, quantitative analysis....
  • Phylogenetics
    Phylogenetics

    In biology, phylogenetics is the study of evolutionary relatedness among various groups of organisms , which is discovered through molecular sequencing data and morphological data matrices....
  • Sequence alignment software
    Sequence alignment software

    This list of sequence alignment software is a compilation of software tools and web portals used in pairwise sequence alignment and multiple sequence alignment....
  • Structural alignment
    Structural alignment

    Structural alignment is a form of sequence alignment that is based on comparison of shape. These alignments attempt to establish equivalences between two or more polymer structures based on their shape and three-dimensional tertiary structure....


Survey articles


External links

  • from the Virtual School of Natural Sciences
  • from Pôle Bioinformatique Lyonnais
  • from the Max Planck Institute for Molecular Genetics