Sequence analysis - AbsoluteAstronomy.com

Bioinformatics

Bioinformatics is the application of computer science and information technology to the field of biology and medicine. Bioinformatics deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, software...

, the term sequence analysis refers to the process of subjecting a DNA

DNA sequence

The sequence or primary structure of a nucleic acid is the composition of atoms that make up the nucleic acid and the chemical bonds that bond those atoms. Because nucleic acids, such as DNA and RNA, are unbranched polymers, this specification is equivalent to specifying the sequence of...

, RNA

RNA

Ribonucleic acid , or RNA, is one of the three major macromolecules that are essential for all known forms of life....

or peptide sequence

Peptide sequence

Peptide sequence or amino acid sequence is the order in which amino acid residues, connected by peptide bonds, lie in the chain in peptides and proteins. The sequence is generally reported from the N-terminal end containing free amino group to the C-terminal end containing free carboxyl group...

to any of a wide range of analytical methods to understand its features, function, structure, or evolution. Methodologies used include sequence alignment

Sequence alignment

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are...

, searches against biological database

Biological database

Biological databases are libraries of life sciences information, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analyses. They contain information from research areas including genomics, proteomics, metabolomics, microarray...

s, and others. Since the development of methods of high-throughput production of gene and protein sequences, the rate of addition of new sequences to the databases increased exponentially. Such a collection of sequences does not, by itself, increase the scientist's understanding of the biology of organisms. However, comparing these new sequences to those with known functions is a key way of understanding the biology of an organism from which the new sequence comes. Thus, sequence analysis can be used to assign function to genes and proteins by the study of the similarities between the compared sequences. Nowadays ,there are many tools and techniques that provide the sequence comparisons (sequence alignment) and analyze the alignment product to understand its biology.

Sequence analysis in molecular biology

Molecular biology

Molecular biology is the branch of biology that deals with the molecular basis of biological activity. This field overlaps with other areas of biology and chemistry, particularly genetics and biochemistry...

includes a very wide range of relevant topics:

The comparison of sequences in order to find similarity often to infer if they are related (homologous
Homology (biology)
Homology forms the basis of organization for comparative biology. In 1843, Richard Owen defined homology as "the same organ in different animals under every variety of form and function". Organs as different as a bat's wing, a seal's flipper, a cat's paw and a human hand have a common underlying...

)
Identification of intrinsic features of the sequence such as active site
Active site
In biology the active site is part of an enzyme where substrates bind and undergo a chemical reaction. The majority of enzymes are proteins but RNA enzymes called ribozymes also exist. The active site of an enzyme is usually found in a cleft or pocket that is lined by amino acid residues that...

s, post translational modification sites, gene-structures
Sequence motif
In genetics, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and has, or is conjectured to have, a biological significance...

, reading frame
Reading frame
In biology, a reading frame is a way of breaking a sequence of nucleotides in DNA or RNA into three letter codons which can be translated in amino acids. There are 3 possible reading frames in an mRNA strand: each reading frame corresponding to starting at a different alignment...

s, distributions of intron
Intron
An intron is any nucleotide sequence within a gene that is removed by RNA splicing to generate the final mature RNA product of a gene. The term intron refers to both the DNA sequence within a gene, and the corresponding sequence in RNA transcripts. Sequences that are joined together in the final...

s and exon
Exon
An exon is a nucleic acid sequence that is represented in the mature form of an RNA molecule either after portions of a precursor RNA have been removed by cis-splicing or when two or more precursor RNA molecules have been ligated by trans-splicing. The mature RNA molecule can be a messenger RNA...

s and regulatory elements
Identification of sequence differences and variations such as point mutation
Point mutation
A point mutation, or single base substitution, is a type of mutation that causes the replacement of a single base nucleotide with another nucleotide of the genetic material, DNA or RNA. Often the term point mutation also includes insertions or deletions of a single base pair...

s and single nucleotide polymorphism
Single nucleotide polymorphism
A single-nucleotide polymorphism is a DNA sequence variation occurring when a single nucleotide — A, T, C or G — in the genome differs between members of a biological species or paired chromosomes in an individual...

(SNP) in order to get the genetic marker.
Revealing the evolution and genetic diversity
Genetic diversity
Genetic diversity, the level of biodiversity, refers to the total number of genetic characteristics in the genetic makeup of a species. It is distinguished from genetic variability, which describes the tendency of genetic characteristics to vary....

of sequences and organisms
Identification of molecular structure from sequence alone

In chemistry

Chemistry

Chemistry is the science of matter, especially its chemical reactions, but also its composition, structure and properties. Chemistry is concerned with atoms and their interactions with other atoms, and particularly with the properties of chemical bonds....

, sequence analysis comprises techniques used to do determine the sequence of a polymer

Polymer

A polymer is a large molecule composed of repeating structural units. These subunits are typically connected by covalent chemical bonds...

formed of several monomer

Monomer

A monomer is an atom or a small molecule that may bind chemically to other monomers to form a polymer; the term "monomeric protein" may also be used to describe one of the proteins making up a multiprotein complex...

s.
In molecular biology

Molecular biology

and genetics

Genetics

Genetics , a discipline of biology, is the science of genes, heredity, and variation in living organisms....

, the same process is called simply "sequencing

Sequencing

In genetics and biochemistry, sequencing means to determine the primary structure of an unbranched biopolymer...

".

In marketing

Marketing

Marketing is the process used to determine what products or services may be of interest to customers, and the strategy to use in sales, communications and business development. It generates the strategy that underlies sales techniques, business communication, and business developments...

, sequence analysis is often used in analytical customer relationship management applications, such as NPTB models (Next Product to Buy).

History

Since the very first sequences of the insulin

Insulin

Insulin is a hormone central to regulating carbohydrate and fat metabolism in the body. Insulin causes cells in the liver, muscle, and fat tissue to take up glucose from the blood, storing it as glycogen in the liver and muscle....

protein was characterised by Fred Sanger in 1951 biologists have been trying to use this knowledge to understand the function of molecules.

Sequence Alignment

There are millions of protein and nucleotide

Nucleotide

Nucleotides are molecules that, when joined together, make up the structural units of RNA and DNA. In addition, nucleotides participate in cellular signaling , and are incorporated into important cofactors of enzymatic reactions...

sequences known. These sequences fall into many groups of related sequences known as protein families

Protein family

A protein family is a group of evolutionarily-related proteins, and is often nearly synonymous with gene family. The term protein family should not be confused with family as it is used in taxonomy....

or gene families. Relationships between these sequences are usually discovered by aligning them together and assigning this alignment a score. There are two main types of sequence alignment. Pair-wise sequence alignment only compares two sequences at a time and multiple sequence alignment compares many sequences in one go. Two important algorithms for aligning pairs of sequences are the Needleman-Wunsch algorithm

Needleman-Wunsch algorithm

The Needleman–Wunsch algorithm performs a global alignment on two sequences . It is commonly used in bioinformatics to align protein or nucleotide sequences. The algorithm was published in 1970 by Saul B. Needleman and Christian D...

and the Smith-Waterman algorithm. Popular tools for sequence alignment include:

Pair-wise alignment - BLAST
BLAST
In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences...
Multiple alignment - ClustalW
Clustal
Clustal is a widely used multiple sequence alignment computer program. The latest version is 2.1. There are two main variations:*ClustalW: command line interface*ClustalX: This version has a graphical user interface...

, PROBCONS
ProbCons
ProbCons is an open source probabilistic consistency-based multiple alignment of amino acid sequences. It is an efficient protein multiple sequence alignment program, which has demonstrated a statistically significant improvement in accuracy compared to several leading alignment tools.- See also :*...

, MUSCLE, MAFFT
MAFFT
MAFFT is a multiple sequence alignment program for amino acid or nucleotide sequences. MAFFT is freely available for academic use, without any warranty.- External links :* * * at EBI* at GenomeNet* in MyHits, SIB...

, and T-Coffee
T-Coffee
T-Coffee is a multiple sequence alignment software using a progressive approach. It generates a library of pairwise alignments to guide the multiple sequence alignment...

.

A common use for pairwise sequence alignment is to take a sequence of interest and compare it to all known sequences in a database to identify homologous sequences

Homology (biology)

Homology forms the basis of organization for comparative biology. In 1843, Richard Owen defined homology as "the same organ in different animals under every variety of form and function". Organs as different as a bat's wing, a seal's flipper, a cat's paw and a human hand have a common underlying...

. In general the matches in the database are ordered to show the most closely related sequences first followed by sequences with diminishing similarity. These matches are usually reported with a measure of statistical significance such as an Expectation value.

Profile comparison

In 1987 Michael Gribskov, Andrew McLachlan and David Eisenberg

David Eisenberg

David S. Eisenberg is an American biochemist best known for his contributions to structural and computational molecular biology...

introduced the method of profile comparison for identifying distant similarities between proteins. Rather than using a single sequence profile methods use a multiple sequence alignment to encode a profile which contains information about the conservation level of each residue. Profiles are also known as Position Specific Scoring Matrices (PSSMs). In 1993 a probabilistic interpretation of profiles was introduced by David Haussler

David Haussler

David Haussler is a Howard Hughes Medical Institute Investigator. He is also Professor of Biomolecular Engineering and Director of the Center for Biomolecular Science and Engineering at the University of California, Santa Cruz; director of the California Institute for Quantitative Biosciences on...

and colleagues using hidden Markov models.

Sequence assembly

Sequence assembly refers to the reconstruction of a DNA sequence by aligning

Sequence alignment

and merging small DNA fragments. It is an integral part of modern DNA sequencing

DNA sequencing

DNA sequencing includes several methods and technologies that are used for determining the order of the nucleotide bases—adenine, guanine, cytosine, and thymine—in a molecule of DNA....

. Since presently-available DNA sequencing technologies are ill-suited for reading long sequences, large pieces of DNA (such as genomes) are often sequenced by (1) cutting the DNA into small pieces, (2) reading the small fragments, and (3) reconstituting the original DNA by merging the information on various fragment.

Gene prediction

Gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes

Gênes

Gênes is the name of a département of the First French Empire in present Italy, named after the city of Genoa. It was formed in 1805, when Napoleon Bonaparte occupied the Republic of Genoa. Its capital was Genoa, and it was divided in the arrondissements of Genoa, Bobbio, Novi Ligure, Tortona and...

. This includes protein-coding gene

Gene

A gene is a molecular unit of heredity of a living organism. It is a name given to some stretches of DNA and RNA that code for a type of protein or for an RNA chain that has a function in the organism. Living beings depend on genes, as they specify all proteins and functional RNA chains...

s as well as RNA genes, but may also include prediction of other functional elements such as regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced

Sequencing

In genetics and biochemistry, sequencing means to determine the primary structure of an unbranched biopolymer...

. In general the prediction of bacterial genes is significantly simpler and more accurate than the prediction of genes in eukaryotic species that usually have complex intron

Intron

An intron is any nucleotide sequence within a gene that is removed by RNA splicing to generate the final mature RNA product of a gene. The term intron refers to both the DNA sequence within a gene, and the corresponding sequence in RNA transcripts. Sequences that are joined together in the final...

/exon

Exon

An exon is a nucleic acid sequence that is represented in the mature form of an RNA molecule either after portions of a precursor RNA have been removed by cis-splicing or when two or more precursor RNA molecules have been ligated by trans-splicing. The mature RNA molecule can be a messenger RNA...

patterns.

Protein Structure Prediction

The 3D structures of molecules are of great importance to their functions in nature. Since structural prediction of large molecules at an atomic level is largely intractable problem, some biologists introduced ways to predict 3D structure at a primary sequence level. This includes biochemical or statistical analysis of amino acid residues in local regions and structural inference from homologs (or other potentially related proteins) with known 3D structures.

There have been a large number of diverse approaches to solve the structure prediction problem. In order to determine which methods were most effective a structure prediction competition was founded called CASP

CASP

CASP, which stands for Critical Assessment of Techniques for Protein Structure Prediction, is a community-wide, worldwide experiment for protein structure prediction taking place every two years since 1994...

(Critical Assessment of Structure Prediction).

Methodology

The tasks that lie in the space of sequence analysis are often non-trivial to resolve and require the use of relatively complex approaches. Of the many types of methods used in practice, the most popular include:

Artificial Neural Network
Artificial neural network
An artificial neural network , usually called neural network , is a mathematical model or computational model that is inspired by the structure and/or functional aspects of biological neural networks. A neural network consists of an interconnected group of artificial neurons, and it processes...

,
Hidden Markov Model
Hidden Markov model
A hidden Markov model is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved states. An HMM can be considered as the simplest dynamic Bayesian network. The mathematics behind the HMM was developed by L. E...
Support Vector Machine
Support vector machine
A support vector machine is a concept in statistics and computer science for a set of related supervised learning methods that analyze data and recognize patterns, used for classification and regression analysis...
Clustering
Bayesian Network
Bayesian network
A Bayesian network, Bayes network, belief network or directed acyclic graphical model is a probabilistic graphical model that represents a set of random variables and their conditional dependencies via a directed acyclic graph . For example, a Bayesian network could represent the probabilistic...
Regression Analysis
Regression analysis
In statistics, regression analysis includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables...