All Topics  
Bioinformatics

 

   Email Print
   Bookmark   Link






 

Bioinformatics



 
 
Bioinformatics is the application of information technology
Information technology

Information technology , as defined by the Information Technology Association of America , is "the study, design, development, implementation, support or management of computer-based information systems, particularly software applications and computer hardware." IT deals with the use of electronic computers and computer software to data conv...
 to the field of molecular biology
Molecular biology

Molecular biology is the study of biology at a molecule level. The field overlaps with other areas of biology and chemistry, particularly genetics and biochemistry....
. The term bioinformatics was coined by Paulien Hogeweg
Paulien Hogeweg

Paulien Hogeweg is a Dutch theoretical biologist andcomplex systems researcher studying biological systems as dynamic informationprocessing systems at many interconnected levels....
 in 1978 for the study of informatic processes in biotic systems. Bioinformatics nowadays entails the creation and advancement of databases, algorithms, computational and statistical techniques, and theory to solve formal and practical problems arising from the management and analysis of biological data.






Discussion
Ask a question about 'Bioinformatics'
Start a new discussion about 'Bioinformatics'
Answer questions from other users
Full Discussion Forum



Encyclopedia


Genome Viewer Screenshot Small
Bioinformatics is the application of information technology
Information technology

Information technology , as defined by the Information Technology Association of America , is "the study, design, development, implementation, support or management of computer-based information systems, particularly software applications and computer hardware." IT deals with the use of electronic computers and computer software to data conv...
 to the field of molecular biology
Molecular biology

Molecular biology is the study of biology at a molecule level. The field overlaps with other areas of biology and chemistry, particularly genetics and biochemistry....
. The term bioinformatics was coined by Paulien Hogeweg
Paulien Hogeweg

Paulien Hogeweg is a Dutch theoretical biologist andcomplex systems researcher studying biological systems as dynamic informationprocessing systems at many interconnected levels....
 in 1978 for the study of informatic processes in biotic systems. Bioinformatics nowadays entails the creation and advancement of databases, algorithms, computational and statistical techniques, and theory to solve formal and practical problems arising from the management and analysis of biological data. Over the past few decades rapid developments in genomic and other molecular research technologies and developments in information technologies have combined to produce a tremendous amount of information related to molecular biology. It is the name given to these mathematical and computing approaches used to glean understanding of biological processes. Common activities in bioinformatics include mapping and analyzing DNA
DNA

Deoxyribonucleic acid is a nucleic acid that contains the genetics instructions used in the development and functioning of all known living organisms and some viruses....
 and protein sequences, aligning different DNA
DNA

Deoxyribonucleic acid is a nucleic acid that contains the genetics instructions used in the development and functioning of all known living organisms and some viruses....
 and protein sequences to compare them and creating and viewing 3-D models of protein structures. Bioinformatics is the branch of life science that deals with the study of application of information technology to the field of molecular biology.

The primary goal of bioinformatics is to increase our understanding of biological processes. What sets it apart from other approaches, however, is its focus on developing and applying computationally intensive techniques (e.g., data mining
Data mining

Data mining is the process of extracting hidden patterns from data. As more data is gathered, with the amount of data doubling every three years, data mining is becoming an increasingly important tool to transform this data into information....
, and machine learning
Machine learning

Machine learning is the subfield of artificial intelligence that is concerned with the design and development of algorithms that allow computers to improve their performance over time based on data, such as from sensor data or databases....
 algorithms) to achieve this goal. Major research efforts in the field include sequence alignment
Sequence alignment

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural biology, or evolutionary relationships between the sequences....
, gene finding, genome assembly, protein structure alignment, protein structure prediction
Protein structure prediction

Protein structure prediction is one of the most important goals pursued by bioinformatics and theoretical chemistry. Its aim is the prediction of the three-dimensional structure of proteins from their amino acid sequences, sometimes including additional relevant information such as the structures of related proteins....
, prediction of gene expression
Gene expression

Gene expression is the process by which inheritable information from a gene, such as the DNA sequence, is made into a functional gene product, such as protein or RNA....
 and protein-protein interactions, and the modeling of evolution
Evolution

In biology, evolution is change in the heritability trait of a population of organisms from one generation to the next. These changes are caused by a combination of three main processes: variation, reproduction, and selection....
.

Introduction


Bioinformatics was applied in the creation and maintenance of a database to store biological information at the beginning of the "genomic revolution", such as nucleotide and amino acid sequences. Development of this type of database involved not only design issues but the development of complex interfaces whereby researchers could both access existing data as well as submit new or revised data.

In order to study how normal cellular activities are altered in different disease states, the biological data must be combined to form a comprehensive picture of these activities. Therefore, the field of bioinformatics has evolved such that the most pressing task now involves the analysis and interpretation of various types of data, including nucleotide and amino acid sequences, protein domains, and protein structures. The actual process of analyzing and interpreting data is referred to as computational biology. Important sub-disciplines within bioinformatics and computational biology include: a) the development and implementation of tools that enable efficient access to, and use and management of, various types of information. b) the development of new algorithms (mathematical formulas) and statistics with which to assess relationships among members of large data sets, such as methods to locate a gene within a sequence, predict protein structure and/or function, and cluster protein sequences into families of related sequences.

Major research areas


Sequence analysis



Since the Phage F-X174
Phi-X174 phage

The phi X 174 bacteriophage was the first DNA-based genome to be sequenced. This work was completed by Fred Sanger and his team in 1977. In 1962, Walter Fiers had already demonstrated the physical, covalently closed circularity of phi X 174 DNA....
 was sequenced
Sequencing

In genetics and biochemistry, sequencing means to determine the primary structure of an unbranched biopolymer. Sequencing results in a symbolic linear depiction known as a sequence which succinctly summarizes much of the atomic-level structure of the sequenced molecule....
 in 1977, the DNA sequence
DNA sequence

A DNA sequence or genetic sequence is a succession of letters representing the primary structure of a real or hypothetical DNA molecule or strand, with the capacity to carry information as described by the central dogma of molecular biology....
s of hundreds of organisms have been decoded and stored in databases. The information is analyzed to determine genes that encode polypeptides, as well as regulatory sequences. A comparison of genes within a species
Species

In biology, a species is one of the basic units of biological classification and a taxonomic rank. A species is often defined as a group of organisms capable of interbreeding and producing fertile offspring....
 or between different species can show similarities between protein functions, or relations between species (the use of molecular systematics to construct phylogenetic tree
Phylogenetic tree

A phylogenetic tree or evolutionary tree is a tree showing the evolutionary relationships among various biological species or other entities that are believed to have a common descent....
s). With the growing amount of data, it long ago became impractical to analyze DNA sequences manually. Today, computer program
Computer program

Computer programs are Instruction for a computer. A computer requires programs to function. Moreover, a computer program does not run unless its instructions are executed by a Central processing unit; however, a program may communicate an Algorithm#Formalization of algorithms to people without running....
s are used to search the genome
Genome

In classical genetics, the genome of a diploid organism including eukarya refers to a full set of chromosomes or genes in a gamete; thereby, a regular somatic cell contains two full sets of genomes....
 of thousands of organisms, containing billions of nucleotide
Nucleotide

Nucleotides are molecules that comprise the structural units of RNA and DNA. Additionally, nucleotides play central roles in metabolism. In that capacity, they serve as sources of chemical energy , participate in cell signaling , and are incorporated into important cofactors of enzymatic reactions ....
s. These programs would compensate for mutations (exchanged, deleted or inserted bases) in the DNA sequence, in order to identify sequences that are related, but not identical. A variant of this sequence alignment
Sequence alignment

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural biology, or evolutionary relationships between the sequences....
 is used in the sequencing process itself. The so-called shotgun sequencing
Shotgun sequencing

In genetics, shotgun sequencing, also known as shotgun cloning, is a method used for sequencing long DNA strands. It is named by analogy with the rapidly-expanding, quasi-random firing pattern of a shotgun....
 technique (which was used, for example, by The Institute for Genomic Research
The Institute for Genomic Research

The Institute for Genomic Research was a non-profit genomics research institute founded in 1992 by Craig Venter in Rockville, Maryland, United States....
 to sequence the first bacterial genome, Haemophilus influenzae) does not give a sequential list of nucleotides, but instead the sequences of thousands of small DNA fragments (each about 600-800 nucleotides long). The ends of these fragments overlap and, when aligned in the right way, make up the complete genome. Shotgun sequencing yields sequence data quickly, but the task of assembling the fragments can be quite complicated for larger genomes. In the case of the Human Genome Project
Human Genome Project

The Human Genome Project was an international scientific research project with a primary goal to determine the sequence of chemical base pairs which make up DNA and to identify and map the approximately 20,000-25,000 genes of the human genome from both a physical and functional standpoint...
, it took several days of CPU time (on one hundred Pentium III desktop machines clustered specifically for the purpose) to assemble the fragments. Shotgun sequencing is the method of choice for virtually all genomes sequenced today, and genome assembly algorithms are a critical area of bioinformatics research.

Another aspect of bioinformatics in sequence analysis is the automatic search for genes and regulatory sequences within a genome. Not all of the nucleotides within a genome are genes. Within the genome of higher organisms, large parts of the DNA do not serve any obvious purpose. This so-called junk DNA
Junk DNA

In evolutionary biology and molecular biology, junk DNA is a provisional label for the portions of the DNA sequence of a chromosome or a genome for which no Function has been identified....
 may, however, contain unrecognized functional elements. Bioinformatics helps to bridge the gap between genome and proteome
Proteome

The proteome is the entire complement of proteins expressed by a genome, cell, tissue or organism. More specifically, it is the set expressed proteins at a given time under defined conditions....
 projects--for example, in the use of DNA sequences for protein identification.

See also: sequence analysis
Sequence analysis

The term "sequence analysis" in biology implies subjecting a DNA sequence or peptide sequence to sequence alignment, sequence databases, Repeated Sequences searches, or other bioinformatics methods on a computer....
, sequence profiling tool
Sequence profiling tool

A sequence profiling tool in bioinformatics is a type of software that presents information related to a gene sequence, gene name, or keyword input....
, sequence motif
Sequence motif

In genetics, a sequence motif is a nucleotide or amino acid sequence pattern that is widespread and has, or is conjectured to have, a biology significance....
.

Genome annotation


In the context of genomics
Genomics

Genomics is the study of the genomes of organisms. The field includes intensive efforts to determine the entire DNA sequence of organisms and fine-scale genetic mapping efforts....
, annotation is the process of marking the genes and other biological features in a DNA sequence. The first genome annotation software system was designed in 1995 by Dr. Owen White, who was part of the team that sequenced and analyzed the first genome of a free-living organism to be decoded, the bacterium Haemophilus influenzae
Haemophilus influenzae

Haemophilus influenzae, formerly called Pfeiffer's bacillus or Bacillus influenzae, is a non-motile Gram-negative coccobacillus first described in 1892 by Richard Friedrich Johannes Pfeiffer during an influenza pandemic....
. Dr. White built a software system to find the genes (places in the DNA sequence that encode a protein), the transfer RNA, and other features, and to make initial assignments of function to those genes. Most current genome annotation systems work similarly, but the programs available for analysis of genomic DNA are constantly changing and improving.

Computational evolutionary biology

Evolutionary biology
Evolutionary biology

Evolutionary biology is a sub-field of biology concerned with the origin of species from a common descent and descent of species, as well as their evolution, multiplication and diversity over time....
 is the study of the origin and descent of species
Species

In biology, a species is one of the basic units of biological classification and a taxonomic rank. A species is often defined as a group of organisms capable of interbreeding and producing fertile offspring....
, as well as their change over time. Informatics has assisted evolutionary biologists in several key ways; it has enabled researchers to:
  • trace the evolution of a large number of organisms by measuring changes in their DNA
    DNA

    Deoxyribonucleic acid is a nucleic acid that contains the genetics instructions used in the development and functioning of all known living organisms and some viruses....
    , rather than through physical taxonomy or physiological observations alone,
  • more recently, compare entire genomes, which permits the study of more complex evolutionary events, such as gene duplication
    Gene duplication

    Gene duplication is any duplication of a region of DNA that contains a gene; it may occur as an error in homologous recombination, a retrotransposon event, or duplication of an entire chromosome....
    , horizontal gene transfer
    Horizontal gene transfer

    Horizontal gene transfer , also Lateral gene transfer , is any process in which an organism incorporates genetic material from another organism without being the Reproduction of that organism....
    , and the prediction of factors important in bacterial speciation
    Speciation

    Speciation is the evolutionary process by which new biological species arise. The biologist Orator F. Cook seems to have been the first to coin the term 'speciation' for the splitting of lineages or 'cladogenesis,' as opposed to 'anagenesis' or 'phyletic evolution' occurring within lineages....
    ,
  • build complex computational models of populations to predict the outcome of the system over time
  • track and share information on an increasingly large number of species and organisms
Future work endeavours to reconstruct the now more complex tree of life.

The area of research within computer science
Computer science

Computer science is the study of the theoretical foundations of information and computation, and of practical techniques for their implementation and application in computer systems....
 that uses genetic algorithm
Genetic algorithm

A genetic algorithm is a Search algorithm wikt:technique used in computing to find exact or approximate solutions to Optimization and Search algorithm problems....
s is sometimes confused with computational evolutionary biology, but the two areas are unrelated.

Measuring biodiversity

Biodiversity
Biodiversity

Biodiversity is the variation of life forms within a given ecosystem, biome, or for the entire Earth. Biodiversity is often used as a measure of the health of biological systems....
 of an ecosystem might be defined as the total genomic complement of a particular environment, from all of the species present, whether it is a biofilm in an abandoned mine, a drop of sea water, a scoop of soil, or the entire biosphere
Biosphere

The biosphere is the global sum of all ecosystems. From the broadest Geophysiology point of view, the biosphere is the global ecology system integrating all living beings and their relationships, including their interaction with the elements of the lithosphere, hydrosphere, and Earth's atmosphere....
 of the planet Earth
Earth

Earth is the third planet from the Sun. Earth is the largest of the terrestrial planets in the Solar System in diameter, mass and density. It is also referred to as the World and Wiktionary:Terra.Note that by International Astronomical Union convention, the term "Terra" is used for naming extensive land masses, rather...
. Databases are used to collect the species
Species

In biology, a species is one of the basic units of biological classification and a taxonomic rank. A species is often defined as a group of organisms capable of interbreeding and producing fertile offspring....
 names, descriptions, distributions, genetic information, status and size of population
Population

File:Population density.pngIn biology, a population is the collection of inter-breeding organisms of a particular species; in sociology, a collection of human beings....
s, habitat
Habitat (ecology)

A habitat is an ecological or Natural_environment area that is inhabited by a particular animal or plant species. It is the natural environment in which an organism lives, or the physical environment that surrounds a species population....
 needs, and how each organism interacts with other species. Specialized software
Computer software

Computer software, or just software is a general term used to describe a collection of computer programs, Algorithm and Software documentation that perform some tasks on a computer system....
 programs are used to find, visualize, and analyze the information, and most importantly, communicate it to other people. Computer simulations model such things as population dynamics, or calculate the cumulative genetic health of a breeding pool (in agriculture
Agriculture

Agriculture refers to the production of food and goods through farming and forestry. Agriculture was the key development that led to the rise of civilization, with the animal husbandry of domestication animals and plants creating food surpluses that enabled the development of more Population density and Social stratification societies....
) or endangered population (in conservation). One very exciting potential of this field is that entire DNA
DNA

Deoxyribonucleic acid is a nucleic acid that contains the genetics instructions used in the development and functioning of all known living organisms and some viruses....
 sequences, or genome
Genome

In classical genetics, the genome of a diploid organism including eukarya refers to a full set of chromosomes or genes in a gamete; thereby, a regular somatic cell contains two full sets of genomes....
s of endangered species
Endangered species

An endangered species is a population of an organism which is at risk of becoming extinct because it is either few in numbers, or threatened by changing environmental or predation parameters....
 can be preserved, allowing the results of Nature's genetic experiment to be remembered in silico
In silico

In silico is an expression used to mean "performed on computer or via computer simulation." The phrase is coined in analogy to the Latin language phrases in vivo and in vitro which are commonly used in biology and refer to experiments done in living organisms and outside of living organisms, respectively....
, and possibly reused in the future, even if that species is eventually lost.

Analysis of gene expression

The expression
Gene expression

Gene expression is the process by which inheritable information from a gene, such as the DNA sequence, is made into a functional gene product, such as protein or RNA....
 of many genes can be determined by measuring mRNA levels with multiple techniques including microarrays
DNA microarray

A DNA microarray is a multiplex technology used in molecular biology and in medicine. It consists of an arrayed series of thousands of microscopic spots of DNA oligonucleotides, called features, each containing picoMole s of a specific DNA sequence....
, expressed cDNA sequence tag
Expressed sequence tag

An expressed sequence tag or EST is a short sub-sequence of a transcribed cDNA sequence. They may be used to identify gene Transcription , and are instrumental in gene discovery and gene sequence determination....
 (EST) sequencing, serial analysis of gene expression
Serial Analysis of Gene Expression

Serial analysis of gene expression is a technique used by molecular biologists to produce a snapshot of the messenger RNA population in a sample of interest in the form of small tags that correspond to fragments of those transcripts....
 (SAGE) tag sequencing, massively parallel signature sequencing (MPSS), or various applications of multiplexed in-situ hybridization. All of these techniques are extremely noise-prone and/or subject to bias in the biological measurement, and a major research area in computational biology involves developing statistical tools to separate signal from noise
Noise

In common use, the word noise means unwanted sound or noise pollution. In electronics noise can refer to the electronic signal corresponding to acoustic noise or the electronic signal corresponding to the noise commonly seen as 'Noise ' on a degraded television or video image....
 in high-throughput gene expression studies. Such studies are often used to determine the genes implicated in a disorder: one might compare microarray data from cancerous epithelial cells to data from non-cancerous cells to determine the transcripts that are up-regulated and down-regulated in a particular population of cancer cells.

Analysis of regulation

Regulation is the complex orchestration of events starting with an extracellular signal such as a hormone
Hormone

Hormones are chemicals released by cells that affect cells in other parts of the body. Only a small amount of hormone is required to alter cell metabolism....
 and leading to an increase or decrease in the activity of one or more protein
Protein

Proteins are organic compounds made of amino acids arranged in a linear chain and joined together by peptide bonds between the carboxyl and amino groups of adjacent amino acid Residue ....
s. Bioinformatics techniques have been applied to explore various steps in this process. For example, promoter analysis involves the identification and study of sequence motif
Sequence motif

In genetics, a sequence motif is a nucleotide or amino acid sequence pattern that is widespread and has, or is conjectured to have, a biology significance....
s in the DNA surrounding the coding region of a gene. These motifs influence the extent to which that region is transcribed into mRNA. Expression data can be used to infer gene regulation: one might compare microarray
Microarray

Different kinds of biological assays are called microarrays:*DNA microarrays, such as cDNA microarrays and oligonucleotide microarrays*MMChips, for surveillance of microRNA populations...
 data from a wide variety of states of an organism to form hypotheses about the genes involved in each state. In a single-cell organism, one might compare stages of the cell cycle
Cell cycle

The cell cycle, or cell-division cycle, is the series of events that take place in a cell leading to its division and duplication . In cells without a nucleus , the cell cycle occurs via a process termed binary fission....
, along with various stress conditions (heat shock, starvation, etc.). One can then apply clustering algorithms to that expression data to determine which genes are co-expressed. For example, the upstream regions (promoters) of co-expressed genes can be searched for over-represented regulatory elements.

Analysis of protein expression

Protein microarray
Protein microarray

A protein microarray, sometimes referred to as a protein binding microarray,provides a multiplex approach to identify protein-protein interactions, to identify the substrates of protein kinases, or to identify the targets of biologically active small molecules....
s and high throughput (HT) mass spectrometry
Mass spectrometry

Mass spectrometry is an analytical technique for the determination of the elemental composition of a sample or molecule. It is also used for elucidating the chemical structures of molecules, such as peptides and other chemical compounds....
 (MS) can provide a snapshot of the proteins present in a biological sample. Bioinformatics is very much involved in making sense of protein microarray and HT MS data; the former approach faces similar problems as with microarrays targeted at mRNA, the latter involves the problem of matching large amounts of mass data against predicted masses from protein sequence databases, and the complicated statistical analysis of samples where multiple, but incomplete peptides from each protein are detected.

Analysis of mutations in cancer

In cancer, the genomes of affected cells are rearranged in complex or even unpredictable ways. Massive sequencing efforts are used to identify previously unknown point mutation
Point mutation

A point mutation, or single base substitution, is a type of mutation that causes the replacement of a single base nucleotide with another nucleotide of the genetic material, DNA or RNA....
s in a variety of gene
Gene

A gene is the basic unit of heredity in a living organism. All living things depend on genes. Genes hold the information to build and maintain their cell and pass genetic trait to offspring....
s in cancer
Cancer

Cancer is a class of diseases in which a group of cell display uncontrolled growth , invasion , and sometimes metastasis . These three malignant properties of cancers differentiate them from benign tumors, which are self-limited, do not invade or metastasize....
. Bioinformaticians continue to produce specialized automated systems to manage the sheer volume of sequence data produced, and they create new algorithms and software to compare the sequencing results to the growing collection of human genome
Human genome

The human genome is the genome of Homo sapiens, which is stored on 23 chromosome pairs. Twenty-two of these are autosome, while the remaining pair is XY sex-determination system....
 sequences and germline
Germline

In biology and genetics, the germline of a mature or morphogenesis individual is the line of germ cells that have genetic material that may be passed to a child....
 polymorphisms. New physical detection technology are employed, such as oligonucleotide
Oligonucleotide

An oligonucleotide is a short nucleic acid polymer, typically with twenty or fewer nucleotide. Although they can be formed by bond cleavage of longer segments, they are now more commonly synthesized by polymerizing individual nucleotide precursors....
 microarrays to identify chromosomal gains and losses (called comparative genomic hybridization
Comparative genomic hybridization

Comparative genomic hybridization or Chromosomal Microarray Analysis is a molecular-cytogenetic method for the analysis of copy number changes in the DNA content of a given subject's DNA and often in tumor cell ....
), and single nucleotide polymorphism
Single nucleotide polymorphism

A single-nucleotide polymorphism is a DNA sequence variation occurring when a single nucleotide — adenine, thymine, cytosine, or guanine — in the genome differs between members of a species ....
 arrays to detect known point mutations. These detection methods simultaneously measure several hundred thousand sites throughout the genome, and when used in high-throughput to measure thousands of samples, generate terabyte
Terabyte

A terabyte is a measurement term for computer storage. The value of a terabyte based upon a decimal radix is defined as one 1000000000000 bytes, or 1000 gigabytes....
s of data per experiment. Again the massive amounts and new types of data generate new opportunities for bioinformaticians. The data is often found to contain considerable variability, or noise
Noise

In common use, the word noise means unwanted sound or noise pollution. In electronics noise can refer to the electronic signal corresponding to acoustic noise or the electronic signal corresponding to the noise commonly seen as 'Noise ' on a degraded television or video image....
, and thus Hidden Markov model
Hidden Markov model

A hidden Markov model is a statistical model in which the system being modeled is assumed to be a Markov process with unknown parameters; the challenge is to determine the hidden parameters from the observable data....
 and change-point analysis methods are being developed to infer real copy number changes.

Another type of data that requires novel informatics development is the analysis of lesions found to be recurrent among many tumors .

Prediction of protein structure


Protein structure prediction is another important application of bioinformatics. The amino acid
Amino acid

In chemistry, an amino acid is a molecule containing both amine and carboxyl functional groups. These molecules are particularly important in biochemistry, where this term refers to alpha-amino acids with the general formula H2NCHRCOOH, where R is an organic substituent....
 sequence of a protein, the so-called primary structure
Primary structure

In biochemistry, the primary structure of a biological molecule is the exact specification of its atomic composition and the chemical bonds connecting those atoms ....
, can be easily determined from the sequence on the gene that codes for it. In the vast majority of cases, this primary structure uniquely determines a structure in its native environment. (Of course, there are exceptions, such as the bovine spongiform encephalopathy
Bovine spongiform encephalopathy

Bovine Spongiform Encephalopathy , commonly known as Mad-Cow Disease , is a fatal, neurodegenerative disease in cattle, that causes a spongy degeneration in the brain and spinal cord....
 - aka Mad Cow Disease - prion
Prion

A prion is an infectious disease that is comprised entirely of a reproduction, mis-folded protein. The mis-folded form of the prion protein has been implicated in a number of diseases in a variety of mammals, including bovine spongiform encephalopathy in cattle and Creutzfeldt-Jakob disease in humans....
.) Knowledge of this structure is vital in understanding the function of the protein. For lack of better terms, structural information is usually classified as one of secondary
Secondary structure

In biochemistry and structural biology, secondary structure is the general three-dimensional form of local segments of biopolymers such as proteins and nucleic acids ....
, tertiary
Tertiary structure

In biochemistry and chemistry, the tertiary structure of a protein or any other macromolecule is its three-dimensional structure, as defined by the atomic coordinates....
 and quaternary
Quaternary structure

In biochemistry, quaternary structure is the arrangement of multiple protein folding protein molecules in a multi-subunit complex....
 structure. A viable general solution to such predictions remains an open problem. As of now, most efforts have been directed towards heuristics that work most of the time.

One of the key ideas in bioinformatics is the notion of homology
Homology (biology)

In evolutionary biology, homology refers to any similarity between characteristics that is due to their common descent. The word homologous derives from the ancient Greek ??????e??, 'to agree'....
. In the genomic branch of bioinformatics, homology is used to predict the function of a gene: if the sequence of gene A, whose function is known, is homologous to the sequence of gene B, whose function is unknown, one could infer that B may share A's function. In the structural branch of bioinformatics, homology is used to determine which parts of a protein are important in structure formation and interaction with other proteins. In a technique called homology modeling, this information is used to predict the structure of a protein once the structure of a homologous protein is known. This currently remains the only way to predict protein structures reliably.

One example of this is the similar protein homology between hemoglobin in humans and the hemoglobin in legumes (leghemoglobin
Leghemoglobin

The oxygen carrier leghemoglobin is a hemoprotein found in the Nitrogen fixation root nodules of legume plants. It is produced by legumes in response to the roots being infected by nitrogen-fixing bacteria, so-called rhizobia, as part of the symbiosis interaction between plant and bacterium: roots uninfected with Rhizobium do not synthes...
). Both serve the same purpose of transporting oxygen in the organism. Though both of these proteins have completely different amino acid sequences, their protein structures are virtually identical, which reflects their near identical purposes.

Other techniques for predicting protein structure include protein threading and de novo (from scratch) physics-based modeling.

See also: structural motif
Structural motif

In an unbranched, polymer biological molecule, such as a protein or a strand of RNA, a structural motif is a three-dimensional structural element or protein folding within the chain, which appears also in a variety of other molecules....
 and structural domain.

Comparative genomics


The core of comparative genome analysis is the establishment of the correspondence between genes
Gênes

G?nes is the name of a d?partement in France of the First French Empire in present Italy. It was named after the city Genoa. It was formed in 1805, when Napoleon Bonaparte occupied the Republic of Genoa....
 (orthology analysis) or other genomic features in different organisms. It is these intergenomic maps that make it possible to trace the evolutionary processes responsible for the divergence of two genomes. A multitude of evolutionary events acting at various organizational levels shape genome evolution. At the lowest level, point mutations affect individual nucleotides. At a higher level, large chromosomal segments undergo duplication, lateral transfer, inversion, transposition, deletion and insertion. Ultimately, whole genomes are involved in processes of hybridization, polyploidization and endosymbiosis, often leading to rapid speciation. The complexity of genome evolution poses many exciting challenges to developers of mathematical models and algorithms, who have recourse to a spectra of algorithmic, statistical and mathematical techniques, ranging from exact, heuristics, fixed parameter and approximation algorithms for problems based on parsimony models to Markov Chain Monte Carlo
Markov chain Monte Carlo

Markov chain Monte Carlo method methods , are a class of algorithms for sampling from probability distributions based on constructing a Markov chain that has the desired distribution as its Markov chain#Steady-state_analysis_and_limiting_distributions....
 algorithms for Bayesian analysis of problems based on probabilistic models.

Many of these studies are based on the homology detection and protein families computation.

Modeling biological systems


Systems biology involves the use of computer simulation
Computer simulation

A computer simulation, a computer model or a computational model is a computer program, or network of computers, that attempts to simulation an abstract model of a particular system....
s of cellular
Cell (biology)

The cell is the structural and functional unit of all known Life organisms. It is the smallest unit of an organism that is classified as living, and is often called the building bricks of life....
 subsystems (such as the networks of metabolites
Metabolic network

A metabolic network is the complete set of metabolic and physical processes that determine the physiology and biochemistry properties of a cell....
 and enzyme
Enzyme

Enzymes are biomolecules that catalysis chemical reactions. Almost all enzymes are proteins. In enzymatic reactions, the molecules at the beginning of the process are called Substrate , and the enzyme converts them into different molecules, the products....
s which comprise metabolism
Metabolism

Metabolism is the set of chemical reactions that occur in living organisms in order to maintain life. These processes allow organisms to grow and reproduce, maintain their structures, and respond to their environments....
, signal transduction
Signal transduction

In biology, 'signal transduction' refers to any process by which a cell converts one kind of signal or stimulus into another. Most processes of signal transduction involve ordered sequences of biochemistry chemical reaction inside the cell, which are carried out by enzymes, activated by Second messenger systems, resulting in a signal tran...
 pathways and gene regulatory network
Gene regulatory network

A gene regulatory network or genetic regulatory network is a collection of DNA segments in a cell whichinteract with each other and with other substances in the cell, thereby governing the rates at which genes in the network are transcribed into mRNA....
s) to both analyze and visualize the complex connections of these cellular processes. Artificial life
Artificial life

Artificial life is a field of study and an associated art form which examine systems related to life, its processes, and its evolution through simulations using computer models, robotics, and biochemistry....
 or virtual evolution attempts to understand evolutionary processes via the computer simulation of simple (artificial) life forms.

High-throughput image analysis

Computational technologies are used to accelerate or fully automate the processing, quantification and analysis of large amounts of high-information-content biomedical imagery. Modern image analysis systems augment an observer's ability to make measurements from a large or complex set of images, by improving accuracy, objectivity
Objectivity (science)

"[A]n objective account is one which attempts to capture the nature of the object studied in a way that does not depend on any features of the particular subject who studies it....
, or speed. A fully developed analysis system may completely replace the observer. Although these systems are not unique to biomedical imagery, biomedical imaging is becoming more important for both diagnostics and research. Some examples are:
  • high-throughput and high-fidelity quantification and sub-cellular localization (high-content screening, cytohistopathology)
  • morphometrics
    Morphometrics

    Morphometrics is a field concerned with studying variation and change in the form of organisms or objects. There are several methods for extracting data from shapes, each with their own benefits and weaknesses....
  • clinical image analysis and visualization
  • determining the real-time air-flow patterns in breathing lungs of living animals
  • quantifying occlusion size in real-time imagery from the development of and recovery during arterial injury
  • making behavioral observations from extended video recordings of laboratory animals
  • infrared measurements for metabolic activity determination
  • inferring clone overlaps in DNA mapping
    Gene mapping

    Genome mapping is the creation of a genetic map assigning DNA fragments to chromosomes.When a genome is first investigated, this map is nonexistent....
    , e.g. the Sulston score
    Sulston score

    The Sulston Score is an equation used in Gene mapping#Physical Mapping to numerically assess the likelihood that a given "fingerprint" similarity between two DNA clones is merely a result of chance....


Protein-protein docking

In the last two decades, tens of thousands of protein three-dimensional structures have been determined by X-ray crystallography
X-ray crystallography

X-ray crystallography is a method of determining the arrangement of atoms within a crystal, in which a beam of X-rays strikes a crystal and scatters into many different directions....
 and Protein nuclear magnetic resonance spectroscopy
Protein nuclear magnetic resonance spectroscopy

Protein nuclear magnetic resonance spectroscopy is a field of structural biology in which NMR spectroscopy is used to obtain information about the structure and dynamics of proteins....
 (protein NMR). One central question for the biological scientist is whether it is practical to predict possible protein-protein interactions only based on these 3D shapes, without doing protein-protein interaction
Protein-protein interaction

Protein-protein interactions involve the association of protein molecules. These associations are studied from the perspective of biochemistry, signal transduction and graph theory....
 experiments. A variety of methods have been developed to tackle the Protein-protein docking
Protein-protein docking

Macromolecular docking is the computational modelling of the molecular geometry of complex formed by two or more interacting macromolecules. Protein-protein complexes are the most commonly attempted targets of such modelling, followed by protein-nucleic acid complexes....
 problem, though it seems that there is still much work to be done in this field.

Software and tools

Software tools for bioinformatics range from simple command-line tools, to more complex graphical programs and standalone web-services available from various bioinformatics companies
List of bioinformatics companies

The primary purpose of this list is to serve as a holding place for the identities of Bioinformatics companies, particularly those for which articles have not yet been created....
 or public institutions. The computational biology tool best-known among biologists is probably BLAST
Blast

A blast is an explosion. Blast can also refer to:Entertainment:* BBC Blast, a programme, website and tour for 13 - 19 year olds getting creative...
, an algorithm for determining the similarity of arbitrary sequences against other sequences, possibly from curated databases of protein or DNA sequences. BLAST is one of a number of generally available programs for doing sequence alignment. The NCBI
National Center for Biotechnology Information

The National Center for Biotechnology Information is part of the United States National Library of Medicine , a branch of the National Institutes of Health....
 provides a popular web-based implementation that searches their databases.

Web services in bioinformatics

SOAP
SOAP

SOAP, originally defined as Simple Object Access Protocol, is a protocol specification for exchanging structured information in the implementation of Web Services in computer networks....
 and REST
Rest

Rest may refer to:* Rest, in English may mean: leisure, human relaxation, or sleep; see the...
-based interfaces have been developed for a wide variety of bioinformatics applications allowing an application running on one computer in one part of the world to use algorithms, data and computing resources on servers in other parts of the world. The main advantages lay in the end user not having to deal with software and database maintenance overheads Basic bioinformatics services are classified by the EBI
European Bioinformatics Institute

The European Bioinformatics Institute is a centre for research and services in bioinformatics, and is part of European Molecular Biology Laboratory ....
 into three categories: SSS
Sequence alignment software

This list of sequence alignment software is a compilation of software tools and web portals used in pairwise sequence alignment and multiple sequence alignment....
 (Sequence Search Services), MSA
Multiple sequence alignment

A multiple sequence alignment is a sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In general, the input set of query sequences are assumed to have an evolutionary relationship by which they share a lineage and are descended from a common ancestor....
 (Multiple Sequence Alignment) and BSA
Bioinformatics

Bioinformatics is the application of information technology to the field of molecular biology. The term bioinformatics was coined by Paulien Hogeweg in 1978 for the study of informatic processes in biotic systems....
 (Biological Sequence Analysis). The availability of these service-oriented
Service-oriented

In human sexuality, Service-oriented is a term used in the BDSM community to refer relationship dynamic.In a service-oriented relationship, the focus is on how the submissive can contribute resources to the dominant partner, provide for some of their needs or advance their goals....
 bioinformatics resources demonstrate the applicability of web based bioinformatics solutions, and range from a collection of standalone tools with a common data format under a single, standalone or web-based interface, to integrative, distributed and extensible bioinformatics workflow management systems
Bioinformatics workflow management systems

A bioinformatics workflow management system is a specialized form of workflow management system designed specifically to compose and execute a series of computational or data manipulation steps, or a workflows, in a specific domain of science, bioinformatics....
.

See also


Related topics


Related fields


External links

  • Major Organizations
    • Swiss Institute of Bioinformatics
      Swiss Institute of Bioinformatics

      The Swiss Institute of Bioinformatics is an academic not-for-profit Foundation established on March 30 1998 whose mission is to promote research, develop database and computer technologies, and be involved with teaching and service activities in the field of bioinformatics in Switzerland with international collaborations....
    • Wellcome Trust Sanger Institute


  • Major Journals
    • at Bioinformatics.fr
    • at EMBnet.org
    • International Journal of Computational Biology and Drug Design (IJCBDD)
    • International Journal of Functional Informatics and Personalized Medicine (IJFIPM)


  • Other sites
    • at Bioinformatics.fr*Tutorials / Resources / Primers
    • — by NCBI
      NCBI

      NCBI may refer to:* National Center for Biotechnology Information, part of the U.S. National Institutes of Health* National Coalition Building Institute, a U.S....