Gene prediction
Encyclopedia
In computational biology
Computational biology
Computational biology involves the development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems...

 gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes
Gênes
Gênes is the name of a département of the First French Empire in present Italy, named after the city of Genoa. It was formed in 1805, when Napoleon Bonaparte occupied the Republic of Genoa. Its capital was Genoa, and it was divided in the arrondissements of Genoa, Bobbio, Novi Ligure, Tortona and...

. This includes protein-coding gene
Gene
A gene is a molecular unit of heredity of a living organism. It is a name given to some stretches of DNA and RNA that code for a type of protein or for an RNA chain that has a function in the organism. Living beings depend on genes, as they specify all proteins and functional RNA chains...

s as well as RNA genes, but may also include prediction of other functional elements such as regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced
Sequencing
In genetics and biochemistry, sequencing means to determine the primary structure of an unbranched biopolymer...

.

In its earliest days, "gene finding" was based on painstaking experimentation on living cells and organisms. Statistical analysis of the rates of homologous recombination
Homologous recombination
Homologous recombination is a type of genetic recombination in which nucleotide sequences are exchanged between two similar or identical molecules of DNA. It is most widely used by cells to accurately repair harmful breaks that occur on both strands of DNA, known as double-strand breaks...

 of several different genes could determine their order on a certain chromosome
Chromosome
A chromosome is an organized structure of DNA and protein found in cells. It is a single piece of coiled DNA containing many genes, regulatory elements and other nucleotide sequences. Chromosomes also contain DNA-bound proteins, which serve to package the DNA and control its functions.Chromosomes...

, and information from many such experiments could be combined to create a genetic map specifying the rough location of known genes relative to each other. Today, with comprehensive genome sequence and powerful computational resources at the disposal of the research community, gene finding has been redefined as a largely computational problem.

Determining that a sequence is functional should be distinguished from determining the function of the gene or its product. The latter still demands in vivo
In vivo
In vivo is experimentation using a whole, living organism as opposed to a partial or dead organism, or an in vitro controlled environment. Animal testing and clinical trials are two forms of in vivo research...

experimentation through gene knockout
Gene knockout
A gene knockout is a genetic technique in which one of an organism's genes is made inoperative . Also known as knockout organisms or simply knockouts, they are used in learning about a gene that has been sequenced, but which has an unknown or incompletely known function...

 and other assays, although frontiers of bioinformatics
Bioinformatics
Bioinformatics is the application of computer science and information technology to the field of biology and medicine. Bioinformatics deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, software...

 research are making it increasingly possible to predict the function of a gene based on its sequence alone.

Extrinsic approaches

In extrinsic (or evidence-based) gene finding systems, the target genome is searched for sequences that are similar to extrinsic evidence in the form of the known sequence of a messenger RNA
Messenger RNA
Messenger RNA is a molecule of RNA encoding a chemical "blueprint" for a protein product. mRNA is transcribed from a DNA template, and carries coding information to the sites of protein synthesis: the ribosomes. Here, the nucleic acid polymer is translated into a polymer of amino acids: a protein...

 (mRNA) or protein
Protein
Proteins are biochemical compounds consisting of one or more polypeptides typically folded into a globular or fibrous form, facilitating a biological function. A polypeptide is a single linear polymer chain of amino acids bonded together by peptide bonds between the carboxyl and amino groups of...

 product. Given an mRNA sequence, it is trivial to derive a unique genomic DNA sequence from which it had to have been transcribed
Transcription (genetics)
Transcription is the process of creating a complementary RNA copy of a sequence of DNA. Both RNA and DNA are nucleic acids, which use base pairs of nucleotides as a complementary language that can be converted back and forth from DNA to RNA by the action of the correct enzymes...

. Given a protein sequence, a family of possible coding DNA sequences can be derived by reverse translation of the genetic code
Genetic code
The genetic code is the set of rules by which information encoded in genetic material is translated into proteins by living cells....

. Once candidate DNA sequences have been determined, it is a relatively straightforward algorithmic problem to efficiently search a target genome for matches, complete or partial, and exact or inexact. BLAST
BLAST
In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences...

 is a widely used system designed for this purpose.

A high degree of similarity to a known messenger RNA or protein product is strong evidence that a region of a target genome is a protein-coding gene. However, to apply this approach systemically requires extensive sequencing of mRNA and protein products. Not only is this expensive, but in complex organisms, only a subset of all genes in the organism's genome are expressed at any given time, meaning that extrinsic evidence for many genes is not readily accessible in any single cell culture. Thus, in order to collect extrinsic evidence for most or all of the genes in a complex organism, many hundreds or thousands of different cell types must be studied, which itself presents further difficulties. For example, some human genes may be expressed only during development as an embryo or fetus , which might be difficult to study for ethical reasons.

Despite these difficulties, extensive transcript and protein sequence databases have been generated for human as well as other important model organisms in biology, such as mice and yeast. For example, the RefSeq
RefSeq
The Reference Sequence database is an open access, annotated and curated collection of publicly available nucleotide sequences and their protein products. This database is built by National Center for Biotechnology Information , and, unlike GenBank, provides only single record for each natural...

 database contains transcript and protein sequence from many different species, and the Ensembl
Ensembl
Ensembl is a joint scientific project between the European Bioinformatics Institute and the Wellcome Trust Sanger Institute, which was launched in 1999 in response to the imminent completion of the Human Genome Project...

 system comprehensively maps this evidence to human and several other genomes. It is, however, likely that these databases are both incomplete and contain small but significant amounts of erroneous data..

Ab initio approaches

Because of the inherent expense and difficulty in obtaining extrinsic evidence for many genes, it is also necessary to resort to Ab initio
Ab initio
ab initio is a Latin term used in English, meaning from the beginning.ab initio may also refer to:* Ab Initio , a leading ETL Tool Software Company in the field of Data Warehousing.* ab initio quantum chemistry methods...

gene finding, in which genomic DNA sequence
DNA sequence
The sequence or primary structure of a nucleic acid is the composition of atoms that make up the nucleic acid and the chemical bonds that bond those atoms. Because nucleic acids, such as DNA and RNA, are unbranched polymers, this specification is equivalent to specifying the sequence of...

 alone is systematically searched for certain tell-tale signs of protein-coding genes. These signs can be broadly categorized as either signals, specific sequences that indicate the presence of a gene nearby, or content, statistical properties of protein-coding sequence itself. Ab initio gene finding might be more accurately characterized as gene prediction, since extrinsic evidence is generally required to conclusively establish that a putative gene is functional.

In the genomes of prokaryotes, genes have specific and relatively well-understood promoter sequences (signals), such as the Pribnow box
Pribnow box
The Pribnow box is the sequence TATAAT of six nucleotides that is an essential part of a promoter site on DNA for transcription to occur in bacteria...

 and transcription factor
Transcription factor
In molecular biology and genetics, a transcription factor is a protein that binds to specific DNA sequences, thereby controlling the flow of genetic information from DNA to mRNA...

 binding site
Binding site
In biochemistry, a binding site is a region on a protein, DNA, or RNA to which specific other molecules and ions—in this context collectively called ligands—form a chemical bond...

s, which are easy to systematically identify. Also, the sequence coding for a protein occurs as one contiguous open reading frame
Open reading frame
In molecular genetics, an open reading frame is a DNA sequence that does not contain a stop codon in a given reading frame.Normally, inserts which interrupt the reading frame of a subsequent region after the start codon cause frameshift mutation of the sequence and dislocate the sequences for stop...

 (ORF), which is typically many hundred or thousands of base pair
Base pair
In molecular biology and genetics, the linking between two nitrogenous bases on opposite complementary DNA or certain types of RNA strands that are connected via hydrogen bonds is called a base pair...

s long. The statistics of stop codon
Stop codon
In the genetic code, a stop codon is a nucleotide triplet within messenger RNA that signals a termination of translation. Proteins are based on polypeptides, which are unique sequences of amino acids. Most codons in messenger RNA correspond to the addition of an amino acid to a growing polypeptide...

s are such that even finding an open reading frame of this length is a fairly informative sign. (Since 3 of the 64 possible codons in the genetic code are stop codons, one would expect a stop codon approximately every 20–25 codons, or 60–75 base pairs, in a random sequence
Random sequence
The concept of a random sequence is essential in probability theory and statistics. The concept generally relies on the notion of a sequence of random variables and many statistical discussions begin with the words "let X1,...,Xn be independent random variables...". Yet as D. H. Lehmer stated in...

.) Furthermore, protein-coding DNA has certain periodicities
Frequency
Frequency is the number of occurrences of a repeating event per unit time. It is also referred to as temporal frequency.The period is the duration of one cycle in a repeating event, so the period is the reciprocal of the frequency...

 and other statistical properties that are easy to detect in sequence of this length. These characteristics make prokaryotic gene finding relatively straightforward, and well-designed systems are able to achieve high levels of accuracy.

Ab initio gene finding in eukaryotes, especially complex organisms like humans, is considerably more challenging for several reasons. First, the promoter and other regulatory signals in these genomes are more complex and less well-understood than in prokaryotes, making them more difficult to reliably recognize. Two classic examples of signals identified by eukaryotic gene finders are CpG island
CpG island
In genetics, CpG islands or CG islands are genomic regions that contain a high frequency of CpG sites but to date objective definitions for CpG islands are limited. In mammalian genomes, CpG islands are typically 300-3,000 base pairs in length. They are in and near approximately 40% of promoters of...

s and binding sites for a poly(A) tail
Polyadenylation
Polyadenylation is the addition of a poly tail to an RNA molecule. The poly tail consists of multiple adenosine monophosphates; in other words, it is a stretch of RNA that has only adenine bases. In eukaryotes, polyadenylation is part of the process that produces mature messenger RNA for translation...

.

Second, splicing
RNA splicing
In molecular biology and genetics, splicing is a modification of an RNA after transcription, in which introns are removed and exons are joined. This is needed for the typical eukaryotic messenger RNA before it can be used to produce a correct protein through translation...

 mechanisms employed by eukaryotic cells mean that a particular protein-coding sequence in the genome is divided into several parts (exons), separated by non-coding sequences (introns). (Splice sites are themselves another signal that eukaryotic gene finders are often designed to identify.) A typical protein-coding gene in humans might be divided into a dozen exons, each less than two hundred base pairs in length, and some as short as twenty to thirty. It is therefore much more difficult to detect periodicities and other known content properties of protein-coding DNA in eukaryotes.

Advanced gene finders for both prokaryotic and eukaryotic genomes typically use complex probabilistic models, such as hidden Markov model
Hidden Markov model
A hidden Markov model is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved states. An HMM can be considered as the simplest dynamic Bayesian network. The mathematics behind the HMM was developed by L. E...

s, in order to combine information from a variety of different signal and content measurements. The GLIMMER
Glimmer
GLIMMER was the first bioinformatics system for finding genes that used the interpolated Markov model formalism. It is very effective at finding genes in bacteria, archaea, and viruses, typically finding 98–99% of all protein-coding genes. The GLIMMER software is open source and can be...

 system is a widely used and highly accurate gene finder for prokaryotes. GeneMark
GeneMark
GeneMark is a family of gene prediction programs developed at the Georgia Institute of Technology in Atlanta. First developed in 1993, GeneMark was the first gene finding method recognized as an efficient and accurate tool for genome projects...

 is another popular approach. Eukaryotic ab initio gene finders, by comparison, have achieved only limited success; notable examples are the GENSCAN
GENSCAN
GENSCAN is an program to identify complete gene structures in genomic DNA. It is a GHMM-based program that can be used to predict the location of genes and their exon-intron boundaries in genomic sequences from a variety of organisms...

 and geneid programs. The SNAP gene finder is HMM-based like Genscan and attempts to be more adaptable to different organisms, addressing problems related to using a gene finder on a genome sequence that it was not trained against. A few recent approaches like mSplicer, CONTRAST, or mGene also use machine learning
Machine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...

 techniques like support vector machines for successful gene prediction. They build a discriminative model
Discriminative model
Discriminative models are a class of models used in machine learning for modeling the dependence of an unobserved variable y on an observed variable x...

 using hidden Markov support vector machines or conditional random field
Conditional random field
A conditional random field is a statistical modelling method often applied in pattern recognition.More specifically it is a type of discriminative undirected probabilistic graphical model. It is used to encode known relationships between observations and construct consistent interpretations...

s to learn an accurate gene prediction scoring function.

Other signals

Among the derived signals used for prediction are statistics resulting from the sub-sequence statistics like k-mer
K-mer
The term k-mer usually refers to a specific n-tuple or n-gram of nucleic acid or amino acid sequences that can be used to identify certain regions within biomolecules like DNA or proteins...

 statistics, Fourier transform
Fourier transform
In mathematics, Fourier analysis is a subject area which grew from the study of Fourier series. The subject began with the study of the way general functions may be represented by sums of simpler trigonometric functions...

 of a pseudo-number-coded DNA, Z-curve parameters and certain run features.

It has been suggested that signals other than those directly detectable in sequences may improve gene prediction. For example, the role of secondary structure
Secondary structure
In biochemistry and structural biology, secondary structure is the general three-dimensional form of local segments of biopolymers such as proteins and nucleic acids...

 in the identification of regulatory motifs has been reported. In addition, it has been suggested that RNA secondary structure prediction helps splice site prediction.

Comparative genomics approaches

As the entire genomes of many different species are sequenced, a promising direction in current research on gene finding is a comparative genomics
Comparative genomics
Comparative genomics is the study of the relationship of genome structure and function across different biological species or strains. Comparative genomics is an attempt to take advantage of the information provided by the signatures of selection to understand the function and evolutionary...

 approach. This is based on the principle that the forces of natural selection
Natural selection
Natural selection is the nonrandom process by which biologic traits become either more or less common in a population as a function of differential reproduction of their bearers. It is a key mechanism of evolution....

cause genes and other functional elements to undergo mutation at a slower rate than the rest of the genome, since mutations in functional elements are more likely to negatively impact the organism than mutations elsewhere. Genes can thus be detected by comparing the genomes of related species to detect this evolutionary pressure for conservation. This approach was first applied to the mouse and human genomes, using programs such as SLAM, SGP and Twinscan/N-SCAN.

Comparative gene finding can also be used to project high quality annotations from one genome to another. Notable examples include Projector, GeneWise and GeneMapper. Such techniques now play a central role in the annotation of all genomes.

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK