GenBank
Encyclopedia
The GenBank sequence database
Sequence database
In the field of bioinformatics, a sequence database is a large collection of computerized nucleic acid sequences, protein sequences, or other sequences stored on a computer...

 is an open access, annotated collection of all publicly available nucleotide
Nucleotide
Nucleotides are molecules that, when joined together, make up the structural units of RNA and DNA. In addition, nucleotides participate in cellular signaling , and are incorporated into important cofactors of enzymatic reactions...

 sequences and their protein
Protein
Proteins are biochemical compounds consisting of one or more polypeptides typically folded into a globular or fibrous form, facilitating a biological function. A polypeptide is a single linear polymer chain of amino acids bonded together by peptide bonds between the carboxyl and amino groups of...

 translations. This database is produced and maintained by the National Center for Biotechnology Information
National Center for Biotechnology Information
The National Center for Biotechnology Information is part of the United States National Library of Medicine , a branch of the National Institutes of Health. The NCBI is located in Bethesda, Maryland and was founded in 1988 through legislation sponsored by Senator Claude Pepper...

 (NCBI) as part of the International Nucleotide Sequence Database Collaboration
International Nucleotide Sequence Database Collaboration
The International Nucleotide Sequence Database Collaboration consists of a joint effort to collect and disseminate databases containing DNA and RNA sequences. It involves the following computerized databases: DNA Data Bank of Japan , GenBank and the EMBL...

 (INSDC). The National Center for Biotechnology Information
National Center for Biotechnology Information
The National Center for Biotechnology Information is part of the United States National Library of Medicine , a branch of the National Institutes of Health. The NCBI is located in Bethesda, Maryland and was founded in 1988 through legislation sponsored by Senator Claude Pepper...

 is a part of the National Institutes of Health
National Institutes of Health
The National Institutes of Health are an agency of the United States Department of Health and Human Services and are the primary agency of the United States government responsible for biomedical and health-related research. Its science and engineering counterpart is the National Science Foundation...

 in the United States
United States
The United States of America is a federal constitutional republic comprising fifty states and a federal district...

. GenBank and its collaborators receive sequences produced in laboratories throughout the world from more than 100,000 distinct organisms. In more than 20 years since its establishment, GenBank has become the most important and most influential database for research in almost all biological fields, whose data were accessed and cited by millions of researchers around the world. GenBank continues to grow at an exponential rate
Exponential growth
Exponential growth occurs when the growth rate of a mathematical function is proportional to the function's current value...

, doubling every 18 months. Release 155, produced in August 2006, contained over 65 billion nucleotide bases in more than 61 million sequences. GenBank is built by direct submissions from individual laboratories, as well as from bulk submissions from large-scale sequencing centers.

Submissions

Only original sequences can be submitted to GenBank. Direct submissions are made to GenBank using BankIt, which is a Web-based form, or the stand-alone submission program, Sequin. Upon receipt of a sequence submission, the GenBank staff examines the originality of the data and assigns an accession number
Accession number (bioinformatics)
An accession number in bioinformatics is a unique identifier given to a DNA or protein sequence record to allow for tracking of different versions of that sequence record and the associated sequence over time in a single data repository...

 to the sequence and performs quality assurance checks. The submissions are then released to the public database, where the entries are retrievable by Entrez
Entrez
The Entrez Global Query Cross-Database Search System is a powerful federated search engine, or web portal that allows users to search many discrete health sciences databases at the National Center for Biotechnology Information website...

 or downloadable by FTP
File Transfer Protocol
File Transfer Protocol is a standard network protocol used to transfer files from one host to another host over a TCP-based network, such as the Internet. FTP is built on a client-server architecture and utilizes separate control and data connections between the client and server...

. Bulk submissions of Expressed Sequence Tag
Expressed sequence tag
An expressed sequence tag or EST is a short sub-sequence of a cDNA sequence. They may be used to identify gene transcripts, and are instrumental in gene discovery and gene sequence determination. The identification of ESTs has proceeded rapidly, with approximately 65.9 million ESTs now available in...

 (EST), Sequence-tagged site
Sequence-tagged site
A sequence-tagged site is a short DNA sequence that has a single occurrence in the genome and whose location and base sequence are known....

 (STS), Genome Survey Sequence
Genome survey sequence
In the fields of bioinformatics and computational biology, Genome Survey Sequences are nucleotide sequences similar to EST's, with the exception that most of them are genomic in origin, rather than mRNA...

 (GSS), and High-Throughput Genome Sequence (HTGS) data are most often submitted by large-scale sequencing centers. The GenBank direct submissions group also processes complete microbial genome sequences.

History

Walter Goad
Walter Goad
Walter Goad is one of the founders of GenBank when he was in Los Alamos. He was born in 1925 and died in 2000. From 1970 to 1971, he worked with Francis Crick at the MRC LMB of Cambridge.-References:****-External links:...

 of the Theoretical Biology and Biophysics Group at Los Alamos National Laboratory and others established the Los Alamos Sequence Database in 1979, which culminated in 1982 with the creation of the public GenBank. Funding was provided by the National Institutes of Health
National Institutes of Health
The National Institutes of Health are an agency of the United States Department of Health and Human Services and are the primary agency of the United States government responsible for biomedical and health-related research. Its science and engineering counterpart is the National Science Foundation...

, the National Science Foundation, the Department of Energy, and the Department of Defense. LANL collaborated on GenBank with the firm Bolt, Beranek, and Newman, and by the end of 1983 more than 2,000 sequences were stored in it.

In the mid 1980s, the Intelligenetics bioinformatics company at Stanford University
Stanford University
The Leland Stanford Junior University, commonly referred to as Stanford University or Stanford, is a private research university on an campus located near Palo Alto, California. It is situated in the northwestern Santa Clara Valley on the San Francisco Peninsula, approximately northwest of San...

 managed the GenBank project in collaboration with LANL. As one of the earliest bioinformatics
Bioinformatics
Bioinformatics is the application of computer science and information technology to the field of biology and medicine. Bioinformatics deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, software...

 community projects on the Internet, the GenBank project started BIOSCI
BIOSCI
BIOSCI, also known as Bionet, is a set of electronic communication forum used by life scientists around the world. It includes the Bionet Usenet newsgroups and parallel e-mail lists, with public archives since 1992 at...

/Bionet news groups for promoting open access communications among bioscientists. During 1989 to 1992, the GenBank project transitioned to the newly created National Center for Biotechnology Information
National Center for Biotechnology Information
The National Center for Biotechnology Information is part of the United States National Library of Medicine , a branch of the National Institutes of Health. The NCBI is located in Bethesda, Maryland and was founded in 1988 through legislation sponsored by Senator Claude Pepper...

.

Growth

The GenBank release notes
Release notes
Release notes are documents that are distributed with software products, often when the product is still in the development or test state...

 for release 162.0 (October 2007) state that "from 1982 to the present, the number of bases in GenBank has doubled approximately every 18 months". The following plot clearly shows the exponential growth
Exponential growth
Exponential growth occurs when the growth rate of a mathematical function is proportional to the function's current value...

 (on a semi-log scale
Semilog graph
In science and engineering, a semi-log graph or semi-log plot is a way of visualizing data that are changing with an exponential relationship. One axis is plotted on a logarithmic scale...

 such as this, a straight line represents an exponential change).

, GenBank release 186.0 has 144,458,648 loci
Locus (genetics)
In the fields of genetics and genetic computation, a locus is the specific location of a gene or DNA sequence on a chromosome. A variant of the DNA sequence at a given locus is called an allele. The ordered list of loci known for a particular genome is called a genetic map...

, 132,067,413,372 bases, from 144,458,648 reported sequences.

The GenBank database includes additional data sets which are constructed mechanically from the main sequence data collection, and therefore are excluded from this count.

See also

  • Ensembl
    Ensembl
    Ensembl is a joint scientific project between the European Bioinformatics Institute and the Wellcome Trust Sanger Institute, which was launched in 1999 in response to the imminent completion of the Human Genome Project...

  • HPRD
    HPRD
    The Human Protein Reference Database is a protein database accessible through the internet.The HPRD is a result of an international collaborative effort between the in Bangalore, India and the at Johns Hopkins University in Baltimore, USA. HPRD contains manually curated scientific information...

  • Sequence analysis
    Sequence analysis
    In bioinformatics, the term sequence analysis refers to the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. Methodologies used include sequence alignment, searches against biological...

  • Sequence profiling tool
    Sequence profiling tool
    A sequence profiling tool in bioinformatics is a type of software that presents information related to a genetic sequence, gene name, or keyword input. Such tools generally take a query such as a DNA, RNA, or protein sequence or ‘keyword’ and search one or more databases for information related to...

  • Sequence motif
    Sequence motif
    In genetics, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and has, or is conjectured to have, a biological significance...

  • UniProt
    UniProt
    UniProt is a comprehensive, high-quality and freely accessible database of protein sequence and functional information, many of which are derived from genome sequencing projects...

  • List of sequenced eukaryotic genomes
  • List of sequenced archeal genomes
  • RefSeq
    RefSeq
    The Reference Sequence database is an open access, annotated and curated collection of publicly available nucleotide sequences and their protein products. This database is built by National Center for Biotechnology Information , and, unlike GenBank, provides only single record for each natural...

    - the Reference Sequence Database

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK