All Topics  
GenBank

 

   Email Print
   Bookmark   Link






 

GenBank



 
 
The GenBank sequence database
Sequence database

In the field of bioinformatics, a sequence database is a large collection of DNA, protein, or other sequences stored on a computer. A database can include sequences from only one organism, as in databases including all the proteins in Saccharomyces cerevisiae, or it can include sequences from all organisms whose DNA has been sequenced....
 is an open access
Open access

Open access -- free online access -- can be provided in two ways: open access publishing and open access self-archiving, by its authors, of non-open-access publications ....
, annotated collection of all publicly available nucleotide
Nucleotide

Nucleotides are molecules that comprise the structural units of RNA and DNA. Additionally, nucleotides play central roles in metabolism. In that capacity, they serve as sources of chemical energy , participate in cell signaling , and are incorporated into important cofactors of enzymatic reactions ....
 sequences and their protein
Protein

Proteins are organic compounds made of amino acids arranged in a linear chain and joined together by peptide bonds between the carboxyl and amino groups of adjacent amino acid Residue ....
 translations. This database is produced at National Center for Biotechnology Information
National Center for Biotechnology Information

The National Center for Biotechnology Information is part of the United States National Library of Medicine , a branch of the National Institutes of Health....
 (NCBI) as part of the International Nucleotide Sequence Database Collaboration
International Nucleotide Sequence Database Collaboration

The International Nucleotide Sequence Database Collaboration consists of a joint effort to collect and disseminate databases containing DNA and RNA sequences....
, or INSDC. GenBank and its collaborators receive sequences produced in laboratories throughout the world from more than 100,000 distinct organisms. GenBank continues to grow at an exponential rate
Exponential growth

Exponential growth occurs when the growth rate of a mathematical function is proportionality to the function's current value. In the case of a discrete domain of definition with equal intervals it is also called geometric growth or geometric decay ....
, doubling every 18 months. Release 155, produced in August 2006, contained over 65 billion nucleotide bases in more than 61 million sequences.






Discussion
Ask a question about 'GenBank'
Start a new discussion about 'GenBank'
Answer questions from other users
Full Discussion Forum



Encyclopedia


The GenBank sequence database
Sequence database

In the field of bioinformatics, a sequence database is a large collection of DNA, protein, or other sequences stored on a computer. A database can include sequences from only one organism, as in databases including all the proteins in Saccharomyces cerevisiae, or it can include sequences from all organisms whose DNA has been sequenced....
 is an open access
Open access

Open access -- free online access -- can be provided in two ways: open access publishing and open access self-archiving, by its authors, of non-open-access publications ....
, annotated collection of all publicly available nucleotide
Nucleotide

Nucleotides are molecules that comprise the structural units of RNA and DNA. Additionally, nucleotides play central roles in metabolism. In that capacity, they serve as sources of chemical energy , participate in cell signaling , and are incorporated into important cofactors of enzymatic reactions ....
 sequences and their protein
Protein

Proteins are organic compounds made of amino acids arranged in a linear chain and joined together by peptide bonds between the carboxyl and amino groups of adjacent amino acid Residue ....
 translations. This database is produced at National Center for Biotechnology Information
National Center for Biotechnology Information

The National Center for Biotechnology Information is part of the United States National Library of Medicine , a branch of the National Institutes of Health....
 (NCBI) as part of the International Nucleotide Sequence Database Collaboration
International Nucleotide Sequence Database Collaboration

The International Nucleotide Sequence Database Collaboration consists of a joint effort to collect and disseminate databases containing DNA and RNA sequences....
, or INSDC. GenBank and its collaborators receive sequences produced in laboratories throughout the world from more than 100,000 distinct organisms. GenBank continues to grow at an exponential rate
Exponential growth

Exponential growth occurs when the growth rate of a mathematical function is proportionality to the function's current value. In the case of a discrete domain of definition with equal intervals it is also called geometric growth or geometric decay ....
, doubling every 18 months. Release 155, produced in August 2006, contained over 65 billion nucleotide bases in more than 61 million sequences. GenBank is built by direct submissions from individual laboratories, as well as from bulk submissions from large-scale sequencing centers.

Submissions

Direct submissions are made to GenBank using BankIt, which is a Web-based form, or the stand-alone submission program, Sequin. Upon receipt of a sequence submission, the GenBank staff assigns an Accession number
Accession number (bioinformatics)

An accession number in bioinformatics is a unique identifier given to a DNA sequence or amino acid sequence sequence record to allow for tracking of different versions of that sequence record and the associated sequence over time in a single data repository....
 to the sequence and performs quality assurance checks. The submissions are then released to the public database, where the entries are retrievable by Entrez
Entrez

The Entrez Global Query Cross-Database Search System is a powerful federated search engine, or web portal that allows users to search many discrete health sciences databases at the National Center for Biotechnology Information website....
 or downloadable by FTP
File Transfer Protocol

File Transfer Protocol is a network protocol used to transfer data from one computer to another through a network such as the Internet.FTP is a file transfer protocol for exchanging and manipulating files over a Transmission Control Protocol computer network....
. Bulk submissions of Expressed Sequence Tag
Expressed sequence tag

An expressed sequence tag or EST is a short sub-sequence of a transcribed cDNA sequence. They may be used to identify gene Transcription , and are instrumental in gene discovery and gene sequence determination....
 (EST), Sequence Tagged Site (STS), Genome Survey Sequence
Genome survey sequence

In the fields of bioinformatics and computational biology, Genome Survey Sequences are primary structure similar to expressed sequence tag's, with the exception that most of them are Genome in origin, rather than mRNA....
 (GSS), and High-Throughput Genome Sequence (HTGS) data are most often submitted by large-scale sequencing centers. The GenBank direct submissions group also processes complete microbial genome sequences.

History

Walter Goad
Walter Goad

Walter Goad is one of the founders of GenBank when he was in Los Alamos National Laboratory. He was born in 1925 and died in 2000. From 1970 to 1971, he worked with Francis Crick at the Medical Research Council Laboratory of Molecular Biology of Cambridge....
 of the at Los Alamos National Laboratory and others established the Los Alamos Sequence Database in 1979, which culminated in 1982 with the creation of the public GenBank. Funding was provided by the National Institutes of Health
National Institutes of Health

The National Institutes of Health is an agency of the United States Department of Health and Human Services and is the primary agency of the United States government responsible for biomedical and health-related research....
, the National Science Foundation, the Department of Energy, and the Department of Defense. LANL collaborated on GenBank with the firm Bolt, Beranek, and Newman, and by the end of 1983 more than 2,000 sequences were stored in it.

In the mid 1980s, the Intelligenetics bioinformatics company at Stanford University
Stanford University

Leland Stanford Junior University, commonly referred to as Stanford University or Stanford, is a private university research university located in Stanford, California, California, United States....
 managed the GenBank project in collaboration with LANL. As one of the earliest bioinformatics
Bioinformatics

Bioinformatics is the application of information technology to the field of molecular biology. The term bioinformatics was coined by Paulien Hogeweg in 1978 for the study of informatic processes in biotic systems....
 community projects on the Internet, the GenBank project started BIOSCI
BIOSCI

BIOSCI, also known as Bionet, is a set of electronic communication forum used by life scientists around the world. It includes the bionet USENET newsgroups and parallel e-mail lists, with public archives since 1992 at ....
/Bionet news groups for promoting open access
Open access

Open access -- free online access -- can be provided in two ways: open access publishing and open access self-archiving, by its authors, of non-open-access publications ....
 communications among bioscientists. During 1989 to 1992, the GenBank project transitioned to the newly created National Center for Biotechnology Information
National Center for Biotechnology Information

The National Center for Biotechnology Information is part of the United States National Library of Medicine , a branch of the National Institutes of Health....
.

Growth

The [ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt GenBank release notes] for release 162.0 (October, 2007) state that "from 1982 to the present, the number of bases in GenBank has doubled approximately every 18 months." The following plot clearly shows the exponential growth
Exponential growth

Exponential growth occurs when the growth rate of a mathematical function is proportionality to the function's current value. In the case of a discrete domain of definition with equal intervals it is also called geometric growth or geometric decay ....
 (on a semi-log scale
Semilog graph

In science and engineering, a semi-log graph or semi-log plot is a way of visualizing data that are changing with an exponential distribution relationship....
 such as this, a straight line represents an exponential change).

The GenBank database includes additional data sets which are constructed mechanically from the main sequence data collection, and therefore are excluded from this count.

See also


  • Ensembl
    Ensembl

    Ensembl is a joint scientific project between the European_Bioinformatics_Institute and the Sanger_Institute , which was launched in 1999 in response to the imminent completion of the Human_Genome_Project ....
  • HPRD
    HPRD

    The Human Protein Reference Database is a protein database accessible through the internet. The HPRD is a result of an international collaborative effort between the in Bangalore, India and the at Johns Hopkins University in Baltimore, USA....
  • Sequence analysis
    Sequence analysis

    The term "sequence analysis" in biology implies subjecting a DNA sequence or peptide sequence to sequence alignment, sequence databases, Repeated Sequences searches, or other bioinformatics methods on a computer....
  • Sequence profiling tool
    Sequence profiling tool

    A sequence profiling tool in bioinformatics is a type of software that presents information related to a gene sequence, gene name, or keyword input....
  • Sequence motif
    Sequence motif

    In genetics, a sequence motif is a nucleotide or amino acid sequence pattern that is widespread and has, or is conjectured to have, a biology significance....
  • UniProt
    UniProt

    UniProt is the universal protein resource, a central repository of protein data created by combining Swiss-Prot, TrEMBL and Protein Information Resource....
  • List of sequenced eukaryotic genomes
    List of sequenced eukaryotic genomes

    This list of sequenced eukaryotic genomes contains all the eukaryotes known to have publicly available complete nuclear and organelle genome sequences that have been assembled, annotated and published; draft genomes are not included, nor are organelle only sequences....
  • List of sequenced archeal genomes
  • RefSeq
    RefSeq

    The Reference Sequence sequence database is an open access, annotated collection of publicly available nucleotide sequences and their protein translations....
     - the Reference Sequence Database


External links