Home      Discussion      Topics      Dictionary      Almanac
Signup       Login
UniProt

UniProt

Discussion
Ask a question about 'UniProt'
Start a new discussion about 'UniProt'
Answer questions from other users
Full Discussion Forum
 
Encyclopedia
UniProt is a comprehensive, high-quality and freely accessible database of protein sequence and functional information, many of which are derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature.

The UniProt Consortium


The UniProt Consortium comprises the European Bioinformatics Institute
European Bioinformatics Institute
The European Bioinformatics Institute is a centre for research and services in bioinformatics, and is part of European Molecular Biology Laboratory...

 (EBI), the Swiss Institute of Bioinformatics
Swiss Institute of Bioinformatics
The Swiss Institute of Bioinformatics is an academic not-for-profit foundation which federates bioinformatics activities throughout Switzerland...

 (SIB), and the Protein Information Resource
Protein Information Resource
The Protein Information Resource , located at Georgetown University Medical Center , is an integrated public bioinformatics resource to support genomic and proteomic research, and scientific studies-History:...

 (PIR). EBI, located at the Wellcome Trust Genome Campus
Wellcome Trust Genome Campus
The Wellcome Trust Genome Campus is a scientific research campus built in the grounds of Hinxton Hall, located in the village of Hinxton, Cambridgeshire....

 in Hinxton, UK, hosts a large resource of bioinformatics databases and services. SIB, located in Geneva, Switzerland, maintains the ExPASy
ExPASy
ExPASy is a bioinformatics resource portal operated by the Swiss Institute of Bioinformatics and in particular the SIB Web Team. It is an extensible and integrative portal accessing many scientific resources, databases and software tools in different areas of life sciences...

 (Expert Protein Analysis System) servers that are a central resource for proteomics tools and databases. PIR, hosted by the National Biomedical Research Foundation (NBRF) at the Georgetown University Medical Center in Washington, DC, USA, is heir to the oldest protein sequence database, Margaret Dayhoff
Margaret Oakley Dayhoff
Dr. Margaret Belle Dayhoff was an American physical chemist and a pioneer in the field of bioinformatics...

's Atlas of Protein Sequence and Structure, first published in 1965. In 2002, EBI, SIB, and PIR joined forces as the UniProt Consortium.

The roots of UniProt databases


Each consortium member is heavily involved in protein database maintenance and annotation. Until recently, EBI and SIB together produced the Swiss-Prot and TrEMBL databases, while PIR produced the Protein Sequence Database (PIR-PSD). These databases coexisted with differing protein sequence
Peptide sequence
Peptide sequence or amino acid sequence is the order in which amino acid residues, connected by peptide bonds, lie in the chain in peptides and proteins. The sequence is generally reported from the N-terminal end containing free amino group to the C-terminal end containing free carboxyl group...

 coverage and annotation priorities.

Swiss-Prot was created in 1986 by Amos Bairoch
Amos Bairoch
Amos Bairoch is a Swiss bioinformatician, born 22 November 1957.Bairoch is currently professor of Bioinformatics at the Department of Structural biology and Bioinformatics of the University of Geneva and group leader at the Swiss Institute of Bioinformatics...

 during his PhD and developed by the Swiss Institute of Bioinformatics
Swiss Institute of Bioinformatics
The Swiss Institute of Bioinformatics is an academic not-for-profit foundation which federates bioinformatics activities throughout Switzerland...

 and subsequently developed by Rolf Apweiler
Rolf Apweiler
Rolf Apweiler is a senior scientist at the European Molecular Biology Laboratory-European Bioinformatics Institute and joint head of the Protein And Nucleic Acids group with Ewan Birney....

 at the European Bioinformatics Institute
European Bioinformatics Institute
The European Bioinformatics Institute is a centre for research and services in bioinformatics, and is part of European Molecular Biology Laboratory...

. Swiss-Prot aimed to provide reliable protein sequences associated with a high level of annotation (such as the description of the function of a protein, its domain
Protein domain
A protein domain is a part of protein sequence and structure that can evolve, function, and exist independently of the rest of the protein chain. Each domain forms a compact three-dimensional structure and often can be independently stable and folded. Many proteins consist of several structural...

 structure, post-translational modifications, variants, etc.), a minimal level of redundancy
Data redundancy
Data redundancy occurs in database systems which have a field that is repeated in two or more tables. For instance, in case when customer data is duplicated and attached with each product bought then redundancy of data is a known source of inconsistency, since customer might appear with different...

 and high level of integration with other databases. Recognizing that sequence data were being generated at a pace exceeding Swiss-Prot's ability to keep up, TrEMBL (Translated EMBL Nucleotide Sequence Data Library) was created to provide automated annotations for those proteins not in Swiss-Prot. Meanwhile, PIR maintained the PIR-PSD and related databases, including iProClass, a database of protein sequences and curated families.

The consortium members pooled their overlapping resources and expertise, and launched UniProt in December 2003.

UniProtKB


UniProt Knowledgebase (UniProtKB) is a protein database curated by experts, consisting of two sections. UniProtKB/Swiss-Prot (containing reviewed, manually annotated entries) and UniProtKB/TrEMBL (containing unreviewed, automatically annotated entries). In release 2010_09 of 10 August 2010, UniProtKB/Swiss-Prot contained 519,348 entries, and UniProtKB/TrEMBL contained 11,636,205 entries.

UniProtKB/Swiss-Prot


UniProtKB/Swiss-Prot is a high-quality, manually annotated, non-redundant protein sequence database. It combines information extracted from scientific literature and biocurator
Biocurator
A biocurator is a professional scientist who collects, annotates, and validates information that is disseminated by biological and model organism databases...

-evaluated computational analysis. The aim of UniProtKB/Swiss-Prot is to provide all known relevant information about a particular protein. Annotation is regularly reviewed to keep up with current scientific findings. The manual annotation of an entry involves detailed analysis of the protein sequence and of the scientific literature.

Sequences from the same gene
Gene
A gene is a molecular unit of heredity of a living organism. It is a name given to some stretches of DNA and RNA that code for a type of protein or for an RNA chain that has a function in the organism. Living beings depend on genes, as they specify all proteins and functional RNA chains...

 and the same species
Species
In biology, a species is one of the basic units of biological classification and a taxonomic rank. A species is often defined as a group of organisms capable of interbreeding and producing fertile offspring. While in many cases this definition is adequate, more precise or differing measures are...

 are merged into the same database entry. Differences between sequences are identified, and their cause documented (for example alternative splicing
Alternative splicing
Alternative splicing is a process by which the exons of the RNA produced by transcription of a gene are reconnected in multiple ways during RNA splicing...

, natural variation
Genetic diversity
Genetic diversity, the level of biodiversity, refers to the total number of genetic characteristics in the genetic makeup of a species. It is distinguished from genetic variability, which describes the tendency of genetic characteristics to vary....

, incorrect initiation sites, incorrect exon
Exon
An exon is a nucleic acid sequence that is represented in the mature form of an RNA molecule either after portions of a precursor RNA have been removed by cis-splicing or when two or more precursor RNA molecules have been ligated by trans-splicing. The mature RNA molecule can be a messenger RNA...

 boundaries, frameshifts
Frameshift mutation
A frameshift mutation is a genetic mutation caused by indels of a number of nucleotides that is not evenly divisible by three from a DNA sequence...

, unidentified conflicts). A range of sequence analysis tools is used in the annotation of UniProtKB/Swiss-Prot entries. Computer-predictions are manually evaluated, and relevant results selected for inclusion in the entry. These predictions include post-translational modifications, transmembrane domains and topology
Membrane topology
In biochemistry, the membrane topology of an transmembrane protein describes which portions of the amino-acid sequence of the protein lie within the plane of the surrounding lipid bilayer and which portions protrude into the watery environment on either side...

, signal peptide
Signal peptide
A signal peptide is a short peptide chain that directs the transport of a protein.Signal peptides may also be called targeting signals, signal sequences, transit peptides, or localization signals....

s, domain identification, and protein family
Protein family
A protein family is a group of evolutionarily-related proteins, and is often nearly synonymous with gene family. The term protein family should not be confused with family as it is used in taxonomy....

 classification.

Relevant publications are identified by searching databases such as PubMed
PubMed
PubMed is a free database accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The United States National Library of Medicine at the National Institutes of Health maintains the database as part of the Entrez information retrieval system...

. The full text of each paper is read, and information is extracted and added to the entry. Annotation arising from the scientific literature includes, but is not limited to:
  • Protein and gene names
  • Function
  • Enzyme
    Enzyme
    Enzymes are proteins that catalyze chemical reactions. In enzymatic reactions, the molecules at the beginning of the process, called substrates, are converted into different molecules, called products. Almost all chemical reactions in a biological cell need enzymes in order to occur at rates...

    -specific information such as catalytic activity
    Catalysis
    Catalysis is the change in rate of a chemical reaction due to the participation of a substance called a catalyst. Unlike other reagents that participate in the chemical reaction, a catalyst is not consumed by the reaction itself. A catalyst may participate in multiple chemical transformations....

    , cofactors
    Cofactor (biochemistry)
    A cofactor is a non-protein chemical compound that is bound to a protein and is required for the protein's biological activity. These proteins are commonly enzymes, and cofactors can be considered "helper molecules" that assist in biochemical transformations....

     and catalytic residues
    Active site
    In biology the active site is part of an enzyme where substrates bind and undergo a chemical reaction. The majority of enzymes are proteins but RNA enzymes called ribozymes also exist. The active site of an enzyme is usually found in a cleft or pocket that is lined by amino acid residues that...

  • Subcellular location
    Subcellular localization
    The cells of eukaryotic organisms are elaborately subdivided into functionally distinct membrane bound compartments. Some major constituents of eukaryotic cells are: extracellular space, cytoplasm, nucleus, mitochondria, Golgi apparatus, endoplasmic reticulum , peroxisome, vacuoles, cytoskeleton,...

  • Protein-protein interaction
    Protein-protein interaction
    Protein–protein interactions occur when two or more proteins bind together, often to carry out their biological function. Many of the most important molecular processes in the cell such as DNA replication are carried out by large molecular machines that are built from a large number of protein...

    s
  • Pattern of expression
  • Locations and roles of significant domains and sites
  • Ion
    Ion
    An ion is an atom or molecule in which the total number of electrons is not equal to the total number of protons, giving it a net positive or negative electrical charge. The name was given by physicist Michael Faraday for the substances that allow a current to pass between electrodes in a...

    -, substrate
    Substrate (biochemistry)
    In biochemistry, a substrate is a molecule upon which an enzyme acts. Enzymes catalyze chemical reactions involving the substrate. In the case of a single substrate, the substrate binds with the enzyme active site, and an enzyme-substrate complex is formed. The substrate is transformed into one or...

    - and cofactor-binding sites
  • Protein variant forms produced by natural genetic variation, RNA editing
    RNA editing
    The term RNA editing describes those molecular processes in which the information content in an RNA molecule is altered through a chemical change in the base makeup. To date, such changes have been observed in tRNA, rRNA, mRNA and microRNA molecules of eukaryotes but not prokaryotes...

    , alternative splicing, proteolytic processing, and post-translational modification


Annotated entries undergo quality assurance before inclusion into UniProtKB/Swiss-Prot. When new data becomes available, entries are updated.

UniProtKB/TrEMBL


UniProtKB/TrEMBL contains high-quality computationally analyzed records, which are enriched with automatic annotation. It was introduced in response to increased dataflow resulting from genome projects, as the time- and labour-consuming manual annotation process of UniProtKB/Swiss-Prot could not be broadened to include all available protein sequences. The translations of annotated coding sequences in the EMBL-Bank/GenBank/DDBJ nucleotide sequence database are automatically processed and entered in UniProtKB/TrEMBL.
UniProtKB/TrEMBL also contains sequences from PDB
Protein Data Bank
The Protein Data Bank is a repository for the 3-D structural data of large biological molecules, such as proteins and nucleic acids....

, and from gene prediction, including Ensembl
Ensembl
Ensembl is a joint scientific project between the European Bioinformatics Institute and the Wellcome Trust Sanger Institute, which was launched in 1999 in response to the imminent completion of the Human Genome Project...

, RefSeq
RefSeq
The Reference Sequence database is an open access, annotated and curated collection of publicly available nucleotide sequences and their protein products. This database is built by National Center for Biotechnology Information , and, unlike GenBank, provides only single record for each natural...

 and CCDS
Consensus CDS Project
With the annotation of genes in the human genome taking place around the world, it is necessary for a consensus of protein coding regions for consistency...

.

UniParc


UniProt Archive (UniParc) is a comprehensive and non-redundant database, which contains all the protein sequences from the main, publicly available protein sequence databases. Proteins may exist in several different source databases, and in multiple copies in the same database. In order to avoid redundancy, UniParc stores each unique sequence only once. Identical sequences are merged, regardless of whether they are from the same or different species. Each sequence is given a stable and unique identifier (UPI), making it possible to identify the same protein from different source databases. UniParc contains only protein sequences, with no annotation. Database cross-references in UniParc entries allow further information about the protein to be retrieved from the source databases. When sequences in the source databases change, these changes are tracked by UniParc and history of all changes is archived.

Source databases


Currently UniParc contains protein sequences from the following publicly available databases:
  • INSDC EMBL-Bank/DDBJ/GenBank
    GenBank
    The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. This database is produced and maintained by the National Center for Biotechnology Information as part of the International Nucleotide Sequence...

     nucleotide sequence databases
  • Ensembl
    Ensembl
    Ensembl is a joint scientific project between the European Bioinformatics Institute and the Wellcome Trust Sanger Institute, which was launched in 1999 in response to the imminent completion of the Human Genome Project...

  • European Patent Office
    European Patent Office
    The European Patent Office is one of the two organs of the European Patent Organisation , the other being the Administrative Council. The EPO acts as executive body for the Organisation while the Administrative Council acts as its supervisory body as well as, to a limited extent, its legislative...

     (EPO)
  • FlyBase
    FlyBase
    FlyBase is an online bioinformatics database and the primary repository of genetic and molecular data for the insect family Drosophilidae. For the most extensively studied species and model organism, Drosophila melanogaster, a wide range of data are presented in different formats...

  • H-Invitational
    H-Invitational
    the H-Invitational Database is a comprehensive annotation resource for human genes and transcripts....

     Database (H-Inv)
  • International Protein Index
    International Protein Index
    The International Protein Index is database that was created to give the proteomics community a resource that enables* accession numbers from a variety of bioinformatics databases to be mapped* a complete set of proteins for a species i.e...

     (IPI)
  • Japan Patent Office
    Japan Patent Office
    The Japan Patent Office is a Japanese governmental agency in charge of industrial property right affairs, under the Ministry of Economy, Trade and Industry...

     (JPO)
  • Protein Information Resource
    Protein Information Resource
    The Protein Information Resource , located at Georgetown University Medical Center , is an integrated public bioinformatics resource to support genomic and proteomic research, and scientific studies-History:...

     (PIR-PSD)
  • Protein Data Bank
    Protein Data Bank
    The Protein Data Bank is a repository for the 3-D structural data of large biological molecules, such as proteins and nucleic acids....

     (PDB)
  • Protein Research Foundation (PRF) http://www.prf.or.jp/index-e.html
  • RefSeq
    RefSeq
    The Reference Sequence database is an open access, annotated and curated collection of publicly available nucleotide sequences and their protein products. This database is built by National Center for Biotechnology Information , and, unlike GenBank, provides only single record for each natural...

  • Saccharomyces Genome Database
    Saccharomyces Genome Database
    The Saccharomyces Genome Database is a scientific database of the molecular biology and genetics of the yeast Saccharomyces cerevisiae, which is commonly known as baker's or budding yeast....

     (SGD)
  • The Arabidopsis Information Resource (TAIR)
  • TROME [ftp://ftp.isrec.isb-sib.ch/pub/databases/trome]
  • US Patent Office (USPTO)
  • UniProtKB/Swiss-Prot, UniProtKB/Swiss-Prot protein isoforms, UniProtKB/TrEMBL
  • Vertebrate and Genome Annotation Database (VEGA)
  • WormBase
    WormBase
    WormBase is an online bioinformatics database of the biology and genome of the model organism Caenorhabditis elegans and related nematodes....


UniRef


The UniProt Reference Clusters (UniRef) consist of three databases of clustered sets of protein sequences from UniProtKB and selected UniParc records. The UniRef100 database combines identical sequences and sequence fragments (from any organism
Organism
In biology, an organism is any contiguous living system . In at least some form, all organisms are capable of response to stimuli, reproduction, growth and development, and maintenance of homoeostasis as a stable whole.An organism may either be unicellular or, as in the case of humans, comprise...

) into a single UniRef entry. The sequence of a representative protein, the accession numbers
Accession number (bioinformatics)
An accession number in bioinformatics is a unique identifier given to a DNA or protein sequence record to allow for tracking of different versions of that sequence record and the associated sequence over time in a single data repository...

 of all the merged entries and links to the corresponding UniProtKB and UniParc records are displayed. UniRef100 sequences are clustered using the CD-HIT algorithm
Algorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...

 to build UniRef90 and UniRef50. Each cluster is composed of sequences that have at least 90% or 50% sequence identity, respectively, to the longest sequence. Clustering sequences significantly reduces database size, enabling faster sequence searches.

UniMes


The UniProt Metagenomic and Environmental Sequences (UniMES) database is a repository specifically developed for metagenomic
Metagenomics
Metagenomics is the study of metagenomes, genetic material recovered directly from environmental samples. The broad field may also be referred to as environmental genomics, ecogenomics or community genomics. Traditional microbiology and microbial genome sequencing rely upon cultivated clonal cultures...

 and environmental data
Environmental data
Environmental data is that which is based on the measurement of environmental pressures, the state of the environment and the impacts on ecosystems...

. The predicted proteins from this dataset are combined with automatic classification by InterPro
InterPro
InterPro is a database of protein families, domains and functional sites in which identifiable features found in known proteins can be applied to new protein sequences in order to functionally characterise them....

 to enhance the original information with further analysis.

UniProtKB contains protein sequences from known species, data arising from metagenomics studies is from environmental (i.e. uncultured) samples and as such the species may not be known/identified. UniMES was developed for this data. Data from UniMES is not included in UniProtKB or UniRef, but is included in UniParc. UniMES includes data from the Global Ocean Sampling Expedition
Global Ocean Sampling Expedition
The Global Ocean Sampling Expedition is an ocean exploration genome project with the goal of assessing the genetic diversity in marine microbial communities and to understand their role in nature's fundamental processes. Begun as a Sargasso Sea pilot sampling project in August 2003, Craig Venter...

 (GOS).

UniMES is available from the [ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/unimes/ UniProt FTP site]

Funding for UniProt


UniProt is funded by grants from the National Human Genome Research Institute
National Human Genome Research Institute
The National Human Genome Research Institute is a division of the National Institutes of Health, located in Bethesda, Maryland.NHGRI began as the National Center for Human Genome Research , which was established in 1989 to carry out the role of the NIH in the International Human Genome Project...

, the National Institutes of Health
National Institutes of Health
The National Institutes of Health are an agency of the United States Department of Health and Human Services and are the primary agency of the United States government responsible for biomedical and health-related research. Its science and engineering counterpart is the National Science Foundation...

 (NIH), the European Commission
European Commission
The European Commission is the executive body of the European Union. The body is responsible for proposing legislation, implementing decisions, upholding the Union's treaties and the general day-to-day running of the Union....

, the Swiss Federal Government through the Federal Office of Education and Science, NCI-caBIG
CaBIG
The cancer Biomedical Informatics Grid is an open source, open access information network with the mission of enabling secure data exchange throughout the cancer community...

, and the Department of Defense.