Home      Discussion      Topics      Dictionary      Almanac
Signup       Login
Structural Classification of Proteins

Structural Classification of Proteins

Ask a question about 'Structural Classification of Proteins'
Start a new discussion about 'Structural Classification of Proteins'
Answer questions from other users
Full Discussion Forum
The Structural Classification of Proteins (SCOP) database is a largely manual classification of protein structural domains based on similarities of their structure
Protein structure
Proteins are an important class of biological macromolecules present in all organisms. Proteins are polymers of amino acids. Classified by their physical size, proteins are nanoparticles . Each protein polymer – also known as a polypeptide – consists of a sequence formed from 20 possible L-α-amino...

s and amino acid
Amino acid
Amino acids are molecules containing an amine group, a carboxylic acid group and a side-chain that varies between different amino acids. The key elements of an amino acid are carbon, hydrogen, oxygen, and nitrogen...

 sequences. A motivation for this classification is to determine the evolutionary relationship between proteins. Proteins with the same shapes but having little sequence or functional similarity are placed in different "superfamilies", and are assumed to have only a very distant common ancestor. Proteins having the same shape and some similarity of sequence and/or function are placed in "families", and are assumed to have a closer common ancestor.

The SCOP database is freely accessible on the internet. SCOP was created in 1994. It is maintained by Alexei G. Murzin and his colleagues at the Laboratory of Molecular Biology
Laboratory of Molecular Biology
The Laboratory of Molecular Biology is a research institute in Cambridge, England, which was at the forefront of the revolution in molecular biology which occurred in the 1950–60s, since then it remains a major medical research laboratory with a much broader focus.-Early beginnings: 1947-61:Max...

 in Cambridge, England. The current version of SCOP is 1.75, release June, 2009. A snapshot of the next release of SCOP, which is called pre-SCOP (for preview), is accessible from the main SCOP page.

Hierarchical structure

The source of protein structures is the Protein Data Bank
Protein Data Bank
The Protein Data Bank is a repository for the 3-D structural data of large biological molecules, such as proteins and nucleic acids....

. The unit of classification of structure in SCOP is the protein domain
Protein domain
A protein domain is a part of protein sequence and structure that can evolve, function, and exist independently of the rest of the protein chain. Each domain forms a compact three-dimensional structure and often can be independently stable and folded. Many proteins consist of several structural...

. What the SCOP authors mean by "domain" is suggested by their statement that small proteins and most medium sized ones have just one domain, and by the observation that human hemoglobin, which has an α2β2 structure, is assigned two SCOP domains, one for the α and one for the β subunit.

The shapes of domains are called "folds" in SCOP. Domains belonging to the same fold have the same major secondary structures in the same arrangement with the same topological connections. 1195 folds are given in SCOP version 1.75. Short descriptions of each fold are given. For example, the "Globin-like" fold is described as core: 6 helices; folded leaf, partly opened. The fold to which a domain belongs is determined by inspection, rather than by software.

The levels of SCOP are as follows.
  1. Class: Types of folds, e.g., beta sheets.
  2. Fold: The different shapes of domains within a class.
  3. Superfamily: The domains in a fold are grouped into superfamilies, which have at least a distant common ancestor.
  4. Family: The domains in a superfamily are grouped into families, which have a more recent common ancestor.
  5. Protein domain: The domains in families are grouped into protein domains, which are essentially the same protein.
  6. Species: The domains in "protein domains" are grouped according to species.
  7. Domain: part of a protein. For simple proteins, it can be the entire protein.

The folds are grouped into "classes". The classes are the top level, or "root" of the SCOP hierarchical classification. The classes are displayed something like this:
a. All alpha proteins [46456] (284)
domains consisting of α-helices
Alpha helix
A common motif in the secondary structure of proteins, the alpha helix is a right-handed coiled or spiral conformation, in which every backbone N-H group donates a hydrogen bond to the backbone C=O group of the amino acid four residues earlier...

b. All beta proteins [48724] (174)
domains consisting of ß-sheets
Beta sheet
The β sheet is the second form of regular secondary structure in proteins, only somewhat less common than the alpha helix. Beta sheets consist of beta strands connected laterally by at least two or three backbone hydrogen bonds, forming a generally twisted, pleated sheet...

c. Alpha and beta proteins (a/b) [51349] (147)
Mainly parallel beta sheets (beta-alpha-beta units)
d. Alpha and beta proteins (a+b) [53931] (376)
Mainly antiparallel beta sheets (segregated alpha and beta regions)
e. Multi-domain proteins (alpha and beta) [56572] (66)
Folds consisting of two or more domains belonging to different classes
f. membrane
Cell membrane
The cell membrane or plasma membrane is a biological membrane that separates the interior of all cells from the outside environment. The cell membrane is selectively permeable to ions and organic molecules and controls the movement of substances in and out of cells. It basically protects the cell...

 and cell surface proteins and peptide
Peptides are short polymers of amino acid monomers linked by peptide bonds. They are distinguished from proteins on the basis of size, typically containing less than 50 monomer units. The shortest peptides are dipeptides, consisting of two amino acids joined by a single peptide bond...

s [56835] (58)
Does not include proteins in the immune system
Immune system
An immune system is a system of biological structures and processes within an organism that protects against disease by identifying and killing pathogens and tumor cells. It detects a wide variety of agents, from viruses to parasitic worms, and needs to distinguish them from the organism's own...

g. Small proteins [56992] (90)
Usually dominated by metal ligand, heme, and/or disulfide bridges
h. coiled-coil proteins [57942] (7)
Not a true class
i. Low resolution protein structures [58117] (26)
Peptides and fragments. Not a true class
j. Peptides [58231] (121)
peptides and fragments. Not a true class.
k. Designed proteins [58788] (44)
Experimental structures of proteins with essentially non-natural sequences. Not a true class

The number in brackets, called a "sunid", is a SCOP unique integer identifier for each node in the SCOP hierarchy. The number in parentheses indicates how many elements are in each category. For example, there are 284 folds in the "All alpha proteins" class. Each member of the hierarchy is a link to the next level of the hierarchy.

The first few folds of the 284 folds in the "All-α proteins" class are displayed something like the following.
1. Globin-like [46457] (2)
core: 6 helices; folded leaf, partly opened
2. Long alpha-hairpin [46556] (20)
2 helices; antiparallel hairpin, left-handed twist
3. Type I dockerin domain [63445] (1)
tandem repeat of two calcium-binding loop-helix motifs, distinct from the EF-hand

Each fold is followed by a description of that fold.

The domains within a fold are further classified into superfamilies, which, in turn, are classified into families. Within a fold, domains belonging to the same superfamily are assumed to have a common ancestor. However, this ancestor is presumed to be distant, because the different members of a superfamily have low sequence identities. The two superfamilies of the "Globin-like" fold are displayed something like the following:
  1. Globin-like [46458] (4)
  2. alpha-helical ferredoxin [46548] (2) contains two Fe4-S4 clusters

No description is given for the "Globin-like" superfamily, presumably because its description is very like that of its fold, which has the same name.

Families are more closely related than superfamilies. Domains within a fold are placed in the same family if
  1. they have at least a 30% similarity in sequences, or, failing that,
  2. if they have some similarity in sequences, e.g., 15%, and perform the same function.

The similarity in sequence and structure is evidence that these proteins have a closer evolutionary relationship than do proteins in the same superfamily. Sequence tools, such as BLAST
In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences...

, are used to assist in placing domains into superfamilies and families. The four families in the "Globin-like" superfamily of the "Globin-like" fold are displayed something like the following.
  1. Truncated hemoglobin [46459] (6) lack the first helix (A)
  2. Nerve tissue mini-hemoglobin (neural globin) [74660] (1) lack the first helix but otherwise is more similar to conventional globins than the truncated ones
  3. Globins [46463] (81) Heme-binding protein
  4. Phycocyanin-like phycobilisome proteins [46532] (26) oligomers of two different types of globin-like subunits containing two extra helices at the N-terminus binds a bilin chromophore

The families in SCOP may also be referred to using a SCOP concise classification string, sccs, which looks like, e.g., a.1.1.2 for the "Globin" family. The letter identifies the class to which the domain belongs; the following integers identify the fold, superfamily, and family, respectively.

Within a family are protein domains. Proteins are placed in the same protein domain if they are isoforms
Protein isoform
A protein isoform is any of several different forms of the same protein. Different forms of a protein may be produced from related genes, or may arise from the same gene by alternative splicing. A large number of isoforms are caused by single-nucleotide polymorphisms or SNPs, small genetic...

 of each other, or if they are essentially the same protein, but from different species. This is apparently done manually. The "protein domains" are further subdivided into species. ("Protein domains" are not on separate pages in the current release of SCOP; in pre-SCOP, they are on separate pages.) Here is how some of the 81 protein domains of the "Globins" family are displayed.
Protein Domains:
7. Leghemoglobin [46481]
1. Yellow lupin (Lupinus luteus) [TaxId: 3873] [46482] (17)
2. Soybean (Glycine max), isoform A [TaxId: 3847] [46483] (2)
8. Non-symbiotic plant hemoglobin [46484]
1. Rice (Oryza sativa) [TaxId: 4530] [46485] (1)
9. Hemoglobin, alpha-chain [46486]
1. Human (Homo sapiens) [TaxId: 9606] [46487] (192)
2. Human (Homo sapiens), zeta isoform [TaxId: 9606] [68937] (1)
3. Horse (Equus caballus) [TaxId: 9796] [46488] (19)
4. Deer (Odocoileus virginianus) [TaxId: 9874] [46489] (1)

The "TaxId" is the taxonomy ID number; it is also a link to the NCBI
National Center for Biotechnology Information
The National Center for Biotechnology Information is part of the United States National Library of Medicine , a branch of the National Institutes of Health. The NCBI is located in Bethesda, Maryland and was founded in 1988 through legislation sponsored by Senator Claude Pepper...

 taxonomy browser, which provides more information about the species to which the protein belongs.

Clicking on a species or isoform brings up a list of domains. Here is how some of the 192 domains of the "Hemoglobin, alpha-chain from Human (Homo sapiens)" protein are displayed.
PDB Entry Domains:
1. 2dn3
automatically matched to d1abwa1
complexed with cmo, hem
1. region a:2-141 [131583]
2. 1ird
complexed with cmo, hem
1. chain a [66286]
3. 2dn1
automatically matched to d1abwa1
complexed with hem, mbn, oxy
1. region a:2-141 [131577]

Clicking on the PDB numbers is supposed to display the structure of the molecule, but the links are currently broken. (The links do work in pre-SCOP.)


Most pages in SCOP contain a search box. Entering "trypsin +human" retrieves several proteins, including the protein trypsinogen from humans. Selecting that entry displays a page that includes the "lineage", which is at the top of most SCOP pages. The page includes the following information.
1. Root: scop
2. Class: All beta proteins [48724]
3. Fold: Trypsin-like serine proteases [50493]
barrel, closed; n=6, S=8; greek-key
duplication: consists of two domains of the same fold
4. Superfamily: Trypsin-like serine proteases [50494]
5. Family: Eukaryotic proteases [50514]
6. Protein: Trypsin(ogen) [50515]
7. Species: Human (Homo sapiens) [TaxId: 9606] [50519]

Searching for "Subtilisin" brings up the protein, "Subtilisin from Bacillus subtilis, carlsberg", with the following lineage.
1. Root: scop
2. Class: Alpha and beta proteins (a/b) [51349]
Mainly parallel beta sheets (beta-alpha-beta units)
3. Fold: Subtilisin-like [52742]
3 layers: a/b/a, parallel beta-sheet of 7 strands, order 2314567; left-handed crossover connection between strands 2 & 3
4. Superfamily: Subtilisin-like [52743]
5. Family: Subtilases [52744]
6. Protein: Subtilisin [52745]
7. Species: Bacillus subtilis, carlsberg [TaxId: 1423] [52746]

Although both of these proteins are proteases, they do not even belong to the same fold, which is consistent with them being an example of convergent evolution
Convergent evolution
Convergent evolution describes the acquisition of the same biological trait in unrelated lineages.The wing is a classic example of convergent evolution in action. Although their last common ancestor did not have wings, both birds and bats do, and are capable of powered flight. The wings are...


Comparison to other classification systems

This classification is more significantly based on the human expertise than semi-automatic CATH
The CATH Protein Structure Classification is a semi-automatic, hierarchical classification of protein domains published in 1997 by Christine Orengo, Janet Thornton and their colleagues....

, its chief rival. Human expertise is needed to decide whether certain proteins are evolution
Evolution is any change across successive generations in the heritable characteristics of biological populations. Evolutionary processes give rise to diversity at every level of biological organisation, including species, individual organisms and molecules such as DNA and proteins.Life on Earth...

ary related and therefore should be assigned to the same superfamily, or their similarity is a result of structural constraints and therefore they belong to the same fold. Another database, FSSP
Families of structurally similar proteins
Families of Structurally Similar Proteins or FSSP is a database of structurally superimposed proteins generated using the "Distance-matrix ALIgnment" algorithm. The database is helpful for the comparison of protein structures.-External links:*...

, is purely automatically generated (including regular automatic updates) but offers no classification, allowing the user to draw their own conclusion as to the significance of structural relationships based on the pairwise comparisons of individual protein structures.

Wikilinks to SCOP

To insert a link in Wikipedia to a particular SCOP page, use the template of the form , where xxxxxx is a SCOP sunid, for instance .

See also

  • Structural alignment
    Structural alignment
    Structural alignment attempts to establish homology between two or more polymer structures based on their shape and three-dimensional conformation. This process is usually applied to protein tertiary structures but can also be used for large RNA molecules...

  • CATH
    The CATH Protein Structure Classification is a semi-automatic, hierarchical classification of protein domains published in 1997 by Christine Orengo, Janet Thornton and their colleagues....

  • FSSP
    Families of structurally similar proteins
    Families of Structurally Similar Proteins or FSSP is a database of structurally superimposed proteins generated using the "Distance-matrix ALIgnment" algorithm. The database is helpful for the comparison of protein structures.-External links:*...

    SUPERFAMILY is a database of structural and functional annotation for all proteins and genomes.The SUPERFAMILY annotation is based on a collection of hidden Markov models, which represent structural protein domains at the SCOP superfamily level....

External links