Threading (protein sequence)
Encyclopedia
Protein threading, also known as fold recognition, is a method of protein modeling (i.e. computational protein structure prediction
Protein structure prediction
Protein structure prediction is the prediction of the three-dimensional structure of a protein from its amino acid sequence — that is, the prediction of its secondary, tertiary, and quaternary structure from its primary structure. Structure prediction is fundamentally different from the inverse...

) which is used to model those protein
Protein
Proteins are biochemical compounds consisting of one or more polypeptides typically folded into a globular or fibrous form, facilitating a biological function. A polypeptide is a single linear polymer chain of amino acids bonded together by peptide bonds between the carboxyl and amino groups of...

s which have the same fold
Protein folding
Protein folding is the process by which a protein structure assumes its functional shape or conformation. It is the physical process by which a polypeptide folds into its characteristic and functional three-dimensional structure from random coil....

 as proteins of known structures, but do not have homologous proteins with known structure.
It differs from the homology modeling
Homology modeling
Homology modeling, also known as comparative modeling of protein refers to constructing an atomic-resolution model of the "target" protein from its amino acid sequence and an experimental three-dimensional structure of a related homologous protein...

 method of structure prediction as it (protein threading) is used for proteins which do not have their homologous protein structure
Protein structure
Proteins are an important class of biological macromolecules present in all organisms. Proteins are polymers of amino acids. Classified by their physical size, proteins are nanoparticles . Each protein polymer – also known as a polypeptide – consists of a sequence formed from 20 possible L-α-amino...

s deposited in the Protein Data Bank
Protein Data Bank
The Protein Data Bank is a repository for the 3-D structural data of large biological molecules, such as proteins and nucleic acids....

 (PDB), whereas homology modeling is used for those proteins which do. Threading works by using statistical knowledge of the relationship between the structures deposited in the PDB and the sequence
Primary structure
The primary structure of peptides and proteins refers to the linear sequence of its amino acid structural units. The term "primary structure" was first coined by Linderstrøm-Lang in 1951...

 of the protein which one wishes to model.

The prediction is made by "threading" (i.e. placing, aligning) each amino acid
Amino acid
Amino acids are molecules containing an amine group, a carboxylic acid group and a side-chain that varies between different amino acids. The key elements of an amino acid are carbon, hydrogen, oxygen, and nitrogen...

 in the target sequence to a position in the template structure, and evaluating how well the target fits the template. After the best-fit template is selected, the structural model of the sequence is built based on the alignment with the chosen template. Protein threading is based on two basic observations: that the number of different folds in nature is fairly small (approximately 1300); and that 90% of the new structures submitted to the PDB in the past three years have similar structural folds to ones already in the PDB (according to the CATH release notes).

Classification of protein structure

The Structural Classification of Proteins
Structural Classification of Proteins
The Structural Classification of Proteins database is a largely manual classification of protein structural domains based on similarities of their structures and amino acid sequences. A motivation for this classification is to determine the evolutionary relationship between proteins...

 (SCOP) database provides a detailed and comprehensive description of the structural and evolutionary relationships of known structure. Proteins are classified to reflect both structural and evolutionary relatedness. Many levels exist in the hierarchy, but the principal levels are family, superfamily and fold, as described below.

Family (clear evolutionary relationship): Proteins clustered together into families are clearly evolutionarily related. Generally, this means that pairwise residue identities between the proteins are 30% and greater. However, in some cases similar functions and structures provide definitive evidence of common descent in the absence of high sequence identity; for example, many globin
Globin
Globins are a related family of proteins, which are thought to share a common ancestor. These proteins all incorporate the globin fold, a series of eight alpha helical segments. Two prominent members of this family include myoglobin and hemoglobin, which both bind the heme prosthetic group...

s form a family though some members have sequence identities of only 15%.
Superfamily (probable common evolutionary origin): Proteins that have low sequence identities, but whose structural and functional features suggest that a common evolutionary origin is probable, are placed together in superfamilies. For example, actin
Actin
Actin is a globular, roughly 42-kDa moonlighting protein found in all eukaryotic cells where it may be present at concentrations of over 100 μM. It is also one of the most highly-conserved proteins, differing by no more than 20% in species as diverse as algae and humans...

, the ATPase
ATPase
ATPases are a class of enzymes that catalyze the decomposition of adenosine triphosphate into adenosine diphosphate and a free phosphate ion. This dephosphorylation reaction releases energy, which the enzyme harnesses to drive other chemical reactions that would not otherwise occur...

 domain of the heat shock protein
Heat shock protein
Heat shock proteins are a class of functionally related proteins involved in the folding and unfolding of other proteins. Their expression is increased when cells are exposed to elevated temperatures or other stress. This increase in expression is transcriptionally regulated...

, and hexakinase together form a superfamily.
Fold (major structural similarity): Proteins are defined as having a common fold if they have the same major secondary structures in the same arrangement and with the same topological connections. Different proteins with the same fold often have peripheral elements of secondary structure and turn regions that differ in size and conformation. In some cases, these differing peripheral regions may comprise half the structure. Proteins placed together in the same fold category may not have a common evolutionary origin: the structural similarities could arise just from the physics and chemistry of proteins favoring certain packing arrangements and chain topologies.

Method

A general paradigm of protein threading consists of the following four steps:

The construction of a structure template database: Select protein structures from the protein structure databases as structural templates. This generally involves selecting protein structures from databases such as PDB
Protein Data Bank
The Protein Data Bank is a repository for the 3-D structural data of large biological molecules, such as proteins and nucleic acids....

, FSSP
Families of structurally similar proteins
Families of Structurally Similar Proteins or FSSP is a database of structurally superimposed proteins generated using the "Distance-matrix ALIgnment" algorithm. The database is helpful for the comparison of protein structures.-External links:*...

, SCOP
Structural Classification of Proteins
The Structural Classification of Proteins database is a largely manual classification of protein structural domains based on similarities of their structures and amino acid sequences. A motivation for this classification is to determine the evolutionary relationship between proteins...

, or CATH
CATH
The CATH Protein Structure Classification is a semi-automatic, hierarchical classification of protein domains published in 1997 by Christine Orengo, Janet Thornton and their colleagues....

, after removing protein structures with high sequence similarities.
The design of the scoring function: Design a good scoring function to measure the fitness between target sequences and templates based on the knowledge of the known relationships between the structures and the sequences. A good scoring function should contain mutation potential, environment fitness potential, pairwise potential, secondary structure compatibilities, and gap penalties. The quality of the energy function is closely related to the prediction accuracy, especially the alignment accuracy.
Threading alignment: Align the target sequence with each of the structure templates by optimizing the designed scoring function. This step is one of the major tasks of all threading-based structure prediction programs that take into account the pairwise contact potential; otherwise, a dynamic programming algorithm can fulfill it.
Threading prediction: Select the threading alignment that is statistically most probable as the threading prediction. Then construct a structure model for the target by placing the backbone atoms of the target sequence at their aligned backbone positions of the selected structural template.

Comparison with homology modeling

Homology modeling
Homology modeling
Homology modeling, also known as comparative modeling of protein refers to constructing an atomic-resolution model of the "target" protein from its amino acid sequence and an experimental three-dimensional structure of a related homologous protein...

 and protein threading are both template-based methods and there is no rigorous boundary between them in terms of prediction techniques. But the protein structures their targets are different. Homology modeling is for those targets which have homologous proteins with known structure(usually/may be of same family), while protein threading is for those targets with only fold-level homology found. In other words, homology modeling is for "easier" targets and protein threading is for "harder" targets.

Homology modeling treats the template in an alignment as a sequence, and only sequence homology is used for prediction. Protein threading treats the template in an alignment as a structure, and both sequence and structure information extracted from the alignment are used for prediction. When there is no significant homology found, protein threading can make a prediction based on the structure information. That also explains why protein threading may be more effective than homology modeling in many cases.

In practice, when the sequence identity in a sequence sequence alignment is low (i.e. <25%), homology modeling may not produce a significant prediction. In this case, if there is distant homology found for the target, protein threading can generate a good prediction.

More about threading

Fold recognition methods can be broadly divided into two types: 1, those that derive a 1-D profile for each structure in the fold library and align the target sequence to these profiles; and 2, those that consider the full 3-D structure of the protein template. A simple example of a profile representation would be to take each amino acid in the structure and simply label it according to whether it is buried in the core of the protein or exposed on the surface. More elaborate profiles might take into account the local secondary structure
Secondary structure
In biochemistry and structural biology, secondary structure is the general three-dimensional form of local segments of biopolymers such as proteins and nucleic acids...

 (e.g. whether the amino acid is part of an alpha helix
Alpha helix
A common motif in the secondary structure of proteins, the alpha helix is a right-handed coiled or spiral conformation, in which every backbone N-H group donates a hydrogen bond to the backbone C=O group of the amino acid four residues earlier...

) or even evolutionary information (how conserved the amino acid is). In the 3-D representation, the structure is modeled as a set of inter-atomic distances, i.e. the distances are calculated between some or all of the atom pairs in the structure. This is a much richer and far more flexible description of the structure, but is much harder to use in calculating an alignment. The profile-based fold recognition approach was first described by Bowie, Lüthy and Eisenberg in 1991. The term threading was first coined by Jones, Taylor and Thornton in 1992, and originally referred specifically to the use of a full 3-D structure atomic representation of the protein template in fold recognition. Today, the terms threading and fold recognition are frequently (though somewhat incorrectly) used interchangeably.

Fold recognition methods are widely used and effective because it is believed that there are a strictly limited number of different protein folds in nature, mostly as a result of evolution but also due to constraints imposed by the basic physics and chemistry of polypeptide chains. There is, therefore, a good chance (currently 70-80%) that a protein which has a similar fold to the target protein has already been studied by X-ray crystallography
X-ray crystallography
X-ray crystallography is a method of determining the arrangement of atoms within a crystal, in which a beam of X-rays strikes a crystal and causes the beam of light to spread into many specific directions. From the angles and intensities of these diffracted beams, a crystallographer can produce a...

 or nuclear magnetic resonance (NMR) spectroscopy
Nuclear magnetic resonance
Nuclear magnetic resonance is a physical phenomenon in which magnetic nuclei in a magnetic field absorb and re-emit electromagnetic radiation...

 and can be found in the PDB. Currently there are nearly 1300 different protein folds known (see CATH database statistics for latest view), but new folds are still being discovered every year due in significant part to the ongoing structural genomics
Structural genomics
Structural genomics seeks to describe the 3-dimensional structure of every protein encoded by a given genome. This genome-based approach allows for a high-throughput method of structure determination by a combination of experimental and modeling approaches...

 projects.

Many different algorithms have been proposed for finding the correct threading of a sequence onto a structure, though many make use of dynamic programming
Dynamic programming
In mathematics and computer science, dynamic programming is a method for solving complex problems by breaking them down into simpler subproblems. It is applicable to problems exhibiting the properties of overlapping subproblems which are only slightly smaller and optimal substructure...

 in some form. For full 3-D threading, the problem of identifying the best alignment is very difficult (it is an NP-hard
NP-hard
NP-hard , in computational complexity theory, is a class of problems that are, informally, "at least as hard as the hardest problems in NP". A problem H is NP-hard if and only if there is an NP-complete problem L that is polynomial time Turing-reducible to H...

 problem for some models of threading). Researchers have made use of many combinatorial optimization methods such as Conditional random fields
RaptorX / software for protein modeling and analysis
RaptorX for protein modeling and analysisRaptorX is a software and web server for protein structure and function prediction that is free for non-commercial use. RaptorX is among the most popular methods for protein structure prediction. Other popular methods include HHpredHHpred / HHsearch and...

, simulated annealing
Simulated annealing
Simulated annealing is a generic probabilistic metaheuristic for the global optimization problem of locating a good approximation to the global optimum of a given function in a large search space. It is often used when the search space is discrete...

, branch and bound
Branch and bound
Branch and bound is a general algorithm for finding optimal solutions of various optimization problems, especially in discrete and combinatorial optimization...

 and linear programming
RAPTOR (software)
RAPTOR is protein threading software used for protein structure prediction, given a primary sequence.-Protein threading vs. homology modeling:Researchers attempting to solve a protein's structure start their a study with little more than a protein sequence...

, searching to arrive at heuristic solutions.

It is interesting to compare threading methods to methods which attempt to align two protein structures (protein structural alignment), and indeed many of the same algorithms have been applied to both problems.

Protein threading software

  • HHpred
    HHpred / HHsearch
    HHsearch is a program for protein sequence searching that is free for non-commercial use. HHpred is a free protein function and protein structure prediction server based on the HHsearch method...

     is a popular threading server which runs HHsearch
    HHpred / HHsearch
    HHsearch is a program for protein sequence searching that is free for non-commercial use. HHpred is a free protein function and protein structure prediction server based on the HHsearch method...

    , a widely used software for remote homology detection based on pairwise comparison of hidden Markov model
    Hidden Markov model
    A hidden Markov model is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved states. An HMM can be considered as the simplest dynamic Bayesian network. The mathematics behind the HMM was developed by L. E...

    s.
  • RAPTOR (software)
    RAPTOR (software)
    RAPTOR is protein threading software used for protein structure prediction, given a primary sequence.-Protein threading vs. homology modeling:Researchers attempting to solve a protein's structure start their a study with little more than a protein sequence...

     is an integer programming based protein threading software. The original developer of RAPTOR has designed a new protein threading program RaptorX / software for protein modeling and analysis
    RaptorX / software for protein modeling and analysis
    RaptorX for protein modeling and analysisRaptorX is a software and web server for protein structure and function prediction that is free for non-commercial use. RaptorX is among the most popular methods for protein structure prediction. Other popular methods include HHpredHHpred / HHsearch and...

    , employing a very different methodology . RaptorX significantly outperforms RAPTOR and is especially good at aligning proteins with sparse sequence profile. The RaptorX server is free to public at RaptorX.
  • Phyre
    Phyre / Phyre2
    Phyre and Phyre2 are web-based services for protein structure prediction that are free for non-commercial use. Phyre is among the most popular methods for protein structure prediction having been cited over 1000 times...

     is a popular threading server combining HHsearch
    HHpred / HHsearch
    HHsearch is a program for protein sequence searching that is free for non-commercial use. HHpred is a free protein function and protein structure prediction server based on the HHsearch method...

     with ab initio and multiple-template modelling.
  • MUSTER is a standard threading algorithm based on dynamic programming and sequence profile-profile alignment. It also combines multiple structural resources to assist the sequence profile alignment.

See also

  • Homology modeling
    Homology modeling
    Homology modeling, also known as comparative modeling of protein refers to constructing an atomic-resolution model of the "target" protein from its amino acid sequence and an experimental three-dimensional structure of a related homologous protein...

  • Protein structure prediction
    Protein structure prediction
    Protein structure prediction is the prediction of the three-dimensional structure of a protein from its amino acid sequence — that is, the prediction of its secondary, tertiary, and quaternary structure from its primary structure. Structure prediction is fundamentally different from the inverse...

  • Protein structure prediction software
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK