Protein fragment library - AbsoluteAstronomy.com

Protein backbone fragment libraries have been used successfully in a variety of structural biology

Structural biology

Structural biology is a branch of molecular biology, biochemistry, and biophysics concerned with the molecular structure of biological macromolecules, especially proteins and nucleic acids, how they acquire the structures they have, and how alterations in their structures affect their function...

applications, including homology modeling

Homology modeling

Homology modeling, also known as comparative modeling of protein refers to constructing an atomic-resolution model of the "target" protein from its amino acid sequence and an experimental three-dimensional structure of a related homologous protein...

, de novo structure prediction

De novo protein structure prediction

In computational biology, de novo protein structure prediction is the task of estimating a protein's tertiary structure from its sequence alone. The problem is very difficult and has occupied leading scientists for decades. Research has focused in three areas: alternate lower-resolution...

, and structure determination. By reducing the complexity of the search space, these fragment libraries enable more rapid search of conformational space

Configuration space

- Configuration space in physics :In classical mechanics, the configuration space is the space of possible positions that a physical system may attain, possibly subject to external constraints...

, leading to more efficient and accurate models.

Motivation

Protein

Proteins are biochemical compounds consisting of one or more polypeptides typically folded into a globular or fibrous form, facilitating a biological function. A polypeptide is a single linear polymer chain of amino acids bonded together by peptide bonds between the carboxyl and amino groups of...

s can adopt an exponential number of states when modeled discretely. Typically, a protein's conformations are represented as sets of dihedral angle

Dihedral angle

In geometry, a dihedral or torsion angle is the angle between two planes.The dihedral angle of two planes can be seen by looking at the planes "edge on", i.e., along their line of intersection...

s, bond length

Bond length

- Explanation :Bond length is related to bond order, when more electrons participate in bond formation the bond will get shorter. Bond length is also inversely related to bond strength and the bond dissociation energy, as a stronger bond will be shorter...

s, and bond angles between all connected atoms. The most common simplification is to assume ideal bond lengths and bond angles. However, this still leaves the phi-psi angles of the backbone, and up to four dihedral angles for each side chain

Side chain

In organic chemistry and biochemistry, a side chain is a chemical group that is attached to a core part of the molecule called "main chain" or backbone. The placeholder R is often used as a generic placeholder for alkyl group side chains in chemical structure diagrams. To indicate other non-carbon...

, leading to a worst case complexity of k^6*n possible states of the protein, where n is the number of residues and k is the number of discrete states modeled for each dihedral angle. In order to reduce the conformational space, one can use protein fragment libraries rather than explicitly model every phi-psi angle.

Fragments are short segments of the peptide backbone, typically from 5 to 15 residues

Amino acid

Amino acids are molecules containing an amine group, a carboxylic acid group and a side-chain that varies between different amino acids. The key elements of an amino acid are carbon, hydrogen, oxygen, and nitrogen...

long, and do not include the side chains. They may specify the location of just the C-alpha atoms if it is a reduced atom representation, or all the backbone heavy atoms (N, C-alpha, C carbonyl, O). Note that side chains are typically not modeled using the fragment library approach. To model discrete states of a side chain, one could use a rotamer library approach.

This approach operates under the assumption that local interactions play a large role in stabilizing the overall protein conformation. In any short sequence, the molecular forces constrain the structure, leading to only a small number of possible conformations, which can be modeled by fragments. Indeed, according to Levinthal's paradox, a protein could not possibly sample all possible conformations within a biologically reasonable amount of time. Locally stabilized structures would reduce the search space and allow proteins to fold on the order of milliseconds.

Construction

Libraries of these fragments are constructed from an analysis of the Protein Data Bank

Protein Data Bank

The Protein Data Bank is a repository for the 3-D structural data of large biological molecules, such as proteins and nucleic acids....

(PDB). First, a representative subset of the PDB is chosen which should cover a diverse array of structures, preferably at a good resolution. Then, for each structure, every set of n consecutive residues is taken as a sample fragment. The samples are then clustered into k groups, based upon how similar they are to each other in spatial configuration, using algorithms such as k-means clustering. The parameters n and k are chosen according to the application (see discussion on complexity below). The centroid

Centroid

In geometry, the centroid, geometric center, or barycenter of a plane figure or two-dimensional shape X is the intersection of all straight lines that divide X into two parts of equal moment about the line. Informally, it is the "average" of all points of X...

s of the clusters are then taken to represent the fragment. Further optimization can be performed to ensure that the centroid possesses ideal bond geometry, as it was derived by averaging other geometries.

Because the fragments are derived from structures that exist in nature, the segment of backbone they represent will have realistic bonding geometries. This helps avoid having to explore the full space of conformation angles, much of which would lead to unrealistic geometries.

The clustering above can be performed without regard to the identities of the residues, or it can be residue-specific. That is, for any given input sequence of amino acids, a clustering can be derived using only samples found in the PDB with the same sequence in the k-mer fragment. This requires more computational work than deriving a sequence-independent fragment library but can potentially produce more accurate models. Conversely, a larger sample set is required, and one may not achieve full coverage.

Example use: loop modeling

In homology modeling

Homology modeling

, a common application of fragment libraries is to model the loops of the structure. Typically, the alpha helices

Alpha helix

A common motif in the secondary structure of proteins, the alpha helix is a right-handed coiled or spiral conformation, in which every backbone N-H group donates a hydrogen bond to the backbone C=O group of the amino acid four residues earlier...

and beta sheet

Beta sheet

The β sheet is the second form of regular secondary structure in proteins, only somewhat less common than the alpha helix. Beta sheets consist of beta strands connected laterally by at least two or three backbone hydrogen bonds, forming a generally twisted, pleated sheet...

s are threaded against a template structure, but the loops in between are not specified and need to be predicted. Finding the loop with the optimal configuration is NP-hard

NP-hard

NP-hard , in computational complexity theory, is a class of problems that are, informally, "at least as hard as the hardest problems in NP". A problem H is NP-hard if and only if there is an NP-complete problem L that is polynomial time Turing-reducible to H...

. To reduce the conformational space that needs to be explored, one can model the loop as a series of overlapping fragments. The space can then be sampled, or if the space is now small enough, exhaustively enumerated.

One approach for exhaustive enumeration goes as follows. Loop construction begins by aligning all possible fragments to overlap with the three residues at the N terminus of the loop (the anchor point). Then all possible choices for a second fragment are aligned to (all possible choices of) the first fragment, ensuring that the last three residues of the first fragment overlap with the first three residues of the second fragment. This ensures that the fragment chain forms realistic angles both within the fragment and between fragments. This is then repeated until a loop with the correct length of residues is constructed.

The loop must both begin at the anchor on the N side and end at the anchor on the C side. Each loop must therefore be tested to see if its last few residues overlap with the C terminal anchor. Very few of these exponential numbers of candidate loops will close the loop. After filtering out loops that don't close, one must then determine which loop has the optimal configuration, as determined by having the lowest energy using some molecular mechanics force field.

Complexity

The complexity of the state space is still exponential in the number of residues, even after using fragment libraries. However, the degree of the exponent is reduced. For a library of F-mer fragments, with L fragments in the library, and to model a chain of N residues overlapping each fragment by 3, there will be L^[N/(F-3)]+1 possible chains. This is much less than the K^N possibilities if explicitly modeling the phi-psi angles as K possible combinations, as the complexity grows at a degree smaller than N.

The complexity increases in L, the size of the fragment library. However, libraries with more fragments will capture a greater diversity of fragment structures, so there is a trade off in the accuracy of the model vs the speed of exploring the search space. This choice governs what K is used when performing the clustering.

Additionally, for any fixed L, the diversity of structures capable of being modeled decreases as the length of the fragments increases. Shorter fragments are more capable of covering the diverse array of structures found in the PDB than longer ones. Recently, it was shown that libraries of up to length 15 are capable of modeling 91% of the fragments in the PDB to within 2.0 angstroms.

Motivation

Construction

Example use: loop modeling

Complexity

See also