A generative, probabilistic model of local protein structure
See allHide authors and affiliations

Edited by David Baker, University of Washington, Seattle, WA, and approved March 14, 2008 (received for review February 27, 2008)
Abstract
Despite significant progress in recent years, protein structure prediction maintains its status as one of the prime unsolved problems in computational biology. One of the key remaining challenges is an efficient probabilistic exploration of the structural space that correctly reflects the relative conformational stabilities. Here, we present a fully probabilistic, continuous model of local protein structure in atomic detail. The generative model makes efficient conformational sampling possible and provides a framework for the rigorous analysis of local sequence–structure correlations in the native state. Our method represents a significant theoretical and practical improvement over the widely used fragment assembly technique by avoiding the drawbacks associated with a discrete and nonprobabilistic approach.
Protein structure prediction remains one of the greatest challenges in computational biology. The problem itself is easily posed: predict the threedimensional structure of a protein given its amino acid sequence. Significant progress has been made in the last decade, and, especially, knowledgebased methods are becoming increasingly accurate in predicting structures of small globular proteins (1). In such methods, an explicit treatment of local structure has proven to be an important ingredient. The search through conformational space can be greatly simplified through the restriction of the angular degrees of freedom in the protein backbone by allowing only angles that are known to appear in the native structures of real proteins. In practice, the angular preferences are typically enforced by using a technique called fragment assembly. The idea is to select a set of small structural fragments with strong sequence–structure relationships from the database of solved structures and subsequently assemble these building blocks to form complete structures. Although the idea was originally conceived in crystallography (2), it had a great impact on the protein structureprediction field when it was first introduced a decade ago (3). Today, fragment assembly stands as one of the most important single steps forward in tertiary structure prediction, contributing significantly to the progress we have seen in this field in recent years (4, 5).
Despite their success, fragmentassembly approaches generally lack a proper statistical foundation, or equivalently, a consistent way to evaluate their contributions to the global free energy. When a fragmentassembly method is used, structure prediction normally proceeds by a Markov Chain Monte Carlo (MCMC) algorithm, where candidate structures are proposed by the fragment assembler and then accepted or rejected based on an energy function. The theoretical basis of MCMC is the existence of a stationary probability distribution dictating the transition probabilities of the Markov chain. In the context of statistical physics, this stationary distribution is given by the conformational free energy through the Boltzmann distribution. The problem with fragmentassembly methods is that it is not possible to evaluate the proposal probability of a given structure, which makes it difficult to ensure an unbiased sampling (which requires the property of detailed balance). Local free energies could, in principle, be assigned to individual fragments, but there is no systematic way to combine them into a local free energy for an assembly of fragments. In fact, because of edge effects, the assembly process often introduces spurious local structural motifs that are not themselves present in the fragment library (3).
Significant progress has been made in the probabilistic modeling of local protein structure. With HMMSTR, Bystroff and coworkers (6) introduced a method to turn a fragment library into a probabilistic model but used a discretization of angular space, thereby sacrificing geometric detail. Other studies focused on strictly geometric models (7, 8). For these methods, the prime obstacle is their inability to condition the sampling on a given amino acid sequence. In general, it seems that none of these models has been sufficiently detailed or accurate to constitute a competitive alternative to fragment assembly. This is reflected in the latest CASP (critical assessment of techniques for protein structure prediction) exercise, where the majority of best performing de novo methods continue to rely on fragment assembly for local structure modeling (5).
Recently, we showed that a firstorder Markov model forms an efficient probabilistic, generative model of the Cα geometry of proteins in continuous space (9). Although this model allows sampling of Cα traces, it is of limited use in highresolution de novo structure prediction, because this requires the representation of the full atomic detail of a protein's backbone, and the mapping from Cα to backbone geometry is onetomany. Consequently, this model also cannot be considered a direct alternative to the fragmentassembly technique.
In the present study, we propose a continuous probabilistic model of the local sequence–structure preferences of proteins in atomic detail. The backbone of a protein can be represented by a sequence of dihedral angle pairs, φ and ψ (Fig. 1) that are well known from the Ramachandran plot (10). Two angles, both with values ranging from −180° to 180°, define a point on the torus. Hence, the backbone structure of a protein can be fully parameterized as a sequence of such points. We use this insight to model the angular preferences in their natural space using a probability distribution on the torus and thereby avoid the traditional discretization of angles that characterizes many other models. The sequential dependencies along the chain are captured by using a dynamic Bayesian network (a generalization of a hidden Markov model), which emits angle pairs, amino acid labels, and secondary structure labels. This allows us to directly sample structures compatible with a given sequence and resample parts of a structure while maintaining consistency along the entire chain. In addition, the model makes it possible to evaluate the likelihood of any given structure. Generally, the sampled structures will not be globular, but will have realistic local structure, and the model can thus be used as a proposal distribution in structureprediction simulations. The probabilistic and generative nature of the model also makes it directly applicable in the framework of statistical mechanics. In particular, because the probability before and after any resampling can be evaluated, unbiased sampling can be ensured.
We show that the proposed model accurately captures the angular preferences of protein backbones and successfully reproduces previously identified structural motifs. Finally, through a comparison with one of the leading fragmentassembly methods, we demonstrate that our model is highly accurate and efficient, and we conclude that our approach represents an attractive alternative to the use of fragment libraries in de novo proteinstructure prediction.
Results and Discussion
TorusDBN—A Model of Protein Local Structure.
Considering only the backbone, each residue in a protein chain can be represented by using two angular degrees of freedom, the φ and ψ dihedral bond angles (Fig. 1). The bond lengths and all remaining angles can be assumed to have fixed values (11). Even with this simple representation, the conformational search space is extremely large. However, as Ramachandran and coworkers (10) noted in 1963, not all values of φ and ψ are equally frequent, and many combinations are never observed because of steric constraints. In addition, strong sequential dependencies exist between the angle pairs along the chain. We define it as our goal to model precisely these local preferences.
We begin by stating a few necessary conditions for the model. First, we require that, given an amino acid sequence, our model should produce protein backbone chains with plausible local structure. In particular, the parameterization used in our model should be sufficiently accurate to allow direct sampling and the construction of complete protein backbones. Note that we do not expect sampled structures to be correctly folded globular proteins—we only require them to have realistic local structure. Secondly, it should be possible to seamlessly replace any stretch of a protein backbone with an alternative segment, thus making a small step in conformational space. Finally, we require that it is possible to compare the probability of a newly sampled candidate segment with the probability of the original segment, which is needed to enforce the property of detailed balance in MCMC simulations.
The resulting model is presented in Fig. 2. Formulated as a dynamic Bayesian network (DBN), it is a probabilistic model that ensures sequential dependencies through a sequence of hidden nodes. A hidden node represents a residue at a specific position in a protein chain. It is a discrete node that can adopt 55 states (see Methods). Each of these states, or h values, corresponds to a distinct emission distribution over dihedral angles [d = (φ,ψ)], amino acids (a), secondary structure (s), and the cis or trans conformation of the peptide bond (c). The angular emissions are modeled by bivariate von Mises distributions, whereas the ω dihedral angle (Fig. 1) is fixed at either 180° or 0°, depending on the trans/cis flag. Note that this model can also be regarded as a hidden Markov model with multiple outputs.
The joint probability of the model is a sum over each possible hidden node sequence h = {h _{1}, … , h _{N}}, where N denotes the length of the protein: The four types of emission nodes (d, a, s, and c) can each be used either as input or output. In most cases, some input information is available (e.g., the amino acid sequence), and the corresponding emission nodes are subsequently fixed to specific values. These nodes are referred to as observed nodes. Sampling from the model then involves two steps: (i) sampling a hidden node sequence conditioned on the set of observed nodes and (ii) sampling emission values for the unobserved nodes conditioned on the hiddennode sequence. The first step is most efficiently solved by using the forward–backtrack algorithm (12, 9) [see supporting information (SI) Text ], which allows for the resampling of any segment of a chain. This resembles fragment insertion in fragment assemblybased methods, but the forward–backtrack approach has the advantage that it ensures a seamless resampling that correctly handles the transitions at the ends of the segment. Once a particular sequence of hidden node values has been obtained, emission values for the unobserved nodes are drawn from the corresponding conditional probability distributions (step ii). This is illustrated in Fig. 2, where the emission probability distributions for a particular h value are highlighted.
The parameters of the model were estimated from the SABmark 1.65 (13) dataset (see Methods). From the 1,723 proteins, 276 were excluded during training and used for testing purposes (test set).
We conducted a series of experiments to evaluate the model's performance. Throughout this article, we will be comparing the results obtained with our model (TorusDBN) to the results achieved with one of the most successful fragment assemblybased methods currently available, the Rosetta fragment assembler (3). Because our interest in this study is limited to modeling local protein structure, we exclusively enabled Rosetta's initial fragmentassembly phase, disabling any energy evaluations apart from clash detection. In all cases, as input to Rosetta, we used the amino acid sequence of the query structure, multiple sequence information from PSIBLAST (14), and a predicted secondary structure sequence using PSIPRED (15).
Angular Preferences.
As a standard quality check of protein structure, a Ramachandran plot is often used by crystallographers to detect possible angular outliers. We investigated how closely the Ramachandran plot of samples from our model matched the Ramachandran plot for the corresponding native structures.
For each protein in the test set, we extracted the amino acid sequence, and calculated a predicted secondary structure labeling using PSIPRED. We then sampled a single structure using the sequence and secondary structure labels as input and summarized the sampled angle pairs in a 2D histogram. Fig. 3 shows the histograms for the test set and the samples, respectively. The results are strikingly similar. Although the experiment reveals little about the detailed sequence–structure signal in our model, it provides a first indication that a mixture of bivariate von Mises distributions is an appropriate choice to model the angular preferences of the Ramachandran plot.
We proceeded with a comparison to Rosetta. For each protein in the test set, we created a single structure using Rosetta's fragment assembler and compared the resulting histogram to that of the test set. Also in this case, the produced plot is visually indistinguishable from the native one (plot not shown). However, by using the Kullback–Leibler (KL) divergence, a standard measure of distance between probability distributions, it becomes clear that the Ramachandran plot produced by the TorusDBN is closer to native than the plot produced by Rosetta (see SI Text and Table S1).
Structural Motifs.
The TorusDBN models the sequential dependencies along the protein backbone through a firstorder Markov chain of hidden states. In such a model, we expect longer range dependencies to be modeled as designated highprobability paths through the model.
By manually inspecting the paths of length 4 with highest probability according to the model (based on their transition probabilities), we indeed recovered several well known structural motifs. Fig. 4 demonstrates how eight well known structural motifs appear as such paths in the model. Both the emitted angle pairs (Fig. 4) and the amino acid preferences (Fig. S1) have good correspondence with the literature (16, 17) (see SI Text ). All reported paths are among the 0.25% most probable 4state paths in the model (out of the 55^{4} possible paths).
Often, structural motifs will arise from combinations of several hidden node paths. By summing over the contributions of all possible paths [posterior decoding (18)], it is possible to extract this information from the model. To illustrate, we reversed the analysis of the structural motifs, by giving the ideal angles and secondary structure labeling of a motif as input to the model, and calculating the posterior distribution over amino acids at each position. Table 1 lists the top three preferred amino acids for each position in the different βturn motifs. All of these amino acids have previously been reported to have high propensities at their specific positions (17).
Sampling Structures.
We conclude with a demonstration of the model's performance beyond the scope of well defined structural motifs. In the context of de novo structure prediction, the role of the model is that of a proposal distribution, where repeated resampling of angles should lead to an efficient exploration of conformational space. In this final experiment, we therefore sampled dihedral angles for the proteins in our test set and investigated how closely the sampled angles match those of the native state.
For each protein in the test set, 100 structures were sampled, and the average angular deviation was recorded (see SI Text ). This was done for an increasing amount of input to the model. Initially, samples were generated without using input information, resulting in unrestricted samples from the model. We then included the amino acid sequence of the protein, a predicted secondary structure labeling (using PSIPRED), and, finally, a combination of both. We ran the same test with Rosetta's fragment assembler for comparison.
Fig. 5 shows the distribution of the average angular distance over all proteins in the test set. Clearly, as more information becomes available, the samples lie more closely around the native state. When both amino acid and secondary structure information is used, the performance of the TorusDBN approaches that of the fragment assembler in Rosetta. Recall that Rosetta also uses both amino acid and secondary structure information in its predictions but, in addition, incorporates multiple sequence information directly, which TorusDBN does not. In this light, our model performs remarkably well in this comparison. The time necessary to generate a single sample, averaged over all of the proteins in the test set, was 0.08 s for our model and 1.30 s for Rosetta's fragment assembler. All experiments were run on a 2,800 MHz AMD Opteron processor.
To illustrate the effect of the different degrees of input, we include a graphical view of two representative fragments extracted from the samples on the test set (Fig. 6). Note how the sequence and secondary structure input provide distinct signals to the model. In the hairpin motif, the sequenceonly signal creates structures with an excess of coil states around the hairpin, whereas the inclusion of only secondary structure input gets the secondary structure elements right but fails to make the turn correctly. Finally, with both inputs, the secondary structure boundaries of the motif are correct, and the quality of the turn is enhanced through the knowledge that the sequence motif AspGly is found at the two coil positions, which is common for a type I′ hairpin (17).
Additional Evaluations.
We conducted several additional experiments to evaluate other aspects of the model. First, we performed a detailed evaluation of TorusDBN′s performance on local structure motifs using the Isites library (19) ( SI Text and Figs. S2–S4). Second, we compared TorusDBN directly to HMMSTR in the recognition of decoy structures from native ( SI Text and Tables S2 and S3), and finally, the length distributions of secondary structure elements in samples were analyzed ( SI Text and Fig. S5). All these studies lend further support to the quality of the model.
Potential Applications.
In closing, we list a few potential applications for the described model. First and foremost, it is in the context of de novo predictions that we expect the greatest benefits from our model. Seamless resampling and probability evaluations of proposed structures should provide a better sampling of conformational space, allowing calculations of thermodynamical averages in MCMC simulations (20). There are, however, several other potential areas of application. (i) Homology modeling, where the model is potentially useful as a proposal distribution for loop closure tasks; (ii) quality verification of experimentally determined protein structures, where it is likely that the sequential signal in our model constitutes an advantage over the current widespread use of Ramachandran plots to detect outliers; and (iii) protein design, where the model might be used to predict or sample amino acid sequences that are locally compatible with a given structure (as was demonstrated for short motifs in Table 1).
Methods
Parameter Estimation.
The model was trained by using the Mocapy DBN toolkit (21). As training data, we used the SABmark 1.65 twilight protein dataset, which for each different SCOPfold provides a set of structures with low sequence similarity (13). Training was done on structures from 180 randomly selected folds (1,447 proteins, 226,338 observations), whereas the remaining 29 folds (276 proteins, 42,654 observations) were used as a test set. Amino acid, trans/cis peptide bond, and angle pair information was extracted directly from the training data, whereas secondary structure was computed by using DSSP (22).
Because the hidden node values are inherently unobserved, an algorithm capable of dealing with missing data is required. Here, we used a stochastic version of the well known expectationmaximization (EM) algorithm (23, 24). The idea behind stochastic EM (25, 26) is to first fill in plausible values for all unobserved nodes (Estep), and then update the parameters as if the model was fully observed (Mstep). Just as with classic EM, these two steps are repeated until the algorithm converges. In our case, for each observation in the training set, we sampled a corresponding h value, using a single sweep of Gibbs sampling: in random order, all h values were resampled based on their current left and right neighboring h values and the observed emission values at that residue. Computationally, stochastic EM is more efficient than classic EM. Furthermore, on large datasets, stochastic EM is known to avoid convergence to local maxima (26).
The optimal size of the hidden node (i.e., the number of states that it can adopt) is a hyperparameter that is not automatically estimated by the EM procedure. We optimized this parameter by training models for a range of sizes, evaluating the likelihood for each model using the forward algorithm (18). Because the training procedure is stochastic in nature, we repeated this procedure several times. The best model was selected by using the Bayesian Information Criterion (BIC) (27), a score based on likelihood, which penalizes an excess of parameters and thereby avoids overfitting (see SI Text ). As displayed in Fig. 7, the BIC reaches a maximum at a hidden node size of ≈55. The model, however, appears to be quite stable with regard to the choice of this parameter. Several of the experiments in our study were repeated with different h size models (size 40–80) without substantially affecting the results.
Angular Probability Distribution.
The Ramachandran plot is well known in crystallography and biochemistry. The plot is usually drawn as a projection onto the plane, but because of the periodicity of the angular degrees of freedom, the natural space for these angle pairs is on the torus. To capture the angular preferences of protein backbones, a mixture of Gaussianlike distributions on this surface is therefore an appropriate choice. We turned to the field of directional statistics for a bivariate angular distribution with Gaussianlike properties that allows for efficient sampling and parameter estimation. From the family of bivariate von Mises distributions, we chose the cosine variant, which was especially developed for this purpose by Mardia et al. (28). The density function is given by The distribution has five parameters: μ and ν are the respective means for φ and ψ, κ_{1} and κ_{2} their concentration, and κ_{3} is related to their correlation (Fig. 8). The parameters can be efficiently estimated by using a momentestimation technique. Efficient sampling from the distribution is achieved by rejection sampling, using a mixture of two von Mises distributions as a proposal distribution (see SI Text ).
Availability.
The TorusDBN model is implemented as part of the backboneDBN package, which is freely available at http://sourceforge.net/projects/phaistos/.
Acknowledgments
We thank Mikael Borg, Jes Frellsen, Tim Harder, Kasper Stovgaard, and Lucia Ferrari for valuable suggestions to the paper; John Kent for discussions on the angular distributions; Christopher Bystroff for help with HMMSTR and the newest version of Isites; and the Bioinformatics Centre and the Zoological Museum, University of Copenhagen, for use of their cluster computer. W.B. was supported by the Lundbeck Foundation, and T.H. was funded by Forskningsrådet for Teknologi og Produktion (“Data Driven Protein Structure Prediction”).
Footnotes
 ^{§}To whom correspondence should be addressed. Email: thamelry{at}binf.ku.dk

Author contributions: W.B. and T.H. designed research; W.B. performed research; K.V.M. and C.C.T. contributed new reagents/analytic tools; and W.B., J.F.B., A.K., and T.H. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/cgi/content/full/0801715105/DCSupplemental.

Freely available online through the PNAS open access option.
 © 2008 by The National Academy of Sciences of the USA
References
 ↵
 ↵
 ↵

↵
 Chikenji G ,
 Fujitsuka Y ,
 Takada S
 ↵
 ↵

↵
 Edgoose T ,
 Allison L ,
 Dowe DL

↵
 Camproux AC ,
 Tuffery P ,
 Chevrolat JP ,
 Boisvieux JF ,
 Hazout S
 ↵
 ↵
 ↵
 ↵

↵
 Van Walle I ,
 Lasters I ,
 Wyns L

↵
 Altschul SF ,
 et al.
 ↵
 ↵
 ↵

↵
 Durbin R ,
 Eddy SR ,
 Krogh A ,
 Mitchison G
 ↵
 ↵

↵
 Hamelryck T
 ↵

↵
 Dempster AP ,
 Laird NM ,
 Rubin DB
 ↵

↵
 Diebolt J ,
 Ip EHS
 Gilks WR ,
 Richardson S ,
 Speigelhalter DJ
 ↵
 ↵
 ↵

↵
 DeLano WL