Skip to main content
Biophysical Journal logoLink to Biophysical Journal
. 2011 Nov 2;101(9):2251–2259. doi: 10.1016/j.bpj.2011.09.036

Smoothing Protein Energy Landscapes by Integrating Folding Models with Structure Prediction

Ari Pritchard-Bell 1, M Scott Shell 1,
PMCID: PMC3207166  PMID: 22067165

Abstract

Decades of work has investigated the energy landscapes of simple protein models, but what do the landscapes of real, large, atomically detailed proteins look like? We explore an approach to this problem that systematically extracts simple funnel models of actual proteins using ensembles of structure predictions and physics-based atomic force fields and sampling. Central to our effort are calculations of a quantity called the relative entropy, which quantifies the extent to which a given set of structure decoys and a putative native structure can be projected onto a theoretical funnel description. We examine 86 structure prediction targets and one coupled folding-binding system, and find that in a majority of cases the relative entropy robustly signals which structures are nearest to native (i.e., which appear to lie closest to a funnel bottom). Importantly, the landscape model improves substantially upon purely energetic measures in scoring decoys. Our results suggest that physics-based models—including both folding theories and all-atom force fields—may be successfully integrated with structure prediction efforts. Conversely, detailed predictions of structures and the relative entropy approach enable one to extract coarse topographic features of protein landscapes that may enhance the development and application of simpler folding models.

Introduction

Protein structure prediction has seen major advances in the past few decades, but outstanding challenges remain for proteins that are large, multimeric, transmembrane, or, in particular, that have novel folds or distant sequence homology to databases (1–3). Many current algorithms involve bioinformatics components due to use of structure databases in training or statistical potentials, and there has long been interest in improving these by incorporating more physics-based theories of folding and all-atom physiochemical force fields that have not been trained to structure statistics. In particular, physics-based approaches may offer a more transferable picture of protein energetics and suggest sampling strategies inspired by folding mechanisms (4–6). Unfortunately, direct folding of proteins using all-atom force fields remains computationally challenging for but very small systems, although a growing number of successes for miniproteins increasingly suggests accuracy of many of the models themselves (6–13).

A central question is: how can physical folding theories and all-atom models be made relevant to practical structure prediction efforts, and vice versa? We suggest that a potentially powerful new route lies with a quantity called the “relative entropy”—a metric that we borrow from recent ideas in the coarse-graining and multiscale simulation literature (14–17). In general terms, the relative entropy measures the amount of information one loses when a detailed description of a system is transformed into a coarse-grained version. Here we propose that detailed sets of actual structure predictions for a given protein can be coarse-grained into a physically motivated toy energy landscape model. We construct these landscapes using interstructure distances and energies measured from short all-atom molecular dynamics (MD) runs with a physiochemical force field. By calculating the relative entropy for this coarse-graining, we are able to identify what topographic properties of the landscape best match the prediction structures, including which are closest to the native. In doing so, the relative entropy also becomes a surrogate for scoring the structures themselves—allowing us to select among or rank them in terms of near-nativeness.

Our work is certainly not the first to attempt to use physics-based models to score or refine structure predictions. A number of studies have found that approximate conformational free energies based on molecular mechanics force fields and implicit solvation models can discriminate native from decoy structures (18–23), and that inaccurate decoys are less stable in MD simulations than native structures (22,24). Yet while a strong correlation between free energy and root mean-square deviation (RMSD) exists for near-native structures (18,19), these free energies can fail to correctly rank high-quality decoys far from the native structure (20–22). Short (<5 ns) MD simulations have also shown mixed success in refining predictions: in some cases, decoys slightly improved (19), whereas in others, the input structures moved away from the native (20,22,25,26). It has been proposed that much longer simulations are required for refinement success, as significant conformational fluctuations contributing to real improvement occur over 10–100-ns timescales (25). More recent efforts leverage sophisticated sampling strategies such as replica exchange molecular dynamics (27,28) and accurate implicit solvation models that smooth the conformational landscape (26,28). Configurational clustering techniques have also been identified as useful indicators of near-native structural families (22,24,27,28).

One important difference our approach has from these earlier efforts is that it does not consider each structure decoy separately (for example, scoring based on energies or conformational fluctuations). Instead, it attempts to integrate multiple predictions together into a more global picture of the folding landscape. Our work in particular is inspired by the efforts of Stumpff-Kane and Feig (29), who found that the correlation between a structure's distance to others in a decoy set and the corresponding energy differences with them is a more robust predictor of proximity to the native structure than the energies themselves. Implicit to this assumption is that the native structure lies near the minimum of a funneled energy landscape. The basic idea is that as one moves away from near-native structures, one expects a systematic increase in energy and hence a stronger energy-distance correlation.

In our work here, rather than measure correlation coefficients, we explicitly develop a simple funneled landscape model that predicts relationships between energies and structure distances. The relative entropy connects the model to a set of actual structure predictions and serves as a similar scoring metric. Our work is distinct, however, in that we are able to ascertain topographic measures of the landscapes through use of the model, and in that we find a “metabasin” picture is necessary for success of the approach. We now describe both aspects below.

Modeling Approach

We consider a protein in terms of a coarse folding energy landscape, i.e., its multidimensional free energy surface projected as a function of the protein configurational degrees of freedom. Here, the solvent degrees of freedom and energetics are included implicitly through standard Boltzmann averaging. The first level of coarsening is in viewing the landscape as composed of distinct metabasins, each of which corresponds to a collection of structurally similar configurations that can be accessed through small energy barriers and short timescales (30). The second level is that we characterize the topography of these features by a very simple model. We measure the distance between a given metabasin and the native one in terms of the rotationally and translationally invariant distance-based root mean-squared deviation (dRMSD),

D2=2N1(N1)1i<j(dijdij,0)2,

where N is the chain length, and dij and dij, 0 are the current and native (or putative native) distances between the centroids of residues i and j. We then make the Ansatz that the landscape is globally funnel-like, with the energy increasing to first-order (linearly) with dRMSD, and that roughness can be captured with random Gaussian energy fluctuations. The energy of a protein structure in this coarse-grained model is given by

U=U0+αND+Ufluct(Ufluct)=(2πN2σ2)12exp(Ufluct22N2σ2), (1)

where U0 is the native energy, α is a landscape slope coefficient with units of energy per distance, Ufluct is a random variable describing energy fluctuations, ℘ gives their distribution, and N2 σ2 characterizes their variance.

Equation 1 is motivated on several physical grounds. It bears some similarity to Gō-like approaches that assign favorable energies to native amino-acid contacts (31–33); here, deviations from native contacts are explicitly penalized through the dRMSD term that characterizes the multidimensional space of contact distances. (By comparison, the conventional RMSD requiring a global least-squares superposition lacks this direct connection to pairwise interactions.) Gō models give rise to landscapes that are inherently too smooth, and the fluctuation term is intended to add in roughness due to both inter- and intrametabasin energy variations, e.g., due to the many local minima and saddle points. In principle, this roughness arises from the heterogeneity and detailed form of the many atomic interactions, and the Gaussian distribution may be viewed as a central-limit approximation when these are combined. It is also reminiscent of random energy models that have been used extensively to study protein folding landscapes and other frustrated systems (34–36).

On the whole, Eq. 1 assumes that the native structure lies at a minimum in a very simple funneled free energy landscape, with solvent and internal amino-acid degrees of freedom averaged out, and with a statistical model of ruggedness. The most significant assumption here is that the conformational entropy of any given specific structure—due, say, to localized backbone or side-chain fluctuations—is roughly the same. That is, we take structure populations to be dominated by intraprotein potential and solvent free energies. For the compact, well-folded conformations of the structure predictions that we consider later, this approximation is not likely to be severe and in fact the ultimate success of the approach supports this view. Certainly, however, future work could address this issue by appending approximate conformational entropies calculated, for example, in a harmonic, normal-mode manner.

To integrate this kind of simple model with actual, detailed protein structures, one requires a way to quantify the relevance of the coarse-grained to the all-atom picture. We use the relative entropy (Kullback-Leibler divergence) for this purpose, a quantity that we recently introduced as the basis of a new coarse-graining strategy (14–17); in this context, Srel quantifies the thermodynamic information lost due to coarse-graining, and better coarse models are those that minimize the relative entropy with respect to a reference all-atom target system. The general form of the relative entropy is

Srel=ipAA(i)In[pAA(i)pCG(i)], (2)

where the summation proceeds over all all-atom configurational states of the system, and pAA and pCG give the respective all-atom and coarse-grained ensemble probabilities of configuration i (which may be transformed to a corresponding coarse version). The relative entropy looks similar to the usual thermal entropy expression except for a negative sign and the presence of the second, coarse-grained probability distribution. The relative entropy is bounded below by zero for perfect coarse-grained models that correctly recapitulate the entire ensemble probability distribution. Earlier work suggests that the relative entropy may be a kind of universal quality metric for simplified models that, in some cases, actually predicts the magnitude of coarse-graining errors incurred in thermodynamic properties (16,17).

In this article, we instead conceptualize the summation as proceeding over near-native landscape metabasins, which in practice we approximate by a much smaller, compact set of actual structure predictions. Our test cases stem from targets in the biennial CASP competition (2), and for a given protein we utilize the set of predictions generated by all automated web server submissions (typically 260–280 models per target). For each model, we perform a short molecular dynamics (MD) simulation of it using the physics-based AMBER all-atom force field. These simulations only explore the immediate metabasin surrounding the models, associated with small conformational fluctuations of typically 3–5 Å all-atom RMSD. The results are used to compute the relative entropy for each target by summing over all models i. The coarse-grained probability distribution follows from Eq. 1,

pCG(i)exp[(UiU0αNDi)22N2σ2], (3)

where 〈Ui〉 and 〈Di〉 give the average energy and average dRMSD from a putative native structure computed in the MD simulation of model i. For pAA we consider an ensemble of low-lying parts of the landscape, letting pAA(i) ∝ 1 if 〈Ui〉 is <75% of the lowest average MD energy found among all models for a given protein, and pAA(i) ∝ 0 otherwise. Though such a criterion is certainly arbitrary, our tests show substantial insensitivity to cutoffs between 50% and 80%.

The utility of this approach is that the relative entropy can quantify how well the simple funnel picture describes an actual collection of structures for a given protein. Importantly, the funnel model can be improved in two ways by tuning it so as to minimize Srel: First, one can find values of U0, α, and σ2 through its numerical minimization, and hence in some sense uncover general topographic descriptors of the landscape. Second, one can choose which structure is considered native in the model, which affects the calculations of the distances Di. In the spirit of Stumpff-Kane and Feig (29), one can pick putative native states among the original collection of structure predictions. That is, each model can be considered in turn as the native, and a value of the relative entropy can be assigned to it after minimizing with respect to U0, α, and σ2. To the extent that the simple landscape picture is accurate, the structure with the minimum value of Srel should then lie nearest to the true native state. In other words, the relative entropy can serve to score the extent to which each structure appears to lie near the bottom of a folding funnel, with lower values presumably corresponding to better models.

We term this approach “landscape smoothing” because projection onto the coarse model in effect filters out energy ruggedness (and hence can signal near-native structures that are in high-lying local minima), facilitating a simpler assessment of funnel-like behavior. Fig. 1 shows a schematic of the entire procedure.

Figure 1.

Figure 1

Schematic of the landscape smoothing approach. Structures from an initial collection of webserver predictions are first energy-minimized using an all-atom physiochemical force field. Short (40 ps) molecular dynamics runs then serve to explore the immediate metabasin surrounding each. The average energies and interstructure distances are projected onto a simple, analytical funnel-shaped model that is tuned so as to minimize the relative entropy. Finally, structures are scored by the value of the relative entropy when the funnel minimum is coincident with them.

Methods

We use the AMBER package (37) with the ff96 force field (38) and the implicit solvation model of Onufriev et al. (39). This particular combination has been shown in several studies to correctly fold a variety of small peptides and miniproteins (4,41,42). Protein test cases are CASP8 and CASP9 single-domain targets, with structure sets taken from all submitted webserver models as available on the CASP website. All calculations for CASP9 targets were performed in conjunction with that event and before release of the experimental structures. After the event, 85 of the targets were identified as single-domain and analyzed, of which 10 were designated as free-modeling targets and the remaining template-based modeling. Missing residues that appear in a small subset of models are rebuilt in extended form before simulation using the AMBER tleap program. Our runs employ a time step of 2 fs with all hydrogen atoms constrained with rigid bonds. All structures are first energy-minimized (200 steepest-descent and 50 conjugate-gradient steps). Subsequently, MD runs are performed entailing 20 ps of equilibration followed by 20 ps of production time for calculating averages. Our earlier tests found 20 ps to perform as well as 100 ps averaging time, although longer runs gave lengthy analyses given the number of predictions; we view the 20-ps time as a reasonable compromise. All energies analyzed and reported include both the interatomic potential energy as well as the approximate solvation free energy given by the generalized-Born solvation model.

For each structure in a protein's set of predictions, an Srel value is computed using custom Python code that takes the following steps: dRMSD distances to all other models of the same protein are computed; values of U0, α, and σ2 in Eq. 1 are determined through minimization of Srel (Eqs. 2 and 3) and use of the MD-determined energies; and this minimal Srel is recorded. Each time Srel is computed, we first evaluate and then normalize the probabilities pCG (Eq. 3) and pAA for each structure; these are then directly used in Eq. 2. During minimization, multiple initial parameter starting points always led to the same minima. For some structures, however, we find α < 0, indicating that these sit near a maximum rather than a minimum (funnel basin) in the landscape. Because the relative entropy does not distinguish such cases, we set their value of Srel equal to the maximum among the structure set. Finally, in evaluating Srel for native structures, we compute dRMSD values using only those residues that are resolved in the experimental structures. The native structure is not, however, used in computing Srel for the webserver models (only for its own relative entropy). We did not compute energies for native structures because in many cases they had missing residues relative to the models, making comparison difficult.

To assess the success of the relative entropy approach in locating near-native structures, we compute the percentage fx of the predicted top structures (lowest Srel values) that are truly among the top models as scored by dRMSD to native. Here “top” denotes a best fraction of the total number of predictions; fx is computed each for x = 10, 20, and 30% of the structure dataset for a protein. We also introduce an enhancement factor, χx, that compares this fraction to that which one would expect for a random sampling of x percent of the dataset, χx = fx/x. Values of χx > 1 suggest that relative entropy scoring of the structures is better than naïve, random selection. As a point of comparison, we also compute values of fx and χx for cases in which a quantity different from Srel is used to score and rank structures. In one case, we use the MD average 〈EMD〉, a purely energetic measure. In a second case, we rank using an entirely distance-based metric, nlocal, which gives the number of other structures in the dataset that are within a 4 Å dRMSD range from the considered one. The point of nlocal is to score in a similar manner to widely used clustering algorithms that select structures using the hypothesis that near-native structures are likely to have many structurally similar decoys generated.

For the binding study described later, we simulate the 11-residue hirudin fragment DFEEIPEEYLQ using 20-ns replica exchange MD simulations. Details of the methodological approach are described in Lin and Shell (42). The final 5 ns of the 270 K trajectory is clustered to extract 10 conformers using a modified K-means algorithm. Each of these is manually positioned ∼20 Å away from the binding site (random orientation) in the holo thrombin structure (PDB: 3C27), and submitted to the RosettaDock webserver (43) for docking. Ten docked conformations for each cluster give a total of 100 structures for use in the relative entropy approach as described above.

Results

Funnel landscape of an αβ protein

We first discuss detailed results on an illustrative case, the CASP8 target T0471, a 131-residue globular αβ protein. We analyze 256 structure models submitted by 59 separate groups. The AMBER-evaluated minimized and average MD energies for these are shown in Fig. 2 as a function of dRMSD to the native structure (PDB: 2K4M). The models span a wide range of proximity to native, from 2.5 to >15 Å, yet there is weak correlation with the force-field-based energy metrics. In fact, the lowest energy structures appear to lie at ∼8–9 Å. Interestingly, the average MD energies do show a slight improvement over the minimized ones, in terms of correlation with dRMSD (R2 = 0.2 vs. 0.0). This may suggest the benefit of averaging over local landscape features, and possible high-energy trapping minima, in the metabasin view.

Figure 2.

Figure 2

Landscape smoothing for the αβ target T0471. Minimized (top) and average MD energies (middle) do not show a strong correlation with near-nativeness. On the other hand, the relative entropy (bottom) does and successfully picks out the top structure, without any knowledge of the native one. For convenience, all Srel values are shifted such that the minimum is zero.

On the other hand, the computed relative entropies for the structures show a remarkably strong correspondence with dRMSD (R2 = 0.7), particularly close to the native. In fact, that with the lowest Srel has the lowest overall dRMSD, and all models with Srel < 2 have dRMSD values <5 Å. Here for clarity, we have normalized the relative entropy calculations such that the minimum found is zero, shifting each by a constant amount; we continue this practice throughout. It is important to recognize that, although the calculation of the relative entropy involves the mutual dRMSDs among the predicted structures, the correlation described here is with a different, independent set of dRMSDs, those with respect to the native structure. Fig. 3 compares the best models that would be picked on the basis of the MD energy and relative entropy scores. Clearly the latter is a much better approximation of the native, and its differences from native lie largely in loop regions that show increased conformational fluctuations among the multiple NMR structures. The relative entropy's success as a scoring metric here is likely due to the fact that it integrates energy and distance information from all structures to assess how well a given one fits the simple funnel picture; in the process, therefore, it is able to detect and filter out abnormally low or high landscape energy fluctuations by comparison to nearby structures.

Figure 3.

Figure 3

Best predictions for T0471 as selected by minimum average energies and minimum relative entropy. The latter is in much better agreement with the native structure at 3.5/2.5 Å vs. 15.0/9.0 Å overall backbone RMSD/dRMSD. Interestingly, the parts of the top Srel structure in worst agreement with the native tend to lie in loop regions that display increased flexibility across multiple NMR models.

What elements of the relative entropy approach enable these results? The metabasin perspective of using averaged MD energies in Eq. 3 turns out to be critical. If these are replaced with the minimized energies and Srel is reevaluated, there is significantly less correlation with native proximity (R2 = 0.1, see also Fig. S1 in the Supporting Material), suggesting that coarse funnel models of this kind are only descriptive when high-frequency ruggedness is averaged out. One might also question the use of dRMSD rather than some other distance metric or reaction coordinate. Replotting the results in Fig. 2 versus the usual RMSD to native does not change this picture, nor the quality of Srel as a scoring metric (see Fig. S2). Indeed, there is a close correspondence of the two distances, particularly at low values (see Fig. S3). However, if the RMSD replaces the interdecoy dRMSD as a surrogate for Di in the funnel model itself—i.e., in Eqs. 1 and 3—the correlation of Srel with native proximity is notably reduced (see Fig. S4). One interpretation may be that dRMSD is a more energetically relevant structure distance through its relationship to pairwise interactions. RMSD may be less relevant because localized structural perturbations can cause global shifts in structure superposition and similarity. Better distance metrics may yet exist; we leave these for future work.

Application to CASP9 targets

We applied the landscape smoothing technique to 85 single-domain proteins as a part of the CASP9 event; these ranged from 54–506 residues and contained an overall average of 34% α-and 27% β-secondary structure content. For each protein, we assessed whether or not the relative entropy is a good predictor of its near-nativeness by computing f10, f20, and f30, the fraction of the lowest 10, 20, and 30% dRMSD structures that are found by picking the same percentage of lowest Srel ones. Table 1 shows that the relative entropy scoring strategy is able to locate 33% of the top 10% and 63% of the top 30% of models, on average. Importantly, when these are compared to the average fraction that one would expect from completely random selection, the relative entropy has an enhancement factor χ of 2.1–3.3; i.e., it picks this many times more models correctly. The distributions of these statistics shows a wide range of performance across the entire protein set, with some instances attaining >80% of top models and enhancement factors reaching 8.0 (see Fig. S5).

Table 1.

Results of landscape smoothing for 85 CASP9 proteins

Targets considered Count fx, Fraction of top x% of models found using Srel (enhancement factor χx)
f10 (χ10) f20 (χ20) f30 (χ30)
All 85 33% (3.3) 50% (2.5) 63% (2.1)
<10 missing residues 25 40% (4.0) 57% (2.8) 68% (2.3)
<4.0 median dRMSD 40 38% (3.8) 52% (2.6) 64% (2.1)
<10 missing residues and <4.0 dRMSD 14 46% (4.6) 63% (3.1) 73% (2.4)
…same but using <EMD>
All 85 29% (2.9) 44% (2.2) 54% (1.8)
…same but using nlocal
All 85 26% (2.6) 42% (2.1) 54% (1.8)

Shown at the top are the fractions of top models found using the relative entropy scoring; the bottom rows show analogous results when the energy and distance metrics 〈EMD〉 and nlocal are used to rank structures instead of Srel. The corresponding enhancement factors give the ratio of the fraction to what one would expect for entirely random sampling. For the relative entropy ranking, both the fraction of top models and the enhancement factor improve with subsets of the targets that have fewer missing residues (relative to the originally published CASP9 sequence) and lower median dRMSD-to-native values among webserver predictions.

What makes the relative entropy strategy effective? Table 1 also shows that scoring structures using average MD energies alone is not nearly as good. This suggests that the process of coarse-graining the landscape, which draws information from energy-distance relationships rather than mere energies, adds value to the selection procedure. Moreover, it appears that the success of the Srel approach is independent of clustering and related scoring methods. When decoys are ranked according to the number of structurally close predictions (using the metric nlocal as defined above), the ability to select top structures is diminished and even slightly worse than scoring based on MD energies. Thus, it is clear that the landscape-based energetic picture emphasized by the relative entropy ranking contributes to its scoring success, and its results are not simply artifacts of enhanced near-native structure populations.

A selection of five representative successful cases is shown in Fig. 4, showing dramatic correlations between Srel and native distance. In three of these, the lowest Srel structure coincides with the lowest dRMSD structure; in the other two, it appears among the top five models. By comparison, the average MD energies show much weaker correlation with native proximity and in all but one instance the minimum energy model is not among the top 10%. We also compute what the value of the relative entropy would have been for the actual native structure, had it been originally included as a model. It is nearly zero for four of these proteins and less than one for the other, consistent with the idea that the true native should signal the best fit to the funnel model. Though just a small sample, these cases do illustrate three features of Srel scoring: First, the approach appears to work for a variety of chain lengths and secondary structure balance. Second, the lowest Srel structures are not necessarily those with the most immediate neighbors (similar models), as the predictions in many of these cases most populate more intermediate dRMSD values. This marks an important advantage over clustering-based model selection methods, as discussed more generally by Stumpff-Kane and Feig (29). Here, the relative entropy can mark a rare prediction as a good fit to a funnel minimum because the model of Eq. 1 implies relationships beyond close structure neighbors. Third, the approach succeeds in both cases where the set of structures span small (0–5 Å) and large (5–15 Å) deviations from native. We do find often, however, that the correlation between Srel and dRMSD seems to improve with an increase in the number of near-native models, as appears visually in Fig. 4.

Figure 4.

Figure 4

Landscape smoothing for five selected CASP9 targets. (Left panels) Average MD energies. (Right panels) Corresponding relative entropy values. A dramatic improvement in correlation with near-nativeness is found for the relative entropy, across proteins of a variety of secondary structure content, size, and quality of predictions. (Dotted red lines) Value of the relative entropy for the true native structure; it is close to zero, as expected. The native structures depicted do not fall on top of any data points.

Still, there are a few cases where the relative entropy fails to be an effective scoring metric. In 13 of the 85 cases, the enhancement factor is <1, a sign that relative entropy ranking is actually worse than random selection. We hypothesize that poor performance may be due to two possible features of a particular prediction scenario that make it difficult to apply the funnel model: 1), cases in which the predicted sequence and models contain substantial residues that are not resolved in the experimental structure, and 2), those in which all predictions lie too far away from the native structure. The average enhancement factors for subsets of the proteins that have few missing residues and low median dRMSDs to native are indeed higher, up to 4.6, confirming the likely relevance of these issues (Table 1). The character of the native structure may also be important; some failures involve proteins with unusual or elongated structures, or proteins that formed multimeric units, which would naturally question the suitability of the simple funnel model (see Fig. S7).

Other potential indicators, such as native α- and β-content and the values of the optimized landscape parameters α and σ, are not statistically different for proteins with higher values of χ10 and thus do not seem to predict how well Srel will score models (see Table S1 in the Supporting Material). The degree to which the average MD energies are lower for more nativelike structures is also uninformative, as it is only very weakly correlated with enhancement (see Fig. S6). In addition, there is no significant difference between the free-modeling and template-based modeling target groups. In any case, the relative entropy procedure is ignorant of templates and would only be affected by this distinction indirectly, through the quality of predicted structures generated by the webservers.

For the more successful cases, the relative-entropy-optimal values of model parameters σ, α, and U0 may prove suggestive of actual characteristics of these proteins' underlying energy landscapes, and hence, offer a natural way to understand the relationship between real protein structures and folding models. In particular, we evaluate the ruggedness parameter σ for native structures of proteins for which Srel scoring is effective (i.e., those for which f10 > 40%). We find that it is surprisingly constant: an average value of 0.43 ± 0.07 kcal/mol. The steepness of the folding funnel, measured by α, is less consistent but still reveals characteristic values: with the exception of three outliers, it averages at 0.017 ± 0.009 kcal/mol/Å. Neither of these parameters shows a significant correlation with secondary structure content or chain length (see Fig. S8). These findings may suggest the existence of certain universal, coarse-grained characteristics of the folding landscapes of these proteins.

Folding landscape coupled to binding

We also test the coarse-graining approach on a system involving folding coupled with binding. In principle, such processes can also be described in funneled landscape terms, enabling us to evaluate the generality of the strategy here. Specifically, we examine binding to the coagulant protein thrombin of an 11-residue core fragment of the natural peptide inhibitor hirudin. In this case, the relevant energy landscape is that describing both the conformational states of hirudin and its mutual orientation and distance from thrombin. The native binding pose and the hirudin sequence suggest that favorable interactions are stabilized by hydrophobic and electrostatic forces.

To sample the landscape for this flexible binding problem, we first extract multiple unbound hirudin conformations using molecular simulations (see Methods). These show that hirudin adopts variety of largely extended structures and is inherently quite flexible, particularly near its N-terminus. Remarkably, some of these structures are quite close to the bound conformer, reaching 1.8 Å backbone RMSD ∼20% of the time (see Fig. S8); that it samples boundlike conformations in solution may suggest a mechanism by which hirudin is able to achieve high binding affinity. Subsequently, we use RosettaDock (43) to dock conformers to holo-thrombin, generating an ensemble of bound structures. These unconstrained runs place hirudin at many different locations on thrombin's surface, both near and far from the native binding site and spanning 4.4–30 Å RMSD from it (Fig. 5). As previously, we then energy-minimize and perform short MD simulations for each.

Figure 5.

Figure 5

Docking predictions and selection for hirudin-thrombin binding. (Left) All 100 predictions of hirudin-thrombin complex obtained using the RosettaDock server. Ten hirudin structures are initially generated by replica exchange MD before docking. (Right) The lowest Srel structure (teal) finds the appropriate binding pocket and is close to the native (green) at 5.4 Å RMSD.

When the docked structures are evaluated using the relative entropy approach, we find results that are strikingly similar to the earlier purely folding studies. Namely, there is an excellent correlation between the value of Srel and near-nativeness (Fig. 6), and the lowest relative entropy prediction picks the second-best model. On the other hand, the energetic measures are much less informative, exhibiting a near-constant range of sampled energies with distance from the native binding site. Interestingly, the relative entropy procedure finds a large fraction of structures (32%) for which the optimal value of the funnel slope α is negative, suggesting that these conformers lie near a sort of maximum in the energy landscape. When these particular structures are viewed separately (see Fig. S9), they actually cluster around a convex, nonpocketed region of the thrombin structure, suggesting that indeed the negative α may be a realistic indication of the nature of the local energy surface.

Figure 6.

Figure 6

Landscape smoothing for hirudin-thrombin binding. Similar to what was found for purely folding cases, this coupled folding-binding problem shows that minimized (top) and average MD energies (middle) do not signal proximity to the true binding conformation, whereas the relative entropy strategy (bottom) does. The large numbers of structures with Srel near a value of 13 are cases in which the energy landscape was found to have a maximum rather than a funnel structure (α < 0), and the largest positive-α relative entropy was instead assigned. These structures turn out to cluster near the same region of the thrombin surface (Fig. S10).

Conclusions

In this work we introduced an approach to smooth the energy landscapes of detailed protein structure predictions by projecting them onto a simple analytical folding funnel model. At the heart of the method is a coarse-graining quantity called the relative entropy that quantifies the success of the projection, and that can be used to optimize and hence infer the coarse properties of the analytical landscape. Importantly, the relative entropy also shows which putative structures sit near a funnel minimum, and hence are most nativelike, as we found in tests of 86 proteins. It is important to note that the relative entropy is calculated from a sampling of putative folds, without knowledge of the native structure; it therefore may offer a new tool in selecting and directing structure predictions.

The approach also appears encouraging for coupled folding and binding problems. Because we used a physiochemical force field to evaluate energies, this work suggests that possible errors in such energy functions might not be nearly as bad as thought for structure prediction (26,44). Instead, the inherent ruggedness of their energy landscapes may mask near-nativeness unless some type of smoothing approach is applied. On the other hand, the relative entropy approach may be equally as effective if a bioinformatics energy scoring function is used, to the extent that the corresponding landscapes are funnel-like.

This general strategy appears not only to offer a new route to score protein structure predictions, but may also help develop new simple folding models by more directly characterizing properties of energy landscapes in realistic protein structures. Certainly more sophisticated funnel models could also be used, including Gō-like ones that may require explicit simulations to evaluate energy distributions. In addition, here we considered cases in which there are only a few hundred structures to characterize the landscape and relative entropy; however, one might imagine that increasing this number could improve the approach dramatically. Structure refinement and perturbation strategies might be therefore offer a natural complement to this approach, perhaps iterating between generation of large decoy sets and identification of the best models by smoothing.

Acknowledgments

We gratefully acknowledge the support of the National Science Foundation (Award No. CBET-0845074) and of the Hellman Family Faculty Fellows Program.

Supporting Material

Document S1. Ten figures and a table
mmc1.pdf (674.2KB, pdf)

References

  • 1.Baker D., Sali A. Protein structure prediction and structural genomics. Science. 2001;294:93–96. doi: 10.1126/science.1065659. [DOI] [PubMed] [Google Scholar]
  • 2.Moult J. A decade of CASP: progress, bottlenecks and prognosis in protein structure prediction. Curr. Opin. Struct. Biol. 2005;15:285–289. doi: 10.1016/j.sbi.2005.05.011. [DOI] [PubMed] [Google Scholar]
  • 3.Moult J., Fidelis K., Tramontano A. Critical assessment of methods of protein structure prediction—round VIII. Proteins Struct. Funct. Bioinformat. 2009;77:1–4. doi: 10.1002/prot.22589. [DOI] [PubMed] [Google Scholar]
  • 4.Shell M.S., Ozkan S.B., Dill K.A. Blind test of physics-based prediction of protein structures. Biophys. J. 2009;96:917–924. doi: 10.1016/j.bpj.2008.11.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.DeBartolo J., Colubri A., Sosnick T.R. Mimicking the folding pathway to improve homology-free protein structure prediction. Proc. Natl. Acad. Sci. USA. 2009;106:3734–3739. doi: 10.1073/pnas.0811363106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Dill K.A., Ozkan S.B., Weikl T.R. The protein folding problem. Annu. Rev. Biophys. 2008;37:289–316. doi: 10.1146/annurev.biophys.37.092707.153558. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Lei H., Wu C., Duan Y. Folding processes of the B domain of protein A to the native state observed in all-atom ab initio folding simulations. J. Chem. Phys. 2008;128:235105–235109. doi: 10.1063/1.2937135. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Lei H., Duan Y. Ab initio folding of albumin binding domain from all-atom molecular dynamics simulation. J. Phys. Chem. B. 2007;111:5458–5463. doi: 10.1021/jp0704867. [DOI] [PubMed] [Google Scholar]
  • 9.Voelz V.A., Bowman G.R., Pande V.S. Molecular simulation of ab initio protein folding for a millisecond folder NTL9(1-39) J. Am. Chem. Soc. 2010;132:1526–1528. doi: 10.1021/ja9090353. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Pande V.S., Sorin E.J., Rhee Y.M. Chapter 8: Computer simulations of protein folding. In: Muñoz V., editor. Protein Folding, Misfolding and Aggregation. Royal Society of Chemistry; Cambridge, UK: 2008. pp. 161–187. [Google Scholar]
  • 11.Kim E., Jang S., Pak Y. Direct folding studies of various α- and β-strands using replica exchange molecular dynamics simulation. J. Chem. Phys. 2008;128:175104–175110. doi: 10.1063/1.2909561. [DOI] [PubMed] [Google Scholar]
  • 12.Paschek D., Hempel S., García A.E. Computing the stability diagram of the Trp-cage miniprotein. Proc. Natl. Acad. Sci. USA. 2008;105:17754–17759. doi: 10.1073/pnas.0804775105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Freddolino P.L., Harrison C.B., Schulten K. Challenges in protein folding simulations: timescale, representation, and analysis. Nat. Phys. 2010;6:751–758. doi: 10.1038/nphys1713. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Shell M.S. The relative entropy is fundamental to multiscale and inverse thermodynamic problems. J. Chem. Phys. 2008;129:144108. doi: 10.1063/1.2992060. [DOI] [PubMed] [Google Scholar]
  • 15.Chaimovich A., Shell M.S. Anomalous waterlike behavior in spherically-symmetric water models optimized with the relative entropy. Phys. Chem. Chem. Phys. 2009;11:1901–1915. doi: 10.1039/b818512c. [DOI] [PubMed] [Google Scholar]
  • 16.Chaimovich A., Shell M.S. Coarse-graining errors and numerical optimization using a relative entropy framework. J. Chem. Phys. 2011;134:094112. doi: 10.1063/1.3557038. [DOI] [PubMed] [Google Scholar]
  • 17.Chaimovich A., Shell M.S. Relative entropy as a universal metric for multiscale errors. Phys. Rev. E. 2010;81:060104. doi: 10.1103/PhysRevE.81.060104. [DOI] [PubMed] [Google Scholar]
  • 18.Lazaridis T., Karplus M. Discrimination of the native from misfolded protein models with an energy function including implicit solvation. J. Mol. Biol. 1999;288:477–487. doi: 10.1006/jmbi.1999.2685. [DOI] [PubMed] [Google Scholar]
  • 19.Lee M.R., Baker D., Kollman P.A. 2.1 and 1.8 Å average C(α) RMSD structure predictions on two small proteins, HP-36 and s15. J. Am. Chem. Soc. 2001;123:1040–1046. doi: 10.1021/ja003150i. [DOI] [PubMed] [Google Scholar]
  • 20.Lee M.R., Kollman P.A. Free-energy calculations highlight differences in accuracy between x-ray and NMR structures and add value to protein structure prediction. Structure. 2001;9:905–916. doi: 10.1016/s0969-2126(01)00660-8. [DOI] [PubMed] [Google Scholar]
  • 21.Vorobjev Y.N., Hermans J. Free energies of protein decoys provide insight into determinants of protein stability. Protein Sci. 2001;10:2498–2506. doi: 10.1110/ps.15501. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Lee M.R., Tsai J., Kollman P.A. Molecular dynamics in the endgame of protein structure prediction. J. Mol. Biol. 2001;313:417–430. doi: 10.1006/jmbi.2001.5032. [DOI] [PubMed] [Google Scholar]
  • 23.Kmiecik S., Gront D., Kolinski A. Towards the high-resolution protein structure prediction. Fast refinement of reduced models with all-atom force field. BMC Struct. Biol. 2007;7:43. doi: 10.1186/1472-6807-7-43. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Taly J.-F., Marin A., Gibrat J.-F. Can molecular dynamics simulations help in discriminating correct from erroneous protein 3D models? BMC Bioinformatics. 2008;9:6. doi: 10.1186/1471-2105-9-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Fan H., Mark A.E. Refinement of homology-based protein structures by molecular dynamics simulation techniques. Protein Sci. 2004;13:211–220. doi: 10.1110/ps.03381404. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Chopra G., Summa C.M., Levitt M. Solvent dramatically affects protein structure refinement. Proc. Natl. Acad. Sci. USA. 2008;105:20239–20244. doi: 10.1073/pnas.0810818105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Chen J., Brooks C.L., III Can molecular dynamics simulations provide high-resolution refinement of protein structure? Proteins Struct. Funct. Bioinformat. 2007;67:922–930. doi: 10.1002/prot.21345. [DOI] [PubMed] [Google Scholar]
  • 28.Ishitani R. Refinement of comparative models of protein structure by using multicanonical molecular dynamics simulations. Mol. Simul. 2008;34:327–336. [Google Scholar]
  • 29.Stumpff-Kane A.W., Feig M. A correlation-based method for the enhancement of scoring functions on funnel-shaped energy landscapes. Proteins Struct. Funct. Bioinformat. 2006;63:155–164. doi: 10.1002/prot.20853. [DOI] [PubMed] [Google Scholar]
  • 30.Heuer A. Exploring the potential energy landscape of glass-forming systems: from inherent structures via metabasins to macroscopic transport. J. Phys. Condens. Matter. 2008;20:373101. doi: 10.1088/0953-8984/20/37/373101. [DOI] [PubMed] [Google Scholar]
  • 31.Ueda Y., Taketomi H., Gō N. Studies on protein folding, unfolding, and fluctuations by computer simulation. II. A three-dimensional lattice model of lysozyme. Biopolymers. 1978;17:1531–1548. [Google Scholar]
  • 32.Clementi C., Nymeyer H., Onuchic J.N. Topological and energetic factors: what determines the structural details of the transition state ensemble and “en-route” intermediates for protein folding? An investigation for small globular proteins. J. Mol. Biol. 2000;298:937–953. doi: 10.1006/jmbi.2000.3693. [DOI] [PubMed] [Google Scholar]
  • 33.Tozzini V. Coarse-grained models for proteins. Curr. Opin. Struct. Biol. 2005;15:144–150. doi: 10.1016/j.sbi.2005.02.005. [DOI] [PubMed] [Google Scholar]
  • 34.Derrida B. Random-energy model: limit of a family of disordered models. Phys. Rev. Lett. 1980;45:79. [Google Scholar]
  • 35.Bryngelson J.D., Wolynes P.G. Spin glasses and the statistical mechanics of protein folding. Proc. Natl. Acad. Sci. USA. 1987;84:7524–7528. doi: 10.1073/pnas.84.21.7524. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Gutin A.M., Shakhnovich E.I. Ground state of random copolymers and the discrete random energy model. J. Chem. Phys. 1993;98:8174. [Google Scholar]
  • 37.Case D.A., Cheatham T.E., 3rd, Woods R.J. The AMBER biomolecular simulation programs. J. Comput. Chem. 2005;26:1668–1688. doi: 10.1002/jcc.20290. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Kollman P.A., Dixon R., Pohorille A. The development/application of a minimalist organic/biochemical molecular mechanic force field using a combination of ab initio calculations and experimental data. In: Wilkinson A., Weine P., van Gunsteren W.F., editors. Computer Simulation of Biomolecular Systems. Elsevier; New York: 1997. pp. 83–96. [Google Scholar]
  • 39.Onufriev A., Bashford D., Case D.A. Exploring protein native states and large-scale conformational changes with a modified generalized Born model. Proteins Struct. Funct. Bioinformat. 2004;55:383–394. doi: 10.1002/prot.20033. [DOI] [PubMed] [Google Scholar]
  • 40.Reference deleted in proof.
  • 41.Shell M.S., Ritterson R., Dill K.A. A test on peptide stability of AMBER force fields with implicit solvation. J. Phys. Chem. B. 2008;112:6878–6886. doi: 10.1021/jp800282x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Lin E., Shell M.S. Convergence and heterogeneity in peptide folding with replica exchange molecular dynamics. J. Chem. Theory Comput. 2009;5:2062–2073. doi: 10.1021/ct900119n. [DOI] [PubMed] [Google Scholar]
  • 43.Lyskov S., Gray J.J. The RosettaDock server for local protein-protein docking. Nucleic Acids Res. 2008;36(Web Server issue):W233–W238. doi: 10.1093/nar/gkn216. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Summa C.M., Levitt M. Near-native structure refinement using in vacuo energy minimization. Proc. Natl. Acad. Sci. USA. 2007;104:3177–3182. doi: 10.1073/pnas.0611593104. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Ten figures and a table
mmc1.pdf (674.2KB, pdf)

Articles from Biophysical Journal are provided here courtesy of The Biophysical Society

RESOURCES