Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2006 Nov 9;103(47):17747–17752. doi: 10.1073/pnas.0605580103

Understanding ensemble protein folding at atomic detail

Isaac A Hubner *, Eric J Deeds , Eugene I Shakhnovich *,
PMCID: PMC1635542  PMID: 17095606

Abstract

It has long been known that a protein's amino acid sequence dictates its native structure. However, despite significant recent advances, an ensemble description of how a protein achieves its native conformation from random coil under physiologically relevant conditions remains incomplete. Here we present a detailed all-atom model with a transferable potential that is capable of ab initio folding of entire protein domains using only sequence information. The computational efficiency of this model allows us to perform thousands of microsecond-time scale-folding simulations of the engrailed homeodomain and to observe thousands of complete independent folding events. We apply a graph-theoretic analysis to this massive data set to elucidate which intermediates and intermediary states are common to many trajectories and thus important for the folding process. This method provides an atomically detailed and complete picture of a folding pathway at the ensemble level. The approach that we describe is quite general and could be used to study the folding of proteins on time scales orders of magnitude longer than currently possible.

Keywords: engrailed homeodomain, folding pathway, graph theory, Monte Carlo, protein folding


One of the longest-standing goals of computational biophysics is the creation of a model for protein folding that is both predictive of protein structure and folding mechanism at the atomic level. Despite numerous advances in recent years (15), no method has yet been able to follow the folding of an entire protein domain from random coil to its native state over realistic time scales using an atomically detailed representation and a fully transferable potential. Significant progress has been made on the computational prediction of native-state structures for many proteins (6), but such algorithms do not use a single realistic representation of the polypeptide chain and its dynamics from start to finish and as such cannot shed light on the mechanistic details of the folding process. For instance, the highly successful structure prediction ROSETTA algorithm (6) uses a C-β representation of the protein for many of its early calculations and is based on dynamics that move entire segments of the chain to instantly adopt complete structural motifs drawn from a database to search conformational space efficiently.

Attempts to access the important features of the physical folding process using explicit approximations of physical forces and molecular dynamics simulation protocols have encountered significant barriers in computational efficiency (7). Individual simulations using such protocols are currently limited to tens to hundreds of nanosecond time scales, and although rare fast-folding events can be observed in such simulations (7), these methods cannot access folding events or intermediates that occur on longer time scales (7). Given that even the fastest-folding protein domains fold on the order of tens of microseconds, much of this work has thus focused on protein fragments or designed peptide sequences (7) that can fold on shorter time scales and thus has not allowed for specific understanding or dissection of the folding pathways of entire protein domains. A major exception to the above statement occurs in the case of the Villin headpiece (79), for which a single ≈1-μs trajectory was conducted in 1998. Recently, tens of thousands of short (≈25-ns) molecular dynamics (MD) trajectories for the Villin headpiece were conducted and combined into ≈500 μs of total simulated time (10), a feat that highlights the current state of the art in MD simulations. This work currently relies on the analysis of very many short trajectories, and it is currently unclear to what extent the details observed in those simulations fully characterize the ensemble of complete folding events characteristic of Villin. Others have applied the powerful tool of high- (above physiological) temperature unfolding simulations (2, 7); however, questions regarding the degree of accuracy to which the transition-state ensemble (TSE) and energy landscape are represented at elevated temperatures remain (7).

Methods based on Metropolis Monte Carlo (MC) have been proposed to overcome the time-scale limitations of MD. Several researchers have used Go models at atomic detail (11, 12) and, whereas much may be learned from such models, they are inherently limited by their dependence on knowledge of native state and, potentially, by their lack of nonnative interactions (11, 12). Transferable “knowledge-based” potentials derived from atom–atom contact frequencies in protein-structure databases have been used with some success (13, 14) and are detailed in a number of recent reviews (14, 7). Although these methods have been applied to questions of structure prediction (13, 15, 16), the sampling techniques and potential design procedures used in many such studies have limited applicability to the ab initio simulation of ensemble protein-folding mechanisms.

In the present work, we describe a computational approach that overcomes the limitations outlined above and provides a detailed account of the entire folding process of an ensemble of protein molecules. This study is based on an all-atom model that can accurately fold multiple protein domains using a single potential that includes no knowledge of the native structure beyond the primary amino acid sequence (14). Using this model, we map the complete folding mechanism of the engrailed homeodomain (ENH), which presents an excellent specimen for study due to the wealth of existing experimental data (1723). We perform 4,000 independent ≈10-μs simulations (40 ms of total simulated time), each of which is initiated from a different random coil conformation and is carried out at a fixed temperature (≈25°C). Our simulations represent an increase in time scale of 2–3 orders of magnitude over comparable MD studies, both in terms of the simulation time of individual trajectories (micro- vs. nanosecond) and total simulated time (milli- vs. microsecond). This computational advance allows us to observe hundreds of successful and complete folding events and thus gain a comprehensive picture of the folding process.

The unprecedented length and number of simulations that we report present an interesting question: how can such data be analyzed and synthesized into a coherent description of protein folding? Here we present a graph theoretic/clustering technique to overcome this challenge. Although this work builds on a long history of clustering in computational protein folding (14, 16, 2428), we introduce the concept of trajectory “flux” that allows for the combination of multiple folding-simulation pathways and evaluation of intermediate and intermediary states in folding. The flexibility of this structural cluster analysis, combined with the detail of our simulations, allows us to provide a completely coherent and ordered look at the ensemble pathway that quantitatively and accurately predict all experimental observations of ENH folding.

Results

Model and Simulations.

A realistic protein model begins with a realistic spatial representation of the polypeptide chain. Therefore, we model all heavy atoms and rotational degrees of freedom explicitly. To represent conformational dynamics and the physical interactions between these atoms in a way such that the time-scale limitations of the above-discussed MD simulations are avoided, we apply the recently developed knowledge-based μ-potential combined with a hydrogen-bonding potential and propagate the simulations by MC dynamics (see Methods and refs. 14 and 29). It is important to note that the move set used in these simulations, although extensively validated (11, 12, 14, 3032), is nonetheless a MC move set and thus lacks the temporal resolution to make exquisitely detailed statements regarding dynamics. As such we confine our dynamical analysis to consider only very coarse-grained features of the folding trajectories. Using this model, 4,000 independent folding simulations are initiated from different random coils at a single physiological temperature. Then, based on our energy function and graph-theoretic analysis (14), we objectively select the best 100 trajectories constituting complete folding events to the structurally cohesive global free energy minimum (i.e., the true native state of the present potential) for use in the analysis discussed below.

Structural Cluster Analysis.

The ability to collect a large ensemble of complete folding trajectories creates a need for a method to organize and analyze the data. To understand the complete folding pathway of ENH as an ensemble process, it is necessary to combine information from a large number of folding runs and determine the dynamic relation between structural states common to all trajectories. This fundamental problem has been approached from a number of different perspectives, some of which involve a means of clustering structures within and between trajectories (10, 14, 16, 2428, 33). These methods were first developed by Brooks and colleagues (24, 26), who clustered conformations belonging to the same long folding trajectory of a short peptide. Conformational clustering has since been extensively used in many studies of protein folding. Daggett and Li (27) introduced a pairwise distance matrix (usually based on the Cα rmsd between conformations) and used the features of that matrix to cluster structurally “close” conformations belonging to the same folding trajectory together to identify the Transition State Ensemble. Jayachandran et al. (10) recently developed a method in which a structural clustering procedure is used to create a Markov State Model (MSM) of Villin headpiece folding. In contrast to (24) and (27) but similar to this work and (14, 25) conformations belonging to different trajectories were clustered in MSM (10).

The clustering we use represents a graph theoretic analysis (Fig. 1), based on the concept of a “structural graph,” which is constructed by comparing the different conformations collected from each trajectory to one another using several complementary measures of structural distance to organize these data. The nodes in the graph represent individual structures, and edges are placed between these nodes based on a cutoff of structural similarity; if two conformations exhibit a structural distance that is less than the cutoff, they are connected with an edge. This form of graph may be based on a number of different similarity criteria depending on the structural properties in which one is interested. Unlike kinetic protein-folding networks (28, 33), this method does not rely on dynamic information to create edges and therefore represents a way of combining information from a large group of trajectories (something an individual trajectory graph cannot do, because there is inherently no dynamic relationship between independent runs).

Fig. 1.

Fig. 1.

The concept of a structural graph and flux (F). Each node (colored oval) represents a single protein conformation, and the edges (solid lines) represent connections based on structural similarity. The different colors indicate conformations from independent simulations, and the colored arrows indicate a dynamic relation (increasing step number) within each run. Each cluster is composed of multiple nodes from a single trajectory and/or from multiple trajectories. All trajectories are observed to converge to the GC (rightmost cluster), which contains the native conformations and has a flux of one. The smaller cluster with a flux of one corresponds to an obligate intermediate.

The nature of the network represented by the structural graph depends upon the cutoff (r) used in its construction. At the limit of very large r, every conformation will be connected, whereas at limit of very small r, there will be no edges in the graph. At intermediate cutoffs, however, the structural graphs we create fragment into disjoint clusters of varying sizes, where a cluster is simply defined as a set of nodes that are connected by a path within the graph. As with many graphs (34, 35), this fragmentation process is described by a decrease in size of the giant component (GC), or largest cluster, and in the case of these structural graphs, we observe a pronounced transition in the size of the GC over a fairly limited range of cutoffs (34, 35). At intermediate r, one can treat each individual cluster as a set of potentially interesting structural states (i.e., more closely related to one another than those in other clusters). The role of a given cluster in protein-folding kinetics can be determined by measuring the flux of that cluster (Fig. 1). The flux of a cluster is defined as the fraction of the total number of trajectories in the ensemble that exhibit at least one structure in that cluster. A cluster with a flux of one (obligatory microstate) will thus contain at least one structure from every trajectory in the ensemble, whereas a cluster with a flux of one-third contains structures from only one-third of the trajectories.

In addition to r, the results of structural cluster analysis rely upon the choice of distance metrics for structural similarity. We term these measures order parameters (OP), because they allow the monitoring of the structural progression of folding. In choosing a set of OP to study folding, it is useful to choose complementary measures that work at various levels of detail. The rmsd in Cα coordinates is a relatively high-resolution measure of the global topological similarity between two conformations, which is best suited to discriminating near native states. We also use distance rms (drms; see Eqs. 3 and 4, in Supporting Text, which is published on the PNAS web site), a metric that is somewhat more coarse-grained than rmsd, and that is well suited to comparing semiordered states with regions of large variability. To compare structures at various stages of nonspecific and specific collapse, we use the difference in radius of gyration (ΔRg; Eq. 1 in Supporting Text) between conformations. Each OP contributes a unique perspective on the folding process and, by analyzing the high-flux clusters obtained from each method, one can develop a complete picture of the folding process.

A folding trajectory should exhibit a number of intermediary states, the dynamic relationships that define the folding mechanism. Clusters of conformations that are common to most or all of the trajectories (high flux) represent states that are obligatory in folding. This analysis thus allows for the completely objective identification of intermediate and intermediary states as high-flux structural clusters. As mentioned above, we treat the dynamics between these clusters in a coarse-grained way appropriate to the MC dynamics on which our simulations are based. The mean first passage time (MFPT) of trajectories to these clusters, as well as their duration (mean last exit time, or MLET) indicate the relative positions of these intermediates during folding, allowing for the organization of a complete and coherent picture of the folding process from structural cluster analysis. The concept of trajectory flux differentiates our approach from the clustering methods used previously (10, 14, 16, 2428). Flux could readily be applied to several structural clustering protocols, especially ones like the Markov State Model (MSM), where conformations generated by different trajectories are clustered according to their structural similarity. As in this work, multiple structural measures can be applied in the construction of MSM.

Early Events and Intermediate States in Folding.

Laser T-jump experiments monitoring Trp fluorescence at 25°C reveal two-step folding kinetics with a fast phase of t1/2 ≈ 1.5 μs and a slower phase of t1/2 ≈ 15 μs (19). When Trp burial is monitored as a function of folding time in simulation, two distinct kinetic phases are also observed (Fig. 6, which is published as supporting information on the PNAS web site), although it is important to note that, in the absence of similar long folding trajectories for many proteins, it is unclear how generic is this two-phase behavior. These two relaxation time scales are separated by an order of magnitude and, when fit to experimental time, reveal a correspondence between experimental and simulated time of ≈108 MC steps to 10 μs. Based on the distribution of folding times, we expect (and observe) a fraction of trajectories that do not fold in the simulated time. As in experiment (18, 19), the folding transition is cooperative, an interpretation supported by both simulated melting curves and energy distributions at different temperatures (Fig. 7, which is published as supporting information on the PNAS web site). The simulated melting curves indicate that our simulation temperature is ≈25°C (Fig. 8, which is published as supporting information on the PNAS web site). The question naturally follows: to what do these kinetic and thermodynamic ensemble measurements correspond, and how are the corresponding microstates dynamically related?

The earliest MFPT to a flux 1 cluster, revealed by drms clustering of trajectories (Table 1 and Fig. 2), occurs at 8.4·106 steps. This reveals a nonspecifically collapsed state, which represents rapid quenching of the nonequilibrium denatured state from which the simulation began. The next series of high-flux clusters, revealed through Rg clustering, have MFPT values of 14.0, 18.2, and 19.9·106 steps, respectively (Table 1 and Fig. 2). Much of the phenomenology early in protein folding, before specific structure formation, may be conceptualized as that of a random heteropolymer: in such cases, Rg is more useful than rmsd or drms. The first two of these collapse phases are rapidly completed and correspond to globules of increasing compaction and helical content. The third collapse phase is more mechanistically complicated and spans specific collapse and rearrangement to near native states.

Table 1.

Summary of the structural and dynamic properties of clusters identified by different order parameters

OP MFPT [Step] MLET Energy Rg, Å
Native rmsd 79.7 −754.28 10.03
Near native Rg 70.1 73.4 78.3 −610.88 9.76
TSE rmsd 57.4 −546.64 11.33
Intermediate drms 28.8 44.7 56.3 −521.36 15.33
Collapse 3 Rg 19.9 63.5 99.6 −570.16 10.39
Collapse 2 Rg 18.2 38.1 69.8 −445.19 11.14
Collapse 1 Rg 14.0 32.4 45.1 −451.09 11.49
Initial relaxation drms 8.4 57.5 100 −539.01 10.64

Values are averaged over the entire cluster. MFPT, average step, and MLET are in 106 MC steps.

Fig. 2.

Fig. 2.

The results of structural cluster analysis with different OP. Each bar represents a high flux cluster with start and end values equal to MFPT and MLET values. Colors indicate flux values of 1, 0.9, 0.8, 0.7, 0.6, and 0.5 for purple, blue, green, yellow, orange, and red, respectively (clusters with F < 0.5 were not plotted). The results from each individual OP's clustering are plotted against separate y axis, in each case representing the <Rg> of the specified cluster (to an aligned x axis representing the MC time step). Note that clusters identified by different OP may overlap in time. Although clusters identified by different OP may overlap in time, a single structure is never found in more than one cluster of any single graph. Also, a single trajectory may move between clusters; the average dwell time is often less than the difference between MLET and MFPT. Nevertheless, the order of events in folding (as represented by clusters under different OP) may be determined through the clusters' MFPT. The nature of these events is also describable through factors such as their MLET and the structural-energetic characteristics.

Although drms clustering reveals no further mechanistic detail regarding compact states, it does reveal a cluster corresponding to an on-path obligate intermediate, which is virtually indistinguishable from the L16A mutant model (36), as well as simulated thermal denaturing (unfolding) models (19) of the folding intermediate proposed by Fersht and coworkers (19, 36). The intermediate is characterized by an undocked first helix, helicity in the first turn region, and contact between the last two helices (Fig. 3). Because the first helix is most stable in isolation (20), it is intuitive that it would be the most likely to undock while retaining helical content. This intermediate appears with MFPT of 28.8·106 and MLET of 56.3·106 steps, which occurs during the collapsed phase identified by Rg and drms clustering. The aligned regions (residues 28–53) of the 25 NMR structures (PDB ID code 1ZTR; ref. 36) of this intermediate exhibit an average rmsd of 4.69Å, with some structures as low as 2.06 Å from the simulated intermediate (Fig. 3). The intermediate structures have an average drms of 2.94 Å, with some as low as 1.57 Å, and if the ZTR structures are added to the trajectories for drms clustering analysis, every conformation groups with the simulated intermediate cluster. These similarities are striking, especially when considering that the intermediate, unlike the native state, has one of three helices lacking long-range NOE restraints, and that the comparison is between two models of the intermediate, in experiment, the NMR structure of the L16A mutant and in simulation a computational model that includes a wider range of dynamically related structures from different folding trajectories.

Fig. 3.

Fig. 3.

The ENH folding intermediate. (a and b) The mutational model of the ENH folding intermediate (a) (36) is indistinguishable from the simulated intermediate (b) identified by structural cluster analysis. (c) A representative superposition of one experimental and simulated model. As in experiment, the N terminus is largely helical but lacks long-range order. The aligned (ordered) regions in the mutational and computational models span residues 28–53, corresponding to the red and green colored regions. The peptide chain is colored from N (blue) to C (red) terminus.

Transition State.

Because the rmsd clustering giant component (GC) represents the energy landscape surrounding the native basin of attraction (14), crossing the transition barrier is analogous to entering the GC. Therefore, we hypothesize that conformations dynamically preceding entry into the GC in each trajectory, as well as in the GC with very low connectivity k = 1, are enriched in TSE conformations. However, this subset of structures is not presented as the TSE. Rather, this putative TSE is subject to pfold analysis (37) to identify conformations that are true TS (see Supporting Text). Graph properties enrich sampling of TSE from 1–7.3% (in the pre-GC and k = 1 vs. a random subset). The simulated (31) and experimental φ values (Tables 2 and 3, which are published as supporting information on the PNAS web site) correlate with R = 0.70 (P value, 0.015); we do not observe a statistically significant correlation between experimental φ and simulated φ from the control ensemble of structures as compact as the TSE.

In the native state, A43 points outward to solvent at the first turn of the third helix. But, as we see from the model of the TSE, this region is disordered and free to make contacts with a number of neighboring aliphatic residues. A43 makes a number of nonnative contacts with residues such as L40 and I45 (which also have high φ), resulting in an attractive energetic contribution in the TSE that is 49% greater than in the native state. Experimental data in which A43 exhibits φ > 1 further reinforce this interpretation. The model TSE has ≈13% more solvent-accessible surface area (SASA) than native, which corresponds well with the experimental observation of ≈17% (21). The high experimental and simulated φ values, average a step number of 57.4·106, and small increase in SASA are suggestive of a late TSE.

Native State.

This physical simulation of folding arrives at a top E-k prediction (see Methods) of 2.44-Å Cα rmsd from the crystal structure (Fig. 4Left). Structural studies have noted that the buried aromatic side chains corresponding to F8, F20, W48, and W49 form a highly conserved cluster in which each residue contacts at least one of the others, which is a central feature of the homeodomain fold (17). Fig. 4 Upper Right illustrates the conserved aromatic core packing in an E-k predicted structure. The extensive network of salt bridges on the surface of helices 1 and 2 centered around residues R15, E19, R30, and E37 is also a defining structural feature of the homeodomain fold (17). Many E-k predictions (Fig. 4 Lower Right) exhibit strong salt bridges between intrahelix residues R15:E19 and interhelix residues R15:E37 and E19:R30. The simulated helical content is within 2% of experimentally observed values (38), a difference attributable to native-state fluctuations. Clearly, the model is representative of physical interactions and predictive at the level of backbone and side-chain atomic coordinates.

Fig. 4.

Fig. 4.

ENH structure prediction. (Left) The superimposed native and top E-k prediction colored from N (blue) to C (red) termini (only backbone shown). (Upper Right) Predicted side-chain packing for core aromatic residues. (Lower Right) Predicted side-chain packing for surface salt bridges (native in black and predicted in CPK color).

Conclusions

Structural cluster analysis provides a clear ordered picture of ensemble protein folding (Fig. 5), detailed to the level of atomic coordinates. This model accurately and quantitatively predicts all experimental observables, as well the TSE structure, structural assignment of the first kinetic phase, dynamic placement, and experimentally verified structural details of the intermediate complete folding mechanism, and the structure of all experimentally inaccessible microstates. The atomically detailed and time-resolved analysis presented in this work synthesizes many theoretical aspects of protein-folding mechanisms that have emerged from studying coarse-grained models of protein folding such as the diffusion-collision (39), nucleation-condensation (40, 41) and hydrophobic-collapse (42) mechanisms into a coherent physically meaningful complete scenario of protein folding.

Fig. 5.

Fig. 5.

Folding from a denatured (D) state, which rapidly undergoes nonspecific collapse (C). There are several C states, characterized by increasing compaction and helical content (see Fig. 9, which is published as supporting information on the PNAS web site, for a plot of the OPs as a function of time). After the protein becomes sufficiently helical, the chain extends through fluctuations to an expanded intermediate (I) state, which allows rearrangement of the helices and is followed by the TS. A final collapse to a near-native (NN) state ensues, which proceeds through specific side-chain packing and energetic relaxation to the native (N) state. C1, C2, C3, and I may undergo rapid conversion (as indicated by overlap in Fig. 3). The sequence of events in this representative trajectory is identical to the ordering of events in structural cluster analysis of the ensemble of folding trajectories.

Helix formation occurs upon compaction (stages C1–3; Fig. 5) only when stabilized by intraprotein interactions and screened from the solvent. This finding is physically intuitive, because isolated ENH helices, although more stable than isolated helices from other proteins (18), are still only marginally (13–33%) helical (18, 21). However, the pathway leading to the intermediate and TS are kinetically facilitated by helix formation, in the spirit of framework mechanisms. The rate-limiting step in each trajectory is crossing the TS, which involves the formation of a specific set of contacts (some of which are nonnative) among particular residues in accordance with the nucleation mechanism proposed by Abkevich et al. (40). Although certain structural features of this folding pathway have been predicted through unfolding simulations (19), the complete ordering of the pathway described above and diagrammed in Fig. 5 highlights that our approach accesses the process of folding at a much higher level of detail.

The unprecedented computational efficiency and accuracy of this model allow for thousands of all-atom, physiological-temperature effectively microsecond-folding simulations of an entire protein domain from random coil to native. The combination of this computational model and structural cluster analysis presents a generalized method for organizing structural data from ensembles of simulations and making objective assignments of intermediary, intermediate, transition, and native states. Although this method has been applied to a single fast-folding protein domain in this work, the model and the principles of structural cluster analysis could be easily extended to study the folding of other proteins, especially smaller or larger proteins that fold on even longer time scales. If one were to look for rare folding events in ensembles with ≈40 ms of total simulation time [i.e., use our model in the framework of Pande and coworkers (7)], it should be possible to access the millisecond folding events, a feat far beyond the capacity of current folding simulations.

Methods

Protein Model.

The model (11, 12, 14, 29) represents all nonhydrogen atoms as impenetrable hard spheres. The move set includes global and localized backbone moves and side-chain torsions, while maintaining bond length, connectivity, and excluded volume, which have been described in detail, and have been shown to behave ergodically and satisfy detailed balance (11, 12). We apply a transferable knowledge-based two-body contact potential that has also been used to introduce physically realistic interaction in pfold simulations of SH3 domains (32) and to fold seven different protein domains from random coil (14, 29). A backbone hydrogen bonding function is also considered to ensure proper secondary structure formation. The model and potential are detailed in a recent work (14) and in Supporting Text.

Simulation Protocol.

The engrailed homeodomain (1ENH) was chosen for its small size and fast folding, which allow for a large ensemble of independent full-folding simulations and an abundance of detailed experimental data for comparison. Reported rmsd values are calculated by using residues F8-I56, which correspond to the helical and turns regions of the protein. Excluded residues correspond to crystallographically disordered regions at the termini (17). This fragment is experimentally shown to be structurally and thermodynamically similar to the full-length domain (17). Simulations were initiated by using atomic coordinates generated from primary sequence using Swiss PDB viewer's “Load Raw Sequence to Model” function. This structure was unfolded for 106 random steps to create a unique fully unfolded starting conformation for each run of folding simulations. Each starting conformation was subsequently propagated at T = 1.75 for 108 steps using a Metropolis MC procedure based on a micropotential energy function for atom–atom contacts and a geometrical hydrogen bonding function, as described (14). The move set and potential parameters were identical to those used in previous work (14) and included local backbone moves, global backbone moves, and side-chain rotational angle moves. This protocol was applied to create 4,000 independent simulations. Thus, each trajectory was started from a different unique random coil (which did not exhibit high structural similarly in clustering or rmsd to any other conformation in any simulation). Because of the expected distribution of folding rates around the t1/2 of 15 μs (19), not all trajectories folded in the simulated time. Although this fact might create the appearance of misfolded trajectories (and imply gross errors in the potential), that is not the case. This conclusion is strongly supported because energy-based criteria (E-k, discussed below) consistently and objectively identify the most native structures and best folding trajectories. Critically, structures and trajectories are selected based on the global free energy and graph properties.

Graph-Theoretical Analysis.

Each graph is constructed from the set of pairwise OP distances between conformations by considering each structure as a node and connecting any two nodes that exhibit a distance less than a particular cutoff (r). The disjoint clusters in this graph are defined as any set of nodes where a path exists in the graph between those nodes. At any value of r, the GC is defined as the largest cluster. Structure predictions are made by using k (the “degree” or number of neighbors) values from a rmsd graph of minimum-energy structures from each of the 4,000 independent folding runs where the graph is drawn at the r at which the GC contains half of the structures in the GC (midpoint in the transition). The most connected (highest k) node is then used as the structure prediction. As previously demonstrated, this E-k criterion is superior to energy alone at identifying the native fold (14). Using this criterion, and based on the minimum energy conformation of each run, we objectively select the 100 trajectories that encountered conformations with the lowest energy and highest k. These 100 objectively selected simulations represent complete folding events to the cohesive global free-energy minimum. One hundred one conformations from each run are collected (one per 106 steps), and the set of 10,100 is clustered to construct a structural folding graph. Graph analysis to identify intermediates is performed by using Rg, drms, or rmsd as OP (see Eq. 2 in Supporting Text) and analyzing the flux results for each cluster at varying values of r.

Supplementary Material

Supporting Information

Acknowledgments

We thank Alan Fersht and Peter Kutchukian for stimulating discussion. I.A.H. and E.J.D. were supported by the Howard Hughes Medical Institute. This research is supported by National Institutes of Health Grant GM52126.

Abbreviations

MD

molecular dynamics

ENH

engrailed homeodomain

MC

Monte Carlo

GC

giant component

OP

order parameter

drms

distance rms

MFPT

mean first passage time

MLET

mean last exit time

TSE

transition state ensemble.

Footnotes

The authors declare no conflict of interest.

This article is a PNAS direct submission.

References

  • 1.Mirny L, Shakhnovich E. Annu Rev Biophys Biomol Struct. 2001;30:361–396. doi: 10.1146/annurev.biophys.30.1.361. [DOI] [PubMed] [Google Scholar]
  • 2.Daggett V, Fersht A. Nat Rev Mol Cell Biol. 2003;4:497–502. doi: 10.1038/nrm1126. [DOI] [PubMed] [Google Scholar]
  • 3.Gnanakaran S, Nymeyer H, Portman J, Sanbonmatsu KY, Garcia AE. Curr Opin Struct Biol. 2003;13:168–174. doi: 10.1016/s0959-440x(03)00040-x. [DOI] [PubMed] [Google Scholar]
  • 4.Shakhnovich EI. Chem Rev. 2006;106:1559–1588. doi: 10.1021/cr040425u. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Thirumalai D, Hyeon C. Biochemistry. 2005;44:4957–4970. doi: 10.1021/bi047314+. [DOI] [PubMed] [Google Scholar]
  • 6.Bradley P, Misura KM, Baker D. Science. 2005;309:1868–1871. doi: 10.1126/science.1113801. [DOI] [PubMed] [Google Scholar]
  • 7.Snow CD, Sorin EJ, Rhee YM, Pande VS. Annu Rev Biophys Biomol Struct. 2005;34:43–69. doi: 10.1146/annurev.biophys.34.040204.144447. [DOI] [PubMed] [Google Scholar]
  • 8.Duan Y, Kollman PA. Science. 1998;282:740–744. doi: 10.1126/science.282.5389.740. [DOI] [PubMed] [Google Scholar]
  • 9.Zagrovic B, Snow CD, Shirts MR, Pande VS. J Mol Biol. 2002;323:927–937. doi: 10.1016/s0022-2836(02)00997-x. [DOI] [PubMed] [Google Scholar]
  • 10.Jayachandran G, Vishal V, Pande VS. J Chem Phys. 2006;124:164902. doi: 10.1063/1.2186317. [DOI] [PubMed] [Google Scholar]
  • 11.Shimada J, Kussell EL, Shakhnovich EI. J Mol Biol. 2001;308:79–95. doi: 10.1006/jmbi.2001.4586. [DOI] [PubMed] [Google Scholar]
  • 12.Shimada J, Shakhnovich EI. Proc Natl Acad Sci USA. 2002;99:11175–11180. doi: 10.1073/pnas.162268099. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Papoian GA, Ulander J, Eastwood MP, Luthey-Schulten Z, Wolynes PG. Proc Natl Acad Sci USA. 2004;101:3352–3357. doi: 10.1073/pnas.0307851100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Hubner IA, Deeds EJ, Shakhnovich EI. Proc Natl Acad Sci USA. 2005;102:18914–18919. doi: 10.1073/pnas.0502181102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Liwo A, Khalili M, Scheraga HA. Proc Natl Acad Sci USA. 2005;102:2362–2367. doi: 10.1073/pnas.0408885102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Herges T, Wenzel W. Biophys J. 2004;87:3100–3109. doi: 10.1529/biophysj.104.040071. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Clarke ND, Kissinger CR, Desjarlais J, Gilliland GL, Pabo CO. Protein Sci. 1994;3:1779–1787. doi: 10.1002/pro.5560031018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Mayor U, Johnson CM, Daggett V, Fersht AR. Proc Natl Acad Sci USA. 2000;97:13518–13522. doi: 10.1073/pnas.250473497. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Mayor U, Guydosh NR, Johnson CM, Grossmann JG, Sato S, Jas GS, Freund SM, Alonso DO, Daggett V, Fersht AR. Nature. 2003;421:863–867. doi: 10.1038/nature01428. [DOI] [PubMed] [Google Scholar]
  • 20.Mayor U, Grossmann JG, Foster NW, Freund SM, Fersht AR. J Mol Biol. 2003;333:977–991. doi: 10.1016/j.jmb.2003.08.062. [DOI] [PubMed] [Google Scholar]
  • 21.Gianni S, Guydosh NR, Khan F, Caldas TD, Mayor U, White GW, DeMarco ML, Daggett V, Fersht AR. Proc Natl Acad Sci USA. 2003;100:13286–13291. doi: 10.1073/pnas.1835776100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Stollar EJ, Mayor U, Lovell SC, Federici L, Freund SM, Fersht AR, Luisi BF. J Biol Chem. 2003;278:43699–43708. doi: 10.1074/jbc.M308029200. [DOI] [PubMed] [Google Scholar]
  • 23.White GW, Gianni S, Grossmann JG, Jemth P, Fersht AR, Daggett V. J Mol Biol. 2005;350:757–775. doi: 10.1016/j.jmb.2005.05.005. [DOI] [PubMed] [Google Scholar]
  • 24.Karpen ME, Tobias DJ, Brooks CL. Biochemistry. (3rd) 1993;32:412–420. doi: 10.1021/bi00053a005. [DOI] [PubMed] [Google Scholar]
  • 25.Shortle D, Simons KT, Baker D. Proc Natl Acad Sci USA. 1998;95:11158–11162. doi: 10.1073/pnas.95.19.11158. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Boczko EM, Brooks CL. Science. (3rd) 1995;269:393–396. doi: 10.1126/science.7618103. [DOI] [PubMed] [Google Scholar]
  • 27.Li A, Daggett V. Proc Natl Acad Sci USA. 1994;91:10430–10434. doi: 10.1073/pnas.91.22.10430. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Rao F, Caflisch A. J Mol Biol. 2004;342:299–306. doi: 10.1016/j.jmb.2004.06.063. [DOI] [PubMed] [Google Scholar]
  • 29.Hubner IA, Shakhnovich EI. Phys Rev E. 2005;72:022901. doi: 10.1103/PhysRevE.72.022901. [DOI] [PubMed] [Google Scholar]
  • 30.Kussell E, Shimada J, Shakhnovich EI. Proc Natl Acad Sci USA. 2002;99:5343–5348. doi: 10.1073/pnas.072665799. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Hubner IA, Shimada J, Shakhnovich EI. J Mol Biol. 2004;336:745–761. doi: 10.1016/j.jmb.2003.12.032. [DOI] [PubMed] [Google Scholar]
  • 32.Hubner IA, Edmonds KA, Shakhnovich EI. J Mol Biol. 2005;349:424–434. doi: 10.1016/j.jmb.2005.03.050. [DOI] [PubMed] [Google Scholar]
  • 33.Andrec M, Felts AK, Gallicchio E, Levy RM. Proc Natl Acad Sci USA. 2005;102:6801–6806. doi: 10.1073/pnas.0408970102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Dokholyan NV, Shakhnovich B, Shakhnovich EI. Proc Natl Acad Sci USA. 2002;99:14132–14136. doi: 10.1073/pnas.202497999. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Albert R, Barabasi A-L. Rev Mod Phys. 2002;74:47–97. [Google Scholar]
  • 36.Religa TL, Markson JS, Mayor U, Freund SM, Fersht AR. Nature. 2005;437:1053–1056. doi: 10.1038/nature04054. [DOI] [PubMed] [Google Scholar]
  • 37.Du R, Pande VS, Grosberg A, Tanaka T, Shakhnovich EI. J Chem Phys. 1998;108:334–350. [Google Scholar]
  • 38.Kabsch W, Sander C. Biopolymers. 1983;22:2577–2637. doi: 10.1002/bip.360221211. [DOI] [PubMed] [Google Scholar]
  • 39.Karplus M, Weaver DL. Protein Sci. 1994;3:650–668. doi: 10.1002/pro.5560030413. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Abkevich VI, Gutin AM, Shakhnovich EI. Biochemistry. 1994;33:10026–10036. doi: 10.1021/bi00199a029. [DOI] [PubMed] [Google Scholar]
  • 41.Fersht AR. Proc Natl Acad Sci USA. 1995;92:10869–10873. doi: 10.1073/pnas.92.24.10869. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Sali A, Shakhnovich E, Karplus M. Nature. 1994;369:248–251. doi: 10.1038/369248a0. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information
pnas_0605580103_1.pdf (25KB, pdf)
pnas_0605580103_2.pdf (54.1KB, pdf)
pnas_0605580103_3.pdf (30.3KB, pdf)
pnas_0605580103_4.pdf (34KB, pdf)

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES