Skip to main content
Biophysical Journal logoLink to Biophysical Journal
. 2010 Jun 2;98(11):2671–2681. doi: 10.1016/j.bpj.2010.02.048

A Comprehensive Multidimensional-Embedded, One-Dimensional Reaction Coordinate for Protein Unfolding/Folding

Rudesh D Toofanny 1, Amanda L Jonsson 1, Valerie Daggett 1,
PMCID: PMC2877366  PMID: 20513412

Abstract

The goal of the Dynameomics project is to perform, store, and analyze molecular dynamics simulations of representative proteins, of all known globular folds, in their native state and along their unfolding pathways. To analyze unfolding simulations, the location of the protein along the unfolding reaction coordinate (RXN) must be determined. Properties such as the fraction of native contacts and radius of gyration are often used; however, there is an issue regarding degeneracy with these properties, as native and nonnative species can overlap. Here, we used 15 physical properties of the protein to construct a multidimensional-embedded, one-dimensional RXN coordinate that faithfully captures the complex nature of unfolding. The unfolding RXN coordinates for 188 proteins (1534 simulations and 22.9 μs in explicit water) were calculated. Native, transition, intermediate, and denatured states were readily identified with the use of this RXN coordinate. A global native ensemble based on the native-state properties of the 188 proteins was created. This ensemble was shown to be effective for calculating RXN coordinates for folds outside the initial 188 targets. These RXN coordinates enable, high-throughput assignment of conformational states, which represents an important step in comparing protein properties across fold space as well as characterizing the unfolding of individual proteins.

Introduction

As with any reaction, protein folding/unfolding progresses through multiple (e.g., native, transition, intermediate, and denatured) states. The process by which the protein travels from the unfolded state to the native state is one of the most important problems in molecular biology. How does a seemingly disordered chain of amino acids self-associate to form its three-dimensional (3D) native structure? Protein folding and unfolding molecular dynamics (MD) simulations are becoming quite common, as there is currently no experimental technique that can reveal the level of atomic detail and dynamics modeled in such simulations. Due to timescale limitations (microseconds or less), however, the study of protein folding by MD simulations is limited to small proteins. Fully solvated, thermal denaturation simulations offer an attractive alternative because they describe the unfolding pathway of a protein, occur on a computationally reasonable timescale (nanoseconds), and have been shown to be reversible (1,2), that is, the unfolding pathway mimics the folding pathway in reverse.

A major difficulty in analyzing unfolding and folding simulations is determining the position of the protein along the reaction coordinate (RXN). RXN coordinates are often calculated using one or two properties. The radius of gyration, Cα root mean-squared deviation (RMSD), and fraction of native contacts (Q) are frequently used (3–6); however, these properties can lead to degeneracy. For example, compact unfolded structures with a low radius of gyration confound the use of such global metrics to track progress (Fig. 1). Furthermore, two structures may have the same number of contacts, but one can trade native for nonnative contacts during unfolding.

Figure 1.

Figure 1

Degeneracy in the use of single properties to describe RXNs. Structures taken from an unfolding trajectory of Fyn SH3 (one at 57 ps and one at 27,961 ps) both report a Cα radius of gyration of 11.1 Å. Using this metric, the separation between native and nonnative values is not always distinguishable.

Consequently, describing the folding/unfolding reaction by means of individual global properties can be misleading and may not reveal the full complexity of the process. To address this issue, we developed a simple but robust approach to more faithfully represent the folding/unfolding process. In previous work (7–9), we developed a property space-based description of the unfolding process comprised of analytical and physical properties derived from the time-dependent Cartesian coordinates of the protein. Boczko and Brooks (3) developed a similar metric for protein A. The general idea is to filter out the data from MD simulations and reduce the information to a set of properties that capture the important features of the unfolding process while creating a simple but informative multidimensional-embedded, one-dimensional (1D) RXN coordinate. Here, we generalize and apply our method to 1534 simulations of 188 proteins (22.9 μs in explicit water) from our Dynameomics database (10–12). These 188 proteins represent 181 structurally different metafolds, which in turn represent ∼67% of all structures (10) in the Protein Data Bank (PDB).

To develop a more comprehensive property-space description of unfolding, we selected various properties, such as the helical and β-structure content, core RMSD, solvent-accessible surface area (SASA), for the particular protein system being investigated. In investigating the unfolding of the engrailed homeodomain (EnHD; PDB ID: 1enh), we considered 32 properties and found that 10 of these properties were sufficient (9). Our application of the property-space method to this much larger set of 188 proteins revealed that 15 of the 32 general properties capture the main features of unfolding for this structurally diverse set of proteins. The 1D RXN coordinate we define is derived from these 15 properties.

To validate our RXN coordinate, we investigated a number of proteins whose transition-state (TS) ensembles have been identified through protein engineering and Ф-value techniques (13). In the past 15 years, our laboratory has validated MD simulations of protein unfolding by predicting experimental Φ-values using MD-derived TS structures identified through a conformational clustering technique employing multidimensional scaling (MDS) (14–20). Here, we compare our new RXN coordinate method with our MDS representation of the all-versus-all Cα RMSD matrix, in which structures along the unfolding trajectory are compared with each other. Although the MDS method generally works well, it is subjective, and it can be difficult to pick the TS ensemble when the protein unfolds rapidly or the denatured state remains very compact. We used both methods to identify TS ensembles for these 188 proteins and compared our MD-derived TS ensemble using the two different methods with experimental Φ-values for five of these proteins for which Φ-values are available (FKBP12, α-spectrin SH3, Fyn SH3 domain, engrailed homedomain, and immunity protein 7 (Im7)). Having validated both methods for TS ensemble assignment, we proceeded to assign TS ensembles for the unfolding simulations in our Dynameomics database (10–12) in a high-throughput fashion.

We also performed a principal component analysis (PCA) on our multiproperty description of the unfolding process, which is another way to filter high-dimensional data into a simple and descriptive 2D or 3D representation (7). These representations of the underlying property space data reveal clusters of time points with similar overall properties. Of importance, both methods provide a means of comparing multiple trajectories and allow important unfolding species along the RXN coordinate to be identified, which ultimately is the purpose of a good RXN coordinate (21). The unfolding trajectories of different folds can also be compared with one another. Given that the 188 Dynameomics proteins used in this study represent 67% of the known globular proteins (10), we also tested whether the native-state properties of these proteins could be used to create a global, general native ensemble that could serve as a reference for other folds for a general RXN coordinate for protein folding.

Materials and Methods

MD simulations

The MD simulations were performed using in lucem molecular mechanics (ilmm) (22) following the Dynameomics protocol described by Beck et al. (10). Each of the 188 proteins (see Table 1 of Beck et al. (10)) was subjected to at least one native-state 298 K simulation of at least 31 ns, and five to eight simulations at 498 K, with two of these simulations being at least 31 ns long. Structures were saved every 0.2 ps for the shorter runs and every 1 ps for the longer simulations. We considered a standard set of 32 analyses for each simulation. All calculations were carried out on simulation data (coordinates and results of analyses) and stored in our Dynameomics database (http://www.dynameomics.org).

Property-space analysis

Fifteen of the 32 analytical properties calculated as standard for all Dynameomics targets (10) were included in the property space: native contacts, nonnative contacts, radius of gyration, end-to-end distance, main-chain SASA, side-chain SASA, polar SASA, nonpolar SASA, main-chain polar SASA, main-chain nonpolar SASA, side-chain polar SASA, side-chain nonpolar SASA, total SASA, fraction of helix, and fraction of β. SASA was calculated according to the algorithm of Lee and Richards (23) with a probe radius of 1.4 Å. The 15 properties were chosen so that there would be minimal comparison to a reference structure (e.g., Cα RMSD and CONGENEAL scores (24) were discarded because they are relative to the starting structure), and protein-wide descriptors were preferred over residue-wide (e.g., Ф and ψ angles) or atomic-level descriptors. It was necessary for the properties to be general enough to be applicable to all the protein metafolds within the 188 proteins so that the general features of the RXN coordinate could be determined.

The 15 properties were compiled from each simulation of a particular protein. Each of the properties was then normalized by the range of values across all of the protein's simulations. A table was created and stored in the Dynameomics database (11) for each protein, its associated simulations, and its 15 normalized properties.

We calculated the distance in property space (dprop) between two structures (i and j) using the data in the aforementioned table, in a C# stored procedure we call the Comparator, by the following equation:

dprop=(aiaj)2+(bibj)2+….(NiNj)2Np (1)

where Np is the number of properties, and a, b, etc. are different normalized properties.

For every structure in the unfolding simulations, we calculated dprop to every structure in the native reference (the average dprop is the mean distance to reference). This calculation was repeated for each structure in the 498 K simulation. A histogram was calculated from all of the mean distances, resulting in a 1D RXN coordinate.

TS assignments from MDS of conformational clustering

TS ensembles were identified from unfolding simulations through conformational clustering (25,26). For each unfolding simulation, the Cα-RMSD was calculated between all structures in the first 2 ns (2000 structures per simulation). Next, MDS of the resulting matrix was performed to reduce the data to three dimensions. We visually inspected the 3D representation to identify the exit from the first (and presumably) native-like cluster. The exit from the first cluster and the preceding 5 ps were designated as the putative TS ensemble (Fig. 2 a).

Figure 2.

Figure 2

TS identification methods for Fyn SH3. (a) The 3D MDS representation of the Cα-RMSD all-versus-all space of the first 2 ns of a representative unfolding simulation. Each point represents a structure from the simulation, and the distance between two points is proportional to the Cα-RMSD between the two structures. The native cluster is colored red and the TS is shown. (b) Histogram of the native state and the same unfolding simulation shown in panel a. The native state is colored in red. The 498 K simulation is colored in green. The TS ensemble is identified as the valley between the native and denatured regions. The TS time points identified from conformational clustering are shown as magenta triangles. (c) Zoomed-in view of the TS valley shown in panel b to better show the population in the valley. For comparison, the magenta triangles show the positions of the TS identified through MDS. (d) Approximate 1D free-energy RXN coordinate calculated by taking the negative natural log (ln) of the count at each mean distance to the reference bin.

TS assignments from the RXN coordinate

Once the 1D RXN coordinate has been calculated, the histograms are used to define the TS ensemble region. For all the trajectories, there is a clear distinction in the RXN coordinate between the native reference and denatured species (Fig. 2 b). The TS ensemble region is defined as those structures in the unfolding trajectory that do not overlap the native reference and are contiguous in time in the first valley after the native state and before the denatured state. In practice, the valley has a minimum distance to the reference greater than the maximum distance from the native-state simulation. The maximum distance to reference for the valley is identified from the histogram (Fig. 2 c). The distances are converted to time points in the trajectory, and the earliest contiguous time points are then defined (structures are defined as contiguous if they are within 10 ps) as the TS ensemble. One can also determine the TS ensemble from the approximate free-energy RXN coordinate by picking structures that are in the region leaving the native state and cross the free-energy maximum of the RXN coordinate (Fig. 2 d).

PCA

PCA was also used to reduce the dimensionality of the property space for a particular protein (typically five simulations per protein). PCA was run on each set of simulations and the associated normalized properties, as described by Kazmirski et al. (7). Projections of the first two or three principal components (PCs) reveal clusters of structures that can be attributed to the different species along the unfolding pathway. PCA was run in Mathematica (27) using a Java database connection, developed in-house, to the Dynameomics database (11). This analysis requires a data matrix with each column containing one of the 15 properties and each row representing a different time point in the simulation. The data matrix used for this analysis contains all the property values for each time point for the 298 K simulation and each of the 498 K simulations. Typically, loadings are calculated, and these are the coefficients of an eigenvector that reflect the contribution of the original variables to the vector. PCA was run over a data matrix from all the 298 K and the longer 498 K simulations.

Global native-state ensemble

A global native-state ensemble was created from the native-state simulations of the 188 proteins. Several properties had to be scaled so that they could be comparable between proteins; for example, the number of native contacts was converted to the fraction of native contacts, and SASAs were scaled by the number of residues. Once scaled, the properties were normalized by the range of values in both the 298 K and 498 K simulations for all proteins. RXN coordinates were then calculated for both of the longer (31 ns) unfolding simulations for all 188 proteins using the global ensemble as a generic native-state reference. In creating this global RXN coordinate, we chose the largest distance to the reference to represent the most denatured protein in our set. Hence, we could evaluate the extent to which other unfolding simulations unfolded relative to this reference to determine whether they needed to be run longer, or they had substantial amounts of residual structure in their denatured states. These RXN coordinates were then compared with RXN coordinates originally calculated using a protein's own native-state simulation as a reference. To improve the efficiency of the calculation, the sampling of the composite global native ensemble was reduced from 1 ps to 100 ps granularity. Since these were native-state simulations, the granularity could be reduced with little effect on the overall average properties. To ensure that the full features of the rapidly changing unfolding RXN coordinates were captured, a 1 ps granularity was maintained. A number of other proteins that were not in the original set of 188 proteins were also compared with the native reference. To achieve this, the properties were scaled by the number of residues (when appropriate). The scaled properties were then added to the pool of 298 and 498 K properties created by the 188 proteins, and the entire set of properties was normalized by the range of each property. The unfolding trajectory was then compared with the generic pooled native state, as described above. For those proteins not in the 188 representatives, single protein RXN coordinates were also calculated, i.e., the unfolding trajectories were compared with their own native simulation as a control. We carried out this procedure to show that the native ensemble of this set of 188 proteins captured the features of the native state in general.

Results

Previous works using property space-based descriptions of the unfolding process have successfully identified unfolding species (7–9), but the method has only been applied to five proteins. Here, we used a property space composed of 15 general properties to construct a 1D RXN coordinate for 1534 simulations of 188 proteins. The mean property-space distance to reference histograms for each simulation were also calculated, resulting in 1534 RXN coordinate profiles. When we compare the histograms of the 15 properties for one 498 K run of the Fyn SH3 domain and its native-state simulation (see Fig. S1 in the Supporting Material), certain properties show no distinction between the native and nonnative simulations (e.g., the fraction of α-helix, polar SASA, and side-chain polar SASA). Conversely, the remaining properties show at least some distinction between the native and nonnative conformations. As expected, those properties that show a distinction between conformations are the most heavily weighted components in the PCA (Table S1).

Generalized RXNs

Data from all of the RXN coordinates were compiled for an overall view of the 188 proteins (Fig. 3 a). The overall histogram has a distinct native peak (dprop < 0.1), a low-population TS region (dprop = 0.1–0.25), and a denatured-state region (dprop > 0.25). Obviously, it is possible that compilation of these data may mask some of the finer details observed for individual simulations; indeed, the histogram in Fig. 3 a contains >22.9 μs of data, or 2.29 × 107 individual data points. Fig. 3 b contains the corresponding free-energy map over both the 298 K and 498 K simulations. As expected, the TS is higher in free energy than both the native and denatured states.

Figure 3.

Figure 3

General RXN for protein unfolding derived from 188 proteins. (a) Mean distance histogram from the pooled data of 188 proteins and 1534 simulations. Native-state simulations are shown in red and 498 K simulations are shown in green. (b) Free-energy profile calculated by taking the negative log of the 298 K and 498 K counts of the mean distance to the reference. (c) Contour plot of PC1 and PC2 space for all 188 protein simulations.

In Fig. 3 c a contour plot is generated from a 2D histogram plotting PC1 against PC2 for the normalized properties of all 188 proteins at 100 ps granularity. The plot can be broken down into three regions of interest: the native-state ensemble, TS, and denatured-state ensemble. All states are distinguishable even though they were constructed from 1534 simulations. The native state generally occupies a smaller region of PC1 and PC2 space than the much broader denatured-state ensemble. Table S1 shows the loadings of each property in the first three PCs and the percentage of the variance captured. The first two PCs captured 77% of the variance and 83% over the top three. To capture 92% of the variance, the first five PCs are required. The loadings in the first PC for nonpolar SASA, side-chain nonpolar SASA, total SASA, main-chain SASA, side-chain SASA, native contacts, and main-chain nonpolar SASA are all high and approximately equal (±0.9), indicative of a high correlation with the original properties. The polar SASA, radius of gyration, fraction of β-sheet, nonnative contacts, and fraction of α-helix are the next-highest loadings, whereas the side-chain polar SASA and end-to-end distance have very low loadings, indicative of a low correlation with the original variables, and are of low importance.

TS assignment and comparison with experimental Φ-values

We identified TS ensembles using MDS and the RXN method from unfolding simulations for 188 proteins and good experimental data available for five of them (FKBP12 (PDB ID: 1fkb), α-spectrin SH3 domain (PDB ID: 1shg), Fyn SH3 domain (PDB ID: 1shf), engrailed homeodomain (EnHD; PDB ID: 1enh), and the immunity protein Im7 (PDB ID: 1unk)). We assessed the amount of structure at each residue over the identified TS ensemble in each simulation by calculating the S-values (15). S-values are a product of two terms, S and S, that measure local secondary and tertiary structure, respectively. S is the fraction of native secondary structure. S is the ratio of the number of contacts in the TS ensemble versus the number of contacts in the reference structure. We also calculated the average S-value across all simulations of the same protein. The structures in the TS ensembles are heterogeneous, with an average Cα RMSD of 3.49 ± 0.52 Å across different simulations of the same protein. All sets of S-values were compared with the available experimental Φ-values. For each protein, we considered only Φ-values with a ΔΔGD-N > 0.7 kcal/mol and conservative hydrophobic mutations. We considered only one Φ-value at each residue position and discarded negative Φ-values. The correlation coefficients between S- and Φ-values are reported in Table S2.

Fyn SH3 domain

The TS ensembles picked from the Fyn SH3 domain simulations using both MDS and the RXN coordinate show excellent agreement with the experimental Φ-values (28), with correlation coefficients > 0.90 (Table S2). The location of the MDS TS ensemble picks falls within the TS valley (Fig. 2 c). The TS ensemble picked by MDS overlaps the RXN time window. Both the MDS and RXN coordinate TS ensemble structures show a loss of structure in the terminal strands (Fig. S2 a).

We used PCA to reduce the 15-dimensional data to a comprehensive and quantitative description of the unfolding trajectory. Projecting the first two PCs for native and 498 K simulations of Fyn SH3 (Fig. 4 a) shows that the protein has two well-defined states: when the protein is native-like, it is located within the native cluster; it is then in a broad but transiently populated region (TS ensemble) before it enters the very broad and heavily populated denatured state. A comparison with the 1D RXN coordinate (Fig. 4 b) leads to the same underlying observation: the protein is native-like before it enters a TS region with dprop = 0.12–0.28, and after the TS, the protein enters the denatured state. Representative structures are shown in Fig. 4 c. A RXN coordinate was also constructed from the fraction of native contacts (Q) and radius of gyration (Fig. 4 d) to compare with the 1D and 2D RXNs derived from the 15 properties (Fig. 4, a and b). A comparison of Fig. 4, a, b, and d, suggests that the different approaches are in agreement.

Figure 4.

Figure 4

PCA and 1D projections of unfolding simulations of Fyn SH3 and Im7. (a) Contour plot of the projections of PC1 and PC2 from the PCA of the 15 properties of the unfolding (498 K run 1) and native simulations of Fyn SH3. Unfolding ensembles are highlighted. (b) 1D RXN coordinate for the same simulations of Fyn SH3 as in panel a, with the native state, TS, and denatured state highlighted. (c) Representative structures from the unfolding of Fyn SH3. (d) Contour maps of Q versus radius of gyration for the native state and unfolding simulation of Fyn SH3. A two-state process is clearly defined here, comparable to the observation in both the 1D RXN coordinate (b) and projections of PC1 and PC2 (a). (e) A contour plot of the projections of PC1 and PC2 from the PCA of the 15 properties for the same simulations of Im7 as in panel e. Unfolding ensembles are highlighted. (f) 1D RXN coordinate for the unfolding (498 K run 1) and native simulations of Im7. (g) Representative structures from the unfolding of Im7 with the native state (with annotated helices), TS1, intermediate state, TS2, and denatured state highlighted. The intermediate state is in agreement with experiment and shows loss of packing of helix 3 with helices 1 and 2. (h) Contour maps of Q versus radius of gyration for the native state and unfolding simulation of Im7. Using these measures, the process appears to be a two-state one. However, the 1D RXN coordinate (f) and the projections of PC1 and PC2 (e) indicate that it is a three-state process.

α-Spectrin SH3 domain

Another member of this fold family, α-spectrin SH3 domain, showed poor agreement between the S- and Φ-values (29), with correlation coefficients of 0.27 and 0.09 for the RXN and MDS methods, respectively (Table S2). Experimentally, Fyn and α-spectrin SH3 domain share low Φ-values in the N- and C-terminal strands and medium to high Φ-values in the other three strands. The exact location of high Φ-values varies in each protein. A single unfolding trajectory, run 3, showed good agreement (R = 0.70) with experiment for the MDS method (Table S2). There is more heterogeneity in the α-spectrin SH3 domain MD-derived TS structures than in the Fyn SH3 structures. Consistently, α-spectrin SH3 loses structure in strand 5, the C-terminal strand, similar to Fyn SH3 domain. However, structure is also frequently lost in the middle strands (2 and 4) in TS structures of α-spectrin SH3 domain. We attempted to increase the correlation with experiment by choosing earlier TS time points using both MDS and the RXN coordinate methods, or using the MDS cluster identified in run 3 to guide the choice of clusters in the other unfolding runs, but these approaches were unsuccessful.

Immunity 7 protein

In contrast to Fyn SH3, the projections of the first two PCs of the colicin e7 immunity protein (Im7; Fig. 4 e) indicates that there are three well-defined species (native, intermediate, and denatured states), which are also evident in the 1D RXN coordinate (Fig. 4 f). This observation is in agreement with experimental evidence for an on-pathway intermediate (30).

Good agreement with experimental Φ-values (30) was also obtained for the MD-derived TS ensemble of Im7 using both the RXN coordinate and conformation clustering methods (Table S2). The TS ensemble time windows from MDS and RXN coordinate overlap in three of the eight unfolding trajectories (Table S2). For the other five unfolding simulations, whereas the MDS TS time fell outside the first contiguous time range, the dprop of the MDS TS time points fell within the RXN coordinate TS valley. Both MDS and RXN coordinate TS structures show deformity in helix 2 and loss of structure in helix 3 (Fig. 4 g). S-values and Φ-values are both highest in the C-terminal helix (Fig. S2 b).

We also selected structures from the intermediate cluster, as shown in Fig. 4 e, and calculated S-values to compare with the experimental ФI-values (30). The correlation was R = 0.55. Representative structures for these states are provided in Fig. 4 g. In the second TS ensemble, there is overlap between the intermediate and denatured states. A RXN coordinate constructed of the fraction of native contacts (Q) and radius of gyration is shown in Fig. 4 h. As with Fyn SH3, this RXN coordinate contains two well-defined and highly populated regions indicative of the native and denatured states based on Q and radius of gyration. This projection (Fig. 4 h) does not agree with our PCA projections and the 1D RXN coordinate (Fig. 4, e and f), or with experiment, all of which indicate that there is a well-populated intermediate.

FKBP12

For FKBP12, the initial comparison between S- and Φ-values (17) from TS ensembles identified by conformational clustering had a correlation coefficient of 0.54 (Fig. S3). However, the initial conformational clustering TS ensemble time points fell outside the ensemble identified from the RXN coordinate and outside the TS valley. Clustering over a shorter time span (500 ps versus the original 2 ns) for each unfolding simulation helped to reveal earlier, previously hidden clusters in the 3D projection from conformational clustering (Fig. S3 b). TS ensembles identified from these earlier clusters gave better agreement with the experimental data (R = 0.70; Table S2). In all eight simulations, the MDS TS time points fell within the RXN coordinate TS valley.

Engrailed homeodomain

The correlation between the average S- and Φ-values (18) for EnHD using the RXN coordinate and conformational clustering methods was 0.33 and 0.28, respectively. However, as we previously reported (31), S-values do not always correctly represent the amount of structure present in loop regions of the protein, and this is a known problem with EnHD. The issue is that although side-chain interactions are often maintained, S-values are usually very low because of their more dynamic character. The Φ-values probe differences primarily in side-chain interactions. Therefore, we used only S-values for residues in loops, with the exception of residue Y25, which maintains contacts in the simulation TS ensembles but deviates from the native backbone (φ, ψ) angles. With this combination of S- and S-values, the correlation coefficients increase to 0.71 and 0.75 for the RXN and conformational clustering methods, respectively (Table S2 and Fig. S4). The MD-generated TS structures maintain the helical content, but helix 3 pulls away from the core.

Dynameomics targets with Φ-values not considered

Φ-Values are available for three other proteins in our Dynameomics set: CheY (32), ubiquitin (33–35), and the ribosomal protein S6 (36). CheY has a complicated unfolding pathway, as determined from experiment, where the reported Φ-values and the major TS ensemble may be after the intermediate on the unfolding pathway. Additional work is needed to fully compare our CheY simulations with available experimental data. Inconsistent experimental Φ-values for S6 and ubiquitin obtained by different laboratories make it difficult to compare results from simulation and experiment. Single-residue positions of ubiquitin with different mutations show a wide range of Φ-values (33–35). S6 has a different pattern of Φ-values under different experimental conditions (36). Since the experimental Φ-values are a moving target, these two cases are not good for validation purposes.

Determining RXN coordinates using the global native ensemble

Vav SH3 domain

We sought to determine whether the native-state properties of the 188 proteins could serve as a reference for other folds, i.e., whether we could define a global native-state ensemble. This putative global native-state ensemble was constructed as a reference and the RXN coordinates were calculated for all 188 proteins. To illustrate the results, projections of Fyn SH3 and Im7 are provided in Fig. 5, a and b. This approach is in contrast to the procedure described above in which the protein's own native-state simulation was used as the reference. The RXN coordinate was also calculated for a number of proteins that were not in the original set of 188, and hence not in the general native-state reference. One of these proteins, Vav SH3 (PDB ID: 1gcp; not in the reference but still in the fold list) is shown in Fig. 5 c. The aim here was to test the generality of our RXN coordinate and determine whether it is sufficiently all-encompassing and applicable to describe the unfolding of other globular, monomeric proteins. Fig. 5 a shows the 1D RXN coordinate determined for Fyn SH3 (498 K run 1), which is comparable to (albeit shifted from) the 1D RXN coordinate that was calculated for Fyn SH3 498 K run 1 using its own native state as the reference state (Fig. 4 b). The main features of the RXN coordinate, namely, a short-lived population of native protein and a TS region before the denatured state, are captured. Fig. 5 b shows the 1D RXN coordinate determined for Im7 (498 K run 1), which is comparable to Fig. 4 f. Again, the main features of the 1D RXN coordinate are captured. In Fig. 5 c the RXN coordinates for two 498 K (a protein not in the global native-state ensemble) are shown. Fig. 5 d shows the same Vav SH3 simulations compared with the Vav SH3 native 298 K simulation as the reference. Here, it is apparent that using both the global native-state ensemble, of which Vav SH3 is not a member (although SH3 domains are represented), and the Vav SH3 native-state simulation as the reference captures the same unfolding features.

Figure 5.

Figure 5

RXN coordinates derived using the general native-state ensemble as a reference for property-space calculations. (a) The 1D RXN coordinate of Fyn SH3 calculated using the general native-state ensemble shows features comparable to those observed in Fig. 3b. (b) The 1D RXN coordinate of Im7 calculated using the generic reference shows features comparable to those observed in Fig. 4d. (c) 1D RXN coordinate of vav SH3 calculated using the general native-state ensemble using two 498 K trajectories. (d) 1D RXN coordinate of vav SH3 calculated using the protein's own native-state ensemble.

Methane monooxygenase component B

We also examined the RXN coordinate for a protein that is not part of the original 188 folds: methane monooxygenase component B (MMcB; PDB ID: 2mob; rank 221 in our Dynameomics 2003 fold list (37)). We also used the global native-state ensemble as the reference in these calculations (Fig. S5 a) and compared them with RXN coordinates calculated when the unfolding trajectories were compared with their own native-state simulations (Fig. S5 b). Again it is apparent that the main features of the RXN coordinates are similar in both, although the curves are shifted.

Discussion

Simple RXN coordinates based on two or fewer properties are often inadequate for distinguishing between different protein conformational states. Here we have described the construction of a multidimensional-embedded, property-space (15 physical properties of the proteins; Table S1) RXN coordinate of the unfolding process, which we believe provides a simple but more faithful and global view of the unfolding process. These properties were monitored for 1534 simulations of 188 proteins representing distinct fold families from our Dynameomics database (10–12). Unfolding simulations for a particular protein were compared with its native simulation, and the dprop was calculated between every structure in the unfolding trajectory and every structure in the native ensemble. A histogram was then constructed from the dprop, yielding a very simple RXN coordinate for unfolding. Although the resulting 1D RXN coordinate is simple, it captures the complexity of the process because of the multidimensional-embedded properties. With this method, the distinction between time points representing native, TS, intermediate, and denatured state ensembles is readily observable (Fig. 4 f).

We have introduced a new method for TS ensemble assignment using a property-space-based 1D RXN coordinate. Assignments made through the RXN coordinate method are in good agreement with experimental data and our previously established MDS method. For four proteins in particular (engrailed homeodomain, Fyn SH3, Im7, and FKBP12), the correlation is good. There is poor agreement with experiment for the fifth protein, α-spectrin SH3, and we have not been able to pinpoint the reason for this, i.e., whether there are problems with the simulations or the experimental values.

The unfolding RXN coordinates for the 188 proteins were compiled to create a global composite view of unfolding (Fig. 3 a). Although the precise position of the TS region for a protein is specific to that protein, compiling all RXN coordinates shows that the TS region generally has dprop = 0.1–0.25. This observation also suggests that dprop < 0.1 for the native state and > 0.25 for the denatured state, although more unfolded structures have dprop ∼ 0.6. The generality of this metric across such a wide variety of protein folds suggests that it can be used in a predictive manner to screen simulations.

We performed a PCA of the 15 properties, and this proved to be a useful way to define unfolding species. With PCA, the RXN is defined quantitatively based on the variance of the properties and there is no bias toward certain user-defined properties (although there may be bias in terms of which properties are included in the first place). The loadings derived from the PCA calculation describe the correlation between the time-dependent variance of a property and each PC. For example, for EnHD (a three-helix bundle), the helical structure has a high loading (0.91), whereas the β-structure term does not (−0.05). It is important to look at the percentage of variance captured in the first three PCs, since these are the components that can be readily visualized. Typically, the first two PCs account for at least ∼75% of the variance. We found that even 2D plots of the PCs can be very helpful (Fig. 3 c). For the 188 proteins (Table S1), five dimensions are required to capture 92% of the variance, which is not trivial to visualize. The contour plot of PC1 and PC2 space for all proteins (Fig. 3 c) shows a native region separated by a TS region before the broad denatured-state region is entered. In property space, structures begin folding in the denatured state with a large variance in properties. Moving closer to the native state, the variance in properties becomes smaller.

The fact that the property-space description includes very general properties that are not necessarily protein-specific may raise the question as to whether such properties can capture the fine detail and features of a particular protein's unfolding. Indeed, it is true that for a more detailed analysis, one should use more protein-specific measures. For example, interhelical distances in EnHD can more precisely define the structure, but they are not included in our RXN coordinate because we want a general expression that covers all globular protein folds. In this high-throughput method, the properties appear to be descriptive enough to capture the general features of unfolding. Furthermore, the scope of the properties is broad enough that even if a number of properties are redundant, the remaining properties will capture the detail. The property space we have created can also capture details of the denatured-state ensemble, and in several cases we observed potential intermediate states during unfolding. For example, Im7 populates an intermediate state (Fig. 4, e–g) in its unfolding, which is in agreement with experiment (30). Furthermore, the MD-generated intermediate is consistent with experiment: helix 3 is not docked to helix 1 or 2, and there are many nonnative interactions. A number of other proteins show a potential intermediate state. Using just two properties does not always provide enough sensitivity to detect intermediate states, as evidenced by the failure of the radius of gyration and fraction of native contacts to pick up the Im7 intermediate (Fig. 4 h).

The ability to distinguish between conformational states is highly useful, and since the measurement in our RXN coordinate is the mean distance to the native ensemble, we can determine with a certain degree of confidence how native-like a structure is. This becomes particularly useful when studying temperature-quenched refolding simulations. For example, we can use a more tailored RXN coordinate to determine when EnHD refolds as high-temperature structures are quenched (M. E. McCully, A. R. Fersht, and V. Daggett, unpublished). Using the 1D RXN coordinate, it is trivial to determine when the protein becomes native-like, as this is the region that is bounded in property space by the properties of the entire native ensemble. In the EnHD study, the property space is constructed from 35 properties and includes properties specific to this system; in particular, interhelical distances are sensitive metrics. Similarly, one can determine structures that are native-like by inspecting the first two PCs after PCA has been run over the property-space matrix, since native-like structures will fall in the native ensemble cluster. Both methods can be used for scoring as an alternative to more traditional measures such as Cα RMSD, which is not always robust and requires knowledge of the target structure. Structures that are native-like in property space (i.e., have properties that are within the bounds of the native-state ensemble) are not necessarily close to each other in Cartesian space. This effect is also observed by real experimental probes, which tend to report on the general properties of an ensemble of protein molecules. Multiple conformational pathways may travel the same pathway in property space. We can reduce the degeneracy of the property space by including as many properties as possible, since structures with similar properties should also be conformationally similar. A more realistic representation of native-like structures would comprise structures that are close to native-like structures but also are close to each other in property space. Therefore, simulations of protein folding could have an alternative end point: to fold to a structure that is native-like in property space, and thus capture the dynamic nature of the native state.

We validated our TS assignments using these RXN coordinates by comparison with experimental Ф-values. For FKBP12, Fyn SH3 domain, EnHD, and Im7, the correlation is good. In an ideal world, we would be able to compare all of the proteins in the set of 188 with experiment, but the experimental data are not available.

An important goal of the Dynameomics project is to define a general RXN coordinate based on information gleaned from a huge set of protein simulations, as well as analyses of those simulations and properties. Here, a global native-state ensemble was created from the 15 aforementioned properties for 188 proteins with distinct folds. Unfolding trajectories were compared in property space with the global native-state ensemble, and for proteins not contained in the 188 list, the global native-state ensemble recapitulated the features seen when an unfolding trajectory was compared with its own native-state simulation. Consequently, the general global native reference can be applied to other globular monomeric proteins outside of the existing 188 proteins. In some ways this is not so surprising, considering that these 188 proteins represent 67% of all known protein structures.

Conclusions

Choosing an RXN coordinate for a complex process such as protein folding is challenging. A good RXN coordinate should be able to differentiate among unfolded, near-native, and folded ensembles (9,21). The RXN coordinate described here appears to be robust. We are able to distinguish unfolding species with relative ease, and the TS ensembles agree with experiment. We are now proceeding to assign TS ensembles for all proteins in the Dynameomics database, and have already begun to characterize the global properties of this state (38).

The properties from which the RXN coordinates are derived capture the general features of unfolding. We can also assess when a structure becomes native-like, which is particularly useful for studying simulations of refolding or folding. The ability to assign structures in an unfolding trajectory to conformational states is highly desirable, especially for our large-scale, high-throughput Dynameomics endeavor. Ultimately, the ability to distinguish and characterize the overall properties of the states along the unfolding/folding RXN coordinate should provide insight into the general mechanisms of protein folding.

Supporting Material

Five figures and two tables are available at http://www.biophysj.org/biophysj/supplemental/S0006-3495(10)00328-0.

Supporting Material

Document S1. Figures and Tables
mmc1.pdf (1.2MB, pdf)

Acknowledgments

We thank Dr. David A. C. Beck for writing the Comparator program, and Dr. Kathryn Scott for helpful discussions.

This study was supported by the External Research Program of Microsoft Research (www.microsoft.com/science) and the National Institutes of Health (grant GM 50789 to V.D.). Most simulations for Dynameomics were performed using resources at the National Energy Research Supercomputing Center, which is supported by the Office of Science of the U.S. Department of Energy under contract No. DE-AC02-05CH11231, with the support of the Office of Biological and Environmental Research.

References

  • 1.McCully M.E., Beck D.A., Daggett V. Microscopic reversibility of protein folding in molecular dynamics simulations of the engrailed homeodomain. Biochemistry. 2008;47:7079–7089. doi: 10.1021/bi800118b. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Day R., Daggett V. Direct observation of microscopic reversibility in single-molecule protein folding. J. Mol. Biol. 2007;366:677–686. doi: 10.1016/j.jmb.2006.11.043. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Boczko E.M., Brooks C.L. First-principles calculation of the folding free energy of a three-helix bundle protein. Science. 1995;269:393–396. doi: 10.1126/science.7618103. [DOI] [PubMed] [Google Scholar]
  • 4.Sheinerman F.B., Brooks C.L. Calculations on folding of segment B1 of streptococcal protein G. J. Mol. Biol. 1998;278:439–456. doi: 10.1006/jmbi.1998.1688. [DOI] [PubMed] [Google Scholar]
  • 5.Sheinerman F.B., Brooks C.L. Molecular picture of folding of a small α/β protein. Proc. Natl. Acad. Sci. USA. 1998;95:1562–1567. doi: 10.1073/pnas.95.4.1562. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.García A.E., Onuchic J.N. Folding a protein in a computer: an atomic description of the folding/unfolding of protein A. Proc. Natl. Acad. Sci. USA. 2003;100:13898–13903. doi: 10.1073/pnas.2335541100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Kazmirski S.L., Li A., Daggett V. Analysis methods for comparison of multiple molecular dynamics trajectories: applications to protein unfolding pathways and denatured ensembles. J. Mol. Biol. 1999;290:283–304. doi: 10.1006/jmbi.1999.2843. [DOI] [PubMed] [Google Scholar]
  • 8.Scott K.A., Randles L.G., Clarke J. The folding pathway of spectrin R17 from experiment and simulation: using experimentally validated MD simulations to characterize states hinted at by experiment. J. Mol. Biol. 2006;359:159–173. doi: 10.1016/j.jmb.2006.03.011. [DOI] [PubMed] [Google Scholar]
  • 9.Beck D.A.C., Daggett V. A one-dimensional reaction coordinate for identification of transition states from explicit solvent P(fold)-like calculations. Biophys. J. 2007;93:3382–3391. doi: 10.1529/biophysj.106.100149. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Beck D.A., Jonsson A.L., Daggett V. Dynameomics: mass annotation of protein dynamics and unfolding in water by high-throughput atomistic molecular dynamics simulations. Protein Eng. Des. Sel. 2008;21:353–368. doi: 10.1093/protein/gzn011. [DOI] [PubMed] [Google Scholar]
  • 11.Simms A.M., Toofanny R.D., Daggett V. Dynameomics: design of a computational lab workflow and scientific data repository for protein simulations. Protein Eng. Des. Sel. 2008;21:369–377. doi: 10.1093/protein/gzn012. [DOI] [PubMed] [Google Scholar]
  • 12.Kehl C., Simms A.M., Daggett V. Dynameomics: a multi-dimensional analysis-optimized database for dynamic protein data. Protein Eng. Des. Sel. 2008;21:379–386. doi: 10.1093/protein/gzn015. [DOI] [PubMed] [Google Scholar]
  • 13.Fersht A.R., Matouschek A., Serrano L. The folding of an enzyme. I. Theory of protein engineering analysis of stability and pathway of protein folding. J. Mol. Biol. 1992;224:771–782. doi: 10.1016/0022-2836(92)90561-w. [DOI] [PubMed] [Google Scholar]
  • 14.Daggett V., Li A.J., Fersht A.R. Combined molecular dynamics and Φ-value analysis of structure-reactivity relationships in the transition state and unfolding pathway of barnase: structural basis of Hammond and anti-Hammond effects. J. Am. Chem. Soc. 1998;120:12740–12754. [Google Scholar]
  • 15.Daggett V., Li A.J., Fersht A.R. Structure of the transition state for folding of a protein derived from experiment and simulation. J. Mol. Biol. 1996;257:430–440. doi: 10.1006/jmbi.1996.0173. [DOI] [PubMed] [Google Scholar]
  • 16.Day R., Daggett V. Sensitivity of the folding/unfolding transition state ensemble of chymotrypsin inhibitor 2 to changes in temperature and solvent. Protein Sci. 2005;14:1242–1252. doi: 10.1110/ps.041226005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Fulton K.F., Main E.R.G., Jackson S.E. Mapping the interactions present in the transition state for unfolding/folding of FKBP12. J. Mol. Biol. 1999;291:445–461. doi: 10.1006/jmbi.1999.2942. [DOI] [PubMed] [Google Scholar]
  • 18.Gianni S., Guydosh N.R., Fersht A.R. Unifying features in protein-folding mechanisms. Proc. Natl. Acad. Sci. USA. 2003;100:13286–13291. doi: 10.1073/pnas.1835776100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Li A.J., Daggett V. Molecular dynamics simulation of the unfolding of barnase: characterization of the major intermediate. J. Mol. Biol. 1998;275:677–694. doi: 10.1006/jmbi.1997.1484. [DOI] [PubMed] [Google Scholar]
  • 20.Petrovich M., Jonsson A.L., Fersht A.R. Φ-Analysis at the experimental limits: mechanism of β-hairpin formation. J. Mol. Biol. 2006;360:865–881. doi: 10.1016/j.jmb.2006.05.050. [DOI] [PubMed] [Google Scholar]
  • 21.Scheraga H.A., Khalili M., Liwo A. Protein-folding dynamics: overview of molecular simulation techniques. Annu. Rev. Phys. Chem. 2007;58:57–83. doi: 10.1146/annurev.physchem.58.032806.104614. [DOI] [PubMed] [Google Scholar]
  • 22.Beck, D.A.C., D.O.V. Alonso, and V. Daggett. 2000–2009. In lucem Molecular Mechanics (ilmm). University of Washington, Seattle.
  • 23.Lee B., Richards F.M. The interpretation of protein structures: estimation of static accessibility. J. Mol. Biol. 1971;55:379–400. doi: 10.1016/0022-2836(71)90324-x. [DOI] [PubMed] [Google Scholar]
  • 24.Yee D.P., Dill K.A. Families and the structural relatedness among globular proteins. Protein Sci. 1993;2:884–899. doi: 10.1002/pro.5560020603. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Li A.J., Daggett V. Characterization of the transition state of protein unfolding by use of molecular dynamics: chymotrypsin inhibitor 2. Proc. Natl. Acad. Sci. USA. 1994;91:10430–10434. doi: 10.1073/pnas.91.22.10430. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Li A.J., Daggett V. Identification and characterization of the unfolding transition state of chymotrypsin inhibitor 2 by molecular dynamics simulations. J. Mol. Biol. 1996;257:412–429. doi: 10.1006/jmbi.1996.0172. [DOI] [PubMed] [Google Scholar]
  • 27.Wolfram Research. I . Wolfram Research; Champaign, IL: 2007. Mathematica. [Google Scholar]
  • 28.Northey J.G.B., Di Nardo A.A., Davidson A.R. Hydrophobic core packing in the SH3 domain folding transition state. Nat. Struct. Biol. 2002;9:126–130. doi: 10.1038/nsb748. [DOI] [PubMed] [Google Scholar]
  • 29.Martínez J.C., Serrano L. The folding transition state between SH3 domains is conformationally restricted and evolutionarily conserved. Nat. Struct. Biol. 1999;6:1010–1016. doi: 10.1038/14896. [DOI] [PubMed] [Google Scholar]
  • 30.Capaldi A.P., Kleanthous C., Radford S.E. Im7 folding mechanism: misfolding on a path to the native state. Nat. Struct. Biol. 2002;9:209–216. doi: 10.1038/nsb757. [DOI] [PubMed] [Google Scholar]
  • 31.White G.W., Gianni S., Daggett V. Simulation and experiment conspire to reveal cryptic intermediates and a slide from the nucleation-condensation to framework mechanism of folding. J. Mol. Biol. 2005;350:757–775. doi: 10.1016/j.jmb.2005.05.005. [DOI] [PubMed] [Google Scholar]
  • 32.López-Hernández E., Serrano L. Structure of the transition state for folding of the 129 aa protein CheY resembles that of a smaller protein, CI-2. Fold. Des. 1996;1:43–55. [PubMed] [Google Scholar]
  • 33.Went H.M., Jackson S.E. Ubiquitin folds through a highly polarized transition state. Protein Eng. Des. Sel. 2005;18:229–237. doi: 10.1093/protein/gzi025. [DOI] [PubMed] [Google Scholar]
  • 34.Krantz B.A., Dothager R.S., Sosnick T.R. Discerning the structure and energy of multiple transition states in protein folding using ψ-analysis. J. Mol. Biol. 2004;337:463–475. doi: 10.1016/j.jmb.2004.01.018. [DOI] [PubMed] [Google Scholar]
  • 35.Sosnick T.R., Dothager R.S., Krantz B.A. Differences in the folding transition state of ubiquitin indicated by ϕ and ψ analyses. Proc. Natl. Acad. Sci. USA. 2004;101:17377–17382. doi: 10.1073/pnas.0407683101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Otzen D.E., Oliveberg M. Conformational plasticity in folding of the split β-α-β protein S6: evidence for burst-phase disruption of the native state. J. Mol. Biol. 2002;317:613–627. doi: 10.1006/jmbi.2002.5423. [DOI] [PubMed] [Google Scholar]
  • 37.Day R., Beck D.A., Daggett V. A consensus view of fold space: combining SCOP, CATH, and the Dali Domain Dictionary. Protein Sci. 2003;12:2150–2160. doi: 10.1110/ps.0306803. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Jonsson A.L., Scott K.A., Daggett V. Dynameomics: a consensus view of the protein unfolding/folding transition state ensemble across a diverse set of protein folds. Biophys. J. 2009;97:2958–2966. doi: 10.1016/j.bpj.2009.09.012. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Figures and Tables
mmc1.pdf (1.2MB, pdf)

Articles from Biophysical Journal are provided here courtesy of The Biophysical Society

RESOURCES