Abstract
Characterizing the conformations of protein in the transition state ensemble (TSE) is important for studying protein folding. A promising approach pioneered by Vendruscolo [Nature (London) 409, 641 (2001)] to study TSE is to generate conformations that satisfy all constraints imposed by the experimentally measured ϕ values that provide information about the native likeness of the transition states. Faísca [J. Chem. Phys. 129, 095108 (2008)] generated conformations of TSE based on the criterion that, starting from a TS conformation, the probabilities of folding and unfolding are about equal through Markov Chain Monte Carlo (MCMC) simulations. In this study, we use the technique of constrained sequential Monte Carlo method [Lin , J. Chem. Phys. 129, 094101 (2008); Zhang Proteins 66, 61 (2007)] to generate TSE conformations of acylphosphatase of 98 residues that satisfy the ϕ-value constraints, as well as the criterion that each conformation has a folding probability of 0.5 by Monte Carlo simulations. We adopt a two stage process and first generate 5000 contact maps satisfying the ϕ-value constraints. Each contact map is then used to generate 1000 properly weighted conformations. After clustering similar conformations, we obtain a set of properly weighted samples of 4185 candidate clusters. Representative conformation of each of these cluster is then selected and 50 runs of Markov chain Monte Carlo (MCMC) simulation are carried using a regrowth move set. We then select a subset of 1501 conformations that have equal probabilities to fold and to unfold as the set of TSE. These 1501 samples characterize well the distribution of transition state ensemble conformations of acylphosphatase. Compared with previous studies, our approach can access much wider conformational space and can objectively generate conformations that satisfy the ϕ-value constraints and the criterion of 0.5 folding probability without bias. In contrast to previous studies, our results show that transition state conformations are very diverse and are far from nativelike when measured in cartesian root-mean-square deviation (cRMSD): the average cRMSD between TSE conformations and the native structure is 9.4 Å for this short protein, instead of 6 Å reported in previous studies. In addition, we found that the average fraction of native contacts in the TSE is 0.37, with enrichment in native-like β-sheets and a shortage of long range contacts, suggesting such contacts form at a later stage of folding. We further calculate the first passage time of folding of TSE conformations through calculation of physical time associated with the regrowth moves in MCMC simulation through mapping such moves to a Markovian state model, whose transition time was obtained by Langevin dynamics simulations. Our results indicate that despite the large structural diversity of the TSE, they are characterized by similar folding time. Our approach is general and can be used to study TSE in other macromolecules.
INTRODUCTION
While protein native conformation provides the structural basis of its biological function, it is important to understand how proteins fold to its native state.8, 29, 37 Protein folding is a complex process that involves many different molecular and cellular machineries. Protein conformations are inherently heterogeneous and, in many cases, misfolded proteins can cause diseases such as Alzheimer’s disease, Parkinson’s diseases, and type II diabetes.1 Characterizing the conformations of transition state ensemble (TSE) of protein folding has been a major focus in protein folding studies.2, 15, 17, 35, 38 Transition state ensemble are usually understood to be those conformations around the saddle point of the landscape of protein folding.35 These conformations have about the same probability to either fold or unfold. Because the transition states are transient in nature, can contain a wide range of conformations, and are often dynamic with significant amount of structural fluctuations, it is challenging to study them with experimental techniques.
An important approach to study TSE is the ϕ-value analysis.16 By measuring the changes of free energy of activation and free energy of folding upon mutating a residue, this technique provides a measure of the extent of formation of structure relative to denatured and native states of the TSE. Experimental ϕ-value analysis can also provide information on the degree of formation of secondary and tertiary structures,36 backbone–backbone hydrogen-bonding interactions,7 and movement around the transition state in the folding energy landscape.19
Computational studies have also leaded to important insight on how protein folds. Among these, lattice models and molecular dynamics have been successfully applied to study protein folding and to characterize partially unfolded structures.14 Klimov and Thirumalai used exhaustive simulations of lattice models with side-chains to study transition state ensemble of two-state folders.22 Day and Daggett ran multiple molecular dynamic simulations at different temperature and solvent environment to study the folding∕unfolding transition state ensemble of chymotrypsin inhibitor 2.6 Ding et al. reconstructed the TSE of the src-SH3 protein domain from molecular dynamic simulations.9 Prompers and Brschweiler combined molecular dynamics with NMR relaxation spectroscopy to study the dynamics of folded and unfolded proteins.31 Zagrovic et al. found that the mean structure averaged over unfolded ensemble of three different folds small proteins are nativelike.42 Experimental information, such as NMR residual dipolar couplings, can be used as constraints to select unfolded state structures.27 Other information from NMR spectroscopy can also be incorporated to define partially folded intermediate states.23, 32 Richter et al. provided a solution to solve overfitting and underfitting problems when calculating ensemble of structures with NMR constraints.33
To generate explicit conformations of the TSE, Vendruscolo et al. used information from experimental ϕ values.28, 40 The ϕ value at individual residue position is defined as the ratio of stability change to the transition state upon mutation versus stability change of the native folded state upon the same mutation.24, 25 ϕ values can be measured experimentally and provide rich information about the native likeness of protein structures in the TSE.13, 41 Following Li and Daggett’s work,25 Vendruscolo et al. defined TSE as the conformations that satisfy
| (1) |
for the i-th residue with experimentally measured ϕ-value . Here the calculated ϕ-value of residue i is defined as the ratio of the number of native contacts formed by the residue in the transition state, over the number of contacts formed by the residue in the native state. Using Markov chain Monte Carlo method (MCMC) with crank-shaft move, the authors generated a set of TSE conformations based on this model for acylphosphatase (AcP), a protein with 98 residues.
Faísca et al.12 used a different approach of Pfold to identify conformations in the TSE based on the idea that the conformations in TSE have equal probability to either fold or unfold.11 Starting from a random conformation, independent Monte Carlo (MC) simulations are carried out. If in half of these independent MC runs, the structure folds before unfolds, the initial conformation is identified as a member in the TSE.
In this work, we generate TSE conformations of AcP with the combined constraints of experimental ϕ value as studied by Vendruscolo et al.40 and the pFold criteria11 as implemented by Faísca et al.12 We use constrained sequential Monte Carlo to generate candidate conformations that satisfy all ϕ-value constraints. Markov chain Monte Carlo simulations are then carried out to each of the candidate conformations and select only the conformations with folding probability of 0.5. Our main contribution is that, through further development of the technique of constrained sequential Monte Carlo method first reported in Ref. 26, we ensure rigorous and efficient sampling of the whole space of TSE under stringent constraints from both ϕ values and the pFold model, without bias toward native conformations due to inadequate sampling in molecular dynamics simulation, or the unsolved difficulty in assessing adequate mixing when applying Metropolis type of Monte Carlo sampling techniques.
This paper is organized as follows. In Sec. 2, we described our method to generate conformations in TSE for the protein AcP. Findings and interpretations of the reproduced TSE are reported in Sec. 3, followed by the conclusion section.
MODEL AND METHOD
Generating candidate conformations of TSE
We first generate a set of candidate conformations of the transition state ensemble of AcP that satisfy the constraints of all experimentally measured ϕ values at different positions of amino acid residues. Here we follow Vendruscolo et al.’s model of ϕ-value constraints.40 Specifically, our goal is to generate a proper set of conformations that are uniformly distributed in the model constrained space
| (2) |
where xn=(x1,...,xn) denotes a conformation of the protein, which has n residues. xi is the location of i-th residue, and are the experimentally measured ϕ value and the calculated ϕ value of the i-th residue, respectively; I is the set of residues whose ϕ values have been measured experimentally.
We consider a three-dimensional cubic lattice model, in which residues in conformation xn are located on the lattice sites with a unit length of 1.3 Å and satisfy the self-avoiding, bond-length, bond-angle, and torsion-angle constraints. It is based on an off-lattice four-state model, and on average there are 23 candidate positions for placing an additional residue to a partial chain.26 Two residues are defined to be in contact if the distance between them is less than 8.5 Å . Details of this lattice model and constraints are described in Refs. 26, 44. In this lattice model for protein AcP, the conformation that is closest to the native structure in terms of cRMSD has 88% native contacts preserved.
We use the sequential Monte Carlo technique to generate AcP structures. It is a growth-based method that can generate samples properly weighted with respect to a given target distribution π(xn). The weights are calculated as , where is the probability of generating the sample . If this sampling distribution satisfying q(xn)>0 for all xn∈{xn|π(xn)>0}, any function h(xn) under the target distribution π(xn) can be estimated by
| (3) |
In addition, the normalizing constant of the target distribution π(xn) in any set Ω, namely, the partition function in the case when the target distribution is the Boltzmann distribution, can be estimated using
| (4) |
where I(·) is the indicator function: I(·)=1 if the statement represented by ( · ) is true, 0 otherwise.
Lin et al.26 used a two-stage sequential Monte Carlo method to efficiently generate conformation samples properly weighted with respect to the uniform distribution in Ωϕ, that is, π(xn)∝I(xn∈Ωφ). At the first stage, 5000 contact maps are sampled from the uniform distribution of all contact maps satisfying the ϕ-value constraints. Here each sample is a realization of a n × n symmetric contact map C={cij}n×n, where cij = 1 if residue i and residue j are in contact, and cij = 0 otherwise. At the second stage, for each contact map sample, 1000 properly weighted conformational samples satisfying this contact map are generated. For protein AcP, Fig. 1 shows the experimentally measured ϕ values and the weighted average of the calculated ϕ values of the generated conformation samples.
Figure 1.
Reproducing ϕ values of acylphosphatase (AcP). Experimentally measured ϕ values (Ref. 4) and calculated ϕ values obtained from conformation samples properly weighted with respect to the uniform distribution in Ωϕ are shown.
To reduce the number of candidate conformations, we cluster similar conformations together. First, we arrange all the conformations in a random order. Starting from an empty set, we add one conformation at a time to the current system of clusters, from the first conformation to the last conformation. For each conformation, it is compared with all the current cluster representatives. If its cRMSD to any previous clusters is larger than a cutoff value, it is regarded as being a member of a new cluster; otherwise, it is grouped with the nearest cluster. The cutoff value for clustering used in this study is 2 Å. The weight of each cluster is the summation of the weights of all conformations in that cluster, and the representative structure of each cluster is chosen as the conformation with the largest weight in that cluster. For protein AcP, we obtained a total 4185 clusters. The fraction of native contacts preserved in these clusters is within a small range of 0.15, namely, from 0.26 to 0.41. This is not surprising, because of the strong ϕ-value constraints imposed.
Identifying conformations in TSE using Markov chain Monte Carlo
pfoldestimated by Markov chain Monte Carlo
In addition to the constraints from measured ϕ values, we further adopt the pfold model introduced in Refs. 12 and 11 in which the transition state conformation will have about equal probability to fold or unfold. According to Refs. 12 and 11, in a system with two stable states (the folded state and the unfolded state), the folding probability, pfold of any conformation is defined as the probability that it will reach the folded state before reaching the unfolded state. pfold can be regarded as a measure of the kinetic distance between the given conformation and the folded state. It is therefore reasonable to assume that the conformations in the TSE would have pfold = 0.5. Starting from a specific conformation, Faísca et al. calculates pfold of the conformation by recording the ratio of runs of Markov chain Monte Carlo simulations that reach the folded state before reaching the unfolded state.12 The conformations of TSE are then obtained by selecting those conformations with pfold = 0.5. We follow this strategy to compute pfold for candidate conformations that satisfy the ϕ-value constraints.
Briefly, we construct a Markov chain for the target equilibrium distribution π(zn) of Boltzmann distribution by the Gō-potential as follows:34 Starting with , where xn is one of the candidate conformations; at each step t, a random move selected from a primitive move set is applied to to obtain a new conformation . is accepted as with probability
and let otherwise. Here is the probability of moving from the current conformation to the new conformation .
Regrowth move set
We use the primitive move set developed by Zhang et al.43 in this study. The primitive move is to randomly remove a fragment of the current conformation , and regenerate the removed fragment to obtain a new conformation . The fragment is regenerated using sequential Monte Carlo under the constraint that the two ends of the fragment are fixed. If the removed fragment is at the tail of the conformation, only one end is fixed. The starting position of the fragment to be replaced is uniformly distributed along the full chain, and the fragment length is uniformly distributed between 5 and 12.
Folded and unfolded state
We assess whether a Markov chain at time t has reached the folded state or unfolded state by criteria based on the number of native contacts preserved in the conformation . We set two thresholds Nfold and Nunfold for the number of native contacts in a conformation. If the number of native contacts preserved in is larger than Nfold, the conformation is considered to be folded. If it is less than Nunfold, the conformation is considered to be unfolded.
The values of Nfold and Nunfold are determined as follows. For Nfold, we sample uniformly from the set of near native conformations (NNS) ΩNNS. Here we follow44 and define the set of NNS as those within 3 Å in cRMSD from the native structure. Nfold is defined as the threshold value of number of native contacts, such that only 5% of the conformations in ΩNNS have less than Nfold native contacts. For Nunfold, we sample uniformly from the set of denatured conformations ΩD, defined as the set of conformations with >10 Å in cRMSD from the native structure. Nunfold is defined as the threshold value of number of native contacts, such that only 5% of the conformations in Ωunfold have more than Nunfold native contacts. Since the majority of the conformations in the set of all possible conformations have cRMSD to the native structure >12 Å , the choice of the value of 10 Å for deriving Nunfold is not critical.
We use the sequential Monte Carlo method described in Ref. 44 to generate sets of proper weighted conformations in both ΩNNS and ΩD. Figure 2a shows the values of the two thresholds for Nfold and Nunfold for protein AcP. They satisfy Nfold∕N = 0.65 and Nunfold∕N = 0.15, where N is the number of contacts in the native structure.
Figure 2.
Defining folded and unfolded states and selecting conformations with 0.5 probability of folding. (a) The thresholds (vertical dashed lines) of the fraction of native contacts preserved for the folded and unfolded states. (b) Number counts of Markov chain Monte Carlo runs that reach the folded state for the set of 4185 conformations. Each point represents one conformation. Only the conformations between the two horizontal lines are included in TSE.
The energy function in Markov chain Monte Carlo
In the Markov chain Monte Carlo runs, the energy function we use is the Gō-potential18
where U(xi, xj) = −1 only if residue i and j are in contact, namely, |xi − xj| < 8.5 Å, in both conformation xn and the native structure. The equilibrium distribution of the Markov chain is π(xn)∝exp{−H(xn)∕τ}, where τ is the temperature parameter. Here we use slightly different definitions of the set of NNS and the set of denatured conformations, based on Nfold and Nunfold
Following Ref. 12, the folding temperature is selected so that the folded structures and the denatured structures have equal probabilities in the equilibrium distribution. That is,
| (5) |
For a specific given temperature τ, we again use the sequential Monte Carlo technique to estimate the values of both sides of Eq. 5.44 For protein AcP, the temperature is set to τ = 1.654, which makes both sides of Eq. 5 equal.
We carry out 50 Markov chain Monte Carlo runs for each of the 4185 conformations that satisfy the ϕ-value constraints. We then test the null hypothesis pfold = 0.5. This null hypothesis is rejected if the statistical p value of the number of runs that lead to the folded state is less than 5%. Or equivalently, if the number of folded runs is less than 17 or larger than 33 among the 50 independent runs starting from xn, we reject the null hypothesis that pfold = 0.5, and this conformation xn is not included in the TSE. Otherwise, xn is included in the TSE. Figure 2b shows the number of runs that lead to the folded state for 4185 conformations. A total of 1501 conformations are included in the TSE.
RESULTS
In this section, we study the physical properties of conformations that form the TSE of the protein AcP using the aforementioned procedure. The generated samples representing the TSE of AcP consist of 1501 clusters of conformations, each conformation is associated with a properly calculated weight with respect to the Boltzmann distribution π(xn)∝exp{−H(xn)∕τ}. As the weight obtained in Sec. 2A is with respect to the uniform distribution in the constrained space Ωϕ, they have been adjusted by a multiplication factor exp{−H(xn)∕τ}.
TSE can be far away from the native state
We plot the distribution of cRMSD between TS conformations and the native state and the distribution of the fraction of native contacts preserved in TS conformations in Fig. 3a, 3b, respectively. The unweighted transition state ensemble has a large variation, with the cRMSD ranging from 6 to 14 Å and the fraction of native contacts preserved varying from 0.28 to 0.38. In contrast, the transition state ensemble weighted with respect to the Boltzmann distribution is much more homologous—the majority of conformations have a cRMSD of 9.4 Å and fraction of native contacts preserved of 0.37. The difference between unweighted and weighted TSE demonstrates that TS conformations are structurally diverse. Although the weighted TSE is much more homologous than the unweighted one, the average cRMSD remains to be large, compared with the value of 6 Å reported in a previous work.40
Figure 3.
The distributions of cRMSD values of conformations satisfying the ϕ-value constraints and the distributions of the fraction of native contacts preserved. The distributions of cRMSDs for (a) the transitions state ensemble ΩTSE of 1,501 clusters of conformations, (c) the denatured-side ensemble ΩDS, and (e) the native-side ensemble ΩNS to the native conformation of protein acylphosphatase at different cRMSD distance intervals, and the distributions of the fraction of native contacts preserved at different intervals for (b) ΩTSE, (d) ΩDS, and (f) ΩNS. Both unweighted (white bar) and weighted (black bar) distributions are shown.
This difference is likely due to the fact that our method can access much wider conformational space in severely constrained space. As a result, TS conformations that are far away from the native state are successfully identified, and are represented proportionately with correct importance weights that adjusts the sampling bias for using a sampling distribution that is different from the target distribution. The weight ensures that it neither exaggerates nor underestimates the importance of these conformations in the TSE that are far away from the native state. In fact, the characterizations of TSE are accurate for conformations of any nature, including those that are close to the native state.
To compare TS conformations with other conformations satisfying ϕ-value constraints, we divide the 4185 candidate conformations generated into three groups
That is, if the number of folded runs among the 50 independent Markov chain Monte Carlo simulations starting from xn is between 17 and 33, the conformation xn is considered to be in set of transition state ensemble ΩTSE. If the number of folded runs is less than 17, xn is considered to be in set ΩDS of denatured side (DS). If the number of folded runs is larger than 33, xn is considered to be in set ΩNS of native side (NS).
We plot the distribution of cRMSDs between the conformations in these three sets and the native conformation in Figs. 3a, 3c, 3e, and the distributions of the fraction of native contacts preserved in these three sets in Figs. 3b, 3d, 3f.
It is not surprising to see that the conformations in ΩDS have larger cRMSD to the native structure and less native contacts preserved compared to the conformations in ΩTSE. Similarly as expected, we find that conformations in ΩNS have smaller cRMSD to the native structure and contain more native contacts than ΩTSE.
Although it appears that many conformations with lower RMSD have small weights as the weighted mean cRMSD is larger [e.g., Fig. 3e], and there are low energy conformations with large cRMSD that dominate in mean cRMSD calculation, we cannot conclude that in general conformations with higher cRMSD have lower energy. The conformations generated are from a strongly constrained region with both ϕ value and folding rate constraints imposed. As a result, energies of conformations in this set are not significantly correlated with cRMSD. Figure 4 shows the plot of energy and cRMSD of the TSE conformations to the native state. There is little correlation between energy and cRMSD of TSE. The estimated correlation coefficient is −0.035, with a p-value of 0.173 for a two-sided t-test of zero correlation.
Figure 4.
Lack of correlation between energy and cRMSD of conformations in the TSE.
It is informative to examine possible residual secondary structures in the transition state ensemble. AcP protein contains the following secondary structures: β1 (residues 7–13), α1 (residues 22–33), β2 (residues 36–42), β3 (residues 46–53), α2 (residues 55–66), β4 (residues 77–85), and β5 (residues 93–97).
Figure 5 shows the distribution of cRMSDs between fragments of secondary structures in the weighted TSE and in the native conformation. We find that although in general that the native secondary structures are not well-preserved in the TSE, fragments of native β-sheets are more enriched in the TSE compared to α-helices. This is consistent with a previous study.40
Figure 5.
The distributions of cRMSD between the secondary structures in the weighted TS conformations and in the native state of protein acylphosphatase. (a) Helix α1 (residues 22–33, white bar) and helix α2 (residues 55–66, black bars); (b) Strand β3 (residues 46–53, white bar), strand β4 (residues 77–85, black bars).
It has been suggested that the topology of the transition state of AcP is defined by the relative positions of just three “key” residues Y11, P54, and F94.40 We have carried out additional study using only ϕ values at these three key residues as constraints. We find that ϕ values of the other residues can be largely recovered from conformations generated using constraints at the three key residues alone [Fig. 6a]. The correlation coefficients between the calculated ϕ values of all residues recovered using constraints at the three residue and at 24 residues is 0.79. However, the ensemble of conformations generated have overall much larger cRMSD to the native conformations when only three constraints are used [Fig. 6b].
Figure 6.
The recovery of overall ϕ-values and resulting larger cRMSD values of conformations generated with ϕ-values constrained only at three key residues of Y11, P54, and F94. (a) Experimentally measured ϕ-values and calculated ϕ-values obtained from conformation samples satisfying the ϕ-value constraints of three key residues only. (b) The distributions of cRMSD values of conformations satisfying the ϕ-value constraints of three key residues (white bar) and 24 residues (black bar).
Correlation between point-wise distances and ϕ values
We define the point-wise distance of residue i between a conformation and the native conformation as the Euclidean distance between the locations of residue i after optimal rigid superposition of these two conformations. The average point-wise distance of each residue between the weighted TSE and the native state conformations is shown in Fig. 7.
Figure 7.
The average point-wise distances of residues between the weighted TSE and the native conformation of protein acylphosphatase. The three circles are the three key residues identified by Vendruscolo et al. (Ref. 40) that have large experimentally measured ϕ values. They have overall small point-wise cRMSD values.
For the 24 residues in AcP with experimentally measured ϕ values, the correlation between the ϕ values and the corresponding point-wise distance is –0.574, with a p-value =0.0017 for testing zero correlation by a one-sided t-test. The correlation between the calculated ϕ values of all residues and the corresponding point-wise distances is –0.502, with a p-value of 6.93 × 10−8. These observations can be rationalized by the physical models of the ϕ values. If ϕ value is large, the structure of TSE around the residue is close to the native state, and thus the corresponding point-wise distance is small with many physical contact constraints reflected by the high ϕ values. If the ϕ value is small, the structure of TSE around the residue is disrupted, and the corresponding point-wise distance is therefore large.
Contact order of TSE
Contact order has been widely used to study the correlation of protein native structures and protein folding rate.30, 45 It is defined as the average residue separation of the contact. We examine the distribution of all native contacts preserved in the weighted TSE at different residue separations in Fig. 8a. For comparison, the distribution of residue separation for the native conformation is also shown in Fig. 8b.
Figure 8.
The fractions of preserved native contacts with different sequence separations of protein acylphosphatase for (a) the weighted TSE and (b) the native conformation. Bin 1–11 correspond to sequence separations of 4, 5, 6–10, 11–20, 21–30, 31–40, 41–50, 51–60, 61–70, 71–80, and 81–90, respectively.
We find that the average contact order of native contacts preserved in the weighted TSE is 33.2, while the contact order for the native state is 37.3. Our result shows that there are less long range contacts in the TSE. That is, long range native contacts often occur after protein chains departed from the transition state.
Paci et al. provided a detailed study of contact order of TSE for ten proteins.28 They added an energy term based on RMSD in ϕ value to the energy function of molecular mechanics. The contact order of TSE reported here is somewhat different. This is likely due to the difference in the potential function used. Detailed information on how contact order of TSE is related to protein folding rate can be found in Ref. 28. A study based on a modified concept called geometric contact showed that both two-state and multistate protein folding rate are well correlated to the native state topology.45 A detailed theoretical study we have carried out on enumerated 2D hydrophobic-hydrophilic (HP) sequences suggests that the folding rate of model proteins of the same native state can differ by 1000, and the observed correlation of folding rate and native state topology in real proteins may be a consequence of evolutionary selection.20
The first passage time
We now estimate the first passage time (FPT), which is defined as the average of time required for a conformation in the transition state to fold into its native state. Because the number of Markov moves required for a conformation to fold depends on the specific details of the move set, it usually does not reflect the true physical time required for folding. To arrive at some estimations of the time required for a transition state conformation to fold, we use Langevin dynamics simulation to estimate the true physical time that each Markov move takes.
Given the number of residues (L = 5, …, 12) in the regrown fragment and the end-to-end distance r of the fragment ends, we perform simulations to estimate the traveling time between different fragment configurations that have the same number of residues L and the same end-to-end distance r. Here we discretize r into bins of intervals between r = 1.5 , 2.0 , 2.5 Å, …, according to the end-to-end distance.
Simulation of physical movement of fragment
For a fragment x of length L and end-to-end distance r, we run Langevin dynamics simulations to sample its conformations and calculate the transition time between different conformational clusters. That is, we aim to provide physically relevant time scale for each elementary Monte Carlo move that transform the conformation of a fragment. Since our goal is to assess the physical time of the movement or diffusion of a fragment, we fix its two ends and measure the time required to transform the conformation of the fragment from xL(t1) to xL(t2). Here we use a simplified model, in which the residues in the fragment are treated as connected beads, and they are allowed to move freely in the space subjected to the constraints imposed by other residue beads in the fragment through several types of interactions, including the bond interaction, angle interaction, and van der waals interaction. The motion of the system is simulated using Langevin dynamics, where the equation governing the motion of all residues in the fragment is21
| (6) |
where x(t) is the position vector of the residues at time t, γ is the friction constant, f(x,t) is the conformational force per unit mass, and α is a constant defined as , in which T is the temperature, m is the mass, and ε(t) is the Gaussian random force at time t, such that the autocorrelation function ⟨ε(t),ε(t′)⟩=δ(t−t′), where δ(t) is the delta function. Here we have γ = 0.05τ−1, with being the time unit of the simulation. m = 1 is the mass unit, l = 3.8Å is the length unit, and e = 1 is the energy unit. Veitshans et al. provided a discussion on the choice of the value of the friction constant γ.39
For each combination of L and r, we start from a chain in an extended initial conformation and an initial velocity vector in Gaussian form, in which each of the 3L vector component is sampled from the Gaussian distribution N(0,1), which is then scaled by a factor of . The simulation is run for 109 time steps, where each time step is set to , with mass m = 1, length scale a = 3.8 Å, and the reference energy scale ε0 = 1.
The first 2 × 108 steps are treated as the burning-in period and the generated fragment conformations are discarded. The fragment conformations beyond the burning period are clustered as follows. Each time a conformation is sampled, it is compared with all the cluster representatives generated in previous steps. If its distance is more than a cut-off threshold from all of the representatives, it is considered as the representative of a new cluster; otherwise, it is grouped to its nearest cluster. Here the distance between two fragments xL(t1) at time t1 and xL(t2) at time t2 is calculated as
| (7) |
in which |xl(t1) − xl(t2)| is the Euclidean distance between the two position vectors xl(t1) and xl(t2) of the l-th residue. The cutoff used in the clustering is 5 Å.
Markovian assumption and the estimation of traveling time
Suppose S clusters are obtained. We treat each cluster as a state, and use state i to denote the i-th cluster. The representative structure of state i is denoted as . Let I be the total number of time steps of the trajectory beyond the burning-in period, Ii be the observed number of state i, and Iij be the observed number of times that state i is immediately followed by state j in the next time step. We define , which represents the probability of the fragment to be in state i, and , which represents the transition probability from state i to j.
The average duration ξi that the state sequence of the simulation trajectory {xL(t)} stays in state i can be estimated as
| (8) |
where represents the probability of the fragment moves away from state i.
To estimate the average time ξji that the state sequence enters state j (j ≠ i), then travels from state j to state i, we analyze the time trajectory. If the state sequence {xL(t)} leaves state i at step t0 then re-enters state i at step t1, we record the first time that {xL(t)} enters state j after t0 but before t1 as t(j). The traveling time is then recorded as t1 − t(j). As many can be recorded from one simulation trajectory, we take its average value as the travel time ξji. An illustration of counting is shown in Fig. 9a.
Figure 9.
Estimating the first passage time (FPT) to folded structure and correlation between FPT and fraction of native contacts among TSE. (a) An illustration of counting the first passage time as t1 − t(j). (b) The average FPT of conformations in TSE of AcP. Each point represents a transition state conformation.
If we assume sequence {xL(t)} obtained from simulation after clustering is a Markov chain, we can alternatively calculate ξji, j ≠ i for each state i by solving the linear equations
Figure 10 shows the frequency of different states in the MD simulation and the comparison between the transition time calculated through counting the simulated MD sequence, namely, the counted traveling time, and that through solving the linear equations, namely, the calculated traveling time. From Fig. 10 we observe that the ratio of the counted and the calculated traveling times is close to one, except for those states with very few observations. The generally good agreement between these two approaches suggests that a Markovian state model is reasonable for the majority of state transitions.
Figure 10.
The agreement of counted and calculated traveling times between states. For the fixed number of residues L = 11 and the end-to-end distance r = 5 Å, this figure shows: (a) The frequency of different states in the trajectory of the simulation; and the ratio of the counted traveling time to the calculated traveling time for fixed destination state (b) 5, (c) 40, and (d) 75. Except for rarely observed states, counted and calculated traveling times agree well with each other.
Physical time for the regrowth moves
After obtaining the traveling time ξij, the time each Markov move takes is estimated as follows. Suppose the Markov chain moves from the current fragment to a new fragment . First, we assign to the state i, whose representative structure is the closest to in terms of cRMSD. If the proposed move is rejected, this move takes time ξi. If the move is accepted, we assign the new fragment to a new state j, in which is the closest to in terms of Euclidean distance. This successful move takes time ξij − ξi, as we assume that the fragment will stay in state i on average ξi time before it moves to state j.
Conformations in TSE have diverse structures but share similar characteristic folding time
Figure 9b plots the average FPT for each conformation in the TSE against the fraction of native contacts in this conformation. The correlation coefficients between the folding time and the fraction of native contacts preserved for the conformations in TSE is –0.068 (p-value of testing zero correlation is 0.0041). Hence the folding time and the nativeness of the conformation are not strongly correlated.
For unweighted TSE, the standard deviation of the average FPT between different conformations is 1.26 × 108 unit of time. For comparison, we computed the standard deviation of FPT for each conformation in different Markov chain Monte Carlo runs, and the average is 3.77 × 108 unit of time. For weighted TSE, these values are 1.98 × 108 and 2.59 × 108, respectively. We can see that the variation of the average FPT for different TS conformations is small. In fact, it is comparable with the variation of FPT in different Monte Carlo runs starting with the same TS conformation. This result shows that, for the protein AcP, although the conformations in TSE are structurally diverse and far away from the native state, they have very similar physical folding time. One possible reason is that, as demonstrated by Fig. 3, all conformations in TSE have relative high energy, therefore these conformations may quickly fold to conformations that have low energy. As a result, these structurally diverse conformations demonstrate similar folding time. Figure 11a plots the average first passage time of TS conformations in different intervals of cRMSD distance to the native structure. It shows that for TS conformations, the average first passage time does not change much as the cRMSD distance to the native structure increases.
Figure 11.
First passage time to folded structures and distance in cRMSD to the native structure. (a) The average first passage time of transition state conformations with different cRMSD distance to the native structure. For comparison, (b), (c), and (d) plot the distributions of the first passage time of the conformations in ΩTSE, ΩDS, and ΩNS, respectively. Both unweighted (white bar) and weighted (black bar) distributions are shown.
We compare the first passage time for the conformations in ΩDS and ΩNS with the TS conformations. Note these groups of conformations are defined by whether they will first fold or unfold by the pfold criterion, without considerations of their kinetic behavior. The distributions of the first passage time for the conformations in these three sets are plotted in Fig. 11b, 11c, 11d. It is not surprising to observe that compared with the conformations in TSE, the average folding time of the conformations in ΩDS is longer, and the average folding time of the conformations in ΩNS is shorter.
DISCUSSION AND CONCLUSIONS
In this study, we have further developed the constrained sequential Monte Carlo method for sampling conformations of transition state ensemble of protein folding. Our approach can generate rigorously unbiased samples for a specified target distribution satisfying experimentally measured parameters such as ϕ values, and can access a much wider space of conformations compared to other methods, and hence lead to generation of more diverse conformations. When combined with Markov chain Monte Carlo with physically mapped transition time, we can generate explicitly conformations of the TSE satisfying both the ϕ value measurement and the pfold criterion.
Our method was applied to study the TSE of the protein acylphosphatase, which has 98 residues. We found that the transition state conformations are diverse, and can be far away from the native state. Although in general native secondary structures are not well-conserved, fragments of native beta sheets are more enriched in the TSE than alpha helices. In addition, we found that long range native contacts are formed only after the formation of TSE. Despite the significant diversity in structures, all TS conformations have similar folding time.
As demonstrated by Cavalli et al.,3 there is a strong tendency that the outcome of pfold analysis depends on the potential function. It is expected that the G-potential may introduce a strong bias toward the native state. This would enable structures far away from the native state to have pfold = 0.5. However, the finding of more heterogeneous nature of TSE in this study is most likely due to the improved simulation method employed, and possibly not so much as a consequence of the G-potential used. The Gö approach is used in the study of Vendruscolo et al.,40 in which the potential is a function of RMSD deviation of ϕ value from the native state. The current results are obtained under comparable settings with these prior studies. Hence the more heterogeneous nature of the TSE is indeed a novel finding of this study.
A challenge in constrained sequential Monte Carlo sampling is to identify an efficient approximating trial distribution q(xn) in a high-dimensional and strongly constrained space. To reduce the estimate variance, we use carefully designed growth potential described in Ref. 26 to generate conformations. In addition, a large sample size (5, 000, 000) is used to improve accuracy in estimation.
Since our goal is to access the wide conformational space that satisfies all ϕ value constraints, we use the uniform distribution in the constrained space as the target distribution π(x). The growth potential is also designed for the uniform target distribution, and is well suited for this purpose. Nevertheless, when we reweight the conformations by Boltzmann factor under the G-potential, weights of generated samples become skewed. As G-potential models themselves are artificial constructs, it is appropriate to study the natural underlying shapes of the transition state ensemble, which follow the uniform distribution. With this goal in mind, the TSE conformations are generated uniformly from the space with constrained ϕ values.
In our clustering method, the choices of the representative structures are important because the distance of a conformation to the representatives is used for classification. Although the clustering results may depend on the order in which the conformations are generated, the representative structures are always chosen as those with the largest weights in the clusters, regardless of the ordering of the conformations. In addition, by carefully choosing the criterion of cluster distance, conformations are all well separated. We therefore expect that our clustering method is not overly sensitive to differences in the ordering of the conformations. To confirm it, we carried out the following study. We first order the conformations by their weights, then perform clustering sequentially from the largest weight conformation to the smallest weight conformation. This approach resulted in 3897 clusters, compared to the 4185 clusters obtained with random ordering. Figure 12 reports the unweighted distributions of cRMSD values and fractions of native contacts preserved for clusters obtained under both ordering. It can be seen that the two ordering produces very similar results.
Figure 12.
Effects of different ordering of conformations on clustering. (a) The distributions of cRMSD values of representative conformations of clusters obtained when conformations are ordered by weights (white bar) and when they are randomly ordered (solid black). (b) The distributions of fractions of native contacts for conformation clustering obtained using conformations ordered by weights (white bar) and using random ordered conformations (black bar). Overall, these distributions are very similar.
ACKNOWLEDGMENTS
We thank Dr. Eugene Shahknovich for suggesting the pfold model, Drs. Ulrich Hansmann and Hagai Meirovich for helpful discussions, and two anonymous referees for their constructive comments. This work is supported by NIH Grants GM079804, GM081682, NSFGrant DBI-0646035, and DMS-0800257, DMS-0800183, DMS-0915139 and DMS-0905763.
References
- Dill K. A., Ozkan S. B., Shell M. S., and Weikl T. R., Annu. Rev. Biophys. 37, 289 (2008). 10.1146/annurev.biophys.37.092707.153558 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pande V. S., Grosberg A. Yu., Tanaka T., and Rokhsar D. S., Curr. Opin. Struct. Biol. 9, 68 (1998). 10.1016/S0959-440X(98)80012-2 [DOI] [PubMed] [Google Scholar]
- Shakhnovich E., Chem. Rev. 106, 1559 (2006). 10.1021/cr040425u [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bartolini M. and Andrisano V., ChemBioChem 11, 1018 (2010). 10.1002/cbic.200900666 [DOI] [PubMed] [Google Scholar]
- Calosci N., Chi C. N., Richter B., Camilloni C., Engstrom A., Eklund L., Travaglini-Allocatelli C., Gianni S., Vendruscolo M., and Jemth P., Proc. Natl. Acad. Sci. U.S.A. 105, 19241 (2008). 10.1073/pnas.0804774105 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fersht A. R., Itzhaki L. S., elMasry N. F., Matthews J. M., and Otzen D. E., Proc. Natl. Acad. Sci. U.S.A. 91, 10426 (1994). 10.1073/pnas.91.22.10426 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Geierhaas C. D., Salvatella X., Clarke J., and Vendruscolo M., Prot. Eng. Des. Sel. 21, 215 (2008). 10.1093/protein/gzm092 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Royer C. A., Arch. Biochem. Biophys. 469, 34 (2008). 10.1016/j.abb.2007.08.022 [DOI] [PubMed] [Google Scholar]
- Varnai P., Dobson C. M., and Vendruscolo M., J. Mol. Biol. 377, 575 (2008). 10.1016/j.jmb.2008.01.012 [DOI] [PubMed] [Google Scholar]
- Fersht A. R. and Sato S., Proc. Natl. Acad. Sci. U.S.A. 101, 7976 (2004). 10.1073/pnas.0402684101 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sato S., Religa T. L., Daggett V., and Fersht A. R., Proc. Natl. Acad. Sci. U.S.A. 101, 6952 (2004). 10.1073/pnas.0401396101 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Deechongkit S., Nguyen H., Powers E. T., Dawson P. E., Gruebele M., and Kelly J. W., Nature (London) 430, 101 (2004). 10.1038/nature02611 [DOI] [PubMed] [Google Scholar]
- Hedberg L. and Oliveberg M., Proc. Natl. Acad. Sci. U.S.A. 101, 7606 (2004). 10.1073/pnas.0308497101 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fersht A. R. and Daggett V., Cell 108, 573 (2002). 10.1016/S0092-8674(02)00620-7 [DOI] [PubMed] [Google Scholar]
- Klimov D. K. and Thirumalai D., Proteins: Struct., Funct., Genet. 43, 465 (2001). 10.1002/prot.1058 [DOI] [PubMed] [Google Scholar]
- Day R. and Daggett V., Protein Sci. 14, 1242 (2005). 10.1110/ps.041226005 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ding F., Guo W. H., Dokholyan N. V., Shakhnovich E. I., and Shea J. E., J. Mol. Biol. 350, 1035 (2005). 10.1016/j.jmb.2005.05.017 [DOI] [PubMed] [Google Scholar]
- Prompers J. J. and Brüschweiler R., J. Am. Chem. Soc. 124, 4522 (2002). 10.1021/ja012750u [DOI] [PubMed] [Google Scholar]
- Zagrovic B., Snow C., Khaliq S., Shirts M., and Pande V. S., J. Mol. Biol. 323, 153 (2002). 10.1016/S0022-2836(02)00888-4 [DOI] [PubMed] [Google Scholar]
- Mueller G. A., Choy W. Y., Yang D., Forman-Kay J. D., Venters R. A., and Kay L. E., J. Mol. Biol. 300, 197 (2000). 10.1006/jmbi.2000.3842 [DOI] [PubMed] [Google Scholar]
- Korzhnev D. M., Salvatella X., Vendruscolo M., Nardo A. A. Di, Davidson A. R., Dobson C. M., and Kay L. E., Nature (London) 430, 586 (2004). 10.1038/nature02655 [DOI] [PubMed] [Google Scholar]
- Religa T. L., Markson J. S., Mayor U., Freund S. M., and Fersht A. R., Nature (London) 437, 1053 (2005). 10.1038/nature04054 [DOI] [PubMed] [Google Scholar]
- Richter B., Gsponer J., Varnai P., Salvatella X., and Vendruscolo M., J. Biomol. NMR 37, 117 (2007). 10.1007/s10858-006-9117-7 [DOI] [PubMed] [Google Scholar]
- Paci E., Lindorff-Larsen K., Dobson C. M., Karplus M., and Vendruscolo M., J. Mol. Biol. 352, 495 (2005). 10.1016/j.jmb.2005.06.081 [DOI] [PubMed] [Google Scholar]
- Vendruscolo M., Paci E., Dobson C. M., and Karplus M., Nature (London) 409, 641 (2001). 10.1038/35054591 [DOI] [PubMed] [Google Scholar]
- Lazaridis T. and Karplus M., Science 278, 1928 (1997). 10.1126/science.278.5345.1928 [DOI] [PubMed] [Google Scholar]
- Li A. and Daggett V., Proc. Natl. Acad. Sci. U.S.A. 91, 10430 (1994). 10.1073/pnas.91.22.10430 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fersht A. R., Leatherbarrow R. J., and Wells T. N., Biochemistry 26, 6030 (1987). 10.1021/bi00393a013 [DOI] [PubMed] [Google Scholar]
- Winter G., Fersht A. R., Wilkinson A. J., Zoller M., and Smith M., Nature (London) 299, 756 (1982). 10.1038/299756a0 [DOI] [PubMed] [Google Scholar]
- Faísca P. F. N., Travasso R. D. M., Ball R. C., and Shakhnovich E. I., J. Chem. Phys. 129, 095108 (2008). 10.1063/1.2973624 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Du R., Pande V. S., Grosberg A. Y., Tanaka T., and Shakhnovich E. I., J. Chem. Phys. 108, 334 (1998). 10.1063/1.475393 [DOI] [Google Scholar]
- Lin M., Lu H., Chen R., and Liang J., J. Chem. Phys. 129, 094101 (2008). 10.1063/1.2968605 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang J., Lin M., Chen R., Liang J., and Liu J. S., Proteins 66, 61 (2007). 10.1002/prot.21203 [DOI] [PubMed] [Google Scholar]
- Chiti F., Taddei N., White P. M., Bucciantini M., Magherini F., Stefani M., and Dobson C. M., Nat. Struct. Biol. 6, 1005 (1999). 10.1038/14890 [DOI] [PubMed] [Google Scholar]
- Cho S. S., Levy Y., and Wolynes P. G., Proc. Natl. Acad. Sci. U.S.A. 103, 586 (2006). 10.1073/pnas.0509768103 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Robert R. and Casella G., Monte Carlo Statistical Methods (Springer-Verlag, New York, 1999). [Google Scholar]
- Zhang J., Kou S. C., and Liu J. S., J. Chem. Phys. 126, 225101 (2007). 10.1063/1.2736681 [DOI] [PubMed] [Google Scholar]
- Go N. and Taketomim H., Proc. Natl. Acad. Sci. U.S.A. 75, 559 (1978). 10.1073/pnas.75.2.559 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Plaxco K. W., Simons K. T., Ruczinski I., and Baker D., Biochemistry 39, 11177 (2000). 10.1021/bi000200n [DOI] [PubMed] [Google Scholar]
- Zheng O. and Liang J., Protein Sci. 17, 1256 (2008). 10.1110/ps.034660.108 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kachalo S., Lu H. M., and Liang J., Phys. Rev. Lett. 96, 058106 (2006). 10.1103/PhysRevLett.96.058106 [DOI] [PubMed] [Google Scholar]
- Kaya H. and Chan H. S., J. Mol. Biol. 326, 911 (2003). 10.1016/S0022-2836(02)01434-1 [DOI] [PubMed] [Google Scholar]
- Veitshans T., Klimov D., and Thirumalai D., Folding Des. 2, 1 (1996). 10.1016/S1359-0278(97)00002-3 [DOI] [PubMed] [Google Scholar]
- Cavalli A., Vendruscolo M., and Paci E., Biophys. J. 88, 3158 (2005). 10.1529/biophysj.104.055335 [DOI] [PMC free article] [PubMed] [Google Scholar]












