Abstract
We simulate the aggregation thermodynamics and kinetics of proteins L and G, each of which self-assembles to the same β/β topology through distinct folding mechanisms. We find that the aggregation kinetics of both proteins at an experimentally relevant concentration exhibit both fast and slow aggregation pathways, although a greater proportion of protein G aggregation events are slow relative to those of found for protein L. These kinetic differences are correlated with the amount and distribution of intrachain contacts formed in the denatured state ensemble (DSE), or an intermediate state ensemble (ISE) if it exists, as well as the folding timescales of the two proteins. Protein G aggregates more slowly than protein L due to its rapidly formed folding intermediate, which exhibits native intrachain contacts spread across the protein, suggesting that certain early folding intermediates may be selected for by evolution due to their protective role against unwanted aggregation. Protein L shows only localized native structure in the DSE with timescales of folding that are commensurate with the aggregation timescale, leaving it vulnerable to domain swapping or nonnative interactions with other chains that increase the aggregation rate. Folding experiments that characterize the structural signatures of the DSE, ISE, or the transition state ensemble (TSE) under nonaggregating conditions should be able to predict regions where interchain contacts will be made in the aggregate, and to predict slower aggregation rates for proteins with contacts that are dispersed across the fold. Since proteins L and G can both form amyloid fibrils, this work also provides mechanistic and structural insight into the formation of prefibrillar species.
Keywords: aggregation, protein folding, denatured state, folding intermediates, protein function
In order to perform their biological function, proteins adopt a three-dimensional structure that represents a global or very low-lying minimum on their free-energy surface. Through molecular events not fully understood, proteins can sacrifice these stabilizing intrachain contacts in favor of configurations that promote intermolecular interactions leading to the formation of aggregates. These aggregates range from amorphous structures without order to highly structured fibrils, each arising by distinct aggregation mechanisms. The resulting structure of protein aggregates and the kinetics of their formation will depend on protein sequence, protein concentration, and solution conditions.
For example, aggregates can take on a common morphology of unbranched fibrils that are several micrometers in length, ~10 nm in diameter, and rich in β-sheets orthogonal to the fibril axis (Dobson 2003b). These aggregates, termed amyloid fibrils, are believed to be responsible for a number of diseases, including Alzheimer’s disease and Huntington’s disease (Dobson 2003b). The first experimentally identifiable nucleating species are known as prefibrillar aggregates, which resemble small bead structures, and these prefibrillar species then go on to further organize into “protofilaments,” which are thought to later organize into the mature amyloid fibrils. It is also noteworthy that many non-disease proteins can be induced to form amyloid fibrils. Alternatively, when engineered proteins are expressed in a bacterial host for industrial production, the product often accumulates in the form of inclusion bodies and must be solubilized and subsequently refolded to the native state while avoiding aggregation (Clark 2001). The morphologies of inclusion body formation are less structurally distinct, and this seeming nonspecificity makes it unclear whether inclusion body formation shares common molecular origins of aggregation with amyloid formation.
Increased aggregation propensity has been noted for proteins that fold through both obligatory and nonobligatory kinetic intermediates (King et al. 1996) or through molten globule states (Safar et al. 1994). This is attributed to attraction between interchain hydrophobic patches that resemble the folded monomer intrachain contacts and promote further interchain stabilization upon aggregation. It is thought that random coils are less susceptible to this association because of the reduction in native-like hydrophobic patterns (Fink 1998; Uversky et al. 1998). However, it has also been shown that there is competition between the refolding of protein monomers with the formation of transient oligomeric protein aggregates, that instead derive from association of random coil states of the protein (Silow et al. 1999). While the aggregates are short-lived and do dissociate so that individual chains fold, their folding rate is fractionally slower than the normal folding reaction (Silow et al. 1999). Chiti and coworkers were able to show that stabilizing local elements of secondary structure in the denatured state ensemble of the small α/β protein AcP prevented fibril formation (Chiti et al. 1999).
This work examines the question of aggregation kinetics and mechanism using simulation of coarse-grained models of two α/β proteins, Ig-binding proteins L and G (see Fig. 1 ▶). Proteins L and G make excellent targets for theoretical study of aggregation since their folding characteristics have been extensively studied by experiment (Gu et al. 1995, 1997; Park et al. 1997, 1999; Scalley et al. 1997; Kim et al. 1998; McCallister et al. 2000; Taddei et al. 2000). Experimental evidence indicates that protein L folds two-state through a transition-state ensemble involving a native-like β-hairpin 1. Protein G, on the other hand, folds through an early intermediate state, followed by a rate-limiting step that involves formation of β-hairpin 2. Our most recent theoretical studies also differentiate the folding mechanism of proteins L and G as seen experimentally (Brown and Head-Gordon 2004). We find that their folding is consistent with a nucleation-condensation mechanism, each of which is described as helix-assisted β-1 and β-2 hairpin formation, respectively (Brown and Head-Gordon 2004). We determine that protein G exhibits an early intermediate that draws together misaligned secondary-structure elements that are stabilized by hydrophobic core contacts involving the third β-strand, and the later transition state ensemble (TSE) corrects the strand alignment of these same secondary-structure elements (Brown and Head-Gordon 2004). The kinetic data for protein G folding were fit to a two-step first-order reversible reaction, proving that protein G folding involves an on-pathway early intermediate, and should be populated and therefore observable by experiment (Brown and Head-Gordon 2004).
Figure 1.

Ribbon drawing of the protein L and protein G model used in this study. Figure created by PyMOL (DeLano Scientific).
The purpose of our study is to demonstrate that the different folding characteristics of proteins L and G, both of which form the same native topology, explain their different rates of aggregation. Furthermore, given the concentration and temperature used in our study corresponding to partially denaturing conditions, and the fact that protein L and protein G form amyloid fibrils, we would argue that these simulations are most directly relevant to the earliest aggregation events involving prefibrillar formation. We show that proteins L and G provide a good contrast for understanding features of aggregation that arise from different mechanisms of folding and/or due to protein-folding intermediates, while controlling for size, topology, and stability. We find that protein G aggregates more slowly than protein L due to its rapidly formed folding intermediate, which exhibits native intrachain contacts spread across the protein. Protein L shows more localized native structure in the denatured state ensemble (DSE) with timescales of folding that are commensurate with the aggregation timescale, leaving it vulnerable to domain swapping or nonnative interactions with other chains that increase the aggregation rate.
This computational study suggests that experiments that can characterize the structural signatures of the DSE (or transition state ensemble) under nonaggregating conditions should be able to predict where interchain contacts will be made in the aggregate, and to predict slower aggregation rates for proteins with contacts that are dispersed across the protein fold. A corollary of this work is that early intermediates in folding may be evolutionarily selected for their protective role against unwanted aggregation, and thus could be useful in reengineered sequences to slow aggregation and increase refolding yield. Finally, given that proteins L and G can both form amyloid fibrils under certain solution conditions (Ramirez-Alvarado et al. 2000; T. Cellmer and H. Blanch, pers. comm.), this work also provides mechanistic and structural insight into the formation of the earliest prefibrillar species.
Results
All aggregation simulations we report for proteins L and G were done at the midpoint of their temperature denaturation curve. This corresponds to a folding temperature for protein L of Tf = 0.42, while the aggregation simulations for protein G were done at its folding temperature of Tf = 0.41, as determined in previous work (Brown and Head-Gordon 2004). Aggregation simulations performed at the protein’s folding temperature correspond to partially denaturing conditions typically used by experimentalists to promote aggregation (Dobson 2003a). The use of the two (but only slightly different) temperatures for the aggregation studies allows us to remove the trivial effect of greater or lesser stability of the native state of one sequence as a factor in the comparison of their aggregation rates. By simulating aggregation at each of their folding temperatures, the native and unfolded states are of equal stability.
Because protein aggregation is a second-order or higher reaction process, aggregation rates will be concentration-dependent. All simulations were performed with three identical chains, of either the L or G sequence, set at maximal distances apart in their unfolded state in a periodically replicated box length of 32 distance units. A box length of 32 units with three chains corresponds to a protein solution concentration of about 20 mg/mL (assuming a length scale of a bead–bead distance of 3.8 Å). This concentration is on the same order of magnitude of 5 mg/mL for experimental aggregation kinetics studies performed by Dobson and coworkers (de Laureto et al. 2003) for SH3, another small protein. We also performed comparable simulations at a higher concentration (~80 mg/mL) and determined the same qualitative features of the mechanism of aggregation (N. Fawzi, V. Chubukov, and T. Head-Gordon, unpubl.), which we reveal and discuss more fully below.
Instead of defining an aggregation event through a defined oligomer state, we chose to define it in terms of a critical number of interchain contacts, χinter, and a number of intrachain contacts, χintra (see Materials and Methods). This had the benefit, from our view, of no a priori assumptions about aggregation structure. We quantified these contact numbers by evaluating free energy surfaces as projected onto χinter and χintra for protein L and protein G at their respective folding temperatures (Fig. 2A,B ▶). The free energy projection shows a free energy minimum corresponding to aggregated states for each protein, and while their χintra position is the same, they differ in their position with respect to the χinter reaction coordinate. Based on these free energy projections, we use χinter = 28 for protein L and χinter = 38 for protein G as the measure for when three protein chains have aggregated. Using these definitions, aggregates of L and G are comprised of both dimers and trimers.
Figure 2.

Free energy projection of interchain contact number, χinter, vs. the number of intrachain contacts, χintra, for protein L (A) and protein G (B) at their respective folding temperatures, Tf.
In Figure 3A ▶ we plot the unaggregated population, Punaggregated, versus unitless time t/τ (where τ is the Langevin timestep), for protein L and protein G at their respective folding temperatures. The kinetic data for both proteins are fit by a double exponential (the parameters are shown in Table 1). It is clear from the figure and parameter fit that protein G aggregates more slowly than protein L. Just for clarification, Figure 3B ▶ compares the kinetic profiles for both proteins at χinter = 25, and we can see that protein G still aggregates more slowly, so that the difference in definition of χinter for the two proteins is not biased toward slower aggregation rates for protein G. Therefore, we focus on the aggregation kinetics using the χinter definition based on the free energy projections for protein L and protein G, which corresponds to the minimum in each basin. In Table 1 we also report timescales of folding for protein L and for protein G, including an estimate for the timescale for forming the intermediate, which will be important for later analysis. For protein L and the fast aggregation pathway for protein G, the timescales for folding are comparable to the aggregation timescale, whereas the protein G intermediate forms on timescales that are an order of magnitude faster than the fastest timescale for early aggregation (Table 1).
Figure 3.

Fraction of unaggregated states Punaggregated as a function of unitless time t/τ for protein L (squares) and protein G (circles) at their respective folding temperatures. (A) Based on the aggregation definition of χinter = 38 for protein G and χinter = 28 for protein L. The fit is shown as a solid line, and the best fit parameters are given in Table 1. (B) Based on the same definition of aggregation: χinter = 25 for both proteins. This shows that differences in definition of χinter for the two proteins used in Figure 3A is not biased toward slower aggregation rates for protein G.
Table 1.
Parameters obtained from fits to aggregation kinetic data of proteins L and G from this study
| T | A0 | 1 - A0 | τ0 | τ1 | χ2/10−2 | |
| Aggregation rates | ||||||
| L | 0.42 | 0.43 | 0.57 | 2175 | 33,768 | 5.0 |
| G | 0.41 | 0.27 | 0.73 | 3554 | 42,750 | 2.0 |
| Folding rates | ||||||
| L | 0.42 | 1.0 | 0.0 | 15,700 | 0 | 0.034 |
| G | 0.41 | 0.81 | 0.19 | 13,700 | 46,400 | 0.035 |
| Gintermediate | 0.41 | 0.5 | 0.0 | 600 | 0 | |
The data are fit to the equation A0 exp(−t/τ0) + (1 - A0) exp(−t/τ1). We also include the kinetic parameters fit to the folding of proteins L and G as reported in Table 1 of Brown and Head-Gordon (2004) and an estimate of the timescale for the formation of protein G’s intermediate from Figure 8 ▶ in Brown and Head-Gordon (2004). The χ2 values for these data indicate a good fit.
From the fit to the aggregation data, we find that a greater proportion of protein G chains (73%) than protein L chains (57%) aggregate through their slow aggregation pathways. From these kinetic fits, we can separate the raw aggregation trajectories into two subpopulations—fast and slow aggregation—for each protein (see Materials and Methods). We then can analyze these kinetic subpopulations for proteins L and G by evaluating contact maps of the native state, intermediate state (if it exists), and denatured state ensembles of individual folding protein chains, as well as contact maps for the intrachain and interchain aggregated ensembles, to explain differences in aggregation rates. The simulation and numerical procedures for obtaining these data are described in the Materials and Methods section, and in previous work (Brown and Head-Gordon 2004).
The contact maps presented in Figures 4 ▶ through 6 ▶ illustrate which areas of the protein chain are in contact in the various states examined. The secondary structural arrangements of proteins L and G (β-strand 1, β-strand 2, helix, β-strand 3, β-strand 4) are presented along the bottom and the side of each map. Diagonal arrangements of contacts going up and to the right in the β regions correspond to a parallel β-sheet, down and to the right correspond to an anti-parallel β-sheet. For all contact maps, a contact is formed if two beads are within 2.5 distance units (corresponding to ~9 Å), roughly the center of mass distance between side chains. In all figures, reference lines indicating native state contacts of folded chains are in black. Unlike contacts in the native state that are always formed, contacts in the DSE, ISE, and aggregated states that are significant are formed with a certain probability. Therefore, these states are contoured at various percentage levels described below to bring out significant contacts. For example, a contour level of 60% was chosen for the intrachain contacts, to bring out contacts that are formed by 60% of the chains, more than half the chains in the aggregates. The frequency of the most populated interchain contacts is lower than that of intrachain contacts, due to a greater number of possibilities for a particular contact to form involving three chains. Trivial contacts between neighboring residues found along the main diagonal are ignored in intrachain contours (native state, DSE, ISE, TSE, and aggregated intrachain contacts). Interchain contours have no such trivial contacts, and the area along the main diagonal is treated no differently from contacts in other regions. Interchain contour levels for protein G are adjusted by a factor of (38/28) to correct for the increased number of contacts by definition included in the protein G aggregates.
Figure 4.

Contact maps comparing native conformation to intermediate states and denatured states. Native-state contacts (represented by the area lying within the black contours) compared to contacts that are present in at least 60% of the ensemble of denatured-state structures (contoured in red) for protein L (A) and protein G (B). (C) Contact map comparing native state (black) and contacts that are present across at least 60% of the intermediate state ensemble (red) for protein G.
Figure 6.

Contact map comparing native state (black) and interchain contacts made in at least 15% (green) and 8% (blue) of the aggregated ensemble for protein L’s fast aggregation pathway (A), protein L’s slow aggregation pathway (B), protein G’s fast aggregating pathway (C), and protein G’s slow aggregation pathway (D), respectively. For reference, intrachain contacts that are present in at least 60% of the aggregated ensemble of protein L are contoured in red in A and B, and in at least 60% of the aggregated ensemble of protein G are contoured in red in C and D.
Figure 4, A and B ▶, provides a comparison of the DSEs of protein L and protein G, respectively. In Figure 4C ▶ we display the intermediate ensemble for the slow pathway of protein G at the same contour level, which in previous work we have shown is characterized as an assembly of misaligned β-strands that are corrected in the later TSE (Brown and Head-Gordon 2004). These figures show that native contacts made in the DSE of protein L are more localized relative to that exhibited in the DSE of protein G, which exhibits stable native structural elements dispersed over the entire protein chain. In addition, there is a greater population of nonnative elements in the DSE of protein L relative to the DSE of protein G, while the ISE of protein G is the most native-like. The nonnative element of protein L involves parallel association of β-strands 2 and 4, but does contribute to a greater delocalization of intrachain contacts. As shown below, the structural signatures of the DSE of each protein, and the ISE for protein G and timescale for its formation, provide complete insight into the aggregation pathways and kinetics.
In Figure 5 ▶ we display the intrachain contacts made by 60% of the population (i.e., more than one and a half full chains per three-chain aggregate) in the aggregated ensemble against the native and DSE reference for the fast and slow aggregation pathways, for protein L (Fig. 5, A and B ▶, respectively) and protein G (Fig. 5, C and D ▶, respectively). For each protein it is evident that the intrachain contacts of the aggregated ensembles resembles contacts formed in the denatured state ensemble. The fast aggregation pathway for protein L (Fig. 5A ▶) protects only the localized first β-hairpin region, consistent with its folding pathway, but leaves a majority of the residues vulnerable to entanglement with other chains. The aggregation is slowed down by chains that exploit both a more extensively formed first β-hairpin and contacts more greatly dispersed across the fold arising from association of β-strands 2 and 4 in the DSE (Fig. 5B ▶). Baker and coworkers’ experiments confirm the localized structure in hairpin 1 of the denatured state of protein L (Yi et al. 2000), and, although no long-range associations like that between strands 2 and 4 were detected, the authors note that their use of 2 M guanidine denaturant might disrupt long-range structure. The fast aggregation pathway for protein G protects some of the first β-hairpin and more extensively the second β-hairpin region, consistent with the folding pathway of protein G, but these two regions are still relatively β-strand (Fig. 5D ▶).
Figure 5.
Contact map comparing native state (black) and intrachain contacts made in at least 60% of the aggregated ensemble (green) for protein L’s fast aggregation pathway (A), protein L’s slow aggregation pathway (B), protein G’s fast aggregating pathway (C), and protein G’s slow aggregation pathway (D). For reference, contacts that are present in at least 60% of the denatured state ensemble of protein L are contoured in red in A and B, and present in at least 60% of the intermediate ensemble of protein G are contoured in red in C and D. localized, i.e., they do not provide sufficient pinning sites throughout the fold (Fig. 5C). In contrast, the slowest aggregation pathway for protein G has a more extensive network of stabilizing native contacts across the protein, more consistent with the ISE that protects the sticky third
Finally, in Figure 6 ▶ we display contact maps for inter-chain contacts made in the aggregated ensemble for 15% (green) and 8% (blue) of the proteins for the fast aggregating pathway and slow aggregation pathway, for protein L (Fig. 6A,B ▶) and protein G (Fig. 6C,D ▶), respectively. We also show a snapshot of aggregated chains for protein G in Figure 7 ▶. The 15% contour is comparable to the level of significance of the intrachain contact maps, which is the reference ensemble used in this comparison (red). The point of this comparison is to show that protection afforded by the intrachain contacts reduces their representation in the inter-chain contacts that can be made. The more permissive contact level of 8% emphasizes that protein L gives rise to aggregates with more interchain contacts, and exhibits a greater degree of domain swapping, especially between strands 2 and 4′ (2′ and 4), as well as interchain association of same strands, i.e., 2 and 2′ as well as 4 and 4′. It is clear that the greater protection factor afforded by the stable structural elements dispersed over the entirety of the DSE of protein G, with the ISE viewed as an especially structured DSE, results in a much sparser interchain contact map. Protein G has a much reduced propensity for domain swapping, and largely exhibits only interchain association of same strands 3 and 3′. In fact, the third β-strand is the stickiest region of protein G, and therefore potentially more harmful with respect to unwanted aggregation, but its rapid protection in the folding mechanism as an early intermediate potentially minimizes this destructive tendency.
Figure 7.

Ribbon diagram of a snapshot for the aggregation simulation of protein G that illustrates both native intrachain and interchain contacts made. Figure created by PyMOL (Delano Scientific).
Discussion
The purpose of our study is to demonstrate that the different folding characteristics of proteins L and G, both which form the same native topology, explain their different rates of aggregation. The aggregated ensembles for proteins L and G show intrachain contact maps that strongly resemble the DSE, or intermediate ensemble if it exists, and therefore characterization of the DSE, early intermediates of the folding pathways, or even transition-state ensembles of folding under nonaggregating conditions (low concentration) could provide information that will help explain the slower aggregation kinetics, and possibly the morphologies of aggregates, for different protein sequences.
The aggregation for protein G is slower than for protein L due to the presence of an intermediate in its folding pathway that quickly protects a number of regions of the sequence dispersed throughout the protein. While a number of studies have shown that intermediates can play a deleterious role by increasing protein aggregation (Safar et al. 1994; King et al. 1996; Fink 1998; Horwich 2002), this work provides evidence that early on-pathway intermediates in folding could also play a protective role in abating unwanted aggregation. Correspondingly, the faster aggregation rate for L arises from the localization of stable structural elements in its DSE. The unstructured part of the chain leaves it vulnerable to domain-swapping interactions with other chains that increase the aggregation rate and contribute to a greater number of interchain contacts. Therefore, proteins that have localized structure in the DSE will be more aggregation-prone than proteins with more diffuse elements of stable structure. In fact, mutations that stabilize some elements of native structure in the second and/or fourth strand of protein L could be made such that its folding pathway is not perturbed, although it should diminish its aggregation propensity relative to wild type.
It is a difficult experimental problem to determine the small populations of structure in the DSE at equilibrium, and the few experimental studies that exist have primarily focused on the DSE resemblance to the TSE (Mok et al. 1999; Kortemme et al. 2000; Crowhurst et al. 2003). In Figure 8, A and B ▶, we show the resemblance between the DSE and TSE obtained from our model for protein L and protein G, respectively. It is interesting that the TSE population for protein G contains very similar or the same elements of structure populated in the DSE. Therefore, information about the TSE does show correlation with aggregation propensity in this case. The DSE of protein L, however, exhibits association of β-strands 2 and 4 in both native and nonnative configurations that are altogether absent in the TSE—an important difference since the extra “pinning” site provides an alternative pathway that slows down the rate of protein L aggregation. In general, a more delocalized TSE should correlate with a reduction in the aggregation rate and amount of interchain contacts formed in the aggregate, relative to sequences with a localized TSE, as we see when comparing protein L and G. Thus the “easier” characterization of the TSE through φ-value analysis may be a good guide to aggregation kinetics among different protein sequences, although the DSE would be more directly informative.
Figure 8.

Contact map comparing native state (black), the denatured state ensemble (red), and the transition state ensemble (blue) for protein L (A) and protein G (B).
The equilibrium experiments that report similarity in structure between the TSE and DSE are only suggestive of the role that it plays in the kinetics of folding (Mok et al. 1999; Kortemme et al. 2000). We note that the similarities seen between the TSE and DSE of our models of proteins L and G do support the description of the DSE in the folding kinetics of proteins as proposed by Fersht (1995). Protein L does exhibit a faster folding rate than protein G due to more localized contacts (helix and β-hairpin 1) formed in the DSE, and because it lacks a folding intermediate, all of which supports the connection between fast folding and minimal residual structure in the DSE. We might also suggest that residual nonnative structure may also contribute to a faster folding rate, since protein L’s nonnative association of β-strands 2 and 4 provides weak contacts (relative to native-like interactions) and yet reduces the entropy of folding by drawing together disparate regions of the structure.
It is noteworthy that Capaldi and coworkers (Capaldi et al. 2001, 2002) have determined that the folding mechanism of one member of a set of immunity proteins, Im7, involves an early intermediate, while the homologous members Im2, Im8, and Im9 are simple two-state folders. It would be interesting to see whether these and other structurally homologous protein families that show both two-state kinetics as well as on-pathway intermediates in folding correlate with differences in aggregation kinetics for the reasons that we have shown here for proteins L and G.
The primary conclusion of this computational study is that if experiments can characterize the structural signatures of the DSE, or possibly the TSE, then native contacts that are delocalized across the protein fold should correlate with slower aggregation rates. We suggest that there may be a functional advantage to a diffuse transition-state ensemble or DSE not only to prevent misfolding (Lindberg et al. 2002; Sanchez and Kiefhaber 2003; Wright et al. 2003), but also to aid aggregation resistance. This emphasizes that evolution has optimized protein sequences for functional robustness, not simply for folding rate; protein sequences that prefer slower pathways and/or folding intermediates may be evolutionarily selected for, in part, due to their aggregation resistance (Lindberg et al. 2002; Wright et al. 2003). A further question is whether differences in folding pathways of homologous proteins might delineate differences in their functional role—i.e., that sequence differences in a given fold class have evolved to provide protection against aggregation depending on the specifics of their protein interaction partners in the cell. A corollary of this work is that early intermediates in folding may be evolutionarily selected for their protective role against unwanted aggregation, and thus could be useful to employ in reengineered sequences to slow aggregation and increase folding yield in industrial protein production. Furthermore, given the concentration conditions used in our study, and the fact that protein L and protein G form amyloid fibrils (Ramirez-Alvarado et al. 2000; T. Cellmer and H. Blanch, pers. comm.), we would argue that these simulations are most directly relevant to the earliest aggregation events involving prefibrillar formation.
Materials and methods
This work examines the question of aggregation kinetics and mechanism through computational coarse-grained models of two members of an α/β protein fold class, Ig-binding proteins L and G. We justify the use of coarse-grained models for the following reasons: They capture the correct spatial distribution of local and nonlocal contacts (Fig. 1 ▶) of the most relevant native-state features (Sorenson and Head-Gordon 2000; Head-Gordon and Brown 2003; Brown and Head-Gordon 2004) that most influences the overall kinetics of protein folding (Plaxco et al. 1998; Alm et al. 2002). Coarse-graining in sequence should also be highly appropriate for aggregation studies, since it is clear that hydrophobic/hydrophilic amino acid sequence patterning plays a crucial role in determining a protein’s ability to aggregate. Broome and Hecht (2000) reported that alternating patterns of hydrophilic and hydrophobic amino acid residues occur significantly less often than other patterns, and Schwartz et al. (2001) identify a more specific rule that blocks of three or more hydrophobic residues are disfavored among wild-type proteins surveyed, indicating that there are sequence patterns that are particularly conducive to the formation of amyloid fibrils (West et al. 1999).
These minimalist models enable a sequence-driven connection to experimental protein folding mechanisms that is not reproducible by G⊚ topology models. Since our models are based on physical potentials, we can engineer sequences that fold into α-helical, β-sheet, and mixed α/β protein topologies, and distinguish folding rates and mechanisms between members within the same protein fold family. Therefore, our computational model is also appropriate for protein engineering studies that have proven critical in understanding some basic aspects of aggregation phenomena.
The protein chain is modeled as a sequence of beads of three types—hydrophilic, hydrophobic, and neutral—designated by L, B, and N, respectively (Sorenson and Head-Gordon 2000; Brown and Head-Gordon 2004). The pairwise interaction between beads is attractive for hydrophobic–hydrophobic (B–B) interactions, and repulsive for all other bead pairs (although the strength of the repulsive interactions depends on the bead types involved). In addition to pairwise nonbonded interactions, the other contributions to the potential energy function include bending and torsional degrees of freedom.
The total potential energy function is given by
![]() |
(1) |
where ɛH determines the energy scale and sets the strength of the hydrophobic interactions. The bond-angle energy term is a stiff harmonic potential with force constant kθ = 20ɛH/rad2, and θ0 = 105°. The second term in the potential energy designates the torsional, or dihedral, potential and is given by one of the following: helical (H), with A = 0, B = C = D = 1.2ɛH; extended (E), favoring β-strands, with A = 0.9ɛ, C = 1.2ɛH, B = D = 0; or turn potential (T), with A = B = D = 0, C = 0.2ɛH. The non-bonded interactions are determined by S1 = S2 = 1, a Lennard-Jones potential with a short-range attractive minimum to represent the energetically favorable burial of hydrophobic groups for B–B interactions; S1 = 1/3 and S2 = −1, a repulsive interaction for L–L and L–B interactions; and S1 = 1 and S2 = 0, a softer repulsive interaction to mimic smaller amino acids for all N–L, N–B, and N–N interactions. For convenience, all simulations are performed in reduced units, with mass m, length s, energy ɛH, and kB all set equal to unity. Note that while the nonbonded potential is symmetric with respect to inversion, this is not true for the dihedral interactions. Thus, the total energy function is not symmetric with respect to indice permutations, and we do not find mirror image states. Full details of the model can be found in Sorenson and Head-Gordon (2000) and Brown and Head-Gordon (2004).
Folding and aggregation simulations
We perform constant-temperature simulations using Langevin dynamics in the low-friction limit for three protein chains when characterizing the thermodynamics and kinetics of aggregation. Low-friction stochastic dynamics enables the sampling of long timescale events (on the order of milliseconds or longer) such as folding and aggregation, but makes quantitative comparison to experimentally measured absolute time difficult. Therefore, we restrict our analysis to comparing timescales between proteins L and G. We performed 600 aggregation trajectories for both protein L and protein G at their reduced folding temperatures of Tf = 0.42 and Tf = 0.41, respectively. A reduced temperature is used by us for numerical convenience since it eliminates the use of small constants that accumulate error in an MD simulation. It is defined as T* = kbT/ɛH, where ɛH = kb = 1. A Langevin timestep equivalent to 0.005 unit time was used for all simulations.
In order to determine the kinetics of aggregation, we sought a thermodynamic definition of an aggregate based on the number of contacts by constructing a free energy surface describing aggregation. To this end, we collected multidimensional histograms (Kumar et al. 1992; Ferguson and Garrett 1999) over a number of different order parameters, including energy V, radius of gyration Rg, and various native-state or aggregation-state similarity parameters χ. We collected histograms at 13 different temperatures: 0.90, 0.62, 0.60, 0.55, 0.50, 0.48, 0.46, 0.44, 0.42, 0.41, 0.40, 0.39, and 0.38. We ran 10 independent trajectories at each temperature, and collected 5000 data points per trajectory. For each trajectory, the three chains start off in an arbitrary conformation at maximum separation in the periodic box. The chains are initially propagated at high temperature of 1.6 for 750,000 steps to randomize the starting configuration. The simulation is quickly cooled to the target temperature (5000 steps for target temperatures of 0.70 and 0.90; 10,000 steps for 0.5 ≥ T ≥ 0.62; 20,000 steps for T ≤ 0.48), then equilibrated for long times (500,000 timesteps for 0.70 and 0.90; 1,000,000 steps for 0.5 ≥ T ≥ 0.62; and 4,000,000 steps for T ≤ 0.48) to ensure that the simulation represents the equilibrium ensemble at the target temperature. The free energy landscape is characterized using the multiple, multidimensional weighted histogram analysis technique. From the histogram analysis, we constructed a projection of the free energy surface on the parameters χinter and χintra. χinter is the number of bead pairs from different chains that are in close contact (within 1.28 distance units, corresponding to ~5 Å). χintra is the number of bead pairs from the same chain in close contact. We selected the short contact distance of 1.28 to limit interactions counted as contacts to those that are very likely to be energetically favorable BB interactions. At a contact length of 1.28 distance units, BB interactions are −0.70ɛH, 70% of the minimum potential energy (−1.0ɛH at a distance of 1.122), and LL interactions are unfavorable (+0.37ɛH). With these parameters, we identified the center of the free energy basin for aggregation based on the projection of the free energy surface onto the χinter and χintra parameters.
The kinetics of the aggregation process can be characterized by calculating a large number of mean first-passage times, the time required for an aggregation trajectory to reach χinter. In kinetics simulations, three chains start off in an arbitrary conformation at maximum separation in the periodic box. The chains are initially propagated at a high temperature of 1.2 for 750,000 steps to randomize the starting configuration and ensure that each chain is in the unfolded conformation. The chains are then cooled extremely quickly (in 200 steps) to the folding temperature, and an equilibration period of 30,000 timesteps follows at the folding temperature. We subtract off this initial correlation time in which the high-temperature chain is briefly equilibrated at the target temperature (this is the computational dead time during the kinetics run). The chains are then propagated at the folding temperature and χinter is measured every 1000 steps until χinter reaches a rolling block average of the designated interchain contact number, χinter = 38 for G and χinter = 28 for L. The rolling block average is a short block average of the number of contacts for the current and nine previous sample points (the last 10,000 steps of simulation), implemented to reduce the noise in the number of contacts. The time at which this χinter is reached is recorded as the first-passage time. Trajectories are truncated at 6 million steps if χinter is not reached, and all kinetic fits are generated on trajectory data out to 5 million steps.
In order to examine structural differences in aggregates that form quickly and slowly, the population of kinetics runs was split at a point t*. At t*, assuming a two-independent-pathways model, P(trajectory is a member of the fast pathway | aggregated at t*) = P(trajectory is part of the slow pathway |aggregated at t*). Trajectories aggregating before t* are more likely to be from the fast pathway and represent the fast exponential timescale; trajectories aggregating after t* are more likely to represent the slow exponential timescale. Any contamination of the populations where trajectories near t* are incorrectly assigned will be small due to the order of magnitude separation in timescales.
To connect the aggregation properties to chain characteristics, findings from previous work characterizing the L and G single-chain stationary points (transition and intermediate states) were included. Briefly, the structures along the folding pathway were isolated, and for each structure Pfold, the probability that a particular structure will find the native state before unfolding, was determined (Du et al. 1998). Structures with Pfold near 0.5 (0.4 ≤ Pfold ≤ 0.6) were identified as transition states (see Brown and Head-Gordon 2004 for full details). An ISE was postulated for protein G by observing the trajectories starting from a Pfold ≈ 0.5 that did not unfold. To characterize the denatured state ensemble for the folding of individual chains, we simulate 16 independent single-chain trajectories at constant temperature to collect states at the folding temperature that reside in the unfolded basin, defined to be 0.0 < χ< 0.4 and 2.5 < Rg < 4.5 as estimated from the partition function at Tf. When we compare the resulting distribution to the histogram partition function to validate that we are sampling the proper distribution of states, we find good agreement.
Acknowledgments
We acknowledge financial support from UC Berkeley and a subcontract award under the NSF grant no. CHE-0205170. N.J.F. gratefully acknowledges the Whitaker foundation for a graduate research fellowship. We thank Troy Cellmer and Harvey Blanch for communicating their aggregation amyloid data on protein L.
Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.041177505.
Supplemental material: see www.proteinscience.org
References
- Alm, E., Morozov, A.V., Kortemme, T., and Baker, D. 2002. Simple physical models connect theory and experiment in protein folding kinetics. J. Mol. Biol. 322 463–476. [DOI] [PubMed] [Google Scholar]
- Broome, B.M. and Hecht, M.H. 2000. Nature disfavors sequences of alternating polar and non-polar amino acids: Implications for amyloidogenesis. J. Mol. Biol. 296 961–968. [DOI] [PubMed] [Google Scholar]
- Brown, S. and Head-Gordon, T. 2004. Intermediates and the folding of proteins L and G. Protein Sci. 13 958–970. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Capaldi, A.P., Shastry, M.C.R., Kleanthous, C., Roder, H., and Radford, S.E. 2001. Ultrarapid mixing experiments reveal that Im7 folds via an on-pathway intermediate. Nat. Struct. Biol. 8 68–72. [DOI] [PubMed] [Google Scholar]
- Capaldi, A.P., Kleanthous, C., and Radford, S.E. 2002. Im7 folding mechanism: Misfolding on a path to the native state. Nat. Struct. Biol. 9 209–216. [DOI] [PubMed] [Google Scholar]
- Chiti, F., Webster, P., Taddei, N., Clark, A., Stefani, M., Ramponi, G., and Dobson, C.M. 1999. Designing conditions for in vitro formation of amyloid protofilaments and fibrils. Proc. Natl. Acad. Sci. 96 3590–3594. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Clark, E.D. 2001. Protein refolding for industrial processes [Review]. Curr. Opin. Biotech. 12 202–207. [DOI] [PubMed] [Google Scholar]
- Crowhurst, K.A., Choy, W.Y., Mok, Y.K., and Forman-Kay, J. 2003. Corrigendum to the paper by Mok et al. 1999. ‘NOE data demonstrating a compact unfolded state for an SH3 domain under non-denaturing conditions.’ J. Mol. Biol. 329 185–187. [DOI] [PubMed] [Google Scholar]
- DeLano, W.L. 2002. The PyMOL user’s manual. DeLano Scientific, San Carlos, CA.
- de Laureto, P.P., Taddei, N., Frare, E., Capanni, C., Costantini, S., Zurdo, J., Chiti, F., Dobson, C.M., and Fontana, A. 2003. Protein aggregation and amyloid fibril formation by an SH3 domain probed by limited proteolysis. J. Mol. Biol. 334 129–141. [DOI] [PubMed] [Google Scholar]
- Dobson, C.M. 2003a. Protein folding and disease: A view from the first Horizon Symposium. Nat. Rev. Drug Discov. 2 154–160. [DOI] [PubMed] [Google Scholar]
- ———. 2003b. Protein folding and misfolding [Review]. Nature 426 884–890. [DOI] [PubMed] [Google Scholar]
- Du, R., Pande, V.S., Grosberg, A.Y., Tanaka, T., and Shakhnovich, E.S. 1998. On the transition coordinate for protein folding. J. Chem. Phys. 108 334–350. [Google Scholar]
- Ferguson, D.M. and Garrett, D.G. 1999. Simulated annealing—Optimal histogram methods. Monte Carlo Methods Chem. Phys. 105 311–336. [Google Scholar]
- Fersht, A. 1995. Optimization of rates of protein folding—The nucleation–condensation mechanism and its implications. Proc. Natl. Acad. Sci. 92 10869–10873. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fink, A.L. 1998. Protein aggregation—Folding aggregates, inclusion bodies and amyloid [Review]. Fold. Des. 3 R9–R23. [DOI] [PubMed] [Google Scholar]
- Gu, H.D., Yi, Q.A., Bray, S.T., Riddle, D.S., Shiau, A.K., and Baker, D. 1995. A phage display system for studying the sequence determinants of protein folding. Protein Sci. 4 1108–1117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gu, H., Kim, D., and Baker, D. 1997. Contrasting roles for symmetrically disposed beta;-turns in the folding of a small protein. J. Mol. Biol. 274 588–596. [DOI] [PubMed] [Google Scholar]
- Head-Gordon, T. and Brown, S. 2003. Minimalist models for protein folding and design. Curr. Opin. Struct. Biol. 13 160–167. [DOI] [PubMed] [Google Scholar]
- Horwich, A. 2002. Protein aggregation in disease: A role for folding intermediates forming specific multimeric interactions. J. Clin. Invest. 110 1221–1232. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim, D.E., Gu, H.D., and Baker, D. 1998. The sequences of small proteins are not extensively optimized for rapid folding by natural selection. Proc. Natl. Acad. Sci. 95 4982–4986. [DOI] [PMC free article] [PubMed] [Google Scholar]
- King, J., Haasepettingell, C., Robinson, A.S., Speed, M., and Mitraki, A. 1996. Thermolabile folding intermediates—Inclusion body precursors and chaperonin substrates [Review]. FASEB J. 10 57–66. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kortemme, T., Kelly, M.J.S., Kay, L.E., Forman-Kay, J., and Serrano, L. 2000. Similarities between the spectrin SH3 domain denatured state and its folding transition state. J. Mol. Biol. 297 1217–1229. [DOI] [PubMed] [Google Scholar]
- Kumar, S., Bouzida, D., Swendsen, R., Kollman, P.A., and Rosenberg, J.A. 1992. The weighted histogram analysis method for free-energy calculations on biomolecules. I. The method. J. Comput. Chem. 13 1011–1021. [Google Scholar]
- Lindberg, M., Tangrot, J., and Oliveberg, M. 2002. Complete change of the protein folding transition state upon circular permutation. Nat. Struct. Biol. 9 818–822. [DOI] [PubMed] [Google Scholar]
- McCallister, E.L., Alm, E., and Baker, D. 2000. Critical role of beta;-hairpin formation in protein G folding. Nat. Struct. Biol. 7 669–673. [DOI] [PubMed] [Google Scholar]
- Mok, Y.K., Kay, C.M., Kay, L.E., and Forman-Kay, J. 1999. NOE data demonstrating a compact unfolded state for an SH3 domain under non-denaturing conditions. J. Mol. Biol. 289 619–638. [DOI] [PubMed] [Google Scholar]
- Park, S.H., O’Neil, K.T., and Roder, H. 1997. An early intermediate in the folding reaction of the B1 domain of protein G contains a native-like core. Biochemistry 36 14277–14283. [DOI] [PubMed] [Google Scholar]
- Park, S.H., Shastry, M.C.R., and Roder, H. 1999. Folding dynamics of the B1 domain of protein G explored by ultrarapid mixing. Nat. Struct. Biol. 6 943–947. [DOI] [PubMed] [Google Scholar]
- Plaxco, K.W., Simons, K.T., and Baker, D. 1998. Contact order, transition state placement and the refolding rates of single domain proteins. J. Mol. Biol. 277 985–994. [DOI] [PubMed] [Google Scholar]
- Ramirez-Alvarado, M., Merkel, J.S., and Regan, L. 2000. A systematic exploration of the influence of the protein stability on amyloid fibril formation in vitro. Proc. Natl. Acad. Sci. 97 8979–8984. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Safar, J., Roller, P.P., Gajdusek, D.C., and Gibbs, C.J. 1994. Scrapie amyloid (prion) protein has the conformational characteristics of an aggregated molten globule folding intermediate. Biochemistry 33 8375–8383. [DOI] [PubMed] [Google Scholar]
- Sanchez, I.E. and Kiefhaber, T. 2003. Hammond behavior versus ground state effects in protein folding: Evidence for narrow free energy barriers and residual structure in unfolded states. J. Mol. Biol. 327 867–884. [DOI] [PubMed] [Google Scholar]
- Scalley, M.L., Yi, Q., Gu, H.D., McCormack, A., Yates, J.R., and Baker, D. 1997. Kinetics of folding of the IgG binding domain of peptostreptoccocal protein L. Biochemistry 36 3373–3382. [DOI] [PubMed] [Google Scholar]
- Schwartz, R., Istrail, S., and King, J. 2001. Frequencies of amino acid strings in globular protein sequences indicate suppression of blocks of consecutive hydrophobic residues. Protein Sci. 10 1023–1031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Silow, M., Tan, Y.J., Fersht, A.R., and Oliveberg, M. 1999. Formation of short-lived protein aggregates directly from the coil in two-state folding. Biochemistry 38 13006–13012. [DOI] [PubMed] [Google Scholar]
- Sorenson, J. and Head-Gordon, T. 2000. Matching simulation and experiment: A new simplified model for protein folding. J. Comput. Biol. 7 469–481. [DOI] [PubMed] [Google Scholar]
- Taddei, N., Chiti, F., Fiaschi, T., Bucciantini, M., Capanni, C., Stefani, M., Serrano, L., Dobson, C.M., and Ramponi, G. 2000. Stabilisation of α-helices by site-directed mutagenesis reveals the importance of secondary structure in the transition state for acylphosphatase folding. J. Mol. Biol. 300 633–647. [DOI] [PubMed] [Google Scholar]
- Uversky, V.N., Segel, D.J., Doniach, S., and Fink, A.L. 1998. Association-induced folding of globular proteins. Proc. Natl. Acad. Sci. 95 5480–5483. [DOI] [PMC free article] [PubMed] [Google Scholar]
- West, M.W., Wang, W.X., Patterson, J., Mancias, J.D., Beasley, J.R., and Hecht, M.H. 1999. De novo amyloid proteins from designed combinatorial libraries. Proc. Natl. Acad. Sci. 96 11211–11216. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wright, C.F., Lindorff-Larsen, K., Randles, L.G., and Clarke, J. 2003. Parallel protein-unfolding pathways revealed and mapped. Nat. Struct. Biol. 10 658–662. [DOI] [PubMed] [Google Scholar]
- Yi, Q., Scalley-Kim, M.L., Alm, E.J., and Baker, D. 2000. NMR characterization of residual structure in the denatured state of protein L. J. Mol. Biol. 299 1341–1351. [DOI] [PubMed] [Google Scholar]



