Skip to main content
The Journal of Chemical Physics logoLink to The Journal of Chemical Physics
. 2009 Jun 18;130(23):235106. doi: 10.1063/1.3152842

Generic coarse-grained model for protein folding and aggregation

Tristan Bereau 1,a), Markus Deserno 1,b)
PMCID: PMC3910140  PMID: 19548767

Abstract

A generic coarse-grained (CG) protein model is presented. The intermediate level of resolution (four beads per amino acid, implicit solvent) allows for accurate sampling of local conformations. It relies on simple interactions that emphasize structure, such as hydrogen bonds and hydrophobicity. Realistic α/β content is achieved by including an effective nearest-neighbor dipolar interaction. Parameters are tuned to reproduce both local conformations and tertiary structures. The thermodynamics and kinetics of a three-helix bundle are studied. We check that the CG model is able to fold proteins with tertiary structures and amino acid sequences different from the one used for parameter tuning. By studying both helical and extended conformations we make sure the force field is not biased toward any particular secondary structure. The accuracy involved in folding not only the test protein but also other ones show strong evidence for amino acid cooperativity embedded in the model. Without any further adjustments or bias a realistic oligopeptide aggregation scenario is observed.

INTRODUCTION

Proteins are the building blocks of biology. They are evolutionarily optimized heteropolymers, whose physical and material properties more often than not exceed what can be readily understood from conventional polymer physics reasoning, which derives much of its strength from uniformity, randomness, and the law of large numbers. In contrast, the complexity of proteins rests on the different physical and chemical properties of their monomers, the 20 physiological amino acids, and their intricate combination into what at cursory inspection only seems to be a random heteropolymer sequence. Moreover, the main interactions that drive their folding into intricate secondary, tertiary, and quaternary physiological structures are weak, comparable to thermal energy. The overall stability of a protein is perilously marginal,1, 2 so proteins very often rely on cooperative effects to keep them in their native structure—one appealing reason for why they might be so much bigger than what their comparatively small active centers would make one suspect. This aspect also makes them very hard to model and to coarse grain, because it is extremely difficult to understand from first principles which interactions are essential and which can be approximated.

As for many other soft matter systems, the success of atomistic protein simulations is limited by available computer power. It is not so much the sheer number of atoms involved that poses the main challenge but rather the long equilibration time associated with a system that in a highly nontrivial phase space can so easily get stuck. Coarse-grained (CG) simulations intend to address this problem by lowering the level of resolution.3 A smaller number of quasiatoms, or “beads,” decreases the computational requirements and accelerates the speed of Monte Carlo (MC) or molecular dynamics (MD) simulations. It also smoothens out the free energy landscape by reducing molecular friction, which artificially accelerates the dynamics and makes phase space both smaller and more navigable. However, compared to many other examples in soft matter, it is often precisely these small local interactions that contribute to the overall stability of the native state which makes the process of “throwing away detail for the benefit of the greater good” so much more daring. One might thus prefer to wait until computers have become even faster, and the progress in atomistic simulations looks indeed promising. Yet, the undeniable need to access really big systems of crucial relevance as well as the insatiable scientific interest to find what really matters in these systems both drive the development of new CG protein models.

The field of CG protein modeling is very diverse and has a rich history owing to a wide variety of problems to tackle, as well as length and time scales to look at.4, 5, 6, 7, 8, 9, 10, 11, 12, 13 Various levels of resolution have been designed to study many different problems. On the coarser side of particle-based simulations, conformational effects of hydrophobic interactions were studied using lattice simulations.14 This is a very powerful tool that is still widely used when looking at large-scale cooperativity effects. Soon, off-lattice simulations were developed using one bead per amino acid with implicit solvent; famous examples are Gō models.15 This level of resolution allows for much more conformational freedom, which is key to structural studies. One underlying constraint in such models is that structure is biased toward the native configuration of the protein because the remaining degrees of freedom do not suffice to accurately represent the system’s phase space, including secondary structure motifs. Intermediate resolution models (more than one bead per amino acid) have been designed to investigate structural properties of proteins while emphasizing certain aspects. For instance, the recently introduced MARTINI force field16 opts for a high resolution on the protein’s side chains, while the backbone is represented by only one bead per amino acid. The force field was parametrized using partitioning coefficients between water and a (similarly CG) lipid membrane. By doing so, structural properties in transmembrane proteins can be accurately investigated (see, e.g., Ref. 17). Other models with a comparable overall resolution shift the emphasis (in terms of modeling detail) on the backbone instead of the side chain in order to look at structure and conformational properties without biasing the force field to the native configuration. Several force fields (see, e.g., Refs. 18, 19, 20) have been reported to fold de novo helical proteins. These models incorporate only a subset of amino acids, emphasizing their chemical effects (e.g., hydrophobic, polar, glycine residue).

Intermediate level resolution models have shown promising results in capturing local conformations and reproducing basic aspects of secondary structure recognition while gaining much computational efficiency compared to atomistic models. This is partly due to the removal of solvent, which allows for significant speedup, as water typically represents the bulk of a simulation in such systems. As a result, it is necessary to treat important solvent effects implicitly, as they are determining factors in a protein’s conformation.

While α-helices are comparatively easy to obtain in such models, β-sheets and structures are more difficult to stabilize. There are several reasons for this. First, the enthalpic gain per amino acid is weaker compared to α-helices.21 Second, Yang and Honig21 showed that side chain–side chain interactions have a decisive role in sheet formation. And third, the stabilization energy contains a contribution from interactions between dipoles of successive peptide bonds that is usually neglected in simple models, yet it favors the β- over the α-structure.22 Apart from these local effects, the stability of extended conformations therefore also depends greatly on cooperativity. Other than stabilizing folds, this can also lead to peptide aggregation. Besides being an interesting physical problem, peptide aggregation is associated with countless biological processes. It also plays a crucial role in many diseases, ranging from sickle cell anemia23 to Alzheimer.24

In this work, we present a CG model of a four bead per amino acid model in implicit solvent. It differs from previously mentioned intermediate level force fields in several ways. First, by improving on full amino acid specificity it provides a more detailed free energy landscape. Second, protein folding is quantitatively probed by comparing our MD simulations with experimental data instead of the lowest energy structure that was sampled. Third, after tuning our force field with respect to one protein (in terms of tertiary structure reproduction), others are tested to understand how reliable this procedure is. Fourth, an important design criterion for our model is its ability to sample extended conformations. By requiring a balanced proportion between α-helical and β-extended conformations, our model aims to avoid a bias toward any particular secondary structure. Finally, we monitor the aggregation of small peptides (into β-sheets) to test whether a realistic aggregation scenario in the long-time and large length-scale regime can be achieved.

In order to parametrize and test our force field as finely as possible, we systematically compare the performance of our CG model with experimental data. We hasten to add, though, that refining CG models is no attempt to compete with atomistic force fields. Such an endeavor strikes us as neither likely to succeed nor to be in line with the reasons one pursues coarse graining in the first place, namely, to gain a physical understanding of fundamental mechanisms and universals of complex molecular structures. However, in systems as delicate as marginally stable proteins a subtle local interaction can have a substantial global impact, and uncovering causations of this type is well within the scope of CG studies

The paper is divided into several parts: the mapping scheme will explain how atomistic details were coarse-grained (CG) out, the different interactions as well as parameter tuning and simulation methods will be described, and finally several applications will show to what extent the model can reproduce structural properties.

MAPPING SCHEME

Overall geometry

An amino acid is modeled by three or four beads (Fig. 1). These beads represent the amide group N, central carbon Cα, carbonyl group C, and (for nonglycine residues) a side chain Cβ. The first three beads belong to the backbone of the protein chain, whereas the last one represents the side chain and is responsible for amino acid specificity. This high level of backbone resolution is necessary to account for the characteristic conformational properties underlying secondary protein structure. As far as reducing the number of degrees of freedom is concerned, this high resolution is regrettable, as the backbone is represented almost atomistically. Indeed, models that do not require the CG protein to represent local structure generally do away with most (if not all) backbone beads (e.g., Ref. 16). However, here we explicitly aim at a model that is capable of finding secondary structure by itself. This is, for instance, necessary in applications where this structure is known to change (e.g., misfolding, spontaneous aggregation) or not known at all.

Figure 1.

Figure 1

Schematic figure of the local geometry of the protein chain. The solid beads represent one amino acid. Neighboring amino acid beads are represented in dashed lines.

Parameter values

Geometric parameters were taken from existing peptide models9, 18, 19 and are reported in Table 1. Even though the spatial arrangement of the beads was fixed beforehand, the van der Waals radii were left as free parameters. Following the abovementioned references, Cβ was set at the location of the first carbon of the side chain (hence our nomenclature), directly connected to the backbone. Its location will generally not coincide with the center of mass of the atomistic side chain (which for larger and flexible side chains has no fixed position with respect to the backbone), but the concomitant substantial reduction in tuning parameters is necessary for our parametrization scheme, as we will see below.

Table 1.

Bonded interaction parameters used in the model. The dihedrals denoted with an asterisk were determined during parameter tuning (see Sec. 4). All parameters are expressed in terms of the intrinsic units of the system. k represents the interaction strength of Fourier mode n (see main text), with equilibrium value φ0. ωpro refers to the ω dihedral around the peptide bond for a proline residue. The sign of the improper dihedral angle φ0 is linked to the chirality of the isomer; the L form requires a negative sign. For each angular potential, only a single mode n was used.

  Bond lengths
NCα CαC CN CαCβ
r0 (Å) 1.455 1.510 1.325 1.530
kbond(E/Å2) 300 300 300 300
           
  Bond angles
  NCαCβ CβCαC NCαC CαCN CNCα
θ0 (deg) 108 113 111 116 122
kangle(E/deg2) 300 300 300 300 300
           
  Dihedrals
  ϕ ψ ω ωpro Improper
k(E) −0.3 −0.3 67.0 3.0 17.0
n 1 1 1 2 1
φ0 (deg) 0 0 180 0 120

All side chain beads have been given the same van der Waals radius, except for glycine, which is modeled without a side chain. This accounts for the biggest difference in the Ramachandran plot of amino acids, namely, the large flexibility of an achiral glycine residue, as opposed to the substantial chiral sterical clashes between all the others.25 On the other hand, it does not represent the size differences between nonglycine residues and will thus likely cause problems if packing issues are important, e.g., inside globular proteins.

Both the location and the size of the side chain are thus modeled in an approximate and highly simplified way. Why not be more sophisticated? Since these degrees of freedom are accounted for, one might as well give them the best possible parameter values. Ideally this is indeed what one would like to do, but the catch is that the necessary tuning is very difficult. Having 20 different amino acids gives—in the worst case—203=8000 local Ramachandran plots for the (ϕ,ψ) angles between three consecutive amino acids. These would first need to be determined atomistically and then—via some suitable matching procedure—translated into CG side chain properties. Clearly, many obvious simplifications would be possible and the task is not nearly as daunting. The number of free parameters would nevertheless be substantially increased and their tuning would require both automated techniques and enormous computing resources. In contrast, in the present model we aim to keep the number of free parameters as low as possible, such that judicious tuning by hand is still a viable option. We will see below that it is also successful. While optimization of side chain parameters will remain a long term goal, this is certainly not the point where to start.

Finally, amino acids that are in the middle of a protein chain form peptide bonds with their neighbors. This is not so at the ends of the chain, and the structure is slightly different. Nonetheless, we model the end beads identically.

Units

All lengths are measured in units of L, which we choose to be 1 Å. For the energies we found it convenient to relate them to the thermal energy, since it is this balance which determines the overall protein conformation. We thus define the energy unit E=kBTr=1.38×1023 J K1300 K4.1×1021 J0.6 kcal mol1 as the thermal energy at room temperature.

Masses will be measured in the unit “M,” which is the mass of a single CG bead. We will assume all beads to have the same mass. An amino acid weighs on average 110 Da. By distributing mass equally among the four beads N, Cα, Cβ, and C, this gives an average mass of M4.6×1026 kg.

The natural time unit in our simulation is τ=LM/E. Using the length, energy, and mass mappings from above, we find τ0.1 ps. This unit of time correctly describes the instantaneous dynamics of a fictitious CG bead-spring system (e.g., it leads to a value of the instantaneous velocity and associated kinetic energy that satisfies the equipartition theorem). However, it is crucial to understand that it does not measure the time which the real protein system requires to undergo the same conformational changes as observed in the simulation. The reason is that the reduction in degrees of freedom removes friction (smoothes the free energy landscape) and speeds up the motion through phase space. Translating τ into a reasonable measure for actual dynamics requires the determination of the associated speedup factor, which is typically accomplished by mapping an easily observable dynamic process between the experimental system and the CG simulation (such as diffusion). However, in the case of the conformational dynamics of proteins the identification of a suitable dynamic process is much less obvious. We defer this task to future work. It should be recalled that as far as equilibrium questions are concerned the precise time mapping is, of course, irrelevant.

INTERACTIONS

Bonded interactions

The local structure is constrained by bonded interactions. Bonds and angle potentials are chosen to be harmonic:

Vbond(r)=12kbond(rr0)2, (1a)
Vangle(θ)=12kangle(θθ0)2. (1b)

The spring constants kbond and kangle are set high enough to keep these coordinates close to their minimum (within 5%). Table 1 reports these parameters.

Up to thermal fluctuations bonds and angles are thus fixed. Flexibility of the overall structure enters through the dihedrals, the possibility to rotate around a chemical bond. In the case of proteins, two out of three backbone dihedrals are very flexible and are responsible for the diverse set of local conformations. These dihedrals are the ϕ and ψ coordinates, defined by the sets of beads CNCαC and NCαCN, respectively (see Fig. 1). They describe the angle between two planes (e.g., ϕ is the angle between the planes CNCα and NCαC) and obey the following convention: taking any four beads 1, 2, 3, and 4 and looking along the vector from bead 2 to bead 3, the angle “0” will correspond to the conformation in which beads 1 and 4 point into the same direction (i.e., when they visually overlap). The rotation of plane 1, 2, 3 with respect to plane 2, 3, 4 away from this state defines the angle; the counterclockwise sense counts positive. Because the potential of rotation around the bond between sp3-and sp2-hybridized atoms has a rather low barrier compared to thermal energy at room temperature, we let the beads rotate freely. However, we will later include a contribution to the coordinates ϕ and ψ accounting for an effective nonbonded dipolar interaction (see below).

The third dihedral along the backbone chain, ω, defined by CαCNCα, is located at the peptide bond (see Fig. 1). This bond corresponds to the rotation around two sp2-hybridized atoms, which involves a symmetric potential with two minima, separated by a rather high barrier. The two conformations, cis and trans, have angles of 0° and 180°, respectively. The cis conformation tends to be sterically unfavored for most amino acids, except for proline where there is no specific preference due to its special side chain linkage.

Generally, dihedrals can be written as a Fourier series in the rotation angle. Here we will restrict to a single mode and describe the interaction as

Vdih(φ)=kn[1cos(nφφn,0)], (2)

with coefficient kn and phase φn,0. In our model we represent the peptide bond using only one minimum (n=1) centered around the trans conformation. In this case φ0φ1,0 is the equilibrium orientation of the dihedral and kk1 is the stiffness describing deviations from the equilibrium angle. For a peptide bond located right before a proline residue, we model the isomerization by a dihedral potential with two minima (n=2,kk2), one at the cis conformation and the other one at trans. This allows for a more natural representation of the different conformations proline can take. Depending on the problem one is interested in, the energy barrier can be tuned to either freeze the isomerization or set to a low value to allow efficient sampling. We chose the latter in this work. This choice will of course affect the kinetics of the system.

The central carbon Cα not only links the backbone to the side chain; its sp3 hybridization imposes a tilted orientation of the CαCβ vector compared to the NCαC plane. Its four bonds are located at the vertices of a tetrahedron, linking the backbone atoms N and C, as well as the Cβ side chain and an extra hydrogen (not modeled by us). This has an important consequence, because a carbon atom with four different substituents is chiral and hence optically active. All amino acids except glycine exist as two different stereoisomers. The L form is realized in native amino acids: looking at the central carbon Cα, with the hydrogen atom pointing away, the isomer has L form if the three other substituents C, Cβ, and N are arranged in a counterclockwise fashion (“CORN rule”). This amino acid chirality is a central feature in proteins and their secondary structure, and we account for it by including an “improper dihedral” between the beads NCαCCβ. This keeps a tilt between the backbone plane, NCαC, and the plane intersecting the side chain with two backbone beads, CαCCβ, such that all angles are correct and the CORN rule is satisfied. The interaction has the same form as other dihedrals, given by Eq. 2. The two stereoisomers only differ in the sign of the dihedral equilibrium angle φ0 and can thus both be modeled.

Nonbonded interactions

Probably the biggest challenge in any coarse graining scheme is determining the nonbonded interactions. Unlike bonded interactions, their form is not intrinsically obvious and the system behavior depends very sensitively on them. In the following section every interaction introduced will require at least one free parameter that has to be determined by tuning. The key technical difficulty of this enterprise is that all parameters are typically highly correlated. Optimization is thus an intrinsically multidimensional problem and we therefore intend to limit the number of free parameters as much as possible. While one might envision “hands-off” tuning schemes in which optimization occurs in an automated fashion,26 for the present problem we found this difficult to implement for two reasons: First, parameter variations often have a rather inconspicuous impact on target observables, and the determination of the right gradient in parameter space thus can require very substantial computer time. And second, some optimization aims are hard to quantify in numbers and rather require judgment and choice—e.g., the question how one balances the quality of a local Ramachandran plot against global folding characteristics. While we are aware of several obvious extensions and improvements of our present model that would ultimately benefit from such automated fine-tuning, this is not the point where we wish to start.

Backbone

Steric interactions are closely linked to secondary and tertiary structures for two reasons: first, local interactions along the protein chain will shape the Ramachandran plot; second, contact between distant parts of the amino acid chain will determine protein packing on larger scales. In order to model a local excluded volume, we use a purely repulsive Weeks–Chandler–Andersen (WCA) potential

Vbb(r)={4ϵbb[(σijr)12(σijr)6+14],rrc,0,r>rc,} (3)

where rc=21/6σij and σij is the arithmetic mean between the two bead sizes involved, following the Lorentz–Berthelot mixing rule. Just like the bead sizes, the energy ϵbb is a free parameter, though we use only one parameter for all backbone-backbone and backbone–side chain interactions, since for the WCA potential the energy scale is largely immaterial. Following the practice in atomistic simulations, we do not calculate excluded volume interaction between beads that are less than three bonds apart.

Side chain interactions

Amino acids differ in their water solubility. This can be quantified experimentally by measuring the partitioning of residues between water and a hydrophobic environment (e.g., Ref. 27). The ratio of densities of a residue in the two environments can be translated into a free energy of transfer from one medium to another. Hydrophobicity is one prominent cause for certain amino acids to attract. However, there are other reasons why residues interact (e.g., charges or hydrogen bonds between side chains) and this combination can be probed by statistical analyses of residue-residue contacts in proteins.28, 29, 30, 31, 32 One then arrives at a phenomenological interaction energy between any two residues A and B that depend on the number of close AB contacts that are found in a pool of protein structures. This mean-field approach (it averages over all neighboring contacts) not only contains information on the relative hydrophobicity of amino acids but also partially incorporates effects coming from additional interactions (e.g., salt bridges or side chain hydrogen bonds). In the absence of explicit solvent we represent this phenomenological cohesion by introducing an effective attraction (of standard Lennard-Jones 12-6 type) between Cβ side chain beads, whose strength is mapped to such a statistical analysis of residue-residue contacts. Specifically, we used Miyazawa and Jernigan’s (MJ) statistical analyses28 to extract a relative attraction strength between residues. To translate this into an absolute scale, one additional free parameter ϵhp is needed.

Miyazawa and Jernigan analyzed residue-residue contacts in crystallized proteins. By modeling interactions via square-well potentials, they obtained interaction strengths ϵijMJ for every i-j pair of residues. We reduced the resulting 20×20 interaction matrix further by deconvolving it into 20 interaction parameters ϵi (one for each amino acid), which approximately recreate all interactions as the geometric mean of the two amino acids involved, ϵijMJϵij=ϵiϵj, following the Lorentz–Berthelot mixing rule. Each term is then normalized,

ϵi=ϵiminkϵkmaxkϵkminkϵk, (4)

such that the most hydrophilic residue has a weight of 0 and the most hydrophobic a weight of 1, and the normalized interaction contact is denoted ϵij=ϵiϵj. Finally, we multiply this term by the overall interaction scale ϵhp. One limitation in varying the interaction strength of a Lennard-Jones potential is that a low ϵij will tend to flatten out the repulsive part of the interaction. This will, as a result, fade the excluded volume effect for certain side chain beads, which is likely to exacerbate packing problems in dense regions. To overcome this issue and keep the same excluded volume for all side chain beads, we model the overall interaction by using a Lennard-Jones potential for the attractive part linked to a purely repulsive WCA potential for smaller distances. We join the two potentials at the minimum value of the interaction in such a way that both the potential and its first derivative are continuous. Overall, the interaction will have the following form:

Vhp(r)={4ϵhp[(σCβr)12(σCβr)6]+(ϵhpϵij),rrc,4ϵhpϵij[(σCβr)12(σCβr)6],rcrrhp,cut,0,r>rhp,cut.} (5)

Relative (un-normalized) coefficients ϵi were calculated by minimizing the expression

χ2=1Ni,ji=1Nχij2, (6)

where χij=ϵijMJϵiϵj, N is the number of matrix coefficients (210 independent elements in a 20×20 symmetric matrix), and the sum goes over all such elements. The normalized coefficients ϵi that were obtained by simulated annealing followed by proper scaling [Eq. 4] are reported in Table 2.

Table 2.

Normalized scale of amino acid hydrophobicities ϵi using the Lorentz–Berthelot mixing rule for the cross terms, as well as relative and absolute error, Δϵi and χij, from the diagonal elements of the MJ matrix (see text for definition). Note that the side chain of glycine (marked with an asterisk in the table) is not modeled.

  Lys Glu Asp Asn Ser Arg Gln Pro Thr Gly His Ala Tyr Cys Trp Val Met Ile Phe Leu
K E D N S R Q P T G H A Y C W V M I F L
ϵi(E) 0.00 0.05 0.06 0.10 0.11 0.13 0.13 0.14 0.16 0.17 0.25 0.26 0.49 0.54 0.64 0.65 0.67 0.84 0.97 1.00
Δϵii(E) 4.00 0.50 0.16 0.01 0.05 0.20 0.20 0.10 −0.01 −0.05 −0.11 0.00 0.03 −0.14 0.05 −0.02 0.01 0.02 0.04 0.05
χij(E) −0.48 −0.45 −0.19 −0.02 −0.08 −0.31 −0.31 −0.17 0.02 0.11 0.35 0.01 −0.14 0.76 −0.24 0.12 −0.05 −0.12 −0.32 −0.38

Let us quantify the quality of our deconvolution and the suitability of our amino acid specific hydrophobic strength ϵi. Recall that the correlation coefficient c between two data sets {Xi} and {Yi} is defined as

c=1ni=1n(XiX¯σX)(YiY¯σY), (7)

where n is the number of data points in each set, X¯ and Y¯ are their averages, and σX and σY are their standard deviations, respectively. Our inferred 210 ϵij values and their original ϵijMJ counterparts have a correlation coefficient of 98%, which decreases by only three points when comparing the MJ matrix to the normalized interaction contacts ϵij. Moreover, the 20 individual values ϵi as well as the ϵi have an 87% correlation with the experimental hydrophobic scale measured by Fauchere and Pliska.27 Since the MJ matrix accounts for more than hydrophobicity, this further drop in the correlation coefficient is expected. However, its still relatively large value suggests that the hydrophobic effect is the dominant contribution to the MJ energies. This is the reason why we refer to the interactions 5 summarily as “hydrophobicity.” The fitting procedure gave a χ2 value of 0.064, which translates into an average relative error Δϵ¯=0.25 between coefficients along the diagonal of the MJ matrix, where this deviation is defined by Δϵii=χii/ϵiiMJ. Even though most coefficients did not deviate more than 15% from the MJ matrix, lysine, the most hydrophilic residue, is off by a factor of 4. Various sets of parameters with a comparable χ2 value showed equivalent correlation properties, even though deviations were located on different amino acids. This rules out the hypothesis of a systematic failure of our N2N deconvolution.

It is possible to account for solvent effects in even further detail, for instance, by including the layering of water molecules around the solute into the effective potentials.33 In our attempt to develop a simple force field and only keep a few important aspects of protein interactions and in view of the approximation already made, we decided against such local details.

Hydrogen bonds

Since our model does not contain any electrostatics, it is necessary to model hydrogen bonds implicitly as well. The interaction depends on the relative distance and orientation of an amide and a carbonyl group. A real amide group is composed of a nitrogen with a hydrogen, whereas the carbonyl group has a carbon double-bonded to an oxygen. The hydrogen bond is favored when the N, H, and O atoms are aligned. Several interaction potentials for hydrogen bonding have been proposed in the literature.11, 18, 19, 34, 35, 36 For its simplicity and corresponding CG mapping, we follow Irbäck et al.18 by using a radial 12-10 Lennard-Jones potential combined with an angular term,

Vhb(r,θN,θC)=ϵhb[5(σhbr)126(σhbr)10]{cos2θNcos2θC,|θN|,|θC|<90°,0,otherwise,} (8)

where r is the distance between the two beads N and C, σhb is the equilibrium distance (Table 3), and θN is the angle formed by the atoms HNC and θC corresponds to the angle NCO (Fig. 2). The main motivation for using a power of 10 instead of 6 in the Lennard-Jones potential is a narrower confinement of the hydrogen bond length. Since our model does not represent hydrogens and oxygens, these particle positions were calculated via the local geometry of the backbone. Any NC pair can form a hydrogen bond, except if N belongs to proline, since its side chain connects to the preceding amide on the backbone. The hydrogen bond leads to one more free parameter, its interaction strength ϵhb.

Table 3.

Nonbonded interactions. The length σ represents the diameter of a bead. Most parameters were determined after parameter tuning, except the ones denoted by an asterisk. See Sec. 4.

Backbone excluded volume
σN(Å) σCα(Å) σC(Å) ϵbb(E)
2.9 3.7 3.5 0.02
 
Hydrophobicity
σCβ(Å) ϵhp(E) rhp,cut(Å)
5.0 4.5 10
 
Hydrogen bonding
σhb(Å) ϵhb(E) rhb,cut(Å)
4.11 6 8
Figure 2.

Figure 2

Schematic figure of the hydrogen bond interaction. The light beads (H and O) are not explicitly modeled in the simulation; their positions are inferred from their bonded neighbors.

Electrostatics

There is no explicit treatment of side chain charges in the force field. Specifically, we do not model the interaction between charged residues. However, this piece of information is partially included in the MJ matrix, as the method is based on statistical analysis of residue-residue distances. The electrostatic interaction involved between two charged residues will be implicitly sampled, and its effect reflected in the interaction coefficient. Nevertheless, an explicit treatment of charged residues would allow one to look into properties that depend on the environment’s pH or ionic strength. For a solution that has a high salt concentration (e.g., under physiological conditions), ions are able to screen most of the electrostatics, such that a Debye–Hückel potential would be appropriate to model this interaction. By compensating for the difference in binding energy for all the coefficients involved, one could disentangle charge effects from the MJ matrix. This, however, has not been done in the present model.

Dipole interaction

The interactions described above were sufficient to fold and stabilize α-helices but not β-sheets. Chen et al.22 pointed out that there is an important contribution usually neglected in generic models: carbonyl and amide groups at the peptide bond form dipoles that interact with each other. Mu and Gao34 showed that the nearest-neighbor interaction is enough to favor β over α content. Effectively, all dipoles along a helix are parallel compared to more favorable antiparallel neighboring dipoles on a β-sheet.

From a computational standpoint, a dipole-dipole interaction,

Vdd(pi,pj)=ϵddr3[pipj3(pir^)(pjr^)], (9)

between two dipoles pi and pj at a distance r from each other is inconvenient because it is long ranged. However, nearest-neighbor dipoles are all separated roughly by the same distance, as all amino acids have the same backbone geometry. All dipoles also have the same magnitude, as they are formed from the same atoms. Therefore, the key component of the interaction lies in the relative orientation between dipoles and not on their magnitude or relative distance. Successive dipoles therefore capture the orientation of the local backbone geometry. Two neighboring dipoles will effectively measure the angle difference between the two planes Ci1NiCα,i and Cα,iCiNi+1, where the index keeps track of the amino acid involved (see Fig. 1). As the effect is completely localized and only affects the conformation of the amino acid backbone, we treat this interaction as a bonded one by effectively biasing the dihedral potentials of ϕ and ψ. To do so, we first calculated Eq. 9 for all combinations of dihedral angles with a 5° resolution. The result is plotted on the upper part of Fig. 3. The (sterically forbidden) central part of the plot was removed to emphasize local differences in allowed regions.

Figure 3.

Figure 3

Map of the nearest-neighbor dipole-dipole interaction for all sets of dihedral angles ϕ and ψ (top) and the decoupled Fourier series approximation (bottom). The central part of the upper plot was not reproduced in order to emphasize local difference in other regions of the plot (as can be seen in Fig. 4, this anyway is a sterically hindered region). Sterically favored regions of the plot are circumscribed by a thick line, in addition to labels of α and β regions. The two graphs were shifted and scaled for comparison.

In order to be efficient, the potential should decouple along the two coordinates, i.e., it must be expressible as a sum U(ϕ,ψ)U(ϕ)+U(ψ). We use a single cosine function centered around ϕ,ψ=0 with identical amplitude along both coordinates to approximate the neighboring dipole potential (Fig. 3, bottom). Higher modes in the series have shown to be negligible:

Vdip(ϕ,ψ)=kdip[(1cos ϕ)+(1cos ψ)]. (10)

The value of the optimally tuned free parameter kdip is reported in Table 1. The discrepancy between the plots is due to the enforced decoupling of the two coordinates ϕ and ψ. Even though the final result looks rather inaccurate on the whole domain of the function, it nevertheless recreates the one important effect of the interaction: the β region is more favored than the α region (see labels in Fig. 3). Moreover, the quality of the fit should only be tested along the physically relevant domains of the Ramachandran plot, most notably the α and β regions. In this sense, Eq. 10 makes for a good approximation of the dipole interaction and is enough to recreate the physics that favors β regions.

PARAMETER TUNING

There are various ways CG force fields can be parametrized. For instance: only allowing nonbonded interactions between native contacts (Gō-type models),15 partitioning measurements of amino acids between water and a hydrophobic medium,16 structure-based coarse graining based on all-atom simulations,4 and knowledge-based potentials which intend to optimize parameters by using large pools of existing structures.12

Parameter tuning in top-down CG models aims at reproducing a selected subset of structural or energetic system properties. Since these parameters tend to be correlated, a given set needs to be tested at all scales. In our force field, local conformations are tuned to reproduce probability distribution functions of dihedral angles, which by a slight extension of standard terminology we also called Ramachandran plots (see Sec. 4A). Large-scale (global) properties are targeted by studying folding events of a helical peptide (see Sec. 4B). The final set of parameters was identified as the one we felt most capable in reproducing properties on both levels. The physical conditions (temperature, density, etc.) of the force field will be set by the systems we try to match.

Note that on the global level we tune our parameters using only one protein. Of course, adding more proteins into the “training set” would incorporate more information, presumably leading to a better founded force field. There exist various successful parametrization schemes that rest on large ensembles of data.37, 12, 38 This, however, needs to be balanced against the need to test how reliable a given force field handles proteins that were not part of its training set—a point we deemed more relevant.

Table 4 lists the eight free parameters that need to be determined. Because of time constraints and to obtain some intuition and feeling of each interaction involved, we made a point of having our model tunable by hand, which is why we required the number of free parameters to be as low as possible. As explained above, this is the main reason why we decided against amino acid specific bead sizes. Adding even more free parameters, on top of being time consuming, would make it difficult to obtain a consistent set of parameters that would correctly describe both local and global conformations. Different bead sizes would involve different Ramachandran plots, and all backbone parameters would need to be consistent throughout.

Table 4.

Table of free parameters in this CG model. The main test that was used to determine a given parameter is denoted in the second column.

Free parameters Tuning method
σN,σCα,σC,σCβ,ϵbb Ramachandran plot
ϵhp,ϵhb,kdip Folding characteristics

The free parameters were tuned by trying to constrain parameter space as much as possible, for instance, by eliminating unphysical behavior (e.g., sterically hindered β region in the Ramachandran plot or too much helicity in secondary structures). Combining both local and global tests was enough to settle for a satisfying set of parameters using the constraint that the dipole interaction strength was maximized. Even though this may sound arbitrary when looking for a realistic α/β content ratio, it turned out to be very difficult to use β structures as tests because they are so weakly stabilized. Indeed, we have found that the final set point is still not strong enough to fully stabilize β-sheets during folding events (see below). This shows that maximizing the dipole interaction strength in this model does not lead to oversampling of β content but merely sampling as much extended conformations as possible before the force field cannot stabilize helical structures anymore. Other simple tests can be used to exclude regions of parameter space. For example, a hydrogen bond interaction that is too strong will lead to proteins that fold into one long helix. Too strong hydrophobic interactions will collapse proteins into globules, even native elongated helical structures. Bead size parameters were initially taken from other CG models (e.g., Refs. 9, 18, 19) and tuned as little as possible to recreate enough sampling of α/β content while suppressing sterically hindered regions.

As for any physical system, the representative sampling of its phase space is prerequisite to obtaining accurate thermodynamic information. Different schemes have been developed to characterize and estimate the population of thermodynamic states.39, 40, 41, 42 In the present case, thermodynamic calculations were performed by combining parallel tempering43 with the weighted histogram analysis method (WHAM).44, 45, 46 The main idea is to combine energy histograms from canonical simulations at various temperatures in order to reconstruct the density of states of the system. The information contained in these histograms is used to calculate a consistent set of free energy differences between simulations. Converging these free energies was done by using a recently developed highly efficient algorithm.47 Once the density of states is reconstructed, one can obtain continuous approximations to all thermodynamic observables. By combining WHAM with parallel tempering, we effectively improve sampling by reducing correlations between data points.

Local conformations: Ramachandran plot

The Ramachandran plot48 records the occurrence and frequency of successive (ϕ,ψ) angles in a protein. Since backbone flexibility is almost exclusively due to these two coordinates, the Ramachandran plot is an ideal reporter of local (secondary) structure: α-helices and β-sheets belong to peaks in different regions of the plot. And since proteins are highly constrained systems, low energy points on the Ramachandran plot are rather well localized. Their accurate sampling is therefore prerequisite to the formation and stabilization of reliable structures on larger scales. In the following we will be concerned with the (thermal) distribution of the (ϕ,ψ) angles surrounding some particular amino acid and, in a slight stretch of standard terminology, also refer to this probability density as a Ramachandran plot.

The free parameters that most directly constrain the Ramachandran plot are the different bead sizes (σN,σCα,σC,σCβ) and, to a lesser extent, the excluded volume energy prefactor ϵbb. We disentangled hydrogen bond and hydrophobicity effects from the Ramachandran plot by studying systems made of only three amino acids. From a steric point of view we only distinguish between glycine and nonglycine amino acid, by either not having a side chain bead at all (Gly) or by using a generic bead representing the 19 other amino acids (Ala, for the sake of concreteness). It is then sufficient to study the two Ramachandran plots of Gly-Gly-Gly and Gly-Ala-Gly tripeptides, the smallest systems that contain relevant information on successive dihedral angles ϕ and ψ. The reason why we surround the amino acid of interest with two Gly is to avoid hydrophobic interactions between neighboring side chains. As a result, we solely probe steric effects. The Ramachandran plots derived from the final set of parameters are shown in Fig. 4 as free energy plots obtained from using parallel tempering at temperatures kBT/E{0.5,0.7,1.0,1.3,1.6,1.9,2.2,2.5} and reconstructing the density of states with WHAM. The free energy plot is calculated at our reference temperature kBT/E=1. The shading represents the free energy difference with respect to the lowest conformation, in units of kBT. Notice the inherent asymmetry in the Gly-Ala-Gly system, which reflects the chirality of the α-carbon. Both α-helix (−60°, −60°) and β-sheet (−60°, 130°) regions are well populated, in agreement with Ho et al.49 Proper balance and connectivity between the two regions are crucial for protein folding. This is tuned by the bead sizes and excluded volume energy but also depends on the dipole interaction kdip (see below). The achiral glycine, on the other hand, has no side chain and permits many more conformations. One therefore often finds glycine residues at the ends of helices.

Figure 4.

Figure 4

Free energy plots of tripeptides Gly-Gly-Gly (top) and Gly-Ala-Gly (bottom) as a function of successive dihedrals ϕ and ψ, calculated at our reference temperature T=1E/kB. The coloring represents the free energy difference with the lowest conformation, in units of kBT.

A particular challenge was the fact that we model neither the amide-hydrogen nor the carbonyl-oxygen explicitly, yet their steric effects strongly shape the Ramachandran plot.49 This required subtle adjustments of the bead sizes of the N and C atoms compared to their conventional van der Waals radii.

A poor sampling of local conformations can thwart the formation of realistic secondary structure. Moreover, the relative weight of characteristic regions of the Ramachandran plot determines to a large extent the α/β content. Even though the analysis of abovementioned tripeptides accounts for steric effects and the dipole interaction, it does not consider hydrogen bonds and side chain interactions which are also important to stabilize secondary structure. For this reason it is difficult to ascertain the quality of conformational distributions without studying larger structures.

Folding of a three-helix bundle

In this section we study full size proteins to parametrize large-scale interactions. We used proteins found in the Protein Data Bank that were resolved experimentally in aqueous solvent.

Our choice of reference protein is constrained by the limitations of our model. For instance, salt or disulfide bridges cannot yet be represented and should thus play no role in the reference protein either. Also, it was important to start with a simple structure rather than a globular protein for which packing and cooperativity are more important. Following Irbäck et al.18 and Takada et al.,19 we also tuned our force field on a three-helix bundle. Direct comparisons with their models are difficult, though. First, these authors do not incorporate specificity on every amino acid and only represent a few amino acid types (e.g., hydrophobic, polar, glycine residue). Second, they only compared their simulations to the lowest energy structure found during the simulation rather than experimental data. In contrast, we use the de novo protein 2A3D (73 residues) and systematically compare our results with the real structure resolved experimentally (using NMR).50 The amino acid sequence is given in Table 5. A similar protocol was followed by Favrin et al.20 in order to study a different three-helix bundle (1BDD).

Table 5.

Structure and amino acid sequence of all proteins studied in this paper.

PDB ID Structure Sequence
2A3D Three-helix bundle MGSWAEFKQRLAAIKTRLQALGGSEAELAAFEKEIAAFESELQAYKGKGNPEVEALRKEAAAIRDELQAYRHN
1LQ7 Three-helix bundle GSRVKALEEKVKALEEKVKALGGGGRIEELKKKWEELKKKIEELGGGGEVKKVEEEVKKLEEEIKKL
1P68 Four-helix bundle MYGKLNDLLEDLQEVLKNLHKNWHGGKDNLHDVDNHLQNVIEDIHDFMQGGGSGGKLQEMMKEFQQVLDELNNHLQGGKHTVHHIEQNIKEIFHHLEELVHR
2JUA Four-helix bundle MYGKLNDLLEDLQEVLKHVNQHWQGGQKNMNKVDHHLQNVIEDIHDFMQGGGSGGKLQEMMKEFQQVLDEIKQQLQGGDNSLHNVHENIKEIFHHLEELVHR
1R69 Five short helices SISSRVKSKRIQLGLNQAELAQKVGTTQQSIEQLENGKTKRPRFLPELASALGVSVDWLLNGTSDSNVR
1K8B Two helices anda four stranded β-sheet EILIEGNRTIIRNFRELAKAVNRDEEFFAKYLLKETGSAGNLEGGRLILQRR
1K43 β-hairpin RGKWTYNGITYEGR
  β-hairpin VVVVVDPGVVVVV

A first attempt in tuning parameters consisted of simulating proteins starting from their native structure. Testing for stability is a rapid means to constrain parameter space but not sufficiently so as to actually determine their values. This is consistent with the picture of a deep funnel-like free energy landscape:6 the free energy minimum of a native state is sufficiently deep compared to unfolded states that a folded protein is very stable against force field parameter variations. Further tuning was therefore mainly achieved by studying folding events using a set of trial runs with different parameters. Observation of three-dimensional structures with VMD (Ref. 51) was well suited to characterize simulations. The software was also used to render protein images in this paper.

Folding was studied in the following way: The only input into our simulations was the sequence of amino acids and the temperature. The initial conformation (determined by the collection of dihedral angles ϕ and ψ) was chosen randomly, and the integration started by warming up nonbonded interactions to relax high energy steric clashes. We used parallel tempering for all simulations to avoid kinetic traps. Structural observables were measured at kBT=1E, the temperature at which the force field was tuned. Simulations were set at eight different temperatures: kBT/E{1.0,1.1,1.2,1.3,1.4,1.6,1.9,2.2}. MC swaps between different temperatures were attempted every 10τ; the average acceptance rate was around 10%. We tested convergence to a global minimum by checking that different initial conditions consistently equilibrate to the same structure. A combination of thermodynamic and kinetic studies (see below) will allow us to show two important features. First, the temperature used for parameter tuning, kBT=1E, is below the folding temperature Tf of 2A3D, above which the unfolded conformation becomes the most stable state. Second, kBT=1E is above the glass transition temperature Tg, below which the energy landscape becomes very rugged and creates severe kinetic traps. It was indeed possible to observe folding events in conventional (i.e., not using parallel tempering) simulations within this range of temperature.

Quantitative comparison between the CG and the experimental structures can be made by calculating the root-mean-square deviation (RMSD) between corresponding α-carbons on the two chains (after optimal mutual alignment). Figure 5 reports the RMSD of a protein in the lowest (kBT=1E) replica of a parallel tempering MD run as a function of time, using the RMSD trajectory tool within the VMD package.51 These results were obtained with the parameters reported in Tables 1, 2, 3. The average error between the equilibrated simulation and the NMR structure is around 4 Å after about 100000τ and at kBT/E=1, temperature at which the native conformation represents the free energy minimum. A superposition of the simulated structure with the experimental one is shown in Fig. 6. The STRIDE algorithm52 was used to assign secondary structure. Overall the conformation is very well reproduced considering that we have a resolution of only four beads per amino acid and that no a priori knowledge of secondary/tertiary structure was provided to the force field. Helix regions had formed at the right place, and amino acids were arranged in order to bury hydrophobic beads between the three helices, away from the implicit solvent.

Figure 5.

Figure 5

RMSD of the CG proteins 2A3D (full line) and 1P68 (dashed line) compared with experimentally resolved structures. Both simulations were run at T=1E/kB.

Figure 6.

Figure 6

Equilibrated structures of (a) 2A3D and (b) 1P68 sampled at T=1E/kB. Superposition of simulated structure (opaque) with experimental data (transparent) is displayed. The STRIDE algorithm (Ref. 52) was used for secondary structure assignment (thick ribbons represent α-helices on the figure).

To characterize the stability of this protein, we also performed thermodynamic calculations using WHAM and parallel tempering at the temperatures kBT/E{0.8,0.9,1.0,1.1,1.2,1.3,1.4,1.6,1.9,2.2}. By reconstructing the density of states, we can estimate the folding temperature kBTf1.2E, the point at which the folded and unfolded states are equally populated. This gives a measure of the stability of the system: below Tf the native state is the most likely conformation. In Fig. 7, we plot the free energy below, at, and above the folding temperature as a function of the nativeness order parameter Q as introduced by Takada et al.19 It measures the distance rij between pairs i and j of Cα beads between the NMR data and CG simulations:

Q=exp[19σ2(rijNMRrijCG)2]ij, (11)

where the average goes over all pairs ij. The folded conformation lies in the basin Q0.6 whereas all unfolded conformations (in which not all three helices have properly formed) occur for Q0.5. It should be noted that all three curves in the graph have been calculated by using the same reference point, meaning that the vertical shift between curves accounts for the free energy difference in going from one temperature to another. The folding temperature is close to 1.2E/kB. To make sure the model is also able to sample this important part of phase space in conventional simulations, we provide a stability run at the folding temperature starting from a random conformation. It can be seen that the system repeatedly switches between folded and unfolded states and roughly spends as much time in either one (Fig. 8).

Figure 7.

Figure 7

Free energy profile as a function of a nativeness order parameter Q below (T=1.1E/kB), at, and above (T=1.3E/kB) the folding temperature Tf=1.2E/kB.

Figure 8.

Figure 8

Conventional (i.e., not using parallel tempering) simulation of 2A3D at T=1.2E/kB. The nativity parameter Q is plotted against time. The protein alternates between folded (Q0.6) and unfolded conformations (Q0.5).

In 13 out of 15 independent parallel tempering simulations the protein folded to the native state at a temperature T=1E/kB. However, the folding time varied substantially between different simulations. The kinetics of folding of this protein was studied by running conventional simulations at various temperatures. For each temperature kBT/E{0.7,0.8,0.9,1.0,1.1,1.2} we ran ten simulations and measured the average time it took to fold the protein to its native conformation, if it ever did in the time scale of the simulation (2×106τ). The results are reported in Fig. 9. Temperatures 0.7E/kB and 0.8E/kB did not yield a single folding event, suggesting the onset of glassy behavior.6, 53 The glass transition temperature Tg can be estimated following a simple pragmatic scheme suggested by Socci and Onuchic:53 it is the temperature where the mean folding time is the average of the minimum folding time τmin (lowest point in the graph) and the largest time scale one is willing to invest in the simulation, τmax (highest boundary in the graph): τg=(τmin+τmax)/2. This average time is plotted as a horizontal line in the graph. One can then estimate what temperature this folding time corresponds to (Fig. 9). In our case, we can safely assume that Tg<0.9E/kB, meaning that the protein does not experience glassy behavior when simulating at our reference temperature T=1E/kB. Moreover, combining results from thermodynamic calculations and kinetic studies shows that there is a range of temperatures Tg<T<Tf in which the system is not experiencing glassy behavior but is still “cold” enough such that the native state is the most stable conformation.

Figure 9.

Figure 9

Kinetic studies of the 2A3D three-helix bundle CG protein. The average folding time tf is plotted against temperature. For temperatures ranging from T=0.7E/kB to T=1.2E/kB, about ten simulations were run and we measured the first passage time to the native state. The line represents the average between the minimum folding time and the time scale of the simulation. This can be used to estimate the glass transition temperature (see text).

Irbäck et al.18 as well as Takada et al.19 reported a degeneracy in the CG structures of their helix bundles: there are two ways three helices can pack (see Fig. 10), and their models were not able to discriminate the two different tertiary structures. NMR experiments on 2A3D found a ratio between clockwise and counterclockwise topologies of several percentage, leading to a free energy difference of a few kBT at room temperature.54 From 15 independent simulations we ran, one of them did not fold within 300000τ, and 13 converged to the NMR structure—a counterclockwise topology [Fig. 10a]; only one had the other topology [(illustrated in Fig. 10b]. While it is encouraging to see that our model is able to distinguish these topologies, it is not guaranteed that this will work equally well for other proteins.

Figure 10.

Figure 10

Schematic figure of the two possible topologies in forming a three-helix bundle. (a) The native fold of protein 2A3D corresponds to a counterclockwise topology and (b) that of 1LQ7 is clockwise.

SIMULATION DETAILS

MD simulations were performed with the ESPRESSO package.55 Simulations in the canonical ensemble (NVT) were achieved by using a Langevin thermostat with friction constant Γ=τ1. The temperature was expressed in terms of the intrinsic unit of energy, E. The force field is parametrized in order to reproduce a temperature of T=300 K. The integration time step used for all simulations is δt=0.01τ.

APPLICATIONS AND TESTS

Folding

All simulations mentioned from this point onward have not been part of the parameter tuning training set. They come out as independent checks and features of the force field. Thermodynamic and kinetic studies were not performed for the different proteins of this section. Here, we study the equilibrium conformations of various sequences at a temperature of kBT/E=1, which lies between Tg and Tf for our reference protein, 2A3D. In this respect, we expect to avoid glassy behavior for similarly complex proteins whose native state is folded.

In order to test the folding features of the model, we first studied another de novo three-helix bundle, 1LQ7. Even though the fold is very similar to 2A3D, it has 67 amino acids and a completely different primary sequence. Also, the native structure, obtained from NMR,56 has the opposite topology (clockwise) compared to 2A3D. From ten independent parallel tempering runs, 300000τ long each, one of them did not fold within this amount of time (helices formed but did not arrange properly). Out of the nine remaining structures, five folded consistently to the native clockwise topology [Fig. 10b] and four to the other one [Fig. 10a]. It should be noted that this sequence had been designed such that its native structure leads to favorable salt-bridge interactions.56 As we do not incorporate electrostatics (and thus salt bridges) explicitly, we expect the CG model to have difficulties in discriminating between the two tertiary structures.

In order to further probe the folding features of different α-helical rich folds, we studied a four-helix bundle, 1P68, consisting of 102 amino acids.57 Even though the secondary structure is overall rather similar to the abovementioned three-helix bundle, the tertiary structure and amino acid sequence are completely different. Again, the reference structure is taken from experimental data.57 From six independent parallel tempering runs, each 600000τ long, our force field successfully folded the protein into a four-helix bundle for every simulation except one, which did not have time to properly align its fourth helix. The RMSD is shown in Fig. 5 for a simulation which converged to the right topology. As can be seen, the RMSD went below 4 Å, which, again, is very satisfactory considering the level of resolution and the complete absence of structure bias in the force field. It should be noted that what appears as large fluctuations on the graph are actually frequent MC swaps between replicas of the parallel tempering ladder. Even though their potential energies are comparable (which is the reason why they swap temperatures), their structures are fairly different, as can be seen on the RMSD plot. Just as in the three-helix bundle case, this protein can fold into several different topologies. Out of the five simulations which converged to a four-helix bundle tertiary structure, two of them represented the NMR topology. RMSD values for other topologies ranged between 5 and 8 Å. A snapshot of the equilibrated structure is shown in Fig. 6b.

Also a second de novo four-helix bundle was used to test the force field. Even though the tertiary structure resembles the abovementioned 1P68, the amino acid sequence of 2JUA is completely independent (though it also has 102 amino acids) and the topology is different. Out of three independent runs, all of them successfully folded in a four-helix bundle structure within 600000τ by comparing qualitatively the CG protein with the NMR structure.58 However, none of them converged to the right topology.

Our model has proven very efficient in finding the equilibrium conformation of various helical structures, up to small deviations, and independent of their tertiary structure (i.e., number of helices) or sequence of amino acids. The fact that none of these proteins was part of the parameter tuning strongly indicates that our CG model captures important aspects of protein physics.

The limits of the model were reached when simulating globular proteins, such as 1R69 (Ref. 59) and 1K8B.60 The chain collapsed into a molten globule, but the arrangement of secondary structures (collections of α-helices and β-sheets) was not accurately reproduced, leading to an incorrect tertiary structure. This suggests a missing sufficiently deep free energy minimum, most likely due to the limitations of the CG model in terms of cooperativity and realistic packing (recall that all side chains have the same bead size). The RMSD values did not drop below 10 Å.

Stabilizing a single β-hairpin in small proteins is difficult because this relies on very weak interactions. We simulated the de novo 1K43 peptide for 300000τ. It consists of 14 residues and forms a β-hairpin in water.61 Our model is not able to stabilize it. The simulation shows a high tendency to form an α-helix, where 40% of all conformations are helical, whereas only 2% are extended (β-sheet-like). However, the CG model can successfully fold a designed β-hairpin, sequence V5DPGV5, which contains a D-proline in order to sterically favor hairpin formation.62 This peptide has been recently characterized using atomistic63 and structure-based CG simulations.64

Aggregation

Gsponer et al.65 recently reported atomistic simulations of small aggregation events in water. Heptapeptides GNNQQNY from the yeast prion protein Sup35 were shown to form β-sheet aggregates. These authors did a quantitative analysis of the number of 2 and 3 aggregates in the system at room temperature.

We studied the abovementioned scenario by simulating 15 identical peptides in a box of size of 40 Å, without matching density with the atomistic run. Indeed, while Gsponer et al. simulated their system in a restricted sphere of 150 Å diameter and applying forces to constrain the system in the center, we set periodic boundary conditions in a cubic box. Even though this represents a rather dense system in order to drive aggregation, we checked that similar structures were sampled when simulating more dilute systems. Initial configurations were chosen randomly, and we ran parallel tempering simulations at temperatures kBT/E{0.7,0.8,0.85,0.9,0.95,1.0,1.1,1.2} for 500000τ each. We used WHAM to calculate the specific heat of the system (Fig. 11). A clear peak occurs between lower temperatures, with formation of long-range fibrillar structures (Fig. 12), and higher temperatures where the system mostly samples random coil monomers.

Figure 11.

Figure 11

Specific heat of 15 GNNQQNY peptides in a cubic box of size of 41 Å. The peak around T=0.95E/kB separates a low temperature phase, rich in high-β content aggregates, from a high temperature phase where no aggregates form.

Figure 12.

Figure 12

Snapshot of a typical cluster that forms by peptide aggregation in the low temperature phase (T=0.8E/kB) of Fig. 11.

At lower temperatures, where aggregation occurs, we mostly observe parallel sheets over antiparallel. Interestingly, this is in agreement with the study of Gsponer et al. and could be due to the hydrophobic interactions of the C-terminal tyrosine. To test this, we performed single point mutations in order to create a symmetric sequence. In this case parallel β-sheets also turned out to be more stable than antiparallel ones, which is unexpected since antiparallel β-sheets are generally believed to be lower in free energy.25 One possible explanation is that the model is lacking electrostatic interactions at the N and C termini of the chains, which will favor antiparallel sheets, as the two ends have opposite charges.

These parallel GNNQQNY β-sheets also have the tendency to align within a plane, with the C termini facing each other. This evidently results from the attraction between the C-terminal tyrosines, the most hydrophobic amino acid in this peptide.

To show that their force field was not biased toward aggregation, the authors also simulated a water-soluble control peptide SQNGNQQRG and found a difference in the amount of β-sheets formed. We compared the phase behavior of GNNQQNY and SQNGNQQRG by using WHAM on both sequences but did not find statistically significant differences over the studied temperature range. This suggests that some of the details necessary to distinguish the thermodynamics of these two peptides are too subtle for our force field to represent. Since the simulation temperature of Gsponer et al. in our case maps to T=1E/kB, which is where we essentially find the phase transition (Fig. 11), effects only captured by the atomistic force field can indeed be expected to lead to substantial differences. Previous studies have shown how differences in CG force field parameters affect structure, β-sheet propensity, and aggregation behavior of different sequences.66

All of these aggregation results were obtained using the same force field with no additional parameter adjustment. Other CG models have previously demonstrated aggregation events on a larger scale.67, 68 Here our goal was to show that we can study aggregation events using a force field that is tuned to reproduce simple folding features without biasing secondary or tertiary structure. This is important when looking at spontaneous aggregation or misfolding pathways, where one aims to reproduce general behavior without constraining the protein’s structure toward a certain state that might not even be known or well defined.

CONCLUSION

We have presented a new CG implicit solvent peptide model. Its intermediate resolution of four beads per amino acid permits accurate sampling of local conformations and thus secondary structure. Following cautious parameter tuning, the CG model is able to fold simple proteins such as helix bundles. Folding of a three-helix bundle was used to incorporate large-scale aspects of the force field, whereas the successful folding event of other helical bundles provided independent checks of reliability. Thermodynamic and kinetic studies of the three-helix bundle were carried out to make sure the folding temperature Tf was above the glass transition temperature Tg for this protein. The model was systematically compared to NMR data in order to optimize parameter tuning and precisely determine how much fine-scale information this CG model still contains. Of course, our model is not intended to compete with atomistic simulations, which is not the point of CG models; yet, carefully balancing several key contributions to the force field is a prerequisite to perform meaningful studies involving secondary and tertiary structure formation. Globular shaped proteins have proven more difficult to stabilize, presumably because accurate packing and strong cooperativity are not well enough captured. We also observe aggregation events of small β-sheets without retuning the force field. A realistic α/β balance, coupled with basic folding features, make the CG model very suitable for the large-scale and long-term regime that many biological processes require. Indeed, a force field that is not biased toward the protein’s native conformation will likely give rise to insightful thermodynamic and kinetic studies when the structure is not known, not well defined, strongly perturbed from the native state, or adjusts during aggregation events.

ACKNOWLEDGMENTS

We thank Zunjing Wang, Christine Peter, Cem Yolcu, and Bill DeGrado for valuable discussions and useful comments.

References

  1. Pace C. N., Trends Biochem. Sci. 15, 14 (1990). 10.1016/0968-0004(90)90124-T [DOI] [PubMed] [Google Scholar]
  2. Pace C. N., Shirley B. A., McNutt M., and Gajiwala K., FASEB J. 10, 75 (1996). [DOI] [PubMed] [Google Scholar]
  3. Coarse-Graining of Condensed Phase and Biomolecular Systems, edited by Voth G. A. (Taylor & Francis, New York, 2008). [Google Scholar]
  4. Ayton G. S., Noid W. G., and Voth G. A., Curr. Opin. Struct. Biol. 17, 192 (2007). 10.1016/j.sbi.2007.03.004 [DOI] [PubMed] [Google Scholar]
  5. Tozzini V., Curr. Opin. Struct. Biol. 15, 144 (2005). 10.1016/j.sbi.2005.02.005 [DOI] [PubMed] [Google Scholar]
  6. Bryngelson J. D., Onuchic J. N., Socci N. D., and Wolynes P. G., Proteins: Struct., Funct., Genet. 21, 167 (1995). 10.1002/prot.340210302 [DOI] [PubMed] [Google Scholar]
  7. Arkhipov A., Freddolino P. L., and Schulten K., Structure (London) 14, 1767 (2006). 10.1016/j.str.2006.10.003 [DOI] [PubMed] [Google Scholar]
  8. Arkhipov A., Yin Y., and Schulten K., Biophys. J. 95, 2806 (2008). 10.1529/biophysj.108.132563 [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Ding F., Borreguero J. M., Buldyrey S. V., Stanley H. E., and Dokholyan N. V., Proteins: Struct., Funct., Genet. 53, 220 (2003). 10.1002/prot.10468 [DOI] [PubMed] [Google Scholar]
  10. Sorenson J. M. and Head-Gordon T., J. Comput. Biol. 7, 469 (2000). 10.1089/106652700750050899 [DOI] [PubMed] [Google Scholar]
  11. Smith A. Voegler and Hall C. K., Proteins: Struct., Funct., Genet. 44, 344 (2001). 10.1002/prot.1100 [DOI] [PubMed] [Google Scholar]
  12. Fujitsuka Y., Takada S., Luthey-Schulten Z. A., and Wolynes P. G., Proteins: Struct., Funct., Genet. 54, 88 (2004). 10.1002/prot.10429 [DOI] [PubMed] [Google Scholar]
  13. Head-Gordon T. and Brown S., Curr. Opin. Struct. Biol. 13, 160 (2003). 10.1016/S0959-440X(03)00030-7 [DOI] [PubMed] [Google Scholar]
  14. Lau K. F. and Dill K. A., Macromolecules 22, 3986 (1989). 10.1021/ma00200a030 [DOI] [Google Scholar]
  15. Go N., Annu. Rev. Biophys. Bioeng. 12, 183 (1983). 10.1146/annurev.bb.12.060183.001151 [DOI] [PubMed] [Google Scholar]
  16. Monticelli L., Kandasamy S. K., Periole X., Larson R. G., Tieleman D. P., and Marrink S. -J., J. Chem. Theory Comput. 4, 819 (2008). 10.1021/ct700324x [DOI] [PubMed] [Google Scholar]
  17. Thøgersen L., Schiøtt B., Vosegaard T., Nielsen N. C., and Tajkhorshid E., Biophys. J. 95, 4337 (2008). 10.1529/biophysj.108.133330 [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Irbäck A., Sjunnesson F., and Wallin S., Proc. Natl. Acad. Sci. U.S.A. 97, 13614 (2000). 10.1073/pnas.240245297 [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Takada S., Luthey-Schulten Z., and Wolynes P. G., J. Chem. Phys. 110, 11616 (1999). 10.1063/1.479101 [DOI] [Google Scholar]
  20. Favrin G., Irback A., and Wallin S., Proteins: Struct., Funct., Genet. 47, 99 (2002). 10.1002/prot.10072 [DOI] [PubMed] [Google Scholar]
  21. Yang A. S. and Honig B., J. Mol. Biol. 252, 366 (1995). 10.1006/jmbi.1995.0503 [DOI] [PubMed] [Google Scholar]
  22. Chen N. Y., Su Z. Y., and Mou C. Y., Phys Rev. Lett. 96, 078103 (2006). 10.1103/PhysRevLett.96.078103 [DOI] [PubMed] [Google Scholar]
  23. Lodish H., Berk A., Zipursky S. L., Matsudaira P., Baltimore D., and Darnell J., Molecular Cell Biology (Freeman, New York, 2000). [Google Scholar]
  24. Lansbury P. T. and Lashuel H. A., Nature (London) 443, 774 (2006). 10.1038/nature05290 [DOI] [PubMed] [Google Scholar]
  25. Finkelstein A. V. and Ptitsyn O. B., Protein Physics (Academic, New York, 2002). [Google Scholar]
  26. Meyer H., Biermann O., Faller R., Reith D., and Müller-Plathe F., J. Chem. Phys. 113, 6264 (2000). 10.1063/1.1308542 [DOI] [Google Scholar]
  27. Fauchere J. L. and Pliska V., Eur. J. Med. Chem. 18, 369 (1983). [Google Scholar]
  28. Miyazawa S. and Jernigan R. L., J. Mol. Biol. 256, 623 (1996). 10.1006/jmbi.1996.0114 [DOI] [PubMed] [Google Scholar]
  29. Skolnick J., Jaroszewski L., Kolinski A., and Godzik A., Protein Sci. 6, 676 (1997). [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Miyazawa S. and Jernigan R. L., Proteins: Struct., Funct., Genet. 36, 357 (1999). [DOI] [PubMed] [Google Scholar]
  31. Betancourt M. R. and Thirumalai D., Protein Sci. 8, 361 (1999). [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Wang Z. H. and Lee H. C., Phys. Rev. Lett. 84, 574 (2000). 10.1103/PhysRevLett.84.574 [DOI] [PubMed] [Google Scholar]
  33. Cheung M. S., Garcia A. E., and Onuchic J. N., Proc. Natl. Acad. Sci. U.S.A. 99, 685 (2002). 10.1073/pnas.022387699 [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Mu Y. and Gao Y. Q., J. Chem. Phys. 127, 105102 (2007). 10.1063/1.2768062 [DOI] [PubMed] [Google Scholar]
  35. Yap E. H., Fawzi N. L., and Head-Gordon T., Proteins: Struct., Funct., Bioinf. 70, 626 (2008). 10.1002/prot.21515 [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Guo C. L., Cheung M. S., Levine H., and Kessler D. A., J. Chem. Phys. 116, 4353 (2002). 10.1063/1.1448493 [DOI] [Google Scholar]
  37. Mirny L. A. and Shakhnovich E. I., J. Mol. Biol. 264, 1164 (1996). 10.1006/jmbi.1996.0704 [DOI] [PubMed] [Google Scholar]
  38. Matysiak S. and Clementi C., J. Mol. Biol. 363, 297 (2006). 10.1016/j.jmb.2006.07.088 [DOI] [PubMed] [Google Scholar]
  39. Das P., Moll M., Stamati H., Kavraki L. E., and Clementi C., Proc. Natl. Acad. Sci. U.S.A. 103, 9885 (2006). 10.1073/pnas.0603553103 [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Micheletti C., Laio A., and Parrinello M., Phys. Rev. Lett. 92, 170601 (2004). 10.1103/PhysRevLett.92.170601 [DOI] [PubMed] [Google Scholar]
  41. Torrie G. M. and Valleau J. P., J. Comput. Phys. 23, 187 (1977). 10.1016/0021-9991(77)90121-8 [DOI] [Google Scholar]
  42. Park S. and Schulten K., J. Chem. Phys. 120, 5946 (2004). 10.1063/1.1651473 [DOI] [PubMed] [Google Scholar]
  43. Swendsen R. H. and Wang J. S., Phys. Rev. Lett. 57, 2607 (1986). 10.1103/PhysRevLett.57.2607 [DOI] [PubMed] [Google Scholar]
  44. Ferrenberg A. M. and Swendsen R. H., Phys. Rev. Lett. 61, 2635 (1988). 10.1103/PhysRevLett.61.2635 [DOI] [PubMed] [Google Scholar]
  45. Kumar S., Bouzida D., Swendsen R. H., Kollman P. A., and Rosenberg J. M., J. Comput. Chem. 13, 1011 (1992). 10.1002/jcc.540130812 [DOI] [Google Scholar]
  46. Kumar S., Rosenberg J. M., Bouzida D., Swendsen R. H., and Kollman P. A., J. Comput. Chem. 16, 1339 (1995). 10.1002/jcc.540161104 [DOI] [Google Scholar]
  47. Bereau T. and Swendsen R. H., “Optimized convergence for multiple histogram analysis,” J. Comput. Phys. DOI:10.106/j.jcp.2009.05.011 (to be published). [Google Scholar]
  48. Ramachandran G. N., Ramakrishnan C., and Sasisekharan V., J. Mol. Biol. 7, 95 (1963). 10.1016/S0022-2836(63)80023-6 [DOI] [PubMed] [Google Scholar]
  49. Ho B. K., Thomas A., and Brasseur R., Protein Sci. 12, 2508 (2003). 10.1110/ps.03235203 [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Walsh S. T. R., Cheng H., Bryson J. W., Roder H., and DeGrado W. F., Proc. Natl. Acad. Sci. U.S.A. 96, 5486 (1999). 10.1073/pnas.96.10.5486 [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Humphrey W., Dalke A., and Schulten K., J. Mol. Graphics 14, 33 (1996). 10.1016/0263-7855(96)00018-5 [DOI] [PubMed] [Google Scholar]
  52. Frishman D. and Argos P., Proteins: Struct., Funct., Genet. 23, 566 (1995). 10.1002/prot.340230412 [DOI] [PubMed] [Google Scholar]
  53. Socci N. D. and Onuchic J. N., J. Chem. Phys. 101, 1519 (1994). 10.1063/1.467775 [DOI] [Google Scholar]
  54. DeGrado F. W. (personal communication).
  55. Limbach H. J., Arnold A., Mann B. A., and Holm C., Comput. Phys. Commun. 174, 704 (2006). 10.1016/j.cpc.2005.10.005 [DOI] [Google Scholar]
  56. Dai Q. H., Tommos C., Fuentes E. J., Blomberg M. R. A., Dutton P. L., and Wand A. J., J. Am. Chem. Soc. 124, 10952 (2002). 10.1021/ja0264201 [DOI] [PubMed] [Google Scholar]
  57. Wei Y. N., Kim S., Fela D., Baum J., and Hecht M. H., Proc. Natl. Acad. Sci. U.S.A. 100, 13270 (2003). 10.1073/pnas.1835644100 [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Go A., Kim S., Baum J., and Hecht M. H., Protein Sci. 17, 821 (2008). 10.1110/ps.073377908 [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Mondragon A., Subbiah S., Almo S. C., Drottar M., and Harrison S. C., J. Mol. Biol. 205, 189 (1989). 10.1016/0022-2836(89)90375-6 [DOI] [PubMed] [Google Scholar]
  60. Cho S. and Hoffman D. W., Biochemistry 41, 5730 (2002). 10.1021/bi011984n [DOI] [PubMed] [Google Scholar]
  61. Pastor M. T., de la Paz M. L., Lacroix E., Serrano L., and Perez-Paya E., Proc. Natl. Acad. Sci. U.S.A. 99, 614 (2002). 10.1073/pnas.012583999 [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Gellman S. H., Curr. Opin. Chem. Biol. 2, 717 (1998). 10.1016/S1367-5931(98)80109-9 [DOI] [PubMed] [Google Scholar]
  63. Ferrara P., Apostolakis J., and Caflisch A., J. Phys. Chem. B 104, 5000 (2000). 10.1021/jp994157t [DOI] [Google Scholar]
  64. Thorpe I. F., Zhou J., and Voth G. A., J. Phys. Chem. B 112, 13079 (2008). 10.1021/jp8015968 [DOI] [PubMed] [Google Scholar]
  65. Gsponer J., Haberthur U., and Caflisch A., Proc. Natl. Acad. Sci. U.S.A. 100, 5154 (2003). 10.1073/pnas.0835307100 [DOI] [PMC free article] [PubMed] [Google Scholar]
  66. Bellesia G. and Shea J. -E., J. Chem. Phys. 126, 245104 (2007). 10.1063/1.2739547 [DOI] [PubMed] [Google Scholar]
  67. Peng S., Ding F., Urbanc B., Buldyrev S. V., Cruz L., Stanley H. E., and Dokholyan N. V., Phys. Rev. E 69, 041908 (2004). 10.1103/PhysRevE.69.041908 [DOI] [PubMed] [Google Scholar]
  68. Fawzi N. L., Okabe Y., Yap E. H., and Head-Gordon T., J. Mol. Biol. 365, 535 (2007). 10.1016/j.jmb.2006.10.011 [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from The Journal of Chemical Physics are provided here courtesy of American Institute of Physics

RESOURCES