Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2011 Jul 1;6(7):e20853. doi: 10.1371/journal.pone.0020853

A Coarse-Grained Approach to Protein Design: Learning from Design to Understand Folding

Ivan Coluzza 1,*
Editor: Annalisa Pastore2
PMCID: PMC3128589  PMID: 21747930

Abstract

Computational studies have given a great contribution in building our current understanding of the complex behavior of protein molecules; nevertheless, a complete characterization of their free energy landscape still represents a major challenge. Here, we introduce a new coarse-grained approach that allows for an extensive sampling of the conformational space of a large number of sequences. We explicitly discuss its application in protein design, and by studying four representative proteins, we show that the method generates sequences with a relatively smooth free energy surface directed towards the target structures.

Introduction

Protein molecules play a central role in the large majority of biochemical reactions in living organisms [1]. Performance of these functions generally requires folding of the proteins into a specific three-dimensional structure, the so-called native state [2], [3], (a number of exceptions involving the so-called “disordered” proteins has also been discovered [4]). Computer simulations combined with experiments have given a great contribution to our current understanding of the complex behavior of protein molecules and of the mechanism by which folding takes place [3], [5], [6]. Advances have been made through the use of atomistic models, which are capable of providing detailed descriptions of protein dynamics [7], [5], [6], and through the development of coarse-grained representations, which enable more comprehensive sampling of the conformational space [3], [8], [9], [10].

A common approach to protein folding involves the use of Go-models [11]. The Go-models are non-transferable potentials tailored to the native structure such that each amino acid interacts selectively with a subset of residues and only when in the native configuration. Hence, Go-proteins are hypothetical proteins with a arbitrary variety of pair interactions among the residues (alphabet), but are able to successfully fold, and have a smooth free-energy landscape with a single global minimum in the native structure. However, if the size of the alphabet is reduced, for instance, to the Inline graphic letter alphabet of real proteins, it becomes more and more difficult to observe folding for a random sequence, as the landscape most of the time changes from the smoothnes of Go-models, to rugged with many local minima. Hence, folding becomes more complex and requires an extensive search in the space of possible sequences to obtain a folding chain. For this reason these methods are often referred as “protein design”. Protein design was originally developed for lattice heteropolymers by Wolynes [12], and recently has been extended by Coluzza et al. [13]. By using lattice models it was possible not only to design heteropolymers with a large variety of target configurations, but also to generate lattice proteins with more complex self-assembly properties [14], [15]. The solution of the design problem is of considerable interest in biotechnology as it holds promises for the engineering of proteins with new functional properties. Some successful designs of novel artificial enzymes have been obtained by introducing residues expected to play a catalytic role in a specific reaction [16] in sequences with known folds.

In this work we will go beyond lattice models by introducing a novel design procedure that can produce realistic amino acid sequences able to fold into protein structures taken directly from experimental data. In what follows we will demonstrate for the first time that accurate representation of the protein backbone is a necessary condition for successful protein design, as such constraints confine the possible configurations of proteins to the structural space of real proteins. Our hypothesis is based on the observation that the design procedure developed for lattice proteins was unable to produce folding sequences when applied to simple off-lattice representations (e.g. a flexible chains of particles) (as indicated by our earlier simualtions). In order to understand the importance of constraints, let us ignore for a moment long range correlations in the system (i. e., we make a mean field approximation). Hence, energy minimization can be viewed as a local optimization of the residue-residue pair-interactions. In such condition, sequence mutation (design) is guaranteed to find the same minimum as configuration changes (folding) [12], provided that the number of possible sequences is larger or equal to the number of all possible configurations. Hence, for real proteins with an alphabet limited to Inline graphic letters, it becomes clear that one needs to introduce constraints that limit the size of the configurational space (e.g. cubic lattice). Of course, in order to reproduce the space of real proteins, a specific set of constraints is needed that, contrary to Go potentials [11], does not vary from protein to protein.

Recently, Maritan and co-workers [17], [18], [19], have introduced a novel protein coarse-graining procedure by representing a typical protein as a flexible self-avoiding tube (from here the name “Tube” model) with a radius of Inline graphic and effective hydrogen bonds interactions along the tube. The configurations of the tube model are controlled by just two parameters, the total hydrophobicity and the bending rigidity, that drive the tube into all secondary and many known protein’s tertiary structures. Hence, the results obtained with the tube model, strongly suggest that the typical protein structures are inherent in the geometrical constraints of the backbone, as the latter are the main features of the the tube model. To put in the words of the authors the tube “pre-sculpts” the free energy landscape. So far, a design method for the tube model has not been introduced and when hydrophilic/hydrophobic patterns of typical proteins were tried, the tube model could not systematically fold to the native structures [20]. However, we believe that the tube model highlighted the important type of constraints necessary to design sequences for real protein structure, namely the self-avoidance of the backbone and the hydrogen bonds.

In order to support our hypothesis, we have developed a new model taking inspiration from the work of Maritan and co-workers, but unlike the tube model, the physico-chemical properties of individual amino acids are represented by an effective spherical potential centered on the Inline graphic atoms, and a more realistic potential to represents the hydrogen bonding interactions. We refer to this model as the caterpillar model because of the image created by the spheres that follow the backbone (Fig. 1). The behavior of the caterpillar model depends on the balance between the spherical and hydrogen bond potentials. The main differences between the caterpillar and the tube model is that our model considers an arbitrary alphabet of amino acids and has a more detailed structure of the backbone that represents more faithfully the hydrogen bonding interactions. However, we retain the tube nature of the protein, via the self-avoiding core of the spheres centred on the Inline graphic atoms [21]. We expect then the constraints resulting from the spherical and the hydrogen bonding potentials to confine the polypeptide chains, to regions of the conformational space with realistic protein-like structure elements. It is important to notice that the higher level of description of the caterpillar model allows not only for a higher precision in the representation of structures, but also to directly transfer the results obtained with the caterpillar model to the further refinement of full atomistic simulations. In fact, in order to further study the results of the caterpillar model with full atomistic simulations, we only need to add the atoms of the side chains of each amino acid directly on the backbone configurations of the coarse-grained simulations. Moreover, the use of spheres to account for self-avoidance is computationally more efficient [21] than the three-body interaction rules used in the tube model [18].

Figure 1. Illustration of the caterpillar model.

Figure 1

The large transparent spheres represent the self-avoidance volume, which has a radius of Inline graphic, associated to an amino acid and centered on the position of the Inline graphic atoms. The backbone degrees of freedom are the torsional angles Inline graphic and Inline graphic. In order to describe hydrogen bonds also the backbone amide (NH) and the carboxyl (CO) groups are explicitly represented.

In this paper, we will show that the caterpillar model satisfies the two conditions mentioned above for foldabilty and designability, as it retains the elements of the polypeptide chain essential for the folding of designed sequences, and at the same time is simple enough to allow for an extensive exploration of the configurational space. Below we describe the novel design procedure based on the caterpillar model and we discuss the design of four representative protein structures taken directly from the Protein Data Bank (PDB) [22] [38]. We show that with our model we are able to design all test structures, and generate a large number of sequences with the target configurations stting at the bottom of a global free energy minimum. Finally, to further support the tangible link with real proteins we show that the hydrophobic/philic profile of designed sequences agrees with that typical of real sequences, and more importantly we demonstrate that the caterpillar model can refold the sequence of one of the four test proteins to its corresponding native structure.

Methods

Model

As outlined above, the caterpillar model is a 5-bead model with the Inline graphic augmented by the full main atomic positions to introduce directional hydrogen bonds. The degrees of freedom of the model are the torsional angles Inline graphic and Inline graphic; all other structural parameters are kept fixed at values from the literature [23]. The C, O, N, H positions were determined from the Inline graphic atoms as shown in Fig. 1.

The side chain interactions are represented by and effective Inline graphic-Inline graphic sphere-sphere interaction energy given by

graphic file with name pone.0020853.e016.jpg (1)

where Inline graphic is the distance between the Inline graphic atoms at the centers of spheres Inline graphic and Inline graphic and Inline graphic (Inline graphic) is the distance at which Inline graphic; Inline graphic is a scale factor; see below. This expression provides a continuous square well form for the sphere-sphere interaction energy [39]. To determine the parameter Inline graphic we made use of the model of Betancourt and Thirumalai (BT) [24], in which the interaction energies were derived from a calculation of the contact frequency in the PDB. This potential had been used primarily for lattice proteins, but it is also appropriate for the caterpillar model, which employs a square-well-like potential. Backbone hydrogen bonds were modeled with a 10–12 Lennard-Jones type potential using the expression [25]

graphic file with name pone.0020853.e026.jpg (2)

where Inline graphic is the distance between the hydrogen atom of the amide group (NH) and the oxygen atom of the carboxyl group (CO) of the main chain. We set Inline graphic, Inline graphic, and Inline graphic; the values are given in [25].

To complete the parametrization, we need to determine Inline graphic and Inline graphic. Since BT is a contact potential, there is no cutoff value. Here, we considered the Inline graphic-Inline graphic pair-correlation function g(Inline graphic) of several proteins and found that it begins to decay at approximately Inline graphic (see Figure S1). This behavior can be interpreted as the range of the effective interactions among amino acids. For larger values of Inline graphic, the system tends to acquire a mean field behavior, where every particle interacts with all the others, regardless of the geometry. By contrast, for smaller value of Inline graphic, correlations that are crucial for the stability of the target structure can be missed. The parameter Inline graphic was chosen to balance the contributions of Inline graphic (Eq. (1)) and Inline graphic (Eq. (2)). With Inline graphic, Inline graphic and Inline graphic provide approximately the same contributions to the energy per particle. If Inline graphic is too small, all sequences form Inline graphic-helices, while if it is too large all sequences fail to self-assemble and collapse in random glassy structures. In Eq. (2), the directionality of the hydrogen bonds is accounted for by multiplying the Lennard-Jones term by a factor containing the Inline graphic and Inline graphic angles between the atoms COH and OHN, respectively. (Figures S2, S3 shows the distance dependence and angular dependence of Inline graphic 0). The directionality of the hydrogen bonds is essential to make more probable regions of conformational space characterized by the secondary structure elements typical of proteins. The spheres centered on the position of the Inline graphic atoms ensure that only the maximum of the term in Eq. (2) for angles close to Inline graphic is accessible; that at Inline graphic corresponds to configurations that are not allowed by the self-avoiding volumes of the spheres.

The energy function, Eq. (1), does not take the effects of the solvent into account explicitly. Although the designed sequences are able to fold to their respective target structures, their surface exposure profiles do not necessarily reproduce those of actual proteins. To improve this aspect of the design, we added an energy term Inline graphic that penalizes the surfaces exposure of hydrophobic amino acids; the expression has the form

graphic file with name pone.0020853.e054.jpg (3)

where Inline graphic is a threshold for the number of contacts in the native structure above which the amino acid is considered to be fully buried and Inline graphic is the Dolittle hydrophobicity index [26], rescaled by Inline graphic to make this term match the contributions from the other energy terms. The number of contacts for the amino acids in the native state varies between Inline graphic and Inline graphic; the value 24 was chosen for Inline graphic (see Fig. 5).

Figure 5. Hydrophobic/philic profile of the protein L7/L12 (PDB ID 1CTF) designed with and without the solvation term.

Figure 5

In the top frame we plot the number of contacts that each amino acids along the chain has with the all the other non consecutive amino acids in the range of Inline graphic defined by our potential in Eq. (2). Large numbers indicate amino acids that are buried in the core of the protein while low number correspond to residues that are highly solvated. The dashed horizontal line refers to the value Inline graphic in Eq. (2). In the bottom frame we compare the hydrophobic/philic profiles averaged over the designed sequences, with (W.S., blue continuous line) and without (Wo.S., red point-dash line) the solvation term in Eq. (3), to the average profile obtained from the Pfam alignment data (PF00542, black dashed line) corresponding to the structure L7/L12. W.S. sequences capture many of the features of the HP profiles of the PF00542 and follows more closely the profile described in the top frame, indicating that we design proteins with an hydrophobic core surrounded by hydrophilic amino acids, which overall is more realistic. It has to be noted that the discrepancies between the designed and the real proteins (between residue 20 and 30 and around residue 45) occur in regions where structurally one would expect hydrophilic amino acids. The unexpected hydrophobic patches present in the wild type proteins may very well be involved in the function of the protein in vivo that we do not take into account during the design procedure. In the inset From left to right, comparison of the designed (W.S.) and the native hydrophilic (blue) and hydrophobic (red) amino acids distributions for L7/L12.

The designs described in this paper were done mainly using only Eqs. (1) and (2) for the energy. A comparison calculation was then made for one protein including the solvation energy term of Eq. (3).

Design procedure

Given the potential function for the caterpillar model, there are two steps in the design procedure. First, a larger number (Inline graphic) of sequences with a low energy and high sequence heterogeneity are generated using the target structure. Second, a selected subset is studied to determine its free energy surface and folding properties.

Several methods have been proposed to design the sequence of proteins such that they fold into a specific target conformation [27], [28], [13], [29]. We use here a modified version of a method that we described recently [13], which generates sequences by minimizing the energy of the target configuration and, at the same time, maximizes the number of amino acid permutations to increase the sequence heterogeneity. With this procedure the distribution of possible sequences remains large, which is necessary to generate sequences with a free energy minimum low enough to stabilize the folded state [27]. The search in sequence space is carried out by a parallel tempering Monte Carlo procedure with single point mutation moves. As in the conventional Metropolis scheme, the acceptance of trial moves depends on the ratio of the Boltzmann weights at a design temperature Inline graphic of the new and old states [30]. However, if this were the only criterion, there would be a tendency to generate homopolymer chains with a low energy, rather than chains that fold selectively into a specific target structure. To ensure an amino acid composition far from the homopolymer region of the sequence space, we impose the following acceptance criterion for a single mutation

graphic file with name pone.0020853.e071.jpg (4)

where Inline graphic is the difference of the energy before and after the mutation attempt, Inline graphic is a scale factor for the relative value of the two terms in the equation, and Inline graphic is the number of permutations that are possible for a given set of amino acids; Inline graphic is given by the multinomial distribution

graphic file with name pone.0020853.e076.jpg (5)

where Inline graphic is the total number of monomers and Inline graphic, etc are the number of amino acids of type 1,2, etc. While sampling the sequence space with the Monte Carlo scheme, we set Inline graphic to high enough value Inline graphic to generate sequences with a heterogeneous composition. To adequately sample the sequence space, we generated Inline graphic sequences with the native structure as the template using the parallel tempering scheme [13] with a set of temperatures Inline graphic in units of Inline graphic. From these we selected the ones most likely to yield stable structures for the native state. For this purpose, we used the Landau free energy Inline graphic, defined by

graphic file with name pone.0020853.e085.jpg (6)

and generate the two-dimensional normalized histogram Inline graphic of the distribution of the pair (Inline graphic and Inline graphic) collected over the ensemble of the Inline graphic generated sequences. For further study we chose a small number of sequences with low Landau free energy; i.e., ensembles of sequences that have a reasonably low energy and a high probability of being observed. The rationale for this choice is that such sequences are robust against point mutations, which are correlated with the overall thermodynamic stability ([31], [32]; see also [33]). Our criterion can be understood with a simple argument in the mean-field approximation, where we consider only short range correlations between the amino acids in the chains. In these conditions point mutations are equivalent to small structural distortions, as both perturbations only have a local effect. Hence, proteins that are robust against point mutations are most probably resistant to small deformations induced by thermal fluctuations.

For each selected sequence, we computed the free energy Inline graphic as a function of a the order parameter Inline graphic, where Inline graphic is defined by

graphic file with name pone.0020853.e093.jpg (7)

where Inline graphic denotes a normalized histogram of the number of sampled conformations with order parameter Inline graphic, and Inline graphic is the Distance Root mean square difference (DRMSD) from the native structure. In practice, a direct calculation of this histogram is not efficient, since even the caterpillar model tends to be trapped in local minima, especially at low temperatures. To induce escape from these local minima, we made use of the Virtual Move Parallel Tempering Monte Carlo sampling scheme proposed by Coluzza and Frenkel [34], based on the Waste Recycling approach [35]. This scheme is very efficient in sampling both high and low free energy states (see supplementary informations). We find that on a 4 quad-core dual Xeon (Harpertown) compute nodes the calculation of Inline graphic as a function of Inline graphic for a single sequence requires 336 hours of CPU time, while generation of the Inline graphic sequences requires only 2 hours CPU time.

We used the native conformations of four representative proteins as target structures (see Fig. 2), the B1 immunoglobulin-binding domain of streptococcal protein G (PDB ID 1PGB), the C-terminal domain of the ribosomal protein L7/L12 of E. coli (PDB ID 1CTF), a putative lipoprotein from Pseudomonas syringae (Gene Locus PSPTO2350, PDB code 2K57), and the UBA domain of Tap/NXF1 (PDB ID 1OAI).

Figure 2. Comparison of the designed (yellow) and the target (red) structures for the four proteins analyzed in this work, from top to bottom.

Figure 2

(a) Protein G (PDB ID 1PGB) Inline graphic DRMSD ( Inline graphic RMSD 0); (b) L7/L12 (PDB ID 1CTF) Inline graphic DRMSD ( Inline graphic RMSD 0); (c) lipoprotein (PDB id 2K57) Inline graphic DRMSD ( Inline graphic RMSD 0); (d) UBA domain of Tap/NXF1 (PDB ID 1OAI) Inline graphic DRMSD ( Inline graphic RMSD 0).

Results

The Landau free energy diagram Inline graphic for protein 1CTF, which we studied in detail, is shown in Fig. 3. As is evident from the diagram, the lowest energy sequences and lowest Landau free energy are not directly correlated; i.e., there are numerous very low energy structures with sequences that have a low probability of being observed. We then calculated the free energy as a function of the DRMSD from the native structure (Eq. (7)) for five selected low free energy, high heterogeneity sequences of protein 1CTF; they corresponds to the point indicated by the arrow labeled “LowF” in Fig. 3. Figure 4 shows the free energy surfaces for these proteins at a low temperature where the proteins are stable with the present energy function, it is a relatively smooth surface with the minima of the free energy at an DRMSD in the range Inline graphic to Inline graphic; the breadth of the surface can be argued to reflect the structural fluctuations present in the native state. Because of the definition of DRMSD, structures that are long lived would appear as free energy minima at high values of DRMSD. Hence, the smoothness of the free energy profiles in Fig. 4 indicates that the folding process of our artificial sequences occurs spontaneously with no long lived metastable states. An important result is that in the low temperature simulations, the free energy surface shows no misfolded states with free energies below that of the target structure. It is important to notice that in order to have a single free energy minimum, we did not explicitly impose to the design process to disfavor particular conformations of the chain. Similar results for the other three systems are given in supplementary informations (Figure S4), and overall we get a structure prediction precision between Inline graphic and Inline graphic in DRMSD (Inline graphic and Inline graphic in RMSD [40]) as shown in Fig. 2.

Figure 3. Plot of the design free energy surface Inline graphic for protein L7/L12 (PDB ID 1CTF) as a function of the total Inline graphic energy and the logarithm of the number of possible letter permutations Inline graphic.

Figure 3

For small values of Inline graphic the sequences will tend to be more and more homopolymeric. The most stable sequences corresponds to the to lowest free energy point (indicated by the the LowF arrow) and the folding capacity deteriorates moving away from that point even if the total energy is lower (e.g. the point indicated by the LowE arrow). The boundaries are determined by the limits in the computational power but also by the fact that some combinations of Inline graphic and Inline graphic are not possible.

Figure 4. Comparison of the folding free energies Inline graphic of 6 designed sequences and of the real sequence for L7/L12 as a function of the root mean square distance (Inline graphic) from the target structure.

Figure 4

The profile of Inline graphic (black dashed line) for 5 sequences selected from the ensemble of those with the lowest free energy in sequence space (LowF in Fig. 3) is compared with the profile (red line) obtained for a sequence with lower energy (LowE) than the previous ones. The free energy has been calculated at the same temperature Inline graphic. The folding efficiency of the LowF sequences is very different from the one of LowE as the latest one cannot reach a proper folded structure. Finally we also plot the folding free energy for the real sequence (Real) of the same protein L7/L12 (point dash blue line). At Inline graphic, we found the minimum of Inline graphic to be around 1.6 Inline graphic (Inline graphic RMSD), indicating that the designed proteins are folded correctly on their targets.

We tested if the sequence selection mechanisms, based on the Landau free energy, performed better than simply taking a low energy sequence, as was done previously for lattice proteins [27]. Figure 4 also shows the results of a sequence selected for its low energy (LowE) with a relatively high number of permutations; see Fig. 3. In order to show how important is to select sequences from the most probable ensemble, we chose the LowE sequence not too far from the global sequence free energy minimum. Nevertheless, the folding of “LowE” is significantly less reliable than that of the LowF sequences, as the equilibrium configuration of LowE draomatically differs from the native structure.

We finally introduced the solvation term in equation (3). By including the latter we repeat the design procedure for L7/L12, and the refolding for the natural sequence of L7/L12 as taken from the PDB (1CTF). We set Inline graphic. In figure 5 we plot the hydrophobic/philic profile (HP) of the protein 1CTF designed with and without the solvation term in Eq. (3). The first important observation is that even with our Inline graphic ranged potential (Eq. (2)), we are able to distinguish between buried amino acids and surface residues, as is demonstrated by the large variation in the number of contacts (top frame). Moreover the HP profiles averaged over the designed sequences with solvation term (W.S.) follow much better the contact profile than the profile relative to sequences designed without the solvent term (Wo.S.), indicating that our “artificial” proteins have a hydrophobic core surrounded by hydrophilic amino acids as expected for molecules that live in aqueous solutions [36]. Finally we compared the artificial HP profiles to the average profile obtained from the Pfam alignment data (PF00542) for protein 1CTF; the curve for W.S. sequences is qualitatively comparable to one of the real proteins, as the discrepancies (between residue 20 and 30 and around residue 45) occur in regions where the wild type proteins express hydrophobic residues even if highly exposed to the solvent, which could be the results of functionalities that we did not include in the design procedure. At this point it is natural to ask if the caterpillar model, with the solvation term, is able to reproduce the folded structures of real proteins, since we have shown that designed sequences refold to the target structure, and the design now produces protein like sequences. In Fig. 4 we plot the folding free energy profile of the natural sequence of protein L7/L12. The profile is qualitatively similar to the one obtained from the folding of the artificial sequences, and the distance of the global free energy minimum from the X-Ray structure is still small (1.6 Å DRMSD, 3.4 Å RMSD). Hence, the quality is again striking considering that the only parameters we had to adjust in the model are the range of the potential, the scaling factor of the Inline graphic interaction and the threshold Inline graphic of the solvation term.

We conclude that a carefully tuned external field can produce a protein-like hydrophobicity profile (Fig. 5) and closely predict the native structure of the real sequence (Fig. 4). The proposed framework is then fully self consistent since the design procedure is able to produce natural-like sequences, while the folding properties of the caterpillar are compatible with the folding of real natural sequences. Moreover, the estimation of all free parameters is based on the condition that designed sequences must refold into their respective target structure, and as a result we reattain fundamental properties of real proteins that we did not impose to the system. Model and methodology are therefore shown to be an important step forward in bridging the crucial gap between a coarse grained representation and a fully atomistic description of proteins.

Discussion

In this work we introduce a fundamental criterion for the designability of coarse-grained models of proteins. With the caterpillar model we are able to design protein sequences for various proteins representative of the typical combinations of protein secondary structures. Each of the tested sequences reached the target structure with a very high precision considering the simplicity of the model, demonstrating that the procedure is universal for proteins with different proportions of alpha helices and beta sheets. With our model we could characterize in detail the free energy of the folding process, and we showed that each of the free energy landscapes has a global free energy minimum near the target structures. Moreover, the landscapes are relatively smooth indicating that our designed proteins can spontaneously fold without remaining trapped for long time in metastable states.

The caterpillar model provides a strong evidence to support our hypothesis that a minimum number of constraints is necessary in order to successfully perform protein design. By applying an accurate representation of the backbone we demonstrated that design and folding of real proteins is possible to a degree of accuracy that could not have been anticipated given the level of coarse-graining applied. To the best of our knowledge, a direct analysis of the importance of constraints for the design of protein like structures has never been done before. Our results, then not only extend protein design beyond lattice proteins but also further extend the important work of Maritan and co-workers [17], [18], [19] on the tube model. With the tube model, the authors showed that the protein structure universe is largely determined by the particular geometry imposed by the backbone, independently of the accuracy used to represent the amino acid pair interactions. With the caterpillar model we not only verify the results of Maritan and co-workers, but also we extend the function of the backbone geometry to the crucial role of enforcing the minimal set of constraints responsible for the protein design property.

It is important to stress that the three free parameters of the model have been adjusted only on the refolding ability of the designed sequences, and, as a result, the artificial sequences resemble real proteins in the hydrophilic/phobic profiles, and the folding of real sequences predicts the correct native structure with a surprising high accuracy. This last result suggests that it is possible to determine a universal set of values for the parameters valid for all proteins, which we intend to make the center of further investigation. Moreover, given its computational efficiency, we anticipate that the caterpillar model will be useful for studying other important aspects of protein behaviour such as folding, misfolding and aggregation. Especially considering that, thanks to the high detail of the backbone, the results of our model can be easily integrated in full atomistic simulations by adding the side chains of each amino acid.

Supporting Information

Figure S1

Inline graphic Radial distribution function Inline graphic of three of the target proteins tested in our work. The solid lines are spline interpolations of the data points to guide the eye. The plots show common features between all three proteins, in particular the position of the major peaks is contained in the Inline graphic radial distance. This alone is not enough to prove that the effective potential between Inline graphic pairs should have such a wide range, but it supports our phenomenological observation that shorter or longer ranges do not guarantee the same universal refolding properties to the caterpillar model.

(EPS)

Figure S2

Angular dependence of the potential used to model hydrogen bonds in Eq.(2).

(EPS)

Figure S3

Radial dependence of the potential used to model hydrogen bonds in Eq.(2).

(EPS)

Figure S4

Free energies Inline graphic(DRMSD) of the designed sequences as a function of the root mean square distance (Inline graphic) from their target structures for the four cases that we considered in this work: (a) the B1 immunoglobulin-binding domain of streptococcal protein G (PDB ID 1PGB), (b) the C-terminal domain of the ribosomal protein, (c) a putative lipoprotein from Pseudomonas syringae (Gene Locus PSPTO2350, PDB code 2K57), and (d) the UBA domain of Tap/NXF1 (PDB ID 1OAI). The free energy is shown for two temperatures, the first (Inline graphic) slightly below the folding temperature (Inline graphic) and the second (Inline graphic) slightly above; all temperatures are in reduced units). At low temperatures, for all the target structures that we considered we found the minima of Inline graphic to be between 1.0 and 1.5 Inline graphic, indicating that the designed proteins are folded correctly on their targets. At Inline graphic the native is at equilibrium with the unfolded state. The exact determination of the folding temperature requires a fine analysis of the temperature dependence of the folding process, and is beyond the scope of our work. Our estimate is based on the observation that just above Inline graphic the protein is unfolded, while below the native state is the most stable state.

(EPS)

Text S1

Supporting information.

(PDF)

Acknowledgments

The author acknowledges the help provided by Prof M. Vendruscolo with enlightening discussions, and numerous suggestions from Prof A. Cacciuto, Dr B. Capone, Dr M. Miller and Prof C. Dellago to increase the quality of the manuscript. The figures were made with VMD/NAMD/BioCoRE/JMV/other software support. VMD/NAMD/BioCoRE/JMV/ is developed with NIH support by the Theoretical and Computational Biophysics group at the Beckman Institute, University of Illinois at Urbana-Champaign [37].

Footnotes

Competing Interests: The author has declared that no competing interests exist.

Funding: The author acknowledges financial supported by the Austrian Science Foundation (FWF) within the SFB ViCoM (F 41), and Intra-European mobility grant FP6 Project No.: 24496. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Fersht A. New York: W. H. Freeman & Co; 1999. Structure and Mechanism in Protein Science. [Google Scholar]
  • 2.Anfinsen CB. Principles that govern folding of protein chains. Science. 1973;181:223–230. doi: 10.1126/science.181.4096.223. [DOI] [PubMed] [Google Scholar]
  • 3.Dobson C, Šali A, Karplus M. Protein folding: A perspective from theory and experiment. Angewandte Chemie International Edition. 1998;37:868–893. doi: 10.1002/(SICI)1521-3773(19980420)37:7<868::AID-ANIE868>3.0.CO;2-H. [DOI] [PubMed] [Google Scholar]
  • 4.Dunker AK, Brown C, Obradovic Z. Identification and functions of usefully disordered proteins. Advances in Protein Chemistry Volume. 2002;62:25–49. doi: 10.1016/s0065-3233(02)62004-2. [DOI] [PubMed] [Google Scholar]
  • 5.Fersht A, Daggett V. Protein folding and unfolding at atomic resolution. Cell. 2002;108:573–582. doi: 10.1016/s0092-8674(02)00620-7. [DOI] [PubMed] [Google Scholar]
  • 6.Karplus M, Kuriyan J. Molecular dynamics and protein function. P Natl Acad Sci USA. 2005;102:6679–6685. doi: 10.1073/pnas.0408930102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Das R, Baker D. Macromolecular modeling with rosetta. Annu Rev Biochem. 2008;77:363–382. doi: 10.1146/annurev.biochem.77.062906.171838. [DOI] [PubMed] [Google Scholar]
  • 8.Mirny L, Shakhnovich E. Evolutionary conservation of the folding nucleus. J of Mol Bio. 2001;308:123–129. doi: 10.1006/jmbi.2001.4602. [DOI] [PubMed] [Google Scholar]
  • 9.Onuchic JN, Wolynes PG. Theory of protein folding. Curr Opin Struc Biol. 2004;14:70–75. doi: 10.1016/j.sbi.2004.01.009. [DOI] [PubMed] [Google Scholar]
  • 10.Tozzini V. Coarse-grained models for proteins. Curr Opin Struc Biol. 2005;15:144–150. doi: 10.1016/j.sbi.2005.02.005. [DOI] [PubMed] [Google Scholar]
  • 11.Go N, Taketomi H. Respective roles of short-range and long-range interactions in protein folding. P Natl Acad Sci USA. 1978;75:559–563. doi: 10.1073/pnas.75.2.559. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Bryngelson JD, Wolynes PG. Spin-glasses and the statistical-mechanics of protein folding. P Natl Acad Sci USA. 1987;84:7524–7528. doi: 10.1073/pnas.84.21.7524. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Coluzza I, Muller H, Frenkel D. Designing refoldable model molecules. Phys Rev E. 2003;68:046703. doi: 10.1103/PhysRevE.68.046703. [DOI] [PubMed] [Google Scholar]
  • 14.Coluzza I, Frenkel D. Monte carlo study of substrate-induced folding and refolding of lattice proteins. Biophys J. 2007;92:1150–1156. doi: 10.1529/biophysj.106.084236. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Coluzza I, Frenkel D. Designing specificity of protein-substrate interactions. Phys Rev E. 2004;70:051917. doi: 10.1103/PhysRevE.70.051917. [DOI] [PubMed] [Google Scholar]
  • 16.Rothlisberger D, Khersonsky O, Wollacott AM, Jiang L, Dechancie J, et al. Kemp elimination catalysts by computational enzyme design. Nature. 2008;453:190–U4. doi: 10.1038/nature06879. [DOI] [PubMed] [Google Scholar]
  • 17.Maritan A, Micheletti C, Trovato A, Banavar JR. Optimal shapes of compact strings. Nature. 2000;406:287–290. doi: 10.1038/35018538. [DOI] [PubMed] [Google Scholar]
  • 18.Hoang T, Trovato A, Seno F, Banavar J, Maritan A. Geometry and symmetry presculpt the free-energy landscape of proteins. P Natl Acad Sci USA. 2004;101:7960–7964. doi: 10.1073/pnas.0402525101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Magee JE, Victor V, Lue L. Helical structures from an isotropic homopolymer model. Phys Rev Lett. 2006;96:2078028. doi: 10.1103/PhysRevLett.96.207802. [DOI] [PubMed] [Google Scholar]
  • 20.Hoang T, Marsella L, Trovato A, Seno F, Banavar J, et al. Common attributes of native-state structures of proteins, disordered proteins, and amyloid. P Natl Acad Sci USA. 2006;103:6883–6888. doi: 10.1073/pnas.0601824103. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Banavar JR, Cieplak M, Hoanga TX, Maritan A. First-principles design of nanomachines. P Natl Acad Sci USA. 2011;106:6900–6903. doi: 10.1073/pnas.0901429106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Berman H, Westbrook J, Feng Z, Gilliland G, Bhat T, et al. Announcing the worldwide protein data bank. Nuc Acid Res. 2000;28:235–242. doi: 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Creighton . W.H.Freeman & Co Ltd 2nd Revised edition; 1992. Proteins. [Google Scholar]
  • 24.Betancourt M, Thirumalai D. Pair potentials for protein folding: Choice of reference states and sensitivity of predicted native states to variations in the interaction schemes. Protein Science. 1999;8:361–369. doi: 10.1110/ps.8.2.361. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Irbäck A, Sjunnesson F, Wallin S. Three-helix-bundle protein in a ramachandran model. P Natl Acad Sci USA. 2000;97:13614–13618. doi: 10.1073/pnas.240245297. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Dolittle R. Springer; 1989. In Predictions of Protein Structure and the Principles of Protein Conformation. pp. 599–623. [Google Scholar]
  • 27.Shakhnovich E, Gutin A. Engineering of stable and fast-folding sequences of model proteins. P Natl Acad Sci USA. 1993;90:7195–7199. doi: 10.1073/pnas.90.15.7195. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Deutsch J, Kurosky T. New algorithm for protein design. Phys Rev Lett. 1996;76:323–326. doi: 10.1103/PhysRevLett.76.323. [DOI] [PubMed] [Google Scholar]
  • 29.Seno F, Trovato A, Banavar JR, Maritan A. Maximum entropy approach for deducing amino acid interactions in proteins. Phys Rev Lett. 2008;100:1–4. doi: 10.1103/PhysRevLett.100.078102. [DOI] [PubMed] [Google Scholar]
  • 30.Frenkel D, Smit B. Accademic Press; 2002. Understanding Molecular Simulations.389 [Google Scholar]
  • 31.Vendruscolo M. Modified configurational bias monte carlo method for simulation of polymer systems. J Chem Phys. 1997;106:2970–2976. [Google Scholar]
  • 32.Wilke C, Bloom J, Drummond D, Raval A. Predicting the tolerance of proteins to random amino acid substitution. Biophys J. 2005;89:3714–3720. doi: 10.1529/biophysj.105.062125. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Gutin A, Shakhnovich E. Ground-state of random copolymers and the discrete random energy-model. J Chem Phys. 1993;98:8174–8177. [Google Scholar]
  • 34.Coluzza I, Frenkel D. Virtual-move parallel tempering. ChemPhysChem. 2005;6:1779–1783. doi: 10.1002/cphc.200400629. [DOI] [PubMed] [Google Scholar]
  • 35.Frenkel D. Speed-up of monte carlo simulations by sampling of rejected states. P Natl Acad Sci USA. 2004;101:17571–17575. doi: 10.1073/pnas.0407950101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Abeln S, Frenkel D. Accounting for protein-solvent contacts facilitates design of non- aggregating lattice proteins. Biophys J. 2011;100:693–700. doi: 10.1016/j.bpj.2010.11.088. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Humphrey W, Dalke A, Schulten K. Vmd - visual molecular dynamics. J Molec Graphics. 1996;14:33–38. doi: 10.1016/0263-7855(96)00018-5. [DOI] [PubMed] [Google Scholar]
  • 38.Protein Data Bank website. Available: www.pdb.org. Accesed 2011 May 20.
  • 39.See supplementary informations about the methods.
  • 40.The values of RMSD have been calculated using “RMSD Calculator” in VMD excluding the first and the last 5 residues.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Figure S1

Inline graphic Radial distribution function Inline graphic of three of the target proteins tested in our work. The solid lines are spline interpolations of the data points to guide the eye. The plots show common features between all three proteins, in particular the position of the major peaks is contained in the Inline graphic radial distance. This alone is not enough to prove that the effective potential between Inline graphic pairs should have such a wide range, but it supports our phenomenological observation that shorter or longer ranges do not guarantee the same universal refolding properties to the caterpillar model.

(EPS)

Figure S2

Angular dependence of the potential used to model hydrogen bonds in Eq.(2).

(EPS)

Figure S3

Radial dependence of the potential used to model hydrogen bonds in Eq.(2).

(EPS)

Figure S4

Free energies Inline graphic(DRMSD) of the designed sequences as a function of the root mean square distance (Inline graphic) from their target structures for the four cases that we considered in this work: (a) the B1 immunoglobulin-binding domain of streptococcal protein G (PDB ID 1PGB), (b) the C-terminal domain of the ribosomal protein, (c) a putative lipoprotein from Pseudomonas syringae (Gene Locus PSPTO2350, PDB code 2K57), and (d) the UBA domain of Tap/NXF1 (PDB ID 1OAI). The free energy is shown for two temperatures, the first (Inline graphic) slightly below the folding temperature (Inline graphic) and the second (Inline graphic) slightly above; all temperatures are in reduced units). At low temperatures, for all the target structures that we considered we found the minima of Inline graphic to be between 1.0 and 1.5 Inline graphic, indicating that the designed proteins are folded correctly on their targets. At Inline graphic the native is at equilibrium with the unfolded state. The exact determination of the folding temperature requires a fine analysis of the temperature dependence of the folding process, and is beyond the scope of our work. Our estimate is based on the observation that just above Inline graphic the protein is unfolded, while below the native state is the most stable state.

(EPS)

Text S1

Supporting information.

(PDF)


Articles from PLoS ONE are provided here courtesy of PLOS

RESOURCES