Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2001 Feb 20;98(5):2329–2333. doi: 10.1073/pnas.041609598

Recent improvements in prediction of protein structure by global optimization of a potential energy function

Jarosław Pillardy *, Cezary Czaplewski *,†, Adam Liwo *,†, Jooyoung Lee *,, Daniel R Ripoll §, Rajmund Kaźmierkiewicz *,†, Stanisław Ołdziej , William J Wedemeyer *, Kenneth D Gibson *, Yelena A Arnautova *, Jeff Saunders *, Yuan-Jie Ye *, Harold A Scheraga *,
PMCID: PMC30138  PMID: 11226239

Abstract

Recent improvements of a hierarchical ab initio or de novo approach for predicting both α and β structures of proteins are described. The united-residue energy function used in this procedure includes multibody interactions from a cumulant expansion of the free energy of polypeptide chains, with their relative weights determined by Z-score optimization. The critical initial stage of the hierarchical procedure involves a search of conformational space by the conformational space annealing (CSA) method, followed by optimization of an all-atom model. The procedure was assessed in a recent blind test of protein structure prediction (CASP4). The resulting lowest-energy structures of the target proteins (ranging in size from 70 to 244 residues) agreed with the experimental structures in many respects. The entire experimental structure of a cyclic α-helical protein of 70 residues was predicted to within 4.3 Å α-carbon (Cα) rms deviation (rmsd) whereas, for other α-helical proteins, fragments of roughly 60 residues were predicted to within 6.0 Å Cα rmsd. Whereas β structures can now be predicted with the new procedure, the success rate for α/β- and β-proteins is lower than that for α-proteins at present. For the β portions of α/β structures, the Cα rmsd's are less than 6.0 Å for contiguous fragments of 30–40 residues; for one target, three fragments (of length 10, 23, and 28 residues, respectively) formed a compact part of the tertiary structure with a Cα rmsd less than 6.0 Å. Overall, these results constitute an important step toward the ab initio prediction of protein structure solely from the amino acid sequence.


Important progress has been made in recent years toward the physics-based computation of protein structure based solely on knowledge of the amino acid sequence. This approach, commonly referred to as an ab initio or de novo method (13), is based on the thermodynamic hypothesis formulated by Anfinsen (4), according to which the native structure of a protein corresponds to the global minimum of its free energy under given conditions. Protein structure prediction by using ab initio methods is accomplished by a search for a conformation corresponding to the global-minimum of an appropriate potential energy function without use of secondary structure prediction, homology modeling, threading, etc.

Until recently, ab initio protein structure prediction based solely on the thermodynamic hypothesis was considered unfeasible (57) mainly because of the inaccuracy of the potential functions used to describe protein conformational energy and the lack of powerful global optimization methods for exploring the energy landscapes represented by those functions. Other types of knowledge-based methodologies, such as homology modeling (813) or threading methods (9, 12, 14) have been considered to be the most successful approaches. However, the success of these methods depends on the presence of sequentially or structurally homologous proteins in the databases. Furthermore, they do not provide a general understanding of the role of particular interactions in the formation of protein structure and the mechanisms of protein folding. This understanding can be achieved only through the development of force fields based completely on the physics of interactions for which the native structure is the lowest-energy minimum.

United-residue models of polypeptide chains (1421) have been the subject of special attention for many years. In particular, because a global minimum search of single-domain proteins of typical size (30–250 residues) is practically unfeasible at the all-atom level, a united-residue representation of the protein reduces the number of variables, making this optimization problem tractable with current computers. During the last few years, we have developed a physics-based united-residue force field (UNRES) (1921) for off-lattice simulations (22). Initial predictive applications of the UNRES force field were carried out on helical proteins, as assessed during the CASP3 experiment (23, ); however, this initial version was unable to model β-structures (19). During the past 2 years, we continued to develop the force field and, with the aid of a cumulant expansion of the free energy, and a Z-score optimization, we determined the terms in the restricted free energy (RFE) function that are responsible for formation of β-structure (21). Thus, the current version of UNRES can treat proteins with both α and β structures.

General Form of the UNRES Force Field.

In the UNRES model (1921), a polypeptide chain is represented by a sequence of α-carbon (Cα) atoms linked by virtual bonds with attached united side chains (SC) and united peptide groups (p). Each united peptide group is located in the middle between two consecutive α-carbons. Only these united peptide groups and the united side chains serve as interaction sites, the α-carbons serving only to define the chain geometry (see figure 1 of ref. 24). All virtual bond lengths (i.e., Cα—Cα and Cα—SC) are fixed; the distance between neighboring Cαs is 3.8 Å, corresponding to trans peptide groups, whereas the side-chain angles (αSC and βSC), and virtual-bond angles (θ and γ) can vary. The energy of the virtual-bond chain is expressed by Eq. 1.

graphic file with name M1.gif 1
graphic file with name M2.gif

The term USCiSCj represents the mean free energy of the hydrophobic (hydrophilic) interactions between the side chains as an orientation-dependent Gay-Berne potential (25); it implicitly contains the contributions from the interactions of the side chains with the solvent. The term USCipj denotes the excluded-volume potential of the side-chain–peptide-group interactions. The interaction potential (Upipj) accounts mainly for the electrostatic interactions (i.e., the tendency to form backbone hydrogen bonds) between peptide groups pi and pj. Utor, Ub, and Urot represent the energies of virtual-dihedral angle torsions, virtual-bond angle bending, and side-chain rotamers, respectively; these terms account for the local propensities of the polypeptide chain. Details of the parameterization of all of these terms are provided in earlier publications (19). Finally, the terms UInline graphic, m = 1,2,… Ncorr are the correlation or multibody contributions from the cumulant expansion of the RFE and w's are the weights of the energy terms.

The UNRES force field was derived as an RFE function, by averaging the all-atom energy over the degrees of freedom that are neglected in the united-residue model (1921); these include solvent degrees of freedom, side-chain rotation angles, and the dihedral angles λ for rotation of the peptide groups about the Cα⋅⋅⋅Cα virtual bonds (26). The RFE function can be expressed as a sum of single-body, pairwise, and, generally, multibody contributions of various order in the framework of the so-called “cumulant” expansion (27). These cumulant terms are parameterized by fitting them to the free-energy surfaces of model systems such as tetra- and hexapeptides, as calculated from our all-atom potential, empirical conformational energy program for peptides (ECEPP/3; ref. 28).

Finally, the weights of the different terms in the UNRES energy function (Eq. 1) were determined by maximizing both the energy gap between the native-like and non-native conformations (ΔE) and the Z-score value (Z), both quantities being treated as functions of weights, as expressed by Eqs. 2 and 3.

graphic file with name M4.gif 2
graphic file with name M5.gif 3

where nat and non-nat indicate the sets of native-like and non-native conformations, respectively [the criterion being the rms deviation (rmsd) from the experimental structure], and Nnat and Nnonnat denote the number of native-like and non-native structures, respectively.

To obtain a force field that can be applied to α-helical-, β-, and α/β-structures, the weights were optimized by using two proteins simultaneously: the 10–55 fragment of the 60-residue B domain of staphylococcal protein A (hereafter referred to as protein A) (29), which has a three-helix bundle structure, and the 20-residue betanova (30), whose native structure is a three-stranded antiparallel β-sheet. The optimization procedure involves iterative cycles in each of which the conformational space annealing (CSA) method (22, 31, 32) is used to carry out a global search with the current set of weights. Details of the procedure to optimize the UNRES model will be presented elsewhere.

The resulting UNRES energy function was tested with a designed 28-residue peptide that contains the minimal α/β fold (33), identified in the Protein Data Bank (PDB) as 1fsd. It should be stressed that 1fsd was not used in the force-field optimization. In a series of global optimization runs with the CSA method, a structure with an rmsd for the Cα atoms of 3.4 Å from the average NMR structure (33) was obtained as the one with the lowest energy (see Fig. 1).

Figure 1.

Figure 1

Superposition of the predicted (red) structure of 1fsd on a family of experimental NMR structures (green) (33). The Cα atoms agree to within an rmsd of 3.4 Å.

Increased accuracy and speed of convergence is obtained by treating α, β and α/β proteins separately, with separate weights determined for each category by Eqs. 2 and 3. Further, by using only the lowest order of cumulants, an additional force field (α0) to treat α-type proteins was developed. The latter force field is less accurate than the one that includes higher-order cumulants (α), but, despite the small loss in accuracy, we are able to treat α proteins of up to 250 residues with a 3-fold speed-up in the computations.

The CASP4 Exercise in Protein Structure Prediction.

The newly developed force field has recently been used in blind predictions of some of the target proteins provided for the Fourth Critical Assessment of Techniques for Protein Structure Prediction (CASP4). The three-dimensional structures of these targets were being determined by NMR spectroscopy or x-ray crystallography at the same time that the predictions were made. Our laboratory submitted predictions for 16 of the 43 targets that were volunteered by experimental structural biologists. The length of the target-sequences that we considered varied from 70 to 244 amino acids. In all cases, five predictions per target were submitted. The models correspond to the lowest-energy UNRES conformations of the five lowest-energy families obtained from a clustering analysis. Each model was then converted to an all-atom structure by using the dipole-path method (34) and later refined by using the electrostatically driven Monte Carlo (EDMC) method (35, 36) and ECEPP/3 (28).

The analysis of our results for the α-helical targets shows reasonably accurate predictions. Our best α-helical prediction corresponds to target T0102 (bacteriocin AS-48), which is a 70-residue cyclic polypeptide from Enterococcus faecalis (PDB code: 1e68). The structure (37) consists of five α-helices arranged in a structural motif analogous to that of NK-lysin, but this information about a homologous structure was not used in the prediction. Our simulations were carried out by assuming an open chain (i.e., no loop-closing term was used in the energy function to force the N- and C-termini to come together). A version of our force field (viz., α) parameterized for α-helical structures that uses higher order correlation terms was used. Secondary structure information for this protein was available but was not used in our simulations. This information was used only to generate four additional sequences by cyclic permutations of the termini in such a manner that sequence cuts fell outside of the α-helical regions. Five CSA runs were carried out, one for each different sequence. Low-energy conformations, in which the N- and C-termini were in close proximity, were selected, and loop closure was imposed during the refinement at the all-atom level representation. Fig. 2 shows the superposition of model 1 for T0102 onto the experimental structure with an rmsd of 4.3 Å for the Cα atoms.

Figure 2.

Figure 2

Superposition of the crystal (red) and predicted (yellow) structures of the 70-residue protein bacteriocin AS-48 (target T0102). The Cα atoms were superposed with an rmsd of 4.3 Å.

For other all-α target proteins, our predicted structures reproduced several features of the experimental structure. For example, the predicted structures of targets T0096 (PDB code: 1e2x), T0097 (PDB code: 1g7d), T0098(PDB code: 1fc3), T0106, and T0124 match the experimental structures to within 6.0 Å Cα rmsd for fragments varying in length from 52 to 68 residues (Figs. 3, 4, and 5). It should be noted that simulation studies (38) have demonstrated that it is extremely unlikely to obtain a predicted structure with a 6-Å rmsd by a random search for a chain of at least 60 residues and, hence, that a prediction with a 6-Å rmsd should be considered as a successful one. For the 121-residue target T0098, which represents a novel protein fold, our protocol reproduced a 68-residue fragment (model 3) with a 5.9-Å Cα rmsd (residues 146–213) (Fig. 3). For the 105-residue target T0097, a 66-residue fragment (residues V158-E223) superposed with a Cα rmsd of 5.9 Å (Fig. 4), whereas for the 128-residue target T0106, the 64-residue fragment between I8 and E71 superimposed with a Cα rmsd of 6.0 Å (Fig. 5).

Figure 3.

Figure 3

Superposition of the crystal (red) and predicted (yellow) structures of T0098. The Cα atoms of the 68-residue fragment included between residues D146 to E213 superimposed with an rmsd of 5.9 Å. This fragment is shown as colored ribbons.

Figure 4.

Figure 4

Superposition of the crystal (red) and predicted (yellow) structures of T0097. The Cα atoms of the 66-residue fragment included between residues V158 to E223 superimposed with an rmsd less than 6.0 Å. This fragment is shown as colored ribbons.

Figure 5.

Figure 5

The predicted structure of T0106. The Cα atoms of the illustrated 64-residue fragment included between residues I8 to E71 agree with the experimental structure with an rmsd of 6.0 Å. This fragment is shown as a yellow ribbon.

Predictions of α/β- and β-targets were in general less successful than those for α-helical targets. Nonetheless, some of them are quite encouraging, especially because our new procedure is now capable of predicting β structure whereas our older one was not. For the 163-residue target T0126, fragments involving residues 66–82 and 88–122 (52 residues) of model 1 (not shown) match the experimental structure within 6.8 Å Cα rmsd. Similarly, the fragments including residues 45–54, 60–82, and 95–122 of model 3 (61 residues) match the experimental structure within 6.0 Å Cα rmsd (Fig. 6) and correctly predicted the contact between noncontiguous strands involving residues 77–82 and 104–111.

Figure 6.

Figure 6

The predicted structure of T0126. The Cα atoms of three fragments defined by residues L45 to R54, Q60 to G82, and M95 to A122 agree with the experimental structure with an rmsd of 6.0 Å. These fragments are shown as yellow ribbons. The remaining residues are shown as a Cα trace.

Concluding Remarks.

We have shown that a reasonably accurate united-residue potential function for proteins can be developed by including multibody terms derived from a cumulant expansion of the restricted free energy. Even though further improvement of our approach is necessary, the results presented here demonstrate that prediction of the three-dimensional structures of proteins solely from the amino acid sequence (without the aid of knowledge-based information from secondary-structure prediction, multiple-sequence alignment, or fold recognition) is feasible.

Acknowledgments

This research was supported by grants from the National Institutes of Health (GM-14312), the National Science Foundation (MCB95–13167), the Fogarty Foundation (R03 TW1064), the National Institutes of Health National Center for Research Resources (P41RR-04293), and the Polish State Committee for Scientific Research, KBN (3 T09A 111 17). Support was also received from the National Foundation for Cancer Research. A large part of the computations in this work was carried out (i) at the Cornell Theory Center, which receives funding from Cornell University, New York State, the National Center for Research Resources at the National Institutes of Health, and members of the Theory Center's Corporate Partnership Program; (ii) at the computing resources provided by the National Partnership for Advanced Computational Infrastructure at the San Diego Supercomputer Center supported in part by the National Science Foundation cooperative agreement ACI-9619020; (iii) with the resources of the Informatics Center of the Metropolitan Academic Network (IC MAN) in Gdańsk; and (iv) with our own array of 55 dual-processor PC computers.

Abbreviations

CSA

conformational space annealing

ECEPP

empirical conformational energy program for peptides

PDB

protein data bank

rmsd

rms deviation

UNRES

united-residue

RFE

restricted free energy

SC

side chain

Footnotes

Third Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction, December 13–17, 1998, Pacific Grove, CA, http://predictioncenter.llnl.gov/casp3/Casp3.html.

References

  • 1.Scheraga H A. Int J Quant Chem. 1992;42:1529–1536. [Google Scholar]
  • 2.Vásquez M, Némethy G, Scheraga H A. Chem Rev. 1994;94:2183–2239. [Google Scholar]
  • 3.Scheraga H A. Biophys Chem. 1996;59:329–339. doi: 10.1016/0301-4622(95)00126-3. [DOI] [PubMed] [Google Scholar]
  • 4.Anfinsen C B. Science. 1973;181:223–230. doi: 10.1126/science.181.4096.223. [DOI] [PubMed] [Google Scholar]
  • 5.Jones D T. Curr Opin Struct Biol. 1997;7:377–387. doi: 10.1016/s0959-440x(97)80055-3. [DOI] [PubMed] [Google Scholar]
  • 6.Mirny L A, Shakhnovich E I. J Mol Biol. 1998;283:507–526. doi: 10.1006/jmbi.1998.2092. [DOI] [PubMed] [Google Scholar]
  • 7.Fersht A. Structure and Mechanism in Protein Science. New York: Freeman; 1999. p. 536. [Google Scholar]
  • 8.Warme P K, Momany F A, Rumball S V, Tuttle R W, Scheraga H A. Biochemistry. 1974;13:768–782. doi: 10.1021/bi00701a020. [DOI] [PubMed] [Google Scholar]
  • 9.Jones T A, Thirup S. EMBO J. 1986;5:819–822. doi: 10.1002/j.1460-2075.1986.tb04287.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Clark D A, Shirazi J, Rawlings C J. Prot Eng. 1991;4:751–760. doi: 10.1093/protein/4.7.751. [DOI] [PubMed] [Google Scholar]
  • 11.Rooman M J, Wodak S J. Biochemistry. 1992;31:10239–10249. doi: 10.1021/bi00157a010. [DOI] [PubMed] [Google Scholar]
  • 12.Johnson M S, Overington J P, Blundell T L. J Mol Biol. 1993;231:735–752. doi: 10.1006/jmbi.1993.1323. [DOI] [PubMed] [Google Scholar]
  • 13.Fischer D, Rice D, Bowie J U, Eisenberg D. FASEB J. 1996;10:126–136. doi: 10.1096/fasebj.10.1.8566533. [DOI] [PubMed] [Google Scholar]
  • 14.Sippl M J. J Comput Aided Mol Des. 1993;7:473–501. doi: 10.1007/BF02337562. [DOI] [PubMed] [Google Scholar]
  • 15.Levitt M, Warshel A. Nature (London) 1975;253:694–698. doi: 10.1038/253694a0. [DOI] [PubMed] [Google Scholar]
  • 16.Pincus M R, Scheraga H A. J Phys Chem. 1977;81:1579–1583. [Google Scholar]
  • 17.Godzik A, Koliñski A, Skolnick J. J Comput Aided Mol Des. 1993;7:397–438. doi: 10.1007/BF02337559. [DOI] [PubMed] [Google Scholar]
  • 18.Crippen G M. J Mol Biol. 1996;260:467–475. doi: 10.1006/jmbi.1996.0414. [DOI] [PubMed] [Google Scholar]
  • 19.Liwo A, Pillardy J, Kaźmierkiewicz R, Wawak R J, Groth M, Czaplewski C, Ołdziej S, Scheraga H A. Theor Chem Acc. 1999;101:16–20. [Google Scholar]
  • 20.Liwo A, Lee J, Ripoll D R, Pillardy J, Scheraga H A. Proc Natl Acad Sci, USA. 1999;96:5482–5485. doi: 10.1073/pnas.96.10.5482. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Liwo A, Pillardy J, Czaplewski C, Lee J, Ripoll D R, Groth M, RodziewiczMotowidło S, Kaźmierkiewicz R, Wawak R J, Ołdziej S, Scheraga H A. In: RECOMB 2000: Proceedings of the Fourth Annual International Conference on Computational Molecular Biology. Shamir R, Miyano S, Istrail S, Pevzner P, Waterman M, editors. New York: ACM; 2000. pp. 193–200. [Google Scholar]
  • 22.Lee J, Liwo A, Scheraga H A. Proc Natl Acad Sci USA. 1999;96:2025–2030. doi: 10.1073/pnas.96.5.2025. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Orengo C A, Bray J E, Hubbard T, LoConte L, Sillitoe I. Proteins Struct Funct Genet. 1999;Suppl. 3:149–170. doi: 10.1002/(sici)1097-0134(1999)37:3+<149::aid-prot20>3.3.co;2-8. [DOI] [PubMed] [Google Scholar]
  • 24.Liwo A, Ołdziej S, Pincus M R, Wawak R J, Rackovsky S, Scheraga H A. J Comput Chem. 1997;18:849–873. [Google Scholar]
  • 25.Gay J G, Berne B J. J Chem Phys. 1981;74:3316–3319. [Google Scholar]
  • 26.Nishikawa K, Momany F A, Scheraga H A. Macromolecules. 1974;7:797–806. doi: 10.1021/ma60042a020. [DOI] [PubMed] [Google Scholar]
  • 27.Kubo R. J Phys Soc Japan. 1962;17:1100–1120. [Google Scholar]
  • 28.Némethy G, Gibson K D, Palmer K A, Yoon C N, Paterlini G, Zagari A, Rumsey S, Scheraga H A. J Phys Chem. 1992;96:6472–6484. [Google Scholar]
  • 29.Gouda H, Torigoe H, Saito A, Sato M, Arata Y, Shimada I. Biochemistry. 1992;31:9665–9672. doi: 10.1021/bi00155a020. [DOI] [PubMed] [Google Scholar]
  • 30.Kortemme T, Ramirez-Alvarado M, Serrano L. Science. 1998;281:253–256. doi: 10.1126/science.281.5374.253. [DOI] [PubMed] [Google Scholar]
  • 31.Lee J, Scheraga H A, Rackovsky S. J Comput Chem. 1997;18:1222–1232. [Google Scholar]
  • 32.Lee J, Scheraga H A. Int J Quant Chem. 1999;75:255–265. [Google Scholar]
  • 33.Dahiyat B I, Mayo S L. Science. 1997;278:82–87. doi: 10.1126/science.278.5335.82. [DOI] [PubMed] [Google Scholar]
  • 34.Liwo A, Pincus M R, Wawak R J, Rackovsky S, Scheraga H A. Protein Sci. 1993;2:1697–1714. doi: 10.1002/pro.5560021015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Ripoll D R, Scheraga H A. Biopolymers. 1988;27:1283–1303. doi: 10.1002/bip.360270808. [DOI] [PubMed] [Google Scholar]
  • 36.Ripoll D R, Liwo A, Scheraga H A. Biopolymers. 1998;46:117–126. doi: 10.1002/(SICI)1097-0282(199808)46:2<117::AID-BIP6>3.0.CO;2-P. [DOI] [PubMed] [Google Scholar]
  • 37.González C, Langdon G M, Bruix M, Gálvez A, Valdivia E, Maqueda M, Rico M. Proc Natl Acad Sci USA. 2000;97:11221–11226. doi: 10.1073/pnas.210301097. . (First Published September 26, 2000; 10.1073/pnas.210301097) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Reva B A, Finkelstein A V, Skolnick J. Fold Des. 1998;3:141–147. doi: 10.1016/s1359-0278(98)00019-4. [DOI] [PubMed] [Google Scholar]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES