Abstract
We present a method for predicting folding rates of proteins from their amino acid sequences only, or rather, from their chain lengths and their helicity predicted from their sequences. The method achieves 82% correlation with experiment over all 64 “two-state” and “multistate” proteins (including two artificial peptides) studied up to now.
Proteins have very different rates of folding. Some of them fold within microseconds (1), some need an hour to fold (2). Small proteins usually (but far from always) fold faster than the larger ones (3). The correlation between folding rates and protein sizes is not as large, however: 64% (being stronger for “multistate” proteins that have folding intermediates when folded in water, and weaker for “two-state” proteins that do not have such intermediates) (4).
The folding rate of proteins is predicted more accurately when a “contact order” of the 3D structure is taken into account (5, 6) in addition to the chain length: now the correlation achieves 74% for the totality of proteins (6).
It has been noticed that a high helical content is the main structural feature that decreases the contact order and accelerates folding of two-state proteins (7), and, for this group of proteins, a folding rate prediction method based on the content of secondary structure in their 3D structures has been suggested recently (8). [However, examining a whole set of proteins studied up to now, we saw that the reported equation, which works well for the two-state proteins (8), is much worse at predicting folding rates of multistate proteins and short peptides.]
The empirical dependence of folding rate on some features of amino acid sequences has been reported also for some small groups of two-state proteins (9, 10), but no method to predict folding rates from sequences for the totality of proteins has been suggested so far.
The present work shows that folding rates for proteins of all kinds (as well as for short peptides) are well estimated from secondary structure predictions based on the amino acid sequences and the lengths of these sequences. This estimate has a clear physical sense, and it achieves 82% correlation with experiment over all 62 two-state and multistate proteins and 2 peptides studied up to now [i.e., it works even better than estimates based on whole 3D structures (5, 6)]. As a result, we can suggest a general method that predicts the rate of in-water protein folding directly from its primary structure and does not need any information on its 3D fold.
Materials and Methods
List of Proteins. This list (Table 1, which is published as supporting information on the PNAS web site) includes all peptides and single-domain proteins having no S–S bonds and covalently bound ligands, whose in-water folding rates (kf) and primary and 3D structures have been established experimentally. It includes all 57 proteins and peptides from table 1 of Ivankov et al. (6) and, in addition, 7 recently studied proteins: Trp-cage [Protein Data Bank (ref. 11, www.pdb.org) ID code 1L2Y, 20 residues, kf = 2.5 × 105 sec–1] (1); villin headpiece (1VII, 36 residues, kf = 9.8 × 104 sec–1) (12); B domain of staphylococcal protein A (1BDD, 58 residues, kf = 1.2 × 105 sec–1) (13); engrailed homeodomain (1ENH, 61 residues, kf = 3.6 × 104 sec–1) (14); HypF-N (1GXT, 91 residues, kf = 81 sec–1) (10); common-type acylphosphatase (2ACY, 98 residues, kf = 2.5 sec–1) (15); and VlsE (1L8W, 341 residues, kf = 4.9 sec–1) (16).
Secondary Structure Assignment. Secondary structure was assigned from Protein Data Bank (11) coordinates of proteins by using the program dssp (ref. 17, www.sander.ebi.ac.uk/dssp), which marks helical residues by symbols H and β-structural residues by symbols E.
Secondary Structure Prediction. Secondary structure was predicted by using the programs psipred (ref. 18, www.psipred.net) and alb (ref. 19, http://i2o.protres.ru/alb). The residues predicted as helical are marked by H by psipred and by H and & by alb, and those predicted as β-structural are marked by E by psipred and by S and B by alb.
Results and Discussion
Both analytical theory (20, 21) and off-lattice computer simulations (22) of folding suggest that the logarithm of folding rate (kf) decreases in proportion to some power of the protein chain length [although the value of this power for in-water folding of proteins is still determined rather crudely: from two-thirds to one-half (see refs. 6 and 20–22), and maybe even below that (23)].
Theoretically, the protein chain length is determined as a number of the chain links (“folding units”) (20, 21), which is usually (4, 6, 20–23) calculated as the number L of the chain residues. However, if the folding chain contains some preformed blocks (or the blocks, which are rapidly and independently formed on the other chain during the folding), the effective length of the folding chain (Leff) should be smaller than the number of residues L in proportion to the total number of residues involved in these blocks.
Because α-helices are natural candidates to the role of the internally stable and/or rapidly and independently folding blocks, the effective length of the folding chain can be taken in a form
![]() |
[1] |
where LH is the number of residues in helical conformation, NH is the number of helices, and l1 means that we consider the whole block (a helix) as l1 chain residues [from a physical point of view, l1 should not exceed a length of one turn of α-helix (4 residues); the l1 value should be optimized from comparison with experiment].
Now, according to refs. 6 and 20–23, we can consider the following dependence of the folding rate on the effective chain length:
![]() |
[2] |
where the value of power P is to be fitted from comparison with experimental data. It is noteworthy that the case P = 0 corresponds to correlation of log(kf) with log(Leff), because const – LP = const – exp(P×ln(L)) = (const + 1) – P×2.3×log(L) ∝ const′ – log(L) when P → 0. A correlation of log(kf) with log(L) has been suggested by Gutin et al. (24) (on the basis of in silico folding of simplified models of protein chains) for the ambient conditions that are most favorable for folding. The correlation of log(kf) with L2/3 concerns [according to analytical theory of Finkelstein and Badretdinov (21)] the other extreme, “the minimal ambient folding conditions,” i.e., the midtransition between the native structure and the coil. And, at last, the intermediate P values in the region 0 < P < ⅔ may be expected for various intermediate “moderately folding” conditions, including folding in water.
The value of L is directly obtained from the protein sequence, whereas the values of LH and NH can be estimated from the same sequence, using some good program of secondary structure prediction, e.g., the most effective program psipred (18), based on local sequence similarity, or a physics-based program alb (19). Besides (as a control test) we can compute LH and NH from known 3D protein structures by using dssp (17).
Having L, LH, and NH, we can estimate the value of log(kf) by using Eqs. 1 and 2. In each case, we varied the P and l1 values so as to maximize the correlation between log(kf) and –(Leff)P.
Fig. 1 presents the results for the case when the secondary structure prediction is done by psipred. One can see that the correlation between log(kf) and –(Leff)P exceeds 80% and is nearly the same (within small statistical errors) for all P values below 0.7 (cf. ref. 23). The maximal correlation, 82 ± 4%, is formally achieved at P = 0.1, and l1 = 3, but, actually, all of the region P = 0.0–0.5 and l1 = 1–4 is equally good. The latter shows that, actually, only the helical content is important.
Fig. 1.
Correlation between the logarithm of protein folding rate in water and the value (L – LH + 3×NH)P at P = 0.1 [the number of helical residues (LH) and number of helices (NH) are predicted for each protein from its sequence by using psipred; L is the total number of chain residues]. •, Two-state proteins; □, multistate proteins; ▵, short artificial peptides (α-helix and β-hairpin) without tertiary structure. The straight heavy line represents the best linear fit, log(kf, sec–1) = 10.7 – 16.6×[(L – LH + 3×NH)0.1 – 1]. Standard deviation between predicted and observed log(kf) is ± 1.07. From a practical point of view, the following equations, which have essentially equal predictive power, may be suggested to estimate the folding rate: log(kf, sec–1) = 12.4 – 5.7×log(L – LH + 3×NH) (line “log,” correlation 81.8%); log(kf, sec–1) = 10.7 – 16.6×[(L – LH + 3×NH)0.1 – 1] (straight line “0.1,” correlation 81.8%); log(kf, sec–1) = 8.2 – 2.4×[(L – LH + 3×NH)0.3 – 1] (line “0.3,” correlation 81.4%); log(kf, sec–1) = 6.6 – 0.6×[(L – LH + 3×NH)1/2 – 1] (line “1/2,” correlation 80.4%); and log(kf, sec–1) = 5.7 – 0.2×[(L – LH + 3×NH)2/3 – 1] (line “2/3,” correlation 79.1%). It is hardly possible to distinguish quality of these approximating equations because, as the figure shows, the divergence of even the most deviating of the above functions, log and 2/3, is much smaller than the standard deviation (≈± 1.1) of experimental points from any of them, and the correlation of log(X) and X2/3 at the interval covered by experimental points ([L – LH + 3×NH]min = 8 ≤ X ≤ [L – LH + 3×NH]max = 270) is very high, 96% (much higher than correlation of experimental points and each of the approximating equations). (Inset) Correlation coefficient between log(kf) and –(L – LH + l1 × NH)P depending on power P (the lines are drawn for l1 values equal to 1, 2, 3, and 4; they virtually coincide). The standard error bars (which are virtually equal in all of the cases) are shown for the line with l1 = 3.
Nearly the same results are obtained when the secondary structure is predicted from sequence by alb (then the correlation achieves 78 ± 5%) or extracted from 3D structures by dssp (then the correlation achieves 81 ± 4%). Thus, the folding rate prediction obtained from the amino acid sequence alone (with the help of a secondary structure prediction done by psipred) is at least not worse than the predictions done from known 3D structures.
It should be noted that, although the obtained dependence works a little better for the totality of proteins than for their subgroups taken separately, it is not as sensitive to the set of proteins used. If we exclude two short artificial peptides (which are represented as two triangles in the left of Fig. 1, and are not true proteins, having no tertiary structure), the maximal correlation between log(kf) and –(Leff)P for the remaining proteins is as high as 78 ± 5%. If, in addition, we exclude tryptophan synthase β2-subunit [which is represented as the rightmost square in Fig. 1, and, unlike the other protein used in this study, is not a true single-domain protein, according to the Structural Classification of Proteins (ref. 25, http://scop.mrc-lmb.cam.ac.uk/scop)], the correlation between log(kf) and –(Leff)P decreases by only an additional 2–3%. If we consider only the two-state proteins, the maximal correlation between log(kf) and –(Leff)P is 74 ± 8%; and for only the multistate proteins it is 77 ± 10%.
Some remarks in conclusion:
We did not manage to improve the predictions by consideration of β-structure, maybe because α and β contents are strongly anticorrelated (at the level of 87%), and because there is, as yet, no method to predict internally stable and therefore rapidly folding (cf. ref. 26) β-hairpins.
The suggested theory is simple, but inevitably approximate, because it does not take into account those sequence mutations that do not change the secondary structure but can sometimes change the folding rate by two orders of magnitude (27). Also, the theory does not take into account a solvent-induced change in protein stability, which can change the folding rate manifolds (21, 27, 28). Therefore, it is only natural that this theory, which predicts the in-water folding rates, makes this prediction with a precision of plus or minus an order of magnitude (cf. Fig. 1); however, this is a relatively small error on the background of the 10 orders of magnitude difference in observed protein folding rates.
α-Helices may lead to effective shortening of the folding protein chain either because some preformed helices already exist in the unfolded state of the chain [which, indeed, may be the case for in-water conditions (26)], or because the helices are rapidly formed in the course of folding. Therefore, we would like to avoid any speculations on hierarchic or nonhierarchic mechanism of protein folding that may arise from the presented results.
The presented theory reveals a high (≈80%) correlation between the folding rate and the number (Leff) of nonhelical residues in the protein chain. However, the correlation between the folding rate and the content (= Leff/L) of nonhelical residues in the protein chain is poor: only ≈25% (results are not shown). This poor correlation shows that the main determinant of the protein folding rate is the number of degrees of freedom that are to be fixed during the rate-limiting step of folding (cf. refs. 3, 4, 6, and 21–24).
Supplementary Material
Acknowledgments
We are grateful to O. V. Galzitskaya and S. O. Garbuzynskiy for help and seminal discussions. This work was supported by the Russian Academy of Sciences (Program “Physical and Chemical Biology” and Grant “Scientific Schools” 1968.2003.4), by the Russian Foundation for Basic Research, and by an International Research Scholar's Award from the Howard Hughes Medical Institute (to A.V.F.).
References
- 1.Qui, L., Pabit, S. A., Roitberg, A. E. & Hagen, S. J. (2002) J. Am. Chem. Soc. 124, 12952–12953. [DOI] [PubMed] [Google Scholar]
- 2.Goldberg, M. E., Semisotnov, G. V., Friguet, B., Kuwajima, K., Ptitsyn, O. B. & Sugai, S. (1990) FEBS Lett. 263, 51–56. [DOI] [PubMed] [Google Scholar]
- 3.Galzitskaya, O. V., Ivankov, D. N. & Finkelstein, A. V. (2001) FEBS Lett. 489, 113–118. [DOI] [PubMed] [Google Scholar]
- 4.Galzitskaya, O. V., Garbuzynskiy, S. O., Ivankov, D. N. & Finkelstein, A. V. (2003) Proteins 51, 162–166. [DOI] [PubMed] [Google Scholar]
- 5.Plaxco, K. W., Simons, K. T. & Baker, D. (1998) J. Mol. Biol. 277, 985–994. [DOI] [PubMed] [Google Scholar]
- 6.Ivankov, D. N., Garbuzynkiy, S. O., Alm, E., Plaxco, K. W., Baker, D. & Finkelstein, A. V. (2003) Protein Sci. 12, 2057–2062. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Mirny, L. & Shakhnovich, E. I. (2001) Annu. Rev. Biophys. Biomol. Struct. 30, 361–396. [DOI] [PubMed] [Google Scholar]
- 8.Gong, H., Isom, D. G., Srinivasan, R. & Rose, G. D. (2003) J. Mol. Biol. 327, 1149–1154. [DOI] [PubMed] [Google Scholar]
- 9.Shao, H., Peng, Y. & Zeng Z.-H. (2003) Protein Pept. Lett. 10, 277–280. [DOI] [PubMed] [Google Scholar]
- 10.Calloni, G., Taddei, N., Plaxco, K. W., Ramponi, G., Stefani, M. & Chiti, F. (2003) J. Mol. Biol. 330, 577–591. [DOI] [PubMed] [Google Scholar]
- 11.Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer, E. F., Brice, M. D., Rogers, J. R., Kennard, O., Shimanouchi, T. & Tasumi, M. (1977) Eur. J. Biochem. 80, 319–324. [DOI] [PubMed] [Google Scholar]
- 12.Islam, S. A., Karplus, M. & Weaver, D. L. (2000) J. Mol. Biol. 318, 199–215. [DOI] [PubMed] [Google Scholar]
- 13.Myers, J. K. & Oas, T. G. (2001) Nat. Struct. Biol. 8, 552–558. [DOI] [PubMed] [Google Scholar]
- 14.Mayor, U., Johnson, C. M., Daggett, V. & Fersht, A. R. (2000) Proc. Natl. Acad. Sci. USA 97, 13518–13522. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Taddei, N., Chiti, F., Paoli, P., Fiaschi, T., Bucciantini, M., Stefani, M., Dobson, C. M. & Ramponi, G. (1999) Biochemistry 38, 2135–2142. [DOI] [PubMed] [Google Scholar]
- 16.Jones, K. & Wittung-Stafshede, P. (2003) J. Am. Chem. Soc. 125, 9606–9607. [DOI] [PubMed] [Google Scholar]
- 17.Kabsch, W. & Sander, C. (1983) Biopolymers 22, 2577–2637. [DOI] [PubMed] [Google Scholar]
- 18.Jones, D. T. (1999) J. Mol. Biol. 292, 195–202. [DOI] [PubMed] [Google Scholar]
- 19.Ptitsyn, O. B. & Finkelstein, A. V. (1983) Biopolymers 22, 15–25. [DOI] [PubMed] [Google Scholar]
- 20.Thirumalai, D. (1995) J. Phys. (Orsay, Fr.) 5, 1457–1469. [Google Scholar]
- 21.Finkelstein, A. V. & Badretdinov, A. Ya. (1997) Folding Des. 2, 115–121. [DOI] [PubMed] [Google Scholar]
- 22.Koga, N. & Takada, S. (2001) J. Mol. Biol. 313, 171–180. [DOI] [PubMed] [Google Scholar]
- 23.Li, M. S., Klimov, D. K. & Thirumalai, D. (2003) Polymer 45, 573–579. [Google Scholar]
- 24.Gutin, A. M., Abkevich, V. I. & Shakhnovich, E. I. (1996) Phys. Rev. Lett. 77, 5433–5436. [DOI] [PubMed] [Google Scholar]
- 25.Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. (1995) J. Mol. Biol. 247, 536–540. [DOI] [PubMed] [Google Scholar]
- 26.Finkelstein, A. V. & Ptitsyn, O. B. (2002) Protein Physics: A Course of Lectures (Academic, New York), pp. 103–116.
- 27.Fersht, A. (1999) Structure and Mechanism in Protein Science: A Guide to Enzyme Catalysis and Protein Folding (Freeman, New York), pp. 540–572.
- 28.Gunasekaran, K., Eyles, S. J., Hagler, A. T. & Gierasch, L. M. (2001) Curr. Opin. Struct. Biol. 11, 83–93. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.



