Abstract
Protein tertiary structures are known to be encoded in amino acid sequences, but the problem of structure prediction from sequence continues to be a challenge. With this question in mind, recent simulations have shown that atomic burials, as expressed by atom distances to the molecular geometrical center, are sufficiently informative for determining native conformations of small globular proteins. Here we use a simple computational experiment to estimate the amount of this required burial information and find it to be surprisingly small, actually comparable with the stringent limit imposed by sequence statistics. Atomic burials appear to satisfy, therefore, minimal requirements for a putative dominating property in the folding code because they provide an amount of information sufficiently large for structural determination but, at the same time, sufficiently small to be encodable in sequences. In a simple analogy with human communication, atomic burials could correspond to the actual “language” encoded in the amino acid “script” from which the complexity of native conformations is recovered during the folding process.
Keywords: protein folding, structure prediction, information theory, folding code
During the last two decades, a physical picture of the folding process has emerged with the advent of energy landscape theory (1–5) but, despite many recent advances, a general solution to the problem of structure prediction from sequence has remained elusive. Most attempts in this direction have assumed sequences to encode partial information about many structural properties, such as likelihood of tertiary contacts or secondary structure propensities, that could eventually be combined to provide a general predictive algorithm (6–10). An alternative scheme would assume a single (or few) conformational property to be directly encoded in sequences, resulting in a small number of sequence-dependent parameters, whereas other conformational features would arise from sequence-independent constraints. The importance of such constraints has been recently emphasized by Banavar and collaborators (11).
The amount of information provided by a putative single property dominating the code should satisfy two conditions: It should be sufficiently large for structural determination but sufficiently small for being encodable in sequences (12). The widely recognized importance of hydrophobic interactions on protein structure formation (13, 14) suggests atomic burials to constitute a natural candidate for this putative dominant property. There has been some discussion, in the simplified context of lattice models, on the possibility that intrinsically unspecific hydrophobicity could satisfy the first condition (15), including a dependence on the choice of native conformation (16, 17). Encouraging results from recent Monte Carlo simulations, on the other hand, indicate that the first condition is satisfied by atomic burials, as measured by distances from the molecular geometrical center, for small globular proteins represented by off-lattice, geometrically realistic, all-heavy-atom models (12). In the present study, we estimate the amount of this required burial information and suggest that the second condition may also be satisfied.
We perform molecular dynamics simulations of an all-heavy-atom protein model with a chosen number of natively constrained central distances in order to enforce the burial condition. Each atom i receives information about the native structure in the form of a radial force toward the center of a flat-bottomed, spherically symmetric, burial-potential well. The total distribution of central distances is divided in a chosen number L of equiprobable disjoint “layers”. Every atom in each layer is governed by the same burial potential centered at ri*±δi, whose native-dependent parameters are the layer-central distance and width, respectively. Simulations were performed for L = 10, L = 5, L = 3 and L = 2. The limiting case of exact burials (ri* = rinat, δi = 0) was also investigated. An upper bound, in bits/atom, for the provided information (18, 19) is log2(L). The actual value must be smaller because of unavoidable correlations between burials in a chain of covalently linked atoms. Viable compact conformations are enforced by native-independent additional constraints in the form of geometrically realistic covalent bonds, atomic excluded volume, side-chain chirality, and a strong penalty for backbone oxygen and nitrogen atoms to be buried unless geometrically defined hydrogen bonds are formed, independently of partners. The general procedure is described in the Materials and Methods section and summarized in Fig. 1.
Results and Discussion
The Upper Left frame of Fig. 2 shows results for the small α-helical engrailed homeodomain [Protein Data Bank (PDB) code 1ENH]. Conformations with Cα root mean square deviation, (RMSD), from the PDB file smaller than 1 Å are sampled in the majority of trajectories when exact or L = 10 burials are used. Minimal RMSDs are around 1 Å for L = 5, between 1 Å and 2 Å for L = 3, and between 2 Å and 4 Å for L = 2. Conformations clearly similar to the PDB structure are therefore sampled even for L = 2, whereas virtually native conformations are sampled for L ≥ 3. When a sufficiently large amount of information is provided, most trajectories are actually uniformly close to the native structure. This closeness is apparent from the the small average RMSD and corresponding standard deviation for exact burials and L = 10. When the amount of provided information is reduced, the behavior of different trajectories is less uniform, and RMSD averages become significantly larger than their minimal values (for example L = 3). In the low-information regime, information increase is reflected almost completely on a decrease of RMSD and almost no change in the average burial energy. In the high-information regime, on the other hand, burial-energy increase is the major response to additional provided information.
An apparent correlation between RMSD and average burial energy suggests that a simple-selection criterium based on burial energy could eventually improve structural predictions. For the small α-helical engrailed homeodomain, this expectation is corroborated by the Lower Left frame of Fig. 2. This frame compares the average RMSD over all conformations in each group, independent of trajectory, to the same quantity restricted to the subgroup with the lowest 5 % burial energy. The lowest-energy subgroups reflect more closely the general behavior of minimal RMSD values. This is true even if the actual global RMSD minimum is not in this subgroup. For L = 3, for example, although the overall average lies between the overall averages for L = 5 and L = 2, the lowest energy subgroup and minimal RMSD are very similar to the corresponding values for L = 5 and much smaller than for L = 2. These results suggest that the minimal amount of provided information required to recover identifiable native-like conformations lies somewhere between L = 2 and L = 3. The Right frames of Fig. 2 show analogous results, with exactly the same parameters for all native-independent constraints, for the α+β monomeric version of the cro factor (PDB code 1ORC). These additional results suggest that this surprisingly small amount of required information is not an exclusive property of simple α-helical domains. In negative controls shown in Fig. 3 for both proteins, with all atoms subjected to identical burial potentials shaped as a single, central well with width ranging from 15–18 Å, average and minimal RMSDs never go below 10 Å and 5 Å, respectively. Furthermore, as shown in the same figure, no specific structural information appears to be provided by the sequence-specific excluded volume interactions present in these simulations, as indicated by similar RMSDs for 1ENH and 1ORC trajectories with respect both to 1ENH and 1ORC native structures.
In order to compare our results with the information of amino acid sequences, we estimate the actual amount of provided burial information by its corresponding Shannon entropy (18, 19) for different numbers of layers, HL(B), in bits/residue. Because burial correlations between covalently linked atoms are present, smaller values than the maximal HLmax(B) = log2(L) bits/atom can be anticipated. Furthermore, an expected increase in the correlation range with protein globule size suggests a possible dependence of provided information on chain length. Fig. 4 shows entropy estimates obtained from burial frequencies in a set of small (between 50 and 100 residues) globular proteins. In this kind of analysis [e.g. Brenner and collaborators (20)], we compute entropies for “fragments”, or “blocks”, from sequences of burials of linearly connected atoms and estimate the entropy per atom from the resulting dependence on block size. The entropy per backbone atom is computed from burials of linearly connected backbone atoms. Nitrogens, α-carbons, and carbonyl carbons are considered indistinguishable. Assuming correlations along the backbone direction to be on average similar to correlations along side-chain directions, we then estimate the entropy per residue simply by multiplying the backbone value by the average number of atoms per residue.
Block entropies for L = 2, L = 3, and L = 5 are shown in Fig. 4 for our set of small proteins. An appropriate chain size range varying between 50 and 100 residues was used. Measuring the slope of the linear fits to regions with intermediate block sizes, we determine backbone entropies to be H2(B) ≈ 0.4 and H3(B) ≈ 0.6 bits/atom, or H2(B) ≈ 3.1 and H3(B) ≈ 4.7 bits/residue. These values can now be compared with the entropy of protein sequences H(Q) ≈ 4.2 bits/residue (20, 21). It must be noted that the actual amount of structural information obtainable from sequences is not given by sequence entropy itself but by the somewhat smaller mutual information between sequences and structures (18, 19). Although this difference has not been estimated, evidence for its significance comes from several empirical observations. For example, many sequences fold into the same 3D structure (22, 23), and also, foldable chains can be constructed from reduced amino acid alphabets (24–26). Even considering a reasonable, more restrictive value—like something around 3 bits/residues—together with unavoidable uncertainties on all estimates, burial and sequence entropies are found to be strikingly similar. Furthermore, our values for required burial information must be considered as upper estimates because future improvements on native-independent model constraints cannot be ruled out.
These surprising results not only confirm that native conformations are recoverable from burial information alone, but they also indicate that the required amount of information could be sufficiently small to be encoded by the linear sequence of amino acids. Other conformational properties, such as secondary structures or even pairwise contact interactions, would arise as a consequence of crucial but sequence-independent constraints. The resulting connection between burials and sequence suggests a simple analogy with human communication. Atomic burials could correspond to the actual “language” directly encoded in the “script” of amino acids. Specific atomic burials would constitute the “literature” transmitted by sequences whereas other conformational properties would arise from the “grammar” governing the language. The practical prediction problem could in principle be solved by a sequence-dependent, probably knowledge-based, potential capable of “reading” burial patterns in the amino acid script. Sequence-independent constraints, similar to the native-independent terms used in the present study, would convert burials into a 3D structure, “speaking” the protein folding burial language.
Materials and Methods
We perform standard molecular dynamics simulations of an all-heavy-atom protein model. Native-independent standard constraints are enforced by harmonic potentials on distances, angles, rigid dihedrals (in peptide bonds and aromatic rings), and side-chain chirality. Minima positions are taken from an extended conformation generated with program MOLMOL (27), by using standard residue and peptide geometries. The side-chain chiral angle is always constrained to 2.5 rad. Excluded volumes arise from a repulsive potential for distances smaller than 2.5 Å between any pair of atoms except for Cβ and backbone O atoms, in which case a larger distance of 3.0 Å is used. All simulations were performed with an adaptation of the molecular dynamics program previously used with structure-based models (28). Nonstandard terms, detailed below, correspond to the native-dependent atomic burial potential and to additional native-independent constraints resulting from hydrogen bond formation.
Native-Dependent Atomic Burial Potential.
Different atoms i contribute to the potential energy function with a simple burial term that depends on its central distance, Eibur(r), attaining its minimal value of zero everywhere inside the 2δi long interval (ri*−δi,ri* + δi) and increasing linearly outside this interval with slope ±ki, except for small quadratic sections of length δq required to maintain differentiability at every point, including r = 0:
with atom-dependent r1,…,r5,a1,…,a5,b1,b2,b5 ≥ 0 defined in terms of ri*, δi, ki and δq. Usually, therefore, ri*−r3 = r4 − ri* = δi, r1 = r3 − r2 = r5 − r4 = δq and a2 = a5 = ki, and remaining terms are obtained from continuity and differentiability requirements. For sufficiently small values of ri*, appropriate modifications of r1, r2, r3, and a2 might also be necessary. In the present study, we divide the total distribution of central distances in a chosen number L of equiprobable disjoint “layers” and assign to every atom in each layer the same ri* and δi values, given respectively by the central position and half-width of the layer. We use L = 10, L = 5, L = 3, and L = 2 layers, in addition to the limiting cases of L = 1, corresponding to simple compaction with no native-dependent burial information, and of exact burial (ri* = rinat and δi = 0), which is the continuous function corresponding to the burial potential of ref. 12. ki for all atoms gradually increases during the simulation from 0 to 1, in energy units per angstrom, where the energy unit is defined by the constant temperature of the simulation, except for L = 1, in which case ki was kept constant at 1. We have used δq = 0.5 Å. Fig. 1 shows the three final burial-energy wells used for simulations with three layers, L = 3, superimposed to the actual burial distribution.
Native-Independent Hydrogen Bonds.
For each of the N1 × N2 combinations of N1 possible donors and N2 possible acceptors, we have a putative hydrogen bond quantified by the product of three Fermi functions
with F(α) = 1/(1 + exp(βα(α−μα))), which changes abruptly but continuousy from 1 to 0 as any of three controlling variables exceeds a specific threshold. These three controlling variables are computed from the coordinates of the following five atoms: the acceptor carbonyl oxygen, the donor nitrogen, the two atoms adjacent to this nitrogen, and the carbon adjacent to the acceptor oxygen, in this order. These coordinates define three convenient vectors: , , and . In terms of these vectors, is the norm of , η is the angle between and , or , and θ is the angle between and or . In the present study, the energetic contributions of the donor and acceptor in a given hydrogen bond is allowed to depend explicitly on their coordinates , in addition to λ, as
In the present scheme, therefore, hydrogen bond formation does not result in a simple decrease in energy because it is only within a region around the center, defined by the inflection position of the Fermi function F(r), that it is unfavorable for potential donors and acceptors not to form hydrogen bonds. The resulting continuous dependence on local geometry and global burial is similar in spirit to previous discrete environment-dependent hydrogen bond models (12, 29), which are motivated by the increased probability for exposed groups of hydrogen bond formation with the solvent. We have used ɛhb = 10, μr = 13 Å and βr = 10 Å −1, with the resulting penalty function shown in Fig. 1 superimposed to the actual burial-distribution and burial-potential wells. All backbone nitrogen and oxygen atoms were considered possible hydrogen bond donors and acceptors, respectively. The same parameters were used for all putative bonds: ɛh = 3 Å, βh = 100 Å −1, ɛη = 0.5 rad, βη = 100 rad −1, ɛθ = 0.6 rad, and βθ = 100 rad −1.
Computation of Burial Entropies.
Shannon entropies for burial blocks of size N, or N-blocks, of linearly connected backbone atoms for a given number of layers L, HL(BN) are computed from the basic Shannon entropy equation (18, 19):
where the probabilities PL(BN) of different N -blocks, for given number of layers L, are estimated from their frequencies in the data bank. Frequencies were computed in a set of 392 small globular proteins, with sizes varying between N = 50 and N = 100 residues, derived from a recent release (November 2008) of the PDBSLECT database (30). Globular structures were selected from the condition Å (31). Because the probability for any burial level at each position in the N-block is simply 1/L, statistical independence between burials at different positions would result in equal block probabilities of (1/L)N, and the block entropy would reduce to its maximal value Nlog2(L). The entropy per atom or entropy density, HL(B), formally corresponds to the limit of the ratio between block entropy and block length as length increases (18, 19):
In the present study, this ratio is estimated from the slope of HL(BN) as a function of N for intermediate values of N, as shown in Fig. 4, to avoid artifacts from the finite size of the data bank on probability and entropy estimates for large N. Correlations between burials, as expected in a chain of connected atoms, are reflected in smaller block entropies and reduced entropy density when compared with the maximal value corresponding to uncorrelated burials, or HLmax(B) = log2(L) bits/atom.
Acknowledgments.
Work at the Center for Theoretical Biological Physics was sponsored by National Science Foundation Grant PHY-0822283 with additional support from National Science Foundation Grant MCB-0543906. This study was developed during A.F.P.d.A.'s visiting period to the Center for Theoretical Biological Physics, with partial support from the Conselho Nacional de Pesquisa.
Footnotes
The authors declare no conflict of interest.
References
- 1.Bryngelson JD, Onuchic J, Socci ND, Wolynes PG. Funnels, pathways, and the energy landscape of protein folding: A synthesis. Proteins Struct Funct Genet. 1995;21:167–195. doi: 10.1002/prot.340210302. [DOI] [PubMed] [Google Scholar]
- 2.Wolynes PG, Onuchic JN, Thirumalai D. Navigating the folding routes. Science. 1995;267:1619–1620. doi: 10.1126/science.7886447. [DOI] [PubMed] [Google Scholar]
- 3.Onuchic JN, Wolynes PG. Theory of protein folding. Curr Opin Struct Biol. 2004;14:70–75. doi: 10.1016/j.sbi.2004.01.009. [DOI] [PubMed] [Google Scholar]
- 4.Shakhnovich E. Protein folding thermodynamics and dynamics: Where physics, chemistry, and biology meet. Chem Rev. 2006;106:1559–1588. doi: 10.1021/cr040425u. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Dill KA, Ozcan SB, Shell MS, Weikl TR. The protein folding problem. Annu Rev Biophys. 2008;37:289–316. doi: 10.1146/annurev.biophys.37.092707.153558. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Goldstein RA, Luthey–Schulten ZA, Wolynes PG. Optimal protein-folding codes from spin-glass theory. Proc Natl Acad Sci USA. 1992;89:4918–4922. doi: 10.1073/pnas.89.11.4918. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Hardin C, Eastwood MP, Prentiss MC, Luthey–Schulten Z, Wolynes PG. Associative memory Hamiltonians for structure prediction without homology: α/β proteins. Proc Natl Acad Sci USA. 2003;100:1679–1684. doi: 10.1073/pnas.252753899. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Rohl C, Strauss CE, Misura KM, Baker D. Protein structure prediction using Rosetta. Methods Enzymol. 2004;383:66–93. doi: 10.1016/S0076-6879(04)83004-0. [DOI] [PubMed] [Google Scholar]
- 9.Liwo A, Khalili M, Scheraga HA. Ab initio simulations of protein-folding pathways by molecular dynamics with the united-residue model of polypeptide chains. Proc Natl Acad Sci USA. 2005;102:2362–2367. doi: 10.1073/pnas.0408885102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Yang JS, Chen WW, Skolnick J, Shakhnovich EI. All-atom ab initio folding of a diverse set of proteins. Structure. 2007;15:53–63. doi: 10.1016/j.str.2006.11.010. [DOI] [PubMed] [Google Scholar]
- 11.Hoang TX, Trovato A, Seno F, Banavar JR, Maritan A. Geometry and symmetry presculpt the free-energy landscape of proteins. Proc Natl Acad Sci USA. 2004;101:7960–7964. doi: 10.1073/pnas.0402525101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Pereira de Araújo AF, Gomes ALC, Bursztyn AA, Shakhnovich EI. Native atomic burials, supplemented by physically motivated hydrogen bond constraints, contain sufficient information to determine the tertiary structure of small globular proteins. Proteins Struct Funct Bioinf. 2008;70:971–983. doi: 10.1002/prot.21571. [DOI] [PubMed] [Google Scholar]
- 13.Kauzmann W. Some factors in the interpretation of protein denaturation. Adv Prot Chem. 1959;14:1–63. doi: 10.1016/s0065-3233(08)60608-7. [DOI] [PubMed] [Google Scholar]
- 14.Dill KA. Dominant forces in protein folding. Biochemistry. 1990;29:7133–7155. doi: 10.1021/bi00483a001. [DOI] [PubMed] [Google Scholar]
- 15.Yue K, et al. A test of lattice protein folding algorithms. Proc Natl Acad Sci USA. 1995;92:325–329. doi: 10.1073/pnas.92.1.325. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Pereira de Araújo AF. Folding protein models with a simple hydrophobic energy function: The fundamental importance of monomer inside/outside segregation. Proc Natl Acad Sci USA. 1999;96:12482–12487. doi: 10.1073/pnas.96.22.12482. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Garcia LG, Treptow WL, Pereira de Araújo AF. Folding simulations of a three-dimensional protein model with a non-specific hydrophobic energy function. Phys Rev E. 2001;64:011912. doi: 10.1103/PhysRevE.64.011912. [DOI] [PubMed] [Google Scholar]
- 18.Shannon CE. A mathematical theory of communication. Bell Syst Tech J. 1948;27:379–623. [Google Scholar]
- 19.Cover TM, Thomas JA. Elements of Information Theory. Hoboken, NJ: Wiley; 1991. [Google Scholar]
- 20.Crooks GE, Brenner SE. Protein structure prediction: Entropy, correlations and prediction. Bioinformatics. 2004;20:1603–1611. doi: 10.1093/bioinformatics/bth132. [DOI] [PubMed] [Google Scholar]
- 21.Weiss O, Jimenez-Montano M, Herzel H. Information content of protein sequences. J Theor Biol. 2000;206:379–386. doi: 10.1006/jtbi.2000.2138. [DOI] [PubMed] [Google Scholar]
- 22.Koehl P, Levitt M. Protein topology and stability define the space of allowed sequences. Proc Natl Acad Sci USA. 2002;99:1280–1285. doi: 10.1073/pnas.032405199. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Larson SM, England JL, Desjarlais JR, Pande VS. Throughly sampling sequence space: Large-scale protein design of structural ensembles. Protein Sci. 2002;11:2804–2813. doi: 10.1110/ps.0203902. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Riddle DS, et al. Functional rapidly folding proteins from simplified amino acid sequences. Nat Struct Biol. 1997;4:805–809. doi: 10.1038/nsb1097-805. [DOI] [PubMed] [Google Scholar]
- 25.Wolynes PG. As simple as can be? Nat Struct Biol. 1997;4:871–874. doi: 10.1038/nsb1197-871. [DOI] [PubMed] [Google Scholar]
- 26.Dokholyan NV. What is the protein design alphabet? Proteins Struct Funct Bioinf. 2004;54:622–628. doi: 10.1002/prot.10633. [DOI] [PubMed] [Google Scholar]
- 27.Koradi R, Billeter M, Wüthrich K. Molmol: A program for display and analysis of macromolecular structures. J Mol Graphics. 1996;14:29–32. doi: 10.1016/0263-7855(96)00009-4. [DOI] [PubMed] [Google Scholar]
- 28.Whitford PC, Miyashita O, Levy Y, Onuchic JN. Conformational transitions of adenylate kinase: Switching by cracking. J Mol Biol. 2007;366:1661–1671. doi: 10.1016/j.jmb.2006.11.085. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Ding F, Tsao D, Nie H, Dokholyan NV. Ab initio folding of proteins with all-atom discrete molecular dynamics. Structure. 2008;16:1010–1018. doi: 10.1016/j.str.2008.03.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Hobohm U, Scharf M, Schneider R, Sander C. Selection of representative protein data sets. Protein Sci. 1992;1:409–417. doi: 10.1002/pro.5560010313. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Gomes ALC, de Rezende JR, Pereira de Araújo AF, Shakhnovich EI. Description of atomic burials in compact globular proteins by Fermi-Dirac probability distributions. Proteins Struct Funct Bioinf. 2007;66:304–320. doi: 10.1002/prot.21137. [DOI] [PubMed] [Google Scholar]