Significance
Most proteins must be folded to perform their function, which typically involves binding another molecule. This property enables protein traits of stable folding and strong binding to emerge through adaptation even when the trait itself does not increase the organism’s fitness. Such traits, known as evolutionary “spandrels,” have been investigated across many organisms and scales of biology. Here we show that proteins can spontaneously evolve strong but nonfunctional binding interactions. Moreover, evolution of such interactions can be highly unpredictable, with chance mutation events determining the outcome. In contrast, evolution of functional interactions follows a more predictable pattern of first gaining folding stability and then losing it as the new binding function improves.
Keywords: protein evolution, evolutionary spandrels, fitness landscapes, protein interactions, folding stability
Abstract
Binding interactions between proteins and other molecules mediate numerous cellular processes, including metabolism, signaling, and gene regulation. These interactions often evolve in response to changes in the protein’s chemical or physical environment (such as the addition of an antibiotic). Several recent studies have shown the importance of folding stability in constraining protein evolution. Here we investigate how structural coupling between folding and binding—the fact that most proteins can only bind their targets when folded—gives rise to an evolutionary coupling between the traits of folding stability and binding strength. Using a biophysical and evolutionary model, we show how these protein traits can emerge as evolutionary “spandrels” even if they do not confer an intrinsic fitness advantage. In particular, proteins can evolve strong binding interactions that have no functional role but merely serve to stabilize the protein if its misfolding is deleterious. Furthermore, such proteins may have divergent fates, evolving to bind or not bind their targets depending on random mutational events. These observations may explain the abundance of apparently nonfunctional interactions among proteins observed in high-throughput assays. In contrast, for proteins with both functional binding and deleterious misfolding, evolution may be highly predictable at the level of biophysical traits: adaptive paths are tightly constrained to first gain extra folding stability and then partially lose it as the new binding function is developed. These findings have important consequences for our understanding of how natural and engineered proteins evolve under selective pressure.
Proteins carry out a diverse array of chemical and mechanical functions in the cell, ranging from metabolism to gene expression (1). Thus, proteins serve as central targets for natural selection in wild populations, as well as a key toolbox for engineering novel molecules with medical and industrial applications (2, 3). Most proteins must fold into their native state, a unique 3D conformation, to perform their function, which typically involves binding a target molecule such as a small ligand, DNA, or another protein (1). Misfolded proteins not only fail to perform their function but also may form toxic aggregates and divert valuable protein synthesis and quality control resources (4–7). It is therefore imperative that the folded state be stable against typical thermal fluctuations. However, biophysical experiments and computational studies reveal that most random mutations in proteins destabilize the folded state (8, 9), including mutations that improve function (9–11). As a result, many natural proteins tend to be only marginally stable, mutationally teetering at the brink of unfolding (12, 13). With proteins in such a precarious position, how can they evolve new functions while maintaining sufficient folding stability?
Directed evolution experiments have offered a window into the dynamics of this process (2, 3), indicating the importance of compensatory mutations, limited epistasis, and mutational robustness. Theoretical efforts to describe protein evolution in biophysical terms have focused on evolvability (14), distributions of protein stabilities and evolutionary rates (13, 15–17), and global properties of protein interaction networks (18, 19). However, a subtle but key property of proteins has not been explored in this context: structural coupling of folding and binding (the fact that folding is required for function) implies evolutionary coupling of folding stability and binding strength. Thus, selection acting directly on only one of these traits may produce apparent, indirect selection for the other. The importance of this effect was popularized by Gould and Lewontin in their influential paper on evolutionary “spandrels” (20), defined as traits that evolve as byproducts in the absence of direct selection. Since then the importance of spandrels and coupled traits has been explored in many areas of evolutionary biology (21), including various molecular examples (12, 22, 23).
How do coupled traits affect protein evolution? We consider a biophysical model that describes evolution of a new binding interaction in response to a change in the protein’s chemical or physical environment, including availability and concentrations of various ligands (24, 25), or in response to artificial selection in a directed evolution experiment (3). We postulate a fitness landscape as a function of two biophysical traits: folding stability and binding affinity to a target molecule. We then use an exact numerical algorithm (26, 27) to characterize adaptive paths on this fitness landscape, focusing on how coupled protein traits affect evolutionary fates, epistasis, and predictability.
Results
Model of Protein Energetics.
We consider a protein with two-state folding kinetics (1). In the folded state, the protein has an interface that binds a target molecule. Assuming thermodynamic equilibrium (valid when protein folding and binding are faster than typical cellular processes), the probabilities of the protein’s structural states are given by their Boltzmann weights:
| [1] |
Here β is the inverse temperature, is the free energy of folding (also known as ), and , where is the binding free energy and μ is the chemical potential of the target molecule. For simplicity we will refer to as the binding energy. The partition function is . Note that for intrinsically stable proteins and for favorable binding interactions; binding can thus be made more favorable either by improving the intrinsic affinity between the protein and the target (decreasing ) or by increasing the concentration of the target molecule (increasing μ). Because the protein can bind only when it is folded, the binding and folding processes are structurally coupled. This gives rise to binding-mediated stability: strong binding stabilizes a folded protein, such that even intrinsically unstable proteins may attain their folded and bound state if binding is strong enough . Different physical mechanisms are possible for the kinetics of the structural coupling (28, 29), but in thermodynamic equilibrium only the free energy differences matter.
The folding and binding energies depend on the protein’s genotype (amino acid sequence) σ. We assume that adaptation only affects “hotspot” residues at the binding interface (30, 31); the rest of the protein does not change on relevant time scales because it is assumed to be already optimized for folding. We consider L hotspot residues which, to a first approximation, make additive contributions to the total folding and binding free energies (32) (SI Materials and Methods):
| [2] |
where and capture the energetic contributions of amino acid at position i (equivalent to ). The reference energy is the fixed contribution to the folding energy from all other residues in the protein, and the parameter is the minimum binding energy among all genotypes (Materials and Methods). Amino acid energies and are randomly sampled from distributions constructed using available data and other biophysical considerations (Materials and Methods); we will consider properties of the model averaged over these distributions. The exact shape of the distributions is not important for large enough L due to the central limit theorem.
Fitness Landscape.
We construct a fitness landscape based on the molecular traits and . Without loss of generality, we assume that the protein contributes fitness 1 to the organism if it is always folded and bound. Let be the multiplicative fitness penalties for being unbound and unfolded, respectively: the total fitness is if the protein is unbound but folded, and if the protein is both unbound and unfolded. Then the fitness of the protein averaged over all three possible structural states in Eq. 1 is given by
| [3] |
This fitness landscape is divided into three nearly flat plateaus corresponding to the three protein states of Eq. 1, separated by steep thresholds corresponding to the folding and binding transitions (Fig. 1A). The heights of the plateaus are determined by the values of and , leading to three qualitative regimes of the global landscape structure (Fig. 1 B–D).
Fig. 1.
Fitness, selection, and epistasis in energy trait space. (A) Phase diagram of protein structural states. Dashed lines separate structural states of the protein corresponding to plateaus on the fitness landscape; arrows represent the folding transition (green), binding transition (red), and the coupled folding–binding transition (blue). Fitness landscapes with direct selection (B) for binding only , (C) for folding only (, ), and (D) for both binding and folding (, ). Black contours indicate constant fitness values. The contours are uniformly sampled in energy space, with unequal fitness differences between adjacent contours. Streamlines indicate the direction of the selection “force” , with color showing its magnitude (decreasing from red to blue). (E) Example genotype distribution and mutational network for and . (F) Blue arrows indicate the same mutation on different genetic backgrounds. When the fitness contours are straight, the mutation is beneficial regardless of the background ( or ). With curved contours, the same mutation can become deleterious , indicative of sign epistasis. Sign epistasis can give rise to multiple local fitness maxima (e.g., AA and BB in E).
In the first case (Fig. 1B), a protein that is perfectly folded but unbound has no fitness advantage over a protein that is both unbound and unfolded: . Thus, selection acts directly on the binding trait only. This regime requires that either [binding is essential, e.g., in the context of conferring antibiotic resistance to the cell (24)] or (misfolded proteins are not toxic). The latter case also includes directed evolution experiments where only function is artificially selected for in vitro. In contrast, when and (Fig. 1C), a perfectly folded and bound protein has no fitness advantage over a folded but unbound protein, and thus this case entails direct selection for folding only; binding is nonfunctional because it provides no intrinsic fitness advantage. Such proteins may have other, functional binding interfaces that are already adapted to their targets. Thus, the fitness penalty reflects both the intrinsic misfolding toxicity [e.g., due to aggregation (4–7)] as well as the loss of the other functional interactions. Finally, in the most general case there are distinct selection pressures on both binding and folding. This occurs when and , resulting in a landscape that is a hybrid of the previous two cases (Fig. 1D).
It is straightforward to generalize our three-state model to proteins with additional structural states (such as metastable partially folded configurations) and allow for simultaneous adaptation at multiple binding interfaces. Furthermore, the fitness landscape in Eq. 3 can be made an arbitrary nonlinear function of state probabilities. However, these more complex scenarios would still share the essential features of our basic model: coupling between folding and binding traits and sharp fitness thresholds between bound/unbound and folded/unfolded states. Thus, our qualitative conclusions do not depend on the specific model in Eq. 3.
Epistasis and Local Fitness Maxima.
For protein sequences of length L with an alphabet of size k, each of the possible genotypes is projected onto the 2D trait space of and (Eq. 2) and connected to immediate mutational neighbors, forming a network of states that the population must traverse (a simple example is shown in Fig. 1E). Adaptive dynamics are determined by the interplay between the fitness function (Eq. 3) and the distribution of genotypes in trait space (Eq. 2).
This interplay gives rise to the possibility of epistasis and multiple local fitness maxima. For the energy traits our model is nonepistatic (Eq. 2). At the level of fitness, magnitude epistasis is widespread owing to the nonlinear dependence of fitness on folding and binding energies (Eq. 3). Sign epistasis in fitness cannot occur when the fitness contours in energy space are straight parallel lines (Fig. 1F). However, curved fitness contours, which occur near folding or binding thresholds in our model (Fig. 1 B–D), can produce sign epistasis, giving rise to multiple local fitness maxima in the genotype space (Fig. 1E).
Evolutionary Dynamics.
We assume that a population encoding the protein of interest evolves in the monomorphic limit: , where L is the number of residues, N is an effective population size, and u is the per-residue probability of mutation per generation (33). In this limit, the entire population has the same genotype at any given time, and the rate of substitution from the current genotype to one of its mutational neighbors is given by Eq. S1 in SI Materials and Methods. We use the strong-selection limit of the substitution rate (Eq. S2), in which the effective population size enters only as an overall time scale. We verify the validity of this assumption in SI Materials and Methods (Fig. S1). In this regime, deleterious mutations never fix and adaptive paths have a finite number of steps, terminating at a global or local fitness maximum. For compact genomic units such as proteins, the monomorphic condition is generally met in multicellular species, although it may be violated in rapidly mutating unicellular organisms (34, 35). Sequential fixation of single mutants is also a typical mode of adaptation in directed evolution experiments (3). For simplicity, we neglect more complex mutational dynamics such as indels and recombination.
Quantitative Description of Adaptation.
Although our model can describe many adaptive scenarios, for concreteness we focus on a specific but widely applicable case. A population begins as perfectly adapted to binding an original target molecule characterized by an energy matrix with minimum binding energy (defining a fitness landscape ). The population is then subjected to a selection pressure that favors binding a new target, with energy matrix and minimum binding energy (fitness landscape ). For simplicity we assume that the original binding target is replaced by a new one with uncorrelated binding energies, although it is straightforward to extend our results to the case where the original target simply changes concentration and hence chemical potential ( but ). The adaptive paths are first-passage paths leading from the global maximum on to a local or global maximum on , with fitness increasing monotonically along each path.
Each adaptive path φ with probability is a sequence of genotypes connecting initial and final states. Using an exact numerical algorithm (26, 27) (SI Materials and Methods), we determine the path length distribution , which gives the probability of taking an adaptive path with total amino acid substitutions, and the mean adaptation time . We also consider , the entropy of the adaptive paths:
| [4] |
The path entropy is maximized when evolution is neutral, resulting in all paths of a given length being accessible and equally likely. In that case (27), where is the mean path length.
We also consider the path density , which gives the total probability of reaching a state σ at any point along a path. When σ is a final state (a local fitness maximum on ), the path density is the commitment probability. We calculate the entropy of the commitment probabilities as
| [5] |
Direct Selection for Binding Only.
We first focus on the case in Eq. 3. The geometry of the fitness contours is invariant under overall shifts in the binding energy (Fig. 1B); equivalently, the direction (but not the magnitude) of the selection force does not depend on . Thus, without loss of generality, we set in this section.
The contours of constant fitness are parallel to the axis when is low, indicating that, as expected, selection acts only on binding when proteins are sufficiently stable. However, when stability is marginal [which describes most natural proteins (12, 13)], the fitness contours begin to curve downward, indicating indirect selection for folding, even though selection acts directly only on the binding trait. Thus, adaptation will produce a trait (more stability) that is neutral at the level of the fitness function simply because it is coupled with another trait (binding) that is under selection. Folding stability can therefore be considered an evolutionary spandrel (20) caused by structural coupling between folding and binding. For intrinsically unstable proteins , binding-mediated stability causes the fitness contours to approach diagonal lines: selection effectively acts to improve both binding and folding equally (Fig. 1B).
An example realization of evolutionary dynamics in the marginally stable regime is shown in Fig. 2 A and B (see Fig. S2 for stable and intrinsically unstable examples and Fig. S3 for distributions of initial, intermediate, and final states averaged over multiple landscape realizations). There is typically just one or two fitness maxima; all maxima are usually accessible (Fig. 2C). For stable proteins, the global maximum almost always coincides with the best-binding genotype and is about as far as a randomly chosen genotype from the best-folding genotype (Fig. 2D; two random genotypes are separated by for an alphabet of size ). However, as becomes greater, the average distance between the maxima and the best-binding genotype increases, while the average distance between the maxima and the best-folding genotype decreases, until they meet halfway for intrinsically unstable proteins (Fig. 2D). In general the maxima lie on or near the Pareto front (36) (Fig. 2A and Fig. S2), defined here as the set of genotypes such that either or cannot be decreased further without increasing the other (the global maximum is always on the front, whereas local maxima may not be).
Fig. 2.
Adaptation with direct selection for binding only. (A) Distribution of folding and binding energies for all genotypes (small gray points) in a single realization of the model with a marginally stable protein ( kcal/mol). The black star indicates the initial state for adaptation (global maximum on ); red triangles indicate local fitness maxima on , shaded according to their commitment probabilities ; and blue crosses show best-folding and best-binding genotypes. The magenta line connects genotypes on the Pareto front, and black contours indicate constant fitness . (B) The region of energy space accessible to adaptive paths, zoomed in from A. Example paths are shown in blue and green; black circles indicate intermediate states along paths, sized proportional to their path density ; small gray circles are genotypes inaccessible to adaptation. (C) Average number m of local fitness maxima (solid, green) and average number of local maxima accessible to adaptation (dashed, blue) versus . (D) Average per-residue Hamming distance between the maxima and the best-folding genotype (; solid, green) and the best-binding genotype (; dashed, blue) versus . (E) Average distributions of path lengths (number of substitutions) for stable proteins ( kcal/mol; solid, green), marginally stable proteins ( kcal/mol; dashed, blue), and intrinsically unstable proteins ( kcal/mol; dotted, red). (F) Average per-substitution path entropy (solid, green) and average entropy of commitment probabilities (dashed, blue) versus . Averages are taken over realizations of the energy matrices (, , ) in E and realizations otherwise. In all panels, and kcal/mol.
As protein stability decreases ( becomes higher), the average distance between initial and final states for adaptation decreases. As a result the mean path length (number of substitutions) decreases as well, although the variance of path lengths is relatively constant over all energies (Fig. 2E). The path entropy per substitution also decreases with , reflecting greater constraints on adaptive paths (note that for neutral evolution with and ). Finally, in the marginally stable regime (Fig. 2F). Because the average number of maxima is in this regime (Fig. 2C), the maximum value of is roughly , indicating that multiple maxima are usually not equally accessible.
Direct Selection for Folding Only.
In this regime, and in Eq. 3, which means that the binding interaction under consideration is nonfunctional. Analogous with the previous case, the geometry of the fitness contours and thus most landscape properties are now independent of the folding energy (Fig. 1C).
When the nonfunctional binding is weak, the fitness contours are parallel to the axis, indicating that selection acts only on folding (Fig. 1C). This regime yields a single fitness maximum due to the lack of sign epistasis; the maximum predominantly coincides with the best-folding genotype (Fig. 3A). However, with increasing binding strength the fitness contours curve such that the effective selection force attempts to improve both binding and folding equally, due to binding-mediated stability. Thus, binding emerges as an evolutionary spandrel in this case. There is also an increased likelihood of multiple local maxima, located between the best-folding and best-binding genotypes (Fig. 3A).
Fig. 3.
Adaptation with direct selection for folding only. (A) The average number of local maxima m (solid, green) and their average per-residue Hamming distances from the best-folding (; dashed, blue) and the best-binding (; dotted, red) genotypes versus . (B) Probability that adaptation occurs (i.e., the initial state is not coincident with any of the final states) as a function of and . (C) Example landscape with divergent binding fates: there are two accessible local maxima, one with (favorable binding, ) and the other with (negligible binding, ). All symbols are the same as in Fig. 2B. (D) Average distribution of local maxima weighted by their commitment probabilities. In C and D, kcal/mol. Averages are taken over realizations of the energy matrices in D and realizations otherwise. In all panels, , , and kcal/mol.
Depending on the abundance of the old and new ligands in the cell and their binding properties, several adaptive scenarios may take place. First, the best-binding strengths and of the old and new targets may be similar in magnitude. If both are weak, initial and final states are likely to be the best-folding genotype or close to it (Fig. 3A); in this case, there is a high probability that no adaptation will occur (Fig. 3B). When and are both low, adaptation usually occurs to accommodate the binding specificity of the new ligand (Fig. 3B and Fig. S4A). Surprisingly, we see that proteins frequently evolve stronger binding at the expense of folding (Fig. S4A, Bottom). This happens due to the constraints of the genotype–phenotype map: not enough genotypes are available to optimize both traits simultaneously.
It is also possible to gain or lose a nonfunctional binding interface through adaptation. In the first case, the new target is more abundant or has stronger binding than the old one . Thus, the initial state is the best-folding genotype or close to it, and the protein adapts toward a genotype with intermediate folding and binding (Fig. S4B). As before, adaptation is tightly constrained by the genotype–phenotype map, sacrificing the trait (folding stability) under direct selection to affect the spandrel (nonfunctional binding interaction). Effectively, the protein switches from being a self-reliant folder to needing a binding partner to stabilize folding. In the second case (), the dynamics is opposite: the protein loses its nonfunctional binding interface and becomes self-reliant (Fig. S4C). Thus, proteins may acquire or lose binding interfaces depending on the availability of ligands that can participate in binding-mediated stability.
Divergent Evolutionary Fates.
Near the selection streamlines diverge in Fig. 1C, creating the possibility of multiple local maxima with at least one having negative (strong binding) and at least one having positive (negligible binding); Fig. 3C shows an example. Such a protein has two qualitatively different fates available to it: one in which it evolves to bind the target and another in which it does not. The eventual fate of the protein is determined by random mutation events, making evolution inherently unpredictable. Indeed, the distribution of final states can be strongly bimodal (Fig. 3D), and there is a sizable probability of divergent fates across a range of binding energies (Fig. S5).
Direct Selection for Both Binding and Folding.
Here we consider the most general case in which and in Eq. 3 (Fig. 1D). The fitness landscape is divided into two regions by a straight diagonal contour with fitness and slope . Below this contour, the landscape is qualitatively similar to the case of selection for binding only (Fig. 1B), whereas above the contour the landscape resembles that of the folding-only selection scenario (Fig. 1C). Thus, evolutionary dynamics for proteins with favorable binding and folding energies will largely resemble the case of selection for binding only. However, a qualitatively different behavior will occur if the distribution of genotypes straddles the diagonal contour (Fig. 4). This happens when folding stability is marginal and initial binding is unfavorable. In this case, selection streamlines around the diagonal contour (Fig. 1D) and the genotype–phenotype map tightly constrain the adaptive paths to gain extra folding stability first, and then lose it as the binding function is improved (Fig. 4).
Fig. 4.
Adaptation with direct selection for both binding and folding. (A and B) Example distribution of folding and binding energies for a marginally stable and marginally bound protein; all symbols are the same as in Fig. 2 A and B. (C) Distribution of initial states in green, intermediate states in blue (weighted by their path densities), and final states in red (weighted by their commitment probabilities) for realizations of the energy matrices. In all panels, , , and kcal/mol.
Tempo and Rhythm of Adaptation.
The strength of selection is the primary determinant of the average adaptation time . If the selection coefficient s is small (but ), the substitution rate in Eq. S1 is proportional to s. Thus, as selection becomes exponentially weaker for lower energies (Fig. S1), adaptation becomes exponentially slower. The distribution of the total adaptation time over an adaptive path is highly nonuniform. For example, in the case of selection for binding only and a marginally stable protein, the adaptation time is concentrated at the end of the path, one mutation away from the final state (Fig. S6 A and B). Substitutions at the beginning of the path occur quickly because there are many possible beneficial substitutions and because selection is strong; in contrast, at the end of the path adaptation slows down dramatically as beneficial mutations are depleted and selection weakens. This behavior is observed in most of the other model regimes as well.
The exception to this pattern occurs in the case of selection for both binding and folding in marginally stable and marginally bound proteins, owing to the unique contour geometry (Figs. 1D and 4). As the adaptive paths wrap around the diagonal contour in the region of high and low , the landscape flattens, making selection weaker and substitutions slower (Fig. S6C). Thus, most of the time is spent in the middle of the path rather than the end (Fig. S6D). Adaptation accelerates toward the end of the path as the strength of selection increases again. If the intermediate slowdown is significant enough, a protein may not have time to complete the second half of its path before environmental conditions change, so that it will never evolve the new binding function.
Discussion
Protein Folding and Binding As Evolutionary Spandrels.
In the decades since Gould and Lewontin’s paper (20), the existence of evolutionary spandrels has become a critical evolutionary concept. There are many possible scenarios in which spandrels can emerge through evolution (20, 21), although two key mechanisms are indirect selection (arising from coupled traits) and neutral processes (such as genetic drift and biases in mutation and recombination) (37). Here we have focused on the former, which we expect to be more important on short time scales.
Previous studies have argued that the marginal stability of most proteins may be an evolutionary spandrel that evolved due to mutation–selection balance (3, 12, 13). We suggest more broadly that having folding stability at all may be a spandrel for proteins with no misfolding toxicity. Even more striking is the possibility that some binding interactions may be spandrels that evolved solely to stabilize protein structures. Previously the role of binding-mediated stability has been mainly discussed in the context of intrinsically disordered proteins (28, 29) and therapies for protein misfolding diseases (38); here we show that it is a general phenomenon with significant implications for our interpretation of data on proteome-wide interactions (39). In particular, our results suggest that less stable proteins should have a greater number of nonfunctional interactions. Protein stability was previously argued to correlate positively with abundance to explain the observed negative correlation of abundance with evolutionary rate (16, 17), whereas models of protein–protein interaction networks imply that protein abundance also correlates negatively with the intrinsic number of interactions (e.g., as measured by surface hydrophobicity) (19, 40). Together these observations argue that stability should indeed be negatively correlated with the number of interactions (41). Moreover, the possibility of evolving new binding interactions solely to stabilize protein structure suggests how chaperones and protein quality control machinery may have first evolved (42).
Pareto Optimization of Proteins.
The Pareto front is a useful concept in problems of multiobjective optimization (36). The Pareto front in our model consists of the protein sequences along the low-, low- edge of the genotype distribution (e.g., see Fig. 2A). Pareto optimization assumes that all states on the front are valid final states for adaptation; however, nonlinear fitness functions with saturation effects (as expected from biophysical considerations, e.g., Eq. 3) will confound this assumption. Our model shows how this nonlinearity leads to a small subset of true final states on or even off the front. Thus, Pareto optimization does not capture a key feature of the underlying biophysics, providing only a rough approximation to the true dynamics.
Epistasis and Evolutionary Predictability.
Our results also shed light on the role of epistasis—the correlated effects of mutations at different sites—in protein evolution. Epistasis underlies the ruggedness of fitness landscapes (43, 44) and determines the predictability of evolution, an issue of paramount importance in biology (24, 45, 46). In most cases considered here, limited sign epistasis gives rise to less-predictable intermediate pathways (high ) but highly predictable final outcomes (low ). However, there are two major exceptions to this pattern. First, proteins with a nonfunctional binding interaction may have multiple local maxima, some with strong binding and others with weak binding (Fig. 3 C and D). Here both the intermediate pathways and the final states are unpredictable—pure chance, in the form of random mutations, drives the population to one binding fate or the other. The second exception occurs in proteins with direct selection for both binding and folding. Here there is usually a single maximum, but the adaptive paths are tightly constrained in energy space (Fig. 4), making protein evolution highly predictable at the level of biophysical traits.
Materials and Methods
Energetics of Protein Folding and Binding.
Folding energetics are probed experimentally and computationally by measuring the changes in resulting from point mutations of a reference sequence. We choose an arbitrary genotype as a reference, with for all such that . Then we sample the remaining entries of from a Gaussian distribution with mean 1.25 kcal/mol and standard deviation 1.6 kcal/mol, consistent with available data showing these energies to be universally distributed over many proteins (8). We also choose an arbitrary genotype to have the minimum binding energy: for all and . Because binding hotspot residues typically have a 1–3 kcal/mol penalty for mutations away from the wild-type amino acid (30, 31), we sample the other entries of from an exponential distribution defined in the range of kcal/mol, with mean 2 kcal/mol. This distribution is consistent with alanine-scanning experiments that probe energetics of amino acids at the binding interface (47). Note that binding energy matrices and are sampled independently in all figures. We consider hotspot residues and a reduced alphabet of amino acids (grouped into negative, positive, polar, hydrophobic, and other), resulting in genotypes. Our population genetics model and the algorithm for exact calculation of adaptive path statistics are described in SI Materials and Methods.
Supplementary Material
Acknowledgments
A.V.M. acknowledges support from an Alfred P. Sloan Research Fellowship.
Footnotes
The authors declare no conflict of interest.
*This Direct Submission article had a prearranged editor.
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1415895112/-/DCSupplemental.
References
- 1.Creighton TE. Proteins: Structures and Molecular Properties. Freeman; New York: 1992. [Google Scholar]
- 2.Campbell RE, et al. A monomeric red fluorescent protein. Proc Natl Acad Sci USA. 2002;99(12):7877–7882. doi: 10.1073/pnas.082243699. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Bloom JD, Arnold FH. In the light of directed evolution: Pathways of adaptive protein evolution. Proc Natl Acad Sci USA. 2009;106(Suppl 1):9995–10000. doi: 10.1073/pnas.0901522106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Bucciantini M, et al. Inherent toxicity of aggregates implies a common mechanism for protein misfolding diseases. Nature. 2002;416(6880):507–511. doi: 10.1038/416507a. [DOI] [PubMed] [Google Scholar]
- 5.Chiti F, Stefani M, Taddei N, Ramponi G, Dobson CM. Rationalization of the effects of mutations on peptide and protein aggregation rates. Nature. 2003;424(6950):805–808. doi: 10.1038/nature01891. [DOI] [PubMed] [Google Scholar]
- 6.Drummond DA, Wilke CO. Mistranslation-induced protein misfolding as a dominant constraint on coding-sequence evolution. Cell. 2008;134(2):341–352. doi: 10.1016/j.cell.2008.05.042. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Geiler-Samerotte KA, et al. Misfolded proteins impose a dosage-dependent fitness cost and trigger a cytosolic unfolded protein response in yeast. Proc Natl Acad Sci USA. 2011;108(2):680–685. doi: 10.1073/pnas.1017570108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Tokuriki N, Stricher F, Schymkowitz J, Serrano L, Tawfik DS. The stability effects of protein mutations appear to be universally distributed. J Mol Biol. 2007;369(5):1318–1332. doi: 10.1016/j.jmb.2007.03.069. [DOI] [PubMed] [Google Scholar]
- 9.Tokuriki N, Stricher F, Serrano L, Tawfik DS. How protein stability and new functions trade off. PLOS Comput Biol. 2008;4(2):e1000002. doi: 10.1371/journal.pcbi.1000002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Wang X, Minasov G, Shoichet BK. Evolution of an antibiotic resistance enzyme constrained by stability and activity trade-offs. J Mol Biol. 2002;320(1):85–95. doi: 10.1016/S0022-2836(02)00400-X. [DOI] [PubMed] [Google Scholar]
- 11.Sun SB, et al. Mutational analysis of 48G7 reveals that somatic hypermutation affects both antibody stability and binding affinity. J Am Chem Soc. 2013;135(27):9980–9983. doi: 10.1021/ja402927u. [DOI] [PMC free article] [PubMed] [Google Scholar] [Retracted]
- 12.Taverna DM, Goldstein RA. Why are proteins marginally stable? Proteins. 2002;46(1):105–109. doi: 10.1002/prot.10016. [DOI] [PubMed] [Google Scholar]
- 13.Zeldovich KB, Chen P, Shakhnovich EI. Protein stability imposes limits on organism complexity and speed of molecular evolution. Proc Natl Acad Sci USA. 2007;104(41):16152–16157. doi: 10.1073/pnas.0705366104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Bloom JD, Labthavikul ST, Otey CR, Arnold FH. Protein stability promotes evolvability. Proc Natl Acad Sci USA. 2006;103(15):5869–5874. doi: 10.1073/pnas.0510098103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.DePristo MA, Weinreich DM, Hartl DL. Missense meanderings in sequence space: A biophysical view of protein evolution. Nat Rev Genet. 2005;6(9):678–687. doi: 10.1038/nrg1672. [DOI] [PubMed] [Google Scholar]
- 16.Drummond DA, Bloom JD, Adami C, Wilke CO, Arnold FH. Why highly expressed proteins evolve slowly. Proc Natl Acad Sci USA. 2005;102(40):14338–14343. doi: 10.1073/pnas.0504070102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Serohijos AWR, Rimas Z, Shakhnovich EI. Protein biophysics explains why highly abundant proteins evolve slowly. Cell Reports. 2012;2(2):249–256. doi: 10.1016/j.celrep.2012.06.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Johnson ME, Hummer G. Nonspecific binding limits the number of proteins in a cell and shapes their interaction networks. Proc Natl Acad Sci USA. 2011;108(2):603–608. doi: 10.1073/pnas.1010954108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Heo M, Maslov S, Shakhnovich E. Topology of protein interaction network shapes protein abundances and strengths of their functional and nonspecific interactions. Proc Natl Acad Sci USA. 2011;108(10):4258–4263. doi: 10.1073/pnas.1009392108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Gould SJ, Lewontin RC. The spandrels of San Marco and the Panglossian paradigm: A critique of the adaptationist programme. Proc R Soc Lond B Biol Sci. 1979;205(1161):581–598. doi: 10.1098/rspb.1979.0086. [DOI] [PubMed] [Google Scholar]
- 21.Pigliucci I, Kaplan I. The fall and rise of Dr Pangloss: Adaptationism and the Spandrels paper 20 years later. Trends Ecol Evol. 2000;15(2):66–70. doi: 10.1016/s0169-5347(99)01762-0. [DOI] [PubMed] [Google Scholar]
- 22.Weiss MA, et al. Protein structure and the spandrels of San Marco: Insulin’s receptor-binding surface is buttressed by an invariant leucine essential for its stability. Biochemistry. 2002;41(3):809–819. doi: 10.1021/bi011839+. [DOI] [PubMed] [Google Scholar]
- 23.Barrett RDH, Hoekstra HE. Molecular spandrels: Tests of adaptation at the genetic level. Nat Rev Genet. 2011;12(11):767–780. doi: 10.1038/nrg3015. [DOI] [PubMed] [Google Scholar]
- 24.Weinreich DM, Delaney NF, Depristo MA, Hartl DL. Darwinian evolution can follow only very few mutational paths to fitter proteins. Science. 2006;312(5770):111–114. doi: 10.1126/science.1123539. [DOI] [PubMed] [Google Scholar]
- 25.Chou HH, Chiu HC, Delaney NF, Segrè D, Marx CJ. Diminishing returns epistasis among beneficial mutations decelerates adaptation. Science. 2011;332(6034):1190–1192. doi: 10.1126/science.1203799. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Manhart M, Morozov AV. Path-based approach to random walks on networks characterizes how proteins evolve new functions. Phys Rev Lett. 2013;111(8):088102. doi: 10.1103/PhysRevLett.111.088102. [DOI] [PubMed] [Google Scholar]
- 27.Manhart M, Morozov AV. In: First-Passage Phenomena and Their Applications. Metzler R, Oshanin G, Redner S, editors. World Scientific; Singapore: 2014. [Google Scholar]
- 28.Wright PE, Dyson HJ. Linking folding and binding. Curr Opin Struct Biol. 2009;19(1):31–38. doi: 10.1016/j.sbi.2008.12.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Habchi J, Tompa P, Longhi S, Uversky VN. Introducing protein intrinsic disorder. Chem Rev. 2014;114(13):6561–6588. doi: 10.1021/cr400514h. [DOI] [PubMed] [Google Scholar]
- 30.Clackson T, Wells JA. A hot spot of binding energy in a hormone-receptor interface. Science. 1995;267(5196):383–386. doi: 10.1126/science.7529940. [DOI] [PubMed] [Google Scholar]
- 31.Moreira IS, Fernandes PA, Ramos MJ. Hot spots—a review of the protein-protein interface determinant amino-acid residues. Proteins. 2007;68(4):803–812. doi: 10.1002/prot.21396. [DOI] [PubMed] [Google Scholar]
- 32.Wells JA. Additivity of mutational effects in proteins. Biochemistry. 1990;29(37):8509–8517. doi: 10.1021/bi00489a001. [DOI] [PubMed] [Google Scholar]
- 33.Champagnat N. A microscopic interpretation for adaptive dynamics trait substitution sequence models. Stoch Proc Appl. 2006;116(8):1127–1160. [Google Scholar]
- 34.Lynch M. The Origins of Genome Architecture. Sinauer; Sunderland, MA: 2007. [Google Scholar]
- 35.Charlesworth B. Fundamental concepts in genetics: Effective population size and patterns of molecular evolution and variation. Nat Rev Genet. 2009;10(3):195–205. doi: 10.1038/nrg2526. [DOI] [PubMed] [Google Scholar]
- 36.Shoval O, et al. Evolutionary trade-offs, Pareto optimality, and the geometry of phenotype space. Science. 2012;336(6085):1157–1160. doi: 10.1126/science.1217405. [DOI] [PubMed] [Google Scholar]
- 37.Lynch M. The evolution of genetic networks by non-adaptive processes. Nat Rev Genet. 2007;8(10):803–813. doi: 10.1038/nrg2192. [DOI] [PubMed] [Google Scholar]
- 38.Rodrigues JV, Henriques BJ, Lucas TG, Gomes CM. Cofactors and metabolites as protein folding helpers in metabolic diseases. Curr Top Med Chem. 2012;12(22):2546–2559. doi: 10.2174/1568026611212220009. [DOI] [PubMed] [Google Scholar]
- 39.Stark C, et al. BioGRID: A general repository for interaction datasets. Nucleic Acids Res. 2006;34(Database issue):D535–D539. doi: 10.1093/nar/gkj109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Yang JR, Liao BY, Zhuang SM, Zhang J. Protein misinteraction avoidance causes highly expressed proteins to evolve slowly. Proc Natl Acad Sci USA. 2012;109(14):E831–E840. doi: 10.1073/pnas.1117408109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Dixit PD, Maslov S. Evolutionary capacitance and control of protein stability in protein-protein interaction networks. PLOS Comput Biol. 2013;9(4):e1003023. doi: 10.1371/journal.pcbi.1003023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Hartl FU, Bracher A, Hayer-Hartl M. Molecular chaperones in protein folding and proteostasis. Nature. 2011;475(7356):324–332. doi: 10.1038/nature10317. [DOI] [PubMed] [Google Scholar]
- 43.Poelwijk FJ, Kiviet DJ, Weinreich DM, Tans SJ. Empirical fitness landscapes reveal accessible evolutionary paths. Nature. 2007;445(7126):383–386. doi: 10.1038/nature05451. [DOI] [PubMed] [Google Scholar]
- 44.Szendro IG, Schenk MF, Franke J, Krug J, de Visser JA. Quantitative analyses of empirical fitness landscapes. J Stat Mech. 2013;2013:P01005. [Google Scholar]
- 45.Gould SJ. Wonderful Life: The Burgess Shale and the Nature of History. Norton; New York: 1990. [Google Scholar]
- 46.Lobkovsky AE, Koonin EV. Replaying the tape of life: Quantification of the predictability of evolution. Front Genet. 2012;3:246. doi: 10.3389/fgene.2012.00246. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Thorn KS, Bogan AA. ASEdb: A database of alanine mutations and their effects on the free energy of binding in protein interactions. Bioinformatics. 2001;17(3):284–285. doi: 10.1093/bioinformatics/17.3.284. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.




