Skip to main content
Genes logoLink to Genes
. 2018 Aug 21;9(9):423. doi: 10.3390/genes9090423

Selection for Protein Stability Enriches for Epistatic Interactions

Anna Posfai 1, Juannan Zhou 1, Joshua B Plotkin 2, Justin B Kinney 1, David M McCandlish 1,*
PMCID: PMC6162820  PMID: 30134605

Abstract

A now classical argument for the marginal thermodynamic stability of proteins explains the distribution of observed protein stabilities as a consequence of an entropic pull in protein sequence space. In particular, most sequences that are sufficiently stable to fold will have stabilities near the folding threshold. Here, we extend this argument to consider its predictions for epistatic interactions for the effects of mutations on the free energy of folding. Although there is abundant evidence to indicate that the effects of mutations on the free energy of folding are nearly additive and conserved over evolutionary time, we show that these observations are compatible with the hypothesis that a non-additive contribution to the folding free energy is essential for observed proteins to maintain their native structure. In particular, through both simulations and analytical results, we show that even very small departures from additivity are sufficient to drive this effect.

Keywords: thermodynamic stability, epistasis, molecular evolution, purifying selection

1. Introduction

The relationship between protein sequence, stability, and function has been a subject of intense investigation for decades. A combination of biophysical and evolutionary models and, more recently, high-throughput mutagenesis experiments have dramatically advanced our understanding of this complex relationship [1,2,3,4,5]. A consensus view has emerged on some aspects of protein function and evolution—e.g., what accounts for the distribution of thermodynamic stabilities observed in nature. However, other questions—e.g., whether genetic interactions play a dominant or minor role in protein sequence evolution—remain actively debated, with apparently contradictory empirical and theoretical evidence [1,2,3,4,5].

A nuanced appreciation of the high-dimensional nature of protein sequence space has been essential for resolving questions about protein structure, function, and evolution. The observation that naturally occurring proteins are only marginally, as opposed to maximally, stable was first interpreted as an adaptive feature to permit increased protein flexibility and functionality [6]. However, with some exceptions [7], this view has been largely replaced with a more parsimonious explanation based on the high dimensionality of sequence space: Marginal stabilities are observed because, simply, far more sequences are marginally stable than maximally stable [8]. Essential to the development of this explanation was the concept of sequence entropy [9,10]—the idea that the sheer number of protein sequences that map to a given phenotype exerts a strong entropic pull on the distribution of observed phenotypes in viable proteins [11,12,13]. The field today has mostly settled on a synthetic understanding of how simple biophysical models of energy and folding, along with the structure of sequence space, conspire to explain the distribution of protein stabilities observed in nature [1,2,3,4,5].

In contrast to observed stabilities, the role of epistasis in protein evolution and function remains a topic of active debate with unresolved ambiguities. The same biophysical models that can parsimoniously explain observed distributions of stabilities have been reported to show either only weak context-dependence of mutational effects on stability [14], or, alternatively, have been reported to show very strong context-dependence of mutational effects [13,15,16]. Likewise, experimental studies on the fitness effects of mutations in divergent sequence backgrounds have reported very weak epistasis [14], whereas comparative analysis of divergent proteins has implicated an overriding role for epistasis in shaping sequence evolution [17]. How are we to resolve this significant discrepancy about the role of epistasis for protein stability and sequence evolution?

In this paper, we address this discrepancy by analyzing simple models for the relationship between amino acid sequence and the Gibbs free energy (ΔG) of folding. Under selection to maintain a minimum degree of stability, these models predict distributions of folding energies that are roughly consistent with those observed in nature. Moreover, the models predict very weak interactions between pairs of mutations. These predictions are consistent with biophysical measurements of nearly additive mutational effects on stability [18], and with reports of consistent effects over both short [14] and long [19] evolutionary timescales. However, at the same time, we show that a non-additive contribution to the folding free energy is essential for allowing proteins to fold stably in our model, for reasons attributable to sequence entropy. These results may help to resolve striking discrepancies in the literature on the importance of epistasis for protein stability and evolution [1,2,3,4,5].

2. Methods

Simulations

We consider two simple models for the relationship between amino acid sequences and ΔG of folding. All amino acid sequences considered are of length l=400. Both of these models have the form:

ΔG=ΔGadd+ΔGepi, (1)

where ΔGadd describes the independent and additive contributions from each amino acid at each position, while ΔGepi describes contribution of interactions between positions.

First, we consider a model where epistasis arises solely from pairwise interactions between sites. For example, ΔGadd in such a model might account for the independent energetic effects of individual residues being removed from the aqueous environment upon protein folding, whereas ΔGepi might account for physical interactions between pairs of residues in the folded conformation. For this model, we assume that the additive effect on stability for each possible amino acid in each position in the primary sequence is drawn from a Gaussian distribution with mean μadd and variance σadd2. In addition, we allow pairwise interactions between amino acid sites, with the magnitudes of these interactions drawn from a Gaussian distribution with zero mean and variance σepi2. Moreover, these pairwise interactions are specified in such a way that the pairwise interaction terms have no impact on the average effect of any given amino acid substitution, so that the magnitudes of the additive and epistatic effects can be modified independently. That is, the model is equivalent to a “random field model” from the fitness landscape literature [20,21] where the only non-zero terms are the constant, linear, and pairwise interaction terms. See Appendix A for details on the mathematical features and practical implementation of this model.

Second, we consider a model where epistasis is modeled as a random deviation from additivity drawn independently for each genotype, meaning each sequence of amino acids. This model is similar to the “rough Mount Fuji” model of fitness landscapes [22,23], and might arise if ΔGepi accounts for highly cooperative energetic contributions, e.g., resulting from global changes in the folded protein conformation. In this case, we again draw the additive effect of each amino acid in each position from a Gaussian distribution with mean μadd and variance σadd2, but in addition the folding energy of each genotype is perturbed by an independent draw from a zero-mean Gaussian with variance σHOC2 (where HOC denotes “house of cards”, since this component is completely uncorrelated between mutationally adjacent genotypes, similar to the house of cards model of fitness landscapes [24]). Because protein sequence space is too large to store in computer memory, we implement a hashing scheme so that, in the simulations, these epistatic effects remain consistent for previously observed genotypes, but are drawn anew for genotypes that have not yet been encountered.

The simulations of protein sequence evolution under selection are based on a threshold model for thermodynamic stability: proteins with a negative ΔG of folding are deemed viable and all other sequences are deemed inviable. At each step in the simulation, a random position in the protein sequence is chosen and changed to a random alternative amino acid. This new sequence is accepted if it is viable and rejected if it is inviable; if the mutation is rejected, then the protein sequence remains unaltered for that step of the simulation. These simulations are initialized at the sequence predicted to be most stable based on its additive effects and allowed to equilibrate for 5000 proposed mutations, a time sufficient for the distribution of folding stabilities to become approximately stationary for the conditions considered. After this relaxation period, the simulations continue for an additional 5000 proposed mutations to produce the results shown here. All simulations and all calculations presented were implemented in Mathematica (Wolfram Research) and the corresponding Mathematica notebook is included as supplemental information.

The models analyzed here are simpler and less realistic than other commonly used models for protein evolution based on force fields [25], contact energies [26], or lattice proteins [27]. However, we employ these models because their simple structure facilitates a variety of exact and approximate analytical results, and thus provides a clearer illumination of the theoretical issues involved than the more realistic but less tractable alternatives.

3. Results

3.1. Epistasis Is Essential for Proper Folding of Evolved Sequences

We simulated the evolution of a protein of length 400 under a model where each amino acid at each position makes an additive contribution to the free energy of folding, and where in addition we allow pairwise stability interactions between sites. We imposed truncation selection for spontaneous folding so that only sequences with a negative ΔG of folding are considered viable. The parameters of the simulations were chosen to be roughly consistent with the observed distribution of folding stabilities and mutational effects on stability reported in the literature, e.g., [28,29,30,31].

Figure 1a shows the distribution of folding energies observed in these simulations after the process was allowed to reach stationarity. The mean of this distribution is only slightly negative, indicating that the evolved proteins are marginally stable, as predicted by theory and observed in nature [8,11,29,31,32]. Examining the effects of random single amino acid substitutions for sequences drawn from this distribution (Figure 1b), we observe that the mean effect is +1.15 kcal/mol with standard deviation 1.5 kcal/mol, consistent with empirical observations [28,29,30]. The distribution of energetic effects of mutations that are fixed over the course of the simulations (Figure 1c) is shifted to have approximately zero mean (−0.0007 kcal/mol) and a smaller standard deviation (1.1 kcal/mol) as observed previously [16,33,34].

Figure 1.

Figure 1

Free energy of folding, stability effects of mutations, and contribution of additive effects to folding stability under the pairwise epistatic effects model: (a) Distribution of free energy of folding for evolved sequences; (b) distribution of stability effects of random mutations, i.e., distribution of ΔΔG values for a random single mutant generated from a random evolved sequence, where ΔΔG is the change in the ΔG of folding due to the mutation; (c) distribution of stability effects of fixed mutations, i.e., distribution of ΔΔG values corresponding to two distinct neighboring sequences along the simulated trajectory; (d) stability effects of double mutations versus the sum of the stability effects of the two single mutations. 500 random double mutants are shown, R2=0.9997; (e) effects of single mutations that fixed along the trajectory in two evolved backgrounds that differ by 50% sequence divergence, R2=0.87; (f) the stability effect of a random mutation on ΔG is highly correlated with the stability effect of the mutation on the additive contribution ΔGadd, R2=0.89; (g) free energy of folding versus additive contribution to free energy of folding for evolved sequences. The additive contribution to folding is not a good indicator of the total free energy of folding (R2=0.06) and observed sequences cannot fold spontaneously based on the additive contribution alone. The dashed gray curve is derived from our bivariate normal approximation and is predicted to contain 95% of the evolved sequences. Simulations conducted under the pairwise epistasis model with μadd=1kcal/mol, σadd2=1kcal2/mol2, and σepi2=0.0003kcal2/mol2.

Interactions between mutations also have a similar magnitude to those observed in previous studies. In an evolved background, the combined mutational effect of two random mutations is nearly exactly predicted by the sum of the mutational effects of the constituent single mutations (Figure 1d, R2=0.9997). Furthermore, the effects of mutations that fix along our simulated evolutionary trajectories remain relatively consistent over time (Figure 1e), with a root mean square change of only 0.5 kcal/mol at 50% sequence divergence, consistent with the empirical measurements of Risso et al. [19], who observed a root mean square change of 0.67 kcal/mol among mutations that have fixed between two sequences at a similar level of divergence. Because we are working with simulated data, we can also assess how closely the observed effects of mutations reflect the effects of these mutations on the additive component of the folding energy. Figure 1f shows that the observed effects of random single amino acid substitutions, i.e., the ΔΔG of single amino acid substitutions, are highly correlated with the underlying additive effects of these mutations, i.e., with the ΔΔGadd of the same mutations (R2=0.89).

To summarize, our simulations are qualitatively similar to both previous empirical and theoretical investigations of long-term evolution under selection for protein folding stability, and, on the face of it, based on Figure 1d–f, they suggest that epistasis plays only a minor role in protein folding. However, when we actually compute the additive contribution to protein stability observed in our simulations, a very different picture emerges as shown in Figure 1g. Shockingly, we find that the additive contribution to folding stability is not nearly sufficient to allow spontaneous folding (mean of ΔGadd is 25.5 kcal/mol), and so epistatic interactions are required for folding of all sequences observed at stationarity. We also observe that the additive contribution to folding stability ΔGadd is almost completely uncorrelated with the actual folding stability ΔG, with R2 = 0.06. Thus, in our model, epistasis plays an essential role in protein folding, despite the fact that mutational effects can be well predicted based solely on the additive contributions of each amino acid to the sequence’s free energy of folding.

3.2. Enrichment for Epistasis Observed under the Pairwise Model, but Not the Independent Model

In order to better understand the causes of these counter-intuitive results, we considered an alternative model with an identical additive component but with epistasis modeled as an independent random draw for each sequence (from a zero-mean Gaussian distribution with variance σHOC2=0.01kcal2/mol2, see Methods). The results of these simulations are shown in Figure 2. The distribution of folding stabilities, the distribution of mutational effects, and the extent of observed epistasis observed in mutational effects are qualitatively unchanged from the results observed under the prior, pairwise interaction model (Figure 2a–f vs. Figure 1a–f). However, in this case, the paradoxical contribution of epistasis to folding stability is absent, so that the additive contribution to stability is sufficient for spontaneous folding for most evolved sequences, and the observed folding energy is highly correlated with the additive contribution (R2 = 0.99, Figure 2d). We therefore conclude that enrichment for epistasis under stabilizing selection occurs with pairwise epistatic interactions but not with fully random interactions.

Figure 2.

Figure 2

Free energy of folding, stability effects of mutations, and contribution of additive effects to folding stability under the independent epistatic effects model: (a) Distribution of free energy of folding for evolved sequences; (b) distribution of stability effects of random mutations, i.e., distribution of ΔΔG values for a random single mutant generated from a random evolved sequence; (c) distribution of stability effects of fixed mutations, i.e., distribution of ΔΔG values corresponding to two distinct neighboring sequences along the simulated trajectory; (d) stability effects of double mutations versus the sum of the stability effects of the two single mutations. 500 random double mutants are shown, R2=0.993; (e) effects of single mutations that fixed along the trajectory in two evolved backgrounds that differ by 50% sequence divergence, R2=0.98; (f) the stability effect of a random mutation on ΔG is highly correlated with the stability effect of the mutation on the additive contribution ΔGadd, R2=0.997; (g) the additive contribution to folding is a good indicator of the total free energy of folding (R2=0.99) and 95% of observed sequences can fold spontaneously based on the additive contribution alone. The dashed gray curve is derived from our bivariate normal approximation and is predicted to contain 95% of the evolved sequences. Simulations conducted under the independent epistatic effects model with μadd=1kcal/mol, σadd2=1kcal2/mol2, and σHOC2=0.01kcal2/mol2.

What explains this difference in behavior between the model with pairwise epistatic interactions and the model with independent epistatic effects for each sequence? In order to address this question, we conducted a mathematical analysis of the random field model with pairwise epistatic interactions (see Appendix B). What we came to understand was that, in this model, the amount of epistasis observed in double mutants vastly underestimates the total amount of epistasis in the energy landscape. This occurs because making a double mutant only results in changes to relatively few interaction pairs (i.e., those interaction pairs involving the site of either single mutant). However, as additional mutations are added to the sequence, more pairs are perturbed, which unleashes additional epistasis.

More precisely, in the mathematical analysis, we considered the expected magnitude of the observed epistasis as a function of the number of mutations from an arbitrary focal sequence. That is, we calculated the expected variance in the epistatic contribution among the set of all sequences at a given distance d from the focal sequence. The results are shown in Figure 3 where the variance at d=2 is set to 1, so that the variance is expressed relative to the variance observed in a double mutant analysis. We see that for small d this variance increases roughly linearly, and eventually saturates at almost 50 times the expected variance at d=2. In contrast, the independent random epistasis model is essentially constant at all positive distances. Thus, similar levels of observed epistasis in double mutants make vastly different predictions for the total amount of epistasis under the two models.

Figure 3.

Figure 3

Expected epistatic variance as a function of distance from the focal sequence for amino acid sequences of length l=400. Results for the pairwise epistasis model shown in black, results for the independent epistasis model shown in gray. All variances are normalized relative to the expected variance at Hamming distance d=2 which is set to 1. Notice that the epistatic variance at large distances is much larger than the epistatic variance at distance d=2 for the pairwise epistasis model but not for the independent epistasis model.

3.3. Bivariate Normal Approximation for the Joint Distribution of Additive and Epistatic Contributions to the Free Energy of Folding Captures Impact of Sequence Entropy

Our results on the surprising implications of small observed epistatic effects under the pairwise interaction model make the results in Figure 1g appear somewhat more plausible because more epistasis is present in the landscape than is apparent from the double mutants. However, this observation still does not provide a definite explanation for the large contribution of epistatic interactions to folding stability.

We now provide such an explanation, based on considering the fraction of random sequences that have any given pair of additive and epistatic contributions (see Appendix C for details). In particular, we assume that the distribution of additive contributions to the free energy of folding for random sequences is normally distributed with mean μ1=lμadd and variance σ12=lσadd2. We further assume that the distribution of epistatic contributions is normally distributed with mean 0 and variance σ22=l2σepi2. The additive and epistatic contributions are uncorrelated, so the total folding energy of a random sequence ΔG=ΔGadd+ΔGepi is normally distributed, with mean μ=μ1 and variance σ2=σ12+σ22. These normal approximations are reasonable considering that ΔGadd is calculated by adding up the energy contribution of each site in the sequence, and ΔGepi is calculated by adding up the energy contribution of each pair of sites in the sequence, for a large number l=400 of sites. We also note that, in the mutation-limited regime depicted in our simulations, the stationary distribution of the simulated random walk will be uniform on the set of genotypes with negative folding energies that are path-connected with our choice of starting genotype [35]. Under the assumption that almost all genotypes with negative folding energies are path-connected, picking a sequence from the stationary distribution is equivalent to picking a sequence from the uniform distribution on sequences with negative folding energies, and so our problem reduces to understanding the distribution of additive and epistatic folding contributions among all sequences with negative free energies of folding.

Under the above approximation, we now consider how—for a typical viable sequence—the additive and epistatic energies jointly produce a negative free energy of folding. The key idea is that there are so many more sequences with positive additive contributions to folding than there are sequences with negative additive contributions that most sequences that fold have a positive additive contribution despite the fact that any particular sequence with a positive additive contribution to the free energy of folding has only a minuscule chance of actually folding.

More precisely, let us fix the value of the additive energy at ΔGadd=x, and count the number of sequences, with this given ΔGadd, that fold. The number of sequences with ΔGadd=x is proportional to the probability density function (PDF) for the distribution of ΔGadd, PDF(N(μ1,σ12))(x). Adding the epistatic energy to the additive energy, the sequences with ΔGadd=x that fold are the sequences for which ΔGepi<x, i.e., their number is proportional to the cumulative distribution function (CDF) of ΔGepi evaluated at x, CDF(N(0,σ22))(x). Putting the two pieces together, the number of sequences that have ΔGadd=x and, at the same time fold, is proportional to

PDF(N(μ1,σ12))(x)×CDF(N(0,σ22))(x). (2)

Figure 4 shows this calculation for x values near the viability threshold 0. We see that, over this range of folding energies, the number of sequences is growing extremely rapidly (Figure 4a) so that, roughly speaking, the number of sequences with a given additive energy increases 10-fold for every additional 0.45 kcal/mol. Now, Figure 4b shows the fraction of sequences with a given additive contribution that spontaneously fold. This is near 1 for most sequences with a negative contribution, but decreases exponentially for positive additive contributions. The net result (Figure 4c) is that a typical additive contribution for a sequence that folds is often around 22 or 23 kcal/mol. While only a tiny fraction of sequences with additive energies in this range fold (roughly 1 in a million), there are roughly 10 billion times more sequences in this 1 kcal/mol range than there are all sequences that would spontaneously fold based on their additive contribution (i.e., with ΔGadd<0), so that, in the end, most sequences that fold have substantially positive additive contributions to the free energy of folding.

Figure 4.

Figure 4

Illustration of the main mechanism behind the essentiality of epistatic interactions for spontaneous folding: (a) density of random sequences with given additive free energy PΔGadd=x=PDF(N(μ1,σ12))(x); (b) fraction of sequences that fold given additive free energy P(ΔG<0|ΔGadd=x)=CDF(N(0,σ22))(x); (c) density of random sequences that fold and have the given additive free energy PΔG<0ΔGadd=x. Parameters are identical to those in Figure 1.

The above argument leads to a simple prediction for the joint distribution of the free energy of folding and additive contribution to folding shown in Figure 1g and Figure 2g: since the joint distribution for random sequences is bivariate normal, the distribution of observed energies should simply be this bivariate normal distribution truncated at ΔG=0 kcal/mol. This approximation is shown in Figure 1g and Figure 2g by a dashed gray curve that is predicted to contain 95% of the observations, and we see that this prediction is in reasonable agreement with our simulations.

Moreover, under this bivariate normal approximation, the average contribution of epistasis to the mean free energy of folding observed in our simulations can be calculated in a manner exactly analogous to Galton’s classical results on regression to the mean [36], or the difference between the selection differential and the response to selection in the breeder’s equation from quantitative genetics [37,38]. In particular, we find that the mean additive energy of viable sequences is approximately E(ΔGadd|ΔG<0)μσ22/(σ12+σ22) (see Appendix C for details), so that the mean contribution of epistasis is approximately μσ22/(σ12+σ22), or equivalently μ(1R2), where R2 is given by the squared correlation coefficient of additive and total folding energies taken over all of sequence space. As a result, even if the mapping from sequence to folding energy is nearly additive, in the sense that R2 is almost 1, the predicted epistatic contribution to the folding stability can still be substantial provided that the expected folding energy μ of a random sequence is sufficiently disfavorable.

4. Discussion

The role of epistasis in long-term protein evolution remains a topic of active debate [1,2,3,4,5]. Here, we have explored a surprising phenomenon where the effects of mutations on the ΔG of folding appear to combine nearly additively, and nonetheless what little epistasis is present plays a critical role, to the extent that observed sequences would not be able to fold in the absence of these epistatic interactions. We showed that this phenomenon occurs in a model where interactions occur between pairs of sites but not in a model where each sequence differs from its additive prediction by an independent draw from a normal distribution. The difference between the two models arises because pairwise interactions can appear nearly additive in double mutants while still producing a substantial amount of epistasis over sequence space as a whole. We also present simple analytical approximations that predict the extent of the epistatic contribution to stability in our simulations. Furthermore, these approximations suggest that this phenomenon occurs due to sequence entropy: Many more sequences can fold due to a combination of epistatic and additive contributions than can fold based on the additive contributions to stability alone, and so the epistatic contribution to stability is typically essential when one observes a random sequence that folds. These results add to a growing literature demonstrating that natural selection can enrich for epistatic interactions in both adaptive [39,40,41] and nearly neutral [13,16] evolution, such that the mutations that fix during evolution can have a very different pattern of epistasis than random mutations.

Our simulations (Figure 1) recapitulate the known qualitative features of protein evolution under purifying selection for folding stability to a surprising degree, with the exception of matching the observed stability margin, which is smaller in our simulations than for experimentally measured folding energies [29] (free energy of folding is typically −5 to −10 kcal/mol versus −1 kcal/mol in our simulations). However, this unnaturally small stability margin is a well-known artifact of our decision to model fitness as a step function in stability [8] rather than a more realistic logistic function [11,30], and the fact that our simulations do not include any of the other factors that would tend to increase the stability margin such as selection for mutational robustness [8,32,35] or selection to prevent misfolding due to errors in translation [42]. Nonetheless the simple sequence-to-fitness mapping employed in our simulations allows us to provide a relatively simple and complete theory for the observed phenomenon. Moreover, we emphasize that it is easy to find realistic parameters where the mean additive contribution to stability is far less stable than shown in Figure 1, so we anticipate that the possibility that most sequences fold only due to epistasis would be robust even if sequences experienced a much larger stability margin.

A different limitation of our results concerns the assumption, in our truncated bivariate normal approximation, that the set of sequences with negative folding energies is mutationally connected, and hence is accessible to an evolving population. In particular, the theory breaks down if a large fraction of sequences that fold appear as isolated peaks or small isolated clusters of sequences. Figure 5 shows an example of this limitation for the case of the independent model with parameters chosen so that the bivariate normal approximation is identical to the bivariate normal approximation for the pairwise model shown in Figure 1. The figure shows some enrichment for epistasis but not as much as predicted by our bivariate normal approximation. Using the crude percolation-theory argument that the connected network of sequences can extend only up to the additive energy at which each sequence has on average one neighbor that folds due to epistasis [43], we can derive the approximate upper limit of the distribution of additive energies as σHOCΨ1(1/(400×19))=18.8, where Ψ1 is the inverse cumulative distribution function of a standard normal distribution. This approximate upper limit is shown by the dashed vertical line in Figure 5. We see that the cloud of observed sequences is primarily to the left of this line, with a notable absence of sequences with substantially more positive additive contributions. This analysis of connectivity of the set of sequences that fold highlights that pairwise interactions have several special features: not only can they appear locally non-epistatic while harboring a substantial amount of epistasis at greater distances, but as long as the individual coefficients remain small they produce energy landscapes that change smoothly over sequence space, producing the enormous connected networks of sequences whose traversal allows the evolution of a sizable epistatic contribution to folding.

Figure 5.

Figure 5

Joint distribution of ΔG of folding and the additive contribution to ΔG of folding for the independent epistasis model with σHOC2 chosen so that the bivariate normal approximation matches the bivariate normal approximation shown in Figure 1g. Simulations conducted under the independent epistasis model with μadd=1kcal/mol, σadd2=1kcal2/mol2, σHOC2=21.6kcal2/mol2. A dashed curve shows the area predicted to include 95% of sequences at stationarity under the bivariate normal approximation; the dashed vertical line at ΔGadd=18.8 reflects a value, derived using a crude percolation theory argument, for the right-most edge of the region where the bivariate normal approximation is expected to be valid.

A final limitation of our approach concerns long-term dynamics at individual sites. Our analysis addresses the possibility that the observed, nearly additive effects of mutations on stability may be compatible with a critical role for epistasis in most sequences that fold. However, our mathematical results do not yet provide a detailed view of how the stability effects at individual sites change over time. In particular, previous simulation studies have described complex temporal dynamics at individual sites, so that the set of tolerable amino acid mutations at any given site changes over time [13,15,16,44,45], a process that has been called an “evolutionary Stokes shift”, and which is also related to the phenomena of “contingency and entrenchment” and the expectation that reversion rates will decrease over time [15,45,46,47]. However, mathematical results closely related to the approach taken here [13,48] suggest that extensions of our theory to address the temporal variation in stability effects, as well as the corresponding distributions of site-specific amino acid preferences and correlations between sites, may also be possible.

It is natural to ask whether the essentiality of epistatic interactions for the functionality of evolved sequences is likely to hold in other contexts where additivity is thought to prevail, such as the DNA binding sites of transcription factors (TFs). However, this effect is unlikely to occur in such cases because the sequences are much shorter, the alphabet is smaller, and thus the set of functional sequences makes up a much larger proportion of genotypic space. In particular, it is helpful to consider the z-score of functional sequences relative to random sequences, since the regression to the mean effect observed here is proportional to the absolute value of the z-score. For instance, a typical TF binding motif in bacteria has an information content of ∼23 bits [49], corresponding to a p-value of ∼10−7 or a z-score of roughly −5. Eukaryotic TFs have even smaller information content and therefore smaller absolute value z-scores. In contrast, the z-scores of the spontaneously folding sequences observed in our simulations are on the order of −20, which we would expect to result in a roughly 4-fold larger contribution of epistasis to binding energy at stationarity than for a bacterial transcription factor binding site. Such extreme z-scores are not even possible in short DNA elements, e.g., the most extreme z-score possible in a TF binding site of length 20 is only −7. Thus, the essentiality of epistatic interactions observed here is likely possible only because protein sequence space is very large compared to the space of TF binding sites.

Finally, it is important to emphasize that the key question of whether epistatic interactions for protein stability are essential for protein folding in naturally evolved sequences remains open. Our contribution only demonstrates that such an effect is qualitatively consistent with empirical observations on the thermodynamic effects of mutations and the results of prior simulation studies, and suggests that the overall importance of epistasis for stability depends on the precise form of epistasis involved. Intriguingly, the experimental observation that pairwise correlations between site-specific amino acid usages are sometimes necessary for folding [50] provides evidence for both the presence of the low-order epistatic interactions that result in a substantial contribution of epistasis to protein folding and also for the possible essentiality of these interactions. Thus, determining whether epistasis is essential for folding of observed sequences is a key question for the field, from both theoretical and empirical perspectives. Importantly, our analysis shows that most standard designs for examining the extent of epistasis for protein stability cannot adjudicate this question because they examine how the stability effects of mutations change at a only single distance from a reference genotype. For instance, the analysis of double mutants considers the change in the effect of a mutation in a sequence at distance 1; and comparison of the effects of mutations on two diverged backgrounds, e.g., [14,19], can only determine the extent of epistasis at that one level of divergence. Rather, the two theories analyzed here differ in how the extent of epistasis changes with distance (e.g., Figure 3 and Appendix B.1). Thus, the critical experiment is to measure how the energetic effects of individual mutations change across several different levels of sequence divergence (c.f. [16]).

Acknowledgments

We thank Ashley Teufel and David Liberles for organizing this Special Issue of Genes and two anonymous reviewers for helpful comments.

Appendix A

Appendix A.1. Model for the Free Energy of Folding

Given an alphabet A={0,1,,a1} and a sequence length l, let S be the set all possible strings of length l built from alphabet A. The free energy of folding ΔG(x) for a sequence xS is defined as the sum of (1) an additive component that measures the energy contribution of each allele at each position in the sequence, and (2) an epistatic component that describes the energy contribution of pairwise interactions among alleles for the pairwise model, or a random draw from a normal distribution for the independent epistasis model:

ΔG(x):=ΔGadd(x)+ΔGepi(x). (A1)

To specify each of the terms ΔGadd(x) and ΔGepi(x), let β{k},α be the additive contribution to the free energy of folding of having allele α at position k, and β{k1,k2},α1α2 be the contribution of the interaction between allele α1 at site k1 and allele α2 at site k2 in the pairwise interaction model. Then,

ΔGadd(x):=k=1lβ{k},xk (A2)

is the total additive contribution to the folding energy. For the pairwise interaction model, we let

ΔGepi(x):=k1=1l1k2=k1+1lβ{k1,k2},xk1xk2 (A3)

be the epistatic contribution to the free energy of folding. For the independent epistasis model, we instead let ΔGepi(x) be an independent random draw from a normal distribution with mean 0 and variance σHOC2.

It now remains to specify a procedure for assigning values to the β{k},α and β{k1,k2},α1α2. In drawing these coefficients, we need to ensure that ΔGepi(x) is a pure epistatic contribution, that is, that the average epistatic effect of any given point mutation over sequence space is zero, and would also like to set the average energy over all of sequence space to lμadd. To do this, we draw these coefficients from multivariate normal distributions with covariance matrices chosen to enforce the necessary constraints.

In particular, for each site k, we choose the β{k},α from an a-dimensional normal distribution that has mean vector (μadd,,μadd) and covariance

E(β{k},αμadd)(β{k},αμadd)=σadd2,ifα=α,σadd2,1a1ifαα. (A4)

These coefficients are drawn independently for each site k, and the marginal distribution for each β{k},α is a normal distribution with mean μadd and variance σadd2 as required.

Turning to the epistatic component of the free energy of folding for the pairwise interactions model, for each pair of sites {k1,k2} with k1<k2, we draw the β{k1,k2},α1α2 from an a2-dimensional multivariate normal distribution with mean (0,,0) and covariance

E(β{k1,k2},α1α2β{k1,k2},α1α2)=σepi2,ifα1=α1andα2=α2,σepi21(a1),if(α1=α1andα2α2)or(α1α1andα2=α2),σepi21(a1)2,ifα1α1andα2α2. (A5)

These coefficients are then drawn independently for each such pair of sites {k1,k2}. Furthermore, the marginal distribution for each β{k1,k2},α1α2 is normal with mean 0 and variance σepi2 as required.

Appendix A.2. Properties of the β{k},α and β{k1,k2},α1α2

As mentioned above, the covariance matrices for β{k},α and β{k1,k2},α1α2 are chosen to enforce two sets of constraints. First, for each k,

1aα=0a1β{k},α=μadd (A6)

must be satisfied, so that the average contribution to the free energy of folding across all the alleles at any given site is μadd. This implies that the mean additive folding energy over all possible sequences is lμadd, i.e., ΔGadd(x)x=lμadd.

Second, for each pair of sites {k1,k2}, we specify that

α1=0a1β{k1,k2},α1α2=0,α2A,andα2=0a1β{k1,k2},α1α2=0,α1A. (A7)

This second set of constraints ensures that the average epistatic effect of any given point mutation over sequence space is zero, i.e., that ΔGepi is a pure epistatic term with no additive component. It also follows that the mean epistatic energy over all possible sequences is 0, i.e., ΔGepi(x)x=0.

To see that the constraint in Equation (A6) is satisfied, we calculate

E1aα=0a1β{k},α=1aα=0a1E(β{k},α)=μadd, (A8)

and using Equation (A4),

Var1aα=0a1β{k},α=E1aα=0a1β{k},αμadd2=1a2Eα=0a1(β{k},αμadd)2=1a2α=0a1E((β{k},αμadd)2)+2α=0a1α=0ααa1E(β{k},αμadd)(β{k},αμadd)=1a2aσadd2a(a1)21a1σadd=0, (A9)

so that the constraint in Equation (A6) is satisfied with probability 1.

Similarly, to see that the constraint in Equation (A7) is satisfied, we calculate

Eα1=0a1β{k1,k2},α1α2=α1=0a1E(β{k1,k2},α1α2)=0, (A10)

and using Equation (A5),

Varα1=0a1β{k1,k2},α1α2=Eα1=0a1β{k1,k2},α1α22=α1=0a1E(β{k1,k2},α1α22)+2α1=0a1α1=0α1α1a1E(β{k1,k2},α1α2β{k1,k2},α1α2)=aσepi22a(a1)2σepi2(a1)=0. (A11)

It follows from Equations (A10) and (A11) that the first set of constraints in Equation (A7) is satisfied with probability 1. By symmetry between α1 and α2, the same follows for the second set of constraints as well.

Appendix B

Appendix B.1. Expected Variance of Epistatic Energy of Distance Classes under the Random Field Model

In this section, we derive an analytical formula for the expectation of the realized variance in folding energy due to epistasis at a given distance from a focal genotype. Figure 3 was generated using this result. We arbitrarily fix a wild type sequence wt and look at all sequences that are at fixed Hamming distance d from this wild type, where the Hamming distance d(x,x) between two sequences x and x is defined as the number of sites where the two sequences differ. Let S(wt,d)={xS|d(x,wt)=d} denote the set of sequences at distance d from the wild type wt. For some random function f defined on sequence space S, e.g., a random field model, we want to understand how variable the function will be among sequence at distance d from the wild-type. We quantify this variability in terms of the sample variance at distance d:

V(d)V({f(x)|xS(wt,d)}):=1ndxS(wt,d)(f(x)f(x)S(wt,d))2, (A12)

with ·T denoting the mean taken over the set T and

nd:=|S(wt,d)|=ld(a1)d. (A13)

Because V(d) is itself a random variable and will take a different value for each realization of the random function, we quantify the typical behavior at distance d by the expected value of V(d), i.e., E(v(d)). We will derive our results for f=ΔGepi, but they hold for any random field f that has the same covariance structure as ΔGepi (i.e., isotropic, pure pairwise interactions).

We start by calculating the covariance function of

f(x)=p1=1l1p2=p1+1lβ{p1,p2},xp1xp2. (A14)

We obtain

σf(x)f(x):=Ef(x)f(x)=Ep1<p2β{p1,p2},xp1xp2p1<p2β{p1,p2},xp1xp2=Ep1<p2β{p1,p2},xp1xp2β{p1,p2},xp1xp2=σepi2ld(x,x)2ld(x,x)1d(x,x)11(a1)+d(x,x)21(a1)2, (A15)

where first we used the fact that the β{p1,p2} coefficients for different positions {p1,p2} are independent, then broke up the sum according to position pairs at which x and x agree in 0,1, or 2 sites, and used the formula for the covariance between the coefficients β{p1,p2},α1α2 given in Equation (A5). We note that σf(x)f(x) is a function of d(x,x), and we will use the notation

σf(x)f(x)=σepi2w(d(x,x)). (A16)

Now, turning to calculating the expected sample variance, we apply the well-known formula for possibly correlated random variables [51]

E[V(d)]=nd1nd(σ2¯d+σd,μ2σ¯d), (A17)

where the three quantities:

σ2¯d=1ndxS(wt,d)σf(x)2,σd,μ2=1nd1xS(wt,d)(E[f(x)]1ndxS(wt,d)E[f(x)])2,σ¯d=1nd(nd1)xS(wt,d)xS(wt,d)xxσf(x)f(x), (A18)

are the mean of the variances, the variance of the means, and the mean covariances, respectively. We now derive each of these three quantities individually.

First, we address the mean of variances σ2¯d. Since, by Equation (A16), we have σf(x)2=l2σepi2 for all xS, we also have

σ2¯d=l2σepi2. (A19)

Next, for the variance of the mean energy σd,μ2, since the mean energy E[f(x)] is constant across all xS, we have

σd,μ2=0. (A20)

Finally, for the mean covariance σ¯d, since the covariance σf(x)f(x) only depends on d(x,x) by Equation (A16), the second sum in the definition of σ¯d is the same for all x, thus we can arbitrarily fix a sequence x and write

σ¯d=1nd1xS(wt,d)xxσf(x)f(x),=1nd1d=1min{l,2d}N(d,d)σf(x)f(x),=1nd1σepi2d=1min{l,2d}N(d,d)w(d), (A21)

where x is an arbitrary sequence at distance d to wt and at distance d to x,

N(d,d):=|{x|d(wt,x)=d,d(x,x)=d}|=|S(wt,d)S(x,d)|, (A22)

is the number of such sequences, and we used Equation (A16) in the last equation. Note that

d=1min{l,2d}N(d,d)=nd1 (A23)

is the total number of sequences in S(wt,d) minus the focal sequence x.

Our final task is to count N(d,d). We do this by enumerating the ways that we can change a sequence x at d positions in such a way that it maintains the same distance d from wt as x has. Now, x differs from wt at d sites and is the same as wt at ld sites. For the special case case where a=2, changing any of the d sites where x is the same as wt requires changing the same number of sites among the ld sites where x is identical to wt in order to remain at distance d from wt. Thus N(d,d) is only non-zero for even d less than or equal to twice the minimum of d and ld, in which case we have

N(d,d)=dd/2ldd/2, (A24)

since we need to choose d/2 sites out of the d where x differs from wt and d/2 sites out of the ld sites where they are identical. Now for a>2, we first divide the d sites at which x and wt differ into s sites that we choose to be identical to wt in x and ds sites that will be different from wt. There are ds ways to do this for each choice of s (ds0). Second, since d(x,wt)=d, we must choose s sites out of the ld sites where x and wt are identical and set them in x to be one of the a1 other states. The number of ways to do this is lds(a1)s. Third, now we have x and x differing on 2s sites and, since d(x,x)=d, we need to choose d2s sites out of the ds sites whose states we have not decided yet and set the states of x to be one of the a2 states that is different from both x and wt. The number of ways to do this is dsd2s(a2)d2s. Putting these together, we obtain

N(d,d)=s=0min{d,d2}dslds(a1)sdsd2s(a2)d2s. (A25)

Although cumbersome, this expressions can be readily evaluated using a computer.

Returning to Equation (A17), Equations (A19)–(A21) yield

E[V(d)]=nd1ndσepi2l21nd1d=1min{l,2d}N(d,d)w(d). (A26)

Appendix C

Appendix C.1. Bivariate Normal Approximation

We approximate (ΔGadd,ΔGepi) with a bivariate normal distribution with mean vector

(μ1,0):=lμadd,0 (A27)

and covariance matrix

σ1200σ22, (A28)

where σ12=lσadd2 and σ22=l2σepi2 for the pairwise epistasis model and σ22=σHOC2 for the independent epistasis model. Thus, in this approximation, the total folding energy ΔG=ΔGadd+ΔGepi is also normally distributed, with mean μ:=μ1 and variance σ2:=σ12+σ22. Using this normal approximation, we give an analytical justification for the phenomenon observed in Figure 1g, that, although the effect of epistasis is small, it is nonetheless crucial for folding. We shall use two quantities to measure the strength of this phenomenon. For the smallness of the epistatic effect, we use the measure σ22/σ2, the fraction of the energy variance across all sequences accounted for by the variance of the epistatic energy. For the importance of the epistatic effect, we use the measure E(ΔGadd|ΔG<0), the mean additive energy of viable sequences. If this mean is far above the viability threshold 0, it indicates that, on average, epistasis must make a substantial contribution to the ability of viable sequences to fold.

We now analytically approximate the conditional expectation E(ΔGadd|ΔG<0). We use a classical result that is referred to as the regression towards the mean formula for a pair of normally distributed random variables. If (X,Y) has normal distribution with mean (μX,μY) and covariance matrix

σX2ρXYσXσYρXYσXσYσY2, (A29)

then the regression towards the mean formula describes how the means change if we place a condition on one of the variables as being below some cutoff value c:

E(X|Y<c)E(X)=ρXYσXσYE(Y|Y<c)E(Y). (A30)

Applying this formula to ΔG and ΔGadd, and the condition that a sequence is viable, i.e., ΔG<0, we obtain

E(ΔGadd|ΔG<0)μ1=R2E(ΔG|ΔG<0)μ, (A31)

where

R2:=Cov(ΔGadd,ΔG)σ1σσ1σ=σ12σ2 (A32)

is calculated using the fact that ΔG=ΔGadd+ΔGepi and that the additive and epistatic energies are uncorrelated by Equation (A28). In addition, by Equation (A28), μ=μ1, hence we can express the mean additive folding energy of viable sequences from Equation (A31) as

E(ΔGadd|ΔG<0)=1R2μ1+R2E(ΔG|ΔG<0). (A33)

The conditional mean on the right-hand side of the equation above can be calculated as

E(ΔG|ΔG<0)=μσψμσΨμσ, (A34)

where ψ and Ψ are the PDF and CDF of the standard normal distribution, respectively. As μ becomes large compared to σ, E(ΔG|ΔG<0) approaches 0, therefore, returning to Equation (A33), we obtain the estimate

E(ΔGadd|ΔG<0)1R2μ1=σ22σ12+σ22μ1. (A35)

This means that no matter how small the epistatic effect is, measured by σ22/(σ12+σ22), if the mean of the additive energy μ1 is large enough in comparison, the role of the epistatic energy in protein folding is crucial.

Plugging in our model parameters as given in Equations (A27) and (A28), and using the approximation l1l, the estimate in Equation (A35) becomes

E(ΔGadd|ΔG<0)l2μaddσepi2lσepi2+2σadd2. (A36)

The choice of parameters μadd=1, σadd2=1, and σepi2=0.0003 used in our simulations then yield E(ΔGadd|ΔG<0)22.6, which is very close to 22.5, the mean additive energy of sequences observed in the simulation.

Author Contributions

Conceptualization, A.P., J.B.P., J.B.K. and D.M.M.; Software, A.P., J.Z. and D.M.M.; Formal Analysis A.P., J.Z., J.B.K. and D.M.M.; Investigation A.P. and D.M.M.; Writing—Original Draft Preparation, A.P., J.Z., J.B.P., J.B.K. and D.M.M.; Writing—Review and Editing, A.P., J.B.P., J.B.K. and D.M.M.; Funding Acquisition, J.B.P. and J.B.K.

Funding

J.B.K. and A.P. were supported in part by a grant from the Cold Spring Harbor Laboratory (CSHL) / Northwell Health alliance. J.B.K. is a member of the CSHL Cancer Center, which is supported by NIH Cancer Center Support Grant 5P30CA045508. J.B.P. acknowledges support from the David and Lucile Packard Foundation and the U.S. Army Research Office (W911NF-12-R-0012-04).

Conflicts of Interest

The authors declare no conflict of interest. The funding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.

References

  • 1.Sikosek T., Chan H.S. Biophysics of protein evolution and evolutionary protein biophysics. J. R. Soc. Interface. 2014;11:20140419. doi: 10.1098/rsif.2014.0419. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Bastolla U., Dehouck Y., Echave J. What evolution tells us about protein physics, and protein physics tells us about evolution. Curr. Opin. Struct. Biol. 2017;42:59–66. doi: 10.1016/j.sbi.2016.10.020. [DOI] [PubMed] [Google Scholar]
  • 3.Starr T.N., Thornton J.W. Epistasis in protein evolution. Protein Sci. 2016;25:1204–1218. doi: 10.1002/pro.2897. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Echave J., Wilke C.O. Biophysical models of protein evolution: Understanding the patterns of evolutionary sequence divergence. Ann. Rev. Biophys. 2017;46:85–103. doi: 10.1146/annurev-biophys-070816-033819. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Storz J.F. Compensatory mutations and epistasis for protein function. Curr. Opin. Struct. Biol. 2018;50:18–25. doi: 10.1016/j.sbi.2017.10.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Li H., Tang C., Wingreen N.S. Are protein folds atypical? Proc. Natl. Acad. Sci. USA. 1998;95:4987–4990. doi: 10.1073/pnas.95.9.4987. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Tokuriki N., Tawfik D.S. Stability effects of mutations and protein evolvability. Curr. Opin. Struct. Biol. 2009;19:596–604. doi: 10.1016/j.sbi.2009.08.003. [DOI] [PubMed] [Google Scholar]
  • 8.Taverna D.M., Goldstein R.A. Why are proteins marginally stable? Proteins Struct. Funct. Bioinform. 2002;46:105–109. doi: 10.1002/prot.10016. [DOI] [PubMed] [Google Scholar]
  • 9.Shakhnovich E.I. Protein design: A perspective from simple tractable models. Fold. Des. 1998;3:R45–R58. doi: 10.1016/S1359-0278(98)00021-2. [DOI] [PubMed] [Google Scholar]
  • 10.Govindarajan S., Goldstein R.A. On the thermodynamic hypothesis of protein folding. Proc. Natl. Acad. Sci. USA. 1998;95:5545–5549. doi: 10.1073/pnas.95.10.5545. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Goldstein R.A. The evolution and evolutionary consequences of marginal thermostability in proteins. Proteins Struct. Funct. Bioinform. 2011;79:1396–1407. doi: 10.1002/prot.22964. [DOI] [PubMed] [Google Scholar]
  • 12.Serohijos A.W., Shakhnovich E.I. Merging molecular mechanism and evolution: Theory and computation at the interface of biophysics and evolutionary population genetics. Curr. Opin. Struct. Biol. 2014;26:84–91. doi: 10.1016/j.sbi.2014.05.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Goldstein R.A., Pollock D.D. Sequence entropy of folding and the absolute rate of amino acid substitutions. Nat. Ecol. Evol. 2017;1:1923. doi: 10.1038/s41559-017-0338-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Ashenberg O., Gong L.I., Bloom J.D. Mutational effects on stability are largely conserved during protein evolution. Proc. Natl. Acad. Sci. USA. 2013;110:21071–21076. doi: 10.1073/pnas.1314781111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Pollock D.D., Thiltgen G., Goldstein R.A. Amino acid coevolution induces an evolutionary Stokes shift. Proc. Natl. Acad. Sci. USA. 2012;109:E1352–E1359. doi: 10.1073/pnas.1120084109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Shah P., McCandlish D.M., Plotkin J.B. Contingency and entrenchment in protein evolution under purifying selection. Proc. Natl. Acad. Sci. USA. 2015;112:E3226–E3235. doi: 10.1073/pnas.1412933112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Breen M.S., Kemena C., Vlasov P.K., Notredame C., Kondrashov F.A. Epistasis as the primary factor in molecular evolution. Nature. 2012;490:535–538. doi: 10.1038/nature11510. [DOI] [PubMed] [Google Scholar]
  • 18.Wells J.A. Additivity of mutational effects in proteins. Biochemistry. 1990;29:8509–8517. doi: 10.1021/bi00489a001. [DOI] [PubMed] [Google Scholar]
  • 19.Risso V.A., Manssour-Triedo F., Delgado-Delgado A., Arco R., Barroso-delJesus A., Ingles-Prieto A., Godoy-Ruiz R., Gavira J.A., Gaucher E.A., Ibarra-Molero B., et al. Mutational studies on resurrected ancestral proteins reveal conservation of site-specific amino acid preferences throughout evolutionary history. Mol. Biol. Evol. 2015;32:440–455. doi: 10.1093/molbev/msu312. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Stadler P.F., Happel R. Random field models for fitness landscapes. J. Math. Biol. 1999;38:435–478. doi: 10.1007/s002850050156. [DOI] [Google Scholar]
  • 21.Neidhart J., Szendro I.G., Krug J. Exact results for amplitude spectra of fitness landscapes. J. Theor. Biol. 2013;332:218–227. doi: 10.1016/j.jtbi.2013.05.002. [DOI] [PubMed] [Google Scholar]
  • 22.Aita T., Uchiyama H., Inaoka T., Nakajima M., Kokubo T., Husimi Y. Analysis of a local fitness landscape with a model of the rough Mt. Fuji-type landscape: Application to prolyl endopeptidase and thermolysin. Biopolymers. 2000;54:64–79. doi: 10.1002/(SICI)1097-0282(200007)54:1&#x0003c;64::AID-BIP70&#x0003e;3.0.CO;2-R. [DOI] [PubMed] [Google Scholar]
  • 23.Neidhart J., Szendro I.G., Krug J. Adaptation in tunably rugged fitness landscapes: The rough Mount Fuji model. Genetics. 2014;198:699–721. doi: 10.1534/genetics.114.167668. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Kingman J. A simple model for the balance between selection and mutation. J. Appl. Probab. 1978;15:1–12. doi: 10.2307/3213231. [DOI] [Google Scholar]
  • 25.Guerois R., Nielsen J.E., Serrano L. Predicting changes in the stability of proteins and protein complexes: A study of more than 1000 mutations. J. Mol. Biol. 2002;320:369–387. doi: 10.1016/S0022-2836(02)00442-4. [DOI] [PubMed] [Google Scholar]
  • 26.Miyazawa S., Jernigan R.L. Residue—Residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading. J. Mol. Biol. 1996;256:623–644. doi: 10.1006/jmbi.1996.0114. [DOI] [PubMed] [Google Scholar]
  • 27.Chan H., Bornberg-Bauer E. Perspectives on protein evolution from simple exact models. Appl. Bioinform. 2002;1:121–144. [PubMed] [Google Scholar]
  • 28.Tokuriki N., Stricher F., Schymkowitz J., Serrano L., Tawfik D.S. The stability effects of protein mutations appear to be universally distributed. J. Mol. Biol. 2007;369:1318–1332. doi: 10.1016/j.jmb.2007.03.069. [DOI] [PubMed] [Google Scholar]
  • 29.Zeldovich K.B., Chen P., Shakhnovich E.I. Protein stability imposes limits on organism complexity and speed of molecular evolution. Proc. Natl. Acad. Sci. USA. 2007;104:16152–16157. doi: 10.1073/pnas.0705366104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Wylie C.S., Shakhnovich E.I. A biophysical protein folding model accounts for most mutational fitness effects in viruses. Proc. Natl. Acad. Sci. USA. 2011;108:9916–9921. doi: 10.1073/pnas.1017572108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Miyazawa S. Selection maintaining protein stability at equilibrium. J. Theor. Biol. 2016;391:21–34. doi: 10.1016/j.jtbi.2015.12.001. [DOI] [PubMed] [Google Scholar]
  • 32.Bloom J.D., Raval A., Wilke C.O. Thermodynamics of neutral protein evolution. Genetics. 2007;175:255–266. doi: 10.1534/genetics.106.061754. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Serrano L., Day A.G., Fersht A.R. Step-wise mutation of barnase to binase: A procedure for engineering increased stability of proteins and an experimental analysis of the evolution of protein stability. J. Mol. Biol. 1993;233:305–312. doi: 10.1006/jmbi.1993.1508. [DOI] [PubMed] [Google Scholar]
  • 34.Serohijos A.W., Shakhnovich E.I. Contribution of selection for protein folding stability in shaping the patterns of polymorphisms in coding regions. Mol. Biol. Evol. 2013;31:165–176. doi: 10.1093/molbev/mst189. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Van Nimwegen E., Crutchfield J.P., Huynen M. Neutral evolution of mutational robustness. Proc. Natl. Acad. Sci. USA. 1999;96:9716–9720. doi: 10.1073/pnas.96.17.9716. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Galton F. Regression towards mediocrity in hereditary stature. J. Anthropol. Inst. G. B. Irel. 1886;15:246–263. doi: 10.2307/2841583. [DOI] [Google Scholar]
  • 37.Lynch M., Walsh B. Genetics and Analysis of Quantitative Traits. Volume 1 Sinauer; Sunderland, MA, USA: 1998. [Google Scholar]
  • 38.Kimura M., Crow J.F. Effect of overall phenotypic selection on genetic change at individual loci. Proc. Natl. Acad. Sci. USA. 1978;75:6168–6171. doi: 10.1073/pnas.75.12.6168. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Draghi J.A., Plotkin J.B. Selection biases the prevalence and type of epistasis along adaptive trajectories. Evolution. 2013;67:3120–3131. doi: 10.1111/evo.12192. [DOI] [PubMed] [Google Scholar]
  • 40.Greene D., Crona K. The changing geometry of a fitness landscape along an adaptive walk. PLoS Comput. Biol. 2014;10:e1003520. doi: 10.1371/journal.pcbi.1003520. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Blanquart F., Achaz G., Bataillon T., Tenaillon O. Properties of selected mutations and genotypic landscapes under Fisher’s geometric model. Evolution. 2014;68:3537–3554. doi: 10.1111/evo.12545. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Drummond D.A., Wilke C.O. Mistranslation-induced protein misfolding as a dominant constraint on coding-sequence evolution. Cell. 2008;134:341–352. doi: 10.1016/j.cell.2008.05.042. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Gavrilets S., Gravner J. Percolation on the fitness hypercube and the evolution of reproductive isolation. J. Theor. Biol. 1997;184:51–64. doi: 10.1006/jtbi.1996.0242. [DOI] [PubMed] [Google Scholar]
  • 44.Goldstein R.A., Pollock D.D. The tangled bank of amino acids. Protein Sci. 2016;25:1354–1362. doi: 10.1002/pro.2930. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Teufel A.I., Wilke C.O. Accelerated simulation of evolutionary trajectories in origin-fixation models. J. R. Soc. Interface. 2017;14:20160906. doi: 10.1098/rsif.2016.0906. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Naumenko S.A., Kondrashov A.S., Bazykin G.A. Fitness conferred by replaced amino acids declines with time. Biol. Lett. 2012;8:825–828. doi: 10.1098/rsbl.2012.0356. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.McCandlish D.M., Shah P., Plotkin J.B. Epistasis and the dynamics of reversion in molecular evolution. Genetics. 2016;203:1335–1351. doi: 10.1534/genetics.116.188961. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Berg O.G., von Hippel P.H. Selection of DNA binding sites by regulatory proteins: Statistical-mechanical theory and application to operators and promoters. J. Mol. Biol. 1987;193:723–743. doi: 10.1016/0022-2836(87)90354-8. [DOI] [PubMed] [Google Scholar]
  • 49.Wunderlich Z., Mirny L.A. Different gene regulation strategies revealed by analysis of binding motifs. Trends Genet. 2009;25:434–440. doi: 10.1016/j.tig.2009.08.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Socolich M., Lockless S.W., Russ W.P., Lee H., Gardner K.H., Ranganathan R. Evolutionary information for specifying a protein fold. Nature. 2005;437:512. doi: 10.1038/nature03991. [DOI] [PubMed] [Google Scholar]
  • 51.Bradley R.S. Estimation of bias and variance of measurements made from tomography scans. Meas. Sci. Technol. 2016;27 doi: 10.1088/0957-0233/27/9/095402. [DOI] [Google Scholar]

Articles from Genes are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES