Skip to main content
Molecular Biology and Evolution logoLink to Molecular Biology and Evolution
. 2018 Aug 2;35(10):2345–2354. doi: 10.1093/molbev/msy141

Biophysical Inference of Epistasis and the Effects of Mutations on Protein Stability and Function

Jakub Otwinowski 1,
Editor: Claus Wilke1
PMCID: PMC6188545  PMID: 30085303

Abstract

Understanding the relationship between protein sequence, function, and stability is a fundamental problem in biology. The essential function of many proteins that fold into a specific structure is their ability to bind to a ligand, which can be assayed for thousands of mutated variants. However, binding assays do not distinguish whether mutations affect the stability of the binding interface or the overall fold. Here, we introduce a statistical method to infer a detailed energy landscape of how a protein folds and binds to a ligand by combining information from many mutated variants. We fit a thermodynamic model describing the bound, unbound, and unfolded states to high quality data of protein G domain B1 binding to IgG-Fc. We infer distinct folding and binding energies for each mutation providing a detailed view of how mutations affect binding and stability across the protein. We accurately infer the folding energy of each variant in physical units, validated by independent data, whereas previous high-throughput methods could only measure indirect changes in stability. While we assume an additive sequence–energy relationship, the binding fraction is epistatic due its nonlinear relation to energy. Despite having no epistasis in energy, our model explains much of the observed epistasis in binding fraction, with the remaining epistasis identifying conformationally dynamic regions.

Keywords: genotype-phenotype map, protein evolution, protein-protein interface, deep mutational scanning

Introduction

Deep mutational scanning (DMS) studies have produced detailed maps of how proteins and regulatory sequences are related to function by assaying up to millions of mutated variants (Fowler and Fields 2014), and has had many applications, from identifying viral epitopes (Kowalsky et al. 2015; Doud et al. 2017; Haddox et al. 2018) to protein engineering (Wrenbeck et al. 2017). While these studies aim to understand molecular function and evolution by collecting large numbers of sequence–function pairs, the full sequence–function map is very difficult to determine due to the enormity of sequence space. Different sites in a sequence may not contribute to molecular function independently, and the effect of a mutation at one site may depend on the genetic background. This nonadditivity, or epistasis, means that the entire space of possible sequences may have to be explored to understand molecular function.

Given the limited data, mathematical modeling is necessary to make any progress. However, purely statistical inferences are difficult to interpret in terms of known biology, and can be too flexible to make reliable predictions (Otwinowski and Plotkin 2014; du Plessis et al. 2016). On the other hand, biophysical systems are not arbitrarily complex as they follow physical laws and structural constraints. In other words, there is hope that biophysical knowledge can help explain sequence–function relationships. A powerful assumption is that in between sequence and function lie relevant intermediate phenotypes for which have relatively simple relations to sequence and function (Bershtein et al. 2017; Otwinowski et al. 2018).

The stability of a protein’s fold is a fundamental molecular phenotype under selection. Studying how mutations affect stability is a fundamental challenge in protein science (Magliery et al. 2011), and is the aim of some DMS studies (Araya et al. 2012; Traxlmayr et al. 2012; Kim et al. 2013; Olson et al. 2014). However, assaying molecular function, such as binding to a ligand, is a necessary and insufficient measure of stability, in that most proteins must be folded to function, but do not necessarily function if folded. High-throughput techniques, such as proteolysis assays (Rocklin et al. 2017), typically do not measure free energy, but measure scores that are correlated with stability.

Binding to a ligand is an essential function of many proteins, but studies which aim to study binding affinity, for example, to determine viral epitopes (Doud et al. 2017), also confound binding and folding. Dissociation constants measured by varying the ligand concentration do not distinguish between changes in fold stability and changes in stability of the binding interface. In addition, scores from high-throughput binding assays often do not measure affinity in meaningful units (although see Adams et al. 2016). Some studies use heuristic measures of sequence conservation to determine which sites are relevant to the binding interface (Kowalsky et al. 2015), but these methods are not precise and do not separate the effects of binding and stability at each site.

While binding assays often confound binding and stability, these can be separated with a thermodynamic approach. Thermodynamic approaches have been at the heart of biophysical models applied to data to quantify the evolution of regulatory sequences (Mustonen and Lässig 2005; Mustonen et al. 2008; Kinney et al. 2010; Haldane et al. 2014; Lagator et al. 2017), and proteins (Bloom et al. 2005; Wylie and Shakhnovich 2011; Echave and Wilke 2017; Schmiedel and Lehner 2018). In the context of proteins, there are typically a few relevant conformational states, and the kinetics are fast enough to reach thermal equilibrium, with a free energy determining the probability of each state. At a minimum, a protein has folded and unfolded states, and other states may be due to binding, misfolding, or some other conformational changes. Two-state models have been important in understanding observed patterns of substitutions in protein evolution (a Liberles et al. 2012; Starr and Thornton 2016; Bastolla et al. 2017), and in general, the ensemble of protein conformations generates epistasis that makes protein evolution difficult to predict (Sailer and Harms 2017b). A powerful simplification is to approximate the total energy by a sum over site-specific energies (additivity), which has been observed in most changes to fold stability (Wells 1990; Sandberg and Terwilliger 1993). However, even with additivity in energy, the probability of a protein being in a particular state is nonlinear with respect to energy and therefore epistatic with regard to sequence.

In this work, we infer a thermodynamic model that separates folding and stability in a small bacterial protein, protein G domain B1 (GB1), a model system of folding and stability, where a recent high-throughput assay of functional binding to an immunoglobulin fragment (IgG-Fc) described the epistasis between nearly all pairs of sites (Olson et al. 2014). We infer a thermodynamic model with two states, bound and unbound, and another model with three states: bound/folded, unbound/folded and unbound/unfolded. The approximation of additivity in energy allows us to separate how mutations destabilize the binding interface and how they destabilize the overall fold. We validate these approximations by predicting independently measured changes in fold stability. We describe the folding and stability landscape of the protein, identify which sites contribute most to binding, and explain much of the observed epistasis without assuming any nonadditivity in how mutations affect binding and folding energy.

Results

In Vitro Selection of Protein Variants

Olson et al. (2014) mutagenized GB1 to create a library of protein variants which contained all single mutants (1045 variants) and nearly all double mutants (536k variants) of a wild-type sequence. The library was sequenced before and after an in vitro binding assay, and therefore, the fraction of bound protein to IgG-Fc for a variant with sequence σ is pfb(σ)=n1(σ)n0(σ)r, where n0(σ) and n1(σ) are the sequence counts before and after selection, and r is a global factor that accounts for overall differences between initial and final sequencing (see Materials and Methods for a maximum likelihood derivation of pfb). Since r is not known, we define fitness as the logarithm of the binding fraction normalized by the wild-type σW

f(σ)=log(n1(σ)n0(σ)n0(σW)n1(σW)), (1)

as an analogy to the growth rate of an exponentially growing population, although we do not imply that this is the (relative) growth rate of an organism with this variant.

With single and double mutants it is possible to study interactions between sites, or epistasis. Pairwise epistasis in fitness is the difference between the fitness of the double mutants and the expectation of additivity, that is, the sum of the fitness of two single mutants:

Jijab=f(σ/(i,a)/(j,b)W)f(σ/(i,a)W)f(σ/(j,b)W). (2)

where /(i,a) and /(j,b) indicate mutations at positions i, j with amino acids a, b. Note that this is the standard definition of epistasis since fitnesses are defined relative to the wild-type.

While changes in fitness show where a protein is sensitive to binding, such changes are not informative of whether mutations are destabilizing the binding interface or the overall fold, as changes in either one influence the fraction of bound protein. A thermodynamic model is necessary to separate these effects.

Thermodynamic Models

Proteins fold into complicated structures and interact with other molecules depending on the free energy of their different states or conformations. Under natural conditions, protein states reach thermodynamic equilibrium very quickly and the Boltzmann distribution relates the probability of each state with the free energy of that state (Phillips et al. 2012). In a two state bound/unbound model, the fraction of bound protein is 11+eG/RT, with free energy that is related to the dissociation constant Kd by G=RTlog(Kd/c), where T is temperature, R is the gas constant, and c is the concentration of the ligand. However, this model does not distinguish between changes in fold stability and stability of the binding interface since it only considers the free energy difference between all bound and unbound states. In order to separate binding and stability, we define three states: unfolded and unbound (uu), folded and unbound (fu), and folded and bound (fb). The probabilities of these states according to the Boltzmann distribution are

puu=1/Zpfu=eGf/RT/Zpfb=e(Gf+Gb)/RT/Z=11+eGb/RT(1+eGf/RT), (3)

with normalization Z=1+eGf/RT+e(Gf+Gb)/RT, folding energy Gf and binding energy Gb, which implicitly depend on sequence σ. While the folding energy is the energy difference between the fu and uu states, we define the binding energy as the difference between the fb and fu states, which also includes a constant that depends on the concentration of ligand (the chemical potential). Therefore, this marginal binding energy measures only the destabilization of the binding interface, and is related to the free energy of a two state model, G, by

eG/RT=eGb/RT(1+eGf/RT). (4)

The probability of the unfolded and bound state is negligible, and the absence of this state means that pfb does not factorize into separate probabilities of binding and folding. The physical causes of the nonindependence of binding and folding are irrelevant in equilibrium, where only the differences in free energy determine the probabilities.

We only consider pfb, as this is what is measured in the experiment. Intuitively, low binding and folding energy leads to large pfb, and the shaded areas in figure 1A show regions in energy space with each label indicating the most likely state. This model also accounts for binding mediated stability in the region where folding alone is unfavorable, Gf > 0, but the folded and bound state is still likely, Gf+Gb<0.

Fig. 1.

Fig. 1.

Thermodynamic model of three protein states: unfolded and unbound, folded and unbound, and folded and bound, described by equation (3). Shaded areas correspond to regions in energy space where the labeled state is dominant. A) Given sequences and binding fractions pfb, the nonlinear Boltzmann form (eq. 3) imposes constraints on the possible parameters (additive energies). Solid lines are the energies compatible with pfb for four hypothetical sequences: wild-type (WT) two single mutants (Mut A, Mut B) and a double mutant Mut AB. Dashed lines represent additive energies, and connect the wild-type to the single mutants and to the double mutant, which has lengths equal to the sum of additive energies. B) Folding versus binding energy for single mutants (additive effects). Gray dot is the wild-type energy, dashed line is where binding fraction is the same as wild-type pfb=pfbW, and below which binding fraction is higher.

Given an experimentally measured pfb of a single variant, there are many values Gb and Gf that match pfb, and it is not possible to identify these energies, as shown by the contour lines of equal pfb in figure 1A. However, with the approximation of additivity in energy over sites, it is possible to combine information from many sequences to estimate energies. Additivity means that the energy is a sum over energies specific to the amino acid at each site

Gf(σ)=GfW+igf(i,σi) (5)
Gb(σ)=GbW+igb(i,σi) (6)

where subscripts f and b indicate folding and binding, respectively, σi is the amino acid at position i, and g(i,σiW)=0 so that the wild-type energies are GfW and GbW. While additivity has been observed in many experimental measurements of changes in fold stability (Wells 1990; Sandberg and Terwilliger 1993), it is a local approximation that is likely to be violated for highly mutated sequences.

To illustrate how multiple data points constrain this nonlinear model, consider a hypothetical quartet of sequences and their measured binding fraction pfb: the wild-type, two single mutants, and a double mutant. The lines in figure 1A are the energy coordinates consistent with the given pfb, and the dashed lines are the additive energies that connect the wild-type (red line) to the single mutants (black lines), and to the double mutant (blue line). The parameters are not constrained given a wild-type pfb and single mutant pfb, as a rectangles can be placed anywhere between two lines as long as the opposing corners land on them. However, when considering all four sequences the largest rectangle must have lengths that are the sum of the smaller rectangle lengths due to additivity, and the nonlinearity of the curves constrains the parameters (additive energies) that can fit the data, with more data providing more constraints.

We use all sequences and associated counts in a maximum a likelihood framework to infer all additive and wild-type folding and binding energies in kcal/mol. In Materials and Methods, we describe a Poisson likelihood for the counts, use Eqns. 3, 5, 6, modified to account for nonspecific background binding, and describe a procedure to find the parameters that maximize this likelihood and overcome local optima via bootstrapping. For comparison, we also infer parameters of the two state model with additive energies and a more naive estimate of two state free energy based on p(σ)=n1(σ)n0(σ)r (see Materials and Methods).

Inferred Energy Landscape

We compare the inferred additive folding energies to a set of folding energy measurements of 81 single mutants from the literature in figure 2A (various sources, see Table S4 in Olson et al. 2014). The three state model accurately predicts gf in physical units with an root mean squared error of 0.39 kcal/mol and a correlation of ρ=0.91, which is better than computational methods (∼0.6 to ∼0.7) and close to the amount of correlation between replicates of low-throughput methods (∼0.86) (Potapov et al. 2009). Six variants with two to six mutations are also predicted with comparable accuracy (fig. 2 inset, various sources, see Table S5 in Olson et al. 2014). However, a variant computationally designed to be very stable, Gβ1-c3b4 with with seven mutations (Malakauskas and Mayo 1998), is 2.1 kcal/mol more stable than predicted, and a variant selected to be stable by in vitro evolution, M2 with four mutations (Wunderlich and Schmid 2006), is 5.3 kcal/mol more stable than predicted, suggesting the presence of significant synergistic epistasis in the folding energy between the mutations in these variants. There is no comparable literature set of binding energies. As discussed below, most of the single mutants have excess stability and weak binding, and therefore many of the binding energies are approximately equal to the two-state free energy difference.

Fig. 2.

Fig. 2.

(A) Prediction of changes in folding energy gf (eq. 5, commonly referred to as ΔΔG) by fitting a three state thermodynamic model to deep mutational scanning data. Predicted energies have a root mean square error of 0.39 kcal/mol and ρ=0.91 compared with a literature set of folding energies of 81 single mutants (table S4, Olson et al. 2014). In the inset are eight variants with two to seven mutations, including two highly stable engineered variants which are poorly predicted, suggesting nonadditivity in stability between the mutated sites (table S5, Olson et al. 2014). (B) Marginal binding energy is well approximated by the two state energy when binding is weak and fold stability is strong: Gf<0<GbG. Main plot shows all single mutants with Gf < 0 (black, ρ=0.96) and Gf > 0 (gray). Inset shows the subset of 81 single mutants with measured stabilities, as in (A), and ρ=0.99. All lines have a slope of unity.

The two state model with additive energies fits the data similarly well to the three state model, with a correlation between predicted fitness f^(σ)=log(p^(σ)/p^(σW)) and measured fitness of 96.4% and 97.1% for two and three state models, respectively (see supplementary fig. S1, Supplementary Material online). However, the additive energies inferred by the two state model have no relation to the literature set of gf (supplementary fig. S3, Supplementary Material online, ρ=0.10, P value 0.31). Clearly a three state model is necessary to predict folding energy, and has the added benefit of estimating binding energy.

To assess how much sampling noise influences our results, we calculate 95% confidence intervals of gf and gb from the bootstrapped estimates (supplementary fig. S7, Supplementary Material online), and find that they are very narrow compared with the range of effect sizes for most of the 2092 energy parameters. Examining gf and gb across sites and amino acids (fig. 3) reveals a detailed picture of how folding and binding are sensitive to mutations. The energies have striking differences in their patterns, and gf and gb are uncorrelated (ρ=0.03, pvalue=.28) indicating no systematic trade-off between binding and folding energy. Some mutations appear to have strong antagonistic effects, such as at positions 23 and 41, neither of which are at the binding interface. The additive folding energies with the most positive and most negative values (at the limit imposed on the parameters, see Materials and Methods) are 54 G and 41 L, respectively. The double mutant 54G/41L was noted to have large sign epistasis due to a physical interaction (Olson et al. 2014), and the corresponding inferred additive energies are most likely skewed by this specific epistasis that our model cannot account for.

Fig. 3.

Fig. 3.

Inferred additive binding gb and folding gf energies show strikingly different patterns. Three of the binding sites (27, 31, 43) have strong effects on binding. Many mutations at positions 23 and 41 are beneficial for folding and deleterious for binding, although overall gb and gf are uncorrelated (ρ=0.03,pvalue=.28).

A top down view of folding versus binding energy of single and double mutants (single mutants fig. 1B and double mutants supplementary fig. S2, Supplementary Material online) shows where the variants fall into each of the three states. The absolute location of the wild-type is not very meaningful as GfW can vary with experimental conditions and GbW depends on the concentration of ligand. Relative energies are more reproducible, and figure 1B shows the distribution of single mutants, that is, additive effects around the wild-type. Most variants fall in the region of excess fold stability and weak binding energy, Gf<0<Gb, which means that binding fraction is more sensitive to changes in binding and equation (3) can be approximated by pfb11+eGb/RT for these variants. More naive estimates of the two-state free energy G based on counts (see Materials and Methods) of the single mutants are indeed very close to Gb for variants with Gf < 0 (fig. 2B, correlation 0.96). This is also true of the literature set of folding energies (fig. 2B inset, correlation 0.99), and the correction in the estimate, GGB=RTlog(1+eGf/RT) (eq. 4) using inferred GfW and measured gf, is very small with a median value of −0.002. This explains the lack of correlation between inferred gf and the literature set (supplementary fig. S3, Supplementary Material online), as well as the lack of correlation between changes in fitness and folding energies in Olson et al. (2014). Notably, there is a lack of variants with binding mediated stability in the region bounded by Gf > 0 and Gf+Gb<0, with 54 G and 41 L as outliers in the single mutants.

The energy landscape of a protein is determined by its structure, and inferred energies are likely to be related to structural features. Residue depth measures distance to the bulk solvent, and in other studies correlates better with changes in stability than accessible surface area (Tan et al. 2013). Supplementary figure S4, Supplementary Material online, shows that root mean squared gf for each site correlates with depth (ρ=0.81), and root mean squared gb has a weaker relation to depth (ρ=0.27). Only three sites at the binding interface (Sauer-Eriksson et al. 1995; Sloan and Hellinga 1999) (27, 31, 43) are very sensitive to binding.

Patterns of Epistasis

While the energies in our models are nonepistatic, the binding fraction and fitness are epistatic due to the nonlinearity between binding fraction and energy (eq. 3), and inferred pairwise epistasis J^ijab, analogous to equation (2), can be compared with the observed epistasis. We filter out nonbiological epistasis due to experimental limits on measured fitness, that is, nonspecific background binding (see Materials and Methods). The epistasis averaged over amino acids in each pair of sites (fig. 2A and B, full nonaveraged epistasis in supplementary fig. S8, Supplementary Material online) shows that the predictions from the three-state model reproduces the biological epistasis better than the two-state model, with a 16% improvement in correlation, and an 18% improvement in accuracy of predicting the sign. While the two state model vastly underestimates the magnitude of epistasis across the protein, the three state model predicts much of the negative epistasis, but misses clusters of positive epistasis. Clusters of positive epistasis are more visible in figure 4C after filtering out all but the most underestimated epistasis (lowest 5% of normalized errors, see Materials and Methods). As noted in Olson et al. (2014), the residues in these positions had correlated conformational dynamics in NMR and molecular dynamics studies (positions 7, 9, 11, 12, 14, 16, 33, 37, 38, 40, 54, 56, Clore and Schwieters 2004; Lange et al. 2005; Markwick et al. 2007). This suggests that this unexplained epistasis is due to systematic errors not accounted for in the model, such as epistasis in energy or additional conformations, and therefore large prediction errors can identify sites that should be studied in more detail.

Fig. 4.

Fig. 4.

Patterns of pairwise epistasis observed in the data and predicted by the two (A) and three (B) state thermodynamic models (above the diagonal). Below the diagonal in (A) and (B) are observed pairwise fitness epistasis (eq. 2) averaged across amino acids for each pair of sites. The correlation between observed and predicted epistasis is ρ=0.50 and 0.66 for two and three state models, respectively, and the accuracy of predicting the sign, that is, number of predictions that have same sign as observed divided by total, is 0.54 and 0.73 for two and three state models. Nonbiological epistasis that is a consequence of the experimental limits on measured fitness due to nonspecific background binding was filtered out in all calculations and plots (see Materials and Methods). (C) Above the diagonal is the observed epistasis for sites where the three state model overestimates, J^ijJij (top 5% of normalized errors, see Materials and Methods), and below the diagonal is where it underestimates, J^ijJij (bottom 5% of normalized errors). Underestimated epistasis is largely positive and corresponds to a dynamically correlated network of residues.

A follow up study by Wu et al. (2016) targeted four highly epistatic sites in GB1, and assayed binding for nearly all combinations of mutations, that is, 204 variants. Most of the variants have three or four mutations and very low fitness. Comparing fitness predictions from the three state model trained on the Olson et al. data to the observed fitness shows many deviations (fig. 5), with a correlation as low as 0.48 for the quadruple mutants (supplementary table S1, Supplementary Material online). This is partly due to the increased amount of noise in this experiment, and where quadruple mutants had an average variance in fitness of 0.71, compared with 0.27 for double mutants in Olson et al. (2014). However, the distribution of prediction errors is skewed with ∼20% of variants having fitness underestimated much more than expected from a normal distribution (supplementary fig. S5, Supplementary Material online), suggesting that some unaccounted epistasis is restoring binding or stability for these variants.

Fig. 5.

Fig. 5.

Fitness is less predictable in a follow-up study that targeted all combinations at four sites (Wu et al. 2016). Panels show true and inferred fitness for one to four mutations from wild-type. A substantial fraction of functional variants are underestimated, suggesting some unaccounted epistasis. See also supplementary figure S5 and table S1, Supplementary Material online.

Supplementary table S1, Supplementary Material online, describes how the three state model predicts whether variants in Wu et al. (2016) are functional with a threshold of f>2.5. The model is good at predicting variants that are not functional, miss-classifying only 0.14% of nonfunctional quadruple mutants as functional. However, it is rather poor at predicting functional variants, classifying only 11% of functional quadruple mutants correctly, which again suggests the model is missing some form of positive epistasis that restores functionality.

Discussion

Many DMS studies have measured binding affinity of single mutants, however such measurements often confound how mutations affect the stability of the interface with the stability of the overall fold. With a large set of double mutants and a few biophysical assumptions, that is, a small number of thermodynamic states and additivity of energy, it is possible to extract a detailed folding and binding energy landscape of a protein by combining information from many variants.

Recent heuristic approaches also use such data to infer statistical models of genotype–phenotype maps (Sarkisyan et al. 2016; Sailer and Harms 2017a; Otwinowski et al. 2018). In contrast to this work, these approaches assume a single additive intermediate trait and a monotonic nonlinear relation between this trait and the measurement. The only thermodynamic model that fits into this framework is a two state model with an additive energy, whereas a three state model is necessary to predict fold stability and explain the observed epistasis of GB1. In our model the nonlinearity was fixed as the Boltzmann distribution, and allowed for inference of energy in physical units. In other experiments, where a simple biophysical relation is not known, a flexible nonlinearity may be more appropriate.

The inferred folding energies match a set of literature measurements very closely and demonstrate it is possible to produce accurate stability measurements in physical units with high-throughput. Several DMS studies have indirectly measured stability. Araya et al. (2012) applied a metric, based on the rescue effect of double mutations, to identify stabilizing mutations, and Rocklin et al. (2017) used proteolysis assays that correlate with stability. Olson et al. (2014) extracted stability measures from their data by searching for single mutation fitness changes in different genetic backgrounds that correlated with the validation set of stability measurements. Further refinements were developed with clustering methods that use structural information and physiochemical properties (Wu et al. 2015). However, these ad-hoc methods can suffer from overfitting and require extraneous knowledge to work effectively. In contrast, the thermodynamic model directly infers stability in physical units and does not require any benchmark stability measurements or structural information.

The three state model also infers binding energies that measure the destabilization of the binding interface, in that they are the marginal energy difference between the folded/unbound and folded/bound states. In contrast, dissociation constants, as measured by titration curves, do not account for differences in fold stability. The recently developed tite-seq method (Adams et al. 2016) infers a saturation constant to account for sequence dependent differences in stability and expression, but does not compensate for stability effects in the dissociation constant itself.

The three state model shows good agreement with patterns of epistasis, and deviations from our model identify a network of residues that have correlated conformational dynamics. Since the deviations and measurement noise itself can be rather large, sign epistasis, path accessibility, and other geometric features of the inferred genotype–phenotype map are likely to be distorted. More complex models, which include pairwise energetic interaction terms or more conformational states, may be able to describe the remaining unexplained epistasis. The three state model works well for the relatively small GB1, but larger proteins may need additional states, such as misfolded conformations, to accurately model their properties.

Inferred energy landscapes from DMS studies may also be useful in understanding protein structure. Inferred energy parameters may be useful for calibrating potential functions used in structure prediction (Lee et al. 2017). Recently, heuristic methods (Rollins et al. 2018; Schmiedel and Lehner 2018), similar to those applied to multiple sequence alignments of homologous proteins (Morcos et al. 2014), were able to predict some of the protein contacts in GB1 from the Olson et al. (2014) data, enough to predict its structure. Refined thermodynamic models with pairwise epistatic energy may be able to infer many more protein contacts to improve structure prediction. Thermodynamic models coupled with DMS also provide a way to study intrinsically disordered proteins, which fold and bind simultaneously, but have no persistent structure while unbound (Wright and Dyson 2009). Since many conformations can correspond to these states, free energy differences are a natural way to quantify the properties of disordered proteins.

Despite having a detailed genotype–phenotype map of GB1, the consequences for evolution are not clear since they depend on how the binding fraction affects organismal fitness. Manhart and Morozov (2015) explored the evolutionary dynamics of a fitness function that is a linear combination of the three protein states. This leads to an evolutionary coupling between binding and folding, where selection on folding can drive changes in binding, and vice versa. It may be possible to infer selection coefficients associated with each state, as well as evolutionary trajectories in energy space, from multiple sequence alignments.

Materials and Methods

Poisson Likelihood for an In Vitro Binding Assay

In an in vitro binding assay with one round, the library of protein variants is sequenced before and after binding, and therefore the count or multiplicity of each sequence carries information on the binding. For each variant σ, with initial and final counts n0 and n1, the joint likelihood is a product of two Poisson distributions,

i=0,1λinieλi/ni!,

with intensity λ0=N0 and λ1=N0pfbr, where pfb is the fraction of bound (and folded) protein and r is the overall difference between initial and final sequencing. There is an implicit dependence of all variables (except r) on σ. The resulting log-likelihood is

LL=N0(1+pfbr)+n0log(N0)+n1log(N0pfbr), (7)

omitting terms that do not depend on parameters. N0 can be eliminated by solving for the maximum likelihood estimate given the other parameters, N0*=n0+n11+pfbr, plugging it into the likelihood and again dropping terms that do not depend on the parameters:

LL=(n0+n1)log(1+pfbr)+n1log(pfbr). (8)

The maximum likelihood estimate of binding fraction is pfb*=n1n0r, which depends on the unknown parameter r. Since some of the counts can be very small, we add a pseudocount of 12 to n0 and n1 to slightly regularize the fitness estimates.

Three State Thermodynamic Inference

We use the likelihood in equation (8), and parameterize pfb as a thermodynamic model following the Boltzmann distribution. We modify p to account for a low level of nonspecific background binding p0:

pfb=11+eGb/RT(1+eGf/RT)(1p0)+p0, (9)

and the energies have an additive relation to sequence defined by eqs. 56. The total likelihood is the sum of the per variant likelihoods LL=σLL(σ).

This log-likelihood is nonconvex, and is optimized using the NLopt library (Johnson 2018), which implements the method of moving asymptotes algorithm (Svanberg 2002), and uses the log-likelihood gradients with respect to r, p0, and the energies. The initial parameters were r =1, p0=0.01, and all energies set to zero. r and p0 were reparameterized as er and ep0 inside the optimization function, so that the original parameters are nonnegative. In the optimization algorithm, upper and lower bounds on energy parameters are set to keep the optimization from stopping prematurely due to very small gradients. The value of ±15 was chosen by optimizing with different bounds, ±10, ±15, ±20, and ±25, and choosing the result with the highest likelihood. In addition, a bootstrapping procedure, described below, is used to overcome local optima. R is the standard gas constant and T=297°K, and therefore the bounds are ±8.85 kcal/mol. The inferred global parameters are r =8.97 and p0=0.0007.

To validate the inference procedure, we generate artificial data utilizing the same sequences of the GB1 data set, but with different counts based on randomized parameters (see supplementary fig. S6, Supplementary Material online, for details).

Two State Model Thermodynamic Inference

We infer a two state model analogously with additive energies G2(σ)=G2W+ig2(i,σi) and

pfb=11+eG2/RT(1p0)+p0. (10)

The inferred global parameters are r =8.23 and p0=0.0008.

A more naive inference of two state free energy can be based directly on the counts. The free energy is

G=RTlog(1pfbpfbp0),

which can be estimated by using pfb=n1n0r and the values of r and p0 inferred from the three state model. This quantity does not assume additivity in energy and is related to the observed fitness (eq. 1), except that it depends on inferred global parameters. A small number of variants with pfb<p0 were omitted in figure 2B, since G was then undefined and the binding fraction is presumably too low to estimate accurately.

Overcoming Local Optima and Bootstrapping Confidence Intervals

Since the optimization algorithm can get stuck in local optima, we use a bootstrap restarting procedure to overcome local optima related to sampling noise (Wood 2001), and to generate a bootstrap distribution of parameters to quantify their uncertainty due to sampling noise. The maximum likelihood parameters from the fit described earlier, θ, are the initial parameters in an iterative procedure that alternates optimizing on the original and the bootstrapped data. Each iteration consists of: 1) creation of bootstrapped data with counts drawn from a Poisson distribution with means n0 and n1. 2) Searching for the maximum likelihood parameters θ on the bootstrapped data with initial parameters θ. 3) Searching for the maximum likelihood parameters θ on the original data with initial parameters θ. 4) If the likelihood is no better than the best optimization within a small threshold LL(θ)LL(θ)+η (η=0.0001), then add the bootstrapped parameters θ to a list. 5) If the likelihood is better than the previous best LL(θ)>LL(θ)+η, then update the best parameters, θθ, and delete the list of bootstrapped parameters. 6) terminate the procedure once 100 bootstraps have been accumulated in the list.

Sample Variance of Fitness and Epistasis

Since estimated fitness is asymptotically Gaussian, the sample variance of fitness is equal to the curvature of the log-likelihood surface. Replacing p with ef in equation (8), the asymptotic variance of f is (2LLf2)1. The variance of the fitness estimate, as defined in equation (1) is the sum of the focal and wild-type variances

vf(σ)=1n0(σ)+1n1(σ)+1n0(σW)+1n1(σW), (11)

and the variance in epistasis is the sum of the single and double mutant variances

vJijab=vf(σ/(i,a)/(j,b)W)+vf(σ/(i,a)W)+vf(σ/(j,b)W),

Pairwise Epistasis

A minimal amount of nonspecific background binding, p0, imposes a lower bound on measured binding fraction in the experiment, estimated to be f0=log(p0/p(σW))=5.69 by our three state model. This threshold effect produces large amounts of nonbiological positive pairwise epistasis, for example, when the double mutant has the same level of binding as one of the single mutants at the background level. Therefore, the data shown in figure 4 and supplementary figure S8, Supplementary Material online, exclude J^ijab for variants near this threshold, that is, min(f(σ/(i,a)/(j,b)W),f(σ/(i,a)W),f(σ/(j,b)W))<4.5. Figure 4 shows epistasis averaged over amino acids a, b, weighted by the epistasis variance

Jij=abJ^ijab/vJijabab,ij1/vJijab.

To filter out under and overestimated epistasis in figure 4C, we calculate the normalized error in epistasis

zij=(J^ijJij)/vJij,

where vJij=abvJijab, and find the Jij where zij is in the top fifh or bottom fifth percentiles.

Code Availability

Code, data, inferred models, and analysis are available online at github.com/jotwin/ProteinGthermo, last accessed July 23, 2018.

Supplementary Material

Supplementary Data

Acknowledgments

Research was supported by NIH grant T32AI055400. The author thanks Armita Nourmohammad, Joachim Krug, Michael Laessig, Joshua Plotkin, David McCandlish and Nicholas Wu for helpful comments.

References

  1. Adams RM, Mora T, Walczak AM, Kinney JB.. 2016. Measuring the sequence-affinity landscape of antibodies with massively parallel titration curves. eLife 5:e23156.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Liberles DA, Teichmann SA, Bahar I, Bastolla U, Bloom J, Bornberg-Bauer E, Colwell LJ, de Koning APJ, Dokholyan NV, Echave J, et al. 2012. The interface of protein structure, protein biophysics, and molecular evolution. Protein Sci. 216:769–785. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Araya CL, Fowler DM, Chen W, Muniez I, Kelly JW, Fields S.. 2012. A fundamental protein property, thermodynamic stability, revealed solely from large-scale measurements of protein function. Proc Natl Acad Sci USA. 10942:16858.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bastolla U, Dehouck Y, Echave J.. 2017. What evolution tells us about protein physics, and protein physics tells us about evolution. Curr Opin Struct Biol. 42:59–66. [DOI] [PubMed] [Google Scholar]
  5. Bershtein S, Serohijos AWR, Shakhnovich EI.. 2017. Bridging the physical scales in evolutionary biology: from protein sequence space to fitness of organisms and populations. Curr Opin Struct Biol. 42:31–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Bloom JD, Silberg JJ, Wilke CO, Drummond DA, Adami C, Arnold FH, Fersht AR.. 2005. Thermodynamic prediction of protein neutrality. Proc Natl Acad Sci USA. 1023:606–611. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Clore GM, Schwieters CD.. 2004. Amplitudes of protein backbone dynamics and correlated motions in a small alpha/beta protein: correspondence of dipolar coupling and heteronuclear relaxation measurements. Biochemistry 4333:10678–10691. [DOI] [PubMed] [Google Scholar]
  8. Doud MB, Hensley SE, Bloom JD.. 2017. Complete mapping of viral escape from neutralizing antibodies. PLoS Pathog. 133:e1006271. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. du Plessis L, Leventhal GE, Bonhoeffer S.. 2016. How good are statistical models at approximating complex fitness landscapes. Mol Biol Evol. 339:2454–2468. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Echave J, Wilke CO.. 2017. Biophysical models of protein evolution: understanding the patterns of evolutionary sequence divergence. Annu Rev Biophys. 461:85–103. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Fowler DM, Fields S.. 2014. Deep mutational scanning: a new style of protein science. Nat Methods 118:801–807. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Haddox HK, Dingens AS, Hilton SK, Overbaugh J, Bloom JD.. 2018. Mapping mutational effects along the evolutionary landscape of HIV envelope. eLife 7:e34420.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Haldane A, Manhart M, Morozov AV.. 2014. Biophysical fitness landscapes for transcription factor binding sites. PLoS Comput Biol. 107:e1003683.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Johnson SG. 2018. The NLopt nonlinear-optimization package. Available from: http://ab-initio.mit.edu/nlopt. last accessed July 23 2018.
  15. Kim I, Miller CR, Young DL, Fields S.. 2013. High-throughput analysis of in vivo protein stability. Mol Cell Proteomics. 1211:3370–3378. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Kinney JB, Murugan A, Callan CG, Cox EC.. 2010. Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. Proc Natl Acad Sci USA. 10720:9158–9163. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Kowalsky CA, Faber MS, Nath A, Dann HE, Kelly VW, Liu L, Shanker P, Wagner EK, Maynard JA, Chan C, Whitehead TA.. 2015. Rapid fine conformational epitope mapping using comprehensive mutagenesis and deep sequencing. J Biol Chem. 29044:26457–26470. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Lagator M, Paixão T, Barton NH, Bollback JP, Guet CC.. 2017. On the mechanistic nature of epistasis in a canonical cis-regulatory element. eLife Sci. 6:e25192.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Lange OF, Grubmüller H, de Groot BL.. 2005. Molecular dynamics simulations of protein G challenge NMR-derived correlated backbone motions. Angew Chem Int Ed Engl. 4422:3394–3399. [DOI] [PubMed] [Google Scholar]
  20. Lee J, Freddolino PL, Zhang Y.. 2017. Ab initio protein structure prediction. Dordrecht: Springer; p. 3–35. [Google Scholar]
  21. Magliery TJ, Lavinder JJ, Sullivan BJ.. 2011. Protein stability by number: high-throughput and statistical approaches to one of protein science’s most difficult problems. Curr Opin Chem Biol. 153:443–451. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Malakauskas SM, Mayo SL.. 1998. Design, structure and stability of a hyperthermophilic protein variant. Nat Struct Mol Biol. 56:470–475. [DOI] [PubMed] [Google Scholar]
  23. Manhart M, Morozov AV.. 2015. Protein folding and binding can emerge as evolutionary spandrels through structural coupling. Proc Natl Acad Sci USA. 1126:1797–1802. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Markwick PRL, Bouvignies G, Blackledge M.. 2007. Exploring multiple timescale motions in protein GB3 using accelerated molecular dynamics and NMR spectroscopy. J Am Chem Soc. 12915:4724–4730. [DOI] [PubMed] [Google Scholar]
  25. Morcos F, Hwa T, Onuchic JN, Weigt M.. 2014. Direct coupling analysis for protein contact prediction In Kihara D, editor. Protein structure prediction. Vol. 1137 New York (NY: ): Springer New York; p. 55–70. [DOI] [PubMed] [Google Scholar]
  26. Mustonen V, Kinney J, Callan CG, Lässig M.. 2008. Energy-dependent fitness: a quantitative model for the evolution of yeast transcription factor binding sites. Proc Natl Acad Sci USA. 10534:12376–12381. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Mustonen V, Lässig M.. 2005. Evolutionary population genetics of promoters: predicting binding sites and functional phylogenies. Proc Natl Acad Sci USA. 10244:15936–15941. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Olson CA, Wu NC, Sun R.. 2014. A comprehensive biophysical description of pairwise epistasis throughout an entire protein domain. Curr Biol. 2422:2643–2651. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Otwinowski J, McCandlish DM, Plotkin J.. 2018. Inferring the shape of global epistasis. Proc Natl Acad Sci USA. 201804015; DOI: 10.1073/pnas.1804015115 [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Otwinowski J, Plotkin JB.. 2014. Inferring fitness landscapes by regression produces biased estimates of epistasis. Proc Natl Acad Sci USA. 11122:E2301–E2309. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Phillips R, Kondev J, Theriot J, Garcia H.. 2012. Physical biology of the cell. 2nd ed. New York: Garland Science. [Google Scholar]
  32. Potapov V, Cohen M, Schreiber G.. 2009. Assessing computational methods for predicting protein stability upon mutation: good on average but not in the details. Protein Eng Des Sel. 229:553–560. [DOI] [PubMed] [Google Scholar]
  33. Rocklin GJ, Chidyausiku TM, Goreshnik I, Ford A, Houliston S, Lemak A, Carter L, Ravichandran R, Mulligan VK, Chevalier A, et al. 2017. Global analysis of protein folding using massively parallel design, synthesis, and testing. Science 3576347:168–175. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Rollins NJ, Brock KP, Poelwijk FJ, Stiffler MA, Gauthier NP, Sander C, Marks DS.. 2018. 3D protein structure from genetic epistasis experiments. bioRxiv 320721. [Google Scholar]
  35. Sailer ZR, Harms MJ.. 2017a. Detecting high-order epistasis in nonlinear genotype-phenotype maps. Genetics 2053:1079–1088. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Sailer ZR, Harms MJ.. 2017b. Molecular ensembles make evolution unpredictable. Proc Natl Acad Sci USA. 11445: 11938–11943. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Sandberg WS, Terwilliger TC.. 1993. Engineering multiple properties of a protein by combinatorial mutagenesis. Proc Natl Acad Sci USA. 9018:8367–8371. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Sarkisyan KS, Bolotin DA, Meer MV, Usmanova DR, Mishin AS, Sharonov GV, Ivankov DN, Bozhanova NG, Baranov MS, Soylemez O, et al. 2016. Local fitness landscape of the green fluorescent protein. Nature 5337603:397–401. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Sauer-Eriksson AE, Kleywegt GJ, Uhlén M, Jones TA.. 1995. Crystal structure of the C2 fragment of streptococcal protein G in complex with the Fc domain of human IgG. Structure 33:265–278. [DOI] [PubMed] [Google Scholar]
  40. Schmiedel J, Lehner B.. 2018. Determining protein structures using genetics. bioRxiv 303875.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Sloan DJ, Hellinga HW.. 1999. Dissection of the protein G B1 domain binding site for human IgG Fc fragment. Protein Sci. 88:1643–1648. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Starr TN, Thornton JW.. 2016. Epistasis in protein evolution. Protein Sci. 257:1204–1218. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Svanberg K. 2002. A class of globally convergent optimization methods based on conservative convex separable approximations. SIAM J Optimization 122:555–573. [Google Scholar]
  44. Tan KP, Nguyen TB, Patel S, Varadarajan R, Madhusudhan MS.. 2013. Depth: a web server to compute depth, cavity sizes, detect potential small-molecule ligand-binding cavities and predict the pKa of ionizable residues in proteins. Nucleic Acids Res. 41(W1):W314–W321. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Traxlmayr MW, Hasenhindl C, Hackl M, Stadlmayr G, Rybka JD, Borth N, Grillari J, Rüker F, Obinger C.. 2012. Construction of a stability landscape of the CH3 domain of human IgG1 by combining directed evolution with high throughput sequencing. J Mol Biol. 4233:397–412. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Wells JA. 1990. Additivity of mutational effects in proteins. Biochemistry 2937:8509–8517. [DOI] [PubMed] [Google Scholar]
  47. Wood SN. 2001. Minimizing model fitting objectives that contain spurious local minima by bootstrap restarting. Biometrics 571:240–244. [DOI] [PubMed] [Google Scholar]
  48. Wrenbeck EE, Faber MS, Whitehead TA.. 2017. Deep sequencing methods for protein engineering and design. Curr Opin Struct Biol. 45:36–44. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Wright PE, Dyson HJ.. 2009. Linking folding and binding. Curr Opin Struct Biol. 191:31–38. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Wu NC, Dai L, Olson CA, Lloyd-Smith JO, Sun R.. 2016. Adaptation in protein fitness landscapes is facilitated by indirect paths. eLife 5:e16965.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Wu NC, Olson CA, Sun R.. 2015. High-throughput identification of protein mutant stability computed from a double mutant fitness landscape. Protein Sci. 252:530–539 [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Wunderlich M, Schmid FX.. 2006. In vitro evolution of a hyperstable Gβ1 variant. J Mol Biol. 3632:545–557. [DOI] [PubMed] [Google Scholar]
  53. Wylie CS, Shakhnovich EI.. 2011. A biophysical protein folding model accounts for most mutational fitness effects in viruses. Proc Natl Acad Sci USA. 10824:9916–9921. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Molecular Biology and Evolution are provided here courtesy of Oxford University Press

RESOURCES