Abstract
There are more ways to synthesize a 100-amino acid (aa) protein (20100) than there are atoms in the universe. Only a very small fraction of such a vast sequence space can ever be experimentally or computationally surveyed. Deep neural networks are increasingly being used to navigate high-dimensional sequence spaces1. However, these models are extremely complicated. Here, by experimentally sampling from sequence spaces larger than 1010, we show that the genetic architecture of at least some proteins is remarkably simple, allowing accurate genetic prediction in high-dimensional sequence spaces with fully interpretable energy models. These models capture the nonlinear relationships between free energies and phenotypes but otherwise consist of additive free energy changes with a small contribution from pairwise energetic couplings. These energetic couplings are sparse and associated with structural contacts and backbone proximity. Our results indicate that protein genetics is actually both rather simple and intelligible.
Subject terms: Genomics, Biophysics, Computational biology and bioinformatics
By experimentally sampling from sequence spaces larger than 1010 and using thermodynamic models, the genetic structure of at least some proteins can be well described, indicating that protein genetics is simpler than anticipated.
Main
Massively parallel experiments allow the effects of single aa changes in proteins to be comprehensively quantified2,3. Similarly, experimental analysis of double mutants is feasible, at least for small proteins4,5. The analysis of higher-order mutants, however, quickly becomes infeasible owing to the combinatorial explosion of possible genotypes. For example, the number of ways to combine one mutation at 34 different sites in a protein is 234 ≈ 1.7 × 1010. Experimental exploration of such a large number of genotypes is extremely challenging6 given current technology, which—so far—has experimentally analysed sequence spaces up to about 106 (refs. 4,7).
Moreover, combining random mutations in even moderate numbers nearly always results in non-functional proteins8,9. For example, only 2–8% of 5 aa variants and <0.2% of 10 aa variants in a small protein domain are expected to be folded if energies combine additively (n = 2 domains; Fig. 1a and Extended Data Fig. 1a). Sampling even tens of millions of random combinatorial genotypes in most proteins will therefore provide almost no information about genetic architecture—the set of rules that govern how mutations combine to determine phenotypes—and will not be useful for training and evaluating predictive models beyond testing the trivial prediction that most genotypes are unfolded.
One strategy for exploring high-dimensional sequence spaces is to use deep learning. Deep neural networks with millions of fitted parameters have proved successful for diverse prediction and protein design tasks, including predicting the effects of combinatorial mutants10–21. However, these models have extremely complicated and difficult to interpret architectures.
It could be that protein genotype–phenotype landscapes are complex, with many interactions between mutations required for accurate prediction. Alternatively, these landscapes might be much simpler, as suggested by energy measurements22 and inferences23,24 and the use of statistical models25,26. For us, a simple model is one with few parameters (so providing a large data compression) and parameters that are interpretable (so providing understanding).
Here we use an experimental design that enriches functional protein sequences to explore the genetic architecture of high-dimensional protein sequence spaces with more than 30 dimensions and more than 1010 genotypes. We find that protein architectures are remarkably simple, with additive energy models providing very good predictive performance. Quantifying the pairwise energetic couplings between mutations further increases predictive power, providing excellent performance in high-dimensional genotype spaces. These couplings are sparse and related to protein 3D structures. The genetic architecture of at least some proteins is therefore very simple, with additive energetics and a small contribution from sparse pairwise structural couplings.
Sampling a 1010 sequence space
We previously showed that the energetic effects of thousands of individual mutations on the stability of a protein can be measured en masse using pooled variant synthesis, selection and sequencing experiments23,24. In these experiments, the effect of each mutation on the cellular abundance of a protein is measured in the wild-type protein and also in a small number of variants with different fold stabilities. For example, using a shallow double-mutant library, we could infer the changes in Gibbs free energy of folding (ΔΔGf) for nearly all mutations (1,056/1,064 = 99%) in the C-terminal SH3 domain of the adaptor protein GRB2 (ref. 23). Similar massively parallel measurements of single-mutant fold stabilities have now been made for other signalling domains, including the oncoprotein KRAS24, and, in vitro, for many, mostly prokaryotic, small domains7.
Combining random mutations in the GRB2-SH3 domain very quickly results in unfolded proteins, with about 98% and more than 99.9% of genotypes with five and ten mutations expected to be unfolded (based on additive energies; see Fig. 1a). This rapid decay of stability as mutations are combined is consistent with experimental measurements of the activity of other proteins8,9. To experimentally explore folded genotypes in high-dimensional sequence spaces, we therefore used a heuristic technique to enrich for conserved fold and function in combinatorial variants. For each possible starting single aa substitution, we iteratively selected further substitutions—one per residue position—that simultaneously maximizes the resulting combinatorial mutant’s predicted abundance and binding to an interaction partner (see Methods). For GRB2-SH3, the largest set of mutations predicted to preserve both molecular phenotypes consisted of 34 single aa substitutions: 25 in surface residues (relative solvent-accessible surface area (RSASA) ≥ 0.25), three in the protein core (RSASA < 0.25) and six mutations in the GAB2 ligand binding interface (ligand distance < 5 Å; Fig. 1b, right).
We synthesized a library (‘library 1’) containing all combinations of these 34 mutants and quantified the cellular abundance of a sample of the 234 ≈ 1.7 × 1010 genotypes using a highly validated pooled selection, abundance protein fragment complementation assay (AbundancePCA23,24,27). In total, we obtained triplicate abundance measurements for 129,320 variants, which is 0.0007% of the sequence space. The measurements were highly reproducible (Pearson’s r > 0.91; Fig. 1d and see Methods).
The symmetrical pod-like shape of the genotype frequency landscape, with the number of genotypes peaking at the intermediate Hamming distance of 17—that is, equidistant from the wild type (zeroth-order) and 34th-order mutant—is recapitulated in the experimentally sampled library (Fig. 1e). Median abundance measurements decrease with increasing number of aa substitutions, but there are still thousands of genotypes with many mutations that, nevertheless, maintain abundance scores that are indistinguishable from that of the wild-type protein (n = 2,706 with more than 20 mutations, two-sided z-test nominal P > 0.05; Fig. 1f).
Genetic prediction with energy models
Quantifying the effects of a large number of multi-mutants allowed us to test the predictive performance of genotype–phenotype models in regions of the genetic landscape beyond the local neighbourhood used for training. For model building and evaluation, we restricted all analyses to variants with quantitative measurements in all three biological replicates (n = 71,233). Notably, our original energy model (Fig. 2a) trained on abundance and ligand binding selections (doubledeepPCA, ddPCA) quantifying the effects of single and double aa mutants only23 explains as much as half of the fitness variance in combinatorial multi-mutants (R2 = 0.5; Fig. 2b, lower-right panel), for which most (94%) include at least 13 aa substitutions in the wild-type sequence. The only trained parameters in this simple model are Gibbs free energy terms for the wild type (ΔGf) and single aa substitutions (ΔΔGf) and a two-parameter (affine) transformation relating the fraction of folded molecules to the AbundancePCA score (fitness; Fig. 2a). That such a large proportion of phenotypic variance is explained by an additive energy model (no specific epistasis/genetic interactions) trained on genotypes containing only one or two genetic changes suggests that the energetic effects of mutations in proteins are largely context-independent.
On the other hand, a linear model—which implicitly assumes that mutation effects combine additively at the phenotypic level in multi-mutants—trained on the same ddPCA data performs much worse (R2 = 0.32). The linear model also systematically underestimates the observed phenotypic effects of mutant combinations (Fig. 2b, lower-left panel), a consequence of not accounting for the scaling of mutational effects owing to protein thermodynamics (global epistasis28–30). For example, introducing a destabilizing mutation in an already-unfolded protein has no effect on the fraction of folded molecules (lower plateau of model in Fig. 2c), which is not captured by a linear model. These results demonstrate a key advantage of fitting energy models: accounting for global epistasis improves the generalizability of predictions beyond the local neighbourhood of the training data.
Fitting linear and energy models to the combinatorial data improves the variance explained by 30% and 13%, respectively (Fig. 2b, upper panels), probably because of the greater amount of training data and (relevant) genetic backgrounds in which the effects of each single aa are quantified: about 50% of all variants in the library (n ≈ 30,000) include—and therefore report on the effects of—any given one of the 34 single aa substitutions, that is, almost three orders of magnitude more measurements per single mutant compared with the relatively shallow (ddPCA) library. Although the fraction of variance explained by first-order linear and energy models is comparable (R2 = 0.62 and 0.63; Fig. 2b, upper panels), the biased regression residuals in the case of the linear model show that this model is less appropriate (Extended Data Fig. 2a). The energy model provides an excellent fit to the data, faithfully capturing the global nonlinear relationship (global epistasis) between observed AbundancePCA fitness and inferred changes in free energy of folding (ΔGf) (Fig. 2c). There is also very good agreement between free energy changes (model parameters) inferred from combinatorial and ddPCA datasets (Pearson’s r = 0.87), but the former tend to be more extreme, once again demonstrating the use of assaying the effects of mutations in greater numbers of genetic backgrounds, thereby allowing their energetic effects to be more accurately estimated (Fig. 2d).
Couplings improve genetic prediction
We next tested whether quantifying non-additive energetic couplings between mutations improved predictive performance. In our combinatorial dataset, each pair of mutations is present in a median of 17,923 genotypes, allowing robust measurement of second-order genetic interaction terms (energetic couplings, ΔΔΔGf (refs. 31,32)). Including all second-order energetic couplings improves model performance by an extra 9% (R2 = 0.72), consistent with expectations that pairwise effects are an important source of specific epistasis in proteins32 (Fig. 2e). Whereas first-order terms are stronger in magnitude and biased towards destabilizing effects, second-order energetic couplings tend to have milder effects centred on zero (Fig. 2f).
Physical contacts and backbone proximity
Measured at the phenotypic level, genetic interactions in proteins have previously been shown to reflect—at least in part—protein structures4,33–36. Combining combinatorial deep mutational scanning with thermodynamic modelling allowed us to infer a total of 561 pairwise energetic couplings, providing an opportunity to interrogate their mechanistic origins and relationship with protein structure. Comparing coupling energy magnitude (absolute folding ΔΔΔGf) with the 3D distance separating mutation pairs in the folded structure (minimal side-chain heavy-atom distance) reveals an L-shaped distribution with the strongest energetic couplings occurring between structurally proximal residues (Fig. 3a; see also Extended Data Fig. 3a). The top five energetic couplings all involve pairs of residues within 5.5 Å and 15 of the top 20 (75%) energetic couplings involve pairs separated by less than 8 Å. Although there is a weak anticorrelation between contact distance and coupling energy strength (Spearman’s ρ = −0.12; Fig. 3a), this trend breaks down for pairs that are not proximal in the primary sequence (Spearman’s ρ = −0.02, backbone distance >5 residues).
On the other hand, comparing coupling strength with separation distance between residues in the primary sequence (along the peptide backbone) reveals a marked inverse relationship that extends over quite large distances (Spearman’s ρ = −0.28) and is robust to the exclusion of direct physical contacts between residues (<5 Å, Spearman’s ρ = −0.27; Fig. 3b and see also Extended Data Fig. 3b). The interaction matrix in Fig. 3c summarizes these observations: the strongest energetic couplings coincide with direct physical contacts (black circles; see also Fig. 3d) and energetic coupling strength decays along the protein backbone (Fig. 3c, near-diagonal versus far off-diagonal cells). The matrix also highlights physical interactions between secondary structural elements as hotspots for strong energetic couplings.
To disentangle the relative importance of these different potential structural determinants of energetic coupling strength, we gathered a collection of quantitative features describing both the number and the type of chemical bonds or interactions existing between the atoms of pairs of residues, as well as their relative positions in the folded structure (Fig. 3e). A linear regression model based on these 12 structural features is predictive of coupling strength (Fig. 3f; see Methods). Notably, the same model performs similarly well when evaluated on a held out, non-overlapping set of inferred energetic couplings derived from an independent combinatorial mutagenesis experiment (‘library 3’, which is described below; Pearson’s r = 0.46, R2 = 0.21). This suggests that, despite its simplicity, the integrated model captures structural determinants of energetic coupling strength and that energetic couplings are caused by structural interactions.
Couplings decay along the peptide chain
To directly test the hypothesis that inter-residue backbone distance is associated with energetic coupling strength independently of 3D contact distance, we designed a combinatorial saturation mutagenesis library involving all possible mutations at four physically proximal surface residues in the same secondary structure element (‘library 2’; Extended Data Figs. 4a and 5a and see Methods). We reasoned that energetic couplings owing to the propagation of perturbations along the protein backbone should also be apparent among solvent-facing residues. In total, we obtained abundance measurements for 138,157 variants (86% of the sequence landscape) and the measurements were highly reproducible (Pearson’s r > 0.89; Extended Data Fig. 5b and see Methods).
The single-mutant effects at these four residues have a larger range than those of the combinatorial library that was designed to conserve fold and function (Extended Data Fig. 5c,d). Therefore, when combined in double, triple and quadruple mutants—the most numerous class—the result is a larger fraction of unfolded variants (Extended Data Fig. 5c–e). A two-state thermodynamic model that includes all first-order and second-order coefficients provides an excellent fit to the data (R2 = 0.93; Extended Data Fig. 5e–g) and inferred folding free energy changes (first-order terms) are highly correlated (Pearson’s r = 0.94) with those obtained previously using an independent shallow double-mutant library (Extended Data Fig. 5h). Although the four mutated residues are physically proximal in 3D space, with all except one pair (H26:T44) separated by less than 5 Å (3.8–8.4 Å; Extended Data Fig. 5i), their relative positions in the primary peptide sequence cover a large range (2–18 residues; Extended Data Fig. 5j). There is no relationship between contact distance and folding coupling strength for these contacting residues (Spearman’s ρ = −0.05; Extended Data Fig. 5i), whereas the relationship for backbone distance is significant (Spearman’s ρ = −0.41; Extended Data Fig. 5j). Indeed, backbone distance is very well correlated with coupling strength when averaging energy terms per residue pair (Spearman’s ρ = −0.94; Extended Data Fig. 5j). The relative position of aa residues in the primary protein sequence is therefore associated with coupling strength independently of their proximity in 3D space.
Higher-order mutants fold and function
Our experiments identified a large number of GRB2-SH3 genotypes containing many mutations that have high cellular abundance (for example, 25,564 genotypes containing more than five mutations; Fig. 1f). To further confirm that abundant multi-mutants are correctly folded and functional, we performed a third combinatorial mutagenesis experiment in which we also tested the ability of GRB2-SH3 variants to bind to a peptide ligand using a protein–protein interaction assay (BindingPCA23,24,37; Fig. 4a). Recognition of the peptide ligand can only occur if the protein adopts its native conformation38 (Fig. 1b,c). We designed a library (library 3) consisting of all combinations of 15 single aa substitutions occurring within a 22-aa residue window, avoiding mutations in our original library in binding interface residues (minimal side-chain heavy-atom distance to the ligand < 5 Å; Extended Data Fig. 4b and also see Methods). The library contains 215 (=32,768) variants and shares six single aa substitutions with our original 234 library. In total, we obtained binding measurements for 25,967 variants and abundance measurements for 31,936 variants (79% and 97% of the sequence landscape, respectively). The measurements were highly reproducible (Pearson’s r > 0.85 and 0.94 for binding and abundance, respectively; Fig. 4c, Extended Data Fig. 6a–f and also see Methods).
Plotting the changes in binding against the changes in abundance for third-order, sixth-order and ninth-order variants shows that most mutations altering binding also alter the concentration of the isolated domain, consistent with previous results and the expectation that changes in protein stability are a main cause of mutational effects on binding23,24,39 (Fig. 4d and Extended Data Fig. 6g). Notably, however, most higher-order mutants that have high abundance scores also bind the GAB2 ligand, indicating that they are correctly folded (Fig. 4d and Extended Data Fig. 6g). For example, 4% (204/4,805) of variants containing nine mutations have abundance indistinguishable from that of the wild-type protein (nominal P > 0.05) and 96% (177/184) of these also bind the ligand (predicted fraction bound molecules > 0.5). Most of the abundant higher-order GRB2-SH3 mutants are thus correctly folded.
Multi-phenotype genetic prediction
The large number of genetic backgrounds in which both single and double aa mutant effects were measured for these two related molecular phenotypes is a rich source of data for thermodynamic modelling. First, considering only the abundance phenotype, we observe that an additive two-state thermodynamic model—with unfolded and folded energetic states—outperforms a linear model when evaluated on held-out variants (R2 = 0.93 versus 0.87; Extended Data Fig. 7a,b). To attain similar predictive performance as the first-order energy model requires inclusion of both second-order and third-order genetic interaction terms in the linear model (Extended Data Fig. 7c,d), representing a massive increase in model complexity (715 versus 16 parameters; that is, greater than 40-fold more). This greater complexity of models that use many specific pairwise and higher-order genetic interaction terms to capture global nonlinearities in data (global epistasis) has been referred to as ‘phantom epistasis’40.
Next, extending previous work23, we used a neural network implementation of a three-state equilibrium model41—with unfolded, folded and bound energetic states (Fig. 4e)—to simultaneously infer the underlying causal free energy changes of both folding and binding (ΔΔGf and ΔΔGb), as well as folding and binding energetic couplings (ΔΔΔGf and ΔΔΔGb) (Fig. 4f). The model fits the data extremely well (Fig. 4g), explaining virtually all of the fitness variance (Fig. 4h), and the inferred folding and binding free energy changes (first-order terms) are well correlated (Pearsons’s r = 0.9 and 0.7) with those obtained previously using an independent shallow double-mutant library23 (Fig. 4i, double mutants). This is the first time, to our knowledge, that a large number of folding (ΔΔΔGf) and binding (ΔΔΔGb) energetic coupling terms have been measured for any protein.
Allostery in ligand-proximal residues
We observe that mutational effects on folding energy tend to be larger than those on binding energy (Fig. 5a), recapitulating previous results23,24. Energetic couplings show the same pattern, with folding coupling energies tending to be larger in magnitude than binding energetic couplings (area under the curve = 0.7, n = 210, P = 3.6 × 10−7, two-sided Mann–Whitney U test, |ΔΔΔGf| mean = 0.087, s.d. = 0.084, |ΔΔΔGb| mean = 0.038, s.d. = 0.035; Fig. 5b). As none of the mutations in this library occur in the binding interface, any notable effects on binding affinity must be through an allosteric mechanism23,24. Plotting absolute free energy changes against the 3D distance to the ligand shows a negative correlation as previously reported23,24 (Spearman’s ρ = −0.46), with mutations in second-shell residues and residues adjacent (in the sequence) to binding interface residues highly enriched for strong allosteric effects on binding affinity (Fig. 5a). Consistent with previous observations23,24, mutations at distal glycine residues have among the strongest effects on binding affinity.
Whereas the mutations with the strongest folding coupling energies are near-diagonal (closely spaced in the primary sequence), particularly between pairs of residues in the beta strand, the strongest binding coupling in the dataset is an interaction between residues P11 and G18 (Fig. 5b). These two residues are proximal in 3D space (<8 Å) and constitute one of only two long-range physical contacts between the mutated residues (backbone distance > 5 residues; Fig. 5b), suggesting that allosteric energetic couplings are also driven by structural contacts.
SRC kinase combinatorial mutagenesis
Finally, to further test the generality of our conclusions, we used the same greedy approach to design a library containing 215 (=32,768) variants in an unrelated and larger protein, the human proto-oncogene tyrosine-protein kinase Src (SRC). We obtained triplicate abundance measurements for 31,557 variants and the measurements were highly reproducible (r > 0.86; Fig. 6b–d). As for our three GRB2-SH3 combinatorial libraries, a second-order energy model was highly predictive of abundance changes (R2 = 0.87; Fig. 6e,f), with energetic couplings predicted by both 3D spatial proximity (Fig. 6g) and backbone proximity (Fig. 6h). The consistency of these results in an unrelated full-length protein further supports their generality.
Discussion
By experimentally quantifying protein fold stability in samples from sequence spaces greater than 1010 in size, we have shown here that the fundamental genetic architecture of at least some proteins is remarkably simple. Thermodynamic models in which the energetic effects of mutations are summed provide very good prediction of fold stability when tens of mutations are combined. Quantifying the pairwise energetic couplings between mutations further increases predictive power, providing very good performance in high-dimensional genotype spaces. The large number of energetic couplings quantified here reveals important principles about their origins: couplings are strongest between structurally contacting residues and coupling strength also decays along the protein backbone.
The energy models used here are very sparse and represent very large data compressions: up to about 108 (234/34) for the additive models and up to about 107 (234/596) for the models with energetic couplings. Analyses of previously published combinatorial protein mutagenesis datasets8,41–43, mutagenesis of a protein interaction interface44, hydrophobic protein cores45, an intrinsically disordered region46, a tRNA43,47 and an alternatively spliced exon40 suggest that this simplicity of genotype–phenotype landscapes is widely observed and probably a general principle of macromolecules and their molecular interactions.
Energy models are grounded in our understanding of protein thermodynamics and their simplicity and interpretability contrasts with the complexity and lack of mechanistic insight provided by deep neural networks. Predictive energy models are likely to have many applications, including for clinical variant effect interpretation48, pathogen and pandemic forecasting49 and protein engineering for biotechnology1. An important challenge moving forward is how to efficiently quantify the free energy changes and energetic couplings for all mutations in proteins of interest. Quantifying mutational effects across diverse genetic backgrounds and homologous sequences may be an efficient way to achieve this50.
Our data do not rule out the importance of higher-order genetic interactions for protein stability. Rather, they show that, when global nonlinearities owing to cooperative protein folding are properly accounted for and measurements are averaged across genetic backgrounds, first-order and pairwise energetic couplings provide sufficient information for many prediction tasks. An important question to address in future work will be the extent to which higher-order energetic interactions become important in even larger sequence spaces, including in the ‘twilight zone’ of structurally homologous proteins with very low sequence identity. Indeed, the superior performance of our models in 215-sized compared with in 234-sized sequence spaces hints that higher-order interactions become increasingly important as sequences diverge. Simple experimental designs should be able to definitively address this question for a diversity of protein folds.
Methods
Combinatorial mutagenesis library designs
Combinatorial library 1
Library 1 was designed using a computationally efficient greedy strategy to search for the largest number of single aa substitutions that, when combined, preserve both fold and function even in the highest-order mutants (Fig. 1b). The algorithm used previously published ddPCA data and thermodynamic modelling results for GRB2-SH3, including inferred single aa substitution free energy changes of folding and binding for this protein23. We showed previously that this model—which assumes that individual inferred folding and binding free energy changes (ΔΔGf and ΔΔGb) combine additively in multi-mutants—accurately predicts the effects of double aa substitutions23. Therefore, this same additive model was used to make predictions about the energetic and phenotypic effects of higher-order mutants explored in the greedy search.
First, the set of candidate single aa mutations was restricted to those with confident free energy changes, defined as those with 95% confidence intervals < 1 kcal mol−1 and whose effects were measured in at least 20 genetic backgrounds (that is, double aa mutations). Candidate mutations were further restricted to those reachable by single-nucleotide substitutions in the wild-type sequence to simplify synthesis of the resulting combinatorial mutagenesis library. The algorithm begins from an arbitrary starting mutation and iteratively selects further mutations at other residue positions until all residues in the protein have been mutated. The heuristic works by selecting further mutations at each step that maximize the fold and function of the current highest-order mutant combination, that is, the geometric mean of model-predicted AbundancePCA and BindingPCA growth rates. This procedure is then repeated for all possible starting mutations.
To visualize and compare the resulting solutions, we also simulated the median AbundancePCA and BindingPCA growth rates of all candidate combinatorial libraries, calculated using a random sample of 10,000 variants. Although the algorithm is not guaranteed to produce the optimal solution at each Hamming distance from the wild-type sequence, the greedy approach nevertheless achieves solutions in which both phenotypes are predicted to be preserved in variants with more than 30 mutations (Extended Data Fig. 1b), beyond which one or both phenotypes are lost. Defining viable libraries as those preserving both molecular phenotypes above 70% of the maximal value (that is, the geometric mean of simulated median AbundancePCA and BindingPCA growth rates) resulted in the largest candidate combinatorial library consisting of all combinations of 34 single aa mutations (Fig. 1 and Extended Data Fig. 1b–d).
Combinatorial library 2
We clustered the contact map (minimal side-chain heavy-atom distance < 5 Å) comprising all GRB2-SH3 surface residues (RSASA ≥ 0.25) existing in secondary structure elements (Extended Data Fig. 4) and selected the following four physically proximal residues for saturation combinatorial mutagenesis: H26, M28, A39 and T44 (see Extended Data Fig. 5).
Combinatorial library 3
This library was designed to include all combinations of 15 single aa substitutions with mild effects (within one-third of the AbundancePCA fitness interquartile range of the wild type23) in close proximity in the primary sequence and reachable by single-nucleotide substitutions while avoiding mutations in binding interface residues (minimal side-chain heavy-atom distance to the ligand < 5 Å). We used a sliding window approach to determine the number of candidate mutant residues in stretches of 20, 21 and 22 consecutive residues in GRB2-SH3 (Extended Data Fig. 4b). Only one window with a width of 22 aa (starting at residue position 10) includes 15 candidate positions (Extended Data Fig. 4b). The final library consisted of all combinations of the following randomly selected candidate mutations at these positions: D10N, P11A, D14N, G15E, G18C, R20S, R21Q, D23E, F24I, H26L, V27I, M28K, D29E, N30T and S31T (see Fig. 4).
Combinatorial library 4: SRC
This library was designed using the same greedy algorithm from data and thermodynamic modelling results for SRC51, including inferred single aa substitution free energy changes of folding and activity for this protein. The design includes 15 single aa substitutions reachable by single nt substitution in a 22 aa window, located in the N-lobe of the SRC kinase domain, avoiding mutations in the activation loop, subsetting folding and activity ddGs to confident energies (95% confidence interval < 1 kcal mol−1) and associated with singles observed in at least seven backgrounds. The final library consisted of all combinations of the following randomly selected candidate mutations at these positions: V329G, G344S, F349V, K343M, E331K, V337A, E332A, M341K, S330N, I336L, T338S, S345T, L346V, P333T and Y340S (see Fig. 6).
Mutagenesis library construction and selection assays
Media and buffers used
LB: 10 g l−1 bacto-tryptone, 5 g l−1 yeast extract, 10 g l−1 NaCl. Autoclaved 20 min at 120 °C.
YPDA: 20 g l−1 glucose, 20 g l−1 peptone, 10 g l−1 yeast extract, 40 mg l−1 adenine sulphate. Autoclaved 20 min at 120 °C.
SORB: 1 M sorbitol, 100 mM LiOAc, 10 mM Tris pH 8.0, 1 mM EDTA. Filter sterilized (0.2-mm nylon membrane, Thermo Scientific).
Plate mixture: 40% PEG3350, 100 mM LiOAc, 10 mM Tris-HCl pH 8.0, 1 mM EDTA pH 8.0. Filter sterilized.
Recovery medium: YPD (20 g l−1 glucose, 20 g l−1 peptone, 10 g l−1 yeast extract) + 0.5 M sorbitol. Filter sterilized.
SC-URA: 6.7 g l−1 yeast nitrogen base without aa, 20 g l−1 glucose, 0.77 g l−1 complete supplement mixture drop-out without uracil. Filter sterilized.
SC-URA/MET/ADE: 6.7 g l−1 yeast nitrogen base without aa, 20 g l−1 glucose, 0.74 g l−1 complete supplement mixture drop-out without uracil, adenine and methionine. Filter sterilized.
Competition medium: SC-URA/MET/ADE + 200 μg ml−1 methotrexate (Merck Life Science), 2% DMSO.
DNA extraction buffer: 2% Triton-X, 1% SDS, 100 mM NaCl, 10 mM Tris-HCl pH 8.0, 1 mM EDTA pH 8.0.
Plasmid construction
For libraries 1–3: GRB2 mutagenesis plasmid pGJJ286: wild-type GRB2-SH3 was digested from pGJJ046 (described previously23) with the restriction enzymes AvrII and HindIII and cloned into the digested plasmid pGJJ191 (described previously24) using T4 ligase (New England Biolabs). AbundancePCA pGJJ046 and pGJJ045 plasmids and BindingPCA pGJJ034 and pGJJ001 plasmids were previously described23. For library 4: pTB043 plasmid containing full-length SRC was described previously51. pTB043 is based on the same backbone as the AbundancePCA plasmids. The difference is that full-length SRC is fused to the DHFR[3] fragment at its N terminus and to the DHFR[1,2] fragment at its C terminus, so DHFR is reconstituted following correct folding of SRC, whereas unfolded SRC genotypes result in degradation of the fusion protein.
Libraries construction
Libraries 1–3: libraries were constructed in two steps. First, an IDT primer containing the chosen combination of mutations was assembled by Gibson into the mutagenesis plasmid pGJJ286. Libraries were then cloned into the yeast plasmids AbundancePCA pGJJ045 and BindingPCA pGJJ001 by digestion/ligation. For the first step, the libraries into the mutagenesis plasmid were assembled by Gibson reaction (in-house preparation) of two fragments. The vector fragment was obtained by polymerase chain reaction (PCR) amplification of pGJJ286 with the oligos shown in Supplementary Tables 1 and 2, incubated with DpnI to remove the template and gel purified using QIAquick gel extraction kit (Qiagen). The insert fragment was obtained by mixing equimolar amounts of IDT mutation primer (Supplementary Tables 1 and 2) and a reverse elongation primer (Supplementary Tables 1 and 2) and incubating for one cycle of annealing/extension with Q5 polymerase (New England Biolabs). dsDNA product was then incubated with ExoSAP-IT (Applied Biosystems) to remove the remaining ssDNA and purified with MinElute columns (Qiagen). 100 ng of vector in a molar ratio of 1:5 with the insert was incubated for 3 h at 50 °C with a Gibson mix 2× prepared in-house. The reaction was desalted by dialysis with membrane filters (MF-Millipore) for 1 h and concentrated 4× using a SpeedVac concentrator (Thermo Scientific). DNA was then transformed into NEB 10-beta High Efficiency Electrocompetent E. coli. Cells were allowed to recover in SOC medium (NEB 10-beta Stable Outgrowth Medium) for 30 min and later transferred to LB medium with spectinomycin overnight. A fraction of cells was also plated into spectinomycin + LB + agar plates to estimate the total number of transformants. 100 ml of each saturated E. coli culture was collected the next morning to extract the mutagenesis plasmid library using the QIAfilter Plasmid Midi Kit (QIAGEN). To obtain the final libraries into the yeast plasmids, libraries in pGJJ286 plasmid were digested with NheI and HindIII, gel purified (MinElute Gel Extraction Kit, QIAGEN) and cloned into pGJJ045 or pGJJ034 digested plasmids with T4 ligase (New England Biolabs) by temperature-cycle ligation following the manufacturer’s instructions, 67 fmol of backbone and 200 fmol of insert in a 33.3-μl reaction. The ligation was desalted by dialysis using membrane filters for 1 h, concentrated 4× using a SpeedVac concentrator (Thermo Scientific) and transformed into NEB 10-beta High Efficiency Electrocompetent E. coli cells.
Library 4: this library was constructed in one step by Gibson reaction of two fragments. The vector fragment was obtained by amplification of pTB043 plasmid with the oligos shown in the Supplementary Tables 1 and 2. The second fragment was obtained with ten cycles of PCR using mutated IDT primer as template (Supplementary Tables 1 and 2).
Methotrexate yeast selection assay
The yeast selection assay was previously described23. The high-efficiency yeast transformation protocol described below (adjusted to a pre-culture of 200 ml of YPDA) was scaled up or down, depending on the number of transformants for each library (Supplementary Table 2). Three independent pre-cultures of BY4742 were grown in 20 ml of standard YPDA at 30 °C overnight. The next morning, the cultures were diluted into 200 ml of pre-warmed YPDA at an OD600nm = 0.3 and incubated at 30 °C for 4 h. Cells were then collected and centrifuged for 5 min at 3,000g, washed with sterile water and SORB medium, resuspended in 8.6 ml of SORB and incubated at room temperature for 30 min. After incubation, 175 μl of 10 mg ml−1 boiled salmon sperm DNA (Agilent Genomics) and 3.5 μg of plasmid library were added to each tube of cells and mixed gently. 35 ml of plate mixture was added to each tube to be incubated at room temperature for a further 30 min. 3.5 ml of DMSO was added to each tube and the cells were then heat shocked at 42 °C for 20 min (inverting tubes from time to time to ensure homogenous heat transfer). After heat shock, cells were centrifuged and resuspended in approximately 50 ml of recovery media and allowed to recover for 1 h at 30 °C. Cells were then centrifuged, washed with SC-URA medium and resuspended in 200 ml SC-URA. 10 μl was plated on SC-URA Petri dishes and incubated for about 48 h at 30 °C to measure the transformation efficiency. The independent liquid cultures were grown at 30 °C for about 48 h until saturation. Saturated cells were diluted again to OD600nm = 0.1 in SC-URA/MET/ADE media and allowed to grow four generations until OD600nm = 1.6 at 30 °C and 200 rpm. A fraction of the culture was then used to inoculate 200 ml of competition media containing methotrexate at a starting OD600nm = 0.05 and the rest was collected and pellets frozen and stored as input. Cells in competition media were allowed to grow for 3–5 generations (Supplementary Table 2), collected and frozen and stored as output.
DNA extractions and plasmid quantification
The DNA extraction protocol used was previously described23. The protocol below is for 100 ml of collected culture at OD600nm ≈ 1.6. Protocols were scaled up or down, depending on the library (Supplementary Table 2). Cell pellets (one for each experiment input/output replicate) were resuspended in 1 ml of DNA extraction buffer, frozen by dry ice/ethanol bath and incubated at 62 °C in a water bath twice. Subsequently, 1 ml of phenol/chloro/isoamyl in a ratio of 25:24:1 (equilibrated in 10 mM Tris-HCl, 1 mM EDTA, pH 8.0) was added, together with 1 g of acid-washed glass beads (Sigma Aldrich) and the samples were vortexed for 10 min. Samples were centrifuged at room temperature for 30 min at 4,000 rpm and the aqueous phase was transferred into new tubes. The same step was repeated twice. 0.1 ml of NaOAc 3 M and 2.2 ml of pre-chilled absolute ethanol were added to the aqueous phase. The samples were gently mixed and incubated at −20 °C for at least 30 min. After that, they were centrifuged for 30 min at full speed at 4 °C to precipitate the DNA. The ethanol was removed and the DNA pellet was allowed to dry overnight at room temperature. DNA pellets were resuspended in 0.6 ml TE 1X and treated with 5 μl of RNase A (10 mg ml, Thermo Scientific) for 30 min at 37 °C. To desalt and concentrate the DNA solutions, the QIAEX II Gel Extraction Kit was used (50 µl of QIAEX II beads, QIAGEN). The samples were washed twice with PE buffer and eluted twice by 125 µl of 10 mM Tris-HCI buffer, pH 8.5. Finally, plasmid concentrations in the total DNA extract (which also contained yeast genomic DNA) were quantified by quantitative PCR using the primer pair oGJJ152–oGJJ153 that binds to the ori region of the plasmids.
Sequencing library preparation
Libraries 1–3: this was shown in ref. 23. Briefly, the sequencing libraries were constructed in two consecutive PCR assays. The first PCR (PCR1) was designed to amplify the mutated protein of interest and to increase the nucleotide complexity of the first sequenced bases by introducing frame-shift bases between the adapters and the sequencing region of interest (Supplementary Tables 1 and 2). The second PCR (PCR2) was necessary to add the remainder of the Illumina adapter and demultiplexing indexes. PCR2 reactions were run for each sample independently using Hot Start High-Fidelity DNA Polymerase. In this second PCR, the remaining parts of the Illumina adapters were added to the library amplicon. The forward primer (5′ P5 Illumina adapter) was the same for all samples (GJJ_1J), whereas the reverse primer (3′ P7 Illumina adapter) differed by the barcode index (Supplementary Table 3) to be subsequently pooled and demultiplexed after deep sequencing. All samples were pooled in an equimolar ratio and gel purified using the QIAEX II Gel Extraction Kit. The purified amplicon library pools were subjected to Illumina 150-bp paired-end NextSeq500 sequencing at the CRG Genomics Core Facility.
Library 4: the method for preparing the library for sequencing was the same as for the other libraries but in the second PCR step, we used a barcoded index in the forward primer as well (5′ P5 Illumina adapter). The purified amplicon library pool was sequenced with an Illumina paired-end NextSeq2000 machine this time.
Sequencing data processing
FastQ files from paired-end sequencing of all AbundancePCA and BindingPCA experiments were processed with DiMSum v1.3 (ref. 52) using default settings with small adjustments (https://github.com/lehner-lab/DiMSum). Supplementary Table 4 contains DiMSum fitness estimates and associated errors for all experiments. Experimental design files and command-line options required for running DiMSum on these datasets are available on GitHub (https://github.com/lehner-lab/archstabms). Variants with fewer than ten input read counts in any replicate were discarded (‘fitnessMinInputCountAll’ option), that is, only variants observed in all replicates above this threshold were retained. For library 1, we also included fitness estimates that derived from a subset of replicates whose input read counts exceeded this threshold (‘fitnessMinInputCountAny’ option; see Fig. 1).
For library 1, we also included a wild-type-only sample for sequencing using pGJJ046 as template to derive empirical estimates of sequencing error rates. The FastQ file for this sample was processed identically to those of the replicate input/output samples in the first-pass analysis with DiMSum with permissive base quality thresholds (‘vsearchMinQual = 5’ and ‘vsearchMaxee = 1000’). Read counts for all variants were then adjusted by subtracting the expected number of sequencing errors derived from the wild-type-only sample and proportional to the total sequencing library size of each sample. Finally, fitness estimates and associated errors for library 1 were then obtained from the resulting corrected variant counts with DiMSum (‘countPath’ option).
Thermodynamic modelling with MoCHI
We used MoCHI43 (https://github.com/lehner-lab/MoCHI) to fit all thermodynamic models to combinatorial DMS data using default settings with small adjustments. The software is based on our previously described genotype–phenotype modelling approach23, with extra functionality and improvements for ease of use and flexibility24,43. Models fit to shallow (double-mutant) libraries and used in the analyses described in this work (for example, combinatorial mutagenesis library designs) were obtained using the original software implementation23.
We model protein folding as an equilibrium between two states: unfolded (u) and folded (f), and protein binding as an equilibrium between three states: unfolded and unbound (uu), folded and unbound (fu) and folded and bound (fb). We assume that the probability of the unfolded and bound state (ub) is negligible and free energy changes of folding and binding are additive, that is, the total binding and folding free energy changes of an arbitrary variant relative to the wild-type sequence is simply the sum over residue-specific energies corresponding to all constituent single aa substitutions.
We configured MoCHI parameters to specify a neural network architecture consisting of additive trait layers (free energies) for each biophysical trait to be inferred (folding or folding and binding for AbundancePCA or BindingPCA, respectively), as well as one linear transformation layer per observed phenotype. The specified nonlinear transformations ‘TwoStateFractionFolded’ and ‘ThreeStateFractionBound’ derived from the Boltzmann distribution function relate energies to proportions of folded and bound molecules, respectively (see Figs. 2a and 4e,f). The target (output) data to fit the neural network comprise fitness scores for the wild-type and aa substitution variants of all mutation orders. The inclusion of both first-order and second-order (pairwise energetic coupling) model coefficients in the models was specified using the ‘max_interaction_order’ option.
A random 30% of aa substitution variants of all mutation orders was held out during model training, with 20% representing the validation data and 10% representing the test data. Validation data were used to evaluate training progress and optimize hyperparameters (batch size). Optimal hyperparameters were defined as those resulting in the smallest validation loss after 100 training epochs. Test data were used to assess final model performance.
MoCHI optimizes the parameters θ of the neural network using stochastic gradient descent on a loss function based on a weighted and regularized form of mean absolute error:
in which yn and σn are the observed fitness score and associated standard error, respectively, for variant n, ŷn is the predicted fitness score, N is the batch size and λ2 is the L2 regularization penalty. To penalize very large free energy changes (typically associated with extreme fitness scores), we set λ2 to 10−6, representing light regularization. The mean absolute error is weighted by the inverse of the fitness error () to downweight the contribution of less confidently estimated fitness scores to the loss. Furthermore, to capture the uncertainty in fitness estimates, the training data were replaced with a random sample from the fitness error distribution of each variant. The validation and test data were left unaltered.
Models were trained with default settings, that is, for a maximum of 1,000 epochs using the Adam optimization algorithm with an initial learning rate of 0.05 (except for library 1, for which we used an initial learning rate of 0.005). MoCHI reduces the learning rate exponentially (γ = 0.98) if the validation loss has not improved in the most recent ten epochs compared with the preceding ten epochs. Also, MoCHI stops model training early if the wild-type free energy terms over the most recent ten epochs have stabilized (standard deviation ≤ 10−3).
Free energies are calculated directly from model parameters as follows: ΔGb = θbRT and ΔGf = θfRT, in which T = 303 K and R = 0.001987 kcal K−1 mol−1. We estimated the confidence intervals of model-inferred free energies using a Monte Carlo simulation approach. The variability of inferred free energy changes was calculated between ten separate models fit using data from: (1) independent random training–validation–test splits and (2) independent random samples of fitness estimates from their underlying error distributions. Confident inferred free energy changes are defined as those with Monte Carlo simulation-derived 95% confidence intervals < 1 kcal mol−1. Supplementary Table 5 contains inferred binding and folding free energy changes and energetic couplings from all second-order models.
Linear model to predict energetic coupling strength
We built a linear model to predict energetic coupling strength (absolute value of energetic coupling terms) from 12 features (see Fig. 3e), comprising five distance metrics for residue pairs or positions thereof in the protein structure: backbone distance (linear 1D distance separating residue pairs along the primary aa sequence), inter-residue distance (minimal side-chain heavy-atom distance in 3D space), number of core residues (0, 1 or both residues in the pair with RSASA < 0.25), number of binding interface residues (0, 1 or both with minimal side-chain heavy-atom distance to the ligand < 5 Å), number of beta-sheet residues (0, 1 or both in beta strands) and seven features describing the number of chemical bonds or interactions between the atoms of pairs of residues as calculated using the GetContacts software tool (https://getcontacts.github.io/): backbone to backbone hydrogen bonds, side chain to backbone hydrogen bonds, side chain to side chain hydrogen bonds, pi–cation interactions, pi-stacking interactions, salt-bridge interactions and van der Waals interactions. Before running GetContacts, we used PyMOL to fill missing hydrogens (‘h_add’ command), FoldX53 to restore the wild-type proline at position 54 that is mutated in the reference crystal structure (PDB: 2VWF; ‘PositionScan’ command) and removed GAB2 ligand atoms. The training dataset comprised energetic couplings inferred from library 1 and the test set comprised independently inferred energetic couplings from library 3 (see Fig. 3f).
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Online content
Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at 10.1038/s41586-024-07966-0.
Supplementary information
Acknowledgements
This work was funded by a European Research Council Advanced Grant (883742), the Spanish Ministry of Science and Innovation (LCF/PR/HR21/52410004, EMBL Partnership, Severo Ochoa Centre of Excellence), the Bettencourt Schueller Foundation, the AXA Research Fund, Agencia de Gestio d’Ajuts Universitaris i de Recerca (AGAUR, 2017 SGR 1322) and the CERCA Program/Generalitat de Catalunya. A.J.F. was funded by a Ramón y Cajal fellowship (RYC2021-033375-I) financed by the Spanish Ministry of Science and Innovation (MCIN/AEI/10.13039/501100011033) and the European Union (NextGenerationEU/PRTR). A.M.-A. was funded by a fellowship from ”laCaixa” Foundation (ID 100010434, fellowship code B006052). We thank all members of the Lehner Lab for helpful discussions and suggestions.
Extended data figures and tables
Author contributions
B.L. and A.J.F. conceived the project, motivated in part by preliminary unpublished analyses of evolutionary couplings performed by J.M.S. B.L., A.J.F. and A.M.-A. designed the combinatorial mutagenesis libraries and experiments. A.B. provided unpublished data and advice for the SRC experiments. A.M.-M. and C.H.-C. performed the experiments, with help from A.J.F. A.J.F. led the data analysis, with help from A.M.-A. A.J.F. and B.L. wrote the manuscript, with input from A.M.-A. and C.H.-C.
Peer review
Peer review information
Nature thanks Willow Coyote-Maestas, Elena Kuzmin and Gabriel Rocklin for their contribution to the peer review of this work. Peer reviewer reports are available.
Data availability
All DNA sequencing data have been deposited in the Gene Expression Omnibus (GEO) with accession number GSE246322. Associated fitness measurements and free energies are provided in Supplementary Tables 4 and 5. Shallow double-mutant ddPCA DNA sequencing data for GRB2-SH3 and PSD95-PDZ3 are available in the GEO with accession number GSE184042, and the processed data used in this study can be found in Supplementary Tables 6 and 7 of the corresponding publication (10.1038/s41586-022-04586-4). Protein structures are available from the Protein Data Bank for GRB2-SH3 (entry ID: 2VWF) and SRC (entry ID: 2SRC).
Code availability
Source code for fitting thermodynamic models (MoCHI) is available at https://github.com/lehner-lab/MoCHI. Source code for all downstream analyses, including DiMSum and MoCHI configuration files, and to reproduce all figures described here is available at https://github.com/lehner-lab/archstabms. An archive of this repository is also publicly available on Zenodo at 10.5281/zenodo.11671164 (ref. 54).
Competing interests
A.J.F. and B.L. are founders, employees and shareholders of ALLOX. J.M.S. is a founder, employee and shareholder of factorize.bio.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Andre J. Faure, Email: andre.faure@crg.eu
Ben Lehner, Email: bl11@sanger.ac.uk.
Extended data
is available for this paper at 10.1038/s41586-024-07966-0.
Supplementary information
The online version contains supplementary material available at 10.1038/s41586-024-07966-0.
References
- 1.Notin, P., Rollins, N., Gal, Y., Sander, C. & Marks, D. Machine learning for functional protein design. Nat. Biotechnol.42, 216–228 (2024). [DOI] [PubMed] [Google Scholar]
- 2.Kinney, J. B. & McCandlish, D. M. Massively parallel assays and quantitative sequence–function relationships. Annu. Rev. Genom. Hum. Genet.20, 99–127 (2019). [DOI] [PubMed] [Google Scholar]
- 3.Fowler, D. M. & Fields, S. Deep mutational scanning: a new style of protein science. Nat. Methods11, 801–807 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Olson, C. A., Wu, N. C. & Sun, R. A comprehensive biophysical description of pairwise epistasis throughout an entire protein domain. Curr. Biol.24, 2643–2651 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Nedrud, D., Coyote-Maestas, W. & Schmidt, D. A large-scale survey of pairwise epistasis reveals a mechanism for evolutionary expansion and specialization of PDZ domains. Proteins89, 899–914 (2021). [DOI] [PubMed] [Google Scholar]
- 6.Dryden, D. T. F., Thomson, A. R. & White, J. H. How much of protein sequence space has been explored by life on Earth? J. R. Soc. Interface5, 953–956 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Tsuboyama, K. et al. Mega-scale experimental analysis of protein folding stability in biology and design. Nature620, 434–444 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Poelwijk, F. J., Socolich, M. & Ranganathan, R. Learning the pattern of epistasis linking genotype and phenotype in a protein. Nat. Commun.10, 4213 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Sarkisyan, K. S. et al. Local fitness landscape of the green fluorescent protein. Nature533, 397–401 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Bryant, D. H. et al. Deep diversification of an AAV capsid protein by machine learning. Nat. Biotechnol.39, 691–696 (2021). [DOI] [PubMed] [Google Scholar]
- 11.Biswas, S., Khimulya, G., Alley, E. C., Esvelt, K. M. & Church, G. M. Low-N protein engineering with data-efficient deep learning. Nat. Methods18, 389–396 (2021). [DOI] [PubMed] [Google Scholar]
- 12.Madani, A. et al. Large language models generate functional protein sequences across diverse families. Nat. Biotechnol.41, 1099–1106 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Hsu, C. et al. Learning inverse folding from millions of predicted structures. Preprint at bioRxiv10.1101/2022.04.10.487779 (2022).
- 14.Ferruz, N., Schmidt, S. & Höcker, B. ProtGPT2 is a deep unsupervised language model for protein design. Nat. Commun.13, 4348 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Notin, P. et al. Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. Preprint at https://arxiv.org/abs/2205.13760 (2022).
- 16.Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. USA118, e2016239118 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Frazer, J. et al. Disease variant prediction with deep generative models of evolutionary data. Nature599, 91–95 (2021). [DOI] [PubMed] [Google Scholar]
- 18.Cheng, J. et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science381, eadg7492 (2023). [DOI] [PubMed] [Google Scholar]
- 19.Dauparas, J. et al. Robust deep learning-based protein sequence design using ProteinMPNN. Science378, 49–56 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Ingraham, J., Garg, V. K., Barzilay, R. & Jaakkola, T. in Proc. Advances in Neural Information Processing Systems 32 (NeurIPS 2019) (eds Wallach, H. et al.) (2019).
- 21.Freschlin, C. R., Fahlberg, S. A., Heinzelman, P. & Romero, P. A. Neural network extrapolation to distant regions of the protein fitness landscape. Nat. Commun.15, 6405 (2024). [DOI] [PMC free article] [PubMed]
- 22.Wells, J. A. Additivity of mutational effects in proteins. Biochemistry29, 8509–8517 (1990). [DOI] [PubMed] [Google Scholar]
- 23.Faure, A. J. et al. Mapping the energetic and allosteric landscapes of protein binding domains. Nature604, 175–183 (2022). [DOI] [PubMed] [Google Scholar]
- 24.Weng, C., Faure, A. J., Escobedo, A. & Lehner, B. The energetic and allosteric landscape for KRAS inhibition. Nature626, 643–652 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Zhou, J. & McCandlish, D. M. Minimum epistasis interpolation for sequence-function relationships. Nat. Commun.11, 1782 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Russ, W. P. et al. An evolution-based model for designing chorismate mutase enzymes. Science369, 440–445 (2020). [DOI] [PubMed] [Google Scholar]
- 27.Levy, E. D., Kowarzyk, J. & Michnick, S. W. High-resolution mapping of protein concentration reveals principles of proteome architecture and adaptation. Cell Rep.7, 1333–1340 (2014). [DOI] [PubMed] [Google Scholar]
- 28.Domingo, J., Baeza-Centurion, P. & Lehner, B. The causes and consequences of genetic interactions (epistasis). Annu. Rev. Genom. Hum. Genet.20, 433–460 (2019). [DOI] [PubMed] [Google Scholar]
- 29.Sailer, Z. R. & Harms, M. J. Detecting high-order epistasis in nonlinear genotype-phenotype maps. Genetics205, 1079–1088 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Johansson, K. E., Lindorff-Larsen, K. & Winther, J. R. Global analysis of multi-mutants to improve protein function. J. Mol. Biol.435, 168034 (2023). [DOI] [PubMed] [Google Scholar]
- 31.Carter, P. J., Winter, G., Wilkinson, A. J. & Fersht, A. R. The use of double mutants to detect structural changes in the active site of the tyrosyl-tRNA synthetase (Bacillus stearothermophilus). Cell38, 835–840 (1984). [DOI] [PubMed] [Google Scholar]
- 32.Horovitz, A. Double-mutant cycles: a powerful tool for analyzing protein structure and function. Fold. Des.1, R121–R126 (1996). [DOI] [PubMed] [Google Scholar]
- 33.Schmiedel, J. M. & Lehner, B. Determining protein structures using deep mutagenesis. Nat. Genet.51, 1177–1186 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Rollins, N. J. et al. Inferring protein 3D structure from deep mutation scans. Nat. Genet.51, 1170–1176 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Stiffler, M. A. et al. Protein structure from experimental evolution. Cell Syst.10, 15–24.e5 (2020). [DOI] [PubMed] [Google Scholar]
- 36.Rajasekaran, N., Suresh, S., Gopi, S., Raman, K. & Naganathan, A. N. A general mechanism for the propagation of mutational effects in proteins. Biochemistry56, 294–305 (2017). [DOI] [PubMed] [Google Scholar]
- 37.Diss, G. & Lehner, B. The genetic landscape of a physical interaction. eLife7, e32472 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Harkiolaki, M. et al. Distinct binding modes of two epitopes in Gab2 that interact with the SH3C domain of Grb2. Structure17, 809–822 (2009). [DOI] [PubMed] [Google Scholar]
- 39.Wei, X. et al. A massively parallel pipeline to clone DNA variants and examine molecular phenotypes of human disease mutations. PLoS Genet.10, e1004819 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Baeza-Centurion, P., Miñana, B., Schmiedel, J. M., Valcárcel, J. & Lehner, B. Combinatorial genetics reveals a scaling law for the effects of mutations on splicing. Cell176, 549–563.e23 (2019). [DOI] [PubMed] [Google Scholar]
- 41.Otwinowski, J. Biophysical inference of epistasis and the effects of mutations on protein stability and function. Mol. Biol. Evol.35, 2345–2354 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Park, Y., Metzger, B. P. H. & Thornton, J. W. The simplicity of protein sequence-function relationships. Nat. Commun.15, 7953 (2024). [DOI] [PMC free article] [PubMed]
- 43.Faure, A. J. & Lehner, B. MoCHI: neural networks to fit interpretable models and quantify energies, energetic couplings, epistasis and allostery from deep mutational scanning data. Preprint at bioRxiv10.1101/2024.01.21.575681 (2024).
- 44.Ding, D. et al. Protein design using structure-based residue preferences. Nat. Commun.15, 1639 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Escobedo, A., Voigt, G., Faure, A. J. & Lehner, B. Genetics, energetics and allostery during a billion years of hydrophobic protein core evolution. Preprint at bioRxiv10.1101/2024.05.11.593672 (2024).
- 46.Zarin, T. & Lehner, B. A complete map of specificity encoding for a partially fuzzy protein interaction. Preprint at bioRxiv10.1101/2024.04.25.591103 (2024).
- 47.Domingo, J., Diss, G. & Lehner, B. Pairwise and higher-order genetic interactions during the evolution of a tRNA. Nature558, 117–121 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Livesey, B. J. & Marsh, J. A. Updated benchmarking of variant effect predictors using deep mutational scanning. Mol. Syst. Biol.19, e11474 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Stockdale, J. E., Liu, P. & Colijn, C. The potential of genomics for infectious disease forecasting. Nat. Microbiol.7, 1736–1743 (2022). [DOI] [PubMed] [Google Scholar]
- 50.Beltran, A., Jiang, X., Shen, Y. & Lehner, B. Site saturation mutagenesis of 500 human protein domains reveals the contribution of protein destabilization to genetic disease. Preprint at bioRxiv10.1101/2024.04.26.591310 (2024).
- 51.Beltran, A., Faure, A. J. & Lehner, B. The allosteric landscape of the Src kinase. Preprint at bioRxiv10.1101/2024.04.26.591297 (2024).
- 52.Faure, A. J., Schmiedel, J. M., Baeza-Centurion, P. & Lehner, B. DiMSum: an error model and pipeline for analyzing deep mutational scanning data and diagnosing common experimental pathologies. Genome Biol.21, 207 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Delgado, J., Radusky, L. G., Cianferoni, D. & Serrano, L. FoldX 5.0: working with RNA, small molecules and a new graphical interface. Bioinformatics35, 4168–4169 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Faure, A. et al. The genetic architecture of protein stability. Zenodo10.5281/zenodo.11671164 (2024).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All DNA sequencing data have been deposited in the Gene Expression Omnibus (GEO) with accession number GSE246322. Associated fitness measurements and free energies are provided in Supplementary Tables 4 and 5. Shallow double-mutant ddPCA DNA sequencing data for GRB2-SH3 and PSD95-PDZ3 are available in the GEO with accession number GSE184042, and the processed data used in this study can be found in Supplementary Tables 6 and 7 of the corresponding publication (10.1038/s41586-022-04586-4). Protein structures are available from the Protein Data Bank for GRB2-SH3 (entry ID: 2VWF) and SRC (entry ID: 2SRC).
Source code for fitting thermodynamic models (MoCHI) is available at https://github.com/lehner-lab/MoCHI. Source code for all downstream analyses, including DiMSum and MoCHI configuration files, and to reproduce all figures described here is available at https://github.com/lehner-lab/archstabms. An archive of this repository is also publicly available on Zenodo at 10.5281/zenodo.11671164 (ref. 54).