Abstract
Mutations in protein active sites can dramatically improve function. The active site, however, is densely packed and extremely sensitive to mutations. Therefore, some mutations may only be tolerated in combination with others in a phenomenon known as epistasis. Epistasis reduces the likelihood of obtaining improved functional variants and dramatically slows natural and lab evolutionary processes. Research has shed light on the molecular origins of epistasis and its role in shaping evolutionary trajectories and outcomes. In addition, sequence- and AI-based strategies that infer epistatic relationships from mutational patterns in natural or experimental evolution data have been used to design functional protein variants. In recent years, combinations of such approaches and atomistic design calculations have successfully predicted highly functional combinatorial mutations in active sites. These were used to design thousands of functional active-site variants, demonstrating that, while our understanding of epistasis remains incomplete, some of the determinants that are critical for accurate design are now sufficiently understood. We conclude that the space of active-site variants that has been explored by evolution may be expanded dramatically to enhance natural activities or discover new ones. Furthermore, design opens the way to systematically exploring sequence and structure space and mutational impacts on function, deepening our understanding and control over protein activity.
Keywords: protein function, machine learning, protein evolution, protein engineering, epistasis
Protein activity is determined by the precise positioning of constellations of amino acids in the active site (1, 2). These constellations are stabilized by networks of molecular interactions within the active site and with the remainder of the protein. Due to a high density of interactions in the active site, most mutations disrupt the positioning of the catalytic groups and are detrimental to protein foldability, stability, and activity (3–6). Nevertheless, active-site mutations are critical to modifying protein activity. Accordingly, clarifying the factors that determine whether active-site mutations are tolerated (4, 5) and enhance activity has been an area of intense research over decades with wide-ranging implications for understanding evolutionary processes (7–11) and our ability to design and engineer new and enhanced molecular activities (12–14).
Epistasis is one of the main factors that determine the functional outcome of mutations (7, 9, 15–17). This phenomenon is observed when the phenotype of a combinatorial mutant is not a simple additive combination of the effects of the constituent mutations (16). Consider, for instance, a multipoint mutant that enhances protein activity but some of the constituent point mutations are deleterious if introduced individually (Fig. 1) (6, 18, 19). Because evolution typically accumulates mutations in a stepwise manner, and each mutation must be at least tolerated to be selected, epistatic relationships between mutations may block the emergence of such beneficial mutants or severely restrict the order in which mutations accumulate (9, 16, 20). Therefore, epistasis may dramatically lengthen the course of evolution of new or significantly improved activities (21). Additionally, epistasis may change the tolerance of homologs to mutations. For instance, some disease-causing mutations in humans do not exhibit a deleterious phenotype in other species. In human triosephosphate isomerase, the mutation Gly122Arg forms unfavorable steric overlaps with nearby Trp90, but several bacterial orthologs accommodate an Arg at position 122 against the background of the Trp90Lys mutation (22). Thus, epistasis shapes evolutionary processes in proteins and other biomolecules (23) and may reduce the chances of the emergence of some desirable mutants (15, 16, 19–21).
Fig. 1.
The impact of mutations on stability or activity may depend on other mutations in a phenomenon known as epistasis. (Left) Schematics of two favorable interactions (Top and Bottom) in which the intermediates (center) are incompatible and thus unlikely to be tolerated in evolution. (Right) Two amino acid residues across the light–heavy chain interface of an antibody variable domain (Top; based on PDB entry: 2fjg) were designed to form a hydrogen bond (Bottom). Both intermediates (center) are strained and predicted not to be tolerated in evolution (adapted from ref. 24). This figure presents only one form of epistasis. Others are described in the main text.
In protein engineering and design, epistasis limits our ability to predict the functional impact of combinatorial mutations (25, 26). For instance, modern laboratory procedures, such as deep mutational scanning, provide detailed information on the impact of all point mutations and even low-order combinations of mutations (13, 27). If mutational effects were completely additive, such procedures would enable fast and effective protein optimization by suggesting combinations of function-enhancing mutations that may exert a large gain-of-function phenotypic change (12). Due to the pervasiveness of epistasis, however, even armed with complete knowledge of the mutational impact of all single-point mutations, one cannot guarantee the functional outcome of double mutants, let alone higher-order combinations (18, 19, 25). In fact, it is estimated that due to epistasis, only half of the beneficial mutations that accumulate during laboratory evolution experiments can be explained on the basis of their impact on the starting point protein (25). This unpredictability is one of the main reasons why protein optimization workflows typically require multiple experimental iterations and large-scale experimental screening (28) and yet sometimes fail to reach the desired optimization goals (11, 29). Furthermore, the observations that epistasis dramatically slows evolutionary processes imply that the natural repertoire of protein functions is far from exhausted by contemporary proteins in nature (21) providing an impetus for exploring nonevolutionary strategies for protein optimization.
Until recently, epistasis was often addressed through experimental protein engineering approaches (30, 31). With the accumulation of large datasets of protein homologs, methods that use deep learning and other strategies to search for covariation in the evolution of protein families have made important contributions to detecting epistatic mutations and to design capabilities (32–35). Our perspective focuses on understanding the molecular basis of epistasis and how this understanding has guided recent computational design strategies that combine sequence, structure, and machine-learning approaches to introduce gain-of-function mutations in active sites. In this perspective, epistasis presents the most profound challenge to protein engineering, design, and evolution of new molecular activities; however, epistasis does not only restrict the emergence of new or enhanced function. Rather, the nonlinear impact that epistatic mutations exert on function allows the emergence of new or substantially improved activities by combining a relatively small number of strategic mutations (10, 18, 36). Understanding and prediction of beneficial epistatic interactions are therefore essential stepping stones toward a greater level of control over design processes.
The Molecular Origins of Epistasis
To understand what determines whether mutations are epistatic and what would be their functional outcome, we consider how structure and interresidue interactions may affect mutational tolerance (9, 22, 37). Notably, epistasis can occur between positions that are in direct contact with one another. Direct epistasis is an especially acute problem in protein active sites as these are almost universally densely packed with molecular interactions that are critical for protein stability, foldability, and activity (Fig. 2). Epistasis can also occur indirectly, and even among positions in distant parts of the protein (25, 38) further complicating the ability to rationalize or predict the functional impact of combinatorial mutations.
Fig. 2.
Active sites in enzymes and binders comprise preorganized constellations of amino acids. (A) Interaction between protein A (purple) and the crystallizable fragment (Fc) of human IgG1 antibodies (blue) (PDB entry: 1l6x). (Middle) Positions on protein A that form critical interactions with Fc are marked in spheres. The amino acid residues in these positions form stabilizing van der Waals and hydrogen bond interactions within and between the monomers (Left and Right). (B) The molecular structure of a glycoside hydrolase and its substrate xylan (white) (PDB entry: 4pue). (Left) An intricate hydrogen-bond network stabilizes the substrate in the active site. Hydrogen bonds are indicated using dashed yellow lines. (Right) A space-filling model of the active site shows that the amino acids that line the substrate-binding pocket form a dense interaction network. Uncompensated mutations in such densely packed regions are unlikely to be tolerated. Spheres according to atomic radii.
Direct epistasis arises from physical contacts such as electrostatics and van der Waals interactions (37) (Fig. 3A). Due to the high density of interactions in an active site (Fig. 2), mutations that improve activity may undermine stability (4, 5, 22). For instance, a large-to-small mutation may be necessary to improve the fit between the protein and a bulky substrate. At the same time, however, the newly introduced cavity may destabilize the protein. In such cases, a second (or several) compensatory small-to-large mutation(s) may be needed (22). Similarly, changes in activity may demand a new network of polar/charged interactions in a hydrophobic environment. Introduced individually, the mutations that comprise this network may destabilize the protein (22) or temporarily reduce its activity (6), and the gain-of-function effect may only materialize when all mutations are introduced (37) (Fig. 3A). These structural and molecular considerations highlight that evolving mutants that exhibit large gain-of-function may require going through intermediate variants in which activity is reduced due to epistasis.
Fig. 3.
Examples of epistatic function-altering mutations and their structural impact. (A) Several ultrahigh affinity and specificity endonuclease-immunity pairs (purple and blue, respectively) are known in Escherichia coli. Pairs Im2/E2 and Im9/E9 are homologous but exhibit orders of magnitude specificity preference for their cognate partners over the noncognate ones (Im2/E9 and Im9/E2). Three key positions (indicated in spheres on the Left) form highly epistatic interactions that can be exchanged between the two pairs in the specific order shown on the Right. For instance, Asn34Val enables Arg98Met which in turn enables Asp33Leu. Other mutational orders lead to steric or electrostatic incompatibilities and nonfunctional intermediates. These mutations completely change binding specificity. Adapted from ref. 39. (B) A hormone-regulated transcription factor (Middle) evolved a specificity switch mainly by two key mutations: Ser106Pro (Left) introduces a kink in the backbone repositioning a nearby helix, including position 111. Leu111Gln (Right) introduces a hydrogen bond with a hydroxyl group present only in the new hormone. Adapted from ref. 40, models generated using ColabFold (41) (C) Hydrogen bonds define the structure of a long loop that caps the active site and interacts with the substrate (Left and Middle, PDB entry: 1uqz). Mutations of the hydrogen bonding residues (Right) resulted in substantially lower activity levels in enzyme designs that incorporated this structural fragment despite the large distance between the mutated positions and the active site (42).
Indirect epistatic interactions, or conformational epistasis, may result from either local or long-range backbone changes, for instance by altering the positioning and orientation of nearby residues (40) (Fig. 3B). Backbone changes can also result from eliminating hydrogen bonds that are responsible for the local conformation (42, 43) (Fig. 3C). In a remarkable example of the functional impact of indirect epistasis, a histidine-to-proline mutation observed in two mammalian hemoglobins eliminates a hydrogen bond to a nearby helix, thereby reorienting the four protein subunits and increasing their affinity for oxygen (44). In long-range epistasis, mutations impact the functional outcome of distant mutations (25, 38). For example, mutations outside a protein binding site may simultaneously increase affinity for multiple binding partners, inducing multispecific binding. Additional binding-site mutations can then increase the specificity for several new binding partners (45).
Mutations that are distant from an active site can also exert epistatic effects by changing protein stability (38). Proteins may be stabilized by mutations throughout the protein, and such mutations may buffer the destabilizing effect of active-site mutations, reducing the likelihood that the protein misfolds or aggregates in response to the active-site mutations (7, 8, 11, 46–48). Indeed, strong selection pressure for the native function may lead to accumulating mutations that promote foldability, thereby potentiating the evolution of a new function. Conversely, mutations that reduce stability may exhibit what is known as a threshold-robustness effect, in which function-enhancing mutations reduce stability and accumulate only to the point in which further mutations lower stability below the threshold required for protein expression (7).
Despite advances in our understanding of the molecular underpinnings of epistasis, in many cases, mutations far from the active site may have a large functional impact that cannot be rationalized with current understanding (11, 49, 50). Such remote mutations may not change the molecular structure of the protein significantly, and yet, they exert a significant impact on function. These observations suggest that analyzing protein dynamics or other molecular properties that are typically not taken into consideration in protein modeling may make important contributions to our understanding of epistasis (38). In this connection, we note that methods that assess coevolutionary patterns in homologous proteins (reviewed below) do not depend on a mechanistic framework, allowing them to empirically identify correlated mutations that result from epistatic relationships.
Epistasis in Protein Evolution and Design
In evolutionary studies, a fitness landscape is a conceptual framework for considering mutational trajectories and the impact of mutations on fitness (51). In this framework, the distance between mutants is proportional to the likelihood that one mutant may have derived from another (e.g., point mutants are close to one another), and their relative heights are proportional to relative fitness values. To be selected by evolution, a mutation must be beneficial or at least not harmful (26, 51). Thus, in an ideal scenario for the rapid evolution of new activities, the fitness landscape would be smooth and exhibit a monotonous path of increase in function among mutants. Epistasis, however, implies that the fitness landscape may be rugged (6, 9, 16, 52, 53), significantly reducing the number of paths that lead from one functional mutant to another without reducing fitness (15–17, 26, 54). Indeed, many putative intermediates that connect naturally occurring protein variants are nonfunctional due to epistasis (6, 20, 55–57); such intermediates populate fitness valleys between peaks (58, 59).
Functionally, neutral mutations can help avoid fitness valleys (45, 48) and promote functional innovation in several ways (60). First, neutral mutations can enhance promiscuous functions that may be useful under different conditions or environments (39, 61). For example, mutations in TEM-1 β-lactamase that do not impact the native enzyme activity (ampicillin resistance) increase a different activity (resistance to cefotaxime) (62). Second, neutral mutations may provide multiple genetic backgrounds some of which may permit mutations that are excluded in a functionally equivalent protein with a different sequence due to epistasis (6, 48). Last, recombination events between functionally equivalent proteins may enable large leaps in the fitness landscape by simultaneously introducing multiple mutations that cannot be introduced one by one (63–65).
Epistasis may impose a specific mutational order in which functionally important mutations are only introduced following permissive ones (25). For example, most evolutionary trajectories that connect two natural β-lactamase variants that are separated by five mutations were shown to go through a fitness valley (19). This loss in activity is due to the fact that most of the intermediates encode a destabilizing mutation that is nonetheless critical to the functional transition. To compensate for the stability loss, a compensatory stabilizing mutation that slightly reduces activity must be introduced before the mutation that promotes the functional transition. Such strict ordering in the accumulation of mutations is a hallmark of epistasis (16), slows the evolution of new activities (21), and reduces the likelihood of reaching a new fitness peak through random mutational processes.
Another critical impact of epistasis is that it may lead to functional irreversibility following the emergence of a new activity (66, 67). In this scenario, once the mutations that confer a new activity have been fixed, additional mutations may entrench the function-altering mutations (6, 39, 68). Although the mutations that impose irreversibility may be neutral or slightly beneficial against the background of the variant with the new activity, they may negatively impact function when implemented on the background of the ancestral state, thus reducing the chances of returning to that state (6). For example, a new function may emerge through a change in the backbone structure that enables additional mutations that are incompatible with the previous structure (Fig. 3B) (40, 69).
In some cases, irreversibility may materialize in just a few mutational steps. For example, by exchanging a salt bridge to a hydrophobic pair in a binding site, the original polar amino acids cannot be reintroduced through single-point mutations without significant destabilization (e.g., the evolutionary trajectory that reverses the order shown in Fig. 3A is strongly disfavored) (39). Due to functional irreversibility, enzymes are highly adapted to carry out their specific functions, while evolutionary ancestors of proteins often exhibit a broader substrate scope (these are known as “generalist” or multispecific ancestors) (6, 61, 70). Functional irreversibility also explains why evolving or designing contemporary proteins toward new activities, even ones that are observed in homologous proteins, is often challenging: The contemporary proteins may have accumulated a large number of mutations that are incompatible with ones that are needed for the new activity (6). Instead, protein engineers may exploit promiscuous enzyme activities and enhance them through gradual accumulation of mutations into a new function (48, 61).
Experimental screens of mutational variants rarely isolate combinations of epistatic mutations in one experimental cycle (28, 31). In most cases, mutations accumulate one at a time and thus must be individually tolerated (25, 71, 72). Moreover, the possible number of mutation combinations, including insertion and deletion events, is astronomical, and due to epistasis, most combinations of mutations, including combinations of beneficial ones, are nonfunctional (19). Therefore, experimental methods often resort to high-throughput screening to isolate functional variants. Ultrahigh-throughput screening methods, however, can screen up to 108 variants (73) [or 1014 in extreme cases (74)], a very small fraction of sequence space. High-throughput screening can also sample multiple mutations at once but due to the large combinatorial space, such methods are restricted to a small number of positions (10, 75). Due to these technical limitations and the rareness of gain-of-function combinations, high-throughput screening approaches cannot adequately cover the space of favorable combinations of mutations.
Thus, despite often being necessary for introducing large changes in activity, epistasis imposes significant constraints on the design and engineering of protein activity. Protein design and engineering methods use some of the mechanisms that are also seen in natural evolution to promote functional change, such as introducing stabilizing mutations prior to introducing function-altering ones (43, 48, 76) or recombining fragments from homologous proteins (64, 77). However, due to the unpredictability that results from epistasis and the large space of combinatorial mutants, such methods cover only a very small part of the space of potentially functional mutants. A critical advantage of protein design approaches (reviewed below) relative to evolutionary ones is that the former are not restricted to stepwise accumulation of individually beneficial (or tolerated) mutations. Design methods may, in principle, escape fitness valleys that hinder evolutionary strategies. In the following, we describe how combinations of mutations are selected by sequence and structure-based design methods.
Data-Driven Design Approaches
Computational methods that use evolutionary and experimental data can screen large regions of the sequence space and focus experimental efforts on the parts that are most likely to harbor functional proteins. Homologous sequences represent functionally related proteins that have evolved in different environments. Such data can be statistically analyzed to guide the design of new proteins, including by introducing epistatic mutations (78, 79). For instance, covariation in the substitution patterns observed in aligned positions in homologs likely reflects an epistatic relationship and implies an interaction among the positions (80). Methods for inferring pairwise interactions based on sequence covariation reached a high level of accuracy approximately a decade ago (81, 82), but they were limited in their ability to infer high-order relationships that are necessary for accurate design (17, 55). With the advent of AI methods, patterns that reflect higher-order relationships between mutations may be inferred more reliably (83), and such methods have had a transformative impact on modelling (84) and design. Most dramatically, these methods have enabled reliable ab initio structure prediction in widely used methods such as AlphaFold and RoseTTAfold (85, 86), prediction of mutation effects and function annotation (35, 83), and the design of new proteins, including binders (87) and enzymes with dozens of mutations from the natural starting point yet with comparable activities (34, 81, 88). Remarkably, the breakthroughs in AI-based protein design were based on sequence and structure data without requiring a physical model of mutational impact or a detailed understanding of the mode of action of the proteins.
Sequence-based design methods that are applied to natural proteins require very large sets of homologous sequences and may have limited applications in protein families that do not exhibit sufficient diversity. If the number of homologs is small, the learned distribution may be unreliable (89). Thus, these models were demonstrated in privileged cases where they could be fine-tuned on tens of thousands of homologous sequences (90–94). A recent study indicated that when enough sequence data are available, a simple bioinformatic technique called ancestral sequence reconstruction (95), in which putative evolutionary ancestors of contemporary sequences are inferred using statistical methods, surpasses sophisticated generative neural network models in designing functional proteins (96). Thus, AI-based methods can quickly generate potential protein sequences, but other data sources, such as physics-based modeling methods (see below), may be necessary to select promising ones for experimental characterization.
As a particularly relevant example of using additional data sources in design, laboratory evolution campaigns and deep mutational scanning generate large datasets that relate mutations to protein function (13, 28, 75, 97, 98). AI-based methods can exploit such annotated datasets to predict biological function. By inferring patterns of functional mutants in the training data, such models can quickly evaluate millions of potential variants and suggest promising ones for experimental testing, thus substantially reducing the experimental effort (99–101). Models based on experimentally determined activity data may escape fitness valleys by suggesting combinations of mutations that are likely to be stable and functional (100). Training effective models, however, requires large amounts of data (102), which come at the cost of increased experimental effort. Instead, AI strategies can leverage the information encoded in large unlabeled sequence datasets (for instance, the sequences of all known proteins) to augment small, functionally labeled datasets (103–110). Models can be subsequently fine-tuned on a target protein family using homologous sequences to further lower the number of functionally labeled sequences that are required for reliable prediction. In one remarkable example, based on such fine-tuning using a few dozen experimentally labeled sequences, highly functional green fluorescent protein (GFP) variants were designed (93).
Despite these recent achievements, it should be borne in mind that data-driven design approaches rely on data quality and struggle to generalize outside the distribution of the training data (111). Protein functional sites are dense with epistatic interactions, and mutating epistatic sites often results in nonfunctional proteins. Consequently, epistatic interactions, particularly in active sites, are underrepresented in natural and experimental datasets (112). Therefore, data-driven methods are less reliable on active-site positions (113), particularly proximal ones, and perform best on distal positions where mutational effects are more likely to be additive (114–116). In other words, sequence-based AI methods mainly learn the coevolutionary patterns in the data and are effective in tasks, such as structure prediction, that rely on such patterns; but they are less effective in designing new or enhanced functions for which data are not present in the natural sequence diversity (117).
Another limitation of natural evolutionary data as the main basis for design is that natural proteins are typically not optimized for stability or activity. In fact, many natural proteins are only marginally stable and may exhibit low functional expression (118). Furthermore, protein activity levels or selectivity may differ even among close homologs. For instance, lysozyme and α lactalbumin exhibit high sequence and structure homology but very different functions (119). Therefore, exchanging mutations among homologous active sites often leads to nonfunctional proteins due to incompatibility among the mutations and due to evolutionary irreversibility (6). Thus, data-driven methods are strongly biased toward natural sequences (89) and may have limited ability to propose variants with improved features (91, 94). In practice, to generate sequences that may exhibit functions that are substantially different from natural ones, these methods must sample regions of the sequence space that are far from the data on which they were trained—regions where model reliability is, by definition, low (111).
Atomistic Design of Epistatic Interactions
In atomistic design calculations, mutants are ranked according to physics-based energy terms, including van der Waals packing and electrostatics (120). Typically, a few dozen designs are experimentally evaluated for the desired function, and the data are then used to fine-tune the atomistic model to enhance the activity of the designs (121–126). Protein stability and active-site preorganization are required for high-efficiency function and both rely on low system energy. Thus, the challenge of designing gain-of-function mutations in epistatic positions can be considered from the point-of-view of biomolecular energetics, postulating that epistatic interactions should form low-energy and stable constellations (116). Until recently, however, atomistic design calculations were not accurate enough to propose stable combinatorial mutants in an active site, and successful engineering of functional variants demanded cycles of computational and experimental screening often aided by protein crystallography and high-throughput functional screening methods (72, 127, 128).
Thanks to a dramatic improvement in the reliability of atomistic design calculations over recent years (129, 130), atomistic design methods can now address some of the challenges that epistasis poses for design of function. For instance, reliable methods for designing stable protein variants have been applied to a variety of challenging proteins. The ability to design stable protein variants, in itself, opens the way to enhancing protein activity (18, 76, 131) because stable proteins are more likely to occupy their native state than marginally stable ones (118). Additionally, preorganization of the amino acids that are responsible for function in their functionally competent state may increase catalytic efficiency (2).
The FuncLib computational design method diversifies functional sites by explicitly addressing the challenge that epistasis presents for designing dense regions (18). In this approach, hundreds of thousands of active-site mutants are modeled atomistically and ranked according to their energy; then, a few dozen low-energy designs each encoding several active-site mutations are tested experimentally. This approach models mutants that harbor multiple simultaneous active-site mutations and does not rely on the gradual accumulation of mutations as evolutionary strategies do. Furthermore, designs are assessed by energy to verify that combinations of mutations (which may only be observed in functionally different homologs) are compatible with one another. Resulting designs can, in principle, escape fitness valleys (18, 36, 131) and, in many cases, form new hydrogen-bond networks that would be difficult to generate through the stepwise accumulation of tolerated mutations (Fig. 1) (24). In one example, a high-efficiency designed quadruple mutant in a phosphotriesterase showed strong epistatic relationships among the designed mutations (18), suggesting that evolutionary processes would be slow or unlikely to give rise to such a variant (52).
In several unrelated studies, the success rate of isolating highly active enzymes among FuncLib designs exceeded 50%. Applied to different proteins, it increased enzyme specific activity (18, 132, 133), changed selectivity (36, 134), broadened the enzyme substrate scope (135), altered enantioselectivity (136), and increased stability (137). FuncLib was also applied to protein interfaces to increase affinity (138) and stability (24, 139). The fact that the same design strategy has been applied successfully to enzymes belonging to different functional classes and to different obligatory and transient protein–protein interfaces is an encouraging indication that energy ranking is a general strategy for the design of function in the epistatic environments that characterize active sites.
Natural evolution repeatedly uses a limited set of protein folds and recombines sequences in order to generate novel proteins (140–142). Inspired by nature, protein designers can recombine mutations or modular substructures to generate novel proteins (64, 143–147), but such changes are sensitive to epistasis among the combined substructures. Energy considerations can identify compatible sets of substructures or mutations, improving the likelihood of successful design and increasing the number of mutations that can be tolerated. A method called Epistasis Neural Network (EpiNNet) finds a space of mutations or substructures that are mutually compatible and likely to form low-energy structures when combined (42). It employs a machine-learning analysis of a large number of atomistically designed models to find mutational patterns that are common among computationally generated, low-energy designs and rare among the high-energy ones. Those mutations or substructures that lead to low-energy designs can then be freely combined to form a library of mutants for experimental screening. This approach echoes data-driven strategies that look for tolerated combinations of mutations (115), but unlike those strategies, EpiNNet analyzes computer-designed sequences and structures rather than ground-truth experimental data. By relying on simulated structures, EpiNNet is less sensitive to the number of known homologs compared to current AI-based strategies. Thus, this approach simulates natural recombination, but unlike recombination of natural genes, it designs and selects the mutants according to their likelihood to combine into low-energy and therefore foldable and potentially functional designs.
EpiNNet was applied to two protein families, glycoside hydrolases (42) and GFP (148), using backbone fragment combination and combinatorial sequence mutations, respectively. In both cases, large libraries were synthesized and screened for activity. Remarkably, in each case more than 10,000 functional designs were obtained using high-throughput screening. Unlike typical mutational libraries which randomize mutated positions, the mutations were computationally selected, allowing a greater space of positions to be sampled. For instance, in the enzyme design study, more than 100 designed fragments were recombined, and in GFP, 14 core positions were targeted for mutagenesis, thus enabling systematic exploration of the space of epistatic mutations. Due to the high concentration of mutations in the active site, a large fraction of the functional designs exhibited changes in activity profile. These results suggest that energy-based ranking of combinatorial designs may provide a key to improve protein activity and even to discover new activities that are not observed in the natural diversity. Furthermore, the high enrichment for active-site structure and sequence diversity among the functional variants in atomistic design relative to other approaches provides an encouraging sign that energy-based ranking may embody some of the critical aspects that determine the functional outcome of epistatic mutations.
Discussion
How functional innovation is generated in protein evolution or engineering is one of the most fundamental and intriguing questions in protein science (28, 30, 149). In most proteins, functional sites exhibit a high density of molecular interactions and are among the most evolutionarily conserved regions, with a low tolerance to mutations (21). Furthermore, due to epistasis, the impact of mutations on activity may differ when introduced in combination with others. Epistasis is both a curse and a blessing for functional innovation: On the one hand, it frustrates protein engineering efforts because the functional impact of mutations depends on other mutations; on the other hand, due to the nonlinear impact of mutations on function, a handful of strategic mutations can lead to a dramatic functional change to a desirable profile.
Despite considerable progress in understanding the molecular basis of epistasis, we are still making first steps in addressing its implications in protein design. Data-driven approaches have successfully generated functional sequences with many mutations and without requiring a mechanistic or theoretical framework. The generative capabilities of these approaches are limited, however, by the extent and quality of the data available for training. For instance, the success of structure prediction algorithms relied on vast sequence and structure databases like the PDB. In contrast, comprehensive functional annotations are sparse (150). Future studies that generate large sets of standardized and high-quality experimental data in a variety of proteins may address this critical challenge, but it remains to be seen whether data-driven approaches can extrapolate to unseen regions of the sequence space in order to predict mutants that exhibit a substantially different activity.
By contrast, energy-based methods can make useful predictions where data are sparse. Combined with evolutionary data, energy-based methods have demonstrated high reliability and generality in designing highly efficient enzymes and binders that are difficult for evolutionary strategies to reach. Crucially, in most cases, the resulting proteins do not exhibit frustrating trade-offs that are often observed in conventional protein optimization strategies where higher activity comes at the cost of lower stability (11, 76). Although we are at an early stage of developing methods that address epistasis in design, we cautiously suggest that such methods may be able to generate thousands of functional active-site variants starting from any natural protein. This suggests exciting opportunities for enhancing existing activities or for discovering new ones by systematically exploring sequence and structure space in regions that are difficult for evolutionary processes to sample efficiently.
Finally, we note an important general trend in the development of protein design methodology. During most of its history, protein design methods were developed using evolutionary (151), atomistic (120), and machine-learning approaches (115) mostly in isolation from one another. However, protein design is without a doubt one of the most complex problems in computational biology. This complexity is due to the diversity of protein folds and functions, the multidimensionality of sequence and conformation space, and, perhaps most crucially, due to epistasis or the nonlinear dependence of function on mutations in different parts of the protein. An interdisciplinary approach may therefore be crucial to address the long-standing challenge of developing methods for general and reliable design of function. Accordingly, over the past decade, breakthroughs in protein design methodology have combined concepts and methods from multiple disciplines (18, 42, 123, 131). In addressing epistasis, however, such synthesis has only been attempted quite recently, and we note that there is much room for greater methodological sophistication. In particular, even in recent design studies, fine-tuning the design method based on experimental results of designed libraries resulted in dramatic gains in success rate (42, 128, 152). However, the fraction of functional proteins in library screening is still small, and we are optimistic that analyzing large sets of functional designs will clarify obscure determinants of epistasis to improve design reliability. Given that the design of protein folds is now routine (87, 152, 153), design of function is a critical next frontier that may make large gains through interdisciplinary efforts that involve protein biochemists, data scientists, and evolutionary biologists.
Acknowledgments
We thank Ziv Avizemer for providing structure models that were used to generate Fig. 3A. R.L.-S. was supported by a fellowship from the Arianne de Rothschild Women Doctoral Program. Work in the Fleishman lab was funded by the Volkswagen Foundation grant 9474, the Israel Science Foundation grant 1844, the European Research Council through a Consolidator Award grant 815379, the Dr. Barry Sherman Institute for Medicinal Chemistry, and a donation in memory of Sam Switzer.
Author contributions
R.L.-S. and S.J.F. designed research; R.L.-S. performed research; and R.L.-S. and S.J.F. wrote the paper.
Competing interests
R.L.-S. and S.J.F. are named inventors on patents relating to methods and designs described in the manuscript, and S.J.F. consults on the application of protein design methods.
Footnotes
This article is a PNAS Direct Submission.
Contributor Information
Rosalie Lipsh-Sokolik, Email: rosalipsh@gmail.com.
Sarel J. Fleishman, Email: sarel@weizmann.ac.il.
Data, Materials, and Software Availability
All study data are included in the main text.
References
- 1.Jencks W. P., Binding energy, specificity, and enzymic catalysis: The circe effect. Adv. Enzymol. Relat. Areas Mol. Biol. 43, 219–410 (1975). [DOI] [PubMed] [Google Scholar]
- 2.Warshel A., Electrostatic origin of the catalytic power of enzymes and the role of preorganized active sites. J. Biol. Chem. 273, 27035–27038 (1998). [DOI] [PubMed] [Google Scholar]
- 3.Yue P., Li Z., Moult J., Loss of protein structure stability as a major causative factor in monogenic disease. J. Mol. Biol. 353, 459–473 (2005). [DOI] [PubMed] [Google Scholar]
- 4.Shoichet B. K., Baase W. A., Kuroki R., Matthews B. W., A relationship between protein stability and protein function. Proc. Natl. Acad. Sci. U.S.A. 92, 452–456 (1995). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Meiering E. M., Serrano L., Fersht A. R., Effect of active site residues in barnase on activity and stability. J. Mol. Biol. 225, 585–589 (1992). [DOI] [PubMed] [Google Scholar]
- 6.Wellner A., Raitses Gurevich M., Tawfik D. S., Mechanisms of protein sequence divergence and incompatibility. PLoS Genet. 9, e1003665 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Bershtein S., Segal M., Bekerman R., Tokuriki N., Tawfik D. S., Robustness-epistasis link shapes the fitness landscape of a randomly drifting protein. Nature 444, 929–932 (2006). [DOI] [PubMed] [Google Scholar]
- 8.Bloom J. D., et al. , Thermodynamic prediction of protein neutrality. Proc. Natl. Acad. Sci. U.S.A. 102, 606–611 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Starr T. N., Thornton J. W., Epistasis in protein evolution. Protein Sci. 25, 1204–1218 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Metzger B. P. H., Park Y., Starr T. N., Thornton J. W., Epistasis facilitates functional evolution in an ancient transcription factor. eLife 12, RP88737 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Klesmith J. R., Bacik J.-P., Wrenbeck E. E., Michalczyk R., Whitehead T. A., Trade-offs between enzyme fitness and solubility illuminated by deep mutational scanning. Proc. Natl. Acad. Sci. U.S.A. 114, 2265–2270 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Whitehead T. A., et al. , Optimization of affinity, specificity and function of designed influenza inhibitors using deep sequencing. Nat. Biotechnol. 30, 543–548 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Fowler D. M., et al. , High-resolution mapping of protein sequence-function relationships. Nat. Methods 7, 741–746 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Listov D., et al. , Assessing and enhancing foldability in designed proteins. Protein Sci. 31, e4400 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Breen M. S., Kemena C., Vlasov P. K., Notredame C., Kondrashov F. A., Epistasis as the primary factor in molecular evolution. Nature 490, 535–538 (2012). [DOI] [PubMed] [Google Scholar]
- 16.Weinreich D. M., Watson R. A., Chao L., Perspective: Sign epistasis and genetic constraint on evolutionary trajectories. Evolution 59, 1165–1174 (2005). [PubMed] [Google Scholar]
- 17.Weinreich D. M., Lan Y., Wylie C. S., Heckendorn R. B., Should evolutionary geneticists worry about higher-order epistasis? Curr. Opin. Genet. Dev. 23, 700–707 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Khersonsky O., et al. , Automated design of efficient and functionally diverse enzyme repertoires. Mol. Cell 72, 178–186.e5 (2018), 10.1016/j.molcel.2018.08.033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Weinreich D. M., Delaney N. F., Depristo M. A., Hartl D. L., Darwinian evolution can follow only very few mutational paths to fitter proteins. Science 312, 111–114 (2006). [DOI] [PubMed] [Google Scholar]
- 20.Poelwijk F. J., Kiviet D. J., Weinreich D. M., Tans S. J., Empirical fitness landscapes reveal accessible evolutionary paths. Nature 445, 383–386 (2007). [DOI] [PubMed] [Google Scholar]
- 21.Povolotskaya I. S., Kondrashov F. A., Sequence space and the ongoing expansion of the protein universe. Nature 465, 922–926 (2010). [DOI] [PubMed] [Google Scholar]
- 22.Barešić A., Hopcroft L. E. M., Rogers H. H., Hurst J. M., Martin A. C. R., Compensated pathogenic deviations: Analysis of structural effects. J. Mol. Biol. 396, 19–30 (2010). [DOI] [PubMed] [Google Scholar]
- 23.Meer M. V., Kondrashov A. S., Artzy-Randrup Y., Kondrashov F. A., Compensatory evolution in mitochondrial tRNAs navigates valleys of low fitness. Nature 464, 279–282 (2010). [DOI] [PubMed] [Google Scholar]
- 24.Warszawski S., et al. , Optimizing antibody affinity and stability by the automated design of the variable light-heavy chain interfaces. PLoS Comput. Biol. 15, e1007207 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Miton C. M., Tokuriki N., How mutational epistasis impairs predictability in protein evolution and design. Protein Sci. 25, 1260–1272 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.de Visser J. A. G. M., Krug J., Empirical fitness landscapes and the predictability of evolution. Nat. Rev. Genet. 15, 480–490 (2014). [DOI] [PubMed] [Google Scholar]
- 27.Araya C. L., Fowler D. M., Deep mutational scanning: Assessing protein function on a massive scale. Trends Biotechnol. 29, 435–442 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Arnold F. H., Innovation by evolution: Bringing new chemistry to life (nobel lecture). Angew. Chem. Int. Ed. Engl. 58, 14420–14426 (2019). [DOI] [PubMed] [Google Scholar]
- 29.Tokuriki N., et al. , Diminishing returns and tradeoffs constrain the laboratory optimization of an enzyme. Nat. Commun. 3, 1257 (2012). [DOI] [PubMed] [Google Scholar]
- 30.Tóth-Petróczy A., Tawfik D. S., The robustness and innovability of protein folds. Curr. Opin. Struct. Biol. 26, 131–138 (2014). [DOI] [PubMed] [Google Scholar]
- 31.Goldsmith M., Tawfik D. S., Enzyme engineering: Reaching the maximal catalytic efficiency peak. Curr. Opin. Struct. Biol. 47, 140–150 (2017). [DOI] [PubMed] [Google Scholar]
- 32.Poelwijk F. J., Socolich M., Ranganathan R., Learning the pattern of epistasis linking genotype and phenotype in a protein. Nat. Commun. 10, 4213 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Salinas V. H., Ranganathan R., Coevolution-based inference of amino acid interactions underlying protein function. Elife 7, e34300 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Russ W. P., et al. , An evolution-based model for designing chorismate mutase enzymes. Science 369, 440–445 (2020). [DOI] [PubMed] [Google Scholar]
- 35.Hopf T. A., et al. , Mutation effects predicted from sequence co-variation. Nat. Biotechnol. 35, 128–135 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Ospina F., et al. , Selective biocatalytic N-methylation of unsaturated heterocycles. Angew. Chem. Int. Ed Engl. 61, e202213056 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Ivankov D. N., Finkelstein A. V., Kondrashov F. A., A structural perspective of compensatory evolution. Curr. Opin. Struct. Biol. 26, 104–112 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Wilding M., Hong N., Spence M., Buckle A. M., Jackson C. J., Protein engineering: The potential of remote mutations. Biochem. Soc. Trans. 47, 701–711 (2019). [DOI] [PubMed] [Google Scholar]
- 39.Avizemer Z., Martí-Gómez C., Hoch S. Y., McCandlish D. M., Fleishman S. J., Evolutionary paths that link orthogonal pairs of binding proteins. Res Sq [Preprint] (2023). 10.21203/rs.3.rs-2836905/v1 (Accessed 19 October 2023). [DOI]
- 40.Ortlund E. A., Bridgham J. T., Redinbo M. R., Thornton J. W., Crystal structure of an ancient protein: Evolution by conformational epistasis. Science 317, 1544–1548 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Mirdita M., et al. , ColabFold: Making protein folding accessible to all. Nat. Methods 19, 679–682 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Lipsh-Sokolik R., et al. , Combinatorial assembly and design of enzymes. Science 379, 195–201 (2023). [DOI] [PubMed] [Google Scholar]
- 43.Goldenzweig A., et al. , Automated structure- and sequence-based design of proteins for high bacterial expression and stability. Mol. Cell 63, 337–346 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Natarajan C., et al. , Epistasis among adaptive mutations in deer mouse hemoglobin. Science 340, 1324–1327 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Starr T. N., Picton L. K., Thornton J. W., Alternative evolutionary histories in the sequence space of an ancient protein. Nature 549, 409–413 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Zheng J., Guo N., Wagner A., Selection enhances protein evolvability by increasing mutational robustness and foldability. Science 370, eabb5962 (2020). [DOI] [PubMed] [Google Scholar]
- 47.Bloom J. D., Labthavikul S. T., Otey C. R., Arnold F. H., Protein stability promotes evolvability. Proc. Natl. Acad. Sci. U.S.A. 103, 5869–5874 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Bershtein S., Goldin K., Tawfik D. S., Intense neutral drifts yield robust and evolvable consensus proteins. J. Mol. Biol. 379, 1029–1044 (2008). [DOI] [PubMed] [Google Scholar]
- 49.Ding D., et al. , Co-evolution of interacting proteins through non-contacting and non-specific mutations. Nat. Ecol. Evol. 6, 590–603 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Boder E. T., Midelfort K. S., Wittrup K. D., Directed evolution of antibody fragments with monovalent femtomolar antigen-binding affinity. Proc. Natl. Acad. Sci. U.S.A. 97, 10701–10705 (2000). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Smith J. M., Natural selection and the concept of a protein space. Nature 225, 563–564 (1970). [DOI] [PubMed] [Google Scholar]
- 52.Kondrashov D. A., Kondrashov F. A., Topological features of rugged fitness landscapes in sequence space. Trends Genet. 31, 24–33 (2015). [DOI] [PubMed] [Google Scholar]
- 53.Franke J., Klözer A., de Visser J. A. G. M., Krug J., Evolutionary accessibility of mutational pathways. PLoS Comput. Biol. 7, e1002134 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Sailer Z. R., Harms M. J., High-order epistasis shapes evolutionary trajectories. PLoS Comput. Biol. 13, e1005541 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Starr T. N., Thornton J. W., Exploring protein sequence-function landscapes. Nat. Biotechnol. 35, 125–126 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Podgornaia A. I., Laub M. T., Pervasive degeneracy and epistasis in a protein-protein interface. Science 347, 673–677 (2015). [DOI] [PubMed] [Google Scholar]
- 57.Lozovsky E. R., et al. , Stepwise acquisition of pyrimethamine resistance in the malaria parasite. Proc. Natl. Acad. Sci. U.S.A. 106, 12025–12030 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Sailer Z. R., et al. , Inferring a complete genotype-phenotype map from a small number of measured phenotypes. PLoS Comput. Biol. 16, e1008243 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Szendro I. G., Schenk M. F., Franke J., Krug J., de Visser J. A. G. M., Quantitative analyses of empirical fitness landscapes. J. Stat. Mech. 2013, P01005 (2013). [Google Scholar]
- 60.Payne J. L., Wagner A., The causes of evolvability and their evolution. Nat. Rev. Genet. 20, 24–38 (2019). [DOI] [PubMed] [Google Scholar]
- 61.Khersonsky O., Tawfik D. S., Enzyme promiscuity: A mechanistic and evolutionary perspective. Annu. Rev. Biochem. 79, 471–505 (2010). [DOI] [PubMed] [Google Scholar]
- 62.Stiffler M. A., Hekstra D. R., Ranganathan R., Evolvability as a function of purifying selection in TEM-1 β-lactamase. Cell 160, 882–892 (2015). [DOI] [PubMed] [Google Scholar]
- 63.Stemmer W. P., DNA shuffling by random fragmentation and reassembly: In vitro recombination for molecular evolution. Proc. Natl. Acad. Sci. U.S.A. 91, 10747–10751 (1994). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Voigt C. A., Martinez C., Wang Z.-G., Mayo S. L., Arnold F. H., Protein building blocks preserved by recombination. Nat. Struct. Biol. 9, 553–558 (2002). [DOI] [PubMed] [Google Scholar]
- 65.Afriat-Jurnou L., Jackson C. J., Tawfik D. S., Reconstructing a missing link in the evolution of a recently diverged phosphotriesterase by active-site loop remodeling. Biochemistry 51, 6047–6055 (2012). [DOI] [PubMed] [Google Scholar]
- 66.Wagner G. P., The logical structure of irreversible systems transformations: A theorem concerning Dollo’s law and chaotic movement. J. Theor. Biol. 96, 337–346 (1982). [DOI] [PubMed] [Google Scholar]
- 67.Kaltenbach M., Jackson C. J., Campbell E. C., Hollfelder F., Tokuriki N., Reverse evolution leads to genotypic incompatibility despite functional and active site convergence. Elife 4, e06492 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Shah P., McCandlish D. M., Plotkin J. B., Contingency and entrenchment in protein evolution under purifying selection. Proc. Natl. Acad. Sci. U.S.A. 112, E3226–E3235 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Bridgham J. T., Ortlund E. A., Thornton J. W., An epistatic ratchet constrains the direction of glucocorticoid receptor evolution. Nature 461, 515–519 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Levin K. B., et al. , Following evolutionary paths to protein-protein interactions with high affinity and selectivity. Nat. Struct. Mol. Biol. 16, 1049–1055 (2009). [DOI] [PubMed] [Google Scholar]
- 71.Chao G., et al. , Isolating and engineering human antibodies using yeast surface display. Nat. Protoc. 1, 755–768 (2006). [DOI] [PubMed] [Google Scholar]
- 72.Whitehead T. A., Baker D., Fleishman S. J., Computational design of novel protein binders and experimental affinity maturation. Methods Enzymol. 523, 1–19 (2013). [DOI] [PubMed] [Google Scholar]
- 73.Gantz M., Neun S., Medcalf E. J., van Vliet L. D., Hollfelder F., Ultrahigh-throughput enzyme engineering and discovery in in vitro compartments. Chem. Rev. 123, 5571–5611 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Hanes J., Plückthun A., In vitro selection and evolution of functional proteins by using ribosome display. Proc. Natl. Acad. Sci. U.S.A. 94, 4937–4942 (1997). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Fowler D. M., Fields S., Deep mutational scanning: A new style of protein science. Nat. Methods 11, 801–807 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Goldsmith M., et al. , Overcoming an optimization plateau in the directed evolution of highly efficient nerve agent bioscavengers. Protein Eng. Des. Sel. 30, 333–345 (2017). [DOI] [PubMed] [Google Scholar]
- 77.Bedbrook C. N., Yang K. K., Rice A. J., Gradinaru V., Arnold F. H., Machine learning to design integral membrane channelrhodopsins for efficient eukaryotic expression and plasma membrane localization. PLoS Comput. Biol. 13, e1005786 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Siedhoff N. E., Schwaneberg U., Davari M. D., Machine learning-assisted enzyme engineering. Methods Enzymol. 643, 281–315 (2020). [DOI] [PubMed] [Google Scholar]
- 79.Li G., Dong Y., Reetz M. T., Can machine learning revolutionize directed evolution of selective enzymes? Adv. Synth. Catal. 361, 2377–2386 (2019). [Google Scholar]
- 80.Göbel U., Sander C., Schneider R., Valencia A., Correlated mutations and residue contacts in proteins. Proteins 18, 309–317 (1994). [DOI] [PubMed] [Google Scholar]
- 81.Levy R. M., Haldane A., Flynn W. F., Potts Hamiltonian models of protein co-variation, free energy landscapes, and evolutionary fitness. Curr. Opin. Struct. Biol. 43, 55–62 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Marks D. S., et al. , Protein 3D structure computed from evolutionary sequence variation. PLoS One 6, e28766 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Riesselman A. J., Ingraham J. B., Marks D. S., Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15, 816–822 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Wang S., Sun S., Li Z., Zhang R., Xu J., Accurate de novo prediction of protein contact map by ultra-deep learning model. PLoS Comput. Biol. 13, e1005324 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85.Jumper J., et al. , Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Baek M., et al. , Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 87.Watson J. L., et al. , De novo design of protein structure and function with RFdiffusion. Nature 620, 1089–1100 (2023), 10.1038/s41586-023-06415-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.Lewis R. D., France S. P., Martinez C. A., Emerging technologies for biocatalysis in the pharmaceutical industry. ACS Catal. 13, 5571–5577 (2023). [Google Scholar]
- 89.Wu Z., Johnston K. E., Arnold F. H., Yang K. K., Protein sequence design with deep generative models. Curr. Opin. Chem. Biol. 65, 18–27 (2021). [DOI] [PubMed] [Google Scholar]
- 90.Hawkins-Hooker A., et al. , Generating functional protein variants with variational autoencoders. PLoS Comput. Biol. 17, e1008736 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 91.Repecka D., et al. , Expanding functional protein sequence spaces using generative adversarial networks. Nat. Mach. Intell. 3, 324–333 (2021). [Google Scholar]
- 92.Wu Z., et al. , Signal peptides generated by attention-based neural networks. ACS Synth. Biol. 9, 2154–2161 (2020). [DOI] [PubMed] [Google Scholar]
- 93.Biswas S., Khimulya G., Alley E. C., Esvelt K. M., Church G. M., Low-N protein engineering with data-efficient deep learning. Nat. Methods 18, 389–396 (2021). [DOI] [PubMed] [Google Scholar]
- 94.Madani A., et al. , Large language models generate functional protein sequences across diverse families. Nat. Biotechnol. 41, 1099–1106 (2023), 10.1038/s41587-022-01618-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 95.Pauling L., Zuckerkandl E., Henriksen T., Lövstad R., Chemical paleogenetics. Molecular “restoration studies” of extinct forms of life. Acta Chem. Scand. 17, 9–16 (1963). [Google Scholar]
- 96.Johnson S. R., et al. , Computational scoring and experimental evaluation of enzymes generated by neural networks. bioRxiv [Preprint] (2023). 10.1101/2023.03.04.531015 (Accessed 19 October 2023). [DOI] [PMC free article] [PubMed]
- 97.Yang A., et al. , Deploying synthetic coevolution and machine learning to engineer protein-protein interactions. Science 381, eadh1720 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 98.Hanning K. R., Minot M., Warrender A. K., Kelton W., Reddy S. T., Deep mutational scanning for therapeutic antibody engineering. Trends Pharmacol. Sci. 43, 123–135 (2022). [DOI] [PubMed] [Google Scholar]
- 99.Romero P. A., Krause A., Arnold F. H., Navigating the protein fitness landscape with Gaussian processes. Proc. Natl. Acad. Sci. U.S.A. 110, E193–E201 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 100.Wu Z., Kan S. B. J., Lewis R. D., Wittmann B. J., Arnold F. H., Machine learning-assisted directed protein evolution with combinatorial libraries. Proc. Natl. Acad. Sci. U.S.A. 116, 8852–8858 (2019), 10.1073/pnas.1901979116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 101.Saito Y., et al. , Machine-learning-guided mutagenesis for directed evolution of fluorescent proteins. ACS Synth. Biol. 7, 2014–2022 (2018). [DOI] [PubMed] [Google Scholar]
- 102.Aghazadeh A., et al. , Epistatic net allows the sparse spectral regularization of deep neural networks for inferring fitness functions. Nat. Commun. 12, 5225 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 103.Wittmann B. J., Johnston K. E., Wu Z., Arnold F. H., Advances in machine learning for directed evolution. Curr. Opin. Struct. Biol. 69, 11–18 (2021). [DOI] [PubMed] [Google Scholar]
- 104.Alley E. C., Khimulya G., Biswas S., AlQuraishi M., Church G. M., Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019), 10.1038/s41592-019-0598-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 105.Yang K. K., Wu Z., Bedbrook C. N., Arnold F. H., Learned protein embeddings for machine learning. Bioinformatics 34, 2642–2648 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 106.Rao R., et al. , Evaluating protein transfer learning with TAPE. Adv. Neural Inf. Process. Syst. 32, 9689–9701 (2019). [PMC free article] [PubMed] [Google Scholar]
- 107.Wittmann B. J., Yue Y., Arnold F. H., Informed training set design enables efficient machine learning-assisted directed protein evolution. Cell Syst. 12, 1026–1045.e7 (2021). [DOI] [PubMed] [Google Scholar]
- 108.Rives A., et al. , Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. U.S.A. 118, e2016239118 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 109.Hsu C., Nisonoff H., Fannjiang C., Listgarten J., Learning protein fitness models from evolutionary and assay-labeled data. Nat. Biotechnol. 40, 1114–1122 (2022). [DOI] [PubMed] [Google Scholar]
- 110.Minot M., Reddy S. T., Meta learning improves robustness and performance in machine learning-guided protein engineering. bioRxiv [Preprint] (2023). 10.1101/2023.01.30.526201 (Accessed 19 October 2023). [DOI]
- 111.Fannjiang C., Listgarten J., Is novelty predictable? arXiv [Preprint] (2023). 10.48550/arXiv.2306.00872 (Accessed 19 October 2023). [DOI]
- 112.Bashton M., Chothia C., The generation of new protein functions by the combination of domains. Structure 15, 85–99 (2007). [DOI] [PubMed] [Google Scholar]
- 113.Ding D., et al. , Protein design using structure-based residue preferences. bioRxiv [Preprint] (2023). 10.1101/2022.10.31.514613 (Accessed 19 October 2023). [DOI] [PMC free article] [PubMed]
- 114.Ma E. J., et al. , Machine-directed evolution of an imine reductase for activity and stereoselectivity. ACS Catal. 11, 12433–12445 (2021). [Google Scholar]
- 115.Fox R. J., et al. , Improving catalytic function by ProSAR-driven enzyme evolution. Nat. Biotechnol. 25, 338–344 (2007). [DOI] [PubMed] [Google Scholar]
- 116.Chen L., et al. , Learning protein fitness landscapes with deep mutational scanning data from multiple sources. Cell Syst 14, 706–721.e5 (2023). [DOI] [PubMed] [Google Scholar]
- 117.Li F.-Z., Amini A. P., Yue Y., Yang K. K., Lu A. X., Feature reuse and scaling: Understanding transfer learning with protein language models. bioRxiv [Preprint] (2024). 10.1101/2024.02.05.578959 (Accessed 19 October 2023). [DOI]
- 118.Goldenzweig A., Fleishman S. J., Principles of protein stability and their application in computational design. Annu. Rev. Biochem. 87, 105–129 (2018). [DOI] [PubMed] [Google Scholar]
- 119.Permyakov E. A., α-Lactalbumin, amazing calcium-binding protein. Biomolecules 10, 1210 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 120.Kuhlman B., et al. , Design of a novel globular protein fold with atomic-level accuracy. Science 302, 1364–1368 (2003). [DOI] [PubMed] [Google Scholar]
- 121.Wijma H. J., Fürst M. J. L. J., Janssen D. B., "A computational library design protocol for rapid improvement of protein stability: FRESCO" in Protein Engineering: Methods and Protocols, Bornscheuer U. T., Höhne M., Eds. (Springer, New York, 2018), pp. 69–85. [DOI] [PubMed] [Google Scholar]
- 122.Pavelka A., Chovancova E., Damborsky J., HotSpot Wizard: A web server for identification of hot spots in protein engineering. Nucleic Acids Res. 37, W376–W383 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 123.Musil M., et al. , FireProt: Web server for automated design of thermostable proteins. Nucleic Acids Res. 45, W393–W399 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 124.Cherny I., et al. , Engineering V-type nerve agents detoxifying enzymes using computationally focused libraries. ACS Chem. Biol. 8, 2394–2403 (2013). [DOI] [PubMed] [Google Scholar]
- 125.Froning K. J., et al. , Computational design of a specific heavy chain/κ light chain interface for expressing fully IgG bispecific antibodies. Protein Sci. 26, 2021–2038 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 126.Lewis S. M., et al. , Generation of bispecific IgG antibodies by structure-based design of an orthogonal Fab interface. Nat. Biotechnol. 32, 191–198 (2014). [DOI] [PubMed] [Google Scholar]
- 127.Fleishman S. J., et al. , Computational design of proteins targeting the conserved stem region of influenza hemagglutinin. Science 332, 816–821 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 128.Chevalier A., et al. , Massively parallel de novo protein design for targeted therapeutics. Nature 550, 74–79 (2017), 10.1038/nature23912. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 129.Alford R. F., et al. , The Rosetta all-atom energy function for macromolecular modeling and design. J. Chem. Theory Comput. 13, 3031–3048 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 130.O’Meara M. J., et al. , A combined covalent-electrostatic model of hydrogen bonding improves structure prediction with Rosetta. J. Chem. Theory Comput. 11, 609–622 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 131.Weinstein J., Khersonsky O., Fleishman S. J., Practically useful protein-design methods combining phylogenetic and atomistic calculations. Curr. Opin. Struct. Biol. 63, 58–64 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 132.Barber-Zucker S., et al. , Designed high-redox potential laccases exhibit high functional diversity. ACS Catal. 12, 13164–13173 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 133.Risso V. A., et al. , Enhancing a de novo enzyme activity by computationally-focused ultra-low-throughput screening. Chem. Sci. 11, 6134–6148 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 134.Bengel L. L., et al. , Engineered enzymes enable selective N-alkylation of pyrazoles with simple haloalkanes. Angew. Chem. Int. Ed Engl. 60, 5554–5560 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 135.Klaus M., Buyachuihan L., Grininger M., Ketosynthase domain constrains the design of polyketide synthases. ACS Chem. Biol. 15, 2422–2432 (2020). [DOI] [PubMed] [Google Scholar]
- 136.Gomez de Santos P., et al. , Repertoire of computationally designed peroxygenases for enantiodivergent C-H oxyfunctionalization reactions. J. Am. Chem. Soc. 145, 3443–3453 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 137.Leonard A. C., et al. , Stabilization of the SARS-CoV-2 receptor binding domain by protein core redesign and deep mutational scanning. Protein Eng. Des. Sel. 35, gzac002 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 138.Netzer R., et al. , Ultrahigh specificity in a network of computationally designed protein-interaction pairs. Nat. Commun. 9, 5286 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 139.VanDrisse C. M., Lipsh-Sokolik R., Khersonsky O., Fleishman S. J., Newman D. K., Computationally designed pyocyanin demethylase acts synergistically with tobramycin to kill recalcitrant Pseudomonas aeruginosa biofilms. Proc. Natl. Acad. Sci. U.S.A. 118, e2022012118 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 140.Teichmann S. A., Chothia C., Gerstein M., Advances in structural genomics. Curr. Opin. Struct. Biol. 9, 390–399 (1999). [DOI] [PubMed] [Google Scholar]
- 141.Chothia C., One thousand families for the molecular biologist. Nature 357, 543–544 (1992), 10.1038/357543a0. [DOI] [PubMed] [Google Scholar]
- 142.Zhang J., Evolution by gene duplication: An update. Trends Ecol. Evol. 18, 292–298 (2003). [Google Scholar]
- 143.Lipsh-Sokolik R., Listov D., Fleishman S. J., The AbDesign computational pipeline for modular backbone assembly and design of binders and enzymes. Protein Sci. 30, 151–159 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 144.Jacobs T. M., et al. , Design of structurally distinct proteins using strategies inspired by evolution. Science 352, 687–690 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 145.Ferruz N., et al. , Identification and analysis of natural building blocks for evolution-guided fragment-based protein design. J. Mol. Biol. 432, 3898–3914 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 146.Eisenbeis S., et al. , Potential of fragment recombination for rational design of proteins. J. Am. Chem. Soc. 134, 4019–4022 (2012). [DOI] [PubMed] [Google Scholar]
- 147.Höcker B., Claren J., Sterner R., Mimicking enzyme evolution by generating new (βα) 8-barrels from (βα) 4-half-barrels. Proc. Natl. Acad. Sci. U.S.A. 101, 16448–16453 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 148.Weinstein J. Y., et al. , Designed active-site library reveals thousands of functional GFP variants. Nat. Commun. 14, 2890 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 149.Dellus-Gur E., Toth-Petroczy A., Elias M., Tawfik D. S., What makes a protein fold amenable to functional innovation? Fold polarity and stability trade-offs. J. Mol. Biol. 425, 2609–2621 (2013). [DOI] [PubMed] [Google Scholar]
- 150.Chica R. A., Ferruz N., What does it take for an “AlphaFold Moment” in functional protein engineering and design? Nat. Biotechnol. 42, 173–174 (2024). [DOI] [PubMed] [Google Scholar]
- 151.Steipe B., Schiller B., Pluckthun A., Steinbacher S., Sequence statistics reliably predict stabilizing mutations in a protein domain. J. Mol. Biol. 240, 188–192 (1994). [DOI] [PubMed] [Google Scholar]
- 152.Rocklin G. J., et al. , Global analysis of protein folding using massively parallel design, synthesis, and testing. Science 357, 168–175 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 153.Tsuboyama K., et al. , Mega-scale experimental analysis of protein folding stability in biology and design. Nature 620, 434–444 (2023), 10.1038/s41586-023-06328-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
All study data are included in the main text.



