Abstract
Designing proteins with improved functions requires a deep understanding of how sequence and function are related, a vast space that is hard to explore. The ability to efficiently compress this space by identifying functionally important features is extremely valuable. Here, we first establish a method called EvoScan to comprehensively segment and scan the high-fitness sequence space to obtain anchor points that capture its essential features, especially in high dimensions. Our approach is compatible with any biomolecular function that can be coupled to a transcriptional output. We then develop deep learning and large language models to accurately reconstruct the space from these anchors, allowing computational prediction of novel, highly fit sequences without prior homology-derived or structural information. We apply this hybrid experimental-computational method, which we call EvoAI, to a repressor protein and find that only 82 anchors are sufficient to compress the high-fitness sequence space with a compression ratio of 1048. The extreme compressibility of the space informs both applied biomolecular design and understanding of natural evolution.
Protein engineering and design can create proteins with optimized functions for various applications in biotechnology, medicine, and synthetic biology1–3. The fundamental challenge of protein engineering is to understand and manipulate the protein fitness landscape, which is a high-dimensional and complex space that contains a vast number of possible sequences and functions4,5.
Although there have been considerable attempts over the past several decades to search this space for high-fitness sequences, we have only scratched the surface of understanding the rules and features of the space6–9.
Experimental methods using directed evolution techniques, such as deep mutational scanning10,11, site-saturated mutagenesis12,13, and random library construction14,15, can provide valuable information, but they are laborious and time-consuming to scale up and typically must trade off accuracy and precision with sequence space coverage. These experimental methods are also usually restricted to low-dimensional mutations that do not take into account the natural selection pressure that shapes the protein fitness landscape in high-dimensional space. Advanced directed evolution tools that support the necessary scale, such as phage-assisted continuous evolution (PACE)16,17 or OrthoRep18, provide information primarily about trajectories that lead to high-fitness variants, which is insufficient to model the fitness landscape in its entirety. Computational methods, such as structure or sequence-based modeling of the protein fitness landscape19–25, can evaluate larger sequence spaces but are limited by the availability and quality of training data, especially for proteins with few homologs or no structure information. These computational methods also typically do not account for other biological factors that affect the protein function, such as in vivo interactions or post-translational modifications.
An ideal approach to understanding and navigating this space for design and engineering purposes would use comprehensive high-throughput experimental data to inform efficient computational models. It was shown that high-throughput short sequencing data from directed evolution experiments can enable machine learning methods to reconstruct the full-length genotype and identify high-fitness variants26. Furthermore, it has been demonstrated that deep learning models for protein design can benefit from even a limited number of functionally characterized variants27. A recent work demonstrated that the protein fitness landscape is rugged with many local peaks but still easily navigable28. We view these functional variants or local peaks as key “anchor” points that capture the features of high-fitness genotype space. We hypothesize that the design space for high-fitness genotypes can be effectively compressed by identifying a sufficient number of these “anchor” points to capture all the essential features, which can then instruct deep learning models to reconstruct and explore the whole space. However, no existing method can generate these anchors in a rapid and comprehensive way, especially for anchors from the high-dimensional space. Such a method would need to capture functional information about variants evenly distributed across protein sequence space in a very high throughput manner.
Here, we present EvoAI, a novel approach to empirically interrogate, then model, compress, and reconstruct, the sequence space. Our approach combines high-throughput experimental evolution and computational methods to capture and learn from the essential features of the space. We first developed an evolutionary scanning method that adapts phage-assisted non-continuous evolution (PANCE)17 by incorporating a segmented mutagenesis system based on EvolvR29. Compared to traditional methods, this method enabled rapid and thorough evolutionary scanning from low to high dimensions and captured valuable fitness anchors. We then developed a deep learning and large language model to reconstruct the sequence space from these anchors and design new proteins with more than 10-fold improved activity compared to wild-type. For a repressor protein, we demonstrated that this vast design space can be extremely compressed by a factor of 1048 to 82 points.
Results
The evolutionary scanning method
The M13 bacteriophage has a single-stranded DNA genome, but it generates a double-stranded form after infecting the host cell30 (Fig. 1a). We reasoned that this should allow the targeted CRISPR-guided DNA polymerase mutagenesis system (TP) to introduce mutations into the M13 phage genome for selection and evolution29. Here, the expression of the nCas9-PolI complex was controlled by the vanillic acid induced VanR-pVanA expression system that has a large induction fold change and low background expression, and is suitable for expressing large and highly toxic proteins31,32. The evolution target was inserted into the M13 genome in place of gIII (the major coat protein of M13) to generate the selection phage (SP). The accessory plasmid (AP) expresses guide RNAs (gRNAs) that target different regions of the gene of interest for mutagenesis. The AP also contains gIII under the control of a genetic circuit that links the function of the gene of interest to the expression of gIII. This allows the selection of phages with improved and high-fitness protein function during phage propagation, while phages with non-functional genes are eliminated after dilution (Fig. 1a). We named this system EvoScan (Evolutionary Scanning).
Figure 1.
EvoScan scheme, development, and validation on a protein-protein interaction evolution.
(a) Overview of the Evolution Scanning system. (b) Testing scheme for the EvoScan system on EGFP-nanobody interaction. (c) Predicted interaction between EGFP and nanobody. The structures of EGFP and nanobody were predicted by Alphafold2, and the interaction was modeled by ZDOCK (method). The position of the E103 site is labeled in red. (d) Propagation assays of combinatorial AP designs. Several ribosome binding sites (RBS) were tested, including: P2 and P3 for PhlF, and sd2, sd5 and B0064 for gIII. (e) Propagation assays of WT-Nano phage, the E103K mutant phage, and empty phage (ΔgIII) under different concentrations of IPTG (0, 200 μM and 1 mM) to control cI434-EGFP expression levels. (f) EvoScan of the EGFP nanobody. Three different gRNAs targeted different CDR regions of the E103K nanobody. An off-target gRNA with no target sequence in the nanobody was used as a control group. (g) Phage propagation and mutations of EvoScan and PANCE for EGFP nanobody. Initial titers of phage E103K were 3×109 p.f.u. /ml for EvoScan and 5×108 p.f.u. /ml for PANCE. Dilution factor was 100 for each passage. Sequences of sgRNA and genetic parts are provided in Supplementary Table 4 and Table 5. Data are mean ± SD of three experiments, except for phage titers.
EvoScan can explore specific regions of the fitness landscape to generate valuable anchors. These anchors are obtained by using different gRNAs to divide the target gene into defined segments, thus reducing the dimensionality of the fitness space. Moreover, the combination of different gRNAs through serial propagation on host cells bearing different APs enables the scanning and identification of anchors in higher dimensions, which can capture more details of the protein sequence space. To investigate and scan the protein sequence space, we validated and used this system to study three proteins with diverse functions: an EGFP-specific nanobody for protein-protein interaction; SARS-CoV-2 Mpro and its inhibitors for protein-ligand interaction; and AmeR and its DNA operator for protein-nucleic acid interaction.
Validation of EvoScan and rapid identification of anchors in nanobody
To validate EvoScan and apply this system to proteins involved in protein-protein interaction, we chose antigen-antibody interaction, in this case, EGFP and its cognate nanobody33. We first established a reverse two-hybrid system (RTHS) that coupled the nanobody-EGFP interaction to the expression of gIII. We fused EGFP to the cI434 repressor, and its nanobody to cIp22, which can interact with cI434 but not with itself34. The gene encoding nanobody-cIp22 was inserted on phage to replace gIII. The gene encoding EGFP-cI434 was integrated on the AP and transformed into E. coli. After phage infection, interaction between EGFP and nanobody will enable the interaction between cI434 and cIp22 to form a tetramer complex and inhibit the p434 promoter (Fig. 1b, 1c). In the AP, a transcriptional repressor PhlF was placed downstream of the p434 promoter, and gIII was placed under the control of the pPhlF promoter, such that interaction between EGFP and the nanobody will eventually induce the expression of gIII and allow phage propagation (Fig. 1b). We tested several combinations of ribosome binding sites and chose P3 RBS for PhlF and B0064 for gIII (Fig. 1d). This circuit propagated phage carrying EGFP nanobody while limiting the propagation of empty phage.
To test whether EvoScan could quickly identify a fitness-increasing protein variant “anchor” site, we artificially disrupted the interaction between EGFP and nanobody by introducing the E103K mutation in the CDR3 of the nanobody, which is essential for binding to its target (Fig. 1c, Fig. 1e). We designed four different gRNAs targeting different segments of the nanobody gene, with gRNA3 designed to target the segment containing the E103K mutation site of the nanobody (Fig. 1f). After two passages in EvoScan, we observed that only the group with gRNA3 targeting the E103K segment showed increased phage titer, while the other three groups all decreased (Fig. 1g). Sequencing results of the phage supernatant confirmed that in the gRNA3 group, the E103K mutation had reverted back to glutamate. This validated that EvoScan can successfully and efficiently identify anchors that play important roles in protein function.
For comparison, we also implemented a traditional phage-assisted non-continuous evolution system (PANCE) using the same E103K phage (Extended Data Fig. 1a). The two systems differed only in the use of targeted (EvoScan, TP) or non-targeted (PANCE, MP6) mutagenesis. After 8 passages in PANCE, no consensus mutations were found in the nanobody gene. Interestingly, a N29D single mutation appeared on cIp22 (Extended Data Fig. 1b, Fig. 1g), which disrupted the selection pressure on nanobody function due to the strong self-interaction between the two cI repressors (Extended Data Fig. 1c, 1d). These results further demonstrated that EvoScan can rapidly guide the evolution for precise searching of target proteins, even in the context of a more likely background mutation that could interfere with the desired evolution process.
Thorough identification of anchors reveals novel Mpro drug resistant variants
We next applied EvoScan to investigate protein-ligand interaction. In this case, we chose Mpro, a crucial protease in the SARS-CoV-2 virus35,36. Several Mpro inhibitors have been developed and used to treat COVID-19 patients, such as GC37637 and PF-0732133238, which is a key component of Paxlovid. However, the rapid mutation of SARS-CoV-2 may reduce or even eliminate the efficacy of these drugs. Previous studies have identified mutational hotspots for drug resistance but have not comprehensively profiled the Mpro drug resistance fitness landscape39,40. It is important to thoroughly study possible escape mechanisms of Mpro in order to inform future drug development efforts. Here, we used EvoScan to systematically identify and extract key anchors from different regions of Mpro that affect its interaction with small molecule inhibitors.
To couple the protease activity of Mpro to the expression of downstream reporter genes, we fused the two cI repressors, cI434 and cIp22, with a linker that contains the specific sequence motif recognized by Mpro, such that only functional Mpro will cleave and deactivate the fused cI repressor (Fig. 2a). We also used a previously reported inactive Mpro mutant (C145A) to validate this system35,36. Our results demonstrated that this selection circuit can accurately and sensitively report on Mpro activity and the inhibition efficiency of small molecules (Fig. 2b, 2c). In addition, we found that this genetic circuit can be used for proteases from other viruses such as HCV (Extended Data Fig. 2a, 2b), demonstrating its robustness and broad applications. Compared to previously reported selections used for protease evolution in PACE41–43, our circuit represents an alternative and improved strategy with better response properties (Extended Data Fig. 2c, 2d).
Figure 2.
Thorough segment scanning for protein-ligand interaction evolution.
(a) A schematic of the Mpro activity fluorescence reporter system. The Mpro substrate peptide CAAVLQSGFRKK was cloned into the linker between cI434 and cIp22 so that protease activity controlled the output (eYFP) from the p434 promoter. (b) Flow cytometry assays for Mpro and the C145A mutant under different inducer concentrations (0, 50 and 200 μM vanillic acid, and 200 μM IPTG). (c) Flow cytometry assays for Mpro activity in the presence of 50 μM inhibitors (without inhibitor (w/o), GC376 (GC) and PF-07321332 (PF)). The concentration of IPTG was 200 μM and the concentration of vanillic acid was 100 μM. (d) Genetic circuit design for Mpro evolution. 32 different APs carrying 32 different gRNAs tiling the Mpro gene sequence were designed. (e) Phage propagation assays for Mpro, the C145A mutant, and the empty phage at 1 mM IPTG. (f) Phage propagation assays in the presence of 0, 20, or 40 μM inhibitors. (g) Schematic diagram of the Mpro EvoScan process. (h) EvoScan of Mpro using two different inhibitors and 32 different gRNAs. The initial titer of Mpro phage was 5×106 p.f.u. /ml. (i) The resistance index (RI) against GC376 (50 μM) and PF-07321332 (50 μM) of different variants. (j, k) Crystal structure of WT Mpro interacting with GC376 (j, PDB ID: 7CB7) and PF-07321332 (k, PDB ID: 7VLO). Mutated residue sites obtained in EvoScan are highlighted in red. Interactions between key residues and the ligand are indicated with dash lines in the enlarged figures. Sequences of gRNAs and genetic parts are provided in Supplementary Table 4 and Table 5. Data are mean ± SD of three experiments, except for phage titers.
We then used our genetic circuit to couple the cleavage activity of Mpro, encoded on SP, to the expression of gIII, which was controlled by the p434 promoter (Fig. 2d). This circuit enables selection for Mpro variants that can escape inhibition by small molecules. Our results showed that wild-type Mpro supported robust phage propagation, while the C145A mutant behaved like empty phage (Fig. 2e, Extended Data Fig. 2e). We tested phage propagation at various concentrations of the inhibitors GC376 and PF-07321332, and selected 20 μM as the initial concentration for evolution (Fig. 2f, Extended Data Fig. 2f).
We designed 32 different gRNAs to systematically cover the Mpro gene and performed EvoScan with two inhibitors (Fig. 2g, 2h). Surprisingly, we found that escaping mutations can occur across the whole Mpro gene (Fig. 2h). Some of these mutations, such as F140L, E166V, and S144A, have also been reported in previous studies on drug resistance against PF-0732133244,45, proving the effectiveness and reliability of our system. Most other mutations were not observed in previous works, demonstrating that EvoScan can successfully identify novel key mutations. We also identified conserved mutation sites for both inhibitors, such as S62, L75, N119, S144, T169, A191, P241, and G302 (Fig. 2h). Interestingly, we observed that the phage propagation trajectories of the 32 segments targeted for mutagenesis varied during the evolution process, and more than 10 segments showed no overall enrichment during serial passaging, suggesting that mutations within each of these segments taken individually cannot enable drug resistance, which may serve as regions for future drug development studies (Fig. 2h).
We further verified the ability of these mutations to confer inhibition resistance (Fig. 2i, Extended Data Fig. 2g, 2h). Nearly all of these mutations showed increased resistance against inhibitors compared to wild-type Mpro. In group I mutations, we found that A191V had a strong resistance effect against both inhibitors, while N119D had a moderate resistance effect, and other mutations had relatively weak resistance effects on their specific inhibitors. Strikingly, we found a set of group II mutations (such as E166K), of which the enzyme activities were even improved by inhibitors (Fig. 2i). Similar to a previously reported mechanism where GC376 increased the catalytic activity of Mpro mutants46, E166K has a different interaction with the inhibitors compared to WT, which may then improve the dimerization of Mpro and thus the enzyme activity (Fig. 2j, Fig. 2k). The same phenomenon was observed with other mutations such as I136V, T169P, F140L, and S144A. However, how these mutation sites increase the enzyme activities when inhibitors were added is not clear, as they are located far from the active pocket.
As a comparison, we also evolved Mpro using the PANCE system (Extended Data Fig. 3a, 3b). With only the mutagenesis method changed, Mpro SP failed to accumulate any consensus mutations after 36 passages. After 96 passages, 4 dominant variants with escaping abilities emerged in the four groups in total (Extended Data Fig. 3c-f). All these variants have the N119D or A191V mutation, which appeared after only 8 passages in EvoScan. These results further showed that EvoScan can effectively explore protein-ligand interaction and identify novel key anchor mutations related to small molecule interactions.
Systematic searching for anchors in high-dimensional space
Having demonstrated that EvoScan can rapidly and thoroughly explore the sequence space and generate more diverse functional variants than traditional methods, we next applied this approach to protein-nucleic acid interaction and systematically searched the space from low to high dimensions. We selected AmeR, a transcriptional regulator from the TetR family which plays important roles in many biological processes and synthetic biology47,48. AmeR has few known sequence homologs, making it challenging to use traditional methods to explore its sequence-function relationship, especially in high-dimensional space (Fig. 3a). We planned to first carry out a rapid scan of all gRNAs that cover the full sequence of AmeR, then select only those that generated enriched mutations for further use. Several different evolution routes could then be designed using the remaining APs. Serial passaging of phage across hosts containing different APs would identify anchors in high dimensions – that is, combining multiple mutations in different segments – that thoroughly and representatively sampled the AmeR sequence space (Fig. 3a).
Figure 3.
Applying EvoScan on AmeR for protein-DNA interaction evolution.
(a) Schematic diagram of EvoScan of AmeR. A series of gRNAs was used to divide the AmeR gene into segments. An initial set of passages identified gRNAs that resulted in mutations, and several different evolution routes were designed with these APs. WT AmeR phages were passaged through these routes sequentially to scan and collect anchors. (b) Genetic circuit design for AmeR evolution. 13 different APs carrying 13 different gRNAs were designed. (c) Phage propagation assays of SP bearing the AmeR gene and the empty phage (ΔgIII). (d) Schematic diagram of one step in each route during evolution. (e) EvoScan of AmeR and properties of the collected variants. For each step in each route, the dominant mutations observed from supernatant were shown. Mutation number distribution and the top 10 mutation types of the 82 variants were shown. A comparison of the evolution results between EvoScan and PANCE was shown, including variant numbers, mutation diversities, and mutated sites of AmeR. (f) Distribution of mutations on AlphaFold2 predicted structure of AmeR. Red regions are mutation sites and the blue region is the typical Helix-Turn-Helix (HTH) Domain of TetR family proteins. (g) Mutation relation map among the 82 variants and WT AmeR. Each circle is a variant and its size is the mutation number. Variants are colored based on their log(fold repression). Variants with less than 3 amino acid difference were linked together. (h) Schematic and mutant information of four evolution paths from the mutation relation map. Sequences of gRNAs and genetic parts are provided in Supplementary Table 4 and Table 5. Data are mean ± SD of three experiments, except for phage titers.
To link AmeR interaction with its operator to gIII expression, we inserted a PhlF repressor after the pAmeR promoter, such that the repression ability of AmeR is positively correlated with gIII expression (Fig. 3b). We tested several combinations of plasmid origins, ribosome binding sites (RBS) and repressor types49 to optimize the circuit. The optimal combination resulted in 73-fold propagation of SP carrying AmeR (Fig. 3c, Extended Data Fig. 4b).
To start the scanning process, we selected 13 gRNA sites that cover both the N-terminal and C-terminal domains of AmeR, which are involved in DNA binding and dimerization, respectively. We measured phage titers after each of the 4 passages and found that most groups enriched ≥ 50-fold. Of the 13 different groups, 8 generated dominant mutations in the phage supernatant. These mutations were observed within the targeting segment corresponding to each gRNA (Fig. 4b, 4e). These results provide one-dimensional information about the protein sequence space. We next designed 8 evolutionary routes to sample the high-dimensional space, in which SPs were passaged across all these 8 APs in different orders (Fig. 3d, 3e). For each route, we sequenced the supernatant and 2 single plaques from each round (Fig. 3e). After the full evolutionary scanning process, we obtained 82 anchor variants encompassing 52 different mutations at 39 residue sites (Fig. 3e). Among all the variants, a large portion (~ 83%) of variants had more than 2 mutations, demonstrating the successful exploration and even sampling of the high-dimensional space. We measured the fold repression of the 82 variants, and nearly all of them showed improved function compared to WT, demonstrating again the effectiveness of EvoScan in searching for high-fitness sequences (Extended Data Fig. 5a, Supplementary Table 1).
Figure 4.
Genetic relationships and features of the 82 anchors generated by EvoScan.
(a) Fold repression of WT AmeR and the S57R variant in different systems, E. coli and HEK293T mammalian cells. (b) Leakages and circuit scores of different genetic circuits (IMPLY, NIMPLY, and NAND) using WT AmeR or the S57R variant. (c) Fold repression of AmeR variants with different mutation numbers. (d) Evolution paths from WT to the S57R I80V P94L variant. (e) Phylogenetic tree of the 82 variants. The tree is divided into sub-regions by branch distances. Variants with less than 2 amino acid difference are linked by curves. Curves across sub-regions are in bold black. (f) Magnitude epistasis for D33E and [P94L V188F] in the S57R genetic background (upper plane), and reciprocal sign epistasis for P94L and [G83V V188F A199S G212S] in the S57R genetic background (lower plane). (g) Epistasis values of D33E, S57R, P94L, R43S, I80V, and D119N with combinations of different mutations. Data are mean ± SD of three experiments, except for phage titers.
For comparison, we also applied PANCE to AmeR evolution (Extended Data Fig. 4a, 4b, 4c). After 16 passages, only R43S and S57R single mutants and the R43S S57R double mutant appeared (Fig. 3e, Extended Data Fig. 4d), all of which appeared during EvoScan within 8 passages. That only the variants from the low-dimensional space were observed in PANCE again illustrated how allowing competition between variants from all parts of sequence space can suppress and obscure many functionally informative mutation sites and high-fitness variants from the high-dimensional space, which were systematically captured by EvoScan.
Anchors capture key features of the design space
Alignment between mutations and predicted structure by AlphaFold2 suggested that these beneficial mutations accumulated not only on the helix-turn-helix domain near the N terminus that interacts with DNA, but also on regions related to dimerization of AmeR near the C terminus (Fig. 3f). To investigate the mutation relationship between variants, we drew a relation map linking variants that contained less than three different residues (Fig. 3g). The evolution paths leading to different mutants were connected with complexity, indicating the complex interactive nature of protein evolution in high-dimensional space.
We were able to identify four evolution paths from the complex map, leading to different mutants that shared the same intermediates or reached the same destination (Fig. 3h). This suggested the existence of shared local peaks in the landscape, consistent with a recent study demonstrating the simultaneous accessibility of multiple peaks during evolution27. These mutations usually contained one or more of D33E, R43S, S57R, P94L, and variants containing these mutations appeared to be fitter than WT AmeR (Fig. 3g, Extended Data Fig. 5a), indicating that these mutation sites provide important information about the sequence-function relationships for high-fitness genotypes.
The best-performing single mutant, S57R, outperformed the wild-type AmeR repression ability in both bacteria and mammalian cell systems (Fig. 4a, Extended Data Fig. 6a, 6b). Repressors with better properties are crucial for robust gates and genetic circuits construction in synthetic biology (e.g., low leakage, high circuit score)50. We next incorporated it into several genetic circuit contexts such as IMPLY, NIMPLY, and NAND49 (Fig. 4b, Extended Data Fig. 6c-h). The S57R variant significantly increased the circuit score of all these genetic circuits and reduced the circuit leakage at the same time (Fig. 4b). These results show that the identified mutations affected the protein-DNA interaction directly and captured essential features of the protein itself, rather than increasing fitness only in the context of our evolution selection.
Anchors capture complex epistasis interactions in the high-dimensional space
We also found that, in these anchors, mutational combinations had synergistically enhanced repression abilities in both E. coli and HEK293T, demonstrating that exploring the higher dimensions is vital for identifying proteins with improved functions (Fig. 4c, Extended Data Fig. 6a, 6b). However, we found that the order of introducing mutations significantly affected the evolvability, even if the start point and end point were the same (Fig. 4d). For example, S57R P94L double-mutants had lower fitness than S57R, which suggested that it was more difficult for natural evolution to reach the final genotype (I80V P94L S57R) if S57R was introduced first (Fig. 4d). We further built a phylogenetic tree to investigate the evolvability among these variants (Fig. 4e). The results revealed that, by designing different routes (Fig. 3e), EvoScan likely bypassed these evolvability limitations and achieved long genetic distance searching to obtain these anchors by “jumping” between domains in different orders (APs) in the high-dimensional space. These results further highlighted the need for high-throughput targeting methods to effectively explore the sequence space.
These non-additive interactions between two or more mutations are known as epistasis, which has profound impacts on the landscape in the high-dimensional space. We next systematically investigated the epistasis effect in these anchors and calculated the epistasis value (ε) using fold repression as the fitness value of different genotypes (Supplementary Table 2). We identified both negative (such as D33E and S57R, R43S and [D33E S57R A75T C93R]) and positive epistasis (such as [S57R P94L] and V188F, [P94L S57R] and [G83V V188F A199S G212S]) for different mutation combinations in both low dimensions and high dimensions (Supplementary Table 2). We also studied the magnitude and sign epistasis of different mutations (Fig. 4f), which can create rugged fitness landscapes. Interestingly, we identified reciprocal sign epistasis in the high-dimensional space, such as P94L and [G83V V188F A199S G212S] in the S57R genetic background (Fig. 4f). We also found that, even for the same mutation, such as D33E, P94L, and D119N, ε can be either positive or negative when combined with different mutations, indicating the complex and idiosyncratic epistasis relationship between different mutations (Fig. 4g).
EvoAI enables sequence space reconstruction and prediction of new proteins
Given the complex interaction of mutations in the high-dimensional space, we next aimed to use deep learning to extract the latent features of these anchors obtained from EvoScan to represent and reconstruct the design space of AmeR for high-fitness genotypes with high accuracy, enabling design of new proteins with multiple mutations not represented in the experimental outcomes. We name this hybrid experimental-computational method EvoAI.
We combined a pre-trained GeoFitness model and the Protein Language Model (ESM-2), followed by a Multi-Layer Perceptron (MLP) to enhance the accuracy of predicting protein mutation effects (Fig. 5a). The pre-trained GeoFitness model was trained on a large dataset of ~ 300,000 protein fitness values from various experimental cases and indicators to enable prediction of protein fitness of single mutations (Extended Data Fig. 7). We used the 82 anchor points for both training and validation with a 10-fold cross-validation approach to obtain the final model (Extended Data Fig. 8). Spearman correlation coefficients were 0.91 and 0.84 for the training set and the test set, respectively, demonstrating a high level of consistency in training effectiveness (Fig. 5b). These results demonstrated that our deep learning model accurately predicted the multi-interaction of mutations and complex epistasis in higher dimensional space.
Figure 5.
Anchors and deep learning reconstruct the design space for high-fitness genotypes.
(a) Schematic of the deep learning model. WT AmeR and 82 anchors from EvoScan were the data set for model training. (b) Training and test results of the 82 variants. Training data are in blue and test data are in red. (c) Experimental fold repression of the designed variants using model trained by EvoScan anchors or deep mutational scanning (DMS) information. The dashed line is WT AmeR fold repression. (d) Comparison between DMS and EvoAI. DMS can only search the low-dimensional space, while EvoScan comprehensively segments and scans the high-dimensional space to enable accurate design space reconstruction by deep learning.
We further validated the accuracy of the reconstructed space by designing, predicting, and testing new variants different from the 82 anchors. To reduce the computational load, we chose 13 mutations from the top 11 mutation sites with high prediction certainties for novel protein design (Extended Data Fig. 9b). We then computationally traversed all possible combinations of 6 total mutations and calculated the predicted fold repression by our model (1093 predictions in total). The 10 top-scoring protein sequences were cloned and experimentally tested for their fold repression. All 10 sequences showed significantly improved activities compared to WT with 10- to 38-fold repression abilities (Supplementary Table 3). Furthermore, although we chose only the top predictions and all of them have very close prediction scores, these variants still showed a high Spearman correlation coefficient between prediction scores and experimental results (Extended Data Fig. 9c, 9d). For comparison, we tested the predicted sequence space without using these anchors information but only using low-dimensional deep mutational scanning (DMS) information, and also generated 10 variants with 6 mutations each (Extended Data Fig. 9e). In striking contrast to the high-performing EvoAI-predicted variants, all 10 variants generated by DMS had worse activity relative to wild-type AmeR (Fig. 5c, Supplementary Table 3).
These results validated that, with these compressed anchors, our deep learning model can accurately reconstruct the design space for high-fitness genotypes in high-dimensional space, and design new protein sequences with improved functions. We identified 39 mutation sites in AmeR (Fig. 3e) that could potentially generate high-fitness genotypes, with a theoretical design space of ~ 1050 (2039). Our EvoAI approach therefore effectively demonstrated that this vast design space of AmeR for high-fitness genotypes can be compressed by ~ 1048 times to 82 anchor points.
Discussion
Navigating the complexity and scope of a protein fitness landscape is a long-standing challenge for protein design. We developed EvoScan, a novel system that combines EvolvR mutagenesis and phage selection to explore the protein sequence space in different dimensions. EvoScan can identify valuable anchors, which are variants with critical mutations that represent the sequence space. We showed that these anchor points can accurately reconstruct the space and design new proteins when coupled to deep learning methods (EvoAI), demonstrating the extreme compressibility of this space. Previous methods did not capture this insight likely because they only explored either the low-dimensional space by measuring single or double mutations, or a small region of the sequence space by saturating mutations. These methods thus might not capture the whole picture, especially the high-dimensional space (Fig. 5d).
Our approach has several important advantages over existing methods. First, it balances realistic fitness optimization and even sampling of sequence space, which can rapidly explore high dimensions and generate more diverse and functional variants, and provide richer information about sequence-function relationships. Second, by integrating empirical evolutionary scanning and deep learning models in EvoAI, we can leverage the strengths of both approaches. We could use the properties learned by deep learning to dynamically guide the scanning process. Future advances of explainable deep learning could uncover the underlying rules or patterns, and provide insights into how proteins adapt and overcome evolutionary constraints or trade-offs. Third, our method can evolve and investigate proteins that lack structural information, or that involve challenging interactions. We showed that EvoScan can capture anchors for proteins with diverse functions, such as protein-protein, protein-ligand, and protein-nucleic acid interactions. Our approach should be compatible with any biomolecular function that can be coupled to a transcriptional output (e.g., enzymes through small molecule sensors), and thus could be applied to study the sequence spaces of diverse biomolecules.
Our approach could be further improved in the future. We could use Cas9 variants with more PAM options to increase the guide RNA tiling and mutation-targeted segment selection. We could also modify the editing system to introduce mutations at multiple sites at once, avoiding host switching and speeding up the exploration process. Furthermore, incorporation of the target mutagenesis approach of EvoScan into PACE could potentially lead to deeper sampling of sequence space segments. In addition, integration of EvoScan with genotype reconstruction methods, such as Evoracle, could enable more systematic and intelligent exploration of the sequence space26. Moreover, the modularity of our system makes it highly suitable for automation, such as with the recently reported PRANCE method51, and could be scaled up to provide more comprehensive fitness landscape profiling data for different protein targets, illustrating whether the extreme compressibility of the design space for high-fitness genotypes is universal or unusual, or if the whole protein fitness landscape is compressible.
We also hope that our method will inspire new insights into the relationship between genotype and phenotype and the evolution of biological systems. The compressibility of the design space may suggest that nature somehow finds a way to search through the seemingly infinite space in the relatively short period of life time on earth by Darwinian evolution, possibly by “jumping” between these anchors instead of searching every possibility (Fig. 5d). Genetic recombination in large sexual populations could possibly enable this “jumping” and boost evolution rates53,54. Our approach would enable the investigation of such path dependence of evolutionary outcomes of biological systems in high-throughput experiments51,52 and provide valuable insights for evolution and protein design in biotechnology and biomedical applications.
Materials and Methods
General methods.
The following working concentrations of antibiotics were used: carbenicillin (Solarbio, 50 μg/ml), kanamycin (Solarbio, 50 μg/ml), spectinomycin (Macklin, 50 μg/ml), chloramphenicol (Macklin, 25 μg/ml). PHANTA 2x mix (Vazyme) was used for cloning PCR, and Flash 2x mix (Vazyme) was used for verification PCR and Sanger sequencing (Tsingke Bioscience). All cloning fragments were assembled by Golden Gate assembly (New England Biolabs) or ClonExpress assembly (Vazyme) methods. Plasmids were cloned in DH5α competent cells (HT Health). Synthetic genes were ordered from Tsingke Bioscience. Cloned plasmids were extracted by Tiangen DNA extraction kit. E. coli strain S206055 was used in all aspects of the EvoScan process, including system construction, evolution, and plaque assays. The DH5α strain was used for flow cytometry experiments. Detailed information on the plasmids and selection phage (SP) used in this work is given in Supplementary Table 6.
Phage propagation assay.
Competent S2060 cells were transformed with corresponding accessory plasmid (AP) in each experiment. Overnight cultures of single colonies inoculated in LB medium with proper antibiotics were diluted 50 or 100 times and grown at 37 °C in 220 rpm shaker (ZQZY-B8, cultured in shake tubes, 5 ml system) or 1000 rpm shaker (HUXI HW-400TG, cultured in 96-deep well plate, 500 μl system) to log phase (OD600 ~ 0.4–0.6). These cells were then infected with SP at an initial titer of 5 × 106 plaque-forming units (p.f.u.) per ml. The mixture was further cultured overnight (16–20 h) at 37 °C in the shakers as described above, and was centrifuged at 4000 rpm for 10 min. Phages in the supernatant was filtered by 0.22 μm bacterial filter and stored at 4 °C for further use.
Plaque assay.
A single colony of chemically competent S2208 cells55 (S2060 cells transformed with plasmid pJC175e) was cultured overnight in LB medium added with proper antibiotics. The saturated bacteria culture was diluted 50 or 100 times into LB medium with proper antibiotics and grown at 37 °C in 220 rpm shaker to log phase (OD ~0.4–0.8) before use. Phages were serially diluted 6 to 8 times with a dilution ratio of 10-fold in each step in LB medium. Then, 10 μl of each phage dilution was mixed with 45 μl S2208 cells, and then 180 μl of liquid (50–65°C) soft agar (LB medium and 0.5% agar) supplemented with 2% Bluo-gal (Inalco S.p.A.) was added and mixed by pipetting. The whole mixture was immediately added onto 500 μl of bottom agar (LB medium and 1.5% agar) previously prepared in 24-well plate. Then the plates were incubated in 37 °C for overnight growth (14–18 h).
Calculation of fold propagation.
For fold propagation measurement of the selection phage, initial phage titer and final phage titer were measured by plaque assays. We defined the ratio of final phage titer versus initial phage titer as the fold propagation of the phage.
Basic process of evolutionary scanning (EvoScan).
Target mutagenesis plasmid (TP) was first transformed to chemically competent S2060 cells, and then the prepared S2060-TP cells were used to prepare super chemically competent cell by Inoue method56. Chemically competent S2060-TP cells were transformed with corresponding APs. The resulting S2060-TP-AP bacteria were cultured overnight and diluted 50–100 times into 500 μl LB medium with antibiotics and inducers, and grown in 37 °C 1000 rpm shaker to OD ~0.5. The phage titer for the first infection was around 5×106–5×108 p.f.u./ml, and for the following passages the phages were subjected to a 1:50 or 1:100 dilution. Vanillic acid (Sigma-Aldrich, ethanol dissolved) at a final concentration of 50 μM was added to induce the expression of nCas9-PolIM5 complex. The mixture was then cultured in 37 °C 1000 rpm shaker overnight. The next day the mixture was centrifuged at 4000 rpm for 10 min and the phage content of the collected supernatant was verified by PCR (Flash 2x mix) and Sanger Sequencing. The supernatant was then used for plaque assay as described above. Single plaques from plaque assay were picked and further verified by PCR (Flash 2x mix). The PCR product was sent for Sanger Sequencing.
Searching steps in each route.
For each step of EvoScan in a route, 10 μl supernatant with evolved phages was added into 1 ml log-phase S2208 bacteria culture (OD ~0.4–0.8), and propagated overnight in 96-deep well plate. The mixture was centrifuged at 4000 rpm for 10 min and filtered by 0.22 μm bacterial filter. The obtained phages were then diluted and infected another host cell containing a different AP with an infection titer of 5×106 p.f.u./ml.
Basic process of phage-assisted non-continuous evolution (PANCE).
Accessory plasmid with the designed genetic circuit and the mutagenesis plasmid MP6 were co-transformed into super chemically competent S2060 cells. The S2060-MP6-AP bacteria were cultured overnight and diluted 50–100 times into 500 μl LB medium with antibiotics and inducers in 96-deep well plate, and grown in 37 °C 1000 rpm shaker to OD ~0.5. The initial phage titer was around 5×106–5×108 p.f.u./ml, and the phages were subjected to a 1:10–1:100 dilution in the following passages in a 500 μl system. 1% (m/v) arabinose dissolved in ddH2O was added as the inducer of MP6. Phages were then collected to obtain mutations following the same procedures in EvoScan.
Induced expression assay.
Single colonies of strains to be tested were cultured in LB medium overnight. Saturated bacterial culture was diluted 100 times in LB medium with proper antibiotics and inducers, and cultured in 37 °C 1000 rpm shaker for 2 h (OD ~ 0.4). Then LB with proper antibiotics and inducers was prepared and 2 μl log phase bacteria culture was added together to a whole volume of 500 μl. The mixture was cultured for 5 hours in the 96-deep well plate.
Flow cytometry assay.
10 μl of the culture was added into 190 μl PBS with 2 g/L kanamycin in the 96-well U-bottom plate to stop the cell growth. The plate was stored in 4 °C until used. The flow cytometer (Beckman Coulter Cytoflex S) was used to quantify the expression levels of fluorescent protein. The software FlowJo v10 was used to gate the events (at least 10000 events) and calculated the median of each sample.
Mpro drug resistance index.
In the RTHS protease activity assay, the fluorescence of the experimental group carrying eYFP was measured with or without addition of Mpro inhibitor GC376 or PF-07321332. The ratio of fluorescence FITC-A median with inhibitor versus fluorescence FITC-A median without inhibitor was defined as the resistance index (RI) to evaluate the drug resistance abilities of different Mpro variants.
Structure display and interaction prediction.
Schrodinger 2017 was used for structural display. ZDOCK57 was used for interaction structure prediction between EGFP and its nanobody. The interaction between Mpro and inhibitors within 3 angstrom was shown in the figure.
Fold repression calculation.
The background fluorescence of cells, which is the median of the fluorescence of the bacteria carrying an empty plasmid with only the backbone, was measured and subtracted from all the experimental groups. The subtracted fluorescence values of the uninduced group (no repressor expression) were divided by the induced group (repressor expression) to obtain the fold repression.
Relative expression level calculation.
Using flow cytometry assay, we measured the FITC-A median of the strain carrying the empty plasmid and set this value as the background value. The FITC-A median of the strain carrying the standard plasmid expressing eYFP through the open reading frame J23101-B0064-YFP was measured the same way and set as the standard value. The FITC-A median of the strain containing a specific variant was measured the same way, and the relative expression level was defined as: (variant value – background value)/standard value.
Circuit score calculation.
Thestrain carrying the plasmid with a specific genetic circuit was prepared for flow cytometry assay. IPTG (1 mM) and vanillic acid (100 μM) were used as the input signals. YFP was used as the output reporter of the circuit and the FITC-A median of each state was measured. The lowest ON signal (lowest FITC-A median in “ON” states of the circuit) was divided by the highest OFF signal (highest FITC-A median in “OFF” states of the circuit) to obtain the circuit score.
AmeR phylogenetic tree construction.
Protein sequences of the 82 variants and the WT were collected as a fasta file and the file was input into MEGA11 for multiple sequence alignment (MSA)58. After MSA and phylogenetic analysis, neighbor-joining tree was selected as the method of tree construction. The output tree was decorated by iTOL59, and all the parameters were set using default values.
Epistasis calculation.
Epistasis between two different mutations, A and B, could be calculated as ε = fab + fAB – fAb – faB. f is the fitness of wild-type, double-mutant and single-mutant genotypes, respectively. ε > 0 means positive epistasis, while ε < 0 means negative epistasis.
Mammalian Cell culture and transfections.
HEK293T cells (CRL-3216, ATCC) were cultured in Dulbecco’s modified Eagle’s medium (DMEM, Gibco) supplemented with 10% (v/v) fetal bovine serum (FBS, Biological Industries) and 1% (v/v) penicillin/streptomycin solution (Beyotime) at 37 °C, 100% humidity and 5% CO2. In transfection experiments, 60,000–80,000 HEK293T cells in 0.2 ml of DMEM complete medium were seeded into each well of 48-well plastic plates (NEST) and grown for ~24 h. M5 HiPer Lipo2000 Transfection Reagent (Lipo2000, Mei5bio) was used in all transfection experiments following the manufacturer’s protocol. Briefly, a sample mixture was prepared by mixing 150 ng repressor plasmid or 150 ng control plasmid (repressor-deficiency) with 150 ng reporter plasmid in 0.7 μl Lipo2000. The mixture was incubated at room temperature for 20 min before adding to cells. Transfections were supplemented with 0.2 mL DMEM complete medium 24 h post-transfection. Cells were cultured for 2 days post-transfection before flow cytometry analysis.
Mammalian cell flow cytometry assay.
Cells were trypsinized 48 h after transfection and were then centrifuged at 250 × g for 10 min at room temperature. The supernatant was removed, and the cells were resuspended in 1 × PBS. Fluorescence values were measured with a Cytoflex flow cytometer (Beckman Coulter, Inc.). PB450-A and ECD-A channels were chosen for BFP and mCherry measurement, respectively. Data were processed using FlowJo (TreeStar), gated by the area of the forward scatter and the side scatter (FSC-A/SSC-A) and then cell populations were selected by gating out the background BFP signal of untransfected cells to obtain the median of fluorescence. The median of fluorescence was calculated for >20,000 transfected cells for each sample. To reduce expression noises between samples, the mCherry : BFP fluorescence ratio was used to report the repressor activity60. The mCherry : BFP fluorescence ratio was calculated by (mCherry - mCherry0)/(BFP - BFP0), mCherry0 and BFP0 were the fluorescence values from untransfected HEK293T cells. The fold-repression was calculated by (mCherry : BFP)unrepressed/(mCherry : BFP)repressed. (mCherry : BFP)unrepressed and (mCherry : BFP)repressed were the fluorescence values of the states co-transfected with control plasmid or repressor plasmid.
Feature generation.
Our initial step entails querying the UniRef30_2021_03 and bfd multiple sequence alignment (MSA) databases. Subsequently, we employ AlphaFold2 to construct the structural representation of the wild-type protein. For this endeavor, we deploy the GeoFitness-Seq variant of the pre-training model. In the case of mutated proteins, structural configurations are generated using FoldX 5. The sequence features are extracted from the large-scale protein language model ESM-2, for the purpose of capturing global context information. Consequently, each node in the Geometric Encoder is initialized by the embedding of the corresponding residue derived from the ESM-2. Unlike conventional methodologies that rely upon inter-residue distances and contacts to establish edges, each edge in the Geometric Encoder is initialized by the relative geometric relationship between a pair of residues derived from the protein 3D structure61.
Cross-validation.
We employed a 10-fold cross-validation approach to find the hyperparameters of the model. The dataset, comprising 82 mutational data points, was divided into three parts: a training set (59 samples), a validation set (7 samples), and a test set (16 samples). Model evaluation was performed using the Spearman correlation coefficient (ρ) as the primary assessment metric.
Model training details.
The model employs the Soft Rank Loss as its loss function, with a learning rate of 10−3, Adam optimizer, and a decay rate for the learning rate. The training spans across 50 epochs. Subsequently, the learning rate of the upstream GeoFitness model is set to 10−4, while the learning rate of the downstream model is adjusted to 5×10−4 for further ne-tuning.
Acknowledgements
This study was supported by Ministry of Science and Technology of China grant 2021YFA0911000 (S.Z.), National Natural Science Foundation of China grant 32171416 (S.Z.) and U22A20552 (S.Z.), Tsinghua University Dushi Plan Foundation (S.Z.), U.S. NIH R01 EB022376/EB031172 (D.R.L.), and R35 GM118062 (D.R.L.). We thank J. Zheng (Westlake University) for helpful discussions. We thank C. Zhang (Tsinghua University) for the kind gift of the EvolvR gene. We apologize to authors whose work cannot be cited owing to referencing restrictions.
Footnotes
Competing interests
S.Z. and Z.M. have filed a patent application based on this work.
Supplementary Files
Contributor Information
Shuyi Zhang, Tsinghua University.
Ziyuan Ma, Tsinghua University.
Wenjie Li, Tsinghua University.
Yunhao Shen, Tsinghua University.
Yunxin Xu, Tsinghua University.
Gengjiang Liu, Tsinghua University.
Jiamin Chang, Tsinghua University.
Zeju Li, Tsinghua University.
Hong Qin, Tsinghua University.
Boxue Tian, Tsinghua University.
Haipeng Gong, Tsinghua University.
David Liu, Broad Institute.
B Thuronyi, Williams College.
Christopher Voigt, Massachusetts Institute of Technology.
References
- 1.Lovelock S. L. et al. The road to fully programmable protein catalysis. Nature 606, 49–58 (2022). [DOI] [PubMed] [Google Scholar]
- 2.Labanieh L. & Mackall C. L. CAR immune cells: design principles, resistance and the next generation. Nature 614, 635–648 (2023). [DOI] [PubMed] [Google Scholar]
- 3.Dumontet C., Reichert J. M., Senter P. D., Lambert J. M. & Beck A. Antibody–drug conjugates come of age in oncology. Nat. Rev. Drug Discov. 22, 641–661 (2023). [DOI] [PubMed] [Google Scholar]
- 4.Macken C. A. & Perelson A. S. Protein evolution on rugged landscapes. Proc. Natl Acad. Sci. USA 86, 6191–6195 (1989). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Lutz S. Beyond directed evolution—semi-rational protein engineering and design. Curr. Opin. Biotechnol. 21, 734–743 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Ding X., Zou Z. & Brooks C. L. III Deciphering protein evolution and fitness landscapes with latent space models. Nat. Commun. 10, 5644 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Tian P. & Best R. B. Exploring the sequence fitness landscape of a bridge between protein folds. PLoS Comput. Biol. 16, e1008285 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Fernandez-de-Cossio-Diaz J., Uguzzoni G. & Pagnani A. Unsupervised inference of protein fitness landscape from deep mutational scan. Mol. Biol. Evol. 38, 318–328 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.D’Costa S., Hinds E. C., Freschlin C. R., Song H. & Romero P. A. Inferring protein fitness landscapes from laboratory evolution experiments. PLoS Comput. Biol. 19, e1010956 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Fowler D. M. & Fields S. Deep mutational scanning: a new style of protein science. Nat. Methods 11, 801–807 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Stiffler M. A., Hekstra D. R. & Ranganathan R. Evolvability as a function of purifying selection in TEM-1 β-lactamase. Cell 160, 882–892 (2015). [DOI] [PubMed] [Google Scholar]
- 12.Zheng L., Baumann U. & Reymond J.-L. An efficient one-step site-directed and site-saturation mutagenesis protocol. Nucleic Acids Res. 32, e115 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.McLaughlin R. N. Jr, Poelwijk F. J., Raman A., Gosal W. S. & Ranganathan R. The spatial architecture of protein function and adaptation. Nature 491, 138–142 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Cadwell R. C. & Joyce G. F. Randomization of genes by PCR mutagenesis. Genome Res. 2, 28–33 (1992). [DOI] [PubMed] [Google Scholar]
- 15.Vanhercke T., Ampe C., Tirry L. & Denolf P. Reducing mutational bias in random protein libraries. Anal. Biochem. 339, 9–14 (2005). [DOI] [PubMed] [Google Scholar]
- 16.Esvelt K. M., Carlson J. C. & Liu D. R. A system for the continuous directed evolution of biomolecules. Nature 472, 499–503 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Miller S. M., Wang T. & Liu D. R. Phage-assisted continuous and non-continuous evolution. Nat. Protoc. 15, 4101–4127 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Ravikumar A., Arzumanyan G. A., Obadi M. K. A., Javanpour A. A. & Liu C. C. Scalable, Continuous Evolution of Genes at Mutation Rates above Genomic Error Thresholds. Cell 175, 1946–1957.e1913 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Sarkisyan K. S. et al. Local fitness landscape of the green fluorescent protein. Nature 533, 397–401 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Hopf T. A. et al. Mutation effects predicted from sequence co-variation. Nat. Biotechnol. 35, 128–135 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Riesselman A. J., Ingraham J. B. & Marks D. S. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15, 816–822 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Yang K. K., Wu Z. & Arnold F. H. Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16, 687–694 (2019). [DOI] [PubMed] [Google Scholar]
- 23.Luo Y. et al. ECNet is an evolutionary context-integrated deep learning framework for protein engineering. Nat. Commun. 12, 5743 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Wu Z., Johnston K. E., Arnold F. H. & Yang K. K. Protein sequence design with deep generative models. Curr. Opin. Chem. Biol. 65, 18–27 (2021). [DOI] [PubMed] [Google Scholar]
- 25.Somermeyer L. G. et al. Heterogeneity of the GFP fitness landscape and data-driven protein design. Elife 11, e75842 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Shen M. W., Zhao K. T. & Liu D. R. Reconstruction of evolving gene variants and fitness from short sequencing reads. Nat. Chem. Biol. 17, 1188–1198 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Biswas S., Khimulya G., Alley E. C., Esvelt K. M. & Church G. M. Low-N protein engineering with data-efficient deep learning. Nat. Methods 18, 389–396 (2021). [DOI] [PubMed] [Google Scholar]
- 28.Papkou A., Garcia-Pastor L., Escudero J. A. & Wagner A. A rugged yet easily navigable fitness landscape. Science 382, eadh3860 (2023). [DOI] [PubMed] [Google Scholar]
- 29.Halperin S. O. et al. CRISPR-guided DNA polymerases enable diversification of all nucleotides in a tunable window. Nature 560, 248–252 (2018). [DOI] [PubMed] [Google Scholar]
- 30.Baas P. DNA replication of single-stranded Escherichia coli DNA phages. Biochim. Biophys. Acta, Gene Struct. Expression 825, 111–139 (1985). [DOI] [PubMed] [Google Scholar]
- 31.Jinek M. et al. A programmable dual-RNA–guided DNA endonuclease in adaptive bacterial immunity. Science 337, 816–821 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Ran F. A. et al. Genome engineering using the CRISPR-Cas9 system. Nat. Protoc. 8, 2281–2308 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Dietsch F. et al. Small p53 derived peptide suitable for robust nanobodies dimerization. J. Immunol. Methods 498, 113144 (2021). [DOI] [PubMed] [Google Scholar]
- 34.Di Lallo G., Castagnoli L., Ghelardini P. & Paolozzi L. A two-hybrid system based on chimeric operator recognition for studying protein homo/heterodimerization in Escherichia coli. Microbiology 147, 1651–1656 (2001). [DOI] [PubMed] [Google Scholar]
- 35.Gao K. et al. Perspectives on SARS-CoV-2 main protease inhibitors. J. Med. Chem. 64, 16922–16955 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Li J. et al. Structural basis of the main proteases of coronavirus bound to drug candidate PF-07321332. J. Virol. 96, e02013–02021 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Fu L. et al. Both Boceprevir and GC376 efficaciously inhibit SARS-CoV-2 by targeting its main protease. Nat. Commun. 11, 4417 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Owen D. R. et al. An oral SARS-CoV-2 Mpro inhibitor clinical candidate for the treatment of COVID-19. Science 374, 1586–1593 (2021). [DOI] [PubMed] [Google Scholar]
- 39.Iketani S. et al. Functional map of SARS-CoV-2 3CL protease reveals tolerant and immutable sites. Cell Host Microbe 30,1354–1362 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Iketani S. et al. Multiple pathways for SARS-CoV-2 resistance to nirmatrelvir. Nature 14, 1716–1726 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Dickinson B. C., Packer M. S., Badran A. H. & Liu D. R. A system for the continuous directed evolution of proteases rapidly reveals drug-resistance mutations. Nat. Commun. 5, 5352 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Packer M. S., Rees H. A. & Liu D. R. Phage-assisted continuous evolution of proteases with altered substrate specificity. Nat. Commun. 8, 956 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Blum T. R. et al. Phage-assisted evolution of botulinum neurotoxin proteases with reprogrammed specificity. Science 371, 803–810 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Iketani S. et al. Functional map of SARS-CoV-2 3CL protease reveals tolerant and immutable sites. Cell Host Microbe 30, 1354–1362. e1356 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Iketani S. et al. Multiple pathways for SARS-CoV-2 resistance to nirmatrelvir. Nature 613, 558–564 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Nashed N. T., Aniana A., Ghirlando R., Chiliveri S. C. & Louis J. M. Modulation of the monomerdimer equilibrium and catalytic activity of SARS-CoV-2 main protease by a transition-state analog inhibitor. Commun. Biol. 5, 160 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Stanton B. C. et al. Genomic mining of prokaryotic repressors for orthogonal logic gates. Nat. Chem. Biol. 10, 99–105 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Ramos J. L. et al. The TetR family of transcriptional repressors. Microbiol. Mol. Biol. Rev. 69, 326–356 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Nielsen A. A. et al. Genetic circuit design automation. Science 352, aac7341 (2016). [DOI] [PubMed] [Google Scholar]
- 50.Brophy J. A. N. & Voigt C. A. Principles of genetic circuit design. Nat. Methods 11, 508–520 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.DeBenedictis E. A. et al. Systematic molecular evolution enables robust biomolecule discovery. Nat. Methods 19, 55–64 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Dickinson B. C., Leconte A. M., Allen B., Esvelt K. M. & Liu D. R. Experimental interrogation of the path dependence and stochasticity of protein evolution using phage-assisted continuous evolution. Proc. Natl Acad. Sci. USA 110, 9007–9012 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Weinreich D. M. & Chao L. Rapid evolutionary escape by large populations from local fitness peaks is likely in nature. Evolution 59, 1175–1182 (2005). [PubMed] [Google Scholar]
- 54.Weissman D. B., Feldman M. W. & Fisher D. S. The Rate of Fitness-Valley Crossing in Sexual Populations. Genetics 186, 1389–1410 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Carlson J. C., Badran A. H., Guggiana-Nilo D. A. & Liu D. R. Negative selection and stringency modulation in phage-assisted continuous evolution. Nat. Chem. Biol. 10, 216–222 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Green M. R. & Sambrook J. The Inoue Method for Preparation and Transformation of Competent Escherichia coli: “Ultracompetent” Cells. Cold Spring Harb Protoc. 2020, 101196 (2020). [DOI] [PubMed] [Google Scholar]
- 57.Chen R., Li L. & Weng Z. ZDOCK: an initial-stage protein docking algorithm. Proteins: Struct., Funct., Bioinf. 52, 80–87 (2003). [DOI] [PubMed] [Google Scholar]
- 58.Tamura K., Stecher G. & Kumar S. MEGA11: molecular evolutionary genetics analysis version 11. Mol. Biol. Evol. 38, 3022–3027 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Letunic I. & Bork P. Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res. 49, W293–W296 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Liang J. C., Chang A. L., Kennedy A. B. & Smolke C. D. A high-throughput, quantitative cell-based screen for efficient tailoring of RNA device activity. Nucleic Acids Res. 40, e154 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Xu Y., Liu D. & Gong H. Improving the prediction of protein stability changes upon mutations by geometric learning and a pre-training strategy. bioRxiv, 2023.2005. 2028.542668 (2023). [DOI] [PubMed] [Google Scholar]





