Summary
Deep learning has rapidly emerged as a promising toolkit for protein optimization, yet its success remains limited, particularly in the realm of activity. Moreover, most algorithms lack rigorous iterative evaluation, a crucial aspect of protein engineering exemplified by classical directed evolution. This study introduces DeepDE, a robust iterative deep learning-guided algorithm leveraging triple mutants as building blocks and a compact library of ∼1,000 mutants for training. Triple mutants allow for the exploration of a much greater sequence space compared to single or double mutants in each iteration. When applied to GFP from Aequorea victoria, DeepDE achieved a remarkable 74.3-fold increase in activity over four rounds of evolution, far surpassing the benchmark superfolder GFP. Our study suggests that limited screening involving experimentally affordable ∼1,000 variants significantly enhances the performance of DeepDE, likely by mitigating the constraints imposed by the intractable data sparsity problem in protein engineering.
Subject areas: Protein, Cell, Structural biology, Biocomputational method, Artificial intelligence
Graphical abstract

Highlights
-
•
DeepDE enables iterative protein evolution via supervised learning on ∼1,000 mutants
-
•
A mutation radius of three allows efficient exploration of vast sequence space
-
•
Limited screening greatly enhances evolutionary performance of DeepDE
-
•
DeepDE achieves a 74.3-fold increase in GFP488nm activity in just four rounds
Protein; Biocomputational method; Artificial intelligence; Synthetic biology
Introduction
Protein engineering is a foundational driver for advancing biotechnology applications across vital sectors, including industry, medicine, and agriculture. However, the sequences space of a target protein is extraordinarily vast.1,2 For example, an average protein with 300 residues can yield an astronomical 3.1 1010 possible combinations from just three substitutions. This immense combinatorial complexity presents a formidable challenge in the exploration of the functional landscape of a target protein, severely impeding the practice of protein engineering.3,4
The classical approach to protein engineering has been directed evolution, which typically involves three key steps: generating a library of random mutations for the target protein, screening thousands of mutants for improved function, and selecting the top-performing variant as the template for the next round of evolution.5,6,7 While more powerful than rational design, this iterative process is often labor-intensive, time-consuming, and inefficient. In recent years, artificial intelligence (AI), particularly deep learning, has rapidly emerged as a promising toolkit for protein optimization.8,9,10,11,12,13 Currently, leading AI-guided algorithms for protein engineering utilize models trained on various protein datasets,14 with top models like ridge regression15 or random forest16 employed to forecast quantitative functions. The training methods can broadly be classified into unsupervised, weak-positive only (including active learning), and supervised learning.14,16,17 Unsupervised learning leverages large, diverse datasets of unlabeled sequences, such as UniRef,14 enabling the extraction of general features from over 20 million protein sequences.18,19,20 Weak-positive only learning utilizes limited sets of evolutionarily related proteins that lack experimentally assayed fitness labels (except for active learning, which was not used in our work, the model begins with a small set of labeled examples and progressively querying labels for additional unlabeled data, and incorporating these into the training dataset).21,22,23 In contrast, supervised learning is trained on variant sequences of a specific target protein with associated fitness labels.24,25,26
So far, the success for AI-guided protein engineering nonetheless remains limited, and particularly in the realm of activity.27 Two critical hurdles remain. First, many methods lack robust experimental validation,18,19,20,21,22,23,24 relying predominantly on in-distribution testing, where both training and testing data originate from the same dataset. Such tests are inherently interpolative, and can be misleading when addressing complex biological questions such as protein engineering, which require extrapolation. Second, very few methods have been implemented in an iterative framework,28,29,30 which is the crucial feature often required for protein engineering, as seen in classical protein directed evolution.5,6,7
For instance, the recently introduced EVOLVEpro algorithm,16 which combines active learning with random forest, generates only single mutations per design cycle, and inherently lacks the capacity for iteration. Another notable example is the low-N algorithm,15 which employs ridge regression and was applied to the green fluorescent protein from Aequorea victoria (avGFP), a protein comprising 238 amino acids.31 Despite being trained on just 24 avGFP variants in the supervised learning phase, the algorithm introduced mutations with a large radius of 7 or 15, and was again evaluated in a single design cycle. Although it reported several significantly improved mutants, we noticed that 77 out of the top 100 mutants carried mutations at the well-known chromophore site S65, an unusually high occurrence. S65T alone is known to enhance GFP activity by 4- to 6-fold.32,33 Upon randomly selecting four of these mutations and reverting the changes at site S65 to the wild type, we found all four mutants returned to wild-type level activity (Table S1).
In this study, we utilized the three deep learning methods (unsupervised, weak-positive only, and supervised learning) to devise an algorithm referred to hereinafter as DeepDE (Figure S1). It incorporates these three key features: (1) we used a supervised training dataset of 1,000 single or double mutants. (2) We set the mutation radius to three for each round of evolution. This radius presents a significant challenge, as it results in a combinatorial library of approximately 1.5 1010 variants, compared to just 1.0 107 for double mutants and 4.5 103 for single mutants. Such a vast number of variants exceed the practical limits for both computational modeling and experimental screening. Additionally, this radius was selected to mitigate the difficulties associated with exploring larger theoretical mutation spaces that would arise with a larger radius. Also critically, this radius allowed us to utilize a standard mutagenesis kit34 for experimentally exploring a focused set of mutants based on the predicted triple mutation sites as outlined below. (3) We implemented two design strategies: the “mutagenesis by direct prediction” approach (DM), which involves direct prediction of beneficial triple mutants with specific amino acid substitutions, and the “mutagenesis coupled with screening” approach (SM), where potential beneficial triple mutation sites are predicted, followed by the experimental construction of 10 libraries of triple mutants for screening to identify the best mutants (see STAR Methods for more details). This allowed us to evaluate the performance of three evolution paths composed of these two approaches (DM only, SM/DM, and SM only).
As a proof-of concept, we selected avGFP as the model protein35 (see results section for details). We implemented an iterative cycle of evolution spanning five rounds. For the first four rounds, we used a same training dataset to explore the extent of optimization attainable from this dataset, and for Round 5, we transitioned to a different training dataset. We found that among the three evolution paths explored, only Path III (the SM only) consistently displayed the most promising and steadily improving results, outperforming Path I (the DM only) and Path II (SM/DM). Incorporating the known mutation S65T in superfolder GFP (sfGFP)15,36 with Path III yielded the best performing mutant with a 74.3-fold increase in activity at Round 4, significantly surpassing the 40.2-fold increase seen in sfGFP, which was achieved through a multi-year engineering effort.
Results
Choice of the model protein GFP
As a proof-of concept, we selected avGFP as the model protein.35 In doing so, we extensively surveyed available mutant libraries of proteins for training and evolution. These libraries generally fall into two primary categories37: (1) deep mutational scanning libraries, where each site of a target protein is mutated to a specific amino acid (such as alanine), or to all other 19 alternatives. However, the former approach is overly restrictive, while the latter can be cost-prohibitive. (2) Random libraries, where the entire sequence of a target protein is mutated via error-prone PCR. Since this class is more affordable, we decided to leverage such libraries, ultimately selecting the meticulously curated avGFP library for our study.
Computational evaluation of deep learning-guided directed protein evolution for GFP
We employed two evaluation metrics: Spearman rank correlation ()38,39 between actual and predicted values, and normalized discounted cumulative gain (NDCG),40 which assigns high scores to accurately predicted high-fitness outcomes. To assess model performance, we utilized six sets of single or double mutants with a 1/9 split (N = 24, 96, 240, 400, 1,000, and 2,000), and evaluated the trained models on all triple mutants from the Sarkisyan dataset (a total of 12,337 sequences), which provided a more rigorous out-of-distribution test.41,42 As illustrated in Table S2, we observed a positive correlation between algorithm performance and training dataset size, with Spearman’s correlation coefficients increasing from 0.30 to 0.74 and NDCG values rising from 0.34 to 0.81. These results underscore the importance of data size in supervised learning for enhancing protein design. Based on this finding, we selected the dataset of 1,000 mutants for model training and design of GFP.
To examine the impact of dataset composition on model performance, we varied the ratio of single to double mutants in the training dataset, adjusting it from 1/9 to 9/1, while keeping the mutations unchanged. Interesting, DeepDE predicted the same set of triple mutant outcomes. This suggests that the limited representation of double mutants in the training dataset (900 out of 10,181,283 possible double mutations, or just 0.009%) was likely insufficient to provide additional information for the model.
Very interestingly, when we evaluated the trained algorithm on the remaining double mutants (11,877 sequences), it yielded lower performance metrics of 0.58 and 0.74, respectively. This outcome was consistent with our experimental data (see in the further text).
Design and validation of GFP
The randomly selected training dataset of 1,000 mutants is detailed in Data S1, with the corresponding activity profile illustrated in Figure S2. These mutants cover 219 of the 238 sites in avGFP. As it is very costly to compute all the possible triple mutants (1.5 1010), we took two approaches to search for beneficial triple mutants. For the DM approach, we computed all possible double mutants, and identified 258 top mutants with predicted activity greater than 125% of wild-type. Mutation frequency was then calculated for each site, and the top 20 mutation sites along with their respective amino acid substitutions were selected. Triple mutants were generated by considering all subset combinations of the double mutants. The trained DeepDE model was applied to predict the activity for each triple mutant candidate, and the top 10 ranked triple mutants were selected for direct synthesis and assay.
For the SM approach, we simulated the landscape of all possible double mutants, predicting the activity values for all double site combinations (28,203, ) by calculating the mean activity of the top three mutants for each site combination. For each triple site combination (2,218,636, ), we calculated the mean values of the corresponding double site combinations and selected the top 10 ranked triple site combinations. The trained DeepDE model was then applied to predict the activity values of all triple mutants corresponding to these top 10 combinations. The top 5 ranked mutants were selected for encoding with degenerate codons and experimental validation using a GeneArt Site-Directed Mutagenesis PLUS Kit.34 For each cycle, approximately 1,040 triple mutants were screened (see STAR Methods for more details). Both the directly designed and synthesized sequences, as well as the libraries of sequences constructed with the mutagenesis kit, were cloned into E. coli DH5α-T1R using the plasmid pACYC184, and then assayed in a 96-well plate. Since the evolution of avGFP typically leads to an extension of its excitation wavelength from 385-405 nm to 488 nm,43 for clarity we focused primarily on activity at the 488 nm excitation wavelength.
First round of evolution for GFP
As shown in Figure 1A and Table S3, all the top 10 predicted sequences exhibited higher fluorescence intensity at both excitation wavelengths (398 nm and 488 nm), achieving a perfect hit rate of 100%. The most active mutant, DM1 (R73H/V163G/D190N), displayed a 2.1-fold increase in GFP488nm activity (and a 1.7-fold increase in GFP398nm activity). Following this, we predicted the top 10 triple site combinations and constructed 10 combinatorial mutant libraries. As shown in Figure 1B and Tables S4 and S5, 436 out of 1,003 colonies from the combinatorial libraries exhibited higher fluorescence intensity than the wild type in terms of GFP488nm activity, yielding a hit rate of 43.5% (with a hit rate of 34.9% for GFP398nm, activity). The most active mutant, SM1 (E5N/R73H/V163G), showed a 2.1-fold increase in GFP488nm activity (and a 1.8-fold increase in GFP398nm activity). Notably, only one of the top ten designed sequences was among the top ten screened sequences.
Figure 1.
Activities of avGFP mutants designed by DeepDE and ECNet in the first round
(A) GFP398nm and GFP488nm activities for the top 10 mutants directly predicted by DeepDE. Fold changes in avGFP mutant activities compared to wild-type avGFP (avGFP-WT) are indicated by red numbers. The most active mutant, DM1, is highlighted with a dotted box.
(B) Heatmap displaying GFP488nm activities for mutants from the 10 combinatorial libraries predicted by DeepDE. The most active mutant, SM1, is marked with a red star. Mutants not successfully obtained in column L1 are left blank.
(C) GFP398nm and GFP488nm activities for the top 10 mutants directly predicted by ECNet. Fold changes in avGFP mutant activities compared to wild-type avGFP are indicated by red numbers.
(D) Heatmap displaying GFP488nm activities for mutants from the 10 combinatorial libraries predicted by ECNet. All assays were performed in triplicate with biological replicates (with the exception of the heatmaps). See also Figures S1 and S2; Tables S3–S8.
Next, we applied the same training dataset to challenge ECNet44 (see STAR Methods for more details), a different supervised learning model that uses a bidirectional long short-term memory network (BiLSTM) combined with a self-attention mechanism to predict mutagenic effects in specific proteins by leveraging general protein evolutionary context. Similarly, we requested the output of the top 10 predicted sequences and 10 top triple site combinations for constructing and screening the best mutants. As shown in Figure 1C and Table S6, all the 10 predicted sequences from ECNet performed poorly at both excitation wavelengths (398 nm and 488 nm), resulting in a hit rate of 0%. As shown in Figure 1D and Tables S7 and S8, for the combinatorial mutant libraries from ECNet, only 104 out of 1,038 colonies exhibited higher fluorescence intensity than the wild type in GFP488nm activity, resulting in a hit rate of just 10.0% (with a hit rate of 8.3% for GFP398nm activity). The most active mutant, Q157T/K162S/K214L, exhibited a modest 1.2-fold increase in both GFP488nm and GFP398nm activities.
In summary, DeepDE compellingly demonstrated its potential to forecast high-fitness mutants, and significantly outperforming ECNet.
Subsequent three rounds of iterative design and validation of GFP
Using DM1 and SM1 as templates along two separate paths (Path I or DM only, and Path II or SM-DM-DM-DM), we embarked on three additional rounds of iterative evolution using the direct design approach with the same training dataset. For both paths, the best mutant from each round served as the template for the next iteration. As shown in Figures 2, S3, and Tables S9–S14, for Path I, GFP488nm activity peaked at Round 3 with a 4.7-fold increase, and for Path II, GFP488nm activity continued to rise, reaching a 7.6-fold increase by Round 4.
Figure 2.
Results from three distinct evolution paths of DeepDE
Fold changes in GFP488nm activities are shown for the top-performing mutants from the Path I (PI), Path II (PII), and Path III (PIII), as well as for the top-performing mutants from the Path III upon incorporating S65T. All assays were performed in triplicate with biological replicates. Error bars (ranging from 0.13% to 2.73%) are too small to be readily visible in the figure. See also Figures S1–S6 and Tables S9–S29.
Given that Path II outperformed Path I, we explored whether additional limited screening could lead to further optimization beyond SM1. To this end, we established a third path (Path III or SM-SM-SM-SM), involving three additional rounds of combinatorial libraries, again utilizing the same training dataset. As shown in Figures 2, S4, and Tables S15–S21, Path III consistently overperformed Path II, achieving an impressive 25.8-fold increase in GFP488nm activity by Round 4. It is noteworthy that PIII-3 contained an additional mutation (G4S) and a synonymous mutation (E90). Because such errors can occur in combinatorial library construction using oligonucleotides, this outcome was considered acceptable.30 Furthermore, by incorporating the S65T mutation into the top-performing mutants from various rounds of Path III (Figure 2), we achieved a remarkable 74.3-fold increase in GFP488nm activity for the best mutant PIII-4, significantly surpassing the 40.2-fold increase observed for sfGFP.
For comparison, the S65T mutation was also integrated into the top-performing mutants from various rounds of Paths I and II (Table S22). We achieved a 17.6-fold increase in GFP488nm activity for the best mutant from Path I (PI-3), a 23.4-fold increase for the best mutant from Path II (PII-2).
We also conducted a short evolution path (PIa) of two rounds using double mutants as building blocks. In this approach, we computed all possible beneficial double mutants, and the top 10 sequences were selected for direct synthesis and assay. The best mutant was then used as the template for a second round of design and assay. This path yielded the lowest performing mutant, with 2.7-fold increase in GFP488nm activity (Figure S5; Tables S23 and S24). The results align with the computational evaluation of DeepDE for GFP on double mutants, leading us to abandon this evolutionary path.
In conclusion, for GFP, the additional limited screening approach significantly enhanced the performance of DeepDE, highlighting its potential for optimizing protein variants.
The fifth and final round of design and validation of GFP using new training datasets
For this final round of design, we selected a new set of 1,000 single or double mutants for each of three separate paths (Path I, Path II, and Path III), as detailed in Figure S6 and Data S2, S3, and S4. These sets of mutants cover 222, 209, and 224 of the 238 sites for avGFP, respectively. As illustrated in Figure 2, the three best mutants, PI-5, PII-5, and PIII-5, performed slightly better than PI-4, PII-4, and PIII-4 in terms of GFP488nm activities. The incorporation of S65T further enhanced GFP488nm activity for PI-5, PII-5, and PIII-5, reaching 9.0-fold, 22.7-fold, and 64.0-fold increases, respectively (Tables S25–S28). However, PIII-4 with S65T incorporated remains the most active variant in terms of GFP488nm activity.
Discussion
In this study, we introduce a deep learning-guided algorithm for protein evolution, which was evaluated on GFP. Our findings indicate that expanding the training dataset to a practically feasible size (on the order of 1,000 variants) significantly enhances the theoretical performance metrics of DeepDE (Table S2), in contrast to the claims by the low-N algorithm.15 This improvement is conceptually sound for two reasons: (1) proteins are highly diverse, with each protein or protein family exhibiting distinct structural and functional properties, and (2) the inherently vast combinatorial mutation space imposes a substantial data sparsity challenge, rendering low N training datasets severely insufficient for generalization.
In a direct, head-to-head design and testing comparison, with a mutation radius of three, DeepDE demonstrated exceptional robustness, achieving hit rates of 100% for the “mutagenesis by direct prediction” mode and 43.5% for the “mutagenesis coupled with limited screening” mode, far surpassing the performance of the comparative algorithm ECNet.44 Further iterative testing revealed that DeepDE yielded significantly better outcomes when combined with limited screening, in contrast to relying solely on direct prediction. Over just four rounds of evolution, utilizing a compact training dataset of 1,000 variants and a modest screening library of 1,000 mutants per cycle (4,000 total), DeepDE yielded a top-performing mutant, PIII-4 (+S65T), which achieved nearly double the activity of the hallmark sfGFP. Interestingly, among the 12 mutations in PIII-4 (+S65T), only three (V163, I171, A206) overlap with sfGFP aside from S65T, which carries 14 mutations (Table S29). This mutant also outperformed a recently reported sfGFP variant, mChartreuse,45 which incorporates six additional mutations derived from two other avGFP mutants and exhibits a 66.6-fold increase in GFP488nm activity (Figure 2).
Importantly, this mutant emerged before we had to transition to a second training dataset, as its activity already surpassed that of sfGFP. The best mutant from the next round along Path III, utilizing the second training dataset (PIII-5+S65T), actually performed slightly worse. This observation suggests that DeepDE may have rapidly approached the upper limit of GFP488nm activity for avGFP. However, it remains to be determined whether this efficiency is specific to GFP or if similar performance can be achieved for proteins with different evolutionary landscapes.
Our study provides two key insights into AI-guided protein evolution. First, limited screening significantly uplifts the performance of DeepDE. This likely stems from the limited capability of deep-learning models trained on a modest mutant dataset of a specific protein, which may struggle to navigate the extensive and yet potentially unique mutant landscape of the target protein. Limited screening provides diverse trajectories for the algorithm to traverse the protein landscape more adeptly, thereby mitigating the formidable challenge imposed by data sparsity in protein engineering.46,47 Limited screening has become increasingly viable with the advent of automated workstations.48 Second, while Path III (SM only), which is coupled with limited screening, showed a steady increase in the activity profile, we noticed that for Path I (DM only) and Path II (SM/DM), both of which rely on direct prediction starting from Round 2, the activity profiles started to plateau as early as Round 2, with Path I even experiencing a subsequent decline. This observation strongly hints that it is imperative to assess algorithm performance for protein engineering iteratively, a critical issue that is not sufficiently addressed in current literature so far.
Our work supports a general DeepDE approach for proteins that currently lack available mutant datasets for training (Figure S1): (1) generate a limited library of single or double mutants (in the thousands) in a single batch using error-prone PCR, and curate them based on a consistent standard for use in multiple rounds of evolution. (2) Select approximately 1,000 mutants to train the DeepDE model, compute triple site combinations, and use these to experimentally construct a limited combinatorial library of around 1,000 mutants. (3) Choose the best mutant as the launch point to design and screen for next best-performing mutant. If necessary, shift to a new training dataset.
In conclusion, the DeepDE algorithm, coupled with limited screening as outlined in this work, presents a more pragmatic and scalable approach for AI-guided protein engineering. Clearly, the algorithm can be further optimized, particularly by updating the models for unsupervised learning, weak positive only learning, as well as by devising better strategies for iterative evolution.
Limitations of the study
The DeepDE algorithm has so far been validated only on avGFP, and further testing is needed to assess its generalizability across proteins with diverse structures, functions, and evolutionary landscapes. In addition, there is much room for further refinement and optimization for each of the three modules in DeepDE (unsupervised learning, weak positive-only learning, and iterative evolution).
Resource availability
Lead contact
Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact Zhanglin Lin (zhanglinlin@gdut.edu.cn).
Materials availability
This study did not generate new unique reagents.
Data and code availability
-
•
The GFP datasets used as training sequences in this study were obtained from Sarkisyan et al.35 Detailed descriptions of the datasets are provided in Data S1, S2, S3, and S4.
-
•
Code supporting the findings of this study is available in the open-access repository Zenodo: https://doi.org/10.5281/zenodo.15959236.
-
•
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.
Acknowledgments
This study was supported by National Key R&D Program of China (2018YFA0901000), Guangdong S&T Program (2024B1111160003), Science and Technology Projects in Guangzhou (2025A04J7027), and Program for Guangdong Introducing Innovative and Entrepreneurial Teams (2019ZT08Y318).
Author contributions
Z.L., X.Y., and X.L. designed research; X.L., Q.W., J.L., and F.T. performed research; Z.L., X.Y., X.L., Q.W., and J.L. analyzed data; Z.L., X.Y., X.L. Q.W., and J.L. wrote paper.
Declaration of interests
The authors declare that the DeepDE method and its associated source code are protected under copyright (registration number: 2023SR0456882).
Declaration of generative AI and AI-assisted technologies in the writing process
During the preparation of this work, we used ChatGPT (https://chatgpt.com/) solely to improve the readability of the manuscript draft. After using ChatGPT, we reviewed and edited the language as needed and take full responsibility for the content of the manuscript.
STAR★Methods
Key resources table
| REAGENT or RESOURCE | SOURCE | IDENTIFIER |
|---|---|---|
| Bacterial and virus strains | ||
| E. coli DH5α-T1R | Thermo Fisher Scientific | Cat#12297016, Carlsbad, CA, USA https://www.thermofisher.cn/order/catalog/product/12297016?SID=srch-hj-12297-016 |
| Critical commercial assays | ||
| The GeneArt® Site-Directed Mutagenesis PLUS Kit | Thermo Fisher Scientific | Cat#A14604 https://www.thermofisher.cn/order/catalog/product/A14604?SID=srch-hj-A14604 |
| Deposited data | ||
| GFP datasets | Sarkisyan et al.35 | https://doi.org/10.6084/m9.figshare.3102154 |
| For detailed information about training dataset (Set Ⅰ/Set Ⅱa/Set Ⅱb/Set Ⅱc) used in this study, see Data S1, S2, S3, and S4 | This study | N/A |
| Oligonucleotides | ||
| For a list of primers used in this study, see Table S30 | This paper | N/A |
| Recombinant DNA | ||
| Plasmid: pACYC184 | The plasmid pACYC184 was a kind gift from Prof. Yang Sheng. | N/A |
| avGFP | Sarkisyan et al.35 | Protein Data Bank (PDB): 2WUR |
| sfGFP | Biswas et al.15 | https://doi.org/10.1038/s41592-021-01100-y |
| For the three best-performing mutants from PⅠ/PⅡ/PⅢ with S65T incorporated in this study, see Table S29 | This study | N/A |
| Software and algorithms | ||
| Low-N algorithm | Biswas et al.15 | GitHub: https://github.com/churchlab/low-N-protein-engineering |
| ECNet algorithm | Luo et al.44 | GitHub: https://github.com/luoyunan/ECNet |
| DeepDE algorithm | This study | Zenodo: https://doi.org/10.5281/zenodo.15959236 |
| Other | ||
| Ubuntu 16.04.7 | Powerleader | PR4908P |
| NVIDIA GeForce RTX 2080Ti | NVIDIA | GeForce RTX 2080Ti |
Method details
Model training
GFP datasets
The Sarkisyan dataset35 comprises functionally characterized sequences from the local fitness landscape of avGFP. This publicly available dataset, processed from Sarkisyan et al., served as the source for sampling training sequences in our experiments. To investigate out-of-distribution generalization, we randomly sampled six subsets of varying sizes (N = 24, 96, 240, 400, 1,000, 2,000), each containing single or double mutants (with a 1/9 split), from Sarkisyan dataset to construct the training dataset. For prospective avGFP design, a subset of size N = 1,000 single or double mutants was randomly sampled from the Sarkisyan dataset and utilized for design in Rounds 1 to 4 (Set I in Data S1). For Round 5 of each path, three new subsets of size N = 1,000 single or double mutants were generated by random sampling (Set IIa, Set IIb and Set IIc in Data S2, S3, and S4, respectively).
Model architecture
To facilitate the iterative design of avGFP, we employed a multiplicative long short-term memory (mLSTM) model based on eUniRep.15 Initially, eUniRep trains an unsupervised protein language model using sequences from UniRef50. Subsequently, the model undergoes fine-tuning with homologous sequences of the target protein, a process referred to as 'evotuning'. This fine-tuning results in the generation of a vector representations for each protein sequence. The representation serves as input for a top supervised model, utilized to predict the fitness of mutants. The open-source eUniRep code in tensorflow aided our implementation (GitHub: https://github.com/churchlab/low-N-protein-engineering). Additionally, we generated augmented eUniRep representation of the protein sequences according to Hsu et al.17 In brief, this involved concatenating the one-hot encoding with the evolutionary density scores to create the augmented eUniRep representation (Figure S1).
Training details
In the process of the avGFP design, we used the previously described evotuned weights19 and repeated the evotuning process to ensure its robustness. Similar to UniRep,19 the avGFP target sequence, along with a selection of related fluorescent proteins (FPs), was subjected to a search using JackHMMER49 until convergence. Edit distance was calculated between the search result sequences and the avGFP target sequence. The resulting sequence set was filtered based on length (retaining those with <500 amino acids) and Levenshtein distance from avGFP (keeping all <400), while sequences with non-standard amino acids were eliminated. Training involved 10,000 gradient steps, maintaining a consistent learning rate of 1 10−5, in accordance with the methodology outlined by Biswas et al.15 To prevent overfitting during model training, we employed an early stopping strategy. Training was stopped if the validation loss failed to increase before reaching 10,000 steps. The normalized fluorescence values of the predicted sequences are generated as output, with log10(relative fluorescence) values scaled according to the formula (x - min_val)/(wt_val - min_val), as described in Biswas et al.15 In this formula, min_val represents the fluorescence of the least fluorescent sequence (1.283), and wt_val denotes the fluorescence of the WT (wild-type) sequence (3.719). Consequently, post-transformation, the fluorescence of the WT sequence corresponds to a value of 1, while an entirely nonfunctional sequence is assigned a fluorescence value of 0.
Prospective design
Direct sequence design
In this application scenario, we computed all possible beneficial double mutants to directly predict triple mutant candidates. The key steps are as follows.
-
(1)
Prediction of all possible double mutants. Utilizing the subset (Set I), we trained the DeepDE model and to predict the value of all possible double mutants (10,181,283, ).
-
(2)
Selection of top-performing mutants. From the simulated landscape of all possible double mutants, 258 mutants with predicted values exceeding 1.04 (indicating avGFP activity greater than 25% of wild-type) were identified. For each site, the mutation frequency was calculated, and the top 20 mutation sites along with their respective amino acid substitutions were selected for design. After each round, the three mutated sites in the template were excluded from further consideration.
-
(3)
Combination of triple mutants. Based on each of the top 20 mutation sites, we generated new triple mutants by considering all subset combinations of mutated sites and enumerating amino acid changes that have appeared in these sites within the subset. The trained DeepDE model was applied to predict the value for each candidate mutant. Mutants were ranked based on their predicted values, and the top 10 mutants were selected for direct synthesis.
Combinatorial library design
We aimed to use an algorithm that would seek more functional mutants on average, thus predicting triple site combinations. The key steps are as follows.
-
(1)
Prediction of all possible double site combinations. Based on the value of all possible double mutants (10,181,283, ), we predicted the value of all double site combinations (28,203, ) by calculating the mean of top three mutants with the highest predicted values for each site combination.
-
(2)
Prediction of all possible triple site combinations. For each triple site combination (2,218,636, ), we calculated the mean values of the corresponding double site combinations (for example, the triple site combination 89-101-137 could be split into 89–101, 89–137, and 101–137). After ranking, the top 10 triple site combinations were chosen for design. After each round, the three mutated sites in the template were excluded from further consideration.
-
(3)
Combination of mutant libraries. Each triple site combination contains 6,859 mutants (). The DeepDE was applied to predict the value of all mutants corresponding to these top 10 triple site combinations. Mutants were ranked based on predicted values, and the top 5 mutants for each combination were selected for encoding with degenerate codons for experimental validation.
ECNet
The ECNet algorithm44 is a deep learning architecture designed for sequence-to-function prediction. It consists of two key components: local evolutionary context and global evolutionary context. Input protein sequences are encoded in a one-hot format, and the functional measurements of proteins are subsequently generated as output. The global evolutionary context employs the transformer model from TAPE,25 generating a vector representation for each amino acid. Meanwhile, the local evolutionary context is derived from the multiple sequence alignment (MSA) of homologous sequences of the target protein. The pre-trained weights of the ECNet model are available for download at GitHub: https://github.com/luoyunan/ECNet. We applied the model to the avGFP dataset, which includes 1,084 single mutants, 12,777 double mutants, 12,337 triple triple mutants, as well as single-to-triple mutants, to predict 9,386 quadruple mutants of avGFP. The Spearman’s correlation coefficients () obtained were in line with those reported in the original publication. To further evaluate ECNet’s performance, we retrained the model using the same training dataset (Set I), consisting of 1,000 single or double mutants. The model was tasked with directly predicting beneficial triple mutants and identifying possible beneficial triple mutation site combinations for experimental validation.
Experimental validation
Strains, plasmids, and materials
The strains E. coli DH5α-T1R was obtained from Thermo Fisher Scientific (Carlsbad, CA, USA). The plasmid pACYC184 was a kind gift from Prof. Yang Sheng. DNA sequences encoding avGFP,35 sfGFP15 and mChartreuse45 were optimized for expression in E. coli, and synthesized by Ruibiotech (Beijing, China). Oligonucleotides for cloning were synthesized by Sangon (Shanghai, China) and were listed in Table S30. DNA sequencing was performed by Ruibiotech (Beijing, China). Restriction enzymes and DNA polymerases were purchased from New England Biolabs (Beverly, MA, USA). The GeneArt Site-Directed Mutagenesis PLUS Kit was obtained from Thermo Fisher Scientific (A14604, Carlsbad, CA, USA). 96-well microtiter plates were procured from Corning (3904, New York, USA), and the microplate reader was sourced from Tecan (Infinite M200 Pro, Switzerland). The high-speed oscillating incubator was purchased from the Zhi Chu (Shanghai, China).
Construction of plasmids
The plasmid pACYC184 was employed as the vector for cloning avGFP, sfGFP, and mChartreuse. DNA fragments containing avGFP, sfGFP, and mChartreuse were amplified by PCR and assembled with the specified pACYC-J23100 DNA fragment by using the Gibson assembly method.50 To generate libraries of avGFP plasmids with specific mutations, we employed the GeneArt Site-Directed Mutagenesis PLUS Kit along with synthetic mutagenic oligonucleotides. Mutated plasmids were then transformed into DH5α-T1R competent cells. Details of the synthetic mutagenic oligonucleotides used can be found in Table S30. Synthetic DNA segments featuring various mutations in avGFP were generated through PCR with the following primers: 5′AAGAGGAGAAAGGATCCATGAGCAAGGGCGAGGAGCTGT3′ and 5′CTCGAGAAGCTTTCTAGATTATCACTTGTACAGCTCGTCCATGCC3’. These PCR products were subsequently assembled with the specified pACYC-J23100 DNA fragment using the Gibson assembly method. For the construction the plasmids containing the desired 65T mutations in the best sequences identified from each round of template, two pairs of PCR primers were employed to prepare DNA segments. The resulting assembly products were chemically transformed into DH5α-T1R competent cells. The reading frames of the aforementioned plasmids were confirmed through DNA sequencing.
Expression of GFP designs
E. coli DH5α-T1R competent cells containing the target plasmids were inoculated into lysogeny broth (LB) agar plates supplemented with 34 mg/L chloramphenicol and incubated at 37°C overnight. Colonies were subsequently picked using toothpicks and transferred into 96-well microtiter plates (flat-bottomed, polystyrene plates; Beyotime, Shanghai, China) containing LB medium (150 μL; 34 mg/L chloramphenicol). After 16 h of cultivation in a microtiter plate shaker (37°C, 800 rpm; Zhichu Instrument, Shanghai, China), each well was replicated by using a replicator into a second series of 96-well microtiter plates (transparent flat bottom, black wall; Corning, NewYork, USA) containing LB (200 μL, 34 mg/L chloramphenicol). The first set of plates were stored at −80°C after addition of 60% glycerol. The clones in the second set of plates were cultivated for 10 h in a microtiter plate shaker (37°C, 800 rpm; Zhichu Instrument, Shanghai, China). After the expression, the cultures were then used for screening as described below. During the expression process, 104 clones were evaluated for each combinatorial library, except for L1 predicted by DeepDE, where only 67 clones were assessed due to the low efficiency of long primers (Figure 1B).
Characterization of designed GFP
Fluorescence intensity measurements for various mutants were measured using an Infinite M200 Pro microplate reader. To begin with, the instrument was preheated to 37°C. The fluorescence intensity of each mutation was monitored from the 398 nm excitation spectrum (ex: 398 nm, em: 512 nm, gain: 80%), 488 nm excitation spectrum (ex: 488 nm, em: 525 nm, gain: 60%) and absorbance at 600 nm, respectively. For the characterization of fluorescence intensity, error bars represent three independent experiments.
Published: August 7, 2025
Footnotes
Supplemental information can be found online at https://doi.org/10.1016/j.isci.2025.113324.
Contributor Information
Xiaofeng Yang, Email: biyangxf@scut.edu.cn.
Zhanglin Lin, Email: zhanglinlin@gdut.edu.cn.
Supplemental information
References
- 1.Qiu Y., Wei G.-W. Artificial intelligence-aided protein engineering: from topological data analysis to deep protein language models. Brief. Bioinform. 2023;24 doi: 10.1093/bib/bbad289. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Osadchy M., Kolodny R. How deep learning tools can help protein engineers find good sequences. J. Phys. Chem. B. 2021;125:6440–6450. doi: 10.1021/acs.jpcb.1c02449. [DOI] [PubMed] [Google Scholar]
- 3.Wu Z., Kan S.B.J., Lewis R.D., Wittmann B.J., Arnold F.H. Machine learning-assisted directed protein evolution with combinatorial libraries. Proc. Natl. Acad. Sci. USA. 2019;116:8852–8858. doi: 10.1073/pnas.1901979116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Chen L., Zhang Z., Li Z., Li R., Huo R., Chen L., Wang D., Luo X., Chen K., Liao C., Zheng M. Learning protein fitness landscapes with deep mutational scanning data from multiple sources. Cell Syst. 2023;14:706–721.e5. doi: 10.1016/j.cels.2023.07.003. [DOI] [PubMed] [Google Scholar]
- 5.Chen K.Q., Arnold F.H. Enzyme engineering for nonaqueous solvents: random mutagenesis to enhance activity of subtilisin E in polar organic media. Biotechnology. 1991;9:1073–1077. doi: 10.1038/nbt1191-1073. [DOI] [PubMed] [Google Scholar]
- 6.Arnold F.H. Design by directed evolution. Acc. Chem. Res. 1998;31:125–131. [Google Scholar]
- 7.Zhao H., Giver L., Shao Z., Affholter J.A., Arnold F.H. Molecular evolution by staggered extension process (StEP) in vitro recombination. Nat. Biotechnol. 1998;16:258–261. doi: 10.1038/nbt0398-258. [DOI] [PubMed] [Google Scholar]
- 8.Jumper J., Evans R., Pritzel A., Green T., Figurnov M., Ronneberger O., Tunyasuvunakool K., Bates R., Žídek A., Potapenko A., et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–589. doi: 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Abramson J., Adler J., Dunger J., Evans R., Green T., Pritzel A., Ronneberger O., Willmore L., Ballard A.J., Bambrick J., et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature. 2024;630:493–500. doi: 10.1038/s41586-024-07487-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Baek M., DiMaio F., Anishchenko I., Dauparas J., Ovchinnikov S., Lee G.R., Wang J., Cong Q., Kinch L.N., Schaeffer R.D., et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science. 2021;373:871–876. doi: 10.1126/science.abj8754. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Krishna R., Wang J., Ahern W., Sturmfels P., Venkatesh P., Kalvet I., Lee G.R., Morey-Burrows F.S., Anishchenko I., Humphreys I.R., et al. Generalized biomolecular modeling and design with RoseTTAFold All-Atom. Science. 2024;384 doi: 10.1126/science.adl2528. [DOI] [PubMed] [Google Scholar]
- 12.Cui Y., Chen Y., Sun J., Zhu T., Pang H., Li C., Geng W.-C., Wu B. Computational redesign of a hydrolase for nearly complete PET depolymerization at industrially relevant high-solids loading. Nat. Commun. 2024;15:1417. doi: 10.1038/s41467-024-45662-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Jiang F., Li M., Dong J., Yu Y., Sun X., Wu B., Huang J., Kang L., Pei Y., Zhang L., et al. A general temperature-guided language model to design proteins of enhanced stability and activity. Sci. Adv. 2024;10 doi: 10.1126/sciadv.adr2641. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Suzek B.E., Wang Y., Huang H., McGarvey P.B., Wu C.H., UniProt Consortium UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics. 2015;31:926–932. doi: 10.1093/bioinformatics/btu739. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Biswas S., Khimulya G., Alley E.C., Esvelt K.M., Church G.M. Low-N protein engineering with data-efficient deep learning. Nat. Methods. 2021;18:389–396. doi: 10.1038/s41592-021-01100-y. [DOI] [PubMed] [Google Scholar]
- 16.Jiang K., Yan Z., Bernardo M.D., Sgrizzi S.R., Villiger L., Kayabolen A., Kim B.J., Carscadden J.K., Hiraizumi M., Nishimasu H., et al. Rapid in silico directed evolution by a protein language model with EVOLVEpro. Science. 2024;387 doi: 10.1126/science.adr6006. [DOI] [PubMed] [Google Scholar]
- 17.Hsu C., Nisonoff H., Fannjiang C., Listgarten J. Learning protein fitness models from evolutionary and assay-labeled data. Nat. Biotechnol. 2022;40:1114–1122. doi: 10.1038/s41587-021-01146-5. [DOI] [PubMed] [Google Scholar]
- 18.Rives A., Meier J., Sercu T., Goyal S., Lin Z., Liu J., Guo D., Ott M., Zitnick C.L., Ma J., Fergus R. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. USA. 2021;118 doi: 10.1073/pnas.2016239118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Alley E.C., Khimulya G., Biswas S., AlQuraishi M., Church G.M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods. 2019;16:1315–1322. doi: 10.1038/s41592-019-0598-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Meier J., Rao R., Verkuil R., Liu J., Sercu T., Rives A. Vol. 34. 2021. Language models enable zero-shot prediction of the effects of mutations on protein function; pp. 29287–29303. (NIPS'21: Proceedings of the 35th International Conference on Neural Information Processing Systems). [Google Scholar]
- 21.Hopf T.A., Ingraham J.B., Poelwijk F.J., Schärfe C.P.I., Springer M., Sander C., Marks D.S. Mutation effects predicted from sequence co-variation. Nat. Biotechnol. 2017;35:128–135. doi: 10.1038/nbt.3769. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Riesselman A.J., Ingraham J.B., Marks D.S. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods. 2018;15:816–822. doi: 10.1038/s41592-018-0138-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Shihab H.A., Gough J., Cooper D.N., Stenson P.D., Barker G.L.A., Edwards K.J., Day I.N.M., Gaunt T.R. Predicting the functional, molecular, and phenotypic consequences of amino acid substitutions using hidden markov models. Hum. Mutat. 2013;34:57–65. doi: 10.1002/humu.22225. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Wittmann B.J., Yue Y., Arnold F.H. Informed training set design enables efficient machine learning-assisted directed protein evolution. Cell Syst. 2021;12:1026–1045.e7. doi: 10.1016/j.cels.2021.07.008. [DOI] [PubMed] [Google Scholar]
- 25.Rao R., Bhattacharya N., Thomas N., Duan Y., Chen X., Canny J., Abbeel P., Song Y.S. Evaluating protein transfer learning with TAPE. arXiv. 2019 doi: 10.48550/arXiv.1906.08230. Preprint at. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Shanehsazzadeh A., Belanger D., Dohan D. Is transfer learning necessary for protein landscape prediction? arXiv. 2020 doi: 10.48550/arXiv.2011.03443. Preprint at. [DOI] [Google Scholar]
- 27.Ding F., Steinhardt J. Protein language models are biased by unequal sequence sampling across the tree of life. bioRxiv. 2024 doi: 10.1101/2024.03.07.584001. Preprint at. [DOI] [Google Scholar]
- 28.Saito Y., Oikawa M., Sato T., Nakazawa H., Ito T., Kameda T., Tsuda K., Umetsu M. Machine-learning-guided library design cycle for directed evolution of enzymes: the effects of training data composition on sequence space exploration. ACS Catal. 2021;11:14615–14624. [Google Scholar]
- 29.Rapp J.T., Bremer B.J., Romero P.A. Self-driving laboratories to autonomously navigate the protein fitness landscape. Nat. Chem. Eng. 2024;1:97–107. doi: 10.1038/s44286-023-00002-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Patsch D., Schwander T., Voss M., Schaub D., Hüppi S., Eichenberger M., Stockinger P., Schelbert L., Giger S., Peccati F., et al. Enriching productive mutational paths accelerates enzyme evolution. Nat. Chem. Biol. 2024;20:1662–1669. doi: 10.1038/s41589-024-01712-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Cody C.W., Prasher D.C., Westler W.M., Prendergast F.G., Ward W.W. Chemical structure of the hexapeptide chromophore of the Aequorea green-fluorescent protein. Biochemistry. 1993;32:1212–1218. doi: 10.1021/bi00056a003. [DOI] [PubMed] [Google Scholar]
- 32.Heim R.C., A B., Tsien R.Y. Improved green fluorescence. Nature. 1995;376:663–664. doi: 10.1038/373663b0. [DOI] [PubMed] [Google Scholar]
- 33.Heim R., Prasher D.C., Tsien R.Y. Wavelength mutations and posttranslational autoxidation of green fluorescent protein. Proc. Natl. Acad. Sci. USA. 1994;91:12501–12504. doi: 10.1073/pnas.91.26.12501. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Walquist M.J., El-Gewely M.R. Encyclopedia of Life Sciences. John Wiley & Sons, Ltd.; 2018. Mutagenesis: Site-Directed; pp. 1–14. [Google Scholar]
- 35.Sarkisyan K.S., Bolotin D.A., Meer M.V., Usmanova D.R., Mishin A.S., Sharonov G.V., Ivankov D.N., Bozhanova N.G., Baranov M.S., Soylemez O., et al. Local fitness landscape of the green fluorescent protein. Nature. 2016;533:397–401. doi: 10.1038/nature17995. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Pédelacq J.-D., Cabantous S., Tran T., Terwilliger T.C., Waldo G.S. Engineering and characterization of a superfolder green fluorescent protein. Nat. Biotechnol. 2006;24:79–88. doi: 10.1038/nbt1172. [DOI] [PubMed] [Google Scholar]
- 37.Notin P., Kollasch A.W., Ritter D., van Niekerk L., Paul S., Spinner H., Rollins N., Shaw A., Weitzman R., Frazer J., et al. ProteinGym: large-scale benchmarks for protein design and fitness prediction. bioRxiv. 2023 doi: 10.1101/2023.12.07.570727. Preprint at. [DOI] [Google Scholar]
- 38.Russ W.P., Figliuzzi M., Stocker C., Barrat-Charlaix P., Socolich M., Kast P., Hilvert D., Monasson R., Cocco S., Weigt M., Ranganathan R. An evolution-based model for designing chorismate mutase enzyme. Science. 2020;369:440–445. doi: 10.1126/science.aba3304. [DOI] [PubMed] [Google Scholar]
- 39.Barrat-Charlaix P., Figliuzzi M., Weigt M. Improving landscape inference by integrating heterogeneous data in the inverse ising problem. Sci. Rep. 2016;6 doi: 10.1038/srep37812. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Jarvelin K., Kekäläinen J. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 2002;20:422–446. [Google Scholar]
- 41.Caro M.C., Huang H.-Y., Ezzell N., Gibbs J., Sornborger A.T., Cincio L., Coles P.J., Holmes Z. Out-of-distribution generalization for learning quantum dynamics. Nat. Commun. 2023;14:3751. doi: 10.1038/s41467-023-39381-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Fu Y., Yu S., Li J., Lao Z., Yang X., Lin Z. DeepMineLys: Deep mining of phage lysins from human microbiome. Cell Rep. 2024;43 doi: 10.1016/j.celrep.2024.114583. [DOI] [PubMed] [Google Scholar]
- 43.Cormack B.P., Valdivia R.H., Falkow S. FACS-optimized mutants of the green fluorescent protein (GFP) Gene. 1996;173:33–38. doi: 10.1016/0378-1119(95)00685-0. [DOI] [PubMed] [Google Scholar]
- 44.Luo Y., Jiang G., Yu T., Liu Y., Vo L., Ding H., Su Y., Qian W.W., Zhao H., Peng J. ECNet is an evolutionary context-integrated deep learning framework for protein engineering. Nat. Commun. 2021;12:5743. doi: 10.1038/s41467-021-25976-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Fraikin N., Couturier A., Mercier R., Lesterlin C. A palette of bright and photostable monomeric fluorescent proteins for bacterial time-lapse imaging. Sci. Adv. 2025;11 doi: 10.1126/sciadv.ads6201. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Romero P.A., Arnold F.H. Exploring protein fitness landscapes by directed evolution. Nat. Rev. Mol. Cell Biol. 2009;10:866–876. doi: 10.1038/nrm2805. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Acevedo-Rocha C.G., Li A., D'Amore L., Hoebenreich S., Sanchis J., Lubrano P., Ferla M.P., Garcia-Borràs M., Osuna S., Reetz M.T. Pervasive cooperative mutational effects on multiple catalytic enzyme traits emerge via long-range conformational dynamics. Nat. Commun. 2021;12:1621. doi: 10.1038/s41467-021-21833-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Feng L., Gao L., Besirlioglu V., Essani K., Wittwer M., Kurkina T., Ji Y., Schwaneberg U. A flow cytometry-based ultrahigh-throughput screening method for directed evolution of oxidases. Angew. Chem. Int. Ed. 2023;62 doi: 10.1002/anie.202214999. [DOI] [PubMed] [Google Scholar]
- 49.Potter S.C., Luciani A., Eddy S.R., Park Y., Lopez R., Finn R.D. HMMER web server: 2018 update. Nucleic Acids Res. 2018;46:W200–W204. doi: 10.1093/nar/gky448. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Gibson D.G., Young L., Chuang R.-Y., Venter J.C., Hutchison C.A., Smith H.O. Enzymatic assembly of DNA molecules up to several hundred kilobases. Nat. Methods. 2009;6:343–345. doi: 10.1038/nmeth.1318. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
-
•
The GFP datasets used as training sequences in this study were obtained from Sarkisyan et al.35 Detailed descriptions of the datasets are provided in Data S1, S2, S3, and S4.
-
•
Code supporting the findings of this study is available in the open-access repository Zenodo: https://doi.org/10.5281/zenodo.15959236.
-
•
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.


