A general temperature-guided language model to design proteins of enhanced stability and activity

Fan Jiang; Mingchen Li; Jiajun Dong; Yuanxi Yu; Xinyu Sun; Banghao Wu; Jin Huang; Liqi Kang; Yufeng Pei; Liang Zhang; Shaojie Wang; Wenxue Xu; Jingyao Xin; Wanli Ouyang; Guisheng Fan; Lirong Zheng; Yang Tan; Zhiqiang Hu; Yi Xiong; Yan Feng; Guangyu Yang; Qian Liu; Jie Song; Jia Liu; Liang Hong; Pan Tan

doi:10.1126/sciadv.adr2641

. 2024 Nov 27;10(48):eadr2641. doi: 10.1126/sciadv.adr2641

A general temperature-guided language model to design proteins of enhanced stability and activity

Fan Jiang ^1,^†, Mingchen Li ^2,^3,^†, Jiajun Dong ^4,^5,^†, Yuanxi Yu ^1,^†, Xinyu Sun ^6,^7,^†, Banghao Wu ^1,^8,^†, Jin Huang ^1,⁸, Liqi Kang ¹, Yufeng Pei ⁷, Liang Zhang ¹, Shaojie Wang ⁴, Wenxue Xu ⁴, Jingyao Xin ⁴, Wanli Ouyang ², Guisheng Fan ³, Lirong Zheng ¹, Yang Tan ^2,³, Zhiqiang Hu ⁹, Yi Xiong ⁸, Yan Feng ⁸, Guangyu Yang ^8,^10,¹¹, Qian Liu ⁸, Jie Song ^7,^*, Jia Liu ^4,^*, Liang Hong ^1,^2,^12,^*, Pan Tan ^1,^2,^*

^¹School of Physics and Astronomy, & Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai 200240, China.

^²Shanghai Artificial Intelligence Laboratory, Shanghai 200030, China.

^³School of Information Science and Engineering, East China University of Science and Technology, Shanghai 200240, China.

^⁴Shanghai Institute for Advanced Immunochemical Studies and School of Life Sciences and Technology, ShanghaiTech University, Shanghai 201210, China.

^⁵Guangzhou National Laboratory, No. 9 XingDaoHuanBei Road, Guangzhou International Bio Island, Guangzhou, Guangdong 510005, China.

^⁶Department of Chemistry, University of Science and Technology of China, Hefei, Anhui 230001, China.

^⁷Hangzhou Institute of Medicine, Chinese Academy of Sciences, Hangzhou, Zhejiang 310018, China.

^⁸School of Life Sciences and Biotechnology, & State Key Laboratory of Microbial Metabolism, & Joint International Research Laboratory of Metabolic, Shanghai Jiao Tong University, Shanghai 200240, China.

^⁹SenseTime Research, Shanghai 200233, China.

^¹⁰Institute of Key Biological Raw Material, Shanghai Academy of Experimental Medicine, Shanghai 201401, China.

^¹¹Hzymes Biotechnology Co. Ltd, Wuhan, Hubei 430075, China.

^¹²Zhanjiang Institute for Advanced Study, Shanghai Jiao Tong University, Shanghai 200240, China.

Corresponding author. Email: songjie@him.cas.cn (J.S.); liujia@shanghaitech.edu.cn (J.L.); hongl3liang@sjtu.edu.cn (L.H.); tpan1039@gmail.com (P.T.)

^†

These authors contributed equally to this work.

Roles

Fan Jiang: Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Validation, Visualization, Writing - review & editing

Mingchen Li: Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Validation, Visualization, Writing - review & editing

Jiajun Dong: Data curation, Formal analysis, Investigation, Resources, Validation

Yuanxi Yu: Data curation, Formal analysis, Investigation, Project administration, Resources, Validation

Xinyu Sun: Investigation, Validation

Banghao Wu: Data curation, Formal analysis, Funding acquisition, Investigation, Project administration, Resources, Software, Validation

Jin Huang: Investigation, Resources, Validation

Liqi Kang: Investigation

Yufeng Pei: Investigation, Resources, Validation

Liang Zhang: Investigation, Software

Shaojie Wang: Formal analysis, Investigation, Resources, Software, Validation

Wenxue Xu: Investigation, Validation

Jingyao Xin: Formal analysis, Investigation, Validation

Wanli Ouyang: Funding acquisition, Investigation, Project administration

Guisheng Fan: Data curation, Investigation, Project administration, Resources, Validation

Lirong Zheng: Investigation, Validation

Yang Tan: Investigation, Resources, Validation

Zhiqiang Hu: Formal analysis, Investigation, Software, Validation

Yi Xiong: Investigation

Yan Feng: Investigation, Validation

Guangyu Yang: Formal analysis, Investigation, Resources

Qian Liu: Investigation, Resources

Jie Song: Formal analysis, Funding acquisition, Investigation, Project administration, Resources, Supervision, Validation

Jia Liu: Data curation, Formal analysis, Funding acquisition, Investigation, Project administration, Resources, Supervision, Validation

Liang Hong: Conceptualization, Data curation, Funding acquisition, Project administration, Resources, Supervision, Validation, Writing - original draft, Writing - review & editing

Pan Tan: Conceptualization, Data curation, Formal analysis, Funding acquisition, Methodology, Project administration, Resources, Software, Supervision, Validation, Writing - original draft, Writing - review & editing

PMCID: PMC11601203 PMID: 39602544

Abstract

Designing protein mutants with both high stability and activity is a critical yet challenging task in protein engineering. Here, we introduce PRIME, a deep learning model, which can suggest protein mutants with improved stability and activity without any prior experimental mutagenesis data for the specified protein. Leveraging temperature-aware language modeling, PRIME demonstrated superior predictive ability compared to current state-of-the-art models on the public mutagenesis dataset across 283 protein assays. Furthermore, we validated PRIME’s predictions on five proteins, examining the impact of the top 30 to 45 single-site mutations on various protein properties, including thermal stability, antigen-antibody binding affinity, and the ability to polymerize nonnatural nucleic acid or resilience to extreme alkaline conditions. More than 30% of PRIME-recommended mutants exhibited superior performance compared to their premutation counterparts across all proteins and desired properties. We developed an efficient and effective method based on PRIME to rapidly obtain multisite mutants with enhanced activity and stability. Hence, PRIME demonstrates broad applicability in protein engineering.

The PRIME model improves protein mutants’ key properties with a 30% success rate, using machine learning without prior data.

INTRODUCTION

Proteins are fundamental constituents of living systems, playing crucial roles in a vast array of biological processes, spanning from enzyme catalysis (1) and cellular metabolism (2) to immune responses (3), signal transduction (4), and transport (5), among others. Beyond their biological significance, proteins are critical to numerous industries. In biomedicine, they serve as therapeutic agents and targets; in the food industry, they play roles in food processing and preservation; in brewing, they are essential to the production process; and in chemical engineering, they act as key catalysts for various reactions. In addition, proteins are the cornerstone of in vitro diagnostic tests, instrumental in the detection and monitoring of numerous diseases. However, proteins extracted from biological organisms, known as “wild type,” often require modifications to make them suitable for industrial applications. This is primarily because the physicochemical conditions (e.g., temperature) in which these proteins need to function in industrial settings are often drastically different from their native biological contexts (6, 7). Therefore, to meet the demands of these diverse application scenarios, the proteins need to be engineered through mutations to improve their physicochemical properties (8–10). These modifications may aim to enhance stability under extreme temperature (11) or pH conditions or to increase enzymatic activity and specificity. The process of optimizing proteins for such industrial applications typically involves iterative cycles of mutation, screening, and selection—a labor-intensive and time-consuming endeavor.

As computational simulations and related technologies continue to advance, various software tools have emerged to enhance protein thermostability, including Rosetta (12), ABACUS (13), and FoldX (14), which use physical or statistical potential functions. While these computational methods often provide relatively accurate stability predictions, their capacity to predict protein biological activity is limited. Typically, modifying the biological activity of proteins requires long-term (~years) meticulous experimental research into their working mechanisms, which is the primary method of rational protein design. However, mechanistic research is time consuming and labor intensive, and it increasingly fails to meet the modification needs of many important industrial enzymes commonly used in everyday applications. In recent years, deep learning has been extensively applied in protein engineering. Large-scale protein language models (PLMs) (15–19), such as those using self-supervised learning of protein sequences to understand protein sequence semantics and grammar, have demonstrated appreciable predictive performance for protein fitness (20), even in zero-shot settings (19, 21, 22). A zero-shot setting here means that the model can predict the mutation sites of a protein to improve its properties without relying on any prior experimental mutagenesis data. However, the prediction of most PLMs pretrained on extensive protein sequences often does not achieve sufficient high accuracy for protein stability, which is crucial for protein engineering (23). Other supervised deep learning methods exhibit high accuracy in predicting protein fitness but rely on high-throughput experiments to generate hundreds or even thousands of data points (24, 25). This approach is not practical for many proteins because of resource limitations. In this study, we used a comprehensive dataset comprising 96 million sequence-host bacterial strain optimal growth temperatures (OGTs) (26). The OGT of host bacterial strains has been shown to strongly correlate with information such as protein optimal enzymatic activity temperature and melting temperature (27). Leveraging this dataset, we developed a deep learning–based methodology, termed PRIME, which stands for Protein language model for Intelligent Masked pretraining and Environment (temperature) prediction. During its pretraining process, PRIME uses a masked language modeling (MLM) task, inspired by the transformer-based language models (28). This task involves artificially modifying protein sequences based on the natural probability distribution of amino acids, then attempting to restore the sequences to their original state. This procedure enables PRIME to learn and comprehend the semantic and grammatical features inherent in protein sequences. Alongside this, PRIME capitalizes on a multitask learning paradigm to capture the temperature traits associated with these sequences. This approach fosters an inherent predisposition in PRIME to assign higher scores to protein sequences exhibiting enhanced temperature tolerance and conforming to natural biological principles. PRIME is trained with the objective of predicting OGTs across a wide range of bacterial strains. As a result, PRIME naturally correlates higher scores with sequences more likely to contribute to robustness and survivability in varied environmental conditions, including extreme temperatures. Therefore, PRIME proves particularly proficient in the design and optimization of industrial enzymes and proteins that often require high-temperature tolerance and resilience for practical applications. Our model has demonstrated much better predictive performance compared to other state-of-the-art models in forecasting thermostability [change of melting temperature (T_m)] and fitness prediction of mutated protein sequences.

To further evaluate the efficacy of our model, we applied it to five distinct proteins and subjected the results to wet-lab experimental validation. The proteins studied included LbCas12a, T7 RNA polymerase, creatinase, nonnatural nucleic acid polymerase, and the variable domain of the heavy chain of a nano-antibody against growth hormone (VHH). Without any prior experimental mutagenesis data, we used the PRIME model to select several top-ranking single-site mutants for experimental testing. Our results revealed that more than 30% of these mutants displayed notable improvements in the physicochemical properties, such as thermostability, catalytic activity, antigen-antibody binding affinity, or even the nonnatural properties, e.g., the ability to polymerize nonnatural nucleic acid or resilience to extreme alkaline conditions.

Protein engineering for various pharmaceutical and industrial applications is confronted by two major challenges. The first is the identification of beneficial single-site mutations, and the second is the combination of multiple single-site mutations into a deep mutant. The latter becomes particularly challenging as it is often observed that combining two positive single-site mutations often results in a two-site mutant with inferior performance compared to each single-site mutant before the combination. As shown in (29) in high-throughput screening of green fluorescence protein, the probability of observing the negative epistatic effect, where the fluorescence intensity of a mutant combining two single-site mutations is worse than the linear addition of the fitness of the two before combination, is ~100 times higher than that of observing the positive epistatic effect. Building on the foundational framework provided by PRIME, we introduce a multisite stacking strategy based on the PRIME model. For example, in the case of T7 RNA polymerase, after three rounds of AI-experiment iterations with fewer than 100 mutants in total, we successfully developed a mutant with 12-site mutations that surpasses the thermostable counterpart offered by the leading commercial biotechnology company, New England Biolabs. We also conducted similar strategy on LbCas12a, which contains multiple domains and 1228 amino acids. After three rounds of AI-experiment iterations with fewer than 100 mutants, we achieved an 8-site mutant with the best thermostability to date, whose T_m is 6.5°C higher than the wild type while maintaining comparable or higher trans-cleavage activity at the desired condition.

Furthermore, in the case of T7 RNA polymerase and LbCas12a, we found that PRIME can automatically combine the negative single-site mutations from different functional domains into a multisite deep mutant to further improve the fitness of the latter. This could be a very important finding as it opens a route for protein engineers as they now can make use of negative mutations to improve the fitness of proteins. These negative mutations, which are more common than positive ones, were traditionally pre-excluded in conventional protein engineering.

RESULTS

PRIME architecture

PRIME is a pretrained model based on the Transformer architecture (30), as illustrated in the Fig. 1A. PRIME consists of three main components. The first is the encoder module for sequence feature extraction, which is a Transformer encoder model to extract the latent representation of the sequence. The second component is the MLM module, which is designed to prompt the encoder to learn the contextual representation of amino acids. Meanwhile, the MLM module can also be applied in mutant scoring. The third component is the OGT prediction module, which can predict the OGT of the organism in which the protein is located, on the basis of the latent representation. The model and training details of PRIME are described in Methods.

The pretraining objectives of PRIME

There consists of three learning objectives of PRIME: the MLM objective, the OGT prediction objective, and the correlation objective. The details of these objectives are as follows:

Masked language modeling

MLM is often used as a pretraining method for sequential data representation. In this objective, noised protein sequences serve as the input, wherein parts of tokens are masked as “<mask>” or substituted with alternative tokens. The training objective is to reconstruct these noised tokens. This approach facilitates the model’s ability to capture dependencies among amino acids as well as contextual information along the sequence. The details can be found in Methods. Moreover, we can use this reconstruction process to score mutations.

OGT prediction

The second training objective is optimized under supervised conditions. We use a dataset containing 96 million protein sequences annotated with OGT to train the PRIME model. The input of this objective is protein sequence, and the OGT module generates a temperature value ranging between 0° and 100°C. Notably, the OGT and MLM modules operate with a shared encoder. This architecture enables the model to simultaneously capture amino acid contextual information and temperature-related sequence characteristics (Fig. 1B).

Correlation objective

We introduce a learning objective to align these two metrics to facilitate feedback from the predicted OGTs to the MLM scores. For a group of single-site mutant sequences, the OGT prediction module outputs their OGTs, and the MLM module scores these mutants. Subsequently, we maximize the Pearson correlation between these mutant scores and predicted OGT values, serving to align the mutant OGT with their corresponding mutant scores. The goal of this objective is the maximization of the Pearson correlation coefficient. We use Pearson correlation as our learning objective because of its differentiable properties (for backpropagation), in contrast to the nondifferentiable of Spearman correlation.

We have conducted experiments using mean square error (MSE) loss to align the MLM and OGT predictions (table S1). We found that this approach yielded inferior results compared to using Pearson correlation as a loss function. The possible reason is that MSE loss aligns the MLM and OGT values for a single sequence, resulting in unstable loss for individual data, and the absolute value of the MLM score holds limited significance for us. In contrast, correlation loss is calculated for a set of mutated sequences and better reflects the relative magnitude of values within a set, which aligns more closely with our specific application scenario of protein engineering and evaluating a set of mutated data.

Zero-shot single-site mutation scoring

Models trained with the MLM objective can output the likelihood of amino acids appearing at a specific position based on the surrounding context. We use this to score single-site mutations. Given a mutation, we treat the amino acid in the wild-type protein as a reference and compare its likelihood to that of the mutated amino acid. The mutations are then scored using the log-odds ratio at the mutated position. (See Fig. 1C; the details can be found in Methods.)

Augmentation of single-site mutation prediction performance in PRIME through fine-tuning on homologous sequences via the MLM learning objective

While PRIME exhibits commendable performance in zero-shot mutant effect prediction, we observed that additional unsupervised fine-tuning of the language modeling module on homologous protein sequences of target proteins yields improved results, without adding supervision from experimental data. Explicitly, for the fine-tuning process, we deploy homologous sequences of the proteins of interest as an unsupervised dataset, optimizing both the encoder and MLM modules of PRIME and ESM2-650 M. Evaluation results substantiate that this method improves PRIME’s and ESM-2’s predictive accuracy for mutant effect prediction.

PRIME outperforms state-of-the-art methods in predicting fitness of mutated protein sequence

We conducted a comparison of the zero-shot prediction capacity on thermostability between our model, PRIME, and several current state-of-the-art models, including deep learning models ESM-1v (21), ESM-2 (19), MSA-transformer (17), Tranception-EVE (31), CARP (32), MIF-ST (33), SaProt (34), Stability Oracle (35), as well as the traditional computational method, GEMME (36), and Rosetta (12). Notably, among these methods, MIF-ST, SaProt, and Rosetta incorporate protein structure information, whereas the others rely solely on protein sequence. Our analysis used a dataset derived from MPTherm (37), FireProtDB (38), and ProThermDB (39), featuring single-site mutations in proteins with ΔT_m, i.e., changing of melting temperature as compared to the wild type, collected under the same experimental pH and ensuring a minimum of 10 data points per protein, amassing a total of 66 assays. Concurrently, the analysis also incorporated assays from deep mutational scanning (DMS), specifically those housed within ProteinGym (31). ProteinGym presents a meticulously constructed substitution benchmark, characterized by the experimental delineation and assessment of ~2.5 million missense variants. These variants are dispersed across 217 distinct DMS assays and encompass a range of protein properties including, but not limited to, enzymatic catalysis, binding affinity, stability, and fluorescence intensity. Such a comprehensive assembly of missense variants within the substitution benchmark of ProteinGym provides a robust and expansive repository, thereby facilitating the nuanced evaluative study of the myriad documented missense variants. This repository thus serves as a valuable asset for the systematic examination and interpretation of the diverse and intricate landscape of protein mutations and their associated properties.

These comprehensive datasets enabled a systematic investigation of the impact of specific mutations on protein fitness and thermostability, supporting the development and validation of advanced predictive models such as PRIME. The comparison provides valuable insights into the relative performance of different modeling approaches and highlights the potential of PRIME for predicting protein mutations in a zero-shot setting. The results are illustrated in Fig. 2A and table S2. As can be seen, PRIME demonstrates better performance than all the other methods in predicting protein fitness and stability. In the ProteinGym benchmark, PRIME outperforms the second-best model, SaProt, registering a score of 0.486 against 0.457 (P = 1 × 10⁻⁴, Wilcoxon). In the ΔT_m dataset, PRIME’s performance surpasses the next model, Stability Oracle, with scores of 0.437 and 0.412, respectively (P = 9 × 10⁻³, Wilcoxon). We also compared PRIME with other methods in the dataset of Stability, which refers to ProteinGym-stability, a subdataset of ProteinGym. PRIME still outperform all of the other methods. It is crucial to note that the OGT used by PRIME is not a direct representation of protein T_m. Instead, a correlation exists between them (27). There are some enzymes from thermophiles that turn out to be not very thermostable (40). However, even when leveraging the slightly imprecise OGT as a stand-in for protein sequences’ T_m attribute, PRIME markedly outshines models that do not incorporate OGT. For instance, the similar-architecture counterpart, ESM-2, achieves only 0.330 in the ΔT_m dataset. We posit that PRIME’s performance would witness a significant boost with access to a vast dataset of accurate T_m values for natural proteins. These findings underscore PRIME’s potential in protein engineering endeavors, particularly in crafting protein sequences with enhanced thermostability and other fitness attributes. Across the board, PRIME outclasses both traditional computational strategies and other deep learning models, underscoring its unparalleled effectiveness.

Fig. 2. — (A) Unsupervised model benchmarking on the ΔT_m and ProteinGym datasets. PRIME (homologous sequences) denotes the fine-tuning of the PRIME model using the MLM loss on homologous sequences of the target proteins present in either ProteinGym or ΔT_m datasets. (B to E) Supervised prediction of T_m (melting temperature) and T_opt (optimal enzymatic activity temperature). For the supervised benchmarks, we trained PRIME and ESM-2 with three different random seeds, while the results for DeepET were obtained from (82). Four metrics are used to gauge the models’ accuracy and predictive ability: RMSE (B), Pearson correlation (C), R² (coefficient of determination) (D), and Spearman correlation (E). The datasets and data split for T_m and T_opt are referenced from (82). We obtained the wild-type protein structure from the Protein Data Bank and used Alphafold2 (87) to construct structures absent in PDB for the input to Rosetta and MIF-ST. The data points and the P value tests associated with Fig. 2 are shown in tables S2 and S3.

Recent efforts, such as SaProt, which integrates protein structural information into PLMs, show enhanced prediction capabilities on stability. However, SaProt and other structural models, including MIF-ST or Stability Oracle, require protein structure data as input, which inherently carries noise and is limited by the availability of high-precision structures either from wet-lab experimental resolution or predictions like those from AlphaFold. This makes their application somewhat restricted. PRIME, which only requires sequence input, has already outperformed the current leading model SaProt when using the latest complete version of ProteinGym (217 datasets) as a benchmark. As a purely sequence-based model, PRIME not only substantially improves prediction capabilities within stability datasets compared to other PLMs, such as ESM-2, but also achieves superior performance in nonstability datasets, particularly those involving activity, as shown in table S2.

In addition to the zero-shot assignment, we also tested the representational capacity and transferability of PRIME. Specifically, we conduct supervised fine-tuning on two temperatures related downstream tasks with global fine-tuning (Fig. 1B). As the pretraining of PRIME incorporates the optimum growth temperature of the bacteria where the protein lives in, it is anticipated that PRIME can also perform better in predicting other properties of proteins associated with temperature. As exhibited from Fig. 2 (B to E) (table S3), PRIME also outperforms other supervised methods in the task of predicting the melting temperature (T_m) of a native protein and its optimal enzymatic activity temperature (T_opt). Considering the importance of T_m and T_opt in protein, PRIME’s ability to rapidly label a large volume of protein native sequences with thermal properties, using only sequence input, is of notable utility for native protein annotation engineering in practical applications.

Furthermore, we delved deeper into understanding the individual contributions of the three core modules within PRIME: the OGT prediction module, the MLM module, and the correlation term. Our findings, detailed in table S1, highlight that relying solely on either the OGT prediction or the MLM module leads to a dip in PRIME’s performance. Among these, the MLM module stands out as having the most pronounced effect across all zero-shot benchmarks. The OGT module plays a pivotal role in ΔT_m prediction, with the standard PRIME achieving a score of 0.437, in contrast to PRIME/-OGT, which scores 0.362 (P = 3 × 10⁻², Wilcoxon). Similarly, the correlation term significantly influences ΔT_m prediction, with PRIME/-correlation registering a score of 0.429 (P = 4 × 10⁻², Wilcoxon). In the context of the ProteinGym benchmark, both the OGT and correlation terms continue to exert a significant influence. This finding highlights the significance of combining the OGT prediction, MLM, and correlation modules in the PRIME model to achieve optimal performance. The synergistic effect of these three modules allows the model to better understand the complex relationships between protein sequences and their thermostability properties, ultimately resulting in improved predictive capabilities. The integration of all these modules in the PRIME model ensures a more comprehensive understanding of the protein sequence information, which in turn contributes to its superior performance compared to other state-of-the-art models.

Further, we assessed PRIME’s performance in other supervised protein engineering tasks. Specifically, in the Fitness Landscape Inference for Proteins (FLIP) benchmark (41), which consists of 12 tasks, PRIME leads in all of these tasks over ESM-1b, ESM-1v, ESM-2, and CARP, demonstrating its strong extrapolation capability, particularly in predicting high-complexity mutational effects (table S4). We note that, among the 12 tasks in FLIP, 2 of them (AAV and GB1) correspond to predicting the fitness of the multisite deep mutants when knowing the fitness of the constituent single-site mutations, which is crucial for identifying the final product in the protein engineering. One plausible explanation for this capability is that during PRIME’s pretraining, there is an alignment between the token-level MLM and the sequence-level OGT attributes of mutant sequences. This alignment allows the model to learn the thermal properties of native sequences and the thermal stability ranking of mutant sequences. Because protein thermal stability, binding affinity, and other extremophilic tolerances follow similar physical principles reflecting structural stability, PRIME exhibits superior extrapolation capability in tasks related to native protein thermal stability (Meltome) and mutated protein binding affinity (AAV and GB1) within the FLIP benchmark. This is why PRIME demonstrates a stronger performance in these tasks compared to the ESM series. Moreover, in the Meltome (42) dataset task of FLIP, which involves predicting the T_m of human-derived proteins, PRIME, integrated with OGT information, consistently surpassed models with similar architectures like ESM-2. This indicates that although PRIME’s pretraining process only learned the OGT information of bacterial-derived protein sequences, it still excels in predicting the T_m temperature attributes of proteins from other species. This demonstrates PRIME’s generalizable capabilities.

Wet-lab experimental testing of PRIME-designed single-site mutants of various proteins for different engineering purposes

In practical applications of protein engineering, the prevailing approach involves identifying positive single-site mutations that enhance the protein’s performance (making it more active or more stable) and then combining them to form multisite mutants with desired properties probably through a greedy search method (25). Thus, the successful identification of these positive single-site mutations forms the cornerstone of successful protein engineering. To further substantiate the effectiveness and generosity of our methodology, we tested the PRIME model on designing single-site mutant for five distinct proteins, namely, LbCas12a, T7 RNA polymerase, creatinase, nonnatural nucleic acid polymerase (Tgo-D4K), and the variable domain of the heavy chain of a nano-antibody against growth hormone (VHH). Briefly, we fine-tuned PRIME on a set of 30,000 homologous sequences for each target protein, sourced from the Uniclust30 database (43). This fine-tuning was executed with five distinct random seeds for each target protein. By averaging the prediction outcomes from these five models for single-site saturation mutations, we generated a single-site mutation score table for every protein. PRIME was then used to rank all single-site mutants within the landscape, on the basis of the likelihood of the mutated sequences relative to their wild-type counterparts (refer to the mutated protein sequence scoring strategy). Subsequently, we selected top 30 to 45 mutants from outside the 6-Å range of the catalytic active sites or binding pockets for further experimentation. Considering that mutations within the catalytic active site or binding pockets could profoundly affect the protein’s function, direct mutation of the active site presents both risks and opportunities (44). In this study, we adopted a conservative approach aimed at averting potential drastic disruption to the protein’s catalytic capabilities. Notably, we were initially unsure about the specific effects of PRIME’s suggested mutations on the properties of the five proteins under study. However, each protein had distinct enhancement needs, either in stability or activity. For the five distinct proteins, the engineering objectives varied: for LbCas12a, T7 RNA polymerase, and creatinase, the goal was to enhance thermostability; for nonnatural nucleic acid polymerase, the target was to accelerate the polymerization rate of nontraditional nucleic acids, specifically 2′-fluoroarabino nucleic acid (FANA); and for VHH, the objective was to increase stability in highly alkaline pH conditions (pH > 13). Comprehensive outcomes of these experiments are elaborated on in the subsequent sections.

PRIME can be used to rank mutants on the basis of both activity and stability for single mutants. However, from the ablation study of PRIME, we found that the zero-shot performance with only the OGT module (PRIME/-MLM) is quite poor in both the ProteinGym benchmark and ΔT_m. Therefore, we do not use the OGT module to select single-site mutations for stability. Instead, we suggest using the large langaue model’s (LLM) likelihood of PRIME, obtained when predicting OGT as an additional pretraining task. Drawing on previous research experience of biologists (7, 44), we can choose mutations located on the surface of the protein to improve protein stability and mutate amino acids around the protein pocket to enhance protein catalytic activity.

LbCas12a

It is well known that engineering proteins with multiple functions is challenging because of the trade-offs between different protein properties (45). Moreover, these multidomain proteins often have substantial conformational differences between their functional states and their crystal structures, which poses a substantial challenge to traditional rational design methods that rely on structure. Thus, we sought to challenge our model with a large, multidomain protein whose activity requires cross-talk between multiple functional domains. We engineer the T_m of Lachnospiraceae bacterium Cas12a (LbCas12a). Cas12a is an RNA-guided endonuclease belonging to the type V-A CRISPR-Cas system (46). LbCas12a contains 1228 amino acids with multiple functional domains (Fig. 3A). During the catalytic process, CRISPR RNA (crRNA) guides Cas12a to bind to and cleave double-stranded DNA substrates. Upon target DNA recognition, the recognition domain lobe of Cas12a undergoes conformational changes to unleash its trans-activity to cleave nonspecific single-stranded DNA (47). This feature makes LbCas12a particularly useful in in vitro diagnostic applications (48). We used PRIME to perform a round of single-site mutation prediction and tested 30 single-site mutations, of which 9 of 30 single-site mutants had a T_m not lower than the wild type (V936F, I976L, S962K, M957L, M456I, L59K, Y549K, G49K, and C1090D) (Fig. 3B).

Fig. 3. — The structures and experimental results of single-site mutants predicted by PRIME for LbCas12a (A and B), T7 RNA polymerase (C and D), creatinase (E and F), nonnatural nucleic acid polymerase (G and H) and VHH (I and J) are depicted. The data points representing the mutations were systematically arranged in ascending order, with the corresponding value for the wild-type protein delineated by a gray bar for comparative purposes. Mutants that exhibited superior performance compared to their wild-type counterparts in terms of targeted attributes are highlighted in yellow, while negative mutants are shown in blue. The engineering goals varied between proteins for practical purposes: for LbCas12a, T7 RNA polymerase, and creatinase, the objective was enhanced thermostability (T_m); for nonnatural nucleic acid polymerase (Tgo-D4K), the aim was to accelerate the synthesis rate of FANA; and for VHH, the goal was to improve the tolerance ability under extreme alkaline pH conditions [median effective concentration (EC₅₀) of VHH binding to the antigen]. All mutated structure were folded by Alphafold2. Detailed experimental data can be found in the separate Excel file in the Supplementary Materials.

T7 RNA polymerase

T7 RNA polymerase is a monomeric enzyme derived from T7 bacteriophage, comprising a total of 883 amino acids. Since its initial utilization in RNA synthesis in the early 1980s, T7 RNA polymerase has become a crucial tool in the fields of molecular biology and genetic engineering (49). It is now commonly used in various applications such as in vitro transcription (IVT) experiments, mRNA vaccine production (50), isothermal amplification detection techniques (51, 52), etc. However, T7 RNA polymerase also presents some application drawbacks. For example, it produces immunostimulatory by-products, such as double-stranded RNA, during the transcription process (53), which necessitates complex purification processes for mRNA vaccine production. Recent studies have indicated that increasing the reaction temperature to above 48°C effectively reduces the by-products (54). Nevertheless, the wild-type T7 RNA polymerase unfolds at temperatures around 45°C, resulting in decreased enzymatic activity and an inability to transcribe the desired target products at higher temperatures. Therefore, there is a critical need to enhance the thermal stability of T7 RNA polymerase.

In this study, we used PRIME to predict mutation sites in T7 RNA polymerase and directly selected the top 45 single-site mutants for subsequent experimental verification. As depicted in Fig. 3D, our experimental results indicate that 57.7% (26 of 45) of the mutants have a T_m value higher than the wild type (Y846R, C125A, H772K, S606V, V687E, F481W, C216L, N601E, A881F, P657K, S633P, K642G, M306K, A703T, P476E, A468F, T375K, I217L, R792M, S397W, P865L, C515P, W797L, S430P, L446F, and Q786L).

Creatinase

Creatinase, a dimeric proteinase, is widely used in enzymatic assays for measuring creatinine levels (55, 56). It is primarily derived from microorganisms such as Pseudomonas, Bacillus, and Alcaligenes. Creatinase is crucial in medical diagnostics and plays a role in quantifying creatinine in serum and urine (57). Elevated creatinine levels indicate impaired kidney or muscle function. Nevertheless, the optimal catalytic temperature for creatinase typically falls within the range of 30° to 40°C, which constraints both the industrial and clinical diagnostic applications. Enhancing the thermal stability of creatinase not only improves the efficiency of clinical creatinine detection but also facilitates enzyme production, storage, and transportation.

Here, we used the PRIME model to predict single-site mutations in creatinase obtained from Alcaligenes faecalis (58). At the end, 28 single-site mutants were selected for experimental validation. As depicted in Fig. 3F, 32% (9 of 28) of the mutants exhibited improved thermal stability (Q151V, H193Y, V283L, A180K, Y310L, E170T, S19L, H74Q, and D17V).

Nonnatural nucleic acid polymerase

Tgo is a DNA polymerase that has been identified in the thermophilic bacterium Thermococcus gorgonarius, which was isolated from a geothermal vent in New Zealand (59). Tgo has been found to accurately replicate FANA, a genetic polymer with 2′-fluoroarabino residues in deoxyribonucleotides (60, 61). However, Tgo DNA polymerase can only catalyze the synthesis of FANA on the DNA template at a rate of ~15 nt/min (61), which is much lower than that of Tgo for DNA synthesis (~400 nt/min) (62), limiting the application of FANA as a substitute for DNA in information storage (63), disease treatment (64, 65), and other fields. The evolution of a xeno nucleic acid (XNA) polymerase necessitates a comprehensive evaluation of not only binding affinity but also catalytic activity and processivity. This is due to the unique chemical and biophysical properties of XNA, which differ from those of DNA and RNA, making prediction by traditional in silico methods challenging. Furthermore, the distinct sugar pucker of XNAs may result in conformational structures of XNA that differ from those of DNA, RNA, and nucleic acids modified in bases, thereby influencing the polymerase’s recognition of XNA. Consequently, the in silico prediction and direct evolution of XNA polymerase remain formidable challenges. To date, the evolution of XNA polymerases has relied solely on random mutation methods in vitro, such as the compartmentalized self-tagging method. Pinheiro et al. (66) constructed a high-throughput mutation library and conducted at least two rounds of screening to identify the currently fastest FANA polymerase, Tgo-D4K (TgoT: L403P, P657T, E658Q, K659H, Y663H, E664K, D669A, K671N, and T676I). The polymerase was able to extend FANA on the DNA template at a rate of ~80 ± 27 nt/min, while the rate of DNA extension on the DNA template was reduced to 16 ± 3 nt/min (62). However, the synthesis rate of Tgo-D4K for FANA is still lower than that of Tgo for DNA synthesis. Therefore, methods are required to modify existing polymerases to screen for polymerases with higher FANA synthesis rates.

In the present study, we commenced our investigation with Tgo-D4K as the starting point. Using PRIME, we systematically screened potential mutation sites across various domains of Tgo-D4K. Ultimately, we selected 27 promising mutations for subsequent experimental validation. The polymerase kinetic profiling (PKPro) strategy was used to detect the FANA synthesis rate of the mutants as previously described (62). The experimental results (Fig. 3H) showed that more than 40% (12 of 27) of the mutants had a higher FANA synthesis rate (P716G, R460E, I528A, H659E, K465E, A546V, I471E, D29V, Y481G, T55L, A217P, and I693W), and the single-site (I693W) mutation was identified, which can notably increase the extension rate to ~3.2-fold of that of the Tgo-D4K enzyme.

VHH

VHH antibody is the antigen-binding fragment of heavy chain only antibodies (67). Because of the advantages of small size, monomer state, robust structure, and easy tailoring, VHH has been used as an important tool in medical research and clinical antibody drug development (68), which have been developed as an affinity ligand to selectively purify biopharmaceuticals, for example prothrombin, tetrabromobisphenol A, intercellular adhesion molecule 1, and so on (69–71). In the practical production of biological products, the most widely used method of clean in place is 0.5 M NaOH cleaning for 24 hours. Hence, VHH antibodies used for biopharmaceutical purification need mutational engineering to tolerate the harsh alkaline condition, which is rarely seen in nature (72, 73).

In this study, we used our PRIME model to predict mutation sites for a VHH antibody against a growth hormone that we select from an immunized camelid. The top 29 mutants were chosen for further testing; 11 of 29 (~38%) mutants enhanced stability after incubation at 0.3 M NaOH for 24 hours, as shown in Fig. 3J (A57D, P29T, A15P, V113D, P117Q, R20T, R110E, T58K, D114Y, W112F, and L12K). Among these, the A57D mutation displayed a remarkable 12-fold enhancement in alkali tolerance. Besides, ~31% (9 of 29) of the mutants show increased affinity for antigen before the alkaline treatment (P29T, A15P, A57D, P117Q, Q83D, R20T, T119V, L12K, and V113D).

Benchmark of different strategies for selecting single-site mutations

To evaluate the efficiency of PRIME and the strategy of our single-site mutation selection, we incorporated a benchmark comparison for different strategies of selecting single-site mutations. We conducted comparisons both in silico and through wet-lab experiments. From the ProteinGym dataset, we used a subset with saturated single-site mutation data (comprising five datasets with wild-type sequence identity to the PRIME pretraining dataset below 30%) for this analysis. We compared the top 15 single-site mutations selected by four different strategies, which include the following: (i) the strategy method in this paper, using homologous sequences of the target protein to fine-tune the PRIME model; (ii) fine-tuning ESM-2 on the same homologous sequences; (iii) the ESM vote strategy from (74); (iv) random single mutations. Our single-site selection strategy consistently outperformed the other methods across three evaluation metrics: the number of positive single-site mutations, the maximum fitness, and the median fitness of the mutants. The specific results are presented in table S5. Furthermore, we compared the performance of the top 15 single-site mutations selected by different methods through wet-lab experiments. We limited our comparison to two proteins: T7 RNA polymerase and a nonnatural nucleic acid polymerase Tgo-D4K. We validated the top 15 single-site mutations selected by our strategy, the ESM vote strategy, the strategy of fine-tuning ESM-2 on homologous sequences, and the strategy of scoring saturated single-site mutations with Rosetta for unfolding free energy. Rosetta scores protein saturated single-point mutations by ranking on the basis of predicted values of the unfolding free energy. The energy function used to calculate this unfolding free energy includes all energy terms referenced in the literature (75). The results, shown in Fig. 4 (A and B) (detailed in table S6), demonstrate that our strategy method’s selected single-site mutations comprehensively outperform those selected by other strategy methods.

Fig. 4. — (A) Comparative web-lab results for the top 15 single-point mutations in T7 RNA polymerase. (B) Results for Tgo-D4K, as determined by PRIME, Rosetta, ESM-vote, and ESM2(homo). (C and D) Maximum (C) and mean (D) fitness outcomes obtained from in silico–directed evolution on the GB1 dataset, involving random mutagenesis, ftMLDE, ESM-2, and PRIME. For ESM-2 and PRIME, we examined both top-K sampling and the tiered sampling used by ftMLDE (77).

Enhanced multisite mutagenesis through PRIME-driven protein engineering

Traditional protein engineering and directed evolution techniques often use an incremental approach, reminiscent of greedy algorithms, accumulating mutations from single-site mutants to construct multisite variants. Such a strategy, while prevalent, is prone to pitfalls, notably converging to local optima. Specifically, the most effective multisite mutant does not always emerge from the aggregation of the top-performing single-site mutants. Harnessing the capabilities of PRIME, we unveil an advanced multisite mutation stacking strategy. This purely data-driven method evaluates the entire landscape of 2^N potential mutants (where N represents the count of single-site mutations available for combination), bypassing the pitfalls of traditional directed evolution that might settle for local optima through incremental mutations. Our strategy simplifies the prediction of high-performing multisite mutations, reducing the need for extensive experimental iterations, as depicted in Fig. 1D. Our methodology includes a zero-shot prediction pipeline based on homologous sequences with PRIME fine-tuned for specific proteins. Previous studies have indicated that PLMs fine-tuned on homologous sequences can achieve substantial better performance in low-N scenarios (76). A comparative analysis with the ftMLDE method from (77), using simulated directed evolution on the GB1 dataset, demonstrates that our PRIME-based workflow more effectively identifies multisite mutants with enhanced max fitness (Fig. 4C) or mean fitness (Fig. 4D). We examined both the top-K sampling used by PRIME and the tiered sampling used by ftMLDE (77). Our findings indicate that the iterative multipoint mutation strategy based on PRIME outperforms ftMLDE in terms of both maximum and average fitness across multiple rounds of iteration, where the results of top-K sampling were comparable to tiered sampling, with top-K sampling showing a slight advantage (detailed results are shown in table S7). This in silico–directed evolution was conducted 100 times, with each iteration comprising two rounds. In each round, the top 50 mutants or tiered-50 samples identified from the preceding round were used as the training dataset for the following round, using a multilayer perceptron (MLP) layer as the regression model for ESM-2 and PRIME to score the whole rest mutants. For the implementation of ftMLDE, we executed the code as described in (77) and used MSA-transformer as the variant encoding model, and the regression module is ensemble of ARDRegression, BaggingRegressor, and KNeighborsRegressor.

LbCas12a

In the case of LbCas12a, we trained the PRIME model on all the 30 single-site mutation data points and predicted the T_m of multisite mutation combinations. The top 10 scored mutants were then selected from each of the two- to four-site mutation pools for the second-round experimental validation. In the third-round stability evolution, the 30 multisite mutants were added to the training set to further fine-tune PRIME. We then selected the top 5 mutants from each of the 3-, 4-, 5- and 6-site mutant collections, and top 10 mutants in total from the 7- to 10-site mutant collections were selected for experimental characterization.

As shown in Fig. 5C, 17 of 30 multisite mutants in the second-round exhibited higher T_m than the wild type. Furthermore, all the 30 multisite mutants in the third round had a higher T_m than the wild type. The best mutant was an eight-site mutant (R2-26) with T_m of 48.15°C, which is 6.25°C higher than wild type (details can be found in the Supplementary Materials).

In the second round of T_m-enhancing positive multisite mutations, many of the multisite mutations recommended by PRIME contain negative single-site mutations (T_m decrease). For example, C10L in the R1-3 (C10L; S962K) mutation has a T_m lower than the wild type, but it participates in the formation of this double-site mutation with a T_m higher than both two single-site mutations. Moreover, the three-site mutation R1-15 (C10L; S962K; I976L) formed by adding the C10L mutation based on the R1-9 (S962K; I976L) double-site mutation also has a T_m higher than the previous double-site mutation. Furthermore, the positive multisite mutations containing C10L consist of mutations from different functional domains. As shown in Fig. 5A, for instance, C10L is in the WED-I domain of cas12a, while S962 and I976 are in the RuvC-II domain. This demonstrates the remarkable generalization ability of PRIME, which has learned the epistatic effects between different mutations from different domains with only the information of sequence, and can combine negative single-site mutations into excellent multisite mutations. This is challenging to achieve with traditional directed evolution methods, which use an incremental approach, reminiscent of greedy algorithms, accumulating mutations from single-site positive mutants to construct multisite variants. It is unlikely to directly combine negative single-site mutations into multisite mutations.

T7 RNA polymerase

Taking T7 RNA polymerase as another example, we built on the foundation of previously identified single-site mutations and used the PRIME model, fine-tuned with homologous sequences of T7 RNA polymerase, to perform supervised regression prediction tasks (details can be found in Methods).

We used the T_m data from all single-site mutations in the first round, including the wild-type protein, as the training set and then used the trained PRIME models to predict the multisite mutation sequences. Subsequently, from sequences with two to four mutation sites, we selected 5 sequences each and 10 sequences for eight mutation sites that had the highest predicted T_m, resulting in a total of 25 multisite mutants for the second round of wet-lab validation. As shown in Fig. 5D, after two rounds of mutagenesis, all 25 multisite mutants exhibited a T_m higher than the wild type. The standout mutant had eight mutation sites, R1-21 (Q786L; S430P; W797L; P657K; N601E; L446F; P476E; T375K), with its T_m being 7.4°C higher than the wild type.

However, compared to the commercial thermostable T7 RNA polymerase (Hi-T7, T_m = 56.8°C) available from New England Biolabs, our eight-site mutant still has a T_m that is 4°C lower. To acquire a mutant with higher T_m, we further tested 10 additional single-site mutants from the first round of zero-shot prediction by PRIME, as shown in Fig. 5D. We then combined the data from all single-site mutants and previous multisite mutants, a total of 80 mutants, to train the PRIME model. Subsequently, we used the trained PRIME model to directly predict the T_m of multisite mutants range from 9- to 14-site mutations formed by these single-site mutations. In addition, the top 15 multisite mutants predicted by PRIME were selected into the following wet-lab testing. Five of 15 deep mutants showed unambiguous higher T_m as compared to Hi-T7, with the best mutant of 12 mutation sites (Q786L; S430P; L446F; S606V; K642G; S633P; I217L; S397W; L534V; A124N; G618E; L665D), whose T_m was 12.8°C higher than that of wild type. Notably, the enzymatic activity of the five most thermostable mutants were also higher than the wild type, as illustrated in Fig. 5D.

Furthermore, we found this 12-site mutant contains several negative single-site mutations, such as A124N, G618E, and L665D. When applied to the wild type, these mutations would lead to a decrease in T_m, as shown in Fig. 5D. The amalgamation of negative mutations poses a formidable challenge, as these mutation sites are often preemptively excluded from further combinations to form deep mutants in conventional protein engineering because of the paucity of domain knowledge on their effective utilization. Given that negative mutations are far more common than positive ones, our finding that protein LLMs can make use of them as ingredient to form better deep mutants could be exciting to the protein engineering community for further mechanism and industrial applications.

Without any prior experimental data or high-throughput screening technique, after three rounds of mutagenesis and wet-lab validation of 95 mutants, we successfully obtained a T7 RNA polymerase variant with up to 12-site mutations that surpasses the commercial enzyme. This achievement not only attests to the precision and efficiency of PRIME’s single-site prediction and multisite stacking but also highlights its potential in notably reducing the financial overheads associated with wet experiments. This accomplishment remains elusive in the realm of traditional protein engineering and rational design.

DISCUSSION

We present PRIME, an advanced deep learning approach that masterfully leverages an extensive dataset encompassing sequence-host bacterial strain OGTs. By tailoring an MLM for OGT prediction, PRIME astutely captures the semantic, grammatical, and temperature-related nuances of protein sequences. Rigorous in silico evaluations consistently underscore PRIME’s preeminence over other leading models, including ESM-1v, ESM-2, MSA-transformer, Tranception-EVE, CARP, MIF-ST, SaProt, Stability Oracle, GEMME, and Rosetta, in predicting thermostability and the overall fitness of protein mutants. Through PRIME, we have crafted five proteins with single-site mutations, achieving substantial enhancements in their physicochemical attributes, with a commendable success rate of over 30% among the 30 to 45 AI-conceptualized mutants. This highlights PRIME’s transformative potential in the realm of protein engineering.

Historically, protein engineering strategies have pivoted around either directed evolution or rational design. The former, while effective, hinges on high-throughput experimental screenings, making it resource intensive in terms of both time and capital. For numerous pivotal proteins, the practicality of high-throughput experimental methodologies is debatable, rendering low-throughput assays a more viable alternative. Conversely, rational design demands an in-depth comprehension of the biophysical attributes pertinent to the target protein’s operational mechanism. With a profound understanding of this mechanism, rational design can occasionally identify high-performing mutants with limited wet experiments. Yet, for many proteins with limited mechanistic insights or for modifications of unconventional activities, such as the polymerization activity toward nonnatural nucleic acids highlighted in our study, rational design often encounters limitations. In these scenarios, AI-centric predictions, epitomized by PRIME, stand out. Without necessitating extensive wet experimental data or a deep understanding of the protein’s modus operandi, PRIME offers invaluable insights, streamlining the protein engineering trajectory.

Traditional protein engineering often adopts a strategy akin to greedy algorithms, incrementally accumulating mutations from single-site to multisite mutants. While effective, this process can be labor intensive and time consuming. Moreover, it occasionally results in suboptimal outcomes, as the optimal multisite mutant does not necessarily comprise the most beneficial single-site mutants. Our model, PRIME, introduces a paradigm shift in this field. It offers a refined strategy for multisite mutation accumulation, overcoming the limitations of conventional tactics and expediting the creation of superior multisite protein mutants. PRIME can automatically group negative single-site mutations into a deep mutant, notably enhancing its fitness. This finding could be pivotal, opening a pathway for protein engineers. They can now use negative mutations, which are more prevalent than positive ones and were previously excluded in traditional design, to enhance the fitness of proteins. By reducing the reliance on exhaustive experimental screenings, computational tools like PRIME could revolutionize the protein engineering landscape, potentially expanding the range of proteins amenable to skilled engineering. This approach holds promise for a wide array of applications in pharmaceutical and industrial sectors.

Furthermore, PRIME’s versatile modeling framework holds promise for diverse predictive tasks, such as deducing the melting temperature (T_m) or the optimal enzymatic activity temperature (T_opt) of indigenous proteins. PRIME streamlines the prerequisites for protein modifications, facilitating enhancements in protein stability and activity, eliminating the need for comprehensive mechanistic probes. In addition, PRIME’s multitask learning modality, which aligns OGT with MLM, considerably boosts the model’s predictive accuracy on temperature-associated downstream tasks when juxtaposed with other training techniques. Moreover, this does not compromise its predictive efficacy on tasks unrelated to temperature. This suggests that while enhancing the model’s predictive capability for specific tasks, this pretraining method also maintains the model’s generalization capability on other unrelated tasks. This pretraining approach could pave the way for a fresh learning paradigm, embedding specialized domain insights into foundational AI frameworks, and could be instrumental in bridging the gap between deep learning and conventional scientific wisdom. PRIME’s predictive prowess extends to pinpointing mutation sites that bolster protein properties, even those seldom observed in nature. Instances include fortifying antibody resilience in extreme alkaline environments or amplifying a polymerase’s polymerization velocity on non-native nucleic acids, underscoring PRIME’s universal applicability in protein engineering.

METHODS

Details of PRIME architecture

PRIME consists of a common Transformer-based encoder and two different components: one for performing MLM pretraining and another for pretraining OGT prediction. In this section, we first introduce the common Transformer encoder, followed by a detailed description of the MLM module and the OGT prediction module.

Transformer encoder

For the Transformer encoder, we use the same architecture of the ESM-2, a widely used Transformer-based pretrained language model. Compared to the standard Transformer model architecture, it replaces the absolute position embedding with rotary position embedding and uses the trick of prelayer normalization like Roberta, and the activation unit is a GELU function rather than ReLU. We also use Flash attention to accelerate the training and inference. The code can be found in our code repository. Conceptually, the Transformer encoder acts as a parameterized transformation function, converting a protein sequence into a sequence of dense vectors

(h_{1}, h_{2}, \dots, h_{L}) = Transformer (x_{1}, x_{2}, \dots x_{L})

Here, L is the length of the protein sequence, (x₁, x₂, …x_L) represents the discrete one-hot encoded amino acids of the protein sequence, and the continuous vectors (h₁, h₂, …, h_L) are the outputs of the Transformer encoder, representing the protein sequence in latent space.

MLM module

This module is also the same as ESM-2 architecture. This module acts as a reverse function of the Transformer encoder, mapping a sequence of hidden vectors into the one-hot encodings of protein sequences. During pretraining, the MLM module is learned to recover the noised protein sequence. The noised sequence is generated heuristically from the original sequence by randomly masking 20% of the tokens in a protein sequence. Of these masked tokens, 70% are replaced with a special <mask> token, accounting for 14% of the entire sequence. In addition, 20% of the masked tokens, or 4% of the entire sequence, are substituted with amino acids. These substitutions are based on their natural occurrence frequencies in the UniProtKB database, ensuring that more common amino acids have a higher likelihood of being chosen. The objective of this task is to let the encoder understand the relationships between words and to learn the contextual information necessary for understanding the primary structure of protein sequences. It has been shown that the probability distribution generated by the model for a given masked position in a protein sequence, over all possible amino acids, has a positive correlation with the mutant score (78). The mutant score is a measure of how likely it is that a given amino acid substitution at that position will result in a functional change in the protein. The fact that the probability distribution generated by the model is correlated with the mutant score indicates that the model has learned to capture important features of protein sequences, such as the effects of amino acid substitutions on protein function. Formally, the MLM module is a point-wise parameterized function, converting a sequence of dense vectors into a sequence of probability distribution on the protein sequence

p_{j} = MLM (h_{j}) j \in noised postions

where j denotes the noised position and h_j is the latent representation, while p_j ∈ R²⁰ is probability distribution (20 is the vocab size).

OGT prediction module

The original MLM for natural languages is actually join trained with an additional supervised task that learns to decide whether two given sentences follow each other or not. However, this supervised part is usually ignored in protein-based models. To fill this gap, we added a supervised module to our model to learn how to predict the OGT of the organism to which a protein belongs. This module contains an attention-based pooling layer, two MLP layers, and a residue connection. The attention pooling layer takes the latent representations of the protein sequence (h₁, h₂, …, h_N) as input and subsequently uses a projection-softmax layer to compute the weights and produces a weighted vector c

Attention(h₁, h₂,…, h_N)

(h_{1}, h_{2}, \dots, h_{N}) \leftarrow LayerNorm (h_{1}, h_{2}, \dots, h_{N})

s_{i} = \frac{exp (W h_{i} + b)}{\sum_{n = 1}^{N} exp (W h_{n} + b)} i = 1, 2, \dots, N

c = \sum_{n = 1}^{N} s_{i} h_{i}

where W and b are the learnable parameters of the attention pooling layer.

Then, an MLP layer with two fully connected layers and GELU activation is used to transform the weighted vector c. The first fully connected layer maps c to the same dimension as the feed-forward network layer of the Transformer, which in our implementation is four times the size of the hidden layer. The second fully connected layer maps the output of the first layer back to the original dimension. Between the first and second fully connected layers, there is a GELU activation function. In addition, there is a residual connection between the output of the second fully connected layer and the output of the attention layer

r = F C_{2} \{g [F C_{1} (c)]\} + c

where FC₂ and FC₁ are learnable fully connected layers, and g is the GELU activation function. In particular, the output vector r can be viewed as a representation feature of the whole sequence, which can be used in the transfer learning for downstream tasks.

Last, another MLP layer with two fully connected layers and a tanh activation function are used to learn to map the sequence representation r to the OGT of the protein sequence

y = F C_{4} \{tanh [F C_{3} (r)]\} y is OGT \in R

where FC₃ and FC₄ are trainable fully connected layers. We use the MSE criterion as the loss function.

Zero-shot prediction of the effects of single-point mutations

According to (18, 21), PLMs, which are trained using the MLM objective, are capable of predicting the likelihood of an amino acid occurring at a specific position in a protein based on the surrounding context. This prediction ability can be used to evaluate sequence mutant effects. Figure 1C shows how to predict the mutant effect using the MLM module. For a given mutation, the amino acid in the wild-type protein serves as a reference state. The effect of the mutation is ascertained by comparing the predicted probability of the mutated amino acid against that of the original (wild-type) amino acid. Formally, the effect of the mutation is quantified through the log-odds ratio at the mutated position, as

Score (i, m ∣ w) = log P (x_{i} = m ∣ X) - log P (x_{i} = w ∣ X)

where Score(i, m ∣ w) represents the score of the single-point mutant, where the i_th wild-type amino acid w has been mutated to mutant type m. Also, X = (x₁,…, x_L) denotes the entire wild-type sequence, where x_i indicates the amino acid at position i, and L is the sequence length. Note that this process can also be applied to multipoint mutant effects, where the fitness value of multisite mutations can be considered as the sum of the fitness of its single-site mutations. This method is used to evaluate multipoint mutants in the ProteinGym benchmark.

Training details

Pretraining

As shown in Fig. 1A, PRIME incorporates three distinct loss functions as optimization objectives during pretraining: MLM loss, OGT prediction loss, and the correlation loss. Below, we provide detailed formula of these three functions. The training and validation curves during the pretraining process are depicted in fig. S1.

MLM loss

To compute the MLM loss, we use the cross-entropy loss. For each masked amino acid in a protein sequence, the model computes the probability distribution over its vocabulary (20 naturally occurring amino acids) and compares it to the actual amino acid (AA) distribution (represented as a one-hot vector). The loss is the negative log-likelihood of the correct amino acid

L_{MLM} = - \sum_{i} log P (A A_{i}^{true})

where $P (A A_{i}^{true})$ represents the predicted probability of the true amino acid.

OGT prediction loss

This loss function is used to quantify the difference between the predicted OGT and the actual OGT. We use the MSE as the loss function, which can be expressed as follows

L_{OGT} = \frac{1}{N} \sum_{i} {(T_{i}^{pred} - T_{i}^{true})}^{2}

Here, N represents the number of training samples, $T_{i}^{pred}$ signifies the predicted OGT, and $T_{i}^{true}$ represents the true OGT.

Correlation loss

This loss function aligns the mutation scores generated with the predicted OGT of mutations. Given a protein sequence S, we randomly generate N single-point mutants M = (M₁, M₂,…, M_N). Using the MLM module, we can obtain MLM scores S = (S₁, S₂, …, S_N) of these N mutants. In addition, the OGT module is used to predict the temperatures T = (T₁, T₂,…, T_N) for these mutants. Pearson correlation coefficient is then used to align T and S, with the specific formula given by

L_{Corr} = 1 - \frac{cov (S, T)}{σ_{S} σ_{T}}

where cov(S, T) represents the covariance between S and T, and σ_S and σ_T are the SDs of S and T, respectively.

The final loss is the sum of three model losses. We observed that the OGT prediction loss has a significantly different magnitude compared to the other two losses, with values ranging from 0 to 1000 initially and stabilizing at 0 to 100 later. To maintain numerical stability, we multiplied this loss by 0.01.

Implement details

We used PyTorch to implement PRIME. The Transformer encoder is composed of 33 layers and 20 attention heads, with 650 million parameters and an embedding size of 1280. The learning rate was set to 1 × 10⁻⁴. The micro-batch size per GPU is 4096 tokens, and the gradient accumulation steps are 32. The models were trained for 200k update steps on 8 × A100 80G GPUs. After pretraining, the root mean square root of the OGT prediction task was 3.5 on the 50,000 held-out validation set, and the perplexity of MLM reached 3.52. The average error of the correlation loss during pretraining reached 0.1623. We initialize all layers of the Transformer encoder and MLM module from (19).

Alternating training

Because of the disparate input requirements of the three loss functions—MLM Loss operates on noised protein sequences, OGT Loss on complete sequences, and Correlation Loss on N random single-point mutants of the sequence—we use alternating training strategy to optimize these distinct objectives. Specifically, we use Mini-batch Gradient Descent with the Adam optimizer to train the model, alternating tasks with each mini-batch iteration. The training regimen is delineated in the Python and PyTorch-style pseudocode in table S8. After training, we compared the predicted and actual OGT, as shown in fig. S2.

Effect of different weights of the multitask loss function to the performance of zero-shot prediction

We have explored the design of the multitask loss function with varying weights to address the relative amount of data or task difficulty. To minimize computational costs, we randomly selected 500,000 sequences from the full pretraining dataset of 96 million entries to serve as our training dataset for these ablation studies. Each combination of loss weights was selected using a grid search from the list [0.01, 0.05, 0.5, 1, 2], resulting in a total of 125 combinations. We found that a 1:1:1 weight ratio presents an optimal setting for the zero-shot mutation prediction task on the ProteinGym and ΔT_m datasets. The specific results are documented in the table S9.

Fine-tuning MLM module on homologous sequence

To improve the performance of PRIME and ESM-2 in zero-shot mutant effect prediction, we explore enhancing it through fine-tuning on homologous sequences with only training on the MLM. Fine-tuning on homologous sequences involves adapting a pretrained model to a specific protein by leveraging the knowledge gained from similar protein sequences (78). Our approach applies this fine-tuning strategy to the ProteinGym or T_m dataset. Using Jackhammer, a renowned sequence comparison tool, we identified homologous sequences of proteins within these datasets from the Uniclust30 database (43). For proteins with more than 30,000 homologous sequences, the first 30,000 sequences were selected. Conversely, for those with fewer than 30,000, all sequences were retained for fine-tuning. The fine-tuning process used the same hyperparameter settings as in the pretraining phase of MLM module. Specifically, the noised sequence is generated heuristically from the original sequence by randomly masking 20% of the tokens in a protein sequence. Of these masked tokens, 70% are replaced with a special <mask> token, accounting for 14% of the entire sequence. Furthermore, 20% of the masked tokens, corresponding to 4% of the entire sequence, are substituted with amino acids based on their natural occurrence frequencies in the UniProtKB database, ensuring that more common amino acids are more likely to be chosen. Our objective in fine-tuning on these sequences is to harness the shared attributes among homologous proteins, thereby enhancing mutation effect predictions. This tailored approach aims to optimize the pretrained model for specific protein contexts, offering a promising avenue for enhanced predictive accuracy.

Transfer learning of PRIME on temperature related benchmark and FLIP

PRIME is trained on both temperature-related supervised and unsupervised tasks. To assess the transfer representational ability of PRIME, we use a T_m prediction benchmark and another optimal catalytic temperature prediction (T_opt) benchmark. The assessments were executed using the encoder component and the OGT module. All the parameters of the Transformer encoder can be fine-tuned. The batch size is set to 256 and the learning rate is set to 0.0001 in the Adam optimizer. Moreover, the model was subjected to early stopping, with a patience setting of 20 epochs, and the max number of training epochs is set to 200 epochs. The loss function is also MSE. To ensure robustness, the experiments were executed in fivefold cross-validation. There is no information in the test set that was used during training and validation. The mean of the results was used as the final performance metric, and the variance was used for the error bars.

Transfer learning of PRIME on supervised mutant effect prediction

PRIME can also be applied in supervised mutant effect prediction tasks, which is used in our strategy for generating multisite mutants (Fig. 1D). Given a training set of mutated sequences with experimental fitness labels, we can use PRIME to learn on the training set and further predict the fitness of new mutated sequences. In this task, we only use the Transformer encoder and OGT module, while the MLM module is dropped. Except for the parameters of the last two fully connected layers, FC₃ and FC₄, in the OGT module, which are rerandomized, the rest remain frozen, which is called regression module in Fig. 1D. We also use MSE as a loss function to learn how to minimize the predicted mutated fitness and the true fitness. During this training, the learning rate is 1 × 10⁻⁴, and the batch size is 16. The training epochs are dynamically decided. We begin by splitting the dataset into five folds. In each iteration, we use four folds for training and the remaining one for validation. We track the number of epochs needed for the validation set across these iterations, resulting in five epoch counts. The final number of training epochs for the entire dataset is determined by averaging these five epoch numbers. In the final training phase, we do not use a validation set; instead, we train on the entire dataset using the previously determined number of epochs. After training, the model can be used to predict the fitness of unseen mutated sequences. For proteins without any labeled mutant data, we first use the zero-shot capability of PRIME, referred to as PRIME (Zero-shot), to select the top-K single-point mutants. These mutants are then experimentally labeled. In the first round of design, we use this labeled data to train PRIME, called PRIME (Round 1). This trained model is used to predict fitness scores for multisite mutants combined from all single-site mutants identified in the zero-shot round, and the top-K mutants are selected on the basis of these scores. We then experimentally determine the fitness of these top mutants. The labeled multisite mutant data from this first round is added to the initial training set, and PRIME is retrained on this updated set, called PRIME (Round 2). Using PRIME (Round 2), we predict fitness scores for all multisite mutants combined from both the zero-shot and Round 1, selecting the top-K mutants. We then experimentally determine the fitness of these top mutants. If the results do not meet the requirements, we further add these labeled top mutants to the training set and repeat the process.

Dataset

Pretraining dataset

By integrating publicly accessible data from Uniprot and protein sequences from metagenomic projects (79–81), we have curated ProteomeAtlas, a vast repository of natural protein sequences containing 4.7 billion entries. We filtered these sequences, retaining only those that are full length. Further, we used MMseqs2 to process these sequences, setting a sequence identity threshold of 50% for redundancy reduction. This enabled us to identify and annotate sequences corresponding to OGTs (26) for bacterial strains. Ultimately, we annotated 96 million sequences in this manner, providing a rich resource for exploring protein sequence-temperature relationships.

Benchmark datasets for zero-shot mutation scoring

The dataset used for changes in melting temperature (ΔT_m) was sourced from MPTherm (37), FireProtDB (38), and ProThermDB (39), ensuring that all experiments were conducted under the same pH conditions. The ProteinGym dataset was cited from (31). Datasets and data split for predicting melting temperature (T_m) and optimal enzymatic activity temperature (T_opt) of native protein sequences were drawn from (82).

Different strategies of selecting single-site mutations for different engineering purposes

PRIME can be used to rank mutants on the basis of both activity and stability for single mutants. However, from the ablation study of PRIME, we found that the zero-shot performance with only the OGT module (PRIME/-MLM) is quite poor in both the ProteinGym benchmark and ΔT_m. Therefore, we do not use the OGT module to select single-site mutations for stability. Instead, we suggest using the LLM likelihood of PRIME, obtained when predicting OGT as an additional pretraining task. Drawn on the past research experience of biologists (7, 44), one can choose mutations located on the surface of the protein to improve protein stability while not alerting much the activity and mutate amino acids around the protein pocket to enhance protein catalytic activity. This empirical knowledge can be used in a specific protein engineering assignment, which might further increase the success rate.

Engineering of high stability or activity in five proteins

Prediction of single-site mutations by PRIME

First, we used Jackhmmer to identify sequences homologous to each target protein within the Uniclust30 database (43). For proteins with a bounty of more than 30,000 homologous sequences, we randomly cherry-picked a subset of 30,000 for the fine-tuning of the PRIME model. On the other hand, for proteins boasting fewer than 30,000 homologous sequences, we incorporated all available sequences into the fine-tuning process. This fine-tuning was executed across five iterations, each initiated with a distinct random seed, for every target protein. By amalgamating the predictive outcomes from these five distinct model parameters for single-site saturation mutations, we synthesized a comprehensive mutation scorecard for each protein. Mutants that showcased scores surpassing that of the wild type in the scorecard were earmarked as potential candidates. In the final phase, we meticulously handpicked ~30 to 45 mutants, ensuring they were situated beyond the 6-Å radius of pivotal regions like the catalytic active sites or binding pockets, to pave the way for subsequent experimental evaluations.

T7 RNA polymerase

Preparation of T7 RNA polymerase variants

The T7 RNA polymerase (Uniprot ID: P00573) gene and its mutants’ gene were cloned into the pQE-80 l expression vector and transformed into Escherichia coli BL21(DE3) cells. The cells were cultured in Luria-Bertani (LB) media until reaching an optical density at 600 nm (OD₆₀₀) of ~0.6 to 0.8, followed by induction with 1 mM isopropyl-β-d-thiogalactopyranoside (IPTG) for a 6-hour growth period at 37°C. After collection, the bacteria were resuspended in a binding buffer [50 mM tris-HCl (pH 8.0), 300 mM NaCl, 3 mM imidazole, and 0.1 mM EDTA] and lysed via sonication. The resulting lysate underwent centrifugation at 4°C and 12,000 rpm for 30 min. The lysate was then applied to a nickel–nitrilotriacetic acid (Ni-NTA) gravity column and washed with a washing buffer [50 mM tris-HCl (pH 8.0), 300 mM NaCl, 10 mM imidazole, 0.1 mM EDTA, and 10% glycerol]. Elution was performed using an elution buffer [50 mM tris-HCl (pH 8.0), 300 mM NaCl, 250 mM imidazole, 0.1 mM EDTA, and 10% glycerol]. Concentration was achieved using a final ultrafiltration buffer [50 mM tris-HCl (pH 8.0), 100 mM NaCl, and 0.1 mM EDTA], and the T7 RNA polymerase was diluted with a storage buffer [50 mM tris-HCl (pH 8.0), 100 mM NaCl, 0.1 mM EDTA, 1 mM dithiolthreitol (DTT), and 75% glycerol] (83).

Thermal melt measurements

The protein staining agent, SYPRO Orange, was added to a final concentration of 5×, and the protein sample (~0.2 mg/ml) was mixed in an eight-row polymerase chain reaction (PCR) tube. Each sample was prepared in a final volume of 20 μl and tested in triplicate. Denaturation curves were generated using a PCR instrument (Analytik Jena qTower3) equipped with appropriate optical filters [FAM (470 nm) and ROX (625 nm) for excitation and emission, respectively]. The temperature was incrementally increased by 0.5°C steps from 25° to 65°C, with a 5-s hold for equilibration at each temperature step. The thermal unfolding curves were analyzed by fitting the Boltzmann equation to approximate the T_m (58).

IVT assays

The IVT reaction buffer was prepared, which contained 200 mM Hepes (pH 7.5), 30 mM MgCl₂, 20 mM DTT, ribonuclease inhibitor (0.4 U/μl), 5 mM nucleoside triphosphate mix, and 100 nM iSpinach DNA template (84). The buffer was incubated at 52°C for 10 min, and then T7 RNAP (0.04 mg/ml) was added to initiate the reaction. After the mixture incubated for 1 hour, 100 mM EDTA was added to stop the reaction. Last, 100 μM DFHBI was introduced, and fluorescence was measured with excitation at a wavelength of 470 nm and emission at 512 nm.

Creatinase

Preparation of creatinase variants

The creatinase (Uniprot ID: Q9RH-U9) gene was cloned into a pET-28a expression vector and transformed into E. coli BL21(DE3) cells. The cells expressing creatinase were cultivated in LB medium supplemented at a temperature of 37°C while agitating the culture at 220 rpm. To induce the expression of creatinase, when the OD₆₀₀ value of the culture reached 0.8 to 1.0, IPTG was added at a final concentration of 1 mM. The cells were then further cultured at a reduced temperature of 18°C for a duration of 16 hours. After collecting the cells, they were resuspended in a binding buffer [25 mM tris-HCl (pH 8.0), 200 mM NaCl, and 20 mM imidazole] and subjected to sonication for cell disruption. The resulting lysate was centrifuged at 4°C and 12,000 rpm for 30 min, and the supernatant was collected. The supernatant was loaded onto a pre-equilibrated Ni-NTA gravity column, and protein elution was performed using an imidazole gradient ranging from 20 to 200 mM. The purity of the fractions obtained was analyzed using SDS–polyacrylamide gel electrophoresis (SDS-PAGE).

The fractions containing the purified target protein were combined and desalted using an ultrafiltration unit. The purified protein was then concentrated and stored in 1× PBS at a temperature of −80°C to maintain its stability and activity (58).

Differential scanning fluorimetry

The thermal stability testing was also carried out using a PCR instrument (Analytik Jena qTower3). All proteins were diluted in 1× PBS to a final concentration of 0.3 mg/ml and mixed with SYPRO Orange at a final concentration of 5× in an eight-row PCR tube. The protein unfolding process was initiated by subjecting the samples to a thermal treatment ranging from 25 to 85°C (with a temperature increment of 0.5°C per step) with each step holding for 5 s.

Subsequently, the thermal unfolding curves were obtained, and the data were analyzed using the Boltzmann equation to determine the T_m (58).

Activity measurements

Creatine could be hydrolyzed by creatinase into urea and creatinine. The resulting urea reacts with p-dimethylaminobenzaldehyde to form a yellow-colored dye. The concentration of urea can be determined by measuring the absorbance of the yellow dye at 435 nm using a spectrophotometer (58). Consequently, the specific activity of the protein can be calculated. Here are the details of the experimental procedure:

1) Incubate a PBS buffer solution (280 μl) containing 100 mM creatine at 37°C for 5 min.

2) Incubate the mixture with 20 μl of protein solution (1 mg/ml) for 22 min.

3) Stop the reaction by adding p-dimethylaminobenzaldehyde solution (600 μl) prepared by dissolving 2 g of p-dimethylaminobenzaldehyde in 100 ml of dimethyl sulfoxide and 15 ml of concentrated hydrochloric acid.

4) Measure the absorbance at 435 nm using a spectrophotometer.

VHH

Protein expression and purification

The gene of the VHH was cloned into the pET29a plasmid with an N-terminal His-tag. The expression plasmid was transformed into E.coli BL21(DE3) cells. A single colony of each recombinant E. coli strain was inoculated into 30 ml of LB medium with kanamycin (50 μg/ml) for seed culture at 37°C for 12 to 16 hours. The seed culture (10 ml) was transferred to 1 liter of LB medium with kanamycin (50 μg/ml) at 37°C 220 rpm until the OD₆₀₀ value reached 0.6 to 0.8. The culture was cooled to 16°C and then induced with 0.5 mM IPTG for 20 to 24 hours at 16°C. Cells were harvested from the fermentation culture by centrifugation for 30 min at 4000 rpm, and the cell pellets were collected for later purification. The cell pellets were resuspended in buffer A [20 mM Na₂HPO₄ and NaH₂PO₄ and 0.5 M NaCl (pH 8.0)] and then lysed via ultra sonification. The lysates were centrifuged for 30 min at 12,000 rpm at 4°C, after which the supernatants were subjected to Ni-NTA affinity purification with elution buffer [20 mM Na₂HPO₄ and NaH₂PO₄, 0.5 M NaCl, and 250 mM imidazole (pH 8.0)]. The purity of the fractions obtained was analyzed using SDS-PAGE. The fractions containing the purified target protein were combined and desalted using an ultrafiltration unit. The purified protein was then concentrated and stored in buffer A with 10% glycerol at a temperature of −80°C.

Protein treated with alkaline

Will-type and mutants of VHH were incubated at 0.3 or 0.5 M NaOH for 3, 6, and 24 hours. Subsequently, hydrochloric acid was added to terminate the alkali treatment, the samples were stored at a temperature of −80°C.

Alkaline pH stability test(ELISA)

Ninety-six-well plates were coated with growth hormone protein at a density of 5 ng per well at 4°C overnight. The plates were washed with 1 × phosphate buffered solution (PBST) three times. Following blocking with 1% BSA in 1 × PBST at 25°C for 2 hours. After washing three times with 1 × PBST, the plates were incubated with serial dilutions of VHH proteins 100 μl per well (1:2, 1:4.1:8, 1:16, 1:32, 1:64, 1:128, 1:256, 1:512, 1:1024, and 1:2048) for 1 hour at 25°C. After washing three times with 0.5% PBST, horseradish peroxidase (100 μl per well; 1:5000) was added and incubated at 25°C for 1 hour. The plates were washed with 1 × PBS’T four times, and TMB (a total of 100 μl per well) was added and incubated at 25°C for 15 min in the dark. Last, 2 M H₂SO₄ (100 μl per well) was added to stop the reaction and absorbance was measured at 450 nm (TECAN, Swiss.).

The log(agonist) versus response -- Variable slope (four parameters) curves were analyzed to calculate median effective concentration, which determines the stability of VHH after alkaline treatment.

Nonnatural nucleic acid polymerase

Polymerase expression and purification

Polymerases were expressed and purified as previously reported (85). Briefly, Tgo-D4K and its mutants’ gene were cloned into the pGDR11 vector and transformed into E. coli BL21 cells. The cultures were grown in 50 ml of LB medium containing ampicillin (100 μg/ml) at 37°C with shaking at 240 rpm until the OD₆₀₀ reached 0.6 to 1.0. Then, the cultures were induced by adding IPTG (0.5 mM) and incubated at 16°C with shaking at 240 rpm for 20 hours. The cells were harvested by centrifugation, and the pellet was lysed by sonication in buffer [10 mM tris-HCl (pH 8.0), 500 mM KCl, and 10% glycerol]. The lysate was centrifuged for 30 min at 13,300 rpm at 4°C, and the clarified supernatant was heated for 1 hour at 80°C and then immediately cooled for 30 min on ice. The lysate was clarified again by centrifugation for 30 min at 4°C and 13,300 rpm. Polyethyleneimine (0.5%, v/v) was added to precipitate the nucleic acids, and then the lysate was centrifuged for 30 min at 13,300 rpm at 4°C. Ammonium sulfate (60%, w/v) was added to precipitate the polymerase. After incubating for 1 hour at 4°C, it was centrifuged for 30 min at 13,300 rpm at 4°C. Protein pellets were suspended in 4°C precooled buffer [10 mM tris-HCl (pH 8.0), 50 mM KCl, and 10% glycerol]. The supernatant was loaded onto Ni-NTA resin. All protein eluted at 200 mM imidazole was dialyzed in 4°C buffer [10 mM tris-HCl (pH 8.0), 50 mM KCl, and 10% glycerol]. The purity of the fractions obtained were verified by SDS-PAGE and stored at −80°C.

Measurement of synthesis rates of polymerase

To measure the synthesis rates of the polymerase, kinetic measurements were performed as previously reported (62). Each measurement (10 μl) contained 1 μM 30-mer template, 100 μM of each nucleotide triphosphate, 1× ThermoPol buffer, 2× LC Green Plus fluorescent dye, and 20 nM polymerase. Reactions were denatured for 3 min at 95°C and extended for 30 min at 55°C, with fluorescence intensity recorded at 6-s intervals. Fluorescence data for each polymerase were normalized and converted to nucleotides per polymerase. The synthesis rate was determined by linear fitting of nucleotides per polymerase over reaction time. The reported values are the average of three independent replicates.

Lbcas12a

Plasmids construction

LbCas12a mutants were constructed by overlap PCR using a previous described pET28a plasmid harboring wild-type LbCas12a as the template and oligonucleotides carrying desired mutations. The expression plasmid contained a C-terminal 10× His tag for downstream affinity purification. The recombinant plasmids were transformed into E. coli Trelief 5α cells (Tsingke, China, Beijing). The sequences of all the plasmid constructs were confirmed via Sanger sequencing (Tsingke).

Protein expression and purification

All the LbCas12a proteins were expressed in E. coli BL21(DE3) cells cultured in LB medium supplemented with kanamycin (50 μg/ml). Single colonies were picked from the LB agar plates and grown in a starter culture overnight. The next day, the culture was inoculated into fresh LB medium supplemented with kanamycin (50 μg/ml) at a ratio of 1:100 and incubated at 37°C until OD₆₀₀ reached 0.6. Protein expression was induced with 1 mM IPTG at 37°C for 4 hours. The cells were harvested by centrifugation at 5000 rcf for 15 min at 4°C.

Collected cells were resuspended in lysis buffer (pH 8.0) containing 100 mM sodium phosphate, 600 mM NaCl, 0.05% Tween 20, 30 mM imidazole, 1 mM DTT, and 0.5 mM phenylmethylsulfonyl fluoride. After disruption by sonication and centrifugation for 1 hour at 12,000 rcf at 4°C, HisPur Ni-NTA Magnetic Beads (Thermo Fisher Scientific, Waltham, MA, USA) were used to purify proteins according to the manufacture’s protocol. The harvested protein was concentrated into storage buffer containing 50 mM tris-HCl (pH 7.5), 500 mM NaCl, 10% (v/v) glycerol, and 2 mM DTT by Pierce Protein Concentrators (Thermo Fisher Scientific) and stored at −80°C.

crRNA preparation

All the DNA oligos used in this study were purchased from Tsingke Biotechnology Co. For crRNA preparation, IVT template was generated by annealing a T7 promotor–carrying oligonucleotide with a complementary oligonucleotide containing antisense T7 promotor, crRNA direct repeat motif and spacer sequence. crRNA transcription was performed in a 30-μl reaction using the above IVT templates and HiScribe T7 Quick High Yield RNA Synthesis Kit (New England Biolabs) at 37°C overnight. The residual DNA templates in the IVT reactions were removed by treatment with deoxyribonuclease I (0.08 U/μl), and the RNA product was purified by TRIzol (Invitrogen).

Differential scanning fluorimetry assays

All the LbCas12a proteins were diluted to a final concentration of 0.5 mg/ml in reaction buffer containing 50 mM tris-HCl (pH 7.5) and 500 mM NaCl and added into Standard Capillaries (NanoTemper). All the experiments were carried out at temperatures ranging from 20° to 95°C with a heating rate of 1°C/min by using Prometheus NT.48 instrument and PR.ThermControl software (NanoTemper, Munich, Germany).

In vitro cleavage assays

The Cas12a trans-cleavage reaction was performed as previously described (86) with minor modifications. Target DNA was PCR amplified from a plasmid via specific primers or generated by annealed oligonucleotides and then purified. Briefly, the reaction was carried out with 50 nM LbCas12a protein, 2.5 ng of substrate DNA, 100 nM crRNA, 0.5 mM DTT, 1.25 μM single-stranded DNA (5′-FAM-CCC-CC-BHQ1-3′), and 1 × Buffer 2.1 (New England Biolabs) in a 10-μl reaction. Each sample was performed with three biological replicates and loaded on to 384-well plates. After incubation for 15 min at 42°C, the fluorescence intensity was monitored using SpectraMax iD3 Multi-Mode Microplate Reader with an excitation wavelength of 485 nm and an emission wavelength of 535 nm. The fluorescence signal was recorded in a 2-min interval and processed in subsequent analyses.

Acknowledgments

We acknowledge Shanghai Artificial Intelligence Laboratory for computing resources.

Funding: This work was supported by the National Natural Science Foundation of China (grant nos. 12104295, 11974239, and 32471536), the National Key Research and Development Program of China (grant no. 2021YFF1200200), the Innovation Program of Shanghai Municipal Education Commission (2019-01-07-00-02-E00076), Shanghai Jiao Tong University Scientific and Technological Innovation Funds (21X010200843), the Computational Biology Key Program of Shanghai Science and Technology Commission (23JS1400600), Science and Technology Innovation Key R&D Program of Chongqing (CSTB2022TIAD-STX0017), the Student Innovation Center at Shanghai Jiao Tong University, and Shanghai Artificial Intelligence Laboratory. The engineering of VHH was supported by Changchun Genscience Pharmaceuticals Co., Ltd.

Author contributions: Conceptualization: P.T. and L.H. Methodology: P.T., F.J., and M.L. Investigation: F.J., M.L., J.D., Y.Y., B.W., X.S., J.H., Liang.Z, Y.T., S.W., W.X., Y.P., J.S., L.K., Y.X., W.O., Z.H., G.F. Lirong.Z. Y.F., G.Y., J.L., Q.L., and J.X. Visualization: F.J. and M.L. Supervision: P.T., L.H., J.L. and J.S. Writing—original draft: P.T. and L.H. Writing—review and editing: P.T., L.H., F.J., and M.L.

Competing interests: The proteins mentioned in this article have both published patents and pending patent applications in China. These include LbCas12a, filed by ShanghaiTech University and Shanghai Jiao Tong University (application no. 2023111716279); T7 RNA Polymerase and Creatinase, filed by Shanghai Matwings Technology Co., Ltd. (with both published and pending patents; publication numbers CN117070493A, CN116694608A, and CN117448306A); and VHH, filed by Changchun Genscience Pharmaceuticals Co., Ltd. The authors involved in these patent applications are F.J., Y.Y., B.W., L.K., P.T., J.L., and L.H. The authors declare that they have no other competing interests.

Data and materials availability: All data needed to evaluate the conclusions in this paper are present in the paper and/or the Supplementary Materials. The source code, pretraining checkpoint file, and checkpoint files related to the five proteins discussed in this study for PRIME are accessible at https://doi.org/10.5281/zenodo.12819415. Comprehensive results from the benchmarks for each assay from ProteinGym and ΔT_m datasets are also provided at https://doi.org/10.5281/zenodo.12819415. The single-point mutation scores for the five proteins emphasized in this study and all the detailed wet-lab experimental data are included in an Excel file in the supplementary information. The maintenance of the subsequent code is in this GitHub repository https://github.com/ai4protein/Pro-Prime.

Supplementary Materials

The PDF file includes:

Figs. S1 and S2

Tables S1 to S9

Legend for data S1

References

sciadv.adr2641_sm.pdf^{(826.1KB, pdf)}

Other Supplementary Material for this manuscript includes the following:

Data S1

sciadv.adr2641_data_s1.zip^{(1MB, zip)}

REFERENCES AND NOTES

1.W. P. Jencks, Catalysis in Chemistry and Enzymology (Courier Corporation, 1987). [Google Scholar]
2.C. M. O’Connor, J. U. Adams, J. Fairman, Essentials of Cell Biology (NPG Education 1, 54, 2010). [Google Scholar]
3.Chaplin D. D., Overview of the immune response. J. Allergy Clin. Immunol. 125, S3–S23 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Cohen G. B., Ren R., Baltimore D., Modular binding domains in signal transduction proteins. Cell 80, 237–248 (1995). [DOI] [PubMed] [Google Scholar]
5.van Vliet C., Thomas E. C., Merino-Trigo A., Teasdale R. D., Gleeson P. A., Intracellular sorting and transport of proteins. Prog. Biophys. Mol. Biol. 83, 1–45 (2003). [DOI] [PubMed] [Google Scholar]
6.Xia Y., Li X., Yang L., Luo X., Shen W., Cao Y., Peplowski L., Chen X., Development of thermostable sucrose phosphorylase by semi-rational design for efficient biosynthesis of alpha-D-glucosylglycerol. Appl. Microbiol. Biotechnol. 105, 7309–7319 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
7.M. T. Reetz, Z. Sun, G. Qu, Enzyme Engineering: Selective Catalysts for Applications in Biotechnology, Organic Chemistry, and Life Science (John Wiley & Sons, 2023). [Google Scholar]
8.Lovelock S. L., Crawshaw R., Basler S., Levy C., Baker D., Hilvert D., Green A. P., The road to fully programmable protein catalysis. Nature 606, 49–58 (2022). [DOI] [PubMed] [Google Scholar]
9.Tokuriki N., Tawfik D. S., Protein dynamism and evolvability. Science 324, 203–207 (2009). [DOI] [PubMed] [Google Scholar]
10.Lutz S., Iamurri S. M., Protein engineering: Past, present, and future. Methods Mol. Biol. 1685, 1–12 (2018). [DOI] [PubMed] [Google Scholar]
11.Reetz M. T., Soni P., Fernández L., Knowledge-guided laboratory evolution of protein thermolability. Biotechnol. Bioeng. 102, 1712–1717 (2009). [DOI] [PubMed] [Google Scholar]
12.Das R., Baker D., Macromolecular modeling with rosetta. Annu. Rev. Biochem. 77, 363–382 (2008). [DOI] [PubMed] [Google Scholar]
13.Xiong P., Chen Q., Liu H., Computational protein design under a given backbone structure with the ABACUS statistical energy function. Methods Mol. Biol. 1529, 217–226 (2017). [DOI] [PubMed] [Google Scholar]
14.Schymkowitz J., Borg J., Stricher F., Nys R., Rousseau F., Serrano L., The FoldX web server: An online force field. Nucleic Acids Res. 33, W382–W388 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Madani A., Krause B., Greene E. R., Subramanian S., Mohr B. P., Holton J. M., Olmos J. L., Xiong C., Sun Z. Z., Socher R., Fraser J. S., Naik N., Large language models generate functional protein sequences across diverse families. Nat. Biotechnol. 41, 1099–1106 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Hie B., Zhong Ellen D., Berger B., Bryson B., Learning the language of viral evolution and escape. Science 371, 284–288 (2021). [DOI] [PubMed] [Google Scholar]
17.R. M. Rao, J. Liu, R. Verkuil, J. Meier, J. Canny, P. Abbeel, T. Sercu, A. Rives, MSA transformer, in Proceedings of the 38th International Conference on Machine Learning (2021), vol. 139, pp. 8844–8856. [Google Scholar]
18.Rives A., Meier J., Sercu T., Goyal S., Lin Z., Liu J., Guo D., Ott M., Zitnick C. L., Ma J., Fergus R., Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. U.S.A. 118, e2016239118 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Lin Z., Akin H., Rao R., Hie B., Zhu Z., Lu W., Smetanin N., Verkuil R., Kabeli O., Shmueli Y., dos Santos Costa A., Fazel-Zarandi M., Sercu T., Candido S., Rives A., Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023). [DOI] [PubMed] [Google Scholar]
20.W. Jin, S. Sarkizova, X. Chen, N. Hacohen, C. Uhler, Unsupervised protein-ligand binding energy prediction via neural Euler’s rotation equation. arXiv:2301.10814 [q-bio.BM] (2023).
21.J. Meier, R. Rao, R. Verkuil, J. Liu, T. Sercu, A. Rives, Language models enable zero-shot prediction of the effects of mutations on protein function. bioRxiv 450648 [Preprint] (2021). 10.1101/2021.07.09.450648. [DOI]
22.Hsu C., Nisonoff H., Fannjiang C., Listgarten J., Learning protein fitness models from evolutionary and assay-labeled data. Nat. Biotechnol. 40, 1114–1122 (2022). [DOI] [PubMed] [Google Scholar]
23.Notin P., Kollasch A., Ritter D., Van Niekerk L., Paul S., Spinner H., Rollins N., Shaw A., Orenbuch R., Weitzman R., ProteinGym: Large-scale benchmarks for protein fitness prediction and design, in 37th Conference on Neural Information Processing Systems (NeurIPS 2023). [Google Scholar]
24.Luo Y., Jiang G., Yu T., Liu Y., Vo L., Ding H., Su Y., Qian W. W., Zhao H., Peng J., ECNet is an evolutionary context-integrated deep learning framework for protein engineering. Nat. Commun. 12, 5743 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Wu Z., Kan S. B. J., Lewis Russell D., Wittmann Bruce J., Arnold Frances H., Machine learning-assisted directed protein evolution with combinatorial libraries. Proc. Natl. Acad. Sci. U.S.A. 116, 8852–8858 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Engqvist M. K. M., Correlating enzyme annotations with a large set of microbial growth temperatures reveals metabolic adaptations to growth at diverse temperatures. BMC Microbiol. 18, 177 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Li G., Rabe K. S., Nielsen J., Engqvist M. K. M., Machine learning applied to predicting microorganism growth temperatures and enzyme catalytic optima. ACS Synth. Biol. 8, 1411–1420 (2019) vol 1, pp. 4171–4186. [DOI] [PubMed] [Google Scholar]
28.J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, in Proceedings of the 2019 Conference of the North Marican Chapter of the Association for Computational Linguistics: Human Language Technologies, 4171–4186 (Association for Computational Linguistic, 2019). [Google Scholar]
29.Sarkisyan K. S., Bolotin D. A., Meer M. V., Usmanova D. R., Mishin A. S., Sharonov G. V., Ivankov D. N., Bozhanova N. G., Baranov M. S., Soylemez O., Bogatyreva N. S., Vlasov P. K., Egorov E. S., Logacheva M. D., Kondrashov A. S., Chudakov D. M., Putintseva E. V., Mamedov I. Z., Tawfik D. S., Lukyanov K. A., Kondrashov F. A., Local fitness landscape of the green fluorescent protein. Nature 533, 397–401 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A. N., Kaiser Ł., Polosukhin I., Attention is all you need, in 31st Conference on Neural Information Processing Systems (NIPS 2017).
31.P. Notin, M. Dias, J. Frazer, J. M. Hurtado, A. N. Gomez, D. Marks, Y. Gal, Tranception: Protein fitness prediction with autoregressive transformers and inference-time retrieval, in Proceedings of the 39th International Conference on Machine Learning (PMLR, 2022), vol. 162, pp. 16990–17017. [Google Scholar]
32.Yang K. K., Fusi N., Lu A. X., Convolutions are competitive with transformers for protein sequence pretraining. Cell Syst. 15, 286–294.e2 (2024). [DOI] [PubMed] [Google Scholar]
33.K. K. Yang, N. Zanichelli, H. Yeh, Masked inverse folding with sequence transfer for protein representation learning. bioRxiv 493516 [Preprint] (2023). 10.1101/2022.05.25.493516. [DOI] [PubMed]
34.J. Su, C. Han, Y. Zhou, J. Shan, X. Zhou, F. Yuan, SaProt: Protein language modeling with structure-aware vocabulary. bioRxiv 560349 [Preprint] (2024). 10.1101/2023.10.01.560349. [DOI]
35.Diaz D. J., Gong C., Ouyang-Zhang J., Loy J. M., Wells J., Yang D., Ellington A. D., Dimakis A. G., Klivans A. R., Stability Oracle: A structure-based graph-transformer framework for identifying stabilizing mutations. Nat. Commun. 15, 6170 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Laine E., Karami Y., Carbone A., GEMME: A simple and fast global epistatic model predicting mutational effects. Mol. Biol. Evol. 36, 2604–2619 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Kulandaisamy A., Sakthivel R., Gromiha M. M., MPTherm: Database for membrane protein thermodynamics for understanding folding and stability. Brief. Bioinform. 22, 2119–2125 (2021). [DOI] [PubMed] [Google Scholar]
38.Stourac J., Dubrava J., Musil M., Horackova J., Damborsky J., Mazurenko S., Bednar D., FireProtDB: Database of manually curated protein stability data. Nucleic Acids Res. 49, D319–D324 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Nikam R., Kulandaisamy A., Harini K., Sharma D., Gromiha M. M., ProThermDB: Thermodynamic database for proteins and mutants revisited after 15 years. Nucleic Acids Res. 49, D420–D424 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Turner P., Mamo G., Karlsson E. N., Potential and utilization of thermophiles and thermostable enzymes in biorefining. Microb. Cell Fact. 6, 1–23 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
41.C. Dallago, J. Mou, K. E. Johnston, B. J. Wittmann, N. Bhattacharya, S. Goldman, A. Madani, K. K. Yang, FLIP: Benchmark tasks in fitness landscape inference for proteins. bioRxiv 467890 [Preprint] (2022). 10.1101/2021.11.09.467890. [DOI]
42.Jarzab A., Kurzawa N., Hopf T., Moerch M., Zecha J., Leijten N., Bian Y., Musiol E., Maschberger M., Stoehr G., Becher I., Daly C., Samaras P., Mergner J., Spanier B., Angelov A., Werner T., Bantscheff M., Wilhelm M., Klingenspor M., Lemeer S., Liebl W., Hahne H., Savitski M. M., Kuster B., Meltome atlas—Thermal proteome stability across the tree of life. Nat. Methods 17, 495–503 (2020). [DOI] [PubMed] [Google Scholar]
43.Mirdita M., Von Den Driesch L., Galiez C., Martin M. J., Söding J., Steinegger M., Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res. 45, D170–D176 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Sun Z., Liu Q., Qu G., Feng Y., Reetz M. T., Utility of B-factors in protein science: Interpreting rigidity, flexibility, and internal motion and engineering thermostability. Chem. Rev. 119, 1626–1665 (2019). [DOI] [PubMed] [Google Scholar]
45.Goldsmith M., Tawfik D. S., Enzyme engineering: Reaching the maximal catalytic efficiency peak. Curr. Opin. Struct. Biol. 47, 140–150 (2017). [DOI] [PubMed] [Google Scholar]
46.Fonfara I., Richter H., Bratovic M., Le Rhun A., Charpentier E., The CRISPR-associated DNA-cleaving enzyme Cpf1 also processes precursor CRISPR RNA. Nature 532, 517–521 (2016). [DOI] [PubMed] [Google Scholar]
47.Swarts D. C., Jinek M., Mechanistic insights into the cis- and trans-acting DNase activities of Cas12a. Mol. Cell 73, 589–600.e4 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Zavvar T. S., Khoshbin Z., Ramezani M., Alibolandi M., Abnous K., Taghdisi S. M., CRISPR/Cas-engineered technology: Innovative approach for biosensor development. Biosens. Bioelectron. 214, 114501 (2022). [DOI] [PubMed] [Google Scholar]
49.Borkotoky S., Murali A., The highly efficient T7 RNA polymerase: A wonder macromolecule in biological realm. Int. J. Biol. Macromol. 118, 49–56 (2018). [DOI] [PubMed] [Google Scholar]
50.Dousis A., Ravichandran K., Hobert E. M., Moore M. J., Rabideau A. E., An engineered T7 RNA polymerase that produces mRNA free of immunostimulatory byproducts. Nat. Biotechnol. 41, 560–568 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Leone G., van Gemen B., Schoen C. D., van Schijndel H., Kramer F. R., Molecular beacon probes combined with amplification by NASBA enable homogeneous, real-time detection of RNA. Nucleic Acids Res. 26, 2150–2155 (1998). [DOI] [PMC free article] [PubMed] [Google Scholar]
52.Cui Z., Wang Y., Fang L., Zheng R., Huang X., Liu X., Zhang G., Rui D., Ju J., Hu Z., Novel real-time simultaneous amplification and testing method to accurately and rapidly detect Mycobacterium tuberculosis complex. J. Clin. Microbiol. 50, 646–650 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
53.Nelson J., Sorensen E. W., Mintri S., Rabideau A. E., Zheng W., Besin G., Khatwani N., Su S. V., Miracco E. J., Issa W. J., Impact of mRNA chemistry and manufacturing process on innate immune activation. Sci. Adv. 6, eaaz6893 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
54.Wu M. Z., Asahara H., Tzertzinis G., Roy B., Synthesis of low immunogenicity RNA with high-temperature in vitro transcription. RNA 26, 345–360 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
55.Killard A. J., Smyth M. R., Creatinine biosensors: Principles and designs. Trends Biotechnol. 18, 433–437 (2000). [DOI] [PubMed] [Google Scholar]
56.Zhi Q., Kong P., Zang J., Cui Y., Li S., Li P., Yi W., Wang Y., Chen A., Hu C., Biochemical and molecular characterization of a novel high activity creatine amidinohydrolase from Arthrobacter nicotianae strain 02181. Process Biochem. 44, 460–465 (2009). [Google Scholar]
57.Berberich J. A., Yang L. W., Bahar I., Russell A. J., A stable three enzyme creatinine biosensor. 2. Analysis of the impact of silver ions on creatine amidinohydrolase. Acta Biomater. 1, 183–191 (2005). [DOI] [PubMed] [Google Scholar]
58.Jiang F., Bian J., Liu H., Li S., Bai X., Zheng L., Jin S., Liu Z., Yang G.-Y., Hong L., Creatinase: Using increased entropy to improve the activity and thermostability. J. Phys. Chem. B 127, 2671–2682 (2023). [DOI] [PubMed] [Google Scholar]
59.Bult C. J., White O., Olsen G. J., Zhou L., Fleischmann R. D., Sutton G. G., Blake J. A., FitzGerald L. M., Clayton R. A., Gocayne J. D., Kerlavage A. R., Dougherty B. A., Tomb J.-F., Adams M. D., Reich C. I., Overbeek R., Kirkness E. F., Weinstock K. G., Merrick J. M., Glodek A., Scott J. L., Geoghagen N. S. M., Weidman J. F., Fuhrmann J. L., Nguyen D., Utterback T. R., Kelley J. M., Peterson J. D., Sadow P. W., Hanna M. C., Cotton M. D., Roberts K. M., Hurst M. A., Kaine B. P., Borodovsky M., Klenk H.-P., Fraser C. M., Smith H. O., Woese C. R., Venter J. C., Complete genome sequence of the methanogenic archaeon. Science 273, 1058–1073 (1996). [DOI] [PubMed] [Google Scholar]
60.Kois P., Tocik Z., Spassova M., Ren W.-Y., Rosenberg I., Soler J. F., Watanabe K. A., Synthesis and some properties of modified oligonucleotides. II. Oligonucleotides containing 2′-deoxy-2′-fluoro-β-D-arabinofuranosyl pyrimidine nucleosides. Nucleosides Nucleotides 12, 1093–1109 (1993). [Google Scholar]
61.Wang Y., Ngor A. K., Nikoomanzar A., Chaput J. C., Evolution of a general RNA-cleaving FANA enzyme. Nat. Commun. 9, 5067 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
62.Nikoomanzar A., Dunn M. R., Chaput J. C., Evaluating the rate and substrate specificity of laboratory evolved XNA polymerases. Anal. Chem. 89, 12622–12625 (2017). [DOI] [PubMed] [Google Scholar]
63.Yan S., Li X., Zhang P., Wang Y., Chen H.-Y., Huang S., Yu H., Direct sequencing of 2′-deoxy-2′-fluoroarabinonucleic acid (FANA) using nanopore-induced phase-shift sequencing (NIPSS). Chem. Sci. 10, 3110–3117 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
64.Harshe R. P., Xie A., Vuerich M., Frank L. A., Gromova B., Zhang H., Robles R. J., Mukherjee S., Csizmadia E., Kokkotou E., Cheifetz A. S., Moss A. C., Kota S. K., Robson S. C., Longhi M. S., Endogenous antisense RNA curbs CD39 expression in Crohn’s disease. Nat. Commun. 11, 5894 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
65.Pelisch N., Rosas Almanza J., Stehlik K. E., Aperi B. V., Kroner A., Use of a self-delivering anti-CCL3 FANA oligonucleotide as an innovative approach to target inflammation after spinal cord injury. eNeuro 8, ENEURO.0338-0320.2021 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
66.Pinheiro V. B., Taylor A. I., Cozens C., Abramov M., Renders M., Zhang S., Chaput J. C., Wengel J., Peak-Chew S.-Y., McLaughlin S. H., Herdewijn P., Holliger P., Synthetic genetic polymers capable of heredity and evolution. Science 336, 341–344 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
67.Tang H., Gao Y., Han J., Application progress of the single domain antibody in medicine. Int. J. Mol. Sci. 24, 4176 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
68.Muyldermans S., Applications of nanobodies. Annu. Rev. Anim. Biosci. 9, 401–421 (2021). [DOI] [PubMed] [Google Scholar]
69.Wang J., Bever C. R., Majkova Z., Dechant J. E., Yang J., Gee S. J., Xu T., Hammock B. D., Heterologous antigen selection of camelid heavy chain single domain antibodies against tetrabromobisphenol A. Anal. Chem. 86, 8296–8302 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
70.Zettl I., Ivanova T., Zghaebi M., Rutovskaya M. V., Ellinger I., Goryainova O., Kollárová J., Villazala-Merino S., Lupinek C., Weichwald C., Drescher A., Eckl-Dorna J., Tillib S. V., Flicker S., Generation of high affinity ICAM-1-specific nanobodies and evaluation of their suitability for allergy treatment. Front. Immunol. 13, 1022418 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
71.Pabst T. M., Wendeler M., Wang X., Bezemer S., Hermans P., Hunter A. K., Camelid V(H) H affinity ligands enable separation of closely related biopharmaceuticals. Biotechnol. J. 12, 1600357 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
72.Palmer B., Angus K., Taylor L., Warwicker J., Derrick J. P., Design of stability at extreme alkaline pH in streptococcal protein G. Biotechnol. J. 134, 222–230 (2008). [DOI] [PubMed] [Google Scholar]
73.Laughlin T. M., Horn J. R., Engineering pH-sensitive single-domain antibodies. Methods Mol. Biol. 2446, 269–298 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
74.Hie B. L., Shanker V. R., Xu D., Bruun T. U. J., Weidenbacher P. A., Tang S., Wu W., Pak J. E., Kim P. S., Efficient evolution of human antibodies from general protein language models. Nat. Biotechnol. 42, 275–283 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
75.Alford R. F., Leaver-Fay A., Jeliazkov J. R., O’Meara M. J., DiMaio F. P., Park H., Shapovalov M. V., Renfrew P. D., Mulligan V. K., Kappel K., The Rosetta all-atom energy function for macromolecular modeling and design. J. Chem. Theory Comput. 13, 3031–3048 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
76.Biswas S., Khimulya G., Alley E. C., Esvelt K. M., Church G. M., Low-N protein engineering with data-efficient deep learning. Nat. Methods 18, 389–396 (2021). [DOI] [PubMed] [Google Scholar]
77.Wittmann B. J., Yue Y., Arnold F. H., Informed training set design enables efficient machine learning-assisted directed protein evolution. Cell Syst. 12, 1026–1045.e7 (2021). [DOI] [PubMed] [Google Scholar]
78.Meier J., Rao R., Verkuil R., Liu J., Sercu T., Rives A., Language models enable zero-shot prediction of the effects of mutations on protein function. Adv. Neural Inf. Process. Syst. 34, 29287–29303 (2021). [Google Scholar]
79.Richardson L., Allen B., Baldi G., Beracochea M., Bileschi M. L., Burdett T., Burgin J., Caballero-Pérez J., Cochrane G., Colwell L. J., Curtis T., Escobar-Zepeda A., Gurbich T. A., Kale V., Korobeynikov A., Raj S., Rogers A. B., Sakharova E., Sanchez S., Wilkinson D. J., Finn R. D., MGnify: The microbiome sequence data analysis resource in 2023. Nucleic Acids Res. 51, D753–D759 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
80.Steinegger M., Mirdita M., Söding J., Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nat. Methods 16, 603–606 (2019). [DOI] [PubMed] [Google Scholar]
81.Steinegger M., Söding J., Clustering huge protein sequence sets in linear time. Nat. Commun. 9, 2542 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
82.Li G., Buric F., Zrimec J., Viknander S., Nielsen J., Zelezniak A., Engqvist M. K. M., Learning deep representations of enzyme thermal adaptation. Protein Sci. 31, e4480 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
83.Meyer A. J., Garry D. J., Hall B., Byrom M. M., McDonald H. G., Yang X., Yin Y. W., Ellington A. D., Transcription yield of fully 2′-modified RNA can be increased by the addition of thermostabilizing mutations to T7 RNA polymerase mutants. Nucleic Acids Res. 43, 7480–7488 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
84.Qin W., Li L., Yang F., Wang S., Yang G.-Y., High-throughput iSpinach fluorescent aptamer-based real-time monitoring of in vitro transcription. Bioresour. Bioprocess. 9, 112 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
85.Nikoomanzar A., Dunn M. R., Chaput J. C., Engineered polymerases with altered substrate specificity: Expression and purification. Curr. Protoc. Nucleic Acid Chem. 69, 4.75.1–4.75.20 (2017). [DOI] [PubMed] [Google Scholar]
86.Lin H., Zheng W., Li S., Wang Y., Wei D., Xie L., Lu W., Tian Z., Wang S., Qu J., Liu J., Internet of medical things-enabled CRISPR diagnostics for rapid detection of SARS-CoV-2 variants of concern. Front. Microbiol. 13, 1070940 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
87.Jumper J., Evans R., Pritzel A., Green T., Figurnov M., Ronneberger O., Tunyasuvunakool K., Bates R., Žídek A., Potapenko A., Bridgland A., Meyer C., Kohl S. A. A., Ballard A. J., Cowie A., Romera-Paredes B., Nikolov S., Jain R., Adler J., Back T., Petersen S., Reiman D., Clancy E., Zielinski M., Steinegger M., Pacholska M., Berghammer T., Bodenstein S., Silver D., Vinyals O., Senior A. W., Kavukcuoglu K., Kohli P., Hassabis D., Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Figs. S1 and S2

Tables S1 to S9

Legend for data S1

References

sciadv.adr2641_sm.pdf^{(826.1KB, pdf)}

Data S1

sciadv.adr2641_data_s1.zip^{(1MB, zip)}

[R1] 1.W. P. Jencks, Catalysis in Chemistry and Enzymology (Courier Corporation, 1987). [Google Scholar]

[R2] 2.C. M. O’Connor, J. U. Adams, J. Fairman, Essentials of Cell Biology (NPG Education 1, 54, 2010). [Google Scholar]

[R3] 3.Chaplin D. D., Overview of the immune response. J. Allergy Clin. Immunol. 125, S3–S23 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Cohen G. B., Ren R., Baltimore D., Modular binding domains in signal transduction proteins. Cell 80, 237–248 (1995). [DOI] [PubMed] [Google Scholar]

[R5] 5.van Vliet C., Thomas E. C., Merino-Trigo A., Teasdale R. D., Gleeson P. A., Intracellular sorting and transport of proteins. Prog. Biophys. Mol. Biol. 83, 1–45 (2003). [DOI] [PubMed] [Google Scholar]

[R6] 6.Xia Y., Li X., Yang L., Luo X., Shen W., Cao Y., Peplowski L., Chen X., Development of thermostable sucrose phosphorylase by semi-rational design for efficient biosynthesis of alpha-D-glucosylglycerol. Appl. Microbiol. Biotechnol. 105, 7309–7319 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.M. T. Reetz, Z. Sun, G. Qu, Enzyme Engineering: Selective Catalysts for Applications in Biotechnology, Organic Chemistry, and Life Science (John Wiley & Sons, 2023). [Google Scholar]

[R8] 8.Lovelock S. L., Crawshaw R., Basler S., Levy C., Baker D., Hilvert D., Green A. P., The road to fully programmable protein catalysis. Nature 606, 49–58 (2022). [DOI] [PubMed] [Google Scholar]

[R9] 9.Tokuriki N., Tawfik D. S., Protein dynamism and evolvability. Science 324, 203–207 (2009). [DOI] [PubMed] [Google Scholar]

[R10] 10.Lutz S., Iamurri S. M., Protein engineering: Past, present, and future. Methods Mol. Biol. 1685, 1–12 (2018). [DOI] [PubMed] [Google Scholar]

[R11] 11.Reetz M. T., Soni P., Fernández L., Knowledge-guided laboratory evolution of protein thermolability. Biotechnol. Bioeng. 102, 1712–1717 (2009). [DOI] [PubMed] [Google Scholar]

[R12] 12.Das R., Baker D., Macromolecular modeling with rosetta. Annu. Rev. Biochem. 77, 363–382 (2008). [DOI] [PubMed] [Google Scholar]

[R13] 13.Xiong P., Chen Q., Liu H., Computational protein design under a given backbone structure with the ABACUS statistical energy function. Methods Mol. Biol. 1529, 217–226 (2017). [DOI] [PubMed] [Google Scholar]

[R14] 14.Schymkowitz J., Borg J., Stricher F., Nys R., Rousseau F., Serrano L., The FoldX web server: An online force field. Nucleic Acids Res. 33, W382–W388 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Madani A., Krause B., Greene E. R., Subramanian S., Mohr B. P., Holton J. M., Olmos J. L., Xiong C., Sun Z. Z., Socher R., Fraser J. S., Naik N., Large language models generate functional protein sequences across diverse families. Nat. Biotechnol. 41, 1099–1106 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Hie B., Zhong Ellen D., Berger B., Bryson B., Learning the language of viral evolution and escape. Science 371, 284–288 (2021). [DOI] [PubMed] [Google Scholar]

[R17] 17.R. M. Rao, J. Liu, R. Verkuil, J. Meier, J. Canny, P. Abbeel, T. Sercu, A. Rives, MSA transformer, in Proceedings of the 38th International Conference on Machine Learning (2021), vol. 139, pp. 8844–8856. [Google Scholar]

[R18] 18.Rives A., Meier J., Sercu T., Goyal S., Lin Z., Liu J., Guo D., Ott M., Zitnick C. L., Ma J., Fergus R., Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. U.S.A. 118, e2016239118 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Lin Z., Akin H., Rao R., Hie B., Zhu Z., Lu W., Smetanin N., Verkuil R., Kabeli O., Shmueli Y., dos Santos Costa A., Fazel-Zarandi M., Sercu T., Candido S., Rives A., Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023). [DOI] [PubMed] [Google Scholar]

[R20] 20.W. Jin, S. Sarkizova, X. Chen, N. Hacohen, C. Uhler, Unsupervised protein-ligand binding energy prediction via neural Euler’s rotation equation. arXiv:2301.10814 [q-bio.BM] (2023).

[R21] 21.J. Meier, R. Rao, R. Verkuil, J. Liu, T. Sercu, A. Rives, Language models enable zero-shot prediction of the effects of mutations on protein function. bioRxiv 450648 [Preprint] (2021). 10.1101/2021.07.09.450648. [DOI]

[R22] 22.Hsu C., Nisonoff H., Fannjiang C., Listgarten J., Learning protein fitness models from evolutionary and assay-labeled data. Nat. Biotechnol. 40, 1114–1122 (2022). [DOI] [PubMed] [Google Scholar]

[R23] 23.Notin P., Kollasch A., Ritter D., Van Niekerk L., Paul S., Spinner H., Rollins N., Shaw A., Orenbuch R., Weitzman R., ProteinGym: Large-scale benchmarks for protein fitness prediction and design, in 37th Conference on Neural Information Processing Systems (NeurIPS 2023). [Google Scholar]

[R24] 24.Luo Y., Jiang G., Yu T., Liu Y., Vo L., Ding H., Su Y., Qian W. W., Zhao H., Peng J., ECNet is an evolutionary context-integrated deep learning framework for protein engineering. Nat. Commun. 12, 5743 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Wu Z., Kan S. B. J., Lewis Russell D., Wittmann Bruce J., Arnold Frances H., Machine learning-assisted directed protein evolution with combinatorial libraries. Proc. Natl. Acad. Sci. U.S.A. 116, 8852–8858 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Engqvist M. K. M., Correlating enzyme annotations with a large set of microbial growth temperatures reveals metabolic adaptations to growth at diverse temperatures. BMC Microbiol. 18, 177 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] 27.Li G., Rabe K. S., Nielsen J., Engqvist M. K. M., Machine learning applied to predicting microorganism growth temperatures and enzyme catalytic optima. ACS Synth. Biol. 8, 1411–1420 (2019) vol 1, pp. 4171–4186. [DOI] [PubMed] [Google Scholar]

[R28] 28.J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, in Proceedings of the 2019 Conference of the North Marican Chapter of the Association for Computational Linguistics: Human Language Technologies, 4171–4186 (Association for Computational Linguistic, 2019). [Google Scholar]

[R29] 29.Sarkisyan K. S., Bolotin D. A., Meer M. V., Usmanova D. R., Mishin A. S., Sharonov G. V., Ivankov D. N., Bozhanova N. G., Baranov M. S., Soylemez O., Bogatyreva N. S., Vlasov P. K., Egorov E. S., Logacheva M. D., Kondrashov A. S., Chudakov D. M., Putintseva E. V., Mamedov I. Z., Tawfik D. S., Lukyanov K. A., Kondrashov F. A., Local fitness landscape of the green fluorescent protein. Nature 533, 397–401 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] 30.Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A. N., Kaiser Ł., Polosukhin I., Attention is all you need, in 31st Conference on Neural Information Processing Systems (NIPS 2017).

[R31] 31.P. Notin, M. Dias, J. Frazer, J. M. Hurtado, A. N. Gomez, D. Marks, Y. Gal, Tranception: Protein fitness prediction with autoregressive transformers and inference-time retrieval, in Proceedings of the 39th International Conference on Machine Learning (PMLR, 2022), vol. 162, pp. 16990–17017. [Google Scholar]

[R32] 32.Yang K. K., Fusi N., Lu A. X., Convolutions are competitive with transformers for protein sequence pretraining. Cell Syst. 15, 286–294.e2 (2024). [DOI] [PubMed] [Google Scholar]

[R33] 33.K. K. Yang, N. Zanichelli, H. Yeh, Masked inverse folding with sequence transfer for protein representation learning. bioRxiv 493516 [Preprint] (2023). 10.1101/2022.05.25.493516. [DOI] [PubMed]

[R34] 34.J. Su, C. Han, Y. Zhou, J. Shan, X. Zhou, F. Yuan, SaProt: Protein language modeling with structure-aware vocabulary. bioRxiv 560349 [Preprint] (2024). 10.1101/2023.10.01.560349. [DOI]

[R35] 35.Diaz D. J., Gong C., Ouyang-Zhang J., Loy J. M., Wells J., Yang D., Ellington A. D., Dimakis A. G., Klivans A. R., Stability Oracle: A structure-based graph-transformer framework for identifying stabilizing mutations. Nat. Commun. 15, 6170 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] 36.Laine E., Karami Y., Carbone A., GEMME: A simple and fast global epistatic model predicting mutational effects. Mol. Biol. Evol. 36, 2604–2619 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] 37.Kulandaisamy A., Sakthivel R., Gromiha M. M., MPTherm: Database for membrane protein thermodynamics for understanding folding and stability. Brief. Bioinform. 22, 2119–2125 (2021). [DOI] [PubMed] [Google Scholar]

[R38] 38.Stourac J., Dubrava J., Musil M., Horackova J., Damborsky J., Mazurenko S., Bednar D., FireProtDB: Database of manually curated protein stability data. Nucleic Acids Res. 49, D319–D324 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] 39.Nikam R., Kulandaisamy A., Harini K., Sharma D., Gromiha M. M., ProThermDB: Thermodynamic database for proteins and mutants revisited after 15 years. Nucleic Acids Res. 49, D420–D424 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] 40.Turner P., Mamo G., Karlsson E. N., Potential and utilization of thermophiles and thermostable enzymes in biorefining. Microb. Cell Fact. 6, 1–23 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] 41.C. Dallago, J. Mou, K. E. Johnston, B. J. Wittmann, N. Bhattacharya, S. Goldman, A. Madani, K. K. Yang, FLIP: Benchmark tasks in fitness landscape inference for proteins. bioRxiv 467890 [Preprint] (2022). 10.1101/2021.11.09.467890. [DOI]

[R42] 42.Jarzab A., Kurzawa N., Hopf T., Moerch M., Zecha J., Leijten N., Bian Y., Musiol E., Maschberger M., Stoehr G., Becher I., Daly C., Samaras P., Mergner J., Spanier B., Angelov A., Werner T., Bantscheff M., Wilhelm M., Klingenspor M., Lemeer S., Liebl W., Hahne H., Savitski M. M., Kuster B., Meltome atlas—Thermal proteome stability across the tree of life. Nat. Methods 17, 495–503 (2020). [DOI] [PubMed] [Google Scholar]

[R43] 43.Mirdita M., Von Den Driesch L., Galiez C., Martin M. J., Söding J., Steinegger M., Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res. 45, D170–D176 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] 44.Sun Z., Liu Q., Qu G., Feng Y., Reetz M. T., Utility of B-factors in protein science: Interpreting rigidity, flexibility, and internal motion and engineering thermostability. Chem. Rev. 119, 1626–1665 (2019). [DOI] [PubMed] [Google Scholar]

[R45] 45.Goldsmith M., Tawfik D. S., Enzyme engineering: Reaching the maximal catalytic efficiency peak. Curr. Opin. Struct. Biol. 47, 140–150 (2017). [DOI] [PubMed] [Google Scholar]

[R46] 46.Fonfara I., Richter H., Bratovic M., Le Rhun A., Charpentier E., The CRISPR-associated DNA-cleaving enzyme Cpf1 also processes precursor CRISPR RNA. Nature 532, 517–521 (2016). [DOI] [PubMed] [Google Scholar]

[R47] 47.Swarts D. C., Jinek M., Mechanistic insights into the cis- and trans-acting DNase activities of Cas12a. Mol. Cell 73, 589–600.e4 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] 48.Zavvar T. S., Khoshbin Z., Ramezani M., Alibolandi M., Abnous K., Taghdisi S. M., CRISPR/Cas-engineered technology: Innovative approach for biosensor development. Biosens. Bioelectron. 214, 114501 (2022). [DOI] [PubMed] [Google Scholar]

[R49] 49.Borkotoky S., Murali A., The highly efficient T7 RNA polymerase: A wonder macromolecule in biological realm. Int. J. Biol. Macromol. 118, 49–56 (2018). [DOI] [PubMed] [Google Scholar]

[R50] 50.Dousis A., Ravichandran K., Hobert E. M., Moore M. J., Rabideau A. E., An engineered T7 RNA polymerase that produces mRNA free of immunostimulatory byproducts. Nat. Biotechnol. 41, 560–568 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R51] 51.Leone G., van Gemen B., Schoen C. D., van Schijndel H., Kramer F. R., Molecular beacon probes combined with amplification by NASBA enable homogeneous, real-time detection of RNA. Nucleic Acids Res. 26, 2150–2155 (1998). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R52] 52.Cui Z., Wang Y., Fang L., Zheng R., Huang X., Liu X., Zhang G., Rui D., Ju J., Hu Z., Novel real-time simultaneous amplification and testing method to accurately and rapidly detect Mycobacterium tuberculosis complex. J. Clin. Microbiol. 50, 646–650 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R53] 53.Nelson J., Sorensen E. W., Mintri S., Rabideau A. E., Zheng W., Besin G., Khatwani N., Su S. V., Miracco E. J., Issa W. J., Impact of mRNA chemistry and manufacturing process on innate immune activation. Sci. Adv. 6, eaaz6893 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R54] 54.Wu M. Z., Asahara H., Tzertzinis G., Roy B., Synthesis of low immunogenicity RNA with high-temperature in vitro transcription. RNA 26, 345–360 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R55] 55.Killard A. J., Smyth M. R., Creatinine biosensors: Principles and designs. Trends Biotechnol. 18, 433–437 (2000). [DOI] [PubMed] [Google Scholar]

[R56] 56.Zhi Q., Kong P., Zang J., Cui Y., Li S., Li P., Yi W., Wang Y., Chen A., Hu C., Biochemical and molecular characterization of a novel high activity creatine amidinohydrolase from Arthrobacter nicotianae strain 02181. Process Biochem. 44, 460–465 (2009). [Google Scholar]

[R57] 57.Berberich J. A., Yang L. W., Bahar I., Russell A. J., A stable three enzyme creatinine biosensor. 2. Analysis of the impact of silver ions on creatine amidinohydrolase. Acta Biomater. 1, 183–191 (2005). [DOI] [PubMed] [Google Scholar]

[R58] 58.Jiang F., Bian J., Liu H., Li S., Bai X., Zheng L., Jin S., Liu Z., Yang G.-Y., Hong L., Creatinase: Using increased entropy to improve the activity and thermostability. J. Phys. Chem. B 127, 2671–2682 (2023). [DOI] [PubMed] [Google Scholar]

[R59] 59.Bult C. J., White O., Olsen G. J., Zhou L., Fleischmann R. D., Sutton G. G., Blake J. A., FitzGerald L. M., Clayton R. A., Gocayne J. D., Kerlavage A. R., Dougherty B. A., Tomb J.-F., Adams M. D., Reich C. I., Overbeek R., Kirkness E. F., Weinstock K. G., Merrick J. M., Glodek A., Scott J. L., Geoghagen N. S. M., Weidman J. F., Fuhrmann J. L., Nguyen D., Utterback T. R., Kelley J. M., Peterson J. D., Sadow P. W., Hanna M. C., Cotton M. D., Roberts K. M., Hurst M. A., Kaine B. P., Borodovsky M., Klenk H.-P., Fraser C. M., Smith H. O., Woese C. R., Venter J. C., Complete genome sequence of the methanogenic archaeon. Science 273, 1058–1073 (1996). [DOI] [PubMed] [Google Scholar]

[R60] 60.Kois P., Tocik Z., Spassova M., Ren W.-Y., Rosenberg I., Soler J. F., Watanabe K. A., Synthesis and some properties of modified oligonucleotides. II. Oligonucleotides containing 2′-deoxy-2′-fluoro-β-D-arabinofuranosyl pyrimidine nucleosides. Nucleosides Nucleotides 12, 1093–1109 (1993). [Google Scholar]

[R61] 61.Wang Y., Ngor A. K., Nikoomanzar A., Chaput J. C., Evolution of a general RNA-cleaving FANA enzyme. Nat. Commun. 9, 5067 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R62] 62.Nikoomanzar A., Dunn M. R., Chaput J. C., Evaluating the rate and substrate specificity of laboratory evolved XNA polymerases. Anal. Chem. 89, 12622–12625 (2017). [DOI] [PubMed] [Google Scholar]

[R63] 63.Yan S., Li X., Zhang P., Wang Y., Chen H.-Y., Huang S., Yu H., Direct sequencing of 2′-deoxy-2′-fluoroarabinonucleic acid (FANA) using nanopore-induced phase-shift sequencing (NIPSS). Chem. Sci. 10, 3110–3117 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R64] 64.Harshe R. P., Xie A., Vuerich M., Frank L. A., Gromova B., Zhang H., Robles R. J., Mukherjee S., Csizmadia E., Kokkotou E., Cheifetz A. S., Moss A. C., Kota S. K., Robson S. C., Longhi M. S., Endogenous antisense RNA curbs CD39 expression in Crohn’s disease. Nat. Commun. 11, 5894 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R65] 65.Pelisch N., Rosas Almanza J., Stehlik K. E., Aperi B. V., Kroner A., Use of a self-delivering anti-CCL3 FANA oligonucleotide as an innovative approach to target inflammation after spinal cord injury. eNeuro 8, ENEURO.0338-0320.2021 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R66] 66.Pinheiro V. B., Taylor A. I., Cozens C., Abramov M., Renders M., Zhang S., Chaput J. C., Wengel J., Peak-Chew S.-Y., McLaughlin S. H., Herdewijn P., Holliger P., Synthetic genetic polymers capable of heredity and evolution. Science 336, 341–344 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R67] 67.Tang H., Gao Y., Han J., Application progress of the single domain antibody in medicine. Int. J. Mol. Sci. 24, 4176 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R68] 68.Muyldermans S., Applications of nanobodies. Annu. Rev. Anim. Biosci. 9, 401–421 (2021). [DOI] [PubMed] [Google Scholar]

[R69] 69.Wang J., Bever C. R., Majkova Z., Dechant J. E., Yang J., Gee S. J., Xu T., Hammock B. D., Heterologous antigen selection of camelid heavy chain single domain antibodies against tetrabromobisphenol A. Anal. Chem. 86, 8296–8302 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R70] 70.Zettl I., Ivanova T., Zghaebi M., Rutovskaya M. V., Ellinger I., Goryainova O., Kollárová J., Villazala-Merino S., Lupinek C., Weichwald C., Drescher A., Eckl-Dorna J., Tillib S. V., Flicker S., Generation of high affinity ICAM-1-specific nanobodies and evaluation of their suitability for allergy treatment. Front. Immunol. 13, 1022418 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R71] 71.Pabst T. M., Wendeler M., Wang X., Bezemer S., Hermans P., Hunter A. K., Camelid V(H) H affinity ligands enable separation of closely related biopharmaceuticals. Biotechnol. J. 12, 1600357 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R72] 72.Palmer B., Angus K., Taylor L., Warwicker J., Derrick J. P., Design of stability at extreme alkaline pH in streptococcal protein G. Biotechnol. J. 134, 222–230 (2008). [DOI] [PubMed] [Google Scholar]

[R73] 73.Laughlin T. M., Horn J. R., Engineering pH-sensitive single-domain antibodies. Methods Mol. Biol. 2446, 269–298 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R74] 74.Hie B. L., Shanker V. R., Xu D., Bruun T. U. J., Weidenbacher P. A., Tang S., Wu W., Pak J. E., Kim P. S., Efficient evolution of human antibodies from general protein language models. Nat. Biotechnol. 42, 275–283 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R75] 75.Alford R. F., Leaver-Fay A., Jeliazkov J. R., O’Meara M. J., DiMaio F. P., Park H., Shapovalov M. V., Renfrew P. D., Mulligan V. K., Kappel K., The Rosetta all-atom energy function for macromolecular modeling and design. J. Chem. Theory Comput. 13, 3031–3048 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R76] 76.Biswas S., Khimulya G., Alley E. C., Esvelt K. M., Church G. M., Low-N protein engineering with data-efficient deep learning. Nat. Methods 18, 389–396 (2021). [DOI] [PubMed] [Google Scholar]

[R77] 77.Wittmann B. J., Yue Y., Arnold F. H., Informed training set design enables efficient machine learning-assisted directed protein evolution. Cell Syst. 12, 1026–1045.e7 (2021). [DOI] [PubMed] [Google Scholar]

[R78] 78.Meier J., Rao R., Verkuil R., Liu J., Sercu T., Rives A., Language models enable zero-shot prediction of the effects of mutations on protein function. Adv. Neural Inf. Process. Syst. 34, 29287–29303 (2021). [Google Scholar]

[R79] 79.Richardson L., Allen B., Baldi G., Beracochea M., Bileschi M. L., Burdett T., Burgin J., Caballero-Pérez J., Cochrane G., Colwell L. J., Curtis T., Escobar-Zepeda A., Gurbich T. A., Kale V., Korobeynikov A., Raj S., Rogers A. B., Sakharova E., Sanchez S., Wilkinson D. J., Finn R. D., MGnify: The microbiome sequence data analysis resource in 2023. Nucleic Acids Res. 51, D753–D759 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R80] 80.Steinegger M., Mirdita M., Söding J., Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nat. Methods 16, 603–606 (2019). [DOI] [PubMed] [Google Scholar]

[R81] 81.Steinegger M., Söding J., Clustering huge protein sequence sets in linear time. Nat. Commun. 9, 2542 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R82] 82.Li G., Buric F., Zrimec J., Viknander S., Nielsen J., Zelezniak A., Engqvist M. K. M., Learning deep representations of enzyme thermal adaptation. Protein Sci. 31, e4480 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R83] 83.Meyer A. J., Garry D. J., Hall B., Byrom M. M., McDonald H. G., Yang X., Yin Y. W., Ellington A. D., Transcription yield of fully 2′-modified RNA can be increased by the addition of thermostabilizing mutations to T7 RNA polymerase mutants. Nucleic Acids Res. 43, 7480–7488 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R84] 84.Qin W., Li L., Yang F., Wang S., Yang G.-Y., High-throughput iSpinach fluorescent aptamer-based real-time monitoring of in vitro transcription. Bioresour. Bioprocess. 9, 112 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R85] 85.Nikoomanzar A., Dunn M. R., Chaput J. C., Engineered polymerases with altered substrate specificity: Expression and purification. Curr. Protoc. Nucleic Acid Chem. 69, 4.75.1–4.75.20 (2017). [DOI] [PubMed] [Google Scholar]

[R86] 86.Lin H., Zheng W., Li S., Wang Y., Wei D., Xie L., Lu W., Tian Z., Wang S., Qu J., Liu J., Internet of medical things-enabled CRISPR diagnostics for rapid detection of SARS-CoV-2 variants of concern. Front. Microbiol. 13, 1070940 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R87] 87.Jumper J., Evans R., Pritzel A., Green T., Figurnov M., Ronneberger O., Tunyasuvunakool K., Bates R., Žídek A., Potapenko A., Bridgland A., Meyer C., Kohl S. A. A., Ballard A. J., Cowie A., Romera-Paredes B., Nikolov S., Jain R., Adler J., Back T., Petersen S., Reiman D., Clancy E., Zielinski M., Steinegger M., Pacholska M., Berghammer T., Bodenstein S., Silver D., Vinyals O., Senior A. W., Kavukcuoglu K., Kohli P., Hassabis D., Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A general temperature-guided language model to design proteins of enhanced stability and activity

Fan Jiang

Mingchen Li

Jiajun Dong

Yuanxi Yu

Xinyu Sun

Banghao Wu

Jin Huang

Liqi Kang

Yufeng Pei

Liang Zhang

Shaojie Wang

Wenxue Xu

Jingyao Xin

Wanli Ouyang

Guisheng Fan

Lirong Zheng

Yang Tan

Zhiqiang Hu

Yi Xiong

Yan Feng

Guangyu Yang

Qian Liu

Jie Song

Jia Liu

Liang Hong

Pan Tan

Roles

Abstract

INTRODUCTION

RESULTS

PRIME architecture

Fig. 1. Overview of the PRIME architecture and its applications.

The pretraining objectives of PRIME

Masked language modeling

OGT prediction

Correlation objective

Zero-shot single-site mutation scoring

Augmentation of single-site mutation prediction performance in PRIME through fine-tuning on homologous sequences via the MLM learning objective

PRIME outperforms state-of-the-art methods in predicting fitness of mutated protein sequence

Fig. 2. Comparison of the performance between PRIME and other methods.

Wet-lab experimental testing of PRIME-designed single-site mutants of various proteins for different engineering purposes

LbCas12a

Fig. 3. Overview of the structures and performance results of single-site mutants predicted by the PRIME model.

T7 RNA polymerase

Creatinase

Nonnatural nucleic acid polymerase

VHH

Benchmark of different strategies for selecting single-site mutations

Fig. 4. Comparative analysis of PRIME and different models through wet-lab experiments and in silico benchmarking.

Enhanced multisite mutagenesis through PRIME-driven protein engineering

LbCas12a

Fig. 5. Illustration of protein structures and experimental results for multisite mutants, comparing their properties against wild-type proteins.

T7 RNA polymerase

DISCUSSION

METHODS

Details of PRIME architecture

Transformer encoder

MLM module

OGT prediction module

Zero-shot prediction of the effects of single-point mutations

Training details

Pretraining

MLM loss

OGT prediction loss

Correlation loss

Implement details

Alternating training

Effect of different weights of the multitask loss function to the performance of zero-shot prediction

Fine-tuning MLM module on homologous sequence

Transfer learning of PRIME on temperature related benchmark and FLIP

Transfer learning of PRIME on supervised mutant effect prediction

Dataset

Pretraining dataset

Benchmark datasets for zero-shot mutation scoring

Different strategies of selecting single-site mutations for different engineering purposes

Engineering of high stability or activity in five proteins

Prediction of single-site mutations by PRIME

T7 RNA polymerase