Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2026 Jan 28;54(3):gkaf1529. doi: 10.1093/nar/gkaf1529

Computational evolution of poly(U) polymerase for efficient and controlled RNA oligonucleotide synthesis

Lixiang Yang 1,#, Yi He 2,#, Fuyan Cao 3,#, Yanjia Qin 4,#, Yi Wang 5, Huijun Zhang 6, Weiwei Han 7,, Meng Yang 8,9,
PMCID: PMC12848935  PMID: 41603735

Abstract

Template-independent polymerases such as poly(U) polymerase (PUP) hold promise for enzymatic RNA synthesis but are limited by inefficient incorporation of modified nucleotides. Here, we describe a multi-round, closed-loop workflow integrating Gaussian accelerated molecular dynamics (GaMD), machine learning (ML), and generative artificial intelligence (AI) to engineer PUP variants with enhanced activity and stability. Our engineering strategy commenced with a deep mechanistic analysis of PUP using GaMD simulations. This provided the blueprint for our first key step: engineering PUPdel, a truncated variant that achieved a pivotal breakthrough by incorporating 3′-terminally blocked nucleotides and enabling controlled template-independent synthesis. Subsequently, we screened single-point mutations using protein language models (e.g. ESM1v) combined with Rosetta-based stability predictions, yielding a 47.78% hit rate for functionally active variants. Iterative ML models predicted synergistic multi-mutant combinations, increasing success rates to 63%. Finally, ESM3-based generative design produced PUPdel2, with 16 mutations conferring 3.4°C higher thermostability, 3.7-fold improved expression, and up to 5.4-fold enhanced catalytic efficiency for 3′-O-allyl-UTP. Structural analyses revealed that mutations enhance β-trapdoor flexibility and substrate binding via electrostatic and dynamic mechanisms. This AI-driven approach navigates vast sequence space efficiently, enabling superior enzymes for biotechnological applications in RNA therapeutics and beyond.

Graphical Abstract

Graphical Abstract.

Graphical Abstract

Introduction

RNA oligonucleotides hold immense promise for diverse biomedical applications [1], ranging from enabling technologies such as next-generation sequencing [2] and mRNA vaccine development [3] to therapeutic modalities such as antisense oligonucleotides (ASOs) and short-interfering RNAs (siRNAs) for treating genetic disorders such as spinal muscular atrophy (SMA) [4, 5]. Furthermore, they serve as crucial components in biosensors and signaling pathways for rapid disease diagnostics, including viral pathogen detection [6].

Currently, the in vitro production of RNA oligonucleotides predominantly relies on two strategies: chemical synthesis [7] and enzymatic synthesis via in vitro transcription [8]. While established, conventional chemical methods, notably solid-phase phosphoramidite synthesis [9], grapple with inherent limitations such as declining efficiency and fidelity with increasing length, which are challenges in producing highly pure long RNAs [10, 11], and the generation of chemical waste, enzymatic synthesis using RNA polymerases presents an attractive alternative, offering potential advantages in generating high-quality, diverse RNA sequences [12], simplifying downstream purification, and potentially reducing production costs. However, achieving efficient and precisely controlled enzymatic RNA synthesis faces persistent challenges. Template-dependent transcription often yields heterogeneous products, while template-independent polymerases typically exhibit lower processivity or struggle with incorporating non-canonical or modified nucleotides necessary for specific applications [13, 14].

Poly(U) polymerase (PUP), also known as Cid1, is a member of the non-canonical polymerase X family within the broader polymerase β superfamily [15, 16, 17]. Its architecture comprises an N-terminal catalytic domain structurally interwoven with a C-terminal central domain. The N-terminal domain adopts a characteristic fold featuring a mixed β-sheet scaffold supported by α-helices H2 and H3, while the C-terminal domain is predominantly α-helical, consisting of six α-helices interspersed with two short β-strands [1517, 18]. Functionally, these domains cooperate to form the active site cleft, with specific structural motifs acting analogously to “finger” subdomains, governing substrate nucleotide recognition and binding kinetics, and facilitating the subsequent polymerization reaction. Detailed biochemical characterization has revealed Cid1’s pronounced substrate specificity. It exhibits a strong preference for UTP incorporation (Km ~12 μM) compared with ATP (Km ~312 μM), while displaying significantly lower catalytic efficiency towards CTP and GTP [18]. This inherent bias towards uridylation is remarkably robust, as Cid1 preferentially incorporates UTP even when challenged with a 10-fold molar excess of ATP [18]. While advantageous for poly(U) tail synthesis in vivo, this narrow substrate scope and potentially suboptimal catalytic rates for non-preferred nucleotides present limitations for versatile in vitro RNA synthesis applications. Consequently, engineering PUP to enhance its catalytic prowess and potentially modulate its substrate specificity represents a key imperative. A significant advancement was recently reported by Wiegand et al., who developed a template-independent enzymatic RNA synthesis platform leveraging engineered variants of Candidatus Ihubacter terrigenus DNA polymerase I (CID1) PUP, specifically H336R and H336R-N171A-T172S [12]. By employing optimized 3′-O-blocking groups and cyclic reaction conditions, their system demonstrated efficient and controllable synthesis, offering a novel paradigm for scalable RNA therapeutic manufacturing. Building upon this foundational work, the present study focuses on further engineering the core PUP enzyme through strategic truncation and targeted mutagenesis. Our objective is to develop PUP variants exhibiting enhanced catalytic efficiency (e.g. improved kcat/Km), increased thermostability, higher recombinant protein yield, and potentially broadened substrate compatibility, particularly towards nucleotides relevant for controlled synthesis. Success in this endeavor aims to refine this enzymatic platform for more robust and efficient production of functional RNA oligonucleotides, thereby providing superior tools and expanded options for RNA drug discovery and development.

Addressing such protein engineering challenges through traditional experimental methods, such as extensive site-directed mutagenesis or directed evolution, often proves laborious and resource-intensive. Fortunately, recent advancements in computational biology have ushered in powerful in silico strategies. Structure-based rational design, machine learning (ML) models, and burgeoning protein language models now offer predictive power to significantly accelerate the enzyme design–build–test–learn cycle, drastically reducing the reliance on large-scale, often low-yield, experimental screening campaigns [1921, 22].

From a broader evolutionary perspective, PUPs belong to the non-canonical polymerase X family within the polymerase β superfamily, with homologs distributed across diverse eukaryotic lineages. Comparative phylogenetic analysis of PUP family members (Supplementary Fig. S1) reveals distinct clades corresponding to species-specific adaptations, conserved catalytic motifs, and domain architectures. This evolutionary context not only underscores the structural conservation underpinning UTP preference but also provides a rational framework for selecting candidate sequences or ancestral variants as scaffolds in engineering efforts aimed at enhancing catalytic performance and substrate versatility.

Recent advances in protein large language models (Prot-LLMs) offer unprecedented capabilities for protein science, providing powerful computational frameworks to predict, analyze, and design protein structures and functions, thereby accelerating experimental validation and downstream applications [23]. Capitalizing on these developments, this study leverages Prot-LLM methodologies, particularly models from the evolutionary scale modeling (ESM) family, to engineer novel Cid1 variants with desired properties. We employ an iterative design–predict–test strategy for both de novo generation and targeted variant optimization [2429]. Distinguishing our approach, we utilize the state-of-the-art ESM3 model [30] for generative design, integrating information across multiple biological scales—incorporating evolutionary context from sequence families, spatial constraints from predicted or known structures, and fine-grained details from active site regions. Generated sequences are computationally pre-screened using tools such as COMPSS [21] to predict potential catalytic activity, alongside various established methods for evaluating protein solubility and expression propensity, aiming to prioritize candidates possessing both high activity and favorable biophysical characteristics. For the design of Cid1 variants, we used the widely acclaimed ESM1v [31] for supervised classification learning, expanding the training set through an iterative prediction–experiment process to enable more accurate predictions. Solubility and expressibility assessments are similarly applied during this variant selection process. Top-scoring candidates emerging from both the generative and optimization pipelines will be subjected to rigorous biochemical characterization, including detailed enzyme kinetic assays to quantify their activity towards modified nucleotide substrates. Subsequently, molecular dynamics (MD) simulations will be employed to elucidate the structural and dynamic mechanisms underpinning any observed alterations in substrate specificity or catalytic efficiency. This research endeavors not only to enhance Cid1’s performance metrics (e.g. catalytic efficiency and substrate tolerance for modified nucleotides) but also to demonstrate and refine a rapid, computationally driven workflow for enzyme design and screening. Ultimately, this work aims to provide valuable tools for synthetic biology and establish transferable strategies for protein engineering endeavors.

Materials and methods

Materials

Oligonucleotides were purchased from Azenta (Suzhou, China). The Mut Express II Fast Mutagenesis Kit V2 was purchased from Vazyme (Suzhou, China), rNTPs and 3′-O-allyl-NTPs were purchased from syngenebio (Jiangsu, China), Ni-NTA resin and 2× TBE–urea sample buffer were purchased from Sangon (Shanghai, China), 10× Bugbuster and Amicon Ultra centrifugal filters [molecular weight cut-off (MWCO) 10 kDa] were purchased from Sigma-Aldrich (St. Louis, MO, USA). SYBR gold was purchased from Thermo Fisher Scientific (Waltham, MA, USA). Pyrophosphatase and phosphate sensor were purchased from Beyotime (Shanghai, China). The Protein Thermal Shift™ Dye Kit was purchased from Thermo Fisher Scientific (Waltham, MA, USA). The Oligonucleotide Clean and Concentrator Kit was purchased from Zymo Research (Irvine, CA, USA). Sodium tetrachloropalladate (II) (Na2PdCl4) and triphenylphosphine-3,3′,3′′-trisulfonic acid trisodium salt [P(PhSO3Na)3] were purchased from Sigma-Aldrich (St. Louis, MO, USA).

PUP crude enzyme screening system

The DNA sequence of wild-type (WT) PUP protein was synthesized and codon-optimized by Azenta, and then cloned into the pET28a(+) expression vector. The plasmid pET28a(+)-PUP was transformed into BL21 (DE3) competent cells. To make the plasmid pET28a(+)-PUPdel, the fragments were PCR amplified, digested, and ligated using the Mut Express II Fast Mutagenesis Kit V2. The plasmids of single or multiple PUPdel variants were prepared using the Mut Express II Fast Mutagenesis Kit V2, and the mutagenic primers were designed using the Vazyme CE Design Primer Design website (https://crm.vazyme.com/cetool/en-us/simple.html).

Single colonies were picked from overnight cultures and inoculated into 150 μl of LB liquid medium (50 μg ml−1 kanamycin) in a 96-well plate. The culture was grown overnight at 37°C with shaking at 220 rpm. Then, 10 μl of the overnight culture was transferred to 150 μl of LB liquid medium (50 μg ml−1 kanamycin) in a new 96-well plate and incubated for ~2 h at 37°C with shaking at 220 rpm. When the OD600 reached 0.6–0.8, 0.5 mM isopropyl-β-d-thiogalactopyranoside (IPTG) was added to induce protein expression, and the culture was incubated at 37°C with shaking at 220 rpm for 4 h. After induction, 3 μl of 10× Bugbuster was added to the 150 μl culture in each well, mixed thoroughly, and incubated at room temperature for 20 min to obtain crude enzyme extract. For the 10 μl crude enzyme reaction system, it included 1 μM poly(T)/mU (TTTTTTTTTTTTTTTTmUmUmUmU, m: 2′-OMe modification; the 5′ end was labeled with Alexa Fluor 488), 1× PUP reaction buffer [10 mM Tris–HCl, 50 mM NaCl, 10 mM MgCl2, 1 mM dithiothreitol (DTT) pH 7.9], 2 mM MnCl2, 0.125 mM 3′-O-allyl-A/GTP or 0.125 mM 3′-O-allyl-U/CTP, and 3 μl of crude enzyme. The reaction was conducted at 37°C for 40 min, followed by addition of an equal volume of 2× TBE–urea sample buffer to quench the reaction. The products were analyzed using denaturing 20% polyacrylamide gel electrophoresis (PAGE). To optimize the n + 2 phenomenon, the 10 μl crude enzyme reaction system included 1 μM poly(T)/mU, 1× PUP reaction buffer, 2 mM MnCl2, 0.25 mM 3′-O-allyl-CTP, and 4 μl of crude enzyme. The reaction was carried out at 37°C for 3 h, followed by the addition of an equal volume of 2× TBE:urea sample buffer to stop the reaction. The products were analyzed using denaturing 20% PAGE and stained with SYBR Gold.

PUP expression and purification

A single colony was inoculated into LB medium (50 μg ml−1 kanamycin) and cultured overnight at 37°C. The overnight culture was then subcultured at a 1:100 dilution into fresh LB medium (50 μg ml−1 kanamycin) and incubated at 37°C. When the cells reached the log phase (OD600 = 0.6–0.8), 0.5 mM IPTG was added to induce protein expression, and the culture was incubated overnight at 16°C. Afterwards, the cells were harvested by centrifugation, resuspended in Buffer 1 (50 mM Tris–HCl, 500 mM NaCl, pH 7.0), and lysed by high-pressure homogenization. After centrifugation, the supernatant was collected and the PUP protein was purified from the supernatant using Ni-NTA resin. The protein was analyzed by sodium dodecysulfate (SDS)–PAGE, and then concentrated using an Amicon Ultra centrifugal filter (MWCO 10 kDa) in Storage Buffer 1 (25 mM Tris–HCl, 150 mM NaCl, pH 7.0). The purified protein was stored at −80°C.

PUP activity assays

The 10 μl reaction mixture included 1 μM poly(U) (UUUUUUUUUUUUUUUUUUUU), 1× PUP reaction buffer, 2 mM MnCl2, 0.25 mM 3′-O-allyl-NTP, and 0.1 mg ml−1 purified enzyme. The reaction was carried out at 37°C for 1 min. Afterwards, an equal volume of 2× TBE–urea sample buffer was added to quench the reaction. The products were analyzed by denaturing 20% PAGE and stained with SYBR Gold.

Allyl ether deblocking reactions and n + 2 reactions

The 10 μl n + 1 reaction mixture included 1 μM poly(U), 1× PUP reaction buffer, 2 mM MnCl2, 0.25 mM 3′-O-allyl-NTP, and 0.2 mg ml−1 purified enzyme. The reaction was carried out at 37°C for 2 min. Afterwards, oligonucleotide was purified using the Oligonucleotide Clean and Concentrator Kit. An allyl ether deblocking reaction mixture included 10 mM Tris–HCl (pH 6.5), 1.15 mM Na2PdCl4, 8.8 mM P (PhSO3Na)3 and the purified product. All deblocking reactions were carried out at 65°C for 15 min. Deblocked oligonucleotide was then purified using the Oligonucleotide Clean and Concentrator Kit. The purified deblocked oligonucleotide from the n + 1 reaction was utilized as the primer for the n + 2 reaction. The n + 2 reaction was carried out as described above.

Kinetic assays

The 20 μl standard curve reaction mixture included 1 μM poly(U), 1× PUP reaction buffer, 2 mM MnCl2, different concentrations of Na4P2O7, 0.2 U ml−1 pyrophosphatase, and 2 μM phosphate sensor. The reaction was performed in a 384-well plate. The parameters for the microplate reader were as follows: fluorescence excitation/emission wavelengths of 425/450 nm, reaction temperature at 37°C, and detection interval of 5 s. The stable fluorescence values were plotted against the corresponding pyrophosphate (PPi) concentrations to generate the PPi standard curve. The 20 μl kinetic reaction mixture included 1 μM poly(U), 1× PUP reaction buffer, 2 mM MnCl2, different concentrations of 3′-O-allyl-NTP, 0.2 U ml−1 pyrophosphatase, 2 μM phosphate sensor, and 0.1 mg ml−1 purified enzyme. The reaction was conducted in a 384-well plate. The parameters for the microplate reader were as follows: fluorescence excitation/emission wavelengths of 425/450 nm, reaction temperature at 37°C, and detection interval of 5 s. The amount of PPi released was calculated based on the standard curve, and the PPi released at different substrate concentrations was plotted. The slope of the curve represents the enzyme reaction rate, and the Michaelis–Menten equation can be generated based on the substrate concentration and corresponding reaction rates.

T m assays

The melting temperature (Tm) of PUP was determined by the Protein Thermal Shift™ Dye Kit. The 20 μl melt reaction mixture included 5.0 µl of Protein Thermal Shift™ Buffer, 12.5 µl of protein in buffer (containing 3 μg of PUP), and 2.5 µl of Diluted Protein Thermal Shift™ Dye (8×). The thermal profile was as follows: Step 1, temperature 25°C; time, 2 min; ramp rate, 1.6°C s–1; Step 2, temperature, 99°C; time, 2 min; ramp rate, 0.1°C s–1. In the targets section, in the reporter list, ROX was selected for each target. The Tm of PUP was analyzed by the melt curve.

Acquisition of mutation sites (pseudo-log-likelihood ratio)

We quantified the effect of in-frame insertions and deletions on protein sequences by defining an effect score as the pseudo-log-likelihood ratio (PLLR) between the mutated and WT amino acid sequences. This scoring method allows us to predict how the protein function may be impacted, especially identifying regions of high variability where mutations are less likely to disrupt core functions. To estimate the pseudo-log-likelihoods, we used a combination of pre-trained language models designed for protein sequences, specifically ESM1b and ESM1v models [32]. With the WT amino acid sequence (PUPdel) as input, these models outputs the log-likelihood of each of the 20 standard amino acids (including the WT amino acid) at each position of the protein sequence. The PLLR score of each mutation is the difference between the log-likelihood of the missense and WT amino acids at that position. These models have been shown to effectively capture the biophysical properties of proteins and predict the functional impact of mutations.

Thermodynamic stability and binding affinity predictions

Changes in thermodynamic stability and binding affinity (ΔΔGstability, ΔΔGbinding) were predicted using Rosetta (GitHub SHA1 99d33ec59ce9fcecc5e4f3800c778a54afdf8504) with the Cartesian ddG protocol on UPdel structures [32, 33]. The ΔΔG values obtained from Rosetta were divided by 2.9 to convert them from Rosetta energy units onto a scale corresponding to kcal mol–1.

Structure model preparation

The 3D structures of the PUP–NTP complex were constructed using AlphaFold3 [34]. Since the AlphaFold3 model does not include the UTP and CTP substrates, the UTP substrate was modeled by superimposing it with the PUP crystal structure (code: 4E80). The CTP substrate was obtained by locally modifying the PUP–UTP complex using Discovery Studio 2019 software [35]. Since the experiments employed 3′-O-terminal blocking modifications, the 3′-O of the substrates in all four complexes was modified with an allyl group. The allyl variants were constructed using Chimera Software 1.13.1 (developed by the Resource for Biocomputing, Visualization, and Informatics at the University of California, San Francisco, CA, USA) [35].

Molecular dynamics simulations

To understand the mechanism, MD simulations were conducted for the reaction systems of apo PUPdel and the PUPdel-3′-O-allyl-NTP. These simulations were performed using the PMEMD engine of the Amber22 software [36], with detailed system specifications provided in the accompanying table. The modified substrate 3′-O-allyl-NTP was optimized at the B3LYP/def2-SVP level, and its atomic charges were derived by fitting to the electrostatic potential (ESP) using the constrained ESP (RESP) method at the B3LYP/def2-TZVP level [37]. The Amber FF19SB force field [38] was employed for these reaction systems, while the four modified nucleotide substrates utilized the GAFF2 force field [39]. The MCPB.py tool [40] was used to process ions and construct the force field for metal ions. Non-bonded electrostatic interactions were handled using the Particle Mesh Ewald (PME) algorithm [41]. The simulation box was set with an 8 Å distance from the solute surface and filled with the OPC water model [42]. Sodium ions were added randomly to neutralize the system, and energy minimization was performed using the steepest descent and conjugate gradient algorithms to resolve atomic collisions in the initial structure. Following this, 50 ps NVT and 50 ps NPT simulations [43] were conducted to stabilize the system, with the temperature maintained at 300 K using the Langevin thermostat [44] and Langevin barostat [45]. Once thermodynamic parameters were stabilized, an initial 50 ns conventional MD simulation was run to calculate GaMD acceleration parameters, followed by an additional 50 ns for GaMD equilibration [46]. Finally, a time step of 2 fs with sampling every 0.1 ns was performed. GaMD simulations were performed for a duration of 200 ns. A total of 2000 GaMD traces were obtained for subsequent analysis.

Round 2: double-mutant combination

A total of 108 mutants were obtained, of which 51 were negative and 57 were positive. These, together with the positive WT (a total of 109 samples), were used as the training set for the second round. To make full use of this dataset, we performed protein feature embedding on the training set using ESM1v and ESM2b [47], and then evaluated with 5-fold cross-validation using Support Vector Machine (SVM), Random Forest (RF), and Multi-Layer Perceptron (MLP) [4850]. The specific training strategies are as follows.

For SVM and RF, we used the Sklearn package with grid search for 5-fold cross-validation. The RF parameters were as follows: {‘n_estimators’: [25, 50, 100, 200, 400], “min_samples_split”: [2, 4, 8], “min_samples_leaf”: [1, 2, 4, 8], “max_features”: [“sqrt”, “log2”, None]}. The SVM parameters were: params = {“kernel”: [“linear”], “C”: [0.01, 0.1, 1, 10, 100], “kernel”: [“poly”], “C”: [0.01, 0.1, 1, 10, 100], “degree”: [2, 3], “kernel”: [‘rbf’], “C” :[0.01, 0.1, 1,10 100,1000], “gamma”: [1, 0.1, 0.01, 0.001]}.

For MLP, the model structure included three fully connected layers with ReLU activation, and the final layer used a sigmoid activation to map to probability. The binary cross-entropy loss function (BCELoss) was used as the objective function, and the Adam optimizer was used for parameter optimization with early stopping to prevent overfitting.

We then predicted the saturation mutations for the positive sites and positive double-point combinations, and selected the top 20 for experimental validation.

Round 3: multisite combination

We supplemented the second-round experimental results into the second-round dataset to build the third-round dataset, which contains 129 samples. Due to the dataset expansion, we decided to use more methods for feature extraction, including ESM3 [51], ESM1v [52], COMPSS [55], aaindex [53], Peptides (https://github.com/althonos/peptides.py), and PSSM [54] for protein feature embedding, and the model used was the MLP from the second round. Subsequently, we performed four different combination schemes of positive single-point and positive double-point variants: 1 + 1 + 1, 1 + 1 + 1 + 1, 2 + 1, and 2 + 2. The sequences generated from each combination were predicted using the best-performing model, and the highest-scoring sequences from each combination, i.e. 9, 4, 3, and 3 sequences, respectively, were selected for experimental validation.

Sequence generation

We used the latest ESM3 to generate new PUPs, following the Gibbs sampling generation strategy from ESM-PGEN [55]. Specifically, we constructed a seed sequence set, including the 63 positive sequences from the second-round variants. We generated 500 sequences by Gibbs sampling [55] strategy; detailed procedures can be found in the “Sequence Generation Method” section of the Supplementary Data.

After removing sequences containing “X”, we obtained 495 sequences. We then used Colab AlphaFold2 [56] to generate 3D structures for these sequences and computed all the weighted scores for COMPSS and ProtSoluCollect, with each item weighted at 0.05, except for ESM1v and ProteinMPNN [57], which were weighted at 0.1 due to their importance as mentioned in the COMPSS. For ProtSoluCollectit, we included four solubility prediction metrics for Escherichia coli expression: GATSol [58], NetSolP [59], SoluProt [60], and Protein-Sol [61]. We then performed Cd-Hit [62] clustering on the 495 sequences and selected the top sequences from the top 10 clusters with the highest scores as candidate sequences for experimental validation.

Results

Design and engineering workflow for a PUP enzyme

In this work, we engineered the PUP variants using a multi-round, closed-loop workflow that integrates rotational computational modeling, ML, and generative artificial intelligence (AI) to efficiently navigate the vast protein sequence space (Fig. 1).

Figure 1.

Figure 1.

Computational evolution: forging a superior polymerase for controlled RNA synthesis. (a) A multi-round computational pipeline for engineering a superior enzyme. (b) A typical cycle of enzymatic synthesis begins with the enzymatic extension of the initiator oligonucleotide. Synthesis occurs in the 5′-to-3′ direction with a reversible terminator NTP and enzyme. A deblocking step then occurs to remove the reversible terminator group from the extended oligonucleotide, allowing the next cycle of synthesis to commence. When the desired length and composition have been reached, the final oligonucleotide product is isolated.

We initiated our study by performing GaMD simulations on the WT PUP to analyze its stability and catalytic mechanism. These insights guided the design of a truncated variant, PUPdel, which exhibited proficient incorporation of four modified nucleotides and served as the starting template for further optimization. Subsequently, we explored the mutational landscape via a high-throughput in silico screen. This process involved using a pre-trained protein language model (e.g. ESM1v) to predict the functional impact of saturation mutagenesis, followed by physics-based stability calculations (Rosetta ΔΔG) to filter for structurally viable candidates.

To navigate the complex combinatorial search space, we developed a supervised ML model. This model was trained on a sparsely sampled experimental dataset of single-mutant activities, using high-dimensional ESM1v-derived embeddings as feature vectors. The trained model enabled high-fidelity prediction of synergistic effects within thousands of unexplored multi-mutant variants, guiding the selection of promising combinatorial candidates.

The final stage transitioned from predictive screening to de novo generative design. An autoregressive generative model (ESM3) proposed novel, high-activity sequences, which were then subjected to a stringent multi-objective optimization pipeline. This pipeline integrated diverse computational metrics—spanning evolutionary conservation (MSA), predicted structure (AlphaFold2), biophysical properties (e.g. solubility via Protein-Sol), and sequence-based fitness scores—to down-select a diverse set of superior enzyme candidates for experimental validation and application in enzymatic RNA synthesis.

Rational design and computational characterization of a truncated PUP

Seeking to rationally engineer a template-independent RNA polymerase with potentially enhanced catalytic activity and stability, we first performed GaMD simulations on the WT full-length CID1 PUP. Analysis of Cα atom temperature factors (B-factors) identified regions of significant flexibility at both the N-terminus (residues 1–12) and C-terminus (residues 378–405) (Fig. 2a). Based on these findings, these terminal segments were excised in silico to generate a truncated variant, designated PUPdel. The PUPdel mutant thus contains an N-terminal deletion of residues 1–12 and a C-terminal deletion of residues 378–405, both of which are in-frame deletions, removing flexible extremities while preserving the core catalytic domain.

Figure 2.

Figure 2.

Design of a highly active truncated PUP. (a) Residue-wise Cα B-factor profile of WT PUP derived from MD simulations, highlighting regions of high backbone flexibility. (b) Root mean square deviation (RMSD) values calculated for backbone atoms of the PUP and PUPdel protein with the initial structure as the reference. (c) Conformational analysis of PUP and PUPdel with 3′-O-allyl-NTPs. The bar chart compares a key dihedral angle (θ) in the full-length enzyme (PUP, purple) and the truncated variant (PUPdel, blue) upon binding to 3′-O-allyl-NTPs. The inset illustrates the structural definition of the measured dihedral angle θ. Data are presented as the mean ± standard deviation (SD). (d) Comparison of calculated binding free energies for 3′-O-allyl-NTPs with WT PUP and PUPdel. (e) Structural superposition highlighting key conformational differences, particularly in the β-trapdoor region and active site residue positioning, between representative stable states of WT PUP and PUPdel from GaMD simulations. (f) Representative binding mode of the modified substrate with PUPdel. The complex conformation was derived from GaMD simulations and selected from the most populated cluster in the trajectory analysis to illustrate the predominant binding state. (g) Gel electrophoresis analysis comparing the incorporation capability of four 3′-O-allyl-NTPs by WT PUP versus PUPdel. (h) Michaelis–Menten plots depicting the reaction velocity of PUPdel as a function of varying concentrations for each 3′-O-allyl-NTP substrate.

GaMD simulations were conducted on PUP and PUPdel to evaluate its interaction profile with 3′-O-allyl-modified nucleotides. These simulations revealed that PUPdel exhibits significantly reduced root mean square deviation (RMSD) fluctuations (RMSFs) compared with the full-length PUP, a finding indicative of its superior conformational stability (Fig. 2b). Furthermore, the catalytic pocket cleft of PUPdel displayed a smaller average dihedral angle than its WT counterpart (Fig. 2c). This more compact architecture is hypothesized to promote tighter substrate engagement, thereby enhancing catalytic efficiency. Binding free energy calculations predicted stronger affinity between PUPdel and its target substrates, 3′-O-allyl-UTP and 3′-O-allyl-ATP (Fig. 2d). This enhanced binding is suggestive of a higher propensity for their subsequent incorporation. Although binding afinity for 3′-O-allyl-GTP and 3′-O-allyl-CTP appeared slightly attenuated in PUPdel simulations, substantial binding contacts were maintained (Fig. 2f).

Structural comparisons between representative stable conformations of PUP and PUPdel revealed distinct features in the latter. Notably, the β-trapdoor region in PUPdel consistently adopted a stable, closed β-sheet conformation. Concomitantly, His324 (corresponding to His336 in full-length PUP) was observed to reorient and flip into the active site cavity (Fig. 2e), a conformation potentially facilitating substrate stabilization [16]. Intriguingly, detailed analysis of the 3′-O-allyl-UTP complex highlighted a shift in key stabilizing interactions. The hydrogen bond occupancy between the catalytic His324 and the uracil base was relatively low (18.19%), suggesting a diminished contribution to binding energy compared with canonical UTP interactions. Conversely, a robust hydrogen bond (81.83% occupancy; Supplementary Table S1) formed between the uracil base and Asn159. This observation points towards a potential compensatory mechanism where Asn159 assumes a more prominent role in stabilizing the modified nucleotide within the PUPdel active site. We hypothesize that this altered interaction network is induced by the presence of the 3′-O-allyl blocking group, leading to a distinct substrate binding mode in PUPdel (Fig. 2f). These computational findings provided a structural rationale for investigating PUPdel’s utility in enzymatic RNA synthesis involving modified nucleotides.

Subsequent activity assays of PUP and PUPdel with the substrates 3′-O-allyl-NTPs (Fig. 2g) demonstrated a significant enhancement in substrate incorporation efficiency by PUPdel. All four modified substrates were almost completely incorporated within 1 min. To further understand PUPdel’s substrate selectivity for 3′-O-allyl-NTPs, we evaluated the in vitro kinetic activity of PUPdel across different substrates and observed no significant preference among the three modified nucleotides for 3′-O-allyl-ATP, 3′-O-allyl-GTP, and 3′-O-allyl-CTP (Fig. 2h). However, a strong substrate preference was observed for 3′-O-allyl-UTP, indicating a distinct selectivity towards this nucleotide substrate. Given PUPdel’s superior performance in activity assays, future mutagenesis efforts will be based on this PUPdel truncated variant.

Screening and design of superior PUPdel variants

Round 1: screening of single-point variants 

To identify beneficial single-point mutations within the optimized PUPdel scaffold, we initiated a multi-stage computational screening process coupled with experimental validation. First, we calculated PLLR scores using a protein language model (ESM1v and ESM1b) for all possible single amino acid substitutions across the 365 residues of PUPdel. Higher PLLR scores generally indicate substitutions deemed more plausible or non-conservative within the protein family’s sequence context. Our primary filtering criteria selected mutations with PLLR scores > −5.0, while also considering the introduction of diverse physicochemical properties. Additionally, as a comparison group, 13 mutations at relatively conserved positions predicted to be potentially disruptive (PLLR score < −5.0, i.e. R127A, D89L, A345W, A126Y, L173W, E72W, S255V, T167Q, S81C, H324Y, A71T, L14M, and I322R) were specifically included. This initial filtering yielded 108 candidate mutations (Supplementary Table S1). Subsequently, these 108 mutations were subjected to further in silico evaluation using the Rosetta cartesian_ddg method to predict the change in folding stability and the change in binding affinity towards the 3′-O-allyl-UTP substrate. Based primarily on their predicted stability scores, the top 90 single-point variants were selected for experimental characterization. All 90 selected variants were successfully expressed recombinantly. Overall, 47.8% (43/90) of the tested variants exhibited detectable incorporation for the four 3′-O-allyl-NTPs (denoted AUGC). When assessed with specific substrate pairs, the success rates were 54.4% (49/90) for the 3′-O-allyl-UTP/CTP mix (UC) and 50.0% (45/90) for the 3′-O-allyl-ATP/GTP mix (AG). Analysis revealed a clear correlation between the PLLR score and experimental success (Fig. 3a). The 77 variants selected with PLLR scores > −5.0 demonstrated an activity success rate of 52.0% (40/77) for AUGC incorporation. In contrast, the 13 variants selected from conserved sites with low PLLR scores (< −5.0) exhibited a significantly lower success rate of only 23.1% (3/13), validating the utility of the PLLR score as an initial filter. Furthermore, a retrospective analysis indicated that applying stricter combined criteria—specifically selecting variants with both favorable predicted stability (ΔΔGstability < 0) and higher PLLR scores (> −2.0; n = 25, Fig. 3b)—enriched for active mutants, achieving an AUGC incorporation success rate of 64.0% (16/25). This suggests that integrating sequence-based language model predictions with structure-based stability calculations enhances the identification of functional PUPdel variants.

Figure 3.

Figure 3.

Screening of single-point variants. (a) Distribution plots comparing calculated PLLR scores and predicted folding stability changes (Rosetta ΔΔGstability) between experimentally determined active and inactive single-point PUPdel variants from the initial screen. (b) Analysis of substrate incorporation success rates for PUPdel variants grouped according to different computational filtering criteria (combined PLLR and ΔΔGstability criteria). (c) Representative gel electrophoresis analysis showing 3′-O-allyl-NTP incorporation activity for a subset of the screened PUPdel single-point variants. (d) Structural basis for key beneficial mutations. The central panel shows the binding mode of 3′-O-allyl-UTP (ball-and-stick model) within the active site of PUPdel. Inset views provide detailed comparisons between the WT (cyan) and mutated (magenta) residues, illustrating their interactions with the surrounding environment or the substrate. Potential hydrogen bond interactions are indicated by yellow dashed lines. (e) Comparison of experimentally determined Tms for selected purified PUPdel variants relative to the WT PUPdel. (f) Steady-state kinetic analysis of PUPdel with different 3′-O-allyl-NTP substrates. The plot shows the initial reaction velocity as a function of concentration for four different modified substrates (3′-O-allyl-ATP, -UTP, -GTP, and -CTP). Data points represent the mean ± SD from three independent replicates, and the solid lines are fits to the Michaelis–Menten equation.

To further validate the activity and stability of these variants, we selected a representative subset of seven single-point variants (C63I, I45P, K132R, I128V, N159F, D97P, and L223Q), chosen to sample different predicted stability profiles and PLLR scores, for recombinant expression and purification (incorporation activity results are shown in Fig. 3c). The spatial locations of these seven variants within the protein are illustrated in Fig. 3d. Among them, the N159F mutation is positioned in the catalytic site. The introduction of a phenylalanine residue at this position is proposed to enable a π-stacking interaction with the pyrimidine ring of the 3′-O-allyl-UTP substrate, thereby stabilizing and anchoring it. The L223Q mutation introduces a hydrogen bond to Pro319, while K132R enhances electrostatic attraction to Asp148. The other three variants (C63I, I45P, and D97P) contribute to increased local protein stability. Thermal stability analysis, conducted by determining the Tm, revealed distinct effects of the mutations. While the K132R, I128V, and L223Q variants exhibited thermal stability comparable with the WT PUPdel, the C63I, I45P, and D97P variants demonstrated significantly enhanced stability relative to PUPdel (Fig. 3e). Furthermore, we randomly selected the I45P mutant for kinetic assays (Table 1). Compared with PUPdel, the I45P variant displayed a notable ~4.0-fold increase in catalytic efficiency (kcat/K= 7.9 × 104 M−1 min−1) for incorporating 3′-O-allyl-CTP. Additionally, it exhibited moderately improved catalytic efficiency (~1.5-fold) towards the other three modified substrates (3′-O-allyl-ATP, -UTP, and -GTP). To gain further insights into the structural mechanism underlying the enhanced catalytic efficiency of the I45P mutant, we performed 200 ns MD simulations and binding energy analyses (Supplementary Fig. S9). These results revealed structural stabilization and improved electrostatic complementarity around the active site, which probably contribute to the observed activity enhancement. Detailed simulation procedures and analyses are provided in the Supplementary Data.

Table 1.

Kinetic parameters for PUPdel and variants against four 3′-O-allyl-NTPs

Parameters Variants 3′-O-Allyl-ATP 3′-O-Allyl-UTP 3′-O-Allyl-GTP 3′-O-Allyl-CTP
K m (μM) PUP full 3.50 ± 0.75 2.29 ± 0.51 6.64 ± 1.41 89.62 ± 19.78
PUPdel 8.66 ± 0.88 0.72 ± 0.14 7.17 ± 1.21 14.88 ± 4.62
MT1 6.39 ± 1.40 0.92 ± 0.22 5.24 ± 1.40 5.72 ± 1.72
MT2 5.47 ± 0.89 2.98 ± 0.73 1.34 ± 0.36 11.08 ± 2.64
MT21 5.51 ± 2.18 0.85 ± 0.24 4.75 ± 0.94 11.72 ± 3.00
MT32 5.28 ± 1.66 1.06 ± 0.26 8.05 ± 1.26 5.25 ± 1.26
MT41 6.60 ± 1.19 1.24 ± 0.25 5.62 ± 0.65 11.00 ± 1.40
PUPdel2 4.95 ± 0.82 0.74 ± 0.13 9.98 ± 1.62 11.58 ± 3.76
k cat (min−1) PUP full 0.35 ± 0.02 0.24 ± 0.01 0.25 ± 0.02 0.10 ± 0.01
PUPdel 0.82 ± 0.04 0.67 ± 0.02 0.49 ± 0.03 0.30 ± 0.03
MT1 0.95 ± 0.06 1.07 ± 0.06 0.57 ± 0.04 0.45 ± 0.03
MT2 0.53 ± 0.03 0.66 ± 0.04 0.42 ± 0.02 0.43 ± 0.03
MT21 0.73 ± 0.1 1.10 ± 0.06 0.66 ± 0.04 0.48 ± 0.04
MT32 0.53 ± 0.05 1.19 ± 0.07 0.77 ± 0.03 0.20 ± 0.01
MT41 0.67 ± 0.04 1.13 ± 0.06 0.76 ± 0.03 0.46 ± 0.02
PUPdel2 1.63 ± 0.07 3.66 ± 0.11 1.02 ± 0.06 0.26 ± 0.02
k cat/Km (M−1 min−1) PUP full 1.0 × 105 1.1 × 105 3.8 × 104 1.1 × 103
PUPdel 9.5 × 104 9.3 × 105 6.9 × 104 2.0 × 104
MT1 1.5 × 105 1.2 × 106 1.1 × 105 7.9 × 104
MT2 9.8 × 104 2.2 × 105 3.1 × 105 3.9 × 104
MT21 1.3 × 105 1.3 × 106 1.4 × 105 4.1 × 104
MT32 1.0 × 105 1.1 × 106 9.5 × 104 3.7 × 104
MT41 1.0 × 105 9.1 × 105 1.3 × 105 4.2 × 104
PUPdel2 3.3 × 105 5.0 × 106 1.0 × 105 2.3 × 104

MT1 = I45P, MT2 = N159F, MT21 = I45L/L223Q, MT32 = I45P/L223Q/P273Y, MT41 = I45P/I128V/S165T/L223Q.

In addition, we evaluated the ability of the PUPdel and I128V variants to extend natural RNA nucleotides lacking 3′-end-blocking modifications (ATP, CTP, UTP, and GTP) under standard extension conditions (Supplementary Fig. S2). The results showed that both variants efficiently catalyzed the extension of unmodified nucleotides, with the I128V variant exhibiting slightly higher elongation efficiency compared with PUPdel.

In summary, assays of thermodynamic stability and incorporation activity, performed on purified enzymes, demonstrated that these variants exhibited significantly improved incorporation efficiency and enhanced stability compared with PUPdel. The synergistic application of PLLR and Rosetta computational methodologies has proven to be a highly effective and rapid zero-shot strategy for identifying optimized variants with superior functional attributes.

Previous structural and biochemical studies identified Asn159 (corresponding to Asn171 in PUP full) as a key active site residue in PUP. Its side chain conformation changes upon UTP binding, forming hydrogen bonds with the ribose 2′-hydroxyl and the pyrimidine carbonyl, contributing to substrate recognition, π-stacking interactions with Tyr200 (Tyr212 in PUP full), and selection for ribonucleotides over deoxyribonucleotides (Fig. 4c) [16, 18].

Figure 4.

Figure 4.

Analysis of the catalytic mechanism of the N159F variant. (a) Comparison of Cα RMSF profiles for the N159F variant and PUPdel derived from GaMD simulations. The N159F variant exhibits significantly increased flexibility compared with the parental enzyme, particularly in the active site lid loop (residues 296–310). (b) Time-evolution analysis of the secondary structure for key flexible loops (116–152 and 296–310). (c) Representative conformations of the active site in the PUPdel (left) and the N159F variant (right) when complexed with 3′-O-allyl-UTP. The N159F mutation introduces a phenylalanine residue that establishes a favorable π–π stacking interaction with the uracil base of the substrate, anchoring it in a more stabilized binding conformation. Key interacting residues are shown as sticks. (d) Average number of hydrogen bonds formed between the enzyme’s active site and the nucleotide substrate. The N159F variant consistently forms a greater number of stable hydrogen bonds with both 3′-O-allyl-UTP and 3′-O-allyl-CTP, indicative of enhanced substrate binding affinity and stability. (e and f) Dynamic analysis of the active site cleft, monitored by the distance between the Cα atoms of I130 and D310, in the presence of (e) 3′-O-allyl-UTP and (f) 3′-O-allyl-CTP. The N159F variant sustains a significantly smaller and more stable inter-residue distance, suggesting a more compact and catalytically competent “closed” conformation of the active site compared with the parental PUPdel. Representative snapshots illustrate this conformational difference.

To probe the functional plasticity of this critical site, we performed saturation mutagenesis at position N159 in PUPdel. Initial screening revealed that variants N159F, N159I, N159Q, and N159M retained the ability to incorporate the 3′-O-allyl-NTP substrates. We selected the N159F variant for further detailed characterization, including purification and activity assays. Remarkably, the purified N159F variant catalyzed near–quantitative incorporation (>95%) of 3′–O–allyl–NTPs within 1 min under standard assay conditions (Fig. 3c). Moreover, according to the kinetic data presented in Table (Table 1), N159F exhibited substantially enhanced catalytic activity toward 3′–O–allyl–UTP and 3′–O–allyl–CTP compared with its parent PUPdel enzyme, whereas no significant improvement was observed for 3′–O–allyl–ATP or 3′–O–allyl–GTP. Compared with the amide group of Asn, the aromatic ring of phenylalanine has rigidity and a larger steric footprint, so substituting the conserved residue N159 with phenylalanine is likely to trigger local structural rearrangements.

To elucidate the structural basis for this enhanced activity, we performed 200 ns GaMD simulations on the N159F variant in complex with substrates. After generating the dynamics trajectory, we calculated the RMSF of the system’s Cα atoms (Fig. 4a). Compared with PUPdel, the N159F variant displayed higher RMSF values after substrate polymerization, especially in residues 116–152 and 296–310, with β-fold shortening, indicating increased flexibility (Fig. 4b; Supplementary Fig. S4). Additionally, when 3′-O-allyl-UTP was used as the substrate, new hydrogen bonds were observed between the oxygen atoms in residues D89, D91, and the phosphate group. When 3′-O-allyl-CTP was used as the substrate, the number of hydrogen bonds between D91 and the 3′ oxygen atom of the sugar ring increased. These additional hydrogen bonds may have stabilized the enzyme–substrate complex during polymerization (Fig. 4c, d; Supplementary Fig. S3). The differences in binding energy due to hydrogen bonds and hydrophobic interactions may contribute to tighter substrate binding, increasing molecular stability and stabilizing the catalytic pocket of the variant enzyme (Supplementary Fig. S5). Further analysis of the catalytic pocket opening was performed (Fig. 4e, f). Distance analysis showed that after substrate polymerization, the distance between residues I130 and D310 in the variant was shortened, which may make it more difficult for the substrate to be released under physiological conditions, thereby promoting more efficient substrate polymerization.

To further understand the polymerization mechanism of the engineered enzyme, we conducted a detailed investigation of the I45L/L223Q and R127D variants. Full results and analyses are provided in Supplementary Figs S6S8).

Round 2/3: screening and designing multi-point variants across single-point libraries

Building upon the experimentally validated single-point mutations from Round 1, we implemented an iterative ML strategy involving cycles of prediction and experimental testing to identify superior multi-point PUPdel variants.

Round 2: the initial training dataset comprised the characterized single-point variants from Round 1. We extracted protein sequence embeddings using both ESM1v and ESM2b models. These embeddings served as input features for training traditional ML classifiers—SVM, RF, and MLP—to predict variant activity (active/inactive). Model performance was evaluated using 5-fold cross-validation (Supplementary Fig. S13). The combination of ESM1v embeddings with an RF classifier (ESM1v–RF) yielded the highest cross-validation accuracy of 0.70 (Fig. 5b). This optimal model was then employed to predict the activity of a library of potential double-mutant combinations (derived from promising Round 1 hits). Based on the model’s ranking, the top 20 predicted double-mutant candidates were selected for experimental validation using the 3′-O-allyl-NTP incorporation assay. This resulted in 11 active variants, achieving an experimental hit rate of 55% (Supplementary Table S2).

Figure 5.

Figure 5.

Iterative performance enhancement of PUPdel guided by ML and experimental validation. (a and b) Performance evaluation of various ML models in predicting mutant activity during Round 2 and Round 3. Models are compared based on precision, accuracy, F1-score, and area under the curve (AUC) metrics. (c) Pearson correlation matrix of different computational features in the Round 3 dataset. The color bar indicates the correlation coefficient, with warm and cool colors representing positive and negative correlations, respectively. (d) Scatter plot showing the distribution of active (orange) and non-active (blue) variants based on the two computational scores (MIF-ST and ESM-1v mask6) most correlated with the activity label. Dashed lines represent the thresholds used to distinguish between the two classes. (e) Linear regression plot between the two most correlated scores (CARP-640m and ESM-1v mask6), demonstrating their high degree of correlation. (f) Cumulative distribution function (CDF) curves for active and non-active variants across four computational scores highly correlated with the activity label. The separation between the curves indicates the discriminative power of the score. (gTms for PUPdel, the newly generated sequence (PUPdel2), and MT21 (I45L_L223Q), MT31 (I45P_I128V_L223Q), MT32 (I45P_L223Q_P273Y), and MT41 (I45P_I128V_S165T_L223Q) variants. Higher Tm values correspond to greater thermal stability. Data are presented as the mean ± SD from multiple replicates. (h) Steady-state kinetic analysis of PUPdel2 with four different 3′-O-allyl-modified substrates. The plot shows initial reaction velocities as a function of substrate concentration. Data points are the mean ± SD from three independent replicates, and solid lines represent fits to the Michaelis–Menten equation. (i) RMSF of residues for PUPdel (blue) and the optimized PUPdel2 (orange) from MD simulations. PUPdel2 exhibits significantly reduced fluctuations in two key flexible loop regions (highlighted in light cyan), indicating enhanced conformational stability. The inset shows the overall structure of PUPdel with the corresponding flexible loops highlighted. (j) Representative binding modes of PUPdel2 with four 3′-O-allyl-modified substrates. The left panel displays a structural superposition of the four binding modes. The right panels provide detailed views of the interaction network between key residues (e.g. S78, K181, Y200, and H324) and each modified substrate.

Round 3: the experimental outcomes (both active and inactive variants) from Rounds 1 and 2 were incorporated into the training dataset (Supplementary Table S3), expanding it to 129 labeled data points. For this Round 3, we explored a broader range of five protein input features (ESM1v, ESM3, COMPSS, PSSM, and Peptides) for activity prediction using an MLP classifier. Evaluating these feature sets individually with MLP via 5-fold cross-validation (Supplementary Fig. S14), the ESM1v–MLP combination achieved the best classification accuracy (0.67). Subsequently, 19 candidate sequences (multi-point variants) predicted as active by ESM1v–MLP were selected for experimental testing. Twelve of these exhibited detectable polymerase activity, corresponding to an improved experimental hit rate of 63.16% (Supplementary Table S4). Specific information on the 19 selected multi-mutant candidates can be found in the Supplementary Data.

Overall, these results highlight the effectiveness of the iterative ML-guided approach. While cross-validation metrics fluctuated slightly, the progressive incorporation of experimental data led to an enhanced success rate in identifying active combination mutants in successive rounds, demonstrating the model’s ability to learn increasingly complex sequence–function relationships relevant to PUPdel activity. Detailed statistics for each round are provided in Table 2.

Table 2.

Result and training dataset of three-round iteration

Round Train positive Train negative Experiment positive Experiment negative Hit rate
Round1 \ \ 43 47 0.48
Round2 52 57 11 9 0.55
Round3 63 66 12 7 0.63

To further understand the sequence determinants of activity, we conducted a systematic COMPSS evaluation for all experimental samples (Fig. 5c; Supplementary Table S5) and found that ESM1v, CARP 640M, ESM1v mask6, and MIF-ST showed strong correlations (with ESM1v mask6 and CARP 640M reaching a peak correlation of 0.87, Fig. 5e), all showing significant correlation with the experimental result labels. This indicates that these four indicators effectively distinguish PUPdel incorporation activity, showing distinct cumulative distribution patterns between positive and negative samples (Fig. 5f). We visualized the two features most strongly correlated with the labels, ESM1v mask6 and MIF-ST (Fig. 5c). Active proteins were found to cluster in regions where MIF-ST > −0.23 and ESM1v mask6 > −2.1, with a positive sample proportion of 75.68%, compared with 44.59% for the negative group (Fig. 5d). These indicators could inform and direct future strategies for the optimization and rational design of enzyme.

Subsequently, to rigorously characterize the performance improvements achieved through iterative design, we randomly selected four combinatorial variants with incorporation activity for expression and purification. All four purified variants demonstrated robust incorporation activity with the 3′-O-allyl-NTP substrate mixture (Supplementary Fig. S10). Thermal stability assessments further confirmed that these engineered variants maintained or exceeded the stability of the parent PUPdel (Fig. 5g). Notably, variant MT42 (I45P/L223Q/K270Q/P273Y) displayed a significant 6.4°C increase in Tm compared with PUPdel, indicating substantially enhanced thermostability. Furthermore, detailed kinetic assays were conducted for three selected multi-point variants: I45L/L223Q, I45P/L223Q/P273Y, and I45P/I128V/S165T/L223Q (Table 1). These analyses revealed significant enhancements in catalytic efficiency (kcat/Km) for the incorporation of 3′-O-allyl-CTP and 3′-O-allyl-GTP. Specifically, the I45L/L223Q and I45P/I128V/S165T/L223Q variants exhibited ~2-fold greater catalytic efficiency towards these two substrates (3′-O-allyl-CTP and 3′-O-GTP) compared with PUPdel. In contrast, no substantial improvement in efficiency was observed for 3′-O-allyl-ATP and 3′-O-allyl-UTP incorporation, probably attributable to the already high efficiency of the parent PUPdel for these specific analogs.

To evaluate the continuous–extension capability of the engineered MT41 variants (I45P/I128V/S165T/L223Q), we performed iterative primer extension reactions using 3′–allyl–modified NTP substrates. Starting from a poly(U) primer (sequence: UUUUUUUUUUUUUUUUUUUU), the MT41 enzyme catalyzed efficient n + 1 extension with the 3′–allyl–blocked nucleotide. After removal of the allyl–blocking group, a subsequent n + 2 extension was successfully achieved under identical reaction conditions (37°C, 2 min). The results, shown in Supplementary Fig. S11, demonstrate complete incorporation of the modified nucleotides at each step. To further characterize primer sequence generality, we tested the enzyme using a mixed–sequence RNA primer (sequence: UAUAACAAGCACACUAAAUU). Under the same reaction conditions (37 °C, 2 min), MT41 again catalyzed nearly complete incorporation of the modified substrate (Supplementary Fig. S11). These results confirm that the MT41 enzyme can efficiently extend a range of RNA primer sequences and perform successive stepwise incorporation of 3′–allyl–modified nucleotides after deprotection, indicating strong potential for controlled, template–independent RNA elongation.

Sequence generation

To generate targeted candidate sequences, we employed the state-of-the-art ESM3 protein language model integrated with a Gibbs sampling strategy, referred to as ESM3PGEN (Supplementary Fig. S12). Using five different strategies (Fig. 6ac), we iteratively generated 500 sequences based on the PUPdel or the previously characterized functional variants (Supplementary Table S6). After removing sequences containing “X”, a total of 495 sequences remained. These candidates subsequently underwent in silico evaluation, involving 3D structure prediction via Colab AlphaFold2 and assessment using weighted scores from COMPSS and the ProtSoluCollect suite (incorporating GATSol, NetSolP, SoluProt, and Protein-Sol). For efficient candidate prioritization, Cd-Hit clustering was performed on the 495 sequences, with the top-scoring representative from each of the 10 highest ranked clusters selected for experimental validation (Supplementary Table S7). Encouragingly, all 10 selected candidates demonstrated robust, high-level expression (Fig. 6d), confirming the efficacy of our combined computational strategy in designing experimentally tractable variants. Functional screening, however, identified only a single sequence, designated PUPdel2, exhibiting the desired catalytic function—the polymerization of all four 3′-O-allyl-NTPs (Fig. 6e, f). Compared with the PUPdel benchmark, PUPdel2 displayed significant enhancements across key metrics: a 3.4°C improvement in thermal stability, a 3.7-fold increase in expression yield, and markedly superior incorporation efficiencies for 3′-O-allyl-UTP (5.4-fold), 3′-O-allyl-ATP (3.5-fold), 3′-O-allyl-GTP (1.4-fold), and 3′-O-allyl-CTP (1.2-fold) (Table 1). These findings underscore the success of our guided engineering approach and highlight the considerable potential of PUPdel2 for advancing future applications.

Figure 6.

Figure 6.

Sequence generation strategies and experimental validation. (a) Schematic illustration of different sequence generation workflows. The left panel shows seed positive sequences undergoing site masking and subsequent sampling through two strategies: one-by-one sampling and one-time sampling. Up to eight iterations were performed to expand sequence diversity. The right panel illustrates a WT sequence-based mutation strategy, in which the first iteration was guided by the joint probability distribution of positive sequences, followed by seven iterative sampling rounds identical to those on the left. (b) One-time sampling: mutations at all selected sites are introduced in a single iteration to produce a new sequence. (c) One-by-one sampling: mutations are introduced site by site in a sequential manner within one iteration, generating alternative variants. By clustering and ranking 500 generated sequences, 10 representative sequences were selected for experimental validation. The assays included testing the expressibility of the proteins (d) and evaluating the polymerization capacity of AG (e) and UC (f) for the 10 generated proteins.

To understand the mechanistic basis for this dramatic enhancement, we combined structural analysis, MD, and substrate binding simulations. PUPdel2 incorporates 16 mutations that are strategically dispersed throughout its structure, suggesting a global remodeling rather than localized active site tuning (Supplementary Fig. S15). The functional consequence of this global remodeling becomes evident in the enzyme’s dynamics. GaMD simulations revealed a striking increase in the conformational flexibility of the β-trapdoor region (residues 295–312), a critical element controlling substrate access (Fig. 5i). This heightened dynamism is driven by three key mutations within the loop (A297R, S303R, and Q306S), which introduce bulky, charged arginine residues. These substitutions locally increase solvent accessibility and create a positive electrostatic patch, transforming the β-trapdoor into a dynamic, “attractive” gateway. We propose a two-pronged mechanism: the enhanced flexibility facilitates the opening and closing motions required for accommodating the RNA primer–template duplex, while the positive charge acts as an electrostatic beacon, guiding the negatively charged RNA into the active site. Once the substrate is engaged, its binding is stabilized by a sophisticated and highly organized network of interactions, as revealed by GaMD simulations (Fig. 5j). A conserved binding motif, involving a central Mg2+ ion and residues K181, S199, and Y200, tightly coordinates the triphosphate tail of all 3′-O-allyl-NTPs. The nucleobase recognition is also exquisitely specific, with H324 playing a key stabilizing role. Notably, the binding of 3′-O-allyl-CTP is distinguished by a unique hydrogen bond to R328, showcasing the enzyme’s nuanced recognition capabilities. In conclusion, our work demonstrates that engineering distal sites to modulate the dynamics of key structural motifs—in this case, the β-trapdoor—is a powerful strategy for improving polymerase function. The superior performance of PUPdel2 arises from a synergistic combination of enhanced protein dynamics that facilitate primer loading and a fine-tuned active site that ensures stable and efficient nucleotide incorporation.

Discussion

In this work, we developed an integrated workflow for the effective design and optimization of PUP, enabling the efficient and controllable template-independent enzymatic synthesis of RNA oligonucleotides. Central to our success was the initial rational design of PUPdel, informed by GaMD simulations that highlighted terminal flexibility in WT PUP and predicted enhanced substrate binding in the truncated form. This truncation stabilized the catalytic pocket, as evidenced by reduced RMSFs and a compact dihedral angle, facilitating tighter engagement with 3′-O-allyl-NTPs. Experimental validation confirmed that PUPdel exhibited excellent incorporation activity toward all four modified substrates. Building on this scaffold, our high-throughput in silico screening used ESM1b/ESM1v-derived PLLR scores and Rosetta ΔΔG calculations enriched for functional single-point mutants, achieving a 64% success rate under stringent criteria. This outperforms conventional saturation mutagenesis by leveraging sequence plausibility and structural stability as orthogonal filters, as retrospective analyses demonstrated a clear correlation between PLLR thresholds and activity. The transition to multi-mutant design via iterative ML models further amplified efficiency. By training on sparse experimental data with ESM embeddings, our ESM1v–RF and ESM1v–MLP classifiers predicted synergistic effects in double- and higher-order mutants, progressively improving hit rates from 55% to 63%. This iterative refinement mirrors active learning strategies in protein engineering, where incorporating new data refines predictions of epistatic interactions—a longstanding challenge in combinatorial libraries. While ML is powerful for optimizing within a known chemical space (combinations of existing hits), the ESM3-based generative approach proposed truly novel sequences, breaking free from the local fitness landscape defined by the initial single-mutant library. The discovery of PUPdel2, a high-performance variant with 16 mutations, from this de novo pool underscores the transformative potential of generative AI in protein design, allowing access to distant, highly functional regions of sequence space that are inaccessible through incremental evolution.

Perhaps our most significant finding is the elucidation of PUPdel2’s unique catalytic mechanism, which hinges on the allosteric modulation of protein dynamics rather than direct modification of the catalytic residues. Our simulations reveal that PUPdel2’s performance is primarily driven by the enhanced flexibility of its β-trapdoor region. The introduction of three key mutations (A297R, S303R, and Q306S) remodels this structural motif into a “dynamic, attractive gateway”. This gateway serves a dual purpose: its increased flexibility lowers the kinetic barrier for the entry of the RNA primer, while the newly introduced positive charges create an electrostatic funnel, actively guiding the negatively charged nucleic acid into the active site. This “dynamic capture” mechanism stands in stark contrast to the more subtle compensatory binding mode observed in the parent PUPdel, where N159 played a key role. The success of PUPdel2 demonstrates that engineering protein function can be achieved more effectively by tuning the dynamics of distal, allosteric sites to control substrate access and positioning, a principle of broad applicability in enzyme design.

The implications of our work are 2-fold. First, its high efficiency in incorporating all four 3′-O-allyl-NTPs can facilitate the production of novel RNA-based therapeutics, diagnostics, and research tools with greater sequence diversity, yield, and purity. Second, and more broadly, our integrated AI-driven workflow serves as a blueprint for tackling other challenging protein engineering problems. By strategically layering rational design, predictive ML, and de novo generation, researchers can efficiently move from an initial protein scaffold to a highly optimized biocatalyst. While our generative model yielded a single functional hit out of 10 candidates, indicating room for improvement in downstream filtering algorithms, the success of PUPdel2 validates the overall strategy. Future work could focus on refining the multi-objective selection pipeline by incorporating more sophisticated dynamics-based or experimental high-throughput metrics to improve the hit rate of generative models.

In conclusion, our research showcases a powerful synergy between AI and molecular engineering. We have not only created an enzyme of significant practical value but also uncovered a sophisticated dynamic mechanism governing its function. This work charts a course for the future of biocatalyst design, where AI-guided exploration of sequence space, combined with deep mechanistic understanding, will unlock enzymes with unprecedented capabilities.

Supplementary Material

gkaf1529_Supplemental_Files

Acknowledgements

Author contributions: Lixiang Yang (Conceptualization [supporting], Methodology [equal], Molecular dynamics simulations [equal], Mutant design [lead], Visualization [lead], Writing – original draft [equal], Writing – review & editing [supporting]), Yi He (Methodology [equal], Machine learning model design [lead], Software [equal], Formal analysis [equal], Writing – original draft [equal], Writing – review & editing [supporting]), Fuyan Cao (Methodology [equal], Molecular dynamics simulations [equal], Formal analysis [equal], Literature review [equal], Writing – original draft [equal], Writing – review & editing [supporting]), Yanjia Qin (Investigation – biological experiments [lead], Methodology [equal], Data curation [equal], Validation [equal], Writing – original draft [equal], Writing – review & editing [supporting]), Yi Wang (Investigation [supporting], Data curation [supporting], Writing – review & editing [supporting]), Huijun Zhang (Investigation [supporting], Validation [supporting], Writing – review & editing [supporting]), Weiwei Han (Supervision [equal], Methodology [supporting], Writing – review & editing [lead], Project administration [supporting]), Meng Yang (Conceptualization [lead], Methodology [lead], Supervision [lead], Funding acquisition [lead], Project administration [lead], Writing – review & editing [lead])

Contributor Information

Lixiang Yang, MGI Tech, Shenzhen 518083, China.

Yi He, Key Laboratory for Molecular Enzymology and Engineering of Ministry of Education, School of Life Sciences, Jilin University, Qianjin Road 2699, Changchun 130012, China.

Fuyan Cao, Key Laboratory for Molecular Enzymology and Engineering of Ministry of Education, School of Life Sciences, Jilin University, Qianjin Road 2699, Changchun 130012, China.

Yanjia Qin, MGI Tech, Shenzhen 518083, China.

Yi Wang, MGI Tech, Shenzhen 518083, China.

Huijun Zhang, MGI Tech, Shenzhen 518083, China.

Weiwei Han, Key Laboratory for Molecular Enzymology and Engineering of Ministry of Education, School of Life Sciences, Jilin University, Qianjin Road 2699, Changchun 130012, China.

Meng Yang, MGI Tech, Shenzhen 518083, China; Graduate Affairs, Faculty of Medicine, Chulalongkorn University, 10330 Bangkok, Thailand.

Supplementary data

Supplementary Data are available at NAR online.

Conflict of interest

The authors have submitted patent applications based on the results reported in this paper. F.M. declares stock holdings in MGI.

Funding

The Ministry of Science and Technology of the People’s Republic of China’s “National Key Research and Development Program of China” [2022YFF1202200 and 2022YFF1202203]; the Science, Technology, Innovation Commission of Shenzhen Municipality [JSGGZD20220822095802006]; and the Natural Science Foundation of China [32471313Q.7Q.8].

Code availability

The code for the protein generation is available on https://github.com/heyigacu/esm3pgen. For protein expressiveness assessment ProtSoluCollect, see: https://github.com/heyigacu/ProtSoluCollect. Both are archived at https://zenodo.org/records/17921806.

Data availability

All data can be found in the supplementary data.

References

  • 1. Khorkova  O, Stahl  J, Joji  A  et al.  Amplifying gene expression with RNA-targeted therapeutics. Nat Rev Drug Discov. 2023;22:539–61. 10.1038/s41573-023-00704-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Sanger  F, Coulson  AR. A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. J Mol Biol. 1975;94:441–8. 10.1016/0022-2836(75)90213-2. [DOI] [PubMed] [Google Scholar]
  • 3. Chaudhary  N, Weissman  D, Whitehead  KA. mRNA vaccines for infectious diseases: principles, delivery and clinical translation. Nat Rev Drug Discov. 2021;20:817–38. 10.1038/s41573-021-00283-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Davidson  BL, McCray  PB  Jr. Current prospects for RNA interference-based therapies. Nat Rev Genet. 2011;12:329–40. 10.1038/nrg2968. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Wang  F, Zuroske  T, Watts  JK. RNA therapeutics on the rise. Nat Rev Drug Discov. 2020;19:441–2. 10.1038/d41573-020-00078-0. [DOI] [PubMed] [Google Scholar]
  • 6. Lee  H, Xie  T, Kang  B  et al.  Plug-and-play protein biosensors using aptamer-regulated in vitro transcription. Nat Commun. 2024;15:7973. 10.1038/s41467-024-51907-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Molina  AG, Sanghvi  YS. Liquid-phase oligonucleotide synthesis: past, present, and future predictions. Curr Protoc Nucleic Acid Chem. 2019;77:e82. 10.1002/cpnc.82. [DOI] [PubMed] [Google Scholar]
  • 8. Moody  ER, Obexer  R, Nickl  F  et al.  An enzyme cascade enables production of therapeutic oligonucleotides in a single operation. Science. 2023;380:1150–4. 10.1126/science.add5892. [DOI] [PubMed] [Google Scholar]
  • 9. Caruthers  MH. A brief review of DNA and RNA chemical synthesis. Biochem Soc Trans. 2011;39:575–80. 10.1042/BST0390575. [DOI] [PubMed] [Google Scholar]
  • 10. Beaucage  SL, Caruthers  MH. Deoxynucleoside phosphoramidites—a new class of key intermediates for deoxypolynucleotide synthesis. Tetrahedron Lett. 1981;22:1859–62. 10.1016/S0040-4039(01)90461-7. [DOI] [Google Scholar]
  • 11. Andrews  BI, Antia  FD, Brueggemeier  SB  et al.  Sustainability challenges and opportunities in oligonucleotide manufacturing. J Org Chem. 2020;86:49–61. 10.1021/acs.joc.0c02291. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Wiegand  DJ, Rittichier  J, Meyer  E  et al.  Template-independent enzymatic synthesis of RNA oligonucleotides. Nat Biotechnol. 2025;43:762–772. 10.1038/s41587-024-02244-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Krienke  C, Kolb  L, Diken  E  et al.  A noninflammatory mRNA vaccine for treatment of experimental autoimmune encephalomyelitis. Science. 2021;371:145–53. 10.1126/science.aay3638. [DOI] [PubMed] [Google Scholar]
  • 14. Mateus  J, Dan  JM, Zhang  Z  et al.  Low-dose mRNA-1273 COVID-19 vaccine generates durable memory enhanced by cross-reactive T cells. Science. 2021;374:eabj9853. 10.1126/science.abj9853. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Munoz-Tello  P, Gabus  C, Thore  S. Functional implications from the Cid1 poly(U) polymerase crystal structure. Structure. 2012;20:977–86. 10.1016/j.str.2012.04.006. [DOI] [PubMed] [Google Scholar]
  • 16. Yates  LA, Fleurdépine  S, Rissland  OS  et al.  Structural basis for the activity of a cytoplasmic RNA terminal uridylyl transferase. Nat Struct Mol Biol. 2012;19:782–7. 10.1038/nsmb.2329. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Balbo  PB, Bohm  A. Mechanism of poly(A) polymerase: structure of the enzyme–MgATP–RNA ternary complex and kinetic analysis. Structure. 2007;15:1117–31. 10.1016/j.str.2007.07.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Lunde  BM, Magler  I, Meinhart  A. Crystal structures of the Cid1 poly(U) polymerase reveal the mechanism for UTP selectivity. Nucleic Acids Res. 2012;40:9815–24. 10.1093/nar/gks740. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Patop  IL, Wüst  S, Kadener  S. Past, present, and future of circ RNAs. EMBO J. 2019;38:e100836. 10.15252/embj.2018100836. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Arnold  FH. Directed evolution: bringing new chemistry to life. Angew Chem Int Ed. 2017;57:4143–8. 10.1002/anie.201708408. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Johnson  SR, Fu  X, Viknander  S  et al.  Computational scoring and experimental evaluation of enzymes generated by neural networks. Nat Biotechnol. 2025;43:396–405. 10.1038/s41587-024-02214-2.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Orsi  E, Schada von Borzyskowski  L, Noack  S  et al.  Automated in vivo enzyme engineering accelerates biocatalyst optimization. Nat Commun. 2024;15:3447. 10.1038/s41467-024-46574-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Shroff  R, Cole  AW, Diaz  DJ  et al.  Discovery of novel gain-of-function mutations guided by structure-based deep learning. ACS Synth Biol. 2020;9:2927–35. 10.1021/acssynbio.0c00345. [DOI] [PubMed] [Google Scholar]
  • 24. Lu  H, Diaz  DJ, Czarnecki  NJ  et al.  Machine learning-aided engineering of hydrolases for PET depolymerization. Nature. 2022;604:662–7. 10.1038/s41586-022-04599-z. [DOI] [PubMed] [Google Scholar]
  • 25. Madani  A, Krause  B, Greene  ER  et al.  Large language models generate functional protein sequences across diverse families. Nat Biotechnol. 2023;41:1099–106. 10.1038/s41587-022-01618-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Shen  J, Yu  Q, Chen  S  et al.  Unbiased organism-agnostic and highly sensitive signal peptide predictor with deep protein language model. Nat Comput Sci. 2024;4:29–42. 10.1038/s43588-023-00576-2. [DOI] [PubMed] [Google Scholar]
  • 27. Ruffolo  JA, Madani  A. Designing proteins with language models. Nat Biotechnol. 2024;42:200–2. 10.1038/s41587-024-02123-4. [DOI] [PubMed] [Google Scholar]
  • 28. Kotsiliti  E. De novo protein design with a language model. Nat Biotechnol. 2022;40:1433. 10.1038/s41587-022-01518-5. [DOI] [PubMed] [Google Scholar]
  • 29. Shin  J-E, Riesselman  AJ, Kollasch  AW  et al.  Protein design and variant prediction using autoregressive generative models. Nat Commun. 2021;12:2403. 10.1038/s41467-021-22732-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Hayes  T, Rao  R, Akin  H  et al.  Simulating 500 million years of evolution with a language model. Science. 2025;387:850–8. [DOI] [PubMed] [Google Scholar]
  • 31. Zhou  Z, Zhang  L, Yu  Y  et al.  Enhancing efficiency of protein language models with minimal wet-lab data through few-shot learning. Nat Commun. 2024;15:5566. 10.1038/s41467-024-49798-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Brandes  N, Goldman  G, Wang  CH  et al.  Genome-wide prediction of disease variant effects with a deep protein language model. Nat Genet. 2023;55:1512–22. 10.1038/s41588-023-01465-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Grønbæk-Thygesen  M, Voutsinos  V, Johansson  KE  et al.  Deep mutational scanning reveals a correlation between degradation and toxicity of thousands of aspartoacylase variants. Nat Commun. 2024;15:4026. 10.1038/s41467-024-48481-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Abramson  J, Adler  J, Dunger  J  et al.  Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature. 2024;630:493–500. https://doi.og/10.1038/s41586-024-07487-w.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Pettersen  EF, Goddard  TD, Huang  CC  et al.  UCSF Chimera—a visualization system for exploratory research and analysis. J Comput Chem. 2004;25:1605–12. 10.1002/jcc.20084. [DOI] [PubMed] [Google Scholar]
  • 36. Case  DA, Aktulga  HM, Belfon  K  et al.  AMBER 2022.  University of California, San Francisco, 2022. [Google Scholar]
  • 37. Wang  Y, Tang  H, Huang  L  et al.  Self-play reinforcement learning guides protein engineering. Nat Mach Intell. 2023;5:845–60. 10.1038/s42256-023-00691-9. [DOI] [Google Scholar]
  • 38. Tian  C, Kasavajhala  K, Belfon  KAA  et al.  ff19SB: amino-acid-specific protein backbone parameters trained against quantum mechanics energy surfaces in solution. J Chem Theory Comput. 2019;16:528–52. 10.1021/acs.jctc.9b00591. [DOI] [PubMed] [Google Scholar]
  • 39. Wang  J, Wolf  RM, Caldwell  JW  et al.  Development and testing of a general amber force field. J Comput Chem. 2004;25:1157–74. 10.1002/jcc.20035. [DOI] [PubMed] [Google Scholar]
  • 40. Li  P, Merz  KM  Jr. MCPB. py: A python based metal center parameter builder. ACS Publications;2016;599-604. [DOI] [PubMed] [Google Scholar]
  • 41. Essmann  U, Perera  L, Berkowitz  ML  et al.  A smooth particle mesh Ewald method. J Chem Phys. 1995;103:8577–93. 10.1063/1.470117. [DOI] [Google Scholar]
  • 42. Davidchack  RL, Handel  R, Tretyakov  MV. Langevin thermostat for rigid body dynamics. J Chem Phys. 2009;130:144114. 10.1063/1.3149788. [DOI] [PubMed] [Google Scholar]
  • 43. Bosko  JT, Todd  BD, Sadus  RJ. Molecular simulation of dendrimers and their mixtures under shear: comparison of isothermal–isobaric (NpT) and isothermal–isochoric (NVT) ensemble systems. J Chem Phys. 2005;123:34905. 10.1063/1.1946749. [DOI] [PubMed] [Google Scholar]
  • 44. Bussi  G, Parrinello  M. Stochastic thermostats: comparison of local and global schemes. Comput Phys Commun. 2008;179:26–9. 10.1016/j.cpc.2008.01.006. [DOI] [Google Scholar]
  • 45. Grønbech-Jensen  N, Farago  O. Constant pressure and temperature discrete-time Langevin molecular dynamics. J Chem Phys. 2014;141:194108. 10.1063/1.4901303. [DOI] [PubMed] [Google Scholar]
  • 46. Wang  J, Arantes  PR, Bhattarai  A  et al.  Gaussian accelerated molecular dynamics: principles and applications. Wiley Interdiscip Rev Comput Mol Sci. 2021;11:e1521. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Lin  Z, Akin  H, Rao  R  et al.  Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379:1123–30. 10.1126/science.ade2574. [DOI] [PubMed] [Google Scholar]
  • 48. Xiang  W, Xiong  Z, Chen  H  et al.  FAPM: functional annotation of proteins using multimodal models beyond structural modeling. Bioinformatics. 2024;40:btae680. 10.1093/bioinformatics/btae680. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Amancio  DR, Comin  CH, Casanova  D  et al.  A systematic comparison of supervised classifiers. PLoS One. 2014;9:e94137. 10.1371/journal.pone.0094137. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Mohammed  A, Kora  R. A comprehensive review on ensemble deep learning: opportunities and challenges. J King Saud Univ Comput Inform Sci. 2023;35:757–74. 10.1016/j.jksuci.2023.01.014. [DOI] [Google Scholar]
  • 51. Hayes  T, Rao  R, Akin  H  et al.  Simulating 500 million years of evolution with a language model. Science. 2025;387:850–8. 10.1126/science.ads0018. [DOI] [PubMed] [Google Scholar]
  • 52. Meier  J, Rao  R, Verkuil  R  et al.  Language models enable zero-shot prediction of the effects of mutations on protein function. Adv Neural Inform Proc Syst. 2021;34:29287–303. [Google Scholar]
  • 53. Menke  MJ, Ao  Y-F, Bornscheuer  UT. Practical machine learning-assisted design protocol for protein engineering: transaminase engineering for the conversion of bulky substrates. ACS Catal. 2024;14:6462–9. 10.1021/acscatal.4c00987. [DOI] [Google Scholar]
  • 54. Khanh Le  NQ, Nguyen  QH, Chen  X  et al.  Classification of adaptor proteins using recurrent neural networks and PSSM profiles. BMC Genomics. 2019;20:966. 10.1186/s12864-019-6335-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Johnson  SR, Monaco  S, Massie  K  et al.  Generating novel protein sequences using Gibbs sampling of masked language models. bioRxiv. 2021;2021–01., 27 January 2021, preprint: not peer reviewed. [Google Scholar]
  • 56. Jumper  J, Evans  R, Pritzel  A  et al.  Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–9. 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57. Dauparas  J, Anishchenko  I, Bennett  N  et al.  Robust deep learning-based protein sequence design using ProteinMPNN. Science. 2022;378:49–56. 10.1126/science.add2187. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. Li  B, Ming  D. GATSol, an enhanced predictor of protein solubility through the synergy of 3D structure graph and large language modeling. BMC Bioinf. 2024;25:204. 10.1186/s12859-024-05820-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Thumuluri  V, Martiny  H-M, Almagro Armenteros  JJ  et al.  NetSolP: predicting protein solubility in Escherichia coli using language models. Bioinformatics. 2022;38:941–6. 10.1093/bioinformatics/btab801. [DOI] [PubMed] [Google Scholar]
  • 60. Hon  J, Marusiak  M, Martinek  T  et al.  SoluProt: prediction of soluble protein expression in Escherichia coli. Bioinformatics. 2021;37:23–8. 10.1093/bioinformatics/btaa1102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Hebditch  M, Carballo-Amador  MA, Charonis  S  et al.  Protein-Sol: a web tool for predicting protein solubility from sequence. Bioinformatics. 2017;33:3098–100. 10.1093/bioinformatics/btx345. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62. Li  W, Godzik  A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22:1658–9. 10.1093/bioinformatics/btl158. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

gkaf1529_Supplemental_Files

Data Availability Statement

All data can be found in the supplementary data.


Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES