Significance
Our study utilizes deep learning for enzyme design, creating well expressed in Escherichia coli, thermally stable variants with high enzymatic activity, validated by thermal and kinetic assays. Traditional computational design pipelines demand expert insight into protein structures. Our pipeline sees enzyme structures as modular components, simplifying structure creation and bypassing expert curation. Where traditional methods have low success and require extensive screening, our method produces designs that surpass natural enzymes in expression and thermostability, while preserving wild-type activity levels with just a few dozens of tested variants. The findings presented here are of high practical importance and offer a very simple to use, yet transformative approach to backbone remodeling in enzymes and other proteins.
Keywords: protein design, bioinformatics, biochemistry
Abstract
The potential of engineered enzymes in industrial applications is often limited by their expression levels, thermal stability, and catalytic diversity. De novo enzyme design faces challenges due to the complexity of enzymatic catalysis. An alternative approach involves expanding natural enzyme capabilities for new substrates and parameters. Here, we introduce CoSaNN (Conformation Sampling using Neural Network), an enzyme design strategy using deep learning for structure prediction and sequence optimization. CoSaNN controls enzyme conformations to expand chemical space beyond simple mutagenesis. It employs a context-dependent approach for generating enzyme designs, considering non-linear relationships in sequence and structure space. We also developed SolvIT, a graph NN predicting protein solubility in Escherichia coli, optimizing enzyme expression selection from larger design sets. Using this method, we engineered enzymes with superior expression levels, with 54% expressed in E. coli, and increased thermal stability, with over 30% having higher Tm than the template, with no high-throughput screening. Our research underscores AI’s transformative role in protein design, capturing high-order interactions and preserving allosteric mechanisms in extensively modified enzymes, and notably enhancing expression success rates. This method’s ease of use and efficiency streamlines enzyme design, opening broad avenues for biotechnological applications and broadening field accessibility.
Natural evolution in proteins, through mechanisms like mutations, gene duplication, and homologous recombination, leads to new structures and functions (1–3). This is evident in multidomain proteins, where fused domains create new functions, and within individual domains, where secondary structure elements form novel protein folds (4–6). Techniques like DNA shuffling have exploited this to develop new protein variants, such as β-lactamase enzymes (7) and further work has combined random DNA segments to create unique protein folds (8). These methods have evolved to include structural considerations, improving the chances of producing functional proteins (9, 10). However, these methods, relying on rational or random segmentation, often result in unstable proteins and require extensive high-throughput screening due to their limited predictive capacity (11). Computational protein design has advanced these techniques by selecting and modeling recombined segment and optimizing amino acid sequences in silico. Notable examples include the SEWING algorithm by Jacobs et al. (12) and AbDesign by Lapidoth et al. (13–14), both implemented within the Rosetta suite (15). These methods face challenges, such as the “lever effect” where slight variations in anchor regions between graft and scaffold alignments can cause significant shifts in distal regions, and ignoring structural plasticity of the grafted segment induced by the new structural environment (10, 16, 17), both resulting in inaccurate backbone conformations. Rosetta uses a Monte Carlo-based algorithm to search the sequence space for low-energy sequences for a given protein backbone. Other than being very computationally intensive, its energy function does not capture all aspects of protein thermodynamics and kinetics reliably (18). The requirement for Rosetta’s energy function to be pairwise decomposable means that higher-order interactions between residues are missed. This can generate protein designs that, while having favorable Rosetta scores (analogous to folding energy) do not fold well. Moreover, this presents a particular challenge when attempting to computationally design allosteric proteins (19). Here we present CoSaNN (Conformation Sampling using Neural Network), a design strategy for creating novel enzyme conformations by grafting segments sampled from natural protein, that takes into account the surrounding structural and chemical context in which the segment is placed. The algorithm leverages the latest advances in deep-learning-based structure prediction, specifically DeepMind’s AlphFold2 (20). CoSaNN begins with a template structure. The structure is then segmented along structurally conserved points within the protein family. Unlike previous backbone sampling algorithms that require computationally intensive sampling of protein structures (13), new conformations are generated by creating sequence-based chimeras, swapping sequence segments in the scaffold enzyme with sequence segments from donor proteins. The chimeric sequences are then modeled using AlphaFold2. Since naive sequence chimeras tend to aggregate and misfold (11), we perform an additional sequence optimization step, which modifies the sequence while keeping the input conformation fixed. Illustratively, in the first step, the algorithm samples a point in protein conformation/sequence space, and in the second step, the algorithm moves along the sequence dimension while keeping the conformation coordinate fixed until it reaches a region of highly foldability (Fig. 6). AlphaFold and other similar neural network (NN) structure prediction methods are trained to map sequences to their corresponding structures. Through this process, these models effectively learn to recognize sequence–structure pattern motifs. In practice, this means that if a sequence segment from protein A is inserted into an equivalent position in a homologous protein B, AlphaFold is capable of accurately folding the segment such that the conformation of both accepting template and grafted segment changes only slightly. We hypothesize that this ability is attributed to the contextual dependency learned by the model during training and from the combination of sequences mapped to the graft and the accepting scaffold. This feature explains why when presented with sequences containing mutations known to cause experimental misfolding, AlphaFold tends to predict the “native” or typical fold and also why AlphaFold seems to produce different conformations with different multiple sequence alignment (MSA) clusters (21). For the sequence optimization step, we chose to use the newly developed ProteinMPNN as well as RosettaDesign (15) for comparison. During the sequence optimization step, we keep the catalytic residues unchanged. This decision was based on previous design iterations, which demonstrated a propensity for both RosettaDesign and ProteinMPNN to occasionally mutate these residues. This tendency is not unexpected, given the well-documented tradeoff between enzymatic functions and protein stability (22). It is worth noting that both algorithms are primarily optimized for ensuring foldability and are not trained explicitly to increase or maintain soluble expression. Previous methods required a time-consuming process of manual inspection of designs which is highly subjective and not quantitative, or relying on rough approximations such as the Rosetta total score. To address this limitation, we introduced into our design pipeline a graph NN classifier trained from the ground up to predict soluble expression in Escherichia coli from the AlphaFold models of the designed proteins.
Fig. 6.
CoSaNN traversal of the conformation-sequence space. The contour map represents a two-dimensional projection of the possible sequence–structure protein space. Darker regions represent regions with high foldability. During the CoSaNN design flow (1) we first move along the Conformation coordinate using protein structure prediction NN. This step potentially places us outside a foldable region (2). The next step uses an inverse folding NN to move back to a foldable region along the same Structure coordinate (3). Last, we apply SolvIT to optimize for expression within the foldable region.
In the present study, we applied CoSaNN to design enzymes that undergo substantial conformational changes upon substrate binding. This scenario poses a rigorous test for the robustness of the design strategy we have developed. Not only do we obtain highly active designs, without laboratory evolution, but we maintain the fold’s allosteric mechanism, demonstrating that our design approach accounts for long-range, high-order interactions both in sequence and structure space. We also demonstrate that our approach works well not only in scenarios that involve the grafting of a single segment, but also in grafting of multiple different segments, some radically different than the acceptor segment; finally, we show that our classifier-based ranking system significantly increases the likelihood of finding well-expressed proteins from the millions of generated in silico designs.
Methods
Computational.
Selection of template.
Polyphosphate glucokinase (ppgmk) from Arthrobacter sp. (strain KM, Uniprot Id: Q7WT42) was selected as the template onto which all backbone segments are grafted. The 3D structure of the protein was previously determined by X-ray crystallography (23) in the bound conformation with glucose and a pair of phosphates adjacent to the glucose substrate (PDB Id: 1WOQ).
Selection of anchor positions.
To determine the most structurally invariant anchors across the entire protein fold, we developed an algorithm that creates distance maps for all AlphaFold models of proteins within the same PFam fold (Fig. 1 A–C) as ppgmk (PF00480), clustered to 50% sequence identity. The distance maps are then modified to only include structurally conserved residues across the fold family, generating a “consensus” distance map of the fold family. We employed mTm-Align (24) to create a multiple structural alignment. To identify invariant spatial position pairs across all family members, we measured the variance across all residue pairs in the consensus distance map across all proteins.
Fig. 1.
The CoSaNN design pipeline. (A) Protein family sequences are sourced from PFam and clustered using a 50% sequence identity cutoff. (B) The structures were modeled using AlphaFold2. (C) The resulting 3D models of protein structures are aligned. (D) Desired insertion points of each protein are identified, and relevant segments are extracted from the aligned stem residues. (E) The extracted sequences replace the corresponding wild-type segment, resulting in the creation of chimeric proteins. These chimeras are subsequently modeled using AlphaFold2 once more. (F) Inverse folding NNs (or Rosetta) are utilized to optimize the sequence of each chimera. (G) Finally, SolvIT, a NN designed to predict protein expression, is used to score each stabilized chimera.
The following anchor positions were used in this work: residues 81,122 for the first segment and residues 159,194 for the second segment (numbering according to ppgmk). These positions were selected based on their spatial invariance across the entire fold family and their contribution to substrate recognition and catalysis (25). Enzymes featuring successfully grafted segments that play a role in substrate recognition and catalysis would provide compelling evidence of our method’s robustness.
Generating chimeric enzymes from segments of homologous proteins.
Anchor positions were derived by obtaining the consensus anchor points (as described above) and mapping them to the template and donor proteins. The amino-acid sequence between the two anchors was then extracted (Fig. 1D) for each of the entries in the PFam fold family of ppgmk (PF00480) and inserted into the template sequence, replacing the original template’s equivalent sequence.
Modeling chimeric enzymes.
Chimeric enzymes were modeled using the ColabFold (26) implementation of AlphaFold2 (Fig. 1E). The MSAs required by AlphaFold2 as input were generated using mmseqs (27).
Sequence design of modeled enzymes.
Sequence optimization for the chimeric enzymes was performed using either ProteinMPNN (using the settings described in the ProteinMPNN paper (28), except for the inference temperature which was set to 0.15 (instead of 0.1), and RosettaDesign (Fig. 1F). A Position-Specific Scoring Matrix (PSSM) was utilized to constrain the sequence space in both cases. For ProteinMPNN, the PSSM was incorporated as a bias to guide the model's logit predictions. For RosettaDesign, we used the PSSM-generated log-likelihood values to eliminate any amino acids that yielded log-likelihood ratio values less than zero.
Training an E. coli–based heterologous expression predictor.
The eSol dataset (29, 30) consisting of 3,173 translated E. coli proteins with their respective soluble fraction titers was downloaded from the author’s website. The structure of each protein in the dataset was modeled using AlphaFold2. Subsequently, a graph representation of the protein was created from the structure such that the protein’s amino acids form the graph vertices and an edge exists between two amino acids if the distance between them is <5 Å. Twenty percent of the samples were set aside to serve as the test set, while the rest of the samples were partitioned to five identically sized sets for a fivefold cross-validation training process. The solubility classifier is an ensemble of five models trained on the fivefolds. Each of the models is composed of two graph attention layers (30) and a ReLU (Rectified Linear Unit) activation function. Following the attention layers, the graph representation is pooled by a Graph Multiset Transformer pooling operator (31). Finally, a linear layer is applied, followed by a sigmoid activation function. The implementation of the network was done in PyTorch (32) and PyTorch-Geometric (33).
The solubility classifier (SolvIT) can then be used for ranking and prioritizing designs for subsequent experimental characterization (Fig. 1G).
The full CoSaNN computational protocol is illustrated in Fig. 1.
Experimental.
Protein expression.
NiCo21(DE3) E. coli competent cells (New England Biolabs, Ipswich, MA) were transformed using the heat shock method with a pET-28a plasmid that was cloned with our designed enzymes, with an addition of an N-terminal 6xHis-tag. The transformation was conducted in a 96-well plate (Cat: 701011, Wuxi-Nest, Jiangsu, China). After transformation, the cells were cultivated in their respective wells at 37 °C for 16 h, in a 2xYT medium supplemented with 30 µg/mL of kanamycin for selection.
The cells were subsequently diluted in a 1:60 ratio into a deep-well 96-plate (Labcon, USA, Cat: LC 3909-525), reaching a final volume of 0.5 mL per well. This dilution was made using a fresh 2xYT medium containing 30 µg/mL of kanamycin. The cultures were then incubated at 37 °C until they attained an absorbance of 0.6 at 600 nm.
Protein expression was initiated by the addition of isopropylthio-β-galactoside at a concentration of 0.2 mM. The cultures were then induced for approximately 20 h at 20 °C. Following this induction period, the cells were harvested through centrifugation at 4,000 g at a temperature of 4 °C for 15 min.
Protein isolation and quantification.
The cells were resuspended in 100 µL Qproteome Bacterial Protein Prep kit (Qiagen, Germany, cat. 37900). First, 1 mg/mL lysozyme, 1:100 (v/v) Protease Inhibitor Cocktail (EDTA-Free, 100× in DMSO) (APExBIO, USA, Cat. K1010) and 1:1,000 (v/v) benzonase were added to the lysis buffer. Then, after 30 min on ice, the lysate was centrifuged (4,500 g; 4 °C; 30 min) and the soluble fraction was incubated in a shaker with Ni-NTA Charged MagBeads (GeneScript, USA, cat. L00295) for 1 h at 25 °C. Prior to incubation with cell supernatant, the beads were pre-washed with a binding buffer (50 mM TRIS-HCl pH 8.0, 300 mM KCl, 0.1 Mm ZnCl2, and 10 mM imidazole). Subsequently, the beads were washed three times with 200 µL wash buffer (50 mM TRIS-HCl pH 8.0, 300 mM KCl, 0.1 Mm ZnCl2, and 30 mM imidazole). His-tagged proteins were eluted with 50 µL elution buffer (50 mM TRIS-HCl pH 8.0, 100 mM KCl, 0.1 Mm ZnCl2, and 500 mM imidazole). The eluted protein concentrations were then inspected using standard sodium dodecyl sulfate–polyacrylamide gel electrophoresis (SDS PAGE) with Coomassie staining, and quantified using the Coomassie Plus Bradford reagent Assay kit (Thermo Scientific, USA, cat. 23236).
Protein thermal stability.
Protein thermal stability was assayed using Applied Biosystems, Protein Thermal Shift Dye Kit according to the manufacturer’s protocol and analyzed using Bio-Rad C1000 Touch Thermal cycler machine and software. The SyproOrange dye was excited at 470 nm, and change to the fluorescence emission at 570 nm (Relative Fluorescence Unit) was measured in temperatures ranging from 25 °C to 95 °C with an increase of 1 °C per minute.
End-point glucokinase activity.
Endpoint enzymatic activity was assayed at 37 °C unless indicated otherwise. The reaction mixture contained 50 mM Tris-HCl buffer (pH 8.0), 100 mM NaCl, 0.1 mM ZnCl2, 1.6 µg/mL polyphosphate, and 100 mM glucose. The reactions were initiated by the addition of the enzyme and terminated by freezing after 6 min. To determine the concentration of glucose-6-phosphate, the PicoProbe™ Fructose-6-Phosphate Assay Kit (GenWay Biotech, Inc.) was used as instructed by the manufacturer with minor adaptations: 1) Glucose-6-phosphate was used as a standard and 2) glucose-6-phosphate isomerase was omitted from the reaction. Fluorescence was measured in a synergy HT-T1 plate reader (Biotek, USA), Ex. 528, Em. 590. The optimal temperature was determined by running an activity assay at varying temperatures between 25 °C to 95 °C for 6 min.
In order to evaluate the enzymes’ stability at different temperatures, the enzymes were incubated at varying temperatures ranging from 25 °C to 80 °C for indicated times (0 to 22 h) in the enzymatic reaction buffer. Then, the temperature was changed to 37 °C, glucose and polyphosphate were added, and the enzymatic activity was measured.
Glucokinase kinetic measurements.
Enzyme kinetics were assayed for 20 min at 37 °C using flat bottom UV-transparent plates (UV-Star 96 wells, Greiner, Bio-One #655801). The reaction mixture contained 50 mM Tris-HCl buffer (pH 8.0), 100 mM NaCl, 0.1 mM ZnCl2, 10 mM MgCl2, 66 µM Nicotinamide adenine dinucleotide phosphate, and Converter Mix containing glucose 6-phosphate dehydrogenase from the PicoProbe™ Fructose-6-Phosphate Assay Kit that was diluted 1,500 fold. Km and Vmax values for glucose were determined by varying the glucose concentration from 0 to 166 mM while keeping the concentration of the polyphosphate at 8 µg/mL. Km and Vmax values for polyphosphate were assayed by varying the polyphosphate concentration from 0 to 320 µg/mL while keeping the concentration of glucose at 33 mM. The formation of glucose-6-phosphate was monitored by coupling with 1,500 fold dilution of glucose-6-phosphate dehydrogenase Converter Mix as mentioned above and measuring the formation of NADH at 340 nm. Kinetic parameters were calculated by measuring the initial rates of reactions and fitting the data to the Michaelis–Menten model using GraphPad Prism version 9.4.0 for Windows, (GraphPad Software, USA).
Results
The ROK Carbohydrate Kinase.
We chose to test our design method on the Repressor Open reading frame Kinase (ROK) fold family (PFam family Id: PF00480), due to its complex conformational dynamics (34) (SI Appendix, Fig. S1B) and functional diversity (35). This family consists of both carbohydrate-dependent transcriptional repressors and sugar kinases (36), sharing a common fold with an N-terminal small α/β domain and a C-terminal large α/β domain. The binding of carbohydrates to these kinases triggers a significant rearrangement of these domains, forming a catalytically active conformation. Most ROK kinases are selective for specific sugars and prefer Adenosine triphosphate (ATP) as a phosphate donor, often requiring a magnesium ion (25). An exception is the inorganic polyphosphate/ATP-glucomannokinase from Arthrobacter sp. strain KM (ppgmk) (37), which lacks the ATP binding motif and instead exhibits a different phosphate binding mode, suggesting a unique activity mechanism (SI Appendix, Fig. S1 A, C, and D). We chose this structurally distinctive kinase as a template for grafting in our design methodology, offering a non-trivial test case.
Comparing Design Strategies.
To ensure a robust and comprehensive evaluation of our design strategy, we separated the design protocol into four distinct elements. These include two design tasks—backbone conformation sampling and sequence optimization—and two design strategies used for these tasks—NN–based methods and Rosetta-based methods. Previous attempts at backbone conformation sampling have demonstrated that sequence optimization is a critical step that improves overall stability, by both introducing stabilizing amino acid interactions between the new segment and the template protein, and by introducing stabilizing mutations to the chimeric protein in general (14, 38–40).
To further test the generality of our design method, we tested both single segment replacement as well as multiple segment replacements (Table 1).
Table 1.
Computational design strategies
| Design strategy | Grafted segment | Backbone grafting method | Sequence design method | # of designs generated |
|---|---|---|---|---|
| 1 | 81–122 | Rosetta based (14) | Rosetta Design | 40 |
| 2 | 81–122 | AlphaFold sequence grafting | Rosetta Design | 119 |
| 3 | 81–122, 159–194 | AlphaFold sequence grafting | Rosetta Design | 95 |
| 4 | 81–122, 159–194 | AlphaFold sequence grafting | ProteinMPNN | 94 |
This comparative approach provides a meaningful assessment of each method’s strengths and weaknesses, allowing us to evaluate the overall effectiveness of our design strategy in a variety of scenarios. Furthermore, by segregating the design tasks, we can more accurately pinpoint areas of success and necessary improvement within each procedural element.
We decided to focus our design on two regions within the protein scaffold: a segment stretching residues 81 to 122 and the second stretching between residues 159 to 194 (numbering corresponding to the template protein ppgmk). The 81 to 122 segment, part of the N-terminal small domain, predominantly contains the sugar specificity motif (25). (Fig. 2C). In the template protein ppgmk, this segment is unique among ROK family members as it contains a higher density of positively charged residues. These residues play a crucial role in stabilizing the negatively charged polyphosphate molecule (41). The segment spanning 159 to 194 contains the characteristic zinc finger motif (Fig. 2D). The template protein’s, ppgmk, preference for polyphosphate over ATP as a phosphate donor is reflected in the absence of this motif, highlighting its distinct structural features compared to other ROK family members.
Fig. 2.
Comparison between ATP-dependent and polyphosphate-dependent ROK family sugar kinase and depiction of backbone sampling. (A) Representative conformations of the N-terminal segment (colored blue-magenta) grafted onto ppgmk (colored white) created using Design Strategies 1 and 2. (B) Representative conformations of the N-terminal segment (colored blue-magenta) and the C-terminal segment (colored green-yellow) grafted onto ppgmk (colored white) created using Design Strategies 3 and 4. The glucose substrate and phosphate ions are shown as magenta and red spheres and orange and red spheres respectively.
In the following section, we elaborate on the characteristics of each design strategy.
Design Strategy 1. As a control set, we opted to replace the conformation of segment 81 to 122 using Rosetta as previously described (13, 14, 42). In short, the dihedral angles from the donor segment are first recorded. The acceptor segment is mutated to poly Alanine and the number of residues are altered to match the length of the donor segment. A main-chain cut is randomly introduced in the acceptor segment and the dihedral angles previously recorded from the donor segment are applied. Following this step, a Cyclic Coordinate Descent algorithm is applied to bring the two parts of the segment together and optimize the donor conformation. Following the conformation replacement, we proceed to the sequence optimization phase, utilizing RosettaDesign. This method relies on the Rosetta energy function and Monte Carlo sampling to generate sequences with minimal Rosetta energy (R.e.u) with the newly derived conformations.
Design Strategy 2. The second approach, simpler in its implementation, involves creating sequence chimeras by replacing the target segment’s amino-acid sequence with the donor segment sequence and modeling the resulting structure using AlphaFold2 (Methods). Subsequently, we again use RosettaDesign to optimize the sequence.
Design Strategy 3. Follows the same methodology as Strategy 2 (using AlphaFold for modeling and RosettaDesign for sequence optimization), but with increased complexity, by introducing an additional conformational change within the segment spanning residues 159 to 194. This approach not only requires accommodating two independent conformational changes within the same protein scaffold, but also any potential interactions between the two segments.
Design Strategy 4. Also introduces two concurrent segment changes. However, it differs from Design Strategy 3 by using the recently developed ProteinMPNN model to perform sequence optimization rather than RosettaDesign. In our implementation, we increased ProteinMPNN's sampling temperature to 0.15 from the original 0.1 to better explore sequence–structure space for non-natural chimera backbones.
Soluble Expression Classifier.
Since ProteinMPNN and RosettaDesign were trained for sequence recovery and not for optimizing high-yield heterologous expression, their final scores for designs do not necessarily indicate the likelihood of robust expression. Most design pipelines provide either by-proxy measurements for bacterial expression (such as Rosetta total score) or involve visual inspection of design for downstream characterization. To simplify and automate this task, we sought to develop an independent classifier trained specifically on the task of predicting soluble protein expression in E. coli.
Several machine learning models have been developed to predict protein solubility in various organisms, demonstrating varying degrees of accuracy. One particularly notable model, GraphSol, achieved state-of-the-art performance on the eSol dataset—A solubility database of ensemble E. coli proteins (29). However, GraphSol is computationally intensive, requires additional evolutionary information (generated on the fly), and does not take into account high resolution structural information.
Our solubility classifier (see Methods for detail implementation) addresses these limitations. In brief, the classifier constructs a graph representation of the AlphaFold protein structure, wherein the amino acids are the vertices and an edge is defined between two vertices if the distance between them is less than 5Å. This classifier is an ensemble of 5 models wherein each of the models is a graph NN composed of two graph attention layers activated by a ReLU function. Following attention, the graph representation is pooled via a Graph Multiset Transformer pooling operation. A final linear layer, followed by a sigmoid activation function, completes the process.
Our classifier exhibits exceptional precision and accuracy on a hold-out dataset, not utilized during training and validation, on par with the performance of the current state-of-the-art classifier, GraphSol [Fig. 3; AUC (area under the curve) = 0.86 and Pearson’s R2 = 0.41 when compared to the normalized expression values documented in the eSol dataset]. Notably, our classifier is computationally efficient, with the entire pipeline requiring 30 to 45 s per protein sequence (depending on length) on a p2.xlarge AWS instance. This runtime includes the structure prediction step using the ColabFold implementation of AlphaFold.
Fig. 3.

SolvIT performance metrics. (A) Scatter plot showing the normalized soluble expression values of test-set entries in the eSol dataset as a function of classifier predictions scaled from [0,1] to [0,100]. (B) ROC Curve of classifier predictions on the eSol test set vs. ground truth showing AUC of 0.86 for SolvIT vs. 0.49 and 0.58 for ProteinMPNN and ESM (C) ROC curve for retrospective prediction of design expression (a protein is considered as expressed when more than 10 µg total protein was produced per 0.5 mL induced culture) for SolvIT, ProteinMPNN, and ESM.
Sequence analysis and structural diversity of the generated designs.
A key metric we looked at to assess design quality is the structural conservation of the grafted segment with respect to its source. This was done by: A) measuring the rmsd of each grafted segment relative to its original conformation., B) Calculating the scaffold’s rmsd, excluding the segment, to the original, native, scaffold conformation (as shown in Fig. 4).
Fig. 4.

Structural and sequence diversity of generated designs grouped by strategy. (A) The distribution of rmsd between grafted segments and the original segment (from the donor protein), all-structure rmsd (between design and the AlphaFold chimera before design), and segment 159 to 194 (which was grafted only in Strategies 3 and 4) (B) Distribution of sequence identity (over the entire protein sequence) for each of the designs relative to the template enzyme. (C) Mutation Preference across different protein regions. Each mutation was assigned one of the regions—grafted segment (81 to 122,159 to 194), sphere of 4.5 A around the segments, buried or surface-exposed residues. (D) Association between the number of mutated residues in the core and expression, melting temperature. (E) Pairwise rmsd distribution strategy (F) Pairwise sequence identity distribution per strategy (G) t-SNE projection of chimeras before and after designs, with the scaffold ppgmk (H) Fraction of charged/polar residues in protein cores per strategy.
For the segment consisting of residues 159 to 194, which remained unaltered in Strategies 1 and 2, we compared these to the native segment of the acceptor scaffold (ppgmk). This comparison served as a baseline for Strategies 3 and 4, representing the minimal rmsd observed when the segment spanning residues 159 to 194 is not derived from other proteins. Our analysis revealed that Strategy 1 introduced the greatest rmsd variation compared to the donor segments, affecting even the scaffold itself. The other strategies exhibited only minor deviations from the original segment’s conformation.
In terms of sequence diversity (Fig. 4B), the designs vary between 56% and 89% compared to the template enzyme. This range translates into more than 100 mutations relative to the template enzyme, indicating the breadth of sequence space sampled by our method.
Not surprisingly designs that were modified with multiple segments (Design strategies 3 & 4) demonstrated higher sequence diversity relative to the template enzyme, with an average sequence identity of 65% and 60% respectively. To analyze mutation preferences across different protein regions, we divided each protein design into six distinct areas: the swapped segments (residues 81 to 122, 159 to 194), the immediate vicinity of these segments defined as residues within a 4.5-Å radius, surface-exposed residues, and buried residues. Our analysis revealed a clear tendency for mutations to occur within the swapped segments themselves rather than in the adjacent areas, as depicted in Fig. 4C. We hypothesize that modifying the amino acids within the swapped segments is simpler than altering the amino acids in its surrounding, as the latter could lead to a complex cascade effect where more amino acids would be modified in the scaffold to accommodate the changes in the immediate vicinity of the swapped segments. Additionally, in Strategies 2 and 3, which both utilize the RosettaDesign, there is a significant propensity to not only mutate the grafted segments but also the surface-exposed residues. Altering these surface regions, especially with polar or charged residues, can lower the overall energy associated with solvation, possibly compensating for any structural instability introduced by the grafting process.
Strategy 4 which employs the ProteinMPNN design algorithm demonstrates similar behavior (in absolute number of mutations) but also modifies the core residues to a much larger extent. We argue that this change is attributed to ProteinMPNN’s design process which ignores the structure’s input sequence and therefore is not biased by it.
We tested whether the amount of mutated residues in the core has any effect on protein expression or melting temperature (Fig. 4D). Surprisingly, there was no statistically significant difference between proteins that are considered well expressed (protein amount per elution vol. of 0.5 ml greater than 10ug) or with a melting temperature higher than the wild type to the amount of mutated buried residues (KS-test P-value = 0.18,0.08 respectively). The total diversity generated by each of the strategies was also analyzed by calculating for each pair of designs both the rmsd and sequence identity (Fig. 4 E and F).
Examining the rmsd of the inserted segments with respect to the original segment conformation (in the context of the donor protein) reveals that Design strategy 3 has a higher rmsd compared to Design strategy 2. We attribute this difference to the fact that we introduce 2 segments in strategy 3 and therefore more modifications are required (both in structure and in sequence) to accommodate both segments with respect to each other and the template scaffold. Design Strategy 4 shows even higher rmsd compared to Design strategy 3. Here, we hypothesize that since ProteinMPNN introduced more sequence diversity compared to Strategy 3, this higher mutation load is reflected in the final model’s conformation.
We further explored the differences between the various strategies by applying Evolutionary Scale Modeling v2 (ESMv2) (43), a deep contextual language model based on NNs, to extract high-dimensional embeddings of the sequences we generated using the different methods. These embeddings were projected onto a two-dimensional plane using t-distributed Stochastic Neighbor Embedding (t-SNE) (44) (Fig. 4G). The resulting distribution illustrates distinct clustering for each strategy. Notably, Strategy 1 (blue) and Strategy 2 (orange) cluster closely, aligning near the chimeras post-AlphaFold and pre-sequence design. In contrast, Strategy 3 (green), which involved simultaneous grafting of two different segments, is markedly more distant yet still clusters with the pre-design chimeras from which it is derived. This is contrary to Strategy 4 (red), which diverges from the pre-design chimeras used in its generation, despite some being shared with Strategy 3. This divergence suggests that ProteinMPNN, even under low-temperature conditions, can effectively explore a broader sequence–structure space for each of the chimeras and still yield well-expressed designs (Soluble Expression and Thermal Melting Temperature Quantification section).
Computational experimentation of the different design strategies.
To better understand the strengths and weaknesses of each of the strategies used to generate the designs, we ran a comprehensive computational analysis of all the combinations of backbone sampling and sequence optimization (4 in total).
In each of the strategies described above, there are two main tasks: backbone conformation sampling using either AlphaFold or Rosetta (using the Splice algorithm) and sequence optimization using either RosettaDesign or ProteinMPNN.
Although the combination of Splice and ProteinMPNN was not used in any of the 4 strategies that were tested experimentally, we included it in this analysis for completeness.
Our dataset consisted of 61 homologs from the same PFam fold family as ppgmk, that were used as the fragment donors to create the experimentally tested designs. The sequence similarities of the donor proteins ranges from 24 to 64% and rmsd between 1 Å to 3.3 Å (SI Appendix, Fig. S3) compared to the template protein ppgmk. The design schema involved grafting segments corresponding to residues 81 to 122 of ppgmk as described earlier, followed by sequence optimization. For each donor segment and design method, we generated 10 designs and assessed their performance by remodeling each design using AlphaFold. We recorded the predicted local distance difference test (pLDDT) scores (45) for both the scaffold and segment residues, predicted Template Modeling score (pTM)—a metric created by AlphaFold to assess the TM-Score (20, 46)—and the rmsd and sequence identity of both the scaffold backbone atoms of ppgmk and the grafted segment relative to the original segment from which it was sourced. We then computed the average of these metrics for each protocol combination (AlphaFold/Splice + RosettaDesign/ProteinMPNN, see Table 2 and SI Appendix, Fig. S4).
Table 2.
Computational benchmark summary
| Grafting method | AlphaFold | Splice | ||
|---|---|---|---|---|
| ProteinMPNN | RosettaDesign | ProteinMPNN | RosettaDesign | |
| Segment rmsd (Å) | 1.15 | 1.28 | 1.18 | 1.37 |
| Scaffold rmsd (Å) | 0.93 | 0.99 | 0.79 | 1.01 |
| Segment pLDDT | 94.91 | 89.75 | 90.95 | 89.19 |
| All Structure pLDDT | 95.96 | 94.87 | 95.23 | 94.6 |
| pTM | 0.87 | 0.86 | 0.86 | 0.86 |
| Segment Sequence Identity (to source segment) (%) | 48 | 57 | 79 | 41 |
| Scaffold (without segment) Sequence identity | 92 | 93 | 97 | 91 |
The pairing of AlphaFold and ProteinMPNN performed better than the other combinations, with pLDDT scores standing out as the property with the highest and most statistically significant difference. For both AlphaFold and ProteinMPNN, we computed the average of each of the metrics across 10 designs for each of the chimeras generated by ppgmk and each of the 61 graft donors. The mean pLDDT of segment residues was 94.9, in contrast to 89.7 (KS-Test P-value of 8.8e-11), 90.9 (KS-Test P-value of 1.31e-6), and 89.1 (KS-Test P-value of 8.8e-11) for AlphaFold+Rosetta, Splice+ProteinMPNN, and Splice+Rosetta, respectively (Table 2). Calculating the rmsd with respect to the original segment conformation, revealed there were no statistically significant differences between the different approaches (assuming significance level alpha = 0.05), averaging 1.15 Å, 1.28, 1.18, and 1.37 for AlphaFold+ProteinMPNN, AlphaFold+Rosetta, Splice+ProteinMPNN, and Splice+Rosetta, respectively. The significantly higher pLDDT suggests that the combination of AlphaFold+ProteinMPNN has the highest agreement between the model and the actual protein structure. While the rmsd values for the inserted segments across all methods are similar, taking pLDDT scores into account as well suggests that the rmsd values of the AlphaFold+ProteinMPNN are the most reliable. We hypothesized that AlphaFold’s conformation prediction accuracy leverages signals from sequences in the input MSA mapped to both the scaffold and the grafted segment to infer the correct context-dependent conformation. To investigate this hypothesis, we used ESMv2 (43) to test the effect of different MSA depths. We used K-Means to cluster the high-dimensional vectors derived from the ESM embedding the fold family sequences. For this analysis, we used different cluster numbers (from 2 up to 70). A similar approach has been shown to successfully elucidate multiple conformational states of the circadian rhythm protein KaiB, the transcription factor RfaH, and the spindle checkpoint protein Mad2 (21). Our analysis revealed that the pLDDT scores of the grafted segment residues were significantly impacted by the number of clusters used, which are inversely correlated with MSA depth, indicating a correlation between MSA depth and structural prediction accuracy (SI Appendix, Fig. S5). Specifically, we observed a substantial decrease in pLDDT scores, averaging 14 points even when clustering to two major clusters (K = 2). While this decline varied across different structures, it demonstrated a minor yet consistent association with K, the number of clusters (Pearson’s R2 = −0.14 P-value = 0.001). In contrast, the mean pLDDT score for the entire structure altered marginally: from an initial average of 94.87 without clustering to 93 after the first subdivision (K = 2), and down to 90 for the shallowest MSAs, albeit with a higher statistical confidence (R2 = −0.36; P-value = 6.95e-17). The rmsd values of the grafted segment relative to the donor segment were increased with respect to the shallowness of the MSA, moving from an average of 1.15 to 1.58. This increase in rmsd however was not statistically significant when comparing its correlation with the number of clusters (Pearson’s R2 = 0.08; P-value = 0.06). Our conclusion from these findings is that while relatively shallow MSAs can suffice when modeling natural proteins, in order to accurately model chimeric proteins where structural elements are fitted together in “evolutionary jumps”, deeper MSAs are required. We argue that the reason for this is that natural proteins lie in a continuous region of protein sequence-conformation space, and the chimeric proteins extend the boundaries of these regions. While shallow MSAs still contain enough information to capture the representations of the natural proteins within the same fold family, chimeric proteins present a harder challenge since the NN model needs to fill in the evolutionary gap between the set of natural proteins and the novel protein.
Soluble expression and thermal melting temperature quantification.
We cloned and expressed the designs along with two wild-type ROK family enzymes (ppgmk—the template used for all designs, and gmuE—an ATP-dependent fructokinase (47), serving as controls.
Out of the 348 designs, 92 (26%) yielded more than 10 µg total protein per 0.5 mL induced culture volume, an additional 97 (28%) designs yielded detectable soluble protein amounts, and the remaining 165 (47%) designs showed no soluble expression.
Analyzing the results with respect to the different design strategies used, we found that none of the designs produced using Design Strategy 1 backbone conformation sampling using Rosetta) showed detectable expression. In contrast, we observed soluble expression in 66% (78 out of 119), 33% (31 out of 95), and 85% (80 out of 94) of the designs created using Design Strategies 2, 3, and 4, respectively (summarized in Table 3).
Table 3.
Summary of enzymes’ measurements across tested strategies
| Strategy 1 | Strategy 2 | Strategy 3 | Strategy 4 | Total | |
|---|---|---|---|---|---|
| # of designs generated | 40 | 119 | 95 | 94 | 348 |
| Detectable expression | 0 | 78 | 31 | 80 | 189 |
| Well expressed (10 µg total protein per 0.5 ml induced culture volume) | 0 | 50 | 3 | 39 | 92 |
| Tm > 56.7 °C | 0 | 35 | 8 | 70 | 113 |
| Active (ATP+Glc) | 0 | 42 | 0 | 4 | 46 |
| Active (HexaP+Glc) | 0 | 54 | 2 | 0 | 56 |
| Active (ATP and HexaP) | 0 | 42 | 0 | 0 | 42 |
Interestingly, we noted a distinct difference in the expression rates between strategies 3 (RosettaDesign) and 4 (ProteinMPNN). Both strategies involve the simultaneous grafting of two different segments, with one of the grafted segments including of the Zinc finger motif which is absent from the ppgmk template (48). This outcome suggests that the simultaneous grafting of multiple segments, particularly when these segments, which are separated in the linear sequence, interact in the three-dimensional protein structure, necessitates a nuanced understanding of both the grafted segments and the accepting template within their specific structural context.
While Strategy 1, similar to Strategy 2, uses RosettaDesign for sequence optimization and only grafts a single segment, it failed to generate expressible designs, in contrast to Strategy 2. A closer inspection of the designs created by both strategies reveals a number of key findings: A) The backbones that were generated by Strategy 1 differed significantly relative to the donor conformation (Fig. 4A). B) The sequence diversity of the designs generated by Strategy 1 is higher than those generated by Strategy 2 (Fig. 4B). C) Strategy 1 has a higher fraction of buried polar residues compared to Strategy 2(KS-test P-value < 1e-10, Fig. 4 C and H). Taking these findings together we hypothesize that significant changes in backbone conformation created by Strategy 1 led to both erroneous sequence optimization as well as potentially exposing core regions in the protein leading to additional detrimental mutations.
Thermal Melting Temperature Quantification.
To evaluate the thermal stability of the designs, thermal melting temperature (Tm) was measured using Differential Scanning Fluorimetry with SyproOrange as a probe (49). 38% of all designs (a total of 135) underwent successful thermal stability evaluation, including the control wild-type enzymes ppgmk (the template enzyme) and gmuE, exhibiting a Tm of 56.7 °C and 62.8 °C, respectively.
Remarkably, 83% of the assessed designs (113 out of 135) demonstrated a higher Tm than that of ppgmk, the original template enzyme. Notably, Design Strategy 4, while being the most divergent in sequence and conformation relative to the template, yielded the highest average Tm among all tested proteins, with 70 (78%) designs surpassing the Tm of ppgmk. See Fig. 5F, for a comprehensive distribution of the designs’ melting temperatures, grouped by design strategy and contrasted against the ppgmk and gmuE controls.
Fig. 5.
Enzymatic activities of selected enzymatic design. Each measurement was performed in duplicates. (A) The relative activity of glucose phosphorylation with polyphosphate for the designs SKFe, and ppgmk the wild-type enzyme, measurements were normalized according to the amount of product formed by each enzyme in 25c (B) + (C) Thermal stability—of ppgmk and SKFe activity as a function of incubation time and temperature. (D) SKFe and ppgmk Michaelis–Menten kinetics on glucose. (E) SKFe and ppgmk Michaelis–Menten kinetics on hexametaphosphate. Error bars represent the SD of two independent experiments. The datasets were significantly different according to the extra sum of squares F-Test (P = 0.006, n = 2). (F) Comparison of expression rates and Tm across different strategies. Non or poorly expressing variants were assigned with a Tm of 0) (Strategy 1 not shown due to lack of expressed variants). gmuE (a fructokinase, in magenta) and ppgmk (polyphosphate glucokinase, in red) expression and Tm are given as references.
Solubility classifier evaluation on backbone designed enzymes.
Major backbone perturbations and large sequence deviations from the wild-type have the potential to destabilize the protein, increase its tendency to misfold, and hence, decrease its total soluble expression (50). Therefore, we wanted to test whether classification to soluble/insoluble expression can be explained by trivial parameters such as sequence identity or rmsd to the original template enzyme.
Each measured variable (sequence identity and rmsd for both modified segments) exhibited a modest, yet statistically significant, correlation with soluble protein yield. Specifically, sequence identity demonstrated a Pearson’s R2 value of 0.22 (P = 2e-5), the rmsd value for the segment between residues 81 to 122 had an R2 value of −0.22 (P = 4e-5), and the rmsd value for the segment between residues 159 to 194 exhibited an R2 value of −0.31 (P = 1e-5). Those measurements however show poor binary predictability for whether the expression yield is higher than 10 µg total protein per 0.5 mL induced culture volume, achieving AUCs of 0.54,0.58,0.53 for sequence identity, segment 81 to 122 rmsd and segment 159 to 194 rmsd (SI Appendix, Fig. S2), respectively. These findings indicate that the likelihood of expression in yields higher than 10 µg total protein per 0.5 mL induced culture volume cannot be explained by simple observations like the mutation load or the degree of structural perturbations to the template structure, which motivated us to develop a dedicated classifier that can capture more intricate interaction between protein conformation, sequence, and predicted expressibility.
The classifier’s training pipeline is described in the Methods section under “Training an E. coli–based heterologous expression predictor”. We evaluated the accuracy of our soluble expression classifier by predicting the solubility of each respective design. Samples that yielded in excess of 10 µg total protein per 0.5 mL induced culture volume were labeled as positive, while those with lesser yields were labeled as negative. We plotted an ROC graph (Fig. 3B), and calculated the AUC, attaining a value of 0.79. Establishing a probability threshold greater than 0.7 yielded an accuracy of 70%, a positive predictive value (PPV) of 43%, and a negative predictive value of 93%.
Earlier research (51) has suggested that both ESMv2 and ProteinMPNN, despite not being explicitly trained to predict protein fitness, thermal stability (Tm), or expression yield, are capable of offering reliable unsupervised predictions regarding these experimental properties. However, when these metrics were applied to our designs, their predictive capacity was mediocre, both on the eSol dataset, and on our designs (Fig. 3 B and C). The ESMv2 mean likelihood calculated on the eSol dataset, when used as a predictor, achieved an AUC of 0.58, while on the designed sequences it achieved an AUC of 0.7. ProteinMPNN similarly, when using the mean negative log-likelihood scores for classification, achieved a classification AUC of 0.49 on the eSol dataset and 0.66 on the designed sequences.
When using our classifier to predict the solubility of designed enzymes (which were not part of the classifier’s training dataset), the calculated PPV suggests that experimentally testing designs with a prediction score higher than 0.7 would result in over 50% of the designs being well expressed, compared to 26% when selecting designs generated using the same design pipeline but without the additional solubility classification step.
Enzyme activity measurement.
The successful design of active enzymes, requiring sub-angstrom precision in configuring catalytic residues, strongly validates our method’s accuracy. We screened the soluble designs for glucokinase kinase activity. We incubated the enzymes with glucose and either ATP or hexametaphosphate for 16 h and measured the concentration of glucose-6-phosphate created using a modified picoProbe assay as described in the Methods section. In total, 60 designs were active with either hexametaphosphate or ATP (Table 3); 54 of the active designs were created using Design Strategy 2 (90% of active enzymes), 2 designs were created using Design Strategy 3 (3.3% of active enzymes), and 4 designs were created using Design Strategy 4 (6.6% of active enzymes).
Interestingly, 8 (13%) designs showed obligatory polyphosphate-dependent activity while the template enzyme can utilize both ATP as well as inorganic polyphosphate. Obligatory polyphosphate activity is not common in natural enzymes and has been documented in a handful of cases (48, 52). The emergence of this feature in our designs underscores the potential of our method to introduce, non-natural enzymatic traits.
We selected 2 designs with high soluble expression rates, Tm, and activity (design ids: SKFe, 1rvg) for additional biochemical and biophysical characterization. Both designs were created using Strategy 2, and measured a Tm of 63 °C and 53 °C, and 105 µg and 110 µg protein amount per 0.5 ml elution volume respectively. We quantified its maximal activity as 100%, corresponding to the amount of substrate 1rvg synthesized at a temperature of 25 °C within a span of 6 min. We then measured the relative activity of each of the designs and compared it to the ppgmk template, as a function of reaction temperature (Fig. 5A). Both designs show overall higher activity than the ppgmk template across all measured temperatures above 35 °C, with SKFe retaining close to 60% relative activity at 85 °C and more than 40% relative activity at 95 °C for the measured time duration. Conversely, the ppgmk template enzyme loses activity completely at 95 °C.
We next compared the capability of SKFe and the template enzyme, ppgmk, to sustain catalytic activity after prolonged temperature stress in different temperatures, by measuring the relative remaining activity after several incubation times and temperatures (Fig. 5B, See Methods for additional details).
Briefly, the enzymes were incubated without their respective substrates at varying temperatures for distinct time intervals, after which they were cooled to 37 °C. Subsequently, substrates were introduced and product formation was measured after a duration of 6 min. In contrast to ppgmk, which became inactive after 4 h at 55 °C and after only 30 min at 65 °C, SKFe retained over 80% of its activity even after 22 h at 55 °C, and more than 40% after 30 min at 65 °C.
We proceeded to characterize and compare the kinetics of SKFe and ppgmk using polyphosphate and glucose as substrates at 37 °C (Fig. 5 D and E). We calculated the Km of SKFe to be 3.15 mM with respect to glucose and 2.9 μg/mL with respect to polyphosphate. The equivalent parameters for ppgmk were calculated to be 0.75 mM and 12.26 μg/mL respectively. We hypothesized that SKFe’s higher affinity toward polyphosphate may come at the cost of lower affinity to ATP as SKFe had no detectable activity when ATP was used as the phosphate donor.
Discussion
Enzyme design presents one of the last frontiers in computational protein design, necessitating the optimization of multiple objectives, from expression, stability, conformational plasticity, and catalytic function. The last being particularly challenging as it requires the modeling of quantum phenomena that still remains poorly understood. The ability to design new-to-nature enzymes not only poses a stringent test of our understanding of the “protein rule set” but also presents a transformative technology as it will enable new manufacturing processes, orders of magnitude more efficient than current ones.
While designing completely de novo catalytic machinery still presents a daunting challenge in computational protein design, our approach circumvents this challenge and focuses on expanding the catalytic repertoire of natural enzymes by taking inspiration from evolutionary processes which generate diversity by conserving the core catalytic machinery while introducing changes to peripheral conformations and sequences. While this engineering concept by itself is not new and is used in experimental techniques such as directed evolution, these techniques are extremely inefficient in exploring the inconceivably large conformation-sequence space. Natural evolution explores the conformation-sequence space in a stepwise manner. This conservative approach reflects a biological constraint: Introducing too many mutations at once to a protein increases the risk of generating a non-foldable or non-functional protein.
The advent of computational design tools like Rosetta has enabled the ability to “leap frog” through the conformation-sequence space allowing for a more expansive exploration. However, these techniques also have their shortcomings. The reliance on random moves to explore the protein space and the use of a limited scoring function that fails to capture high-order amino-acid cooperativity, means that many design trajectories that explore the protein space end up in non-foldable regions.
In this work, we demonstrate an approach that leverages the latest advances in NNs and directly addresses these challenges. We treat the conformation-sequence design process as two separate steps moving through the protein sequence-conformation space. First, we move along the conformation dimension by exploiting the “forced-to-predict” feature of trained structure prediction deep NNs, meaning that even for input sequences that are “unfoldable” in the real world, these models would still produce a “native like” protein conformation. Following this step, we would most likely land in a region of conformation-sequence space that is non-foldable (Fig. 6). To rectify this, we apply an inverse folding NN to move us along the sequence dimension while keeping the conformation fixed (Fig. 6), putting us back in a foldable region of the protein space.
Using this approach, CoSaNN is able to generate novel enzymes that, while being extremely divergent from natural homologues in conformation and sequence space, still maintain wild-type level activity and demonstrate superior stability and expressibility, which are fundamental properties for any commercially relevant enzyme. Not only that, but we directly tackle the challenge of designing enzymes with elaborate activity mechanisms that undergo significant conformational changes upon substrate binding, underscoring our ability to capture intricate amino-acid interactions.
In its current implementation, CoSaNN relies on AlphaFold for modeling the chimeric protein conformations and requires an MSA as input. We observed in our analysis a correlation between the accuracy of the grafted segments and the depth of the input MSA. Specifically, our analysis showed that shallow MSAs result in grafts that have poor confidence (as is evidenced by the significant decrease in pLDDT). That observation led us to hypothesize that intermediary sequences, located between the donor and acceptor in sequence space, may provide crucial guidance to AlphaFold for accurately integrating the graft and adjusting the scaffold. These intermediary sequences in the MSA may act as bridges, offering reference points that facilitate a more precise grafting process. This consideration could potentially enhance the accuracy and efficacy of our computational designs by strategically enriching the MSA with these bridging sequences. Another observation was the low rate of active enzymes created using Strategy 4, even though this strategy excelled in producing well-expressed and thermally stable enzymes. One possible explanation could be the high mutation load of Strategy 4, specifically in the protein core compared to Strategy 3. These mutations could have led to subtle, yet significant, changes in the active site of the enzymes. Our results could indicate that ProteinMPNN would benefit from additional fine tuning specifically for enzyme design tasks as there might be additional subtle features that dictate correlations between active site residues and other positions in the protein structure.
We also investigated the use of ProteinMPNN and ESM likelihood scores as metrics for protein stability. We found that these scores were not ideal for selecting designs for experimental characterization and therefore developed our own classifier, SolvIT, which greatly simplifies the task of design selection.
In the process of developing CoSaNN, we tested different design strategies. We acknowledge that some of the differences observed between the design strategies could be caused by a convolution of methods (e.g. combining AlphaFold backbone sampling with RosettaDesign). To further elucidate the contribution of each method we performed in silico experiments that provided interesting insights into the effect each method had on the design outcome and explain why we believe that combining AlphaFold with ProteinMPNN provided superior results while at the same time highlighting areas that could benefit from further optimization.
CoSaNN offers a streamlined and resource-efficient approach for enzyme design, eliminating the need for extensive computing resources or domain expertise. Complemented by SolvIT, our soluble expression prediction model, it simplifies design selection and lowers the entry barrier for researchers pursuing new enzymatic activities.
Supplementary Material
Appendix 01 (PDF)
Dataset S01 (XLSX)
Acknowledgments
Author contributions
L.Z., I.L., and G.D.L. designed research; L.Z., N.A., I.L., A.K., N.S., and C.B. performed research; L.Z. and C.B. analyzed data; and L.Z. and I.L. wrote the paper.
Competing interests
All contributing authors are employees of Enzymit Ltd. or were employees of the company at the time of the study was conducted. Gideon Lapidoth is the CEO, Co-Founder, and member of the Board of Enzymit Ltd. and has interests exceeding $5,000 or 5% equity in a company.
Footnotes
This article is a PNAS Direct Submission. L.J.C. is a guest editor invited by the Editorial Board.
Data, Materials, and Software Availability
Previously published data were used for this work (53). All study data are included in the article and/or SI Appendix.
Supporting Information
References
- 1.Hughes A. L., Gene duplication and the origin of novel proteins. Proc. Natl. Acad. Sci. U.S.A. 102, 8791–8792 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Blake C. C. F., Do genes-in-pieces imply proteins-in-pieces? Nature 273, 267 (1978), 10.1038/273267a0. [DOI] [Google Scholar]
- 3.Bashton M., Chothia C., The generation of new protein functions by the combination of domains. Structure 15, 85–99 (2007). [DOI] [PubMed] [Google Scholar]
- 4.Grishin N. V., Fold change in evolution of protein structures. J. Struct. Biol. 134, 167–185 (2001). [DOI] [PubMed] [Google Scholar]
- 5.Fernandez-Fuentes N., Dybas J. M., Fiser A., Structural characteristics of novel protein folds. PLoS Comput. Biol. 6, e1000750 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Söding J., Lupas A. N., More than the sum of their parts: On the evolution of proteins from peptides. Bioessays 25, 837–846 (2003). [DOI] [PubMed] [Google Scholar]
- 7.Stemmer W. P., Rapid evolution of a protein in vitro by DNA shuffling. Nature 370, 389–391 (1994). [DOI] [PubMed] [Google Scholar]
- 8.Riechmann L., Winter G., Novel folded protein domains generated by combinatorial shuffling of polypeptide segments. Proc. Natl. Acad. Sci. U.S.A. 97, 10068–10073 (2000). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Otey C. R., et al. , Structure-guided recombination creates an artificial family of cytochromes P450. PLoS Biol. 4, e112 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Bharat T. A. M., Eisenbeis S., Zeth K., Höcker B., A beta alpha-barrel built by the combination of fragments from different folds. Proc. Natl. Acad. Sci. U.S.A. 105, 9942–9947 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Khersonsky O., Fleishman S. J., Why reinvent the wheel? Building new proteins based on ready-made parts Protein Sci. 25, 1179–1187 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Jacobs T. M., et al. , Design of structurally distinct proteins using strategies inspired by evolution. Science 352, 687–690 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Lapidoth G. D., et al. , AbDesign: An algorithm for combinatorial backbone design guided by natural conformations and sequences. Proteins 83, 1385–1406 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Lapidoth G., et al. , Highly active enzymes by automated combinatorial backbone assembly and sequence design. Nat. Commun. 9, 2780 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Leman J. K., et al. , Macromolecular modeling and design in Rosetta: Recent methods and frameworks. Nat. Methods 17, 665–680 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Schellenberg M. J., et al. , Context-dependent remodeling of structure in two large protein fragments. J. Mol. Biol. 402, 720–730 (2010). [DOI] [PubMed] [Google Scholar]
- 17.de Bono S., Riechmann L., Girard E., Williams R. L., Winter G., A segment of cold shock protein directs the folding of a combinatorial protein. Proc. Natl. Acad. Sci. U.S.A. 102, 1396–1401 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Faver J. C., et al. , The energy computation paradox and ab initio protein folding. PLoS One 6, e18868 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Khersonsky O., Fleishman S. J., Incorporating an allosteric regulatory site in an antibody through backbone design. Protein Sci. 26, 807–813 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Jumper J., et al. , Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Wayment-Steele H. K., Ovchinnikov S., Colwell L., Kern D., Prediction of multiple conformational states by combining sequence clustering with AlphaFold2. bioRxiv [Preprint] (2022). 10.1101/2022.10.17.512570 (Accessed 12 December 2023). [DOI]
- 22.Tokuriki N., Stricher F., Serrano L., Tawfik D. S., How protein stability and new functions trade off. PLoS Comput. Biol. 4, e1000002 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Mukai T., Kawai S., Mori S., Mikami B., Murata K., Crystal Structure of Bacterial Inorganic Polyphosphate/ATP-glucomannokinase. J. Biol. Chem. 279 (48), 50591–50600 (2004), 10.1074/jbc.m408126200. [DOI] [PubMed] [Google Scholar]
- 24.Dong R., Peng Z., Zhang Y., Yang J., mTM-align: An algorithm for fast and accurate multiple protein structure alignment. Bioinformatics 34, 1719–1725 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Larion M., Moore L. B., Thompson S. M., Miller B. G., Divergent evolution of function in the ROK sugar kinase superfamily: Role of enzyme loops in substrate specificity. Biochemistry 46, 13564–13572 (2007). [DOI] [PubMed] [Google Scholar]
- 26.Mirdita M., et al. , ColabFold: Making protein folding accessible to all. Nat. Methods 19, 679–682 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Steinegger M., Söding J., MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017). [DOI] [PubMed] [Google Scholar]
- 28.Dauparas J., et al. , Robust deep learning–based protein sequence design using ProteinMPNN. Science 378, 49–56 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Niwa T., et al. , Bimodal protein solubility distribution revealed by an aggregation analysis of the entire ensemble of Escherichia coli proteins. Proc. Natl. Acad. Sci. U.S.A. 106, 4201–4206 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Brody S., Alon U., Yahav E., How attentive are graph attention networks? arXiv [Preprint] (2021). 10.48550/arXiv.2105.14491 (Accessed 1 May 2023). [DOI]
- 31.Baek J., Kang M., Hwang S. J., Accurate learning of graph representations with graph multiset pooling. arXiv [Preprint] (2021). 10.48550/arXiv.2102.11533 (Accessed 1 May 2023). [DOI]
- 32.Paszke A., et al. , Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32, 8026–8037 (2019). [Google Scholar]
- 33.Fey M., Lenssen J. E., Fast graph representation learning with pytorch geometric. arXiv [Preprint] (2019). 10.48550/arXiv.1903.02428 (Accessed 1 May 2023). [DOI]
- 34.Miyazono K.-I., et al. , Substrate recognition mechanism and substrate-dependent conformational changes of an ROK family glucokinase from Streptomyces griseus. J. Bacteriol. 194, 607–616 (2011), 10.1128/jb.06173-11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Conejo M. S., Thompson S. M., Miller B. G., Evolutionary bases of carbohydrate recognition and substrate discrimination in the ROK protein family. J. Mol. Evol. 70, 545–556 (2010). [DOI] [PubMed] [Google Scholar]
- 36.Titgemeyer F., Reizer J., Reizer A., Saier M. H., Evolutionary relationships between sugar kinases and transcriptional repressors in bacteria. Microbiology 140, 2349–2354 (1994). [DOI] [PubMed] [Google Scholar]
- 37.Hansen T., Schönheit P., ATP-dependent glucokinase from the hyperthermophilic bacterium Thermotoga maritima represents an extremely thermophilic ROK glucokinase with high substrate specificity. FEMS Microbiol. Lett. 226, 405–411 (2003). [DOI] [PubMed] [Google Scholar]
- 38.Park H.-S., et al. , Design and evolution of new catalytic activity with an existing protein scaffold. Science 311, 535–538 (2006). [DOI] [PubMed] [Google Scholar]
- 39.Claren J., Malisi C., Höcker B., Sterner R., Establishing wild-type levels of catalytic activity on natural and artificial (beta alpha)8-barrel protein scaffolds. Proc. Natl. Acad. Sci. U.S.A. 106, 3704–3709 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Sperl J. M., Rohweder B., Rajendran C., Sterner R., Establishing catalytic activity on an artificial (βα)8-barrel protein designed from identical half-barrels. FEBS Lett. 587, 2798–2805 (2013). [DOI] [PubMed] [Google Scholar]
- 41.Kawai S., Mukai T., Mori S., Mikami B., Murata K., Hypothesis: Structures, evolution, and ancestor of glucose kinases in the hexokinase family. J. Biosci. Bioeng. 99, 320–330 (2005). [DOI] [PubMed] [Google Scholar]
- 42.Canutescu A. A., Dunbrack R. L. Jr., Cyclic coordinate descent: A robotics algorithm for protein loop closure. Protein Sci. 12, 963–972 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Rives A., et al. , Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. U.S.A. 118, e2016239118 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Hinton G. E., Roweis S., Stochastic neighbor embedding. Adv. Neural Inf. Process. Syst. 15, 857–864 (2002). [Google Scholar]
- 45.Mariani V., Biasini M., Barbato A., Schwede T., lDDT: A local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics 29, 2722–2728 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Zhang Y., Skolnick J., Scoring function for automated assessment of protein structure template quality. Proteins 57, 702–710 (2004). [DOI] [PubMed] [Google Scholar]
- 47.Nocek B., et al. , Structural studies of ROK fructokinase YdhR from Bacillus subtilis: Insights into substrate binding and fructose specificity. J. Mol. Biol. 406, 325–342 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Tanaka S., et al. , Strictly polyphosphate-dependent glucokinase in a polyphosphate-accumulating bacterium, Microlunatus phosphovorus. J. Bacteriol. 185, 5654–5656 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Hołubowicz R., et al. , Effect of calcium ions on structure and stability of the C1q-like domain of otolin-1 from human and zebrafish. FEBS J. 284, 4278–4297 (2017). [DOI] [PubMed] [Google Scholar]
- 50.Tokuriki N., Stricher F., Schymkowitz J., Serrano L., Tawfik D. S., The stability effects of protein mutations appear to be universally distributed. J. Mol. Biol. 369, 1318–1332 (2007). [DOI] [PubMed] [Google Scholar]
- 51.Johnson S. R., et al. , Computational scoring and experimental evaluation of enzymes generated by neural networks. bioRxiv [Preprint] (2023). 10.1101/2023.03.04.531015 (Accessed 10 May 2023). [DOI]
- 52.Klemke F., et al. , All1371 is a polyphosphate-dependent glucokinase in Anabaena sp. PCC 7120. Microbiology 160, 2807–2819 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Delaney J. S., ESOL: Estimating Aqueous Solubility Directly from Molecular Structure. J. Chem. Inf. Comput. 44, 1000–1005 (2004), 10.1021/ci034243x. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Appendix 01 (PDF)
Dataset S01 (XLSX)
Data Availability Statement
Previously published data were used for this work (53). All study data are included in the article and/or SI Appendix.




