Abstract
Disulfide bonds, covalently formed by sulfur atoms in cysteine residues, play a crucial role in protein folding and structure stability. Considering their significance, artificial disulfide bonds are often introduced to enhance protein thermostability. Although an increasing number of tools can assist with this task, significant amounts of time and resources are often wasted owing to inadequate consideration. To enhance the accuracy and efficiency of designing disulfide bonds for protein thermostability improvement, we initially collected disulfide bond and protein thermostability data from extensive literature sources. Thereafter, we extracted various sequence‐ and structure‐based features and constructed machine‐learning models to predict whether disulfide bonds can improve protein thermostability. Among all models, the neighborhood context model based on the Adaboost‐DT algorithm performed the best, yielding “area under the receiver operating characteristic curve” and accuracy scores of 0.773 and 0.714, respectively. Furthermore, we also found AlphaFold2 to exhibit high superiority in predicting disulfide bonds, and to some extent, the coevolutionary relationship between residue pairs potentially guided artificial disulfide bond design. Moreover, several mutants of imine reductase 89 (IR89) with artificially designed thermostable disulfide bonds were experimentally proven to be considerably efficient for substrate catalysis. The SS‐bond data have been integrated into an online server, namely, ThermoLink, available at guolab.mpu.edu.mo/thermoLink.
Keywords: enzyme, enzyme fitness, machine learning, protein thermostability, SS bond
1. INTRODUCTION
Thermostability is an essential requirement for industrial enzymes, facilitating efficient reactions at high temperatures and prolonging enzyme half‐life (Wu et al., 2023). Although ongoing improvements in enzyme activity and stability are imperative, natural evolution does not guarantee their optimal states in terms of both thermostability and activity. This is because enzymes evolve under varying environmental conditions and selective pressures, potentially prioritizing certain aspects of enzyme function over others (Senior et al., 1976). Consequently, enzymes may possess inherent trade‐offs or suboptimal characteristics that limit their overall performance. Through directional (Kuchner & Arnold, 1997) or engineering‐related (Bloom et al., 2005; Mazurenko et al., 2019) evolution, the performance of enzymes in industrial applications can be improved by enhancing their catalytic efficiency, substrate specificity, resistance to denaturation, and tolerance to harsh conditions (such as a high concentration of organic molecules). This optimization process involves modifying key amino acids by introducing beneficial mutations, short peptide tags, or occasionally large sequence substitutions to enhance enzyme properties (Han et al., 2020; Nixon et al., 1998). Through an iterative process of exploring and refining enzyme variants, researchers can unlock the previously unexploited capabilities of natural enzymes, thus helping enhance their catalytic efficiency across diverse applications.
To develop more stable enzymes for industrial applications, various strategies, such as reverting to consensus mutations, mutating to thermophilic homologs, rigidifying flexible regions, introducing salt or disulfide bridges, and incorporating proline mutations, have been proposed and implemented (Ban et al., 2021; Prakash & Jaiswal, 2010; Suplatov et al., 2015; Yu et al., 2022; Yu & Huang, 2014). In particular, the introduction of disulfide bonds (SS bonds) is widely considered a viable and effective strategy to improve enzyme thermostability in protein engineering. These covalent bonds, formed between pairs of proximal cysteines, act as structural scaffolds by establishing intra‐ or inter‐molecular connections, reinforce protein structure, and therefore play crucial roles in protein folding, stability, and function (Creighton et al., 1995; Gihaz et al., 2020; Hogg, 2009; Ning et al., 2014). However, whether all the designed SS bonds contribute to favorable thermostability improvement is yet to be fully elucidated.
Currently, with the development of structural biology and emergence of artificial intelligence (AI) ‐driven tools such as AlphaFold2 (AF2) (Jumper et al., 2021), rational protein design is accurately and efficiently facilitated by rich structural information. Meanwhile, certain SS‐bond design tools, including Disulfide by Design 2.0 (DBD2) (Craig & Dombkowski, 2013), MODIP (Dani et al., 2003), Yosshi (Suplatov et al., 2019), among others, have been developed. However, with the aid of these tools, undesired designs still lead to negative results, including inferior thermostability, lower activity, or even loss of function, among others. Therefore, further investigations into the stability enhancement or functional gain of enzymes when introducing new SS bridges is imperative.
Machine learning (ML) has recently been used to accelerate protein evolution (Kunka et al., 2023; Siedhoff et al., 2021). By training a large dataset of enzyme sequences and their corresponding characteristics, ML algorithms can learn the relationship between them, allowing accurate predictions of the characteristics of novel enzyme variants. Different training data can stimulate the exploration of diverse regions of sequence space in ML‐guided directed evolution, potentially resulting in varying levels of functional improvement (Saito et al., 2021). Through ML, new enzyme variants with improved thermostability and functionality can be designed by exploring different sequence spaces and SS bond patterns. These ML‐driven engineering approaches may provide an efficient pathway for developing thermally stable enzymes to address challenges in industrial and biotechnological applications.
Increasing instances of engineered SS bonds, including successful implementations and failures as well as undetectable ones, have been documented (Li et al., 2020, 2022; Pang et al., 2020; Tanghe et al., 2017). However, the limited data in the literature potentially hinder our ability to draw further analyses and robust conclusions, identify patterns, or make accurate predictions. To address the gaps in this rapidly evolving field, we have curated a database named ThermoLink based on a search of a vast amount of literature to investigate the specific impact of different types of SS bridges on enzyme thermostability. Sequence‐ and structure‐based features were extracted for in‐depth analysis. Exploring SS bonds and their capacity to enhance thermostability has uncovered several intriguing and significant findings. Based on various extracted sequence‐ and structure‐based features, ML models were generated to predict the thermostability effect when introducing an SS bridge. The details are shown in Figure 1.
FIGURE 1.

Study workflow chart. Experimental data on the SS bonds that influence enzyme thermostability and function were collected from the literature. Sequence information was derived from the National Center for Biotechnology Information (NCBI) and UniProt databases, and the structures of wild‐type (WT) and cysteine mutants were predicted using an AI‐powered tool, namely, AF2. Both sequence‐ and structure‐based features were considered for the construction of the prediction model using multiple ML methods, such as Support Vector Machine (SVM), XGBoost (eXtreme Gradient Boosting), Random Forest (RF), Adaptive Boosting Decision Tree (Adaboost‐DT), and Adaptive Boosting Random Forest (Adaboost‐RF).
To demonstrate the effectiveness of the SS bond‐guided protein engineering protocol, we designed several mutants for imine reductase (IR89), which synthesizes a key pharmaceutical intermediate of tofacitinib, a Janus kinase (JAK) inhibitor used for several diseases, including arthritis, ankylosing spondylitis, and ulcerative colitis. Encouragingly, with the aid of AF2 and our models, we obtained eight positive mutants out of thirteen IR89 mutants tested by introducing SS bonds possessing substantially higher activities. Overall, we identified patterns and regularities that provide data and theoretical support for a more practical design of SS bonds for enzyme engineer in the future.
2. RESULTS
2.1. SS‐bond database for enzyme thermostability engineering
In this work, we collected data on SS linkages, both artificial and native, along with corresponding enzyme thermostability information. For each item, we compiled the sequence ID, source, and enzyme category as well as the designed SS‐bond sites and consequent impact on enzyme thermostability. In our study, we classified “enhanced thermostability” as positive data (55%) and other situations as negative data (Figure 2a,b). In this work, we collected data on SS linkages, both artificial and native, along with corresponding enzyme thermostability information. For each item, we compiled the sequence ID, source, and enzyme category as well as the designed SS‐bond sites and consequent impact on enzyme thermostability. In our study, we classified “enhanced thermostability” as positive data (55%) and other situations as negative data (Figure 2c). According to the IUBMB classification system (McDonald & Tipton, 2014), enzymes can be grouped into six categories: oxidoreductases, transferases, hydrolases, lyases, isomerases, and ligases, which are involved in oxidation–reduction, group transfer, hydrolysis, condensation, isomerization, and ligation reactions, respectively (Schomburg & Schomburg, 2010). Our collected data had relatively substantial proportions of hydrolases and oxidoreductases, accounting for 60% and 32%, respectively (Figure 2d). Hydrolases, including lipases, esterases, amylases, and proteases, among others, serve an essential role in industrial production as important biological components (Victorino da Silva Amatto et al., 2022). SS bonds are often introduced into lipases to increase their thermostability, predominantly by rigidifying flexible lid domains.
FIGURE 2.

Dataset information. (a and b) Distribution of the functional SS bond types collected in this work. (c–e) Distributions of source organisms (c), enzyme classifications (d), and disulfide bond types and positions in enzyme structures within the database, including artificial intra‐ and interchain and native intrachain (e).
Numerous native SS linkages use alanine and serine mutations to confirm their functions, and our data included eight of these cases. Among the artificial SS linkages, 92.5% were intrachain (within the same chain) and 5.7% interchain (between different chains) (Figure 2e). Additionally, 103 enzyme entries were specifically utilized to analyze intrachain SS bonds, while 11 were specifically employed to design and investigate interchain SS bonds.
2.2. Key features derived from the mutation sites
To unearth the critical factors that help stabilize or destabilize the effects of engineered SS linkages on enzymes, we extracted features based on the sequences and structures of these engineered enzymes, including evolutionary information, micro‐environment information, and other geometric properties of mutation sites. The ranking of the features' importance scores is shown in Figure S1, which displays the top 10 essential factors retrieved from the mutation site and neighborhood context. Evidently, the most important feature in both models is the mutation‐induced Gibbs free energy change. Other features, such as the geometric features (distance and dihedral angles), evolutionary characteristics (position‐specific scoring matrix [PSSM] and BLOSUM62), local interactions (hydrogen bonds), residue type, solvent‐accessible surface area (SASA), and secondary structure (dynamic secondary structure of proteins [DSSP]), among others, are also extremely important. Thereafter, we comprehensively analyzed the importance of these features in improving protein thermostability (Figure 3).
FIGURE 3.

Distribution of key features, including the PSSM scores (a), correlation scores between mutation sites calculated using the BLOSUM62 matrix (b), mutation energy upon cysteine mutation (c), SASA of design sites in WT structures (d), and non‐covalent interactions between mutation sites and their surrounding residues in the WT structure, including hydrogen bonds (e) and salt bridges (f).
2.2.1. Conservation of engineering sites
Zhu et al. (2010) found that conservation characteristics played an important role in post‐transcriptional translation, such as disulfide formation. Hence, for the SS‐bond design, considering the conservation feature may exert less impact on protein function under double mutations. Herein, we initially attempted to investigate the evolutionary relationship between the mutated site pairs involved in SS‐bond formation.
The PSSM is a widely used computational tool in bioinformatics and protein analysis (Altschul et al., 1997; Klesmith et al., 2017; Xing et al., 2024). Based on multiple sequence alignments (MSAs) of related proteins, the PSSM can provide valuable information on the conservation of residues at each position after alignment. In this study, PSSM scores were calculated for each sequence and incorporated into multiple mutation prediction algorithms (Nagar et al., 2023). As shown in Figure 3a, the PSSM scores were calculated when cysteine was present at the engineering sites. For each sample, the score represented the sum of the SS‐bond sites. In the distribution plot, the PSSM scores of the positive samples tended to be higher. A higher PSSM score indicated a greater likelihood of or preference for a particular amino acid at a specific position in a protein sequence.
Furthermore, the correlation scores of mutation site pairs were calculated using the BLOSUM62 matrix (Eddy, 2004; Song et al., 2014), a commonly used substitution matrix for the alignment of protein sequences. The BLOSUM62 matrix can be used to evaluate the substitution frequencies between different residues. It is based on extensive protein sequence alignment data and reflects the evolutionary relationships and co‐occurrence patterns between different residues. As illustrated in Figure 3b, the positive SS bonds exhibited stronger correlations between site pairs. Hence, the positive sites have a slightly stronger evolutionary relationship.
2.2.2. Change in Gibbs free energy upon mutation
In protein systems, the Gibbs free energy difference potentially reflects the variation in stability (folding free energy) resulting from a mutation. It represents the energy contribution associated with altering the protein's structure, interactions, or conformational dynamics. For rational enzyme design, energy shifts from WT to mutant have been considered (Duan et al., 2020; Fuxreiter & Mones, 2014). Several protocols aimed at calculating free energy differences have been developed, and they have grown in popularity over the past years. An assessment of energy change is typically based on thermodynamic parameters, such as . A highly positive value indicates that the mutation may lead to enzyme instability, whereas a smaller value suggests that the mutation is potentially tolerable.
In this work, the energy for each pair of residues involved in an SS bond was calculated using the FoldX suite. Figure 3c displays the distinct variations in the peak values of the distribution curve presenting the changes in energy for increased thermostability as well as those of other data types. The SS bonds with enhanced thermostability are characterized by smaller energy changes, further demonstrating the importance of energy calculations in improving thermostability based on the SS‐bond design.
2.2.3. Exposure levels of engineering sites
The SASA is a measure of the contact extent between residues located on the protein and the surrounding solvent. According to the distribution curves in Figure 3d, the positive samples yielded higher SASA values before the introduction of SS bonds, suggesting that incorporating SS bonds at solvent‐exposed positions may help enhance protein stability. However, the introduction of SS bonds may restrict enzyme flexibility and impede conformational changes in the active site, potentially impairing enzyme function. Therefore, caution is required when designing SS bonds, particularly in proximity to essential functional sites.
2.2.4. Non‐covalent interactions around mutation sites
A diverse range of non‐covalent interactions serve an essential role in achieving and maintaining protein structure stability. These interactions include hydrogen bonds, salt bridges, and Van der Waals interactions, among others. Herein, in terms of polar interactions, we quantified the native hydrogen bonds and salt bridges around the mutation sites and compared their distribution between the positive and negative samples. As shown in Figure 3e,f, the strength of the native interactions around engineering sites is weaker in the positive samples than in the negative ones. When the introduction of an SS bond fails to improve the thermostability of an enzyme, it tends to preserve the original conformation, resulting in a relatively minor impact on its stability. Hence, considering the native interactions around the WT sites, the design process can be strategically optimized to preserve or improve the protein's stability and functionality, enhancing the probability of achieving the desired modifications with precision and efficacy.
2.2.5. Amino‐acid types and positions of the engineering sites
To determine the impact of residue substitutions on enzyme thermostability, we specifically analyzed residue frequency distribution at the cysteine mutation sites within each of the two categories (Figure 4a). Alanine yielded the highest frequency in both categories, exhibiting consistency with the BLOSUM62 matrix and indicating that A has a high probability of substitution with C. Interestingly, mutations that are more likely to be associated with improved enzyme thermostability involve conservative replacements, such as S and T, which potentially exert less influence on local polarity and volume. Typically, numerous proteins possess unbounded termini, and the stabilization of these unbounded short regions can tremendously improve the enzyme's overall thermostability. For example, Zhou et al. (Zhou et al., 2016) successfully introduced an N‐terminal disulfide to improve the thermostability of Aspergillus niger‐derived GH11 xylanase. As illustrated in Figure 4b, our compiled data indicate that SS bonds involving terminal residues are particularly effective at increasing thermostability. Hence, when designing disulfide bonds, prioritizing A, S, or T at mutation sites and modifying terminal regions are advisable to ensure their substantial impact on thermostability.
FIGURE 4.

Key characteristics of mutation sites: (a) residue‐type distribution at mutation sites; (b) location of the SS bond design site at the terminal or non‐terminal region, with the terminal SS bond being the one at which at least one residue is within the first and last five of the entire sequence; (c and d) heatmap plots of the secondary structure information of the mutation sites for negative and positive samples. Abbreviations: H, helix; E, sheet; L, loop.
2.3. Key features of residues near the mutation sites
The micro‐environment formed by different residues plays a central role in forming protein three‐dimensional (3D) structures (Johansson et al., 2010). As SS bonds significantly contribute to protein structural stability, the micro‐environment surrounding an SS bond should also affect its functionality. Most natural SS bonds, preserved throughout long‐term evolution, are exceedingly likely to contribute toward enzyme thermostability. Analyzing the micro‐environment difference between the artificial SS bonds collected in the present work and natural ones may provide clues for improving protein thermostability by introducing extra SS bonds. Our subsequent analysis considered both sequence‐ and structure‐based micro‐environments (Figure 5).
FIGURE 5.

Key features of residues around SS‐bond design sites. (a and b) residue‐type and secondary structure distributions of sequential neighbor residues for positive (a) and negative (b) data. The residue types of all samples were clustered by probability value. Abbreviations: H, helix; E, sheet; L, loop; −, not predicted. (c) Amino acid preference for spatial adjacency residues around the engineered (negative vs. positive) SS bonds. (d) Amino acid preference for the spatial adjacency residues around the native SS bonds compared with that for all protein sequences. (e–h) Comparisons of the two‐by‐two linear regression curves of the four datasets mentioned above. Native SS bond information and the corresponding full‐length sequences were retrieved from the CullPDB database.
First, we extracted length‐21 amino‐acid sequences centered on the mutation sites. Subsequently, a statistical analysis was performed to determine the residue‐type distribution at each position. This sequence‐based information was compared between the positive and negative samples of artificial SS bonds. As illustrated in Figure 5a,b, the residue‐type distributions in both datasets exhibited immense similarity and were classified into two primary groups. Residues A, S, T, and G were more prevalent than others, especially A and G. In both datasets, A was the most frequently observed residue type at the mutated sites, while G was popular at other positions, except for the engineering sites. Although S and T were abundant in the positive samples, their contributions were relatively scarce in the negative samples. The frequency of C implies that SS‐bond formation merely requires one mutation along with a natural C. Single mutations generally exert a lesser effect than double mutations. Consistent with the situation shown in Figure 4c,d, mutation sites located on the beta‐sheet are more likely to improve enzyme thermostability.
Thereafter, the residue distributions in the spatial vicinities of the positive and negative SS bonds were unraveled. As illustrated in Figure 5c, the distributions of positive and negative data are similar. To verify whether this characteristic is specific to residues in close proximity to SS bonds or is related to the overall frequency of amino acids in the protein sequence, we compared the frequency of amino acids in the spatial vicinity of natural SS bond sites with that of all amino acids in the protein sequence (Figure 5d). Interestingly, the distribution trends were remarkably similar.
Subsequently, a two‐by‐two linear regression was performed between these results (Figure 5e–h). The correlation between the positive and negative datasets was approximately 0.86, which was extremely close to that between natural disulfide bond sites and all sequences ( = 0.87). However, the positive data were considerably similar to the natural SS bonds compared with the negative ones (0.92 vs. 0.80). Although the local SS‐bond sites and overall sequences exhibited a strong correlation ( = 0.87), a residue preference remained within the specified SS‐bond samples. G, with its minimal side chain, predominantly occurs around natural and positive SS bonds. L, with a larger hydrophobic side chain, is less likely to be found near SS bonds. T and S, which resemble C, frequently appear around positive and natural SS bonds. Based on the aforementioned analysis, utilizing amino acid S as a substitution site for C is more likely to yield desired outcomes. Basic amino acids, including K and R, are less frequently observed near positive samples.
2.4. Performance of the binary classification models
To address the current challenge of engineering enzymes, we developed innovative prediction models focused on the influence of introduced SS bonds on enzyme thermostability. Our model leverages two sets of features: one set solely based on mutation site information and the other incorporating information on the mutation site and its surrounding properties. To construct the prediction model, we employed a diverse range of ML algorithms, including SVM, XGBoost, RF, Adaboost‐DT, and Adaboost‐RF.
Initially, we extracted a comprehensive set of features based on the mutation sites. These features comprise crucial information, such as mutation type, conservation information, and structure information, among others. To ensure optimal utilization of these features, we employed advanced feature‐engineering techniques, including feature selection and normalization. In addition, we considered the properties surrounding the mutation sites. We extracted contextual features that encompassed sequence information on neighboring bases, structural characteristics, and other relevant attributes. These contextual features provided a more comprehensive understanding of the mutations. Based on the two sets of features, our prediction model was constructed using various ML algorithms.
To evaluate the performance of the models, we assessed key evaluation metrics, such as accuracy, recall, the F1 score, specificity, the area under the curve (AUC), and precision. The results (Table 1) support the efficacy of our proposed model in predicting the causal relationship between an introduced SS bond and the change in protein thermostability. The highest accuracy was 0.714, as shown in the mutation site model based on RF and Adaboost‐DT as well as the neighborhood context model based on Adaboost‐DT. In particular, among the mutation site models, RF exhibited superior performance on multiple metrics, with an AUC score of 0.737, while the neighborhood context model excelled, with an AUC of 0.773 based on Adaboost‐DT. These findings suggest that information‐based enrichment potentially enhances the model's ability to discriminate between positive and negative samples.
TABLE 1.
Prediction performance by different methods.
| Models | Algorithms | Performance | |||||
|---|---|---|---|---|---|---|---|
| Accuracy | Recall | F1 score | Specificity | AUC | Precision | ||
| Mutation Site Model | SVM | 0.5 | 0.526 | 0.487 | 0.478 | 0.505 | 0.500 |
| XGBoost | 0.692 | 0.579 | 0.629 | 0.782 | 0.76 | 0.688 | |
| Random forest (RF) | 0.714 | 0.737 | 0.7 | 0.70 | 0.737 | 0.66 | |
| Adaboost‐DT | 0.714 | 0.736 | 0.700 | 0.737 | 0.735 | 0.666 | |
| Adaboost‐RF | 0.642 | 0.684 | 0.634 | 0.609 | 0.666 | 0.591 | |
| Neighborhood Context Model | SVM | 0.453 | 0.421 | 0.410 | 0.478 | 0.474 | 0.470 |
| XGBoost | 0.594 | 0.474 | 0.514 | 0.696 | 0.574 | 0.563 | |
| Random forest (RF) | 0.569 | 0.474 | 0.529 | 0.739 | 0.691 | 0.714 | |
| Adaboost‐DT | 0.714 | 0.789 | 0.714 | 0.652 | 0.773 | 0.684 | |
| Adaboost‐RF | 0.595 | 0.684 | 0.604 | 0.522 | 0.584 | 0.542 | |
Above all, our prediction models integrate the characteristics of the mutation sites and surrounding properties, providing satisfactory results by using various ML algorithms. The models constructed in this study may serve as a valuable reference for further advances in sequence‐ and structure‐based prediction tasks.
2.5. Enhancing IR89 fitness by introducing disulfide bonds
IR89 is an imine reductase with a dimeric structure, which can be used to convert 1‐benzyl‐4‐methylpiperidin‐3‐one hydrochloride to (3R,4R)‐1‐benzyl‐N,4‐dimethylpiperidin‐3‐amine, a key pharmaceutical intermediate of tofacitinib. As a JAK inhibitor, tofacitinib is a medication used to treat multiple diseases, including different types of arthritis, ankylosing spondylitis, and ulcerative colitis. To efficiently synthesize the chiral amine group, IR89 has been developed to selectively convert the imine group into an S‐type amine (Lenz et al., 2017). However, the substrate concentration and conversion rate are not high enough for industrial applications.
To showcase the effectiveness of SS‐bond engineering in enzyme design and enhance the performance of IR89, we integrated DBD2, AF2, and our models to strategically introduce SS bonds that enhance enzyme fitness and activity. The detailed pipeline is displayed in Figure 6a. Initially, AF2 was employed to predict the structure of IR89, which was successively used as a DBD2 input to identify potential SS bond sites. Subsequent steps involved screening the potential cysteine pairs using AF2 alongside our models. Intrachain SS bonds were modeled using AF2, while interchain SS bonds between monomers were predicted using AlphaFold2‐Multimer (Evans et al., 2021). Candidate residue pairs for SS bond formation were identified from the predicted structures, and their likelihood to enhance enzyme fitness was evaluated using our prediction models. Multiple voting was performed using our three high‐accuracy models with an accuracy of 0.714, and if two of them identified the SS bond as positive, they were used as candidates for further validation. Finally, 13 candidates were adopted for the experimental validation of enzymatic fitness.
FIGURE 6.

Enhanced fitness of IR89 owing to the introduction of SS bonds. (a) Overview of the pipeline employed to predict and screen SS bonds. (b and c) Comparative conversion rates of WT and mutated IR89 at 30°C at substrate concentrations of 25 mM (b) and 50 mM (c) after 24‐h incubation. Interchain SS bonds are highlighted in light‐pink shading.
Through the pipeline described above, eight high‐activity mutants, including four interchain (A261C‐A261C, S180C‐S180C, W176C‐G187C, and A168C‐T194C) and four intrachain (V23C‐G44C, G39C‐V47C, H246C‐S263C, and F249C‐E256C) SS bonds, were obtained upon introducing new SS bonds (Figure 6b). In particular, three mutants exhibited a conversion rate of approximately 100% at 25 mM substrate concentration, and that of F249C–E256C also approximated 100% at 50 mM substrate concentration (Figure 6c).
The interchain SS bonds we designed significantly enhanced the enzyme fitness of IR89, a phenomenon largely attributable to the enzyme's distinctive dimeric structure. As a dimeric enzyme, IR89's active site is situated between its two monomers. The strategic introduction of interchain SS bonds may effectively stabilize this active pocket, leading to the observed improvements in enzyme performance. Notably, these interchain SS bonds have demonstrated a higher success rate in augmenting enzyme fitness than intrachain ones. This IR89 case underscores the inherent value of our strategy for identifying SS bonds with superior effects.
2.6. Usage of ThermoLink
ThermoLink refers to both natural and engineered SS bonds that influence enzyme thermostability. The effects of these SS bonds on thermostability can be categorized into four scenarios: no change, increased thermostability, decreased thermostability, or other effects. Global and fuzzy search of ThermoLink can be performed using “Protein”, “Organism”, “Protein ID”, and “Thermostability” as search words (Figure 7). Each entry provides information regarding the enzyme's name, organism, SS‐bond introduction sites, pre‐mutation residue types, thermostability change description, and the reference (Figure 7). Users can click entries under the “Index” title to navigate to the 3D PyMOL pse files highlighting the artificial intra‐SS bonds in the mutant structure for further study, and protein structures are colored based on the predictions' trust scores provided by FastAF2. Furthermore, users can click on entries under the titles “Protein ID” and “References” to access external resources, such as UniProt, NCBI, PubMed, and other online servers. The website also offers a download option on the download page for the entire database content in spreadsheet format.
FIGURE 7.

Graphical interface for Thermolink.
3. DISCUSSION
3.1. Superiority of AF2 in SS‐bond prediction
AI tools, such as AF2, have recently been used to predict 3D protein folds with high confidence (Duan et al., 2020; Willems et al., 2023). Research conducted by Willems et al. (2023) demonstrated a high level of consistency (96%) between the SS bonds predicted by AF2 and those observed in X‐ray and nuclear magnetic resonance protein structures. Traditional SS prediction algorithms perform predictions mainly based on geometric distances in 3D structures, which potentially lead to the loss of valuable data. Considering the incompleteness of the structure database (Berman et al., 2000; Jumper et al., 2021) and high precision of AF2, we submitted sequences collected from the literature to fastAF2 (Hong et al., 2021; Zheng et al., 2022) to predict the structure of WT enzymes and mutants.
For our dataset, the distances of the atoms in the WT and sulfur atoms (S–S) in the mutants were calculated between the designed SS sites (Figure 8a,b). The results suggest that some – distances were far beyond reasonable limits for the formation of SS bonds in WT structures, while some pairs exhibited a remarkable decrease in the S–S distance, especially those with enhanced thermostability. For certain items, the designed sites in the WT structures did not meet the geometric demands of SS bridges, although SS bonds had successfully been introduced into these structures. Surprisingly, these long‐distance engineering sites were located at the protein terminus (Figure 8c), allowing the relatively unconstrained terminus to immobilize on the protein surface and further enhancing conformational rigidity upon the introduction of SS bonds. To a certain extent, AF2 use can enhance the accuracy of SS‐bond prediction, leading to improved enzyme thermostability. Therefore, during SS‐bond design, terminal engineering sites may serve as preferred candidates, as they are more likely to increase enzyme stability. Meanwhile, AF2 exhibits the potential to predict SS bonds formed by spatially distant residues in enzymes. In summary, terminal SS bonds can be designed and subsequently filtered using our model, presenting a substantially efficient strategy for improving enzyme thermostability, especially for those lacking experimentally determined structures. Additionally, the introduction of such non‐native cross‐links into protein structures may unbalance the complicated structure–function relationships, and the mechanisms warrant in‐depth investigation for accurate digging.
FIGURE 8.

Side‐chain distances between the engineering sites before (–) and after S–S following the introduction of SS bonds (a) and representative examples of successfully engineered SS bonds captured by AF2 (b). (c) Structural superposition between AF2‐predicted structures for WT (terminus in magenta) and mutant forms (terminus in light purple) (Li et al., 2022; Yin et al., 2015; Zhou et al., 2016). The UniProt ID was provided for each mutant.
3.2. The importance of co‐evolution information
Rational and effective disulfides at the interrelated residue pairs should be considered during evolution, potentially mitigating the adverse effects on protein folding or conformation. Notably, the stability and activity of Lytic polysaccharide monooxygenases (LPMOs) were enhanced upon designing new disulfide bonds, which is guided by coevolution information (Zhou et al., 2022). In the aforementioned analysis, we found a distinctly positive correlation between the conservation of residues involved in SS‐bond formation and the enhancement of protein thermostability. Furthermore, the co‐evolutionary patterns of the residue pairs should also be conducive to SS‐bond engineering. To verify this hypothesis, we used the thermostability modification data derived from Wang et al. (2021) as examples to evaluate the potential relationship between the co‐evolution and thermostability of SS residue pairs. A Python framework named EVcouplings (https://evcouplings.org/) proposed by Hopf et al. (2019) was employed. This tool can perform co‐evolutionary sequence analysis for the de novo prediction of the structure and function of RNA, proteins, and protein complexes. Evolutionary couplings (EC) between sites can be identified using EVcouplings software, also corresponding to physical contacts in the molecule's 3D structure. As shown in Table 2, a positive correlation existed between EC scores and thermostability changes for the SS mutant. Although the observed phenomenon in our dataset may not always follow that pattern, for the design of the SS linkage, we still need to consider the conservation and co‐evolutionary information of the residue pairs as key factors.
TABLE 2.
EVcoupling score and thermostability function of the SS bond from endo‐polygalacturonase Talaromyces leycettanus JCM 12802 (UniProt ID: A0A6M9BP13) (Wang et al., 2021).
| Mutation pairs | Themostability | EVcoupling score |
|---|---|---|
| T316‐G344C | Remarkably Enhanced | 16.991 |
| N253C‐G282C | Little Enhanced | 12.522 |
| G220C‐S249C | 12.406 | |
| I99C‐S140C | 3.415 | |
| N54C‐N78C | Decreased | / |
| E86C‐T127C | / | |
| K128C‐V152C | / | |
| T182C‐N202C | / | |
| S211C‐D238C | / |
3.3. Current stage and challenge of engineering SS bonds for enzyme thermostability improvement
SS‐bond design targeted at thermostability modification in enzymes is of utmost importance. Several models, such as DBD2, SSBOND‐Predict (Gao et al., 2020), and Yosshi, have been proposed for the prediction of disulfides. DBD2 is a second‐generation design tool derived from the SS‐bridge design (DBD) (Dombkowski, 2003), and it incorporates conformational constraints and introduces the B‐factor for the design of SS bonds, while SSBOND‐Predict is a neural network‐based model that predicts amino acid pairs to construct engineered SS bonds. Another application, Yosshi, integrates evolution information via MSA for SS‐bond prediction (Suplatov et al., 2019). All these existing computational tools have made significant progress in identifying potential SS‐bond sites. However, from a practical viewpoint, enhancing thermostability remains a challenge because of inadequate consideration, and additional efforts are required to improve the accuracy and reliability of the prediction to bridge the existing gap.
Now, we attempt to represent an extended design strategy that commences with the existing SS‐bond prediction tools and subsequently incorporates the use of our models to refine the selection of potential variants to improve enzyme thermostability (Figure 9a). To achieve this, a set of potential SS‐bond sites can be generated in the target enzymes using existing prediction tools. Considering the portability of DBD2 and SSBONDPredict, we selected them to predict the SS‐bond sites. Although the performance of DBD2 in our test dataset slightly surpassed that of SSBONDPredict, the accuracy of both tools remained relatively low (Figure 9b). Some SS bonds may not be hooked owing to the structural rigidity of proteins. Regarding the SS bonds detected by DBD2, less than half of them positively affected enzyme thermostability, indicating the need for further screening. The potential sites can subsequently be subjected to enzyme thermostability prediction to screen the SS‐bond candidates using our model, which is specifically designed to detect intricate patterns and characteristics related to SS bonds and enzyme thermostability.
FIGURE 9.

An extended design strategy for SS‐bond design incorporating the existing tools and our models. (a) Relationship between our work and other preexisting SS‐bond prediction tools. Potential SS bonds could be predicted using existing tools, such as DBD2 and SSBONDPredict, while our models aimed to determine whether the introduction of an SS bond can improve enzyme thermostability. AF2 could also assist with this assignment. The integrated application of these two kinds of tools potentially enhanced enzyme thermostability by facilitating the introduction of SS bonds. (b) The performance of the existing models and our models on the testing dataset, including individual and combined strategies.
Furthermore, on analyzing the test set in this study, our extended strategy produced a set of predictions that presented highly positive results, demonstrating high accuracy in identifying positive SS‐bond sites and substantially improving the target of the original predictions. This accuracy is significant for enzyme engineering, as it provides a more reliable method for guiding enzyme design and optimization, further advancing the development of biotechnology and industrial applications. Indeed, by using this strategy, we successfully identified eight IR89 mutants with enhanced enzyme performance upon introducing artificial SS bonds.
4. CONCLUSION
SS bridges, as a feasible and powerful modification strategy in enzyme engineering, are often introduced to enhance enzyme thermostability. However, the results are not always as expected. To address this, we prepared a manually curated database of enzymes with experimentally determined thermostability data upon the introduction of inter‐ and intrachain SS bridges. ML models were constructed based on intrachain data to introduce SS bonds more accurately and efficiently. Sequence‐, structure‐, conservation‐, and energy‐based features were extracted. The conservation‐ and energy‐based features exhibited distinct distributions between the two categories. Overall, this suggests that the prediction model presented herein can be effective in identifying residue pairs that are potentially useful in introducing SS bonds to improve enzyme stability with fewer failed attempts. Furthermore, compared with existing tools, our model performs better in SS‐bond enhancement. Surprisingly, with the aid of AF2 and our models, we uncovered several positive SS bonds that effectively enhanced IR89 fitness with fewer false positives. In summary, our work, featuring an improved model for the more accurate and efficient prediction of SS bonds, will significantly advance enzyme engineering in industrial applications.
5. MATERIALS AND METHODS
5.1. Data collection and processing
5.1.1. Data collection
We gathered relevant data by comprehensively reviewing the literature and searching reputable online databases, such as PubMed, NCBI, UniProt, and other academic sources. Our search criteria included specific keywords and filters to obtain datasets directly relevant to our research question. The collected data were diligently cleaned and preprocessed to eliminate outliers, address missing values, and ensure data consistency and usability.
5.1.2. Structure prediction using AF2
AF2 can provide high‐quality, predicted 3D protein structures containing more enriched information than 1D amino acid sequences. The sequences collected before and after SS‐bond modification were adopted for AF2‐based structure prediction. AF2 relies on MSA to extract co‐evolutionary information and structure templates from the Protein Data Bank. Among the five predicted structure models, only the best one (with the highest plDDT score) was used for further analysis. In this study, we adopted the “faster” version of AF2 (fastAF2) (Hong et al., 2021; McBride et al., 2023; Zheng et al., 2022) for structure prediction, where the speed is relatively fast, and precision is guaranteed. FastAF2 has been shown to be beneficial in predicting the structure of the protein–ligand complex (Tao et al., 2023) in the 15th Critical Assessment of Structure Prediction.
5.1.3. Natural SS‐bond structure data acquisition
In this section, the Protein Sequence Screening Server was used to obtain protein chains with <50% sequence similarity to each other in the CullPDB database (update as of September 23, 2023 (Wang & Dunbrack Jr, 2003, 2005)). Thereafter, we extracted the residue numbers of cysteines associated with SS bonds by searching for “SSBOND” in the PDB files and matching them with the corresponding chain IDs. Finally, 2, 777 protein chains with intrachain SS bonds were identified and downloaded.
5.2. Feature extraction
5.2.1. Sequence‐based feature extraction
First, one‐hot encoding (Villegas‐Morcillo et al., 2021) is used for protein residue type representations for SS‐Bond engineering sites. In addition to focusing on the mutation site itself, the surrounding N (where N = 5) residues are also considered to capture the surrounding environment of the mutation site. The PSSM has emerged as a piece of valuable information in protein bioinformatics, as it helps gain insight into evolutionary conservation and make predictions related to protein folding, RNA binding sites, and protein functional analysis (Liang et al., 2015). In this work, the PSSM score was calculated to evaluate the sequence conservation of the residue pairs forming SS linkages. To collect local information regarding the specified sites, we performed feature extraction directly based on the PSSM profiles of the designed sites. A higher score indicated a greater likelihood of finding an amino acid at that position, while a lower value indicated otherwise.
5.2.2. Structure‐based feature extraction
In this work, we calculated the mutation‐induced Gibbs free energy change of each residue pair using the FoldX suite (Schymkowitz et al., 2005). The free energy change associated with the unfolding () of a target protein as well as the energy difference between WTs and mutants were calculated using the following Equations (1 and 2):
| (1) |
| (2) |
Then, SASA (Solvent Accessible Surface Area), which measures the surface area of atoms accessible to the solvent, has been calculated for residues involved in disulfide bonds using mdtraj (McGibbon et al., 2015). This measurement provides valuable information on the exposure of the residues to the surrounding environment.
Subsequently, the angle and distance between the selected atoms of residues involved in the SS bonds were calculated. These geometric parameters provide information regarding the spatial arrangement and conformation of the SS bonds. Non‐covalent bonds, such as hydrogen bonds and salt bridges, play crucial roles in stabilizing protein structures. Hydrogen bonds and salt bridges formed between residues adjacent to the SS bonds have been identified and quantified to consider polar interactions around the SS bond. The enzyme structures were read using the Biopython module (Cock et al., 2009), and the hydrogen bonds and salt bridges formed between the mutation sites and surrounding amino acids were subsequently calculated. Hydrogen bonds are defined as distances between hydrogen atoms and oxygen and nitrogen atoms less than the threshold of 3.5 Å, and salt bridges involve interactions between charged residues with a cutoff distance of 4 Å (Marqusee & Sauer, 1994). These interactions may contribute to the overall or partial stability of the protein structure.
Furthermore, the secondary structure information (i.e., helix or strand) of the SS‐bond sites and their surrounding residues was determined using the DSSP (Perticaroli et al., 2013), providing insights into the local structural context of the SS bonds. The eight secondary structure categories, namely, 310–helix (G), –helix (H), –helix (I), –bridge (B), –strand (E), high‐curvature loop (S), –turn (T), and coil (C), obtained from the DSSP calculations were grouped into three main divisions: helix (G, H, and I), –sheet (B and E), and others (S, T, C, and undetected secondary structures).
5.3. Development and validation of ML models
5.3.1. Training and testing datasets
The collected SS bonds were classified into two groups. Those conferring favorable thermostability were labeled “positive”, while the others were labeled “negative”. The labeled datasets were subsequently used to train the model to predict the effects of SS bonds on protein thermostability. To enhance the model's robustness and reduce the risk of overfitting, enzymes were clustered according to a sequence similarity threshold of 0.6 using CD‐HIT (version 4.8.1) (Fu et al., 2012). The resulting clusters were then randomly divided into training and testing sets for the train–test split (ratio = 9:1), ensuring that samples from the same cluster were present in the same dataset, either in the training or testing set. In the testing dataset, the negative‐to‐positive ratio was 1:1, and in the training dataset, it approximated 1:1.
5.3.2. Construction of ML models
We utilized the positive and negative datasets mentioned in the preceding section to develop a classification model that can differentiate the impact of artificial SS bonds on thermal stability. Five supervised algorithms were implemented and trained on the labeled data. These algorithms included RF (Rigatti, 2017), XGBoost (Ogunleye & Wang, 2019; Torlay et al., 2017), Adaboost‐RF (Lu et al., 2019), Adaboost‐DT (Qu et al., 2021), and Support vector machine (SVM) (Noble, 2006). RF, Adaboost‐RF, XGBoost, and Adaboost‐DT are all ensemble algorithms. They combine multiple weak learners, such as decision trees, to construct a strong learner, thus improving overall predictive performance. In the SVM, a linear kernel is employed to transform the data into a higher‐dimensional feature space, enabling the identification of an optimal hyperplane for effective data separation. These ML algorithms have been implemented and integrated into the Python scikit‐learn package (Pedregosa et al., 2011). The optimal parameters used in our models were obtained via a grid search strategy. The SVM model contains several key configuration options, including the penalty parameter (C), kernel, and gamma. Additionally, for the other tree models, the learning rate, depth, and number of decision trees were mainly considered for optimal selection.
5.3.3. Performance evaluation
To verify the model quality, the following parameters were applied:
Accuracy:
| (3) |
Recall:
| (4) |
Precision:
| (5) |
F1 score:
| (6) |
Specificity:
| (7) |
AUC:
| (8) |
where TP, TN, FP, and FN represent true positives, true negatives, false positives, and false negatives, respectively. Recall reflects the model's coverage of actual positive instances. Precision indicates the accuracy of the model's positive predictions. F1 Score offers a balanced measure that considers both Recall and Precision. And, accuracy quantifies the overall proportion of the model's correct predictions. Moreover, the areas under the receiver operating characteristic and precision–recall curves, denoted as the AUC, was computed, with elevated AUC values signifying enhanced model performance.
5.4. Experimental validation of the designed IR89 mutants with artificial SS bonds
5.4.1. Experimental materials and chemicals
The compound 1‐benzyl‐4‐methylpiperidin‐3‐one hydrochloride (1, substrate) was purchased from Shanghai Haohong Biomedical Technology Co., Ltd. Methylamine hydrochloride and (3R,4R)‐N,4‐dimethyl‐1‐benzyl‐3‐piperidinamine (2, product which is a key pharmaceutical intermediate of tofacitinib) were procured from Aladdin Industrial Corporation, Shanghai. PrimeSTAR® GXL DNA Polymerase was acquired from TaKaRa Biotechnology Co., Ltd. Enzyme FastDigest Ů DpnI was purchased from Thermo Fisher Scientific Co., Ltd. (China). The Hieff Clone® Plus One Step Cloning Kit was secured from Yisheng Biotechnology Co., Ltd. (Shanghai, China). Escherichia coli BL21(DE3) receptor cells were procured from WeiDi Chemical Co. Ltd. (China). Isopropyl ‐d‐1‐thiogalactopyranoside (IPTG) was acquired from Shanghai Meryer Biochemical Technology Co., Ltd. PCR instruments (Biometra TRIO 48) were purchased from Analytik Jena (Beijing) Instruments Co., Ltd. The high‐performance liquid chromatography (HPLC; LC‐2010C) system was acquired from Shimadzu Co., Ltd. (China). An ultrasonic cell breaker (DH92‐IIN) was procured from Ningbo Lawson Smarttech Co., Ltd.
5.4.2. IR89 cloning and mutant construction
The IR89 sequence was adopted from a previous study (Zhan et al., 2022). DNAWorks was used for codon optimization with an E. coli expression system, and the full‐length DNA molecule was synthesized and cloned to pET‐28a by Tongyong Co., Ltd. (Shanghai). The DNA sequence of IR89 used in this study is provided in Table S1. The plasmid PET‐28a(+)‐IR89 was used as a template for amplification, and PrimeSTAR GXL DNA polymerase was applied. The PCR workflow was as follows: denaturation (94°C, 2 min), annealing (98°C, 10 s; 50–60°C, 15 s; 68°C, 6 min 20 s; 25 cycles in all), and extension (68°C, 10 min). The WTs and mutants were constructed using the method mentioned above, and they included F249C–E256C, H246C–S263C, Q219C–A280C, S136C–A166C, A100C–S167C, A95C–S167C, G39C–V47C, V23C–G44C, G10C–V65C, A261C, S180C, W176C–G187C, and A168C–T194C, and the primers are shown in Table S2.
5.4.3. Protein expression, purification, and enzyme activity testing
The BL21 strain of E. coli was employed as the host organism for the extracellular expression of IR89 and its mutant proteins. The expression cultures were incubated at 37°C until the optical density at 600 nm reached 0.4–0.6. Induction was subsequently performed by adding IPTG to a final concentration of 1 mM and allowing the culture to express for 16–20 h at 16°C. Following induction, cells were harvested via centrifugation at 4°C and 10,500 g for 5 min and resuspended in a buffer comprising 50 mM Tris–HCl (pH 7.5), 150 mM NaCl, and 10% glycerol. They were subsequently sonicated for 30 min and centrifuged at 4°C and 10,000 r/min for 20 min. The supernatant was collected as the crude enzyme solution. Cell lysis was performed via sonication for 30 min, and the lysate was subsequently centrifuged at 4°C and 10,000 rpm for 20 min to separate the supernatant, which served as the crude enzyme extract.
To assess the catalytic activity of IR89 and its mutants, in vitro enzymatic reactions were set up in a 5 mL total volume consisting of 100 mM Tris–HCl buffer (pH 8.5), 10% DMSO, 2 mL of crude enzyme extract, 2.5 mg of glucose dehydrogenase, 25 mM substrate (1‐benzyl‐4‐methylpiperidin‐3‐one hydrochloride), 250 mM methylamine hydrochloride, 66 mM D‐Glucose, and 1.2 mM NADPH. The reactions were incubated at 30°C for 24 h, followed by analysis using an LC‐2010C High‐Performance Liquid Chromatography (HPLC) system, and the results are shown in Figures S2, S3 and S4.
AUTHOR CONTRIBUTIONS
Ran Xu: Data curation; writing – review and editing; writing – original draft; investigation; validation; formal analysis; software. Qican Pan: Writing – original draft; methodology; formal analysis. Guoliang Zhu: Methodology; writing – original draft; formal analysis. Yilin Ye: Methodology; software; formal analysis; visualization; writing – original draft. Minghui Xin: Writing – original draft; writing – review and editing; data curation. Zechen Wang: Writing – original draft; writing – review and editing. Sheng Wang: Writing – original draft; writing – review and editing; formal analysis. Weifeng Li: Writing – review and editing; writing – original draft. Yanjie Wei: Writing – original draft; writing – review and editing. Jingjing Guo: Conceptualization; methodology; writing – review and editing; project administration; supervision; formal analysis; funding acquisition; resources; software. Liangzhen Zheng: Conceptualization; methodology; writing – review and editing; project administration; formal analysis; funding acquisition.
FUNDING INFORMATION
This research was supported by the National Key R&D Program of China (2023YFA0915500) to Liangzhen Zheng and by National Science Foundation of China under grant no. 62272449 to Yanjie Wei. This work was financially supported by Macao Polytechnic University (RP/CAI‐01/2023) to Jingjing Guo.
CONFLICT OF INTEREST STATEMENT
The authors declare no conflict of interest.
Supporting information
Data S1.
Xu R, Pan Q, Zhu G, Ye Y, Xin M, Wang Z, et al. ThermoLink: Bridging disulfide bonds and enzyme thermostability through database construction and machine learning prediction. Protein Science. 2024;33(9):e5097. 10.1002/pro.5097
Ran Xu and Qican Pan contributed equally to this work.
Review Editor: Nir Ben‐Tal
Contributor Information
Jingjing Guo, Email: jguo@mpu.edu.mo.
Liangzhen Zheng, Email: zhenglz@zelixir.com.
REFERENCES
- Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI‐BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389–3402. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ban X, Xie X, Li C, Gu Z, Hong Y, Cheng L, et al. The desirable salt bridges in amylases: distribution, configuration and location. Food Chem. 2021;354:129475. [DOI] [PubMed] [Google Scholar]
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, et al. The protein data bank. Nucleic Acids Res. 2000;28(1):235–242. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bloom JD, Meyer MM, Meinhold P, Otey CR, MacMillan D, Arnold FH. Evolving strategies for enzyme engineering. Curr Opin Struct Biol. 2005;15(4):447–452. [DOI] [PubMed] [Google Scholar]
- Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, et al. Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25(11):1422–1423. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Craig DB, Dombkowski AA. Disulfide by design 2.0: a web‐based tool for disulfide engineering in proteins. BMC Bioinformatics. 2013;14(1):1–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Creighton TE, Zapun A, Darby NJ. Mechanisms and catalysts of disulphide bond formation in proteins. Trends Biotechnol. 1995;13(1):18–23. [DOI] [PubMed] [Google Scholar]
- Dani VS, Ramakrishnan C, Varadarajan R. MODIP revisited: re‐evaluation and refinement of an automated procedure for modeling of disulfide bonds in proteins. Protein Eng. 2003;16(3):187–193. [DOI] [PubMed] [Google Scholar]
- Dombkowski AA. Disulfide by design: a computational method for the rational design of disulfide bonds in proteins. Bioinformatics. 2003;19(14):1852–1853. [DOI] [PubMed] [Google Scholar]
- Duan J, Lupyan D, Wang L. Improving the accuracy of protein thermostability predictions for single point mutations. Biophys J. 2020;119(1):115–127. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eddy SR. Where did the BLOSUM62 alignment score matrix come from? Nat Biotechnol. 2004;22(8):1035–1036. [DOI] [PubMed] [Google Scholar]
- Evans R, O'Neill M, Pritzel A, Antropova N, Senior A, Green T, et al. Protein complex prediction with AlphaFold‐multimer. BioRxiv . 2021. 10.1101/2021.10.04.463034 [DOI]
- Fu L, Niu B, Zhu Z, Wu S, Li W. CD‐HIT: accelerated for clustering the next‐generation sequencing data. Bioinformatics. 2012;28(23):3150–3152. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fuxreiter M, Mones L. The role of reorganization energy in rational enzyme design. Curr Opin Chem Biol. 2014;21:34–41. [DOI] [PubMed] [Google Scholar]
- Gao X, Dong X, Li X, Liu Z, Liu H. Prediction of disulfide bond engineering sites using a machine learning method. Sci Rep. 2020;10(1):10330. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gihaz S, Bash Y, Rush I, Shahar A, Pazy Y, Fishman A. Bridges to stability: engineering disulfide bonds towards enhanced lipase biodiesel synthesis. ChemCatChem. 2020;12(1):181–192. [Google Scholar]
- Han X, Ning W, Ma X, Wang X, Zhou K. Improving protein solubility and activity by introducing small peptide tags designed with machine learning models. Metab Eng Commun. 2020;11:e00138. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hogg P. Contribution of allosteric disulfide bonds to regulation of hemostasis. J Thromb Haemost. 2009;7:13–16. [DOI] [PubMed] [Google Scholar]
- Hong L, Sun S, Zheng L, Tan Q, Li Y. fastMSA: accelerating multiple sequence alignment with dense retrieval on protein language. BioRxiv . 2021. https://www.biorxiv.org/content/early/2021/12/21/2021.12.20.473431
- Hopf TA, Green AG, Schubert B, Mersmann S, Schärfe CP, Ingraham JB, et al. The EVcouplings python framework for coevolutionary sequence analysis. Bioinformatics. 2019;35(9):1582–1584. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Johansson J, Nerelius C, Willander H, Presto J. Conformational preferences of non‐polar amino acid residues: an additional factor in amyloid formation. Biochem Biophys Res Commun. 2010;402(3):515–518. [DOI] [PubMed] [Google Scholar]
- Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596(7873):583–589. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Klesmith JR, Bacik JP, Wrenbeck EE, Michalczyk R, Whitehead TA. Trade‐offs between enzyme fitness and solubility illuminated by deep mutational scanning. Proc Natl Acad Sci. 2017;114(9):2265–2270. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kuchner O, Arnold FH. Directed evolution of enzyme catalysts. Trends Biotechnol. 1997;15(12):523–530. [DOI] [PubMed] [Google Scholar]
- Kunka A, Marques SM, Havlasek M, Vasina M, Velatova N, Cengelova L, et al. Advancing enzyme's stability and catalytic efficiency through synergy of force‐field calculations, evolutionary analysis, and machine learning. ACS Catal. 2023;13(19):12506–12518. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lenz M, Borlinghaus N, Weinmann L, Nestl BM. Recent advances in imine reductase‐catalyzed reactions. World J Microbiol Biotechnol. 2017;33:1–10. [DOI] [PubMed] [Google Scholar]
- Li C, Ban X, Zhang Y, Gu Z, Hong Y, Cheng L, et al. Rational design of disulfide bonds for enhancing the thermostability of the 1, 4‐α‐glucan branching enzyme from Geobacillus thermoglucosidans STB02. J Agric Food Chem. 2020;68(47):13791–13797. [DOI] [PubMed] [Google Scholar]
- Li L, Wu W, Deng Z, Zhang S, Guan W. Improved thermostability of lipase Lip2 from Yarrowia lipolytica through disulfide bond design for preparation of medium‐long‐medium structured lipids. LWT. 2022;166:113786. [Google Scholar]
- Liang Y, Liu S, Zhang S, et al. Prediction of protein structural classes for low‐similarity sequences based on consensus sequence and segmented PSSM. Comput Math Methods Med. 2015;2015:1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lu H, Gao H, Ye M, Wang X. A hybrid ensemble algorithm combining AdaBoost and genetic algorithm for cancer classification with gene expression data. IEEE/ACM Trans Comput Biol Bioinform. 2019;18(3):863–870. [DOI] [PubMed] [Google Scholar]
- Marqusee S, Sauer RT. Contributions of a hydrogen bond/salt bridge network to the stability of secondary and tertiary structure in repressor. Protein Sci. 1994;3(12):2217–2225. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mazurenko S, Prokop Z, Damborsky J. Machine learning in enzyme engineering. ACS Catal. 2019;10(2):1210–1223. [Google Scholar]
- McBride J, Polev K, Abdirasulov A, Reinharz V, Grzybowski B, Tlusty T. AlphaFold2 can predict single‐mutation effects. Phys Rev Lett. 2023;11:131. [DOI] [PubMed] [Google Scholar]
- McDonald AG, Tipton KF. Fifty‐five years of enzyme classification: advances and difficulties. FEBS J. 2014;281(2):583–592. [DOI] [PubMed] [Google Scholar]
- McGibbon RT, Beauchamp KA, Harrigan MP, Klein C, Swails JM, Hernández CX, et al. MDTraj: a modern open library for the analysis of molecular dynamics trajectories. Biophys J. 2015;109(8):1528–1532. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nagar N, Tubiana J, Loewenthal G, Wolfson HJ, Tal NB, Pupko T. EvoRator2: predicting site‐specific amino acid substitutions based on protein structural information using deep learning. J Mol Biol. 2023;435:168155. [DOI] [PubMed] [Google Scholar]
- Ning L, Guo J, Jin N, Liu H, Yao X. The role of Cys179–Cys214 disulfide bond in the stability and folding of prion protein: insights from molecular dynamics simulations. J Mol Model. 2014;20:1–8. [DOI] [PubMed] [Google Scholar]
- Nixon AE, Ostermeier M, Benkovic SJ. Hybrid enzymes: manipulating enzyme design. Trends Biotechnol. 1998;16(6):258–264. [DOI] [PubMed] [Google Scholar]
- Noble WS. What is a support vector machine? Nat Biotechnol. 2006;24(12):1565–1567. [DOI] [PubMed] [Google Scholar]
- Ogunleye A, Wang QG. XGBoost model for chronic kidney disease diagnosis. IEEE/ACM Trans Comput Biol Bioinform. 2019;17(6):2131–2140. [DOI] [PubMed] [Google Scholar]
- Pang B, Zhou L, Cui W, Liu Z, Zhou Z. Improvement of the thermostability and activity of pullulanase from Anoxybacillus sp. WB42. Appl Biochem Biotechnol. 2020;191:942–954. [DOI] [PubMed] [Google Scholar]
- Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit‐learn: machine learning in python. J Mach Learn Res. 2011;12:2825–2830. [Google Scholar]
- Perticaroli S, Nickels JD, Ehlers G, O'Neill H, Zhang Q, Sokolov AP. Secondary structure and rigidity in model proteins. Soft Matter. 2013;9(40):9548–9556. [DOI] [PubMed] [Google Scholar]
- Prakash O, Jaiswal N. α‐amylase: an ideal representative of thermostable enzymes. Appl Biochem Biotechnol. 2010;160:2401–2414. [DOI] [PubMed] [Google Scholar]
- Qu H, Wu W, Chen C, Yan Z, Guo W, Meng C, et al. Application of serum mid‐infrared spectroscopy combined with an ensemble learning method in rapid diagnosis of gliomas. Anal Methods. 2021;13(39):4642–4651. [DOI] [PubMed] [Google Scholar]
- Rigatti SJ. Random forest. J Insur Med. 2017;47(1):31–39. [DOI] [PubMed] [Google Scholar]
- Saito Y, Oikawa M, Sato T, Nakazawa H, Ito T, Kameda T, et al. Machine‐learning‐guided library design cycle for directed evolution of enzymes: the effects of training data composition on sequence space exploration. ACS Catal. 2021;11(23):14615–14624. [Google Scholar]
- Schomburg D, Schomburg I. Databases. Methods in Molecular Biology. Humana Press. Totowa, NJ. 2010. p. 113–128. [DOI] [PubMed] [Google Scholar]
- Schymkowitz J, Borg J, Stricher F, Nys R, Rousseau F, Serrano L. The FoldX web server: an online force field. Nucleic Acids Res. 2005;33(suppl_2):W382–W388. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Senior E, Bull A, Slater J. Enzyme evolution in a microbial community growing on the herbicide Dalapon. Nature. 1976;263(5577):476–479. [DOI] [PubMed] [Google Scholar]
- Siedhoff NE, Illig AM, Schwaneberg U, Davari MD. PyPEF—an integrated framework for data‐driven protein engineering. J Chem Inf Model. 2021;61(7):3463–3476. [DOI] [PubMed] [Google Scholar]
- Song D, Chen J, Chen G, Li N, Li J, Fan J, et al. Parameterized BLOSUM matrices for protein alignment. IEEE/ACM Trans Comput Biol Bioinform. 2014;12(3):686–694. [DOI] [PubMed] [Google Scholar]
- Suplatov D, Timonina D, Sharapova Y, Švedas V. Yosshi: a web‐server for disulfide engineering by bioinformatic analysis of diverse protein families. Nucleic Acids Res. 2019;47(W1):W308–W314. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Suplatov D, Voevodin V, Švedas V. Robust enzyme design: bioinformatic tools for improved protein stability. Biotechnol J. 2015;10(3):344–355. [DOI] [PubMed] [Google Scholar]
- Tanghe M, Danneels B, Last M, Beerens K, Stals I, Desmet T. Disulfide bridges as essential elements for the thermostability of lytic polysaccharide monooxygenase LPMO10C from Streptomyces coelicolor . Protein Eng Des Select. 2017;30(5):401–408. [DOI] [PubMed] [Google Scholar]
- Tao S, Fuxu L, Zechen W, Jinyuan S, Yifan B, Jintao M, et al. zPoseScore model for accurate and robust protein‐ligand docking pose scoring in CASP15. Proteins. 2023;91(12):1837–1849. [DOI] [PubMed] [Google Scholar]
- Torlay L, Perrone‐Bertolotti M, Thomas E, Baciu M. Machine learning–XGBoost analysis of language networks to classify patients with epilepsy. Brain Inform. 2017;4(3):159–169. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Victorino da Silva Amatto I, Gonsales da Rosa‐Garzon N, Antonio de Oliveira Simoes F, Santiago F, Pereira da Silva Leite N, Raspante Martins J, et al. Enzyme engineering and its industrial applications. Biotechnol Appl Biochem. 2022;69(2):389–409. [DOI] [PubMed] [Google Scholar]
- Villegas‐Morcillo A, Makrodimitris S, van Ham RC, Gomez AM, Sanchez V, Reinders MJ. Unsupervised protein embeddings outperform hand‐crafted sequence and structure features at predicting molecular function. Bioinformatics. 2021;37(2):162–170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang G, Dunbrack RL Jr. PISCES: a protein sequence culling server. Bioinformatics. 2003;19(12):1589–1591. [DOI] [PubMed] [Google Scholar]
- Wang G, Dunbrack RL Jr. PISCES: recent improvements to a PDB sequence culling server. Nucleic Acids Res. 2005;33(suppl_2):W94–W98. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang S, Meng K, Su X, Hakulinen N, Wang Y, Zhang J, et al. Cysteine engineering of an endo‐polygalacturonase from Talaromyces leycettanus JCM 12802 to improve its thermostability. J Agric Food Chem. 2021;69(22):6351–6359. [DOI] [PubMed] [Google Scholar]
- Willems P, Huang J, Messens J, Van Breusegem F. Functionally annotating cysteine disulfides and metal binding sites in the plant kingdom using AlphaFold2 predicted structures. Free Radic Biol Med. 2023;194:220–229. [DOI] [PubMed] [Google Scholar]
- Wu H, Chen Q, Zhang W, Mu W. Overview of strategies for developing high thermostability industrial enzymes: discovery, mechanism, modification and challenges. Crit Rev Food Sci Nutr. 2023;63(14):2057–2073. [DOI] [PubMed] [Google Scholar]
- Xing R, Zhang H, Wang Q, Hao Y, Wang Y, Chen J, et al. Improved thermostability of maltooligosyl trehalose hydrolase by computer‐aided rational design. Systems microbiology and biomanufacturing. 2024;1–10. [Google Scholar]
- Yin X, Hu D, Li JF, He Y, Zhu TD, Wu MC. Contribution of disulfide bridges to the thermostability of a type a feruloyl esterase from aspergillus usamii. PLoS One. 2015;10(5):e0126864. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yu H, Huang H. Engineering proteins for thermostability through rigidifying flexible sites. Biotechnol Adv. 2014;32(2):308–315. [DOI] [PubMed] [Google Scholar]
- Yu Z, Yu H, Xu J, Wang Z, Wang Z, Kang T, et al. Enhancing thermostability of lipase from Pseudomonas alcaligenes for producing l‐menthol by the CREATE strategy. Cat Sci Technol. 2022;12(8):2531–2541. [Google Scholar]
- Zhan Z, Xu Z, Yu S, Feng J, Liu F, Yao P, et al. Stereocomplementary synthesis of a key intermediate for tofacitinib via enzymatic dynamic kinetic resolution‐reductive amination. Adv Synth Catal. 2022;364(14):2380–2386. [Google Scholar]
- Zheng L, Meng J, Lin M, Lv R, Cheng H, Zou L, et al. Structure prediction of the entire proteome of monkeypox variants. Acta Mater Med. 2022;1(2):260–264. [Google Scholar]
- Zhou CY, Li TB, Wang YT, Zhu XS, Kang J. Exploration of a N‐terminal disulfide bridge to improve the thermostability of a GH11 xylanase from aspergillus Niger. J Gen Appl Microbiol. 2016;62(2):83–89. [DOI] [PubMed] [Google Scholar]
- Zhou X, Xu Z, Li Y, He J, Zhu H. Improvement of the stability and activity of an LPMO through rational disulfide bonds design. Front Bioeng Biotechnol. 2022;9:815990. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhu L, Yang J, Song JN, Chou KC, Shen HB. Improving the accuracy of predicting disulfide connectivity by feature selection. J Comput Chem. 2010;31(7):1478–1485. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data S1.
