Skip to main content
Synthetic and Systems Biotechnology logoLink to Synthetic and Systems Biotechnology
. 2025 May 15;10(3):1038–1049. doi: 10.1016/j.synbio.2025.05.005

From reactants to products: computational methods for biosynthetic pathway design

Shaozhen Ding a,1, Dongliang Liu a,1, Yu Tian a, Dachuan Zhang c, HuaDong Xing d, Junni Chen e, Zhiguo Liu a, Qian-Nan Hu b,
PMCID: PMC12159830  PMID: 40510533

Abstract

One of the main goals in synthetic biology is to produce value-added compounds from available precursors using enzymatic approaches. The construction of biosynthetic pathways for synthesizing target molecules plays a crucial role in this process. However, it is challenging and time-consuming for researchers to design efficient pathways manually. In recent decades, pathway design has advanced through data- and algorithm-driven approaches. In this article, we review key computational tools involved in biosynthetic pathway design, covering: 1) Biological Big-Data including compounds, reactions/pathways and enzymes. 2) Retrosynthesis methods leveraging multi-dimensional biosynthesis data to predict potential pathways for target compounds synthesis. 3) Enzyme engineering relying on data mining to identify/de novo design enzymes with desired functions. Integrating these three key components can significantly enhance the efficiency and accuracy of biosynthetic pathway design in synthetic biology.

Keywords: Biological big-data, Methods of retrosynthetic analysis, Enzyme engineering

1. Introduction

The key concept of engineering is the ability to assemble simple standardized modules into systems with increased complexity. Guided by this kind of principle, synthetic biology employs multiple disciplinaries to design and build biological systems with the goal to solve economically valuable tasks. It not only modifies naturally occurring biological systems, but also rationally constructs novel systems from well understood components [1]. To be more specific, these tasks include the engineering of bacteria to invade and kill cancer tumors [2], production of value-added chemicals by metabolic engineering [3], biodegradation of toxic and harmful substances [4], the engineering of biosensors [5], rational design of enzymes that catalyze novel reactions [6]. Among these tasks, metabolic engineering enables researchers to bioengineer microorganisms for synthesizing valuable molecules (e.g., renewable biofuels or anticancer drugs) from available precursors, which has an increasingly positive impact on the society.

Due to the complexity of biological systems and unknown interactions among their myriad components, numerous rounds of design-build-test-learn (DBTL) cycle must be performed to obtain desired solutions for problems in metabolic engineering [7]. Biosynthetic pathway design encounters challenges including a massive search space, complex metabolic pathways, and biological system uncertainties [8]. It is a time-consuming and error-prone work to construct a novel biological system that satisfies the desired specifications (e.g., a particular titer, rate, or yield) [9]. For example, it took 150 person-years of effort to produce the antimalarial precursor artemisinin; and 575 person-years of effort to generate propanediol [10]. Therefore, utilization of emerging technologies to enhance the efficiency and accuracy of biosynthetic pathway design plays a crucial role in synthetic biology. In recent years, computational methods have been extensively applied in synthetic biology, notably in biosynthetic pathway design to accelerate the process [[11], [12], [13], [14], [15]]. In this article, we discuss how computational methods are enabling the metabolic pathway design to produce value-added molecules in three parts, including biological big-data, retrosynthetic synthesis, and enzyme engineering (Fig. 1).

Fig. 1.

Fig. 1

The framework of biosynthetic pathway design from the three parts.

2. Biological big-data

The effectiveness of computational methods for biosynthetic pathway design depends on the quality and diversity of available biological data from several categories, including compounds, reactions/pathways, and enzymes (Table 1).

Table 1.

Biological databases encompassing various categories.

2.1. Compound databases

Compound databases such as PubChem [16], ChEBI [17], ChEMBL [18], ZINC [19] and ChemSpider [20] store information on chemical compounds, including their structure, properties, and biological activities, which are essential and serve as a foundation of reaction and pathway databases. Specifically speaking, PubChem [16], funded by the NIH, is publicly accessible and contains 119 million compound records, 327 million substance records, 295 million bioactivity experiment data and 41 million literature references. ChEBI [17], maintained by EBI, focuses on small molecular compounds and provides detailed information (e.g., structures, properties, and biological activities) that is freely accessible. ChEMBL [18] is a curated database of bioactive drug-like small molecules for drug discovery, containing over 2.5 million compounds with bioactivity information. ZINC [19], provided by the University of California, San Francisco, is a database of commercially available compounds for virtual screening, offering over 230 million purchasable compounds with 3D structures. ChemSpider is a free chemical structure database providing fast text and structure search access to over 130 million structures from hundreds of data sources [20]. NPAtlas (Natural Products Atlas) is a curated repository of natural products with annotated structures, sources, and bioactivity data, facilitating drug discovery and biosynthetic studies [21]. LOTUS (The Natural Products Online Database) integrates chemical, taxonomic, and spectral data of natural products to accelerate research in metabolomics and drug discovery [22]. COCONUT (Collection of Open Natural Products) is an open repository of natural product structures with metadata, designed to facilitate drug discovery and computational exploration [23]. NPASS (Natural Product Activity and Species Source) bridges natural products, their biological sources, and pharmacological activities to accelerate drug development research [24]. These databases not only cover general small molecules but also specialize in natural products and drug-like compounds, which serve as essential resources for biosynthetic pathway design by providing comprehensive chemical structures, biological activities, and taxonomic origins of diverse molecules.

2.2. Reaction/pathway databases

By providing abundant information about molecular events, interactions, and regulatory mechanisms that govern various biological processes, biological reaction/pathway databases play a crucial role in deepening our understanding of complex biological systems. Furthermore, they enable researchers to explore the interconnectedness of different pathways and identify key components that drive cellular function. Specifically, KEGG (Kyoto Encyclopedia of Genes and Genomes) is a comprehensive database that integrates genomic, chemical, and systemic functional information [25]. By providing valuable data on pathways, diseases, drugs, and organisms, KEGG is a key resource for bioinformatics researchers and systems biology studies. BKMS-react [26] is an integrated and non-redundant biochemical reaction database containing known enzyme-catalyzed and spontaneous reactions collected from BRENDA [27], KEGG [25], MetaCyc [28] and SABIO-RK [29]. Rhea [30] is a database of biochemical reactions that offers detailed information on enzyme-catalyzed reactions, including reaction equations, chemical structures, enzyme annotations and so on. It is a useful resource for studying metabolic pathways and enzyme function in various organisms. Covering a wide range of biological processes, Reactome [31] is a curated database of biological pathways that provides comprehensive information on molecular events and interactions. Pathbank is a database of metabolic pathways that offers information on metabolites, enzymes, and reactions involved in various metabolic processes [32], and it is a useful resource for studying metabolic networks and identifying potential drug targets for metabolic diseases. MetaCyc is a database of metabolic pathways and enzymes that provides detailed information on biochemical reactions and pathways in various organisms [28], and it is a valuable resource for studying metabolic diversity and evolution across different species. SABIO-RK is a manually curated database containing data about biochemical reactions and their reaction kinetics [29].

In addition to general reaction/pathway databases, specialized databases focusing on drug metabolism provide critical insights into xenobiotic transformations, toxicity, and pharmacokinetics. For instance, DrugBank combines drug structure, target interactions, and metabolic pathways to support pharmaceutical research and precision medicine [33]. HMDB (The Human Metabolome Database) is curated knowledgebase of human metabolites with their biological roles, disease associations, and metabolic pathways, providing insights into metabolic health and disease mechanism [34]. STITCH is a comprehensive interaction database that integrates chemical-protein network by aggregating experimental, predicted, and curated data from multiple sources, enabling systematic exploration of drug-target interactions, metabolic pathways, and poly-pharmacology effects across species [35]. Overall, these databases are essential resource for studying the intricate networks in life processes, and advancing research in fields such as bioinformatics, systems biology, and drug discovery.

2.3. Enzyme databases

The compilation and systematic organization of enzymatic data covering diverse life domains (e.g., including bacteria, archaea, eukaryotes, and viruses) is critical for providing comprehensive enzyme information (e.g., enzyme functions, structural characteristics, catalytic mechanisms, substrate specificity, and inhibitor interactions), which will effectively promote the construction of biosynthetic pathways. Over the past several decades, several groups have constructed comprehensive enzyme databases. For example, UniProt is a protein information database managed by the UniProt Consortium, containing information on protein structure, function, and evolution across various organisms [36]. PDB archives 3D structural information on proteins and other biological molecules, obtained through techniques like X-ray crystallography and NMR [37]. BRENDA is an enzyme database providing detailed data on enzyme functions, structures, and mechanisms of action [27]. AlphaFold DB is a high-quality protein structure database based on the AlphaFold algorithm, predicting protein structures through deep learning methods [38]. Based on those enzyme databases, research can accelerate the study of enzyme-catalyzed reactions and enhance our understanding of biological processes at molecular scales.

3. Methods of retrosynthetic analysis

The significance of functional molecules has been confirmed in various fields, including energy, drug, food, cosmetics and so on [42]. One goal of synthetic biology is to identify the biosynthetic pathways for producing functional molecules from suitable starting material [43]. The method of retrosynthetic analysis for producing target molecules by utilizing computer was first proposed by Corey in 1960s [44]. In recent years, with the great improvement in biosynthetic data and algorithms, as well as computing power, Selger et al. have learned from the successful experience of artificial intelligence algorithms in GO and developed “AlphaGO” in retrosynthesis [45], which indicates that synthetic pathway planning has stepped into artificial intelligence area. Unlike games such as chess that follow fixed rules, identifying feasible reaction candidates is nontrivial as there are thousands of possible candidates at each step, which leads to exponential growth in the number of reactions with distance from target molecules. To address this challenge, computational approaches have been proposed to assist in planning biosynthesis pathways. Based on mechanism of precursor generation, retrosynthetic planning tools can be divided into three categories: template-based, template-free, and semi-template-based methods (Fig. 2), and detailed descriptions of each category and their comparative analysis is presented below.

Fig. 2.

Fig. 2

Three kinds of methods in one-step retrosynthetic prediction: template-based approach, template-free approach, and semi-template-based approach.

3.1. Template-based pathway design

The template-based approach obtains reaction rules from biochemical reactions by manual or automatic extraction at specified radius representing the bond distance between atoms and reaction center. Then the reaction rules can be utilized to infer precursors from target molecule by finding disconnection sites [42]. There are several template-based retrosynthesis tools as shown below.

Jean-Loup et al. built RetroPath2.0 [46], an automated open-source workflow for retrosynthesis based on generalized reaction rules that performs the retrosynthesis search from chassis to target. It requires three inputs: a set of compounds (the source), a set of compounds (the sink) and a set of reaction templates. The workflow produces a network linking the source set to the sink set, where each line corresponds to a reaction rule.

By using a prime factorization-based encoding technique (rePrime), Costas D et al. tracked and codified all reaction centers as rules and built a pathway-searching algorithm (novoStoic) to trace both metabolites and moieties through balanced bio-conversion strategies [47]. It has been proved that the novoStoic could be utilized to bypass steps in existing pathways through putative transformations, assemble complex pathways blending both known and putative steps toward pharmaceuticals, and postulate ways to biodegrade xenobiotics.

Based on reaction-filling framework, we expanded the bio-reaction space over 10 times by extracting reaction rules from KEGG [25] and Rhea [30], and then provided an user-friendly webserver (NovoPathFinder) for biosynthetic pathway design [48]. It has three main features: (i) enumerate novel pathways between two specified molecules without considering hosts; (ii) construct heterologous pathways with known or putative reactions for producing target molecule within Escherichia coli or yeast without giving precursor; (iii) estimate novel pathways with considering several categories.

By extracting reaction rules from Reaxy, Segler et al. combined Monte Carlo Tree Search (MCTS) with an expansion policy network that guides the search, and a filter network to pre-select the most promising retrosynthetic steps [45]. As a result, this system solved for almost twice as many molecules, thirty times faster than the traditional computer-aided search methods.

By extracting reaction rules from MetxNetX, Koch et al. proposed an open-source and modular command line tool (RetroPath RL) to explore the bioretrosynthesis space using Monte Carlo Tree Search (MCTS) reinforcement learning method [49]. RetroPath RL has been validated on the golden data set of 20 manually curated experimental pathways, in which 75 % of pathways could be found.

Zhang et al. proposed a framework called BioRetro to predict biosynthetic pathways for producing natural products, which combines a one-step synthesis network (HybridMLP) with AND-OR tree heuristic search [50]. The one-step bioretrosynthesis prediction experiments are conducted on MetaNetX dataset by using HybridMLP, which achieves 46.5 %, 74.6 %, 81.6 % in terms of the top-1, top-5, top-10 accuracies.

Instead of extracting reaction rules automatically, RetroBioCat utilized a set of expertly encoded reaction rules encompassing the enzyme toolbox to identify promising biosynthetic pathways for producing target molecules [51]. The system employs a rule set comprising 99 unique reactions, represented by 135 reaction SMARTS patterns. As a result, except for C–H oxidation by P450 enzymes, all of the reaction in the test-set (52 pathways reported in the literature) were correctly predicted by RetroBioCat.

Without considering the global information of the target molecule, reaction rules that are too specific or too general will lead to the predicted results being overly conservative or unrealistic, respectively. In terms of this issue, Chen et al. presented a data-driven retrosynthesis model [52,53], LocalRetro, which suggests possible synthesis pathways by locally learning the chemical reactivity (local reaction template) together with global reactivity attention to account for the remaining nonlocal effects.

Generally speaking, reaction templates are either summarized manually by experts or extracted from reaction databases automatically, and template-based methods are more likely to infer the stable structure of precursors through the breaking and formation of chemical bonds. Results from template-based methods have strong interpretability and high prediction accuracy, since all templates are grounded in known biological reactions and are summarized through a large number of experiments and studies. However, the main limitation of those methods is that they cannot predict reactions beyond the templates, which will limit innovation.

3.2. Template-free pathway design

By contrast, without predefined reaction templates, template-free methods analogize prediction of precursors from target molecules as a machine language translation problem using a trained ML model [54]. One of the powerful models applied to language translation tasks in natural language processing (NLP) is end-to-end approach, which was firstly described for the forward reaction prediction by Nam and Kim [55]. Over the next few years, several groups have also developed template-free methods for retrosynthesis prediction.

Liu et al. developed a fully data driven model for retrosynthetic reaction prediction with an encoder-decoder architecture that consists of two recurrent neural networks [56]. For a given target molecule and a specified reaction type, the model predicts the most likely reactants that can react in the specified reaction type to produce the target molecule. As a result, Liu's model achieves the top-1 accuracy of 37.5 % when comparing with the baseline model (template-based expert system) with top-1 accuracy of 35.4 %.

To raise accuracy, Zheng et al. developed a template-free self-correct retrosynthesis predictor (SCROP) to predict retrosynthesis planning using transformer neural networks [57]. By coupling with a neural network-based syntax corrector, SCROP achieves an accuracy of 59.0 % on a standard benchmark data set, which outperforms other deep learning methods by >21 % and template-based methods by >6 %.

By combining end-to-end transformer neural networks with an AND-OR tree-based planning algorithm, Zheng et al. built BioNavi-NP, a navigable and user-friendly toolkit to predict the biosynthetic pathways for both natural products (NPs) and NP-like compounds [14,15]. Extensive evaluation reveals that BioNavi-NP can identify biosynthetic pathways for 90.2 % of 368 test compounds and recovers the reported building blocks as in the test set for 72.8 %, 1.7 times more accurate than existing conventional rule-based approaches.

Probsr et al. presented forward and backward prediction models based on molecular transformer trained on enzyme-catalyzed reactions extended with EC (enzyme commission) numbers. Their results demonstrate that the molecular transformer performs well for both forward reaction and retrosynthetic pathway prediction [58] with an accuracy of 49.6 % (top-1) in forward prediction model and 39.6 % (top-1) in single-step retrosynthetic model.

By utilizing the multi-head attention-based transformer architecture and Monte Carlo Tree Search with a heuristic scoring function, Lai et al. constructed an automatic data-driven end-to-end retrosynthetic route planning system (AutoSynRoute), which achieves top-1 predictive accuracy (63.0 %, with the reaction class provided) and top-1 molecular validity (99.6 %) in one-step retrosynthetic task [59]. According to the result, AutoSynRoute successfully reproduced published synthesis routes for the four case products.

Lee et al. proposed READRetro as a practical bio-retrosynthesis tool for planning the biosynthetic pathways of natural products with an ensemble of deep learning-based chemical reaction prediction models, Retroformer and Graph2SMILES, and a reaction retriever, which effectively resolved the tradeoff between generalizability and memorability [60]. As a result, READRetro was demonstrated to outperform existing models by a large margin in terms of both generalizability and memorability. The ensemble of Retroformer and Graph2SMILE achieved a top 1 accuracy of 23.4 % and a top-10 accuracy of 59.3 % on BioChem-USPTO (clean) dataset.

Generally speaking, template-free approaches convert the one-step retrosynthesis prediction as machine translations from one language (reactants) to another (products). This kind of methods has two main advantages: 1) without constraints from predefined reaction templates, they have strong generalization ability and are able to mine hidden patterns from large amount of biological data, which can predict wider space with novel reactions by considering the global information of target molecules; 2) they are able to learn the relationship between reactants and products automatically without any manual intervention, which will be time-saving and labor-efficient when comparing with template-based methods. However, template-free methods also have some limitations: 1) novel reactions generated by this kind of methods may lack good interpretability and cannot track back to known biological reactions with the identical reaction center; 2) the predictions may contain thermodynamically unstable structures, syntax errors in SMILES or unfeasible reactions when the dataset for training is small, and insufficient biosynthetic data comparing with chemical data may also lead to over-fitting and reduced robustness.

3.3. Semi-template-based pathway design

Instead of directly converting the product to reactants, semi-template-based methods solve the retrosynthesis like chemists, which divide the process into two steps: 1) identify the reaction center to break the product into synthons; 2) complete the synthons to generate reactants. There are several semi-template-based retrosynthesis tools as shown below.

Shi et al. formulated the retrosynthesis task as a one-to-many graph-to-graphs translation problem, and developed G2GS which transforms a target molecular graph into a set of reactant molecular graphs [61]. First, G2Gs splits the target molecule graph into a set of synthons by identifying the reaction centers, and then translates the synthons to the final reactant graphs using a variational graph translation framework. As a result, G2Gs significantly outperforms many template-free approaches by up to 63 % in terms of the top-1 accuracy when using USPTO-50K dataset.

Zhong et al. developed Graph2Edits [62] based on graph neural network. First, it predicts the product graph in an auto-regressive manner, and then sequentially generates transformation intermediates and final reactants according to the predicted edits sequence. It has been proved that Graph2Edit achieves the state-of-the-art performance for semi-template-based retrosynthesis with a promising 55.1 % top-1 accuracy when using USPTO-50K.

Based on the idea that the graph topology of precursor molecules is largely unaltered during a chemical reaction, Somnath et al. proposed a graph-based approach for retrosynthesis prediction [63], which first converts the target molecule into molecules called synthons, and then the model expands synthons into complete molecules. Result shows that the model achieves a top-1 accuracy of 53.7 %.

Yan et al. developed RetroXpert [64], which is inspired by how chemists approach retrosynthesis prediction and disassembles retrosynthesis into two steps: i) identify the potential reaction center of the target molecule through a novel graph neural network and generate intermediate synthons, and ii) generate the reactants associated with synthons via a robust reactant generation model. As a result, RetroXpert achieves top-1 accuracy of 70.4 % and 65.5 % with the reaction types given or unknown respectively, in the USPTO-50k dataset. For the USPTO-full dataset, RetroXpert achieves top-1 accuracy of 49.4 %.

In addition to the graph-based semi-template-based methods described above, Wang et al. developed a method for retrosynthesis prediction called RetroPrime [65], in which two stages (generate synthons and complete synthons) are accomplished with versatile transformer models with top-1 accuracy of 64.8 % and 51.4 %, when the reaction types are known and unknown, respectively, in the USPTO-50k dataset.

Motivated by fact that molecular changes usually occur locally during reactions, Zhong et al. also proposed transformer-based autoencoder for synthesis prediction, called R-SMILES, which could be implemented in different synthesis tasks, including reactant-to-product, product-to-reactant, product-to-synthon and synthon-to-reactant [66]. By specifying a tightly aligned one-to-one mapping between the product and reactant SMILES and reducing the edit distance, R-SMILES is largely relieved from learning the complex syntax and dedicated to learning the knowledge in reactions. As a result, R-SMILES achieves top-1 accuracy of 49.1 % ± 0.42 for product-to-synthon-to-reactant, in the USPTO-50K dataset.

Generally speaking, semi-template-based methods treat one-step retrosynthesis just as chemists think about how a rection happened, first identify the reaction center and break the product into synthons, and then complete the synthons to generate reactants. This kind of methods has good scalability when comparing with template-based methods as it does not rely on a database of predefined chemical reaction templates, and it also has higher accuracy when comparing with template-free methods as the P2S (product to synthons) stage is performed to identify the reaction center and some inherent laws of chemical reactions are also considered. However, semi-template methods also have some limitations, for example, most of semi-template approaches are not end-to-end, recognition of reaction center and completion of synthon are independent to each other, which indicates that an incorrect prediction in the first step will lead to errors in second step and the final result.

In summary, retrosynthetic pathway planning methods can be divided into three categories: template-based methods, template-free methods and semi-template-based methods. We have made detailed comparative analysis for each of them from several aspects (Table 2), including characteristics in technical level, advantages & limitations, computational efficiency, interpretability & biological feasibility and real-world application example.

Table 2.

Comparative analysis of three kinds of retrosynthesis methods.

Template-based method Template-free method Semi-template-based method
Characteristics in technical level
  • It relies on predefined reaction templates extracted from reaction/pathway databases (e.g., KEGG, Rhea) manually or automatically.

  • Reaction templates represent structure changes between reactants and products in reaction centers along with neighboring atoms at a customized radius, often represented by SMARTS.

  • It is a kind of end-to-end method, which formulates retrosynthesis as sequence generation problems.

  • It often regards the retrosynthesis as a machine translation problem by representing a molecule as a series of SMILES tokens (sequence-based method).

  • It divides the retrosynthesis into two steps:

  • 1)

    transform the target molecule into synthons

  • 2)

    complete synthons to reactants.

Advantages & Limitations Advantages: The template-based method has strong interpretability and high prediction accuracy, since all templates are grounded in known biological reactions, and are summarized through a large number of experiments and studies.
Limitations: It cannot predict reactions beyond predefined templates, which will limit innovation
Advantage: The template-free method has strong generalization ability and can predict a wider space with novel reactions by considering the global information of target molecules. It can learn the relationship between reactants and products automatically, which reduce manual intervention.
Limitations: 1) Novel reactions generated by this kind of methods may lack good interpretability and cannot track back to known biological reactions with the identical reaction center; 2) Predictions may contain thermodynamically unstable structures, syntax errors in SMILES or unfeasible reactions when the dataset for training is small, and insufficient biosynthetic data comparing with chemical data may also lead to over-fitting and reduced robustness.
Advantages: The semi-template-based method conforms to the thinking of chemists and has good interpretability. It also has good scalability when comparing with template-based method and has higher accuracy when comparing with template-free method.
Limitations: Most of semi-template approaches are not end-to-end, recognition of reaction center and completion of synthon are independent to each other, which indicates that an incorrect prediction in the first step will lead to errors in second step and the final result.
Computational efficiency It usually has high computational efficiency, since it can quickly find possible retrosynthetic steps by matching the structural features of the target molecule with predefined reaction templates. Its computational efficiency is often low. This kind of method requires exhaustive exploration of the reaction space through deep learning architectures (e.g., Transformer) without the constraints of predefined templates. While enabling novel pathway discovery, this unguided search demands intensive computational resources to evaluate thousands of potential reaction permutations, particularly for complex molecules with multiple functional groups. Its computational efficiency is between the first two kinds of methods. It decomposes retrosynthetic prediction into two sub-problems: reaction center identification and synthon completion. It automatically obtains templates from the dataset, reduces manual intervention, simplifies the complexity of reactant generation, and relatively reduces model complexity and computational cost.
Interpretability
& Biological feasibility
Strong interpretability/Strong biological feasibility: It utilizes reaction templates to characterize the mechanism of reactions. Medium interpretability/Medium Biological feasibility: It considers the retrosynthesis prediction as a "molecular translation" task, its interpretability and biological feasibility is limited by the black-box nature, missing reaction mechanism and risk of syntax errors. Good interpretability/Good biological feasibility: It solves the retrosynthesis like chemists, which divide the process into two steps (generate/complete synthons)
Representative tools RetroPath2.0 [46]; rePrime&novoStoic [47]; novoPathFinder [48]; RetroPath RL [49]; RetroBioCat [51] and so on. SCROP [57]; BioNavi-NP [14,15]; AutoSynRoute [59] and so on. G2GS [61]; RetroXpert [64]; Somnath et al. [63]; Graph2Edits. [62]; RetroPrime [65]; R-SMILES [66] and so on.
Real-world application example RetroPath2.0 was used to create potential pathways for the production of eugenol from ferulic acid [67]. BioNavi-NP [14,15] was used to predict biosynthetic pathways to produce glutarate. The predicted pathways are consistent with the experimental results reported in several wet labs [[68], [69], [70]]. R-SMILES [66] was used to predict two pathways to produce febuxostat. The first route is consistent with the result in the literature [71]. The second route is more advantageous in terms of reaction effects (such as avoiding thermal decomposition and reducing side reactions) and raw material costs.

4. Methods for enzyme engineering

A significant challenge in bio-synthesizing value-added molecules is the lack of efficient enzymes that catalyze biological reactions in pathways [72]. To optimize enzyme key properties (e.g., expression, stability, substrate range and catalytic efficiency) or even to unlock new catalytic activities not found in nature, enzymes should be engineered at the level of amino acid sequences [73]. Enzyme engineering is challenging because there are ∼20N possible sequences in the search space for a protein of length N, which leads to beyond-astronomically large space for potential proteins [74]. Therefore, computer-assisted enzyme engineering approaches have been developed to discover enzymes with desired functions. Here, we demonstrate these methods from four aspects (Fig. 3): 1) selection of enzymes when the structure of the substrate is clear: for a specified substrate, computational tools in this part focus on screening appropriate enzymes from various sequence candidates whose functions are known. 2) discovery of functional enzymes from protein sequences: it is important to identify the function of enzymes and expand the biological reactions that they can potentially catalyze, since a large number of amino acid sequences have not been annotated. 3) generation of novel enzymes with desired functions: it is necessary to generate novel enzymes with specific functions since natural enzymes may not necessarily meet practical requirements. 4) optimization of enzyme key properties: both enzyme selection and de novo enzyme design face efficiency challenges, requiring further optimization of their properties. We provide case studies to demonstrate their applications in metabolic engineering (Table 3).

Fig. 3.

Fig. 3

Computational methods of enzyme engineering in key functional categories (A: selection of enzyme when structure of substrate is clear. B: discovery of functional enzymes from protein sequences; C: generation of novel enzymes with specified functions; D: optimization of enzyme key properties).

Table 3.

Applications of enzyme engineering tools in metabolic engineering.

Category tools Case Study
Enzyme Selection for Non-natural substrates Selenzyme [43,75] Selenzyme was used to select the enzyme (1.14.15.4) correctly for transformation of 11-deoxycortisol to cortisol [43,75].
E-yme [76] E-yme was used to assign KO entries for 3,865 pairs in KEGG [76]
BridgIT [77] BridgIT was used to identify EC number (4.2.1.114) correctly for a biological reaction(R03444) in KEGG [77].
SPEPP [78] SPEPP was used to screen enzymes for the conversion of succinic acid to 1,4-butanediol [78].
PUEPP [79] PUEPP was used to identify 15 new degrading enzymes specific for the mycotoxins ochratoxin A and zearalenone, of which six could degrade >90 % mycotoxin content within 3 h [79].
DLKcat [80] DLKcat predicted the kcat values of 343 yeast/fungi species in genome scale, involving about 3 million enzyme - substrate pairs [80].
ESP [81] ESP predicted substrates of bacterial nitrilases, using input features based on the 3D-structures and active sites of the enzymes [81].
EnzymeCAGE [82] EnzymeCAGE predicted enzymes for 194 orphan reactions, which had no enzyme information before 2018, and received enzyme annotation after 2023 [82].
Discovery of functional enzymes from protein sequences COFACTOR [90,91] COFACTOR was used to predict EC number of 318 non-homologous enzymes, the benchmark EC numbers extracted from PDB entries [90,91].
CLEAN [94] CLEAN was used to identify 36 incompletely annotated halogenases from Uniprot followed by experimental validation in vitro [94].
DeepEC [92] Enzyme(YgbJ) was assigned with two EC number (1.1.1.411 and 1.1.1.60) by DeepEC and has been validated its promiscuous by enzyme assay [92].
GraphEC [93] The EC numbers of two enzymes (Acyl-protein thioesterase 2 and Proline racemase) were accurately predicted by GraphEC [93].
ECRECer [95] Enzyme (iron/alpha-ketoglutarate-dependent dioxygenase AusU) was assigned EC number (1.14.11.38) by ECRECer, and the prediction was supported by further protein structure analysis [95].
De novo Enzyme Design AlphaFold3 [99] AlphaFold3 was used to predict structure of the key enzyme Fut3Bc, which assist in screening mutation sites to improve the production of difucosyllactose (DFL) [124].
Rosetta Design [101] Researchers used Rosetta Design to reshape the active center of carboxylic acid reductase, significantly improving the enzyme's activity and substrate specificity, and successfully applied it to the biosynthesis of nylon 6 and nylon 66 monomers [125]
RFDiffusion [96] David Baker's team used RFdiffusion combined with the PLACER deep neural network to successfully design a new serine hydrolase [126].
ProteinMPNN [102] ProteinMPNN was employed to generate protein sequences (monomer or oligomer) which were more soluble with a significant success rate [102].
ProtGPT2 [104] The proportion of globular domains in the sequences generated by ProtGPT2 was 88 %, and many enzymes belong to globular proteins, indicating that this tool can facilitate the design of enzymes in metabolic engineering [104].
ProteinGAN [105] ProteinGAN was used to generate malate dehydrogenase (MDH) sequences, of which 24 % (13 out of 55) were experimentally were soluble [105].
Enzyme Optimization FireProt-ASR [116] The thermal stability of haloalkane dehalogenase(DhaA, UniprotID P0A3G2) was optimized by FireProt-ASR [116]
ProSNEx [122] ProSNEx constructed the weighted protein structure network of TEM-1 β-lactamase from Escherichia coli, the relationship between sequence evolution and dynamic interactions within the β-lactamase structure was highlighted [122].
AlloSigMA 2 [123] AlloSigMA 2 was used to explore regulatory mechanism of Phosphofructokinase
PFK regulation, promoting the phosphorylation of fructose-6-phosphate (F6P) to produce fructose-1,6-bisphosphate, thereby promoting the glycolysis process [123].
TKSA-MC [117] The TKSA-MC server analyzed charge-charge interactions of ubiquitin, which is involved in the ubiquitination degradation pathway of proteins, and predicts mutation sites that can improve thermal stability [117].

4.1. Selection of enzymes when the structure of substrate is clear

Selection of enzymes that will catalyze their natural chemical transformation on non-natural substrates would dramatically advance the ability to design synthetic routes involving enzymatic catalysis. Many bioinformatic tools have been developed to assist in predicting CPI (compound-protein interaction) at enzyme-substrate scope, which could be divided into two aspects: 1) Reaction similarity-based methods. In more detail, Selenzyme was developed to search for enzymes [43,75] existing in BRENDA [27] and EXpasy [41] that are capable of catalyzing novel or orphan reactions. Moriya et al. developed a novel method (E-ymes) to identify candidate enzymes for specific reaction using chemical structures of substrate-product pairs [76]. This method is based on search for similar reactant pairs in a reference database and offers ortholog groups that possibly catalyze the given reaction. By assessing the similarity of one orphan and one well-characterized nonorphan reaction and using their substrate reactive sites and surrounding structure as well as structures of the generated products, BridgIT was developed to suggest enzymes that catalyze the most-similar nonorphan reactions as candidates for also catalyzing the orphan ones [77]. 2) Machine learning based methods. Xing et al. developed a substrate-product Pair-based Enzyme Promiscuity Prediction (SPEPP) model to output a score indicating the possibility that enzyme catalyzes a substrate-product reaction [78]. Zhang et al. proposed a robust enzyme's substrate promiscuity prediction model based on positive unlabeled leaning, which has been utilized to identify 15 new degrading enzymes specific two hazardous substances [79]. Li et al. developed DLKcat that uses substrate structure and protein sequence as inputs [80], and demonstrats its capability for Kcat prediction which defines the maximum chemical conversion rate of a reaction. Kroll et al. presented a general model (ESP) for predicting the substrates scope of enzymes [81], which performs with an accuracy of 89 % even for enzymes with very low sequence identity (<40 %) to proteins in the training set. By integrating enzyme structures, evolutionary insights, and reaction center transformations, EnzymeCAGE was built as foundation framework that connects enzymes and reactions [82]. These advanced computational tools can provide accurate function predictions of enzyme-substrate interactions, which enable the discovery of enzyme functions and a deeper understanding of orphan reactions.

4.2. Discovery of functional enzymes from protein sequences

With the development of DNA sequencing technologies, and particularly genomic and metagenomics tools, we have vastly more sequence data than functional data for enzymes. For example, Uniprot contains approximately 250 million protein sequences, of which only 0.3 % (∼572,970) have been annotated [36]. It is important to identify functions of enzymes and expand the biological reactions they can potentially catalyze. Experimental approaches for enzyme function characterization are both time-intensive and costly, so the development of computational approaches has become imperative. These computational methods can be categorized into homology-based, structure-based, and machine learning-based approaches. Specifically, by assuming that highly similar enzymes have similar functions, homology-based alignment tools were proposed to annotate the enzyme function. These tools include BLAST [83], UBLAST [84], LAMBDA [85], LAST [86], DIAMOND [87], BLAT [87], RAPSEARCHE2 [88], SANSparllel [89] and so on. However, those homology-based methods may limit coverage when lacking similar sequence. To improve the coverage, COFACTOR was developed as a platform for structure-based multiple-level protein function predictions by scanning structurally similar protein templates to identify consensus functions [90,91]. Both homology-based and structure-based methods are limited by the lack of high-quality templates, therefore machine-learning-based approaches were developed to alleviate the constraints of similar sequences and structures. For example, DeepEC utilizes 3 convolutional neural networks (CNNs) as a major engine for the prediction of EC numbers, and also implements homology analysis for EC numbers that cannot be classified by CNNs [92]. GraphEC was proposed as a geometric graph learning-based EC number predictor using ESMFold-predicted structures and a pretrained protein language model [93]. Zhao et al. reported a contrastive learning-enabled enzyme annotation model (CLEAN) for enzyme prediction, which has been utilized to identify 36 incompletely annotated halogenases from Uniprot followed by experimental validation in vitro [94]. ECRECer is a deep learning-based cloud platform that significantly improves EC number prediction accuracy by 70 % through protein language modeling and a multi-agent hierarchical framework, enabling precise enzyme function annotation [95]. Enzymes play essential roles in diverse biological processes. Computational approaches for predicting unannotated enzyme functions and identifying their active sites have become increasingly significant in synthetic biology, genomics, and related fields. However, the performance of these methods heavily depends on the availability and quality of protein data. Therefore, sustained efforts should be encouraged to develop comprehensive and reliable datasets through long-term research commitments.

4.3. Generation of novel enzymes with specific functions

Methods of enzyme annotation and screening at enzyme-substrate scope have helped a lot in enzyme discovery during biosynthetic pathways construction. However, only a tiny subset of the possible protein landscape has been explored in nature, de novo protein allows researchers to derive new proteins with new functions and desirable attributes, which could provide greater opportunities for biosynthesizing value-added molecules [[96], [97], [98]]. The workflow of protein design can be divided into five steps. 1) Backbone generation: The protein backbone is essential as it forms the structural scaffold that dictates the protein's three-dimensional shape and enables its biological function. Several computational tools have been developed to generate protein scaffolds, such as RFDiffusion [96], AlphaFold3 [99], RoseTTAFold [100] and so on. 2) Sequence Design: The amino acid sequence is designed to stabilize the target backbone or generated flowing the principles of natural amino acid sequences for specific functions. Computational tools like RosettaDesign [101], ProteinMPNN [102], EvoDesign [103], ProGPT2 [104] and ProteinGAN [105] could help a lot in this stage. 3) Structure optimization: Designed structures are refined for stability and function accuracy. Computational tools like RosettaRelax [106] help a lot in this stage. 4) Experimental validation:This stage involves gene analysis, protein expression and structural characterization via X-ray crystallography, cryo-EM, or NMR. 5) Iterative optimization. Computational and experimental feedback loop many times to refine the design. Advances in machine learning, such as AlphaFold3 [99] and ProteinMPNN [102], have significantly accelerated the design process, enabling the creation of proteins with unprecedented precision and function.

4.4. Optimization of enzyme key properties

Both enzyme selection and de novo enzyme design face efficiency challenges, requiring further optimization of their properties [107]. Many computational methods have been developed to narrow down the space of possible mutations and alleviate the experimental burden. These tools could be divided by their functions. 1) Enzyme solubility. The solubility of enzyme is a crucial factor, poor solubility will lead to aggregation and the formation of inactive clumps. Several approaches relying on sequence and structure properties provide solutions for solubility prediction and optimization, such as SOLart [108], Aggrescan3D 2.0 [109], AggreRATE-Pred [110] and SoluPro [111]. 2) Enzyme activity and specificity. Engineering the activity and selectivity toward a specified substrate plays an important role in target molecule biosynthesis. Several groups have developed tools by introducing mutations in the active site and optimizing it towards the targeted substrate, engineering access tunnel, modifying the dynamic properties and editing recognition element, which contains Rosetta toolbox [112], Rossetta-based web tool Funclib [113], CaverDock [114], DaReUS-Loop [115] and so on. 3) Enzyme stability. The temperature, co-solvent, PH and other general conditions have an impact on enzyme stability, it is desirable that the enzyme survives longer in harsher conditions when comparing with its native variants. At this point, many tools have been developed to push the boundary, such as FireProt-ASR [116], TKSA-MC [117], pStab [118] and so on. 4) Enzyme dynamics. Enzyme dynamics is crucial to achieve a desired activity output, and also has an impact on predicting protein solubility and stability, many tools have been developed to assess and engineer enzyme dynamics, such as DynaMut2 [119], CABS [120,121], ProSNEx [122], AllosigMA 2 [123] and so on.

5. Conclusions and further perspectives

The development of computational tools for biosynthetic pathway design has revolutionized synthetic biology, enabling the systematic engineering of organisms to produce valuable compounds. In this review, we have highlighted how the integration of biological big-data, retrosynthetic algorithms, and enzyme engineering approaches can accelerate the design-build-test-learn (DBTL) cycle. Key advances include: (1) Construction of multidimensional biosynthetic databases, which serve as the cornerstone of biosynthetic pathway design and enzyme engineering; (2) Prediction of biosynthetic pathways through template-based, template-free, and semi-template-based methods, each offering unique trade-offs between innovation and interpretability; and (3) Rational enzyme discovery, design and optimization, empowered by machine learning and structural modeling tools such as AlphaFold and RFdiffusion.

However, several persistent challenges remain to be addressed in future research: 1) In terms of data: The performance of computational methods is significantly influenced by biological data, which is often limited in availability and quality, thus making it difficult to accurately model and predict complex biological processes. On the other hand, there is a lack of standardization of data collection, especially for data from failed experiment, leading to inaccuracies in analysis and interpretation. Besides, many metabolic pathways in living organisms are still not fully understood, leading to gaps in our knowledge of how biological molecules are synthesized. 2) In terms of retrosynthesis planning, template-based methods are limited in predicting transformations not covered in existing reaction rules. Template-free approaches often generate SMILES strings with grammatical errors, leading to invalid outputs. Semi-template-based methods, though a middle ground, lack end-to-end integration, since reaction center identification and synthon completion are decoupled, an error in the first step propagates to the second, ultimately compromising the final result. Moreover, to prevent the exponential growth of possible reaction combinations during single-step design, all three approaches must incorporate filtering mechanisms to eliminate implausible reactions by accounting for toxicity, thermodynamic feasibility, and economic viability. 3) In terms of enzyme engineering. Discovery and optimization of enzymes to perform desirable functions is another hurdle to be overcome for the successful construction of the predicted pathways. Key strategies of enzyme engineering contain various aspects, including selection of enzymes from well-characterized protein databases, optimization of enzymes toward good activity, solubility as well as cofactor availability, optimal balance between substrate specificity and enzyme promiscuity, generation of non-natural enzyme sequence by using artificial intelligence technologies. To improve both the performance and applicability of these methods, several aspects can be addressed. These include the development of efficient and user-friendly tools, the integration of protein dynamics into machine learning pipelines, and a more comprehensive understanding of protein perturbations at the amino acid level under varying environmental conditions such as temperature, pH, and ionic strength.

In the coming future, we propose three priority directions for the field: 1) Unified knowledge graphs: Developing structured representations that link metabolites, reactions, enzymes, and host-specific constraints to reduce design-test gaps; 2) Generative AI for novel biochemistry: Combining physics-based modeling with large language models (e.g., for enzyme active site design) to explore uncharted reaction spaces; 3) Automated experimental validation: Coupling computational predictions with high-throughput robotic platforms to close the DBTL loop faster, as seen in emerging "self-driving lab" initiatives. The convergence of these advances (open-data, efficient algorithms, and automated experimentation) will democratize biosynthetic pathway engineering. Future tools must prioritize interpretability to guide wet-lab researchers, and scalability to handle complex pathways. By integrating computational predictions with biological reality, synthetic transforms biotechnology from manual trial-and-error approaches into rational design, which enables groundbreaking applications, including valuable compounds production and targeted drug development.

CRediT authorship contribution statement

Shaozhen Ding: Writing – review & editing. Dongliang Liu: Writing – original draft. Yu Tian: Writing – original draft. Dachuan Zhang: Writing – original draft. HuaDong Xing: Writing – original draft. Junni Chen: Data curation. Zhiguo Liu: Writing – original draft. Qian-Nan Hu: Writing – review & editing.

Declaration of competing interest

The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Junni Chen is currently employed by Wuhan LifeSynther Science and Technology Co. Limited.

Acknowledgments

This project received funding from the National Key Research and Development Program of China (2020YFA0908300). We thank Lizhen Ding, a graduate student in the English Department of Wuhan University, for English language editing.

Footnotes

Peer review under the responsibility of Editorial Board of Synthetic and Systems Biotechnology.

Contributor Information

Shaozhen Ding, Email: 23113176@whpu.edu.cn.

Qian-Nan Hu, Email: qnhu@whu.edu.cn.

References

  • 1.MacDonald J.T., Barnes C., Kitney R.I., Freemont P.S., Stan G.B. Computational design approaches and tools for synthetic biology. Integr Biol. 2011;3(2):97–108. doi: 10.1039/c0ib00077a. [DOI] [PubMed] [Google Scholar]
  • 2.Anderson J.C., Clarke E.J., Arkin A.P., Voigt C.A. Environmentally controlled invasion of cancer cells by engineered bacteria. J Mol Biol. 2006;355(4):619–627. doi: 10.1016/j.jmb.2005.10.076. [DOI] [PubMed] [Google Scholar]
  • 3.Karaca H., Kaya M., Kapkac H.A., Levent S., Ozkay Y., Ozan S.D., Nielsen J., Krivoruchko A. Metabolic engineering of Saccharomyces cerevisiae for enhanced taxadiene production. Microb Cell Fact. 2024;23(1):241. doi: 10.1186/s12934-024-02512-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Zhang D., Xing H., Liu D., Han M., Cai P., Lin H., Tian Y., Guo Y., Sun B., Le Y., et al. Discovery of toxin-degrading enzymes with positive unlabeled deep learning. ACS Catal. 2024;14(5):3336–3348. [Google Scholar]
  • 5.Della Corte D., van Beek H.L., Syberg F., Schallmey M., Tobola F., Cormann K.U., Schlicker C., Baumann P.T., Krumbach K., Sokolowsky S., et al. Engineering and application of a biosensor with focused ligand specificity. Nat Commun. 2020;11(1):4851. doi: 10.1038/s41467-020-18400-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Buller R., Lutz S., Kazlauskas R.J., Snajdrova R., Moore C., Bornscheuer U.T. From nature to industry: harnessing enzymes for biocatalysis. Science. 2023;382(6673) doi: 10.1126/science.adh8615. [DOI] [PubMed] [Google Scholar]
  • 7.Carbonell P., Jervis A.J., Robinson C.J., Yan C., Dunstan M., Swainston N., Vinaixa M., Hollywood K.A., Currin A., Rattray N.J.W., et al. An automated Design-Build-Test-Learn pipeline for enhanced microbial production of fine chemicals. Commun Biol. 2018;1:66. doi: 10.1038/s42003-018-0076-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Heinemann M., Sauer U. Systems biology of microbial metabolism. Curr Opin Microbiol. 2010;13(3):337–343. doi: 10.1016/j.mib.2010.02.005. [DOI] [PubMed] [Google Scholar]
  • 9.Carbonell P., Radivojevic T., Garcia Martin H. Opportunities at the intersection of synthetic biology, machine learning, and automation. ACS Synth Biol. 2019;8(7):1474–1477. doi: 10.1021/acssynbio.8b00540. [DOI] [PubMed] [Google Scholar]
  • 10.Hodgman C.E., Jewett M.C. Cell-free synthetic biology: thinking outside the cell. Metab Eng. 2012;14(3):261–269. doi: 10.1016/j.ymben.2011.09.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Lawson C.E., Martí J.M., Radivojevic T., Jonnalagadda S.V.R., Gentz R., Hillson N.J., Peisert S., Kim J., Simmons B.A., Petzold C.J. Machine learning for metabolic engineering: a review. Metab Eng. 2021;63:34–60. doi: 10.1016/j.ymben.2020.10.005. [DOI] [PubMed] [Google Scholar]
  • 12.Radivojević T., Costello Z., Workman K., Garcia Martin H. A machine learning Automated Recommendation Tool for synthetic biology. Nat Commun. 2020;11(1):4879. doi: 10.1038/s41467-020-18008-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Zampieri G., Vijayakumar S., Yaneske E., Angione C. Machine and deep learning meet genome-scale metabolic modeling. PLoS Comput Biol. 2019;15(7) doi: 10.1371/journal.pcbi.1007084. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Zeng T., Jin Z., Zheng S., Yu T., Wu R. Developing bionavi for hybrid retrosynthesis planning. JACS Au. 2024;4(7):2492–2502. doi: 10.1021/jacsau.4c00228. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Zheng S., Zeng T., Li C., Chen B., Coley C.W., Yang Y., Wu R. Deep learning driven biosynthetic pathways navigation for natural products with BioNavi-NP. Nat Commun. 2022;13(1):3342. doi: 10.1038/s41467-022-30970-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Kim S., Chen J., Cheng T., Gindulyte A., He J., He S., Li Q., Shoemaker B.A., Thiessen P.A., Yu B., et al. PubChem 2023 update. Nucleic Acids Res. 2023;51(D1):D1373–D1380. doi: 10.1093/nar/gkac956. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Hastings J., Owen G., Dekker A., Ennis M., Kale N., Muthukrishnan V., Turner S., Swainston N., Mendes P., Steinbeck C. ChEBI in 2016: improved services and an expanding collection of metabolites. Nucleic Acids Res. 2016;44(D1):D1214–D1219. doi: 10.1093/nar/gkv1031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Mendez D., Gaulton A., Bento A.P., Chambers J., De Veij M., Félix E., et al. ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res. 2019;47(D1):D930–D940. doi: 10.1093/nar/gky1075. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Tingle B.I., Tang K.G., Castanon M., Gutierrez J.J., Khurelbaatar M., Dandarchuluun C., Moroz Y.S., Irwin J.J. ZINC-22─A free multi-billion-scale database of tangible compounds for ligand discovery. J Chem Inf Model. 2023;63(4):1166–1176. doi: 10.1021/acs.jcim.2c01253. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Pence H.E., Williams A. ChemSpider: an online chemical information resource. J Chem Educ. 2010;87:1123–1124. [Google Scholar]
  • 21.Poynton E.F., van Santen J.A., Pin M., Contreras M.M., McMann E., Parra J., Showalter B., Zaroubi L., Duncan K.R., Linington R.G. The Natural Products Atlas 3.0: extending the database of microbially derived natural products. Nucleic Acids Res. 2025;53(D1):D691–D699. doi: 10.1093/nar/gkae1093. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Rutz A., Sorokina M., Galgonek J., Mietchen D., Willighagen E., Gaudry A., Graham J.G., Stephan R., Page R., Vondrášek J. The LOTUS initiative for open knowledge management in natural products research. Elife. 2022;11 doi: 10.7554/eLife.70780. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Sorokina M., Merseburger P., Rajan K., Yirik M.A., Steinbeck C. COCONUT online: collection of open natural products database. J Cheminform. 2021;13(1):2. doi: 10.1186/s13321-020-00478-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Zhao H., Yang Y., Wang S., Yang X., Zhou K., Xu C., Zhang X., Fan J., Hou D., Li X. NPASS database update 2023: quantitative natural product activity and species source database for biomedical research. Nucleic Acids Res. 2023;51(D1):D621–D628. doi: 10.1093/nar/gkac1069. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Kanehisa M., Furumichi M., Sato Y., Kawashima M., Ishiguro-Watanabe M. KEGG for taxonomy-based analysis of pathways and genomes. Nucleic Acids Res. 2023;51(D1):D587–D592. doi: 10.1093/nar/gkac963. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Lang M., Stelzer M., Schomburg D. BKM-react, an integrated biochemical reaction database. BMC Biochem. 2011;12:42. doi: 10.1186/1471-2091-12-42. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Chang A., Jeske L., Ulbrich S., Hofmann J., Koblitz J., Schomburg I., Neumann-Schaal M., Jahn D., Schomburg D. BRENDA, the ELIXIR core data resource in 2021: new developments and updates. Nucleic Acids Res. 2021;49(D1):D498–D508. doi: 10.1093/nar/gkaa1025. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Caspi R., Billington R., Keseler I.M., Kothari A., Krummenacker M., Midford P.E., Ong W.K., Paley S., Subhraveti P., Karp P.D. The MetaCyc database of metabolic pathways and enzymes - a 2019 update. Nucleic Acids Res. 2020;48(D1):D445–D453. doi: 10.1093/nar/gkz862. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Wittig U., Rey M., Weidemann A., Kania R., Müller W. SABIO-RK: an updated resource for manually curated biochemical reaction kinetics. Nucleic Acids Res. 2018;46(D1):D656–D660. doi: 10.1093/nar/gkx1065. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Bansal P., Morgat A., Axelsen K.B., Muthukrishnan V., Coudert E., Aimo L., Hyka-Nouspikel N., Gasteiger E., Kerhornou A., Neto T.B., et al. Rhea, the reaction knowledgebase in 2022. Nucleic Acids Res. 2022;50(D1):D693–D700. doi: 10.1093/nar/gkab1016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Fabregat A., Sidiropoulos K., Viteri G., Forner O., Marin-Garcia P., Arnau V., D'Eustachio P., Stein L., Hermjakob H. Reactome pathway analysis: a high-performance in-memory approach. BMC Bioinf. 2017;18(1):142. doi: 10.1186/s12859-017-1559-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Wishart D.S., Kruger R., Sivakumaran A., Harford K., Sanford S., Doshi R., Kehrtarpal N., Fatokun O., Doucet D., Zubkowski A., et al. PathBank 2.0-the pathway database for model organism metabolomics. Nucleic Acids Res. 2024;52(D1):D654–D662. doi: 10.1093/nar/gkad1041. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Knox C., Wilson M., Klinger C.M., Franklin M., Oler E., Wilson A., Pon A., Cox J., Chin N.E., Strawbridge S.A. DrugBank 6.0: the DrugBank knowledgebase for 2024. Nucleic Acids Res. 2024;52(D1):D1265–D1275. doi: 10.1093/nar/gkad976. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Wishart D.S., Guo A., Oler E., Wang F., Anjum A., Peters H., Dizon R., Sayeeda Z., Tian S., Lee B.L. Hmdb 5.0: the human metabolome database for 2022. Nucleic Acids Res. 2022;50(D1):D622–D631. doi: 10.1093/nar/gkab1062. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Kuhn M., von Mering C., Campillos M., Jensen L.J., Bork P. STITCH: interaction networks of chemicals and proteins. Nucleic Acids Res. 2007;36(suppl_1):D684–D688. doi: 10.1093/nar/gkm795. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.UniProt C. UniProt: the universal protein knowledgebase in 2025. Nucleic Acids Res. 2025;53(D1):D609–D617. doi: 10.1093/nar/gkae1010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.ww PDBc Protein Data Bank: the single global archive for 3D macromolecular structure data. Nucleic Acids Res. 2019;47(D1):D520–D528. doi: 10.1093/nar/gky949. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Varadi M., Bertoni D., Magana P., Paramval U., Pidruchna I., Radhakrishnan M., Tsenkov M., Nair S., Mirdita M., Yeo J. AlphaFold Protein Structure Database in 2024: providing structure coverage for over 214 million protein sequences. Nucleic Acids Res. 2024;52(D1):D368–D375. doi: 10.1093/nar/gkad1011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Hastings J., Owen G., Dekker A., Ennis M., Kale N., Muthukrishnan V., Turner S., Swainston N., Mendes P., Steinbeck C. ChEBI in 2016: improved services and an expanding collection of metabolites. Nucleic Acids Res. 2016;44(D1):D1214–D1219. doi: 10.1093/nar/gkv1031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Norsigian C.J., Pusarla N., McConn J.L., Yurkovich J.T., Dräger A., Palsson B.O., King Z. BiGG Models 2020: multi-strain genome-scale models and expansion across the phylogenetic tree. Nucleic Acids Res. 2020;48(D1):D402–D406. doi: 10.1093/nar/gkz1054. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Duvaud S., Gabella C., Lisacek F., Stockinger H., Ioannidis V., Durinx C. Expasy, the Swiss bioinformatics resource portal, as designed by its users. Nucleic Acids Res. 2021;49(W1):W216–W227. doi: 10.1093/nar/gkab225. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Yu T., Boob A.G., Volk M.J., Liu X., Cui H., Zhao H. Machine learning-enabled retrobiosynthesis of molecules. Nat Catal. 2023;6(2):137–151. [Google Scholar]
  • 43.Carbonell P., Wong J., Swainston N., Takano E., Turner N.J., Scrutton N.S., Kell D.B., Breitling R., Faulon J.L. Selenzyme: enzyme selection tool for pathway design. Bioinformatics. 2018;34(12):2153–2154. doi: 10.1093/bioinformatics/bty065. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Corey E.J., Wipke W.T. Computer-assisted design of complex organic syntheses. Science. 1969;166(3902):178–192. doi: 10.1126/science.166.3902.178. [DOI] [PubMed] [Google Scholar]
  • 45.Segler M.H.S., Preuss M., Waller M.P. Planning chemical syntheses with deep neural networks and symbolic AI. Nature. 2018;555(7698):604–610. doi: 10.1038/nature25978. [DOI] [PubMed] [Google Scholar]
  • 46.Delepine B., Duigou T., Carbonell P., Faulon J.L. RetroPath2.0: a retrosynthesis workflow for metabolic engineers. Metab Eng. 2018;45:158–170. doi: 10.1016/j.ymben.2017.12.002. [DOI] [PubMed] [Google Scholar]
  • 47.Kumar A., Wang L., Ng C.Y., Maranas C.D. Pathway design using de novo steps through uncharted biochemical spaces. Nat Commun. 2018;9(1):184. doi: 10.1038/s41467-017-02362-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Ding S., Tian Y., Cai P., Zhang D., Cheng X., Sun D., Yuan L., Chen J., Tu W., Wei D.Q., et al. novoPathFinder: a webserver of designing novel-pathway with integrating GEM-model. Nucleic Acids Res. 2020;48(W1):W477–W487. doi: 10.1093/nar/gkaa230. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Koch M., Duigou T., Faulon J.L. Reinforcement learning for bioretrosynthesis. ACS Synth Biol. 2020;9(1):157–168. doi: 10.1021/acssynbio.9b00447. [DOI] [PubMed] [Google Scholar]
  • 50.Zhang X., Liu J., Yang F., Zhang Q., Yang Z., Shah H.A. Planning biosynthetic pathways of target molecules based on metabolic reaction prediction and AND-OR tree search. Comput Biol Chem. 2024;111 doi: 10.1016/j.compbiolchem.2024.108106. [DOI] [PubMed] [Google Scholar]
  • 51.Finnigan W., Hepworth L.J., Flitsch S.L., Turner N.J. RetroBioCat as a computer-aided synthesis planning tool for biocatalytic reactions and cascades. Nat Catal. 2021;4(2):98–104. doi: 10.1038/s41929-020-00556-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Chen S., Jung Y. Deep retrosynthetic reaction prediction using local reactivity and global attention. JACS Au. 2021;1(10):1612–1620. doi: 10.1021/jacsau.1c00246. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Chen S., Noh J., Jang J., Kim S., Gu G.H., Jung Y. Reaction templates: bridging synthesis knowledge and artificial intelligence. Acc Chem Res. 2024;57(14):1964–1972. doi: 10.1021/acs.accounts.4c00261. [DOI] [PubMed] [Google Scholar]
  • 54.Zheng S., Rao J., Zhang Z., Xu J., Yang Y. Predicting retrosynthetic reactions using self-corrected transformer neural networks. J Chem Inf Model. 2020;60(1):47–55. doi: 10.1021/acs.jcim.9b00949. [DOI] [PubMed] [Google Scholar]
  • 55.Nam J., Kim J. Linking the neural machine translation and the prediction of organic chemistry reactions. arXiv:161209529 [Preprint] 2016 [Google Scholar]
  • 56.Liu B., Ramsundar B., Kawthekar P., Shi J., Gomes J., Luu Nguyen Q., Ho S., Sloane J., Wender P., Pande V. Retrosynthetic reaction prediction using neural sequence-to-sequence models. ACS Cent Sci. 2017;3(10):1103–1113. doi: 10.1021/acscentsci.7b00303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Zheng S., Rao J., Zhang Z., Xu J., Yang Y. Predicting retrosynthetic reactions using self-corrected transformer neural networks. J Chem Inf Model. 2019;60(1):47–55. doi: 10.1021/acs.jcim.9b00949. [DOI] [PubMed] [Google Scholar]
  • 58.Probst D., Manica M., Nana Teukam Y.G., Castrogiovanni A., Paratore F., Laino T. Biocatalysed synthesis planning using data-driven learning. Nat Commun. 2022;13(1):964. doi: 10.1038/s41467-022-28536-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Lin K., Xu Y., Pei J., Lai L. Automatic retrosynthetic route planning using template-free models. Chem Sci. 2020;11(12):3355–3364. doi: 10.1039/c9sc03666k. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Kim T., Lee S., Kwak Y., Choi M.S., Park J., Hwang S.J., Kim S.G. READRetro: natural product biosynthesis predicting with retrieval‐augmented dual‐view retrosynthesis. New Phytol. 2024;243(6):2512–2527. doi: 10.1111/nph.20012. [DOI] [PubMed] [Google Scholar]
  • 61.Shi C., Xu M., Guo H., Zhang M., Tang J. ICML’20: Proceedings of the 37th International Conference on Machine Learning. JMLR.org; 2020. A graph to graphs framework for retrosynthesis prediction; pp. 8818–8827. [Google Scholar]
  • 62.Zhong W., Yang Z., Chen C.Y.-C. Retrosynthesis prediction using an end-to-end graph generative architecture for molecular graph editing. Nat Commun. 2023;14(1):3009. doi: 10.1038/s41467-023-38851-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Somnath V.R., Bunne C., Coley C.W., Krause A., Barzilay R. NIPS’21: Proceedings of the 35th International Conference on Neural Information Processing Systems. Curran Associates; 2021. Learning graph models for retrosynthesis prediction; pp. 9405–9415. [Google Scholar]
  • 64.Yan C., Ding Q., Zhao P., Zheng S., Yang J., Yu Y., Huang J. NIPS’20: Proceedings of the 34th International Conference on Neural Information Processing Systems. Curran Associates; 2020. Retroxpert: decompose retrosynthesis prediction like a chemist; pp. 11248–11258. [Google Scholar]
  • 65.Wang X., Li Y., Qiu J., Chen G., Liu H., Liao B., Hsieh C.-Y., Yao X. Retroprime: a diverse, plausible and transformer-based method for single-step retrosynthesis predictions. Chem Eng J. 2021;420 [Google Scholar]
  • 66.Zhong Z., Song J., Feng Z., Liu T., Jia L., Yao S., Wu M., Hou T., Song M. Root-aligned SMILES: a tight representation for chemical reaction prediction. Chem Sci. 2022;13(31):9023–9034. doi: 10.1039/d2sc02763a. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Hanko E.K., Valdehuesa K.N.G., Verhagen K.J., Chromy J., Stoney R.A., Chua J., Yan C., Roubos J.A., Schmitz J., Breitling R. Carboxylic acid reductase-dependent biosynthesis of eugenol and related allylphenols. Microb Cell Fact. 2023;22(1):238. doi: 10.1186/s12934-023-02246-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Park S.J., Kim E.Y., Noh W., Park H.M., Oh Y.H., Lee S.H., Song B.K., Jegal J., Lee S.Y. Metabolic engineering of Escherichia coli for the production of 5-aminovalerate and glutarate as C5 platform chemicals. Metab Eng. 2013;16:42–47. doi: 10.1016/j.ymben.2012.11.011. [DOI] [PubMed] [Google Scholar]
  • 69.Parthasarathy A., Pierik A.J., Kahnt Jr, Zelder O., Buckel W. Substrate specificity of 2-hydroxyglutaryl-CoA dehydratase from Clostridium symbiosum: toward a bio-based production of adipic acid. Biochemistry. 2011;50(17):3540–3550. doi: 10.1021/bi1020056. [DOI] [PubMed] [Google Scholar]
  • 70.Wang J., Wu Y., Sun X., Yuan Q., Yan Y. De novo biosynthesis of glutarate via α-keto acid carbon chain extension and decarboxylation pathway in Escherichia coli. ACS Synth Biol. 2017;6(10):1922–1930. doi: 10.1021/acssynbio.7b00136. [DOI] [PubMed] [Google Scholar]
  • 71.Cao Q., Ma X., Xiong J., Guo P., Chao J. The preparation of febuxostat by Suzuki reaction. Chin J New Drugs. 2016;25:1057–1060. [Google Scholar]
  • 72.Jang W.D., Kim G.B., Kim Y., Lee S.Y. Applications of artificial intelligence to enzyme and pathway design for metabolic engineering. Curr Opin Biotechnol. 2022;73:101–107. doi: 10.1016/j.copbio.2021.07.024. [DOI] [PubMed] [Google Scholar]
  • 73.Kouba P., Kohout P., Haddadi F., Bushuiev A., Samusevich R., Sedlar J., Damborsky J., Pluskal T., Sivic J., Mazurenko S. Machine learning-guided protein engineering. ACS Catal. 2023;13(21):13863–13895. doi: 10.1021/acscatal.3c02743. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Yang J., Li F.Z., Arnold F.H. Opportunities and challenges for machine learning-assisted enzyme engineering. ACS Cent Sci. 2024;10(2):226–241. doi: 10.1021/acscentsci.3c01275. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Stoney R.A., Hanko E.K., Carbonell P., Breitling R. SelenzymeRF: updated enzyme suggestion software for unbalanced biochemical reactions. Comput Struct Biotechnol J. 2023;21:5868–5876. doi: 10.1016/j.csbj.2023.11.039. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76.Yamanishi Y., Hattori M., Kotera M., Goto S., Kanehisa M. E-zyme: predicting potential EC numbers from the chemical transformation pattern of substrate-product pairs. Bioinformatics. 2009;25(12):i179–i186. doi: 10.1093/bioinformatics/btp223. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Hadadi N., MohammadiPeyhani H., Miskovic L., Seijo M., Hatzimanikatis V. Enzyme annotation for orphan and novel reactions using knowledge of substrate reactive sites. Proc Natl Acad Sci. 2019;116(15):7298–7307. doi: 10.1073/pnas.1818877116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Xing H., Cai P., Liu D., Han M., Liu J., Le Y., Zhang D., Hu Q.-N. High-throughput prediction of enzyme promiscuity based on substrate–product pairs. Briefings Bioinf. 2024;25(2) doi: 10.1093/bib/bbae089. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.Zhang D., Xing H., Liu D., Han M., Cai P., Lin H., Tian Y., Guo Y., Sun B., Le Y. Discovery of toxin-degrading enzymes with positive unlabeled deep learning. ACS Catal. 2024;14(5):3336–3348. [Google Scholar]
  • 80.Li F., Yuan L., Lu H., Li G., Chen Y., Engqvist M.K., Kerkhoven E.J., Nielsen J. Deep learning-based k cat prediction enables improved enzyme-constrained model reconstruction. Nat Catal. 2022;5(8):662–672. [Google Scholar]
  • 81.Kroll A., Ranjan S., Engqvist M.K., Lercher M.J. A general model to predict small molecule substrates of enzymes based on machine and deep learning. Nat Commun. 2023;14(1):2787. doi: 10.1038/s41467-023-38347-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82.Liu Y., Hua C., Zeng T., Rao J., Zhang Z., Wu R., Coley C.W., Zheng S. EnzymeCAGE: a geometric foundation model for enzyme retrieval with evolutionary insights. bioRxiv. 2024 doi: 10.1101/2024.12.15.628585. [DOI] [Google Scholar]
  • 83.Altschu S.F., Madden T.L., Schäffer A.A., Zhang J., Zhang Z., Miller W., Lipman D.J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84.Edgar R.C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010;26(19):2460–2461. doi: 10.1093/bioinformatics/btq461. [DOI] [PubMed] [Google Scholar]
  • 85.Hauswedell H., Singer J., Reinert K. Lambda: the local aligner for massive biological data. Bioinformatics. 2014;30(17):i349–i355. doi: 10.1093/bioinformatics/btu439. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 86.Kielbasa S.M., Wan R., Sato K., Horton P., Frith M.C. Adaptive seeds tame genomic sequence comparison. Genome Res. 2011;21(3):487–493. doi: 10.1101/gr.113985.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 87.Buchfink B., Xie C., Huson D.H. Fast and sensitive protein alignment using DIAMOND. Nat Methods. 2015;12(1):59–60. doi: 10.1038/nmeth.3176. [DOI] [PubMed] [Google Scholar]
  • 88.Zhao Y., Tang H., Ye Y. RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data. Bioinformatics. 2012;28(1):125–126. doi: 10.1093/bioinformatics/btr595. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 89.Somervuo P., Holm L. SANSparallel: interactive homology search against Uniprot. Nucleic Acids Res. 2015;43(W1):W24–W29. doi: 10.1093/nar/gkv317. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 90.Roy A., Yang J., Zhang Y. COFACTOR: an accurate comparative algorithm for structure-based protein function annotation. Nucleic Acids Res. 2012;40(Web Server issue):W471–W477. doi: 10.1093/nar/gks372. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 91.Zhang C., Freddolino P.L., Zhang Y. COFACTOR: improved protein function prediction by combining structure, sequence and protein-protein interaction information. Nucleic Acids Res. 2017;45(W1):W291–W299. doi: 10.1093/nar/gkx366. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 92.Ryu J.Y., Kim H.U., Lee S.Y. Deep learning enables high-quality and high-throughput prediction of enzyme commission numbers. Proc Natl Acad Sci U S A. 2019;116(28):13996–14001. doi: 10.1073/pnas.1821905116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 93.Song Y., Yuan Q., Chen S., Zeng Y., Zhao H., Yang Y. Accurately predicting enzyme functions through geometric graph learning on ESMFold-predicted structures. Nat Commun. 2024;15(1):8180. doi: 10.1038/s41467-024-52533-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 94.Yu T., Cui H., Li J.C., Luo Y., Jiang G., Zhao H. Enzyme function prediction using contrastive learning. Science. 2023;379(6639):1358–1363. doi: 10.1126/science.adf2465. [DOI] [PubMed] [Google Scholar]
  • 95.Shi Z., Yuan Q., Wang R., Li H., Liao X., Ma H. ECRECer: enzyme commission number recommendation and benchmarking based on multiagent dual-core learning. arXiv:220203632 [Preprint] 2022 [Google Scholar]
  • 96.Watson J.L., Juergens D., Bennett N.R., Trippe B.L., Yim J., Eisenach H.E., Ahern W., Borst A.J., Ragotte R.J., Milles L.F. De novo design of protein structure and function with RFdiffusion. Nature. 2023;620(7976):1089–1100. doi: 10.1038/s41586-023-06415-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 97.Winnifrith A., Outeiral C., Hie B.L. Generative artificial intelligence for de novo protein design. Curr Opin Struct Biol. 2024;86 doi: 10.1016/j.sbi.2024.102794. [DOI] [PubMed] [Google Scholar]
  • 98.Wu Z., Johnston K.E., Arnold F.H., Yang K.K. Protein sequence design with deep generative models. Curr Opin Chem Biol. 2021;65:18–27. doi: 10.1016/j.cbpa.2021.04.004. [DOI] [PubMed] [Google Scholar]
  • 99.Jumper J., Evans R., Pritzel A., Green T., Figurnov M., Ronneberger O., Tunyasuvunakool K., Bates R., Žídek A., Potapenko A. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596(7873):583–589. doi: 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 100.Baek M., DiMaio F., Anishchenko I., Dauparas J., Ovchinnikov S., Lee G.R., Wang J., Cong Q., Kinch L.N., Schaeffer R.D. Accurate prediction of protein structures and interactions using a three-track neural network. Science. 2021;373(6557):871–876. doi: 10.1126/science.abj8754. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 101.Kuhlman B., Dantas G., Ireton G.C., Varani G., Stoddard B.L., Baker D. Design of a novel globular protein fold with atomic-level accuracy. Science. 2003;302(5649):1364–1368. doi: 10.1126/science.1089427. [DOI] [PubMed] [Google Scholar]
  • 102.Dauparas J., Anishchenko I., Bennett N., Bai H., Ragotte R.J., Milles L.F., Wicky B.I., Courbet A., de Haas R.J., Bethel N. Robust deep learning–based protein sequence design using ProteinMPNN. Science. 2022;378(6615):49–56. doi: 10.1126/science.add2187. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 103.Mitra P., Shultis D., Zhang Y. EvoDesign: de novo protein design based on structural and evolutionary profiles. Nucleic Acids Res. 2013;41(W1):W273–W280. doi: 10.1093/nar/gkt384. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 104.Ferruz N., Schmidt S., Höcker B. ProtGPT2 is a deep unsupervised language model for protein design. Nat Commun. 2022;13(1):4348. doi: 10.1038/s41467-022-32007-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 105.Repecka D., Jauniskis V., Karpus L., Rembeza E., Rokaitis I., Zrimec J., Poviloniene S., Laurynenas A., Viknander S., Abuajwa W. Expanding functional protein sequence spaces using generative adversarial networks. Nat Mach Intell. 2021;3:324–333. [Google Scholar]
  • 106.Bhardwaj G., Mulligan V.K., Bahl C.D., Gilmore J.M., Harvey P.J., Cheneval O., Buchko G.W., Pulavarti S.V., Kaas Q., Eletsky A. Accurate de novo design of hyperstable constrained peptides. Nature. 2016;538(7625):329–335. doi: 10.1038/nature19791. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 107.Marques S.M., Planas-Iglesias J., Damborsky J. Web-based tools for computational enzyme design. Curr Opin Struct Biol. 2021;69:19–34. doi: 10.1016/j.sbi.2021.01.010. [DOI] [PubMed] [Google Scholar]
  • 108.Hou Q., Kwasigroch J.M., Rooman M., Pucci F. SOLart: a structure-based method to predict protein solubility and aggregation. Bioinformatics. 2020;36(5):1445–1452. doi: 10.1093/bioinformatics/btz773. [DOI] [PubMed] [Google Scholar]
  • 109.Kuriata A., Iglesias V., Pujols J., Kurcinski M., Kmiecik S., Ventura S. Aggrescan3D (A3D) 2.0: prediction and engineering of protein solubility. Nucleic Acids Res. 2019;47(W1):W300–W307. doi: 10.1093/nar/gkz321. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 110.Rawat P., Prabakaran R., Kumar S., Gromiha M.M. AggreRATE-Pred: a mathematical model for the prediction of change in aggregation rate upon point mutation. Bioinformatics. 2020;36(5):1439–1444. doi: 10.1093/bioinformatics/btz764. [DOI] [PubMed] [Google Scholar]
  • 111.Dutton G. Bacterial Platforms Can Rival Mammalian Platforms: AbSci says that its Escherichia coli platforms are engineered to defy known cell line limitations and enable high-titer scale-up of challenging biologics. Genetic Engineering & Biotechnology News. 2020;40(6):10–11. [Google Scholar]
  • 112.Bender B.J., Cisneros I.I.I.A., Duran A.M., Finn J.A., Fu D., Lokits A.D., Mueller B.K., Sangha A.K., Sauer M.F., Sevy A.M. Protocols for molecular modeling with Rosetta3 and RosettaScripts. Biochemistry. 2016;55(34):4748–4763. doi: 10.1021/acs.biochem.6b00444. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 113.Khersonsky O., Lipsh R., Avizemer Z., Ashani Y., Goldsmith M., Leader H., Dym O., Rogotner S., Trudeau D.L., Prilusky J., et al. Automated design of efficient and functionally diverse enzyme repertoires. Mol Cell. 2018;72(1):178–186. doi: 10.1016/j.molcel.2018.08.033. e5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 114.Vavra O., Filipovic J., Plhak J., Bednar D., Marques S.M., Brezovsky J., Stourac J., Matyska L., Damborsky J. CaverDock: a molecular docking-based tool to analyse ligand transport through protein tunnels and channels. Bioinformatics. 2019;35(23):4986–4993. doi: 10.1093/bioinformatics/btz386. [DOI] [PubMed] [Google Scholar]
  • 115.Karami Y., Guyon F., De Vries S., Tufféry P. DaReUS-Loop: accurate loop modeling using fragments from remote or unrelated proteins. Sci Rep. 2018;8(1) doi: 10.1038/s41598-018-32079-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 116.Musil M., Khan R.T., Beier A., Stourac J., Konegger H., Damborsky J., Bednar D. FireProtASR: a web server for fully automated ancestral sequence reconstruction. Brief Bioinform. 2021;22(4) doi: 10.1093/bib/bbaa337. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 117.Contessoto V.G., de Oliveira V.M., Fernandes B.R., Slade G.G., Leite V.B., Tksa‐M C. A web server for rational mutation through the optimization of protein charge interactions. Proteins. 2018;86(11):1184–1188. doi: 10.1002/prot.25599. [DOI] [PubMed] [Google Scholar]
  • 118.Gopi S., Devanshu D., Krishna P., Naganathan A.N. pStab: prediction of stable mutants, unfolding curves, stability maps and protein electrostatic frustration. Bioinformatics. 2018;34(5):875–877. doi: 10.1093/bioinformatics/btx697. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 119.Rodrigues C.H., Pires D.E., Ascher D.B. DynaMut2: assessing changes in stability and flexibility upon single and multiple point missense mutations. Protein Sci. 2021;30(1):60–69. doi: 10.1002/pro.3942. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 120.Kurcinski M., Oleniecki T., Ciemny M.P., Kuriata A., Kolinski A., Kmiecik S. CABS-flex standalone: a simulation environment for fast modeling of protein flexibility. Bioinformatics. 2019;35(4):694–695. doi: 10.1093/bioinformatics/bty685. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 121.Kuriata A., Gierut A.M., Oleniecki T., Ciemny M.P., Kolinski A., Kurcinski M., Kmiecik S. CABS-flex 2.0: a web server for fast simulations of flexibility of protein structures. Nucleic Acids Res. 2018;46(W1):W338–W343. doi: 10.1093/nar/gky356. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 122.Aydınkal R.M., Serçinoğlu O., Ozbek P. ProSNEx: a web-based application for exploration and analysis of protein structures using network formalism. Nucleic Acids Res. 2019;47(W1):W471–W476. doi: 10.1093/nar/gkz390. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 123.Tan Z.W., Guarnera E., Tee W.-V., Berezovsky I.N. AlloSigMA 2: paving the way to designing allosteric effectors and to exploring allosteric effects of mutations. Nucleic Acids Res. 2020;48(W1):W116–W124. doi: 10.1093/nar/gkaa338. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 124.Chen Y., Zhao C., Wang R., Zhang W., Zhu Y., Mu W. Multidimensional engineering of Escherichia coli MG1655 for the efficient biosynthesis of difucosyllactose. J Agric Food Chem. 2025;73(9):5405–5413. doi: 10.1021/acs.jafc.4c12623. [DOI] [PubMed] [Google Scholar]
  • 125.Shi K., Li J.-M., Wang M.-Q., Zhang Y.-K., Zhang Z.-J., Chen Q., Hollmann F., Xu J.-H., Yu H.-L. Computation-driven redesign of an NRPS-like carboxylic acid reductase improves activity and selectivity. Sci Adv. 2024;10(48) doi: 10.1126/sciadv.adp6775. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 126.Lauko A., Pellock S.J., Sumida K.H., Anishchenko I., Juergens D., Ahern W., Jeung J., Shida A., Hunt A., Kalvet I. Computational design of serine hydrolases. Science. 2025;388(6744) doi: 10.1126/science.adu2454. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Synthetic and Systems Biotechnology are provided here courtesy of KeAi Publishing

RESOURCES