Open Enzyme Database: a community-wide repository for sharing enzyme data

Le Yuan; David M Bianchi; Katherine Arneson; Bingji Guo; Sara Lambert; Chris Stephens; Yash H Wasnik; Christopher Pond; Matthew J Berry; Huimin Zhao

doi:10.1093/nar/gkaf1082

. 2025 Nov 4;54(D1):D643–D651. doi: 10.1093/nar/gkaf1082

Open Enzyme Database: a community-wide repository for sharing enzyme data

Le Yuan ^1,^2,³, David M Bianchi ^4,^5,⁶, Katherine Arneson ^7,^8,⁹, Bingji Guo ^10,¹¹, Sara Lambert ^12,¹³, Chris Stephens ^14,¹⁵, Yash H Wasnik ^16,¹⁷, Christopher Pond ^18,¹⁹, Matthew J Berry ^20,^21,²², Huimin Zhao ^23,^24,^25,^26,^27,^✉

¹Department of Chemical and Biomolecular Engineering, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States

² Carl R. Woese Institute for Genomic Biology, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States

³ NSF Molecule Maker Lab Institute, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States

⁴ NSF Molecule Maker Lab Institute, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States

⁵ National Center for Supercomputing Applications, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States

⁶ NSF iBioFoundry, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States

⁷ NSF Molecule Maker Lab Institute, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States

⁸ National Center for Supercomputing Applications, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States

⁹ NSF iBioFoundry, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States

¹⁰ NSF Molecule Maker Lab Institute, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States

¹¹ National Center for Supercomputing Applications, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States

¹² NSF Molecule Maker Lab Institute, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States

¹³ National Center for Supercomputing Applications, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States

¹⁴ NSF Molecule Maker Lab Institute, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States

¹⁵ National Center for Supercomputing Applications, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States

¹⁶ NSF Molecule Maker Lab Institute, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States

¹⁷ National Center for Supercomputing Applications, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States

¹⁸ NSF Molecule Maker Lab Institute, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States

¹⁹ National Center for Supercomputing Applications, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States

²⁰ NSF Molecule Maker Lab Institute, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States

²¹ National Center for Supercomputing Applications, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States

²² NSF iBioFoundry, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States

²³Department of Chemical and Biomolecular Engineering, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States

²⁴ Carl R. Woese Institute for Genomic Biology, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States

²⁵ NSF Molecule Maker Lab Institute, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States

²⁶ NSF iBioFoundry, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States

²⁷ NSF Global Center for Biofoundry Applications, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States

^✉

To whom correspondence should be addressed. Email: zhao5@illinois.edu

Roles

Le Yuan: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Software, Supervision, Validation, Visualization, Writing - original draft, Writing - review & editing

David M Bianchi: Conceptualization, Data curation, Formal analysis, Investigation, Project administration, Software, Visualization, Writing - review & editing

Katherine Arneson: Formal analysis, Investigation, Methodology, Software, Visualization, Writing - review & editing

Bingji Guo: Investigation, Software, Visualization, Writing - review & editing

Sara Lambert: Investigation, Methodology, Resources, Software

Chris Stephens: Data curation, Investigation, Software

Yash H Wasnik: Data curation, Investigation, Software

Christopher Pond: Investigation, Project administration, Software, Supervision

Matthew J Berry: Methodology, Project administration, Resources, Software, Supervision, Visualization, Writing - original draft, Writing - review & editing

Huimin Zhao: Conceptualization, Formal analysis, Funding acquisition, Investigation, Project administration, Resources, Supervision, Writing - review & editing

PMCID: PMC12807643 PMID: 41188073

Abstract

Enzymes are the molecular machines of life and play an indispensable role in numerous biotechnological and biomedical applications. Despite the availability of a few enzyme databases, their utility is often hindered by limitations such as infrequent updates, lack of standardization, and inadequate support for machine learning. In this study, we developed the Open Enzyme Database (OED), a community-wide repository and web-based infrastructure designed to facilitate the sharing and exploration of enzyme data. As a user-friendly platform, the first release of the OED provides extensively curated information on enzyme kinetic parameters, structural data, enzymatic reactions, assay conditions, and functional annotations. It also features cutting-edge artificial intelligence (AI) tools for enzyme property prediction, along with cheminformatics-powered algorithms for the identification of promising enzymes. This database should greatly accelerate research and applications in areas such as biocatalysis, systems biology, synthetic biology, and enzyme engineering, as well as AI. The resource is freely available at https://openenzymedb.platform.moleculemaker.org /.

Graphical Abstract

Introduction

Enzymes play an indispensable role in biological systems by catalyzing various biochemical reactions and, therefore, are crucial for scientific research in a wide range of areas, e.g. chemical synthesis, biofuel production, and environmental remediation [1–3]. Enzyme data in peer-reviewed journal articles are typically stored in various formats, many of which are unstructured, posing a significant barrier to downstream applications. As interest in enzyme discovery and engineering continues to expand, there is an increasing demand for comprehensive enzyme data that are both high quality and readily accessible. In particular, for machine learning (ML) and artificial intelligence (AI) applications in enzyme science, the availability of rich and well-curated datasets serves as a critical foundation for developing high-accuracy models capable of accelerating enzyme activity prediction, functional annotation, and pathway optimization [4, 5]. Although a few key enzyme databases (e.g. BRENDA and SABIO-RK) have been established [6, 7], critical limitations remain, including infrequent updates, lack of standardization, insufficient support for ML applications, and the absence of physics-based and data-driven computational algorithms for enzyme discovery.

To this end, we have established the Open Enzyme Database (OED), a community-driven and open-access knowledgebase designed to facilitate the sharing, exploration, and reuse of enzyme data. The database is presented as part of AlphaSynthesis (https://moleculemaker.org/research/alphasynthesis/), which is a platform that offers easy-to-use tools for AI-guided synthesis planning, catalyst discovery, process optimization, and closed-loop discovery science. With the OED, users can explore a variety of enzyme-related information such as enzyme kinetic parameters, enzyme structures, and enzymatic reactions. Currently, the database contains 23 309 enzymatic reactions catalyzed by 13 914 unique proteins derived from 5173 different organisms (Supplementary Table S1). In addition to serving as a repository of enzyme data, the OED also provides advanced cheminformatics-based algorithms for identifying and prioritizing promising enzymes. To help investigate the vast and still largely uncharacterized enzyme universe, the OED further offers three state-of-the-art AI tools for enzyme property prediction.

Materials and methods

Data curation

To initialize the OED, we first gathered enzyme kinetic data from publicly available databases, i.e. BRENDA and SABIO-RK [6, 7]. In this step, substantial efforts were devoted to data downloading, preprocessing, deduplication, and cleaning to ensure data quality and consistency (Supplementary data). This process resulted in 87 833 entries for enzyme turnover number (k_cat), 169 264 entries for Michaelis constant (K_m), and 51 698 entries for enzyme catalytic efficiency (k_cat/K_m). These kinetic parameters are commonly used as key indicators of enzyme catalytic activity.

In addition, the OED contains newly curated data that are lacking in other existing enzyme databases. For instance, it provides accessible reaction schemes for each entry whenever available. As of now, the database contains 23 309 unique enzymatic reactions, with 91.6% of entries linked to detailed reaction information. Moreover, assay conditions, such as pH and temperature, play a critical role in determining enzyme catalytic activity. Recognizing this, the OED has significantly expanded its coverage of such experimental conditions. Taking the K_m parameter entries as an example, the OED supplies 110 855 entries with associated pH data and 105 520 entries with linked temperature data, offering a much richer resource than previously available. Also, all entries in the database feature comprehensive enzyme-type annotations, specifying whether the enzyme is wildtype or mutated, along with detailed information on mutation sites. Furthermore, the OED currently includes 13 914 unique proteins, with interactive structure visualizations available on the UniProt accession entity page. Regarding enzyme substrates, the OED offers 18 229 compounds annotated with key physicochemical properties, including LogP, topological polar surface area (TPSA), number of hydrogen bond donors and acceptors, and number of rings. These properties are not available in other enzyme databases and are valuable for analyzing enzyme–substrate interactions and serving as informative features in cheminformatics and ML models.

Cheminformatics algorithms

The algorithms used in the enzyme recommendation module include maximum common substructure (MCS), Tanimoto similarity, and fragment algorithms. Powered by data-driven strategies, these algorithms are instrumental in identifying potential enzyme candidates that can catalyze substrates for which no existing enzyme–substrate pairs have been reported. This capability enables the discovery of enzymes that are specifically suited to catalyze desired substrates. To support these algorithms, compounds are represented using SMILES (Simplified Molecular Input Line Entry System) notation for computational processing.

The MCS approach was developed as a powerful tool for identifying and grouping compounds that exhibit similar structural patterns. In this work, we utilized the fMCS algorithm due to its high computational efficiency [8, 9]. The MCS score is defined mathematically as T_MCS(A, B) (see Equation 1). In this equation, |A|_a refers to the number of atoms in the input molecule, while |B|_a corresponds to the number of atoms in the database molecule to which it is compared. |MCS(A, B)|_a represents the number of atoms in the MCS shared by two molecules.

(1)

For the similarity algorithm, we employed the Tanimoto similarity method by comparing chemical structures encoded as two-dimensional (2D) fingerprints. The widely accepted Tanimoto coefficient has proven highly effective in similarity-based virtual screening, making it an ideal choice for measuring molecular similarity [10]. This metric quantifies the degree of overlap between the fingerprint bit vectors of the user-input molecule and those of the database molecules, providing a straightforward and computationally efficient way to assess structural resemblance. In parallel, we implemented a fragment matching algorithm, which decomposes molecules into substructural units. This approach enables the detection of shared molecular fragments between the query compound and database entries, revealing key structural motifs that may not be captured by fingerprint-based similarity alone. The implementation of these algorithms is mainly based on the RDKit toolkit version 2024.09.6 (https://www.rdkit.org/).

Database implementation

The OED is a multitiered web application built upon a range of open-source technologies and public Application Programming Interfaces (APIs). The main user interface is implemented using the Angular (https://angular.dev/) frontend framework. Tailwind (https://tailwindcss.com/) provides an abstraction layer over Cascading Style Sheets (CSS) to control layouts and styling, while most interactive elements of the interface are sourced from the PrimeNG (https://primeng.org/) component library. The user interface was designed using Figma (https://www.figma.com/), with a focus on ease of navigation and access to key data. Chemical structures are visualized in 2D using RDKit (https://www.rdkit.org/) and as a three-dimensional (3D) diagram using 3Dmol.js (https://3dmol.csb.pitt.edu/) [11], based on molecular geometries retrieved from the PubChem Power User Gateway (PUG) for molecules. Enzyme structures are also displayed as a 3D diagram using 3Dmol.js, with geometries retrieved on demand from the AlphaFold Protein Structure Database [12, 13]. Chart.js (https://www.chartjs.org/) and D3.js (https://d3js.org) support other data visualizations. In addition, user-drawn and user-uploaded chemical structure diagrams are further processed by Ketcher (https://github.com/epam/ketcher), and user-supplied compound names are converted to SMILES representations via the PubChem PUG.

API access to the dataset is provided through a FastAPI (https://fastapi.tiangolo.com/) web service. This service is described by an OpenAPI specification (https://www.openapis.org/), which enables an interactive Swagger (https://swagger.io/) documentation web page and supports the automatic generation of API client libraries in multiple programming languages via the OpenAPI Generator (https://github.com/OpenAPITools/openapi-generator). The underlying data records are stored in a PostgreSQL (https://www.postgresql.org/) database. In addition, a second FastAPI web service (https://github.com/moleculemaker/mmli-backend) provides utility functions, such as canonicalization of SMILES strings by RDKit, and manages the calculation of recommended enzymes and predicted properties by dispatching asynchronous jobs to compute nodes. All incoming web traffic passes through a Traefik (https://traefik.io/) application proxy, which enforces rate limits and automatically renews TLS certificates via Let’s Encrypt (https://letsencrypt.org/). Requests are subsequently reverse-proxied to nginx (https://nginx.org/) servers for frontend assets and backend APIs. Major system components are containerized as separate Docker (https://www.docker.com/) images, orchestrated using Kubernetes (https://kubernetes.io/), and hosted on an Ubuntu (https://ubuntu.com/) cluster at the National Center for Supercomputing Applications.

Database contents and features

As a freely available enzyme resource, the OED is designed to facilitate interactive exploration and its broad application in enzymology and related fields. Although this is the first release of the OED, it adheres to the FAIR (Findable, Accessible, Interoperable, and Reusable) principles for data management and sharing [14]. This resource enables users not only to investigate known enzymes reported in literature through the database search functionality, but also to explore the uncharted enzyme space using AI-driven and cheminformatics-driven tools in the platform for the characterization of enzyme candidates. Programmatic access to the enzyme dataset in the OED is available to the public via API at https://openenzymedb-api.platform.moleculemaker.org/api/v1/docs/.

Overview of enzyme kinetic data

By analyzing all the k_cat entries across various Enzyme Commission (EC) numbers, we found that those entries with the first digit of EC number ranging from 1 to 4 account for >90% (Fig. 1A). Similar trends are also observed for the K_m and k_cat/K_m kinetic parameters (Supplementary Figs S1A and S2A). This is consistent with the fact that the majority of enzymatic reactions are catalyzed by oxidoreductases, transferases, hydrolases, and lyases, which correspond to EC numbers beginning with 1, 2, 3, and 4, respectively. In terms of species distribution, the three common species (Homo sapiens, Escherichia coli, Rattus norvegicus) have the highest number of k_cat, K_m, and k_cat/K_m entries among all species (Fig. 1B and Supplementary Figs S1B and S2B). Moreover, comparative analysis of enzyme kinetic parameters across different EC number categories shows that there is not a substantial difference in the median values for enzymes with first digits ranging from 1 to 4 (Fig. 1C and Supplementary Figs S1C and S2C). Notably, enzymes with EC numbers beginning with 7 (translocases) exhibit the highest median values for both k_cat and k_cat/K_m (Fig. 1C and Supplementary Fig. S2C), while displaying the lowest median K_m values among all categories (Supplementary Fig. S1C).

Figure 1. — Overview of k_cat entries in the OED. (A) Classification of k_cat entries based on the first digit of EC number. (B) Top 10 species with the highest number of k_cat entries. (C) Comparison of k_cat values across different EC number categories. Abbreviations: k_cat, enzyme turnover number; OED, Open Enzyme Database; EC number, Enzyme Commission number.

Database search

The OED (https://openenzymedb.platform.moleculemaker.org/) features a user-friendly web interface organized into several dedicated pages. The welcome page offers an overview of the OED and the web-based tools available within the platform. The tutorial page provides step-by-step documentation on how to use the database. The about page outlines supplementary details regarding the database, e.g. data statistics, GitHub repository, team information, etc. More importantly, the search page enables users to query and retrieve data related to specific enzymes of interest.

To support interactive exploration of the web database, users can perform searches using various query types, including compound name or SMILES, organism, enzyme name, UniProt accession, EC number, pH, and temperature. For each supported query type, representative examples are provided to guide users. The returned results are displayed in a tabular format, where each entry contains the compound name, organism, UniProt accession, EC number, enzyme type, experimental conditions (pH and temperature), enzyme kinetic parameters, and PubMed ID. Each PubMed ID is hyperlinked directly to the original source publication for further reference. In addition, users can click the expand icon positioned to the left of each entry to view more information, such as the enzyme name and, when available, the corresponding enzymatic reaction scheme. Cross-references to external databases, like UniProt and BRENDA, are provided [6, 15]. The search results are sortable by experimental kinetic parameters (e.g. k_cat and K_m) in either ascending or descending order to suit specific research needs. Filtering options are also available for all relevant database fields. Furthermore, the displayed data entries can be downloaded as a CSV file by clicking the export button. Relevant fields within each entry, such as the compound name, UniProt accession, and EC number, are clickable to enable users to explore detailed entity pages.

On the compound entity page, users have a centralized view of all available data associated with a specific compound. For example, by clicking on l-proline within an entry in the table of search results, users are directed to the l-proline entity page. At the top of the page, basic information (i.e. the compound name, molecular formula, SMILES, and molecular weight) is presented (Fig. 2A). Below this section, users can explore detailed chemical properties of l-proline, comprising LogP, TPSA, number of hydrogen bond donors and acceptors, and number of rings (Fig. 2B). On the top right, both 2D and 3D representations of the molecular structure are displayed (Fig. 2C). Further down the page, a table summarizes enzyme-related experimental data for l-proline, with built-in options for sorting and filtering (Fig. 2D). On the UniProt accession entity page, users can view an overview of the entity’s information, cross-references, 3D protein structure, and enzyme data associated with the UniProt accession. Similarly, the EC number entity page provides access to entity details, cross-references, representative protein structure, and reaction scheme, as well as related enzyme data. The data presented on these entity pages can also be downloaded by clicking the export button.

Figure 2. — A screenshot from the OED web interface displaying the compound entity page. (A) At the top of the page, basic compound information is presented, including the name, molecular formula, SMILES, and molecular weight. (B) Chemical properties associated with the compound are provided. (C) On the top right, both 2D and interactive 3D molecular structures are shown. (D) Further down, a table summarizes experimental data of enzymes, with options for data sorting and filtering. Abbreviations: OED, Open Enzyme Database; SMILES, Simplified Molecular Input Line Entry System; 2D, two-dimensional; 3D, three-dimensional.

AI tools for enzyme property prediction

With the rapid advancement of neural networks and large language models, AI is transforming enzyme science. One of the key advantages of the OED is that all the data presented on the web interface are experimentally derived, with detailed entries supported by original peer-reviewed literature references. Beyond serving as a comprehensive resource for enzyme data exploration, the OED aspires to be a one-stop platform offering various AI tools for enzyme discovery. In this release, we mainly integrated three cutting-edge AI models (i.e. DLKcat, UniKP, and CatPred) for enzyme property prediction into the OED platform. These tools, deployed as a complementary feature not available in other enzyme databases such as BRENDA, enable users to estimate potential catalytic activities even in the absence of experimentally reported data, thereby helping to guide future experiments. Among them, DLKcat employs a deep learning approach to predict k_cat values from enzyme–substrate pairs, combining a graph neural network for substrate representation with a convolutional neural network for protein representation [16]. UniKP is a unified framework for predicting enzyme kinetic parameters (k_cat, K_m, and k_cat/K_m) from amino acid sequences and substrates based on pretrained language models [17]. CatPred is a deep learning model for predicting k_cat, K_m, and inhibitory constant (K_i) by leveraging a pretrained protein language model and Directed Message Passing Neural Networks [18].

As a case study, we demonstrate how the AI tools in the OED platform can be used. To begin with, users should navigate to the specific enzyme property prediction web page, where they are prompted to input an enzyme and a substrate into two separate input boxes. For instance, the enzyme sulfotransferase 1A1 (UniProt accession: Q29476) is entered in FASTA format into the enzyme input box, and the substrate tyrosine is provided as a SMILES string (OC1=CC=C(C[C@@H](C(O)=O)N)C=C1) in the substrate input box (Supplementary Fig. S3). After clicking the “Get Enzyme Property Prediction” button, the results are displayed on a new page. In addition, users are allowed to enter their email address to receive a notification when the results are ready, although the prediction typically takes only a few minutes. The new result page presents prediction outputs from all three AI tools. After clicking “View full results” under the UniKP section, users are directed to a dedicated UniKP results page (Supplementary Fig. S4). This page first displays the input information for the enzyme–substrate pair, followed by the predicted values for k_cat, K_m, and k_cat/K_m generated by this AI model. Below these predictions, the results are contextualized against the distribution of experimental data for each kinetic parameter from the OED.

Enzyme recommendation

To assist users in identifying potential enzymes capable of catalyzing a given substrate, we implemented three cheminformatics-based algorithms in the enzyme recommendation module, namely MCS, Tanimoto similarity, and fragment algorithm. Users begin by navigating to the enzyme recommendation page, where they can input a substrate of interest either by entering a compound name or SMILES string, or by intuitively drawing and uploading the molecular structure. Upon clicking the “Get Enzyme Recommendation” button, the system generates a results page summarizing outputs from all three algorithms. For both the MCS and Tanimoto similarity algorithms, scores are computed by comparing the query molecule against all compounds in the database and the top 10 compounds with the highest scores are returned. For the fragment-based algorithm, all compounds in the database that contain the query substrate as a fragment are returned.

For example, when users aim to identify enzymes that can catalyze the substrate 7-methyltryptamine, those compounds exhibiting similar structural patterns are retrieved using the three cheminformatics algorithms in the OED. By clicking “View full results” in the MCS section, users are directed to a dedicated page displaying the 10 compounds with the highest MCS scores (Fig. 3). At the top of the web page, users can view basic information about the query substrate of interest, including its molecular structure. Below this, the 10 compounds (e.g. tryptamine and serotonin) are shown along with detailed MCS scores, ranked from highest to lowest. Users can also click the expand icon positioned to the left of each entry to access more information, such as the UniProt accession, EC number, enzyme sequence, enzyme catalytic rate and efficiency (k_cat, k_cat/K_m), and original literature. Moreover, the molecular structures of the output compounds are available for inspection, enabling users to directly assess the structural patterns shared between the input substrate 7-methyltryptamine and the retrieved compounds. For instance, the top two compounds, tryptamine and serotonin, contain an indole scaffold with an ethylamine side chain, similar to that of 7-methyltryptamine. Consequently, enzymes capable of catalyzing tryptamine or serotonin are likely to catalyze 7-methyltryptamine as well, with the corresponding interactive enzyme structures displayed on the page.

Figure 3. — Enzyme recommendation based on the MCS algorithm. In this module, users input a substrate of interest, and the MCS score is calculated by comparing the input substrate with all compounds in the database. The top 10 compounds with the highest MCS scores are then displayed on the results page, along with the associated enzymes capable of catalyzing these compounds. These enzymes are proposed as promising candidates with a high likelihood of catalyzing the input substrate. Note that only five returned compounds are shown in the figure due to space constraints. Abbreviation: MCS, maximum common substructure.

Roadmap for developing the OED

As the name implies, the OED is dedicated to promoting open access and fostering community contributions, which reflects a similar initiative with the Open Reaction Database in the area of chemical reactions [19]. To develop a comprehensive OED, we have implemented an intuitive and user-friendly data submission interface, enabling researchers worldwide to contribute new enzyme data, provide feedback, and actively engage with the database. This web interface, which is available at https://openenzymedb.platform.moleculemaker.org/about/get-involved/, facilitates standardized data submission and supports the continuous growth and enrichment of the OED knowledgebase. To ensure the integrity, high quality, and interoperability of the database, all user-submitted entries will be subjected to automated validation checks designed to detect common errors and inconsistencies. When necessary, manual curation will be performed by expert reviewers with domain-specific knowledge.

In parallel, we intend to establish a formal contributor recognition system to acknowledge and credit data providers and community curators. This initiative aims to incentivize data sharing and curation efforts, thereby fostering a sustainable and collaborative ecosystem. Furthermore, we will integrate standardized ontologies and metadata schemas that adhere to widely adopted community-driven standards, facilitating seamless data integration, interoperability, and long-term reusability. Looking ahead, we aspire to form a community advisory board composed of leading experts in AI, enzymology, synthetic biology, and related disciplines. This board will provide strategic guidance, oversee data policies, and help prioritize future development directions to ensure that the OED remains aligned with the evolving needs of the scientific community. Through these coordinated efforts, we aim to develop the OED as a trusted, dynamic, and openly accessible resource that accelerates enzyme research and drives scientific discovery.

Discussion and conclusion

Enzymes are indispensable biological catalysts that drive essential biochemical reactions and facilitate diverse applications such as biotechnology, medicine, and sustainable chemistry [20]. In this work, we present the OED, a user-friendly and community-driven web database developed for the sharing and dissemination of enzyme data. Since its initial development, the database has undergone major updates every 2 months to ensure the timely incorporation of new data or feature improvements, and we are committed to maintaining this regular update cycle in future releases to support the continuous growth and practical utility of this resource. In particular, we are actively engaging with several world-leading researchers in biocatalysis and enzyme engineering to further expand and enrich the knowledgebase. To make it more standardized, the OED provides curated enzyme information in a uniform and structured tabular format, allowing for streamlined access and seamless integration with other computational workflows. Specifically, each entry conforms to a consistent data schema, with clearly defined fields and standardized units, thereby improving data interoperability. For ML and AI applications, the OED provides preprocessed and standardized datasets that can be readily downloaded and directly applied to model development, significantly reducing the time and effort required for data preparation. Moreover, our platform integrates state-of-the-art AI tools for enzyme property prediction developed by the broader scientific community, which are not available in other existing enzyme databases. This would empower users, regardless of their programming experience, to obtain predictions of enzyme properties by simply inputting an enzyme sequence and substrate structure.

In addition, we implemented customized cheminformatics-based algorithms to enable the efficient identification of promising enzymes for given substrates. While these algorithms are useful on their own, users can also combine them with other features available in the OED for more comprehensive enzyme investigation. Specifically, when users aim to identify potential enzymes capable of catalyzing a given substrate using our cheminformatics-based methods, they can further click on the corresponding UniProt accession to access the UniProt page. By examining the detailed entries, users can view the associated enzymatic reactions, enabling them to make informed decisions based on chemical transformations between substrates and products. Furthermore, we have deployed various cutting-edge AI tools for enzyme property prediction in the OED, which can also aid in identifying potential enzymes. For instance, given a substrate and a set of candidate enzymes, these AI tools can predict the enzyme turnover numbers (k_cat values), helping users determine which enzyme is likely to exhibit a higher catalytic rate. In this way, these deep learning-based models in the OED are very useful for enzyme recommendation and can be adopted to guide the selection of enzymes for experimental studies.

To conclude, our primary objective is to build a solid foundation for accelerating enzyme research by offering standardized, high-quality, and readily accessible enzyme data to the public. Moving forward, the OED aims to broaden its scope to encompass a wide range of enzyme-related information, including data on enzyme solubility, stability, substrate specificity, functions, and beyond, ultimately establishing itself as a comprehensive resource for the scientific community. Given the newly curated datasets and unique features not available in other existing enzyme databases, we anticipate that the OED will bring a speed boost to biological research across both experimental and computational domains.

Supplementary Material

gkaf1082_Supplemental_File

gkaf1082_supplemental_file.pdf^{(2.2MB, pdf)}

Acknowledgements

We are deeply grateful to Prof. Frances Arnold (California Institute of Technology), Prof. Nicholas J. Turner (University of Manchester), Prof. Philip A. Romero (Duke University), and Prof. Zhongyue J. Yang (Vanderbilt University), as well as their team members Yueming Long, William Finnigan, and Xinchun Ran, for their valuable feedback that helped shape this project. We also acknowledge Chenxin Wang, who contributed to the software development.

Author contributions: Le Yuan (Conceptualization [lead], Data curation [lead], Formal analysis [lead], Investigation [lead], Methodology [lead], Project administration [equal], Software [equal], Supervision [supporting], Validation [lead], Visualization [lead], Writing—original draft [lead], Writing—review & editing [lead], David M. Bianchi (Conceptualization [equal], Data curation [supporting], Formal analysis [supporting], Investigation [equal], Project administration [equal], Software [supporting], Visualization [supporting], Writing—review & editing [supporting], Katherine Arneson (Formal analysis [equal], Investigation [supporting], Methodology [supporting], Software [equal], Visualization [equal], Writing—review & editing [supporting], Bingji Guo (Investigation [supporting], Software [lead], Visualization [equal], Writing—review & editing [supporting], Sara Lambert (Investigation [equal], Methodology [supporting], Resources [supporting], Software [supporting], Chris Stephens (Data curation [supporting], Investigation [supporting], Software [equal], Yash H. Wasnik (Data curation [supporting], Investigation [supporting], Software [supporting], Christopher Pond (Investigation [supporting], Project administration [equal], Software [supporting], Supervision [supporting], Matthew J. Berry (Methodology [supporting], Project administration [equal], Resources [supporting], Software [lead], Supervision [equal], Visualization [equal], Writing—original draft [supporting], Writing—review & editing [supporting], and Huimin Zhao (Conceptualization [lead], Formal analysis [equal], Funding acquisition [lead], Investigation [equal], Project administration [lead], Resources [lead], Supervision [lead], Writing—review & editing [lead].

Contributor Information

Le Yuan, Department of Chemical and Biomolecular Engineering, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States; Carl R. Woese Institute for Genomic Biology, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States; NSF Molecule Maker Lab Institute, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States.

David M Bianchi, NSF Molecule Maker Lab Institute, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States; National Center for Supercomputing Applications, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States; NSF iBioFoundry, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States.

Katherine Arneson, NSF Molecule Maker Lab Institute, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States; National Center for Supercomputing Applications, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States; NSF iBioFoundry, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States.

Bingji Guo, NSF Molecule Maker Lab Institute, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States; National Center for Supercomputing Applications, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States.

Sara Lambert, NSF Molecule Maker Lab Institute, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States; National Center for Supercomputing Applications, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States.

Chris Stephens, NSF Molecule Maker Lab Institute, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States; National Center for Supercomputing Applications, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States.

Yash H Wasnik, NSF Molecule Maker Lab Institute, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States; National Center for Supercomputing Applications, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States.

Christopher Pond, NSF Molecule Maker Lab Institute, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States; National Center for Supercomputing Applications, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States.

Matthew J Berry, NSF Molecule Maker Lab Institute, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States; National Center for Supercomputing Applications, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States; NSF iBioFoundry, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States.

Huimin Zhao, Department of Chemical and Biomolecular Engineering, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States; Carl R. Woese Institute for Genomic Biology, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States; NSF Molecule Maker Lab Institute, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States; NSF iBioFoundry, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States; NSF Global Center for Biofoundry Applications, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States.

Supplementary data

Supplementary data is available at NAR online.

Conflict of interest

None declared.

Funding

This work was supported by the U.S. National Science Foundation (NSF) (2019897, 2505932, DBI-2400058, and OISE-2435374 to H.Z.). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect those of the NSF. Funding to pay the Open Access publication charges for this article was provided by NSF.

Data availability

The data underlying this article can be viewed, explored, and downloaded from the OED, which is freely available without any login requirements at https://openenzymedb.platform.moleculemaker.org/. The frontend source code can be found at https://doi.org/10.5281/zenodo.17244999.

References

1. Reisenbauer JC, Sicinski KM, Arnold FH. Catalyzing the future: recent advances in chemical synthesis using enzymes. Curr Opin Chem Biol. 2024;83:102536. 10.1016/j.cbpa.2024.102536. [DOI] [PMC free article] [PubMed] [Google Scholar]
2. El-Araby R. Biofuel production: exploring renewable energy solutions for a greener future. Biotechnol Biofuels Bioprod. 2024;17:129. 10.1186/s13068-024-02571-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Xu X, Lin X, Ma Wet al. Biodegradation strategies of veterinary medicines in the environment: enzymatic degradation. Sci Total Environ. 2024;912:169598. 10.1016/j.scitotenv.2023.169598. [DOI] [PubMed] [Google Scholar]
4. Yu T, Boob AG, Volk MJet al. Machine learning-enabled retrobiosynthesis of molecules. Nat Catal. 2023;6:137–51. 10.1038/s41929-022-00909-w. [DOI] [Google Scholar]
5. Yang J, Li F-Z, Arnold FH. Opportunities and challenges for machine learning-assisted enzyme engineering. ACS Cent Sci. 2024;10:226–41. 10.1021/acscentsci.3c01275. [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Chang A, Jeske L, Ulbrich Set al. BRENDA, the ELIXIR core data resource in 2021: new developments and updates. Nucleic Acids Res. 2021;49:D498–508. 10.1093/nar/gkaa1025. [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Wittig U, Rey M, Weidemann Aet al. SABIO-RK: an updated resource for manually curated biochemical reaction kinetics. Nucleic Acids Res. 2018;46:D656–60. 10.1093/nar/gkx1065. [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Cao Y, Jiang T, Girke T. A maximum common substructure-based algorithm for searching and predicting drug-like compounds. Bioinformatics. 2008;24:i366–74. 10.1093/bioinformatics/btn186. [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Yuan L, Tian Y, Ding Set al. PrecursorFinder: a customized biosynthetic precursor explorer. Bioinformatics. 2019;35:1603–4. 10.1093/bioinformatics/bty838. [DOI] [PubMed] [Google Scholar]
10. Bajusz D, Rácz A, Héberger K. Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations?. J Cheminform. 2015;7:20. 10.1186/s13321-015-0069-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Rego N, Koes D. 3Dmol.js: molecular visualization with WebGL. Bioinformatics. 2015;31:1322–4. 10.1093/bioinformatics/btu829. [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Jumper J, Evans R, Pritzel Aet al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–9. 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Varadi M, Bertoni D, Magana Pet al. AlphaFold Protein Structure Database in 2024: providing structure coverage for over 214 million protein sequences. Nucleic Acids Res. 2024;52:D368–75. 10.1093/nar/gkad1011. [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Wilkinson MD, Dumontier M, Aalbersberg IjJet al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016;3:1–9. 10.1038/sdata.2016.18. [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Apweiler R, Bairoch A, Wu CHet al. UniProt: the Universal Protein Knowledgebase. Nucleic Acids Res. 2004;32:D115–9. 10.1093/nar/gkh131. [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Li F, Yuan L, Lu Het al. Deep learning-based k_cat prediction enables improved enzyme-constrained model reconstruction. Nat Catal. 2022;5:662–72. 10.1038/s41929-022-00798-z. [DOI] [Google Scholar]
17. Yu H, Deng H, He Jet al. UniKP: a unified framework for the prediction of enzyme kinetic parameters. Nat Commun. 2023;14:8211. 10.1038/s41467-023-44113-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Boorla VS, Maranas CD. CatPred: a comprehensive framework for deep learning in vitro enzyme kinetic parameters. Nat Commun. 2025;16:2072. 10.1038/s41467-025-57215-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Kearnes SM, Maser MR, Wleklinski Met al. The Open Reaction Database. J Am Chem Soc. 2021;143:18820–6. 10.1021/jacs.1c09820. [DOI] [PubMed] [Google Scholar]
20. Wang Y, Xue P, Cao Met al. Directed evolution: methodologies and applications. Chem Rev. 2021;121:12384–444. 10.1021/acs.chemrev.1c00260. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

gkaf1082_Supplemental_File

gkaf1082_supplemental_file.pdf^{(2.2MB, pdf)}

Data Availability Statement

[B1] 1. Reisenbauer JC, Sicinski KM, Arnold FH. Catalyzing the future: recent advances in chemical synthesis using enzymes. Curr Opin Chem Biol. 2024;83:102536. 10.1016/j.cbpa.2024.102536. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B2] 2. El-Araby R. Biofuel production: exploring renewable energy solutions for a greener future. Biotechnol Biofuels Bioprod. 2024;17:129. 10.1186/s13068-024-02571-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B3] 3. Xu X, Lin X, Ma Wet al. Biodegradation strategies of veterinary medicines in the environment: enzymatic degradation. Sci Total Environ. 2024;912:169598. 10.1016/j.scitotenv.2023.169598. [DOI] [PubMed] [Google Scholar]

[B4] 4. Yu T, Boob AG, Volk MJet al. Machine learning-enabled retrobiosynthesis of molecules. Nat Catal. 2023;6:137–51. 10.1038/s41929-022-00909-w. [DOI] [Google Scholar]

[B5] 5. Yang J, Li F-Z, Arnold FH. Opportunities and challenges for machine learning-assisted enzyme engineering. ACS Cent Sci. 2024;10:226–41. 10.1021/acscentsci.3c01275. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B6] 6. Chang A, Jeske L, Ulbrich Set al. BRENDA, the ELIXIR core data resource in 2021: new developments and updates. Nucleic Acids Res. 2021;49:D498–508. 10.1093/nar/gkaa1025. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B7] 7. Wittig U, Rey M, Weidemann Aet al. SABIO-RK: an updated resource for manually curated biochemical reaction kinetics. Nucleic Acids Res. 2018;46:D656–60. 10.1093/nar/gkx1065. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B8] 8. Cao Y, Jiang T, Girke T. A maximum common substructure-based algorithm for searching and predicting drug-like compounds. Bioinformatics. 2008;24:i366–74. 10.1093/bioinformatics/btn186. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B9] 9. Yuan L, Tian Y, Ding Set al. PrecursorFinder: a customized biosynthetic precursor explorer. Bioinformatics. 2019;35:1603–4. 10.1093/bioinformatics/bty838. [DOI] [PubMed] [Google Scholar]

[B10] 10. Bajusz D, Rácz A, Héberger K. Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations?. J Cheminform. 2015;7:20. 10.1186/s13321-015-0069-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B11] 11. Rego N, Koes D. 3Dmol.js: molecular visualization with WebGL. Bioinformatics. 2015;31:1322–4. 10.1093/bioinformatics/btu829. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B12] 12. Jumper J, Evans R, Pritzel Aet al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–9. 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B13] 13. Varadi M, Bertoni D, Magana Pet al. AlphaFold Protein Structure Database in 2024: providing structure coverage for over 214 million protein sequences. Nucleic Acids Res. 2024;52:D368–75. 10.1093/nar/gkad1011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B14] 14. Wilkinson MD, Dumontier M, Aalbersberg IjJet al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016;3:1–9. 10.1038/sdata.2016.18. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B15] 15. Apweiler R, Bairoch A, Wu CHet al. UniProt: the Universal Protein Knowledgebase. Nucleic Acids Res. 2004;32:D115–9. 10.1093/nar/gkh131. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B16] 16. Li F, Yuan L, Lu Het al. Deep learning-based k_cat prediction enables improved enzyme-constrained model reconstruction. Nat Catal. 2022;5:662–72. 10.1038/s41929-022-00798-z. [DOI] [Google Scholar]

[B17] 17. Yu H, Deng H, He Jet al. UniKP: a unified framework for the prediction of enzyme kinetic parameters. Nat Commun. 2023;14:8211. 10.1038/s41467-023-44113-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B18] 18. Boorla VS, Maranas CD. CatPred: a comprehensive framework for deep learning in vitro enzyme kinetic parameters. Nat Commun. 2025;16:2072. 10.1038/s41467-025-57215-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B19] 19. Kearnes SM, Maser MR, Wleklinski Met al. The Open Reaction Database. J Am Chem Soc. 2021;143:18820–6. 10.1021/jacs.1c09820. [DOI] [PubMed] [Google Scholar]

[B20] 20. Wang Y, Xue P, Cao Met al. Directed evolution: methodologies and applications. Chem Rev. 2021;121:12384–444. 10.1021/acs.chemrev.1c00260. [DOI] [PubMed] [Google Scholar]

PERMALINK

Open Enzyme Database: a community-wide repository for sharing enzyme data

Le Yuan

David M Bianchi

Katherine Arneson

Bingji Guo

Sara Lambert

Chris Stephens

Yash H Wasnik

Christopher Pond

Matthew J Berry

Huimin Zhao

Roles

Abstract

Graphical Abstract

Graphical Abstract.

Introduction

Materials and methods

Data curation

Cheminformatics algorithms

Database implementation

Database contents and features

Overview of enzyme kinetic data

Figure 1.

Database search

Figure 2.

AI tools for enzyme property prediction

Enzyme recommendation

Figure 3.

Roadmap for developing the OED

Discussion and conclusion

Supplementary Material

Acknowledgements

Contributor Information

Supplementary data

Conflict of interest

Funding

Data availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases