Skip to main content
Scientific Data logoLink to Scientific Data
. 2025 Aug 14;12:1427. doi: 10.1038/s41597-025-05753-8

Curated CYP450 Interaction Dataset: Covering the Majority of Phase I Drug Metabolism

Yu-Hao Ni 1, Yu-Wen Su 1, Shang-Chen Yang 2, Jia-Cheng Hong 1, Po-Wen Allen Du 1, Yu-Ting Hsu 1, Tien-Chueh Kuo 1,3, Yufeng Jane Tseng 1,3,4,5,
PMCID: PMC12354672  PMID: 40813405

Abstract

We collected and organized a detailed dataset encompassing both substrates and non-substrates for six principal cytochrome P450 (CYP450) isozymes, responsible for 90% of Phase I drug metabolism in humans. These isozymes, specifically CYP1A2, CYP2C9, CYP2C19, CYP2D6, CYP2E1, and CYP3A4, play critical roles in the detoxification and metabolic processing of therapeutic compounds. The dataset, meticulously assembled, includes interactions with approximately 2000 compounds per enzyme, ensuring comprehensive coverage and high accuracy. Employing a combination of conventional machine learning techniques alongside advanced methodologies such as Graph Convolutional Networks (GCN), robust models have been developed to elucidate these drug-enzyme interactions. The dataset is poised to significantly contribute to fields requiring pharmacokinetic modeling, furthering drug development efforts and toxicological studies by providing an essential resource for the accurate prediction of metabolic pathways, thereby enhancing drug safety and efficacy assessments.

Subject terms: Pharmaceutics, Machine learning

Background & Summary

Cytochrome P450 (CYP450) enzymes are integral to human detoxification processes, ubiquitously present across various organisms, with over fifty distinct CYP450 isozymes identified in the human body. Notably, isozymes such as CYP1A2, CYP2C9, CYP2C19, CYP2D6, CYP2E1, and CYP3A4 play pivotal roles in metabolizing the most toxic compounds through Phase I oxidation processes, collectively responsible for the metabolism of approximately 90% of pharmaceuticals14. The strategic role of these enzymes in drug biotransformation underscores their importance in pharmacokinetic evaluations, affecting the bioavailability and therapeutic efficacy of medications. Accurately predicting the substrate interactions of CYP450 enzymes is therefore essential, aiding the early stages of drug development and expediting the ligand screening process.

Advancements in machine learning and computational chemistry have fostered the development of myriad in silico methodologies aimed at elucidating the interactions between CYP450 isozymes and drug molecules. Among these, ligand-based approaches are prevalent, utilizing the structural similarities between ligands and known active compounds. Quantitative structure-activity relationship (QSAR) models, which are among the most utilized ligand-based methods, effectively correlate molecular descriptors with biological activities, thus enabling the evaluation of compound properties based on structural characteristics5,6. While protein structure-based methods also hold promise, they are often impeded by the complexities involved in data collection, feature extraction, and the application of algorithms, particularly due to the intricate nature of large protein molecules and their high-resolution tertiary structures7,8.

The emergence of neural network technologies such as Graph Convolutional Networks (GCN) and Convolutional Neural Networks (CNN) has introduced novel capabilities in the realms of QSAR modeling and drug development. These technologies refine the interaction with molecular features, updating model parameters interactively to enhance the prediction accuracy. Specifically, the Transformer-CNN model, which integrates self-attention with character-level convolutional networks, has shown efficacy in analyzing molecular sequences, such as those represented by the Simplified Molecular-Input Line-Entry System (SMILES). This model is particularly advantageous for small datasets, where it facilitates rapid model convergence911.

Despite the prevalence of traditional machine learning methods in the development of predictive models for CYP substrates, these models often utilize small, inconsistent datasets that are ill-suited for complex algorithms, leading to potential overfitting and extensive training durations. In contrast, deep learning approaches like GCN directly convert molecular structures into graphical representations, depicting atoms and bonds, thus providing a more nuanced understanding of molecular interactions conducive to drug development.

However, related substrates and non-substrates data are scattered across various databases and peer-reviewed literature. To address this issue, we curated an extensive dataset, consisting of substrates and non-substrates for six prominent CYP450 isozymes, which play a critical role in 90% of Phase I drug metabolism, incorporating up to 2000 compounds for each enzyme. We employed an advanced machine learning method, Graph Convolutional Networks (GCN). Additionally, we utilized sophisticated model optimization techniques, such as Bayesian optimization and SMILES enumeration, to develop robust CYP450 substrate classification models.

Our GCN-based models showcased superior performance, achieving Matthews correlation coefficients ranging from 0.51 (CYP2C19) to 0.72 (CYP1A2) when evaluated on external testing sets. This high level of accuracy underscores the robustness of our dataset. In our benchmark study (Tables 3 and 4), models trained on our curated dataset significantly outperformed those trained on the Cypstrate dataset, validating the effectiveness of our single-model GCN approach. This methodology not only enhances the prediction of CYP450-mediated metabolism but also supports compound screening in drug development, providing a reliable tool for pharmaceutical research and safety assessment.

Table 3.

GCN Performance Evaluation Across Different Datasets.

CYP Isoform Collected datasets Collected datasets (Substrate enumerated) Datasets from CYPstrate35
# of sub/non-sub MCC ACC Sen/Spe # folds of subs in trainsets MCC ACC Sen/Spe # of sub/non-sub MCC ACC Sen/Spe
CYP1A2 492/1511 0.51 0.82 0.57/0.91 4 0.86 0.93 0.99/0.85 237/1142 0.50 0.87 0.53/0.94
CYP2C9 384/1835 0.37 0.84 0.37/0.94 6 0.90 0.95 0.99/0.89 202/1175 0.39 0.86 0.41/0.94
CYP2C19 355/1605 0.40 0.84 0.39/0.94 5 0.91 0.96 1.00/0.91 194/1184 0.45 0.88 0.39/0.97
CYP2D6 521/1739 0.50 0.83 0.55/0.92 4 0.45 0.85 1.00/0.00 243/1140 0.57 0.88 0.59/0.95
CYP2E1 277/1621 0.52 0.89 0.52/0.95 2 0.79 0.92 0.87/0.94 125/1244 0.47 0.92 0.44/0.97
CYP3A4 1243/1379 0.60 0.80 0.80/0.81 4 0.85 0.95 1.00/0.78 416/991 0.67 0.86 0.76/0.91

The following abbreviations are used: CYP Isoform - Cytochrome P450 enzyme isoform; # of sub/non-sub - Number of substrates and non-substrates; MCC - Matthews Correlation Coefficient; ACC - Accuracy; Sen/Spe - Sensitivity/Specificity; # folds of subs in trainsets - Number of folds used in training subsets.

Table 4.

GCN Performance on the Test Sets Across Different Datasets.

Collected datasets Datasets from Cypstrate39
CYP Isoform # of data # of Sub/Nonsub MCC ACC Sen Spe MCC ACC Sen Spe
CYP1A2 345 59/286 0.72 0.92 0.77 0.96 0.65 0.91 0.68 0.95
CYP2C9 342 48/294 0.55 0.90 0.44 0.98 0.52 0.90 0.46 0.97
CYP2C19 345 47/298 0.50 0.89 0.48 0.96 0.46 0.89 0.42 0.96
CYP2D6 344 60/284 0.62 0.94 0.64 0.95 0.59 0.89 0.52 0.97
CYP2E1 338 28/330 0.57 0.94 0.57 0.97 0.42 0.92 0.37 0.97
CYP3A4 352 104/248 0.71 0.88 0.76 0.93 0.66 0.86 0.68 0.94

The following abbreviations are used: CYP Isoform - Cytochrome P450 enzyme isoform; # of data in set - Number of data points in the test set; # of Sub/Nonsub - Number of substrates and non-substrates in the dataset; MCC - Matthews Correlation Coefficient; ACC - Accuracy; Sen - Sensitivity; Spe - Specificity.

Our dataset can help improve the model predicting whether a drug can act as a CYP450 substrate. It may also be integrated with other datasets that predict drug responses, potentially enhancing drug design and eliminating unnecessary adverse drug effects.12

Methods

Data Collection

The aim of this research was to develop a comprehensive dataset for modeling the metabolism of drugs mediated by CYP450 enzymes from various databases and literature. To achieve this, compounds processed by the six key human CYP450 isoforms–CYP1A2, CYP2C9, CYP2C19, CYP2D6, CYP2E1, and CYP3A4–were systematically collected from various authenticated sources including drug databases and peer-reviewed literature published prior to 2023.

Data were primarily extracted from three major databases:

  1. Drugbank13 (https://go.drugbank.com/): For each isoform, the search function was used to query substrates by entering the specific CYP450 isoform name into the search bar. Results listing compounds as substrates were carefully recorded.

  2. SuperCYP14 (https://insilico-cyp.charite.de/SuperCYPsPred/index.php?site=DrugDrugInteraction): This database was queried by selecting the ‘CYP-drug interaction’ option, followed by choosing the specific isoform and the substrate filter. This process was repeated for each isoform to ensure comprehensive data collection.

  3. Cytochrome P450 Knowledgebase15 (http://cpd.ibmh.msk.su/): Accessed by selecting the ‘complete’ dataset option, followed by filtering for ‘human’ and the specific isoform. Functional data were then viewed to list all substrates for the selected isoform.

In addition to database sources, data were also sourced from interaction tables published by Pharmacy Times (http://www.pharmacytimes.com/publications/issue/2008/2008-07/2008-07-8624), Indiana University (https://drug-interactions.medicine.iu.edu/), the ILD Care Foundation (http://ildcare.eu/Downloads/artseninfo/Drugs_metabolized_by_CYP450s.pdf), the FDA (https://www.fda.gov/Drugs/DevelopmentApprovalProcess/DevelopmentResources/DrugInteractionsLabeling/ucm093664.htm#table2-1), and Mayo Clinic Laboratories (https://www.mayocliniclabs.com/it-mmfiles/Pharmacogenomic_Associations_Tables.pdf). Each source was queried for substrates of each isoform, and the results were meticulously documented1620.

The distribution of the compounds collected from each data source, including the number of substrates identified for each CYP450 isoform, is comprehensively summarized in Table 1. These additional data are used to support the three primary databases and serve as independent verification during the curation process.

Table 1.

Distribution of Data counts for Cytochrome P450 Substrates and Non-substrates Across Various Sources (Formatted as Substrates/Non-substrates).

Dataset Source CYP1A2 CYP2C9 CYP2C19 CYP2D6 CYP2E1 CYP3A4
DrugBank 5.113 198/107 224/215 169/116 271/163 65/72 789/0768/158
SuperCYP14 19/10 16/10 16/10 32/8 14/7 33/6
CYP Knowledgebase15 53/40 46/61 52/44 74/70 81/38 210/54
Mayo Clinic Laboratories20 1/0 0/0 0/0 0/0 0/0 0/0
Zhang T-201229 1/3 0/1 2/2 1/6 0/0 0/0
Yamashita-201130 105/400 36/440 56/439 80/400 84/525 121/231
Yap-200531 3/13 5/185 6/14 14/161 0/10 26/93
Michielan-200932 1/1 0/2 0/1 3/1 0/1 0/0
Mishra, N.K.-201033 0/1 1/0 0/1 0/1 0/0 1/0
Siyang Tian-201834 21/0 34/1 11/0 35/0 21/1 53/1
Holmer-202135 144/1220 72/1212 89/1270 71/1213 40/1277 134/1079
Total 56/1795 434/2127 401/1897 581/2023 305/1931 1343/1622

Data Curation

Different data sources may have various thresholds to define substrates, leading to possible biases in substrate classification. In addition, data may conflict between data sources. To address these issues, the data curation process was designed to ensure the dataset’s accuracy, consistency, and replicability, especially when integrating chemical compound data from multiple sources. Several steps were implemented throughout the verification process to preserve the overall integrity and reliability of the dataset. The detailed steps were as follows:

  1. Compound Identifier Verification: Ensuring the accuracy and consistency of chemical compound identifiers is a crucial step in dataset curation, particularly when integrating data from multiple sources. This study systematically verified each compound’s PubChem Compound Identifier (CID) to confirm its existence, classification, and consistency across multiple databases. The verification process involved several key steps to maintain the integrity of the dataset.
    • PubChem CID Retrieval: To retrieve its CID, each compound was first queried against PubChem, one of the largest publicly available chemical databases. This step ensured that every compound in the dataset had a valid and unique identifier within PubChem.
    • Cross-Referencing with CYP450 Interaction Data: Compounds with conflicting classifications were subjected to a cross-verification process to further ensure accuracy. Some compounds were labeled differently across databases, and certain sources identified them as substrates, while others classified them as inhibitors. To resolve such discrepancies, the data was cross-referenced using interaction tables from authoritative sources, including the FDA Drug Metabolism Database, the Mayo Clinic Pharmacogenomics Database, and the Indiana University CYP450 Drug Interaction Table1720. Only compounds with consistent CYP450 interaction classifications across at least two independent sources were retained in the final form.
    • Removal of Unverified or Inconsistent Compounds: Compounds that could not be reliably verified were systematically excluded from the dataset to maintain data integrity and ensure accuracy of subsequent analyses. The exclusion criteria included compounds that lacked a valid PubChem CID, as their identity could not be confirmed within a recognized chemical database. Compounds without confirmed CYP450 interaction data from at least two independent sources were removed, as their metabolic role remained uncertain. Furthermore, any compounds that exhibited contradictory classifications across multiple databases. For instance, those labeled simultaneously as a substrate and an inhibitor were excluded from consideration. This exclusion was applied unless supporting literature provided conclusive evidence to resolve the discrepancy in the classifications. By implementing this rigorous compound verification process, we ensured that only well characterized and properly classified compounds were retained in the final dataset. These curated data were then utilized to construct predictive models for CYP450 substrate classification, thereby enhancing the reliability and reproducibility of the study. Moreover, this meticulous verification approach improves dataset quality and facilitates reproducibility in future CYP450-related research. Researchers can confidently leverage the dataset for machine learning applications, cheminformatics analyses, and drug metabolism investigations by ensuring that the dataset remains robust and scientifically valid.
  2. Interaction Table Cross-Verification: A rigorous cross-verification process was conducted using multiple authoritative sources to ensure the accuracy of compound classifications. Compounds with narrow therapeutic indices or significant dose-dependent side effects were carefully examined to confirm their designation as substrates or inhibitors, as errors in classification could lead to misleading conclusions in downstream predictive models. Given that some compounds exhibit dual roles as both inhibitors and substrates depending on concentration and metabolic conditions, this step was particularly crucial in refining the dataset. To validate the classification of each compound, data were systematically cross-referenced across multiple independent sources, including Pharmacist’s Letter, the FDA Drug Metabolism Database, and the Indiana University CYP450 Drug Interaction Table. Compounds were also checked against the ILD Care Foundation CYP450 Interaction Database and relevant peer-reviewed literature to ensure consistency and eliminate ambiguities. When discrepancies arose between sources, literature evidence was prioritized to determine the most biologically relevant classification.

    This multi-tiered validation process helped to prevent misclassification errors, thereby enhancing the credibility and robustness of the dataset. By ensuring that compounds were accurately labeled as substrates, inhibitors, or non-substrates, we significantly improved the reliability of the dataset for machine learning models and cheminformatics analyses. The resulting curated dataset provides a scientifically rigorous foundation for CYP450 interaction prediction, facilitating reproducibility and confident application in drug metabolism research.

  3. Steroid Classification: Steroids were systematically reviewed for their interactions with CYP450 enzymes, given their biological activity and significant metabolic effects. These compounds often exhibit complex metabolic pathways, necessitating a rigorous classification approach to ensure accurate labeling in the dataset. The classification process involved cross-checking functional data from DrugBank and FDA records to verify the metabolic role of each steroid. Active steroid metabolites were compared across multiple sources to prevent misclassification, ensuring that their CYP450 interactions were consistently reported. Steroids were labeled as substrates only if at least two independent sources confirmed their classification, while those listed as inhibitors or inducers were categorized as non-substrates, as summarized in Table 2. This stringent classification method ensured that only credibly labeled steroids were included in the final dataset, enhancing its reliability for downstream machine learning models and cheminformatics analyses.

  4. Parent Compound Identification: For compounds with multiple constituents, such as drug salts, prodrugs, or metabolites, the biologically active parent compound was systematically identified to ensure that the dataset accurately reflects real-world metabolic behavior. Since prodrugs require enzymatic activation and some drug salts exhibit different solubility or bioavailability properties, identifying the most pharmacologically relevant form was essential for accurate CYP450 metabolism modeling. To achieve this, we conducted a structured evaluation based on multiple data sources as follows:
    • Drug Label Analysis: Using records from FDA-approved drug labels and DrugBank, we identified the form of each compound approved for clinical use. The active moiety–as specified in regulatory documents–was selected if multiple formulations existed. This was cross-referenced with the FDA Drug Metabolism Database to confirm CYP450 interactions.
    • Pharmacokinetic Literature Review: Published studies on drug metabolism, clearance rates, and bioavailability were examined to determine whether a compound was primarily metabolized into a more active or inactive form. In cases where a prodrug had minimal pharmacological activity but was converted to an active metabolite, the metabolite was retained as the parent compound. Studies from PubMed, Drug Metabolism and Disposition, and clinical pharmacokinetics references were used for this validation.
    • Cross-Referencing Metabolic Pathways: Metabolic pathways were validated using CYP450 pathway models and enzyme-substrate databases to ensure consistency. If conflicting pathways were reported, the compound was prioritized based on the most frequently cited metabolic route across DrugBank, CYP450 Knowledgebase, and SuperCYP. When available, the DeepChem GCN-based metabolic pathway predictions were also utilized as an additional validation step.
  5. SMILES Standardization and Discarding Non-Verifiable Compounds: SMILES notations for all compounds were standardized using RDKit to maintain structural consistency across the dataset. This step ensured that molecular structures were consistently represented, reducing errors caused by variations in input formats. During this process, compounds with ambiguous stereochemistry were normalized to ensure uniformity in structure representation. If multiple stereochemical configurations were possible, the most prevalent or biologically relevant form was retained based on DrugBank and FDA metabolic records.Additionally, entries with missing or contradicting CYP450 interaction data across different sources were systematically reviewed. Compounds that lacked supporting documentation in at least two independent sources (e.g., DrugBank, SuperCYP, Cytochrome P450 Knowledgebase) were excluded. Similarly, if a compound was categorized as a substrate and an inhibitor without supporting literature to resolve the discrepancy, it was removed from the dataset. The dataset maintains high integrity and scientific validity by applying these SMILES standardization and data verification criteria, ensuring its reliability for subsequent cheminformatics analyses and machine learning-based CYP450 interaction modeling21.

Table 2.

Systematic classification of steroids and their interaction with CYP450 enzymes.

Steroids Steroid Name Source Drug Name Dosage CYP Interactions
Glucocorticoids Prednisolone FDA label (NDA 21-959/S-004) Orapred ODT 10 mg to 60 mg CYP3A4 substrate
FDA label (Reference ID: 2960745) Flo-Pred 5 mg to 60 mg CYP3A4 substrate
FDA label (Reference ID: 3165107) RAYOS (prednisone) 5 mg CYP3A4 substrate
Betamethasone FDA label (Reference ID: 4217397) CELESTONE SOLUSPAN CYP3A4 substrate
Tamihida Matsunaga et al. 2012
Petra Matouková et al. 2014
Dexamethasone FDA label (Reference ID: 4500185) HEMADY 20 mg or 40 mg CYP3A4 substrate
FDA label (NDA 11-664/S-062) DECADRON (DEXAMETHASONE TABLETS) CYP3A4 substrate
Yi Ling Lee et al. 2012
Raucy JL et al. 2002
Radim Vrzal et al. 2008
Hydrocortisone FDA (Reference ID: 4678054) ALKINDI® SPRINKLE (hydrocortisone) oral granules 0.5 mg or 1 mg CYP3A4 substrate
Wafaa El-Sankary et al. 2002
Matouková P et al. 2014
Methylprednisolone FDA label (Reference ID: 3032293) SOLU-MEDROL CYP3A4 substrate
FDA label (Reference ID: 3982392) DEPO-MEDROL 20 mg/mL, 40 mg/mL, 80 mg/mL CYP3A4 substrate
Usui T et al. 2003
Matouková P et al. 2014
Deflazacort FDA label (Reference ID: 4053971) EMFLAZA 0.9 mg/day CYP3A4 substrate
Mineralocorticoid Fludrocortisone Omodunho Ogbu, PharmD 2019 CYP3A4 substrate
Other Steroids Testosterone Sylvie K Kandiel et al. 2017 CYP3A4 substrate
Yamazaki H et al. 1997 CYP2C19 substrate
Yamazaki H et al. 1997 CYP2C19 substrate
Cholic acid M Paolini et al. 1999 CYP1A2 inhibitor
Lanosterol D Rozman et al. 1996
Progesterone FDA label (Reference ID: 3037212) PROMETRIUM (progesterone capsules) 100 mg per day to 300 mg CYP3A4 substrate
FDA label Endometrin (progesterone Vaginal Insert) 200-300 mg CYP2C19 substrate
Medrogestone H P E Tugster et al. 1993 CYP2C19 substrate
β-Sitosterol Binkowska M et al. 2014 CYP2B6 inhibitor
Cholesterol Vijayakumar TM et al. 2014 CYP2B6 inhibitor
5α-cholestanol William J Griffiths et al. 2019

This table simplifies the identification of steroids based on their role as substrates, inducers, or inhibitors of specific CYP450 isoforms.

We established a high-quality dataset for each of the six major CYP450 enzymes through these rigorous data collection and curation processes. These curated datasets were used in QSAR and deep learning-based GCN models to predict CYP450 substrate interactions. The detailed curation process ensures that the dataset is accurate, reproducible, and reliable for future research.

Data Records

The curated dataset of CYP450 interactions is available as open access on the Figshare online repository22. The distribution of chemical compounds from each data source is summarized in Table 1. This dataset includes training and testing sets for classifying substrates and non-substrates of six essential CYP450 enzymes.

Each CSV file contains four columns:

  1. Chemical name

  2. SMILES notation

  3. Labels (where 1 indicates a substrate of CYP450 enzymes, and 0 indicates a non-substrate)

  4. Data sources

Additionally, PubChem Fingerprints for each chemical are provided in CSV format and stored in the PubChem Fingerprint folder within the Figshare repository.

The curated CYP450 dataset consists of 12 comma-separated values (CSV) files, divided into training and testing sets for CYP1A2, CYP2C9, CYP2C19, CYP2D6, CYP2E1, and CYP3A4. The training and testing sets have already been pre-separated for five-fold cross-validation.

We recommend that researchers perform data augmentation before model training. For other cross-validation methods, researchers can merge the training and testing sets into a single file (e.g., combining CYP1A2_trainingset.csv and CYP1A2_testingset.csv) and re-split them as needed.

Technical Validation

Implementation and Enhancement of Computational Models for CYP450 Substrate Classification

We deployed Graph Convolutional Networks (GCN) as the main component of the CYP450 substrate classification model to validate our curated dataset. The GCN models were developed using the GraphConvModel from the DeepChem library (version 2.0), a leading toolkit in the realm of chemistry-focused deep learning. This model inherently processes SMILES representations of molecules by leveraging the ConvMolFeaturizer, which automatically computes and analyzes a vector of local descriptors for each atom. This approach eliminates the need for manual descriptor selection and provides a robust molecular representation tailored for graph-based learning.

Strategic Data Augmentation and Validation

Acknowledging the prevalent issue of data imbalance between substrates and non-substrates, we adopted the SMILES enumeration technique as described by Bjerrum et al.23. This technique significantly enhances the dataset by generating multiple SMILES representations for each compound, effectively amplifying the presence of substrates and achieving a more balanced dataset. The performance of the GCN models on these augmented datasets was meticulously evaluated through a robust five-fold cross-validation protocol and further tested on external datasets.

Comprehensive Data Analysis and Hyperparameter Optimization

For each CYP450 isoform, the entire dataset collected during the initial data gathering phase was utilized for training. We employed k-fold cross-validation as a methodical approach to validate the generalization capacity of our models across independent datasets. This method involved partitioning the data into multiple groups, which were cyclically used as both training and validation sets. The averaged results from these cycles provided a reliable basis for hyperparameter optimization, which was conducted using Bayesian optimization techniques facilitated by the Hyperopt Python library24,25.

Hyperparameter tuning was guided by the Tree-structured Parzen Estimator (TPE) from the Hyperopt package, a sophisticated approach that models the conditional probability p(x|y) rather than p(y|x), allowing for morew nuanced optimization based on prior outcomes26,27:

p(xy)={l(x)ify<yg(x)ifyy 1

The optimization criterion, expected improvement (EI), is defined as:

EIy(x)=-max(y-y,0)p(xy)dyl(x)g(x) 2

Performance Evaluation and Final Remarks

Our model’s effectiveness is demonstrated through several key performance metrics, which are essential for evaluating the predictive accuracy and reliability across different scenarios:

Accuracy=TP+TNP+N, 3
Sensitivity (TPR)=TPTP+FN, 4
Specificity (FPR)=TNTN+FP, 5
Matthews correlation coefficient (MCC)=TP×TN-FP×FN(TP+FP)(TP+FN)(TN+FP)(TN+FN). 6

These metrics, particularly the Matthews Correlation Coefficient (MCC), underscore the balanced accuracy of our models, as detailed in Supporting information (see Table S1).

To validate our comprehensive datasets and further substantiate their advantages, we employed several distinct methodologies. We initially trained the Graph Convolutional Network (GCN) model using two different training sets: one derived from our curated datasets and the other from the CYPstrate models. We evaluated the models’ performance through five-fold cross-validation and on external testing sets, as elaborated in Table 3 for cross-validation and Table 4 for external tests.

Additionally, we enhanced the robustness of our datasets by implementing SMILES enumeration on the substrates. This method was aimed at increasing the diversity and representatives of the chemical structures, ensuring adaptability across various chemical contexts.

Our investigations clearly demonstrate the superior performance of the models trained on our datasets, especially notable in the CYP1A2 and CYP3A4 isoforms. These models have consistently exhibited exceptional performance metrics, significantly surpassing those trained with the CYPstrate datasets. The rigorous validation through model training and dataset evaluation has firmly established the effectiveness and reliability of our datasets in advancing the predictive modeling of CYP450-mediated drug metabolism.

Supplementary information

Acknowledgements

This work was financially supported by the ‘Center for Advanced Computing and Imaging in Biomedicine (NTU-113L900703)’ and the ‘Center of Precision Medicine’ from The Featured Areas Research Center Program within the framework of the Higher Education Sprout Project by the Ministry of Education (MOE) in Taiwan, and the National Science and Technology Council (NSTC 111-2320-B-002-043-MY2). We thank the Computational Molecular Design and Metabolomics Laboratory, Department of Computer Science and Information Engineering, and National Taiwan University for their resources.

Author contributions

Y.J.T. conceived the project. Y.H.N. designed the method and implemented the classification model. S.C.Y. and Y.W.S. collected the data. Y.H.N. wrote the manuscript. J.C.H., P.W.A.T, Y.T.H. refined the data inclusion criteria and calibrated the dataset. J.C.H., P.W.A.T., Y.T.H., T.C.K. and Y.J.T. edited the manuscript. All authors have reviewed and approved the final version of the manuscript.

Code availability

The in-house Python scripts for demonstrating machine learning and deep learning are available at GitHub with the Zenodo28: https://zenodo.org/doi/10.5281/zenodo.13364709.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

The online version contains supplementary material available at 10.1038/s41597-025-05753-8.

References

  • 1.Furge, L. L. & Guengerich, F. P. Cytochrome p450 enzymes in drug metabolism and chemical toxicology: An introduction. Biochemistry and Molecular Biology Education34, 66–74 (2006). [DOI] [PubMed] [Google Scholar]
  • 2.Nebert, D. W. & Russell, D. W. Clinical importance of the cytochromes p450. The Lancet360, 1155–1162 (2002). [DOI] [PubMed] [Google Scholar]
  • 3.Lynch, T. & Price, A. The effect of cytochrome p450 metabolism on drug response, interactions, and adverse effects. American family physician76, 391–396 (2007). [PubMed] [Google Scholar]
  • 4.Zanger, U. M. & Schwab, M. Cytochrome p450 enzymes in drug metabolism: regulation of gene expression, enzyme activities, and impact of genetic variation. Pharmacology & therapeutics138, 103–141 (2013). [DOI] [PubMed] [Google Scholar]
  • 5.Cherkasov, A. et al. Qsar modeling: where have you been? where are you going to? Journal of medicinal chemistry57, 4977–5010 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Shaikh, N., Sharma, M. & Garg, P. Selective fusion of heterogeneous classifiers for predicting substrates of membrane transporters. Journal of Chemical Information and Modeling57, 594–607 (2017). [DOI] [PubMed] [Google Scholar]
  • 7.Lewis, D. F. & Ito, Y. Human cyps involved in drug metabolism: structures, substrates and binding affinities. Expert Opinion on Drug Metabolism & Toxicology6, 661–674 (2010). [DOI] [PubMed] [Google Scholar]
  • 8.Kesharwani, S. S., Nandekar, P. P., Pragyan, P., Rathod, V. & Sangamwar, A. T. Characterization of differences in substrate specificity among cyp1a1, cyp1a2 and cyp1b1: an integrated approach employing molecular docking and molecular dynamics simulations. Journal of Molecular Recognition29, 370–390 (2016). [DOI] [PubMed] [Google Scholar]
  • 9.Vaswani, A. et al. Attention is all you need. Advances in neural information processing systems30 (2017).
  • 10.Zhang, X., Zhao, J. & LeCun, Y. Character-level convolutional networks for text classification. Advances in neural information processing systems28 (2015). [PMC free article] [PubMed]
  • 11.Karpov, P., Godin, G. & Tetko, I. V. Transformer-cnn: Swiss knife for qsar modeling and interpretation. Journal of cheminformatics12, 1–12 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Tatonetti, N. P., Ye, P.-P., Daneshjou, R. & Altman, R. B. Science Translational Medicine4, 125ra31 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Wishart, D. S. et al. Drugbank 5.0: a major update to the drugbank database for 2018. Nucleic acids research46, D1074–D1082 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Banerjee, P., Dunkel, M., Kemmler, E. & Preissner, R. Supercypspred-a web server for the prediction of cytochrome activity. Nucleic acids research48, W580–W585 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Cytochrome P450 Knowledgebase. Cytochrome P450 knowledgebase. http://cpd.ibmh.msk.su/ (2006).
  • 16.Horn, J. R. & Hansten, P. D. Get to know an enzyme: Cyp2d6. Pharmacy Times (2008). Retrieved from http://www.pharmacytimes.com/publications/issue/2008/2008-07/2008-07-8624.
  • 17.Flockhart, D. Drug interactions: Cytochrome p450 drug interaction table. Indiana University School of Medicinehttps://drug-interactions.medicine.iu.edu (2007).
  • 18.Genelex Corporation. Physician guidelines: Drugs metabolized by cytochrome p450’s. http://ildcare.eu/Downloads/artseninfo/Drugs_metabolized_by_CYP450s.pdf (2005).
  • 19.US Food and Drug Administration. Drug development and drug interactions: table of substrates, inhibitors and inducers. https://www.fda.gov/Drugs/DevelopmentApprovalProcess/DevelopmentResources/DrugInteractionsLabeling/ucm093664.htm#table2-1 (2020).
  • 20.Mayo Clinic Laboratories. Pharmacogenomic associations tables. https://www.mayocliniclabs.com/it-mmfiles/Pharmacogenomic_Associations_Tables.pdf (2019).
  • 21.Landrum, G. Rdkit: Open-source cheminformatics. https://www.rdkit.org (2006).
  • 22.Ni, Y.-H. et al. Comprehensively-Curated Dataset of CYP450 Interactions: Enhancing Predictive Models for Drug Metabolism (2024). https://figshare.com/articles/dataset/Comprehensively-Curated_Dataset_of_CYP450_Interactions_Enhancing_Predictive_Models_for_Drug_Metabolism/26630515.
  • 23.Bjerrum, E. J. Smiles enumeration as data augmentation for neural network modeling of molecules. arXiv preprint arXiv:1703.07076 (2017).
  • 24.Bergstra, J., Yamins, D. & Cox, D. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In International conference on machine learning, 115–123 (PMLR, 2013).
  • 25.Pedregosa, F. Scikit-learn: Machine learning in python fabian. Journal of machine learning research12, 2825 (2011). [Google Scholar]
  • 26.Bergstra, J., Bardenet, R., Bengio, Y. & Kégl, B. Algorithms for hyper-parameter optimization. Advances in neural information processing systems24 (2011).
  • 27.Chen, J.-H. & Tseng, Y. J. A general optimization protocol for molecular property prediction using a deep learning network. Briefings in Bioinformatics23, bbab367 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Ni, Y.-H. et al. Cmdm-lab/cyp450: v1.0. Zenodo 10.5281/zenodo.13364709 (2024).
  • 29.Zhang, T., Dai, H., Liu, L. A., Lewis, D. F. & Wei, D. Classification models for predicting cytochrome p450 enzyme-substrate selectivity. Molecular informatics31, 53–62 (2012). [DOI] [PubMed] [Google Scholar]
  • 30.Yamashita, F., Feng, C., Yoshida, S., Itoh, T. & Hashida, M. Automated information extraction and structure- activity relationship analysis of cytochrome p450 substrates. Journal of chemical information and modeling51, 378–385 (2011). [DOI] [PubMed] [Google Scholar]
  • 31.Yap, C. W. & Chen, Y. Z. Prediction of cytochrome P450 3A4, 2D6, and 2C9 inhibitors and substrates by using support vector machines. Journal of Chemical Information and Modeling45, 982–992 (2005). [DOI] [PubMed] [Google Scholar]
  • 32.Michielan, L., Terfloth, L., Gasteiger, J. & Moro, S. Comparison of multilabel and single-label classification applied to the prediction of the isoform specificity of cytochrome p450 substrates. Journal of Chemical Information and Modeling49, 2588–2605 (2009). [DOI] [PubMed] [Google Scholar]
  • 33.Mishra, N. K., Agarwal, S. & Raghava, G. P. Prediction of cytochrome P450 isoform responsible for metabolizing a drug molecule. BMC Pharmacology10, 8 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Tian, S., Djoumbou-Feunang, Y., Greiner, R. & Wishart, D. S. Cypreact: a software tool for in silico reactant prediction for human cytochrome p450 enzymes. Journal of chemical information and modeling58, 1282–1291 (2018). [DOI] [PubMed] [Google Scholar]
  • 35.Holmer, M., de Bruyn Kops, C., Stork, C. & Kirchmair, J. Cypstrate: a set of machine learning models for the accurate classification of cytochrome p450 enzyme substrates and non-substrates. Molecules26, 4678 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

The in-house Python scripts for demonstrating machine learning and deep learning are available at GitHub with the Zenodo28: https://zenodo.org/doi/10.5281/zenodo.13364709.


Articles from Scientific Data are provided here courtesy of Nature Publishing Group

RESOURCES