Curated CYP450 Interaction Dataset: Covering the Majority of Phase I Drug Metabolism

Yu-Hao Ni; Yu-Wen Su; Shang-Chen Yang; Jia-Cheng Hong; Po-Wen Allen Du; Yu-Ting Hsu; Tien-Chueh Kuo; Yufeng Jane Tseng

doi:10.1038/s41597-025-05753-8

. 2025 Aug 14;12:1427. doi: 10.1038/s41597-025-05753-8

Curated CYP450 Interaction Dataset: Covering the Majority of Phase I Drug Metabolism

Yu-Hao Ni ¹, Yu-Wen Su ¹, Shang-Chen Yang ², Jia-Cheng Hong ¹, Po-Wen Allen Du ¹, Yu-Ting Hsu ¹, Tien-Chueh Kuo ^1,³, Yufeng Jane Tseng ^1,^3,^4,^5,^✉

PMCID: PMC12354672 PMID: 40813405

Abstract

We collected and organized a detailed dataset encompassing both substrates and non-substrates for six principal cytochrome P450 (CYP450) isozymes, responsible for 90% of Phase I drug metabolism in humans. These isozymes, specifically CYP1A2, CYP2C9, CYP2C19, CYP2D6, CYP2E1, and CYP3A4, play critical roles in the detoxification and metabolic processing of therapeutic compounds. The dataset, meticulously assembled, includes interactions with approximately 2000 compounds per enzyme, ensuring comprehensive coverage and high accuracy. Employing a combination of conventional machine learning techniques alongside advanced methodologies such as Graph Convolutional Networks (GCN), robust models have been developed to elucidate these drug-enzyme interactions. The dataset is poised to significantly contribute to fields requiring pharmacokinetic modeling, furthering drug development efforts and toxicological studies by providing an essential resource for the accurate prediction of metabolic pathways, thereby enhancing drug safety and efficacy assessments.

Subject terms: Pharmaceutics, Machine learning

Background & Summary

Cytochrome P450 (CYP450) enzymes are integral to human detoxification processes, ubiquitously present across various organisms, with over fifty distinct CYP450 isozymes identified in the human body. Notably, isozymes such as CYP1A2, CYP2C9, CYP2C19, CYP2D6, CYP2E1, and CYP3A4 play pivotal roles in metabolizing the most toxic compounds through Phase I oxidation processes, collectively responsible for the metabolism of approximately 90% of pharmaceuticals^1–4. The strategic role of these enzymes in drug biotransformation underscores their importance in pharmacokinetic evaluations, affecting the bioavailability and therapeutic efficacy of medications. Accurately predicting the substrate interactions of CYP450 enzymes is therefore essential, aiding the early stages of drug development and expediting the ligand screening process.

Advancements in machine learning and computational chemistry have fostered the development of myriad in silico methodologies aimed at elucidating the interactions between CYP450 isozymes and drug molecules. Among these, ligand-based approaches are prevalent, utilizing the structural similarities between ligands and known active compounds. Quantitative structure-activity relationship (QSAR) models, which are among the most utilized ligand-based methods, effectively correlate molecular descriptors with biological activities, thus enabling the evaluation of compound properties based on structural characteristics^5,6. While protein structure-based methods also hold promise, they are often impeded by the complexities involved in data collection, feature extraction, and the application of algorithms, particularly due to the intricate nature of large protein molecules and their high-resolution tertiary structures^7,8.

The emergence of neural network technologies such as Graph Convolutional Networks (GCN) and Convolutional Neural Networks (CNN) has introduced novel capabilities in the realms of QSAR modeling and drug development. These technologies refine the interaction with molecular features, updating model parameters interactively to enhance the prediction accuracy. Specifically, the Transformer-CNN model, which integrates self-attention with character-level convolutional networks, has shown efficacy in analyzing molecular sequences, such as those represented by the Simplified Molecular-Input Line-Entry System (SMILES). This model is particularly advantageous for small datasets, where it facilitates rapid model convergence^9–11.

Despite the prevalence of traditional machine learning methods in the development of predictive models for CYP substrates, these models often utilize small, inconsistent datasets that are ill-suited for complex algorithms, leading to potential overfitting and extensive training durations. In contrast, deep learning approaches like GCN directly convert molecular structures into graphical representations, depicting atoms and bonds, thus providing a more nuanced understanding of molecular interactions conducive to drug development.

However, related substrates and non-substrates data are scattered across various databases and peer-reviewed literature. To address this issue, we curated an extensive dataset, consisting of substrates and non-substrates for six prominent CYP450 isozymes, which play a critical role in 90% of Phase I drug metabolism, incorporating up to 2000 compounds for each enzyme. We employed an advanced machine learning method, Graph Convolutional Networks (GCN). Additionally, we utilized sophisticated model optimization techniques, such as Bayesian optimization and SMILES enumeration, to develop robust CYP450 substrate classification models.

Our GCN-based models showcased superior performance, achieving Matthews correlation coefficients ranging from 0.51 (CYP2C19) to 0.72 (CYP1A2) when evaluated on external testing sets. This high level of accuracy underscores the robustness of our dataset. In our benchmark study (Tables 3 and 4), models trained on our curated dataset significantly outperformed those trained on the Cypstrate dataset, validating the effectiveness of our single-model GCN approach. This methodology not only enhances the prediction of CYP450-mediated metabolism but also supports compound screening in drug development, providing a reliable tool for pharmaceutical research and safety assessment.

Table 3.

GCN Performance Evaluation Across Different Datasets.

CYP Isoform	Collected datasets				Collected datasets (Substrate enumerated)				Datasets from CYPstrate³⁵
CYP Isoform	# of sub/non-sub	MCC	ACC	Sen/Spe	# folds of subs in trainsets	MCC	ACC	Sen/Spe	# of sub/non-sub	MCC	ACC	Sen/Spe
CYP1A2	492/1511	0.51	0.82	0.57/0.91	4	0.86	0.93	0.99/0.85	237/1142	0.50	0.87	0.53/0.94
CYP2C9	384/1835	0.37	0.84	0.37/0.94	6	0.90	0.95	0.99/0.89	202/1175	0.39	0.86	0.41/0.94
CYP2C19	355/1605	0.40	0.84	0.39/0.94	5	0.91	0.96	1.00/0.91	194/1184	0.45	0.88	0.39/0.97
CYP2D6	521/1739	0.50	0.83	0.55/0.92	4	0.45	0.85	1.00/0.00	243/1140	0.57	0.88	0.59/0.95
CYP2E1	277/1621	0.52	0.89	0.52/0.95	2	0.79	0.92	0.87/0.94	125/1244	0.47	0.92	0.44/0.97
CYP3A4	1243/1379	0.60	0.80	0.80/0.81	4	0.85	0.95	1.00/0.78	416/991	0.67	0.86	0.76/0.91

Open in a new tab

The following abbreviations are used: CYP Isoform - Cytochrome P450 enzyme isoform; # of sub/non-sub - Number of substrates and non-substrates; MCC - Matthews Correlation Coefficient; ACC - Accuracy; Sen/Spe - Sensitivity/Specificity; # folds of subs in trainsets - Number of folds used in training subsets.

Table 4.

GCN Performance on the Test Sets Across Different Datasets.

			Collected datasets				Datasets from Cypstrate³⁹
CYP Isoform	# of data	# of Sub/Nonsub	MCC	ACC	Sen	Spe	MCC	ACC	Sen	Spe
CYP1A2	345	59/286	0.72	0.92	0.77	0.96	0.65	0.91	0.68	0.95
CYP2C9	342	48/294	0.55	0.90	0.44	0.98	0.52	0.90	0.46	0.97
CYP2C19	345	47/298	0.50	0.89	0.48	0.96	0.46	0.89	0.42	0.96
CYP2D6	344	60/284	0.62	0.94	0.64	0.95	0.59	0.89	0.52	0.97
CYP2E1	338	28/330	0.57	0.94	0.57	0.97	0.42	0.92	0.37	0.97
CYP3A4	352	104/248	0.71	0.88	0.76	0.93	0.66	0.86	0.68	0.94

Open in a new tab

The following abbreviations are used: CYP Isoform - Cytochrome P450 enzyme isoform; # of data in set - Number of data points in the test set; # of Sub/Nonsub - Number of substrates and non-substrates in the dataset; MCC - Matthews Correlation Coefficient; ACC - Accuracy; Sen - Sensitivity; Spe - Specificity.

Our dataset can help improve the model predicting whether a drug can act as a CYP450 substrate. It may also be integrated with other datasets that predict drug responses, potentially enhancing drug design and eliminating unnecessary adverse drug effects.¹²

Methods

Data Collection

The aim of this research was to develop a comprehensive dataset for modeling the metabolism of drugs mediated by CYP450 enzymes from various databases and literature. To achieve this, compounds processed by the six key human CYP450 isoforms–CYP1A2, CYP2C9, CYP2C19, CYP2D6, CYP2E1, and CYP3A4–were systematically collected from various authenticated sources including drug databases and peer-reviewed literature published prior to 2023.

Data were primarily extracted from three major databases:

Drugbank¹³ (https://go.drugbank.com/): For each isoform, the search function was used to query substrates by entering the specific CYP450 isoform name into the search bar. Results listing compounds as substrates were carefully recorded.
SuperCYP¹⁴ (https://insilico-cyp.charite.de/SuperCYPsPred/index.php?site=DrugDrugInteraction): This database was queried by selecting the ‘CYP-drug interaction’ option, followed by choosing the specific isoform and the substrate filter. This process was repeated for each isoform to ensure comprehensive data collection.
Cytochrome P450 Knowledgebase¹⁵ (http://cpd.ibmh.msk.su/): Accessed by selecting the ‘complete’ dataset option, followed by filtering for ‘human’ and the specific isoform. Functional data were then viewed to list all substrates for the selected isoform.

In addition to database sources, data were also sourced from interaction tables published by Pharmacy Times (http://www.pharmacytimes.com/publications/issue/2008/2008-07/2008-07-8624), Indiana University (https://drug-interactions.medicine.iu.edu/), the ILD Care Foundation (http://ildcare.eu/Downloads/artseninfo/Drugs_metabolized_by_CYP450s.pdf), the FDA (https://www.fda.gov/Drugs/DevelopmentApprovalProcess/DevelopmentResources/DrugInteractionsLabeling/ucm093664.htm#table2-1), and Mayo Clinic Laboratories (https://www.mayocliniclabs.com/it-mmfiles/Pharmacogenomic_Associations_Tables.pdf). Each source was queried for substrates of each isoform, and the results were meticulously documented^16–20.

The distribution of the compounds collected from each data source, including the number of substrates identified for each CYP450 isoform, is comprehensively summarized in Table 1. These additional data are used to support the three primary databases and serve as independent verification during the curation process.

Table 1.

Distribution of Data counts for Cytochrome P450 Substrates and Non-substrates Across Various Sources (Formatted as Substrates/Non-substrates).

Dataset Source	CYP1A2	CYP2C9	CYP2C19	CYP2D6	CYP2E1	CYP3A4
DrugBank 5.1¹³	198/107	224/215	169/116	271/163	65/72	789/0768/158
SuperCYP¹⁴	19/10	16/10	16/10	32/8	14/7	33/6
CYP Knowledgebase¹⁵	53/40	46/61	52/44	74/70	81/38	210/54
Mayo Clinic Laboratories²⁰	1/0	0/0	0/0	0/0	0/0	0/0
Zhang T-2012²⁹	1/3	0/1	2/2	1/6	0/0	0/0
Yamashita-2011³⁰	105/400	36/440	56/439	80/400	84/525	121/231
Yap-2005³¹	3/13	5/185	6/14	14/161	0/10	26/93
Michielan-2009³²	1/1	0/2	0/1	3/1	0/1	0/0
Mishra, N.K.-2010³³	0/1	1/0	0/1	0/1	0/0	1/0
Siyang Tian-2018³⁴	21/0	34/1	11/0	35/0	21/1	53/1
Holmer-2021³⁵	144/1220	72/1212	89/1270	71/1213	40/1277	134/1079
Total	56/1795	434/2127	401/1897	581/2023	305/1931	1343/1622

Open in a new tab

Data Curation

Different data sources may have various thresholds to define substrates, leading to possible biases in substrate classification. In addition, data may conflict between data sources. To address these issues, the data curation process was designed to ensure the dataset’s accuracy, consistency, and replicability, especially when integrating chemical compound data from multiple sources. Several steps were implemented throughout the verification process to preserve the overall integrity and reliability of the dataset. The detailed steps were as follows:

Compound Identifier Verification: Ensuring the accuracy and consistency of chemical compound identifiers is a crucial step in dataset curation, particularly when integrating data from multiple sources. This study systematically verified each compound’s PubChem Compound Identifier (CID) to confirm its existence, classification, and consistency across multiple databases. The verification process involved several key steps to maintain the integrity of the dataset.
- PubChem CID Retrieval: To retrieve its CID, each compound was first queried against PubChem, one of the largest publicly available chemical databases. This step ensured that every compound in the dataset had a valid and unique identifier within PubChem.
- Cross-Referencing with CYP450 Interaction Data: Compounds with conflicting classifications were subjected to a cross-verification process to further ensure accuracy. Some compounds were labeled differently across databases, and certain sources identified them as substrates, while others classified them as inhibitors. To resolve such discrepancies, the data was cross-referenced using interaction tables from authoritative sources, including the FDA Drug Metabolism Database, the Mayo Clinic Pharmacogenomics Database, and the Indiana University CYP450 Drug Interaction Table^{17– 20}. Only compounds with consistent CYP450 interaction classifications across at least two independent sources were retained in the final form.
- Removal of Unverified or Inconsistent Compounds: Compounds that could not be reliably verified were systematically excluded from the dataset to maintain data integrity and ensure accuracy of subsequent analyses. The exclusion criteria included compounds that lacked a valid PubChem CID, as their identity could not be confirmed within a recognized chemical database. Compounds without confirmed CYP450 interaction data from at least two independent sources were removed, as their metabolic role remained uncertain. Furthermore, any compounds that exhibited contradictory classifications across multiple databases. For instance, those labeled simultaneously as a substrate and an inhibitor were excluded from consideration. This exclusion was applied unless supporting literature provided conclusive evidence to resolve the discrepancy in the classifications. By implementing this rigorous compound verification process, we ensured that only well characterized and properly classified compounds were retained in the final dataset. These curated data were then utilized to construct predictive models for CYP450 substrate classification, thereby enhancing the reliability and reproducibility of the study. Moreover, this meticulous verification approach improves dataset quality and facilitates reproducibility in future CYP450-related research. Researchers can confidently leverage the dataset for machine learning applications, cheminformatics analyses, and drug metabolism investigations by ensuring that the dataset remains robust and scientifically valid.
Interaction Table Cross-Verification: A rigorous cross-verification process was conducted using multiple authoritative sources to ensure the accuracy of compound classifications. Compounds with narrow therapeutic indices or significant dose-dependent side effects were carefully examined to confirm their designation as substrates or inhibitors, as errors in classification could lead to misleading conclusions in downstream predictive models. Given that some compounds exhibit dual roles as both inhibitors and substrates depending on concentration and metabolic conditions, this step was particularly crucial in refining the dataset. To validate the classification of each compound, data were systematically cross-referenced across multiple independent sources, including Pharmacist’s Letter, the FDA Drug Metabolism Database, and the Indiana University CYP450 Drug Interaction Table. Compounds were also checked against the ILD Care Foundation CYP450 Interaction Database and relevant peer-reviewed literature to ensure consistency and eliminate ambiguities. When discrepancies arose between sources, literature evidence was prioritized to determine the most biologically relevant classification.

This multi-tiered validation process helped to prevent misclassification errors, thereby enhancing the credibility and robustness of the dataset. By ensuring that compounds were accurately labeled as substrates, inhibitors, or non-substrates, we significantly improved the reliability of the dataset for machine learning models and cheminformatics analyses. The resulting curated dataset provides a scientifically rigorous foundation for CYP450 interaction prediction, facilitating reproducibility and confident application in drug metabolism research.
Steroid Classification: Steroids were systematically reviewed for their interactions with CYP450 enzymes, given their biological activity and significant metabolic effects. These compounds often exhibit complex metabolic pathways, necessitating a rigorous classification approach to ensure accurate labeling in the dataset. The classification process involved cross-checking functional data from DrugBank and FDA records to verify the metabolic role of each steroid. Active steroid metabolites were compared across multiple sources to prevent misclassification, ensuring that their CYP450 interactions were consistently reported. Steroids were labeled as substrates only if at least two independent sources confirmed their classification, while those listed as inhibitors or inducers were categorized as non-substrates, as summarized in Table 2. This stringent classification method ensured that only credibly labeled steroids were included in the final dataset, enhancing its reliability for downstream machine learning models and cheminformatics analyses.
Parent Compound Identification: For compounds with multiple constituents, such as drug salts, prodrugs, or metabolites, the biologically active parent compound was systematically identified to ensure that the dataset accurately reflects real-world metabolic behavior. Since prodrugs require enzymatic activation and some drug salts exhibit different solubility or bioavailability properties, identifying the most pharmacologically relevant form was essential for accurate CYP450 metabolism modeling. To achieve this, we conducted a structured evaluation based on multiple data sources as follows:
- Drug Label Analysis: Using records from FDA-approved drug labels and DrugBank, we identified the form of each compound approved for clinical use. The active moiety–as specified in regulatory documents–was selected if multiple formulations existed. This was cross-referenced with the FDA Drug Metabolism Database to confirm CYP450 interactions.
- Pharmacokinetic Literature Review: Published studies on drug metabolism, clearance rates, and bioavailability were examined to determine whether a compound was primarily metabolized into a more active or inactive form. In cases where a prodrug had minimal pharmacological activity but was converted to an active metabolite, the metabolite was retained as the parent compound. Studies from PubMed, Drug Metabolism and Disposition, and clinical pharmacokinetics references were used for this validation.
- Cross-Referencing Metabolic Pathways: Metabolic pathways were validated using CYP450 pathway models and enzyme-substrate databases to ensure consistency. If conflicting pathways were reported, the compound was prioritized based on the most frequently cited metabolic route across DrugBank, CYP450 Knowledgebase, and SuperCYP. When available, the DeepChem GCN-based metabolic pathway predictions were also utilized as an additional validation step.
SMILES Standardization and Discarding Non-Verifiable Compounds: SMILES notations for all compounds were standardized using RDKit to maintain structural consistency across the dataset. This step ensured that molecular structures were consistently represented, reducing errors caused by variations in input formats. During this process, compounds with ambiguous stereochemistry were normalized to ensure uniformity in structure representation. If multiple stereochemical configurations were possible, the most prevalent or biologically relevant form was retained based on DrugBank and FDA metabolic records.Additionally, entries with missing or contradicting CYP450 interaction data across different sources were systematically reviewed. Compounds that lacked supporting documentation in at least two independent sources (e.g., DrugBank, SuperCYP, Cytochrome P450 Knowledgebase) were excluded. Similarly, if a compound was categorized as a substrate and an inhibitor without supporting literature to resolve the discrepancy, it was removed from the dataset. The dataset maintains high integrity and scientific validity by applying these SMILES standardization and data verification criteria, ensuring its reliability for subsequent cheminformatics analyses and machine learning-based CYP450 interaction modeling²¹.

Table 2.

Systematic classification of steroids and their interaction with CYP450 enzymes.

Steroids	Steroid Name	Source	Drug Name	Dosage	CYP Interactions
Glucocorticoids	Prednisolone	FDA label (NDA 21-959/S-004)	Orapred ODT	10 mg to 60 mg	CYP3A4 substrate
		FDA label (Reference ID: 2960745)	Flo-Pred	5 mg to 60 mg	CYP3A4 substrate
		FDA label (Reference ID: 3165107)	RAYOS (prednisone)	5 mg	CYP3A4 substrate
	Betamethasone	FDA label (Reference ID: 4217397)	CELESTONE SOLUSPAN	—	CYP3A4 substrate
		Tamihida Matsunaga et al. 2012	—	—	—
		Petra Matouková et al. 2014	—	—	—
	Dexamethasone	FDA label (Reference ID: 4500185)	HEMADY	20 mg or 40 mg	CYP3A4 substrate
		FDA label (NDA 11-664/S-062)	DECADRON (DEXAMETHASONE TABLETS)	—	CYP3A4 substrate
		Yi Ling Lee et al. 2012	—	—	—
		Raucy JL et al. 2002	—	—	—
		Radim Vrzal et al. 2008	—	—	—
	Hydrocortisone	FDA (Reference ID: 4678054)	ALKINDI® SPRINKLE (hydrocortisone) oral granules	0.5 mg or 1 mg	CYP3A4 substrate
		Wafaa El-Sankary et al. 2002	—	—	—
		Matouková P et al. 2014	—	—	—
	Methylprednisolone	FDA label (Reference ID: 3032293)	SOLU-MEDROL	—	CYP3A4 substrate
	Methylprednisolone	FDA label (Reference ID: 3982392)	DEPO-MEDROL	20 mg/mL, 40 mg/mL, 80 mg/mL	CYP3A4 substrate
	Usui T et al. 2003	—	—	—	—
	Usui T et al. 2003	Matouková P et al. 2014	—	—	—
	Deflazacort	FDA label (Reference ID: 4053971)	EMFLAZA	0.9 mg/day	CYP3A4 substrate
Mineralocorticoid	Fludrocortisone	Omodunho Ogbu, PharmD 2019	—	—	CYP3A4 substrate
Other Steroids	Testosterone	Sylvie K Kandiel et al. 2017	—	—	CYP3A4 substrate
		Yamazaki H et al. 1997	—	—	CYP2C19 substrate
		Yamazaki H et al. 1997	—	—	CYP2C19 substrate
	Cholic acid	M Paolini et al. 1999	—	—	CYP1A2 inhibitor
	Lanosterol	D Rozman et al. 1996	—	—	—
	Progesterone	FDA label (Reference ID: 3037212)	PROMETRIUM (progesterone capsules)	100 mg per day to 300 mg	CYP3A4 substrate
	Progesterone	FDA label	Endometrin (progesterone Vaginal Insert)	200-300 mg	CYP2C19 substrate
	Medrogestone	H P E Tugster et al. 1993	—	—	CYP2C19 substrate
	β-Sitosterol	Binkowska M et al. 2014	—	—	CYP2B6 inhibitor
	Cholesterol	Vijayakumar TM et al. 2014	—	—	CYP2B6 inhibitor
	5α-cholestanol	William J Griffiths et al. 2019	—	—	—

Open in a new tab

This table simplifies the identification of steroids based on their role as substrates, inducers, or inhibitors of specific CYP450 isoforms.

We established a high-quality dataset for each of the six major CYP450 enzymes through these rigorous data collection and curation processes. These curated datasets were used in QSAR and deep learning-based GCN models to predict CYP450 substrate interactions. The detailed curation process ensures that the dataset is accurate, reproducible, and reliable for future research.

Data Records

The curated dataset of CYP450 interactions is available as open access on the Figshare online repository²². The distribution of chemical compounds from each data source is summarized in Table 1. This dataset includes training and testing sets for classifying substrates and non-substrates of six essential CYP450 enzymes.

Each CSV file contains four columns:

Chemical name
SMILES notation
Labels (where 1 indicates a substrate of CYP450 enzymes, and 0 indicates a non-substrate)
Data sources

Additionally, PubChem Fingerprints for each chemical are provided in CSV format and stored in the PubChem Fingerprint folder within the Figshare repository.

The curated CYP450 dataset consists of 12 comma-separated values (CSV) files, divided into training and testing sets for CYP1A2, CYP2C9, CYP2C19, CYP2D6, CYP2E1, and CYP3A4. The training and testing sets have already been pre-separated for five-fold cross-validation.

We recommend that researchers perform data augmentation before model training. For other cross-validation methods, researchers can merge the training and testing sets into a single file (e.g., combining CYP1A2_trainingset.csv and CYP1A2_testingset.csv) and re-split them as needed.

Technical Validation

Implementation and Enhancement of Computational Models for CYP450 Substrate Classification

We deployed Graph Convolutional Networks (GCN) as the main component of the CYP450 substrate classification model to validate our curated dataset. The GCN models were developed using the GraphConvModel from the DeepChem library (version 2.0), a leading toolkit in the realm of chemistry-focused deep learning. This model inherently processes SMILES representations of molecules by leveraging the ConvMolFeaturizer, which automatically computes and analyzes a vector of local descriptors for each atom. This approach eliminates the need for manual descriptor selection and provides a robust molecular representation tailored for graph-based learning.

Strategic Data Augmentation and Validation

Acknowledging the prevalent issue of data imbalance between substrates and non-substrates, we adopted the SMILES enumeration technique as described by Bjerrum et al.²³. This technique significantly enhances the dataset by generating multiple SMILES representations for each compound, effectively amplifying the presence of substrates and achieving a more balanced dataset. The performance of the GCN models on these augmented datasets was meticulously evaluated through a robust five-fold cross-validation protocol and further tested on external datasets.

Comprehensive Data Analysis and Hyperparameter Optimization

For each CYP450 isoform, the entire dataset collected during the initial data gathering phase was utilized for training. We employed k-fold cross-validation as a methodical approach to validate the generalization capacity of our models across independent datasets. This method involved partitioning the data into multiple groups, which were cyclically used as both training and validation sets. The averaged results from these cycles provided a reliable basis for hyperparameter optimization, which was conducted using Bayesian optimization techniques facilitated by the Hyperopt Python library^24,25.

Hyperparameter tuning was guided by the Tree-structured Parzen Estimator (TPE) from the Hyperopt package, a sophisticated approach that models the conditional probability p(x|y) rather than p(y|x), allowing for morew nuanced optimization based on prior outcomes^26,27:

p (x ∣ y) = {\begin{matrix} l (x) & i f y < y^{*} \\ g (x) & i f y \geq y^{*} \end{matrix}

The optimization criterion, expected improvement (EI), is defined as:

E I_{y^{*}} (x) = \int_{- \infty}^{\infty} \max (y^{*} - y, 0) p (x ∣ y) d y \propto \frac{l (x)}{g (x)}

Performance Evaluation and Final Remarks

Our model’s effectiveness is demonstrated through several key performance metrics, which are essential for evaluating the predictive accuracy and reliability across different scenarios:

A c c u r a c y = \frac{T P + T N}{P + N},

Sensitivity (TPR) = \frac{T P}{T P + F N},

Specificity (FPR) = \frac{T N}{T N + F P},

Matthews correlation coefficient (MCC) = \frac{T P \times T N - F P \times F N}{\sqrt{(T P + F P) (T P + F N) (T N + F P) (T N + F N)}} .

These metrics, particularly the Matthews Correlation Coefficient (MCC), underscore the balanced accuracy of our models, as detailed in Supporting information (see Table S1).

To validate our comprehensive datasets and further substantiate their advantages, we employed several distinct methodologies. We initially trained the Graph Convolutional Network (GCN) model using two different training sets: one derived from our curated datasets and the other from the CYPstrate models. We evaluated the models’ performance through five-fold cross-validation and on external testing sets, as elaborated in Table 3 for cross-validation and Table 4 for external tests.

Additionally, we enhanced the robustness of our datasets by implementing SMILES enumeration on the substrates. This method was aimed at increasing the diversity and representatives of the chemical structures, ensuring adaptability across various chemical contexts.

Our investigations clearly demonstrate the superior performance of the models trained on our datasets, especially notable in the CYP1A2 and CYP3A4 isoforms. These models have consistently exhibited exceptional performance metrics, significantly surpassing those trained with the CYPstrate datasets. The rigorous validation through model training and dataset evaluation has firmly established the effectiveness and reliability of our datasets in advancing the predictive modeling of CYP450-mediated drug metabolism.

Supplementary information

Supplementary Information^{(46.6KB, pdf)}

Acknowledgements

This work was financially supported by the ‘Center for Advanced Computing and Imaging in Biomedicine (NTU-113L900703)’ and the ‘Center of Precision Medicine’ from The Featured Areas Research Center Program within the framework of the Higher Education Sprout Project by the Ministry of Education (MOE) in Taiwan, and the National Science and Technology Council (NSTC 111-2320-B-002-043-MY2). We thank the Computational Molecular Design and Metabolomics Laboratory, Department of Computer Science and Information Engineering, and National Taiwan University for their resources.

Author contributions

Y.J.T. conceived the project. Y.H.N. designed the method and implemented the classification model. S.C.Y. and Y.W.S. collected the data. Y.H.N. wrote the manuscript. J.C.H., P.W.A.T, Y.T.H. refined the data inclusion criteria and calibrated the dataset. J.C.H., P.W.A.T., Y.T.H., T.C.K. and Y.J.T. edited the manuscript. All authors have reviewed and approved the final version of the manuscript.

Code availability

The in-house Python scripts for demonstrating machine learning and deep learning are available at GitHub with the Zenodo²⁸: https://zenodo.org/doi/10.5281/zenodo.13364709.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

The online version contains supplementary material available at 10.1038/s41597-025-05753-8.

References

1.Furge, L. L. & Guengerich, F. P. Cytochrome p450 enzymes in drug metabolism and chemical toxicology: An introduction. Biochemistry and Molecular Biology Education34, 66–74 (2006). [DOI] [PubMed] [Google Scholar]
2.Nebert, D. W. & Russell, D. W. Clinical importance of the cytochromes p450. The Lancet360, 1155–1162 (2002). [DOI] [PubMed] [Google Scholar]
3.Lynch, T. & Price, A. The effect of cytochrome p450 metabolism on drug response, interactions, and adverse effects. American family physician76, 391–396 (2007). [PubMed] [Google Scholar]
4.Zanger, U. M. & Schwab, M. Cytochrome p450 enzymes in drug metabolism: regulation of gene expression, enzyme activities, and impact of genetic variation. Pharmacology & therapeutics138, 103–141 (2013). [DOI] [PubMed] [Google Scholar]
5.Cherkasov, A. et al. Qsar modeling: where have you been? where are you going to? Journal of medicinal chemistry57, 4977–5010 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Shaikh, N., Sharma, M. & Garg, P. Selective fusion of heterogeneous classifiers for predicting substrates of membrane transporters. Journal of Chemical Information and Modeling57, 594–607 (2017). [DOI] [PubMed] [Google Scholar]
7.Lewis, D. F. & Ito, Y. Human cyps involved in drug metabolism: structures, substrates and binding affinities. Expert Opinion on Drug Metabolism & Toxicology6, 661–674 (2010). [DOI] [PubMed] [Google Scholar]
8.Kesharwani, S. S., Nandekar, P. P., Pragyan, P., Rathod, V. & Sangamwar, A. T. Characterization of differences in substrate specificity among cyp1a1, cyp1a2 and cyp1b1: an integrated approach employing molecular docking and molecular dynamics simulations. Journal of Molecular Recognition29, 370–390 (2016). [DOI] [PubMed] [Google Scholar]
9.Vaswani, A. et al. Attention is all you need. Advances in neural information processing systems30 (2017).
10.Zhang, X., Zhao, J. & LeCun, Y. Character-level convolutional networks for text classification. Advances in neural information processing systems28 (2015). [PMC free article] [PubMed]
11.Karpov, P., Godin, G. & Tetko, I. V. Transformer-cnn: Swiss knife for qsar modeling and interpretation. Journal of cheminformatics12, 1–12 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Tatonetti, N. P., Ye, P.-P., Daneshjou, R. & Altman, R. B. Science Translational Medicine4, 125ra31 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Wishart, D. S. et al. Drugbank 5.0: a major update to the drugbank database for 2018. Nucleic acids research46, D1074–D1082 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Banerjee, P., Dunkel, M., Kemmler, E. & Preissner, R. Supercypspred-a web server for the prediction of cytochrome activity. Nucleic acids research48, W580–W585 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Cytochrome P450 Knowledgebase. Cytochrome P450 knowledgebase. http://cpd.ibmh.msk.su/ (2006).
16.Horn, J. R. & Hansten, P. D. Get to know an enzyme: Cyp2d6. Pharmacy Times (2008). Retrieved from http://www.pharmacytimes.com/publications/issue/2008/2008-07/2008-07-8624.
17.Flockhart, D. Drug interactions: Cytochrome p450 drug interaction table. Indiana University School of Medicinehttps://drug-interactions.medicine.iu.edu (2007).
18.Genelex Corporation. Physician guidelines: Drugs metabolized by cytochrome p450’s. http://ildcare.eu/Downloads/artseninfo/Drugs_metabolized_by_CYP450s.pdf (2005).
19.US Food and Drug Administration. Drug development and drug interactions: table of substrates, inhibitors and inducers. https://www.fda.gov/Drugs/DevelopmentApprovalProcess/DevelopmentResources/DrugInteractionsLabeling/ucm093664.htm#table2-1 (2020).
20.Mayo Clinic Laboratories. Pharmacogenomic associations tables. https://www.mayocliniclabs.com/it-mmfiles/Pharmacogenomic_Associations_Tables.pdf (2019).
21.Landrum, G. Rdkit: Open-source cheminformatics. https://www.rdkit.org (2006).
22.Ni, Y.-H. et al. Comprehensively-Curated Dataset of CYP450 Interactions: Enhancing Predictive Models for Drug Metabolism (2024). https://figshare.com/articles/dataset/Comprehensively-Curated_Dataset_of_CYP450_Interactions_Enhancing_Predictive_Models_for_Drug_Metabolism/26630515.
23.Bjerrum, E. J. Smiles enumeration as data augmentation for neural network modeling of molecules. arXiv preprint arXiv:1703.07076 (2017).
24.Bergstra, J., Yamins, D. & Cox, D. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In International conference on machine learning, 115–123 (PMLR, 2013).
25.Pedregosa, F. Scikit-learn: Machine learning in python fabian. Journal of machine learning research12, 2825 (2011). [Google Scholar]
26.Bergstra, J., Bardenet, R., Bengio, Y. & Kégl, B. Algorithms for hyper-parameter optimization. Advances in neural information processing systems24 (2011).
27.Chen, J.-H. & Tseng, Y. J. A general optimization protocol for molecular property prediction using a deep learning network. Briefings in Bioinformatics23, bbab367 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Ni, Y.-H. et al. Cmdm-lab/cyp450: v1.0. Zenodo 10.5281/zenodo.13364709 (2024).
29.Zhang, T., Dai, H., Liu, L. A., Lewis, D. F. & Wei, D. Classification models for predicting cytochrome p450 enzyme-substrate selectivity. Molecular informatics31, 53–62 (2012). [DOI] [PubMed] [Google Scholar]
30.Yamashita, F., Feng, C., Yoshida, S., Itoh, T. & Hashida, M. Automated information extraction and structure- activity relationship analysis of cytochrome p450 substrates. Journal of chemical information and modeling51, 378–385 (2011). [DOI] [PubMed] [Google Scholar]
31.Yap, C. W. & Chen, Y. Z. Prediction of cytochrome P450 3A4, 2D6, and 2C9 inhibitors and substrates by using support vector machines. Journal of Chemical Information and Modeling45, 982–992 (2005). [DOI] [PubMed] [Google Scholar]
32.Michielan, L., Terfloth, L., Gasteiger, J. & Moro, S. Comparison of multilabel and single-label classification applied to the prediction of the isoform specificity of cytochrome p450 substrates. Journal of Chemical Information and Modeling49, 2588–2605 (2009). [DOI] [PubMed] [Google Scholar]
33.Mishra, N. K., Agarwal, S. & Raghava, G. P. Prediction of cytochrome P450 isoform responsible for metabolizing a drug molecule. BMC Pharmacology10, 8 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Tian, S., Djoumbou-Feunang, Y., Greiner, R. & Wishart, D. S. Cypreact: a software tool for in silico reactant prediction for human cytochrome p450 enzymes. Journal of chemical information and modeling58, 1282–1291 (2018). [DOI] [PubMed] [Google Scholar]
35.Holmer, M., de Bruyn Kops, C., Stork, C. & Kirchmair, J. Cypstrate: a set of machine learning models for the accurate classification of cytochrome p450 enzyme substrates and non-substrates. Molecules26, 4678 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information^{(46.6KB, pdf)}

Data Availability Statement

The in-house Python scripts for demonstrating machine learning and deep learning are available at GitHub with the Zenodo²⁸: https://zenodo.org/doi/10.5281/zenodo.13364709.

[CR1] 1.Furge, L. L. & Guengerich, F. P. Cytochrome p450 enzymes in drug metabolism and chemical toxicology: An introduction. Biochemistry and Molecular Biology Education34, 66–74 (2006). [DOI] [PubMed] [Google Scholar]

[CR2] 2.Nebert, D. W. & Russell, D. W. Clinical importance of the cytochromes p450. The Lancet360, 1155–1162 (2002). [DOI] [PubMed] [Google Scholar]

[CR3] 3.Lynch, T. & Price, A. The effect of cytochrome p450 metabolism on drug response, interactions, and adverse effects. American family physician76, 391–396 (2007). [PubMed] [Google Scholar]

[CR4] 4.Zanger, U. M. & Schwab, M. Cytochrome p450 enzymes in drug metabolism: regulation of gene expression, enzyme activities, and impact of genetic variation. Pharmacology & therapeutics138, 103–141 (2013). [DOI] [PubMed] [Google Scholar]

[CR5] 5.Cherkasov, A. et al. Qsar modeling: where have you been? where are you going to? Journal of medicinal chemistry57, 4977–5010 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Shaikh, N., Sharma, M. & Garg, P. Selective fusion of heterogeneous classifiers for predicting substrates of membrane transporters. Journal of Chemical Information and Modeling57, 594–607 (2017). [DOI] [PubMed] [Google Scholar]

[CR7] 7.Lewis, D. F. & Ito, Y. Human cyps involved in drug metabolism: structures, substrates and binding affinities. Expert Opinion on Drug Metabolism & Toxicology6, 661–674 (2010). [DOI] [PubMed] [Google Scholar]

[CR8] 8.Kesharwani, S. S., Nandekar, P. P., Pragyan, P., Rathod, V. & Sangamwar, A. T. Characterization of differences in substrate specificity among cyp1a1, cyp1a2 and cyp1b1: an integrated approach employing molecular docking and molecular dynamics simulations. Journal of Molecular Recognition29, 370–390 (2016). [DOI] [PubMed] [Google Scholar]

[CR9] 9.Vaswani, A. et al. Attention is all you need. Advances in neural information processing systems30 (2017).

[CR10] 10.Zhang, X., Zhao, J. & LeCun, Y. Character-level convolutional networks for text classification. Advances in neural information processing systems28 (2015). [PMC free article] [PubMed]

[CR11] 11.Karpov, P., Godin, G. & Tetko, I. V. Transformer-cnn: Swiss knife for qsar modeling and interpretation. Journal of cheminformatics12, 1–12 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] 12.Tatonetti, N. P., Ye, P.-P., Daneshjou, R. & Altman, R. B. Science Translational Medicine4, 125ra31 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Wishart, D. S. et al. Drugbank 5.0: a major update to the drugbank database for 2018. Nucleic acids research46, D1074–D1082 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.Banerjee, P., Dunkel, M., Kemmler, E. & Preissner, R. Supercypspred-a web server for the prediction of cytochrome activity. Nucleic acids research48, W580–W585 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] 15.Cytochrome P450 Knowledgebase. Cytochrome P450 knowledgebase. http://cpd.ibmh.msk.su/ (2006).

[CR16] 16.Horn, J. R. & Hansten, P. D. Get to know an enzyme: Cyp2d6. Pharmacy Times (2008). Retrieved from http://www.pharmacytimes.com/publications/issue/2008/2008-07/2008-07-8624.

[CR17] 17.Flockhart, D. Drug interactions: Cytochrome p450 drug interaction table. Indiana University School of Medicinehttps://drug-interactions.medicine.iu.edu (2007).

[CR18] 18.Genelex Corporation. Physician guidelines: Drugs metabolized by cytochrome p450’s. http://ildcare.eu/Downloads/artseninfo/Drugs_metabolized_by_CYP450s.pdf (2005).

[CR19] 19.US Food and Drug Administration. Drug development and drug interactions: table of substrates, inhibitors and inducers. https://www.fda.gov/Drugs/DevelopmentApprovalProcess/DevelopmentResources/DrugInteractionsLabeling/ucm093664.htm#table2-1 (2020).

[CR20] 20.Mayo Clinic Laboratories. Pharmacogenomic associations tables. https://www.mayocliniclabs.com/it-mmfiles/Pharmacogenomic_Associations_Tables.pdf (2019).

[CR21] 21.Landrum, G. Rdkit: Open-source cheminformatics. https://www.rdkit.org (2006).

[CR22] 22.Ni, Y.-H. et al. Comprehensively-Curated Dataset of CYP450 Interactions: Enhancing Predictive Models for Drug Metabolism (2024). https://figshare.com/articles/dataset/Comprehensively-Curated_Dataset_of_CYP450_Interactions_Enhancing_Predictive_Models_for_Drug_Metabolism/26630515.

[CR23] 23.Bjerrum, E. J. Smiles enumeration as data augmentation for neural network modeling of molecules. arXiv preprint arXiv:1703.07076 (2017).

[CR24] 24.Bergstra, J., Yamins, D. & Cox, D. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In International conference on machine learning, 115–123 (PMLR, 2013).

[CR25] 25.Pedregosa, F. Scikit-learn: Machine learning in python fabian. Journal of machine learning research12, 2825 (2011). [Google Scholar]

[CR26] 26.Bergstra, J., Bardenet, R., Bengio, Y. & Kégl, B. Algorithms for hyper-parameter optimization. Advances in neural information processing systems24 (2011).

[CR27] 27.Chen, J.-H. & Tseng, Y. J. A general optimization protocol for molecular property prediction using a deep learning network. Briefings in Bioinformatics23, bbab367 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR28] 28.Ni, Y.-H. et al. Cmdm-lab/cyp450: v1.0. Zenodo 10.5281/zenodo.13364709 (2024).

[CR29] 29.Zhang, T., Dai, H., Liu, L. A., Lewis, D. F. & Wei, D. Classification models for predicting cytochrome p450 enzyme-substrate selectivity. Molecular informatics31, 53–62 (2012). [DOI] [PubMed] [Google Scholar]

[CR30] 30.Yamashita, F., Feng, C., Yoshida, S., Itoh, T. & Hashida, M. Automated information extraction and structure- activity relationship analysis of cytochrome p450 substrates. Journal of chemical information and modeling51, 378–385 (2011). [DOI] [PubMed] [Google Scholar]

[CR31] 31.Yap, C. W. & Chen, Y. Z. Prediction of cytochrome P450 3A4, 2D6, and 2C9 inhibitors and substrates by using support vector machines. Journal of Chemical Information and Modeling45, 982–992 (2005). [DOI] [PubMed] [Google Scholar]

[CR32] 32.Michielan, L., Terfloth, L., Gasteiger, J. & Moro, S. Comparison of multilabel and single-label classification applied to the prediction of the isoform specificity of cytochrome p450 substrates. Journal of Chemical Information and Modeling49, 2588–2605 (2009). [DOI] [PubMed] [Google Scholar]

[CR33] 33.Mishra, N. K., Agarwal, S. & Raghava, G. P. Prediction of cytochrome P450 isoform responsible for metabolizing a drug molecule. BMC Pharmacology10, 8 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR34] 34.Tian, S., Djoumbou-Feunang, Y., Greiner, R. & Wishart, D. S. Cypreact: a software tool for in silico reactant prediction for human cytochrome p450 enzymes. Journal of chemical information and modeling58, 1282–1291 (2018). [DOI] [PubMed] [Google Scholar]

[CR35] 35.Holmer, M., de Bruyn Kops, C., Stork, C. & Kirchmair, J. Cypstrate: a set of machine learning models for the accurate classification of cytochrome p450 enzyme substrates and non-substrates. Molecules26, 4678 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Curated CYP450 Interaction Dataset: Covering the Majority of Phase I Drug Metabolism

Yu-Hao Ni

Yu-Wen Su

Shang-Chen Yang

Jia-Cheng Hong

Po-Wen Allen Du

Yu-Ting Hsu

Tien-Chueh Kuo

Yufeng Jane Tseng

Abstract

Background & Summary

Table 3.

Table 4.

Methods

Data Collection

Table 1.

Data Curation

Table 2.

Data Records

Technical Validation

Implementation and Enhancement of Computational Models for CYP450 Substrate Classification

Strategic Data Augmentation and Validation

Comprehensive Data Analysis and Hyperparameter Optimization

Performance Evaluation and Final Remarks

Supplementary information

Acknowledgements

Author contributions

Code availability

Competing interests

Footnotes

Supplementary information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases