Abstract
Drug-induced hepatotoxicity (DIH), characterized by diverse phenotypes and complex mechanisms, remains a critical challenge in drug discovery. To systematically decode this diversity and complexity, we propose a multi-dimensional computational framework integrating molecular structure analysis with disease pathogenesis exploration, focusing on drug-induced intrahepatic cholestasis (DIIC) as a representative DIH subtype. First, a graph-based modularity maximization algorithm identified DIIC risk genes, forming a DIIC module and eight disease pathogenesis clusters. Network proximity values between drug targets and DIIC clusters were calculated to define drug–disease relationships. Subsequently, a random forest model combining Mordred molecular descriptors, structural alerts (SAs), and network proximity achieved robust DIIC prediction: Accuracy(ACC) = 0.740 ± 0.014 and area under the curve (AUC) = 0.828 ± 0.008 (ntraining = 342, nvalidation = 114, nexternal test = 295, randomly modeling 100 times). Notably, a K-nearest neighbors-graph convolutional network classified drugs into 8 clusters, with the Cluster 3 model demonstrating superior performance (ACC = 0.810 ± 0.024; AUC = 0.890 ± 0.014; ntraining = 186, nvalidation = 63, nexternal test = 172). Mechanistic analysis linked critical SAs to DIIC pathogenesis: (i) Furan (SA3) perturbed cytochrome P450-mediated metabolism and regulation of lipid metabolism by PPARα; (ii) Nitrogen-sulfur heteroatom chains (SA7) disrupted metabolism of steroids; (iii) Phenylthio groups (SA12) and their CYP450 metabolites induced cholestasis. This multi-dimensional framework bridges molecular features and disease mechanisms, offering a generalizable strategy for toxicity prediction and pathway-centric drug safety evaluation, especial for complex disease.
Keywords: multi-dimensional computational framework, structural alerts, disease gene cluster, KNN-GCN, drug-induced hepatotoxicity
Graphical Abstract
Graphical Abstract.
Introduction
In drug discovery and clinical medicine, assessing hepatotoxicity is a critical component in evaluating lead compounds and drugs [1]. Drug-induced hepatotoxicity (DIH), particularly drug-induced intrahepatic cholestasis (DIIC), accounts for over 50% of clinical liver injury cases and remains a pivotal challenge in drug safety evaluation due to its multi-stage pathogenesis and heterogeneous phenotypes [2, 3]. While computational models, including classical quantitative structure–activity relationship frameworks, transcriptome-based classifiers, and benchmark concentration models [4–7], largely owing to the liver’s first-pass metabolism and complex multi-factorial injury mechanisms, have advanced toxicity prediction, their accuracy in liver-specific assessments lags behind other organs (Accuracy difference, ΔACC >15%) [8].
Existing models predominantly rely on isolated data dimensions, such as molecular descriptors or biochemical endpoints [9–11], which fail to capture the synergistic interplay between structural toxicity triggers (e.g. structural alerts [SAs]) [12–15] and disease pathogenesis networks [16, 17]. For instance, while Mordred descriptors and SAs achieve moderate sensitivity (normally 0.6–0.8) in hepatotoxicity prediction [18, 19], they overlook the mechanistic bridge linking drug substructures to dysregulated pathways [16, 17]. Similarly, gene-clustering algorithms like Louvain [20] stratify diseases yet lack integration with molecular features. This reductionist approach creates a critical gap between prediction and mechanism, limiting clinical translatability [15]. As artificial intelligence methods began to delve deeply into the relationships within multiple biological heterogeneous networks in drug discovery [21–23], BioCB employed deep convolutional neural networks and LightGBM as predictive factors to handle liver toxicity prediction [24]. The emergence and changes of neural network [25], graph neural network [26], and graph convolutional neural network (GCN) algorithms [27, 28], and their application in cascade classification of relationships between heterogeneous biological networks [29], provide the possibility to accurately locate DIIC hierarchical pathways between drugs.
The evolution of systems biology has catalyzed the development of network medicine frameworks that systematically decode disease mechanisms [30–33]. Identification of pathologically cohesive gene clusters through interactome topology analysis supply a method for disease module quantification and classification disease genes [20]. Drug pathway proximity analysis integrates the shortest path distance between the drug target and the module hub with the help of a multi-scale metric [17]. By constructing multiple biological network relationships, researchers have developed interpretable prediction models for bio-link prediction, with area under the curve (AUC) values improved by over 10% [34], significantly narrowing the gap between predictive performance and clinical needs. Topological matching of SAs to drug gene cluster proximity allows systematic identification of key pathway perturbation modules triggered by drug substructures (such as the regulatory network of bile acid transporter ABCB11) [35]. These multi-dimensional computational framework strategies break through the limitations of traditional single-dimensional models and successfully establish a cascade analysis paradigm of “molecular substructure- pathway dysregulation- phenotype” output.
Considering the establishment of a method, the mechanism can be explained by mapping the topological characteristics of drug-pathway interactions, and establishing the molecular structure-disease gene hub or cluster causal relationship for modeling. In order to improve the accuracy of predicting hepatotoxicity and substructure-mechanism exploration, this article explores a multi-dimensional computational framework based on molecular structure information and self-built topology to classify drugs into cluster domains associated with DIIC through disease gene clustering and modified GCNs, and further explores the relationship between drugs and disease pathogenesis, which is used to predict a major aspect of DIIC hepatotoxicity.
Methods
In this study, we developed a multi-dimensional computational framework that integrates molecular structural information with drug–cluster association classification to predict DIIC. As illustrated in Fig. 1, the framework comprises six sequential modules.
Figure 1.
Workflow of the multi-dimensional computational framework.
Data collection and normalization
The drugs with positive and negative DIIC toxicity were collated from DILIrank [36] and LiverTox [37]. The drugs labeled with most-DILI-concern and less-DILI-concern from DILIrank, and the drugs given with A, B, C, or D on the likelihood score among LiverTox, were collected and defined as DIIC positive if they showed the side effect of cholestasis in the metaADEDB database (http://lmmd.ecust.edu.cn/metaadedb/). The drugs labeled with no-DILI-concern from DILIrank, and the drugs were E on the likelihood score among LiverTox, were picked out as DIIC negative. Then drug molecular structures were downloaded from the PubChem database (https://pubchem.ncbi.nlm.nih.gov/) in SMILES notation and pretreated by the following steps: (i) inorganics and mixtures were removed; (ii) salts were converted to the corresponding acids or bases.
The intrahepatic cholestasis risk genes were picked out from the OMIM (https://www.omim.org/), DisGeNET (https://www.disgenet.org/), GeneCards (https://www.genecards.org/), and CTD (http://ctdbase.org/) databases (accessed on 31 December 2024) by using Intrahepatic cholestasis or cholestasis, Intrahepatic as two searching items. Drug targets were sourced from the DrugBank (https://go.drugbank.com/) and DGIdb (https://www.dgidb.org) databases (downloaded on 31 December 2024). The intrahepatic cholestasis risk genes and the human protein targets of drugs were sort out with the same format in UniProt using Python 3.9.
Molecular structure characteristics extraction
On the one hand, the molecular structure was prepared with the molecular SA, using SARpy (W-SARpy 1.0), which hired “string mining” to break the structures of chemicals into fragments from the SMILES notations and identify fragments related to the toxic effect. Here, SAs associated with intrahepatic cholestasis toxicity were identified by using default parameter settings with the number of atoms ranging from 2 to 18, and with a minimum occurrence of 3. For each SA, an aggregation value was calculated based on the information gain value (IG) and the positive rate value (PR), ranging from 0 to 1 (see more details in Method section and Supplementary material), as the molecular SA feature input into the model. Then perform a matching search of the identified SAs against all drugs. For a given molecule, the SA feature value corresponds to its aggregation value if the drug contains any of the 19 SAs. Molecules lacking these SAs are assigned an SA feature value of 0.
The other hand, molecular structure was also characterized with Mordred descriptors [38], a kind of classical molecular descriptor. The redundant Mordred descriptors were deleted by following steps: (i) no value for the drug, (ii) with standard deviations less than 0.10, (iii) with Pearson correlation coefficients >0.90.
Network proximity metrics defining the drug–disease relationships
Molecular-related-pathogenesis features were attracted by network proximity methodology. First, the disease-pathogenesis of DIIC was explored with the DIIC module by importing intrahepatic cholestasis risk genes into the STRING database and using interactions classified as “highest confidence” (combined score > 0.9). Second, disease onset clusters of highly connected genes were identified in the DIIC module using a graph-based modularity maximization algorithm, commonly referred to as the Louvain algorithm, by using default parameter settings and calculating modularity to iteratively identify groups of genes that have many connections within the group and few connections between groups. Functional enrichment of the gene clusters was conducted with the G: Profiler tool [39], using all genes in the full DIIC module as the background gene set with adjustments for multiple tests (the Benjamin-Hochberg procedure). After GO term, and KEGG and REACTOME pathways were tested for functional enrichment, the DIIC disease onset clusters were obtained. To verify the clustering results, we built an alternative hierarchical DIIC network by multiscale community detection performed in Cytoscape using the CDAPS (Community Detection Application and Service) application, with the HiDef community detection algorithm [40, 41], and communities were annotated with significantly enriched GO terms and pathways from the G: Profiler tool.
Subsequently, the network proximity method was performed to calculate the shortest path lengths between the drug targets and disease onset clusters for each drug. The shortest path length (dc) of disease onset cluster gene set (V) and drug target set (T) was calculated by
![]() |
(1) |
Using the average
and standard deviation
of the reference distribution between randomly selected protein groups, the calculated distance was converted into the average relative distance between drugs and disease onset clusters. The relative average shortest distance (
) between each V, and T was calculated to assess DIIC toxicity and served as the molecular-related-disease-onset characteristics, which were calculated by
![]() |
(2) |
Multi-dimensional modeling
The above different dimension types of features were merged together and screened with recursive feature elimination with cross-validation (CV). The features with less contribution could be removed by obtaining the importance of each feature and repeating the process recursively on the pruning set, then evaluating the accuracy of the selected feature set by 10-fold CV. The feature set with the highest accuracy on a 10-fold CV was chosen for modeling.
Three machine learning algorithms, named eXtreme gradient boosting (XGB) [42], random forest (RF) [43], and support vector machine (SVM) [44], were used to develop the prediction models between the DIIC hepatoxicity and three types of drug characteristics in Scikit-Learn. The collected drugs from DILIrank were put into the training and validation sets in 0.75:0.25, while the collected drugs from LiverTox were into the external test set. A stratified 10-fold CV grid search was performed on the training set to identify optimal hyperparameter combinations, while uni-dimensional models were established using exclusively Mordred descriptors as a control for each classifying set. Furthermore, we conducted ablation studies comparing distinct modeling strategies by using the structural features and Zdc features. All models were performed the 10-fold CV, randomly 100 times, respectively, on the training set, validation set, and external test set with a set of metrics calculated, including accuracy (ACC), precision, recall (sensitivity, SE), specificity (SP), f1-score (F1), Matthews correlation coefficient (MCC), and AUC of the receiver operating characteristics. For more method details, see Method section and Supplementary material.
Cluster-specific classification based on K-nearest neighbors-graph convolutional network
In this article, a cluster-specific classification for drugs was explored by using the K-nearest neighbors-graph convolutional network (KNN-GCN) methodology, implementing categorizing drugs into distinct disease onset clusters. First step, the structural similarity of drugs was calculated. The Tanimoto similarity coefficient based on ECFP were calculated to assess the structural similarity of drugs. Second, the molecular functional (MF) similarity of disease onset clusters was computed using the R package GOSemSim [45]. Drugs and disease onset clusters were represented as nodes, the similarity matrices as node features. Node labels were labeled by the Zdc values, which were set as 1 (association) for negative Zdc values and 0 (no association) for positive Zdc values. Then, edges between nodes were generated by setting K as 1,3,5,7,9,11,13,15, and fitting the KNN model on 5-fold CV, thereby constructing the topological network structure. Finally, the GCN algorithm was used to predict the association between disease onset clusters and drugs on 5-fold CV. The GCN model used 512-dimensional features with “concat” aggregation, follows a “1-0-1-0” of S-GCN architecture with ReLU activation and normalized bias, employs a sigmoid loss function, and is trained with a learning rate of 0.001, 0.1 dropout, no weight decay, a sample coverage of 50, and a positive class weight of 10, running for 1000 epochs with an edge sampler and subgraph edge size of 4000. As a neural network layer, the propagation of GCN from layer 1 to layer 1 + 1 is defined below
![]() |
(3) |
Then, each disease onset cluster would contain a different number of drugs that formed each cluster-related drug set, which represents the applicability domain based on the KNN-GCN. Drugs belonging to DILIrank were randomly divided into the training set and validation set in a ratio of 0.75: 0.25, drugs belonging to LiverTox were used as the external test set, and DIIC prediction models were constructed for the drugs included in each cluster.
For comparative assessment, equal numbers of positive and negative drugs were randomly designated from each cluster in the original dataset as control groups. These control models were established to benchmark the disease onset cluster models defined by the KNN-GCN method. Each cluster’s drug set (both disease-associated and control groups) underwent 100 independent random iterations to validate the predictive robustness of the cluster-specific classifications.
Pathway enrichment analysis linking SAs to DIIC pathogenesis
The Shapley values were calculated by using the Shapley additive explanation (SHAP) Python package to quantify the contributions of input features in the model. The attribution of each drug characteristic on a model was plotted into the SHAP bee-swarm. According to the distribution in the coordinate quadrants of the high (or low) eigenvalues of each feature, the influence of the selected drug characteristic on the incidence of DIIC hepatotoxicity could be probed and interpreted.
Furthermore, the impact of the multi-dimensional features on the DIIC pathogenesis was investigated as follows. Observed the positive and negative occupancy of the selected SAs in all drugs within each cluster, the SAs that solely contributed to the positive cases were picked out. The targets of the drugs with these SAs are then network-overlapped with all the genes in the cluster they belonged to. Then, the ds values between each gene of the cluster and its closet drug target were calculated in the Menche interactome, with the results of 0 (overlapped), 1 (the closet neighbor), 2, >2 steps, and no step to any drug target (Not Application, NA). Enrichment analysis of genes with ds ≤ 2 was performed (P-value <0.05) because our previous studies showed that genes with ds ≤ 2 were more important [17]. Enrichment analysis of genes in each cluster was also performed. Then, top 20 results of two enrichment analyses were observed, and if the same items were present, these items were the key items for SAs regulating to DIIC pathogenesis.
Results
Data collection
There were 501 drugs in DILIrank and 687 drugs in LiverTox with known DIIC/no DIIC labels. After inorganics and mixtures removed, only 475 and 676 drugs were remained. After eight drugs with labeling inconsistencies from these two databases were removed, and duplicate drugs in the second one with consistent labels from these two databases were removed, finally, 456 drugs in DILIrank and 295 drugs in LiverTox were kept (see Table S1), respectively, including 271 labeled DIIC positive and 480 negative. Then the t-distributed stochastic neighbor embedding (t-SNE) of 751 selected drugs was plotted in Fig. S2, and it was found that the distribution and the diversity of all drugs were available.
In addition, 346 DIIC risk genes were compiled with the MeSH method from the OMIM, DisGeNET, GeneCards, and CTD databases, listed in Table S2A with a unified form.
Molecular structure characteristics
In this step, only one SA characteristic for all drugs was obtained by a series of operating processes as follows. When the selected 456 drugs from DILIrank as the training set were input with the SMILES structures and the DIIC properties by using SARpy, the results showed the accuracy was 0.76, sensitivity was 0.65, and specificity was 0.84. As shown in Table S4, The likelihood ratio (LR) of 19 kinds of SA with DIIC hepatotoxicity were all >2, and the frequencies of SA1–SA18 in positive drugs were higher than that in negative drugs. Further testing these 19 SAs using 295 drugs from LiverTox as an external test set yielded prediction results of accuracy was 0.69, sensitivity was 0.72, and specificity was 0.68, all around 0.70 [46, 47], validating that these 19 alerts were indeed closely related to DIIC. It means that, for the collected drugs, highly associated DIIC toxic structures can be characterized by these 19 SAs. Then weighting the IG and PR values for each of 19 SAs, an aggregation value to each SA was assigned. Performing a matching search of 19 SAs against all drugs, the SA feature for each drug is obtained and listed in Table S5A. For the Mordred molecular structure characteristics, after calculating 1826 Mordred descriptors for 751 selected drugs, filtering out those with a standard deviation <0.10 and Pearson correlation coefficients >0.90 was performed, resulting in only 422 descriptors remaining.
Molecular-related-disease-onset characteristics
First, the DIIC module was defined with 276 DIIC genes significantly localized in the human interactome (Z = 12.6, P <0 .0001, STRING database), which with the largest connected component consisted of 250 risk genes (Z = 4.9, p = 1.4 × 10−6, see Fig. S1).
Meanwhile, the interconnections among 276 genes were processed using the Louvain algorithm for community network analysis to get the gene clusters. When the modular degree value reached a maximum value being 0.63, 14 gene clusters were formed and listed in Table S2C. Only eight gene clusters containing >10 genes with statistical significance were retained. If one cluster could get its own major biological pathways and functions (P-value < 0.05) performing functional annotation, it formed a disease onset cluster. We focused on entries where the enrichment result is top 10, most cluster functions were defined as entries for top 1 or top 2 (see Table S3). Here, eight DIIC onset clusters were obtained, named C1–C8 (see Table S2D). Among them, particularly noteworthy were C4 (drug metabolism-cytochrome P450, P-value = 3.23 × 10−24), C2 (plasma lipoprotein remodeling, P-value = 1.89 × 10−12), and C6 (apoptosis, P-value = 2.28 × 10−11). An alternative clustering strategy, multiscale community detection, revealed largely similar clusters (see Table S2E).
When drugs enter the human body, they initially trigger changes in cells or proteins through stages 1–3 (M1: transporter changes, M2: hepatocellular changes, and M3: bile canalicular changes), leading to intrahepatic cholestasis [2]. Subsequently, a series of processes ensue, including cellular self-repair and the stimulation of immune responses to combat the disease. Moreover, the clinical diagnosis of DIIC typically relies on blood tests and ultrasound imaging, with a focus on biochemical indicators such as alkaline phosphatase levels and alterations in tissue morphology. From this perspective, stages M1–M3 are directly related to the initial impacts of drugs on cells and proteins, which are crucial for the diagnosis of DIIC. To determine which of the eight clusters are more closely associated with the onset of DIIC and which parts of the body they are more likely to affect, attention was concentrated on their relation to the M1–M3 stages.
Then, the MF similarities between the eight clusters and the genes involved in three stages of the DIIC were calculated and shown in Fig. 2. As shown in Fig. 2, C1–C8, which have at least one dark red, indicates that they all have at least one obvious similarity with M1–M3 of DIIC (with good similarity values belonging to 0.3–0.6). According to the darkness of red, C4 and C1 were more similar to the genes on M1, C6 was near to M3, and C7 was close to M2. C2, C3, C5, and C8 were closed to both of M2 and M3 stages. The eight clusters obtained in our method well represent the DIIC progression process from stage M1 to M3. Thus, through clustering, the DIIC was successfully organized into eight biologically significant clusters, each corresponding to various stages of the disease, which helps to understand the dynamic changes and potential mechanisms of the DIIC.
Figure 2.
Clustering and annotation of the DIIC risk genes. (A) Similarity of eight DIIC clusters and three DIIC stages. (B) Classes of identified pathways that are functionally related are presented as inner circles, with circle size roughly indicating relative class sizes. The overlying outer circle illustrates the DIIC stages with overall preferential molecular function similarity in each cluster (e.g. M2 closed to C2, C5, C7, and C8 because genes in these four classes were more similar to M2). (C) The UMAP distribution map of 751 drugs contained in eight clusters based on the KNN-GCN classification.
Then the average shortest path length between drug targets and disease genes among each cluster was calculated by network proximity algorithm, respectively, and 8 Zdcs were obtained and listed in Table S5C, using them to characterize the characteristics of drug association with DIIC onset as input feature of the models.
Multi-dimensional models
Before modeling, two feature sets containing eight multi-dimensional and eight uni-dimensional features with the best 10-fold CV accuracy were picked out, respectively. The multi-dimensional feature set included one SA, two Zdcs, and five Mordred descriptors, and listed in Table S5A. At the same time, the selected uni-dimensional features (only Mordred descriptors, structural features, Zdc features) for fair comparison were listed in Table S5B.
Twelve models (more details for their parameters setting listed in Table S6A) were developed based on the multi-dimensional and uni-dimensional features by three algorithms, RF, XGB, and SVM, respectively, random 100 times. Parts of their statistical results are shown in Table 1 (see more details in Table S6B). It can be found that the models based on multi-dimensional features performed better than those based on uni-dimensional features.
Table 1.
Performance of the multi-dimensional-feature models and the uni-dimensional-feature models.
| Multi-dimensional-feature models | Uni-dimensional-feature models | ||||||
|---|---|---|---|---|---|---|---|
| ACC | Precision | AUC | ACC | Precision | AUC | ||
| Training set (n = 342) | RF | 0.799 ± 0.012 | 0.800 ± 0.018 | 0.792 ± 0.013 | 0.701 ± 0.016 | 0.693 ± 0.022 | 0.689 ± 0.016 |
| XGB | 0.784 ± 0.018 | 0.784 ± 0.024 | 0.776 ± 0.019 | 0.694 ± 0.021 | 0.665 ± 0.028 | 0.687 ± 0.021 | |
| SVM | 0.795 ± 0.014 | 0.811 ± 0.019 | 0.785 ± 0.015 | 0.673 ± 0.016 | 0.669 ± 0.025 | 0.657 ± 0.018 | |
| Validation set (n = 114) | RF | 0.801 ± 0.031 | 0.801 ± 0.031 | 0.874 ± 0.027 | 0.702 ± 0.038 | 0.701 ± 0.040 | 0.769 ± 0.036 |
| XGB | 0.784 ± 0.035 | 0.783 ± 0.036 | 0.848 ± 0.031 | 0.694 ± 0.036 | 0.691 ± 0.035 | 0.757 ± 0.034 | |
| SVM | 0.789 ± 0.032 | 0.790 ± 0.033 | 0.850 ± 0.029 | 0.675 ± 0.038 | 0.675 ± 0.036 | 0.696 ± 0.037 | |
| Test set (n = 295) | RF | 0.740 ± 0.014 | 0.800 ± 0.010 | 0.828 ± 0.008 | 0.662 ± 0.013 | 0.752 ± 0.009 | 0.730 ± 0.009 |
| XGB | 0.711 ± 0.016 | 0.778 ± 0.011 | 0.796 ± 0.015 | 0.658 ± 0.021 | 0.750 ± 0.015 | 0.711 ± 0.019 | |
| SVM | 0.737 ± 0.014 | 0.796 ± 0.009 | 0.828 ± 0.010 | 0.660 ± 0.012 | 0.746 ± 0.008 | 0.688 ± 0.007 | |
In Table 1, the average ACC across the three multi-dimensional-features-based models was consistently above 0.700, with the RF model achieving an ACC of 0.801 on the validation set. In stark contrast, the average ACC values for the uni-dimensional-feature models mostly remained below 0.700. Compared with the uni-dimensional-feature models, the multi-dimensional-feature models demonstrated superior predictive ability across all performance metrics encompassing the training, validation, and external test set. This pronounced advantage underscores the enhanced capability of multi-dimensional feature characterization approaches to capture the intricate underpinnings of DIIC disease. Compared with XGB and SVM models, RF models with the best robustness and reliability due to their highest mean value and lowest standard deviation across 100 iterations, was selected to emerge into the next step.
Cluster-specific classification
The KNN-GCN method was explored to attribute 751 drugs to 8 DIIC clusters. Here, the best K was chosen by 5-fold CV KNN (the results see Table S7). Although the best performance was achieved when K = 1, this was usually not meaningful [48]. When the performance to the models exhibited an ascending peak at K = 9, suggesting that KNN had good predictive power when K = 9 compared to other values of K, so K = 9 was chosen to construct the topological network graphs for GCN (see more details in Table S8). Then 5502 drug-cluster association pairs between 751 drugs and 8 DIIC clusters were established, and the results were listed in Table S9. Only 2671 pairs were used for the next model development analysis, due to their prediction probability scores exceeding 0.80 with good reliability. Till now, 751 drugs were classified into 8 cluster domains by KNN-GCN classification methods and plotted in Fig. 2C by UMAP. From Fig. 2C, most drugs with only one color, it was found that 751 drugs were well-distinguished across the eight defined domains, with only a few drugs appearing in multiple domains (few drug-dots with two or three colors), indicating that these drugs may play roles in various stages of the DIIC onset.
The eight RF models between multi-dimensional features and the DIIC hepatotoxicity were developed in eight domains, randomly 100 times, respectively (see Table S10). Model C3 with the ACCs maintained above 0.800 was the best among the eight cluster models, as shown in Table 2. The validation set provided the ACC and AUC of 0.903 and 0.837, respectively, and the external test set provided the ACC and AUC of 0.890 and 0.810, respectively, which indicated Model C3 had strong robustness and predicting capability. In addition, an equal number of positive and negative drugs were selected for each domain based on the number of drugs in that domain to serve as a corresponding control group for that domain. These were then randomly modeled 100 times in the same manner as the aforementioned RF model, with the results listed in Table S10. From Table S10, each cluster domain model, after drugs were classified into its respective cluster, clearly outperforms the control model.
Table 2.
Statistical performance of model C3.
| AUC | ACC | Precision | SE | SP | F1 | MCC | ||
|---|---|---|---|---|---|---|---|---|
| Training set (n = 114) | X- | 0.819 | 0.827 | 0.831 | 0.766 | 0.872 | 0.787 | 0.654 |
| Std | 0.017 | 0.017 | 0.026 | 0.031 | 0.023 | 0.023 | 0.035 | |
| Validation set (n = 39) | X- | 0.903 | 0.837 | 0.838 | 0.832 | 0.880 | 0.832 | 0.670 |
| Std | 0.034 | 0.045 | 0.047 | 0.046 | 0.057 | 0.046 | 0.092 | |
| Test set (n = 152) | X- | 0.890 | 0.810 | 0.849 | 0.805 | 0.800 | 0.816 | 0.551 |
| Std | 0.014 | 0.024 | 0.015 | 0.022 | 0.029 | 0.019 | 0.042 |
At the same time, we created bar charts for ACC and AUC, as shown in Fig. 3. In eight domains, it is clearly observable that the models within the domains after KNN-GCN classification are higher than the control models, indicating that the model post-classification within each cluster significantly outperforms the control model. In particular, Model C2, Model C3, Model C4, and Model C5 had stronger stability and robustness with better results on the training and validation sets, and Model C2, Model C3, and Model C4 had improved predictive capability with better results on the external test set. The comparative performance of Model C3 against the control model across different data sets (training, validation, and external test sets) using the metrics of AUC and ACC, Model C3 consistently outperforms the control model in all scenarios. In the training set, the AUC and ACC for Model C3 reached 0.819 and 0.827, respectively, compared to the control model’s 0.753 and 0.761. For the validation set, while the control model was at 0.820 and 0.754, Model C3 achieved 0.903 and 0.837, respectively. Regarding the external test set, Model C3 (0.890 and 0.810) was notably higher than the control model (0.796 and 0.711).
Figure 3.
The ACC and AUC of eight cluster models and their control models.
Moreover, the mean performance of Model C3 was compared with the earlier cholestasis prediction models in the literature, as shown in Table 3. Model C3 had higher ACC and AUC values than Pablo [49] and Jiang [50] with a similar number of features and drugs. Although Kotsampasakou [51] confirmed more compounds, Model C3 achieved much better performance by using fewer features.
Table 3.
Comparison of model C3 performance for the external test set with models in reference.
| Reference | Model | n a | n b | Prediction | Model interpretation | |
|---|---|---|---|---|---|---|
| AUC | ACC | |||||
| Pablo et al. [43] | Logical_OR | 419 | 8 | 0.732 | 0.684 | No |
| Jiang et al. [44] | LR | 28 | 13 | 0.716 | 0.700 | Yes |
| Kotsampasakou et al. [45] | MetaCost+IBk | 1904 | 93 | 0.604 | 0.629 | No |
| Model C3 | RF | 751 | 8 | 0.890 | 0.810 | Yes |
a n, the number of compounds in data set (including training, validation, and test set).
b n, the number of features in modeling.
Link of molecular SAs to DIIC pathogenesis
The contribution of the drug characteristics among the multi-dimensional-feature RF model of 751 drugs was plotted into the SHAP bee-swarm, respectively, in Fig. 4. In SHAP summary plots, the horizontal axis quantifies feature impact through SHAP values, where greater absolute values indicate stronger predictive influence. As evidenced in Fig. 4A, the SA feature emerged as the dominant predictor with maximal SHAP magnitude, while network proximity metrics Zdc4 and Zdc1 ranked as secondary contributors (positions 2 and 4, respectively), conclusively validating the significant efficacy of topological drug–disease relationships in DIIC prediction.
Figure 4.
Feature contribution analysis for the overall model. (A) Importance of the features. (B) The SHAP bee-swarm plot.
SHAP force decomposition revealed feature-level predictive dynamics through two-dimensional mapping: horizontal displacement denoting impact magnitude (|SHAP| ≥ 0.3 as significance threshold) and hue gradients encoding outcome polarity (Δoutput >0: crimson, Δoutput <0: cerulean). Figure 4A’s feature hierarchy demonstrated: (i) SA dominance (SHAP = −0.191 to 0.415, average |SHAP| = 0.145); (ii) Network topology determinants Zdc4/Zdc1 constituting 27.07% combined feature importance, quantitatively confirming graph-based proximity as DIIC’s key pharmacodynamic driver.
In addition, the contribution of eight features of Models C1–C8 was also calculated and plotted into the SHAP bee-swarm in Fig. S3. From Fig. S3, SA ranked top one in all eight models, while Zdc4 and Zdc1 ranked in the front, so SA, Zdc4, and Zdc1 still played key roles in DIIC toxicity. The results of Model C1–C8 also suggest that the larger the SA value, the greater the risk of DIIC toxicity, and the lower the values of Zdc4 and Zdc1, the higher the risk of DIIC toxicity.
Then those SAs only existed in positive drugs and the number of drugs with the SA is bigger than 5, the SAs and the drug with the SA were picked out and used to explore affecting in DIIC development. Analytical framework was listed in Fig. 5A. We got an insight of important features driving DIIC pathogenesis in Fig. 5B.
Figure 5.
The integrated analysis of drug’s SAs and DIIC clusters. (A) Flowchart of the SAs-drugs to DIIC pathogenesis. (B) The integrated analysis of the SAs related to C1, C3, and C4 influenced DIIC pathogenesis. (a–c) SAs contribution in C1, C4, and C3; (d and f) randomly selected drugs in C1/C4/C3 containing SA3/SA3 or SA12/ SA7; (g–i) enrichment results analysis between drug-C1/C4/C3 overlapping genes and C1/C4/C3 genes; (j–l) pathway analysis.
From Fig. 5B-a, in C1, SA3 only existed in positive drugs. Methoxsalen (CID: 4114) was randomly picked out as a prototypical SA3-containing compound (its structure was shown in Fig. 5B-d). According Fig. 5A workflow, after gene enrichment analyses, there were four items the same in KEGG pathway listed as blue columns named 4152, 4920, 4015, and 3320 in Fig. 5B-g, see more details in Table S11-A.The PPAR signaling pathway (KEGG:03320, P-value =1.38 × 10−2) ranked fourth, indicating methoxsalen modulated PPARα, influencing a bile acid homeostasis in Fig. 5B-j, which aligned with established evidence that methoxsalen inhibits PPARα-mediated regulation of lipid metabolism (C1) [52], inducing bile acid accumulation and cholestatic hepatotoxicity.
From Fig. 5B-b, In C4, SA3 and SA12 were found. Fraxinellone (CID: 124039) with SA3 and chlorpromazine (CID: 2726) with SA12 were picked out. After enrichment analyses, there were four items the same in Fraxinellone-C4 KEGG results(as purple columns named 980, 5204, 982, and 983 in Fig. 5B-h, left, see more details in Table S11-B) and seven items the same in chlorpromazine-C4 (as yellow columns named 980, 982, 5204, 140, 1100, 830, and 983 in Fig. 5B-h, right, see more details in Table S11-C), which worth being observed. In Fig. 5B-h, left, KEGG:00980, metabolism of xenobiotics by cytochrome P450, rank first place, showed significant enrichment (P-value = 1.70 × 10−3), indicating after fraxinellone metabolized, it is easy to lead hepatotoxicity, as shown in Fig. 5B-k, fraxinellone metabolized by CYP450 enzymes into a cis-butene structure, which had reported as hepatotoxicity structure [53]. Similarly, in Fig. 5B-h, right, KEGG:00980, metabolism of xenobiotics by cytochrome P450, also rank first place, indicating SA12 in chlorpromazine can be metabolized by CYP450 enzymes into electrophilic intermediates, which then conjugate with glutathione in Fig. 5B-k, also resulting in hepatotoxicity [54]. Due to the structural differences between the parent drugs and their metabolites when drugs undergo metabolism in the liver, it is often not the parent drugs themselves, but rather their metabolites that exhibit hepatotoxic effects. This finding further underscores, for some drugs, the importance of distinguishing between parent drugs and their metabolites for hepatotoxicity.
From Fig. 5B, it can be seen that SA3 appears exclusively in the positive drugs within both C1 and C4, indicating that the SA3 structure impacts DIIC within these two clusters. Meanwhile, as previously determined by SHAP values, Zdc1 and Zdc4 show negative correlations with DIIC hepatotoxicity. By identifying drugs with markedly low values for both Zdc1 and Zdc4, with SA3 structure, and belonging to not only C1 but also C4, there were seven drugs found (listed Table S11-F). Taking lapatinib as an example (Fig. 5B-e, CID: 208908), the overlapping enrichment results (see Table S11-D) showed a high consistency with the enrichment results of C1 and C4, respectively. The same item appeared metabolism of xenobiotics by cytochrome P450 and PPAR signaling pathway still appeared in the top 10 enrichment results (Table S11-D). This indicates that lapatinib is related to C1 and C4. At the same time, studies have shown that the parent drug lapatinib is involved in the PPAR signaling pathway [55], further significantly affecting lipid excretion function [56], and it has indeed been reported to manifest pronounced DIIC [57]. Its metabolite, 8-hydroxy-lapatinib (Fig. 5B-k), has also been reported to interact with reactive substances such as aldehydes or quinonoid imines, as well as to engage with transport proteins, thereby generating notable toxicity [58].
From Fig. 5B-c, SA7 was identified as critical DIIC determinants. Cimetidine (CID: 2756), the prototypical SA7-bearing agent, were chosen and all top 10 items were the same in cimetidine-C3 enrichment results (as orange columns in Fig. 5B-i, see more details in Table S11-E), which worth being observed. The fifth item named KEGG: 00100 (P-value = 3.70 × 10−4) in steroid biosynthesis pathways (Table S11-E) coordinated with HMG-CoA. This network-driven metabolic dysregulation (Fig. 5B-i) mechanistically explains cimetidine’s clinical association with cholestatic injury [59], confirming SA7’s capacity to disrupt sterol metabolism through sterol biosynthesis limits.
Discussion
We explored a multi-dimensional computational framework to predict the drug’s DIIC and systematically deciphered their mechanisms. A DIIC disease module was stratified into eight pathological clusters across three progression stages, quantified via network proximity metrics to establish drug–cluster interaction signatures. Integrating SAs, Modred descriptors, and cluster-proximity parameters achieved a robust hepatotoxicity classification model. A KNN-GCN resolved 751 drugs, addressing cluster assignment accuracy. Interactome expansion revealed statistically significant pathway-cluster associations, spatially mapping drug-DIIC sub-module interactions within their topological domains. This systematic workflow bridges molecular heterogeneity with disease mechanisms, enabling precision toxicity prediction while elucidating cluster-specific toxicity pathways.
Here, the performance of our predicting model is considered on three multidimensional parameters. Consistent with the concept of structure determines attributes, structure descriptors still make an obvious contribution to DIIC in the model [12, 18, 60–63], such as SAs which had been extensively investigated across various toxicological endpoints [12] and also successfully applied to the hepatoxicity [18]. Especially, SA3 had been repeatedly reported the cholestasis toxicity [18, 62], while SA1, SA8, SA9, and SA12 were also mentioned in earlier cholestatic research [63]. The relationship of molecular targets and risk genes was defined by the topological distances to risk genes in disease module, quantified as the number of drug targets within varying network steps from module hubs—our strategy was previously validated for hepatotoxicity prediction [17]. In this study, an improved graph-based modularity maximization algorithm identified DIIC risk genes, forming a DIIC module and eight disease pathogenesis clusters. Network proximity values between drug targets and DIIC clusters were calculated to define drug–disease relationships. Clustering DIIC genes generated refined disease-associated gene subsets, enabling precise mapping of molecular-pathway interactions through network proximity metrics. This approach dissected heterogeneous disease progression mechanisms by resolving molecule-specific influences on distinct pathological phases, ultimately contributing enhanced the accuracy post cluster-specific classification.
The KNN-GCN framework introduces two aspects to overcome graph learning limitations. Unlike conventional GCNs employing cross-layer sampling, we implement topology-aware mini-batch construction via training graph sampling, reducing GPU memory consumption. By synergizing KNN’s local topology preservation with GCN’s global feature aggregation, this hybrid model quantifies drug–disease associations through attention-weighted cluster attribution scores, interpreting drug-cluster mapping. Applying this framework to DIIC, we achieved Mechanistic Subtyping and Toxicity Prediction Enhancement. DIIC was stratified into eight etiological clusters via tripartite relational analysis, with dominant clusters 4/1/3 (n = 476/435/394 drugs) exhibiting optimal feature representation (Shapley value). The algorithm-defined applicability domains improved prediction robustness, particularly for rare hepatotoxic events (F1-score). These advancements underscore the critical role of large-scale interactome datasets in deriving biologically plausible AI models, with cluster-specific performance metrics strongly correlating with data volume.
In this study, we employed a cluster-based modeling strategy, where separate models are trained for each cluster. Notably, within each cluster, the positive and negative samples are balanced, which effectively mitigates the common class imbalance issue present in the overall dataset. This intra-cluster balance reduces the need for additional imbalance-handling techniques such as cost-sensitive learning or focal loss. We also experimented with synthetic oversampling techniques like SMOTE combined with weighting of borderline samples within clusters; however, these strategies did not yield significant improvements. This suggests that the cluster-based division itself provides a natural balancing effect that leads to robust model performance. While alternative methods such as focal loss, cost-sensitive learning, and ensemble-based rebalancing have shown promise in handling class imbalance in other contexts, their application may be less straightforward or redundant under our cluster-wise balanced framework. Future work may explore these methods in scenarios where global imbalance persists or clustering is not feasible.
With the enrichment and expansion of research data, the number of nodes and edges in biological networks such as protein–protein interactions, molecular interactions, and molecular-disease relationships is increasing, “network community discovery” research has made it possible to directly construct hierarchical representations of biological structure and function from networks. Butler et al. applied community detection and pattern recognition methods to hierarchically organize complex network structures, successfully applied to cataloging omics profiles [64, 65]. We used “network community discovery” to cluster the onset of diseases. These 8 clusters not only defined molecular characteristics but also participated in molecular attribution into the applicability domains. When combining multi-dimensional features, through significant feature analysis, the relationship between molecular structure fragments and pathways could be found under different applicability domains. Then, the enrichment results of those key molecular targets helped to find the relationship between the enrichment results and the onset of disease. This leads to better interpretability. To better capture the temporal dynamics of DIIC, future studies should focus on incorporating longitudinal clinical data to model toxicity progression across different disease stages, while also developing time-resolved molecular signatures that reflect the evolving “time-dose-concentration-effect” relationships.
Conclusion
Our developed multi-dimensional computational framework of DIIC by integrating molecular structure characteristics with disease pathogenesis helped to improve the accuracy and explore the drug-induced disease mechanisms. This method not only provides a good prediction of the DIIC of a new compound, but also logically deduces the mechanisms involved in toxicity. Due to the scarcity of DILI datasets and the lack of DIIC studies, the sample size used in the study is still relatively small, and the corresponding samples in cluster classification are few. With more drug DIIC test data and more research on known drug targets, datasets can be further expanded, and predictions and classifications can be more accurate. However, the multi-dimensional computational framework established here provides a new transferable mode of metastatic anchor pathway analysis for exploring the toxicity of complex drug diseases.
Key Points
We propose the first multi-dimensional modeling framework that synergistically integrates molecular structural fingerprints with disease pathogenesis clusters to predict drug-induced hepatotoxicity.
Methodological advancement: The KNN-GCN-based disease cluster classification algorithm not only achieved superior predictive accuracy but also systematically linked substructural toxicity to mechanistic pathways.
Performance superiority: Our multi-dimensional model outperformed single-feature approaches (ΔACC >7%, ΔAUC >6%) by unifying structural descriptors and network-proximity metrics.
Mechanistic generalizability: The framework provides a transferable paradigm for anchoring toxicity mechanisms to specific disease clusters, as validated by pathway enrichment of SA-associated genes.
Supplementary Material
Contributor Information
Huayu Zhong, College of Pharmacy, Chongqing Medical University, No. 1 Yixueyuan Road, Yuzhong District, Chongqing 400016, P. R. China.
Juanji Wang, College of Pharmacy, Chongqing Medical University, No. 1 Yixueyuan Road, Yuzhong District, Chongqing 400016, P. R. China.
Xiaoxiao Liu, College of Pharmacy, Chongqing Medical University, No. 1 Yixueyuan Road, Yuzhong District, Chongqing 400016, P. R. China.
Xiaoyun Wei, College of Pharmacy, Chongqing Medical University, No. 1 Yixueyuan Road, Yuzhong District, Chongqing 400016, P. R. China.
Chengcheng Zhou, College of Pharmacy, Chongqing Medical University, No. 1 Yixueyuan Road, Yuzhong District, Chongqing 400016, P. R. China.
Taiyan Zou, College of Pharmacy, Chongqing Medical University, No. 1 Yixueyuan Road, Yuzhong District, Chongqing 400016, P. R. China; Chongqing Engineering Research Center for Clinical Big Data and Drug Evaluation, Chongqing Medical University, No. 1 Yixueyuan Road, Yuzhong District, Chongqing 401331, P. R. China; Medical Data Science Academy, College of Medical Informatics, Chongqing Medical University, No. 1 Yixueyuan Road, Yuzhong District, Chongqing 400016, P. R. China.
Xin Han, College of Pharmacy, Chongqing Medical University, No. 1 Yixueyuan Road, Yuzhong District, Chongqing 400016, P. R. China.
Lingyun Mo, The Guangxi Key Laboratory of Theory and Technology for Environmental Pollution Control, College of Environmental Science and Engineering, Guilin University of Technology, No. 12 Liutai Avenue, Xiufeng District, Guilin 541004, P. R. China; Technical Innovation Center for Mine Geological Environment Restoration Engineering in Shishan Area of South China, Ministry of Natural Resources, No. 8 Minzu Road, Xingning District, Nanning 530028, P. R. China.
Wenling Qin, Chongqing Key Laboratory of Natural Product Synthesis and Drug Research, Chemical Biology Research Center, School of Pharmaceutical Sciences, Chongqing University, No. 174 Shazheng Street, Shapingba District, Chongqing 401331, P. R. China.
Yonghong Zhang, College of Pharmacy, Chongqing Medical University, No. 1 Yixueyuan Road, Yuzhong District, Chongqing 400016, P. R. China.
Funding
This research was financially supported by National Natural Science Foundation of China (grant numbers: 22176020, 32072262), Graduate Research and Innovation Project of Chongqing Municipal Education Commission (CYS23357), CQMU Program for Youth Innovation in Future Medicine (W0181), and the Intelligent Medicine Research Project of Chongqing Medical University (YJSZHYX202203).
Data availability
The codes are available in https://github.com/123zhyyy/cluster_model/tree/main. All scripts used were written based on Python 3.9 or Python 2.7. All model training was performed on a workstation equipped with an Intel® Core™ i7-9700 CPU (3.00GHz) and NVIDIA RTX 4070 GPU, with the complete training process requiring ~24 h. All molecules with SMILES notation were accessed from PubChem database (https://pubchem.ncbi.nlm.nih.gov/) and standardized using the RDKit package. The structural alerts were obtained using SARpy (W-SARpy 1.0, VEGA HUB). The data that support the findings of this study are available in the main text and/or in the Supplementary material.
References
- 1. Tran TTV, Surya Wibowo A, Tayara H. et al. Artificial intelligence in drug toxicity prediction: recent advances, challenges, and future perspectives. J Chem Inf Model 2023;63:2628–43. 10.1021/acs.jcim.3c00200 [DOI] [PubMed] [Google Scholar]
- 2. Gijbels E, Vilas-Boas V, Deferm N. et al. Mechanisms and in vitro models of drug-induced cholestasis. Arch Toxicol 2019;93:1169–86. 10.1007/s00204-019-02437-2 [DOI] [PubMed] [Google Scholar]
- 3. Petrov PD, Fernández-Murga ML, López-Riera M. et al. Predicting drug-induced cholestasis: preclinical models. Expert Opin Drug Metab Toxicol 2018;14:721–38. 10.1080/17425255.2018.1487399 [DOI] [PubMed] [Google Scholar]
- 4. Wu Z, Lei T, Shen C. et al. ADMET evaluation in drug discovery. 19. Reliable prediction of human cytochrome P450 inhibition using artificial intelligence approaches. J Chem Inf Model 2019;59:4587–601. 10.1021/acs.jcim.9b00801 [DOI] [PubMed] [Google Scholar]
- 5. Muller C, Pekthong D, Alexandre E. et al. Prediction of drug induced liver injury using molecular and biological descriptors. Comb Chem High Throughput Screen 2015;18:315–22. 10.2174/1386207318666150305144650 [DOI] [PubMed] [Google Scholar]
- 6. Moreno-Torres M, López-Pascual E, Rapisarda A. et al. Novel clinical phenotypes, drug categorization, and outcome prediction in drug-induced cholestasis: analysis of a database of 432 patients developed by literature review and machine learning support. Biomed Pharmacother 2024;174:116530. 10.1016/j.biopha.2024.116530 [DOI] [PubMed] [Google Scholar]
- 7. AbdulHameed MDM, Liu R, Wallqvist A. Using a graph convolutional neural network model to identify bile salt export pump inhibitors. ACS Omega 2023;8:21853–61. 10.1021/acsomega.3c01583 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Zhang S, Wang Z, Chen J. et al. Multimodal model to predict tissue-to-blood partition coefficients of chemicals in mammals and fish. Environ Sci Technol 2024;58:1944–53. 10.1021/acs.est.3c08016 [DOI] [PubMed] [Google Scholar]
- 9. Ai H, Chen W, Zhang L. et al. Predicting drug-induced liver injury using ensemble learning methods and molecular fingerprints. Toxicol Sci 2018;165:100–7. 10.1093/toxsci/kfy121 [DOI] [PubMed] [Google Scholar]
- 10. Shin HK, Kang MG, Park D. et al. Development of prediction models for drug-induced cholestasis, cirrhosis, hepatitis, and steatosis based on drug and drug metabolite structures. Front Pharmacol 2020;11:67. 10.3389/fphar.2020.00067 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Lee S, Yoo S. InterDILI: interpretable prediction of drug-induced liver injury through permutation feature importance and attention mechanism. J Chem 2024;16:1. 10.1186/s13321-023-00796-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Wu Z, Jiang D, Wang J. et al. Mining toxicity information from large amounts of toxicity data. J Med Chem 2021;64:6924–36. 10.1021/acs.jmedchem.1c00421 [DOI] [PubMed] [Google Scholar]
- 13. Ferrari T, Cattaneo D, Gini G. et al. Automatic knowledge extraction from chemical structures: the case of mutagenicity prediction. SAR QSAR Environ Res 2013;24:365–83. 10.1080/1062936X.2013.773376 [DOI] [PubMed] [Google Scholar]
- 14. Zhang C, Cheng F, Li W. et al. In silico prediction of drug induced liver toxicity using substructure pattern recognition method. Mol Inform 2016;35:136–44. 10.1002/minf.201500055 [DOI] [PubMed] [Google Scholar]
- 15. Su R, Wu H, Liu X. et al. Predicting drug-induced hepatotoxicity based on biological feature maps and diverse classification strategies. Brief Bioinform 2021;22:428–37. 10.1093/bib/bbz165 [DOI] [PubMed] [Google Scholar]
- 16. Dai W, Tang T, Dai Z. et al. Probing the mechanism of hepatotoxicity of hexabromocyclododecanes through toxicological network analysis. Environ Sci Technol 2020;54:15235–45. 10.1021/acs.est.0c03998 [DOI] [PubMed] [Google Scholar]
- 17. Tang T, Gan X, Zhou L. et al. Exploring the hepatotoxicity of drugs through machine learning and network toxicological methods. CBIO 2023;18:484–96. 10.2174/1574893618666230316122534 [DOI] [Google Scholar]
- 18. Jia X, Wen X, Russo DP. et al. Mechanism-driven modeling of chemical hepatotoxicity using structural alerts and an in vitro screening assay. J Hazard Mater 2022;436:129193. 10.1016/j.jhazmat.2022.129193 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. McLoughlin KS, Jeong CG, Sweitzer TD. et al. Machine learning models to predict inhibition of the bile salt export pump. J Chem Inf Model 2021;61:587–602. 10.1021/acs.jcim.0c00950 [DOI] [PubMed] [Google Scholar]
- 20. Rosenthal SB, Wang H, Shi D. et al. Mapping the gene network landscape of Alzheimer’s disease through integrating genomics and transcriptomics. PLoS Comput Biol 2022;18:e1009903. 10.1371/journal.pcbi.1009903 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Terwilliger TC, Liebschner D, Croll TI. et al. AlphaFold predictions are valuable hypotheses and accelerate but do not replace experimental structure determination. Nat Methods 2024;21:110–6. 10.1038/s41592-023-02087-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Le NQK, Tran T-X, Nguyen P-A. et al. Recent progress in machine learning approaches for predicting carcinogenicity in drug development. Expert Opin Drug Metab Toxicol 2024;20:621–8. 10.1080/17425255.2024.2356162 [DOI] [PubMed] [Google Scholar]
- 23. Masumshah R, Eslahchi C. DPSP: a multimodal deep learning framework for polypharmacy side effects prediction. Bioinform Adv 2023;3:vbad110. 10.1093/bioadv/vbad110 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Russo DP, Aleksunes LM, Goyak K. et al. Integrating concentration-dependent toxicity data and toxicokinetics to inform hepatotoxicity response pathways. Environ Sci Technol 2023;57:12291–301. 10.1021/acs.est.3c02792 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Masumshah R, Aghdam R, Eslahchi C. A neural network-based method for polypharmacy side effects prediction. BMC Bioinformatics 2021;22:385. 10.1186/s12859-021-04298-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Le NQK. Predicting emerging drug interactions using GNNs. Nat Comput Sci 2023;3:1007–8. 10.1038/s43588-023-00555-7 [DOI] [PubMed] [Google Scholar]
- 27. Zeng H, Zhou X, Srivastava D. et al. GraphSAINT: Graph Sampling Based Inductive Learning Method. 2019. arXiv preprint arXiv:1907.04931.
- 28. Wang X, Zhu M, Bo D. et al. AM-GCN: adaptive multi-channel graph convolutional networks. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York: ACM; 2020. p. 1243–53. 10.1145/3394486.3403177. [DOI]
- 29. Wang X, Cheng Y, Yang Y. et al. Multitask joint strategies of self-supervised representation learning on biomedical networks for drug discovery. Nat Mach Intell 2023;5:445–56. 10.1038/s42256-023-00640-6 [DOI] [Google Scholar]
- 30. Dong J, Peng Q, Deng L. et al. iMS2Net: a multiscale networking methodology to decipher metabolic synergy of organism. iScience 2022;25:104896. 10.1016/j.isci.2022.104896 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Tang M, Wu ZE, Li F. Integrating network pharmacology and drug side-effect data to explore mechanism of liver injury-induced by tyrosine kinase inhibitors. Comput Biol Med 2024;170:108040. 10.1016/j.compbiomed.2024.108040 [DOI] [PubMed] [Google Scholar]
- 32. Zhao C, Dong J, Deng L. et al. Molecular network strategy in multi-omics and mass spectrometry imaging. Curr Opin Chem Biol 2022;70:102199. 10.1016/j.cbpa.2022.102199 [DOI] [PubMed] [Google Scholar]
- 33. Chen X, Zhou B, Jiang X. et al. Drug repurposing to tackle parainfluenza 3 based on multi-similarities and network proximity analysis. Front Pharmacol 2024;15:1428925. 10.3389/fphar.2024.1428925 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Wang X, Xin B, Tan W. et al. DeepR2cov: deep representation learning on heterogeneous drug networks to discover anti-inflammatory agents for COVID-19. Brief Bioinform 2021;22:bbab226. 10.1093/bib/bbab226 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Wang H, Zhang J, Lu Z. et al. Identification of potential therapeutic targets and mechanisms of COVID-19 through network analysis and screening of chemicals and herbal ingredients. Brief Bioinform 2022;23:bbab373. 10.1093/bib/bbab373 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Chen M, Suzuki A, Thakkar S. et al. DILIrank: the largest reference drug list ranked by the risk for developing drug-induced liver injury in humans. Drug Discov Today 2016;21:648–53. 10.1016/j.drudis.2016.02.015 [DOI] [PubMed] [Google Scholar]
- 37. LiverTox . Clinical and Research Information on Drug-Induced Liver Injury. Bethesda, MD: National Institute of Diabetes and Digestive and Kidney Diseases, 2012. [PubMed] [Google Scholar]
- 38. Moriwaki H, Tian YS, Kawashita N. et al. Mordred: a molecular descriptor calculator. J Chem 2018;10:4. 10.1186/s13321-018-0258-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Kolberg L, Raudvere U, Kuzmin I. et al. G: profiler-interoperable web service for functional enrichment analysis and gene identifier mapping (2023 update). Nucleic Acids Res 2023;51:W207–12. 10.1093/nar/gkad347 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Singhal A, Cao S, Churas C. et al. Multiscale community detection in cytoscape. PLoS Comput Biol 2020;16:e1008239. 10.1371/journal.pcbi.1008239 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Zheng F, Zhang S, Churas C. et al. HiDeF: identifying persistent structures in multiscale’ omics data. Genome Biol 2021;22:21. 10.1186/s13059-020-02228-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Chen T, Guestrin C. XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM; 2016. p. 785–94. 10.1145/2939672.2939785 [DOI]
- 43. Svetnik V, Liaw A, Tong C. et al. Random forest: a classification and regression tool for compound classification and QSAR modeling. J Chem Inf Comput Sci 2003;43:1947–58. 10.1021/ci034160g [DOI] [PubMed] [Google Scholar]
- 44. Noble WS. What is a support vector machine? Nat Biotechnol 2006;24:1565–7. 10.1038/nbt1206-1565 [DOI] [PubMed] [Google Scholar]
- 45. Yu G, Li F, Qin Y. et al. GOSemSim: an R package for measuring semantic similarity among GO terms and gene products. Bioinformatics 2010;26:976–8. 10.1093/bioinformatics/btq064 [DOI] [PubMed] [Google Scholar]
- 46. Greene N, Fisk L, Naven RT. et al. Developing structure-activity relationships for the prediction of hepatotoxicity. Chem Res Toxicol 2010;23:1215–22. 10.1021/tx1000865 [DOI] [PubMed] [Google Scholar]
- 47. Pizzo F, Lombardo A, Manganaro A. et al. A new structure-activity relationship (SAR) model for predicting drug-induced liver injury, based on statistical and expert-based structural alerts. Front Pharmacol 2016;7:442. 10.3389/fphar.2016.00442 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Bzdok D, Krzywinski M, Altman N. Machine learning: supervised methods. Nat Methods 2018;15:5–6. 10.1038/nmeth.4551 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49. Rodríguez-Belenguer P, Mangas-Sanjuan V, Soria-Olivas E. et al. Integrating mechanistic and toxicokinetic information in predictive models of cholestasis. J Chem Inf Model 2024;64:2775–88. 10.1021/acs.jcim.3c00945 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50. Jiang J, van Ertvelde J, Ertaylan G. et al. Unraveling the mechanisms underlying drug-induced cholestatic liver injury: identifying key genes using machine learning techniques on human in vitro data sets. Arch Toxicol 2023;97:2969–81. 10.1007/s00204-023-03583-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51. Kotsampasakou E, Ecker GF. Predicting drug-induced cholestasis with the help of hepatic transporters-an in silico modeling approach. J Chem Inf Model 2017;57:608–15. 10.1021/acs.jcim.6b00518 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52. Zhao G, Xu D, Yuan Z. et al. 8-Methoxypsoralen disrupts MDR3-mediated phospholipids efflux and bile acid homeostasis and its relevance to hepatotoxicity. Toxicology 2017;386:40–8. 10.1016/j.tox.2017.05.011 [DOI] [PubMed] [Google Scholar]
- 53. Wang S, Bao J, Li J. et al. Fraxinellone induces hepatotoxicity in zebrafish through oxidative stress and the transporters pathway. Molecules 2022;27:2647. 10.3390/molecules2709264 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54. Wen B, Zhou M. Metabolic activation of the phenothiazine antipsychotics chlorpromazine and thioridazine to electrophilic iminoquinone species in human liver microsomes and recombinant P450s. Chem Biol Interact 2009;181:220–6. 10.1016/j.cbi.2009.05.014 [DOI] [PubMed] [Google Scholar]
- 55. Zhang W, Liu J, Ren X. et al. Identification of the novel markers of PPAR signalling affecting immune microenvironment and immunotherapy response of lung adenocarcinoma patients. J Cell Mol Med 2024;28:e17877. 10.1111/jcmm.17877 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56. Traxl A, Mairinger S, Filip T. et al. Inhibition of ABCB1 and ABCG2 at the mouse blood-brain barrier with marketed drugs to improve brain delivery of the model ABCB1/ABCG2 substrate [11C]erlotinib. Mol Pharm 2019;16:1282–93. 10.1021/acs.molpharmaceut.8b01217 [DOI] [PubMed] [Google Scholar]
- 57. Bunchorntavakul C, Reddy KR. Drug hepatotoxicity: newer agents. Clin Liver Dis 2017;21:115–34. 10.1016/j.cld.2016.08.009 [DOI] [PubMed] [Google Scholar]
- 58. Castellino S, O'Mara M, Koch K. et al. Human metabolism of lapatinib, a dual kinase inhibitor: implications for hepatotoxicity. Drug Metab Dispos 2012;40:139–50. 10.1124/dmd.111.040949 [DOI] [PubMed] [Google Scholar]
- 59. García-García M, Liarte S, Gómez-González NE. et al. Cimetidine disrupts the renewal of testicular cells and the steroidogenesis in a hermaphrodite fish. Comp Biochem Physiol C Toxicol Pharmacol 2016;189:44–53. 10.1016/j.cbpc.2016.07.004 [DOI] [PubMed] [Google Scholar]
- 60. King TE, Humphrey JR, Laughton CA. et al. Optimizing excipient properties to prevent aggregation in biopharmaceutical formulations. J Chem Inf Model 2024;64:265–75. 10.1021/acs.jcim.3c01898 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61. Liang Y, Huangfu X, Huang R. et al. Machine learning for predicting halogen radical reactivity toward aqueous organic chemicals. J Hazard Mater 2024;472:134501. 10.1016/j.jhazmat.2024.134501 [DOI] [PubMed] [Google Scholar]
- 62. Carter LE, Bugiel S, Nunnikhoven A. et al. Comparative genomic analysis of Fischer F344 rat livers exposed for 90 days to 3-methylfuran or its parental compound furan. Food Chem Toxicol 2024;184:114426. 10.1016/j.fct.2023.114426 [DOI] [PubMed] [Google Scholar]
- 63. Firman JW, Pestana CB, Rathman JF. et al. A robust, mechanistically based in silico structural profiler for hepatic cholestasis. Chem Res Toxicol 2021;34:641–55. 10.1021/acs.chemrestox.0c00465 [DOI] [PubMed] [Google Scholar]
- 64. Butler A, Hoffman P, Smibert P. et al. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat Biotechnol 2018;36:411–20. 10.1038/nbt.4096 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65. Kiselev VY, Andrews TS, Hemberg M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat Rev Genet 2019;20:273–82. 10.1038/s41576-018-0088-9 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The codes are available in https://github.com/123zhyyy/cluster_model/tree/main. All scripts used were written based on Python 3.9 or Python 2.7. All model training was performed on a workstation equipped with an Intel® Core™ i7-9700 CPU (3.00GHz) and NVIDIA RTX 4070 GPU, with the complete training process requiring ~24 h. All molecules with SMILES notation were accessed from PubChem database (https://pubchem.ncbi.nlm.nih.gov/) and standardized using the RDKit package. The structural alerts were obtained using SARpy (W-SARpy 1.0, VEGA HUB). The data that support the findings of this study are available in the main text and/or in the Supplementary material.









