Abstract
The potential association data between drugs, genes, and diseases is sparse and complex. Existing models find it difficult to effectively handle the problem of heterogeneous relationships and multi-source data fusion simultaneously, resulting in limited accuracy and generalization of association prediction. To address this problem, we propose a fusion method of relational graph convolutional network (R-GCN) and eXtreme Gradient Boosting (XGBoost). First, a heterogeneous graph containing drug, gene, and disease nodes and their relationships is constructed. The features of different types of nodes are aggregated and represented by R-GCN to generate high-quality node embeddings. Then, the embedded features of the drug-gene-disease triples are input into the XGBoost model for training to achieve the association prediction task. The findings demonstrate that the model’s area under the curve reaches 0.92, and the F1 score reaches 0.85, indicating strong predictive ability. This method solves the problem of association prediction in complex biological networks and brings new technological support for precision medicine.
Keywords: relational graph convolutional network, XGBoost model, heterogeneous graph construction, feature embedding fusion, association prediction
Introduction
The association between drugs, genes, and diseases is an essential field in life science research, which directly affects the development of drugs and the functional analysis of genes and is of great significance for the diagnosis, treatment, and prognosis evaluation of diseases. Drugs regulate the occurrence and development of diseases by acting on specific genes [1, 2], which involve complex biological networks [3] and contain a wealth of potential information. Currently, research in this field is facing various difficulties. Although a large amount of drug, gene, and disease-related data has been accumulated in biomedical databases, most of these data are local and isolated and have serious omissions and incompleteness, which makes it difficult for traditional statistical analysis [4] or simple machine learning [5, 6] methods to effectively my hidden associations. There is also the problem of relationship heterogeneity. The relationships between drugs, genes, and diseases are complex and diverse. Different relationship types affect the strength of association and the construction of prediction models. Many existing models are mostly based on the assumption of a single relationship and cannot comprehensively model multi-type and multi-scale biological relationships. Finally, there is the problem of model generalization ability. In practical applications, prediction models must perform well on known data and reliably infer unknown or unobserved drug-gene-disease [7] triplets. However, existing methods often show low robustness and generalization performance when processing new data. These difficulties have severely limited the improvement of drug R&D (research and development) efficiency and the advancement of precision medicine. Therefore, building a new association prediction [8, 9] model that can handle complex heterogeneous relationships and effectively deal with data sparsity has become one of the core challenges in this field.
To address these challenges, this paper adopts a drug-gene-disease triple association prediction model based on the fusion of heterogeneous graph neural networks (HGNNs) and ensemble learning methods. The heterogeneous graph construction method uniformly models the drug, gene, and disease nodes and their multi-type relationships and constructs a heterogeneous network containing diverse nodes and edges. Based on this heterogeneous network, R-GCN is used to extract features and model relationships of different types of nodes, capture the complex relationships between drugs, genes, and diseases through the message-passing mechanism, and generate high-dimensional node embedding representations. On this basis, an embedding feature fusion method is designed to combine the embedding representations of drug, gene, and disease nodes to form a feature vector representing the triple relationship. Then, XGBoost is used as a classifier to train the triple feature vector to predict potential associations accurately. This study organically integrates HGNNs with ensemble learning methods. It successfully solves the problems of data sparsity and relationship heterogeneity by combining the advantages of both. This paper systematically explores integrating different data types, modeling heterogeneous relationships, and optimizing prediction tasks in model design, providing a new technical solution for association prediction in complex biological networks. These research works enrich the theoretical and methodological system of drug-gene-disease association prediction and lay a solid foundation for promoting the development of precision medicine and personalized treatment development.
The novelty of this study lies in the integrated use of relational graph convolutional networks (R-GCN) and ensemble learning (XGBoost) for predicting drug-gene-disease associations within a unified, end-to-end framework. Unlike existing approaches that typically model pairwise associations or rely on homogeneous data representations, our method constructs a heterogeneous graph to capture the multi-relational structure inherent in biomedical networks. We introduce a hierarchical attention-enhanced R-GCN to generate expressive node embeddings that preserve semantic and structural heterogeneity. These embeddings are then optimized through dimensionality reduction and sample-sensitive strategies and finally used as input to a regularized XGBoost classifier, ensuring robust prediction even under high data sparsity. Additionally, the model incorporates a fusion strategy combining Bagging and Stacking, which further enhances predictive performance. This is the first approach to synergize these techniques in a biological association prediction context, offering a scalable and generalizable framework for precision medicine applications.
Related work
The application and progress of ensemble learning methods in many fields are becoming increasingly significant, and they have shown great potential in solving complex data problems. Yang [10] et al. studied the development of ensemble learning in the era of deep learning through data analysis, methodological discussion, and the latest progress review, revealing the inherent problems and technical challenges of ensemble deep learning. In ensemble deep learning research, more and more scholars have begun to focus on effectively integrating different learning models to enhance the system’s prediction accuracy and generalization ability while coping with challenges such as high-dimensional data and class imbalance. Wu [11] et al. used genetic algorithms to construct an integrated learning framework to improve the prediction performance of synergistic drug combinations by addressing the class imbalance and high-dimensional input data problems of drug combination datasets and verified its predictive ability through cell proliferation experiments. The effectiveness of the ensemble learning [12] method is also reflected in the integration with other data representation technologies, and the model’s prediction ability is further enhanced in the application of graph embedding methods. Wang [13] et al. used ensemble learning techniques and graph embedding to establish a model for predicting gene-disease associations, enhancing the predictive performance of gene-disease associations by embedding representations and combining different data sources. These studies have demonstrated the wide application of ensemble learning methods in multiple fields. However, problems such as high method complexity, poor model interpretability, and lack of unified standards in practical applications remain.
As an important tool, heterogeneous graph networks have been widely used in constructing biomedical knowledge graphs and relationship prediction. Li [14] et al. used a HGNN framework based on an attention mechanism to deal with the embedding problem of knowledge graphs. By aggregating entity features of different semantic aspects and assigning appropriate weights, they performed better than other advanced methods in three real-world knowledge graph experiments. As the complexity and diversity of biomedical data grow, researchers are increasingly favoring more fine-grained graph attention mechanisms to capture deeper underlying associations. Li [15] et al. used a hierarchical graph attention network method and predicted the association between miRNA and disease by constructing a miRNA disease lncRNA heterogeneous graph and integrating node layer attention and semantic layer attention, indicating that the model had excellent performance in miRNA disease association prediction. Liu [16] et al. used the Disentangled Graph Attention Heterogeneous Biological Memory Network method to predict and analyze disease-related miRNAs, showing that this method can effectively capture and utilize complex potential biological data associations, significantly improving the prediction performance of disease-related miRNAs. These studies still face the issue of effectively handling more diverse heterogeneous data and building larger-scale graphs, which requires further improvement and optimization. Bo-Wei Zhao [17] et al. developed a novel graph-based framework leveraging advanced network embedding techniques to capture latent associations among drugs, genes, and diseases. Their method improved predictive accuracy on benchmark datasets but struggled with data sparsity issues. B. W. Zhao [18] et al. introduced a deep learning approach integrating multi-view data sources to better model complex interactions. However, the model required extensive computational resources and lacked interpretability. Xiaorui Su [19] et al. proposed a heterogeneous graph transformer incorporating semantic attention mechanisms to weigh diverse relationships. While effective, their approach did not combine ensemble learning strategies for classification. Bo-Wei Zhao [20] et al. applied tensor decomposition methods to jointly analyze drug-gene-disease triplets, achieving promising results but with limitations in scalability.
Recent studies have significantly advanced the modeling of complex biological networks by leveraging HGNNs and ensemble learning strategies. For instance, Bo-Wei Zhao et al. [21] introduced enhanced relational attention mechanisms within heterogeneous graphs to capture multi-type interactions among biomedical entities better, improving embedding expressiveness. Ensemble learning methods have also evolved to address challenges unique to biological data, such as class imbalance and high dimensionality; [22] demonstrated how tailored ensemble frameworks can boost prediction robustness and generalization in multi-source biological datasets. Moreover, sophisticated triple-association inference frameworks, as discussed in Jike Wang et al. [23], utilize multitask graph embedding approaches to jointly infer relationships among drugs, genes, and diseases, highlighting the benefit of integrating multiple relational views. Additionally, multi-omics data integration through heterogeneous graph representations has been refined, enabling more comprehensive feature extraction across diverse biological scales. Bo-Wei Zhao [24] et al. further summarize the architectural innovations in biomedical knowledge graphs, emphasizing the need for models that efficiently handle heterogeneous and sparse data.
Surendar Rama Sitaraman et al. (2024) demonstrate improved classification accuracy through ensemble learning in diagnosing inflammatory bowel disease (IBD). Leveraging this insight, our proposed model combines ensemble learning with heterogeneous graph networks to boost prediction accuracy and stability in discovering triple associations. This leads to more reliable biomedical knowledge extraction and facilitates effective drug discovery pipelines [25]. Sunil Kumar Alavilli et al. (2023) proposed a machine learning-integrated approach for enhancing drug discovery specific to lung cancer, highlighting the potential of data-driven models in uncovering therapeutic insights. Inspired by their disease-centric predictive modeling, our study generalizes this concept to a broader tripartite prediction framework. It aims to find complex interactions among drugs, genes, and diseases using a hybrid of ensemble learning and heterogeneous graph networks [26]. A hybrid Transformer-RNN-GNN model for command verification and attack detection, as Basani et al. (2020) proposed, inspires our use of graph-based hybrid approaches combined with ensemble learning to predict drug-gene-disease associations. This integration enhances accuracy and robustness in modeling complex biological networks [27].
Our proposed model complements these advances by fusing R-GCN with the powerful ensemble learning classifier XGBoost. This integration uniquely addresses the complexity of heterogeneous biological relationships and the challenges of data sparsity, achieving superior predictive performance and generalization compared to standalone graph or ensemble models.
Methods
Heterogeneous graph construction and R-GCN model
Constructing a heterogeneous graph expressing different types of nodes and relationships is crucial in predicting drugs, genes, and diseases. The heterogeneous graph contains three types of nodes: drugs, genes, and diseases, as well as complex and multiple relationships between them. The attributes of each node type have unique characteristics. Chemical structures represent the characteristics of drug nodes; gene nodes are represented by gene function annotations; disease nodes are represented by clinical characteristics or phenotypic information. These node attributes are converted into vector representations through appropriate processing methods to provide input for subsequent graph neural network models.
When constructing a heterogeneous graph, the relationships between drugs, genes, and diseases need to be modeled as edges. The type and weight of the edge represent different ways of interaction between nodes, including drug-gene interactions, gene-disease genetic associations, and drug-disease therapeutic effects. These relationships are usually heterogeneous, and each involves different semantics and feature information, so the edge types of the graph also need to be precisely defined and labeled.
The role of R-GCN in this framework is to learn the features of each node in the heterogeneous graph through graph convolution operations. Compared with traditional graph convolutional networks (GCNs), R-GCN can handle multiple edge relationships, so it can simultaneously utilize the diverse interactions between drugs, genes, and diseases to generate richer node representations. Figure 1 shows the overall process of heterogeneous graph construction and R-GCN model node update.
Figure 1.
Schematic diagram of heterogeneous graph and R-GCN node update.
The key idea of the R-GCN model is to update the node features based on the neighbor node information of each node. Considering node
and the set of neighbour nodes
connected to it, each neighbour node
transmits information to node
through edge type
. R-GCN uses different weight matrices for each edge type to aggregate the features of neighbour nodes to obtain a new representation of the node. The node feature update rule of R-GCN is:
![]() |
(1) |
In Equation (1), the aggregation term involves normalization constants
, which control the influence of neighbouring nodes connected by edge type
. Specifically,
is computed as: 
where
denotes the number of neighbors of node
under relation
. This symmetric normalization prevents nodes with many neighbors from disproportionately dominating the feature aggregation, thus stabilizing the learning process and enabling better representation of heterogeneous relationships. The design of this normalization follows the original R-GCN formulation introduced. Intuitively, this ensures that each neighbor’s contribution is scaled inversely by the connectivity of the involved nodes, maintaining a balanced message passing across nodes with varying degrees.
Here, the ReLU function is used:
![]() |
(2) |
Here,
is the input value. R-GCN captures the high-order relationship features between nodes through multi-layer stacked graph convolutions and updates the node representation layer by layer so that the node features better reflect its structure and semantic information in the heterogeneous graph. The network’s output is the embedding representation of each node, which contains the high-order semantic information of drug, gene, and disease nodes. These embedding representations effectively capture the complex relationships between nodes and provide high-quality input features for subsequent association prediction tasks.
Relation modeling: single, dual, and triplet
Single-relation modeling constructs a graph with edges of only one relationship type, e.g. drug-gene interactions. This limited relational view constrains the embedding space to only pairwise interactions, potentially missing important cross-type associations. Dual-relation modeling extends this by incorporating two edge types, such as drug-gene and gene-disease, thus enabling the message to pass across two semantic layers, which enriches node features. Our Triplet-relation modeling integrates all three edge types—drug-gene, gene-disease, and drug-disease—allowing the R-GCN to apply distinct convolutional filters per relation and comprehensively aggregate heterogeneous neighborhood information. This richer relational context is key to achieving superior prediction performance as it captures complex multi-way dependencies inherent in biological networks.
Model hyperparameter optimization
Key hyperparameters, including maximum tree depth, learning rate, number of boosting rounds, subsample ratio, column sampling ratio, gamma for minimum loss reduction, and minimum child weight, were included to maximize the performance of the XGBoost classifier. Based on area under the curve (AUC) performance, the optimal parameter combination was found using a grid search approach with five-fold cross-validation. Max_depth was between 3 and 10, the learning rate was between 0.01 and 0.3, n_estimators were between 50 and 500, subsample and colsample_bytree were between 0.6 and 1.0, gamma was between 0 and 5, and min_child_weight was between 1 and 10. Overfitting was avoided by applying early stopping.
Embedded feature fusion and XGBoost model training
Based on completing the node embedding representation, integrating the features of drugs, genes, and diseases generates triple representations and uses dimensionality reduction technology and ensemble learning classifiers to achieve the association prediction task. The embedding vectors of drugs, genes, and diseases are first constructed through feature splicing to form a complete triple embedding representation. Then the principal component analysis (PCA) technology is used to optimize high-dimensional features to remove redundancy. After obtaining the optimized embedding representation, it is input into the XGBoost model for association prediction. The specific steps of the whole process are shown in Fig. 2.
Figure 2.
Feature fusion and classification process.
To achieve this goal, this paper sets the node embedding representation to
,
, and
, representing the node embedding vectors of drugs, genes, and diseases, respectively, and
is the embedding dimension. For each drug-gene-disease triple
, its joint representation is constructed in the following way:
![]() |
(3) |
Among them,
represents the vector connection operation, directly splicing the embedding vectors of drugs, genes, and diseases into a high-dimensional vector
. Although this direct splicing method fully retains the individual information of the nodes, the high dimension easily leads to increased noise and high-computational cost. To this end, PCA is utilized here to reduce the dimension of
further. By calculating the covariance matrix of
and decomposing its eigenvalues, the eigenvectors corresponding to the first
eigenvalues are selected to obtain the reduced-dimensional representation
, which is as follows:
![]() |
(4) |
Among them,
is the matrix composed of the first
principal components of the covariance matrix. When selecting
, the cumulative contribution rate of the eigenvalues is based on 95% to ensure that the information is retained completely.
To reduce the high dimensionality of the concatenated triple embedding vector
formed by splicing drug, gene, and disease node embeddings, PCA is applied. The number of principal components
is selected so the cumulative explained variance ratio reaches or exceeds
. Formally,
is the smallest integer satisfying: 
where
are the eigenvalues of the covariance matrix of
, sorted in descending order.
This selection balances information retention and computational efficiency by preserving the majority of the variance in the original data while reducing noise and redundancy. PCA was implemented using (tool or library name, e.g. scikit-learn’s PCA module), ensuring reproducibility of the dimensionality reduction process.
After generating the optimized triple embedding features, these vectors are input into the XGBoost model to complete the association prediction task. XGBoost uses the gradient-boosting decision tree method to combine multiple weak classifiers into a strong classifier in a weighted manner. For each tree, the entropy gain splitting standard of features and category labels is used to construct node partitioning, so that the model can adapt to the complex nonlinear structure of input features. The objective function of XGBoost is defined as:
![]() |
(5) |
Among them,
is the prediction error loss function. The logarithmic loss function is used:
![]() |
(6) |
In the regularization term,
and
control the number of leaf nodes and node weights, respectively;
constrains the tree complexity of the model;
controls the weight amplitude of each leaf node. The following hyperparameters are specifically optimized to ensure the best performance of the XGBoost model.
The XGBoost classifier establishes a nonlinear mapping relationship between the input features and labels by modeling the embedded features after dimensionality reduction. Finally, it completes the association prediction of drug-gene-disease triples. This process effectively combines the high-quality feature embedding generated by HGNNs and the discriminative ability of ensemble learning methods, laying the foundation for subsequent prediction tasks. Although PCA is a linear dimensionality reduction method and may not fully capture nonlinear dependencies, it is employed in this work for two main reasons. First, the embeddings produced by R-GCN already encapsulate nonlinear interactions among drugs, genes, and diseases through multi-layer message passing and relation-specific transformations. Therefore, the embedding space is semantically enriched before PCA is applied. Second, PCA provides a computationally efficient way to reduce feature redundancy while retaining the majority of informative variance. By preserving 95% of the cumulative contribution rate, the PCA transformation ensures minimal loss of relevant information while also improving the training efficiency of the XGBoost classifier.
Model optimization and fusion strategy
An integrated optimization strategy in complex biological network data is designed to improve embedding representation and classification performance to effectively deal with data sparsity and heterogeneity problems and improve prediction accuracy. In the embedding optimization process, the heterogeneous graph embedding representation is strengthened from multiple angles; in the optimization stage of the classification model, the weight mechanism is adjusted to model sparse samples effectively; the advantages of heterogeneous graph representation and decision tree classification are further integrated through the model fusion strategy to form a more comprehensive solution.
In the heterogeneous graph embedding representation optimization process, the R-GCN-based embedding representation adds a hierarchical attention mechanism to capture the multi-type relationship characteristics between nodes efficiently. Assuming that the drug, gene, and disease node embeddings are
, and the following hierarchical weights are used to enhance the representation:
![]() |
(7) |
Among them,
is the set of relationship types;
is the relationship weight controlled by the attention score;
is the final embedding representation of node
,
.
Weight allocation strategy for relationships in the heterogeneous graph
The weights assigned to each relationship type in the heterogeneous graph are learned dynamically using a hierarchical attention mechanism within the R-GCN framework. Specifically, for each node, the embedding update aggregates neighbor information weighted by attention scores
corresponding to each edge type
. These attention weights are initialized uniformly and trained end-to-end via backpropagation to optimize the model’s predictive performance. The attention mechanism enables the model to assign higher importance to more informative relationships while downweighting less relevant ones. This approach allows flexible, data-driven learning of edge weights, improving the representation of complex multi-type interactions in the biological network.
In optimizing the classification model, a category-sensitive balanced training scheme is designed to address the data sparsity problem, and a weighted regularization strategy based on sample distribution is applied. When optimizing the objective function of XGBoost, the regularization term formula is as follows:
![]() |
(8) |
We introduce a weighted loss function to mitigate the impact of class imbalance among drug-gene-disease triples. Specifically, the sample weight
for the
-th training example is defined as 
where
is the total number of training samples, and
is the number of samples belonging to class
. This weighting scheme emphasizes minority classes by proportionally increasing their loss contribution during model training. We do not apply explicit upsampling or downsampling of rare triples in the dataset. Instead, the weighted loss mechanism allows the model to focus on underrepresented classes without altering the original data distribution, thereby preserving the integrity of the heterogeneous graph structure.
Among them,
is the sample category weight. Increasing the loss weight of the sparse category mitigates the model bias caused by the uneven distribution of the sample number. The heterogeneous graph embedding representation and the integrated classifier are jointly optimized to improve the overall system’s performance further and form a unified modeling framework. In this process, the optimization goal is to minimize the following joint loss function:
![]() |
(9) |
Among them,
is the difference between the embedding layer and the true embedding;
represents the loss function of the classifier;
and
are weight parameters used to balance embedding learning and classification performance.
Finally, the model fusion strategy is applied to perform a weighted combination of the prediction results of multiple base classifiers of XGBoost through Bagging and Stacking techniques. Its formula is expressed as:
![]() |
(10) |
Among them,
is the prediction result of the
-th classifier, and
is the corresponding weight, which is dynamically adjusted based on the performance of the validation set. By combining the prediction capabilities of multiple sub-models, comprehensive coverage of different feature subspaces is achieved, ensuring further improvement in overall prediction accuracy. Through these strategies, the model shows stronger generalization ability in processing sparse and heterogeneous biological network data, providing an effective solution for complex association prediction tasks.
Results
Model performance evaluation
The experiment aims to verify the applicability and stability of the method by quantitatively evaluating the model’s performance on different datasets. The experiment selects five representative datasets with obvious differences in the number of nodes, the number of edges, and the sparsity of data. The number of nodes and edges in dataset A are both at a medium level, and the data is relatively evenly distributed. The proportion of gene nodes in dataset B is higher, and the number of edges is larger. The number of edges in dataset C is relatively small, and the data sparsity is more obvious. The edge relationship of dataset D is more complex, involving multiple secondary associations. The proportion of drug nodes in dataset E is relatively high, which is suitable for verifying the accuracy of association prediction. Table 1 summarizes the scale and sparsity of the five datasets, providing a data basis for analyzing experimental results.
Table 1.
Scale and sparsity of different datasets
| Dataset | Number of nodes | Number of edges | Data sparsity |
|---|---|---|---|
| A | 2500 | 10 000 | 0.002 |
| B | 3000 | 15 000 | 0.0017 |
| C | 1800 | 7200 | 0.0022 |
| D | 2200 | 9000 | 0.0021 |
| E | 2700 | 12 000 | 0.0018 |
The experiment evaluates the model performance on five datasets using two indicators: the AUC of the receiver operating characteristic and the F1 score. The AUC reflects the model’s ability to distinguish between positive and negative samples. The F1 score measures the balance between precision and recall and fully demonstrates the applicability and stability of the model in the association prediction task. Figure 3 illustrates the model’s AUC and F1 score results on different datasets.
Figure 3.
Performance of the model on different datasets.
According to the data in Fig. 3, the model performs best on dataset A, with an AUC of 0.92 and an F1 score of 0.85, while the performance on dataset C is relatively the worst, with an AUC of 0.87 and an F1 score of 0.81. The excellent performance of dataset A stems from its relatively uniform distribution of nodes and edges, which provides sufficient and consistent neighbor features, enhances the feature aggregation effect of R-GCN, and reduces the interference of noise features on the model. In contrast, dataset C has higher sparsity and a limited number of node neighbors, resulting in the loss of important relationship features during the embedding generation process. Sparse data also increases the difficulty of XGBoost training, further weakening the classification performance. Comprehensive analysis shows that the model performs more stably in a more uniform network with higher data integrity, while high sparsity data significantly affects its performance.
To further validate the robustness of our proposed model, we conducted five-fold cross-validation experiments on all five datasets. Table X shows the mean and standard deviation of AUC and F1 scores across folds. The consistently high mean values, along with low standard deviations, demonstrate that the model maintains stable and reliable performance, confirming its generalization ability across diverse datasets. This strengthens our earlier findings based on single train-test splits, as shown in Table 2.
Table 2.
Mean ± standard deviation of AUC and F1 score obtained from five-fold cross-validation experiments on different datasets
| Dataset | AUC (Mean ± Std) | F1 Score (Mean ± Std) |
|---|---|---|
| A | 0.92 ± 0.01 | 0.85 ± 0.02 |
| B | 0.90 ± 0.02 | 0.83 ± 0.03 |
| C | 0.87 ± 0.02 | 0.81 ± 0.02 |
| D | 0.89 ± 0.01 | 0.82 ± 0.02 |
| E | 0.91 ± 0.01 | 0.84 ± 0.02 |
Performance comparison of different models
To comprehensively evaluate the performance of different models in the drug-gene-disease triple association prediction task, the experiment compares the R-GCN + XGBoost model in this paper with the R-GCN alone, the traditional GCN, the DeepWalk+XGBoost model, and the random forest model. R-GCN can effectively model multi-relation graph structures and capture complex heterogeneous features; GCN focuses on the feature aggregation of local neighbors; DeepWalk generates low-dimensional representations of nodes through random walks; random forest is suitable for high-dimensional data scenarios with its high robustness. By comparing the performance of these models, the advantages and disadvantages of different methods in processing multi-source heterogeneous data are deeply explored, and the results are displayed in Table 3.
Table 3.
Performance comparison of association prediction models
| Model name | AUC | Precision | Recall | F1 score | AUPR |
|---|---|---|---|---|---|
| R-GCN + XGBoost | 0.92 | 0.88 | 0.82 | 0.85 | 0.91 |
| R-GCN | 0.89 | 0.84 | 0.78 | 0.81 | 0.89 |
| GCN | 0.87 | 0.82 | 0.76 | 0.79 | 0.85 |
| DeepWalk + XGBoost | 0.84 | 0.79 | 0.74 | 0.76 | 0.87 |
| Random forest | 0.8 | 0.75 | 0.7 | 0.72 | 0.88 |
As can be seen from Table 3, the R-GCN and XGBoost fusion model achieves AUC and F1 scores of 0.92 and 0.85, respectively, which is significantly better than other comparison models. The performance of R-GCN alone is slightly lower, with AUC and F1 scores of 0.89 and 0.81, respectively, reflecting the effectiveness of the joint design of embedding features and classifiers in improving the model’s predictive ability. The AUC of GCN is 0.87, which is lower than that of R-GCN, indicating that the failure to consider heterogeneous relationships fully affects the model’s generalization ability. DeepWalk combined with XGBoost performs better than random forest, but its AUC is only 0.84 due to the failure to capture high-order relationship features fully. The random forest model performs the worst, with AUC and F1 scores of 0.80 and 0.72, respectively, indicating its lack of adaptability to complex network structures. While we compared with classical and widely-used baselines such as GCN and DeepWalk combined with XGBoost, we recognize the importance of evaluating against newer deep learning models like heterogeneous graph attention networks or graph transformers. These models are currently under investigation for future extensions of this work. Overall, the fusion model combines the advantages of heterogeneous graph embedding and strong classifiers to more accurately mine potential correlation relationships. The Area Under the Precision–Recall Curve (AUPR) results further validate the model’s effectiveness, with R-GCN + XGBoost achieving the highest score of 0.91, indicating strong performance in imbalanced settings. R-GCN and GCN follow with AUPR values of 0.89 and 0.85, respectively, showing the benefit of modeling heterogeneous relationships. While DeepWalk + XGBoost (0.87) and random forest (0.88) perform moderately well, they lag due to limitations in capturing complex network structures. The fusion model demonstrates a superior precision-recall tradeoff in association prediction tasks. AUPR complements AUC by better capturing the model’s performance on the positive class, which is critical in sparse or imbalanced biomedical datasets where true associations are rare.
Impact of data sparsity on the model
To explore the specific impact of data sparsity on model performance, the experiment simulates the model’s running performance under different sparsity levels by adjusting the retention ratio of edges and calculates the AUC and F1 scores under different sparsity levels to analyze the changing trends of the model’s robustness and predictive ability. Figure 4 illustrates the changing trends of the model’s AUC and F1 scores when the retention ratio of edges changes from 100% to 10%.
Figure 4.
Changes in model performance under different data sparsity.
As the retention ratio of edges gradually decreases, the AUC and F1 scores show a significant downward trend. In the high retention ratio range of 100%–80%, the AUC value drops from 0.92 to 0.89, and the F1 score remains between 0.85 and 0.84. The relatively gentle performance changes indicate that the model is not sensitive to data sparsity when the number of edges is sufficient. The performance decreases significantly when the retention ratio is lower than 70%. The AUC value decreases to 0.82 at 40%, and the F1 score decreases to 0.77, reflecting that the feature information transmission ability of the sparse graph is affected. When the retention ratio is further reduced to 10%, the AUC is only 0.67, and the F1 score decreases to 0.62, indicating that data sparsity has caused serious damage to node embedding and classifier synergy. The experimental outcomes demonstrate that the impact of sparsity on model performance is closely related to the expression quality of feature information. The higher the data sparsity, the weaker the model’s ability to capture multi-relational data. As the graph becomes sparser, the neighborhood information each node can aggregate during R-GCN updates diminishes, leading to less informative embeddings. This degradation propagates to the classifier, where fewer representative samples reduce XGBoost’s ability to learn discriminative decision boundaries. Our category-sensitive balanced training and weighted regularization help mitigate this but cannot fully compensate for extreme sparsity, as evidenced by performance drops at 10% edge retention.
Evaluation of modeling effect of heterogeneous relationships
In the context of multi-relational modeling, this experiment explores the performance of different modeling methods in the drug-gene-disease association prediction task. Single-relationship modeling only considers the association between two entities. In contrast, dual-relationship modeling adds more association information and enhances the model’s performance by combining two different types of nodes and edges. Triple-relationship modeling further expands the data structure and considers more complex interactions between drugs, genes, and diseases, thereby providing richer feature information. By comparing the performance of these modeling methods on different evaluation indicators, the experiment aims to reveal the potential of multi-relational modeling in biological network prediction. Table 4 describes the comparison results of each modeling method on each indicator.
Table 4.
Comparison of prediction performance of different modeling methods
| Evaluation metric | Single-relation modeling | Dual-relation modeling | Triplet-relation modeling |
|---|---|---|---|
| AUC | 0.83 | 0.87 | 0.92 |
| Precision | 0.78 | 0.8 | 0.84 |
| Recall | 0.85 | 0.88 | 0.9 |
| F1 Score | 0.81 | 0.84 | 0.87 |
In Table 4, the triple-relationship modeling performs well in all evaluation indicators. The AUC value reaches 0.92, indicating a significant advantage in the model’s ability to distinguish between positive and negative samples. The precision, recall, and F1 scores all reach the highest values in the triple-relationship modeling, which are 0.84, 0.90, and 0.87, respectively, suggesting that the model improves the accuracy of positive prediction while reducing missed detection. In contrast, the AUC and F1 scores of single-relationships and dual-relationship modeling have decreased, indicating that increasing the number of relationships effectively improves the model’s prediction performance. These suggest that the triple-relationship modeling significantly improves the association prediction effect between drugs, genes, and diseases by applying more complex relationships.
After comparing the performance results of different modeling methods, their performance differences in specific prediction results are further explored. These differences are reflected in the model’s overall prediction ability, the distribution of error types, and the accuracy of particular domain tasks. The experiment reveals the advantages of multi-relationship modeling in processing complex biological network data by analyzing the false positive rate, false negative rate, number of misclassifications, and domain-specific prediction accuracy. Table 5 displays the comparison results of different modeling methods in these aspects.
Table 5.
Analysis of prediction results of different modeling methods
| Evaluation metric | Single-relation model | Dual-relation model | Triple-relation model |
|---|---|---|---|
| False positive rate | 0.12 | 0.08 | 0.05 |
| False negative rate | 0.15 | 0.12 | 0.09 |
| Incorrect drug predictions | 50 | 35 | 20 |
| Incorrect disease predictions | 45 | 30 | 15 |
| Cancer accuracy | 0.8 | 0.85 | 0.9 |
| Antibiotic accuracy | 0.78 | 0.83 | 0.88 |
In Table 5, the triple-relationship modeling performs best in terms of false positive and false negative rates, with values of 0.05 and 0.09, respectively. Compared with the single relationship modeling, the false positive rate is 0.12, and the false negative rate is 0.15, which is significantly reduced, indicating that the triple-relationship modeling has significant advantages in reducing misclassification. In the prediction of misclassified drugs and diseases, the number of misclassifications of the triple-relationship modeling is 20 drugs and 15 diseases, respectively, significantly improved compared with the 50 drugs and 45 diseases of the single-relationship modeling. In the prediction of domain-specific tasks, the prediction accuracy of the triple-relationship modeling for cancer and antibiotics reaches 0.90 and 0.88, respectively, which is significantly higher than other modeling methods. This shows that the triple-relationship modeling is better than other models in overall prediction performance and excellent results in predicting specific fields.
Synergy between embedded features and classifiers
In the association prediction of biological network data, the quality of embedded features has a crucial impact on model performance. This experiment explores the synergy between embedding features and classifiers by constructing five different quality levels of embedding features. The five embedding feature quality levels are original embedding, preliminary optimized embedding, optimized embedding, embedding after feature selection, and final optimized embedding. The original embedding represents the node features directly obtained through R-GCN without further optimization or processing. The preliminary optimized embedding adjusts the node features to a certain extent in the initial training stage, improving some features’ expressiveness. The optimized embedding further extracts high-order relationship information through multi-layer convolution, and the expressiveness of node features is significantly enhanced. The embedding after feature selection improves the simplicity and effectiveness of the embedding representation by removing redundant features. The final optimized embedding is comprehensively tuned and combines multiple optimization strategies to show the strongest node feature expression ability. The results of the impact of different embedding feature quality levels on the performance of the XGBoost classifier are shown in Fig. 5.
Figure 5.
Impact of embedding feature quality on classification results.
Figure 5 shows that with the improvement of embedding feature quality, both AUC and F1 scores show significant growth. The original embedding has low AUC and F1 scores of 0.70 and 0.65, respectively, indicating that the unoptimized node features cannot effectively support the classifier for accurate prediction. The initial optimized embedding slightly improves the AUC and F1 scores to 0.75 and 0.68, respectively, indicating that the optimization process has helped the model to some extent. Still, it has not yet achieved the best results. As the embedding quality improves, the optimized embedding reaches 0.80 and 0.73 in AUC and F1 scores, respectively, and the feature expression ability is significantly enhanced. The embedding after feature selection further improves the AUC and F1 scores by removing redundant features, which are 0.86 and 0.78, respectively. The final optimized embedding achieves the best performance, with AUC and F1 scores of 0.92 and 0.85, indicating that the comprehensive optimization and feature selection greatly enhance the support of node features for the classifier, promoting the improvement of the accuracy and generalization ability of the model. This result shows that the optimization and fusion of feature quality play a key role in improving the model’s performance.
Biological significance of predicted associations
Although the model performs well regarding prediction, assessing the biological significance of its results is vital. Doxorubicin–TP53–breast cancer and Cisplatin–BRCA1–ovarian cancer are well-known drug-disease relationships found when the top predicted associations were analyzed. These closely match well-established results in the literature, confirming the model’s capacity to represent interactions that have been clinically validated. Furthermore, known antibiotic-target relationships are reflected in associations such as Ciprofloxacin–gyrA–urinary tract infection, especially pertinent to bacterial resistance mechanisms. In addition, the model predicted new relationships that are not yet well established but could be promising options for additional research or medication repositioning, such as those between less well-known medications and disease genes. The predicted associations further support the model’s usefulness in practical biomedical applications, which are generally both biologically significant and statistically sound.
Conclusions
This paper presents a method combining R-GCN and XGBoost to predict drug-gene-disease associations by constructing a heterogeneous graph and capturing high-order semantic features. The model achieves an AUC of 0.92 and an F1 score of 0.85, outperforming traditional approaches. While effective on sparse and multi-relationship data, performance declines under extreme sparsity, highlighting opportunities for improving feature learning and multi-scale data fusion. Our approach uniquely integrates HGNNs with ensemble learning, addressing data sparsity and relationship heterogeneity more effectively than recent methods focusing on either component alone. This fusion enhances predictive accuracy and generalization, offering a robust framework for complex biological network inference and precision medicine. Emerging techniques such as graph attention and disentangled representations [28, 29] provide promising directions for further improvement.
Key Points
Combines R-GCN and XGBoost to predict drug-gene-disease associations.
Builds a heterogeneous graph with drug, gene, and disease relationships.
Uses R-GCN to extract high-quality embeddings from multi-type node features.
XGBoost model trained on embedded triples for accurate prediction results.
Achieves 0.92 AUC and 0.85 F1 score, boosting precision medicine efforts.
Author contributions
K.N.G. contributed to the design and methodology of this study, the assessment of the outcomes, and the writing of the manuscript.
Conflict of interest: None declared.
Funding
There is no specific funding to support this research.
Data availability
All data generated or analyzed during this study are included in the manuscript.
Code availability
Not applicable.
References
- 1. Wang X, Zhang S, Yao W. et al. Revealing potential drug-disease-gene association patterns for precision medicine. Scientometrics 2021;126:3723–48. 10.1007/s11192-021-03892-4 [DOI] [Google Scholar]
- 2. Zhou G, Xuan C, Wang Y. et al. Drug repositioning based on a multiplex network by integrating disease, gene, and drug information. Current Bioinformatics 2023;18:266–75. 10.2174/1574893618666230223114427 [DOI] [Google Scholar]
- 3. Yadav K, Ramachandran R, Kumar V. et al. Indian TrANslational GlomerulonephrItis BioLogy nEtwork (I-TANGIBLE): Design and methods. Indian Journal of Nephrology 2023;33:277–82. 10.4103/ijn.ijn_305_23 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Dwivedi AK. How to write statistical analysis section in medical research. J Invest Med 2022;70:1759–70. 10.1136/jim-2022-002479 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Wen SJ, Liu YB, Yang G. et al. A method for miRNA-disease association prediction using machine learning decoding of multi-layer heterogeneous graph transformer encoded representations. Sci Rep 2024;14:20490. 10.1038/s41598-024-68897-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Ding Y, Lei X, Liao B. et al. Machine learning approaches for predicting biomolecule–disease associations. Brief Funct Genomics 2021;20:273–87. 10.1093/bfgp/elab002 [DOI] [PubMed] [Google Scholar]
- 7. Kim Y, Cho Y-R. Predicting drug–gene–disease associations by tensor decomposition for network-based computational drug repositioning. Biomedicines 2023;11:1998. 10.3390/biomedicines11071998 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Dong N, Mucke S, Khosla M. Mucomid: A multitask graph convolutional learning framework for miRNA-disease association prediction. IEEE/ACM Trans Comput Biol Bioinform 2022;19:3081–92. 10.1109/TCBB.2022.3176456 [DOI] [PubMed] [Google Scholar]
- 9. Tie J, Lei X, Pan Y. Metabolite-disease association prediction algorithm combining DeepWalk and random forest. Tsinghua Sci Technol 2021;27:58–67. 10.26599/TST.2021.9010003 [DOI] [Google Scholar]
- 10. Yang Y, Lv H, Chen N. A survey on ensemble learning under the era of deep learning. Artif Intell Rev 2023;56:5545–89. 10.1007/s10462-022-10283-5 [DOI] [Google Scholar]
- 11. Wu L, Ye X, Zhang Y. et al. A genetic algorithm-based ensemble learning framework for drug combination prediction. J Chem Inf Model 2023;63:3941–54. 10.1021/acs.jcim.3c00260 [DOI] [PubMed] [Google Scholar]
- 12. Lin S, Zheng H, Han B. et al. Comparative performance of eight ensemble learning approaches for the development of models of slope stability prediction. Acta Geotech 2022;17:1477–502. 10.1007/s11440-021-01440-1 [DOI] [Google Scholar]
- 13. Wang H, Wang X, Zhouxin Y. et al. Graph embedding and ensemble learning for predicting gene-disease associations. Int J Data Min Bioinform 2020;23:360–79. 10.1504/IJDMB.2020.108704 [DOI] [Google Scholar]
- 14. Li Z, Liu H, Zhang Z. et al. Learning knowledge graph embedding with heterogeneous relation attention networks. IEEE Trans Neural Netw Learn Syst 2021;33:3961–73. 10.1109/TNNLS.2021.3055147 [DOI] [PubMed] [Google Scholar]
- 15. Li Z, Zhong T, Huang D. et al. Hierarchical graph attention network for miRNA-disease association prediction. Mol Ther 2022;30:1775–86. 10.1016/j.ymthe.2022.01.041 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Liu Y, Qi W, Zhou L. et al. Disentangled similarity graph attention heterogeneous biological memory network for predicting disease-associated miRNAs. BMC Genomics 2024;25:1161. 10.1186/s12864-024-11078-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Zhao B-W, Xiao-Rui S, Yang Y. et al. Regulation-aware graph learning for drug repositioning over heterogeneous biological network. Inform Sci 2025;686:121360. 10.1016/j.ins.2024.121360 [DOI] [Google Scholar]
- 18. Zhao B-W, Xiao-Rui S, Yang Y. et al. Motif-aware miRNA-disease association prediction via hierarchical attention network. IEEE J Biomed Health Inform 2024;28:4281–94. 10.1109/JBHI.2024.3383591 [DOI] [PubMed] [Google Scholar]
- 19. Su X, Hu P, You Z-H. et al. Dual-channel learning framework for drug-drug interaction prediction via relation-aware heterogeneous graph transformer. In: Wooldridge M, Dy J, Natarajan S. (eds), AAAI Conference on Artificial Intelligence. Washington, DC, USA: AAAI Press, Vol. 38, 2024. [Google Scholar]
- 20. Zhao B-W, Xiao-Rui S, Peng-Wei H. et al. iGRLDTI: An improved graph representation learning method for predicting drug–target interactions over heterogeneous biological information network. Bioinformatics 2023;39:btad451. 10.1093/bioinformatics/btad451 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Zhao B-W, Xiao-Rui S, Yang Y. et al. A heterogeneous information network learning model with neighborhood-level structural representation for predicting lncRNA-miRNA interactions. Comput Struct Biotechnol J 2024;23:2924–33. 10.1016/j.csbj.2024.06.032 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Wang J, Jianwen Feng Y, Kang PP. et al. Discovery of antimicrobial peptides with notable antibacterial potency by an LLM-based foundation model. Sci Adv 2025;11:eads8932. 10.1126/sciadv.ads8932 [DOI] [PubMed] [Google Scholar]
- 23. Wang J, Luo H, Qin R. et al. 3DSMILES-GPT: 3D molecular pocket-based generation with token-only large language model. Chem Sci 2025;16:637–48. 10.1039/D4SC06864E [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Zhao B-W, Xiao-Rui S, Peng-Wei H. et al. A geometric deep learning framework for drug repositioning over heterogeneous information networks. Brief Bioinform 2022;23:bbac384. 10.1093/bib/bbac384 [DOI] [PubMed] [Google Scholar]
- 25. Sitaraman SR, Adnan MM, Maharajan K. et al. A classification of inflammatory bowel disease using ensemble learning model. In: 2024 First International Conference on Software, Systems and Information Technology (SSITCON). Washington, DC, USA: IEEE, pp. 1–5, 2024.
- 26. Alavilli SK. Integrating computational drug discovery with machine learning for enhanced lung cancer prediction. Journal of Current Science 2023;11. [Google Scholar]
- 27. Basani DKR. Hybrid transformer-RNN and GNN-based robotic cloud command verification and attack detection: Utilizing soft computing, rough set theory, and grey system theory. Journal Name 2020;8:70. [Google Scholar]
- 28. Yang Y, Li G, Li D. et al. Integrating fuzzy clustering and graph convolution network to accurately identify clusters from attributed graph. IEEE Trans Netw Sci Eng 2025;12:1112–25. 10.1109/TNSE.2024.3524077 [DOI] [Google Scholar]
- 29. Li G, Zhao B, Su X. et al. Discovering consensus regions for interpretable identification of RNA N6-Methyladenosine modification sites via graph contrastive clustering. IEEE J Biomed Health Inform 2024;28:2362–72. 10.1109/JBHI.2024.3357979 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
All data generated or analyzed during this study are included in the manuscript.















