Abstract
m1A modification, as a pivotal RNA epigenetic modification, plays a central regulatory role in the pathogenesis and progression of complex human diseases, including cancer. Exploring the potential associations between m1As and diseases are an important approach to revealing the molecular mechanism of disease onset. However, traditional biological experiments have the limitations of time-consuming and labor-intensive, resulting in an extremely scarce amount of verified m1A-disease association data. Meanwhile, the existing computational prediction methods are mostly limited to specific application scenarios and rely solely on the direct correlation data between m1As and diseases. They do not fully integrate multi-dimensional biological information and thus are unable to achieve efficient and accurate association predictions. In view of this, this study proposes a method for predicting the association between m1A modification and diseases based on a ternary heterogeneous network and GCN. By introducing circRNA as an intermediate connection node, a ternary association network of m1A-circRNA-disease is constructed, which effectively enriches the dimension of feature information for both m1A and diseases. Meanwhile, leveraging the feature learning capability of Graph Convolutional Network, the extraction and representation of their features are realized. The experimental results demonstrate that the proposed approaches significantly outperforms existing mainstream methods in predictive performance, substantially enhancing the accuracy and reliability of m1A-disease association prediction. Furthermore, case validation has further confirmed that the predicted candidate m1A sites participate in regulating disease-related gene expression networks by modulating core processes such as RNA localization, stability, and translation efficiency, thereby providing novel insights into the investigation of disease pathogenesis.
Keywords: graph convolutional neural network, m1A-disease associations, RNA modifications, self-attention mechanism, ternary heterogeneousnetwork
1. Introduction
N1-methyladenine (m1A) is a reversible post-transcriptional RNA modification widely present in tRNA, rRNA and mRNA of eukaryotes. By regulating RNA stability, translational capacity, and protein function, m1A exerts a regulatory effect on disease progression (Zhang and Jia, 2018). The research team from the School of Pharmaceutical Sciences, Sun Yat-sen University, demonstrated that m1A modulates the expression of ATP5D (the δ subunit of mitochondrial ATP synthase), thereby controlling the glycolytic activity of tumor cells (Wu et al., 2022). This discovery not only reveals the critical role of m1A modification in cancer metabolism but also underscores its broader implications in the pathogenesis of human diseases, particularly cancer.
In recent years, an increasing number of studies have been dedicated to elucidating the biological functions and regulatory mechanisms of m1A modification in the initiation and progression of diseases. Consequently, many computational tools facilitating the prediction of m1A functions have been developed. Dominissini et al. (2016) for the first time mapped the dynamic m1A methylation patterns in eukaryotic mRNA through the integration of chemical enrichment and high-throughput sequencing. Wang et al. (2021) employed LC-MS/MS quantification, CRISPR-mediated knockout of TRMT6/61A, RNA-seq and cholesterol tracing to verify that m1A modification promotes hepatocellular carcinoma (HCC) initiation by regulating cholesterol metabolism.
Currently, researchers have compiled databases of RNA chemical modification information, such as RNAMDB (Cantara et al., 2011), RMBASE (Xuan et al., 2018), MeT-DB (Liu et al., 2015), and m6AVAR (Zheng et al., 2018), which have facilitated subsequent research on RNA chemical modifications. The method for predicting the associations between RNA modifications and diseases constitutes a complex and rapidly evolving field. Notably, the exploration of unknown correlations based on the known associations derived from existing datasets has received significant attention. Graph neural networks (GNN) have emerged as powerful deep learning architectures for modeling graph-structured data in biological systems. By iteratively aggregating feature information from neighboring nodes through message passing mechanisms, GNN can effectively learn node embeddings that capture both local graph topology and node attributes. More recently, a comprehensive review by Khemani et al. (Khemani et al., 2024) further summarized the fundamental concepts, architectures, and training techniques of GNNs, and discussed their challenges, benchmark datasets, and diverse applications, highlighting the rapid development and broad applicability of GNN-based approaches across multiple domains. Ma et al. (2021a) integrated the positional information of m7G and comprehensive disease similarity information to construct a heterogeneous network. Then, the matrix decomposition method was applied to predict potential disease-related m7G sites. Huang et al. (2024) applied artificial intelligence–based approaches to investigate epitranscriptome distribution, providing new insights into the landscape of RNA modifications. Zhang Y. et al. (2025) developed DirectRM, a framework that enables the integrated detection of multiple RNA modifications and their potential crosstalk using direct RNA sequencing. Huang et al. (2023) proposed a computational framework based on random walk in heterogeneous networks. By integrating multi-dimensional information such as RNA sequences, structural features, expression profiles, and disease phenotypes, to construct a ternary m7G-RNA-disease network. Through the random walk algorithm, the associated signals are globally propagated to predict potential disease-related m7G modification sites. Liu et al. (2023) proposed RMDGCN, a graph convolutional network integrated with an attention mechanism, to predict the relationship between m1A modifications and diseases. Zhang and Liu (2025) proposed m6ADP-GCNPUAS, a method that predicts the association between m6A and diseases through graph convolutional neural networks (GCNs) and Positive-Unlabeled Learning with Self-Adaptive Sampling (PUAS). In addition, several methods developed for predicting the associations between circRNAs and diseases can provide valuable insights for our research. Wang et al. (2020a) employed deep convolutional neural networks to explore the associations between circRNA and diseases. The framework of their method, including feature extraction and application of machine learning models, provided insights for the prediction of RNA modifications and disease associations. Wang et al. (2020b) proposed GCNCDA based on graph convolutional networks. By integrating Gaussian interaction profile kernels and similarity networks of both circRNAs and diseases, they constructed a two-layer graph convolutional network to automatically learn node embeddings, then followed by inner product decoding to predict potential circRNA- diseases associations. Bian et al. (2021) introduced GATCDA, which constructed a graph attention network to adaptively learn neighbor weights and aggregated node features, and then predicted circRNA-disease associations through a bilinear decoder. Despite the availability of these prediction methods, most of them solely rely on RNA modification-disease association data and fail to integrate additional datasets to explore the underlying mechanisms of the relationship between RNA modifications and diseases. Therefore, it is necessary to develop a novel computational tool to extract more complex m1A features related to diseases and apply them to a broader range of scenarios.
Based on this, this paper proposes a method for predicting m1A modification-disease associations using a ternary heterogeneous network and graph convolutional network, termed THGC_MDA. This model introduces circRNA as an intermediary to construct a ternary heterogeneous network of m1A-circRNA-disease. It extracts the features of m1A and diseases respectively through the graph convolutional network, thereby realizing the prediction of their associations.
2. Materials and methods
Figure 1 illustrates the specific workflow of THGC_MDA. Initially, this model constructs an m1A similarity network based on the m1A sequence and Jaccard similarity. Subsequently, it establishes a similarity matrix for diseases by integrating the Jaccard similarity and semantic similarity of the diseases. Concurrently, by leveraging the ternary heterogeneous network of m1A-circRNA-disease, the meta-path networks of m1A and diseases were respectively constructed. Then, through a multi-layer GCN, deep feature learning is performed on both m1A and diseases to obtain their respective updated feature representations. Finally, the features of both are combined to output a prediction score matrix for m1A -disease associations. As illustrated in Figure 1, panel (A) shows the construction of the m1A-disease heterogeneous graph, in which we separately integrate m1A and disease similarities and employ circRNA to establish meta-paths for extracting the features of m1A sites and diseases. Panel (B) describes the deep feature learning process implemented via multi-layer GCN. Panel (C) presents the prediction of potential associations between m1A and diseases using a MLP.
FIGURE 1.
The flowchart of THGC_MDA. (A) Construction of the m1A-disease heterogeneous graph and meta-paths for feature extraction. (B) Deep feature learning via multi-layer GCN. (C) Prediction of potential m1A-disease associations using MLP.
2.1. Dataset
The data on the association between m1A and diseases used in this study is the same as that in RMDGCN(14), both sourced from the RMVar database (http://rmvar.renlab.org) (Luo et al., 2021). To ensure data credibility, we retained modified sites with high confidence and disease information with corresponding Disease Ontology Identifiers (DOID) from the Disease Ontology database. Finally, 3,618 pieces of mutation-related m1A modification sites, 116 diseases, and 5,100 associations were obtained. Furthermore, in order to construct the ternary heterogeneous network of m1A-circRNA-diseases, we mapped 3,618 m1A modification sites onto circRNAs, resulting in a total of 222 circRNAs. Additionally, we obtained 1,022 relationships between 222 circRNAs and 116 diseases.
2.2. Similarity calculation
This paper applies semantic similarity and Jaccard similarity to diseases, and applies sequence similarity and Jaccard similarity to m1A.
2.2.1. Construction of the disease similarity network
We retrieved the Disease Ontology identifiers (DOID) corresponding to each disease from the Disease Ontology dataset (Kibbe et al., 2015), which were utilized for calculating the semantic similarity of diseases. The similarity was computed using the Dosim function provided in the R language package (Yu et al., 2014). In the disease ontology framework, each disease is represented by a directed acyclic graph (DAG). The disease semantic similarity between disease i and disease j can be defined as follows in Equation 1:
| (1) |
Among them, and are the ancestor nodes of disease i and disease j respectively, represents the contribution value of all nodes in to disease i, and is defined as Equation 2:
| (2) |
Among them, represents the child nodes belonging to t; α is the disease semantic score, and the default setting is 0.5 (Rashid et al., 2025).
In the Disease Ontology framework, diseases that share similar ancestor nodes in the DAG structure tend to have related pathological mechanisms or biological characteristics. Therefore, diseases with higher semantic similarity scores are more likely to share similar molecular mechanisms or regulatory processes. Incorporating disease semantic similarity into the model helps capture functional relationships between diseases and provides biologically meaningful information for identifying potential associations between m1A modifications and diseases.
Given the sparsity of disease semantic similarity data, which limits the comprehensive representation of disease features, the Jaccard similarity is used to enrich the similarity information.
The Jaccard similarity between diseases d i and d j is calculated based on the corresponding columns in the association matrix as follows in Equation 3:
| (3) |
Among them, y i and y j correspond to the feature vectors (row vectors or column vectors) of disease i and disease j in the association matrix (they must be consistent), and the elements of the vectors are usually 0 or 1; ||y i || 1 and ||y j || 1 represent the L1 norm of vectors y i and y j (that is the sum of all elements in the vectors), and they are respectively equal to the total number of features of disease i and disease j.
From a biological perspective, diseases that share more common associations in the interaction network are more likely to exhibit similar pathological mechanisms or molecular regulatory processes. Therefore, the Jaccard similarity reflects the degree of overlap between the association patterns of two diseases. A higher Jaccard similarity score indicates that the two diseases share more common interaction partners, suggesting potential functional or mechanistic relationships. Incorporating Jaccard similarity helps capture structural relationships within the disease network and complements the disease semantic similarity in representing disease-related features.
2.2.2. Construction of the m1A similarity network
To calculate the sequence similarity between the m1A modification sites, we first extracted 65-nt RNA sequence fragments from the reference sequence, with each fragment containing an m1A site at the center and flanked by 32 nucleotides upstream and 32 nucleotides downstream. In computational studies of RNA modification sites, extracting fixed-length sequence windows centered on the modification site is a widely used method to capture the local sequence environment surrounding the modified nucleotide. Previous studies have demonstrated that the nucleotide sequences flanking RNA modification sites contain important sequence features that are informative for identifying RNA modification sites (Chen et al., 2020). For example, Zhou et al. (2016) developed the predictor SRAMP, which employed a 61-nt sequence window to characterize the sequence context surrounding m6A modification sites. Following these studies, we adopted a window size of 65 nt (32 nt upstream and 32 nt downstream of the central site), which is sufficient to capture the local contextual sequence information around the modification site while maintaining a consistent input length for downstream computational analysis. Subsequently, these sequence fragments were subjected to one-hot encoding, which each base (A, U, G, C) was mapped to a four-dimensional binary vector (for example, A was represented as [1,0,0,0], U as [0,1,0,0], G as [0,0,1,0], and C as [0,0,0,1]) (Wang et al., 2024). Given that each sequence is 65 nt in length, each sequence was represented as a high-dimensional vector with a dimension of 65 × 4 = 260 following encoding.
After performing one-hot encoding, we adopted cosine similarity as the metric to calculate the similarity among all sequences corresponding to the m1A sites. The cosine similarity value between two m1A sites is defined as Equation 4:
| (4) |
Among them, SSM(i,j) denotes the cosine similarity between m1A site i and m1A site j. Specifically, m i and m j represent the sequence embedding feature vectors of m1A site i and m1A site j respectively. Here, d represents the total dimension of the vector derived from one-hot encoding of each sequence. Meanwhile, and represent the m-th element of the one-hot encoding vectors corresponding to sequence of m1A site i and m1A site j, respectively. Then, the numerator represents the dot product of the two vectors, calculated by multiplying each pair of corresponding elements and summing the results. The denominator is the product of the L2 norms of the two vectors, where the L2 norm of a vector is defined as the square root of the sum of the squares of its individual elements.
From a biological perspective, sequence similarity reflects the degree of conservation in the nucleotide context surrounding m1A modification sites. m1A modifications often occur within specific sequence environments or motifs. Therefore, m1A sites with similar sequence patterns in their flanking regions may share similar structural characteristics or regulatory mechanisms involved in RNA modification. By measuring sequence similarity, the model can capture the local sequence features associated with m1A modification sites, which provides biologically meaningful information for identifying potential relationships among modification sites.
Furthermore, the Jaccard similarity is also employed to enrich the similarity information of m1A modification sites. The Jaccard similarity between vectors m i and m j (corresponding to m1A sites i and j) is calculated based on the corresponding rows in the association matrix, as show below Equation 5:
| (5) |
Among them, x ᵢ and x j correspond to the eigenvectors (row vectors or column vectors, with consistency required) of the m1A features in the association matrix corresponding to the i-th and j-th m1A respectively. The elements of these vectors are usually typically binary(0 or 1). || x ᵢ|| 1 and || x j ||1 denote the L1-norms of vectors x ᵢ and x j (i.e., the sum of all elements in each vector), which are equivalent to the total number of features associated with the i-th m1A and the j-th m1A sites.
2.2.3. Fusion of similarity networks
In order to better integrate the similarities of diseases, DDS represents the ultimate similarity of diseases, and the formula is as follows in Equation 6:
| (6) |
Among them, SSD denotes the semantic similarity of diseases, and JJD represents the Jaccard similarity of diseases. The final disease similarity is derived by averaging these two similarity metrics. Similarly, for m1A similarity in Equation 7, SSM is the sequence similarity of m1A, JJM denotes the Jaccard similarity, and the average of these similarities is used to obtain the final similarity of m1A.
| (7) |
2.2.4. Construction of the meta-path network
The meta-path can comprehensively connect m1A, circRNA and diseases together, while capturing detailed structural information within the correlation network (Zhang X. et al., 2025). Based on the meta-path framework, we constructed two distinct adjacency matrices to capture network structural information at multiple hierarchical levels. The meta-path network of m1A is as show in Equation 8:
| (8) |
| (9) |
Among them, denotes the m1A-circRNA association matrix, represents the circRNA-disease association matrix, and quantifies association strengths between m1A nodes through shared circRNA neighbors, while captures indirect associations via the m1A-circRNA-disease pathway in Equation 9. represents the transposed association network of m1A and circRNA, and represents the transposed association network of circRNA and diseases.
Similarly, the network representing the root causes of the disease is as show in Equation 10:
| (10) |
| (11) |
where, measures association strengths between disease nodes through shared circRNA neighbors, and encodes indirect associations via the disease-circRNA-m1A pathway in Equation 11.
2.3. Graph convolutional network
We further utilized the graph convolutional network (GCN) to extract the embedding features of each m1A and disease nodes. GCN fuses the feature vectors of neighboring nodes with the input m1A-m1A or disease-disease network (encoded as an adjacency matrix), effectively learning the embedding features of each node (Wu et al., 2020). The GCN architecture requires two core inputs: an initial node feature matrix and an adjacency matrix defining the graph structural topology. For m1A nodes, the initial feature matrix consists of enhanced representations generated via similarity fusion and feature enrichment. For disease nodes, the feature matrix integrates semantic and Jaccard similarity measures. The adjacency matrix encodes the connectivity patterns between nodes in the network.
For m1A nodes, the adjacency matrix integrates information from both the m1A-circRNA association path , the m1A -circRNA-disease association path , and the m1A similarity matrix DDM, formulated as in Equation 12:
| (12) |
The weighting coefficients (0.6, 0.3, 0.1) prioritize direct neighbor relationships while incorporating higher-order connectivity information. The coefficient 0.6 corresponds to first-order neighborhood aggregation (direct connectivity via shared circRNA), preserving local structural information analogous to standard GCN designs. The coefficient 0.3 corresponds second-order cross-modal propagation (bridging m1A and disease through two-hop circRNA paths). The coefficient 0.1 indicates that the similarity matrix serves as feature preprocessing and prior regularization.
For disease nodes, the adjacency matrix is constructed by integrating the circRNA-disease association path , the disease-circRNA- m1A association path , and the disease similarity matrix DDS in Equation 13:
| (13) |
The constructed adjacency matrices undergo symmetric normalization to ensure numerical stability during feature propagation in Equations 14, 15:
| (14) |
| (15) |
where and represent the degree matrices of and , respectively. These diagonal matrices contain elements and , and the normalization process mitigates issues of gradient explosion or vanishing during network training.
The multi-layer GCN architecture learns hierarchical embedding representations for both m1A nodes and disease nodes. For m1A nodes, the feature transformation at the -th layer is defined as in Equations 16, 17:
| (16) |
| (17) |
where denotes the m1A node features at the -th layer, represents the trainable weight matrix, and is the Rectified Linear Unit activation function introducing non-linearity.
Similarly, for disease nodes, the feature update follows in Equation 18:
| (18) |
where contains the disease node features at the -th layer and is the corresponding trainable weight matrix.
To enhance the model’s representational capacity, we incorporated a multi-head self-attention mechanism with residual connections following the GCN feature learning. The self-attention mechanism adaptively computes importance weights between nodes in Equation 19:
| (19) |
| (20) |
where , and denote the query, key, and value matrices obtained through linear transformations of node features, represents the key vector dimension, and the function computes normalized attention weights in Equation 20.
Residual connections facilitate gradient flow by combining attention outputs with original GCN features in Equation 21:
| (21) |
where is a learnable residual weight matrix.
For association prediction, we implemented a feature interaction network that processes the final embeddings of m1A and disease nodes. The m1A feature and disease feature are concatenated and processed through a multi-layer perceptron in Equation 22:
| (22) |
where denotes vector concatenation and comprises multiple non-linear transformation layers.
The interaction features are then projected through a prediction network to generate association probabilities in Equations 23, 24:
| (23) |
| (24) |
where represents the Sigmoid activation function, and are the weight matrix and bias term, respectively, and the output is constrained to the [0,1] interval, representing the probability of association between m1A and the disease.
To address class imbalance, we employed a composite loss function in Equations 25-27:
| (25) |
| (26) |
| (27) |
where denotes binary cross-entropy loss, represents Focal Loss with parameters and , which dynamically adjusts sample weights to prioritize challenging examples during training.
This comprehensive GCN architecture enables effective learning of deep feature representations for m1A and diseases within the heterogeneous network, facilitating accurate prediction of m1A-disease associations.
3. Results and discussion
3.1. Evaluation metrixs
The model was evaluated via 5-fold cross-validation. In the 5-fold cross-validation process, the complete set of m1A-disease associations was split into five equal sized parts. In each cross-validation, one part was used as the test set, and the remaining four parts were used as the training set. In order to better evaluate the performance of the model, we use ACC, Precision, Recall, F1_score, AUPR and AUC to present the results of the model. Their definitions are as follows in Equation 28:
| (28) |
Among them, TP stands for true positive, TN for true negative, FP for false positive, and FN for false negative. For the Receiver Operating Characteristic (ROC) curve, the vertical axis is the true positive rate (TPR), while the horizontal axis is the false positive rate (FPR). AUC is the area under this curve with a value ranging from 0 to 1, the closer it is to 1, the stronger the model’s ability to distinguish between positive and negative classes, thus achieving more accurate predictions. The Precision-Recall (PR) curve takes Recall as its horizontal axis and Precision as its vertical axis. The Area Under the PR Curve (AUPR) corresponds to the area under this curve. It can better reflect the true performance of the model when there are few positive samples.
3.2. Comparing with other methods
This section compares THGC_MDA with six other methods to evaluate the performance advantage in predicting m1A-diseases associations. The comparative results are primarily derived from RMDGCN (Liu et al., 2023), a computational method based on the attention mechanism and GCN to predict the associations between RNA methylation and diseases. The raw data utilized in this paper is the same as that in RMDGCN. Furthermore, to verify the superiority of THGC_MDA, both the comparative method and evaluation metrics employed herein are consistent with those adopted in RMDGCN.
RWR (Tong et al., 2006) can utilize node information through multiple approaches, capturing global information of the graph to obtain the correlation scores between nodes. Given that a single molecule can induce various diseases, the potential factors governing the molecular-disease associations are highly correlated. NMF (Lee and Seung, 1999) decomposes the correlation matrix into a basis matrix and a weight matrix. The basis matrix represents the relative contribution of potential factors, while the weight matrix represents the relative contribution of the diseases. Additionally, we compared three existing methods: BRPCA (Ma et al., 2021b) for m7G-disease associations prediction, SCMDDF (Zhang et al., 2018) for drug-disease associations prediction, and LOMDA (Pech et al., 2019) for miRNA-disease associations prediction. BPRCA employs singular value decomposition to impute missing items in the adjacency matrix of the heterogeneous network, thereby obtaining the fitting matrix for the repaired associations between modification sites and diseases. SCMDDF predicts drug-disease associations through similarity constraint matrix decomposition, while LOMDA is a linear prediction method for miRNA-disease relationships based on linear optimization.
The 5-fold cross-validation results demonstrate that THGC_MDA outperforms other competing methods. Detailed performance metrics for each evaluation index are summarized in Table 1. Figure 2 presents the 5-fold cross-validation results of AUC and AUPR for THGC_MDA, along with the corresponding ROC curves and PR curves derived from the final model. To address class imbalance, we implemented a random negative sampling strategy: for all experimentally validated m1A-disease associations (positive samples), an equal number of unconnected m1A-disease pairs were randomly selected from the complete bipartite space as negative samples, yielding a balanced 1:1 training dataset. The GCN model was configured with the following hyperparameters: 300 training epochs, initial learning rate of 0.001, GCN encoder hidden dimension of 256, embedding output dimension of 128, and a 3-layer multilayer perceptron (MLP) predictor (32→64→32→1) with Sigmoid activation for probability calibration.
TABLE 1.
Performance comparison of different methods.
| Methods | AUC | AUPR | ACC | Recall | Precision | F1_score |
|---|---|---|---|---|---|---|
| RWR | 0.6343 | 0.0694 | 0.9036 | 0.1900 | 0.1041 | 0.1345 |
| NMF | 0.7120 | 0.2926 | 0.9949 | 0.0843 | 0.1038 | 0.0754 |
| BRPCA | 0.9326 | 0.1624 | 0.9304 | 0.7847 | 0.0579 | 0.1079 |
| SCMFDD | 0.8988 | 0.0499 | 0.9915 | 0.162 | 0.0528 | 0.0856 |
| LOMDA | 0.8937 | 0.3252 | 0.3540 | 0.9431 | 0.0172 | 0.0338 |
| RMDGCN | 0.9892 | 0.8682 | 0.9836 | 0.7809 | 0.799 | 0.7897 |
| THGC_MDA | 0.9704 | 0.9667 | 0.9166 | 0.9447 | 0.8951 | 0.9190 |
The best results in each column are highlighted in bold.
FIGURE 2.
The ROC curves and PR curves for 5CV.
3.3. Adjustment of parameters
If the number of GCN layers is too large, it will increase the model’s computational complexity, prolong training time, reduce the generalization ability, and raise the risk of overfitting (Kipf and Welling, 2017; Li et al., 2018). Increasing the number of GCN layers allows the model to aggregate information from higher-order neighbors. Conversely, if the number of GCN layers is too small, it may limit the model’s expressive ability and fail to capture more complex structures and patterns in the data, resulting in a decreased model performance. Therefore, it is necessary to investigate the impact of GCN layers count on the model to avoid problems caused by excessive or insufficient layers and achieve better performance. Experiments were conducted with 2, 3, and 4 GCN layers, and Figure 3 shows the evaluation metrics of the model with different GCN layer configurations. From the experimental results, it can be seen that when the number of GCN layers is 3, the model achieves the best predictive performance.
FIGURE 3.

Comparison of adjusting the number of GCN layers.
The learning rate governs the step size for parameters updates during the training process. If the learning rate is too high or too low, it may significantly reduce the performance of the model. To identify the optimal value, the model was evaluated with learning rates of 1e-3, 5e-3, 1e-4, and 5e-4. As shown in Figure 4, the model achieves the best performance when the learning rate is 1e-3. Therefore, THGC_MDA sets the learning rate of 1e-3.
FIGURE 4.

Performance comparison of adjusting learning rate.
3.4. Ablation experiment
To validate the effectiveness of constructing the three-node heterogeneous network in this paper, a meta-path ablation experiment was conducted. The adjacency matrix of m1A and diseases, as well as the similarity network of m1A and diseases, were directly predicted through GCN, without incorporating the new features fused via meta-paths. Experimental results show that the evaluation indicators showed an AUC of 0.8911 and an AUPR of 0.9316 in Figure 5, indicating that the meta-paths play a crucial role in the model’s results and can effectively improve the prediction performance of the model.
FIGURE 5.

Comparison of ablation experiments.
3.5. Case analysis
To further validate the model’s performance, we utilized THGC_MDA to predict potential m1A sites associated with renal cell carcinoma (RCC) and conducted Gene Ontology (GO) enrichment analysis on the host genes of these newly predicted sites. In this case study, all known RCC–m1A associations were masked as unknown, and the model was used to compute RCC–m1A prediction scores. Based on these scores, we ranked the RCC-related m1A sites and retrieved their corresponding host genes. After removing redundancies (several sites share the same host gene), 439 unique host genes remained with the criterion p > 0.5. These 439 genes were functionally annotated using GO terms from three dimensions: Cellular Component (CC), Biological Process (BP), and Molecular Function (MF).
Figure 6 is the results of CC enrichment of genes related to RCC. Entries such as “lateral plasma membrane” represent key pathways for tumor migration and metastasis in RCC. Terms like “cytosolic large ribosomal subunit” and “preribosome” reflect aberrant translational machinery, which indicaes heightened proliferative activity of cancer cells, while “spindle” is associated with cell-cycle dysregulation and malignant proliferation. Clark et al. (Clark et al., 2019) demonstrated through multi-omic profiling that “cytosolic large ribosomal subunit” and “preribosome” are highly expressed in RCC, resulting in enhanced translational efficiency. Functional assays showed that inhibiting ribosome assembly attenuated cancer cell proliferation, confirming that aberrant ribosome translation drives enhanced proliferative activity in RCC.
FIGURE 6.
CC enrichment of genes related to RCC.
Many studies have demonstrated that the development of RCC is associated with various biological processes. We analyzed the enrichment relationship between BP and host genes, and the results are shown in Figure 7. From the figure, it can be seen that a single gene may be involved in multiple biological processes, while a biological process is driven by the interaction of multiple genes. In RCC, genes associated with RNA splicing and cytoplasmic translation are markedly activated, including core host genes such as RPL3 and YBX1. Previous studies have shown that piR-RCC (Wang et al., 2025) inhibits RCC proliferation and metastasis by blocking YBX1 nuclear translocation, thereby regulating its transcriptional repressor activity and indirectly affecting the expression of cytoplasmic translation-related genes.
FIGURE 7.
BP enrichment of genes related to RCC.
Next, we analyzed the MF category in the GO enrichment analysis. As shown in Figure 8, the larger the circle, the greater the number of genes enriched in this item. It can be observed that RCC is associated with histone modifying activity. This activity primarily refers to the chemical modification of histones, which alters chromatin compaction and regulates gene expression. In RCC, this mechanism is crucial for silencing tumor-suppressor genes (Zhu et al., 2019).
FIGURE 8.
MF enrichment of genes related to renal cell carcinoma.
4. Conclusion
Identifying the association with m1A- related diseases can help us understand the crucial role of m1A modification in the pathogenesis of diseases. In this study, we propose a novel computational framework, THGC_MDA, to predict or discover the potential associations between m1A and diseases. Firstly, we obtain the disease similarity matrix by averaging the disease semantics and Jaccard similarity, and the m1A similarity matrix by averaging the m1A sequence similarity and Jaccard similarity matrix. To better identify the relationships between nodes, we introduced circRNA information and constructed the m1A meta-path network and disease meta-path network through the meta-path network. Then, we use GCN to extract deep embedding features of m1A and disease, and finally get the m1A-disease association prediction results through multilayer perceptron. In terms of predictive performance, the model achieved an AUC of 0.9684 and an AUPR of 0.9611 in the five-fold validation. Additionally, in the case of RCC, we verified the effectiveness of the model through GO enrichment analysis. However, since the model relies on circRNA as an intermediate bridge between m1A and the disease, and the annotation and functional research of circRNA are still insufficient at present, some missing data related to the association between certain diseases or m1A and circRNA may exist, thereby affecting the accuracy of the meta-path network construction. In the future, by integrating miRNA, lncRNA, proteins, etc., as intermediate bridges, multi-type meta-path networks (such as m1A→miRNA→disease, m1A→protein→disease, etc.) can be constructed. Through multi-path fusion algorithms, the comprehensive information mediated by multiple molecules is integrated, reducing the reliance on circRNA data.
Several limitations of this study should be acknowledged. First, although the proposed model demonstrates good predictive performance, it operates largely as a black-box model, which may limit its biological interpretability. Interpretability is crucial for understanding the underlying mechanisms of RNA modification–disease associations. In recent years, model interpretation methods such as SHAP (Lundberg and Lee, 2017) and lime (Ribeiro et al., 2016) have been widely used to explain machine learning models by estimating the contribution of input features to prediction outcomes. These approaches can quantify the influence of individual features and provide insights into the decision-making process of complex models. Second, all RNA modification-related variants in this study were obtained from the RMVar database (Luo et al., 2021), a comprehensive resource integrating experimentally validated and computationally predicted variants. However, relying on a single database may introduce potential biases, as data collection strategies, experimental conditions, and annotation methods may vary across studies.
To address these limitations, future work will pursue two complementary directions: integrating interpretability techniques into the proposed framework to analyze the contribution of different biological features, thereby identifying key factors influencing predicted m1A-disease associations and enhancing biological interpretability; and integrating data from multiple databases, such as RMDisease 2.0, as well as newly generated experimental datasets, to validate findings and improve the robustness and generalizability of the model. The combination of multi-database validation and model interpretability analysis will collectively strengthen the reliability of our predictions and provide deeper mechanistic insights into RNA-disease associations.
Funding Statement
The author(s) declared that financial support was received for this work and/or its publication. This work was supported by the Key R&D Program of Shaanxi Provincial Department of Science and Technology (No.2025CY-JJQ-65); the Youth Innovation Team of Shaanxi Universities.
Footnotes
Edited by: An Zhu, Fujian Medical University, China
Reviewed by: Teng Zhang, Jiangsu University of Science and Technology, China
Kai Cheng Chuang, National Chung Hsing University Department of Life Sciences, Taiwan
Data availability statement
The original contributions presented in the study are included in the article/supplementary material. The processed data and codes are freely available at GitHub https://github.com/XUEz-svg/THGC_MDA. Further inquiries can be directed to the corresponding authors.
Author contributions
HG: Formal Analysis, Methodology, Conceptualization, Writing – review and editing, Writing – original draft, Investigation. XZ: Validation, Writing – review and editing, Methodology, Investigation, Formal Analysis, Data curation, Writing – original draft. LB: Supervision, Methodology, Conceptualization, Investigation, Formal Analysis, Writing – review and editing. HY: Funding acquisition, Writing – review and editing. FL: Conceptualization, Writing – review and editing, Supervision, Funding acquisition.
Conflict of interest
The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Generative AI statement
The author(s) declared that generative AI was not used in the creation of this manuscript.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
References
- Bian C., Lei X.-J., Wu F.-X. (2021). GATCDA: predicting circRNA-disease associations based on graph attention network. Cancers 13, 2595. 10.3390/cancers13112595 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cantara W. A., Crain P. F., Rozenski J., McCloskey J. A., Harris K. A., Zhang X., et al. (2011). The RNA modification database, RNAMDB: 2011 update. Nucleic Acids Res. 39, D195–D201. 10.1093/nar/gkq1028 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen Z., Zhao P., Li F., Wang Y., Smith A. I., Webb G. I., et al. (2020). Comprehensive review and assessment of computational methods for predicting RNA post-transcriptional modification sites from RNA sequences. Brief. Bioinform 21, 1676–1696. 10.1093/bib/bbz112 [DOI] [PubMed] [Google Scholar]
- Clark D. J., Dhanasekaran S. M., Petralia F., Pan J., Song X., Hu Y., et al. (2019). Integrated proteogenomic characterization of clear cell renal cell carcinoma. Cell 179, 964–983.e931. 10.1016/j.cell.2019.10.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dominissini D., Nachtergaele S., Moshitch-Moshkovitz S., Peer E., Kol N., Ben-Haim M. S., et al. (2016). The dynamic N1-methyladenosine methylome in eukaryotic messenger RNA. Nature 530, 441–446. 10.1038/nature16998 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang Y., Wu Z., Lan W., Zhong C. (2023). Predicting disease-associated N7–Methylguanosine (m7G) sites via random walk on heterogeneous network. IEEE/ACM TCBB 20, 3173–3181. 10.1109/TCBB.2023.3284505 [DOI] [PubMed] [Google Scholar]
- Huang D., Meng J., Chen K. (2024). AI techniques have facilitated the understanding of epitranscriptome distribution. Cell Genomics 4, 4. 10.1016/j.xgen.2024.100718 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Khemani B., Patil S., Kotecha K., Tanwar S. (2024). A review of graph neural networks: concepts, architectures, techniques, challenges, datasets, applications, and future directions. J. Big Data 11, 11. 10.1186/s40537-023-00876-4 [DOI] [Google Scholar]
- Kibbe W. A., Arze C., Felix V., Mitraka E., Bolton E., Fu G., et al. (2015). Disease ontology 2015 update: an expanded and updated database of human diseases for linking biomedical knowledge through disease data. Nucleic Acids Research 43, D1071–D1078. 10.1093/nar/gku1011 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kipf T. N., Welling M. (2017). “Semi-supervised classification with graph convolutional networks,” in International conference on learning representations (ICLR). [Google Scholar]
- Lee D. D., Seung H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature 401, 788–791. 10.1038/44565 [DOI] [PubMed] [Google Scholar]
- Li Q., Han Z., Wu X.-m. (2018). Deeper insights into graph convolutional networks for semi-supervised learning. AAAI Conf. Artif. Intell. 32, 3538–3545. 10.1609/aaai.v32i1.11604 [DOI] [Google Scholar]
- Liu H., Flores M. A., Meng J., Zhang L., Zhao X., Rao M. K., et al. (2015). MeT-DB: a database of transcriptome methylation in mammalian cells. Nucleic Acids Res. 43, D197–D203. 10.1093/nar/gku1024 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu L., Zhou Y., Lei X. (2023). RMDGCN: prediction of RNA methylation and disease associations based on graph convolutional network with attention mechanism. PLOS Comput. Biol. 19, e1011677. 10.1371/journal.pcbi.1011677 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lundberg S., Lee S.-I. (2017). A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 30, 4765–4774. 10.48550/arXiv.1705.07874 [DOI] [Google Scholar]
- Luo X., Li H., Liang J., Zhao Q., Xie Y., Ren J., et al. (2021). RMVar: an updated database of functional variants involved in RNA modifications. Nucleic Acids Res. 49, D1405–D1412. 10.1093/nar/gkaa811 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ma J., Zhang L., Chen J., Song B., Zang C., Liu H. (2021a). m7GDisAI: N7-methylguanosine (m7G) sites and diseases associations inference based on heterogeneous network. BMC Bioinforma. 22, 152. 10.1186/s12859-021-04007-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ma J., Zhang L., Li S., Liu H. (2021b). BRPCA: bounded robust principal component analysis to incorporate similarity network for N7-Methylguanosine (m7G) site-disease association prediction. IEEE/ACM TCBB 19, 3295–3306. 10.1109/TCBB.2021.3109055 [DOI] [PubMed] [Google Scholar]
- Pech R., Lee Y.-L., Hao D., Po M., Zhou T. (2019). LOMDA: linear optimization for miRNA-disease association prediction. bioRxiv. 10.1101/751651 [DOI] [Google Scholar]
- Rashid M. A., Chaturvedi M., Paliwal K. K. (2025). PIONet: a positional encoding integrated onehot feature based RNA-Binding protein classification using deep neural network. IEEE Access 13, 87220–87228. 10.1109/ACCESS.2025.3570714 [DOI] [Google Scholar]
- Ribeiro M. T., Singh S., Guestrin C. (2016). ““why should I trust you?”explaining the predictions of any classifier,” in Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 1135–1144. 10.1145/2939672 [DOI] [Google Scholar]
- Tong H., Faloutsos C., Pan J.-Y. (2006). Fast random walk with restart and its applications. IEEE, 613–622. 10.1109/ICDM.2006.70 [DOI] [Google Scholar]
- Wang L., You Z.-H., Huang Y.-A., Huang D.-S., Chan K. C. (2020a). An efficient approach based on multi-sources information to predict circRNA–disease associations using deep convolutional neural network. Bioinformatics 36, 4038–4046. 10.1093/bioinformatics/btz825 [DOI] [PubMed] [Google Scholar]
- Wang L., You Z.-H., Li Y.-M., Zheng K., Huang Y.-A. (2020b). GCNCDA: a new method for predicting circRNA-disease associations based on graph convolutional network algorithm. PLOS Comput. Biol. 16, e1007568. 10.1371/journal.pcbi.1007568 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang Y., Wang J., Li X., Xiong X., Wang J., Zhou Z., et al. (2021). N1-methyladenosine methylation in tRNA drives liver tumourigenesis by regulating cholesterol metabolism. Nat. Commun. 12, 6314. 10.1038/s41467-021-26718-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang G., Liu T., Lyu H., Liu Z. (2024). F5C-finder: an explainable and ensemble biological language model for predicting 5-Formylcytidine modifications on mRNA. arXiv Preprint, 13265. 10.48550/arXiv.2404.13265 [DOI] [Google Scholar]
- Wang R., Li F., Lin Y., Lu Z., Luo W., Xu Z., et al. (2025). piR‐RCC suppresses renal cell carcinoma progression by facilitating YBX-1 cytoplasm localization. Adv. Sci. 12, 12. 10.1002/advs.202414398 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu Z., Pan S., Chen F., Long G., Zhang C., Yu P. S. (2020). A comprehensive survey on graph neural networks. IEEE 32, 4–24. 10.1109/TNNLS.2020.2978386 [DOI] [PubMed] [Google Scholar]
- Wu Y., Chen Z., Xie G., Zhang H., Wang Z., Zhou J., et al. (2022). RNA m1A methylation regulates glycolysis of cancer cells through modulating ATP5D. PNAS 119, 119. 10.1073/pnas.2119038119 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xuan J.-J., Sun W.-J., Lin P.-H., Zhou K.-R., Liu S., Zheng L.-L., et al. (2018). RMBase v2. 0: deciphering the map of RNA modifications from epitranscriptome sequencing data. Nucleic Acids Research 46, D327–D334. 10.1093/nar/gkx934 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yu G., Wang L.-G., Yan G.-R., He Q.-Y. (2014). DOSE: an R/Bioconductor package for disease ontology semantic and enrichment analysis. Bioinformatics 31, 608–609. 10.1093/bioinformatics/btu684 [DOI] [PubMed] [Google Scholar]
- Zhang C., Jia G. (2018). Reversible RNA modification N1-methyladenosine (m1A) in mRNA and tRNA. Bioinformatics 16, 155–161. 10.1016/j.gpb.2018.03.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang T., Liu L. (2025). m6ADP-GCNPUAS: M6A-disease prediction via graph convolutional network and positive-unlabeled learning with self-adaptive sampling. Interdiscip. Sci. Comput. Life Sci., 1–15. 10.1007/s12539-025-00760-0 [DOI] [PubMed] [Google Scholar]
- Zhang W., Yue X., Lin W., Wu W., Liu R., Huang F., et al. (2018). Predicting drug-disease associations by using similarity constrained matrix factorization. BMC Bioinformatics 19, 233. 10.1186/s12859-018-2220-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang Y., Wu Y., Ma J., Wu Y., Li L., Wang H., et al. (2025a). DirectRM: integrated detection of landscape and crosstalk between multiple RNA modifications using direct RNA sequencing. Nat. Commun. 16, 16. 10.1038/s41467-025-64495-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang X., Zou Q., Niu M., Wang C. (2025b). Predicting circRNA–disease associations with shared units and multi-channel attention mechanisms. Bioinformatics 41, btaf088. 10.1093/bioinformatics/btaf088 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zheng Y., Nie P., Peng D., He Z., Liu M., Xie Y., et al. (2018). m6AVar: a database of functional variants involved in m6A modification. Nucleic Acids Research 46, D139–D145. 10.1093/nar/gkx895 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhou Y., Zeng P., Li Y. H., Zhang Z., Cui Q. (2016). SRAMP: prediction of Mammalian N6-methyladenosine (m6A) sites based on sequence-derived features. Nucleic Acids Res. 44, e91. 10.1093/nar/gkw104 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhu Q., Yu L., Qin Z., Chen L., Hu H., Zheng X., et al. (2019). Regulation of OCT2 transcriptional repression by histone acetylation in renal cell carcinoma. Epigenetics 14, 791–803. 10.1080/15592294.2019.1615354 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The original contributions presented in the study are included in the article/supplementary material. The processed data and codes are freely available at GitHub https://github.com/XUEz-svg/THGC_MDA. Further inquiries can be directed to the corresponding authors.





