Skip to main content
GigaScience logoLink to GigaScience
. 2025 Oct 17;14:giaf123. doi: 10.1093/gigascience/giaf123

PanGIA: A universal framework for identifying association between ncRNAs and diseases

Xiaoyuan Liu 1,#, Xiye Lü 2,#, Qiuhao Chen 3, Jiqiu Sun 4, Tianyi Zhao 5,6,, Yan Zhu 7,
PMCID: PMC12532321  PMID: 41105014

Abstract

Background

With the growing recognition of the important roles noncoding RNAs (ncRNAs) play in various biological functions, especially their potential involvement in many human diseases, predicting ncRNA–disease associations has become a key challenge in biomedical research.

Results

Although many computational methods have been proposed to predict ncRNA–disease associations, most of these methods focus on a single type of ncRNA. However, the competitive and cooperative interactions among different types of ncRNAs are closely related to their functional roles in disease associations. To address this limitation, we propose a novel computational framework, PanGIA (Pan-ncRNA Graph-Interaction Attention network), designed to simultaneously predict potential associations between multiple types of noncoding RNAs, including microRNAs (miRNAs), long noncoding RNAs (lncRNAs), circular RNAs (circRNAs), and PIWI-interacting RNAs (piRNAs), and diseases. Experimental results show that PanGIA outperforms type-specific SOTA methods in both individual and comprehensive predictions. It remains robust even when nodes or ncRNA types are removed, and ablation studies confirm the benefits of cross-type information. PanGIA also outperforms several single-type state-of-the-art methods across multiple metrics.

Conclusions

PanGIA demonstrates significant advantages in predicting disease associations for different types of ncRNAs, including miRNAs, lncRNAs, circRNAs, and piRNAs. Case studies further confirm the accuracy of the model’s predictions, as all high-confidence associations were supported by literature evidence. This demonstrates the model’s strong biological interpretability and promising potential for practical applications. The successful application of PanGIA provides a new paradigm for exploring disease-associated ncRNAs, highlighting their immense potential in the field of biomedical research.

Keywords: heterogeneous graph attention network, mixture of experts, cross-task attention mechanism, ncRNA–disease association

Introduction

Noncoding RNAs (ncRNAs) refer to a class of RNA molecules that do not encode proteins but play crucial roles in various biological processes, such as posttranscriptional regulation, epigenetic modification, and cellular signaling. In recent years, with the advancement of high-throughput sequencing technologies and functional genomics, an increasing number of ncRNAs have been identified as closely associated with a wide range of complex human diseases [1–4]. A growing body of experimental evidence has demonstrated that aberrant expression or dysfunction of ncRNAs is involved in the pathogenesis of major diseases, including cancer, neurodegenerative disorders, and cardiovascular diseases. Therefore, uncovering the potential associations between ncRNAs and diseases not only helps to elucidate the molecular mechanisms underlying complex diseases but also provides theoretical support for early diagnosis, biomarker discovery, and personalized treatment strategies. In particular, ncRNA regulatory mechanisms have emerged as a research hotspot in fields such as oncology, neurological disorders, and cardiovascular disease.

Among the various types of ncRNAs, small RNAs such as microRNAs (miRNAs) have been extensively studied and are well recognized for their posttranscriptional silencing functions through binding to target mRNAs. They have demonstrated significant potential as biomarkers in a wide range of diseases [5, 6].

Circular RNAs (circRNAs), owing to their covalently closed-loop structures that confer high stability, can function as competitive endogenous RNAs (ceRNAs) for microRNAs (miRNAs) or interact with RNA-binding proteins. Increasing evidence has demonstrated that circRNAs play critical regulatory roles and possess considerable potential for clinical applications across various disease contexts [7, 8].

Long ncRNAs (lncRNAs), which function by interacting with DNA, RNA, or proteins, are involved in processes such as chromatin modification and transcriptional regulation, and they have been found to play crucial roles in tumorigenesis, cell proliferation, and immune modulation [9, 10].

PIWI-interacting RNAs (piRNAs), initially thought to function predominantly in germ cells by suppressing transposable elements to maintain genome stability, have more recently been shown to exert regulatory functions in somatic cells as well. These piRNAs are increasingly associated with various cancers and metabolic disorders [11, 12].

Despite their critical roles in gene regulation and disease mechanisms, experimental identification of ncRNA–disease associations remains costly and time-consuming, limiting its scalability for large-scale studies. As a result, computational methods have gained increasing attention for their ability to efficiently and cost-effectively predict ncRNA–disease associations.

For miRNA, representative methods such as IMCMDA [13] leverage an integrated similarity network combined with a bilateral diffusion model to predict potential disease–miRNA associations. Regarding lncRNAs, LncDisAP [14] incorporates multiple similarity features and employs deep representation learning to effectively uncover latent associations. In the case of piRNAs, IPiDA-GBNN [15] enhances predictive accuracy by integrating graph neural networks with multifeature representations. For circRNAs, existing approaches include GCNCDA [16], which constructs a heterogeneous graph structure and applies graph convolutional networks to learn circRNA–disease relationships. Additionally, CRBPSA [17] exploits sequence- and structure-aware attention mechanisms to identify circRNA–RBP (RNA binding protein) interaction sites, thereby offering new insights into circRNA functionality. More recently, StackCirRNAPred [18] adopts a stacked ensemble learning strategy to achieve accurate classification of long circRNAs and other lncRNAs by integrating features from multiple sources.

Although these methods have achieved promising results within their respective ncRNA categories, they typically focus on a single type of ncRNA, ignoring the complex interplay and competition among different ncRNAs. For instance, miRNAs may interact with lncRNAs or circRNAs through the ceRNA mechanism, jointly regulating disease-related pathways. These interactions form a complex regulatory network, yet existing approaches lack comprehensive modeling of both cross-ncRNA relationships and multitype ncRNA–disease associations.

Therefore, there is a pressing need for novel computational frameworks that can jointly model the interrelations among various ncRNA types and their associations with diseases, enabling the discovery of previously unknown ncRNA–disease links through a more holistic understanding of their regulatory dynamics.

Despite the availability of several specialized repositories, such as miR2Disease, circR2Disease, LncRNADisease, and piRDisease, which systematically curate associations between ncRNAs and human diseases [19–22], these databases are inherently constrained by their reliance on experimentally derived evidence. The acquisition of such evidence is both resource-intensive and time-consuming, with its breadth inherently limited by laboratory conditions and prevailing research focuses. Moreover, most current studies are restricted to individual ncRNA classes, thereby neglecting potential crosstalk and cooperative interactions among distinct ncRNA species in disease pathogenesis. Consequently, existing resources remain insufficient to fully capture the complexity of ncRNA-mediated regulatory networks.

In conventional studies, most computational approaches focus on a single type of ncRNA, such as miRNA, circRNA, lncRNA, or piRNA, and are typically tailored to the specific features and data types associated with that category. Examples include IMCMDA [13], LncDisAP [14], IPiDA-GBNN [15], and GCNCDA [16], among others. These models are generally built upon sequence information, structural properties, or expression profiles unique to the targeted ncRNA type. However, such type-specific methods exhibit clear limitations in their generalizability, as they are often not applicable to other classes of ncRNAs.

This limitation arises from the substantial differences in structure, biological function, and disease-related mechanisms among various ncRNA types. For instance, miRNAs primarily function through posttranscriptional repression by targeting mRNAs, whereas lncRNAs are involved in gene regulation and chromatin remodeling. CircRNAs are known to act as “sponges” for miRNAs, and piRNAs are mainly implicated in posttranscriptional regulation and transposon silencing. Consequently, models focusing exclusively on one ncRNA type tend to ignore the potential interactions and synergies among different ncRNA categories, thereby restricting their applicability and limiting their potential to uncover cross-type regulatory mechanisms in disease contexts.

Furthermore, single-type RNA-based approaches are inadequate in capturing the complex biological interactions that may exist across different RNA types. For example, miRNAs may indirectly influence disease development by regulating lncRNA expression, circRNAs may impact disease progression through interactions with miRNAs, and piRNAs may engage with other RNA species in various biological processes. Since traditional methods are confined to individual RNA types, they are unable to fully elucidate the potential cross-talk and coregulatory mechanisms among diverse ncRNA classes.

Therefore, there is an urgent need to develop an efficient and scalable computational prediction model capable of systematically identifying potential associations between different types of ncRNAs and diseases. Such a model would not only compensate for the limitations of experimental data but also expand the knowledge graph of disease regulatory networks.

In this article, we propose PanGIA (Pan-ncRNA Graph-Interaction Attention network), a novel framework for comprehensive ncRNA–disease association prediction. To address the challenge of feature heterogeneity, PanGIA constructs a heterogeneous graph that integrates multisource data, including sequence information, functional similarity, and interaction networks of ncRNAs. To capture comprehensive and layered associations, it employs a cross-task attention mechanism combined with a mixture-of-experts architecture to dynamically learn the multilevel interactions between ncRNAs and diseases. Notably, PanGIA encompasses 4 representative classes of noncoding RNAs—miRNAs, lncRNAs, circRNAs, and piRNAs—which exhibit distinct characteristics in terms of length, structure, regulatory mechanisms, and functional roles. These classes, owing to their complementary and representative nature in current research, are collectively referred to as pan-ncRNAs [23]. By adopting a pan-ncRNA perspective, PanGIA not only overcomes the limitations of single-type ncRNA studies but also provides a more comprehensive understanding of the multilayered regulatory roles of ncRNAs in disease.

The main contributions of our work are summarized as follows:

  • Pan-ncRNA integration: PanGIA jointly models 4 types of ncRNAs (miRNA, lncRNA, circRNA, and piRNA), overcoming the limitations of single-type approaches.

  • Heterogeneous graph fusion: It constructs a heterogeneous graph to integrate sequence, semantic, and functional data, capturing complex ncRNA–disease relationships.

  • Cross-task attention: A mixture-of-experts module with cross-task attention enhances feature sharing across ncRNA types and improves prediction accuracy.

  • Superior performance: PanGIA achieves higher area under the curve (AUC), area under the precision-recall curve (AUPR), and rank metrics than baseline models, demonstrating strong generalization and reliability.

Materials

Our study involves multiple classes of noncoding RNAs and requires the simultaneous acquisition of their sequence information and disease association data. The databases utilized in this study are listed as follows:

  • miRNA: The associations between miRNAs and diseases were obtained from the HMDD v4.0 database [24], while the sequence information of miRNAs was retrieved from the miRBase database [25].

  • LncRNA/circRNA: This study includes lncRNA and circRNA associations with diseases, with data obtained from LncRNADisease v3.0 [26]. The sequence information of circRNAs was retrieved from the circBase database [27]. In contrast, lncRNA sequences were collected from 2 sources: GENCODE [28] and NONCODE [29].

  • piRNA: The associations between piRNAs and diseases were obtained from the piRDisease v1.0 [21] database, and the sequence information was retrieved from the piRBase [30] and piRNAdb [31] databases.

  • Disease: This study utilizes Disease Ontology Identifiers (DOIDs) to construct the disease similarity matrix, with corresponding information obtained from the Disease Ontology database [32].

The construction of the ncRNA–disease association network was based on merging data entries from the aforementioned association databases.

Methods

We propose a novel model named PanGIA, which is built upon the heterogeneous graph attention network (HAN) and a mixture-of-experts (MoE) framework. The model is designed to predict associations between pan-ncRNAs and diseases. The overall workflow of PanGIA is illustrated in Fig. 1, and consists of 3 main steps:

Figure 1:

Figure 1:

The structure of PanGIA.

  1. Pretraining: ncRNA node embeddings via DNABERT6

  2. Data processing: generation of heterogeneous networks

  3. Model construction: multitask association prediction via HAN and MoE with cross-task attention

Pretraining: ncRNA node embeddings via DNABERT6

In this study, we leveraged the DNABERT6 [33] model to pretrain ncRNA sequences and obtain high-quality node embeddings. DNABERT represents a recent class of methods that adapt the BERT architecture, originally developed in natural language processing, to genomic sequence modeling. The central idea is to segment DNA sequences into fixed-length k-mers, thereby treating the genomic sequence as a special type of “language.” By pretraining on large-scale genomic corpora, DNABERT is able to capture contextual dependencies within sequences and has demonstrated superior performance compared to traditional feature engineering approaches in a variety of downstream tasks, such as promoter prediction and transcription factor binding site identification. Building on this concept, we employed the DNABERT6 variant (with k = 6) to model ncRNA sequences, thereby generating embeddings suitable for subsequent graph-based representation learning. The overall workflow is illustrated in Fig. 1A and can be summarized in the following steps:

Tokenization of ncRNA sequences

We first segmented the ncRNA sequences into fixed-length k-mers of size 6. Prior to tokenization, all uracil (U) bases in the RNA sequences were systematically replaced with thymine (T) to ensure compatibility with DNA-based models such as DNABERT. The fundamental rationale of k-mer tokenization is to conceptualize DNA/RNA sequences as a specialized form of “language,” in which each nucleotide fragment of length 6 serves as an independent lexical unit. Previous studies have demonstrated that setting k = 6 provides superior performance across a wide range of genomic modeling tasks, as it effectively preserves local sequence features while simultaneously capturing long-range dependencies [33–35]. This strategy establishes a solid foundation for subsequent deep representation learning.

Masked language model pretraining

After obtaining the 6-mer token sequences, we employed a masked language model (MLM) pretraining strategy, wherein a subset of tokens was randomly masked, and the model was required to recover the original tokens based on their surrounding context. Specifically, certain k-mer tokens in the input sequence were randomly replaced by a mask symbol, and the model was trained to reconstruct the original tokens conditioned on the unmasked context, thereby enabling the learning of contextual dependencies within the sequence. The corresponding objective function can be formally expressed as

graphic file with name TM0001.gif (1)

where Inline graphic denotes the set of masked positions, Inline graphic is the masked token, Inline graphic represents the unmasked context tokens, and Inline graphic denotes the model parameters.

From a biological perspective, this mechanism facilitates the identification of potential functional motifs and enhances the ability to capture their regulatory roles under diverse contextual environments.

Embedding representation

At the input layer, each 6-mer token is mapped into a dense vector representation composed of 3 components:

  • Token embedding: captures the semantic features of the current 6-mer

  • Positional embedding: encodes the positional information of each token within the sequence, thereby enabling the model to preserve the linear order of nucleotides

  • Segment embedding: used to differentiate between distinct segments in concatenated sequences

Formally, the overall embedding of a token can be expressed as

graphic file with name TM0006.gif (2)

where Inline graphic, Inline graphic, and Inline graphic denote the token, positional, and segment embeddings of the ith token, respectively.

This multilevel embedding strategy enables the model to simultaneously retain local nucleotide features while capturing the global topological structure of the sequence, thereby providing a comprehensive representation for downstream tasks.

Transformer encoding and ncRNA embedding matrix

After obtaining the token-level embeddings, the sequence representations are fed into a stack of Transformer encoders. The core component of the Transformer is the multihead self-attention mechanism, which enables the model to capture dependencies among sequence fragments in multiple subspaces. For ncRNA sequences, this mechanism is particularly important, as functional elements such as binding sites or seed regions may exhibit interactions spanning long distances along the sequence. By stacking multiple layers of self-attention and feed-forward networks, the model is able to generate increasingly rich contextual representations.

Ultimately, the model integrates the contextual information of each sequence into an embedding matrix that not only captures local nucleotide fragment features but also encodes long-range dependencies and global sequence semantics. Compared with traditional one-hot encoding or manually engineered sequence features, the embeddings generated by DNABERT are more comprehensive and robust. For RNA nodes, we employ DNABERT embeddings of 768 dimensions as the input features. This embedding matrix is further utilized as the node feature representation in graph-based learning, thereby providing a high-quality foundation for predicting ncRNA–disease associations.

Data processing: generation of heterogeneous networks

Heterogeneous graph construction

In this study, we first preprocessed the raw ncRNA sequence data and their known associations with diseases in order to construct a cross-modal heterogeneous network, which serves as the input foundation for the PanGIA model. Formally, the heterogeneous graph is defined as

graphic file with name TM0011.gif (3)

where the node set is given by

graphic file with name TM0012.gif (4)

with Inline graphic denoting the set of RNA nodes and Inline graphic denoting the set of disease nodes.

The edge set Inline graphic is composed of multiple types of relationships, including the following:

  • RNA–disease associations:  
    graphic file with name TM0016.gif (5)

    where Inline graphic represents the known RNA–disease association matrix. If Inline graphic, this indicates that RNA node Inline graphic is associated with disease node Inline graphic.

  • RNA–RNA similarity edges:  
    graphic file with name TM0021.gif (6)

    where Inline graphic denotes the RNA similarity matrix. A weighted edge is established between 2 RNA nodes if their similarity score is greater than zero, with the edge weight denoted as Inline graphic.

  • Disease–disease similarity edges:  
    graphic file with name TM0024.gif (7)

    where Inline graphic denotes the disease similarity matrix. Similarly, if the similarity score between 2 diseases is greater than zero, a weighted edge is constructed with weight Inline graphic.

Construction of ncRNA similarity matrices

To model intraclass similarities among 4 major categories of ncRNAs—miRNA, lncRNA, circRNA, and piRNA—we constructed similarity matrices based on embedding representations rather than conventional sequence alignment. Specifically, each ncRNA sequence was encoded into a dense embedding vector Inline graphic using a pretrained model. The similarity between any 2 sequences Inline graphic and Inline graphic of the same category was then quantified via cosine similarity:

graphic file with name TM0030.gif (8)

where Inline graphic and Inline graphic denote the embedding vectors of sequences Inline graphic and Inline graphic, respectively. After normalization, the similarity values were constrained within the interval [0,1], thereby ensuring consistency across different ncRNA types.

Finally, the similarity matrices for all ncRNAs were organized into a block-diagonal structure:

graphic file with name TM0035.gif (9)

where Inline graphic and Inline graphic correspond to the cosine similarity matrices of miRNA, circRNA, lncRNA, and piRNA, respectively. This block-diagonal representation provides a structured foundation for integrating multiclass ncRNA similarities into downstream graph-based learning.

Based on the constructed ncRNA–disease association network, we next compute the functional similarity of noncoding RNAs using the Gaussian Interaction Profile (GIP) kernel function. The corresponding formula is given as follows:

graphic file with name TM0038.gif (10)

In this formulation, Inline graphic and Inline graphic represent the vectors corresponding to the ith and jth rows of the adjacency matrix Inline graphic, respectively. The parameter Inline graphic denotes the bandwidth coefficient of the kernel function, which is defined as follows:

graphic file with name TM0045.gif (11)

Inline graphic denotes the total number of ncRNAs, and Inline graphic represents the vector corresponding to the kth row of the adjacency matrix Inline graphic. Subsequently, we integrate the sequence similarity and GIP-based functional similarity to obtain the final ncRNA similarity matrix:

graphic file with name TM0050.gif (12)

Construction of the disease similarity matrix

On the disease side, we constructed 2 types of disease similarity networks based on different approaches: (i) a semantic similarity matrix calculated using Disease Ontology and (ii) a GIP-based similarity matrix generated from disease interaction profiles. By integrating these 2 sources of similarity information, we obtained a comprehensive disease similarity network to enhance the accuracy of disease representation [36–38].

Disease Ontology is a structured ontology that organizes various diseases and their hierarchical relationships. Each disease node in the ontology is assigned a unique identifier and may be associated with descriptive attributes, such as symptoms and causes. The hierarchical structure of the ontology typically resembles a tree, where parent nodes represent broader disease categories and child nodes correspond to more specific diseases. In this study, we employ the Jaccard similarity coefficient to compute the semantic similarity matrix between diseases, defined as follows:

graphic file with name TM0051.gif (13)

In this formulation, Inline graphic denotes the set of ancestor nodes of the disease node d, and Inline graphic represents the cardinality (i.e., the number of elements) of the set A.

The functional similarity of diseases based on the GIP kernel is calculated as follows:

graphic file with name TM0056.gif (14)

In this formulation, Inline graphic and Inline graphic represent the vectors corresponding to the ith and jth columns of the adjacency matrix Inline graphic, respectively. The parameter Inline graphic denotes the bandwidth coefficient of the kernel function, which is defined as follows:

graphic file with name TM0063.gif (15)

Inline graphic denotes the total number of diseases. Subsequently, we integrate the semantic similarity and GIP-based functional similarity to obtain the final integrated disease similarity matrix:

graphic file with name TM0065.gif (16)

We integrate the constructed ncRNA similarity network, the disease similarity network, and the known ncRNA–disease associations to form a unified ncRNA–disease heterogeneous graph, denoted as Inline graphic.

Model construction: multitask association prediction via HAN and MoE with cross-task attention

We propose a novel multitask relational prediction framework that integrates HAN with a MoE mechanism. Through a cross-task attention mechanism, the framework enables collaborative modeling across tasks, enhancing both task generalization and interaction expression capabilities. The overall structure of the model is depicted in Fig. 1C, which consists of the following 5 main components:

Feature representation and heterogeneous graph modeling

The model input consists of the ncRNA embedding matrix Inline graphic and the disease embedding matrix Inline graphic, where Inline graphic and Inline graphic represent the number of ncRNAs and diseases, respectively, and Inline graphic and Inline graphic correspond to their embedding dimensions. Since the original feature dimensions of ncRNAs and diseases may differ, we first map the disease embeddings into the same space as the ncRNA embeddings:

graphic file with name TM0073.gif (17)

where Inline graphic is a learnable linear transformation matrix.

Next, the heterogeneous network Inline graphic, based on ncRNA–disease associations, along with the ncRNA embedding matrix Inline graphic and disease embedding matrix Inline graphic, is fed into the HAN model. A multihead attention mechanism, guided by meta-paths, is employed to extract higher-order semantic features from the graph structure. The output of the HAN encoder is

graphic file with name TM0078.gif (18)

where Inline graphic represent the hidden representations of ncRNAs and diseases extracted by the HAN layer, and Inline graphic denotes the intermediate hidden dimension.

Expert pool and global disease information fusion

To integrate global disease semantics, we average the representations of all disease nodes to obtain the global disease feature:

graphic file with name TM0081.gif (19)

We concatenate this with each ncRNA representation to form the fused representation:

graphic file with name TM0082.gif (20)

Subsequently, the fused representation is input into an expert pool consisting of Inline graphic experts, where each expert is a nonlinear transformation module:

graphic file with name TM0084.gif (21)

The outputs of all experts are then stacked:

graphic file with name TM0085.gif (22)

where Inline graphic denotes the output dimension of each expert.

Multitask gating mechanism

For each specific task Inline graphic, the corresponding ncRNA subset is Inline graphic. We learn a gating network for each task to perform attention-based selection of experts in the expert pool:

  • First, the task-related input representation Inline graphic is mapped to a query vector Inline graphic.

  • The expert representations are mapped to keys Inline graphic and values Inline graphic.

  • A multihead attention mechanism is then used to compute the attention-weighted expert representation:

graphic file with name TM0093.gif (23)

The final aggregation of the expert outputs results in the task feature representation:

graphic file with name TM0094.gif (24)

Cross-task attention interaction

To model the potential correlations between tasks, we input the aggregated representations of all tasks (after average pooling) into a cross-task multihead attention module:

graphic file with name TM0095.gif (25)
graphic file with name TM0096.gif (26)

The cross-task global representation Inline graphic is then concatenated with the original task representation Inline graphic:

graphic file with name TM0099.gif (27)

Relational prediction and output layer

The representations of all diseases are projected to the expert dimension through a linear transformation:

graphic file with name TM0100.gif (28)

Finally, the association score between each ncRNA and all diseases for each task is calculated through the dot product, followed by normalization using the sigmoid function:

graphic file with name TM0101.gif (29)

In this framework, the association prediction task is formulated as a binary classification problem. For each task t, the task-specific MLP transforms the RNA representations into Inline graphic, which captures task-refined structural and semantic features. Each disease node representation is projected into the same latent space, yielding Inline graphic. The association score between an RNA node i and a disease node j is computed as the dot product:

graphic file with name TM0107.gif (30)

This score is then passed through a sigmoid function to produce the following probability:

graphic file with name TM0108.gif (31)

which indicates the likelihood that RNA i is associated with disease j. Hence, the MLP does not serve as the final classifier but rather as a task-dependent feature extractor, while the prediction itself is achieved through the interaction between RNA and disease embeddings.

Results

Benchmark on various ncRNAs

In this study, we evaluated the performance of different models using 5-fold cross-validation and employed Rank Index, AUC, and AUPR as evaluation metrics.

Under the evaluation of 5-fold cross-validation, the performance of various models on the pan-ncRNA–disease association prediction task is summarized in Table 1. Compared with the baseline methods, PanGIA consistently achieved the best performance across multiple evaluation metrics.

Table 1.

Performance comparison of PanGIA and baseline models on ncRNA–disease association prediction tasks

Model RNA category AUC Rank Index AUPR
NIMGSA [39] miRNA 0.947 Inline graphic 0.003 0.318 Inline graphic 0.006 0.682 Inline graphic 0.002
MINIMDA [40] miRNA 0.918 Inline graphic 0.001 0.324 Inline graphic 0.003 0.904 Inline graphic 0.004
PanGIA miRNA 0.926 Inline graphic 0.003 0.304 Inline graphic 0.003 0.914 Inline graphic 0.002
gGATLDA [41] lncRNA 0.931 Inline graphic 0.001 0.283 Inline graphic 0.002 0.923 Inline graphic 0.005
LDGRNMF [42] lncRNA 0.892 Inline graphic 0.005 0.328 Inline graphic 0.003 0.849 Inline graphic 0.004
PanGIA lncRNA 0.933 Inline graphic 0.006 0.298 Inline graphic 0.001 0.927 Inline graphic 0.003
iPiDi-PUL [43] piRNA 0.569 Inline graphic 0.026 0.444 Inline graphic 0.021 0.117 Inline graphic 0.008
PUTransGCN [44] piRNA 0.930 Inline graphic 0.007 0.103 Inline graphic 0.006 0.598 Inline graphic 0.032
PanGIA piRNA 0.934 Inline graphic 0.001 0.291 Inline graphic 0.007 0.929 Inline graphic 0.003
IGNSCDA [45] circRNA 0.812 Inline graphic 0.003 0.331 Inline graphic 0.004 0.694 Inline graphic 0.006
GATCL2CD [46] circRNA 0.931 Inline graphic 0.004 0.282 Inline graphic 0.007 0.879 Inline graphic 0.008
PanGIA circRNA 0.927 Inline graphic 0.005 0.306 Inline graphic 0.003 0.914 Inline graphic 0.009
PanGIA miRNA, lncRNA, piRNA, circRNA 0.988 Inline graphic 0.002 0.256 Inline graphic 0.001 0.985 Inline graphic 0.004

To comprehensively evaluate the performance of the PanGIA model in multitype ncRNA–disease association prediction, this study conducted comparative experiments using various existing mainstream methods across 4 types of ncRNAs—miRNA, lncRNA, piRNA, and circRNA. The models’ performance was assessed using AUC, AUPR, and Rank Index metrics. As shown in the table, PanGIA outperformed all other methods across all RNA types. Furthermore, when all ncRNA types were integrated, the model’s performance was further enhanced, achieving the highest AUC and AUPR and the lowest Rank Index. These results strongly demonstrate PanGIA’s exceptional generalization ability and predictive accuracy in multitype ncRNA–disease association prediction.

Multitask synchronous prediction can provide key information

To validate the advantages of our proposed framework in leveraging the full-spectrum heterogeneous association network and the neural network architecture design, we not only examined the performance after ablating critical network modules but also progressively reduced the scale of full-spectrum data to evaluate the unique contribution of pan-ncRNA–disease association information.

Stepwise reduction of heterogeneity in the network

To systematically evaluate the impact of reduced training data on model performance, we designed 2 downsampling strategies to progressively decrease the amount of information available to the PanGIA model: (i) a random uniform subsampling strategy and (ii) an RNA-type-based node selection strategy.

In the random uniform subsampling strategy, we adopted a straightforward uniform sampling method. Specifically, a certain proportion of ncRNA and disease nodes were randomly and uniformly removed from the heterogeneous graph, along with their associated edges. This ensured that both types of nodes (ncRNAs and diseases) were reduced at the same rate, preserving the relative balance between modalities in the network while decreasing the overall graph size. By gradually scaling down the input network, we simulated scenarios with limited data availability to examine how PanGIA performs under constrained information settings.

Based on this strategy, we conducted a systematic performance evaluation of the PanGIA model using 100%, 80%, 67%, and 50% of the original dataset for training. As shown in Fig. 2, the corresponding evaluation metrics—AUC, AUPR, and Rank Index—consistently declined with reduced data scale. These results indicate that PanGIA is sensitive to the quantity of training data and that its predictive capability is notably affected under data-sparse conditions.

Figure 2:

Figure 2:

Performance comparison of PanGIA under different subsampling ratios.

In summary, this experiment underscores the critical importance of data completeness in achieving optimal predictive performance with PanGIA, highlighting the key role of full-spectrum biological data in robust association prediction tasks.

To further evaluate the overall contribution of different ncRNA types to the predictive performance of the model, we conducted a stepwise ablation study by progressively removing specific categories of ncRNAs from the full pan-ncRNA set. As shown in Fig. 3, we assessed model performance under various ncRNA combinations using AUC, AUPR, and Rank Index as evaluation metrics. The results demonstrate that the inclusion of all 4 ncRNA types (miRNA, lncRNA, circRNA, and piRNA) yields the best overall performance. In contrast, removing any single or multiple ncRNA types leads to a noticeable decline in 1 or more metrics, highlighting the complementary contributions of each ncRNA class to the overall prediction capability of the PanGIA framework.

Figure 3:

Figure 3:

Performance of PanGIA with different ncRNA combinations.

Robustness of PanGIA

To validate the effectiveness of key components in the PanGIA framework, we performed a series of ablation experiments by systematically removing core modules, including HAN, MoE, and the cross-task attention mechanism. Additionally, a single-task learning variant was tested to contrast against the full multitask framework. As illustrated in Fig. 4, Fig. 5 and Fig. 6, performance metrics including AUC, AUPR, and Rank Index were measured for each ablation variant.

Figure 4:

Figure 4:

Performance comparison of PanGIA ablation variants on AUC.

Figure 5:

Figure 5:

Performance comparison of PanGIA ablation variants on AUPR.

Figure 6:

Figure 6:

Performance comparison of PanGIA ablation variants on Rank Index.

The results demonstrate that removing any of the core components leads to a noticeable performance decline across all metrics. Specifically, the removal of HAN or MoE resulted in significant drops in both AUC and AUPR, indicating the importance of structural and expert-based representation learning. Furthermore, disabling the cross-task attention mechanism impaired the model’s ability to integrate information across tasks, reducing prediction accuracy. The single-task baseline also underperformed compared to the full model, highlighting the advantage of PanGIA’s multitask learning design. Overall, these findings confirm that each module contributes uniquely and substantially to the overall predictive power of PanGIA.

PanGIA reveals novel ncRNA–disease associations

In the case study, we selected high-confidence associations between various types of ncRNAs (miRNA, circRNA, lncRNA, and piRNA) and representative diseases as predicted by the PanGIA model. These predicted associations were organized and presented in Table 2, respectively. Through literature review, we confirmed that all the associations listed in the table have been experimentally validated, with supporting evidence provided by the corresponding references (PMIDs).

Table 2.

Experimentally validated ncRNA–disease associations used in the case study

RNA type RNA symbol Disease name PMID
miRNA miR-944 Glioblastoma 34233294
miRNA miR-936 Glioblastoma 29218238
miRNA miR-378 Osteoarthritis 35474736
miRNA miR-139 Osteoarthritis 32185303
circRNA CSPP1 Glioblastoma 32495924
circRNA SCN3B Glioblastoma 39289188
circRNA ROCK1 Coronary artery disease 34236817
circRNA WNK1 Coronary artery disease 31821324
lncRNA LINC00324 Stomach carcinoma 32855634
lncRNA LINC00691 Stomach carcinoma 32330554
lncRNA RPSAP52 Stomach carcinoma 35322746
lncRNA AFAP1-AS1 Cholangiocarcinoma 28938565
piRNA DQ570326 Parkinson’s disease 29986767
piRNA DQ592957 Parkinson’s disease 29986767
piRNA DQ596377 Alzheimer’s disease 28127595
piRNA DQ597397 Renal cell carcinoma 25998508

All associations were experimentally validated and supported by the referenced PubMed IDs (PMIDs).

In this case study, we focused on high-confidence miRNA–disease associations predicted by our model. Literature searches confirmed that these miRNAs are supported by clear biological mechanisms. For example, miR-944 derived from glioma stem cell exosomes directly downregulates VEGFC expression, further suppressing AKT/ERK signaling activity, thereby significantly reducing glioblastoma growth and angiogenesis [47]. Similarly, miR-936 is markedly downregulated in glioblastoma tissues, with its expression negatively correlated with tumor grade. Re-expression of miR-936 can target the CKS1 gene and inhibit the downstream AKT/ERK pathway, effectively blocking the cell cycle and suppressing tumor growth [48]. In osteoarthritis (OA), overexpression of miR-378 aggravates cartilage degeneration by suppressing autophagy in chondrocytes and inhibiting chondrogenic differentiation of bone marrow mesenchymal stem cells. Its targets, Atg2a and Sox6, are well characterized; conversely, the application of anti–miR-378 alleviates OA progression and promotes joint regeneration, highlighting its therapeutic potential [49]. In addition, miR-139 is significantly upregulated in OA-damaged cartilage and can be activated by IL-1Inline graphic. By directly targeting MCPIP1, it relieves translational repression of IL-6, leading to elevated IL-6 and degradative enzymes such as MMP-13 and ADAMTS4, thereby promoting chondrocyte apoptosis and matrix degradation [50].

Regarding circRNAs, multiple experimental findings also support the model predictions. CSPP1 is markedly upregulated in glioblastoma tissues, closely associated with abnormal mitosis and the proliferation of tumor cells [51]. Similarly, SCN3B has been identified as a key molecule in glioblastoma, with aberrant expression linked to enhanced tumor cell migration and invasion [52]. In cardiovascular disease, ROCK1-related circRNA plays an important role in coronary artery disease by regulating vascular smooth muscle cell contraction and apoptosis, thereby promoting disease progression [53]. Meanwhile, WNK1-derived circRNA is significantly upregulated in patients with coronary artery disease, affecting endothelial function and ion channel homeostasis, thus accelerating atherosclerosis development [54]. These results further demonstrate the molecular significance of circRNAs in diverse diseases, validating the reliability and biological value of our model predictions.

For lncRNAs, several experimentally validated findings support the predicted associations. LINC00324 is significantly downregulated in stomach carcinoma, where it interacts with miR-3200-5p to regulate downstream BCAT1 expression, thereby inhibiting tumorigenesis [55]. Similarly, LINC00691 is upregulated in gastric cancer tissues and promotes proliferation and invasion by modulating the miR-9-5p/FGFR1 axis, suggesting its oncogenic role [56]. Furthermore, RPSAP52 enhances proliferation and inhibits apoptosis in gastric cancer by regulating the miR-665/STAT3 pathway, thus promoting tumor progression [57]. In cholangiocarcinoma, AFAP1-AS1 is markedly upregulated and promotes migration and invasion through transcriptional regulation of EMT (epithelial-mesenchymal transition)-related genes, underscoring its pivotal role in tumor progression [58].

For piRNAs, most predicted associations have been experimentally verified to show differential expression and participation in pathological processes. For instance, DQ597397 is significantly upregulated in renal cell carcinoma cells compared with normal renal cells, suggesting its role in promoting tumor progression [59]. Conversely, DQ570326 and DQ592957 are downregulated in neurons derived from patients with Parkinson’s disease, potentially contributing to neurodegenerative mechanisms [60]. Moreover, DQ596377 is markedly upregulated in neurons from patients with Alzheimer’s disease (AD), with expression levels 11.38-fold higher than those in normal brain cells, indicating its involvement in AD-specific neuropathological processes [61]. Collectively, these findings demonstrate that piRNAs play crucial molecular roles in the pathogenesis of multiple major diseases and further substantiate the reliability of our model predictions.

This result indicates that PanGIA performs excellently in the aforementioned case studies, demonstrating its capability to identify high-confidence associations between miRNAs, circRNAs, lncRNAs, and piRNAs with diseases. The unconfirmed associations predicted by PanGIA may serve as candidate targets for subsequent biological experiments and lay a solid foundation for the potential application of related ncRNAs in disease diagnosis and therapy.

Conclusions

In this study, we proposed PanGIA, a novel model for ncRNA–disease association prediction that integrates HAN with a MoE architecture. Comprehensive experiments demonstrate that PanGIA achieves consistently superior performance across various RNA types, including miRNA, lncRNA, circRNA, and piRNA, validating the effectiveness of our multisource feature extraction and multitask modeling strategy.

PanGIA not only outperforms state-of-the-art methods on multiple evaluation metrics but also maintains robust and stable performance across different ncRNA categories. This indicates strong generalization and robustness of the model in handling heterogeneous RNA data and structures.

Through further case study analyses, we validated that several high-confidence predictions have been experimentally confirmed in the literature, highlighting the significant advantages of this method in terms of biological interpretability and result reliability. In particular, the ncRNA–disease associations predicted by PanGIA show substantial research and application value in fields such as neurological disorders, metabolic diseases, and cancer.

Overall, PanGIA demonstrates strong potential as a unified framework for pan-ncRNA–disease association prediction. It excels in both macro-level performance benchmarks and micro-level case reliability, suggesting excellent cross-task generalizability and interpretability. In future work, incorporating additional omics data and optimizing the network architecture may further enhance its predictive power, contributing to ncRNA functional studies, disease mechanism exploration, and the advancement of precision medicine and biomarker discovery.

Supplementary Material

giaf123_Authors_Response_To_Reviewer_Comments_Original_Submission
giaf123_GIGA-D-25-00208_Original_Submission
giaf123_GIGA-D-25-00208_Revision_1
giaf123_Reviewer_1_Report_Original_Submission

Veronica Buttaro -- 7/9/2025

giaf123_Reviewer_1_Report_Revision_1

Veronica Buttaro -- 9/1/2025

giaf123_Reviewer_2_Report_Original_Submission

Wei Lan -- 7/14/2025

giaf123_Reviewer_2_Report_Revision_1

Wei Lan -- 8/27/2025

Contributor Information

Xiaoyuan Liu, School of Medicine and Health, Harbin Institute of Technology, Harbin 150000, China.

Xiye Lü, School of Medicine and Health, Harbin Institute of Technology, Harbin 150000, China.

Qiuhao Chen, Zhengzhou Research Institute, Harbin Institute of Technology, Harbin 150000, China.

Jiqiu Sun, Department of Otorhinolaryngology, Harbin Institute of Technology Hospital, Harbin 150038, China.

Tianyi Zhao, School of Medicine and Health, Harbin Institute of Technology, Harbin 150000, China; Zhengzhou Research Institute, Harbin Institute of Technology, Harbin 150000, China.

Yan Zhu, College of Veterinary Medicine, Northeast Agricultural University, Harbin 150038, China.

Availability of Source Code and Requirements

Abbreviations

AD: Alzheimer’s disease; AUC: area under the curve; AUPR: area under the precision-recall curve; ceRNA: competitive endogenous RNA; circRNA: circular RNA; DOID: Disease Ontology Identifiers; GIP: Gaussian Interaction Profile; HAN: heterogeneous graph attention network; lncRNA: long noncoding RNA; miRNA: microRNA; MLM: masked language model; MoE: mixture-of-experts; ncRNA: noncoding RNA; OA: osteoarthritis; PanGIA: Pan-ncRNA Graph-Interaction Attention network; piRNA: PIWI-interacting RNA.

Funding

This study was supported by the National Natural Science Foundation of China (Grant No. 62172125) and the Heilongjiang Province Basic Research Support Program for Outstanding Young Teachers (YQJH2023195).

Data Availability

The supporting data underlying this study are available in the GigaScience Database, GigaDB [62].

Competing Interests

The authors declare that they have no competing interests.

References

  • 1. Esteller  M. Non-coding RNAs in human disease. Nat Rev Genet. 2011;12(12):861–74. 10.1038/nrg3074. [DOI] [PubMed] [Google Scholar]
  • 2. Loganathan  T, Doss C  GP. Non-coding RNAs in human health and disease: potential function as biomarkers and therapeutic targets. Funct Integr Genomics. 2023;23(1):33. 10.1007/s10142-022-00947-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Harries  LW. Long non-coding RNAs and human disease. Biochem Soc Trans. 2012;40(4):902–6. 10.1042/BST20120020. [DOI] [PubMed] [Google Scholar]
  • 4. Li  C, Ni  YQ, Xu  H, et al.  Roles and mechanisms of exosomal non-coding RNAs in human health and diseases. Signal Transduct Target Ther. 2021;6(1):383. 10.1038/s41392-021-00779-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Shi  H, Xu  J, Zhang  G, et al.  Walking the interactome to identify human miRNA-disease associations through the functional link between miRNA targets and disease genes. BMC Syst Biol. 2013;7(1):101. 10.1186/1752-0509-7-101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Lu  M, Zhang  Q, Deng  M, et al.  An analysis of human microRNA and disease associations. PLoS One. 2008;3(10):e3420. 10.1371/journal.pone.0003420. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Zang  J, Lu  D, Xu  A. The interaction of circRNAs and RNA binding proteins: an important part of circRNA maintenance and function. J Neurosci Res. 2020;98(1):87–97. 10.1002/jnr.24356. [DOI] [PubMed] [Google Scholar]
  • 8. Okholm  TLH, Sathe  S, Park  SS, et al.  Transcriptome-wide profiles of circular RNA and RNA-binding protein interactions reveal effects on circular RNA biogenesis and cancer pathway expression. Genome Med. 2020;12(1):112. 10.1186/s13073-020-00812-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Yan  J, Wang  R, Tan  J. Recent advances in predicting lncRNA–disease associations based on computational methods. Drug Disc Today. 2023;28(2):103432. 10.1016/j.drudis.2022.103432. [DOI] [PubMed] [Google Scholar]
  • 10. Yang  X, Gao  L, Guo  X, et al.  A network based method for analysis of lncRNA-disease associations and prediction of lncRNAs implicated in diseases. PLoS One. 2014;9(1):e87797. 10.1371/journal.pone.0087797. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Ali  SD, Tayara  H, Chong  KT. Identification of piRNA disease associations using deep learning. Comput Struct Biotechnol J. 2022;20:1208–17. 10.1016/j.csbj.2022.02.026. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Rayford  KJ, Cooley  A, Rumph  JT, et al.  piRNAs as modulators of disease pathogenesis. Int J Mol Sci. 2021;22(5):2373. 10.3390/ijms22052373. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Li  Z, Zhang  Y, Bai  Y, et al.  IMC-MDA: prediction of miRNA-disease association based on induction matrix completion. Math Biosci Eng. 2023;20(6):10659–74. 10.3934/mbe.2023471. [DOI] [PubMed] [Google Scholar]
  • 14. Wang  Y, Juan  L, Peng  J, et al.  LncDisAP: a computation model for LncRNA-disease association prediction based on multiple biological datasets. BMC Bioinformatics. 2019;20:582. 10.1186/s12859-019-3081-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Qian  Y, He  Q, Deng  L. iPiDA-GBNN: identification of Piwi-interacting RNA-disease associations based on gradient boosting neural network. In: 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). Houston, TX, USA: IEEE; 2021:1045–50. 10.1109/BIBM52615.2021.9669592. [DOI] [Google Scholar]
  • 16. Wang  L, You  ZH, Li  YM, et al.  GCNCDA: a new method for predicting circRNA-disease associations based on graph convolutional network algorithm. PLoS Comput Biol. 2020;16(5):e1007568. 10.1371/journal.pcbi.1007568. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Cao  C, Wang  C, Dai  Q, et al.  CRBPSA: CircRNA-RBP interaction sites identification using sequence structural attention model. BMC Biol. 2024;22(1):260. 10.1186/s12915-024-02055-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Wang  X, Liu  Y, Li  J, et al.  StackCirRNAPred: computational classification of long circRNA from other lncRNA based on stacking strategy. BMC Bioinformatics. 2022;23(1):563. 10.1186/s12859-022-05118-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Bao  Z, Yang  Z, Huang  Z, et al.  LncRNADisease 2.0: an updated database of long non-coding RNA-associated diseases. Nucleic Acids Res. 2019;47(D1):D1034–37. 10.1093/nar/gkz276. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Fan  C, Lei  X, Tie  J, et al.  CircR2Disease v2.0: an updated web server for experimentally validated circRNA–disease associations and its application. Genom Proteomics Bioinform. 2022;20(3):435–45. 10.1016/j.gpb.2021.10.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Muhammad  A, Waheed  R, Khan  NA, et al.  piRDisease v1.0: a manually curated database for piRNA associated diseases. Database. 2019;2019:baz052. 10.1093/database/baz052. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Jiang  Q, Wang  Y, Hao  Y, et al.  miR2Disease: a manually curated database for microRNA deregulation in human disease. Nucleic Acids Res. 2009;37(Database issue):D98–104. 10.1093/nar/gkn714. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Hombach  S, Kretz  M. Non-coding RNAs: Classification, biology and functioning. Adv Exp Med Biol. 2016;937:3–17. 10.1007/978-3-319-42059-2_1. [DOI] [PubMed] [Google Scholar]
  • 24. Cui  C, Zhong  B, Fan  R, et al.  HMDD v4.0: a database for experimentally supported human microRNA-disease associations. Nucleic Acids Res. 2024;52(D1):D1327–32. 10.1093/nar/gkae502. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Kozomara  A, Birgaoanu  M, Griffiths-Jones  S. miRBase: from microRNA sequences to function. Nucleic Acids Res. 2019;47(D1):D155–62. 10.1093/nar/gky1141. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Lin  X, Lu  Y, Zhang  C, et al.  LncRNADisease v3.0: an updated database of long non-coding RNA-associated diseases. Nucleic Acids Res. 2024;52(D1):D1365–69. 10.1093/nar/gkad828. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Glažar  P, Papavasileiou  P, Rajewsky  N. circBase: a database for circular RNAs. RNA (New York, NY). 2014;20(11):1666–70. 10.1261/rna.043687.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Frankish  A, Carbonell-Sala  S, Diekhans  M, et al.  GENCODE: reference annotation for the human and mouse genomes in 2023. Nucleic Acids Res. 2023;51(D1):D942–49. 10.1093/nar/gkac1071. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Zhao  Y, Li  H, Fang  S, et al.  NONCODE 2016: an informative and valuable data source of long non-coding RNAs. Nucleic Acids Res. 2016;44(D1):D203–8. 10.1093/nar/gkv1252. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Wang  J, Zhang  P, Lu  Y, et al.  piRBase: a comprehensive database of piRNA sequences. Nucleic Acids Res. 2019;47(D1):D175–80. 10.1093/nar/gky1043. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Piuco  R, Galante  PAF. piRNAdb: a Piwi-interacting RNA database. 2021. 10.1101/2021.09.21.461238. Accessed July 2024. [DOI]
  • 32. Schriml  LM, Munro  JB, Schor  M, et al.  The Human Disease Ontology 2022 update. Nucleic Acids Res. 2022;50(D1):D1255–261. 10.1093/nar/gkab1063. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Ji  Y, Zhou  Z, Liu  H, et al.  DNABERT: pre-trained Bidirectional encoder representations from transformers model for DNA-language in genome. Bioinformatics (Oxford, England). 2021;37(15):2112–20. 10.1093/bioinformatics/btab083. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Sanabria  M, Hirsch  J, Joubert  PM, et al.  DNA language model GROVER learns sequence context in the human genome. Nat Mach Intell. 2024;6(8):911–23. 10.1038/s42256-024-00872-0. [DOI] [Google Scholar]
  • 35. Suzuki  S, Horie  K, Amagasa  T, et al.  Genomic language models with k-mer tokenization strategies for plant genome annotation and regulatory element strength prediction. Plant Mol Biol. 2025;115(4):100. 10.1007/s11103-025-01604-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Van Laarhoven  T, Nabuurs  SB, Marchiori  E. Gaussian interaction profile kernels for predicting drug–target interaction. Bioinformatics. 2011;27(21):3036–43. 10.1093/bioinformatics/btr500. [DOI] [PubMed] [Google Scholar]
  • 37. Köhler  S. Improved ontology-based similarity calculations using a study-wise annotation model. Database. 2018;2018. 10.1093/database/bay026. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Mathur  S, Dinakarpandian  D. Finding disease similarity based on implicit semantic similarity. J Biomed Inform. 2012;45(2):363–71. 10.1016/j.jbi.2011.11.017. [DOI] [PubMed] [Google Scholar]
  • 39. Jin  C, Shi  Z, Lin  K, et al.  Predicting miRNA-disease association based on neural inductive matrix completion with graph autoencoders and self-attention mechanism. Biomolecules. 2022;12(1):64. 10.3390/biom12010064. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Lou  Z, Cheng  Z, Li  H, et al.  Predicting miRNA–disease associations via learning multimodal networks and fusing mixed neighborhood information. Brief Bioinform. 2022;23(5):bbac159. 10.1093/bib/bbac159. [DOI] [PubMed] [Google Scholar]
  • 41. Wang  L, Zhong  C. gGATLDA: lncRNA-disease association prediction based on graph-level graph attention network. BMC Bioinformatics. 2022;23(1):11. 10.1186/s12859-021-04548-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Wang  MN, You  ZH, Wang  L, et al.  LDGRNMF: lncRNA-disease associations prediction based on graph regularized non-negative matrix factorization. Neurocomputing. 2021;424:236–45. 10.1016/j.neucom.2020.02.062. [DOI] [Google Scholar]
  • 43. Wei  H, Xu  Y, Liu  B. iPiDi-PUL: identifying Piwi-interacting RNA-disease associations based on positive unlabeled learning. Brief Bioinform. 2021;22(3):bbaa058. 10.1093/bib/bbaa058. [DOI] [PubMed] [Google Scholar]
  • 44. Chen  Q, Zhang  L, Liu  Y, et al.  PUTransGCN: identification of piRNA–disease associations based on attention encoding graph convolutional network and positive unlabelled learning. Brief Bioinform. 2024;25(3):bbae144. 10.1093/bib/bbae144. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Lan  W, Dong  Y, Chen  Q, et al.  IGNSCDA: predicting circRNA-disease associations based on improved graph convolutional network and negative sampling. IEEE ACM T Comput Biol Bioinform. 2022;19(6):3530–38. 10.1109/TCBB.2021.3111607. [DOI] [PubMed] [Google Scholar]
  • 46. Peng  L, Yang  C, Chen  Y, et al.  Predicting circRNA-disease associations via feature convolution learning with heterogeneous graph attention network. IEEE J Biomed Health Inf. 2023;27(6):3072–82. 10.1109/JBHI.2023.3260863. [DOI] [PubMed] [Google Scholar]
  • 47. Jiang  J, Lu  J, Wang  X, et al.  Glioma stem cell-derived exosomal miR-944 reduces glioma growth and angiogenesis by inhibiting AKT/ERK signaling. Aging. 2021;13(15):19243–259. 10.18632/aging.203243. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Wang  D, Zhi  T, Xu  X, et al.  MicroRNA-936 induces cell cycle arrest and inhibits glioma cell proliferation by targeting CKS1. Am J Cancer Res. 2017;7(11):2131–43. [PMC free article] [PubMed] [Google Scholar]
  • 49. Feng  L, Yang  Z, Li  Y, et al.  MicroRNA-378 contributes to osteoarthritis by regulating chondrocyte autophagy and bone marrow mesenchymal stem cell chondrogenesis. Mol Ther Nucleic Acids. 2022;28:328–41. 10.1016/j.omtn.2022.03.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Panagopoulos  PK, Lambrou  GI. The involvement of microRNAs in osteoarthritis and recent developments: a narrative review. Mediterr J Rheumatol. 2018;29(2):67–79. 10.31138/mjr.29.2.67. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Xue  YF, Li  M, Li  W, et al.  Roles of circ-CSPP1 on the proliferation and metastasis of glioma cancer. Eur Rev Med Pharmacol Sci. 2020;24(10):5519–25. 10.26355/eurrev_202005_21337. [DOI] [PubMed] [Google Scholar]
  • 52. Liu  H, Weng  J, Huang  CLH, et al.  Is the voltage-gated sodium channel 3 subunit (SCN3B) a biomarker for glioma?. Funct Integr Genomics. 2024;24(5):162. 10.1007/s10142-024-01443-7. [DOI] [PubMed] [Google Scholar]
  • 53. Dokumacioglu  E, Duzcan  I, Iskender  H, et al.  RhoA/ROCK-1 signaling pathway and oxidative stress in coronary artery disease patients. Braz J Cardiovas Surg. 2022;37(2):212–18. 10.21470/1678-9741-2020-0525. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Holvoet  P, Klocke  B, Vanhaverbeke  M, et al.  RNA-sequencing reveals that STRN, ZNF484 and WNK1 add to the value of mitochondrial MT-COI and COX10 as markers of unstable coronary artery disease. PLoS One. 2019;14(12):e0225621. 10.1371/journal.pone.0225621. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Wang  S, Cheng  Y, Yang  P, et al.  Silencing of long noncoding RNA LINC00324 interacts with microRNA-3200-5p to attenuate the tumorigenesis of gastric cancer via regulating BCAT1. Gastroenterol Res Pract. 2020;2020:4159298. 10.1155/2020/4159298. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Liang  W, Xia  B, He  C, et al.  Overexpression of LINC00691 promotes the proliferation and invasion of gastric cancer cells via the Janus kinase/signal transducer and activator of transcription signalling pathway. Int J Biochem Cell Biol. 2020;123:105751. 10.1016/j.biocel.2020.105751. [DOI] [PubMed] [Google Scholar]
  • 57. He  C, Liu  Y, Li  J, et al.  LncRNA RPSAP52 promotes cell proliferation and inhibits cell apoptosis via modulating miR-665/STAT3 in gastric cancer. Bioengineered. 2022;13(4):8699–711. 10.1080/21655979.2022.2054754. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. Shi  X, Zhang  H, Wang  M, et al.  LncRNA AFAP1-AS1 promotes growth and metastasis of cholangiocarcinoma cells. Oncotarget. 2017;8(35):58394–404. 10.18632/oncotarget.16880. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Li  Y, Wu  X, Gao  H, et al.  Piwi-interacting RNAs (piRNAs) are dysregulated in renal cell carcinoma and associated with tumor metastasis and cancer-specific survival. Mol Med. 2015;21(1):381–88. 10.2119/molmed.2014.00203. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60. Schulze  M, Sommer  A, Plötz  S, et al.  Sporadic Parkinson’s disease derived neuronal cells show disease-specific mRNA and small RNA signatures with abundant deregulation of piRNAs. Acta Neuropathol Commun. 2018;6(1):58. 10.1186/s40478-018-0561-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Roy  J, Sarkar  A, Parida  S, et al.  Small RNA sequencing revealed dysregulated piRNAs in Alzheimer’s disease and their probable role in pathogenesis. Mol Biosyst. 2017;13(3):565–76. 10.1039/C6MB00699J. [DOI] [PubMed] [Google Scholar]
  • 62. Liu  X, Lv  X, Chen  Q, et al.  Supporting data for “PanGIA: A Universal Framework for Identifying Association between ncRNAs and Diseases.”. GigaScience Database. 2023. 10.5524/102760. [DOI]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

  1. Liu  X, Lv  X, Chen  Q, et al.  Supporting data for “PanGIA: A Universal Framework for Identifying Association between ncRNAs and Diseases.”. GigaScience Database. 2023. 10.5524/102760. [DOI]

Supplementary Materials

giaf123_Authors_Response_To_Reviewer_Comments_Original_Submission
giaf123_GIGA-D-25-00208_Original_Submission
giaf123_GIGA-D-25-00208_Revision_1
giaf123_Reviewer_1_Report_Original_Submission

Veronica Buttaro -- 7/9/2025

giaf123_Reviewer_1_Report_Revision_1

Veronica Buttaro -- 9/1/2025

giaf123_Reviewer_2_Report_Original_Submission

Wei Lan -- 7/14/2025

giaf123_Reviewer_2_Report_Revision_1

Wei Lan -- 8/27/2025

Data Availability Statement

The supporting data underlying this study are available in the GigaScience Database, GigaDB [62].


Articles from GigaScience are provided here courtesy of Oxford University Press

RESOURCES