Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2026 Feb 17;16:6582. doi: 10.1038/s41598-026-39894-6

R-GAT: cancer document classification leveraging graph-based residual network for scenarios with limited data

Elias Hossain 1,, Tasfia Nuzhat 2, Shamsul Masum 3, Shahram Rahimi 4, Noorbakhsh Amiri Golilarz 4
PMCID: PMC12913666  PMID: 41703211

Abstract

Accurate classification of cancer-related biomedical abstracts is critical for advancing cancer informatics and supporting decision-making in healthcare research. Yet progress in this domain is often constrained by limited availability of labeled corpora and the high computational demands of transformer-based approaches. To address these challenges, we propose a Residual Graph Attention Network (R-GAT) that integrates multi-head attention with residual connections to capture semantic and relational dependencies in biomedical texts. Evaluated on a curated dataset of 1,875 PubMed abstracts spanning thyroid, colon, lung, and generic cancer topics, R-GAT achieves stable and competitive performance (macro-F1: 0.96 ± 0.01), comparable to transformer-based models such as BioBERT and BioClinicalBERT and strong classical baselines like Logistic Regression, while requiring significantly fewer computational resources. Ablation studies confirm the importance of attention and residual connections in ensuring robustness under limited-data conditions. To support reproducibility and facilitate future research, we also release the curated dataset. Together, these contributions demonstrate the value of lightweight graph-based architectures as reliable and resource-efficient alternatives to computationally intensive transformers in biomedical NLP.

Subject terms: Health care, Medical research, Engineering

Introduction

Cancer remains a major global health challenge, with thyroid, colon, and lung cancers ranking among the most prevalent and deadly types worldwide13. The scale of this challenge has driven extensive biomedical research, resulting in a rapidly expanding body of scientific publications. Notably, most of these findings are first communicated through PubMed abstracts, which provide concise yet information-rich summaries of cancer-related studies. While such abstracts offer valuable insights for clinical and scientific progress, their growing volume and complexity make manual analysis infeasible. Consequently, automated classification of cancer abstracts has become an essential task to accelerate literature mining and knowledge discovery.

In response to this need, recent advances in natural language processing (NLP) have enabled substantial progress in biomedical text mining4, offering tools for entity recognition5, relation extraction6, and document classification7. Nevertheless, two critical obstacles remain. First, large biomedical resources such as CORD-198 are broad and noisy, while electronic health records (EHRs) are incomplete and difficult to access, leaving a gap in reliable, cancer-focused corpora. While multi-cancer datasets such as the Hallmarks of Cancer corpus (HOC)9 and CORD-198 subsets exist, they are either broad in scope or not specifically balanced across thyroid, colon, and lung cancer categories. Second, state-of-the-art deep learning models, particularly transformers, achieve high accuracy but require massive labeled datasets and substantial computational resources, which limits their use in data-constrained biomedical contexts. More importantly, this limitation highlights the need for models that can generalize effectively without relying on large-scale training corpora.

To ensure consistent evaluation, we curated a controlled subset of PubMed abstracts, offering a reproducible benchmark for cancer classification studies. At the same time, addressing the methodological gap left by data- and compute-intensive transformers requires exploring alternative architectures. One promising direction is the use of graph-based neural networks, which differ from sequence models by representing biomedical abstracts as interconnected structures that capture relational and semantic dependencies between terms. When combined with attention mechanisms, such models can highlight critical biomedical entities and their interactions, while residual connections mitigate information loss and stabilize training. Collectively, these properties position graph-based residual architectures as strong candidates for robust text classification in scenarios where labeled data are scarce and computational resources are limited.

Building on this rationale, this study investigates the following research question: Can graph-based residual architectures provide a robust and computationally efficient alternative to transformer models for cancer abstract classification under limited-data conditions?

To address this question, we propose R-GAT, a graph-based model that integrates multi-head attention with residual connections to capture semantic dependencies more effectively. Our approach is systematically benchmarked against a wide range of traditional machine learning, deep learning, and transformer-based baselines, with ablation studies conducted to assess the contribution of each architectural component. Ultimately, the aim of this study is to demonstrate that R-GAT provides a stable and computationally efficient solution for cancer document classification in limited-data settings. Building on this objective, the key contributions are as follows:

  • We introduce R-GAT, a graph-based approach that leverages residual connections and multi-head attention to deliver robust performance under limited-data biomedical NLP conditions.

  • We design a rigorous evaluation framework, benchmarking R-GAT against traditional machine learning, deep learning, and transformer-based baselines, and conducting ablation studies to isolate the impact of its architectural components.

  • To ensure reproducibility and provide a controlled testbed, we make available a curated subset of Inline graphic1,875 PubMed abstracts, balanced across thyroid, colon, and lung cancers, which serves as a benchmark resource for future cancer document classification studies.

The remainder of this manuscript is organized as follows. First, we review related work on cancer document classification, followed by a description of the dataset and its statistical properties. We then present the proposed R-GAT framework and its components, and outline the evaluation metrics, baseline models, implementation details, and reproducibility considerations. Next, we report the experimental results and key insights, followed by a critical discussion of the findings, with emphasis on the performance of R-GAT relative to other models. Finally, the manuscript concludes with a summary of contributions and directions for future research.

Literature review

Research on cancer text classification has progressed from traditional machine learning to deep learning and transformer-based approaches, with growing interest in graph-based methods. Early work primarily focused on single-domain resources such as radiology reports and clinical notes. For instance, Nguyen et al.10 developed a hybrid encoder–decoder model with attention for Dutch radiology reports, achieving strong accuracy but facing challenges in clinical applicability. Similarly, Tang et al.11 fine-tuned BERT with an attention layer for clinical progress notes, attaining 97.6% accuracy and underscoring the potential of transformers for medical text classification. More recent studies such as Uskaner et al.12 applied pre-trained BERT and DistilBERT to Turkish mammography reports, showing domain adaptation can yield high performance, with BERT achieving a 91% F1-score.

Beyond cancer-specific tasks, graph-based neural architectures have gained traction in NLP and biomedical informatics13. Ai et al.14 proposed the Edge-Enhanced Minimum-Margin Graph Attention Network (EMGAN) to address short-text sparsity, while Wei et al.15 introduced a graph convolutional attention network (GCAN) with residual connections for time-series predictions. Other innovations, such as Song et al.’s Graph Sequence Pretraining with Transformer (GSPT)16 and Rao et al.’s Multi-layer Residual Attention Network (MRAN)17, highlight how attention and residual mechanisms can improve relational reasoning and semantic representation across diverse text-attributed graphs and knowledge graphs.

Together, these studies demonstrate the potential of attention-based and graph-based architectures to capture semantic and structural dependencies beyond linear text representations. However, most prior work has been limited to narrow contexts such as single cancer types, proprietary clinical notes, or imaging reports, leaving multi-cancer abstract classification comparatively underexplored. Moreover, while graph neural networks have been applied in biomedical and related domains, the role of residual connections within graph attention mechanisms has not been systematically examined for document-level biomedical classification, particularly under limited-data conditions.

This gap motivates the present work, which positions residual graph attention as a candidate solution for improving robustness and efficiency in cancer abstract classification.

Dataset construction and statistics

Data collection

To provide a reliable benchmark for cancer document classification under limited-data conditions, a domain-specific dataset was constructed comprising 1,875 medical abstracts focusing on thyroid, colon, and lung cancers, as well as general biomedical topics. The abstracts were retrieved from PubMed, one of the largest biomedical literature databases, using the open-source Entrezpy18 Python library, which provides direct programmatic access to the NCBI Entrez system. Data collection was carried out between January and March 2024, and the search strategy was designed to incorporate cancer-specific terms such as “thyroid cancer,” “colon cancer,” and “lung cancer,” alongside more general biomedical keywords. Retrieval was restricted to articles published within the last five years to ensure that recent research trends were captured.

An initial set of approximately 2,000 abstracts was obtained, from which non-English, duplicate, irrelevant, or incomplete entries were excluded. The remaining abstracts were manually reviewed and categorized into four groups: thyroid, colon, lung, and generic. The final dataset consisted of 1,875 unique abstracts with an average length of Inline graphic145 tokens, ensuring both domain specificity and balanced topical coverage.

Data cleaning

Prior to model development, the dataset underwent a systematic data cleaning and preprocessing pipeline to enhance quality and usability. Missing attributes were identified and resolved, followed by tokenization using the NLTK library19 to break text into smaller, meaningful units. To normalize word forms, lemmatization20 was applied, enabling the model to better capture semantic relationships. In addition, redundant and non-informative words were filtered out to reduce noise. Finally, the cleaned text was transformed into numerical vectors using multiple representation methods, including Term Frequency–Inverse Document Frequency (TF-IDF)21, Word2Vec embeddings22, and BERT-based tokenization23, enabling flexibility for downstream machine learning and deep learning tasks. This rigorous collection and cleaning procedure ensures that the dataset is both representative of recent biomedical literature and suitable for robust NLP-based experimentation.

Dataset statistics

Table 1 summarizes the distribution of abstracts across the four categories together with the average number of tokens per abstract. The dataset is relatively well-balanced, with each category contributing between 450 and 480 abstracts. This balance reduces the likelihood of strong class imbalance effects, which can bias model training and evaluation. The average abstract length is approximately 145 tokens, though some variation exists across categories. For example, generic biomedical abstracts tend to be longer (163.72 tokens on average), while lung cancer abstracts are somewhat shorter (135.73 tokens on average). Such variation reflects the differing styles and focus of the source literature, but overall the abstracts remain concise and structurally comparable across classes.

Table 1.

Distribution of abstracts across four cancer categories, including the number of documents and average tokens per abstract. The dataset is relatively balanced, with each class contributing a similar number of samples, helping to minimize class imbalance during evaluation.

Category No. of Abstracts Avg. Tokens
Colon Cancer 468 144.45
Generic 453 163.72
Lung Cancer 473 135.73
Thyroid Cancer 481 136.00
Total 1,875 144.74

Although the dataset size (1,875 abstracts) may appear modest compared to large-scale general-domain corpora, it is consistent with established biomedical NLP resources, such as the BC5CDR corpus (Inline graphic1,500 abstracts)24. Domain-specific corpora in biomedical text mining are often smaller in scale because of the stringent filtering and manual curation required to ensure relevance, quality, and interpretability. The relatively balanced distribution of abstracts, coupled with careful preprocessing and annotation, ensures that this dataset is both representative of its target domain and suitable for benchmarking models under limited-data conditions.

Methodology

Overview of the approach

This section outlines the classification of medical documents related to thyroid cancer, colon cancer, lung cancer, and generic topics divided into four subsequent phases. In the first phase, medical abstracts related to these cancers were collected from the PubMed database. The second phase involved text preprocessing, where the raw data underwent several techniques to produce a high-quality dataset, including tokenization, spelling checks, and text normalization, such as lemmatization.

The third phase included the R-GAT model, which unfolds across four distinct steps. Initially, a graph was constructed to represent node features and the adjacency matrix in the first step. In the second step, this graph was processed through two Graph Attention Network (GAT) layers before entering to the Residual Block. The third step introduced a Residual Block, a crucial component comprising three GAT layers, each with its activation function, as depicted in Fig. 1. After that, in the forth step, a Global Average Pooling layer aggregated the features. Lastly, in the final phase of the workflow, a fully connected layer was used for classification.

Fig. 1.

Fig. 1

End-to-end methodology for cancer abstract classification using the proposed R-GAT. The workflow is divided into four major phases: (1) Data Collection: Abstracts are retrieved from PubMed and curated into a medical document corpus. (2) Text Preprocessing: Cleaning operations include spelling correction, tokenization, and lemmatization, producing a high-quality dataset suitable for model training. (3) Graph Construction and R-GAT Model Architecture: Abstracts are represented as graphs, where nodes correspond to document features and edges capture relational dependencies. The adjacency matrix and feature vectors form the graph representation. This representation is processed through stacked Graph Attention (GAT) layers with non-linear activations, followed by a Residual Block consisting of three GAT layers and skip connections. The residual design mitigates information loss and stabilizes training. (4) Classification: Features are aggregated via a Global Average Pooling layer and passed through a fully connected layer with a Softmax decoder to predict four target categories: thyroid cancer, colon cancer, lung cancer, and generic biomedical abstracts.

Graph construction

The first step is to create a graph that visually represents the cancer documents and their interconnections. In the node feature representation, each medical text document is assigned a feature vector to reflect its content. The feature matrix, denoted as Inline graphic, represents the document features, with N representing the number of nodes in the graph and F is the number of features per node. An adjacency matrix Inline graphic is utilized to represent the relationships between documents, with Inline graphic denoting the strength of the connection between document i and document j, computed using cosine similarity. The value of Inline graphic is set to 1 if the similarity exceeds a predefined threshold and 0 otherwise. The edge weights can be acquired through training or determined using domain-specific knowledge.

Graph layers with attention mechanism

A key component of our research is the use of the Graph Attention Mechanism, which computes attention scores for surrounding documents. The attention scores for a certain node i are calculated as follows: We use Equation 1 to apply a Leaky Rectified Linear Unit (LeakyReLU) activation function to the concatenation of linear transformations of the feature vectors of nodes i and j, where nodes i and j feature representations are denoted by Inline graphic and Inline graphic, respectively; a is a learnable attention weight vector and W is a learnable weight matrix. The Inline graphic represents that vector a is transposed before performing the dot product to ensure appropriate dimension alignment for matrix multiplication. In addition, a double vertical bar sign denotes concatenation.

graphic file with name d33e520.gif 1

To obtain attention coefficients, we normalize the attention scores using the SoftMax function in Equation 2. In this regard, Inline graphic represents the collection of surrounding nodes of node i.

graphic file with name d33e535.gif 2

Residual blocks with graph attention layers

To improve our model and capture complex interactions, we use a Residual Block that combines multiple GAT layers, each followed by an activation function. The input Inline graphic in Equation 3 is the result of two previous GAT layers before the Residual Block. Equations 36 mathematically describe the structure of the Block, where A represents the adjacency matrix. To be more precise, Inline graphic and Inline graphic represent the outputs of the first and second GAT layers, respectively; Inline graphic is the result of adding the residual connection shown in (Equation 5). Finally, Inline graphic is the outcome of processing through the third GAT layer of the Residual Block.

graphic file with name d33e579.gif 3
graphic file with name d33e584.gif 4
graphic file with name d33e588.gif 5
graphic file with name d33e592.gif 6

The use of attention coefficients Inline graphic helps consolidate information from adjacent nodes, resulting in an improved representation for each particular node Inline graphic.

graphic file with name d33e605.gif 7

Simultaneously, to capture a wide range of patterns contained in the data, we employ several K independent attention heads. Each attention head, K, which operates independently, captures different aspects of the interactions between nodes in the network. These different attention heads increase the model’s ability to focus on diverse patterns at the same time. Also, the non-linear activation function, denoted as Inline graphic further contributes to this process by introducing non-linearity, allowing the model to learn intricate relationships within data.

graphic file with name d33e621.gif 8

Global average pooling layer

The final node representation is created by concatenating or averaging the results. Our network uses global average pooling, which computes the mean feature vector to represent the entire graph. Finally, the average feature vector passes through a dropout layer and then a fully connected layer, followed by a SoftMax activation function to forecast the cancer document classes. Algorithm 1 is the pseudocode of the R-GAT model, which illustrates the major steps and processes in the design.

Algorithm 1.

Algorithm 1

R-GAT Model

Decoder and optimization objective

The final graph-level representation, obtained through global average pooling, is passed through a dropout layer and a fully connected decoder for classification. The decoder projects pooled embeddings into a vector with dimensionality equal to the number of cancer classes, followed by a softmax activation to produce class probabilities.

Training is optimized using categorical cross-entropy loss:

graphic file with name d33e645.gif

where C is the number of classes, Inline graphic is the ground-truth label, and Inline graphic is the predicted probability. Optimization is performed with the Adam optimizer (learning rate = 0.001, weight decay = 0.0001), using early stopping based on validation macro-F1 to prevent overfitting.

Experimental setup

We benchmarked R-GAT against traditional machine learning, deep learning, and transformer-based baselines under stratified cross-validation protocols. Model performance was assessed using macro- and micro-averaged precision, recall, and F1-scores, with results reported as mean ± standard deviation to capture variability across folds. Confusion matrices were also generated to provide class-level error analysis, and 95% confidence intervals were included where appropriate. Full details of baseline architectures, feature extraction methods, training procedures, hyperparameter configurations, and the reproducibility statement are provided in the Supplementary Information.

Results and analysis

Insight 1: performance of baselines

Traditional machine learning and deep learning models establish important reference points for evaluating R-GAT. As detailed in the Supplementary Information (Section A.2: Extended Results for Baseline Models), Logistic Regression with TF-IDF (unigram) achieved a macro-F1 of Inline graphic, underscoring the effectiveness of sparse lexical representations in small biomedical datasets. Gradient Boosting and AdaBoost also performed competitively, though with higher variance across folds. In contrast, Word2Vec embeddings consistently reduced performance (macro-F1 as low as 0.60), suggesting that dense embeddings without contextual information or domain adaptation are less effective in this setting.

Deep learning models, summarized in the Supplementary Information (Section A.3: Extended Results for Deep Learning and Transformer Models), exhibited more variability. CNNs reached a macro-F1 of Inline graphic, confirming their strength in capturing local semantic features. However, sequential models such as RNNs and shallow LSTMs underperformed substantially (macro-F1: 0.33–0.74), likely due to overfitting under limited data conditions. Transformer-based models (e.g., BioBERT, BioClinicalBERT) delivered the highest absolute scores (Inline graphic), though at significantly higher computational cost.

These baselines highlight two important themes: (1) lightweight linear models can yield surprisingly strong results, but their performance is highly dependent on feature design and may not generalize beyond TF-IDF representations; and (2) transformers achieve state-of-the-art accuracy, but at the cost of large compute requirements. These observations motivate the exploration of architectures such as R-GAT, which aim to balance accuracy, robustness, and efficiency under limited-data and limited-resource conditions.

Insight 2: robustness of R-GAT

R-GAT delivered consistently strong results across cancer classes, achieving a macro-F1 of Inline graphic. Figures 2a and 2b illustrate stable convergence and balanced predictions across categories, while error bars in Figure 3 confirm low variance under cross-validation. This contrasts with transformers and ensemble baselines, which showed greater fluctuations across folds.

Figure 2.

Figure 2

Performance visualization of the proposed R-GAT for multi-cancer abstract classification. (a) Confusion matrix showing the distribution of predictions across the four cancer classes: Colon Cancer, Lung Cancer, Thyroid Cancer, and Generic. Values on the diagonal represent correct classifications, with R-GAT achieving high accuracy across all categories (Inline graphic0.94), indicating balanced performance and minimal class-specific bias. Off-diagonal values reflect misclassifications, which remain relatively rare. (b) Validation loss curves plotted over 50 epochs for each fold of stratified 5-fold cross-validation. The consistently smooth convergence across all folds demonstrates stable learning behavior and low variance, reinforcing the robustness of the R-GAT model under limited-data conditions.

Figure 3.

Figure 3

Cross-validation robustness analysis using F1-scores with 95% confidence intervals (error bars) for R-GAT, its ablated variants (GAT without residuals and GCN without attention and residuals), and baseline models (Logistic Regression and BioBERT). R-GAT achieves a macro-F1 of approximately 0.96 with the narrowest confidence intervals, indicating strong stability and consistent generalization across folds. In contrast, GCN shows wider intervals and lower mean performance, reflecting higher sensitivity to data partitioning. Logistic Regression and BioBERT achieve slightly higher absolute scores, but with greater computational demands (BioBERT) or dependence on specific feature representations (LogReg). These results emphasize that R-GAT balances robustness and efficiency, making it particularly suitable for limited-data biomedical classification scenarios.

To further assess robustness, we employed stratified 5-fold cross-validation with three independent random seeds, reducing the influence of any single data split. Table 2 compares three representative models: Logistic Regression with TF-IDF features, BioBERT, and the proposed R-GAT. Scores are reported as mean ± standard deviation across folds.

Table 2.

Cross-validation performance (mean ± std F1-score) for three representative models: Logistic Regression with TF-IDF features, BioBERT, and the proposed R-GAT, across four cancer classes. Results highlight both per-class and macro-averaged scores, allowing comparison of classical, transformer-based, and graph-based approaches under consistent evaluation.

Model Colon Generic Lung Thyroid Macro F1
Logistic Regression (TF-IDF) 0.97 ± 0.01 0.97 ± 0.01 0.98 ± 0.01 0.99 ± 0.00 0.98 ± 0.01
BioBERT 0.98 ± 0.00 0.98 ± 0.01 0.98 ± 0.01 0.99 ± 0.00 0.98 ± 0.00
R-GAT (Proposed) 0.95 ± 0.02 0.95 ± 0.02 0.97 ± 0.01 0.98 ± 0.01 0.96 ± 0.01

The results show a clear trade-off: Logistic Regression can slightly exceed R-GAT in absolute macro-F1, but its performance is highly dependent on TF-IDF features and may not extend to more complex representations. BioBERT achieves the strongest absolute scores but requires substantially greater computational resources and pretraining. In contrast, R-GAT maintains competitive accuracy while offering consistently low variance and efficiency in training. This positions R-GAT as a practical middle ground–balancing accuracy, robustness, and resource demands in biomedical NLP settings where stability is often more critical than marginal gains in peak performance.

Insight 3: contribution of model components

Ablation experiments (Table 3) underscore the critical role of residual connections and multi-head attention in the R-GAT architecture. Excluding residuals reduced macro-F1 to 0.92, while removing both residuals and attention caused a sharper decline to 0.83. These results validate the architectural choices, confirming that both mechanisms contribute substantially to robustness and generalization.

Table 3.

Macro- and micro-averaged precision, recall, and F1-scores for the full R-GAT model and its ablated variants. Results show the effect of removing residual connections (GAT) and both residuals and attention (GCN), providing a direct assessment of the contribution of each architectural component.

Model Macro P / R / F1 Micro P / R / F1
R-GAT (full) 0.97 / 0.96 / 0.96 0.97 / 0.96 / 0.96
GAT (no residuals) 0.92 / 0.93 / 0.92 0.92 / 0.93 / 0.92
GCN (no attention, no residuals) 0.83 / 0.83 / 0.82 0.83 / 0.83 / 0.82

Overall, the evidence shows that although high-capacity models such as BioBERT achieve the strongest absolute scores, R-GAT delivers a more balanced trade-off–maintaining competitive accuracy while reducing variance and computational cost. This positions R-GAT as a lightweight yet reliable alternative for biomedical NLP tasks, particularly in scenarios constrained by data availability or computational resources.

Inference testing

The predictive capability of R-GAT was further examined through inference testing on unseen biomedical abstracts (Fig. 4). In Fig. 4a, the model correctly classified an abstract on the telomere–telomerase complex in familial and sporadic thyroid carcinoma as Thyroid Cancer. In Fig. 4b, an abstract describing nitrosourea-based therapies for bronchogenic carcinoma was accurately labeled as Lung Cancer. These examples illustrate R-GAT’s ability to capture domain-specific terminology and contextual relationships, enabling precise predictions across distinct cancer types.

Figure 4.

Figure 4

Analysis of cancer abstracts fed into the R-GAT model for classification: a Thyroid Cancer–the model analyzed the abstract focused on the telomere-telomerase complex in both sporadic and familial thyroid cancer cases, emphasizing telomere shortening and telomerase activation; b Lung Cancer–the R-GAT model processed an abstract detailing the effectiveness of nitrosoureas and other agents in treating various types of lung cancer, including oat cell carcinoma and magenta adenocarcinoma. Both abstracts were correctly classified by the R-GAT model.

While inference outcomes alone cannot establish robustness, they demonstrate how the model translates learned representations into accurate real-world classifications. By leveraging residual graph attention, R-GAT effectively highlights key biomedical entities and preserves contextual dependencies, supporting generalization beyond the training set. This ability to maintain accuracy on new inputs underscores the model’s practical utility for biomedical text mining and its potential to enhance automated cancer literature classification.

Comparative review of existing studies

A comparative summary of existing studies covering cancer types such as breast, colorectal, prostate, lung carcinoma, thyroid, colon, and lung is provided in the Supplementary Information (Section A.4: Comparative Analysis with Prior Work). As shown in the supplementary table, most datasets used in prior studies were not publicly available, indicating limited accessibility and reproducibility. In addition, the “Multi-cancer” column shows that the majority of studies focused on a single cancer type, with only a few addressing multiple types. This reveals a key limitation, because studying multiple cancers together can enable more generalized and comprehensive models. Previous research mainly concentrated on radiological and clinical reports, particularly for breast cancer. In contrast, biomedical abstracts have been underexplored, despite their availability and potential to provide diverse clinical insights. These abstracts contain rich semantic information that can support both single and multi-cancer classification tasks.

Although some prior studies have incorporated transformer-based models, the supplementary comparison (Section A.4) indicates that graph-based attention networks have not been applied for cancer classification using biomedical abstracts. Furthermore, none of the reviewed works specifically targeted thyroid, colon, and lung cancers within this modality. To address these gaps, the R-GAT model was designed to classify lengthy medical abstracts by capturing semantic and relational patterns within text using graph-based representations. The performance of the R-GAT was evaluated against state-of-the-art transformer models, including BERT, BioBERT, RoBERTa, and BioClinicalBERT, as well as traditional machine learning and deep learning methods. Experimental results show that R-GAT provides strong generalizability across cancer types and outperforms existing models in classifying biomedical abstracts.

Discussion

Across the benchmarking spectrum, a consistent pattern emerged: simple models such as Logistic Regression and resource-intensive transformers like BioBERT both achieved strong performance on this dataset, yet their behavior proved less stable when examined across folds and feature variations. This underscores a broader issue in biomedical NLP–absolute scores alone do not capture the practical reliability of a model, especially under low-data constraints. What matters is not only peak accuracy but also how consistently a model generalizes across different conditions.

R-GAT’s strength lies in this dimension of stability. By combining multi-head attention with residual connections, the model maintained balanced performance across cancer categories and resisted fluctuations introduced by different data partitions. While it did not surpass all baselines in raw F1, R-GAT demonstrated that graph-based architectures can reliably capture relational dependencies in text without relying on large-scale pretraining or extensive computational resources. In practice, this makes R-GAT a dependable option in biomedical environments where annotated data are scarce and hardware resources are limited.

Beyond model stability, computational efficiency was also carefully considered. Across equivalent experimental settings, R-GAT exhibited substantially shorter fine-tuning and inference durations relative to transformer-based baselines such as BioBERT, reflecting its lightweight architecture and reduced dependency on large-scale pretraining. These observations highlight that, beyond accuracy, R-GAT offers a practical balance between performance stability and computational economy for real-world biomedical NLP deployments.

The dataset itself imposes certain constraints. Its modest size and focus on single-topic abstracts simplify the classification task and reduce the diversity of linguistic patterns available for training. This explains why even lightweight baselines perform unexpectedly well and why high scores should not be over-interpreted as evidence of clinical applicability. Recognizing these limits avoids overstating contributions, while still emphasizing the value of providing a clean, balanced corpus for reproducible cancer informatics research.

The findings position R-GAT as a complementary approach within the biomedical NLP landscape. Transformers remain unmatched in accuracy when large-scale data and compute are available, whereas linear models offer competitive baselines for sparse feature spaces. Between these extremes, R-GAT provides a middle ground: a lightweight, interpretable architecture that delivers stable performance under realistic constraints. Its role is less about replacing existing methods and more about broadening the methodological toolkit for scenarios where robustness and efficiency are as critical as raw accuracy.

Conclusion

This study introduced R-GAT, a residual graph attention network developed for cancer abstract classification under limited-data conditions. Through systematic benchmarking against transformer-based and traditional baselines, R-GAT was shown to achieve competitive accuracy, reduced variance across folds, and efficiency advantages in computationally constrained environments. In addition, a curated dataset of 1,875 PubMed abstracts was released to facilitate reproducibility and provide a standardized benchmark for future investigations.

Nevertheless, the study has certain limitations. Specifically, the dataset is modest in size and has not been clinically validated, which restricts direct applicability in oncology practice. Looking ahead, future work should focus on extending the framework to additional cancer types, incorporating multi-modal data sources, and exploring hybrid graph–transformer architectures.

Additionally, emerging biomedical foundation models such as MedGemma and BioMistral will be included in future comparative analyses to further position R-GAT within the broader landscape of large-scale LLM-based biomedical NLP frameworks.

Taken in combination, the release of both the model and dataset provides a reproducible foundation for cancer informatics research and supplies a resource that can be integrated into benchmarking efforts and community challenges. Consequently, the findings position R-GAT as a robust and efficient alternative for biomedical NLP tasks in data-constrained settings.

Supplementary Information

Author contributions

E.H. conceived the research idea, designed the experiments, implemented the codebase, and wrote the full draft of the manuscript. T.N. contributed to refining the experimental results, enhancing performance evaluation, and improving the illustrations. S.M., S.R., and N.A.G. supervised the research work and provided critical feedback throughout the project. All authors reviewed and approved the final manuscript.

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Data availability

The dataset supporting the findings of this study is publicly available on GitHub and can be accessed at: https://github. com/eliashossain001/MedicalAbstracts

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-026-39894-6.

References

  • 1.Zhai, M. et al. The global burden of thyroid cancer and its attributable risk factor in 195 countries and territories: A systematic analysis for the global burden of disease study. Cancer Medicine10, 4542–4554 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Organization, W. H. Colorectal cancer. https://www.who.int/news-room/fact-sheets/detail/colorectal-cancer. Accessed on 2023-10-15.
  • 3.Wang, Y.-H., Nguyen, P. A., Islam, M. M., Li, Y.-C. & Yang, H.-C. Development of deep learning algorithm for detection of colorectal cancer in ehr data. Studies in Health Technology and Informatics264, 438–441 (2019). [DOI] [PubMed] [Google Scholar]
  • 4.Moqbel, M. & Jain, A. Mining the truth: A text mining approach to understanding perceived deceptive counterfeits and online ratings. Journal of Retailing and Consumer Services84, 104149 (2025). [Google Scholar]
  • 5.Liu, X., Erkoyuncu, J. A., Fuh, J. Y. H., Lu, W. F. & Li, B. Knowledge extraction for additive manufacturing process via named entity recognition with llms. Robotics and Computer-Integrated Manufacturing93, 102900 (2025). [Google Scholar]
  • 6.Yang, W., Chen, Y., Xu, J., Qin, Y. & Chen, P. Automatically learning linguistic structures for entity relation extraction. Information Processing & Management62, 103904 (2025). [Google Scholar]
  • 7.Alva Principe, R., Chiarini, N. & Viviani, M. Long document classification in the transformer era: A survey on challenges, advances, and open issues. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery15, e70019 (2025). [Google Scholar]
  • 8.Wang, L. L. et al. Cord-19: The covid-19 open research dataset. arXiv preprint arXiv:2004.10706 (2020). PMID: 32510522; PMCID: PMC7251955.
  • 9.Baker, S. et al. Automatic semantic classification of scientific literature according to the hallmarks of cancer. Bioinform.32, 432–440. 10.1093/bioinformatics/btv585 (2016). [DOI] [PubMed] [Google Scholar]
  • 10.Nguyen, E. et al. A hybrid text classification and language generation model for automated summarization of dutch breast cancer radiology reports. In 2020 IEEE Second International Conference on Cognitive Machine Intelligence (CogMI), 72–81 (IEEE, 2020).
  • 11.Tang, M. et al. Progress notes classification and keyword extraction using attention-based deep learning models with bert. arXiv preprint arXiv:1910.05786 (2019).
  • 12.Hepsag, J. U., Özel, P., Dali, S. A. & Yazıcı, A. Using bert models for breast cancer diagnosis from turkish radiology reports. Language Resources and Evaluation 1–32 (2023).
  • 13.Lu, Y., Goi, S. Y., Zhao, X. & Wang, J. Biomedical knowledge graph: A survey of domains, tasks, and real-world applications. arXiv preprint arXiv:2501.11632 (2025).
  • 14.Ai, W. et al. Edge-enhanced minimum-margin graph attention network for short text classification. Expert Systems with Applications251, 124069 (2024). [Google Scholar]
  • 15.Wei, Y., Wu, D. & Terpenny, J. Remaining useful life prediction using graph convolutional attention networks with temporal convolution-aware nested residual connections. Reliability Engineering & System Safety242, 109776 (2024). [Google Scholar]
  • 16.Song, Y. et al. A pure transformer pretraining framework on text-attributed graphs. arXiv preprint arXiv:2406.13873 (2024). [PMC free article] [PubMed]
  • 17.Rao, Q., Wang, T., Guo, X., Wang, K. & Yan, Y. Knowledge graph completion using a pre-trained language model based on categorical information and multi-layer residual attention. Applied Sciences14, 4453 (2024). [Google Scholar]
  • 18.Buchmann, J. P. & Holmes, E. C. Entrezpy: A python library to dynamically interact with the ncbi entrez databases. Bioinformatics35, 4511–4514 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Bird, S., Loper, E. & Klein, E. Natural Language Processing with Python (O’Reilly Media, Inc., 2009).
  • 20.Khyani, D., Siddhartha, B., Niveditha, N. & Divya, B. An interpretation of lemmatization and stemming in natural language processing. Journal of University of Shanghai for Science and Technology22, 350–357 (2021). [Google Scholar]
  • 21.Dai, S. et al. Ai-based nlp section discusses the application and effect of bag-of-words models and tf-idf in nlp tasks. Journal of Artificial Intelligence General science (JAIGS) ISSN: 3006-40235, 13–21 (2024).
  • 22.Goldberg, Y. & Levy, O. word2vec explained: deriving mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722 (2014).
  • 23.Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding (2019). arxiv:1810.04805.
  • 24.Li, J. et al. Biocreative v cdr task corpus: a resource for chemical disease relation extraction. Database2016 (2016). [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

The dataset supporting the findings of this study is publicly available on GitHub and can be accessed at: https://github. com/eliashossain001/MedicalAbstracts


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES