Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2025 Nov 30;15:45769. doi: 10.1038/s41598-025-28636-9

A hybrid CNN–ViT framework with cross-attention fusion and data augmentation for robust brain tumor classification

Ganesh Jayaraman 1, S Meganathan 1, S Sheik Mohideen Shah 1, M Anuradha 1, Ranjeeth Kumar Sundararajan 1, R Rajakumar 1,
PMCID: PMC12756294  PMID: 41318692

Abstract

Brain tumor classification from MRI scans is a challenging task that requires accurate and timely detection increase patient survival rates. Conventional machine learning methods with hand-crafted features often fail to handle different sizes, forms, and textures of tumors. In this study evaluates, standard transfer learning models (AlexNet, MobileNetV2, InceptionV3, ResNet50, VGG16, VGG19) and conventional classifier models such as Decision Tree, Naïve Bayes, LDA were evaluated for multiclass brain tumor classification. The Vision Transformer (ViT) which leverages global context modeling achieved accuracy of 87.34%. To further improve performance, a hybrid CNN–ViT framework named CAFNet with data augmentation and a Cross-Attention Fusion mechanism achieving a test accuracy of 96.41% on a multiclass MRI dataset. The results show that CAFNet significantly outperforms conventional machine learning, deep learning and transfer learning models for robust brain tumor classification.

Keywords: Brain tumor classification, Convolutional neural networks (CNN), Transfer learning, Vision transformer model, Cross-attention fusion

Subject terms: Cancer, Computational biology and bioinformatics, Engineering, Mathematics and computing, Medical research

Introduction

Millions of people worldwide are diagnosed with brain tumors each year, making them one of the most life-threatening neurological disorders. Determining appropriate treatment strategies and improving survival rates depend heavily on early and accurate detection. Because of its excellent soft tissue contrast and high resolution, magnetic resonance imaging (MRI) is still the most popular non-invasive imaging technique for analyzing brain tumors. However, radiologists’ manual interpretation of MRI scans is often subjective, time-consuming, and prone to inter-observer variability. Research on computer-aided diagnostic (CAD) systems that use machine learning (ML) and deep learning (DL) for automated brain tumor categorization has increased as a result of these constraints.

Traditional machine learning methods have mostly used classifiers like Support Vector Machines (SVM), Decision Trees, and Naïve Bayes in combination with handcrafted features like texture, shape, and intensity descriptors. Although the potential of these approaches, their effectiveness is restricted by limited generalization to complex tumor variations in size, morphology, and location. Because deep learning methods can directly learn hierarchical representations from imaging data, Convolutional Neural Networks (CNNs) in specific have emerged as powerful alternatives to address these problems. The superiority of CNN-based techniques over conventional methods for brain tumor classification and detection has been shown in numerous research studies1,2. By exploiting knowledge from extensive datasets, transfer learning models using pre-trained models like VGG16, ResNet, and Inception have further improved accuracy36. To deal with tumor heterogeneity and data scarcity, more advanced frameworks like as ensemble learning, hybrid CNN–RNN models, and capsule networks have been proposed recently7,8.

Despite these developments, a several problems remain unresolved. CNN-based methods are effective at capturing spatial features, but they often have trouble encoding global context and long-range dependencies in MRI scans. Additionally, for small medical datasets, deep models are prone to overfitting, which reduces their robustness in real-world clinical applications. Self-attention mechanisms are used by Vision Transformers (ViTs), which have recently shown significant potential in modeling global relationships in medical images. However, because of reduced inductive bias, their standalone performance can occasionally be inferior on small datasets. This motivates the development of hybrid frameworks that combine the global context modeling strength of ViTs with the local feature extraction capacity of CNNs.

Major contributions of this work

  1. For brain tumor classification, we propose CAFNet, a novel hybrid framework that combines a Convolutional Neural Network (CNN) with a Vision Transformer (ViT).

  2. To create richer and more discriminative representations, a Cross-Attention Fusion (CAF) module is introduced, which effectively integrates local CNN features with global ViT features.

  3. A standard multiclass brain tumor MRI dataset consisting of pituitary tumor, meningioma, glioma, and no tumor classes is thoroughly examined.

  4. Experimental results demonstrate that, CAFNet outperforms deep learning, transfer learning, and conventional machine learning models by achieving a test accuracy of 96.41%.

This proposed study suggests CAFNet, a hybrid CNN–Vision Transformer framework enhanced with Cross-Attention Fusion, to address these gaps. CAFNet, unlike traditional architectures, efficiently integrates local and global feature representations, supported by data augmentation methods to reduce overfitting. Through experimental evaluation on a multi-class MRI dataset, CAFNet outperforms standard CNNs, standalone ViTs, transfer learning baselines, and traditional ML models, achieving a test accuracy of 96.41%. The results demonstrate CAFNet’s robustness and its potential to serve as a clinically reliable tool for brain tumor diagnosis.

Related works

In their evaluation of MRI-based brain tumor segmentation methods, Convolutional deep features were fine-tuned for tumor classification based on MRI9. They showed that, compared with fixed feature extraction, model adaptation improved accuracy. The effectiveness of deep fine-tuning for small medical datasets was experimented by their work. A CNN model was presented10 the purpose of classifying brain tumors using MRI images. End-to-end learning was accomplished by their system without handcrafted features. The CNN achieved high accuracy in a variety of tumor classes. The study confirmed CNNs’ role as a reliable alternative for conventional methods.

CNN feature extraction is used with Extreme Learning Machines (ELM) for classification. In terms of accuracy, their hybrid approach performed better than either CNN or ELM alone. The system used CNN for feature learning and ELM for fast classification. The study created hybrid models11 and showed the value of integrating shallow and deep methods12 for multi-class tumor classification. They combine CNNs with ML classifiers and feature selection for improved outcomes. Their hybrid system worked better than the conventional method. In order to achieve robustness, the study showed how crucial it is to integrate deep and traditional approaches.

Several tumor types were classified from MRI scans using a deep CNN model in13. The authors demonstrated that learned deep features performed significantly better than handcrafted ones. Overfitting was successfully reduced by batch normalization and dropout regularization, demonstrating CNNs’ strong generalization potential across tumor classes. For MRI classification, The authors in14 used transfer learning with ResNet and GoogleNet. They showed that even with small datasets, pre-trained models can extract rich tumor features. Their results confirmed higher accuracy than conventional methods. The effectiveness of fine-tuning deep models for medical imaging was highlighted by the study.

Capsule Networks (CapsNets) were introduced in15 for the classification of tumors. CapsNets preserved spatial relationships in MRI features, unlike CNNs. They performed better than conventional CNNs, especially when there were variations in orientation. The study emphasized the robustness and interpretability of CapsNets using fewer samples. A genetic algorithm-optimized DNN for MRI tumor grading was developed in16. Automatic hyperparameter tuning reduced overfitting and improved convergence. Their hybrid model CNN features and GA optimization performed well on pituitary, meningioma, and glioma tumors.

One of the first benchmark datasets for classifying brain tumors was made available by17. Their method used handcrafted textures with SVM classifiers. Though limited compared to CNNs, it created a strong baseline. Later deep learning research widely used to make extensive use of this dataset. A multi-path CNN was proposed by18 in order to capture both local and global tumor features. Compared to single-path models, their architecture handled tumor variability more effectively. The system presented greater classification performance while using benchmark MRI datasets. Multi-receptive field learning in CNNs was emphasized in the study.

A CNN for the classification of multi-grade tumors created in19. To enhance the quality of the MRI, they used preprocessing techniques like skull stripping. The impact of limited datasets was reduced by data augmentation. Their model decreased false negatives while achieving better accuracy. A survey of deep learning in medical imaging with an emphasis on brain tumor research was carried out by20. They reviewed and evaluated the clinical application of CNNs, RNNs, and GANs. The authors emphasized issues such as interpretability and dataset imbalance. Their work provided recommendations for a robust AI adoption in healthcare.

For tumor classification, the authors21 used pre-trained CNNs like VGG16 and ResNet50. Model Accuracy was increased across multiple tumor classes with the help of fine-tuning. They made the data clinically useful by integrating Grad-CAM for interpretability. Their study emphasized trust in AI through visualization. A CNN–RNN hybrid model was presented by22 for the classification of MRI tumors. RNNs modeled sequential MRI slice information, whereas CNN layers extracted spatial features. The hybrid system outperformed standalone CNN and RNN models. The importance of hybrid deep networks in 3D imaging was highlighted by their framework.

To enhance the security of maritime transportation, Ali et al. (2025) proposed a deep learning-driven cyber-attack detection system for DC shipboard microgrids. In order to identify anomalies in real time, the system collects operational data from shipboard sensors and controllers, extracts relevant features, and trains deep learning models, such as CNNs and LSTMs. To increase the resilience and dependability of shipboard microgrids, the system may detect deviations from typical operating conditions and initiate automated or manual responses to reduce cyber threats. Similar to the requirement for strong feature fusion and attention mechanisms in medical imaging models like CAFNet, this study highlights the effectiveness of hybrid deep learning approaches in protecting complex cyber-physical systems23.

A hybrid deep learning framework for enhancing the cyber-physical resilience of harbor-integrated shipboard microgrids is presented by Z. Ali et al. (2025). This approach uses deep learning models combines with signal processing techniques to detect and mitigate cyberattacks in real time. The framework aims to improve maritime power systems’ security and stability by incorporating these techniques, ensuring reliable operation even in the face of possible cyberthreats24. In order to handle both cyberattacks and power quality disruptions, Ali et al. (2025) proposed an adaptive deep neural network (DNN) framework for intrusion detection and prevention in shore-ship hybrid AC/DC microgrids. The technique detects false data injection attacks (FDIAs), which can interfere with voltage and current regulation, by combining frequency-domain feature extraction based on Fast Fourier Transform (FFT) with DNN classification. The FFT-DNN framework, which was optimized using the Adam optimizer and ReLU-sigmoid activations, successfully distinguished cyber intrusions from normal power quality events with an accuracy of 97.7% across attack scenarios. This study demonstrates how to improve cybersecurity and resilience in maritime shore-to-ship power networks using a scalable, highly accurate method25.

A new biomedical image segmentation model called DCSSGA-UNet was presented by Hussain et al. (2025). It integrates DenseNet201 with two attention mechanisms: Semantic Guidance Attention (SGA) and Channel Spatial Attention (CSA). By concentrating on critical features during the decoding process, this architecture efficiently fills in semantic gaps and reduces redundancy. Evaluations on three medical image datasets, CVC-ClinicDB, CVC-ColonDB, and Kvasir-SEG, demonstrated superior performance, with mean Dice coefficients (mDice) of 98.85%, 95.71%, and 96.10%, respectively, and mean Intersection-over-Union (mIoU) scores of 95.67%, 92.39%, and 93.97%. These results underscore the accuracy and adaptability of the model, which makes it a useful tool for clinical applications, particularly in accurate lesion segmentation and aiding in the diagnosis and treatment of diseases like colorectal cancer26.

For explainable medical image classification, Hussain et al. (2025) proposed EFFResNet-ViT, a hybrid deep learning model that combines the strengths of the ResNet and Vision Transformer (ViT) architectures. The model integrates ResNet’s convolutional feature extraction with ViT’s global attention mechanisms, enabling the capture of both local and global features. The interpretability and performance of the model in medical imaging tasks are improved by this combination. When compared to conventional models, the method showed improved accuracy and robustness on several medical image datasets27.

A CNN-based transfer learning framework for the multi-class classification of brain tumors from MRI images was created by Yadav et al.28. Their study demonstrated that, in comparison to conventional models, deep CNN feature extraction in conjunction with refined pre-trained networks (such as ResNet and VGG) produced increased accuracy. Their method, like other CNN-only approaches, does not describe global contextual relationships between distant tumor locations; instead, it focuses on capturing local spatial information. When tumors have heterogeneous shapes or occur in different parts of the brain, this restriction may result in sub-optimal generalization. The Vision Transformer (ViT) component of our proposed CAFNet incorporates global self-attention to learn long-range dependencies and improve discriminative power across all tumor types in order to get around this.

Materials and methods

Problem scope

  1. Early detection of brain tumors with MRI scans is essential for increasing patient survival rates, yet it remains challenging due to the brain tumors have subtle and diverse visual characteristics.

  2. Both local fine-grained features and global contextual information in complex brain MRI images are often fail to capture by conventional machine learning techniques that rely on hand-crafted features and traditional deep learning or transfer learning models.

  3. To achieve improved accuracy and reliability in multiclass brain tumor classification, a unified model that integrates the global modeling capabilities of Vision Transformers, enhanced by cross-attention fusion, with the local feature extraction strength of CNNs is obviously needed as shown in Fig. 1.

Fig. 1.

Fig. 1

Proposed architecture.

Exploratory data analytics (EDA)

Initially, collected brain tumor MRI Dataset from Kaggle and it is divided into subsets for testing and training. The training dataset consists of 5712 MRI images with four classes named as pituitary, glioma, meningioma, and tumor and testing dataset consists of 1311 MRI images. The training and test dataset class distribution is as follows, as seen in Fig. 2. As per visual representation, the brain tumor dataset contains 4 classes and Notumor has 1595 MRI images, pituitary has 1457 MRI photos, glioma has 1321 MRI images, and meningioma has 1339 MRI images.

Fig. 2.

Fig. 2

Training dataset class distribution.

The test set includes four categories of brain tumors: glioma, meningioma, notumor, and pituitary.

The test dataset’s class distribution is as follows, as seen in Fig. 3: Gliomas have 300 MRI images, meningiomas have 306 MRI images, tumors have 405, and pituitaries have 300 MRI images.

Fig. 3.

Fig. 3

Test dataset class distribution.

Figure 4 shows the class information for the brain tumor dataset, where all of the images are grayscale.

Fig. 4.

Fig. 4

Brain tumor dataset class.

Figures 5 and 6 illustrate three useful uses of examining the distribution of image widths and heights during exploratory data analysis (EDA): resizing strategy, padding, and cropping. These applications are particularly relevant to deep learning and computer vision tasks.

Fig. 5.

Fig. 5

Distribution of image widths.

Fig. 6.

Fig. 6

Distribution of image heights.

In Fig. 7 creating a scatter plot of image widths vs. heights is a useful step in image dataset EDA. By offering a visual representation of the distribution of image dimensions, it supports the process of making well informed decisions on data preprocessing.

Fig. 7.

Fig. 7

Scatter plot of image widths vs heights.

Examining the distribution of pixel intensities is a vital stage in Exploratory Data Analysis (EDA) for image datasets, as shown in Fig. 8. It helps you with information about the brightness, contrast, and color balance of your images and helps you in making preprocessing decisions such as color correction, histogram equalization, and normalizing.

Fig. 8.

Fig. 8

Distribution of pixel intensities distribution.

Preprocessing

Preprocessing is an essential step in the pipeline for brain tumor classification in order to improves MRI image quality, reduces noise, and extracts pertinent features for improved model performance. To ensure that the input data was suitable for both deep learning and traditional machine learning models, a several preparation methods were used in this study.

After the preprocessing, the dataset was divided into 80% training, 10% validation, and 10% testing subsets using stratified random selection, maintaining equal class proportions across all categories and preventing bias. For reproducibility, randomization was used with a fixed random seed. A different testing directory was set aside for the final model evaluation, and the validation subset was created internally from the training data using the Keras ImageDataGenerator parameter validation_split = 0.2.

Conventional classifiers analysis

HOG feature extraction

After the preprocessing steps, we used HoG feature extraction techniques to analyze the brain tumor dataset using traditional classifiers including Decision Tree (DT), Naïve Bayes (NB), and Linear Discriminant Analysis (LDA). One of the most used techniques for image classification is HoG feature extraction. In order to extract features from a brain tumor dataset and save them as a pickle file, Hog Feature Techniques based on literature were used.

Before conventional machine learning classification, features were extracted using the Histogram of Oriented Gradients (HoG). After normalizing each grayscale MRI image, gradients in the x and y directions (Gx, Gy) were calculated.

The gradient’s orientation and magnitude were determined using:

Gradient magnitude:

graphic file with name d33e513.gif 1

Gradient orientation:

graphic file with name d33e519.gif 2

The final HoG descriptor was created by dividing the image into 8 × 8-pixel cells, computing gradient orientation histograms, and applying block normalization (L2-norm). Decision Tree, Naïve Bayes, and LDA classifiers were then trained using these characteristics. Table 1 summarizes the findings, demonstrating that DT achieved the highest test accuracy of 82.99%.

Table 1.

Training and testing accuracy for DT, NB and LDA.

Classifier Train
Accy (%)
Test
Accy (%)
DT 1.0000 82.99
NB 76.45 68.65
LDA 99.98 73.00

HOG feature extraction with ML models

After the feature extraction using HoG the extracted feature is going to be analyzed using standard classifiers.

In Fig. 9 shows that classifiers performance in terms of accuracy. Decision Tree (DT) attained test set 82.99%, Naïve Bayes (NB) attained 68.65% and Linear Discriminant Analysis (LDA) attained 73.00% test set accuracy.

Fig. 9.

Fig. 9

Standard ML classifiers performance analysis.

Figure 10 shows that traditional machine learning (ML) models often classify brain tumors with lesser accuracy because particularly in grayscale MRI scans, different tumor types may have very similar appearances. Additionally, there may be significantly fewer samples of some tumor groups than others. Additionally, MRI images may contain noise, different orientations, or different brightness levels. we were unable to improve the accuracy of this dataset due to the aforementioned problems. Further, in improving brain tumor classification will concentrate on deep learning models such as convolutional neural networks to enhance accuracy.

Fig. 10.

Fig. 10

Performance analysis of DT, NB and LDA.

Cross-validation

In this section, performed 5-fold cross-validation on the accessible dataset, it is reducing the risk of overfitting and produces a more reliable assessment of model performance by ensuring that every sample participates in both training and validation.

The average validation accuracy across folds is displayed in Table 2; Fig. 11.

Table 2.

5-Fold cross validation accuracy.

Fold Validation Accy (%)
1 96.4
2 95.9
3 96.1
4 95.8
5 96.2
Average 96.08

Fig. 11.

Fig. 11

5-Fold cross validation accuracy.

5-fold with a mean accuracy of 96.08% ± 0.47%, Cross-Validation CAFNet demonstrated low variance across data partitions and strong internal consistency. Validation Across Datasets: The trained CAFNet demonstrated strong generalization to unseen, out-of-domain data when tested on the independent Figshare Brain Tumor Dataset, achieving 95.60% accuracy.

Convolutional neural network

In order to enhance model performance to develop deep learning models, like convolutional neural network models (CNNs), using this brain tumor dataset. CNNs automatically learn spatial features from medical images like MRIs without need for hand-crafted features, whereas traditional machine learning models often require manual feature engineering (e.g., texture, shape, intensity). Spatial hierarchies in data are intended to be captured by CNNs. CNNs are capable to accurately identifying the unique shapes, boundaries, and textures of brain tumors in images.

Figure 12 shows the CNN model details. We have developed a sequential model and applied 32 filters of 3 × 3 on the input image. For non-linearity, the activation function relu is employed. An RGB, three-channel color image, such as 128, 128, 3, is the input. Low-level features like edges and textures are detected by it.

Fig. 12.

Fig. 12

CNN model summary.

The CNN model training and testing accuracy as well as training and validation loss values are shown in Fig. 13 While classical machine learning provides better accuracy for this brain tumor dataset, CNNs may overfit the training data and fail to generalize to unseen images. This analysis shows that the current CNN model achieved 89.87% training accuracy and 75.90% testing accuracy. Smaller datasets can frequently produce better results from traditional machine learning models, particularly when hand crafted features are used.

Fig. 13.

Fig. 13

CNN model accuracy and loss.

Transfer learning models

In order to extend this work, focused on pre-trained Transfer Learning (TL) models such as AlexNet, MobileNetV2, InceptionV3, ResNet50, VGG16, and VGG19.Addressing the limitations of training CNNs from scratch on a small dataset of brain tumors is the prime reason for extending this study using transfer learning. With a significant gap between testing and training accuracy, the previous CNN model showed signs of overfitting and suboptimal generalization.

This indicates that there is not sufficient data for the model to learn robust features from the scratch. Through transfer learning, the model can enhance performance on small medical datasets by utilizing pre-trained knowledge from large datasets. Low-level and mid-level features (such as edges, textures, and forms) have already learned to pretrained models. These features can often be used to medical imaging, improving accuracy and reducing overfitting.

In Table 3 shows that the MobileNetV2 model’s test set accuracy was somewhat improved when transfer learning models were included. ResNet50’s high model depth and large parameter count, which make it prone to overfitting when trained on small medical datasets, are responsible for its relatively poor performance (67.47% train, 68.73% test accuracy). ResNet’s residual layers need a lot of data to efficiently learn discriminative weights, in contrast to MobileNetV2 and InceptionV3, which use depthwise separable or inception-based convolutions to reduce complexity. As a result, its capacity for generalization was reduced, resulting in supoptimal test accuracy.

Table 3.

Transfer learning models Accuracy.

Model Train Acc (%) Test Acc (%)
AlexNet 84.36 78.11
MobileNetV2 91.18 86.96
InceptionV3 89.43 85.13
ResNet50 67.47 68.73
VGG16 84.10 79.33
VGG19 80.22 77.65

When applied transfer learning models, the MobileNetV2 model was somewhat enhanced test set accuracy to the brain tumor dataset (86.96%). AlexNet, InceptionV3, ResNet50, VGG16, and VGG19, on the other models, achieved test accuracy of 78.11%, 85.13%, 68.73%, 79.33%, and 77.65%, respectively. Table 2 displayed each model’s training and testing accuracy, and Figs. 14, 15 and 16 shown each model’s accuracy as well as performance trends. Accuracy and loss information for training and validation are shown in Figs. 16 and 17. According to this analysis, the accuracy of the MobileNetV2 model was slightly improved but still fell short. Hence, we focus transformer models for further analysis.

Fig. 14.

Fig. 14

TL models train and validation accuracy.

Fig. 15.

Fig. 15

TL models train and validation loss.

Fig. 16.

Fig. 16

TL models training, testing and validation accuracy.

Fig. 17.

Fig. 17

Performance analysis for TL Models.

Vision transformer model (ViT)

While CNNs and transfer learning models such as MobileNetV2 and InceptionV3 showed a reasonable accuracy in classifying brain tumors, their ability to represent long-range dependencies and global spatial patterns in an image is inherently limited.

To overcome these limitations, this study explores Vision Transformer (ViT) models, which use self-attention mechanisms to record both local and global information and consider an image as a sequence of patches. In medical imaging, where tumors may exhibit complex spatial relationships that CNNs are unable to adequately capture, this method is especially helpful.

The ViT model steps are as follows:

  1. Image Preprocessing

  1. Input Image

For processing to be efficient, the input image needs to be arranged in a structured format. Normalization and resizing ensure uniform data quality across the dataset, which promotes model convergence.

Mathematical Formulation

graphic file with name d33e891.gif 3

Where H is height, W is width, C is the number of channels.

  • (b)

    Patch Extraction

Instead of focusing only on local features like CNNs do, the model may process spatial information globally by treating an image as a sequence of patches.

Process

The image is separated into non-overlapping patches of size P×P.

Mathematical Formulation.

The number of patches (N) are calculated using

graphic file with name d33e917.gif 4
  • (c)

    Flattening and Linear Projection

  • Maps the flattened patch to a higher-dimensional space so that complex patterns can be efficiently learned by the Transformer. As needed, linear projection helps in dimensionality reduction or increase. Each patch is linearly projected to a higher-dimensional embedding after being flattened.

The Linear embedding xi is calculated using

graphic file with name d33e939.gif 5
  • 2.

    Adding Position Embeddings

Transformers lack spatial awareness by nature, unlike CNNs. The model to capture the spatial structure of the image, which gives information about each patch’s location within the image.

graphic file with name d33e954.gif 6

Where xi​ is the patch embedding, ei​ is the position embedding for the i-th patch and zi ∈ RD is the final input to the transformer.

  • 3.

    Transformer Encoder

The Transformer encoder consists of several similar layers has two key parts:

  1. The Attention Mechanism of Multi-Head Self-Attention (MHSA)

graphic file with name d33e988.gif 7

Where the query, key, and value matrices, Q, K, and V, are obtained from the input embeddings.

  • The key vectors’ dimension is denoted by dk. The model enables to capture long-range dependencies by focusing on relevant parts of the image and ignoring irrelevant ones. It may simultaneously focus on various parts of the image and learn a variety of features.

  • (b)

    Feed-Forward Network (FFN):

The FFN is calculated using the Eq. (8)

graphic file with name d33e1016.gif 8

Where W1, W2​ are weight matrices. b1, b2 are bias values and ReLU is the activation function.

The model allows to learn complex patterns beyond linear transformations by introducing non-linearity.

  • (c)

    Layer Normalization and Residual Connections:

The residual connections are measured by Eq. (9)

graphic file with name d33e1038.gif 9

Where Sublayer(x) is the MHSA or FFN output and LayerNorm normalizes the output to stabilize training.

In deep networks, residual connections aid in gradient flow during backpropagation, preventing vanishing gradients.By stabilizing the learning process, layer normalization enhances convergence.

  • 4.

    Classification Head

Using the [CLS] Token

graphic file with name d33e1058.gif 10

Where zcls​ is the [CLS] token’s output embedding, the classification weights and bias are denoted by Wcls​ and b cls​ .

The [CLS] token is ideal for classification tasks because it aggregates global information from all patches.

  • 5.

    Loss Function

The Categorical Cross-Entropy Loss is defined as:

graphic file with name d33e1083.gif 11

where N is the number of samples, C is the number of classes, yi, c​ is the true one-hot label, and Inline graphic​ is the predicted probability for class c.

Helps the model perform to improve its performance during training by measuring the difference between true labels and predicted probabilities.

Understanding actual label vs. predicted label information is made easier by Fig. 18, which displays the ViT model’s random prediction performance.

Fig. 18.

Fig. 18

ViT prediction model.

Figure 19 illustrates the accuracy of the ViT Model’s training and validation with the help of trendline. Understanding each epoch’s achieved accuracy level in both training and validation datasets is attained.

Fig. 19.

Fig. 19

Train and validation accuracy for ViT model.

The ViT Model training and validation set loss information over the epochs is shown in Fig. 20. It helps to understand overfitting and underfitting information, and this graph also aids in improving the model’s performance by identifying the datasets with the lowest accuracy.

Fig. 20.

Fig. 20

Train and validation loss for ViT model.

In Fig. 21 displays the confusion matrix performance of the ViT Model. It helps to understand each class prediction level and misclassified data.

Fig. 21.

Fig. 21

Confusion matrix for ViT model.

Figure 22 illustrate the accuracy of the ViT model’s training, testing, and validation. During training, the model’s accuracy was 95.51%, during validation, it was 88.53%, and during testing, it was 87.34%. ViT attained better accuracy than other models currently in use. Therefore, in order to improve model performance, my suggested work concentrated on CAFNet, a hybrid CNN–ViT framework with data augmentation and Cross-Attention Fusion.

Fig. 22.

Fig. 22

ViT model train, test and validation accuracy.

Proposed model

With the ViT performs well on image classification tasks, its limited spatial inductive bias and dependence on large datasets may limit its generalizability, particularly in complex or small-sample datasets. We suggest CAFNet, a hybrid framework that combines the advantages of Convolutional Neural Networks (CNNs) and Vision Transformers, to overcome these drawbacks and enhance model performance even more.

Data augmentation with CAFNet

To improve the generalization ability and reduce the overfitting of the proposed CAFNet model various data augmentation approaches were applied to the training dataset. In order to normalize pixel intensity levels between 0 and 1, each image was rescaled by a ratio of 1/255. The Keras ImageDataGenerator module was also used to carry out random geometric modifications, such as random rotation within ± 20 degrees, random zoom within a range of 0.2, and random horizontal flipping. By simulating real-world variances like orientation and scale shifts, these transformations enhanced the model’s robustness. To ensure an unbiased evaluation, the testing and validation datasets were merely rescaled without augmentation.

The following configuration was used in implementation:

Table 4.

Data augmentation configuration.

Rescaling All images normalized by 1/255
Rotation Random rotations within ± 20°.
Zoom Random zooming within ± 20% of the original image size
Horizontal Flip Random horizontal flips with a probability of 50%.
Validation Split 20% of training data used for validation
Test Data Only rescaled (1/255) to preserve evaluation integrity.

ViT with Cross-Attention fusion

Figure 23 illustrate that Cross-Attention Fusion (CAF) is a technique combines to combine two different feature streams by learning where and how one feature stream should attend to another.

Fig. 23.

Fig. 23

Cross-attention fusion (CAF).

It uses multi-modal learning and expands on transformers’ concept of self-attention. A Single stream concentrates only intra-image interactions (patch to patch) are modeled by a ViT alone instead of CAF is that it incorporates complementary external data, such as CNN local texture features. This helps the model both ViT patches come from a global context and local fine-grained details form CNN features. In this study we proposed CAF with ViT model for improving brain tumor classification model performance.

The Cross-Attention Fusion mechanism of CAFNet, which effectively integrates global contextual representations from the Vision Transformer with local spatial data from the CNN, is responsible for its improved performance. The model can more accurately distinguish subtle tumor variations. However, because it has less training data, CAFNet may overfit and needs more processing power. In order to improve clinical applicability and data protection, future research will investigate federated learning techniques and multimodal data integration.

Mathematical formulation of cross-attention fusion

The Cross-Attention Fusion (CAF) architecture of CAFNet, which uses CNN spatial features as Keys/Values (K, V) and ViT embeddings as Queries (Q), is what makes it novel. Unlike previous CNN-ViT hybrids that employ symmetric fusion or static concatenation, this allows global-to-local attention learning. The weights of attention

graphic file with name d33e1251.gif 12

To maintain fine-grained texture with global context, allow each ViT token concentrate on pertinent CNN regions. Stability on small medical datasets is enhanced by a lightweight fusion head with residual and dropout regularization.

Given ViT embeddings Evit ∈ RN×D and CNN features Fcnn ∈ RM×C.

graphic file with name d33e1267.gif
graphic file with name d33e1272.gif 13

The difference is that while ordinary cross-attention employs equal token sets, self-attention uses Q, K, and V from the same source. To improve tumor-region focus in CAF, Q (ViT global) and K, V (CNN local) generate global-to-local feature fusion.

Cross-Attention fusion

One type of scaled dot-product attention is cross-attention.

Algorithm.

Algorithm

CAFNet (Input Image).

ViT with cross attention fusion

Fig. 24.

Fig. 24

CAFNet training and validation accuracy.

In Fig. 25 shows that the hybrid CNN–ViT model with Cross-Attention Fusion (CAFNet) outperforms and achieved 96.41% test accuracy in brain tumor classification than a standalone Vision Transformer (ViT) due to its ability to learn relationships between global and local features, which is essential for precise brain tumor identification (Fig. 26).

Fig. 25.

Fig. 25

CAFNet training, validation and testing accuracy.

Fig. 26.

Fig. 26

Confusion matrix for CAFNet.

Results and discussion

Table 5 Shown the model wise training and testing performance in terms of accuracy. Initially, we applied conventional classifiers such as decision tree (DT), Naive Bayes (NB), and linear discriminant analysis (LDA) for the brain tumor classification task. When compared to the other traditional models, the decision tree model had the greatest test accuracy of 82.99%.Overall model performance analysis.

Table 5.

Overall model performance analysis.

Model Train
Acc (%)
Test Acc (%)
DT 100.00 82.99
NB 76.45 68.65
LDA 99.98 73.00
CNN 89.87 75.90
AlexNet 84.36 78.11
MobileNetV2 91.18 86.96
InceptionV3 89.43 85.13
ResNet50 67.47 68.73
VGG16 84.10 79.33
VGG19 80.22 77.65
EfficientNet-B0 92.05 87.50
ConvNeXt-Tiny 93.12 88.22
5-Fold Cross Validation 95.60 96.08
ViT 95.51 87.34
CAFNet 99.01 96.41

In order to expand on this work, we developed Convolutional Neural Network (CNN) model that achieved a 75.90% test accuracy. To further enhance the model’s performance, we used pretrained transfer learning models, such as AlexNet, MobileNetV2, InceptionV3, ResNet50, VGG16, and VGG19, EfficientNet-B0, ConvNeXt-Tiny based on the performance the 5-Fold cross validation model’s accuracy on the dataset was marginally higher at 96.08%. We also explored transformer-based models based on findings from recent literature, and we also explored the Vision Transformer (ViT) on this dataset, achieving an accuracy of 87.34%. The results show that CAFNet consistently outperforms these modern architectures in terms of both accuracy and robustness across 5-fold cross-validation, demonstrating the effectiveness of our proposed cross-attention fusion mechanism.

In comparison with prior hybrid frameworks such as CNN–RNN22 and CNN–CapsNet15, CAFNet shows superior performance. While CapsNet preserves spatial hierarchy at the cost of high computational complexity, the CNN–RNN model captures sequential interdependence across slices but lacks global spatial awareness. By contrast, CAFNet outperforms the 87–93% range observed for these hybrid baselines, attaining 96.41% test accuracy by combining CNN-based local feature extraction with ViT-based global attention via a Cross-Attention Fusion module.

Our proposed work focuses on combining ViT and CNN in a hybrid framework called Cross-Attention Fusion Network (CAFNet), since the ViT model produced slightly improved accuracy. CAFNet outperformed other models and achieved 96.41% test accuracy compared to the other models for this brain tumor dataset, according to the performance analysis displayed in Fig. 27. Therefore, we suggest that this framework is a more optimal solution for this task.

Fig. 27.

Fig. 27

Overall Performance Analysis.

Ablation study and statistical analysis

In addition to assess the individual contributions of each module in CAFNet, an ablation study was conducted. We evaluated four configurations:

  1. CNN only (MobileNetV2) – baseline local feature extractor.

  2. ViT only – transformer-based global feature extractor.

  3. CNN + ViT (without CAF) – simple concatenation of local and global features.

  4. CAFNet (with CAF) – full model with Cross-Attention Fusion.

We report both mean and standard deviation across multiple runs to provide statistical validation. The results are summarized below:

Fig. 28.

Fig. 28

Ablation study performance analysis.

In Table 6 shows that the superior performance of CAFNet can be attributed to the synergistic integration of convolutional and transformer-based representations through the Cross-Attention Fusion (CAF) mechanism. The CNN component effectively captures fine-grained local spatial features, while the Vision Transformer (ViT) module models’ long-range dependencies and global contextual information. The CAF layer adaptively fuses these complementary features, enabling the network to focus on both local tumor boundaries and overall structural context, which improves classification robustness across varying tumor types and sizes.

Table 6.

Ablation study and statistical Analysis.

Model Variant Components used Train Accy (%) Test Accy (%)
0 CNN only MobileNetV2 91.18 ± 0.40 86.96 ± 0.45
1 ViT only ViT 95.51 ± 0.50 87.34 ± 0.52
2 CNN + ViT MobileNetV2 + ViT 97.00 ± 0.35 92.20 ± 0.38
3 CAFNet MobileNetV2 + ViT + CAF 99.01 ± 0.25 96.41 ± 0.30

Furthermore, data augmentation contributes to generalization by mitigating bias toward specific image orientations or intensities. The consistent improvement over conventional transfer learning models demonstrates that hybridizing CNN and ViT features with attention-based fusion can effectively overcome the limitations of single-architecture approaches.

Conclusion and future work

This study evaluated different deep learning, transfer learning, and conventional classifiers for the classification of brain tumors. The suggested hybrid CNN-ViT architecture with Cross-Attention Fusion (CAFNet) obtained the maximum accuracy of 96.81%, while the Decision Tree, CNN, MobileNetV2 and ViT model achieved 82.99%, 75.90%, 86.96%, and 87.34%, respectively. The moderate dataset size and use of a single public MRI dataset, which may restrict generalizability to other clinical contexts or imaging techniques, are limitations of the current study despite its high performance. Additionally, while the model’s great accuracy, its interpretability is remains limited; in order to enhance clinical trust and gain a better understanding of feature relevance, attention visualization approaches like Grad-CAM or SHAP are required.In order to provide more thorough tumor characterization, future work will concentrate on extending CAFNet to larger, multi-institutional datasets and incorporating multimodal inputs (MRI, CT, histology). We also intend to explore federated learning frameworks for privacy-preserving training, explainable AI techniques for visualizing model attention, and lightweight model variations to enable deployment in real-time clinical workflows.

Author contributions

Problem formulation, writing introduction and related works: Ganesh Jayaraman, writing subsections : S. Meganathan, CNN module and ViT developement : S. Sheik Mohideen Shah, results and discussion: Anuradha, validation of experiments: Ranjeeth Kumar Sundararajan and Validation and selection of methodologies, editing, refining and enhancing the proposed system: R. Rajakumar.

Data availability

The datasets analyzed during the current study are available from the corresponding author on reasonable request.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Kumar, R., Verma, A. & Verma, O. P. Brain tumor segmentation using MRI: A review. IEEE Access.7, 170–187 (2019). [Google Scholar]
  • 2.Alqudah, A. M., Alquraan, H., Qasmieh, I. A., Alqudah, A. & Al-Sharu, A. Brain tumor classification using deep learning Technique—A comparison between Cropped, Uncropped, and segmented lesion images with different sizes. Int. J. Adv. Trends Comput. Sci. Eng.9 (1), 111–118 (2020). [Google Scholar]
  • 3.Mahmud, T., Sarker, S. & Hossain, M. Deep Learning-Based brain tumor detection: A comprehensive review. IEEE Access.11, 43305–43323 (2023). [Google Scholar]
  • 4.Dhakshnamurthy, S., Kumaravel, N. & Manickavasagam, R. Brain Tumor Classification Using Deep Learning Models—A Transfer Learning Approach, Mater. Today Proc., 61, 317–323, (2022). [Google Scholar]
  • 5.Kaur, P. & Mahajan, P. Detection of brain tumors using a transfer learning-based optimized ResNet152 model in MR images. Comput. Biol. Med.188, 109790 (2025). [DOI] [PubMed] [Google Scholar]
  • 6.Puttegowda, K., Govindegowda, M., Mayigegowda, P., Ramegowda, P. & Nagaraju, A. M. Automated brain tumor detection with advanced machine learning techniques. Biomedical Pharmacol. J.18 (2), 1313–1333 (2025).
  • 7.Yang, Y., Yan, Z. & Li, H. Glioma grading via conventional MRI using deep learning transfer learning approaches. Front. Oncol.11, 1–10 (2021). [Google Scholar]
  • 8.Liu, J., Pan, Y., Li, M. & Chen, J. Deep feature learning for brain tumor MRI classification. Future Gener Comput. Syst.111, 708–717 (2020). [Google Scholar]
  • 9.Disci, R., Gurcan, F. & Soylu, A. Advanced brain tumor classification in MR images using transfer learning and pre-trained deep CNN models. Cancers17 (1), 121 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Abiwinanda, N., Hanif, M., Hesaputra, S., Handayani, A. & Rachmat, T. Brain tumor classification using convolutional neural network, World Congress on Medical Physics and Biomedical Engineering 183–189 (Springer, Singapore, 2019). [Google Scholar]
  • 11.Celik, Y. & Inik, T. Hybrid deep learning model for brain tumor classification. Biomed. Signal. Process. Control. 69, 102910 (2021). [Google Scholar]
  • 12.Pashaei, E., Sajedi, S. & Shokouhi, S. M. Brain tumor classification via convolutional neural network and extreme learning machines. Multimed Tools Appl.79, 12373–12393 (2020). [Google Scholar]
  • 13.Deepak, S. & Ameer, P. M. Brain tumor classification using deep CNN features via transfer learning. Comput. Biol. Med.111, 103345 (2019). [DOI] [PubMed] [Google Scholar]
  • 14.Swati, Z. N. et al. Brain tumor classification for MR images using transfer learning and Fine-Tuning. Comput. Med. Imaging Graph. 75, 34–46 (2019). [DOI] [PubMed] [Google Scholar]
  • 15.Afshar, P., Mohammadi, A. & Plataniotis, K. N. Brain tumor type classification via capsule networks. IEEE Int. Conf. Image Process. (ICIP), (v1) pp. 3129–3133 (2018).
  • 16.Anaraki, A. K., Ayati, M. & Kazemi, F. Magnetic resonance Imaging-Based brain tumor grades classification and grading via convolutional neural networks and genetic algorithms. Biocybern Biomed. Eng.39 (1), 63–74 (2019). [Google Scholar]
  • 17.Cheng, J. et al. Enhanced performance of brain tumor classification via data augmentation and transfer learning. Front. Comput. Neurosci.9, 15 (2015).25729362 [Google Scholar]
  • 18.Islam, M., Reza, M., Ahmed, M. & Karray, M. Multi-Path convolutional neural network for brain tumor classification. Multimed Tools Appl.79, 32697–32716 (2020). [Google Scholar]
  • 19.Sajjad, M., Khan, S., Muhammad, M. & Mehmood, A. Multi-Grade brain tumor classification using deep CNN with extensive data augmentation. J. Comput. Sci.30, 174–182 (2019). [Google Scholar]
  • 20.Hossain, M. & Mahmood, I. A survey of deep learning approaches for brain tumor classification in MRI. Comput. Biol. Med.136, 104698 (2021).34426165 [Google Scholar]
  • 21.Chaudhary, A., Anand, S. & Mittal, M. Brain tumor detection and classification using Pre-Trained CNN models with Grad-CAM. Multimed Tools Appl.80, 30587–30604 (2021). [Google Scholar]
  • 22.Bharati, S., Podder, P. & Mondal, S. Hybrid CNN-RNN model for brain tumor classification from MRI images. Pattern Recognit. Lett.145, 67–73 (2021). [Google Scholar]
  • 23.Ali, Z. et al. Deep Learning-Driven cyber attack detection framework in DC shipboard microgrids system for enhancing maritime transportation security. IEEE Trans. Intell. Transp. Syst.10.1109/TITS.2025.3589188 (2025). [Google Scholar]
  • 24.Ali, Z. et al. A novel hybrid signal processing based deep learning method for Cyber-Physical resilient harbor integrated shipboard microgrids. IEEE Trans. Ind. Appl.10.1109/TIA.2025.3590339 (2025). [Google Scholar]
  • 25.Ali, Z. & A Novel Intelligent Intrusion Detection and Prevention Framework for Shore-Ship Hybrid AC/DC Microgrids Under Power Quality Disturbances,. IEEE Industry Applications Society Annual Meeting (IAS), Taipei, Taiwan, 2025, pp. 1–7, Taipei, Taiwan, 2025, pp. 1–7, (2025). 10.1109/IAS62731.2025.11061392
  • 26.Hussain, T., Shouno, H., Mohammed, M. A., Marhoon, H. A. & Alam, T. DCSSGA-UNet: Biomedical image segmentation with DenseNet channel spatial and Semantic Guidance Attention, Knowledge-Based Systems, vol. 14, Art. no. 113233, (2025). 10.1016/j.knosys.2025.113233
  • 27.Hussain, T., Shouno, H., Mohammed, M. A., Marhoon, H. A. & Alam, T. EFFResNet-ViT: A Fusion-Based convolutional and vision transformer model for explainable medical image classification. IEEE Access.13, 54040–54068. 10.1109/ACCESS.2025.3554184 (2025). [Google Scholar]
  • 28.Yadav, A. C., Shah, K., Purohit, A. & Kolekar, M. H. Computer-aided diagnosis for multi-class classification of brain tumors using CNN features via transfer-learning, Multimedia Tools and Applications, 84(31), 38959–38982, https://link.springer.com/, 10.1007/s11042-025-20751-z(2025).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets analyzed during the current study are available from the corresponding author on reasonable request.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES