Abstract
Although Deep learning (DL) methodologies have progressed substantially, the classification of skin cancer remains a challenging task. There are several reasons for this. Some of them are: artifacts like Hair; severe class imbalance in dermatoscopic datasets; and difficulty in extracting both fine-grained local features (details within small area of lesion) like texture, color, pigment network, vascular patterns and long range global features like overall shape, border irregularity, asymmetry. To overcome these, this study presents a novel, two-stage framework. At first, C’GAN (Conditional Generative Adversarial Network) is employed for generation of duplicate images for the minority classes. Then secondly, a CNN-ViT ensemble architecture is introduced followed by a cross attention based fusion module to fuse their features. The attention fusion model synergistically merges ViT’s global token representations with CNN’s local feature maps. The overall performance is analyzed through some standard quantitative metrics, whereas the reliability as well as the stability are validated through bootstrap based statistical analysis. The framework achieved remarkable accuracies of 99.3%, 99.7%, 98.9% and 98.2% on Dermatofibroma, Vascular lesions, Basal Cell Carcinoma, and Actinic Keratosis respectively, besides 99.4% overall AUC, 0.93 bootstrap mean, and 0.0003 standard error. The proposed model showed balanced performance both on majority as well as on minority classes, showcasing it’s effectiveness in class imbalance.
Keywords: Attention maps, Vision transformers, Conditional generative adversarial networks, Attentional fusion, Ensemble learning
Subject terms: Cancer, Computational biology and bioinformatics, Diseases, Health care, Mathematics and computing
Introduction
One among the most prevalent malignancies worldwide is skin cancer. Among its types, melanoma ranks as seventeenth most diagnosed cancer, whereas non-melanoma are reported to be fifth most common1. According to WCRF2, an estimated 331,722 new victims of melanoma and 58,667 related mortalities were recorded worldwide in 2022. When diagnosed at the earliest, localized stage (melanoma is detected before it has spread to the lymph nodes), the survival rate is higher, but it decreases dramatically as the cancer progresses3. Hence, detecting it in advance is necessary for decreasing associated mortality. However, for the detection of this skin cancer, current methodologies like Clinical skin exams and Dermoscopy are time consuming and not effective. From a clinical perspective, dermatological diagnosis is characterized by high visual ambiguity, significant inter-class similarity, and substantial intra-class variability. These challenges effects dermatological decision-making, where false negatives carry severe clinical consequences. Even for existing deep learning techniques, achieving consistent robustness across highly challenging, underrepresented classes remains an ongoing issue that limits reliable clinical deployment4,5.
The major challenges that current DL models encounter include: Image complexities like hair and other artifacts, Imbalanced datasets, incomplete feature extraction, high Inter-class similarity coupled with substantial Intra-class variation among lesions, lack of interpretability of model decisions6,7. The imbalanced dataset is a major challenge for the DL models which need to be addressed. This imbalance not only biases model optimization toward majority classes but also undermines the reliability of standard performance metrics such as overall accuracy. Models trained without careful handling of imbalance may appear performant while failing precisely in the cases where diagnostic assistance is most needed. This can be solved by using resampling strategies for balancing of dataset. A widely opted strategy is oversampling of minority classes using data augmentation techniques8,9 like shifting, scaling, rotating, changing contrast, etc. But this creates duplicate images and fails to capture the diversity and complexity of features. Hence, the framework includes synthesizing artificial images using GANs10,11. The information from the condition guides the GAN to yield clinically relevant patterns, enriching minority classes. This approach moves beyond the traditional oversampling using Data augmentations. The other major issue faced by the DL models is incomplete feature extraction. It leads to information loss, poor generalization, and failure to handle similar classes. The traditional CNNs extract features through convolutional filters that operate on local regions of the input and fundamentally limited in capturing long-range or global features. Thus, enhancing the feature extraction with these global features like overall asymmetry of a lesion or the irregularity of its entire border can also help to overcome the challenges. Current solutions include attention mechanisms that incorporate attention layers into CNNs that make model focus on more informative regions of input images, and skip connections like ResNet to facilitate gradient flow enable the trainer to learn more complex and hierarchical features. But the attention mechanisms can sometimes become biased towards dominant features overlooking information in less prominent areas, while skip connections mitigate vanishing gradients, but in extremely deep architectures can still present challenges like overfitting on small datasets, redundant feature updates, unintentionally encouraging shortcut learning, and training instability. Hence, the framework utilizes an ensemble architecture that combines a CNN (ResNet50V2) with a Vision Transformer12,13(DeiT) to enable the capturing of fine-grained local spatial features as well as long range global dependencies within the image respectively. This leverages complementary representations and extractions of the features. An attention fusion14–17 module is utilized to integrate convolutional feature maps with ViT tokens, allowing for synergistic merging of local and global contexts for complex lesion maps.
The work focuses on systematically combining established components of DL in a clinically informed manner regarding classification of skin cancer:
A C’GAN pipeline that is conditioned on the attention maps of real images is used in synthesizing the reliable images of underrepresented classes, addressing the problem of class imbalance.
An Ensemble model consisting of ResNet50V2 and DeiT is introduced for extracting the features from the images of new balanced dataset combining images of the HAM10000 dataset and C’GAN generated synthetic images.
An Attention Fusion module is used for a unified, dynamically weighted local spatial features extracted by the CNN alongside global contextual features captured by the ViT, inspired from the cross attention mechanism.
The remaining paper is ordered as: Section II summarizes various approaches and existing methods regarding the skin cancer classification, along with some literatures regarding Generative Adversarial Networks and Vision Transformers in dermatoscopy. Section III gives the details of proposed methodology including the C’GAN pipeline and CNN-ViT Ensemble architecture. Section IV presents the results obtained by proposed approach and their comparison with some SOTA methods. Section V discusses reliability of this methodology, its limitations and future directions. Section VI concludes this study with final remarks.
Literature review
This part presents summary of various literatures and studies relevant to implementation of DL concepts in classification of this skin cancer. It underlines works regarding ensemble models, dataset balancing, vision transformers and generative adversarial networks within this domain.
Vidhyalakshmi et al.18 developed a Hybrid Flash Butterfly (HFB) driven CNN-BiLSTM architecture for the prediction and categorizing of skin diseases. Their model categorizes images into four classes: normal, melanoma, benign keratosis, and melanocytic nevus. Data augmentations were employed until each category contained 6,000 images. In their framework, the HFB optimization algorithm is employed to refine the training specifications of combined CNN with BiLSTM network. The BiLSTM component is used to capture sequential information from dermatoscopic images by processing feature representations in both forward and backward directions.
Kanchana et al.19 introduced a customized preprocessing pipeline designed specifically for EfficientNet models (B0-B7). The pipeline includes scaling of images, dataset expansion, and removal of artifacts for improved quality of dermoscopic inputs. Gaussian and median blurring techniques were applied to reduce surrounding noise and remove artifacts. Also, multiple thresholding strategies were employed to accurately separate skin lesions from their surrounding regions. Histogram equalization and contrast stretching were incorporated into the preprocessing workflow for further refinement in image standards and ensuring subsequent extraction of features.
Chatterjee et al. in20 introduced an IncepX-Ensemble model based on Transfer Learning (TL) for enhancement in accurate categorization of skin lesion images. Standard data augmentations were incorporated for balancing the dataset. Outputs obtained from multiple models were concatenated, followed by the inclusion of a global spatial average pooling (GAP) layer to form a unified composite feature representation. A grid-search strategy was then employed to determine the optimal set of model weights for the ensemble.
Sofana Reka et al. in21 proposed an approach that integrates principles from quantum computing with traditional ML to advance skin disease classification. Their Quanvolutional Neural Network, incorporating RY qubit rotations and Pauli-Z gate operations within its quantum convolutional layer, achieved an accuracy of 82.86%, and Quantum Support Vector Classifier built on pre-trained MobileNet produced an accuracy of 72.5%. Due to computing resource limitations, 250 random images per class have been considered with a reduced dimensions of 28 × 28 pixels.
ZhanLin Ji et al. in22 proposed a novel EFAM-Net architecture for categorization of skin cancer. They incorporated Attentional Residual Learning ConvNeXt (ARLC) module for extraction of low-level features and Parallel ConvNeXt (PCNXt) module for capturing complex representations. Furthermore, a newly developed Multiscale Efficient Attention Feature Fusion (MEAFF) module is introduced to enhance feature extraction and integrate multi-level feature maps. Prior to fusion, an Efficient Channel Attention (ECA) mechanism is applied to generate channel-wise importance weights.
Wu Di et al. in23 presented ECRNet, a novel skin cancer recognition framework designed to effectively capture both global and local contextual information. The network introduces an Explicit Vision Center (EVC) module for enhanced representation learning. Additionally, a feature fusion component termed the CCPA block integrates coordinate attention and channel attention to strengthen feature extraction. The EVC-CCPA-ResNet model achieved 92.12% accuracy on ISIC 2018.
Iftekhar et al. in24 proposed a DL based ensemble combining ResNet50V2, MobileNetV2, and EfficientNetV2 to improve classification accuracy. The framework integrates a relevance scoring that highlights salient features and encourages collaborative interaction between heterogeneous collections of features. To overcome the imbalance in the dataset, data augmentation strategies were utilized extensively and a hair removal preprocessing technique was applied to enable the model to capture lesion characteristics more effectively.
Adnan Saeed et al. in25 introduced a dual-path architecture known as Efficient Global-Vision Attention-Net (EG-VAN), which integrates EfficientNetV2s with a modified ResNet50 enhanced by a Spatial Context Group Attention (SCGA) module as well as a Non Local Block to effectively capture both local contextual relationships and long-range feature dependencies. Their methodology also incorporates extensive data augmentation and a hair-removal preprocessing step to improve dataset balance and image clarity. A Multi-Scale Feature Fusion (MFF) module is further employed for aggregating intermediate feature representations. In addition, the study presents a novel color balancing technique combining the Gray World algorithm with Retinex theory that recalibrates image color distributions to highlight lesion regions more effectively.
Remya et al. in26 proposed an advanced structure that integrates a ViT with TL, channel attention, and region-of-interest (ROI) extraction to accurately identify skin cancer. Data augmentation procedures are applied to alleviate class imbalance. Transfer learning is used to align multimodal data with the input format required by the ViT architecture. In their system, textual information is processed through recurrent neural networks (RNNs), while image features are extracted by CNNs. The resulting visual as well as textual embeddings are concatenated along the sequence dimension before being fed into the ViT model.
Rupali Kiran Shinde et al. in27 developed Squeeze-MNet, a deep learning-based classifier that integrates a Squeeze algorithm for removal of hair and deriving essential features in images with a MobileNet backbone. A black hat morphological function is applied to suppress noise. This system separates benign from malignant lesions.
Moturi et al. in28 employed MobileNetV2 and DenseNet DL models for classifying skin tumors as benign or malignant. Their preprocessing pipeline includes image scaling and removal of hair for improving the precision of lesion regions. Following model evaluation, MobileNetV2 attained an 85% accuracy, whereas a customized CNN attained 95%. A Python-based web application was also developed, enabling users to input patient details, upload lesion images, and receive a cancer prediction along with the estimated probability of malignancy.
Talha Mahboob Alam et al. in29 designed a DL based framework for detecting skin cancer using a dataset that is not balanced. For mitigating class imbalance, augmentation strategies like rotation, flipping, translation, resizing, as well as shearing were applied extensively, generating 4,000–5,000 samples per class. Models such as AlexNet, InceptionNetV3, and RegNet-Y310 were trained with various learning rates to identify the optimal configuration. The proposed framework attained a 91% accuracy, an 88.1% f1-score, along with a 0.95 AUC score.
Qichen Su et al. in30 developed a dual stage GAN based approach termed Self Transfer GAN (STGAN) for generating high-quality 256 × 256 skin lesion images to address dataset imbalance. STGAN initially trains an unconditional GAN to learn universal lesion characteristics across all classes and subsequently hands over this knowledge to individual classes, merging it with features specified to class for generating fine-grained synthetic images. Building on STGAN, the authors also introduced T-ResNet50, a classification framework designed to enhance diagnostic performance. Both synthetic image generation and classification experiments were evaluated using the HAM10000 dataset.
Mithun Kumar Kar et al. in31 proposed LeSegGAN, a composite attention enhanced GAN architecture for robust segmentation of skin lesions. LeSegGAN integrates CNN blocks and Inception modules with residual connections in the encoder for effective capturing of local as well as global features. A channel-attention mechanism is placed between the encoder and decoder to highlight label-relevant features and improve segmentation precision. The discriminator is based on a Vision Transformer (ViT), leveraging its self-attention capabilities to enhance feature discrimination. Illumination correction using the CET method is applied during preprocessing. Furthermore, a compound loss function is utilized that combines weighted binary cross entropy, dice, and focal losses to boost performance.
Aswathy Ravikumar et al. in32 explored the feasibility of generating photo-realistic dermatoscopic images using Generative Adversarial Networks (GANs) and subsequently leveraging these synthesized samples to augment existing datasets in order to improve CNN-based skin lesion classification. Their deep learning pipeline includes multiple architectures VGG-16, DenseNet, Xception, and ResNetV2. To improve explainability, Local Interpretable Model Agnostic Explanations (LIME) were incorporated for highlighting the important regions of image in the classification decisions.
Riaz H. Junejo et al. in33 introduced a novel structure for categorization of skin lesions by combining Federated Learning (FL) with ViTs. Federated Learning facilitates cooperative training of model over distributed clinical institutions without a need for exchanging raw patient data, thereby supporting privacy preservation and accommodating non-IID data distributions. ViTs capture broad related information efficiently. They implemented a weighted Federated Averaging (FedAvg) strategy, along with client-side preprocessing, augmentation, and feature optimization. Their global model achieved a 90% accuracy, with an 88.2% sensitivity and a 91.4% specificity on the HAM10000 dataset.
Aravinda et al. in34 introduced QPAViT-NAS, a Quantum Patch Attention Vision Transformer guided by Neural Architecture Search. The model enhances both predictive accuracy and transparency in skin disease classification by combining a ViT backbone with a quantum-inspired feature encoding module that utilizes quantum superposition principles to amplify subtle texture variations in dermoscopic images. This system was assessed on a handpicked dataset containing four clinically relevant disease categories Monkeypox, Measles, Chickenpox, and Normal achieving an accuracy of 90.2% under 5-fold cross-validation.
Wanqing Peng et al. in35 developed DBTU-Net, a dual branch architecture that fuses transformer based along with U-Net based feature extraction. The Attention Dense U-Net component, built on a pre-trained ResNet50, incorporates a triple fusion attention mechanism to capture features along height, width, and channel dimensions. Following this, channel and spatial fusion attention modules further refine the multi-dimensional feature integration process.
Dipanjali Kundu et al. in36 proposed an FL-based framework using deep learning models for distinguishing monkeypox and other pox viruses securely. A cycle-consistent GAN is used to augment the dataset. MobileNetV2, ViT, and ResNet50 models are used for the classification purpose. A flower FL environment is used for security. The ViT-B32 achieved an overall accuracy of 97.9% on Cycle-GAN with Federated Learning environment.
Talayeh Tabibi et al. in37 presented an ensemble architecture formed from multiple CNNs including VGG16, Inception, Xception, ResNet50, InceptionResNet, DenseNet, and EfficientNet B0-B7 to improve skin lesion classification performance. Among these, ConvNeXt, SENet, DenseNet, and EfficientNet-B0 demonstrated superior accuracy. Xavier initialization was used for weight initialization, and classifier selection for the ensemble was guided by Kappa score rankings.
Abdulrahman Noman et al. in38 proposed a novel Cascading Size-Dependent Deep Propagation (CADP) approach to mitigate over-smoothing in graph-based few-shot learning and improve the classification of skin diseases. The graph is constructed with the help of a CNN. The repeated nonlinear transformations are avoided by decoupling the feature propagation process from CNN, thus preventing over-smoothing ad improving feature representation.
Abdulrahman Noman et al. in39 introduced a novel framework Synergistic Edge-Node Graph Attention Network (SEN-GAT) for diagnosing few-shot skin diseases. For effective feature transferability, they incorporated a task-driven meta-learning objective optimizing parameters across node classification and edge prediction. They also utilized an adaptive global loss function to optimize the parameters of the model.
Existing literature on class imbalance relies on either classical oversampling techniques, such as SMOTE based approaches, or unconditional GAN40 based augmentation strategies. Classical methods often fail to preserve the complex characteristics of dermoscopic images. Unconditional GANs do not explicitly enforce class consistency, making them unsuitable for targeted minority class augmentation. The proposed framework addresses these gaps by conditioning generation on class labels and spatial attention maps derived from a pretrained classifier, the generator focuses on diagnostically meaningful regions while maintaining class fidelity.
Existing studies typically adopt either CNN based or transformer based classifiers in isolation, thereby failing to exploit the complementary strengths of both paradigms. Furthermore, most ensemble approaches in medical image classification rely on late fusion strategies, which do not explicitly model interactions between feature representations. The proposed framework addresses these gaps by introducing a feature level CNN-ViT ensemble with attention based fusion. This design distinguishes the proposed classifier from existing ensemble methods and aligns well with the heterogeneous nature of dermoscopic images.
Methodology
In this section, the overall procedural flow of proposed methodology is discussed, which includes: (A) Attention maps Extraction from the baseline ResNet50V2, (B) Training the C’GAN for Enhanced Data Augmentation, (C) Balanced Dataset preparation and Preprocessing, and (D) Architecture and Training of ResNet50V2-DeiT Ensemble with attention fusion. The overall workflow is depicted in Fig. 1.
Fig. 1.
Flowchart of proposed architecture for skin cancer classification. The pipeline illustrates balancing dataset through Generative Adversarial Network (GAN) conditioned on the attention maps generated from ResNet50V2, which is then used for evaluating classification of the CNN-ViT Ensemble.
Attention maps extraction from baseline ResNet50V2
The Generator block of the GAN generates the synthetic images from the noise and improves the originality of the images as continues. But, this is not reliable for generation of multi-class images as in our case. So, a Conditional GAN (C’GAN) is used which generates synthetic images from random noise along with the attention maps conditioned on the specific class, hence generating more reliable synthetics.
This is a pre-training phase in which, initially, a ResNet50V2 is trained on original HAM10000 dataset. It is trained using Adam optimizer and Cross Entropy loss for 30 epochs. After which, the corresponding attention maps are extracted using the Smooth-Grad heat maps to highlight the regions that are important or made the baseline model ambiguous. We then save these class-specific attention maps that help GAN in generating reliable synthetic images to balance the dataset. Figure 2 shows one attention map per class that are overlayed on the original images. The overlay is attending more to specific regions that are prominent in the diagnosis. In the images, gradient color of purple indicate cooler and lower attention regions, whereas yellow-green indicate warmer and higher attention regions.
Fig. 2.
Extracted attention maps (one per class) overlayed on the original images.
Training the C’GAN for enhanced data augmentation
The Conditional GAN (C’GAN) pipeline is implemented to generate reliable synthetic images to balance the HAM10000 dataset since the normal augmentation techniques fail to capture diversity of the classes. The architecture of the C’GAN consists of two blocks namely: Generator and Discriminator.
Generator (G)
The primary goal of the generator in a GAN is to fool the discriminator by generating the synthetics such that it is hard for the discriminator to tell them apart from the genuine ones. Typically, generator block of a GAN produces synthetic images from the received input random noise vector. Hence, its operation can be defined by the equation:
![]() |
1 |
Where ‘
’ denotes the random noise vector.
In contrast, our C’GAN’s generator block have a condition as an additional input along with the random noise vector
. In our case, the conditioned inputs of the C’GAN is the real image attention map. Hence, the operation of the C’GAN used can be defined by the equation:
![]() |
2 |
Where ‘c’ is the class labels, and ‘a’ is the real image attention maps that are extracted from training the baseline ResNet50V2.
The class label is embedded into a continuous latent representation using a learnable embedding layer, which allows semantic class information to influence generation throughout training. To ensure spatial variability, the noise vector is projected through a fully connected layer and reshaped into a two-dimensional noise map with the same spatial resolution as the input image. This design allows randomness to be injected at the pixel level rather than as a single global latent code, promoting localized variations within the lesion region. The generator processes a concatenation of the real image attention map, and noise map using an encoder-decoder structure. The encoder consists of a sequence of strided convolutional layers that progressively reduce spatial resolution while increasing feature dimensionality. The first convolutional layer learns low-level texture and color representations by jointly analyzing the real image and attention signal. Subsequent convolutional blocks deepen the feature representation, capturing lesion morphology, pigmentation patterns, and boundary characteristics. Batch normalization and ReLU activations are used to stabilize training and improve gradient flow. The decoder mirrors the encoder using transposed convolutions that gradually restore the spatial resolution. These layers reconstruct a three channel output representing a residual image. The final synthetic image is obtained by adding the attention-weighted residual to the real image and clipping the result to a valid intensity range. Each decoder layer refines the spatial details and color distribution, allowing the generator to model subtle lesion variations that are consistent with the conditioned class label. This formulation ensures that regions identified as diagnostically relevant receive stronger modifications, while background skin regions remain largely unchanged. The generator architecture from Fig. 3a shows how real image attention map, and class conditioning signals are combined and processed through encoder decoder stages generating synthetic image.
Fig. 3.

The architecture of conditional GAN: (a) Generator (b) Discriminator.
Discriminator (D)
The primary goal of the discriminator in a GAN is to act as binary classifier distinguishing the genuine and synthetics from generator block. Thus, it teaches the generator how to produce more realistic results by providing a signal to the generator through back propagation on how to improve its parameters when a fake image is identified. The training of discriminator is done on both images from the dataset as well as the synthetic samples from generator. Both discriminator and generator are optimized in opposition, each striving to outperform other, giving rise to the name ‘adversarial’, enabling both to get better.
However, in our C’GAN the discriminator is implemented as a conditional PatchGAN, which evaluates realism at the level of local image patches rather than entire images. It receives the concatenation of an image and its attention map as input, allowing it to focus on lesion relevant regions during discrimination. This design enforces fine-grained texture consistency and discourages the generator from producing visually implausible local artifacts. The discriminator consists of a sequence of convolutional layers with increasing channel depth. The initial layers capture low-level textural differences between real and synthetic images, while deeper layers analyze higher-level structural patterns such as lesion boundaries and internal pigmentation. LeakyReLU activations are used to maintain gradient flow, and batch normalization is applied to stabilize learning. The final convolutional layer produces a spatial map of realism scores, which are averaged to compute the discriminator output. The discriminator architecture from Fig. 3b shows how it distinguishes between real and synthetic images.
C’GAN training
The model is trained using the Wasserstein GAN framework with gradient penalty (WGAN-GP), which provides stable convergence and mitigates vanishing gradients. The generator is optimized using a combination of adversarial loss, identity consistency loss, and diversity regularization. The adversarial loss
encourages the generated samples to align with the real distribution learned by the discriminator. The identity loss
penalizes excessive deviation from the original image, ensuring that generated samples remain anatomically plausible. The diversity loss
explicitly encourages variation between successive generated samples of the same class, preventing the generator from producing duplicates. Together, these objectives promote class-consistent, diverse, and stable sample generation. The discriminator objective encourages separation between real and synthetic distributions while enforcing Lipschitz continuity through the gradient penalty
. The equations of Wasserstein loss for Generator and Discriminator are given as:
![]() |
3 |
![]() |
4 |
Where I denotes real image derived attention map,
represents synthetic image generated by the conditional attention-guided generator, D(.) is scalar discriminator output score,
denotes statistical expectation, the scalar coefficient
prevents the generator from introducing unrealistic deviations, the hyperparameter
regulates the trade-off between diversity and visual consistency,
represent consecutive generated samples within a mini-batch,
denotes an interpolated image sampled between I and
gradient
represents discriminator’s sensitivity with respect to its input, and
is scalar coefficient of gradient penalty. The hyperparameters and training details of C’GAN are listed in Table 1.
Table 1.
Hyperparameters and training details of C’GAN.
| C’GAN hyperparameters | Value |
|---|---|
| Latent dimension | 64 |
| Epochs | 40 |
| Generator LR | 1e-4 |
| Discriminator LR | 4e-4 |
Gradient penalty
|
10 |
| Identity loss weight | 10 |
| Diversity loss weight | 0.1 |
Class conditioning in GAN is incorporated through learned label embeddings, class-aware noise injection, and training on class-specific real attention pairs. During training and synthesis, minority classes are explicitly oversampled. Each generated sample is conditioned on a target class label. Attention maps ensure variations remain lesion centric. This design promotes intra-class diversity while maintaining label fidelity.
Quantitative and qualitative evaluation of GAN quality
The quality and diversity of the generated images are evaluated using standard generative metrics, including Fréchet Inception Distance (FID), Learned Perceptual Image Patch Similarity (LPIPS), and precision-recall analysis in feature space. The achieved FID score of 21.5 indicates strong alignment between real and synthetic distributions, while the LPIPS score of 0.45 confirms perceptual diversity among generated samples. Precision-recall of 0.90 demonstrate a favorable balance between realism and coverage. Qualitative comparisons between real and one its corresponding synthetic images along with its attention map from the Fig. 4 further show that the generated images preserve lesion specific characteristics such as color variation, texture, and border irregularity across different classes. Minor failure cases include occasional over smoothing in low contrast lesions and limited structural novelty for extremely rare classes like df, bcc, and akiec.
Fig. 4.
Qualitative comparison between (a) Real Image (b) Attention map and (c) Generated synthetic image (one per class).
Balanced dataset preparation and preprocessing
The framework utilizes a wide and varied collection of seven distinct categories of skin lesion images, named HAM10000 dataset consisting 10,015 dermatoscopic samples. The class distribution of HAM10000 is presented in Table 2a. But, this dataset is not directly used to train our ResNet50V2-ViT ensemble. Several preprocessing steps are followed, such as generating synthetic images for minority classes to address the class imbalance of the HAM10000, applying hair removal technique for the original images in the dataset, creating a new balanced dataset from synthetic images generated and the hair removed HAM10000 images, partitioning new balanced dataset into training, validation, and test sets in a 70:20:10 split proportion, stratified by class, resizing images to 224 × 224, and normalizing them.
Table 2.
(a) HAM10000 dataset class distribution. (b) Number of synthetic images generated for each class. (c) Total number of images in balanced dataset after resampling.
| Class Name | a | b | c |
|---|---|---|---|
| bkl (0) | 1099 | 2501 | 3600 |
| nv (1) | 6705 | 95 | 6800 |
| df (2) | 115 | 2385 | 2500 |
| mel (3) | 1113 | 2487 | 3600 |
| vasc (4) | 142 | 2358 | 2500 |
| bcc (5) | 514 | 1986 | 2500 |
| akiec (6) | 327 | 2173 | 2500 |
| Total | 10,015 | 13,985 | 24,000 |
Hair removal preprocessing
This hair removal pipeline is applied to the original images of the dataset. Because, the hair on the skin lesions obstructs the view and stops the model from complete feature extraction of the lesions. They act as a noise and confuse the model, hence making the model underperform. So, removing this hair and repairing the skin surface is essential. The hair removal pipeline follows a series of image processing techniques: (I) Conversion of image into grayscale. (II) Black-hat morphological operation calculating the dissimilarity of original image from its morphological closing version, highlighting dark hair structures on the light skin lesion backgrounds. (III) Thresholding to separate hair with a pixel value of 10–255, assigning black to surroundings and white to hair. (IV) Inpainting for reconstruction of masked hair areas by analyzing surrounding lesion pixel information. Figure 5 demonstrates the output images at each step when hair removal is applied to some images: (a) ISIC_0024320, (b) ISIC_0024342, (c) ISIC_0024421, (d) ISIC_0024508, and (e) ISIC_0029414.
Fig. 5.
Demonstration of the steps involved in Hair Removal Process (Original Image – Grayscale Image – Blackhat Filter Result – Hair Mask Result – Inpainted/Hair Removed Image) for images: (a) ISIC_0024320 (b) ISIC_0024342 (c) ISIC_0024421 (d) ISIC_0024508 and e) ISIC_0029414..
Generating synthetic images
To overcome this issue of class imbalance, the incorporated C’GAN is used to generate synthetic images. This overcomes the challenges regarding class diversity, and authenticity that occur during the resampling through augmentations. The best generator obtained during the C’GAN training is loaded and the attention maps generated are used to generate the synthetic images. The objective of synthetic image generation in this work is not photorealistic image synthesis for visual inspection alone, but targeted augmentation aimed at improving classifier robustness under data scarcity. Hence, this applies more heavily for the minority classes. The number of augmented images per class is determined based on the original class distribution, inter-class imbalance ratio. Rather than enforcing complete class uniformity, the proposed approach adopts a controlled balancing strategy, where severely underrepresented classes are augmented to a predefined threshold while majority classes are minimally adjusted. This adaptive criterion provides a principled balance between diversity, and classification stability. The count of synthetics generated for each class is listed in Table 2b.
Balanced dataset preparation
The final dataset is created by combining hair removed original images from the HAM10000 dataset and C’GAN generated synthetic samples. And labels are assigned according to the classes. Synthetic images generated by the conditional GAN are integrated directly with real dermoscopic images to form an expanded dataset. Class balancing is achieved at the dataset level rather than at the batch level. Hence, the final balanced dataset is ready which has enough diverse images in each class with a total of 24,000 images, to train a model effectively. The partitioning of new balanced dataset is presented in Table 2c.
Dataset splitting
The balanced dataset is then separated into training, validation, and test sets in a 70:20:10 split proportion. Train set makes model to learn underlying patterns, relationships, parameter values from input features, and their corresponding output targets. Validation set helps in model development by evaluating as well as fine-tuning the parameters and hyperparameters of model. It also helps to prevent overfitting. Test set acts as a completely separate and unseen dataset, used for the unbiased evaluation of final performance of the model.
Architecture and training of ResNet50V2-DeiT ensemble with attention fusion
The core classification architecture used is the ResNet50V2-DeiT Ensemble that is obtained extending the baseline ResNet50V2 by integrating a light-weight Vision Transformer variant, DeiT, unified by an attention fusion mechanism.
Feature extraction branches
The ensemble consists of a CNN branch, ResNet50V2 for the extraction of local features and a ViT branch, DeiT for the extraction of the global dependencies. These branches are setup with ImageNet base-trained weights and subsequently refined on balanced dataset. The choice of CNN and ViT backbones is motivated by the intrinsic visual characteristics of skin lesion images. By combining a convolutional backbone with a transformer backbone, the ensemble leverages both inductive bias and global reasoning, aligning well with clinical diagnostic criteria. The selected architectures represent a balance between representational power and computational feasibility, allowing effective transfer learning while maintaining practical training and inference costs.
ResNet50V2
ResNet50V2 architecture consists of 50 layer deep convolution neural network. It is chosen for depth, identity mappings, and proficiency in extracting intricate hierarchical features. It enhances gradient flow and mitigate vanishing gradients. This is developed from ResNetV1 with some modifications. The Batch Normalization (BN) and ReLU are applied before each convolutional layer. The final non-linearity after the addition operation is removed, allowing the information to directly flow from input to output, hence the term Residual. ResNet50V2 contains 5 main stages, each consisting of multiple bottleneck residual blocks. The key components in a Bottleneck block are: a 1 × 1 convolution, a 3 × 3 convolution, another 1 × 1 convolution, followed by BN and ReLU. The initial and final 1 × 1 convolutions reduces and restores feature dimensions minimizing computational complexity while maintaining representational capacity. The described ResNet50V2 architecture is displayed in Fig. 6.
Fig. 6.
ResNet50V2 architecture.
DeiT
Data-efficient image Transformer, a lightweight Vision Transformer variant uses a specific strategy to enable high performance with low data. A vision transformer treats images as sequence of patches. In patch embedding stage, the taken RGB image is cut as a sequence of disjoint patches sized to 16 × 16 pixels. The flattened patches are then transformed into tokens by a linear projection layer. To these, a learnable class token is prepended. Positional embeddings are incorporated into the patch tokens for the information regarding their spatial locations within the image. These tokens are then fed into a Transformer Encoder that extract the global dependencies in an image using Multi Head Self Attention (MHSA) mechanism and position wise Feed Forward Networks (FFNs). In DeiT, there is an addition of an extra ‘distillation token’ to the input sequence. This token is optimized to approximate the output produced by the corresponding CNN model. Instead of MLP head used in the ViT, DeiT uses a simple linear classifier. Hence, DeiT helps in capturing global context like asymmetry and border irregularity enhancing feature extraction. The architecture of the DeiT can be understood from the Fig. 7.
Fig. 7.
DeiT architecture.
Attention fusion module
A unified, dynamically weighted local and global feature representation is possible through this Attention Fusion Module (AFM) which is inspired by cross-attention mechanism. This AFM integrates dense, multi-channel CNN feature maps and sparse ViT token embeddings. The proposed ensemble is designed as a feature-level fusion architecture, rather than a decision-level ensemble. The proposed approach explicitly models interactions between convolutional and transformer based representations before classification. The CNN feature maps are pooled to a compatible token like dimension and the ViT tokens are projected to spatial dimensions. ViT output is used as the Query (Q) and CNN feature maps are linearly projected as Key (K) and Value (V) components. The similarity between global context Q and local details K, generating weights prioritizing local features V is computed by contextual weighting which is shown by an expression as:
![]() |
5 |
Where Q = Query matrix, K = Key matrix, V = Value matrix,
= dimension of keys.
A CNN backbone extracts high-resolution local features that encode lesion texture, pigment distribution, and border irregularities. In parallel, a ViT processes the same input image to capture long-range spatial dependencies and global contextual relationships. The outputs of both backbones are projected into a common embedding space and fused using an attention-based mechanism, enabling the model to learn dynamic weighting between local and global cues. This feature-level fusion allows the ensemble to resolve cases where local texture information alone is insufficient. The dynamically weighted feature form a comprehensive feature vector. This feature vector passes through a bottleneck layer to reduce dimensionality while maintaining critical information, followed by an output layer that is equipped with softmax activation layer for final classification of seven classes. Hyperparameters of ensemble model are listed in Table 3. The detailed structure of ResNet50V2-ViT Ensemble with Attentional Fusion Module is shown in the Fig. 8.
Table 3.
Hyperparameters of ensemble model.
| Hyperparameter | Value |
|---|---|
| Batch size | 32 |
| Image Resolution | 224 × 224 |
| Epochs | 50 |
| Optimizer | Adam |
| Loss Function | Categorical Cross Entropy |
| Activation Function | Softmax |
Fig. 8.

Architecture of attentional fusion module.
The ensemble model is trained in a staged manner to ensure stability and efficient convergence. Initially, backbone networks are frozen and only the fusion layers and classification head are trained. This warm-up phase allows the ensemble to learn meaningful feature interactions without disrupting pretrained representations. Subsequently, selected layers of the CNN and ViT backbones are unfrozen and jointly fine-tuned with a reduced learning rate. This joint optimization enables coordinated adaptation of both branches while minimizing the risk of catastrophic forgetting. The entire ensemble is trained end-to-end during this phase, allowing gradients to flow across fusion and backbone components.
Results and performance analysis
This section analyzes proposed architecture’s performance through a standard set of parameters and their related curves obtained from misclassification matrix, ablation study is conducted for quantifying individual contributions by C’GAN and ResNet50V2-DeiT ensemble, and functioning on underrepresented classes is compared against some SOTA models.
Dataset description
The model is evaluated on HAM (Human Against Machine) 10,000 dataset which is a primary and benchmarking dermatoscopic, skin lesion dataset used for the training ML and DL models aimed to detect skin cancer. It includes 10,015 dermoscopic skin lesion images that are obtained from diverse populations. Over 50% of the diagnoses have been confirmed by pathology and the remaining are based on expert consensus or follow-up information. Each image has its metadata detailing patient information specifically: age, sex, and lesion location along with its image_id, lesion_id, class name (dx), and diagnosis verification method (dx_type). The total dataset is categorized into seven separate classes: Actinic Keratoses (akiec), Basal Cell Carcinoma (bcc), Benign Keratosis-like Lesions (bkl), Dermatofibroma (df), Melanoma (mel), Melanocytic nevi (nv) and Vascular Lesions (vasc). The respective class distributions are 327, 514, 1099, 115, 1113, 6705, and 142 images.
Evaluation metrics
For such imbalanced multi-class problems, performance validation relies on computing the standard set of metrics like Precision, Recall, and F1-score individually for each class. Macro-averages of these are used for the overall performance. The robustness and class separability of the model are visualized using Area Under the Curve (AUC) scores of Receiver Operating Characteristic (ROC) curves as well as with Average Precision (AP) values of Precision-Recall curves.
Precision
For a specific class, precision gives how many of the samples labeled as a given class by the model are actually correct. It is expressed by the formula:
![]() |
6 |
A class with higher precision value specifies that the model’s assignment of a sample to the class is more likely true. In imbalanced multi-class classification, accuracy is often deceiving. Per class precision is shown in Table 4. For our model, the Macro Average Precision defined as mean of the precision values computed across each class, and the Weighted Average Precision defined as computing the precision of each class and weighting it according to support in that class, are obtained as 0.93 and 0.92 respectively. For df, vasc, bcc, and akiec underrepresented classes, it is 0.99, 0.98, 0.95, and 0.93 respectively.
Table 4.
Classification report of the proposed architecture.
| Class name | Precision | Recall | F1-score |
|---|---|---|---|
| bkl (0) | 0.86 | 0.93 | 0.90 |
| nv (1) | 0.92 | 0.91 | 0.92 |
| df (2) | 0.99 | 0.95 | 0.97 |
| mel (3) | 0.88 | 0.87 | 0.88 |
| vasc (4) | 0.98 | 1.00 | 0.99 |
| bcc (5) | 0.95 | 0.94 | 0.95 |
| akiec (6) | 0.93 | 0.89 | 0.91 |
| Macro Average | 0.93 | 0.93 | 0.93 |
| Weighted Average | 0.92 | 0.92 | 0.92 |
Recall
For a specific class, recall gives how effectively the model identifies all instances that truly belong to a given class. It is expressed by the formula:
![]() |
7 |
A class with higher recall value specifies the model’s ability in identifying all its instances. Per class recall is shown in Table 4. For our model, the Macro Average Recall is 0.93 and the Weighted Average Recall is 0.92. For df, vasc, bcc, and akiec underrepresented classes, it is 0.95, 1.00, 0.94, and 0.89 respectively.
F1-score
For a specific class, F1-score can be defined as the harmonic mean of class’s precision and recall, balancing both aspects of performance in single value. It is expressed by the formula:
![]() |
8 |
Recall is complementary to precision, means high recall often comes with the expense of low precision. Hence, F1-score combines them for a more comprehensive assessment of model’s performance on each class. Per class f1-score is shown in Table 4. For our model, the Macro Average F1-score is 0.93 and the Weighted Average F1-score is 0.92. For df, vasc, bcc, and akiec underrepresented classes, it is 0.97, 0.99, 0.95, and 0.91 respectively.
Evaluation emphasizes minority class sensitivity and class-wise performance metrics rather than overall accuracy alone. Precision, recall, F1-score, and confusion matrices are reported for each class in Table 4; Fig. 8 respectively to provide a granular view of model behavior. Precision-Recall curves are used in addition to ROC curves from Fig. 9, as they provide a more informative assessment under class imbalance. Improvements in recall for rare but clinically critical classes are particularly highlighted, as reducing false negatives is of primary clinical importance.
Fig. 9.
Plots showing (a) ROC curves (left) and (b) Precision-Recall curves (right) of all classes.
ROC and precision-recall curves
The ROC curve demonstrates the model’s discriminative capability. It plots the True Positive Rate (TPR) against the False Positive Rate (FPR). The higher AUC scores for the classes indicates the higher potential of model to separate that class from the other classes. The AUC scores that are obtained by our model for each class are shown in the Fig. 9a. For df, vasc, bcc, and akiec underrepresented classes, the AUC scores are 1.00, 1.00, 1.00, and 0.99 respectively. Improvements in the area under the ROC curve for minority classes indicate that the ensemble benefits from balanced training data and multi-scale feature fusion.
The Precision-Recall curves helps in visualizing the ability of the model to evaluate performance per class. It plots the Precision against the Recall of all classes. A higher Average Precision (AP) signifies the effective balancing of the trade-off between the exactness and completeness for each class. The AP values that are obtained by the proposed model for all classes are shown in the Fig. 9b. For df, vasc, bcc, and akiec underrepresented classes, the AP values are 0.99, 1.00, 0.99, and 0.97 respectively. Precision-Recall curves provide a more informative evaluation in the presence of class imbalance. The ensemble demonstrates improved recall for rare classes, confirming that attention-guided augmentation and ensemble fusion jointly reduce false negatives.
Confusion matrix and saliency maps
The confusion matrix allows for in depth depiction of model’s misclassification patterns by indicating which classes are being confused with others. The normalized 7 × 7 confusion matrix is illustrated in the Fig. 10, which shows the dominance along its diagonal, representing high rates of correct classifications and low rates of misclassifications. The confusion matrix reveals reduced misclassification between visually similar lesion types, particularly in cases where texture-based and shape-based features conflict. For df, vasc, bcc, and akiec underrepresented classes, 224, 249, 230, and 210 instances (TP) are correctly predicted out of 237, 249, 245, and 235 instances (TP + FN) and wrongly predicted as them in 3, 5, 11, and 16 instances (FP) respectively.
Fig. 10.

Confusion matrix.
Saliency maps gives the visual explanation of a model’s prediction. A few of the saliency maps generated by the Attentional Fusion Module are shown in the Fig. 11. These saliency maps provide information of the prioritizing features, which confirms that our model successfully utilizes both local features and global dependencies. The ensemble’s saliency responses are more spatially coherent and better aligned with lesion regions, indicating that the transformer branch brings global localization while the CNN branch preserves fine-grained details. This complementary behavior supports the clinical plausibility of the model’s predictions.
Fig. 11.
Saliency maps generated by the ensemble architecture.
Ablation study and comparison with SOTA models
An ablation study is performed to quantify the individual importance of two components utilized in our architecture namely: Conditioned GAN augmentation and ResNet50V2-DeiT ensemble architecture. The ablation study of the proposed model components for various architectures are listed and compared in the Table 5.
Table 5.
Ablation study of model components (C’GAN and ViT).
| Model | Class | Precision | Recall | F1-score | Accuracy | SE | Computational Time |
|---|---|---|---|---|---|---|---|
|
C’GAN + ResNet50V2 + MobileNetV2 (i) |
bkl (0) | 0.71 | 0.28 | 0.40 | 0.59 | 0.0010 | ~ 1 h 50 min |
| nv (1) | 0.80 | 0.93 | 0.86 | ||||
| df (2) | 0.19 | 0.18 | 0.18 | ||||
| mel (3) | 0.60 | 0.21 | 0.31 | ||||
| vasc (4) | 0.94 | 0.98 | 0.96 | ||||
| bcc (5) | 0.71 | 0.26 | 0.38 | ||||
| akiec (6) | 0.26 | 0.89 | 0.40 | ||||
| Macro Average | 0.60 | 0.53 | 0.50 | ||||
| Weighted Average | 0.66 | 0.59 | 0.57 | ||||
|
C’GAN + ResNet50V2 + EfficientNetV2 (ii) |
bkl (0) | 0.83 | 0.90 | 0.86 | 0.88 | 0.0008 | ~ 2 h 10 min |
| nv (1) | 0.92 | 0.94 | 0.93 | ||||
| df (2) | 0.94 | 0.78 | 0.86 | ||||
| mel (3) | 0.86 | 0.82 | 0.84 | ||||
| vasc (4) | 0.93 | 0.99 | 0.96 | ||||
| bcc (5) | 0.98 | 0.69 | 0.81 | ||||
| akiec (6) | 0.75 | 0.90 | 0.82 | ||||
| Macro Average | 0.89 | 0.86 | 0.87 | ||||
| Weighted Average | 0.89 | 0.88 | 0.88 | ||||
|
C’GAN + ResNet50V2 + MobileNetV2 + EfficientNetV2 (iii) |
bkl (0) | 0.80 | 0.94 | 0.86 | 0.89 | 0.0008 | ~ 2 h 25 min |
| nv (1) | 0.91 | 0.93 | 0.92 | ||||
| df (2) | 0.86 | 0.98 | 0.92 | ||||
| mel (3) | 0.92 | 0.77 | 0.84 | ||||
| vasc (4) | 0.94 | 0.99 | 0.97 | ||||
| bcc (5) | 0.92 | 0.84 | 0.88 | ||||
| akiec (6) | 0.92 | 0.80 | 0.86 | ||||
| Macro Average | 0.90 | 0.89 | 0.89 | ||||
| Weighted Average | 0.90 | 0.89 | 0.89 | ||||
|
Augmentations + ResNet50V2 + DeiT (iv) |
bkl (0) | 0.80 | 0.78 | 0.79 | 0.85 | 0.0009 | ~ 1 h 10 min |
| nv (1) | 0.86 | 0.95 | 0.90 | ||||
| df (2) | 0.93 | 0.92 | 0.92 | ||||
| mel (3) | 0.81 | 0.68 | 0.74 | ||||
| vasc (4) | 0.97 | 0.91 | 0.94 | ||||
| bcc (5) | 0.85 | 0.78 | 0.81 | ||||
| akiec (6) | 0.81 | 0.81 | 0.81 | ||||
| Macro Average | 0.86 | 0.83 | 0.85 | ||||
| Weighted Average | 0.85 | 0.85 | 0.85 | ||||
|
C’GAN + ResNet50V2 + DeiT (Proposed) (v) |
bkl (0) | 0.86 | 0.93 | 0.90 | 0.92 | 0.0003 |
~ 2 h 45 min |
| nv (1) | 0.92 | 0.91 | 0.92 | ||||
| df (2) | 0.99 | 0.95 | 0.97 | ||||
| mel (3) | 0.88 | 0.87 | 0.88 | ||||
| vasc (4) | 0.98 | 1.00 | 0.99 | ||||
| bcc (5) | 0.95 | 0.94 | 0.95 | ||||
| akiec (6) | 0.93 | 0.89 | 0.91 | ||||
| Macro Average | 0.93 | 0.93 | 0.93 | ||||
| Weighted Average | 0.92 | 0.92 | 0.92 |
The bold values indicate the results obtained by the proposed approach, distinguishing them from the results compared approaches.
From the Table 5, one can clearly observe the progression of the performance. Three CNN ensembles are implemented along with the proposed CNN-ViT ensemble namely: ResNet50V2-MobileNetV2, ResNet50V2-EfficientNetV2, ResNet50V2-MobileNetV2-EfficientNetV2. All these ensembles are evaluated on the same GAN augmented balanced dataset. However, the proposed ResNet50V2-DeiT ensemble shows marginal gains in evaluation metrics and significant drop in standard error over the other ensembles, confirming the importance of global dependencies. This can be observed by comparing the cases (i), (ii), and (iii) with case (v) of Table 5. Regular data augmentation techniques are also implemented to balance the dataset for training proposed ensemble. But, the C’GAN augmented dataset shows improved results, confirming the importance of feature-targeted synthesis. This can be observed by comparing case (iv) with case (v) of Table 5. The performance of the proposed architecture in accuracy, standard error, Macro as well as weighted averages of evaluation metrics demonstrates its balance and superiority. The computational complexity of the implemented models is also compared. The traditional data augmentations approach (iv) took least computational time of ~ 1 h 10 min, as it only requires training of the ResNet50V2-DeiT ensemble. As all of (i), (ii), (iii), and (v) approaches include baseline training and GAN training steps, the computational time varies completely based on the ensemble training. Thus, the overall time complexity for: (i) ResNet50V2-MobileNetV2 ensemble is ~ 1 h 50 min, (ii) ResNet50V2-EfficientNetV2 ensemble is ~ 2 h 10 min, (iii) ResNet50V2-MobileNetV2-EfficentNetV2 ensemble is ~ 2 h 25 min, and (v) the proposed ensemble ResNet50V2-DeiT is ~ 2 h 45 min. The trade-off involving higher computational costs is validated by the resulting gains in stability and minority class generalization.
The comparative evaluation of the proposed architecture on the minority classes can be found in Table 6, against some referred state-of-the-art (SOTA) models such as Quanvolutional Neural Network (QNN) and Quantum Support Vector Classifier(QSVC)21, EFAM-Net incorporating Attentional Residual Learning ConvNeXt (ARLC), Parallel ConvNeXt (PCNXt), and Multiscale Efficient Attention Feature Fusion (MEAFF) modules22, Ensemble combining ResNet50V2, MobileNetV2, and EfficientNetV224, Efficient Global-Vision Attention-Net (EG-VAN)25, ResNet with Self Transfer GAN (STGAN) and Test-Time Augmentation technique (TTA)30, Multiple architectures like VGG-16, DenseNet, Xception, and ResNetV232, and DenseNet and ConvNeXt Fusion41encompassing a range of DL methodologies. Here, we performed comparison of evaluation metrics for the minority classes namely: df, vasc, bcc, and akiec. Clearly, proposed framework outperforms other SOTA models in standard evaluation metrics of the minority classes. Models which achieved comparable results that of the proposed architecture do not guarantee the reliability and are less consistent with all classes.
Table 6.
Comparison of evaluation metrics for the minority classes (df, vasc, bcc, and akiec) aganist the existing SOTA models.
| Ref | Model | Class | Precision | Recall | F1-score |
|---|---|---|---|---|---|
|
Sofana Reka et al. |
QNN | df | 0.90 | 0.88 | 0.89 |
| vasc | 0.94 | 0.88 | 0.91 | ||
| bcc | 0.90 | 0.92 | 0.91 | ||
| akiec | 0.83 | 0.86 | 0.84 | ||
| QSVC | df | 0.80 | 0.90 | 0.85 | |
| vasc | 0.88 | 0.74 | 0.80 | ||
| bcc | 0.73 | 0.80 | 0.76 | ||
| akiec | 0.72 | 0.78 | 0.75 | ||
|
ZhanLin Ji et al. |
EFAM-Net | df | 0.95 | 0.95 | 0.95 |
| vasc | 0.89 | 1.00 | 0.94 | ||
| bcc | 0.92 | 0.90 | 0.91 | ||
| akiec | 0.86 | 0.82 | 0.83 | ||
|
Iftekhar et al. |
ResNet50V2 + MobileNetV2 + EfficientNetV2 | df | 0.92 | 0.96 | 0.94 |
| vasc | 0.99 | 0.99 | 0.99 | ||
| bcc | 0.82 | 0.82 | 0.82 | ||
| akiec | 0.84 | 0.75 | 0.79 | ||
|
Adnan Saeed et al. |
EG_VAN | df | 1.00 | 0.91 | 0.95 |
| vasc | 1.00 | 1.00 | 1.00 | ||
| bcc | 0.96 | 1.00 | 0.98 | ||
| akiec | 0.92 | 0.78 | 0.85 | ||
|
Qichen Su et al. |
T-ResNet50 + STGAN + TTA | df | 0.87 | 0.91 | 0.89 |
| vasc | 0.96 | 1.00 | 0.98 | ||
| bcc | 0.94 | 0.87 | 0.90 | ||
| akiec | 0.81 | 0.73 | 0.77 | ||
|
A. Ravikumar et al. |
VGG-16 + DenseNet + Xception + Inception ResNetv2 + GAN |
df | 0.86 | 0.77 | 0.81 |
| vasc | 0.87 | 0.94 | 0.90 | ||
| bcc | 0.82 | 0.91 | 0.86 | ||
| akiec | 0.71 | 0.70 | 0.70 | ||
|
Mingjun Wei et al. |
Fused CNNs | df | 0.85 | 1.00 | 0.92 |
| vasc | 1.00 | 1.00 | 1.00 | ||
| bcc | 0.85 | 0.92 | 0.88 | ||
| akiec | 0.87 | 0.91 | 0.89 | ||
| df | 0.99 | 0.95 | 0.97 | ||
| Proposed Model | vasc | 0.98 | 1.00 | 0.99 | |
| C’GAN Augmentation + ResNet-DeiT Ensemble | bcc | 0.95 | 0.94 | 0.95 | |
| akiec | 0.93 | 0.89 | 0.91 |
The bold values indicate the results obtained by the proposed approach, distinguishing them from the results compared approaches.
Significance test and discussions
Reliability through bootstrap analysis
The reliability, stability, and overall uncertainty of the model are assessed through the Bootstrap analysis. In this, the performance metrics are calculated by repeatedly training and testing the model on different sets of resampled datasets. Here, we obtained Mean, Standard Deviation, and Standard Error for standard metrics of each class.
The analysis can be found in the Table 7. The mean, standard deviation, and standard error obtained from the Bootstrap analysis can be given by the following formulae:
![]() |
9 |
![]() |
10 |
![]() |
11 |
Table 7.
Bootstrap analysis.
| Class | Metric | Precision | Recall | F1-score |
|---|---|---|---|---|
| bkl | Mean | 0.8653 | 0.9313 | 0.8970 |
| Std | 0.0177 | 0.0127 | 0.0119 | |
| SE | 0.0006 | 0.0004 | 0.0004 | |
| nv | Mean | 0.9224 | 0.9127 | 0.9174 |
| Std | 0.0098 | 0.0114 | 0.0084 | |
| SE | 0.0003 | 0.0003 | 0.0002 | |
| df | Mean | 0.9871 | 0.9466 | 0.9664 |
| Std | 0.0072 | 0.0135 | 0.0073 | |
| SE | 0.0002 | 0.0003 | 0.0002 | |
| mel | Mean | 0.8816 | 0.8732 | 0.8772 |
| Std | 0.0169 | 0.0179 | 0.0126 | |
| SE | 0.0006 | 0.0006 | 0.0005 | |
| vasc | Mean | 0.9811 | 1.0000 | 0.9905 |
| Std | 0.0086 | 0.0000 | 0.0044 | |
| SE | 0.0003 | 0.0000 | 0.0001 | |
| bcc | Mean | 0.9559 | 0.9386 | 0.9471 |
| Std | 0.0119 | 0.0141 | 0.0096 | |
| SE | 0.0004 | 0.0005 | 0.0004 | |
| akiec | Mean | 0.9301 | 0.8990 | 0.9141 |
| Std | 0.0155 | 0.0224 | 0.0150 | |
| SE | 0.0005 | 0.0007 | 0.0004 |
Where,
denotes the metric (precision, recall, f1-score) calculated from the ith bootstrap sample, and B represents total number of bootstrap samples is 1000 in our case.
The high mean for the metrics of each class validates the significant and consistent performance of proposed architecture. The low standard deviations and standard errors confirms that the model’s high effectiveness is not just due to the highly oversampled data or some random variability but due to the synergistic data enhancements and architecture. Hence, from the conducted bootstrap analysis, it is evident that the model is reliable, significant as well as stable.
Discussions, limitations, and future work
Coming to the computational complexity, the baseline ResNet50V2 classifier contains ~ 25 M parameters and relies exclusively on convolutional layers with hierarchical feature aggregation. The generator contains ~ 25 M parameters and the discriminator ~ 12 M parameters. The proposed ensemble integrating ResNet50V2 with a DeiT contains ~ 48 M parameters. On a single NVIDIA T4 GPU in Kaggle, baseline ResNet50V2 training converged within 9 epochs requiring ~ 1.5 min per epoch, GAN training converged within 14 epochs requiring ~ 2.5 min per epoch, and on dual GPU Ensemble training converged within 10 epochs requiring ~ 4.5 min per epoch. Compared to full-image synthesis GANs, the residual formulation significantly reduces computational overhead and accelerates convergence. This efficiency makes the approach suitable for practical data augmentation in large scale medical imaging pipelines. In comparison with standalone CNN classifiers, the proposed ensemble introduces higher computational cost due to the inclusion of transformer layers and attention-based fusion. The tradeoff between computational complexity and classification performance is therefore justified by improved stability and generalization across minority classes.
The proposed framework demonstrates consistent performance improvements across multiple lesion types, suggesting robust generalization rather than class specific tuning. Class-wise performance consistency indicates that the model does not overfit to specific augmentation patterns and maintains balanced decision boundaries across lesion categories.
The use of synthetic data introduces a potential risk of overfitting, particularly when real samples for rare classes are extremely limited. Synthetic images are generated with controlled variability guided by attention maps, reducing the likelihood of mode collapse or repetitive patterns. GAN training stability remains a dependency, and poor generator convergence could degrade augmentation quality. It is also important to be careful regarding the data leakage into the test set.
Future work will explore extending the framework to multi-center datasets like ISIC2018, BCN20000, etc. to test its generalization capabilities with varying imaging conditions, resolutions, and acquisition protocols. Domain-adaptive augmentation strategies and style-invariant conditioning mechanisms may further improve generalization across clinical settings. Further, alternative techniques can be explored to manage the class imbalance as a substitute for the proposed extensive synthetic generation employed in this study and reducing the overall complexity.
Conclusion
This paper proposed a strategy for classifying skin cancer with a dual framework, synergistically integrating GAN augmentation pipeline conditioned on the attention maps with the ResNet50V2-DeiT ensemble fused dynamically using an attention fusion module. These are used to address the two major limitations in dermatoscopy deep learning: the reliable synthetic image generation addressing severe class imbalance and the incomplete feature extraction by CNN models. The proposed architecture’s performance is evaluated vigorously. Our approach achieved good and balanced results for the minority classes as existing SOTA models. The reliability, stability, and statistical significance of the model is demonstrated through the bootstrap analysis in which the model achieved high mean, low standard deviation and standard error for all metrics per class. The framework enhanced the classification of minority classes by achieving an accuracy of 99.3% on Dermatofibroma, 99.7% on Vascular lesions, 98.9% on Basal Cell Carcinoma, and 98.2% on Actinic Keratosis. From all the results, it is clear that we achieved a balanced performance across all classes including the underrepresented classes which is not possible in most of strategies having higher overall performance. The performance of the model can be further enhanced by other optimized techniques.
Author contributions
Shaik Riyaz Hussain¹ conceived the research idea, designed the methodology, and supervised the study. Saladi Saritha² contributed to the experimental design, model implementation, and result analysis. Abhi Chevuri¹ conducted the experiments, performed data preprocessing, and organized the figures. Yepuganti Karuna²* assisted with experimentation, validation, and interpretation of the results. Shaik Riyaz Hussain¹ and Yepuganti Karuna²* drafted and revised the manuscript. All authors read and approved the final manuscript and agree to be accountable for all aspects of the work.
Funding
Open access funding provided by Vellore Institute of Technology- AP University.
Data availability
The primary dataset used in this study, HAM10000, is publicly available through the Harvard Dataverse ( [https://dataverse.harvard.edu/dataset.xhtml? persistentId=doi:10.7910/DVN/DBW86T](https:/dataverse.harvard.edu/dataset.xhtml? persistentId=doi:10.7910/DVN/DBW86T) ) and Kaggle repositories ( [https://www.kaggle.com/datasets/kmader/skin-cancer-mnist-ham10000](https:/www.kaggle.com/datasets/kmader/skin-cancer-mnist-ham10000) ). The attention maps and synthetic images generated during this study were produced as intermediate, model-dependent outputs of the ResNet50V2 and conditional GAN training processes. Due to their large file sizes, dependency on trained model parameters, and storage in non-visual .npy formats, these derived data are not publicly hosted. However, they can be made available from the corresponding author upon reasonable request, subject to practical and technical feasibility.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Wang Mingyue1,2,3,4; Gao, Xinghua1,2,3,4; Zhang, Li1,2,3,4. Recent global patterns in skin cancer incidence, mortality, and prevalence. Chin. Med. J.138 (2), 185–192. (2025). 10.1097/CM9.0000000000003416 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.World Cancer Research Fund. https://www.wcrf.org/preventing-cancer/cancer-statistics/skin-cancer-statistics/
- 3.Sundararajan, S. et al. Metastatic Melanoma. [Updated 2024 Feb 17]. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2025 Jan-. Available from: https://www.ncbi.nlm.nih.gov/books/NBK470358/
- 4.Khalaf, A. D., Hamdan, H., Halin, A. B. A. & Manshor, N. Segmentation and Classification of Skin Cancer Diseases Based on Deep Learning: Challenges and Future Directions. IEEE Access. (2025).
- 5.Meedeniya, D., De Silva, S., Gamage, L. & Isuranga, U. Skin cancer identification utilizing deep learning: A survey. IET Image Proc.18 (13), 3731–3749 (2024). [Google Scholar]
- 6.Nie, Y. et al. Recent Advances in Diagnosis of Skin Lesions Using Dermoscopic Images Based on Deep Learning. IEEE Access.10, 95716–95747. 10.1109/ACCESS.2022.3199613 (2022). [Google Scholar]
- 7.Mohanty, A., Sutherland, A., Bezbradica, M. & Javidnia, H. Skin disease analysis with limited data in particular rosacea: a review and recommended framework. IEEE Access.10, 39045–39068. (2022). 10.1109/ACCESS.2022.3165574 [Google Scholar]
- 8.Abhiram, A. P., Anzar, S. M. & Panthakkan, A. December. DeepSkinNet: a deep learning model for skin cancer detection. In: 2022 5th International Conference on Signal Processing and Information Security (ICSPIS) (97–102). (IEEE, 2022).
- 9.Ramella, G. & Serino, L. CNN issues in skin lesion classification: data distribution and quantity. IEEE Access.13, 58848–58862. (2025). 10.1109/ACCESS.2025.3556479 [Google Scholar]
- 10.Veeramani, N. & Jayaraman, P. A promising AI based super resolution image reconstruction technique for early diagnosis of skin cancer. Sci. Rep.15, 5084. 10.1038/s41598-025-89693-8 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Kannan, M. et al. An enhancement of machine learning model performance in disease prediction with synthetic data generation. Sci. Rep.15, 33482. 10.1038/s41598-025-15019-3 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Najjar, F. H. et al. Transformer-aided skin cancer classification using VGG19-based feature encoding. Sci. Rep.15, 40204. 10.1038/s41598-025-24081-w (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Halawani, H. T. et al. Enhanced early skin cancer detection through fusion of vision transformer and CNN features using hybrid attention of EViT-Dens169. Sci. Rep.15, 34776. 10.1038/s41598-025-18570-1 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Bakkouri, I. & Afdel, K. MLCA2F: Multi-level context attentional feature fusion for COVID-19 lesion segmentation from CT scans. SIViP17, 1181–1188. 10.1007/s11760-022-02325-w (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Bakkouri, I. et al. BG-3DM2F: Bidirectional gated 3D multi-scale feature fusion for Alzheimer’s disease diagnosis. Multimed Tools Appl.81, 10743–10776. 10.1007/s11042-022-12242-2 (2022). [Google Scholar]
- 16.Bakkouri, I. & Afdel, K. Computer-aided diagnosis (CAD) system based on multi-layer feature fusion network for skin lesion recognition in dermoscopy images. Multimed Tools Appl.79, 20483–20518. 10.1007/s11042-019-07988-1 (2020). [Google Scholar]
- 17.Bakkouri, I. & Afdel, K. Multi-scale CNN based on region proposals for efficient breast abnormality recognition. Multimed Tools Appl.78, 12939–12960. 10.1007/s11042-018-6267-z (2019). [Google Scholar]
- 18.Vidhyalakshmi, A. M. & Kanchana, M. Classification of skin disease using a novel hybrid flash butterfly optimization from dermoscopic images. Neural Comput. Appl.36 (8), 4311–4324 (2024). [Google Scholar]
- 19.Kanchana, K., Kavitha, S., Anoop, K. J. & Chinthamani, B. Enhancing skin cancer classification using efficient net b0-b7 through convolutional neural networks and transfer learning with patient-specific data. Asian Pac. J. Cancer Prevention: APJCP. 25 (5), 1795 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Chatterjee, S., Gil, J. M. & Byun, Y. C. Early detection of multiclass skin lesions using transfer learning-based IncepX-ensemble model. IEEE Access (2024).
- 21.Reka, S. S., Karthikeyan, H. L., Shakil, A. J., Venugopal, P. & Muniraj, M. Exploring quantum machine learning for enhanced skin lesion classification: A comparative study of implementation methods. IEEE Access (2024).
- 22.Ji, Z. et al. Efam-net: A multi-class skin lesion classification model utilizing enhanced feature fusion and attention mechanisms. IEEE Access (2024).
- 23.Di, W. et al. ECRNet: Hybrid network for skin cancer identification. IEEE Access.12, 67880–67888 (2024). [Google Scholar]
- 24.Ahmed, I. et al. Multi-model attentional fusion ensemble for accurate skin cancer classification. IEEE Access (2024).
- 25.Saeed, A., Shehzad, K., Ahmed, S. & Azar, A. T. EG-VAN: a global and local attention based dual-branch ensemble network with advanced color balancing for multi-class skin cancer recognition. IEEE Access. (2025).
- 26.Remya, S., Anjali, T. & Sugumaran, V. A novel transfer learning framework for multimodal skin lesion analysis. IEEE Access.12, 50738–50754 (2024). [Google Scholar]
- 27.Shinde, R. et al. Squeeze-MNet: Precise skin cancer detection model for low computing IoT devices using transfer learning. Cancers 2023, 14, 12. (2022). [DOI] [PMC free article] [PubMed]
- 28.Moturi, D., Surapaneni, R. K. & Avanigadda, V. S. G. Developing an efficient method for melanoma detection using CNN techniques. J. Egypt. Natl Cancer Inst.36 (1), 6 (2024). [DOI] [PubMed] [Google Scholar]
- 29.Alam, T. M. et al. An Efficient Deep Learning-Based Skin Cancer Classifier for an Imbalanced Dataset. Diagnostics 2022, 12, 2115. (2022). [DOI] [PMC free article] [PubMed]
- 30.Su, Q., Hamed, H. N. A., Isa, M. A., Hao, X. & Dai, X. A GAN-based data augmentation method for imbalanced multi-class skin lesion classification. IEEE Access.12, 16498–16513. (2024). 10.1109/ACCESS.2024.3360215 [Google Scholar]
- 31.Kumar Kar, M., Venugopal, V. & Anoop, B. N. LeSegGAN: a hybrid attention-based gan for accurate lesion segmentation in dermatological images. IEEE Access.13, 177019–177035. (2025). 10.1109/ACCESS.2025.3621107 [Google Scholar]
- 32.Ravikumar, A., Sriraman, H. & Chadha, C. Kumar Chattu, V. Alleviation of health data poverty for skin lesions using ACGAN: systematic review. IEEE Access.12, 122702–122723. (2024). 10.1109/ACCESS.2024.3417176 [Google Scholar]
- 33.Junejo, R. H., Abbas, Q., Awais, M., Akram, T. & Aldajani, M. B. Federated ViT: a distributed deep learning framework for skin cancer classification. IEEE Access.13, 166326–166342. (2025). 10.1109/ACCESS.2025.3612477 [Google Scholar]
- 34.Aravinda, C. V., Raja Joseph, E. & Alasmari, S. A neural architecture search-driven quantum patch attention framework for skin disease recognition and classification with XAI vision transformers. IEEE Access.13, 197312–197328. (2025). 10.1109/ACCESS.2025.3627877 [Google Scholar]
- 35.Peng, W. et al. DBTU-Net: A dual branch network fusing transformer and U-net for skin lesion segmentation. IEEE Access.13, 101262–101273. (2025). 10.1109/ACCESS.2025.3578295 [Google Scholar]
- 36.Kundu, D. et al. Federated deep learning for monkeypox disease detection on GAN-augmented dataset. IEEE Access.12, 32819–32829. (2024). 10.1109/ACCESS.2024.3370838 [Google Scholar]
- 37.Tabibi, S. T., Nikravanshalmani, A. & Saboohi, H. An ensemble classifier based on diverse convolutional neural networks for skin lesions classification. IEEE Access.13, 195673–195686. (2025). 10.1109/ACCESS.2024.3442827 [Google Scholar]
- 38.Noman, A., Beiji, Z., Zhu, C., Alhabib, M. & Alasri, A. Cascading size-dependent deep propagation (CADP): Addressing over-smoothing in graph few-shot dermatology classification. Neural Networks, p.108154. (2025). [DOI] [PubMed]
- 39.Noman, A., Beiji, Z., Zhu, C., Alhabib, M. & Alasri, A. SEN-GAT: Synergistic Edge-Node graph attention networks for few-shot skin disease diagnosis. Digital Signal. Processing, p.105497. (2025).
- 40.Karuna, Y., Syamala, N., Ravikumar, C. V., Thakur, P. & Saritha Saladi. Modified energy-based GAN for intensity in homogeneity correction in brain MR images. Sci. Rep.15 (1), 26409 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Wei, M. et al. A skin disease classification model based on densenet and convnext fusion. Electronics12 (2), 438 (2023). [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The primary dataset used in this study, HAM10000, is publicly available through the Harvard Dataverse ( [https://dataverse.harvard.edu/dataset.xhtml? persistentId=doi:10.7910/DVN/DBW86T](https:/dataverse.harvard.edu/dataset.xhtml? persistentId=doi:10.7910/DVN/DBW86T) ) and Kaggle repositories ( [https://www.kaggle.com/datasets/kmader/skin-cancer-mnist-ham10000](https:/www.kaggle.com/datasets/kmader/skin-cancer-mnist-ham10000) ). The attention maps and synthetic images generated during this study were produced as intermediate, model-dependent outputs of the ResNet50V2 and conditional GAN training processes. Due to their large file sizes, dependency on trained model parameters, and storage in non-visual .npy formats, these derived data are not publicly hosted. However, they can be made available from the corresponding author upon reasonable request, subject to practical and technical feasibility.




















