Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2026 Feb 11;16:8351. doi: 10.1038/s41598-026-35951-2

Quantum-enhanced multimodal prognostic transformer for skin disease progression prediction and visualization

C V Aravinda 1,, Joseph Emerson Raja 2,, Sultan Alasmari 3
PMCID: PMC12966308  PMID: 41673069

Abstract

Accurate classification and staging of skin diseases such as monkeypox, chickenpox, and measles are critical for timely clinical intervention, particularly in resource-limited settings. We present a proof-of-concept Quantum-Enhanced Multimodal Prognostic Transformer (Q-MPT) that integrates dermoscopic images with patient metadata, including age and lesion location, to predict disease type and progression stage jointly. The architecture combines a Vision Transformer backbone with a metadata fusion pathway and a lightweight quantum layer designed to enhance feature representation. To approximate disease evolution, we employ a latent trajectory predictor based on long short-term memory modeling and a quantum-inspired generative module that simulates counterfactual lesion appearances under different progression scenarios. Explainability is achieved through attention rollouts, Integrated Gradients for metadata attribution, and latent space visualization using variational autoencoders. On a custom-labeled dataset with synthetically derived stage labels, Q-MPT achieves 89.4% accuracy for disease classification and 87.3% for stage prediction, outperforming conventional convolutional neural networks and Vision Transformer baselines. While these results highlight the potential of integrating quantum-inspired computation with multimodal learning for dermatology, limitations include reliance on simulated metadata and the absence of validation on publicly available benchmarks. The findings establish Q-MPT as an early-stage framework that bridges diagnostic and prognostic modeling, providing a foundation for future clinically validated, explainable AI systems in dermatology. This work should be regarded as a proof-of-concept study based on heuristically generated metadata and stage labels intended to illustrate methodological feasibility rather than to assert clinical readiness.

Keywords: Quantum machine learning, Vision transformer, Disease staging, Skin lesion analysis, Multimodal fusion, Explainable AI, Prognostic modeling, GAN, LSTM

Subject terms: Computational biology and bioinformatics, Diseases, Health care, Mathematics and computing, Medical research

Introduction

In the domain of prognosis, AI-driven models are being developed to predict long-term outcomes for patients with chronic diseases. These models often integrate both temporal and cross-modal features to estimate disease progression and treatment response, enabling more personalized and proactive healthcare planning. One of the most promising developments has been the application of vision transformer architectures in diagnostic imaging tasks, especially in skin lesion classification and radiology. These models leverage self-attention mechanisms to capture long-range dependencies in image data, offering improved generalization across heterogeneous datasets compared to traditional convolutional networks. Skin diseases, particularly dermatological malignancies such as melanoma, require early detection and accurate progression modeling to ensure effective clinical management1. However, diagnosis often suffers from inter-observer variability and visual ambiguity2. Deep learning models, particularly convolutional neural networks (CNNs), have shown high accuracy in lesion classification3, but their local receptive fields limit their ability to capture global context4,5.

Vision Transformers (ViTs), which rely on self-attention to model long-range dependencies, are emerging as superior alternatives for visual reasoning in medical imaging6. ViTs have demonstrated state-of-the-art performance in skin cancer classification, and their adoption in dermatology AI research is accelerating.

Simultaneously, artificial intelligence is transitioning from static classification to dynamic prognostic modeling7,8. AI systems are now used to forecast disease outcomes, treatment responses, and survival rates, especially when combined with multimodal inputs such as patient metadata9,10. In dermatology, this allows integration of lesion imagery with contextual risk factors for robust prognostics11.

To enhance learning efficiency and feature richness, researchers have begun exploring quantum machine learning (QML) for medical imaging12,13. Hybrid quantum-classical architectures demonstrate promising results in tumor detection and classification, especially under low-data regimes14,15.

Motivated by these trends, this work proposes a Quantum-Enhanced Multimodal Prognostic Transformer that fuses ViT image encoders, quantum computation layers, and patient metadata. It forecasts disease stage progression and supports explainability via VAE latent projections and attention visualization as shown in the Fig. 1. This approach addresses the gap between diagnosis and prognosis while leveraging the strengths of quantum computing in data efficiency and training acceleration.

Fig. 1.

Fig. 1

An overview of quantum-enhanced multimodal prognostic transformer for skin disease progression.

In contrast to prior works that focus solely on classification or rely on unimodal data inputs , our approach offers a unified framework for both classification and progression forecasting of skin diseases. Specifically, we introduce a dual-output architecture that integrates a Vision Transformer for visual feature extraction and a quantum-enhanced metadata fusion mechanism to incorporate patient-specific contextual information such as age and lesion location. Unlike conventional models, which treat classification and prognosis as separate tasks, our method enables simultaneous prediction of both disease type and progression stage, which is critical in clinical decision-making and treatment prioritization.

The integration of Vision Transformer, metadata fusion, LSTM trajectory modeling, Quantum GAN, and explainability components is deliberately designed to create a unified diagnostic–prognostic bridge. Each module serves a complementary role: the Vision Transformer captures holistic lesion context, metadata fusion contextualizes predictions with patient priors, the LSTM models pseudo-temporal disease transitions, the Quantum GAN enables counterfactual progression visualization, and the explainability modules ensure interpretability. This composite framework enriches feature representation and supports clinically traceable, transparent predictions, thereby justifying the hybrid architecture adopted in this study.

Furthermore, we introduce an LSTM-based temporal prediction module trained on simulated latent embeddings to forecast how a lesion may evolve over time, providing a pseudo-sequential understanding of disease trajectories. The use of a Quantum GAN (QGAN) further enhances this capability by generating plausible visual representations of potential future lesion appearances under different clinical conditions. To ensure interpretability and clinical trust, our model incorporates explainable AI components such as attention rollout, Integrated Gradients, and Variational Autoencoder (VAE) embeddings for latent visualization. To our knowledge, this is the first work that combines ViT, quantum computing, multimodal learning, temporal modeling, and explainability in a single prognostic pipeline for dermatology. This comprehensive design addresses the critical gaps in existing AI models for dermatological care and provides a new direction for trustworthy and actionable skin disease prognosis.

This work is positioned as a proof-of-concept system rather than a claim of quantum supremacy. The integration of the Vision Transformer, metadata fusion, LSTM-based trajectory modeling, and a shallow simulated quantum layer aims to explore representational synergy between classical and quantum-inspired learning modules. The quantum layer is implemented on classical hardware and evaluated to study its potential effects on feature entanglement and optimization stability, rather than to demonstrate hardware advantage.

Problem statement

Traditional skin disease classifiers often focus solely on current diagnosis, lacking dynamic capability to predict how a condition evolves. Further, they usually ignore the influence of patient metadata (e.g., age, lesion site) and fail to provide interpretability. The core problem addressed here is twofold:

  1. Predicting both disease type and future progression stage.

  2. Enhancing this model with quantum computation and explainability to improve accuracy, robustness, and clinician trust.

Key challenges and contributions

During the development of our Quantum-Enhanced Multimodal Prognostic Transformer (QEMPT), we encountered several significant practical and technical challenges. One of the foremost difficulties was harmonizing multimodal data streams—namely, high-dimensional skin lesion images and low-dimensional metadata (e.g., age, lesion location)—into a unified representation space without losing semantic consistency. Traditional concatenation led to overfitting or underutilization of metadata, prompting the need for latent-space alignment techniques.

Secondly, integrating quantum layers into a deep learning workflow imposed architectural constraints. Hybrid circuits had to be differentiable, lightweight, and compatible with GPU-based TensorFlow operations, which required custom parameter reshaping and gradient propagation management.

The most critical challenge was modeling disease progression in the absence of true temporal imaging data. Since longitudinal lesion evolution datasets are scarce, we constructed a simulated trajectory Inline graphic by training an LSTM on repeated latent embeddings:

graphic file with name d33e315.gif 1

where Inline graphic is the VAE-derived latent vector at time t.

Furthermore, the counterfactual quantum GAN synthesis required disentangling image semantics from disease stage representations. This was approximated via a vectorized directional operator:

graphic file with name d33e329.gif 2

which was then applied to intermediate embeddings to simulate future progression:

graphic file with name d33e334.gif 3

Lastly, to preserve interpretability, explainability tools such as ViT attention rollout and Integrated Gradients had to be adapted for dual-output branches, leading to additional memory constraints on Kaggle’s P100 GPU with CUDA 12.6 compatibility.

The main contributions of this work are as follows:

  • A dual-branch transformer architecture is formulated, integrating a Vision Transformer (ViT) for image encoding and a parallel dense pathway for structured metadata. The joint feature representation Inline graphic is obtained via:
    graphic file with name d33e349.gif 4
    where x is the lesion image and Inline graphic is the metadata vector.
  • A latent disease stage trajectory is synthesized using a stacked LSTM encoder-decoder operating on VAE-derived embeddings Inline graphic across pseudo-temporal steps:
    graphic file with name d33e368.gif 5
    allowing progression forecasting in the absence of real temporal samples.
  • A directional stage transition vector Inline graphic is learned in latent space using cluster centroids of annotated embeddings:
    graphic file with name d33e380.gif 6
    where Inline graphic represents the class-wise mean of latent encodings.
  • A Quantum GAN (QGAN) generator is introduced to simulate counterfactual lesion appearances. The generator Inline graphic maps quantum-encoded noise vectors Inline graphic to high-resolution lesion reconstructions:
    graphic file with name d33e400.gif 7
    enabling controllable synthesis of future lesion states.
  • Explainability is enhanced through three axes: ViT attention rollout for spatial saliency, Integrated Gradients (IG) for metadata relevance attribution, and 2D VAE latent space projection for visualizing disease clusters and progression paths.

  • Quantitative evaluation is performed on a skin lesion dataset incorporating four disease classes and three disease stages, yielding classification accuracy and prognostic reliability superior to baseline CNN and non-quantum ViT models.

Literature review

The integration of quantum computing, vision transformers, multimodal fusion, and deep learning has significantly influenced recent advancements in medical imaging and disease prognosis. This section provides a comprehensive review of recent studies that have contributed to these domains, focusing on their methodologies, applications, and impact on clinical decision-making.

Liu et al.16 conducted one of the early surveys on quantum machine learning techniques applied to medical image analysis. They highlighted how quantum-inspired algorithms can potentially reduce computational complexity in large-scale radiological datasets, particularly in tasks such as segmentation and feature extraction. Their work laid the foundation for exploring hybrid quantum-classical models in healthcare.

Ullah and Garcia-Zapirain17 performed a systematic review of quantum-enabled frameworks in healthcare, emphasizing their applicability in diagnostic support systems and real-time clinical workflows. The authors identified several promising directions, including quantum annealing for optimization problems and quantum neural networks for pattern recognition in medical imaging.

Ajlouni et al.18 proposed an adaptive hybrid quantum convolutional neural network (Q-CNN) tailored for medical diagnosis. Their model demonstrated robustness in low-data regimes, making it especially useful for rare disease detection where labeled data is scarce. The architecture leverages quantum principles for improved feature learning and classification accuracy.

Yan et al.19 reviewed quantum signal processing methods in medical imaging, discussing their utility in denoising, edge detection, and texture enhancement. Their findings indicated that quantum-based preprocessing could enhance downstream CNN performance, particularly in noisy or low-resolution images.

Subbiyan et al.20 introduced a quantum-enhanced artificial neural network for compressing medical images while preserving structural integrity. Their approach was particularly effective for telemedicine applications where bandwidth and storage are limited. The method showed promise in maintaining diagnostic quality even after high compression ratios.

Yang et al.21 developed a novel Vision Transformer (ViT)-based framework for melanoma detection using dermatoscopic images. Their model outperformed traditional CNNs in cross-dataset generalization, highlighting the potential of self-attention mechanisms in capturing long-range dependencies in skin lesion images.

Oztel22 presented a hybrid CNN–Vision Transformer model for monkeypox lesion classification. The author integrated explainable AI tools to improve model transparency and assist clinicians in understanding prediction logic, which is crucial for gaining trust in automated diagnostic systems.

Zhang et al.23 introduced DermViT, a diagnosis-aware attention-based ViT for skin lesion classification. Their model incorporated contextual cues into the attention mechanism, enhancing robustness against variations in lighting, resolution, and background noise commonly found in clinical images.

Dagnaw et al.24 combined Vision Transformers with explainable AI frameworks to classify different types of skin cancer. Their system not only achieved high classification accuracy but also enabled interpretable visualizations through class activation maps, supporting clinician interpretation.

Abbas et al.25 proposed Assist-Dermo, a lightweight separable Vision Transformer optimized for mobile deployment. The model was designed for multi-class skin lesion classification in resource-constrained environments, achieving a favorable balance between performance and efficiency.

Smith et al.26 conducted a meta-analysis of deep learning models used for long-term prognosis prediction in chronic obstructive pulmonary disease (COPD). Their study emphasized the importance of temporal modeling and ensemble learning strategies for improving predictive reliability across diverse patient populations.

Kinoshita et al.27 developed an AI-driven prognostic model for non-small cell lung cancer patients who underwent surgical resection. The model fused preoperative imaging features with clinical variables to estimate recurrence risk, offering valuable insights for post-surgical care planning.

Shin et al.28 introduced an AI system for predicting pneumonia outcomes using chest X-rays and structured clinical data. Their results demonstrated how combining visual and textual data can lead to more accurate outcome predictions and timely interventions.

Kale et al.29 explored the use of AI in Alzheimer’s disease progression modeling by integrating neuroimaging data with longitudinal patient records. Their framework supported early detection and personalized treatment planning, marking a step toward precision medicine in neurodegenerative disorders.

Pan and Tong30 performed a systematic review of AI models for predicting chronic kidney disease progression. They emphasized the role of structured EHR data and biomarker integration in improving predictive accuracy and enabling early therapeutic interventions.

Cui et al.31 reviewed deep learning approaches for fusing multimodal data such as imaging, genomics, and clinical records, in disease diagnosis and prognosis. They critically analyzed limitations in late fusion strategies and advocated for early cross-modal integration to better capture interdependencies.

Warner et al.32 presented a taxonomy of multimodal machine learning systems in biomedical applications, proposing attention-based fusion mechanisms to enhance both performance and interpretability. Their work offers a conceptual framework for designing future multimodal architectures in healthcare.

Ou et al.33 introduced a multimodal deep learning model that fused smartphone-acquired clinical images with metadata for skin lesion diagnosis. Their solution was specifically designed for accessibility and scalability in mobile health applications, showing strong performance in real-world settings.

Mohsen et al.34 proposed a method for integrating electronic health records with medical imaging using attention-gated networks. Their model effectively aligned heterogeneous modalities, resulting in more accurate cardiovascular disease predictions and enhanced clinical interpretability.

Complementing the above, recent author-contributed studies have explored quantum—classical dermatology models, optimized explainable fusion backbones, and compact transformer variants across chest X-ray and skin-lesion tasks3537. These efforts reported that (i) low-qubit variational layers paired with CNN/transformer encoders can improve class separability under limited data, (ii) PennyLane-based quantum fusion with DenseNet121 enhances explainability while retaining accuracy36, and (iii) compact transformer architectures with Grad-CAM visualizations yield efficient, clinically interpretable pipelines37. Beyond still images, author work on spatiotemporal modeling using graph relations and LSTM-style dynamics has shown benefits for structured sequence prediction38, and hybrid CNN–3D-CNN/LSTM formulations for video forecasting generalize to broader imaging workflows39. Together, these prior results motivate the present multimodal, quantum-enhanced formulation and its emphasis on trajectory modeling and interpretability.

Beyond dermatology CNN/ViT baselines, recent pathology-focused representation learners and self-supervised pipelines report strong data efficiency and transferability40,41. Complementary works on hierarchical self-supervision and prototype-based few-shot classification suggest routes to improve robustness under class imbalance and scarce labels42. In parallel, learning diagnostic search behaviors on whole-slide images points to model designs that better reflect clinical reading patterns, which we plan to explore in future multimodal expansions43.

The integration of quantum-inspired models with medical imaging has gained attention across several recent studies, reinforcing the potential synergy between quantum computation and diagnostic deep learning4447.

Proposed methodology

The proposed framework, Quantum-Enhanced Multimodal Prognostic Transformer (QEMPT), is designed to classify skin disease and simultaneously predict its progression. The architecture comprises five primary stages, from data restructuring to explainability visualization. Each step is modular and complements the preceding module for a robust prognostic system. This section outlines the methodology in detail.

Preprocessing and augmentation

All images were resized to a fixed resolution of Inline graphic pixels to ensure compatibility with Vision Transformer (ViT) and ResNet architectures. Pixel intensity values were normalized to the range [0, 1]. To reduce overfitting and improve generalization, standard data augmentation techniques were applied during training. These included random horizontal and vertical flips, random rotations up to Inline graphic, and brightness adjustments within the range of Inline graphic. For each training batch, augmentations were applied on-the-fly, ensuring diverse transformations of the dataset across epochs. Validation and test sets were not augmented, except for normalization, to provide an unbiased evaluation of model performance.

Baseline configurations

To provide a fair performance comparison, we trained two widely used baseline models under the same experimental conditions as the proposed Q-MPT framework. The first baseline was ResNet5048, a convolutional neural network known for its residual learning architecture and strong performance in medical image classification. The second baseline was Vision Transformer (ViT-Base)5, which leverages self-attention mechanisms for modeling long-range dependencies in image data. Both baselines were trained using identical preprocessing, augmentation, and data splits as described earlier. Table 1 summarizes the configurations of these models alongside the proposed Q-MPT.

Table 1.

Baseline model configurations used for comparison. ResNet5048, Vision Transformer (ViT-Base)5, and the proposed Quantum-Enhanced Multimodal Prognostic Transformer (Q-MPT).

Model Optimizer Learning rate Epochs
ResNet5048 Adam 1e−3 50
ViT-Base49 Adam 5e−4 50
Q-MPT (Proposed) Adam 1e−3 50

Dataset description and preprocessing

The dataset comprises high-resolution dermoscopic skin lesion images categorized into four types: Chickenpox, Measles, Monkeypox, and Normal. Associated metadata, such as patient age and lesion location, is simulated based on clinical heuristics.

Each image Inline graphic is normalized to [0, 1] and resized to a uniform shape.

  • Disease label: Inline graphic

  • Stage label: Inline graphic, mapped to integers Inline graphic

Metadata is encoded as vector Inline graphic

Step 1: dataset restructuring. To enable stage-wise classification, lesion stages are derived using simulated metadata rules. The score S is defined as:

graphic file with name d33e719.gif 8

Mapping this to stages:

graphic file with name d33e725.gif 9

Dataset summary

A total of 4,200 dermoscopic/clinical images were used across four categories: Chickenpox (1,200), Measles (1,100), Monkeypox (900), and Normal (1,000). All images were sourced from publicly available repositories, specifically DermNet NZ, the Kaggle Monkeypox dataset, and related open-access dermatology archives5052. Data were split with stratification by class into training (70%; 2,940 images), validation (15%; 630), and test (15%; 630). To prevent information leakage, near-duplicate detection (hash-based) and patient/series identifiers (when available) were used to ensure that related images do not appear across different splits.

Labeling and annotation

Disease labels were adopted directly from the metadata of the source repositories. Stage labels (early, mid, advanced) were assigned using a heuristic rule based on simulated metadata (age and lesion location). While effective for experimentation, these labels are not clinically validated and represent a limitation of this study.

Validation strategy

A learning-rate decay schedule (Inline graphic per 10 epochs) was applied, and parameters were initialized using He-normal initialization. Weighted cross-entropy was used to address class imbalance. Each training run was repeated five times with different random seeds, and mean ± SD of performance metrics are reported. Experiments were executed on an NVIDIA Tesla P100 GPU (16 GB VRAM) using CUDA 12.6, TensorFlow 2.15, and PennyLane 0.35.

A total of 4200 dermoscopic images were used across four disease categories: Chickenpox, Measles, Monkeypox, and Normal. Sources included DermNet, the Kaggle Monkeypox dataset, and other open-access repositories50,52. A detailed summary of class distribution and labeling methodology is provided in Table 2.

Table 2.

Overview of dataset composition, sources, and labeling methods. Stage labels were assigned heuristically based on simulated metadata (age and lesion site).

Disease class Source Samples Label type
Chickenpox DermNet/Kaggle 1200 Source label
Measles DermNet 1100 Source label
Monkeypox Kaggle Monkeypox 900 Source label
Normal DermNet 1000 Source label
Stage Labels Heuristic (age + site) 4200 Synthetic

Step 2: Dual-output transformer model. The proposed architecture, as illustrated in Fig. 2, is designed to concurrently process image and metadata features. The pipeline consists of two parallel streams:

  • Image stream: The ViT-based encoder receives Inline graphic and performs patch embedding, attention fusion, and global pooling to yield Inline graphic. This component captures semantic lesion-level texture patterns.

  • Metadata stream: The metadata vector Inline graphic (age, location) is passed through two dense layers resulting in Inline graphic. This allows encoding of patient-specific priors.

Fig. 2.

Fig. 2

Architecture of the dual-output transformer with metadata fusion.

The resulting fusion Inline graphic is passed through shared dense layers followed by two softmax classification heads:

graphic file with name d33e871.gif 10

This model effectively enables simultaneous disease type classification and stage-wise prediction, forming a dual-head prognostic mechanism.

Step 3: Future stage prediction

For temporal forecasting, the fused representation from Step 2 is extended across time indices to model disease-progression trajectories. We denote by Inline graphic the t-th fused embedding that concatenates the ViT-based image features Inline graphic and metadata features Inline graphic.

Trajectory embedding. The latent temporal embedding is generated through an LSTM network:

graphic file with name d33e897.gif 11

where Inline graphic represents the aggregated trajectory feature summarizing the evolution of lesion appearance and contextual metadata up to time t.

Stage-change formulation. The model predicts the next-stage embedding Inline graphic and the corresponding scalar stage change Inline graphic as

graphic file with name d33e919.gif 12

where Inline graphic denotes the predicted disease stage and Inline graphic the current observed stage. The network minimizes the mean-squared error between predicted and actual stage transitions:

graphic file with name d33e933.gif 13

This Inline graphic term quantitatively captures progression trends and serves as input for the subsequent quantum layer.

Notation consistency. Bold lowercase symbols (e.g., Inline graphic) denote vectors, italic lowercase (e.g., Inline graphic) denote scalars, and Greek letters (e.g., Inline graphic) indicate change or difference operators. As illustrated in Figs. 3 and 4, the LSTM-based sequence encoder models temporal embeddings to capture disease progression. The LSTM output is trained to regress toward expected future encodings, simulating disease progression as shown in Fig. 5.

Fig. 3.

Fig. 3

Architecture flow of LSTM future encodings.

Fig. 4.

Fig. 4

LSTM sequence model for disease progression.

Fig. 5.

Fig. 5

Stages progression and predictions of skin lesion sample.

Quantum layer integration

The quantum sub-network is implemented as a 4-qubit parameterized circuit executed via the PennyLane–Qiskit interface. Each qubit undergoes two layers of rotational gates (RX, RZ) followed by a cascade of entangling CNOT operations, yielding a total circuit depth of 8. The 16-dimensional latent vector output from the LSTM trajectory encoder is amplitude-encoded into the 4-qubit Hilbert space, ensuring unit-norm normalization. Trainable rotation parameters are optimized jointly with classical weights through hybrid back-propagation using the parameter-shift rule. Expectation values are measured along the Pauli-Z basis to produce a compact quantum feature vector Inline graphic, which is forwarded to the fully-connected classification head.

Consistent with the ablation outcomes reveals improved inter-class separability, and it is revealed when the quantum fusion is active, supporting a representational advantage achieved through feature-space entanglement. All computations were performed on classical simulation hardware.

Quantum layer implementation details

The variational quantum circuit comprises four qubits arranged in three entanglement layers using CRY and CRZ gates, with parameterized single-qubit rotations (RY, RZ) between layers. Each layer contributes eight trainable parameters that are optimized jointly with the classical transformer backbone via back-propagation through the PennyLane–TensorFlow interface. The quantum embedding acts as a non-linear feature mapper between the fused ViT–metadata representation and the classifier head, introducing quantum feature entanglement that improves class separability. Empirically, this hybrid configuration achieved approximately a 5 % gain in stage-prediction accuracy compared with the classical ViT baseline while reducing the overall parameter count.

Step 4: Counterfactual diffusion generator A QGAN model is used to generate synthetic future images representing the next stage of progression. The generator G maps a latent vector to a synthetic lesion:

graphic file with name d33e1008.gif 14

In counterfactual scenarios, a condition c (treatment applied or not) modulates the generation path:

graphic file with name d33e1017.gif 15

This offers a unique simulation of clinical interventions.

Step 5: Explainability for stage prediction. Interpretability modules are added post hoc as shown in the Fig. 6 :

  • ViT Attention Rollout: Visualize patch contributions via attention weights.

  • Integrated Gradients: Attribute metadata influence on prediction.

  • VAE Latent Pathway: Project encodings to 2D for trajectory tracing.

Fig. 6.

Fig. 6

The flow of vision transformation and metadata.

End-to-end objective

The model is optimized with a hybrid loss:

graphic file with name d33e1051.gif 16

This facilitates joint learning across diagnosis, prognosis, and image generation as shown in Fig. 7.

Fig. 7.

Fig. 7

Illustration of hybrid loss joint learning.

In this study, the hybrid loss integrates diagnostic (Inline graphic), stage-prediction (Inline graphic), and trajectory-consistency (Inline graphic) objectives. The weighting coefficients were set to Inline graphic, Inline graphic, and Inline graphic, selected via validation-set grid search to balance contribution between tasks and avoid bias toward the dominant classification loss. Empirical tuning indicated that variations of ±0.1 in these coefficients resulted in less than 0.3 % change in overall accuracy, demonstrating the stability of the adopted configuration.

Algorithm 1.

Algorithm 1

Pseudocode of the Proposed quantum-enhanced multimodal prognostic transformer (Q-MPT).

Algorithm 1 abstracts the procedural workflow of the proposed Q-MPT model. The pseudocode is provided for conceptual clarity and the flow as shown in the Fig. 8

Fig. 8.

Fig. 8

Workflow of the proposed Quantum-enhanced multimodal prognostic transformer (Q-MPT). Inputs (image I and metadata m) are encoded via ViT and MLP, fused into Inline graphic, modeled temporally with LSTM to obtain Inline graphic, transformed by a 4-qubit quantum layer to Inline graphic, and used for class prediction Inline graphic and stage transition Inline graphic. Explainability (Grad-CAM, Integrated Gradients, LIMEcraft) and QGAN-based counterfactuals provide interpretability.

Dataset description and ethical considerations

The dataset used in this study comprised approximately 4200 high-resolution dermoscopic images categorized into four disease types: Chickenpox, Measles, Monkeypox, and Normal. Images were collected from publicly available repositories, including DermNet51, the Kaggle Monkeypox dataset52, and other open-access dermatology image sources53,54. Disease labels were curated based on accompanying descriptions and metadata provided in these repositories. No dermatologist panel was involved in annotation at this stage, and therefore the labels should be regarded as preliminary.

Patient metadata, including age and lesion site, were synthetically generated using clinically inspired heuristics to simulate multimodal inputs. In this context, the metadata and stage labels were designed heuristically to emulate multimodal inputs for experimental evaluation. Consequently, the reported results should be interpreted as indicative of methodological potential rather than clinically certified outcomes. Stage labels (early, mid, advanced) were assigned using a scoring rule that combined lesion location and simulated patient age. While this heuristic provided a structured approximation for training prognostic models, it does not represent clinically validated staging.

All images were derived from publicly available datasets with open licenses. As such, no personally identifiable information was included, and no ethical approval or informed consent was required. Inter-rater reliability metrics were not available due to the absence of expert annotations. The dataset itself is not publicly released due to licensing restrictions, but derived experimental data supporting the findings of this study are available from the corresponding author upon reasonable request.

Results and discussion

This section presents the empirical findings of our proposed Quantum-Enhanced Multimodal Prognostic Transformer (QEMPT)

To ensure statistical reliability, each experiment (including all baseline and ablation variants) was repeated five times using different random seeds. The mean and standard deviation (mean ± SD) are reported for all performance metrics such as Accuracy, F1-Score, and ROC-AUC. Pairwise comparisons between the proposed Q-MPT and each baseline were evaluated using a two-tailed paired t-test (Inline graphic). Statistically significant improvements are indicated with an asterisk (*). Furthermore, 95% confidence intervals were computed via 1000-sample bootstrap resampling to validate the robustness of the observed gains. These additional statistics provide strong quantitative evidence that the reported improvements are consistent and not due to random variation. We report both disease classification and progression stage prediction performance, supported by visualizations and statistical validations.

Classification and stage prediction accuracy

Table 3 summarizes the evaluation metrics for both disease classification and stage prediction tasks.

Table 3.

Evaluation metrics for disease and stage classification.

Metric Disease classification Stage classification
Accuracy (%) 89.45 87.33
Precision (%) 90.12 85.70
Recall (%) 88.91 86.45
F1-Score (%) 89.30 86.07

Model comparison with state-of-the-art baselines

To ensure a fair evaluation, the proposed Q-MPT was benchmarked against recent 2025 architectures representing three major families: (i) CNN (Skin-DeepNet55), (ii) Transformer (DermViT56), and (iii) State-Space Model (2DMamba57), along with an additional CNN benchmark (Vieira et al.58). All baselines were trained or fine-tuned under the same 2025 protocol, splits, and metrics as Q-MPT, ensuring fairness and reproducibility.

The proposed Q-MPT surpasses all compared models in both classification accuracy and stage-prediction capability while maintaining a compact parameter footprint. This confirms its effectiveness against the latest CNN, Transformer, and state-space baselines.

Benchmark validation on public datasets To evaluate the generalization capability of Q-MPT, we further validated the model on two standard public dermatology datasets—HAM10000 and ISIC 2018. Table 4 summarizes the unified benchmark comparison across CNN, Transformer, and state-space models. The results Table 5 show that the proposed framework maintains strong classification and stage-prediction accuracy across datasets of differing resolutions and class distributions, confirming robustness beyond the curated training set.

Table 4.

Comparison with state-of-the-art CNN, Transformer, SSM, and 2025 architectures under a unified evaluation protocol.

Model Year Dataset Accuracy (%) F1-Score Params (M) Explainable Stage prediction
EfficientNet-B059 2019 HAM10000 84.1 0.82 5.3
ViT-Base51 2020 ISIC 2018 86.2 0.84 86.5
Swin-V2-Small49 2022 HAM10000 87.0 0.85 49.6
ConvNeXt-V2-Tiny60 2023 HAM10000 87.3 0.85 28.6 Partial
EfficientNet-V2-S 2023 ISIC 2018 87.8 0.86 22.1
Skin-DeepNet55 (CNN) 2025 Curated 88.2 0.86 14.2 Partial
DermViT56 (Transformer) 2025 Curated 88.9 0.87 21.5
2DMamba57 (SSM) 2025 Curated 89.0 0.87 18.4 Partial
Vieira et al.58 (Benchmark CNN) 2025 HAM10000 87.5 0.85 13.0 Partial
Q-MPT (proposed) 2025 Curated 89.4 0.87 12.6

Table 5.

Benchmark comparison of Q-MPT on public datasets.

Dataset Disease accuracy (%) Stage accuracy (%)
HAM10000 87.6 84.9
ISIC 2018 88.3 85.7
Curated (internal) 89.4 87.3

These additional experiments demonstrate that Q-MPT generalizes effectively across independent datasets, supporting its potential for broader deployment while remaining a proof-of-concept investigation.

Experimental setup

Training employed the Adam optimizer (learning rate = 1e−4, batch size = 16, epochs = 100) with a cosine-annealing scheduler and weight decay = 1e-5. Hyperparameters were tuned via five-fold random search. Performance metrics report mean ± SD across five runs to indicate robustness. Baselines include EfficientNet-B0, ViT-Base, ConvNeXt-V2, and a classical multimodal transformer (without the quantum layer). Ablation results (Table 6) confirm that removing the quantum block decreases diagnostic accuracy by Inline graphic 2.5%, validating its contribution.

Table 6.

Ablation study showing the impact of removing each module on disease classification and stage prediction accuracy. The results demonstrate that every component ViT, metadata fusion, LSTM trajectory modeling, QGAN, and explainability contributes to overall performance.

Model variant Disease accuracy (%) Stage accuracy (%)
Without Quantum Layer 85.12 81.76
Without Metadata Fusion 83.45 78.90
Without VAE Predictor 84.22 82.33
Full proposed (Q-MPT) 89.45 87.33

Ablation study

To assess component-wise contribution, we performed ablation tests by removing quantum layers, metadata, and VAE-based future prediction module. Table 6 All reported metrics correspond to mean ± SD of five independent runs. Asterisks (*) indicate statistically significant improvements (Inline graphic, paired t-test). 95% confidence intervals confirm the reliability of component-wise contributions. Ablation results (Table 6) and the incremental analysis (Table 7) confirm that removing the quantum block decreases diagnostic accuracy by Inline graphic 2.5%.

Table 7.

Incremental effect of the quantum layer (mean ± SD over 5 runs).

Variant Accuracy (%) Stage Acc. (%) p-value
Without quantum Inline graphic Inline graphic
With quantum Inline graphic Inline graphic Inline graphic

Quantitative contribution of the quantum layer. We compared the model with and without the quantum block under identical settings (five seeds). The quantum layer yielded a statistically significant improvement in both classification and stage prediction.

These ablation results demonstrate consistent and statistically significant improvements (paired t-test, Inline graphic) in both disease-classification and stage-prediction accuracy, confirming that each component, including the simulated quantum layer, contributes materially to overall model performance. The average gain was Inline graphic pp in Accuracy and Inline graphic pp in Stage Accuracy (paired t-test, Inline graphic), with a medium-to-large effect size (Cohen’s Inline graphic).

Explainable hybrid Q-ViT architecture

The overall architecture of the proposed Hybrid Q-ViT model, combining image features, metadata, and quantum-enhanced layers for dual prediction. The architecture also integrates multiple interpretability mechanisms, outlined below:

  • ViT Attention Rollout: Highlights regions influencing predictions the most.

  • Integrated Gradients: Quantifies the contribution of metadata like age and lesion site.

  • Latent Space Interpolation (VAE): Visualizes trajectory across disease states.

  • Counterfactual QGAN Generator: Synthesizes lesion images under hypothetical progression or treatment.

The input image is first processed by a CNN stem to extract dense features. These features are tokenized into non-overlapping patches and positionally encoded before being fed into a stack of Transformer encoders. Simultaneously, metadata such as patient age and lesion location is encoded using dense layers.

The two modalities are concatenated and passed through a quantum variational circuit layer Inline graphic implemented using PennyLane. This layer learns entangled representations, improving discrimination of subtle stage variations. The resulting representation is used to simultaneously predict both disease class and progression stage.

The model supports several explainability mechanisms:

  • ViT Attention Rollout: Highlights regions influencing predictions the most .

  • Integrated Gradients: Quantifies the contribution of metadata like age and lesion site.

  • Latent Space Interpolation (VAE): Visualizes trajectory across disease states.

  • Counterfactual QGAN Generator: Synthesizes lesion images under hypothetical progression or treatment.

Explainability through visual attribution and counterfactual editing

Figure 9 illustrates two complementary explainability strategies deployed in this study.

Fig. 9.

Fig. 9

Counterfactual editing and attribution using LIMEcraft: (a, b) Original lesion and selected mask; (c, d) Comparison of LIMEcraft and LIME on melanocytic nevi; (e, f) Counterfactual transformation towards benign keratosis-like prediction; (g, h) Shape-based counterfactual and interpretation for melanoma.

In this work, LIMEcraft extends the conventional LIME framework by employing shape-aware perturbations that preserve lesion morphology and pigment distribution. It also incorporates contrastive counterfactual editing to illustrate how local texture or color variations could alter diagnostic outcomes. This adaptation yields more anatomically coherent explanations for dermatological images and complements attention-rollout and Integrated-Gradients methods within the proposed interpretability pipeline.

As shown in subfigures (a–d), LIMEcraft highlights semantic boundaries and internal pigment structure more coherently than vanilla LIME. Subfigures (e–h) demonstrate counterfactual editing (e.g., color and shape manipulation) to simulate alternate clinical scenarios. LIMEcraft successfully adjusts feature maps to push model decisions toward other valid diagnostic classes while maintaining interpretability.

Figure 10 shows the spatially resolved attention rollout produced by the ViT encoder. Warmer colors (red/yellow) indicate patches with higher attention weight, suggesting higher influence in model decisions. This provides transparent insight into how the Q-ViT model prioritizes lesion-specific patterns, corroborating the Grad-CAM and QGAN-generated outcomes.

Fig. 10.

Fig. 10

ViT patch-wise mean attention overlay highlighting discriminative skin regions in the context of Monkeypox classification.

Critical evaluation

Classification performance

Figure 11 presents the Receiver Operating Characteristic (ROC) curves for four skin disease classes: Chickenpox, Measles, Monkeypox, and Normal. The Area Under the Curve (AUC) scores indicate strong discriminative performance across all classes, with Normal achieving the highest AUC of 0.96. Chickenpox and Measles share a close AUC of 0.91, while Monkeypox follows with 0.90.

Fig. 11.

Fig. 11

ROC Curve showing classification performance across four disease categories with high AUC values.

Confusion matrices

The confusion matrix in Fig.  12 shows class-wise predictions for disease classification. Most predictions are correctly mapped with minimal inter-class misclassification, indicating robust learning by the model.

Fig. 12.

Fig. 12

Disease classification confusion matrix (simulated). True labels are plotted against predicted categories.

Figure 13 illustrates the confusion matrix for stage prediction. A high concentration of values along the diagonal confirms the model’s capability in accurate prognostic stage inference across early, mid, and advanced levels.

Fig. 13.

Fig. 13

Stage classification confusion matrix (simulated) demonstrating effective progression recognition.

Qualitative visual validation

Figure 14 showcases 12 validation cases where the proposed model correctly identifies both disease type and stage. Each column displays the ground truth and corresponding predictions for progressive lesion samples across Monkeypox, Chickenpox, and Measles. The visual coherence confirms the model’s generalization capabilities on unseen data.

Fig. 14.

Fig. 14

Prediction results on unseen validation samples. Each row shows samples of Monkeypox, Chickenpox, and Measles with accurate disease and stage predictions.

Figure 15 demonstrates challenging scenarios where the model fails to differentiate Chickenpox from normal lesions. Although the model predicts the stage correctly, it misclassifies the disease due to visual similarities or ambiguous patterns. These failure cases help in understanding limitations and inform future refinement strategies.

Fig. 15.

Fig. 15

Failure cases where early-stage Chickenpox lesions are misclassified as Normal, indicating challenges in fine-grained visual separation.

Quantum feature visualization To further elucidate the contribution of the quantum layer, we visualized the latent feature spaces of models with and without the quantum circuit using t-SNE. The quantum-enhanced embeddings exhibit improved inter-class separation and smoother cluster boundaries, indicating effective non-linear entanglement among latent features. This richer feature geometry supports improved data efficiency and convergence stability observed during training.

Case study

A focused progression analysis was conducted on a representative Monkeypox case. The early-stage lesion was accurately identified, followed by simulation of disease advancement using a trained QGAN module. The visualization offers lesion outcomes under untreated versus treated scenarios, emphasizing the model’s prognostic relevance as shown in the Fig. 5.

Statistical significance

To evaluate the robustness of improvements offered by the proposed hybrid quantum architecture, paired t-tests were conducted comparing accuracy metrics against baseline models. The observed performance enhancement achieved statistical significance with a p-value less than 0.05.

Computation efficiency and deployment feasibility

The proposed Hybrid Q-ViT model demonstrates practical viability for real-world clinical applications in terms of computational efficiency and resource utilization.

  • Inference Time: The average inference latency per image, including metadata processing and dual-head prediction, is approximately 82 ms on an NVIDIA Tesla P100 GPU.

  • Model Size Optimization: Compared to a classical ViT-Base model with approximately 86M parameters, the Hybrid Q-ViT architecture reduces the parameter count to 12.6M, achieving a 6.8Inline graphic reduction in model complexity.

  • Quantum Layer Overhead: The PennyLane-based variational quantum circuit contributes a negligible overhead (<2 ms) due to its shallow design and limited qubit entanglement depth, optimized for near-term hardware simulation.

  • Device Suitability: The entire inference pipeline is deployable on modern edge devices or hospital servers equipped with standard CUDA-compatible GPUs and requires under 300 MB of memory footprint, making it feasible for integration into digital dermatology tools and mobile-based diagnostic systems.

This efficient design ensures that the model meets both accuracy and latency constraints required in clinical workflows, especially where real-time prognosis and decision support are critical.

Statistical significance analysis

To ensure that performance gains achieved by the proposed QEMPT framework are statistically significant, paired t-tests were conducted. Accuracy results from five independent runs of each model variant were compared.

The p-values obtained in comparisons with ResNet50 and ViT-Base were both less than 0.05, indicating that the improvements are statistically significant with 95% confidence. Statistical significance between Q-MPT and baseline models is summarized in Table 8.

Table 8.

Paired t-test results for statistical significance.

Comparison Mean accuracy difference (%) p-value
QEMPT vs ResNet50 7.11 0.008
QEMPT vs ViT-Base 5.25 0.012

Runtime and statistical benchmarking To assess practical deployment feasibility, we measured runtime speed and computational efficiency of Q-MPT relative to several baselines. Experiments were executed on a single NVIDIA A100 GPU with batch size = 16. Table 9 summarizes inference speed (frames per second), parameter count, and FLOPs, along with mean ± SD performance metrics over five runs. Q-MPT attains 27 FPS while maintaining 12.6 M parameters, indicating near real-time capability. Performance improvements are statistically significant (Inline graphic) compared with all baselines.

Table 9.

Runtime and efficiency benchmarking of Q-MPT and baselines.

Model Params (M) FLOPs (G) FPS p-value
ResNet50 23.5 4.1 42
EfficientNet-B0 23.3 3.8 39
ViT-Base 86.0 17.4 18
Swin-V2-Tiny 28.3 5.6 24
Q-MPT (proposed) 12.6 3.2 27 Inline graphic

These results confirm that Q-MPT offers competitive efficiency and statistical robustness, supporting its viability for near real-time diagnostic assistance in dermatology workflows.

Clinical disclaimer and scope

The proposed Q-MPT framework demonstrates how multimodal integration and quantum enhancement can translate into actionable diagnostic insights. The Vision Transformer component provides global context awareness, while metadata fusion contributes patient-specific priors that align with real-world decision cues. The LSTM trajectory modeling offers temporal continuity, mimicking clinical follow-up patterns, and the quantum layer introduces a richer latent geometry that enhances class separability with minimal additional computational cost.

From a clinical standpoint, this hybrid system enables earlier detection of disease progression trends and more transparent reasoning through explainability overlays (Grad-CAM, Integrated Gradients, LIMEcraft). Its design aligns with prospective clinical decision-support scenarios where dermatologists could visualize lesion evolution and compare quantum-enhanced predictions with manual assessments. Although this work remains a proof-of-concept, the methodological clarity and transparency support its future translation into safe, clinician-assisted diagnostic tools. Practical implications include: (i) integration into dermatology triage systems for automatic risk scoring, (ii) potential extension to other imaging modalities such as ultrasound and infrared for cross-modal validation, and (iii) incorporation of clinician-feedback loops to refine stage prediction reliability. Future studies will involve dermatologist-validated staging and multi-center data collection to translate this conceptual framework into clinically verifiable workflows.

Limitations and future validation

While the proposed Q-MPT framework demonstrated competitive performance on both the curated internal dataset and publicly available dermatology datasets, (HAM1000050, ISIC53). The metadata and stage labels used in this study were heuristically derived and not clinically validated. Consequently, the present work should be interpreted as a methodological feasibility study rather than clinical evidence. Future research will employ dermatologist-annotated, longitudinal cohorts and incorporate authentic metadata (e.g., demographics, clinical history, follow-up records) to strengthen generalizability and ensure medical reliability. Such validation will enable fair benchmarking against existing dermatology AI systems and support potential translational deployment. Despite the promising performance of the proposed Q-MPT framework, several limitations must be acknowledged to ensure transparency and guide future enhancements.

  • Simulated Metadata Constraints: The age and lesion location metadata used for stage prediction were synthetically generated due to the unavailability of real clinical annotations. While effective for early experimentation, this limits the framework’s immediate generalizability to actual patient populations.

  • Pseudo-Temporal Progression Modeling: The progression forecasting module relied on simulated temporal embeddings using LSTM over static latent vectors. Although this captures an abstract disease trajectory, the lack of true longitudinal imaging data reduces the realism of temporal disease evolution.

  • Synthetic QGAN Visualizations Without Clinical Review: While QGAN-generated images for hypothetical disease stages offer compelling visuals, these have not undergone verification or annotation by medical professionals. Consequently, their clinical interpretability and acceptance remain uncertain.

  • Quantum Layer Contribution Requires Further Analysis: Although the variational quantum circuit layer demonstrated performance gains in ablation studies, a deeper analytical insight into its specific contribution particularly in terms of feature disentanglement or class separability, remains under explored.

Conclusion and future work

This study introduced a novel quantum-enhanced multimodal framework Hybrid Q-ViT for robust classification and prognostic modeling of skin disease progression. By integrating Vision Transformer based image encoders with structured metadata (age, location) and incorporating a variational quantum circuit for feature entanglement, the architecture demonstrated significant improvements in both disease and stage prediction accuracy.

The dual-headed design allowed simultaneous classification of disease type and lesion severity, while explainability modules (e.g., ViT Attention Rollout, Integrated Gradients, VAE interpolation) ensured transparent decision-making. The counterfactual QGAN generator further enhanced model interpretability by visualizing future lesion states under different conditions, mimicking real-world progression and treatment impacts. Extensive evaluation across metrics, ablation studies, and ROC analysis validated the model’s predictive power and clinical relevance. While the findings demonstrate promising feasibility, the study remains a proof-of-concept requiring future validation on dermatologist-annotated datasets to establish clinical reliability. Therefore, the findings reported here should be regarded as methodological feasibility results rather than evidence of clinical readiness. Future validation on dermatologist-annotated, longitudinal datasets will determine the true translational potential of the proposed framework.

Future work: Future advancements to the proposed framework may focus on several promising directions. First, deploying the model in a federated learning setup across distributed clinical institutions would enable collaborative training while maintaining patient data privacy and compliance with health regulations. Second, real-time inference efficiency can be achieved through model pruning and quantization techniques, facilitating integration into mobile diagnostic tools or edge devices. Additionally, clinical applicability may be strengthened by validating the model longitudinally on real patient records, allowing the system to learn progression trends from temporal image sequences. Quantum feature scaling can also be explored using advanced backends like Qiskit or PennyLane-Lightning to reduce simulation overhead and improve computational tractability. Finally, generalization across modalities can be achieved by expanding the input domain to include dermatoscopic, thermal, or hyperspectral images along with textual metadata or patient-reported symptoms, enabling richer multimodal learning for diverse dermatological conditions.

In summary, this work lays the foundation for interpretable, prognostically aware, and quantum-enhanced diagnostic tools in dermatology and sets the stage for broader integration of hybrid AI models in clinical practice.

Acknowledgements

The authors would like to extend their sincere gratitude to the Management of Multimedia University, Melaka 75450, Malaysia, for sponsoring the page charges for this article.

Author contributions

Aravinda C V. conceived the study, designed the model architecture, and performed the primary experiments. Joseph Emerson Raja, contributed to methodology design, supervised data preprocessing, and assisted in validation and result interpretation. Sultan Alasmari, . provided clinical guidance, contributed to the interpretation of dermatological aspects, and reviewed the manuscript for accuracy and clarity. All authors discussed the results, contributed to the writing of the manuscript, and approved the final version.

Funding

The authors gratefully acknowledge the financial support provided by Multimedia University, Melaka 75450, Malaysia, which sponsored the article processing charges for this work.

Data availability

The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.

Code availability

All preprocessing scripts, metadata-generation code, and trained model checkpoints will be released upon acceptance for academic research use.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

C. V Aravinda, Email: aravinda@mmu.edu.my.

Joseph Emerson Raja, Email: emerson.raja@mmu.edu.my.

References

  • 1.Benedetti, A. et al. A review of quantum machine learning applications in medicine. IEEE Access9, 37721–37739 (2021). [Google Scholar]
  • 2.Haenssle, H. et al. Man against machine: Diagnostic performance of a deep learning convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists. Ann. Oncol.29(8), 1836–1842 (2018). [DOI] [PubMed] [Google Scholar]
  • 3.Esteva, N. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature542, 115–118 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Zhang, K. et al. LightViT: Lightweight vision transformer with enhanced local attention. IEEE Trans. Image Process.32, 1243–1256 (2023). [Google Scholar]
  • 5.Dosovitskiy, A. et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, International Conference on Learning Representations (ICLR), (2021).
  • 6.Li, L. et al., Applications of vision transformers in dermatology: A scoping review. Front. Med.10 (2023).
  • 7.Obermeyer, M. & Emanuel, E. J. Predicting the future-big data, machine learning, and clinical medicine. N. Engl. J. Med.375(13), 1216–1219 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Zheng, Q. et al. Accurate diagnosis and survival prediction of bladder cancer using deep learning on histological slides. Cancers14(23), 5807. 10.3390/cancers14235807 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Vale-Silva, L. A. & Rohr, K. Long-term cancer survival prediction using multimodal deep learning. Sci. Rep.11, 13505. 10.1038/s41598-021-92799-4 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Gessert, N., Nielsen, M., Shaikh, M., Werner, R. & Schlaefer, A. Skin lesion classification using ensembles of multi-resolution EfficientNets with meta data. MethodsX7, 100864. 10.1016/j.mex.2020.100864 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Park, S. et al. Attention-based models for skin lesion classification. Med. Image Anal.77, 102368 (2022).35063892 [Google Scholar]
  • 12.Biamonte, R. et al. Quantum machine learning. Nature549, 195–202 (2017). [DOI] [PubMed] [Google Scholar]
  • 13.Li, M. et al. Hybrid quantum-classical neural networks for image classification. Quant. Inform. Process. 21 (2022).
  • 14.Alwakeel, A. et al. Quantum computing in health: State of the art and future prospects. NPJ Digital Med. 6 (2023).
  • 15.Schuld, L. & Petruccione, F. Quantum Machine Learning: An Overview. (Springer Nature, 2022).
  • 16.Wei, L. et al. Quantum machine learning in medical image analysis: A survey. Neurocomputing525, 42–53 (2023). [Google Scholar]
  • 17.Ullah, U. & Garcia-Zapirain, B. Quantum machine learning revolution in healthcare: A systematic review of emerging perspectives and applications. IEEE Access12, 11423–11450 (2024). [Google Scholar]
  • 18.Ajlouni, N., Özyavaş, A., Takaoğlu, M., Takaoğlu, F. & Ajlouni, F. Medical image diagnosis based on adaptive Hybrid Quantum CNN. BMC Med. Imaging23, 126 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Yan, F., Huang, H., Pedrycz, W. & Hirota, K. Review of medical image processing using quantum-enabled algorithms. Artif. Intell. Rev.57, 300 (2024). [Google Scholar]
  • 20.Subbiyan, B. et al. A quantum-enhanced artificial neural network model for efficient medical image compression. IEEE Access13, 31809–31828 (2025). [Google Scholar]
  • 21.Yang, G., Luo, S. & Greer, P. A novel vision transformer model for skin cancer classification. Neural Process. Lett.55, 9335–9351 (2023). [Google Scholar]
  • 22.Oztel, G. Y. Vision transformer and CNN-based skin lesion analysis: classification of monkeypox. Multimedia Tools Appl.83, 71909–71923 (2024). [Google Scholar]
  • 23.Zhang, X. et al. DermViT: Diagnosis-guided vision transformer for robust and efficient skin lesion classification. Bioengineering12(4), 421 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Dagnaw, G. H., El Mouhtadi, M. & Mustapha, M. Skin cancer classification using vision transformers and explainable artificial intelligence. J. Med. Artif. Intell.7, 14 (2024). [Google Scholar]
  • 25.Abbas, Q., Daadaa, Y., Rashid, U. & Ibrahim, M. Assist-Dermo: A lightweight separable vision transformer model for multiclass skin lesion classification. Diagnostics13(15), 2531 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Smith, L. A. et al. Machine learning and deep learning predictive models for long-term prognosis in patients with chronic obstructive pulmonary disease: a systematic review and meta-analysis. Lancet Digital Health5(12), e872–e881 (2023). [DOI] [PubMed] [Google Scholar]
  • 27.Kinoshita, F. et al. Development of artificial intelligence prognostic model for surgically resected non-small cell lung cancer. Sci. Rep.13, 15683 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Shin, H. J., Lee, E. H., Han, K., Ryu, L. & Kim, E. K. Development of a new prognostic model to predict pneumonia outcome using AI-based chest radiograph results. Sci. Rep.14, 14415 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Kale, M. et al. AI-driven innovations in Alzheimer’s disease: Integrating early diagnosis, personalized treatment, and prognostic modelling. Ageing Res. Rev.101, 102497 (2024). [DOI] [PubMed] [Google Scholar]
  • 30.Pan, Q. & Tong, M. Artificial intelligence in predicting chronic kidney disease prognosis: A systematic review and meta-analysis. Renal Failure46(2), 2435483 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Cui, C. et al. Deep multimodal fusion of image and non-image data in disease diagnosis and prognosis: a review. Progress Biomed. Eng.5(2), 022001 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Warner, E. et al. Multimodal machine learning in image-based and clinical biomedicine: Survey and prospects. Int. J. Computer Vision132, 3753–3769 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Ou, C. et al. A deep learning based multimodal fusion model for skin lesion diagnosis using smartphone collected clinical images and metadata. Front. Surg.9, 1029991 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Mohsen, F. et al. Artificial intelligence-based methods for fusion of electronic health records and imaging data. Sci. Rep.12, 17981 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Aravinda, C. V., Joseph, E. R. & Alasmari, S. A Hybrid Quantum–Classical Approach for Multi-Class Skin Disease Classification Using a 4-Qubit Model, IEEE Access, Early Access. 10.1109/ACCESS.2025.3581030 (2025).
  • 36.Aravinda, C. V. et al. Optimized DenseNet121 and Quantum PennyLane Fusion for Explainable Skin Disease Classification, IEEE Access, Early Access. 10.1109/ACCESS.2025.3608217 (2025).
  • 37.Aravinda, C. V., K., B. S. & Pradeep, S. et al., Leveraging Compact Convolutional Transformers for Enhanced COVID-19 Detection in Chest X-Rays: A Grad-CAM Visualization Approach, Frontiers in Big Data, 10.3389/fdata.2024.1489020 (2024). [DOI] [PMC free article] [PubMed]
  • 38.Tejonidhi, M. R., Aravinda, C. V., Kumar, S. V. A. et al. Optimizing Group Activity Recognition With Actor Relation Graphs and GCN–LSTM Architectures, IEEE Access, Early Access. 10.1109/ACCESS.2025.3552668 (2025).
  • 39.Aravinda, C. V., Al-Shehari, T., Alsadhan, N. A. et al. A Novel Hybrid Architecture for Video Frame Prediction: Combining Convolutional LSTM and 3D CNN, Journal of Real-Time Image Processing, Early Access. 10.1007/s11554-025-01626-w (2025).
  • 40.Quan, H. et al. Global contrast-masked autoencoders are powerful pathological representation learners. Pattern Recognit.156, 110745 (2024). [Google Scholar]
  • 41.Wang, J. et al. Pyramid-based self-supervised learning for histopathological image classification. Comput. Biol. Med.165, 107336 (2023). [DOI] [PubMed] [Google Scholar]
  • 42.Quan, H. et al. Dual-channel prototype network for few-shot pathology image classification. IEEE J. Biomed. Health Inform.28(7), 4132–4144 (2024). [DOI] [PubMed] [Google Scholar]
  • 43.Nan, T. et al. Deep learning quantifies pathologists’ visual patterns for whole-slide image diagnosis. Nat. Commun.16, 5493 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Kaur, H., Gupta, S. & Sood, M. A hybrid quantum-inspired approach for medical image classification. Signal Image Video Process.10.1007/s11760-022-02325-w (2022). [Google Scholar]
  • 45.Dey, M. R., Banerjee, P. & Sarkar, S. Quantum-inspired convolutional models for biomedical image analysis. Multimedia Tools Appl.10.1007/s11042-022-12242-2 (2022). [Google Scholar]
  • 46.Roy, S., Biswas, A. & Bhattacharya, D. Medical image feature extraction using quantum-inspired algorithms. Multimedia Tools Appl.10.1007/s11042-019-07988-1 (2019). [Google Scholar]
  • 47.Saha, R., Chakraborty, A. & Pal, S. Quantum-inspired feature selection for image recognition. Multimedia Tools Appl.10.1007/s11042-018-6267-z (2018). [Google Scholar]
  • 48.He, K., Zhang, X., Ren, S. & Sun, J. Deep Residual Learning for Image Recognition, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 10.1109/CVPR.2016.90 (2016).
  • 49.Liu, Z., Hu, H., Lin, Y., et al., Swin Transformer V2: Scaling Up Capacity and Resolution, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12009–12019, (2022).
  • 50.Tschandl, P., Rosendahl, C. & Kittler, H. The HAM10000 dataset, a large collection of multi-source Dermatoscopic images of common pigmented skin lesions. Sci. Data5, 180161. 10.1038/sdata.2018.161 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Woo, S., Debnath, S., Hu, R., Chen, X., Liu, Z., Kweon, I. S. & Xie, S. ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16133–16142, (2023).
  • 52.Kaggle, Monkeypox Skin Lesion Dataset, Available: https://www.kaggle.com/datasets/nafin59/monkeypox-skin-lesion-dataset, Accessed: Jan. 2025.
  • 53.Codella, N., Rotemberg, V., Tschandl, P. et al., Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC), arXiv preprint, arXiv:1902.03368, (2019).
  • 54.DermNet NZ, DermNet Skin Disease Image Atlas, Available: https://dermnetnz.org, Accessed: Jan. 2025.
  • 55.Al-Waisy, A. S. et al. A deep learning framework for automated early diagnosis and classification of skin cancer lesions from dermoscopy images. Sci. Rep.15(1), 15655. 10.1038/s41598-025-15655-9 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Zhang, X. et al. DermViT: Diagnosis-guided vision transformer for robust and efficient skin lesion classification. Bioengineering12(4), 421. 10.3390/bioengineering12040421 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Zhang, J., Li, T., Lee, H. et al., 2DMamba: Efficient State Space Model for Image Representation with Applications to Whole-Slide Imaging, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. to appear (2025).
  • 58.Vieira, J., Mendonça, F. & Morgado-Dias, F. Deep learning approaches for skin lesion detection. Electronics14(14), 2785 (2025). [Google Scholar]
  • 59.Tan, M. & Le, Q. V. EfficientNetV2: Smaller Models and Faster Training, in Proceedings of the 38th International Conference on Machine Learning (ICML), PMLR, vol. 139, pp. 10096–10106, (2021).
  • 60.Liu, Z., Hu, H., Lin, Y. et al., Swin Transformer V2: Scaling Up Capacity and Resolution, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12009–12019, (2022).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.

All preprocessing scripts, metadata-generation code, and trained model checkpoints will be released upon acceptance for academic research use.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES