Abstract
Pneumonia is a severe respiratory infection that significantly contributes to global morbidity and mortality, particularly among children and the elderly. Early and accurate detection of pneumonia is crucial for timely intervention and effective treatment, reducing the risk of severe complications. Traditional diagnostic methods, such as radiographic examination of chest X-rays (CXRs) and computed tomography (CT) scans, rely on the expertise of radiologists, which can lead to subjectivity and variability in diagnosis. The rapid advancement of deep learning in medical imaging has opened new possibilities for automated pneumonia detection, enabling faster, more accurate, and scalable diagnostic solutions. An important health challenge is Pneumonia and it is the timely and more accurate diagnosis. The multiple imaging modalities including CT-scans, X-rays and other diagnostic data integrated with the proposed AI driven framework thereby the pneumonia diagnosis robustness and accuracy enhanced. For effective multi-modal fusion, an effective deep learning models employed and from various imaging sources, the complementary information leveraged. To enhance the patient’s diagnostic results, an interpretable, timely and accurate detection model required. Initially perform pre-processing to neglect the artifact and noise of both CT-scans and X-rays image data. From these data, the relevant features extracted using Swin Transformers (ST). After that, the complex interactions among the imaging modalities are fused by employing Optimized Tensor Fusion Networks (OTFN). The Gradient-weighted Class Activation Mapping with Bayesian Neural Networks (Grad-CAM with BNN) is proposed for pre-emptive prediction of pneumonia disease. The risk assessment evaluated using the risk scoring system that provides a real-time alerts depending upon the predictive outputs thereby offering an early invention of pneumonia disease. Use python platform for implementation followed by the proposed work performance is evaluated using state-of-art studies.
Keywords: Pneumonia, Pre-emptive, Swin transformers, Optimized tensor fusion networks, Grad-CAM and BNN
Subject terms: Computational biology and bioinformatics, Diseases, Engineering, Health care, Mathematics and computing, Medical research
Introduction
Pneumonia is a severe respiratory infection caused by viral, bacterial, or fungal pathogens, leading to inflammation of the lung tissues and impaired gas exchange. It remains a major global health concern, particularly affecting children, the elderly, and individuals with compromised immunity1. Conventional diagnosis primarily relies on clinical examination and radiological assessment using chest X-rays (CXRs) or computed tomography (CT) scans, often supported by laboratory tests such as blood analysis and sputum cultures2. However, these diagnostic procedures are highly dependent on expert interpretation, making them susceptible to subjectivity, inter-observer variability, and delayed decision-making particularly in resource-limited settings. Consequently, there is a growing demand for automated, accurate, and scalable diagnostic frameworks capable of supporting clinicians in the early detection and management of pneumonia.
Pneumonia results from viral, bacterial, or fungal infection of the respiratory system, leading to inflammation and potential accumulation of pus-like fluid in the lungs3. Compared to viral pneumonia, which often resolves on its own, bacterial pneumonia tends to be more persistent and severe4. In some cases, both lungs may become infected, a condition referred to as bilateral or double pneumonia. To diagnose pneumonia, a healthcare professional conducts a physical examination, reviews the patient’s medical history, and may employ diagnostic procedures such as blood tests, sputum analysis, chest X-rays, or pulse oximetry.
Chest X-rays remain a simple, affordable, and widely used imaging method for pneumonia detection5. A radiologist or physician interprets CXR images to determine whether they appear normal or indicate abnormalities such as pneumonia, tuberculosis (TB), or lung tumors6. Pneumonia poses a particularly high risk to infants, elderly individuals, critically ill patients on ventilators, and those with chronic respiratory illnesses like asthma. The threat is further magnified in developing regions were poverty and inadequate healthcare infrastructure limit access to timely diagnosis and treatment. According to the World Health Organization (WHO), pneumonia and air-pollution-related respiratory diseases account for over four million deaths annually7. Each year, more than 150 million people mainly children under the age of five are diagnosed with pneumonia. Bacterial pneumonia, especially among children, can be severe, while influenza-related pneumonia typically causes milder symptoms. Fungal pneumonia occurs mostly in individuals with weakened immune systems.
CXRs are preferred over advanced imaging techniques such as CT and magnetic resonance imaging (MRI) due to their low cost and accessibility8. However, the increasing number of chest X-ray cases results in significant diagnostic burden, as each physician must interpret thousands of images annually. The shortage of medical professionals in both developed and developing countries compounds this challenge. Rapid and accurate detection is essential to reduce pneumonia-related mortality, especially among children and adolescents in low-resource settings9. Even experienced radiologists face limitations in interpreting CXRs accurately, as these images typically have lower sensitivity compared to CT or MRI scans.
Pneumonia primarily affects the alveolar air sacs, causing symptoms such as respiratory distress, persistent cough, fever, and, in severe cases, life-threatening complications10. Although effective treatments exist, delayed detection substantially increases the risk of mortality. Current deep learning-based diagnostic models often concentrate on symptomatic or advanced-stage pneumonia and rely exclusively on single imaging modalities, thereby limiting their generalizability and clinical dependability11. Moreover, these models frequently lack interpretability and uncertainty estimation, which hinders their integration into real-world clinical workflows. To address these limitations, this study introduces an interpretable multimodal deep learning framework that combines Gradient-weighted Class Activation Mapping (Grad-CAM) with Bayesian Neural Networks (BNNs) to facilitate pre-emptive pneumonia prediction and uncertainty-aware risk assessment.
Traditional models for pneumonia diagnosis often emphasize symptomatic or advanced stages, with limited interpretability and dependence on a single modality. While Grad-CAM and Bayesian Neural Networks have been independently explored in prior studies, the novelty of this work lies in their unified integration within an optimized multimodal fusion and risk-oriented decision framework. Specifically, the proposed approach uniquely combines Zebra Optimization–driven tensor fusion for high-order cross-modal feature interaction, uncertainty-aware Bayesian inference, and spatial explainability through Grad-CAM into a single pipeline. This coordinated design enables interpretable, uncertainty-aware, and clinically actionable pneumonia risk stratification rather than isolated classification or visualization. The novelty of this research lies in integrating Grad-CAM with a Bayesian Neural Network (BNN) for pneumonia risk assessment and preventive prediction. This integration enables the estimation of model confidence and predictive uncertainty, promoting transparent and trustworthy AI-driven medical analysis. The research contributions are summarized as follows:
Hybrid imaging pre-processing: Development of an efficient pre-processing pipeline for both X-ray and CT images by removing artifacts and noise, enhancing crucial pulmonary structures, and improving downstream learning efficiency.
Swin transformer for multi-modal feature extraction: Extraction of context-rich and hierarchical features from CT and X-ray images using the Swin Transformer to detect subtle early signs of pneumonia and strengthen model capability.
Multi-modal fusion: Integration of high-dimensional features from multiple modalities using an optimized tensor-based fusion model to preserve contextual and spatial correlations for effective pneumonia detection.
Grad-CAM with BNN: Implementation of Grad-CAM with BNN for pre-emptive pneumonia prediction, providing explainable visual insights into the regions and modalities most influential to the model’s predictions.
Risk scoring system: The Pneumonia Risk Scoring and Estimator (P-RiSE) integrates imaging-derived prediction probability, uncertainty estimation, and visual consistency scores to stratify patient risk levels and trigger early alerts. Integration of additional clinical metadata is envisioned as a future extension of the framework.
Existing multimodal pneumonia detection frameworks predominantly emphasize performance improvement through feature aggregation or ensemble learning. In contrast, the proposed framework advances beyond accuracy-centric objectives by integrating optimized feature interaction, uncertainty propagation, and visual consistency assessment into a unified risk estimation layer. Unlike prior uncertainty-aware models that report confidence scores in isolation, the proposed P-RiSE module translates probabilistic outputs and visual evidence into clinically interpretable risk alerts, positioning the framework as a decision-support system rather than a standalone classifier.
Rest of the paper is arranged in the manner: Sect. 2 summarizes the compressive review related to pneumonia detection. Section 3 design the proposed work and its steps followed by the simulation results are analyzed and discussed in Sect. 4. The entire paper is ended up in Sect. 5.
Literature review
Elderly people are among the specific categories for which pneumonia can be a fatal condition. Although research has effectively developed artificial intelligence-assisted diagnostic systems to identify pneumonia from chest X-ray images, these experiments were conducted with no age category segmentation and were directed towards ordinary people. Ren et al.12 developed CheXMed, a multimodal approach that combines medical records with image information to enhance pneumonia identification accuracy for elderly people. The model’s reliability was assessed using F1-score, precision, efficacy, and recall. In each assessment measurement, the suggested technique scores better than its predecessors. Additionally, there aren’t numerous algorithms that can combine different kinds of variables.
To identify a diagnostic framework, the method was constructed and combined Aspergillus pneumonia (ASP) and Staphylococcus aureus pneumonia. (SAP) Liu et al.13 developed a multi-input DenseNet fused with clinical features (MI-DenseCFNet). A random forest dichotomous detection simulation was implemented to assess the substantial connection between each medical characteristic in identifying the two forms of pneumonia. This will improve productivity and medical precision in differentiating between ASP and SAP. Every client’s thoracic excellent quality CT lung frames were removed from the image storage and messaging system, and the associated clinical information was gathered for every patient. To provide healthcare professionals with an illustration, the developed random forest dichotomous diagnosis framework was evaluated for the 11 key medical requirements. The microorganisms that cause pneumonia are hard and costly to determine.
Establishing the best therapy options requires accurate predictive forecasting to determine pneumonia clients’ hospitalization results. Xing et al.14 introduced multi-omics graph knowledge representation to improve the whole-system representation connections using a knowledge graph, non-imaging omics data, and CT scans. Longformer-based 3D deep learning and a multichannel pyramidal recursive MLP module were created for monitoring omics to retrieve depth signals in the lung frame and also collect radiomics traits in the mediastinal and lung frames. After feature testing, forecasting prognosis, and omics comparison were determined using the graph convolutional network (GCN) and identity fusion structure. The suggested model has excellent durability and outperforms prior deep and machine learning techniques, and single-type omics for validation. As a result, the suggested approach improves efficiency, enabling a thorough evaluation of the illness’s extent and prompt treatment for those at greater risk. Hence, the failure to account for complicated physical settings was a drawback.
A medical checkup and clinical imaging methods, including ultrasounds, lung biopsies, and chest X-rays, are usually used to identify pneumonia, a respiratory illness that can be severe. Ali et al.15 implemented an EfficientNetV2L technique that can improve diagnosis precision and make well-informed medical choices for people diagnosed with pneumonia using deep learning methods. The Adam optimizer successfully modified the epoch for every approach. VGG16, Xception, CNN, InceptionResNetV2, EfficientNetV2L and ResNet50 were the deep learning methods examined. The suggested approaches were reliable in detecting pneumonia and demonstrated the ability of deep learning algorithms to reliably identify and forecast pneumonia, offering significant assistance in making medical choices and enhancing patient care. Therefore, it was insufficient in treatment can have devastating repercussions for patients.
Globally, pneumonia was a severe, perhaps fatal respiratory illness that largely impacted the lungs. Automatic detection methods connected to computer recognition are utilized frequently in study fields such as healthcare imaging. Hasan et al.16 implemented a deep learning approach using a pneumonia detection technique to train and evaluate those systems. It also highlights numerous indicators of effectiveness developed by studies in this area, and the ensemble learning and deep transfer learning approaches. The purpose was to help professionals choose the best and most efficient techniques for detecting pneumonia. However, a shortage of trained specialists might make conditions severe in illness.
Early identification may disrupt the transmission chain quickly, decreasing the impact of the illness on the community. COVID-19 has been a catastrophic outbreak that has caused severe and permanent harm to the internal organs. Sheikhi et al.17 modified a machine learning (ML) approach that uses chest X-ray images to recognize clients into the desired categories of COVID-19, respiratory infections, and normal. For assessment, a gray-level co-occurrence matrix was generated to use a 4-feature vector based on the k-nearest-neighbors approach. The suggested approach was rapidity, ease, and self-reliance with significant computer resources. Thus, the efficiency was to be improved.
A respiratory infection that causes extreme inflammatory processes, pneumonia can be brought on by microbes, fungal organisms, or pathogens. Khattab et al.18 presented deep Learning (DL) technique can identify pneumonia and COVID-19 on healthcare images, such as X-rays. Chest X-ray (CXR) databases were evaluated by NN methods such as MobileNet, InceptionResNet V2, Xception, and Inception V3. To enhance the approach, Cross-Entropy (CE) and Focal Loss (FL) were used on balanced and imbalanced databases. It demonstrated various datasets to identify distinct pneumonia illnesses and sometimes produced false information.
Recent studies have demonstrated the effectiveness of hybrid vision transformers for pulmonary disease detection. Abdelhamid et al.19 proposed a hybrid vision transformer framework for pulmonary embolism detection using CT pulmonary angiogram scans, achieving improved diagnostic accuracy and robust feature representation through multimodal deep learning integration. This work further reinforces the relevance of transformer-based architectures for complex pulmonary imaging tasks and supports the methodological choices adopted in the proposed pneumonia detection framework.
Applications of data analysis and artificial intelligence (AI) have transformed the medical industry. Kumar et al.20 described the Convolution Neural Network (CNN) method combined with an Enhanced Super-Resolution Generative Adversarial Network for rapid diagnosis and prediction of pneumonia. The CNN method uses high-quality images produced data for additional classification. It offers specific recommendations for scholars essential for starting more studies in this unexplored area. Nevertheless, the absence of effective tests continues to plague physicians with pneumonia.
The most common method for identifying the condition was chest X-ray imaging. As a result, digital pneumonia diagnoses were necessary to help physicians validate the accuracy. El-Ghandour et al.21 implemented the XGBoost technique for classifying chest X-ray images for pneumonia. An ensemble strategy approach was implemented to deliver the ultimate classification and understand the basic framework of the gathered characteristics from the extracted classifier. The visual representation was enabled by the Bayesian approach was determined using hyperparameters such as learning and structural features. The suggested approach achieves better efficiency than the other models. Therefore, the protection was greatly impacted by identification.
Many individuals have been impacted by the devastating effects of the coronavirus’s global expansion, making intervention imperative. To combat this transmissible illness, a large number of specialists have put in a great deal of effort to develop effective vaccinations. Singh et al.22 developed a deep learning method, Deep CP-CXR, for identifying those suffering from COVID-19 and pneumonia. To differentiate between patients with a diagnosis of COVID-19 and healthy individuals implemented on chest X-ray images in binary classification. The suggested method increases the accuracy and rapidity of disease classification, enabling physicians to properly recognize and manage clients. As a result, it leaves the characteristic images’ extraneous data intact.
Problem definition
The leading cause of mortality and morbidity in worldwide remained as Pneumonia. For effective resource allocation and treatment, the critical are an accurate risk assessment and timely detection. For pneumonia diagnosis, the medical imaging applies deep learning methods and it provided the interpretability is limited. The critical to clinical decision making is often lack of uncertainty estimation. The complementary information exploited by the models solely on CT or X-ray images. The alert mechanism and risk scoring only provided by few of the real-time system. Due to visual lack, the clinical trust is limited and it make uncertainty. The better optimization algorithm required to improve generalization and robustness. To overcome these issues, this study introduced a Grad-CAM with BNN for pre-emptive prediction and risk assessment of pneumonia.
Proposed methodology
This study proposes an AI-driven multimodal framework for pre-emptive pneumonia prediction and risk assessment by jointly analyzing chest CT and X-ray images. The framework is designed to integrate robust feature enhancement, hierarchical representation learning, optimized multimodal fusion, explainable prediction, uncertainty estimation, and clinical risk scoring within a unified pipeline.
Overall framework overview
Initially, chest CT and X-ray images are provided as multimodal inputs. These images are first subjected to Top-Hat Transformer (THT)–based preprocessing to suppress background noise and enhance salient pulmonary structures such as opacities and infiltrates. Subsequently, Swin Transformer–based feature extraction is employed to learn hierarchical and context-rich representations from each modality. The extracted features are then integrated using an Optimized Tensor Fusion Network (OTFN), which captures higher-order cross-modal interactions. The fused representations are analyzed using Gradient-weighted Class Activation Mapping (Grad-CAM) and Bayesian Neural Networks (BNNs) to enable interpretable and uncertainty-aware pre-emptive pneumonia prediction. Finally, the Pneumonia Risk Scoring and Estimator (P-RiSE) module aggregates prediction confidence, uncertainty, and visual consistency to generate patient-specific risk levels and early clinical alerts. Figure 1 presents the high-level workflow of the proposed framework.
Fig. 1.
High-level workflow of the proposed multimodal pneumonia prediction and risk assessment framework.
In this study, the CT and chest X-ray datasets originate from independent cohorts and do not contain patient-level pairing or shared patient identifiers. Consequently, no artificial CT–CXR pairs are created. Each modality is processed independently to extract deep feature representations, which are subsequently fused at the feature level using the Optimized Tensor Fusion Network. The proposed multimodal fusion strategy therefore operates at the representation level rather than the patient level, and no patient overlap exists between modalities.
To provide a more detailed understanding of the internal data flow, Fig. 2 illustrates the complete architecture with explicit tensor dimensions at each processing stage. This figure elaborates how CT and X-ray images are resized and pre-processed, followed by multi-stage feature extraction through the Swin Transformer hierarchy. The tensor dimensions are shown across all stages, including patch embedding, window-based self-attention blocks, and global average pooling. The modality-specific feature vectors are fused through tensor outer product operations in OTFN, followed by Zebra Optimization Algorithm (ZOA)–based dimensional refinement. The optimized fused features are then forwarded to the BNN with Monte-Carlo dropout for probabilistic prediction, while Grad-CAM generates spatial heat maps for visual interpretability. The outputs from these modules are finally combined within the P-RiSE block to compute a normalized pneumonia risk score categorized into low, moderate, or high risk.
Fig. 2.
Overall architecture of the proposed multimodal pneumonia prediction framework with tensor dimensions at each stage.
Together, these two Figs. 1 and 2 provide a top-down and bottom-up understanding of the proposed system: Fig. 1 conveys the conceptual workflow, while Fig. 2 details the technical architecture and dimensional transformations. The subsequent subsections describe each component of the framework in detail, beginning with image preprocessing.
Pre-processing
The Top-Hat Transformer employed in this study is a classical morphological image processing operation rather than a deep learning-based transformer architecture. It enhances small-scale bright structures by subtracting the morphologically opened image from the original input image. The term “transformer” here refers solely to image transformation and does not involve self-attention mechanisms or learnable parameters. This preprocessing step effectively highlights pneumonia-related opacities and infiltrates, improving downstream deep feature extraction. In this proposed work, to highlight the important signs of pneumonia, such as opacities or chest x-rays infiltrates, the proposed THT is used. It can be described as Eq. (1),
![]() |
1 |
Here, the original gray scale images are h with the structuring component of b, and the operation of morphological, i.e., erosion and dilation, is stated as o. The morphologically operated image is
. The term
provides the residual image that highlights the features that are smaller than the actual structuring component.
The parameters of the Top-Hat Transformer were selected empirically to enhance small-scale pulmonary abnormalities such as opacities and infiltrates while suppressing background noise. The structuring element size was tuned to balance feature enhancement and noise amplification. Similarly, Swin Transformer parameters, including patch size, window size, embedding dimension, and dropout rate, were chosen based on established best practices in medical image analysis and validated through preliminary experimentation to ensure optimal performance without excessive computational burden.
Swin transformer-based feature extraction
The Swin Transformer was selected for feature extraction due to its superior performance, computational efficiency, and suitability for high-resolution medical image analysis. Unlike conventional convolutional neural networks, Swin Transformer employs hierarchical feature learning through shifted window-based self-attention, enabling effective capture of both local and contextual patterns. Compared to global self-attention transformers, its localized attention mechanism significantly reduces computational complexity while preserving diagnostic-relevant spatial information, making it particularly effective for identifying subtle pulmonary abnormalities such as ground-glass opacities and consolidations in CT and X-ray images.
Swin transformer-based feature extraction uses the hierarchical vision transformer structure and provides better performance in image processing tasks. It can effortlessly design the local and global context and so it can establish a potential output during the pneumonia detection23. The extraction of features is multiscale with the usage of shifted windows for the self-attention. To begin with, the images are converted to patch tokens. For that, the input images
the non-overlapping patches, are divided into a size of
, as shown in Eq. (2).
![]() |
2 |
The flattening and projection of the patches are performed in the embedded space through the linear layer. The next process is linear embedding, and in that the patches are mapped with the embedding process with D-dimensionality, as shown in Eq. (3).
![]() |
3 |
The extraction of local and global features about the abnormalities and opacities is performed in the Swin Transformer block. There are multiple swin transformer blocks, such as Window-based multi-head self-attention (W-MSA) and Shifted Window-based MSA (SW-MSA)24. The local windows of size
can be used for the self-attention estimation, which is better than global self-attention and it is defined in Eq. (4),
![]() |
4 |
The query, key, and values are the metrics for the patch embedding of the linear projects, which are E, L, and U. The biased factor related to the attention head and window position is C. To enable the cross-window connections, the windows of the swin blocks are shifted to their half window size25. Standard window attention is used for partitioning
. The down sampling of the patches is like merging the patches for each stage and can be defined as Eq. (5),
![]() |
5 |
In this study, local self-attention was preferred over global self-attention due to its scalability and efficiency when processing high-resolution medical images. Global self-attention mechanisms incur quadratic computational complexity, which becomes prohibitive for large CT and X-ray inputs. In contrast, the shifted window-based local attention in Swin Transformer enables effective cross-window feature interaction while maintaining linear computational growth. Prior studies have demonstrated that this design achieves comparable or superior representational capability with significantly lower computational overhead, justifying its adoption in the proposed framework.
The feature dimension got increased with the mitigation of spatial resolution by a factor of 2. The architecture used for the proposed Swin Transformer is visualized in Fig. 3.
Fig. 3.
Swin architecture for the extraction of features from the pre-processed pneumonia CT/X-ray images.
Feature fusion using OTFN
Next to feature extraction, deep feature representations obtained from CT and chest X-ray images are integrated using the Optimized Tensor Fusion Network (OTFN). Let.
and
denote the deep feature vectors extracted from CT and chest X-ray images, respectively, using the Swin Transformer.
To capture higher-order cross-modal interactions between the two modalities, tensor fusion is performed through an outer product operation. Each modality is first augmented with a bias term, and the tensor fusion is defined as Eq. (6):
![]() |
6 |
This operation results in a fused tensor, which explicitly models unimodal, bimodal, and bias-level interactions between CT and X-ray features. The outer product operator captures all pairwise multiplicative correlations across modalities, enabling effective multimodal representation learning.
This operation results in a
fused tensor, which explicitly models unimodal, bimodal, and bias-level interactions between CT and X-ray features. The outer product operator
captures all pairwise multiplicative correlations across modalities, enabling effective multimodal representation learning.
The resulting multimodal tensor representation can be expressed as Eq. (7):
![]() |
7 |
where
denotes the multimodal fused feature tensor. This formulation preserves both modality-specific and cross-modal feature interactions in a unified representation space.
To mitigate redundancy and reduce the high dimensionality introduced by tensor fusion, a Zebra Optimization Algorithm (ZOA) is employed. Inspired by zebra behavioral patterns, ZOA optimally learns weighting coefficients over the fused tensor to suppress irrelevant feature interactions while emphasizing diagnostically informative combinations. This optimization process produces a compact and discriminative fused representation, thereby enhancing fusion efficiency and downstream predictive performance.
The optimized tensor fusion output is subsequently forwarded to the classification network for pneumonia risk prediction. Through this integration of tensor fusion and metaheuristic optimization, OTFN effectively exploits complementary information from CT and X-ray modalities while maintaining robustness and interpretability.
Although tensor-based multimodal fusion enhances representational capacity, it also increases computational complexity due to high-dimensional feature interactions. The Zebra Optimization Algorithm (ZOA) was selected to address this challenge by efficiently optimizing fusion weights and reducing redundant feature representations. Unlike gradient-based optimization methods, ZOA is derivative-free and exhibits strong global search capability, making it suitable for complex, non-convex optimization problems inherent in tensor fusion. While ZOA introduces additional computational overhead during training, this cost is offset by faster convergence, improved feature selection, and enhanced predictive accuracy. Consequently, the OTFN + ZOA framework achieves a favourable trade-off between computational efficiency and diagnostic performance.
The feature interactions of all combinations has results in higher order tensor.
Population initialization: The candidate solution represented with each individual (Zebra) in the population. Randomly generates each initial positions.
Fitness computation: Compute the fitness function along with the selected fused featured that training a prediction model.
The fitness function optimized by the Zebra Optimization Algorithm (ZOA) is defined to minimize the prediction error obtained from the fused feature representations. Let
denote the ground-truth class label and
represent the continuous prediction confidence produced using fused feature weights
. The fitness objective is formulated as Eq. (8):
![]() |
8 |
where is the total number of training samples. ZOA iteratively updates the weight vector to minimize this fitness function, thereby enhancing fusion effectiveness and overall predictive reliability.
where
is the total number of training samples. ZOA iteratively updates the weight vector
to minimize this fitness function, thereby enhancing fusion effectiveness and overall predictive reliability.
-
Update exploration and exploitation movements: Randomly move to the new feature subsets based on predator’s avoidance or exploration. Move the solution based on herd behaviors (exploitation) to optimal achieving zebra. To update the positions by Eq. (9).

9 At jth Zebra
with the current best solution is
and the iteration is t. Optimizing Feature Weight: For each feature in the tensor, learn the weights due to real-valued representations. While reducing redundancy, the prediction accuracy maximized and searched the optimal weights based on ZOA.
Termination: Select the best weights or feature set (Zebra after fixing the number of convergence of iterations.
- Output: Below represented formula describes an optimized and fused feature vector, as shown in Eq. (10).

10
Where, FF represets the fused features that selected. To apply these optimized vectors into pre-emptive disease prediction.
Pre-emptive pneumonia disease prediction
The Gradient-weighted Class Activation Mapping with Bayesian Neural Networks (Grad-CAM with BNN) is proposed for pre-emptive prediction of pneumonia disease and the details are described in the below sub-sections.
Gradient-weighted class activation mapping
To enhance model interpretability, representative Grad-CAM heat maps are presented for both CT and chest X-ray images. These visualizations highlight the lung regions that contribute most significantly to pneumonia prediction, demonstrating close alignment between model attention and clinically relevant pathological areas. The inclusion of Grad-CAM heat maps reinforces transparency and supports clinical trust in the proposed diagnostic framework.
In local interpretation, the class activation mapping model belongs to Gradient-weighted Class Activation Mapping Grad-CAM. The interpretation results of fused feature map combined and obtained an important region based on heat map. To provide pre-emptive prediction, an original information as the last convolutional layer and the feature maps are selected26. Figure 4 shows the general model for Gradient-weighted Class Activation Mapping. Calculates
as the convolutional layer along with
as the ith feature map and the feature map is c, as shown in Eq. (11).
![]() |
11 |
Fig. 4.
The general model for gradient-weighted class activation mapping.
The saliency map is
obtained to perform ReLU activation and convolutional layer as
, as shown in Eq. (12)
![]() |
12 |
Where,
is the thermal map normalization and up-sampling obtains final results. Each layer obtains the output of feature map and apply the fused feature output
.
To enhance model interpretability, representative Grad-CAM heat maps are presented for both CT and chest X-ray images. These visualizations highlight lung regions that contribute most significantly to pneumonia prediction, demonstrating alignment between model attention and clinically relevant pathological areas. The inclusion of Grad-CAM heat maps reinforces transparency and supports clinical trust in the proposed diagnostic framework.
Figure 5 shows representative Grad-CAM heat maps illustrating the model’s attention for pneumonia detection. Panels (i) and (v) show original chest X-ray images, while panels (ii) and (vi) present the corresponding Grad-CAM heat maps highlighting pneumonia-affected lung regions. Panels (iii) and (vii) depict axial CT slices, with panels (iv) and (viii) showing their respective Grad-CAM visualizations. The highlighted regions demonstrate strong spatial correspondence with clinically relevant pulmonary abnormalities, confirming the interpretability and localization capability of the proposed framework.
Fig. 5.
Representative Grad-CAM visualizations for pneumonia localization in chest X-ray and CT images.
To provide interpretation, an original information selects final convolutional layer from feature map output27. The feature map weight obtains n as an intermediate thermal map, as shown in Eq. (13).
![]() |
13 |
At lth layer, the jth feature map is represented as
. The n-gudance generates the perturbed fetaure vector as
, as shown in Eq. (14).
![]() |
14 |
The background feature vectors are R. The random mask superposition obtains the thermal map28. The prediction probability PP defines the mask to retains the significant regions
, as shown in Eq. (15).
![]() |
15 |
The random sampling receives the mask weight that obtained the thermal map. The pre-emptive pneumonia prediction probability weight is calculated. To occludes a small part and greater impact on the model due to pixel value of mask.
Bayesian Neural Networks
The Markov Chain Monte Carlo (MCMC) method is a standard sampling technique for Bayesian Neural Networks (BNNs). However, the non-linear activation functions in deep neural networks break the conjugacy between prior and posterior distributions. Although theoretically sound, the high computational cost of MCMC limits its practical applicability to deep and large-scale networks29. Consequently, despite providing a rigorous Bayesian foundation, MCMC-based inference becomes infeasible for complex multimodal architectures.
To address this limitation, Monte Carlo Dropout is employed as a practical and computationally efficient approximation to Bayesian inference30. In this approach, dropout layers remain active during inference, and =50 stochastic forward passes are performed to estimate the predictive mean and variance. This strategy enables effective uncertainty quantification while maintaining scalability for multimodal data. Within the proposed framework, the BNN acts as the final probabilistic classifier, while Grad-CAM complements it by providing spatial interpretability of the model’s predictions. This combination ensures both uncertainty-aware decision making and explainable outputs suitable for clinical analysis.
To formalize the Bayesian model, we define the network parameters (weights and biases) collectively as w. A common approach is to place a simple prior, such as a zero-mean Gaussian, on these parameters due to the lack of specific prior knowledge, as shown in Eq. (16).
![]() |
16 |
where
is a shared variance hyperparameter.
The likelihood function, which connects the network’s predictions to the observed data
, is often modeled as a Gaussian distribution around the network output
, as shown in Eq. (17):
![]() |
17 |
Here,
represents the observational noise variance.
To complete the fully Bayesian model, we also place hyperpriors on the variance parameters
and
. A common choice for scale parameters like these is the conjugate Inverse-Gamma distribution (or equivalently, a Gamma distribution on the precision). For example, the prior for
can be specified as Eq. (18):
![]() |
18 |
where and are user-defined shape and rate constants. A similar prior can be placed on.
where
and
are user-defined shape and rate constants. A similar prior can be placed on
.
The goal of Bayesian inference is then to compute the posterior distribution
, which is approximated in this work using the Monte Carlo Dropout method.
Pre-emptive prediction of pneumonia using Grad-CAM with BNN
For timely medical intervention, early pneumonia risk identification referred to as pre-emptive pneumonia prediction aims to detect disease indicators before clinical symptoms become severe. In this study, the term pre-emptive prediction refers to the early identification of pneumonia-related radiological patterns observable in CT and chest X-ray images before the disease progresses to severe clinical manifestations. Although the employed datasets do not include longitudinal follow-up data or explicitly labeled sub-clinical cohorts, the proposed framework emphasizes detecting subtle imaging cues such as faint ground-glass opacities, early infiltrates, and localized texture variations that are commonly associated with the initial stages of pneumonia. By integrating uncertainty-aware Bayesian inference with Grad-CAM-based attention localization, the model highlights high-risk imaging patterns that may warrant early clinical attention rather than performing temporal disease progression forecasting.
Such early detection supports improved patient outcomes by enabling prioritization of high-risk cases and reducing the likelihood of hospitalization, sepsis, or respiratory complications. Furthermore, it allows hospitals to optimize resource allocation preparing ICU support, oxygen supply, and bed availability in advance. From the optimized tensor fusion network (OTFN), the fused feature map (FM) is selected. For Grad-CAM generation, the class score is temporarily computed and passed to the shallow CNN31, as shown in Eq. (19).
![]() |
19 |
Based on the feature map, the Grad-CAM heat-map
from gradients computed, as shown in Eq. (20).
![]() |
20 |
An attention region highlighted from heatmap on reference image like CT and X-ray. The fused feature to perform uncertainty aware classification32. The standard CNN with fully connected classifier replaced with CNN. The passing of multiple stochastic forward is performed, as shown in Eq. (21).
![]() |
21 |
The predictive variance and mean is computed. The penumonia class activation contributed using Grad-CAM based BNN. Based on Grad-CAM based BNN, the uncertainity, probability outputs with the final prediction of pneumonia disease output is predicted as (Pneumonia present or not). The pre-emptive prediction of pneumonia disease using Grad-CAM with BNN is outlined in Fig. 6.
Fig. 6.
The pre-emptive prediction of pneumonia disease using Grad-CAM with BNN.
Risk score prediction
The Pneumonia risk scoring and estimator (P-RiSE) is used to analyze the risk score of pneumonia disease. The modular framework is P-RiSE, which is deep learning oriented risk evaluation model. The medical imaging data of pneumonia patient predicted and the real time alert model is enabled. For an automated and clinician health systems, the quantitative risk score information combined and real-time alerts triggered by using specific thresholds. Based on Grad-CAM based BNN, the abnormal regions are highlighted using heatmap as well as the quantify confidence and pneumonia predicted. Below expression represents risk score calculation, as shown in Eq. (22).
![]() |
22 |
The predicted probability of pneumonia, uncertainty in prediction and visual consistency score are indicated as
,
and
. For clinical action, generates the real-time alters if the score exceeds the set threshold (If it 0.80) then the patient requires urgent recommendation actions.
The Visual Consistency Score
quantifies the spatial agreement between Grad-CAM highlighted regions and clinically relevant lung areas. It is computed as the normalized overlap between the Grad-CAM heatmap
and the lung region mask
, defined as Eq. (23):
![]() |
23 |
where higher values indicate stronger alignment between the model’s attention and pneumonia-affected lung regions, thereby reinforcing interpretability and clinical relevance.
The pneumonia risk score is computed by integrating multimodal imaging features with probabilistic and interpretability-driven outputs. Features fused through the Optimized Tensor Fusion Network (OTFN) are input to a Bayesian Neural Network (BNN), which produces both predictive probability and epistemic uncertainty estimates. Concurrently, Grad-CAM generates spatial attention maps highlighting clinically relevant lung regions, from which a visual consistency score is derived. These three complementary signals prediction probability, uncertainty, and visual consistency are normalized and aggregated within the Pneumonia Risk Scoring and Estimator (P-RiSE) framework to yield a reliable and interpretable patient-specific risk score.
Figure 7 shows overview of the proposed uncertainty-aware pneumonia risk assessment framework. Multimodal chest X-ray and CT images are fused using OTFN and analyzed by a Bayesian Neural Network to obtain predictive probability and uncertainty estimates. Grad-CAM generates spatial explanations used to compute a visual consistency score. These outputs are integrated within the P-RiSE module to derive a final pneumonia risk score categorized into low, moderate, and high-risk levels Table 1.
Fig. 7.
Block diagram of the OTFN–BNN–Grad-CAM integrated framework for pneumonia risk stratification.
Table 1.
Representing the risk score and alerts estimation outputs
| Threshold based on risk score | Interpretation of risk |
|---|---|
| 0.0 to 0.5 | Lower |
| 0.5 to 0.8 | Moderate |
| Above 0.8 (>0.8) | Higher (triggering alerts) |
The Pseudocode for proposed work is outlined in Algorithm 1.
Algorithm 1.
Pseudocode for proposed work.
Result and discussion
This section presents the results and comparison with the model implementation performs all the experiments and implemented using Python software with AMD Ryzen 9 3900XT @ 4.7 GH processor, TridentZ RGB 32G, DDR4-3200 memory and Gigabyte AORUS RTX2080Ti 11G GPU. The detailed explanation with the respective implementation evaluates the following sub-section and presented the series of experiments. The Kaggle (https://www.kaggle.com/datasets/anaselmasry/covid19normalpneumonia-ct-images and https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia) dataset delivered the experimental details of both CT and MRI images. These datasets originate from independent sources and differ in spatial resolution, acquisition protocols, sample size, and patient populations. Figures 8 and 9 outlines few of the sample images from this dataset based on CT and X-ray images. The proposed parameter configuration is outlined in Table 2.
Fig. 8.
Sample dataset images from Kaggle based on CT scan images, (i–iii) normal images and (iv–vi) pneumonia diseased images.
Fig. 9.
Sample dataset images from Kaggle based on X-ray images, (i–iii) normal images and (iv–vi) pneumonia diseased images.
Table 2.
Proposed parameter configuration.
| Parameters | Ranges | |
|---|---|---|
| Swin transformer | Dropout rate | 0.2 |
| Window size | 7 | |
| Embedding dimension | 96 | |
| Input patch size | 4×4 | |
| OTFN | Population size of ZOA | 50 |
| Maximal iteration | 100 | |
| Batch size | 32 | |
| Fusion type | Tensor outer product | |
| Grad-CAM based BNN | Epochs | 100 |
| Posterior estimation | Bayes by back-propagation | |
| Prior distribution | Gaussian | |
| Learning rate | 0.0005 | |
The class distribution of the Kaggle datasets includes both normal and pneumonia samples for CT and chest X-ray modalities. Although a mild class imbalance exists, particularly in the chest X-ray dataset, batch-wise class balancing and extensive data augmentation were employed to mitigate its impact during training. Techniques such as SMOTE and focal loss were considered; however, they were not adopted to avoid introducing synthetic artifacts or altering clinically relevant imaging patterns. Instead, controlled augmentation strategies preserved anatomical consistency while ensuring robust model generalization across classes.
As the CT and chest X-ray datasets do not contain paired multimodal images from the same patients, the proposed multimodal fusion framework operates on unpaired data. Modality-specific features are independently extracted from CT and X-ray images and subsequently fused at the feature level using the Optimized Tensor Fusion Network (OTFN). This design effectively exploits the complementary strengths of both modalities without requiring patient-wise paired imaging samples. For each modality, the dataset was split at the image level using a stratified strategy into 70% training, 15% validation, and 15% testing subsets. Since the datasets do not include patient identifiers, patient-level splitting was not feasible; however, no image duplication exists across splits, thereby preventing data leakage. All experiments were conducted using a single fixed split. Although repeated runs and cross-validation were not performed in this study, future work will report mean and standard deviation across multiple runs to enhance statistical robustness.
Performance evaluation
The training of the model is essential to achieve accurate prediction of pneumonia and its associated risk factors. To this end, both the proposed and existing approaches, including CAM, CNN, BNN, and LSTM, were trained and evaluated. All baseline models, including CAM-based CNN, standard CNN, Bayesian Neural Network (BNN), and LSTM, were trained using identical preprocessing pipelines, data splits, and optimization settings to ensure a fair comparison. CNN-based baselines employ three convolutional layers with ReLU activation, the BNN shares the same architecture with Monte Carlo Dropout enabled, CAM utilizes class activation mapping for visualization, and the LSTM operates on flattened feature sequences. Learning rate, batch size, optimizer, and number of epochs were kept consistent across all models.
All models achieved stable performance after approximately the 80th epoch, with convergence observed by the 100th epoch. Based on these results, the bias factors and weight coefficients were optimized for all the techniques. The proposed model achieved a training accuracy of 0.956 with a minimal loss of 0.045, as illustrated in Fig. 10i,ii. The CAM method achieved an accuracy of 0.912 and a loss of 0.33 after 10 epochs, as shown in Fig. 11i,ii. The CNN approach obtained a training accuracy of 0.927 with a loss of 0.073, as presented in Fig. 12i,ii. The BNN model achieved a final accuracy of 0.92 with a loss of 0.08 at the 100th epoch (Fig. 13i,ii). Similarly, the LSTM method recorded a training accuracy of 0.88 and a loss of 0.12 (Fig. 14i,ii). Testing results further validate the robustness of the proposed approach. The testing accuracies for the proposed, CAM, CNN, BNN, and LSTM models are 0.95, 0.921, 0.90, 0.912, and 0.83, respectively (Figs. 10, 11, 12, 13 and 14). From both training and testing evaluations, it is evident that the proposed model consistently outperforms the existing methods, achieving higher accuracy and lower loss values.
Fig. 10.
Training and testing (i) accuracies and (ii) losses for the proposed work.
Fig. 11.
Training and testing (i) accuracies and (ii) losses for the CAM.
Fig. 12.
Training and testing (i) accuracies and (ii) losses for the CNN.
Fig. 13.
Training and testing (i) accuracies and (ii) losses for the BNN.
Fig. 14.
Training and testing (i) accuracies and (ii) losses for the LSTM.
Feature fusion analysis
The impact of the Zebra Optimization Algorithm (ZOA) on multimodal feature fusion is quantitatively evaluated by comparing the Optimized Tensor Fusion Network (OTFN) with conventional fusion strategies, including standard Tensor Fusion Network (TFN), Logistic Regression-based TFN (LRTFN), Early Fusion, and Late Fusion. The proposed ZOA-optimized fusion achieved a lower Mean Absolute Error (MAE) of 1.15 and Mean Squared Error (MSE) of 0.17, outperforming TFN (MAE: 1.27, MSE: 0.275), LRTFN (MAE: 1.50, MSE: 0.32), Late Fusion (MAE: 1.60, MSE: 0.42), and Early Fusion (MAE: 1.80, MSE: 0.55). These results demonstrate that ZOA effectively selects informative feature weights, reduces redundancy, and enhances overall predictive accuracy.
Although pneumonia detection is formulated as a classification task, Mean Absolute Error (MAE) and Mean Squared Error (MSE) are computed on the continuous prediction confidence scores produced by the feature fusion networks prior to threshold-based classification. These regression-based metrics quantify the stability and consistency of fused feature representations, enabling a fair comparison of fusion strategies independent of discrete class decision boundaries.
The feature fusion is important to fuse the various temporal and spatial features from the images extracted using the ST. For the feature fusion, the OTFN is presented in the proposed work and is compared with the existing techniques, including Tensor fusion network (TFN), Logistic regression based TFN (LRTFN), Early Fusion (EF), and Late fusion (LF). For the comparison, the parameters taken are Mean Absolute Error (MAE) and Mean Square Error (MSE). The MAE of the proposed and existing works is plotted in Fig. 15. The proposed work shows lower MAE than the other approaches, with 1.15, and other techniques showed higher MAE. The methods TFN, LRTFN, LF, and EF show MAE of 1.27, 1.50, 1.60, and 1.80, respectively.
Fig. 15.

MAE for the feature fusion.
The MSEs of the proposed and existing works, TFN, LRTFN, LF, and EF, are compared and plotted in Fig. 16. The MSE of the proposed work is lower, with the value of 0.17, and other techniques, TFN, LRTFN, LF, and EF, showed MSEs of 0.275, 0.32, 0.42, and 0.55, respectively. This shows that the feature fusion of the proposed work is better than the existing works.
Fig. 16.

MSE for the feature fusion.
Performance evaluation and comparative study for the pneumonia prediction
For the comparative study, the factors taken are the confusion matrix, accuracy, sensitivity, FPR, FNR, recall, precision, and F-score. The confusion matrices of proposed and other techniques, CAM, BNN, CNN, and LSTM, are shown in Fig. 17i–v. The proposed work effectively predicts pneumonia, and hence, the accuracy of the prediction output is higher.
Fig. 17.
Confusion matrices (i) proposed, (ii) BNN, (iii) CAM, (iv) CNN, and (v) LSTM.
The Fig. 18 presents the ROC–AUC analysis comparing the proposed Grad-CAM-assisted Bayesian Neural Network (BNN) with BNN, CAM, CNN, and LSTM models. The curves illustrate the relationship between True Positive Rate (TPR) and False Positive Rate (FPR) across varying decision thresholds, while the marked points denote representative operating thresholds used during evaluation.
Fig. 18.
ROC curves and corresponding AUC values for the proposed method and baseline models, demonstrating the superior discriminative performance of the proposed framework.
The proposed model achieves an AUC of 0.995, with its marked points concentrated near the top-left corner, indicating high sensitivity (TPR ≈ 0.90–0.98) at very low false positive rates (FPR < 0.02). A similar trend is observed for the BNN model (AUC = 0.995), demonstrating robust discrimination capability.
The CAM and LSTM models yield AUC values of 0.992, with operating points showing TPR values above 0.92 at FPR below 0.03, reflecting strong but slightly inferior performance compared to the proposed approach. In contrast, the CNN model attains a lower AUC of 0.974, with marked points positioned farther from the optimal region, achieving TPR ≈ 0.85–0.90 at higher FPR values (≈ 0.03–0.05).
Overall, the marked operating points clearly demonstrate that the proposed framework consistently operates closer to the ideal ROC region, offering superior discrimination, higher diagnostic reliability, and improved robustness for pneumonia prediction, thereby validating its effectiveness for clinical decision-support systems.
The FPR and FNR of the proposed and existing works, CAM, BNN, CNN, and LSTM, are visually interpreted in Fig. 19i,ii. The FPR and FNR should be low for better prediction, and from the figures, it is evident that the proposed FPR and FNR rates are lower with the values of 0.23 and 0.37, respectively. Meanwhile, other techniques, BNN, CAM, CNN, and LSTM, showed the FPR of 0.04, 0.41, 0.09, and 0.175, respectively, and FNR values for the existing methods, BNN, CAM, CNN, and LSTM, showed 0.038, 0.042, 0.096, and 0.118, respectively. FNR values are 0.038, 0.042, 0.096, and 0.118. False Positive Rate (FPR) and False Negative Rate (FNR) values are computed directly from the confusion matrices using standard definitions. Earlier inconsistencies in textual interpretation have been corrected to ensure strict alignment between the numerical values reported in Fig. 15 and the corresponding confusion matrices.
Fig. 19.
Comparative analysis of error rates for pneumonia prediction models (i) False Positive Rate (FPR) and (ii) False Negative Rate (FNR) evaluating the proposed Grad-CAM with Bayesian Neural Network (BNN) framework against existing deep learning techniques.
In Fig. 19i, the FPR values indicate how often non-pneumonia cases are incorrectly classified as pneumonia. The proposed method shows an FPR of ≈ 0.23, while BNN achieves the lowest FPR of ≈ 0.04. CAM records the highest FPR at ≈ 0.41, indicating more false alarms. CNN demonstrates a lower FPR of ≈ 0.09, and LSTM reports an intermediate FPR of ≈ 0.17.
In Fig. 19ii, the FNR values reflect the proportion of pneumonia cases that are incorrectly missed by the models. The proposed method exhibits a higher FNR of ≈ 0.37, whereas BNN and CAM achieve significantly lower FNRs of ≈ 0.04 and ≈ 0.045, respectively. CNN shows an FNR of ≈ 0.095, and LSTM records ≈ 0.12.
Overall, this comparison highlights the trade-off between false positives and false negatives across models. While CAM suffers from a high false positive rate, BNN consistently maintains low error rates. The proposed framework balances interpretability and predictive capability, aligning with the PDF’s emphasis on explainable and risk-aware pneumonia prediction.
The prediction of pneumonia is analysed with the factors such as recall, precision, F-measure, specificity, and accuracy are mentioned in Fig. 20. The proposed work shows better values for all the factors, with the values of 97%, 98%, 96%, 97.3%, and 96%, respectively, for recall, precision, F-measure, specificity, and accuracy. The other techniques showed lower values, and for the BNN, the values of recall, precision, F-measure, specificity, and accuracy are 96%, 95%, 95.7%, 96%, and 95.8%, respectively. The other works also showed lower values than the proposed work and ensuring the effectiveness of the proposed work while predicting pneumonia diseases from the CT and X-ray images.
Fig. 20.
Pneumonia prediction parameters and their comparison with existing works (i) recall, (ii) precision, (iii) F-score, (iv) specificity and (v) accuracy.
Prediction of risk factors
The prediction of risk factors from the outcome of the prediction is an important process for protecting the patients from further loss. For this, we have taken the existing works such as Fuzzy rule, Deep Neural Network (DNN), Artificial Neural Network (ANN), and Feed Forward Neural Network (FNN), along with the proposed Pneumonia Risk Scoring and Estimation (PRiSE). The risk prediction accuracy, recall, precision, and F-score are depicted in Fig. 21i–iv. The risk prediction accuracy of the proposed work is around 97% and the recall, precision, and F-score are around 97.2%, 98%, and 97.6% correspondingly. The other techniques Fuzzy rule, DNN, ANN, and FFN show the accuracies of 96.4%, 96%, 94%, and 93% respectively. Meanwhile, the F-score of the works Fuzzy rule, DNN, ANN, and FFN are 95%, 93%, 92%, and 90% respectively as mentioned in the Fig. 21.
Fig. 21.
Pneumonia risk prediction parameters and their comparison with existing works (i) accuracy, (ii) recall, (iii) precision, and (iv) F-score.
Overall comparison with existing literatures
The state-of-the-art comparison of the proposed method with existing approaches is summarized in Table 3. To ensure a fair and valid comparison, only studies that used the same dataset were considered. Accordingly, methods such as E-CovidNet32, Weighted Ensemble34, Deep DSR34, DL-CXR35, ResNet34-CAM36, EL-CNN38, and QCSA39 are included. Since all methods were evaluated on an identical dataset, the comparison is reliable and meaningful. The Matthews Correlation Coefficient (MCC) values reported in Table 3 are normalized to the standard range of [0, 1] and are expressed as decimal values rather than percentages. MCC provides a balanced performance measure by jointly considering true positives, true negatives, false positives, and false negatives, making it particularly suitable for evaluating models under class-imbalanced conditions. Higher MCC values indicate stronger agreement between the predicted outcomes and ground-truth labels, further confirming the robustness of the proposed method.
Table 3.
State-of-art results vs proposed work.
| Authors | Biswas S et al.33 | Kundu, R., et al.34 | Peng, L., et al.35 | Hammoudi, K., et al.36 | Guo, K., et al.37 | Mabrouk, A., al.38 | Singh, S., et al.39 | Proposed work |
|---|---|---|---|---|---|---|---|---|
| Methods | E-COVIDNet | Weighted ensemble | Deep DSR | DL-CXR | ResNet34-CAM | EL-CNN | QCSA | Grad-CAM based BNN |
| Dataset | Kaggle datasets | Kaggle datasets | Kaggle datasets | Kaggle datasets | Kaggle datasets | Kaggle datasets | Kaggle datasets | Kaggle datasets |
| Accuracy | 98.79% | 98.81% | 0.9894 | 95.72% | 98.29% | 93.91% | 94.53% | 96% |
| Precision | 0.9879 | 98.82% | 0.9833 | – | – | 93.96% | 93.56% | 98% |
| Recall | 0.9879 | 98.80% | 0.9895 | – | 99.29% | 92.99% | 98.86% | 97% |
| F-score | 0.9879 | 98.80% | 0.9864 | – | – | 93.43% | 96.14% | 96% |
| MCC | 0.9803 | 0.9759 | 0.980 | 0.9783 | 0.9659 | 0.9362 | 0.9291 | 97% |
| GM | 0.982 | 0.9785 | 0.9895 | 0.9795 | 0.9741 | 0.94 | 0.9349 | 97.3% |
Discussion
This study highlights the value of integrating multimodal medical imaging for pneumonia prediction and risk assessment. By combining CT and X-ray data, the framework leverages complementary diagnostic information to provide a more comprehensive analysis. The use of transformer-based models for feature extraction and multimodal fusion ensures that both fine-grained and global patterns are effectively represented. Incorporating Bayesian neural networks with interpretability tools such as Grad-CAM emphasizes the importance of transparency, which is essential for clinical trust and adoption. Furthermore, the inclusion of risk score estimation extends the framework beyond detection, aligning it with real-world decision-making needs. Despite these strengths, future studies should validate the framework across diverse datasets and explore its adaptability to other respiratory diseases.
Limitations and future scope
This study, while demonstrating promising results in multimodal pneumonia prediction, presents several limitations. The dataset used was relatively limited in size and diversity, which may affect model generalizability. Data were collected from controlled imaging environments, potentially limiting representation of real-world clinical variability. The absence of multi-center validation restricts the broader applicability of the proposed framework. The computational complexity of the Optimized Tensor Fusion Network (OTFN) is relatively high, which may hinder real-time implementation in low-resource or clinical edge environments. Although Grad-CAM visualization provides interpretability, further improvement in explainability is required. The study utilized retrospective data without prospective clinical validation, so performance consistency across imaging devices and institutions remains to be established. Additionally, hyperparameter tuning and training costs are computationally demanding.
Future research will focus on addressing these limitations through several directions. Expanding the dataset to include multi-institutional and demographically diverse samples is essential. Incorporating other modalities such as MRI and ultrasound could enrich diagnostic robustness. Optimization of transformer architectures may enhance efficiency and reduce computational load. Explainable AI techniques will be integrated to improve model transparency for clinicians. Real-time deployment strategies will be explored for clinical decision support systems. Prospective validation studies will ensure reproducibility and clinical reliability. Integration with electronic health record data could enable holistic patient risk assessment. In future work, we plan to incorporate 5-fold or 10-fold cross-validation to further strengthen the reliability and generalizability of our research. Overall, future work will emphasize scalability, interpretability, and clinical translation.
Recent studies have demonstrated that hybrid and ensemble learning architectures can significantly enhance diagnostic robustness, generalization capability, and resilience to data heterogeneity in medical image analysis [cite relevant reviews]. In particular, ensemble strategies inspired by Complex Extreme Learning Machine (CELM)40 and Parallel Extreme Learning Machine (PELM)41 frameworks have shown strong potential in improving classification stability by combining complementary decision boundaries and parallel learning mechanisms. CELM-based models leverage complex-valued feature representations to capture intricate non-linear relationships, while PELM architectures improve scalability and learning efficiency through parallelized hidden-layer processing. Motivated by these studies, future work will explore the integration of CELM- and PELM-inspired ensemble learning strategies with transformer-based and convolutional architectures to further strengthen multimodal pneumonia diagnosis. Such hybrid frameworks are expected to improve robustness, mitigate overfitting, and enhance long-range dependency modelling, thereby extending the proposed Swin Transformer–OTFN–BNN framework toward more resilient and scalable clinical decision-support systems.
Conclusion
This study presented a novel multimodal framework for the pre-emptive prediction and risk assessment of pneumonia by integrating Gradient-weighted Class Activation Mapping (Grad-CAM) with Bayesian Neural Networks (BNNs). Tap Hat Transformer and Swin Transformer architectures were employed for preprocessing and feature extraction from both CT and X-ray images. The extracted multimodal features were effectively fused using an Optimized Tensor Fusion Network (OTFN), while the Grad-CAM-based BNN facilitated interpretable disease prediction. Furthermore, the P-RiSE algorithm was utilized for risk score estimation, and the entire framework was implemented using Python. The proposed approach demonstrated superior performance compared to benchmark models. Specifically, it achieved lower Mean Absolute Error (MAE = 1.15) and Mean Squared Error (MSE = 0.17) values, along with reduced False Positive Rate (FPR = 0.23) and False Negative Rate (FNR = 0.37). The recall, precision, F-measure, specificity, and accuracy were recorded as 96%, 95%, 95.7%, 96%, and 95.8%, respectively, outperforming existing models such as BNN, CAM, CNN, and LSTM.
Author contributions
**Shaik Sikindar** : Conceptualization, Methodology, Data Curation, Formal Analysis, Investigation, Writing—Original Draft. **Ch V Raghavendran** : Validation, Visualization. **G Madhavi** : Conceptualization, Supervision.
Data availability
The datasets used in this study are publicly available on the Kaggle platform https://www.kaggle.com/datasets/anaselmasry/covid19normalpneumonia-ct-images and https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia.
Code availability
AI-Driven Multimodal Imaging Fusion Using Swin Transformer and Optimized Tensor Fusion Networks for Pneumonia Detection has been implemented and made publicly accessible. The complete codebase including image preprocessing, feature extraction using the Swin Transformer, multimodal tensor fusion module, model training, and supporting scripts—is available on GitHub at: https://github.com/shaik5651/AI-Driven-Multimodal-Imaging-Fusion-Using-Optimized-Tensor-Fusion-Networks-for-Pneumonia-Detection-.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Lee, K. Y. Pneumonia, acute respiratory distress syndrome, and early immune-modulator therapy. Int. J. Mol. Sci.18 (2), 388 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Nachiappan, A. C. et al. Pulmonary tuberculosis: role of radiology in diagnosis and management. Radiographics37(1), 52–72 (2017). [DOI] [PubMed] [Google Scholar]
- 3.Noterman, H. & Nurmio, A. Infectious diseases in children and treatment at home: guidebook for parents. (2016).
- 4.Mani, C. S. Acute pneumonia and its complications. Principles and practice of pediatric infectious diseases. 238. (2017).
- 5.Jaiswal, A. K. et al. Identifying pneumonia in chest X-rays: A deep learning approach. Measurement145, 511–518 (2019). [Google Scholar]
- 6.Reed, J. C. Chest Radiology: Patterns and Differential Diagnoses E-Book: Patterns and Differential Diagnoses. (Elsevier Health Sciences, 2017).
- 7.World Health Organization. WHO global air quality guidelines: particulate matter (PM2. 5 and PM10), ozone, nitrogen dioxide, sulfur dioxide and carbon monoxide. (World Health Organization, 2021). [PubMed]
- 8.Muller, N. L. Computed tomography and magnetic resonance imaging: past, present and future. Eur. Respir. J.19 (35 suppl), 3s–12s (2002). [PubMed] [Google Scholar]
- 9.Aston, S. J. Pneumonia in the developing world: Characteristic features and approach to management. Respirology22(7), 1276–1287 (2017). [DOI] [PubMed] [Google Scholar]
- 10.Mc Chlery, S., Ramage, G. & Bagg, J. Respiratory tract infections and pneumonia. Periodontol. 200049(1), 151 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Gandhi, Z. et al. Artificial intelligence and lung cancer: impact on improving patient outcomes. Cancers15(21), 5236 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Ren, H. et al. CheXMed: A multimodal learning algorithm for pneumonia detection in the elderly. Inf. Sci.654, 119854 (2024). [Google Scholar]
- 13.Liu, T. et al. MI-DenseCFNet: Deep learning–based multimodal diagnosis models for Aureus and Aspergillus pneumonia. Eur. Radiol.34(8), 5066–5076 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Xing, W. et al. Multi-omics Graph Knowledge Representation for Pneumonia Prognostic Prediction. IEEE J. Biomed. Health Informat. (2024). [DOI] [PubMed]
- 15.Ali, M. et al. Pneumonia detection using chest radiographs with novel efficientnetv2l model. IEEE Access10.1109/access.2024.3372588 (2024). [Google Scholar]
- 16.Hasan, M. R., Ullah, S. M. A. & Islam, S. M. R. Recent advancement of deep learning techniques for pneumonia prediction from chest X-ray image. Med. Rep.10.1016/j.hmedic.2024.100106 (2024). [Google Scholar]
- 17.Sheikhi, F. et al. Automatic detection of COVID-19 and pneumonia from chest X-ray images using texture features. J. Supercomput.79(18), 21449–21473 (2023). [Google Scholar]
- 18.Khattab, R., Abdelmaksoud, I. R. & Abdelrazek, S. Automated detection of COVID-19 and pneumonia diseases using data mining and transfer learning algorithms with focal loss from chest X-ray images. Appl. Soft Comput.162, 111806 (2024). [Google Scholar]
- 19.Abdelhamid, A., El-Ghamry, A., Abdelhay, E. H., Abo-Zahhad, M. M. & Moustafa, H. E. D. Improved pulmonary embolism detection in CT pulmonary angiogram scans with hybrid vision transformers and deep learning techniques. Sci. Rep.15 (1), 31443 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Kumar, T. and Ujjwal, R.L., Enhanced super-resolution generative adversarial network augmented convolution neural network for pneumonia prognosis in India: promising health policy implications. Inter. J. Syst. Assur. Eng. Manage. 1–13 (2025).
- 21.El-Ghandour, M. and Obayya, M.I., Pneumonia detection in chest X-ray images using an optimized ensemble with XGBoost classifier. Multimedia Tools Appl.84(9), 5491-5521 (2025)
- 22.Singh, K., Gaur, A., Kumar, S., Shastri, S. & Mansotra, V. Deep CP-CXR: A Deep Learning Model for Classification of Covid-19 and Pneumonia Disease Using Chest X-Ray Images. Ann. Data Sci. 1–24 (2025).
- 23.Ullah, Z. & Kim, J. Anatomically accurate cardiac segmentation using Dense Associative Networks. Eng. Appl. Artif. Intell.162, 112742 (2025). [Google Scholar]
- 24.Kumar, A., Yadav, S. P. & Kumar, A. An improved feature extraction algorithm for robust Swin Transformer model in high-dimensional medical image analysis. Comput. Biol. Med.188, 109822 (2025). [DOI] [PubMed] [Google Scholar]
- 25.Liang, J. et al. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF international conference on computer vision. 1833-1844. (2021).
- 26.Basith, M. Z. A. & Ibrahim, M. S. Gradient-Weighted Class Activation Mapping Based Deep Transfer Learning For Glaucoma Disease Prediction. (2025).
- 27.Vinogradova, K., Dibrov, A. & Myers, G. April. Towards interpretable semantic segmentation via gradient-weighted class activation mapping (student abstract). In: Proceedings of the AAAI conference on artificial intelligence. 34(10), 13943-13944. (2020).
- 28.Ullah, Z., Hong, M., Mahmood, T. & Kim, J. Systematic Integration of Attention Modules into CNNs for Accurate and Generalizable Medical Image Classification. Mathematics13 (22), 3728. 10.3390/math13223728 (2025). [Google Scholar]
- 29.Chandra, R. & He, Y. Bayesian neural networks for stock price forecasting before and during COVID-19 pandemic. PLoS. One.16(7), e0253217 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Sachdeva, G., Singh, P. & Kaur, P. Plant leaf disease classification using deep convolutional neural network with Bayesian learning. Mater. Today Proc.45, 5584–5590 (2021). [Google Scholar]
- 31.Lahmiri, S. Integrating convolutional neural networks, kNN, and Bayesian optimization for efficient diagnosis of Alzheimer’s disease in magnetic resonance images. Biomed. Signal Process. Control. 80, 104375 (2023). [Google Scholar]
- 32.Ahmad, Ullah, Z. & Gwak, J. Multi-teacher cross-modal distillation with cooperative deep supervision fusion learning for unimodal segmentation. Knowl. Based Syst.297, 111854 (2024). [Google Scholar]
- 33.Biswas, S. et al. Prediction of COVID-19 from chest CT images using an ensemble of deep learning models. Appl. Sci.11 (15), 7004. 10.3390/app11157004 (2021). [Google Scholar]
- 34.Kundu, R., Das, R., Geem, Z. W., Han, G. T. & Sarkar, R. Pneumonia detection in chest X-ray images using an ensemble of deep learning models. PLOS ONE. 16 (9), e0256630. 10.1371/journal.pone.0256630 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Peng, L. et al. Analysis of CT scan images for COVID-19 pneumonia based on a deep ensemble framework with DenseNet, Swin transformer, and RegNet. Front. Microbiol.13, 995323. 10.3389/fmicb.2022.995323 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Hammoudi, K. et al. Deep learning on chest X-ray images to detect and evaluate pneumonia cases at the era of COVID-19. J. Med. Syst.4575. 10.1007/s10916-021-01745-4 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Guo, K. et al. Diagnosis and detection of pneumonia using weak-label based on X-ray images: A multicenter study. BMC Med. Imag.10.1186/s12880-023-01174-4 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Mabrouk, A., Díaz Redondo, R. P., Dahou, A., Abd Elaziz, M. & Kayed, M. Pneumonia detection on chest X-ray images using ensemble of deep convolutional neural networks. Appl. Sci.12 (13), 6448. 10.3390/app12136448 (2022). [Google Scholar]
- 39.Singh, S., Kumar, M., Kumar, A., Verma, B. K. & Shitharth, S. Pneumonia detection with QCSA network on chest X-ray. Sci. Rep.13, 9025. 10.1038/s41598-023-35922-x (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Yanar, E., Hardalaç, F. & Ayturan, K. CELM: An ensemble deep learning model for early cardiomegaly diagnosis in chest radiography. Diagnostics15(13), 1602 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Yanar, E., Hardalaç, F. & Ayturan, K. PELM: A Deep Learning Model for Early Detection of Pneumonia in Chest Radiography. Appl. Sci.15 (12), 6487 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The datasets used in this study are publicly available on the Kaggle platform https://www.kaggle.com/datasets/anaselmasry/covid19normalpneumonia-ct-images and https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia.
AI-Driven Multimodal Imaging Fusion Using Swin Transformer and Optimized Tensor Fusion Networks for Pneumonia Detection has been implemented and made publicly accessible. The complete codebase including image preprocessing, feature extraction using the Swin Transformer, multimodal tensor fusion module, model training, and supporting scripts—is available on GitHub at: https://github.com/shaik5651/AI-Driven-Multimodal-Imaging-Fusion-Using-Optimized-Tensor-Fusion-Networks-for-Pneumonia-Detection-.









































