Skip to main content
Brain Informatics logoLink to Brain Informatics
. 2026 Feb 2;13(1):5. doi: 10.1186/s40708-025-00291-w

Multimodal fusion and explainability of artificial intelligence models in Alzheimer’s Disease detection

Vimbi Viswan 1,#, Noushath Shaffi 2,#, E Malathy 3, G Chemmalar Selvi 3, B R Kavitha 3, Abdelhamid Abdesselam 4, Shuqiang Wang 5, Ponnuthurai N Suganthan 2, Ibrahim Al Shezawi 6, Mufti Mahmud 7,8,9,
PMCID: PMC12876535  PMID: 41629566

Abstract

The integration of multimodal data has emerged as a powerful strategy for enhancing the accuracy and interpretability of artificial intelligence (AI) models in the diagnosis and prognosis of Alzheimer’s Disease (AD). This systematic review presents a comprehensive synthesis of recent advances in AI-driven multimodal fusion approaches for AD prediction. A detailed examination of widely used datasets—including their modalities, preprocessing pipelines, and accessibility—is provided to aid reproducibility and methodological transparency. We analyze and categorize the various data harmonization and preprocessing techniques employed across neuroimaging (e.g., fMRI, sMRI, PET), electrophysiological (EEG), and genomic modalities, highlighting domain-specific practices and challenges. Furthermore, fusion strategies are classified into data-level, feature-level, decision-level, and temporal (early, intermediate, and late) paradigms, offering insights into their implementation and diagnostic impact. The review also investigates the adoption of explainable AI (XAI) techniques across studies and identifies a significant underrepresentation of works that simultaneously emphasize multimodality, explainability, and methodological rigor. By adhering to both PRISMA and Kitchenham’s guidelines, this review ensures transparency and replicability in evidence synthesis. Compared to existing reviews, our work uniquely focuses on the intersection of multimodal integration and explainability within a systematically validated framework. The review concludes with recommendations for future research aimed at developing robust, interpretable, and clinically relevant AI models for AD.

Keywords: Alzheimer’s Disease, Multimodal data fusion, Artificial intelligence, Machine learning, Deep learning, Mild cognitive impairment, Explainable AI

Introduction

Neurological disorders (NLDs) encompass a range of conditions that affect the central and peripheral nervous systems, leading to various cognitive, motor, and behavioural dysfunctions [1]. Among the most prevalent NLDs is Alzheimer’s Disease (AD), a progressive neurodegenerative disorder primarily characterized by memory impairment, cognitive decline, and disorientation [2, 3]. It involves the accumulation of amyloid-beta plaques and tau tangles in the brain, leading to neuronal damage and the eventual loss of cognitive function [4].

AD has far-reaching impacts on various levels, economic, personal, and societal. The economic burden is substantial and covers healthcare costs, caregiving expenses, and lost productivity [2]. As the prevalence of AD continues to increase with aging populations, the financial strain on healthcare systems becomes increasingly significant. Beyond economics, AD inflicted individuals often face reduced quality of life, increased dependency on caregivers, and social isolation. Families and communities also face the emotional and practical challenges of supporting those with the disease. Figure 1 presents a world heat map illustrating the average annual percentage change in death rates (per 100,000 population) from 1990 to 2021 across all ages and both sexes. Regions with positive values indicate a worsening mortality trend, while negative values suggest improvements over time. This spatial perspective highlights substantial geographic disparities in health outcomes, underscoring the need for targeted, inclusive, and data-driven interventions, particularly in regions where mortality trends are deteriorating.

Fig. 1.

Fig. 1

Global average annual percentage change in death rates per 100,000 population. Source – https://vizhub.healthdata.org/

Early detection of AD is crucial for initiating timely interventions and improving clinical outcomes. Detecting the disease in its prodromal form offers the potential to delay progression, manage symptoms more effectively, and plan appropriate care. However, traditional diagnostic methods are based on clinical symptoms that often appear in later stages, resulting in delayed intervention and compromised quality of life. These limitations underscore the need for innovative, efficient, and non-invasive diagnostic techniques for early-stage detection.

Artificial Intelligence (AI) has emerged as a promising tool in early disease detection in various medical domains [5, 6]. Its capabilities include enhanced diagnostics, personalized treatment, complex dataset management, and the identification of subtle patterns that may elude clinical observation [6]. AI-based models have shown promise in differentiating between healthy AD inflicted individuals with high accuracy, sometimes outperforming traditional diagnostic methods [5].

Recent diagnostic advances, particularly in medical imaging, such as magnetic resonance imaging (MRI), computed tomography (CT), positron emission tomography (PET), molecular analysis (biomarkers, genetic testing), and clinical records, offer detailed and diverse views of patient health. By combining these modalities, researchers can form a comprehensive understanding of the disease. This multimodal approach improves diagnosis, supports clinical decision making, and offers valuable insights into the mechanisms underlying AD. By reducing subjective variability and improving interpretability, AI-based systems can improve the reliability of disease detection.

Based on the comparative analysis of existing systematic reviews, it becomes evident that while several review articles have contributed valuable insights into AD detection using AI, they often fall short of concurrently addressing three crucial dimensions: exclusive focus on multimodal data, explicit integration of explainable AI (XAI) methodologies, and adherence to both PRISMA and Kitchenham’s guidelines for systematic literature reviews. For example, a study by Vimbi et al. [7] thoroughly explores XAI approaches such as LIME and SHAP, but does not restrict its analysis to multimodal data alone. Similarly, another review article by Vimbi et al. [8] emphasizes XAI and complies with SLR protocols, but includes studies that rely on single modalities for interpretation. Reviews such as [9] and [10] focus on specific data types (e.g. EEG or 3D imaging), excluding multimodal integration. Also a recent study [11] explored multimodal data fusion yet did not focus on XAI rather limited itself to inherently interpretable methods.

This systematic literature review (SLR) distinguishes itself from existing literature by uniquely fulfilling three important criteria: i) synthesizing studies that exclusively deal with multimodal data, ii) comprehensively investigating XAI components, and iii) rigorously adhering to both PRISMA and Kitchenham’s protocols. A comparative analysis of six prominent systematic reviews, as summarized in Table 1, reveals that none simultaneously fulfill all these three criteria. Reviews focusing on XAI often include single-modality studies, those addressing multimodal data frequently lack an exclusive focus or a dedicated XAI component, and adherence to both PRISMA and Kitchenham is not consistently observed. Consequently, this review fills an important gap by offering a methodologically sound and focused analysis of interpretability in AI models that exclusively use multimodal data for AD. Serving as a unique and valuable resource, the study offers insight into the challenges and solutions related to explainability in this specific context. Complementing this, Fig. 2 encapsulates the thematic breadth and key contributions of this study. It visually outlines the comprehensive coverage of the review, from basic data modalities and fusion techniques to interpretability frameworks and clinical relevance, highlighting the integrative and forward-looking nature of our approach. Together, the table and figure reinforce the distinct contribution this review makes to the evolving landscape of AI-driven multimodal research in AD.

Table 1.

Comparative Analysis of Existing Systematic Reviews on Multimodal Data, Explainable AI, and Methodological Adherence

Refs Exclusive multimodal data focus Explainable AI Focus PRISMA Kitchenham
[7] No (includes single modality) Yes (exclusive XAI)
[9] No (EEG only) No
[8] No (explanations from single modality) Yes (exclusive XAI)
[10] No (3D imaging only) No
[12] No (includes single modality) No
[11] No (specific multimodal, not general exclusive) No (XAI not exclusive)
Our work Exclusive Multimodal XAI Focussed

Fig. 2.

Fig. 2

Overview of the key contributions and thematic coverage of the proposed systematic literature review on AI-driven multimodal approaches for Alzheimer’s Disease detection

The main contributions of this review are:

  1. Adherence to established SLR protocols such as PRISMA and Kitchenham guidelines.

  2. Well-defined research questions covering the breadth of multimodal data in AI-based AD detection.

  3. Comprehensive discussion of multimodal fusion strategies, benefits, and challenges reported over the past decade.

  4. Exploration of XAI techniques and their potential to enhance the transparency and trustworthiness of AI systems.

  5. Critical synthesis of theoretical foundations and recent advancements in multimodal data fusion, including research directions, and future opportunities in the field.

Multimodal studies typically integrate neuroimaging (e.g., sMRI, fMRI, PET), electrophysiological signals (e.g., EEG), genetic markers, and clinical or demographic variables to improve prediction robustness. Figure 3 provides a high-level overview of the commonly used modalities and their complementary roles in AD research. Readers seeking deeper background material may refer to prior surveys [7, 8, 13, 14].

Fig. 3.

Fig. 3

Commonly used data modalities in multimodal AD research, including structural MRI (sMRI), functional MRI (fMRI), PET, EEG, genetic sequencing, and clinical/demographic information (now shown). These modalities are often integrated to enhance diagnostic performance and complement one another in capturing different aspects of AD pathology

The remainder of this paper is structured as follows: Sect. 2 describes the search methodology guided by PRISMA. Section 3 presents insights derived from answering the defined research questions. Finally, Sect. 4 concludes the paper.

Search strategy

This section describes the steps to search and identify relevant articles for a systematic review. Figure 4 illustrates the stages involved. This review aims to explore research articles that employ a combination of multiple types of data to diagnose or detect AD. Published articles on AI and its associated fields are analyzed to identify contributions and summarize findings. The main objective of this review is to pinpoint areas where more research is needed, particularly in the use of multimodal fusion for AD detection.

Fig. 4.

Fig. 4

PRISMA-based workflow illustrating the stages of the systematic search, including article identification, screening, eligibility assessment, and final inclusion

To conduct a thorough systematic review, we followed the concrete guidelines proposed by Kitchenham [15] and PRISMA [16]. We started by formulating research questions (RQ) and creating search strings. Then, we identified digital libraries and conducted searches based on relevant inclusion and exclusion criteria. Next, we filtered the articles according to their relevance to the study topics and extracted necessary information from the selected articles. Finally, we produced results that allow for further critical analysis, enabling us to study the existing techniques and their benefits, future needs, and limitations. Through this process we able to investigate the research questions and perform a thorough systematic review.

Formulating research questions

The purpose of creating RQs is to establish a clear strategy for gathering papers that directly relate to specific topics of interest. In doing so, the reader can better understand the information presented. The RQs addressed in this article are listed in Table 2.

Table 2.

Research Questions and Their Motivations

S.No Research Questions Motivation
RQ1 What multimodal data combinations are used in the detection of AD? To know how complimentary modalities improve prediction and characterization
RQ2 What datasets and preprocessing methods are used for multimodal data in Alzheimer’s Disease research? To understand the what preprocessing methods are essential for ensuring data compatibility, reproducibility, and model performance
RQ3 What methods are used for multimodal data fusion in Alzheimer’s Disease detection? By reviewing the fusion methods enables assessment and their effectiveness and suitability for AD diagnosis
RQ4 What different AI models are available for AD diagnosis that incorporate multimodal data? To identify the best practices and approaches for effectively learning from complex multimodal representation
RQ5 Which XAI frameworks are used to interpret AI models for Alzheimer’s Disease using multimodal data? To explore how XAI techniques enhance trust, transparency, and model validation in a multimodal setting
RQ6 What are the performance trends of AI models in multimodal Alzheimer’s Disease diagnosis across different datasets, fusion strategies, and classifiers? To know the extent to which multimodal fusion and interpretability improve diagnostic performance over traditional methods
RQ7 What are the limitations, challenges, and future scope of using multimodal data and XAI methods for AD diagnosis? To provide a direction for future research and help to overcome real-world barriers in adoption and generalization

Searching articles

When conducting a thorough and comprehensive systematic review, one of the most challenging tasks is choosing the correct search terms. To ensure accuracy, the search strings were carefully selected so that they were neither too general, which could result in irrelevant articles, nor too specific, which could result in missing essential articles [16]. After multiple attempts to combine and reorganize relevant keywords, we ultimately arrived at the search strings in Table 3. We selected five databases (refer to Table 4) that are known for their comprehensive coverage of relevant research. These databases offer diverse search functions and access to high-quality studies, ensuring a thorough coverage of the literature relevant to our research topic.

Table 3.

The Search Strings

S.No Search Strings
1 XAI AND ML AND Multimodal OR fusion data AND AD
2 XAI AND DL AND Multimodal OR fusion data AND AD
3 Explainable AND ML AND Multimodal OR fusion data AND AD
4 Explainable AND DL AND Multimodal OR fusion data AND AD
5 Interpretable AND ML AND Multimodal OR fusion data AND AD
6 Interpretable AND DL AND Multimodal OR fusion data AND AD
7 Explainable ML AND Multimodal OR fusion data AND AD
8 Explainable DL AND Multimodal OR fusion data AND AD
9 Explainable AI AND Multimodal OR fusion data AND AD
10 Explainable AI AND Multimodal OR fusion data AND AD
11 Interpretable AI AND ML AND Multimodal OR fusion data AND AD
12 Interpretable AI and DL AND Multimodal Or fusion data AND AD
13 Interpretable AI AND Multimodal OR fusion data AND AD

Table 4.

The Databases Considered

S.No Database
1 IEEE Xplore (www.ieee.org)
2 ScienceDirect (www.sciencedirect.com)
3 Springer (www.springer.com)
4 PubMed (www.pubmed.ncbi.nlm.nih.gov/)
5 MDPI (www.mdpi.com)

Screening articles

After compiling the results of each individual search, we found a total of 732 publication records (IEEE=191, Springer=138, PubMed=262, ScienceDirect=124, MDPI=17). We included all research articles from the past decade until June 2025. We then removed all duplicate records and any articles published before 2014.

In the ensuing task, we used the inclusion–exclusion criteria listed in Table 5 to evaluate the identified articles. Firstly, we went through each database search collection and identified and labeled all duplicate records. Once we had done this for each collection, we combined all records into a single collection and eliminated all duplicates. This left us with a total of 179 unique records. We then scrutinized the titles and abstracts of each publication and excluded pilot studies, editorials, nonjournal articles, conference proceedings, books, posters, and studies that were published before 2014. After the process, the number of articles was reduced to 77 eligible records. This entire process demonstrated in Fig. 4.

Table 5.

Inclusion–Exclusion Criteria

Inclusion Criteria Exclusion Criteria
Studies related to AD diagnosis using multimodal data and AI techniques Pilot papers, editorials, proceedings, magazines
Studies related to Explainable AI for AD prediction Articles not related to AI-based AD disease diagnosis
Studies related to performance results of ML/DL models with fusion data for AD

Furthermore, we excluded records that were not accessible, studies that only discussed multimodal fusion without providing evidence of performance and results, and studies that did not relate to our research questions. After screening, we looked for articles relevant to our research questions. We found 54 credible research articles from reputed journals that aligned with our RQs. The process is well demonstrated in Fig. 4 and Fig. 5, showing the different stages of the SLR using PRISMA guidelines. Furthermore, Fig. 6 displays statistics on the articles considered, organized by year. These numbers show that there is limited scientific research on AD that utilizes multimodal fusion data and that this research has only experienced significant growth in recent years. It should be noted that this review is distinct in its dedicated focus on multimodal fusion data for AD.

Fig. 5.

Fig. 5

PRISMA flowchart showing the number of records identified, screened, excluded, and finally selected for inclusion in the systematic review

Fig. 6.

Fig. 6

Number of papers published during 2012-2025 (partial) for AD Detection using multimodal data with XAI

Data extraction and assessing quality

Following the screening process, we systematically extracted information from studies that satisfied the inclusion criteria. Authors independently collected essential details from each article, such as types of multimodal data, dataset characteristics, preprocessing steps, fusion strategies, AI models, XAI techniques, evaluation metrics, and other principal findings. Any differences in interpretation were discussed and clarified so that the extracted information remained consistent and accurate.

The existing standardized tools do not employ quality assessment as they are designed for clinical or epidemiological studies and do not capture preprocessing, model reproducability, or fusion AI pipelines [17]. Therefore, we extracted indicators including clarity of preprocessing steps, completeness of model descriptions, justification for fusion strategies, and reporting of explainability methods. By assessing these aspects, we were able to identify the strengths and limitations of the included studies within the context of multimodal AI and XAI research. This qualitative assessment also highlights where future studies could strengthen their methodological robustness.

Results and discussion

Multimodal data combinations used in the AI based detection of Alzheimer’s Disease

This section explores RQ1: What multimodal data combinations are used in the detection of AD?

Multimodal data (MMD) integration has gained significant momentum in AD research due to its potential to offer a more holistic and robust representation of disease pathology. Unlike unimodal approaches that may capture only isolated aspects of neurodegeneration, multimodal combinations leverage complementary information across structural, functional, molecular, and behavioral domains, thereby enhancing predictive power and interpretability.

Based on our SLR, we identified and categorized 15 unique multimodal configurations (MMD 1–15) that have been employed in state-of-the-art AI-based AD detection studies. These combinations, summarized in Table 6 and Fig. 7, reveal how the field has evolved from simple bimodal pairings to complex, high-dimensional integrations.

  • Neuroimaging-Centric Integrations: Neuroimaging modalities, particularly sMRI and PET, form the foundation of many studies. MMD 3 (MRI + PET) is among the most prevalent, appearing in over ten reviewed works [2132]. These combinations allow for the simultaneous assessment of anatomical atrophy and metabolic activity. Other notable imaging fusions include MMD 10 (MRI features + PET features) and MMD 12 (sMRI + fMRI), reflecting efforts to combine spatial and temporal aspects of brain function.

  • Integration of Imaging with Clinical and Cognitive Assessments: Several configurations (e.g., MMD 4, MMD 6, MMD 7) incorporate demographic, cognitive, and neuropsychological data alongside imaging. These additional data modalities complement imaging findings, allowing a more complete understanding of disease progression. For example, MMD 4 integrates MRI, PET, demographic information, and cognitive assessments [36], thus improving early detection and supporting more accurate differential diagnosis.

  • Fusion of Genetic, Molecular, and Laboratory Biomarkers: Recent studies have begun to integrate genotypic and molecular markers such as apolipoprotein E (APOE) status, Cerebrospinal Fluid (CSF) biomarkers, and Diffusion Tensor Imaging (DTI). These are evident in MMD 5 and MMD 6, where data from multiple biological levels are combined, including PET, MRI, CSF, DTI, and genetic profiles, for a comprehensive disease model [40, 41, 43]. This integration has shown promise in identifying prodromal stages of AD and stratifying risk in asymptomatic populations.

  • Incorporation of Sensor-Based, EEG, and Behavioral Data: Noninvasive and ambulatory data sources have also gained traction. MMD 1 uses EEG features for neurophysiological monitoring [18], while MMD 11 fuses EEG with CT and MRI for combined structural-functional evaluation [65]. These modalities are particularly valuable in real-time or low-resource diagnostic contexts.

  • Use of Electronic Health Record (EHR) -Linked Information: Some studies, such as MMD 13 and MMD 14, explore the integration of structured clinical records, comorbidity data, and medication history. These configurations recognize the multifactorial nature of AD and seek to model disease risk in ecologically valid healthcare settings [68, 69].

A comprehensive analysis of the literature reveals a progressive shift from simple bimodal integrations to complex multidimensional multimodal data combinations for AD detection. sMRI consistently emerges as the dominant modality, frequently paired with PET scans [2132] and complemented by demographic and cognitive information [33, 34, 36], thus improving diagnostic precision. Recent studies increasingly incorporate genetic markers (e.g., APOE), CSF biomarkers, DTI, and neuropsychological test results [3943, 4852], reflecting a move towards individualized and biologically grounded diagnostics. Although EEG and fMRI are comparatively underutilized, their inclusion in studies combining EEG alone [18], EEG with fused CT-MRI [65], and sMRI with fMRI [67] points to a growing interest in capture functional brain dynamics. A small but notable set of studies integrates clinical comorbidity, medication data, and EHR [68, 69], indicating a change towards real-world applicability. The unique multimodal configurations identified (Refer Table 6 and Fig. 7) demonstrate the ongoing effort of the research community to capture the multifactorial nature of AD using AI-driven models.

Table 6.

MMD Acronyms for AD Prediction

MMD Data Modalities References
MMD 1 EEG Features [18]
MMD 2 fMRI fusion [19, 20]
MMD 3 MRI + PET [2132]
MMD 4 MRI + PET + Demographic [3335]
MRI + PET + Demographic + Cognitive [3638]
MMD 5 MRI + PET + Genetic [39, 40]
MRI + PET + Genetic + DTI + CSF + Neuro tests [4144]
MMD 6 MRI + PET + Demographic + Cognitive + Genetic + CSF [4547]
MRI + PET + Demographic + Cognitive + Genetic + Lab [48]
MRI + PET + Demographic + Cognitive + Genetic + NSB + Lab [49, 50]
MMD 7 MRI + Demographic + Cognitive + Genetic + NSB + Lab [51, 52]
MRI + Demographic + Clinical [5356]
MRI + Demographic + Cognitive + Genetic [57, 58]
MMD 8 MRI + Neuropsychological Tests [5961]
MMD 9 MRI Features (T1 + Proton Density) [62]
MMD 10 MRI + PET Features [63, 64]
MMD 11 EEG + Fused CT-MRI [65, 66]
MMD 12 sMRI + fMRI [67]
MMD 13 Cognitive + Medicine + Comorbidities [68]
MMD 14 DTI + Clinical (EHR) [69]
MMD 15 MRI + DTI [70, 71]

Fig. 7.

Fig. 7

Overview of the multimodal data combinations used in AI-based AD studies. The figure illustrates how frequently different modality pairings and higher-order combinations appear in the literature, highlighting dominant trends such as MRI–PET integration and the relative scarcity of more diverse multimodal configurations

Despite the steady growth in multimodal AI research for AD detection, several notable gaps persist in the current literature. Imaging modalities such as sMRI and PET scans are widely used, often in combination with cognitive scores or demographic data, but the integration of behavioral, clinical, and functional data (e.g., EEG, fMRI, neuropsychological assessments, and EHR) remains relatively underexplored. For example, modalities such as EEG and CT appear in only isolated studies (e.g. MMD1, MMD11), and combinations involving EHRs or clinical metadata are limited to a handful (e.g., MMD13, MMD14), despite their potential to enhance model interpretability and clinical relevance.

In addition, most existing studies limit themselves to the fusion of bimodal or trimodal data (e.g. MMD3-MMD4), while only a few explore complex combinations involving four or more modalities (e.g. MMD5, MMD6, MMD7). This trend highlights missed opportunities to exploit the full breadth of heterogeneous patient data to build more robust and generalizable AI models. Several key factors may account for this limitation, such as: Three principal factors may explain this pattern:

  • Modality overlap in information content - Certain data types, such as T1-weighted MRI and DTI, provide structurally similar insights into brain morphology, and their integration can yield only marginal improvements in model performance.

  • Underexplored yet complementary modality pairings - Despite the availability of datasets containing clinically synergistic modalities (e.g. PET with neuropsychological assessments or EEG with MRI), these combinations remain underexplored, reflecting a gap in both the design of the fusion strategy and the experimental implementation.

  • Limited availability of high-value modalities - Some high-value modalities, especially those requiring invasive acquisition (e.g., CSF biomarkers, genetic profiling) or continuous monitoring (e.g., EEG, sensor data), are not consistently recorded for the same subjects or are only available in small sample sizes, particularly in longitudinal cohorts. This hinders their inclusion in integrated multimodal frameworks.

Addressing these challenges through collaborative data sharing, curated benchmark datasets, and advanced fusion architectures could significantly increase the diagnostic utility of AI in real-world clinical settings for AD.

Datasets and associated preprocessing methods used for AI driven AD prediction

This section explores RQ2: What datasets and preprocessing methods are used for multimodal data in Alzheimer’s Disease research?

The effectiveness of AI-based models for AD prediction is highly dependent on the quality, diversity, and representativeness of the datasets used during development. Given the heterogeneity of AD symptoms and its progression, the choice of datasets and their associated modalities, ranging from neuroimaging and genetics to cognitive assessments and clinical biomarkers, plays a crucial role in shaping model performance and generalizability. Equally important are the preprocessing techniques applied to raw data, which significantly influence the reliability of feature extraction, model interpretability, and cross-study comparability. Inconsistent preprocessing workflows can introduce variability, obscure clinical patterns, and limit reproducibility between studies. Therefore, identifying the most commonly used datasets and understanding the preprocessing practices adopted in the literature are vital for assessing the current research landscape, detecting methodological gaps, and guiding the development of robust and clinically applicable AI models.

To better understand the current research landscape, we compiled a detailed table summarizing the most widely used datasets in AI-based AD detection. Table 7 maps each dataset to specific multimodal data configurations employed across reviewed studies, offering insight into how different data modalities are utilized in practice. This table serves as a comprehensive reference for evaluating dataset diversity, modality integration, and real-world applicability - key factors that influence model development, generalizability, and clinical relevance.

Table 7.

A compilation of widely used AD datasets

Dataset Reference Modalities Datatype
ADNI [2124, 2631] MMD1 Images
[3335, 3941] MMD4 Images, Numeric
[3944] MMD5 Images, Numeric
[45, 4750] MMD6 Numeric
[5158, 61] MMD7 Images, Numeric
[63, 64] MMD10 Images, Numeric
[37, 59] MMD8 Images, Numeric
[67] MMD12 Images, Numeric
[68] MMD13 Images, Numeric
[69, 70] MMD14/15 Images, Numeric
OASIS [24, 25] MMD3 Images
[65] MMD11 Images, Numeric
Harvard DB [62] MMD9 Numeric
AIBL [20, 32, 38, 46, 48, 60, 66] MMD3 Images
[55] MMD7 Images, Numeric
TADPOLE [48] MMD6 Numeric

Datasets

ADNI datasets A wide range of publicly available datasets have significantly advanced the development of AI models for AD prediction. Among them, the Alzheimer’s Disease Neuroimaging Initiative (ADNI,https://adni.loni.usc.edu) stands out as the most widely used resource. ADNI spans multiple phases: ADNI-1, ADNI-GO, ADNI-2, ADNI-3, and recent ADNI-4 collectively include more than 1,500 participants in cognitively normal (CN), mild cognitive impairment (MCI) and AD. For example, ADNI-1 enrolled approximately 200 subjects with CN, 400 MCI and 200 AD, while ADNI-2 and ADNI-3 added hundreds more subjects with early MCI and mild AD. Provides longitudinal and multimodal data that include structural and functional MRI, PET scans (Fluorodeoxyglucose (FDG), amyloid, and tau), cerebrospinal fluid (CSF) biomarkers, APOE genotyping, neuropsychological scores, demographic and clinical data, as well as laboratory test results. The standardized imaging protocols of ADNI and the wide coverage of the modality make it the benchmark data set to develop and validate AI and multimodal fusion models in AD research.

OASIS Another important resource is the Open Access Series of Imaging Studies (OASIS, https://www.oasis-brains.org), which consists of multiple phases (OASIS-1 to OASIS-4) and includes more than 1,300 participants, with data covering CN individuals and those with varying degrees of cognitive impairment. OASIS-3 alone provides more than 2,800 MRI sessions and 2,100 PET scans in 1,000+ subjects. It offers sMRI, PET (including amyloid and FDG tracers), resting-state fMRI, DTI, and associated cognitive and clinical measures. Although it may not offer the same breadth of multimodal clinical data as ADNI, its standardized imaging protocols and large sample size make it a valuable resource to investigate early structural changes of the brain associated with AD.

AIBL Dataset The Australian Imaging, Biomarkers and Lifestyle Study of Aging (AIBL, https://aibl.csiro.au) complements ADNI by providing imaging and biomarker data from an independent cohort of approximately 1,100 subjects, including CN, MCI, and AD participants. It features sMRI, PET, and neuropsychological assessments, along with detailed information on lifestyle factors such as diet, physical activity, and sleep, which are rarely captured in other cohorts. AIBL is particularly valuable for studying the influence of modifiable risk factors.

TADPOLE dataset The TADPOLE dataset (https://tadpole.grand-challenge.org) is derived from a curated subset of ADNI and was developed as part of a global challenge aimed at predicting future cognitive decline in MCI and AD patients. It includes longitudinal multimodal data such as MRI, PET, CSF biomarkers, cognitive scores, and diagnostic labels, allowing detailed modeling of disease progression. Although not a standalone cohort, TADPOLE offers standardized benchmarks and prediction tasks that support reproducibility and cross-study comparison, making it a valuable resource for evaluating prognostic AI models.

Med Harvard Dataset The Med Harvard dataset, (accessed through the Harvard Aging Brain Study dataset: https://www.nitrc.org/projects/habs/ though not widely used in AD research, offers unique numerical features derived from T1-weighted and proton-density MRI scans that can support MRI-based classification models. However, its impact is limited by restricted public access, minimal documentation, and a narrower range of imaging modalities, making it less prevalent compared to more accessible datasets in the AD research community.

Referencing Table 7, which compiles the most frequently used datasets in AI-based AD research, it is evident that the field is heavily based on a few foundational cohorts, particularly ADNI, due to its multimodal richness and longitudinal scope. The table further highlights the various data configurations used in the studies, providing insight into how various modalities are integrated for diagnostic modeling. Although such data sets have undoubtedly propelled the advancement of AI applications in AD prediction, several challenges persist, including variability in modality availability, inconsistent preprocessing practices, and limited use of under-represented cohorts. Addressing these issues through standardized pipelines, better cross-cohort harmonization, and increased utilization of diverse datasets such as AIBL, OASIS, and Harvard DB will be essential. Such efforts will not only enhance the robustness and generalizability of AI models but will also strengthen their clinical relevance and translational impact in real-world settings.

Preprocessing techniques

Preprocessing multimodal data is a vital step in developing accurate and generalizable AI models for AD prediction. The diversity of preprocessing techniques adopted in different data modalities in multimodal AD research is visually summarized in Fig. 8. This schematic overview delineates modality-specific pipelines including structural and functional neuroimaging (sMRI, fMRI), DTI, PET, EEG, and non-imaging data such as clinical and genetic information. The figure aims to consolidate these procedures to provide a modality-aware perspective on the preprocessing landscape.

Fig. 8.

Fig. 8

Overview of common preprocessing steps applied across different modalities in multimodal AD studies. The figure summarizes frequently used operations such as normalization, registration, skull stripping, segmentation, and artifact removal; for comprehensive modality-specific preprocessing guidelines, readers may refer to established medical imaging preprocessing surveys [5]

Given the heterogeneous nature of multimodal data—spanning neuroimaging, electrophysiological signals, genetic information, and clinical records—researchers employ diverse preprocessing pipelines tailored to each modality. To further contextualize these modality-specific strategies, Fig. 9 presents a heatmap that highlights the frequency of different preprocessing techniques in representative multimodal AD datasets. Common practices such as data normalization, bias correction, segmentation, image registration, and skull stripping appear in almost all modalities, reinforcing their fundamental role in the preparation of neuroimaging data for downstream analysis. Techniques such as AC-PC correction/alignment, bias correction, brain extraction, image registration, and segmentation are highly concentrated in MRI and PET workflows, while temporal data fusion and missing value handling are more prominent in longitudinal clinical and genetic datasets. This quantitative synthesis underscores a trend toward methodological convergence in some preprocessing steps, despite the inherent heterogeneity of data types. It also emphasizes the urgent need for unified and reproducible pipelines that accommodate the unique demands of each modality while ensuring seamless integration in multimodal AI frameworks. To provide concrete examples of these modality-specific strategies, Table 8 compiles representative preprocessing workflows reported in multimodal AD studies, detailing the associated modalities, key preprocessing steps and software tools used. This tabulated synthesis complements the schematic and heatmap overviews by linking general trends to specific implementations in the literature. In the following, we summarize the preprocessing strategies categorized according to the nature of the modality, as reflected in the literature.

Fig. 9.

Fig. 9

Frequency distribution of preprocessing techniques applied across various data modalities in multimodal Alzheimer’s Disease studies

Table 8.

Several preprocessing techniques used for the diagnosis of AD Disease

Modalities Refs Preprocessing steps Software used
MMD1 [18] EEG recording and EEG signals were down-sampled to 256 Hz and segmented into M non-overlapping epochs of 5 s duration Matlab toolbox
MMD3 [21] anterior commissure (AC) -posterior commissure (PC) correction, intensity inhomogeneity by N3 algorithm and skull-stripping DDSS tool
MMD3 [20, 22, 23, 32, 38, 46, 66] MRI: anterior commissure-posterior commissure for MRI images. Spatial Normalization Smoothing PET: Head Motion Correction Spatial Normalization Smoothing DDSS tool
MMD15 [70] Noise correction Alignment Normalization Registration dcm2nii tool
MMD3-5, MMD7-8 [30, 33, 37, 40, 44, 51, 55, 56, 59] MRI images: AC-PC correction, intensity inhomogeneity correction, brain extraction, cerebellum removal, tissues segmentation, registration PET: Registration. Data preprocessing namely, Interval alignment, Age processing, and Generating target time points Freesurfer
MMD3 [25, 26, 31] MRI image: bias corrected and then skull stripped, segmentation of brain tissue on the skull stripped image, Image registration, Noise reduction. PET image: color space transition, filtering, Fusion of both low and high pass filters, retransformed to HSI space, Image conversion Freesurfer software, FSL software,QNML, BDM
MMD7 [54] Subcortical volumes, hippocampal subfields volumes, cortical volume, cortical surface area, cortical curvature, cortical thickness, and cortical thickness standard deviation. ANOVA and ChiSquare tests done FreeSurfer
MMD12 [67] Spatial distortion, skull stripping, removal of cerbellum and parcellating into cortical regions FreeSurfer
MMD4 [34, 35] sMRI- denoising, resampling, bias correction, image normalization, affine registration on MNI space, Using SPM12 segmented GM, WM and CSF, global intensity correction PET - image registration, aligned sMRI with PET,skull elimination CAT12 toolkit, SPM12
MMD7, MMD13 [52, 68] Temporal data fusion ATC encoding Handling missing values, Data normalization Time series feature extraction dcm2nii tool
MMD11 [65] Image pre-processing - contrast enhancement - Gaussian and median filter (denoising image) - Data augmentation EEG signal - Notch, Butter worth and CAR spatial filter DDSS tool
MMD5-7,MMD9-10 [39, 42, 43, 45, 47, 48, 58, 6164] (1) anterior commissure-posterior commissure (AC-PC) correction via MIPAV software (2) intensity inhomogeneity correction using N3 algorithm (3) skull stripping and cerebellum removal with aBEAT,3 (4) three main tissues (i.e. GM, WM, CSF) segmentation via FAST algorithm (5) registering images to a template via HAMMER and (6) projecting 90 region of interest (ROI) labels from the template to each subject image (7) Bias-correction, and (8) reorientation (9) Motion correction, (10) Lasso regression, and (10) Normalization DDSS tool, FSL, PLINK, FreeSurfer, Beagle

Structural and Functional MRI (sMRI and fMRI) Structural MRI preprocessing typically begins with alignment of the anterior commissure (AC)-posterior commissure(PC), followed by correction for intensity inhomogeneity using the Nonparametric Nonuniform intensity Normalization (N3) algorithm and skull stripping. Subsequent steps include segmentation of brain tissue into gray matter (GM), white matter (WM), and cerebrospinal fluid (CSF), normalization to standard anatomical templates (for example, MNI space) and registration in common coordinate spaces [21, 22, 39, 48, 6264]. Tools such as FreeSurfer, FMRIB Software Library (FSL), dcm2nii, DDSS etc., are widely used to perform these steps [41, 45, 48, 51, 59]. Some studies also apply motion correction, cerebellum removal, and bias correction [42, 43, 58]. For fMRI data, preprocessing includes distortion correction, skull stripping, cerebellum removal, and cortical parcellation to enable functional analysis and connectivity mapping [67]. Spatial normalization is crucial when combining fMRI with other modalities such as PET or DTI.

Positron Emission Tomography (PET) PET preprocessing involves head motion correction, spatial normalization, and smoothing to improve signal-to-noise ratio (SNR). MNI registration is often performed to allow voxel-wise correspondence and fusion of structural and functional insights. Some workflows employ color space transitions (e.g., RGB to HSI), fusion filtering (low- and high-pass) and image retransformation to harmonize the data with MRI-derived regions of interest [23, 25, 34]. Skull elimination and global intensity normalization further refine the data for downstream modeling.

Diffusion Tensor Imaging (DTI): DTI preprocessing is used in relatively fewer studies but includes standard steps such as noise correction, spatial alignment, and normalization. It is often combined with sMRI for tissue contrast enhancement and registration [69, 70]. From the preprocessed diffusion data, scalar measures such as fractional anisotropy (FA) and mean diffusivity (MD) are commonly extracted. FA quantifies the directional coherence of water diffusion and serves as a marker of WM integrity; lower FA values may indicate microstructural degeneration, demyelination, or axonal loss, which are characteristic of AD progression. In contrast, MD reflects the overall magnitude of water diffusion regardless of direction and tends to increase in neurodegenerative conditions due to disrupted cellular barriers. Thus, these features derived from DTI are valuable for modeling microstructural brain changes and are increasingly integrated into multimodal ML frameworks for the early detection of AD.

Electroencephalography (EEG) EEG preprocessing involves downsampling (e.g., to 256 Hz), epoch segmentation (e.g., 5-second windows), and application of spatial filters such as Notch, Butterworth, and Common Average Reference (CAR) filters to remove power-line noise and artifacts [18, 65]. EEG denoising often includes contrast enhancement and Gaussian or median filtering, particularly in fused EEG–CT/MRI systems.

Genetic, Cognitive, and Clinical Data Structured numerical data, including single nucleotide polymorphisms (SNP), cognitive scores, and EHR-derived variables, require preprocessing steps such as normalization, missing value imputation, and categorical variable encoding. In some cases, time-series encoding and temporal data fusion are applied to longitudinal data [52, 68]. Genetic data processing may also include SNP filtering, imputation (via PLINK, Beagle), and linkage disequilibrium pruning [42, 43]. Statistical feature selection techniques such as ANOVA and chi-square tests are used for dimensionality reduction [54].

Based on the reviewed studies, it is evident that preprocessing multimodal data in AD research is a critical step that ensures compatibility, quality, and meaningful integration across diverse data sources. Collectively, these strategies reflect a comprehensive and modality-specific approach to preprocessing, although the heterogeneity in tool chains and parameter settings highlights an urgent need for standardized and fusion-aware pipelines to ensure reproducibility and optimal model performance in AD prediction.

Fusion methods used for AI driven AD prediction

This section explores RQ3: What methods are used for multimodal data fusion in Alzheimer’s Disease detection?

The prediction of AD, based on recent developments, have increasingly relied on the integration of multimodal data through tailored fusion strategies. ML and DL models process and combine heterogeneous information, such as neuroimaging, fluid biomarkers, genetic data, and cognitive assessments, at different points in the analytical pipeline. The fusion process varies in complexity and function depending on when and how the data are integrated. This section outlines the fusion methodologies adopted in current AD prediction studies, distinguishing between early, late, data-level, intermediate, feature-level, and decision-level approaches, and comparing how ML and DL techniques exploit each fusion stage to improve model generalizability and diagnostic accuracy.

The combination of different data sources offers complementary perspectives on AD pathology and progression. Modalities such as sMRI, fMRI, PET imaging, CSF biomarkers, EEG signals, and clinical or genetic profiles are used in various combinations to enhance prediction performance [72]. Some studies apply layered ensemble strategies or hierarchical networks to better capture complex cross-modal relationships.

To facilitate a structured analysis of fusion strategies in AI based multimodal AD prediction, Table 9 outlines key aspects of each study considered in our review, including the type and level of fusion, the modalities integrated, and the classification models employed. In these studies, a diverse collection of algorithms is used, ranging from classic classifiers such as SVM and Random Forest to deep architectures such as CNN, LSTM, and interpretable neural models. In addition, the heat map in Fig. 10 visualizes how frequently different combinations of modality appear at fusion levels, helping to illustrate dominant patterns in the literature and highlight areas that remain underexplored. This section examines the fusion strategies found in this review, identifying common trends and less frequent modality-classifier pairings, while assessing their implications for the classification and prognosis of AD.

Table 9.

Fusion strategies for AD classification grouped by fusion level

Fusion Level Refs Strategy Modalities Classifiers
Data-Level [26] MRI and PET combined into a composite image for enhanced AD classification MRI, PET 3D CNN
[41] Combines multimodal biomarkers with imputation and classification techniques MRI, PET, CSF, Cognitive SVM, RF, Gradient Boosting
[65] Combines EEG and CT-MRI into fused signals for early AD detection using transfer learning EEG, CT-MRI HEMRDTL (Deep TL model using VGG-19)
[44] Multi-modal feature inductive learning with dual multi-level GNN and attention-based Transformer fusion MRI, SNP, Protein FIL-DMGNN
[60] Multimodal ensemble across demographics, MRI, and clinical history from multiple cohorts using large-scale AI models MRI, Clinical, Demographics Transformer Model
Decision-Level [34] Combines MRI and PET at patch level, features, and classifier levels for enhanced accuracy MRI, PET PDMML framework (multiple models)
[53] Multistage fusion strategy combining PLS features with an ensemble RF approach MRI (T1-weighted) OvR Random Forest Ensemble
[54] Multi-level hierarchical machine learning used to classify 4 groups sMRI AHM, HM classifiers
[59] Fusion ensemble integrates MRI and neuropsychological test scores with modality-specific weights MRI, Neuropsychological Tests SVM (Ensemble)
Early [45] Raw features from multiple modalities fused into a common vector for classification ADNI Dataset Random Forest
[51] Combines multimodal time-series (CSs, MRI, NSB, BL) for predicting conversion time CSs, MRI, NSB, BL LSTM, DT, RF, SVM, LR, KNN
[52] Early fusion explores hidden relationships; late fusion uses stacked ensemble voting Multimodal time-series (4 total) KNN, XGBoost, SVM, RF, DT, MLP (Stacking)
[58] Baseline MRI, genetic, and CDC data are combined into unified features for survival prediction MRI, Genetic, CDC (Cognitive + CSF) DeepSurv (Deep Learning Survival Model)
[67] CMRN model fuses structural and functional MRI using cross-modal attention sMRI, fMRI Softmax
[68] Five modalities combined early and optimized using stratified training/testing Comorbidities, Medications, CSs SVM, RF, KNN, LR, DT
[47] AMFS with SHAP feature selection and dynamic model/ensemble selection across multiple modalities MRI, PET, Cognitive tests, CSF, Blood GCN (Graph Convolutional Network)
[66] Combines CT and MRI using Early (concatenated JSM maps) and Late fusion (dual-branch network averaging) MRI, CT Random Forest, CNN
[61] CEntropy-based fusion using top-entropy MRI slices and neuropsychological tests via cognitive ELM with boosting ensemble MRI slices, Neuropsychological test scores Extreme Learning Machine (ELM)
Feature-Level [35] Patch-based interpretable framework combining image patches (MRI, PET) and clinical data using FCRN and MLP MRI, PET, Demographic, MMSE, ApoE4 MLP
[46] Combines GM from MRI with PET via wavelet transform and other image fusion for multimodal deep learning MRI, PET, CSF CNN, RNN
[38] Hierarchical attention-based multi-task fusion using CHAM module on GM, WM, CSF, and PET sMRI, PET, CSF HAMMF model
[18] Features from CWT and BiS are integrated for improved AD and MCI classification EEG (CWT, BiS) AE, MLP, LR, SVM
[19] Combines ASL and BOLD fMRI using sparse regression to build a hyper-network for MCI classification ASL fMRI, BOLD fMRI Multi-kernel SVM
[21] Two-stage SDPN model fuses MRI and PET features for disease characterization MRI, PET SVM, Linear Classifier
[22] MRI and PET features fused before LSTM and Softmax to improve performance MRI, PET Softmax
[23] Latent features from MRI and PET are combined to enhance classification accuracy MRI, PET SVM
[24] Multiple modalities projected into a latent space using orthogonal projection MRI, PET Ensemble of classifiers
[25] Fuses sMRI, PET, and CSF using Fourier/DWT methods for dementia stage detection sMRI, PET, CSF SVM, Inception-ResNet
[27] Enhances feature representation from sMRI and FDG-PET using cross-enhanced fusion module sMRI, FDG-PET Softmax
[28] Uses correlation and structure fusion for multimodal feature selection from MRI, PET, and CSF MRI, PET, CSF SVM
[34] Combines MRI and PET at patch level, features, and classifier levels for enhanced accuracy MRI, PET PDMML framework (multiple models)
[36] Concatenates features from MRI, PET, CSF, and cognitive scores using fully connected layers MRI, PET, CSF, Cognitive CNN-based model (ML4VisAD)
[39] Multi-kernel fusion of MRI, PET, CSF, and APOE for improved GP classification MRI, PET, CSF, APOE Gaussian Process classifier
[40] MRI, PET, and SNP data fused into a latent representation for classification MRI, PET, SNP SVM
[42] Uses hierarchical attention to integrate MRI, SNP, and clinical data MRI, SNP, Clinical Two-layer MLP
[43] Incorporates intra- and inter-modal interaction modules using the DISFC model to enhance feature fusion sMRI, Clinical, Genetic Interpretable Deep Learning Model
[48] Combines Fisher Score and greedy search heuristics for optimal multimodal feature selection MRI, PET, CSF, Cognitive tests SVM, KNN
[50] Uses deep BiLSTM to integrate multiple modalities for progression prediction CSs, NSB, MRI, FDG-PET RF, SoftMax, SVM, DT, FURIA, MOEFC
[53] Multistage fusion strategy combining PLS features with an ensemble RF approach MRI (T1-weighted) OvR Random Forest Ensemble
[54] Multi-level hierarchical machine learning used to classify 4 groups sMRI AHM & HM classifiers
[55] M3ID approach transfers knowledge from multimodal teacher to unimodal student network MRI, Clinical SVM
[56] Transformed clinical/biological features are fused with CNN-extracted neuroimaging features PET, fMRI, CSF, DTI, EEG, MRI ANN with CNN backbone
[57] Fuses MRI with demographic and cognitive scores across time steps for improved prediction MRI, Demographics, Cognitive Scores 3D CNN + BRNN
[58] Baseline MRI, genetic, and CDC data are combined into unified features for survival prediction MRI, Genetic, CDC (Cognitive + CSF) DeepSurv (Deep Learning Survival Model)
[62] NSCT-extracted features fused with sCNN and denoised using FOTGV MRI, CT, PET, SPECT sCNN with FOTGV
[63] High-level SAE features fused via zero-mask method and classified with MKSVM MRI, PET 2SAE + Multi-kernel SVM
[64] DMDR projects multiple modalities into a discriminative space using self-expression constraints MRI, PET, CSF SVM
[69] A multimodal ensemble combines a Random Forest for clinical data and CNN for DTI scans to improve prediction DTI, EHR (Electronic Health Records) Random Forest + CNN
[71] Combines shape and DTI features to evaluate their combined discriminative power for AD diagnosis T1-weighted MRI, DTI SVM, LDA
[29] Combines representation learning and classifier modeling using MRI and PET in a unified framework MRI, PET SVM
[30] Multi-task learning fuses MRI, PET, CSF features to predict classification and regression outcomes MRI, PET, CSF Multi-modal SVM
[31] Dual modality fusion integrates MRI and PET scans for AD diagnosis MRI, PET Dual-3DM3-AD model
[33] Deep fusion of MRI, PET, and demographics to build a comprehensive latent representation MRI, PET, Demographics RNN-based sequence model
Intermediate [67] CMRN model fuses structural and functional MRI using cross-modal attention sMRI, fMRI Softmax
[47] AMFS with SHAP feature selection and dynamic model/ensemble selection across multiple modalities MRI, PET, Cognitive tests, CSF, Blood GCN (Graph Convolutional Network)
[20] Multi-fusion approach combining multilevel features from MRI, PET, CSF, and genetics. Uses cascaded CNNs, orthogonal latent space models, and local/global learning modules sMRI, PET, CSF, Genetic MDL-Net
[32] Cross-modality and multiscale fusion using attention mechanisms to enhance feature interaction between MRI and PET MRI, PET Least Square Twin SVM (energy-based)
Late-Level [47] AMFS with SHAP feature selection and dynamic model/ensemble selection across multiple modalities MRI, PET, Cognitive tests, CSF, Blood GCN (Graph Convolutional Network)
[66] Combines CT and MRI using Early (concatenated JSM maps) and Late fusion (dual-branch network averaging) MRI, CT Random Forest, CNN
[49] Combines CNN-BiLSTM models for each modality using stacked learning PET, MRI CNN-BiLSTM Ensemble
[52] Early fusion explores hidden relationships; late fusion uses stacked ensemble voting Multimodal time-series (4 total) KNN, XGBoost, SVM, RF, DT, MLP (Stacking)
[67] CMRN model fuses structural and functional MRI using cross-modal attention sMRI, fMRI Softmax
[70] Combines predictions from axial, coronal, and sagittal views using majority voting to enhance classification sMRI, DTI Transfer Learning CNN

Fig. 10.

Fig. 10

Frequency of modality combinations employed across different fusion strategies in AI-based AD prediction

Early Fusion In the context of AD, early fusion has been widely adopted, particularly with time-series and imaging data. For example, [45] concatenated raw features from multiple ADNI modalities for classification using Random Forest. Similarly, [51] fused longitudinal clinical series that included levels of CS, MRI, NSB, and biomarkers for prediction using LSTM, Decision Trees, and other ensemble classifiers. Several studies such as [52, 68], and [58] used early fusion for structured data such as genetics and cognitive tests, combined with deep survival models or hybrid stacking approaches. These approaches often benefit from rich cross-modal interactions, but may suffer when modalities have missing or misaligned data.

Intermediate Fusion Intermediate fusion methods integrate information from different modalities at mid-level layers within the network, allowing partial feature extraction before merging. A notable study by [67] introduced a multimodal attention-based network, which combines sMRI and fMRI. This approach enables the model to attend to relevant modality-specific features before joint reasoning, improving flexibility and interpretability. Although less commonly reported than early or late fusion, intermediate fusion offers a balanced trade-off between early richness and late robustness.

Late Fusion Late fusion refers to the combination of modality-specific model outputs (or decisions) at the final stage, typically using ensemble methods. For AD prediction, [49] fused CNN-BiLSTM models trained on PET and MRI using stacked learning, while [70] implemented a view-based ensemble across axial, coronal, and sagittal slices of sMRI and DTI using majority voting. Studies such as [52] combine early and late fusion in layered ensemble strategies to increase performance. Late fusion is particularly advantageous for handling missing modalities and using independently optimized pipelines.

Data-Level Fusion In data-level fusion, raw data from multiple modalities are merged directly, often spatially or temporally, before feature extraction or model training [26] performed voxel-level fusion of MRI and PET images into a composite volume analyzed with 3D CNN [41] handled imputation across missing multimodal biomarkers (MRI, PET, CSF, Cognitive) and trained ensemble ML classifiers (SVM, RF, boosting). A unique example is [65], which fused EEG and CT-MRI signals for early diagnosis using transfer learning with a VGG-19-based model. Although data-level fusion allows for holistic representation, it requires strict alignment and normalization across inputs.

As detailed in Table 9, the included studies cover a range of fusion paradigms, from early fusion of raw inputs to decision-level fusion models. An observation from the table is that data-level fusion remains relatively rare. Only a handful of studies[26, 41, 65] have attempted this direct integration of raw imaging or signal data. This is likely due to substantial technical obstacles: each modality, whether MRI, PET, EEG, or CT, differs in resolution, preprocessing needs, and data dimensionality. Without standardized pipelines or harmonization frameworks, aligning such diverse inputs can introduce noise or bias, making this approach difficult to scale. The complexities faced at this level mirror those encountered in early fusion, where low-level integration also requires careful synchronization. Future directions might include the development of robust preprocessing frameworks to facilitate data-level fusion in large-scale AD studies.

Feature-Level Fusion Feature-level fusion is by far the most widely adopted strategy in the reviewed AD prediction studies, with more than 30 implementations applying this approach. It involves extracting features separately from each modality and merging them into a unified feature space for classification. Several studies used this strategy to integrate neuroimaging modalities such as MRI and PET, often combining them with clinical, genetic, or cognitive data. For example, [21, 23], and [28] used hand-made features from MRI and PET and applied classifiers such as SVM, linear models, and ensemble techniques. Multikernel methods remain popular for handling heterogeneously scaled data, as demonstrated in [19, 39], and [63], which combined structural imaging, CSF, and APOE or SNP data. Studies such as [18, 36], and [48] focused on EEG or cognitive test characteristics fused using autoencoders, CNNs, or heuristic feature selection techniques like Fisher Score.

Several recent approaches have explored more complex architectures to model interactions between features. For example, [42] and [43] introduced attention mechanisms and interpretable deep models to combine MRI, SNP, and clinical / genetic data. Similarly, [56] and [57] fused time-dependent imaging, demographics, and cognitive scores using hybrid CNN-RNN models to improve progression prediction. Studies such as [33] and [36] further promoted feature fusion by applying fully connected deep layers to multimodal embeddings from MRI, PET, and CSF. Despite its popularity, feature-level fusion has challenges such as high dimensionality, feature imbalance, and the risk of redundancy. However, its flexibility, interpretability, and compatibility with both traditional and deep learning classifiers have made it a cornerstone of multimodal AD prediction research.

Decision-Level Fusion Decision-level fusion aggregates the outputs of modality-specific classifiers, often through ensemble techniques. In [34], the predictions from the MRI and PET models were fused at the level of the patch, the feature, and the classifier. [53] and [54] implemented multistage or hierarchical decision fusion using Random Forest and rule-based classifiers for robust AD subgroup classification. [59] applied weighted ensemble methods to combine neuroimaging with neuropsychological tests. These strategies are modular and fault-tolerant, but may not capture deep cross-modal correlations.

The feature-level and early fusion strategies clearly dominate this review of the literature on AD prediction. Feature-level fusion, in particular, is widely used across ML and DL pipelines, with studies like [18, 39] and [36] applying multi-kernel SVMs or deep networks to jointly process structured features from imaging, CSF biomarkers, cognitive scores and genetic markers. Early fusion is commonly applied in time series and survival prediction tasks (for example [51, 58]), using LSTM and deep survival models to process raw multimodal input jointly. Intermediate fusion, although less frequently used, is illustrated in [67], where cross-modal attention enables midlayer integration of fMRI and sMRI signals, balancing early synergy and late abstraction. Meanwhile, decision-level and late-fusion approaches, such as in [49] and [70] implement ensemble frameworks to aggregate predictions from multiple models or modality-specific views. These methods offer modularity and flexibility, particularly in handling missing modalities.

As discussed, the reviewed literature demonstrates diverse and evolving practices in fusion strategy and modality integration. Table 9 encapsulates these patterns, serving as a comprehensive reference for future comparative studies and methodological refinements.

AI based models for the detection of AD using multimodal data

This section explores RQ4: What different AI models are available for AD diagnosis that incorporate multimodal data?

The application of AI in AD prediction has evolved to take advantage of the richness of multimodal data, including neuroimaging, clinical, genetic, and biochemical characteristics, to improve diagnostic precision. Numerous investigations have used various combinations of multimodal data to achieve precise diagnoses of AD. A total of 86 different studies were identified (Table 10) and categorized based on traditional ML and DL approaches. Among these, 41 studies utilized ML, while 45 employed DL frameworks, reflecting the growing interest in more automated, end-to-end pipelines for complex data integration.

Table 10.

Studies incorporating Multimodal Data for the Prediction of Alzheimer’s Disease

MMD Refs Task Dataset Datatype ML/DL Classifier Performance
MMD1 [18] 2 way (AD vs HC, MCI vs AD, MCI vs HC) and 3-way ( AD. vs. MCI. vs. HC) classification EEG database of 63 AD, 63 MCI and 63 HC Numeric ML and DL MLP,AE, SVM, LR Accuracy 87.2%
MMD3 [21] 2-way Classification ADNI Images (MRI, PET) ML SVM Acc: 97.13
[23] AD Binary Classification:AD vs NC,NC vs MCI, MCI vs AD ADNI Images ML SVM Accuracy: 0.9074 (AD vs NC),0.74 (MCI vs NC),0.75(AD vs MCI)
[22] 2-way classification of AD ADNI Images DL CNN Acc: 94.82
[24] OLFG for multimodal AD diagnosis ADNI-2, OASIS-3 Images DL GNN Acc: 94.7
[25] 2-way (HC vs MCI,MCI vs AD and HC vs AD) classification OASIS Images DL CNN AccInline graphic0.9506, SpeInline graphic0.9741, Sen=0.9276, AUC=0.985
[26] Classify NC, MCI and AD using fused scans of both MRI and FDG-PET ADNI Images DL CNN Acc: 93.21
[27] AD and SMC Diagnosis ADNI Images (sMRI and PET) DL CNN Acc: 97.67
[28] 2 way Classification- AD vs NC, NC vs LMCI, NC vs EMCI, LMCI vs EMCI ADNI Images ML SVM Acc: 91.85( AD vs NC)
[29] 2 way classification ADNI-1, ADNI-2 Images ML SVM Acc: 96.9- ADNI1 96.8 - ADNI2 (AD vs NC)
[30] 3 Way classification (AD vs MCI vs NC) ADNI Image and Numeric (PET,sMRI, CSF) ML SVM Acc=0.933
[31] AD Binary Classification:AD vs NC ADNI Images DL ResNet Accuracy: 0.98, Sen=0.978, Spe=0.975, F1=0.982
[32] NC vs AD; NC vs MCI; sMCI vs pMCI ADNI Images (MRI, PET) ML and DL Twin-SVM (ResNet50) AUC AD 0.947, MCI 0.814, pMCI 0.719
[20] MCI vs CN; AD vs MCI ADNI, AIBL Images (sMRI, PET) DL MDL-Net
MMD4 [34] AD Diagnosis ADNI Images (MRI, PET), Numeric (Demographic) DL CNN Acc=0.808, AUC=0.92, Pre=0.81, Rec=0.81
[33] AD progression prediction ADNI Image, Numeric DL RNN Refer
[36] 3-way and 5-way Classification ADNI, QT-PAD Images ML SVM Acc=0.82 (3-way) and 0.68 (5-way)
[37] AD vs HC; MCI vs HC; multi-way ADNI, OASIS, BrainLa Images (MRI, PET) ML and DL Fractional CNN; SECNN-RF Acc 97.6%; Prec 95.8%; Rec 97%; F1 96.8%
[35] AD vs CN; CN vs MCI CANDI; FUHS Images (MRI, PET), Demographics, MMSE DL MLP AD 96.22%; MCI 92.22%
[38] AD vs NC ADNI Images (sMRI, PET), Clinical scores DL HAMMF Acc 93.15%; AUC (AD) 0.906; (NC) 0.908
MMD5 [39] MCI to AD Progression detection ADNI Images, Numeric ML SVM Acc: 68.1
[40] AD diagnosis with incomplete data ADNI Image, Numeric, (MRI, PET, SNP) ML SVM Acc: sMCI vs pMCI Inline graphic74.3
[41] AD diagnosis along with missing data ADNI MRI, PET, CSF, DTI, Genetics, Neurological tests ML GB, RF, SVM Acc: AD vs NC:93.40 Precision:92.37 Recall:91.44
[42] MCI vs AD ADNI Image, Numeric and neurological tests DL 3D ResNet, DNN, MLP Acc: MCI vs AD:87.20 AUC:0.913
[43] AD Progression Diagnosis ADNI Image, Numeric and neurological tests DL DNN Acc: AUC=0.962, Acc=0.92, Sen=0.88, Spe=0.95
[44] AD Progression Diagnosis ADNI Image (MRI), SNP, Protein DL FIL-DMGNN Acc 93.24%
MMD6 [45] MCI-to-AD progression prediction within three years from a baseline diagnosis ADNI Images, Numeric ML RF Acc: 93.95(3 way classification) 93.94( Progression prediction)
[49] AD Prediction ADNI1, ADNI2, ADNI Go Numeric DL CNN-BiLSTM Acc=0.926, Pre=0.94, Rec=0.984
[48] AD vs MCI vs NC ADNI-TADPOLE and AIBL Images ML SVM, KNN Spe=0.825, Sen=0.84
[50] Predicting seven cognitive scores 2.5 years ahead using multitask regression for progression support ADNI Images (MRI), Numeric (Cognitive Scores) ML and DL RF, SVM, Gradient Boosting, Naive Bayes, DT, FURIA, MOEFC, BiLSTM DFBL+RF: Acc=82.6%, Prec=84.7%, Recall=84.8%, F1=84.7%. MRBL+RF: Acc=80.3%, 81.7%, 80.8%, F1=81.2%
[46] AD vs HC ADNI Images (MRI, PET), CSF DL CNN Acc 89.4%
[47] AD vs HC ADNI-2 Images (MRI, PET) CSF, Cognitive Tests DL GCN + SHAP boosting Acc 95.9%; Acc 91.9%
MMD7 [52] Progression detection ADNI numeric (three time series and one static modality) ML and DL KNN, SVM, XGB, RF, DT, MLP Acc= 0.995 (AD vs NC), 0.964(AD vs MCI vs NC)
[51] AD vs pMCI vs sMCI vs CN, AD vs MCI vs CN, conversion time prediction ADNI Images (MRI), Numeric DL LSTM Acc= 0.884 ± 2.20 (4-way), 0.938 ± 1.48 (3-way)
[53] HC vs. sMCI vs. pMCI vs. AD ADNI Numeric ML RF Acc=0.562, Pre=0.52, Rec=0.56, F1=0.54
[54] 4 way classification problem ADNI Image+Numeric ML SVM, NB, RF, Extra Trees, AB, XGB Acc: 54.67 (XGB)
[55] MCI to AD Conversion Prediction ADNI1, ADNI2 Numeric (MRI Clinical Data) DL CNN Acc=0.80, Spe=0.81, Sen=0.784, AUC=0.871
[57] Techniques for AD progression detection by incorporating longitudinal and cross-sectional data ADNI Images DL CNN Acc=0.96, Spe=0.99, Sen=0.92, AUC=0.96
[56] AD prediction using multimodal fusion data ADNI Image, Numeric (sMRI, Biological and Clinical information) DL ANN Acc: 96
[58] Predicting DAT time-to-conversion for subjects with MCI ADNI Numeric (63 features from MRI, genetic, and CDC) DL ANN NA
MMD8 [59] Detecting Brain areas afflicted by AD ADNI Image, Numeric (MRI, PET and other biomarkers) ML SVM Acc: 80.9
[60] CN vs MCI vs dementia NACC; ADNI; NIFD; PPMI; AIBL; OASIS; 4RTNI; LBDSU; FHS Image (MRI), Demographics, Clinic ML CNN
[61] CN vs MCI vs AD ADNI Image (2D MRI slices), Neuropsych scores ML Extreme Learning Machine Acc 98.11%
MMD9 [62] Fusion of multimodal images Med Harvard online Images (MR-T1, MR-PD) DL sCNN NA
MMD10 [64] 2 way classification (HC vs MCI, MCI-C vs MCI-NC, HC vs AD) ADNI-1, ADNI-2 Images DL MLP Acc: 96.75 (Ad vs HC)
[63] 2 way and 3 way classification ADNI Images DL SAE Acc:91.40 (AD vs NC)
MMD11 [65] AD vs HC OASIS Image (CT, MRI), Numeric (EEG) DL CNN Acc=0.991, Pre=0.993
[66] CN vs MCI vs AD OASIS-3 Image (CT, MRI), CDR Scores ML and DL RF; CNN Acc 95.37%; Sen 93.5%; Spe 93.5%
MMD12 [67] HC, cMCI, sMCI, AD ADNI Image (sMRI, fMRI) DL GNN Acc=0.938
MMD13 [68] Progression Detection ADNI Numeric (CS, Medicine data) ML SVM Acc=0.905
MMD14 [69] EMCI to AD conversion prediction(5 year) ADNI-1, ADNI-2, and ADNI-GO Image and Tabular(DTI Scans and EHR) ML and DL RF+CNN Acc:98.81
MMD15 [71] AD vs NC A total of 29 patients with probable AD (M/F = 13/16, mean age = 67.5 ± 9.5 years), and 23 HC subjects Image (T1-weighted, DTI fusion) ML SVM Acc=0.946, Spe=0.933, Sen=0.955
[70] 2-way: AD vs MCI, AD vs NC, MCI vs NC. 3-way: AD vs. MCI vs. NC ADNI Image DL CNN Acc: AD vs NC:92.3

AI models applied in ML studies

The most frequently adopted ML model remained the SVM, appearing in almost half of the ML studies. Its popularity stems from its robustness in handling high-dimensional data, such as MRI, PET, and EEG features, and their ability to perform well even with relatively small sample sizes. In particular, SVM-based models achieved high accuracies, particularly when applied to combinations of MRI and PET imaging. For example, Jun et al. [21] developed a computer-aided diagnosis system for early detection of AD using multimodal neuroimaging data from the ADNI database, combining structural magnetic resonance imaging and PET images. An intermediate fusion strategy was used to extract the features from each modality, which were then fused for the joint classification of AD, MCI, and cognitively normal groups. This approach integrated anatomical and functional data, achieving a classification accuracy of 93.26%, outperforming unimodal methods and demonstrating the effectiveness of multimodal DL for the early diagnosis of AD.

In another study by Li et al. [19], proposed a multimodal hyper-network modeling method for the two way classification of MCI vs HC using SVM. The proposed multimodal hyper-connectivity networks encode complementary information conveyed by multiple modalities (ASL and BOLD fMRI), and thus provide a more comprehensive representation of the brain functional organization. However, this study has utilized a relatively small number of subjects. Random Forest is another widely used tool that offers better handling of heterogeneous data. El-Sappagh et al. [45] employed a Random Forest model to integrate 11 modalities—including sMRI, PET, and CSF biomarkers—achieving a notable classification accuracy of 94.4%. The study also incorporated SHAP-based interpretation to highlight key predictive features.

Although SVMs are the most commonly used ML model, other models such as RF, KNN, and Naive Bayes are also employed but are less frequent, often due to limitations such as susceptibility to overfitting (RF with small datasets), sensitivity to noise and irrelevant features (KNN), or simplistic assumptions about feature independence (Naive Bayes). Moreover, these approaches heavily depend on domain-specific feature engineering, which can limit adaptability to new datasets or clinical environments.

AI models applied in DL studies

DL architectures are increasingly prominent in AD research due to their ability to learn hierarchical features directly from raw data. CNNs are the most widely used, particularly for imaging modalities such as sMRI, PET, and DTI. Studies such as Abdelaziz et al. [32] and Leng et al. [27] demonstrated CNN-based fusion frameworks achieving classification accuracies exceeding 97%. Subramanyam et al. [25] proposed a wrapper model based on Inception-ResNet to classify HC, MCI, and AD using MRI and PET images. Preprocessed MRI images were fused with PET using two-dimensional Fourier and discrete wavelet transform (DWT), reconstructed via inverse Fourier and DWT, and evaluated using multiple CNNs, with Inception-ResNet identified as the best performer for three-class AD classification.

In another study, Karim et al. [70] utilized MRI and DTI images and proposed a transfer learning strategy using CNNs to automatically classify brain scans from small ROIs, such as limited hippocampal slices. A LeNet-like CNN architecture was employed to construct and combine AD stage classification models. Several transfer learning strategies were evaluated, including cross-modal transfer learning (sMRI and DTI), cross-domain transfer learning (MNIST), and a hybrid of both. For certain classification tasks (MCI/AD and NC/MCI), initializing network parameters through transfer learning improved AD stage classification performance by more than 5 points.

Graph Neural Networks (GNNs)[67] and Residual networks (ResNets) also feature prominently in more recent work, enabling the modeling of topological brain features or deeper architectures. Unlike ML models, DL frameworks such as the Dual Interaction Stepwise Fusion Classifier (DISFC) [73] and Hierarchical Attention-Based Multimodal Fusion (HAMMF) [38] demonstrate increasing sophistication in cross-modal feature fusion. These models often outperform traditional approaches, achieving AUC values as high as 0.962, and integrate explainability techniques such as SHAP and attention maps to improve clinical relevance. However, they are data-hungry and prone to overfitting when trained on limited or homogeneous datasets like ADNI.

These approaches often rely on domain-specific feature engineering, which may limit adaptability to new datasets or clinical settings. Moreover, although many ML models report strong performance, few conduct rigorous external validation to ensure robustness. In several studies, interpretability is treated as a supplementary aspect rather than a core design objective, highlighting the need for integrated, transparent, and clinically meaningful models. Overall, both ML and DL approaches show promise for AD diagnosis using multimodal data. ML models, particularly SVM and RF, are favored for simplicity and interpretability, whereas DL models such as CNNs and GNNs better capture complex, nonlinear multimodal interactions. Recent trends point toward advanced fusion strategies and increased emphasis on explainability; however, challenges related to generalization, modality balance, and clinical readiness persist. Future research should focus on developing robust, transparent, and clinically applicable models validated across diverse populations and enriched with real-world health data.

The information displayed in Table 10 provides a detailed overview of all 54 papers dedicated to AD detection using multimodal data. This table offers a thorough analysis of these studies, encompassing key details such as performance metrics, datasets utilized, and the data types associated with the respective multimodal data combinations.

Figure 11 effectively addresses RQ4. Out of 54 papers, 86 independent studies were found to utilize multimodal data in the detection of AD. The analysis reveals a notable prevalence of ML approaches in AD detection using MMD, with DL studies following closely. Within the realm of ML studies, the SVM classifier emerges as the most frequently employed, highlighting its significance. Conversely, DL studies predominantly feature CNN as the classifier of choice.

Fig. 11.

Fig. 11

Sankey Diagram: i)In the First level, number of AI studies for AD prediction using Multimodal Data Classified as Deep Learning, Machine Learning and Hybrid models. ii) In the Second level, number of individual classifiers used in the reviewed articles

Explainable AI frameworks used for interpretation of AI models

This section explores RQ5: Which XAI frameworks are used to interpret AI models for Alzheimer’s Disease using multimodal data?

This RQ focuses on identifying XAI frameworks and techniques commonly used to interpret AI-based multimodal classification models. In healthcare, explainable embedded systems can reduce clinical workload and improve diagnostic precision. This review identifies widely adopted XAI frameworks in AD research, including LIME, SHAP, GradCAM, Saliency Maps, Layerwise Relevance Propagation (LRP), and Attention Maps. Table 11 presents a consolidated overview of the reviewed studies, summarizing key aspects such as the XAI framework used, classifier architecture, classification task, fused modalities, and interpretation type, while significant features identified through these methods are reported in Table 12. Overall, these XAI approaches improve the transparency and interpretability of AI-based AD diagnostics, strengthening clinical trust and supporting informed decision making by enabling deeper insight into model behavior.

Table 11.

Studies of XAI frameworks used for interpreting AI models with multimodal data to detect AD

XAI Refs Classifier Model Classification Task Data Modality Interpret Type
SHAP [45] ML RF CN vs MCI vs AD Cognitive Score, Neuropsychological Visual
sMCI vs pMCI battery, MRI, PET, Genetics, Medical
History(lab test,age,vital signs
SHAP [69] DL/ML RF, CNN earliest MCI vs Diffusion Tensor, Imaging Visual
AD EHR
SHAP [42] DL 3DResNet, DNN, MLP MCI vs AD Image, Clinical Visual
Neurological
SHAP [43] DL DNN MCI vs AD Image, Numeric visual
Neurological
SHAP [66] ML Cognitive-ELM CN vs MCI vs AD Cognitive tests,MRI/PET Visual
Blood tests Numeric
SHAP [61] ML Cognitive-ELM AD vs non-AD 2D MRI slices Visual
CN vs MCI vs AD Cognitive scores Numeric
SHAP [37] ML, DL Fractional CNN, RF NC vs MCI vs AD MRI, PET, CT Visual
HC vs sMCI vs pMCI Numeric
SHAP [47] DL GCN, SHAP Boosting AD vs NC Cognitive tests, MRI/PET Numeric
Blood tests, CSF
LRP [37] ML, DL Fractional CNN, RF NC vs MCI vs AD MRI, PET, CT Visual
HC vs sMCI vs pMCI Numeric
GradCAM [57] DL 3DCNN, BRNN CN vs progressed Demographic Cognitive Scores Visual
to AD vs AD Clinical biomarkers, MRI
GradCAM [27] DL 3D Multimodel cross AD vs NC sMRI, FDG-PET Visual
enhanced fusion network SMC vs NC
(MENet)
GradCAM [20] DL MDL-Net MCI vs CN sMRI, PET Visual
AD vs MCI
GradCAM [35] DL FCRN + MLP AD vs CN MRI, PET, Demographics Visual
CN vs MCI MMSE, ApoE4 Numeric
GradCAM [46] DL FIL-DMGNN AD vs NC Gene, MRI, Protein Visual
EMCI vs LMCI Clinical Numeric
AD vs MCI
GradCAM [44] DL FIL-DMGNN AD vs NC Gene, MRI, Protein Visual
EMCI vs LMCI Clinical
AD vs MCI
GradCAM [37] ML, DL Fractional CNN, RF NC vs MCI vs AD MRI, PET, CT Visual
HC vs sMCI vs pMCI Numeric
Attention Map [55] DL ResNet MCI vs AD MRI, Clinical data Visual
Attention Map [38] DL HAMMF AD vs NC MRI, PET Visual
Saliency Map [32] DL Twin-SVM NC vs AD, NC vs MCI MRI, PET Visual
(ResNet50 features) sMCI vs pMCI Numeric
Saliency Map [60] ML, DL RF, CNN NC vs MCI vs AD MRI, CT Numeric
sMCI vs pMCI
LIME [37] ML, DL Fractional CNN, RF NC vs MCI vs AD MRI, PET, CT Visual
HC vs sMCI vs pMCI Numeric

Table 12.

Studies of XAI frameworks with significant features used for interpreting AI models to detect AD

XAI Refs Disease Classifier Significant features
SHAP [45] AD ML CDRSB, MMSE, DigitalTotalScore, MOCA, FAQ, ADNI_MEM, CDGLOBAL, RAVLT_immediate, ADAS 13
SHAP [69] AD DL/ML Age, PTRACCAT, APOE4, ADAS11, ADAS13, MMSE, FAQ, Ventricles, Hippocampus
SHAP [42] AD DL FDR followed by RAVL-I, CDR, MMSE Score
SHAP [43] AD DL LDELTOTAL, RAVLT, MMSE, PTGENDER
SHAP [66] AD ML CBB, MMSE, Edu, ApoE, Age, Gender
SHAP [61] AD ML Top-entropy slice embeddings, cognitive scores
SHAP [37] AD ML, DL Judgment, Memory, Orient, Sumbox
SHAP [47] AD DL CBB, MMSE, Edu, ApoE, Age, Gender
LRP [37] AD ML, DL Judgment, Memory, Orient, Sumbox
GradCAM [57] AD DL Inter and intra-slice features, inter-volumetric relationship features
GradCAM [27] AD DL For AD subjects - atrophy in ventricle region, lingual, amygdala, fusiform, hippocampus, superior temporal gyrus, medial and paracingulate gyrus by sMRI. And atrophy in superior cerebellum, lingual, calcarine, hippocampus, thalamus, middle temporal gyrus, middle frontal gyrus by FDG-PET. For SMS subjects - atrophy in ventricle regions, inferior frontal gyrus, rolandic perculum, precuneus, lingual, hippocampus, and caudate by sMRI. And atrophy in inferior frontal gyrus, middle frontal gyrus, superior parietal cortex by FDG-PET
GradCAM [20] AD DL PET, GM, WM, fused feature F_c
GradCAM [35] AD DL Neuropsychological test scores
GradCAM [46] AD DL Hippocampus; amygdala, insula
GradCAM [44] AD DL Top 30 risk features, protein levels, left hippocampus
GradCAM [37] AD ML, DL Judgment, Memory, Orient, Sumbox
Attention Map [55] AD DL Extract features from discriminative areas of MRI and attach patch scores to discriminate features of importance like hippocampus, ventricle, temporal, and parietal lobe
Attention Map [38] AD DL Texture, intensity, shape, hippocampus, amygdala, insula
Saliency Map [32] AD DL Hippocampus, cingulum, occipital regions
Saliency Map [60] AD ML, DL Hippocampus, cerebral cortex, ventricular system
LIME [37] AD ML, DL Judgment, Memory, Orient, Sumbox

Studies based on SHAP

SHAP is a framework that can explain ML predictions by calculating Shapley values using cooperative game theory. It ensures consistency across different models, is model-agnostic, and can be applied to a variety of model types.

The research by El-Sappagh et al. [45] describes an ML model that integrates 11 modalities from the ADNI data set to predict AD progression. The model, a two-layer structure using random forest as the classifier algorithm, achieves high accuracy (94.4%) with a combination of 28 features. The study emphasizes the importance of interpretable models in the clinical domain, highlighting the trade-off between accuracy and interpretability. Integrating MRI brain volume features improves accuracy by about 1%, providing crucial information for an effective prediction. Comparison with other ML models, such as SVM, KNN, Naïve Bayes, and Decision Trees (DT), demonstrates superior performance and interpretability. The article introduces an explainability approach based on SHAP to enhance understanding of model decisions. The interpretable models, including the DT, serve as "explainers" supporting the main model’s decisions. Figure 12 is an example of a SHAP summary plot used to show the contribution of all the features of the second layer for the pMCI and sMCI classes. The article concludes that the proposed model achieves high accuracy and enhances interpretability and trust in the clinical context, providing insights into AD progression.

Fig. 12.

Fig. 12

SHAP summary plot illustrating feature contributions to AD predictions across all samples. The visualization highlights the relative importance and directional influence of each feature on model output; image reproduced with permission from - [45]

Velazquez et al. [69] introduce a multimodal ensemble model to predict the conversion of individuals with Early Mild Cognitive Impairment (EMCI) to AD within five years. The ensemble combines a RF model using EHR features with a CNN that analyzes DTI scans. The study addresses imbalanced data sets, determines the optimal weight distribution, and emphasizes the interpretability of the predictions. The proposed model outperforms recent multimodality models for the prediction of AD conversion and demonstrates superior predictive accuracy. The ensemble’s performance is visualized using SHAP, and a Flask Python application is developed for clinical decision support, providing local explainability. Despite occasional mispredictions, the ensemble model shows promise in avoiding errors and achieving correct predictions. The article concludes with the potential clinical implementation of the model and its ability to predict the probability of EMCI to AD conversion within a five-year period, offering advancements over current state-of-the-art models.

In a study on early detection and prediction of AD progression, focused on predicting the conversion from MCI to AD, Peixin et al. [74] proposed a Hierarchical Attention-Based Multimodal Fusion Framework (HAMF) that integrates imaging (MRI), genetic (SNP), and clinical data. The authors used ADNI data and the 3D ResNet for MRI, a stacking denoising autoencoder for SNPs, and a Deep Neural Network for clinical data. These features are projected into a shared latent space using nonlinear gating, and hierarchical attention mechanisms are applied to dynamically weight and fuse the modalities, capturing cross-modal interactions. The final classification is performed using a two-layer multilayer perceptron and the model is evaluated using a five-fold cross-validation with metrics such as accuracy, sensitivity, specificity, F1 score, and AUC. The HAMF model achieved state-of-the-art results, with an accuracy of 87.2% and an AUC of 0.913, outperforming unimodal and other multimodal approaches. Clinical data emerged as the most informative single modality, but combining it with MRI provided the best predictive performance. The study’s ablation experiments confirmed the effectiveness of hierarchical attention and nonlinear gating. SHAP was used to interpret the model and identify the most influential clinical features. The study findings demonstrate that multimodal fusion, particularly with advanced attention mechanisms, significantly enhances the prediction of MCI to AD conversion, providing valuable information for clinical decision support and future research.

The researchers, Yifan et al. [73], developed a model named the Dual Interaction Stepwise Fusion Classifier (DISFC), designed to extract features from imaging, clinical and genetic modalities using a parallel three-branch network. These features were then fused through a stepwise process that incorporates both intra-modal and inter-modal interaction modules. The authors used ADNI data including structural magnetic resonance imaging, clinical evaluations, and genetic polymorphism data. The objective of this study was to develop and validate an interpretable deep learning model that takes advantage of both interaction effects and multimodal data to improve the long-term prediction of conversion from MCI to AD. DISFC was trained and validated using stratified ten-fold cross-validation in the ADNI-1 and ADNI-GO/2 cohorts and tested in an independent set from the ADNI-3 cohort. The results showed that the DISFC model achieved high predictive performance, with an AUC of 0.962 and a precision of 92.92% in cross-validation and an AUC of 0.939 and a precision of 92.86% in the independent test. The findings suggest that integrating interaction effects and multimodality into deep learning frameworks can produce more accurate and interpretable predictions for the progression of MCI to AD, thus supporting early diagnosis and intervention. The authors also note that, while the model outperformed existing methods, limitations include the use of data from a single institution and a relatively small sample size, suggesting the need for further validation on larger, more diverse datasets.

In particular, all studies emphasize the need to evaluate the importance of characteristics in different scenarios and provide information on the relevance of specific characteristics in the prediction of AD. In general, the research contributes to understanding feature selection strategies and their impact on predictive modeling for AD.

Studies based on GradCAM

GradCAM is a technique that aims to make CNN models more interpretable by identifying the significant regions in an input image that influence predictions [75]. The process involves using the gradient information from the output layer of the CNN model to generate a localization map that highlights crucial regions within an image. Each neuron is assigned significance values that help in making specific decisions. The result is a coarse heatmap that highlights vital regions that are conducive to prediction and explanation.

The study by Rahim et al. [57] proposes a new framework to detect the progression of AD using longitudinal magnetic resonance data. The study aims to investigate the role of MRI time series data in predicting AD progression. The proposed model demonstrates high and stable performance by integrating temporal features from longitudinal MRI data using a 3D CNN and a Bidirectional Recurrent Neural Network (BRNN). Including multimodal input data, such as demographic data from patients and cross-sectional data, enhances the performance of the model. The authors also introduce a novel approach to XAI techniques, which provides medically acceptable insights into the network decision-making process. This approach involves a unique time series visual explainability for 2D slices and 3D brain surfaces involving demographics and cross-sectional data.

Figure 13 shows a GradCAM activation map of the proposed network that specifies the major brain regions that influence the results of the detection of AD. The study shows that the proposed framework outperforms existing studies and other DL models by optimizing grid search hyperparameter techniques. In another study to detect AD, Leng et al. [27] developed a novel method for learning features from multilevel and multimodal data using a cascaded deep CNN. The study divided feature extraction networks into three categories: 2D-based, 3D-based, and transformer-based. The proposed model addressed the loss of spatial information using attention to re-calibrate the feature weights. In addition, the network integrated learning mechanisms into the system, which helped to capture spatial correlations between features. The authors used spatial and channel enhancement blocks to improve the accuracy of the FDG-PET images and the sMRI features, respectively. To diagnose AD and identify atrophy through sMRI, the network used a cross-enhanced fusion module and a multitask, multilevel feature adversarial network. The proposed methodology demonstrated enhanced discrimination power and improved feature representation capabilities. However, this approach had limitations, such as the difficulty in visualizing features and neglecting the correlation, complementarity, and heterogeneity between different modalities.

Fig. 13.

Fig. 13

Grad-CAM visualization highlighting voxel regions contributing to AD predictions. The heatmap indicates model-sensitive areas in structural imaging, offering insight into spatial patterns associated with AD; image reproduced with permission from –[57]

Studies based on attention map

Attention mechanisms have become increasingly important in AI, particularly in natural language processing and computer vision tasks. They improve the performance of DL models by intelligently weighting input characteristics and extracting high-level features. Attention mechanisms also play a critical role in XAI by providing information on the regions of interest during decision-making processes [76]. As attention mechanisms continue to shape the AI landscape, they raise questions about the interpretability of the model and the intricate interplay between attention, feature extraction, and decision making.

Knowledge distillation is a popular method of transferring knowledge from a larger teacher network to a smaller student network [77]. However, there are challenges when transferring knowledge from a multimodal teacher network to a single-modal MRI-based student network. Guan et al. [55] explore the evolution, applications, and impact of attention mechanisms on modern deep-learning models for AD detection. In this study, the authors perform knowledge transfer between a multimodal teacher network and a single-modal student network. This is done by obtaining probability estimates of multiple instances by entering image patches into the networks, which serve as additional supervision for the student network. The student network, based on ResNet with a specific residual block, is trained under the guidance of the teacher network, using a MRI feature extractor with a similar architecture. The proposal is to choose a multimodal attention network as the teacher network because it is lightweight and effective and shares an architecture similar to the student network. The teacher network combines the student network with the multimodal attention module to fuse multimodal features, thus improving overall performance. The important features reveal significant regions in the image for the target class, serving as attention maps, which are a visual explanation of a deep neural network. Figure 14 presents attention maps for interpreting AD that are upscaled to the same size as the input MRI image and overlayed on the MRI image. The columns in the figure represent three anatomical planes. These results are computed by averaging all the correctly predicted test cases.

Fig. 14.

Fig. 14

Example visualization of a multimodal attention map used in AD prediction. The illustration highlights how attention mechanisms prioritize modality-specific features; image reproduced with permission from – [55]

The results of MCI conversion prediction between the student network, teacher network, and students trained with various knowledge distillation methods are compared. The comparison demonstrates the effectiveness of knowledge transfer from the multi-modal teacher network. The multi-modal teacher network achieved the highest prediction AUC. The proposed frameworks were evaluated on three public datasets, namely ADNI-1, ADNI-2, and AIBL. The study showcases its efficacy in transferring multimodal knowledge to an MRI-based student network, highlighting its ability to model subtle features in MRI data. The study is shown to be effective in feature learning and multimodal data fusion using stage-wise deep neural networks for dementia diagnosis.

Studies based on layerwise relevance propagation

Layer-wise Relevance Propagation (LRP) is a prominent XAI technique developed to interpret neural network predictions by assigning relevance scores to input features based on their contribution to the final output. It operates by propagating the prediction backward through the network using specific rules, a process theoretically grounded in a deep Taylor decomposition [78, 79]. LRP is applicable to a range of AI architectures, including CNNs, RNNs, and normalization networks. It is also widely used in image classification to generate heatmaps that localize influential regions and has also been extended to domains such as text analysis, medical diagnosis, and predictive maintenance. Although the quality of the explanations can vary with architecture and parameter settings, LRP remains a powerful tool to gain insight into complex AI systems and plays a central role in the broader XAI landscape.

Generating pixel-wise relevance heatmaps in post-hoc is one of the significant features of LRP. Khosroshahi et al. [37] investigate the use of LRP to interpret DL predictions for AD using CNN trained in multimodal MRI data focusing on three classes, CN, MCI and AD. The authors used the pixel-wise relevant heatmaps to localize brain regions that contribute to each class prediction. It also highlights biologically relevant structures such as the hippocampus, ventricles, and cortical regions. The study shows that CNN achieved competitive classification results and LRP substantially improved interpretability by visually linking model predictions to anatomical features that provide clinically meaningful insights.

Studies based on LIME

Local Interpretable Model-Agnostic Explanations (LIME) is an XAI technique that interprets predictions by approximating complex black-box models with locally faithful human-interpretable approximations [80]. By perturbing the input data around a specific instance and observing the resulting changes in the model output, LIME constructs a simpler and interpretable model, such as a linear regressor or decision tree, that closely mimics the behavior of the original model within the local neighborhood [8]. Its strength lies in two core properties: interpretability, which facilitates intuitive understanding of how input features relate to predictions, and local fidelity, ensuring that explanations remain accurate in the vicinity of the instance being analyzed. LIME is model-agnostic and broadly applicable to tabular, textual, and image data. In the context of AD diagnosis, LIME can interpret predictions by identifying critical image regions, such as contiguous patches of brain scans, that drive classification decisions, thus improving model transparency and supporting clinical trust.

The study by Khosroshahi et al. [37] used AI approaches that integrate ML and DL models, specifically the application of fractional CNN and RF alongside multimodal data, including MRI, PET and CT scans to predict various stages of AD. In this study, LIME helped to visualize which image regions (by relevance to pixels) contributed the most to classification decisions, thus improving interpretability and clinical trust.

Studies based on saliency map

The saliency map is a crucial concept in DL and computer vision, particularly for training CNNs [81].By understanding the feature maps at each CNN layer, the saliency maps emphasize specific parts of an image by identifying significant pixels in the image and assigning brightness to more salient regions. Brighter pixels indicate increased saliency, making the map a valuable tool to draw attention to specific areas of the image. Typically presented in grayscale, a saliency map can be transformed into a colored format for improved visual interpretation. Often referred to as "heat maps," the brightness or hotness of the pixels significantly influences object classification. Saliency maps are widely used in visual attention models, enhancing our understanding of essential image features.

In a recent study, Abdelaziz et al. [32] used a DL framework based on twin support vector machine (Twin SVM) for AD classification using multimodal data, specifically sMRI and PET scans. The model combines a CNN-based feature extractor with a Twin SVM classifier to leverage the strengths of the DL and traditional ML paradigms. To enhance interpretability, the authors incorporated saliency maps, which visualize the most influential regions of the input images contributing to the model predictions. In another study by Xue et al. [60], a dual-branch deep learning framework was proposed to classify AD stages using fused MRI and CT modalities, enhanced through a Jacobian Augmented Loss (JAL) function for improved multimodal alignment. A notable aspect in their work is the use of saliency maps via Gradient-weighted Class Activation Mapping (GradCAM), to interpret model predictions. In both studies ([32, 60]) the saliency maps highlighted key neuroanatomical regions implicated in AD pathology, such as the hippocampus, cortical structures and ventricles, thus validating the biological relevance of the model outputs.

This review highlights the increasing adoption of multimodal XAI techniques for the diagnosis of AD. Techniques such as SHAP, GradCAM, LRP, Saliency Maps, LIME, and Attention Maps have been employed to interpret complex AI models by identifying key features and brain regions relevant to disease progression. SHAP remains the most widely adopted, providing local and global attribution of characteristics across multimodal pipelines. GradCAM and Attention Maps are widely applied in DL models to highlight spatial activation patterns within neuroimaging data, whereas LRP provides fine-grained, pixel-level explanations grounded in strong theoretical foundations. LIME and Saliency Maps, though comparatively less explored, demonstrate strong potential to improve the transparency of CNN-based multimodal classifiers. In particular, the reviewed studies demonstrate that integrating XAI methods with multimodal data not only improves interpretability and clinical relevance but also supports the development of reliable decision support systems.

Performance trends and impact of multimodal data on model performance

This section highlights RQ5: What are the performance trends of AI models in multimodal Alzheimer’s Disease diagnosis across different datasets, fusion strategies, and classifiers?

The application of AI models in the diagnosis of AD has shown significant advancements, particularly through the integration of multimodal data. This section provides a comprehensive analysis of the reported performance trends of these AI models, examining how their efficacy is influenced by the choice of datasets, the strategies employed for data fusion, and the types of classifiers utilized. The discussion draws directly from a synthesis of recent studies, highlighting both successes and persistent challenges in this critical area of medical diagnostics.

Overview of reported performance metrics in multimodal AD diagnosis

AI models leveraging multimodal data have demonstrated varied performance across different diagnostic tasks in Alzheimer’s Disease. A clear distinction emerges between the high accuracy achieved in binary classification tasks and the more complex challenges presented by multi-class classification and disease progression prediction.

Studies consistently report high accuracy rates for binary classification tasks, particularly when distinguishing individuals with AD from cognitively normal (CN) controls. This strong discriminative power is evident across various modeling approaches. For instance, investigations by [21] and [29] achieved remarkable accuracies of 97.13% and 96.9% (for ADNI1) / 96.8% (for ADNI2), respectively, for AD vs. NC classification. These studies predominantly employed Support Vector Machine (SVM) classifiers on integrated MRI and PET imaging modalities. Similarly, deep learning architectures have yielded exceptional results, with a study by [27] reporting a 97.67% accuracy using Convolutional Neural Networks (CNNs) on sMRI and PET images, and another by [31] achieving 98% accuracy with a ResNet model on MRI and PET data for AD vs. NC classification. These figures underscore the robust capability of multimodal AI models to identify clear pathological differences between overt AD and healthy states.

While binary classification exhibits impressive performance, the accuracy of AI models tends to decrease when faced with more intricate multi-class problems or the prediction of disease progression. Tasks involving the differentiation of multiple cognitive states, such as 3-way (AD vs. MCI vs. HC) or 4-way (HC vs. sMCI vs. pMCI vs. AD) classification, present a more significant challenge. For example, a study by [23] reported accuracies of 74% for MCI vs. NC and 75% for AD vs. MCI, which are notably lower than the 90.74% achieved for AD vs. NC classification. More challenging distinctions, such as 4-way classification, have shown accuracies as low as 56.2% by [53] and 54.67% by [54].

Predicting disease progression, such as the conversion from Mild Cognitive Impairment (MCI) to AD, also remains a considerable hurdle. Reported accuracies for this task range from 68.1% by [39] to 80% by [55]. Although some studies, like [45], have reported higher accuracies (93.94%) for 3-way classification and progression prediction, these are generally outliers compared to the typical 3-way or 4-way performance observed in multiclass classification tasks.

In addition to the challenges of differentiating the MCI subtypes and predicting AD, there exist other factors that explain the performance gap. Data issues like class imbalance, minute and varied changes in early disease, differences in diagnostic criteria between groups, and limited long-term follow-up make reliable modeling difficult. AI models face more problems, including small sample sizes, uneven data timing, and mixed input types. Most current methods combine data types in a simple way, making it difficult to capture the complex relationships needed to understand early disease. Also, few studies use biological knowledge or built-in constraints, so models often overfit to short-term patterns instead of learning meaningful disease progressions. These challenges explain why models perform well at predicting AD from CN cases, but much less reliable in distinguishing MCI subgroups.

Influence of datasets on diagnostic performance

The choice of dataset plays a pivotal role in shaping the reported performance of AI models for multimodal AD diagnosis. A review of the literature reveals a significant reliance on a particular dataset, which has implications for the generalizability and scope of current findings.

Despite ADNI’s prominence, only a limited number of studies deviate from its exclusive use. For instance, [18] utilized an EEG database, [19] combined ADNI with an in-house dataset, [25] and [65] employed the OASIS dataset, and [71] used a smaller, specific patient cohort. While some of these non-ADNI studies report respectable accuracies (e.g., 87.2% for [18] and 99.1% for [65]), the consistent trend is that the highest reported accuracies are predominantly observed in studies that leverage the ADNI dataset.

The overwhelming reliance on the ADNI dataset, while beneficial for comparability and reproducibility within the research community, raises significant considerations regarding the generalizability of these AI models to real-world clinical populations. ADNI participants are typically recruited under strict inclusion and exclusion criteria, often from academic medical centers, which may not fully represent the demographic, comorbidity, socioeconomic, or genetic diversity of AD patients encountered in routine clinical practice. Models trained predominantly on ADNI might therefore exhibit reduced performance or biases when applied to external, more heterogeneous datasets. This implies a potential "ADNI-centric" characteristic in reported performance, where high accuracies might not translate to equivalent performance in diverse clinical settings. Furthermore, the high performance observed with ADNI-based studies, especially those using MRI and PET, might inadvertently understate the challenges associated with other modalities or less standardized data. For instance, the study using an EEG database [18] reported a lower accuracy compared to many ADNI-based image-centric studies. This could stem from the inherent noise, variability, or lower diagnostic signal-to-noise ratio in certain modalities (e.g., EEG or less standardized clinical data) compared to the highly curated neuroimaging data available from ADNI. The concentrated focus on ADNI, while yielding impressive numerical results, may obscure the true difficulty of integrating and interpreting less "clean" or less universally available multimodal data, which is a common reality in clinical scenarios.

Performance of fusion strategies on multimodal AD classification

The manner in which different data modalities are combined, or "fused," significantly influences the performance of AI models in AD diagnosis. Various fusion strategies have been explored, each with its own advantages and implications for model design and outcome (Refer Table 9).

High performance metrics are reported across various fusion levels, indicating that effectiveness is not confined to a single approach but often depends on the specific modalities, diagnostic task, and chosen classifiers. For instance, data-level fusion, as implemented by [26], achieved an accuracy of 93.21%. Early fusion reported a very high accuracy of 99.5% for AD vs. CN classification[52]. Feature-level fusion studies too achieved high levels of accuracy such as 97.67%[27] and 98%. Late-level fusion also demonstrated strong results reporting as high as 92.6% [49]. This breadth of success suggests that no single fusion level is universally superior; rather, the optimal strategy is often contingent upon the specific characteristics of the data and the diagnostic objective.

The dominance of feature-level fusion suggests a preference for integrating information at a relatively abstract, discriminative level, allowing the classifier to learn complex inter-modal relationships. This "early integration" assumes that features from different modalities contain complementary information that is best combined before classification. However, the success of decision-level and late-level fusion approaches, such as those by [49] and [70], indicates that combining predictions from modality-specific models can also yield high performance. This implies a significant design consideration: early fusion might capture subtle cross-modal interactions but risks being overwhelmed by noise or irrelevant features if not carefully managed. Late fusion, conversely, preserves the semantic independence of each modality’s interpretation, which can be advantageous when modalities are highly disparate or when individual models are strong predictors. The optimal choice likely depends on the inherent characteristics of the modalities and the complexity of their interactions in the context of AD pathology.

The variety of successful fusion strategies observed also indicates that the choice is not arbitrary but rather a critical design decision influenced by the nature of the multimodal data. For instance, image-based modalities like MRI and PET might lend themselves well to data-level or feature-level fusion, where spatial and intensity information can be directly combined. Conversely, combining highly disparate data types such as imaging, genetic, and clinical scores might benefit from feature-level fusion to create a unified vector, or from late-level fusion, where modality-specific models process their unique data types before their outputs are combined. This highlights the importance of matching the fusion strategy to the data’s inherent structure and the specific diagnostic task, rather than adhering to a single "best" approach.

Performance of classifiers in achieving diagnostic outcomes

The selection of classifiers is another critical determinant of diagnostic performance in multimodal AD models. Both DL and traditional ML paradigms are widely employed, often in conjunction with advanced ensemble techniques.

  1. Prominence of Deep Learning and Traditional Machine Learning Both DL and traditional ML classifiers are extensively used and have demonstrated high performance in AD diagnosis. Among DL models, CNNs are particularly prominent, especially for processing image-based data. Studies by [22, 2527, 31, 57], and [70] all utilize CNNs and report high accuracies, ranging from 92.3% to 98% for various AD classification tasks. This underscores CNNs’ inherent strength in automatically extracting hierarchical features from raw image data, reducing the need for extensive manual feature engineering.

Among ML classifiers, SVMs are frequently observed, often yielding competitive or even superior results. Investigations by [21, 23, 2830, 3941, 48, 55, 71], and [64] all employ SVMs, with reported accuracies reaching up to 97.13% by [21] and 96.9% by [29] for AD vs. NC classification. These results demonstrate SVMs’ robustness and effectiveness, particularly when combined with carefully engineered features derived from multimodal data.

The co-existence of high-performing deep learning models (primarily CNNs for image data) and traditional ML models (like SVMs for feature-based data) suggests that both paradigms offer distinct advantages. CNNs excel at automated feature extraction from complex, high-dimensional data like medical images, reducing the need for manual feature engineering. SVMs, on the other hand, are powerful discriminative classifiers, often effective with smaller datasets or when well-defined features are available, demonstrating robustness and interpretability in certain contexts. This implies that the choice of classifier is not simply a matter of "newer is better," but rather a strategic decision based on the data type, the available data volume, and the specific diagnostic task. It also suggests that a hybrid approach, combining the strengths of both, could be a promising future direction.

  • 2.

    Continuous Trend of Ensemble Approaches Beyond the use of single classifiers, there is a notable trend towards employing ensemble methods and hybrid models to further enhance diagnostic performance and robustness. Examples of ensemble techniques include Random Forest (RF) used by [45], Gradient Boosting (GB) by [41], and XGBoost by [52]. More sophisticated combinations are also observed, such as the OvR Random Forest Ensemble by [53], the CNN-BiLSTM Ensemble by [49], and the Random Forest + CNN hybrid model by [69]. These approaches often leverage the complementary strengths of multiple models or integrate different learning paradigms to improve overall robustness and predictive power, particularly in complex classification scenarios.

The increasing adoption of ensemble methods and multi-task learning, as seen in studies like [30], indicates a recognition that no single classifier is perfect for all aspects of AD diagnosis. Ensemble methods improve robustness by combining diverse predictions, effectively mitigating individual model weaknesses and reducing variance. Multi-task learning, by simultaneously optimizing for related outcomes (e.g., classification and regression), can leverage shared underlying patterns across tasks, leading to more generalized and powerful representations. This trend reflects a move beyond merely achieving peak accuracy on a single metric towards building more reliable, comprehensive, and clinically applicable diagnostic tools that can handle the inherent variability and complexity of AD.

Explainability in AI models for multimodal Alzheimer’s Disease diagnosis

The increasing complexity of AI models, particularly DL architectures, has underscored the critical need for XAI frameworks. Understanding why a model makes a particular prediction is crucial for building trust among clinicians and facilitating the translation of AI research into practical diagnostic tools[7, 8]. Analysis of the provided studies reveals a growing trend towards incorporating XAI, primarily focusing on visual interpretability.

The most prominent XAI framework observed is SHAP, utilized in several studies. In the context of multimodal AD diagnosis, SHAP has been applied to both traditional ML models like RF and DL models such as 3DResNet, DNN, and MLP. These applications span various classification tasks, including distinguishing between CN, MCI, and AD, as well as predicting MCI to AD conversion. The data modalities being interpreted with SHAP are diverse, encompassing cognitive scores, neuropsychological batteries, MRI, PET, genetics, medical history, DTI, EHR, and general image and clinical/neurological data. The consistent output type for SHAP in these studies is visual interpretability, which likely aids in presenting complex feature contributions in an intuitive manner to human experts.

Another significant XAI technique observed is GradCAM. GradCAM is particularly relevant for DL models, especially CNNs, as it produces visual explanations by highlighting the important regions in the input image that led to a specific classification decision. Studies employing GradCAM for AD diagnosis have used DL classifiers like 3DCNN and BRNN and a 3D Multimodal cross-enhanced fusion network (MENet). These models interpret multimodal data such as demographic and cognitive scores, clinical biomarkers, MRI, and FDG-PET images. Similar to SHAP, the interpretability type for GradCAM is consistently visual, providing heatmaps or activation maps that pinpoint relevant areas within medical images.

Furthermore, Attention Maps are also noted as an interpretability type, specifically used with ResNet deep learning models for MCI vs. AD classification, interpreting MRI and clinical data. Attention mechanisms, inherent in some neural network architectures, can highlight which parts of the input data the model "attends" to most when making a prediction, offering another form of visual explanation.

Overall, the trend in XAI for multimodal AD diagnosis is characterized by:

  1. Dominance of Visual Interpretability: All listed XAI methods (SHAP, GradCAM, Attention Maps) provide visual explanations, indicating a strong preference for intuitive, human-understandable insights, especially crucial for image-based medical data.

  2. Application Across Model Types: While DL models are increasingly complex and thus benefit greatly from XAI, traditional ML models (like RF) are also being made interpretable using frameworks like SHAP.

  3. Focus on Feature Importance and Spatial Relevance: SHAP helps identify the most influential features (e.g., specific cognitive scores, genetic markers), while GradCAM and Attention Maps pinpoint critical regions within imaging data, providing complementary insights into the model’s decision-making process.

  4. Growing Recognition of Clinical Need: The inclusion of XAI studies signifies a maturing field that is moving beyond just achieving high accuracy to ensuring transparency, trustworthiness, and clinical utility of AI models in AD diagnosis. This trend is vital for fostering adoption and enabling clinicians to validate and act upon AI-driven insights.

The analysis of reported performance trends in AI models for multimodal AD diagnosis reveals a landscape of significant progress alongside persistent challenges. While these models demonstrate remarkable success in distinguishing overt AD from healthy controls, frequently achieving accuracies exceeding 95%, their performance notably diminishes for 3-way or 4-way classification tasks, such as differentiating between various stages of MCI or accurately predicting disease progression. This efficacy is heavily influenced by the choice of datasets, with the ADNI dataset serving as the predominant resource, raising considerations about the generalizability of findings to diverse clinical populations. In terms of fusion strategies, feature-level fusion is the most common approach, though high performance is observed across various levels, indicating that the optimal strategy is contingent upon the specific data characteristics and diagnostic task. Both DL (especially CNNs) and traditional ML (SVMs) contribute significantly to high performance, often complemented by ensemble methods and multi-task learning approaches to enhance robustness and predictive capabilities. The integration of XAI frameworks marks a crucial evolution in multimodal AD diagnosis, addressing the imperative for transparency and clinical trust in complex AI models. The XAI techniques (specifically, SHAP, GradCAM, and Attention Maps) are increasingly being applied across this performance spectrum. The trend indicates a growing recognition that for AI models to be truly impactful in AD diagnosis, their decision-making processes, regardless of their specific performance level or underlying fusion strategy, must be comprehensible to clinicians, thereby bridging the gap between high predictive power and practical clinical adoption. Ultimately, while current AI models show strong discriminative power for established AD, further advancements are crucial for achieving reliable early detection, precise staging, and broad clinical applicability across the complex spectrum of AD.

Limitations, challenges, and future scope

This section addresses the RQ7: What are the limitations, challenges and future scope of using multimodal data and XAI methods for AD diagnosis?

The intersection of multimodal data and XAI methods provides critical decision-support capabilities for the early detection and diagnosis of AD. While the fusion of multimodal data with XAI techniques offers comprehensive insights and enhanced interpretability, it also introduces multiple practical and methodological challenges [82, 83]. In this section, we examine existing studies employing multimodal data and XAI methods, highlighting their limitations to better understand current challenges and future research opportunities. The key challenges and potential solutions are illustrated in Fig.  15.

Fig. 15.

Fig. 15

Key challenges in applying multimodal AI and XAI techniques for AD detection. The figure also outlines potential mitigation strategies, highlighting gaps in data quality, model robustness, interpretability, and clinical usability that future research must address

Data heterogeneity

Data acquisition in multimodal fusion inherently involves diverse sources, such as wearable sensors, clinical reports, imaging modalities, and laboratory data [84, 85]. Consequently, datasets often exhibit significant heterogeneity in structure and format [18, 19, 34, 86]. This variability arises from differences in acquisition protocols, equipment, and observation conditions. Addressing this issue requires careful data preparation and feature harmonization, including steps such as registration, normalization, filtering, and class balancing for imaging modalities [87, 88].

Integration complexity

Multimodal fusion involves combining data that may differ in temporal, spatial, and spectral resolution. For example, video and audio data include temporal dependencies [89, 90], whereas neuroimaging carries spatial and spectral complexities [91]. Integration at multiple resolutions significantly increases computational and model complexity. Advanced DL architectures—such as Multimodal CNNs [92], RNNs [93], graph neural networks (GNNs) [94], Transformers [95], and generative models [96]—offer a solution by learning hierarchical and cross-modal representations that capture complex interdependencies.

Limited data size

The availability of sufficiently large multimodal datasets is a recurring challenge in clinical AI. Small sample sizes across modalities limit model generalizability and stability [19, 71, 97], often leading to overfitting and reduced clinical applicability [22, 33]. Data augmentation and synthetic sample generation can partially alleviate this limitation [54], but these methods face inherent risks: artificially generated data may fail to capture the true variability of clinical populations and may introduce bias if not rigorously validated. This challenge is further amplified in multimodal settings, where incomplete modality overlap across subjects reduces the effective sample size available for fusion-based learning. From an explainability perspective, models trained on limited datasets may produce unstable or dataset-specific explanations, undermining the reliability of XAI outputs in clinical decision-making. These limitations highlight the need for larger, harmonized multimodal cohorts and robust external validation to ensure both predictive performance and trustworthy interpretability.

Lack of longitudinal validation

Despite the availability of a longitudinal dataset (ADNI), a large proportion of studies rely primarily on cross-sectional data which restricts the ability of models to capture temporal disease dynamics and early pathological transitions. This limitation is particularly evident in tasks such as MCI subtype differentiation and progression prediction, where subtle longitudinal changes are clinically meaningful but difficult to model without sustained temporal data. Moreover, explainability analyses in most studies are conducted on static snapshots, with little investigation into whether highlighted biomarkers or brain regions remain stable or clinically consistent over time. The absence of robust longitudinal validation therefore undermines both predictive reliability and the clinical interpretability of fusion-based AI models, emphasizing the need for temporally-aware modeling strategies and long-term, multimodal follow-up cohorts.

Complexity of fusion models

Although multimodal fusion can model intricate relationships and improve prediction accuracy [98100], resulting architectures are often highly complex and difficult to interpret. Most AI-driven fusion models function as black boxes, reducing transparency and clinician trust [57]. Embedding explainable AI techniques—such as saliency mapping, feature attribution, and model-agnostic explanations—can provide visual, numeric, or textual insights into how each modality contributes to the final prediction [36, 101103].

Clinical validation

Despite promising experimental results, real-world clinical adoption remains limited. Multimodal fusion models are predominantly trained and evaluated on curated datasets such as ADNI, which lack the heterogeneity of routine clinical data [25, 64]. This reliance on controlled research cohorts raises concerns about model robustness, generalizability, and explainability when deployed in real clinical settings. Clinical validation requires close collaboration with healthcare professionals, controlled trials, and benchmarking against current diagnostic standards [104, 105]. None of the reviewed studies explicitly engaged clinicians in the model development or evaluation loop, which substantially impedes trust and limits the feasibility of deployment in primary care and early diagnostic workflows. In addition, the absence of standardized evaluation protocols and clinically meaningful endpoints—such as decision impact, time-to-diagnosis, or alignment with existing diagnostic criteria—further limits the translational readiness of current multimodal AI and XAI models. This limitation is not specific to AD but is broadly applicable to the adoption of AI-based systems across the medical domain..

Critical insights and future directions

The stark contrast in diagnostic efficacy across disease stages reveals a critical limitation in current AI models. The models appear highly effective at distinguishing well-defined, distinct disease states, where pathological differences are likely more pronounced and easier to capture. However, they struggle with the more subtle, heterogeneous, and overlapping characteristics of prodromal stages, such as sMCI versus pMCI, or the dynamic nature of disease progression. This suggests that while current models are adept at "diagnosis at a later stage," their capacity for "early detection and precise staging," which is paramount for timely intervention and personalized medicine in AD, is less developed. If AI models cannot reliably differentiate between various MCI subtypes or accurately predict conversion time with high confidence, their immediate clinical utility for early diagnosis and prognostic guidance is limited. While high AD vs. NC accuracy is valuable for confirmatory diagnosis, the real unmet need in AD is identifying individuals at risk or in early, potentially modifiable stages. The lower performance in these complex tasks indicates that current AI models are not yet sufficiently robust for widespread application in primary care or for guiding AD preventative therapies, highlighting a gap between research capabilities and clinical requirements for truly impactful early AD management. This gap limits their utility for early intervention and personalized medicine.

Future research should focus on:

  1. Integrate causal inference with multimodal fusion to distinguish biomarkers driving AD from comorbidities, moving beyond correlation-based explainability. For example, causal graphs can model interactions between amyloid-PET, vascular lesions, and cognitive scores [106, 107].

  2. Move from feature attribution to counterfactual explanations (e.g., "If hippocampal volume increased by 5%, AD risk decreases by X%") using methods like generative counterfactual inference [106, 108].

  3. Develop privacy-preserving, federated learning frameworks for aggregating data across hospitals. Initiatives like OASIS-4 [109] show promise but lack standardization; blockchain-based data sharing could incentivize participation [110].

  4. Creation of standardized minimal data-sharing templates for multimodal AD studies that specify required metadata like image acquisition parameters, cognitive assessments, biomarker measurements, and preprocessing steps. These templates can address the challenges outlined in Section 3.7.1 and Sect. 3.7.3.

  5. Establishing benchmark multimodal pipelines with openly shared preprocessing scripts, fusion architectures, and XAI interpretation workflows. This would help reduce model-level variability and enable more rigorous cross-study comparison.

  6. Mitigate issues related to limited dataset size and incomplete modality overlap, short-term initiatives such as federated harmonization protocols and lightweight modality-imputation baselines can accelerate the development of clinically reliable multimodal models without requiring full data centralization.

  7. Adopt clinically meaningful evaluation endpoints, considering the impact on clinical decision-making, deliver clinician-interpretable XAI outputs, and involve clinicians earlier in model design, validation, and explanation refinement.

By overcoming these barriers, multimodal AI systems have the potential to transition from promising experimental tools to reliable, clinician-trusted solutions for early AD diagnosis, personalized patient care and strengthen methodological robustness in accelerating the deployment of multimodal AI systems for early AD diagnosis.

Conclusion

This systematic literature review presents a comprehensive analysis of recent advancements in the use of AI for the prediction and diagnosis of AD through multimodal data fusion. The integration of diverse data modalities—such as neuroimaging, genetic profiles, clinical assessments, and cognitive scores—has been shown to enhance diagnostic accuracy, model robustness, and early disease detection capabilities. The review highlights a wide range of ML and DL models employed for multimodal fusion, including conventional algorithms like Random Forests and Support Vector Machines, as well as more sophisticated architectures such as CNNs, Graph Neural Networks (GNNs), and transformer-based models. Fusion strategies are typically categorized into early, intermediate, and late-stage approaches, each offering different strengths in handling data heterogeneity and complexity.

Furthermore, the review emphasizes the growing incorporation of XAI techniques—such as SHAP, GradCAM, and Attention Maps—which are instrumental in increasing the interpretability and transparency of AI-driven predictions. These methods support clinical trust by identifying key features and brain regions that contribute to diagnostic outcomes. Despite notable progress, several challenges remain. These include data heterogeneity, integration complexity, limited sample sizes, the black-box nature of deep fusion models, and the lack of clinical validation in real-world settings. Addressing these limitations will require future research to focus on improved data standardization, interpretable and computationally efficient model designs, and extensive external and prospective validation.

In conclusion, the integration of multimodal data fusion with advanced AI and XAI techniques holds significant promise for transforming AD diagnosis and prognosis. However, translating these models into clinically viable tools demands interdisciplinary collaboration, rigorous validation, and a continued emphasis on interpretability and usability.

Acknowledgements

The authors would like to express their heartfelt gratitude to the scientists who had kindly released the data from their experiments.

Author contributions

This work was carried out in close collaboration among all authors. V.V. and N.S. contributed equally to the conception of the study, methodology design, and initial manuscript preparation. M.E., C.G., and K.B. provided technical insights and assisted in data curation. S.W., A.A., P.N.S., and I.A. contributed to validation, critical revisions, and domain expertise. M.M. conceived the idea, supervised the project, refined the methodology, and finalized the manuscript. All authors have read and approved the final version of the paper.

Data availability

No datasets were generated or analysed during the current study.

Declarations

Ethics approval and consent to participate

This work is based on secondary datasets available online. Hence ethical approval was not necessary.

Consent for publication

All authors have seen and approved the current version of the paper.

Competing interests

The authors declare no Conflict of interest.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Vimbi Viswan and Noushath Shaffi have contributed equally to this work.

References

  • 1.Noor MBT, Zenia NZ, Kaiser MS, Mamun SA, Mahmud M (2020) Application of deep learning in detecting neurological disorders from magnetic resonance images: a survey on the detection of alzheimer’s disease, parkinson’s disease and schizophrenia. Brain Inform 7:1–21 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Gauthier S, Webster C, Sarvaes S, Morais J, Rosa-Neto P (2022) World Alzheimer Report 2022: Life After Diagnosis - Navigating Treatment, Care and Support
  • 3.Hajamohideen F, Shaffi N, Mahmud M, Subramanian K, Al Sariri A, Vimbi V, Abdesselam A (2023) Four-way classification of alzheimer’s disease using deep siamese convolutional neural network with triplet-loss function. Brain Inform 10(1):1–13 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Shaffi N, Hajamohideen F, Abdesselam A, Mahmud M, et al (2023) Ensemble classifiers for a 4-way classification of alzheimer’s disease. In: Proc.AII2022, pp. 219–230
  • 5.Mahmud M, Kaiser MS, McGinnity TM, Hussain A (2021) Deep learning in mining biological data. Cogn Comput 13:1–33 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Mahmud M, Kaiser MS, Hussain A, Vassanelli S (2018) Applications of deep learning and reinforcement learning to biological data. IEEE Trans Neural Netw Learn Syst 29(6):2063–2079 [DOI] [PubMed] [Google Scholar]
  • 7.Vimbi V, Shaffi N, Mahmud M (2024) Interpreting artificial intelligence models: a systematic review on the application of lime and shap in alzheimer’s disease detection. Brain Inform 11(1):10 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Viswan V, Shaffi N, Mahmud M, Subramanian K, Hajamohideen F (2023) Explainable artificial intelligence in alzheimer’s disease classification: a systematic review. Cogn Comput 1–44
  • 9.Aviles M, Sánchez-Reyes LM, Álvarez-Alvarado JM, Rodríguez-Reséndiz J (2024) Machine and deep learning trends in eeg-based detection and diagnosis of alzheimer’s disease: a systematic review. Eng 5(3):1464–1484 [Google Scholar]
  • 10.Awang MK, Ali G, Faheem M (2024) Deep learning techniques for alzheimer’s disease detection in 3d imaging: a systematic review. Health Sci Rep 7(9):70025 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Teles AS, de Moura IR, Silva F, Roberts A, Stahl D (2025) Ehr-based prediction modelling meets multimodal deep learning: a systematic review of structured and textual data fusion methods. Inform Fusion 102981
  • 12.Alsubaie MG, Luo S, Shaukat K (2024) Alzheimer’s disease detection using deep learning on neuroimaging: a systematic review. Mach Learn Knowl Extr 6(1):464–505 [Google Scholar]
  • 13.Bandettini PA (2009) What’s new in neuroimaging methods? Ann N Y Acad Sci 1156(1):260–293 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Kuncheva LI (2002) A theoretical study on six classifier fusion strategies. IEEE Trans Pattern Anal Mach Intell 24(2):281–286 [Google Scholar]
  • 15.Kitchenham B, Charters S (2007) Guidelines for performing systematic literature reviews in software engineering. Technical Report EBSE 2007-001, Keele University and Durham University Joint Report. http://www.dur.ac.uk/ebse/resources/Systematic-reviews-5-8.pdf
  • 16.Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, Shamseer L, Tetzlaff JM, Akl EA, Brennan SE et al (2021) The prisma 2020 statement: an updated guideline for reporting systematic reviews. Syst Rev 10(1):1–11 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Nagendran M, Chen Y, Lovejoy CA, Gordon AC, Komorowski M, Harvey H, Topol EJ, Ioannidis JP, Collins GS, Maruthappu M (2020) Artificial intelligence versus clinicians: systematic review of design, reporting standards, and claims of deep learning studies. bmj 368 [DOI] [PMC free article] [PubMed]
  • 18.Ieracitano C, Mammone N, Hussain A, Morabito FC (2020) A novel multi-modal machine learning based approach for automatic classification of eeg recordings in dementia. Neural Netw 123:176–190 [DOI] [PubMed] [Google Scholar]
  • 19.Li Y, Liu J, Gao X, Jie B, Kim M, Yap P-T, Wee C-Y, Shen D (2019) Multimodal hyper-connectivity of functional networks using functionally-weighted lasso for mci classification. Med Image Anal 52:80–96 [DOI] [PubMed] [Google Scholar]
  • 20.Qiu Z, Yang P, Xiao C, Wang S, Xiao X, Qin J, Liu C-M, Wang T, Lei B (2024) 3d multimodal fusion network with disease-induced joint learning for early alzheimer’s disease diagnosis. IEEE Trans Med Imaging 43(9):3161–3175 [DOI] [PubMed] [Google Scholar]
  • 21.Shi J, Zheng X, Li Y, Zhang Q, Ying S (2017) Multimodal neuroimaging feature learning with multimodal stacked deep polynomial networks for diagnosis of alzheimer’s disease. IEEE J Biomed Health Inform 22(1):173–183 [DOI] [PubMed] [Google Scholar]
  • 22.Feng C, Elazab A, Yang P, Wang T, Zhou F, Hu H, Xiao X, Lei B (2019) Deep learning framework for alzheimer’s disease diagnosis via 3d-cnn and fsbi-lstm. IEEE Access 7:63605–63618 [Google Scholar]
  • 23.Dong A, Zhang G, Liu J, Wei Z (2022) Latent feature representation learning for alzheimer’s disease classification. Comput Biol Med 150:106116 [DOI] [PubMed] [Google Scholar]
  • 24.Chen Z, Liu Y, Zhang Y, Li Q, Initiative ADN et al (2023) Orthogonal latent space learning with feature weighting and graph learning for multimodal alzheimer’s disease diagnosis. Med Image Anal 84:102698 [DOI] [PubMed] [Google Scholar]
  • 25.Rallabandi VS, Seetharaman K (2023) Deep learning-based classification of healthy aging controls, mild cognitive impairment and alzheimer’s disease using fusion of mri-pet imaging. Biomed Signal Process Control 80:104312 [Google Scholar]
  • 26.Kong Z, Zhang M, Zhu W, Yi Y, Wang T, Zhang B (2022) Multi-modal data alzheimer’s disease detection based on 3d convolution. Biomed Signal Process Control 75:103565 [Google Scholar]
  • 27.Leng Y, Cui W, Peng Y, Yan C, Cao Y, Yan Z, Chen S, Jiang X, Zheng J, Initiative ADN et al (2023) Multimodal cross enhanced fusion network for diagnosis of alzheimer’s disease and subjective memory complaints. Comput Biol Med 157:106788 [DOI] [PubMed] [Google Scholar]
  • 28.Jiao Z, Chen S, Shi H, Xu J (2022) Multi-modal feature selection with feature correlation and feature structure fusion for mci and ad classification. Brain Sci 12(1):80 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Ning Z, Xiao Q, Feng Q, Chen W, Zhang Y (2021) Relation-induced multi-modal shared representation learning for alzheimer’s disease diagnosis. IEEE Trans Med Imaging 40(6):1632–1645 [DOI] [PubMed] [Google Scholar]
  • 30.Zhang D, Shen D, Initiative ADN et al (2012) Multi-modal multi-task learning for joint prediction of multiple regression and classification variables in alzheimer’s disease. Neuroimage 59(2):895–907 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Khan AA, Mahendran RK, Perumal K, Faheem M (2024) Dual-3dm 3-ad: Mixed transformer based semantic segmentation and triplet pre-processing for early multi-class alzheimer’s diagnosis. IEEE Trans Neural Syst Rehabilitation Eng [DOI] [PubMed]
  • 32.Abdelaziz M, Wang T, Anwaar W, Elazab A (2025) Multi-scale multimodal deep learning framework for alzheimer’s disease diagnosis. Comput Biol Med 184:109438 [DOI] [PubMed] [Google Scholar]
  • 33.Xu L, Wu H, He C, Wang J, Zhang C, Nie F, Chen L (2022) Multi-modal sequence learning for alzheimer’s disease progression prediction with incomplete variable-length longitudinal data. Med Image Anal 82:102643 [DOI] [PubMed] [Google Scholar]
  • 34.Liu F, Yuan S, Li W, Xu Q, Sheng B (2023) Patch-based deep multi-modal learning framework for alzheimer’s disease diagnosis using multi-view neuroimaging. Biomed Signal Process Control 80:104400 [Google Scholar]
  • 35.Zhang H, Ni M, Yang Y, Xie F, Wang W, He Y, Chen W, Chen Z (2025) Patch-based interpretable deep learning framework for alzheimer’s disease diagnosis using multimodal data. Biomed Signal Process Control 100:107085 [Google Scholar]
  • 36.Eslami M, Tabarestani S, Adjouadi M (2023) A unique color-coded visualization system with multimodal information fusion and deep learning in a longitudinal study of alzheimer’s disease. Artif Intell Med 140:102543 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Taiyeb Khosroshahi M, Morsali S, Gharakhanlou S, Motamedi A, Hassanbaghlou S, Vahedi H, Pedrammehr S, Kabir HMD, Jafarizadeh A (2025) Explainable artificial intelligence in neuroimaging of alzheimer’s disease. Diagnostics 15(5):612 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Liu X, Li W, Miao S, Liu F, Han K, Bezabih TT (2024) Hammf: hierarchical attention-based multi-task and multi-modal fusion model for computer-aided diagnosis of alzheimer’s disease. Comput Biol Med 176:108564 [DOI] [PubMed] [Google Scholar]
  • 39.Young J, Modat M, Cardoso MJ, Mendelson A, Cash D, Ourselin S, Initiative ADN et al (2013) Accurate multimodal probabilistic prediction of conversion to alzheimer’s disease in patients with mild cognitive impairment. NeuroImage Clinical 2:735–745 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Zhou T, Liu M, Thung K-H, Shen D (2019) Latent representation learning for alzheimer’s disease diagnosis with incomplete multi-modality neuroimaging and genetic data. IEEE Trans Med Imaging 38(10):2411–2422 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Aghili M, Tabarestani S, Adjouadi M (2022) Addressing the missing data challenge in multi-modal datasets for the diagnosis of alzheimer’s disease. J Neurosci Methods 375:109582 [DOI] [PubMed] [Google Scholar]
  • 42.Lu P, Hu L, Mitelpunkt A, Bhatnagar S, Lu L, Liang H (2024) A hierarchical attention-based multimodal fusion framework for predicting the progression of alzheimer’s disease. Biomed Signal Process Control 88:105669 [Google Scholar]
  • 43.Wang Y, Gao R, Wei T, Johnston L, Yuan X, Zhang Y, Yu Z, Initiative ADN (2024) Predicting long-term progression of alzheimer’s disease using a multimodal deep learning model incorporating interaction effects. J Transl Med 22(1):265 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Lei B, Li Y, Fu W, Yang P, Chen S, Wang T, Xiao X, Niu T, Fu Y, Wang S et al (2024) Alzheimer’s disease diagnosis from multi-modal data via feature inductive learning and dual multilevel graph neural network. Med Image Anal 97:103213 [DOI] [PubMed] [Google Scholar]
  • 45.El-Sappagh S, Alonso JM, Islam SR, Sultan AM, Kwak KS (2021) A multilayer multimodal detection and prediction model based on explainable artificial intelligence for alzheimer’s disease. Sci Rep 11(1):2660 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Raza ML, Hassan ST, Jamil S, Hyder N, Batool K, Walji S, Abbas MK (2025) Advancements in deep learning for early diagnosis of alzheimer’s disease using multimodal neuroimaging: challenges and future directions. Front Neuroinform 19:1557177 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Zhang M, Cui Q, Lü Y, Li W (2024) A feature-aware multimodal framework with auto-fusion for alzheimer’s disease diagnosis. Comput Biol Med 178:108740 [DOI] [PubMed] [Google Scholar]
  • 48.KP MN, Thiyagarajan P (2022) Feature selection using efficient fusion of fisher score and greedy searching for alzheimer’s classification. J King Saud Univ-Comput Inf Sci 34(8):4993–5006 [Google Scholar]
  • 49.El-Sappagh S, Abuhmed T, Islam SR, Kwak KS (2020) Multimodal multitask deep learning model for alzheimer’s disease progression detection based on time series data. Neurocomputing 412:197–215 [Google Scholar]
  • 50.Abuhmed T, El-Sappagh S, Alonso JM (2021) Robust hybrid deep learning models for alzheimer’s progression detection. Knowl-Based Syst 213:106688 [Google Scholar]
  • 51.El-Sappagh S, Saleh H, Ali F, Amer E, Abuhmed T (2022) Two-stage deep learning model for alzheimer’s disease detection and prediction of the mild cognitive impairment time. Neural Comput Appl 34(17):14487–14509 [Google Scholar]
  • 52.El-Sappagh S, Ali F, Abuhmed T, Singh J, Alonso JM (2022) Automatic detection of alzheimer’s disease progression: an efficient information fusion approach with heterogeneous ensemble classifiers. Neurocomputing 512:203–224 [Google Scholar]
  • 53.Ramírez J, Górriz J, Ortiz A, Martínez-Murcia F, Segovia F, Salas-Gonzalez D, Castillo-Barnes D, Illán I, Puntonet C, Initiative ADN et al (2018) Ensemble of random forests one vs. rest classifiers for mci and ad prediction using anova cortical and subcortical feature selection and partial least squares. J Neurosci Methods 302:47–57 [DOI] [PubMed] [Google Scholar]
  • 54.Yao D, Calhoun VD, Fu Z, Du Y, Sui J (2018) An ensemble learning system for a 4-way classification of alzheimer’s disease and mild cognitive impairment. J Neurosci Methods 302:75–81 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Guan H, Wang C, Tao D (2021) Mri-based alzheimer’s disease prediction via distilling the knowledge in multi-modal data. Neuroimage 244:118586 [DOI] [PubMed] [Google Scholar]
  • 56.Tu Y, Lin S, Qiao J, Zhuang Y, Zhang P (2022) Alzheimer’s disease diagnosis via multimodal feature fusion. Comput Biol Med 148:105901 [DOI] [PubMed] [Google Scholar]
  • 57.Rahim N, El-Sappagh S, Ali S, Muhammad K, Del Ser J, Abuhmed T (2023) Prediction of alzheimer’s progression based on multimodal deep-learning-based fusion and visual explainability of time-series data. Information Fusion 92:363–388 [Google Scholar]
  • 58.Mirabnahrazam G, Ma D, Beaulac C, Lee S, Popuri K, Lee H, Cao J, Galvin JE, Wang L, Beg MF et al (2023) Predicting time-to-conversion for dementia of alzheimer’s type using multi-modal deep survival analysis. Neurobiol Aging 121:139–156 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Arco JE, Ramírez J, Górriz JM, Ruz M, Initiative ADN et al (2021) Data fusion based on searchlight analysis for the prediction of alzheimer’s disease. Expert Syst Appl 185:115549 [Google Scholar]
  • 60.Xue C, Kowshik SS, Lteif D, Puducheri S, Jasodanand VH, Zhou OT, Walia AS, Guney OB, Zhang JD, Poésy S et al (2024) Ai-based differential diagnosis of dementia etiologies on multimodal data. Nat Med 30(10):2977–2989 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Zhang M, Cui Q, Lü Y, Yu W, Li W (2024) A multimodal learning machine framework for alzheimer’s disease diagnosis based on neuropsychological and neuroimaging data. Comput Ind Eng 197:110625 [Google Scholar]
  • 62.Goyal S, Singh V, Rani A, Yadav N (2022) Multimodal image fusion and denoising in nsct domain using cnn and fotgv. Biomed Signal Process Control 71:103214 [Google Scholar]
  • 63.Liu S, Liu S, Cai W, Che H, Pujol S, Kikinis R, Feng D, Fulham MJ et al (2014) Multimodal neuroimaging feature learning for multiclass diagnosis of alzheimer’s disease. IEEE Trans Biomed Eng 62(4):1132–1140 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Zhu, Q, Xu B, Huang J, Wang H, Xu R, Shao W, Zhang D (2022) Deep multi-modal discriminative and interpretability network for alzheimer’s disease diagnosis. IEEE Trans Med Imaging [DOI] [PubMed]
  • 65.Leela M, Helenprabha K, Sharmila L (2023) Prediction and classification of alzheimer disease categories using integrated deep transfer learning approach. Measurement Sensors 27:100749 [Google Scholar]
  • 66.Mustafa Y, Luo T (2024). Unmasking dementia detection by masking input gradients: a jsm approach to model interpretability and precision. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, pp 75–90
  • 67.Ma J, Zhang J, Wang Z (2022) Multimodality alzheimer’s disease analysis in deep riemannian manifold. Inf Process Manag 59(4):102965 [Google Scholar]
  • 68.El-Sappagh S, Saleh H, Sahal R, Abuhmed T, Islam SR, Ali F, Amer E (2021) Alzheimer’s disease progression detection model based on an early fusion of cost-effective multimodal data. Futur Gener Comput Syst 115:680–699 [Google Scholar]
  • 69.Velazquez M, Lee Y (2022) Multimodal ensemble model for alzheimer’s disease conversion prediction from early mild cognitive impairment subjects. Comput Biol Med 151:106201 [DOI] [PubMed] [Google Scholar]
  • 70.Aderghal K, Afdel K, Benois-Pineau J, Catheline G (2020) Improving alzheimer’s stage categorization with convolutional neural network using transfer learning and different magnetic resonance imaging modalities. Heliyon 6(12) [DOI] [PMC free article] [PubMed]
  • 71.Tang X, Qin Y, Wu J, Zhang M, Zhu W, Miller MI (2016) Shape and diffusion tensor imaging based integrative analysis of the hippocampus and the amygdala in alzheimer’s disease. Magn Reson Imaging 34(8):1087–1099 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Zhang T, Shi M (2020) Multi-modal neuroimaging feature fusion for diagnosis of alzheimer’s disease. J Neurosci Methods 341:108795 [DOI] [PubMed] [Google Scholar]
  • 73.Wang Y, Gao R, Wei T, Johnston L, Yuan X, Zhang Y, Yu Z, Initiative ADN (2024) Predicting long-term progression of alzheimer’s disease using a multimodal deep learning model incorporating interaction effects. J Transl Med 22(1):265 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Lu P, Hu L, Mitelpunkt A, Bhatnagar S, Lu L, Liang H (2024) A hierarchical attention-based multimodal fusion framework for predicting the progression of alzheimer’s disease. Biomed Signal Process Control 88:105669 [Google Scholar]
  • 75.Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2019) Grad-CAM: visual explanations from deep networks via gradient-based localization. Int J Comput Vision. 10.1007/s11263-019-01228-7 [Google Scholar]
  • 76.An J, Joe I (2022) Attention map-guided visual explanations for deep neural networks. Appl Sci 12(8):3846 [Google Scholar]
  • 77.Touvron H, Cord M, Douze M, et al (2021) Training data-efficient image transformers & distillation through attention. In: Proc. ICML, pp. 10347–10357
  • 78.Bach S, Binder A, Montavon G, Klauschen F, Müller K-R, Samek W (2015) On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE 10(7):0130140 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.Bazen S, Joutard X (2013) The taylor decomposition: a unified generalization of the oaxaca method to nonlinear models
  • 80.Ribeiro MT, Singh S, Guestrin C (2016) “ why should i trust you?” explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144
  • 81.Li Z (2002) A saliency map in primary visual cortex. Trends Cogn Sci 6(1):9–16 [DOI] [PubMed] [Google Scholar]
  • 82.Junaid M, Ali S, Eid F, El-Sappagh S, Abuhmed T (2023) Explainable machine learning models based on multimodal time-series data for the early detection of parkinson’s disease. Comput Methods Programs Biomed 234:107495 [DOI] [PubMed] [Google Scholar]
  • 83.Lahat D, Adali T, Jutten C (2015) Multimodal data fusion: an overview of methods, challenges, and prospects. Proc IEEE 103(9):1449–1477 [Google Scholar]
  • 84.Malawski F, Gałka J (2018) System for multimodal data acquisition for human action recognition. Multimedia Tools Appl 77:23825–23850 [Google Scholar]
  • 85.Yilmaz Y, Aktukmak M, Hero AO (2021) Multimodal data fusion in high-dimensional heterogeneous datasets via generative models. IEEE Trans Signal Process 69:5175–5188 [Google Scholar]
  • 86.Safai A, Vakharia N, Prasad S, Saini J, Shah A, Lenka A, Pal PK, Ingalhalikar M (2022) Multimodal brain connectomics-based prediction of parkinson’s disease using graph attention networks. Front Neurosci 15:741489 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 87.Boonzaier NR, Piccirillo SG, Watts C, Price SJ (2015) Assessing and monitoring intratumor heterogeneity in glioblastoma: how far has multimodal imaging come? CNS Oncology 4(6):399–410 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 88.Li L, Han L, Ding M, Cao H (2023) Multimodal image fusion framework for end-to-end remote sensing image registration. IEEE Trans Geosci Remote Sens 61:1–14 [Google Scholar]
  • 89.Sanchez-Cortes D, Aran O, Jayagopi DB, Schmid Mast M, Gatica-Perez D (2013) Emergent leaders through looking and speaking: from audio-visual data to multimodal recognition. J Multimodal User Interfaces 7:39–53 [Google Scholar]
  • 90.Guo J, Song B, Zhang P, Ma M, Luo W et al (2019) Affective video content analysis based on multimodal data fusion in heterogeneous networks. Inf Fusion 51:224–232 [Google Scholar]
  • 91.Jiang X, Ma J, Xiao G, Shao Z, Guo X (2021) A review of multimodal image matching: methods and applications. Information Fusion 73:22–71 [Google Scholar]
  • 92.Spasov SE, Passamonti L, Duggento A, Lio P, Toschi N (2018) A multi-modal convolutional neural network framework for the prediction of alzheimer’s disease. In: 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 1271–1274. IEEE [DOI] [PubMed]
  • 93.Lv M, Xu W, Chen T (2019) A hybrid deep convolutional and recurrent neural network for complex activity recognition using multimodal sensors. Neurocomputing 362:33–40 [Google Scholar]
  • 94.Gao J, Lyu T, Xiong F, Wang J, Ke W, Li Z (2021) Predicting the survival of cancer patients with multimodal graph neural network. IEEE/ACM Trans Comput Biol Bioinf 19(2):699–709 [DOI] [PubMed] [Google Scholar]
  • 95.Xu P, Zhu X, Clifton DA (2023) Multimodal learning with transformers: A survey. IEEE Trans Pattern Anal Mach Intell [DOI] [PubMed]
  • 96.Suzuki M, Matsuo Y (2022) A survey of multimodal deep generative models. Adv Robot 36(5–6):261–278 [Google Scholar]
  • 97.Pérez-Toro PA, Arias-Vergara T, Klumpp P, Vásquez-Correa JC, Schuster M, Noeth E, Orozco-Arroyave JR (2022) Depression assessment in people with parkinson’s disease: the combination of acoustic features and natural language processing. Speech Commun 145:10–20 [Google Scholar]
  • 98.Xue Z, Marculescu R (2023) Dynamic multimodal fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 2574–2583
  • 99.Steyaert S, Pizurica M, Nagaraj D, Khandelwal P, Hernandez-Boussard T, Gentles AJ, Gevaert O (2023) Multimodal data fusion for cancer biomarker discovery with deep learning. Nat Mach Intell 5(4):351–362 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 100.Fang M, Peng S, Liang Y, Hung C-C, Liu S (2023) A multimodal fusion model with multi-level attention mechanism for depression detection. Biomed Signal Process Control 82:104561 [Google Scholar]
  • 101.Grassucci E, Sigillo L, Uncini A, Comminiello D (2023) Grouse: a task and model agnostic wavelet-driven framework for medical imaging. IEEE Signal Process
  • 102.Rahim N, El-Sappagh S, Ali S, Muhammad K, Del Ser J, Abuhmed T (2023) Prediction of alzheimer’s progression based on multimodal deep-learning-based fusion and visual explainability of time-series data. Information Fusion 92:363–388 [Google Scholar]
  • 103.Folgado D, Barandas M, Famiglini L, Santos R, Cabitza F, Gamboa H (2023) Explainability meets uncertainty quantification: insights from feature-based model fusion on multimodal time series. Information Fusion 100:101955 [Google Scholar]
  • 104.Chen Q, Li M, Chen C, Zhou P, Lv X, Chen C (2023) Mdfnet: application of multimodal fusion method based on skin image and clinical data to skin cancer classification. J Cancer Res Clin Oncol 149(7):3287–3299 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 105.Zhou P, Chen H, Li Y, Peng Y (2023) Unpaired multi-modal tumor segmentation with structure adaptation. Appl Intell 53(4):3639–3651 [Google Scholar]
  • 106.Qiu S, Miller MI, Joshi PS, Lee JC, Xue C, Ni Y, Wang Y, De Anda-Duran I, Hwang PH, Cramer JA et al (2022) Multimodal deep learning for alzheimer’s disease dementia assessment. Nat Commun 13(1):3404 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 107.Koksalmis GH, Soykan B, Brattain LJ, Huang H-H (2025) Artificial intelligence for personalized prediction of alzheimer’s disease progression: A survey of methods, data challenges, and future directions. arXiv preprint arXiv:2504.21189
  • 108.Leony F, Lin C-J, Initiative ADN et al (2025) Multimodal fusion architectures for alzheimer’s disease diagnosis: an experimental study. J Biomed Inform 104834 [DOI] [PubMed]
  • 109.Raza ML, Hassan ST, Jamil S, Hyder N, Batool K, Walji S, Abbas MK (2025) Advancements in deep learning for early diagnosis of alzheimer’s disease using multimodal neuroimaging: challenges and future directions. Front Neuroinform 19:1557177 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 110.Jandoubi B, Akhloufi MA (2025) Multimodal artificial intelligence in medical diagnostics. Information

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

No datasets were generated or analysed during the current study.


Articles from Brain Informatics are provided here courtesy of Springer

RESOURCES