Skip to main content
UKPMC Funders Author Manuscripts logoLink to UKPMC Funders Author Manuscripts
. Author manuscript; available in PMC: 2024 Mar 30.
Published in final edited form as: J King Saud Univ Comput Inf Sci. 2023 Jun 24;35(7):101618. doi: 10.1016/j.jksuci.2023.101618

CsAGP: Detecting Alzheimer’s disease from multimodal images via dual-transformer with cross-attention and graph pooling

Chaosheng Tang a, Mingyang Wei a, Junding Sun a,*, Shuihua Wang a,b,c,*, Yudong Zhang a,b,c,*; Alzheimer’s Disease Neuroimaging Initiative1
PMCID: PMC7615783  EMSID: EMS194928  PMID: 38559705

Abstract

Alzheimer’s disease (AD) is a terrible and degenerative disease commonly occurring in the elderly. Early detection can prevent patients from further damage, which is crucial in treating AD. Over the past few decades, it has been demonstrated that neuroimaging can be a critical diagnostic tool for AD, and the feature fusion of different neuroimaging modalities can enhance diagnostic performance. Most previous studies in multimodal feature fusion have only concatenated the high-level features extracted by neural networks from various neuroimaging images simply. However, a major problem of these studies is over-looking the low-level feature interactions between modalities in the feature extraction stage, resulting in suboptimal performance in AD diagnosis. In this paper, we develop a dual-branch vision transformer with cross-attention and graph pooling, namely CsAGP, which enables multi-level feature interactions between the inputs to learn a shared feature representation. Specifically, we first construct a brand-new cross-attention fusion module (CAFM), which processes MRI and PET images by two independent branches of differing computational complexity. These features are fused merely by the cross-attention mechanism to enhance each other. After that, a concise graph pooling algorithm-based Reshape-Pooling-Reshape (RPR) framework is developed for token selection to reduce token redundancy in the proposed model. Extensive experiments on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database demonstrated that the suggested method obtains 99.04%, 97.43%, 98.57%, and 98.72% accuracy for the classification of AD vs. CN, AD vs. MCI, CN vs. MCI, and AD vs. CN vs. MCI, respectively.

Keywords: Alzheimer’s disease, Vision transformer, Multimodal image fusion, Deep learning

1. Introduction

Alzheimer’s disease (AD) and its prodromal stage, mild cognitive impairment (MCI), are the primary causes of dementia. The increasing impairment of memory and cognitive abilities differentiates AD and MCI. Between 2000 and 2019, the number of people who passed from AD increased by more than 145% in the United States in 2019 (Alzheimer’s disease facts and figures, 2022). More than 11 million Americans are offering unpaid caregiving of around 16 billion hours worth $271.6 billion to people with AD in 2021 (Alzheimer’s disease facts and figures, 2022). The report shows that the global burden of AD will reach $2 trillion, and 152 million people will suffer from AD by 2050 (Patterson, 2018). There is no effective drug or method of curing AD for this complicated pathogenesis (Liu, 2020). Consequently, precise early detection and treatment of AD are of utmost importance.

Generally, according to different pathological features, the disease has three stages: control normal (CN), MCI, and AD. Neuropsychological tests and neuroimaging diagnoses are the primary clinical examination methods for AD. The mini-mental state examination (MMSE) and the clinical dementia rating (CDR) are the most commonly utilized tools for clinical neuropsychological evaluation of AD and assist doctors in determining the stage of a patient. With medical technology’s rapid advancement, neuroimaging has become the mainstream method for diagnosing AD. Due to the great precision presentation of brain tissue and the capacity to differentiate between grey and white matter, magnetic resonance imaging (MRI) has turned into the common tool for neuroimaging diagnosis of AD. positron emission tomography (PET), another widely adopted neuFroimaging tool for diagnosing AD, may detect the spread of lesions and alterations in glucose metabolism using imaging agents. Moreover, the fusion of complementary information provided by different neuroimaging modalities further improves AD’s diagnostic performance.

In the past decades, inspired by deep learning in the field of computer vision, deep learning methods have been extensively employed in AD Computer-Aided Diagnosis (CAD) (Suk et al., 2014; Liu et al., 2023). However, most methods only utilized unimodal images as input, and information provided by unimodal images is one-sided, which may lead to suboptimal performance for AD diagnosis. Researchers have recently shown increasing interest in multimodal images for AD diagnosis, and more deep learning-based multimodal feature fusion algorithms have been created (Kong et al., 2022; Zhang et al., 2019). Specifically, according to the type of input modalities, these algorithms can be split into four classes: the raw image-based methods, the fused image-based methods, the generated image-based methods, the neuroimaging, and clinical data-based methods. The raw image-based methods feed the multi-input neural networks with the raw neuroimaging images or their preprocessed images, then fuse different modal features by latent representation learning (Zhang and Shi, 2020; Meng, 2022). Although these methods are simple to implement, they are prone to causing excessive model parameters and ignoring the interaction of information between modalities. The fused image-based methods merge important and discriminative information from several modalities to a sole fused image through image preprocessing steps to reduce model parameters, then take the sole fused image as model input (Song et al., 2021; Wu, 2018). However, these preprocessing steps are time-consuming and also increase computational costs. Due to factors such as cost or availability, multimodal images are not always fully realized in practice. To address this limitation and utilize incomplete data, the generated image-based methods directly generate missing data from an available modality through image generation algorithms such as generative adversarial networks (GANs) (Pan and Wang, 2206; Logan, 2021). Regrettably, it is difficult to analyze the generated images quantitatively due to the particularity of medical images.

On the other hand, neuroimaging and clinical data-based methods combine neuroimaging and clinical data to simulate the diagnostic process of clinicians (Zhao et al., 2019; Lin et al., 2021). Even though this method can increase the performance of AD diagnosis even further, it suffers from the same limitation of time-consuming preprocessing steps for clinical data. Furthermore, extracting effective features from high-dimensional gene sequences is challenging.

Although convolutional neural networks (CNNs) ’s convolutional operation improves their ability to capture local information, this generally results in CNNs learning features that are only relevant to nearby brain regions rather than more generalizable features that can be applied across multiple brain regions. It has been found that even distant brain regions can have significant interactions. Hence AD-related disorders can affect many different brain parts (Lyu et al., 2022). A new architecture based on the self-attention mechanism, vision transformer (ViT), was designed to effectively model global context without layering hierarchical convolution layers. ViT is powerful in classifying AD in several investigations (Zhu, 2022; Kushol et al., 2022). Notably, the problem of token redundancy (Rao et al., 2021) in ViT without taken into account in their models.

Additionally, from the point of view of multimodal feature fusion strategy, most existing multimodal data fusion diagnosis methods purely combine high-level selected features from the various modalities to merge their information, ignoring the fusion of low-level features. Compared to high-level features, low-level features have higher resolution and contain more location and detail information which is equally important for AD diagnosis. On the other hand, feature extraction and fusion stages are performed independently in these methods, ignoring the cross-modal interactions, which restricts the model from learning a shared representation (Khan et al., Jun. 2021). Cross-modal interaction has been shown to fully fuse features and further improve model performance (Tan and Bansal, 2019).

In this paper, we design a dual-transformer based on cross-attention and graph pooling algorithm (CsAGP) to solve the above issues, which enables multi-level feature interaction between the input modalities through the cross-attention mechanism. Specifically, we first construct a dual-branch framework for extracting multimodal features and disease classification. Then, to learn rich fused features, an innovative cross-attention fusion module (CAFM) is built to extract and fuse multimodal features based on the self-attention mechanism. To reduce token redundancy in the proposed model, a concise Reshape-Pooling-Reshape (RPR) frame-work was developed to select tokens of high significance via a graph pooling algorithm while avoiding high computation and memory costs. The proposed CsAGP has performed satisfactorily in the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database. Our major contributions are as follows:

  • (1)

    A dual-branch vision transformer with cross-attention and graph pooling algorithm, called CsAGP, is present to model the global information of images based on the pure self-attention mechanism to detect multimodal fused features for AD diagnosis.

  • (2)

    An innovative cross-attention mechanism-based multi-modal feature fusion method is suggested, which can efficiently learn a shared feature representation of MRI and PET images.

  • (3)

    A concise Reshape-Pooling-Reshape (RPR) framework is developed, which filters tokens based on a graph pooling algorithm to reduce computation costs and token redundancy in the proposed model.

2. Related work

This section first introduces the current deep learning-based multimodal AD diagnosis methods. Generally, based on the type of input modalities, these methods can be split into four classes: (i) the raw image-based methods, (iii) the fused image-based methods, (iii) the generated image-based methods, and (iv) the neuroimaging clinical data-based methods. Then, an introduction to vision transformers for AD diagnosis is described.

2.1. Deep learning-based multimodal AD diagnosis

The raw image-based methods input raw neuroimaging images of different modalities or their preprocessed images into multi-input neural networks to fuse features between modalities by latent representation learning. Fang et al. (Fang et al., 2020) employed three CNNs (GooleNet, ResNet, and DenseNet) with a dropout mechanism and the Adaboost ensemble algorithm to improve AD’s classification precision. They built a stack of CNNs to learn multimodal representations from MRI and PET images while utilizing the Adaboost ensemble algorithm to fuse their probabilistic scores. In their model, the dropout mechanism is utilized to exclude the slices with poor discrimination. However, the Adaboost ensemble algorithm prioritized misclassification data, which could lead to a bias due to noise data.

Adaptive-similarity-based multimodal feature selection (ASMFS) was developed by Shi et al. (Shi, 2022); which combines adaptive similarity learning with feature selection. Unfortunately, they only checked the efficacy of their model for binary classification problems and did not test it for multi-class situations. Jiao et al. (Jiao et al., 2022) devised a multimodal feature selection approach (FC2FS), which generates feature equivalence regularization and feature construction regularization through the similarity matrix calculated from the multimodal feature vertices. Finally, a support vector machine (SVM) is employed to finish the process of AD diagnosis. It is possible that the model’s generalization ability was not maximized because only standard techniques of generating correlation coefficients were used throughout the construction of the similarity matrix. Zhang et al. (Zhang et al., 2021) developed a 2.5D CNN-based framework that extracts 2.5D patches from the hippocampal areas of MRI and PET images. Then, these 2.5D patches are integrated by a training approach termed branching pre-training to provide a full AD diagnosis.

Although the above methods can further raise the accuracy of AD diagnosis compared with traditional machine learning methods (Shi et al., 2019; Richhariya et al., May 2020), multi-input neural networks demand a lot of model parameters and computational costs. In addition, since only the high-level features of different modalities are concerned, the latent representation learning over-looks feature interactions between modalities. The fused image-based methods integrate important and discriminative information from several modalities into a sole fused image based on image fusion algorithms and then take the fused image as the model’s input to address these limitations. Song et al. (Song et al., 2021) acquired a new neuroimaging modality famous as “GM-PET” by fusing gray matter (GM) of 3D structural MRI and PET images. Experimentally, their method can improve accuracy by up to 16.48% compared to the unimodal. Although their method significantly reduces the model’s parameters compared to other multi-modal fusion methods, the preprocessing steps are time-consuming.

On the other hand, Kang et al. (Kang et al., 2020) obtained fractional anisotropy (FA) and mean diffusivity (MD) 2D image slices from diffusion tensor imaging by FMRIB Software Library (FSL), then merged them with the corresponding index MRI image slices into an RGB image, finally fed the RGB image into the VGG network to complete the classification of MCI and CN. However, they only tested their method on the CN vs. MCI task and did not consider diagnostic tasks involving other stages, such as AD. To avoid the problem that 2D slices will lose image-spatial information of raw 3D images, similar to Ref. (Song et al., 2021). Kong et al. (Kong et al., 2022) fused the GM into a 3D GM image and then fed the 3D GM image into a 3D CNN. Finally, they got 93.21% accuracy on AD vs. CN. Although the above methods can reduce the amount of computation compared to multi-input neural networks, the pre-processing steps of image fusion are demanding.

In practice, multimodal images may be incomplete for high financial costs or availability. To address this limitation and utilize incomplete data, with generative adversarial networks (GANs), the generated image-based methods directly produce missing data from a present modality. By combining a GAN and a dense CNN, Gao et al. (Gao et al., 2022) constructed a hybrid framework (PT-DCN) to diagnose AD. To make use of multimodal data, they generate PET images by the task-induce pyramid GAN. The PT-DCN can learn and merge multimodal features gradually. However, their experiment data was derived from ADNI-1 and ADNI-2, which may affect the experimental accuracy by varying MR scanner parameters. Zhang et al. (Zhang et al., 2022) developed a 3D GAN (BPGAN) to generate 3D PET images from MRI images. They devised a cutting-edge hybrid loss function to keep tabs on the brain data training process. In the end, they obtained an accuracy of 98.11% for AD vs. CN. Ye et al. (Ye et al., 2022) developed a paired GAN, which uses deep MRI features extracted by a feature extractor. The network can produce equivalent PET features in place of raw MRI images to reduce the model’s size.

While the previous work has proven that generating missing data for AD diagnosis is possible, it has certain drawbacks when synthesizing multimodal medical images. First, the trustworthiness of the generated data is a serious issue. There are obvious differences between synthetic and real images regarding semantics and resolution because of the complicated spatial structure of medical images. Second, erratic training methods. The visual pattern in medical images is often unclear. Since GAN’s training processes are prone to instability (Creswell et al., 2018), it is difficult to spot erratic behavior and implausible outcomes. At last, the evaluation is not always convincing. Because of the disclosure of ground-truth images, typical pixel-wise metrics have trouble quantitatively evaluating generated images.

The clinical diagnosis of AD relies on neuroimaging data but also the subject’s clinical and biochemical information. It can significantly increase the accuracy of AD diagnosis by fusing with clinical and neuroimaging data. Zhang et al. (Zhang et al., 2019) employed two separate CNNs to analyze MRI and PET images for diagnosing AD. They suggested a method based on the Pearson coefficient that combines the neuroimaging diagnostic with neuropsychological evaluations (MMSE and CDR) to steer the output of their model. However, they focused solely on the high-level features of various modal images and paid little attention to the interactions of the low-level features.

Tu et al. (Tu et al., 2022) created an innovative multimodal AD diagnostic model. They first suggested a geometric; algebraic approach that extended low-dimensional clinical data of subjects, such as profiles, gene sequences, and MMSE scores, to high-dimensional features at various levels. Second, according to the degree of influence, the feature filtration algorithm eliminates irrelevant features from high-dimensional features and yields transformed ones. Finally, the transformed features are combined with those extracted by CNNs from MRI images. Nan et al. (Nan, 2022) suggested a framework to investigate the impact of different modalities and their combinations on AD diagnosis. Ultimately, they found that with the addition of different modal data, the diagnostic performance of AD increased gradually. Furthermore, they discovered that adding single nucleotide polymorphism (SNP) data could bring a 3% to 7% performance boost to the AD diagnostic.

2.2. Vision Transformer-Based AD diagnosis

Rather than stacking hierarchical convolution layers, the vision transformer successfully models the image’s global context based on the self-attention mechanism. Several works have shown the potential of vision transformers in AD diagnosis. Lyu et al. (Lyu et al., 2022) transferred a pre-trained ViT to the brain imaging dataset. They employed ViT as the backbone network and 2D MRI images as input and finally got 95.3% accuracy in AD diagnosis. Zhu et al. (Zhu, 2022) merged representation learning, feature distillation, and classification into a coherent model termed Brain Informer (BraInf). They initially deployed a multi-head ProbSparse self-attention block to minimize computational costs for representation learning. Later, a structural distillation block was utilized to underrate the dimension of the three-spatial tensor, which further reduces computational costs. However, the patch size of MRI images was predetermined in their experiments, which is ill-considered as the structural changes within every region produced by AD are not fixed.

On the other hand, Jang et al. (Jang and Hwang, 2022) developed a medical classifier for diagnosing AD. They trained a 3D CNN to recover local features linked to anomalies of AD from 3D MRI images and then fed the obtained local features into a transformer block to combine multi-plane and multi-slice features. This procedure can mark a general representation in 3D MRI images. They achieved 93.21%, 93.27%, and 85.26% accuracies on the ADNI, AIBL, and OASIS datasets. Xing et al. (Xing et al., 2022) assembled a block to transpose the 3D PET images into 2D images and fed the transposed image into a paralleled vision transformer model for AD diagnosis.

In general, deep learning-based multimodal AD diagnosis methods can automatically extract the AD-related features from complex neuroimaging images via CNNs without domain-specific knowledge, which can avoid errors caused by artificial. However, it is difficult to capture global features that across brain regions for CNNs. Meanwhile, although the vision transformer-based methods can model image-global information by the self-attention mechanism, most works do not consider the problem of token redundancy in their models. In this paper, we proposed a dual-transformer that fuses MRI and PET image features based on the cross-attention mechanism and selects discriminative tokens using a graph pooling algorithm to reduce redundancy.

3. Materials

Both the database ADNI and the image preprocessing pipelines are detailed in this section.

3.1. Datasets

Data used in this article were obtained from ADNI, which was settled in 2003 as a public–private alliance. The ADNI aims to develop clinical, imaging, and genetic to diagnose AD. Following the methodology described in Ref. (Golovanevsky et al., 2206), 766 subjects from the ADNI1/GO and ADNI2 phases were selected, including MRI and PET images. The numbers of AD, MCI, and CN subjects were 214, 226, and 326, respectively. There includes a T1-weighted MRI and a PET (FDG-PET) image in a NIfTI file format for every subject. Table 1 shows the clinical information (e.g., sex, age, MMSE scores, and CDR scores) of selected subjects. MRI images of subjects in this paper were acquired by three MR scanners, SIEMENS, Philips Medical Systems, and GE Medical Systems.

Table 1. The clinical information of the subjects.

Diagnosis Number Age Gender(F/M) MMSE CDR
AD 214 75.1 ± 7.8 95/119 21.2 ± 4.1 0.9 ± 0.4
MCI 226 76.0 ± 7.4 82/144 25.6 ± 4.3 0.5 ± 0.3
CN 326 76.1 ± 6.4 165/161 28.7 ± 1.4 0 ± 0

The imaging parameters are, respectively, a) repetition time [TR]= 3000ms, echo time [TE]= 3.5ms, inversion time [TI]= 1000ms, flip angle = 8°, thickness = 1.2mm, matrix size = 192 × 192 × 160, field strength = 3.0T. b) [TR] = 6.8005ms, [TE]= 3.116ms, [TI] = 0ms, flip angle = 9°, thickness = 1.2mm, matrix size = 256 × 256 × 170, field strength = 3.0T. c) [TR]= 7.332ms, [TE]= 3.036ms, [TI]= 400ms, flip angle = 11°, thickness = 1.2mm, matrix size = 256 × 256 × 196, field strength = 3.0T. The ADNI data acquisition details can be seen on the official webpage of ADNI.2

3.2. Data preprocessing

To remove the impact of various imaging parameters, the raw images were preprocessed using a normal preprocessing method described in Ref. (Suk et al., 2014) by the FMRIB Software Library (FSL)3 and Advanced Normalization Tools (ANTs).4

First, the acpcdetect software5 shifted all of the raw MRI images to the exact center of the anterior commissure (AC) to the posterior commissure (PC) dividing line. After adjustment of force inhomogeneity by the nonparametric non-uniform force normalization (N4) algorithm, these MRI images were processed through the Brain Extraction Tool (BET) in the FSL to delete the cerebellum and skull. Second, we ensured that the skulls were clean and the dura was gone by hand-checking the images. Finally, all the preprocessed MRI images were spatially normalized onto a standard space.

PET images were precisely aligned with their corresponding MRI images. The Gaussian kernel was used to further smooth the preprocessed images. Utilizing the med2image tool,6 181 MRI and PET axial view slice images were acquired, respectively. Only slices with indices 80–100 have been used in this paper, as these images contained the most relevant information for the whole brain. To meet the input specifications, these slice images were scaled to 224×224. The images before and after preprocessed are shown in Fig. 1.

Fig. 1. Contrasting of the raw and preprocessed images.

Fig. 1

4. Methods

Considering the difference in resolution and information in MRI and PET images, we designed two branches of different computational complexity by the encoder block proposed in Ref. (Dosovitskiy, et al., 2010) to process MRI and PET images individually. The proposed CsAGP, shown in Fig. 2, composes of three components: (i) two identical Patch Embed modules are implemented to convert MRI and PET images into non-overlapping patch tokens, respectively, (ii) A stack of K CsAGP Blocks that output the final feature representation for each modality, (iii) a classifier that predicts AD stage based on the shared feature representation.

Fig. 2. An illustration of the proposed CsAGP.

Fig. 2

The main implementation steps of our model can be described as follows. Firstly, the Patch Embed module is carried out on 2D MRI and PET images, which splits and transposes the input image into a series of patch tokens with a fixed size. Then the positional encoding and the class token are added to each token sequence. Then, these token sequences with positional encoding are passed into the CsAGP Block as image feature sequences. The feature sequences first pass through the Encoder module, which primarily consists of the self-attention mechanism and a feed-forward network (FFN). Compared to CNNs, the self-attention mechanism can efficiently model long-range relationships (Dosovitskiy, et al., 2010). Secondly, the outputs of the Encoder module are fed into the CAFM for multimodal feature fusion. The CAFM realizes the interactions of multi-level features through a pure self-attention mechanism which is different from the previous methods (Zhang et al., 2019) that concatenates the high-level features into a long vector. After that, the fused token sequences are passed through the RPR framework, which selects the discriminative tokens through a graph pooling algorithm to reduce token redundancy and memory costs. Finally, the class tokens of each modality sequence as an agent are combined to get the shared feature representation as the output of CsAGP, as detailed in the following subsections.

4.1. Patch Embed

In ViT, the original image is directly converted into fixed-size patches by linear projections alone, which is a poor way to capture low-level information in images. To overcome this limitation, as shown in Fig. 2. A novel tokenization approach was employed to make optimal use of CNN’s strength in retrieving low-level features and minimizes the training difficulty of embedding by decreasing the patch size. Specifically, for Mmri branch, given an input image xmri ∈ ℝH×W, to minimize the size of input images, we first utilize a 7 × 7 convolution with a stride of 4 and a padding of 3, then two additional 3 × 3 convolutions with a stride of 2 and padding of 1, for improved low-level information extraction.

After that, the output xmriD×HP×WP of the Patch Embed module is flattened and transposed to get the patch tokens matrix xpatchmriN×D, where N = HW/P2 is the number of patches, D is the number of enriched channels, (H, W) and (P, P) represent the resolution of the input images and image patches, respectively. Finally, the positional encoding and an extra class token xclsmri1×D are added as image representations to the patch tokens matrix xpatchmri, resulting in the final patch tokens matrix xfmri(N+1)×D for further steps. These procedures can be noted as follows:

xmri=ReLU(Conv3(ReLU(Conv2(ReLU(Conv1(xmri)))))) (1)
xpatchmri=Transpose(Flatten(xmri)) (2)
xfmri=[xclsmrixpatchmri]+PE,PE(N+1)×D (3)

where ∥ is the concatenate operation and PE ∈ ℝ(N+1)×D represents the positional encoding following Ref. (Dosovitskiy, et al., 2010). The Mpet branch follows the same procedures but takes a 2D PET image as input and adds another class token xclspet1×D.

4.2. Cross-Attention fusion module (CAFM)

The cross-attention fusion module (CAFM) was designed to fuse multimodal features efficiently. Specifically, let xfi(N+1)×D be the final patch tokens matrix output from the previous step at branch i, where i represents the i-th branch (Mmri or Mpet).

Fusion in the CAFM involves the class token xclsi from one branch and the patch tokens xpatchi from another branch. Specifically, the class token xclsi is utilized as an agent to share information between the patch tokens xpatchi from another branch, and then the class token xclsi returns to the i-th branch so that it combines the multimodal features efficiently and favorably. Following the fusing of patch tokens from another branch, the class token exchange information with its own patch tokens once more in the subsequent blocks to impart the information obtained from another branch into its own patch token representations.

As shown in Fig. 2. The final matrix xfi is entered into the CAFM, which includes two sub-blocks. Each sub-block has two parts. The first part main contains a multi-heads cross-attention (MCA) mechanism to swap information between the patch tokens xpatchi from another branch. An exemplification of the MCA on the Mmri branch is proved in Fig. 3. For Mmri branch, it first collects the patch tokens xpatchpetN×D from the Mpet branch, and then concatenates them with own class token xclsmri, as expressed in Eq. (4):

xmri=[xclsmrixpatchpet] (4)

Fig. 3. Multi-heads cross-attention feature fusion for Mmri branch.

Fig. 3

Then, the module performs the MCA between xclsmri and xmri, where class token xclsmri of Mmri branch is the query as patch-token information has already been integrated into the class token. The MCA could be written mathematically as:

q=xclsmriWq,k=xmriWk,v=xmriWv (5)
A=softmax(qkT/D/h) (6)
MCA(xmri)=Av (7)

where Wq, Wk, Wv ∈ ℝD×(D/h) are learnable parameters, D is the embedding dimension of tokens, h represents the number of heads. Because only the class token is utilized in the queries, the computational and memory costs of MCA are linear instead of quadratic in constructing A. Finally, the output zmri of the first part with a residual shortcut is defined as follows:

yclsmri=xclsmri+MCA([(xclsmrixpatchpet]) (8)
zmri=[yclsmrixpatchmri] (9)

The second part primarily consists of a feed-forward network with non-linear activation, which performs a spatial transformation of zmri by two linear projecting layers to enhance the representation ability of tokens. It can be described as follows:

Zmri=LN(FFN(LN(zmri))+zmri) (10)
FFN(x)=σ(xW1+b1)W2+b2 (11)

where W1 ∈ ℝD×K is the weight of the first layer, projecting each token in a higher dimension K. And W2 ∈ ℝK×D is the weight of the second layer. b1 ∈ ℝ1×K and b2 ∈ ℝ1×D are the biases. LN represents the layer normalization, σ(·) is a non-linear activation function.

4.3. RPR framework

To reduce token redundancy in the proposed CsAGP, we developed the Reshape-Pooling-Reshape (RPR) framework, which consists of three stages: (i) tokens to graph (T2G), (ii) graph pooling, (iii) graph to tokens (G2T), as illustrated in Fig. 4. The token sequences were converted into graph-structured data in the T2G stage. A graph pooling algorithm is utilized to filter the tokens, and only the discriminative tokens are retained. Finally, the pooled subgraph vertices are reconverted to a token sequence in the G2T stage for the next step.

Fig. 4. The illustration of the RPR framework of the Mmri branch.

Fig. 4

4.3.1. Tokens to graph (T2G)

For the Mmri branch, given tokens Zmri ∈ ℝ(N+1)×D generated from the CAFM, we first split them into patch tokens matrix zpatchmriN×D and a class token zclsmri1×D accordingly. Then, a graph Gmri=(V,A) is constructed, where V represents the vertex set consisting of vertices {v1, …, vN}, and A ∈ {0, 1}N×N is the adjacency matrix describing the edge connection information of Gmri.

In other words, a graph Gmri with N vertices and each vertex vi in the graph has a corresponding D-dimensional feature vector zimriR1×D was constructed. The feature matrix zpatchmriN×D stacks N feature vectors. Then, the adjacency matrix A was established by the Euclidean distance between each vertex feature vector. Specifically, if the distance value distij between vertices vi and vj is smaller than average distance μ, then Aij= 1, which means there is an edge between vertices vi and vj, otherwise Aij= 0. The process of establishing the adjacency matrix A can be formulated as follows:

dist=[z1mriz1mri2z1mriz2mri2z1mrizNmri2z2mriz1mri2z2mriz2mri2z2mrizNmri2zNmriz1mri2zNmriz2mri2zNmrizNmri2] (12)
μ=1N2i=1N(j=1Ndistij) (13)
Aij={1ifdistij<μ,1i,jN0otherwise, (14)

where ∥ · ∥2 represents the 2 norm and dist indicates the distance matrix between vertices. μ is the average distances, distij and Aij are the values of distance matrix dist and A in i-th row and j-th column, respectively. Finally, the patch tokens graph Gmri is created, where A and zpatchmri are the adjacency matrix and the feature matrix, respectively. The Mpet branch generates graph Gpet through the same way.

4.3.2. Graph pooling

we developed a novel graph pooling algorithm to reduce token redundancy by selecting the discriminative vertices of Gmri and Gpet generated in the previous stages. As shown in Fig. 4. The algorithm evaluates the importance of vertices in multiple ways. The structure-based learning module (SBLM) and the feature-based learning module (FBLM) are utilized to score vertices according to their local structure and feature information to receive scores s1 and s2, respectively. Then, the structure-feature learning module (SFLM) obtains the final score s for each vertex by combining s1 and s2. To make the final graph embedding more feature information, the vertex feature fusion module is employed to aggregate the features of the vertices to be pooled before discarding them. Finally, only the top-k vertices will be retained according to the final score s. The details of these procedures in the Mmri branch can be described as follows, which is the same as the Mpet branch.

As shown in Fig. 4, the graph Gmri output by the T2G is fed into three branches to evaluate the importance of vertices in multiple ways. Since GCNs considers structural information of graphs, it is utilized to evaluate each vertex based on the structural information in SBLM. The mathematical representation is as follows:

s1=σ(W12A˜W12XW) (15)

where A˜ and X ∈ ℝN×D are the adjacency matrix and the vertex features of the graph Gmri, respectively. W denotes the diagonal vertex degree matrix. W ∈ ℝD×1 represents the learnable parameters and σ(·) is a non-linear activation function.

In FBLM, each vertex is scored by CNNs based on their feature information. It mainly consists of a 1D CNN and a Batch Normalization layer, mathematically represented as:

s2=σ(BN(Conv(X))) (16)

where X ∈ ℝN×D represents the feature matrix of the graph Gmri.

Then, the SFLM combines s1 and s2 to calculate the final scores of the vertices. Given the scores s1 ∈ ℝN×1 and s2 ∈ ℝN×1 obtained from SBLM and FBLM, respectively. First, add s1 and s2 to get a coarse score s′ ∈ ℝN×1, then the coarse score s′ is fed into a 1D CNN to output the final scores s ∈ ℝN×1. It can be denoted as:

s=BN(Conv(s)),ands=s1+s2 (17)

After that, the vertices are sorted by the final score s, and only the top-k vertices V={v1,,vk} will be retained as pooling results.

Finally, To make the final graph embedding vectors more representational, we aggregate information from neighborhood vertices in the feature fusion module with graph attention network (GAT) before discarding the vertex set V, where V=VV={vk+1,,vN} represents the set of vertices that will be discarded. It can be denoted as:

zi=σ(1Kk=1KjViαijkWkhj) (18)

where zi and hj represent the feature vector and the neighbor vertices of the vertex vi, respectively. Vi is the number of vertex vi’s adjacent vertices. K is the number of attention heads. αijk is the k-th attention value between zi and hj. W is the weight matrix.

4.3.3. Graph to tokens (G2T)

Given a subgraph Gmri=(V,A) of Gmri obtained from the graph pooling stage, where V={v1,,vk} and A′ ∈ ℝk×k represent the vertex set and the adjacency matrix of Gmri, respectively. Let X′ ∈ ℝk×D denotes the feature matrix of Gmri. After the graph pooling stage, the feature matrix X′ is reassembled into token sequence zp ∈ ℝk×D in G2T, then the class token zclsmri1×D and a new positional encoding are added to zp ∈ ℝk×D for providing spatial information, that can be expressed as follows:

zp=reshape(X) (19)
zout=[zclsmrizp]+PE,PE(k+1)×D (20)

As shown in Fig. 2, the Mpet branch follows the identical operation as Mmri branch.

5. Experiment and results

In this section, the experimental setup and the results of performance evaluation measures are provided. Meanwhile, the activated area of CsAGP is visualized.

5.1. Experimental setup

All experiments are implemented on a workstation with two Intel Xeon Gold 6330 CPUs and four Nvidia A100 GPUs with a total of 160 GB of video memory. This workstation is equipped with Ubuntu 20.04.1 LTS. We built our model on Pytorch 1.12.0 framework and trained for 300 epochs. Adam is applied as the optimizer, and more details of experiment settings are as follows: (i) batch size is set to 128; (ii) loss function adopts the CrossEntropy; (iii); the initial learning rate is set to 1 × 10-5 and weight decay is set to 5 × 10-4. In the experimental data, 60% of the data were randomly selected for training, 20% were chosen randomly for validation, and the rest 20% of subjects were used as test data.

For CsAGP, considering the difference in resolution and information contained in MRI and PET images, Following Ref. (Chen et al., 2021); we set K= 3, M= 1, N= 3. K signifies the number of CsAGP Block, M and N indicate the number of Encoder of the PET and MRI branches, respectively. Taking into account the computation costs and benefits together as a whole, the pooling rate r is set to 0.5.

5.2. Performance evaluation

To provide a quantitative assessment of the effectiveness of the suggested method for diagnosing AD, several evaluation metrics, including accuracy, specificity, and sensitivity, were computed as follows:

accuracy=TP+TNTP+TN+FP+FN (21)
sensitivity=TPTP+FN (22)
specificity=TNFP+TN (23)

The terms “true positive,” “true negative,” “false positive,” and “false negative” are represented as “TP,” “TN,” “FP,” and “FN,” respectively. In addition to the three criteria discussed above, the area under the curve (AUC) is another factor considered when assessing performance. The area under the receiver operating characteristic curve (ROC), sometimes known as the area under the receiver operating characteristic curve (AUC), is a performance matrix employed to measure the quality of a classifier, and a large value of AUC indicates better classification performance.

5.3. Experiment results

In our experiments, the whole data was divided into AD vs. CN, AD vs. MCI, CN vs. MCI, and AD vs. CN vs. MCI groups to evaluate CsAGP. Each group of experiments was conducted unimodal (MRI or PET) and multimodal (MRI and PET). To make the results more convincing, we took two identical images as the model’s input when conducting unimodal experiments. Table 2 demonstrates the comparison of the classification performances of each group.

Table 2. Classification results of the unimodal and multimodal method.

Auxiliary diagnosis Modality SEN (%) SPE (%) ACC (%) AUC (%)
AD vs CN MRIPETMRI + PET 96.7391.7297.96 98.3997.8799.54 97.8795.9299.04 99.6298.9799.80
AD vs MCI MRIPETMRI + PET 92.2589.0394.25 96.7296.3398.81 95.3794.1297.43 98.7998.2799.23
CN vs MCI MRIPETMRI + PET 92.6494.6198.52 97.1094.7798.61 94.9494.6998.57 98.9298.8399.76
AD vs CN vs MCI MRIPETMRI + PET 92.9692.2898.65 96.8896.4999.34 94.2193.3798.72 98.8298.2499.86

SEN: sensitivity; SPE: specificity; ACC: accuracy.

As can be seen from Table 2, the performance of our CsAGP is outperforming the unimodal method. Specifically, the developed multimodal method obtains the classification accuracies of 99.04%, 97.43%, 98.57%, and 98.72% on AD vs. CN, AD vs. MCI, CN vs. MCI, and AD vs. CN vs. MCI, and the accuracies of MRI modality are 97.87%, 95.37%, 94.94%, and 94.21%, respectively.

Compared to MRI modality, the proposed multimodal method improves the classification performance by 1.17%, 2.06%, 3.63%, and 4.51% on AD vs. CN, AD vs. MCI, CN vs. MCI, and AD vs. CN vs. MCI, respectively. For PET modality, the accuracies on AD vs. CN, AD vs. MCI, CN vs. MCI, and AD vs. CN vs. MCI are 95.92%, 94.12%, 94.69%, and 93.37%, respectively. Compared to PET modality, the proposed multimodal method improves performance rises of 3.12%, 3.31%, 3.88%, and 5.35%, respectively. The proposed multimodal method can improve classification accuracy by combining MRI and PET significantly compared with the unimodal method.

On the other hand, it can also be found that the classification accuracy of MRI modality is surpasses PET modality in each group of classification experiments. Compared with PET modality, the accuracy of MRI increases by 1.95%, 1.25%, 0.25%, and 0.84% on AD vs. CN, AD vs. MCI, CN vs. MCI, and AD vs. CN vs. MCI, respectively. It is evident that the CsAGP can capture more discriminative features on MRI images when extracting unimodal features. We consider this is due to the high resolution of MRI images compared to PET images, which allows for better differentiation between soft tissue and anatomical structures.

Compared to the results of the other group tasks on the ADNI database, the diagnostic accuracy of the AD vs. CN task is, on the whole, higher than that of the other tasks. The same results are also in Ref. (Gao et al., 2022). This can be interpreted as AD’s primary neuroimaging features can be distinguished more easily from those of CN and MCI. Since the subtle AD-related changes that occur in MCI are not noticeable, distinguishing MCI from AD and CN only by neuroimaging data is difficult. We further present the performance of each group to demonstrate the differences between the groups intuitively. As seen in Fig. 5, the multimodal performance acts better than the unimodal, which displays that the classification performance can boost the classification efficiencies further by joining the MRI and PET modalities.

Fig. 5. Classification performance of various groups.

Fig. 5

5.4. Comparison with other methods

In this section, we compared our CsAGP to several other multi-modal methods that are based on the ADNI database. As shown in Table 3, methods of comparison include the raw images-based methods (Zhang et al., 2019; Fang et al., 2020; Liu et al., 2022; Kun et al., 2020), the traditional machine learning method (Shi, 2022); the fused image-based method (Song et al., 2021); the generated image-based method (Zhang et al., 2022), the neuroimaging and clinical data-based method (Zhang et al., 2019).

Table 3. Performance comparison of the different existing methods.

Tasks Methods SEN (%) SPE (%) ACC (%) AUC (%)
AD vs CN Fang et al (2020) (2020) 95.89 98.72 99.27 n/a
AD vs CN Zhang et al (2019) (2019) 96.58 95.36 98.47 98.61
AD vs CN Shi et al (2022) (2022) 96.10 97.47 96.76 97.03
AD vs CN Song et al (2021) (2021) 93.33 94.27 94.11 n/a
AD vs CN CsAGP (ours) 97.96 99.54 99.04 99.80
AD vs MCI Fang et al (2020) (2020) 89.71 93.59 92.57 n/a
AD vs MCI Zhang et al (2019) (2019) 90.11 91.82 85.74 88.15
AD vs MCI Song et al (2022) (Song et al., 2021) 71.19 85.94 80.80 n/a
AD vs MCI Liu et al (2022) (2022) 94.91 98.52 94.44 97.00
AD vs MCI CsAGP (ours) 94.25 98.81 97.43 99.23
CN vs MCI Fang et al (2020) (2020) 88.36 92.56 90.35 n/a
CN vs MCI Zhang et al (2019) (2019) 97.43 84.31 88.20 88.01
CN vs MCI Shi et al (2022) (2022) 85.98 70.90 80.73 78.75
CN vs MCI Song et al (2022) (Song et al., 2021) 84.69 85.60 85.00 n/a
CN vs MCI CsAGP (ours) 98.52 98.61 98.57 99.76
AD vs CN vs MCI Song et al (2021) (2021) 55.67 83.40 71.52 n/a
AD vs CN vs MCI Han et al (2020) (Kun et al., 2020) n/a n/a 67.74 n/a
AD vs CN vs MCI Zhang et al (2022) (2022) n/a n/a 80.00 95.00
AD vs CN vs MCI CsAGP (ours) 98.65 99.34 98.72 99.86

Bold value means the best indicator value under the same conditions and ‘n/a’ means no data.

In the AD vs. CN task, the accuracy of Fang et al. (Fang et al., 2020) was 99.27%, which is slightly larger than our suggested method. The reason is due to their utilization of ensemble learning, where the output of their model is based on three CNNs (GooleNet, ResNet, and DenseNe). By combining multiple different CNNs, they could leverage their diversity and differences. Each CNN may perform better on different subsets of data or feature subspaces. By aggregating their predictions through ensemble learning, they were able to reduce bias and variance, improving the overall accuracy of the model.

Additionally, Ref. (Fang et al., 2020) also introduced a “dropout” mechanism to discard low discrimination images, further reducing noise in their model’s input data. Although ensemble learning can enable them to achieve higher classification accuracy, training three CNNs requires many parameters and computation. In addition, compared with Fang et al. (Fang et al., 2020), CsAGP gets the best results except for accuracy.

In the AD vs. MCI task, the sensitivity metric reported by Liu et al. (Liu et al., 2022) was 94.91%, only 0.66% higher than ours, which means that the ability of their model to identify positive examples is slightly more than ours. They diagnosed AD by fusing multi-scale gray and white matter features from MRI images, while we only considered 2D slice images and single-scale feature information. By extracting features at different scales and fusing them together, the model can comprehensively utilize both local details and global contextual information, enhancing its understanding and expression capability of the images. Additionally, Ref. (Liu et al., 2022) employs the channel attention mechanism to automatically learn the importance weights of each channel, enabling the model to focus on relevant features for the task. By enhancing important channels, the model can improve its perception of crucial information, enhancing its performance.

Our CsAGP gets the best diagnostic performance in CN vs. MCI and AD vs. CN vs. MCI tasks. This can be attributed to several factors. Firstly, in addition to leveraging high-level features from different modalities, we also pay attention to the fusion of low-level features across modalities. This comprehensive integration of both high-level and low-level features enables the CsAGP to capture a more comprehensive representation of multimodal data. Secondly, by simultaneously conducting feature extraction and fusion stages for different modalities, we facilitate the effective integration of multimodal features. This simultaneous processing allows the CsAGP to learn shared representations and exploit complementary information from different modalities, further enhancing its performance. Furthermore, for reasons that the network parameters can be drastically decreased thanks to the CAFM and the RPR framework, the computational complexity and memory cost of our CsAGP does not rise.

5.5. Ablation experiments

Ablation experiments were carried out in this section of our CsAGP in order to demonstrate the efficacy of the CAFM and the RPR framework. To provide an accurate comparison, all experiments utilized the same settings for a fair comparison.

To reduce token redundancy and computation costs, we proposed a graph pooling algorithm to select discriminative tokens, which evaluates tokens in both feature and structural ways. Experiments were conducted to investigate the influence of the graph pooling algorithm on the prediction performance.. Since multi-classification tasks are more challenging than binary classification, the CsAGP was evaluated with different pooling rate r values. The results of the AD vs. CN vs. MCI task are reported in Table 4.

Table 4. The classification results for different..r.

r SEN (%) SPE (%) ACC (%) AUC (%)
0.1 95.61 98.04 96.30 99.32
0.3 95.56 98.42 97.00 99.49
0.5 98.65 99.34 98.72 99.83
0.7 98.69 99.36 98.79 99.86
0.9 99.00 99.57 99.21 99.90

It can be seen that the classification accuracy is generally increasing with the increase of r. Specifically, when the pooling rate r increases from 0.1 to 0.5, the classification accuracy of CsAGP increases from 96.30% to 98.72%, a rise of 2.42%. However, the trend of increasing classification performance gradually flattens out when the pooling rate r is greater than 0.5. For example, when r= 0.9, the accuracy is 99.21%, only up 0.49% compared to r= 0.5. Therefore, considering computation costs and benefits together as a whole, r is set to 0.5 in our experiments.

As the pooling rate r increases, more tokens are preserved, allowing the model to capture more information and consequently leading to a rapid improvement in model performance. However, as r continues to increase, the noise and the computational cost of the model also increase. As a result, the trend of performance improvement of the model gradually flattens out.

To investigate the effectiveness of the FBLM and the SFLM, we conducted a series of experiments with different strategies. The results are listed in Table 5. Method A means using MLP to evaluate the vertex feature information (FBLM*) and linearly weighting sum vertex scores s1 and s2 (SFLM*).

Table 5. Ablations on FBLM and SFLM.

Method FBLM* SFLM* FBLM SFLM SEN (%) SPE (%) ACC (%) AUC (%)
A 98.49 99.25 98.23 99.86
B 98.46 99.20 98.47 99.79
C 98.51 99.27 98.63 99.81
D 98.65 99.34 98.72 99.86

By changing SFLM* to SFLM, the accuracy improves by 0.24% (Method A vs. Method B). When we change FBLM* to FBLM, the accuracy increases by 0.4% (Method A vs. Method C).Further,when using both FBLM and SFLM, as Method D, the accuracy rises by 0.49% (Method A vs. Method D). These results validate that the comprehensive consideration of both vertex position and feature information plays a crucial role in the graph pooling process. Vertex position information aids in understanding the contextual and topological relationships within the graph structure, while vertex feature information provides descriptions of vertex attributes and features, offering crucial information for vertex representation and learning. Combining these two aspects of information can assist the model in better understanding and processing graph data, enhancing the model’s performance and expressive capabilities.

To evaluate the effectiveness of CAFM in CsAGP, we removed the CAFM in the CsAGP, while other configurations remained the same. It can help us to focus on the high-level features fusion of two modalities. Comparative experiments were performed in all diagnosis tasks.

As seen from Table 6, under the influence of the CAFM, the accuracy increases by 1.33%, 1.38%, 2.9%, and 3.02% on AD vs. CN, AD vs. MCI, CN vs. MCI, and AD vs. CN vs. MCI, respectively. These results indicate that fusing multi-level features from different modalities can further improve model performance. High-level features often contain more abstract and semantically rich information, capturing the high semantics and contextual information of images.

Table 6. Classification results of removing CAFM.

Auxiliary diagnosis SEN (%) SPE (%) ACC (%) AUC (%)
AD vs CNw/o CAFM 97.9696.13 99.5498.44 99.0497.71 99.8098.71
AD vs MCIw/o CAFM 94.2593.00 98.8196.93 97.4396.05 99.2398.05
CN vs MCIw/o CAFM 98.5294.70 98.6196.60 98.5795.67 99.7699.12
AD vs CN vs MCIw/o CAFM 98.6595.30 99.3497.73 98.7295.70 99.8699.26

On the other hand, low-level features focus more on low-level details and local features. By fusing multi-level features, it is possible to fully utilize the complementarity of high-level and low-level features, providing a more comprehensive and rich feature representation, and enhancing the model’s understanding and expressive capability. Furthermore, high-level features are usually less sensitive to modality differences, while low-level features are more sensitive to such differences. By integrating multi-level features, the impact of modality differences can be reduced, enhancing the model’s robustness and generalization ability towards multimodal images.

In addition, every branch of the transformer in our model develops the class token as an agent, which can exchange information between branches by the cross-attention mechanism. This makes it possible to generate attention maps in linear time rather than quadratic time.

5.6. Visualization

Fig. 6 shows the activated areas of our CsAGP by the Grad-CAM technology (Selvaraju et al., 2017). The images on each cell’s left and right sides represent a slice image of the subject in various modalities, and the AD-related activation maps corresponded with the relevant slice image. From Fig. 6(a), it is seen from the heatmap that the areas of interest are dispersed throughout the brain. It means that our model can analyze abnormalities throughout the brain that are related to AD.

Fig. 6. AD-related visualization map results using Grad-CAM.

Fig. 6

Compared with CNNs, transformer-based networks with a high receptive field have various advantages, one of which is the presence of wide activated areas. In addition, compared with AD, the heatmap areas of MCI (Fig. 6(c)) are relatively concentrated, which may be because MCI is the prodromal stage of AD with few lesion areas. The heatmap areas of CN (Fig. 6(e)) are mainly focused on the center of the brain.

Furthermore, due to different imaging protocols and information emphases, the heatmap areas of the three stages of PET images (Fig. 6(b), Fig. 6(d), and Fig. 6(f)) are relatively concentrated. It can be seen that the heatmap areas of different stages focus on different brain regions. This result further proved the view in Ref. (Suk et al., 2014) that complementary information can be obtained from a variety of modalities to improve AD diagnostic performance.

6. Conclusion

This paper proposes a dual-branch vision transformer with the cross-attention mechanism and a graph pooling algorithm, CsAGP, for multimodal AD classification. We designed a multimodal feature fusion strategy based on the cross-attention mechanism to effectively learn the shared feature representation of MRI and PET images. Furthermore, a concise framework based on a graph pooling algorithm is developed to reduce token redundancy in the proposed model. Extensive experiments on the ADNI database demonstrate that the classification accuracy of our proposed CsAGP for AD vs. CN, AD vs. MCI, CN vs. MCI, and AD vs. CN vs. MCI are 99.04%, 97.43%, 98.57%, and 98.72%, which is 4.93%, 2.99%, 8.22% and 18.72% higher than current multimodal AD diagnosis methods, respectively.

The proposed CsAGP is slice-based and considers only axial view slices. Since 2D images cannot include all the information from a full brain scan. In addition, this study has not yet conducted a time processing comparison. Expanding the CsAGP for a full brain analysis and conducting comparative study on time processing will be a part of our future research.

Acknowledgement

Anonymized.

Funding

This work is supported by the National Natural Science Foundation of China (62276092); Key Science and Technology Program of Henan Province (212102310084); Key Scientific Research Projects of Colleges and Universities in Henan Province (22A520027); British Heart Foundation Accelerator Award, UK (AA/18/3/34220); Royal Society International Exchanges Cost Share Award, UK (RP202G0230); Hope Foundation for Cancer Research, UK (RM60G0680); Medical Research Council Confidence in Concept Award, UK (MC_PC_17171); Sino-UK Industrial Fund, UK (RP202G0289); Global Challenges Research Fund (GCRF), UK (P202PF11); LIAS, UK (P202ED10 and P202RE969); Data Science Enhancement Fund, UK (P202RE237); Fight for Sight, UK (24NN201); Sino-UK Education Fund, UK (OP202006); Biotechnology and Biological Sciences Research Council, UK (RM32G0178B8). Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.

Footnotes

2

Available at https://adni.loni.usc.edu.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data/Code Availability

The code will be available on https://github.com/weimingyang4/CsAGP after the article is accepted, and the authors do not have permission to share data. CsAGP: Detecting Alzheimer’s Disease from Multimodal Images via Dual-Transformer with Cross-Attention and Graph Pooling Anonymized.

References

  1. Alzheimer’s disease facts and figures. Alzheimers Dement. 2022 Apr;18(4):700–789. doi: 10.1002/alz.12638. 2022. [DOI] [PubMed] [Google Scholar]
  2. Chen C-FR, Fan Q, Panda R. Crossvit: Cross-attention multi-scale vision transformer for image classification; Proceedings of the IEEE/CVF international conference on computer vision; 2021. pp. 357–366. [Google Scholar]
  3. Creswell A, White T, Dumoulin V, Arulkumaran K, Sengupta B, Bharath AA. Generative adversarial networks: an overview. IEEE Sig Process Mag. 2018;35(1):53–65. [Google Scholar]
  4. Dosovitskiy A, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint. 2020:arXiv:2010.11929 [Google Scholar]
  5. Fang X, Liu Z, Xu M. Ensemble of deep convolutional neural networks based multi-modality images for Alzheimer’s disease diagnosis. IET Image Proc. 2020;14(2):318–326. [Google Scholar]
  6. Gao X, Shi F, Shen D, Liu M. Task-induced pyramid and attention GAN for multimodal brain image imputation and classification in alzheimer’s disease. IEEE J Biomed Health Inform. 2022;26(1):36–43. doi: 10.1109/JBHI.2021.3097721. [DOI] [PubMed] [Google Scholar]
  7. Golovanevsky M, Eickhoff C, Singh R. Multimodal attention-based deep learning for Alzheimer’s disease diagnosis. arXi preprintv. 2022:arXiv:2206.08826. doi: 10.1093/jamia/ocac168. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Jang J, Hwang D. M3T: Three-Dimensional Medical Image Classifier Using Multi-Plane and Multi-Slice Transformer; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022. pp. 20718–20729. [Google Scholar]
  9. Jiao Z, Chen S, Shi H, Xu J. Multi-modal feature selection with feature correlation and feature structure fusion for MCI and AD classification. Brain Sci. 2022;12(1):80. doi: 10.3390/brainsci12010080. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Kang L, Jiang J, Huang J, Zhang T. Identifying early mild cognitive impairment by multi-modality MRI-based deep learning. Front Aging Neurosci. 2020;12:206. doi: 10.3389/fnagi.2020.00206. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Khan A, Chefranov A, Demirel H. Image scene geometry recognition using low-level features fusion at multi-layer deep CNN. Neurocomputing. 2021 Jun;440:111–126. doi: 10.1016/j.neucom.2021.01.085. [DOI] [Google Scholar]
  12. Kong Z, Zhang M, Zhu W, Yi Y, Wang T, Zhang B. Multimodal data Alzheimer’s disease detection based on 3D convolution. Biomed Signal Process Control. 2022;75:103565 [Google Scholar]
  13. Kun HAN, Haiwei PAN, Wei Z, Xiaofei B, Chunling C, Shuning HE. Alzheimer’s disease classification method based on multi-modal medical images. J Tsinghua Univ (Sci Technol) 2020;60(8):664–671. [Google Scholar]
  14. Kushol R, Masoumzadeh A, Huo D, Kalra S, Yang Y-H. Addformer: Alzheimer’s disease detection from structural MRI using fusion transformer; 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI); 2022. pp. 1–5. [DOI] [Google Scholar]
  15. Lin W, Gao Q, Du M, Chen W, Tong T. Multiclass diagnosis of stages of Alzheimer’s disease using linear discriminant analysis scoring for multimodal data. Comput Biol Med. 2021;134:104478. doi: 10.1016/j.compbiomed.2021.104478. [DOI] [PubMed] [Google Scholar]
  16. Liu M, et al. A multimodel deep convolutional neural network for automatic hippocampus segmentation and classification in Alzheimer’s disease. Neuroimage. 2020;208:116459. doi: 10.1016/j.neuroimage.2019.116459. [DOI] [PubMed] [Google Scholar]
  17. Liu H, Jin F, Zeng H, Pu H, Fan B. Image Enhancement Guided Object Detection in Visually Degraded Scenes. IEEE Transactions on Neural Networks and Learning Systems. 2023 doi: 10.1109/TNNLS.2023.3274926. [DOI] [PubMed] [Google Scholar]
  18. Liu Z, Lu H, Pan X, Xu M, Lan R, Luo X. Diagnosis of Alzheimer’s disease via an attention-based multi-scale convolutional neural network. Knowl-Based Syst. 2022;238:107942 [Google Scholar]
  19. Logan R, et al. Deep convolutional neural networks with ensemble learning and generative adversarial networks for Alzheimer’s disease image data classification. Front Aging Neurosci. 2021;13:720226. doi: 10.3389/fnagi.2021.720226. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Lyu Y, Yu X, Zhu D, Zhang L. Classification of Alzheimer’s Disease via Vision Transformer; Proceedings of the 15th International Conference on PErvasive Technologies Related to Assistive Environments; 2022. pp. 463–468. [Google Scholar]
  21. Meng X, et al. Multimodal neuroimaging neural network-based feature detection for diagnosis of Alzheimer’s disease. Front Aging Neurosci. 2022;14:911220. doi: 10.3389/fnagi.2022.911220. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Nan F, et al. A multi-classification accessment framework for reproducible evaluation of multimodal learning in Alzheimer’s disease. IEEE/ACM Trans Comput Biol Bioinf. 2022:1–14. doi: 10.1109/TCBB.2022.3204619. [DOI] [PubMed] [Google Scholar]
  23. Pan J, Wang S. Cross-Modal Transformer GAN: A Brain Structure-Function Deep Fusing Framework for Alzheimer’s Disease. arXiv preprint. 2022:arXiv:2206.13393 [Google Scholar]
  24. Patterson C. World alzheimer report 2018. 2018
  25. Rao Y, Zhao W, Liu B, Lu J, Zhou J, Hsieh C-J. Dynamicvit: Efficient vision transformers with dynamic token sparsification. Adv Neural Inf Proces Syst. 2021;34:13937–13949. [Google Scholar]
  26. Richhariya B, Tanveer M, Rashid AH. Diagnosis of Alzheimer’s disease using universum support vector machine based recursive feature elimination (USVM-RFE) Biomed Signal Process Control. 2020 May;59 doi: 10.1016/j.bspc.2020.101903101903. [DOI] [Google Scholar]
  27. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Gradcam: Visual explanations from deep networks via gradient-based localization; Proceedings of the IEEE international conference on computer vision; 2017. pp. 618–626. [Google Scholar]
  28. Shi Y, et al. ASMFS: Adaptive-similarity-based multi-modality feature selection for classification of Alzheimer’s disease. Pattern Recogn. 2022;126:108566 [Google Scholar]
  29. Shi Y, Suk H-I, Gao Y, Lee S-W, Shen D. Leveraging coupled interaction for multimodal Alzheimer’s disease diagnosis. IEEE Trans Neural Networks Learn Syst. 2019;31(1):186–200. doi: 10.1109/TNNLS.2019.2900077. [DOI] [PubMed] [Google Scholar]
  30. Song J, Zheng J, Li P, Lu X, Zhu G, Shen P. An effective multimodal image fusion method using MRI and PET for Alzheimer’s disease diagnosis. Front Digital Health. 2021;3:637386. doi: 10.3389/fdgth.2021.637386. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Suk H-I, Lee S-W, Shen D Initiative, A.D.N. Hierarchical feature representation and multimodal fusion with deep learning for AD/MCI diagnosis. Neuroimage. 2014;101:569–582. doi: 10.1016/j.neuroimage.2014.06.077. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Tan H, Bansal M. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint. 2019:arXiv:1908.07490 [Google Scholar]
  33. Tu Y, Lin S, Qiao J, Zhuang Y, Zhang P. Alzheimer’s disease diagnosis via multimodal feature fusion. Comput Biol Med. 2022;148:105901. doi: 10.1016/j.compbiomed.2022.105901. [DOI] [PubMed] [Google Scholar]
  34. Wu C, et al. Discrimination and conversion prediction of mild cognitive impairment using convolutional neural networks. Quant Imaging Med Surg. 2018;8(10):992. doi: 10.21037/qims.2018.10.17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Xing X, Liang G, Zhang Y, Khanal S, Lin A-L, Jacobs N. Advit: vision transformer on multi-modality pet images for alzheimer disease diagnosis; 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI); 2022. pp. 1–4. [Google Scholar]
  36. Ye H, Zhu Q, Yao Y, Jin Y, Zhang D. Pairwise feature-based generative adversarial network for incomplete multi-modal Alzheimer’s disease diagnosis. Vis Comput. 2022:1–10. [Google Scholar]
  37. Zhang J, He X, Qing L, Gao F, Wang B. BPGAN: Brain PET synthesis from MRI using generative adversarial network for multi-modal Alzheimer’s disease diagnosis. Comput Methods Programs Biomed. 2022;217:106676. doi: 10.1016/j.cmpb.2022.106676. [DOI] [PubMed] [Google Scholar]
  38. Zhang F, Li Z, Zhang B, Du H, Wang B, Zhang X. Multimodal deep learning model for auxiliary diagnosis of Alzheimer’s disease. Neurocomputing. 2019;361:185–195. [Google Scholar]
  39. Zhang X, Lin W, Xiao M, Ji H. Multimodal 2.5 D convolutional neural network for diagnosis of Alzheimer’s Disease with magnetic resonance imaging and positron emission tomography. Prog Electromagn Res. 2021;171:21–34. [Google Scholar]
  40. Zhang T, Shi M. Multimodal neuroimaging feature fusion for diagnosis of Alzheimer’s disease. J Neurosci Methods. 2020;341:108795. doi: 10.1016/j.jneumeth.2020.108795. [DOI] [PubMed] [Google Scholar]
  41. Zhao X, Zhou F, Ou-Yang L, Wang T, Lei B. Graph convolutional network analysis for mild cognitive impairment prediction; 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019); 2019. pp. 1598–1601. [Google Scholar]
  42. Zhu J, et al. Efficient self-attention mechanism and structural distilling model for Alzheimer’s disease diagnosis. Comput Biol Med. 2022;147:105737. doi: 10.1016/j.compbiomed.2022.105737. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The code will be available on https://github.com/weimingyang4/CsAGP after the article is accepted, and the authors do not have permission to share data. CsAGP: Detecting Alzheimer’s Disease from Multimodal Images via Dual-Transformer with Cross-Attention and Graph Pooling Anonymized.

RESOURCES