Abstract
Deep learning can provide rapid brain age estimation based on brain magnetic resonance imaging (MRI). However, most studies use one neural network to extract the global information from the whole input image, ignoring the local fine-grained details. In this paper, we propose a global-local transformer, which consists of a global-pathway to extract the global-context information from the whole input image and a local-pathway to extract the local fine-grained details from local patches. The fine-grained information from the local patches are fused with the global-context information by the attention mechanism, inspired by the transformer, to estimate the brain age. We evaluate the proposed method on 8 public datasets with 8,379 healthy brain MRIs with the age range of 0-97 years. 6 datasets are used for cross-validation and 2 datasets are used for evaluating the generality. Comparing with other state-of-the-art methods, the proposed global-local transformer reduces the mean absolute error of the estimated ages to 2.70 years and increases the correlation coefficient of the estimated age and the chronological age to 0.9853. In addition, our proposed method provides regional information of which local patches are most informative for brain age estimation. Our source code is available on: https://github.com/shengfly/global-local-transformer.
Keywords: Global-local transformer, attention, brain age estimation, deep learning, interpretation
I. Introduction
BRAIN age can be estimated by using machine learning techniques on the brain magnetic resonance image (MRI) [1]–[4]. The MRI-derived brain age is associated with brain health at the individual level [5]–[8].
The difference between the predicted brain age and the chronological age is called “brain age gap (BAG)”, which is an informative biomarker of brain health [5]. Many studies have shown that a positive BAG associates with a risk of cognitive decline and neurodegeneration, such as Alzheimer’s disease [6], mild cognitive impairment (MCI) [7], [8], mortality [9], psychosis [10], major depressive disorder [11] and others [5], [12]. The key part of brain age estimation is to train a machine learning model on a normal aging population which can estimate brain age on healthy brain MR images with low errors. In most studies [2], [3], [13], the machine learning model is trained on brain MR images with the individual’s chronological age which is the amount of time since the birth of an individual [1]. Thus, the trained machine learning model can estimate the brain age on unseen data by extracting the chronological age-specific patterns learned from the healthy brain MRIs.
Convolutional neural networks (CNNs) can provide superhuman performance on various applications [14], [15], including brain age estimation [3], [8], [16]–[18]. It can make a prediction on both the whole input images and local patches (segmented from the input image) [19]. One advantage of using CNNs on the whole images is that it can capture the global information and provide an image-level prediction. However, fine-grained details are missed due to the fact that deep neural networks are dominated by the salient information on the whole image. In addition, the decision of the CNN from the whole input image is hard to understand [3], [20]. On the contrary, patch-based method can capture the local detailed information and provide a patch-wise evidence which can reveal the age-specific patterns for interpretation [19], [21]. However, the performance is limited due to the lack of the global context information.
To solve the problem, we present a two-pathway network for brain age estimation. One pathway is designed to capture the global context information from the input brain MRI and another pathway is responsible for capturing the fine-grained information from a local patch. We fuse the local detailed and the global context information with an attention mechanism, inspired by the self-attention in the “transformer” [22]. Thus, our proposed fusion block is named “global-local transformer”, as shown in Fig. 1.
A. Our Method Exploits Global Context and Local Details
The global-pathway makes a decision based on the whole input image and the deep features contain global-context information of the input image. However, it is easy to converge to the most informative regions yielding a small training loss [23] and other regions which also contain the subtle age information are ignored. The local-pathway learns the age information from a local patch, forcing the network to learn the detailed age information within a small local region [19], yielding a limited performance because the size of the receptive field is bounded in the local patch. Many studies in the literature have shown that fusing the global-context and local detailed information can improve the performance [24]–[29]. Our proposed method uses the attention mechanism to optimally fuse the global-context information extracted from the global-pathway and local detailed information extracted from the local-pathway.
B. Our Method Does Not Need Spatial Feature Alignment
A common way for fusing the global and local features from two different pathways is to segment features from the global pathway at the same spatial location of the local patch and concatenate them together [24], [26], [27]. However, there are two limitations: (1) it requires that the features from global and local pathways being spatially aligned [24], which is difficult for an arbitrary input image size after several max-pooling layers in the neural network; (2) the cropped deep features from the global-pathway still contain information from local regions, without the global-context information.
We use the attention mechanism [22], [30] to optimally fuse the deep features from the global-pathway and local-pathway. The attention can select the most important information and ignore the irrelevant information on the context features from the global-pathway. A weighted sum of the global-context information at all positions is fused with the feature on each location of the deep features from the local-pathway, where the weight (normalized by the softmax, named attention) is computed by the similarity between the corresponding global and local deep features. Thus, our method does not need any spatial feature alignment and can capture long-range global-context information guided by the similarities between the features from the global-pathway and local-pathway.
C. Our Method Can Be Interpreted
There are different methods to interpret the deep learning methods for brain age estimation. Levakov et al. [18] applied a gradient-based method to compute individual explanation maps which represent the contribution of each voxel to brain age prediction. Our previous work [31] computed a correlation map between the response of the hidden neurons and the chronological ages over a population to find the most discriminative neurons in the neural network. For the transformer, the attention flow [32] could be used to evaluate the relative relevance of the patches. These indirect interpretation methods aim to understand where or what the neural network has learned from brain images with the limitation that the neural networks are dominated by salient information.
The direct interpretation method, on the other hand, interprets the neural network by training it directly on local patches and quantifying prediction accuracy on every local patch, to highlight the most informative patches in input images. One representative method is the BagNet [19], which classifies an image based on small local patches segmented from images without considering their spatial orders, making it easy to analyze the predicted evidence from each local patches. Similar to BagNet, our method can estimate brain age based on local patches. Thus, the patch-level evidence for each subject can be exploited and visualized for interpretation [19], [21]. The proposed method shares the advantages of BagNet. In addition, the performance of the neural network on local-patches is higher than BagNet since it also learns the corresponding global-context information with attention from the global-pathway.
II. Related Work
A. Brain Age Estimation
Table I summaries related studies in the literature for brain age estimation using convolutional neural networks. Most studies use the common structural networks for brain estimation on brain MRIs [33]–[35]. A 3D neural network with 5 convolutional layers and one fully connected layer for brain age regression was proposed in [16], providing the mean absolute error (MAE) of 4.16 years on the age range of 18-90 years. The 3D version of the residual network [20] has been applied in [17], achieving the MAE of 3.631 years by combining predictions from multiple CNNs. The 3D version of the VGG network [36] was employed for brain age estimation [37] on subjects aged 18-90 years with the MAE of 5.55 years. Feng et al. [3] used a neural network with 10 convolutional layers for brain age estimation on subjects with ages from 18-97, yielding an MAE of 4.21 years. Levakov et al. [18] utilized a 3D CNN with 4 convolutional layers and 2 fully-connected layers for estimating brain age and an MAE of 3.07 years is obtained by averaging of 10 CNNs on 10,176 subjects (age range: 4-94 years). Peng et al. [13] proposed a lightweight Simple Fully Convolutional Network (SFCN), achieving an MAE of 2.14 years on subjects with age range 44-80 years. Bashyam et al. [8] developed a DeepBrainNet for brain age prediction based on 2D slices, providing an MAE of 3.702 years on a large set of MRI scans.
TABLE I.
Study | Publish Year | #Samples | Age ranges | MAE (years) |
---|---|---|---|---|
Huang et al. [33] | 2017 | 1,099 | 20-80 | 4.0 |
Cole et al. [16] | 2017 | 2,001 | 18-90 | 4.16 |
Ueda et al. [34] | 2019 | 1,101 | 20-80 | 3.67 |
Jónsson et al. [17] | 2019 | 1,264 | 15-80 | 3.63 |
Jiang et al. [37] | 2020 | 1,454 | 18-90 | 5.55 |
Feng et al. [3] | 2020 | 10,158 | 18-97 | 4.06 |
Bashyam et al. [8] | 2020 | 11,729 | 3-95 | 3.702 |
Levakov et at. [18] | 2020 | 10,176 | 4-94 | 3.07 |
Peng et al. [13] | 2021 | 14,503 | 42-82 | 2.14 |
Dinsdale et al. [35] | 2021 | 19,687 | 44-80 | 2.97 |
He et al. [31] | 2021 | 16,705 | 0-97 | 3.00 |
Chen et al. [4] | 2021 | 6,586 | 17-98 | 2.428 |
Proposed | - | 8,379 | 0-97 | 2.70 |
Recently, Cheng et al. [4] proposed a 3D two-stage-age-network to estimate brain age from T1w MRI with two stages: the first stage estimated a rough brain age and the second stage was used to refine the results. An MAE of 2.428 was achieved on 6,586 subjects with ages of 17-98 years. Our previous work in [31] used a fusion-with-attention (FiA-Net) 3D network to fuse the intensity and RAVENS channels for brain age estimation, yielding an MAE of 3.00 years on a lifespan cohort (0-97 years).
Our method is different in two key ways: (1) We propose a two-pathway network, which can exploit the global-context and local detailed information for brain age estimation. (2) We apply the proposed method on 2D slices extracted from 3D brain MRI columns, which is computationally efficient and achieves the MAE of 2.70 years on a lifespan (0-97 years).
B. Transformer
Transformer [22] was first used for natural language processing (NLP) and has recently become popular in visual recognition [38], [39]. The core idea is to apply a self-attention layer on the input sequence to capture the relationship among the sequence of local patches. The input sequence is first converted into three different components, namely “query”, “key” and “value”. Subsequently, the attention is obtained based on “query” and “key”, and applied on the “value” to output a scaled sequence. Transformer has been used in image recognition [38], object detection [40], hand pose estimation [41], image super-resolution [42], etc. A recent survey can be found in [43].
Our proposed method is different in the way that the “query”, “key” and “value” are from different features: we compute the “key” and “value” from the global-pathway and the “query” from the local-pathway. Through the “key” and “query”, the attention between the global and local information can be obtained and applied to the “value” to compute the global-context information for the local patches. Thus, our method can optimally fuse the global-context and local detailed information with attention, which is named “global-local attention”. The corresponding transformer with “global-local attention” is named “global-local transformer”.
III. Method
A. Backbone for Deep Feature Extraction
We use a convolutional neural network (CNN) as the backbone to the extract deep features from the input image. The backbone is based on VGGNet [36] but with a small number of layers based on the fact that “shallow neural networks provide better results than deep ones in brain age estimation [13]”. As show in Fig. 2, the backbone contains eight blocks. Each block consists of a convolutional layer with kernel size of 3 × 3 and padding of 1, a batch normalization layer [44] and a ReLU activation layer [15]. A max-pooling layer with the kernel size of 2×2 and stride of 2 is applied after every two blocks to gradually reduce the spatial dimensions. The channel numbers used in each block are [64,128,256,512], similar to VGGNet [36]. The backbone converts an input image into a deep feature, representing the abstract and high-level features of the input image.
B. Global-Local Attention Mechanism
Our aim is to develop a neural network to learn the fine-grained information from a local patch by fusing the global-context visual information learned from the whole input image. We use two identical backbones with different parameters to extract deep feature from the whole input image (with height h and width w) and deep feature from a local patch (with height h′ and width w′) with the number of channels d, the height h′ < h and the width w′ < w. The deep feature fg contains the global-context information of the whole input image and the deep feature fl contains the fine-grained local information from a local patch. To fuse the global-context and local fine-grained information, we propose a global-local attention, inspired by the self-attention mechanism [22]. The framework is shown in Fig. 3. For the local feature fl, we use a 1 × 1 convolutional layer to project it into a new space fQ, named “query”. For the global feature fg, we use two 1 × 1 convolutional layers to project it into two different spaces fK and fV, named “key” and “value”.
As shown in Fig. 3, for the deep feature at each location (pixel-level) of the query (where i is the position index on the local feature), we compute its similarity to all positions on the deep feature fK, using the dot-product function: (where j is the position index on the global features fK). We divide the sij by and normalize it by applying a softmax function: . The normalized is called attention since it has different weights on different locations j, determined by both the query and the key . Finally, the corresponding global-context information for each location on the local feature is obtained as the sum of the weighted value computed from the global-pathway: .
Similar to the self-attention [22], we can also use the matrix operations to efficiently compute the global-local attention. We reshape the global features fK and fV and local feature fQ to FK (with the size of N1 × d, N1 = h × w), FV (N1 × d) and FQ (N2 × d, N2 = h′ × w′), respectively. The global-context local feature can be defined by:
(1) |
The global-local attention is different from the self-attention [22] in two different aspects. First, in self-attention, the “query”, “key” and “value” are from the same feature vectors while in the global-local attention, the “query” (FQ) is from the local pathway and the pair of “key” (FK) and “value” (FV) are from the global pathway. Second, in self-attention, the “query”, “key” and “value” usually have the same size while in the global-local attention, the size of FQ is smaller than the size of FK and FV because FQ is computed from a small local path. The global-local attention is asymmetry: while FK and FV are derived from the global-pathway, FQ is derived from the local-pathway, with the number of features (N1 ≪ N2) and the complexity of the global-local attention (as defined in Eq. 1) is .
As show in Fig. 3, the output size of the global-context local feature FG is the same as the size of the local feature FQ. The feature values on the output are computed as a weighted sum of the values FV from the global pathway. Thus, the output contains the global contextual information determined by the global and local features without any spatial alignment.
Similar to the self-attention, we also use multi-head attention, where the global and local features are split into h = 8 parallel parts on the channel dimension. The global-local attention is applied on each part and the output values are concatenated and projected into one feature with the same size of the input feature. Multi-head attention becomes a standard component in Transformer [45] and more details can be found in the self-attention literature [22].
C. Global-Local Transformer
In this section, we present the global-local transformer, as shown in Fig. 1. We concatenate the output of the global-local attention block with the local feature because it contains the global contextual information from the global pathway which is different from the local feature. The global-context information is a weighted sum (with attention) of the global features according to the similarity determined by both global and local features. These two different features are further fused in a feed-forward block. Slightly different from the standard transformer [22], the feed-forward block contains two linear transformations (two convolutional layers with 512 channels and a kernel size of 1) with batch normalization and ReLU activation functions to fuse the global contextual and local fine-grained information. The output is added with the local feature inspired by the residual learning [20]. Thus, it is the local features that contain the global-context information. The global-context information and the updated local feature can be also fed into another global-local transformer. The same structure is repeated N times to iteratively integrate the global-context and local detailed information (the selection of N will be discussed in Section V-A). The final layer on each branch for brain age estimation is a fully-connected layer to map the feature vector (obtained after an average pooling layer) with 512 dimensions to the brain age.
IV. Experiments
In this section, we present the experimental results of the proposed method on a large healthy cohort. We also compare it with baseline models and state-of-the-art architecture of neural networks.
A. Dataset
In this paper, we evaluate the proposed method on a healthy cohort: we collect the healthy brain T1-weighted MRI scans from 8 public data sets (Table II), with a total of 8,379 samples with an age range of 0-97 years. Among of them, 6 data sets are used for cross-validation and the CMI and CoRR data sets are used for evaluating the generality of the deep learning models.
TABLE II.
Dataset | N samples | Age range | Gender(female/male) |
---|---|---|---|
◇ BGSP [46] | 1,570 | 19.0-35.0 | 905/665 |
◇ OASIS-3 [47] | 1,222 | 42.0-97.0 | 750/472 |
◇ NIH-PD [48] | 1,211 | 0-22.2 | 626/585 |
◇ ABIDE-I [49] | 567 | 6.4-56.2 | 98/469 |
◇ IXI* | 556 | 19.9-86.3 | 309/247 |
◇ DLBS [50] | 315 | 20.5-89.1 | 198/117 |
| |||
⊕ CMI [51] | 1,765 | 5.0-21.9 | 1,117/648 |
⊕ CoRR [52] | 1,173 | 6.0-88.0 | 591/582 |
| |||
Overall | 8,379 | 0-97.0 | 4,594/3,785 |
◇ Datasets are used for cross-validation.
⊕ Datasets are used for evaluating the generality.
The pre-processing steps include N4 bias correction [53], field of view normalization [54], and Multi-Atlas Skull Stripping (MASS) [55]. The skull-stripped T1w MRI is affine registered to the SRI atlas [56] by FSL’s flirt tool [57], which has a voxel size of 1 × 1 × 1 mm, and was constructed from T1w of 24 healthy brains. The dimension of the 3D brain volume is cropped into the size of 130 × 170 × 120 by removing the black boundaries. All MRI scans have been checked manually to remove the failure MRIs with severe artifacts or poor registration.
We extract 2D slices around the center of the 3D brain volumes in the axial, coronal, and sagittal planes, a strategy similar to that in [8], [58]. The number of 2D slices to be extracted is studied in Fig. 5 as a key variable of our algorithm. As shown in [3], [8], [58], 2D slices around the center of the brain in different plains can be used for brain age prediction. In addition, training the 2D neural network requires less parameters compared to the 3D neural network. In addition, the global-local attention is computed among every positions between the global and local features, which requires a large computational resources (computing time and memory) for the 3D neural network. As shown in Table II, images from the BGSP, OASIS-3, NIH-PD, ABIDE-I, IXI and DLBS are randomly split into 5 parts and 5-fold cross-validation is implemented for evaluation and images scanned with different scanners from the CMI and CoRR are used for evaluating the generality of models.
B. Network Training
We use the mean absolute error as the loss function, which is defined as:
(2) |
where pi is the known chronological age of the subject and is the estimated brain age from the neural network. The final training loss is the sum of the losses form the global pathway and the local pathway. The network is trained by the Adam optimizer built in PyTorch platform, with an initial learning rate of 0.0001, reducing to half at every 25 epochs in the total 80 training epochs. The batch size is set to 18 due to the limitation of the GPU memory. The training of the neural network takes around 12 hours on a single NVIDIA RTX 6000 GPU with 12G memory.
C. Performance Evaluation of Age Estimation
To evaluate the performance of the age estimation, we use three metrics: mean absolute error (MAE), correlation coefficient (r) and cumulative score (CS) [59]. The MAE is defined in Eq. 2, which is a widely used metric for brain age estimation [3], [13], [60]. The correlation coefficient (r) [2] is computed as the Pearson correlation between the predicted ages and the chronological ages. The CS is the accuracy of age estimation given a threshold α, which is given by:
(3) |
where Ne≤α is the number of samples on which the absolute error of age estimation e is no higher than the threshold α. A higher CS score means better performance.
D. Comparison With Different Baseline Models
We compare the proposed global-local transformers with the following six different baseline models: (1) ResNet18 [20]: We train a standard ResNet with 18 layers to estimate the brain age directly on the whole input image. (2) BagNet-ResNet18 [19]: The ResNet18 is applied on each local patch segmented from the input image, inspired by the BagNet [19]. (3) VGG [36]: We use the VGG backbone (described on Section III-A) as the network for brain age estimation on the whole input image. (4) BagNet-VGG [19]: Applying the VGG backbone network on each local patch. This is similar to the BagNet-ResNet18 model by replacing ResNet18 with VGG. (5) Global-Transformer [38]: we use the VGG backbone to extract the feature vectors from the sequence of local patches cropped from the input image and feed the corresponding feature sequence into a standard transformer for brain age estimation. The “query”, “key”, and “value” are from the sequences of local patches segmented on the whole input image. (6) Local-Transformer [22]: The standard transformer is applied on the feature vectors extracted on each local patch. The “query”, “key”, and “value” are from the deep features extracted from the single local patch. All models are trained with the same training configurations for fair comparison.
E. Comparison With State-of-the-Art Neural Networks
We also compare the proposed method with other popular neural networks for visual recognition with the whole image as the input. The compared network structures include (1) ResNet50 and ResNet101 [20]: the most popular residual networks with 50 and 101 layers. (2) WRN-50 and WRN-101 [61]: Wide residual networks (WRNs) with different layers which decrease depth and increase width of residual networks. (3) DenseNet121 and DenseNet201 [62]: Densely connected convolutional networks with different layers. (4) SqueezeNet [63] and ShuffleNet v2 [64]: two efficient networks using small kernel sizes or depth separable convolutional layers for visual recognition.
F. Comparison With State-of-the-Art Brain Age Estimation Methods
As we mentioned above, most studies of brain age estimation use the common structure of neural networks ( compared in the previous sections). There are three recently published neural networks specifically designed for brain age estimation. Thus, we also compare the proposed method with them: SFCN [13], DeepBrainNet [8] and FiA-Net [31]. The SFCN is originally designed on 3D images, named SFCN 3D. In addition, we replace the 3D convolutional kernels with 2D ones, named SFCN 2D, to compare the performance with 2D and 3D images. It contains seven convolutional, batch normalization, activation and max pooling layers. The DeepBrainNet works on 2D slices based on Inception-Res-V2 [65] model. The same training configuration is applied on all these models for fair comparison. The results of FiA-Net is directly obtained from the original paper [31] since the experimental configuration is similar to the proposed method.
G. Performance of Single Patch Size for Comparison
We evaluate the brain age performance of the proposed method with a fixed local patch size. Although our method can segment the local patches at any position without feature alignment, we use a sliding window strategy to crop patches with a step which is set to the half of the patch size for computational efficiency. The final estimated age is the average of the estimated ages from all possible local patches.
H. Interpretation With Multiple Patch Sizes
We crop the local patches with different sizes and feed them into the same local-pathway of the proposed model for brain age estimation. In other words, all patches with different sizes share the same local-pathway in the network. Although arbitrary patch sizes can be applied, we set the minimum patch size to 32 and maximum size to 102 with a step of 8 for computational efficiency. During training, 30 patches are randomly sampled with different patch sizes on different locations from the whole image to train the proposed neural network. During testing, for each subject, we randomly sample 3,000 patches and an estimated brain age is obtained on each patch. Thus, a distribution of the estimated brain age can be obtained, described by the mean m and standard deviation σ (as shown in Fig. 4). The standard deviation σ can be considered as the uncertainty [66] of brain age estimation, which measures how differences of the estimated age on different brain regions. Since the brain age can be estimated on local patches, the patches with the lowest MAE can be found and visualized for interpretation.
V. Results
In this section, We first evaluate the performance of the proposed method with different parameters, such as the patch size of the local-pathway, the number of slices and global-local transformers. In the second and third part, we compare the proposed method with different baseline models and state-of-the-art architectures. In the last part, we visualize the most informative patches for brain age estimation.
A. Parameter Evaluation of the System
First, we evaluate the performance of the proposed method with different sizes of the local patches and the experimental results are shown in Fig. 5(a). From the figure we can see that there is no significant differences among results when the patch size is greater than 48. Thus, we set the patch size of the local-pathway to 64 in this section. Second, we give the brain age estimation results (Fig. 5(b)) from 2D images with a different number of slices segmented from the 3D MRI scans, based on the fact that estimated age is closed to the actual age on the slices from the center [3]. Fig. 5(b) shows that there is no significant difference of performance with the number of slices from 5 to 20. Third, we also present the performance of the system (in Fig. 5(c)) with a different number of global-local transformer blocks (as shown in Fig. 1), which shows that the MAE is lower when the number of global-local transformer blocks is around N = 6-10. In the following section, we set the number of slices to 5 and number of blocks to N = 6 for a tradeoff between the performance and computational times and memories. Fig. 5(d) shows the performance of the proposed method with different backbones: ResNet18, VGG13 and VGG8. VGG13 has a similar structure to the VGG8 (as shown in Fig. 2) but with 13 convolutional layers which are the same as VGG16 [36]. Using the neural network with 8 convolutional layers provides the best results. A similar finding was reported in [13]: “the deeper neural networks do not outperform the shallow ones in brain age prediction”. Fig. 5(e) shows the performance of the global and local pathways of the proposed global-local transformer. The prediction of the local pathway is the average predicted age from all the local patches. The local pathway captures the detailed information from the local patches with the global-context information from the global pathway, yielding a better performance than the global pathway which extracts the brain age from the whole input images. Thus, we only report the performance of the local pathway in the following sections and the global-pathway is only used to enhance the global-context information for improving the performance of the local pathway.
B. Comparison With Different Baseline Models
Table III shows the performance of different models on 2D slices extracted from the three planes: Axial, Coronal and Sagittal. We also fuse the results from the prediction of the 2D slices extracted from three planes by averaging the estimated ages: y = Σi yi / 3 where yi is the predicted age from the plane i ∈ {Axial, Coronal, Sagittal}. Fig. 6 shows the CS curves of different models with different error levels α. Two best models’ scatter plots of the estimated brain ages against the chronological ages are shown in Fig. 7.
TABLE III.
Method | MAE (years) | Pearson correlation (r) | ||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
Axial | Coronal | Sagittal | Fusion | Axial | Coronal | Sagittal | Fusion | |
ResNet18 [20] | 3.45±0.17 | 3.38±0.07 | 3.87±0.10 | 3.25±0.09 | 0.9759±0.0032 | 0.9764±0.0023 | 0.9700±0.0025 | 0.9789±0.0023 |
BagNet-ResNet18 [19] | 5.13±0.21 | 6.29±0.81 | 5.57±0.17 | 5.46±0.34 | 0.9698±0.0026 | 0.9564±0.0065 | 0.9607±0.0029 | 0.9696±0.0028 |
VGG [36] | 3.48±0.16 | 3.29±0.09 | 3.67±0.13 | 3.11±0.12 | 0.9767±0.0024 | 0.9782±0.0026 | 0.9758±0.0016 | 0.9801±0.0022 |
BagNet-VGG [19] | 4.08±0.09 | 5.87±0.25 | 5.11±0.32 | 4.75±0.19 | 0.9758±0.0016 | 0.9723±0.0016 | 0.9749±0.0006 | 0.9763±0.0016 |
Global-Transformer [38] | 3.97±0.26 | 5.39±0.71 | 5.34±0.22 | 4.11±0.29 | 0.9749±0.0006 | 0.9725±0.0036 | 0.9685±0.0054 | 0.9811±0.0018 |
Local-Transformer [22] | 3.50±0.10 | 3.32±0.10 | 3.78±0.13 | 3.28±0.03 | 0.9792±0.0013 | 0.9774±0.0019 | 0.9723±0.0020 | 0.9803±0.0015 |
Global-Local Transformer | 2.87±0.11 | 2.97±0.07 | 3.14±0.02 | 2.70±0.03 | 0.9827±0.0022 | 0.9826±0.0022 | 0.9807±0.0019 | 0.9853±0.0020 |
Several observations can be obtained: (1) For BagNet based methods (BagNet-ResNet18 and BagNet-VGG), their performances are lower than the networks (ResNet18 and VGG) with the whole image as the input. It shows that performance of estimating brain age based on local patches only is limited. Using the self-attention mechanism, Local-Transformer with the VGG backbone can improve the performance, but the result is still lower than ResNet18. In general, neural networks with the local patches as input provide lower performance than ones with the whole image as input. However, our proposed Global-Local Transformer gives the best performance, which demonstrates the advantage of fusing the global-context and local detailed information. (2) The age information on different planes is slightly different. The most informative plane is the Axial, which provides better results than Coronal and Sagittal. The 2D slices from the Axial plane are also used for brain age estimation in [1], [8]. For ResNet18, VGG, Local-Transformer and the proposed Global-Local Transformer, fusing the three planes can improve the performance. In the following sections, we only report the performance of the fusion from the three planes since it can provide better results than the single plane. (3) The lightweight VGG network with 8 layers provides better results than ResNet with 18 layers on both whole input image and local patches. This is the same as the finding in [13] that the lightweight network can achieve better performance than ResNet [20] for brain age estimation. (4) Our proposed Global-Local Transformer gives the lowest MAEs, highest correlation r and CS with various threshold α than all other models on three planes and the corresponding fusion one.
To further show the detailed estimation performance of different models, the evaluation based on MAE are broken down into different age ranges. Table IV shows the performance on age groups which are roughly divided into four groups. For all models, the MAEs of estimated ages on subjects with age of 30-60 years are higher than subjects on other age groups, demonstrating the age estimation on this age group is more challenge than other age groups. In general, the results from the table show that our proposed method always provides the best performance than all other six baseline models on four different age groups.
TABLE IV.
Age group | < 10 years | 10-30 years | 30-60 years | > 60 years |
N samples | 690 | 2,724 | 732 | 1,296 |
| ||||
ResNet18 [20] | 1.39±0.21 | 2.23±0.11 | 5.88±0.25 | 4.90±0.15 |
BagNet-ResNet18 [19] | 4.16±0.47 | 5.21±0.85 | 5.77±0.45 | 6.47±0.35 |
VGG [36] | 1.20±0.15 | 2.00±0.10 | 5.45±0.29 | 5.12±0.31 |
BagNet-VGG [19] | 4.26±0.44 | 4.30±0.38 | 5.23±0.19 | 5.64±0.25 |
Global-Transformer [38] | 1.82±0.29 | 3.29±0.36 | 7.16±0.63 | 5.31±0.70 |
Local-Transformer [22] | 1.75±0.18 | 2.54±0.09 | 5.53±0.42 | 4.35±0.24 |
Global-Local Transformer | 0.97±0.12 | 1.89±0.15 | 5.12±0.34 | 3.93±0.29 |
Table V shows the performance on different datasets, including the 6 datasets for cross-validation and 2 datasets for generality. Our proposed method provides the lowest MAEs on the 6 data sets involved in the cross-validation as well as on the CMI and CoRR data sets. Our proposed method is generalizable to different datasets from different sites and scanners.
TABLE V.
Dataset | Cross-validation | Generality | ||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
BGSP | OASIS-3 | NIH-PD | ABIDE-I | IXI | DLBS | CMI | CoRR | |
Nsamples | 1,570 | 1,222 | 1,211 | 567 | 556 | 315 | 1,765 | 1,173 |
Age range | 19.0-35.0 | 42.0-97.0 | 0-22.2 | 6.4-56.2 | 19.9-86.3 | 20.5-89.1 | 5.0-21.1 | 6.0-88.0 |
| ||||||||
ResNet18 [20] | 2.00±0.03 | 3.88±0.22 | 1.16±0.11 | 3.61±0.22 | 7.98±0.87 | 6.11±0.30 | 5.34±0.25 | 6.88±0.38 |
BagNet-ResNet18 [19] | 4.74±0.93 | 4.99±0.53 | 3.57±0.46 | 7.37±0.80 | 9.35±0.69 | 7.77±0.64 | 10.68±0.39 | 10.90±0.72 |
VGG [36] | 1.90±0.07 | 4.06±0.23 | 1.05±0.04 | 2.97±0.26 | 7.56±0.69 | 5.73±0.34 | 4.97±0.05 | 5.71±0.18 |
BagNet-VGG [19] | 3.52±0.35 | 4.30±0.32 | 3.76±0.33 | 6.32±0.41 | 8.55±0.71 | 6.87±0.70 | 9.26±0.12 | 9.00±0.29 |
Global-Transformer [38] | 2.98±0.46 | 5.78±0.64 | 1.90±0.26 | 3.75±0.34 | 7.77±0.48 | 6.00±0.54 | 5.06±0.33 | 7.74±1.12 |
Local-Transformer [22] | 2.11±0.07 | 3.41±0.18 | 1.41±0.18 | 4.02±0.31 | 7.81±0.57 | 6.38±0.42 | 6.14±0.42 | 8.29±0.62 |
Global-Local Transformer | 1.79±0.07 | 3.12±0.18 | 0.90±0.03 | 2.73±0.24 | 6.68±0.39 | 5.40±0.40 | 4.95±0.10 | 5.68±0.47 |
C. Comparison With State-of-the-Art Neural Networks and Models of Brain Age Estimation
Table VI shows the comparison with eight state-of-the-art deep networks and the recently published two brain age estimation models in terms of MAE, correlation r and CS (α = 5 years). All these models are trained with five-cross validation and the fusion results of the three planes are reported for fair comparison. We train the SFCN [13] with 2D and 3D convolutional neural networks, named SFCN 2D and SFCN 3D on the same data. From Table VI we can see that (1) the efficient networks (ShuffleNet, SqueezeNet and SFCN 2D) have the largest MAE (> 3.5 years), lowest correlation (r < 0.98) and CS(α) < 80% among algorithms compared. (2) DenseNet gives better results than other neural networks, including ResNet, WRN and DeepBrainNet. (3) The 3D network of the SFCN provides better results than its 2D version. (4) Our proposed method outperforms other general-purpose neural networks included in the comparison, and also three networks (SFCN [13], DeepBrainNet [8] and FiA-Net [31]) specifically designed for brain age estimation.
TABLE VI.
Method | MAE | Pearson Correlation (r) | CS(α=5 years) |
---|---|---|---|
ShuffleNet V2 (2.0x) [64] | 3.85±0.12 | 0.9668±0.0036 | 76.90%±0.78 |
SqueezeNet [63] | 3.71±0.16 | 0.9710±0.0035 | 77.05%±1.53 |
ResNet50 [20] | 3.12±0.08 | 0.9781±0.0027 | 81.88%±0.73 |
ResNet101 [20] | 3.15±0.13 | 0.9778±0.0029 | 81.53%±0.98 |
WRN-50-2 [61] | 3.06±0.10 | 0.9786±0.0028 | 82.38%±0.85 |
WRN-101-2 [61] | 3.07±0.10 | 0.9788±0.0022 | 81.97%±0.81 |
DenseNet121 [62] | 2.86±0.08 | 0.9837±0.0017 | 82.87%±1.01 |
DenseNet201 [62] | 2.80±0.07 | 0.9836±0.0015 | 83.72%±0.65 |
| |||
*SFCN 2D [13] (2021) | 3.58±0.10 | 0.9754±0.0023 | 77.73%±0.95 |
*SFCN 3D [13] (2021) | 3.04±0.06 | 0.9817±0.0008 | 81.66%±0.50 |
**FiA-Netfus 3D [31] (2021) | 3.00±0.06 | 0.9840±0.0000 | 81.75%±1.20 |
* DeepBrainNet [8] (2020) | 2.97±0.11 | 0.9815±0.0022 | 82.87%±0.66 |
Global-Local Transformer | 2.70±0.03 | 0.9853±0.0020 | 84.53%±0.77 |
Models specifically designed for brain age estimation.
Results are directly from the paper.
D. Interpretation With Multiple Patch Sizes
In this section, we propose two types of interpretation: subject-level interpretation which highlights the most discriminative patches on each subject and group-level interpretation which shows the most salient brain regions over a group of subjects within a certain age range. For subject-level interpretation, the 5 patches with the lowest MAEs of each patch size are collected and a heat map is built for visualizing the most informative regions. For group-level interpretation, we only select the 5 patches on each subject with the lowest MAEs of patch sizes 32 and 40, and then all selected patches from subjects within the age range are averaged to obtain a fine-grained heatmap. The heatmap shows the probability that the lowest MAE (the best prediction) can be obtained on the brain image.
Fig. 8 shows the most informative brain regions on each subject computed by averaging the patches with the lowest MAEs with various patch sizes. For each brain MRI, the most patches with the lowest MAEs covers the same region, indicating that the salient region (shown in Fig. 8) contains the most brain age information than other brain regions. In addition, the salient brain age regions are slightly different among subjects with different ages. To compute the general trends of the salient brain age regions, we average the salient regions on subjects within a certain age range and the results are show in Fig. 9. There is a trend of the changes of the salient brain regions over time. In the children (0-5 years old), the most salient brain age region is on the frontal lobe. It shifts to the deep gray nuclei region with age range of 5-20 years. Starting at 20 years, the salient region is gradually shifts to the parietal lobe at 30-35 years old and then shifted back at 35-40 years until 65-70 years. After 75 years, there are two salient regions which contain the most age information.
Fig. 10 shows the distribution of the standard deviation σ (the uncertainty measurement) over lifespan. A large σ means a high differences of the estimating brain age on different brain regions. It shows that the differences go to lowest on the ages around 20 and 65 years, indicating that the differences among the whole brain regions are smallest on these ages. Subjects around 40 years of age have the largest differences. The reason might be that there are fewer training samples on this age range. A similar finding was reported in [18].
The top bar of Fig. 11 shows the Pearson correlations between the predicted brain age error and the values of intracranial volume (ICV) normalized volume of the brain regions auto-segmented based on the SRI atlas [56]. We find that there is no significant correlation existed on the cross-validation data set (n = 5,441, r < 0.1). The bottom of the Fig. 11 shows the box plot of each brain region (ROI). The average errors (AE) of the predicted brain age are slightly different in different brain regions, with the range from 0.29 years of Parietal Lateral GM Right to −0.86 years of Occipital Inferior GM right. It is also possible to visualize the average errors of each brain regions across the lifespan and three examples are shown in Fig. 11.
VI. Discussion and Conclusion
In this paper, we proposed a novel neural network for brain age estimation called global-local transformer, which optimally fuse the global-context and local detailed information with the attention mechanism.
We conducted experiments on six public datasets with 5,441 healthy subjects with age of 0-97 years. By comparing with six different baseline models, the results shown that the proposed global-local transformer provides the best performance for brain age estimation, in terms of MAE, correlation, and cumulative scores with different thresholds (Table III and Fig. 6). In addition, we also compared the proposed method with eight state-of-the-art neural networks and two specific networks for brain age estimation (Table VI). All of these results have shown that fusing the global-context and local detailed information can improve the performance of brain age estimation on 2D slices.
Our proposed method can also be used for interpreting the evidence for brain age estimation. We showed the subject-level salient brain regions which provide the lowest MAEs (Fig. 8) and the average salient regions over a group of subjects within a certain age range (Fig. 9). It can also be used for computing the differences of brain aging in different brain regions (Fig. 10). These results demonstrate the advantages of the proposed method which can not only achieve the best performance than other models, but also can visualize the estimated evidence for brain age estimation.
The limitations of the proposed method are summarized as follows: (1) This study focuses on developing the accurate brain MRI estimation model and MRIs from patients are not involved, similar to other studies in the literature [3], [13], [17]. Building a machine learning model on healthy cohort with a high performance is the first step to apply it on diseased cohorts, which is our next step. (2) We also considered the gender information on the last fully-connected layer. We did not report the results since no improvement was obtained. In future, we will investigate where and how to fuse the gender information in the transformer model. (3) As shown in Table I, the dataset used in our study is not the largest one (the study in [35] used 19,687 subjects with the age of 44-80 years). Collecting a large dataset is challenging, especially covering the lifespan 0-100 years of age. In future, we will continue to collect the dataset to evaluate the performance of the proposed method with different data scales. (4) As shown in Table IV, our dataset is unbalanced. The number of samples in 30-60 years is smaller than the number in other age groups, yielding the largest MAEs (5.12 years). In future, we will balance the dataset either by re-sampling the training samples [3] or using data augmentation methods. (5) It is challenging to fairly compare the proposed method with other studies in the literature since different studies used different datasets, pre-processing, and modalities. There is no benchmark dataset for brain age estimation. We summarized the performance of studies in the literature in Table I. Our method achieved a comparable result (MAE: 2.70 years) which is lower than other studies with lifespan data (covering young and old adults). (6) Our proposed method can compute the predicted errors on each brain region based on the atlas over the lifespan, as shown in Fig. 11. However, no significant correlation has been found between the predicted errors and normalized brain volumes on segmented brain regions. The results indicate that the mechanism of interpreting the age prediction based on local patches may be different from other indirect interpretation method, such as the explanation maps [18] computed based on the gradients with the whole brain image as the input. One future direction is to study the differences between the direct and indirect interpretation methods and their correlations with other natural variabilities in brain morphology.
In conclusion, we have proposed a global-local transformer for brain age estimation using the convolutional neural networks with two pathways: global-pathway to extract global-context information and local-pathway to extract local detailed information. The global-context and local detailed information are fused by the attention mechanism. The proposed method can achieve the state-of-the-art performance and can highlight the most informative regions. Future work includes using large and balanced dataset, fusing gender information with MRI and applying the model on patients’ MRIs.
Acknowledgments
The work of Sheng He was supported by Charles A. King Trust Research Fellowship. The work of Yangming Ou was supported in part by Harvard Medical School, in part by Boston Children’s Hospital Faculty Development Award, and in part by St. Baldrick Foundation Scholar Award Grace Fund and NIH R03 HD104891.
Appendix
Our proposed method can be used to estimate brain age on pathology-bearing brain MR images, such as MR images with brain tumor [67]. The biological age of the MRI with tumor is not available and it is only possible to obtain the biological age from subjective evaluation from radiologists which is a time-consuming and subjective processing. In this paper, we train the machine learning models for the chronological age estimation instead of the biological age estimation.
We collect brain MRIs from BraTS [67], [68] and only use subjects whose brain ages are available. Finally, there are 382 subjects with age range from 17.4 to 86.6 years. We concatenate the four modalities: T1w, T1GD, T2w, T2-FLAIR into on images with multiple-channels as input.
We use the same configurations as the experiments conducted on the healthy cohort: 5 cross-validation is applied on the BraTS dataset and the fusion results of the three planes are reported in this section. Table VII shows the performance of the proposed method, compared with different baseline models, state-of-the-art neural networks and two recently published models for brain age prediction. Our proposed method achieves the best performance among the seventeen models. The main reasons are that (1) our proposed method predicts the brain age on local patches. Thus, it can capture the brain age information on the non-tumor brain regions on subjects with tumors; (2) the global-context information is learnt through the attention, which can automatically find the tumor regions by computing the similarity between healthy and tumor regions, eliminating the affect caused by the tumors.
TABLE VII.
Method | MAE | Pearson Correlation (r) | CS(α=5 years) |
---|---|---|---|
ShuffleNet V2 (2.0x) [64] | 8.72±0.77 | 0.5642±0.0979 | 37.27%±5.93 |
SqueezeNet [63] | 7.77±0.58 | 0.6575±0.0758 | 41.20%±4.85 |
ResNet50 [20] | 7.93±0.91 | 0.6416±0.0840 | 42.78%±6.56 |
ResNet101 [20] | 7.98±0.90 | 0.6272±0.0693 | 40.70%±8.11 |
WRN-50-2 [61] | 7.91±0.76 | 0.6671±0.0361 | 41.47%±6.09 |
WRN-101-2 [61] | 7.94±0.91 | 0.6484±0.0754 | 42.53%±6.92 |
DenseNet121 [62] | 7.11±0.77 | 0.7242±0.0430 | 46.45%±4.70 |
DenseNet201 [62] | 7.34±0.76 | 0.6905±0.0616 | 44.35%±5.58 |
| |||
*SFCN 2D [13] (2021) | 10.18±0.38 | 0.6350±0.1083 | 32.55%±2.63 |
*SFCN 3D [13] (2021) | 9.98±0.33 | 0.6295±0.0676 | 32.03%±2.64 |
*DeepBrainNet [8](2020) | 7.90±0.88 | 0.6816±0.0463 | 42.52%±7.23 |
| |||
ResNet18 [20] | 8.02±0.82 | 0.6670±0.0636 | 40.16%±6.39 |
BagNet-ResNet18 [19] | 8.86±0.47 | 0.6400±0.1004 | 34.38%±4.67 |
VGG [36] | 11.24±0.41 | 0.6066±0.0708 | 22.83%±4.74 |
BagNet-VGG [19] | 7.79±0.57 | 0.7023±0.0743 | 40.93%±4.76 |
Global-Transformer [38] | 8.26±0.79 | 0.6968±0.0925 | 36.75%±4.97 |
Local-Transformer [22] | 8.06±0.77 | 0.6931±0.0837 | 40.43%±5.87 |
Global-Local Transformer | 6.85±0.65 | 0.7538±0.0485 | 47.78%±4.14 |
Models specifically designed for brain age prediction.
Similar to Section V-D, we also train the global-local transformer with multiple patch sizes and visualize the most informative regions in Fig. 12. It shows that the salient brain regions does not overlap with the tumor regions, indicating that the predicted age is mainly from the non-tumor brain regions.
Our results show that it may use the brain age estimation for unsupervised brain tumor segmentation in future since the errors of the brain age estimation on tumor and non-tumor regions are different. The non-tumor regions have the MAE of 7.09±5.91 years, which is lower than the MAE on tumor regions (8.45±7.72 years, p < 0.0001, t-test, two-side).
References
- [1].Armanious K et al. , “Age-Net: An MRI-based iterative framework for brain biological age estimation,” IEEE Trans. Med. Imag, vol. 40, no. 7, pp. 1778–1791, Jul. 2021. [DOI] [PubMed] [Google Scholar]
- [2].Hu D et al. , “Disentangled-multimodal adversarial autoencoder: Application to infant age prediction with incomplete multimodal neuroimages,” IEEE Trans. Med. Imag, vol. 39, no. 12, pp. 4137–4149, Dec. 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Feng X, Lipton ZC, Yang J, Small SA, and Provenzano FA, “Estimating brain age based on a uniform healthy population with deep learning and structural magnetic resonance imaging,” Neurobiol. Aging, vol. 91, pp. 15–25, Jul. 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Cheng J et al. , “Brain age estimation from MRI using cascade networks with ranking loss,” IEEE Trans. Med. Imag, early access, Jun. 4, 2021, doi: 10.1109/TMI.2021.3085948. [DOI] [PubMed] [Google Scholar]
- [5].Cole JH and Franke K, “Predicting age using neuroimaging: Innovative brain ageing biomarkers,” Trends Neurosci, vol. 40, no. 12, pp. 681–690, Dec. 2017. [DOI] [PubMed] [Google Scholar]
- [6].Beheshti I, Maikusa N, and Matsuda H, “The association between ‘Brain-Age Score’(BAS) and traditional neuropsychological screening tools in Alzheimer’s disease,” Brain Behav, vol. 8, no. 8, 2018, Art. no. e01020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Gaser C, Franke K, Klöppel S, Koutsouleris N, and Sauer H, “BrainAGE in mild cognitive impaired patients: Predicting the conversion to Alzheimer’s disease,” PLoS ONE, vol. 8, no. 6, Jun. 2013, Art. no. e67346. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Bashyam VM et al. , “MRI signatures of brain age and disease over the lifespan based on a deep brain network and 14,468 individuals worldwide,” Brain, vol. 143, no. 7, pp. 2312–2324, Jul. 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Cole JH et al. , “Brain age predicts mortality,” Mol. Psychiatry, vol. 23, pp. 1385–1392, May 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Chung Y et al. , “Use of machine learning to determine deviance in neuroanatomical maturity associated with future psychosis in youths at clinically high risk,” JAMA Psychiatry, vol. 75, no. 9, pp. 960–968, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Han LK et al. , “Brain aging in major depressive disorder: Results from the enigma major depressive disorder working group,” Mol. Psychiatry, vol. 4, pp. 1–16, May 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Kaufmann T et al. , “Common brain disorders are associated with heritable patterns of apparent aging of the brain,” Nature Neurosci, vol. 22, no. 10, pp. 1617–1623, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Peng H, Gong W, Beckmann CF, Vedaldi A, and Smith SM, “Accurate brain age prediction with lightweight deep neural networks,” Med. Image Anal, vol. 68, Feb. 2021, Art. no. 101871. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Krizhevsky A, Sutskever I, and Hinton GE, “Imagenet classification with deep convolutional neural networks,” in Proc. Adv. Neural Inf. Process. Syst, 2012, pp. 1097–1105. [Google Scholar]
- [15].LeCun Y, Bengio Y, and Hinton G, “Deep learning,” Nature, vol. 521, pp. 436–444, May 2015. [DOI] [PubMed] [Google Scholar]
- [16].Cole JH et al. , “Predicting brain age with deep learning from raw imaging data results in a reliable and heritable biomarker,” NeuroImage, vol. 163, pp. 115–124, Dec. 2017. [DOI] [PubMed] [Google Scholar]
- [17].Jonsson BA et al. , “Brain age prediction using deep learning uncovers associated sequence variants,” Nature Commun, vol. 10, no. 1, pp. 1–10, Dec. 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Levakov G, Rosenthal G, Shelef I, Raviv TR, and Avidan G, “From a deep learning model back to the brain—Identifying regional predictors and their relation to aging,” Hum. Brain Mapping, vol. 41, no. 12, pp. 3235–3252, Aug. 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Brendel W and Bethge M, “Approximating CNNs with bag-of-local-features models works surprisingly well on ImageNet,” in Proc. Int. Conf. Learn. Represent, 2018, pp. 1–15. [Google Scholar]
- [20].He K, Zhang X, Ren S, and Sun J, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit, Jun. 2016, pp. 770–778. [Google Scholar]
- [21].Qiu S et al. , “Development and validation of an interpretable deep learning framework for Alzheimer’s disease classification,” Brain, vol. 143, no. 6, pp. 1920–1933, Jun. 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [22].Vaswani A et al. , “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst, 2017, pp. 5998–6008. [Google Scholar]
- [23].Zhou B, Khosla A, Lapedriza A, Oliva A, and Torralba A, “Learning deep features for discriminative localization,” in Proc. Comput. Vis. Pattern Recognit, 2016, pp. 2921–2929. [Google Scholar]
- [24].van Rijthoven M, Balkenhol M, Siliña K, van der Laak J, and Ciompi F, “HookNet: Multi-resolution convolutional neural networks for semantic segmentation in histopathology whole-slide images,” Med. Image Anal, vol. 68, Feb. 2021, Art. no. 101890. [DOI] [PubMed] [Google Scholar]
- [25].Feichtenhofer C, Fan H, Malik J, and He K, “Slowfast networks for video recognition,” in Proc. Int. Conf. Comput. Vis, 2019, pp. 6202–6211. [Google Scholar]
- [26].Chen W, Jiang Z, Wang Z, Cui K, and Qian X, “Collaborative global-local networks for memory-efficient segmentation of ultra-high resolution images,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 8924–8933. [Google Scholar]
- [27].He S and Schomaker L, “FragNet: Writer identification using deep fragment networks,” IEEE Trans. Inf. Forensics Security, vol. 15, pp. 3013–3022, 2020. [Google Scholar]
- [28].Guo Y et al. , “Deep local-global refinement network for stent analysis in ivoct images,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent Springer, 2019, pp. 539–546. [Google Scholar]
- [29].Yan Z, Han X, Wang C, Qiu Y, Xiong Z, and Cui S, “Learning mutually local-global U-Nets for high-resolution retinal lesion segmentation in fundus images,” in Proc. IEEE 16th Int. Symp. Biomed. Imag. (ISBI), Apr. 2019, pp. 597–600. [Google Scholar]
- [30].Schlemper J et al. , “Attention gated networks: Learning to leverage salient regions in medical images,” Med. Image Anal, vol. 53, pp. 197–207, Apr. 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [31].He S et al. , “Multi-channel attention-fusion neural network for brain age estimation: Accuracy, generality, and interpretation with 16,705 healthy MRIs across lifespan,” Med. Image Anal, vol. 72, Aug. 2021, Art. no. 102091. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [32].Abnar S and Zuidema W, “Quantifying attention flow in transformers,” 2020, arXiv:2005.00928. [Online]. Available: http://arxiv.org/abs/2005.00928
- [33].Huang T-W et al. , “Age estimation from brain MRI images using deep learning,” in Proc. IEEE 14th Int. Symp. Biomed. Imag. (ISBI), Apr. 2017, pp. 849–852. [Google Scholar]
- [34].Ueda M et al. , “An age estimation method using 3D-CNN from brain MRI images,” in Proc. IEEE 16th Int. Symp. Biomed. Imag. (ISBI), Apr. 2019, pp. 380–383. [Google Scholar]
- [35].Dinsdale NK et al. , “Learning patterns of the ageing brain in MRI using deep convolutional networks,” NeuroImage, vol. 224, Jan. 2021, Art. no. 117401. [DOI] [PubMed] [Google Scholar]
- [36].Simonyan K and Zisserman A, “Very deep convolutional networks for large-scale image recognition,” 2014, arXiv:1409.1556. [Online]. Available: http://arxiv.org/abs/1409.1556
- [37].Jiang H et al. , “Predicting brain age of healthy adults based on structural MRI parcellation using convolutional neural networks,” Frontiers Neurol, vol. 10, p. 1346, Jan. 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [38].Dosovitskiy A et al. , “An image is worth 16×16 words: Transformers for image recognition at scale,” in Proc. Int. Conf. Learn. Represent, 2021, pp. 1–22. [Google Scholar]
- [39].Chen M et al. , “Generative pretraining from pixels,” in Proc. Int. Conf. Mach. Learn, 2020, pp. 1691–1703. [Google Scholar]
- [40].Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, and Zagoruyko S, “End-to-end object detection with transformers,” in Proc. Eur. Conf. Comput. Vis Springer, 2020, pp. 213–229. [Google Scholar]
- [41].Huang L, Tan J, Liu J, and Yuan J, “Hand-transformer: Non-autoregressive structured modeling for 3D hand pose estimation,” in Proc. Eur. Conf. Comput. Vis Springer, 2021, pp. 17–33. [Google Scholar]
- [42].Yang F, Yang H, Fu J, Lu H, and Guo B, “Learning texture transformer network for image super-resolution,” in Proc. Conf. Comput. Vis. Pattern Recognit, 2020, pp. 5791–5800. [Google Scholar]
- [43].Han K et al. , “A survey on vision transformer,” 2020, arXiv:2012.12556. [Online]. Available: http://arxiv.org/abs/2012.12556
- [44].Ioffe S and Szegedy C, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proc. Int. Conf. Mach. Learn, 2015, pp. 448–456. [Google Scholar]
- [45].Kenton JDM-WC and Toutanova LK, “BERT: Pre-training of deep bidirectional transformers for language understanding,” Univ. Lang. Model Fine-tuning Text Classification, Tech. Rep, 2018, p. 278. [Google Scholar]
- [46].Holmes AJ et al. , “Brain genomics superstruct project initial data release with structural, functional, and behavioral measures,” Sci. Data, vol. 2, no. 1, Dec. 2015, Art. no. 150031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [47].LaMontagne PJ et al. , “OASIS-3: Longitudinal neuroimaging, clinical, and cognitive dataset for normal aging and Alzheimer disease,” Alzheimer’s Dementia, J. Alzheimer’s Assoc, vol. 14, no. 7, p. P1097, 2018. [Google Scholar]
- [48].Evans AC, “The NIH MRI study of normal brain development,” NeuroImage, vol. 30, no. 1, pp. 184–202, Mar. 2006. [DOI] [PubMed] [Google Scholar]
- [49].Di Martino A et al. , “The autism brain imaging data exchange: Towards a large-scale evaluation of the intrinsic brain architecture in autism,” Mol. Psychiatry, vol. 19, no. 6, pp. 659–667, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [50].Park J et al. , “Neural broadening or neural attenuation? Investigating age-related dedifferentiation in the face network in a large lifespan sample,” J. Neurosci, vol. 32, no. 6, pp. 2154–2158, Feb. 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [51].Alexander LM et al. , “An open resource for transdiagnostic research in pediatric mental health and learning disorders,” Sci. data, vol. 4, May 2017, Art. no. 170181. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [52].Zuo XN et al. , “An open science resource for establishing reliability and reproducibility in functional connectomics,” Sci. Data, vol. 1, no. 1, pp. 1–13, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [53].Tustison NJ et al. , “N4ITK: Improved N3 bias correction,” IEEE Trans. Med. Imag, vol. 29, no. 6, pp. 1310–1320, Jun. 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [54].Ou Y et al. , “Field of view normalization in multi-site brain MRI,” Neuroinformatics, vol. 16, nos. 3–4, pp. 431–444, Oct. 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [55].Doshi J, Erus G, Ou Y, Gaonkar B, and Davatzikos C, “Multi-atlas skull-stripping,” Academic Radiol, vol. 20, vol. 12, pp. 1566–1576, Dec. 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [56].Rohlfing T, Zahr NM, Sullivan EV, and Pfefferbaum A, “The SRI24 multichannel atlas of normal adult human brain structure,” Hum. Brain Mapping, vol. 31, no. 5, pp. 798–819, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [57].Jenkinson M and Smith S, “A global optimisation method for robust affine registration of brain images,” Med. Image Anal, vol. 5, no. 2, pp. 143–156, Jun. 2001. [DOI] [PubMed] [Google Scholar]
- [58].Lin W et al. , “Convolutional neural networks-based MRI image analysis for the Alzheimer’s disease prediction from mild cognitive impairment,” Frontiers Neurosci, vol. 12, p. 777, Nov. 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [59].Geng X, Zhou Z-H, and Smith-Miles K, “Automatic age estimation based on facial aging patterns,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 29, no. 12, pp. 2234–2240, Dec. 2007. [DOI] [PubMed] [Google Scholar]
- [60].Cole JH, Leech R, and Sharp DJ, “Prediction of brain age suggests accelerated atrophy after traumatic brain injury,” Ann. Neurol, vol. 77, no. 4, pp. 571–581, Apr. 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [61].Zagoruyko S and Komodakis N, “Wide residual networks,” in Proc. Brit. Mach. Vis. Conf, 2016, pp. 1–15. [Google Scholar]
- [62].Huang G, Liu Z, Van Der Maaten L, and Weinberger KQ, “Densely connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 4700–4708. [Google Scholar]
- [63].Iandola FN, Han S, Moskewicz MW, Ashraf K, Dally WJ, and Keutzer K, “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size,” 2016, arXiv:1602.07360. [Online]. Available: http://arxiv.org/abs/1602.07360
- [64].Ma N, Zhang X, Zheng H-T, and Sun J, “ShuffleNet V2: Practical guidelines for efficient CNN architecture design,” in Proc. Eur. Conf. Comput. Vis, 2018, pp. 116–131. [Google Scholar]
- [65].Szegedy C, Ioffe S, Vanhoucke V, and Alemi A, “Inception-V4, inception-resnet and the impact of residual connections on learning,” in Proc. AAAI Conf. Artif. Intell, 2017, vol. 31, no. 1, pp. 4278–4284. [Google Scholar]
- [66].Edupuganti V, Mardani M, Vasanawala S, and Pauly J, “Uncertainty quantification in deep MRI reconstruction,” IEEE Trans. Med. Imag, vol. 40, no. 1, pp. 239–250, Jan. 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [67].Menze BH et al. , “The multimodal brain tumor image segmentation benchmark (BRATS),” IEEE Trans. Med. Imag, vol. 34, no. 10, pp. 1993–2024, Oct. 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [68].Bakas S et al. , “Advancing the cancer genome atlas glioma MRI collections with expert segmentation labels and radiomic features,” Sci. Data, vol. 4, no. 1, pp. 1–13, Dec. 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]