Interpretable multimodal fusion networks reveal mechanisms of brain cognition

Wenxing Hu; Xianghe Meng; Yuntong Bai; Aiying Zhang; Gang Qu; Biao Cai; Gemeng Zhang; Tony W Wilson; Julia M Stephen; Vince D Calhoun; Yu-Ping Wang

doi:10.1109/TMI.2021.3057635

. Author manuscript; available in PMC: 2022 May 1.

Published in final edited form as: IEEE Trans Med Imaging. 2021 Apr 30;40(5):1474–1483. doi: 10.1109/TMI.2021.3057635

Interpretable multimodal fusion networks reveal mechanisms of brain cognition

Wenxing Hu ¹, Xianghe Meng ², Yuntong Bai ³, Aiying Zhang ⁴, Gang Qu ⁵, Biao Cai ⁶, Gemeng Zhang ⁷, Tony W Wilson ⁸, Julia M Stephen ⁹, Vince D Calhoun ¹⁰, Yu-Ping Wang ¹¹

PMCID: PMC8208525 NIHMSID: NIHMS1699209 PMID: 33556002

Abstract

The combination of multimodal imaging and genomics provides a more comprehensive way for the study of mental illnesses and brain functions. Deep network-based data fusion models have been developed to capture their complex associations, resulting in improved diagnosis of diseases. However, deep learning models are often difficult to interpret, bringing about challenges for uncovering biological mechanisms using these models. In this work, we develop an interpretable multimodal fusion model to perform automated diagnosis and result interpretation simultaneously. We name it Grad-CAM guided convolutional collaborative learning (gCAM-CCL), which is achieved by combining intermediate feature maps with gradient-based weights. The gCAM-CCL model can generate interpretable activation maps to quantify pixel-level contributions of the input features. Moreover, the estimated activation maps are class-specific, which can therefore facilitate the identification of biomarkers underlying different groups. We validate the gCAM-CCL model on a brain imaging-genetic study, and demonstrate its applications to both the classification of cognitive function groups and the discovery of underlying biological mechanisms. Specifically, our analysis results suggest that during task-fMRI scans, several object recognition related regions of interests (ROIs) are activated followed by several downstream encoding ROIs. In addition, the high cognitive group may have stronger neurotransmission signaling while the low cognitive group may have problems in brain/neuron development due to genetic variations.

Keywords: Interpretable, multimodal fusion, brain functional connectivity, CAM

I. Introduction

Recently, it is well recognized that multimodal imaging data fusion can exploit complementary information from multiple modalities, leading to more accurate diagnosis and discovery of underlying biological mechanisms [1]. Conventional multimodal fusion is often focused on matrix decomposition approaches. For example, canonical correlation analysis (CCA) [2] has been widely used to integrate multimodal data by detecting linear cross-modal correlations. To capture complex cross-modal associations, deep neural network (DNN) based models, e.g., deep CCA [3], were developed to extract high-level cross-modal associations. These methods can lead to improved performance in terms of prediction and diagnosis [3], [4].

Beyond diagnosis, it is also important to uncover biological mechanisms underlying mental disorders. This requires the data representation model to be interpretable. However, DNN is composed of a large number of layers and each layer consists of several nonlinear transforms/operations, e.g., nonlinear activation and convolution, causing difficulties in interpreting its representation. There have been many efforts to interpret DNN based models. Among these interpretation approaches, there typically exists a trade-off between “model performance”, “computational cost”, and “interpretability”. Simonyan et al. [5] proposed two methods: “class model visualization” to find the optimal image for the explanation of the trained model; and “class saliency visualization” to explain each image’s results. Perturbation-based methods [6] have also been proposed to evaluate each region/biomarker’s contribution by occluding it individually to see how much the classification accuracy is sacrificed as a result. Computational cost is however high because all voxels/biomarkers and their combinations are examined individually. Optimization-based methods obtain the occlusion by solving an optimization problem, i.e., minimizing the size of biomarkers and the loss of accuracy. For example, Fong et al. [7] proposed an optimization-based approach to only search for meaningful region candidates to reduce computational cost. In addition, Dabkowski et al. [8] developed a real-time interpretation framework but it introduced an additional network block; consequently, computationally intensive model-training is needed. There also have been some other optimization- or perturbation-based interpretation models [9], [10], but all of them introduce additional network training or additional optimization problems for each image input. Therefore, these approaches perform well for single image interpretation, but are impractical for large scale data analysis due to high computational costs.

To address this issue, we herein develop an interpretable DNN based multimodal fusion model, Grad-CAM guided convolutional collaborative learning (gCAM-CCL), which can perform automated diagnosis and result interpretation simultaneously. The gCAM-CCL model can generate interpretable activation maps indicating pixel-wise contributions of the inputs, enabling automated result interpretation. In particular, the activation maps are class-specific, which can reveal the underlying features or biomarkers specific to each class. In addition, gCAM-CCL can capture the cross-modal associations that are linked with phenotypical traits or disease conditions. This is achieved by feeding the network representations to a collaborative layer [11] which considers both cross-modal interactions and fitting to traits.

The contributions of our gCAM-CCL model are four-folds. First, it is a novel interpretable multimodal deep learning model, which can perform automated classification and interpretation simultaneously. Second, both cross-modal associations and the fitting to class-labels are considered using a batch-independent loss function, leading to better classification and association identification. Third, the interpretation is based on guided Grad-CAM, which allows for instant interpretation with high resolution class-specific activation maps. Fourth, we proposed an improved way of computing the weights for Grad-CAM, which focused on only the pixels with positive effects on class probabilities.

The rest of the paper is organized as follows. Section II introduces several related works and the limitations of existing multimodal fusion methods. Section III proposes our new model, gCAM-CCL, and describes how it overcomes the limitations. Section IV presents experimental results of applying gCAM-CCL to imaging genetic study. Section V concludes the work with a brief discussion.

II. Related works

A. Multimodal data fusion: analyzing cross-modal association

Classical multimodal data fusion methods are often focused on cross-modal matrix factorization. Among them, canonical correlation analysis (CCA) [2] has been widely used for multi-view/omics studies including our work [12]–[14]. CCA aims to find the most correlated variable pairs, i.e., canonical variables, and further association analysis can be performed accordingly.

Specifically, given two data matrices X₁ ∈ ℝ^n×r, X₂ ∈ ℝ^n×s (n represents sample/subject size, and r, s represents the feature/variable sizes in two data sets), CCA seeks two optimal loading vectors u₁ ∈ ℝ^r×1 and u₂ ∈ ℝ^s×1 which maximize the Pearson correlation corr(X₁u₁, X₂u₂), as in Eq. 1.

(u_{1}^{*}, u_{2}^{*}) = \underset{u_{1}, u_{2}}{argmax} u_{1}^{'} Σ_{12} u_{2} subject to u_{1}^{'} Σ_{11} u_{1} = 1, u_{2}^{'} Σ_{22} u_{2} = 1

(1)

where $u_{1} \in ℝ^{r \times 1}, u_{2} \in ℝ^{s \times 1}, Σ_{i j} : = X_{i}^{'} X_{j} / (n - 1), i, j =$ 1, 2.

Solving the optimization in Eq. 1 yields the most correlated canonical variable pair, i.e., X₁u₁ and X₂u₂. More correlated canonical variable pairs (with lower correlations) can be obtained subsequently by solving the following extended optimization problem in Eq. 2.

(U_{1}^{*}, U_{2}^{*}) = \underset{U_{1}, U_{2}}{argmax} Trace (U_{1}^{'} Σ_{12} U_{2}) subject to U_{1}^{'} Σ_{11} U_{1} = U_{2}^{'} Σ_{22} U_{2} = I_{k}

(2)

where U₁ ∈ ℝ^r×k, U₂ ∈ ℝ^s×k, k = min(rank(X₁), rank(X₂)).

CCA captures only linear associations and therefore it requires that different data/views follow the same distribution. However, different modality data, e.g., fMRI imaging and genetic data, may follow different distributions and have different structures. As a result, CCA fails to detect the association between heterogeneous data sets. To address this problem, Deep CCA (DCCA) was proposed by Andrew et al. [3] to detect more complicated correlations. DCCA introduces a deep network representation before applying CCA framework. Unlike linear CCA, which finds the optimal canonical vectors u₁, u₂, DCCA seeks the optimal network representation f₁(X₁), f₂(X₂), as shown in Eq. 3.

(f_{1}^{*}, f_{2}^{*}) = \underset{f_{1}, f_{2}}{argmax} {\max_{u_{1}, u_{2}} \frac{u_{1}^{'} f_{1}^{'} (X_{1}) f_{2} (X_{2}) u_{2}}{‖ f_{1} (X_{1}) u_{1} ‖_{2} ‖ f_{2} (X_{2}) u_{2} ‖_{2}}}

(3)

where f₁, f₂ are two deep networks.

The introduction of deep network representation leads to a more flexible way to detect both linear and nonlinear correlations. According to the application to both speech and handwritten digits data analyses [3], DCCA was more effective in finding correlations compared to other methods, e.g., linear CCA, and kernel CCA. Despite DCCA’s superior performance, the detected associations may not be relevant to the phenotype of interest, e.g., disease. Instead, they may be caused by irrelevant signals, e.g., background and noise. As a result, they have limited value for identifying disease-related biomarkers.

B. Deep collaborative learning (DCL): phenotype-related cross-data association

To address the limitations of DCCA, we proposed a multimodal fusion model, deep collaborative learning (DCL) [11]. DCL can capture phenotype-related cross-modal associations by enforcing additional fitting to phenotype label, as formulated in Eq. 4.

(Z_{1}^{*}, Z_{2}^{*}) = \underset{Z_{1}, Z_{2}}{argmax} {\max_{U_{1}, U_{2}} a_{1} Trace (U_{1}^{'} Z_{1}^{'} Z_{2} U_{2}) - a_{2} \min_{β_{1}} {‖ y - Z_{1} β_{1} ‖}_{2}^{2} - a_{3} \min_{β_{2}} {‖ y - Z_{2} β_{2} ‖}_{2}^{2}} = \underset{Z_{1}, Z_{2}}{argmax} {a_{1} {‖ Σ_{11}^{- \frac{1}{2}} Σ_{12} Σ_{22}^{- \frac{1}{2}} ‖}_{t r} - a_{2} {‖ y - Z_{1} {(Z_{1}^{'} Z_{1})}^{+} Z_{1}^{'} y ‖}_{2}^{2} - a_{3} {‖ y - Z_{2} {(Z_{2}^{'} Z_{2})}^{+} Z_{2}^{'} y ‖}_{2}^{2}}

(4)

where U₁, U₂ are subject to $U_{1}^{'} Σ_{11} U_{1} = U_{2}^{'} Σ_{22} U_{2} = I_{k}$ ; $∥ A ∥_{t r} : = Trace (\sqrt{A^{'} A}) = {ΣΣ}_{i}$ ; Z₁ = f₁(X₁) ∈ ℝ^n×p, Z₂ = f₂(X₂) ∈ ℝ^n×q; f₁, f₂ represent two deep networks, respectively; y ∈ ℝ^n×1 represents phenotype or label data; and a₁, a₂, a₃ are the weight coefficients.

As shown in Eq. 4, DCL seeks the optimal network representation Z₁ = f₁(X₁), Z₂ = f₂(X₂) to maximize cross-modal correlations. Compared to DCCA, DCL’s representation retains label related information to guarantee that the identified associations are linked with label/phenotype. As a result, better classification of different cognitive groups can be achieved, according to the work described in [11]. Moreover, DCL relaxes the requirement that the projections u₁ and u₂ have to be in the same direction. This leads to an effective fitting to phenotypical information while incorporating cross-modal correlation.

With the ability to capture both cross-modal associations and trait-related features, DCL can combine complementary information from multimodal data, as demonstrated in our brain imaging study [11]. However, DCL uses deep networks to extract high-level features, which are difficult to interpret. As a result, DCL is limited by its ability for revealing disease mechanisms.

C. Deep Network Interpretation: CAM-based methods

Both DCCA and DCL use deep neural networks (DNN) for feature extraction. DNN employs a sequence of intermediate layers to extract high-level features. Each layer is composed of a number of complex operations, e.g., nonlinear activation, kernel convolution, batch normalization. DNN based models have found successful applications in both computer vision and medical imaging, due to their superior ability to extract high-level features. However, the large number of layers and the complex/nonlinear operations in each layer bring about a difficulty in network explanation and feature identification. Users may cast doubt on the reliability of the deep networks: whether deep networks make decisions based on the object of interest, or based on irrelevant/background information.

Optimization-based methods and perturbation-based methods have been proposed to partially interpret deep neural networks. However, both of them introduce additional optimization problems or network training for each image input, and consequently are computationally intensive and not appropriate for large scale applications.

1). Class Activation Mapping (CAM):

To make DNN explainable, Class Activation Mapping (CAM) method [15] was proposed. CAM generates an activation map for each sample/image indicating pixel-wise contributions to the decision, e.g., class label. Moreover, the activation maps are class-specific, providing more discriminative information for class-specific analysis. This dramatically helps build trust in deep networks: for correctly classified images/samples, CAM explains how the classification decision is made by highlighting the object of interest; for incorrectly classified images/samples, CAM illustrates the misleading regions.

The activation maps in CAM are obtained by computing an optimal combination of intermediate feature maps. A weight coefficient is needed for each feature map to evaluate its importance to the decision of interest. However, for most CNN-based models, this weight is not provided. To solve this problem, a re-training procedure is introduced, in which the feature maps are used directly by a newly introduced layer to re-conduct classification. The corresponding weights are then calculated using the parameters in the introduced layer. The detailed CAM method is described as follows.

For a pre-trained CNN-based model, assume that a target feature map layer consists of K channels/feature-maps F^k ∈ ℝ^h×w(k = 1, 2, · · ·, K), where h, w represent the height and width of each feature map, respectively. CAM discards all the subsequent layers and introduces a new layer (with softmax activation) to re-conduct classification using these feature maps F^k. Then a prediction score S^c is calculated by the newly introduced layer for each class c (c = 1, 2, · · ·, C), as formulated in Eq. 5.

S^{c} = \sum_{k = 1}^{K} w_{k}^{c} global_avg_pooling (F^{k})

(5)

where $w_{k}^{c}$ represents the weight coefficient of feature map F^k for class c.

After that, class-specific activation maps can be generated by first combining the feature maps with the trained weights $w_{k}^{c}$ followed by upsampling to project onto the input images, as in Eq. 6.

{map}_{c a m} = upsampling (\sum_{k = 1}^{K} w_{k}^{c} F^{k})

(6)

The re-training procedure, however, is time consuming, which limits CAM’s application. Moreover, classification accuracy sacrifices due to the modification of the model’s architecture.

2). Gradient-weighted CAM (Grad-CAM):

To address the limitation of the CAM method, Gradient-weighted CAM (Grad-CAM), was proposed [16] to compute activation maps without modifying the model’s architecture. Similar to CAM, Grad-CAM also needs a set of weight coefficients to combine feature maps. This can be achieved by first calculating the gradient of decision of interest w.r.t each feature map and then performing global average pooling on the gradients to get scalar weights. As a result, Grad-CAM avoids adding extra layers so both model-retraining and performance-decrease problems can be solved. The following is the description about how Grad-CAM calculates weights $g_{k}^{c}$ for the generation of activation map map_gradcam.

g_{k}^{c} = global_avg_pooling (\frac{\partial y^{c}}{\partial F^{k}})

(7)

where y^c represents the prediction score for class c, and

{map}_{gradcam} = upsampling (ReLU (\sum_{k = 1}^{K} g_{k}^{c} F^{k}))

(8)

3). Guided Grad-CAM: high resolution class-specific activation maps:

Both CAM’s and Grad-CAM’s activation maps are coarse due to the upsampling procedure, as feature maps normally are of smaller size compared to input images. This brings about difficulties in identifying small but important features. Fine-grained visualization methods, e.g., guided back-propagation (BP) [17] and deconvolution [6], can generate high resolution activation maps. These methods use backward projections which operate on layer-to-layer gradients. Upsampling procedure is not involved in these back projection methods, and therefore high resolution activation maps can be obtained. Nevertheless, the activation maps are not class-specific, causing obstacles in interpreting the activation maps, especially for multiple classes. To obtain both high resolution and class-specific activation maps, guided Grad-CAM was proposed in the work [16] by incorporating guided BP into Grad-CAM. Guided Grad-CAM computes activation maps by performing a Hadamard product between the Grad-CAM map and the Guided BP map, as formulated in Eq. 9.

{map}_{g u i d e d_g r a d c a m} = {map}_{g u i d e d B P} ⊙ {map}_{g r a d c a m}

(9)

where map_guidedBP represents the map computed using guided BP algorithm [17], and ⊙ represents the Hadamard product operation. Given two arbitrary matrices A, B ∈ ℝ^m×n, their Hadamard product is defined as (A ⊙ B)_ij := AijBij.

III. Grad-CAM guided convolutional collaborative learning (gCAM-CCL)

To this regard, we propose a new model, Grad-CAM guided convolutional collaborative learning (gCAM-CCL), which can perform automated multimodal fusion and result interpretation. As shown in Fig. 1, gCAM-CCL first integrates two modality data using the collaborative networks, and then computes class-specific activation maps using both Guided BP and Grad-CAM. As a result, gCAM-CCL can perform both classification and biomarker identification (i.e., result interpretation) simultaneously. Moreover, we propose an improved way to compute the weights for Grad-CAM, which can focus on only the pixels with positive effects on class probabilities.

Fig. 1: — The work-flow of grad-CAM guided convolutional collaborative learning (gCAM-CCL), an end-to-end model for automated classification and interpretation for multimodal data fusion. Genetic data is fed into a ConvNet and then flattened to a fully connected layer. Brain functional connectivity (FC) data is fed into a deep network. A collaborative learning layer fuses the two deep networks and passes two composite gradients mutually during the back-propagation process.

A. Proposed learning method

Compared to the DCL model [11], gCAM-CCL employs both a new architecture and a new loss function to incorporate Grad-CAM. Since computing activation maps needs a layer of feature-maps, gCAM-CCL replaces a multilayer perceptron (MLP) network with two ConvNets so that multi-channel feature maps can be obtained. This also benefits model-training because ConvNet dramatically reduces the number of parameters by enforcing shared kernel weights.

Moreover, as pointed out in Wang’s work [4], both DCCA [3] and DCL [11] include the parameter of sample size in their loss functions, giving rise to a problem of batch size tuning. In other words, the loss functions are dependent on batch size due to the correlation term $U_{1}^{'} Z_{1}^{'} Z_{2} U_{2}$ at population-level. As a result, a large batch size is required [4] for network training. In this work, we propose a new loss function to relieve the issue of batch-size dependence, as formulated in Eq. 10. As shown in Eq. 10, the population-level correlation term is replaced with a summation of sample-level loss. Moreover, the correlation term is replaced with a regression loss, i.e., cross-entropy loss, since it has been shown that the optimization of correlation term is equivalent to the optimization of regression loss [4].

Loss = - \sum_{i = 1}^{2} ((1 - y) l o g (h_{1}^{(i)}) + y l o g (1 - h_{2}^{(i)})) - \sum_{i = 1}^{2} (h_{i}^{(1)} l o g (h_{i}^{(2)}) + h_{i}^{(2)} l o g (h_{i}^{(1)}))

(10)

where $h_{i}^{(1)}$ , $h_{i}^{(2)}$ are the outputs of two ConvNets, as illustrated in Fig. 1.

This batch-independent loss function is easier to extend to multi-class and multi-view scenarios, where the extended loss function is given as follows.

L o s s = - \frac{1}{m} \sum_{i = 1}^{m} \sum_{c = 1}^{C} y_{c} l o g (h_{c}^{(i)}) - \frac{1}{m (m - 1)} \sum_{i, j (i \neq j)}^{m} \sum_{c = 1}^{C} h_{c}^{(i)} l o g (h_{c}^{(j)})

(11)

where m represents the number of views, and C represents the number of classes.

B. Proposed interpretation method

The gCAM-CCL uses a 1D ConvNet to learn features from SNP data and uses a 2D ConvNet to learn features from brain imaging data, respectively. The output of two ConvNets are flattened and then fused in the collaborative layer with the loss function in Eq. 10, which considers both cross-modal associations and their fittings to phenotype/label y. After that, two intermediate layers are selected, from which the feature maps can be combined using the gradient-based weights (Eqs. 7–8) and class-specific Grad-CAM activation maps can be generated accordingly. Meanwhile, fine-grained activation maps are computed by projecting the gradients back from the collaborative layer to the input layer using Guided BP. The obtained activation maps indicate pixel-wise contributions to the decision of interest, e.g., prediction, and significant biomarkers, e.g., brain FCs and genes, can be identified accordingly.

Moreover, ideal class-specific activation maps should highlight only the features relevant to the corresponding class, e.g., ‘dog’ class. However, features related to other classes, e.g., fish-related features, may have negative contributions to predicting ‘dog’ class, resulting in noisy features in the activation maps. To remove the noisy or irrelevant features, we apply a ReLU function to the gradients, as shown in Eq. 12. The ReLU function ensures positive effects so that pixels with negative contributions can be filtered out.

g_{k}^{c} = global_avg_pooling (R e L U (\frac{\partial y^{c}}{\partial F^{k}}))

(12)

where y^c represents the prediction score for class c.

IV. Application to brain imaging genetic study

We apply the gCAM-CCL model to an imaging genetic study, in which brain FC data is integrated with single nucleotide polymorphism (SNPs) data to classify low/high cognitive groups. Multiple brain regions of interests (ROIs) function as a group when performing a specific task, e.g., reading. Brain FC depicts the functional associations between different brain ROIs [18]. On the other hand, genetic factors may also have influences on brain functions, as brain dysfunctionality is genetically inheritable. Imaging-genetic integration can probe brain function from a more comprehensive view, with the potential for revealing brain mechanisms. The proposed gCAM-CCL model will be used to extract and analyze complex interactions both within and between brain FC and genetics.

A. Brain imaging data

Several brain fMRI modalities from the Philadelphia Neurodevelopmental Cohort (PNC) [19] were used in the experiments. PNC cohort is a large-scale collaborative study between the Brain Behavior Laboratory at the University of Pennsylvania and the Children’s Hospital of Philadelphia. It has a collection of multiple neuroimaging data, e.g., fMRI, and genomic data, e.g., SNPs, collected from 854 adolescents aged from 8 to 21 years. Three types of fMRI data are available in PNC cohort: resting-state fMRI, emotion task fMRI, and nback task fMRI (nback-fMRI). As our work was focused on analyzing cognitive ability, only nback-fMRI, which was related to working memory and lexical processing, was used in the experiments. The duration of nback-fMRI scan was 11.6 minutes (231 TR), during which subjects were asked to conduct standard nback tasks.

SPM12¹ was used to conduct motion correction, spatial normalization, and spatial smoothing. Movement artifact (head motion effect) was removed via a regression procedure using a rigid body (6 parameters: 3 translation and 3 rotation parameters) [20], and the functional time series were band-pass filtered using a 0.01Hz to 0.1Hz frequency range as significant signals mainly focus on low frequency. For quality control, we excluded high motion subjects with translation > 2mm or with SFNR < 275 (Signal-to-fluctuation-noise ratio) following the work in [21]. 264 regions of interest (ROIs) (containing 21,384 voxels) were extracted based on the Power coordinates [22] with a sphere radius parameter of 5mm. For each subject, a 264 × 264 image was then obtained based on the 264 × 264 ROI-ROI connections, which was used next as image inputs for the gCAM-CCL model. The ROIs are arranged according to their spatial coordinates and their brain sub-networks information. The spatial information is based on the coordinates in the Montreal Neurological Institute space, and ROIs in the same brain sub-network are similar in terms of brain function. In this way, neighbor ROIs are both spatially and functionally close to each other.

B. SNP data

The genomic data were collected from 3 platforms, including the Illumina HumanHap 610 array, the Illumina HumanHap 500 array, and the Illumina Human Omni Express array. The three platforms generated 620k, 561k, 731k SNPs, respectively [19]. A common set of SNPs (313k) were extracted, and then PLINK [23] was used to perform standard quality controls, including the Hardy-Weinberg equilibrium test for genotyping errors with p-value < 1e−5, extraction of common SNPs (MAF > 5%), and linkage disequilibrium (LD) pruning with a threshold of 0.9. After that, SNPs with missing call rates > 10% and samples with missing SNPs > 5% were removed. The remaining missing values were imputed by Minimac 3 [24] using the reference genome from 1000 Genome Project. In addition, only the SNPs within gene bodies were kept for further analysis, resulting in 98,804 SNPs in 14,131 genes.

As the study aimed to investigate the brain, we further narrowed down to brain-expression-related SNPs. This was achieved using the expression quantitative trait loci (eQTL) data from Genotype-Tissue Expression (GTEx) ² V7 database [25], a large scale consortium studying tissue-specific gene regulations and expressions. The GTEx data were collected from 53 different tissue sites from around 1000 subjects. Among the 53 tissue sites, 13 tissues were brain-related and they were listed in Table I. A set of 108 SNP loci were selected, which showed significant tissue regulation level (eQTL < 5×10e-8) in all 13 brain relevant tissues. In addition, SNPs in the top 100 brain-expressed genes were also selected based on the GTEx database. These procedures resulted in 750 SNP loci, which were used next as the genetic input to the gCAM-CCL model.

TABLE I:

13 Brain-related tissues from GTEx V7 database

Tissue name	Sample size
Brain nucleus accumbens	93
Brain caudate	100
Brain cerebellar hemisphere	89
Brain cerebellum	103
Brain frontal cortex	92
Brain cortex	96
Brain amygdala	76
Brain spinal cord	80
Brain substantia nigra	72
Brain putamen	82
Brain anterior cingulate cortex	72
Brain hypothalamus	81
Brain hippocampus	81

Open in a new tab

C. Integrating brain imaging and genetic data: classification

The gCAM-CCL was then applied to integrate brain imaging data with SNPs data to classify subjects with low/high cognitive functions. The wide range achievement test (WRAT) [26] score, a measure of comprehensive cognitive ability, including reading, comprehension, and math skills, was used to evaluate the cognitive level of each subject. The 854 subjects were divided into three classes: high cognitive/WRAT group (top 20% WRAT score), low cognitive/WRAT group (bottom 20% WRAT score), and middle group (the rest), following the procedures in the work [11].

The gCAM-CCL model adopted a 1D convolutional neural network (CNN) to learn the interactions between alleles at different SNP loci. ConvNet has been widely used for sequencing and gene expression data [27], [28] to learn local genetic structures. According to these studies, 1D kernels with relatively larger sizes are preferred. As a result, a 31×1 kernel and a 15 × 1 kernel were used. The detailed architecture of gCAM-CCL is listed in Table IV. The data is partitioned into training set (70%), validation set (15%), and test set (15%). The proposed gCAM-CCL model was trained on training set; hyper-parameters were selected based on the validation set; and the classification performance was reported based on the test set.

TABLE IV:

The Architecture of gCAM-CCL

fMRI ConvNet				SNP ConvNet
Layer Name	Input Shape	Operations	Connects to	Layer Name	Input Shape	Operations	Connects to
f_conv1	(b, 1, 264, 264)	K, P, MP = 7, 3, 2	f_conv2	s_conv1	(b, 1, 750)	K, MP = 31, 6	s_conv2
f_conv2	(b, 16, 132, 132)	K, P, MP = 5, 2, 4	f_conv3	s_conv2	(b, 16, 120)	K, MP = 31, 6	s_conv3
f_conv3	(b, 32, 33, 33)	K, P, MP = 3, 1, 3	f_conv4	s_conv3	(b, 32, 15)	K = 15	s_flatten
f_conv4	(b, 32, 11, 11)	K = 11	f_flatten	s_flatten	(b, 64, 1)	-	collab_layer
f_flatten	(b, 64, 1)	-	collab_layer	-	-	-	-

collab_layer	(b, 4)

Open in a new tab

Notations: b (batch size), K (kernel size), P (padding), MP (maxpooling).

Hyper-parameters, including momentum, activation function, learning rate, decay rate, batch size, maximum epochs, were selected using the validation set and their values are listed in Table II. Mini-batch SGD was used to solve the optimization problem. Over-fitting problem occurred due to small sample size. To solve overfitting issue, a dropout was used and the dropout probability of the middle layers was set to be 0.2. Moreover, early stopping was used during network training to further overcome overfitting problem. In addition, batch normalization was implemented after each layer to relieve the gradient vanishing/exploding problem resulting from ReLU activation. Computational experiments were conducted on a Desktop with an Intel(R) Core(TM) i7–8700K CPU (@ 3.70GHz), a 16G RAM, and a NVIDIA GeForce GTX 1080 Ti GPU (11G).

TABLE II:

Hyper-parameter setting

Methods	Epochs	batch size	Activation	Learning rate	Decay rate	dropout	Momentum
gCAM-CCL	500	4	ReLU, Sigmoid	0.00001	Half per 200 epochs	0.2 (middle layers)	0.9

Open in a new tab

To evaluate and compare the performance, several classical classifiers, e.g., SVM, random forest (RF), decision tree, were implemented for classifying low/high WRAT groups. In addition, several deep network based classifiers were implemented, including CCL with external classifiers (SVM/RF), and multilayer perceptron (MLP). Three multimodal-data-integration-based classifiers, i.e., gCAM-CCL, CCL+SVM, CCL+RF, take two omics data separately as the input, as shown in Fig. 1. In contrast, the other models take the concatenated data as the input. Specifically, to generate “concatenated data”, brain FCs were first flattened into vectors and then principal component analysis (PCA) was used for dimension reduction. After that, the brain FC vectors and SNPs vectors were concatenated as the “concatenated data”. We used radial basis function (RBF) kernel for SVM; and the RF classifier consists of 50 trees. The result of classifying high/low cognitive groups is shown in Table III. From Table III, the three data-integration-based classifiers outperform the concatenating classifiers. This is consistent with the result in our previous work [11], which also shows that the collaborative network with multimodal data can improve classification performance. Moreover, gCAM-CCL with intrinsic softmax classifiers achieves better classification compared with ‘CCL+SVM’ and ‘CCL+RF’. This may be due to the incorporation of cross-entropy loss in the model, i.e., Eq. 10, which facilitates the learning of loss-gradient during back-propagation process at each iteration.

TABLE III:

The comparison of classification performances (Low/High WRAT classification).

Classifier	ACC	SEN	SPF	F1
gCAM-CCL	0.7501	0.7762	0.7157	0.7610
CCL+SVM	0.7387	0.7637	0.7083	0.7504
CCL+RF	0.7419	0.7666	0.7014	0.7523
DCL+SVM	0.7315	0.7578	0.6981	0.7354
MLP	0.7231	0.7555	0.6915	0.7215
SVM	0.7082	0.7562	0.6785	0.7093
DT	0.6626	0.6778	0.6430	0.6605
RF	0.7119	0.7559	0.6714	0.7138
Logist	0.6745	0.7386	0.6285	0.6900

Open in a new tab

D. Integrating brain imaging and genetic data: result interpretation

The class-specific activation maps for low WRAT group and high WRAT group were plotted in Figs. 2–3, respectively. From Fig. 2, the low WRAT group shows a relatively larger number of activated FCs, which contributed to the ‘low WRAT group’ class. In comparison, the high WRAT group (Fig. 3) has a relatively smaller number of significant FCs, which determined the ‘high WRAT group’ class. This is further validated in the average histogram of the activation maps, i.e., Fig. 4. For the low WRAT group (Fig. 4-left), a large portion of FCs were activated (high grey-scale value), while for the high WRAT group (Fig. 4-right), only a small portion of them were activated.

Fig. 3: — The brain FC activation maps for the high WRAT group: Grad-CAM (top 4 subfigures) and Gradient-Guided Grad-CAM (bottom 4 subfigures).

Fig. 4: — The histogram of the Grad-CAM activation maps of brain FCs (see Figs. 2–3).

To identify significant brain FCs and SNPs, pixels with gray-value > 0.05×maximum gray-value were selected, following the instructions in the work [16]. After that, FCs and SNPs with > 0.7 occurring frequency across all subjects were further selected as significant FCs (see Figs. 5–6) and SNPs (listed in Tables V-VI).

Fig. 5: — The identified class-discriminative brain FCs by gCAM-CCL. The full names of ROIs can be found in Table IX. Each circle arc represents an ROI (based on Power parcellation [22]). The length of a circle arc indicates the number of ROI-ROI connections for this ROI.

Fig. 6: — The identified brain functional connectivity. The top 3 subfigures: Low WRAT group (axial view, coronal view, sagittal view, respectively); the bottom 3 subfigures: High WRAT group (axial view, coronal view, sagittal view, respectively).

TABLE V:

Identified SNP loci (Low WRAT group)

SNP rs #	Gene	SNP rs #	Gene
rs1642763	ATP1B2	rs997349	MTURN
rs9508	ATPIF1	rs17547430	MTURN
rs2242415	BASP1	rs7780166	MTURN
rs11133892	BASP1	rs10488088	MTURN
rs10113	CALM3	rs3750089	MTURN
rs11136000	CLU	rs2275007	OSGEP
rs4963126	DEAF1	rs4849179	PAX8
rs11755449	EEF1A1	rs11539202	PDHX
rs2073465	EEF1A1	rs1045288	PSMD13
rs1809148	EEF1D	rs7563960	RNASEH1
rs4984683	FBXL16	rs145290	RP1
rs7026635	FBXW2	rs446227	RP1
rs734138	FLYWCH1	rs414352	RP1
rs2289681	GFAP	rs6507920	RPL17
rs7258864	GNG7	rs12484030	RPL3
rs4807291	GNG7	rs10902222	RPLP2
rs887030	GNG7	rs8079544	TP53
rs7254861	GNG7	rs6726169	TTL
rs12985186	GNG7	rs415430	WNT3
rs2070937	HP	rs8078073	YWHAE
rs622082	IGHMBP2	rs12452627	YWHAE
rs12460	LINS	rs324126	ZNF880
rs10044354	LNPEP

Open in a new tab

TABLE VI:

Identified SNP loci (High WRAT group)

SNP rs #	Gene	SNP rs #	Gene
rs3787620	APP	rs1056680	MB
rs373521	APP	rs9257936	MOG
rs2829973	APP	rs7660424	MRFAP1
rs1783016	APP	rs3802577	PHYH
rs440666	APP	rs1414396	PHYH
rs2753267	ATP1A2	rs1414395	PHYH
rs10494336	ATP1A2	rs1037680	PKM
rs1642763	ATP1B2	rs2329884	PPM1F
rs10113	CALM3	rs1045288	PSMD13
rs2053053	CAMK2A	rs2271882	RAB3A
rs4958456	CAMK2A	rs12294045	SLC1A2
rs4958445	CAMK2A	rs3794089	SLC1A2
rs3756577	CAMK2A	rs7102331	SLC1A2
rs874083	CAMK2A	rs3798174	SLC22A1
rs3011928	CAMTA1	rs9457843	SLC22A1
rs890736	CPLX2	rs1443844	SLC22A1
rs17065524	CPLX2	rs6077693	SNAP25
rs12325282	FAHD1	rs363043	SNAP25
rs104664	FAM118A	rs362569	SNAP25
rs6874	FAM69B	rs10514299	TMEM161B-AS1
rs7026635	FBXW2	rs4717678	TYW1B
rs12735664	GLUL	rs8078073	YWHAE
rs7155973	HSP90AA1	rs10521111	YWHAE
rs2251110	LOC101928134	rs4790082	YWHAE
rs2900856	LOC441242	rs10401135	ZNF559
rs8136867	MAPK1

Open in a new tab

The identified brain FCs (ROI-ROI connections) and their corresponding ROIs were visualized in Fig. 5 and Fig. 6, respectively. For the high WRAT group (Fig. 5.b), three hub-ROIs (lingual gyrus, middle occipital gyrus, and inferior occipital gyrus) exhibited dominant ROI-ROI connections over the others. All of the three hub-ROIs are occipital-related. Lingual gyrus, also known as medial occipitotemporal gyrus, plays an important role in visual processing [29], [30], object recognition, and word processing [29]. The other two hubs, i.e., middle and inferior occipital gyrus, also play a role in object recognition [31]. As shown in Fig. 5.b, the hub-ROIs also connect to several other ROIs, e.g., cuneus, and parahippocampal gyrus. Among them, the cuneus receives visual signals and is involved in basic visual processing. The parahippocampal gyrus is related to encoding and recognition [32]. These suggest that the three occipital gyri are first activated when processing visual and word signals during the WRAT test, and then several downstream processing ROIs, e.g., para hippocampal gyrus, are activated for further complex encoding. As a result, strong FCs exist in these ROI-ROI connections, which may lead the gCAM-CCL to select the high WRAT group.

For the low WRAT group (Fig. 5.a), there were no significant hub ROIs identified. Instead, several previously reported task-negative regions, e.g., temporal-parietal and cingulate gyrus [33], were identified. This indicates that the low WRAT group may be weaker in activating cognition-processing ROIs and therefore task-negative are relatively more active, which may lead the gCAM-CCL to make the ‘low WRAT group’ decision.

As seen in Fig. 5a-b, a relatively larger number of FCs contributed to the low WRAT group, compared to that of the high WRAT group. Despite this, as shown in Table III, the sensitivity is lower than the specificity, which means that the accuracy of classifying low WRAT group is lower. This suggests that the identified FCs for the high WRAT group are relatively more discriminative and the low WRAT group may contain more noisy FCs.

Gene enrichment analysis is conducted on the identified SNPs (Tables V-VI) using ConsensusPathDB-human (CPDB) database³, and the enriched pathways are listed in Tables VII-VIII. Several neurotransmission related pathways, e.g., regulation of neurotransmitter levels and synaptic signaling, are enriched from the identified high WRAT group genes. This suggests that the high WRAT group may have stronger neuron signaling ability. The stronger neuron-signalling may benefit the daily training and development of ROI-ROI connections, which may therefore contribute to stronger cognitive ability. For the low WRAT group, several brain development and neuron growth related pathways, e.g., midbrain development and growth cone, are enriched, suggesting that the low WRAT group may have problems in brain/neuron development. This may further affect the ROI-ROI connections, leading to weaker cognitive ability. Therefore, the genetic differences may lead to differences in ROI-ROI connection, and especially the visual processing and information encoding ROIs; and may further contribute to the differences in the brain FC patterns between the two cognition groups. As a result, subjects with both neuro-transmission-related genetic biomarkers and the image-processing-related brain FCs may lead classifiers to more likely make the ‘high cognition’ decision. In contrast, subjects with both neuron-underdevelopment-related biomarkers and weaker image-processing-related FC patterns may lead to a higher score on the “low cognition” class.

TABLE VII:

Gene enrichment analysis of the identified genes (Low WRAT group). Q-values represent multiple testing corrected p-value.

Pathway Name	Pathway Source	Set size	Contained	p-value	q-value
Eukaryotic Translation Elongation	Reactome	106	5	1.18E-06	1.65E-05
Peptide chain elongation	Reactome	101	4	3.12E-05	2.18E-04
Calcium Regulation in the Cardiac Cell	Wikipathways	149	4	1.48E-04	6.89E-04
Translation	Reactome	310	5	2.09E-04	7.33E-04
Metabolism of proteins	Reactome	2008	11	3.96E-04	1.11E-03
Midbrain development	Gene Ontology	94	4	1.22E-05	1.53E-03
Site of polarized growth	Gene Ontology	167	4	1.25E-04	2.19E-03
Growth cone	Gene Ontology	165	4	1.20E-04	4.07E-03
Cellular catabolic process	Gene Ontology	2260	12	7.22E-05	4.55E-03
Pathways in cancer - Homo sapiens (human)	KEGG	526	5	2.39E-03	5.58E-03
Metabolism of amino acids and derivatives	Reactome	342	4	3.20E-03	6.40E-03

Open in a new tab

TABLE VIII:

Gene enrichment analysis of the identified genes (High WRAT group). Q-values represent multiple testing corrected p-values.

Pathway Name	Pathway Source	Set size	Contained	p-value	q-value
Regulation of neurotransmitter levels	Gene Ontology	335	9	6.77E-10	1.04E-07
Transmission across Chemical Synapses	Reactome	224	7	7.75E-08	2.40E-06
Synaptic signaling	Gene Ontology	711	10	3.17E-08	2.43E-06
Insulin secretion - Homo sapiens (human)	KEGG	85	5	3.26E-07	5.06E-06
Organelle localization by membrane tethering	Gene Ontology	170	6	1.34E-07	6.84E-06
Regulation of synaptic plasticity	Gene Ontology	179	6	1.89E-07	7.21E-06
Exocytosis	Gene Ontology	909	10	3.06E-07	9.36E-06
Membrane docking	Gene Ontology	179	6	1.82E-07	1.28E-05
Vesicle docking involved in exocytosis	Gene Ontology	45	4	5.11E-07	1.30E-05
Plasma membrane bounded cell projection part	Gene Ontology	1452	12	3.05E-07	1.34E-05
Cell projection part	Gene Ontology	1452	12	3.05E-07	1.37E-05
Neurotransmitter release cycle	Reactome	51	4	1.76E-06	1.37E-05
Synaptic Vesicle Pathway	Wikipathways	51	4	1.76E-06	1.37E-05
Neuronal System	Reactome	368	7	2.25E-06	1.39E-05
Secretion by cell	Gene Ontology	1493	12	4.13E-07	1.45E-05
Adrenergic signaling in cardiomyocytes - Homo sapiens (human)	KEGG	144	5	4.47E-06	2.31E-05
Gastric acid secretion - Homo sapiens (human)	KEGG	75	4	8.34E-06	3.70E-05
Neuron part	Gene Ontology	1713	12	1.83E-06	4.09E-05
Plasma membrane bounded cell projection	Gene Ontology	2098	13	2.22E-06	4.87E-05

Open in a new tab

V. Conclusion

In this work, we proposed an interpretable multimodal deep learning based fusion model, namely gCAM-CCL, which can perform both classification and result interpretation. The gCAM-CCL model generates activation maps to display pixel-wise contribution of input images and genetic vectors. Specifically, it calculates each feature map’s gradients and then merges the gradients with global average pooling to combine the feature maps. Moreover, the activation maps are class-specific, which further promotes class-difference analysis with potential discovery of biological mechanisms.

The proposed model was applied to an imaging-genetic study to classify low/high WRAT groups. Experimental results demonstrate gCAM-CCL’s superior performance in both classification and biological mechanism analysis. Based on the generated activation maps, a number of significant brain FCs and SNPs were identified. Among the significant FCs (ROI-ROI connections), three visual processing ROIs exhibited dominant ROI-ROI connections over the others. In addition, several signal encoding ROIs, e.g., the parahippocampa gyrus, showed connections to the three hub-ROIs. These suggest that during task-fMRI scans, object recognition related ROIs are first activated and then downstream ROIs get involved with signal encoding. Results also suggest that high cognitive group may have higher neuron-transmitter signalling levels while low cognitive group may have problems in brain/neuron development, due to genetic-level differences. In summary, gCAM-CCL has demonstrated superior classification accuracy than several competitive models, in addition to being able to identify significant biomarkers. Besides the proposed imaging-genetic study, the model is generic, which can find widespread applications in multimodal data integration.

TABLE IX:

Abbreviations of the ROIs

Inferior Parietal Lobule (Inf_Pari)	Angular Gyrus (Angular)
Inferior Occipital Gyrus (Inf_Occi)	Fusiform Gyrus (fusiform)
Inferior Frontal Gyrus (Inf_Fron)	Cingulate Gyrus (Cingu)
Middle Occipital Gyrus (Mid_Occi)	Sub-Gyral (SubGyral)
Middle Frontal Gyrus (Mid_Fron)	Paracentral Lobule (ParaCetr)
Parahippocampa Gyrus (Parahippo)	Postcentral Gyrus (PostCetr)
Middle Temporal Gyrus (Mid_Temp)	Precuneus (Precun)
Superior Parietal Lobule (Sup_Pari)	Lingual gyrus (Lingual)

Open in a new tab

Acknowledgments

This work was partially supported by the NIH (R01 GM109068, R01 MH104680, R01 MH107354, P20 GM103472, R01 EB020407, R01 EB006841, R01 MH121101, R01 DA047828) and NSF (#1539067).

Footnotes

http://www.fil.ion.ucl.ac.uk/spm/software/spm12/

https://gtexportal.org/

http://epdb.molgen.mpg.de/

Contributor Information

Wenxing Hu, Biomedical Engineering Department, Tulane University, New Orleans, LA 70118..

Xianghe Meng, Center of System Biology, Data Information and Re-productive Health, School of Basic Medical Science, Central South University, Changsha, Hunan, 410008, China..

Yuntong Bai, Biomedical Engineering Department, Tulane University, New Orleans, LA 70118..

Aiying Zhang, Biomedical Engineering Department, Tulane University, New Orleans, LA 70118..

Gang Qu, Biomedical Engineering Department, Tulane University, New Orleans, LA 70118..

Biao Cai, Biomedical Engineering Department, Tulane University, New Orleans, LA 70118..

Gemeng Zhang, Biomedical Engineering Department, Tulane University, New Orleans, LA 70118..

Tony W. Wilson, Institute for Human Neuroscience, Boys Town National Research Hospital, Boys Town, NE 68101.

Julia M. Stephen, Mind Research Network, Albuquerque, NM 87106.

Vince D. Calhoun, Tri-institutional Center for Translational Research in Neuroimaging and Data Science (TReNDS), Georgia State University, Georgia Institute of Technology, Emory University, Atlanta, GA 30030..

Yu-Ping Wang, Biomedical Engineering Department, Tulane University, New Orleans, LA 70118..

References

[1].Sui J. et al. , “Neuroimaging-based individualized prediction of cognition and behavior for mental disorders and health: Methods and promises,” Biological Psychiatry, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[2].Hotelling H, “Relations between two sets of variates,” Biometrika, vol. 28, pp. 321–377, 1936. [Google Scholar]
[3].Andrew G. et al. , “Deep canonical correlation analysis,” in International Conference on Machine Learning, 2013, pp. 1247–1255. [Google Scholar]
[4].Wang W. et al. , “On deep multi-view representation learning,” in International Conference on Machine Learning, 2015, pp. 1083–1092. [Google Scholar]
[5].Simonyan K. et al. , “Deep inside convolutional networks: Visualising image classification models and saliency maps,” 2014. [Google Scholar]
[6].Zeiler MD and Fergus R, “Visualizing and understanding convolutional networks,” in European conference on computer vision. Springer, 2014, pp. 818–833. [Google Scholar]
[7].Fong RC and Vedaldi A, “Interpretable explanations of black boxes by meaningful perturbation,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 3429–3437. [Google Scholar]
[8].Dabkowski P. and Gal Y, “Real time image saliency for black box classifiers,” in Advances in Neural Information Processing Systems, 2017, pp. 6967–6976. [Google Scholar]
[9].Yuan H. et al. , “Interpreting deep models for text analysis via optimization and regularization methods,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 5717–5724. [Google Scholar]
[10].Fong R. et al. , “Understanding deep networks via extremal perturbations and smooth masks,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 2950–2958. [Google Scholar]
[11].Hu W. et al. , “Deep collaborative learning with application to multimodal brain development study,” IEEE Transactions on Biomedical Engineering, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
[12].Hu W, “Adaptive sparse multiple canonical correlation analysis with application to imaging (epi) genomics study of schizophrenia,” IEEE Transactions on Biomedical Engineering, vol. 65, no. 2, pp. 390–399, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
[13].Lin D. et al. , “Correspondence between fmri and snp data by group sparse canonical correlation analysis,” Medical image analysis, vol. 18, no. 6, pp. 891–902, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
[14].Hu W. et al. , “Integration of snps-fmri-methylation data with sparse multi-cca for schizophrenia study,” in 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE, 2016, pp. 3310–3313. [DOI] [PubMed] [Google Scholar]
[15].Zhou B. et al. , “Learning deep features for discriminative localization,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2921–2929. [Google Scholar]
[16].Selvaraju RR et al. , “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 618–626. [Google Scholar]
[17].Springenberg JT et al. , “Striving for simplicity: The all convolutional net,” arXiv preprint arXiv:1412.6806, 2014. [Google Scholar]
[18].Calhoun VD and Adali T, “Time-varying brain connectivity in fmri data: whole-brain data-driven approaches for capturing and characterizing dynamic states,” IEEE Signal Processing Magazine, vol. 33, no. 3, pp. 52–66, 2016. [Google Scholar]
[19].Satterthwaite TD et al. , “Neuroimaging of the philadelphia neurodevelopmental cohort,” Neuroimage, vol. 86, pp. 544–553, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
[20].Friston KJ et al. , “Characterizing dynamic brain responses with fmri: a multivariate approach,” Neuroimage, vol. 2, no. 2, pp. 166–172, 1995. [DOI] [PubMed] [Google Scholar]
[21].Rashid B. et al. , “Dynamic connectivity states estimated from resting fmri identify differences among schizophrenia, bipolar disorder, and healthy control subjects,” Frontiers in human neuroscience, vol. 8, p. 897, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
[22].Power JD et al. , “Functional network organization of the human brain,” Neuron, vol. 72, no. 4, pp. 665–678, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
[23].Purcell S. et al. , “Plink: a tool set for whole-genome association and population-based linkage analyses,” The American journal of human genetics, vol. 81, no. 3, pp. 559–575, 2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
[24].Das S. et al. , “Next-generation genotype imputation service and methods,” Nature genetics, vol. 48, no. 10, p. 1284, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
[25].Lonsdale J. et al. , “The genotype-tissue expression (gtex) project,” Nature genetics, vol. 45, no. 6, p. 580, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
[26].Wilkinson GS and Robertson GJ, Wide range achievement test. Psychological Assessment Resources, 2006. [Google Scholar]
[27].Singh R. et al. , “Deepchrome: deep-learning for predicting gene expression from histone modifications,” Bioinformatics, vol. 32, no. 17, pp. i639–i648, 2016. [DOI] [PubMed] [Google Scholar]
[28].Zhou J. et al. , “Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk,” Nature genetics, vol. 50, no. 8, p. 1171, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
[29].Mechelli A. et al. , “Differential effects of word length and visual contrast in the fusiform and lingual gyri during,” Proceedings of the Royal Society of London. Series B: Biological Sciences, vol. 267, no. 1455, pp. 1909–1913, 2000. [DOI] [PMC free article] [PubMed] [Google Scholar]
[30].Mangun GR et al. , “Erp and fmri measures of visual spatial selective attention,” Human brain mapping, vol. 6, no. 5–6, pp. 383–389, 1998. [DOI] [PMC free article] [PubMed] [Google Scholar]
[31].Grill-Spector K. et al. , “The lateral occipital complex and its role in object recognition,” Vision research, vol. 41, no. 10–11, pp. 1409–1422, 2001. [DOI] [PubMed] [Google Scholar]
[32].Mégevand P. et al. , “Seeing scenes: topographic visual hallucinations evoked by direct electrical stimulation of the parahippocampal place area,” Journal of Neuroscience, vol. 34, no. 16, pp. 5399–5405, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
[33].Hamilton JP et al. , “Default-mode and task-positive network activity in major depressive disorder: implications for adaptive and maladaptive rumination,” Biological psychiatry, vol. 70, no. 4, pp. 327–333, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] [1].Sui J. et al. , “Neuroimaging-based individualized prediction of cognition and behavior for mental disorders and health: Methods and promises,” Biological Psychiatry, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] [2].Hotelling H, “Relations between two sets of variates,” Biometrika, vol. 28, pp. 321–377, 1936. [Google Scholar]

[R3] [3].Andrew G. et al. , “Deep canonical correlation analysis,” in International Conference on Machine Learning, 2013, pp. 1247–1255. [Google Scholar]

[R4] [4].Wang W. et al. , “On deep multi-view representation learning,” in International Conference on Machine Learning, 2015, pp. 1083–1092. [Google Scholar]

[R5] [5].Simonyan K. et al. , “Deep inside convolutional networks: Visualising image classification models and saliency maps,” 2014. [Google Scholar]

[R6] [6].Zeiler MD and Fergus R, “Visualizing and understanding convolutional networks,” in European conference on computer vision. Springer, 2014, pp. 818–833. [Google Scholar]

[R7] [7].Fong RC and Vedaldi A, “Interpretable explanations of black boxes by meaningful perturbation,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 3429–3437. [Google Scholar]

[R8] [8].Dabkowski P. and Gal Y, “Real time image saliency for black box classifiers,” in Advances in Neural Information Processing Systems, 2017, pp. 6967–6976. [Google Scholar]

[R9] [9].Yuan H. et al. , “Interpreting deep models for text analysis via optimization and regularization methods,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 5717–5724. [Google Scholar]

[R10] [10].Fong R. et al. , “Understanding deep networks via extremal perturbations and smooth masks,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 2950–2958. [Google Scholar]

[R11] [11].Hu W. et al. , “Deep collaborative learning with application to multimodal brain development study,” IEEE Transactions on Biomedical Engineering, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] [12].Hu W, “Adaptive sparse multiple canonical correlation analysis with application to imaging (epi) genomics study of schizophrenia,” IEEE Transactions on Biomedical Engineering, vol. 65, no. 2, pp. 390–399, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] [13].Lin D. et al. , “Correspondence between fmri and snp data by group sparse canonical correlation analysis,” Medical image analysis, vol. 18, no. 6, pp. 891–902, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] [14].Hu W. et al. , “Integration of snps-fmri-methylation data with sparse multi-cca for schizophrenia study,” in 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE, 2016, pp. 3310–3313. [DOI] [PubMed] [Google Scholar]

[R15] [15].Zhou B. et al. , “Learning deep features for discriminative localization,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2921–2929. [Google Scholar]

[R16] [16].Selvaraju RR et al. , “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 618–626. [Google Scholar]

[R17] [17].Springenberg JT et al. , “Striving for simplicity: The all convolutional net,” arXiv preprint arXiv:1412.6806, 2014. [Google Scholar]

[R18] [18].Calhoun VD and Adali T, “Time-varying brain connectivity in fmri data: whole-brain data-driven approaches for capturing and characterizing dynamic states,” IEEE Signal Processing Magazine, vol. 33, no. 3, pp. 52–66, 2016. [Google Scholar]

[R19] [19].Satterthwaite TD et al. , “Neuroimaging of the philadelphia neurodevelopmental cohort,” Neuroimage, vol. 86, pp. 544–553, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] [20].Friston KJ et al. , “Characterizing dynamic brain responses with fmri: a multivariate approach,” Neuroimage, vol. 2, no. 2, pp. 166–172, 1995. [DOI] [PubMed] [Google Scholar]

[R21] [21].Rashid B. et al. , “Dynamic connectivity states estimated from resting fmri identify differences among schizophrenia, bipolar disorder, and healthy control subjects,” Frontiers in human neuroscience, vol. 8, p. 897, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] [22].Power JD et al. , “Functional network organization of the human brain,” Neuron, vol. 72, no. 4, pp. 665–678, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] [23].Purcell S. et al. , “Plink: a tool set for whole-genome association and population-based linkage analyses,” The American journal of human genetics, vol. 81, no. 3, pp. 559–575, 2007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] [24].Das S. et al. , “Next-generation genotype imputation service and methods,” Nature genetics, vol. 48, no. 10, p. 1284, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] [25].Lonsdale J. et al. , “The genotype-tissue expression (gtex) project,” Nature genetics, vol. 45, no. 6, p. 580, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] [26].Wilkinson GS and Robertson GJ, Wide range achievement test. Psychological Assessment Resources, 2006. [Google Scholar]

[R27] [27].Singh R. et al. , “Deepchrome: deep-learning for predicting gene expression from histone modifications,” Bioinformatics, vol. 32, no. 17, pp. i639–i648, 2016. [DOI] [PubMed] [Google Scholar]

[R28] [28].Zhou J. et al. , “Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk,” Nature genetics, vol. 50, no. 8, p. 1171, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] [29].Mechelli A. et al. , “Differential effects of word length and visual contrast in the fusiform and lingual gyri during,” Proceedings of the Royal Society of London. Series B: Biological Sciences, vol. 267, no. 1455, pp. 1909–1913, 2000. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] [30].Mangun GR et al. , “Erp and fmri measures of visual spatial selective attention,” Human brain mapping, vol. 6, no. 5–6, pp. 383–389, 1998. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] [31].Grill-Spector K. et al. , “The lateral occipital complex and its role in object recognition,” Vision research, vol. 41, no. 10–11, pp. 1409–1422, 2001. [DOI] [PubMed] [Google Scholar]

[R32] [32].Mégevand P. et al. , “Seeing scenes: topographic visual hallucinations evoked by direct electrical stimulation of the parahippocampal place area,” Journal of Neuroscience, vol. 34, no. 16, pp. 5399–5405, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] [33].Hamilton JP et al. , “Default-mode and task-positive network activity in major depressive disorder: implications for adaptive and maladaptive rumination,” Biological psychiatry, vol. 70, no. 4, pp. 327–333, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Interpretable multimodal fusion networks reveal mechanisms of brain cognition

Wenxing Hu

Xianghe Meng

Yuntong Bai

Aiying Zhang

Gang Qu

Biao Cai

Gemeng Zhang

Tony W Wilson

Julia M Stephen

Vince D Calhoun

Yu-Ping Wang

Roles

Abstract

I. Introduction

II. Related works

A. Multimodal data fusion: analyzing cross-modal association

B. Deep collaborative learning (DCL): phenotype-related cross-data association

C. Deep Network Interpretation: CAM-based methods

1). Class Activation Mapping (CAM):

2). Gradient-weighted CAM (Grad-CAM):

3). Guided Grad-CAM: high resolution class-specific activation maps:

III. Grad-CAM guided convolutional collaborative learning (gCAM-CCL)

Fig. 1:

A. Proposed learning method

B. Proposed interpretation method

IV. Application to brain imaging genetic study

A. Brain imaging data

B. SNP data

TABLE I:

C. Integrating brain imaging and genetic data: classification

TABLE IV:

TABLE II:

TABLE III:

D. Integrating brain imaging and genetic data: result interpretation

Fig. 2:

Fig. 3:

Fig. 4:

Fig. 5:

Fig. 6:

TABLE V:

TABLE VI:

TABLE VII:

TABLE VIII:

V. Conclusion

TABLE IX:

Acknowledgments

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases