Abstract
Integrating and analyzing multiple omics data sets, including genomics, proteomics and radiomics, can significantly advance researchers’ comprehensive understanding of Alzheimer’s disease (AD). However, current methodologies primarily focus on the main effects of genetic variation and protein, overlooking non-additive effects such as genotype–protein interaction (GPI) and correlation patterns in brain imaging genetics studies. Importantly, these non-additive effects could contribute to intermediate imaging phenotypes, finally leading to disease occurrence. In general, the interaction between genetic variations and proteins, and their correlations are two distinct biological effects, and thus disentangling the two effects for heritable imaging phenotypes is of great interest and need. Unfortunately, this issue has been largely unexploited. In this paper, to fill this gap, we propose
ulti-
ask
enotype-
rotein
nteraction and
orrelation disentangling method (
) to identify GPI and extract correlation patterns between them. To ensure stability and interpretability, we use novel and off-the-shelf penalties to identify meaningful genetic risk factors, as well as exploit the interconnectedness of different brain regions. Additionally, since computing GPI poses a high computational burden, we develop a fast optimization strategy for solving MT-GPIC, which is guaranteed to converge. Experimental results on the Alzheimer’s Disease Neuroimaging Initiative data set show that MT-GPIC achieves higher correlation coefficients and classification accuracy than state-of-the-art methods. Moreover, our approach could effectively identify interpretable phenotype-related GPI and correlation patterns in high-dimensional omics data sets. These findings not only enhance the diagnostic accuracy but also contribute valuable insights into the underlying pathogenic mechanisms of AD.
Keywords: multi-omics brain imaging genetics, genotype–protein interaction and correlation, biomarker identification
INTRODUCTION
Alzheimer’s disease (AD) is a hereditary and intricate neurodegenerative disorder [1, 2]. Growing evidence indicates that AD may result from abnormal interactions between disease-related protein changes and genetic variations. This is because many important biological processes such as RNA transport and translation are controlled by interactions of these two kinds of biomacromolecules [3, 4]. Therefore, investigating the underlying relationship between proteins and genetic variations is essential for comprehending the mechanisms of AD and facilitating diagnostics and therapeutics [4–6].
Previous studies have demonstrated that genotype–protein interaction (GPI) and correlation effects are important factors contributing to complex phenotype traits, such as AD [7–12]. From a biological standpoint, there are two distinct types of relationships between protein and genetic variations. The first kind is GPI, which denotes effect modification between genes and proteins, as proteins bind to specific DNA sequences and contribute to imaging phenotypes. The second kind is genotype–protein correlation (GPC), which denotes that disease-related imaging phenotypes might be associated with genes and proteins simultaneously because genes and proteins reflect individual information from different perspectives, which could result in a relatively high correlation between them. GPI and GPC may implicate distinct biological mechanisms, and, finally, lead to imaging phenotypes with substantial differences [4, 13]. Consequently, the identification of GPI and correlation associated with heritable phenotypes is an urgent need.
Brain imaging genetics, a burgeoning field in neuroscience, has garnered considerable attention for its enhanced capacity to unveil the relationships between single nucleotide polymorphisms (SNPs) and intermediate imaging phenotypes, i.e. quantitative traits (QTs) [2]. In the last decade, several analytical methods have emerged for studying the cumulative impacts of SNPs on intermediate imaging phenotypes, i.e. univariate, multivariate and bi-multivariate approaches [14–18]. Early methods predominantly were univariate statistical techniques [14, 19, 20], and multivariate approaches were used to explore the relationship between multiple SNPs and few imaging QTs. In contrast, other approaches such as bi-multivariate methods investigated multiple SNPs and imaging QTs, thus could identify significant SNPs and QTs simultaneously [15–18]. Despite these advancements, existing methods were only focused on the main effects of genetic variations or proteins for imaging phenotypes and might lead to missing heritability, a few works have been done to discover non-additive interaction effects such as GPI, which might be insufficient to reveal the complicated yet beneficial mechanism of the heritability of brain disorders [14, 19–21].
To address this limitation, some non-linear learning methods, including kernel methods and graph neural networks [22, 23], have been proposed. Though these methods have demonstrated success in modeling feature interactions or correlation, they have inherent limitations in discovering relevant feature interactions and correlation, and thus cannot provide interpretable biomarkers. On this account, this prompts us to develop innovative methods that can facilitate more efficient and practical identification of GPI and correlation. Unfortunately, due to the lack of an effective uncoupling method, detecting GPI and correlation is still under research and poorly understood [24].
With the above observations, we aim to extract SNPs and proteins that contribute to brain imaging phenotypes by displaying a significant level of interaction and cross-correlation. To tackle this challenging problem, we propose Multi-Task Genotype–Protein Interaction (MT-GPIC) to model GPI effects and correlation effects simultaneously. Specifically, our method leverages multi-task learning (MTL) techniques [25], incorporating an interpretable feature interaction module and a correlation regularization module. Firstly, the feature interaction module facilitates the exploration of non-linear interactions between genetic variations and protein, thereby capturing important and interpretable biomarkers. The correlation regularization module can extract meaningful genotype–protein patterns, ensuring a good interpretation and stability. Secondly, to identify biologically meaningful genetic risk factors, we use
,
-norm and
-norm for identifying meaningful genetic variations. The
penalty is also employed to identify interesting brain sub-networks [26]. Thirdly, when more genetic variations are considered, the number of GPI and correlations increases exponentially. Thus, we derive a divide-and-conquer strategy and present its convergence. In detail, we take into consideration the block structure of genetic variations and use the parallel approach to solve the high-dimensional problem to make our approach applicable in practice. Fourthly, compared with state-of-the-art methods [27–29], experimental results demonstrate that MT-GPIC obtains the best performance. Furthermore, our method offers interpretable insights into GPI and correlation, contributing to a deeper understanding of AD. Hence, our methodology holds great promise as an alternative approach for advancing imaging genetic studies.
BACKGROUND AND METHODS
This article follows the convention of representing vectors with lowercase letters and matrices with uppercase letters.
Background
In this subsection, we present a concise overview of the most relevant methods.
Sparse multiple canonical correlation analysis and adaptive SMCCA
The sparse multiple canonical correlation analysis (SMCCA) model enables the detection of associations among diverse OMICS data sets, SNPs, proteomic markers, and imaging QTs [18, 28]. For ease of presentation, SNPs, proteins and imaging markers are defined as
,
,
, where
represents the number of subjects,
,
and
represent the number of SNPs, proteomic markers and imaging QTs, respectively. Then SMCCA is defined as follows:
![]() |
(1) |
The regularization terms
assist in identifying a small subset of biomarkers that are crucial for understanding the disease. However, an inherent limitation of SMCCA is its independent assumption, i.e.
. This assumption ignores the relationship of genetic variations and thus leads to performance degradation.
Adaptive SMCCA incorporates an additional tuning parameter to better extract the relationship among multi-omics data. Thus, we introduce the objective function of the Adaptive SMCCA:
![]() |
(2) |
Adaptive SMCCA uses
to dynamically adjust pairwise covariances during each iteration [29]. Unfortunately, Adaptive SMCCA still relies on the independent assumption, which may limit its performance.
RelPMDCCA
RelPMDCCA, the most advanced variant of SMCCA, has been developed to simultaneously analyze multi-omics data. RelPMDCCA can be defined as follows [27, 30]:
![]() |
(3) |
RelPMDCCA improves SMCCA in two significant ways. Firstly, RelPMDCCA eliminates the independent assumption. Secondly, it utilizes the SCAD penalty, which is better than SMCCA’s
-norm. Nevertheless, the above SMCCA methods cannot identify and distinguish between intricate GPI effects and correlation relationships within omics data sets. Therefore, these methods are suboptimal for comprehending the relationships among multiple OMICS data sets.
Multi-Task Genotype–Protein Interaction
Model
As discussed in the previous sections, most methods are only focused on the main effects of gene and protein, while neglecting potential links between protein and genetic variations [31–34]. In this work, the underlying motivation is to extract SNPs and proteins that could contribute to heritable brain imaging phenotypes while displaying a significant level of interaction effects and correlation effects. We propose MT-GPIC to jointly identify GPI and reveal GPC. The SNPs is denoted as
, proteins and imaging phenotypes are denoted as
and
.
,
and
are all zero-centered. On this account, we defined the MT-GPIC as follows:
![]() |
(4) |
where
is the canonical weight carrying the main effects of SNPs,
is that carrying the main effects of proteomic markers, and
is the pairwise interaction effects between SNPs and proteomic markers.
indicates the canonical weight associated with imaging phenotypes.
,
and
control the sparsity of the main effects.
is a nonnegative tuning parameter to balance the influence of GPC. Of note, if we replace
with 1, the GPI and correlation task are equally important. We set
to identify interpretable GPI.
Specifically, the first term of MT-GPIC model aims to capture genotype–protein main and interaction effects on the brain imaging phenotypes, as mounting evidence suggests their involvement in AD pathology. The second term focuses on identifying meaningful GPC, considering that specific genes first encode proteins that contribute to heritable imaging phenotypes. Additionally, the remaining terms in the model serve to select important and interpretable multi-omics biomarkers, as well as their interactions and correlations. Therefore, our objective is to extract features that not only explain the variance of imaging QTs but also exhibit interactions and correlation patterns between SNP and protein features. To simultaneously figure out the joint and independent influence of SNPs, we define the regularization terms for SNPs as follows:
![]() |
(5) |
where the
-norm is defined as
![]() |
(6) |
-norm here emphasizes the group-wise effect of gene or linkage disequilibrium (LD) orderly.
-norm selects consistent features across GPI and correlation tasks. In addition, some SNPs may not be associated with all tasks, and thus we use
-norm to prompt the element-wise sparsity. We also use
-norm and
-norm to select important and interpretable protein, i.e.
, where
,
are nonnegative tuning parameters. Similarly, we also introduce the
-penalty to capture the interesting brain sub-networks. In particular,
, where
,
are also tuning parameters. And we defined
-norm as follows:
![]() |
(7) |
where
is the edge of brain networks and
are their weights. This setup implements the feature selection for imaging phenotypes and thus made MT-GPIC have diverse and interpretable capabilities for biomarker identification.
The optimization algorithm
To address MT-GPIC, we introduce its Lagrangian form, which is defined as follows:
![]() |
(8) |
Fortunately, this objective is multi-convex and smooth, and thus we first fix
,
, and
to solve
. Specifically, we define sub-objectives with respect to
as follows:
![]() |
(9) |
By differentiating Equation (9) for
and equating it to zero, we obtain the following equation:
![]() |
(10) |
We define
![]() |
(11) |
where
,
and
are all diagonal matrices. In particular, the
,
and
are sub-gradients of
,
and
.
is a diagonal matrix whose diagonal entries are
;
is diagonal matrix whose diagonal entries are
; and
is also diagonal matrix with diagonal entries being
. In addition,
is a non-negative tuning parameters. Therefore, the following updating rule can be used to obtain
:
![]() |
(12) |
To fulfill the equality constraints, we use the following scaling step
:
![]() |
(13) |
Applying similar procedures, we first simplify the objective for
as follows:
![]() |
(14) |
and then we can obtain the updating formula for
:
![]() |
(15) |
is a diagonal matrix with the diagonal entries are
, where
indicates a mapping operation which transforms a matrix into a vector by stacking its columns.
is defined as the matrix loading GPI. Thus,
(
is element-wise product), and
captures the pairwise interactions between the
th genetic variation and
th protein.
Similarly, the solution to solve
can be simplified as follows:
![]() |
(16) |
Then we obtain the closed-form solution to
,
![]() |
(17) |
and
are tuning parameters, and they can be fine-tuned by cross-validation.
and
are sub-gradients of
and
. Then we define
![]() |
(18) |
To meet the equality constraints, the solution for
is achieved by scaling each
, as follows:
![]() |
(19) |
The objective to solve
can be simplified as follows:
![]() |
(20) |
and then we obtain closed-form rule as follows:
![]() |
(21) |
where
is a diagonal matrix whose diagonal entries are
.
is also a diagonal matrix whose diagonal entry is
.
The following scaling step is applied to satisfy the equality constraint:
![]() |
(22) |
Extension to chromosome-wide analysis
MT-GPIC is difficult to directly apply to genome-wide analysis since it becomes exceptionally burdensome due to the computationally intensive
and
. To handle this issue, we here introduce a heuristic accelerating method. Although there are a huge number of SNPs in the human genome, we do not need to calculate the interaction term directly. Instead, we can compute it within each chromosome and then combine the results of all chromosomes. Specifically, we divide the high-dimensional SNPs and GPI into
non-intersected subsets, i.e.
,
, respectively.
can be user-defined or tuned from the data. Consequently, the MT-GPIC objective can be reformulated as follows:
![]() |
(23) |
Here,
represents the matrix concatenation operator. Now SNPs and the interaction terms are decoupled, facilitating parallel processing. Specifically, we partition the large genotype and GPI matrices into smaller ones, each aligning with the dimensions of LD blocks. This strategy leverages the inherent block structure of SNPs, thus preserving model performance. Therefore, based on the divide-and-conquer strategy, we can obtain closed-form solution to
, i.e.
as follows:
![]() |
(24) |
In the above equation,
is defined as the
th diagonal block of
.
can also be obtained by the same procedure. i.e.
:
![]() |
(25) |
where
is the
th diagonal block of
.
After calculation, we concatenate
and
to obtain the final solution. Thus the computational burden is greatly reduced without sacrificing model performance. Finally, we summarize the step-by-step procedure of the MT-GPIC algorithm in Algorithm 1.
Convergence analysis
Theorem 1.1.
Algorithm 1 can maintain convergence during the iterative computation.
Proof.
the proof can be divided into two parts. The first part demonstrates the convergence for
, while the second part focuses on the convergence for
,
, and
.
Part 1: we first denote the canonical weights as
. For ease of representation, we define the MT-GPIC with respect to
as follows:
(26) We also define an auxiliary function as
(27) which can be equivalently rewritten as
(28) where
,
and
are defined in Equation (10). We can easily verify
(29) Further,
is a convex quadratic function that adheres to the following conditions:
(30) Combining the above formula, we can derive that
(31) The scaling steps will not break the above conclusions. Consequently, the first part of the proof is completed.
Part 2: similarly, it is easy to draw the same conclusion for other variables, i.e.
,
, and
. For simplicity, letting
denotes the original objective of our method, we can arrive at
(32) According to mathematical derivation, the objective of MT-GPIC is lower bounded by 0. Therefore, we can conclude that the algorithm iteration is convergent and the proof is complete.
RESULTS
Experimental setup
In our work, we conduct a comparative evaluation of MT-GPIC by benchmarking it against three relevant methods: SMCCA, RelPMDCCA and AdaSMCCA. They are considered state-of-the-art approaches for analyzing associations among multi-omics brain imaging data [27–29]. For the parameters, we tuned the parameters in the candidate set
(
) via 5-fold cross-validation, and those parameters resulting in the highest mean testing CCC will be chosen. Additionally, to ensure fair and unbiased experimental results, we conducted all methods on the same software platform (e.g. MATLAB(2020b), and employed the same data partition. Importantly, performance evaluation is carried out using two key metrics, i.e. feature selection and CCC. Of note, the CCC served as a crucial performance indicator for the model, which can be computed as follows:
![]() |
(33) |
Real neuroimaging, proteomic and genetic study
Data Source: the data applied in our work were downloaded from the ADNI database. For up-to-date information, please see www.adni-info.org.
The study included 244 participants of non-Hispanic Caucasian descent, comprising 42 HCs, 137 MCIs and AD, their details were presented in Table 1. Structural magnetic resonance imaging scans were subjected to voxel-based morphometry using the Statistical Parametric Mapping tool. To mitigate confounding factors, such as baseline gender, age, handedness and education, the obtained data were subjected to further adjustments using regression weights that were generated by HCs. Utilizing the MarsBaR-based Automated Anatomical Labeling templates, a comprehensive set of 465 brain regions of interest were extracted [35]. In addition, the proteomic analyte samples were conducted using the Rules Based Medicine, Inc. proteomic panel. Following quality control, a total of 146 proteomic markers were obtained. Moreover, we extracted 10 000 SNPs from the database and used the additive coding paradigm for SNPs. Our objective was to analyze multiple omics data sets, including genomics, proteomics and radiomics, as well as to explore the GPI and correlation on intermediate imaging phenotypes. Such analyses could facilitate a deeper understanding of AD and enable targeted in-depth follow-up studies.
Table 1.
Participant characteristics
| HC | MCI | AD | |
|---|---|---|---|
| Num | 42 | 137 | 65 |
| Sex (M/F, %) | 52.38/47.62 | 69.34/30.66 | 55.38/44.62 |
| Handedness (R/L, %) | 90.48/9.52 | 92.70/7.30 | 98.46/1.54 |
Age (mean std) |
75.40 5.80 |
74.13 7.22 |
74.75 7.67 |
Education (mean std) |
15.88 2.77 |
16.03 2.98 |
15.12 3.05 |
Bi-multivariate association identification
In Figure 1, the key metrics (CCCs) are displayed for all methods, covering both training and testing data sets. In this experiment, we investigate two effects: GPC and genotype–protein main and interaction effects associated with imaging phenotypes (GPI-QT). Notably, MT-GPIC utilized the main and interaction effects of genetic variation and protein to calculate the CCCs, while the benchmark methods failed to identify GPI. Thus we only showed their main effects. As expected, thanks to its incorporation of GPI and correlation with diverse regularizations, we observed that MT-GPIC obtained the highest CCC values among all competitors on training and testing sets. This indicated its superior capability in capturing the multi-way bi-multivariate associations among multiple omics data.
Figure 1.

The CCC (mean
std.) obtained from 5-fold cross-validation for all methods. (A) The experimental results on training data sets. (B) The experimental results on testing data sets.
Identification and explanation of genetic loci
Figure 2 depicted the heatmap displaying the canonical weights assigned to SNPs, where the color bar provided a relative measure of feature importance. Table 2 presented the top 10 genetic biomarkers. Notably, MT-GPIC successfully identified several AD-risk loci, including the well-known rs429358 (situated in
), rs7412 (situated in
), rs4420638 (
), rs56131196 (
), rs12721051 (
), rs6857 (
) and rs59007384 (
) [36, 37]. To test whether the selected SNPs significantly affect AD, we conducted ANOVA to examine the main effect on AD diagnosis (i.e. the effects of gender, age, handedness and education were excluded). As expected, all
-values have obtained significance level (
<0.05). By further investigation, owing to the introduction of the
penalty in MT-GPIC, meaningful groups of SNPs were identified (e.g, rs4420638 (
=
), rs56131196 (
=
), rs12721051 (
=
), all located in
and from a synergetic LD block). This demonstrated that MT-GPIC can successfully identify the neighboring genetic variations [36, 37]. In contrast, the competitors reported numerous irrelevant signals that could potentially mislead subsequent analyses. Next, to better show the selected genetic loci, post-analyses, such as the ANOVA and the gene set analyses, were conducted. Both analyses could demonstrate the effectiveness of MT-GPIC in identifying multiple AD-related risk loci.
Figure 2.

Canonical weights (heatmap) of SNPs from 5-fold cross-validation. Each row corresponds to a specific method: (1) SMCCA; (2) Adaptive SMCCA; (3) RelPMDCCA; (4) MT-GPIC.
Table 2.
The top 10 loci for each method, as determined by their mean canonical weights
| SMCCA | AdaSMCCA | RelPMDCCA | MT-GPIC |
|---|---|---|---|
| rs34768260 | rs11673490 | rs73035960 | rs429358 |
| rs34244103 | rs11083724 | rs73035964 | rs7412 |
| rs12609260 | rs12611428 | rs186072321 | rs4420638 |
| rs12609521 | rs62116964 | rs73035978 | rs56131196 |
| rs12609529 | rs61013802 | rs140627212 | rs12721051 |
| rs11673490 | rs4803659 | rs117998908 | rs769449 |
| rs11083724 | rs68073292 | rs118016134 | rs7256200 |
| rs12611428 | rs8104300 | rs34768260 | rs6857 |
| rs11666403 | rs1386502 | rs34244103 | rs10414043 |
| rs62116994 | rs67773424 | rs12609260 | rs59007384 |
Follow-up analyses: gene-set analyses
To further validate the identified genetic loci, we conducted the gene-set analyses (GSEA) for further investigation. Specifically, we employed joint-SNP gene-based analysis utilizing the MAGMA software tool [38], where the Fisher’s test was applied to obtain
-values, showing the association strength between a gene and the diagnostic phenotype. Interestingly, we observed that the genes
(
=
, including rs429358, rs7412, etc.),
(
=
, including rs12721051, etc.) and
(
=
, including rs6857, rs59007384, etc.) exhibited the highest level of statistical significance, demonstrating a significant correlation with the diagnostic phenotype. This again indicated MT-GPIC could identify meaningful and AD-related genetic variations. Conversely, competitors yielded numerous potentially irrelevant signals, raising concerns regarding their impact on subsequent analyses. Specifically, SMCCA failed to detect previously reported loci and did not yield loci overlapping with the MAGMA analysis findings of our method. AdaSMCCA identified nine loci, with only rs1386502 having been previously reported. RelPMDCCA identified rs73035978 [39], failing to identify the most important rs429358 and rs7412, as well as lacking overlap with MAGMA analysis results. Moreover, SMCCA, AdaSMCCA and RelPMDCCA lacked the capability for feature grouping, missing the structural information embedded in SNPs. In summary, the overall results demonstrated that MT-GPIC can identify genetic risk factors accurately and comprehensively.
Identification and explanation of the proteomic markers
Figure 3 illustrated canonical weights assigned to the proteomic markers, where the color bar indicated the relative importance of the features. Table 3 presented the top 10 proteomic biomarkers identified by our MT-GPIC method. It was worth noting that our method successfully identified several proteomic markers associated with AD, including ApoE, CRP, IGM, CD5L, MIG and ApoB [37, 40]. Additionally, we conducted ANOVA to examine the main effects of the top-selected proteomic markers on AD diagnosis. The results attained statistical significance (
<0.05), suggesting their high relevance to AD. In contrast, the comparison methods also identified some AD-related proteomic markers, but they reported numerous irrelevant signals, thereby hindering the interpretation. In particular, SMCCA, AdaSMCCA and RelPMDCCA identified a few proteomic markers related to AD. Instead, they identified a substantial number of proteomic markers that have not been previously reported. Notably, all the comparative methods exhibited shortcomings as they failed to detect crucial AD-risk proteomic markers, such as ApoE, CRP and ApoB [37, 40]. Overall, our method identified more disease-related proteomic markers and had the potential to deepen our understanding of AD.
Figure 3.

Heatmap of canonical weights for proteomic markers (mean values), which was calculated through 5-fold cross-validation. Each row represents a specific method: (1) SMCCA; (2) Adaptive SMCCA; (3) RelPMDCCA; (4) MT-GPIC.
Table 3.
The top 10 significant proteomic markers for each method, as determined by their mean canonical weights
| SMCCA | AdaSMCCA | RelPMDCCA | MT-GPIC |
|---|---|---|---|
| PAI-1 | PAI-1 | CgA | ApoE |
| RANTES | RANTES | VEGF | CRP |
| Thrombospondin-1 | BDNF | MPIF-1 | IGM |
| PDGF-BB | Thrombospondin-1 | Testosterone-Total | CD5L |
| BDNF | ENA-78 | PAI-1 | ApoB |
| ENA-78 | PDGF-BB | PDGF-BB | MIG |
| SCF | GRO-alpha | BDNF | ACE |
| CgA | SCF | RANTES | TIMP-1 |
| Testosterone-Total | CgA | SCF | IL-18 |
| GRO-alpha | EGF | ENA-78 | MMP-2 |
Identification and explanation of imaging phenotypes
Identifying abnormal imaging QTs affected by AD helped enhance the performance of computer-aided diagnosis. We showed the top selected imaging phenotypes of MT-GPIC in Figure 4. To better visualize the results, we mapped the imaging markers of imaging features onto the brain in Figure 5(A). In addition, the top 10 selected imaging QTs for each method were also presented in Table 4. Interestingly, MT-GPIC successfully identified both the left and right hippocampus, left superior frontal gyrus and left superior frontal gyrus medial orbital regions as important areas affected by AD [41, 42]. In practice, the hippocampus atrophy and A
deposition of frontal areas were all important features for judging AD or not [43]. What’s more, MT-GPIC can also identify the right of superior parietal gyrus, and parahippocampal gyrus, which were consistent with existing works of literature [44]. Notably, thanks to the
penalty, MT-GPIC was able to identify meaningful brain sub-networks. We displayed the connections between the top selected imaging QTs in Figure 5(B), indicating potential strong associations between certain AD-related QT pairs, such as the hippocampus areas and parahippocampal areas could have similar brain atrophy in the network [42]. In contrast, the benchmark methods failed to provide useful information and could not identify significant AD-related imaging QTs, making them suboptimal. Overall, these findings underscored the better performance of MT-GPIC in identifying AD-affected brain regions.
Figure 4.

Heatmap of canonical weights for imaging phenotypes (mean values), which was calculated by 5-fold cross-validation. Each row corresponds to a specific method: (1) SMCCA; (2) Adaptive SMCCA; (3) RelPMDCCA; (4) MT-GPIC.
Figure 5.
(A) Visualization of identified brain imaging QTs. The color represents the canonical weights of the corresponding imaging QTs. (B) Connection of the top brain regions.
Table 4.
The top 10 brain imaging QTs for each method, determined by the mean canonical weights
| SMCCA | AdaSMCCA | RelPMDCCA | MT-GPIC |
|---|---|---|---|
| Cingulum Mid R | Cingulum Mid R | Cerebellum Crus1 R | Hippocampus L |
| Temporal Sup L | Hippocampus L | Cingulum Mid R | Parietal Sup R |
| Angular R | Temporal Mid R | Rolandic Oper R | Frontal Sup L |
| Hippocampus L | Occipital Mid L | Postcentral L | Occipital Sup L |
| Temporal Sup L | Temporal Sup R | Hippocampus L | Hippocampus R |
| SupraMarginal L | Angular R | Temporal Sup L | Hippocampus L |
| Angular L | Occipital Inf R | Frontal Sup R | ParaHippocampal L |
| Rolandic Oper R | Occipital Mid R | Caudate R | Cerebelum 7b R |
| Temporal Mid R | Temporal Sup R | Frontal Sup L | Cerebelum 9 L |
| Angular R | Temporal Mid R | Temporal Mid L | Frontal Sup Orb L |
Identification and interpretation of GPI
In addition to the main effects, our method can also uncover significant GPI. We presented the top 10 GPI, i.e. (rs118052140, ApoE), (rs35978917, ApoD), (rs112972879, AXL), (rs4251952, ANTES), (rs35121749, AXL ), (rs4251923, RANTES), (rs62118470, TSH), (rs112120887, ApoD), (rs6509113, ApoE) and (rs8113589, SCF). To further investigate whether these findings were beneficial to AD diagnosis, we conducted a classification experiment using the LIBSVM toolbox to evaluate the classification performance for distinguishing between HCs, MCIs and ADs [45]. Interestingly, as shown in Figure 6, our method achieved the highest average testing classification accuracy for all diagnostic groups. This indicated that our method can identify predictive biomarkers to assist in the diagnosis of AD. Notably, the highest accuracy was achieved when both the main and interaction effects of gene and protein were considered, compared with considering the main effects alone, which implicated the necessity of using both of them. Thus, attributed to the GPI effects, our model can obtain disease-related interpretable biomarkers for AD diagnosis.
Figure 6.

Comparison of average classification accuracy achieved by all methods. Without GPI means that the GPI component was deleted from the MT-GPIC model.
Furthermore, we have introduced two-way ANOVA to investigate the influence of the first two GPI on intermediate imaging phenotypes [46]. The analysis revealed several significant findings. Firstly, the main effect of the rs118052140 locus (
= 0.01999
) showed a statistically significant association with the hippocampus area [44], and the main effect of ApoE was (
= 0.00777
). Surprisingly, the SNP by protein interaction (
) was also found to be significant. These findings indicated that the abnormal interactions between disease-related proteins and genetic variability may modulate the disease-related imaging QTs, then further help clinicians be confident in diagnosing at-risk individuals.
What’s more, we also examined the impact of the (rs35978917, ApoD) genotype–protein pair on the left hippocampus area. The ANOVA results indicated that the main effects of rs35978917 locus and ApoD concentrations were not statistically significant, but the rs35978917 locus by ApoD concentrations interaction showed a significant effect (
= 0.0340
). Besides, the main effect of ApoE (
= 0.00766
) and rs118052140 locus by ApoE interaction (
= 0.01148
) showed a statistically significant association with the left frontal sup orb area, i.e. amyloid deposition in frontal areas could serve as an AD-risk biomarker, while the main effect of rs118052140 locus (
= 0.26379) was not significant. This suggested that the main effect and interaction effect may differ, emphasizing the importance of considering both factors rather than focusing solely on the main effect. In summary, by considering the GPI, our approach had the potential to reveal meaningful and interpretable phenotype-associated GPI. These findings provided valuable clues for a better understanding of the complex relationships among multiple omics data sets.
Identification and interpretation of GPC
To better understand the identified correlations between SNPs and proteomic biomarkers, we presented a heatmap of pairwise correlations in Figure 7. The heatmap depicted the correlations between SNPs and protein pairs, with the
symbol indicating pairs that reached the significance level (
<0.05). The heatmap reveals numerous significant correlations between genotype–protein pairs. Notably, the (rs429358, APOE) pair exhibited the highest positive correlation, while the (rs7412, APOE) pair showed the highest negative correlation. These findings align with the existing findings, i.e. Apolipoprotein E (APOE) alleles showed a high correlation with rs429358 and rs7412 polymorphisms, and the minor homozygote of rs429358 had a high risk of AD, while the major homozygote of rs7412 resulted in a high risk of AD.
Figure 7.

Heatmap of pairwise correlation between top 10 SNPs and proteomic analytes, where symbol ‘
’ indicates the pairwise association reached the significance level (
). Of note, the heatmap reveals numerous significant correlations between genotype–protein pairs, which helps better understand the identified correlations between SNPs and proteomic biomarkers.
In this subsection, we further investigated whether the GPC exhibited distinct distributions among different genotypes and diagnostic groups. Figure 8 illustrated the genotype–protein distributions for the rs429358-APOE (highest positive correlations) and rs7412-APOE (highest negative correlations) respectively. In sub-figure (A), a decreased concentration level of rs429358-APOE was observed in MCI and AD groups. The homozygote TT and CT genotypes exhibited a reduced risk of AD. Conversely, carrying the minor allele C appeared to be an AD-risk indicator, as individuals with this allele showed a higher vulnerability to AD. Similar conclusions can be drawn from sub-figure (B). These results demonstrated that the concentration levels of rs429358-APOE and rs7412-APOE correlations could serve as potential indicators for AD diagnosis. Such indicators could facilitate the development of effective therapeutic interventions for AD.
Figure 8.

Pairwise comparisons for SNPs and proteomic biomarkers across HC, MCI and AD groups. (A) The concentration of rs429358-APOE for various genotypes across the three groups. (B) The concentration of rs7412-APOE for various genotypes within the three groups.
Interpretation of genotype-phenotype association
In the brain imaging genetics study, understanding neurological manifestations and their genetic architectures is important and beneficial. In our work, we aimed to gain a deeper understanding of the biological basis between genetic variations and heritable imaging QTs. Figure 9 indicated the pairwise correlation analysis between the top selected SNPs and imaging markers. The
symbol indicates that SNP-QT pairs have obtained the significance level (
<0.05). The results indicated that most SNP-QT pairs held substantial correlation values with significant
values. For example, rs429358 had the highest weight values with hippocampal areas, supporting the hippocampal atrophy phenotype may be highly heritable. Of particular interest, owing to the newly introduced
penalty, rs4420638 (
), rs56131196 (
) and rs12721051 (
) were identified simultaneously. These SNPs exhibited similar patterns with brain imaging QTs, indicating that they were not independent but were jointly related. This further demonstrates that MT-GPIC could detect strong genotype-phenotype association and valuable genetic and imaging QTs.
Figure 9.

Heatmap of the pairwise correlation between the identified top 10 SNPs and imaging phenotypes. The symbol ‘
’ represents that the pairwise association reached the significance level (
). The results showed that most SNP-QT pairs held substantial correlation values with significant
-values.
DISCUSSION AND CONCLUSION
Despite extensive research efforts, the precise pathological mechanisms of AD remain unclear. This may be attributed to the intricate interactions and correlations between genetic variations and proteins. Unfortunately, most existing imaging genetic methods overlooked the potential informative value of GPI and correlations, which were crucial for unraveling the underlying relationships within multi-omics data.
To overcome these limitations, we proposed MT-GPIC to detect GPI and correlation effects associated with imaging QTs. To enhance the computational efficiency, we developed a fast optimization algorithm and guaranteed convergence. MT-GPIC improved CCCs and classification accuracy compared with state-of-the-art methods. Moreover, our method can provide interpretable and AD-related biomarkers with meaningful implications.
Future research can apply our model to whole-genome-wide brain-wide analysis or broaden its applicability to address multi-modal multi-omic brain imaging genetics challenges. Nevertheless, more experiments on different data sets are needed to further validate the robustness of the proposed model. Such endeavors aim to facilitate the identification of more comprehensive biomarkers and aid clinicians in identifying high-risk individuals.
Key Points
We proposed a novel MT-GPIC method to disentangle the GPI and correlation characteristics associated with disease-related heritable phenotypes.
We used a novel penalty and off-the-shelf penalties to detect meaningful genetic risk factors, as well as exploiting the interconnectedness of different brain regions.
Experimental results demonstrated that MT-GPIC significantly outperformed the state-of-the-art methods.
ACKNOWLEDGMENTS
Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative.
Author Biographies
Jin Zhang is a PhD student at the School of Automation, Northwestern Polytechnical University, Xi’an, China.
Zikang Ma is a master student at the School of Automation, Northwestern Polytechnical University, Xi’an, China.
Yan Yang is a PhD student at the School of Automation, Northwestern Polytechnical University, Xi’an, China.
Lei Guo is a full Professor at the School of Automation, Northwestern Polytechnical University, Xi’an, China.
Lei Du is an associate Professor at the School of Automation, Northwestern Polytechnical University, Xi’an, China.
Contributor Information
Jin Zhang, Department of Intelligent Science and Technology, Northwestern Polytechnical University School of Automation, 127 Youyi Road, 710072 Shaanxi, China.
Zikang Ma, Department of Intelligent Science and Technology, Northwestern Polytechnical University School of Automation, 127 Youyi Road, 710072 Shaanxi, China.
Yan Yang, Department of Intelligent Science and Technology, Northwestern Polytechnical University School of Automation, 127 Youyi Road, 710072 Shaanxi, China.
Lei Guo, Department of Intelligent Science and Technology, Northwestern Polytechnical University School of Automation, 127 Youyi Road, 710072 Shaanxi, China.
Lei Du, Department of Intelligent Science and Technology, Northwestern Polytechnical University School of Automation, 127 Youyi Road, 710072 Shaanxi, China.
FUNDING
This work was supported in part by the STI2030-Major Projects (No. 2022ZD0213700), National Natural Science Foundation of China (No. 61973255, 62136004, 61936007, 62373306), Innovation Foundation for Doctor Dissertation (No. CX2023062) and Fundamental Research Funds for the Central Universities at Northwestern Polytechnical University.
CONTRIBUTIONS
L.D., J.Z. conceived experiment(s); J.Z., Z.M., Y.Y., L.G. analyzed the biomarker; L.D., J.Z. wrote the manuscript; and L.G. wrote and reviewed the whole manuscript.
DATA AVAILABILITY
Data for this article were acquired from the ADNI database (adni.loni.usc.edu). Our algorithm’s software is available for free download at https://github.com/dulei323/MT-GPIC.
References
- 1. Sims R, Hill M, Williams J. The multiplex model of the genetics of Alzheimer’s disease. Nat Neurosci 2020;1–12. [DOI] [PubMed] [Google Scholar]
- 2. Shen L, Thompson PM. Brain imaging genomics: integrated analysis and machine learning. Proc IEEE 2019;108(1):125–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Yoo J, Winogradoff D, Aksimentiev A. Molecular dynamics simulations of DNA–DNA and DNA–protein interactions. Curr Opin Struct Biol 2020;64:88–96. [DOI] [PubMed] [Google Scholar]
- 4. Jimenez JS. Protein-DNA interaction at the origin of neurological diseases: a hypothesis. J Alzheimers Dis 2010;22(2):375–91. [DOI] [PubMed] [Google Scholar]
- 5. Tang L. Recording protein–DNA interactions in bacteria. Nat Methods 2022;19(7):782–2. [DOI] [PubMed] [Google Scholar]
- 6. Lei D, Zhao Y, Zhang J, et al. Identification of genetic risk factors based on disease progression derived from longitudinal brain imaging phenotypes. IEEE Trans Med Imaging 2023;1. [DOI] [PubMed] [Google Scholar]
- 7. Canchi S, Raao B, Masliah D, et al. Integrating gene and protein expression reveals perturbed functional networks in Alzheimer’s disease. Cell Rep 2019;28(4):1103–1116.e4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Vasunilashorn SM, Ngo LH, Inouye SK, et al. Apolipoprotein E genotype and the association between c-reactive protein and postoperative delirium: importance of gene-protein interactions. Alzheimers Dement 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Domingue BW, Kanopka K, Mallard TT, et al. Modeling interaction and dispersion effects in the analysis of gene-by-environment interaction. Behav Genet 2022;52(1):56–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Zhang J, Wang H, Zhao Y, et al. Identification of multimodal brain imaging association via a parameter decomposition based sparse multi-view canonical correlation analysis method. BMC Bioinform 2022;23(Suppl 3):128. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Li Y, Sahakian BJ, Kang J, et al. The brain structure and genetic mechanisms underlying the nonlinear association between sleep duration, cognition and mental health. Nature Aging 2022;2(5):425–37. [DOI] [PubMed] [Google Scholar]
- 12. Lei D, Zhang J, Liu F, et al. Identifying associations among genomic, proteomic and imaging biomarkers via adaptive sparse multi-view canonical correlation analysis. Med Image Anal 2021;70:102003. [DOI] [PubMed] [Google Scholar]
- 13. Gallagher LA, Elena Velazquez S, Peterson B, et al. Genome-wide protein–dna interaction site mapping in bacteria using a double-stranded dna-specific cytosine deaminase. Nature. Microbiology 2022;7(6):844–55. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. McCarthy MI, Abecasis GR, Cardon LR, et al. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet 2008;9(5):356–69. [DOI] [PubMed] [Google Scholar]
- 15. Shriner D, Vaughan LK, Padilla MA, Tiwari HK. Problems with genome-wide association studies. Science 2007;316(5833):1840–2. [DOI] [PubMed] [Google Scholar]
- 16. Wei W-H, Hemani G, Haley CS. Detecting epistasis in human complex traits. Nat Rev Genet 2014;15(11):722–33. [DOI] [PubMed] [Google Scholar]
- 17. Wang H, Nie F, Huang H, et al. Identifying disease sensitive and quantitative trait-relevant biomarkers from multidimensional heterogeneous imaging genetics data via sparse multimodal multitask learning. Bioinformatics 2012;28(12):i127–36. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Lin D, Calhoun VD, Wang Y-P. Correspondence between fmri and snp data by group sparse canonical correlation analysis. Med Image Anal 2014;18(6):891–902. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Eichler EE, Flint J, Gibson G, et al. Missing heritability and strategies for finding the underlying causes of complex disease. Nat Rev Genet 2010;11(6):446–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Manolio TA, Collins FS, Cox NJ, et al. Finding the missing heritability of complex diseases. Nature 2009;461(7265):747–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Simons YB, Bullaughey K, Hudson RR, Sella G. A population genetic interpretation of GWAS findings for human quantitative traits. PLoS Biol 2018;16(3):e2002985. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Cuevas J, Montesinos-López O, Juliana P, et al. Deep kernel for genomic and near infrared predictions in multi-environment breeding trials. G3: genes, genomes. Genetics 2019;9(9):2913–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Wang J, Ma A, Chang Y, et al. scGNN is a novel graph neural network framework for single-cell RNA-seq analyses. Nat Commun 2021;12(1):1882. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Serrano-Pozo A, Das S, Hyman BT. APOE and Alzheimer’s disease: advances in genetics, pathophysiology, and therapeutic approaches. Lancet Neurol 2021;20(1):68–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Kulminski AM, Shu L, Loika Y, et al. Genetic and regulatory architecture of Alzheimer’s disease in the APOE region. Alzheimers Dement 2020;12(1):e12008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Lei D, Liu K, Yao X, et al. Detecting genetic associations with brain imaging phenotypes in Alzheimer’s disease via a novel structured SCCA approach. Med Image Anal 2020;61:101656. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Rodosthenous T, Shahrezaei V, Evangelou M. Integrating multi-omics data through sparse canonical correlation analysis for the prediction of complex traits: a comparison study. Bioinformatics 2020;36(17):4616–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Witten DM, Tibshirani RJ. Extensions of sparse canonical correlation analysis with applications to genomic data. Stat Appl Genet Mol Biol 2009;8(1):1–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Wenxing H, Lin D, Cao S, et al. Adaptive sparse multiple canonical correlation analysis with application to imaging (epi) genomics study of schizophrenia. IEEE Trans Biomed Eng 2017;65(2):390–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Kabiljo R, Clegg AB, Shepherd AJ. A realistic assessment of methods for extracting gene/protein interactions from free text. BMC Bioinform 2009;10(1):1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Cordeiro Y, Macedo B, Silva JL, Gomes MPB. Pathological implications of nucleic acid interactions with proteins associated with neurodegenerative diseases. Biophys Rev 2014;6:97–110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Camero S, Ayuso JM, Barrantes A, et al. Specific binding of DNA to aggregated forms of Alzheimer’s disease amyloid peptides. Int J Biol Macromol 2013;55:201–6. [DOI] [PubMed] [Google Scholar]
-
33.
Maloney B, Lahiri DK. The Alzheimer’s amyloid
-peptide (A
) binds a specific DNA A
-interacting domain (A
ID) in the APP, BACE1, and APOE promoters in a sequence-specific manner: characterizing a new regulatory motif. Gene 2011;488(1–2):1–12.
[DOI] [PMC free article] [PubMed] [Google Scholar] -
34.
Bailey JA, Maloney B, Ge Y-W, Lahiri DK. Functional activity of the novel Alzheimer’s amyloid
-peptide interacting domain (A
ID) in the APP and BACE1 promoter sequences and implications in activating apoptotic genes and in amyloidogenesis. Gene 2011;488(1–2):13–22.
[DOI] [PMC free article] [PubMed] [Google Scholar] - 35. Nymberg C, Jia T, Lubbe S, et al. Neural mechanisms of attention-deficit/hyperactivity disorder symptoms are stratified by maoa genotype. Biol Psychiatry 2013;74(8):607–14. [DOI] [PubMed] [Google Scholar]
- 36. Gao L, Cui Z, Shen L, Ji H-F. Shared genetic etiology between type 2 diabetes and Alzheimer’s disease identified by bioinformatics analysis. J Alzheimers Dis 2016;50(1):13–7. [DOI] [PubMed] [Google Scholar]
- 37. Yi L, Ting W, Luo W, et al. A non-invasive, rapid method to genotype late-onset Alzheimer’s disease-related apolipoprotein E gene polymorphisms. Neural Regen Res 2014;9(1):69–75. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. de Leeuw CA, Mooij JM, Heskes T, Posthuma D. MAGMA: generalized gene-set analysis of GWAS data. PLoS Comput Biol 2015;11(4):e1004219. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Gouveia C, Gibbons E, Dehghani N, et al. Genome-wide association of polygenic risk extremes for Alzheimer’s disease in the UK Biobank. Sci Rep 2022;12(1):8404. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Paranjpe MD, Chaffin M, Zahid S, et al. Neurocognitive trajectory and proteomic signature of inherited risk for Alzheimer’s disease. PLoS Genet 2022;18(9):e1010294. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Yang K, Tong L, Shu J, et al. High gamma band eeg closely related to emotion: evidence from functional network. Front Hum Neurosci 2020;14:89. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Wang J, Zuo X, Dai Z, et al. Disrupted functional brain connectome in individuals at risk for Alzheimer’s disease. Biol Psychiatry 2013;73(5):472–81. [DOI] [PubMed] [Google Scholar]
- 43. Hosseinian S, Arefian E, Rakhsh-Khorshid H, et al. A meta-analysis of gene expression data highlights synaptic dysfunction in the hippocampus of brains with Alzheimer’s disease. Sci Rep 2020;10(1):1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Salta E, Lazarov O, Fitzsimons CP, et al. Adult hippocampal neurogenesis in Alzheimer’s disease: a roadmap to clinical relevance. Cell Stem Cell 2023;30(2):120–36. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Bhatt A, Bhatt A. EEG based emotion recognition using SVM and LibSVM. Int J Comput Appl 2019;178:1–3. [Google Scholar]
- 46. Connelly LM. Introduction to analysis of variance (ANOVA). Medsurg Nurs 2021;30(3):218–158. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Data for this article were acquired from the ADNI database (adni.loni.usc.edu). Our algorithm’s software is available for free download at https://github.com/dulei323/MT-GPIC.
























































