Predicting Alzheimer’s Disease Cognitive Assessment via Robust Low-Rank Structured Sparse Model

Jie Xu; Cheng Deng; Xinbo Gao; Dinggang Shen; Heng Huang

doi:10.24963/ijcai.2017/542

. Author manuscript; available in PMC: 2018 Apr 20.

Published in final edited form as: IJCAI (U S). 2017 Aug;2017:3880–3886. doi: 10.24963/ijcai.2017/542

Predicting Alzheimer’s Disease Cognitive Assessment via Robust Low-Rank Structured Sparse Model

Jie Xu ^1,³, Cheng Deng ¹, Xinbo Gao ¹, Dinggang Shen ², Heng Huang ^3,¹

PMCID: PMC5909849 NIHMSID: NIHMS939431 PMID: 29681724

Abstract

Alzheimer’s disease (AD) is a neurodegenerative disorder with slow onset, which could result in the deterioration of the duration of persistent neurological dysfunction. How to identify the informative longitudinal phenotypic neuroimaging markers and predict cognitive measures are crucial to recognize AD at early stage. Many existing models related imaging measures to cognitive status using regression models, but they did not take full consideration of the interaction between cognitive scores. In this paper, we propose a robust low-rank structured sparse regression method (RLSR) to address this issue. The proposed model simultaneously selects effective features and learns the underlying structure between cognitive scores by utilizing novel mixed structured sparsity inducing norms and low-rank approximation. In addition, an efficient algorithm is derived to solve the proposed non-smooth objective function with proved convergence. Empirical studies on cognitive data of the ADNI cohort demonstrate the superior performance of the proposed method.

1 Introduction

Alzheimer’s disease (AD), a common form of dementia, affects nerve cells in areas of the brain responsible for memory, cognition, language, and motor activity [Dailey, 2017]. By linear extrapolation of estimates from 2006, the population worldwide who have AD will increase to over 100 million by 2050 [Thompson et al., 2003; Moradi et al., 2015]. In fact, researchers believe that early detection will be key to preventing, slowing and stopping Alzheimer’s disease. Neuroimaging as a powerful tool for accurate identification and understanding informative feature is necessary for early Alzheimer’s disease prognosis and diagnosis [Liu et al., 2015; Nie et al., 2016]. Therefore, many machine learning methods have been proposed to study neuroimaging measures to detect pathology associated with AD and to predict cognitive scores [Wang et al., 2011b; Huo et al., 2016; Wang et al., 2016]. Among them, structural magnetic resonance imaging (MRI) scans are one of the most extensively used imaging modality in tracking AD progression.

In the association study of selecting effective longitudinal phenotypic markers to predict cognitive scores from imaging features, the input usually consists of two matrices: the imaging feature matrix X = [x₁, …, x_n] ∈ ℝ^d^×ⁿ and the corresponding cognitive score matrix Y = [y₁, …, y_n]^T ∈ ℝⁿ^×^m, where n is the number of samples, d is the number of features and m is the number of different measures of a certain cognitive performance.

A forthright method to identify informative imaging markers is to perform feature selection [Chang and Yang, 2016], which has been demonstrated as a useful way to reflect the correlation between cognitive measures after removing the indistinctive neuroimaging markers. More recently, sparse regularization model has been extensively utilized to learn the structure of data and obtain effective feature in different applications. The sparsity-inducing norm based feature selection methods solve the convex optimization problems of the form:

min_{B} L (B; X) + λ Ω (B)

(1)

where ℒ is a convex function and Ω can include one or more non-smooth sparsity-inducing norms.

When Ω is the l₁-norm, the l₁ shrinkage methods such as LASSO identify informative longitudinal phenotypic markers in the brain that are related to pathological changes of AD by imposing flat sparsity [Liu et al., 2014]. However, the selected features distribute randomly among the whole brain that can not be well explained. In predicting cognitive scores task, we expect to select the most informative markers, which are important to all participants including AD, mild cognitive impairment (MCI) and heathy control (HC). To address this issue, group LASSO with a l₂_,₁-norm is used to impose the structured sparsity on parameter matrix for feature selection [Obozinski et al., 2010; Jie et al., 2015; Yang et al., 2017; Chang et al., 2017]. It enforces the important features to have non-zero weights cross all participants, however many important features often are only discriminant to partial classes, i.e. having large weights on these participants. Thus, such important features may be ignored by the above methods.

On this account, Lee et al. and Wang et al. proposed to add one more l₁_,₁-norm regularization term to achieve both structured and flat sparsity [Lee et al., 2010; Wang et al., 2011a]. However, because the l₁_,₁-norm regularization term enforces the flat sparsity and is prone to shrink the non-large values to zeros, the non-zero weights of some important features may also be forced to be zeros, i.e. some features selected by the l₂_,₁-norm regularization term can be totally suppressed by the l₁_,₁-norm regularization term. As a result, many informative longitudinal phenotypic markers are neglected during the feature selection procedure. Thus, more properly designed structured sparsity-inducing norms are desired in feature selection research.

In this paper, we propose a robust low-rank structured sparse regression method (RLSR) to simultaneously select the important neuroimaging markers and learn the underlying structure between cognitive measures. Our main contributions are three-fold: (1) The new mixed structured sparsity-inducing norms are introduced to overcome the above over-shrinkage drawback in the existing sparse learning based feature selection models. (2) The explicit rank-k low-rank matrix fitting approach is used to extract the underlying interrelation structures between cognitive measures. (3) Because our method leads to a highly non-smooth objective, we derive an efficient algorithm to solve the new objective with proved convergence. We validate our method on cognitive data of the ADNI cohort and obtain promising results.

Notations

We summarize the notations used in this paper. Matrices are written as uppercase letters and vectors are written as bold lowercase letters. For matrix W = {w_ij}, its i-th row, j-th column are denoted as wⁱ, w_j respectively. The l_p-norm of the vector v ∈ ℝⁿ is defined as ${‖ v ‖}_{p} = {(\sum_{i = 1}^{n} {∣ v_{i} ∣}^{p})}^{\frac{1}{p}}$ for p > 0. The l₂_,₁-norm of matrix W is defined as ${‖ W ‖}_{2, 1} = \sum_{i = 1}^{d} {‖ w^{i} ‖}_{2}$ (in some related papers, people also used the notation l₁/l₂-norm). l₁_,₂-norm of matrix W is defined as ${‖ W ‖}_{1, 2} = \sqrt{\sum_{i = 1}^{d} {‖ w^{i} ‖}_{1}^{2}}$ and l₁_,₁-norm of matrix W is defined as ${‖ W ‖}_{1, 1} = \sum_{i = 1}^{d} \sum_{j = 1}^{m} ∣ w_{i j} ∣$ .

2 Robust Low-Rank Structured Sparse Learning

The l₂_,₁-norm based objectives select the informative imaging markers across all the cognitive scores with joint sparsity, i.e. each imaging marker has either small score or large score for all the cognitive measures. However, for accurate identification of effective imaging markers, we utilize participants including AD, MCI and HC during deferent period. Consequently, many features may be irrelevant to each other, which could deteriorate the feature selection performance if we consider all the imaging markers as one group to do feature selection. On the other hand, as we discussed in the introduction section, an extra l₁_,₁-norm regularizer will suppress too many non-zero values and lead unstable feature selection results. For example, when we target to select top 20 features and adjust the trade-off parameter to let l₂_,₁-norm makes about 20 features with relatively large weights, the added l₁_,₁-norm will dramatically shrink the weights such that only few important features (much less than 20) can be selected. To tackle this over-shrinkage problem, we add a convex squared l₁_,₂-norm regularizer instead of l₁_,₁-norm and solve:

min_{B} {‖ Y - X^{T} B ‖}_{F}^{2} + λ_{1} {‖ B ‖}_{2, 1} + λ_{2} {‖ B ‖}_{1, 2}^{2} .

(2)

The typical loss functions are the least square loss and logistic loss. To improve the computational efficiency, we utilize the least square loss in this paper to select informative markers for ADNI data. Thus, our method can be applied to both classification tasks (e.g. AD/MCI versus Normal Controls (NC)) and regression tasks (e.g., estimation of clinical cognitive scores). We perform the latter in this paper.

In Eq. (2), we proposed the novel mixed structured sparsity norms. The standard l₂_,₁-norm enforces the joint sparsity across all cognitive measures to select imaging markers. The new l₁_,₂-norm uses l₂-norm between markers, such that at least one non-zero element in the rows of B selected by l₂_,₁-norm regularizer will be kept. Thus, we won’t lose the discriminative imaging markers selected by the l₂_,₁-norm regularization. At the same time, the l₁_,₂-norm imposes the l₁-norm between cognitive score weights of each marker to shrink the weight values of uncorrelated or irrelevant cognitive measures. For illustration purpose, in Fig. 1, we plot the sparse shrinkage patterns of the matrix B using: (a) l₂_,₁-norm regularizer only, (b) l₂_,₁- norm and l₁_,₁-norm regularizers, (c) l₂_,₁-norm and l₁_,₂-norm regularizers. The l₁_,₁-norm over-shrinks the weights and removes the first feature selected by l₂_,₁-norm. The l₁_,₂-norm suppress some weights with supporting the results of l₂_,₁-norm, e.g. the first feature is still kept in the list.

The sparse shrinkage patterns of matrix B imposed different structured sparsity-inducing norms: (a) l_2,1-norm, (b) l_2,1-norm + l_1,1-norm, (c) l_2,1-norm + l_1,2-norm. Blue points represent the non-zero weights and white points represent the zero weights. In (b), the l_1,1-norm suppresses the first feature selected by l_2,1-norm. In (c), l_1,2-norm will keep at least one non-zero weight for this feature, leading to the stable feature selection results.

More important, with regard to this specific task that predicting AD progression, we hope to extract and utilize the underlying interrelations between cognitive measures to enhance the accuracy of feature selection. In recent research [Ji and Ye, 2009; Deng et al., 2015], the trace norm regularization has been used to seek the low-rank structured shared common representations:

min_{B} {‖ Y - X^{T} B ‖}_{F}^{2} + λ {‖ B ‖}_{*} .

(3)

However, there are two deficiencies: 1) the optimal low-rank approximation is resulted by tediously tuning the parameter λ, which has no direct connection to the rank value; 2) the feature selection isn’t enforced in this model.

To address the above problems, we consider the following low-rank regression:

min_{B} {‖ Y - X^{T} B ‖}_{F}^{2} s . t . rank (B) = s \leq min (m, d),

(4)

where m is the number of classes and d is the dimension of features. Compared to the parameter λ in Eq. (3), the parameter s is more feasible to be decided by users. Moreover, in order to select features and simultaneously keep the low-rank matrix fitting, we propose the following model,

min_{W, P} {‖ Y - X^{T} W P ‖}_{F}^{2} + λ_{1} {‖ W ‖}_{2, 1} + λ_{2} {‖ W ‖}_{1, 2}^{2} s . t . P P^{T} = I

(5)

where W ∈ ℝ^d^×^s, P ∈ ℝ^s^×^m and s ≤ min(m, d). The product B = WP is a low-rank matrix with rank(B) ≤ s. Our new objective simultaneously learns the underlying interrelation between cognitive measures by low-rank matrix fitting and selects the informative neuroimaging markers by mixed structured sparsity-inducing norms. We also consider the real world data often have outliers and hence replace the least square loss by the l₂_,₁-norm based loss function, which imposes the l₁-norm between data points to reduce the effect of outliers. Our final objective is to solve:

min_{W, P} {‖ Y - X^{T} W P ‖}_{2, 1} + λ_{1} {‖ W ‖}_{2, 1} + λ_{2} {‖ W ‖}_{1, 2}^{2} s . t . P P^{T} = I

(6)

The resulted objective has three non-smooth terms, such that the optimization becomes difficult. To solve our new objective, we will derive an efficient algorithm in next section with proved convergence.

3 Optimization

In this section, an efficient algorithm is proposed to tackle Eq. (6), followed by the proof of its convergence.

3.1 Algorithm Derivation

We will alternatively and iteratively solve Eq. (6). To begin with, we rewrite Eq. (6) as:

min_{W, P, H, \tilde{D}, D_{k}} T r ({(Y - X^{T} W P)}^{T} H (Y - X^{T} W P)) + λ_{1} T r (W^{T} \hat{D} W) + λ_{2} \sum_{k = 1}^{s} w_{k}^{T} D_{k} w_{k}, s . t . P P^{T} = I

(7)

where W = [w₁, w₂, …, w_s] ∈ ℝ^d^×^s. Denote

E = Y - X^{T} W P,

(8)

then H ∈ ℝⁿ^×ⁿ is a diagonal matrix and defined as:

H (i, i) = \frac{1}{2 {‖ e^{i} ‖}_{2}},

(9)

where eⁱ(∀i = 1, 2, …, n) is the i-th row of matrix E in Eq. (8). D̂ ∈ ℝ^d^×^d is a diagonal matrix and defined as

\hat{D} (j, j) = \frac{1}{2 {‖ w^{j} ‖}_{2}}, \forall j = 1, 2, \dots, d

(10)

and D_k ∈ ℝ^d^×^d is also a diagonal matrix and defined as:

D_{k} (j, j) = \frac{{‖ w^{j} ‖}_{1}}{∣ w_{j k} ∣}, \forall k = 1, 2, \dots, s, \forall j = 1, 2, \dots, d .

(11)

The first step is to fix P, H, D̂, D_k, and to solve W. Thus, we need solve the following subproblem:

min_{w_{k}} \sum_{k = 1}^{s} - 2 z_{k}^{T} w_{k} + w_{k}^{T} ({XHX}^{T}) w_{k} + λ_{1} w_{k}^{T} \hat{D} w_{k} + λ_{2} w_{k}^{T} D_{k} w_{k},

(12)

where z_k is the k-th column of the matrix XHY P^T, ∀k = 1, 2, …, s. Then taking derivative of Eq. (12) w.r.t. w_k and setting it to zero, we get,

- 2 z_{k} + 2 ({XHX}^{T}) w_{k} + 2 λ_{1} \tilde{D} w_{k} + 2 λ_{2} D_{k} w_{k} = 0

(13)

w_{k} = {({XHX}^{T} + λ_{1} \hat{D} + λ_{2} D_{k})}^{- 1} z_{k}

(14)

The second step is to fix W, H, D̂, D_k, and to solve P. Because

{‖ Y - X^{T} W P ‖}_{2, 1} = T r ({(Y - X^{T} W P)}^{T} H (Y - X^{T} W P)) = T r (Y^{T} H Y) - 2 T r (Y^{T} H X^{T} W P) + T r (W^{T} {XHX}^{T} W) .

(15)

Then the subproblem becomes:

max_{P} T r (P Y^{T} H X^{T} W) s . t . P P^{T} = I

(16)

The solution to Eq. (16) can be obtained by the Theorem 1.

The third step is to fix W, P, and to solve H by Eq. (8) and Eq. (9), solve D̂ by Eq. (10), and solve D_k by Eq. (11).

We repeat the above three steps iteratively, until the predefined stopping criterion is satisfied. We summarize the whole algorithm in Alg. 1.

The Step 2 in the iteration of Alg. 1 can be calculated by linear equation system, which can be efficiently solved. Thus, our algorithm can be applied in large-scale datasets.

Algorithm 1.

The algorithm to solve Eq. (6)

Input:

1. The training data X ∈ ℝ^d^×ⁿ with label matrix Y ∈ ℝⁿ^×^m

2. The regularization parameters λ₁ and λ₂, and the rank s.

Output:

1. The matrices W ∈ ℝ^d^×^s and P ∈ ℝ^s^×^m.

Initialization:

1. Set t = 0, initialize H⁽^t⁾ = I_n_×_n, D̂⁽^t⁾ = I_d_×_d,

D_{k}^{(t)} = I_{d \times d}

, ∀k = 1, …, s, randomly initialize P⁽^t⁾ ∈ ℝ^s^×^m with P⁽^t⁾P^{(t)^T} = I ∈ ℝ^s^×^s.

Repeat:

1. Calculate

z_{k}^{(t)}

, which is the k-th column of the matrix XH⁽^t⁾Y P^{(t)^T}.

2. Calculate W⁽^t⁾ column by column by

w_{k}^{(t)} = {(X H^{(t)} X^{T} + λ_{1} {\hat{D}}^{(t)} + λ_{2} D_{k}^{(t)})}^{- 1} z_{k}^{(t)}

3. Calculate M⁽^t⁾ = Y^TH⁽^t⁾X^TW⁽^t⁾.

4. Do SVD of M⁽^t⁾, M⁽^t⁾ = U⁽^t⁾Σ⁽^t⁾V^{(t)^T}.

5. Update P⁽^t⁺¹⁾ = V⁽^t⁾[I, 0]U^{(t)^T}.

6. Update E⁽^t⁺¹⁾ = Y − X^TW⁽^t⁾P⁽^t⁺¹⁾.

7. Update

H^{(t + 1)} (i, i) = \frac{1}{2 {‖ e^{(t + 1) i} ‖}_{2}}

, ∀i = 1, 2, …, n.

8. Update

{\hat{D}}^{(t + 1)} (j, j) = \frac{1}{2 ‖ w^{(t) j} ‖}

, ∀j = 1, 2, …, d.

9. Update

D_{k} (j, j) = \frac{{‖ w^{(t) j} ‖}_{1}}{∣ w_{j k}^{(t)} ∣}

, ∀j = 1, 2, …, d.

10. Update t = t + 1.

Until Convrge

Open in a new tab

3.2 Convergence Analysis

Theorem 1

The solution to the optimization problem max_P Tr(PM) s.t. PP^T = I, where p ∈ ℝ^s^×^m, M ∈ ℝ^m^×^s and s < m is

P^{*} = V [I, 0] U^{T}

(17)

where U and V are the SVD of M, M = UΛV^T and I is the identify matrix, I ∈ ℝ^s^×^s. 0 ∈ ℝ^s^×(^m⁻^s⁾ is the matrix with all zeros entries. And [Φ, Ψ] is the matrix operation to horizontally concatenate two matrices Φ and Ψ who have the same number of rows.

Proof

We do SVD of M, M = UΛV^T, where U ∈ ℝ^m^×^m, Λ ∈ ℝ^m^×^s, V ∈ ℝ^s^×^s. Then, we have

T r (P M) = T r (P U Λ V^{T}) = T r (Λ V^{T} P U) = T r (Λ Q) = \sum_{k = 1}^{s} λ_{k k} q_{k k}

(18)

where Q = V^TPU, Q ∈ ℝ^s^×^m. Note that s < m. λ_kk and q_kk are the k-th element in the diagonal of matrix Λ and Q respectively. Moreover,

Q Q^{T} = V^{T} {PUU}^{T} P^{T} V = I,

(19)

where I ∈ ℝ^s^×^s is the identity matrix. Thus, q_kk ≤ 1, ∀k, k = 1, 2, …, s. Therefore,

T r (P M) = \sum_{k = 1}^{s} λ_{k k} q_{k k} \leq \sum_{k}^{s} λ_{k k},

(20)

and when q_kk = 1, ∀k, k = 1, 2, …, s, the equality holds. In other words, Tr(PM) reaches the maximum when Q = [I, 0]. Recall that Q = V^TPU, thus the optimal solution to Eq. (16) is Eq. (17).

The convergence of the Alg. 1 is summarized in the following theorem:

Theorem 2

The Alg. 1 will monotonically decrease the objective of the problem in Eq. (6) in each iteration and converge to the local optimum solution to the problem.

Proof

On one hand, denote the updated P by P̃. Because of Theorem 1, when we fix W, H, D̂ and D_k, we get:

T r ({(Y - X^{T} W \tilde{P})}^{T} H (Y - X^{T} W \tilde{P})) + λ_{1} T r (W^{T} \hat{D} W) + λ_{2} \sum_{k = 1}^{s} w_{k}^{T} D_{k} w_{k} \leq T r ({(Y - X^{T} W P)}^{T} H (Y - X^{T} W P)) + λ_{1} T r (W^{T} \hat{D} W) + λ_{2} \sum_{k = 1}^{s} w_{k}^{T} D_{k} w_{k}

(21)

After plugging the definition of H by Eq. (9), we can obtain,

\sum_{i = 1}^{n} \frac{{‖ {\tilde{e}}^{i} ‖}_{2}^{2}}{2 {‖ e^{i} ‖}_{2}} + λ_{1} \sum_{k = 1}^{s} w_{k}^{T} \hat{D} w_{k} + λ_{2} \sum_{k = 1}^{s} w_{k}^{T} D_{k} w_{k} \leq \sum_{i = 1}^{n} \frac{{‖ e^{i} ‖}_{2}^{2}}{2 {‖ e^{i} ‖}_{2}} + λ_{1} \sum_{k = 1}^{s} w_{k}^{T} \hat{D} w_{k} + λ_{2} \sum_{k = 1}^{s} w_{k}^{T} D_{k} w_{k}

(22)

At the same time, beginning with (||ẽⁱ||₂ − ||eⁱ||₂)² ≥ 0, we can get

\sum_{i = 1}^{n} ({‖ {\tilde{e}}^{i} ‖}_{2} - \frac{{‖ {\tilde{e}}^{i} ‖}_{2}^{2}}{2 {‖ e^{i} ‖}_{2}}) \leq \sum_{i = 1}^{n} ({‖ e^{i} ‖}_{2} - \frac{{‖ e^{i} ‖}_{2}^{2}}{2 {‖ e^{i} ‖}_{2}})

(23)

Therefore, adding Eq. (22) and Eq. (23) together, we get

{‖ Y - X^{T} W \tilde{P} ‖}_{2, 1} + λ_{1} \sum_{k = 1}^{s} w_{k}^{T} \hat{D} w_{k} + λ_{2} \sum_{k = 1}^{s} w_{k}^{T} D_{k} w_{k} \leq {‖ Y - X^{T} W P ‖}_{2, 1} + λ_{1} \sum_{k = 1}^{s} w_{k}^{T} \hat{D} w_{k} + λ_{2} \sum_{k = 1}^{s} w_{k}^{T} D_{k} w_{k}

(24)

On the other hand, denote the updated W by W̃, when we fix P, H. According to the step 2 in the Repeat part in Alg. 1, we have:

T r (Y^{T} H Y) + \sum_{k = 1}^{s} (- 2 z_{k}^{T} {\tilde{w}}_{k} + {\tilde{w}}_{k}^{T} ({XHX}^{T}) {\tilde{w}}_{k} + λ_{1} {\tilde{w}}_{k}^{T} \hat{D} {\tilde{w}}_{k} + λ_{2} {\tilde{w}}_{k}^{T} D_{k} {\tilde{w}}_{k}) \leq T r (Y^{T} H Y) + \sum_{k = 1}^{s} (- 2 z_{k}^{T} w_{k} + w_{k}^{T} ({XHX}^{T}) w_{k} + λ_{1} w_{k}^{T} \hat{D} w_{k} + λ_{2} w_{k}^{T} D_{k} w_{k})

(25)

where z_k is the k-th column of XHY P^T.

We plug in the definition of D̂ and D_k by Eq. (10) and Eq. (11) respectively into Eq. (25), and have:

T r (Y^{T} H Y) + \sum_{k = 1}^{s} (- 2 z_{k}^{T} {\tilde{w}}_{k} + {\tilde{w}}_{k}^{T} ({XHX}^{T}) {\tilde{w}}_{k}) + λ_{1} \sum_{k = 1}^{s} \sum_{j = 1}^{d} \frac{{\tilde{w}}_{j k}^{2}}{2 {‖ w^{j} ‖}_{2}} + λ_{2} \sum_{k = 1}^{s} \sum_{j = 1}^{d} \frac{{‖ w^{j} ‖}_{1}}{∣ w^{j k} ∣} {\tilde{w}}_{j k}^{2} \leq T r (Y^{T} H Y) + \sum_{k = 1}^{s} (- 2 z_{k}^{T} w_{k} + w_{k}^{T} ({XHX}^{T}) w_{k} + λ_{1} \sum_{k = 1}^{s} \sum_{j = 1}^{d} \frac{w_{j k}^{2}}{2 {‖ w^{j} ‖}_{2}} + λ_{2} \sum_{k = 1}^{s} \sum_{j = 1}^{d} \frac{{‖ w^{j} ‖}_{1}}{∣ w^{j k} ∣} w_{j k}^{2}

(26)

Similarly, beginning with (||w̃ⁱ||₂ − ||w̃ⁱ||₂)² ≥ 0, we get

{‖ {\tilde{w}}^{j} ‖}_{2} - \frac{{‖ {\tilde{w}}^{j} ‖}_{2}^{2}}{2 {‖ w^{j} ‖}_{2}} \leq {‖ w^{j} ‖}_{2} - \frac{{‖ w^{j} ‖}_{2}^{2}}{2 {‖ w^{j} ‖}_{2}}

(27)

Therefore,

\sum_{j = 1}^{d} {‖ {\tilde{w}}^{j} ‖}_{2} - \sum_{j = 1}^{d} \sum_{k = 1}^{s} \frac{{\tilde{w}}_{j k}^{2}}{2 {‖ w^{j} ‖}_{2}} \leq \sum_{j = 1}^{d} {‖ w^{j} ‖}_{2} - \sum_{j = 1}^{d} \sum_{k = 1}^{s} \frac{w_{j k}^{2}}{2 {‖ w^{j} ‖}_{2}} .

(28)

Meanwhile, according to am-gm inequality, we have

\sum_{k = 1}^{s} \frac{{‖ w^{j} ‖}_{1}}{∣ w^{j k} ∣} {\tilde{w}}_{j k}^{2} \geq {({‖ {\tilde{w}}^{j} ‖}_{1})}^{2}

(29)

Thus,

{({‖ {\tilde{w}}^{j} ‖}_{1})}^{2} - \sum_{k = 1}^{s} \frac{{‖ w^{j} ‖}_{1}}{∣ w^{j k} ∣} {\tilde{w}}_{j k}^{2} \leq 0 = {({‖ {\tilde{w}}^{j} ‖}_{1})}^{2} - \sum_{k = 1}^{s} \frac{{‖ w^{j} ‖}_{1}}{∣ w^{j k} ∣} w_{j k}^{2} \Rightarrow \sum_{j = 1}^{d} {({‖ {\tilde{w}}^{j} ‖}_{1})}^{2} - \sum_{k = 1}^{s} \sum_{j = 1}^{d} \frac{{‖ w^{j} ‖}_{1}}{∣ w^{j k} ∣} {\tilde{w}}_{j k}^{2} \leq 0 = \sum_{j = 1}^{d} {({‖ {\tilde{w}}^{j} ‖}_{1})}^{2} - \sum_{k = 1}^{s} \sum_{j = 1}^{d} \frac{{‖ w^{j} ‖}_{1}}{∣ w^{j k} ∣} w_{j k}^{2}

(30)

Sum up Eq. (26), λ₁×Eq. (28), λ₂×Eq. (30), we can arrive at the following conclusion:

T r ({(Y - X^{T} \tilde{W} P)}^{T} H (Y - X^{T} \tilde{W} P)) + λ_{1} {‖ \tilde{W} ‖}_{2, 1} + λ_{2} {‖ \tilde{W} ‖}_{1, 2}^{2} \leq T r ({(Y - X^{T} W P)}^{T} H (Y - X^{T} W P)) + λ_{1} {‖ W ‖}_{2, 1} + λ_{2} {‖ W ‖}_{1, 2}^{2}

(31)

We update W and P alternatively, then we arrive at our goal:

{‖ Y - X^{T} \tilde{W} \tilde{P} ‖}_{2, 1} + λ_{1} {‖ \tilde{W} ‖}_{2, 1} + λ_{2} {‖ \tilde{W} ‖}_{1, 2}^{2} \leq {‖ Y - X^{T} W P ‖}_{2, 1} + λ_{1} {‖ W ‖}_{2, 1} + λ_{2} {‖ W ‖}_{1, 2}^{2} s . t . \tilde{P} {\tilde{P}}^{T} = I, P P^{T} = I

(32)

In other words, using Alg. 1, we can monotonically decrease the objective function Eq. (6) in each iteration and finally it will converge.

4 Experimental Results

In this section, we evaluate prediction performance of the proposed method by applying it to Alzheimer’s Disease Neuroimaging Initiative (ADNI) cohort (adni.loni.usc.edu), where a wide range of imaging markers measured over a period of 2 years are examined and associated to cognitive scores that are relevant to AD.

4.1 Data Descriptions

We apply the proposed method to the ADNI cohort to predict the cognitive scores of the participants from each of their two types of imaging phenotypes, i.e. FreeSurfer markers and voxel-based morphometry (VBM) markers. The detailed information are shown in Table 1. Mean modulated gray matter measures obtained from 90 target regions of interest, normalized by the total intracranial volume, were extracted as features.

Table 1.

Numbers of participants in the experiments using two different types of imaging markers

Imaging phenotypes	#Total	#AD	#MCI	#HC
FreeSurfer	496	99	225	172
VBM	440	85	203	152

Open in a new tab

4.2 Performance Comparison on the ADNI Cohort

First, we intend to identify a certain set of informative markers that are closely relate to pathological change due to AD. We compared our method against three most related algorithms including multivariate ridge regression (RR), joint l₂_,₁-norm minimization (l₂_,₁) on both loss function and regularization [Nie et al., 2010], linear regression with trace norm. These comparing methods are all widely used in statistical learning and brain image analysis.

In all experiments, we automatically tune the regularization parameters by selecting among the values {10^r: r ∈ {−5, …, 5}} with standard 5-fold cross-validation strategy. After the algorithm converges, we sort the row index of matrix W by the summation of the absolute values in each row, and features are selected by the top ranked indices. To measure prediction performance, we compute the root mean square error (RMSE) between the predicted score and the ground truth.

The prediction experiment evaluated by ridge regression is repeated for 100 times and average results are reported in Fig. 2. As shown in Fig. 2, we can clearly see that the prediction results of our method consistently outperforms other competing methods in nearly all the test cases for all the cognitive tasks except some outlier part in Fig. 2f. But it doesn’t really matter since the proposed method catches up soon. The reasons why the proposed method performs best go as follows: RR assumed the cognitive measures to be independent at different time point which neglects the correlations along the time. And for l₂_,₁, since the pathological change of brain structures due to AD usually do not occur in the pre-identified regions with certain shapes, thus it is difficult to define meaningful feature groups. This makes l₂_,₁ perform worse. Trace norm is a good way to seek the underlying interrelations between cognitive scores, but it ignores the fact that the informative markers relate to AD among all the imaging measures only occupy a small part. As for the proposed method, we not only detect group structure within longitudinal phenotypic neuroimaging markers, but also capture the correlations among cognitive measures. In addition, for ease of comparison, we also list the RMSE using top 10 and top 20 selected features evaluated by ridge regression in Table 2.

RMSE of four feature selection algorithms on different cognitive assessment scores.

Table 2.

Prediction performance measured by RMSE

		RMSE of top 10 features				RMSE of top 30 features

		RR	l_2,1	Trace	Proposed	RR	l_2,1	Trace	Proposed
FreeSurfer	FLUENCY	0.8777	0.8849	0.8982	0.8328	0.9411	0.9145	0.9560	0.8710
	RAVLT	0.8202	0.8073	0.8066	0.7685	0.8245	0.8132	0.8150	0.7726
	TRAILS	0.8467	0.8471	0.8441	0.8110	0.8970	0.8820	0.8923	0.8302

VBM	FLUENCY	0.8937	0.8937	0.8906	0.8639	0.9601	0.9501	0.9555	0.8891
	RAVLT	0.8420	0.8682	0.8215	0.8610	0.8834	0.8779	0.8781	0.8459
	TRAILS	0.8758	0.8899	0.8754	0.8667	0.9273	0.9297	0.9241	0.8719

Open in a new tab

4.3 Identification of Informative Markers

The primary goal of the proposed method is to identify the informative markers which is important for AD diagnosis and prediction. Therefore, we examine the imaging markers selected by our method and show it in Fig. 3. Due to the limit size of display, we only provide one tenth of feature names for both FreeSurfer and VBM markers. As shown in Fig. 3, we observe that hippocampal measures (LHippocampus, RHippocampus, LHippVol and RHippVol) are among the top selected features. These findings are in accordance with the known knowledge that in the pathological pathway of AD, hippocampal is one of sections that can recognize Alzheimer-related changes [Braak and Braak, 1991; Delacourte et al., 1999].

Heat maps of our learned weight matrices on different cognitive assessment scores.

In summary, the identified neuroimaging markers are highly suggestive and effective for tracking the progression of AD, since it strongly agrees with the existing research findings. It also illustrates the necessity and correctness of the selected imaging cognitive associations to reveal the relationships between MRI measures and cognitive scores.

5 Conclusion

To reveal the relationship between cognitive measures and neuroimaging markers, we proposed a novel robust low-rank structured sparse regression model, which selects the most informative imaging markers to predict the cognitive scores for complex brain disorders. Using the new mixed structured sparsity inducing norms and the low-rank approximation function, the proposed method can efficiently identify the effective neuroimaging markers with utilizing the underlying interrelation structures between different cognitive measures. In addition, we provide an efficient algorithm with proved convergence. Validation experiments conducted on multiple data demonstrate the promise of the proposed method.

Acknowledgments

This work was partially supported by the National Natural Science Foundation of China 61572388, and U.S. NIH R01 AG049371, NSF IIS 1302675, IIS 1344152, DBI 1356628, IIS 1619308, IIS 1633753.

References

Braak Heiko, Braak Eva. Neuropathological stageing of alzheimer-related changes. Acta neuropathologica. 1991;82(4):239–259. doi: 10.1007/BF00308809. [DOI] [PubMed] [Google Scholar]
Chang Xiaojun, Yang Yi. Semi-supervised feature analysis by mining correlations among multiple tasks. IEEE transactions on neural networks and learning systems. 2016 doi: 10.1109/TNNLS.2016.2582746. [DOI] [PubMed] [Google Scholar]
Chang Xiaojun, Ma Zhigang, Yang Yi, Zeng Zhiqiang, Hauptmann Alexander G. Bi-level semantic representation analysis for multimedia event detection. IEEE transactions on cybernetics. 2017;47(5):1180–1197. doi: 10.1109/TCYB.2016.2539546. [DOI] [PubMed] [Google Scholar]
Dailey Christina. The impact of alzheimer’s disease-the silent killer. JCCC Honors Journal. 2017;7(2):1. [Google Scholar]
Delacourte A, David JP, Sergeant N, Buee L, Wattez A, Vermersch P, Ghozali F, Fallet-Bianco C, Pasquier F, Lebert F, et al. The biochemical pathway of neurofibrillary degeneration in aging and alzheimer’s disease. Neurology. 1999;52(6):1158–1158. doi: 10.1212/wnl.52.6.1158. [DOI] [PubMed] [Google Scholar]
Deng Cheng, Lv Zongting, Liu Wei, Huang Junzhou, Tao Dacheng, Gao Xinbo. Multi-view matrix decomposition: A new scheme for exploring discriminative information. IJCAI. 2015:3438–3444. [Google Scholar]
Huo Zhouyuan, Shen Dinggang, Huang Heng. New multi-task learning model to predict alzheimer’s disease cognitive assessment. International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer; 2016. pp. 317–325. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ji Shuiwang, Ye Jieping. An accelerated gradient method for trace norm minimization. Proceedings of the 26th annual international conference on machine learning; ACM; 2009. pp. 457–464. [Google Scholar]
Jie Biao, Zhang Daoqiang, Cheng Bo, Shen Dinggang. Manifold regularized multitask feature learning for multimodality disease classification. Human brain mapping. 2015;36(2):489–507. doi: 10.1002/hbm.22642. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lee Seunghak, Zhu Jun, Xing Eric P. Adaptive multi-task lasso: with application to eQTL detection. Advances in neural information processing systems. 2010:1306–1314. [Google Scholar]
Liu Manhua, Zhang Daoqiang, Shen Dinggang, et al. Alzheimer’s Disease Neuroimaging Initiative. Identifying informative imaging biomarkers via tree structured sparse learning for ad diagnosis. Neuroinformatics. 2014;12(3):381–394. doi: 10.1007/s12021-013-9218-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu Mingxia, Zhang Daoqiang, Shen Dinggang. View-centralized multi-atlas classification for alzheimer’s disease diagnosis. Human brain mapping. 2015;36(5):1847–1865. doi: 10.1002/hbm.22741. [DOI] [PMC free article] [PubMed] [Google Scholar]
Moradi Elaheh, Pepe Antonietta, Gaser Christian, Huttunen Heikki, Tohka Jussi, et al. Alzheimer’s Disease Neuroimaging Initiative. Machine learning framework for early mri-based alzheimer’s conversion prediction in mci subjects. Neuroimage. 2015;104:398–412. doi: 10.1016/j.neuroimage.2014.10.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nie Feiping, Huang Heng, Cai Xiao, Ding Chris H. Efficient and robust feature selection via joint l2,1-norms minimization. Advances in neural information processing systems. 2010:1813–1821. [Google Scholar]
Nie Liqiang, Zhang Luming, Meng Lei, Song Xuemeng, Chang Xiaojun, Li Xuelong. Modeling disease progression via multisource multitask learners: A case study with alzheimer’s disease. IEEE transactions on neural networks and learning systems. 2016 doi: 10.1109/TNNLS.2016.2520964. [DOI] [PubMed] [Google Scholar]
Obozinski Guillaume, Taskar Ben, Jordan Michael I. Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing. 2010;20(2):231–252. [Google Scholar]
Thompson Paul M, Hayashi Kiralee M, De Zubicaray Greig, Janke Andrew L, Rose Stephen E, Semple James, Herman David, Hong Michael S, Dittmer Stephanie S, Doddrell David M, et al. Dynamics of gray matter loss in alzheimer’s disease. Journal of neuroscience. 2003;23(3):994–1005. doi: 10.1523/JNEUROSCI.23-03-00994.2003. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang Hua, Nie Feiping, Huang Heng, Risacher Shannon, Ding Chris, Saykin Andrew J, Shen Li, et al. Sparse multi-task regression and feature selection to identify brain imaging predictors for memory performance. International Conference on Computer Vision (ICCV); IEEE; 2011. pp. 557–562. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang Hua, Nie Feiping, Huang Heng, Risacher Shannon, Saykin Andrew J, Shen Li, et al. Identifying ad-sensitive and cognition-relevant imaging biomarkers via joint classification and regression. International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer; 2011. pp. 115–123. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang Xiaoqian, Shen Dinggang, Huang Heng. Prediction of memory impairment with mri data: A longitudinal study of alzheimer’s disease. International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer; 2016. pp. 273–281. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yang Yanhua, Deng Cheng, Gao Shangqian, Liu Wei, Tao Dapeng, Gao Xinbo. Discriminative multi-instance multitask learning for 3d action recognition. IEEE Transactions on Multimedia. 2017;19(3):519–529. [Google Scholar]

[R1] Braak Heiko, Braak Eva. Neuropathological stageing of alzheimer-related changes. Acta neuropathologica. 1991;82(4):239–259. doi: 10.1007/BF00308809. [DOI] [PubMed] [Google Scholar]

[R2] Chang Xiaojun, Yang Yi. Semi-supervised feature analysis by mining correlations among multiple tasks. IEEE transactions on neural networks and learning systems. 2016 doi: 10.1109/TNNLS.2016.2582746. [DOI] [PubMed] [Google Scholar]

[R3] Chang Xiaojun, Ma Zhigang, Yang Yi, Zeng Zhiqiang, Hauptmann Alexander G. Bi-level semantic representation analysis for multimedia event detection. IEEE transactions on cybernetics. 2017;47(5):1180–1197. doi: 10.1109/TCYB.2016.2539546. [DOI] [PubMed] [Google Scholar]

[R4] Dailey Christina. The impact of alzheimer’s disease-the silent killer. JCCC Honors Journal. 2017;7(2):1. [Google Scholar]

[R5] Delacourte A, David JP, Sergeant N, Buee L, Wattez A, Vermersch P, Ghozali F, Fallet-Bianco C, Pasquier F, Lebert F, et al. The biochemical pathway of neurofibrillary degeneration in aging and alzheimer’s disease. Neurology. 1999;52(6):1158–1158. doi: 10.1212/wnl.52.6.1158. [DOI] [PubMed] [Google Scholar]

[R6] Deng Cheng, Lv Zongting, Liu Wei, Huang Junzhou, Tao Dacheng, Gao Xinbo. Multi-view matrix decomposition: A new scheme for exploring discriminative information. IJCAI. 2015:3438–3444. [Google Scholar]

[R7] Huo Zhouyuan, Shen Dinggang, Huang Heng. New multi-task learning model to predict alzheimer’s disease cognitive assessment. International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer; 2016. pp. 317–325. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Ji Shuiwang, Ye Jieping. An accelerated gradient method for trace norm minimization. Proceedings of the 26th annual international conference on machine learning; ACM; 2009. pp. 457–464. [Google Scholar]

[R9] Jie Biao, Zhang Daoqiang, Cheng Bo, Shen Dinggang. Manifold regularized multitask feature learning for multimodality disease classification. Human brain mapping. 2015;36(2):489–507. doi: 10.1002/hbm.22642. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Lee Seunghak, Zhu Jun, Xing Eric P. Adaptive multi-task lasso: with application to eQTL detection. Advances in neural information processing systems. 2010:1306–1314. [Google Scholar]

[R11] Liu Manhua, Zhang Daoqiang, Shen Dinggang, et al. Alzheimer’s Disease Neuroimaging Initiative. Identifying informative imaging biomarkers via tree structured sparse learning for ad diagnosis. Neuroinformatics. 2014;12(3):381–394. doi: 10.1007/s12021-013-9218-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Liu Mingxia, Zhang Daoqiang, Shen Dinggang. View-centralized multi-atlas classification for alzheimer’s disease diagnosis. Human brain mapping. 2015;36(5):1847–1865. doi: 10.1002/hbm.22741. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Moradi Elaheh, Pepe Antonietta, Gaser Christian, Huttunen Heikki, Tohka Jussi, et al. Alzheimer’s Disease Neuroimaging Initiative. Machine learning framework for early mri-based alzheimer’s conversion prediction in mci subjects. Neuroimage. 2015;104:398–412. doi: 10.1016/j.neuroimage.2014.10.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Nie Feiping, Huang Heng, Cai Xiao, Ding Chris H. Efficient and robust feature selection via joint l2,1-norms minimization. Advances in neural information processing systems. 2010:1813–1821. [Google Scholar]

[R15] Nie Liqiang, Zhang Luming, Meng Lei, Song Xuemeng, Chang Xiaojun, Li Xuelong. Modeling disease progression via multisource multitask learners: A case study with alzheimer’s disease. IEEE transactions on neural networks and learning systems. 2016 doi: 10.1109/TNNLS.2016.2520964. [DOI] [PubMed] [Google Scholar]

[R16] Obozinski Guillaume, Taskar Ben, Jordan Michael I. Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing. 2010;20(2):231–252. [Google Scholar]

[R17] Thompson Paul M, Hayashi Kiralee M, De Zubicaray Greig, Janke Andrew L, Rose Stephen E, Semple James, Herman David, Hong Michael S, Dittmer Stephanie S, Doddrell David M, et al. Dynamics of gray matter loss in alzheimer’s disease. Journal of neuroscience. 2003;23(3):994–1005. doi: 10.1523/JNEUROSCI.23-03-00994.2003. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Wang Hua, Nie Feiping, Huang Heng, Risacher Shannon, Ding Chris, Saykin Andrew J, Shen Li, et al. Sparse multi-task regression and feature selection to identify brain imaging predictors for memory performance. International Conference on Computer Vision (ICCV); IEEE; 2011. pp. 557–562. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Wang Hua, Nie Feiping, Huang Heng, Risacher Shannon, Saykin Andrew J, Shen Li, et al. Identifying ad-sensitive and cognition-relevant imaging biomarkers via joint classification and regression. International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer; 2011. pp. 115–123. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Wang Xiaoqian, Shen Dinggang, Huang Heng. Prediction of memory impairment with mri data: A longitudinal study of alzheimer’s disease. International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer; 2016. pp. 273–281. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Yang Yanhua, Deng Cheng, Gao Shangqian, Liu Wei, Tao Dapeng, Gao Xinbo. Discriminative multi-instance multitask learning for 3d action recognition. IEEE Transactions on Multimedia. 2017;19(3):519–529. [Google Scholar]

PERMALINK

Predicting Alzheimer’s Disease Cognitive Assessment via Robust Low-Rank Structured Sparse Model

Jie Xu

Cheng Deng

Xinbo Gao

Dinggang Shen

Heng Huang

Abstract

1 Introduction

Notations

2 Robust Low-Rank Structured Sparse Learning

Figure 1.

3 Optimization

3.1 Algorithm Derivation

Algorithm 1.

3.2 Convergence Analysis

Theorem 1

Proof

Theorem 2

Proof

4 Experimental Results

4.1 Data Descriptions

Table 1.

4.2 Performance Comparison on the ADNI Cohort

Figure 2.

Table 2.

4.3 Identification of Informative Markers

Figure 3.

5 Conclusion

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Predicting Alzheimer’s Disease Cognitive Assessment via Robust Low-Rank Structured Sparse Model

Jie Xu

Cheng Deng

Xinbo Gao

Dinggang Shen

Heng Huang

Abstract

1 Introduction

Notations

2 Robust Low-Rank Structured Sparse Learning

Figure 1.

3 Optimization

3.1 Algorithm Derivation

Algorithm 1.

3.2 Convergence Analysis

Theorem 1

Proof

Theorem 2

Proof

4 Experimental Results

4.1 Data Descriptions

Table 1.

4.2 Performance Comparison on the ADNI Cohort

Figure 2.

Table 2.

4.3 Identification of Informative Markers

Figure 3.

5 Conclusion

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases