SPReM: Sparse Projection Regression Model For High-dimensional Linear Regression

Qiang Sun; Hongtu Zhu; Yufeng Liu; Joseph G Ibrahim; for the Alzheimer’;s Disease Neuroimaging Initiative

doi:10.1080/01621459.2014.892008

. Author manuscript; available in PMC: 2016 Apr 22.

Published in final edited form as: J Am Stat Assoc. 2015 Apr 22;110(509):289–302. doi: 10.1080/01621459.2014.892008

SPReM: Sparse Projection Regression Model For High-dimensional Linear Regression ^*

Qiang Sun ¹, Hongtu Zhu ², Yufeng Liu ³, Joseph G Ibrahim ⁴; for the Alzheimer’;s Disease Neuroimaging Initiative

PMCID: PMC4627720 NIHMSID: NIHMS578402 PMID: 26527844

Abstract

The aim of this paper is to develop a sparse projection regression modeling (SPReM) framework to perform multivariate regression modeling with a large number of responses and a multivariate covariate of interest. We propose two novel heritability ratios to simultaneously perform dimension reduction, response selection, estimation, and testing, while explicitly accounting for correlations among multivariate responses. Our SPReM is devised to specifically address the low statistical power issue of many standard statistical approaches, such as the Hotelling’s T² test statistic or a mass univariate analysis, for high-dimensional data. We formulate the estimation problem of SPREM as a novel sparse unit rank projection (SURP) problem and propose a fast optimization algorithm for SURP. Furthermore, we extend SURP to the sparse multi-rank projection (SMURP) by adopting a sequential SURP approximation. Theoretically, we have systematically investigated the convergence properties of SURP and the convergence rate of SURP estimates. Our simulation results and real data analysis have shown that SPReM out-performs other state-of-the-art methods.

Keywords: heritability ratio, imaging genetics, multivariate regression, projection regression, sparse, wild bootstrap

1 Introduction

Multivariate regression modeling with a multivariate response y ∈ ℝ^q and a multivariate covariate x ∈ ℝ^p is a standard statistical tool in modern high-dimensional inference, with wide applications in various large-scale applications, such as genome-wide association studies (GWAS) and neuroimaging studies. For instance, in GWAS, our primary problem of interest is to identify genetic variants (x) that cause phenotypic variation (y). Specifically, in imaging genetics, multivariate imaging measures (y), such as volumes of regions of interest (ROIs), are phenotypic variables, whereas covariates (x) include single nucleotide polymorphisms (SNPs), age, and gender, among others. The joint analysis of imaging and genetic data may ultimately lead to discoveries of genes for neuropsychiatric and neurological disorders such as autism and schizophrenia (Scharinger et al., 2010; Paus, 2010; Peper et al., 2007; Chiang et al., 2011a,b). Moreover, in many neuroimaging studies, there is a great interest in the use of imaging measures (x), such as functional imaging data and cortical and subcortical structures, to predict multiple clinical and/or behavioral variables (y) (Knickmeyer et al., 2008; Lenroot and Giedd, 2006). This motivates us to systematically investigate a multivariate linear model with a multivariate response y and a multivariate covariate x.

Throughout this paper, we consider n independent observations (y_i, x_i) and a Multivariate Linear Model (MLM) given by

Y = X B + E, or y_{i} = B^{T} x_{i} + e_{i},

(1)

where Y = (y₁, …, y_n)^T, X = (x₁, …, x_n)^T, and B = (β_jl) is a p × q coefficient matrix with rank(B) = r^* ≤ min(p, q). Moreover, the error term E = (e₁, …, e_n)^T has E(e_i) = 0 and Cov(e_i) = Σ_R for all i, where Σ_R is a q × q matrix. Many hypothesis testing problems of interest, such as comparison across groups, can often be formulated as

H_{0} : C B = B_{0} v . s . H_{1} : C B \neq B_{0},

(2)

where C is an r × p matrix and B₀ is an r × q matrix. Without loss of generality, we center the covariates, standardize the responses, and assume rank(C) = r.

We focus on a specific setting that q is relatively large, but p is relatively small. Such a setting is general enough to cover two-sample (or multi-sample) hypothesis testing for high-dimensional data (Chen and Qin, 2010; Lopes et al., 2011). There are at least three major challenges including (i) a large number of regression parameters, (ii) a large covariance matrix, and (iii) correlations among multivariate responses. When the number of responses and the number of covariates are even moderately high, fitting the conventional MLM usually requires estimating a p × q matrix of regression coefficients, whose number pq can be much larger than n. Although accounting for complicated correlations among multiple responses is important for improving the overall prediction accuracy of multivariate analysis (Breiman and Friedman, 1997; Cook et al., 2010), it requires estimating q(q+1)/2 unknown parameters in an unstructured covariance matrix.

There is a great interest in the development of efficient methods for handling MLMs with large q. Four popular traditional methods include the mass univariate analysis, the Hotelling’s T² test, partial least squares regression, and dimension reduction methods. As pointed by Klei et al. (2008) and many others, testing each response variable individually in the mass univariate analysis requires a substantial penalty of controlling for multiplicity. The Hotelling’s T² test is not well-defined, when q > n. Even when q ≤ n, the power of the Hotelling’s T² can be very low if q is nearly as large as n. Partial least squares regression (PLSR) aims to find a linear regression model by projecting y and x to a smaller latent space (Chun and Keles, 2010; Krishnan et al., 2011), but it focuses on prediction and classification. Although dimension reduction techniques, such as principal component analysis (PCA), are considered to reduce the dimensions of both the response and covariates (Formisano et al., 2008; Kherif et al., 2002; Rowe and Hoffmann, 2006; Teipel et al., 2007), most of the methods ignore the variation of covariates and their associations with responses. Thus, such methods can be sub-optimal for our problem.

Some recent developments primarily include regularization methods and envelope models (Peng et al., 2010; Tibshirani, 1996; Breiman and Friedman, 1997; Cook et al., 2010, 2013; Lin et al., 2012). Cook, Li and Chiaromonte (2010) developed a powerful envelope modeling framework for MLMs. Such envelope methods use dimension reduction techniques to remove the immaterial information, while achieving efficient estimation of the regression coefficients by accounting for correlations among the response variables. However, the existing envelope methods are limited to the n > max(p, q) scenario. Recently, much attention has been given to regularization methods for enforcing sparsity in B (Peng et al., 2010; Tibshirani, 1996). These regularization methods, however, do not provide a standard inference tool (e.g., standard deviation) on the regression coefficient matrix B. Lin et al. (2012) developed a projection regression model (PRM) and its associated estimation procedure to assess the relationship between a multivariate phenotype and a set of covariates without providing any theoretical justification.

This paper presents a new general framework, called sparse projection regression model (SPReM), for simultaneously performing dimension reduction, response selection, estimation, and testing in a general high dimensional MLM setting. We introduce two novel heritability ratios, which extend the idea of principal components of heritability from familial studies (Klei et al., 2008; Ott and Rabinowitz, 1999), for MLM and overcome over-fitting and noise accumulation in high dimensional data by enforcing the sparsity constraint. We develop a fast algorithm for both sparse unit rank projection (SURP) and sparse multi-rank projection (SMURP). Furthermore, a test procedure based on the wild-bootstrap method is proposed, which leads to a single p–value for the test of an association between all response variables and covariates of interest, such as genetic markers. Simulations show that our method can control the overall Type I error well, while achieving high statistical power.

Section 2 of this paper introduces the SPReM framework. We introduce a novel deflation procedure to extract the most informative directions for testing hypotheses of interest. Simulation studies and an imaging genetic example are used to examine the finite sample performance of SPReM in Section 3. We present concluding remarks in Section 4.

2 Sparse Projection Regression Model

2.1 Model Setup and Heritability Ratios

We introduce SPReM as follows. The key idea of our SPReM is to appropriately project y_i in a high-dimensional space onto a low-dimensional space, while accounting for the correlation structure Σ_R among the response variables and the hypothesis test in (2). Let W = [w₁, …, w_k] be a q × k nonrandom and unknown direction matrix, where w_j are q × 1 vectors. A projection regression model (PRM) is given by

W^{T} y_{i} = {(B W)}^{T} x_{i} + W^{T} e_{i} = β_{w}^{T} X_{i} + ε_{i},

(3)

where β_w is a p × k regression coefficient matrix and the random vector ε_i has E(ε_i) = 0 and Cov(ε_i) = W^TΣ_RW. When k = 1, PRM reduces to the pseduo-trait model considered in (Amos et al., 1990; Amos and Laing, 1993; Klei et al., 2008; Ott and Rabinowitz, 1999). If k << min(n, q) and W were known, then one could use likelihood (or estimating equation) based methods to efficiently estimate β_w, and (2) would reduce approximately to

H_{0 W} : C β_{w} = b_{0} v . s . H_{1 W} : C β_{w} \neq b_{0},

(4)

where Cβ_w = CBW and b₀ = B₀W. In this case, the number of null hypotheses in (4) is much smaller than that of (2). It is also expected that different W’s strongly influence the statistical power of testing the hypotheses in (2).

A fundamental question arises “how do we determine an ‘optimal’ W to achieve good statistical power of testing (2)?” To determine W, we develop a novel deflation approach to sequentially determine each column of W at a time starting from w₁ to w_k. We focus on how to determine w₁ below and then discuss how to extend it to the scenario with k > 1.

To determine an optimal w₁, we consider two principles. The first principle is to maximize the mean value of the square of the signal-to-noise ratio, called the heritability ratio, for model (3). For each i, the signal-to-noise ratio in model (3) is defined as the ratio of mean to standard deviation of a signal or measurement w^T y_i, denoted by SNR_i = w^TB^T x_i/(w^TΣ_Rw)^0.5. Thus, the heritability ratio (HR) is given by

H R (w) = n^{- 1} \sum_{i = 1}^{n} {SNR}_{i}^{2} = \frac{w^{T} B^{T} S_{X} B w}{w^{T} Σ_{R} w},

(5)

where $S_{X} = n^{- 1} \sum_{i = 1}^{n} x_{i} x_{i}^{T}$ . The HR has several important interpretations. If the x_i are independently and identically distributed (i.i.d) with E(x_i) = 0 and Cov(x_i) = Σ_X, then as n → ∞, we have

H R (w) \to^{p} \frac{w^{T} B^{T} Σ_{X} B w}{w^{T} Σ_{R} w} = \frac{Var (w^{T} B^{T} x_{i})}{Var (ε_{i})},

where →^p denotes convergence in probability. Thus, HR(w) is close to the ratio of the variance of signal w^TB^T x_i to that of noise ε_i. Moreover, HR(w) is close to the heritability ratio considered in (Amos et al., 1990; Amos and Laing, 1993; Klei et al., 2008; Ott and Rabinowitz, 1999) for familial studies, but we define HR from a totally different perspective. With such new perspective, one can easily define HR for more general designs, such as cross-sectional or longitudinal design. One might directly maximize HR(w) to calculate an ‘optimal’ w₁, but such a w₁ can be sub-optimal for testing the hypotheses in (2) as discussed below.

The second principle is to explicitly account for the hypotheses in (2) under model (1) and the reduced ones in (4) under model (3). We define four spaces associated with the null and alternative hypotheses of (2) and (4) as follows:

S_{H_{0}} = {B : C B = B_{0}}, S_{H_{W}} = {B : C B W = B_{0} W}, S_{H_{1}} = {B : C B \neq B_{0}}, S_{H_{1 W}} = {B : C B W \neq B_{0} W} .

It can be shown that they satisfy the following relationship:

S_{H_{0}} \subset S_{H_{W}} and S_{H_{1 W}} \subset S_{H_{1}} for any W \neq 0 .

Due to potential information loss during dimension reduction, both S_{H_W} − S_H₀ and S_H₁ − S_{H_1W} may not be the empty set, but we need to choose W such that S_H₁ − S_{H_1W} ≈ ø. The next question is how to achieve this.

We consider a data transformation procedure. Let C₁ be a (p−r) × p matrix such that

rank [C^{T} C_{1}^{T}] = p and C C_{1}^{T} = 0 .

(6)

Let $D = {[C^{T} C_{1}^{T}]}^{T}$ be a p × p matrix and ${\tilde{x}}_{i} = ({\tilde{x}}_{i 1}^{T}, {\tilde{x}}_{i 2}^{T}) = D^{- T} x_{i}$ be a p × 1 vector, where x̃_i1 and x̃_i2 are, respectively, the r × 1 and (p − r) × 1 subvectors of x̃_i. We define $\tilde{B} = {[{\tilde{B}}_{1}^{T} {\tilde{B}}_{2}^{T}]}^{T} = D B$ , or B = D⁻¹B̃, where B̃₁ and B̃₂ are, respectively, the first r rows and the last p − r rows of B̃. Therefore, model (3) can be rewritten as

W^{T} y_{i} = {(D^{- 1} \tilde{B} W)}^{T} x_{i} + W^{T} e_{i} = W^{T} {({\tilde{B}}_{1} - B_{0})}^{T} {\tilde{x}}_{i 1} + W^{T} B_{0}^{T} {\tilde{x}}_{i 1} + W^{T} {\tilde{B}}_{2}^{T} {\tilde{x}}_{i 2} + W^{T} e_{i} .

(7)

In (7), due to (6), we only need to consider the transformed covariate vector x̃_i1, which contains useful information associated with B̃₁ − B₀ = CB − B₀.

We define a generalized heritability ratio based on model (7). Specifically, for each i, we define a new signal-to-noise ratio as the ratio of mean to standard deviation of signal w^T (B̃₁ − B₀)^T x̃_i1 + w^T e_i, denoted by SNR_i,C = w^T (B̃₁ − B₀)^T x̃_i1/(w^TΣ_Rw)^0.5. The generalized heritability ratio is then defined as

GHR (w; C) = n^{- 1} \sum_{i = 1}^{n} {SNR}_{i, C}^{2} = \frac{w^{T} {({\tilde{B}}_{1} - B_{0})}^{T} S_{{\tilde{X}}_{1}} ({\tilde{B}}_{1} - B_{0}) w}{w^{T} Σ_{R} w},

(8)

where $S_{{\tilde{X}}_{1}} = n^{- 1} \sum_{i = 1}^{n} {\tilde{x}}_{i 1} {\tilde{x}}_{i 1}^{T}$ . If the x_is are random, then we have

GHR (w; C) \to^{p} \frac{w^{T} {({\tilde{B}}_{1} - B_{0})}^{T} Cov ({\tilde{x}}_{i 1}) ({\tilde{B}}_{1} - B_{0}) w}{w^{T} Σ_{R} w} = \frac{w^{T} Σ_{C} w}{w^{T} Σ_{R} w},

(9)

where Σ_C = (B̃₁ − B₀)^T (D^−TΣ_XD⁻¹)_(r,r)(B̃₁ − B₀), and (D^−TΣ_XD⁻¹)_(r,r) is the upper r × r submatrix of D^−TΣ_XD⁻¹. Particularly, if C = [I_r 0], then Σ_C reduces to w^T (B̃₁ − B₀)^T (Σ_X)_(1,1)(B̃₁ − B₀)w, in which (Σ_X)_(1,1) is the upper r × r submatrix of Σ_X. Thus, GHR(w; C) can be interpreted as the ratio of the variance of w^T (B̃₁ − B₀)^T x̃_i1 relative to that of w^T e_i. We propose to calculate an optimal w_* as follows:

w_{*} = \underset{w}{argmax} GHR (w; C) .

(10)

We expect that such an optimal w_* can substantially reduce the size of both S_H₁ − S_{H_1W} and S_{H_W} − S_H₀ and thus the use of such an optimal w_* can enhance the power of testing the hypotheses in (2). Without loss of generality, we assume B₀ = 0 from now on.

We consider a simple example to illustrate the appealing properties of GHR(w; C).

Example We consider model (1) with p = q = 5 and want to test the nonzero effect of the first covariate on all five responses. In this case, r = 1, C = (1, 0, 0, 0, 0), B₀ = (0, 0, 0, 0, 0), and D = I₅, which is a 5 × 5 identity matrix. Without loss of generality, it is assumed that (Σ_X)_(1,1) = 1.

We consider three different cases of Σ_R and B. In the first case, we set $Σ_{R} = diag (σ_{1}^{2}, \dots, σ_{5}^{2})$ and the first column of B to be (1, 0, 0, 0, 0). It follows from (8) that

GHR (w; C) = \frac{w_{1}^{2}}{σ_{1}^{2} w_{1}^{2} + σ_{2}^{2} w_{2}^{2} + \dots + σ_{5}^{2} w_{5}^{2}} and w_{*}^{T} = (c_{0}, 0, 0, 0, 0),

where c₀ is any nonzero scalar. Therefore, w_* picks out the first response, which is the sole one that is associated with the first covariate.

In the second case, we set $Σ_{R} = diag (σ_{1}^{2}, \dots, σ_{5}^{2})$ with $σ_{1}^{2} \geq \dots \geq σ_{5}^{2}$ and the first row of B to be (1, 1, 0, 0, 0). It follows from (8) that

GHR (w; C) = \frac{{(w_{1} + w_{2})}^{2}}{σ_{1}^{2} w_{1}^{2} + σ_{2}^{2} w_{2}^{2} + \dots + σ_{5}^{2} w_{5}^{2}} and w_{*}^{T} = (\frac{σ_{2}^{2}}{σ_{1}^{2}} c_{0}, c_{0}, 0, 0, 0),

where c₀ is any nonzero scalar. Therefore, w_* picks out both the first and second response with larger weight on the second component. This is desirable since β₁₁ and β₂₁ are equal in terms of strength of effect and the noise level for the second response is smaller than that of the first one.

In the third case, we set the first row of B to be (1, 1, 0, 0, 0) and the first and second columns of Σ_R are set as σ²(1, ρ, 0, 0, 0) and σ²(ρ, 1, 0, 0, 0), respectively. It follows from (8) that

GHR (w; C) = \frac{{(w_{1} + w_{2})}^{2}}{σ^{2} w_{1}^{2} + 2 σ_{2}^{2} ρ w_{1} w_{2} + σ_{2}^{2} w_{2}^{2} + Q (w_{3}, w_{4}, w_{5})} and w_{*}^{T} = (c_{0}, c_{0}, 0, 0, 0),

where Q(w₃,w₄,w₅) is a non-negative quadratic form of (w₃,w₄,w₅). Thus, the optimal w_* chooses the first two responses with equal weight, since they are correlated with each other with same variance and β₁₁ = β₂₁ = 1.

For high dimensional data, it is difficult to accurately estimate w_*, since the sample covariance matrix estimator Σ̂_R can be either ill-conditioned or not invertible for large q > n. One possible solution is to focus only on a small number of important features for testing. However, a naive search for the best subset is NP-hard. We develop a penalized procedure to address these two problems, while obtaining a relatively accurate estimate of w. Let Σ̃_R and Σ̂_C be, respectively, estimators of Σ_R and Σ_C. Here we use Σ̃_R to denote the covariance estimator other than sample covariance matrix Σ̂_R. To obtain Σ̂_C, we need to plug B̂, an estimator of B, into Σ_C. Without loss of generality, we consider the ordinary least squares estimate of B. By imposing a sparse structure on w₁, we recast the optimization problem as

max {\frac{w^{T} {\hat{Σ}}_{C} w}{w^{T} {\tilde{Σ}}_{R} w}} s . t . {‖ w ‖}_{1} \leq t,

(11)

where ‖·‖₁ is the L₁ norm and t > 0.

2.2 Sparse Unit Rank Projection

When r = 1, we call the problem in (10) as the unit rank projection problem and its corresponding sparse version in (11) as the sparse unit rank projection (SURP) problem. Actually, many statistical problems, such as two-sample test and marginal effect test problems, can be formulated as the unit rank projection problem (Lopes et al., 2011). We consider two cases including ℓ = (CB)^T = 0 and ℓ = (CB)^T ≠ 0. When ℓ = (CB)^T = 0, the solution set of (8) is trivial, since any w ≠ 0 is a solution of (8). As discussed later, this property is extremely important for controlling the type I error rate.

When ℓ = (CB)^T ≠ 0, (8) reduces to the following optimization problem:

w_{*} = \underset{w^{T} Σ_{R} w = 1}{argmax} w^{T} Σ_{C} w = \underset{w^{T} Σ_{R} w \leq 1}{argmax} w^{T} Σ_{C} w = \underset{w^{T} Σ_{R} w \leq 1}{argmax} w^{T} ℓ,

(12)

where ℓ is the sole eigenvector of Σ_C, since Σ_C is a unit-rank matrix. To impose an L₁ sparsity on w, we propose to solve the penalized version of (12) given by

w_{λ} = \underset{w^{T} Σ_{R} w \leq 1}{argmax} w^{T} ℓ - λ {‖ w ‖}_{1} .

(13)

Although (13) can be solved by using some standard convex programming methods, such methods are too slow for most large-scale applications, such as imaging genetics. We therefore reformulate our problem below. Without special saying, we focus on ℓ = (CB)^T ≠ 0.

By omitting a scaling factor ${‖ Σ_{R}^{- 1 / 2} ℓ ‖}_{2}$ , which will not affect the generalized heritability ratio, we note that (12) is equivalent to the following

w_{0} = \underset{w}{argmin} \frac{1}{2} w^{T} Σ_{R} w - w^{T} ℓ .

(14)

We consider a penalized version of (14) as

w_{0, λ} = \underset{w}{argmin} f (w) = \underset{w}{argmin} \frac{1}{2} w^{T} Σ_{R} w - w^{T} ℓ + λ {‖ w ‖}_{1} .

(15)

A nice property of (15) is that it does not explicitly involve the inequality constraint, which leads to a fast computation. We define (14) as the oracle, since w_λ converges to w₀ as λ → 0. It can be shown that

w_{0} = \sum_{R}^{- 1} ℓ .

(16)

We obtain an equivalence between (15) and (13) as follows.

Theorem 2.1 Problem (15) is equivalent to problem (13) and w_λ ∝ w_0,λ.

We discuss some connections between our SURP problem and the optimization problem considered in Fan et al. (2012) for performing classification in high dimensional space.

However, rather than recasting the problem as in (12) and then (15), they formulate it as

w_{c} = \underset{{‖ w ‖}_{1} \leq c, w^{T} ℓ = 1}{argmin} w^{T} Σ_{R} w,

which can further be reformulated as

w_{λ} = \underset{w^{T} ℓ = 1}{argmin} \frac{1}{2} w^{T} Σ_{R} w + λ {‖ w ‖}_{1} .

(17)

Since (17) involves a linear equality constraint, they replace it by a quadratic penalty as

w_{λ, γ} = argmin \frac{1}{2} w^{T} Σ_{R} w + λ {‖ w ‖}_{1} + \frac{1}{2} γ {(w^{T} ℓ - 1)}^{2} .

(18)

This new formulation requires the simultaneously tuning of λ and γ, which can be computationally intensive. However, in Fan et al. (2012), they stated that the solution to (18) is not sensitive to γ, since solution is always in the direction of $Σ_{R}^{- 1} ℓ$ when λ = 0, as validated by simulations. Their formulation (17) is close to the formulation (15). This result sheds some light on why w_λ,γ is not sensitive to γ. Finally, we can show that the solution path to (15) has a piecewise linear property.

Proposition 2.2 Let ℓ ∈ ℝ^q be a constant vector and Σ_R be positive definite. Then, w_0,λ is a continuous piecewise linear function in λ.

We derive a coordinate descent algorithm to solve (15). Without loss of generality, suppose that $w = {({\tilde{w}}_{1}, {\tilde{w}}_{2}^{T})}^{T} = {({\tilde{w}}_{1}, \dots, {\tilde{w}}_{q})}^{T}$ , w̃_j for all j ≥ 2 are given, and we need to optimize (15) with respect to w̃₁. In this case, the objective function (15) becomes

f_{1} ({\tilde{w}}_{1}, {\tilde{w}}_{2}) = \frac{1}{2} ({\tilde{w}}_{1}, {\tilde{w}}_{2}^{T}) (\begin{matrix} σ_{11} & Σ_{12} \\ Σ_{21} & Σ_{22} \end{matrix}) (\begin{matrix} {\tilde{w}}_{1} \\ {\tilde{w}}_{2} \end{matrix}) - ({\tilde{ℓ}}_{1} {\tilde{w}}_{1} + {\tilde{ℓ}}_{2}^{T} {\tilde{w}}_{2}) + λ | {\tilde{w}}_{1} | + λ {‖ {\tilde{w}}_{2} ‖}_{1},

where $ℓ = ({\tilde{ℓ}}_{1}, {\tilde{ℓ}}_{2}^{T})$ and σ₁₁, Σ₁₂, and Σ₂₂ are subcomponents of Σ_R. Then, by taking the sub-gradient with respect to w̃₁, we have

f_{1}^{'} ({\tilde{w}}_{1}, {\tilde{w}}_{2}) = {\tilde{w}}_{1} σ_{11} + Σ_{12} {\tilde{w}}_{2} + λ Γ_{1} - {\tilde{ℓ}}_{1}

where Γ₁ = sign(w̃₁) for w̃₁ ≠ 0 and is between −1 and 1 if w̃₁ = 0. Let S_λ(t) = sign(t)(|t|−λ)⁺ be the soft-thresholding operator. By setting $f_{1}^{'} ({\tilde{w}}_{1}, {\tilde{w}}_{2}) = 0$ , we have w̃₁ = S_λ(ℓ̃₁ − Σ₁₂w̃₂)/σ₁₁. Based on this result, we can obtain a coordinate descent algorithm as follows.

Algorithm 1

Initialize w at a starting point w⁽⁰⁾ and set m = 0.
Repeat:
- (b.1) Increase m by 1: m ← m + 1
- (b.2) for j ∈ 1, …, p, if ${\tilde{w}}_{j}^{(m - 1)} = 0$ , then set ${\tilde{w}}_{j}^{(m)} = 0$ ; otherwise: ${\tilde{w}}_{j}^{(m)} = argmin f ({\tilde{w}}_{1}^{(m)}, \dots, {\tilde{w}}_{j - 1}^{(m)}, {\tilde{w}}_{j}, {\tilde{w}}_{j + 1}^{(m - 1)}, \dots, {\tilde{w}}_{q}^{(m - 1)})$
Until numerical convergence: we require |f(w^(m))−f(w^(m−1))| to be sufficiently small.

2.3 Extension to Multi-rank Cases

In this subsection, we extend the sparse unit rank projection procedure to handle multiple rank test problems when r > 1. We propose the k–th projection direction as the solution to the following problem:

argmax \frac{w_{k}^{T} Σ_{C} w_{k}}{w_{k}^{T} Σ_{R} w_{k}} s . t . w_{k}^{T} Σ_{R} w_{j} = 0, \forall_{j} < k .

(19)

It can be shown that (19) is equivalent to

argmax w_{k}^{T} Σ_{C} w_{k} s . t . w_{k}^{T} Σ_{R} w_{k} \leq 1, w_{k}^{T} Σ_{R} w_{j} = 0, \forall_{j} < k .

(20)

Following the reasoning in Witten and Tibshirani (2011), we recast (20) into an equivalent problem.

Proposition 2.3 Problem (20) is equivalent to the following problem:

\underset{w}{argmax} \frac{w^{T} B^{T} C^{T} Σ_{11}^{1 / 2} P_{⊥}^{k - 1} Σ_{11}^{1 / 2} C B w}{w^{T} Σ_{R} w},

(21)

where $P_{⊥}^{k - 1}$ is the projection matrix onto the orthogonal space spanned by ${Σ_{11}^{1 / 2} {CBw}_{j}, 1 \leq j \leq k - 1}$ , in which Σ₁₁ = (D^−TΣ_X D⁻¹)_(r,r).

Based on Proposition 2.3, we consider several strategies of imposing the sparsity structure on w_k. A simple strategy is to consider the following problem given by

\underset{w_{k}}{argmax} w_{k}^{T} Σ_{C}^{k} w_{k} - λ {‖ w_{k} ‖}_{1} s . t . w_{k}^{T} Σ_{R} w_{k} \leq 1,

(22)

where $Σ_{C}^{k} = B^{T} C^{T} Σ_{11}^{1 / 2} P_{⊥}^{k - 1} Σ_{11}^{1 / 2} C B$ . When the rank of C is greater than 1, the problem in (22) is no longer convex, since it involves maximizing an objective function that is not concave. A potential solution is to use the minorization-maximization (MM) algorithm (Lange et al., 2000). Specifically, for any fixed w^(m), we take a Taylor series expansion of $w_{k}^{T} Σ_{C}^{k} w_{k}$ at w^(m) and get

w_{k}^{T} Σ_{C}^{k} w_{k} - λ {‖ w_{k} ‖}_{1} \geq 2 w_{k}^{T} Σ_{C}^{k} w_{k}^{(m)} - w_{k}^{(m) T} Σ_{C}^{k} w_{k}^{(m)} - λ {‖ w_{k} ‖}_{1} .

(23)

Thus, the right hand side of (23) minorizes the objective function (22) at $w_{k}^{(m)}$ and is a convex function, which can be solved by using some convex optimization methods. However, based on our extensive experience, the MM algorithm is too slow for most large-scale problems, such as imaging genetics.

To further improve computational efficiency, we consider a surrogate of (22). Recall the discussion in the second principle, we are only interested in extracting informative directions for testing hypotheses of interest. We consider a spectral decomposition of (D^−TΣ_XD⁻¹)_(r×r) as ${(D^{- T} Σ_{X} D^{- 1})}_{(r \times r)} = \sum_{j = 1}^{r} γ_{j} ℓ_{j} ℓ_{j}^{T}$ , where (γ_j, ℓ_j) are eigenvalue-eigenvector pairs with γ₁ ≥ γ₂ ≥ … ≥ γ_r. Then, instead of solving (22), we propose to solve r SURP problems as

w_{λ}^{k} = argmin \frac{1}{2} w_{k}^{T} Σ_{R} w_{k} - \sqrt{γ_{k}} ℓ_{k}^{T} {CBw}_{k} + λ_{k} {‖ w_{k} ‖}_{1} for 1 \leq k \leq r .

(24)

Solving (24) leads to r sparse projection directions. In (24), since we sequentially extract the direction vector according to the input signal Σ_C, it may produce a less informative direction vector compared with those from (22). However, such formulation leads to a fast computational algorithm and our simulation results demonstrate its reasonable performance. Thus, (24) is preferred in practice.

2.4 Test Procedure

We consider three statistics for testing H_0W against H_1W in (4). Based on model (3), we calculate the ordinary least squares estimate of β_w, given by ${\hat{β}}_{w} = {(\sum_{i = 1}^{N} x_{i} x_{i}^{T})}^{- 1} \sum_{i = 1}^{N} x_{i} y_{i}^{T} W$ .

Subsequently, we calculate a k × k matrix, denoted by T_n, as follows:

T_{n} = {(C {\hat{β}}_{w} - b_{0})}^{T} Σ_{\tilde{Ω}}^{- 1} (C {\hat{β}}_{w} - b_{0}),

(25)

where Σ_Ω̃ is a consistent estimate of the covariance matrix of Cβ̂_w − b₀. Specifically, let β̃_w be the restricted least squares (RLS) estimate of β under H₀, which is given by

{\tilde{β}}_{w} = {\hat{β}}_{w} - {(X^{T} X)}^{- 1} C^{T} {[C {(X^{T} X)}^{- 1} C^{T}]}^{- 1} (C {\hat{β}}_{w} - b_{0}) .

Then, we can set $Σ_{\tilde{Ω}} = C {(X^{T} X)}^{- 1} \sum_{i = 1}^{N} a_{i}^{2} x_{i} {\tilde{ε}}_{i}^{T} {\tilde{ε}}_{i} x_{i}^{T} {(X^{T} X)}^{- 1} C^{T}$ , where $a_{i} = 1 / {1 - x_{i}^{T} {(X^{T} X)}^{- 1} x_{i}}$ and ${\tilde{ε}}_{i} = W^{T} y_{i} - {\tilde{β}}_{w}^{T} x_{i}$ . When k > 1, we use the determinant, trace and eigenvalues of T_n as test statistics, which are given by

W_{n} = det (T_{n}), {Tr}_{n} = trace (T_{n}), {and Roy}_{n} = max (eig (T_{n})),

(26)

where det, trace, and eig, respectively, denote the determinant, trace and eigenvalues of a symmetric matrix. When k = 1, all three statistics in (26) reduce to the Wald-type (or Hotelling’s T²) test statistic. For simplicity, we focus on Tr_n throughout the paper.

We propose a wild bootstrap method to improve the finite sample performance of the test statistic Tr_n. First, we fit model (1) under the null hypothesis (2) to calculate the estimated regression coefficient matrix, denoted by B̂₀, with corresponding residuals ${\hat{e}}_{i} = y_{i} - {\hat{B}}_{0}^{T} x_{i}$ for i = 1, …, n. Then we generate G bootstrap samples $z_{i}^{(g)} = {({\hat{B}}_{0})}^{T} x_{i} + η_{i}^{(g)} {\hat{e}}_{i}$ for i = 1, …, n, where $η_{i}^{(g)}$ are independently and identically distributed as a distribution F, which is chosen to be ±1 with equal probability. For each generated wild-bootstrap sample, we repeat the estimation procedure for estimating the optimal weights and the calculation of the test statistic $T r_{n}^{(g)}$ . Subsequently, the p-value of Tr_n is computed as $\frac{1}{G} \sum_{g = 1}^{G} 1 (T r_{n}^{(g)} \geq T r_{n})$ , where 1(·) is an indicator function.

2.5 Tuning Parameter Selection

We consider several methods to select the tuning parameter λ. The first one is cross validation (CV), which is primarily a way of measuring the predictive performance of a statistical model. However, the CV technique can be computationally expensive for large-scale problems. The second one is the information criterion, which has been widely to measure the relative goodness of fit of a statistical model. However, neither of these two methods are applicable for SURP, since our primary interest is to find informative directions for appropriately testing the null and alternative hypotheses of (2). If the null hypothesis is true, it is expected that CB̂ only contains noisy components and the estimated direction vectors should be random. In this case, the test statistics Tr_n, W_n, and Roy_n should not be sensitive to the value of λ. This motivates us to use the rejection rate to select the tuning parameter as follows:

\hat{λ} = \underset{0 \leq λ \leq λ_{max}}{argmax} {{(Rejection Rate)}_{λ}},

(27)

where λ_max is the largest λ to make w nonzero.

3 Asymptotic Theory

We investigate several theoretical properties of SURP and its associated estimator. By substituting Σ̃_R and ℓ̂ = CB̂ into (15), we can calculate an estimate of w₀ as

{\hat{w}}_{λ} = \underset{w}{argmin} \frac{1}{2} w^{T} {\tilde{Σ}}_{R} w - w^{T} \hat{ℓ} + λ {‖ w ‖}_{1} .

(28)

The following question arises naturally:

how close is ŵ_λ to w₀?

We address this question in Theorems 3.1 and 3.2.

We consider the scenario that there are a few nonzero components in w₀, that is, a few response variables are associated with the covariates of interest. Such a scenario is common in many large-scale problems. We make a note here that the sparsity of $w_{0} = Σ_{R}^{- 1} ℓ$ does not require neither $Σ_{R}^{- 1} nor ℓ$ to be sparse, and hence are more quite flexible. Let S₀ = {j : w_0,j ≠ 0} be the active set of w₀ = (w_0,1, …, w_0,q)^T and s₀ is the number of elements in S₀. We use the banded covariance estimator of Σ_R (Bickel and Levina, 2008) such that ${‖ {\tilde{Σ}}_{R} - Σ_{R} ‖}_{2} = O_{p} ({(\frac{log q}{n})}^{\frac{α}{2 (α + 1)}})$ for some well behaved covariance class 𝒰(ε₀, α, C₁), which is defined as

𝒰 (ε_{0}, α, C_{1}) = {Σ = (σ_{j j'}) : max_{j} \sum_{j'} {| σ_{j' j} | : | j' - j | > k} \leq C_{1} k^{- α} for all k > 0 and 0 < ε_{0} \leq λ_{min} (Σ) \leq λ_{max} (Σ) \leq 1 / ε_{0}} .

We have the following results.

Theorem 3.1 Assume that Σ_R ∈ 𝒰(ε₀, α, C₁) and

λ = max {(k_{n} t_{1}^{0} + C_{1} k_{n}^{- α}) {‖ w_{0} ‖}_{2}, t_{2}^{0}} ≍ {(\frac{log (q \lor n)}{n})}^{\frac{α}{2 (α + 1)}} {‖ w_{0} ‖}_{2},

(29)

where $k_{n} ≍ {(\frac{log (q \lor n)}{n})}^{- \frac{1}{2 (α + 1)}}, t_{1}^{0} ≔ \sqrt{2 (η_{1} + 1) \frac{1}{γ (ε_{0}, δ)}} \sqrt{\frac{log (q \lor n)}{n}}$ , and $t_{2}^{0} ≔ \frac{C_{0}}{ε_{0}} \sqrt{2 (η_{2} + 1)} \sqrt{\frac{log (q \lor n)}{n}}$ , in which γ(ε₀, δ) and δ = δ(ε₀) only depends on (ε₀. Then, with probability at least 1 − (q ∨ n)^−η₁ − (q ∨ n)^−η₂, we have

{‖ {\hat{w}}_{λ} - w_{0} ‖}_{2} \leq C λ \sqrt{s_{0}},

(30)

where C is a constant not depending on q and n. Furthermore, for ‖ℓ‖₂ > δ₀, we have

{‖ \frac{{\hat{w}}_{λ}}{{‖ {\hat{w}}_{λ} ‖}_{2}} - \frac{w_{0}}{{‖ w_{0} ‖}_{2}} ‖}_{2} \leq \frac{2 C λ \sqrt{s_{0}}}{{‖ w_{0} ‖}_{2}} .

(31)

Theorem 3.1 gives an oracle inequality and the L₂ convergence rate of ŵ_λ in the sparse case, which indicates direction consistency and is important to ensure the good performance of test statistics. This result has several important implications. If $\sqrt{s_{0}} {(\frac{log q}{n})}^{\frac{α}{2 (α + 1)}} = o (1)$ , then ‖ŵ_λ − w₀‖₂ converges to zero in probability. Therefore, our SURP should perform well for the extremely sparse cases with s₀ << n. This is extremely important in practice, since the extremely sparse cases are common for many large-scale problems. Although we consider the banded covariance estimator of Σ_R in Theorem 3.1 (Bickel and Levina, 2008), the convergence rate of ŵ_λ can be established for other estimators of Σ_R and ℓ as follows.

Theorem 3.2 Suppose that we have ‖Σ̃_R − Σ_R‖₂ = O_p(a_n) = o_p(1) and ‖ℓ̂ − ℓ‖_∞ = O_p(b_n) = o_p(1), then

{‖ {\hat{w}}_{λ} - w_{0} ‖}_{2} = O_{p} ((a_{n} \lor b_{n}) \sqrt{s_{0}}) .

(32)

Furthermore, for ‖ℓ‖₂ > δ₀, we have

{‖ \frac{{\hat{w}}_{λ}}{{‖ {\hat{w}}_{λ} ‖}_{2}} - \frac{w_{0}}{{‖ w_{0} ‖}_{2}} ‖}_{2} = O_{p} (\frac{(a_{n} \lor b_{n}) \sqrt{s_{0}}}{{‖ w_{0} ‖}_{2}}) .

(33)

Theorem 3.2 gives the L₂ convergence rate of ŵ_λ for any possible estimators of Σ_R and ℓ. A direct implication is that we can consider other estimators of Σ_R in order to achieve better estimation of Σ_R under different assumptions of Σ_R. For instance, if Σ_R has an approximate factor structure with sparsity, then we may consider the principal orthogonal complement thresholding (POET) method in Fan et al. (2013) to estimate Σ_R. Moreover, if we can achieve good estimation of ℓ for large p, then we can extend model (1) to the scenario with large p. We will systematically investigate these generalizations in our future work.

Remark The SPReM estimator ŵ_λ is closely connected with those estimators in Witten and Tibshirani (2011) and Fan et al. (2012) in the framework of penalized linear discriminant analysis. However, little is known about the theoretical properties of such estimators. To the best of our knowledge, Theorems 3.1 and 3.2 are the first results on the convergence rate of such estimators under the restricted eigen-vectors of problem (11).

Remark The SPReM estimator ŵ_λ does not have the oracle property due to the asymptotic bias introduced by the L₁ penalty. See detailed discussions in (Fan and Li, 2001; Zou, 2006). However, our estimation procedure may be modified to achieve the oracle property by using some non-concave penalties or adaptive weights. We will investigate this issue in more depth in our future work.

4 Numerical Examples

4.1 Simulation 1: Two Sample Test in High Dimensions

In this subsection, we consider high-dimensional two-sample test problems and compare SPReM with the High-dimensional Two-Sample test (HTS) method in Chen and Qin (2010) and the Random Projection (RP) method proposed by Lopes et al. (2011). Both HTS and RP are the state-of-the-art methods for detecting a shift between the means of two high-dimensional normal distributions. It has been shown in Lopes et al. (2011) that the random projection method outperforms several competing methods when q/n converges to a constant or ∞.

We simulated two sets of samples {y₁, …, y_n₁} and {y_n₁+1, …, y_n} from N(β₁, Σ_R) and N(β₂, Σ_R), respectively, where β₁ and β₂ are q × 1 mean vectors and Σ_R = σ²(ρ_jj′ ), in which (ρ_jj′) is a q × q correlation matrix. We set n = 2n₁ = 100 and the dimension of the multivariate response q is 50, 100, 200, 400, and 800, respectively. We are interested in testing the null hypothesis H₀ : β₁ = β₂ against H₁ : β₁ ≠ β₂. This two-sample test problem can be formulated as a special case of model (1) with n = n₁ + n₂. Moreover, we have B^T = [β₁, β₂] and C = (1, −1). Without loss of generality, we set β₁ = β₂ = 0 to assess type I error rate and then introduce a shift in the first ten components of β₂ to be 1 to assess power. We set σ² to be 1 and 3 and consider three different correlation matrices as follows.

Case 1 is an independent covariance matrix with (ρ_jj′) = diag(1, …, 1).
Case 2 is a weak correlation matrix with ρ_jj′ = 1(j′ = j) + 0.3 + 1(j′ ≠ j).
Case 3 is a strong correlation covariance matrix with ρ_jj′ = 0.8^|j′−j|.

Simulation results are summarized in Tables 1 and 2. As expected, both HTS and RP perform worse as q gets larger, whereas our SPReM works very well even for relatively large q. This is consistent with our theoretical results in Theorems 3.1 and 3.2. Moreover, HTS and RP cannot control the type I error rate well in all scenarios, whereas our SPReM based on the wild bootstrap method works reasonably well. According to the best of our knowledge, none of the existing methods for the two sample test in high dimensions work well in this sparse setting. For cases (ii) and (iii), $Σ_{R}^{- 1} (β_{1} - β_{2})$ is not sparse, but SPReM performs reasonably well under the correlated scenarios. This may indicate the potential of extending SPReM and its associated theory to non-sparse cases. As expected, increasing σ² decreases statistical power in rejecting the null hypothesis. Since both SPReM and RP significantly outperform HTS, we increased q to 2,000 and presented some additional comparisons between SPReM and RP based on 100 simulated data sets in Figure 1.

Table 1.

Simulation 1: power and type I error are reported for two sample test at 5 different qs at significance level α = 5% when σ² = 1.

	Power					Type I error

q	50	100	200	400	800	50	100	200	400	800
case 1
SPReM	1.000	1.000	1.000	1.000	1.000	0.035	0.060	0.055	0.040	0.035
RP	1.000	1.000	1.000	1.000	0.025	0.000	0.000	0.000	0.000	0.000
HTS	0.965	0.320	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000

case 2
SPReM	1.000	1.000	1.000	1.000	1.000	0.060	0.075	0.055	0.045	0.050
RP	1.000	1.000	1.000	1.000	0.970	0.000	0.000	0.000	0.000	0.000
HTS	1.000	0.245	0.030	0.005	0.000	0.000	0.000	0.000	0.000	0.000

case 3
SPReM	1.000	1.000	1.000	1.000	1.000	0.040	0.055	0.085	0.060	0.050
RP	1.000	1.000	1.000	0.535	0.015	0.000	0.000	0.000	0.005	0.000
HTS	1.000	0.140	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000

Open in a new tab

Table 2.

Simulation 1: power and type I error are reported for two sample test at 5 different qs at significance level α = 5% when σ² = 3.

	Power					Type I error

q	50	100	200	400	800	50	100	200	400	800
case 1
SPReM	0.990	0.910	0.825	0.795	0.680	0.030	0.065	0.045	0.035	0.080
RP	1.000	1.000	0.175	0.000	0.000	0.000	0.000	0.000	0.000	0.000
HTS	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000

case 2
SPReM	0.840	0.825	0.775	0.645	0.580	0.045	0.030	0.030	0.060	0.030
RP	1.000	1.000	0.180	0.000	0.000	0.000	0.000	0.000	0.000	0.000
HTS	0.105	0.015	0.005	0.000	0.000	0.000	0.000	0.000	0.000	0.000

case 3
SPReM	0.780	0.755	0.590	0.465	0.525	0.050	0.055	0.050	0.040	0.075
RP	1.000	0.260	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
HTS	0.095	0.005	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000

Open in a new tab

Fig. 1 — Simulation 1 results: the estimated rejection rates as functions of q for two different σ² values. The upper and lower rows are, respectively, for powers and for type I error rates, whereas the left and right columns correspond to σ² = 1 and σ² = 3, respectively. In all panels, the lines obtained from SPReM and RP are, respectively, presented in red and in blue, and the results for independence, weak, and strong correlation structures are, respectively, presented as thick, dashed, and dotted lines.

4.2 Simulation 2: Multiple Rank Cases

In this subsection, we evaluate the finite sample performance of SMURP. The simulation studies were designed to establish the association between a relatively high-dimensional imaging phenotype with a genetic marker (e.g., SNP or haplotype), which is common in imaging genetics studies, while adjusting for age and other environmental factors. We set the sample size n = 100 and the dimension of the multivariate phenotype q to be 50, 100, 200, 400 and 800, respectively, and then simulated the multivariate phenotype according to model (1). The random errors were simulated from a multivariate normal distribution with mean 0 and covariance matrix with diagonal elements 1. For the off-diagonal elements in the covariance matrix, which characterize the correlations among the multivariate phenotypes, we categorized each component of the multivariate phenotype into three categories: high correlation, medium correlation and very low correlation with the corresponding number of components (1, 1, q − 2) in each category, and then we set the three degrees of correlation among the different components of the multivariate phenotype according to Table 3. The final covariance matrix is set to be Σ_R = σ²(ρ_jj′), where (ρ_jj′) is the correlation matrix. We considered σ² = 1 and 3.

Table 3.

Correlation matrix of responses used in the simulation

	High	Med	Low
High	0.9	0.6	0.3
Med	0.6	0.9	0.1
Low	0.3	0.1	0.1

Open in a new tab

For the covariates, we included two SNPs with an additive effect and 3 additional continuous covariates. We varied the minor allele frequency (MAF) of the first SNP, whereas we fixed the MAF of the second SNP to be 0.5. For the first SNP, we considered 6 scenarios assuming the MAFs are 0.05, 0.1, 0.2, 0.3, 0.4, and 0.5, respectively. We simulated the three additional continuous covariates from a multivariate normal distribution with mean 0, standard deviation 1, and equal correlation 0.3. We first set B = 0 to assess type I error rate. To assess power, we set the first response to be the only components of the multivariate phenotype associated with the first SNP and the second response to be the component related to the second SNP effect. Specifically, we set the coefficients of the two SNPs to be 1 for the selected responses and all other regression coefficients to be 0. We are interested in testing the joint effects of the two SNPs on phenotypic variance.

We applied SPReM to 100 simulated data sets. Note that to the best of our knowledge, no other methods can be used to test the multi-rank test problem and thus we only focus on SPReM here. Table 4 presents the estimated rejection rates corresponding to different MAFs, q, and σ². Our SPReM works very well even for relatively large q under both σ² = 1 and 3. Specifically, the wild bootstrap method can control the type I error rate well in all scenarios. For the power, SPReM performs reasonably well under the small MAFs and q = 800. It may indicate that our method can perform well for much larger q if the sample size gets larger. As expected, increasing σ² decreases statistical power in rejecting the null hypothesis.

Table 4.

Simulation 2: the estimates of rejection rates were reported at 6 different MAFs, 5 different qs, and 2 different σ² values at significance level α = 5%. For each case, 100 simulated data sets were used.

	Power					Type I error

σ² = 1
MAF\q	50	100	200	400	800	50	100	200	400	800
0.050	0.950	0.955	0.930	0.940	0.930	0.045	0.060	0.030	0.070	0.080
0.100	0.995	0.990	0.990	0.980	0.975	0.045	0.055	0.040	0.045	0.045
0.200	1.000	1.000	1.000	1.000	1.000	0.045	0.045	0.080	0.030	0.060
0.300	1.000	1.000	1.000	1.000	1.000	0.065	0.040	0.020	0.065	0.060
0.400	1.000	1.000	1.000	1.000	1.000	0.050	0.070	0.035	0.060	0.070
0.500	1.000	1.000	1.000	1.000	1.000	0.060	0.050	0.030	0.020	0.035

σ² = 3
0.050	0.915	0.875	0.765	0.795	0.735	0.050	0.040	0.030	0.050	0.065
0.100	0.970	0.960	0.940	0.875	0.865	0.040	0.055	0.070	0.080	0.050
0.200	0.995	0.985	0.975	0.975	0.970	0.015	0.050	0.060	0.010	0.065
0.300	1.000	1.000	0.990	0.970	0.955	0.045	0.055	0.055	0.080	0.040
0.400	0.995	1.000	1.000	0.990	0.985	0.055	0.035	0.045	0.050	0.070
0.500	0.995	1.000	1.000	0.985	0.980	0.085	0.055	0.055	0.065	0.030

Open in a new tab

4.3 Alzheimer’s Disease Neuroimaging Initiative (ADNI) Data Analysis

The development of SPReM is motivated by the joint analysis of imaging, genetic, and clinical variables in the ADNI study. “Data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.ucla.edu). The ADNI was launched in 2003 by the National Institute on Aging (NIA), the National Institute of Biomedical Imaging and Bioengineering (NIBIB), the Food and Drug Administration (FDA), private pharmaceutical companies and non-profit organizations, as a $60 million, 5-year publicprivate partnership. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer’s disease (AD). Determination of sensitive and specific markers of very early AD progression is intended to aid researchers and clinicians to develop new treatments and monitor their effectiveness, as well as lessen the time and cost of clinical trials. The Principal Investigator of this initiative is Michael W. Weiner, MD, VA Medical Center and University of California, San Francisco. ADNI is the result of efforts of many coinvestigators from a broad range of academic institutions and private corporations, and subjects have been recruited from over 50 sites across the U.S. and Canada. The initial goal of ADNI was to recruit 800 subjects but ADNI has been followed by ADNI-GO and ADNI-2. To date these three protocols have recruited over 1500 adults, ages 55 to 90, to participate in the research, consisting of cognitively normal older individuals, people with early or late MCI, and people with early AD. The follow up duration of each group is specified in the protocols for ADNI-1, ADNI-2 and ADNI-GO. Subjects originally recruited for ADNI-1 and ADNI-GO had the option to be followed in ADNI-2. For up-to-date information, see www.adni-info.org.”

The Huamn 610-Quad BeadChip (Illumina, Inc. San Diego, CA) was used to genotype 818 subjects in the ADNI-1 database, which resulted in a set of 620,901 SNPs and copy number variation (CNV) markers. Since the Apolipoprotein E (ApoE) SNPs, rs429358 and rs7412, are not on the Human 610-Quad Bead-Chip, they were genotyped separately and added to the data set manually. For simplicity, we only considered the 10, 479 SNPs collected on the chromosome 19, which houses the famous ApoE gene commonly suspected of having association with Alzheimer’s disease. A complete GWAS of ADNI will be reported elsewhere. The SNP data were preprocessed by standard quality control steps including dropping any SNP that has more than 5% missing data, imputing the missing values in each SNP with its mode, dropping SNPs with minor allele frequency < 0.05, and screening out SNPs violating the Hardy-Weinberg equilibrium. Finally, we obtained 8, 983 SNPs on chromosome 19, including the ApoE allele as the last SNP in our dataset.

Our problem of interest is to perform a genome-wide search for establishing the association between the 10, 479 SNPs collected on the chromosome 19 and the brain volume of 93 regions of interest (ROIs). We fitted model (1) with all 93 ROIs as responses and a covariate vector including an intercept, a specific SNP, age, gender, whole brain volume, and the top 5 principal components to account for population stratification. To reduce population stratification effects, we only used 761 Caucasians from all 818 subjects. Subjects with missing values were removed, which leads to 747 subjects. We set λ = λ_max in our SPReM for computational efficiency. To test the SNP effect on all 93 ROIs, we calculated the test statistic and its p–value for each SNP. We further performed a standard massive univariate analysis. Specifically, we fitted a linear model with the same set of covariates and calculated a p–value for every pair of ROIs and SNPs.

We developed a computationally efficient strategy to approximate the p–value of each SNP with different MAFs. In the real data analysis, we considered a pool of SNPs consisting of 6 MAF groups including MAF∈ (0.05, 0.075], MAF∈ (0.075, 0.15], MAF∈ (0.15, 0.25], MAF∈ (0.25, 0.35], MAF∈ (0.35, 0.45], and MAF∈ (0.45, 0.50]. Each MAF group contains 40 SNPs. For each SNP, we generated 10,000 wild bootstrap samples under the null hypothesis to obtain 10,000 bootstrapped test statistics. Then, based on 40 × 10, 000 bootstrapped samples for each MAF group, we use the Satterthwaite method to approximate the null distribution of the test statistic by a Gamma distribution with parameters (a_T, b_T). Specifically, we set a_T = ε²/ν and b_T = ν/ε by matching the mean (ε) and the variance (ν) of the test statistics and those of the Gamma distribution. The histograms and the fitted gamma distributions along with the QQ-plots are, respectively, presented in Figures 2–3. Figures 2 and 3 reveal that our gamma approximations work reasonably well for a wide range of MAFs when λ = λ_max. Since we only use Gamma(a_T, b_T ) to approximate the p–value of large test statistic, we only need a good approximation at the tail of the Gamma distribution. See Figure 3 for details. For each SNP, we matched its MAF with the closest MAF group in the pool and then calculated the p–value of the test statistic based on the approximated gamma distribution. We present the manhattan plot in Figure 4 and the top 10 SNPs with their p–values for SPReM and the mass univariate analysis in Table 5 for λ = λ_max.

Fig. 2 — Histograms and their gamma approximations based on the wild bootstrap samples under the null hypothesis for different MAFs for λ = λ_max.

Fig. 3 — QQ-plot of the gamma approximations based on the wild bootstrap samples under the null hypothesis for different MAFs for λ = λ_max.

Fig. 4 — ADNI GWAS results: Manhattan plot of −log₁₀(p)-values on chromosome 19 by SPReM for λ = λ_max.

Table 5.

Comparison between SPReM and the massive univariate analysis (MUA) for ADNI data analysis: the top 10 SNPs and their – log₁₀(p) values for λ = λ_max.

SNP	apoe_allele	rs11667587	rs2075650	rs7248284	rs3745341
SPReM	5.04E-16	5.95E-06	9.58E-06	2.56E-05	3.83E-05
MUA	3.43E-11	4.42E-04	1.12E-04	8.75E-04	1.00E-03

SNP	rs4803646	rs8106200	rs2445830	rs8102864	rs740436

SPReM	4.65E-05	1.16E-04	1.32E-04	1.93E-04	2.17E-04
MUA	7.56E-04	3.70E-03	1.33E-02	9.34E-04	1.63E-03

Open in a new tab

We have several important findings. The ApoE allele was identified as the top one significant covariate with −log₁₀(p) ~ 15 and 9 respectively, indicating a strong association between the ApoE allele and imaging phenotype, a biomarker of Alzheimer’s disease diagnosis. This finding agrees with the previous result in Vounou et al. (2012). We also found some interesting results regarding rs207650 on the TOMM40 gene, which is one of the top 10 significant SNPs with −log₁₀(p) ~ 5 and 4 respectively. The TOMM40 gene is located in close proximity to the ApoE gene and has also been linked to AD in some recent studies (Vounou et al., 2012). We are also able to detect some additional SNPs, such as rs11667587 on the NOVA2 gene, among others, on the chromosome 19, which are not identified in existing genome-wide association studies. The new findings may shed more light on further Alzheimer’s research. The p–values for those top 10 SNPs calculated from SPReM are much smaller than those calculated from the mass univariate analysis. In other words, to achieve comparable p–values, the mass univariate analysis requires many more samples. This strongly demonstrates the effectiveness of our proposed method.

5 Discussion

In this paper, we have developed a general SPReM framework based on the two heritability ratios. Our SPReM methodology has a wide range of applications, including sparse linear discriminant analysis, two sample tests, and general hypothesis tests in MLMs, among many others. We have systematically investigated the L₂ convergence rate of ŵ_λ in the ultrahigh dimensional framework. We further extend the SURP problem to the SMURP and offered a sequential SURP approximation algorithm. We carried out simulation studies and examined a real data set to demonstrate the excellent performance of our SPReM framework compared to other state-of-the-art methods.

6 Assumptions and Proofs

Throughout the paper, the following assumptions are needed to facilitate the technical details, although they may not be the weakest conditions.

Assumption A1. C(n⁻¹X^TX)⁻¹C^T ≍ 1, that is, there exists constant c₀ and C₀ such that c₀ ≤ C(n⁻¹X^TX)⁻¹C^T ≤ C₀.
Assumption A2. 0 ≤ ε₀ ≤ λ_min(Σ_R) ≤ λ_max(Σ_R) ≤ 1/ε₀.
Assumption A3. The covariance estimator Σ̃_R satisfies: ‖Σ̃_R − Σ_R‖₂ = O_p(a_n) ≤ o_p(1).

Remark : Assumption A1 is a very weak and standard assumption for regression models. Assumption A2 has been widely used in the literature. Assumption A3 requires a relatively accurate covariance estimator in terms of spectral norm convergence. We may use some good penalized estimators of Σ_R under different assumptions of Σ_R (Bickel and Levina, 2008; Cai et al., 2010; Lam and Fan, 2009; Rothman et al., 2009; Fan et al., 2013).

Proof of Theorem 2.1 The Karush-Kuhn-Tucker (KKT) conditions for problem (13) are given by:

ℓ - λ Γ - γ Σ_{R} w = 0, γ \geq 0, γ (\frac{1}{2} w^{T} Σ_{R} w - \frac{1}{2}) = 0, \frac{1}{2} w^{T} Σ_{R} w \leq \frac{1}{2},

where Γ is a q × 1 vector and equals the subgradient of ‖w‖₁ with respect to w. We consider two scenarios. First, suppose that |ℓ_j| > λ for some j. We must have γΣ_Rw ≠ 0, which leads to γ > 0 and w^TΣ_Rw = 1. Thus, the KKT conditions reduce to

ℓ - λ Γ - γ Σ_{R} w = 0, γ \geq 0, w^{T} Σ_{R} w = 1 .

If we write w̃ = γw, this is equivalent to solving problem (15) with w̃ and then take normalization. Second, if |ℓ_j| ≤ λ for any j, then w = 0 and γ = 0, which is the solution of (15) as well. This finishes the proof.

Proof of Proposition 2.2 It follows from Theorem 2 of Rosset and Zhu (2007).

Proof of Proposition 2.3 The proof is similar to that of Proposition 1 of Witten and Tibshirani (2011). Letting ${\tilde{w}}_{k} = Σ_{R}^{1 / 2} w_{k}$ , then problem (20) can be rewritten as

argmax {\tilde{w}}_{k}^{T} Σ_{R}^{- 1 / 2} B^{T} C^{T} Σ_{11}^{1 / 2} CB Σ_{R}^{- 1 / 2} {\hat{w}}_{k} s . t . {‖ {\hat{w}}_{k} ‖}^{2} \leq 1,

which is equivalent to

argmax {\tilde{w}}_{k} A P_{⊥}^{k - 1} u_{k} s . t {‖ {\tilde{w}}_{k} ‖}^{2} \leq 1, {‖ u_{k} ‖}^{2} \leq 1,

(34)

where $A = B^{T} C^{T} Σ_{11}^{1 / 2}$ . Thus, w̃_k and u_k that solve problem (34) are the k-th left and right singular vectors of A (Witten and Tibshirani, 2011). Therefore, we have $P_{⊥}^{k - 1} = I - \sum_{j = 1}^{k - 1} u_{j} u_{j}^{T}$ and u_k is the k-th eigenvector of A^TA, or equivalently the k-th right singular vector of A. For problem (34), w̃_k is the k-th left singular vector of A. Therefore, the solution of (21) is the k-th discriminant vector of (20).

Proof of Theorem 3.1 In this theorem, we specifically use the banded covariance estimator Σ̃_R = B_{k_n}(Σ̃_R), where B_k(Σ) = [σ_jj′I(|j′ −j| ≤ k)] and Σ̂_R is the sample covariance matrix of y_i − B̂^T x_i.

First, we define 𝒥 = {‖Σ̃_R − B_{k_n}(Σ_R)‖_∞ ≤ t₁} ∩ {‖ℓ̂ − ℓ‖_∞ ≤ t₂}, where t₁ and t₂ are specified as in Lemma 6.2. Then, it follows from Lemma 6.2 that P(𝒥) ≥ 1−3(q ∨ n)^−η₁ − 2(q ∨ n)^−η₂.

On the set 𝒥, by taking $λ = max {k_{n} t_{1} + C_{1} k_{n}^{- α}, t_{2}}$ and using Lemma 6.1, we have

\frac{1}{2} {({\hat{w}}_{λ} - w_{0})}^{T} {\tilde{Σ}}_{R} ({\hat{w}}_{λ} - w_{0}) + λ {‖ {\hat{w}}_{λ} ‖}_{1} \leq (w_{0}^{T} (Σ_{R} - {\tilde{Σ}}_{R}) + ε^{T}) ({\hat{w}}_{λ} - w_{0}) + λ {‖ w_{0} ‖}_{1} \leq {‖ {\tilde{Σ}}_{R} - Σ_{R} ‖}_{2} {‖ w_{0} ‖}_{2} {‖ {\hat{w}}_{λ} - w_{0} ‖}_{1} + {‖ ε ‖}_{\infty} {‖ {\hat{w}}_{λ} - w_{0} ‖}_{1} + λ {‖ w_{0} ‖}_{1} \leq (k_{n} t_{1} + C_{1} k_{n}^{- α}) {‖ w_{0} ‖}_{2} {‖ {\hat{w}}_{λ} - w_{0} ‖}_{1} + t_{2} {‖ {\hat{w}}_{λ} - w_{0} ‖}_{1} + λ {‖ w_{0} ‖}_{1} \leq λ {‖ {\hat{w}}_{λ} - w_{0} ‖}_{1} + λ {‖ w_{0} ‖}_{1} .

Let w₀,S₀ = [w_0,jI(j ∈ S₀)], where w_0,j is the j–th component of w₀. The above equation can be rewritten as

{({\hat{w}}_{λ} - w_{0})}^{T} ({\tilde{Σ}}_{R} - Σ_{R} + Σ_{R}) ({\hat{w}}_{λ} - w_{0}) + λ {‖ {\hat{w}}_{λ, S_{0}} ‖}_{1} + λ {‖ {\hat{w}}_{λ, S_{0}^{c}} ‖}_{1} \leq λ {‖ {\hat{w}}_{λ, S_{0}} - w_{0, S_{0}} ‖}_{1} + λ {‖ w_{0, S_{0}} ‖}_{1} + λ {‖ {\hat{w}}_{λ, S_{0}^{c}} ‖}_{1},

which yields

{λ_{min} - O (1) {(\frac{log (q)}{n})}^{\frac{α}{2 (α + 1)}}} {‖ {\hat{w}}_{λ} - w_{0} ‖}_{2}^{2} \leq 2 λ \sqrt{s_{0}} {‖ {\hat{w}}_{λ} - w_{0} ‖}_{2} .

Finally, we obtain the following inequality

{‖ {\hat{w}}_{λ} - w_{0} ‖}_{2} \leq \frac{2 λ \sqrt{s_{0}}}{λ_{min} - O (1) {(\frac{log (q)}{n})}^{\frac{α}{2 (α + 1)}}} \leq C λ \sqrt{s_{0}},

which finishes the proof.

Proof of Theorem 3.2 It follows from Lemma (6.1) that

\frac{1}{2} {({\hat{w}}_{λ} - w_{0})}^{T} {\tilde{Σ}}_{R} ({\hat{w}}_{λ} - w_{0}) + λ {‖ {\hat{w}}_{λ} ‖}_{1} \leq (w_{0}^{T} (Σ_{R} - {\tilde{Σ}}_{R}) + (\hat{ℓ} - \hat{ℓ})) ({\hat{w}}_{λ} - w_{0}) + λ {‖ w_{0} ‖}_{1} \leq ({‖ {\tilde{Σ}}_{R} - Σ_{R} ‖}_{2} {‖ w_{0} ‖}_{2} + {‖ \hat{ℓ} - \hat{ℓ} ‖}_{\infty}) {‖ {\hat{w}}_{λ, S} - w_{0, S} ‖}_{1} + λ {‖ w_{0, S} ‖}_{1}

Note that ${‖ {\hat{w}}_{λ} ‖}_{1} \leq {‖ w_{0, S_{0}} ‖}_{1} - {‖ w_{0, S_{0}} - {\hat{w}}_{λ, S_{0}} ‖}_{1} + {‖ {\hat{w}}_{λ, S_{0}^{c}} ‖}_{1}$ . Then, by taking

λ = {‖ {\tilde{Σ}}_{R} - Σ_{R} ‖}_{2} {‖ w_{0} ‖}_{2} + {‖ \hat{ℓ} - \hat{ℓ} ‖}_{\infty} ≍ O_{p} (a_{n} {‖ w_{0} ‖}_{2} \lor b_{n}),

we have

\frac{1}{2} {({\hat{w}}_{λ} - w_{0})}^{T} {\tilde{Σ}}_{R} ({\hat{w}}_{λ} - w_{0}) \leq ({‖ {\tilde{Σ}}_{R} - Σ_{R} ‖}_{2} {‖ w_{0} ‖}_{2} + {‖ \hat{ℓ} - \hat{ℓ} ‖}_{\infty}) ({‖ {\hat{w}}_{λ, S_{0}} - w_{0, S_{0}} ‖}_{1} + {‖ {\hat{w}}_{λ, S_{0}^{c}} ‖}_{1}) - λ ({‖ w_{0, S_{0}} ‖}_{1} - {‖ w_{0, S} - {\hat{w}}_{λ, S_{0}} ‖}_{1} + {‖ {\hat{w}}_{λ, S_{0}^{c}} |) + λ | w_{0, S_{0}} ‖}_{1} = O_{p} (a_{n} \lor b_{n}) {‖ {\hat{w}}_{λ, S_{0}} - w_{0, S_{0}} ‖}_{1} \leq O_{p} (a_{n} \lor b_{n}) \sqrt{s_{0}} {‖ {\hat{w}}_{λ, S} - w_{0, S_{0}} ‖}_{2} .

By using Weyl’s inequality, we have

{({\hat{w}}_{λ} - w_{0})}^{T} {\tilde{Σ}}_{R} ({\hat{w}}_{λ} - w_{0}) \geq (λ_{min} (Σ_{R}) - {‖ {\tilde{Σ}}_{R} - Σ_{R} ‖}_{2}) {‖ {\hat{w}}_{λ} - w_{0} ‖}_{2}^{2}

where ‖Σ̃_R − Σ_R‖₂ = O_p(a_n) = o_p(1). Finally, we have

{‖ {\hat{w}}_{λ} - w_{0} ‖}_{2}^{2} \leq \frac{O_{p} (a_{n} \lor b_{n}) \sqrt{s_{0}} {‖ {\hat{w}}_{λ, S} - w_{0, S} ‖}_{2}}{λ_{min} (Σ) - O_{p} (a_{n})},

(35)

which finishes the proof.

Lemma 6.1 We have the following basic inequality

\frac{1}{2} {({\hat{w}}_{λ} - w_{0})}^{T} {\tilde{Σ}}_{R} ({\hat{w}}_{λ} - w_{0}) + λ {‖ {\hat{w}}_{λ} ‖}_{1} \leq {w_{0}^{T} (Σ_{R} - {\tilde{Σ}}_{R}) + {(\hat{ℓ} - ℓ)}^{T}} ({\hat{w}}_{λ} - w_{0}) + λ {‖ w_{0} ‖}_{1} .

(36)

Proof We rewrite the optimization problem (28) as

{\hat{w}}_{λ} = argmin \frac{1}{2} {(w - {\tilde{Σ}}_{R}^{- 1} \hat{ℓ})}^{T} {\tilde{Σ}}_{R} (w - {\tilde{Σ}}_{R}^{- 1} \hat{ℓ}) + λ {‖ w ‖}_{1} .

Thus, we have

\frac{1}{2} {({\hat{w}}_{λ} - {\tilde{Σ}}_{R}^{- 1} \hat{ℓ})}^{T} {\tilde{Σ}}_{R} ({\hat{w}}_{λ} - {\tilde{Σ}}_{R}^{- 1} \hat{ℓ}) + λ {‖ {\hat{w}}_{λ} ‖}_{1} \leq \frac{1}{2} {(w_{0} - {\tilde{Σ}}_{R}^{- 1} \hat{ℓ})}^{T} {\tilde{Σ}}_{R} (w_{0} - {\tilde{Σ}}_{R}^{- 1} \hat{ℓ}) + λ {‖ w_{0} ‖}_{1},

which yields

\frac{1}{2} {‖ {\hat{w}}_{λ} - w_{0} ‖}_{{\tilde{Σ}}_{R}}^{2} + λ {‖ {\hat{w}}_{λ} ‖}_{1} \leq {(\hat{ℓ} - {\tilde{Σ}}_{R} w_{0})}^{T} ({\hat{w}}_{λ} - w_{0}) = {w_{0}^{T} (Σ_{R} - {\tilde{Σ}}_{R}) + {(\hat{ℓ} - ℓ)}^{T}} ({\hat{w}}_{λ} - w_{0}) + λ {‖ w_{0} ‖}_{1},

in which we have used ℓ̂ = Σ_Rw₀ + ℓ̂ − ℓ in the last equality.

Lemma 6.2 For all $t_{1} \geq t_{1}^{0}$ and $t_{2} \geq t_{2}^{0}$ , we have

P (𝒥) \geq 1 - 3 {(q \lor n)}^{- η_{1}} - 2 {(q \lor n)}^{- η_{2}} .

(37)

Proof First, it follows from Lemma A.3 of Bickel and Levina (2008) that

P ({‖ {\tilde{Σ}}_{R} - B_{k_{n}} (Σ_{R}) ‖}_{\infty} \geq t_{1}) \leq 2 (k + 1) q exp {- n {(t_{1}^{0})}^{2} γ (ε_{0}, δ)} \leq 2 (k_{n} + 1) (q \lor n) exp {- 2 n (η_{1} + 1) \frac{1}{γ (ε_{0}, δ)} \frac{log (q \lor n)}{n} γ (ε_{0}, δ))} \leq 3 ((q \lor n) k_{n}) exp {- (η_{1} + 1) log ((q \lor n) k_{n}))} \leq 3 {((q \lor n) k_{n})}^{- (η_{1} + 1) + 1} \leq 3 {(q \lor n)}^{- η_{1}},

where $t_{1}^{0} = \sqrt{2 (η_{1} + 1) \frac{1}{γ (ε_{0}, δ)}} \sqrt{\frac{log (q \lor n)}{n}}$ .

Second, we know that $\frac{\sqrt{n} ({\hat{ℓ}}_{j} - ℓ_{j})}{σ_{j} C_{X}}$ is Sub(1)-distributed, where $C_{X} = C {(\frac{1}{n} X^{T} X)}^{- 1} C^{T}$ . Then by the union sum inequality, we have

P (max_{j} | \frac{\sqrt{n} ({\hat{ℓ}}_{j} - ℓ_{j})}{C_{0} / ε_{0}} | \geq t_{2}) \leq P (max_{j} | \frac{\sqrt{n} ({\hat{ℓ}}_{j} - ℓ_{j})}{σ_{j} C_{X}} | \geq t_{2}) \leq 2 (q \lor n) exp {- \frac{{(t_{2}^{0})}^{2}}{2}} .

(38)

By taking ${(t_{2}^{0})}^{2} = 2 η_{2} log (q \lor n)$ , we can rewrite the above inequality as

P ({‖ \hat{ℓ} - ℓ ‖}_{\infty} \geq \frac{C_{0}}{ε_{0}} \sqrt{\frac{(2 η_{2} + 2) log (q \lor n)}{n}}) \leq 2 {(q \lor n)}^{- η_{2}}

Finally, we get

P (𝒥) \geq 1 - P ({‖ {\tilde{Σ}}_{R} - B_{k} (Σ_{R}) ‖}_{\infty} \geq t_{1}^{0}) - P ({‖ \hat{ℓ} - ℓ ‖}_{\infty} \geq \frac{C_{0}}{ε_{0}} \sqrt{\frac{(2 η_{2} + 2) log (q \lor n)}{n}}) \geq 1 - 3 {(q \lor n)}^{- η_{1}} - 2 {(q \lor n)}^{- η_{2}},

which finishes the proof.

Acknowledgments

The research of Drs. Zhu and Ibrahim was supported by NIH grants RR025747-01, GM70335, CA74015, P01CA142538-01, MH086633, EB005149-01 and AG033387. The research of Dr. Liu was supported by NSF Grant DMS-07-47575 and NIH Grant NIH/NCI R01 CA- 149569. Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; BioClinica, Inc.; Biogen Idec Inc.; Bristol-Myers Squibb Company; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; GE Healthcare; Innogenetics, N.V.; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Medpace, Inc.; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Synarc Inc.; and Takeda Pharmaceutical Company. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Disease Cooperative Study at the University of California, San Diego. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of California, Los Angeles. This research was also supported by NIH grants P30 AG010129 and K01 AG030514.

Footnotes

Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.ucla.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.ucla.edu/wp−content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf. We thank the Editor, the Associate Editor, and two anonymous referees for valuable suggestions, which greatly helped to improve our presentation.

Contributor Information

Qiang Sun, Email: qsun@bios.unc.edu, Department of Biostatistics, University of North Carolina at Chapel Hill, NC 27599-7420..

Hongtu Zhu, Email: hzhu@bios.unc.edu, Department of Biostatistics, University of North Carolina at Chapel Hill, NC 27599-7420..

Yufeng Liu, Email: yiu@email.unc.edu, Department of Statistics and Operation Research, University of North Carolina at Chapel Hill, CB 3260, Chapel Hill, NC 27599..

Joseph G. Ibrahim, Email: ibrahim@bios.unc.edu, Department of Biostatistics, University of North Carolina at Chapel Hill, NC 27599-7420..

References

Amos CI, Elston RC, Bonney GE, Keats BJB, Berenson GS. A multivariate method for detecting genetic linkage, with application to a pedigree with an adverse lipoprotein phenotype. Am. J. Hum. Genet. 1990;47:247–254. [PMC free article] [PubMed] [Google Scholar]
Amos CI, Laing AE. A comparison of univariate and multivariate tests for genetic linkage. Genetic Epidemiology. 1993;84:303–310. doi: 10.1002/gepi.1370100657. [DOI] [PubMed] [Google Scholar]
Bickel PJ, Levina E. Regularized estimation of large covariance matrices. The Annals of Statistics. 2008;36:199–227. [Google Scholar]
Breiman L, Friedman J. Predicting multivariate responses in multiple linear regression. Journal of the Royal Statistical Society, Series B, Statistical Methodology. 1997;59:3–54. [Google Scholar]
Cai T, Zhang C, Zhou H. Optimal rates of convergence for covariance matrix estimation. The Annals of Statistics. 2010;38:2118–2144. [Google Scholar]
Chen S, Qin Y. A two-sample test for high-dimensional data with applications to gene-set testing. The Annals of Statistics. 2010;38:808–835. [Google Scholar]
Chiang MC, Barysheva M, Toga AW, Medland SE, Hansell NK, James MR, McMahon KL, de Zubicaray GI, Martin NG, Wright MJ, Thompson PM. BDNF gene effects on brain circuitry replicated in 455 twins. NeuroImage. 2011a;55:448–454. doi: 10.1016/j.neuroimage.2010.12.053. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chiang MC, McMahon KL, de Zubicaray GI, Martin NG, Hickie I, Toga AW, Wright MJ, Thompson PM. Genetics of white matter development: A DTI study of 705 twins and their siblings aged 12 to 29. NeuroImage. 2011b;54:2308–2317. doi: 10.1016/j.neuroimage.2010.10.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chun H, Keles S. Sparse partial least squares regression for simultaneous dimension reduction and variable selection. Journal of the Royal Statistical Society, Series B, Statistical Methodology. 2010;72:3–25. doi: 10.1111/j.1467-9868.2009.00723.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cook RD, Helland IS, Su Z. Envelopes and partial least squares regression. Journal of the Royal Statistical Society, Series B, Statistical Methodology. 2013 To appear. [Google Scholar]
Cook RD, Li B, Chiaromonte F. Envelope models for parsimonious and efficient multivariate linear regression. Statist. Sinica. 2010;20:927–1010. [Google Scholar]
Fan J, Feng Y, Tong X. A road to classification in high dimensional space: the regularized optimal affine discriminant. Journal of the Royal Statistical Society, Series B, Statistical Methodology. 2012;74:745–771. doi: 10.1111/j.1467-9868.2012.01029.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
Fan J, Liao Y, Mincheva M. Large covariance estimation by thresholding principal orthogonal complements. Journal of Royal Statistical Society, Series B. 2013;75:603–680. doi: 10.1111/rssb.12016. [DOI] [PMC free article] [PubMed] [Google Scholar]
Formisano E, Martino FD, Valente G. Multivariate analysis of fMRI time series: classification and regression of brain responses using machine learning. Magnetic Resonance Imaging. 2008;26:921–934. doi: 10.1016/j.mri.2008.01.052. [DOI] [PubMed] [Google Scholar]
Kherif F, Poline JB, Flandin G, Benali H, Simon O, Dehaene S, Worsley KJ. Multivariate model specification for fMRI data. Neuroimage. 2002;16:1068–1083. doi: 10.1006/nimg.2002.1094. [DOI] [PubMed] [Google Scholar]
Klei L, Luca D, Devlin B, Roeder K. Pleiotropy and principle components of heritability combine to increase power for association. Genetic Epidemiology. 2008;32:9–19. doi: 10.1002/gepi.20257. [DOI] [PubMed] [Google Scholar]
Knickmeyer RC, Gouttard S, Kang C, Evans D, Wilber K, Smith JK, Hamer RM, Lin W, Gerig G, Gilmore JH. A structural MRI study of human brain development from birth to 2 years. J Neurosci. 2008;28:12176–12182. doi: 10.1523/JNEUROSCI.3479-08.2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
Krishnan A, Williams LJ, McIntosh AR, Abdi H. Partial least squares (PLS) methods for neuroimaging: a tutorial and review. Neuroimage. 2011;56:455–475. doi: 10.1016/j.neuroimage.2010.07.034. [DOI] [PubMed] [Google Scholar]
Lam C, Fan J. Sparsistency and rates of convergence in large covariance matrix estimation. The Annals of statistics. 2009;37:4254–4278. doi: 10.1214/09-AOS720. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lange K, Hunter DR, Yang I. Optimization transfer using surrogate objective functions. Journal of Computational and Graphical Statistics. 2000;9:1–20. [Google Scholar]
Lenroot RK, Giedd JN. Brain development in children and adolescents: insights from anatomical magnetic resonance imaging. Neurosci Biobehav Rev. 2006;30:718–729. doi: 10.1016/j.neubiorev.2006.06.001. [DOI] [PubMed] [Google Scholar]
Lin J, Zhu H, Knickmeyer R, Styner M, Gilmore J, Ibrahim J. Projection regression models for multivariate imaging phenotype. Genetic Epidemiology. 2012;36:631–641. doi: 10.1002/gepi.21658. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lopes ME, Jacob LJ, Wainwright MJ. A more powerful two-sample test in high dimensions using random projection. 2011 arXiv preprint, arXiv:1108.2401. [Google Scholar]
Ott J, Rabinowitz D. A principle-components approach based on heritability for combining phenotype information. Hum Heredity. 1999;49:106–111. doi: 10.1159/000022854. [DOI] [PubMed] [Google Scholar]
Paus T. Population neuroscience: Why and how. Human Brain Mapping. 2010;31:891–903. doi: 10.1002/hbm.21069. [DOI] [PMC free article] [PubMed] [Google Scholar]
Peng J, Zhu J, Bergamaschi A, Han W, Noh D, Pollack JR, Wang P. Regularized multivariate regression for identifying master predictors with application to integrative genomics study of breast cancer. Annals of Applied Statistics. 2010;4:53–77. doi: 10.1214/09-AOAS271SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]
Peper JS, Brouwer RM, Boomsma DI, Kahn RS, Pol HEH. Genetic inuences on human brain structure: A review of brain imaging studies in twins. Human Brain Mapping. 2007;28:464–473. doi: 10.1002/hbm.20398. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rosset S, Zhu J. Piecewise linear regularized solution paths. The Annals of Statistics. 2007;35:1012–1030. [Google Scholar]
Rothman AJ, Levina E, Zhu J. Generalized thresholding of large covariance matrices. Journal of the American Statistical Association. 2009;104:177–186. [Google Scholar]
Rowe DB, Hoffmann RG. Multivariate statistical analysis in fMRI. IEEE Eng Med Biol Med. 2006;25:60–64. doi: 10.1109/memb.2006.1607670. [DOI] [PubMed] [Google Scholar]
Scharinger C, Rabl U, Sitte HH, Pezawas L. Imaging genetics of mood disorders. NeuroImage. 2010;53:810–821. doi: 10.1016/j.neuroimage.2010.02.019. [DOI] [PMC free article] [PubMed] [Google Scholar]
Teipel SJ, Born C, Ewers M, Bokde ALW, Reiser MF, Moller HJ, Hampel H. Multivariate deformation-based analysis of brain atrophy to predict Alzheimer's disease in mild cognitive impairment. NeuroImage. 2007;38:13–24. doi: 10.1016/j.neuroimage.2007.07.008. [DOI] [PubMed] [Google Scholar]
Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, Statistical Methodology. 1996;58:267–288. [Google Scholar]
Vounou M, Janousova E, Wolz R, Stein J, Thompson P, Rueckert D, Montana G ADNI. Sparse reduced-rank regression detects genetic associations with voxel-wise longitudinal phenotypes in Alzheimer's disease. Neuroimage. 2012;60:700–716. doi: 10.1016/j.neuroimage.2011.12.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
Witten DM, Tibshirani R. Penalized classification using Fisher's linear discriminant. Journal of the Royal Statistical Society, Series B, Statistical methodology. 2011;73:753–772. doi: 10.1111/j.1467-9868.2011.00783.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]

[R1] Amos CI, Elston RC, Bonney GE, Keats BJB, Berenson GS. A multivariate method for detecting genetic linkage, with application to a pedigree with an adverse lipoprotein phenotype. Am. J. Hum. Genet. 1990;47:247–254. [PMC free article] [PubMed] [Google Scholar]

[R2] Amos CI, Laing AE. A comparison of univariate and multivariate tests for genetic linkage. Genetic Epidemiology. 1993;84:303–310. doi: 10.1002/gepi.1370100657. [DOI] [PubMed] [Google Scholar]

[R3] Bickel PJ, Levina E. Regularized estimation of large covariance matrices. The Annals of Statistics. 2008;36:199–227. [Google Scholar]

[R4] Breiman L, Friedman J. Predicting multivariate responses in multiple linear regression. Journal of the Royal Statistical Society, Series B, Statistical Methodology. 1997;59:3–54. [Google Scholar]

[R5] Cai T, Zhang C, Zhou H. Optimal rates of convergence for covariance matrix estimation. The Annals of Statistics. 2010;38:2118–2144. [Google Scholar]

[R6] Chen S, Qin Y. A two-sample test for high-dimensional data with applications to gene-set testing. The Annals of Statistics. 2010;38:808–835. [Google Scholar]

[R7] Chiang MC, Barysheva M, Toga AW, Medland SE, Hansell NK, James MR, McMahon KL, de Zubicaray GI, Martin NG, Wright MJ, Thompson PM. BDNF gene effects on brain circuitry replicated in 455 twins. NeuroImage. 2011a;55:448–454. doi: 10.1016/j.neuroimage.2010.12.053. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Chiang MC, McMahon KL, de Zubicaray GI, Martin NG, Hickie I, Toga AW, Wright MJ, Thompson PM. Genetics of white matter development: A DTI study of 705 twins and their siblings aged 12 to 29. NeuroImage. 2011b;54:2308–2317. doi: 10.1016/j.neuroimage.2010.10.015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Chun H, Keles S. Sparse partial least squares regression for simultaneous dimension reduction and variable selection. Journal of the Royal Statistical Society, Series B, Statistical Methodology. 2010;72:3–25. doi: 10.1111/j.1467-9868.2009.00723.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Cook RD, Helland IS, Su Z. Envelopes and partial least squares regression. Journal of the Royal Statistical Society, Series B, Statistical Methodology. 2013 To appear. [Google Scholar]

[R11] Cook RD, Li B, Chiaromonte F. Envelope models for parsimonious and efficient multivariate linear regression. Statist. Sinica. 2010;20:927–1010. [Google Scholar]

[R12] Fan J, Feng Y, Tong X. A road to classification in high dimensional space: the regularized optimal affine discriminant. Journal of the Royal Statistical Society, Series B, Statistical Methodology. 2012;74:745–771. doi: 10.1111/j.1467-9868.2012.01029.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]

[R14] Fan J, Liao Y, Mincheva M. Large covariance estimation by thresholding principal orthogonal complements. Journal of Royal Statistical Society, Series B. 2013;75:603–680. doi: 10.1111/rssb.12016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Formisano E, Martino FD, Valente G. Multivariate analysis of fMRI time series: classification and regression of brain responses using machine learning. Magnetic Resonance Imaging. 2008;26:921–934. doi: 10.1016/j.mri.2008.01.052. [DOI] [PubMed] [Google Scholar]

[R16] Kherif F, Poline JB, Flandin G, Benali H, Simon O, Dehaene S, Worsley KJ. Multivariate model specification for fMRI data. Neuroimage. 2002;16:1068–1083. doi: 10.1006/nimg.2002.1094. [DOI] [PubMed] [Google Scholar]

[R17] Klei L, Luca D, Devlin B, Roeder K. Pleiotropy and principle components of heritability combine to increase power for association. Genetic Epidemiology. 2008;32:9–19. doi: 10.1002/gepi.20257. [DOI] [PubMed] [Google Scholar]

[R18] Knickmeyer RC, Gouttard S, Kang C, Evans D, Wilber K, Smith JK, Hamer RM, Lin W, Gerig G, Gilmore JH. A structural MRI study of human brain development from birth to 2 years. J Neurosci. 2008;28:12176–12182. doi: 10.1523/JNEUROSCI.3479-08.2008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Krishnan A, Williams LJ, McIntosh AR, Abdi H. Partial least squares (PLS) methods for neuroimaging: a tutorial and review. Neuroimage. 2011;56:455–475. doi: 10.1016/j.neuroimage.2010.07.034. [DOI] [PubMed] [Google Scholar]

[R20] Lam C, Fan J. Sparsistency and rates of convergence in large covariance matrix estimation. The Annals of statistics. 2009;37:4254–4278. doi: 10.1214/09-AOS720. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Lange K, Hunter DR, Yang I. Optimization transfer using surrogate objective functions. Journal of Computational and Graphical Statistics. 2000;9:1–20. [Google Scholar]

[R22] Lenroot RK, Giedd JN. Brain development in children and adolescents: insights from anatomical magnetic resonance imaging. Neurosci Biobehav Rev. 2006;30:718–729. doi: 10.1016/j.neubiorev.2006.06.001. [DOI] [PubMed] [Google Scholar]

[R23] Lin J, Zhu H, Knickmeyer R, Styner M, Gilmore J, Ibrahim J. Projection regression models for multivariate imaging phenotype. Genetic Epidemiology. 2012;36:631–641. doi: 10.1002/gepi.21658. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Lopes ME, Jacob LJ, Wainwright MJ. A more powerful two-sample test in high dimensions using random projection. 2011 arXiv preprint, arXiv:1108.2401. [Google Scholar]

[R25] Ott J, Rabinowitz D. A principle-components approach based on heritability for combining phenotype information. Hum Heredity. 1999;49:106–111. doi: 10.1159/000022854. [DOI] [PubMed] [Google Scholar]

[R26] Paus T. Population neuroscience: Why and how. Human Brain Mapping. 2010;31:891–903. doi: 10.1002/hbm.21069. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Peng J, Zhu J, Bergamaschi A, Han W, Noh D, Pollack JR, Wang P. Regularized multivariate regression for identifying master predictors with application to integrative genomics study of breast cancer. Annals of Applied Statistics. 2010;4:53–77. doi: 10.1214/09-AOAS271SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Peper JS, Brouwer RM, Boomsma DI, Kahn RS, Pol HEH. Genetic inuences on human brain structure: A review of brain imaging studies in twins. Human Brain Mapping. 2007;28:464–473. doi: 10.1002/hbm.20398. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Rosset S, Zhu J. Piecewise linear regularized solution paths. The Annals of Statistics. 2007;35:1012–1030. [Google Scholar]

[R30] Rothman AJ, Levina E, Zhu J. Generalized thresholding of large covariance matrices. Journal of the American Statistical Association. 2009;104:177–186. [Google Scholar]

[R31] Rowe DB, Hoffmann RG. Multivariate statistical analysis in fMRI. IEEE Eng Med Biol Med. 2006;25:60–64. doi: 10.1109/memb.2006.1607670. [DOI] [PubMed] [Google Scholar]

[R32] Scharinger C, Rabl U, Sitte HH, Pezawas L. Imaging genetics of mood disorders. NeuroImage. 2010;53:810–821. doi: 10.1016/j.neuroimage.2010.02.019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] Teipel SJ, Born C, Ewers M, Bokde ALW, Reiser MF, Moller HJ, Hampel H. Multivariate deformation-based analysis of brain atrophy to predict Alzheimer's disease in mild cognitive impairment. NeuroImage. 2007;38:13–24. doi: 10.1016/j.neuroimage.2007.07.008. [DOI] [PubMed] [Google Scholar]

[R34] Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, Statistical Methodology. 1996;58:267–288. [Google Scholar]

[R35] Vounou M, Janousova E, Wolz R, Stein J, Thompson P, Rueckert D, Montana G ADNI. Sparse reduced-rank regression detects genetic associations with voxel-wise longitudinal phenotypes in Alzheimer's disease. Neuroimage. 2012;60:700–716. doi: 10.1016/j.neuroimage.2011.12.029. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] Witten DM, Tibshirani R. Penalized classification using Fisher's linear discriminant. Journal of the Royal Statistical Society, Series B, Statistical methodology. 2011;73:753–772. doi: 10.1111/j.1467-9868.2011.00783.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]

PERMALINK

SPReM: Sparse Projection Regression Model For High-dimensional Linear Regression ^*

Qiang Sun

Hongtu Zhu

Yufeng Liu

Joseph G Ibrahim

Roles

Abstract

1 Introduction

2 Sparse Projection Regression Model

2.1 Model Setup and Heritability Ratios

2.2 Sparse Unit Rank Projection

2.3 Extension to Multi-rank Cases

2.4 Test Procedure

2.5 Tuning Parameter Selection

3 Asymptotic Theory

4 Numerical Examples

4.1 Simulation 1: Two Sample Test in High Dimensions

Table 1.

Table 2.

Fig. 1.

4.2 Simulation 2: Multiple Rank Cases

Table 3.

Table 4.

4.3 Alzheimer’s Disease Neuroimaging Initiative (ADNI) Data Analysis

Fig. 2.

Fig. 3.

Fig. 4.

Table 5.

5 Discussion

6 Assumptions and Proofs

Acknowledgments

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

SPReM: Sparse Projection Regression Model For High-dimensional Linear Regression *

Qiang Sun

Hongtu Zhu

Yufeng Liu

Joseph G Ibrahim

Roles

Abstract

1 Introduction

2 Sparse Projection Regression Model

2.1 Model Setup and Heritability Ratios

2.2 Sparse Unit Rank Projection

2.3 Extension to Multi-rank Cases

2.4 Test Procedure

2.5 Tuning Parameter Selection

3 Asymptotic Theory

4 Numerical Examples

4.1 Simulation 1: Two Sample Test in High Dimensions

Table 1.

Table 2.

Fig. 1.

4.2 Simulation 2: Multiple Rank Cases

Table 3.

Table 4.

4.3 Alzheimer’s Disease Neuroimaging Initiative (ADNI) Data Analysis

Fig. 2.

Fig. 3.

Fig. 4.

Table 5.

5 Discussion

6 Assumptions and Proofs

Acknowledgments

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

SPReM: Sparse Projection Regression Model For High-dimensional Linear Regression ^*