L2RM: Low-rank Linear Regression Models for High-dimensional Matrix Responses

Dehan Kong; Baiguo An; Jingwen Zhang; Hongtu Zhu

doi:10.1080/01621459.2018.1555092

. Author manuscript; available in PMC: 2021 Apr 30.

Published in final edited form as: J Am Stat Assoc. 2019 Apr 30;115(529):403–424. doi: 10.1080/01621459.2018.1555092

L2RM: Low-rank Linear Regression Models for High-dimensional Matrix Responses ^*

Dehan Kong ¹, Baiguo An ², Jingwen Zhang ³, Hongtu Zhu ⁴

PMCID: PMC7781207 NIHMSID: NIHMS997852 PMID: 33408427

Abstract

The aim of this paper is to develop a low-rank linear regression model (L2RM) to correlate a high-dimensional response matrix with a high dimensional vector of covariates when coefficient matrices have low-rank structures. We propose a fast and efficient screening procedure based on the spectral norm of each coefficient matrix in order to deal with the case when the number of covariates is extremely large. We develop an efficient estimation procedure based on the trace norm regularization, which explicitly imposes the low rank structure of coefficient matrices. When both the dimension of response matrix and that of covariate vector diverge at the exponential order of the sample size, we investigate the sure independence screening property under some mild conditions. We also systematically investigate some theoretical properties of our estimation procedure including estimation consistency, rank consistency and non-asymptotic error bound under some mild conditions. We further establish a theoretical guarantee for the overall solution of our two-step screening and estimation procedure. We examine the finite-sample performance of our screening and estimation methods using simulations and a large-scale imaging genetic dataset collected by the Philadelphia Neurodevelopmental Cohort (PNC) study.

Keywords: Imaging Genetics, Low Rank, Matrix Linear Regression, Spectral norm, Trace norm

1. Introduction

Multivariate regression modeling with a matrix response $Y \in R^{p \times q}$ and a multivariate covariate $x \in R^{s}$ is an important statistical tool in modern high-dimensional inference, with wide applications in various large-scale applications, such as imaging genetic studies. Specifically, in imaging genetics, matrix responses (Y) as phenotypic variables often represent the weighted (or binary) adjacency matrix of a finite graph for characterizing structural (or functional) connectivity pattern, whereas covariates (x) include genetic markers (e.g., single nucleotide polymorphisms (SNPs)), age, and gender, among others. The joint analysis of imaging and genetic data may ultimately lead to discoveries of genes for many neuropsychiatric and neurological disorders, such as schizophrenia (Scharinger et al., 2010; Peper et al., 2007; Chiang et al., 2011; Thompson et al., 2013; Medlan et al., 2014). This motivates us to systematically investigate a statistical model with a multivariate response Y and a multivariate covariate x.

Let {(x_i, Y_i) : 1 ≤ i ≤ n} denote independent and identically distributed (i.i.d.) observations, where x_i = (x_i,1, …, x_is)^T is a s × 1 vector of scalar covariates (e.g., clinical variables and genetic variants) and Y_i is a p × q response matrix. Without loss of generality, we assume that x_il has mean 0 and variance 1 for every 1 ≤ l ≤ s, and Y_i has mean 0. Throughout the paper, we consider a L2RM, which is given by

Y_{i} = \sum_{l = 1}^{s} x_{i l} * B_{l} + E_{i},

(1)

where B_l is a p × q coefficient matrix characterizing the effect of the lth covariate on Y_i and E_i is a p × q matrix of random errors with mean 0. The symbol “ * “ denotes the scalar multiplication. Model (1) differs significantly from the existing matrix regression, which was developed for matrix covariates and univariate responses (Leng and Tang, 2012; Zhao and Leng, 2014; Zhou and Li, 2014). Our goal is to discover a small set of important covariates from x that strongly influence Y.

We focus on the most challenging setting that both the dimension of Y (or pq) and that of x (or s) can diverge with the sample size. Such a setting is general enough to cover high-dimensional univariate and multivariate linear regression models in the literature (Negahban et al., 2012; Fan and Lv, 2010; Buhlmann and van de Geer, 2011; Tibshirani, 1997; Yuan et al., 2007; Candes and Tao, 2007; Breiman and Friedman, 1997; Cook et al., 2013; Park et al., 2017). In the literature, there are two major categories of statistical methods for jointly analyzing high-dimensional matrix Y and high-dimensional vector x.

The first category is a set of mass univariate methods. Specifically, it fits a marginal linear regression to correlate each element of Y_i with each element of x_i, leading to a total of pqs massive univariate analyses and an expanded search space with pqs elements. It is also called voxel-wise genome-wide association analysis (VGAWS) in the imaging genetics literature (Hibar et al., 2011; Shen et al., 2010; Huang et al., 2015; Zhang et al., 2014; Medland et al., 2014; Zhang et al., 2014; Thompson et al., 2014; Liu and Calhoun, 2014). For instance, Stein et al. (2010) used 300 high performance CPU nodes to run approximately 27 hours to carry out a VGWAS analysis on an imaging genetic dataset with only 448,293 SNPs and 31,622 imaging measures for 740 subjects. Such computational challenges are becoming more severe as the field is rapidly advancing to the most challenging setting with large pq and s. More seriously, for model (1), the massive univariate method can miss some important components of x that strongly influence Y due to the interaction among x.

The second category is to fit a model accommodating all (or part of) covariates and responses (Vounou et al., 2010, 2012; Zhu et al., 2014; Wang et al., 2012a,b; Peng et al.,2010). These methods use regularization methods, such as Lasso or group Lasso, to select a set of covariate-response pairs. However, when the product pqs is extremely large, it is very difficult to allocate computer memory for such an array of size pqs in order to accommodate all coefficient matrices B_ls, rendering all these regularization methods being intractable. Therefore, almost all existing methods in this category have to use some dimension reduction techniques (e.g., screening methods) to reduce both the number of responses and that of covariates. Subsequently, these methods fit a multivariate linear regression model with the selected elements of Y as new responses and those of x as new covariates. However, this approach can be unsatisfactory, since it does not incorporate the matrix structural information.

The aim of this paper is to develop a low-rank linear regression model (L2RM) as a novel extension of both VGAWS and regularization methods. Specifically, instead of repeatedly fitting a univariate model to each covariate-response pair, we consider all elements in Y_i as a high-dimensional matrix response and focus on the coefficient matrix of each covariate, which is approximately low-rank (Candès and Recht, 2009). There is a literature on the development of matrix variate regression (Ding and Cook, 2014; Fosdick and Hoff, 2015; Zhou and Li, 2014), but these papers focus on the case when covariates have a matrix structure. In contrast, there is a large literature on the development of various function-on-scalar regression models that emphasize the inherent functional structure of responses. See Chapter 13 of Ramsay and Silverman (2005) for a comprehensive review on this topic. Variable selection methods have been developed for some function-on-scalar regression models (Wang et al., 2007; Chen et al., 2016), but these methods focus on one-dimensional functional response rather than two-dimensional matrix response. Recently, there has been some literature considering matrix or tensor responses regression (Ding and Cook, 2018; Li and Zhang, 2017; Raskutti and Yuan, 2018; Rabusseau and Kadri, 2016), but they only consider the case when the dimension of the covariates is fixed or slowly diverging with the sample size.

In this paper, we aim at efficiently correlating matrix responses with a high dimensional vector of covariates. Four major methodological contributions of this paper are as follows.

We introduce a low-rank linear regression model to fit high-dimensional matrix responses with a high dimensional vector of covariates, while explicitly accounting for the low-rank structure of coefficient matrices.
We introduce a novel rank-one screening procedure based on the spectral norm of the estimated coefficient matrix to eliminate most “noisy” scalar covariates and show that our screening procedure enjoys the sure independence screening property (Fan and Lv, 2008) with vanishing false selection rate. The use of such spectral norm is critical for dealing with a large number of noisy covariates.
When the number of covariates is relatively small, we propose a low rank estimation procedure based on trace norm regularization, which explicitly characterizes the low-rank structure of coefficient matrices. An efficient algorithm for solving the optimization problem is developed. We systematically investigate some theoretical properties of our estimation procedure, including estimation and rank consistency when both p and q are fixed and an non-asymptotic error bound when both p and q are allowed to diverge.
We investigate how incorrectly screening results can affect the low-rank regression model estimation both numerically and theoretically. We establish a theoretical guarantee for the overall solution, while accounting for the randomness of the first-step screening procedure.

The rest of this paper is organized as follows. In Section 2, we introduce a rank-one screening procedure to deal with a high dimensional vector of covariates and describe our estimation procedure when the number of covariates is relatively small. Section 3 investigates the theoretical properties of our method. Simulations are conducted in Section 4 to evaluate the finite-sample performance of the proposed two-step screening and estimation procedure. Section 5 illustrates an application of L2RM in the joint analysis of imaging and genetic data from the Philadelphia Neurodevelopmental Cohort (PNC) study discussed above. We finally conclude with some discussions in Section 6.

2. Methodology

Throughout the paper, we focus on addressing three fundamental issues for L2RM as follows:

(I) The first one is to eliminate most ‘noisy’ covariates x_il when the number of candidate covariates and that of response matrix are much larger than n, that is min(s, pq) >> n.
(II) The second one is to estimate the coefficient matrix B_l when B_l does have a low-rank structure.
(III) The third one is to investigate some theoretical properties of the screening and estimation methods.

2.1. Rank-one Screening Method

We consider the case that both pq and s diverge at an exponential order of n, and we also denote s by s_n. To address (I), it is common to assume that most scalar covariates have no effects on the matrix responses, that is, B_l0 = 0 for most 1 ≤ l ≤ s_n, where B_l0 is the true value for B_l. In this case, we define the true model and its size as

M = {1 \leq l \leq s_{n} : B_{l 0} \neq 0} and s_{0} = ∣ M ∣ < n .

(2)

Our aim is to estimate the set $M$ and coefficient matrices B_l. Simultaneously estimating $M$ and B_l is difficult since it is computationally infeasible to fit a model when both s_n and pq are quite high. For example, in the PNC data, we have pq = 69² = 4, 761 and s_n ≈ 5 × 10⁶. Therefore, it may be imperative to employ a screening technique to reduce the model size. However, developing a screening technique for model (1) can be more challenging than many existing screening methods, which focus on univariate responses (Fan and Lv, 2008; Fan and Song, 2010).

Similar to Fan and Lv (2008) and Fan and Song (2010), it is assumed that all covariates have been standardized so that

E (x_{i l}) = 0 and E (x_{i l}^{2}) = 1 for l = 1, \dots, s_{n} .

We also assume that every element of Y_i = (Y_i,jk) has been standardized, that is,

E (Y_{i, j k}) = 0 and E (Y_{i, j k}^{2}) = 1 for j = 1, \dots, p and k = 1, \dots, q .

We propose to screen covariates based on the estimated marginal ordinary least squares (OLS) coefficient matrix ${\hat{B}}_{l}^{M} = n^{- 1} \sum_{i = 1}^{n} x_{i l} * Y_{i}$ for l = 1, …, s_n. Although the interpretations and implications of the marginal models are biased from the joint model, the nonsparse information about the joint model can be passed along to the marginal model under a mild condition. Hence it is suitable for the purpose of variable screening (Fan and Song, 2010). Specifically, we calculate the spectral norm (operator norm or largest singular value) of ${\hat{B}}_{l}^{M}$ , denoted as $‖ {\hat{B}}_{l}^{M} ‖_{o p}$ , and define a submodel as

{\hat{M}}_{γ_{n}} = {1 \leq l \leq s_{n} : ‖ {\hat{B}}_{l}^{M} ‖_{o p} \geq γ_{n}},

(3)

where γ_n is a prefixed threshold.

The key advantage of using $‖ {\hat{B}}_{l}^{M} ‖_{o p}$ is that it explicitly accounts for the low-rank structure of B_l0s for most noisy covariates, while being robust to noise and more sensitive to various signal patterns (e.g., sparsely strong signals and low rank weak signals) in coefficient matrices. In our screening step, we use the marginal OLS estimates of the coefficient matrices, which can be regarded as the true coefficient matrices corrupted with some noise. One may directly use some other summary statistics of ${\hat{B}}_{l}^{M}$ based on the component-wise information of ${\hat{B}}_{l}^{M}$ , such as $‖ {\hat{B}}_{l}^{M} ‖_{1}$ (sum of the absolute value of all the elements), $‖ {\hat{B}}_{l}^{M} ‖_{F}$ , or the global Wald-type statistic used in Huang et al. (2015). It is well known that those summary statistics are sensitive to noise and suffer from the curse of dimensionality. This is further confirmed in our simulation studies that our rank-one screening based on $‖ {\hat{B}}_{l}^{M} ‖_{o p}$ is more robust to noise and sensitive to small signal regions. Moreover, the other advantage of using $‖ {\hat{B}}_{l}^{M} ‖_{o p}$ is that it is computationally efficient. In contrast, we may calculate some other regularized estimates (e.g., Lasso or fused Lasso) for screening, but it is computationally infeasible for L2RM when s_n is much larger than the sample size.

A difficult issue in (3) is how to properly select γ_n. As shown in Section 3.1, when γ_n is chosen properly, our screening procedure enjoys the sure independence property (Fan and Lv, 2008). However, it is difficult to precisely determine γ_n in practice since it involves in two unknown positive constant terms C₁ and α as shown in Theorem 1, which cannot be easily determined for finite sample. We propose to use random decoupling to select γ_n, which is similar to that used in Barut et al. (2016). Let { $x_{i}^{*}$ , i = 1, …, n} be a random permutation of the original data (x_i = 1, …, n}. We apply our screening procedure on the random decoupling data ${x_{i}^{*}, Y_{i}}_{i = 1}^{n}$ . As the original association between x_i and Y_i is destroyed by random decoupling, when we perform screening using ${x_{i}^{*}, Y_{i}}_{i = 1}^{n}$ , it mimics the null model, e. the model when there is no association. We obtain the estimated marginal coefficient matrix $({\hat{B}}_{l}^{M})^{*}$ , which is a statistical estimate of zero matrix, and the corresponding operator norm $‖ ({\hat{B}}_{l}^{M})^{*} ‖_{o p}$ for all 1 ≤ l ≤ s_n. Define ν_n = max_{1≤l≤s_n} $‖ ({\hat{B}}_{l}^{M})^{*} ‖_{o p}$ , which is the the minimum thresholding parameter that makes no false positives. Since ν_n depends on the realization of the permutation, we set the threshold value γ_n as the median of these threshold values { $ν_{n}^{(k)}$ , 1 ≤ k ≤ K} from K different random permutations, where $ν_{n}^{(k)}$ is the threshold value for the kth permutation. We set K = 10 in this paper.

2.2. Estimation Method

To address (II), we consider the estimation of B when the true coefficient matrices B_l0s truly have a low-rank structure. The following refined estimation step can be applied after the screening step when the number of covariates is relatively small. For simplicity, we denote the set selected by the screening step ${\hat{M}}_{γ_{n}}$ by $\hat{M}$ . Suppose $\hat{M} = {l_{1}, \dots, l_{∣ \hat{M} ∣}}$ , where $1 \leq l_{1} < \dots < l_{∣ \hat{M} ∣} \leq s_{n}$ . Define $B = [B_{l}, l \in \hat{M}] = [B_{l_{1}}, \dots, B_{l_{∣ \hat{M} ∣}}] \in R^{p \times q ∣ \hat{M} ∣}$ .

Recently, the trace norm regularization ∥B_l∥_* = ∑_k σ_k(B_l) has been widely used to recover the low-rank structure of B_l due to its computational efficiency, where α_k (B_l) is the kth singular value of B_l. For instance, the trace norm has been used for matrix completion (Candès and Recht, 2009), for matrix regression models with matrix covariates and univariate responses (Zhou and Li, 2014), and for multivariate linear regression with vector responses and scalar covariates (Yuan et al., 2007). Similarly, we propose to calculate the regularized least squares estimator of B by minimizing

Q (B) = \frac{1}{2 n} \sum_{i = 1}^{n} ‖ Y_{i} - \sum_{l \in \hat{M}} x_{i l} * B_{l} ‖_{F}^{2} + λ \sum_{l \in \hat{M}} ‖ B_{l} ‖_{*},

(4)

where ∥ · ∥_F is the Frobenius norm of a matrix and λ is a tuning parameter. The low rank structure can be regarded as a special spatial structure, since it is very similar to functional principal component analysis. We use the five-fold cross validation to select the tuning parameter λ. Ideally, we may choose different tuning parameters for different B_l, but it can dramatically increase computational complexity.

We apply the Nesterov gradient method to solve problem (4) even though Q(B) is non-smooth (Nesterov, 2004; Beck and Teboulle, 2009). The Nesterov gradient method utilizes the first-order gradient of the objective function to obtain the next iterate based on the current search point. Unlike the standard gradient descent algorithm, the Nesterov gradient algorithm uses two previous iterates to generate the next search point by extrapolating, which can dramatically improve the convergence rate. Before we introduce the Nesterov gradiant algorithm, we first state a singular value thresholding formula for the trace norm (Cai et al., 2010).

Proposition 1. For a matrix A with {a_k}_1≤k≤r being its singular values, the solution to

min_{B} {\frac{1}{2} ‖ B - A ‖_{F}^{2} + λ ‖ B ‖_{*}}

(5)

shares the same singular vectors as A and its singular values are b_k = (a_k – λ)₊ for k = 1, …, r.

We present the Nesterov gradient algorithm for problem (4) as follows. Denote $R (B) = (2 n)^{- 1} \sum_{i = 1}^{n} ‖ Y_{i} - \sum_{l \in \hat{M}} x_{i l} * B_{l} ‖_{F}^{2}$ and $J (B) = λ \sum_{l \in \hat{M}} ‖ B_{l} ‖_{*}$ . We also define

\begin{matrix} g (B ∣ S^{(t)}, δ) & = R (S^{(t)}) + < \nabla R (S^{(t)}), B - S^{(t)} > + (2 δ)^{- 1} ‖ B - S^{(t)} ‖_{F}^{2} + J (B) \\ = (2 δ)^{- 1} ‖ B - [S^{(t)} - δ \nabla R (S^{(t)})] ‖_{F}^{2} + J (B) + c^{(t)}, \end{matrix}

where ∇R(S^(t)) denotes the first-order gradient of R(S^(t)) with respect to S^(t), S^(t) is an interpolation between B^(t) and B^(t–1) and will be defined below, c^(t) denotes all terms that are irrelevant to B, and δ > 0 is a suitable step size. Given a previous search point S^(t), the next search point S^(t+1) would be the minimizer of g(B∣S^(t), δ). For the search point S^(t), it can be generated by linearly extrapolating two previous algorithmic iterates. A key advantage of using the Nestrov gradient method is that it has an explicit solution at each iteration. Specifically, let B_{l_d}, $S_{l_{d}}^{(t)}$ , and ∇R(S^(t))_{l_d} be the (dq – q + 1)th to the dqth columns of the corresponding $p \times q ∣ \hat{M} ∣$ matrices B, S^(t), and ∇R(S^(t)), respectively. Minimizing $(2 δ)^{- 1} ‖ B - [S^{(t)} - δ \nabla R (S^{(t)})] ‖_{F}^{2} + λ \sum_{l \in \hat{M}} ‖ B_{l} ‖_{*}$ is equivalent to solving $∣ \hat{M} ∣$ sub-problems, each of which minimizes $(2 δ)^{- 1} ‖ B_{l_{d}} - [S_{l_{d}}^{(t)} - δ \nabla R (S^{(t)})_{l_{d}}] ‖_{F}^{2} + λ ‖ B_{l_{d}} ‖_{*}$ for d = 1, …, $∣ \hat{M} ∣$ , while each sub-problem can be exactly solved by using the singular value thresholding formula given in Proposition 1.

Define $X_{\hat{M}} = (x_{i l})_{1 \leq i \leq n, l \in \hat{M}}$ is an $n \times ∣ \hat{M} ∣$ matrix and λ_max(·) denotes the largest eigenvalue of a matrix. Our algorithm can be stated as follows:

Initialize B⁽⁰⁾ = B⁽¹⁾, α⁽⁰⁾ = 0 and α⁽¹⁾ = 1, t = 1, and $δ = n ∕ λ_{\max} {(X_{\hat{M}})^{T} X_{\hat{M}}}$ .

Repeat

\begin{matrix} S^{(t)} = B^{(t)} + (\frac{α^{(t)} - 1}{α^{(t)}}) (B^{(t)} - B^{(t - 1)}); \\ for d = 1 : ∣ \hat{M} ∣, \\ i . (A_{t e m p})_{l_{d}} = S_{l_{d}}^{(t)} - δ \nabla R (S^{(t)})_{l_{d}}; \\ ii . compute singular value decomposition (SVD) (A_{t e m p})_{l_{d}} = U_{l_{d}} diag (a_{l_{d}}) V_{l_{d}}^{T}; \\ iii . b_{l_{d}} = a_{l_{d}} - λ δ * 1; \\ iv . (B_{t e m p})_{l_{d}} = U_{l_{d}} diag (b_{l_{d}}) V_{l_{d}}^{T}; \\ end \\ Combine {(B_{t e m p})_{l_{d}}, 1 \leq d \leq ∣ \hat{M} ∣} sub - matrices and get the entire matrix B_{t e m p}; \\ B^{(t + 1)} = B_{t e m p}; \\ α^{(t + 1)} = {1 + \sqrt{1 + (2 α^{(t)})^{2}}} ∕ 2; \\ t = t + 1; \end{matrix}

until objective function Q(B^(t)) converges.

For the above $p \times q ∣ \hat{M} ∣$ matrices A_temp and B_temp, (A_temp)_{l_d} and (B_temp)_{l_d} denote the (dq – q + 1)th to the (dq)th columns of the corresponding matrices, respectively.

A sufficient condition for the convergence of {B^(t)}_t≥1 is that the step size δ should be smaller than or equal to 1/L(R), where L(R) is the smallest Lipschitz constant of the function R(B) (Beck and Teboulle, 2009; Facchinei and Pang, 2003). In our case, L(R) is equal to $n^{- 1} λ_{\max} {(X_{\hat{M}})^{T} X_{\hat{M}}}$ .

Remarks: For model (1), it is assumed that x_il has mean 0 and variance 1 for every 1 ≤ l ≤ s, and Y_i has mean 0 throughout the paper. If these assumptions are not valid in practice, a simple solution is to carry out a standardization step including standardizing covariates and centering responses. We use this approach in simulations and real data analysis. An alternative approach is to introduce an intercept matrix term B₀ in model (1). Our screening procedure is invariant to such standardization step if we calculate $B_{l, j k}^{M}$ , the (j, k)–th element of B_l, as the sample correlation between x_il and Y_i,jk. In the Supplementary Material, we present a modified algorithm of our estimation procedure and evaluate the effects of the standardization step on estimating B_l by using simulations. According to our experience, scaling covariates is necessary in order to ensure that all covariates are at the same scale, whereas centering covariates and responses is not critical.

3. Theoretical Properties

To address (III), we systematically investigate several key theoretical properties of the screening procedure and the regularized estimation procedure as well as a theoretical guarantee of our two-step estimator. First, we investigate the sure independence screening property of the rank-one screening procedure when s (also denoted by s_n) diverges at an exponential rate of the sample size. Second, we investigate the estimation and rank consistency of our regularized estimator when both p and q are fixed. Third, we derive the non-asymptotic error bound for our estimator when both p and q are diverging. Finally, we establish an overall theoretical guarantee for our two-step estimator. We state the following theorems, whose detailed proofs can be found in the Appendix B.

3.1. Sure Screening Property

The following assumptions are used to facilitate the technical details, even though they may not be the weakest conditions but help to simplify the proof.

(A0) The covariates X_i are i.i.d from a distribution with mean 0 and covariance matrix ∑_x. Define $σ_{l}^{2} = (Σ_{x})_{l l}$ . The vectorized error matrices vec(E_i) are i.i.d from a distribution with 0 and covariance matrix ∑_e, where vec(·) denotes the vectorization of a matrix. Moreover, x_i and E_i = (E_i,jk) are independent.

(A1) There exist some constants C₁ > 0, b > 0, and 0 < κ < 1/2 such that

min_{l \in M} ‖ cov (\sum_{l^{'} \in M} x_{i l^{'}} * B_{l^{'} 0}, x_{i l}) ‖_{o p} \geq C_{1} (p q)^{1 ∕ 2} n^{- κ} and max_{l \in M} ‖ B_{l 0} ‖_{\infty} < b,

where $cov (\sum_{l^{'} \in M} x_{i l^{'}} * B_{l^{'} 0}, x_{i l})$ is a p × q matrix with the (j, k)th element being $cov (\sum_{l^{'} \in M} x_{i l^{'}} * B_{l^{'} 0, j k}, x_{i l})$ , and $‖ B_{l 0} ‖_{\infty} = \max_{1 \leq j \leq p, 1 \leq k \leq q} ∣ B_{l 0, j k} ∣$ .

(A2) There exist positive constants C₂ and C₃ such that

max (E {exp (C_{2} x_{i l}^{2})}, E {exp (C_{2} E_{i, j k}^{2})}) \leq C_{3}

for every 1 ≤ l ≤ s_n, 1 ≤ j ≤ p and 1 ≤ k ≤ q.

(A3) There exists a constant C₄ > 0 such that log(s_n) = C₄n^ξ for ξ ϵ (0, 1 – 2κ).

(A4) There exist constants C₅ > 0 and τ > 0 such that λ_max(∑_x) ≤ C₅n^τ.

(A5) We assume log(pq) = o(n^1–2κ).

Remarks: Assumptions (A0)-(A5) are used to establish the theory of our screening procedure when s_n diverges to infinity. Assumption (A1) is analogous to Condition 3 in Fan and Lv (2008) and equation (4) in Fan and Song (2010), in which κ controls the rate of probability error in recovering the true sparse model. Assumption (A2) is analogous to Condition (D) in Fan and Song (2010) and Condition (E) in Fan et al. (2011). Assumption (A2) requires that x_il and E_i,jk are sub-gaussian, which ensures the tail probability to be exponentially light. Assumption (A3) allows the dimension s_n to diverge at an exponential rate of the sample size n, which is analogous to Condition 1 in Fan and Lv (2008). Assumption (A4) is analogous to Condition 4 in Fan and Lv (2008), which rules out the case of strong collinearity. Assumption (A5) allows the product of the row and column dimensions of the matrix pq to diverge at an exponential rate of the sample size n.

The following theorems show the sure screening property of the screening procedure. We allow p and q to be either fixed or diverging with sample size n.

Theorem 1. Under Assumptions (A0)-(A3) and (A5), let γ_n = αC₁(pq)^1/2n^−κ with 0 < α < 1, then we have $P (M \subseteq {\hat{M}}_{γ_{n}}) \to 1$ as n → ∞.

Theorem 1 shows that if γ_n is chosen properly, then our rank-one screening procedure will not miss any significant variables with an overwhelming probability. Since the screening procedure automatically includes all the significant covariates for small values of γ_n, it is necessary to consider the size of ${\hat{M}}_{γ_{n}}$ when γ_n = αC₁(pq)^1/2n^−κ holds.

Theorem 2. Under Assumptions (A0)-(A5), we have $P (∣ {\hat{M}}_{γ_{n}} ∣ = O (n^{2 κ + τ})) \to 1$ for γ_n = αC₁(pq)^1/2n^−κ with 0 < α < 1 as n → ∞.

Theorem 2 indicates that the selected model size with the sure screening property is only at a polynomial order of n, even though the original model size is at an exponential order of n. Therefore, the false selection rate of our screening procedure vanishes as n → ∞.

3.2. Theory for Estimation Procedure

From this subsection, we will denote ${\hat{M}}_{γ_{n}}$ by $\hat{M}$ for notation simplicity. We first provide some theoretical results for our estimation procedure. We assume that we can exactly select all the important variables in $M$ , i.e. $\hat{M} = M$ , and $s_{0} = ∣ M ∣$ is fixed. The results are also applicable if our original s is fixed, in which we only need to apply our estimation procedure.

We need more notations before we introduce more assumptions. Suppose the rank of B_l0 is r_l. For every l = 1, …, s_n, we denote $U_{l 0} ϴ_{l 0} V_{l 0}^{T}$ as the singular value decomposition of B_l0 and use $U_{l 0}^{⊥}$ and $V_{l 0}^{⊥}$ to denote the orthogonal complements of U_l0 and V_l0, respectively. Define $Σ_{M}$ as the covariance matrix for $x_{i, M}$ , where $x_{i, M} = (x_{i l})_{l \in M} \in R^{∣ M ∣}$ . We further define $A = Σ_{M} \otimes I_{p q \times p q}$ , $K_{l} = V_{l 0}^{⊥} \otimes U_{l 0}^{⊥}$ and $d_{l} = - vec (U_{l 0} V_{l 0}^{T})$ for $l \in M$ , where ⊗ denotes the Kronecker product. Let $d = (d_{l_{1}}^{T}, \dots, d_{l_{∣ M ∣}}^{T})^{T}$ and $K = diag {K_{l_{1}}, \dots, K_{l_{∣ M ∣}}}$ . We define $Λ_{l} \in R^{(p - r_{l}) \times (q - r_{l})}$ for $l \in M$ such that $vec (Λ) = (vec (Λ_{l_{1}})^{T}, \dots, vec (Λ_{l_{∣ M ∣}})^{T})^{T} = (K^{T} A^{- 1} K)^{- 1} K^{T} A^{- 1} d$ . The Λ_l has some interesting interpretation. For instance, it can be shown that it is the Lagrange multiplier of an optimization problem. We include more interpretation of Λ_l in the Appendix C.

We then state additional assumptions that are needed to establish the theory of our estimation procedure when both p and q are assumed to be fixed.

The following assumptions (A6)-(A8) are needed.

(A6) The $Σ_{M}$ is nonsingular.

(A7) The $\max_{l} {rank (B_{l 0}) : l \in M} < min (p, q)$ holds.

(A8) For every $l \in M$ , we assume ∥Λ_l∥_op < 1.

Remarks: Assumption (A6) is a regularity condition in the low dimensional context, which rules out the scenario when one covariate is exactly a linear combination of other covariates. Assumption (A7) is used for rank consistency. Assumption (A8) can be regarded as the irrepresentable condition of Zhao and Yu (2006) in the rank consistency context. A similar condition can be found in Bach (2008).

Define ${\hat{B}}_{l}$ the regularized low rank estimator of B_l for $l \in M$ . We have the following consistent results when the tuning parameter converges in different rates when both p and q are fixed.

Theorem 3. (Estimation Consistency) Under Assumptions (A0) and (A6), we have

(i)
if n^1/2 λ → ∞ and λ → 0, then $λ^{- 1} ({\hat{B}}_{l} - B_{l 0}) = O_{p} (1)$ for all $l \in M$ ;
(ii)
if n^1/2 λ → ρ ϵ [0, ∞) and n → ∞, then $n^{1 ∕ 2} ({\hat{B}}_{l} - B_{l 0}) = O_{p} (1)$ for all $l \in M$ .

Theorem 3 reveals an interesting phase-transition phenomenon. When λ is relatively small or moderate, the convergence rate of ${\hat{B}}_{l} - B_{l 0}$ is of order n^−1/2, whereas as γ gets large, the convergence rate of ${\hat{B}}_{l} - B_{l 0}$ can be approximated as the order of λ. Although we have established the consistency of ${\hat{B}}_{l}$ as λ → 0, the next question is whether the rank of ${\hat{B}}_{l}$ is consistent under the same set of conditions. It turns out that such rank consistency only holds for relatively large λ, whose convergence rate is slower than n^−1/2.

Theorem 4. (Rank Consistency) Under Assumptions (A0) and (A6)-(A8), if λ → 0 and n^1/2λ → ∞ hold, we have that $P (r a n k ({\hat{B}}_{l}) = r a n k (B_{l})) \to 1$ for all $l \in M$ .

Theorem 4 establishes the rank consistency of our regularized estimates. Theorems 3 and 4 reveal that both of the element consistency and the rank consistency hold only for λ → 0 and n^1/2λ → ∞. This phenomenon is similar to that for the Lasso estimator. Specifically, although the Lasso estimator can achieve model selection consistency, the convergence rate of the Lasso estimator cannot achieve the rate of n^−1/2 when selection consistency is satisfied (Zou, 2006).

We then consider the case when p and q are assumed to be diverging. The following assumptions (A9)-(A12) are needed.

(A9) There exist positive constants C_L and C_M such that $0 < C_{L} \leq λ_{min} (Σ_{M}) \leq λ_{\max} (Σ_{M}) \leq C_{M} < \infty$ .

(A10) We assume that $x_{i, M}$ are i.i.d multivariate normal with mean 0 and covariance matrix $Σ_{M}$ .

(A11) The vectorized error matrices vec(E_i) are i.i.d N(0, ∑_e), where $λ_{m a x} (Σ_{e}) \leq C_{U}^{2} < \infty$ .

(A12) We assume max(p, q) → ∞ and max(p, q) = o(n) as n → ∞.

Remarks: Assumptions (A9)-(A12) are needed for our estimation procedure when both p and q are diverging with the sample size n. Assumption (A9) assumes the largest eigenvalue of $Σ_{M}$ is bounded and the smallest eigenvalue of $Σ_{M}$ is greater than 0. Assumption (A10) assumes that the covariates x_ils are gaussian. Assumption (A11) assumes that the largest eigenvalue of ∑_e is bounded. Assumption (A12) allows p and q to diverge slower than n, but it does allow that pq > n.

We then show the following non-asymptotic bound for our estimation procedure when both p and q are diverging.

Theorem 5. (Nonasymptotic bound when both p and q diverge) Under Assumptions (A9)-(A12), when $λ \geq 4 C_{U} C_{M}^{1 ∕ 2} n^{- 1 ∕ 2} (p^{1 ∕ 2} + q^{1 ∕ 2})$ , there exist some positive constants c₁, c₂ and c₃ such that with probability at least 1 – c₁ exp{−c₂(p + q)} – c₃ exp(−n), we have

‖ \hat{B} - B_{0} ‖_{F}^{2} \leq C (\sum_{l \in M} r_{l}) λ^{2} C_{L}^{- 2}

for some constant C > 0.

Theorem 5 implies that when ${r_{l}, l \in M}$ and $∣ M ∣$ are fixed and λ ≍ n^−1/2(p^1/2 + q^1/2), the estimator $\hat{B}$ would be consistent with probability going to 1. The convergence rate of the estimator in Theorem 5 coincides with that in Corollary 5 of Negahban et al. (2009), where they studied the low-rank matrix learning problem using the trace norm regularization. Although considering different models, they also require the dimension of the matrix max(p, q) = o(n). It differs significantly from the L1 regularized problem, where the dimension of the matrix may diverge at the exponential order of the sample size. The result in this theorem can also be regarded as a special case of the result in Raskutti and Yuan (2018), where they derived non-asymptotic error bound in a class of tensor regression model with sparse or low-rank penalties.

3.3. Theory for Two-step Estimator

In this section, we give a unified theory for our two-step estimator. In particular, we derive the non-asymptotic bound for our final estimate. To begin with, we first introduce some notations. For simplicity, we will use $\hat{M}$ to denote ${\hat{M}}_{γ_{n}}$ , which is the set selected from the first step. Define $B^{\hat{M}} = [B_{l}, l \in \hat{M}] \in R^{p \times q ∣ \hat{M} ∣}$ and the true value of $B^{\hat{M}}$ as $B_{0}^{\hat{M}} = [B_{l 0}, l \in \hat{M}] \in R^{p \times q ∣ \hat{M} ∣}$ . Define ${\hat{B}}^{\hat{M}} = [{\hat{B}}_{l}, l \in \hat{M}] \in R^{p \times q ∣ \hat{M} ∣}$ as the solution of the regularized trace norm penalization problem given by

min_{B^{\hat{M}}} Q (B^{\hat{M}}) = min_{B^{\hat{M}}} {\frac{1}{2 n} \sum_{i = 1}^{n} ‖ Y_{i} - \sum_{l \in \hat{M}} x_{i l} * B_{l} ‖_{F}^{2} + λ \sum_{l \in \hat{M}} ‖ B_{l} ‖_{*}} .

(6)

We need the following assumptions.

(A13) Assume 2κ + τ < 1. Define $ı_{L} ≔ {min}_{‖ δ ‖_{0} \leq m, δ \neq 0} δ^{T} (n^{- 1} \sum_{i = 1}^{n} x_{i} x_{i}^{T}) δ ∕ ‖ δ ‖_{2}^{2}$ for any m = O(n^2κ+τ) and $δ \in R^{s}$ . We further assume ι_L > 0.

(A14) Assume max(p, q)/ log(n) → ∞ and max(p, q) = o(n^1–2τ) as n → ∞ with τ < 1/2.

Theorem 6. (Nonasymptotic bound for two-step estimator) Under Assumptions (A0)-(A5), (A10), (A11), (A13), and (A14), when λ ≥ 4C₅n^τ–1/2(p^1/2 + q^1/2), there exist some positive constants c₁, c₂, c₃, c₄, c₅ such that with probability at least 1–c₁n^2κ+τ exp{−c₂(p+q)} – c₃n^2κ+τ exp(−n) – c₄ exp(−c₅n^1−2κ), we have

‖ {\hat{B}}^{\hat{M}} - B_{0}^{\hat{M}} ‖_{F}^{2} \leq C (\sum_{l \in M} r_{l}) λ^{2} ı_{L}^{- 2}

for some constant C > 0.

Theorem 6 implies that when ${r_{l} : l \in M}$ and $∣ M ∣$ are fixed and ι_L is fixed, the estimator ${\hat{B}}^{\hat{M}}$ is consistent with probability going to 1 when λ ≍ n^τ–1/2(p^1/2+q^1/2). Theorem 6 gives an overall theoretical guarantee for our two-step estimator by considering the random selection procedure in the first step. A key fact that we use in the proof of Theorem 6 is that our first-step screening procedure enjoys the sure independence property. In this case, we only need to derive the non-asymptotic bound for the case when we exactly select or over-select the important variables as it holds with overwhelming probability.

4. Simulations

We conduct simulations to examine the finite sample performance of the proposed estimation and screening procedures. For the sake of space, we include additional simulation results in the Supplementary Material.

4.1. Regularized Low-rank Estimate

In the first simulation, we simulate 64 × 64 matrix responses according to model (1) with s = 4 covariates. We set the four true coefficient matrices to be a cross shape (B₁₀), a square shape (B₂₀), a triangle shape (B₃₀), and a butterfly shape (B₄₀). The images of B_l0 are shown in Figure 1, and each of them consists of a yellow region of interest (ROI) containing ones and a blue ROI containing zeros.

Figure 1: — Simulation I setting: the four 64×64 true coefficient matrices for the first simulation setting: the cross shape for B₁₀ in panel (a), the square shape for B₂₀ in panel (b), the triangle shape of B₃₀ in panel (c), and the butterfly shape for B₄₀ in panel (d). The regression coefficient at each pixel is either 0 (blue) or 1 (yellow).

We independently generate all scalar covariates x_i from N(0, ∑_x), where ∑_x = (σ_x,ll′) is a covariance matrix with an autoregressive structure such that $σ_{x, l l^{'}} = ρ_{1}^{∣ l - l^{'} ∣}$ holds for 1 ≤ l,l′ ≤ s with ρ₁ = 0.5. We independently generate vec(E_i) from N(0, ∑_e). Specifically, we set the variances of all elements in E_i to be $σ_{e}^{2}$ and the correlation between E_i,jk and E_i,j′k′ to be $ρ_{2}^{∣ j - j^{'} ∣ + ∣ k - k^{'} ∣}$ for 1 ≤ j, k, j′, k′ ≤ 64 with ρ₂ = 0.5. We consider three different sample sizes including n = 100, 200, and 500, and set $σ_{e}^{2}$ to be 1 and 25.

We use 100 replications to evaluate the finite sample performance of our regularized low-rank (RLR) estimates ${\hat{B}}_{l}$ defined as $‖ {\hat{B}}_{l} - B_{l 0} ‖_{F}^{2}$ . To evaluate the estimation accuracy, we compute the mean squared errors of ${\hat{B}}_{l}$ , denoted by MSE( ${\hat{B}}_{l}$ ), for all 1 ≤ l ≤ 4. We also calculate the prediction errors (PE) by generating n^test = 500 independent test observations.

We compare our method with OLS, Lasso, fused Lasso and tensor envelope method (Li and Zhang, 2017). For fair comparison, we also use five-fold cross validation to select regularization parameters of Lasso and fused Lasso and the envelope dimension of the tensor envelope method. The results are shown in Table 1. We also plot the RLR, OLS, Lasso, fused Lasso and tensor envelope estimates of ( ${\hat{B}}_{l}$ , 1 ≤ l ≤ 4} obtained from a randomly selected data set with n = 500 and $σ_{e}^{2} = 25$ in Figure 2.

Table 1:

Simulation I results: the means of PEs and MSEs for regularized low-rank (RLR), OLS, Lasso, fused Lasso (Fused) and tensor envelope (Envelope) estimates and their associated standard errors in the parentheses. For each case, 100 simulated data sets are used.

(n, $σ_{e}^{2}$ )	Method	MSE(B₁)	MSE(B₂)	MSE(B₃)	MSE(B₄)	PE
(100, 1)	RLR	11.67(0.21)	9.96(0.22)	43.21(0.43)	44.88(0.52)	1.03(0.0002)
	OLS	58.08(0.83)	72.38(1.08)	71.66(1.01)	58.00(0.92)	1.05(0.0004)
	Lasso	42.96(0.79)	53.79(1.08)	53.21(0.98)	44.87(0.75)	1.04(0.0004)
	Fused	11.85(0.20)	11.25(0.22)	13.61(0.23)	17.87(0.26)	1.02(0.0002)
	Envelope	21.20(0.34)	24.95(0.36)	51.14(0.49)	55.62(0.67)	1.04(0.0002)
(200, 1)	RLR	7.27(0.09)	6.73(0.10)	23.77(0.20)	23.08(0.23)	1.02(0.0001)
	OLS	28.61(0.30)	34.88(0.36)	34.85(0.38)	28.11(0.32)	1.03(0.0001)
	Lasso	19.29(0.38)	23.86(0.43)	23.73(0.41)	20.30(0.29)	1.02(0.0002)
	Fused	5.93(0.09)	5.62(0.08)	6.59(0.10)	8.63(0.10)	1.01(0.0001)
	Envelope	11.21(0.16)	13.13(0.14)	38.40(0.30)	43.33(0.35)	1.03(0.0001)
(500, 1)	RLR	3.46(0.03)	3.53(0.03)	10.54(0.06)	9.75(0.06)	1.006(0.00003)
	OLS	11.01(0.08)	13.87(0.10)	13.88(0.09)	11.04(0.07)	1.009(0.00003)
	Lasso	5.93(0.17)	7.89(0.17)	7.78(0.18)	6.70(0.13)	1.007(0.00009)
	Fused	2.36(0.03)	2.37(0.02)	2.73(0.03)	3.29(0.03)	1.004(0.00002)
	Envelope	5.44(0.07)	6.60(0.07)	31.72(0.21)	38.29(0.30)	1.02(0.0001)
(100, 25)	RLR	121.61(1.69)	119.58(2.37)	227.58(2.01)	263.90(2.77)	25.37(0.0027)
	OLS	1451.95(20.64)	1809.40(27.01)	1791.53(25.36)	1450.10(23.10)	26.27(0.0099)
	Lasso	1360.23(19.03)	1683.95(24.74)	1669.88(23.97)	1367.71(21.34)	26.22(0.0093)
	Fused	238.87(3.04)	254.78(4.36)	290.50(4.47)	283.07(3.89)	25.42(0.0031)
	Envelope	175.09(1.57)	139.95(2.71)	259.48(2.18)	286.98(1.81)	25.39(0.0023)
(200, 25)	RLR	79.44(1.01)	71.27(1.25)	171.12(1.21)	201.43(1.63)	25.26(0.0013)
	OLS	715.28(7.49)	872.02(8.93)	871.33(9.41)	702.70(8.00)	25.66(0.0037)
	Lasso	657.54(7.19)	798.43(8.58)	798.01(9.04)	652.91(7.14)	25.62(0.0035)
	Fused	156.75(2.00)	162.27(1.95)	174.79(2.34)	175.61(2.20)	25.27(0.0016)
	Envelope	151.68(1.55)	105.18(1.61)	202.96(2.78)	230.24(2.63)	25.29(0.0016)
(500, 25)	RLR	42.17(0.50)	39.70(0.59)	110.16(0.79)	125.08(0.75)	25.10(0.0005)
	OLS	275.31(2.05)	346.69(2.54)	346.93(2.36)	276.05(1.83)	25.24(0.0008)
	Lasso	238.43(2.33)	299.22(2.99)	298.81(2.90)	243.44(1.95)	25.22(0.0011)
	Fused	80.31(0.84)	89.14(0.79)	93.22(0.90)	89.8(0.83)	25.10(0.0006)
	Envelope	95.49(0.99)	75.41(0.99)	142.24(1.51)	171.34(1.31)	25.14(0.0008)

Open in a new tab

Inspecting Figure 2 and Table 1 reveals the following findings. First, our method always outperforms OLS and envelope method. Second, when the images are of low rank (cross and square), our estimation method truly outperforms Lasso. Third, our method outperforms fused Lasso when either the sample size is small or the noise variance is large, whereas fused Lasso outperforms our method in other cases. Fourth, when the images are not of low rank (triangle and butterfly), fused Lasso performs best in most cases, whereas our method outperforms Lasso when either noise level is high or sample size is small.

These findings are not surprising. First, in all settings, since all the true coefficient matrices are piecewise sparse, the fused Lasso method is expected to perform well. Second, Lasso works reasonably well since it still imposes sparse structure. Third, since our method is designed for low rank cases, it performs well for the low rank cross and square cases, whereas it performs relatively worse for the triangle and butterfly cases.

We then conduct the second simulation study when the images only have low rank structure, but no sparse structure. Specifically, we simulate 64 × 64 matrix responses according to model (1) with s = 2 covariates. We set the two true coefficient matrices as $B_{10} = \sum_{j = 1}^{10} λ_{1, j} u_{1, j} v_{1, j}^{T}$ and $B_{20} = \sum_{j = 1}^{5} λ_{2} u_{2, j} v_{2, j}^{T}$ , where λ₁ = (λ_1,1, …, λ_1,10^T) = (2, 1.8, 1.6, 1.4, 1.2, 1, 0.8, 0.6, 0.4, 0.2)^T, λ₂ = (λ_2,1, λ_2,2, λ_2,3, λ_2,4, λ_2,5)^T = (2, 1.6, 1.2, 0.8, 0.4)^T, and u_1,j, u_2,j, v_1,j, v_1,j, are column vectors of dimension 64. For U₁ = (u_1,1, …, u_1,10) and V₁ = (v_1,1, …, v_1,10), each of them is generated by orthogonalizing a 64 × 10 matrix with all elements being i.i.d standard normal. For U₂ = (u_2,1, …, u_2,5) and V₂ = (v_2,1, …, v_2,5), each of them is generated by orthogonalizing a 64 × 5 matrix with all elements being i.i.d standard normal. For all other settings, they are the same as those in Section 4.1. Table 2 summarizes the obtained results. Our method outperforms all the comparison methods when the true coefficient matrices are of low rank structure, but of no sparse structure.

Table 2:

Simulation II results: the means of PEs and MSEs for regularized low-rank (RLR) OLS, Lasso, fused Lasso (Fused) and tensor envelope (Envelope) estimates and their associated standard errors in the parentheses. For each case, 100 simulated data sets are used.

(n, $σ_{e}^{2}$	Method	MSE(B₁)	MSE(B₂)	PE
(100, 1)	RLR	21.86(0.33)	13.91(0.20)	1.02(0.0001)
	OLS	41.85(0.16)	56.82(0.82)	1.03(0.0003)
	Lasso	55.78(0.88)	54.40(0.73)	1.03(0.0002)
	Fused	57.39(0.94)	56.61(0.77)	1.03(0.0002)
	Envelope	41.46(0.15)	33.23(0.54)	1.02(0.0002)
(200, 1)	RLR	10.84(0.12)	6.77(0.07)	1.011(0.00005)
	OLS	20.75(0.07)	27.80(0.30)	1.018(0.0001)
	Lasso	27.90(0.31)	27.03(0.29)	1.018(0.0001)
	Fused	27.69(0.30)	27.29(0.29)	1.018(0.0001)
	Envelope	20.68(0.07)	18.33(0.22)	1.014(0.00008)
(500, 1)	RLR	4.19(0.05)	2.73(0.02)	1.004(0.00002)
	OLS	8.22(0.03)	10.94(0.08)	1.006(0.00003)
	Lasso	12.49(0.18)	12.18(0.17)	1.007(0.00006)
	Fused	10.99(0.08)	10.91(0.09)	1.006(0.00003)
	Envelope	8.26(0.03)	9.37(0.09)	1.006(0.00003)
(100, 25)	RLR	391.95(5.50)	254.67(3.54)	25.37(0.0024)
	OLS	1044(4.10)	1447(23.94)	25.77(0.0060)
	Lasso	1378.32(22.99)	1360.95(18.52)	25.75(0.0059)
	Fused	1232.31(17.74)	1042.72(12.13)	25.68(0.0055)
	Envelope	1033.69(3.66)	626.57(7.71)	25.46(0.0027)
(200, 25)	RLR	219.13(2.14)	136.41(1.33)	25.26(0.0011)
	OLS	518.63(1.83)	694.98(7.47)	25.45(0.0025)
	Lasso	657.39(7.19)	644.78(7.20)	25.44(0.0025)
	Fused	637.01(6.45)	589.9(5.68)	25.43(0.0022)
	Envelope	516.52(1.81)	395.64(3.67)	25.33(0.0015)
(500, 25)	RLR	101.8(0.91)	64.19(0.57)	25.09(0.0005)
	OLS	206.57(0.73)	275.3(2.05)	25.16(0.0008)
	Lasso	259.2(1.95)	254.26(2.17)	25.16(0.0008)
	Fused	265.44(1.89)	255.88(1.99)	25.16(0.0008)
	Envelope	206.41(0.73)	226.94(1.54)	25.14(0.0006)

Open in a new tab

4.2. Rank-one Screening using SNP Covariates

We generate 64 × 64 matrix responses according to model (1). We use the same method as Section 4.1 to generate E_i with ρ₂ = 0.5 and $σ_{e}^{2} = 1$ or 25. We generate genetic covariates by mimicking the SNP data used in Section 5. Specifically, we use Linkage Disequilibrium (LD) blocks defined by the default method (Gabriel et al., 2002) of Haploview (Barrett et al., 2005) and PLINK (Purcell et al., 2007) to form SNP-sets. To calculate LD blocks, n subjects are simulated by randomly combining haplotypes of HapMap CEU subjects. We use PLINK to determine the LD blocks based on these subjects. We randomly select s_n/10 blocks, and combine haplotypes of HapMap CEU subjects in each block to form genotype variables for these subjects. We randomly select 10 SNPs in each block, and thus we have s_n SNPs for each subject. We set s_n = 2, 000 and 5, 000 and choose the first 20 SNPs as the significant SNPs. That is, we set the first 20 true coefficient matrices as nonzero matrices B_1,0 = … = B_20,0 = B_true, and the remaining coefficient matrices as zero. We consider three types of coefficient matrices B_true with different significant regions, i.e. (p_s, q_s) = (4,4), (8, 8), and (16, 16), where p_s and q_s denote the true size of the significant regions of interest. Figure presents the true images B_true and each of them contains a yellow ROI containing ones and a blue ROI containing zeros.

In this subsection, we evaluate the effect of using different γ_n on the finite sample performance of the screening procedure. We will investigate the proposed random decoupling in the next subsection. Specifically, by sorting the magnitude of $‖ {\hat{B}}_{l}^{M} ‖_{o p}$ in descending order, we define ${\hat{M}}^{k}$ as

{\hat{M}}^{k} = {1 \leq l \leq s_{n} : ‖ {\hat{B}}_{l}^{M} ‖_{o p} is among the first k largest of all covariates} .

(7)

We apply our screening procedure to each simulated data set and then report the average true nonzero coverage proportion as k varies from 1 to 200. In this case, $M = {1, 2, \dots, 20}$ is the set of all true nonzero indices, and ${\hat{M}}^{k}$ is the selected index set by using our screening method. The true nonzero coverage proportion is defined as $∣ {\hat{M}}^{k} \cap M ∣ ∕ ∣ M ∣$ . We consider three different sample sizes including n = 100, 200, and 500. We run 100 Monte Carlo replications for each scenario.

We consider four screening methods including the rank-one screening method, the L1 entrywise norm screening, the Frobenius norm screening, and the global Wald test screening proposed in Huang et al. (2015). The curves of percentage of the average true nonzero coverage proportion for different threshold values are presented for the case ( $σ_{e}^{2}$ , s_n) = (1, 2000) in Figure 4 and for the case ( $σ_{e}^{2}$ , s_n) = (25, 2000) in Figure 5. Inspecting Figures 4 and 5 reveals that the rank-one screening significantly outperforms all other three methods, followed by the Frobenius norm screening. As expected, increasing the sample size n and/or k increases the true nonzero coverage proportion of all four methods. We also include additional simulation results for the cases ( $σ_{e}^{2}$ , s_n) = (1, 5000) in Figure S1 and ( $σ_{e}^{2}$ , s_n) = (25, 5000) in Figure S2 in the Supplementary Material. The findings are similar. Overall, the rank-one screening method is more robust to noise and signal region size.

Figure 4: — Screening simulation results for the case ( $σ_{e}^{2}$ , *s_n*) = (1, 2000): the curves of percentage of the average true nonzero coverage proportion. The black solid, blue dashed, red dotted, and purple dashed dotted lines correspond to the rank-one screening, the L1 entrywise norm screening, the Frobenius norm screening, and the global Wald test screening, respectively. Panels (a)-(i) correspond to (n, *p_s*, *q_s*) = (100, 4, 4), (200, 4, 4), (500, 4, 4), (100, 8, 8), (200, 8, 8), (500, 8, 8), (100, 16, 16), (200, 16, 16), and (500, 16, 16), respectively.

Figure 5: — Screening simulation results for the case ( $σ_{e}^{2}$ , *s_n*) = (25, 2000): the curves of percentage of the average true nonzero coverage proportion. The black solid, blue dashed, red dotted, and purple dashed dotted lines correspond to the rank-one screening, the L1 entrywise norm screening, the Frobenius norm screening, and the global Wald test screening, respectively. Panels (a)-(i) correspond to (n, *p_s*, *q_s*) = (100, 4, 4), (200, 4, 4), (500, 4, 4), (100, 8, 8), (200, 8, 8), (500, 8, 8), (100, 16, 16), (200, 16, 16), and (500, 16, 16), respectively.

4.3. Simulation study for two-step procedure

In this subsection, we perform a simulation study to evaluate our two-step screening and estimation procedure. We simulate 64 × 64 matrix responses according to model (1) with s_n covariates. We set the first four true coefficient matrices to be a cross shape (B₁₀), a square shape (B₂₀), a triangle shape (B₃₀), and a butterfly shape (B₄₀) shown in Figure 1. For the remaining coefficient matrices {B_l0, 5 ≤ l ≤ s_n}, we set them as zero matrices. We consider s_n = 2000 and 5000.

First, we evaluate the finite sample performance of the random decoupling. We perform our screening procedure based on the random decoupling and then apply our regularized low rank estimation procedure. We report the MSEs of ${\hat{B}}_{l}$ (l = 1, 2, 3, 4), model size, and prediction error based on 100 replications in Table 3. We report the proportion of times that we exactly select the true model $M = {1, 2, 3, 4}$ , the proportion that we over-select some variables, but include all the true ones, and the proportion that we miss some of the important covariates in Table 4. The proposed random decoupling works pretty well in choosing γ_n, since the selected covariate set based on γ_n includes the true covariates with high probabilities in all scenarios.

Table 3:

The means of predictor errors (PEs) and MSEs for our two-step procedure, and the average selected model size for our screening procedure. Their associated standard errors are in the parentheses. For each case, 100 simulated data sets are used.

(n, s_n, $σ_{e}^{2}$ )	MSE(B₁)	MSE(B₂)	MSE(B₃)	MSE(B₄)	PE	Model Size
(100, 2000, 1)	15.87(3.22)	10.73(0.67)	43.52(0.46)	47.07(0.58)	1.05(0.001)	5.24(0.11)
(200, 2000, 1)	5.92(0.10)	5.10(0.11)	27.61(0.27)	28.23(0.31)	1.03(0.0003)	5.87(0.11)
(100, 5000, 1)	32.28(6.57)	12.60(1.08)	44.28(0.52)	47.14(0.65)	1.07(0.002)	5.03(0.10)
(200, 5000, 1)	5.92(0.09)	4.94(0.09)	28.03(0.29)	28.35(0.29)	1.03(0.0005)	5.83(0.13)
(100, 2000, 25)	126.17(1.84)	119.15(2.56)	227.86(1.96)	279.22(3.17)	25.99(0.027)	5.15(0.11)
(200, 2000, 25)	84.42(1.25)	73.69(1.41)	177.90(1.66)	214.67(2.15)	25.54(0.012)	5.82(0.11)
(100, 5000, 25)	136.27(3.45)	118.73(2.63)	228.65(2.19)	278.43(3.42)	26.04(0.024)	4.96(0.11)
(200, 5000, 25)	82.69(1.12)	73.53(1.37)	177.44(1.57)	211.25(2.12)	25.56(0.012)	5.75(0.13)

Open in a new tab

Table 4:

The means of PEs and MSEs for our two-step procedure in three scenarios: exact selection (“Exact”), over selection (“Over”) and missing variables (“Miss”). The proportion of times among 100 simulated data sets for each scenario is also reported. The “NA” denotes the values that are not applicable.

(n, s_n, $σ_{e}^{2}$ )	Scenario	MSE(B₁)	MSE(B₂)	MSE(B₃)	MSE(B₄)	PE	Proportion
(100, 2000, 1)	Exact	11.61(0.18)	9.18(0.16)	43.22(0.38)	45.55(0.47)	1.06(0.001)	0.27
	Over	11.17(0.17)	10.11(0.20)	43.6(0.46)	47.31(0.56)	1.05(0.001)	0.71
	Miss	240(0)	53.49(1.68)	44.58(1.61)	59.11(1.18)	1.1(0.001)	0.02
(200, 2000, 1)	Exact	7.09(0.08)	6.6(0.12)	23.97(0.16)	23.11(0.09)	1.03(0.0005)	0.04
	Over	5.87(0.09)	5.03(0.11)	27.76(0.26)	28.44(0.30)	1.03(0.0003)	0.96
	Miss	NA	NA	NA	NA	NA	0
(100, 5000, 1)	Exact	11.69(0.17)	9.74(0.19)	44.14(0.65)	45.79(0.41)	1.07(0.001)	0.27
	Over	11.76(0.20)	9.75(0.19)	44.24(0.43)	46.60(0.54)	1.06(0.001)	0.64
	Miss	240(0)	41.48(1.93)	44.96(0.65)	55.10(1.23)	1.12(0.001)	0.09
(200, 5000, 1)	Exact	7.52(0.09)	6.69(0.11)	24.91(0.22)	24.24(0.08)	1.04(0.001)	0.05
	Over	5.84(0.08)	4.84(0.08)	28.19(0.28)	28.57(0.28)	1.03(0.0005)	0.95
	Miss	NA	NA	NA	NA	NA	0
(100, 2000, 25)	Exact	121.75(1.27)	114.12(2.19)	229.79(1.95)	270.16(3.47)	26.18(0.024)	0.29
	Over	126.37(1.49)	121.47(2.69)	227.21(1.98)	283.29(3.00)	25.91(0.023)	0.70
	Miss	240(NA)	102.16(NA)	217.49(NA)	256.52(NA)	26.45(NA)	0.01
(200, 2000, 25)	Exact	79.26(0.80)	64.22(0.61)	177.62(0.86)	207.41(1.18)	25.54(0.017)	0.07
	Over	84.81(1.27)	74.4(1.43)	177.92(1.71)	215.21(2.20)	25.54(0.012)	0.93
	Miss	NA	NA	NA	NA	NA	0
(100, 5000, 25)	Exact	122.88(1.74)	114.5(2.49)	217.93(1.87)	260.99(2.64)	26.20(0.019)	0.34
	Over	129.81(1.53)	122.59(2.63)	233.46(2.17)	286.15(3.34)	25.92(0.020)	0.58
	Miss	240(0)	108.61(3.01)	239.31(2.06)	296.51(4.25)	26.23(0.022)	0.08
(200, 5000, 25)	Exact	85.2(0.87)	72.92(1.48)	174.83(1.20)	206.34(2.32)	25.66(0.014)	0.06
	Over	82.53(1.14)	73.57(1.37)	177.61(1.60)	211.56(2.12)	25.55(0.012)	0.94
	Miss	NA	NA	NA	NA	NA	0

Open in a new tab

Second, we consider over-selecting and/or missing some covariates. For each of the three above cases, we report the MSEs of ${\hat{B}}_{l}$ (l = 1, 2, 3, 4) and the prediction error in Table 4. When the screening procedure over-selects more irrelevant variables, the MSEs of the true non-zero coefficient matrices and prediction error of the fitted model are similar to those obtained from the model with the correct set of covariates. In contrast, if the screening procedure misses several important variables, then the estimates corresponding to these missed variables completely fail since the corresponding coefficient matrices are estimated zero. However, according to the simulation results, the MSEs corresponding to those important variables that have been selected, are similar to those obtained from the model with the correct set of covariates. The prediction error increases due to missing some important variables.

5. The Philadelphia Neurodevelopmental Cohort

5.1. Data Description and Preprocessing Pipeline

To motivate the proposed methodology, we consider a large database with imaging, genetic, and clinical data collected by the Philadelphia Neurodevelopmental Cohort (PNC) study. This study was a collaboration between the Center for Applied Genomics (CAG) at Children’s Hospital of Philadelphia (CHOP) and the Brain Behavior Laboratory at the University of Pennsylvania (Penn). The PNC cohort consists of youths aged 8-21 years in the CHOP network and volunteered to participate in genomic studies of complex pediatric disorders. All participants underwent clinical assessment and a neuroscience based computerized neurocognitive battery (CNB) and a subsample underwent neuroimaging. We consider 814 subjects with 429 females and 385 males. The age range of the 814 participants is 8-21 (years) with mean value 14.36 (years) and standard deviation 3.48 (years). Specifically, each subject has a resting state functional magnetic resonance imaging (rs-fMRI) connectivity matrix, which is represented as a 69×69 matrix, and a large genetic data set with around 5, 400, 000 genotyped and imputed single-nucleotide polymorphisms (SNPs) on all of the 22 chromosomes. Other clinical variables of interest include age and gender, among others. Our primary question of interest is to identify novel genetic effects on the local rs-fMRI connectivity changes.

We preprocess the resting state fMRI data using C-PAC pipeline. First, we register the fMRI data to the standard MNI 2mm resolution level and did segmentation using the C-PAC default setting. Next, we do motion correction using the Friston 24-parameter method. We also perform nuisance signal correction by regressing out the following variables: top 5 principle components in the noise regions of interest (ROIs), Cerebrospinal fluid (CSF), motion parameters, and the linear trends in time series. Finally, we extract the ROI time series by taking the average of voxel-wise time series in each ROI. The atlases that we use are HarvardOxford Cortical Atlas (48 regions) and HarvardOxford Subcortical Atlas (21 regions), which could be found in FSL. In total, we extract time series for each of the 69 regions and each time series has 120 observations after deleting the first and last 3 scans.

5.2. Analysis and Results

We first fit model (1) with the rs-fMRI connectivity matrices from 814 subjects as 69 × 69 matrix responses and age and gender as clinical covariates. We also include the first 5 principal component scores based on the SNP data as covariates to correct for population stratification. We first calculate the ordinary least squares estimates of coefficient matrices and then compute the corresponding residual matrices for the brain connectivity response matrix after adjusting the effects of the clinical covariates and the SNP principal component scores.

Second, we apply the rank-one screening procedure by using the residual matrices as responses to select important SNPs from the whole set of 5, 354, 265 SNPs that are highly associated with the residual matrices. We use the random decoupling method described in Section 2.1 to choose the thresholding value γ_n and select all those indices whose $‖ {\hat{B}}_{l} ‖_{o p}$ is the larger than γ_n. Finally, seven covariates are selected, where the names are shown in Table 5. Among these seven SNPS, the first three ones on Chromosome 5 have exactly the same genotypes for all the subjects and the next four ones on Chromosome 10 have exactly the same genotypes for all the subjects.

Table 5:

PNC data analysis results: the top 7 SNPs selected by our screening procedure.

Ranking	Chromosome	SNP
1	5	rs72775042
2	5	rs6881067
3	5	rs72775059
4	10	rs200328746
5	10	rs75860012
6	10	rs200248696
7	10	rs78309702

Open in a new tab

Finally, we examine the effects of these selected SNPs on our matrix response. We first fit the OLS to these 7 SNPs. Since the first three ones have exactly the same genotypes and the next four ones have exactly the same genotypes, we regress our matrix response on the first selected SNP and the fourth selected SNP, yielding two coefficient matrix estimates ${\hat{B}}_{(1)}^{o l s}$ and ${\hat{B}}_{(2)}^{o l s}$ . The OLS estimates for the 7 SNPs are defined as ${\hat{B}}_{(1)}^{o l s} = {\hat{B}}_{(2)}^{o l s} = {\hat{B}}_{(3)}^{o l s} = {\hat{B}}_{(1)}^{o l s} ∕ 3$ and ${\hat{B}}_{(4)}^{o l s} = {\hat{B}}_{(5)}^{o l s} = {\hat{B}}_{(6)}^{o l s} = {\hat{B}}_{(7)}^{o l s} = {\hat{B}}_{(2)}^{o l s} ∕ 4$ . We then calculate the singular values of these 7 OLS estimates and plot these singular values in decreasing order in Figure 6. Inspecting Figure 6 reveals that these estimated coefficient matrices have a clear low rank pattern since the first few singular values dominate the remaining ones. This motivates us to apply our RLR estimation procedure to estimate the coefficient matrices corresponding to these 7 SNP covariates. Figure 7(a)-(g) presents the coefficient matrix estimates associated with these SNPs. The coefficient matrices corresponding to the first three selected SNPs are the same and the coefficient matrices corresponding to next four selected SNPs are the same. The estimated ranks of these seven coefficient matrices are given by 11, 11, 11, 8, 8, 8, and 8, respectively.

Figure 6: — PNC data: Panel (a)-(g) are the plots for the singular values of the OLS estimates corresponding to the 7 SNPs selected by our screening step, with singular values sorted from largest to smallest.

Figure 7: — PNC data: Panel (a)-(g) are the plots for our RLR estimates corresponding to the 7 SNPs selected by our screening step.

6. Discussion

Motivated from the analysis of imaging genetic data, we have proposed a low-rank linear regression model to correlate high-dimensional matrix responses with a high dimensional vector of covariates when coefficient matrices are approximately low-rank. We have developed a fast and efficient rank-one screening procedure, which enjoys the sure independence screening property as well as vanishing false selection rate, to reduce the covariate space. We have developed a regularized estimate of coefficient matrices based on the trace norm regularization, which explicitly incorporates the low-rank structure of coefficient matrices, and established its estimation consistency. We have further established a theoretical guarantee for the overall solution obtained from our two-step screening and estimation procedure. We have demonstrated the efficiency of our methods by using simulations and the analysis of PNC dataset.

Supplementary Material

supplement

NIHMS997852-supplement-supplement.pdf^{(361.1KB, pdf)}

Figure 3: — Screening setting: Panel (a)-(c) are the true coefficient images B_true with regions of interest with different sizes: effective regions of interest (yellow ROI) and non-effective regions of interest (blue ROI).

Acknowledgement

The authors would like to thank the Editor, Associate Editor and two reviewers for their constructive comments, which have substantially improved the paper.

Dr. Kong’s work was partially supported by National Science Foundation, National Institute of Health, and Natural Science and Engineering Research Council of Canada. Dr. Baiguo An is work was partially supported by NSF of China (No. 11601349). Dr. Zhu’s work was partially supported by NIH grants R01MH086633 and R01MH116527, and NSF grants SES-1357666 and DMS-1407655. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH or any other funding agency.

The PNC data were obtained through dbGaP (phs000607.v1.p1). Support for the collection of the data sets was provided by grant RC2MH089983 awarded to Raquel Gur and RC2MH089924 awarded to Hakon Hakonarson. All subjects were recruited through the Center for Applied Genomics at The Children’s Hospital in Philadelphia.

A. Auxiliary Lemmas

In this section, we include the auxiliary lemmas needed for the theorems and their proofs.

Lemma 1. (Bernstein’s inequality) Let Z₁, …, Z_n be independent random variables with zero mean such that E∣Z_i∣^m ≤ m!M^m–2v_i/2 for every m ≥ 2 (and all i) and some positive constants M and v_i. Then P(∣Z₁ + … + Z_n∣ > x) ≤ 2 exp[−x² / {2(v + Mx)}] for v ≥ v₁ + … + v_n.

This lemma is Lemma 2.2.11 of van der Vaart and Wellner (2000), and we omit the proof.

Lemma 2. Under Assumptions A0-A2, for arbitrary t > 0 and every l, l′, j, k, we have that

P (∣ \sum_{i = 1}^{n} {x_{i l} x_{i l^{'}} - E (x_{i l} x_{i l^{'}})} ∣ \geq t) \leq 2 exp {- \frac{t^{2}}{2 (2 n C_{2}^{2} e^{C_{2}} C_{3} + t ∕ C_{2})}},

and

P (∣ \sum_{i = 1}^{n} (x_{i l} E_{i, j k}) ∣ \geq t) \leq 2 exp {- \frac{t^{2}}{2 (2 n C_{2}^{2} e^{C_{2}} C_{3} + t ∕ C_{2})}} .

Proofs of Lemma 2: By Assumptions A1 and A2, we have

\begin{matrix} E [exp {C_{2} ∣ x_{i l} x_{i l^{'}} - E (x_{i l} x_{i l^{'}}) ∣}] \leq e^{C_{2} ∣ E (x_{i l} x_{i l^{'}}) ∣} E {e^{C_{2} ∣ x_{i l} x_{i l^{'}} ∣}} \\ \leq & e^{C_{2}} E {e^{C_{2} x_{i l}^{2} ∕ 2} e^{C_{2} x_{i l^{'}}^{2} ∕ 2}} \leq e^{C_{2}} {[E {e^{C_{2} x_{i l}^{2}}} E {e^{C_{2} x_{i l^{'}}^{2}}}]}^{1 ∕ 2} \leq e^{C_{2}} C_{3} . \end{matrix}

For every m ≥ 2, one has

E {∣ x_{i l} x_{i l^{'}} - E (x_{i l} x_{i l^{'}}) ∣^{m}} \leq m! C_{2}^{- m} E {exp (C_{2} ∣ x_{i l} x_{i l^{'}} - E (x_{i l} x_{i l^{'}}) ∣)} \leq m! C_{2}^{- m} e^{C_{2}} C_{3} .

It follows from Lemma 1 that we have

P (∣ \sum_{i = 1}^{n} {x_{i l} x_{i l^{'}} - E (x_{i l} x_{i l^{'}})} ∣ \geq t) \leq 2 exp {- \frac{t^{2}}{2 (2 n C_{2}^{2} e^{C_{2}} C_{3} + t ∕ C_{2})}} .

Similarly, we obtain

\begin{matrix} E {exp (C_{2} ∣ x_{i l} E_{i, j k} ∣)} & \leq e^{C_{2}} E (e^{C_{2} x_{i l}^{2} ∕ 2} e^{C_{2} E_{i, j k}^{2} ∕ 2}) \\ \leq e^{C_{2}} {E (e^{C_{2} x_{i l}^{2}}) E (e^{C_{2} E_{i, j k}^{2}})}^{1 ∕ 2} \leq e^{C_{2}} C_{3} . \end{matrix}

For every m ≥ 2, we have $E ∣ x_{i l} E_{i, j k} ∣^{m} \leq m! C_{2}^{- m} E {exp (C_{2} ∣ x_{i l} E_{i, j k} ∣)} \leq m! C_{2}^{- m} e^{C_{2}} C_{3}$ . Therefore, it follows from Lemma 1 that we have

P (∣ \sum_{i = 1}^{n} (x_{i l} E_{i, j k}) ∣ \geq t) \leq 2 exp {- \frac{t^{2}}{2 (2 n C_{2}^{2} e^{C_{2}} C_{3} + t ∕ C_{2})}} .

This completes the proof of Lemma 2.

The next lemma is about the subdifferential and directional derivatives of the trace norm. For more details about this lemma and its proof, please refer to Recht et al. (2010) and Borwein and Lewis (2010).

Lemma 3. For an arbitrary matrix W, its singular value decomposition is denoted by W = UDV^T, where $U \in R^{p \times m}$ and $V \in R^{q \times m}$ have orthonormal columns, D = diag(d₁, …, d_m), and d₁ ≥ … ≥ d_m > 0 are the singular values of W. Then the trace norm of W is $‖ W ‖_{*} = \sum_{i = 1}^{m} d_{i}$ and its subdifferential is equal to

\partial ‖ W ‖_{*} = {{UDV}^{T} + N, such that ‖ N ‖_{o p} \leq 1, U^{T} N = 0, NV = 0} .

The directional derivative at W is

lim_{ϵ \to 0^{+}} \frac{‖ W + ϵ Υ ‖_{*} - ‖ W ‖_{*}}{ϵ} = t r (U^{T} Υ V) + ‖ (U^{⊥})^{T} Υ V^{⊥} ‖_{*},

where U^⊥, V^⊥ are the orthonormal complements of U and V.

The following lemma is a standard result called Gaussian comparison inequality (Anderson, 1955).

Lemma 4. Let X and Y be zero-mean vector Gaussian random vectors with covariance matrix ∑_X and ∑_Y respectively. If ∑_X – ∑_Y is positive semi-definite, then for any convex symmetric set $C$ , $P (X \in C) \leq P (Y \in C)$ .

B. Proof of Theorems

Proof of Theorem 1: Recall that $B_{l 0}^{M} = cov (\sum_{l^{'} \in M} x_{i l^{'}} * B_{l^{'} 0}, x_{i l})$ . For every 1 ≤ j ≤ p, 1 ≤ k ≤ q and 1 ≤ l ≤ s_n, we have

{\hat{B}}_{l, j k}^{M} - B_{l 0, j k}^{M} = n^{- 1} \sum_{i = 1}^{n} {x_{i l} Y_{i, j k} - E (x_{i l} Y_{i, j k})} .

It follows from Assumptions (A0) (A1) (A2) and Lemma 2 that for any t > 0, we have

\begin{matrix} P (∣ {\hat{B}}_{l, j k}^{M} - B_{l 0, j k}^{M} ∣ \geq t) = P (∣ \sum_{i = 1}^{n} {x_{i l} Y_{i, j k} - E (x_{i l} Y_{i, j k})} ∣ \geq n t) \\ = & P (∣ \sum_{l^{'} \in M} \sum_{i = 1}^{n} {x_{i l} x_{i l^{'}} - E (x_{i l} x_{i l^{'}})} B_{l^{'} 0, j k} + \sum_{i = 1}^{n} x_{i l} E_{i, j k} ∣ \geq n t) \\ \leq & \sum_{l^{'} \in M} P (∣ \sum_{i = 1}^{n} {x_{i l} x_{i l^{'}} - E (x_{i l} x_{i l^{'}})} ∣ \geq \frac{n t}{b (s_{0} + 1)}) + P (\sum_{i = 1}^{n} ∣ x_{i l} E_{i, j k} ∣ \geq \frac{n t}{(s_{0} + 1)}) \\ \leq & 2 s_{0} exp {- \frac{n t^{2} b^{- 2} (s_{0} + 1)^{- 2}}{2 (2 C_{2}^{2} e^{C_{2}} C_{3} + C_{2}^{- 1} b^{- 1} (s_{0} + 1)^{- 1} t)}} + 2 exp {- \frac{n t^{2} (s_{0} + 1)^{- 2}}{2 (2 C_{2}^{2} e^{C_{2}} C_{3} + C_{2}^{- 1} (s_{0} + 1)^{- 1} t)}} . \end{matrix}

For every $l \in M$ , we have

\begin{matrix} P (‖ {\hat{B}}_{l}^{M} ‖_{o p} \leq γ_{n}) \leq P (‖ {\hat{B}}_{l}^{M} - B_{l 0}^{M} ‖_{o p} \geq (p q)^{1 ∕ 2} (1 - α) C_{1} n^{- κ}) \\ \leq & P (‖ {\hat{B}}_{l}^{M} - B_{l 0}^{M} ‖_{F} \geq (p q)^{1 ∕ 2} (1 - α) C_{1} n^{- κ}) = P (\sum_{j, k} ∣ {\hat{B}}_{l, j k}^{M} - B_{l 0, j k}^{M} ∣^{2} \geq p q {(1 - α) C_{1} n^{- κ}}^{2}) \\ \leq & \sum_{j, k} P (∣ {\hat{B}}_{l, j k}^{M} - B_{l 0, j k}^{M} ∣ \geq (1 - α) C_{1} n^{- κ}) \\ \leq & 2 p q (s_{0} exp {- \frac{n^{1 - 2 κ} [(1 - α) C_{1} b^{- 1} (s_{0} + 1)^{- 1}]^{2}}{2 {2 C_{2}^{2} e^{C_{2}} C_{3} + C_{2}^{- 1} b^{- 1} (s_{0} + 1)^{- 1} (1 - α) C_{1} n^{- κ}}}} \\ + exp {- \frac{n^{1 - 2 κ} [(1 - α) C_{1} (s_{0} + 1)^{- 1}]^{2}}{2 {2 C_{2}^{2} e^{C_{2}} C_{3} + C_{2}^{- 1} (s_{0} + 1)^{- 1} (1 - α) C_{1} n^{- κ}}}}) \\ \leq & 2 p q (s_{0} exp {- \frac{n^{1 - 2 κ} [(1 - α) C_{1} b^{- 1} (s_{0} + 1)^{- 1}]^{2}}{2 {2 C_{2}^{2} e^{C_{2}} C_{3} + C_{2}^{- 1} b^{- 1} (s_{0} + 1)^{- 1} (1 - α) C_{1}}}} \\ + exp {- \frac{n^{1 - 2 κ} [(1 - α) C_{1} (s_{0} + 1)^{- 1}]^{2}}{2 {2 C_{2}^{2} e^{C_{2}} C_{3} + C_{2}^{- 1} (s_{0} + 1)^{- 1} (1 - α) C_{1}}}}) . \end{matrix}

Let c₁ = 2pq(s₀ + 1),

\begin{matrix} c_{2} = \frac{[(1 - α) C_{1} b^{- 1} (s_{0} + 1)^{- 1}]^{2}}{2 {2 C_{2}^{2} e^{C_{2}} C_{3} + C_{2}^{- 1} b^{- 1} (s_{0} + 1)^{- 1} (1 - α) C_{1}}}, and \\ c_{3} = \frac{[(1 - α) C_{1} (s_{0} + 1)^{- 1}]^{2}}{2 {2 C_{2}^{2} e^{C_{2}} C_{3} + C_{2}^{- 1} (s_{0} + 1)^{- 1} (1 - α) C_{1}}} . \end{matrix}

We have $P (∣ {\hat{B}}_{l}^{M} ∣ \leq γ_{n}) \leq 2 p q (s_{0} + 1) exp (- c_{0} n^{1 - 2 κ})$ , where c₀ = min{c₂, c₃}. By Assumption (A5), one has

\begin{matrix} P (M \subseteq {\hat{M}}_{γ_{n}}) = P (⋂_{l \in M} {∣ {\hat{B}}_{l}^{M} ∣ > γ_{n}}) \\ = & 1 - P (⋃_{l \in M} {∣ {\hat{B}}_{l}^{M} ∣ \leq γ_{n}}) \geq 1 - \sum_{l \in M} P (∣ {\hat{B}}_{l}^{M} ∣ \leq γ_{n}) \\ \geq & 1 - s_{0} c_{1} exp (- c_{0} n^{1 - 2 k}) = 1 - 2 p q (s_{0} + 1) s_{0} exp (- c_{0} n^{1 - 2 κ}) \to 1 . \end{matrix}

This completes the proof of Theorem 1.

Proof of Theorem 2: The proof consists of two steps. In Step 1, we will show that $P ({\hat{M}}_{γ_{n}} \subseteq M^{o}) \to 1$ , where $M^{o} = {1 \leq l \leq s_{n} : {‖ B_{l 0}^{M} ‖}_{o p} \geq γ_{n} ∕ 2}$ . It follows from the definition of ${\hat{M}}_{γ_{n}}$ that we have

P ({\hat{M}}_{γ_{n}} \subseteq M^{o}) \geq P (⋂_{1 \leq l \leq s_{n}} {‖ {\hat{B}}_{l}^{M} - B_{l 0}^{M} ‖_{o p} \leq γ_{n} ∕ 2}),

Moreover, we have

\begin{matrix} P (⋂_{1 \leq l \leq s_{n}} {‖ {\hat{B}}_{l}^{M} - B_{l 0}^{M} ‖_{o p} \leq γ_{n} ∕ 2}) = 1 - P (⋃_{1 \leq l \leq s_{n}} {‖ {\hat{B}}_{l}^{M} - B_{l 0}^{M} ‖_{o p} \geq γ_{n} ∕ 2}) \\ \geq & 1 - \sum_{1 \leq l \leq s_{n}} P (‖ {\hat{B}}_{l}^{M} - B_{l 0}^{M} ‖_{o p} \geq γ_{n} ∕ 2) \geq 1 - \sum_{1 \leq l \leq s_{n}} P (‖ {\hat{B}}_{l}^{M} - B_{l 0}^{M} ‖_{F} \geq γ_{n} ∕ 2) \\ \geq & 1 - \sum_{1 \leq l \leq s_{n}} \sum_{j, k} P (∣ {\hat{B}}_{l, j k}^{M} - B_{l 0, j k}^{M} ∣ \geq C_{1} n^{- κ} ∕ 2) \\ \geq & 1 - 2 s_{n} p q [s_{0} exp {- \frac{α^{2} C_{1}^{2} b^{- 2} (s_{0} + 1)^{- 2} 2^{- 2} n^{1 - 2 κ}}{2 (2 C_{2}^{2} e^{C_{2}} C_{3} + C_{2}^{- 1} b^{- 1} (s_{0} + 1)^{- 1} α C_{1} 2^{- 1} n^{- κ})}} \\ + exp {- \frac{α^{2} C_{1}^{2} (s_{0} + 1)^{- 2} 2^{- 2} n^{1 - 2 κ}}{2 (2 C_{2}^{2} e^{C_{2}} C_{3} + C_{2}^{- 1} (s_{0} + 1)^{- 1} α C_{1} 2^{- 1} n^{- κ})}}] \\ \geq & 1 - 2 s_{n} p q [s_{0} exp {- \frac{α^{2} C_{1}^{2} b^{- 2} (s_{0} + 1)^{- 2} 2^{- 2} n^{1 - 2 κ}}{2 (2 C_{2}^{2} e^{C_{2}} C_{3} + C_{2}^{- 1} b^{- 1} (s_{0} + 1)^{- 1} α C_{1} 2^{- 1})}} \\ + exp {- \frac{α^{2} C_{1}^{2} (s_{0} + 1)^{- 2} 2^{- 2} n^{1 - 2 κ}}{2 (2 C_{2}^{2} e^{C_{2}} C_{3} + C_{2}^{- 1} (s_{0} + 1)^{- 1} α C_{1} 2^{- 1})}}] \\ = & 1 - 2 p q exp (C_{4} n^{ξ}) [s_{0} exp {- \frac{α^{2} C_{1}^{2} b^{- 2} (s_{0} + 1)^{- 2} 2^{- 2} n^{1 - 2 κ}}{2 (2 C_{2}^{2} e^{C_{2}} C_{3} + C_{2}^{- 1} b^{- 1} (s_{0} + 1)^{- 1} α C_{1} 2^{- 1})}} \\ + exp {- \frac{α^{2} C_{1}^{2} (s_{0} + 1)^{- 2} 2^{- 2} n^{1 - 2 κ}}{2 (2 C_{2}^{2} e^{C_{2}} C_{3} + C_{2}^{- 1} (s_{0} + 1)^{- 1} α C_{1} 2^{- 1})}}] . \end{matrix}

By Assumptions (A3) and (A5), one has $P (⋂_{1 \leq l \leq s_{n}} {‖ {\hat{B}}_{l}^{M} - B_{l 0}^{M} ‖_{o p} \leq γ_{n} ∕ 2}) \geq 1 - c_{4} exp (- c_{5} n^{1 - 2 κ})$ for some constants c₄ > 0 and c₅ > 0. Therefore, we have $P ({\hat{M}}_{γ_{n}} \subseteq M^{o}) \to 1$ by Assumption (A1).

In Step 2, we will show that $∣ M^{o} ∣ = O (n^{2 κ + τ})$ . Define $M^{1} = {1 \leq l \leq s_{n} : {‖ B_{l 0}^{M} ‖}_{F}^{2} \geq γ_{n}^{2} ∕ 4}$ . As ${‖ B_{l 0}^{M} ‖}_{o p} \leq {‖ B_{l 0}^{M} ‖}_{F}$ , we have $M^{o} \subseteq M^{1}$ . By the definition of $M^{1}$ , we have

\begin{matrix} ∣ M^{1} ∣ γ_{n}^{2} ∕ 4 & \leq \sum_{l = 1}^{s_{n}} ‖ B_{l 0}^{M} ‖_{F}^{2} \\ = \sum_{j, k} \sum_{l = 1}^{s_{n}} (B_{l 0, j k}^{M})^{2} = \sum_{j, k} \sum_{l = 1}^{s_{n}} {E (x_{i l} Y_{i, j k})}^{2} = \sum_{j, k} ‖ E (x_{i} * Y_{i, j k}) ‖^{2} . \end{matrix}

Define B_0,jk = (B_10,jk, –, B_s0,jk)^T, we can write $Y_{i, j k} = x_{i}^{T} B_{0, j k} + E_{i, j k}$ . Multiplying X_i on both sides and taking expectations yield ∑_xB_0,jk = E(x_i * Y_i,jk). Therefore, we have

\begin{matrix} ∣ M^{1} ∣ γ_{n}^{2} ∕ 4 \leq \sum_{j, k} ‖ Σ_{x} B_{0, j k} ‖^{2} \leq λ_{\max} (Σ_{x}) \sum_{j, k} B_{0, j k}^{T} B_{0, j k} \\ = & λ_{\max} (Σ_{x}) \sum_{j, k} {var (Y_{i, j k}) - var (Y_{i, j k} ∣ x_{i})} \leq p q λ_{\max} (Σ_{x}) . \end{matrix}

By Assumption A4, we have $∣ M^{1} ∣ \leq 4 p q λ_{\max} (Σ_{x}) γ_{n}^{- 2} = O (n^{2 κ + τ})$ , which implies that $∣ M^{0} ∣ \leq ∣ M^{1} ∣ = O (n^{2 κ + τ})$ .

Combining the results of above two steps leads to

P (∣ {\hat{M}}_{γ_{n}} ∣ = O (n^{2 κ + τ})) \geq P ({\hat{M}}_{γ_{n}} \subseteq M^{0}) \to 1 .

This completes the proof of Theorem 2.

Theorems 3, 4 and 5 are theoretical results for our estimation procedure, and we assume $\hat{M} = M$ and $\hat{M}$ is fixed.

Proof of Theorem 3: Without loss of generality, for the proof of Theorem 3, we assume $\hat{M} = M = {1, \dots, s}$ with s fixed for notation simplicity. We first prove Theorem 3 (i). We define

\begin{matrix} L (Δ_{1}, \dots, Δ_{s}) = λ^{- 2} {Q (λ Δ_{1} + B_{10}, \dots, λ Δ_{s} + B_{s 0}) - Q (B_{10}, \dots, B_{s 0})} \\ = & 2^{- 1} \sum_{l = 1}^{s} \sum_{l^{'} = 1}^{s} n^{- 1} (\sum_{i = 1}^{n} x_{i l} x_{i l^{'}}) t r (Δ_{l}^{T} Δ_{l^{'}}) - λ^{- 1} \sum_{l} t r (Δ_{l}^{T} n^{- 1} \sum_{i = 1}^{n} x_{i l} E_{i}) \\ + λ^{- 1} \sum_{l} {‖ B_{l 0} + λ Δ_{l} ‖_{*} - ‖ B_{l 0} ‖_{*}}, \end{matrix}

where Δ_l = λ⁻¹(B_l – B_l0) for l = 1, …, s. Therefore, we have

({\hat{Δ}}_{1}, \dots, {\hat{Δ}}_{s}) = arg min {L (Δ_{1}, \dots, Δ_{s})},

where ${\hat{Δ}}_{l} = λ^{- 1} ({\hat{B}}_{l} - B_{l 0})$ for l = 1, …, s.

When λ → 0, n^1/2λ → ∞, we have

n^{- 1} \sum_{i = 1}^{n} x_{i l} x_{i l^{'}} \to_{p} Σ_{M, l l^{'}}, for every 1 \leq l, l^{'} \leq s,

where $Σ_{M, l l^{'}}$ is the (l, l′)–th element of $Σ_{M}$ for 1 ≤ l, l’ ≤ s. By the Central Limit Theorem, $n^{- 1 ∕ 2} \sum_{i = 1}^{n} x_{i l} E_{i}$ converges in distribution to a normally distributed matrix D_l with mean 0 and var(vec(D_l)) = m_ll∑_e for every 1 ≤ l ≤ s. Hence

λ^{- 1} n^{- 1} \sum_{i = 1}^{n} x_{i l} E_{i} = λ^{- 1} n^{- 1 ∕ 2} O_{p} (1) \to_{p} 0, for every 1 \leq l \leq s .

For every l = 1, …, s, recall that the singular value decomposition of B_l0 is $U_{l 0} ϴ_{l 0} V_{l 0}^{T}$ , and $U_{l 0}^{⊥}$ , and $V_{l 0}^{⊥}$ denote orthogonal complements of U_l0 and V_l0, respectively. By Lemma 3, we have

λ^{- 1} \sum_{j} {‖ B_{l 0} + λ Δ_{l} ‖_{*} - ‖ B_{l 0} ‖_{*}} \to \sum_{l = 1}^{s} t r (U_{l 0}^{T} Δ_{l} V_{l 0}) + \sum_{l = 1}^{s} ‖ (U_{l 0}^{⊥})^{T} Δ_{l} V_{l 0}^{⊥} ‖_{*} .

Consequently, L(Δ₁, …, Δ_s) →_p L⁰(Δ₁, …, Δ_s) for each Δ₁ ϵ G_l,l = 1, …, s with G_l’s compact sets in $R^{p \times q}$ , where

\begin{matrix} L^{0} (Δ_{1}, \dots, Δ_{s}) \\ = & 2^{- 1} \sum_{l = 1}^{s} \sum_{l^{'} = 1}^{s} Σ_{M, l l^{'}} t r (Δ_{l}^{T} Δ_{l^{'}}) + \sum_{l = 1}^{s} t r (U_{l 0}^{T} Δ_{l} V_{l 0}) + \sum_{l = 1}^{s} ‖ (U_{l 0}^{⊥})^{T} Δ_{l} V_{l 0}^{⊥} ‖_{*} \end{matrix}

One can see that L⁰ (Δ₁, …, Δ_s) is convex, hence it has unique minimum value point (Δ₁₀, …, Δ_s0). As L(Δ₁, …, Δ_s) is also convex, by (Knight and Fu, 2000) we have ${\hat{Δ}}_{l} \to_{p} Δ_{l 0}$ . This implies that $λ^{- 1} ({\hat{B}}_{l} - l 0) = O_{p} (1)$ , l = 1, ... , s.

We second prove Theorem 3 (ii). We define

\begin{matrix} f (Ψ_{1}, \dots, Ψ_{s}) = n (Q (n^{- 1 ∕ 2} Ψ_{l} + B_{l 0}) - Q (B_{l 0})) \\ = & 2^{- 1} \sum_{l = 1}^{s} \sum_{l^{'} = 1}^{s} n^{- 1} (\sum_{i = 1}^{n} x_{i l} x_{i l^{'}}) t r (Ψ_{l}^{T} Ψ_{l^{'}}) - \sum_{l} t r (Ψ_{l}^{T} n^{- 1 ∕ 2} \sum_{i = 1}^{n} x_{i l} * E_{i}) \\ + λ n \sum_{l} {‖ B_{l 0} + n^{- 1 ∕ 2} Ψ_{l} ‖_{*} - ‖ B_{l 0} ‖_{*}}, \end{matrix}

where Ψ_l = n^1/2(B_l – B_l0) for l = 1, …, s. Let $({\hat{Ψ}}_{1}, \dots, {\hat{Ψ}}_{s}) = \arg \min {f (Ψ_{1}, \dots, Ψ_{s})}$ , then we have that ${\hat{Ψ}}_{l} = n^{1 ∕ 2} ({\hat{B}}_{l} - B_{l 0})$ , l = 1, …, s. Under the Assumption (A6), and n^1/2λ → ρ, we have f(Ψ₁ …, Ψ_s) → f⁰(Ψ₁ …, Ψ_s) and

\begin{matrix} f^{0} (Ψ_{1}, \dots, Ψ_{s}) \\ = & 2^{- 1} \sum_{l = 1}^{s} \sum_{l^{'} = 1}^{s} Σ_{M, l l^{'}} t r (Ψ_{l}^{T} Ψ_{l^{'}}) - \sum_{l = 1}^{s} t r (Ψ_{l}^{T} D_{l}) + ρ {\sum_{l = 1}^{s} t r (U_{l 0}^{T} Δ_{l} V_{l 0}) + \sum_{l = 1}^{s} ‖ (U_{l 0}^{⊥})^{T} Δ_{l} V_{l 0}^{⊥} ‖_{*}}, \end{matrix}

where D_l is a random matrix, and vec(D₁) is normally distributed. One can see that f⁰(Ψ₁, …, Ψ_s) is convex, hence it has unique minimum value point (Ψ₁₀, …, Ψ_s0) with Ψ_l0 = O_p(1) for l = 1, …, s. Consequently, by (Knight and Fu, 2000), we have ${\hat{Ψ}}_{l} \to_{d} Ψ_{l 0}$ for l = 1, …, s, which indicates that $n^{1 ∕ 2} ({\hat{B}}_{l} - B_{l 0}) = O_{p} (1)$ for l = 1, …, s. This completes the proof of Theorem 3.

Proof of Theorem 4: Without loss of generality, for the proof of Theorem 4, we assume $\hat{M} = M = {1, \dots, s}$ with s fixed for notation simplicity. It follows from Theorem 3(i) that $λ^{- 1} ({\hat{B}}_{l} - B_{l 0}) = O_{p} (1)$ holds for every 1 ≤ l ≤ s. Since the rank function is lower semi-continuous, $P (rank ({\hat{B}}_{l}) \geq rank (B_{l 0})) \to 1$ . We will then prove $rank (\hat{B} = rank (B_{l 0})$ for every 1 ≤ l ≤ s with probability tending to one.

Denote the singular value decomposition of ${\hat{B}}_{l}$ as ${\hat{B}}_{l} = {\hat{U}}_{l} {\hat{ϴ}}_{l} {\hat{V}}_{l}^{T}$ , where ${\hat{U}}_{l} \in R^{p \times p}$ and ${\hat{V}}_{l} \in R^{q \times q}$ . Let ${\hat{U}}_{l}^{⊥}$ be the submatrix of ${\hat{U}}_{l}$ without the first r_l columns, and ${\hat{V}}_{l}^{⊥}$ is the submatrix of ${\hat{V}}_{l}$ without the first r_l columns, where r_l is the rank of B_l0. Denote the rank of ${\hat{B}}_{l}$ by ${\hat{r}}_{l}$ . We prove the theorem by two steps.

Step 1. In this step, we will show if

‖ ({\hat{U}}_{l}^{⊥})^{T} {n^{- 1} \sum_{i = 1}^{n} x_{i l} [\sum_{l^{'} = 1}^{s} x_{i l^{'}} * ({\hat{B}}_{l^{'}} - B_{l^{'} 0}) - E_{i}]} {\hat{V}}_{l}^{⊥} ‖_{o p} < λ,

then ${\hat{r}}_{l} = r_{l}$ . We will prove the statement by contradiction.

Let ${\hat{U}}_{l 1}$ be the submatrix of ${\hat{U}}_{l}$ corresponding to the first ${\hat{r}}_{l}$ columns, and ${\hat{V}}_{l 1}$ be the submatrix of ${\hat{V}}_{l}$ corresponding to the first ${\hat{r}}_{l}$ columns. If ${\hat{r}}_{l} \geq r_{l}$ , we can write ${\hat{U}}_{l}^{⊥}$ , ${\hat{V}}_{l}^{⊥}$ as $({\hat{U}}_{l 1}^{⊥}, {\hat{U}}_{l 2}^{⊥})$ , and $({\hat{V}}_{l 1}^{⊥}, {\hat{V}}_{l 2}^{⊥})$ respectively, where ${\hat{U}}_{l 1}^{⊥} \in R^{p \times ({\hat{r}}_{l} - r_{l})}$ , ${\hat{U}}_{l 2}^{⊥} \in R^{p \times (p - {\hat{r}}_{l})}$ , ${\hat{V}}_{l 1}^{⊥} \in R^{q \times ({\hat{r}}_{l} - r_{l})}$ , and ${\hat{V}}_{l 2}^{⊥} \in R^{p \times (q - {\hat{r}}_{l})}$ . By the definition of ${\hat{B}}_{l}$ , we have

{\hat{B}}_{l} = arg min_{B_{l}} \frac{1}{2 n} \sum_{i = 1}^{n} ‖ Y_{i} - \sum_{l^{'} \neq l} x_{i l^{'}} {\hat{B}}_{l} - x_{i l} B_{l} ‖_{F}^{2} + λ ‖ B_{l} ‖_{*} .

Hence, by Lemma 3, we have

{n^{- 1} \sum_{i = 1}^{n} x_{i l} [\sum_{l^{'} = 1}^{s} x_{i l^{'}} * ({\hat{B}}_{l^{'}} - B_{l^{'} 0}) - E_{i}]} + λ ({\hat{U}}_{l 1} {\hat{V}}_{l 1}^{T} + N_{l}) = 0,

with ${\hat{U}}_{l 1}^{T} N_{l} = 0$ , ${\hat{N}}_{l} {\hat{V}}_{l 1} = 0$ and ∥N_l∥_op ≤ 1. Furthermore, we have

\begin{matrix} ({\hat{U}}_{l}^{⊥})^{T} {n^{- 1} \sum_{i = 1}^{n} x_{i l} [\sum_{l^{'} = 1}^{s} x_{i l^{'}} * ({\hat{B}}_{l^{'}} - B_{l^{'} 0}) - E_{i}]} {\hat{V}}_{l}^{⊥} \\ = & - λ ({\hat{U}}_{l}^{⊥})^{T} ({\hat{U}}_{l 1} {\hat{V}}_{l 1}^{T} + N_{l}) {\hat{V}}_{l}^{⊥} \\ = & - λ ({\hat{U}}_{l 1}^{⊥}, {\hat{U}}_{l 2}^{⊥})^{T} {{\hat{U}}_{l 1}^{⊥} ({\hat{V}}_{l 1}^{⊥})^{T} + N_{l}} ({\hat{V}}_{l 1}^{⊥}, {\hat{V}}_{l 2}^{⊥}) \\ = & - λ (\begin{matrix} I_{({\hat{r}}_{l} - r_{l}) \times ({\hat{r}}_{l} - r_{l})} & 0 \\ 0 & ({\hat{U}}_{l 2}^{⊥})^{T} N_{l} {\hat{V}}_{l 2}^{⊥} \end{matrix}) . \end{matrix}

From the above formula, it follows that we have ${‖ {({\hat{U}}_{l}^{⊥})}^{T} {n^{- 1} \sum_{i = 1}^{n} x_{i l} [\sum_{i^{'} = 1}^{n} x_{i l^{'}} * ({\hat{B}}_{l^{'}} - B_{l^{'} 0}) - E_{i}]} {\hat{V}}_{l}^{⊥} ‖}_{o p} = λ$ as long as ${\hat{r}}_{l} > r_{l}$ . Consequently,

if ${‖ {({\hat{U}}_{l}^{⊥})}^{T} {n^{- 1} \sum_{i = 1}^{n} x_{i l} [\sum_{i^{'} = 1}^{s} x_{i l^{'}} * ({\hat{B}}_{l^{'}} - B_{l^{'} 0}) - E_{i}] {\hat{V}}_{l 1}^{⊥}} ‖}_{o p} < λ$ , we have ${\hat{r}}_{l} = r_{l}$ .

Step 2. In this step, we will prove that with probability tending to 1, one has

‖ ({\hat{U}}_{l}^{⊥})^{T} {n^{- 1} \sum_{i = 1}^{n} x_{i l} [\sum_{l^{'} = 1}^{s} x_{i l^{'}} * ({\hat{B}}_{l^{'}} - B_{l^{'} 0}) - E_{i}]} {\hat{V}}_{l}^{⊥} ‖_{o p} < λ .

We have

\begin{matrix} ({\hat{U}}_{l}^{⊥})^{T} {n^{- 1} \sum_{i = 1}^{n} x_{i l} [\sum_{l^{'} = 1}^{s} x_{i l^{'}} * ({\hat{B}}_{l^{'}} - B_{l^{'} 0}) - E_{i}]} {\hat{V}}_{l}^{⊥} \\ = & ({\hat{U}}_{l}^{⊥})^{T} {λ \sum_{l^{'} = 1}^{s} (Σ_{M, l l^{'}} + o (1)) {\hat{Δ}}_{l} - O_{p} (n^{- 1 ∕ 2})} {\hat{V}}_{l}^{⊥} = λ ({\hat{U}}_{l}^{⊥})^{T} \sum_{l^{'} = 1}^{s} Σ_{M, l l^{'}} {\hat{Δ}}_{l} {\hat{V}}_{l}^{⊥} + o_{p} (λ) . \end{matrix}

Since ${\hat{B}}_{l}$ is a consistent estimator of B_l0, we have ${\hat{U}}_{l}^{⊥} {({\hat{U}}_{l}^{⊥})}^{T} = {\hat{U}}_{l 0}^{⊥} {({\hat{U}}_{l 0}^{⊥})}^{T} + o_{p} (1)$ and ${\hat{V}}_{l}^{⊥} {({\hat{V}}_{l}^{⊥})}^{T} = {\hat{V}}_{l 0}^{⊥} {({\hat{V}}_{l 0}^{⊥})}^{T} + o_{p} (1)$ . Consequently, we have

\begin{matrix} ‖ ({\hat{U}}_{l}^{⊥})^{T} {n^{- 1} \sum_{i = 1}^{n} x_{i l} [\sum_{l^{'} = 1}^{s} x_{i l^{'}} * ({\hat{B}}_{l^{'}} - B_{l^{'} 0}) - E_{i}]} {\hat{V}}_{l}^{⊥} ‖_{o p} \\ = & ‖ {\hat{U}}_{l}^{⊥} ({\hat{U}}_{l}^{⊥})^{T} {n^{- 1} \sum_{i = 1}^{n} x_{i l} [\sum_{l^{'} = 1}^{s} x_{i l^{'}} * ({\hat{B}}_{l^{'}} - B_{l^{'} 0}) - E_{i}]} {\hat{V}}_{l}^{⊥} ({\hat{V}}_{l}^{⊥})^{T} ‖_{o p} \\ = & λ ‖ {\hat{U}}_{l 0}^{⊥} ({\hat{U}}_{l 0}^{⊥})^{T} (\sum_{l^{'} = 1}^{s} Σ_{M, l l^{'}} {\hat{Δ}}_{l^{'}}) {\hat{V}}_{l 0}^{⊥} ({\hat{V}}_{l 0}^{⊥})^{T} ‖_{o p} + o_{p} (λ) \\ = & λ ‖ {\hat{U}}_{l 0}^{⊥} ({\hat{U}}_{l 0}^{⊥})^{T} (\sum_{l^{'} = 1}^{s} Σ_{M, l l^{'}} Δ_{l^{'} 0}) {\hat{V}}_{l 0}^{⊥} ({\hat{V}}_{l 0}^{⊥})^{T} (1 + o_{p} (1)) ‖_{o p} + o_{p} (λ) \\ = & λ ‖ U_{l 0}^{⊥} Λ_{L} (V_{l 0}^{⊥})^{T} ‖_{o p} + o_{p} (λ) = λ {‖ Λ_{l} ‖_{o p} + o_{p} (1)} . \end{matrix}

As ∥Λ_l∥_op < 1, we have ${‖ {({\hat{U}}_{l}^{⊥})}^{T} {n^{- 1} \sum_{i = 1}^{n} x_{i l} [\sum_{i^{'} = 1}^{s} x_{i l^{'}} * ({\hat{B}}_{l^{'}} - B_{l^{'} 0}) - E_{i}] {\hat{V}}_{l}^{⊥}} ‖}_{o p} < λ$ . with probability 1. This completes the proof of Theorem 4.

Proof of Theorem 5. Without loss of generality, for the proof of Theorem 5, we assume $\hat{M} = M = {1, \dots, s}$ with s fixed for notation simplicity. To prove Theorem 5, we first introduce some notations and definitions used in Negahban et al. (2012). Given a pair of subspaces $M \subseteq \bar{M}$ , a norm based regularizer J is decomposable with respect to ( $(M, {\bar{M}}^{⊥})$ ) if

J (θ + γ) = J (θ) + J (γ) for all θ \in M and γ \in {\bar{M}}^{⊥},

where ${\bar{M}}^{⊥}$ is the orthogonal complement of the space $\bar{M}$ defined as ${\bar{M}}^{⊥} = {v ∣ 〈 u, v 〉 = 0 for all u \in \bar{M}}$ .

We define the projection operator

Π_{M} (u) = {argmin}_{v \in M} ‖ u - v ‖ .

Similarly, we can define the projections Π_M^⊥, $Π_{\bar{M}}$ and $Π_{{\bar{M}}^{⊥}}$ .

We then introduce the definition of the subspace compatibility constant. For the subspace M, the subspace compatibility constant with respect to the pair (J, ∥ · ∥) is given by

ψ (M) ≔ sup_{u \in M ∖ {0}} \frac{J (u)}{‖ u ‖}

We introduce the definition of restricted strong convexity. For a loss function L(θ), define δL(Δ, θ) = L(θ + Δ) – L(θ) – ⟨∇L(θ),Δ), where $\nabla L (θ) = \frac{d L (θ)}{d θ}$ . The loss function satisfies a restricted strong convexity condition with curvature κ_L > 0 and tolerance function τ_L if

δ L (Δ, θ) \geq κ_{L} ‖ Δ ‖^{2} - τ_{L}^{2} (θ) for all Δ \in C (M, {\bar{M}}^{⊥}, θ),

where $C (M, {\bar{M}}^{⊥}, θ) = {Δ ∣ J (Δ_{{\bar{M}}^{⊥}}) \leq 3 J (Δ_{\bar{M}}) + 4 J (θ_{M^{⊥}})}$ .

Now we begin to prove Theorem 5. We need to use the result in Theorem 1 of Negahban et al. (2012). We first check the conditions of the theorem under our context.

Recall that $B = [B_{1}, \dots, B_{s}] \in R^{p \times q s}$ , and r_l = rank(B_l0). Let us consider the class of matrices $ϴ_{l} \in R^{p \times q}$ that have rank r_l ≤ min{p, q} and we define $ϴ = [ϴ_{1}, \dots, ϴ_{s}] \in R^{p \times q s}$ .

Let $row (ϴ_{l}) \subseteq R^{p}$ and $col (ϴ_{l}) \subseteq R^{q}$ denote its row space and column space, respectively. Let U_l and V_l be a given pair of r_l-dimensional subspaces $U_{l} \subseteq R^{p}$ and $V_{l} \subseteq R^{q}$ . Define U = [U₁, …, U_l] and V = [V₁, …, V_l]. For a given pair (U, V), we can define the subspaces M(U, V), $\bar{M} (U, V)$ and ${\bar{M}}^{⊥} (U, V)$ of $R^{p \times q s}$ given by

\begin{matrix} M (U, V) = {Θ \in R^{p \times q s} ∣ r o w (Θ_{l}) \subseteq V_{l} and col (Θ_{l}) \subseteq U_{l} for 1 \leq l \leq s}, \\ \bar{M} (U, V) = {Θ \in R^{p \times q s} ∣ r o w (Θ_{l}) \subseteq V_{l} or col (Θ_{l}) \subseteq U_{l} for 1 \leq l \leq s}, \end{matrix}

and

{\bar{M}}^{⊥} (U, V) = {Θ \in R^{p \times q s} ∣ r o w (Θ_{l}) \subseteq V_{l}^{⊥} and col (Θ_{l}) \subseteq U_{l}^{⊥} for 1 \leq l \leq s},

where ${\bar{M}}^{⊥} (U, V)$ is the orthogonal complement of the space $\bar{M} (U, V)$ . For simplicity, we will use M, $\bar{M}$ and ${\bar{M}}^{⊥}$ to denote M(U,V), $\bar{M} (U, V)$ and ${\bar{M}}^{⊥} (U, V)$ respectively in the following proof.

Define $J (B) = \sum_{l = 1}^{s} {‖ B_{l} ‖}_{*}$ , and we can easily see J(B) is a norm. It is easy to see that the norm J is decomposable with respect to the subspace pair (M, $(M, {\bar{M}}^{⊥})$ ), where $M \subseteq \bar{M}$ . Therefore, the regularizer J satisfies Condition (G1) in Negahban et al. (2012).

Under condition (A9), it is easy to see the loss function R is convex and differentiable, and satisfies the restricted strong convexity with curvature κ_L = C_L and tolerance τ_L = 0, and therefore the Condition (G2) in Negahban et al. (2012) holds.

After we check the conditions, we need to calculate $ψ (\bar{M})$ and $R ({B_{0}}_{{\bar{M}}^{⊥}})$ . It is easy to see $R ({B_{0}}_{{\bar{M}}^{⊥}})$ . For $ψ (\bar{M})$ , one has

\begin{matrix} ψ (\bar{M}) & = sup_{u \in \bar{M} ∖ {0}} \frac{J (u)}{‖ u ‖} = sup_{B_{l} \in \bar{M} ∖ {0}} \frac{\sum_{l = 1}^{s} ‖ B_{l} ‖_{*}}{‖ B ‖_{F}} \leq \frac{\sum_{l = 1}^{s} \sqrt{2 r_{l}} ‖ B_{l} ‖_{F}}{‖ B ‖_{F}} \\ \leq \frac{\sqrt{\sum_{l = 1}^{s} (\sqrt{2 r_{l}})^{2}} \sqrt{\sum_{l = 1}^{s} ‖ B_{l} ‖_{F}^{2}}}{‖ B ‖_{F}} \leq \sqrt{2 \sum_{l = 1}^{s} r_{l} .} \end{matrix}

Therefore, by Theorem 1 in Negahban et al. (2012), when λ ≥ 2J*(∇R(B₀)), one has ${‖ \hat{B} - B_{0} ‖}_{F}^{2} \leq C (\sum_{l = 1}^{s} r_{l}) λ^{2} C_{L}^{- 2}$ for some constant C > 0.

The term J*(∇R(B₀)) is actually a random quantity, and our next step is to derive the order of this term.

Define J*(·) as the dual norm of J(·). For any matrix $A = [A_{1}, \dots, A_{s}] \in R^{p \times q s}$ , we will first prove the following result

J^{*} (A) = sup_{J (B) \leq 1} 〈 A, B 〉 = \max_{1 \leq l \leq s} ‖ A_{l} ‖_{o p} .

(8)

To prove (8), we first show that J*(A) ≥ max_1≤l≤s ∥A_l∥_op. Let $B^{(l)} = [B_{1}^{(l)}, \dots, B_{s}^{(l)}]$ with $B_{k}^{(l)} = 0$ for any k ≠ l and $‖ B_{l}^{(l)} ‖ \leq 1$ . One has

J^{*} (A) \geq sup_{‖ B^{(l)} ‖_{*} \leq 1} 〈 A, B^{(l)} 〉 = \sup_{‖ B_{l}^{(l)} ‖_{*} \leq 1} 〈 A_{l}, B_{l}^{(l)} 〉 = ‖ A_{l} ‖_{o p} .

It is easy to see J*(A) ≥ ∥A_l∥_op holds for any 1 ≥ l ≥ s. Consequently, one has J*(A) ≥ max_1≤l≤s ∥A_l∥_op.

Our next step is to show that J*(A) ≤ max_1≤l≤s ∥A_l∥_op. Define the singular value decomposition of $B_{l} = U_{l} ϴ_{l} V_{l}^{T}$ . One has

\begin{matrix} J^{*} (A) & = sup_{J (B) \leq 1} {\sum_{l = 1}^{s} 〈 U_{l} Θ_{l} V_{l}^{T}, A_{l} 〉} \\ = sup_{J (B) \leq 1} {\sum_{l = 1}^{s} Tr (V_{l} Θ_{l} U_{l}^{T} A_{l} V_{l})} = sup_{J (B) \leq 1} {\sum_{l = 1}^{s} Tr (Θ_{l} U_{l}^{T} A_{l} V_{l})} \\ = sup_{J (B) \leq 1} {\sum_{l = 1}^{s} 〈 U_{l}^{T} A_{l} V_{l} Θ_{l} 〉} = sup_{J (B) \leq 1} {\sum_{l = 1}^{s} \sum_{k = 1}^{min {p, q}} θ_{l k} (U_{l}^{T} A_{l} V_{l})_{k k}} \\ = sup_{J (B) \leq 1} {\sum_{l = 1}^{s} \sum_{k = 1}^{min {p, q}} θ_{l k} ((U_{l})_{(k)})^{T} A_{l} (V_{l})_{(k)}} \\ \leq sup_{J (B) \leq 1} {\sum_{l = 1}^{s} \sum_{k = 1}^{min {p, q}} θ_{l k} ‖ A_{l} ‖_{o p}} \leq sup_{J (B) \leq 1} {\sum_{l = 1}^{s} \sum_{k = 1}^{min {p, q}} θ_{l k} \max_{1 \leq l \leq s} ‖ A_{l} ‖_{o p}} \\ \leq \max_{1 \leq l \leq s} ‖ A_{l} ‖_{o p}, \end{matrix}

where θ_lk is the kth diagonal element of the diagonal matrix ϴ_l, ${(U_{l}^{T} A_{l} V_{l})}_{k k}$ is the kkth element of the matrix $U_{l}^{T} A_{l} V_{l}$ , (U_l)(_k) and (V_l)(_k) are the kth column of the matrices U_l and V_l respectively.

Combining the two inequalities, we show that J*(A) = max_1≤l≤s ∥A_l∥_op.

Next we need to calculate J*(∇R(B₀)), where $\nabla R (B_{0}) = [D_{1}, \dots, D_{s}] \in R^{p \times q s}$ with $D_{l} = - 2 n^{- 1} \sum_{i = 1}^{n} x_{i l} * E_{i}$ . We first need to calculate ∥D_l∥_op. We know that operator norm is the dual norm of the trace norm.

From the definition of J*(·), one has

‖ D_{l} ‖_{o p} = 2 sup_{‖ A ‖_{*} \leq 1} 〈 A, n^{- 1} \sum_{i = 1}^{n} x_{i l} * E_{i} 〉

To obtain a bound for ∥D_l∥_op, we use similar technique as the one used in Raskutti and Yuan (2018). Let W_i be a p × q random matrix with each entry i.i.d. standard normal. Assuming condition (A11) and by Lemma 4, conditioning on x_il, we get

P {sup_{‖ A ‖_{*} \leq 1} 〈 A, n^{- 1} \sum_{i = 1}^{n} x_{i l} * E_{i} 〉 > t} \leq P {sup_{‖ A ‖_{*} \leq 1} 〈 A, n^{- 1} \sum_{i = 1}^{n} x_{i l} * W_{i} 〉 > \frac{t}{C_{U}}},

since $Σ_{e} ≼ C_{U}^{2} I_{p q \times p q}$ .

As $\sup_{{‖ A ‖}_{*} \leq 1} 〈 A, n^{- 1} \sum_{i = 1}^{n} x_{i l} * W_{i} 〉 = {‖ n^{- 1} \sum_{i = 1}^{n} x_{i l} * W_{i} ‖}_{o p}$ , conditioning on W_i, each entry of the matrix $n^{- 1} \sum_{i = 1}^{n} x_{i l} * W_{i}$ is i.i.d $N (0, \frac{{‖ X_{l} ‖}_{o p}^{2}}{n^{2}})$ , where X_l = (x_1l, …, x_nl)^T. Since $\frac{{‖ X_{l} ‖}_{o p}^{2}}{σ_{l}^{2}})$ is a χ² random variable with n degree of freedom, where $σ_{l}^{2} = (Σ_{M})_{l l}$ , one has

P {\frac{‖ X_{l} ‖_{o p}^{2}}{n σ_{l}^{2}} \geq 4} \leq exp (- n)

using the tail bounds of χ². Then combining with the standard random matrix theory, we know that ${‖ n^{- 1} \sum_{i = 1}^{n} x_{i l} * W_{i} ‖}_{o p} \leq 2 n^{- 1 ∕ 2} σ_{l} (p^{1 ∕ 2} + q^{1 ∕ 2})$ with probability at least $1 - c_{1}^{*} \exp {- c_{2}^{*} (p + q)} - \exp (- n)$ where $c_{1}^{*}$ and $c_{2}^{*}$ are some positive constants. Therefore, under conditions (A10) and (A12), there exist some positive constants c₁, c₂ and c₃ such that max_1≤l≤s ∥D_l∥_op ≤ 4C_Un^−1/2(max_{1≤l≤_s} σ_l)(p^1/2 + q^1/2) holds with probability at least 1 – c₁ exp{−c₂(p + q)} – c₃exp(−n). Thus, when $λ \geq 4 C_{U} C_{M}^{1 ∕ 2} n^{- 1 ∕ 2} (p^{1 ∕ 2} + q^{1 ∕ 2})$ , λ ≥ J*(∇R(B₀)) with probability at least 1 – c₁ exp{−c₂(p + q)} – c₃ exp(−n).

Therefore, with probability 1 – c₁ exp{—c₂(p + q)} – c₃ exp(−n), one has ${‖ \hat{B} - B_{0} ‖}_{F}^{2} \leq C (\sum_{l \in M} r_{l}) λ^{2} C_{L}^{- 2}$ for some positive constant C. This completes the proof of Theorem 5.

Proof of Theorem 6:

To prove the theorem, we consider the event ${M \subseteq \hat{M}}$ as it holds with probability goes to 1. We will derive the non-asymptotic error bound under the event ${M \subseteq \hat{M}}$ . Recall that r_l = rank(B_l0), one has r_l = 0 for $l \in M$ . Let us consider the class of matrices $ϴ_{l} \in R^{p \times q}$ that have rank r_l ≤ min{p, q} and we define $ϴ = [ϴ_{l}, l \in \hat{M}] \in R^{p \times q ∣ \hat{M} ∣}$ . Let $row (ϴ_{l}) \subseteq R^{p}$ and $col (ϴ_{l}) \subseteq R^{q}$ denote its row space and column space, respectively. Let U_l and V_l be a given pair of r_l-dimensional subspaces $U_{l} \subseteq R^{p}$ and $V_{l} \subseteq R^{q}$ , respectively. Define $U = [U_{l}, l \in \hat{M}] \in R^{p \times q ∣ \hat{M} ∣}$ and $V = [V_{l}, l \in \hat{M}] \in R^{p \times q ∣ \hat{M} ∣}$ . For a given pair (U, V), we can define the subspaces $\hat{M} (U, V)$ , $\bar{\hat{M}} (U, V)$ and ${\bar{\hat{M}}}^{⊥} (U, V)$ of $R^{p \times q ∣ \hat{M} ∣}$ as follows:

\begin{matrix} \hat{M} (U, V) = {Θ \in R^{p \times q ∣ \hat{M} ∣} ∣ r o w (Θ_{l}) \subseteq V_{l} and col (Θ_{l}) \subseteq U_{l} for l \in \hat{M}}, \\ \bar{\hat{M}} (U, V) = {Θ \in R^{p \times q ∣ \hat{M} ∣} ∣ r o w (Θ_{l}) \subseteq V_{l} or col (Θ_{l}) \subseteq U_{l} for l \in \hat{M}}, \\ {\bar{\hat{M}}}^{⊥} (U, V) = {Θ \in R^{p \times q ∣ \hat{M} ∣} ∣ r o w (Θ_{l}) \subseteq V_{l}^{⊥} and col (Θ_{l}) \subseteq U_{l}^{⊥} for l \in \hat{M}}, \end{matrix}

where ${\bar{\hat{M}}}^{⊥} (U, V)$ is the orthogonal complement of the space $\bar{\hat{M}} (U, V)$ . For simplicity, we will use $\hat{M}$ , $\bar{\hat{M}}$ and ${\bar{\hat{M}}}^{⊥}$ to denote $\hat{M} (U, V)$ , $\bar{\hat{M}} (U, V)$ and ${\bar{\hat{M}}}^{⊥} (U, V)$ , respectively.

For the norm $J (B^{\hat{M}}) = \sum_{l \in \hat{M}} {‖ B_{l} ‖}_{*}$ , it is easy to see that the norm J is decomposable with respect to the subspace pair $(\hat{M}, {\bar{\hat{M}}}^{⊥})$ , where $\hat{M} \subseteq \bar{\hat{M}}$ . Therefore, the regularizer J satisfies Condition (G1) in Negahban et al. (2012).

We need to calculate $ψ \bar{\hat{M}}$ and $R ({B_{0}^{\hat{M}}}_{{\bar{\hat{M}}}^{⊥}})$ . It is easy to see $R ({B_{0}^{\hat{M}}}_{{\bar{\hat{M}}}^{⊥}}) = 0$ . For $ψ \bar{\hat{M}}$ , since r_l = 0 holds for $l \in M$ , one has

\begin{matrix} ψ (\bar{\hat{M}}) & = sup_{u \in \bar{\hat{M}} ∖ {0}} \frac{J (u)}{‖ u ‖} = sup_{B_{l} \in \bar{\hat{M}} ∖ {0}} \frac{\sum_{l \in \hat{M}} ‖ B_{l} ‖_{*}}{‖ B^{\hat{M}} ‖_{F}} \leq \frac{\sum_{l \in M} \sqrt{2 r_{l}} ‖ B_{l} ‖_{F}}{\sqrt{\sum_{l \in \hat{M}} ‖ B_{l} ‖_{F}^{2}}} \\ \leq \frac{\sqrt{\sum_{l \in M} (\sqrt{2 r_{l}})^{2}} \sqrt{\sum_{l \in M} ‖ B_{l} ‖_{F}^{2}}}{\sqrt{\sum_{l \in M} ‖ B_{l} ‖_{F}^{2}}} \leq \sqrt{2 \sum_{l \in M} r_{l} .} \end{matrix}

For any $Δ \in R^{p \times q ∣ \hat{M} ∣}$ , we define $F : R^{p \times q ∣ \hat{M} ∣} \to R$ as

F (Δ) ≔ R (B_{0}^{\hat{M}} + Δ) - R (B_{0}^{\hat{M}}) + λ {J (B_{0}^{\hat{M}} + Δ) - J (B_{0}^{\hat{M}})} .

We will derive a lower bound on F(Δ). In particular, we have

\begin{matrix} F (Δ) & = R (B_{0}^{\hat{M}} + Δ) - R (B_{0}^{\hat{M}}) + λ {J (B_{0}^{\hat{M}} + Δ) - J (B_{0}^{\hat{M}})} \\ \geq 〈 \nabla (R (B_{0}^{\hat{M}}), Δ 〉 + ı_{L} ‖ Δ ‖^{2} + λ {J (B_{0}^{\hat{M}} + Δ) - J (B_{0}^{\hat{M}})} \\ \geq 〈 \nabla (R (B_{0}^{\hat{M}}), Δ 〉 + ı_{L} ‖ Δ ‖^{2} + λ {J (Δ_{{\bar{\hat{M}}}^{⊥}}) - J (Δ_{\bar{\hat{M}}}) - 2 J ((B_{0}^{\hat{M}})_{{\hat{M}}^{⊥}})}, \end{matrix}

where the first inequality follows from condition (A13) and the second inequality follows from Lemma 3 in Negahban et al. (2012) by applying to the pair $(\bar{\hat{M}}, {\bar{\hat{M}}}^{⊥})$ .

By the Cauchy-Schwarz inequality applied to the regularizer J and its dual J*, we have $∣ 〈 \nabla R (B_{0}^{\hat{M}}), Δ 〉 ∣ \leq J^{*} (\nabla R (B_{0}^{\hat{M}})) J (Δ)$ . Since $λ \geq 2 J^{*} (\nabla R (B_{0}^{\hat{M}}))$ holds by assumption, one has $∣ 〈 \nabla R (B_{0}^{\hat{M}}), Δ 〉 ∣ \leq 0.5 λ J (Δ) \leq 0.5 λ (J (Δ_{{\bar{\hat{M}}}^{⊥}}) + J (Δ_{\bar{\hat{M}}}))$ , where the second inequality holds due to the triangle inequality. Therefore, we have

\begin{matrix} F (Δ) & \geq - \frac{λ}{2} {J (Δ_{{\bar{\hat{M}}}^{⊥}}) + J (Δ_{\bar{\hat{M}}})} + ı_{L} ‖ Δ ‖^{2} + λ {J (Δ_{{\bar{\hat{M}}}^{⊥}}) - J (Δ_{\bar{\hat{M}}}) - 2 J ((B_{0}^{\hat{M}})_{{\hat{M}}^{⊥}})} \\ = ı_{L} ‖ Δ ‖^{2} + λ {\frac{1}{2} J (Δ_{{\bar{\hat{M}}}^{⊥}}) - \frac{3}{2} J (Δ_{\bar{\hat{M}}}) - 2 J ((B_{0}^{\hat{M}})_{{\hat{M}}^{⊥}})} \\ \geq ı_{L} ‖ Δ ‖^{2} - \frac{1}{2} λ {3 J (Δ_{\bar{\hat{M}}}) + 4 J ((B_{0}^{\hat{M}})_{{\hat{M}}^{⊥}})} . \end{matrix}

By the subspace compatibility, we have $J (Δ_{\bar{\hat{M}}}) \leq ψ (\bar{\hat{M}}) ‖ Δ_{\bar{\hat{M}}} ‖$ . As the projection is non-expansive and $0 \in \bar{\hat{M}}$ , one has $‖ Δ_{\bar{\hat{M}}} ‖ \leq ‖ Δ ‖$ , and thus $J (Δ_{\bar{\hat{M}}}) \leq ψ (\bar{\hat{M}}) ‖ Δ ‖$ . Substituting it into the previous inequality, and noticing that $J ({(B_{0}^{\hat{M}})}_{{\hat{M}}^{⊥}}) = 0$ , we obtain $F (Δ) \geq ι_{L} {‖ Δ ‖}^{2} - \frac{3}{2} λ ψ (\bar{\hat{M}}) ‖ Δ ‖$ . The righthand side is a quadratic form of Δ, as long as ${‖ Δ ‖}^{2} > \frac{9 λ^{2}}{4 ι_{L}^{2}} ψ (\bar{\hat{M}})$ , one has F(Δ) > 0. By Lemma 4 in Negahban et al. (2012), we have ${‖ {\hat{B}}^{\hat{M}} - B_{0}^{\hat{M}} ‖}_{F}^{2} \leq C (\sum_{l \in M} r_{l}) λ^{2} ι_{L}^{- 2}$ for some positive constant C.

Next we need to calculate $J^{*} (\nabla R (B_{0}^{\hat{M}}))$ , where $\nabla R (B_{0}^{\hat{M}}) = [D_{l}, l \in \hat{M}] \in R^{p \times q ∣ \hat{M} ∣}$ with $D_{l} = - 2 n^{- 1} \sum_{i = 1}^{n} x_{i l} * E_{i}$ . By similar argument as the one in the proof of Theorem 5, one has $J^{*} (\nabla R (B_{0}^{\hat{M}})) = \max_{l \in \hat{M}} {‖ D_{l} ‖}_{o p}$ . To calculate ∥D_l∥_op, by the same argument as the one in proof of Theorem 5, one has ${‖ n^{- 1} \sum_{i = 1}^{n} x_{i l} * W_{i} ‖}_{o p} \leq 2 n^{- 1 ∕ 2} σ_{l} (p^{1 ∕ 2} + q^{1 ∕ 2})$ with probability at least $1 - c_{1}^{*} \exp {- c_{2}^{*} (p + q)} - \exp (- n)$ , where $c_{1}^{*}$ and $c_{2}^{*}$ are some positive constants. Therefore, one has $J^{*} (\nabla R (B_{0}^{\hat{M}})) = \max_{l \in \hat{M}} {‖ D_{l} ‖}_{o p} \leq 4 n^{- 1 ∕ 2} (\max_{l \in \hat{M}} σ_{l}) (p^{1 ∕ 2} + q^{1 ∕ 2})$ with probability at least $1 - ∣ \hat{M} ∣ c_{1}^{*} \exp {- c_{2}^{*} (p + q)} - ∣ \hat{M} ∣ \exp (- n)$ . By condition (A4), one has $\max_{l \in \hat{M}} σ_{l} \leq λ_{\max} (Σ_{x}) \leq C_{5} n^{τ}$ . By the proof of Theorem 2, one has $∣ \hat{M} ∣ = O (n^{2 κ + τ})$ with probability at least $1 - c_{4}^{*} \exp (- c_{5}^{*} n^{1 - 2 κ})$ for some positive constants $c_{4}^{*}$ and $c_{5}^{*}$ . Thus, when λ ≥ 4C₅n^τ–1/2(p^1/2 + q^1/2), one has $λ \geq J^{*} (\nabla R (B_{0}^{\hat{M}}))$ with probability at least $1 - c_{1} n^{2 κ + τ} \exp {- c_{2} (p + q)} - c_{3} n^{2 κ + τ} \exp (- n) - c_{4}^{*} \exp (- c_{5}^{*} n^{1 - 2 κ})$ for some positive constants c₁, c₂, c₃, $c_{4}^{*}$ and $c_{5}^{*}$ .

By the proof of Theorem 1, the event ${M \subseteq \hat{M}}$ holds with probability goes to 1. In particular, $P ({M \subseteq \hat{M}}) \geq 1 - c_{4}^{* *} \exp (- c_{5}^{* *} n^{1 - 2 k})$ for some positive constants $c_{4}^{* *}$ and $c_{5}^{* *}$ . Therefore, there exists some positive constants c₁, c₂, c₃, c₄ and c₅ such that with probability 1 – c₁n^2κ+τ exp{−c₂(p + q)} – c₃n^2κ+τ exp(−n) – c₄ exp(−c₅n^1–2κ), one has ${‖ {\hat{B}}^{\hat{M}} - B_{0}^{\hat{M}} ‖}_{F}^{2} \leq C (\sum_{l \in M} r_{l}) λ^{2} ι_{L}^{- 2}$ for some positive constant C. When Assumptions (A5) and (A14) hold, with probability goes to 1, one has ${‖ {\hat{B}}^{\hat{M}} - B_{0}^{\hat{M}} ‖}_{F}^{2} \leq C (\sum_{l \in M} r_{l}) λ^{2} ι_{L}^{- 2}$ . This completes the proof of Theorem 6.

C. Interpretations of Λ_l

In this section, we include some detailed interpretations of the definition Λ_l. Without loss of generality, we assume $\hat{M} = M = {1, \dots, s}$ with s fixed for notation simplicity. We first give a necessary condition for rank consistency presented in Theorem 4. By proposition 18 of (Bach, 2008), for any 1 ≤ l ≤ s, we have ${(U_{l 0}^{⊥})}^{T} {\hat{Δ}}_{l} V_{l 0}^{⊥} = o_{p} (1)$ if $rank ({\hat{B}}_{l}) = rank (B_{l 0}) = r_{l}$ . Since ${\hat{Δ}}_{l} \to_{p} Δ_{l 0}$ and Δ_l0 is a nonrandom quantity, we have ${(U_{l 0}^{⊥})}^{T} Δ_{l 0} V_{l 0}^{⊥} = 0$ . Recall that {Δ_l0 : 1 ≤ l ≤ s} is the minimizer of l⁰(Δ₁, …, Δ_s), and thus {Δ_l0 : 1 ≤ l ≤ s} is the solution of the optimal problem

min l^{0} (Δ) subject to (U_{l 0}^{⊥})^{T} Δ_{l} V_{l 0}^{⊥} = 0 for every 1 \leq l \leq s .

(9)

Using Lagrange multiplier method, consider the minimizer of

L (Δ, Λ_{1}, \dots, Λ_{s}) = 2^{- 1} vec (Δ)^{T} Σ_{M} vec (Δ) + \sum_{l = 1}^{s} t r (U_{l 0}^{T} Δ_{l} V_{l 0}) + \sum_{l = 1}^{s} t r (Λ_{l}^{T} (U_{l 0}^{⊥})^{T} Δ_{l} V_{l 0}^{⊥})

where {Λ_l, l = 1, …, s} are Lagrange multipliers. Thus, for l = 1, …, s, {Δ_l0 : 1 ≤ l ≤ s} satisfies

\begin{matrix} \frac{\partial L}{\partial Δ_{l}} = \sum_{l^{'} = 1}^{s} Σ_{M, l l^{'}} Δ_{l 0} + U_{l 0} V_{l 0}^{T} + U_{l 0}^{⊥} Λ_{l} (V_{l 0}^{⊥})^{T} = 0, \\ \frac{\partial L}{\partial Λ_{l}} = (U_{l 0}^{⊥})^{T} Δ_{l 0} V_{l 0}^{⊥} = 0 . \end{matrix}

Recall that $A = Σ_{M} \otimes I_{p q \times p q}$ , $K_{l} = V_{l 0}^{⊥} \otimes U_{l 0}^{⊥}$ , and $d_{l} = - vec (U_{l 0} V_{l 0}^{T})$ for l = 1, . . . s, where ⊗ denotes the Kronecker product. Let $d = (d_{1}^{T}, \dots, d_{s}^{T}) T$ , K = diag{K₁, …, K_s}, $Λ_{l} \in R^{(p - r_{l}) \times (q - r_{l})}$ for l = 1, …, s such that

vec (Λ) = (vec (Λ_{1})^{T}, \dots, vec (Λ_{s})^{T})^{T} = (K^{T} A^{- 1} K)^{- 1} K^{T} A^{- 1} d .

Then the Lagrange equation can be written as

(\begin{matrix} A & K \\ K^{T} & 0 \end{matrix}) (\begin{matrix} vec (Δ) \\ vec (Λ) \end{matrix}) = (\begin{matrix} d \\ 0 \end{matrix}) .

It is easy to show that

vec (Δ) = A^{- 1} (d - K vec (Λ)) and vec (Λ) = (K^{T} A^{- 1} K)^{- 1} K^{T} A^{- 1} d .

From the above calculation, we can see that vec(Λ) = (vec(Λ₁)^T, …, vec(Λ_s)^T)^T is actually the Lagrange multiplier for the optimization problem (9).

Footnotes

Supplementary Material

Supplementary Material available online includes modified algorithm for our regularized low rank estimation procedure when response and covariates are not centered and additional simulation results.

Contributor Information

Dehan Kong, Department of Statistical Sciences, University of Toronto.

Baiguo An, School of Statistics, Capital University of Economics and Business.

Jingwen Zhang, Department of Biostatistics, University of North Carolina at Chapel Hill.

Hongtu Zhu, Department of Biostatistics, University of North Carolina at Chapel Hill.

References

Anderson TW (1955), “The integral of a symmetric unimodal function over a symmetric convex set and some probability inequalities,” Proceedings of the American Mathematical Society, 6, 170–176. [Google Scholar]
Bach FR (2008), “Consistency of trace norm minimization,” Journal of Machine Learning Research, 9, 1019–1048. [Google Scholar]
Barrett JC, Fry B, Maller J, and Daly MJ (2005), “Haploview: analysis and visualization of LD and haplotype maps,” Bioinformatics, 21, 263–265. [DOI] [PubMed] [Google Scholar]
Barut E, Fan J, and Verhasselt A (2016), “Conditional sure independence screening,” Journal of the American Statistical Association, 111, 1266–1277. [DOI] [PMC free article] [PubMed] [Google Scholar]
Beck A and Teboulle M (2009), “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM Journal on Imaging Sciences, 2, 183–202. [Google Scholar]
Borwein JM and Lewis AS (2010), Convex analysis and nonlinear optimization: theory and examples, Springer Science & Business Media. [Google Scholar]
Breiman L and Friedman J (1997), “Predicting multivariate responses in multiple linear regression,” Journal of the Royal Statistical Society, 59, 3–54. [Google Scholar]
Buhlmann P and van de Geer S (2011), Statistics for High-Dimensional Data: Methods, Theory and Applications., New York, N. Y.: Springer. [Google Scholar]
Cai J-F, Candès EJ, and Shen Z (2010), “A singular value thresholding algorithm for matrix completion,” SIAM Journal on Optimization, 20, 1956–1982. [Google Scholar]
Candès E and Tao T (2007), “The Dantzig selector: Statistical estimation when p is much larger than n,” The Annals of Statistics, 2313–2351. [Google Scholar]
Candes EJ and Recht B (2009), “Exact matrix completion via convex optimization,” Foundations of Computational Mathematics., 9, 717–772. [Google Scholar]
Chen Y, Goldsmith J, and Ogden T (2016), “Variable Selection in Function-on-Scalar Regression,” Stat, 5, 88–101. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chiang MC, Barysheva M, Toga AW, Medland SE, Hansell NK, James MR, McMahon KL, de Zubicaray GI, Martin NG, Wright MJ, and Thompson PM (2011), “BDNF gene effects on brain circuitry replicated in 455 twins,” NeuroImage, 55, 448–454. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cook R, Helland I, and Su Z (2013), “Envelopes and partial least squares regression,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75, 851–877. [Google Scholar]
Ding S and Cook D (2018), “Matrix variate regressions and envelope models,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80, 387–408. [Google Scholar]
Ding S and Cook RD (2014), “Dimension folding PCA and PFC for matrix-valued predictors,” Statistica Sinica, 24, 463–492. [Google Scholar]
Facchinei F and Pang J-S (2003), Finite-dimensional variational inequalities and complementarity problems Vol. I, Springer Series in Operations Research, Springer-Verlag, New York. [Google Scholar]
Fan J, Feng Y, and Song R (2011), “Nonparametric independence screening in sparse ultra-high-dimensional additive models,” Journal of the American Statistical Association, 106, 544–557. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J and Lv J (2008), “Sure independence screening for ultrahigh dimensional feature space,” Journal of the Royal Statistical Society. Series B., 70, 849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]
—(2010), “A selective overview of variable selection in high dimensional feature space,” Statistica Sinica, 20, 101. [PMC free article] [PubMed] [Google Scholar]
Fan J and Song R (2010), “Sure independence screening in generalized linear models with NP-dimensionality,” The Annals of Statistics, 38, 3567–3604. [Google Scholar]
Fosdick BK and Hoff PD (2015), “Testing and modeling dependencies between a network and nodal attributes,” Journal of the American Statistical Association, 110, 1047–1056. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel B, Higgins J, DeFelice M, Lochner A, Faggart M, et al. (2002), “The structure of haplotype blocks in the human genome,” Science, 296, 2225–2229. [DOI] [PubMed] [Google Scholar]
Hibar D, Stein JL, Kohannim O, Jahanshad N, Saykin AJ, Shen L, Kim S, Pankratz N, Foroud T, Huentelman MJ, Potkin SG, Jack C, Weiner MW, Toga AW, Thompson P, and ADNI (2011), “Voxelwise gene-wide association study (vGeneWAS): multivariate gene-based association testing in 731 elderly subjects,” NeuroImage, 56, 1875–1891. [DOI] [PMC free article] [PubMed] [Google Scholar]
Huang M, Nichols T, Huang C, Yang Y, Lu Z, Feng Q, Knickmeyer RC, Zhu H, and ADNI (2015), “FVGWAS: Fast Voxelwise Genome Wide Association Analysis of Large-scale Imaging Genetic Data,” NeuroImage, 118, 613–627. [DOI] [PMC free article] [PubMed] [Google Scholar]
Knight K and Fu W (2000), “Asymptotics for lasso-type estimators,” The Annals of Statistics, 28, 1356–1378. [Google Scholar]
Leng C and Tang CY (2012), “Sparse matrix graphical models,” Journal of the American Statistical Association, 107, 1187–1200. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li L and Zhang X (2017), “Parsimonious tensor response regression,” Journal of the American Statistical Association, 112, 1131–1146. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu J and Calhoun VD (2014), “A review of multivariate analyses in imaging genetics,” Frontiers in Neuroinformatics, 8, 29. [DOI] [PMC free article] [PubMed] [Google Scholar]
Medlan SE, Jahanshad N, Neale BM, and Thompson PM (2014), “Whole-genome analyses of whole-brain data: working within an expanded search space,” Nature Neuroscience, 17, 791–800. [DOI] [PMC free article] [PubMed] [Google Scholar]
Medland SE, Jahanshad N, Neale BM, and Thompson PM (2014), “Whole-genome analyses of whole-brain data: working within an expanded search space,” Nature Neuroscience, 17, 791–800. [DOI] [PMC free article] [PubMed] [Google Scholar]
Negahban S, Yu B, Wainwright MJ, and Ravikumar PK (2009), “A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers,” in Advances in Neural Information Processing Systems, pp. 1348–1356. [Google Scholar]
Negahban SN, Ravikumar P, Wainwright MJ, and Yu B (2012), “A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers,” Statistical Science, 27, 538–557. [Google Scholar]
Nesterov Y (2004), Introductory lectures on convex optimization, vol. 87 of Applied Optimization, Kluwer Academic Publishers, Boston, MA, a basic course. [Google Scholar]
Park Y, Su Z, and Zhu H (2017), “Groupwise envelope models for imaging genetic analysis,” Biometrics, 73, 1243–1253. [DOI] [PMC free article] [PubMed] [Google Scholar]
Peng J, Zhu J, Bergamaschi A, Han W, Noh DY, Pollack JR, and Wang P (2010), “Regularized multivariate regression for identifying master predictors with application to integrative genomics study of breast cancer,” Annals of Applied Statistics, 4, 53–77. [DOI] [PMC free article] [PubMed] [Google Scholar]
Peper JS, Brouwer RM, Boomsma DI, Kahn RS, and Pol HEH (2007), “Genetic Influences on Human Brain Structure: A Review of Brain Imaging Studies in Twins,” Human Brain Mapping, 28, 464–473. [DOI] [PMC free article] [PubMed] [Google Scholar]
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, De Bakker PI, Daly MJ, et al. (2007), “PLINK: a tool set for whole-genome association and population-based linkage analyses,” The American Journal of Human Genetics, 81, 559–575. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rabusseau G and Kadri H (2016), “Low-rank regression with tensor responses,” in Advances in Neural Information Processing Systems, pp. 1867–1875. [Google Scholar]
Ramsay JO and Silverman BW (2005), Functional Data Analysis, New York, N. Y.: Springer, 2nd ed. [Google Scholar]
Raskutti G and Yuan M (2018), “Convex regularization for high-dimensional tensor regression,” Annals of Statistics, to appear. [Google Scholar]
Recht B, Fazel M, and Parrilo PA (2010), “Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization,” SIAM Review, 52, 471–501. [Google Scholar]
Scharinger C, Rabl U, Sitte HH, and Pezawas L (2010), “Imaging genetics of mood disorders,” NeuroImage, 53, 810–821. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shen L, Kim S, Risacher SL, Nho K, Swaminathan S, West JD, Foroud T, Pankratz N, Moore JH, Sloan CD, Huentelman MJ, Craig DW, DeChairo BM, Potkin SG, Jr CRJ, Weiner MW, Saykin AJ, and ADNI (2010), “Whole genome association study of brain-wide imaging phenotypes for identifying quantitative trait loci in MCI and AD: A study of the ADNI cohort,” NeuroImage, 53, 1051–1063. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stein J, Hua X, Lee S, Ho A, Leow A, Toga A, Saykin AJ, Shen L, Foroud T, Pankratz N, Huentelman M, Craig D, Gerber J, Allen A, Corneveaux JJ, Dechairo B, Potkin S, Weiner M, Thompson P, and ADNI (2010), “Voxelwise genome-wide association study (vGWAS),” NeuroImage, 53, 1160–1174. [DOI] [PMC free article] [PubMed] [Google Scholar]
Thompson PM, Ge T, Glahn DC, Jahanshad N, and Nichols TE (2013), “Genetics of the connectome,” NeuroImage, 80, 475–488. [DOI] [PMC free article] [PubMed] [Google Scholar]
Thompson PM, Stein JL, Medland SE, Hibar DP, Vasquez AA, Renteria ME, Toro R, Jahanshad N, Schumann G, Franke B, et al. (2014), “The ENIGMA Consortium: large-scale collaborative analyses of neuroimaging and genetic data,” Brain Imaging and Behavior, 8, 153–182. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tibshirani R (1997), “The lasso method for variable selection in the Cox model,” Statistics in Medicine, 16, 385–395. [DOI] [PubMed] [Google Scholar]
van der Vaart A and Wellner J (2000), Weak Convergence and Empirical Processes: With Applications to Statistics (Springer Series in Statistics), New York, N. Y.: Springer. [Google Scholar]
Vounou M, Janousova E, Wolz R, Stein J, Thompson P, Rueckert D, Montana G, and ADNI (2012), “Sparse reduced-rank regression detects genetic associations with voxel-wise longitudinal phenotypes in Alzheimer’s disease,” NeuroImage, 60, 700–716. [DOI] [PMC free article] [PubMed] [Google Scholar]
Vounou M, Nichols TE, Montana G, and ADNI (2010), “Discovering genetic associations with high-dimensional neuroimaging phenotypes: A sparse reduced-rank regression approach,” NeuroImage, 53, 1147–1159. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang H, Nie F, Huang H, Kim S, Nho K, Risacher S, Saykin A, Shen L, and ADNI (2012a), “Identifying quantitative trait loci via group-sparse multi-task regression and feature selection: An imaging genetics study of the ADNI cohort,” Bioinformatics, 28, 229–237. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang H, Nie F, Huang H, Risacher S, Saykin A, Shen L, and ADNI (2012b), “Identifying disease sensitive and quantitative trait relevant biomarkers from multi-dimensional heterogeneous imaging genetics data via sparse multi-modal multi-task learning,” Bioinformatics, 28, 127–136. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang L, Chen G, and Li H (2007), “Group SCAD regression analysis for microarray time course gene expression data,” Bioinformatics, 23, 1486–1494. [DOI] [PubMed] [Google Scholar]
Yuan M, Ekici A, Lu Z, and Monteiro R (2007), “Dimension reduction and coefficient estimation in multivariate linear regression,” Journal of the Royal Statistical Society. Series B., 69, 329–346. [Google Scholar]
Zhang YW, Xu ZY, Shen XT, and Pan W (2014), “Testing for association with multiple traits in generalized estimation equations, with application to neuroimaging data,” NeuroImage, 96, 309–325. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhao J and Leng C (2014), “Sparse matrix graphical models,” Statistica Sinica, 24, 799–814. [Google Scholar]
Zhao P and Yu B (2006), “On model selection consistency of Lasso,” Journal of Machine Learning Research, 7, 2541–2563. [Google Scholar]
Zhou H and Li L (2014), “Regularized matrix regression,” Journal of the Royal Statistical Society. Series B., 76, 463–483. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhu H, Khondker ZS, Lu Z, and Ibrahim JG (2014), “Bayesian generalized low rank regression models for neuroimaging phenotypes and genetic markers,” Journal of American Statistical Association, 109, 977–990. [PMC free article] [PubMed] [Google Scholar]
Zou H (2006), “The adaptive lasso and its oracle properties,” Journal of the American Statistical Association, 101, 1418–1429. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplement

NIHMS997852-supplement-supplement.pdf^{(361.1KB, pdf)}

[R1] Anderson TW (1955), “The integral of a symmetric unimodal function over a symmetric convex set and some probability inequalities,” Proceedings of the American Mathematical Society, 6, 170–176. [Google Scholar]

[R2] Bach FR (2008), “Consistency of trace norm minimization,” Journal of Machine Learning Research, 9, 1019–1048. [Google Scholar]

[R3] Barrett JC, Fry B, Maller J, and Daly MJ (2005), “Haploview: analysis and visualization of LD and haplotype maps,” Bioinformatics, 21, 263–265. [DOI] [PubMed] [Google Scholar]

[R4] Barut E, Fan J, and Verhasselt A (2016), “Conditional sure independence screening,” Journal of the American Statistical Association, 111, 1266–1277. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Beck A and Teboulle M (2009), “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM Journal on Imaging Sciences, 2, 183–202. [Google Scholar]

[R6] Borwein JM and Lewis AS (2010), Convex analysis and nonlinear optimization: theory and examples, Springer Science & Business Media. [Google Scholar]

[R7] Breiman L and Friedman J (1997), “Predicting multivariate responses in multiple linear regression,” Journal of the Royal Statistical Society, 59, 3–54. [Google Scholar]

[R8] Buhlmann P and van de Geer S (2011), Statistics for High-Dimensional Data: Methods, Theory and Applications., New York, N. Y.: Springer. [Google Scholar]

[R9] Cai J-F, Candès EJ, and Shen Z (2010), “A singular value thresholding algorithm for matrix completion,” SIAM Journal on Optimization, 20, 1956–1982. [Google Scholar]

[R10] Candès E and Tao T (2007), “The Dantzig selector: Statistical estimation when p is much larger than n,” The Annals of Statistics, 2313–2351. [Google Scholar]

[R11] Candes EJ and Recht B (2009), “Exact matrix completion via convex optimization,” Foundations of Computational Mathematics., 9, 717–772. [Google Scholar]

[R12] Chen Y, Goldsmith J, and Ogden T (2016), “Variable Selection in Function-on-Scalar Regression,” Stat, 5, 88–101. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Chiang MC, Barysheva M, Toga AW, Medland SE, Hansell NK, James MR, McMahon KL, de Zubicaray GI, Martin NG, Wright MJ, and Thompson PM (2011), “BDNF gene effects on brain circuitry replicated in 455 twins,” NeuroImage, 55, 448–454. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Cook R, Helland I, and Su Z (2013), “Envelopes and partial least squares regression,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75, 851–877. [Google Scholar]

[R15] Ding S and Cook D (2018), “Matrix variate regressions and envelope models,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80, 387–408. [Google Scholar]

[R16] Ding S and Cook RD (2014), “Dimension folding PCA and PFC for matrix-valued predictors,” Statistica Sinica, 24, 463–492. [Google Scholar]

[R17] Facchinei F and Pang J-S (2003), Finite-dimensional variational inequalities and complementarity problems Vol. I, Springer Series in Operations Research, Springer-Verlag, New York. [Google Scholar]

[R18] Fan J, Feng Y, and Song R (2011), “Nonparametric independence screening in sparse ultra-high-dimensional additive models,” Journal of the American Statistical Association, 106, 544–557. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Fan J and Lv J (2008), “Sure independence screening for ultrahigh dimensional feature space,” Journal of the Royal Statistical Society. Series B., 70, 849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] —(2010), “A selective overview of variable selection in high dimensional feature space,” Statistica Sinica, 20, 101. [PMC free article] [PubMed] [Google Scholar]

[R21] Fan J and Song R (2010), “Sure independence screening in generalized linear models with NP-dimensionality,” The Annals of Statistics, 38, 3567–3604. [Google Scholar]

[R22] Fosdick BK and Hoff PD (2015), “Testing and modeling dependencies between a network and nodal attributes,” Journal of the American Statistical Association, 110, 1047–1056. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel B, Higgins J, DeFelice M, Lochner A, Faggart M, et al. (2002), “The structure of haplotype blocks in the human genome,” Science, 296, 2225–2229. [DOI] [PubMed] [Google Scholar]

[R24] Hibar D, Stein JL, Kohannim O, Jahanshad N, Saykin AJ, Shen L, Kim S, Pankratz N, Foroud T, Huentelman MJ, Potkin SG, Jack C, Weiner MW, Toga AW, Thompson P, and ADNI (2011), “Voxelwise gene-wide association study (vGeneWAS): multivariate gene-based association testing in 731 elderly subjects,” NeuroImage, 56, 1875–1891. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Huang M, Nichols T, Huang C, Yang Y, Lu Z, Feng Q, Knickmeyer RC, Zhu H, and ADNI (2015), “FVGWAS: Fast Voxelwise Genome Wide Association Analysis of Large-scale Imaging Genetic Data,” NeuroImage, 118, 613–627. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Knight K and Fu W (2000), “Asymptotics for lasso-type estimators,” The Annals of Statistics, 28, 1356–1378. [Google Scholar]

[R27] Leng C and Tang CY (2012), “Sparse matrix graphical models,” Journal of the American Statistical Association, 107, 1187–1200. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Li L and Zhang X (2017), “Parsimonious tensor response regression,” Journal of the American Statistical Association, 112, 1131–1146. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Liu J and Calhoun VD (2014), “A review of multivariate analyses in imaging genetics,” Frontiers in Neuroinformatics, 8, 29. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] Medlan SE, Jahanshad N, Neale BM, and Thompson PM (2014), “Whole-genome analyses of whole-brain data: working within an expanded search space,” Nature Neuroscience, 17, 791–800. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] Medland SE, Jahanshad N, Neale BM, and Thompson PM (2014), “Whole-genome analyses of whole-brain data: working within an expanded search space,” Nature Neuroscience, 17, 791–800. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] Negahban S, Yu B, Wainwright MJ, and Ravikumar PK (2009), “A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers,” in Advances in Neural Information Processing Systems, pp. 1348–1356. [Google Scholar]

[R33] Negahban SN, Ravikumar P, Wainwright MJ, and Yu B (2012), “A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers,” Statistical Science, 27, 538–557. [Google Scholar]

[R34] Nesterov Y (2004), Introductory lectures on convex optimization, vol. 87 of Applied Optimization, Kluwer Academic Publishers, Boston, MA, a basic course. [Google Scholar]

[R35] Park Y, Su Z, and Zhu H (2017), “Groupwise envelope models for imaging genetic analysis,” Biometrics, 73, 1243–1253. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] Peng J, Zhu J, Bergamaschi A, Han W, Noh DY, Pollack JR, and Wang P (2010), “Regularized multivariate regression for identifying master predictors with application to integrative genomics study of breast cancer,” Annals of Applied Statistics, 4, 53–77. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] Peper JS, Brouwer RM, Boomsma DI, Kahn RS, and Pol HEH (2007), “Genetic Influences on Human Brain Structure: A Review of Brain Imaging Studies in Twins,” Human Brain Mapping, 28, 464–473. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, De Bakker PI, Daly MJ, et al. (2007), “PLINK: a tool set for whole-genome association and population-based linkage analyses,” The American Journal of Human Genetics, 81, 559–575. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] Rabusseau G and Kadri H (2016), “Low-rank regression with tensor responses,” in Advances in Neural Information Processing Systems, pp. 1867–1875. [Google Scholar]

[R40] Ramsay JO and Silverman BW (2005), Functional Data Analysis, New York, N. Y.: Springer, 2nd ed. [Google Scholar]

[R41] Raskutti G and Yuan M (2018), “Convex regularization for high-dimensional tensor regression,” Annals of Statistics, to appear. [Google Scholar]

[R42] Recht B, Fazel M, and Parrilo PA (2010), “Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization,” SIAM Review, 52, 471–501. [Google Scholar]

[R43] Scharinger C, Rabl U, Sitte HH, and Pezawas L (2010), “Imaging genetics of mood disorders,” NeuroImage, 53, 810–821. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] Shen L, Kim S, Risacher SL, Nho K, Swaminathan S, West JD, Foroud T, Pankratz N, Moore JH, Sloan CD, Huentelman MJ, Craig DW, DeChairo BM, Potkin SG, Jr CRJ, Weiner MW, Saykin AJ, and ADNI (2010), “Whole genome association study of brain-wide imaging phenotypes for identifying quantitative trait loci in MCI and AD: A study of the ADNI cohort,” NeuroImage, 53, 1051–1063. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] Stein J, Hua X, Lee S, Ho A, Leow A, Toga A, Saykin AJ, Shen L, Foroud T, Pankratz N, Huentelman M, Craig D, Gerber J, Allen A, Corneveaux JJ, Dechairo B, Potkin S, Weiner M, Thompson P, and ADNI (2010), “Voxelwise genome-wide association study (vGWAS),” NeuroImage, 53, 1160–1174. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R46] Thompson PM, Ge T, Glahn DC, Jahanshad N, and Nichols TE (2013), “Genetics of the connectome,” NeuroImage, 80, 475–488. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R47] Thompson PM, Stein JL, Medland SE, Hibar DP, Vasquez AA, Renteria ME, Toro R, Jahanshad N, Schumann G, Franke B, et al. (2014), “The ENIGMA Consortium: large-scale collaborative analyses of neuroimaging and genetic data,” Brain Imaging and Behavior, 8, 153–182. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] Tibshirani R (1997), “The lasso method for variable selection in the Cox model,” Statistics in Medicine, 16, 385–395. [DOI] [PubMed] [Google Scholar]

[R49] van der Vaart A and Wellner J (2000), Weak Convergence and Empirical Processes: With Applications to Statistics (Springer Series in Statistics), New York, N. Y.: Springer. [Google Scholar]

[R50] Vounou M, Janousova E, Wolz R, Stein J, Thompson P, Rueckert D, Montana G, and ADNI (2012), “Sparse reduced-rank regression detects genetic associations with voxel-wise longitudinal phenotypes in Alzheimer’s disease,” NeuroImage, 60, 700–716. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R51] Vounou M, Nichols TE, Montana G, and ADNI (2010), “Discovering genetic associations with high-dimensional neuroimaging phenotypes: A sparse reduced-rank regression approach,” NeuroImage, 53, 1147–1159. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R52] Wang H, Nie F, Huang H, Kim S, Nho K, Risacher S, Saykin A, Shen L, and ADNI (2012a), “Identifying quantitative trait loci via group-sparse multi-task regression and feature selection: An imaging genetics study of the ADNI cohort,” Bioinformatics, 28, 229–237. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R53] Wang H, Nie F, Huang H, Risacher S, Saykin A, Shen L, and ADNI (2012b), “Identifying disease sensitive and quantitative trait relevant biomarkers from multi-dimensional heterogeneous imaging genetics data via sparse multi-modal multi-task learning,” Bioinformatics, 28, 127–136. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R54] Wang L, Chen G, and Li H (2007), “Group SCAD regression analysis for microarray time course gene expression data,” Bioinformatics, 23, 1486–1494. [DOI] [PubMed] [Google Scholar]

[R55] Yuan M, Ekici A, Lu Z, and Monteiro R (2007), “Dimension reduction and coefficient estimation in multivariate linear regression,” Journal of the Royal Statistical Society. Series B., 69, 329–346. [Google Scholar]

[R56] Zhang YW, Xu ZY, Shen XT, and Pan W (2014), “Testing for association with multiple traits in generalized estimation equations, with application to neuroimaging data,” NeuroImage, 96, 309–325. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R57] Zhao J and Leng C (2014), “Sparse matrix graphical models,” Statistica Sinica, 24, 799–814. [Google Scholar]

[R58] Zhao P and Yu B (2006), “On model selection consistency of Lasso,” Journal of Machine Learning Research, 7, 2541–2563. [Google Scholar]

[R59] Zhou H and Li L (2014), “Regularized matrix regression,” Journal of the Royal Statistical Society. Series B., 76, 463–483. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R60] Zhu H, Khondker ZS, Lu Z, and Ibrahim JG (2014), “Bayesian generalized low rank regression models for neuroimaging phenotypes and genetic markers,” Journal of American Statistical Association, 109, 977–990. [PMC free article] [PubMed] [Google Scholar]

[R61] Zou H (2006), “The adaptive lasso and its oracle properties,” Journal of the American Statistical Association, 101, 1418–1429. [Google Scholar]

PERMALINK

L2RM: Low-rank Linear Regression Models for High-dimensional Matrix Responses *

Dehan Kong

Baiguo An

Jingwen Zhang

Hongtu Zhu

Abstract

1. Introduction

2. Methodology

2.1. Rank-one Screening Method

2.2. Estimation Method

3. Theoretical Properties

3.1. Sure Screening Property

3.2. Theory for Estimation Procedure

3.3. Theory for Two-step Estimator

4. Simulations

4.1. Regularized Low-rank Estimate

Figure 1:

Table 1:

Figure 2:

Table 2:

4.2. Rank-one Screening using SNP Covariates

Figure 4:

Figure 5:

4.3. Simulation study for two-step procedure

Table 3:

Table 4:

5. The Philadelphia Neurodevelopmental Cohort

5.1. Data Description and Preprocessing Pipeline

5.2. Analysis and Results

Table 5:

Figure 6:

Figure 7:

6. Discussion

Supplementary Material

Figure 3:

Acknowledgement

A. Auxiliary Lemmas

B. Proof of Theorems

C. Interpretations of Λl

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

L2RM: Low-rank Linear Regression Models for High-dimensional Matrix Responses ^*

C. Interpretations of Λ_l