Heterogeneity adjustment with applications to graphical model inference

Jianqing Fan; Han Liu; Weichen Wang; Ziwei Zhu

doi:10.1214/18-EJS1466

. Author manuscript; available in PMC: 2019 Oct 30.

Published in final edited form as: Electron J Stat. 2018 Dec 5;12(2):3908–3952. doi: 10.1214/18-EJS1466

Heterogeneity adjustment with applications to graphical model inference

Jianqing Fan ¹, Han Liu ², Weichen Wang ³, Ziwei Zhu ⁴

PMCID: PMC6820685 NIHMSID: NIHMS1003844 PMID: 31666911

Abstract

Heterogeneity is an unwanted variation when analyzing aggregated datasets from multiple sources. Though different methods have been proposed for heterogeneity adjustment, no systematic theory exists to justify these methods. In this work, we propose a generic framework named ALPHA (short for Adaptive Low-rank Principal Heterogeneity Adjustment) to model, estimate, and adjust heterogeneity from the original data. Once the heterogeneity is adjusted, we are able to remove the batch effects and to enhance the inferential power by aggregating the homogeneous residuals from multiple sources. Under a pervasive assumption that the latent heterogeneity factors simultaneously affect a fraction of observed variables, we provide a rigorous theory to justify the proposed framework. Our framework also allows the incorporation of informative covariates and appeals to the ‘Bless of Dimensionality’. As an illustrative application of this generic framework, we consider a problem of estimating high-dimensional precision matrix for graphical model inference based on multiple datasets. We also provide thorough numerical studies on both synthetic datasets and a brain imaging dataset to demonstrate the efficacy of the developed theory and methods.

Keywords: Multiple sourcing, batch effect, semiparametric factor model, principal component analysis, brain image network

1. Introduction

Aggregating and analyzing heterogeneous data is one of the most fundamental challenges in scientific data analysis. In particular, the intrinsic heterogeneity across multiple data sources violates the ideal ‘independent and identically distributed’ sampling assumption and may produce misleading results if it is ignored. For example, in genomics, data heterogeneity is ubiquitous and referred to as either ‘batch effect’ or ‘lab effect’. As characterized in [29], microarray gene expression data obtained from different labs on different processing dates may contain systematic variability. Furthermore, [30] pointed out that heterogeneity across multiple data sources may be caused by unobserved factors that have confounding effects on the variables of interest, generating spurious signals. In finance, it is also known that asset returns are driven by varying market regimes and economy status, which can be regarded as a temporal batch effect. Therefore, to properly analyze data aggregated from multiple sources, we need to carefully model and adjust the unwanted variations.

Modeling and estimating heterogeneity effect is challenging for two reasons. (i) Typically, we can only access a limited number of samples from an individual group, given the high cost of biological experiments, technological constraint or fast economy regime switching. (ii) The dimensionality can be much larger than the total aggregated number of samples. The past decade has witnessed the development of many methods for adjusting batch effect in high throughput genomics data. See, for example, [43], [2], [30], and [25]. Though progresses have been made, most of the aforementioned papers focus on the practical side and none of them has a systematic theoretical justification. In fact, most of these methods are developed in a case-by-case fashion and are only applicable to certain problem domains. Thus, there is still a gap that exists between practice and theory.

To bridge this gap, we propose a generic theoretical framework to model, estimate, and adjust heterogeneity across multiple datasets. Formally, we assume the data come from m different sources: the i^th data source contributes n_i samples, each having p measurements such as gene expressions of an individual or stock returns of a day. To explicitly model heterogeneity, we assume that the batch-specific latent factor $f_{t}^{i}$ influence the observed data $X_{j t}^{i}$ in batch i (j indexes variables; t indexes samples) as in the approximate factor model:

X_{j t}^{i} = λ_{j}^{i^{'}} f_{t}^{i} + u_{j t}^{i}, 1 \leq j \leq p, 1 \leq t \leq n_{i}, 1 \leq i \leq m,

(1.1)

where $λ_{j}^{i}$ is an unknown factor loading for variable j and $u_{j t}^{i}$ is a true uncorrupted signal. We consider a random loading $λ_{j}^{i} .$ The linear term ${λ_{j}^{i}}^{'} f_{t}^{i}$ models the heterogeneity effect. We assume that $f_{t}^{i}$ is independent of $u_{j t}^{i}$ and $u_{t}^{i} = {(u_{1 t}, \dots, u_{p t})}^{'}$ shares the same common distribution with mean 0 and covariance Σ_p×p across all data sources. In the matrix-form model, (1.1) can be written as

X^{i} = Λ^{i} F^{i^{'}} + U^{i},

(1.2)

where Xⁱ is a p×n_i data matrix in the i^th batch, Λⁱ is a p×Kⁱ factor loading matrix with $λ_{j}^{i^{'}}$ in the j^th row, Fⁱ is an n_i × Kⁱ factor matrix and Uⁱ is a signal matrix of dimension p × n_i. We allow the number of latent factors Kⁱ to depend on batch i. We emphasize here that within one batch, our model is homogeneous. Heterogeneity in this paper refers to that the batch effect terms ${Λ^{i} F^{i^{'}}}_{i = 1}^{m}$ are different across different groups i = 1,…,m, which are unwanted variations in our study.

To see more clearly on how model (1.2) characterizes the heterogeneity, note that for the t^th sample $X_{t}^{i}$ , which is the t^th column of Xⁱ,

var (X_{t}^{i}) = Λ^{i} var (f_{t}^{i}) Λ^{i^{'}} + Σ .

(1.3)

Therefore, the heterogeneity is carried by the low-rank component $Λ^{i} var (f_{t}^{i}) Λ^{i^{'}}$ in the population covariance matrix of $X_{t}^{i}$ . We need to clarify that since we assume both Fⁱ and Uⁱ have mean zero, heterogeneity mentioned in this paper is for covariance structure as shown above instead of mean structure. In addition, our model differs from the random/mixed effect regression model studied in the literature [45, 23, 11] in that our models are factor models without any factors observed, while the mixed/random effect model is a regression model that requires covariate matrices to estimate the batch-specific term.

Under a pervasive assumption, the heterogeneity component can be estimated by directly applying principal component analysis (PCA) or Projected-PCA (PPCA), which is more accurate when there are sufficiently informative covariates Wⁱ [18]. Let $\hat{Λ^{i}} {\hat{F^{i}}}^{'}$ be the estimated heterogeneity component and $\hat{U^{i}} = X_{t}^{i} - \hat{Λ^{i}} {\hat{F^{i}}}^{'}$ the heterogeneity-adjusted signal, which can be treated as homogeneous across different datasets and thus can be combined together for downstream statistical analysis. This whole framework of heterogeneity adjustment is termed ALPHA (short for Adaptive Low-rank Principal Heterogeneity Adjustment) and is schematically shown in Figure 1.

Fig 1. — Schematic illustration of ALPHA: Depending on whether we can find some sufficiently informative covariates W, we implement principal component analysis (PCA) or Projected-PCA (PPCA) methods (labeled respectively M₁ and M₂) to remove the heterogeneity effects **ΛF′** for each batch of data. This decision was made adaptively by a heuristic method. After removing the unwanted variations, the homogeneous data ${U^{(i)}}_{i = 1}^{m}$ are aggregrated for further analysis.

The proposed ALPHA framework is fully generic and applicable to almost all kinds of multivariate analysis of the combined, heterogeneity adjusted datasets. As an illustrative example, in this paper, we focus on the problem of Gaussian graphical model inference based on multiple datasets. It is a powerful tool to explore complex dependence structure among variables X = (X₁,…,X_p)′. The sparsity pattern of the precision matrix Ω = Σ⁻¹ encodes the information of an undirected graph G = (V,E) where V consists of p vertices corresponding to p variables in X and E describes their dependence relationship. To be specific, V_i and V_j are linked by an edge if and only if $Ω_{i j} \neq 0$ (the (i,j)^th element of Ω), meaning that X_i and X_j are dependent conditioning on the rest of the variables. For heterogeneous data across m data sources, we need to first adjust for heterogeneity using the ALPHA framework. The idea of covariate-adjusted precision matrix estimation has been studied by [7], but they assumed observed factors and no heterogeneity issue, i.e., m = 1.

A significant amount of literature has focused on the estimation of the precision matrix Ω for graphical models for homogeneous data. [49] and [20] developed the Graphical Lasso method using the L₁ penalty and [27] and [42] used a non-convex penalty. Furthermore, [40] and [33] studied the theoretical properties under different assumptions. Estimating Ω can be equivalently reformulated as a set of node-wise sparse linear regression that utilizes Lasso or Danzig selector for each node [35, 48]. To relax the assumption of Gaussian data, [32] and [31] extend the graphical model to the case of semiparametric Gaussian copula and transelliptical family. Via the ALPHA framework, we can combine the adjusted data $\hat{U^{i}}$ to construct an estimator for the precision matrix Ω by the above methods. Recent works also focus on joint estimation of multiple Gaussian or discrete graphical models which share some common structure [22, 15, 47, 8, 21]. They are concerned with both the commonality and individual uniqueness of the graphs. In comparison, ALPHA emphasizes more on heterogeneity-adjusted aggregation for one single graph.

The rest of the paper is organized as follows. Section 2 lays out a basic problem setup and necessary assumptions. We model the heterogeneity by a semiparametric factor model. Section 3 introduces the ALPHA methodology for heterogeneity adjustment. Two main methods PCA and PPCA will be introduced for adjusting the factor effects under different regimes. A guiding rule of thumb is also proposed to determine which method is more appropriate. The heterogeneity-adjusted data will be combined to provide valid graph estimation in Section 4. The CLIME method of [9] is applied for precision matrix estimation. Synthetic and real data analyses are carried out to demonstrate the proposed framework in Section 5. Section 6 contains further discussions and all the proofs are relegated to the appendix.

2. Problem setup

To more efficiently use the external covariate information in removing heterogeneity effect, we first present a semiparametric factor model. Then, based on whether the collected external covariates have explaining power on factor loadings, we discuss two different regimes where PCA or PPCA should be used. We will state the conditions under which these methods can be formally justified.

2.1. Semiparametric factor model

We assume that for subgroup i, we have d external covariates $W_{j}^{i} = {(W_{j 1}^{i}, \dots, W_{j d}^{i})}^{'}$ for variable j. In stock returns, these can be attributes of a firm; in brain imaging, these can be the physical locations of voxels. We assume that these covariates have some explanatory power on the loading parameters $λ_{j}^{i}$ in (1.1) so that it can be further modeled as $λ_{j}^{i} = g^{i} (W_{j}^{i}) + γ_{j}^{i}$ , where gⁱ(·) is the external covariate effects on $λ_{j}^{i}$ and $γ_{j}^{i}$ is the part that can not be explained by the covariates. Thus, model (1.1) can be written as

X_{j t}^{i} = λ_{j}^{i^{'}} f_{t}^{i} + u_{j t}^{i} = {(g^{i} (W_{j}^{i}) + γ_{j}^{i})}^{'} f_{t}^{i} + u_{j t}^{i} .

(2.1)

Model (2.1) does not put much restriction. If $W_{j}^{i}$ is not informative at all, i.e., gⁱ(·) = 0, the model reduces to a regular factor model. In a matrix form, model (2.1) can be written as

X^{i} = Λ^{i} F^{i^{'}} + U^{i} where Λ^{i} = G^{i} (W^{i}) + Γ^{i}, 1 \leq i \leq m .

(2.2)

In (2.2), Gⁱ(Wⁱ) and Γⁱ are p×Kⁱ matrices. More specifically, $g_{k}^{i} (W_{j}^{i})$ and γ_jk are the (j,k)^th element of Gⁱ(Wⁱ) and Γⁱ respectively. Expression (2.2) suggests that the observed data can be decomposed into a low-rank heterogeneity term $Λ^{i} F^{i^{'}}$ and a homogeneous signal term Uⁱ. Letting $u_{t}^{i}$ be the t^th column of Uⁱ, we assume all $u_{t}^{i,} s$ share the same distribution for any t ≤ n_i and for all subgroups i ≤ m with $E [u_{t}^{i}] = 0, var (u_{t}^{i}) = Σ$ .

There has been a large literature on factor models in econometrics [3, 5, 17, 44], machine learning [10, 36] and random matrix theories [26, 38, 46]. We refer the interested readers to those relevant papers and the references therein. However, none of these models incorporate the external covariate information. The semiparametric factor model (2.1) was first proposed by [14] and further investigated by [13] and [18]. Using sufficiently informative external covariates, we are able to more accurately estimate the factors and loadings, and hence yield better adjustment for heterogeneity.

Here we collect some notations of eigenvalues and matrix norms used in the paper. For matrix M, we use λ_max(M), λ_min(M) and λ_i(M) to denote the maximum eigenvalue, the minimum eigenvalue and the ith eigenvalue of M respectively. We define the quantities ${‖ M ‖}_{max} = {max}_{i, j} | M_{i, j} |$ , ${‖ M ‖}_{2} = λ_{max}^{1 / 2} (M^{'} M) (‖ M ‖ for short)$ , ${‖ M ‖}_{F} = {(\sum_{i, j} M_{i j}^{2})}^{1 / 2}$ , ${‖ M ‖}_{1} = {max}_{j} \sum_{i} | M_{i j} |$ and ${‖ M ‖}_{1, 1} = \sum_{i} \sum_{j} | M_{i j} |$ to be its entry-wise maximum, spectral, Frobenius, induced ℓ₁ and element-wise ℓ₁ norms.

2.2. Modeling assumptions and general methodology

In this subsection, we explicitly list all the required modeling assumptions. We start with an introduction of the data generating processes.

Assumption 2.1 (Data generating process). (i) $n_{i}^{- 1} F^{i^{'}} F^{i} = I$ .

(ii) ${u_{t}^{i}}_{t \leq n_{i}, i \leq m}$ are independently and identically sub-Gaussian distributed with mean zero and covariance Σ within and between subgroups, and independent of ${W_{j}^{i}, f_{t}^{i}}$ . Let ${‖ Σ ‖}_{2} = C_{0} < \infty$ .

(iii) ${f_{t}^{i}}_{t \leq n_{i}}$ is a stationary process, with arbitrary temporal dependency. The tail of the factors is sub-Gaussian, i.e., there exist C₁, C₂ > 0 such that for any $α \in ℝ^{K^{i}}$ and $s > 0, ℙ (| α^{'} f_{t}^{i} | > s) \leq C_{1} \exp (- C_{2} s^{2} / {‖ α ‖}^{2})$ .

The above set of assumptions are commonly used in the literature, see [5] and [18]. We omit detailed discussions here.

Based on whether the external covariates are informative, we specify two regimes, each of which requires some additional technical conditions.

2.2.1. Regime 1: External covariates are not informative

For the case that the external covariates do not have enough explanatory power on the factor loadings Λⁱ, we ignore the semiparametric structure and model (2.2) reduces to the traditional factor model, extensively studied in econometrics [3, 44, 37]. PCA will be employed in Section 3.1 to estimate the heterogeneous effect. It requires the following assumptions.

Assumption 2.2. (i) (Pervasiveness) There are two positive constants c_min, c_max > 0 so that

c_{\min} < λ_{\min} (p^{- 1} Λ^{i^{'}} Λ^{i}) < λ_{max} (p^{- 1} Λ^{i^{'}} Λ^{i}) < c_{max}, a . s . \forall i .

(ii) ${max}_{k \leq K^{i}, j \leq p} | λ_{j k}^{i} | = O_{p} (\sqrt{\log p})$ .

The first condition is common and essential in the factor model literature (e.g., [44]). It requires the factors to be strong enough such that the covariance matrix $Λ^{i} cov (f_{t}^{i}) Λ^{i} + Σ$ has spiked eigenvalues. We emphasize here that this condition is actually not so stringent as it looks. Consider a single-factor model Y_it = b_if_t + u_it, i = 1,…,p, t = 1,…,T. The pervasive assumptions actually imply that $c_{\min} p \leq \sum_{i = 1}^{p} b_{i}^{2} \leq c_{max} p$ . Note that since c_min can be a small constant, our pervasive assumption just says that the factors ${f_{t}}_{t = 1}^{T}$ have non-negligible effect on a non-vanishing proportion of outcomes. In addition, this condition is trivially true if ${λ_{j}^{i}}_{j = 1}^{p}$ ’s can be regarded as random samples from a population with non-degenerate covariance matrix [17]. Practically, in fMRI data analysis for instance, the lab environment (temperature, air pressure, etc.) or the mental status of the subject being scanned may cause the BOLD (Blood-Oxygen-Level Dependent) level to be uniformly higher at certain time t. This means the brain heterogeneity is globally driven by the factors ${f_{t}}_{t = 1}^{T}$ . If the batch effect is only limited to a small number of dimensions, we think it is more appropriate to assume sparsity instead of pervasiveness on top eigenvectors, which is quite different from our problem setups and thus beyond the scope of our paper. The second condition holds if the population has a sub-Gaussian tail.

2.2.2. Regime 2: External covariates are informative

When covariates are informative, we will employ the PPCA [18] to estimate the heterogeneous effect. It requires the following assumptions.

Assumption 2.3. (i) (Pervasiveness) There are two positive constants c_min and c_max so that

c_{\min} < λ_{\min} (p^{- 1} G^{i} {(W^{i})}^{'} G^{i} (W^{i})) < λ_{max} (p^{- 1} G^{i} {(W^{i})}^{'} G^{i} (W^{i})) < c_{max}, a . s . \forall_{i} .

(ii) ${max}_{k \leq K^{i}, j \leq p} E g_{k} {(W_{j}^{i})}^{2} < \infty$ .

This assumption is parallel to Assumption 2.2 (i). Pervasiveness is trivially satisfied if ${W_{j}^{i}}_{j \leq p}$ are independent and Gⁱ is sufficiently smooth.

Assumption 2.4. (i) $E γ_{j k}^{i} = 0, {max}_{k \leq K^{i}, j \leq p} | γ_{j k}^{i} | = O_{P} (\sqrt{\log p})$ .

(ii) Write $γ_{j}^{i} = {(γ_{j 1}^{i}, \dots, γ_{j K}^{i})}^{'}$ . we assume ${γ_{j}^{i}}_{j \leq p}$ are independent of ${W_{j}^{i}}_{j \leq p}$ .

(iii) Define $ν_{p} = {max}_{i \leq m} {max}_{k \leq K^{i}} p^{- 1} \sum_{j \leq p} var (γ_{j k}^{i}) < \infty$ . We assume

max_{k \leq K^{i}, j \leq p} \sum_{j^{'} \leq p} | E γ_{j^{'} k}^{i} γ_{j k}^{i} | = O (ν_{p}) .

Condition (i) is parallel to Assumption 2.2 (ii) whereas Condition (ii) is natural since Γⁱ can not be explained by Wⁱ. Condition (iii) imposes cross-sectional weak dependence of $γ_{j}^{i}$ , which is much weaker than assuming independent and identically distributed ${γ_{j}^{i}}_{j \leq p}$ . This condition is mild as main serial dependency has been taken care of by g_k(·)’s.

3. The ALPHA framework

We introduce the ALPHA framework for heterogeneity adjustment. Methodologically, for each sub-dataset we aim to estimate the heterogeneity component and subtract it from the raw data. Theoretically, we aim to obtain the explicit rates of convergence for both the corrected homogeneous signal and its sample covariance matrix. Those rates will be useful when aggregating the homogeneous residuals from multiple sources.

This section covers details for heterogeneity adjustments under the above two regimes: they correspond to estimating Uⁱ by either PCA or PPCA. From now on, we drop the superscript i whenever there is no confusion as we focus on the i^th data source. We use the notation $\hat{F}$ if F is estimated by PCA and $\tilde{F}$ if estimated by PPCA. This convention applies to other related quantities such as $\hat{U}$ and $\tilde{U}$ , the heterogeneity-adjusted estimator. In addition, we use notations such as $\overset{ˇ}{F}$ and $\overset{ˇ}{U}$ to denote the final estimators, which are $\hat{F}$ and $\hat{U}$ if PCA is used, and $\tilde{F}$ and $\tilde{U}$ if PPAC is used.

Estimators for latent factors under regimes 1 and 2 satisfy $n^{- 1} {\overset{ˇ}{F}}^{'} \overset{ˇ}{F} = I$ , which corresponds to normalization in Assumption 2.1 (i). By the principle of least squares, the residual estimator of U then admits the form

\overset{ˇ}{U} = X (I - \frac{1}{n} \overset{ˇ}{F} {\overset{ˇ}{F}}^{'}) .

(3.1)

3.1. Estimating factors by PCA

In regime 1, we directly use PCA to adjust data heterogeneity. PCA estimates F by $\hat{F}$ where the k^th column of $\hat{F} / \sqrt{n}$ is the eigenvector of $X^{'} X$ corresponding to the k^th largest eigenvalue. We have the following theoretical results.

Theorem 3.1. Under Assumptions 2.1 and 2.2, we have

\begin{array}{l} \hat{U} - U = - \frac{1}{n} U F F^{'} + Π, \\ \hat{U} {\hat{U}}^{'} - U U^{'} = - \frac{1}{n} U F F^{'} U^{'} + Δ, \end{array}

where ${‖ Π ‖}_{max} = O_{P} (\sqrt{\log n \log p} (1 / \sqrt{p} + 1 / n) + \sqrt{\log n} {‖ Σ ‖}_{1} / p)$ and ${‖ Δ ‖}_{max} = O_{P} ((1 + n / p) \log p + n^{2} {‖ Σ ‖}_{1}^{2} / p^{2})$ .

Note that we do not explicitly assume bounded ${‖ Σ ‖}_{1}$ . In some applications it might be natural to assume a sparse covariance so that all terms involving ${‖ Σ ‖}_{1}$ can be eliminated, while in other applications such as the graphical model, it is more natural to impose sparsity structure on the precision matrix. In this case, one may want to keep track of the effect of ${‖ Σ ‖}_{1}$ as it can be as large as $O (\sqrt{p})$ as ${‖ Σ ‖}_{1} \leq \sqrt{p} {‖ Σ ‖}_{2} = O (\sqrt{p})$ .

3.2. Estimating factors by Projected-PCA

In regime 2, we would like to incorporate the external covariates using the Projected-PCA (PPCA) method proposed by [18]. The method applies PCA on the projected data and by projection, covariates information is leveraged to reduce dimensionality. We now briefly introduce the method.

For simplicity, we only consider d = 1, that is, we only have a single covariate. The general case can be found in [18]. To model the unknown function g_k(W_j), we adopt a sieve based idea which approximates g_k(·) by a linear combination of basis functions ${ϕ_{1} (x), ϕ_{2} (x), \dots}$ (e.g., B-spline, Fourier series, polynomial series, wavelets). Then

g_{k} (W_{j}) = \sum_{ν = 1}^{J} b_{ν, k} ϕ_{ν} (W_{j}) + R_{k} (W_{j}), k \leq K, j \leq p .

(3.2)

Here ${b_{ν, k}}_{ν = 1}^{J}$ are the sieve coefficients of g_k(W_j), corresponding to the k^th factor loading; R_k is the remainder function representing the approximation error; J denotes the number of sieve bases which may grow slowly as p diverges. We take the same basis functions in (3.2) for all k though they can be different.

Define $b_{k}^{'} = (b_{1, k}, \dots, b_{J, k}) \in ℝ^{J}$ for each k ≤ K, and correspondingly $ϕ {(W_{j})}^{'} = (ϕ_{1} (W_{j}), \dots, ϕ_{J} (W_{j})) \in ℝ^{J}$ . Then we can write

g_{k} (W_{j}) = ϕ {(W_{j})}^{'} b_{k} + R_{k} (W_{j}) .

Let $B_{J \times K} = (b_{1}, \dots, b_{K})$ , $Φ {(W)}_{p \times J} = {(ϕ (W_{1}), \dots, ϕ (W_{p}))}^{'}$ and R_k(W_j) be the (j,k)^th element of R(W)p×K. The matrix form (2.2) can be written as

X = Φ (W) B F^{'} + R (W) F^{'} + Γ F^{'} + U,

(3.3)

recalling that the data index i is dropped. Thus the residual contains three parts: the sieve approximation error $R (W) F^{'}$ , unexplained loading $Γ F^{'}$ and true signal U.

The idea of PPCA is simple: since the factor loadings are a function of the covariates in (3.3) and U and Γ are independent of W, if we project (smooth) the observed data onto the space of W, the effect of U and Γ will be significantly reduced and the problem becomes nearly a noiseless one, given that the approximation error R(W) is small.

Define P as the projection onto the space spanned by the basis functions of W:

P = Φ (W) {(Φ {(W)}^{'} Φ (W))}^{- 1} Φ {(W)}^{'} .

(3.4)

By (3.3), $P X \approx P Φ (W) B F^{'} \approx G (W) F^{'}$ . Thus, F can be estimated from the ‘noiseless projected data’ PX, using the conventional PCA. Let the columns of $\tilde{F} / \sqrt{n}$ be the eigenvectors corresponding to the top K eigenvalues of the n × n matrix $X^{'} P X$ , which is the sample covariance of the projected data. Then, $\tilde{F}$ is the PPCA estimator of F. It only differs from PCA in that we use smoothed or projected data PX.

We need the following conditions for basis functions and accuracy of sieve approximation.

Assumption 3.1. (i) There are d_min, d_max > 0 s.t.

d_{\min} < λ_{\min} (p^{- 1} Φ {(W)}^{'} Φ (W)) < λ_{max} (p^{- 1} Φ {(W)}^{'} Φ (W)) < d_{max}

almost surely and max_ν≤_J,j≤p Eϕ_ν(W_j)²< ∞.

(ii) There exists k ≥ 4 s.t. as $J \to \infty, \sup_{x \in χ} | g_{k} (x) - \sum_{ν = 1}^{J} b_{ν, k} ϕ_{ν} (x) |^{2} = O (J^{- k})$ where X is the support of W_j and max_v,k |b_v,k| < ∞.

Condition (ii) is mild; for instance, when {φ_ν} is polynomial basis or Bsplines, it is implied by the condition that smooth curve g_k(·) belongs to a Hölder class $G = {g : | g^{(r)} (s) - g^{(r)} (t) | \leq L {| s - t |}^{α}}$ for some L > 0, with $k = 2 (r + α) \geq 4$ [34,12].

Recalling the definition of ν_p in Assumption 2.4 (iii), we have the following results.

Theorem 3.2. Choose $J = {(p \min {n, p, ν_{p}^{- 1}})}^{1 / k}$ and assume $J^{2} ϕ_{max}^{2} \log (n J) = O (p)$ where $ϕ_{max} = {max}_{ν \leq J} \sup_{x \in X} ϕ_{ν} (x)$ . Under Assumptions 2.1, 2.3, 2.4 and 3.1,

\begin{array}{l} \tilde{U} - U = - \frac{1}{n} U F F^{'} + Π, \\ \tilde{U} {\tilde{U}}^{'} - U U^{'} = - \frac{1}{n} U F F^{'} U^{'} + Δ, \end{array}

where ${‖ Π ‖}_{max} = O_{P} (\sqrt{\log n / p} (J ϕ_{max} + \sqrt{\log p}) + J ϕ_{max} {‖ Σ ‖}_{1} \sqrt{\log n} / p)$ and ${‖ Δ ‖}_{max} = O_{P} (n \sqrt{ν_{p} / p} (J^{2} ϕ_{max}^{2} + \log p) + n J ϕ_{max} {‖ Σ ‖}_{1} (J ϕ_{max} + \sqrt{\log p}) / p + n^{2} J^{2} ϕ_{max}^{2} {‖ Σ ‖}_{1}^{2} / p^{2})$ if there exists C s.t. v_p>C/n.

3.3. A guiding rule for estimating the number of factors, the number of basis functions and determining regimes

We now address the problem of estimating the number of factors for two different regimes. Extensive literature has made contributions to this problem in regime 1, i.e., the regular factor model [4, 1, 28]. [28] and [1] proposed to use ratio of adjacent eigenvalues of $X^{'} X$ to infer the number of factors. They showed the estimator $\hat{K} = \arg {max}_{k \leq K_{max}} λ_{k} {(X^{'} X) / λ}_{k + 1} (X^{'} X)$ correctly identifies K with probability tending to 1, as long as K_max ≥ K and K_max = O(n_i ∧ p).

For the semiparametric factor model, [18] proposed

\tilde{K} = \arg max_{k \leq K_{max}} λ_{k} (X^{'} P X) / λ_{k + 1} (X^{'} P X) .

Here K_max is of the same order as Jd. It was shown that $ℙ (\tilde{K} = K) \to 1$ under regular assumptions which we omit here. When we have genuine and pervasive covariates, $\tilde{K}$ typically outperforms $\hat{K}$ . More details can be found in [18].

Once we use $\hat{K}$ and $\tilde{K}$ to estimate the number of factors under the regular factor model and semiparametric factor model respectively, we naturally have an adaptive rule to decide whether the covariates W are informative enough to use PPCA over PCA. We compare two eigen-ratios:

\frac{λ_{\hat{K}} (X^{'} X)}{λ_{\hat{K} + 1} (X^{'} X)} vs \frac{λ_{\tilde{K}} (X^{'} P X)}{λ_{\tilde{K} + 1} (X^{'} P X)} .

If the former is larger we identify the dataset as regime 1 and apply regular PCA to get $\hat{U}$ ; otherwise it is regime 2 and PPCA is used to obtain $\tilde{U}$ . The intuition behind this comparison is that the maximal eigen-ratios can be perceived as signal-to-noise ratios in terms of estimating the spiky heterogeneity term. Given that $n^{- 1} X X^{'} \approx G G^{'} + Γ Γ^{'} + Σ$ and $n^{- 1} P X X^{'} P \approx G G^{'} + P Γ Γ^{'} P + P Σ P$ , the first ratio measures the eigen-gap between $G G^{'} + Γ Γ^{'}$ and Σ and the second ratio measures the eigen-gap between $G G^{'} + P Γ Γ^{'} P$ and PΣP. If G(W) is much more important than Γ in explaining the loading structure, projection preserves signal and reduces error to improve the eigen-gap. Conversely, if W is weak in providing useful information, projection reduces both noise and signal. Therefore, if projection enlarges the maximum eigen-gap, we prefer PPCA over PCA to estimate the spiky heterogeneity part. Our proposed guiding rule effectively tells whether projection can further contrast spiky and non-spiky parts of covariance.

The above signal-to-noise ratio comparison can be extended to choose the number of basis functions. Notice that we can regard regular PCA as PPCA with number of basis J = p and hence P = I. In this line of thinking, we can index P by J and maximize $λ_{\tilde{K} (J)} (X^{'} P^{J} K) / λ_{\tilde{K} (J) + 1} (X^{'} P^{J} X)$ over $J \in {1, 2, \dots, J_{max}, p}$ , where J = p corresponds to PCA. Here we use notation $\tilde{K} (J)$ and P^J to exhibit their dependency on J. We implement this guiding rule in real data analysis.

In practice, there is still chance of misspecification of the true number of factors K by ALPHA. One might be curious about how this will affect the performance of ALPHA and the subsequent statistical analysis. To clarify this issue, we conduct sensitivity analysis on the number of factors in Section G.3 in the appendix. The take-home message is that the overestimation of K will not hurt, while underestimation of K might mislead subsequent statistical inference.

3.4. Summary of ALPHA

We now summarize the final procedure and convergence rates. We first divide m subgroups into two classes based on whether the collected covariates have significant influence on the loadings.

M_{1} = {i \leq m | W^{i} is not informative}, M_{2} = {i \leq m | W^{i} is informative} .

ALPHA consists of the following three steps.

Step 1: (Preprocessing) For data source i, determine whether it belongs to $M_{1}$ or $M_{2}$ according to the guiding rule given in Section 3.3 and correspondingly estimate K by Ǩ, which equals $\hat{K}$ or $\tilde{K}$ (and choose J if necessary).

Step 2: (Adjustment) Apply Projected-PCA to estimate if $Λ^{i} F^{i^{'}}$ if $i \in M_{2}$ , otherwise use PCA to remove the heterogeneity, resulting in adjusted data ${\overset{ˇ}{U}}^{i}$ , which is either ${\hat{U}}^{i}$ or ${\tilde{U}}^{i}$ .

Step 3: (Aggregation) Combine adjusted data ${{\overset{ˇ}{U}}^{i}}_{i = 1}^{m}$ to conduct further statistical analysis. For example, estimate sample covariance $Σ$ by $\hat{Σ} = {(N - \sum_{i} {\overset{ˇ}{K}}_{i})}^{- 1} \sum_{i = 1}^{m} {\overset{ˇ}{U}}^{i} {\overset{ˇ}{U}}^{i^{'}}$ where $N = \sum_{i} n_{i}$ is the aggregated sample size; or estimate sparse precision matrix Ω by existing graphical model methods.

We summarize the ALPHA procedure in Algorithm 1 given in Section A. We also summarize the convergence of ${\hat{U}}^{i}$ and ${\tilde{U}}^{i}$ below. To ease presentation, we consider a typical regime in practice: $n_{i} < C_{p}, \sum_{i \leq m} K^{i} < C N$ for some constant C. We focus on the situation of sufficiently smooth curves k = ∞ so that J diverges very slowly (say with rate $O (\sqrt{\log p})$ ) and bounded ϕ_max and ν_p (defined respectively in Theorem 3.2 and Assumption 2.4). Based on discussions of the previous subsections, for estimation of U in ${‖ \cdot ‖}_{max}$ , we have

{\overset{ˇ}{U}}^{i} - U^{i} = - U^{i} F^{i} F^{i^{'}} / n_{i} + {\begin{array}{l} O_{P} (\sqrt{\log n_{i} \log p / p} + \sqrt{\log n_{i} \log p / n_{i}}) & if i \in M_{1}, \\ O_{P} (\sqrt{\log n_{i} \log p / p}) & if i \in M_{2} . \end{array}

Therefore, PPCA dominates PCA as long as the effective covariates are provided However, $U^{i} F^{i} F^{i^{'}} / n_{i}$ dominates all the remaining terms so that $| | {\overset{ˇ}{U}}^{i} - U^{i} | |_{max} = O_{P} (| | U^{i} F^{i} F^{i^{'}} / n_{i} | |_{max}) O_{P} (\sqrt{\log n_{i} \log p / n_{i}}) .$

In addition, for estimation of $U U^{'}$ , we have

{\overset{ˇ}{U}}^{i} {\overset{ˇ}{U}}^{i^{'}} - U^{i} U^{i^{'}} = - U^{i} F^{i} F^{i^{'}} U^{i^{'}} / n_{i} + {\begin{array}{l} O_{P} (\log p + δ) & if i \in M_{1}, \\ O_{P} (n_{i} \log p \sqrt{ν_{p} / p} + δ) & if i \in M_{2}, \end{array}

(3.5)

where $δ = n_{i}^{2} {‖ Σ ‖}_{1}^{2} \log p / p^{2}$ , depending on ${‖ Σ ‖}_{1}$ . If we consider a very sparse covariance matrix so that ${‖ Σ ‖}_{1}$ is bounded, we can simply drop the term δ in both regimes. Then, regime 1 achieves better rate if $p = O (n_{i}^{2} ν_{p})$ but regime 2 outperforms otherwise.

4. Post-ALPHA inference

We have summarized the order of biases caused by adjusting heterogeneity for each data source in Section 3.4. Now we combine the adjusted data together for further statistical analysis. As an example, we study estimation of the Gaussian graphical model. Assume further $u_{t}^{i} ~ N (0, Σ)$ and consider the following class of the precision matrices:

F (s, R) = {Ω : Ω ≻ 0, {‖ Ω ‖}_{1} \leq R, max_{1 \leq i \leq p} \sum_{j = 1}^{p} 𝟙 (Ω_{i, j} \neq 0) \leq s} .

(4.1)

To simplify the analysis, we assume R is fixed, but all the analysis can be easily extended to include growing R.

To estimate Ω = Σ⁻¹ via CLIME, we need a covariance estimator as the input. We assume here the number of factors is known, i.e., the exception probability of recovering Kⁱ has been ignored for ease of discussion. Such an estimator is naturally given by

\hat{Σ} = \frac{1}{N - \sum_{i \leq m} K^{i}} \sum_{i = 1}^{m} {\overset{ˇ}{U}}^{i} {\overset{ˇ}{U}}^{i^{'}} .

(4.2)

Since the number of data sources is huge, we focus on the case of diverging N and p.

4.1. Covariance estimation

Let Σ_N be the oracle sample covariance matrix, i.e., $Σ_{N} = N^{- 1} \sum_{i = 1}^{m} U^{i} U^{i^{'}}$ . We consider the difference between our proposed $\hat{Σ}$ and $Σ_{N}$ in this subsection. The oracle estimator obviously attains the rate ${‖ Σ_{N} - Σ ‖}_{max} = O_{P} (\sqrt{\log p / N})$ .

Let $ξ_{k}^{i} = U^{i} \bar{f_{k}^{i}} / \sqrt{n_{i}}$ where $\bar{f_{k}^{i}}$ is the k^th column of Fⁱ. It is not hard to verify that $ξ_{k}^{i}$ is Gaussian distributed with mean zero and variance Σ. Note that ${ξ_{k}^{i}}_{1 \leq i \leq m, 1 \leq k \leq K^{i}}$ are i.i.d. with respect to k and i, using the assumption $F^{i^{'}} F^{i} / n_{i} = I$ . By the standard concentration bound (e.g. Lemma 4.2 of [19]),

{‖ \sum_{i \leq m} (\frac{1}{n_{i}} U^{i} F^{i} F^{i^{'}} U^{i^{'}} - K^{i} Σ) ‖}_{max} = {‖ \sum_{i \leq m} \sum_{k \leq K^{i}} (ξ_{k}^{i} ξ_{k}^{i^{'}} - Σ) ‖}_{max} = O_{P} (\sqrt{K^{t o t} \log p}),

where $K^{t o t} = \sum_{i \leq m} K^{i}$ . Therefore, by (3.5), we have

{‖ \hat{Σ} - Σ_{N} ‖}_{max} = ‖ \frac{N}{N - \sum_{i \leq m} K^{i}} \frac{1}{N} \sum_{i \leq m} ({\overset{ˇ}{U}}^{i} {\overset{ˇ}{U}}^{i^{'}} - U^{i} U^{i^{'}} + K^{i} Σ) {+ \frac{\sum_{i \in M} K^{i}}{N - \sum_{i \in M} K^{i}} (\frac{1}{N} \sum_{i \leq m} U^{i} U^{i^{'}} - Σ) ‖}_{max} = : O_{P} (a_{m, N, P}),

(4.3)

where $a_{m, N, P} = \frac{| M_{1} | \log p}{N} + \frac{N_{2} \log p}{N} \sqrt{\frac{ν_{p}}{p}} + \frac{\sqrt{K^{t o t} \log p}}{N} + \frac{K^{t o t}}{N} \sqrt{\frac{\log p}{N}}$ and $N_{2} = \sum_{i \in M_{2}} n_{i}$ .

We now examine the difference of the ALPHA estimator from the oracle estimator for two specific cases. In the first case, we apply PCA to all data sources, i.e., all $i \in M_{1}$ and Kⁱ is bounded. We then have a_m,N,p = m log p/N. This rate is dominated by the oracle error rate $\sqrt{log p / N}$ if and only if $m = O (\sqrt{N / \log p})$ . This means traditional PCA performs optimally for adjusting heterogeneity as long as the number of subgroups grows more slowly than the order of $\sqrt{N / \log p}$ .

If we apply PPCA to all data sources, i.e., $i \in M_{2}$ and Kⁱ is bounded, then $a_{m, N, p} = \sqrt{ν_{p} / p} \log p + \sqrt{m \log p} / N$ . This rate is of smaller order than rate $\sqrt{\log p / N}$ if p/log p > CN for some constant C > 0. The advantage of using PPCA is that when n_i is bound so that $m ≍ N$ , we can still achieve optimal rate of convergence so long as we have a large enough dimensionality at least of the order N.

4.2. Precision matrix estimation

In order to obtain an estimator for the sparse precision matrix from $\hat{Σ}$ , we apply the CLIME estimator proposed by [9]. For a given $\hat{Σ}$ , CLIME solves the following optimization problem:

\hat{Ω} = \arg min_{Ω} {‖ Ω ‖}_{1, 1} subject to {‖ \hat{Σ} Ω - I ‖}_{max} \leq λ,

(4.4)

where ${‖ Ω ‖}_{1, 1} = \sum_{i, j \leq p} | Ω_{i j} |$ and λ is a tuning parameter. Note that (4.4) can be solved column-wisely by linear programming. However, CLIME does not necessarily generate a symmetric matrix. We can simply symmetrize it by taking the one with minimal magnitude of ${\hat{σ}}_{i j}$ and ${\hat{σ}}_{j i}$ . The resulting matrix after symmetrization, still denoted as $\hat{Ω}$ with a little bit abuse of notation, also attains good rate of convergence. In particular, we consider the sparse precision matrix class $F (s, C_{0})$ in (4.1). The following lemma guarantees recovery of any sparse matrix $Ω \in F (s, C_{0})$ .

Theorem 4.1. Suppose $Ω \in F (s, C_{0})$ and let $τ_{m, N, p} = \sqrt{\log p / N} + a_{m, N . p}$ . Choosing $λ ≍ τ_{m, N, p}$ , we have

{‖ \hat{Ω} - Ω ‖}_{max} = O_{p} (τ_{m, N, p}) .

Furthermore, ${‖ \hat{Ω} - Ω ‖}_{1} = O_{p} (s τ_{m, N, p})$ and ${‖ \hat{Ω} - Ω ‖}_{2} = O_{p} (s τ_{m, N, p})$ .

Here we stress that we choose CLIME for the precision matrix estimation because it only relies on the max-norm guarantee ${‖ \hat{Σ} - Σ ‖}_{max}$ . The intuition is that for any true Ω with bounded, ${‖ Ω ‖}_{1}$ ,

{‖ I - \hat{Σ} Ω ‖}_{max} = {‖ (\hat{Σ} - Σ) Ω ‖}_{max} \leq {‖ \hat{Σ} - Σ ‖}_{max} {‖ Ω ‖}_{1} = O_{ℙ} ({‖ \hat{Σ} - Σ ‖}_{max}) .

One can see from above that fast convergence of ${‖ \hat{Σ} - Σ ‖}_{max}$ encourages feasibility of Ω, which is a necessary step for establishing consistency of the resulting M-estimator. Interested readers can refer to the proof of Theorem 4.1 for more details. Other possible methods for precision matrix recovery (e.g. graphical Lasso in [20], graphical Dantzig selector in [48] and graphical neighborhood selection in [35]) can be considered for post-ALPHA inference as well, but their convergence rate needs to be studied in a case-by-case fashion.

Theorem 4.1 shows that CLIME has strong theoretical guarantee of convergence under different matrix norms. The rate of convergence has two parts, one corresponding to the minimax optimal rate [48] while the other is due to the error caused by estimating the unknown factors under various situations. The discussions at the end of Section 4.1 suggest that the latter error is often negligible.

In addition, we numerically investigate how misspecification of the number of factors K will affect the precision matrix estimation in Section G.3 in the appendix.

5. Numerical studies

In this section, we first validate the established theoretical results through Monte Carlo simulations. Our purpose is to show that after heterogeneity adjustment, the proposed aggregated covariance estimator $\hat{Σ}$ approximates well the oracle sample covariance Σ_N, thereby leading to accurate estimation of the true co-variance matrix Σ and precision matrix Ω. We also compare the performance of PPCA and regular PCA on heterogeneity adjustment under different settings.

In addition, we analyze a real brain image data using the proposed procedure. The dataset to be analyzed is the ADHD-200 data [6]. It consists of rs-fMRI images of 608 subjects, of whom 465 are healthy and 143 are diagnosed with ADHD. We dropped subjects with missing values in our analysis. Following [39], we divided the whole brain into 264 regions of interest (ROI, p = 264), which are regarded as nodes in our graphical model. Each brain was scanned for multiple times with sample sizes ranging from 76 to 261 (76 ≤ n_i ≤ 261). In each scan, we acquired the blood-oxygen-level dependent (BOLD) signal within each ROI. Note that subjects have different ages, genders etc., which results in heterogeneity over the covariance structure of the data. We need to remove this unwanted heterogeneity; otherwise it will dilute or corrupt the true biological signal, i.e., the difference in the brain functional network between healthy people and patients due to the disease ADHD.

5.1. Preliminary analysis

To apply our ALPHA framework, we need to first argue the pervasiveness condition Assumption 2.2 holds for the real dataset considered. This is done in Section G.2, together with further discussions on pervasiveness. We also collect the physical locations of the 264 regions as the external covariates. Ideally, we hope these covariates to be pervasive in explaining the batch effect (Assumption 2.3), while bearing no association with the graph structure of u_t. This is reasonably true because: the level of batch effect is non-uniform over different locations of the brain when scanned in fMRI machines; furthermore it has been widely acknowledged in biological studies that spatial adjacency does not necessarily imply brain functional connectivity.

To construct $W_{j}^{i}$ from the physical locations, we simply split the 264 regions into 10 clusters (J = 10) by the hierarchy clustering (Ward’s minimum variance method) and use the categorical indices as the covariates of the nodes. The clustering result is shown in Figure 2 and the spatial locations of the 264 regions are shown in Figure 6 in 10 different colors. Black (middle), green (left) and blue (right) represent roughly the region of frontal lobe; gray (middle), pink (left) and magenta (right) occupy the region of parietal lobe; red (left) and orange (right) are in the area of occipital lobe; finally yellow (left) and navy (right) provide information about temporal lobe.

Fig 2. — Cluster Dendrogram for physical locations with J = 10.

Fig 6. — Estimated brain functional connectivity networks using physical locations as covariates to correct heterogeneity. 10 region clusters are labeled in 10 colors. Black, blue and red edges represent respectively common edges, unshared edges in the healthy group and in the ADHD group.

Here J = 10 is only used to calibrate our synthetic model in the next subsection. In the real data analysis, we will choose J adaptively according to our heuristic guiding rule of the maximal eigen-gap discussed in Section 3.3. Note that here since the covariate W is one-dimensional (d = 1) and discrete, the sieve basis functions are just indicator functions 𝟙(w − 0.5 ≤ W < w + 0.5) for w = 1,…,10. We use the same external covariates for all subjects in both healthy and diseased groups.

The next question is how to divide the subjects into $M_{1}$ and $M_{2}$ based on whether the selected covariates explain the loadings effectively. We implemented the method given in Section 3.3 and discovered that 398 healthy (85.6%) and 126 diseased samples (88.1%) prefer PPCA over PCA, meaning that the physical locations indeed have explanatory powers on factor loadings of most subjects. We identified them as subjects in $M_{2}$ while the others were classified as in $M_{1}$ . Based on the class labels, we employed the corresponding method to estimate the number of factors and adjust the heterogeneity. We used K_max = 3. The estimated number of factors for the two groups are summarized in Table 1.

Table 1.

Distribution of estimated number of factors for healthy and ADHD groups

$\overset{ˇ}{K^{i}}$	1	2	3
Healthy	253	148	64
ADHD	78	40	25

Open in a new tab

5.2. Synthetic datasets

In this simulation study, for stability, we use the first 15 subjects in the healthy group to calibrate the simulation models. We specify four asymptotic settings for our simulation studies:

m = 500, n_i = 10 for i = 1,..,m, p = 100, 200,…,600 and G(W) ≠ 0;
m = 100, 200,…,1000, n_i = 10 for i = 1,…,m, p = 264 and G(W) ≠ 0;
m = 100, n_i = 10, 20,…,100 for i = 1,…,m, p = 264 and G(W) ≠ 0;
m = 20, 40,…,200, n_i = 20, 40,…,200 for i = 1,…,m, p = 264 and G(W) = 0.

Here the last setting represents regime 1, where we should expect PCA to work well when the number of subjects is of order of square root of the total sample size, i.e., $m ≍ \sqrt{N}$ . The first three settings represent regime 2 with informative covariates; they present asymptotics with growing p, m and n_i respectively. The details on model calibration and data generation can be found in Section G.1.

We first investigate the errors of estimating covariance of u_t in max-norm after applying PPCA or PCA for heterogeneity adjustment. We also compare them with the estimation errors if we naively pool all the data together without any heterogeneity adjustment. However, the estimation error of the naively pooled sample covariance is too large to fit in the graph for the first 3 cases, which we thus do not plot. Denote the oracle sample covariance of u_t by Σ_N as before. The estimation errors, based on 100 simulations, under the four settings are presented in Figure 3.

Fig 3. — Estimation of Σ by PCA, PPCA and the oracle sample covariance matrix for 4 different settings. Case 1: m and n_i are fixed while the dimension p increases; case 2: n_i and p are fixed while m increases; case 3: m and p are fixed while n_i increases; case 4: p is fixed, and both m and n_i increase and conditions for PPCA are violated.

In Case 1, m and n_i are fixed while dimension p increases. This setting highlights the advantages of Projected-PCA over regular PCA. From the left panel, we observe that increase of dimensionality improves the performance of Projected-PCA. This is consistent with the rate we derived in theories. In Case 2, n_i and p are fixed while m increases. Both PPCA and PCA benefit from an increasing number of subjects. However, since n_i is small, again PPCA outperforms. In Case 3, m and p are fixed while n_i increases. Both methods achieve better estimation as n_i increases, but more importantly, regular PCA outperforms PPCA when n_i is large enough. This is again consistent with our theories. As illustrated by Section 4.1, when m is fixed, PCA attains the convergence rate ${‖ \hat{Σ} - Σ ‖}_{max} = O_{P} (\sqrt{\log p / N})$ , while PPCA only achieves ${‖ \hat{Σ} - Σ ‖}_{max} = O_{P} (\log p / \sqrt{p})$ , which is worse than PCA when p/log p = o(N). In Case 4, p is fixed, and both m and n_i increase. Note that the covariates have no explanation power at all, i.e., Condition 2.3 about pervasiveness does not hold so that PPCA is not applicable. Adjusting by PCA behaves much better and PPCA sometimes is as bad as ‘nPCA’, corresponding to no heterogeneity adjustment. This is not unexpected as we utilize a noisy external covariates.

Now we focus on estimation error of the precision matrix. We plug $\hat{Σ}$ , obtained from data after adjusting for heterogeneity, into CLIME to get the estimator $\hat{Ω}$ of $Ω$ . In Figure 4, ${‖ \hat{Ω} - Ω ‖}_{max}$ and ${‖ \hat{Ω} - Ω ‖}_{1}$ are depicted under the four asymptotic settings. From the plots we see ${‖ \hat{Ω} - Ω ‖}_{max}$ and ${‖ \hat{Ω} - Ω ‖}_{1}$ share similar behavior with ${‖ \hat{Σ} - Σ ‖}_{max}$ shown in Figure 3: in Case 1, n_i is small, so it is advantageous to use PPCA and PPCA behaves better as dimension increases; in Case 2, both PPCA and PCA benefit from an increasing number of subjects and PPCA outperforms PCA; in Case 3, PCA outperforms PPCA when n_i is large enough since m is fixed; in Case 4, the covariates have no explanation power at all so that PPCA does not make sense. In the first three cases, if we do not adjust data heterogeneity, ${‖ \hat{Ω} - Ω ‖}_{max}$ and ${‖ \hat{Ω} - Ω ‖}_{1}$ will be too large to fit in the current scale.

Fig 4. — Estimation of Ω. Presented are the estimation errors in max-norm and L₁-norm for 4 different settings. In Case 4, nPCA refers to no PCA, i.e., we do not adjust heterogeneity.

We also present the ROC curves of our proposed methods in Figure 5, which is of interest to readers concerned with sparsity pattern recovery. The black dashed line is the 45 degree line, representing performance of random guess. It is obvious from those plots that heterogeneity adjustment very much improves the sparsity recovery of the precision matrix. When the sample size of each subject is small, genuine pervasive covariates increase the power of PPCA while if the sample size is relatively large, PCA is sufficiently good in recovering graph structures. Also notice that in all cases, the naive method without heterogeneity adjustment can still achieve a certain amount of power, but we can improve the performance dramatically by correcting the batch effects.

Fig 5. — ROC curves for sparsity recovery of Ω for 4 different settings.

5.3. Brain image network data

We report the estimated graphs for both the healthy group and the ADHD patient group with batch effects removed using our ALPHA framework in this subsection. We took various sparsity levels of the networks from 1% to 5% (corresponding to the same set of λ’s for two groups) and selected the common edges, which are stable with respect to tuning, to be depicted.

The brain network produced by our proposed method is presented in Figure 6. It gives 90.7% identical edges for the two networks. However if we ignore heterogeneity and naively pool the data from all subjects together, it generates 10.2% unshared edges, roughly 1% more than ALPHA produces. Therefore, by heterogeneity adjustment, we found less difference in brain functional networks between ADHD patients and healthy people. In addition, we investigate how those unshared edges are distributed across the 10 clusters. We summarized the total degree of unshared edge vertices within each cluster in Table 2. As we can see, in the left occipital lobe (red) and the left parietal lobe (pink), there are significant difference in functional connectivity structure between healthy people and patients, although in general the difference is weak. These are signs that ADHD is a complex disease that affects many regions of the brain. The general methodology we provide here could be valuable for further understanding the mechanism of the disease.

Table 2.

The degree of unshared edge vertices for each cluster

	red	orange	blue	green	yellow	navy	pink	black	magenta	gray
Health	3	4	3	2	7	6	10	12	11	6
ADHD	9	6	7	5	12	5	6	15	9	10

Open in a new tab

6. Discussions

Heterogeneity is usually informed by the domain knowledge of the dataset. In particular, it occurs with high chance when the data come from different sources or subgroups. In the brain image dataset we used in the numerical study, heterogeneity across patients can stem from difference in age, gender, etc. When it is less clear whether heterogeneity exists, we can calculate multiple summary statistics for all the subgroups and see whether they are significantly different. In the case of pervasive heterogeneity, we can test it by the magnitude of dominating eigenvalues of the covariance matrix in each subgroup. A systematic testing method for heterogeneity is important and we leave it for now as a future research topic. Note that even if all the subgroups are actually homogeneous, ALPHA does not hurt the statistical efficiency under appropriate scaling assumptions. Specifically, for the PCA-based ALPHA, we showed in Section 4.1 that as long as the number of subgroups $m = O (\sqrt{N / \log p})$ , $\overset{ˇ}{Σ}$ enjoys the oracle max-norm rate. This means that given homogeneous data, when the number of data splits is not large, ALPHA yields the same statistical rate as the full-sample oracle estimator. For the PPCA-based ALPHA, $\overset{ˇ}{Σ}$ enjoys the oracle rate when $p / \log p = Ω (\sqrt{N / \log p})$ .

As we have seen, ALPHA is adaptive to factor structures and is flexible to include external information. However, this advantage of PPCA is accompanied by more assumptions and the practical issue of selecting proper basis functions and the number of them in sieve approximation. One contribution of the paper lies in seamless integration of PCA and PPCA, which leverages effective external covariates. If no valuable covariates exist and the sample size is relatively large for each data source, we have shown conventional PCA is still an effective tool.

Note that our framework is compatible with any statistical procedure that only requires an accurate estimator as the input, like CLIME we illustrate in this work. The ALPHA procedure gives theoretical guarantee for $| | \overset{ˇ}{U} - U | |_{\max}$ and $| | \hat{Σ} - Σ | |_{\max}$ , which serve as foundations for establishing the statistical properties of the subsequent procedure. Besides, ALPHA has potential application and in regression analysis. If the residual terms ${U^{i}}_{i = 1}^{m}$ are true predictors for the response of interest ${Y^{i}}_{i = 1}^{m}$ , we can first apply ALPHA to extract the residuals before the regression procedure. For example, the residual BOLD signal we obtained by ALPHA in the brain functional network analysis (Section 5.3) is potentially useful in predicting whether a person has ADHD. This is a typical logistic regression problem based on ALPHA adjustment. We leave the detailed study of combining ALPHA with regression models for future investigation. One recent work [16] has adopted a method similar to ALPHA that extracts residuals for model selection in high dimensional regression.

Finally, we point out two current limitations of ALPHA. The first limitation lies in its pervasiveness assumption of the heterogeneity terms ${Λ^{i} F^{i^{'}}}_{i = 1}^{m}$ . More specifically, for each subgroup i, ALPHA requires the signal strength of the heterogeneous part $Λ^{i} F^{i^{'}}$ to overwhelm the homogeneous residual part Uⁱ so that PCA or PPCA can accurately estimate $Λ^{i} F^{i^{'}}$ and remove it. Such requirement can be violated in practice when the heterogeneous term has similar signal strength as the homogeneous term. Additionally, statistical methods that require more than the max-norm error guarantee $(| | \overset{ˇ}{U} - U | |_{max}, | | \overset{ˇ}{Σ} - Σ | |_{max})$ , say in the general non-sparse situation, may be inappropriate for the post-ALPHA inference for now.

Acknowledgments

This project was supported by National Science Foundation grants DMS-1206464, DMS1406266 and 2R01-GM072611-12.

Appendix A: Algorithm for ALPHA

The pseudo code for the algorithm ALPHA is shown as follows.

\begin{array}{l} \bar{\underline{\begin{array}{l} Algorithm 1 Algorithm for adaptive low-rank principal heterogeneity adjustment \end{array}}} \\ \underline{\begin{array}{l} \underline{Input :} Panel X_{p \times n_{i}}^{i} and d -dimensional {W_{j}^{i}}_{j = 1}^{p} from m data sources, \\ J_{max}, K_{max} (J_{max} \geq (K_{max} + 1) / d) \\ \underline{Output :} \overset{ˇ}{U^{i}}, \overset{ˇ}{K^{i}} and \hat{Σ} \\ 1 : procedure ALPHA \\ 2 : for each subject i \leq m d o \\ 3 : \hat{K^{i}} \leftarrow \arg {max}_{K \leq K_{m a x}} λ_{k} ({X^{i}}^{'} X^{i}) / λ_{k + 1} ({X^{i}}^{'} X^{i}) \\ 4 : Δ λ_{0}^{i} \leftarrow λ_{\hat{K^{i}}} ({X^{i}}^{'} X^{i}) / λ_{\hat{K^{i}} + 1} ({X^{i}}^{'} X^{i}) \\ 5 : for each (K_{max} + 1) / d \leq J \leq J_{max} d o \\ 6 : P_{J}^{i} \leftarrow Φ (W^{i}) {(Φ {(W^{i})}^{'} Φ (W^{i}))}^{- 1} Φ {(W^{i})}^{'} for J \\ 7 : \tilde{K_{J}^{i}} \leftarrow \arg {m a x}_{K \leq K_{max}} λ_{k} ({X^{i}}^{'} P_{J}^{i} X^{i}) / λ_{k + 1} ({X^{i}}^{'} P_{J}^{i} X^{i}) \\ 8 : Δ λ_{J}^{i} \leftarrow λ_{\tilde{K_{J}^{i}}} ({X^{i}}^{'} P_{J}^{i} X^{i}) / λ_{\tilde{K_{J}^{i}} + 1} ({X^{i}}^{'} P_{J}^{i} X^{i}) \\ 9 : end for \\ 10 : J_{*}^{i} \leftarrow \arg {max}_{J} Δ λ_{J}^{i} \\ 11 : \tilde{K^{i}} \leftarrow \tilde{K_{J_{*}^{i}}^{i}} \\ 12 : \\ 13 : if Δ λ_{0}^{i} > Δ λ_{J_{*}^{i}}^{i} (i \in M_{1}) then \\ 14 : \hat{F^{i}} / \sqrt{n_{i}} \leftarrow eigenvectors of {X^{i}}^{'} X^{i} of the top \tilde{K^{i}} eigenvalues \\ 15 : \hat{Λ^{i}} \leftarrow X^{i} \hat{F^{i}} / n_{i}, \hat{U^{i}} \leftarrow X^{i} - \hat{Λ^{i}} \hat{F^{i}} \\ 16 : \overset{ˇ}{U^{i}} \leftarrow \hat{U^{i}}, \overset{ˇ}{K^{i}} \leftarrow \hat{K^{i}} \\ 17 : else \\ 18 : \tilde{F^{i}} / \sqrt{n_{i}} \leftarrow eigenvectors of {X^{i}}^{'} P_{J_{*}^{i}}^{i} X^{i} of the top \tilde{K^{i}} eigenvalues \\ 19 : \tilde{Λ^{i}} \leftarrow X^{i} \tilde{F^{i}} / n_{i}, \tilde{U^{i}} \leftarrow X^{i} - \tilde{Λ^{i}} \tilde{F^{i}} \\ 20 : \overset{ˇ}{U^{i}} \leftarrow \tilde{U^{i}}, \overset{ˇ}{K^{i}} \leftarrow \tilde{K^{i}} \\ 21 : end if \\ 22 : end for \\ 23 : \\ 24 : \hat{Σ} \leftarrow {(\sum_{i} n_{i} - \sum_{i} \overset{ˇ}{K_{i}})}^{- 1} \sum_{i = 1}^{m} \overset{ˇ}{U^{i}} \overset{ˇ}{{U^{i}}^{'}} \\ 25 : return {\overset{ˇ}{U^{i}}}_{i = 1}^{m}, {\overset{ˇ}{K^{i}}}_{i = 1}^{m} and \hat{Σ} \\ 26 : end procedure \end{array}} \end{array}

Appendix B: A key lemma

Recall that we defined

\overset{ˇ}{U} = X (I - \frac{1}{n} \overset{ˇ}{F} {\overset{ˇ}{F}}^{'}) .

(B.1)

where we used notations such as $\overset{ˇ}{F}$ and $\overset{ˇ}{U}$ to denote the final estimators, which are $\hat{F}$ and $\hat{U}$ if PCA is used, and $\tilde{F}$ and $\tilde{U}$ if PPCA is used.

The following lemma holds for $\overset{ˇ}{U}$ no matter whether PCA or PPCA is applied.

Lemma B.1. For any K by K matrix H such that $‖ H ‖ = O_{P} (1),$ if log P = O(n),

\overset{ˇ}{U} - U = - \frac{1}{n} U F F^{'} + Π,

where ${‖ Π ‖}_{max} = O_{P} (\sqrt{\log n} / n \cdot (| | F^{'} (\overset{ˇ}{F} - F H) | |_{max} {‖ Λ ‖}_{max} + | | U (\overset{ˇ}{F} - F H) | |_{max}) + | | \overset{ˇ}{F} - F H | |_{max} {‖ Λ ‖}_{max} + \sqrt{\log n} \cdot {‖ H H^{'} - I ‖}_{max} {‖ Λ ‖}_{max})$ ; and furthermore

\overset{ˇ}{U} {\overset{ˇ}{U}}^{'} - U U^{'} = - \frac{1}{n} U F F^{'} U^{'} + Δ,

where ${‖ Δ ‖}_{max} = O_{P} (| | U (\overset{ˇ}{F} - F H) | |_{max} {‖ Λ ‖}_{max} + | | U (\overset{ˇ}{F} - F H) | |_{max}^{2} + | | F^{'} (\overset{ˇ}{F} - F H) | |_{max} | | Λ | |_{max}^{2} + n | | H H^{'} - I | |_{max} | | Λ | |_{max}^{2})$ .

The above lemma states that the error of estimating U by $\overset{ˇ}{U}$ (or estimating $U U^{'}$ by $\overset{ˇ}{U} {\overset{ˇ}{U}}^{'}$ ) is decomposed into two parts. The first part is inevitable even when the factor matrix F in (3.1) is known in advance. The second part is caused by the uncertainty from estimating F. Since the true F is identifiable up to an orthonormal transformation H, we need to carefully choose H to bound the error Π (or Δ). We will provide explicit rates of convergence for those terms in the following two sections.

Proof. By definition of $\overset{ˇ}{U}$ , $\overset{ˇ}{U} = U (I - n^{- 1} F F^{'}) + n^{- 1} X (\overset{ˇ}{F} {\overset{ˇ}{F}}^{'} - F F^{'})$ . We first look at the converge of $\overset{ˇ}{U} - U$ . Obviously $Π = n^{- 1} X (\overset{ˇ}{F} {\overset{ˇ}{F}}^{'} - F F^{'}) = I + I I$ where

I = \frac{1}{n} Λ F' (\overset{ˇ}{F} {\overset{ˇ}{F}}^{'} - F F^{'}), I I = \frac{1}{n} U (\overset{ˇ}{F} {\overset{ˇ}{F}}^{'} - F F^{'}) .

Since $F^{'} (\overset{ˇ}{F} {\overset{ˇ}{F}}^{'} - F F^{'}) = F^{'} (\overset{ˇ}{F} - F H) {\overset{ˇ}{F}}^{'} + n H (\overset{ˇ}{F} - F H)^{'} + n (H H^{'} - I) F^{'},$ we have

{‖ I ‖}_{max} = O_{P} ({‖ Λ ‖}_{max} (| | F^{'} (\overset{ˇ}{F} - F H) | |_{max} | | \overset{ˇ}{F} / n | |_{max} + | | \overset{ˇ}{F} - F H | |_{max} + | | H H^{'} - I | |_{max} | | F | |_{max})) .

Similarly $U (\overset{ˇ}{F} {\overset{ˇ}{F}}^{'} - F F^{'}) = U (\overset{ˇ}{F} - F H) {\overset{ˇ}{F}}^{'} + U F H (\overset{ˇ}{F} - F H)^{'} + U F (H H^{'} - I) F^{'}$ , so

| | I I | |_{max} = O_{P} (| | U^{'} (\overset{ˇ}{F} - F H) | |_{max} | | \overset{ˇ}{F} / n | |_{max} + | | U F / n | |_{max} (| | \overset{ˇ}{F} - F H | |_{max} + | | H H^{'} - I | |_{max} | | F | |_{max})) .

According to Lemma F.4 (i), ||UF/n ||_max = O_P (1) and noting both ||F||_max and $| | \overset{ˇ}{F} | |_{max}$ are $O_{P} (\sqrt{n})$ , we conclude the result for ${‖ Π ‖}_{max}$ easily.

Now we consider $\overset{ˇ}{U} {\overset{ˇ}{U}}^{'}$ in the following.

\begin{array}{l} \overset{ˇ}{U} {\overset{ˇ}{U}}^{'} = U (I - n^{- 1} F F^{'}) U^{'} + n^{- 1} U (I - n^{- 1} F F^{'}) (\overset{ˇ}{F} {\overset{ˇ}{F}}^{'} - F F^{'}) X^{'} + n^{- 2} X {(\overset{ˇ}{F} {\overset{ˇ}{F}}^{'} - F F^{'})}^{2} X^{'} \\ = : U U^{'} - \frac{1}{n} U F F^{'} U^{'} + I I I + I V . \end{array}

So $Δ = I I I + I V$ and it suffices to bound the two terms.

\begin{array}{l} | | I I I | |_{max} = O_{P} (| | n^{- 1} U (I - F F^{'} / n) \overset{ˇ}{F} {\overset{ˇ}{F}}^{'} F | |_{max} | | Λ | |_{max} + | | n^{- 1} U (I - F F^{'} / n) \overset{ˇ}{F} {\overset{ˇ}{F}}^{'} U^{'} | |_{max}) \\ = : O_{P} (| | J_{1} | |_{max} | | Λ | |_{max} + | | J_{2} | |_{max}) . \end{array}

Decompose J₁ by $J_{1} = n^{- 1} U (\overset{ˇ}{F} - F H) {\overset{ˇ}{F}}^{'} F - n^{- 2} U F \cdot F^{'} (\overset{ˇ}{F} - F H) {\overset{ˇ}{F}}^{'} F$ . Therefore,

| | J_{1} | |_{max} = O_{P} (| | U (\overset{ˇ}{F} - F H) | |_{max} + n^{- 1} | | U F | |_{max} | | F^{'} (\overset{ˇ}{F} - F H) | |_{max}),

since $| | {\overset{ˇ}{F}}^{'} F / n | |_{max} \leq | | {\overset{ˇ}{F}}^{'} F / n | |_{F} \leq | | {\overset{ˇ}{F}}^{'} | |_{F} | | F | |_{F} / n = K$ . Similar to J₁, we decompose J₂ only replacing ${\overset{ˇ}{F}}^{'} F$ with ${\overset{ˇ}{F}}^{'} U^{'}$ . According to Lemma F.4 (i), $| | {\overset{ˇ}{F}}^{'} U^{'} / n | |_{max} = O_{P} (| | U F / n | |_{max} + | | U (\overset{ˇ}{F} - F H) | |_{max}) = O_{P} (1 + | | U (\overset{ˇ}{F} - F H) | |_{max})$ , hence ${‖ J_{2} ‖}_{max} = O_{P} ((| | J_{1} | |_{max} (1 + | | U (\overset{ˇ}{F} - F H) {| |}_{max}))$ . We then conclude that $| | I I I | |_{max} = O_{P} ((| | U (\overset{ˇ}{F} - F H) | |_{max} + n^{- 1} | | U F | |_{max} | | F^{'} (\overset{ˇ}{F} - F H) | |_{max}) (| | Λ | |_{max} + | | U (\overset{ˇ}{F} - F H) | |_{max}))$ .

Now let us take a look at IV. ${‖ I V ‖}_{max} = | | D_{1} + D_{2} + D_{2}^{'} + D_{3} | |_{max}$ where

\begin{array}{l} D_{1} = n^{- 2} Λ F^{'} {(\overset{ˇ}{F} {\overset{ˇ}{F}}^{'} - F F^{'})}^{2} F Λ^{'} = Λ (n I - n^{- 1} F^{'} \overset{ˇ}{F} {\overset{ˇ}{F}}^{'} F) Λ^{'}, \\ D_{2} = n^{- 2} U {(\overset{ˇ}{F} {\overset{ˇ}{F}}^{'} - F F^{'})}^{2} F Λ^{'} = - n^{- 2} U F F^{'} (\overset{ˇ}{F} {\overset{ˇ}{F}}^{'} - F F^{'}) F Λ^{'} \\ D_{3} = n^{- 2} U {(\overset{ˇ}{F} {\overset{ˇ}{F}}^{'} - F F^{'})}^{2} U^{'} . \end{array}

By assumption, ||H||_max ≤ ||H|| = O_P (1). Simple decompositions of D₁ gives

| | D_{1} | |_{max} = O_{P} ((| | F^{'} (\overset{ˇ}{F} - F H) | |_{max} + n | | H H^{'} - I | |_{max}) | | Λ | |_{max}^{2}) .

Since $D_{2} = - n^{- 2} U F F^{'} (\overset{ˇ}{F} - F H) {\overset{ˇ}{F}}^{'} F Λ^{'} - n^{- 1} U F H (\overset{ˇ}{F} - F H)^{'} F Λ^{'} - U F (H H^{'} - I) Λ^{'}$ , we have

| | D_{2} | |_{max} = O_{P} (| | U F / n | |_{max} | | D_{1} | |_{max}) = O_{P} (| | D_{1} | |_{max}) .

It is also not hard to show ${‖D_{3}‖}_{\max} = O_{P} ({‖I I I‖}_{\max} + {‖D_{1}‖}_{\max}) .$ Under both Theorems C.1 and D.1 (replacing $\overset{ˇ}{F}$ by $\hat{F}$ for regime 1 and $\tilde{F}$ for regime 2), we can check the following relationship holds:

n^{- 1} | | {UF | |}_{max} | | U (\overset{ˇ}{F} - F H) | |_{max} = O_{P} (| | Λ | |_{max}^{2}) .

Therefore we have

\begin{array}{l} | | Δ | |_{max} = | | I I I + I V | |_{max} \\ = O_{P} (| | U (\overset{ˇ}{F} - F H) | |_{max} | | Λ | |_{max} + | | U (\overset{ˇ}{F} - F H) | |_{max}^{2} \\ + | | F^{'} (\overset{ˇ}{F} - F H) | |_{max} | | Λ | |_{max}^{2} + n | | H H^{'} - I | |_{max} | | Λ | |_{max}^{2}) . \end{array}

□

Appendix C: Proof of Theorem 3.1

Recall that PCA estimates F by $\hat{F}$ where the k^th column of $\hat{F} / \sqrt{n}$ is the eigenvector of ${(p n)}^{- 1} X^{'} X$ corresponding to the k^th largest eigenvalue. By the definition of $\hat{F}$ , we have

\frac{1}{n p} X^{'} X \hat{F} = \hat{F} K,

where K is a K by K diagonal matrix with top K eigenvalues of ${(n p)}^{- 1} X^{'} X$ in descending order as diagonal elements. Define a K by K matrix H as in [17]:

H = \frac{1}{n p} Λ^{'} Λ F^{'} \hat{F} K^{- 1} .

It has been shown that $‖K‖, ‖K^{- 1}‖ and ‖H‖, ‖H^{- 1}‖$ are all O_P(1).

The following lemma provides all the rates of convergences that are needed for downstream analysis.

Lemma C.1. Under Assumptions 2.1 and 2.2, we have $| | Λ | |_{max} = O_{P} (\sqrt{\log p})$ and

(i) $| | \hat{F} - F H | |_{F} = O_{P} (\sqrt{n / p} + 1 / \sqrt{n})$ and $| | \hat{F} - F H | |_{max} = O_{P} (\sqrt{\log n / p} + \sqrt{\log n} / n)$ ;

(ii) $| | F^{'} (\hat{F} - F H) | |_{max} = O_{P} (1 + \sqrt{n / p})$ ;

(iii) $| | U (\hat{F} - F H) | |_{max} = O_{P} ((1 + n / p) \sqrt{\log p} + n | | Σ | |_{1} / p)$ ;

(iv) $| | H H^{'} - I | |_{max} = O_{P} (1 / n + 1 / p)$ .

Combining the above results with Lemma B.1, we have

\hat{U} - U = - \frac{1}{n} U F F^{'} + Π,

where $| | Π | |_{max} = O_{P} (\sqrt{\log n \log p} (1 / \sqrt{p} + 1 / n) + \sqrt{\log n} | | Σ | |_{1} / p)$ and additionally

\hat{U} {\hat{U}}^{'} - U U^{'} = - \frac{1}{n} U F F^{'} U^{'} + Δ,

where $| | Δ | |_{max} = O_{P} ((1 + n / P) \log p + n^{2} | | Σ | |_{1}^{2} / p^{2})$ . Thus we complete the proof for Theorem 3.1. We are left to check Lemma C.1, which is done in the following three subsections.

C.1. Convergence of factors $\hat{F}$

Recall $H = {(n p)}^{- 1} Λ^{'} Λ F^{'} \hat{F} K^{- 1}$ . Substituting $X = Λ F^{'} + U$ , we have,

\begin{array}{l} \hat{F} - F H = (\sum_{i = 1}^{3} E_{i}) K^{- 1}, \\ Ε_{1} = \frac{1}{n p} F Λ^{'} U \hat{F}, Ε_{2} = \frac{1}{n p} U^{'} Λ F^{'} \hat{F}, Ε_{3} = \frac{1}{n p} U^{'} U \hat{F} . \end{array}

(C.1)

To bound $| | \hat{F} - F H | |_{max}$ , note that there is a constant C > 0, so that

| | \hat{F} - F H | |_{max} \leq C | | K^{- 1} | |_{2} \sum_{i = 1}^{3} | | E_{i} | |_{max} .

Hence we need to bound $| | E_{i} | |_{max} for i = 1, 2, 3$ since $| | K^{- 1} | |_{2} = O_{P} (1)$ . The following lemma gives the stochastic bounds for each individual term.

Lemma C.2. (i) ${‖ E_{1} ‖}_{F} = O_{P} (\sqrt{n / p}) = {‖ E_{2} ‖}_{F}$ , ${‖ E_{3} ‖}_{F} = O_{P} (1 / \sqrt{n} + 1 / \sqrt{p} + \sqrt{n} / p) .$

(ii) ${‖ E_{1} ‖}_{max} = O_{P} (\sqrt{\log n / p}) = {‖ E_{2} ‖}_{max}$ , ${‖ E_{3} ‖}_{max} = O_{P} (1 / \sqrt{p} + \sqrt{\log n} / n) .$

Proof. (i) Obviously ${‖ E_{1} ‖}_{F} \leq p^{- 1} {‖ Λ^{'} U ‖}_{F} = O_{P} (\sqrt{n / p})$ according to Lemma F.1. ${‖ E_{2} ‖}_{F}$ attains the same rate. In addition, ${‖ E_{3} ‖}_{F} \leq n^{- 1 / 2} p^{- 1} {‖ U^{'} U ‖}_{F} = O_{P} (1 + \sqrt{n / p})$ again according to Lemma F.1. So combining the three terms, we have ${‖ \hat{F} - F H ‖}_{F} = O_{P} (1 + \sqrt{n / p}) .$ We now refine the bound for ${‖ E_{3} ‖}_{F \cdot} {‖ E_{3} ‖}_{F} \leq {(n p)}^{- 1} ({‖ U^{'} U F ‖}_{F} {‖ H ‖}_{F} + {‖ U^{'} U ‖}_{F} {‖ \hat{F} - F H ‖}_{F}) = O_{P} (1 / \sqrt{n} + 1 / \sqrt{p} + \sqrt{n} / p) .$ Then the refined rate of ${‖ \hat{F} - F H ‖}_{F}$ is $O_{P} (\sqrt{n / p} + 1 / \sqrt{n}) .$

(ii) Since ${‖ Λ^{'} U \hat{F} ‖}_{F} = O_{P} (n \sqrt{p})$ by Lemma F.1,

{‖ E_{1} ‖}_{max} = O_{P} ({(n p)}^{- 1} {‖ F ‖}_{max} {‖ Λ^{'} U \hat{F} ‖}_{F}) = O_{P} (\sqrt{\log n / p}) .

${‖ E_{2} ‖}_{max}$ is bounded by $p^{- 1} {‖ U^{'} Λ ‖}_{max} = O_{P} (\sqrt{\log n / p})$ while ${‖ E_{3} ‖}_{max}$ is bounded by

O_{P} ({(n p)}^{- 1} ({‖ U^{'} U F ‖}_{max} + \sqrt{n} {‖ U^{'} U ‖}_{max} {‖ \hat{F} - F H ‖}_{F})),

which based on results of Lemma F.2 and (i) is $O_{P} (1 / \sqrt{p} + \sqrt{\log n} / n) .$ □

The final rate of convergence for ${‖ \hat{F} - F H ‖}_{max}$ and ${‖ \hat{F} - F H ‖}_{F}$ are summarized as follows.

Proposition C.1.

{‖ \hat{F} - F H ‖}_{max} = O_{P} (\sqrt{\frac{\log n}{p}} + \frac{\sqrt{\log n}}{n}) a n d {‖ \hat{F} - F H ‖}_{F} = O_{P} (\sqrt{\frac{n}{p}} + \frac{1}{\sqrt{n}}) .

(C.2)

Proof. The results follow from Lemmas C.2. □

C.2. Rates of ${‖ F^{'} (\hat{F} - F H) ‖}_{max}$ and ${‖ H H^{'} - I ‖}_{max}$

Note first that the two matrices under consideration is both K by K, so we do not lose rates bounding them by their Frobenius norm.

Let us find out rate for ${‖ F^{'} (\hat{F} - F H) ‖}_{F}$ . Basically we need to bound ${‖ F^{'} E_{i} ‖}_{F}$ for i = 1, 2, 3. Firstly

{‖ F^{'} E_{1} ‖}_{F} = p^{- 1} {‖ Λ^{'} U \hat{F} ‖}_{F} \leq p^{- 1} ({‖ Λ^{'} U F ‖}_{F} {‖ H ‖}_{F} + {‖ Λ^{'} U ‖}_{F} {‖ \hat{F} - F H ‖}_{F}) .

Since ${‖ Λ^{'} U F ‖}_{F} = O_{P} (\sqrt{n p})$ and ${‖ Λ^{'} U ‖}_{F} = O_{P} (\sqrt{n p})$ by Lemma F.1, we have ${‖ F^{'} E_{1} ‖}_{F} = O_{P} (\sqrt{n / p} + n / p) .$ Secondly,

{‖ F^{'} E_{2} ‖}_{F} \leq p^{- 1} {‖ F^{'} U^{'} Λ ‖}_{F} = O_{P} (\sqrt{n / p}) .

Finally,

{‖ F^{'} E_{3} ‖}_{F} = O_{P} (\frac{1}{n p} {‖ U F ‖}_{F}^{2} + \frac{1}{n p} {‖ F^{'} U^{'} U ‖}_{F} {‖ \hat{F} - F H ‖}_{F}) = O_{P} (1 + \sqrt{n} / p) .

So combining three terms we have ${‖ F^{'} (\hat{F} - F H) ‖}_{max} \leq {‖ F^{'} (\hat{F} - F H) ‖}_{F} = O_{P} (1 + \sqrt{n / p}) .$

Now we bound ${‖ H H^{'} - I ‖}_{F}$ . Since $H^{'} H = n^{- 1} {(F H - \hat{F})}^{'} F H + n^{- 1} {\hat{F}}^{'} (F H - \hat{F}) + I$ , we have

{‖ H^{'} H - I ‖}_{F} = O_{P} (\frac{1}{n} {‖ F^{'} (\hat{F} - F H) ‖}_{F} + \frac{1}{n} {‖ \hat{F} - F H ‖}_{F}^{2}) = O_{P} (\frac{1}{n} + \frac{1}{p}) .

Therefore ${‖ H H^{'} - I ‖}_{F}$ has the same rate since ${‖ H H^{'} - I ‖}_{F} \leq {‖ H ‖}_{F} {‖ H^{'} H - I ‖}_{F} {‖ H^{- 1} ‖}_{F}$ . So ${‖ H H^{'} - I ‖}_{max} = O_{P} (1 / n + 1 / p) .$

C.3. Rate of ${‖ U (\hat{F} - F H) ‖}_{max}$

In order to study rate of ${‖ U (\hat{F} - F H) ‖}_{max}$ ,we essentially need to bound ${‖ U E_{i} ‖}_{max}$ for i = 1, 2, 3. We handle each term separately.

\begin{array}{l} {‖ U E_{1} ‖}_{max} = O_{P} (\frac{1}{n p} {‖ U F ‖}_{max} {‖ Λ^{'} U \hat{F} ‖}_{F}) = O_{P} (\frac{1}{n} {‖ U F ‖}_{max} {‖ F^{'} E_{1} ‖}_{F}) \\ = O_{P} (\sqrt{\frac{\log p}{p}} + \frac{\sqrt{n \log p}}{p}) . \end{array}

By Lemma F.5, ${‖ U U^{'} Λ ‖}_{max} = O_{P} (\sqrt{n p \log p} + n {‖ \sum ‖}_{1}) .$ Therefore,

{‖ U E_{2} ‖}_{max} = O_{P} (\frac{1}{p} {‖ U U^{'} Λ ‖}_{max}) = O_{P} (\frac{n {‖ \sum ‖}_{1}}{p} + \sqrt{\frac{n \log p}{p}}) .

From bounding ${‖ E_{3} ‖}_{F}$ , the last term has rate

\begin{array}{l} {‖ U E_{3} ‖}_{max} = \frac{1}{n p} {‖ U U^{'} U \hat{F} ‖}_{max} \leq \frac{1}{\sqrt{n p}} {‖ U ‖}_{max} {‖ U^{'} U \hat{F} ‖}_{F} \\ = O_{P} ((1 + n / p) \sqrt{\log p}) . \end{array}

So combining three terms, we conclude ${‖ U (\hat{F} - F H) ‖}_{max} = O_{P} ((1 + n / p) \sqrt{\log p} + n {‖ \sum ‖}_{1} / p) .$

Appendix D: Proof of Theorem 3.2

Recall that by the definition of $\tilde{F}$ , we have

\frac{1}{n p} X^{'} P X \tilde{F} = \tilde{F} K,

where K is a K × K diagonal matrix with the first K largest eigenvalues of ${(n p)}^{- 1} X^{'} P X$ in descending order as its diagonal elements. Define the K by K matrix H as in [18]:

H = \frac{1}{n p} B^{'} Φ {(W)}^{'} Φ (W) B F^{'} \tilde{F} K^{- 1} .

It has been shown that $‖ K ‖$ , $‖ K^{- 1} ‖$ and $‖ H ‖$ , $‖ H^{- 1} ‖$ are all O_P(1). Here we remind that though H and K are different from those in regime 1 defined in the previous section, they play essentially the same roles (thus with same notations).

The following lemma provides all the rates of convergences that are needed for downstream analysis.

Lemma D.1. Choose $J = {(p \min {n, p, ν_{p}^{- 1}})}^{1 / k}$ and assume $J^{2} ϕ_{max}^{2} \log (n J) = O (p)$ where $ϕ_{max} = {max}_{ν \leq J} \sup_{x \in X} ϕ_{ν} (x) .$ Under Assumptions 2.1, 2.3, 2.4 and 3.1, we have ${‖ Λ ‖}_{max} = O_{P} (J ϕ_{max} + \sqrt{\log p})$ and

(i) ${‖ \tilde{F} - F H ‖}_{F} = O_{P} (\sqrt{n / p})$ and ${‖ \tilde{F} - F H ‖}_{max} = O_{P} (\sqrt{\log n / p});$

(ii) ${‖ F^{'} (\tilde{F} - F H) ‖}_{max} = O_{P} (\sqrt{n / p} + n / p + n \sqrt{ν_{p} / p});$

(iii) ${‖ U (\tilde{F} - F H) ‖}_{max} = O_{P} (\sqrt{n \log p / p} + n J ϕ_{max} {‖ \sum ‖}_{1} / p);$

(iv) ${‖ H H^{'} - I ‖}_{max} = O_{P} (1 / p + 1 / \sqrt{p n} + \sqrt{ν_{p} / p}) .$

Combining the above lemma with Lemma B.1, we obtain

\tilde{U} - U = - \frac{1}{n} U F F^{'} + Π,

where ${‖ Π ‖}_{max} = O_{P} (\sqrt{\log n / p} (J ϕ_{max} + \sqrt{\log p}) + J ϕ_{max} {‖ Σ ‖}_{1} \sqrt{\log n / p})$ and

\tilde{U} {\tilde{U}}^{'} - U U^{'} = - \frac{1}{n} U F F^{'} U^{'} + Δ,

where ${‖ Δ ‖}_{max} = O_{P} (n \sqrt{ν_{p} / p} (J^{2} ϕ_{max}^{2} + \log p) + n J ϕ_{max} {‖ Σ ‖}_{1} (J ϕ_{max} + \sqrt{\log p}) / p + n^{2} J^{2} ϕ_{max}^{2} {‖ Σ ‖}_{1}^{2} / p^{2})$ if there exists C s.t. ν_p > C/n. We choose to keep ${‖ Σ ‖}_{1}$ terms here although it makes a long presentation of the rate.

Thus we complete the proof for Theorem 3.2. We are left to check Lemma D.1, which is done in the following three subsections.

D.1. Convergence of factors $\tilde{F}$

Recall $H = {(n p)}^{- 1} B^{'} Φ {(W)}^{'} Φ (W) B F^{'} \tilde{F} K^{- 1}$ . Substituting $X = Φ (W) B F^{'} + R (W) F^{'} + Γ F^{'} + U$ , we have,

\tilde{F} - F H = (\sum_{i = 1}^{15} A_{j}) K^{- 1}

(D.1)

where A_i,i ≤ 3 has nothing to do with R(W) and Γ:

A_{1} = \frac{1}{n p} F B^{'} Φ {(W)}^{'} U \tilde{F}, A_{2} = \frac{1}{n p} U^{'} Φ (W) B F^{'} \tilde{F}, A_{3} = \frac{1}{n p} U^{'} P U \tilde{F};

A_i, 3 ≤ i ≤ 8 takes care of terms involving R(W):

\begin{array}{l} A_{4} = \frac{1}{n p} F B^{'} Φ {(W)}^{'} R (W) F^{'} \tilde{F}, A_{5} = \frac{1}{n p} F R {(W)}^{'} Φ(W) B F^{'} \tilde{F}, \\ A_{6} = \frac{1}{n p} F R {(W)}^{'} P R (W) F^{'} \tilde{F}, A_{7} = \frac{1}{n p} F R (W)^{'} P U \tilde{F}, \\ A_{8} = \frac{1}{n p} U^{'} P R (W) F^{'} \tilde{F}; \end{array}

the remaining are terms involving Γ:

\begin{array}{l} A_{9} = \frac{1}{n p} F B^{'} Φ {(W)}^{'} Γ F^{'} \tilde{F}, A_{10} = \frac{1}{n p} F Γ^{'} Φ (W) B F^{'} \tilde{F}, \\ A_{11} = \frac{1}{n p} F Γ^{'} P Γ F^{'} \tilde{F}, A_{12} = \frac{1}{n p} F Γ^{'} P U \tilde{F}, \\ A_{13} = \frac{1}{n p} U^{'} P Γ F^{'} \tilde{F}, A_{14} = \frac{1}{n p} F R^{'} P Γ F^{'} \tilde{F}, A_{15} = \frac{1}{n p} F Γ^{'} P R F^{'} \tilde{F} . \end{array}

To bound ${‖ \tilde{F} - F H ‖}_{max}$ , as in Theorem C.1 we only need to bound ${‖ A_{i} ‖}_{max}$ for i = 1,…,15 since again we have ${‖ K^{- 1} ‖}_{2} = O_{P} (1) .$ The following lemma gives the rate for each term.

Lemma D.2. (i) ${‖ A_{1} ‖}_{max} = O_{P} (\sqrt{\log n / p}) = {‖ A_{2} ‖}_{max}$ ,

(ii) ${‖ A_{3} ‖}_{max} = O_{P} (J ϕ_{max} \sqrt{\log (n J)} / p)$ ,

(iii) ${‖ A_{4} ‖}_{max} = O_{P} (J^{- k / 2} \sqrt{\log n}) = {‖ A_{5} ‖}_{max}$ and ${‖ A_{9} ‖}_{max} = O_{P} (\sqrt{ν_{p} \log n / p}) = {‖ A_{10} ‖}_{max}$ ,

(iv) ${‖ A_{6} ‖}_{max} = O_{P} (J^{- k} \sqrt{\log n})$ and ${‖ A_{11} ‖}_{max} = O_{P} (J ν_{p} \sqrt{\log n} / p)$ ,

(v) ${‖ A_{7} ‖}_{max} = O_{P} (ϕ_{max} \sqrt{p^{- 1} J^{1 - k} \log (n J) \log n}) = {‖ A_{8} ‖}_{max}$ and ${‖ A_{12} ‖}_{max} = O_{P} (J ϕ_{max} \sqrt{ν_{p} \log (n J) \log n / p}) = {‖ A_{13} ‖}_{max}$ ,

(vi) ${‖ A_{14} ‖}_{max} = O_{P} (\sqrt{p^{- 1} J^{1 - k} ν_{p} \log n}) = {‖ A_{15} ‖}_{max} .$

Proof. (i) Because ${‖ F ‖}_{max} = O_{P} (\sqrt{\log n})$ , ${‖ \tilde{F} ‖}_{F} = O_{P} (\sqrt{n}) .$ By Lemma F.3 and F.4, ${‖ U^{'} Φ (W) B ‖}_{F} = O_{P} (\sqrt{p n})$ and ${‖ U^{'} Φ (W) B ‖}_{max} = O_{P} (\sqrt{p \log n}) .$

Hence

\begin{array}{l} {‖ A_{1} ‖}_{max} \leq \frac{\sqrt{K}}{n p} {‖ F ‖}_{max} {‖ B^{'} Φ {(W)}^{'} U ‖}_{F} {‖ \tilde{F} ‖}_{F} = O_{P} (\sqrt{\log n / p}), \\ {‖ A_{2} ‖}_{max} \leq \frac{\sqrt{K}}{n p} {‖ U^{'} Φ (W) B ‖}_{max} {‖ F ‖}_{F} {‖ \tilde{F} ‖}_{F} = O_{P} (\sqrt{\log n / p}) . \end{array}

(ii) We have $A_{3} = \frac{1}{n p} U^{'} Φ (W) {(Φ {(W)}^{'} Φ (W))}^{- 1} Φ {(W)}^{'} U \tilde{F} .$ By Lemma F.3 and F.4, ${‖ U^{'} Φ (W) ‖}_{F} = O_{P} (\sqrt{n p J})$ and ${‖ U^{'} Φ (W) ‖}_{max} = O_{P} (ϕ_{max} \sqrt{p \log (n J)}) .$ By Assumption 3.1, ${‖ {(Φ {(W)}^{'} Φ (W))}^{- 1} ‖}_{2} = O_{P} (p^{- 1}) .$ Note the fact that for matrix $A_{m \times n}$ , $B_{n \times n}$ , $C_{n \times r}$ , ${‖ A B C ‖}_{max} = {max}_{i \leq m, k \leq r} | a_{i}^{'} B c_{k} | \leq \sqrt{n} {‖ A ‖}_{max} {‖ B ‖}_{2} {‖ C ‖}_{F}$ . So

\begin{array}{l} {‖ A_{3} ‖}_{max} \leq \frac{\sqrt{J d}}{n p} {‖ U^{'} Φ (W) ‖}_{max} {‖ {(Φ {(W)}^{'} Φ (W))}^{- 1} ‖}_{2} {‖ Φ {(W)}^{'} U ‖}_{F} {‖ \tilde{F} ‖}_{F} \\ = O_{P} (J ϕ_{max} \sqrt{\log (n J)} / p) . \end{array}

(iii) Note that ${‖ Φ (W) B ‖}_{2} \leq {‖ G (W) ‖}_{2} + {‖ R (W) ‖}_{2} = O_{P} (\sqrt{p})$ , and ${‖ R (W) ‖}_{max} = O_{P} (J^{- k / 2}) .$ Hence we have ${‖ B^{'} Φ {(W)}^{'} R (W) ‖}_{max} \leq {‖ B^{'} Φ {(W)}^{'} ‖}_{1} {‖ R (W) ‖}_{max} \leq \sqrt{p} {‖ B^{'} Φ {(W)}^{'} ‖}_{2} {‖ R (W) ‖}_{max} = O_{P} (p J^{- k / 2}) .$ Thus

{‖ A_{4} ‖}_{max} \leq \frac{K^{3 / 2}}{n p} {‖ F ‖}_{max} {‖ B^{'} Φ {(W)}^{'} R (W) ‖}_{max} {‖ F \tilde{F} ‖}_{F} = O_{P} (J^{- k / 2} \sqrt{\log n}) .

Similarly, ${‖ A_{5} ‖}_{max}$ attains the same rate of convergence.

In addition, notice A₉, A₁₀ have similar representation as A₄, A₅. The only difference is to replace R by Γ. It is not hard to see ${‖ B^{'} Φ^{'} Γ ‖}_{max} = O_{P} (\sqrt{p ν_{p}}) .$ Therefore ${‖ A_{9} ‖}_{max} = O_{P} (\sqrt{ν_{p} \log n / p}) = {‖ A_{10} ‖}_{max}$ .

(iv) Note that

{‖ P ‖}_{2} = {‖ {(Φ {(W)}^{'} Φ (W))}^{- 1 / 2} Φ {(W)}^{'} Φ (W) {(Φ {(W)}^{'} Φ (W))}^{- 1 / 2} ‖}_{2} = 1

and ${‖ R {(W)}^{'} P R (W) ‖}_{max} \leq p {‖ R (W) ‖}_{max}^{2} {‖ P ‖}_{2} = O_{P} (p J^{- κ}) .$ Hence

{‖ A_{6} ‖}_{max} \leq \frac{K}{n p} {‖ F ‖}_{max} {‖ R {(W)}^{'} P R (W) ‖}_{max} {‖ F \tilde{F} ‖}_{F} = O_{P} (J^{- κ} \sqrt{\log n}) .

A₁₁ has similar representation as A₆. Since

{‖ Γ^{'} P Γ ‖}_{max} \leq {‖ Φ^{'} Γ ‖}_{F}^{2} {‖ {(Φ^{'} Φ)}^{- 1} ‖}_{2} = O_{P} (J ν_{p}),

we have ${‖ A_{11} ‖}_{max} = O_{P} (J ν_{p} \sqrt{\log n} / p) .$

(v) According to Lemma F.4, ${‖ U^{'} Φ (W) ‖}_{max} = O_{P} (ϕ_{max} \sqrt{p \log (n J)}) .$ Thus

\begin{array}{l} {‖ A_{7} ‖}_{max} \leq \frac{K}{\sqrt{n p}} {‖ F ‖}_{max} {‖ \tilde{F} ‖}_{F} {‖ R^{'} Φ {(Φ^{'} Φ)}^{- 1} Φ^{'} U ‖}_{max} \\ \leq O_{P} (p^{- 1} \sqrt{J \log n}) {‖ R^{'} Φ ‖}_{F} {‖ {(Φ^{'} Φ)}^{- 1} ‖}_{2} {‖ Φ^{'} U ‖}_{max} \\ = O_{P} (ϕ_{max} \sqrt{\frac{J \log (n J) \log n}{p J^{κ}}}), \end{array}

since ${‖ R^{'} Φ ‖}_{F} \leq {‖ R ‖}_{F} {‖ Φ ‖}_{2} = O_{P} (p J^{- k / 2}) .$ The rate of convergence for A₈ can be bounded in the same way. So do A₁₂ and A₁₃. Given that ${‖ Γ^{'} Φ ‖}_{F} = O_{P} (p J ν_{p}),$ we have ${‖ A_{12} ‖}_{max} = O_{P} (J ϕ_{max} \sqrt{ν_{p} \log (n J) \log n / p}) = {‖ A_{13} ‖}_{max}$ .

(vi) Obviously, ${‖ A_{14} ‖}_{max} = O_{P} (p^{- 1} \sqrt{\log n} {‖ R^{'} P Γ ‖}_{max})$ and $| | R^{'} P Γ | |_{max} \leq | | R^{'} Φ | |_{F} | | {(Φ^{'} Φ)}^{- 1} | | {‖ Φ^{'} Γ ‖}_{F}$ . We conclude ${‖ A_{14} ‖}_{max} = O_{P} (\sqrt{p^{- 1} J^{1 - k} ν_{p} \log n})$ . Same bound holds for A₁₅. □

The final rate of convergence for ${‖ \tilde{F} - F H ‖}_{max}$ and ${‖ \tilde{F} - F H ‖}_{F}$ are summarized as follows.

Proposition D.1. Choose $J = {(p \min (n, p, ν_{p}^{- 1}))}^{1 / k}$ and assume $J^{2} ϕ_{\max}^{2} \log (n J) = O (p)$ and ν_p = O(1),

{‖ \tilde{F} - F H ‖}_{\max} = O_{P} (\sqrt{\frac{\log n}{p}}) a n d {‖ \tilde{F} - F H ‖}_{F} = O_{P} (\sqrt{\frac{n}{p}}) .

(D.2)

Proof. The max norm result follows from Lemmas D.2 and (D.1), while the Frobenius norm result has been shown in [18]. □

D.2. Rates of ${‖ F^{'} (\tilde{F} - F H) ‖}_{max}$ and ${‖ H H^{'} - I ‖}_{max}$

Note first that the two matrices under consideration is both K by K, so we do not lose rates bounding them by their Frobenius norm.

It has been proved in [18] that ${‖ F^{'} (\tilde{F} - F H) ‖}_{F} = O_{P} (\sqrt{n / p} + n / p + n \sqrt{ν_{p} / p} + n J^{- k / 2})$ . By the choice of J, the last term vanishes. So

{‖ F^{'} (\tilde{F} - F H) ‖}_{max} \leq {‖ F^{'} (\tilde{F} - F H) ‖}_{F} = O_{P} (\sqrt{n / p} + n / p + n \sqrt{ν_{p} / p}) .

[18] also showed that ${‖ H^{'} H - I ‖}_{F} = O_{P} (1 / p + 1 / \sqrt{p n} + J^{- κ / 2} + \sqrt{ν_{p} / p})$ . Since $‖ H ‖$ and $‖ H^{- 1} ‖$ are both $O_{P} (1)$ , we easily show ${‖ H H^{'} - I ‖}_{max} \leq {‖ H H^{'} - I ‖}_{F} \leq ‖ H ‖ {‖ H^{'} H - I ‖}_{F} ‖ H^{- 1} ‖ = O_{P} (1 / p + 1 / \sqrt{p n} + \sqrt{ν_{p} / p})$ since $J^{κ} \geq p / ν_{p} .$

D.3. Rate of ${‖ U (\tilde{F} - F H) ‖}_{m a x}$

By (D.1), in order to bound ${‖ U (\tilde{F} - F H) ‖}_{max}$ we essentially need to bound ${‖ U A_{i} ‖}_{max}$ for $i = 1, \dots, 15.$ We do not bother going into the details of each term again as in Lemma D.2. However, we point out the difference here. All A_i are separated into two types: the ones starting with F and the ones starting with U.

If a term A_i starts with F, say A_i = FQ, in Lemma D.2, we bound ${‖ A_{i} ‖}_{max}$ in using $\sqrt{K} {‖ F ‖}_{max} {‖ Q ‖}_{F}$ . Now we use bound ${‖ U A_{i} ‖}_{max} \leq \sqrt{K} {‖ U F ‖}_{max} {‖ Q ‖}_{F}$ so that we obtain all related rates by just changing rate ${‖ F ‖}_{max} = O_{P} (\sqrt{\log n})$ to ${‖ UF ‖}_{max} = O_{P} (\sqrt{n \log p})$ .

Terms starting with U includes A_i, i = 2,3,8,13. In Lemma D.2, we bound ${‖ A_{i} ‖}_{max}$ , i = 3,8,13 using ${‖ U^{'} Φ ‖}_{max}$ while we bound ${‖ A_{2} ‖}_{max}$ using ${‖ U^{'} Φ B ‖}_{max}$ . Correspondingly now we need to control ${‖ U U^{'} Φ ‖}_{max}$ and ${‖ U U^{'} Φ B ‖}_{max}$ separately to update the rates. The derivation is relegated to Lemma F.5. We have ${‖ U U^{'} Φ (W) ‖}_{max} = O_{P} (ϕ_{max} (\sqrt{n p \log p} + n {‖ Σ ‖}_{1}))$ and ${‖ {UU}^{'} Φ (W) B ‖}_{max} = O_{P} (\sqrt{n p \log p} + n J ϕ_{max} {‖ Σ ‖}_{1})$ .

So we replace the corresponding terms in Lemma D.2. It is not hard to see the dominating term is ${‖ U A_{2} ‖}_{max} = O_{P} (\sqrt{n \log p / p} + n J ϕ_{max} {‖ Σ ‖}_{1} / p)$ . Therefore, ${‖ U (\tilde{F} - F H) ‖}_{max}$ has the same rate.

Appendix E: Proof of Theorem 4.1

Proof. Denote the oracle empirical covariance matrix as

Σ_{N} = \frac{1}{N} \sum_{i = 1}^{m} U^{i} U^{i^{'}} .

As in [9] the upper bound on $‖ \hat{Ω} - Ω ‖$ is obtained by proving

{‖ (\hat{Σ} - Σ_{N}) Ω ‖}_{max} = O_{p} (τ_{m, N, p}) and {‖ (Σ_{N} - Σ) Ω ‖}_{max} = O_{p} (τ_{m, N, p}) .

(E.1)

Once the two bounds are established, we proceed by observing

{‖ I_{p} - \hat{Σ} Ω ‖}_{max} = {‖ (\hat{Σ} - Σ) Ω ‖}_{max} = O_{p} (τ_{m, N, p}),

and then it readily follows that if $λ ≍ τ_{m, N, p}$ ,

\begin{array}{l} {‖ \hat{Ω} - Ω ‖}_{max} \leq {‖ Ω (I_{p} - \hat{Σ} \hat{Ω}) ‖}_{max} + {‖ {(I_{p} - \hat{Σ} Ω)}^{'} \hat{Ω} ‖}_{max} \\ \leq {‖ Ω ‖}_{1} {‖ I_{p} - \hat{Σ} \hat{Ω} ‖}_{max} + {‖ I_{p} - \hat{Σ} Ω ‖}_{max} {‖ \hat{Ω} ‖}_{1} \leq λ {‖ Ω ‖}_{1} + τ {‖ Ω ‖}_{1} \\ = O_{p} (τ_{m, N, p}), \end{array}

where the first term of the last inequality uses the constraint of (4.4) while the optimality condition of (4.4) is applied to bound ${‖ \hat{Ω} ‖}_{1}$ by ${‖ Ω ‖}_{1}$ . So it remains to find $τ_{m, N, p}$ in (E.1). Since $Ω \in F (s, C_{0})$ , ${‖ Ω ‖}_{1} \leq C_{0}$ , so we just need to bound ${‖ \hat{Σ} - Σ_{N} ‖}_{max}$ and ${‖ Σ_{N} - Σ ‖}_{max}$ . Obviously,

{‖ Σ_{N} - Σ ‖}_{max} = O_{p} (\sqrt{\frac{\log p}{N}}) .

We have shown in (4.3) that $\hat{Σ}$ given by (4.2) attains the rate ${‖ \hat{Σ} - Σ_{N} ‖}_{max} = O_{P} (a_{m, N, p})$ . Thus $τ_{m, N, p} = \sqrt{\log p / N} + a_{m, N, p}$ . Similar proof as in [9] can also reach error bounds under ${‖ \cdot ‖}_{1}$ and ${‖ \cdot ‖}_{2}$ , which we omit. The proof is now complete. □

Appendix F: Technical lemmas

Lemma F.1. (i) ${‖ Λ^{'} U ‖}_{F}^{2} = O_{P} (n p)$ ,

(ii) ${‖ U^{'} U ‖}_{F}^{2} = O_{p} (n p^{^{2}} + p n^{2})$ ,

(iii) ${‖ U^{'} U F ‖}_{F}^{2} = O_{P} (n p^{2} + p n^{2}) .$

Proof. We simply apply Markov inequality to get the rates.

E {‖ Λ^{'} U ‖}_{F}^{2} = E [tr (Λ^{'} U U^{'} Λ)] = n \cdot tr (Λ^{'} Σ Λ) \leq n ‖ Σ ‖ \cdot tr (Λ^{'} Λ) = O (n p) .

\begin{array}{l} E {‖ U^{'} U ‖}_{F}^{2} = E [\sum_{t = 1}^{n} \sum_{t^{'} = 1}^{n} {(\sum_{j = 1}^{p} u_{j t} u_{j t^{'}})}^{2}] \\ = \sum_{j_{1}, j_{2} = 1}^{p} (\sum_{t = 1}^{n} E [u_{j_{1} t}^{2} u_{j_{2} t}^{2}] + \sum_{1 \leq t \neq t_{1} \leq n} σ_{j_{1} j_{2}}^{2}) \\ = O_{P} (n p^{2} + p n^{2}), \end{array}

since $\sum_{j_{1}, j_{2}} σ_{j_{1} j_{2}}^{2} = tr (Σ^{2}) \leq ‖ Σ ‖ tr (Σ) = O (p) .$

\begin{array}{l} E {‖ U^{'} U F ‖}_{F}^{2} = E [{\sum_{t = 1}^{n} \sum_{k = 1}^{K} (\sum_{t^{'} = 1}^{n} \sum_{j = 1}^{p} u_{j t} u_{j t^{'}} f_{t^{'} k})}^{2}] \\ = \sum_{k = 1}^{K} \sum_{j_{1}, j_{2} = 1}^{p} (\sum_{t = 1}^{n} E [u_{j_{1} t}^{2} u_{j_{2} t}^{2}] f_{t k}^{2} + \sum_{1 \leq t \neq t_{1} \leq n} σ_{j_{1} j_{2}}^{2} f_{t_{1} k}^{2}) \\ = O_{P} (n p^{2} + p n^{2}) . \end{array}

□

Lemma F.2. (i) ${‖ Λ^{'} U ‖}_{max} = O_{P} (\sqrt{p log n})$ .

(ii) ${‖ U^{'} U ‖}_{max} = O_{P} (p)$ ,

(iii) ${‖ U^{'} U F ‖}_{max} = O_{P} (\sqrt{n p log n} + p \sqrt{log n}) .$

Proof. (i) ${‖ Λ^{'} U ‖}_{max} = {max}_{t, k} | u_{t}^{'} λ_{k} |$ where λ_k is the k^th column of Λ. Since $u_{t}^{'} λ_{k}$ is mean zero sub-Gaussian with variance proxy $λ_{k}^{'} Σ λ_{k} \leq ‖ Σ ‖ {‖ λ_{k} ‖}^{2} = O (p)$ , we have ${‖ Λ^{'} U ‖}_{max} = O_{p} (\sqrt{p log n})$ .

(ii) ${‖ U^{'} U ‖}_{max} = {max}_{t, t^{'}} | u_{t}^{'} u_{t^{'}} | \leq {max}_{t \neq t^{'}} | u_{t}^{'} u_{t^{'}} | + {max}_{t} | u_{t}^{'} u_{t} |$ . We need tobound each term separately. The second term is bounded by the upper tail bound of Hanson-Wright inequality for sub-Gaussian vector [24, 41] i.e.

ℙ ({‖ u_{t} ‖}^{2} > tr (Σ) + 2 \sqrt{tr (Σ) s} + 2 ‖ Σ ‖ s) \leq e^{- s} .

Choose s = log n and apply union bound, we have ${max}_{t} | u_{t}^{'} u_{t} | = O_{p} (tr (Σ) + 2 \sqrt{tr (Σ) s}) = O_{p} (p + \sqrt{p log n}) = O_{p} (p)$ . Then we deal with the first term. By Chernoff bound,

ℙ (max_{t \neq t^{'}} | u_{t}^{'} u_{t^{'}} | > s) \leq 2 n^{2} e^{- s θ} E [exp (θ u_{t}^{'} u_{t^{'}})],

where $E [exp (θ u_{t}^{'} u_{t^{'}})] = E [exp (θ^{2} u_{t}^{'} Σ u_{t} / 2)] \leq E [exp (C θ^{2} {‖ u_{t} ‖}^{2})]$ . [24] showed that

E [exp (η {‖ u_{t} ‖}^{2})] \leq exp (tr (Σ) η + \frac{tr (Σ^{2}) η^{2}}{1 - 2 ‖ Σ ‖ η})

For $η < 1 / (4 ‖ Σ ‖) \leq tr (Σ) / (4 tr (Σ^{2}))$ , the right hand side is less than exp(3tr(Σ)η/2) ≤ exp(Cpη). Choose η = Cθ², we have

ℙ (max_{t \neq t^{'}} | u_{t}^{'} u_{t^{'}} | > s) \leq 2 n^{2} exp (- s θ + C θ^{2} p) .

We minimize the right hand side and choose θ = s/(2Cp), it is easy to check $η < 1 / (4 ‖ Σ ‖)$ and see that ${max}_{t \neq t^{'}} | u_{t}^{'} u_{t^{'}} | = O_{p} (\sqrt{p log n})$ . So we conclude that ${‖ U^{'} U ‖}_{max} = O_{p} (p)$ .

(iii) Let ${\bar{f}}_{k}$ be the k^th column of F. ${‖ U^{'} U F ‖}_{max} = {max}_{t, k} | u_{t}^{'} U {\bar{f}}_{k} | \leq {max}_{t, k} | u_{t}^{'} U_{(- t)} {\bar{f}}_{k (- t)} | + {max}_{t, k} | u_{t}^{'} u_{t} f_{t k} |$ where $U_{(- t)}, {\bar{f}}_{k} (- t)$ are U and ${\bar{f}}_{k}$ canceling the t^th column and element respectively. From (ii) we know the second term is of order $O_{p} (p {max}_{t k} | f_{t k} |) = O_{p} (p \sqrt{log n})$ . Define $ξ = U_{(- t)} {\bar{f}}_{k (- t)} \sim$ subGaussian $(0, Σ {‖ {\bar{f}}_{k (- t)} ‖}^{2})$ , which is independent with u_t. Thus

ℙ (max_{t, k} | u_{t}^{'} ξ | > s) \leq 2 n K e^{- s θ} E [exp (θ u_{t}^{'} ξ)],

where $E [exp (θ u_{t}^{'} ξ)] \leq E [exp (θ^{2} u_{t}^{'} Σ u_{t} {‖ {\bar{f}}_{k (- t)} ‖}^{2} / 2)] \leq E [exp (C θ^{2} n {‖ u_{t} ‖}^{2})]$ . Similar to (ii), we choose η = Cθ²n here. It is not hard to see ${max}_{t, k} | u_{t}^{'} ξ | = O_{p} (\sqrt{n p log n})$ . Thus ${‖ U^{'} U F ‖}_{max} = O_{p} (\sqrt{n p log n} + p \sqrt{log n})$ . □

Lemma F.3.(i) ${‖ F^{'} U^{'} ‖}_{F}^{2} = O_{p} (n p)$ .

(ii) ${‖ U^{'} Φ (W) ‖}_{F}^{2} = O_{p} (n p J), {‖ U^{'} Φ (W) B ‖}_{F}^{2} = O_{p} (n p)$ .

(iii) ${‖ Φ {(W)}^{'} U F ‖}_{F}^{2} = O_{p} (n p J), {‖ B^{'} Φ {(W)}^{'} U F ‖}_{F}^{2} = O_{p} (n p)$ .

Proof. This results can be found in the paper of Fan, Liao and Wang (2014). But the conditions they used are a little bit different from our conditions. In particular, we allow no time (sample) dependence and only require bounded ${‖ Σ ‖}_{2}$ instead of ${‖ Σ ‖}_{1}$ . By Markov inequality, it is sufficient to show the expected value of each term attains the corresponding rate of convergence.

E {‖ F^{'} U^{'} ‖}_{F}^{2} = E [tr (F^{'} E [U^{'} U] F)] = E [tr (F^{'} tr (Σ) F)] = n \cdot tr (Σ) = O (n p) .

\begin{array}{l} E {‖ U^{'} Φ (W) ‖}_{F}^{2} = E [tr (Φ^{'} E [U U^{'} | W] Φ)] = n \cdot E [tr (Φ^{'} Σ Φ)] \leq n J d \cdot E [{‖ Φ^{'} Σ Φ ‖}_{2}] \\ \leq n J d C_{0} E [{‖ Φ^{'} Φ ‖}_{2}] = O (n p J) . \end{array}

$E {‖ Φ {(W)}^{'} U F ‖}_{F}^{2} = E [tr (Φ^{'} E [U F F^{'} U^{'} | W] Φ)] = E [tr (F F^{'}) tr (Φ^{'} Σ Φ)] = O (n p J) . E {‖ U^{'} Φ (W) B ‖}_{F}^{2}$ and ${‖ B^{'} Φ {(W)}^{'} U F ‖}_{F}^{2}$ are both O(np) following the same proof as above. Thus the proof is complete. □

Lemma F.4. (i) ${‖ F^{'} U^{'} ‖}_{max} = O_{P} (\sqrt{n log p})$

(ii) ${‖ U^{'} Φ (W) ‖}_{max} = O_{P} (ϕ_{max} \sqrt{p log (n J)}), {‖ U^{'} Φ (W) B ‖}_{max} = O_{P} (\sqrt{p log n})$ .

(iii) ${‖ Φ {(W)}^{'} U F ‖}_{max} = O_{P} (ϕ_{max} \sqrt{n p log J}), {‖ B^{'} Φ {(W)}^{'} U F ‖}_{max} = O_{P} (\sqrt{n p})$ .

Proof. (i) It is not hard to see ${‖ F^{'} U^{'} ‖}_{max} = {max}_{k \leq K, j \leq p} | \sum_{t = 1}^{n} f_{t k} u_{j t} | = O_{p} (\sqrt{n log p})$ . The detailed proof by Chernoff bound is given in the following. By union bound and Chernoff bound, we have

ℙ (max_{k \leq K, i \leq p} | \sum_{t = 1}^{n} f_{t k} u_{j t} | > t) \leq 2 p K e^{- t θ} \cdot E [e^{θ \sum_{t = 1}^{n} f_{t k} u_{j t}}] .

The expectation is calculated by fist conditioning on F,

E [e^{θ \sum_{t = 1}^{n} f_{t k} u_{j t}}] = E [E [e^{θ \sum_{t = 1}^{n} f_{t k} u_{j t}} | F]] \leq E [e^{θ^{2} \sum_{t = 1}^{n} f_{t k}^{2} σ_{j j} / 2}] \leq e^{\frac{1}{2} n C_{0} θ^{2}},

where the second equality uses the sub-Gaussianity of u_jt and the last inequality is from $n^{- 1} F^{'} F = I$ and ${‖ Σ ‖}_{2} \leq C_{0}$ . Therefore, choosing $θ = \frac{t}{n C_{0}}$ , we have

ℙ (max_{k \leq K, j \leq p} | \sum_{t = 1}^{n} f_{t k} u_{j t} | > t) \leq 2 p K e^{- t θ} e^{\frac{C_{0}}{2} n θ^{2}} = 2 p K e^{- \frac{t^{2}}{2 C_{0} n}} .

Thus ${‖ F^{'} U^{'} ‖}_{max} = O_{p} (\sqrt{n log p})$ .

(ii) ${‖ U^{'} Φ (W) ‖}_{max} = {max}_{ν, l, t} | \sum_{j = 1}^{p} u_{j t} ϕ_{ν} (W_{j l}) | = {max}_{ν, l, t} | {\bar{ϕ}}_{ν l}^{'} u_{t} |$ , where ${\bar{ϕ}}_{ν l} = {(ϕ_{ν} (W_{1 l}), \dots, ϕ_{ν} (W_{p l}))}^{'}$ . Consider the tail probability condition on W:

\begin{array}{l} ℙ (max_{ν \leq J, l \leq d, k \leq n} | {\bar{ϕ}}_{ν l}^{'} u_{k} | > t | W) \leq 2 J d n \cdot e^{- t θ} E [e^{θ {\bar{ϕ}}_{ν l}^{'} u_{k}} | W] \\ \leq 2 J d n \cdot exp {- t θ + \frac{1}{2} θ^{2} {\bar{ϕ}}_{ν l}^{'} Σ {\bar{ϕ}}_{ν l}} . \end{array}

The right hand side can be further bounded by

2 J d n \cdot exp (- t θ + \frac{1}{2} θ^{2} C_{0} {‖ {\bar{ϕ}}_{ν l} ‖}^{2}) \leq 2 J d n \cdot exp (- t θ + \frac{1}{2} p C_{0} θ^{2} ϕ_{max}^{2}) .

Choose θ to minimize the upper bound and take expectation with respect to W, we obtain

ℙ (max_{ν \leq J, l \leq d, k \leq n} | {\bar{ϕ}}_{ν l}^{'} u_{k} | > t) \leq 2 J d n \cdot exp {- \frac{t^{2}}{2 p C_{0} ϕ_{max}^{2}}} .

Finally choose t ≍ $ϕ_{max} \sqrt{p log (n J)}$ , the tail probability is arbitrarily small with a proper constant. So ${‖ U^{'} Φ (W) ‖}_{max} = O_{p} (ϕ_{max} \sqrt{p log (n J)})$ . The second part of the results follows similarly. Note ${‖ U^{'} Φ (W) B ‖}_{max} \leq {‖ U^{'} G (W) ‖}_{max} + {‖ U^{'} R (W) ‖}_{max}$ and the first term dominates. So the same derivation gives

ℙ ({‖ U^{'} G (W) ‖}_{max} > t) \leq 2 K n \cdot exp {- \frac{t^{2}}{2 C_{0} {‖ {\bar{g}}_{k} ‖}^{2}}},

where ${\bar{g}}_{k} = (g_{k} (W_{1}), \dots, g_{k} (W_{p})) . {‖ {\bar{g}}_{k} ‖}^{2} = O_{p} (p)$ since it is assumed eigenvalues of $p^{- 1} G {(W)}^{'} G (W)$ is bounded almost surely. Hence, ${‖ U^{'} Φ (W) B ‖}_{max} = O_{p} (\sqrt{p log n})$ .

(iii) ${‖ Φ {(W)}^{'} U F ‖}_{max} = {max}_{ν \leq J, l \leq d, k \leq K} | \sum_{j = 1}^{p} \sum_{i = 1}^{n} ϕ_{ν} (W_{j l}) u_{j i} f_{i k} |$ . Using Chernoff bound again, we get

ℙ (max_{ν \leq J, l \leq d, k \leq K} | \sum_{j = 1}^{p} \sum_{i = 1}^{n} ϕ_{ν} (W_{j l}) u_{j i} f_{i k} | > t) \leq 2 J d K \cdot e^{- t θ} . E [e^{θ \sum_{t = 1}^{n} f_{t k} {\bar{ϕ}}_{ν l}^{'} u_{t}}] .

Since $\sum_{t = 1}^{n} f_{t k} {\bar{ϕ}}_{ν l}^{'} u_{t} | F ~sub-Gaussian (0, \sum_{t = 1}^{n} f_{t k}^{2} {\bar{ϕ}}_{ν l}^{'} Σ {\bar{ϕ}}_{ν l}) = sub-Gaussian (0, n {\bar{ϕ}}_{ν l}^{'} Σ {\bar{ϕ}}_{ν l})$ , the right hand side is easy to bound by first conditioning on F.

E [e^{θ \sum_{t = 1}^{n} f_{t k} {\bar{ϕ}}_{ν l}^{'} u_{t}}] \leq E [\exp (\frac{1}{2} n θ^{2} {\bar{ϕ}}_{ν l}^{'} Σ {\bar{ϕ}}_{ν l})] \leq E [\exp (\frac{1}{2} n p C_{0} ϕ_{max}^{2} θ^{2})] .

Therefore, choosing $θ = \frac{t}{n p C_{0} ϕ_{max}^{2}}$ , we have

\begin{array}{l} ℙ (| | Φ {(W)}^{'} U F | |_{max} > t) \leq 2 J d K . \exp {- t θ + \frac{1}{2} n p C_{0} ϕ_{max}^{2} θ^{2}} \\ = 2 J d K \exp {- \frac{t^{2}}{2 n p C_{0} ϕ_{max}^{2}}} . \end{array}

So we conclude $| | Φ (W)^{'} U F | |_{max} = O_{p} (ϕ_{max} \sqrt{n p \log J})$ . By similar derivation as in (ii), we also have $| | B^{'} Φ {(W)}^{'} U F | |_{max}$ and $| | G {(W)}^{'} U F | |_{max}$ are both of order $O_{P} (\sqrt{n p})$ . □

Lemma F.5. (i) $| | U U^{'} Λ | |_{max} = O_{P} (\sqrt{n p \log p} + n | | Σ | |_{1})$ ,

(ii) $| | U U^{'} Φ (W) | |_{max} = O_{P} (ϕ_{max} (\sqrt{n p \log p} + n | | Σ | |_{1}))$ and $| | U U^{'} Φ (W) B | |_{max} = O_{P} (\sqrt{n p \log p} + n J ϕ_{max} | | Σ | |_{1})$ .

Proof. (i) $| | U U^{'} Λ | |_{max} \leq {max}_{j, k} | \sum_{t = 1}^{n} u_{j t} u_{t}^{'} λ_{k} - n \sum_{j^{'} = 1}^{p} σ_{j j^{'}} λ_{j^{'} k} | + n {max}_{j, k} \sum_{j^{'} = 1}^{p} | σ_{j j^{'}} | | λ_{j^{'} k} |$ . The second term is $O (n | | Σ | |_{1})$ . So it suffices to focus on the first term. Let $Σ = A A^{'}$ and $u_{t} = A v_{t}$ so that Var(v_t) = I. Write $A^{'} = (a_{1}, \dots, a_{p})$ , so we have $u_{j t} = a_{j}^{'} v_{t}$ . Also denote $d_{k} = A^{'} λ_{k}$ . Thus $u_{j t} u_{t}^{'} λ_{k} = a_{j}^{'} v_{t} v_{t}^{'} d_{k}$ and $\sum_{j^{'} = 1}^{p} σ_{j j^{'}} λ_{j^{'} k} = a_{j}^{'} d_{k}$ .

\begin{array}{l} ℙ (max_{j, k} | \sum_{t = 1}^{n} (a_{j}^{'} v_{t} v_{t}^{'} d_{k} - a_{j}^{'} d_{k}) | > s) \\ \leq p K ℙ (| \sum_{t = 1}^{n} ({\tilde{a}}_{j}^{'} v_{t} v_{t}^{'} {\tilde{d}}_{k} - {\tilde{a}}_{j}^{'} {\tilde{d}}_{k}) | > \frac{s}{{max}_{j, k} ‖ a_{j} ‖ ‖ d_{k} ‖}), \end{array}

(F.1)

where ${\tilde{a}}_{j}$ and ${\tilde{d}}_{k}$ are two unit vectors of dimension p. We will bound the right hand side with arbitrary unit vectors ${\tilde{a}}_{j}$ and ${\tilde{d}}_{k}$ .

\begin{array}{l} ℙ (| \sum_{t = 1}^{n} {\tilde{a}}_{j}^{'} v_{t} v_{t}^{'} {\tilde{d}}_{k} - n {\tilde{a}}_{j}^{'} {\tilde{d}}_{k} | > s) \\ \leq ℙ (| \sum_{t = 1}^{n} {({({\tilde{a}}_{j} + {\tilde{d}}_{k})}^{'} v_{t})}^{2} - n | | {\tilde{a}}_{j} + {\tilde{d}}_{k} | |^{2} | > 2 s) \\ + ℙ (| \sum_{t = 1}^{n} {({({\tilde{a}}_{j} - {\tilde{d}}_{k})}^{'} v_{t})}^{2} - n | | {\tilde{a}}_{j} - {\tilde{d}}_{k} | |^{2} | > 2 s) . \end{array}

Note that ${({\tilde{a}}_{j} + {\tilde{d}}_{k})}^{'} v_{t} ~ subGaussian(0,|| {\tilde{a}}_{j} + {\tilde{d}}_{k} | |^{2})$ and $| | {\tilde{a}}_{j} + {\tilde{d}}_{k} | |^{2} \leq 4$ . By Bernstein inequality, we have for constant C > 0,

ℙ (| \sum_{t = 1}^{n} ({\tilde{a}}_{j}^{'} v_{t} v_{t}^{'} {\tilde{d}}_{k} - {\tilde{a}}_{j}^{'} {\tilde{d}}_{k}) | > s) \leq 2 \exp (- C \min (s^{2} / n, s)) .

Choose $s = C \sqrt{n \log p} {max}_{j k} ‖ a_{j} ‖ ‖ d_{k} ‖$ in (F.1), we can easily show that the exception probability is small as long as C is large enough. Therefore, noting ${max}_{j k} ‖ a_{j} ‖ ‖ d_{k} ‖ \leq C {max}_{k} | | λ_{k} | |$ , ${max}_{j, k} | \sum_{t = 1}^{n} u_{j t} u_{t}^{'} λ_{k} - n \sum_{j^{'} = 1}^{p} σ_{j j^{'}} λ_{j^{'}, k} | = O_{P} (\sqrt{n \log p} {max}_{k} | | λ_{k} | |) = O_{P} (\sqrt{n p \log p})$ . Finally $| | U U^{'} Λ | |_{max} = O_{P} (\sqrt{n p \log p} + n | | Σ | |_{1})$ .

(ii) The rates of $| | U U^{'} Φ (W) | |_{max}$ and $| U U^{'} Φ (W) B | |_{max}$ can be similarly derived as (i). Denote $Φ_{ν l} = {(ϕ_{ν} (W_{1 l}), \dots, ϕ_{ν} (W_{p l}))}^{'}$ , so

\begin{array}{l} | | U U^{'} Φ (W) | |_{max} \leq max_{j, ν, l} | \sum_{t = 1}^{n} u_{j t} u_{t}^{'} Φ_{ν l} - n \sum_{j^{'} = 1}^{p} σ_{j j^{'}} ϕ_{ν} (W_{j^{'} l}) | + n max_{j, ν, l} \sum_{j^{'} = 1}^{p} | σ_{j j^{'}} | | ϕ_{ν} (W_{j^{'} l}) | \\ = O_{P} (\sqrt{n \log p} max_{ν, l} ‖ Φ_{ν l} ‖ + n ϕ_{max} {‖ Σ ‖}_{1}) \\ = O_{P} (ϕ_{max} (\sqrt{n p \log p} + n {‖ Σ ‖}_{1})) . \end{array}

Denote the k^th column of Φ(W)B by (ΦB)_k, we have

\begin{array}{l} | | U U^{'} Φ (W) B | |_{max} \leq max_{j, k} | \sum_{t = 1}^{n} u_{j t} u_{t}^{'} {(Φ B)}_{k} - n \sum_{j^{'} = 1}^{p} σ_{j j^{'}} {(Φ B)}_{j^{'} k} | + n max_{j, k} \sum_{j^{'} = 1}^{p} | σ_{j j^{'}} | | {(Φ B)}_{j^{'} k} | \\ = O_{P} (\sqrt{n \log p} max_{k} ‖ {(Φ B)}_{k} ‖ + n J ϕ_{max} {‖ Σ ‖}_{1}) \\ = O_{P} (\sqrt{n p \log p} + n J ϕ_{max} {‖ Σ ‖}_{1}), \end{array}

where we use max ${max}_{k} ‖ {(Φ B)}_{k} ‖ \leq {‖ Φ B ‖}_{F} = O_{P} (\sqrt{p})$ . □

Appendix G: More details on synthetic data analysis

G.1. Model calibration and data generation

We calibrate (estimate) the 264 by 264 covariance matrix $\hat{Σ}$ of u_t by our proposed method to the data in the healthy group. Plugging it as input in CLIME solver delivers a sparse precision matrix Ω, which will be taken as truth in the simulation. Note that after regularization in CLIME, Ω⁻¹ is not the same as $\hat{Σ}$ , and we set the true covariance Σ = Ω⁻¹. To obtain the covariance matrix used, in setting 1, we also calibrate, using the same method, a sub-model that involves only the first 100 regions. We then copy this 100 × 100 matrix multiple times to form a p × p block diagonal matrix and use it for simulations in setting 1. We describe how we calibrate these ‘true models’ and generate data from the models as follows.

(External covariates) For each j ≤ p, generate the external covariate W i.i.d. from the multinomial distribution with $P$ (W_j = s) = w_s,s ≤ 10 where ${w_{s}}_{s = 1}^{10}$ are calibrated with the hierarchy clustering results of the real data.
(Calibration) For the first 15 healthy subjects, obtain estimators for F, B and Γ by PPCA, resulting in $\tilde{F}, \tilde{B} = n^{- 1} {(Φ {(W)}^{'} Φ (W))}^{- 1} Φ {(W)}^{'} X \tilde{F}$ and $\tilde{Γ} = n^{- 1} (I - P) X \tilde{F}$ according to [18]. Use the rows of the estimated factors to fit a stationary VAR model f_t = Af_t−1 + ∈_t, where ∈_t ∼ N(0, Σ_∈), and obtain the estimators $\tilde{A}$ and ${\tilde{Σ}}_{\in}$ .
(Simulation) For each subject i ≤ m, pick one of the 15 calibrated models and their associated parameters from above at random and do the following.
- (a)
  Generate $γ_{j k}^{i}$ (entries of Γⁱ) i.i.d. from $N (0, {\tilde{σ}}_{γ}^{2})$ where ${\tilde{σ}}_{γ}^{2}$ is the sample variance of all entries of $\tilde{Γ}$ . For the first three settings, compute the ‘true’ loading matrix $Λ^{i} = Φ (W) \tilde{B} + Γ^{i}$ . For the last setting, set Λⁱ = Γⁱ since G(W) = 0.
- (b)
  Generate factors $f_{t}^{i}$ from the VAR model $f_{t}^{i} = \tilde{A} f_{t - 1}^{i} + \in_{t}$ with $\in_{t} ~ N (0, {\tilde{Σ}}_{\in})$ where the parameters $\tilde{A}$ and ${\tilde{Σ}}_{\in}$ are taken from the fitted values in step 2.
- (c)
  Finally, generate the observed data $X^{i} = Λ^{i} F^{i^{'}} + U^{i}$ , where each column of Uⁱ is randomly sampled from N(0,Ω⁻¹), where Ω has been calibrated by the CLIME solver as described at beginning of the section.

G.2. More on pervasiveness

In this subsection, we discuss the pervasive assumption, which requires the spikes to grow with order p, and present numerical performance of ALPHA for different levels of c_min and c_max (defined in Assumption 2.3). The readers will have a rough idea about how the spikiness (or the constant in front of the rate) affects the performance. We particularly consider the cases when c_max is small or c_min is large. As a threshold matter, we verify that the real data is consistent with the pervasive assumption.

Denote the maximum and minimum eigenvalues of the matrix $Λ^{'} Λ / p$ by λ_max and λ_min respectively, and denote the maximum eigenvalue of the matrix $U^{'} U / p$ by $λ_{max}^{u}$ . We first investigate the magnitude of λ_min, λ_max and $λ_{max}^{u}$ derived from the real data. Following exactly the same data generation procedure as in the original simulation study, we randomly generate 1,000 subjects. We find that λ_max has mean 15.352 and standard deviation 4.918, λ_min has mean 10.069 and standard deviation 5.416 and $λ_{max}^{u}$ has mean 1.317 and standard deviation 0.119. We also investigate the signal-to-noise ratio $λ_{\min} {/λ}_{max}^{u}$ , which has mean 7.711 and standard deviation 4.230. Therefore, our real data demonstrates a spiked covariance structure while the spikes are not extremely spiky.

Then we manipulate the data generation process correspondent to two different cases. One is to multiply the original loading matrix Λ by 3, called Modified (a), while the other is to divide Λ by 3, called Modified (b). Note that in the case of Modified (b), λ_min will be 1/9 of the original λ_min and thus smaller than $λ_{max}^{u}$ , so we do not see a clear eigen-gap in this case. Table 3 compares the performance of recovering the precision matrix Ω under the original and modified setting when n_i = 100.

Table 3.

Gaussian Graphical Model Analysis

	$\| \| \hat{Ω} - Ω \| \|_{max}$	$\| \| \hat{Ω} - Ω \| \|_{1}$	$\| \| \hat{Ω} - Ω \| \|_{2}$
Original	0.564	3.445	1.188
Modified (a)	0.524	3.052	1.066
Modified (b)	0.749	4.914	1.719

Open in a new tab

We can see from the table above that the performance of ALPHA in the case of Modified (a) is slightly better than that in the original case. Note that increasing c_min makes the heterogeneity part more spiky. Larger c_min allows PCA or PPCA to distinguish the spiky heterogeneity term more easily. In contrast, decreasing c_max makes the original spiky heterogeneity term hard to detect. We also tend to miss several heterogeneity factors while extracting them. Therefore, in Modified (b), the estimation error becomes significantly larger compared with the original case.

G.3. Sensitivity analysis on the number of factors

In this section, we study how the estimated number of factors affects the recovery of the Gaussian graphical model through simulations. The specification of the number of factors is critical to the validity of our ALPHA method, which inspires us to assess the performance of $\hat{K}$ and $\tilde{K}$ on our simulated datasets in the first place. Recall that

\begin{array}{l} \hat{K} = \arg {max}_{k \leq K_{max}} λ_{k} (X^{'} X) / λ_{k + 1} (X^{'} X), \\ \tilde{K} = \arg {max}_{k \leq K_{max}} λ_{k} (X^{'} P X) / λ_{k + 1} (X^{'} P X), \end{array}

where P is the projection operator defined in (3.5) in the main text. The final estimator of the number of factors, denote by Ǩ, comes from the heuristic strategy we developed for choosing between PCA or PPCA. We choose PCA if $λ_{\hat{K}} (X^{'} X) / λ_{\hat{K} + 1} (X^{'} X) \geq λ_{\tilde{K}} (X^{'} P X) / λ_{\tilde{K} + 1} (X^{'} P X)$ and choose PPCA vice versa. The intuition is that we favor the method that yields larger eigen-ratio between the spiked and non-spiked part of the covariance.

Analogous to the simulation study in our paper, we generate m = 1,000 people’s BOLD data based on calibrated “true” data. We investigate the accuracy of the proposed $\hat{K}$ , $\tilde{K}$ and Ǩ for two cases: (i) n_i = 20,p = 264 and (ii) n_i = 100, p = 264, presented in Table 4. As we can see from the table, when n_i is small, $\tilde{K}$ outperforms $\hat{K}$ , and when n_i is large, $\hat{K}$ is better. Note also that our heuristic estimator Ǩ has great performance in both cases of large and small n_i.

Table 4.

Accuracy of $\hat{K}$ , $\tilde{K}$ and Ǩ

		n_i = 20		n_i = 100
	TotErr	OverEst	UnderEst	TotErr	OverEst	UnderEst
$\hat{K}$	38.7%	0%	38.7%	0.7%	0%	0.7%
$\tilde{K}$	29.7%	6.8%	22.9%	4.7%	2.7%	2.0%
Ǩ	29.7%	6.8%	22.9%	3.5%	2.3%	1.2%

Open in a new tab

Given the performance of our proposed estimators of the factor number, we now artificially enlarge this estimation error and see how it affects the Gaussian graphical model analysis. Let η be a random perturbation with P(η = 0) = 1/2, P(η = 1) = 1/3 and P(η = 2) = 1/6. Define K⁺: = K + η and K⁻: = max(K−η,0), where K is the true number of factors. As the notations indicate, K⁺ overestimates the factor number while K⁻ underestimates it. Since P(η ≠ 0) = 1/2, their estimation accuracy is only 50%, worse than that of $\hat{K}$ and $\tilde{K}$ as presented. We use K⁺ and K⁻ as the estimators of the number of factors respectively to recover the precision matrix of U and compare their performance with that of Ǩ. The results are presented in Table 5.

Table 5.

Gaussian Graphical Model Analysis

	n_i = 20			n_i = 100
	$\| \| \hat{Ω} - Ω \| \|_{max}$	$\| \| \hat{Ω} - Ω \| \|_{1}$	$\| \| \hat{Ω} - Ω \| \|_{2}$	$\| \| \hat{Ω} - Ω \| \|_{max}$	$\| \| \hat{Ω} - Ω \| \|_{1}$	$\| \| \hat{Ω} - Ω \| \|_{2}$
Oracle	0.687	4.131	1.311	0.335	2.018	0.695
K^o	0.873	2.824	1.351	0.536	2.006	2.017
Ǩ	1.156	8.581	2.950	0.564	3.445	1.188
K ⁺	0.771	3.27	1.49	0.586	2.154	1.074
K ⁻	1.618	11.384	4.062	1.84	15.133	4.941

Open in a new tab

“Oracle” above means that we directly use the generated noise U to calculate its sample covariance and plug it in CLIME to recover the precision matrix. K^o means we know the true number of pervasive factors, and use PCA or Projected-PCA (choosing the method that yields larger eigen-ratio) to adjust factors. As we can see from the table above, K⁺ is nearly as good as K^o, which means that overestimating the number of factors does not hurt the recovery accuracy. In contrast, underestimating the number factors will seriously increase the estimation error of Ω, as shown by K⁻, because the unadjusted pervasive factors heavily corrupt the covariance of U. Nevertheless, both K⁺ and K⁻ uses partial information of the true number of factors. In comparison, our procedure Ǩ, without any prior knowledge about the number of factors, have a great performance in recovering Ω.

Contributor Information

Jianqing Fan, Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA, jqfan@princeton.edu.

Han Liu, Department of Electrical Engineering and Computer Science, Northwestern University, Evanston, IL 60208, USA, hanliu@northwestern.edu.

Weichen Wang, Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA, nickweichwang@gmail.com.

Ziwei Zhu, Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA, zzw9348ustc@gmail.com.

References

[1].Ahn SC and Horenstein AR (2013). Eigenvalue ratio test for the number of factors. Econometrica 81 1203–1227. MR3064065 [Google Scholar]
[2].Alter O, Brown PO and Botstein D (2000). Singular value decomposition for genome-wide expression data processing and modeling. Proceedings of the National Academy of Sciences 97 10101–10106. [DOI] [PMC free article] [PubMed] [Google Scholar]
[3].Bai J (2003). Inferential theory for factor models of large dimensions. Econometrica 71 135–171. MR1956857 [Google Scholar]
[4].Bai J and Ng S (2002). Determining the number of factors in approximate factor models. Econometrica 70 191–221. MR1926259 [Google Scholar]
[5].Bai J and Ng S (2013). Principal components estimation and identification of static factors. Journal of Econometrics 176 18–29. MR3067022 [Google Scholar]
[6].Biswal BB, Mennes M, Zuo X-N, Gohel S, Kelly C, Smith SM, Beckmann CF, Adelstein JS, Buckner RL and Colcombe S (2010). Toward discovery science of human brain function. Proceedings of the National Academy of Sciences 107 4734–4739. [DOI] [PMC free article] [PubMed] [Google Scholar]
[7].Cai TT, Li H, Liu W and Xie J (2012). Covariate-adjusted precision matrix estimation with an application in genetical genomics. Biometrika ass058 MR3034329 [DOI] [PMC free article] [PubMed] [Google Scholar]
[8].Cai TT, Li H, Liu W and Xie J (2015). Joint estimation of multiple high-dimensional precision matrices. The Annals of Statistics 38 2118–2144. MR3497754 [DOI] [PMC free article] [PubMed] [Google Scholar]
[9].Cai TT, Liu W and Luo X (2011). A constrained 1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association 106 594–607. MR2847973 [Google Scholar]
[10].Cai TT, Ma Z and Wu Y (2013). Sparse PCA: Optimal rates and adaptive estimation. The Annals of Statistics 41 3074–3110. MR3161458 [Google Scholar]
[11].Chen C, Grennan K, Badner J, Zhang D, Gershon E, Jin L and Liu C (2011). Removing batch effects in analysis of expression microarray data: an evaluation of six batch adjustment methods. PloS one 6 e17238. [DOI] [PMC free article] [PubMed] [Google Scholar]
[12].Chen X (2007). Large sample sieve estimation of semi-nonparametric models. Handbook of Econometrics 6 5549–5632. [Google Scholar]
[13].Connor G, Hagmann M and Linton O (2012). Efficient semiparametric estimation of the fama–french model and extensions. Econometrica 80 713–754. MR2951947 [Google Scholar]
[14].Connor G and Linton O (2007). Semiparametric estimation of a characteristic-based factor model of common stock returns. Journal of Empirical Finance 14 694–717. [Google Scholar]
[15].Danaher P, Wang P and Witten DM (2014). The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76 373–397. MR3164871 [DOI] [PMC free article] [PubMed] [Google Scholar]
[16].Fan J, Ke Y and Wang K (2016). Decorrelation of covariates for high dimensional sparse regression. arXiv preprint arXiv:1612.08490 [Google Scholar]
[17].Fan J, Liao Y and Mincheva M (2013). Large covariance estimation by thresholding principal orthogonal complements. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 75 603–680. MR3091653 [DOI] [PMC free article] [PubMed] [Google Scholar]
[18].Fan J, Liao Y and Wang W (2016). Projected principal component analysis in factor models. The Annals of Statistics 44 219–254. MR3449767 [DOI] [PMC free article] [PubMed] [Google Scholar]
[19].Fan J, Rigollet P and Wang W (2015). Estimation of functionals of sparse covariance matrices. Annals of statistics 43 2706 MR3405609 [DOI] [PMC free article] [PubMed] [Google Scholar]
[20].Friedman J, Hastie T and Tibshirani R (2008). Sparse inverse covariance estimation with the graphical Lasso. Biostatistics 9 432–441. [DOI] [PMC free article] [PubMed] [Google Scholar]
[21].Guo J, Cheng J, Levina E, Michailidis G and Zhu J (2015). Estimating heterogeneous graphical models for discrete data with an application to roll call voting. The annals of applied statistics 9 821 MR3371337 [DOI] [PMC free article] [PubMed] [Google Scholar]
[22].Guo J, Levina E, Michailidis G and Zhu J (2011). Joint estimation of multiple graphical models. Biometrika asq060 MR2804206 [DOI] [PMC free article] [PubMed] [Google Scholar]
[23].Higgins J, Thompson SG and Spiegelhalter DJ (2009). A reevaluation of random-effects meta-analysis. Journal of the Royal Statistical Society: Series A (Statistics in Society) 172 137–159. MR2655609 [DOI] [PMC free article] [PubMed] [Google Scholar]
[24].Hsu D, Kakade SM and Zhang T (2012). A tail inequality for quadratic forms of subgaussian random vectors. Electron. Commun. Probab 17 MR2994877 [Google Scholar]
[25].Johnson WE, Li C and Rabinovic A (2007). Adjusting batch effects in microarray expression data using empirical bayes methods. Biostatistics 8 118–127. [DOI] [PubMed] [Google Scholar]
[26].Johnstone IM and Lu AY (2009). On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association 104 682–693. MR2751448 [DOI] [PMC free article] [PubMed] [Google Scholar]
[27].Lam C and Fan J (2009). Sparsistency and rates of convergence in large covariance matrix estimation. Annals of Statistics 37 4254 MR2572459 [DOI] [PMC free article] [PubMed] [Google Scholar]
[28].Lam C and Yao Q (2012). Factor modeling for high-dimensional time series: inference for the number of factors. The Annals of Statistics 40 694–726. MR2933663 [Google Scholar]
[29].Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, Geman D, Baggerly K and Irizarry RA (2010). Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics 11 733–739. [DOI] [PMC free article] [PubMed] [Google Scholar]
[30].Leek JT and Storey JD (2007). Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet 3 1724–1735. [DOI] [PMC free article] [PubMed] [Google Scholar]
[31].Liu H, Han F and Zhang C. h. (2012). Transelliptical graphical models. In Advances in Neural Information Processing Systems [PMC free article] [PubMed] [Google Scholar]
[32].Liu H, Lafferty J and Wasserman L (2009). The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. The Journal of Machine Learning Research 10 2295–2328. MR2563983 [PMC free article] [PubMed] [Google Scholar]
[33].Loh P-L and Wainwright MJ (2013). Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses. The Annals of Statistics 41 3022–3049. MR3161456 [Google Scholar]
[34].Lorentz GG (2005). Approximation of functions, vol. 322 American Mathematical Soc. MR0213785 [Google Scholar]
[35].Meinshausen N and Buhlmann P (2006). High-dimensional graphs and variable selection with the lasso. The Annals of Statistics 1436–1462. MR2278363 [Google Scholar]
[36].Negahban S and Wainwright MJ (2011). Estimation of (near) low-rank matrices with noise and high-dimensional scaling. The Annals of Statistics 1069–1097. MR2816348 [Google Scholar]
[37].Onatski A (2012). Asymptotics of the principal components estimator of large factor models with weakly influential factors. Journal of Econometrics 168 244–258. MR2923766 [Google Scholar]
[38].Paul D (2007). Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statistica Sinica 17 1617 MR2399865 [Google Scholar]
[39].Power JD, Cohen AL, Nelson SM, Wig GS, Barnes KA, Church JA, Vogel AC, Laumann TO, Miezin FM and Schlaggar BL (2011). Functional network organization of the human brain. Neuron 72 665–678. [DOI] [PMC free article] [PubMed] [Google Scholar]
[40].Ravikumar P, Wainwright MJ, Raskutti G and Yu B (2011). High-dimensional covariance estimation by minimizing 1-penalized log-determinant divergence. Electronic Journal of Statistics 5 935–980. MR2836766 [Google Scholar]
[41].Rudelson M and Vershynin R (2013). Hanson-wright inequality and sub-gaussian concentration. Electron. Commun. Probab 18 MR3125258 [Google Scholar]
[42].Shen X, Pan W and Zhu Y (2012). Likelihood-based selection and sharp parameter estimation. Journal of the American Statistical Association 107 223–232. MR2949354 [DOI] [PMC free article] [PubMed] [Google Scholar]
[43].Sims AH, Smethurst GJ, Hey Y, Okoniewski MJ, Pepper SD, Howell A, Miller CJ and Clarke RB (2008). The removal of multiplicative, systematic bias allows integration of breast cancer gene expression datasets–improving meta-analysis and prediction of prognosis. BMC medical genomics 1 42. [DOI] [PMC free article] [PubMed] [Google Scholar]
[44].Stock JH and Watson MW (2002). Forecasting using principal components from a large number of predictors. Journal of the American statistical association 97 1167–1179. MR1951271 [Google Scholar]
[45].Verbeke G and Lesaffre E (1996). A linear mixed-effects model with heterogeneity in the random-effects population. Journal of the American Statistical Association 91 217–221. [Google Scholar]
[46].Wang W and Fan J (2017). Asymptotics of empirical eigenstructure for high dimensional spiked covariance. Annals of statistics 45 1342–1374. MR3662457 [DOI] [PMC free article] [PubMed] [Google Scholar]
[47].Yang S, Lu Z, Shen X, Wonka P and Ye J (2015). Fused multiple graphical lasso. SIAM Journal on Optimization 25 916–943. MR3343365 [Google Scholar]
[48].Yuan M (2010). High dimensional inverse covariance matrix estimation via linear programming. The Journal of Machine Learning Research 11 2261–2286. MR2719856 [Google Scholar]
[49].Yuan M and Lin Y (2007). Model selection and estimation in the gaussian graphical model. Biometrika 94 19–35. MR2367824 [Google Scholar]

[R1] [1].Ahn SC and Horenstein AR (2013). Eigenvalue ratio test for the number of factors. Econometrica 81 1203–1227. MR3064065 [Google Scholar]

[R2] [2].Alter O, Brown PO and Botstein D (2000). Singular value decomposition for genome-wide expression data processing and modeling. Proceedings of the National Academy of Sciences 97 10101–10106. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] [3].Bai J (2003). Inferential theory for factor models of large dimensions. Econometrica 71 135–171. MR1956857 [Google Scholar]

[R4] [4].Bai J and Ng S (2002). Determining the number of factors in approximate factor models. Econometrica 70 191–221. MR1926259 [Google Scholar]

[R5] [5].Bai J and Ng S (2013). Principal components estimation and identification of static factors. Journal of Econometrics 176 18–29. MR3067022 [Google Scholar]

[R6] [6].Biswal BB, Mennes M, Zuo X-N, Gohel S, Kelly C, Smith SM, Beckmann CF, Adelstein JS, Buckner RL and Colcombe S (2010). Toward discovery science of human brain function. Proceedings of the National Academy of Sciences 107 4734–4739. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] [7].Cai TT, Li H, Liu W and Xie J (2012). Covariate-adjusted precision matrix estimation with an application in genetical genomics. Biometrika ass058 MR3034329 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] [8].Cai TT, Li H, Liu W and Xie J (2015). Joint estimation of multiple high-dimensional precision matrices. The Annals of Statistics 38 2118–2144. MR3497754 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] [9].Cai TT, Liu W and Luo X (2011). A constrained 1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association 106 594–607. MR2847973 [Google Scholar]

[R10] [10].Cai TT, Ma Z and Wu Y (2013). Sparse PCA: Optimal rates and adaptive estimation. The Annals of Statistics 41 3074–3110. MR3161458 [Google Scholar]

[R11] [11].Chen C, Grennan K, Badner J, Zhang D, Gershon E, Jin L and Liu C (2011). Removing batch effects in analysis of expression microarray data: an evaluation of six batch adjustment methods. PloS one 6 e17238. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] [12].Chen X (2007). Large sample sieve estimation of semi-nonparametric models. Handbook of Econometrics 6 5549–5632. [Google Scholar]

[R13] [13].Connor G, Hagmann M and Linton O (2012). Efficient semiparametric estimation of the fama–french model and extensions. Econometrica 80 713–754. MR2951947 [Google Scholar]

[R14] [14].Connor G and Linton O (2007). Semiparametric estimation of a characteristic-based factor model of common stock returns. Journal of Empirical Finance 14 694–717. [Google Scholar]

[R15] [15].Danaher P, Wang P and Witten DM (2014). The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76 373–397. MR3164871 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] [16].Fan J, Ke Y and Wang K (2016). Decorrelation of covariates for high dimensional sparse regression. arXiv preprint arXiv:1612.08490 [Google Scholar]

[R17] [17].Fan J, Liao Y and Mincheva M (2013). Large covariance estimation by thresholding principal orthogonal complements. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 75 603–680. MR3091653 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] [18].Fan J, Liao Y and Wang W (2016). Projected principal component analysis in factor models. The Annals of Statistics 44 219–254. MR3449767 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] [19].Fan J, Rigollet P and Wang W (2015). Estimation of functionals of sparse covariance matrices. Annals of statistics 43 2706 MR3405609 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] [20].Friedman J, Hastie T and Tibshirani R (2008). Sparse inverse covariance estimation with the graphical Lasso. Biostatistics 9 432–441. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] [21].Guo J, Cheng J, Levina E, Michailidis G and Zhu J (2015). Estimating heterogeneous graphical models for discrete data with an application to roll call voting. The annals of applied statistics 9 821 MR3371337 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] [22].Guo J, Levina E, Michailidis G and Zhu J (2011). Joint estimation of multiple graphical models. Biometrika asq060 MR2804206 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] [23].Higgins J, Thompson SG and Spiegelhalter DJ (2009). A reevaluation of random-effects meta-analysis. Journal of the Royal Statistical Society: Series A (Statistics in Society) 172 137–159. MR2655609 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] [24].Hsu D, Kakade SM and Zhang T (2012). A tail inequality for quadratic forms of subgaussian random vectors. Electron. Commun. Probab 17 MR2994877 [Google Scholar]

[R25] [25].Johnson WE, Li C and Rabinovic A (2007). Adjusting batch effects in microarray expression data using empirical bayes methods. Biostatistics 8 118–127. [DOI] [PubMed] [Google Scholar]

[R26] [26].Johnstone IM and Lu AY (2009). On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association 104 682–693. MR2751448 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] [27].Lam C and Fan J (2009). Sparsistency and rates of convergence in large covariance matrix estimation. Annals of Statistics 37 4254 MR2572459 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] [28].Lam C and Yao Q (2012). Factor modeling for high-dimensional time series: inference for the number of factors. The Annals of Statistics 40 694–726. MR2933663 [Google Scholar]

[R29] [29].Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, Geman D, Baggerly K and Irizarry RA (2010). Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics 11 733–739. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] [30].Leek JT and Storey JD (2007). Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet 3 1724–1735. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] [31].Liu H, Han F and Zhang C. h. (2012). Transelliptical graphical models. In Advances in Neural Information Processing Systems [PMC free article] [PubMed] [Google Scholar]

[R32] [32].Liu H, Lafferty J and Wasserman L (2009). The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. The Journal of Machine Learning Research 10 2295–2328. MR2563983 [PMC free article] [PubMed] [Google Scholar]

[R33] [33].Loh P-L and Wainwright MJ (2013). Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses. The Annals of Statistics 41 3022–3049. MR3161456 [Google Scholar]

[R34] [34].Lorentz GG (2005). Approximation of functions, vol. 322 American Mathematical Soc. MR0213785 [Google Scholar]

[R35] [35].Meinshausen N and Buhlmann P (2006). High-dimensional graphs and variable selection with the lasso. The Annals of Statistics 1436–1462. MR2278363 [Google Scholar]

[R36] [36].Negahban S and Wainwright MJ (2011). Estimation of (near) low-rank matrices with noise and high-dimensional scaling. The Annals of Statistics 1069–1097. MR2816348 [Google Scholar]

[R37] [37].Onatski A (2012). Asymptotics of the principal components estimator of large factor models with weakly influential factors. Journal of Econometrics 168 244–258. MR2923766 [Google Scholar]

[R38] [38].Paul D (2007). Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statistica Sinica 17 1617 MR2399865 [Google Scholar]

[R39] [39].Power JD, Cohen AL, Nelson SM, Wig GS, Barnes KA, Church JA, Vogel AC, Laumann TO, Miezin FM and Schlaggar BL (2011). Functional network organization of the human brain. Neuron 72 665–678. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] [40].Ravikumar P, Wainwright MJ, Raskutti G and Yu B (2011). High-dimensional covariance estimation by minimizing 1-penalized log-determinant divergence. Electronic Journal of Statistics 5 935–980. MR2836766 [Google Scholar]

[R41] [41].Rudelson M and Vershynin R (2013). Hanson-wright inequality and sub-gaussian concentration. Electron. Commun. Probab 18 MR3125258 [Google Scholar]

[R42] [42].Shen X, Pan W and Zhu Y (2012). Likelihood-based selection and sharp parameter estimation. Journal of the American Statistical Association 107 223–232. MR2949354 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] [43].Sims AH, Smethurst GJ, Hey Y, Okoniewski MJ, Pepper SD, Howell A, Miller CJ and Clarke RB (2008). The removal of multiplicative, systematic bias allows integration of breast cancer gene expression datasets–improving meta-analysis and prediction of prognosis. BMC medical genomics 1 42. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] [44].Stock JH and Watson MW (2002). Forecasting using principal components from a large number of predictors. Journal of the American statistical association 97 1167–1179. MR1951271 [Google Scholar]

[R45] [45].Verbeke G and Lesaffre E (1996). A linear mixed-effects model with heterogeneity in the random-effects population. Journal of the American Statistical Association 91 217–221. [Google Scholar]

[R46] [46].Wang W and Fan J (2017). Asymptotics of empirical eigenstructure for high dimensional spiked covariance. Annals of statistics 45 1342–1374. MR3662457 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R47] [47].Yang S, Lu Z, Shen X, Wonka P and Ye J (2015). Fused multiple graphical lasso. SIAM Journal on Optimization 25 916–943. MR3343365 [Google Scholar]

[R48] [48].Yuan M (2010). High dimensional inverse covariance matrix estimation via linear programming. The Journal of Machine Learning Research 11 2261–2286. MR2719856 [Google Scholar]

[R49] [49].Yuan M and Lin Y (2007). Model selection and estimation in the gaussian graphical model. Biometrika 94 19–35. MR2367824 [Google Scholar]

PERMALINK

Heterogeneity adjustment with applications to graphical model inference

Jianqing Fan

Han Liu

Weichen Wang

Ziwei Zhu

Abstract

1. Introduction

Fig 1.

2. Problem setup

2.1. Semiparametric factor model

2.2. Modeling assumptions and general methodology

2.2.1. Regime 1: External covariates are not informative

2.2.2. Regime 2: External covariates are informative

3. The ALPHA framework

3.1. Estimating factors by PCA

3.2. Estimating factors by Projected-PCA

3.3. A guiding rule for estimating the number of factors, the number of basis functions and determining regimes

3.4. Summary of ALPHA

4. Post-ALPHA inference

4.1. Covariance estimation

4.2. Precision matrix estimation

5. Numerical studies

5.1. Preliminary analysis

Fig 2.

Fig 6.

Table 1.

5.2. Synthetic datasets

Fig 3.

Fig 4.

Fig 5.

5.3. Brain image network data

Table 2.

6. Discussions

Acknowledgments

Appendix A: Algorithm for ALPHA

Appendix B: A key lemma

Appendix C: Proof of Theorem 3.1

C.1. Convergence of factors F^

C.2. Rates of ‖F′(F^−FH)‖max and ‖HH′−I‖max

C.3. Rate of ‖U(F^−FH)‖max

Appendix D: Proof of Theorem 3.2

D.1. Convergence of factors F˜

D.2. Rates of ‖F′(F˜−FH)‖max and ‖HH′−I‖max

D.3. Rate of ‖U(F˜−FH)‖max

Appendix E: Proof of Theorem 4.1

Appendix F: Technical lemmas

Appendix G: More details on synthetic data analysis

G.1. Model calibration and data generation

G.2. More on pervasiveness

Table 3.

G.3. Sensitivity analysis on the number of factors

Table 4.

Table 5.

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

C.1. Convergence of factors $\hat{F}$

C.2. Rates of ${‖ F^{'} (\hat{F} - F H) ‖}_{max}$ and ${‖ H H^{'} - I ‖}_{max}$

C.3. Rate of ${‖ U (\hat{F} - F H) ‖}_{max}$

D.1. Convergence of factors $\tilde{F}$

D.2. Rates of ${‖ F^{'} (\tilde{F} - F H) ‖}_{max}$ and ${‖ H H^{'} - I ‖}_{max}$

D.3. Rate of ${‖ U (\tilde{F} - F H) ‖}_{m a x}$