Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Oct 30.
Published in final edited form as: Electron J Stat. 2018 Dec 5;12(2):3908–3952. doi: 10.1214/18-EJS1466

Heterogeneity adjustment with applications to graphical model inference

Jianqing Fan 1, Han Liu 2, Weichen Wang 3, Ziwei Zhu 4
PMCID: PMC6820685  NIHMSID: NIHMS1003844  PMID: 31666911

Abstract

Heterogeneity is an unwanted variation when analyzing aggregated datasets from multiple sources. Though different methods have been proposed for heterogeneity adjustment, no systematic theory exists to justify these methods. In this work, we propose a generic framework named ALPHA (short for Adaptive Low-rank Principal Heterogeneity Adjustment) to model, estimate, and adjust heterogeneity from the original data. Once the heterogeneity is adjusted, we are able to remove the batch effects and to enhance the inferential power by aggregating the homogeneous residuals from multiple sources. Under a pervasive assumption that the latent heterogeneity factors simultaneously affect a fraction of observed variables, we provide a rigorous theory to justify the proposed framework. Our framework also allows the incorporation of informative covariates and appeals to the ‘Bless of Dimensionality’. As an illustrative application of this generic framework, we consider a problem of estimating high-dimensional precision matrix for graphical model inference based on multiple datasets. We also provide thorough numerical studies on both synthetic datasets and a brain imaging dataset to demonstrate the efficacy of the developed theory and methods.

Keywords: Multiple sourcing, batch effect, semiparametric factor model, principal component analysis, brain image network

1. Introduction

Aggregating and analyzing heterogeneous data is one of the most fundamental challenges in scientific data analysis. In particular, the intrinsic heterogeneity across multiple data sources violates the ideal ‘independent and identically distributed’ sampling assumption and may produce misleading results if it is ignored. For example, in genomics, data heterogeneity is ubiquitous and referred to as either ‘batch effect’ or ‘lab effect’. As characterized in [29], microarray gene expression data obtained from different labs on different processing dates may contain systematic variability. Furthermore, [30] pointed out that heterogeneity across multiple data sources may be caused by unobserved factors that have confounding effects on the variables of interest, generating spurious signals. In finance, it is also known that asset returns are driven by varying market regimes and economy status, which can be regarded as a temporal batch effect. Therefore, to properly analyze data aggregated from multiple sources, we need to carefully model and adjust the unwanted variations.

Modeling and estimating heterogeneity effect is challenging for two reasons. (i) Typically, we can only access a limited number of samples from an individual group, given the high cost of biological experiments, technological constraint or fast economy regime switching. (ii) The dimensionality can be much larger than the total aggregated number of samples. The past decade has witnessed the development of many methods for adjusting batch effect in high throughput genomics data. See, for example, [43], [2], [30], and [25]. Though progresses have been made, most of the aforementioned papers focus on the practical side and none of them has a systematic theoretical justification. In fact, most of these methods are developed in a case-by-case fashion and are only applicable to certain problem domains. Thus, there is still a gap that exists between practice and theory.

To bridge this gap, we propose a generic theoretical framework to model, estimate, and adjust heterogeneity across multiple datasets. Formally, we assume the data come from m different sources: the ith data source contributes ni samples, each having p measurements such as gene expressions of an individual or stock returns of a day. To explicitly model heterogeneity, we assume that the batch-specific latent factor fti influence the observed data Xjti in batch i (j indexes variables; t indexes samples) as in the approximate factor model:

Xjti=λjifti+ujti,1jp,1tni,1im, (1.1)

where λji is an unknown factor loading for variable j and ujti is a true uncorrupted signal. We consider a random loading λji. The linear term λjifti models the heterogeneity effect. We assume that fti is independent of ujti and uti=(u1t,,upt) shares the same common distribution with mean 0 and covariance Σp×p across all data sources. In the matrix-form model, (1.1) can be written as

Xi=ΛiFi+Ui, (1.2)

where Xi is a p×ni data matrix in the ith batch, Λi is a p×Ki factor loading matrix with λji in the jth row, Fi is an ni × Ki factor matrix and Ui is a signal matrix of dimension p × ni. We allow the number of latent factors Ki to depend on batch i. We emphasize here that within one batch, our model is homogeneous. Heterogeneity in this paper refers to that the batch effect terms {ΛiFi}i=1m are different across different groups i = 1,…,m, which are unwanted variations in our study.

To see more clearly on how model (1.2) characterizes the heterogeneity, note that for the tth sample Xti, which is the tth column of Xi,

var(Xti)=Λivar(fti)Λi+Σ. (1.3)

Therefore, the heterogeneity is carried by the low-rank component Λivar(fti)Λi in the population covariance matrix of Xti. We need to clarify that since we assume both Fi and Ui have mean zero, heterogeneity mentioned in this paper is for covariance structure as shown above instead of mean structure. In addition, our model differs from the random/mixed effect regression model studied in the literature [45, 23, 11] in that our models are factor models without any factors observed, while the mixed/random effect model is a regression model that requires covariate matrices to estimate the batch-specific term.

Under a pervasive assumption, the heterogeneity component can be estimated by directly applying principal component analysis (PCA) or Projected-PCA (PPCA), which is more accurate when there are sufficiently informative covariates Wi [18]. Let Λi^Fi^ be the estimated heterogeneity component and Ui^=XtiΛi^Fi^ the heterogeneity-adjusted signal, which can be treated as homogeneous across different datasets and thus can be combined together for downstream statistical analysis. This whole framework of heterogeneity adjustment is termed ALPHA (short for Adaptive Low-rank Principal Heterogeneity Adjustment) and is schematically shown in Figure 1.

Fig 1.

Fig 1.

Schematic illustration of ALPHA: Depending on whether we can find some sufficiently informative covariates W, we implement principal component analysis (PCA) or Projected-PCA (PPCA) methods (labeled respectively M1 and M2) to remove the heterogeneity effects ΛF′ for each batch of data. This decision was made adaptively by a heuristic method. After removing the unwanted variations, the homogeneous data {U(i)}i=1m are aggregrated for further analysis.

The proposed ALPHA framework is fully generic and applicable to almost all kinds of multivariate analysis of the combined, heterogeneity adjusted datasets. As an illustrative example, in this paper, we focus on the problem of Gaussian graphical model inference based on multiple datasets. It is a powerful tool to explore complex dependence structure among variables X = (X1,…,Xp)′. The sparsity pattern of the precision matrix Ω = Σ−1 encodes the information of an undirected graph G = (V,E) where V consists of p vertices corresponding to p variables in X and E describes their dependence relationship. To be specific, Vi and Vj are linked by an edge if and only if Ωij0 (the (i,j)th element of Ω), meaning that Xi and Xj are dependent conditioning on the rest of the variables. For heterogeneous data across m data sources, we need to first adjust for heterogeneity using the ALPHA framework. The idea of covariate-adjusted precision matrix estimation has been studied by [7], but they assumed observed factors and no heterogeneity issue, i.e., m = 1.

A significant amount of literature has focused on the estimation of the precision matrix Ω for graphical models for homogeneous data. [49] and [20] developed the Graphical Lasso method using the L1 penalty and [27] and [42] used a non-convex penalty. Furthermore, [40] and [33] studied the theoretical properties under different assumptions. Estimating Ω can be equivalently reformulated as a set of node-wise sparse linear regression that utilizes Lasso or Danzig selector for each node [35, 48]. To relax the assumption of Gaussian data, [32] and [31] extend the graphical model to the case of semiparametric Gaussian copula and transelliptical family. Via the ALPHA framework, we can combine the adjusted data Ui^ to construct an estimator for the precision matrix Ω by the above methods. Recent works also focus on joint estimation of multiple Gaussian or discrete graphical models which share some common structure [22, 15, 47, 8, 21]. They are concerned with both the commonality and individual uniqueness of the graphs. In comparison, ALPHA emphasizes more on heterogeneity-adjusted aggregation for one single graph.

The rest of the paper is organized as follows. Section 2 lays out a basic problem setup and necessary assumptions. We model the heterogeneity by a semiparametric factor model. Section 3 introduces the ALPHA methodology for heterogeneity adjustment. Two main methods PCA and PPCA will be introduced for adjusting the factor effects under different regimes. A guiding rule of thumb is also proposed to determine which method is more appropriate. The heterogeneity-adjusted data will be combined to provide valid graph estimation in Section 4. The CLIME method of [9] is applied for precision matrix estimation. Synthetic and real data analyses are carried out to demonstrate the proposed framework in Section 5. Section 6 contains further discussions and all the proofs are relegated to the appendix.

2. Problem setup

To more efficiently use the external covariate information in removing heterogeneity effect, we first present a semiparametric factor model. Then, based on whether the collected external covariates have explaining power on factor loadings, we discuss two different regimes where PCA or PPCA should be used. We will state the conditions under which these methods can be formally justified.

2.1. Semiparametric factor model

We assume that for subgroup i, we have d external covariates Wji=(Wj1i,,Wjdi) for variable j. In stock returns, these can be attributes of a firm; in brain imaging, these can be the physical locations of voxels. We assume that these covariates have some explanatory power on the loading parameters λji in (1.1) so that it can be further modeled as λji=gi(Wji)+γji, where gi(·) is the external covariate effects on λji and γji is the part that can not be explained by the covariates. Thus, model (1.1) can be written as

Xjti=λjifti+ujti=(gi(Wji)+γji)fti+ujti. (2.1)

Model (2.1) does not put much restriction. If Wji is not informative at all, i.e., gi(·) = 0, the model reduces to a regular factor model. In a matrix form, model (2.1) can be written as

Xi=ΛiFi+UiwhereΛi=Gi(Wi)+Γi,1im. (2.2)

In (2.2), Gi(Wi) and Γi are p×Ki matrices. More specifically, gki(Wji) and γjk are the (j,k)th element of Gi(Wi) and Γi respectively. Expression (2.2) suggests that the observed data can be decomposed into a low-rank heterogeneity term ΛiFi and a homogeneous signal term Ui. Letting uti be the tth column of Ui, we assume all uti,s share the same distribution for any tni and for all subgroups im with E[uti]=0,var(uti)=Σ.

There has been a large literature on factor models in econometrics [3, 5, 17, 44], machine learning [10, 36] and random matrix theories [26, 38, 46]. We refer the interested readers to those relevant papers and the references therein. However, none of these models incorporate the external covariate information. The semiparametric factor model (2.1) was first proposed by [14] and further investigated by [13] and [18]. Using sufficiently informative external covariates, we are able to more accurately estimate the factors and loadings, and hence yield better adjustment for heterogeneity.

Here we collect some notations of eigenvalues and matrix norms used in the paper. For matrix M, we use λmax(M), λmin(M) and λi(M) to denote the maximum eigenvalue, the minimum eigenvalue and the ith eigenvalue of M respectively. We define the quantities Mmax=maxi,j|Mi,j|, M2=λmax1/2(MM)(Mfor short), MF=(i,jMij2)1/2, M1=maxji|Mij| and M1,1=ij|Mij| to be its entry-wise maximum, spectral, Frobenius, induced 1 and element-wise 1 norms.

2.2. Modeling assumptions and general methodology

In this subsection, we explicitly list all the required modeling assumptions. We start with an introduction of the data generating processes.

Assumption 2.1 (Data generating process). (i) ni1FiFi=I.

(ii) {uti}tni,im are independently and identically sub-Gaussian distributed with mean zero and covariance Σ within and between subgroups, and independent of {Wji,fti}. Let Σ2=C0<.

(iii) {fti}tni is a stationary process, with arbitrary temporal dependency. The tail of the factors is sub-Gaussian, i.e., there exist C1, C2 > 0 such that for any αKi and s>0,(|αfti|>s)C1exp(C2s2/α2).

The above set of assumptions are commonly used in the literature, see [5] and [18]. We omit detailed discussions here.

Based on whether the external covariates are informative, we specify two regimes, each of which requires some additional technical conditions.

2.2.1. Regime 1: External covariates are not informative

For the case that the external covariates do not have enough explanatory power on the factor loadings Λi, we ignore the semiparametric structure and model (2.2) reduces to the traditional factor model, extensively studied in econometrics [3, 44, 37]. PCA will be employed in Section 3.1 to estimate the heterogeneous effect. It requires the following assumptions.

Assumption 2.2. (i) (Pervasiveness) There are two positive constants cmin, cmax > 0 so that

cmin<λmin(p1ΛiΛi)<λmax(p1ΛiΛi)<cmax,a.s.i.

(ii) maxkKi,jp|λjki|=Op(logp).

The first condition is common and essential in the factor model literature (e.g., [44]). It requires the factors to be strong enough such that the covariance matrix Λicov(fti)Λi+Σ has spiked eigenvalues. We emphasize here that this condition is actually not so stringent as it looks. Consider a single-factor model Yit = bift + uit, i = 1,…,p, t = 1,…,T. The pervasive assumptions actually imply that cminpi=1pbi2cmaxp. Note that since cmin can be a small constant, our pervasive assumption just says that the factors {ft}t=1T have non-negligible effect on a non-vanishing proportion of outcomes. In addition, this condition is trivially true if {λji}j=1p’s can be regarded as random samples from a population with non-degenerate covariance matrix [17]. Practically, in fMRI data analysis for instance, the lab environment (temperature, air pressure, etc.) or the mental status of the subject being scanned may cause the BOLD (Blood-Oxygen-Level Dependent) level to be uniformly higher at certain time t. This means the brain heterogeneity is globally driven by the factors {ft}t=1T. If the batch effect is only limited to a small number of dimensions, we think it is more appropriate to assume sparsity instead of pervasiveness on top eigenvectors, which is quite different from our problem setups and thus beyond the scope of our paper. The second condition holds if the population has a sub-Gaussian tail.

2.2.2. Regime 2: External covariates are informative

When covariates are informative, we will employ the PPCA [18] to estimate the heterogeneous effect. It requires the following assumptions.

Assumption 2.3. (i) (Pervasiveness) There are two positive constants cmin and cmax so that

cmin<λmin(p1Gi(Wi)Gi(Wi))<λmax(p1Gi(Wi)Gi(Wi))<cmax,a.s.i.

(ii) maxkKi,jpEgk(Wji)2<.

This assumption is parallel to Assumption 2.2 (i). Pervasiveness is trivially satisfied if {Wji}jp are independent and Gi is sufficiently smooth.

Assumption 2.4. (i) Eγjki=0,maxkKi,jp|γjki|=OP(logp).

(ii) Write γji=(γj1i,,γjKi). we assume {γji}jp are independent of {Wji}jp.

(iii) Define νp=maximmaxkKip1jpvar(γjki)<. We assume

maxkKi,jpjp|Eγjkiγjki|=O(νp).

Condition (i) is parallel to Assumption 2.2 (ii) whereas Condition (ii) is natural since Γi can not be explained by Wi. Condition (iii) imposes cross-sectional weak dependence of γji, which is much weaker than assuming independent and identically distributed {γji}jp. This condition is mild as main serial dependency has been taken care of by gk(·)’s.

3. The ALPHA framework

We introduce the ALPHA framework for heterogeneity adjustment. Methodologically, for each sub-dataset we aim to estimate the heterogeneity component and subtract it from the raw data. Theoretically, we aim to obtain the explicit rates of convergence for both the corrected homogeneous signal and its sample covariance matrix. Those rates will be useful when aggregating the homogeneous residuals from multiple sources.

This section covers details for heterogeneity adjustments under the above two regimes: they correspond to estimating Ui by either PCA or PPCA. From now on, we drop the superscript i whenever there is no confusion as we focus on the ith data source. We use the notation F^ if F is estimated by PCA and F˜ if estimated by PPCA. This convention applies to other related quantities such as U^ and U˜, the heterogeneity-adjusted estimator. In addition, we use notations such as Fˇ and Uˇ to denote the final estimators, which are F^ and U^ if PCA is used, and F˜ and U˜ if PPAC is used.

Estimators for latent factors under regimes 1 and 2 satisfy n1FˇFˇ=I, which corresponds to normalization in Assumption 2.1 (i). By the principle of least squares, the residual estimator of U then admits the form

Uˇ=X(I1nFˇFˇ). (3.1)

3.1. Estimating factors by PCA

In regime 1, we directly use PCA to adjust data heterogeneity. PCA estimates F by F^ where the kth column of F^/n is the eigenvector of XX corresponding to the kth largest eigenvalue. We have the following theoretical results.

Theorem 3.1. Under Assumptions 2.1 and 2.2, we have

U^U=1nUFF+Π,U^U^UU=1nUFFU+Δ,

where Πmax=OP(lognlogp(1/p+1/n)+lognΣ1/p) and Δmax=OP((1+n/p)logp+n2Σ12/p2).

Note that we do not explicitly assume bounded Σ1. In some applications it might be natural to assume a sparse covariance so that all terms involving Σ1 can be eliminated, while in other applications such as the graphical model, it is more natural to impose sparsity structure on the precision matrix. In this case, one may want to keep track of the effect of Σ1 as it can be as large as O(p) as Σ1pΣ2=O(p).

3.2. Estimating factors by Projected-PCA

In regime 2, we would like to incorporate the external covariates using the Projected-PCA (PPCA) method proposed by [18]. The method applies PCA on the projected data and by projection, covariates information is leveraged to reduce dimensionality. We now briefly introduce the method.

For simplicity, we only consider d = 1, that is, we only have a single covariate. The general case can be found in [18]. To model the unknown function gk(Wj), we adopt a sieve based idea which approximates gk(·) by a linear combination of basis functions {ϕ1(x),ϕ2(x),} (e.g., B-spline, Fourier series, polynomial series, wavelets). Then

gk(Wj)=ν=1Jbν,kϕν(Wj)+Rk(Wj),kK,jp. (3.2)

Here {bν,k}ν=1J are the sieve coefficients of gk(Wj), corresponding to the kth factor loading; Rk is the remainder function representing the approximation error; J denotes the number of sieve bases which may grow slowly as p diverges. We take the same basis functions in (3.2) for all k though they can be different.

Define bk=(b1,k,,bJ,k)J for each kK, and correspondingly ϕ(Wj)=(ϕ1(Wj),,ϕJ(Wj))J. Then we can write

gk(Wj)=ϕ(Wj)bk+Rk(Wj).

Let BJ×K=(b1,,bK), Φ(W)p×J=(ϕ(W1),,ϕ(Wp)) and Rk(Wj) be the (j,k)th element of R(W)p×K. The matrix form (2.2) can be written as

X=Φ(W)BF+R(W)F+ΓF+U, (3.3)

recalling that the data index i is dropped. Thus the residual contains three parts: the sieve approximation error R(W)F, unexplained loading ΓF and true signal U.

The idea of PPCA is simple: since the factor loadings are a function of the covariates in (3.3) and U and Γ are independent of W, if we project (smooth) the observed data onto the space of W, the effect of U and Γ will be significantly reduced and the problem becomes nearly a noiseless one, given that the approximation error R(W) is small.

Define P as the projection onto the space spanned by the basis functions of W:

P=Φ(W)(Φ(W)Φ(W))1Φ(W). (3.4)

By (3.3), PXPΦ(W)BFG(W)F. Thus, F can be estimated from the ‘noiseless projected data’ PX, using the conventional PCA. Let the columns of F˜/n be the eigenvectors corresponding to the top K eigenvalues of the n × n matrix XPX, which is the sample covariance of the projected data. Then, F˜ is the PPCA estimator of F. It only differs from PCA in that we use smoothed or projected data PX.

We need the following conditions for basis functions and accuracy of sieve approximation.

Assumption 3.1. (i) There are dmin, dmax > 0 s.t.

dmin<λmin(p1Φ(W)Φ(W))<λmax(p1Φ(W)Φ(W))<dmax

almost surely and maxνJ,jp Eϕν(Wj)2< ∞.

(ii) There exists k ≥ 4 s.t. as J,supxχ|gk(x)ν=1Jbν,kϕν(x)|2=O(Jk) where X is the support of Wj and maxv,k |bv,k| < ∞.

Condition (ii) is mild; for instance, when {φν} is polynomial basis or Bsplines, it is implied by the condition that smooth curve gk(·) belongs to a Hölder class G={g:|g(r)(s)g(r)(t)|L|st|α} for some L > 0, with k=2(r+α)4 [34,12].

Recalling the definition of νp in Assumption 2.4 (iii), we have the following results.

Theorem 3.2. Choose J=(pmin{n,p,νp1})1/k and assume J2ϕmax2log(nJ)=O(p) where ϕmax=maxνJsupxXϕν(x). Under Assumptions 2.1, 2.3, 2.4 and 3.1,

U˜U=1nUFF+Π,U˜U˜UU=1nUFFU+Δ,

where Πmax=OP(logn/p(Jϕmax+logp)+JϕmaxΣ1logn/p) and Δmax=OP(nνp/p(J2ϕmax2+logp)+nJϕmaxΣ1(Jϕmax+logp)/p+n2J2ϕmax2Σ12/p2) if there exists C s.t. vp>C/n.

3.3. A guiding rule for estimating the number of factors, the number of basis functions and determining regimes

We now address the problem of estimating the number of factors for two different regimes. Extensive literature has made contributions to this problem in regime 1, i.e., the regular factor model [4, 1, 28]. [28] and [1] proposed to use ratio of adjacent eigenvalues of XX to infer the number of factors. They showed the estimator K^=argmaxkKmaxλk(XX)/λk+1(XX) correctly identifies K with probability tending to 1, as long as KmaxK and Kmax = O(nip).

For the semiparametric factor model, [18] proposed

K˜=argmaxkKmaxλk(XPX)/λk+1(XPX).

Here Kmax is of the same order as Jd. It was shown that (K˜=K)1 under regular assumptions which we omit here. When we have genuine and pervasive covariates, K˜ typically outperforms K^. More details can be found in [18].

Once we use K^ and K˜ to estimate the number of factors under the regular factor model and semiparametric factor model respectively, we naturally have an adaptive rule to decide whether the covariates W are informative enough to use PPCA over PCA. We compare two eigen-ratios:

λK^(XX)λK^+1(XX)vsλK˜(XPX)λK˜+1(XPX).

If the former is larger we identify the dataset as regime 1 and apply regular PCA to get U^; otherwise it is regime 2 and PPCA is used to obtain U˜. The intuition behind this comparison is that the maximal eigen-ratios can be perceived as signal-to-noise ratios in terms of estimating the spiky heterogeneity term. Given that n1XXGG+ΓΓ+Σ and n1PXXPGG+PΓΓP+PΣP, the first ratio measures the eigen-gap between GG+ΓΓ and Σ and the second ratio measures the eigen-gap between GG+PΓΓP and PΣP. If G(W) is much more important than Γ in explaining the loading structure, projection preserves signal and reduces error to improve the eigen-gap. Conversely, if W is weak in providing useful information, projection reduces both noise and signal. Therefore, if projection enlarges the maximum eigen-gap, we prefer PPCA over PCA to estimate the spiky heterogeneity part. Our proposed guiding rule effectively tells whether projection can further contrast spiky and non-spiky parts of covariance.

The above signal-to-noise ratio comparison can be extended to choose the number of basis functions. Notice that we can regard regular PCA as PPCA with number of basis J = p and hence P = I. In this line of thinking, we can index P by J and maximize λK˜(J)(XPJK)/λK˜(J)+1(XPJX) over J{1,2,,Jmax,p}, where J = p corresponds to PCA. Here we use notation K˜(J) and PJ to exhibit their dependency on J. We implement this guiding rule in real data analysis.

In practice, there is still chance of misspecification of the true number of factors K by ALPHA. One might be curious about how this will affect the performance of ALPHA and the subsequent statistical analysis. To clarify this issue, we conduct sensitivity analysis on the number of factors in Section G.3 in the appendix. The take-home message is that the overestimation of K will not hurt, while underestimation of K might mislead subsequent statistical inference.

3.4. Summary of ALPHA

We now summarize the final procedure and convergence rates. We first divide m subgroups into two classes based on whether the collected covariates have significant influence on the loadings.

M1={im|Wiis not informative},M2={im|Wiis informative}.

ALPHA consists of the following three steps.

Step 1: (Preprocessing) For data source i, determine whether it belongs to M1 or M2 according to the guiding rule given in Section 3.3 and correspondingly estimate K by Ǩ, which equals K^ or K˜ (and choose J if necessary).

Step 2: (Adjustment) Apply Projected-PCA to estimate if ΛiFi if iM2, otherwise use PCA to remove the heterogeneity, resulting in adjusted data Uˇi, which is either U^i or U˜i.

Step 3: (Aggregation) Combine adjusted data {Uˇi}i=1m to conduct further statistical analysis. For example, estimate sample covariance Σ by Σ^=(NiKˇi)1i=1mUˇiUˇi where N=ini is the aggregated sample size; or estimate sparse precision matrix Ω by existing graphical model methods.

We summarize the ALPHA procedure in Algorithm 1 given in Section A. We also summarize the convergence of U^i and U˜i below. To ease presentation, we consider a typical regime in practice: ni<Cp,imKi<CN for some constant C. We focus on the situation of sufficiently smooth curves k = ∞ so that J diverges very slowly (say with rate O(logp)) and bounded ϕmax and νp (defined respectively in Theorem 3.2 and Assumption 2.4). Based on discussions of the previous subsections, for estimation of U in max, we have

UˇiUi=UiFiFi/ni+{OP(lognilogp/p+lognilogp/ni)ifiM1,OP(lognilogp/p)ifiM2.

Therefore, PPCA dominates PCA as long as the effective covariates are provided However, UiFiFi/ni dominates all the remaining terms so that ||UˇiUi||max=OP(||UiFiFi/ni||max)OP(lognilogp/ni).

In addition, for estimation of UU, we have

UˇiUˇiUiUi=UiFiFiUi/ni+{OP(logp+δ)ifiM1,OP(nilogpνp/p+δ)ifiM2, (3.5)

where δ=ni2Σ12logp/p2, depending on Σ1. If we consider a very sparse covariance matrix so that Σ1 is bounded, we can simply drop the term δ in both regimes. Then, regime 1 achieves better rate if p=O(ni2νp) but regime 2 outperforms otherwise.

4. Post-ALPHA inference

We have summarized the order of biases caused by adjusting heterogeneity for each data source in Section 3.4. Now we combine the adjusted data together for further statistical analysis. As an example, we study estimation of the Gaussian graphical model. Assume further uti~N(0,Σ) and consider the following class of the precision matrices:

F(s,R)={Ω:Ω0,Ω1R,max1ipj=1p𝟙(Ωi,j0)s}. (4.1)

To simplify the analysis, we assume R is fixed, but all the analysis can be easily extended to include growing R.

To estimate Ω = Σ−1 via CLIME, we need a covariance estimator as the input. We assume here the number of factors is known, i.e., the exception probability of recovering Ki has been ignored for ease of discussion. Such an estimator is naturally given by

Σ^=1NimKii=1mUˇiUˇi. (4.2)

Since the number of data sources is huge, we focus on the case of diverging N and p.

4.1. Covariance estimation

Let ΣN be the oracle sample covariance matrix, i.e., ΣN=N1i=1mUiUi. We consider the difference between our proposed Σ^ and ΣN in this subsection. The oracle estimator obviously attains the rate ΣNΣmax=OP(logp/N).

Let ξki=Uifki¯/ni where fki¯ is the kth column of Fi. It is not hard to verify that ξki is Gaussian distributed with mean zero and variance Σ. Note that {ξki}1im,1kKi are i.i.d. with respect to k and i, using the assumption FiFi/ni=I. By the standard concentration bound (e.g. Lemma 4.2 of [19]),

im(1niUiFiFiUiKiΣ)max=imkKi(ξkiξkiΣ)max=OP(Ktotlogp),

where Ktot=imKi. Therefore, by (3.5), we have

Σ^ΣNmax=NNimKi1Nim(UˇiUˇiUiUi+KiΣ)+iMKiNiMKi(1NimUiUiΣ)max=:OP(am,N,P), (4.3)

where am,N,P=|M1|logpN+N2logpNνpp+KtotlogpN+KtotNlogpN and N2=iM2ni.

We now examine the difference of the ALPHA estimator from the oracle estimator for two specific cases. In the first case, we apply PCA to all data sources, i.e., all iM1 and Ki is bounded. We then have am,N,p = m log p/N. This rate is dominated by the oracle error rate logp/N if and only if m=O(N/logp). This means traditional PCA performs optimally for adjusting heterogeneity as long as the number of subgroups grows more slowly than the order of N/logp.

If we apply PPCA to all data sources, i.e., iM2 and Ki is bounded, then am,N,p=νp/plogp+mlogp/N. This rate is of smaller order than rate logp/N if p/log p > CN for some constant C > 0. The advantage of using PPCA is that when ni is bound so that mN, we can still achieve optimal rate of convergence so long as we have a large enough dimensionality at least of the order N.

4.2. Precision matrix estimation

In order to obtain an estimator for the sparse precision matrix from Σ^, we apply the CLIME estimator proposed by [9]. For a given Σ^, CLIME solves the following optimization problem:

Ω^=argminΩΩ1,1subject to  Σ^ΩImaxλ, (4.4)

where Ω1,1=i,jp|Ωij| and λ is a tuning parameter. Note that (4.4) can be solved column-wisely by linear programming. However, CLIME does not necessarily generate a symmetric matrix. We can simply symmetrize it by taking the one with minimal magnitude of σ^ij and σ^ji. The resulting matrix after symmetrization, still denoted as Ω^ with a little bit abuse of notation, also attains good rate of convergence. In particular, we consider the sparse precision matrix class F(s,C0) in (4.1). The following lemma guarantees recovery of any sparse matrix ΩF(s,C0).

Theorem 4.1. Suppose ΩF(s,C0) and let τm,N,p=logp/N+am,N.p. Choosing λτm,N,p, we have

Ω^Ωmax=Op(τm,N,p).

Furthermore, Ω^Ω1=Op(sτm,N,p) and Ω^Ω2=Op(sτm,N,p).

Here we stress that we choose CLIME for the precision matrix estimation because it only relies on the max-norm guarantee Σ^Σmax. The intuition is that for any true Ω with bounded, Ω1,

IΣ^Ωmax=(Σ^Σ)ΩmaxΣ^ΣmaxΩ1=O(Σ^Σmax).

One can see from above that fast convergence of Σ^Σmax encourages feasibility of Ω, which is a necessary step for establishing consistency of the resulting M-estimator. Interested readers can refer to the proof of Theorem 4.1 for more details. Other possible methods for precision matrix recovery (e.g. graphical Lasso in [20], graphical Dantzig selector in [48] and graphical neighborhood selection in [35]) can be considered for post-ALPHA inference as well, but their convergence rate needs to be studied in a case-by-case fashion.

Theorem 4.1 shows that CLIME has strong theoretical guarantee of convergence under different matrix norms. The rate of convergence has two parts, one corresponding to the minimax optimal rate [48] while the other is due to the error caused by estimating the unknown factors under various situations. The discussions at the end of Section 4.1 suggest that the latter error is often negligible.

In addition, we numerically investigate how misspecification of the number of factors K will affect the precision matrix estimation in Section G.3 in the appendix.

5. Numerical studies

In this section, we first validate the established theoretical results through Monte Carlo simulations. Our purpose is to show that after heterogeneity adjustment, the proposed aggregated covariance estimator Σ^ approximates well the oracle sample covariance ΣN, thereby leading to accurate estimation of the true co-variance matrix Σ and precision matrix Ω. We also compare the performance of PPCA and regular PCA on heterogeneity adjustment under different settings.

In addition, we analyze a real brain image data using the proposed procedure. The dataset to be analyzed is the ADHD-200 data [6]. It consists of rs-fMRI images of 608 subjects, of whom 465 are healthy and 143 are diagnosed with ADHD. We dropped subjects with missing values in our analysis. Following [39], we divided the whole brain into 264 regions of interest (ROI, p = 264), which are regarded as nodes in our graphical model. Each brain was scanned for multiple times with sample sizes ranging from 76 to 261 (76 ≤ ni ≤ 261). In each scan, we acquired the blood-oxygen-level dependent (BOLD) signal within each ROI. Note that subjects have different ages, genders etc., which results in heterogeneity over the covariance structure of the data. We need to remove this unwanted heterogeneity; otherwise it will dilute or corrupt the true biological signal, i.e., the difference in the brain functional network between healthy people and patients due to the disease ADHD.

5.1. Preliminary analysis

To apply our ALPHA framework, we need to first argue the pervasiveness condition Assumption 2.2 holds for the real dataset considered. This is done in Section G.2, together with further discussions on pervasiveness. We also collect the physical locations of the 264 regions as the external covariates. Ideally, we hope these covariates to be pervasive in explaining the batch effect (Assumption 2.3), while bearing no association with the graph structure of ut. This is reasonably true because: the level of batch effect is non-uniform over different locations of the brain when scanned in fMRI machines; furthermore it has been widely acknowledged in biological studies that spatial adjacency does not necessarily imply brain functional connectivity.

To construct Wji from the physical locations, we simply split the 264 regions into 10 clusters (J = 10) by the hierarchy clustering (Ward’s minimum variance method) and use the categorical indices as the covariates of the nodes. The clustering result is shown in Figure 2 and the spatial locations of the 264 regions are shown in Figure 6 in 10 different colors. Black (middle), green (left) and blue (right) represent roughly the region of frontal lobe; gray (middle), pink (left) and magenta (right) occupy the region of parietal lobe; red (left) and orange (right) are in the area of occipital lobe; finally yellow (left) and navy (right) provide information about temporal lobe.

Fig 2.

Fig 2.

Cluster Dendrogram for physical locations with J = 10.

Fig 6.

Fig 6.

Estimated brain functional connectivity networks using physical locations as covariates to correct heterogeneity. 10 region clusters are labeled in 10 colors. Black, blue and red edges represent respectively common edges, unshared edges in the healthy group and in the ADHD group.

Here J = 10 is only used to calibrate our synthetic model in the next subsection. In the real data analysis, we will choose J adaptively according to our heuristic guiding rule of the maximal eigen-gap discussed in Section 3.3. Note that here since the covariate W is one-dimensional (d = 1) and discrete, the sieve basis functions are just indicator functions 𝟙(w − 0.5 ≤ W < w + 0.5) for w = 1,…,10. We use the same external covariates for all subjects in both healthy and diseased groups.

The next question is how to divide the subjects into M1 and M2 based on whether the selected covariates explain the loadings effectively. We implemented the method given in Section 3.3 and discovered that 398 healthy (85.6%) and 126 diseased samples (88.1%) prefer PPCA over PCA, meaning that the physical locations indeed have explanatory powers on factor loadings of most subjects. We identified them as subjects in M2 while the others were classified as in M1. Based on the class labels, we employed the corresponding method to estimate the number of factors and adjust the heterogeneity. We used Kmax = 3. The estimated number of factors for the two groups are summarized in Table 1.

Table 1.

Distribution of estimated number of factors for healthy and ADHD groups

Kiˇ 1 2 3
Healthy 253 148 64
ADHD 78 40 25

5.2. Synthetic datasets

In this simulation study, for stability, we use the first 15 subjects in the healthy group to calibrate the simulation models. We specify four asymptotic settings for our simulation studies:

  1. m = 500, ni = 10 for i = 1,..,m, p = 100, 200,…,600 and G(W) ≠ 0;

  2. m = 100, 200,…,1000, ni = 10 for i = 1,…,m, p = 264 and G(W) ≠ 0;

  3. m = 100, ni = 10, 20,…,100 for i = 1,…,m, p = 264 and G(W) ≠ 0;

  4. m = 20, 40,…,200, ni = 20, 40,…,200 for i = 1,…,m, p = 264 and G(W) = 0.

Here the last setting represents regime 1, where we should expect PCA to work well when the number of subjects is of order of square root of the total sample size, i.e., mN. The first three settings represent regime 2 with informative covariates; they present asymptotics with growing p, m and ni respectively. The details on model calibration and data generation can be found in Section G.1.

We first investigate the errors of estimating covariance of ut in max-norm after applying PPCA or PCA for heterogeneity adjustment. We also compare them with the estimation errors if we naively pool all the data together without any heterogeneity adjustment. However, the estimation error of the naively pooled sample covariance is too large to fit in the graph for the first 3 cases, which we thus do not plot. Denote the oracle sample covariance of ut by ΣN as before. The estimation errors, based on 100 simulations, under the four settings are presented in Figure 3.

Fig 3.

Fig 3.

Estimation of Σ by PCA, PPCA and the oracle sample covariance matrix for 4 different settings. Case 1: m and ni are fixed while the dimension p increases; case 2: ni and p are fixed while m increases; case 3: m and p are fixed while ni increases; case 4: p is fixed, and both m and ni increase and conditions for PPCA are violated.

In Case 1, m and ni are fixed while dimension p increases. This setting highlights the advantages of Projected-PCA over regular PCA. From the left panel, we observe that increase of dimensionality improves the performance of Projected-PCA. This is consistent with the rate we derived in theories. In Case 2, ni and p are fixed while m increases. Both PPCA and PCA benefit from an increasing number of subjects. However, since ni is small, again PPCA outperforms. In Case 3, m and p are fixed while ni increases. Both methods achieve better estimation as ni increases, but more importantly, regular PCA outperforms PPCA when ni is large enough. This is again consistent with our theories. As illustrated by Section 4.1, when m is fixed, PCA attains the convergence rate Σ^Σmax=OP(logp/N), while PPCA only achieves Σ^Σmax=OP(logp/p), which is worse than PCA when p/log p = o(N). In Case 4, p is fixed, and both m and ni increase. Note that the covariates have no explanation power at all, i.e., Condition 2.3 about pervasiveness does not hold so that PPCA is not applicable. Adjusting by PCA behaves much better and PPCA sometimes is as bad as ‘nPCA’, corresponding to no heterogeneity adjustment. This is not unexpected as we utilize a noisy external covariates.

Now we focus on estimation error of the precision matrix. We plug Σ^, obtained from data after adjusting for heterogeneity, into CLIME to get the estimator Ω^ of Ω. In Figure 4, Ω^Ωmax and Ω^Ω1 are depicted under the four asymptotic settings. From the plots we see Ω^Ωmax and Ω^Ω1 share similar behavior with Σ^Σmax shown in Figure 3: in Case 1, ni is small, so it is advantageous to use PPCA and PPCA behaves better as dimension increases; in Case 2, both PPCA and PCA benefit from an increasing number of subjects and PPCA outperforms PCA; in Case 3, PCA outperforms PPCA when ni is large enough since m is fixed; in Case 4, the covariates have no explanation power at all so that PPCA does not make sense. In the first three cases, if we do not adjust data heterogeneity, Ω^Ωmax and Ω^Ω1 will be too large to fit in the current scale.

Fig 4.

Fig 4.

Estimation of Ω. Presented are the estimation errors in max-norm and L1-norm for 4 different settings. In Case 4, nPCA refers to no PCA, i.e., we do not adjust heterogeneity.

We also present the ROC curves of our proposed methods in Figure 5, which is of interest to readers concerned with sparsity pattern recovery. The black dashed line is the 45 degree line, representing performance of random guess. It is obvious from those plots that heterogeneity adjustment very much improves the sparsity recovery of the precision matrix. When the sample size of each subject is small, genuine pervasive covariates increase the power of PPCA while if the sample size is relatively large, PCA is sufficiently good in recovering graph structures. Also notice that in all cases, the naive method without heterogeneity adjustment can still achieve a certain amount of power, but we can improve the performance dramatically by correcting the batch effects.

Fig 5.

Fig 5.

ROC curves for sparsity recovery of Ω for 4 different settings.

5.3. Brain image network data

We report the estimated graphs for both the healthy group and the ADHD patient group with batch effects removed using our ALPHA framework in this subsection. We took various sparsity levels of the networks from 1% to 5% (corresponding to the same set of λ’s for two groups) and selected the common edges, which are stable with respect to tuning, to be depicted.

The brain network produced by our proposed method is presented in Figure 6. It gives 90.7% identical edges for the two networks. However if we ignore heterogeneity and naively pool the data from all subjects together, it generates 10.2% unshared edges, roughly 1% more than ALPHA produces. Therefore, by heterogeneity adjustment, we found less difference in brain functional networks between ADHD patients and healthy people. In addition, we investigate how those unshared edges are distributed across the 10 clusters. We summarized the total degree of unshared edge vertices within each cluster in Table 2. As we can see, in the left occipital lobe (red) and the left parietal lobe (pink), there are significant difference in functional connectivity structure between healthy people and patients, although in general the difference is weak. These are signs that ADHD is a complex disease that affects many regions of the brain. The general methodology we provide here could be valuable for further understanding the mechanism of the disease.

Table 2.

The degree of unshared edge vertices for each cluster

red orange blue green yellow navy pink black magenta gray
Health 3 4 3 2 7 6 10 12 11 6
ADHD 9 6 7 5 12 5 6 15 9 10

6. Discussions

Heterogeneity is usually informed by the domain knowledge of the dataset. In particular, it occurs with high chance when the data come from different sources or subgroups. In the brain image dataset we used in the numerical study, heterogeneity across patients can stem from difference in age, gender, etc. When it is less clear whether heterogeneity exists, we can calculate multiple summary statistics for all the subgroups and see whether they are significantly different. In the case of pervasive heterogeneity, we can test it by the magnitude of dominating eigenvalues of the covariance matrix in each subgroup. A systematic testing method for heterogeneity is important and we leave it for now as a future research topic. Note that even if all the subgroups are actually homogeneous, ALPHA does not hurt the statistical efficiency under appropriate scaling assumptions. Specifically, for the PCA-based ALPHA, we showed in Section 4.1 that as long as the number of subgroups m=O(N/logp), Σˇ enjoys the oracle max-norm rate. This means that given homogeneous data, when the number of data splits is not large, ALPHA yields the same statistical rate as the full-sample oracle estimator. For the PPCA-based ALPHA, Σˇ enjoys the oracle rate when p/logp=Ω(N/logp).

As we have seen, ALPHA is adaptive to factor structures and is flexible to include external information. However, this advantage of PPCA is accompanied by more assumptions and the practical issue of selecting proper basis functions and the number of them in sieve approximation. One contribution of the paper lies in seamless integration of PCA and PPCA, which leverages effective external covariates. If no valuable covariates exist and the sample size is relatively large for each data source, we have shown conventional PCA is still an effective tool.

Note that our framework is compatible with any statistical procedure that only requires an accurate estimator as the input, like CLIME we illustrate in this work. The ALPHA procedure gives theoretical guarantee for ||UˇU||max and ||Σ^Σ||max, which serve as foundations for establishing the statistical properties of the subsequent procedure. Besides, ALPHA has potential application and in regression analysis. If the residual terms {Ui}i=1m are true predictors for the response of interest {Yi}i=1m, we can first apply ALPHA to extract the residuals before the regression procedure. For example, the residual BOLD signal we obtained by ALPHA in the brain functional network analysis (Section 5.3) is potentially useful in predicting whether a person has ADHD. This is a typical logistic regression problem based on ALPHA adjustment. We leave the detailed study of combining ALPHA with regression models for future investigation. One recent work [16] has adopted a method similar to ALPHA that extracts residuals for model selection in high dimensional regression.

Finally, we point out two current limitations of ALPHA. The first limitation lies in its pervasiveness assumption of the heterogeneity terms {ΛiFi}i=1m. More specifically, for each subgroup i, ALPHA requires the signal strength of the heterogeneous part ΛiFi to overwhelm the homogeneous residual part Ui so that PCA or PPCA can accurately estimate ΛiFi and remove it. Such requirement can be violated in practice when the heterogeneous term has similar signal strength as the homogeneous term. Additionally, statistical methods that require more than the max-norm error guarantee (||UˇU||max,||ΣˇΣ||max), say in the general non-sparse situation, may be inappropriate for the post-ALPHA inference for now.

Acknowledgments

This project was supported by National Science Foundation grants DMS-1206464, DMS1406266 and 2R01-GM072611-12.

Appendix A: Algorithm for ALPHA

The pseudo code for the algorithm ALPHA is shown as follows.

Algorithm1Algorithm for adaptive low-rank principal heterogeneity adjustment_¯Input:_Panel Xp×nii and d-dimensional {Wji}j=1p from m data sources,Jmax,Kmax(Jmax(Kmax+1)/d)Output:_Uiˇ,KiˇandΣ^1:procedure ALPHA2:foreach subject imdo3:Ki^argmaxKKmaxλk(XiXi)/λk+1(XiXi)4:Δλ0iλKi^(XiXi)/λKi^+1(XiXi)5:foreach (Kmax+1)/dJJmaxdo6:PJiΦ(Wi)(Φ(Wi)Φ(Wi))1Φ(Wi)forJ7:KJi˜argmaxKKmaxλk(XiPJiXi)/λk+1(XiPJiXi)8:ΔλJiλKJi˜(XiPJiXi)/λKJi˜+1(XiPJiXi)9:endfor10:J*iargmaxJΔλJi11:Ki˜KJ*ii˜12:13:ifΔλ0i>ΔλJ*ii(iM1)then14:Fi^/nieigenvectors of XiXi of the top Ki˜ eigenvalues15:Λi^XiFi^/ni,Ui^XiΛi^Fi^16:UiˇUi^,KiˇKi^17:else18:Fi˜/nieigenvectors of XiPJ*iiXi of the top Ki˜ eigenvalues19:Λi˜XiFi˜/ni,Ui˜XiΛi˜Fi˜20:UiˇUi˜,KiˇKi˜21:endif22:endfor23:24:Σ^(iniiKiˇ)1i=1mUiˇUiˇ25:return{Uiˇ}i=1m,{Kiˇ}i=1m and Σ^26:end procedure_

Appendix B: A key lemma

Recall that we defined

Uˇ=X(I1nFˇFˇ). (B.1)

where we used notations such as Fˇ and Uˇ to denote the final estimators, which are F^ and U^ if PCA is used, and F˜ and U˜ if PPCA is used.

The following lemma holds for Uˇ no matter whether PCA or PPCA is applied.

Lemma B.1. For any K by K matrix H such that H=OP(1), if log P = O(n),

UˇU=1nUFF+Π,

where Πmax=OP(logn/n(||F(FˇFH)||maxΛmax+||U(FˇFH)||max)+||FˇFH||maxΛmax+lognHHImaxΛmax); and furthermore

UˇUˇUU=1nUFFU+Δ,

where Δmax=OP(||U(FˇFH)||maxΛmax+||U(FˇFH)||max2+||F(FˇFH)||max||Λ||max2+n||HHI||max||Λ||max2).

The above lemma states that the error of estimating U by Uˇ (or estimating UU by UˇUˇ ) is decomposed into two parts. The first part is inevitable even when the factor matrix F in (3.1) is known in advance. The second part is caused by the uncertainty from estimating F. Since the true F is identifiable up to an orthonormal transformation H, we need to carefully choose H to bound the error Π (or Δ). We will provide explicit rates of convergence for those terms in the following two sections.

Proof. By definition of Uˇ, Uˇ=U(In1FF)+n1X(FˇFˇFF). We first look at the converge of UˇU. Obviously Π=n1X(FˇFˇFF)=I+II where

I=1nΛF(FˇFˇFF),II=1nU(FˇFˇFF).

Since F(FˇFˇFF)=F(FˇFH)Fˇ+nH(FˇFH)+n(HHI)F, we have

Imax=OP(Λmax(||F(FˇFH)||max||Fˇ/n||max+||FˇFH||max+||HHI||max||F||max)).

Similarly U(FˇFˇFF)=U(FˇFH)Fˇ+UFH(FˇFH)+UF(HHI)F, so

||II||max=OP(||U(FˇFH)||max||Fˇ/n||max+||UF/n||max(||FˇFH||max+||HHI||max||F||max)).

According to Lemma F.4 (i), ||UF/n ||max = OP (1) and noting both ||F||max and ||Fˇ||max are OP(n), we conclude the result for Πmax easily.

Now we consider UˇUˇ in the following.

UˇUˇ=U(In1FF)U+n1U(In1FF)(FˇFˇFF)X+n2X(FˇFˇFF)2X=:UU1nUFFU+III+IV.

So Δ=III+IV and it suffices to bound the two terms.

||III||max=OP(||n1U(IFF/n)FˇFˇF||max||Λ||max+||n1U(IFF/n)FˇFˇU||max)=:OP(||J1||max||Λ||max+||J2||max).

Decompose J1 by J1=n1U(FˇFH)FˇFn2UFF(FˇFH)FˇF. Therefore,

||J1||max=OP(||U(FˇFH)||max+n1||UF||max||F(FˇFH)||max),

since ||FˇF/n||max||FˇF/n||F||Fˇ||F||F||F/n=K. Similar to J1, we decompose J2 only replacing FˇF with FˇU. According to Lemma F.4 (i), ||FˇU/n||max=OP(||UF/n||max+||U(FˇFH)||max)=OP(1+||U(FˇFH)||max), hence J2max=OP((||J1||max(1+||U(FˇFH)||max)). We then conclude that ||III||max=OP((||U(FˇFH)||max+n1||UF||max||F(FˇFH)||max)(||Λ||max+||U(FˇFH)||max)).

Now let us take a look at IV. IVmax=||D1+D2+D2+D3||max where

D1=n2ΛF(FˇFˇFF)2FΛ=Λ(nIn1FFˇFˇF)Λ,D2=n2U(FˇFˇFF)2FΛ=n2UFF(FˇFˇFF)FΛD3=n2U(FˇFˇFF)2U.

By assumption, ||H||max ≤ ||H|| = OP (1). Simple decompositions of D1 gives

||D1||max=OP((||F(FˇFH)||max+n||HHI||max)||Λ||max2).

Since D2=n2UFF(FˇFH)FˇFΛn1UFH(FˇFH)FΛUF(HHI)Λ, we have

||D2||max=OP(||UF/n||max||D1||max)=OP(||D1||max).

It is also not hard to show D3max=OPIIImax+D1max. Under both Theorems C.1 and D.1 (replacing Fˇ by F^ for regime 1 and F˜ for regime 2), we can check the following relationship holds:

n1||UF||max||U(FˇFH)||max=OP(||Λ||max2).

Therefore we have

||Δ||max=||III+IV||max=OP(||U(FˇFH)||max||Λ||max+||U(FˇFH)||max2+||F(FˇFH)||max||Λ||max2+n||HHI||max||Λ||max2).

Appendix C: Proof of Theorem 3.1

Recall that PCA estimates F by F^ where the kth column of F^/n is the eigenvector of (pn)1XX corresponding to the kth largest eigenvalue. By the definition of F^, we have

1npXXF^=F^K,

where K is a K by K diagonal matrix with top K eigenvalues of (np)1XX in descending order as diagonal elements. Define a K by K matrix H as in [17]:

H=1npΛΛFF^K1.

It has been shown that K,K1andH,H1 are all OP(1).

The following lemma provides all the rates of convergences that are needed for downstream analysis.

Lemma C.1. Under Assumptions 2.1 and 2.2, we have ||Λ||max=OP(logp) and

(i) ||F^FH||F=OP(n/p+1/n) and ||F^FH||max=OP(logn/p+logn/n);

(ii) ||F(F^FH)||max=OP(1+n/p);

(iii) ||U(F^FH)||max=OP((1+n/p)logp+n||Σ||1/p);

(iv) ||HHI||max=OP(1/n+1/p).

Combining the above results with Lemma B.1, we have

U^U=1nUFF+Π,

where ||Π||max=OP(lognlogp(1/p+1/n)+logn||Σ||1/p) and additionally

U^U^UU=1nUFFU+Δ,

where ||Δ||max=OP((1+n/P)logp+n2||Σ||12/p2). Thus we complete the proof for Theorem 3.1. We are left to check Lemma C.1, which is done in the following three subsections.

C.1. Convergence of factors F^

Recall H=(np)1ΛΛFF^K1. Substituting X=ΛF+U, we have,

F^FH=(i=13Ei)K1,Ε1=1npFΛUF^,Ε2=1npUΛFF^,Ε3=1npUUF^. (C.1)

To bound ||F^FH||max, note that there is a constant C > 0, so that

||F^FH||maxC||K1||2i=13||Ei||max.

Hence we need to bound ||Ei||maxfori=1,2,3 since ||K1||2=OP(1). The following lemma gives the stochastic bounds for each individual term.

Lemma C.2. (i) E1F=OP(n/p)=E2F, E3F=OP(1/n+1/p+n/p).

(ii) E1max=OP(logn/p)=E2max,E3max=OP(1/p+logn/n).

Proof. (i) Obviously E1Fp1ΛUF=OP(n/p) according to Lemma F.1. E2F attains the same rate. In addition, E3Fn1/2p1UUF=OP(1+n/p) again according to Lemma F.1. So combining the three terms, we have F^FHF=OP(1+n/p). We now refine the bound for E3FE3F(np)1(UUFFHF+UUFF^FHF)=OP(1/n+1/p+n/p). Then the refined rate of F^FHF is OP(n/p+1/n).

(ii) Since ΛUF^F=OP(np) by Lemma F.1,

E1max=OP((np)1FmaxΛUF^F)=OP(logn/p).

E2max is bounded by p1UΛmax=OP(logn/p) while E3max is bounded by

OP((np)1(UUFmax+nUUmaxF^FHF)),

which based on results of Lemma F.2 and (i) is OP(1/p+logn/n).

The final rate of convergence for F^FHmax and F^FHF are summarized as follows.

Proposition C.1.

F^FHmax=OP(lognp+lognn)andF^FHF=OP(np+1n). (C.2)

Proof. The results follow from Lemmas C.2. □

C.2. Rates of F(F^FH)max and HHImax

Note first that the two matrices under consideration is both K by K, so we do not lose rates bounding them by their Frobenius norm.

Let us find out rate for F(F^FH)F. Basically we need to bound FEiF for i = 1, 2, 3. Firstly

FE1F=p1ΛUF^Fp1(ΛUFFHF+ΛUFF^FHF).

Since ΛUFF=OP(np) and ΛUF=OP(np) by Lemma F.1, we have FE1F=OP(n/p+n/p). Secondly,

FE2Fp1FUΛF=OP(n/p).

Finally,

FE3F=OP(1npUFF2+1npFUUFF^FHF)=OP(1+n/p).

So combining three terms we have F(F^FH)maxF(F^FH)F=OP(1+n/p).

Now we bound HHIF. Since HH=n1(FHF^)FH+n1F^(FHF^)+I, we have

HHIF=OP(1nF(F^FH)F+1nF^FHF2)=OP(1n+1p).

Therefore HHIF has the same rate since HHIFHFHHIFH1F. So HHImax=OP(1/n+1/p).

C.3. Rate of U(F^FH)max

In order to study rate of U(F^FH)max,we essentially need to bound UEimax for i = 1, 2, 3. We handle each term separately.

UE1max=OP(1npUFmaxΛUF^F)=OP(1nUFmaxFE1F)=OP(logpp+nlogpp).

By Lemma F.5, UUΛmax=OP(nplogp+n1). Therefore,

UE2max=OP(1pUUΛmax)=OP(n1p+nlogpp).

From bounding E3F, the last term has rate

UE3max=1npUUUF^max1npUmaxUUF^F=OP((1+n/p)logp).

So combining three terms, we conclude U(F^FH)max=OP((1+n/p)logp+n1/p).

Appendix D: Proof of Theorem 3.2

Recall that by the definition of F˜, we have

1npXPXF˜=F˜K,

where K is a K × K diagonal matrix with the first K largest eigenvalues of (np)1XPX in descending order as its diagonal elements. Define the K by K matrix H as in [18]:

H=1npBΦ(W)Φ(W)BFF˜K1.

It has been shown that K, K1 and H, H1 are all OP(1). Here we remind that though H and K are different from those in regime 1 defined in the previous section, they play essentially the same roles (thus with same notations).

The following lemma provides all the rates of convergences that are needed for downstream analysis.

Lemma D.1. Choose J=(pmin{n,p,νp1})1/k and assume J2ϕmax2log(nJ)=O(p) where ϕmax=maxνJsupxXϕν(x). Under Assumptions 2.1, 2.3, 2.4 and 3.1, we have Λmax=OP(Jϕmax+logp) and

(i) F˜FHF=OP(n/p) and F˜FHmax=OP(logn/p);

(ii) F(F˜FH)max=OP(n/p+n/p+nνp/p);

(iii) U(F˜FH)max=OP(nlogp/p+nJϕmax1/p);

(iv) HHImax=OP(1/p+1/pn+νp/p).

Combining the above lemma with Lemma B.1, we obtain

U˜U=1nUFF+Π,

where Πmax=OP(logn/p(Jϕmax+logp)+JϕmaxΣ1logn/p) and

U˜U˜UU=1nUFFU+Δ,

where Δmax=OP(nνp/p(J2ϕmax2+logp)+nJϕmaxΣ1(Jϕmax+logp)/p+n2J2ϕmax2Σ12/p2) if there exists C s.t. νp > C/n. We choose to keep Σ1 terms here although it makes a long presentation of the rate.

Thus we complete the proof for Theorem 3.2. We are left to check Lemma D.1, which is done in the following three subsections.

D.1. Convergence of factors F˜

Recall H=(np)1BΦ(W)Φ(W)BFF˜K1. Substituting X=Φ(W)BF+R(W)F+ΓF+U, we have,

F˜FH=(i=115Aj)K1 (D.1)

where Ai,i ≤ 3 has nothing to do with R(W) and Γ:

A1=1npFBΦ(W)UF˜,A2=1npUΦ(W)BFF˜,A3=1npUPUF˜;

Ai, 3 ≤ i ≤ 8 takes care of terms involving R(W):

A4=1npFBΦ(W)R(W)FF˜,A5=1npFR(W)Φ(W)BFF˜,A6=1npFR(W)PR(W)FF˜,A7=1npFR(W)PUF˜,A8=1npUPR(W)FF˜;

the remaining are terms involving Γ:

A9=1npFBΦ(W)ΓFF˜,A10=1npFΓΦ(W)BFF˜,A11=1npFΓPΓFF˜,A12=1npFΓPUF˜,A13=1npUPΓFF˜,A14=1npFRPΓFF˜,A15=1npFΓPRFF˜.

To bound F˜FHmax, as in Theorem C.1 we only need to bound Aimax for i = 1,…,15 since again we have K12=OP(1). The following lemma gives the rate for each term.

Lemma D.2. (i) A1max=OP(logn/p)=A2max,

(ii) A3max=OP(Jϕmaxlog(nJ)/p),

(iii) A4max=OP(Jk/2logn)=A5max and A9max=OP(νplogn/p)=A10max,

(iv) A6max=OP(Jklogn) and A11max=OP(Jνplogn/p),

(v) A7max=OP(ϕmaxp1J1klog(nJ)logn)=A8max and A12max=OP(Jϕmaxνplog(nJ)logn/p)=A13max,

(vi) A14max=OP(p1J1kνplogn)=A15max.

Proof. (i) Because Fmax=OP(logn), F˜F=OP(n). By Lemma F.3 and F.4, UΦ(W)BF=OP(pn) and UΦ(W)Bmax=OP(plogn).

Hence

A1maxKnpFmaxBΦ(W)UFF˜F=OP(logn/p),A2maxKnpUΦ(W)BmaxFFF˜F=OP(logn/p).

(ii) We have A3=1npUΦ(W)(Φ(W)Φ(W))1Φ(W)UF˜. By Lemma F.3 and F.4, UΦ(W)F=OP(npJ) and UΦ(W)max=OP(ϕmaxplog(nJ)). By Assumption 3.1, (Φ(W)Φ(W))12=OP(p1). Note the fact that for matrix Am×n, Bn×n, Cn×r,ABCmax=maxim,kr|aiBck|nAmaxB2CF. So

A3maxJdnpUΦ(W)max(Φ(W)Φ(W))12Φ(W)UFF˜F=OP(Jϕmaxlog(nJ)/p).

(iii) Note that Φ(W)B2G(W)2+R(W)2=OP(p), and R(W)max=OP(Jk/2). Hence we have BΦ(W)R(W)maxBΦ(W)1R(W)maxpBΦ(W)2R(W)max=OP(pJk/2). Thus

A4maxK3/2npFmaxBΦ(W)R(W)maxFF˜F=OP(Jk/2logn).

Similarly, A5max attains the same rate of convergence.

In addition, notice A9, A10 have similar representation as A4, A5. The only difference is to replace R by Γ. It is not hard to see BΦΓmax=OP(pνp). Therefore A9max=OP(νplogn/p)=A10max.

(iv) Note that

P2=(Φ(W)Φ(W))1/2Φ(W)Φ(W)(Φ(W)Φ(W))1/22=1

and R(W)PR(W)maxpR(W)max2P2=OP(pJκ). Hence

A6maxKnpFmaxR(W)PR(W)maxFF˜F=OP(Jκlogn).

A11 has similar representation as A6. Since

ΓPΓmaxΦΓF2(ΦΦ)12=OP(Jνp),

we have A11max=OP(Jνplogn/p).

(v) According to Lemma F.4, UΦ(W)max=OP(ϕmaxplog(nJ)). Thus

A7maxKnpFmaxF˜FRΦ(ΦΦ)1ΦUmaxOP(p1Jlogn)RΦF(ΦΦ)12ΦUmax=OP(ϕmaxJlog(nJ)lognpJκ),

since RΦFRFΦ2=OP(pJk/2). The rate of convergence for A8 can be bounded in the same way. So do A12 and A13. Given that ΓΦF=OP(pJνp), we have A12max=OP(Jϕmaxνplog(nJ)logn/p)=A13max.

(vi) Obviously, A14max=OP(p1lognRPΓmax) and ||RPΓ||max||RΦ||F||(ΦΦ)1||ΦΓF. We conclude A14max=OP(p1J1kνplogn). Same bound holds for A15. □

The final rate of convergence for F˜FHmax and F˜FHF are summarized as follows.

Proposition D.1. Choose J=(pmin(n,p,νp1))1/k and assume J2ϕmax2log(nJ)=O(p) and νp = O(1),

F˜FHmax=OP(lognp)andF˜FHF=OP(np). (D.2)

Proof. The max norm result follows from Lemmas D.2 and (D.1), while the Frobenius norm result has been shown in [18]. □

D.2. Rates of F(F˜FH)max and HHImax

Note first that the two matrices under consideration is both K by K, so we do not lose rates bounding them by their Frobenius norm.

It has been proved in [18] that F(F˜FH)F=OP(n/p+n/p+nνp/p+nJk/2). By the choice of J, the last term vanishes. So

F(F˜FH)maxF(F˜FH)F=OP(n/p+n/p+nνp/p).

[18] also showed that HHIF=OP(1/p+1/pn+Jκ/2+νp/p). Since H and H1 are both OP(1), we easily show HHImaxHHIFHHHIFH1=OP(1/p+1/pn+νp/p) since Jκp/νp.

D.3. Rate of U(F˜FH)max

By (D.1), in order to bound U(F˜FH)max we essentially need to bound UAimax for i=1,,15. We do not bother going into the details of each term again as in Lemma D.2. However, we point out the difference here. All Ai are separated into two types: the ones starting with F and the ones starting with U.

If a term Ai starts with F, say Ai = FQ, in Lemma D.2, we bound Aimax in using KFmaxQF. Now we use bound UAimaxKUFmaxQF so that we obtain all related rates by just changing rate Fmax=OP(logn) to UFmax=OP(nlogp).

Terms starting with U includes Ai, i = 2,3,8,13. In Lemma D.2, we bound Aimax, i = 3,8,13 using UΦmax while we bound A2max using UΦBmax. Correspondingly now we need to control UUΦmax and UUΦBmax separately to update the rates. The derivation is relegated to Lemma F.5. We have UUΦ(W)max=OP(ϕmax(nplogp+nΣ1)) and UUΦ(W)Bmax=OP(nplogp+nJϕmaxΣ1).

So we replace the corresponding terms in Lemma D.2. It is not hard to see the dominating term is UA2max=OP(nlogp/p+nJϕmaxΣ1/p). Therefore, U(F˜FH)max has the same rate.

Appendix E: Proof of Theorem 4.1

Proof. Denote the oracle empirical covariance matrix as

ΣN=1Ni=1mUiUi.

As in [9] the upper bound on Ω^Ω is obtained by proving

(Σ^ΣN)Ωmax=Op(τm,N,p)and(ΣNΣ)Ωmax=Op(τm,N,p). (E.1)

Once the two bounds are established, we proceed by observing

IpΣ^Ωmax=(Σ^Σ)Ωmax=Op(τm,N,p),

and then it readily follows that if λτm,N,p,

Ω^ΩmaxΩ(IpΣ^Ω^)max+(IpΣ^Ω)Ω^maxΩ1IpΣ^Ω^max+IpΣ^ΩmaxΩ^1λΩ1+τΩ1=Op(τm,N,p),

where the first term of the last inequality uses the constraint of (4.4) while the optimality condition of (4.4) is applied to bound Ω^1 by Ω1. So it remains to find τm,N,p in (E.1). Since ΩF(s,C0),Ω1C0, so we just need to bound Σ^ΣNmax and ΣNΣmax. Obviously,

ΣNΣmax=Op(logpN).

We have shown in (4.3) that Σ^ given by (4.2) attains the rate Σ^ΣNmax=OP(am,N,p). Thus τm,N,p=logp/N+am,N,p. Similar proof as in [9] can also reach error bounds under 1 and 2, which we omit. The proof is now complete. □

Appendix F: Technical lemmas

Lemma F.1. (i) ΛUF2=OP(np),

(ii) UUF2=Op(np2+pn2),

(iii) UUFF2=OP(np2+pn2).

Proof. We simply apply Markov inequality to get the rates.

EΛUF2=E[tr(ΛUUΛ)]=ntr(ΛΣΛ)nΣtr(ΛΛ)=O(np).
EUUF2=E[t=1nt=1n(j=1pujtujt)2]=j1,j2=1p(t=1nE[uj1t2uj2t2]+1tt1nσj1j22)=OP(np2+pn2),

since j1,j2σj1j22=tr(Σ2)Σtr(Σ)=O(p).

EUUFF2=E[t=1nk=1K(t=1nj=1pujtujtftk)2]=k=1Kj1,j2=1p(t=1nE[uj1t2uj2t2]ftk2+1tt1nσj1j22ft1k2)=OP(np2+pn2).

Lemma F.2. (i) ΛUmax=OP(plogn).

(ii) UUmax=OP(p),

(iii) UUFmax=OP(nplogn+plogn).

Proof. (i) ΛUmax=maxt,k|utλk| where λk is the kth column of Λ. Since utλk is mean zero sub-Gaussian with variance proxy λkΣλkΣλk2=O(p), we have ΛUmax=Op(plogn).

(ii) UUmax=maxt,t|utut|maxtt|utut|+maxt|utut|. We need tobound each term separately. The second term is bounded by the upper tail bound of Hanson-Wright inequality for sub-Gaussian vector [24, 41] i.e.

(ut2>tr(Σ)+2tr(Σ)s+2Σs)es.

Choose s = log n and apply union bound, we have maxt|utut|=Op(tr(Σ)+2tr(Σ)s)=Op(p+plogn)=Op(p). Then we deal with the first term. By Chernoff bound,

(maxtt|utut|>s)2n2esθE[exp(θutut)],

where E[exp(θutut)]=E[exp(θ2utΣut/2)]E[exp(Cθ2ut2)]. [24] showed that

E[exp(ηut2)]exp(tr(Σ)η+tr(Σ2)η212Ση)

For η<1/(4Σ)tr(Σ)/(4tr(Σ2)), the right hand side is less than exp(3tr(Σ)η/2) ≤ exp(Cpη). Choose η = 2, we have

(maxtt|utut|>s)2n2exp(sθ+Cθ2p).

We minimize the right hand side and choose θ = s/(2Cp), it is easy to check η<1/(4Σ) and see that maxtt|utut|=Op(plogn). So we conclude that UUmax=Op(p).

(iii) Let f¯k be the kth column of F. UUFmax=maxt,k|utUf¯k|maxt,k|utU(t)f¯k(t)|+maxt,k|ututftk| where U(t),f¯k(t) are U and f¯k canceling the tth column and element respectively. From (ii) we know the second term is of order Op(pmaxtk|ftk|)=Op(plogn). Define ξ=U(t)f¯k(t) subGaussian (0,Σf¯k(t)2), which is independent with ut. Thus

(maxt,k|utξ|>s)2nKesθE[exp(θutξ)],

where E[exp(θutξ)]E[exp(θ2utΣutf¯k(t)2/2)]E[exp(Cθ2nut2)]. Similar to (ii), we choose η = 2n here. It is not hard to see maxt,k|utξ|=Op(nplogn). Thus UUFmax=Op(nplogn+plogn). □

Lemma F.3.(i) FUF2=Op(np).

(ii) UΦ(W)F2=Op(npJ),UΦ(W)BF2=Op(np).

(iii) Φ(W)UFF2=Op(npJ),BΦ(W)UFF2=Op(np).

Proof. This results can be found in the paper of Fan, Liao and Wang (2014). But the conditions they used are a little bit different from our conditions. In particular, we allow no time (sample) dependence and only require bounded Σ2 instead of Σ1. By Markov inequality, it is sufficient to show the expected value of each term attains the corresponding rate of convergence.

EFUF2=E[tr(FE[UU]F)]=E[tr(Ftr(Σ)F)]=ntr(Σ)=O(np).
EUΦ(W)F2=E[tr(ΦE[UU|W]Φ)]=nE[tr(ΦΣΦ)]nJdE[ΦΣΦ2]nJdC0E[ΦΦ2]=O(npJ).

EΦ(W)UFF2=E[tr(ΦE[UFFU|W]Φ)]=E[tr(FF)tr(ΦΣΦ)]=O(npJ).EUΦ(W)BF2 and BΦ(W)UFF2 are both O(np) following the same proof as above. Thus the proof is complete. □

Lemma F.4. (i) FUmax=OP(nlogp)

(ii) UΦ(W)max=OP(ϕmaxplog(nJ)),UΦ(W)Bmax=OP(plogn).

(iii) Φ(W)UFmax=OP(ϕmaxnplogJ),BΦ(W)UFmax=OP(np).

Proof. (i) It is not hard to see FUmax=maxkK,jp|t=1nftkujt|=Op(nlogp). The detailed proof by Chernoff bound is given in the following. By union bound and Chernoff bound, we have

(maxkK,ip|t=1nftkujt|>t)2pKetθE[eθt=1nftkujt].

The expectation is calculated by fist conditioning on F,

E[eθt=1nftkujt]=E[E[eθt=1nftkujt|F]]E[eθ2t=1nftk2σjj/2]e12nC0θ2,

where the second equality uses the sub-Gaussianity of ujt and the last inequality is from n1FF=I and Σ2C0. Therefore, choosing θ=tnC0, we have

(maxkK,jp|t=1nftkujt|>t)2pKetθeC02nθ2=2pKet22C0n.

Thus FUmax=Op(nlogp).

(ii) UΦ(W)max=maxν,l,t|j=1pujtϕν(Wjl)|=maxν,l,t|ϕ¯νlut|, where ϕ¯νl=(ϕν(W1l),,ϕν(Wpl)). Consider the tail probability condition on W:

(maxνJ,ld,kn|ϕ¯νluk|>t|W)2JdnetθE[eθϕ¯νluk|W]2Jdnexp{tθ+12θ2ϕ¯νlΣϕ¯νl}.

The right hand side can be further bounded by

2Jdnexp(tθ+12θ2C0ϕ¯νl2)2Jdnexp(tθ+12pC0θ2ϕmax2).

Choose θ to minimize the upper bound and take expectation with respect to W, we obtain

(maxνJ,ld,kn|ϕ¯νluk|>t)2Jdnexp{t22pC0ϕmax2}.

Finally choose tϕmaxplog(nJ), the tail probability is arbitrarily small with a proper constant. So UΦ(W)max=Op(ϕmaxplog(nJ)). The second part of the results follows similarly. Note UΦ(W)BmaxUG(W)max+UR(W)max and the first term dominates. So the same derivation gives

(UG(W)max>t)2Knexp{t22C0g¯k2},

where g¯k=(gk(W1),,gk(Wp)).g¯k2=Op(p) since it is assumed eigenvalues of p1G(W)G(W) is bounded almost surely. Hence, UΦ(W)Bmax=Op(plogn).

(iii) Φ(W)UFmax=maxνJ,ld,kK|j=1pi=1nϕν(Wjl)ujifik|. Using Chernoff bound again, we get

(maxνJ,ld,kK|j=1pi=1nϕν(Wjl)ujifik|>t)2JdKetθ.E[eθt=1nftkϕ¯νlut].

Since t=1nftkϕ¯νlut|F~sub-Gaussian(0,t=1nftk2ϕ¯νlΣϕ¯νl)=sub-Gaussian(0,nϕ¯νlΣϕ¯νl), the right hand side is easy to bound by first conditioning on F.

E[eθt=1nftkϕ¯νlut]E[exp(12nθ2ϕ¯νlΣϕ¯νl)]E[exp(12npC0ϕmax2θ2)].

Therefore, choosing θ=tnpC0ϕmax2, we have

(||Φ(W)UF||max>t)2JdK.exp{tθ+12npC0ϕmax2θ2}=2JdKexp{t22npC0ϕmax2}.

So we conclude ||Φ(W)UF||max=Op(ϕmaxnplogJ). By similar derivation as in (ii), we also have ||BΦ(W)UF||max and ||G(W)UF||max are both of order OP(np). □

Lemma F.5. (i) ||UUΛ||max=OP(nplogp+n||Σ||1),

(ii) ||UUΦ(W)||max=OP(ϕmax(nplogp+n||Σ||1)) and ||UUΦ(W)B||max=OP(nplogp+nJϕmax||Σ||1).

Proof. (i) ||UUΛ||maxmaxj,k|t=1nujtutλknj=1pσjjλjk|+nmaxj,kj=1p|σjj||λjk|. The second term is O(n||Σ||1). So it suffices to focus on the first term. Let Σ=AA and ut=Avt so that Var(vt) = I. Write A=(a1,,ap), so we have ujt=ajvt. Also denote dk=Aλk. Thus ujtutλk=ajvtvtdk and j=1pσjjλjk=ajdk.

(maxj,k|t=1n(ajvtvtdkajdk)|>s)pK(|t=1n(a˜jvtvtd˜ka˜jd˜k)|>smaxj,kajdk), (F.1)

where a˜j and d˜k are two unit vectors of dimension p. We will bound the right hand side with arbitrary unit vectors a˜j and d˜k.

(|t=1na˜jvtvtd˜kna˜jd˜k|>s)(|t=1n((a˜j+d˜k)vt)2n||a˜j+d˜k||2|>2s)+(|t=1n((a˜jd˜k)vt)2n||a˜jd˜k||2|>2s).

Note that (a˜j+d˜k)vt~subGaussian(0,||a˜j+d˜k||2) and ||a˜j+d˜k||24. By Bernstein inequality, we have for constant C > 0,

(|t=1n(a˜jvtvtd˜ka˜jd˜k)|>s)2exp(Cmin(s2/n,s)).

Choose s=Cnlogpmaxjkajdk in (F.1), we can easily show that the exception probability is small as long as C is large enough. Therefore, noting maxjkajdkCmaxk||λk||,maxj,k|t=1nujtutλknj=1pσjjλj,k|=OP(nlogpmaxk||λk||)=OP(nplogp). Finally ||UUΛ||max=OP(nplogp+n||Σ||1).

(ii) The rates of ||UUΦ(W)||max and |UUΦ(W)B||max can be similarly derived as (i). Denote Φνl=(ϕν(W1l),,ϕν(Wpl)), so

||UUΦ(W)||maxmaxj,ν,l|t=1nujtutΦνlnj=1pσjjϕν(Wjl)|+nmaxj,ν,lj=1p|σjj||ϕν(Wjl)|=OP(nlogpmaxν,lΦνl+nϕmaxΣ1)=OP(ϕmax(nplogp+nΣ1)).

Denote the kth column of Φ(W)B by (ΦB)k, we have

||UUΦ(W)B||maxmaxj,k|t=1nujtut(ΦB)knj=1pσjj(ΦB)jk|+nmaxj,kj=1p|σjj||(ΦB)jk|=OP(nlogpmaxk(ΦB)k+nJϕmaxΣ1)=OP(nplogp+nJϕmaxΣ1),

where we use max maxk(ΦB)kΦBF=OP(p). □

Appendix G: More details on synthetic data analysis

G.1. Model calibration and data generation

We calibrate (estimate) the 264 by 264 covariance matrix Σ^ of ut by our proposed method to the data in the healthy group. Plugging it as input in CLIME solver delivers a sparse precision matrix Ω, which will be taken as truth in the simulation. Note that after regularization in CLIME, Ω−1 is not the same as Σ^, and we set the true covariance Σ = Ω−1. To obtain the covariance matrix used, in setting 1, we also calibrate, using the same method, a sub-model that involves only the first 100 regions. We then copy this 100 × 100 matrix multiple times to form a p × p block diagonal matrix and use it for simulations in setting 1. We describe how we calibrate these ‘true models’ and generate data from the models as follows.

  1. (External covariates) For each jp, generate the external covariate W i.i.d. from the multinomial distribution with P(Wj = s) = ws,s ≤ 10 where {ws}s=110 are calibrated with the hierarchy clustering results of the real data.

  2. (Calibration) For the first 15 healthy subjects, obtain estimators for F, B and Γ by PPCA, resulting in F˜,B˜=n1(Φ(W)Φ(W))1Φ(W)XF˜ and Γ˜=n1(IP)XF˜ according to [18]. Use the rows of the estimated factors to fit a stationary VAR model ft = Aft−1 + t, where t ∼ N(0, Σ), and obtain the estimators A˜ and Σ˜.

  3. (Simulation) For each subject im, pick one of the 15 calibrated models and their associated parameters from above at random and do the following.
    • (a)
      Generate γjki (entries of Γi) i.i.d. from N(0,σ˜γ2) where σ˜γ2 is the sample variance of all entries of Γ˜. For the first three settings, compute the ‘true’ loading matrix Λi=Φ(W)B˜+Γi. For the last setting, set Λi = Γi since G(W) = 0.
    • (b)
      Generate factors fti from the VAR model fti=A˜ft1i+t with t~N(0,Σ˜) where the parameters A˜ and Σ˜ are taken from the fitted values in step 2.
    • (c)
      Finally, generate the observed data Xi=ΛiFi+Ui, where each column of Ui is randomly sampled from N(0,Ω−1), where Ω has been calibrated by the CLIME solver as described at beginning of the section.

G.2. More on pervasiveness

In this subsection, we discuss the pervasive assumption, which requires the spikes to grow with order p, and present numerical performance of ALPHA for different levels of cmin and cmax (defined in Assumption 2.3). The readers will have a rough idea about how the spikiness (or the constant in front of the rate) affects the performance. We particularly consider the cases when cmax is small or cmin is large. As a threshold matter, we verify that the real data is consistent with the pervasive assumption.

Denote the maximum and minimum eigenvalues of the matrix ΛΛ/p by λmax and λmin respectively, and denote the maximum eigenvalue of the matrix UU/p by λmaxu. We first investigate the magnitude of λmin, λmax and λmaxu derived from the real data. Following exactly the same data generation procedure as in the original simulation study, we randomly generate 1,000 subjects. We find that λmax has mean 15.352 and standard deviation 4.918, λmin has mean 10.069 and standard deviation 5.416 and λmaxu has mean 1.317 and standard deviation 0.119. We also investigate the signal-to-noise ratio λminmaxu, which has mean 7.711 and standard deviation 4.230. Therefore, our real data demonstrates a spiked covariance structure while the spikes are not extremely spiky.

Then we manipulate the data generation process correspondent to two different cases. One is to multiply the original loading matrix Λ by 3, called Modified (a), while the other is to divide Λ by 3, called Modified (b). Note that in the case of Modified (b), λmin will be 1/9 of the original λmin and thus smaller than λmaxu, so we do not see a clear eigen-gap in this case. Table 3 compares the performance of recovering the precision matrix Ω under the original and modified setting when ni = 100.

Table 3.

Gaussian Graphical Model Analysis

||Ω^Ω||max ||Ω^Ω||1 ||Ω^Ω||2
Original 0.564 3.445 1.188
Modified (a) 0.524 3.052 1.066
Modified (b) 0.749 4.914 1.719

We can see from the table above that the performance of ALPHA in the case of Modified (a) is slightly better than that in the original case. Note that increasing cmin makes the heterogeneity part more spiky. Larger cmin allows PCA or PPCA to distinguish the spiky heterogeneity term more easily. In contrast, decreasing cmax makes the original spiky heterogeneity term hard to detect. We also tend to miss several heterogeneity factors while extracting them. Therefore, in Modified (b), the estimation error becomes significantly larger compared with the original case.

G.3. Sensitivity analysis on the number of factors

In this section, we study how the estimated number of factors affects the recovery of the Gaussian graphical model through simulations. The specification of the number of factors is critical to the validity of our ALPHA method, which inspires us to assess the performance of K^ and K˜ on our simulated datasets in the first place. Recall that

K^=argmaxkKmaxλk(XX)/λk+1(XX),K˜=argmaxkKmaxλk(XPX)/λk+1(XPX),

where P is the projection operator defined in (3.5) in the main text. The final estimator of the number of factors, denote by Ǩ, comes from the heuristic strategy we developed for choosing between PCA or PPCA. We choose PCA if λK^(XX)/λK^+1(XX)λK˜(XPX)/λK˜+1(XPX) and choose PPCA vice versa. The intuition is that we favor the method that yields larger eigen-ratio between the spiked and non-spiked part of the covariance.

Analogous to the simulation study in our paper, we generate m = 1,000 people’s BOLD data based on calibrated “true” data. We investigate the accuracy of the proposed K^, K˜ and Ǩ for two cases: (i) ni = 20,p = 264 and (ii) ni = 100, p = 264, presented in Table 4. As we can see from the table, when ni is small, K˜ outperforms K^, and when ni is large, K^ is better. Note also that our heuristic estimator Ǩ has great performance in both cases of large and small ni.

Table 4.

Accuracy of K^, K˜ and Ǩ

ni = 20 ni = 100
TotErr OverEst UnderEst TotErr OverEst UnderEst
K^ 38.7% 0% 38.7% 0.7% 0% 0.7%
K˜ 29.7% 6.8% 22.9% 4.7% 2.7% 2.0%
Ǩ 29.7% 6.8% 22.9% 3.5% 2.3% 1.2%

Given the performance of our proposed estimators of the factor number, we now artificially enlarge this estimation error and see how it affects the Gaussian graphical model analysis. Let η be a random perturbation with P(η = 0) = 1/2, P(η = 1) = 1/3 and P(η = 2) = 1/6. Define K+: = K + η and K: = max(Kη,0), where K is the true number of factors. As the notations indicate, K+ overestimates the factor number while K underestimates it. Since P(η ≠ 0) = 1/2, their estimation accuracy is only 50%, worse than that of K^ and K˜ as presented. We use K+ and K as the estimators of the number of factors respectively to recover the precision matrix of U and compare their performance with that of Ǩ. The results are presented in Table 5.

Table 5.

Gaussian Graphical Model Analysis

ni = 20 ni = 100
||Ω^Ω||max ||Ω^Ω||1 ||Ω^Ω||2 ||Ω^Ω||max ||Ω^Ω||1 ||Ω^Ω||2
Oracle 0.687 4.131 1.311 0.335 2.018 0.695
Ko 0.873 2.824 1.351 0.536 2.006 2.017
Ǩ 1.156 8.581 2.950 0.564 3.445 1.188
K + 0.771 3.27 1.49 0.586 2.154 1.074
K 1.618 11.384 4.062 1.84 15.133 4.941

“Oracle” above means that we directly use the generated noise U to calculate its sample covariance and plug it in CLIME to recover the precision matrix. Ko means we know the true number of pervasive factors, and use PCA or Projected-PCA (choosing the method that yields larger eigen-ratio) to adjust factors. As we can see from the table above, K+ is nearly as good as Ko, which means that overestimating the number of factors does not hurt the recovery accuracy. In contrast, underestimating the number factors will seriously increase the estimation error of Ω, as shown by K, because the unadjusted pervasive factors heavily corrupt the covariance of U. Nevertheless, both K+ and K uses partial information of the true number of factors. In comparison, our procedure Ǩ, without any prior knowledge about the number of factors, have a great performance in recovering Ω.

Contributor Information

Jianqing Fan, Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA, jqfan@princeton.edu.

Han Liu, Department of Electrical Engineering and Computer Science, Northwestern University, Evanston, IL 60208, USA, hanliu@northwestern.edu.

Weichen Wang, Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA, nickweichwang@gmail.com.

Ziwei Zhu, Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA, zzw9348ustc@gmail.com.

References

  • [1].Ahn SC and Horenstein AR (2013). Eigenvalue ratio test for the number of factors. Econometrica 81 1203–1227. MR3064065 [Google Scholar]
  • [2].Alter O, Brown PO and Botstein D (2000). Singular value decomposition for genome-wide expression data processing and modeling. Proceedings of the National Academy of Sciences 97 10101–10106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Bai J (2003). Inferential theory for factor models of large dimensions. Econometrica 71 135–171. MR1956857 [Google Scholar]
  • [4].Bai J and Ng S (2002). Determining the number of factors in approximate factor models. Econometrica 70 191–221. MR1926259 [Google Scholar]
  • [5].Bai J and Ng S (2013). Principal components estimation and identification of static factors. Journal of Econometrics 176 18–29. MR3067022 [Google Scholar]
  • [6].Biswal BB, Mennes M, Zuo X-N, Gohel S, Kelly C, Smith SM, Beckmann CF, Adelstein JS, Buckner RL and Colcombe S (2010). Toward discovery science of human brain function. Proceedings of the National Academy of Sciences 107 4734–4739. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [7].Cai TT, Li H, Liu W and Xie J (2012). Covariate-adjusted precision matrix estimation with an application in genetical genomics. Biometrika ass058 MR3034329 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Cai TT, Li H, Liu W and Xie J (2015). Joint estimation of multiple high-dimensional precision matrices. The Annals of Statistics 38 2118–2144. MR3497754 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Cai TT, Liu W and Luo X (2011). A constrained 1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association 106 594–607. MR2847973 [Google Scholar]
  • [10].Cai TT, Ma Z and Wu Y (2013). Sparse PCA: Optimal rates and adaptive estimation. The Annals of Statistics 41 3074–3110. MR3161458 [Google Scholar]
  • [11].Chen C, Grennan K, Badner J, Zhang D, Gershon E, Jin L and Liu C (2011). Removing batch effects in analysis of expression microarray data: an evaluation of six batch adjustment methods. PloS one 6 e17238. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Chen X (2007). Large sample sieve estimation of semi-nonparametric models. Handbook of Econometrics 6 5549–5632. [Google Scholar]
  • [13].Connor G, Hagmann M and Linton O (2012). Efficient semiparametric estimation of the fama–french model and extensions. Econometrica 80 713–754. MR2951947 [Google Scholar]
  • [14].Connor G and Linton O (2007). Semiparametric estimation of a characteristic-based factor model of common stock returns. Journal of Empirical Finance 14 694–717. [Google Scholar]
  • [15].Danaher P, Wang P and Witten DM (2014). The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76 373–397. MR3164871 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [16].Fan J, Ke Y and Wang K (2016). Decorrelation of covariates for high dimensional sparse regression. arXiv preprint arXiv:1612.08490 [Google Scholar]
  • [17].Fan J, Liao Y and Mincheva M (2013). Large covariance estimation by thresholding principal orthogonal complements. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 75 603–680. MR3091653 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].Fan J, Liao Y and Wang W (2016). Projected principal component analysis in factor models. The Annals of Statistics 44 219–254. MR3449767 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [19].Fan J, Rigollet P and Wang W (2015). Estimation of functionals of sparse covariance matrices. Annals of statistics 43 2706 MR3405609 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [20].Friedman J, Hastie T and Tibshirani R (2008). Sparse inverse covariance estimation with the graphical Lasso. Biostatistics 9 432–441. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [21].Guo J, Cheng J, Levina E, Michailidis G and Zhu J (2015). Estimating heterogeneous graphical models for discrete data with an application to roll call voting. The annals of applied statistics 9 821 MR3371337 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [22].Guo J, Levina E, Michailidis G and Zhu J (2011). Joint estimation of multiple graphical models. Biometrika asq060 MR2804206 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].Higgins J, Thompson SG and Spiegelhalter DJ (2009). A reevaluation of random-effects meta-analysis. Journal of the Royal Statistical Society: Series A (Statistics in Society) 172 137–159. MR2655609 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [24].Hsu D, Kakade SM and Zhang T (2012). A tail inequality for quadratic forms of subgaussian random vectors. Electron. Commun. Probab 17 MR2994877 [Google Scholar]
  • [25].Johnson WE, Li C and Rabinovic A (2007). Adjusting batch effects in microarray expression data using empirical bayes methods. Biostatistics 8 118–127. [DOI] [PubMed] [Google Scholar]
  • [26].Johnstone IM and Lu AY (2009). On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association 104 682–693. MR2751448 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [27].Lam C and Fan J (2009). Sparsistency and rates of convergence in large covariance matrix estimation. Annals of Statistics 37 4254 MR2572459 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [28].Lam C and Yao Q (2012). Factor modeling for high-dimensional time series: inference for the number of factors. The Annals of Statistics 40 694–726. MR2933663 [Google Scholar]
  • [29].Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, Geman D, Baggerly K and Irizarry RA (2010). Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics 11 733–739. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [30].Leek JT and Storey JD (2007). Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet 3 1724–1735. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [31].Liu H, Han F and Zhang C. h. (2012). Transelliptical graphical models. In Advances in Neural Information Processing Systems [PMC free article] [PubMed] [Google Scholar]
  • [32].Liu H, Lafferty J and Wasserman L (2009). The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. The Journal of Machine Learning Research 10 2295–2328. MR2563983 [PMC free article] [PubMed] [Google Scholar]
  • [33].Loh P-L and Wainwright MJ (2013). Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses. The Annals of Statistics 41 3022–3049. MR3161456 [Google Scholar]
  • [34].Lorentz GG (2005). Approximation of functions, vol. 322 American Mathematical Soc. MR0213785 [Google Scholar]
  • [35].Meinshausen N and Buhlmann P (2006). High-dimensional graphs and variable selection with the lasso. The Annals of Statistics 1436–1462. MR2278363 [Google Scholar]
  • [36].Negahban S and Wainwright MJ (2011). Estimation of (near) low-rank matrices with noise and high-dimensional scaling. The Annals of Statistics 1069–1097. MR2816348 [Google Scholar]
  • [37].Onatski A (2012). Asymptotics of the principal components estimator of large factor models with weakly influential factors. Journal of Econometrics 168 244–258. MR2923766 [Google Scholar]
  • [38].Paul D (2007). Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statistica Sinica 17 1617 MR2399865 [Google Scholar]
  • [39].Power JD, Cohen AL, Nelson SM, Wig GS, Barnes KA, Church JA, Vogel AC, Laumann TO, Miezin FM and Schlaggar BL (2011). Functional network organization of the human brain. Neuron 72 665–678. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [40].Ravikumar P, Wainwright MJ, Raskutti G and Yu B (2011). High-dimensional covariance estimation by minimizing 1-penalized log-determinant divergence. Electronic Journal of Statistics 5 935–980. MR2836766 [Google Scholar]
  • [41].Rudelson M and Vershynin R (2013). Hanson-wright inequality and sub-gaussian concentration. Electron. Commun. Probab 18 MR3125258 [Google Scholar]
  • [42].Shen X, Pan W and Zhu Y (2012). Likelihood-based selection and sharp parameter estimation. Journal of the American Statistical Association 107 223–232. MR2949354 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [43].Sims AH, Smethurst GJ, Hey Y, Okoniewski MJ, Pepper SD, Howell A, Miller CJ and Clarke RB (2008). The removal of multiplicative, systematic bias allows integration of breast cancer gene expression datasets–improving meta-analysis and prediction of prognosis. BMC medical genomics 1 42. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [44].Stock JH and Watson MW (2002). Forecasting using principal components from a large number of predictors. Journal of the American statistical association 97 1167–1179. MR1951271 [Google Scholar]
  • [45].Verbeke G and Lesaffre E (1996). A linear mixed-effects model with heterogeneity in the random-effects population. Journal of the American Statistical Association 91 217–221. [Google Scholar]
  • [46].Wang W and Fan J (2017). Asymptotics of empirical eigenstructure for high dimensional spiked covariance. Annals of statistics 45 1342–1374. MR3662457 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [47].Yang S, Lu Z, Shen X, Wonka P and Ye J (2015). Fused multiple graphical lasso. SIAM Journal on Optimization 25 916–943. MR3343365 [Google Scholar]
  • [48].Yuan M (2010). High dimensional inverse covariance matrix estimation via linear programming. The Journal of Machine Learning Research 11 2261–2286. MR2719856 [Google Scholar]
  • [49].Yuan M and Lin Y (2007). Model selection and estimation in the gaussian graphical model. Biometrika 94 19–35. MR2367824 [Google Scholar]

RESOURCES