Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Oct 18.
Published in final edited form as: Stat Sin. 2023 Jan;33(1):27–53. doi: 10.5705/ss.202020.0145

HIGH-DIMENSIONAL FACTOR REGRESSION FOR HETEROGENEOUS SUBPOPULATIONS

Peiyao Wang 1, Quefeng Li 1, Dinggang Shen 2,3,4, Yufeng Liu 1
PMCID: PMC10583735  NIHMSID: NIHMS1892524  PMID: 37854586

Abstract

In modern scientific research, data heterogeneity is commonly observed owing to the abundance of complex data. We propose a factor regression model for data with heterogeneous subpopulations. The proposed model can be represented as a decomposition of heterogeneous and homogeneous terms. The heterogeneous term is driven by latent factors in different subpopulations. The homogeneous term captures common variation in the covariates and shares common regression coefficients across subpopulations. Our proposed model attains a good balance between a global model and a group-specific model. The global model ignores the data heterogeneity, while the group-specific model fits each subgroup separately. We prove the estimation and prediction consistency for our proposed estimators, and show that it has better convergence rates than those of the group-specific and global models. We show that the extra cost of estimating latent factors is asymptotically negligible and the minimax rate is still attainable. We further demonstrate the robustness of our proposed method by studying its prediction error under a mis-specified group-specific model. Finally, we conduct simulation studies and analyze a data set from the Alzheimer’s Disease Neuroimaging Initiative and an aggregated microarray data set to further demonstrate the competitiveness and interpretability of our proposed factor regression model.

Keywords: Factor models, heterogeneity, penalized regression, prediction

1. Introduction

Data heterogeneity is an important issue in modern complex data analysis. In practice, data heterogeneity may come from variables or samples. More specifically, multi-modality/source data have heterogeneity among the variables, because they may correspond to different types of measurements. For example, in biomedical imaging, people may acquire both MRI and PET images (Zhang et al. (2011)). In genomics studies, measurements are collected from different sources, such as mRNA and miRNA (Muniategui et al. (2013)). In addition to variable heterogeneity, data heterogeneity can also arise from samples. For example, there can be subpopulations, batch and clustering effects, or outliers in the data (Bühlmann (2016)), potentially violating the standard independent and identically distributed (i.i.d.) assumption. Ignoring such heterogeneity can lead to poor estimation and prediction. Hence, it is important to take data heterogeneity into account during the modeling process.

In this study, we are interested in data heterogeneity that comes from subgroup populations. For example, in the Alzheimer’s Disease (AD) study, subjects can have five subtypes: Normal Control (NC), Significant Memory Concern (SMC), Early Mild Cognitive Impairment (eMCI), Late Mild Cognitive Impairment (lMCI), and AD, where these subtypes are ordered by disease severity. Owing to data heterogeneity, it can be difficult to build accurate and interpretable predictive models on such data using traditional statistical techniques. A global model that fits a single regression model to all the data may be restrictive because it ignores the group label information, whereas fitting distinct regression models in each group may not be optimal because this does not capture shared information across groups. Hence, a statistical regression model that can recover interpretable globally shared and group-specific signals in the data is required to handle such heterogeneous data. In the literature, varying coefficient models (Hastie and Tibshirani (1993)) and mixed-effects models (Pinheiro and Bates (2000)) are useful in addressing data heterogeneity. However, these models can be computationally expensive to use in practice, especially when the dimension is too high. More recently, Vicari and Vichi (2013) proposed a general regression model to account for both between-cluster and within-cluster variation. Meinshausen and Bühlmann (2015) proposed a maxmin-effects approach under the mixture model. Zhao, Cheng and Liu (2016) proposed a partially linear regression framework to model massive heterogeneous data. Tang and Song (2016) and Ma and Huang (2017) proposed fused penalties to estimate regression coefficients in order to identify subpopulations. Wang, Liu and Shen (2018) proposed a locally weighted penalized model by incorporating a progression score in the local kernels. However, these models are not designed to characterize the globally shared and group-specific structures. Thus, it is desirable to build a model that can identify such structures, quantify prediction errors, and draw interpretable and generalizable scientific conclusions.

There is a large body of literature on data heterogeneity for unsupervised learning. Principal component analysis (PCA) (Wold, Esbensen and Geladi (1987)) techniques are popular, owing to their computational simplicity and theoretical soundness. The joint and individual variations explained (JIVE) method (Lock et al. (2013)) decomposes joint and individual low-rank signals across multiple sources of data. More recent extensions of JIVE can be found in Feng et al. (2018); Gaynanova and Li (2019); Park and Lock (2020). These methods can be extended easily to decompose data from multiple subgroups. Zhou et al. (2015) proposed a matrix factorization framework for common and individual feature extraction for multi-block data.

Closely related to PCA, another popular technique for handling data heterogeneity is factor models. Factor models are useful unsupervised learning tools that model the dependence between multiple variables. The relationship between PCA and factor models is well studied in the literature (Joliffe and Morgan (1992); Stock and Watson (2002); Bai and Ng (2002)). Factor models assume that the variations among the variables are driven by latent factors residing in a low-dimensional space. More recently, Fan et al. (2018) proposed a factor model framework to model the heterogeneity from different subgroups. They used the factor model in the context of Gaussian graphical models to estimate common and individual graphs from different groups. Their structural assumption on the data matrices can be generalizable to predictive modeling.

Here, we focus on supervised learning, and propose a novel factor regression model for heterogeneous data with jointly shared and group-specific structures. We assume that the leading factors in each group drive the majority of variation, which contributes to the heterogeneity effects. After the majority of the variation has been removed, the residual signals are assumed to be homogeneous across subgroups; that is, they have the same covariance matrix. Under this framework, the predictors in the proposed model can be decomposed into heterogeneous factors and homogeneous signals. Correspondingly, in our proposed model, the regression coefficients associated with the factors are group specific, whereas the regression coefficients associated with the homogeneous signals are shared across groups. We use PCA to estimate the factors and homogeneous signals. Because the estimated factors and homogeneous signals are orthogonal, their coefficients can be estimated separately. The low-dimensional heterogeneous regression coefficients can be estimated directly using the ordinary least squares (OLS) method. After projecting the responses on the estimated factors in each group, their residuals can be aggregated together to perform a global regression. When the dimension is high, the homogeneous signals’ coefficients are difficult to estimate. Following given penalization methods (Hoerl and Kennard (2000); Tibshirani (1996); Zou and Hastie (2005)), we propose a flexible penalized least squares method to solve for the high-dimensional coefficients. In the least squares problem, we use the adaptive thresholding estimator (Cai and Liu (2011)) to estimate the covariance of the homogeneous signals. For prediction, we propose a data-driven trace maximization step to estimate the factors and homogeneous signals in the test set before applying our model for prediction.

We establish the estimation consistency for our proposed estimators using either an 2 or 1 penalty. In terms of the prediction accuracy, we study the prediction error of our method in both theoretical and simulation studies, and demonstrate that the proposed model attains a good balance between a global model and a group-specific model. Furthermore, we show that our method is robust when the underlying model is group specific, and has comparable prediction performance with respect to the group-specific model. We apply our method to an Alzheimer’s Disease Neuroimaging Initiative (ADNI) data set and an aggregated microarray data set to show the competitiveness of our model in terms of model prediction and interpretability.

The rest of paper is organized as follows. In Section 2, we introduce the factor decomposition of heterogeneous and homogeneous signals and a corresponding regression model. In Section 3, we introduce the model estimation and a data-driven approach to estimate the factors in the testing data for prediction. In Section 4, we study the estimation and prediction consistency of our proposed method, and compare it with those of group-specific and global models under different scenarios. In Section 5, we conduct simulated experiments to evaluate the performance of our model under different settings, and compare them with that of the global and group-specific models. In Section 6, we apply our model to the ADNI data to predict the clinical score. We conclude the paper with a discussion in Section 7.

2. Motivation and Model Framework

Factor models are useful for modeling the dependence between multiple variables, if these variables are driven by some latent factors. For heterogeneous data, the subgroup heterogeneity can be captured by the group-specific latent factors. After removing such latent factors, different subgroups can be viewed as homogeneous samples for a joint analysis. In this section, we first motivate our proposed model by introducing two simple models in Section 2.1. Then, we briefly review the factor decomposition for heterogeneous data and propose our new factor regression model in Section 2.2.

2.1. Motivation

We first introduce some notation. Assume that the data come from G groups. There are ng samples in the gth group, each having the same set of p explanatory variables. Let Xg,Ygg=1G be the observations from G groups, where XgRng×p is the data matrix and YgRng is the response vector.

There are two commonly used approaches in the regression setup for heterogeneous subpopulations. On the one hand, ignoring the group information, one can use a global model:

Y=μ*+Xβ*+ϵ, (2.1)

where Y=Y1,,YG and X=X1,,XG. In this model, all the subgroups share the same intercept and regression coefficients. The global model ignores the heterogeneity from subgroups and may be too restrictive. On the other hand, by modeling each group separately, one may consider a group-specific model:

Yg=μg*+Xgβg*+ϵg. (2.2)

However, this model may not be efficient because it ignores the shared information across subgroups. These global and group-specific models motivate us to consider a model in between, under which the group-specific heterogeneity and homogeneity across subgroups can both be accounted for. This can be achieved by using a factor model that decomposes covariates into the heterogeneous and homogeneous components.

2.2. Factor model framework

To model the heterogeneous effect introduced by groups, assume that the data matrix Xg can be decomposed as

Xg=FgΛg+Ug, (2.3)

where FgRng×Kg is the factor matrix, ΛgRKg×p is the loading matrix, and UgRng×p denotes the homogeneous signals, also known as idiosyncratic errors in the factor model literature (Bai and Ng (2008)). The number of random factors Kg can vary among groups.

Denote the ith row of Xg,Fg, and Ug by xg,i,fg,i, and ug,i respectively. By (2.3), we have xg,i=Λgfg,i+ug,i. We assume fg,i and ug,i are uncorrelated and satisfy Efg,i=0,covfg,i=IKg×Kg,Eug,i=0, and covug,i=Σu. Hence, for each sample in group g, we have covxg,i=ΛgΛg+Σu, which is the sum of the group-specific low-rank matrix ΛgΛg capturing the group-specific heterogeneity, and the matrix Σu that is homogeneous across different groups.

We adopt the approximate factor model (Stock and Watson (2002)) by assuming that Σu is sparse. Its sparsity can be characterized by mp, defined as

mp=maxipj=1pIσu,ij0,

which is the maximum number of nonzero entries in the row of Σu.

Under the decomposition (2.3), we have the following regression model for the gth group:

Yg=μg*+Fgγg*+Ugβ*+ϵg. (2.4)

Here, μg* is the true group mean vector, γg*RKg denotes the true group-specific coefficients for Fg,β*Rp denotes the common coefficients shared across G groups for Ug, and ϵg is the noise term and has variance σ2. In (2.4), γg* vary across G groups, and they characterize the heterogeneity induced by the factors in the regression model. Moreover, the group mean term μg* contributes to the heterogeneity in the regression model (2.4). When the heterogeneous effect is removed from (2.4), we have the same coefficients β* for Ug across G groups.

From (2.4), we can see that the heterogeneity is modeled by μg*+Fgγg*. After adjusting this heterogeneous term, the remainder term Ugβ* is homogeneous. Model (2.4) implies that, for the response yg,i of the ith subject in group g, we have varyg,i=γg*γg*+β*Σuβ*+σ2. This decomposition shows that the variance can be decomposed as the sum of a group-specific part γg*γg*, a homogeneous part β*Σuβ*, and the background noise σ2. This decomposition allows us to account for the heterogeneity among subgroups, while also borrowing information across subgroups to model homogeneous effects.

One special case of our proposed model (2.4) is when there is no group-specific factor; that is, Fg=0. Then, it reduces to a mean-specific model:

Yg=μg*+Xgβ*+ϵg. (2.5)

This model lies between the global model (2.1) and the group-specific model (2.2). It is different from (2.1) because it adjusts the group mean. It is different from (2.2) because different groups share the common regression coefficients. We refer to (2.5) as the “Factor-0” model.

3. Model Estimation and Prediction

In this section, we introduce the model estimation procedure and a data-driven way to estimate the factors in the testing data for prediction. The overall training procedure consists of two steps. First, we estimate the factors and homogeneous signals from the training data. Second, we estimate the regression coefficients using the estimated factors and homogeneous signals. In Section 3.1, we introduce how the factors can be estimated using a PCA. In Section 3.2, we introduce our procedure for estimating the model parameters. After the model is trained, in Section 3.3, we propose a data-driven procedure to estimate the factors in the testing data in order to make predictions.

3.1. Factor model estimation

For group g, the estimation of Fg and Λg can be formulated into the following optimization problem:

minFg,ΛgXg-FgΛgF,s.t.FgFg=ngI,ΛgΛgisdiagonal, (3.1)

where F denotes the matrix Frobenius norm. The solution to (3.1) can be obtained by performing the eigendecomposition of the matrix XgXg. Following the standard PCA procedure, we estimate Fg by Fˆg, where the kth column of Fˆg is ng times the eigenvector corresponding to the kth largest eigenvalue of XgXg. Then, the loading matrix Λg can be estimated by regressing Xg on Fˆg to obtain Λˆg=FˆgTXg/ng. The homogeneous signal matrix Ug can hence be estimated by the residual matrix Uˆg=Xg-FˆgΛˆg.

We now consider estimating the number of factors Kg. In the literature, several estimators have been proposed to solve this problem (Bai and Ng (2002); Lam and Yao (2012); Ahn and Horenstein (2013)). We consider the following estimator:

Kˆg=argmaxkKmaxλkXgXgλk+1XgXg, (3.2)

where λk() denotes the kth largest eigenvalue (Lam and Yao (2012)). Here, Kmax is a pre-determined upper bound for the number of factors. This estimator has been shown to be a consistent estimator (Ahn and Horenstein (2013)) for the true Kg, and is simple to implement in practice.

3.2. Estimation of regression coefficients

Given Fˆg and Uˆg, as discussed in Section 3.1, we can estimate the model parameters μg*,γg*, and β*. The factor decomposition (2.3) projects the original signals onto the low-dimensional space spanned by Fg and the space spanned by Ug, which is orthogonal to Fg. Owing to the properties of an eigendecomposition, we have Fˆg and Uˆg orthogonal to each other. Hence, we can estimate the regression coefficients γg* and β* in (2.4) separately. Given Fˆg,μg* and γg* can be estimated by the following OLS estimators:

μˆg=Yg,γˆg=FˆgTYgng, (3.3)

where Yg denotes the sample mean of the response in group g.

Note that the factor matrix Fg and the coefficients γg* are not separately identifiable, because for any orthogonal matrix Hg, we have Fgγg*=FgHgHgγg*. Hence, Fg,γg* cannot be identified from FgHg,Hgγg*. In practice, it does not matter which one is used, because the linear space spanned by the columns of FgHg is the same as that spanned by those of Fg.

For homogeneous regression coefficients β*, because they are shared across groups, we can aggregate the residuals from the response and the factor projection to perform a global regression to estimate β*. Denote the aggregated residual vectors from the response as Y˜=(Y˜1,,Y˜G), where Y˜g=Yg-μˆg-Fˆgγˆg. Let U=U1,,UG and Uˆ=(Uˆ1,,UˆG). We solve the following penalized quadratic minimization problem to estimate β*:

minβ12βΣˆuβ-2nY˜Uˆβ+λP(β), (3.4)

where P(β) is a penalty function and λ is a tuning parameter, the optimal value of which is chosen using cross-validation. In particular, we consider an 1 penalty that P(β)=j=1pβj and an 2 penalty that P(β)=j=1pβj2, and denote the corresponding solutions of (3.4) as βˆλlasso and βˆλridge, respectively. In (3.4), Σˆu is an estimator of Σu. To obtain such an estimator, we use the adaptive thresholding method (Cai and Liu (2011)). More specifically, let σˆij=(1/n)g=1Gt=1nguˆg,tiuˆg,tj and θˆij=(1/n)g=1Gt=1nguˆg,tiuˆg,tj-σˆij2, where uˆg,ti is the (t,i) th element of Uˆg. We have

Σˆu=(σˆij𝒯)p×p,σˆij𝒯=σˆii,i=j,sijσˆij,ij, (3.5)

where sij() is any thresholding function that satisfies that for all zR,

sijz=0whenz<τij,andsijz-zτijwhenzτij. (3.6)

Here, τij=Dωnθˆij is an adaptive threshold, where ωn=1/p+logp/n. The purpose of using such a thresholding estimator is to ensure Σu can be consistently estimated when p>n. In Section S3.1 of the Supplementary Material, we perform a sensitivity study on the choice of D, and find that the prediction performance of our method is not sensitive to D. Thus, we recommend choosing D to be a fixed number, rather than tuning it. When p<n,Σu does not have to be sparse. In this case, we find it is safe to choose D=0; see Section S3.1 of the Supplementary Material.

We summarize the overall training procedure as follows:

  1. For g=1,,G:

    1. Estimate Kg from (3.2).

    2. Perform a PCA on XgXg to obtain Fˆg. Estimate μg* and γg* from (3.3).

    3. Compute the projection matrix Pg=FˆgFˆg/ng.

  2. Let H=diagI-P1,,I-PG be the block diagonal matrix. Compute the aggregated signals Uˆ=HX,Y˜=H(Y-μˆ), where μˆ=μˆ1,,μˆG. Estimate Σˆu from Uˆ using (3.5). Solve the optimization problem (3.4) to estimate β*.

In practice, it can be desirable to have an automatic way to choose between the proposed factor regression model (2.4) and the group-specific model (2.2). We provide an effective rule of thumb in Section S2 in the Supplementary Material.

3.3. Prediction

After training the model, in order to make predictions on the testing data, we need to estimate the factors and homogeneous signals in the testing data. In practice, they are not observable. We provide a data-driven procedure to estimate them based on the estimated loading matrix. Let Xg,*Rng,*×p denote the testing data matrix from group g. We aim to estimate the factor matrix Fg,*Rng,*×Kg and the homogeneous signal matrix Ug,*Rng,*×p. Note that the number of columns in Fg,* is the same as that in Fg.

Motivated by (3.1), we assume that the training and testing data from the same group have the same factor decomposition with the same loading matrix Λg. Hence, given Λˆg from the training data, we propose estimating Fg,* by solving

minFg,*Xg,*-Fg,*ΛˆgF,s.t.Fg,*Fg,*=ng,*I. (3.7)

Note that (3.7) can be formulated as a trace maximization problem, the solution of which is given by Fˆg,*=ng,*V˜gU˜g, where V˜g and U˜g come from a singular value decomposition with ΛˆgXg,*=U˜gSgV˜g.

4. Theoretical Properties

We study the statistical properties of the proposed estimator. Without loss of generality, we assume that μg*=0 for any g{1,,G}, so that (2.4) reduces to

Yg=Fgγg*+Ugβ*+ϵg. (4.1)

We establish the following theoretical results. First, we prove in Theorem 1 that the proposed estimators are consistent up to a rotation of the true parameters. As a corollary, we give an upper bound of the prediction error for the proposed method. Second, we show in Theorems 2 and 3 that if (4.1) is true, the group-specific model and the global model yield worse predictions than those of our proposed method. On the other hand, we show in Theorem 4 that even if one assumes each group has a distinct model, our method can have the same prediction error as the group-specific model when p is sufficiently large. Thus, our method is robust to model mis-specification.

First, we introduce some notation. For a matrix ARp×p, let λmin(A) and λmax(A) denote its minimum and maximum eigenvalues, respectively. Let AF=trAA,A=λmaxAA,A1=maxjpi=1paij, and Amax=maxi,jpaij denote its Frobenius, 2,1, and elementwise maximum norms, respectively. For a vector bRp, let b=j=1pbj2,b1=j=1pbj, and b=maxjpbj denote its 2,1, and maximum norms, respectively, and define its support as j:bj0. Furthermore, we let nmax=maxgGng,n=g=1Gng, and [m]={1,,m} for a general positive integer m. In addition, we introduce the following definitions.

Definition 1.

A vector βRp is called s-sparse if and only if its support’s cardinality is at most s.

Definition 2 (RE Condition).

A matrix Σ is said to satisfy the restricted eigenvalue (RE) condition if and only if there exists a positive constant κ, such that βΣβκβ2 for any βC(S)=βRp:βSc13βS1, where S[p] and Sc denotes its complement.

4.1. Consistency of the factor regression method

To establish the consistency of our proposed method, we need to impose the following conditions.

Assumption 1 (Pervasiveness).

There exist positive constants Cmin and Cmax>0 such that, for any g[G],

Cmin<λminp-1ΛgΛg<λmaxp-1ΛgΛg<Cmax.

Assumption 2.

For any g[G], assume that fg,iing and ug,iing are i.i.d. sub-Gaussian random variables with zero means and covariances IKg×Kg and Σu, respectively. More explicitly, assume for any αRKg, γRp, and s>0, there exists C>0 such that Pαfg,i>sexp(-Cs2/α2) and Pγug,i>sexp(-Cs2/γ2). Morever, assume fg,iing are uncorrelated with ug,iing.

Assumption 3.

There exist positive constants c1 and c2 such that λminΣu>c1 and Σu1<c2.

Assumption 4.

For any g[G], j[p], and i1,i2,ing, there exists a positive constant M such that

  1. λg,j<M, where λg,j denotes the jth column of Λg

  2. E[p-1/2{ug,i1ug,i2-E(ug,i1ug,i2)}]4<M;

  3. Ep-1/2j=1pλg,jug,ij4<M.

Assumption 1 is a typical pervasiveness assumption to ensure that the latent factors can be well estimated by the PCA method (Bai and Ng (2013); Fan, Liao and Mincheva (2013)). Such an assumption assumes that the latent factors affect a large proportion of variables, and is commonly used in the factor analysis literature. Assumption 2 is a typical sub-Gaussian assumption on the latent factors and the idiosyncratic components. Assumption 3 is a regularity condition on Σu. Assumption 4 is a collection of technical conditions needed to establish the factor estimation consistency. Such conditions are commonly used in the factor analysis literature (Bai (2003); Bai and Ng (2008); Fan, Liao and Mincheva (2013)). Given these conditions, we show that under model (4.1), the proposed estimators are consistent.

Theorem 1.

Suppose Assumptions 13 hold, logp=on2/39,n=op2, and mpωn=o(1). Then, it follows that

  1. γˆg-Hgγg*=OP1/ng+1/p, where γˆg is as defined in (3.3), Hg=Dˆg-1FˆgFgΛgΛg, and Dˆg is a Kˆg×Kˆg diagonal matrix consisting of the Kˆg largest eigenvalues of XgXg.

  2. In (3.4), if we choose an 2 penalty and λ=Cmax{nmax3/4/n,nmaxp/n}, for some large enough constant C, we have
    βˆλridge-β*=OPnmax3/4n+nmaxpn+mpωnnmaxn. (4.2)
  3. Assuming that β* is s-sparse, Σu satisfies the RE condition, and sωn=o(1), if we choose an 1 penalty in (3.4) and λ=Cωn(mp+nmax/n), for some large enough constant C, then we have
    βˆλlasso-β*=OPsmpωn+nmaxnωn. (4.3)

Statement (a) shows that γˆg is consistent to γg* up to a rotation given by Hg. When the latent factors are known, the oracle convergence rate of γˆg is OP1/ng. Compared with this oracle rate, the extra term of OP(1/p) is essentially due to the estimation error of the latent factors; see Lemma 1(a). When pn, such a term is ignorable and the oracle rate can be attained. This is because, in that situation, many variables can be used to estimate the latent factors. The error in estimating the latent factors is so small that it does not affect the convergence rate of γˆg. This is essentially due to the blessing of dimensionality property of a factor analysis, which has been studied in Li et al. (2018). Statements (b) and (c) show that the proposed penalized estimator in (3.4) is consistent to β*, regardless of whether an 1 or 2 penalty is imposed. To simplify the discussion, assume that n1==nG,mp, and G are bounded. Then, the convergence rates in (4.2) and (4.3) reduce to

βˆλridge-β*=OP1n1/4+pn,βˆλlasso-β*=OPsp+slogpn. (4.4)

Hsu, Kakade and Zhang (2014) show that the minimax rate of a Ridge estimator in a linear regression model is OP(p/n) if no sparsity assumption is assumed. Compared with this minimax rate, our method has an extra term of OP(1/n1/4), which is again due to the error when estimating the latent factors; see Lemma 4. However, when pn, such a term is ignorable and the minimax rate can be obtained. A similar conclusion can be drawn for the Lasso estimator. In (4.4), the term of OP(slogp/n) agrees with the minimax rate of the standard Lasso problem (Raskutti, Wainwright and Yu (2011)). The extra term of OP(s/p) comes from the estimation error Σˆu; see Fan, Liao and Mincheva (2013). This term is ignorable when pn, in which case the minimax rate is attained.

Let Yˆg,λridge=Fˆgγˆg+Uˆgβˆλridge and Yˆg,λlasso=Fˆgγˆg+Uˆgβˆλlasso denote the predicted values of Yg, where γˆg is given in (3.3), βˆλridge and βˆλlasso are the Ridge and Lasso estimators, respectively, solved from (3.4), and Fˆg and Uˆg are as described in Section 3.1. The following corollary gives the upper bounds of the corresponding in-sample prediction errors.

Corollary 1.

Under the assumptions of Theorem 1, we have

1ng{Yˆg,λridge-EYgFg,Ug}=OPnmax3/4nng+1nnmaxpng+mpωnnmaxnng+OPlognglogpng+1ng1/4p, (4.5)
1ng{Yˆg,λlasso-EYgFg,Ug}=OPsngmpωn+nmaxnωn+OPlognglogpng+1ng1/4p. (4.6)

Again, if we assume n1==nG,mp, and G are bounded, these results reduce to

1ng{Yˆg,λridge-EYgFg,Ug}=OP1n1/4p+pn, (4.7)
1ng{Yˆg,λlasso-EYgFg,Ug}=OP1n1/4p+snp+lognlogpn+slogpn. (4.8)

In (4.7), the term of OP(p/n) agrees with the minimax rate of the prediction error given by the Ridge estimator in a standard linear regression problem (Dicker (2016); Dobriban and Wager (2018)). In (4.8), the term of OP(slogp/n) agrees with the prediction error given by the Lasso estimator in the standard setting (Bickel, Ritov and Tsybakov (2009)). All other terms are ignorable when pn.

In conclusion, these results show that our proposed estimators can have the same convergence rates as the Ridge and Lasso estimators have under the standard homogeneous linear regression model, which is simpler than the heterogeneous model we have considered.

4.2. Consistency of group-specific and global models

We study the statistical properties of the group-specific and global models when the underlying model follows (4.1). We show that, in this case, our proposed method has an advantage over these two models in terms of the prediction error. We rewrite (4.1) as

Yg=X˜gβ*+Fgδg+dpUgβ*+ϵg, (4.9)

where X˜g=p-1/2Xg,δg=γg*-p-1/2Λgβ*, and dp=1-p-1/2. Here, we standardize Xg by dividing it by p1/2. This is because the pervasiveness assumption means that Xg is unbounded, which is different from the typical linear regression model. Therefore, we rescale it to be X˜g. Then, the group-specific model seeks to solve

βˆg,λ=argminβ12ngYg-X˜gβ2+λP(β), (4.10)

whereas the global model seeks to solve

βˆλ,global=argminβ12nY-X˜β2+λPβ, (4.11)

where X˜=(X˜1,,X˜G),λ is a tuning parameter and P(β) is a general penalty function. Similar to (3.4), we choose either an 1 or an 2 penalty, and denote the corresponding solutions as βˆg,λlasso,βˆλ,globallasso and βˆg,λridge,βˆλ,globalridge respectively. Next, we give the convergence rates of the estimators in the group-specific and global models in Theorems 2 and 3, respectively.

Theorem 2.

Suppose Assumptions 13 hold and logp=o(n). Then, it follows that

  1. If we use an 2 penalty in (4.10) and choose λ=C/p, for some large enough constant C, we have
    βˆg,λridge-β*=OPpδg+dp1+png+png. (4.12)
  2. Assuming that β* is s-sparse, ΛgΛg/p satisfies the RE condition, and slogp/ngp=o(1), if we use an 1 penalty in (4.10) and choose λ=C1+logp/ngdp+δg+logp/ng/p, for some large enough constant C, we have
    βˆg,λlasso-β*=OPs1+logpngdp+δg+logpng. (4.13)

Let Yˆg,λridge=X˜gβˆg,λridge and Yˆg,λlasso=X˜gβˆg,λlasso be the predicted values of Yg, where βˆg,λridge and βˆg,λlasso are the Ridge and Lasso solutions, respectively, to (4.10). We have the following upper bounds of their in-sample prediction errors.

Corollary 2.

Under the assumptions of Theorem 2, we have

1ng{Yˆg,λridge-EYgFg,Ug}=OPpngδg+dp1ng+png+png, (4.14)
1ng{Yˆg,λlasso-EYgFg,Ug}=OPsng1+logpngdp+δg+logpng. (4.15)

Theorem 3.

Suppose Assumptions 13 hold and logp=o(n). Then, it follows that

  1. If we use an 2 penalty in (4.11) and choose λ=C/p, for some large enough constant C, we have
    βˆλ,globalridge-β*=OPnmaxpng=1Gδg+dpnmaxn+nmaxpn+nmaxpn. (4.16)
  2. Assuming that β* is s-sparse, ΛgΛg/p satisfies the RE condition, and slogp/ngp=o(1) for any g[G], if we use an 1 penalty in (4.11) and choose λ=C[{nmax/(np)+(1/n)nmaxlogp/p}(dp+g=1Gδg)+(1/n)nmaxlogp/p], for some large enough constant C, we have
    βˆλ,globallasso-β*=OPsnmaxn+nmaxlogpndp+g=1Gδg+nmaxlogpn. (4.17)

Let Yˆg,λridge=X˜gβˆλ,globalridge and Yˆg,λlasso=X˜gβˆλ,globallasso be the predicted values of Yg, where βˆλ,globalridge and βˆλ,globallasso are the Ridge and Lasso solutions, respectively, (4.11). We have the following upper bounds for their in-sample prediction errors.

Corollary 3.

Under the assumptions of Theorem 3, we have

1ng{Yˆg,λridge-EYgFg,Ug}=OPnmaxnpngg=1Gδg+1nnmaxpng+dpnmaxnng+1nnmaxpng, (4.18)
1ngYˆg,λlasso-EYgFg,Ug=OPsngnmaxlogpn+nmaxn+nmaxlogpndp+g=1Gδg. (4.19)

Under (4.1), δgγg*+p-1/2Λgβ*=O(1), for all g[G] and dp=O(1). Thus, if we assume that n1==nG and G is bounded, then (4.14) and (4.18) further reduce to 1/ng{Yˆg,λridge-EYgFg,Ug}=OP(p/n) for the Ridge estimator. Compared with the predictor error of our Ridge estimator, which is OP(p/n), these two methods are worse by a factor of n, owing to the mis-specified model (4.1). Similarly for the Lasso estimator, when n1==nG and G is bounded, (4.15) and (4.19) reduce to 1/ng{Yˆg,λlasso-EYgFg,Ug}=OP(s/n+slogp/n). Compared with our Lasso estimator, they have an extra term of s/n, which also comes from the model mis-specification and is nonignorable.

4.3. Robustness

In this section, we assume each group follows a distinct model

Yg=X˜gβg*+ϵg, (4.20)

and examine how well our method performs under this model assumption. In other words, we study how robust our method is under model mis-specification. Here, we still use the rescaled X˜g as the design matrix. We rewrite (4.20) as Yg=p-1/2FgΛgβg*+p-1/2Ugβg*+ϵg. Compared with (4.1), we see that p-1/2Λgβg* and p-1/2βg* can be viewed as γg* and β*, respectively, in our model. Under the model assumption in (4.20), we have the following results.

Theorem 4.

Suppose Assumptions 13 hold, logp=on2/39,n=op2, and mpωn=o(1). Then, for any g[G], it follows that

  1. γˆg-p-1/2HgΛgβg*=OP1/ng+1/p, where Hg is as defined in Theorem 1.

  2. If an 2 penalty in (3.4) is used and λ=O(max{nmax3/4p/n,nmaxp/n}), then
    β^λridge1pβg*=OP(nmaxpn+nmax3/4n)+g=1GOP(ngnpβg*βg*).
  3. Assuming that βg* is s-sparse and Σu satisfies the RE condition, if we use an 1 penalty in (3.4) and choose λ=C{ωnnmax/n+nmax/(np)g=1Gβg*-βg*}, for some large enough constant C, we have
    βˆλlasso-1pβg*=OPsnmaxnωn+nmaxnpg=1Gβg*-βg*.

Let Yˆg,λridge and Yˆg,λlasso be the same as in Corollary 1. Using Theorem 4, we give the upper bounds of the in-sample prediction errors given by our proposed method, when the underlying model follows (4.20).

Corollary 4.

Under the assumptions of Theorem 4, for each g[G], we have

1ng{Yˆg,λridge-E(YgX˜g)}=OP1ng+OP1ngp+OP1ngβˆλridge-1pβg*, (4.21)
1ng{Yˆg,λlasso-E(YgX˜g)}=OP1ng+OP1ngp+OP1ngβˆλlasso-1pβg*. (4.22)

When n1==nG and G is bounded, (4.21) and (4.22) further reduces to

1ng{Yˆg,λridge-EYgX˜g}=OPg=1G1npβg*-βg*+OPpn=OPpn, (4.23)
1ng{Y^g,λlassoE(YgX˜g)}=OP(g=1Gsnpβgβg)+OP(snp+slogpn). (4.24)

We compare these convergence rates with those given by the group-specific model. Because the true model (4.20) is a special case of (4.9), by treating dp=0 and δg=0, it follows from Theorem 2 that the prediction errors of the group-specific model are OPp/ng and OPslogp/ng, when using a Ridge or a Lasso estimator, respectively. Comparing then with (4.23) and (4.24), we find that the Ridge estimator of our model has the same rate as the group-specific Ridge estimators; see (4.23). For the Lasso estimator, when p is small, our model converges at a rate of s/(np), which is slower than that of the group-specific model by a factor of n/(plogp). The reason is that our model estimates G-1g=1Gβg*, instead of βg*, and needs to estimate Σu, which introduces an extra error of OP(s/(np)). However, when pn, all these terms are negligible, and our model has the same convergence as the group-specific model. In conclusion, we have shown that even if the true model is group-specific, our method still provides comparable prediction to that of the group-specific model, especially when the dimension p is high.

5. Simulation Studies

In this section, we perform two simulation studies to compare our proposed model with the global, group-specific, and Factor-0 models. In both studies, we choose G=3,p=200,Kg=3, and ng=100, for any g[G], generate 600 training samples to train all four models, and evaluate their mean squared error (MSE) in an independent test set of 600 samples. Additional simulation studies on other choices of Kg can be found in Section S3.4 in the Supplementary Material. We repeat the simulations 50 times. In setting 1, we generate data from our proposed model. In setting 2, we generate different models for different groups.

5.1. Setting 1: under proposed model

We first generate data from the proposed model in (2.4). For any g[G], we generate fg,iing as i.i.d. samples from 𝒩0,IKg×Kg. We set

Λg=Λg1Λg1Λg1Λg2Λg2Λg1Λg2Λg2.

To ensure Λg satisfies the pervasiveness assumption (Assumption 1), we first choose a positive-definite matrix R*sgsg, where R=rij with rij=0.1|i-j|,sg=λg,1,,λg,Kg,λ1,1,λ1,2,λ1,3=(7.0,3.5,1.2), λ2,1,λ2,2,λ2,3=(10,3.9,1.2), λ3,1,λ3,2,λ3,3=(13,3.9,1.1), and * denotes elementwise matrix multiplication. Additional simulation studies on other choices of λg,1,,λg,Kg can be found in Section S3.2 in the Supplementary Material. Then, we perform an eigendecomposition on it to obtain R*sgsg=VgDgVg, where Dg is the diagonal matrix consisting of its eigenvalues. Next, we set Λg1=QgDg1/2Vg, where Qg is a random orthonormal matrix, and Λg2=QgTg, where Tg is a Kg×p-Kg matrix with elements randomly generated from Unif(-1/20,1/20). This construction of Λg ensures that it has spiked eigenvalues, as required by the pervasiveness assumption, and its rank is Kg. We further generate ug,iing as i.i.d. samples from 𝒩0,Σu, where Σu is a diagonal matrix with diagonal elements all equal to 0.03. For the coefficients in (2.4), we choose μg*=g for g=1,2,3. We set γ1*=(h,h,2h),γ2*=(h,2h,h), and γ3*=(2h,h,h), where we let h change so that, as it increases, the between-group heterogeneity increases accordingly. We consider two settings of β*. For a sparse β*, we set β*=210,090,-210,090, where mL denotes an L-dimensional vector with elements all equal to m; for a dense β*, we set β*=180,020,-180,020. Finally, we generate the error term ϵ as i.i.d samples from 𝒩(0,4).

Under this model generation scheme, Figure 1 shows how the MSEs of these four methods change as h varies. When β* is sparse, all methods use an 1 penalty; when β* is dense, all methods use an 2 penalty. The shaded areas represent the standard errors of the MSEs in the 50 simulations. The optimal tuning parameters in these methods are chosen using 10-fold cross-validation. It is clearly seen that for most h, our model performs best. Owing to the model mis-specification, the group-specific model loses some efficiency in estimating the homogeneous part of (4.9) separately, and the global model entirely ignores the heterogeneity. The Factor-0 model adjusts for group means; therefore, it is better than the global model. However, it is still worse than the proposed full model, indicating that some additional heterogeneity has not been fully taken into account in the Factor-0 model. When h increases, the true model (2.4) becomes more group-specific, and less homogeneity can be used to estimate the common β*. In this case, the group-specific model gradually outperforms our method. They both become much better than the global and the Factor-0 models. The estimation errors on γg* and β* are reported in Tables S2 and S3 in the Supplementary Material.

Figure 1.

Figure 1.

The MSE curves given by the four models. The left panel represents the results for a sparse β*, and the right panel represents the results for a dense β*.

5.2. Setting 2: under group-specific model

We generate different models for different groups and inspect how robust our model is under such a scenario. We generate fg,i as we did in the first study and ug,i as i.i.d samples from 𝒩0,Σu, where Σu=σu,ij, with σu,ij=0.1|i-j|0.03 if |i-j|2, and σu,ij=0 otherwise. Additional simulation studies on fg,iing and ug,iing generated from more general sub-Gaussian distributions for both settings can be found in Section S3.3 in the Supplementary Material. For Λg, we set Λg=Q˜g*sg, where sg is as in the first study, and Q˜g is a random Kg×p orthonormal matrix. Then, we use these elements to generate Xg according to (2.3) and normalize it to obtain the design matrix X˜g. Given X˜g, for any g[G], we generate Yg from (4.20) by setting μg=g for g[G], generating ϵ as i.i.d. samples from N(0,4), and choosing two kinds of βg*. For sparse βg*, we set β1*=10h,10h,-10h,105,0187,105,β2*=10h,-10h,10h,105,0187,105, and β3*=-10h,10h,10h,105,0187,105. For dense βg*, we set β1*=10h,10h,-10h,180,037,180,β2*=10h,-10h,10h,180,037,180, and β3*=-10h,10h,10h,180,037,180.

Under this model generation scheme, Figure 2 shows the MSE curves of the four methods, which are computed the same way as in the first study. For sparse βg*, when h is small, the differences between the group-specific, the Factor-0, and our method are marginal, which agrees with what we proved in Corollary 4. When h gets larger, the group difference dominates. In this case, the group-specific model gives the best prediction, although our model is not far off. Compared with these two models, the global and the Factor-0 models are much worse because they fail to recognize the group difference. For a dense β*, when h is small, all other models have similar performance, except for the global model. As h gets larger, our model becomes slightly worse than the group-specific model for the same reason discussed in the sparse case. However, the performance of the Factor-0 model deteriorates much faster. In conclusion, this study shows that our method’s performance is still acceptable, even when the underlying models in the various groups are different. The estimation errors on βg* are reported in Table S4 in the Supplementary Material.

Figure 2.

Figure 2.

The MSE curves given by the four models. The left panel represents the results for sparse βg*, and the right panel represents the results for dense βg*.

6. Application to ADNI Data Analysis

AD is an irreversible neurodegenerative disease that results in a loss of mental functions caused by a deterioration of the brain. It is the most common cause of dementia among people over the age of 65, affecting an estimated 5.5 million Americans, yet no prevention methods or cures have been discovered. The ADNI was started in 2004 with the goal of tracking the progression of the disease using biomarkers, and using clinical measures to assess the brain’s function over the course of the disease states. In this section, we apply our method to the ADNI data. We are interested in predicting the ADAS-Cog scores using structural magnetic resonance imaging (MRI) scans. All subjects in our analysis are from the ADNI2 phase of the study. In total, there are 697 subjects in our analysis and five groups: NC, SMC, eMCI, lMCI, and AD, ordered by disease severity. The MRI images were preprocessed using anterior commissure-posterior commissure correction, intensity inhomogeneity correction, skull stripping, cerebellum removal based on registration with atlas, spatial segmentation, and registration. After registration, we obtain MRI data with 93 regions of interest (ROIs). For each of the 93 ROIs, we compute the volume of gray matter as a feature. As a result, for each subject, we finally obtain 93 MRI features. Our goal is to predict the ADAS-Cog scores using the 93 MRI features, together with the group information.

We randomly partition the whole data set into two parts: 75% for training the model, and the rest for testing the performance. We repeat the random split 100 times. The testing MSEs and the corresponding standard errors are reported in Table 1 (overall performance) and Table 2 (groupwise performance). We compare four models: the global model (2.1), the group-specific model (2.2), the Factor-0 model (2.5), and our proposed model, as shown in (2.4). For each model, we use three penalty functions, the 2 penalty (Ridge), the 1 penalty (Lasso), and the Elastic Net (EN) penalty with the bridging parameter 0.5.

Table 1.

Overall MSEs for the four models.

Penalty Global Group-specific Factor-0 Proposed

Ridge 27.52 (0.33) 15.70 (0.19) 15.17 (0.18) 15.04 (0.18)
EN 28.23 (o.33) 16.26 (0.2l) 15.47 (0.18) 15.40 (0.18)
Lasso 28.27 (0.34) 16.39 (0.23) 15.49 (0.19) 15.45 (0.18)

Table 2.

Groupwise MSEs for the four models.

Group Global Group-specific Factor-0 Proposed

Penalty = Ridge
NC 16.66 (0.38) 6.24 (0.09) 6.50 (0.10) 6.19 (0.10)
SMC 14.52 (0.31) 6.68 (0.15) 6.43 (0.15) 6.54 (0.15)
eMCI 18.37 (0.41) 10.26 (0.19) 9.84 (0.19) 9.82 (0.19)
lMCI 19.17 (0.38) 16.75 (0.32) 15.61 (0.30) 15.92 (0.32)
AD 73.55 (0.38) 41.25 (0.32) 40.00 (0.30) 39.28 (0.32)

Penalty = Elastic Net
NC 16.79 (0.38) 6.45 (0.09) 6.40 (0.11) 6.37 (0.09)
SMC 15.46 (0.38) 7.12 (0.09) 6.78 (0.11) 6.96 (0.09)
eMCI 18.65 (0.38) 10.59 (0.09) 10.13 (0.11) 10.22 (0.09)
lMCI 20.26 (0.38) 18.32 (0.09) 16.14 (0.11) 16.43 (0.09)
AD 75.00 (0.38) 41.49 (0.09) 40.54 (0.11) 39.64 ( 0.09)

Penalty = Lasso
NC 16.69 (0.38) 6.49 (0.09) 6.41 (0.11) 6.37 (0.09)
SMC 15.57 (0.38) 7.16 (0.09) 6.84 (0.11) 7.05 (0.09)
eMCI 18.44 (0.38) 10.73 (0.09) 10.17 (0.11) 10.26 (0.09)
lMCI 20.36 (0.38) 18.53 (0.09) 16.21 (0.11) 16.50 (0.09)
AD 75.40 (0.38) 41.73 (0.09) 40.47 (0.11) 39.68 (0.09)

As shown in Tables 1 and 2, our proposed models achieve promising performance in most cases. The global model performs worst, because it does not use the label information at all. The group-specific model does not perform as well as our proposed models, because it does not borrow information across different groups. Note that the Factor-0 model achieves great improvement over the global model, which demonstrates that the difference on group means is the main source of the heterogeneous effect on the clinical scores across the five groups. It is seen in Table 2 that our model achieves the greatest improvement on the AD patients over the other models, which indicates that the effects of the heterogeneous factors identified in the AD group are much stronger than those in other groups. This appears to be reasonable, because the brain structure of AD patients is significantly more impaired.

Our model has good interpretations. In this real data set, we can interpret variations due to identified factors as disease-specific variations, and the variation due to the homogeneous signals as the disease-shared variation among all groups. Figure 3 gives heatmaps of Σˆx,g=1/ngXgXg (the top row), where Σx,g=covxg,i, and Σˆu,g (the bottom row), which is obtained by applying an adaptive soft threshold to Σˆx,g-ΛˆgΛˆg. The left, middle, and right columns of Figure 3 are for the NC, eMCI, and AD groups, respectively. From Figure 3, we can see that the bottom row looks more homogeneous than the top row. We further represent brain connections using precision matrices estimated from Gaussian graphical models (Cai, Liu and Luo (2011)). See Section S4 in the Supplementary Material.

Figure 3.

Figure 3.

Heatmaps of Σˆx,g and Σˆu,g in NC, eMCI and AD groups.

7. Conclusion

We have proposed a factor regression model for heterogeneous data with sub-populations. Our proposed model decomposes the predictors into heterogeneous components driven by latent factors and homogeneous components. We assume the group-specific latent factors explain the main heterogeneous variations and, consequently, their associated coefficients can differ by groups. The homogeneous components share the same covariance matrix and, as a result, they share the same regression coefficients. Because the factors are unobserved, we first estimate them using a standard PCA procedure. We use an OLS to directly estimate the group-specific coefficients. For the homogeneous regression coefficients, we propose a flexible penalized least squares solution. For model prediction, we also propose a data-driven procedure to estimate the factors for the testing data. Theoretical studies on the estimation and prediction consistency under 2 and 1 penalties are established. We show that our proposed model is robust under the group-specific model. Extensive simulation studies further demonstrate the competitive performance of our proposed model over the global model and the group-specific model, and our proposed model achieves a good balance between the two. Finally, we apply the proposed method to an ADNI data set for clinical score prediction, and demonstrate that our model has good prediction power and meaningful interpretation. One interesting future direction is to extend the method to include other outcomes, such as categorical or count data.

Supplementary Material

Supplement

Acknowledgments

The authors would like to thank the editor, associate editor, and reviewers for their helpful comments and suggestions. This research was supported in part by NSF grant DMS-1821231, NIH grants R01GM126550 and R01AG073259.

Footnotes

Supplementary Material

Section S1 gives proofs of Theorems 14, Corollaries 1.1–4.1, and the supporting lemmas. Section S2 provides a rule of thumb to choose between our proposed model and the group-specific model in practice. Section S3 presents additional simulation results. Section S4 contains additional results from the ADNI data analysis. Section S5 shows the analysis results when we apply our method to a combined microarray data set.

References

  1. Ahn SC and Horenstein AR (2013). Eigenvalue ratio test for the number of factors. Econometrica 81, 1203–1227. [Google Scholar]
  2. Bai J (2003). Inferential theory for factor models of large dimensions. Econometrica 71, 135–171. [Google Scholar]
  3. Bai J and Ng S (2002). Determining the number of factors in approximate factor models. Econometrica 70, 191–221. [Google Scholar]
  4. Bai J and Ng S (2013). Principal components estimation and identification of static factors. Journal of Econometrics 176, 18–29. [Google Scholar]
  5. Bai J and Ng S (2008). Large dimensional factor analysis. Foundations and Trends® in Econometrics 3, 89–163. [Google Scholar]
  6. Bickel PJ, Ritov Y and Tsybakov AB (2009). Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics 37, 1705–1732. [Google Scholar]
  7. Bühlmann P (2016). Partial least squares for heterogeneous data. In The Multiple Facets of Partial Least Squares and Related Methods (Edited by DAbdi H, Esposito Vinzi V, Russolillo G, Saporta G and Trinchera L), 3–15. Springer International Publishing, Cham. [Google Scholar]
  8. Cai T and Liu W (2011). Adaptive thresholding for sparse covariance matrix estimation. Journal of the American Statistical Association 106, 672–684. [Google Scholar]
  9. Cai T, Liu W and Luo X (2011). A constrained 1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association 106, 594–607. [Google Scholar]
  10. Dicker LH (2016). Ridge regression and asymptotic minimax estimation over spheres of growing dimension. Bernoulli 22, 1–37. [Google Scholar]
  11. Dobriban E and Wager S (2018). High-dimensional asymptotics of prediction: Ridge regression and classification. The Annals of Statistics 46, 247–279. [Google Scholar]
  12. Fan J, Liao Y and Mincheva M (2013). Large covariance estimation by thresholding principal orthogonal complements. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 75, 603–680. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Fan J, Liu H, Wang W and Zhu Z (2018). Heterogeneity adjustment with applications to graphical model inference. Electronic Journal of Statistics 12, 3908–3952. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Feng Q, Jiang M, Hannig J and Marron J (2018). Angle-based joint and individual variation explained. Journal of Multivariate Analysis 166, 241–265. [Google Scholar]
  15. Gaynanova I and Li G (2019). Structural learning and integrative decomposition of multi-view data. Biometrics 75, 1121–1132. [DOI] [PubMed] [Google Scholar]
  16. Hastie T and Tibshirani R (1993). Varying-coefficient models. Journal of the Royal Statistical Society: Series B (Methodological) 55, 757–796. [Google Scholar]
  17. Hoerl AE and Kennard RW (2000). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 42, 80–86. [Google Scholar]
  18. Hsu D, Kakade SM and Zhang T (2014). Random design analysis of ridge regression. Foundations of Computational Mathematics 14, 569–600. [Google Scholar]
  19. Joliffe I and Morgan B (1992). Principal component analysis and exploratory factor analysis. Statistical Methods in Medical Research 1, 69–95. [DOI] [PubMed] [Google Scholar]
  20. Lam C and Yao Q (2012). Factor modeling for high-dimensional time series: Inference for the number of factors. The Annals of Statistics 40, 694–726. [Google Scholar]
  21. Li Q, Cheng G, Fan J and Wang Y (2018). Embracing the blessing of dimensionality in factor models. Journal of the American Statistical Association 113, 380–389. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Lock EF, Hoadley KA, Marron JS and Nobel AB (2013). Joint and individual variation explained (JIVE) for integrated analysis of multiple data types. The Annals of Applied Statistics 7, 523–542. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Ma S and Huang J (2017). A concave pairwise fusion approach to subgroup analysis. Journal of the American Statistical Association 112, 410–423. [Google Scholar]
  24. Meinshausen N and Bühlmann P (2015). Maximin effects in inhomogeneous large-scale data. The Annals of Statistics 43, 1801–1830. [Google Scholar]
  25. Muniategui A, Pey J, Planes FJ and Rubio A (2013). Joint analysis of miRNA and mRNA expression data. Briefings in Bioinformatics 14, 263–278 [DOI] [PubMed] [Google Scholar]
  26. Park JY and Lock EF (2020). Integrative factorization of bidimensionally linked matrices. Biometrics 76, 61–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Pinheiro JC and Bates DM (2000). Linear mixed-effects models: Basic concepts and examples. Mixed-Effects Models in S and S-Plus, 3–56. [Google Scholar]
  28. Raskutti G, Wainwright MJ and Yu B (2011). Minimax rates of estimation for high-dimensional linear regression over q-balls. IEEE Transactions on Information Theory 57, 6976–6994. [Google Scholar]
  29. Stock JH and Watson MW (2002). Forecasting using principal components from a large number of predictors. Journal of the American Statistical Association 97, 1167–1179. [Google Scholar]
  30. Tang L and Song PX (2016). Fused lasso approach in regression coefficients clustering: Learning parameter heterogeneity in data integration. The Journal of Machine Learning Research 17, 3915–3937. [PMC free article] [PubMed] [Google Scholar]
  31. Tibshirani R (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological), 267–288. [Google Scholar]
  32. Vicari D and Vichi M (2013). Multivariate linear regression for heterogeneous data. Journal of Applied Statistics 40, 1209–1230. [Google Scholar]
  33. Wang P, Liu Y and Shen D (2018). Flexible locally weighted penalized regression with applications on prediction of alzheimer’s disease neuroimaging initiative’s clinical scores. IEEE Transactions on Medical Imaging 38, 1398–1408. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Wold S, Esbensen K and Geladi P (1987). Principal component analysis. Chemometrics and Intelligent Laboratory Systems 2, 37–52. [Google Scholar]
  35. Zhang D, Wang Y, Zhou L, Yuan H and Shen D (2011). Multimodal classification of Alzheimer’s disease and mild cognitive impairment. Neuroimage 55, 856–867. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Zhao T, Cheng G and Liu H (2016). A partially linear framework for massive heterogeneous data. The Annals of Statistics 44, 1400–1437. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Zhou G, Cichocki A, Zhang Y and Mandic DP (2015). Group component analysis for multiblock data: Common and individual feature extraction. IEEE Transactions on Neural Networks and Learning Systems 27, 2426–2439. [DOI] [PubMed] [Google Scholar]
  38. Zou H and Hastie T (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67, 301–320. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement

RESOURCES