HIGH-DIMENSIONAL FACTOR REGRESSION FOR HETEROGENEOUS SUBPOPULATIONS

Peiyao Wang; Quefeng Li; Dinggang Shen; Yufeng Liu

doi:10.5705/ss.202020.0145

. Author manuscript; available in PMC: 2023 Oct 18.

Published in final edited form as: Stat Sin. 2023 Jan;33(1):27–53. doi: 10.5705/ss.202020.0145

HIGH-DIMENSIONAL FACTOR REGRESSION FOR HETEROGENEOUS SUBPOPULATIONS

Peiyao Wang ¹, Quefeng Li ¹, Dinggang Shen ^2,^3,⁴, Yufeng Liu ¹

PMCID: PMC10583735 NIHMSID: NIHMS1892524 PMID: 37854586

Abstract

In modern scientific research, data heterogeneity is commonly observed owing to the abundance of complex data. We propose a factor regression model for data with heterogeneous subpopulations. The proposed model can be represented as a decomposition of heterogeneous and homogeneous terms. The heterogeneous term is driven by latent factors in different subpopulations. The homogeneous term captures common variation in the covariates and shares common regression coefficients across subpopulations. Our proposed model attains a good balance between a global model and a group-specific model. The global model ignores the data heterogeneity, while the group-specific model fits each subgroup separately. We prove the estimation and prediction consistency for our proposed estimators, and show that it has better convergence rates than those of the group-specific and global models. We show that the extra cost of estimating latent factors is asymptotically negligible and the minimax rate is still attainable. We further demonstrate the robustness of our proposed method by studying its prediction error under a mis-specified group-specific model. Finally, we conduct simulation studies and analyze a data set from the Alzheimer’s Disease Neuroimaging Initiative and an aggregated microarray data set to further demonstrate the competitiveness and interpretability of our proposed factor regression model.

Keywords: Factor models, heterogeneity, penalized regression, prediction

1. Introduction

Data heterogeneity is an important issue in modern complex data analysis. In practice, data heterogeneity may come from variables or samples. More specifically, multi-modality/source data have heterogeneity among the variables, because they may correspond to different types of measurements. For example, in biomedical imaging, people may acquire both MRI and PET images (Zhang et al. (2011)). In genomics studies, measurements are collected from different sources, such as mRNA and miRNA (Muniategui et al. (2013)). In addition to variable heterogeneity, data heterogeneity can also arise from samples. For example, there can be subpopulations, batch and clustering effects, or outliers in the data (Bühlmann (2016)), potentially violating the standard independent and identically distributed (i.i.d.) assumption. Ignoring such heterogeneity can lead to poor estimation and prediction. Hence, it is important to take data heterogeneity into account during the modeling process.

In this study, we are interested in data heterogeneity that comes from subgroup populations. For example, in the Alzheimer’s Disease (AD) study, subjects can have five subtypes: Normal Control (NC), Significant Memory Concern (SMC), Early Mild Cognitive Impairment (eMCI), Late Mild Cognitive Impairment (lMCI), and AD, where these subtypes are ordered by disease severity. Owing to data heterogeneity, it can be difficult to build accurate and interpretable predictive models on such data using traditional statistical techniques. A global model that fits a single regression model to all the data may be restrictive because it ignores the group label information, whereas fitting distinct regression models in each group may not be optimal because this does not capture shared information across groups. Hence, a statistical regression model that can recover interpretable globally shared and group-specific signals in the data is required to handle such heterogeneous data. In the literature, varying coefficient models (Hastie and Tibshirani (1993)) and mixed-effects models (Pinheiro and Bates (2000)) are useful in addressing data heterogeneity. However, these models can be computationally expensive to use in practice, especially when the dimension is too high. More recently, Vicari and Vichi (2013) proposed a general regression model to account for both between-cluster and within-cluster variation. Meinshausen and Bühlmann (2015) proposed a maxmin-effects approach under the mixture model. Zhao, Cheng and Liu (2016) proposed a partially linear regression framework to model massive heterogeneous data. Tang and Song (2016) and Ma and Huang (2017) proposed fused penalties to estimate regression coefficients in order to identify subpopulations. Wang, Liu and Shen (2018) proposed a locally weighted penalized model by incorporating a progression score in the local kernels. However, these models are not designed to characterize the globally shared and group-specific structures. Thus, it is desirable to build a model that can identify such structures, quantify prediction errors, and draw interpretable and generalizable scientific conclusions.

There is a large body of literature on data heterogeneity for unsupervised learning. Principal component analysis (PCA) (Wold, Esbensen and Geladi (1987)) techniques are popular, owing to their computational simplicity and theoretical soundness. The joint and individual variations explained (JIVE) method (Lock et al. (2013)) decomposes joint and individual low-rank signals across multiple sources of data. More recent extensions of JIVE can be found in Feng et al. (2018); Gaynanova and Li (2019); Park and Lock (2020). These methods can be extended easily to decompose data from multiple subgroups. Zhou et al. (2015) proposed a matrix factorization framework for common and individual feature extraction for multi-block data.

Closely related to PCA, another popular technique for handling data heterogeneity is factor models. Factor models are useful unsupervised learning tools that model the dependence between multiple variables. The relationship between PCA and factor models is well studied in the literature (Joliffe and Morgan (1992); Stock and Watson (2002); Bai and Ng (2002)). Factor models assume that the variations among the variables are driven by latent factors residing in a low-dimensional space. More recently, Fan et al. (2018) proposed a factor model framework to model the heterogeneity from different subgroups. They used the factor model in the context of Gaussian graphical models to estimate common and individual graphs from different groups. Their structural assumption on the data matrices can be generalizable to predictive modeling.

Here, we focus on supervised learning, and propose a novel factor regression model for heterogeneous data with jointly shared and group-specific structures. We assume that the leading factors in each group drive the majority of variation, which contributes to the heterogeneity effects. After the majority of the variation has been removed, the residual signals are assumed to be homogeneous across subgroups; that is, they have the same covariance matrix. Under this framework, the predictors in the proposed model can be decomposed into heterogeneous factors and homogeneous signals. Correspondingly, in our proposed model, the regression coefficients associated with the factors are group specific, whereas the regression coefficients associated with the homogeneous signals are shared across groups. We use PCA to estimate the factors and homogeneous signals. Because the estimated factors and homogeneous signals are orthogonal, their coefficients can be estimated separately. The low-dimensional heterogeneous regression coefficients can be estimated directly using the ordinary least squares (OLS) method. After projecting the responses on the estimated factors in each group, their residuals can be aggregated together to perform a global regression. When the dimension is high, the homogeneous signals’ coefficients are difficult to estimate. Following given penalization methods (Hoerl and Kennard (2000); Tibshirani (1996); Zou and Hastie (2005)), we propose a flexible penalized least squares method to solve for the high-dimensional coefficients. In the least squares problem, we use the adaptive thresholding estimator (Cai and Liu (2011)) to estimate the covariance of the homogeneous signals. For prediction, we propose a data-driven trace maximization step to estimate the factors and homogeneous signals in the test set before applying our model for prediction.

We establish the estimation consistency for our proposed estimators using either an $ℓ_{2}$ or $ℓ_{1}$ penalty. In terms of the prediction accuracy, we study the prediction error of our method in both theoretical and simulation studies, and demonstrate that the proposed model attains a good balance between a global model and a group-specific model. Furthermore, we show that our method is robust when the underlying model is group specific, and has comparable prediction performance with respect to the group-specific model. We apply our method to an Alzheimer’s Disease Neuroimaging Initiative (ADNI) data set and an aggregated microarray data set to show the competitiveness of our model in terms of model prediction and interpretability.

The rest of paper is organized as follows. In Section 2, we introduce the factor decomposition of heterogeneous and homogeneous signals and a corresponding regression model. In Section 3, we introduce the model estimation and a data-driven approach to estimate the factors in the testing data for prediction. In Section 4, we study the estimation and prediction consistency of our proposed method, and compare it with those of group-specific and global models under different scenarios. In Section 5, we conduct simulated experiments to evaluate the performance of our model under different settings, and compare them with that of the global and group-specific models. In Section 6, we apply our model to the ADNI data to predict the clinical score. We conclude the paper with a discussion in Section 7.

2. Motivation and Model Framework

Factor models are useful for modeling the dependence between multiple variables, if these variables are driven by some latent factors. For heterogeneous data, the subgroup heterogeneity can be captured by the group-specific latent factors. After removing such latent factors, different subgroups can be viewed as homogeneous samples for a joint analysis. In this section, we first motivate our proposed model by introducing two simple models in Section 2.1. Then, we briefly review the factor decomposition for heterogeneous data and propose our new factor regression model in Section 2.2.

2.1. Motivation

We first introduce some notation. Assume that the data come from $G$ groups. There are $n_{g}$ samples in the gth group, each having the same set of $p$ explanatory variables. Let ${\{X_{g}, Y_{g}\}}_{g = 1}^{G}$ be the observations from $G$ groups, where $X_{g} \in R^{n_{g} \times p}$ is the data matrix and $Y_{g} \in R^{n_{g}}$ is the response vector.

There are two commonly used approaches in the regression setup for heterogeneous subpopulations. On the one hand, ignoring the group information, one can use a global model:

Y = μ^{*} + X β^{*} + ϵ,

(2.1)

where $Y = {(Y_{1}^{'}, \dots, Y_{G}^{'})}^{'}$ and $X = {(X_{1}^{'}, \dots, X_{G}^{'})}^{'}$ . In this model, all the subgroups share the same intercept and regression coefficients. The global model ignores the heterogeneity from subgroups and may be too restrictive. On the other hand, by modeling each group separately, one may consider a group-specific model:

Y_{g} = μ_{g}^{*} + X_{g} β_{g}^{*} + ϵ_{g .}

(2.2)

However, this model may not be efficient because it ignores the shared information across subgroups. These global and group-specific models motivate us to consider a model in between, under which the group-specific heterogeneity and homogeneity across subgroups can both be accounted for. This can be achieved by using a factor model that decomposes covariates into the heterogeneous and homogeneous components.

2.2. Factor model framework

To model the heterogeneous effect introduced by groups, assume that the data matrix $X_{g}$ can be decomposed as

X_{g} = F_{g} Λ_{g} + U_{g},

(2.3)

where $F_{g} \in R^{n_{g} \times K_{g}}$ is the factor matrix, $Λ_{g} \in R^{K_{g} \times p}$ is the loading matrix, and $U_{g} \in R^{n_{g} \times p}$ denotes the homogeneous signals, also known as idiosyncratic errors in the factor model literature (Bai and Ng (2008)). The number of random factors $K_{g}$ can vary among groups.

Denote the ith row of $X_{g}, F_{g}$ , and $U_{g}$ by $x_{g, i}, f_{g, i}$ , and $u_{g, i}$ respectively. By (2.3), we have $x_{g, i} = Λ_{g}^{'} f_{g, i} + u_{g, i}$ . We assume $f_{g, i}$ and $u_{g, i}$ are uncorrelated and satisfy $E (f_{g, i}) = 0, c o v (f_{g, i}) = I_{K_{g} \times K_{g}}, E (u_{g, i}) = 0$ , and $c o v (u_{g, i}) = Σ_{u}$ . Hence, for each sample in group $g$ , we have $c o v (x_{g, i}) = Λ_{g}^{'} Λ_{g} + Σ_{u}$ , which is the sum of the group-specific low-rank matrix $Λ_{g}^{'} Λ_{g}$ capturing the group-specific heterogeneity, and the matrix $Σ_{u}$ that is homogeneous across different groups.

We adopt the approximate factor model (Stock and Watson (2002)) by assuming that $Σ_{u}$ is sparse. Its sparsity can be characterized by $m_{p}$ , defined as

m_{p} = \underset{i \leq p}{m a x} \sum_{j = 1}^{p} I (σ_{u, i j} \neq 0),

which is the maximum number of nonzero entries in the row of $Σ_{u}$ .

Under the decomposition (2.3), we have the following regression model for the gth group:

Y_{g} = μ_{g}^{*} + F_{g} γ_{g}^{*} + U_{g} β^{*} + ϵ_{g} .

(2.4)

Here, $μ_{g}^{*}$ is the true group mean vector, $γ_{g}^{*} \in R^{K_{g}}$ denotes the true group-specific coefficients for $F_{g}, β^{*} \in R^{p}$ denotes the common coefficients shared across $G$ groups for $U_{g}$ , and $ϵ_{g}$ is the noise term and has variance $σ^{2}$ . In (2.4), $γ_{g}^{*}$ vary across $G$ groups, and they characterize the heterogeneity induced by the factors in the regression model. Moreover, the group mean term $μ_{g}^{*}$ contributes to the heterogeneity in the regression model (2.4). When the heterogeneous effect is removed from (2.4), we have the same coefficients $β^{*}$ for $U_{g}$ across $G$ groups.

From (2.4), we can see that the heterogeneity is modeled by $μ_{g}^{*} + F_{g} γ_{g}^{*}$ . After adjusting this heterogeneous term, the remainder term $U_{g} β^{*}$ is homogeneous. Model (2.4) implies that, for the response $y_{g, i}$ of the ith subject in group $g$ , we have $v a r (y_{g, i}) = γ_{g}^{*'} γ_{g}^{*} + β^{*'} Σ_{u} β^{*} + σ^{2}$ . This decomposition shows that the variance can be decomposed as the sum of a group-specific part $γ_{g}^{*'} γ_{g}^{*}$ , a homogeneous part $β^{*'} Σ_{u} β^{*}$ , and the background noise $σ^{2}$ . This decomposition allows us to account for the heterogeneity among subgroups, while also borrowing information across subgroups to model homogeneous effects.

One special case of our proposed model (2.4) is when there is no group-specific factor; that is, $F_{g} = 0$ . Then, it reduces to a mean-specific model:

Y_{g} = μ_{g}^{*} + X_{g} β^{*} + ϵ_{g} .

(2.5)

This model lies between the global model (2.1) and the group-specific model (2.2). It is different from (2.1) because it adjusts the group mean. It is different from (2.2) because different groups share the common regression coefficients. We refer to (2.5) as the “Factor-0” model.

3. Model Estimation and Prediction

In this section, we introduce the model estimation procedure and a data-driven way to estimate the factors in the testing data for prediction. The overall training procedure consists of two steps. First, we estimate the factors and homogeneous signals from the training data. Second, we estimate the regression coefficients using the estimated factors and homogeneous signals. In Section 3.1, we introduce how the factors can be estimated using a PCA. In Section 3.2, we introduce our procedure for estimating the model parameters. After the model is trained, in Section 3.3, we propose a data-driven procedure to estimate the factors in the testing data in order to make predictions.

3.1. Factor model estimation

For group $g$ , the estimation of $F_{g}$ and $Λ_{g}$ can be formulated into the following optimization problem:

\underset{F_{g}, Λ_{g}}{m i n} {∥X_{g} - F_{g} Λ_{g}∥}_{F}, s.t. F_{g}^{'} F_{g} = n_{g} I, Λ_{g} Λ_{g}^{'} is diagonal,

(3.1)

where $∥ \cdot ∥_{F}$ denotes the matrix Frobenius norm. The solution to (3.1) can be obtained by performing the eigendecomposition of the matrix $X_{g} X_{g}^{'}$ . Following the standard PCA procedure, we estimate $F_{g}$ by ${\hat{F}}_{g}$ , where the kth column of ${\hat{F}}_{g}$ is $\sqrt{n_{g}}$ times the eigenvector corresponding to the kth largest eigenvalue of $X_{g} X_{g}^{'}$ . Then, the loading matrix $Λ_{g}$ can be estimated by regressing $X_{g}$ on ${\hat{F}}_{g}$ to obtain ${\hat{Λ}}_{g} = {\hat{F}}_{g}^{T} X_{g} / n_{g}$ . The homogeneous signal matrix $U_{g}$ can hence be estimated by the residual matrix ${\hat{U}}_{g} = X_{g} - {\hat{F}}_{g} {\hat{Λ}}_{g}$ .

We now consider estimating the number of factors $K_{g}$ . In the literature, several estimators have been proposed to solve this problem (Bai and Ng (2002); Lam and Yao (2012); Ahn and Horenstein (2013)). We consider the following estimator:

{\hat{K}}_{g} = \underset{k \leq K_{m a x}}{a r g m a x} \frac{λ_{k} (X_{g} X_{g}^{'})}{λ_{k + 1} (X_{g} X_{g}^{'})},

(3.2)

where $λ_{k} (\cdot)$ denotes the kth largest eigenvalue (Lam and Yao (2012)). Here, $K_{max}$ is a pre-determined upper bound for the number of factors. This estimator has been shown to be a consistent estimator (Ahn and Horenstein (2013)) for the true $K_{g}$ , and is simple to implement in practice.

3.2. Estimation of regression coefficients

Given ${\hat{F}}_{g}$ and ${\hat{U}}_{g}$ , as discussed in Section 3.1, we can estimate the model parameters $μ_{g}^{*}, γ_{g}^{*}$ , and $β^{*}$ . The factor decomposition (2.3) projects the original signals onto the low-dimensional space spanned by $F_{g}$ and the space spanned by $U_{g}$ , which is orthogonal to $F_{g}$ . Owing to the properties of an eigendecomposition, we have ${\hat{F}}_{g}$ and ${\hat{U}}_{g}$ orthogonal to each other. Hence, we can estimate the regression coefficients $γ_{g}^{*}$ and $β^{*}$ in (2.4) separately. Given ${\hat{F}}_{g}, μ_{g}^{*}$ and $γ_{g}^{*}$ can be estimated by the following OLS estimators:

{\hat{μ}}_{g} = {\overline{Y}}_{g}, {\hat{γ}}_{g} = \frac{{\hat{F}}_{g}^{T} Y_{g}}{n_{g}},

(3.3)

where ${\overline{Y}}_{g}$ denotes the sample mean of the response in group $g$ .

Note that the factor matrix $F_{g}$ and the coefficients $γ_{g}^{*}$ are not separately identifiable, because for any orthogonal matrix $H_{g}$ , we have $F_{g} γ_{g}^{*} = F_{g} H_{g}^{'} H_{g} γ_{g}^{*}$ . Hence, $(F_{g}, γ_{g}^{*})$ cannot be identified from $(F_{g} H_{g}^{'}, H_{g} γ_{g}^{*})$ . In practice, it does not matter which one is used, because the linear space spanned by the columns of $F_{g} H_{g}^{'}$ is the same as that spanned by those of $F_{g}$ .

For homogeneous regression coefficients $β^{*}$ , because they are shared across groups, we can aggregate the residuals from the response and the factor projection to perform a global regression to estimate $β^{*}$ . Denote the aggregated residual vectors from the response as $\tilde{Y} = {({\tilde{Y}}_{1}^{'}, \dots, {\tilde{Y}}_{G}^{'})}^{'}$ , where ${\tilde{Y}}_{g} = Y_{g} - {\hat{μ}}_{g} - {\hat{F}}_{g} {\hat{γ}}_{g}$ . Let $U = {(U_{1}^{'}, \dots, U_{G}^{'})}^{'}$ and $\hat{U} = {({\hat{U}}_{1}^{'}, \dots, {\hat{U}}_{G}^{'})}^{'}$ . We solve the following penalized quadratic minimization problem to estimate $β^{*}$ :

\underset{β}{m i n} \frac{1}{2} (β^{'} {\hat{Σ}}_{u} β - \frac{2}{n} {\tilde{Y}}^{'} \hat{U} β) + λ P (β),

(3.4)

where $P (β)$ is a penalty function and $λ$ is a tuning parameter, the optimal value of which is chosen using cross-validation. In particular, we consider an $ℓ_{1}$ penalty that $P (β) = \sum_{j = 1}^{p} |β_{j}|$ and an $ℓ_{2}$ penalty that $P (β) = \sum_{j = 1}^{p} β_{j}^{2}$ , and denote the corresponding solutions of (3.4) as ${\hat{β}}_{λ}^{l a s s o}$ and ${\hat{β}}_{λ}^{r i d g e}$ , respectively. In (3.4), ${\hat{Σ}}_{u}$ is an estimator of $Σ_{u}$ . To obtain such an estimator, we use the adaptive thresholding method (Cai and Liu (2011)). More specifically, let ${\hat{σ}}_{i j} = (1 / n) \sum_{g = 1}^{G} \sum_{t = 1}^{n_{g}} {\hat{u}}_{g, t i} {\hat{u}}_{g, t j}$ and ${\hat{θ}}_{i j} = (1 / n) \sum_{g = 1}^{G} \sum_{t = 1}^{n_{g}} {({\hat{u}}_{g, t i} {\hat{u}}_{g, t j} - {\hat{σ}}_{i j})}^{2}$ , where ${\hat{u}}_{g, t i}$ is the $(t, i)$ th element of ${\hat{U}}_{g}$ . We have

{\hat{Σ}}_{u} = {({\hat{σ}}_{i j}^{𝒯})}_{p \times p}, {\hat{σ}}_{i j}^{𝒯} = \{\begin{array}{l} {\hat{σ}}_{i i}, & i = j, \\ s_{i j} ({\hat{σ}}_{i j}), & i \neq j, \end{array}

(3.5)

where $s_{i j} (\cdot)$ is any thresholding function that satisfies that for all $z \in R$ ,

s_{i j} (z) = 0 when |z| < τ_{i j}, and |s_{i j} (z) - z| \leq τ_{i j} when |z| \geq τ_{i j} .

(3.6)

Here, $τ_{i j} = D ω_{n} \sqrt{{\hat{θ}}_{i j}}$ is an adaptive threshold, where $ω_{n} = 1 / \sqrt{p} + \sqrt{l o g p / n}$ . The purpose of using such a thresholding estimator is to ensure $Σ_{u}$ can be consistently estimated when $p > n$ . In Section S3.1 of the Supplementary Material, we perform a sensitivity study on the choice of $D$ , and find that the prediction performance of our method is not sensitive to $D$ . Thus, we recommend choosing $D$ to be a fixed number, rather than tuning it. When $p < n, Σ_{u}$ does not have to be sparse. In this case, we find it is safe to choose $D = 0$ ; see Section S3.1 of the Supplementary Material.

We summarize the overall training procedure as follows:

For $g = 1, \dots, G$ :
1. Estimate $K_{g}$ from (3.2).
2. Perform a PCA on $X_{g} X_{g}^{'}$ to obtain ${\hat{F}}_{g}$ . Estimate $μ_{g}^{*}$ and $γ_{g}^{*}$ from (3.3).
3. Compute the projection matrix $P_{g} = {\hat{F}}_{g} {\hat{F}}_{g}^{'} / n_{g}$ .
Let $H = d i a g \{I - P_{1}, \dots, I - P_{G}\}$ be the block diagonal matrix. Compute the aggregated signals $\hat{U} = H X, \tilde{Y} = H (Y - \hat{μ})$ , where $\hat{μ} = {({\hat{μ}}_{1}^{'}, \dots, {\hat{μ}}_{G}^{'})}^{'}$ . Estimate ${\hat{Σ}}_{u}$ from $\hat{U}$ using (3.5). Solve the optimization problem (3.4) to estimate $β^{*}$ .

In practice, it can be desirable to have an automatic way to choose between the proposed factor regression model (2.4) and the group-specific model (2.2). We provide an effective rule of thumb in Section S2 in the Supplementary Material.

3.3. Prediction

After training the model, in order to make predictions on the testing data, we need to estimate the factors and homogeneous signals in the testing data. In practice, they are not observable. We provide a data-driven procedure to estimate them based on the estimated loading matrix. Let $X_{g, *} \in R^{n_{g, *} \times p}$ denote the testing data matrix from group $g$ . We aim to estimate the factor matrix $F_{g, *} \in R^{n_{g, *} \times K_{g}}$ and the homogeneous signal matrix $U_{g, *} \in R^{n_{g, *} \times p}$ . Note that the number of columns in $F_{g, *}$ is the same as that in $F_{g}$ .

Motivated by (3.1), we assume that the training and testing data from the same group have the same factor decomposition with the same loading matrix $Λ_{g}$ . Hence, given ${\hat{Λ}}_{g}$ from the training data, we propose estimating $F_{g, *}$ by solving

\underset{F_{g, *}}{m i n} {∥X_{g, *} - F_{g, *} {\hat{Λ}}_{g}∥}_{F}, s.t. F_{g, *}^{'} F_{g, *} = n_{g, *} I .

(3.7)

Note that (3.7) can be formulated as a trace maximization problem, the solution of which is given by ${\hat{F}}_{g, *} = \sqrt{n_{g, *}} {\tilde{V}}_{g} {\tilde{U}}_{g}^{'}$ , where ${\tilde{V}}_{g}$ and ${\tilde{U}}_{g}$ come from a singular value decomposition with ${\hat{Λ}}_{g} X_{g, *}^{'} = {\tilde{U}}_{g} S_{g} {\tilde{V}}_{g}^{'}$ .

4. Theoretical Properties

We study the statistical properties of the proposed estimator. Without loss of generality, we assume that $μ_{g}^{*} = 0$ for any $g \in {1, \dots, G}$ , so that (2.4) reduces to

Y_{g} = F_{g} γ_{g}^{*} + U_{g} β^{*} + ϵ_{g} .

(4.1)

We establish the following theoretical results. First, we prove in Theorem 1 that the proposed estimators are consistent up to a rotation of the true parameters. As a corollary, we give an upper bound of the prediction error for the proposed method. Second, we show in Theorems 2 and 3 that if (4.1) is true, the group-specific model and the global model yield worse predictions than those of our proposed method. On the other hand, we show in Theorem 4 that even if one assumes each group has a distinct model, our method can have the same prediction error as the group-specific model when $p$ is sufficiently large. Thus, our method is robust to model mis-specification.

First, we introduce some notation. For a matrix $A \in R^{p \times p}$ , let $λ_{m i n} (A)$ and $λ_{m a x} (A)$ denote its minimum and maximum eigenvalues, respectively. Let $∥ A ∥_{F} = \sqrt{t r (A^{'} A)}, ∥ A ∥ = λ_{m a x} (A^{'} A), ∥ A ∥_{1} = {m a x}_{j \leq p} \sum_{i = 1}^{p} |a_{i j}|$ , and $∥ A ∥_{m a x} = {m a x}_{i, j \leq p} |a_{i j}|$ denote its Frobenius, $ℓ_{2}, ℓ_{1}$ , and elementwise maximum norms, respectively. For a vector $b \in R^{p}$ , let $∥ b ∥ = \sqrt{\sum_{j = 1}^{p} b_{j}^{2}}, ∥ b ∥_{1} = \sum_{j = 1}^{p} |b_{j}|$ , and $∥ b ∥_{\infty} = {m a x}_{j \leq p} |b_{j}|$ denote its $ℓ_{2}, ℓ_{1}$ , and maximum norms, respectively, and define its support as $\{j : b_{j} \neq 0\}$ . Furthermore, we let $n_{m a x} = {m a x}_{g \leq G} n_{g}, n = \sum_{g = 1}^{G} n_{g}$ , and $[m] = {1, \dots, m}$ for a general positive integer $m$ . In addition, we introduce the following definitions.

Definition 1.

A vector $β \in R^{p}$ is called $s$ -sparse if and only if its support’s cardinality is at most $s$ .

Definition 2 (RE Condition).

A matrix $Σ$ is said to satisfy the restricted eigenvalue (RE) condition if and only if there exists a positive constant $κ$ , such that $β^{'} Σ β \geq κ ∥ β ∥^{2}$ for any $β \in C (S) = \{β \in R^{p} : {∥β_{S^{c}}∥}_{1} \leq 3 {∥β_{S}∥}_{1}\}$ , where $S \subset [p]$ and $S^{c}$ denotes its complement.

4.1. Consistency of the factor regression method

To establish the consistency of our proposed method, we need to impose the following conditions.

Assumption 1 (Pervasiveness).

There exist positive constants $C_{m i n}$ and $C_{m a x} > 0$ such that, for any $g \in [G]$ ,

C_{m i n} < λ_{m i n} (p^{- 1} Λ_{g} Λ_{g}^{'}) < λ_{m a x} (p^{- 1} Λ_{g} Λ_{g}^{'}) < C_{m a x} .

Assumption 2.

For any $g \in [G]$ , assume that ${\{f_{g, i}\}}_{i \leq n_{g}}$ and ${\{u_{g, i}\}}_{i \leq n_{g}}$ are i.i.d. sub-Gaussian random variables with zero means and covariances $I_{K_{g} \times K_{g}}$ and $Σ_{u}$ , respectively. More explicitly, assume for any $α \in R^{K_{g}}$ , $γ \in R^{p}$ , and $s > 0$ , there exists $C > 0$ such that $P (|α^{'} f_{g, i}| > s) \leq e x p (- C s^{2} / ∥ α ∥^{2})$ and $P (|γ^{'} u_{g, i}| > s) \leq e x p (- C s^{2} / ∥ γ ∥^{2})$ . Morever, assume ${\{f_{g, i}\}}_{i \leq n_{g}}$ are uncorrelated with ${\{u_{g, i}\}}_{i \leq n_{g}}$ .

Assumption 3.

There exist positive constants $c_{1}$ and $c_{2}$ such that $λ_{m i n} (Σ_{u}) > c_{1}$ and ${∥Σ_{u}∥}_{1} < c_{2}$ .

Assumption 4.

For any $g \in [G]$ , $j \in [p]$ , and $i_{1}, i_{2}, i \in [n_{g}]$ , there exists a positive constant $M$ such that

${∥λ_{g, j}∥}_{\infty} < M$ , where $λ_{g, j}$ denotes the jth column of $Λ_{g}$
$E {[p^{- 1 / 2} {u_{g, i_{1}}^{'} u_{g, i_{2}} - E (u_{g, i_{1}}^{'} u_{g, i_{2}})}]}^{4} < M$ ;
$E {∥p^{- 1 / 2} \sum_{j = 1}^{p} λ_{g, j} u_{g, i j}∥}^{4} < M$ .

Assumption 1 is a typical pervasiveness assumption to ensure that the latent factors can be well estimated by the PCA method (Bai and Ng (2013); Fan, Liao and Mincheva (2013)). Such an assumption assumes that the latent factors affect a large proportion of variables, and is commonly used in the factor analysis literature. Assumption 2 is a typical sub-Gaussian assumption on the latent factors and the idiosyncratic components. Assumption 3 is a regularity condition on $Σ_{u}$ . Assumption 4 is a collection of technical conditions needed to establish the factor estimation consistency. Such conditions are commonly used in the factor analysis literature (Bai (2003); Bai and Ng (2008); Fan, Liao and Mincheva (2013)). Given these conditions, we show that under model (4.1), the proposed estimators are consistent.

Theorem 1.

Suppose Assumptions 1–3 hold, $l o g p = o (n^{2 / 39}), n = o (p^{2})$ , and $m_{p} ω_{n} = o (1)$ . Then, it follows that

$∥{\hat{γ}}_{g} - H_{g} γ_{g}^{*}∥ = O_{P} (1 / \sqrt{n_{g}} + 1 / \sqrt{p})$ , where ${\hat{γ}}_{g}$ is as defined in (3.3), $H_{g} = {\hat{D}}_{g}^{- 1} {\hat{F}}_{g}^{'} F_{g} Λ_{g} Λ_{g}^{'}$ , and ${\hat{D}}_{g}$ is a ${\hat{K}}_{g} \times {\hat{K}}_{g}$ diagonal matrix consisting of the ${\hat{K}}_{g}$ largest eigenvalues of $X_{g} X_{g}^{'}$ .
In (3.4), if we choose an $ℓ_{2}$ penalty and $λ = C m a x {n_{m a x}^{3 / 4} / n, \sqrt{n_{m a x} p} / n}$ , for some large enough constant $C$ , we have
$∥ {\hat{β}}_{λ}^{r i d g e} - β^{*} ∥ = O_{P} (\frac{n_{m a x}^{3 / 4}}{n} + \frac{\sqrt{n_{m a x} p}}{n} + m_{p} ω_{n} \frac{n_{m a x}}{n}) .$ (4.2)
Assuming that $β^{*}$ is s-sparse, $Σ_{u}$ satisfies the $R E$ condition, and ${s ω}_{n} = o (1)$ , if we choose an $ℓ_{1}$ penalty in (3.4) and $λ = C ω_{n} (m_{p} + \sqrt{n_{m a x} / n})$ , for some large enough constant $C$ , then we have
$∥ {\hat{β}}_{λ}^{l a s s o} - β^{*} ∥ = O_{P} (\sqrt{s} (m_{p} ω_{n} + \sqrt{\frac{n_{m a x}}{n}} ω_{n})) .$ (4.3)

Statement $(a)$ shows that ${\hat{γ}}_{g}$ is consistent to $γ_{g}^{*}$ up to a rotation given by $H_{g}$ . When the latent factors are known, the oracle convergence rate of ${\hat{γ}}_{g}$ is $O_{P} (1 / \sqrt{n_{g}})$ . Compared with this oracle rate, the extra term of $O_{P} (1 / \sqrt{p})$ is essentially due to the estimation error of the latent factors; see Lemma 1(a). When $p ≫ n$ , such a term is ignorable and the oracle rate can be attained. This is because, in that situation, many variables can be used to estimate the latent factors. The error in estimating the latent factors is so small that it does not affect the convergence rate of ${\hat{γ}}_{g}$ . This is essentially due to the blessing of dimensionality property of a factor analysis, which has been studied in Li et al. (2018). Statements (b) and (c) show that the proposed penalized estimator in (3.4) is consistent to $β^{*}$ , regardless of whether an $ℓ_{1}$ or $ℓ_{2}$ penalty is imposed. To simplify the discussion, assume that $n_{1} = \dots = n_{G}, m_{p}$ , and $G$ are bounded. Then, the convergence rates in (4.2) and (4.3) reduce to

∥ {\hat{β}}_{λ}^{r i d g e} - β^{*} ∥ = O_{P} (\frac{1}{n^{1 / 4}} + \sqrt{\frac{p}{n}}), ∥ {\hat{β}}_{λ}^{l a s s o} - β^{*} ∥ = O_{P} (\sqrt{\frac{s}{p}} + \sqrt{\frac{s l o g p}{n}}) .

(4.4)

Hsu, Kakade and Zhang (2014) show that the minimax rate of a Ridge estimator in a linear regression model is $O_{P} (\sqrt{p / n})$ if no sparsity assumption is assumed. Compared with this minimax rate, our method has an extra term of $O_{P} (1 / n^{1 / 4})$ , which is again due to the error when estimating the latent factors; see Lemma 4. However, when $p ≫ n$ , such a term is ignorable and the minimax rate can be obtained. A similar conclusion can be drawn for the Lasso estimator. In (4.4), the term of $O_{P} (\sqrt{s l o g p / n})$ agrees with the minimax rate of the standard Lasso problem (Raskutti, Wainwright and Yu (2011)). The extra term of $O_{P} (\sqrt{s / p})$ comes from the estimation error ${\hat{Σ}}_{u}$ ; see Fan, Liao and Mincheva (2013). This term is ignorable when $p ≫ n$ , in which case the minimax rate is attained.

Let ${\hat{Y}}_{g, λ}^{r i d g e} = {\hat{F}}_{g} {\hat{γ}}_{g} + {\hat{U}}_{g} {\hat{β}}_{λ}^{r i d g e}$ and ${\hat{Y}}_{g, λ}^{l a s s o} = {\hat{F}}_{g} {\hat{γ}}_{g} + {\hat{U}}_{g} {\hat{β}}_{λ}^{l a s s o}$ denote the predicted values of $Y_{g}$ , where ${\hat{γ}}_{g}$ is given in (3.3), ${\hat{β}}_{λ}^{r i d g e}$ and ${\hat{β}}_{λ}^{l a s s o}$ are the Ridge and Lasso estimators, respectively, solved from (3.4), and ${\hat{F}}_{g}$ and ${\hat{U}}_{g}$ are as described in Section 3.1. The following corollary gives the upper bounds of the corresponding in-sample prediction errors.

Corollary 1.

Under the assumptions of Theorem 1, we have

∥\frac{1}{n_{g}} {{\hat{Y}}_{g, λ}^{r i d g e} - E (Y_{g} ∣ F_{g}, U_{g})}∥ = O_{P} (\frac{n_{m a x}^{3 / 4}}{n \sqrt{n_{g}}} + \frac{1}{n} \sqrt{\frac{n_{m a x} p}{n_{g}}} + m_{p} ω_{n} \frac{n_{m a x}}{n \sqrt{n_{g}}}) + O_{P} (\frac{\sqrt{l o g n_{g} l o g p}}{n_{g}} + \frac{1}{n_{g}^{1 / 4} \sqrt{p}}),

(4.5)

∥\frac{1}{n_{g}} {{\hat{Y}}_{g, λ}^{l a s s o} - E (Y_{g} ∣ F_{g}, U_{g})}∥ = O_{P} (\sqrt{\frac{s}{n_{g}}} (m_{p} ω_{n} + \sqrt{\frac{n_{m a x}}{n}} ω_{n})) + O_{P} (\frac{\sqrt{l o g n_{g} l o g p}}{n_{g}} + \frac{1}{n_{g}^{1 / 4} \sqrt{p}}) .

(4.6)

Again, if we assume $n_{1} = \dots = n_{G}, m_{p}$ , and $G$ are bounded, these results reduce to

∥\frac{1}{n_{g}} {{\hat{Y}}_{g, λ}^{r i d g e} - E (Y_{g} ∣ F_{g}, U_{g})}∥ = O_{P} (\frac{1}{n^{1 / 4} \sqrt{p}} + \frac{\sqrt{p}}{n}),

(4.7)

∥\frac{1}{n_{g}} {{\hat{Y}}_{g, λ}^{l a s s o} - E (Y_{g} ∣ F_{g}, U_{g})}∥ = O_{P} (\frac{1}{n^{1 / 4} \sqrt{p}} + \sqrt{\frac{s}{n p}} + \frac{\sqrt{\log n l o g p}}{n} + \frac{\sqrt{s l o g p}}{n}) .

(4.8)

In (4.7), the term of $O_{P} (\sqrt{p} / n)$ agrees with the minimax rate of the prediction error given by the Ridge estimator in a standard linear regression problem (Dicker (2016); Dobriban and Wager (2018)). In (4.8), the term of $O_{P} (\sqrt{s l o g p} / n)$ agrees with the prediction error given by the Lasso estimator in the standard setting (Bickel, Ritov and Tsybakov (2009)). All other terms are ignorable when $p ≫ n$ .

In conclusion, these results show that our proposed estimators can have the same convergence rates as the Ridge and Lasso estimators have under the standard homogeneous linear regression model, which is simpler than the heterogeneous model we have considered.

4.2. Consistency of group-specific and global models

We study the statistical properties of the group-specific and global models when the underlying model follows (4.1). We show that, in this case, our proposed method has an advantage over these two models in terms of the prediction error. We rewrite (4.1) as

Y_{g} = {\tilde{X}}_{g} β^{*} + F_{g} δ_{g} + d_{p} U_{g} β^{*} + ϵ_{g},

(4.9)

where ${\tilde{X}}_{g} = p^{- 1 / 2} X_{g}, δ_{g} = γ_{g}^{*} - p^{- 1 / 2} Λ_{g} β^{*}$ , and $d_{p} = 1 - p^{- 1 / 2}$ . Here, we standardize $X_{g}$ by dividing it by $p^{1 / 2}$ . This is because the pervasiveness assumption means that $∥X_{g}∥$ is unbounded, which is different from the typical linear regression model. Therefore, we rescale it to be ${\tilde{X}}_{g}$ . Then, the group-specific model seeks to solve

{\hat{β}}_{g, λ} = \underset{β}{a r g m i n} \frac{1}{2 n_{g}} {∥Y_{g} - {\tilde{X}}_{g} β∥}^{2} + λ P (β),

(4.10)

whereas the global model seeks to solve

{\hat{β}}_{λ, g l o b a l} = \underset{β}{a r g m i n} \frac{1}{2 n} ∥ Y - \tilde{X} β ∥^{2} + λ P (β),

(4.11)

where $\tilde{X} = {({\tilde{X}}_{1}^{'}, \dots, {\tilde{X}}_{G}^{'})}^{'}, λ$ is a tuning parameter and $P (β)$ is a general penalty function. Similar to (3.4), we choose either an $ℓ_{1}$ or an $ℓ_{2}$ penalty, and denote the corresponding solutions as ${\hat{β}}_{g, λ}^{l a s s o}, {\hat{β}}_{λ, g l o b a l}^{l a s s o}$ and ${\hat{β}}_{g, λ}^{r i d g e}, {\hat{β}}_{λ, g l o b a l}^{r i d g e}$ respectively. Next, we give the convergence rates of the estimators in the group-specific and global models in Theorems 2 and 3, respectively.

Theorem 2.

Suppose Assumptions 1–3 hold and $l o g p = o (n)$ . Then, it follows that

If we use an $ℓ_{2}$ penalty in (4.10) and choose $λ = C / \sqrt{p}$ , for some large enough constant $C$ , we have
$∥ {\hat{β}}_{g, λ}^{r i d g e} - β^{*} ∥ = O_{P} (\sqrt{p} ∥δ_{g}∥ + d_{p} (1 + \sqrt{\frac{p}{n_{g}}}) + \sqrt{\frac{p}{n_{g}}}) .$ (4.12)
Assuming that $β^{*}$ is s-sparse, $Λ_{g}^{'} Λ_{g} / \sqrt{p}$ satisfies the $R E$ condition, and $s \sqrt{l o g p / (n_{g} p)} = o (1)$ , if we use an $ℓ_{1}$ penalty in (4.10) and choose $λ = C \{(1 + \sqrt{l o g p / n_{g}}) (d_{p} + ∥δ_{g}∥) + \sqrt{l o g p / n_{g}}\} / \sqrt{p}$ , for some large enough constant $C$ , we have
$∥ {\hat{β}}_{g, λ}^{l a s s o} - β^{*} ∥ = O_{P} (\sqrt{s} \{(1 + \sqrt{\frac{l o g p}{n_{g}}}) (d_{p} + ∥δ_{g}∥) + \sqrt{\frac{l o g p}{n_{g}}}\}) .$ (4.13)

Let ${\hat{Y}}_{g, λ}^{r i d g e} = {\tilde{X}}_{g} {\hat{β}}_{g, λ}^{r i d g e}$ and ${\hat{Y}}_{g, λ}^{l a s s o} = {\tilde{X}}_{g} {\hat{β}}_{g, λ}^{l a s s o}$ be the predicted values of $Y_{g}$ , where ${\hat{β}}_{g, λ}^{r i d g e}$ and ${\hat{β}}_{g, λ}^{l a s s o}$ are the Ridge and Lasso solutions, respectively, to (4.10). We have the following upper bounds of their in-sample prediction errors.

Corollary 2.

Under the assumptions of Theorem 2, we have

∥ \frac{1}{n_{g}} {{\hat{Y}}_{g, λ}^{r i d g e} - E (Y_{g} ∣ F_{g}, U_{g})} ∥ = O_{P} (\sqrt{\frac{p}{n_{g}}} ∥δ_{g}∥ + d_{p} (\frac{1}{\sqrt{n_{g}}} + \frac{\sqrt{p}}{n_{g}}) + \frac{\sqrt{p}}{n_{g}}),

(4.14)

∥ \frac{1}{n_{g}} {{\hat{Y}}_{g, λ}^{l a s s o} - E (Y_{g} ∣ F_{g}, U_{g})} ∥ = O_{P} (\sqrt{\frac{s}{n_{g}}} \{(1 + \sqrt{\frac{l o g p}{n_{g}}}) (d_{p} + ∥δ_{g}∥) + \sqrt{\frac{l o g p}{n_{g}}}\}) .

(4.15)

Theorem 3.

Suppose Assumptions 1–3 hold and $l o g p = o (n)$ . Then, it follows that

If we use an $ℓ_{2}$ penalty in (4.11) and choose $λ = C / \sqrt{p}$ , for some large enough constant $C$ , we have
$∥ {\hat{β}}_{λ, global}^{r i d g e} - β^{*} ∥ = O_{P} (\frac{n_{m a x} \sqrt{p}}{n} \sum_{g = 1}^{G} ∥δ_{g}∥ + d_{p} (\frac{n_{m a x}}{n} + \frac{\sqrt{n_{m a x} p}}{n}) + \frac{\sqrt{n_{m a x} p}}{n}) .$ (4.16)
Assuming that $β^{*}$ is s-sparse, $Λ_{g}^{'} Λ_{g} / \sqrt{p}$ satisfies the $R E$ condition, and $s \sqrt{l o g p / (n_{g} p)} = o (1)$ for any $g \in [G]$ , if we use an $ℓ_{1}$ penalty in (4.11) and choose $λ = C [{n_{m a x} / (n \sqrt{p}) + (1 / n) \sqrt{n_{m a x} l o g p / p}} (d_{p} + \sum_{g = 1}^{G} ∥ δ_{g} ∥) + (1 / n) \sqrt{n_{m a x} l o g p / p}]$ , for some large enough constant $C$ , we have
$∥ {\hat{β}}_{λ, g l o b a l}^{l a s s o} - β^{*} ∥ = O_{P} (\sqrt{s} \{(\frac{n_{m a x}}{n} + \frac{\sqrt{n_{m a x} l o g p}}{n}) (d_{p} + \sum_{g = 1}^{G} ∥δ_{g}∥) + \frac{\sqrt{n_{m a x} l o g p}}{n}\}) .$ (4.17)

Let ${\hat{Y}}_{g, λ}^{r i d g e} = {\tilde{X}}_{g} {\hat{β}}_{λ, g l o b a l}^{r i d g e}$ and ${\hat{Y}}_{g, λ}^{l a s s o} = {\tilde{X}}_{g} {\hat{β}}_{λ, g l o b a l}^{l a s s o}$ be the predicted values of $Y_{g}$ , where ${\hat{β}}_{λ, g l o b a l}^{r i d g e}$ and ${\hat{β}}_{λ, g l o b a l}^{l a s s o}$ are the Ridge and Lasso solutions, respectively, (4.11). We have the following upper bounds for their in-sample prediction errors.

Corollary 3.

Under the assumptions of Theorem 3, we have

∥\frac{1}{n_{g}} {{\hat{Y}}_{g, λ}^{r i d g e} - E (Y_{g} ∣ F_{g}, U_{g})}∥ = O_{P} (\frac{n_{m a x}}{n} \sqrt{\frac{p}{n_{g}}} \sum_{g^{'} = 1}^{G} ∥δ_{g^{'}}∥ + \frac{1}{n} \sqrt{\frac{n_{m a x} p}{n_{g}}} + d_{p} (\frac{n_{m a x}}{n_{\sqrt{n_{g}}}} + \frac{1}{n} \sqrt{\frac{n_{m a x} p}{n_{g}}})),

(4.18)

∥\frac{1}{n_{g}} \{{\hat{Y}}_{g, λ}^{l a s s o} - E (Y_{g} ∣ F_{g}, U_{g})\}∥ = O_{P} (\sqrt{\frac{s}{n_{g}}} \{\frac{\sqrt{n_{m a x} l o g p}}{n} + (\frac{n_{m a x}}{n} + \frac{\sqrt{n_{m a x} l o g p}}{n}) (d_{p} + \sum_{g^{'} = 1}^{G} ∥δ_{g^{'}}∥)\}) .

(4.19)

Under (4.1), $∥δ_{g}∥ \leq ∥γ_{g}^{*}∥ + p^{- 1 / 2} ∥Λ_{g}∥ ∥ β^{*} ∥ = O (1)$ , for all $g \in [G]$ and $d_{p} = O (1)$ . Thus, if we assume that $n_{1} = \dots = n_{G}$ and $G$ is bounded, then (4.14) and (4.18) further reduce to $∥ (1 / n_{g}) {{\hat{Y}}_{g, λ}^{r i d g e} - E (Y_{g} ∣ F_{g}, U_{g})} ∥ = O_{P} (\sqrt{p / n})$ for the Ridge estimator. Compared with the predictor error of our Ridge estimator, which is $O_{P} (\sqrt{p} / n)$ , these two methods are worse by a factor of $\sqrt{n}$ , owing to the mis-specified model (4.1). Similarly for the Lasso estimator, when $n_{1} = \dots = n_{G}$ and $G$ is bounded, (4.15) and (4.19) reduce to $∥ (1 / n_{g}) {{\hat{Y}}_{g, λ}^{l a s s o} - E (Y_{g} ∣ F_{g}, U_{g})} ∥ = O_{P} (\sqrt{s / n} + \sqrt{s l o g p} / n)$ . Compared with our Lasso estimator, they have an extra term of $\sqrt{s / n}$ , which also comes from the model mis-specification and is nonignorable.

4.3. Robustness

In this section, we assume each group follows a distinct model

Y_{g} = {\tilde{X}}_{g} β_{g}^{*} + ϵ_{g},

(4.20)

and examine how well our method performs under this model assumption. In other words, we study how robust our method is under model mis-specification. Here, we still use the rescaled ${\tilde{X}}_{g}$ as the design matrix. We rewrite (4.20) as $Y_{g} = p^{- 1 / 2} F_{g} Λ_{g} β_{g}^{*} + p^{- 1 / 2} U_{g} β_{g}^{*} + ϵ_{g}$ . Compared with (4.1), we see that $p^{- 1 / 2} Λ_{g} β_{g}^{*}$ and $p^{- 1 / 2} β_{g}^{*}$ can be viewed as $γ_{g}^{*}$ and $β^{*}$ , respectively, in our model. Under the model assumption in (4.20), we have the following results.

Theorem 4.

Suppose Assumptions 1–3 hold, $l o g p = o (n^{2 / 39}), n = o (p^{2})$ , and $m_{p} ω_{n} = o (1)$ . Then, for any $g \in [G]$ , it follows that

$∥{\hat{γ}}_{g} - p^{- 1 / 2} H_{g} Λ_{g} β_{g}^{*}∥ = O_{P} (1 / \sqrt{n_{g}} + 1 / \sqrt{p})$ , where $H_{g}$ is as defined in Theorem 1.
If an $ℓ_{2}$ penalty in (3.4) is used and $λ = O (m a x {n_{m a x}^{3 / 4} \sqrt{p} / n, \sqrt{n_{m a x}} p / n})$ , then
$‖ {\hat{β}}_{λ}^{r i d g e} - \frac{1}{\sqrt{p}} β_{g}^{*} ‖ = O_{P} (\frac{\sqrt{n_{\max} p}}{n} + \frac{n_{m a x}^{3 / 4}}{n}) + \sum_{g^{'} = 1}^{G} O_{P} (\frac{n_{g^{'}}}{n \sqrt{p}} ‖ β_{g^{'}}^{*} - β_{g}^{*} ‖) .$
Assuming that $β_{g}^{*}$ is s-sparse and $Σ_{u}$ satisfies the $R E$ condition, if we use an $ℓ_{1}$ penalty in (3.4) and choose $λ = C {ω_{n} \sqrt{n_{m a x} / n} + n_{m a x} / (n \sqrt{p}) \sum_{g^{'} = 1}^{G} ∥ β_{g}^{*} - β_{g^{'}}^{*} ∥}$ , for some large enough constant $C$ , we have
$∥{\hat{β}}_{λ}^{l a s s o} - \frac{1}{\sqrt{p}} β_{g}^{*}∥ = O_{P} (\sqrt{s} (\sqrt{\frac{n_{m a x}}{n}} ω_{n} + \frac{n_{m a x}}{n \sqrt{p}} \sum_{g^{'} = 1}^{G} ∥β_{g}^{*} - β_{g^{'}}^{*}∥)) .$

Let ${\hat{Y}}_{g, λ}^{r i d g e}$ and ${\hat{Y}}_{g, λ}^{l a s s o}$ be the same as in Corollary 1. Using Theorem 4, we give the upper bounds of the in-sample prediction errors given by our proposed method, when the underlying model follows (4.20).

Corollary 4.

Under the assumptions of Theorem 4, for each $g \in [G]$ , we have

∥\frac{1}{n_{g}} {{\hat{Y}}_{g, λ}^{r i d g e} - E (Y_{g} ∣ {\tilde{X}}_{g})}∥ = O_{P} (\frac{1}{n_{g}}) + O_{P} (\frac{1}{\sqrt{n_{g} p}}) + O_{P} (\frac{1}{\sqrt{n_{g}}} ∥ {\hat{β}}_{λ}^{r i d g e} - \frac{1}{\sqrt{p}} β_{g}^{*} ∥),

(4.21)

∥\frac{1}{n_{g}} {{\hat{Y}}_{g, λ}^{l a s s o} - E (Y_{g} ∣ {\tilde{X}}_{g})}∥ = O_{P} (\frac{1}{n_{g}}) + O_{P} (\frac{1}{\sqrt{n_{g} p}}) + O_{P} (\frac{1}{\sqrt{n_{g}}} ∥ {\hat{β}}_{λ}^{l a s s o} - \frac{1}{\sqrt{p}} β_{g}^{*} ∥) .

(4.22)

When $n_{1} = \dots = n_{G}$ and $G$ is bounded, (4.21) and (4.22) further reduces to

∥\frac{1}{n_{g}} {{\hat{Y}}_{g, λ}^{r i d g e} - E (Y_{g} ∣ {\tilde{X}}_{g})}∥ = O_{P} (\sum_{g^{'} = 1}^{G} \frac{1}{\sqrt{n p}} ∥β_{g}^{*} - β_{g^{'}}^{*}∥) + O_{P} (\frac{\sqrt{p}}{n}) = O_{P} (\frac{\sqrt{p}}{n}),

(4.23)

∥ \frac{1}{n_{g}} {{\hat{Y}}_{g, λ}^{l a s s o} - E (Y_{g} ∣ {\tilde{X}}_{g})} ∥ = O_{P} (\sum_{g^{'} = 1}^{G} \sqrt{\frac{s}{n p}} ∥ β_{g}^{*} - β_{g^{'}}^{*} ∥) + O_{P} (\sqrt{\frac{s}{n p}} + \frac{\sqrt{s \log p}}{n}) .

(4.24)

We compare these convergence rates with those given by the group-specific model. Because the true model (4.20) is a special case of (4.9), by treating $d_{p} = 0$ and $δ_{g} = 0$ , it follows from Theorem 2 that the prediction errors of the group-specific model are $O_{P} (\sqrt{p} / n_{g})$ and $O_{P} (\sqrt{s l o g p} / n_{g})$ , when using a Ridge or a Lasso estimator, respectively. Comparing then with (4.23) and (4.24), we find that the Ridge estimator of our model has the same rate as the group-specific Ridge estimators; see (4.23). For the Lasso estimator, when $p$ is small, our model converges at a rate of $\sqrt{s / (n p)}$ , which is slower than that of the group-specific model by a factor of $\sqrt{n / (p l o g p)}$ . The reason is that our model estimates $G^{- 1} \sum_{g^{'} = 1}^{G} β_{g^{'}}^{*}$ , instead of $β_{g}^{*}$ , and needs to estimate $Σ_{u}$ , which introduces an extra error of $O_{P} (\sqrt{s / (n p)})$ . However, when $p ≫ n$ , all these terms are negligible, and our model has the same convergence as the group-specific model. In conclusion, we have shown that even if the true model is group-specific, our method still provides comparable prediction to that of the group-specific model, especially when the dimension $p$ is high.

5. Simulation Studies

In this section, we perform two simulation studies to compare our proposed model with the global, group-specific, and Factor-0 models. In both studies, we choose $G = 3, p = 200, K_{g} = 3$ , and $n_{g} = 100$ , for any $g \in [G]$ , generate 600 training samples to train all four models, and evaluate their mean squared error (MSE) in an independent test set of 600 samples. Additional simulation studies on other choices of $K_{g}$ can be found in Section S3.4 in the Supplementary Material. We repeat the simulations 50 times. In setting 1, we generate data from our proposed model. In setting 2, we generate different models for different groups.

5.1. Setting 1: under proposed model

We first generate data from the proposed model in (2.4). For any $g \in [G]$ , we generate ${\{f_{g, i}\}}_{i \leq n_{g}}$ as i.i.d. samples from $𝒩 (0, I_{K_{g}} \times K_{g})$ . We set

Λ_{g} = [\begin{matrix} Λ_{g}^{1'} Λ_{g}^{1} Λ_{g}^{1'} Λ_{g}^{2} \\ Λ_{g}^{2'} Λ_{g}^{1} Λ_{g}^{2'} Λ_{g}^{2} \end{matrix}] .

To ensure $Λ_{g}$ satisfies the pervasiveness assumption (Assumption 1), we first choose a positive-definite matrix $R * s_{g} s_{g}^{'}$ , where $R = (r_{i j})$ with $r_{i j} = {0.1}^{| i - j |}, s_{g} = {(\sqrt{λ_{g, 1}}, \dots, \sqrt{λ_{g, K_{g}}})}^{'}, (λ_{1,1}, λ_{1,2}, λ_{1,3}) = (7.0,3.5,1.2)$ , $(λ_{2,1}, λ_{2,2}, λ_{2,3}) = (10,3.9,1.2)$ , $(λ_{3,1}, λ_{3,2}, λ_{3,3}) = (13,3.9,1.1)$ , and $*$ denotes elementwise matrix multiplication. Additional simulation studies on other choices of $λ_{g, 1}, \dots, λ_{g, K_{g}}$ can be found in Section S3.2 in the Supplementary Material. Then, we perform an eigendecomposition on it to obtain $R * s_{g} s_{g}^{'} = V_{g} D_{g} V_{g}^{'}$ , where $D_{g}$ is the diagonal matrix consisting of its eigenvalues. Next, we set $Λ_{g}^{1} = Q_{g} D_{g}^{1 / 2} V_{g}^{'}$ , where $Q_{g}$ is a random orthonormal matrix, and $Λ_{g}^{2} = Q_{g} T_{g}$ , where $T_{g}$ is a $K_{g} \times (p - K_{g})$ matrix with elements randomly generated from $U n i f (- 1 / 20,1 / 20)$ . This construction of $Λ_{g}$ ensures that it has spiked eigenvalues, as required by the pervasiveness assumption, and its rank is $K_{g}$ . We further generate ${\{u_{g, i}\}}_{i \leq n_{g}}$ as i.i.d. samples from $𝒩 (0, Σ_{u})$ , where $Σ_{u}$ is a diagonal matrix with diagonal elements all equal to 0.03. For the coefficients in (2.4), we choose $μ_{g}^{*} = g$ for $g = 1, 2, 3$ . We set $γ_{1}^{*} = (h, h, 2 h)^{'}, γ_{2}^{*} = (h, 2 h, h)^{'}$ , and $γ_{3}^{*} = (2 h, h, h)^{'}$ , where we let $h$ change so that, as it increases, the between-group heterogeneity increases accordingly. We consider two settings of $β^{*}$ . For a sparse $β^{*}$ , we set $β^{*} = {(2_{10}, 0_{90}, - 2_{10}, 0_{90})}^{'}$ , where $m_{L}$ denotes an $L$ -dimensional vector with elements all equal to $m$ ; for a dense $β^{*}$ , we set $β^{*} = {(1_{80}, 0_{20}, - 1_{80}, 0_{20})}^{'}$ . Finally, we generate the error term $ϵ$ as i.i.d samples from $𝒩 (0,4)$ .

Under this model generation scheme, Figure 1 shows how the MSEs of these four methods change as $h$ varies. When $β^{*}$ is sparse, all methods use an $ℓ_{1}$ penalty; when $β^{*}$ is dense, all methods use an $ℓ_{2}$ penalty. The shaded areas represent the standard errors of the MSEs in the 50 simulations. The optimal tuning parameters in these methods are chosen using 10-fold cross-validation. It is clearly seen that for most $h$ , our model performs best. Owing to the model mis-specification, the group-specific model loses some efficiency in estimating the homogeneous part of (4.9) separately, and the global model entirely ignores the heterogeneity. The Factor-0 model adjusts for group means; therefore, it is better than the global model. However, it is still worse than the proposed full model, indicating that some additional heterogeneity has not been fully taken into account in the Factor-0 model. When $h$ increases, the true model (2.4) becomes more group-specific, and less homogeneity can be used to estimate the common $β^{*}$ . In this case, the group-specific model gradually outperforms our method. They both become much better than the global and the Factor-0 models. The estimation errors on $γ_{g}^{*}$ and $β^{*}$ are reported in Tables S2 and S3 in the Supplementary Material.

5.2. Setting 2: under group-specific model

We generate different models for different groups and inspect how robust our model is under such a scenario. We generate $f_{g, i}$ as we did in the first study and $u_{g, i}$ as i.i.d samples from $𝒩 (0, Σ_{u})$ , where $Σ_{u} = (σ_{u, i j})$ , with $σ_{u, i j} = {0.1}^{| i - j |} 0.03$ if $| i - j | \leq 2$ , and $σ_{u, i j} = 0$ otherwise. Additional simulation studies on ${\{f_{g, i}\}}_{i \leq n_{g}}$ and ${\{u_{g, i}\}}_{i \leq n_{g}}$ generated from more general sub-Gaussian distributions for both settings can be found in Section S3.3 in the Supplementary Material. For $Λ_{g}$ , we set $Λ_{g} = {\tilde{Q}}_{g} * s_{g}$ , where $s_{g}$ is as in the first study, and ${\tilde{Q}}_{g}$ is a random $K_{g} \times p$ orthonormal matrix. Then, we use these elements to generate $X_{g}$ according to (2.3) and normalize it to obtain the design matrix ${\tilde{X}}_{g}$ . Given ${\tilde{X}}_{g}$ , for any $g \in [G]$ , we generate $Y_{g}$ from (4.20) by setting $μ_{g} = g$ for $g \in [G]$ , generating $ϵ$ as i.i.d. samples from $N (0,4)$ , and choosing two kinds of $β_{g}^{*}$ . For sparse $β_{g}^{*}$ , we set $β_{1}^{*} = (10 h, 10 h, - 10 h, 10_{5}, 0_{187}, 10_{5}), β_{2}^{*} = (10 h, - 10 h, 10 h, 10_{5}, 0_{187}, 10_{5})$ , and $β_{3}^{*} = (- 10 h, 10 h, 10 h, 10_{5}, 0_{187}, 10_{5})$ . For dense $β_{g}^{*}$ , we set $β_{1}^{*} = (10 h, 10 h, - 10 h, 1_{80}, 0_{37}, 1_{80}), β_{2}^{*} = (10 h, - 10 h, 10 h, 1_{80}, 0_{37}, 1_{80})$ , and $β_{3}^{*} = (- 10 h, 10 h, 10 h, 1_{80}, 0_{37}, 1_{80})$ .

Under this model generation scheme, Figure 2 shows the MSE curves of the four methods, which are computed the same way as in the first study. For sparse $β_{g}^{*}$ , when $h$ is small, the differences between the group-specific, the Factor-0, and our method are marginal, which agrees with what we proved in Corollary 4. When $h$ gets larger, the group difference dominates. In this case, the group-specific model gives the best prediction, although our model is not far off. Compared with these two models, the global and the Factor-0 models are much worse because they fail to recognize the group difference. For a dense $β^{*}$ , when $h$ is small, all other models have similar performance, except for the global model. As $h$ gets larger, our model becomes slightly worse than the group-specific model for the same reason discussed in the sparse case. However, the performance of the Factor-0 model deteriorates much faster. In conclusion, this study shows that our method’s performance is still acceptable, even when the underlying models in the various groups are different. The estimation errors on $β_{g}^{*}$ are reported in Table S4 in the Supplementary Material.

6. Application to ADNI Data Analysis

AD is an irreversible neurodegenerative disease that results in a loss of mental functions caused by a deterioration of the brain. It is the most common cause of dementia among people over the age of 65, affecting an estimated 5.5 million Americans, yet no prevention methods or cures have been discovered. The ADNI was started in 2004 with the goal of tracking the progression of the disease using biomarkers, and using clinical measures to assess the brain’s function over the course of the disease states. In this section, we apply our method to the ADNI data. We are interested in predicting the ADAS-Cog scores using structural magnetic resonance imaging (MRI) scans. All subjects in our analysis are from the ADNI2 phase of the study. In total, there are 697 subjects in our analysis and five groups: NC, SMC, eMCI, lMCI, and AD, ordered by disease severity. The MRI images were preprocessed using anterior commissure-posterior commissure correction, intensity inhomogeneity correction, skull stripping, cerebellum removal based on registration with atlas, spatial segmentation, and registration. After registration, we obtain MRI data with 93 regions of interest (ROIs). For each of the 93 ROIs, we compute the volume of gray matter as a feature. As a result, for each subject, we finally obtain 93 MRI features. Our goal is to predict the ADAS-Cog scores using the 93 MRI features, together with the group information.

We randomly partition the whole data set into two parts: 75% for training the model, and the rest for testing the performance. We repeat the random split 100 times. The testing MSEs and the corresponding standard errors are reported in Table 1 (overall performance) and Table 2 (groupwise performance). We compare four models: the global model (2.1), the group-specific model (2.2), the Factor-0 model (2.5), and our proposed model, as shown in (2.4). For each model, we use three penalty functions, the $ℓ_{2}$ penalty (Ridge), the $ℓ_{1}$ penalty (Lasso), and the Elastic Net (EN) penalty with the bridging parameter 0.5.

Table 1.

Overall MSEs for the four models.

Penalty	Global	Group-specific	Factor-0	Proposed

Ridge	27.52 (0.33)	15.70 (0.19)	15.17 (0.18)	15.04 (0.18)
EN	28.23 (o.33)	16.26 (0.2l)	15.47 (0.18)	15.40 (0.18)
Lasso	28.27 (0.34)	16.39 (0.23)	15.49 (0.19)	15.45 (0.18)

Open in a new tab

Table 2.

Groupwise MSEs for the four models.

Group	Global	Group-specific	Factor-0	Proposed

Penalty = Ridge
NC	16.66 (0.38)	6.24 (0.09)	6.50 (0.10)	6.19 (0.10)
SMC	14.52 (0.31)	6.68 (0.15)	6.43 (0.15)	6.54 (0.15)
eMCI	18.37 (0.41)	10.26 (0.19)	9.84 (0.19)	9.82 (0.19)
lMCI	19.17 (0.38)	16.75 (0.32)	15.61 (0.30)	15.92 (0.32)
AD	73.55 (0.38)	41.25 (0.32)	40.00 (0.30)	39.28 (0.32)

Penalty = Elastic Net
NC	16.79 (0.38)	6.45 (0.09)	6.40 (0.11)	6.37 (0.09)
SMC	15.46 (0.38)	7.12 (0.09)	6.78 (0.11)	6.96 (0.09)
eMCI	18.65 (0.38)	10.59 (0.09)	10.13 (0.11)	10.22 (0.09)
lMCI	20.26 (0.38)	18.32 (0.09)	16.14 (0.11)	16.43 (0.09)
AD	75.00 (0.38)	41.49 (0.09)	40.54 (0.11)	39.64 ( 0.09)

Penalty = Lasso
NC	16.69 (0.38)	6.49 (0.09)	6.41 (0.11)	6.37 (0.09)
SMC	15.57 (0.38)	7.16 (0.09)	6.84 (0.11)	7.05 (0.09)
eMCI	18.44 (0.38)	10.73 (0.09)	10.17 (0.11)	10.26 (0.09)
lMCI	20.36 (0.38)	18.53 (0.09)	16.21 (0.11)	16.50 (0.09)
AD	75.40 (0.38)	41.73 (0.09)	40.47 (0.11)	39.68 (0.09)

Open in a new tab

As shown in Tables 1 and 2, our proposed models achieve promising performance in most cases. The global model performs worst, because it does not use the label information at all. The group-specific model does not perform as well as our proposed models, because it does not borrow information across different groups. Note that the Factor-0 model achieves great improvement over the global model, which demonstrates that the difference on group means is the main source of the heterogeneous effect on the clinical scores across the five groups. It is seen in Table 2 that our model achieves the greatest improvement on the AD patients over the other models, which indicates that the effects of the heterogeneous factors identified in the AD group are much stronger than those in other groups. This appears to be reasonable, because the brain structure of AD patients is significantly more impaired.

Our model has good interpretations. In this real data set, we can interpret variations due to identified factors as disease-specific variations, and the variation due to the homogeneous signals as the disease-shared variation among all groups. Figure 3 gives heatmaps of ${\hat{Σ}}_{x, g} = (1 / n_{g}) X_{g}^{'} X_{g}$ (the top row), where $Σ_{x, g} = c o v (x_{g, i})$ , and ${\hat{Σ}}_{u, g}$ (the bottom row), which is obtained by applying an adaptive soft threshold to ${\hat{Σ}}_{x, g} - {\hat{Λ}}_{g}^{'} {\hat{Λ}}_{g}$ . The left, middle, and right columns of Figure 3 are for the NC, eMCI, and AD groups, respectively. From Figure 3, we can see that the bottom row looks more homogeneous than the top row. We further represent brain connections using precision matrices estimated from Gaussian graphical models (Cai, Liu and Luo (2011)). See Section S4 in the Supplementary Material.

Figure 3. — Heatmaps of ${\hat{Σ}}_{x, g}$ and ${\hat{Σ}}_{u, g}$ in NC, eMCI and AD groups.

7. Conclusion

We have proposed a factor regression model for heterogeneous data with sub-populations. Our proposed model decomposes the predictors into heterogeneous components driven by latent factors and homogeneous components. We assume the group-specific latent factors explain the main heterogeneous variations and, consequently, their associated coefficients can differ by groups. The homogeneous components share the same covariance matrix and, as a result, they share the same regression coefficients. Because the factors are unobserved, we first estimate them using a standard PCA procedure. We use an OLS to directly estimate the group-specific coefficients. For the homogeneous regression coefficients, we propose a flexible penalized least squares solution. For model prediction, we also propose a data-driven procedure to estimate the factors for the testing data. Theoretical studies on the estimation and prediction consistency under $ℓ_{2}$ and $ℓ_{1}$ penalties are established. We show that our proposed model is robust under the group-specific model. Extensive simulation studies further demonstrate the competitive performance of our proposed model over the global model and the group-specific model, and our proposed model achieves a good balance between the two. Finally, we apply the proposed method to an ADNI data set for clinical score prediction, and demonstrate that our model has good prediction power and meaningful interpretation. One interesting future direction is to extend the method to include other outcomes, such as categorical or count data.

Supplementary Material

Supplement

NIHMS1892524-supplement-Supplement.pdf^{(1.2MB, pdf)}

Acknowledgments

The authors would like to thank the editor, associate editor, and reviewers for their helpful comments and suggestions. This research was supported in part by NSF grant DMS-1821231, NIH grants R01GM126550 and R01AG073259.

Footnotes

Supplementary Material

Section S1 gives proofs of Theorems 1–4, Corollaries 1.1–4.1, and the supporting lemmas. Section S2 provides a rule of thumb to choose between our proposed model and the group-specific model in practice. Section S3 presents additional simulation results. Section S4 contains additional results from the ADNI data analysis. Section S5 shows the analysis results when we apply our method to a combined microarray data set.

References

Ahn SC and Horenstein AR (2013). Eigenvalue ratio test for the number of factors. Econometrica 81, 1203–1227. [Google Scholar]
Bai J (2003). Inferential theory for factor models of large dimensions. Econometrica 71, 135–171. [Google Scholar]
Bai J and Ng S (2002). Determining the number of factors in approximate factor models. Econometrica 70, 191–221. [Google Scholar]
Bai J and Ng S (2013). Principal components estimation and identification of static factors. Journal of Econometrics 176, 18–29. [Google Scholar]
Bai J and Ng S (2008). Large dimensional factor analysis. Foundations and Trends^® in Econometrics 3, 89–163. [Google Scholar]
Bickel PJ, Ritov Y and Tsybakov AB (2009). Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics 37, 1705–1732. [Google Scholar]
Bühlmann P (2016). Partial least squares for heterogeneous data. In The Multiple Facets of Partial Least Squares and Related Methods (Edited by DAbdi H, Esposito Vinzi V, Russolillo G, Saporta G and Trinchera L), 3–15. Springer International Publishing, Cham. [Google Scholar]
Cai T and Liu W (2011). Adaptive thresholding for sparse covariance matrix estimation. Journal of the American Statistical Association 106, 672–684. [Google Scholar]
Cai T, Liu W and Luo X (2011). A constrained $ℓ_{1}$ minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association 106, 594–607. [Google Scholar]
Dicker LH (2016). Ridge regression and asymptotic minimax estimation over spheres of growing dimension. Bernoulli 22, 1–37. [Google Scholar]
Dobriban E and Wager S (2018). High-dimensional asymptotics of prediction: Ridge regression and classification. The Annals of Statistics 46, 247–279. [Google Scholar]
Fan J, Liao Y and Mincheva M (2013). Large covariance estimation by thresholding principal orthogonal complements. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 75, 603–680. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Liu H, Wang W and Zhu Z (2018). Heterogeneity adjustment with applications to graphical model inference. Electronic Journal of Statistics 12, 3908–3952. [DOI] [PMC free article] [PubMed] [Google Scholar]
Feng Q, Jiang M, Hannig J and Marron J (2018). Angle-based joint and individual variation explained. Journal of Multivariate Analysis 166, 241–265. [Google Scholar]
Gaynanova I and Li G (2019). Structural learning and integrative decomposition of multi-view data. Biometrics 75, 1121–1132. [DOI] [PubMed] [Google Scholar]
Hastie T and Tibshirani R (1993). Varying-coefficient models. Journal of the Royal Statistical Society: Series B (Methodological) 55, 757–796. [Google Scholar]
Hoerl AE and Kennard RW (2000). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 42, 80–86. [Google Scholar]
Hsu D, Kakade SM and Zhang T (2014). Random design analysis of ridge regression. Foundations of Computational Mathematics 14, 569–600. [Google Scholar]
Joliffe I and Morgan B (1992). Principal component analysis and exploratory factor analysis. Statistical Methods in Medical Research 1, 69–95. [DOI] [PubMed] [Google Scholar]
Lam C and Yao Q (2012). Factor modeling for high-dimensional time series: Inference for the number of factors. The Annals of Statistics 40, 694–726. [Google Scholar]
Li Q, Cheng G, Fan J and Wang Y (2018). Embracing the blessing of dimensionality in factor models. Journal of the American Statistical Association 113, 380–389. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lock EF, Hoadley KA, Marron JS and Nobel AB (2013). Joint and individual variation explained (JIVE) for integrated analysis of multiple data types. The Annals of Applied Statistics 7, 523–542. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ma S and Huang J (2017). A concave pairwise fusion approach to subgroup analysis. Journal of the American Statistical Association 112, 410–423. [Google Scholar]
Meinshausen N and Bühlmann P (2015). Maximin effects in inhomogeneous large-scale data. The Annals of Statistics 43, 1801–1830. [Google Scholar]
Muniategui A, Pey J, Planes FJ and Rubio A (2013). Joint analysis of miRNA and mRNA expression data. Briefings in Bioinformatics 14, 263–278 [DOI] [PubMed] [Google Scholar]
Park JY and Lock EF (2020). Integrative factorization of bidimensionally linked matrices. Biometrics 76, 61–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pinheiro JC and Bates DM (2000). Linear mixed-effects models: Basic concepts and examples. Mixed-Effects Models in S and S-Plus, 3–56. [Google Scholar]
Raskutti G, Wainwright MJ and Yu B (2011). Minimax rates of estimation for high-dimensional linear regression over $ℓ_{q}$ -balls. IEEE Transactions on Information Theory 57, 6976–6994. [Google Scholar]
Stock JH and Watson MW (2002). Forecasting using principal components from a large number of predictors. Journal of the American Statistical Association 97, 1167–1179. [Google Scholar]
Tang L and Song PX (2016). Fused lasso approach in regression coefficients clustering: Learning parameter heterogeneity in data integration. The Journal of Machine Learning Research 17, 3915–3937. [PMC free article] [PubMed] [Google Scholar]
Tibshirani R (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological), 267–288. [Google Scholar]
Vicari D and Vichi M (2013). Multivariate linear regression for heterogeneous data. Journal of Applied Statistics 40, 1209–1230. [Google Scholar]
Wang P, Liu Y and Shen D (2018). Flexible locally weighted penalized regression with applications on prediction of alzheimer’s disease neuroimaging initiative’s clinical scores. IEEE Transactions on Medical Imaging 38, 1398–1408. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wold S, Esbensen K and Geladi P (1987). Principal component analysis. Chemometrics and Intelligent Laboratory Systems 2, 37–52. [Google Scholar]
Zhang D, Wang Y, Zhou L, Yuan H and Shen D (2011). Multimodal classification of Alzheimer’s disease and mild cognitive impairment. Neuroimage 55, 856–867. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhao T, Cheng G and Liu H (2016). A partially linear framework for massive heterogeneous data. The Annals of Statistics 44, 1400–1437. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhou G, Cichocki A, Zhang Y and Mandic DP (2015). Group component analysis for multiblock data: Common and individual feature extraction. IEEE Transactions on Neural Networks and Learning Systems 27, 2426–2439. [DOI] [PubMed] [Google Scholar]
Zou H and Hastie T (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67, 301–320. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement

NIHMS1892524-supplement-Supplement.pdf^{(1.2MB, pdf)}

[R1] Ahn SC and Horenstein AR (2013). Eigenvalue ratio test for the number of factors. Econometrica 81, 1203–1227. [Google Scholar]

[R2] Bai J (2003). Inferential theory for factor models of large dimensions. Econometrica 71, 135–171. [Google Scholar]

[R3] Bai J and Ng S (2002). Determining the number of factors in approximate factor models. Econometrica 70, 191–221. [Google Scholar]

[R4] Bai J and Ng S (2013). Principal components estimation and identification of static factors. Journal of Econometrics 176, 18–29. [Google Scholar]

[R5] Bai J and Ng S (2008). Large dimensional factor analysis. Foundations and Trends^® in Econometrics 3, 89–163. [Google Scholar]

[R6] Bickel PJ, Ritov Y and Tsybakov AB (2009). Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics 37, 1705–1732. [Google Scholar]

[R7] Bühlmann P (2016). Partial least squares for heterogeneous data. In The Multiple Facets of Partial Least Squares and Related Methods (Edited by DAbdi H, Esposito Vinzi V, Russolillo G, Saporta G and Trinchera L), 3–15. Springer International Publishing, Cham. [Google Scholar]

[R8] Cai T and Liu W (2011). Adaptive thresholding for sparse covariance matrix estimation. Journal of the American Statistical Association 106, 672–684. [Google Scholar]

[R9] Cai T, Liu W and Luo X (2011). A constrained $ℓ_{1}$ minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association 106, 594–607. [Google Scholar]

[R10] Dicker LH (2016). Ridge regression and asymptotic minimax estimation over spheres of growing dimension. Bernoulli 22, 1–37. [Google Scholar]

[R11] Dobriban E and Wager S (2018). High-dimensional asymptotics of prediction: Ridge regression and classification. The Annals of Statistics 46, 247–279. [Google Scholar]

[R12] Fan J, Liao Y and Mincheva M (2013). Large covariance estimation by thresholding principal orthogonal complements. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 75, 603–680. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Fan J, Liu H, Wang W and Zhu Z (2018). Heterogeneity adjustment with applications to graphical model inference. Electronic Journal of Statistics 12, 3908–3952. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Feng Q, Jiang M, Hannig J and Marron J (2018). Angle-based joint and individual variation explained. Journal of Multivariate Analysis 166, 241–265. [Google Scholar]

[R15] Gaynanova I and Li G (2019). Structural learning and integrative decomposition of multi-view data. Biometrics 75, 1121–1132. [DOI] [PubMed] [Google Scholar]

[R16] Hastie T and Tibshirani R (1993). Varying-coefficient models. Journal of the Royal Statistical Society: Series B (Methodological) 55, 757–796. [Google Scholar]

[R17] Hoerl AE and Kennard RW (2000). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 42, 80–86. [Google Scholar]

[R18] Hsu D, Kakade SM and Zhang T (2014). Random design analysis of ridge regression. Foundations of Computational Mathematics 14, 569–600. [Google Scholar]

[R19] Joliffe I and Morgan B (1992). Principal component analysis and exploratory factor analysis. Statistical Methods in Medical Research 1, 69–95. [DOI] [PubMed] [Google Scholar]

[R20] Lam C and Yao Q (2012). Factor modeling for high-dimensional time series: Inference for the number of factors. The Annals of Statistics 40, 694–726. [Google Scholar]

[R21] Li Q, Cheng G, Fan J and Wang Y (2018). Embracing the blessing of dimensionality in factor models. Journal of the American Statistical Association 113, 380–389. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Lock EF, Hoadley KA, Marron JS and Nobel AB (2013). Joint and individual variation explained (JIVE) for integrated analysis of multiple data types. The Annals of Applied Statistics 7, 523–542. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Ma S and Huang J (2017). A concave pairwise fusion approach to subgroup analysis. Journal of the American Statistical Association 112, 410–423. [Google Scholar]

[R24] Meinshausen N and Bühlmann P (2015). Maximin effects in inhomogeneous large-scale data. The Annals of Statistics 43, 1801–1830. [Google Scholar]

[R25] Muniategui A, Pey J, Planes FJ and Rubio A (2013). Joint analysis of miRNA and mRNA expression data. Briefings in Bioinformatics 14, 263–278 [DOI] [PubMed] [Google Scholar]

[R26] Park JY and Lock EF (2020). Integrative factorization of bidimensionally linked matrices. Biometrics 76, 61–74. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Pinheiro JC and Bates DM (2000). Linear mixed-effects models: Basic concepts and examples. Mixed-Effects Models in S and S-Plus, 3–56. [Google Scholar]

[R28] Raskutti G, Wainwright MJ and Yu B (2011). Minimax rates of estimation for high-dimensional linear regression over $ℓ_{q}$ -balls. IEEE Transactions on Information Theory 57, 6976–6994. [Google Scholar]

[R29] Stock JH and Watson MW (2002). Forecasting using principal components from a large number of predictors. Journal of the American Statistical Association 97, 1167–1179. [Google Scholar]

[R30] Tang L and Song PX (2016). Fused lasso approach in regression coefficients clustering: Learning parameter heterogeneity in data integration. The Journal of Machine Learning Research 17, 3915–3937. [PMC free article] [PubMed] [Google Scholar]

[R31] Tibshirani R (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological), 267–288. [Google Scholar]

[R32] Vicari D and Vichi M (2013). Multivariate linear regression for heterogeneous data. Journal of Applied Statistics 40, 1209–1230. [Google Scholar]

[R33] Wang P, Liu Y and Shen D (2018). Flexible locally weighted penalized regression with applications on prediction of alzheimer’s disease neuroimaging initiative’s clinical scores. IEEE Transactions on Medical Imaging 38, 1398–1408. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] Wold S, Esbensen K and Geladi P (1987). Principal component analysis. Chemometrics and Intelligent Laboratory Systems 2, 37–52. [Google Scholar]

[R35] Zhang D, Wang Y, Zhou L, Yuan H and Shen D (2011). Multimodal classification of Alzheimer’s disease and mild cognitive impairment. Neuroimage 55, 856–867. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] Zhao T, Cheng G and Liu H (2016). A partially linear framework for massive heterogeneous data. The Annals of Statistics 44, 1400–1437. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] Zhou G, Cichocki A, Zhang Y and Mandic DP (2015). Group component analysis for multiblock data: Common and individual feature extraction. IEEE Transactions on Neural Networks and Learning Systems 27, 2426–2439. [DOI] [PubMed] [Google Scholar]

[R38] Zou H and Hastie T (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67, 301–320. [Google Scholar]

PERMALINK

HIGH-DIMENSIONAL FACTOR REGRESSION FOR HETEROGENEOUS SUBPOPULATIONS

Peiyao Wang

Quefeng Li

Dinggang Shen

Yufeng Liu

Abstract

1. Introduction

2. Motivation and Model Framework

2.1. Motivation

2.2. Factor model framework

3. Model Estimation and Prediction

3.1. Factor model estimation

3.2. Estimation of regression coefficients

3.3. Prediction

4. Theoretical Properties

Definition 1.

Definition 2 (RE Condition).

4.1. Consistency of the factor regression method

Assumption 1 (Pervasiveness).

Assumption 2.

Assumption 3.

Assumption 4.

Theorem 1.

Corollary 1.

4.2. Consistency of group-specific and global models

Theorem 2.

Corollary 2.

Theorem 3.

Corollary 3.

4.3. Robustness

Theorem 4.

Corollary 4.

5. Simulation Studies

5.1. Setting 1: under proposed model

Figure 1.

5.2. Setting 2: under group-specific model

Figure 2.

6. Application to ADNI Data Analysis

Table 1.

Table 2.

Figure 3.

7. Conclusion

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases