Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Nov 15.
Published in final edited form as: J Am Stat Assoc. 2021 May 19;117(540):2105–2119. doi: 10.1080/01621459.2021.1904958

Individual Data Protected Integrative Regression Analysis of High-Dimensional Heterogeneous Data

Tianxi Cai a, Molei Liu a, Yin Xia b
PMCID: PMC10653033  NIHMSID: NIHMS1912488  PMID: 37975021

Abstract

Evidence-based decision making often relies on meta-analyzing multiple studies, which enables more precise estimation and investigation of generalizability. Integrative analysis of multiple heterogeneous studies is, however, highly challenging in the ultra high-dimensional setting. The challenge is even more pronounced when the individual-level data cannot be shared across studies, known as DataSHIELD contraint. Under sparse regression models that are assumed to be similar yet not identical across studies, we propose in this paper a novel integrative estimation procedure for data-Shielding High-dimensional Integrative Regression (SHIR). SHIR protects individual data through summary-statistics-based integrating procedure, accommodates between-study heterogeneity in both the covariate distribution and model parameters, and attains consistent variable selection. Theoretically, SHIR is statistically more efficient than the existing distributed approaches that integrate debiased LASSO estimators from the local sites. Furthermore, the estimation error incurred by aggregating derived data is negligible compared to the statistical minimax rate and SHIR is shown to be asymptotically equivalent in estimation to the ideal estimator obtained by sharing all data. The finite-sample performance of our method is studied and compared with existing approaches via extensive simulation settings. We further illustrate the utility of SHIR to derive phenotyping algorithms for coronary artery disease using electronic health records data from multiple chronic disease cohorts.

Keywords: DataSHIELD, Distributed learning, High dimensionality, Model heterogeneity, Rate optimality, Sparsistency

1. Introduction

1.1. Background

Synthesizing information from multiple studies is crucial for evidence-based medicine and policy decision making. Meta-analyzing multiple studies allows for more precise estimates and enables investigation of generalizability. In the presence of heterogeneity across studies and high-dimensional predictors, such integrative analysis however is highly challenging. An example of such integrative analysis is to develop generalizable predictive models using electronic health records (EHR) data from different hospitals. In addition to high-dimensional features, EHR data analysis encounters privacy constraints in that individual patient data (IPD) typically cannot be shared across local hospital sites, which makes the challenge of integrative analysis even more pronounced. Breach of Privacy arising from data sharing is in fact a growing concern in general for scientific studies. Recently, Wolfson et al. (2010) proposed a generic individual-information protected integrative analysis framework, named DataSHIELD, that transfers only summary statistics1 from each distributed local site to the central site for pooled analysis. Conceptually highly valued by research communities (see, e.g., Jones et al. 2012; Doiron et al. 2013), the DataSHIELD facilitates multi-study integrative analysis when IPD pooled meta-analysis is not feasible due to ethical and/or legal restrictions (Gaye et al. 2014). In the low-dimensional setting, a number of statistical methods have been developed for distributed analysis that satisfy the DataSHIELD constraint (see, e.g., Chen et al. 2006; Wu et al. 2012; Liu and Ihler 2014; Lu et al. 2015; Huang and Huo 2015; Han and Liu 2016; He et al. 2016; Zöller, Lenz, and Binder 2018; Duan et al. 2019, 2020). Distributed high-dimensional regression have largely focused on settings without between-study heterogeneity as detailed in Section 1.2. To the best of our knowledge, no existing distributed learning methods can effectively handle both high-dimensionality and the presence of model heterogeneity across the local sites.

1.2. Related Work

In the context of high-dimensional regression, several recently proposed distributed inference approaches can be potentially used for integrative analysis under the DataSHIELD constraint. Specifically, Tang, Zhou, and Song (2016), Lee et al. (2017), and Battey et al. (2018) proposed distributed inference procedures aggregating the local debiased LASSO estimators (Zhang and Zhang 2014; Van de Geer et al. 2014; Javanmard and Montanari 2014). By including debiasing procedure in their pipelines, the corresponding estimators can be used for inference directly. Lee et al. (2017) and Battey et al. (2018) proposed to further truncate the aggregated dense debiased estimators to achieve sparsity; see also Maity, Sun, and Banerjee (2019). Though this debiasing-based strategy can be extended to fit for our heterogeneous modeling assumption, it still loses statistical efficiency due to the failure to account for the heterogeneity of the information matrices across different sites. In addition, the use of debiasing procedure at local sites incurs additional error for estimation, as detailed in Section 4.4.

Lu et al. (2015) and Li et al. (2016) proposed distributed approaches for 2-regularized logistic and Cox regression. However, their methods requires equential communications between local sites and the central machine, which may be time and resource consuming. Chen and Xie (2014) proposed to estimate high dimensional parameters by first adopting majority voting to select a positive set and then combining local estimation of the coefficients belonging to this set. Wang, Peng, and Dunson (2014) proposed to aggregate the local estimators through their median values rather than their mean, shown to be more robust to poor estimation performance of local sites with insufficient sample size (Minsker 2019). More recently, Wang et al. (2017) and Jordan, Lee, and Yang (2019) presented a communication-efficient surrogate likelihood framework for distributed statistical learning that only transfers the first-order summary statistics, that is, gradient between the local sites and the central site. Fan, Guo, and Wang (2019) extended their idea and proposed two iterative distributed optimization algorithms for the general penalized likelihood problems. However, their framework, as well as others summarized in this paragraph, is restricted to homogeneous scenarios and cannot be easily extended to the settings with heterogeneous models or covariates.

1.3. Our Contributions

In this article, we fill the methodological gap of high-dimensional distributed learning methods that can accommodate cross-study heterogeneity by proposing a novel data-Shielding High-dimensional Integrative Regression (SHIR) method under the DataSHIELD constraints. While SHIR can be viewed as analogous to the integrative analysis of debiased local LASSO estimators, it achieves debiasing without having to perform debiasing for the local estimators. SHIR solves LASSO problem only once in each local site without requiring the inverse Hessian matrices or the locally debiased estimators and only needs one turn in communication. Statistically, it serves as the tool for the integrative model estimation and variable selection, in the presence of high dimensionality and heterogeneity in model parameters across sites. In addition, under the ultra-high dimensional regime where p can grow exponentially with the total sample size N, we demonstrate that SHIR can achieve the same error rates asymptotically as the ideal estimator based on the IPD pooled analysis, denoted by IPDpool, and attain consistent variable selection. Such properties are not readily available in the existing literature and some novel technical tools are developed for the theoretical verification. We also show theoretically that SHIR is statistically more efficient than the approach based on integrating and thresholding locally debiased estimators (see, e.g., Lee et al. 2017; Battey et al. 2018). Results from our numerical studies confirm that SHIR performs similarly to the ideal IPDpool estimator outperforms the other methods.

1.4. Outline of the Paper

The rest of this article is organized as follows. We introduce the settings in Section 2 and describe SHIR, our proposed approach in Section 3. Theoretical properties of the SHIR estimator are studied in Section 4. We derive the upper bound for its prediction and estimation risks, compare it with the existing approach, and show that the errors incurred by aggregating derived data is negligible compared to the statistical minimax rate. When the true model is ultra-sparse, SHIR is shown to be asymptotically equivalent to the IPDpool estimator and achieves sparsistency. Section 5 compares the performance of SHIR to existing methods through simulations. We apply SHIR to derive classification models for coronary artery disease (CAD) using EHR data from four different disease cohorts in Section 6. Section 7 concludes the paper with a discussion. Technical proofs of the theoretical results and additional numerical results are provided in the supplementary material.

2. Problem Statement

Throughout, for any integer d,[d]={1,,d}. For any vector x=(x1,x2,,xd)d and index set 𝓢={j1,,jk:j1<<jk}[d], x𝓢=(xj1,,xjk), x1=(x2,,xd), xq denotes the q norm of x and x=maxj[d]|xj|. Suppose there are M independent studies and nm subjects in the mth study, for m=1,,M. For the ith subject in the mth study, let Yi(m) and Xi(m), respectively, denote the response and the p-dimensional covariate vector, Di(m)=(Yi(m),Xi(m)), Y(m)=(Y1(m),,Ynm(m)), and X(m)=(X1(m),X2(m),,Xnm(m)). We assume that the observations in study m, 𝓓(m)={Di(m),i=1,,nm}, are independent and identically distributed. Without loss of generality, assume that Xi(m) includes 1 as the first component and Xi,1(m) has mean 0. Define the population parameters of interests as

β0(m)=argminβ(m)𝓛m(β(m)),where
𝓛m(β(m))=E{f(β(m)Xi(m),Yi(m))},
β(m)=(β1(m),β2(m),,βp(m))

for some specified loss function f. Let βj=(βj(1),,βj(M)), β()=(β(1),,β(M)), and β0j, β0() denote the true values of βj, β(). We consider the ultra-high dimensional setting, where the covariate dimension p could grow in an exponential rate of the sample size N=m=1Mnm.

For each j, we follow the typical meta-analysis to decompose βj(m) as βj(m)=μj+αj(m) with αj=(αj(1),,αj(M)) and we set 1M×1αj=0 for identifiability. Here, μj represents average effect of the covariate Xj and αj captures the between-study heterogeneity of the effects. Let μ=(μ1,,μp), α()=(α(1),,α(M)), α1()=(α1(1),,α1(M)), and μ0 and α0() be the true values of μ and α(), respectively. Consider the empirical global loss function

𝓛^(β())=N1m=1Mnm𝓛^m(β(m)),where
𝓛^m(β(m))=nm1i=1nmf(β(m)Xi(m),Yi(m)),m=1,,M

Minimizing 𝓛^(β()) is obviously equivalent to estimating β(m) using 𝓓(m) only. To improve the estimation of β0() by synthesizing information from 𝓓() and overcome the high dimensionality, we employ penalized loss functions, 𝓛^(β())+λρ(β()), with the penalty function ρ() designed to leverage prior structure information on β0(). Under the prior assumption that μ0 is sparse and α0,1(1),,α0,1(M) are sparse and share the same support, we impose a mixture of LASSO and group LASSO penalty: ρ(β())=j=2p|μj|+λgj=2pαj2, where λg0 is a tuning parameter. Similar penalty has been used in Cheng, Lu, and Liu (2015). Our construction differs slightly from that of Cheng, Lu, and Liu (2015) where αj,12 was used instead of αj2. This modified penalty leads to two main advantages: (i) the estimator is invariant to the permutation of the indices of the M studies; and (ii) it yields better theoretical estimation error bounds for the heterogeneous effects. Then an idealized IPDpool estimator for β0() can be obtained as

β^IPDpool()=argminβ()𝓠^(β()),where𝓠^(β())=𝓛^(β())+λρ(β()), (1)

for some tuning parameter λ0. However, the IPDpool estimator is not feasible under the DataSHIELD constraint. Our goal is to construct an alternative estimator that attains the same efficiency as β^IPDpool() asymptotically but only requires sharing summary data. When p is small, the sparse meta analysis (SMA) approach by He et al. (2016) achieves this goal via estimating β() as β^SMA()=argminβ()Q^SMA(β()), where Q^SMA(β())=N1m=1M(β(m)β˘(m))V˘m1(β(m)β˘(m))+λρ(β()), β˘(m)=argminβ(m)𝓛^m(β(m)) and V˘m={nm12𝓛^m(β˘(m))}1. The SMA method is DataSHIELD since only derived statistics β˘(m) and V˘m are shared in the integrative regression. The SMA estimator attains oracle property when p is relatively small but fails for large p due to the failure of β˘(m).

3. Data-SHIR

3.1. SHIR Method

In the high-dimensional setting, one may overcome the limitation of the SMA approach by replacing β˘(m) with the regularized LASSO estimator,

β^LASSO(m)=argminβ(m)𝓛^m(β(m))+λmβ1(m)1 (2)

However, aggregating {β^LASSO(m),m[M]} is problematic with large p due to their inherent biases. To overcome the bias issue, we build the SHIR method motivated by SMA and the debiasing approach for LASSO (see, e.g., Van de Geer et al. 2014) yet achieve debiasing without having to perform debiasing for M local estimators. Specifically, we propose the SHIR estimator for β0() as β^SHIR()=argminβ()Q^SHIR(β()), where

Q^SHIR(β())=N1m=1Mnm{β(m)^mβ(m)2β(m)g^m}+λρ(β()), (3)

^m=2𝓛^m(β^LASSO(m)) is an estimate of the Hessian matrix and g^m=^mβ^LASSO(m)𝓛^m(β^LASSO(m)). Our SHIR estimator β^SHIR() satisfy the DataSHIELD constraint as Q^SHIR(β()) depends on 𝓓(m) only through summary statistics 𝓓^m={nm,^m,g^m}, which can be obtained within the mth study, and requires only one round of data transfer from local sites to the central node.

With {^m,g^m,m=1,,M}, we may implement the SHIR procedure using coordinate descent algorithms (Friedman, Hastie, and Tibshirani 2010) along with reparameterization. Let

Q^SHIR(μ,α())=𝓛^SHIR(μ,α())+λρ(μ,α();λg),

where ρ(μ,α();λg)=μ11+λgα1()2,1,α1()2,1=j=2pαj2 and

𝓛^SHIR(μ,α())=N1m=1Mnm{(μ+α(m)T)^m(μ+α(m))2g^m(μ+α(m))}.

Then the optimization problem in Equation (3) can be reparameterized and represented as:

(μ^SHIR,α^SHIR())=argmin(μ,α())Q^SHIR(μ,α()),s.t.1M×1αj=0,j[p],

and β^SHIR is obtained with the transformation: βj(m)=μj+αj(m) for every j[p]. The above procedure is presented in Algorithm A1 in Section A.5 of the supplementary material.

Remark 1.

The first term in Q^SHIR(β()) is essentially the second-order Taylor expansion of 𝓛^(β()) at the local LASSO estimators β^LASSO(). The SHIR method can also be viewed as approximately aggregating local debiased LASSO estimators without actually carrying out the standard debiasing process. To see this, let Q^dLASSO(β())=N1m=1Mnm(β(m)β^dLASSO(m))^m(β(m)β^dLASSO(m))+λρ(β()), where β^dLASSO(m) is the debiased LASSO estimator for the mth study with

β^dLASSO(m)=β^LASSO(m)Θ^m𝓛^m(β^LASSO(m)),form=1,,M, (4)

and Θ^m is a regularized inverse of ^m. We may write

Q^dLASSO(β())=N1m=1M{nm[β(m)^mβ(m)2β(m)^mβ^dLASSO(m)]+Cm}+λρ(β())N1m=1M{nm[β(m)^mβ(m)2β(m)g^m]+Cm}+λρ(β())=Q^SHIR(β())+N1m=1MCm

where we use Θ^m^mI in the above approximation and the term

Cm=nm{^mβ^LASSO(m)^mΘ^m𝓛^m(β^LASSO(m))}{β^LASSO(m)Θ^m𝓛^m(β^LASSO(m))}

does not depend on β(). We only use Θ^m^mI heuristically above to show a connection between our SHIR estimator and the debiased LASSO, but the validity and asymptotic properties of the SHIR estimator do not require obtaining any Θ^m or establishing a theoretical guarantee for Θ^m^m being sufficiently close to I.

Remark 2.

Compared with existing debiasing-based methods (Lee et al. 2017; Battey et al. 2018), the SHIR approach is both computationally and statistically efficient. It does not rely on the debiased statistics (4) and achieves debiasing without calculating Θ^m, which can only be estimated well under strong conditions (Van de Geer et al. 2014; Janková and Van De Geer 2016).

3.2. Tuning Parameter Selection

The implementation of SHIR requires selection of three sets of tuning parameters, {λm,m[M]}, λ and λg. We select {λm,m[M]} for the LASSO problem locally via the standard K-fold cross-validation (CV). Selecting λ and λg needs to balance the tradeoff between the model’s degrees of freedom, denoted by DF(λ,λg), and the quadratic loss in Q^SHIR(β()) It is not feasible to tune λ and λg via the CV since individual-level data are not available in the central site. We propose to select λ and λg as the minimizer of the generalized information criterion (GIC) (Wang and Leng 2007; Zhang, Li, and Tsai 2010), defined as

GIC(λ,λg)=Deviance(λ,λg)+γNDF(λ,λg),

where γN is some prespecified scaling parameter and

Deviance(λ,λg)=N1m=1Mnm{β^SHIR(m)(λ,λg)^mβ^SHIR(m)(λ,λg)2g^mTβ^SHIR(m)(λ,λg)}.

Following Zhang, Li, and Tsai (2010) and Vaiter et al. (2012), we define DF(λ,λg) as the trace of

[𝓢^μ,𝓢^α2Q^SHIR(μ^SHIR,α^SHIR())]1[𝓢^μ,𝓢^α2𝓛^SHIR(μ^SHIR,α^SHIR())],

where 𝓢^μ={j:μ^SHIR,j(λ,λg)0}, 𝓢^α={j:α^SHIR,j(λ,λg)20}, the operator 𝓢^μ,𝓢^α2 is defined as the second order partial derivative with respect to (μ𝓢^μ,α𝓢^α(2),α𝓢^α(M)), after plugging α(1)=m=2Mα(m) into Q^SHIR(μ,α()) or 𝓛^SHIR(μ,α()).

Remark 3.

As discussed in Kim, Kwon, and Choi (2012), γN can be chosen depending on the goal with commonly choices including γN=2/N for AIC (Akaike 1974), γN=logN/N for BIC (Bhat and Kumar 2010), γN=loglogplogN/N for modified BIC (Wang, Li, and Leng 2009) and γN=2logp/N for RIC (Foster and George 1994). We used the BIC with γN=logN/N in our numerical studies.

Remark 4.

For linear models, it has been shown that the proper choice of γN guarantees GIC’s model selection consistency under various divergence rates of the dimension p (Kim, Kwon, and Choi 2012). For example, for fixed p, GIC is consistent if NγN and γN0. When p diverges in polynomial rate Nξ, then GIC is consistent provided that γN=logN/N (BIC) if 0<ξ<1/2; γN=loglogplogN/N (modified BIC) if 0<ξ<1. When p diverges in exponential rate O(exp(κNξ)) with 0<v<ξ, GIC is consistent as γN=Nv1. These results can be naturally extended to more general log-likelihood functions.

4. Theoretical Results

In this section, we present theoretical properties of β^SHIR() for ρ(β())=ρ(β()) but discuss how our theoretical results can be extended to other sparse structures in Section 7. In Sections 4.2 and 4.3, we derive theoretical consistency and equivalence for the prediction and estimation risks of the SHIR, under high dimensional sparse model and smooth loss function f. In Section 4.4, we compare the risk bounds for SHIR with an estimator derived based on those of the debiasing-based aggregation approaches (Lee et al. 2017; Battey et al. 2018). In addition, Section 4.5 shows that the SHIR achieves sparsistency, that is, variable selection consistency, for the nonzero sets of μ0 and α0(). We begin with some notation and definitions that will be used throughout the article.

4.1. Notation and Definitions

Let o{α(n)}, O{α(n)}, ω{α(n)}, Ω{α(n)} and Θ{α(n)} respectively represent the sequences that grow in a smaller, equal/ smaller, larger, equal/larger and equal rate of the sequence α(n). Similarly, let oP, OP, ωP, ΩP and ΘP represent each of the corresponding rates with probability approaching 1 as n.

For any vector v0d, denote the 2-ball around v0 with radius r>0 as 𝓑r(v0)={vd:vv02r}. Following Vershynin (2018), we define the sub-Gaussian norm of a random variable X as Xψ2:=supq1q1/2(E|X|q)1/q and for any random vector X=(X1,,Xd), its sub-Gaussian norm defined as Xψ2=supv𝓑1(0)vXψ2. For any symmetric matrix X, let Λmin(X) and Λmax(X) denote its minimum and maximum eigenvalue, respectively. For a, denote by sign(a) the sign of a, and for event 𝓔, denote by I(𝓔) the indicator for 𝓔. Denote by 𝓢μ={j:μ0j0}, 𝓢α={j:α0j20}, 𝓢0=𝓢μ𝓢α, 𝓢full={𝓢μ,𝓢α}, sμ=|𝓢μ|, sα=|𝓢α| and s0=|𝓢0|. Let f1(a,y)=f(a,y)/a and f1(a,y)=2f(a,y)/a2. Also, let (β())=N1bdiag{n11(β(1)),n22(β(2)),,nMM(β(M))}, ^=(β^LASSO()), ¯m(β(m))=E[m(β(m))], and ¯m=¯m(β0(m)). Finally, we introduce the compatibility condition (𝓒comp) as below.

Definition 1 (Compatibility Condition (𝓒comp).

The Hessian matrix (β()) and the index set 𝓢 satisfy the Compatibility Condition, if for all (μΔ,αΔ())=(μΔ,αΔ(1),,αΔ(M))𝓒(t,𝓢) with any constant t>0, there exists a constant ϕ0{t,𝓢,(β())} such that,

(μΔ1+λgαΔ()2,1)2N1m=1Mnm|𝓢|m1/2(β(m))(μΔ+αΔ(m))22/ϕ0{t,𝓢,(β())},

where 𝓒(t,𝓢)={(u,v())=(u,v(1),,v(M)):v(1)++v(M)=0,u𝓢c1+λgv𝓢c()2,1t(u𝓢1+λgv𝓢()2,1)}, and ϕ0{t,𝓢,(β())} represents the compatibility constant of (β()) on the set 𝓢.

4.2. Prediction and Estimation Consistency

To establish theoretical properties of the SHIR estimators in terms of estimation and prediction risks, we first introduce some sufficient conditions. Throughout the following analysis, we assume that nm=Θ(N/M) for m[M] and λg=Θ(M1/2)

Condition 1.

There exists an absolute constant ϕ0>0 such that for all δ1=Θ{(s0Mlogp/N)1/2}, β()=(β(1),,β(M)) satisfying β(m)𝓑δ1(β0(m)), the Hessian matrices (β()) and the index set 𝓢0 satisfy 𝓒comp (Definition 1) with compatibility constant ϕ0{t,𝓢0,(β())}ϕ0.

Condition 2.

For all m[M], Xij(m)f1(β0(m)TXi(m),Yi(m)) is sub-Gaussian, that is, there exists some positive constant κ=Θ(1) such that Xij(m)f1(β0(m)Xi(m),Yi(m))ψ2<κ. In addition, there exists B>0 such that maxm[M],i[nm]Xi(m)B.

Condition 3.

There exists positive CL=Θ(1) such that |f1(a,y)f1(b,y)|CL|ab| for all a,b.

Remark 5.

Condition 1 is in a similar spirit as the restricted eigenvalue or restricted strong convexity condition introduced by Negahban et al. (2012). The first part of Condition 2 controls the tail behavior of Xij(m)f1(a,y) so that the random error 𝓛^m(β0(m)) can be bounded properly and the method could be benefited from the group sparsity of α() (Huang and Zhang 2010). This condition can be easily verified for sub-Gaussian design and an extensive class of models, for example, the logistic model. In addition, the condition maxm[M],i[nm]Xi(m)B holds for bounded design with B=Θ(1) and for subGaussian design with B=Θ[{log(pN)}1/2] Condition 3 assumes a smooth function f to guarantee that the empirical Hessian matrix 2𝓛^m(β^LASSO(m)) is close enough to 2𝓛^m(β0(m)), and the term g^m=[^mβ^LASSO(m)𝓛^m(β^LASSO(m))] is close enough to [2𝓛^m(β0(m))β0(m)𝓛^m(β0(m))].

The following Proposition 1 illustrates that for sub-Gaussian weighted design with regular Hessian matrix, Condition 1 (Compatibility Condition) holds with probability approaching 1. This can be viewed as an extension of the existing results, that is, sub-Gaussian design and linear model with lasso penalty (Rivasplata 2012), to our case with nonlinear model and the mixture penalty. We present the proof of Proposition 1 in Section A.1 of the supplementary material.

Proposition 1.

Assume that s0=o{N/(Mlogp)} and Condition 3 holds. Assume in addition that there exists absolute constants κx, Cx>0, such that for all m[M], Cx1Λmin(¯m)Λmax(¯m)Cx, maxx𝓑1(0)E[xXi(m)]4Cx and for any δ1=Θ{(s0Mlogp/N)1/2} and β(m)𝓑δ1(β0(m)), Xi(m){f1(β(m)Xi,Yi(m))}1/2ψ2κx. Then we have that, Condition 1 is satisfied with probability approaching 1.

Remark 6.

As an important example in practice, it is not hard to verify that, for logistic models with f(a,y)=yalog(1+ea), and sub-Gaussian covariates Xi(m), the key assumption on the weighted design required in Proposition 1, Xi(m){f1(β(m)TXi,Yi(m))}1/2ψ2κx, is satisfied.

We further assume in Condition 4 that the local LASSO estimators achieve the minimax optimal error rates to a logarithmic scale (Raskutti, Wainwright, and Yu 2011; Negahban et al. 2012).

Condition 4.

The local estimators satisfy that maxm[M]β^LASSO(m)β0(m)1=OP{s0(logp/nm)1/2}, and maxm[M]β^LASSO(m)β0(m)2maxm[M]X(m)(β^LASSO(m)β0(m))2=OP{(s0logp/nm)1/2}.

Remark 7.

Extensive literatures, such as Van de Geer et al. (2008), Bühlmann and Van De Geer (2011), and Negahban et al. (2012), have established a complete theoretical framework regarding to this property. See, for example, Negahban et al. (2012), in which Condition 4 can be proved for strongly convex loss function f.

Next, we present the risk bounds for the SHIR including the prediction risk ^1/2(β^SHIR()β0())2 and estimation risk μ^SHIRμ01+λgα^SHIR()α0()2,1.

Theorem 1 (Risk bounds for the SHIR).

Under Conditions 14, there exists λ=Θ({(logp+M)/N}1/2+Bs0Mlogp/N) and λg=Θ(M1/2) such that

^1/2(β^SHIR()β0())2=OP({s0(logp+M)/N}1/2+Bs03/2Mlogp/N);
μ^SHIRμ01+λgα^SHIR()α0()2,1=OP(s0{(logp+M)/N}1/2+Bs02Mlogp/N).

Note that, in Theorem 1, the rate of the penalty coefficient for j=2pαj2 is λλg=Θ[{(logp+M)/NM}1/2+BM1/2s0logp/N]. The second term in each of the upper bounds of Theorem 1 is the error incurred by aggregation noise of derived data instead of raw data. These terms are asymptotically negligible under sparsity as s0=o({N(logp+M)}1/2/[BMlogp]). Then β^SHIR() achieves the same error rate as the ideal estimator β^IPDpool() obtained by combining raw data as shown in the following section, and is nearly rate optimal.

4.3. Asymptotic Equivalence in Prediction and Estimation

Under specific sparsity assumptions, we show the asymptotic equivalence, with respect to prediction and estimation risks, of the SHIR and the ideal IPDpool estimator β^IPDpool() or alternatively defined as

(μ^IPDpool,α^IPDpool())=argmin(μ,α())𝓛^(μ,α())+λ˜ρ(μ,α();λg),s.t.1M×1αj=0,j[p],

where λ˜ is a tuning parameter.

Theorem 2.

(Asymptotic Equivalence) Under assumptions in Theorem 1 and assume s0=o({N(logp+M)}1/2/[BMlogp]), there exists λ˜=Θ{(logp+M)/N}1/2 and λg=Θ(M1/2) such that the IPDpool estimator β^IPDpool() satisfies

^1/2(β^IPDpool()β0())2=OP({s0(logp+M)/N}1/2);μ^IPDpoolμ01+λgα^IPDpool()α0()2,1=OP(s0{(logp+M)/N}1/2).

Furthermore, for some λΔ=o(λ˜), the IPDpool and the SHIR defined by (3) with λ=λ˜+λΔ are equivalent in prediction and estimation in the sense that

^1/2(β^SHIR()β0())2^1/2(β^IPDpool()β0())2+oP({s0(logp+M)/N}1/2);μ^SHIRμ01+λgα^SHIR()α0()2,1μ^IPDpoolμ01+λgα^IPDpool()α0()2,1+oP(s0{(logp+M)/N}1/2).

Theorem 2 demonstrates the asymptotic equivalence between β^SHIR() and β^IPDpool() with respect to estimation and prediction risks, and hence implies the optimality of the SHIR. Specifically, when s0=o({N(logp+M)}1/2/[BMlogp]), the excess risks of β^SHIR() compared to β^IPDpool() are of smaller order than those of IPDpool, that is, the minimax optimal rates (up to a logarithmic scale) for multi-task learning of high-dimensional sparse model (Huang and Zhang 2010; Lounici et al. 2011). Similar equivalence results was given in Theorem 4.8 of Battey et al. (2018) for the truncated debiased LASSO estimator. However, to the best of our knowledge, in the existing literatures, such results have not been established yet for the LASSO-type estimators obtained directly from a sparse regression model. Compared with Battey et al. (2018), our result does not require the Hessian matrix ^m to have a sparse inverse since we do not actually rely on the debiasing of β^LASSO(m). Consequently, the proofs of Theorem 2 are much more involved than those in Battey et al. (2018). The new technical skills are developed and presented in detail in the supplementary material.

4.4. Comparison With the Debiasing-based Strategy

To compare to existing approaches, we next consider an extension of the debiased LASSO-based procedures proposed in Lee et al. (2017) and Battey et al. (2018) to incorporating between study heterogeneity. Specifically, at the mth site, we derive the debiased LASSO estimator β^dLASSO(m) as defined in (4) and send it to the central site, where Θ^m is obtained via nodewise LASSO (Javanmard and Montanari 2014). At the central site, compute μ^dLASSO=M1m=1Mβ^dLASSO(m), α^dLASSO(m)=β^dLASSO(m)μ^dLASSO and α^dLASSO()=(α^dLASSO(1),,α^dLASSO(M)). The final estimator for μ and α can be obtained by thresholding μ^dLASSO and α^dLASSO() as μ^L&B=𝓣μ(μ^dLASSO;τ1) and α^L&B()=𝓣α(α^dLASSO();μ2), by Lee et al. (2017) and Battey et al. (2018), where

𝓣μ(μ;τ1)={μ1,μ2h+(τ1),,μph+(τ1)}or{μ1,μ2s+(τ1),,μps+(τ1)}𝓣α(α();τ2)=vec{[α1,α2h+(τ2),αph+(τ2)]}orvec{[α1,α2s+(τ2),αps+(τ2)]},

for any vector x=(x1,,xd) and constant τ, xh+=xI(x2>τ) and xs+=x(1x21τ)I(x2>τ) respectively denote the hard and soft thresholded counterparts of x, and vec(A) vectorize the matrix A by column.

The error rates of {μ^L&B,α^L&B()} can be derived by extending Lee et al. (2017) and Battey et al. (2018). We outline the results below and provide details in Section A.3.4 of the supplementary material. Denote by ¯m(β(m))=E[m(β(m))], ¯m=¯m(β0(m)), Θ¯m={θ¯mj}p×p=¯m1 and s1=maxm[M]j[p]|{j:θ¯mj0}|. Then in analog to Theorem 1, one can obtain that

μ^L&B()μ01+λgα^L&B()α0()2,1 (5)
=OP(s0{(logp+M)/N}1/2+Bs0(s0+s1)Mlogp/N), (6)

where B is as defined in Condition 2. Compared with the error rates of SHIR as presented in Theorem 1, {μ^L&B,α^L&B()} shares the same “first term”, s0{(logp+M)/N}1/2, representing the error of an individual-level empirical process. However, its second term incurred by data aggregation can be larger than that of SHIR as s1=ω(s0), which could happen due to the complex design in practice.

In addition, SHIR could be more efficient than the debiasing-based strategy even when the impact of the additional error term, which depends on s1 in (6), is asymptotically negligible. Consider the setting when all β(m)’s are the same, that is, β(m)=β, and p is moderate or small so that the regularization is unnecessary and the maximum likelihood estimator (MLE) for β is feasible and asymptotically Gaussian. In this case, SHIR can be viewed as the inverse variance weight estimation with asymptotic variance ΣSHIR={m=1MnmΘ¯m1}1, while the debiasing-based approach outputs an estimator of variance ΣL&B=M2m=1Mnm1Θ¯m. It is not hard to show that ΣSHIRΣL&B, where the equality holds only if all Θ¯m’s are in certain proportion. Thus, SHIR is strictly more efficient than debiasing-based approach under the low-dimensional setting with heterogeneous Θ¯m, which commonly arises in metaanalysis as the distributions of X(m)’s are typically heterogeneous across the local sites. In the high-dimensional setting, similarly, SHIR is expected to benefit from the “inverse variance weight” construction, and our simulation results in Section 5 support this point.

4.5. Sparsistency

In this section, we present theoretical results concerning the variable selection consistency of the SHIR. We begin with some extra sufficient conditions for the sparsistency result.

Condition 5.

For all δ2=Θ{(s0Mlogp/N)1/2} and β(m) satisfying β(m)𝓑δ2(β0(m)), there exists Cmin=Θ(1) such that Λmin{m,𝓢0(β(m))}>Cmin, where m,𝓢0(β(m)) denotes the submatrix of m(β(m)) with its rows and columns corresponding to 𝓢0.

Condition 6.

For all δ3=Θ{(s0Mlogp/N)1/2} and β()=(β(1),,β(M)) satisfying β(m)𝓑δ3(β0(m)), the weighted design matrix W(β()) satisfies the Irrepresentable Condition 𝓒Irrep on 𝓢full with parameter ϵ>0, where W(β()) is defined in Section A.2 and 𝓒Irrep is given in Definition A2 of the supplementary material.

Condition 7.

Let v=min{minj𝓢μ|μ0j|,M1/2minj𝓢αα0j2}. For the ϵ defined in Condition 6, [{s0(logp+M)/N}1/2+Bs03/2Mlogp/N]/(vϵ)0, as N.

Remark 8.

Conditions 57 are sparsistency assumptions similar to those of Zhao and Yu (2006) and Nardi et al. (2008). Condition 5 requires the eigenvalues for the covariance matrix of the weighted design matrix corresponding to 𝓢0 to be bounded away from zero, so that its inverse behaves well. Condition 6 adopts the commonly used Irrepresentable Condition (Zhao and Yu 2006) to our mixture penalty setting. Roughly speaking, it requires that the weighted design corresponding to 𝓢full cannot be represented well by the weighted design for 𝓢fullc. Compared to Nardi et al. (2008), 𝓒Irrep is less intuitive but essentially weaker. We justify such condition on several common correlation structures and compare it with Zhao and Yu (2006) in Section A.2 of the supplementary material. Condition 7 assumes that the minimum magnitude of the coefficients is large enough to make the nonzero coefficients recognizable. It requires essentially weaker assumption on the minimum magnitude than local LASSO (Zhao and Yu 2006). This is because we leverage the group structure of β0(m)’s to improve the efficiency of variable selection.

Theorem 3.

(Sparsistency) Let 𝓢^μ={j:μ^SHIR,j0} and 𝓢^α={j:α^SHIR,j20}. Denote by the event 𝓞μ={𝓢^μ=𝓢μ} and 𝓞α={𝓢^α=𝓢α}. Under Conditions 17 and assume that

λ=o(v/s01/2)and
λ=ϵ1ω({(logp+M)/N}1/2+Bs0Mlogp/N),

we have P(𝓞μ𝓞α)1 as N.

Theorem 3 establishs the sparsistency of SHIR. When s0=o({N(logp+M)}1/2/[BMlogp]), Condition 7 turns out to be vϵ=ω({s0(logp+M)/N}1/2), the corresponding sparsistency assumption for the IPDpool estimator. In contrast, a similar condition, which could be as strong as vϵ=ω{(s0Mlogp/N)1/2}, is required for the local LASSO estimator (Zhao and Yu 2006). Compared with the local one, our integrative analysis procedure can recognize smaller signal under some sparsity assumptions. In this sense, the structure of β0() helps us to improve the selection efficiency over the local LASSO estimator. Different from the existing work, we need carefully address the mixture penalty ρ and the aggregation noise of the SHIR, which introduce technical difficulties to our theoretical analysis.

In both Theorems 2 and 3, we allow M, the number of studies, to diverge while still preserving theoretical properties. The growing rate of M is allowed to be

M=min(o{(N/logp)1/2/(Bs0)},o{N/(Bs0logp)2})

for the equivalence result in Theorem 2 and

M=min(o{Nϵv/(Bs03/2logp)},o{N(ϵν)2/s0})

for the sparsistency result in Theorem 3.

5. Simulation Study

We present simulation results in this section to evaluate the performance of our proposed SHIR estimator and compare it with several other approaches. The simulation codes are available at https://github.com/moleibobliu/SHIR. Let M{4,8} and p{100,800,1500} and set nm=n=400 for each m. For each configuration, we summarize results based on 200 simulated datasets. We consider three data-generating mechanisms:

  1. Sparse precision and correctly specified model (strong and sparse signal): Across all studies, let 𝓢μ={1,2,,6} for μ,𝓢α={3,4,,8} for α,𝓢=𝓢μ𝓢α and 𝓢c=[p]\𝓢. For each m[M], we generate X(m) from a zero-mean multivariate normal distribution with covariance (m), where 𝓢c𝓢c(m)=p8(rm), 𝓢c𝓢(m)=p8(rm)Γp8,8(rm,15), 𝓢𝓢(m)=I8+Γp8,8(rm,15)p8(rm)Γp8,8(rm,15), Iq denotes the q×q identity matrix, q(r) denotes the q×q correlation matrix of AR(1) with correlation coefficient r, Γq1,q2(r,s1) denotes the q1×q2 matrix with each of its column having randomly picked s1 entries set as r or r in random and the remaining being 0, and rm=0.4(m1)/M+0.15. Given X(m), we generate Y(m) from the logistic model P(Y(m)=1X(m))=expit{X𝓢μ(m)μ𝓢μ+X𝓢α(m)α𝓢α(m)} with μ𝓢μ=0.5(1,1,1,1,1,1) and α𝓢α(m)=0.35(1)m(1,1,1,1,1,1).

  2. Sparse precision and correctly specified model (weak and sparse signal): Use the same data-generation mechanism as in (i) but relatively weak signals μ𝓢μ=0.2(1,1,1,1,1,1) and α𝓢α(m)=0.15(1)m(1,1,1,1,1,1).

  3. Sparse precision and correctly specified model (strong and dense signal): Use the same mechanism as in (i) but denser supports: 𝓢μ={1,2,,18}, and 𝓢α={7,8,,24}, and more heterogeneous coefficients across the sites (see their specific values in Section A.5 of the supplementary material).

  4. Sparse precision and correctly specified model (weak and dense signal): Use the same mechanism as in (iii) but weaker signals (see Section A.5 of the supplementary material).

  5. Dense precision and wrongly specified model: Let 𝓢={1,2,,5}, 𝓢={6,,50}, and 𝓢=[p]\(𝓢𝓢). For each m[M],, we generate X(m) from zero-mean multivariate normal with covariance matrix (m), where (𝓢𝓢)(𝓢𝓢)(m)=bdiag{45(rm),p50(rm)}, 𝓢𝓢(m)=0,𝓢𝓢(m)=45(rm)Γ45,5(rm,45) and 𝓢𝓢(m)=I5+Γ45,5(rm,45)45(rm)Γ45,5(rm,45). Given X(m), we generate Y(m) from a logistic model with P(Y(m)=1X(m))=expit{j=15{0.25+0.15(1)m}{Xj(m)+0.2(Xj(m))3}+0.1j=14Xj(m)Xj+1(m)}.

Across all settings, the distributions of X(m) and model parameters of Y(m)X(m) differ across the M sites, which mimic the heterogeneity of the covariates and models. The heterogeneity of X(m) is driven by the study-specific correlation coefficient rm in its covariance matrix (m). Under Settings (i)–(iv), the fitted logistic loss corresponds to the likelihood under a correctly specified model with the support of μ and that of α(m) overlapping but not exactly the same. Under Setting (v), the fitted loss corresponds to a misspecified model but the true target parameter β(m) remains approximately sparse with only first 5 elements being relatively large, 45 close to zero and remaining exactly zero. For each j𝓢, there are 15 nonzero coefficients on average in the jth column of the precision Θm under Settings (i)–(iv), and 45 nonzero coefficients under Setting (v). So we can use Settings (i)–(iv) to simulate the scenario with sparse precision on the active set and use Setting (v) to simulate relatively dense precision.

For each simulated dataset, we obtain the SHIR estimator as well as the following alternative estimators: (a) the IPDpool estimator β^IPDpool()=argminβ()𝓠^(β()); (b) the SMA estimator (He et al. 2016), following the sure independent screening procedure (Fan and Lv 2008) that reduces the dimension to n/(3logn) as recommended by He et al. (2016); and (c) the debiasing-based estimator β^L&B() as introduced in Section 4.4, denoted by DebiasL&B. For β^L&B(), we used the soft thresholding to be consistent with the penalty used by IPDpool, SMA and SHIR. We used the BIC to choose the tuning parameters for all methods.

In Figures 1 and 2, we present the relative average absolute estimation error (rAEE), β()β0()1, and the relative prediction error (rPE), X(β()β0())2, for each estimator compared to the IPDpool estimator, respectively. Consistent with the theoretical equivalence results, the SHIR estimator attains very close estimation and prediction accuracy as those of theidealizedIPDpoolestimator,withrPEandrAEEaround1.03 under Setting (i), 1.02 under (ii), 1.04 under (iii), 1.03 under (iv), and 1.07 under (v). The SHIR estimator is substantially more efficient than the SMA under all the settings, with about 45% reduction in both AEE and PE on average. This can be attributed to the improved performance of the local LASSO estimator β^LASSO(m) over the MLE β˘(m) on sparse models. The superior performance is more pronounced for large p such as 800 and 1500, because the screening procedure does not work well in choosing the active set, especially in the presence of correlations among the covariates. Compared with DebiasL&B, SHIR also demonstrates its gain in efficiency. Specifically, relative to SHIR, DebiasL&B has 15% ~ 29% higher AEE and 18% − 42% higher PE under the five settings. This is consistent with our theoretical results presented in Section 4.4 that SHIR has smaller error compared to DebiasL&B due to the heterogeneous Hessians and aggregation errors. In addition, compared to Settings (i)–(iv), the excessive error of DebiasL&B is larger in Setting (v) where the the inverse Hessian Θ¯m is relatively dense. This is consistent with conclusion in Section 4.4.

Figure 1.

Figure 1.

The relative average absolute estimation error (AEE) of IPDpool (IPD), SHIR, DebiasL&B (Debias) and SMA compared to those of IPDpool underdifferent M{4,8}, p{100,800,1500} and data-generation mechanisms (i)–(v) introduced in Section 5.

Figure 2.

Figure 2.

The relative prediction error (PE) of IPDpool (IPD), SHIR, DebiasL&B (Debias), and SMA compared to those of IPDpool under different M{4,8}, p{100,800,1500} and data-generation mechanisms (i)–(v) introduced in Section 5.

In Figure 3, we present the average number of misclassifications on the support of β(), that is, j=1pI(β^j=0)I(β0,j=0) for β^ obtained via different methods under Settings (i)–(iv) where the model for Y is correctly specified. SMA performs poorly and has more misclassification numbers under nearly all the settings, specially for p=800,1500 and dense signals. Both IPDpool and SHIR have good support recovery performance with the misclassification numbers below 2.5 under all settings with sparse signal, and below 7.5 under those with dense signal. These two methods attain similar misclassification numbers with the absolute differences less than 0.8 across all settings. Compared to IPDpool and SHIR, DebiasL&B has significantly worse performance for all the settings with p{800,1500}. For weak signal, M=4 and p{800,1500}, the misclassification numbers of DebiasL&B are about two to four times as large as those of IPDpool and SHIR. For strong signal or M=8, the gap between DebiasL&B and SHIR is still visible though a bit smaller. For example, under Setting (i) with M=8, DebiasL&B has about 60% more misclassifications than SHIR when p=800, and 110% more misclassifications when p=1500 on average. In Figures A1 and A2 of the supplementary material, we present the average true positive rate (TPR) and false discovery rate (FDR) for recovering the support of β(). When p=100, the estimator DebiasL&B tends to have smaller FDR than those of SHIR. However, this is achieved at the expense of substantially lower TPR. On the other hand, when p is larger (p{800,1500}), SHIR attains lower FPR than DebiasL&B while attaining higher or comparable TPR. In summary, SHIR achieves similar performance as IPDpool and better performance than DebiasL&B in support recovery.

Figure 3.

Figure 3.

The average number of misclassifications on {l(βj0),j=1,,p} based on IPDpool (IPD), SHIR, DebiasL&B (Debias), and SMA under different M{4,8}, p{100,800,1500} and data-generation mechanisms (i)–(iv) introduced in Section 5.

6. Application to EHR Phenotyping in Multiple Disease Cohorts

Linking EHR data with biorepositories containing “-omics” information has expanded the opportunities for biomedical research (Kho et al. 2011). With the growing availability of these high-dimensional data, the bottleneck in clinical research has shifted from a paucity of biologic data to a paucity of high-quality phenotypic data. Accurately and efficiently annotating patients with disease characteristics among millions of individuals is a critical step in fulfilling the promise of using EHR data for precision medicine. Novel machine learningbased phenotyping methods leveraging a large number of predictive features have improved the accuracy and efficiency of existing phenotyping methods (Liao et al. 2015; Yu et al. 2015).

While the portability of phenotyping algorithms across multiple patient cohorts is of great interest, existing phenotyping algorithms are often developed and evaluated for a specific patient population. To investigate the portability issue and develop EHR phenotyping algorithms for CAD useful for multiple cohorts, Liao et al. (2015) developed a CAD algorithm using a cohort of rheumatoid arthritis (RA) patients and applied the algorithm to other disease cohorts using EHR data from Partner’s Healthcare System. Here, we performed integrative analysis of multiple EHR disease cohorts to jointly develop algorithms for classifying CAD status for four disease cohorts including type 2 diabetes mellitus (DM), inflammatory bowel disease (IBD), multiple sclerosis (MS), and RA. Under the DataSHIELD constraint, our proposed SHIR algorithm enables us to let the data determine if a single CAD phenotyping algorithm can perform well across four disease cohorts or disease specific algorithms are needed.

For algorithm training, clinical investigators have manually curated gold standard labels on the CAD status used as the response Y, for n1=172DM patients, n2=230IBD patients, n3=105MS patients, and n4=760RA patients. There are a total of p=533 candidate features including both codified features, narrative features extracted via natural language processing (NLP) (Zeng et al. 2006), as well as their two-way interactions. Examples of codified features include demographic information, lab results, medication prescriptions, counts of International Classification of Diseases (ICD) codes and Current Procedural Terminology (CPT) codes. Since patients may not have certain lab measurements and missingness is highly informative, we also create missing indicators for the lab measurements as additional features. Examples of NLP terms include mentions of CAD, current smoking (CSMO), nonsmoking (NSMO) and CAD related procedures. Since the count variables such as the total number of CAD ICD codes are zero-inflated and skewed, we take log(x+1) transformation and include I(x>0) as additional features for each count variable x.

For each cohort, we randomly select 50% of the observations to form the training set for developing the CAD algorithms and use the remaining 50% for validation. We trained CAD algorithms based on SHIR, DebiasL&B and SMA. Since the true model parameters are unknown, we evaluate the performance of different methods based on the prediction performance of the trained algorithms on the validation set. We consider several standard accuracy measures including the area under the receiver operating characteristic curve (AUC), the brier score defined as the mean squared residuals on the validation data, as well as the F-score at threshold value chosen to attain a false-positive rate of 5% (F5%) and 10% (F10%), where the F-score is defined as the harmonic mean of the sensitivity and positive predictive value. The standard errors of the estimated prediction performance measures are obtained by bootstrapping the validation data. We only report results based on tuning parameters selected with BIC as in the simulation studies but note that the results obtained from AIC are largely similar in terms of prediction performance. Furthermore, to verify the improvement of the performance by combining the four datasets, we include the LASSO estimator for each local dataset (Local) as a comparison.

In Table 1, we present the estimated coefficients for variables that received nonzero coefficients by at least one of the included methods. Interestingly, all integrative analysis methods set all heterogeneous coefficients to zero, suggesting that a single CAD algorithm can be used across all cohorts although different intercepts were used for different disease cohorts. The magnitude of the coefficients from SHIR largely agree with the published algorithm with most important features being NLP mentions and ICD codes for CAD as well as total number of ICD codes which serves as a measure of healthcare utilization. The SMA set all variables to zero except for age, nonsmoker and the NLP mentions and ICD codes for CAD, while DebiasL&B has more similar support to SHIR.

Table 1.

Detected variables and magnitudes of their fitted coefficients for homogeneous effect μ.

Variable DebiasL&B SHIR SMA
Prescription count of statin 0.14 0.07 0
Age 0.09 0.26 0.28
Total ICD counts −0.38 −0.75 0
NLP count of CAD 0.97 1.34 0.81
NLP count of CAD procedure related concepts 0 0.02 0
NLP count of nonsmoker −0.07 −0.25 − 0.42
NLP count of nonsmoker > 0 −0.53 0 0
NLP count of current-smoker 0 −0.03 0
NLP count of CAD related diagnosis or procedure ≥ 1 0.06 0.05 0
ICD count for CAD 1.00 0.67 0.35
CPT count for stent or CABG 0 0.05 0
CPT count for echo 0 −0.10 0
ICD count for CAD × CPT count for echo 0 −0.04 0
NLP count of non-smoker × Oncall 0.09 0 0
NLP count of CAD × NLP count of possible-smoker 0 −0.02 0

NOTE: A×B denotes the interaction term of variables A and B. The log(x+1) transformation is taken on the count data and the covariates are normalized.

The point estimates along with their 95% bootstrap confidence intervals of the accuracy measures are presented in Figure 4. The results suggest that SHIR has the best performance across all methods, nearly on all datasets and across all measures. Among the integrative methods, SMA and DebiasL&B performed much worse than SHIR on all accuracy measures. For example, the AUC with its 95% confidence interval of the CAD algorithm for the RA cohorts trained via SHIR, SMA and DebiasL&B is respectively 0.93 (0.90,0.95), 0.88 (0.84,0.92), and 0.86 (0.82,0.90). Compared to the local estimator, SHIR also performs substantially better. For example, the AUC of SHIR and Local for the IBD cohort is 0.93 (0.88,0.97) and 0.90 (0.84,0.95). The difference between the integrative procedures and the local estimator is more pronounced for the DM cohort with AUC being around 0.95 for SHIR and 0.90 for the local estimator trained using DM data only. The local estimator fails to produce an informative algorithm for the MS cohort due to the small size of the training set. These results again demonstrate the power of borrowing information across studies via integrative analysis.

Figure 4.

Figure 4.

The mean and 95% bootstrap confidence interval of AUC, Brier Score, F5% and F10% of DebiasL&B, Local, SHIR and SMA on the validation data from the four studies.

7. Discussion

In this article, we proposed a novel approach, the SHIR, for integrative analysis of high dimensional data under the DataSHIELD framework, where only summary statistics are allowed to be transferred from the local sites to the central node to protect the individual-level data. As we demonstrated via both theoretical analyses and numerical studies, the SHIR estimator is considerably more efficient than the estimators obtained based on the debiasing-based strategies considered in literatures (Lee et al. 2017; Battey et al. 2018). Also, our method accommodates heterogeneity among the design matrices, as well as the coefficients of the local sites, which is not adequately handled under the ultra high-dimensional regime in existing literature. Our approach only solves the LASSO problem once in each local site without requiring the computation of Θ^(m) or debiasing. Note that, SHIR aims at an 1/2-consistent estimation and is not asymptotically unbiased. Consequently, it cannot be directly used for hypothesis testing or confidence interval construction, for example, Caner and Kock (2018a,b). Future work lies on developing statistical approaches for such purposes under DataSHIELD, high-dimensionality and heterogeneity. In addition, sparsistency of our estimator relies on the Irrepresentable Condition (Condition 6) that has been commonly used in the literature (see, e.g.,Yuan and Lin 2006; Nardi et al. 2008), but its rigorous verification for random design or nonlinear models is technically highly challenging. To achieve variable selection consistency without such condition, one may use non-concave (group) sparse penalty like group adaptive lasso (Wang and Leng 2008) or group bridge (Zhou and Zhu 2010) in our framework.

For the choice of penalty, in the current article, we focus primarily on the mixture penalty, ρ(β())=j=2p|μj|+λgj=2pαj2. Nevertheless, other penalty functions, such as group lasso (Huang and Zhang 2010) and hierarchical lasso (Zhou and Zhu 2010), can be incorporated into our framework provided that they effectively leverage certain prior knowledge. Similar techniques used for deriving the theoretical results of SHIR with the mixture penalty can be used for other penalty functions, with some technical details varying according to different choices on ρ(). See Section A.4 of the supplementary material for further justifications.

For the consistency result in Theorem 1, SHIR requires s0=o{(N/Mlogp)1/2}. Although this sparsity assumption is already weaker than those in the existing literature (Battey et al. 2018, e.g.) as shown in Section 4.4, it may be strong in practical applications. For example, (N/Mlogp)1/27 in the EHR example which suggests that the sparsity assumption may not hold. Nevertheless, the resulting SHIR algorithm appear to perform well in terms of out-of-sample classification accuracy. On the other hand, it is of interests to explore the possibilities of relaxing such assumption. One potential way is to use multiple rounds of communications such as Fan, Guo, and Wang (2019). Detailed analysis of this approach warrants future research.

Supplementary Material

Supplement Material

Funding

The research of Yin Xia was supported in part by NSFC Grants 12022103, 11771094, and 11690013. The research of Tianxi Cai and Molei Liu were partially supported by the Translational Data Science Center for a Learning Health System at Harvard Medical School and Harvard T.H. Chan School of Public Health.

Footnotes

Supplementary Material

In the Supplement, we provide some justifications for Conditions 1 and 6, present detailed proofs of Theorems 13, outline theoretical analyses of SHIR for various penalty functions, and present additional simulation results.

Supplementary materials for this article are available online. Please go to www.tandfonline.com/r/JASA.

1

Commonly used summary statistics include the locally fitted regression coefficient and itsHessian matrix in the low-dimensional parametric regression models (see, e.g., Duan et al. 2019, 2020).

References

  1. Akaike H. (1974), “A New Look at the Statistical Model Identification,” IEEE Transactions on Automatic Control, 19, 716–723. [2108] [Google Scholar]
  2. Battey H, Fan J, Liu H, Lu J, Zhu Z. (2018), “Distributed Testing and Estimation Under Sparse High Dimensional Models,” The Annals of Statistics, 46, 1352–1382. [2105,2106,2108,2110,2117,2118] [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bhat HS, and Kumar N. (2010), On the Derivation of the Bayesian Information Criterion, Los Angels: School of Natural Sciences, University of California. [2108] [Google Scholar]
  4. Bühlmann P, and Van De Geer S. (2011), Statistics for High-dimensional Data: Methods, Theory and Applications. Springer Science & Business Media. [2109]
  5. Caner M, and Kock AB (2018a), “Asymptotically Honest Confidence Regions for High Dimensional Parameters by the Desparsified Conservative Lasso,” Journal of Econometrics, 203, 143–168. [2117] [Google Scholar]
  6. Caner M, and Kock AB (2018b), “High Dimensional Linear GMM,” arXiv preprint arXiv:1811.08779. [2117]
  7. Chen X, and Xie M. g. (2014), “A Split-and-Conquer Approach for Analysis of Extraordinarily Large Data,” Statistica Sinica, 24, 1655–1684. [2106] [Google Scholar]
  8. Chen Y, Dong G, Han J, Pei J, Wah BW, and Wang J. (2006), “Regression Cubes With Lossless Compression and Aggregation,” IEEE Transactions on Knowledge and Data Engineering, 18, 1585–1599. [2105] [Google Scholar]
  9. Cheng X, Lu W, and Liu M. (2015), “Identification of Homogeneous and Heterogeneous Variables in Pooled Cohort Studies,” Biometrics, 71, 397–403. [2107] [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Doiron D, Burton P, Marcon Y, Gaye A, Wolffenbuttel BH, Perola M, Stolk RP, Foco L, Minelli C, Waldenberger M, Holle R, Kvaløy K, Hillege HL, Tassé A-M, Ferretti V, Fortier I. (2013), “Data Harmonization and Federated Analysis of Population-based Studies: The BioSHaRE Project,” Emerging Themes in Epidemiology, 10, 12. [2105] [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Duan R, Boland MR, Liu Z, Liu Y, Chang HH, Xu H, Chu H, Schmid CH, Forrest CB, Holmes JH, Schuemie MJ, Berlin JA, Moore JH, Chen Y. (2020), “Learning From Electronic Health Records Across Multiple Sites: A Communication-efficient and Privacypreserving Distributed Algorithm,” Journal of the American Medical Informatics Association, 27, 376–385. [2105] [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Duan R, Boland MR, Moore JH, and Chen Y. (2019), “Odal: A One-shot Distributed Algorithm to Perform Logistic Regressions on Electronic Health Records Data From Multiple Clinical Sites,” in PSB R. Altman B, Dunker AK, Hunter L, Ritchie MD, Murray T. and Klein TE, Kohala Coast, Hawaii, USA: World Scientific Publishing Conference, pp. 30–41. [2105] [PMC free article] [PubMed] [Google Scholar]
  13. Fan J, Guo Y, and Wang K. (2019), “Communication-efficient Accurate Statistical Estimation,” arXiv preprint arXiv:1906.04870. [2106,2118] [DOI] [PMC free article] [PubMed]
  14. Fan J, and Lv J. (2008), “Sure Independence Screening for Ultrahigh Dimensional Feature Space,” Journal of the Royal Statistical Society, Series B, 70, 849–911. [2112] [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Foster DP, and George EI (1994), “The Risk Inflation Criterion for Multiple Regression,” The Annals of Statistics, 1947–1975. [2108]
  16. Friedman J, Hastie T, and Tibshirani R. (2010), “A Note on the Group Lasso and a Sparse Group Lasso,” arXivpreprintarXiv:1001.0736. [2107]
  17. Gaye A, Marcon Y, Isaeva J, LaFlamme P, Turner A, Jones EM, Minion J, Boyd AW, Newby CJ, Nuotio M-L, Wilson R, Butters O, Murtagh B, Demir I, Doiron D, Giepmans L, Wallace SE, Budin-Ljøsne I, Oliver Schmidt C, Boffetta P, Boniol M, Bota M, Carter KW, deKlerk N, Dibben C, Francis RW, Hiekkalinna T, Hveem K, Kvaløy K, Millar S, Perry IJ, Peters A, Phillips CM, Popham F, Raab G, Reischl E, Sheehan N, Waldenberger M, Perola M, van den Heuvel E, Macleod J, Knoppers BM, Stolk RP, Fortier I, Harris JR, Woffenbuttel BH, Murtagh MJ, Ferretti V, Burton PR(2014), “DataSHIELD: Taking the Analysis to the Data, not the Data to the Analysis,” International Journal of Epidemiology, 43, 1929–1944. [2105] [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Han J, and Liu Q. (2016), “Bootstrap Model Aggregation for Distributed Statistical Learning,” in Advances in Neural Information Processing Systems, Lee D, Sugiyama M, Luxburg U, Guyon I. and Garnett R, San Diego, CA, USA: Curran Associates, Inc., pp. 1795–1803. [2105] [Google Scholar]
  19. He Q, Zhang HH, Avery CL, and Lin D. (2016), “Sparse Meta-analysis With High-dimensional Data,” Biostatistics, 17, 205–220. [2105,2107,2112] [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Huang C, and Huo X. (2015), “A Distributed One-step Estimator,” arXiv preprint arXiv:1511.01443. [2105]
  21. Huang J, and Zhang T. (2010), “The Benefit of Group Sparsity,” The Annals of Statistics, 38, 1978–2004. [2109,2110,2118] [Google Scholar]
  22. Janková J, and Van De Geer S. (2016), “Confidence Regions for High-dimensional Generalized Linear Models Under Sparsity,” arXiv preprint arXiv:1610.01353. [2108]
  23. Javanmard A, and Montanari A. (2014), “Confidence Intervals and Hypothesis Testing for High-dimensional Regression,” The Journal of Machine Learning Research, 15, 2869–2909. [2105,2110] [Google Scholar]
  24. Jones E, Sheehan N, Masca N, Wallace S, Murtagh M, and Burton P. (2012), “DataSHIELD–shared Individual-level Analysis Without Sharing the Data: A Biostatistical Perspective,” Norsk Epidemiologi, 21.[2105] [Google Scholar]
  25. Jordan MI, Lee JD, and Yang Y. (2019), “Communication-efficient Distributed Statistical Inference,” Journal of the American Statistical Association, 526, 668–681. [2106] [Google Scholar]
  26. Kho AN, Pacheco JA, Peissig PL, Rasmussen L, Newton KM, Weston N, Crane PK, Pathak J, Chute CG, Bielinski SJ, Kullo IJ, Li R, Manolio TA, Chisholm RL, Denny JC (2011), “Electronic Medical Records for Genetic Research: Results of the Emerge Consortium,” Science Translational Medicine, 3, 79re1–79re1. [2116] [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Kim Y, Kwon S, and Choi H. (2012), “Consistent Model Selection Criteria on High Dimensions,” Journal of Machine Learning Research, 13, 1037–1057. [2108] [Google Scholar]
  28. Lee JD, Liu Q, Sun Y, and Taylor JE (2017), “Communication-efficient Sparse Regression,” Journal of Machine Learning Research, 18, 1–30. [2105,2106,2108,2110,2117] [Google Scholar]
  29. Li W, Liu H, Yang P, and Xie W. (2016), “Supporting Regularized Logistic Regression Privately and Efficiently,” PloS One, 11, e0156479. [2106] [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Liao KP, Ananthakrishnan AN, Kumar V, Xia Z, Cagan A, Gainer VS, Goryachev S, Chen P, Savova GK, Agniel D, et al. (2015). “Methods to Develop an Electronic Medical Record Phenotype Algorithm to ComparetheRisk of CoronaryArteryDiseaseAcross3 Chronic Disease Cohorts,” PLoS One, 10, e0136651. [2116] [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Liu M, Xia Y, Cai T, and Cho K. (2020), “Integrative High Dimensional Multiple Testing With Heterogeneity Under Data Sharing Constraints,” arXiv preprint arXiv:2004.00816. [PMC free article] [PubMed]
  32. Liu Q, and Ihler AT (2014), “Distributed Estimation, Information Loss and Exponential Families,” in Advances in Neural Information Processing Systems, pp. 1098–1106. [2105]
  33. Lounici K, Pontil M, Van De Geer S, Tsybakov AB (2011), “Oracle Inequalities and Optimal Inference Under Group Sparsity,” The Annals of Statistics, 39, 2164–2204. [2110] [Google Scholar]
  34. Lu C-L, Wang S, Ji Z, Wu Y, Xiong L, Jiang X, and Ohno-Machado L. (2015), “Webdisco: A Web Service for Distributed Cox Model Learning Without Patient-level Data Sharing,” Journal of the American Medical Informatics Association, 22, 1212–1219. [2105,2106] [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Maity S, Sun Y, and Banerjee M. (2019), “Communication-efficient Integrative Regression in High-dimensions,” arXiv preprint arXiv:1912.11928. [2106]
  36. Minsker S. (2019), “Distributed Statistical Estimation and Rates of Convergence in Normal Approximation,” Electronic Journal of Statistics, 13, 5213–5252. [2106] [Google Scholar]
  37. Nardi Y, Rinaldo A. (2008), “On the Asymptotic Properties of the Group Lasso Estimator for Linear Models,” Electronic Journal of Statistics, 2, 605–633. [2111,2117] [Google Scholar]
  38. Negahban SN, Ravikumar P, Wainwright MJ, Yu B. (2012), “A Unified Framework for High-dimensional Analysis of m-estimators With Decomposable Regularizers,” Statistical Science, 27, 538–557. [2109] [Google Scholar]
  39. Raskutti G, Wainwright MJ, and Yu B. (2011), “Minimax Rates of Estimation for High-dimensional Linear Regression Over lq-Balls,” IEEE Transactions on Information Theory, 57, 6976–6994. [2109] [Google Scholar]
  40. Rivasplata O. (2012), “Subgaussian Random Variables: An Expository Note,” Internet publication, PDF. [2109]
  41. Tang L, Zhou L, and Song PX-K (2016), “Method of Divide-andcombine in Regularized Generalized Linear Models for Big Data,” arXiv preprint arXiv:1611.06208. [2105] [Google Scholar]
  42. Vaiter S, Deledalle C, Peyré G, Fadili J, and Dossal C. (2012), “The Degrees of Freedom of the Group Lasso,” arXiv preprint arXiv:1205.1481. [2108]
  43. Van de Geer S, Bühlmann P, Ritov Y, Dezeure R. (2014), “On Asymptotically Optimal Confidence Regions and Tests for High-dimensional Models,” The Annals of Statistics, 42, 1166–1202. [2105,2107,2108] [Google Scholar]
  44. Van de Geer SA (2008), “High-dimensional Generalized Linear Models and the Lasso,” The Annals of Statistics, 36, 614–645. [2109] [Google Scholar]
  45. Vershynin R. (2018), High-dimensional Probability: An Introduction With Applications in Data Science, Vol. 47. Cambridge, UK: Cambridge University Press. [2108] [Google Scholar]
  46. Wang H, and Leng C. (2007), “Unified Lasso Estimation by Least Squares Approximation,” Journal of the American Statistical Association, 102, 1039–1048. [2108] [Google Scholar]
  47. Wang H, and Leng C. (2008), “A Note on Adaptive Group Lasso,”Computational Statistics & Data Analysis, 52, 5277–5286. [2118] [Google Scholar]
  48. Wang H, Li B, and Leng C. (2009), “Shrinkage Tuning Parameter Selection With a Diverging Number of Parameters,” Journal of the Royal Statistical Society, Series B, 71, 671–683. [2108] [Google Scholar]
  49. Wang J, Kolar M, Srebro N, and Zhang T. (2017), “Efficient Distributed Learning With Sparsity,” in Proceedings of the 34th International Conference on Machine Learning, Vol. 70, pp. 3636–3645. JMLR. org. [2106] [Google Scholar]
  50. Wang X, Peng P, and Dunson DB (2014), “Median Selection Subset Aggregation for Parallel Inference,” In Advances in Neural Information Processing Systems, pp. 2195–2203. [2106]
  51. Wolfson M, Wallace SE, Masca N, Rowe G, Sheehan NA, Ferretti V, LaFlamme P, Tobin MD, Macleod J, Little J, Fortier I, Knoppers BM, Burton PR (2010), “DataSHIELD: Resolving a Conflict in Contemporary Bioscienceperforming a Pooled Analysis of Individual-level Data Without Sharing the Data,” International Journal of Epidemiology, 39, 1372–1382. [2105] [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Wu Y, Jiang X, Kim J, and Ohno-Machado L. (2012), “Grid Binary Logistic Regression (glore): Building Shared Models Without Sharing Data,” Journal of the American Medical Informatics Association, 19, 758–764. [2105] [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Yu S, Liao KP, Shaw SY, Gainer VS, Churchill SE, Szolovits P, Murphy SN, Kohane IS, and Cai T. (2015), “Toward High-throughput Phenotyping: Unbiased Automated Feature Extraction and Selection From Knowledge Sources,” Journal of the American Medical Informatics Association, 22, 993–1000. [2116] [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Yuan M, and Lin Y. (2006), “Model Selection and Estimation in Regression With Grouped Variables,” Journal of the Royal Statistical Society, Series B, 68, 49–67. [2117] [Google Scholar]
  55. Zeng QT, Goryachev S, Weiss S, Sordo M, Murphy SN, and Lazarus R. (2006), “Extracting Principal Diagnosis, Co-morbidity and Smoking Status for Asthma Research: Evaluation of a Natural Language Processing System,” BMC Medical Informatics and Decision Making, 6, Article no. 30. [2116] [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Zhang C-H, and Zhang SS (2014), “Confidence Intervals for Low Dimensional Parameters in High Dimensional Linear Models,” Journal of the Royal Statistical Society, Series B, 76, 217–242. [2105] [Google Scholar]
  57. Zhang Y, Li R, and Tsai C-L (2010), “Regularization Parameter Selections Via Generalized Information Criterion,” Journal of the American Statistical Association, 105, 312–323. [2108] [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Zhao P, and Yu B. (2006), “On Model Selection Consistency of Lasso,” Journal of Machine Learning Research, 7, 2541–2563. [2111] [Google Scholar]
  59. Zhou N, and Zhu J. (2010), “Group Variable Selection Via a Hierarchical Lasso and Its Oracle Property,” arXiv preprint arXiv:1006.2871. [2118]
  60. Zöller D, Lenz S, and Binder H. (2018), “Distributed Multivariable Modeling for Signature Development Under Data Protection Constraints,” arXiv preprint arXiv:1803.00422. [2105]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement Material

RESOURCES