Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Mar 18.
Published in final edited form as: J Mach Learn Res. 2023 Jan-Dec;24:265.

Surrogate Assisted Semi-supervised Inference for High Dimensional Risk Prediction

Jue Hou 1, Zijian Guo 2, Tianxi Cai 3
PMCID: PMC10947223  NIHMSID: NIHMS1971733  PMID: 38500567

Abstract

Risk modeling with electronic health records (EHR) data is challenging due to no direct observations of the disease outcome and the high-dimensional predictors. In this paper, we develop a surrogate assisted semi-supervised learning approach, leveraging small labeled data with annotated outcomes and extensive unlabeled data of outcome surrogates and high-dimensional predictors. We propose to impute the unobserved outcomes by constructing a sparse imputation model with outcome surrogates and high-dimensional predictors. We further conduct a one-step bias correction to enable interval estimation for the risk prediction. Our inference procedure is valid even if both the imputation and risk prediction models are misspecified. Our novel way of ultilizing unlabelled data enables the high-dimensional statistical inference for the challenging setting with a dense risk prediction model. We present an extensive simulation study to demonstrate the superiority of our approach compared to existing supervised methods. We apply the method to genetic risk prediction of type-2 diabetes mellitus using an EHR biobank cohort.

Keywords: generalized linear models, high dimensional inference, model mis-specification, risk prediction, semi-supervised learning

1. Introduction

Precise risk prediction is vitally important for successful clinical care. High risk patients can be assigned to more intensive monitoring or intervention to improve outcome. Traditionally, risk prediction models are developed based on cohort studies or registry data. Population-based disease registries, while remain a critical source for epidemiological studies, collect information on a relatively small set of pre-specified variables and hence may limit researchers’ ability to develop comprehensive risk prediction models (Warren and Yabroff, 2015). Most clinical care is delivered in healthcare systems (Thompson et al., 2015), and electronic health records (EHR) embedded in healthcare systems accrue rich clinical data in broad patient populations. EHR systems centralize the data collected during routine patient care including structured elements such as codes for International Classification of Diseases, medication prescriptions, and medical procedures, as well as free-text narrative documents such as physician notes and pathology reports that can be processed through natural language processing for analysis. EHR data is also often linked with biobanks which provide additional rich molecular information to assist in developing comprehensive risk prediction models for a broad patient population.

Risk modeling with EHR data, however, is challenging due to several reasons. First, precise information on clinical outcome of interest, Y, is often embedded in free-text notes and requires manual efforts to extract accurately. Readily available outcome surrogates S, such as the diagnostic codes or mentions of the outcome, may be predictive of the true outcome Y, can deviate from the true label Y. Here we consider the general situation that a vector of surrogates, S, that are noisy error prone proxies of Y and may include non-informative surrogates. For example, using EHR data from Mass General Brigham, we found that the positive predictive value was only 0.48 and 0.19 for having at least 1 diagnosis code of Type II Diabetes Mellitus (T2DM) and for having at least 1 mention of T2DM in medical notes, respectively. Directly using these EHR proxies as true disease status to derive risk models may lead to substantial biases. On the other hand, extracting precise disease status requires manual chart review which is not feasible at a large scale. It is thus of great interest to develop risk prediction models under a semi-supervised learning (SSL) framework using both a large unlabeled dataset of size N containing information on predictors X along with surrogates S and a small labeled dataset of size n with additional observations on Y curated via chart review. Throughout the paper, we impose no stringent model assumptions on the triplet (Y,X,S) while using generalized linear working models to define and estimate the risk prediction model (see Section 2).

Additional challenges arise from the high dimensionality of the predictor vector X, and the potential model mis-specifications. Although much progress has been made in high dimensional regression in recent years, there is a paucity of literature on high dimensional inference under the SSL setting. Precise estimation of the high dimensional risk model is even more challenging if the risk model is not sparse. Allowing the risk model to be dense is particularly important when X includes genomic markers since a large number of genetic markers appear to contribute to the risk of complex traits (Frazer et al., 2009). For example, Vujkovic et al. (2020) recently identified 558 genetic variants as significantly associated with T2DM risk. An additional challenge arises when the fitted risk model is mis-specified, which occurs frequently in practice especially in the high dimensional setting. Model mis-specifications can also lead to the fitted model of YX to be dense. There are limited methods currently available to make inference about high dimensional risk prediction models in the SSL setting especially under a possibly mis-specified dense model. In this paper, we fill in the gap by proposing an efficient surrogate assisted SSL (SAS) prediction procedure that leverages the fully observed surrogates S to make inference about a high dimensional risk model under such settings.

Our proposed estimation and inference procedures are as follows. For estimation, we first use the labelled data to fit a regularized imputation model with surrogates and high-dimensional covariates; then we impute the missing outcomes for the unlabeled data and fit the risk model using the imputed outcome and high-dimensional predictors. For inference, we devise a novel bias correction method, which corrects the bias due to the regularization for both imputation and estimation. Compared to existing literature, the key advantages of our proposed SAS procedure are

  1. Applicable to dense risk model YX: we allow the working risk model YX to be dense as long as the working imputation model YS,X is sparse;

  2. Robustness to model mis-specification: the working models on both risk prediction YX and imputation YS,X can be mis-specified;

  3. Requires no assumptions on the measurement error in S as proxies of Y and allows S itself to be of high dimension;

  4. Our analysis on Lasso with estimated inputs in loss (see (6) and (20)) facilitates the consistency analysis for a dense model independent of the convergence rate of the consistently estimated inputs. The technique is an independent contribution to the high-dimensional statistics literatures.

The sparsity assumption on the imputation model is less stringent since we anticipate that most information on Y can be well captured by the low dimensional S while the fitted model of YX might be dense under possible model mis-specifications. Our theory uncovers that suitable use of unlabeled data may greatly relax the sparsity of YX. As most literatures in SSL emphasized in the efficiency gain, our work opens a new direction of estimiability expansion through SSL.

1.1. Related Literatures

Under the supervised setting where both Y and X are fully observed, much progress has been made in recent years in the area of high dimensional inference. High dimensional regression methods have been developed for commonly used generalized linear models under sparsity assumptions on the regression parameters (van de Geer and Bühlmann, 2009; Negahban et al., 2010; Huang and Zhang, 2012). Recently, Zhu and Bradic (2018b) studied the inference of linear combination of coefficients under dense linear model and sparse precision matrix. Inference procedures have also been developed for both sparse (Zhang and Zhang, 2014; Javanmard and Montanari, 2014; van de Geer et al., 2014) and dense combinations of the regression parameters (Cai et al., 2019; Zhu and Bradic, 2018a). High-dimensional inference under the logistic regression model has also been studied recently (van de Geer et al., 2014;Ma et al., 2020; Guo et al., 2020).

Under the SSL setting with nN, however, there is a paucity of literature on high dimensional inference. Although the SSL can be viewed as a missing data problem, it differs from the standard missing data setting in a critical way. Under the SSL setting, the missing probability tends to 1, which would violate a key assumption required in the missing data literature (Bang and Robins, 2005; Smucler et al., 2019; Chakrabortty et al., 2019, e.g.). Existing work on SSL with high-dimensional covariates largely focuses on the post-estimation inference on the global parameters under sparse linear models with examples including SSL estimation of population mean (Zhang et al., 2019; Zhang and Bradic, 2021), the explained variance (Cai and Guo, 2020), and the average treatment effect (Cheng et al., 2018; Kallus and Mao, 2020). Our SAS procedure is among the first attempts to conduct the semi-supervised inference of the high-dimensional coefficient and the individual prediction in the high-dimensional dense and possibly mis-specified risk prediction model. In a concurrent work, Deng et al. (2020) studied the efficient SSL estimation of high-dimensional linear models. Our work differs from them in at least three ways: 1) we consider the more flexible generalized linear models; 2) our setting involves the surrogates S, characterizing the imprecise data in EHR; 3) we study dense coefficients whose number of nonzero elements exceeds the number of labels. In high-dimensional regression with missing data, another line of work studied the estimation of linear models with missing or noisy covariates X (Loh and Wainwright, 2011; Belloni et al., 2017; Chandrasekher et al., 2020).

The surrogates S can be viewed conceptually as “mis-measured” proxies of the true outcome Y. Semi-supervised methods have been developed under the assumption that S depends on X only through Y, which essentially assumes an independent measurement error in S. For example, Gronsbell et al. (2019) studied the generalized linear risk prediction model using mis-measured S. With a single S, Zhang et al. (2022) considered high-dimensional generalized linear model for the prediction model allowing the independence assumption to be slightly violated. Our SAS approach differs from the measurement error approach in two fundamental aspects: 1) typical measurement error approaches require S to be the single proxy outcome of the same type as Y while our SAS approach allow a vector S of arbitrary types as long as some of them are predictive for Y; 2) measurement error approaches impose stringent independence and model assumptions on the triplet (S,X,Y) while our SAS approach has neither. Violation of the two requirements may obstruct the deployment of measurement error methods or compromise its performance.

1.2. Organization of the Paper

The remainder of the paper is organized as follows. We introduce our population parameters and model assumptions in Section 2. In Section 3, we propose the SAS estimation method along with its associated inference procedures. In Section 4, we state the theoretical guarantees of the SAS procedures, whose proofs are provided in the Supplementary Materials. We also remark on the sparsity relaxation and the efficiency gain of the SSL. In Section 5, we present simulation results highlighting finite sample performance of the SAS estimators and comparisons to existing methods. In Section 6, we apply the proposed method to derive individual risk prediction for T2DM using EHR data from Mass General Brigham.

2. Settings and Notations

For the i-th observation, YiR denotes the outcome variable, SiRq denotes the surrogates for Yi and XiRp+1 denotes the high-dimensional covariates with the first element being the intercept. Under the SSL setting, we observe n independent and identically distributed (i.i.d.) labeled observations, =Yi,Xi,Si,i=1,,n and (N-n) i.i.d unlabeled observations, 𝒰=Wi=Xi,Si,i=n+1,,N. We assume that the labeled subjects are randomly sampled by design and the proportion of labelled sample is n/N=ρ(0, 1) with ρ0 as n. We focus on the high-dimensional setting where dimensions p and q grow with n and allow p+q to be larger than n. Motivated by our application, our main focus is on the setting N much larger than p, but our approach can be extended to Np under specific conditions.

To predict Yi with Xi, we consider a possibly mis-specified working regression model with a known monotone and smooth link function g,

Yi~gβXi. (1)

We identify the target parameter as the most predictive working model measured by the pseudo log-likelihood (y, x)

β0=argminβ-EYi,βXi,(y,x)=yx-G(x),G(x)=g(x). (2)

Here we do not assume any model for the true conditional expectation EYiXi. Our goal is to accurately estimate the high-dimensional parameter β0, alternatively characterized by the first order condition for (2),

EXiYi-gβ0Xi=0. (3)

Our procedure generally allows for a wide range of link functions and detailed requirements on g() and its anti-derivative G are given in Section 4. In our motivating example, Y is a binary indicator of T2DM status and g(x)=1/1+e-x with G(x)=log1+ex. We shall further construct confidence intervals for gβ0xnew with any xnewRp+1. The predicted outcome gβ0xnew can be interpreted as the maximum pseudo log-likelihood prediction under working model gβxnew. We make no assumption on the sparsity of β0 relative to number of labels n, and hence it is not feasible to perform valid supervised learning for β0 when sβ=β00>n.

We shall derive an efficient SSL estimate for β0 by leveraging 𝒰. To this end, we fit a working imputation model

Yi~gγWi, (4)

whose limiting parameter is likewise defined as the most predictive working model

γ0=argminγ-EYi,γWiEWiYi-gγ0Wi=0. (5)

The definition of γ guarantees

EXiYi-gγ0Wi=0. (6)

and hence if we impute Yi as Yi=gγ0Wi, we have EXiYi-gβ0Xi=0 regardless the adequacy of the imputation model (4) for the conditional mean EYiWi. It is thus feasible to carry out an SSL procedure by first deriving an estimate for Yi using the labelled data and then regressing the estimated Yi against Xi using the whole data 𝒰. Although we do not require β0 to be sparse or any of the fitted models to hold, we do assume that γ0 defined in (5) to be sparse. When the surrogates S are strongly predictive for the outcome, the sparsity assumption on γ0 is reasonable since the majority of the information in Y can be captured in S.

Notations. We focus on the setting where min{n,p+q,N}. For convenience, we shall use n in the asymptotic analysis. For two sequences of random variables An and Bn, we use An=OpBn and An=opBn to denote limclimnP(|A|c|B|)=0 and limc0limnP(|A|c|B|)=0, respectively. For two positive sequences an and bn, an=Obn or bnan means that C>0 such that anCbn for all n; anbn if an=Obn and bn=Oan, and anbn or an=obn if limsupnan/bn=0. We use ZnN(0, 1) to denote the sequence of random variables Zn converges in distribution to a standard normal random variable.

3. Methodology

3.1. SAS Estimation of β0

The SAS estimation procedure for β0 consists of two key steps: (i) fitting the imputation model to to obtain estimate γ^ for γ0 defined in (5); and (ii) estimating β0 in (3) by fitting imputed outcome Y^i=gγ^Wi against Xi to 𝒰. In both steps, we devise the Lasso type estimator to deal with the high-dimensionality of X. In principle, other types of variable selection methods, e.g. SCAD (Fan and Li, 2001) or square-root Lasso (Belloni et al., 2011), may also be used. We use the Lasso as the example for its simplicity. A further discussion on the choice of regularized estimators is given in Remark 6.

In Step (i), we estimate γ0 by the L1 regularized pseudo log-likelihood estimator γ^, defined as

γ^=argminγRp+q+1impγ+λγγ-11  with  λγlogp+q/n, (7)

where a−1 denotes the sub-vector of all the coefficients except for the intercept and

impγ=1ni=1nYi,γWi  with  y,x defined in 2. (8)

The imputation loss (8) corresponds to the negative log-likelihood when Y is binary and the imputation model holds with g being anti-logit. With γ^, we impute the unobserved outcomes for subjects in 𝒰 as Y^i=gγ^Wi, for n+1iN.

In Step (ii), we estimate β0 by β^=β^(γ^), defined as,

β^γ^=argminβRp+1β;γ^+λββ-11  with  λβlogp/N, (9)

where (β;γ^) is the imputed pseudo log-likelihood:

β;γ^=1Ni>nY^i,βXi+1Ni=1nYi,βXi  with  y,x defined in (2). (10)

We denote the complete data pseudo log-likelihood of the full data as

PLβ=1Ni=1NYi,βXi. (11)

and define the gradients of the various losses (8)–(11) as

˙impγ=impγ,˙PLβ=PLβ,˙β;γ=ββ;γ. (12)

3.2. SAS Inference for Individual Prediction

Since g() is specified, the inference on gxnewβ immediately follows from the inference on xnewβ. We shall consider the inference on standardized linear prediction xstdβ with the standardized covariates

xstd =xnew /xnew 2

and then scale the confidence interval back. This way, the scaling with xnew2 is made explicit in the expression of the confidence interval.

The estimation error of β^ can be decomposed into two components corresponding to the respective errors associated with (7) and (9). Specifically, we write

β^-β0=β-γ^-β0+β^-β-γ^, (13)

where β-(γ^) is defined as the minimizer of the expected imputed loss conditionally on the labeled data, that is,

β-(γ^)=argminβRp+1E(β;γ^). (14)

The term β-(γ^)-β0 denotes the error from the imputation model in (7) while the term β^-β-(γ^) denotes the error from the prediction model in (9) given the imputation model parameter γ^. As 1 penalization is involved in both steps, we shall correct the regularization bias from the two sources. Following from the typical one-step debiasing LASSO (Zhang and Zhang, 2014), the bias β^-β-(γ^) is estimated by Θ^˙(β^;γ^) where Θ^ is an estimator of Egβ0XiXiXi-1, the inverse Hessian of (;γ^) at β=β0.

The bias correction for β-(γ^)-β0 requires some innovation since we need to conduct the bias correction for a nonlinear functional β-() of LASSO estimator γ^, which has not been studied in the literature. We identify β-(γ^) and β0 by the first order moment conditions,

β¯γ^:Ei>nXigβ¯(γ^)Xigγ^Wi0,β0:EXigβ0XiYi=EXigβ0Xigγ0Wi=0. (15)

Here Ei>n[] denotes the conditional expectation of a single copy of the unlabeled data given the labelled data. By equating the two estimating equations in (15), we apply the first order approximation and approximate the difference β-(γ^)-β0 by

β-(γ^)-β0-Egβ0XiXiXi-1Ei>nXigγ0Wi-gγ^Wi (16)

Together with the bias correction for β-(γ^)-β0, this motivates the debiasing procedure

β^-1-ρni=1nΘ^Xigγ^Wi-Yi-Θ^˙β^;γ^.

The 1-ρ factor, which tends to one when n much smaller than N, comes from the proportion of unlabeled data whose missing outcome are imputed.

For theoretical considerations, we devise a cross-fitting scheme in our debiasing process. We split the labelled and unlabeled data into K folds of approximately equal size, respectively. The number of folds does not grow with dimension (e.g. K=10). We denote the indices sets for each fold of the labelled data as 1,,K, and those of the unlabeled data 𝒰 as 𝒥1,,𝒥K. We denote the respective sizes of each fold in the labelled data and full data as nk=k and Nk=nk+𝒥k, where || denotes the carnality of . Define kc={1,,n}k and 𝒥kc={n+1,,N}𝒥k. For each labelled fold k, we fit the imputation model with out-of-fold labelled samples:

γ^k=argminγp+q+11nnkikcYi,γWi+λγγ11. (17)

Using γ^(k), we fit the prediction model with the out-of-fold data kc𝒥kc:

β^k=argminβp+11NNki𝒥kcgγ^kWi,βXi+ikcYi,βXi+λββ11. (18)

To estimate the projection

u0=Egβ0XiXiXi-1Xstd, (19)

we propose an L1-penalized estimator

u^k=argminup1NNkkkik𝒥k12gβ^k,kTXiXiu2uxstd+λuu1, (20)

where β^k,k is trained with samples out of folds k and k,

β^k,k=argminβp+1i𝒥k𝒥kcgγ^k,kWi,βXi+ikkcYi,βXiNNkNk+λββ11, (21)

with

γ^k,k=argminγp+q+1ikckcYi,γWinnknk+λγγ11.

The estimators in (21) take similar forms as those in (17) and (18) except that their training samples exclude two folds of data k𝒥k and k𝒥k. In the summand of (20), the data Yi,Xi,Si in fold kk𝒥k is independent of β^k,k trained without folds k and k. The estimation of u requires an estimator of β and both estimators are subsequently used for the debiasing step. Using the same set of data multiple times for β^, u^, debiasing and variance estimation may induce over-fitting bias, so we implemented the cross-fitting scheme to reduce the over-fitting bias. As a remark, cross-fitting might not be necessary for theory with additional assumptions and/or empirical process techniques.

We obtain the cross-fitted debiased estimator for xstdβ as xstdβ^, defined as

1Kk=1Kxstdβ^k1Nk=1Ki𝒥ku^kXigβ^kXigγ^kTWi 1nk=1Kiku^kXi1ρgγ^kTWi+ρgβ^kXiYi. (22)

The second term is used to correct the bias β-(γ^)-β0 and the third term is used to correct the bias β^-β-(γ^). The corresponding variance estimator is

V^SAS=1nk=1Kiku^kTXi21ρgγ^kTWi+ρgβ^kTXiYi2 +ρ2nk=1Ki𝒥ku^kTXi2gβ^kTXigγ^Wi2 (23)

Through the link g and the scaling factor xnew2, we estimate gxnewβ0 by gxnew2xstdβ^ and construct the (1-α)×100% confidence interval for gxnewβ0 as

gxnew2xstdβ^-Ƶα/2V^SAS/n,gxnew2xstdβ^+Ƶα/2V^SAS/n, (24)

where Zα/2 is the 1-α/2 quantile of the standard normal distribution.

4. Theory

We introduce assumptions required for both estimation and inference in Section 4.1. We state our theories for estimation and inference, respectively in Sections 4.2 and 4.3.

4.1. Assumptions

We assume the complete data consist of i.i.d. copies of Yi,Xi,Si, for i=1,,N. For our focused SSL settings, only the first n outcome labels Yi,,Yn are observed. Under the i.i.d assumption, our SSL setting is equivalent to the missing completely at random (MCAR) assumption. The sparsities of γ0, β0 and u0 are denoted as

sγ=γ00,sβ=β00,su=u00.

We focus on the setting with n, p+q, N with n being allowed to be smaller than p+q. We allow that sγ, sβ and su grow with n, p+q, N and satisfy sγn and sβ+suN. While our method and theory adaptively applies to both SSL (Nn) and missing data (Nn) settings without prior knowledge on the limit of n/N, we emphasize on the SSL (Nn) setting that matches our motivating EHR studies and is also less studied in the literature. To achieve the sharper dimension conditions, we consider the sub-Gaussian design as in Portnoy (1984, 1985); Negahban et al. (2010). We denote the sub-Gaussian norm for random variables and random vectors both as ψ2. The detailed definition is given in Appendix D.

Assumption 1 For constants ν1, ν2 and M independent of n, p and N,

  1. the residuals Yi-gγ0Wi and Yi-gβ0Xi are sub-Gaussian random variables with sub-Gaussian norm bounded by Yi-gγ0Wiψ2ν1 and Yi-gβ0Xiψ2ν2;

  2. The link function g satisfies the monotonicity and smoothness conditions: infxRg(x)0, supxRg(x)<M and supxRg(x)<M.

Under our motivating example with a binary Yi and g(x)=ex/1+ex, 1a and 1b are satisfied. The condition is also satisfied for the probit link function and the identity link function. Condition 1a is universal for high-dimensional regression. Admittedly, Lipschitz requirement in 1b rules out some generalized linear model links with unbounded derivatives like the exponential link, but we may substitute the condition by assuming a bounded Xi.

Assumption 2 For constants σmax2 and σmin2 independent of n, p, N,

  1. Wi is a sub-Gaussian vector with sub-Gaussian norm Wiψ2σmax/2;

  2. The weak overlapping condition at the population parameter β0 and γ0,
    1. infv2=1vEgβ0Xi1XiXivσmin2,
    2. infv2=1vEgγ0Wi1WiWivσmin2;
  3. The non-degeneracy of average residual variance:

infv2=1EYi-(1-ρ)gγ0Wi-ρgβ0Xi2Xiv2σmin2.

Assumption 2a is typical for high-dimensional regression (Negahban et al., 2010), which also implies the bounded maximal eigenvalue of the second moment

supv2=1vEWiWivσmax2.

Notably, we do not require two common conditions under high-dimensional generalized linear models (Huang and Zhang, 2012; van de Geer et al., 2014): 1) the upper bound on supi=1,,NXi; 2) the lower bound on infi=1,,Ngβ0Xi, often known as the overlapping condition for logistic regression model. Compared to the overlapping condition under logistic regression that gβ0Xi and gγ0Wi are bounded away from zero, our Assumptions 2b and 2c are weaker because they are implied by the typical minimal eigenvalue condition

infv2=1vEWiWivσmin2

plus the overlapping condition.

4.2. Consistency of the SAS Estimation

We now state the L2 and L1 convergence rates of our proposed SAS estimator.

Theorem 1 (Consistency of SAS estimation) Under Assumptions 1, 2 and with

sγ=on/logp+q,sβ=oN/logp,λβlogp/N, (25)

we have

β^-β02 =Opsβλβ+(1-ρ)sγlog(p+q)/n,
β^-β01 =Opsβλβ+(1-ρ)2sγlog(p+q)/nλβ.

Remark 2 The dimension requirement for our SAS estimator achieving L2 consistency significantly weakens the existing dimension requirement in the supervised setting (Negahban et al., 2010; Huang and Zhang, 2012; Bühlmann and Van De Geer, 2011; Bickel et al., 2009) With λβlog(p)/N, Theorem 1 implies the L2 consistency of βˆ under the dimension condition,

(1-ρ)2sγlog(p+q)/n+sβlog(p)/N=o(1). (26)

When Nn, our requirement on the sparsity of β, sβ=o(N/log(p)) is significantly weaker than sβ=o(n/log(p)), which is known as the fundamental sparsity limit to identify the high-dimensional regression vector in the supervised setting. Theorem 1 indicates that with assistance from observed S𝒰, the SAS procedure allows sβ>n provided that N is sufficiently large and the imputation model is sparse. This distinguishes our result from most estimation results in high-dimensional supervised settings. Among SSL literatures, the utility of unlabeled data in relaxation of sparsity condition has never been conceived before.

Remark 3 In the context of Theorem 1, a sparse imputation, often induced by a small number of highly predictive surrogates, is essential for an optimal estimation rate. When sβ>sγ, the L2 rate in Theorem 1 has two components, sβlog(p)/N regarding the minimax rate to learn β from all data and sγlog(p+q)/n regarding the minimax rate to learn γ in the labeled data (Raskutti et al., 2011). Thus, the rate cannot be further improved if the sparser imputation model is used to identify the denser β without additional conditions.

Remark 4 If the L1 consistency is of interest, the penalty levels are chosen as

λβmaxlogp/N,sγ/sβλγ, (27)

which produces the L1 estimation rate from Theorem 1

β^-β01=Opsβlog(p)/N+sγsβlog(p)/n.

Compared to the condition for L1 consistency under supervised learning, sβ=o(n/log(p)), the condition from SAS estimation sβ=on/sγ+N/log(p) allows a denser β0 in the setting with a very sparse γ0 and a large unlabeled data. On the other hand, the L2 estimation rate in Theorem 1 remains the same if

log(p)/Nλβmaxlogp/N,sγ/sβλγ.

Our theory on the SAS inference procedure uses the L2 instead of the L1 consistency.

Theorem 1 implies the following prediction consistency result.

Corollary 5 (Consistency of individual prediction) Suppose xnew is sub-Gaussian random vector satisfying supv2=1vExnewxnewvσmax2. Under the conditions of Theorem 1, we have

gβ^xnew-gβ0xnew=Opβ^-β02=op(1).

The concentration result of Corollary 5 is established with respect to the joint distribution of the data and the new observation xnew. This is in a sharp contrast to the individual prediction conditioning on any new observation xnew. If the goal is to conduct inference for any given xnew, the theoretical justification is provided in the following Theorem 7 and Corollary 8.

Remark 6 Other types of penalties shown to provide consistent estimation in L2 for the working imputation model can substitute the Lasso penalty in (7), since the L2 rate γ^-γ02 is the only property invoked for γ^ in the proof of Theorem 1. For example, we may choose the square-root Lasso (Belloni et al., 2011) with pivotal recovery under linear models with identity link g(x)=x. Changing the Lasso penalty in (9), however, might require a different proof to produce the stated estimation rate adaptive to arbitrary sβ/N and sγ/n, covering both sβ/Nsγ/n and sβ/Nsγ/n settings (Case 1 and 2, respectively, in the Proof of Theorem 1). If the sβ/Nsγ/n setting guaranteed by a very large N alone is of interest, other penalties for β^ can work equally well (by adapting Case 1 in the Proof of Theorem 1).

4.3. n-inference with Debiased SAS Estimator

We state the validity of our SSL inference in Theorem 7. We use to AB to denote that random variable A converges in distribution to a distribution B.

Theorem 7 (SAS Inference) Let xnew be the random vector representing the covariate of a new individual. Under Assumptions 1, 2 and the dimension condition

(1-ρ)4sγ2log(p+q)2n+ρsβ2+sβsulog(p)2N+(1-ρ)2sγsulogp+qlogpN=o1, (28)

we draw inference on xnewβ0 conditionally on xnew according to

nV^SAS-1/2xstdβ^-xnewβ0xnew2 xnewN(0, 1),

where V^SAS2 defined in (23) is the estimator of the asymptotic variance

VSAS=Eu0Xi2Y1ρgγ0Wiρgβ0Xi2+ρ1ρEu0Xi2gγ0Wigβ0Xi2,

with

u0=Θ0Xnewxnew2=Egβ0XiXiXi-1Xnewxnew2. (29)

By the Young’s inequality, the condition (28) is implied by

(1-ρ)4sγ2log(p+q)2n+ρsβ+sulog(p)N=o(1), (30)

When p is much smaller than the full sample size N, our condition (30) allows the sparsity levels of β0 and u0 to be as large as p. Even if p is larger than N, our SAS inference procedure is valid if sβ+suN/log(p). In the literature on confidence interval construction in high-dimensional supervised setting, the valid inference procedure for a single regression coefficient in the linear regression requires sβn/log(p) (Zhang and Zhang, 2014; Javanmard and Montanari, 2014; van de Geer et al., 2014). Such a sparsity condition has been shown to be necessary to construct a confidence interval of a parametric rate (Cai and Guo, 2017). We have leveraged the unlabeled data to significantly relax the fundamental limit of statistical inference from sβn/log(p) to sβN/{log(p)n}. The amount of labelled data validates the statistical inference for a dense model in high dimensions.

The sparsity of u0 is determined by xnew and the precision matrix Θ0. In the supervised learning setting, for confidence interval construction for a single regression coefficient, van de Geer et al. (2014) requires sun/log(p) is required. According to (30), our SAS inference requires suN/{log(p)n}, which can be weaker than sun/log(p) if the amount of unlabeled data is larger than n2. Theorem 7 implies that our proposed CI in (24) is valid in terms of coverage, which is summarized in the following corollary.

Corollary 8 Under Assumptions 1 and 2, as well as (28), the CI defined in (24) satisfies,

P{g(xnew 2xstd β^-Ƶα/2V^SAS /ngxnew β0gxnew 2xstd β^+Ƶα/2V^SAS /n=1-α+o(1).2gβ0xnew xnew 2Ƶα/2VSAS /nxnew 2/n,

where VSAS is the the asymptotic variance defined in (29).

Confidence interval construction for gxnewβ0 in high-dimensional supervised setting has been recently studied in Guo et al. (2020). Guo et al. (2020) assumes the prediction model to be correctly specified as a high-dimensional sparse logistic regression and the inference procedure is valid if sβn/logp. In contrast, we leverage the unlabeled data to allow for mis-specified prediction model and a dense regression vector, as long as the dimension requirement in (28) is satisfied.

4.4. Efficiency comparison of SAS Inference

Efficiency in high-dimensional setting or SSL setting in which the proportion of labelled data decays to zero is yet to be formalized. Here we use the efficiency bound in the classical low-dimensional with a fixed ρ as the benchmark. Apart from the relaxation of various sparsity conditions, we illustrate next that our SAS inference achieves a decent efficiency with properly specified imputation model compared to the supervised learning and the benchmark.

Similar to the phenomenon discovered by Chakrabortty and Cai (2018), if the imputation model is correct, we can guarantee the efficiency gain by SAS inference in comparison to the asymptotic variance of the supervised learning,

VSL=Eu0Xi2Yi-gβ0Xi2. (31)

Proposition 9 If EYiSi,Xi=gγ0Wi, we have VSLVSAS.

Moreover, we can show that our SAS inference attains the benchmark efficiency derived from classical fixed ρ setting (Tsiatis, 2007). To simplify the derivation, we describe the missing-completely-at-random mechanism through the binary observation indicator Ri,i=1,,N, independent of Yi, Xi and Si. We still denote the proportion of labelled data as ρ=ERi. The unsorted data take the form

𝒟=Di=Xi,Si,Ri,RiYi,i=1,,N.

We consider the following class of complete data semi-parametric models

comp=fX,Y,S,Rx,y,s,r=fXxρr(1ρ)1rfYS,Xys,xfSXsx:fYS,X,fX,fSX are arbitrary density, (32)

and establish the efficiency bounds for RAL estimators under comp by deriving the associated efficient influence function in the following proposition. We denote the nuisance parameters for fYS,X, fX and fSX as η. We use η0 to denote the true underlying nuisance parameter that generates the data. The parameter of interest β0 is not part of the model comp but defined by the implicit function through the moment condition (3).

Proposition 10 The efficient influence function for θ=xstdβ under comp is

ϕeff Di;θ0,η0=Riρu0XiYi-EYiSi,Xi-u0XiEYiSi,Xi-gβ0Xi.

Under the Assumptions of Theorem 7 and additionally EYiSi,Xi=gγ0Wi, our SAS debiased estimator admits the same influence function

xstdβ^-xnewβ0xnew2=1Ni=1Nϕeff Di;θ0,η0+op(ρN)-1/2

according to Appendix B3 Step 2 (A.31).

5. Simulation

We have conducted extensive simulation studies to evaluate the finite sample performance of the SAS estimation and inference procedures under various scenarios. Throughout, we let p=500, q=100, N=20000 and consider n=500. The signals in β are varied to be approximately sparse or fully dense with a mixture of strong and weak signals. The surrogates S are either moderately and strongly predictive of Y as specified below. For each configuration, we summarize the results based on 500 simulated datasets. We compare our SAS procedure with the supervised LASSO (SLASSO) that (1) estimates the β0 by regressing Y to X over the labeled data with Lasso; (2) draw inference on xnewβ0 with the one-step debiased Lasso van de Geer et al. (2014).

To mimic the zero-inflated discrete distribution of EHR features, we first generate Zi,1x,,Zi,px,Ziu,Zi,1s,,Zi,qs independently from N0, 25. Then we construct Xi from Ziu,Zix=Zi,1x,,Zi,1x via the transformation ςz=log1+expz:

Xi,1=ςj=2p2Xi,j/p-1+Zi,1x/2-μx/σx,
Xi,j=ςZi,jx1-p-1+Ziu/p-μx/σX,j=2,,p.

We standardize Xi,j to roughly mean zero and unit variance with μX=1.80 and σX=2.74. The shared term Ziu induces correlation among the covariates.

For S and Y, we consider two scenarios under which the imputation model is either correctly or incorrectly specified. We present the “Scenario I: neither the risk prediction model nor the imputation model is correctly specified” in the main text and the “Scenario II: The imputation model is correctly specified and exactly sparse” in Section A of the Supplementary materials.

Scenario I: neither the risk prediction model nor the imputation model is correctly specified. In this scenario, we first generate Yi from the probit model

PYi=1Zix=ΦαZix  with  Φ(x)=-x(2π)-1/2e-x2/2dx,

and then generate S from

Si,1=ςZi,1s/2+θYi-μsσs-1+ξXi,  and  Si,j=ςZi,js-μxσx-1,j=2,,p.

We chose μs and σs depending on α such that Si,1 is roughly mean 0 and variance 1. Under this setting, a logistic imputation model would be misspecified but nevertheless approximately sparse with appropriately chosen ξ. The coefficients α control the optimal prediction accuracy of X for Y while θ controls the optimal prediction accuracy of S for Y. We consider two α of different sparsity patterns, which also determine the rest of parameters

 Sparse sα=3:α=0.45, 0.318, 0.318,0497×1,μS=1.82,σS=2.01, Dense sα=500:α=0.316, 0.05929×1,0.007470×1,μS=2.71,σS=2.68,

where ak×1=(a,,a)k×1 for any a. The sparsity of α affects the approximate sparsity of β subsequently (Table 1), which we measured by the squared ratio between 1 norm and 2 norm

𝒮(β)=β12/β22,minj:βj0βj𝒮(β)/β01. (33)

We consider two θ: (a) θ=0.6 for S to be moderately predictive of Y; and (b) θ=1 for strong surrogates. The parameter ξ depends on both the choices of α and θ:

sα=3,θ=0.6: ξ=0:407, 0.330, 0.330, 0.005497×1,
sα=3,θ=1: ξ=0.199, 0.163, 0.163, 0.002497×1,
sα=500,θ=0.6: ξ=0.350,0.064429×1,0.011470×1,
sα=500,θ=1: ξ=0.169,0.03229×1,0.005470×1.

Table 1:

AUC Table for simulations with 500 labels under Scenario I. The AUCs are evaluated on an independent testing set of size 100. We approximately measure the sparsity by 𝒮v=v12/v22.

Scenario Prediction Accuracy (AUC)
Surrogate 𝒮β0 𝒮γ0 Oracle SLASSO SAS
Strong 174 1.32 0.724 0.660 0.711
Moderate 174 1.26 0.724 0.660 0.713
Strong 28.3 1.33 0.719 0.694 0.713
Moderate 28.3 1.24 0.719 0.694 0.711

Due to the complexity of the data generating process and the noncollapsibility of the logistic regression models, we cannot analytically express the true β0 in both scenarios. Instead, we numerically evaluate β0 with a large simulated data using the oracle knowledge of the ex-changeability among covariates according to the model

logitPYi=1Si,1~η0+η1Xi,1+η2j=2sαXi,j+η3j=sα+1pXi,j.

We derive the true β0 as

β0=η0,η1,η2sα×1,η3p-sα×1.

We report the simulation settings under Scenario I in Table 1, where we present the predictive power of the oracle estimation and the lasso estimation. We also report the average area-under-curve (AUC) of the receiver operating characteristic (ROC) curve for oracle β0, SLASSO and the proposed SAS estimation. Our SAS estimation achieves a better AUC compared to supervised LASSO across all scenarios, and is comparable to the AUC with the true coefficient β0. Besides, we observe that the AUC of supervised LASSO is sensitive to the approximate sparsity 𝒮β0, while the AUC of SAS estimation does not seem to be affected by 𝒮β0.

To evaluate the SAS inference for the individualized prediction, we consider six different choices of xnew. We first select xnewL,xnewM,xnewH from a random sample of xnew generated from the distribution of Xi such that their predicted risks are around 0.2, 0.5, and 0.7, corresponding to low, moderate and high risk. We additionally consider three sets of xnew with different levels of sparsity:

 Sparse: xnew S=1, 1,0499×1; Intermediate: xnew I=1,0.18330×1,0470×1; Dense: xnew D=1,0.045500×1.

In Table 2, we compare our SAS estimator of xnewβ0 with the corresponding SLASSO across all settings under Scenario I. The root mean-squared-error (rMSE) of the SAS estimation decays proportionally with the sample size, while the rMSE of the supervised LASSO provides evidence of inconsistency for moderate and dense deterministic xnew. The bias of the supervised LASSO is also significantly larger than that of the SAS estimation. The performance of the SAS estimation is insensitive to sparsity of β0, while that of supervised LASSO severely deteriorate with dense β0. The improvement from the supervised LASSO to the SAS estimation is regulated by the surrogate strength.

Table 2:

Comparison of SAS Estimation to the supervised LASSO (SLASSO) with Bias, Empirical standard error (ESE) and root mean-squared error (rMSE) of the linear predictions xnewβ0 under Scenario I 500 labels, moderate or large 𝒮β0 and strong or moderate surrogates.

SLASSO SAS: Moderate SAS: Strong
Type Bias ESE rMSE Bias ESE rMSE Bias ESE rMSE
Moderate 𝒮β0
xnew  0.605 0.387 0.719 0.165 0.249 0.298 0.118 0.196 0.229
xnew M −0.083 0.337 0.347 −0.008 0.246 0.246 −0.016 0.195 0.196
xnew H −0.718 0.521 0.887 −0.234 0.294 0.376 −0.176 0.225 0.286
xnew S −0.072 0.144 0.161 −0.080 0.094 0.123 −0.018 0.078 0.080
xnew I −0.460 0.096 0.470 −0.110 0.093 0.143 −0.055 0.071 0.090
xnew D −0.413 0.091 0.423 −0.110 0.089 0.141 −0.114 0.069 0.133
Large 𝒮β0
xnew L 0.389 0.275 0.477 0.161 0.215 0.269 0.133 0.264 0.296
xnew M −0.017 0.280 0.280 −0.014 0.213 0.213 −0.017 0.268 0.268
xnew H −0.600 0.481 0.769 −0.251 0.271 0.370 −0.164 0.296 0.339
xnew S −0.202 0.140 0.246 −0.074 0.097 0.122 −0.009 0.078 0.079
xnewI −0.178 0.098 0.203 −0.075 0.086 0.115 −0.071 0.075 0.103
xnew D −0.185 0.090 0.206 −0.109 0.084 0.138 −0.113 0.073 0.135

In Table 3, we compare our SAS inference with supervised debiased LASSO across the settings under Scenario I. Our SAS inference procedure attains approximately honest coverage of 95% confidence intervals for all types of xnew under all scenarios. Unsurprisingly, the debiased SLASSO has under coverage for the deterministic xnew as the consequence of violation to the sparsity assumption for β0 and precision matrix. Under our design, the first covariate X1 has the strongest dependence upon the other covariates, whose associated row in the precision matrix is thus densest. Consequently, the inference for βxnewS=β0+β1 The debiased SLASSO also has an acceptable coverage for random xnewL,xnewM,xnewH sampled from the covariate distribution despite the presence of substantial bias, which we attribute to the even larger variance that dominates the bias. In contrast, our SAS inference has small bias across all scenarios and improved variance from the strong surrogate.

Table 3:

Bias, Empirical standard error (ESE), average of the estimated standard error (ASE) along with empirical coverage of the 95% confidence intervals (CP) for the debiased supervised LASSO (SLASSO) and debiased SAS estimator of linear predictions xnewβ0 under Scenario I with 500 labels, moderate or large 𝒮β0 and strong or moderate surrogates.

Debiased SAS
Debiased SLASSO Moderate Surrogates Strong Surrogates
Type Bias ESE ASE CP Bias ESE ASE CP Bias ESE ASE CP
Risk prediction model approximatedly sparse
xnew −0.290 1.901 1.896 0.948 0.021 1.873 1.864 0.949 0.018 1.531 1.531 0.950
xnewM −0.091 1.994 1.981 0.947 −0.007 1.961 1.954 0.950 −0.015 1.560 1.570 0.953
xnewH 0.348 2.106 2.074 0.942 −0.050 2.036 2.039 0.950 −0.011 1.632 1.623 0.950
xnewS 0.171 0.157 0.128 0.694 −0.019 0.149 0.150 0.950 −0.001 0.132 0.125 0.924
xnewI −0.001 0.129 0.125 0.938 −0.013 0.123 0.116 0.932 0.010 0.101 0.094 0.920
xnewD 0.141 0.137 0.138 0.812 −0.011 0.123 0.118 0.944 −0.001 0.096 0.095 0.940
Large 𝒮β0
xnewL −0.134 1.918 1.914 0.951 0.018 1.875 1.878 0.951 0.018 1.529 1.524 0.948
xnewM −0.056 1.970 1.962 0.948 −0.020 1.911 1.927 0.952 0.005 1.603 1.597 0.950
xnewH 0.109 2.051 2.029 0.945 −0.022 1.997 1.991 0.950 −0.040 1.671 1.668 0.951
xnewS 0.029 0.155 0.127 0.892 −0.008 0.153 0.147 0.946 −0.013 0.133 0.131 0.938
xnewI 0.002 0.131 0.125 0.930 0.001 0.122 0.114 0.936 0.002 0.101 0.098 0.936
xnewD 0.113 0.135 0.139 0.874 −0.007 0.119 0.116 0.938 −0.003 0.099 0.097 0.960

According to Tables A1, A2 and A3 in the Appendix A, the results under Scenario II are consistent with our findings under Scenario I. We also compares SAS to an unsupervised learning approach using proxy outcome derived from surrogates in the Appendix A. Under the Scenario III very similar to Scenario I, SAS performs well as in Scenario I while the unsupervised learning approach fails completely. This is expected since the unsupervised approach requires that the deviation of the surrogates from the true outcome S-Y is uncorrelated with the risk factors X. Otherwise, spurious association between outcome Y and risk factors X can be induced, creating bias in estimation of risk prediction model.

6. Application of SAS to EHR Study

We applied the proposed SAS method to the risk prediction of Type II Diabetes Mellitus (T2DM) using EHR and genomic data of participants of the Mass General Brigham Biobank study. Number of genetic risk factors among single nucleotide polymorphism for T2DM has grown exponentially following the expansion of genome-wide association studies. As an incomplete summary, Voight et al. (2010),Morris et al. (2012) and Scott et al. (2017) each discovered around a dozen new risk SNPs for T2DM, and the recent studies by Mahajan et al. (2018) and Vujkovic et al. (2020) discovered 135 and 558 new risk SNPs, respectively. Some new risk SNPs in Mahajan et al. (2018) even had large coefficients in the poly genetic risk score. The ever growing number of risk SNPs suggest that the genetic risk prediction model for T2DM may be dense. Compared to the large biobank data that generated the genome-wide association studies, EHR captures the temporal information of T2DM onset and other phenotypes predictive for T2DM and thus may provide a more accurate forecasting for T2DM. As we mentioned in the introduction, direct extraction of disease onset from EHR by diagnosis code or mention in medical notes may contain substantial false positives. From an expert annotation of the medical histories for 271 patients, we found 38 patients with T2DM diagnosis code and 161 patients with mention of T2DM in medical notes who actually had never developed T2DM. The annotation process requires intensive labor of highly skilled medical experts, leading to the limited number of labels.

To define the study cohort, we extracted from the EHR of each patient their date of first EHR encounter tini, follow up period (C), the counts and dates for the diagnosis codes and note mentions of clinical concepts related to T2DM as well as its risk factors. We only included patients who do not have any diagnosis code or note mention of T2DM up to baseline, where the baseline time is defined as 1990 if tini is prior to 1990 and as their first year if tini1990. Although neither the diagnosis code nor note mention of T2DM is sufficiently specific, they are highly sensitive and can be used to accurately remove patients who have already developed T2DM at baseline. This exclusion criterion resulted in N=20216 patients who are free of T2DM at baseline and have both EHR and genomics features for risk modeling. Among those, we have a total of n=271 patients whose T2DM status during follow up, Y, has been obtained via manual chart review. The prevalence of T2DM was about 14% based on labeled data.

We aim to develop a risk prediction model for Y by fitting a working model P(Y=1X)=gβ0X, where the baseline covariate vector X includes age, gender, indicator for occurrence of diagnosis code and note counts for obesity, hypertension, coronary artery disease (CAD), hyperlipidemia during the first year window, as well as a total of 49 single nucleotide polymorphism previously reported as associated with T2DM in Mahajan et al. (2018) with odds ratio greater than 1.1. We additionally adjust for follow up by including log(C) and allow for non-linear effects by including two-way interactions between the SNPs and other baseline covariates. All variables with less than 10 nonzero values within the labelled set are removed, resulting the final covariates to be of dimension p=260. We standardize the covariates to have mean 0 and variance 1. To impute the outcome, we used the predicted probability of T2DM derived from the unsupervised phenotyping method MAP (Liao et al., 2019), which achieves an AUC of 0.98, indicating a strong surrogate. In addition to the proposed SAS procedure, we derive risk prediction models based on the supervised LASSO with both the same set of covariates. We let K=5 in cross-fitting and use 5-fold cross-validation for tuning parameter selection. To compare the performance of different risk prediction models, we use 10-fold cross-validation to estimate the out-of-sample AUC. We repeated the process 10 times and took average of predicted probabilities across the repeats for each labelled sample and method in comparison.

In Figure 2, we present the estimated β coefficients for the covariates that received p-value less than 0.05 from the SAS inference. The confidence intervals are generally narrower from the SAS inference. For the coefficients of baseline age and follow-up time, the SAS inference produced much narrower confidence interval than debiased SLASSO, which are expected to have a positive effect on the T2DM onset status during the observation. In addition, the SAS inference identified one global genetic risk factor and 6 other subgroup genetic risk factors while SLASSO identified none of these.

Figure 2:

Figure 2:

Point and 95% confidence interval estimates for the coefficients with nominal p-value < 0.05 from SAS inference. The horizontal bars indicate the estimated 95% confidence intervals. The solid points indicate the (initial) estimates, and the triangles indicate debiased estimates. Colors red and green indicate different methods, SAS and SLASSO, respectively.

In Table 4, we present the AUCs of the estimated risk prediction models using the high dimensional X. It is important to note that AUC is a measurement of prediction accuracy, so debiasing might lead to worse AUC by accepting larger variability for reduced bias. The AUC from SLASSO is very poor, probably due to the over-fitting bias with the small sample sizes of the labeled set. With the information from a large unlabeled data, SAS produced the significantly higher AUC than the SLASSO.

Table 4:

The cross-validated (CV) AUC the estimated risk prediction models with high dimensional EHR and genetic features based on SAS and supervised LASSO. Shown also are the AUC of the imputation model derived for the SAS procedure.

Method Imputation SAS SLASSO
CV AUC 0.928 0.763 0.488

For illustration, we present in Figure 3 the individual risk predictions with 95% confidence intervals for three sets of 10 patients with each set randomly selected from low (< 5%), medium (5% ~ 15%) or high risk (> 15%) subgroups. These risk groups are constructed for illustration purposes and a patient with xnew classified to low, medium and high risk if expitβ^xnew belongs to the low, medium and high tertiles of expitβ^Xi,i=1,,N. We observe that the confidence intervals for patients with predicted The debiased SLASSO inference is not very informative with most error bars stretching from zero to one. The contrast between SAS CIs and SLASSO CIs demonstrates the improved efficiency as the result of leveraging information from the unlabeled data through predictive surrogates.

Figure 3:

Figure 3:

Point and 95% confidence interval estimates for the predicted risks of 30 randomly selected patients. The vertical bars indicate the estimated 95% confidence intervals. The circle and the triangle shapes correspond to (initial) estimation and debiased estimation, correspondingly. Solid points indicate the observed T2DM cases. Colors red and green indicate different methods, SAS and SLASSO.

7. Discussion

We proposed the SAS estimation and inference method for high-dimensional risk prediction model with diminishing proportion of observed outcomes. With a sparse imputation model based on predictive surrogates, the SAS can recover a dense risk prediction model impossible to learn from supervised method, as well as achieve better efficiency than supervised method when the latter is applicable. We show that the theoretical advantages lead to better prediction accuracy and shorter confidence intervals in simulations and real data example.

While the SAS procedure is a powerful tool with minimal requirements, caution should be given to the inclusion of highly informative surrogates so that the imputation model is sparse (or approximately sparse). If all surrogates poorly predicts Y with a dense imputation model, the SAS procedure can lead to a compromised convergence rate in estimation. While the current study is motivated by the existence of easy-to-learn imputation model with highly predictive surrogates, the SAS framework can be extended to settings where the imputation model is not easier to learn than the model for YX. When the imputation model is estimable but more dense than the risk prediction model (i.e. sβ<sγ), we can following similar strategies as in our SAS inference procedure to reduce the bias incurred during the imputation step from γ^. Specifically, we may consider a debiased estimator for β^

β^debias =argminβRp+1k=1K1Ni𝒥kgγ^(k)TWi,βXi+1nikβXigγ^(k)TWi-Yi+λβ-11.

This debiased SAS estimation will attain the optimal rate sβlog(p)/n and we also expect an efficiency gain in the resulting variance compared to the supervised estimator, in analog to the efficiency gain observed in SAS inference. Adaptive approaches to infer whether a given dataset falls into the setting with sβ>sγ or sβ<sγ is straightforward in simpler settings when sβ and sγ can be estimated but warrants future research in general. In the extremely dense imputation model setting when sγ>n, information theoretical bound has indicated that the imputation model will be inestimable, invalidating any subsequent steps involving γ^. A possible solution is to redefine the imputation model as the sparser model between the risk prediction model and the original imputation model. A potential approach to identifying such a sparser imputation model is through the under-identified Dantzig Selector

γ^ada=argminγRp+qγ1,Subject to 1ni=1nXiYi-gγWiλ.

Both γ0 and β0,0q should fall in the feasible region with suitable λ, and the minimization over L1 norm may pick the sparsest element from the feasible class. Using γ^ada in SAS estimation may attain uniform optimal rate for any sβ and sγ. Theoretical studies of the above proposals warrant future research.

Supplementary Material

1

Figure 1:

Figure 1:

A dense prediction model (graph with dashed lines) can be compress to a sparse imputation model (through graph with solid lines) when the effect of most baseline covariates are reflected in a few variables in the EHR monitoring the development of the event of interest.

Contributor Information

Jue Hou, Division of Biostatistics, University of Minnesota School of Public Health, Minneapolis, MN 55455, USA.

Zijian Guo, Department of Statistics, Rutgers University, Piscataway, NJ 08854-8019, USA.

Tianxi Cai, Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA.

References

  1. Bang Heejung and Robins James M.. Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962–973, 2005. [DOI] [PubMed] [Google Scholar]
  2. Belloni A, Chernozhukov V, and Wang L. Square-root lasso: pivotal recovery of sparse signals via conic programming. Biometrika, 98(4):791–806, 2011. [Google Scholar]
  3. Belloni Alexandre, Kaul Abhishek, and Rosenbaum Mathieu. Pivotal estimation via self-normalization for high-dimensional linear models with error in variables, 2017.
  4. Bickel Peter J., Ritov Ya’acov, and Tsybakov Alexandre B.. Simultaneous analysis of lasso and dantzig selector. Ann. Statist, 37(4):1705–1732, August 2009. [Google Scholar]
  5. Bühlmann Peter and Van De Geer Sara. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media, 2011. [Google Scholar]
  6. Cai T Tony and Guo Zijian. Confidence intervals for high-dimensional linear regression: Minimax rates and adaptivity. The Annals of statistics, 45(2):615–646, 2017. [Google Scholar]
  7. Tony Cai T and Guo Zijian. Semisupervised inference for explained variance in high dimensional linear regression and its applications. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(2):391–419, 2020. [Google Scholar]
  8. Cai Tianxi, Cai Tony, and Guo Zijian. Optimal Statistical Inference for Individualized Treatment Effects in High-dimensional Models. arXiv e-prints:1904.12891, 2019. [Google Scholar]
  9. Chakrabortty Abhishek and Cai Tianxi. Efficient and adaptive linear regression in semi-supervised settings. Ann. Statist, 46(4):1541–1572, August 2018. [Google Scholar]
  10. Chakrabortty Abhishek, Lu Jiarui, Cai T. Tony, and Li Hongzhe. High Dimensional M-Estimation with Missing Outcomes: A Semi-Parametric Framework. arXiv e-prints:1911.11345, 2019. [Google Scholar]
  11. Chandrasekher Kabir Aladin, Alaoui Ahmed El, and Montanari Andrea. Imputation for high-dimensional linear regression, 2020.
  12. Cheng David, Ananthakrishnan Ashwin, and Cai Tianxi. Efficient and Robust Semi-Supervised Estimation of Average Treatment Effects in Electronic Medical Records Data. arXiv e-prints:1804.00195, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Deng Siyi, Ning Yang, Zhao Jiwei, and Zhang Heping. Optimal semi-supervised estimation and inference for high-dimensional linear regression. arXiv e-prints:2011.14185, 2020. [Google Scholar]
  14. Fan Jianqing and Li Runze. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348–1360, 2001. [Google Scholar]
  15. Frazer Kelly A, Murray Sarah S, Schork Nicholas J, and Topol Eric J. Human genetic variation and its contribution to complex traits. Nature Reviews Genetics, 10(4):241–251, 2009. [DOI] [PubMed] [Google Scholar]
  16. Gronsbell Jessica, Minnier Jessica, Yu Sheng, Liao Katherine, and Cai Tianxi. Automated feature selection of predictors in electronic medical records data. Biometrics, 75(1):268–277, 2019. [DOI] [PubMed] [Google Scholar]
  17. Guo Zijian, Rakshit Prabrisha, Herman Daniel S, and Chen Jinbo. Inference for the case probability in high-dimensional logistic regression. arXiv preprint:2012.07133, 2020. [PMC free article] [PubMed] [Google Scholar]
  18. Huang Jian and Zhang Cun-Hui. Estimation and selection via absolute penalized convex minimization and its multistage adaptive applications. J. Mach. Learn. Res, 13(1):1839–1864, 2012. [PMC free article] [PubMed] [Google Scholar]
  19. Javanmard Adel and Montanari Andrea. Confidence intervals and hypothesis testing for high-dimensional regression. Journal of Machine Learning Research, 15:2869–2909, 2014. [Google Scholar]
  20. Kallus Nathan and Mao Xiaojie. On the role of surrogates in the efficient estimation of treatment effects with limited outcome data. arXiv e-prints:2003.12408, 2020. [Google Scholar]
  21. Liao Katherine P, Sun Jiehuan, and 18 others. High-throughput multimodal automated phenotyping (MAP) with application to PheWAS. Journal of the American Medical Informatics Association, 26(11):1255–1262, August 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Loh Po-ling and Wainwright Martin J. High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity. In Shawe-Taylor J, Zemel R, Bartlett P, Pereira F, and Weinberger KQ, editors, Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011. [Google Scholar]
  23. Ma Rong, Cai T. Tony, Li Hongzhe. Global and simultaneous hypothesis testing for high-dimensional logistic regression models. Journal of the American Statistical Association, 116(534):984–998, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Mahajan Anubha, Taliun Daniel, and 113 others. Fine-mapping type 2 diabetes loci to single-variant resolution using high-density imputation and islet-specific epigenome maps. Nature Genetics, 50(11):1505–1513, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Morris Andrew P., Voight Benjamin F., Teslovich Tanya M., Ferreira Teresa, Segrè Ayellet V., Steinthorsdottir Valgerdur, et al. Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes. Nature Genetics, (9):981–990, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Negahban Sahand, Ravikumar Pradeep, Wainwright Martin J., and Yu Bin. A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. Technical Report 797, University of California Berkeley, Department of Statistics, 2010. [Google Scholar]
  27. Portnoy Stephen. Asymptotic behavior of m-estimators of p regression parameters when p2/n is large. i. consistency. Ann. Statist, 12(4):1298–1309, December 1984. [Google Scholar]
  28. Portnoy Stephen. Asymptotic behavior of m estimators of p regression parameters when p2/n is large; ii. normal approximation. Ann. Statist, 13(4):1403–1417, December 1985. [Google Scholar]
  29. Raskutti Garvesh, Wainwright Martin J., and Yu Bin. Minimax rates of estimation for high-dimensional linear regression over q -balls. IEEE Transactions on Information Theory, 57(10):6976–6994, 2011. [Google Scholar]
  30. Scott Robert A., Scott Laura J., Reedik Mägi Letizia Marullo, Gaulton Kyle J., et al. An expanded genome-wide association study of type 2 diabetes in europeans. Diabetes, 66 (11):2888–2902, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Smucler Ezequiel, Rotnitzky Andrea, and Robins James M.. A unifying approach for doubly-robust 1 regularized estimation of causal contrasts. arXiv e-prints:1904.03737, 2019. [Google Scholar]
  32. Thompson Caroline A, Kurian Allison W, and Luft Harold S. Linking electronic health records to better understand breast cancer patient pathways within and between two health systems. eGEMs, 3(1), 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Tsiatis A. Semiparametric Theory and Missing Data. Springer Series in Statistics. Springer; New York, 2007. [Google Scholar]
  34. van de Geer Sara, Bühlmann Peter, Ritov Ya’acov, and Dezeure Ruben. On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist, 42(3): 1166–1202, June 2014. [Google Scholar]
  35. van de Geer Sara A. and Bühlmann Peter. On the conditions used to prove oracle results for the lasso. Electron. J. Statist, 3:1360–1392, 2009. [Google Scholar]
  36. Vershynin Roman. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018. [Google Scholar]
  37. Voight Benjamin F., Scott Laura J., Steinthorsdottir Valgerdur, Morris Andrew P., Dina Christian, Welch Ryan P., et al. Twelve type 2 diabetes susceptibility loci identified through large-scale association analysis. Nature Genetics, 42(7):579–589, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Vujkovic Marijana, Keaton Jacob M, and 48 others. Discovery of 318 new risk loci for type 2 diabetes and related vascular outcomes among 1.4 million participants in a multi-ancestry meta-analysis. Nature genetics, 52(7):680–691, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Warren Joan L and Yabroff K Robin. Challenges and opportunities in measuring cancer recurrence in the united states. Journal of the National Cancer Institute, 107(8):djv134, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Zhang Anru, Brown Lawrence D., and Cai T. Tony. Semi-supervised inference: General theory and estimation of means. Ann. Statist, 47(5):2538–2566, October 2019. [Google Scholar]
  41. Zhang Cun-Hui and Zhang Stephanie S.. Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):217–242, 2014. [Google Scholar]
  42. Zhang Yichi, Liu Molei, Neykov Matey, and Cai Tianxi. Prior adaptive semi-supervised learning with application to ehr phenotyping. Journal of Machine Learning Research, 23 (83):1–25, 2022. [PMC free article] [PubMed] [Google Scholar]
  43. Zhang Yuqian and Bradic Jelena. High-dimensional semi-supervised learning: in search of optimal inference of the mean. Biometrika, 109(2):387–403, September 2021. ISSN 1464-3510. doi: 10.1093/biomet/asab042. URL 10.1093/biomet/asab042. [DOI] [Google Scholar]
  44. Zhu Yinchu and Bradic Jelena. Linear hypothesis testing in dense high-dimensional linear models. Journal of the American Statistical Association, 113(524):1583–1600, 2018a. [Google Scholar]
  45. Zhu Yinchu and Bradic Jelena. Significance testing in non-sparse high-dimensional linear models. Electron. J. Statist, 12(2):3312–3364, 2018b. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

RESOURCES