Surrogate Assisted Semi-supervised Inference for High Dimensional Risk Prediction

Jue Hou; Zijian Guo; Tianxi Cai

. Author manuscript; available in PMC: 2024 Mar 18.

Published in final edited form as: J Mach Learn Res. 2023 Jan-Dec;24:265.

Surrogate Assisted Semi-supervised Inference for High Dimensional Risk Prediction

Jue Hou ¹, Zijian Guo ², Tianxi Cai ³

PMCID: PMC10947223 NIHMSID: NIHMS1971733 PMID: 38500567

Abstract

Risk modeling with electronic health records (EHR) data is challenging due to no direct observations of the disease outcome and the high-dimensional predictors. In this paper, we develop a surrogate assisted semi-supervised learning approach, leveraging small labeled data with annotated outcomes and extensive unlabeled data of outcome surrogates and high-dimensional predictors. We propose to impute the unobserved outcomes by constructing a sparse imputation model with outcome surrogates and high-dimensional predictors. We further conduct a one-step bias correction to enable interval estimation for the risk prediction. Our inference procedure is valid even if both the imputation and risk prediction models are misspecified. Our novel way of ultilizing unlabelled data enables the high-dimensional statistical inference for the challenging setting with a dense risk prediction model. We present an extensive simulation study to demonstrate the superiority of our approach compared to existing supervised methods. We apply the method to genetic risk prediction of type-2 diabetes mellitus using an EHR biobank cohort.

Keywords: generalized linear models, high dimensional inference, model mis-specification, risk prediction, semi-supervised learning

1. Introduction

Precise risk prediction is vitally important for successful clinical care. High risk patients can be assigned to more intensive monitoring or intervention to improve outcome. Traditionally, risk prediction models are developed based on cohort studies or registry data. Population-based disease registries, while remain a critical source for epidemiological studies, collect information on a relatively small set of pre-specified variables and hence may limit researchers’ ability to develop comprehensive risk prediction models (Warren and Yabroff, 2015). Most clinical care is delivered in healthcare systems (Thompson et al., 2015), and electronic health records (EHR) embedded in healthcare systems accrue rich clinical data in broad patient populations. EHR systems centralize the data collected during routine patient care including structured elements such as codes for International Classification of Diseases, medication prescriptions, and medical procedures, as well as free-text narrative documents such as physician notes and pathology reports that can be processed through natural language processing for analysis. EHR data is also often linked with biobanks which provide additional rich molecular information to assist in developing comprehensive risk prediction models for a broad patient population.

Risk modeling with EHR data, however, is challenging due to several reasons. First, precise information on clinical outcome of interest, $Y$ , is often embedded in free-text notes and requires manual efforts to extract accurately. Readily available outcome surrogates $S$ , such as the diagnostic codes or mentions of the outcome, may be predictive of the true outcome $Y$ , can deviate from the true label $Y$ . Here we consider the general situation that a vector of surrogates, $S$ , that are noisy error prone proxies of $Y$ and may include non-informative surrogates. For example, using EHR data from Mass General Brigham, we found that the positive predictive value was only 0.48 and 0.19 for having at least 1 diagnosis code of Type II Diabetes Mellitus (T2DM) and for having at least 1 mention of T2DM in medical notes, respectively. Directly using these EHR proxies as true disease status to derive risk models may lead to substantial biases. On the other hand, extracting precise disease status requires manual chart review which is not feasible at a large scale. It is thus of great interest to develop risk prediction models under a semi-supervised learning (SSL) framework using both a large unlabeled dataset of size $N$ containing information on predictors $X$ along with surrogates $S$ and a small labeled dataset of size $n$ with additional observations on $Y$ curated via chart review. Throughout the paper, we impose no stringent model assumptions on the triplet $(Y, X, S)$ while using generalized linear working models to define and estimate the risk prediction model (see Section 2).

Additional challenges arise from the high dimensionality of the predictor vector $X$ , and the potential model mis-specifications. Although much progress has been made in high dimensional regression in recent years, there is a paucity of literature on high dimensional inference under the SSL setting. Precise estimation of the high dimensional risk model is even more challenging if the risk model is not sparse. Allowing the risk model to be dense is particularly important when $X$ includes genomic markers since a large number of genetic markers appear to contribute to the risk of complex traits (Frazer et al., 2009). For example, Vujkovic et al. (2020) recently identified 558 genetic variants as significantly associated with T2DM risk. An additional challenge arises when the fitted risk model is mis-specified, which occurs frequently in practice especially in the high dimensional setting. Model mis-specifications can also lead to the fitted model of $Y ∣ X$ to be dense. There are limited methods currently available to make inference about high dimensional risk prediction models in the SSL setting especially under a possibly mis-specified dense model. In this paper, we fill in the gap by proposing an efficient surrogate assisted SSL (SAS) prediction procedure that leverages the fully observed surrogates $S$ to make inference about a high dimensional risk model under such settings.

Our proposed estimation and inference procedures are as follows. For estimation, we first use the labelled data to fit a regularized imputation model with surrogates and high-dimensional covariates; then we impute the missing outcomes for the unlabeled data and fit the risk model using the imputed outcome and high-dimensional predictors. For inference, we devise a novel bias correction method, which corrects the bias due to the regularization for both imputation and estimation. Compared to existing literature, the key advantages of our proposed SAS procedure are

Applicable to dense risk model $Y ∣ X$ : we allow the working risk model $Y ∣ X$ to be dense as long as the working imputation model $Y ∣ S, X$ is sparse;
Robustness to model mis-specification: the working models on both risk prediction $Y ∣ X$ and imputation $Y ∣ S, X$ can be mis-specified;
Requires no assumptions on the measurement error in $S$ as proxies of $Y$ and allows $S$ itself to be of high dimension;
Our analysis on Lasso with estimated inputs in loss (see (6) and (20)) facilitates the consistency analysis for a dense model independent of the convergence rate of the consistently estimated inputs. The technique is an independent contribution to the high-dimensional statistics literatures.

The sparsity assumption on the imputation model is less stringent since we anticipate that most information on $Y$ can be well captured by the low dimensional $S$ while the fitted model of $Y ∣ X$ might be dense under possible model mis-specifications. Our theory uncovers that suitable use of unlabeled data may greatly relax the sparsity of $Y ∣ X$ . As most literatures in SSL emphasized in the efficiency gain, our work opens a new direction of estimiability expansion through SSL.

1.1. Related Literatures

Under the supervised setting where both $Y$ and $X$ are fully observed, much progress has been made in recent years in the area of high dimensional inference. High dimensional regression methods have been developed for commonly used generalized linear models under sparsity assumptions on the regression parameters (van de Geer and Bühlmann, 2009; Negahban et al., 2010; Huang and Zhang, 2012). Recently, Zhu and Bradic (2018b) studied the inference of linear combination of coefficients under dense linear model and sparse precision matrix. Inference procedures have also been developed for both sparse (Zhang and Zhang, 2014; Javanmard and Montanari, 2014; van de Geer et al., 2014) and dense combinations of the regression parameters (Cai et al., 2019; Zhu and Bradic, 2018a). High-dimensional inference under the logistic regression model has also been studied recently (van de Geer et al., 2014;Ma et al., 2020; Guo et al., 2020).

Under the SSL setting with $n ≪ N$ , however, there is a paucity of literature on high dimensional inference. Although the SSL can be viewed as a missing data problem, it differs from the standard missing data setting in a critical way. Under the SSL setting, the missing probability tends to 1, which would violate a key assumption required in the missing data literature (Bang and Robins, 2005; Smucler et al., 2019; Chakrabortty et al., 2019, e.g.). Existing work on SSL with high-dimensional covariates largely focuses on the post-estimation inference on the global parameters under sparse linear models with examples including SSL estimation of population mean (Zhang et al., 2019; Zhang and Bradic, 2021), the explained variance (Cai and Guo, 2020), and the average treatment effect (Cheng et al., 2018; Kallus and Mao, 2020). Our SAS procedure is among the first attempts to conduct the semi-supervised inference of the high-dimensional coefficient and the individual prediction in the high-dimensional dense and possibly mis-specified risk prediction model. In a concurrent work, Deng et al. (2020) studied the efficient SSL estimation of high-dimensional linear models. Our work differs from them in at least three ways: 1) we consider the more flexible generalized linear models; 2) our setting involves the surrogates $S$ , characterizing the imprecise data in EHR; 3) we study dense coefficients whose number of nonzero elements exceeds the number of labels. In high-dimensional regression with missing data, another line of work studied the estimation of linear models with missing or noisy covariates $X$ (Loh and Wainwright, 2011; Belloni et al., 2017; Chandrasekher et al., 2020).

The surrogates $S$ can be viewed conceptually as “mis-measured” proxies of the true outcome $Y$ . Semi-supervised methods have been developed under the assumption that $S$ depends on $X$ only through $Y$ , which essentially assumes an independent measurement error in $S$ . For example, Gronsbell et al. (2019) studied the generalized linear risk prediction model using mis-measured $S$ . With a single $S$ , Zhang et al. (2022) considered high-dimensional generalized linear model for the prediction model allowing the independence assumption to be slightly violated. Our SAS approach differs from the measurement error approach in two fundamental aspects: 1) typical measurement error approaches require $S$ to be the single proxy outcome of the same type as $Y$ while our SAS approach allow a vector $S$ of arbitrary types as long as some of them are predictive for $Y$ ; 2) measurement error approaches impose stringent independence and model assumptions on the triplet $(S, X, Y)$ while our SAS approach has neither. Violation of the two requirements may obstruct the deployment of measurement error methods or compromise its performance.

1.2. Organization of the Paper

The remainder of the paper is organized as follows. We introduce our population parameters and model assumptions in Section 2. In Section 3, we propose the SAS estimation method along with its associated inference procedures. In Section 4, we state the theoretical guarantees of the SAS procedures, whose proofs are provided in the Supplementary Materials. We also remark on the sparsity relaxation and the efficiency gain of the SSL. In Section 5, we present simulation results highlighting finite sample performance of the SAS estimators and comparisons to existing methods. In Section 6, we apply the proposed method to derive individual risk prediction for T2DM using EHR data from Mass General Brigham.

2. Settings and Notations

For the $i$ -th observation, $Y_{i} \in R$ denotes the outcome variable, $S_{i} \in R^{q}$ denotes the surrogates for $Y_{i}$ and $X_{i} \in R^{p + 1}$ denotes the high-dimensional covariates with the first element being the intercept. Under the SSL setting, we observe $n$ independent and identically distributed (i.i.d.) labeled observations, $ℒ = \{{(Y_{i}, X_{i}^{⊤}, S_{i}^{⊤})}^{⊤}, i = 1, \dots, n\}$ and $(N - n)$ i.i.d unlabeled observations, $𝒰 = \{W_{i} = {(X_{i}^{⊤}, S_{i}^{⊤})}^{⊤}, i = n + 1, \dots, N\}$ . We assume that the labeled subjects are randomly sampled by design and the proportion of labelled sample is $n / N = ρ \in (0, 1)$ with $ρ \to 0$ as $n \to \infty$ . We focus on the high-dimensional setting where dimensions $p$ and $q$ grow with $n$ and allow $p + q$ to be larger than $n$ . Motivated by our application, our main focus is on the setting $N$ much larger than $p$ , but our approach can be extended to $N \leq p$ under specific conditions.

To predict $Y_{i}$ with $X_{i}$ , we consider a possibly mis-specified working regression model with a known monotone and smooth link function $g$ ,

Y_{i} ~ g (β^{⊤} X_{i}) .

(1)

We identify the target parameter as the most predictive working model measured by the pseudo log-likelihood $ℓ (y, x)$

β_{0} = \underset{β}{a r g m i n} - E \{ℓ (Y_{i}, β^{⊤} X_{i})\}, ℓ (y, x) = y x - G (x), G^{'} (x) = g (x) .

(2)

Here we do not assume any model for the true conditional expectation $E (Y_{i} ∣ X_{i})$ . Our goal is to accurately estimate the high-dimensional parameter $β_{0}$ , alternatively characterized by the first order condition for (2),

E [X_{i} \{Y_{i} - g (β_{0}^{⊤} X_{i})\}] = 0 .

(3)

Our procedure generally allows for a wide range of link functions and detailed requirements on $g (\cdot)$ and its anti-derivative $G$ are given in Section 4. In our motivating example, $Y$ is a binary indicator of T2DM status and $g (x) = 1 / (1 + e^{- x})$ with $G (x) = log (1 + e^{x})$ . We shall further construct confidence intervals for $g (β_{0}^{⊤} x_{new})$ with any $x_{new} \in R^{p + 1}$ . The predicted outcome $g (β_{0}^{⊤} x_{new})$ can be interpreted as the maximum pseudo log-likelihood prediction under working model $g (β^{⊤} x_{new})$ . We make no assumption on the sparsity of $β_{0}$ relative to number of labels $n$ , and hence it is not feasible to perform valid supervised learning for $β_{0}$ when $s_{β} = {∥β_{0}∥}_{0} > n$ .

We shall derive an efficient SSL estimate for $β_{0}$ by leveraging $𝒰$ . To this end, we fit a working imputation model

Y_{i} ~ g (γ^{⊤} W_{i}),

(4)

whose limiting parameter is likewise defined as the most predictive working model

γ_{0} = \underset{γ}{a r g m i n} - E \{ℓ (Y_{i}, γ^{⊤} W_{i})\} \Rightarrow E [W_{i} \{Y_{i} - g (γ_{0}^{⊤} W_{i})\}] = 0 .

(5)

The definition of $γ$ guarantees

E [X_{i} \{Y_{i} - g (γ_{0}^{⊤} W_{i})\}] = 0 .

(6)

and hence if we impute $Y_{i}$ as ${\overline{Y}}_{i} = g (γ_{0}^{⊤} W_{i})$ , we have $E [X_{i} \{{\overline{Y}}_{i} - g (β_{0}^{⊤} X_{i})\}] = 0$ regardless the adequacy of the imputation model (4) for the conditional mean $E (Y_{i} ∣ W_{i})$ . It is thus feasible to carry out an SSL procedure by first deriving an estimate for ${\overline{Y}}_{i}$ using the labelled data $ℒ$ and then regressing the estimated ${\overline{Y}}_{i}$ against $X_{i}$ using the whole data $ℒ \cup 𝒰$ . Although we do not require $β_{0}$ to be sparse or any of the fitted models to hold, we do assume that $γ_{0}$ defined in (5) to be sparse. When the surrogates $S$ are strongly predictive for the outcome, the sparsity assumption on $γ_{0}$ is reasonable since the majority of the information in $Y$ can be captured in $S$ .

Notations. We focus on the setting where $min {n, p + q, N} \to \infty$ . For convenience, we shall use $n \to \infty$ in the asymptotic analysis. For two sequences of random variables $A_{n}$ and $B_{n}$ , we use $A_{n} = O_{p} (B_{n})$ and $A_{n} = o_{p} (B_{n})$ to denote ${l i m}_{c \to \infty} {l i m}_{n \to \infty} P (| A | \geq c | B |) = 0$ and ${l i m}_{c \to 0} {l i m}_{n \to \infty} P (| A | \geq c | B |) = 0$ , respectively. For two positive sequences $a_{n}$ and $b_{n}, a_{n} = O (b_{n})$ or $b_{n} ≳ a_{n}$ means that $\exists C > 0$ such that $a_{n} \leq C b_{n}$ for all $n; a_{n} ≍ b_{n}$ if $a_{n} = O (b_{n})$ and $b_{n} = O (a_{n})$ , and $a_{n} ≪ b_{n}$ or $a_{n} = o (b_{n})$ if ${l i m s u p}_{n \to \infty} a_{n} / b_{n} = 0$ . We use $Z_{n} \overset{ℒ}{\to} N (0, 1)$ to denote the sequence of random variables $Z_{n}$ converges in distribution to a standard normal random variable.

3. Methodology

3.1. SAS Estimation of $β_{0}$

The SAS estimation procedure for $β_{0}$ consists of two key steps: (i) fitting the imputation model to $ℒ$ to obtain estimate $\hat{γ}$ for $γ_{0}$ defined in (5); and (ii) estimating $β_{0}$ in (3) by fitting imputed outcome ${\hat{Y}}_{i} = g ({\hat{γ}}^{⊤} W_{i})$ against $X_{i}$ to $𝒰$ . In both steps, we devise the Lasso type estimator to deal with the high-dimensionality of $X$ . In principle, other types of variable selection methods, e.g. SCAD (Fan and Li, 2001) or square-root Lasso (Belloni et al., 2011), may also be used. We use the Lasso as the example for its simplicity. A further discussion on the choice of regularized estimators is given in Remark 6.

In Step (i), we estimate $γ_{0}$ by the $L_{1}$ regularized pseudo log-likelihood estimator $\hat{γ}$ , defined as

\hat{γ} = \underset{γ \in R^{p + q + 1}}{a r g m i n} ℓ_{imp} (γ) + λ_{γ} {∥γ_{- 1}∥}_{1} with λ_{γ} ≍ \sqrt{log (p + q) / n},

(7)

where a₋₁ denotes the sub-vector of all the coefficients except for the intercept and

ℓ_{imp} (γ) = \frac{1}{n} \sum_{i = 1}^{n} ℓ (Y_{i}, γ^{⊤} W_{i}) with ℓ (y, x) defined in (2) .

(8)

The imputation loss (8) corresponds to the negative log-likelihood when $Y$ is binary and the imputation model holds with $g$ being anti-logit. With $\hat{γ}$ , we impute the unobserved outcomes for subjects in $𝒰$ as ${\hat{Y}}_{i} = g ({\hat{γ}}^{⊤} W_{i})$ , for $n + 1 \leq i \leq N$ .

In Step (ii), we estimate $β_{0}$ by $\hat{β} = \hat{β} (\hat{γ})$ , defined as,

\hat{β} (\hat{γ}) = \underset{β \in R^{p + 1}}{a r g m i n} ℓ^{†} (β; \hat{γ}) + λ_{β} {∥β_{- 1}∥}_{1} with λ_{β} ≍ \sqrt{log (p) / N},

(9)

where $ℓ^{†} (β; \hat{γ})$ is the imputed pseudo log-likelihood:

ℓ^{†} (β; \hat{γ}) = \frac{1}{N} \sum_{i > n} ℓ ({\hat{Y}}_{i}, β^{⊤} X_{i}) + \frac{1}{N} \sum_{i = 1}^{n} ℓ (Y_{i}, β^{⊤} X_{i}) with ℓ (y, x) defined in (2).

(10)

We denote the complete data pseudo log-likelihood of the full data as

ℓ_{PL} (β) = \frac{1}{N} \sum_{i = 1}^{N} ℓ (Y_{i}, β^{⊤} X_{i}) .

(11)

and define the gradients of the various losses (8)–(11) as

{\dot{ℓ}}_{imp} (γ) = \nabla ℓ_{imp} (γ), {\dot{ℓ}}_{P L} (β) = \nabla ℓ_{P L} (β), {\dot{ℓ}}^{†} (β; γ) = \frac{\partial}{\partial β} ℓ^{†} (β; γ) .

(12)

3.2. SAS Inference for Individual Prediction

Since $g (\cdot)$ is specified, the inference on $g (x_{new}^{⊤} β)$ immediately follows from the inference on $x_{new}^{⊤} β$ . We shall consider the inference on standardized linear prediction $x_{std}^{⊤} β$ with the standardized covariates

x_{std} = x_{new} / {∥x_{new}∥}_{2}

and then scale the confidence interval back. This way, the scaling with ${∥x_{new}∥}_{2}$ is made explicit in the expression of the confidence interval.

The estimation error of $\hat{β}$ can be decomposed into two components corresponding to the respective errors associated with (7) and (9). Specifically, we write

\hat{β} - β_{0} = \{\bar{β} (\hat{γ}) - β_{0}\} + \{\hat{β} - \bar{β} (\hat{γ})\},

(13)

where $\bar{β} (\hat{γ})$ is defined as the minimizer of the expected imputed loss conditionally on the labeled data, that is,

\bar{β} (\hat{γ}) = \underset{β \in R^{p + 1}}{a r g m i n} E [ℓ^{†} (β; \hat{γ}) ∣ ℒ] .

(14)

The term $\bar{β} (\hat{γ}) - β_{0}$ denotes the error from the imputation model in (7) while the term $\hat{β} - \bar{β} (\hat{γ})$ denotes the error from the prediction model in (9) given the imputation model parameter $\hat{γ}$ . As $ℓ_{1}$ penalization is involved in both steps, we shall correct the regularization bias from the two sources. Following from the typical one-step debiasing LASSO (Zhang and Zhang, 2014), the bias $\hat{β} - \bar{β} (\hat{γ})$ is estimated by $\hat{Θ} {\dot{ℓ}}^{†} (\hat{β}; \hat{γ})$ where $\hat{Θ}$ is an estimator of ${[E \{g^{'} (β_{0}^{⊤} X_{i}) X_{i} X_{i}^{⊤}\}]}^{- 1}$ , the inverse Hessian of $ℓ^{†} (\cdot; \hat{γ})$ at $β = β_{0}$ .

The bias correction for $\bar{β} (\hat{γ}) - β_{0}$ requires some innovation since we need to conduct the bias correction for a nonlinear functional $\bar{β} (\cdot)$ of LASSO estimator $\hat{γ}$ , which has not been studied in the literature. We identify $\bar{β} (\hat{γ})$ and $β_{0}$ by the first order moment conditions,

\bar{β} (\hat{γ}) : E_{i > n} [X_{i} \{g (\bar{β} {(\hat{γ})}^{⊤} X_{i}) - g ({\hat{γ}}^{⊤} W_{i})\} ∣ ℒ] \approx 0, β_{0} : E [X_{i} \{g (β_{0}^{⊤} X_{i}) - Y_{i}\}] = E [X_{i} \{g (β_{0}^{⊤} X_{i}) - g (γ_{0}^{⊤} W_{i})\}] = 0.

(15)

Here $E_{i > n} [\cdot ∣ ℒ]$ denotes the conditional expectation of a single copy of the unlabeled data given the labelled data. By equating the two estimating equations in (15), we apply the first order approximation and approximate the difference $\bar{β} (\hat{γ}) - β_{0}$ by

\bar{β} (\hat{γ}) - β_{0} \approx - {[E \{g^{'} (β_{0}^{⊤} X_{i}) X_{i} X_{i}^{⊤}\}]}^{- 1} E_{i > n} [X_{i} \{g (γ_{0}^{⊤} W_{i}) - g ({\hat{γ}}^{⊤} W_{i})\} ∣ ℒ]

(16)

Together with the bias correction for $\bar{β} (\hat{γ}) - β_{0}$ , this motivates the debiasing procedure

\hat{β} - \frac{1 - ρ}{n} \sum_{i = 1}^{n} \hat{Θ} X_{i} \{g ({\hat{γ}}^{⊤} W_{i}) - Y_{i}\} - \hat{Θ} {\dot{ℓ}}^{†} (\hat{β}; \hat{γ}) .

The $1 - ρ$ factor, which tends to one when $n$ much smaller than $N$ , comes from the proportion of unlabeled data whose missing outcome are imputed.

For theoretical considerations, we devise a cross-fitting scheme in our debiasing process. We split the labelled and unlabeled data into $K$ folds of approximately equal size, respectively. The number of folds does not grow with dimension (e.g. $K = 10$ ). We denote the indices sets for each fold of the labelled data $ℒ$ as $ℐ_{1}, \dots, ℐ_{K}$ , and those of the unlabeled data $𝒰$ as $𝒥_{1}, \dots, 𝒥_{K}$ . We denote the respective sizes of each fold in the labelled data and full data as $n_{k} = |ℐ_{k}|$ and $N_{k} = n_{k} + |𝒥_{k}|$ , where $| ℐ |$ denotes the carnality of $ℐ$ . Define $ℐ_{k}^{c} = {1, \dots, n} ∖ ℐ_{k}$ and $𝒥_{k}^{c} = {n + 1, \dots, N} ∖ 𝒥_{k}$ . For each labelled fold $k$ , we fit the imputation model with out-of-fold labelled samples:

{\hat{γ}}^{(k)} = \underset{γ \in ℝ^{p + q + 1}}{argmin} \frac{1}{n - n_{k}} \sum_{i \in ℐ_{k}^{c}} ℓ (Y_{i}, γ^{⊤} W_{i}) + λ_{γ} {‖γ_{- 1}‖}_{1} .

(17)

Using ${\hat{γ}}^{(k)}$ , we fit the prediction model with the out-of-fold data $ℐ_{k}^{c} \cup 𝒥_{k}^{c}$ :

{\hat{β}}^{(k)} = \underset{β \in ℝ^{p + 1}}{argmin} \frac{1}{N - N_{k}} [\sum_{i \in 𝒥_{k}^{c}} ℓ (g ({\hat{γ}}^{(k) ⊤} W_{i}), β^{⊤} X_{i}) + \sum_{i \in ℐ_{k}^{c}} ℓ (Y_{i}, β^{⊤} X_{i})] + λ_{β} {‖β_{- 1}‖}_{1} .

(18)

To estimate the projection

{u_{0} = E \{g^{'} (β_{0}^{⊤} X_{i}) X_{i} X_{i}^{⊤}\}]}^{- 1} X_{std},

(19)

we propose an $L_{1}$ -penalized estimator

{\hat{u}}^{(k)} = \underset{u \in ℝ^{p}}{argmin} \frac{1}{N - N_{k}} \sum_{k^{'} \neq k} \sum_{i \in ℐ_{k^{'}} \cup^{​} 𝒥_{k^{'}}} [\frac{1}{2} g^{'} ({\hat{β}}^{(k, k^{'}) T} X_{i}) {(X_{i}^{⊤} u)}^{2} - u^{⊤} x_{std}] + λ_{u} ∥ u ∥_{1},

(20)

where ${\hat{β}}^{(k, k^{'})}$ is trained with samples out of folds $k$ and $k^{'}$ ,

{\hat{β}}^{(k, k^{'})} = \underset{β \in ℝ^{p + 1}}{argmin} \frac{\sum_{i \in {(𝒥_{k} \cup^{​} 𝒥_{k^{'}})}^{c}} ℓ (g ({\hat{γ}}^{(k, k^{'}) ⊤} W_{i}), β^{⊤} X_{i}) + \sum_{i \in {(ℐ_{k} \cup^{​} ℐ_{k^{'}})}^{c}} ℓ (Y_{i}, β^{⊤} X_{i})}{N - N_{k} - N_{k^{'}}} + λ_{β} {‖β_{- 1}‖}_{1},

(21)

with

{\hat{γ}}^{(k, k^{'})} = \underset{γ \in ℝ^{p + q + 1}}{argmin} \frac{\sum_{i \in ℐ_{k}^{c} \cap^{​} ℐ_{k^{'}}^{c}} ℓ (Y_{i}, γ^{⊤} W_{i})}{n - n_{k} - n_{k^{'}}} + λ_{γ} {‖γ_{- 1}‖}_{1} .

The estimators in (21) take similar forms as those in (17) and (18) except that their training samples exclude two folds of data $ℐ_{k} \cup 𝒥_{k}$ and $ℐ_{k^{'}} \cup 𝒥_{k^{'}}$ . In the summand of (20), the data $(Y_{i}, X_{i}, S_{i})$ in fold $k^{'} ℐ_{k^{'}} \cup 𝒥_{k^{'}}$ is independent of ${\hat{β}}^{(k, k^{'})}$ trained without folds $k$ and $k^{'}$ . The estimation of $u$ requires an estimator of $β$ and both estimators are subsequently used for the debiasing step. Using the same set of data multiple times for $\hat{β}, \hat{u}$ , debiasing and variance estimation may induce over-fitting bias, so we implemented the cross-fitting scheme to reduce the over-fitting bias. As a remark, cross-fitting might not be necessary for theory with additional assumptions and/or empirical process techniques.

We obtain the cross-fitted debiased estimator for $x_{std}^{⊤} β$ as $\hat{x_{std}^{⊤} β}$ , defined as

\frac{1}{K} \sum_{k = 1}^{K} x_{std}^{⊤} {\hat{β}}^{(k)} - \frac{1}{N} \sum_{k = 1}^{K} \sum_{i \in 𝒥_{k}} {\hat{u}}^{(k) ⊤} X_{i} \{g ({\hat{β}}^{(k) ⊤} X_{i}) - g ({\hat{γ}}^{(k) T} W_{i})\} - \frac{1}{n} \sum_{k = 1}^{K} \sum_{i \in ℐ_{k}} {\hat{u}}^{(k) ⊤} X_{i} \{(1 - ρ) \cdot g ({\hat{γ}}^{(k) T} W_{i}) + ρ \cdot g ({\hat{β}}^{(k) ⊤} X_{i}) - Y_{i}\} .

(22)

The second term is used to correct the bias $\bar{β} (\hat{γ}) - β_{0}$ and the third term is used to correct the bias $\hat{β} - \bar{β} (\hat{γ})$ . The corresponding variance estimator is

{\hat{V}}_{SAS} = \frac{1}{n} \sum_{k = 1}^{K} \sum_{i \in ℐ_{k}} {({\hat{u}}^{(k) T} X_{i})}^{2} {\{(1 - ρ) \cdot g ({\hat{γ}}^{(k) T} W_{i}) + ρ \cdot g ({\hat{β}}^{(k) T} X_{i}) - Y_{i}\}}^{2} + \frac{ρ^{2}}{n} \sum_{k = 1}^{K} \sum_{i \in 𝒥_{k}} {({\hat{u}}^{(k) T} X_{i})}^{2} {\{g ({\hat{β}}^{(k) T} X_{i}) - g ({\hat{γ}}^{⊤} W_{i})\}}^{2}

(23)

Through the link $g$ and the scaling factor ${∥x_{new}∥}_{2}$ , we estimate $g (x_{new}^{⊤} β_{0})$ by $g ({∥x_{new}∥}_{2} \hat{x_{std}^{⊤} β})$ and construct the $(1 - α) \times 100 %$ confidence interval for $g (x_{new}^{⊤} β_{0})$ as

[g \{{∥x_{new}∥}_{2} (\hat{x_{std}^{⊤} β} - Ƶ_{α / 2} \sqrt{{\hat{V}}_{SAS} / n}\}), g \{{∥x_{new}∥}_{2} (\hat{x_{std}^{⊤} β} + Ƶ_{α / 2} \sqrt{{\hat{V}}_{SAS} / n})\}],

(24)

where $Z_{α / 2}$ is the $1 - α / 2$ quantile of the standard normal distribution.

4. Theory

We introduce assumptions required for both estimation and inference in Section 4.1. We state our theories for estimation and inference, respectively in Sections 4.2 and 4.3.

4.1. Assumptions

We assume the complete data consist of i.i.d. copies of $(Y_{i}, X_{i}, S_{i})$ , for $i = 1, \dots, N$ . For our focused SSL settings, only the first $n$ outcome labels $Y_{i}, \dots, Y_{n}$ are observed. Under the i.i.d assumption, our SSL setting is equivalent to the missing completely at random (MCAR) assumption. The sparsities of $γ_{0}, β_{0}$ and $u_{0}$ are denoted as

s_{γ} = {∥γ_{0}∥}_{0}, s_{β} = {∥β_{0}∥}_{0}, s_{u} = {∥u_{0}∥}_{0} .

We focus on the setting with $n, p + q, N \to \infty$ with $n$ being allowed to be smaller than $p + q$ . We allow that $s_{γ}, s_{β}$ and $s_{u}$ grow with $n, p + q, N$ and satisfy $s_{γ} ≪ n$ and $s_{β} + s_{u} ≪ N$ . While our method and theory adaptively applies to both SSL $(N ≫ n)$ and missing data $(N ≍ n)$ settings without prior knowledge on the limit of $n / N$ , we emphasize on the SSL $(N ≫ n)$ setting that matches our motivating EHR studies and is also less studied in the literature. To achieve the sharper dimension conditions, we consider the sub-Gaussian design as in Portnoy (1984, 1985); Negahban et al. (2010). We denote the sub-Gaussian norm for random variables and random vectors both as $∥ \cdot ∥_{ψ_{2}}$ . The detailed definition is given in Appendix D.

Assumption 1 For constants $ν_{1}, ν_{2}$ and $M$ independent of $n, p$ and $N$ ,

the residuals $Y_{i} - g (γ_{0}^{⊤} W_{i})$ and $Y_{i} - g (β_{0}^{⊤} X_{i})$ are sub-Gaussian random variables with sub-Gaussian norm bounded by ${∥Y_{i} - g (γ_{0}^{⊤} W_{i})∥}_{ψ_{2}} \leq ν_{1}$ and ${∥Y_{i} - g (β_{0}^{⊤} X_{i})∥}_{ψ_{2}} \leq ν_{2}$ ;
The link function $g$ satisfies the monotonicity and smoothness conditions: ${i n f}_{x \in R} g^{'} (x) \geq 0, {s u p}_{x \in R} g^{'} (x) < M$ and ${s u p}_{x \in R} g^{''} (x) < M$ .

Under our motivating example with a binary $Y_{i}$ and $g (x) = e^{x} / (1 + e^{x})$ , 1a and 1b are satisfied. The condition is also satisfied for the probit link function and the identity link function. Condition 1a is universal for high-dimensional regression. Admittedly, Lipschitz requirement in 1b rules out some generalized linear model links with unbounded derivatives like the exponential link, but we may substitute the condition by assuming a bounded ${∥X_{i}∥}_{\infty}$ .

Assumption 2 For constants $σ_{m a x}^{2}$ and $σ_{min}^{2}$ independent of $n, p, N$ ,

$W_{i}$ is a sub-Gaussian vector with sub-Gaussian norm ${∥W_{i}∥}_{ψ_{2}} \leq σ_{m a x} / \sqrt{2}$ ;
The weak overlapping condition at the population parameter $β_{0}$ and $γ_{0}$ ,
1. ${i n f}_{∥ v ∥_{2} = 1} v^{⊤} E ([g^{'} (β_{0}^{⊤} X_{i}) \land 1] X_{i} X_{i}^{⊤}) v \geq σ_{min}^{2}$ ,
2. ${i n f}_{∥ v ∥_{2} = 1} v^{⊤} E [\{g^{'} (γ_{0}^{⊤} W_{i}) \land 1\} W_{i} W_{i}^{⊤}] v \geq σ_{min}^{2}$ ;
The non-degeneracy of average residual variance:

\underset{∥ v ∥_{2} = 1}{i n f} E [{\{Y_{i} - (1 - ρ) \cdot g (γ_{0}^{⊤} W_{i}) - ρ \cdot g (β_{0}^{⊤} X_{i})\}}^{2} {(X_{i}^{⊤} v)}^{2}] \geq σ_{min}^{2} .

Assumption 2a is typical for high-dimensional regression (Negahban et al., 2010), which also implies the bounded maximal eigenvalue of the second moment

sup_{∥ v ∥_{2} = 1} v^{⊤} E [W_{i} W_{i}^{⊤}] v \leq σ_{max}^{2} .

Notably, we do not require two common conditions under high-dimensional generalized linear models (Huang and Zhang, 2012; van de Geer et al., 2014): 1) the upper bound on ${sup}_{i = 1, \dots, N} {∥X_{i}∥}_{\infty}$ ; 2) the lower bound on ${i n f}_{i = 1, \dots, N} g^{'} (β_{0}^{⊤} X_{i})$ , often known as the overlapping condition for logistic regression model. Compared to the overlapping condition under logistic regression that $g (β_{0}^{⊤} X_{i})$ and $g (γ_{0}^{⊤} W_{i})$ are bounded away from zero, our Assumptions 2b and 2c are weaker because they are implied by the typical minimal eigenvalue condition

\underset{∥ v ∥_{2} = 1}{i n f} v^{⊤} E (W_{i} W_{i}^{⊤}) v \geq σ_{min}^{2}

plus the overlapping condition.

4.2. Consistency of the SAS Estimation

We now state the $L_{2}$ and $L_{1}$ convergence rates of our proposed SAS estimator.

Theorem 1 (Consistency of SAS estimation) Under Assumptions 1, 2 and with

s_{γ} = o (n / log (p + q)), s_{β} = o (N / log (p)), λ_{β} ≳ \sqrt{log (p) / N},

(25)

we have

{∥\hat{β} - β_{0}∥}_{2} = O_{p} (\sqrt{s_{β}} λ_{β} + (1 - ρ) \sqrt{s_{γ} log (p + q) / n}),

{∥\hat{β} - β_{0}∥}_{1} = O_{p} (s_{β} λ_{β} + (1 - ρ)^{2} s_{γ} log (p + q) / (n λ_{β})) .

Remark 2 The dimension requirement for our SAS estimator achieving $L_{2}$ consistency significantly weakens the existing dimension requirement in the supervised setting (Negahban et al., 2010; Huang and Zhang, 2012; Bühlmann and Van De Geer, 2011; Bickel et al., 2009) With $λ_{β} ≍ \sqrt{log (p) / N}$ , Theorem 1 implies the $L_{2}$ consistency of $\hat{β}$ under the dimension condition,

(1 - ρ)^{2} s_{γ} log (p + q) / n + s_{β} log (p) / N = o (1) .

(26)

When $N ≫ n$ , our requirement on the sparsity of $β, s_{β} = o (N / log (p))$ is significantly weaker than $s_{β} = o (n / log (p))$ , which is known as the fundamental sparsity limit to identify the high-dimensional regression vector in the supervised setting. Theorem 1 indicates that with assistance from observed $S \in 𝒰$ , the SAS procedure allows $s_{β} > n$ provided that $N$ is sufficiently large and the imputation model is sparse. This distinguishes our result from most estimation results in high-dimensional supervised settings. Among SSL literatures, the utility of unlabeled data in relaxation of sparsity condition has never been conceived before.

Remark 3 In the context of Theorem 1, a sparse imputation, often induced by a small number of highly predictive surrogates, is essential for an optimal estimation rate. When $s_{β} > s_{γ}$ , the $L_{2}$ rate in Theorem 1 has two components, $\sqrt{s_{β} log (p) / N}$ regarding the minimax rate to learn $β$ from all data and $\sqrt{s_{γ} log (p + q) / n}$ regarding the minimax rate to learn $γ$ in the labeled data (Raskutti et al., 2011). Thus, the rate cannot be further improved if the sparser imputation model is used to identify the denser $β$ without additional conditions.

Remark 4 If the $L_{1}$ consistency is of interest, the penalty levels are chosen as

λ_{β} ≍ m a x \{\sqrt{log (p) / N}, \sqrt{s_{γ} / s_{β}} λ_{γ}\},

(27)

which produces the $L_{1}$ estimation rate from Theorem 1

{∥\hat{β} - β_{0}∥}_{1} = O_{p} (s_{β} \sqrt{log (p) / N} + \sqrt{s_{γ} s_{β} log (p) / n}) .

Compared to the condition for $L_{1}$ consistency under supervised learning, $s_{β} = o (\sqrt{n / log (p)})$ , the condition from SAS estimation $s_{β} = o ((n / s_{γ} + N) / log (p))$ allows a denser $β_{0}$ in the setting with a very sparse $γ_{0}$ and a large unlabeled data. On the other hand, the $L_{2}$ estimation rate in Theorem 1 remains the same if

\sqrt{log (p) / N} ≲ λ_{β} ≲ m a x \{\sqrt{log (p) / N}, \sqrt{s_{γ} / s_{β}} λ_{γ}\} .

Our theory on the SAS inference procedure uses the $L_{2}$ instead of the $L_{1}$ consistency.

Theorem 1 implies the following prediction consistency result.

Corollary 5 (Consistency of individual prediction) Suppose $x_{new}$ is sub-Gaussian random vector satisfying ${sup}_{∥ v ∥_{2} = 1} v^{⊤} E [x_{new} x_{new}^{⊤}] v \leq σ_{max}^{2}$ . Under the conditions of Theorem 1, we have

g ({\hat{β}}^{⊤} x_{new}) - g (β_{0}^{⊤} x_{new}) = O_{p} ({∥\hat{β} - β_{0}∥}_{2}) = o_{p} (1) .

The concentration result of Corollary 5 is established with respect to the joint distribution of the data and the new observation $x_{new}$ . This is in a sharp contrast to the individual prediction conditioning on any new observation $x_{new}$ . If the goal is to conduct inference for any given $x_{new}$ , the theoretical justification is provided in the following Theorem 7 and Corollary 8.

Remark 6 Other types of penalties shown to provide consistent estimation in $L_{2}$ for the working imputation model can substitute the Lasso penalty in (7), since the $L_{2}$ rate ${∥\hat{γ} - γ_{0}∥}_{2}$ is the only property invoked for $\hat{γ}$ in the proof of Theorem 1. For example, we may choose the square-root Lasso (Belloni et al., 2011) with pivotal recovery under linear models with identity link $g (x) = x$ . Changing the Lasso penalty in (9), however, might require a different proof to produce the stated estimation rate adaptive to arbitrary $s_{β} / N$ and $s_{γ} / n$ , covering both $s_{β} / N ≪ s_{γ} / n$ and $s_{β} / N ≳ s_{γ} / n$ settings (Case 1 and 2, respectively, in the Proof of Theorem 1). If the $s_{β} / N ≪ s_{γ} / n$ setting guaranteed by a very large $N$ alone is of interest, other penalties for $\hat{β}$ can work equally well (by adapting Case 1 in the Proof of Theorem 1).

4.3. $\sqrt{n}$ -inference with Debiased SAS Estimator

We state the validity of our SSL inference in Theorem 7. We use to $A \overset{ℒ}{\to} B$ to denote that random variable $A$ converges in distribution to a distribution $B$ .

Theorem 7 (SAS Inference) Let $x_{new}$ be the random vector representing the covariate of a new individual. Under Assumptions 1, 2 and the dimension condition

(1 - ρ)^{4} \frac{s_{γ}^{2} log (p + q)^{2}}{n} + \frac{ρ (s_{β}^{2} + s_{β} s_{u}) log (p)^{2}}{N} + (1 - ρ)^{2} \frac{s_{γ} s_{u} log (p + q) log (p)}{N} = o (1),

(28)

we draw inference on $x_{new}^{⊤} β_{0}$ conditionally on $x_{new}$ according to

\sqrt{n} {\hat{V}}_{S A S}^{- 1 / 2} (\hat{x_{std}^{⊤} β} - \frac{x_{new}^{⊤} β_{0}}{{∥x_{new}∥}_{2}})| x_{new} \overset{ℒ}{\to} N (0, 1),

where ${\hat{V}}_{SAS}^{2}$ defined in (23) is the estimator of the asymptotic variance

V_{SAS} = E [{(u_{0}^{⊤} X_{i})}^{2} {\{Y - (1 - ρ) \cdot g (γ_{0}^{⊤} W_{i}) - ρ \cdot g (β_{0}^{⊤} X_{i})\}}^{2}] + ρ (1 - ρ) E [{(u_{0}^{⊤} X_{i})}^{2} {\{g (γ_{0}^{⊤} W_{i}) - g (β_{0}^{⊤} X_{i})\}}^{2}],

with

u_{0} = Θ_{0} \frac{X_{new}}{{∥x_{new}∥}_{2}} = {[E \{g^{'} (β_{0}^{⊤} X_{i}) X_{i} X_{i}^{⊤}\}]}^{- 1} \frac{X_{new}}{{∥x_{new}∥}_{2}} .

(29)

By the Young’s inequality, the condition (28) is implied by

(1 - ρ)^{4} \frac{s_{γ}^{2} log (p + q)^{2}}{n} + \frac{\sqrt{ρ} (s_{β} + s_{u}) log (p)}{\sqrt{N}} = o (1),

(30)

When $p$ is much smaller than the full sample size $N$ , our condition (30) allows the sparsity levels of $β_{0}$ and $u_{0}$ to be as large as $p$ . Even if $p$ is larger than $N$ , our SAS inference procedure is valid if $s_{β} + s_{u} ≲ \sqrt{N} / log (p)$ . In the literature on confidence interval construction in high-dimensional supervised setting, the valid inference procedure for a single regression coefficient in the linear regression requires $s_{β} ≲ \sqrt{n} / log (p)$ (Zhang and Zhang, 2014; Javanmard and Montanari, 2014; van de Geer et al., 2014). Such a sparsity condition has been shown to be necessary to construct a confidence interval of a parametric rate (Cai and Guo, 2017). We have leveraged the unlabeled data to significantly relax the fundamental limit of statistical inference from $s_{β} ≲ \sqrt{n} / log (p)$ to $s_{β} ≲ N / {log (p) \sqrt{n}}$ . The amount of labelled data validates the statistical inference for a dense model in high dimensions.

The sparsity of $u_{0}$ is determined by $x_{new}$ and the precision matrix $Θ_{0}$ . In the supervised learning setting, for confidence interval construction for a single regression coefficient, van de Geer et al. (2014) requires $s_{u} ≲ n / log (p)$ is required. According to (30), our SAS inference requires $s_{u} ≲ N / {log (p) \sqrt{n}}$ , which can be weaker than $s_{u} ≲ n / log (p)$ if the amount of unlabeled data is larger than $n^{2}$ . Theorem 7 implies that our proposed CI in (24) is valid in terms of coverage, which is summarized in the following corollary.

Corollary 8 Under Assumptions 1 and 2, as well as (28), the CI defined in (24) satisfies,

P {g ({∥x_{new}∥}_{2} (\hat{x_{std}^{⊤} β} - Ƶ_{α / 2} \sqrt{{\hat{V}}_{SAS} / n})) \leq g (x_{new}^{⊤} β_{0}) \leq g (({∥x_{new}∥}_{2} \hat{x_{std}^{⊤} β} + Ƶ_{α / 2} \sqrt{{\hat{V}}_{SAS} / n}))\} = 1 - α + o (1) . 2 g^{'} (β_{0}^{⊤} x_{new}) {∥x_{new}∥}_{2} Ƶ_{α / 2} \sqrt{V_{SAS} / n} ≲ {∥x_{new}∥}_{2} / \sqrt{n},

where $V_{S A S}$ is the the asymptotic variance defined in (29).

Confidence interval construction for $g (x_{new}^{⊤} β_{0})$ in high-dimensional supervised setting has been recently studied in Guo et al. (2020). Guo et al. (2020) assumes the prediction model to be correctly specified as a high-dimensional sparse logistic regression and the inference procedure is valid if $s_{β} ≲ \sqrt{n} / log p$ . In contrast, we leverage the unlabeled data to allow for mis-specified prediction model and a dense regression vector, as long as the dimension requirement in (28) is satisfied.

4.4. Efficiency comparison of SAS Inference

Efficiency in high-dimensional setting or SSL setting in which the proportion of labelled data decays to zero is yet to be formalized. Here we use the efficiency bound in the classical low-dimensional with a fixed $ρ$ as the benchmark. Apart from the relaxation of various sparsity conditions, we illustrate next that our SAS inference achieves a decent efficiency with properly specified imputation model compared to the supervised learning and the benchmark.

Similar to the phenomenon discovered by Chakrabortty and Cai (2018), if the imputation model is correct, we can guarantee the efficiency gain by SAS inference in comparison to the asymptotic variance of the supervised learning,

V_{S L} = E [{(u_{0}^{⊤} X_{i})}^{2} {\{Y_{i} - g (β_{0}^{⊤} X_{i})\}}^{2}] .

(31)

Proposition 9 If $E (Y_{i} ∣ S_{i}, X_{i}) = g (γ_{0}^{⊤} W_{i})$ , we have $V_{S L} \geq V_{S A S}$ .

Moreover, we can show that our SAS inference attains the benchmark efficiency derived from classical fixed $ρ$ setting (Tsiatis, 2007). To simplify the derivation, we describe the missing-completely-at-random mechanism through the binary observation indicator $R_{i}, i = 1, \dots, N$ , independent of $Y_{i}, X_{i}$ and $S_{i}$ . We still denote the proportion of labelled data as $ρ = E (R_{i})$ . The unsorted data take the form

𝒟 = \{D_{i} = {(X_{i}^{⊤}, S_{i}^{⊤}, R_{i}, R_{i} Y_{i})}^{⊤}, i = 1, \dots, N\} .

We consider the following class of complete data semi-parametric models

ℳ_{comp} = \{f_{X, Y, S, R} (x, y, s, r) = f_{X} (x) ρ^{r} {(1 - ρ)}^{1 - r} f_{Y ∣ S, X} (y ∣ s, x) f_{S ∣ X} (s ∣ x) : f_{Y ∣ S, X}, f_{X}, f_{S ∣ X} are arbitrary density\},

(32)

and establish the efficiency bounds for RAL estimators under $ℳ_{comp}$ by deriving the associated efficient influence function in the following proposition. We denote the nuisance parameters for $f_{Y ∣ S, X}, f_{X}$ and $f_{S ∣ X}$ as $η$ . We use $η_{0}$ to denote the true underlying nuisance parameter that generates the data. The parameter of interest $β_{0}$ is not part of the model $ℳ_{comp}$ but defined by the implicit function through the moment condition (3).

Proposition 10 The efficient influence function for $θ = x_{std}^{⊤} β$ under $ℳ_{comp}$ is

ϕ_{eff} (D_{i}; θ_{0}, η_{0}) = \frac{R_{i}}{ρ} u_{0}^{⊤} X_{i} \{Y_{i} - E (Y_{i} ∣ S_{i}, X_{i})\} - u_{0}^{⊤} X_{i} \{E (Y_{i} ∣ S_{i}, X_{i}) - g (β_{0}^{⊤} X_{i})\} .

Under the Assumptions of Theorem 7 and additionally $E (Y_{i} ∣ S_{i}, X_{i}) = g (γ_{0}^{⊤} W_{i})$ , our SAS debiased estimator admits the same influence function

\hat{x_{std}^{⊤} β} - \frac{x_{new}^{⊤} β_{0}}{{∥x_{new}∥}_{2}} = \frac{1}{N} \sum_{i = 1}^{N} ϕ_{eff} (D_{i}; θ_{0}, η_{0}) + o_{p} ((ρ N)^{- 1 / 2})

according to Appendix B3 Step 2 (A.31).

5. Simulation

We have conducted extensive simulation studies to evaluate the finite sample performance of the SAS estimation and inference procedures under various scenarios. Throughout, we let $p = 500, q = 100, N = 20000$ and consider $n = 500$ . The signals in $β$ are varied to be approximately sparse or fully dense with a mixture of strong and weak signals. The surrogates $S$ are either moderately and strongly predictive of $Y$ as specified below. For each configuration, we summarize the results based on 500 simulated datasets. We compare our SAS procedure with the supervised LASSO (SLASSO) that (1) estimates the $β_{0}$ by regressing $Y$ to $X$ over the labeled data with Lasso; (2) draw inference on $x_{new}^{⊤} β_{0}$ with the one-step debiased Lasso van de Geer et al. (2014).

To mimic the zero-inflated discrete distribution of EHR features, we first generate $Z_{i, 1}^{x}, \dots, Z_{i, p}^{x}, Z_{i}^{u}, Z_{i, 1}^{s}, \dots, Z_{i, q}^{s}$ independently from $N (0, 25)$ . Then we construct $X_{i}$ from $\{Z_{i}^{u}, Z_{i}^{x} = {(Z_{i, 1}^{x}, \dots, Z_{i, 1}^{x})}^{⊤}\}$ via the transformation $ς (z) = ⌊log \{1 + exp (z)\}⌋ :$

X_{i, 1} = \{ς (\sum_{j = 2}^{p} 2 X_{i, j} / \sqrt{p - 1} + Z_{i, 1}^{x} / \sqrt{2}) - μ_{x}\} / σ_{x},

X_{i, j} = [ς (Z_{i, j}^{x} \sqrt{1 - p^{- 1}} + Z_{i}^{u} / \sqrt{p}) - μ_{x}] / σ_{X}, j = 2, \dots, p .

We standardize $X_{i, j}$ to roughly mean zero and unit variance with $μ_{X} = 1.80$ and $σ_{X} = 2.74$ . The shared term $Z_{i}^{u}$ induces correlation among the covariates.

For $S$ and $Y$ , we consider two scenarios under which the imputation model is either correctly or incorrectly specified. We present the “Scenario I: neither the risk prediction model nor the imputation model is correctly specified” in the main text and the “Scenario II: The imputation model is correctly specified and exactly sparse” in Section A of the Supplementary materials.

Scenario I: neither the risk prediction model nor the imputation model is correctly specified. In this scenario, we first generate $Y_{i}$ from the probit model

P (Y_{i} = 1 ∣ Z_{i}^{x}) = Φ (α^{⊤} Z_{i}^{x}) with Φ (x) = \int_{- \infty}^{x} (2 π)^{- 1 / 2} e^{- x^{2} / 2} d x,

and then generate $S$ from

S_{i, 1} = \{ς (Z_{i, 1}^{s} / 2 + θ Y_{i}) - μ_{s}\} σ_{s}^{- 1} + ξ^{⊤} X_{i}, and S_{i, j} = \{ς (Z_{i, j}^{s}) - μ_{x}\} σ_{x}^{- 1}, j = 2, \dots, p .

We chose $μ_{s}$ and $σ_{s}$ depending on $α$ such that $S_{i, 1}$ is roughly mean 0 and variance 1. Under this setting, a logistic imputation model would be misspecified but nevertheless approximately sparse with appropriately chosen $ξ$ . The coefficients $α$ control the optimal prediction accuracy of $X$ for $Y$ while $θ$ controls the optimal prediction accuracy of $S$ for $Y$ . We consider two $α$ of different sparsity patterns, which also determine the rest of parameters

\begin{array}{l} Sparse (s_{α} = 3) : & α = {(0.45, 0.318, 0.318, 0_{497 \times 1}^{⊤})}^{⊤}, μ_{S} = 1.82, σ_{S} = 2.01, \\ Dense (s_{α} = 500) : & α = {(0.316, {0.059}_{29 \times 1}^{⊤}, {0.007}_{470 \times 1}^{⊤})}^{⊤}, μ_{S} = 2.71, σ_{S} = 2.68, \end{array}

where $a_{k \times 1} = (a, \dots, a)_{k \times 1}^{⊤}$ for any $a$ . The sparsity of $α$ affects the approximate sparsity of $β$ subsequently (Table 1), which we measured by the squared ratio between $ℓ_{1}$ norm and $ℓ_{2}$ norm

𝒮 (β) = ∥ β ∥_{1}^{2} / ∥ β ∥_{2}^{2}, \underset{j : β_{j} \neq 0}{m i n} |β_{j}| \leq 𝒮 (β) / ∥ β ∥_{0} \leq 1 .

(33)

We consider two $θ$ : (a) $θ = 0.6$ for $S$ to be moderately predictive of $Y$ ; and (b) $θ = 1$ for strong surrogates. The parameter $ξ$ depends on both the choices of $α$ and $θ$ :

s_{α} = 3, θ = 0.6 : ξ = {(0:407, 0.330, 0.330, {0.005}_{497 \times 1}^{⊤})}^{⊤},

s_{α} = 3, θ = 1 : ξ = {(0.199, 0.163, 0.163, {0.002}_{497 \times 1}^{⊤})}^{⊤},

s_{α} = 500, θ = 0.6 : ξ = {(0.350, 0.064 4_{29 \times 1}^{⊤}, {0.011}_{470 \times 1}^{⊤})}^{⊤},

s_{α} = 500, θ = 1 : ξ = {(0.169, {0.032}_{29 \times 1}^{⊤}, {0.005}_{470 \times 1}^{⊤})}^{⊤} .

Table 1:

AUC Table for simulations with 500 labels under Scenario I. The AUCs are evaluated on an independent testing set of size 100. We approximately measure the sparsity by $𝒮 (v) = ∥ v ∥_{1}^{2} / ∥ v ∥_{2}^{2}$ .

Scenario			Prediction Accuracy (AUC)
Surrogate	$𝒮 (β_{0})$	$𝒮 (γ_{0})$	Oracle	SLASSO	SAS
Strong	174	1.32	0.724	0.660	0.711
Moderate	174	1.26	0.724	0.660	0.713
Strong	28.3	1.33	0.719	0.694	0.713
Moderate	28.3	1.24	0.719	0.694	0.711

Open in a new tab

Due to the complexity of the data generating process and the noncollapsibility of the logistic regression models, we cannot analytically express the true $β_{0}$ in both scenarios. Instead, we numerically evaluate $β_{0}$ with a large simulated data using the oracle knowledge of the ex-changeability among covariates according to the model

logit \{P (Y_{i} = 1 ∣ S_{i, 1})\} ~ η_{0} + η_{1} X_{i, 1} + η_{2} \sum_{j = 2}^{s_{α}} X_{i, j} + η_{3} \sum_{j = s_{α} + 1}^{p} X_{i, j} .

We derive the true $β_{0}$ as

β_{0} = {(η_{0}, η_{1}, {(η_{2})}_{s_{α} \times 1}^{⊤}, {(η_{3})}_{(p - s_{α}) \times 1}^{⊤})}^{⊤} .

We report the simulation settings under Scenario I in Table 1, where we present the predictive power of the oracle estimation and the lasso estimation. We also report the average area-under-curve (AUC) of the receiver operating characteristic (ROC) curve for oracle $β_{0}$ , SLASSO and the proposed SAS estimation. Our SAS estimation achieves a better AUC compared to supervised LASSO across all scenarios, and is comparable to the AUC with the true coefficient $β_{0}$ . Besides, we observe that the AUC of supervised LASSO is sensitive to the approximate sparsity $𝒮 (β_{0})$ , while the AUC of SAS estimation does not seem to be affected by $𝒮 (β_{0})$ .

To evaluate the SAS inference for the individualized prediction, we consider six different choices of $x_{new}$ . We first select $\{x_{new}^{L}, x_{new}^{M}, x_{new}^{H}\}$ from a random sample of $x_{new}$ generated from the distribution of $X_{i}$ such that their predicted risks are around 0.2, 0.5, and 0.7, corresponding to low, moderate and high risk. We additionally consider three sets of $x_{new}$ with different levels of sparsity:

\begin{array}{l} Sparse: & x_{new}^{S} = {(1, 1, 0_{499 \times 1}^{⊤})}^{⊤}; \\ Intermediate: & x_{new}^{I} = {(1, {0.183}_{30 \times 1}^{⊤}, 0_{470 \times 1}^{⊤})}^{⊤}; \\ Dense: & x_{new}^{D} = {(1, {0.045}_{500 \times 1}^{⊤})}^{⊤} . \end{array}

In Table 2, we compare our SAS estimator of $x_{new}^{⊤} β_{0}$ with the corresponding SLASSO across all settings under Scenario I. The root mean-squared-error (rMSE) of the SAS estimation decays proportionally with the sample size, while the rMSE of the supervised LASSO provides evidence of inconsistency for moderate and dense deterministic $x_{new}$ . The bias of the supervised LASSO is also significantly larger than that of the SAS estimation. The performance of the SAS estimation is insensitive to sparsity of $β_{0}$ , while that of supervised LASSO severely deteriorate with dense $β_{0}$ . The improvement from the supervised LASSO to the SAS estimation is regulated by the surrogate strength.

Table 2:

Comparison of SAS Estimation to the supervised LASSO (SLASSO) with Bias, Empirical standard error (ESE) and root mean-squared error (rMSE) of the linear predictions $x_{new}^{⊤} β_{0}$ under Scenario I 500 labels, moderate or large $𝒮 (β_{0})$ and strong or moderate surrogates.

	SLASSO			SAS: Moderate			SAS: Strong
Type	Bias	ESE	rMSE	Bias	ESE	rMSE	Bias	ESE	rMSE
Moderate $𝒮 (β_{0})$
$x_{new}^{⌞}$	0.605	0.387	0.719	0.165	0.249	0.298	0.118	0.196	0.229
$x_{new}^{M}$	−0.083	0.337	0.347	−0.008	0.246	0.246	−0.016	0.195	0.196
$x_{new}^{H}$	−0.718	0.521	0.887	−0.234	0.294	0.376	−0.176	0.225	0.286
$x_{new}^{S}$	−0.072	0.144	0.161	−0.080	0.094	0.123	−0.018	0.078	0.080
$x_{new}^{I}$	−0.460	0.096	0.470	−0.110	0.093	0.143	−0.055	0.071	0.090
$x_{new}^{D}$	−0.413	0.091	0.423	−0.110	0.089	0.141	−0.114	0.069	0.133
Large $𝒮 (β_{0})$
$x_{new}^{L}$	0.389	0.275	0.477	0.161	0.215	0.269	0.133	0.264	0.296
$x_{new}^{M}$	−0.017	0.280	0.280	−0.014	0.213	0.213	−0.017	0.268	0.268
$x_{new}^{H}$	−0.600	0.481	0.769	−0.251	0.271	0.370	−0.164	0.296	0.339
$x_{new}^{S}$	−0.202	0.140	0.246	−0.074	0.097	0.122	−0.009	0.078	0.079
$x_{new}^{I}$	−0.178	0.098	0.203	−0.075	0.086	0.115	−0.071	0.075	0.103
$x_{new}^{D}$	−0.185	0.090	0.206	−0.109	0.084	0.138	−0.113	0.073	0.135

Open in a new tab

In Table 3, we compare our SAS inference with supervised debiased LASSO across the settings under Scenario I. Our SAS inference procedure attains approximately honest coverage of 95% confidence intervals for all types of $x_{new}$ under all scenarios. Unsurprisingly, the debiased SLASSO has under coverage for the deterministic $x_{new}$ as the consequence of violation to the sparsity assumption for $β_{0}$ and precision matrix. Under our design, the first covariate $X_{1}$ has the strongest dependence upon the other covariates, whose associated row in the precision matrix is thus densest. Consequently, the inference for $β^{⊤} x_{new}^{S} = β_{0} + β_{1}$ The debiased SLASSO also has an acceptable coverage for random $x_{new}^{L}, x_{new}^{M}, x_{new}^{H}$ sampled from the covariate distribution despite the presence of substantial bias, which we attribute to the even larger variance that dominates the bias. In contrast, our SAS inference has small bias across all scenarios and improved variance from the strong surrogate.

Table 3:

Bias, Empirical standard error (ESE), average of the estimated standard error (ASE) along with empirical coverage of the 95% confidence intervals (CP) for the debiased supervised LASSO (SLASSO) and debiased SAS estimator of linear predictions $x_{new}^{⊤} β_{0}$ under Scenario I with 500 labels, moderate or large $𝒮 (β_{0})$ and strong or moderate surrogates.

					Debiased SAS
	Debiased SLASSO				Moderate Surrogates				Strong Surrogates
Type	Bias	ESE	ASE	CP	Bias	ESE	ASE	CP	Bias	ESE	ASE	CP
Risk prediction model approximatedly sparse
$x_{new}^{⌞}$	−0.290	1.901	1.896	0.948	0.021	1.873	1.864	0.949	0.018	1.531	1.531	0.950
$x_{new}^{M}$	−0.091	1.994	1.981	0.947	−0.007	1.961	1.954	0.950	−0.015	1.560	1.570	0.953
$x_{new}^{H}$	0.348	2.106	2.074	0.942	−0.050	2.036	2.039	0.950	−0.011	1.632	1.623	0.950
$x_{new}^{S}$	0.171	0.157	0.128	0.694	−0.019	0.149	0.150	0.950	−0.001	0.132	0.125	0.924
$x_{new}^{I}$	−0.001	0.129	0.125	0.938	−0.013	0.123	0.116	0.932	0.010	0.101	0.094	0.920
$x_{new}^{D}$	0.141	0.137	0.138	0.812	−0.011	0.123	0.118	0.944	−0.001	0.096	0.095	0.940
Large $𝒮 (β_{0})$
$x_{new}^{L}$	−0.134	1.918	1.914	0.951	0.018	1.875	1.878	0.951	0.018	1.529	1.524	0.948
$x_{new}^{M}$	−0.056	1.970	1.962	0.948	−0.020	1.911	1.927	0.952	0.005	1.603	1.597	0.950
$x_{new}^{H}$	0.109	2.051	2.029	0.945	−0.022	1.997	1.991	0.950	−0.040	1.671	1.668	0.951
$x_{new}^{S}$	0.029	0.155	0.127	0.892	−0.008	0.153	0.147	0.946	−0.013	0.133	0.131	0.938
$x_{new}^{I}$	0.002	0.131	0.125	0.930	0.001	0.122	0.114	0.936	0.002	0.101	0.098	0.936
$x_{new}^{D}$	0.113	0.135	0.139	0.874	−0.007	0.119	0.116	0.938	−0.003	0.099	0.097	0.960

Open in a new tab

According to Tables A1, A2 and A3 in the Appendix A, the results under Scenario II are consistent with our findings under Scenario I. We also compares SAS to an unsupervised learning approach using proxy outcome derived from surrogates in the Appendix A. Under the Scenario III very similar to Scenario I, SAS performs well as in Scenario I while the unsupervised learning approach fails completely. This is expected since the unsupervised approach requires that the deviation of the surrogates from the true outcome $S - Y$ is uncorrelated with the risk factors $X$ . Otherwise, spurious association between outcome $Y$ and risk factors $X$ can be induced, creating bias in estimation of risk prediction model.

6. Application of SAS to EHR Study

We applied the proposed SAS method to the risk prediction of Type II Diabetes Mellitus (T2DM) using EHR and genomic data of participants of the Mass General Brigham Biobank study. Number of genetic risk factors among single nucleotide polymorphism for T2DM has grown exponentially following the expansion of genome-wide association studies. As an incomplete summary, Voight et al. (2010),Morris et al. (2012) and Scott et al. (2017) each discovered around a dozen new risk SNPs for T2DM, and the recent studies by Mahajan et al. (2018) and Vujkovic et al. (2020) discovered 135 and 558 new risk SNPs, respectively. Some new risk SNPs in Mahajan et al. (2018) even had large coefficients in the poly genetic risk score. The ever growing number of risk SNPs suggest that the genetic risk prediction model for T2DM may be dense. Compared to the large biobank data that generated the genome-wide association studies, EHR captures the temporal information of T2DM onset and other phenotypes predictive for T2DM and thus may provide a more accurate forecasting for T2DM. As we mentioned in the introduction, direct extraction of disease onset from EHR by diagnosis code or mention in medical notes may contain substantial false positives. From an expert annotation of the medical histories for 271 patients, we found 38 patients with T2DM diagnosis code and 161 patients with mention of T2DM in medical notes who actually had never developed T2DM. The annotation process requires intensive labor of highly skilled medical experts, leading to the limited number of labels.

To define the study cohort, we extracted from the EHR of each patient their date of first EHR encounter $(t_{i n i})$ , follow up period $(C)$ , the counts and dates for the diagnosis codes and note mentions of clinical concepts related to T2DM as well as its risk factors. We only included patients who do not have any diagnosis code or note mention of T2DM up to baseline, where the baseline time is defined as 1990 if $t_{i n i}$ is prior to 1990 and as their first year if $t_{i n i} \geq 1990$ . Although neither the diagnosis code nor note mention of T2DM is sufficiently specific, they are highly sensitive and can be used to accurately remove patients who have already developed T2DM at baseline. This exclusion criterion resulted in $N = 20216$ patients who are free of T2DM at baseline and have both EHR and genomics features for risk modeling. Among those, we have a total of $n = 271$ patients whose T2DM status during follow up, $Y$ , has been obtained via manual chart review. The prevalence of T2DM was about 14% based on labeled data.

We aim to develop a risk prediction model for $Y$ by fitting a working model $P (Y = 1 ∣ X) = g (β_{0}^{⊤} X)$ , where the baseline covariate vector $X$ includes age, gender, indicator for occurrence of diagnosis code and note counts for obesity, hypertension, coronary artery disease (CAD), hyperlipidemia during the first year window, as well as a total of 49 single nucleotide polymorphism previously reported as associated with T2DM in Mahajan et al. (2018) with odds ratio greater than 1.1. We additionally adjust for follow up by including $log (C)$ and allow for non-linear effects by including two-way interactions between the SNPs and other baseline covariates. All variables with less than 10 nonzero values within the labelled set are removed, resulting the final covariates to be of dimension $p = 260$ . We standardize the covariates to have mean 0 and variance 1. To impute the outcome, we used the predicted probability of T2DM derived from the unsupervised phenotyping method MAP (Liao et al., 2019), which achieves an AUC of 0.98, indicating a strong surrogate. In addition to the proposed SAS procedure, we derive risk prediction models based on the supervised LASSO with both the same set of covariates. We let $K = 5$ in cross-fitting and use 5-fold cross-validation for tuning parameter selection. To compare the performance of different risk prediction models, we use 10-fold cross-validation to estimate the out-of-sample AUC. We repeated the process 10 times and took average of predicted probabilities across the repeats for each labelled sample and method in comparison.

In Figure 2, we present the estimated $β$ coefficients for the covariates that received p-value less than 0.05 from the SAS inference. The confidence intervals are generally narrower from the SAS inference. For the coefficients of baseline age and follow-up time, the SAS inference produced much narrower confidence interval than debiased SLASSO, which are expected to have a positive effect on the T2DM onset status during the observation. In addition, the SAS inference identified one global genetic risk factor and 6 other subgroup genetic risk factors while SLASSO identified none of these.

In Table 4, we present the AUCs of the estimated risk prediction models using the high dimensional $X$ . It is important to note that AUC is a measurement of prediction accuracy, so debiasing might lead to worse AUC by accepting larger variability for reduced bias. The AUC from SLASSO is very poor, probably due to the over-fitting bias with the small sample sizes of the labeled set. With the information from a large unlabeled data, SAS produced the significantly higher AUC than the SLASSO.

Table 4:

The cross-validated (CV) AUC the estimated risk prediction models with high dimensional EHR and genetic features based on SAS and supervised LASSO. Shown also are the AUC of the imputation model derived for the SAS procedure.

Method	Imputation	SAS	SLASSO
CV AUC	0.928	0.763	0.488

Open in a new tab

For illustration, we present in Figure 3 the individual risk predictions with 95% confidence intervals for three sets of 10 patients with each set randomly selected from low (< 5%), medium (5% ~ 15%) or high risk (> 15%) subgroups. These risk groups are constructed for illustration purposes and a patient with $x_{new}$ classified to low, medium and high risk if $e x p i t ({\hat{β}}^{⊤} x_{new})$ belongs to the low, medium and high tertiles of $\{e x p i t ({\hat{β}}^{⊤} X_{i}), i = 1, \dots, N\}$ . We observe that the confidence intervals for patients with predicted The debiased SLASSO inference is not very informative with most error bars stretching from zero to one. The contrast between SAS CIs and SLASSO CIs demonstrates the improved efficiency as the result of leveraging information from the unlabeled data through predictive surrogates.

7. Discussion

We proposed the SAS estimation and inference method for high-dimensional risk prediction model with diminishing proportion of observed outcomes. With a sparse imputation model based on predictive surrogates, the SAS can recover a dense risk prediction model impossible to learn from supervised method, as well as achieve better efficiency than supervised method when the latter is applicable. We show that the theoretical advantages lead to better prediction accuracy and shorter confidence intervals in simulations and real data example.

While the SAS procedure is a powerful tool with minimal requirements, caution should be given to the inclusion of highly informative surrogates so that the imputation model is sparse (or approximately sparse). If all surrogates poorly predicts $Y$ with a dense imputation model, the SAS procedure can lead to a compromised convergence rate in estimation. While the current study is motivated by the existence of easy-to-learn imputation model with highly predictive surrogates, the SAS framework can be extended to settings where the imputation model is not easier to learn than the model for $Y ∣ X$ . When the imputation model is estimable but more dense than the risk prediction model (i.e. $s_{β} < s_{γ}$ ), we can following similar strategies as in our SAS inference procedure to reduce the bias incurred during the imputation step from $\hat{γ}$ . Specifically, we may consider a debiased estimator for $\hat{β}$

{\hat{β}}_{debias} = \underset{β \in R^{p + 1}}{a r g m i n} \sum_{k = 1}^{K} [\frac{1}{N} \sum_{i \in 𝒥_{k}} ℓ (g ({\hat{γ}}^{(k) T} W_{i}), β^{⊤} X_{i}) + \frac{1}{n} \sum_{i \in ℐ_{k}} β^{⊤} X_{i} \{g ({\hat{γ}}^{(k) T} W_{i}) - Y_{i}\}] + λ {∥β_{- 1}∥}_{1} .

This debiased SAS estimation will attain the optimal rate $\sqrt{s_{β} log (p) / n}$ and we also expect an efficiency gain in the resulting variance compared to the supervised estimator, in analog to the efficiency gain observed in SAS inference. Adaptive approaches to infer whether a given dataset falls into the setting with $s_{β} > s_{γ}$ or $s_{β} < s_{γ}$ is straightforward in simpler settings when $s_{β}$ and $s_{γ}$ can be estimated but warrants future research in general. In the extremely dense imputation model setting when $s_{γ} > n$ , information theoretical bound has indicated that the imputation model will be inestimable, invalidating any subsequent steps involving $\hat{γ}$ . A possible solution is to redefine the imputation model as the sparser model between the risk prediction model and the original imputation model. A potential approach to identifying such a sparser imputation model is through the under-identified Dantzig Selector

{\hat{γ}}_{a d a} = \underset{γ \in R^{p + q}}{a r g m i n} ∥ γ ∥_{1}, Subject to {∥\frac{1}{n} \sum_{i = 1}^{n} X_{i} \{Y_{i} - g (γ^{⊤} W_{i})\}∥}_{\infty} \leq λ .

Both $γ_{0}$ and ${(β_{0}^{⊤}, 0_{q}^{⊤})}^{⊤}$ should fall in the feasible region with suitable $λ$ , and the minimization over $L_{1}$ norm may pick the sparsest element from the feasible class. Using ${\hat{γ}}_{a d a}$ in SAS estimation may attain uniform optimal rate for any $s_{β}$ and $s_{γ}$ . Theoretical studies of the above proposals warrant future research.

Supplementary Material

NIHMS1971733-supplement-1.pdf^{(515.2KB, pdf)}

Figure 1: — A dense prediction model (graph with dashed lines) can be compress to a sparse imputation model (through graph with solid lines) when the effect of most baseline covariates are reflected in a few variables in the EHR monitoring the development of the event of interest.

Contributor Information

Jue Hou, Division of Biostatistics, University of Minnesota School of Public Health, Minneapolis, MN 55455, USA.

Zijian Guo, Department of Statistics, Rutgers University, Piscataway, NJ 08854-8019, USA.

Tianxi Cai, Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA.

References

Bang Heejung and Robins James M.. Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962–973, 2005. [DOI] [PubMed] [Google Scholar]
Belloni A, Chernozhukov V, and Wang L. Square-root lasso: pivotal recovery of sparse signals via conic programming. Biometrika, 98(4):791–806, 2011. [Google Scholar]
Belloni Alexandre, Kaul Abhishek, and Rosenbaum Mathieu. Pivotal estimation via self-normalization for high-dimensional linear models with error in variables, 2017.
Bickel Peter J., Ritov Ya’acov, and Tsybakov Alexandre B.. Simultaneous analysis of lasso and dantzig selector. Ann. Statist, 37(4):1705–1732, August 2009. [Google Scholar]
Bühlmann Peter and Van De Geer Sara. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media, 2011. [Google Scholar]
Cai T Tony and Guo Zijian. Confidence intervals for high-dimensional linear regression: Minimax rates and adaptivity. The Annals of statistics, 45(2):615–646, 2017. [Google Scholar]
Tony Cai T and Guo Zijian. Semisupervised inference for explained variance in high dimensional linear regression and its applications. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(2):391–419, 2020. [Google Scholar]
Cai Tianxi, Cai Tony, and Guo Zijian. Optimal Statistical Inference for Individualized Treatment Effects in High-dimensional Models. arXiv e-prints:1904.12891, 2019. [Google Scholar]
Chakrabortty Abhishek and Cai Tianxi. Efficient and adaptive linear regression in semi-supervised settings. Ann. Statist, 46(4):1541–1572, August 2018. [Google Scholar]
Chakrabortty Abhishek, Lu Jiarui, Cai T. Tony, and Li Hongzhe. High Dimensional M-Estimation with Missing Outcomes: A Semi-Parametric Framework. arXiv e-prints:1911.11345, 2019. [Google Scholar]
Chandrasekher Kabir Aladin, Alaoui Ahmed El, and Montanari Andrea. Imputation for high-dimensional linear regression, 2020.
Cheng David, Ananthakrishnan Ashwin, and Cai Tianxi. Efficient and Robust Semi-Supervised Estimation of Average Treatment Effects in Electronic Medical Records Data. arXiv e-prints:1804.00195, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
Deng Siyi, Ning Yang, Zhao Jiwei, and Zhang Heping. Optimal semi-supervised estimation and inference for high-dimensional linear regression. arXiv e-prints:2011.14185, 2020. [Google Scholar]
Fan Jianqing and Li Runze. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348–1360, 2001. [Google Scholar]
Frazer Kelly A, Murray Sarah S, Schork Nicholas J, and Topol Eric J. Human genetic variation and its contribution to complex traits. Nature Reviews Genetics, 10(4):241–251, 2009. [DOI] [PubMed] [Google Scholar]
Gronsbell Jessica, Minnier Jessica, Yu Sheng, Liao Katherine, and Cai Tianxi. Automated feature selection of predictors in electronic medical records data. Biometrics, 75(1):268–277, 2019. [DOI] [PubMed] [Google Scholar]
Guo Zijian, Rakshit Prabrisha, Herman Daniel S, and Chen Jinbo. Inference for the case probability in high-dimensional logistic regression. arXiv preprint:2012.07133, 2020. [PMC free article] [PubMed] [Google Scholar]
Huang Jian and Zhang Cun-Hui. Estimation and selection via absolute penalized convex minimization and its multistage adaptive applications. J. Mach. Learn. Res, 13(1):1839–1864, 2012. [PMC free article] [PubMed] [Google Scholar]
Javanmard Adel and Montanari Andrea. Confidence intervals and hypothesis testing for high-dimensional regression. Journal of Machine Learning Research, 15:2869–2909, 2014. [Google Scholar]
Kallus Nathan and Mao Xiaojie. On the role of surrogates in the efficient estimation of treatment effects with limited outcome data. arXiv e-prints:2003.12408, 2020. [Google Scholar]
Liao Katherine P, Sun Jiehuan, and 18 others. High-throughput multimodal automated phenotyping (MAP) with application to PheWAS. Journal of the American Medical Informatics Association, 26(11):1255–1262, August 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
Loh Po-ling and Wainwright Martin J. High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity. In Shawe-Taylor J, Zemel R, Bartlett P, Pereira F, and Weinberger KQ, editors, Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011. [Google Scholar]
Ma Rong, Cai T. Tony, Li Hongzhe. Global and simultaneous hypothesis testing for high-dimensional logistic regression models. Journal of the American Statistical Association, 116(534):984–998, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mahajan Anubha, Taliun Daniel, and 113 others. Fine-mapping type 2 diabetes loci to single-variant resolution using high-density imputation and islet-specific epigenome maps. Nature Genetics, 50(11):1505–1513, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
Morris Andrew P., Voight Benjamin F., Teslovich Tanya M., Ferreira Teresa, Segrè Ayellet V., Steinthorsdottir Valgerdur, et al. Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes. Nature Genetics, (9):981–990, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
Negahban Sahand, Ravikumar Pradeep, Wainwright Martin J., and Yu Bin. A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. Technical Report 797, University of California Berkeley, Department of Statistics, 2010. [Google Scholar]
Portnoy Stephen. Asymptotic behavior of m-estimators of p regression parameters when p²/n is large. i. consistency. Ann. Statist, 12(4):1298–1309, December 1984. [Google Scholar]
Portnoy Stephen. Asymptotic behavior of m estimators of p regression parameters when p²/n is large; ii. normal approximation. Ann. Statist, 13(4):1403–1417, December 1985. [Google Scholar]
Raskutti Garvesh, Wainwright Martin J., and Yu Bin. Minimax rates of estimation for high-dimensional linear regression over ℓ_q -balls. IEEE Transactions on Information Theory, 57(10):6976–6994, 2011. [Google Scholar]
Scott Robert A., Scott Laura J., Reedik Mägi Letizia Marullo, Gaulton Kyle J., et al. An expanded genome-wide association study of type 2 diabetes in europeans. Diabetes, 66 (11):2888–2902, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
Smucler Ezequiel, Rotnitzky Andrea, and Robins James M.. A unifying approach for doubly-robust ℓ₁ regularized estimation of causal contrasts. arXiv e-prints:1904.03737, 2019. [Google Scholar]
Thompson Caroline A, Kurian Allison W, and Luft Harold S. Linking electronic health records to better understand breast cancer patient pathways within and between two health systems. eGEMs, 3(1), 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tsiatis A. Semiparametric Theory and Missing Data. Springer Series in Statistics. Springer; New York, 2007. [Google Scholar]
van de Geer Sara, Bühlmann Peter, Ritov Ya’acov, and Dezeure Ruben. On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist, 42(3): 1166–1202, June 2014. [Google Scholar]
van de Geer Sara A. and Bühlmann Peter. On the conditions used to prove oracle results for the lasso. Electron. J. Statist, 3:1360–1392, 2009. [Google Scholar]
Vershynin Roman. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018. [Google Scholar]
Voight Benjamin F., Scott Laura J., Steinthorsdottir Valgerdur, Morris Andrew P., Dina Christian, Welch Ryan P., et al. Twelve type 2 diabetes susceptibility loci identified through large-scale association analysis. Nature Genetics, 42(7):579–589, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
Vujkovic Marijana, Keaton Jacob M, and 48 others. Discovery of 318 new risk loci for type 2 diabetes and related vascular outcomes among 1.4 million participants in a multi-ancestry meta-analysis. Nature genetics, 52(7):680–691, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
Warren Joan L and Yabroff K Robin. Challenges and opportunities in measuring cancer recurrence in the united states. Journal of the National Cancer Institute, 107(8):djv134, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang Anru, Brown Lawrence D., and Cai T. Tony. Semi-supervised inference: General theory and estimation of means. Ann. Statist, 47(5):2538–2566, October 2019. [Google Scholar]
Zhang Cun-Hui and Zhang Stephanie S.. Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):217–242, 2014. [Google Scholar]
Zhang Yichi, Liu Molei, Neykov Matey, and Cai Tianxi. Prior adaptive semi-supervised learning with application to ehr phenotyping. Journal of Machine Learning Research, 23 (83):1–25, 2022. [PMC free article] [PubMed] [Google Scholar]
Zhang Yuqian and Bradic Jelena. High-dimensional semi-supervised learning: in search of optimal inference of the mean. Biometrika, 109(2):387–403, September 2021. ISSN 1464-3510. doi: 10.1093/biomet/asab042. URL 10.1093/biomet/asab042. [DOI] [Google Scholar]
Zhu Yinchu and Bradic Jelena. Linear hypothesis testing in dense high-dimensional linear models. Journal of the American Statistical Association, 113(524):1583–1600, 2018a. [Google Scholar]
Zhu Yinchu and Bradic Jelena. Significance testing in non-sparse high-dimensional linear models. Electron. J. Statist, 12(2):3312–3364, 2018b. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS1971733-supplement-1.pdf^{(515.2KB, pdf)}

[R1] Bang Heejung and Robins James M.. Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962–973, 2005. [DOI] [PubMed] [Google Scholar]

[R2] Belloni A, Chernozhukov V, and Wang L. Square-root lasso: pivotal recovery of sparse signals via conic programming. Biometrika, 98(4):791–806, 2011. [Google Scholar]

[R3] Belloni Alexandre, Kaul Abhishek, and Rosenbaum Mathieu. Pivotal estimation via self-normalization for high-dimensional linear models with error in variables, 2017.

[R4] Bickel Peter J., Ritov Ya’acov, and Tsybakov Alexandre B.. Simultaneous analysis of lasso and dantzig selector. Ann. Statist, 37(4):1705–1732, August 2009. [Google Scholar]

[R5] Bühlmann Peter and Van De Geer Sara. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media, 2011. [Google Scholar]

[R6] Cai T Tony and Guo Zijian. Confidence intervals for high-dimensional linear regression: Minimax rates and adaptivity. The Annals of statistics, 45(2):615–646, 2017. [Google Scholar]

[R7] Tony Cai T and Guo Zijian. Semisupervised inference for explained variance in high dimensional linear regression and its applications. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(2):391–419, 2020. [Google Scholar]

[R8] Cai Tianxi, Cai Tony, and Guo Zijian. Optimal Statistical Inference for Individualized Treatment Effects in High-dimensional Models. arXiv e-prints:1904.12891, 2019. [Google Scholar]

[R9] Chakrabortty Abhishek and Cai Tianxi. Efficient and adaptive linear regression in semi-supervised settings. Ann. Statist, 46(4):1541–1572, August 2018. [Google Scholar]

[R10] Chakrabortty Abhishek, Lu Jiarui, Cai T. Tony, and Li Hongzhe. High Dimensional M-Estimation with Missing Outcomes: A Semi-Parametric Framework. arXiv e-prints:1911.11345, 2019. [Google Scholar]

[R11] Chandrasekher Kabir Aladin, Alaoui Ahmed El, and Montanari Andrea. Imputation for high-dimensional linear regression, 2020.

[R12] Cheng David, Ananthakrishnan Ashwin, and Cai Tianxi. Efficient and Robust Semi-Supervised Estimation of Average Treatment Effects in Electronic Medical Records Data. arXiv e-prints:1804.00195, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Deng Siyi, Ning Yang, Zhao Jiwei, and Zhang Heping. Optimal semi-supervised estimation and inference for high-dimensional linear regression. arXiv e-prints:2011.14185, 2020. [Google Scholar]

[R14] Fan Jianqing and Li Runze. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348–1360, 2001. [Google Scholar]

[R15] Frazer Kelly A, Murray Sarah S, Schork Nicholas J, and Topol Eric J. Human genetic variation and its contribution to complex traits. Nature Reviews Genetics, 10(4):241–251, 2009. [DOI] [PubMed] [Google Scholar]

[R16] Gronsbell Jessica, Minnier Jessica, Yu Sheng, Liao Katherine, and Cai Tianxi. Automated feature selection of predictors in electronic medical records data. Biometrics, 75(1):268–277, 2019. [DOI] [PubMed] [Google Scholar]

[R17] Guo Zijian, Rakshit Prabrisha, Herman Daniel S, and Chen Jinbo. Inference for the case probability in high-dimensional logistic regression. arXiv preprint:2012.07133, 2020. [PMC free article] [PubMed] [Google Scholar]

[R18] Huang Jian and Zhang Cun-Hui. Estimation and selection via absolute penalized convex minimization and its multistage adaptive applications. J. Mach. Learn. Res, 13(1):1839–1864, 2012. [PMC free article] [PubMed] [Google Scholar]

[R19] Javanmard Adel and Montanari Andrea. Confidence intervals and hypothesis testing for high-dimensional regression. Journal of Machine Learning Research, 15:2869–2909, 2014. [Google Scholar]

[R20] Kallus Nathan and Mao Xiaojie. On the role of surrogates in the efficient estimation of treatment effects with limited outcome data. arXiv e-prints:2003.12408, 2020. [Google Scholar]

[R21] Liao Katherine P, Sun Jiehuan, and 18 others. High-throughput multimodal automated phenotyping (MAP) with application to PheWAS. Journal of the American Medical Informatics Association, 26(11):1255–1262, August 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Loh Po-ling and Wainwright Martin J. High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity. In Shawe-Taylor J, Zemel R, Bartlett P, Pereira F, and Weinberger KQ, editors, Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011. [Google Scholar]

[R23] Ma Rong, Cai T. Tony, Li Hongzhe. Global and simultaneous hypothesis testing for high-dimensional logistic regression models. Journal of the American Statistical Association, 116(534):984–998, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Mahajan Anubha, Taliun Daniel, and 113 others. Fine-mapping type 2 diabetes loci to single-variant resolution using high-density imputation and islet-specific epigenome maps. Nature Genetics, 50(11):1505–1513, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Morris Andrew P., Voight Benjamin F., Teslovich Tanya M., Ferreira Teresa, Segrè Ayellet V., Steinthorsdottir Valgerdur, et al. Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes. Nature Genetics, (9):981–990, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Negahban Sahand, Ravikumar Pradeep, Wainwright Martin J., and Yu Bin. A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. Technical Report 797, University of California Berkeley, Department of Statistics, 2010. [Google Scholar]

[R27] Portnoy Stephen. Asymptotic behavior of m-estimators of p regression parameters when p²/n is large. i. consistency. Ann. Statist, 12(4):1298–1309, December 1984. [Google Scholar]

[R28] Portnoy Stephen. Asymptotic behavior of m estimators of p regression parameters when p²/n is large; ii. normal approximation. Ann. Statist, 13(4):1403–1417, December 1985. [Google Scholar]

[R29] Raskutti Garvesh, Wainwright Martin J., and Yu Bin. Minimax rates of estimation for high-dimensional linear regression over ℓ_q -balls. IEEE Transactions on Information Theory, 57(10):6976–6994, 2011. [Google Scholar]

[R30] Scott Robert A., Scott Laura J., Reedik Mägi Letizia Marullo, Gaulton Kyle J., et al. An expanded genome-wide association study of type 2 diabetes in europeans. Diabetes, 66 (11):2888–2902, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] Smucler Ezequiel, Rotnitzky Andrea, and Robins James M.. A unifying approach for doubly-robust ℓ₁ regularized estimation of causal contrasts. arXiv e-prints:1904.03737, 2019. [Google Scholar]

[R32] Thompson Caroline A, Kurian Allison W, and Luft Harold S. Linking electronic health records to better understand breast cancer patient pathways within and between two health systems. eGEMs, 3(1), 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] Tsiatis A. Semiparametric Theory and Missing Data. Springer Series in Statistics. Springer; New York, 2007. [Google Scholar]

[R34] van de Geer Sara, Bühlmann Peter, Ritov Ya’acov, and Dezeure Ruben. On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist, 42(3): 1166–1202, June 2014. [Google Scholar]

[R35] van de Geer Sara A. and Bühlmann Peter. On the conditions used to prove oracle results for the lasso. Electron. J. Statist, 3:1360–1392, 2009. [Google Scholar]

[R36] Vershynin Roman. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018. [Google Scholar]

[R37] Voight Benjamin F., Scott Laura J., Steinthorsdottir Valgerdur, Morris Andrew P., Dina Christian, Welch Ryan P., et al. Twelve type 2 diabetes susceptibility loci identified through large-scale association analysis. Nature Genetics, 42(7):579–589, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] Vujkovic Marijana, Keaton Jacob M, and 48 others. Discovery of 318 new risk loci for type 2 diabetes and related vascular outcomes among 1.4 million participants in a multi-ancestry meta-analysis. Nature genetics, 52(7):680–691, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] Warren Joan L and Yabroff K Robin. Challenges and opportunities in measuring cancer recurrence in the united states. Journal of the National Cancer Institute, 107(8):djv134, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] Zhang Anru, Brown Lawrence D., and Cai T. Tony. Semi-supervised inference: General theory and estimation of means. Ann. Statist, 47(5):2538–2566, October 2019. [Google Scholar]

[R41] Zhang Cun-Hui and Zhang Stephanie S.. Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):217–242, 2014. [Google Scholar]

[R42] Zhang Yichi, Liu Molei, Neykov Matey, and Cai Tianxi. Prior adaptive semi-supervised learning with application to ehr phenotyping. Journal of Machine Learning Research, 23 (83):1–25, 2022. [PMC free article] [PubMed] [Google Scholar]

[R43] Zhang Yuqian and Bradic Jelena. High-dimensional semi-supervised learning: in search of optimal inference of the mean. Biometrika, 109(2):387–403, September 2021. ISSN 1464-3510. doi: 10.1093/biomet/asab042. URL 10.1093/biomet/asab042. [DOI] [Google Scholar]

[R44] Zhu Yinchu and Bradic Jelena. Linear hypothesis testing in dense high-dimensional linear models. Journal of the American Statistical Association, 113(524):1583–1600, 2018a. [Google Scholar]

[R45] Zhu Yinchu and Bradic Jelena. Significance testing in non-sparse high-dimensional linear models. Electron. J. Statist, 12(2):3312–3364, 2018b. [Google Scholar]

PERMALINK

Surrogate Assisted Semi-supervised Inference for High Dimensional Risk Prediction

Jue Hou

Zijian Guo

Tianxi Cai

Abstract

1. Introduction

1.1. Related Literatures

1.2. Organization of the Paper

2. Settings and Notations

3. Methodology

3.1. SAS Estimation of $β_{0}$

3.2. SAS Inference for Individual Prediction

4. Theory

4.1. Assumptions

4.2. Consistency of the SAS Estimation

4.3. $\sqrt{n}$ -inference with Debiased SAS Estimator

4.4. Efficiency comparison of SAS Inference

5. Simulation

Table 1:

Table 2:

Table 3:

6. Application of SAS to EHR Study

Figure 2:

Table 4:

Figure 3:

7. Discussion

Supplementary Material

Figure 1:

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Surrogate Assisted Semi-supervised Inference for High Dimensional Risk Prediction

Jue Hou

Zijian Guo

Tianxi Cai

Abstract

1. Introduction

1.1. Related Literatures

1.2. Organization of the Paper

2. Settings and Notations

3. Methodology

3.1. SAS Estimation of β0

3.2. SAS Inference for Individual Prediction

4. Theory

4.1. Assumptions

4.2. Consistency of the SAS Estimation

4.3. n-inference with Debiased SAS Estimator

4.4. Efficiency comparison of SAS Inference

5. Simulation

Table 1:

Table 2:

Table 3:

6. Application of SAS to EHR Study

Figure 2:

Table 4:

Figure 3:

7. Discussion

Supplementary Material

Figure 1:

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

3.1. SAS Estimation of $β_{0}$

4.3. $\sqrt{n}$ -inference with Debiased SAS Estimator