Prior Adaptive Semi-supervised Learning with Application to EHR Phenotyping

Yichi Zhang; Molei Liu; Matey Neykov; Tianxi Cai

. Author manuscript; available in PMC: 2023 Nov 15.

Published in final edited form as: J Mach Learn Res. 2022;23:83.

Prior Adaptive Semi-supervised Learning with Application to EHR Phenotyping

Yichi Zhang ^1,^*, Molei Liu ^2,^*, Matey Neykov ³, Tianxi Cai ⁴

PMCID: PMC10653017 NIHMSID: NIHMS1912660 PMID: 37974910

Abstract

Electronic Health Record (EHR) data, a rich source for biomedical research, have been successfully used to gain novel insight into a wide range of diseases. Despite its potential, EHR is currently underutilized for discovery research due to its major limitation in the lack of precise phenotype information. To overcome such difficulties, recent efforts have been devoted to developing supervised algorithms to accurately predict phenotypes based on relatively small training datasets with gold standard labels extracted via chart review. However, supervised methods typically require a sizable training set to yield generalizable algorithms, especially when the number of candidate features, $p$ , is large. In this paper, we propose a semi-supervised (SS) EHR phenotyping method that borrows information from both a small, labeled dataset (where both the label $Y$ and the feature set $X$ are observed) and a much larger, weakly-labeled dataset in which the feature set $X$ is accompanied only by a surrogate label $S$ that is available to all patients. Under a working prior assumption that $S$ is related to $X$ only through $Y$ and allowing it to hold approximately, we propose a prior adaptive semi-supervised (PASS) estimator that incorporates the prior knowledge by shrinking the estimator towards a direction derived under the prior. We derive asymptotic theory for the proposed estimator and justify its efficiency and robustness to prior information of poor quality. We also demonstrate its superiority over existing estimators under various scenarios via simulation studies and on three real-world EHR phenotyping studies at a large tertiary hospital.

Keywords: High dimensional sparse regression, regularization, single index model, semi-supervised learning, electronic health records

1. Introduction

Electronic Health Records (EHRs) provide a large and rich data source for biomedical research aiming to further our understanding of disease progression and treatment response. EHR data has been successfully used to gain novel insights into a wide range of diseases, with examples including diabetes (Brownstein et al., 2010), rheumatoid arthritis (Liao et al., 2014), inflammatory bowl disease (Ananthakrishnan et al., 2014), and autism (Doshi-Velez et al., 2014). EHR is also a powerful discovery tool for identifying novel associations between genomic markers and multiple phenotypes through analyses such as phenome-wide association studies (Denny et al., 2010; Kohane, 2011; Wilke et al., 2011; Cai et al., 2018).

Despite its potential, ensuring unbiased and powerful biomedical studies using EHR is challenging because EHR was primarily designed for patient care, billing, and record keeping. Extracting precise phenotype information for an individual patient requires manual medical chart reviews, an expensive process that is not scalable for research studies. To overcome such difficulties, recent efforts including those from Informatics for Integrating Biology and the Bedside (i2b2) (Liao et al., 2015; Yu et al., 2015, e.g.) and the Electronic Medical Records and Genomics (eMERGE) network (Newton et al., 2013; Gottesman et al., 2013) have been devoted to developing phenotyping algorithms to predict disease status using relatively small training datasets with gold standard labels extracted via chart review.

Various approaches to EHR phenotyping have been proposed. Supervised machine learning methods have been shown to achieve robust performance across disease phenotypes and EHR systems (Carroll et al., 2012; Liao et al., 2015). However, supervised methods typically require a sizable training set to yield generalizable algorithms especially when the candidate features, denoted by $X$ , is of high dimensionality $p$ . One approach to overcome the high dimensionality is to consider unsupervised methods. Unfortunately, standard unsupervised methods such as clustering are likely to fail when the dimension of $X$ is large, but a majority of the features are unrelated to the phenotype of interest and predictive of some other underlying subgroups. Recently, unsupervised methods based on “silver standard labels” have been proposed. These methods leverage a surrogate outcome $S$ that is highly predictive of the true phenotype status $Y$ , such as the count of International Classification of Diseases (ICD) billing codes of the disease, to train the phenotyping algorithm against the features $X$ . Specifically, Halpern et al. (2016) and Zhang et al. (2020) utilized anchor variables with high positive predictive value as the surrogate $S$ to estimate $Y ∣ X$ under the conditional independence assumption $S ⊥ X ∣ Y$ . Agarwal et al. (2016) trained penalized logistic regression on $S \sim X$ for phenotyping of $Y$ against $X$ . Chakrabortty et al. (2017) provided theoretical justification for this strategy. They showed that a regularized estimator constructed from an unlabeled subset consisting of those with extreme values of $S$ can be used to infer the direction of $β$ under single index models $S \sim f (α^{⊤} X, ϵ)$ ) and $Y \sim g (β^{⊤} X)$ . Their method relies on the similarity between the directions of $α$ and $β$ to make efficient estimation. However, it is not robust to poor surrogacy resulted from violation of such assumptions. Furthermore, their method cannot be directly used to predict $Y$ using both $S$ and $X$ or accurately recover the scale of $Pr (Y = 1 ∣ S, X)$ .

A number of semi-supervised or weakly supervised deep learning procedures have also been proposed recently and shown to attain better performance than the supervised counterparts. For example, Ratner et al. (2017) proposed a weakly supervised approach that trains a deep model with imperfect labels generated from user–specified label functions from sources such as patterns, heuristics, and external knowledge bases. Wang and Poon (2018) developed a framework for weak supervision from multiple sources by composing probabilistic logic with deep learning. McDermott et al. (2018) designed a semi-supervised cycle Wasserstein regression generative adversarial networks (CWR-GAN) approach using adversarial signals to learn from unlabelled samples and improve prediction performance in scarcity of gold-standard labels. However, it remains unclear when and how the surrogate features, along with the unlabeled dataset can improve the prediction performance of these deep models, due to their complex architectures.

In this paper, we propose an semi-supervised (SS) method for estimating $Y ∣ W = {(S, X^{⊤})}^{⊤}$ that borrows information from both a small labeled dataset with $n$ realizations of ${(Y, W^{⊤})}^{⊤}$ and a much larger unlabeled dataset with $N$ observations on $W$ , under a high dimensional setting with $N ≫ p ≫ n$ . We consider a logistic phenotype model for $Y ∣ S, X$ , a single index model (SIM) for $S ∣ X$ , as well as a working prior assumption that $S$ is independent of $X$ given $Y$ . We obtain the estimator through regularization with penalty functions reflecting the prior knowledge. When the prior assumption holds exactly, we show that the unlabeled dataset can naturally be used to assist in the estimation of the phenotype model. Allowing the prior assumption to hold approximately or to be highly violated, our prior adaptive semi-supervised (PASS) estimator adaptively incorporate the prior knowledge by shrinking the estimator towards a direction derived under the prior.

The proposed PASS estimator is similar to the prior LASSO (pLASSO) procedure of Jiang et al. (2016) in that both approaches aim to incorporate prior information into the $ℓ_{1}$ penalized estimator in a high-dimensional setting. Nevertheless, the differences are substantial and clear. Jiang et al. (2016) assumed that the prior information was summarized into prediction values and contributed to the likelihood term. In contrast, we use prior information to guide the shrinkage and put them into the penalty term. In this sense, PASS and pLASSO complement each other to some extent. However, as shown in both theory and simulations, putting prior information into the likelihood term tends to lead to the “take it or leave it” phenomenon: the usefulness of the prior information is determined based on the overall effect of all predictors. On the other hand, by putting prior information into the penalty term, the PASS approach provides more flexible control: it is able to scrutinize the individual effect of each predictor. This gained flexibility can result in improved theoretical and numerical performances.

The rest of this paper is organized as follows. We discuss the motivation, an important special scenario and the general methodology in Section 2. We analyze the theoretical properties of the proposed approach in Section 3, and access its finite sample performance via simulation studies in Section 4. Furthermore, we illustrate the practical value of the proposed approach on three real EHR datasets in Section 5. Finally, we conclude this paper with some discussions and extensions in Section 6. All technical proofs and additional numerical results are given in the Supplementary Materials.

2. Methodology

2.1. Setup

We assume that the underlying data consists of $N$ independent and identically distributed (i.i.d.) observations ${{(Y_{i}, S_{i}, X_{i}^{⊤})}^{⊤} = {(Y_{i}, W_{i}^{⊤})}^{⊤}, i = 1, \dots, N}$ , where $Y_{i}$ is a binary indicator of the disease status of the $i th$ patient, $S_{i}$ is a scalar surrogate variable that is reasonably predictive of $Y_{i}$ chosen via domain knowledge, and $X_{i}$ is a $p$ -dimensional feature vector. Examples of $S_{i}$ includes the total count of ICD codes or mentions for the disease of interest in clinical notes extracted via natural language processing (NLP). Candidate features $X$ may include the ICD9 code counts for competing diagnosis, lab results, as well as NLP mentions of relevant signs/symptoms, medications and procedures. We may also include various transformations of original features in $X$ to account for non-linear effects. While ${W_{i}, i = 1, \dots, N}$ is fully observed, $Y_{i}$ is only observed for a random subset of $n$ patients. Hence the observed data are $𝓛 \cup 𝓤$ , where without loss of generality, the first $n$ observations are assumed fully observed as $𝓛 = {{(Y_{i}, W_{i}^{⊤})}^{⊤}, i = 1, \dots, n}$ , and the rest constitute the unlabeled dataset as $𝓤 = {W_{i}, i = n + 1, \dots, N}$ .

Throughout, for a $d$ -dimensional vector $u$ , the $ℓ_{q}$ -norm of $v$ is $∥ v ∥_{q} = {(\sum_{j = 1}^{d} {| v_{j} |}^{q})}^{1 / q}$ . The $ℓ_{\infty}$ -norm of $v$ is $∥ v ∥_{\infty} = \max_{1 \leq j \leq d} | v_{j} |$ . The support of $v$ is $supp (v) = {j : v_{j} \neq 0}$ . If $𝓘$ is a subset of ${1, \dots, p}$ , then $v_{𝓘}$ denotes a $d$ -dimensional vector whose $j th$ element is $v_{j} 1_{j \in 𝓘}$ , and $1_{B}$ is the indicator function for set $B$ . The independence between random variables/vectors $U$ and $V$ is written as $U ⫫ V$ . We also denote the negative log-likelihood function associated with the logistic model with $ℓ (y, η) = - y η + \log (1 + e^{η})$ .

2.2. Model Assumptions

To predict $Y$ using $W = {(S, X^{⊤})}^{⊤}$ , we assume

\Pr (Y = 1 ∣ W) = σ (ζ_{0} + S γ_{0} + X^{⊤} β_{0}) = σ (ϑ_{0}^{⊤} \vec{W}) with ϑ_{0} = {(ζ_{0}, γ_{0}, β_{0}^{⊤})}^{⊤},

(𝓜_Y)

where for any vector $w$ , $\vec{w} = {(1, w^{⊤})}^{⊤}$ , and $σ (t) = e^{t} / (1 + e^{t})$ . To leverage the data in $𝓤$ , we further assume a single index model (SIM) for $S ∣ X$ , i.e. there exists $α_{0} \in ℝ^{p}$ such that

S = f (X^{⊤} α_{0}, ϵ), with some ϵ ⫫ X and f satisfying E {f^{2} (X^{⊤} α_{0}, ϵ)} < \infty,

(𝓜_S)

where $X^{⊤} α_{0}$ is a single linear combination of the features $X$ and $f$ is an unknown link function. Here $ζ_{0}$ , $γ_{0}$ , $β_{0}$ and $α_{0}$ are parameters to be estimated where only the direction of $α_{0}$ is identifiable and its norm does not affect our construction introduced below. If $α_{0}$ and $β_{0}$ are similar in certain ways, one would expect that the unlabeled dataset $𝓤$ may be used to improve upon the standard supervised estimator for $β_{0}$ using $𝓛$ alone. For example, if $S$ is a noisy representation of $Y$ with random measurement error, then it is reasonable and common in the EHR literature (Hong et al., 2019; Zhang et al., 2020, e.g.) to assume

X ⫫ S ∣ Y .

(𝓒^prior)

Note that a similar conditional independence assumption to ( $𝓒^{prior}$ ) was imposed between the input and the pretext target given the label, in the context of self-supervised learning to demonstrate its advantage (Lee et al., 2020). Under ( $𝓒^{prior}$ ), we have Proposition 1 with proof given in Supplementary Materials.

Proposition 1.

Under $(𝓜_{Y})$ , $(𝓜_{S})$ , $(𝓒^{prior})$ , and assuming $E (X X^{⊤})$ is positive–definite, and it holds that: $(C 1)$ for any two vectors $a_{1}$ , $a_{2}$ , $E (X^{⊤} a_{2} ∣ X^{⊤} a_{1})$ is linear in $X^{⊤} a_{1}$ , there exist scalars $k_{1}$ , $k_{2} \in ℝ$ such that $α_{0} = k_{1} β_{0}$ and $α^{*} = k_{2} β_{0}$ where

(τ^{*}, α^{*}) = \underset{τ, α}{arg min} E {(S - τ - X^{⊤} α)}^{2} .

Remark 1.

Condition (C1) holds for elliptical distributions including multivariate normal. By Diaconis and Freedman (1984) and Hall and Li (1993), this assumption tends to hold for non-elliptical design when the dimensionality is high. Specifically, one can show that under mild regularity conditions, for two projection vectors $a_{1}$ and $a_{2}$ uniformly randomly drawn from $S^{p - 1} = {v \in ℝ^{p - 1} : ∥ v ∥_{2} = 1}$ , the pair $(X^{⊤} a_{2}, X^{⊤} a_{1})$ weakly converges to a bivariate normal distribution with high probability, and thus $E (X^{⊤} a_{2} ∣ X^{⊤} a_{1})$ is at least approximately linear in $X^{⊤} a_{1}$ ; see Theorem 1.1 of Diaconis and Freedman (1984) and equation (1.9) of Hall and Li (1993).

Proposition 1 hinges on the main result of Li and Duan (1989) that when the features $X$ satisfy (C1), the direction of the coefficients of a SIM could be estimated using least squares regression for the response against $X$ . It suggests that $𝓤$ can greatly improve the estimation of $β_{0}$ under ( $𝓒^{prior}$ ) because the phenotype model ( $𝓜_{Y}$ ) may be rewritten as $logit Pr (Y = 1 ∣ W) = ζ + S γ + ρ X^{⊤} α$ for some $ρ$ . Under this model, a simple SS estimator for $ζ$ , $γ$ and $β$ in ( $𝓜_{Y}$ ) can be obtained as $\hat{ζ}$ , $\hat{γ}$ and $\hat{ρ} \hat{α}$ , where

{(\hat{ζ}, \hat{γ}, \hat{ρ})}^{⊤} = \underset{ζ, γ, ρ}{\arg \min} \sum_{i = 1}^{n} ℓ (Y_{i}, ζ + γ S_{i} + ρ X_{i}^{⊤} \hat{α}), {(\hat{τ}, {\hat{α}}^{⊤})}^{⊤} = \underset{τ, α}{\arg \min} \sum_{i = 1}^{N} {(S_{i} - τ - X α)}^{2} .

By doing so, the direction of the high dimensional vector $β$ is estimated based on the entire $𝓛 \cup 𝓤$ , and only the parameters ${(ζ, γ, ρ)}^{⊤}$ are estimated using the small labeled dataset $𝓛$ . Hereafter we shall refer to this SS estimator derived under ( $𝓒^{prior}$ ) as ${SS}^{prior}$ .

Nevertheless, ${SS}^{prior}$ is only valid when ( $𝓒^{prior}$ ) and (C1) holds exactly. Our goal is to develop a more robust SS estimator under ( $𝓜_{Y}$ ) and ( $𝓜_{S}$ ) that can efficiently exploit $𝓤$ when ( $𝓒^{prior}$ ) and (C1) may only hold approximately. In this more general setting, a desirable SS estimator should improve upon the standard supervised estimator when the directions of $α_{0}$ and $β_{0}$ are similar in their magnitude and/or support. In addition, it should perform similarly to the supervised estimator when the two directions are not close. We shall now detail our PASS estimation procedure which automatically adapts to different cases as reflected in the observed data.

2.3. Prior Adaptive Semi-Supervised (PASS) Estimator

With $𝓛$ only, a supervised estimator for $β$ can be obtained via the standard $ℓ_{1}$ -penalized regression:

\overset{˘}{ϑ} = {(\overset{˘}{ζ}, \overset{˘}{γ}, {\overset{˘}{β}}^{⊤})}^{⊤} = \underset{ϑ}{\arg \min} \frac{1}{n} \sum_{i = 1}^{n} ℓ (Y_{i}, ϑ^{⊤} {\vec{W}}_{i}) + λ ∥ β ∥_{1} .

(1)

With properly chosen $λ$ , the consistency and rate of convergence for $\overset{˘}{ϑ}$ has been established (van de Geer, 2008). To improve the estimation of $β$ through leveraging $𝓤$ , we note that when ( $𝓒^{prior}$ ) holds approximately, the magnitude of $β_{0} - ρ α_{0}$ is small for some $ρ$ , and the support of $β_{0} - ρ α_{0}$ is of small size as well.

To incorporate such prior belief on the relationship between $α_{0}$ and $β_{0}$ , we construct the penalty term

\min_{ρ} {λ_{1} ‖ {{(β - ρ α_{0})}_{𝓐_{0}} ‖}_{1} + λ_{2} {‖ {(β - ρ α_{0})}_{𝓐_{0}^{c}} ‖}_{1}},

where $𝓐_{0} = supp (α_{0})$ , and $λ_{1}$ , $λ_{2} > 0$ are tuning parameters. Since ${(α_{0})}_{𝓐_{0}^{c}} = 0$ , the penalty term is equivalent to

λ_{1} {\min_{ρ} {‖ {(β - ρ α_{0})}_{𝓐_{0}} ‖}_{1}} + λ_{2} {‖ β_{𝓐_{0}^{c}} ‖}_{1} .

(2)

The first term in the penalty measures how far $β$ is from the closest vector along the $α_{0}$ direction, and hence encourages smaller magnitude of $β - ρ α_{0}$ . The second term shrinks $β_{𝓐_{0}^{c}}$ towards $0$ , which reflects our prior that predictors irrelevant to $S$ are likely to be irrelevant to $Y$ as well. The tuning parameters $λ_{1}, λ_{2}$ control the strength of the belief imposed. When they are sufficiently large, $β$ will be forced to be a multiple of $α_{0}$ and thus it ends up with the same estimator as in the case where ( $𝓒^{prior}$ ) holds.

Since we have $N ≫ p$ samples to estimate $α_{0}$ , we use the adaptive LASSO (ALASSO) penalized least square estimator $\hat{α}$ (Zou, 2006; Zou and Zhang, 2009), where

\hat{τ}, \hat{α} = \underset{τ, α}{\arg \min} \frac{1}{N} \sum_{i = 1}^{N} {(S_{i} - τ - X_{i}^{⊤} α)}^{2} + μ \sum_{j = 1}^{p} {\hat{ω}}_{j} | α_{j} |,

where ${\hat{ω}}_{j} = {| {\hat{α}}_{init, j} |}^{- ν}$ for some constant $ν > 0$ , ${\hat{α}}_{init} = {({\hat{α}}_{init, 1}, \dots, {\hat{α}}_{init, p})}^{⊤}$ ,

{\hat{τ}}_{init}, {\hat{α}}_{init} = \underset{τ, α}{\arg \min} \frac{1}{N} \sum_{i = 1}^{N} {(S_{i} - τ - X_{i}^{⊤} α)}^{2} + μ_{init} ∥ α ∥_{1},

$μ_{init}$ and $μ$ are tuning parameters that can be chosen via the cross-validation or Bayesian information criterion (BIC). Here, $\hat{α}$ is actually an estimator of $α^{*}$ , which has the same direction as $α_{0}$ under the conditions in Proposition 1.

Appending the penalty term (2) to the likelihood and replacing $α_{0}$ with its estimate $\hat{α}$ , we propose to estimate $ϑ_{0} = {(ζ_{0}, γ_{0}, β_{0}^{⊤})}^{⊤}$ by

\hat{ϑ} = {(\hat{ζ}, \hat{γ}, {\hat{β}}^{⊤})}^{⊤} = \underset{ϑ}{\arg \min} \frac{1}{n} \sum_{i = 1}^{n} ℓ (Y_{i}, ϑ {\vec{W}}_{i}) + λ_{1} {\min_{ρ} {‖ {(β - ρ \hat{α})}_{\hat{𝓐}} ‖}_{1}} + λ_{2} {‖ β_{{\hat{𝓐}}^{c}} ‖}_{1},

where $\hat{𝓐} = supp (\hat{α})$ . The estimators can be equivalently obtained as

\hat{ρ}, \hat{ϑ} = \underset{ρ, ϑ}{\arg \min} \frac{1}{n} \sum_{i = 1}^{n} ℓ (Y_{i}, ϑ {\vec{W}}_{i}) + λ_{1} {‖ {(β - ρ \hat{α})}_{\hat{𝓐}} ‖}_{1} + λ_{2} {‖ β_{{\hat{𝓐}}^{c}} ‖}_{1}

(3)

The impact of the tuning parameters $λ_{1}, λ_{2}$ can be understood from a bias-variance tradeoff viewpoint. When $λ_{j}$ ’s are large, $\hat{β}$ tends to be a multiple of $\hat{α}$ and thus is an estimator with high bias and low variance. In contrast, when $λ_{j}$ ’s are small, the likelihood term based on the labeled dataset $𝓛$ is the dominant part, and hence $\hat{β}$ will have low bias and high variance. By varying the values of $λ_{j}$ ’s, we are able to obtain a continuum connecting these two extremes. In practice, $λ_{1}$ and $λ_{2}$ can be chosen via standard data-driven approaches such as cross-validation.

2.4. Computation Details

The minimization in (3) can be solved with standard software for LASSO estimation. Let $δ = β - ρ \hat{α}$ . We can re-parametrize the expression above in terms of $ρ$ , $ζ$ , $γ$ , and $δ$ as

\hat{ζ}, \hat{γ}, \hat{ρ}, \hat{δ} = \underset{ζ, γ, ρ, δ}{\arg \min} \frac{1}{n} \sum_{i = 1}^{n} ℓ (Y_{i}, ζ + S_{i} γ + ρ X_{i}^{⊤} \hat{α} + X_{i}^{⊤} δ) + λ_{1} ({‖ δ_{\hat{𝓐}} ‖}_{1} + κ {‖ δ_{𝓟 ∖ \hat{𝓐}} ‖}_{1}),

where $𝓟 = {1, \dots, p}$ and $κ = λ_{2} / λ_{1}$ . This is a typical LASSO problem with covariates ${(1, S_{i}, X_{i}^{⊤} \hat{α}, X_{i}^{⊤})}^{⊤}$ , parameters ${(ζ, γ, ρ, δ)}^{⊤}$ , and a weighted $ℓ_{1}$ penalty on the parameters. Hence it can be solved by essentially any algorithm for ALASSO fitting. In this paper, we use the R package glmnet (Friedman et al., 2010) to compute $\hat{ζ}$ , $\hat{γ}$ , $\hat{ρ}$ , and $\hat{δ}$ , and construct the final estimator for $ϑ_{0}$ as $\hat{ϑ} = {(\hat{ζ}, \hat{γ}, {\hat{β}}^{⊤})}^{⊤}$ with $\hat{β} = \hat{δ} + \hat{ρ} \hat{α}$ .

3. Theoretical Properties

In this section, we present non-asymptotic risk bounds for the PASS estimator. We also make theoretical comparisons with the supervised LASSO estimator to shed light on when PASS outperforms the LASSO and where such improvement comes from.

3.1. Notations

A random variable $V$ is sub-Gaussian $(τ^{2})$ if $E {\exp (λ | V |)} \leq 2 \exp (λ^{2} τ^{2} / 2)$ holds for all $λ > 0$ . Throughout, we define

U = {(X^{⊤}, 1)}^{⊤}, K = E (U U^{⊤}), ξ = {(α^{⊤}, τ)}^{⊤}, Z_{α} = {(X^{⊤}, X^{⊤} α, S, 1)}^{⊤}, G = E (Z_{α^{*}} Z_{α^{*}}^{⊤}),

θ = {(δ^{⊤}, ρ, γ, ζ)}^{⊤}, H = E [σ (Z_{α^{*}}^{⊤} θ_{0}) {1 - σ (Z_{α^{*}}^{⊤} θ_{0})} Z_{α^{*}} Z_{α^{*}}^{⊤}],

where $α^{*}$ is given by ${(α^{* ⊤}, τ^{*})}^{⊤} = ξ^{*} = {arg min}_{ξ} E {(S - U^{⊤} ξ)}^{2}$ , and $Θ_{0} = {θ : δ + ρ α^{*} = β_{0}, ζ = ζ_{0}, γ = γ_{0}}$ . Denote by $𝓑_{0} = supp (β_{0})$ , $𝓐^{*} = supp (α^{*})$ and $q^{*} = | 𝓐^{*} |$ . We assume ${‖ α^{*} ‖}_{2} = 1$ without loss of generality since $α^{*}$ is used to recover only the direction of $β_{0}$ in SIM and one can change $ρ$ correspondingly to make any $β = δ + ρ α^{*}$ invariant to ${‖ α^{*} ‖}_{2}$ . Note that under ( $𝓜_{Y}$ ), any $θ_{0} \in Θ_{0}$ minimizes $E {ℓ (Y, Z_{α^{*}}^{⊤} θ)}$ , and due to perfect multicollinearity in $Z_{α^{*}}$ , $θ_{0}$ is not unique. However, any $θ_{0} \in Θ_{0}$ corresponds to the unique $β_{0} = δ_{0} + ρ_{0} α^{*}$ and thus $Z_{α^{*}}^{⊤} θ_{0} = ζ_{0} + S γ_{0} + X^{⊤} β_{0} = ϑ_{0}^{⊤} \vec{W}$ is well-defined. Moreover, any quantity depending on $θ_{0}$ through $Z_{α^{*}}^{⊤} θ_{0}$ is well-defined. Since the main results in this section depend on $θ_{0}$ solely through $Z_{α^{*}}^{⊤} θ_{0}$ , we will use $θ_{0}$ to represent any $θ \in Θ_{0}$ for simplicity.

For $θ = {(δ^{⊤}, ρ, γ, ζ)}^{⊤}$ , define $Ω (θ) = λ_{0} (| ρ | + | γ | + | ζ |) + λ_{1} {‖ δ_{𝓐^{*}} ‖}_{1} + λ_{2} {‖ δ_{𝓟 ∖ 𝓐^{*}} ‖}_{1}$ , $Δ_{α} = 2 μ_{init} q^{*} / φ^{2}$ and $Π (θ) = | ρ |$ , where $φ$ is a constant defined by Assumption (A4) in Section S2.1 of the Supplement Material, and $λ_{0} = 36 B {\log (6 / ϵ) / n}^{1 / 2}$ . To introduce the oracle $θ^{*}$ , we define the oracle risk function as:

𝓔 (θ, 𝓢_{+}, 𝓢_{-}) = E ℓ (Y, Z_{α^{*}}^{⊤} θ) - E ℓ (Y, Z_{α^{*}}^{⊤} θ_{0}) + 256 \frac{κ {(𝓢_{+})}^{2} | 𝓢_{+} |}{ϖ ψ (𝓢_{+})} + 8 λ_{1} {‖ θ_{𝓢_{-} \cap 𝓐^{*}} ‖}_{1} + 8 λ_{2} {‖ θ_{𝓢_{-} \cap (𝓟 ∖ 𝓐^{*})} ‖}_{1} + 8 λ_{1} Δ_{α} Π (θ),

(4)

where

ψ (𝓢_{+}) = \inf_{v : Ω (v_{𝓢_{-}}) \leq 3 Ω (v_{𝓢_{+}})} \frac{v^{⊤} G v}{v_{𝓢_{+}}^{⊤} v_{𝓢_{+}}},

κ (𝓢_{+}) = {\begin{array}{l} λ_{0}, & if 𝓢_{+} \cap 𝓐^{*} = Ø and 𝓢_{+} \cap (𝓟 ∖ 𝓐^{*}) = Ø \\ λ_{2}, & if 𝓢_{+} \cap 𝓐^{*} = Ø and 𝓢_{+} \cap (𝓟 ∖ 𝓐^{*}) \neq Ø \\ + \infty, & if 𝓢_{+} \cap 𝓐^{*} \neq Ø \end{array}

Define $θ^{*} = {(δ^{* ⊤}, ρ^{*}, γ^{*}, ζ^{*})}^{⊤}$ , $𝓢_{+}^{*}$ and $𝓢_{-}^{*}$ as the solution to

\underset{{θ, 𝓢_{+}, 𝓢_{-}} : 𝓢_{+} \cap 𝓢_{-} = \emptyset, 𝓢_{+} \cup 𝓢_{-} = supp (θ) \cup \bar{𝓟}, 𝓢_{+} \supseteq \bar{𝓟}, and {‖ G^{1 / 2} (θ - θ_{0}) ‖}_{2} \leq η,}{\arg \min} 𝓔 (θ, 𝓢_{+}, 𝓢_{-})

where $\bar{𝓟} = {p + 1, p + 2, p + 3}$ , and $η$ is a constant as defined by Assumption (A3) in Section S2.1 of the Supplement. Let $𝓢^{*} = 𝓢_{+}^{*} \cup 𝓢_{-}^{*} = supp (θ^{*}) \cup \bar{𝓟}$ , $κ^{*} = κ (𝓢_{+}^{*})$ , and $β^{*} = δ^{*} + ρ^{*} α^{*}$ . Intuitively, one may view $𝓢_{+}$ as the union set of unpenalized predictors and the predictors with large coefficients but not recovered by $𝓐^{*}$ . While $𝓢_{-}$ can be viewed as the union set of predictors with small nonzero coefficients and the predictors recovered by $𝓐^{*}$ . Partitioning the support of $θ$ into $𝓢_{+}$ and $𝓢_{-}$ is inspired by Bühlmann and Van De Geer (2011, Section 6.2.4), which leads to a refined bound.

3.2. Main result

We first establish the risk bounds for the PASS estimator in the following theorem. Its proof can be found in Section S2 of the Supplementary Materials.

Theorem 1.

For any $ϵ > 0$ , if the assumptions (A1)–(A8) (introduced in Section S2.1 of the Supplementary Materials) hold, the following inequalities hold simultaneously with probability at least $1 - 10 ϵ$ :

\begin{array}{l} E x c e s s r i s k : & E ℓ (Y, Z_{\hat{α}}^{⊤} \hat{θ}) - E ℓ (Y, Z_{α^{*}}^{⊤} θ_{0}) \leq Ξ, \end{array}

\begin{array}{l} L i n e a r p r e d i c t i o n e r r o r : & E {(Z_{\hat{α}}^{⊤} \hat{θ} - Z_{α^{*}}^{⊤} θ_{0})}^{2} \leq Ξ / ϖ, \end{array}

\begin{array}{l} P r o b a b i l i t y p r e d i c t i o n e r r o r : & E {σ (Z_{\hat{α}}^{⊤} \hat{θ}) - σ (Z_{α^{*}}^{⊤} θ_{0})}^{2} \leq Ξ / ϖ, \end{array}

where $ϖ$ is a positive constant defined in (A2), $Ξ = 64 𝓔 (θ^{*}, 𝓢_{+}^{*}, 𝓢_{-}^{*})$ , and $𝓔$ is an oracle risk function as defined in equation (4).

Remark 2.

As detailed in Section S2.1 of the Supplement Material, Assumptions (A1)–(A8) are imposed on tail behaviour of the regression residuals, regularity of the design matrix, minimum signal strength of $α^{*}$ , sample sizes and rates of the tuning parameters. These assumptions are commonly used conditions in the theoretical literature of LASSO, such as the sub-Gaussian variable condition and the restricted eigenvalue condition; see e.g. van de Geer and Bühlmann (2009); Bickel et al. (2009); Bühlmann and Van De Geer (2011).

Remark 3.

The last term of the risk bound $𝓔 (θ^{*}, 𝓢_{+}^{*}, 𝓢_{-}^{*})$ is of order $O (λ_{1} Δ_{α} | ρ^{*} |)$ , which reflects the estimation error in $\hat{α}$ . Following Lemma S8 in the Supplement, one can show that $Δ_{α} = O_{p} (N^{- 1 / 2} | 𝓐^{*} |)$ . All the other terms in $Ξ$ describe the estimation error in $\hat{θ}$ as if $\hat{α}$ is replaced with $α^{*}$ . When $N$ is sufficiently large, $O (λ_{1} Δ_{α} | ρ^{*} |)$ is typically negligible relative to other terms. Specifically, if $N ≫ n {| 𝓐^{*} |}^{2} \log (p)$ , $O (λ_{1} Δ_{α} | ρ^{*} |) = O ({N n}^{- 1 / 2} \log {(p)}^{1 / 2} | 𝓐^{*} |) = o (n^{- 1})$ . In general, as long as $N ≫ \max (n, p)$ and $α^{*}$ is not much denser than $β_{0}$ as in the typical EHR application cases, $O (λ_{1} Δ_{α} | ρ^{*} |)$ is dominated by the risk of the supervised LASSO estimator and even the supervised oracle estimator obtained under the knowledge of supp( $β_{0}$ ).

To gain a better understanding of how the key quantity $Ξ$ in Theorem 1 changes with respect to the similarity between the prior information $α^{*}$ and the target $β_{0}$ , we shall discuss several specific cases in Section 3.3, based on the risk bound derived in Theorem 1.

3.3. Specific Cases

Following Remark 3, we focus our discussions on the settings where $N$ is sufficiently large such that the last term of the risk bound is negligible. We consider three different scenarios as illustrated in Figure 1: (Case 1) $α^{*}$ recovers both the support and direction of $β_{0}$ ; (Case 2) $α^{*}$ almost recovers the support of $β_{0}$ but has a substantially different direction from $β_{0}$ ; (Case 3) $α^{*}$ fails to recover the support of $β_{0}$ (let alone its direction) and provides poor information. These three cases depict perfect, good, and poor qualities of the prior information $α^{*}$ in recovering the support and direction of $β_{0}$ . Next, we rigorously characterize the three cases by properly specifying the parameters $ρ$ , $δ$ , $𝓢_{+}$ , and $𝓢_{-}$ , and derive the convergence rate of $Ξ$ , the risk bound of the PASS estimator, based on Theorem 1.

Figure 1: — Examples of the coefficients $β_{0}$ and $α^{*}$ in the three cases of Section 3.3. Labels $S_{+}$ , $S_{-}$ , and $A^{*}$ in the diagrams represent ${\bar{𝓢}}_{+} ∖ \bar{𝓟}$ , $𝓢_{-}$ , and $𝓐^{*}$ as chosen and defined in Cases 1–3. $β_{0}$ and $α^{*}$ are aligned for comparison of their directions and supports. Presented below are scatter plots for $σ^{- 1} {\Pr (Y = 1 ∣ S, X)}$ against $S$ of the simulated samples generated under Cases 1–3.

Case 1 (presented in the left panel): $α^{*}$ recovers both the support and direction of $β_{0} . σ^{- 1} {\Pr (Y = 1 ∣ S, X)}$ shows strong collinearity with $S$ . PASS largely outperforms supervised LASSO and has the same convergence rate as the low dimensional regression.

Case 2 (middle): $α^{*}$ (nearly) recovers the support but not the direction $β_{0} . σ^{- 1} {\Pr (Y = 1 ∣ S, X)}$ shows moderate collinearity with $S$ . PASS still outperforms both supervised LASSO and pLASSO in terms of convergence rate.

Case 3 (right): $α^{*}$ fails to recover the support of $β_{0} . σ^{- 1} {\Pr (Y = 1 ∣ S, X)}$ shows weak collinearity with $S$ . PASS is of the same convergence rate as supervised LASSO.

Case 1.

Let $\bar{ρ} = \min_{ρ} {‖ β_{0} - ρ α^{*} ‖}_{1}$ , $\bar{δ} = β_{0} - \bar{ρ} α^{*}$ , $\bar{θ} = {({\bar{δ}}^{⊤}, \bar{ρ}, γ_{0}, ζ_{0})}^{⊤}$ , ${\bar{𝓢}}_{+} = \bar{𝓟}$ and ${\bar{𝓢}}_{-} = supp (δ_{0})$ . If $α^{*}$ successfully recovers the support and direction of $β_{0}$ (see the left panel in Figure 1), ${\bar{𝓢}}_{-} \approx Ø$ and ${‖ δ_{0} ‖}_{1} \approx 0$ . Since ${‖ G^{1 / 2} (\bar{θ} - θ_{0}) ‖}_{2} = 0$ and ${\bar{𝓢}}_{+} \cap 𝓐^{*} = Ø$ , we have $Ξ = O {𝓔 (\bar{θ}, {\bar{𝓢}}_{+}, {\bar{𝓢}}_{-})}$ by the definition of $θ^{*}$ . Hence by Theorem 1, the excess risk of $\hat{θ}$

Ξ = O_{p} (λ_{0}^{2} + λ_{1} {‖ {\bar{δ}}_{{\bar{𝓢}}_{-} \cap 𝓐^{*}} ‖}_{1} + λ_{2} {‖ {\bar{δ}}_{{\bar{𝓢}}_{-} \cap 𝓐^{* c}} ‖}_{1}) \approx O_{p} (λ_{0}^{2}) = O_{p} (n^{- 1}),

recalling that $λ_{0} = O (n^{- 1 / 2})$ .

As a standard result (Negahban et al., 2009), the rate of the excess risk of the supervised LASSO estimator is either $O_{p} {n^{- 1} \log (p) | 𝓑_{0} |}$ or $O {n^{- 1 / 2} \log {(p)}^{1 / 2} {‖ β_{0} ‖}_{1}}$ . These two rate bounds are established under different sparsity norms of $β_{0}$ , and generally comparable, e.g. when order of average magnitude of the non-zero entries in $β_{0}$ is $n^{- 1 / 2} \log {(p)}^{1 / 2}$ . In comparison with them, $O_{p} (n^{- 1})$ , the risk rate of PASS in Case 1, is much more refined. Further, $O_{p} (n^{- 1})$ dis actually the rate of the estimator of a low (fixed) dimensional logistic regression. Thus, if $β_{0}$ is very close to a multiple of $α^{*}$ , PASS could outperform the vanilla LASSO and be comparable with a low dimensional regression in terms of the convergence rate. This big gain is owing to the use of $N$ unlabeled dataset to obtain the direction of $β_{0}$ , and thus reduce the high dimensional regression to a low dimensional one where only the intercept and the scalar of $β_{0}$ need to be estimated.

Case 2.

Consider the same choice of $\bar{θ}$ , ${\bar{𝓢}}_{+}$ and ${\bar{𝓢}}_{-}$ as in Case 1. If $α^{*}$ recovers the support but not the direction of $β_{0}$ (see the middle of Figure 1), we will only have ${‖ {\bar{δ}}_{{\bar{𝓢}}_{-} \cap 𝓐^{* c}} ‖}_{1} \approx 0$ but not ${‖ δ_{0} ‖}_{1} \approx 0$ . Then by Theorem 1, the excess risk of PASS is

Ξ = O_{p} (λ_{0}^{2} + λ_{1} {‖ {\bar{δ}}_{{\bar{𝓢}}_{-} \cap 𝓐^{*}} ‖}_{1} + λ_{2} {‖ {\bar{δ}}_{{\bar{𝓢}}_{-} \cap 𝓐^{* c}} ‖}_{1}) \approx O_{p} {n^{- 1 / 2} \log {(q^{*})}^{1 / 2} {‖ {\bar{δ}}_{\bar{𝓢} - \cap 𝓐^{*}} ‖}_{1}},

recalling that $λ_{1} = O {n^{- 1 / 2} \log {(q^{*})}^{1 / 2}}$ .

In Case 2, the convergence rate of the excess risk of PASS is still better than that of the supervised LASSO estimator when $q^{*} ≪ p$ :

O {n^{- 1 / 2} \log {(q^{*})}^{1 / 2} {‖ {\bar{δ}}_{{\bar{𝓢}}_{-} \cap 𝓐^{*}} ‖}_{1}} ≪ O {n^{- 1 / 2} \log {(p)}^{1 / 2} {‖ β_{0} ‖}_{1}},

by ${‖ {\bar{δ}}_{{\bar{𝓢}}_{-} \cap 𝓐^{*}} ‖}_{1} \leq \min_{ρ} {‖ β_{0} - ρ α^{*} ‖}_{1} \leq {‖ β_{0} ‖}_{1}$ . Namely, if $α^{*}$ might not recover the direction of $β_{0}$ very well but the prior information $𝓐^{*} = supp (α^{*})$ is sparse and covers supp ( $β_{0}$ ) successfully, which is reflected as ${\bar{𝓢}}_{+} = \bar{𝓟}$ , the PASS estimator still benefits from the prior information. This is because recovering the support of $β_{0}$ reduces the dimensionality of the empirical errors needed to be controlled from $p$ to $q^{*} = | 𝓐^{*} |$ . In this case, it is also interesting to compare the proposed PASS estimator with the prior LASSO (pLASSO) procedure of Jiang et al. (2016). When supp ( $α^{*}$ ) and supp ( $β_{0}$ ) are close but the directions of $α^{*}$ and $β_{0}$ are quite different, the pLASSO procedure is unable to utilize this information and will only result in the same convergence rate as supervised LASSO, as shown to be essentially slower than that of PASS.

Case 3.

Let $\bar{ρ} = 0$ , $\bar{δ} = β_{0}$ , $\bar{θ} = {({\bar{δ}}^{⊤}, \bar{ρ}, γ_{0}, ζ_{0})}^{⊤}$ , ${\bar{𝓢}}_{+} = \bar{𝓟} \cup (𝓑_{0} ∖ 𝓐^{*})$ and ${\bar{𝓢}}_{-} = 𝓑_{0} ∖ {\bar{𝓢}}_{+}$ . If $α^{*}$ fails to recover the support of $β_{0}$ , i.e. $𝓐^{*} \cap 𝓑_{0} \approx Ø$ and ${‖ β_{0, 𝓐^{*} \cap 𝓑_{0}} ‖}_{1} \approx 0$ , we have ${‖ {\bar{δ}}_{{\bar{𝓢}}_{-}} ‖}_{1} \leq {‖ {\bar{δ}}_{𝓐^{*}} ‖}_{1} \approx {‖ β_{0, 𝓐^{*} \cap 𝓑_{0}} ‖}_{1} \approx 0$ . Then again using Theorem 1,

Ξ = O (λ_{2}^{2} | {\bar{𝓢}}_{+} | + λ_{1} {‖ {\bar{δ}}_{{\bar{𝓢}}_{-} \cap 𝓐^{*}} ‖}_{1} + λ_{2} {‖ {\bar{δ}}_{{\bar{𝓢}}_{-} \cap 𝓐^{* c}} ‖}_{1}) \approx O_{p} {n^{- 1} \log (p) | 𝓑_{0} |},

recalling that $λ_{2} = O {n^{- 1 / 2} \log {(p)}^{1 / 2}}$ .

In Case 3, the excess risk of the PASS is of the same order as that of supervised LASSO. Therefore the PASS approach is robust against low-quality prior information that recovers neither the direction nor the support of $β_{0}$ . This benefit is a result of using a data-adaptive parameter $ρ$ to control the influence of the prior information on the estimator.

4. Simulation Studies

4.1. Main setups

We conducted extensive simulation studies to examine the finite-sample performance of the PASS estimator and to compare it with existing approaches. We first considered the case where the logistic model for $Y ∣ S, X$ is correctly specified, $S ∣ X$ follows an SIM, and $X$ is near elliptical, but the similarity between $α_{0}$ and $β_{0}$ varies. Since EHR features are often zero inflated and skewed count variables, we generated $X_{500 \times 1}$ from

X_{i} = h (Z_{i}), Z_{i} \sim N (0, Σ_{Z}), h (t) = \log (1 + [e^{t}]),

where $[u]$ denotes the integer nearest to $u$ , $Σ_{Z} = {(σ_{i, j})}_{i, j = 1}^{p}$ and $σ_{i, j} = 4 {(0.5)}^{| i - j |}$ . Here $[e^{Z_{i j}}]$ mimics the skewed raw EHR feature, which is typically transformed via $t \to \log (1 + t)$ prior to model fitting. We then generated the surrogate $S$ from a SIM of $X$ :

S_{i} = h (1 + X_{i}^{⊤} α_{0} + ϵ_{i}), with ϵ_{i} \sim N (0, 2^{2}) .

Following the model assumption ( $𝓜_{Y}$ ), the disease status $Y_{i}$ was generated from

σ^{- 1} {Pr (Y_{i} = 1 ∣ W_{i})} = - 4 + 0.5 S_{i} + X_{i}^{⊤} β_{0} .

To mimic different qualities of the prior information one could encounter in practice, we design six scenarios with different similarities between the true $β_{0}$ and $α_{0}$ :

\begin{array}{l} I: & α_{0} = {(a_{1}^{⊤}, a_{2}^{⊤}, 0_{p - 10}^{⊤})}^{⊤}, & β_{0} = 1.5 {(a_{1}^{⊤}, a_{2}^{⊤}, 0_{p - 10}^{⊤})}^{⊤}; \end{array}

\begin{array}{l} II: & α_{0} = {(a_{1}^{⊤}, a_{2}^{⊤}, 0_{p - 10}^{⊤})}^{⊤}, & β_{0} = 1.5 {(a_{1}^{⊤} + d_{1}^{⊤}, a_{2}^{⊤} + d_{2}^{⊤}, 0_{p - 10}^{⊤})}^{⊤}; \end{array}

\begin{array}{l} III: & α_{0} = {(a_{1}^{⊤}, a_{2}^{⊤}, a_{2}^{⊤}, a_{2}^{⊤}, 0_{p - 20}^{⊤})}^{⊤}, & β_{0} = 1.5 {(a_{1}^{⊤} + d_{1}^{⊤}, a_{2}^{⊤} + d_{2}^{⊤}, 0_{p - 10}^{⊤})}^{⊤}; \end{array}

\begin{array}{l} IV: & α_{0} = {(a_{1}^{⊤}, 0_{p - 5}^{⊤})}^{⊤}, & β_{0} = 1.5 {(a_{1}^{⊤} + d_{1}^{⊤}, a_{2}^{⊤} + d_{2}^{⊤}, 0_{p - 10}^{⊤})}^{⊤}; \end{array}

\begin{array}{l} V : & α_{0} = {(a_{1}^{⊤}, a_{2}^{⊤}, 0_{p - 10}^{⊤})}^{⊤}, & β_{0} = 1.5 {(a_{2}^{⊤}, a_{1}^{⊤}, 0_{p - 10}^{⊤})}^{⊤}; \end{array}

\begin{array}{l} VI : & α_{0} = {(a_{1}^{⊤}, a_{2}^{⊤}, 0_{p - 10}^{⊤})}^{⊤}, & β_{0} = 1.5 {(a_{2}^{⊤}, 0_{5}, a_{1}^{⊤}, 0_{p - 15}^{⊤})}^{⊤} . \end{array}

where

\begin{array}{l} a_{1} = {(0.5, 1, - 0.8, 0.6, 0.2)}^{⊤}, & d_{1} = {(- 0.05, - 0.5, 1.4, 0.5, - 0.6)}^{⊤}, \end{array}

\begin{array}{l} a_{2} = {(0.1, - 0.2, - 0.2, 0.2, 0.7)}^{⊤}, & d_{2} = {(0.02, 0.05, 0.02, - 0.02, - 0.05)}^{⊤}, \end{array}

Our specifications of $β_{0}$ and $α_{0}$ are motivated by the three key specific cases introduced in Section 3.3 and illustrated in Figure 1. Scenario I is the ideal case where $β_{0}$ and $α_{0}$ have identical direction. In Scenario II, most of the components of $β_{0}$ differ slightly from a scalar multiple of $α_{0}$ , while a few components differ substantially. Scenarios I and II are designed to examine the performance of PASS estimator when the prior information is highly or somewhat reliable. In Scenario III, $α_{0}$ is denser than $β_{0}$ and contains quite a few weak signals. On the contrary, in Scenario IV $β_{0}$ is denser than $α_{0}$ . In Scenario V, the magnitude of $α_{0}$ and $β_{0}$ are quite different, whereas they still share the same support. Scenarios III, IV and V are designed to examine the performance of PASS estimator with respect to different degrees of accuracy of the support information. In Scenario VI, both the magnitude and the support of $α_{0}$ and $β_{0}$ differs substantially, which means the unlabeled dataset provides little information. This scenario allows us to see whether the PASS estimator is robust against unreliable prior information. See Figure 2 for a visualization of $β_{0}$ and $ρ α_{0}$ across different scenarios.

Figure 2: — Supports and values of the coefficients $β_{0}$ and $1.5 α_{0}$ under Scenarios I–VI introduced in Section 4.1. Only those indices $j$ satisfying $β_{0, j} \neq 0$ or $α_{0, j} \neq 0$ are shown in the plots.

We compare PASS to following existing methods: (1) supervised LASSO penalized logistic regression with $n$ training samples ( ${LASSO}_{n}$ ); (2) supervised ALASSO penalized logistic regression with n training samples, denoted by ${ALASSO}_{n}$ ; (3) the ${SS}^{prior}$ estimator as described in section 2.2; and (4) two variants of pLASSO estimators as proposed in Jiang et al. (2016). For pLASSO, we fit a penalized logistic model with an LASSO penalty imposed on predictors outside supp $(\hat{α})$ , as in equation (8) of Jiang et al. (2016), and then use the predicted probability from that model as $Y_{i}^{p}$ in equation (7) of Jiang et al. (2016), denoted by pLASSO¹; (2) use the predicted probability given by the ${SS}^{prior}$ approach as $Y_{i}^{p}$ in equation (7) of their paper, denoted by pLASSO².

Throughout, we let $N = 10000$ and let $ν = 1$ in the ALASSO weights. We use Bayesian information criterion (BIC) to select $μ_{init}$ and $μ$ in the estimation of $α$ due to large $N$ , and use 10-fold cross-validation to select $λ_{1}$ , $λ_{2}$ for the estimation of $β$ , so that the phenotype model is tuned towards prediction performance. We quantify the average prediction performance of the estimated linear score, ${\tilde{ϑ}}^{⊤} \vec{W}$ , with $\tilde{ϑ}$ obtained via different methods in an independent test dataset with size 10000. For each choice of ${\tilde{ϑ}}^{⊤} \vec{W}$ , we consider the area under the receiver operating characteristic curve (AUC) for classifying $Y$ , the excess risk (ER) as defined in Section 3, and the mean squared error of the predicted probabilities (MSE-P) which is the mean squared differences between the predicted probability and the true probability. We summarize results based on 1000 simulated datasets for each configuration.

In Figures 3, we compare prediction measures for estimators obtained with $n = 100$ . In Scenario I where the directions of $β_{0}$ and $α_{0}$ coincide, the ${SS}^{prior}$ approach performs the best as expected, yet the proposed PASS method attained very similar accuracy followed by pLASSO² which performed only slightly worse. When the directions of $β_{0}$ and $α_{0}$ are somewhat different as in Scenario II, the ${SS}^{prior}$ and the pLASSO estimators deteriorated quickly. In contrast, the PASS estimator maintains high accuracy and outperforms all competing estimators substantially. We observe qualitatively similar patterns for Scenarios III and IV under which $α_{0}$ and $β_{0}$ have somewhat different support. No matter whether $α_{0}$ is denser than $β_{0}$ as in Scenario IV, or $β_{0}$ is denser than $α_{0}$ as in Scenario V, the PASS method consistently outperforms the supervised estimators. Additionally, the performances of the ${SS}^{prior}$ and pLASSO approaches are not quite satisfactory. In Scenario V, $β_{0}$ and $α_{0}$ have the same support but are quite different in terms of magnitude. The proposed method managed to utilize the same-support information, whereas the pLASSO approaches failed to do so. Finally, the goal of Scenario VI is to examine the robustness of the methods when $β_{0}$ and $α_{0}$ differs a lot, possibly due to the use of an inappropriate surrogate. The PASS estimator performs similarly to the supervised estimators, indicating that our procedure is indeed adaptive to how well the data supports the prior assumption. Across all scenarios, the ALASSO approach performs slightly worse than LASSO, possibly due to the presence of some small nonzero coefficients in $β_{0}$ .

Figure 3: — AUC (left), ER (middle) and MSE-P (right) evaluated on the test set for simulation studies under Scenarios I–VI. Outliers are not drawn. Mean performance of the PASS approach are marked using dashed lines for ease of comparison. The size of the labeled dataset is fixed at $n = 100$ .

In Figure 4, we present the AUC, ER and MSE-P of the PASS estimator trained with $n = 100$ and the supervised LASSO estimator with varying label size. In Scenario I where the prior assumption holds exactly, PASS₁₀₀, the PASS approach with 100 labeled samples, even outperforms LASSO₄₀₀, the LASSO approach with 400 labeled samples. When the prior assumption holds approximately as in Scenarios II through V, PASS₁₀₀ consistently outperforms LASSO₁₅₀, and achieves similar performance as LASSO₂₀₀, which requires twice as many labels. Finally, in Scenario VI where the prior information is highly inaccurate, the PASS method maintains comparable performance against LASSO₁₀₀.

4.2. Efficiency and Robustness Evaluations under Mis-specifications

We conducted simulation studies under three additional scenarios to further investigate the efficiency and robustness of PASS when the model assumptions and elliptical design assumptions are violated. We again set $p = 500$ and generate $X_{i} = 2 Φ (Z_{i}) - 1$ , where $Φ (\cdot)$ is the cumulative distribution function of the standard normal, $Z_{i} = {(Z_{i 1}, \dots, Z_{i p})}^{⊤} \sim N (0, Σ_{Z}^{'})$ , $Σ_{Z}^{'} = {(σ_{i, j}^{'})}_{i, j = 1}^{p}$ , $σ_{i, j}^{'} = {(0.5)}^{| i - j |}$ . If $i = j$ or both $i$ and $j$ are $\leq 20$ or both $i$ and $j$ are $> 20$ and $σ_{i, j}^{'} = 0$ otherwise. We make $Σ_{Z}^{'}$ block-diagonal for the convenience of obtaining the population solution of $β$ and $α$ through the best logistic or least square approximation under model mis-specifications. In real EHR studies, a paradigm of data generation is that the features $X$ , e.g. some genetic variants, precedes the disease status $Y$ , and $Y$ precedes some clinical surrogate $S$ , e.g. the count of ICD codes associated with the disease. To mimic this scenario, we generated $Y_{i}$ and $S_{i}$ from the following models:

\begin{array}{l} Y_{i} = I {(0.8, 1, - 1, 0.8, 0.4, 0_{p - 5}^{⊤}) X_{i} + ϵ_{y i} \geq 0}, & ϵ_{y i} \sim N (0, 1), \end{array}

\begin{array}{l} S_{i} = μ Y_{i} + η_{1}^{⊤} X_{i} + Y_{i} η_{2}^{⊤} X_{i} + ϵ_{s i}, & ϵ_{s i} \sim N (0, 1) . \end{array}

Assumptions ( $𝓒^{prior}$ ) and ( $𝓜_{S}$ ) hold when $η_{1} = η_{2} = 0$ , and would be severely violated when $η_{1}$ and $η_{2}$ are large. We design three scenarios with $η_{1}$ and $η_{2}$ representing different degrees of violation on the surrogate assumptions:

$μ = 1, and η_{1} = η_{2} = 0;$
$μ = 1.5, η_{1} = {(a_{3}^{⊤}, 0_{p - 5}^{⊤})}^{⊤}, and η_{2} = {(d_{3}^{⊤}, 0_{p - 5}^{⊤})}^{⊤};$
$μ = 2, η_{1} = {(a_{3}^{⊤}, a_{3}^{⊤}, a_{3}^{⊤}, 0_{p - 15}^{⊤})}^{⊤}, and η_{2} = {(d_{3}^{⊤}, d_{3}^{⊤}, d_{3}^{⊤}, 0_{p - 15}^{⊤})}^{⊤},$

where $a_{3} = {(0.6, - 0.4, 0.4, 0.5, - 0.5)}^{⊤}$ and $d_{3} = {(0.3, 0.4, 0.6, - 0.5, - 0.5)}^{⊤}$ . Here $μ$ depict the marginal effect of $Y_{i}$ on $S_{i}$ , and are set to make the AUC of target models at a similar level across the three scenarios. Across all scenarios, $Pr (Y_{i} = 1 ∣ S_{i}, X_{i})$ is no longer a parametric logistic model, i.e. ( $𝓜_{Y}$ ) is misspecified. Our goal is to estimate the limiting coefficients $ζ_{0}$ , $γ_{0}$ , $β_{0}$ defined as the minimizor of $E ℓ (Y_{i}, ζ + γ S_{i} + X_{i}^{⊤} β)$ . Benchmark methods, and their implementation, tuning, and evaluation procedures are the same as in Section 4.1, except that we implement supervised LASSO with $n$ ranging from 100 to 700.

In Figure 5, we present AUC, ER and MSE-P of the methods under Scenarios i–iii. In Scenario i, PASS has similar performances as the semi-supervised benchmarks ${SS}^{prior}$ and pLASSO, and all the semi-supervised estimators significantly outperform the two supervised estimators since ( $𝓒^{prior}$ ) holds and $α^{*}$ basically recovers the direction of $β_{0}$ well. Among the semi-supervised estimators, the ${SS}^{prior}$ and pLASSO² estimators have a slight advantage with a smaller variation than expected since both heavily rely on the prior information which is of high quality in this setting. In Scenario ii, the key assumption ( $𝓒^{prior}$ ) is violated, which drastically impacts the performance of ${SS}^{prior}$ and pLASSO². On the other hand, PASS and pLASSO¹ still effectively leverage the imperfect information from $α^{*}$ to approximately recover the support of $β_{0}$ , and thus outperform ${SS}^{prior}$ , pLASSO², and the supervised methods. In Scenario iii, $η_{1}$ and $η_{2}$ become denser than those in Scenario ii. This can make the recovery of supp ( $β_{0}$ ) using supp ( $α^{*}$ ) less accurate, and interestingly, PASS outperforms all methods including pLASSO¹ that also leverages supp ( $α^{*}$ ). In all the three scenarios, PASS significantly outperforms supervised LASSO using the same number or even 2–3 more times the number of samples, which displays a large gain of using the unlabelled dataset to assist the regression. Finally, the results demonstrate that our method can still efficiently leverage the prior information from $S$ in estimating the target parameters when $S ∣ Y, X$ highly depends on $X$ so $𝓒^{prior}$ is violated, ( $𝓜_{Y}$ ) is misspecified, and the design is non-elliptical.

5. Application to EHR Phenotying

We examine the performance of PASS along with other approaches in three real world EHR phenotyping studies with the goal of developing classification models for the diseases of interest. All studies are performed at a large tertiary hospital system with EHR spanning over multiple decades. Each study has $n_{0}$ labeled observations for algorithm training and validation. We consider three choices of training size $n$ no more than $n_{0} / 2$ in all examples. First, we randomly split the labelled samples into four folds of equal sizes. Then we pick each fold as the validation set, sample $n$ training labels from the other three folds for 20 times, train and validate the algorithms, and finally average the evaluation metrics and their standard errors over the validation results on the four folds. We replicate this procedure 10 times and report the average performance.

Data Example 1 (CAD Phenotyping).

The goal of this study is to identify patients with coronary artery disease (CAD) based on their EHR features. The study cohort consists of $N = 4164$ patients, out of which a random subset of $n_{0} = 181$ patients have their true CAD status annotated via chart review by domain experts. We use the sum of the counts for the CAD ICD code and NLP mention of CAD as the surrogate. There are $p = 585$ additional EHR features consisting of the total count of all ICD codes as a healthcare utilization measure, 10 ICD codes related to CAD, and 574 NLP variables. For the size of training labels, we consider $n = 50, 70, 90$ . This de-identified dataset has been analyzed in previous studies (Zhang et al., 2019, e.g.) and is publicly available online: https://celehs.github.io/PheCAP/articles/example2.html.

Data Example 2 (RA Phenotyping).

Similar to the CAD phenotyping study, the goal is to identify patients with rheumatoid arthritis (RA) based on their EHR features. There are $N = 46114$ patients in total and out of which, $n_{0} = 435$ patients have their RA status annotated. Again, we choose the sum of the ICD code and NLP mention of RA as the surrogate. The $p = 924$ additional EHR features consist of the healthcare utilization and 923 NLP variables potentially predictive of RA. For the size of training labels, we consider $n = 50, 125, 200$ .

Data Example 3 (Depression Phenotyping).

The goal is to identify patients with depression based on their codified EHR features. There are $N = 9474$ patients in total and $n_{0} = 236$ labeled observations. The surrogate is chosen as the counts of depression ICD code. There are $p = 231$ additional EHR features, including the healthcare utilization and 230 codified EHR features on depression related medication prescriptions, laboratory tests and ICD codes. For the size of training labels, we consider 50,85,120.

In the three data examples, $N$ is significantly larger than $p$ with $N / \max (p, n)$ being approximately 7 for CAD, 50 for RA, and 41 for Depression. In all the three studies, we apply $x \to \log (1 + x)$ transformation for all count variables. Also, since patients with higher healthcare utilization tend to have higher counts of most features, we orthogonalize all features against the healthcare utilization before regression fitting. Since $ϑ_{0}$ is unknown in applications, we quantify the performance of an estimator $\tilde{ϑ}$ based on the AUC and Brier skill score (BSS) of $σ ({\tilde{ϑ}}^{⊤} W)$ for predicting $Y$ , where the BSS is defined as $1 - {\hat{E}}_{v} [{Y - {σ ({\tilde{ϑ}}^{⊤} W)}}^{2}] / {\hat{E}}_{v} [{Y - {\hat{E}}_{v} (Y)}^{2}]$ , and ${\hat{E}}_{v}$ denotes the empirical expectation on the validation sample. The BSS is essentially a binary version of the R-square.

For comparison, we included PASS, ${SS}^{prior}$ , pLASSO², supervised LASSO and ALASSO on the three data examples to estimate the phenotyping model ( $𝓜_{Y}$ ). We exclude pLASSO¹ since it requires fitting of an unpenalized regression on supp ( $\hat{α}$ ), which is infeasible when $| supp (\hat{α}) | > n$ . In addition, we compare to the unsupervised LASSO (ULASSO) approach of Chakrabortty et al. (2017), which estimates direction of the logistic coefficients for $Y \sim σ (β^{⊤} X)$ by regressing $I (S > c_{u})$ against $X$ on the subset whose $S$ is either greater than $c_{u}$ or smaller than $c_{l}$ , for some pre-specified $c_{u}$ and $c_{l}$ typically chosen such that $Pr (S > c_{u})$ and $Pr (S < c_{l})$ are small. Since the ULASSO approach only provides an estimate $\tilde{β}$ to optimize the prediction of $β^{⊤} X$ for $Y ∣ X$ without using $S$ explicitly as an additional predictor, we also derive a semi-supervised variant of ULASSO, denoted by ${SS}^{ULASSO}$ , by regressing the labeled $Y$ against ${\tilde{β}}^{⊤} X$ and $S$ as for ${SS}^{prior}$ .

As shown in Figure 6, PASS significantly outperforms the supervised LASSO and ALASSO when $n = 50$ in all three examples. As the label size $n$ increases, their performances get closer. Compared with the semi-supervised benchmarks, PASS has slightly or moderately better performance on the CAD and RA studies. For Depression, PASS substantially outperforms them, especially ${SS}^{prior}$ and ${SS}^{ULASSO}$ . For example, when $n = 50$ , PASS attained average AUC in classifying depression about 0.1 higher than that of ${SS}^{prior}$ and ${SS}^{ULASSO}$ and 0.05 higher than pLASSO. The gap becomes smaller when $n$ increases as expected. Interestingly, the supervised estimators outperform pLASSO, ${SS}^{prior}$ , and ${SS}^{ULASSO}$ on the Depression dataset as well but has similar or worse performance than these semi-supervised approaches on the other two examples. This could in part be attributed to the relatively poor quality of the surrogate information, which makes existing semi-supervised approaches fail. In contrast, PASS could utilize such prior information more effectively and robustly, and still preserves better performance than the supervised estimators. Thus, we can conclude that incorporating prior information from the unlabeled dataset could improve and stabilize the prediction performance of phenotyping models in EHR applications, and PASS is more robust and efficient in leveraging the prior information compared with existing semi-supervised methods. In addition, ULASSO shows much worse performance than the other supervised and semi-supervised methods in all examples. This illustrates the importance of collecting labels and including the surrogate in the regression models for EHR phenotyping.

6. Discussion

In this paper, we propose PASS, a high dimensional sparse estimator adaptively incorporating the prior knowledge from surrogate under a semi-supervised scenario commonly found in application fields like EHR analysis. Compared to the supervised approaches, the proposed PASS approach can substantially reduce the required number of labeled samples when the model assumptions ( $𝓜_{S}$ ) and ( $𝓒^{prior}$ ) and the elliptical design assumption (C1) hold exactly or approximately, and thus the prior information $α^{*}$ is trustworthy. Compared to existing pLASSO and ${SS}^{prior}$ approaches that also incorporates prior information, the PASS approach is robust against unreliable prior information $α^{*}$ , which might be the case when the surrogate model assumptions are violated or the design $X$ is highly non-elliptically distributed.

One of the main challenges in our theoretical analysis comes from the colinearity of covariates ${(1, S_{i}, X_{i}^{⊤} \hat{α}, X_{i}^{⊤})}^{⊤}$ due to the enrollment of $ρ$ to leverage the prior information in $\hat{α}$ . We overcome this by properly constructing the oracle coefficients $θ^{*}$ and the restricted eigenvalue assumption (A6). The formulation of our problem falls into the missing data framework with missing completely at random. However, the missing probability approaches 1 as $N \to \infty$ . This together with the high dimensionality of $X$ makes the theoretical justifications more challenging than those used in the standard missing data literature. Without prior assumptions of $β_{0} - ρ α_{0}$ being sparse in certain sense, the unlabeled dataset cannot directly contribute to the estimation of $β_{0}$ . Our proposed PASS procedure hinges on the sparsity of $β_{0} - ρ α_{0}$ to leverage the unlabeled dataset.

We have restricted the discussion to a single surrogate variable for simplicity. However, the proposed method can be easily extended to multiple surrogates. Specifically, consider $K$ surrogates, denoted by $S^{[1]}, \dots, S^{[K]}$ . Let ${\hat{α}}^{[k]}$ be the ALASSO estimator regressing $S_{i}^{[k]}$ against $X_{i}$ , $\hat{𝓐} = \cup_{k = 1}^{K} supp ({\hat{α}}_{k})$ , $S_{i} = {(S_{i}^{[1]}, \dots, S_{i}^{[K]})}^{⊤}$ and $ρ = {(ρ^{[1]}, \dots, ρ^{[K]})}^{⊤}$ . We can obtain an estimator for the model parameters as

\hat{ζ}, \hat{γ}, \hat{ρ}, \hat{β} = \underset{ζ, γ, ρ, β}{\arg \min} n^{- 1} \sum_{i = 1}^{n} ℓ (Y_{i}, ζ + S_{i}^{⊤} γ + X_{i}^{⊤} β) + λ_{1} {‖ {(β - \sum_{k} ρ_{k} {\hat{α}}_{k})}_{\hat{𝓐}} ‖}_{1} + λ_{2} {‖ β_{{\hat{𝓐}}^{c}} ‖}_{1} .

Theoretical justification and finite sample performance of $\hat{β}$ under this setting warrant further research. In our numerical studies, we only focus on fully simulated datasets and real examples. We are further interested in investigating the performance of our approach through semi-synthetic experiments with various setups for the surrogate variables. In addition, it may be interesting to extend the semi-supervised PASS estimator under a high dimensional sparse parametric regression to semi-parametric settings such as the sparse additive model (Ravikumar et al., 2009) and the sparse varying coefficient model (Noh and Park, 2010). Under semi-parametric models, one could still leverage prior information through shrinking the coefficients to “ $ρ \hat{α}$ “ with some sparse penalty function, to gain statistical efficiency. Studying the specific forms and theoretical properties of such approaches via a semi-supervised framework warrants future research.

$R$ codes for implementing PASS and the benchmark methods, and replicating the simulation results can be found at https://github.com/moleibobliu/PASS.

Supplementary Material

Supplement

NIHMS1912660-supplement-Supplement.pdf^{(360.9KB, pdf)}

Contributor Information

Yichi Zhang, Department of Computer Science and Statistics, University of Rhode Island.

Molei Liu, Department of Biostatistics, Harvard T.H. Chan School of Public Health.

Matey Neykov, Department of Statistics and Data Science, Carnegie Mellon University.

Tianxi Cai, Department of Biostatistics, Harvard T.H. Chan School of Public Health.

References

Agarwal Vibhu, Podchiyska Tanya, Banda Juan M, Goel Veena, Leung Tiffany I, Minty Evan P, Sweeney Timothy E, Gyang Elsie, and Shah Nigam H. Learning statistical models of phenotypes using noisy labeled training data. Journal of the American Medical Informatics Association, 23(6):1166–1173, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ananthakrishnan Ashwin N, Cheng Su-Chun, Cai Tianxi, Cagan Andrew, Gainer Vivian S, Szolovits Peter, Shaw Stanley Y, Churchill Susanne, Karlson Elizabeth W, Murphy Shawn N, et al. Association between reduced plasma 25-hydroxy vitamin d and increased risk of cancer in patients with inflammatory bowel diseases. Clinical Gastroenterology and Hepatology, 12(5):821–827, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bickel Peter J., Ritov Ya’acov, and Tsybakov Alexandre B. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, 37(4):1705–1732, 08 2009. doi: 10.1214/08-AOS620. URL 10.1214/08-AOS620. [DOI] [Google Scholar]
Brownstein John S, Murphy Shawn N, Goldfine Allison B, Grant Richard W, Sordo Margarita, Gainer Vivian, Colecchi Judith A, Dubey Anil, Nathan David M, Glaser John P, et al. Rapid identification of myocardial infarction risk associated with diabetes medications using electronic medical records. Diabetes care, 33(3):526–531, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bühlmann Peter and Geer Sara Van De. Statistics for high-dimensional data: methods, theory and applications. Springer, 2011. [Google Scholar]
Cai Tianxi, Zhang Yichi, Ho Yuk-Lam, Link Nicholas, Sun Jiehuan, Huang Jie, Cai Tianrun A., Damrauer Scott, Ahuja Yuri, Honerlaw Jacqueline, Huang Jie, Costa Lauren, Schubert Petra, Hong Chuan, Gagnon David, Sun Yan V., Gaziano J. Michael, Wilson Peter, Cho Kelly, Tsao Philip, O’Donnell Christopher J., Liao Katherine P., and for the VA Million Veteran Program. Association of Interleukin 6 Receptor Variant With Cardiovascular Disease Effects of Interleukin 6 Receptor Blocking Therapy: A Phenome-Wide Association Study. JAMA Cardiology, 3(9):849–857, 09 2018. ISSN 2380–6583. doi: 10.1001/jamacardio.2018.2287. URL 10.1001/jamacardio.2018.2287. [DOI] [PMC free article] [PubMed] [Google Scholar]
Carroll Robert J, Thompson Will K, Eyler Anne E, Mandelin Arthur M, Cai Tianxi, Zink Raquel M, Pacheco Jennifer A, Boomershine Chad S, Lasko Thomas A, Xu Hua, et al. Portability of an algorithm to identify rheumatoid arthritis in electronic health records. Journal of the American Medical Informatics Association, 19(e1):e162–e169, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chakrabortty Abhishek, Neykov Matey, Carroll Raymond, and Cai Tianxi. Surrogate aided unsupervised recovery of sparse signals in single index models for binary outcomes. arXiv preprint 1701.05230, 2017.
Denny Joshua C, Ritchie Marylyn D, Basford Melissa A, Pulley Jill M, Bastarache Lisa, Brown-Gentry Kristin, Wang Deede, Masys Dan R, Roden Dan M, and Crawford Dana C. Phewas: demonstrating the feasibility of a phenome-wide scan to discover gene–disease associations. Bioinformatics, 26(9):1205–1210, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
Diaconis Persi and Freedman David. Asymptotics of graphical projection pursuit. The Annals of Statistics, 12(3):793–815, 09 1984. doi: 10.1214/aos/1176346703. URL 10.1214/aos/1176346703. [DOI] [Google Scholar]
Doshi-Velez Finale, Ge Yaorong, and Kohane Isaac. Comorbidity clusters in autism spectrum disorders: an electronic health record time-series analysis. Pediatrics, 133(1):e54–e63, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
Friedman Jerome, Hastie Trevor, and Tibshirani Rob. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22, 2010. ISSN 1548–7660. doi: 10.18637/jss.v033.i01. URL https://www.jstatsoft.org/index.php/jss/article/view/v033i01. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gottesman Omri, Kuivaniemi Helena, Tromp Gerard, W Andrew Faucett Rongling Li, Manolio Teri A, Sanderson Saskia C, Kannry Joseph, Zinberg Randi, Basford Melissa A, et al. The electronic medical records and genomics (emerge) network: past, present, and future. Genetics in Medicine, 15(10):761–771, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hall Peter and Li Ker-Chau. On almost linearity of low dimensional projections from high dimensional data. The Annals of Statistics, 21(2):867–889, 06 1993. doi: 10.1214/aos/1176349155. URL 10.1214/aos/1176349155. [DOI] [Google Scholar]
Halpern Yoni, Horng Steven, Choi Youngduck, and Sontag David. Electronic medical record phenotyping using the anchor and learn framework. Journal of the American Medical Informatics Association, 23(4):731–740, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hong Chuan, Liao Katherine P, and Cai Tianxi. Semi-supervised validation of multiple surrogate outcomes with application to electronic medical records phenotyping. Biometrics, 75(1):78–89, 2019. [DOI] [PubMed] [Google Scholar]
Jiang Yuan, He Yunxiao, and Zhang Heping. Variable selection with prior information for generalized linear models via the prior LASSO method. Journal of the American Statistical Association, 111(513):355–376, 2016. doi: 10.1080/01621459.2015.1008363. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kohane Isaac S. Using electronic health records to drive discovery in disease genomics. Nature Reviews Genetics, 12(6):417–428, 2011. [DOI] [PubMed] [Google Scholar]
Lee Jason D, Lei Qi, Saunshi Nikunj, and Zhuo Jiacheng. Predicting what you already know helps: Provable self-supervised learning. arXiv preprint arXiv:2008.01064, 2020.
Li Ker-Chau and Duan Naihua. Regression analysis under link violation. The Annals of Statistics, 17(3):1009–1052, 09 1989. doi: 10.1214/aos/1176347254. URL 10.1214/aos/1176347254. [DOI] [Google Scholar]
Liao Katherine P, Diogo Dorothée, Cui Jing, Cai Tianxi, Okada Yukinori, Gainer Vivian S, Murphy Shawn N, Gupta Namrata, Mirel Daniel, Ananthakrishnan Ashwin N, et al. Association between low density lipoprotein and rheumatoid arthritis genetic factors with low density lipoprotein levels in rheumatoid arthritis and non-rheumatoid arthritis controls. Annals of the rheumatic diseases, 73(6):1170–1175, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liao Katherine P, Cai Tianxi, Savova Guergana K, Murphy Shawn N, Karlson Elizabeth W, Ananthakrishnan Ashwin N, Gainer Vivian S, Shaw Stanley Y, Xia Zongqi, Szolovits Peter, et al. Development of phenotype algorithms using electronic medical records and incorporating natural language processing. bmj, 350:h1885, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
McDermott Matthew, Yan Tom, Naumann Tristan, Hunt Nathan, Suresh Harini, Szolovits Peter, and Ghassemi Marzyeh. Semi-supervised biomedical translation with cycle wasserstein regression gans. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
Negahban Sahand, Yu Bin, Wainwright Martin J, and Ravikumar Pradeep K. A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. In Advances in neural information processing systems, pages 1348–1356, 2009.
Newton Katherine M, Peissig Peggy L, Kho Abel Ngo, Bielinski Suzette J, Berg Richard L, Choudhary Vidhu, Basford Melissa, Chute Christopher G, Kullo Iftikhar J, Li Rongling, et al. Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the emerge network. Journal of the American Medical Informatics Association, 20(e1):e147–e154, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
Noh Hoh Suk and Park Byeong U. Sparse varying coefficient models for longitudinal data. Statistica Sinica, pages 1183–1202, 2010.
Ratner Alexander, Stephen H Bach Henry Ehrenberg, Fries Jason, Wu Sen, and Christopher Ré. Snorkel: Rapid training data creation with weak supervision. In Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases, volume 11, page 269. NIH Public Access, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ravikumar Pradeep, Lafferty John, Liu Han, and Wasserman Larry. Sparse additive models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(5):1009–1030, 2009. [Google Scholar]
van de Geer Sara A. High-dimensional generalized linear models and the lasso. The Annals of Statistics, 36(2):614–645, 04 2008. doi: 10.1214/009053607000000929. URL 10.1214/009053607000000929. [DOI] [Google Scholar]
van de Geer Sara A and Bühlmann Peter. On the conditions used to prove oracle results for the lasso. Electronic Journal of Statistics, 3:1360–1392, 2009. doi: 10.1214/09-EJS506. URL 10.1214/09-EJS506. [DOI] [Google Scholar]
Wang Hai and Poon Hoifung. Deep probabilistic logic: A unifying framework for indirect supervision. arXiv preprint arXiv:1808.08485, 2018.
Wilke RA, Xu H, Denny JC, Roden DM, Krauss RM, McCarty CA, Davis RL, Skaar Todd, Lamba J, and Savova G. The emerging role of electronic medical records in pharmacogenomics. Clinical Pharmacology & Therapeutics, 89(3):379–386, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yu Sheng, Liao Katherine P, Shaw Stanley Y, Gainer Vivian S, Churchill Susanne E, Szolovits Peter, Murphy Shawn N, Kohane Isaac S, and Cai Tianxi. Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources. Journal of the American Medical Informatics Association, 22(5):993–1000, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang Lingjiao, Ding Xiruo, Ma Yanyuan, Muthu Naveen, Ajmal Imran, Moore Jason H, Herman Daniel S, and Chen Jinbo. A maximum likelihood approach to electronic health record phenotyping using positive and unlabeled patients. Journal of the American Medical Informatics Association, 27(1):119–126, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang Yichi, Cai Tianrun, Yu Sheng, Cho Kelly, Hong Chuan, Sun Jiehuan, Huang Jie, Ho Yuk-Lam, Ashwin N Ananthakrishnan Zongqi Xia, et al. High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (phecap). Nature protocols, 14(12):3426–3444, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zou Hui. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476):1418–1429, 2006. doi: 10.1198/016214506000000735. [DOI] [Google Scholar]
Zou Hui and Zhang Hao Helen. On the adaptive elastic-net with a diverging number of parameters. The Annals of Statistics, 37(4):1733–1751, 08 2009. doi: 10.1214/08-AOS625. URL 10.1214/08-AOS625. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement

NIHMS1912660-supplement-Supplement.pdf^{(360.9KB, pdf)}

[R1] Agarwal Vibhu, Podchiyska Tanya, Banda Juan M, Goel Veena, Leung Tiffany I, Minty Evan P, Sweeney Timothy E, Gyang Elsie, and Shah Nigam H. Learning statistical models of phenotypes using noisy labeled training data. Journal of the American Medical Informatics Association, 23(6):1166–1173, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Ananthakrishnan Ashwin N, Cheng Su-Chun, Cai Tianxi, Cagan Andrew, Gainer Vivian S, Szolovits Peter, Shaw Stanley Y, Churchill Susanne, Karlson Elizabeth W, Murphy Shawn N, et al. Association between reduced plasma 25-hydroxy vitamin d and increased risk of cancer in patients with inflammatory bowel diseases. Clinical Gastroenterology and Hepatology, 12(5):821–827, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Bickel Peter J., Ritov Ya’acov, and Tsybakov Alexandre B. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, 37(4):1705–1732, 08 2009. doi: 10.1214/08-AOS620. URL 10.1214/08-AOS620. [DOI] [Google Scholar]

[R4] Brownstein John S, Murphy Shawn N, Goldfine Allison B, Grant Richard W, Sordo Margarita, Gainer Vivian, Colecchi Judith A, Dubey Anil, Nathan David M, Glaser John P, et al. Rapid identification of myocardial infarction risk associated with diabetes medications using electronic medical records. Diabetes care, 33(3):526–531, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Bühlmann Peter and Geer Sara Van De. Statistics for high-dimensional data: methods, theory and applications. Springer, 2011. [Google Scholar]

[R6] Cai Tianxi, Zhang Yichi, Ho Yuk-Lam, Link Nicholas, Sun Jiehuan, Huang Jie, Cai Tianrun A., Damrauer Scott, Ahuja Yuri, Honerlaw Jacqueline, Huang Jie, Costa Lauren, Schubert Petra, Hong Chuan, Gagnon David, Sun Yan V., Gaziano J. Michael, Wilson Peter, Cho Kelly, Tsao Philip, O’Donnell Christopher J., Liao Katherine P., and for the VA Million Veteran Program. Association of Interleukin 6 Receptor Variant With Cardiovascular Disease Effects of Interleukin 6 Receptor Blocking Therapy: A Phenome-Wide Association Study. JAMA Cardiology, 3(9):849–857, 09 2018. ISSN 2380–6583. doi: 10.1001/jamacardio.2018.2287. URL 10.1001/jamacardio.2018.2287. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Carroll Robert J, Thompson Will K, Eyler Anne E, Mandelin Arthur M, Cai Tianxi, Zink Raquel M, Pacheco Jennifer A, Boomershine Chad S, Lasko Thomas A, Xu Hua, et al. Portability of an algorithm to identify rheumatoid arthritis in electronic health records. Journal of the American Medical Informatics Association, 19(e1):e162–e169, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Chakrabortty Abhishek, Neykov Matey, Carroll Raymond, and Cai Tianxi. Surrogate aided unsupervised recovery of sparse signals in single index models for binary outcomes. arXiv preprint 1701.05230, 2017.

[R9] Denny Joshua C, Ritchie Marylyn D, Basford Melissa A, Pulley Jill M, Bastarache Lisa, Brown-Gentry Kristin, Wang Deede, Masys Dan R, Roden Dan M, and Crawford Dana C. Phewas: demonstrating the feasibility of a phenome-wide scan to discover gene–disease associations. Bioinformatics, 26(9):1205–1210, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Diaconis Persi and Freedman David. Asymptotics of graphical projection pursuit. The Annals of Statistics, 12(3):793–815, 09 1984. doi: 10.1214/aos/1176346703. URL 10.1214/aos/1176346703. [DOI] [Google Scholar]

[R11] Doshi-Velez Finale, Ge Yaorong, and Kohane Isaac. Comorbidity clusters in autism spectrum disorders: an electronic health record time-series analysis. Pediatrics, 133(1):e54–e63, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Friedman Jerome, Hastie Trevor, and Tibshirani Rob. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22, 2010. ISSN 1548–7660. doi: 10.18637/jss.v033.i01. URL https://www.jstatsoft.org/index.php/jss/article/view/v033i01. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Gottesman Omri, Kuivaniemi Helena, Tromp Gerard, W Andrew Faucett Rongling Li, Manolio Teri A, Sanderson Saskia C, Kannry Joseph, Zinberg Randi, Basford Melissa A, et al. The electronic medical records and genomics (emerge) network: past, present, and future. Genetics in Medicine, 15(10):761–771, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Hall Peter and Li Ker-Chau. On almost linearity of low dimensional projections from high dimensional data. The Annals of Statistics, 21(2):867–889, 06 1993. doi: 10.1214/aos/1176349155. URL 10.1214/aos/1176349155. [DOI] [Google Scholar]

[R15] Halpern Yoni, Horng Steven, Choi Youngduck, and Sontag David. Electronic medical record phenotyping using the anchor and learn framework. Journal of the American Medical Informatics Association, 23(4):731–740, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Hong Chuan, Liao Katherine P, and Cai Tianxi. Semi-supervised validation of multiple surrogate outcomes with application to electronic medical records phenotyping. Biometrics, 75(1):78–89, 2019. [DOI] [PubMed] [Google Scholar]

[R17] Jiang Yuan, He Yunxiao, and Zhang Heping. Variable selection with prior information for generalized linear models via the prior LASSO method. Journal of the American Statistical Association, 111(513):355–376, 2016. doi: 10.1080/01621459.2015.1008363. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Kohane Isaac S. Using electronic health records to drive discovery in disease genomics. Nature Reviews Genetics, 12(6):417–428, 2011. [DOI] [PubMed] [Google Scholar]

[R19] Lee Jason D, Lei Qi, Saunshi Nikunj, and Zhuo Jiacheng. Predicting what you already know helps: Provable self-supervised learning. arXiv preprint arXiv:2008.01064, 2020.

[R20] Li Ker-Chau and Duan Naihua. Regression analysis under link violation. The Annals of Statistics, 17(3):1009–1052, 09 1989. doi: 10.1214/aos/1176347254. URL 10.1214/aos/1176347254. [DOI] [Google Scholar]

[R21] Liao Katherine P, Diogo Dorothée, Cui Jing, Cai Tianxi, Okada Yukinori, Gainer Vivian S, Murphy Shawn N, Gupta Namrata, Mirel Daniel, Ananthakrishnan Ashwin N, et al. Association between low density lipoprotein and rheumatoid arthritis genetic factors with low density lipoprotein levels in rheumatoid arthritis and non-rheumatoid arthritis controls. Annals of the rheumatic diseases, 73(6):1170–1175, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Liao Katherine P, Cai Tianxi, Savova Guergana K, Murphy Shawn N, Karlson Elizabeth W, Ananthakrishnan Ashwin N, Gainer Vivian S, Shaw Stanley Y, Xia Zongqi, Szolovits Peter, et al. Development of phenotype algorithms using electronic medical records and incorporating natural language processing. bmj, 350:h1885, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] McDermott Matthew, Yan Tom, Naumann Tristan, Hunt Nathan, Suresh Harini, Szolovits Peter, and Ghassemi Marzyeh. Semi-supervised biomedical translation with cycle wasserstein regression gans. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.

[R24] Negahban Sahand, Yu Bin, Wainwright Martin J, and Ravikumar Pradeep K. A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. In Advances in neural information processing systems, pages 1348–1356, 2009.

[R25] Newton Katherine M, Peissig Peggy L, Kho Abel Ngo, Bielinski Suzette J, Berg Richard L, Choudhary Vidhu, Basford Melissa, Chute Christopher G, Kullo Iftikhar J, Li Rongling, et al. Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the emerge network. Journal of the American Medical Informatics Association, 20(e1):e147–e154, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Noh Hoh Suk and Park Byeong U. Sparse varying coefficient models for longitudinal data. Statistica Sinica, pages 1183–1202, 2010.

[R27] Ratner Alexander, Stephen H Bach Henry Ehrenberg, Fries Jason, Wu Sen, and Christopher Ré. Snorkel: Rapid training data creation with weak supervision. In Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases, volume 11, page 269. NIH Public Access, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Ravikumar Pradeep, Lafferty John, Liu Han, and Wasserman Larry. Sparse additive models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(5):1009–1030, 2009. [Google Scholar]

[R29] van de Geer Sara A. High-dimensional generalized linear models and the lasso. The Annals of Statistics, 36(2):614–645, 04 2008. doi: 10.1214/009053607000000929. URL 10.1214/009053607000000929. [DOI] [Google Scholar]

[R30] van de Geer Sara A and Bühlmann Peter. On the conditions used to prove oracle results for the lasso. Electronic Journal of Statistics, 3:1360–1392, 2009. doi: 10.1214/09-EJS506. URL 10.1214/09-EJS506. [DOI] [Google Scholar]

[R31] Wang Hai and Poon Hoifung. Deep probabilistic logic: A unifying framework for indirect supervision. arXiv preprint arXiv:1808.08485, 2018.

[R32] Wilke RA, Xu H, Denny JC, Roden DM, Krauss RM, McCarty CA, Davis RL, Skaar Todd, Lamba J, and Savova G. The emerging role of electronic medical records in pharmacogenomics. Clinical Pharmacology & Therapeutics, 89(3):379–386, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] Yu Sheng, Liao Katherine P, Shaw Stanley Y, Gainer Vivian S, Churchill Susanne E, Szolovits Peter, Murphy Shawn N, Kohane Isaac S, and Cai Tianxi. Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources. Journal of the American Medical Informatics Association, 22(5):993–1000, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] Zhang Lingjiao, Ding Xiruo, Ma Yanyuan, Muthu Naveen, Ajmal Imran, Moore Jason H, Herman Daniel S, and Chen Jinbo. A maximum likelihood approach to electronic health record phenotyping using positive and unlabeled patients. Journal of the American Medical Informatics Association, 27(1):119–126, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] Zhang Yichi, Cai Tianrun, Yu Sheng, Cho Kelly, Hong Chuan, Sun Jiehuan, Huang Jie, Ho Yuk-Lam, Ashwin N Ananthakrishnan Zongqi Xia, et al. High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (phecap). Nature protocols, 14(12):3426–3444, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] Zou Hui. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476):1418–1429, 2006. doi: 10.1198/016214506000000735. [DOI] [Google Scholar]

[R37] Zou Hui and Zhang Hao Helen. On the adaptive elastic-net with a diverging number of parameters. The Annals of Statistics, 37(4):1733–1751, 08 2009. doi: 10.1214/08-AOS625. URL 10.1214/08-AOS625. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Prior Adaptive Semi-supervised Learning with Application to EHR Phenotyping

Yichi Zhang

Molei Liu

Matey Neykov

Tianxi Cai

Abstract

1. Introduction

2. Methodology

2.1. Setup

2.2. Model Assumptions

Proposition 1.

Remark 1.

2.3. Prior Adaptive Semi-Supervised (PASS) Estimator

2.4. Computation Details

3. Theoretical Properties

3.1. Notations

3.2. Main result

Theorem 1.

Remark 2.

Remark 3.

3.3. Specific Cases

Figure 1:

Case 1.

Case 2.

Case 3.

4. Simulation Studies

4.1. Main setups

Figure 2:

Figure 3:

Figure 4:

4.2. Efficiency and Robustness Evaluations under Mis-specifications

Figure 5:

5. Application to EHR Phenotying

Data Example 1 (CAD Phenotyping).

Data Example 2 (RA Phenotyping).

Data Example 3 (Depression Phenotyping).

Figure 6:

6. Discussion

Supplementary Material

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases