Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Nov 15.
Published in final edited form as: J Mach Learn Res. 2022;23:83.

Prior Adaptive Semi-supervised Learning with Application to EHR Phenotyping

Yichi Zhang 1,*, Molei Liu 2,*, Matey Neykov 3, Tianxi Cai 4
PMCID: PMC10653017  NIHMSID: NIHMS1912660  PMID: 37974910

Abstract

Electronic Health Record (EHR) data, a rich source for biomedical research, have been successfully used to gain novel insight into a wide range of diseases. Despite its potential, EHR is currently underutilized for discovery research due to its major limitation in the lack of precise phenotype information. To overcome such difficulties, recent efforts have been devoted to developing supervised algorithms to accurately predict phenotypes based on relatively small training datasets with gold standard labels extracted via chart review. However, supervised methods typically require a sizable training set to yield generalizable algorithms, especially when the number of candidate features, p, is large. In this paper, we propose a semi-supervised (SS) EHR phenotyping method that borrows information from both a small, labeled dataset (where both the label Y and the feature set X are observed) and a much larger, weakly-labeled dataset in which the feature set X is accompanied only by a surrogate label S that is available to all patients. Under a working prior assumption that S is related to X only through Y and allowing it to hold approximately, we propose a prior adaptive semi-supervised (PASS) estimator that incorporates the prior knowledge by shrinking the estimator towards a direction derived under the prior. We derive asymptotic theory for the proposed estimator and justify its efficiency and robustness to prior information of poor quality. We also demonstrate its superiority over existing estimators under various scenarios via simulation studies and on three real-world EHR phenotyping studies at a large tertiary hospital.

Keywords: High dimensional sparse regression, regularization, single index model, semi-supervised learning, electronic health records

1. Introduction

Electronic Health Records (EHRs) provide a large and rich data source for biomedical research aiming to further our understanding of disease progression and treatment response. EHR data has been successfully used to gain novel insights into a wide range of diseases, with examples including diabetes (Brownstein et al., 2010), rheumatoid arthritis (Liao et al., 2014), inflammatory bowl disease (Ananthakrishnan et al., 2014), and autism (Doshi-Velez et al., 2014). EHR is also a powerful discovery tool for identifying novel associations between genomic markers and multiple phenotypes through analyses such as phenome-wide association studies (Denny et al., 2010; Kohane, 2011; Wilke et al., 2011; Cai et al., 2018).

Despite its potential, ensuring unbiased and powerful biomedical studies using EHR is challenging because EHR was primarily designed for patient care, billing, and record keeping. Extracting precise phenotype information for an individual patient requires manual medical chart reviews, an expensive process that is not scalable for research studies. To overcome such difficulties, recent efforts including those from Informatics for Integrating Biology and the Bedside (i2b2) (Liao et al., 2015; Yu et al., 2015, e.g.) and the Electronic Medical Records and Genomics (eMERGE) network (Newton et al., 2013; Gottesman et al., 2013) have been devoted to developing phenotyping algorithms to predict disease status using relatively small training datasets with gold standard labels extracted via chart review.

Various approaches to EHR phenotyping have been proposed. Supervised machine learning methods have been shown to achieve robust performance across disease phenotypes and EHR systems (Carroll et al., 2012; Liao et al., 2015). However, supervised methods typically require a sizable training set to yield generalizable algorithms especially when the candidate features, denoted by X, is of high dimensionality p. One approach to overcome the high dimensionality is to consider unsupervised methods. Unfortunately, standard unsupervised methods such as clustering are likely to fail when the dimension of X is large, but a majority of the features are unrelated to the phenotype of interest and predictive of some other underlying subgroups. Recently, unsupervised methods based on “silver standard labels” have been proposed. These methods leverage a surrogate outcome S that is highly predictive of the true phenotype status Y, such as the count of International Classification of Diseases (ICD) billing codes of the disease, to train the phenotyping algorithm against the features X. Specifically, Halpern et al. (2016) and Zhang et al. (2020) utilized anchor variables with high positive predictive value as the surrogate S to estimate YX under the conditional independence assumption SXY. Agarwal et al. (2016) trained penalized logistic regression on SX for phenotyping of Y against X. Chakrabortty et al. (2017) provided theoretical justification for this strategy. They showed that a regularized estimator constructed from an unlabeled subset consisting of those with extreme values of S can be used to infer the direction of β under single index models Sf(αX,ϵ)) and Yg(βX). Their method relies on the similarity between the directions of α and β to make efficient estimation. However, it is not robust to poor surrogacy resulted from violation of such assumptions. Furthermore, their method cannot be directly used to predict Y using both S and X or accurately recover the scale of Pr(Y=1S,X).

A number of semi-supervised or weakly supervised deep learning procedures have also been proposed recently and shown to attain better performance than the supervised counterparts. For example, Ratner et al. (2017) proposed a weakly supervised approach that trains a deep model with imperfect labels generated from user–specified label functions from sources such as patterns, heuristics, and external knowledge bases. Wang and Poon (2018) developed a framework for weak supervision from multiple sources by composing probabilistic logic with deep learning. McDermott et al. (2018) designed a semi-supervised cycle Wasserstein regression generative adversarial networks (CWR-GAN) approach using adversarial signals to learn from unlabelled samples and improve prediction performance in scarcity of gold-standard labels. However, it remains unclear when and how the surrogate features, along with the unlabeled dataset can improve the prediction performance of these deep models, due to their complex architectures.

In this paper, we propose an semi-supervised (SS) method for estimating YW=(S,X) that borrows information from both a small labeled dataset with n realizations of (Y,W) and a much larger unlabeled dataset with N observations on W, under a high dimensional setting with Npn. We consider a logistic phenotype model for YS,X, a single index model (SIM) for SX, as well as a working prior assumption that S is independent of X given Y. We obtain the estimator through regularization with penalty functions reflecting the prior knowledge. When the prior assumption holds exactly, we show that the unlabeled dataset can naturally be used to assist in the estimation of the phenotype model. Allowing the prior assumption to hold approximately or to be highly violated, our prior adaptive semi-supervised (PASS) estimator adaptively incorporate the prior knowledge by shrinking the estimator towards a direction derived under the prior.

The proposed PASS estimator is similar to the prior LASSO (pLASSO) procedure of Jiang et al. (2016) in that both approaches aim to incorporate prior information into the 1 penalized estimator in a high-dimensional setting. Nevertheless, the differences are substantial and clear. Jiang et al. (2016) assumed that the prior information was summarized into prediction values and contributed to the likelihood term. In contrast, we use prior information to guide the shrinkage and put them into the penalty term. In this sense, PASS and pLASSO complement each other to some extent. However, as shown in both theory and simulations, putting prior information into the likelihood term tends to lead to the “take it or leave it” phenomenon: the usefulness of the prior information is determined based on the overall effect of all predictors. On the other hand, by putting prior information into the penalty term, the PASS approach provides more flexible control: it is able to scrutinize the individual effect of each predictor. This gained flexibility can result in improved theoretical and numerical performances.

The rest of this paper is organized as follows. We discuss the motivation, an important special scenario and the general methodology in Section 2. We analyze the theoretical properties of the proposed approach in Section 3, and access its finite sample performance via simulation studies in Section 4. Furthermore, we illustrate the practical value of the proposed approach on three real EHR datasets in Section 5. Finally, we conclude this paper with some discussions and extensions in Section 6. All technical proofs and additional numerical results are given in the Supplementary Materials.

2. Methodology

2.1. Setup

We assume that the underlying data consists of N independent and identically distributed (i.i.d.) observations {(Yi,Si,Xi)=(Yi,Wi),i=1,,N}, where Yi is a binary indicator of the disease status of the ith patient, Si is a scalar surrogate variable that is reasonably predictive of Yi chosen via domain knowledge, and Xi is a p-dimensional feature vector. Examples of Si includes the total count of ICD codes or mentions for the disease of interest in clinical notes extracted via natural language processing (NLP). Candidate features X may include the ICD9 code counts for competing diagnosis, lab results, as well as NLP mentions of relevant signs/symptoms, medications and procedures. We may also include various transformations of original features in X to account for non-linear effects. While {Wi,i=1,,N} is fully observed, Yi is only observed for a random subset of n patients. Hence the observed data are 𝓛𝓤, where without loss of generality, the first n observations are assumed fully observed as 𝓛={(Yi,Wi),i=1,,n}, and the rest constitute the unlabeled dataset as 𝓤={Wi,i=n+1,,N}.

Throughout, for a d-dimensional vector u, the q-norm of v is vq=(j=1d|vj|q)1/q. The -norm of v is v=max1jd|vj|. The support of v is supp(v)={j:vj0}. If 𝓘 is a subset of {1,,p}, then v𝓘 denotes a d-dimensional vector whose jth element is vj1j𝓘, and 1B is the indicator function for set B. The independence between random variables/vectors U and V is written as UV. We also denote the negative log-likelihood function associated with the logistic model with (y,η)=yη+log(1+eη).

2.2. Model Assumptions

To predict Y using W=(S,X), we assume

Pr(Y=1W)=σ(ζ0+Sγ0+Xβ0)=σ(ϑ0W)withϑ0=(ζ0,γ0,β0), (𝓜Y)

where for any vector w, w=(1,w), and σ(t)=et/(1+et). To leverage the data in 𝓤, we further assume a single index model (SIM) for SX, i.e. there exists α0p such that

S=f(Xα0,ϵ),with someϵXandfsatisfyingE{f2(Xα0,ϵ)}<, (𝓜S)

where Xα0 is a single linear combination of the features X and f is an unknown link function. Here ζ0, γ0, β0 and α0 are parameters to be estimated where only the direction of α0 is identifiable and its norm does not affect our construction introduced below. If α0 and β0 are similar in certain ways, one would expect that the unlabeled dataset 𝓤 may be used to improve upon the standard supervised estimator for β0 using 𝓛 alone. For example, if S is a noisy representation of Y with random measurement error, then it is reasonable and common in the EHR literature (Hong et al., 2019; Zhang et al., 2020, e.g.) to assume

XSY. (𝓒prior)

Note that a similar conditional independence assumption to (𝓒prior) was imposed between the input and the pretext target given the label, in the context of self-supervised learning to demonstrate its advantage (Lee et al., 2020). Under (𝓒prior), we have Proposition 1 with proof given in Supplementary Materials.

Proposition 1.

Under (𝓜Y), (𝓜S), (𝓒prior), and assuming E(XX) is positive–definite, and it holds that: (C1) for any two vectors a1, a2, E(Xa2Xa1) is linear in Xa1, there exist scalars k1, k2 such that α0=k1β0 and α=k2β0 where

(τ,α)=arg minτ,αE(SτXα)2.

Remark 1.

Condition (C1) holds for elliptical distributions including multivariate normal. By Diaconis and Freedman (1984) and Hall and Li (1993), this assumption tends to hold for non-elliptical design when the dimensionality is high. Specifically, one can show that under mild regularity conditions, for two projection vectors a1 and a2 uniformly randomly drawn from Sp1={vp1:v2=1}, the pair (Xa2,Xa1) weakly converges to a bivariate normal distribution with high probability, and thus E(Xa2Xa1) is at least approximately linear in Xa1; see Theorem 1.1 of Diaconis and Freedman (1984) and equation (1.9) of Hall and Li (1993).

Proposition 1 hinges on the main result of Li and Duan (1989) that when the features X satisfy (C1), the direction of the coefficients of a SIM could be estimated using least squares regression for the response against X. It suggests that 𝓤 can greatly improve the estimation of β0 under (𝓒prior) because the phenotype model (𝓜Y) may be rewritten as logit Pr(Y=1W)=ζ+Sγ+ρXα for some ρ. Under this model, a simple SS estimator for ζ, γ and β in (𝓜Y) can be obtained as ζ^, γ^ and ρ^α^, where

(ζ^,γ^,ρ^)=argminζ,γ,ρi=1n(Yi,ζ+γSi+ρXiα^),(τ^,α^)=argminτ,αi=1N(SiτXα)2.

By doing so, the direction of the high dimensional vector β is estimated based on the entire 𝓛𝓤, and only the parameters (ζ,γ,ρ) are estimated using the small labeled dataset 𝓛. Hereafter we shall refer to this SS estimator derived under (𝓒prior) as SSprior.

Nevertheless, SSprior is only valid when (𝓒prior) and (C1) holds exactly. Our goal is to develop a more robust SS estimator under (𝓜Y) and (𝓜S) that can efficiently exploit 𝓤 when (𝓒prior) and (C1) may only hold approximately. In this more general setting, a desirable SS estimator should improve upon the standard supervised estimator when the directions of α0 and β0 are similar in their magnitude and/or support. In addition, it should perform similarly to the supervised estimator when the two directions are not close. We shall now detail our PASS estimation procedure which automatically adapts to different cases as reflected in the observed data.

2.3. Prior Adaptive Semi-Supervised (PASS) Estimator

With 𝓛 only, a supervised estimator for β can be obtained via the standard 1-penalized regression:

ϑ˘=(ζ˘,γ˘,β˘)=argminϑ1ni=1n(Yi,ϑWi)+λβ1. (1)

With properly chosen λ, the consistency and rate of convergence for ϑ˘ has been established (van de Geer, 2008). To improve the estimation of β through leveraging 𝓤, we note that when (𝓒prior) holds approximately, the magnitude of β0ρα0 is small for some ρ, and the support of β0ρα0 is of small size as well.

To incorporate such prior belief on the relationship between α0 and β0, we construct the penalty term

minρ{λ1(βρα0)𝓐01+λ2(βρα0)𝓐0c1},

where 𝓐0=supp(α0), and λ1, λ2>0 are tuning parameters. Since (α0)𝓐0c=0, the penalty term is equivalent to

λ1{minρ(βρα0)𝓐01}+λ2β𝓐0c1. (2)

The first term in the penalty measures how far β is from the closest vector along the α0 direction, and hence encourages smaller magnitude of βρα0. The second term shrinks β𝓐0c towards 0, which reflects our prior that predictors irrelevant to S are likely to be irrelevant to Y as well. The tuning parameters λ1,λ2 control the strength of the belief imposed. When they are sufficiently large, β will be forced to be a multiple of α0 and thus it ends up with the same estimator as in the case where (𝓒prior) holds.

Since we have Np samples to estimate α0, we use the adaptive LASSO (ALASSO) penalized least square estimator α^ (Zou, 2006; Zou and Zhang, 2009), where

τ^,α^=argminτ,α1Ni=1N(SiτXiα)2+μj=1pω^j|αj|,

where ω^j=|α^init,j|ν for some constant ν>0, α^init=(α^init,1,,α^init,p),

τ^init,α^init=argminτ,α1Ni=1N(SiτXiα)2+μinitα1,

μinit and μ are tuning parameters that can be chosen via the cross-validation or Bayesian information criterion (BIC). Here, α^ is actually an estimator of α, which has the same direction as α0 under the conditions in Proposition 1.

Appending the penalty term (2) to the likelihood and replacing α0 with its estimate α^, we propose to estimate ϑ0=(ζ0,γ0,β0) by

ϑ^=(ζ^,γ^,β^)=argminϑ1ni=1n(Yi,ϑWi)+λ1{minρ(βρα^)𝓐^1}+λ2β𝓐^c1,

where 𝓐^=supp(α^). The estimators can be equivalently obtained as

ρ^,ϑ^=argminρ,ϑ1ni=1n(Yi,ϑWi)+λ1(βρα^)𝓐^1+λ2β𝓐^c1 (3)

The impact of the tuning parameters λ1,λ2 can be understood from a bias-variance tradeoff viewpoint. When λj’s are large, β^ tends to be a multiple of α^ and thus is an estimator with high bias and low variance. In contrast, when λj’s are small, the likelihood term based on the labeled dataset 𝓛 is the dominant part, and hence β^ will have low bias and high variance. By varying the values of λj’s, we are able to obtain a continuum connecting these two extremes. In practice, λ1 and λ2 can be chosen via standard data-driven approaches such as cross-validation.

2.4. Computation Details

The minimization in (3) can be solved with standard software for LASSO estimation. Let δ=βρα^. We can re-parametrize the expression above in terms of ρ, ζ, γ, and δ as

ζ^,γ^,ρ^,δ^=argminζ,γ,ρ,δ1ni=1n(Yi,ζ+Siγ+ρXiα^+Xiδ)+λ1(δ𝓐^1+κδ𝓟𝓐^1),

where 𝓟={1,,p} and κ=λ2/λ1. This is a typical LASSO problem with covariates (1,Si,Xiα^,Xi), parameters (ζ,γ,ρ,δ), and a weighted 1 penalty on the parameters. Hence it can be solved by essentially any algorithm for ALASSO fitting. In this paper, we use the R package glmnet (Friedman et al., 2010) to compute ζ^, γ^, ρ^, and δ^, and construct the final estimator for ϑ0 as ϑ^=(ζ^,γ^,β^) with β^=δ^+ρ^α^.

3. Theoretical Properties

In this section, we present non-asymptotic risk bounds for the PASS estimator. We also make theoretical comparisons with the supervised LASSO estimator to shed light on when PASS outperforms the LASSO and where such improvement comes from.

3.1. Notations

A random variable V is sub-Gaussian (τ2) if E{exp(λ|V|)}2exp(λ2τ2/2) holds for all λ>0. Throughout, we define

U=(X,1),K=E(UU),ξ=(α,τ),Zα=(X,Xα,S,1),G=E(ZαZα),
θ=(δ,ρ,γ,ζ),H=E[σ(Zαθ0){1σ(Zαθ0)}ZαZα],

where α is given by (α,τ)=ξ=arg minξE(SUξ)2, and Θ0={θ:δ+ρα=β0,ζ=ζ0,γ=γ0}. Denote by 𝓑0=supp(β0), 𝓐=supp(α) and q=|𝓐|. We assume α2=1 without loss of generality since α is used to recover only the direction of β0 in SIM and one can change ρ correspondingly to make any β=δ+ρα invariant to α2. Note that under (𝓜Y), any θ0Θ0 minimizes E{(Y,Zαθ)}, and due to perfect multicollinearity in Zα, θ0 is not unique. However, any θ0Θ0 corresponds to the unique β0=δ0+ρ0α and thus Zαθ0=ζ0+Sγ0+Xβ0=ϑ0W is well-defined. Moreover, any quantity depending on θ0 through Zαθ0 is well-defined. Since the main results in this section depend on θ0 solely through Zαθ0, we will use θ0 to represent any θΘ0 for simplicity.

For θ=(δ,ρ,γ,ζ), define Ω(θ)=λ0(|ρ|+|γ|+|ζ|)+λ1δ𝓐1+λ2δ𝓟𝓐1, Δα=2μinitq/φ2 and Π(θ)=|ρ|, where φ is a constant defined by Assumption (A4) in Section S2.1 of the Supplement Material, and λ0=36B{log(6/ϵ)/n}1/2. To introduce the oracle θ, we define the oracle risk function as:

𝓔(θ,𝓢+,𝓢)=E(Y,Zαθ)E(Y,Zαθ0)+256κ(𝓢+)2|𝓢+|ϖψ(𝓢+)+8λ1θ𝓢𝓐1+8λ2θ𝓢(𝓟𝓐)1+8λ1ΔαΠ(θ), (4)

where

ψ(𝓢+)=infv:Ω(v𝓢)3Ω(v𝓢+)vGvv𝓢+v𝓢+,
κ(𝓢+)={λ0,if𝓢+𝓐=Øand𝓢+(𝓟𝓐)=Øλ2,if𝓢+𝓐=Øand𝓢+(𝓟𝓐)Ø+,if𝓢+𝓐Ø

Define θ=(δ,ρ,γ,ζ), 𝓢+ and 𝓢 as the solution to

argmin{θ,𝓢+,𝓢}:𝓢+𝓢=,𝓢+𝓢=supp(θ)𝓟¯,𝓢+𝓟¯,andG1/2(θθ0)2η,𝓔(θ,𝓢+,𝓢)

where 𝓟¯={p+1,p+2,p+3}, and η is a constant as defined by Assumption (A3) in Section S2.1 of the Supplement. Let 𝓢=𝓢+𝓢=supp(θ)𝓟¯, κ=κ(𝓢+), and β=δ+ρα. Intuitively, one may view 𝓢+ as the union set of unpenalized predictors and the predictors with large coefficients but not recovered by 𝓐. While 𝓢 can be viewed as the union set of predictors with small nonzero coefficients and the predictors recovered by 𝓐. Partitioning the support of θ into 𝓢+ and 𝓢 is inspired by Bühlmann and Van De Geer (2011, Section 6.2.4), which leads to a refined bound.

3.2. Main result

We first establish the risk bounds for the PASS estimator in the following theorem. Its proof can be found in Section S2 of the Supplementary Materials.

Theorem 1.

For any ϵ>0, if the assumptions (A1)(A8) (introduced in Section S2.1 of the Supplementary Materials) hold, the following inequalities hold simultaneously with probability at least 110ϵ:

Excessrisk:E(Y,Zα^θ^)E(Y,Zαθ0)Ξ,
Linearpredictionerror:E(Zα^θ^Zαθ0)2Ξ/ϖ,
Probabilitypredictionerror:E{σ(Zα^θ^)σ(Zαθ0)}2Ξ/ϖ,

where ϖ is a positive constant defined in (A2), Ξ=64𝓔(θ,𝓢+,𝓢), and 𝓔 is an oracle risk function as defined in equation (4).

Remark 2.

As detailed in Section S2.1 of the Supplement Material, Assumptions (A1)(A8) are imposed on tail behaviour of the regression residuals, regularity of the design matrix, minimum signal strength of α, sample sizes and rates of the tuning parameters. These assumptions are commonly used conditions in the theoretical literature of LASSO, such as the sub-Gaussian variable condition and the restricted eigenvalue condition; see e.g. van de Geer and Bühlmann (2009); Bickel et al. (2009); Bühlmann and Van De Geer (2011).

Remark 3.

The last term of the risk bound 𝓔(θ,𝓢+,𝓢) is of order O(λ1Δα|ρ|), which reflects the estimation error in α^. Following Lemma S8 in the Supplement, one can show that Δα=Op(N1/2|𝓐|). All the other terms in Ξ describe the estimation error in θ^ as if α^ is replaced with α. When N is sufficiently large, O(λ1Δα|ρ|) is typically negligible relative to other terms. Specifically, if Nn|𝓐|2log(p), O(λ1Δα|ρ|)=O({Nn}1/2log(p)1/2|𝓐|)=o(n1). In general, as long as Nmax(n,p) and α is not much denser than β0 as in the typical EHR application cases, O(λ1Δα|ρ|) is dominated by the risk of the supervised LASSO estimator and even the supervised oracle estimator obtained under the knowledge of supp(β0).

To gain a better understanding of how the key quantity Ξ in Theorem 1 changes with respect to the similarity between the prior information α and the target β0, we shall discuss several specific cases in Section 3.3, based on the risk bound derived in Theorem 1.

3.3. Specific Cases

Following Remark 3, we focus our discussions on the settings where N is sufficiently large such that the last term of the risk bound is negligible. We consider three different scenarios as illustrated in Figure 1: (Case 1) α recovers both the support and direction of β0; (Case 2) α almost recovers the support of β0 but has a substantially different direction from β0; (Case 3) α fails to recover the support of β0 (let alone its direction) and provides poor information. These three cases depict perfect, good, and poor qualities of the prior information α in recovering the support and direction of β0. Next, we rigorously characterize the three cases by properly specifying the parameters ρ, δ, 𝓢+, and 𝓢, and derive the convergence rate of Ξ, the risk bound of the PASS estimator, based on Theorem 1.

Figure 1:

Figure 1:

Examples of the coefficients β0 and α in the three cases of Section 3.3. Labels S+, S, and A in the diagrams represent 𝓢¯+𝓟¯, 𝓢, and 𝓐 as chosen and defined in Cases 13. β0 and α are aligned for comparison of their directions and supports. Presented below are scatter plots for σ1{Pr(Y=1S,X)} against S of the simulated samples generated under Cases 13.

Case 1 (presented in the left panel): α recovers both the support and direction of β0.σ1{Pr(Y=1S,X)} shows strong collinearity with S. PASS largely outperforms supervised LASSO and has the same convergence rate as the low dimensional regression.

Case 2 (middle): α (nearly) recovers the support but not the direction β0.σ1{Pr(Y=1S,X)} shows moderate collinearity with S. PASS still outperforms both supervised LASSO and pLASSO in terms of convergence rate.

Case 3 (right): α fails to recover the support of β0.σ1{Pr(Y=1S,X)} shows weak collinearity with S. PASS is of the same convergence rate as supervised LASSO.

Case 1.

Let ρ¯=minρβ0ρα1, δ¯=β0ρ¯α, θ¯=(δ¯,ρ¯,γ0,ζ0), 𝓢¯+=𝓟¯ and 𝓢¯=supp(δ0). If α successfully recovers the support and direction of β0 (see the left panel in Figure 1), 𝓢¯Ø and δ010. Since G1/2(θ¯θ0)2=0 and 𝓢¯+𝓐=Ø, we have Ξ=O{𝓔(θ¯,𝓢¯+,𝓢¯)} by the definition of θ. Hence by Theorem 1, the excess risk of θ^

Ξ=Op(λ02+λ1δ¯𝓢¯𝓐1+λ2δ¯𝓢¯𝓐c1)Op(λ02)=Op(n1),

recalling that λ0=O(n1/2).

As a standard result (Negahban et al., 2009), the rate of the excess risk of the supervised LASSO estimator is either Op{n1log(p)|𝓑0|} or O{n1/2log(p)1/2β01}. These two rate bounds are established under different sparsity norms of β0, and generally comparable, e.g. when order of average magnitude of the non-zero entries in β0 is n1/2log(p)1/2. In comparison with them, Op(n1), the risk rate of PASS in Case 1, is much more refined. Further, Op(n1) dis actually the rate of the estimator of a low (fixed) dimensional logistic regression. Thus, if β0 is very close to a multiple of α, PASS could outperform the vanilla LASSO and be comparable with a low dimensional regression in terms of the convergence rate. This big gain is owing to the use of N unlabeled dataset to obtain the direction of β0, and thus reduce the high dimensional regression to a low dimensional one where only the intercept and the scalar of β0 need to be estimated.

Case 2.

Consider the same choice of θ¯, 𝓢¯+ and 𝓢¯ as in Case 1. If α recovers the support but not the direction of β0 (see the middle of Figure 1), we will only have δ¯𝓢¯𝓐c10 but not δ010. Then by Theorem 1, the excess risk of PASS is

Ξ=Op(λ02+λ1δ¯𝓢¯𝓐1+λ2δ¯𝓢¯𝓐c1)Op{n1/2log(q)1/2δ¯𝓢¯𝓐1},

recalling that λ1=O{n1/2log(q)1/2}.

In Case 2, the convergence rate of the excess risk of PASS is still better than that of the supervised LASSO estimator when qp:

O{n1/2log(q)1/2δ¯𝓢¯𝓐1}O{n1/2log(p)1/2β01},

by δ¯𝓢¯𝓐1minρβ0ρα1β01. Namely, if α might not recover the direction of β0 very well but the prior information 𝓐=supp(α) is sparse and covers supp (β0) successfully, which is reflected as 𝓢¯+=𝓟¯, the PASS estimator still benefits from the prior information. This is because recovering the support of β0 reduces the dimensionality of the empirical errors needed to be controlled from p to q=|𝓐|. In this case, it is also interesting to compare the proposed PASS estimator with the prior LASSO (pLASSO) procedure of Jiang et al. (2016). When supp (α) and supp (β0) are close but the directions of α and β0 are quite different, the pLASSO procedure is unable to utilize this information and will only result in the same convergence rate as supervised LASSO, as shown to be essentially slower than that of PASS.

Case 3.

Let ρ¯=0, δ¯=β0, θ¯=(δ¯,ρ¯,γ0,ζ0), 𝓢¯+=𝓟¯(𝓑0𝓐) and 𝓢¯=𝓑0𝓢¯+. If α fails to recover the support of β0, i.e. 𝓐𝓑0Ø and β0,𝓐𝓑010, we have δ¯𝓢¯1δ¯𝓐1β0,𝓐𝓑010. Then again using Theorem 1,

Ξ=O(λ22|𝓢¯+|+λ1δ¯𝓢¯𝓐1+λ2δ¯𝓢¯𝓐c1)Op{n1log(p)|𝓑0|},

recalling that λ2=O{n1/2log(p)1/2}.

In Case 3, the excess risk of the PASS is of the same order as that of supervised LASSO. Therefore the PASS approach is robust against low-quality prior information that recovers neither the direction nor the support of β0. This benefit is a result of using a data-adaptive parameter ρ to control the influence of the prior information on the estimator.

4. Simulation Studies

4.1. Main setups

We conducted extensive simulation studies to examine the finite-sample performance of the PASS estimator and to compare it with existing approaches. We first considered the case where the logistic model for YS,X is correctly specified, SX follows an SIM, and X is near elliptical, but the similarity between α0 and β0 varies. Since EHR features are often zero inflated and skewed count variables, we generated X500×1 from

Xi=h(Zi),ZiN(0,ΣZ),h(t)=log(1+[et]),

where [u] denotes the integer nearest to u, ΣZ=(σi,j)i,j=1p and σi,j=4(0.5)|ij|. Here [eZij] mimics the skewed raw EHR feature, which is typically transformed via tlog(1+t) prior to model fitting. We then generated the surrogate S from a SIM of X:

Si=h(1+Xiα0+ϵi),withϵiN(0,22).

Following the model assumption (𝓜Y), the disease status Yi was generated from

σ1{Pr(Yi=1Wi)}=4+0.5Si+Xiβ0.

To mimic different qualities of the prior information one could encounter in practice, we design six scenarios with different similarities between the true β0 and α0:

I:α0=(a1,a2,0p10),β0=1.5(a1,a2,0p10);
II:α0=(a1,a2,0p10),β0=1.5(a1+d1,a2+d2,0p10);
III:α0=(a1,a2,a2,a2,0p20),β0=1.5(a1+d1,a2+d2,0p10);
IV:α0=(a1,0p5),β0=1.5(a1+d1,a2+d2,0p10);
V:α0=(a1,a2,0p10),β0=1.5(a2,a1,0p10);
VI:α0=(a1,a2,0p10),β0=1.5(a2,05,a1,0p15).

where

a1=(0.5,1,0.8,0.6,0.2),d1=(0.05,0.5,1.4,0.5,0.6),
a2=(0.1,0.2,0.2,0.2,0.7),d2=(0.02,0.05,0.02,0.02,0.05),

Our specifications of β0 and α0 are motivated by the three key specific cases introduced in Section 3.3 and illustrated in Figure 1. Scenario I is the ideal case where β0 and α0 have identical direction. In Scenario II, most of the components of β0 differ slightly from a scalar multiple of α0, while a few components differ substantially. Scenarios I and II are designed to examine the performance of PASS estimator when the prior information is highly or somewhat reliable. In Scenario III, α0 is denser than β0 and contains quite a few weak signals. On the contrary, in Scenario IV β0 is denser than α0. In Scenario V, the magnitude of α0 and β0 are quite different, whereas they still share the same support. Scenarios III, IV and V are designed to examine the performance of PASS estimator with respect to different degrees of accuracy of the support information. In Scenario VI, both the magnitude and the support of α0 and β0 differs substantially, which means the unlabeled dataset provides little information. This scenario allows us to see whether the PASS estimator is robust against unreliable prior information. See Figure 2 for a visualization of β0 and ρα0 across different scenarios.

Figure 2:

Figure 2:

Supports and values of the coefficients β0 and 1.5α0 under Scenarios I–VI introduced in Section 4.1. Only those indices j satisfying β0,j0 or α0,j0 are shown in the plots.

We compare PASS to following existing methods: (1) supervised LASSO penalized logistic regression with n training samples (LASSOn); (2) supervised ALASSO penalized logistic regression with n training samples, denoted by ALASSOn; (3) the SSprior estimator as described in section 2.2; and (4) two variants of pLASSO estimators as proposed in Jiang et al. (2016). For pLASSO, we fit a penalized logistic model with an LASSO penalty imposed on predictors outside supp (α^), as in equation (8) of Jiang et al. (2016), and then use the predicted probability from that model as Yip in equation (7) of Jiang et al. (2016), denoted by pLASSO1; (2) use the predicted probability given by the SSprior approach as Yip in equation (7) of their paper, denoted by pLASSO2.

Throughout, we let N=10000 and let ν=1 in the ALASSO weights. We use Bayesian information criterion (BIC) to select μinit and μ in the estimation of α due to large N, and use 10-fold cross-validation to select λ1, λ2 for the estimation of β, so that the phenotype model is tuned towards prediction performance. We quantify the average prediction performance of the estimated linear score, ϑ˜W, with ϑ˜ obtained via different methods in an independent test dataset with size 10000. For each choice of ϑ˜W, we consider the area under the receiver operating characteristic curve (AUC) for classifying Y, the excess risk (ER) as defined in Section 3, and the mean squared error of the predicted probabilities (MSE-P) which is the mean squared differences between the predicted probability and the true probability. We summarize results based on 1000 simulated datasets for each configuration.

In Figures 3, we compare prediction measures for estimators obtained with n=100. In Scenario I where the directions of β0 and α0 coincide, the SSprior approach performs the best as expected, yet the proposed PASS method attained very similar accuracy followed by pLASSO2 which performed only slightly worse. When the directions of β0 and α0 are somewhat different as in Scenario II, the SSprior and the pLASSO estimators deteriorated quickly. In contrast, the PASS estimator maintains high accuracy and outperforms all competing estimators substantially. We observe qualitatively similar patterns for Scenarios III and IV under which α0 and β0 have somewhat different support. No matter whether α0 is denser than β0 as in Scenario IV, or β0 is denser than α0 as in Scenario V, the PASS method consistently outperforms the supervised estimators. Additionally, the performances of the SSprior and pLASSO approaches are not quite satisfactory. In Scenario V, β0 and α0 have the same support but are quite different in terms of magnitude. The proposed method managed to utilize the same-support information, whereas the pLASSO approaches failed to do so. Finally, the goal of Scenario VI is to examine the robustness of the methods when β0 and α0 differs a lot, possibly due to the use of an inappropriate surrogate. The PASS estimator performs similarly to the supervised estimators, indicating that our procedure is indeed adaptive to how well the data supports the prior assumption. Across all scenarios, the ALASSO approach performs slightly worse than LASSO, possibly due to the presence of some small nonzero coefficients in β0.

Figure 3:

Figure 3:

AUC (left), ER (middle) and MSE-P (right) evaluated on the test set for simulation studies under Scenarios I–VI. Outliers are not drawn. Mean performance of the PASS approach are marked using dashed lines for ease of comparison. The size of the labeled dataset is fixed at n=100.

In Figure 4, we present the AUC, ER and MSE-P of the PASS estimator trained with n=100 and the supervised LASSO estimator with varying label size. In Scenario I where the prior assumption holds exactly, PASS100, the PASS approach with 100 labeled samples, even outperforms LASSO400, the LASSO approach with 400 labeled samples. When the prior assumption holds approximately as in Scenarios II through V, PASS100 consistently outperforms LASSO150, and achieves similar performance as LASSO200, which requires twice as many labels. Finally, in Scenario VI where the prior information is highly inaccurate, the PASS method maintains comparable performance against LASSO100.

Figure 4:

Figure 4:

AUC (left), ER (middle) and MSE-P (right) evaluated on the test set for simulation studies under Scenarios I–VI. Outliers are not drawn. Mean performance of the PASS approach are marked using red dash lines for ease of comparison. The size of the labeled dataset is n=100 for PASS, while it varies for LASSO, as indicated in the subscripts.

4.2. Efficiency and Robustness Evaluations under Mis-specifications

We conducted simulation studies under three additional scenarios to further investigate the efficiency and robustness of PASS when the model assumptions and elliptical design assumptions are violated. We again set p=500 and generate Xi=2Φ(Zi)1, where Φ() is the cumulative distribution function of the standard normal, Zi=(Zi1,,Zip)N(0,ΣZ), ΣZ=(σi,j)i,j=1p, σi,j=(0.5)|ij|. If i=j or both i and j are 20 or both i and j are >20 and σi,j=0 otherwise. We make ΣZ block-diagonal for the convenience of obtaining the population solution of β and α through the best logistic or least square approximation under model mis-specifications. In real EHR studies, a paradigm of data generation is that the features X, e.g. some genetic variants, precedes the disease status Y, and Y precedes some clinical surrogate S, e.g. the count of ICD codes associated with the disease. To mimic this scenario, we generated Yi and Si from the following models:

Yi=I{(0.8,1,1,0.8,0.4,0p5)Xi+ϵyi0},ϵyiN(0,1),
Si=μYi+η1Xi+Yiη2Xi+ϵsi,ϵsiN(0,1).

Assumptions (𝓒prior) and (𝓜S) hold when η1=η2=0, and would be severely violated when η1 and η2 are large. We design three scenarios with η1 and η2 representing different degrees of violation on the surrogate assumptions:

  1. μ=1,andη1=η2=0;

  2. μ=1.5,η1=(a3,0p5),andη2=(d3,0p5);

  3. μ=2,η1=(a3,a3,a3,0p15),andη2=(d3,d3,d3,0p15),

where a3=(0.6,0.4,0.4,0.5,0.5) and d3=(0.3,0.4,0.6,0.5,0.5). Here μ depict the marginal effect of Yi on Si, and are set to make the AUC of target models at a similar level across the three scenarios. Across all scenarios, Pr(Yi=1Si,Xi) is no longer a parametric logistic model, i.e. (𝓜Y) is misspecified. Our goal is to estimate the limiting coefficients ζ0, γ0, β0 defined as the minimizor of E(Yi,ζ+γSi+Xiβ). Benchmark methods, and their implementation, tuning, and evaluation procedures are the same as in Section 4.1, except that we implement supervised LASSO with n ranging from 100 to 700.

In Figure 5, we present AUC, ER and MSE-P of the methods under Scenarios i–iii. In Scenario i, PASS has similar performances as the semi-supervised benchmarks SSprior and pLASSO, and all the semi-supervised estimators significantly outperform the two supervised estimators since (𝓒prior) holds and α basically recovers the direction of β0 well. Among the semi-supervised estimators, the SSprior and pLASSO2 estimators have a slight advantage with a smaller variation than expected since both heavily rely on the prior information which is of high quality in this setting. In Scenario ii, the key assumption (𝓒prior) is violated, which drastically impacts the performance of SSprior and pLASSO2. On the other hand, PASS and pLASSO1 still effectively leverage the imperfect information from α to approximately recover the support of β0, and thus outperform SSprior, pLASSO2, and the supervised methods. In Scenario iii, η1 and η2 become denser than those in Scenario ii. This can make the recovery of supp (β0) using supp (α) less accurate, and interestingly, PASS outperforms all methods including pLASSO1 that also leverages supp (α). In all the three scenarios, PASS significantly outperforms supervised LASSO using the same number or even 2–3 more times the number of samples, which displays a large gain of using the unlabelled dataset to assist the regression. Finally, the results demonstrate that our method can still efficiently leverage the prior information from S in estimating the target parameters when SY,X highly depends on X so 𝓒prior is violated, (𝓜Y) is misspecified, and the design is non-elliptical.

Figure 5:

Figure 5:

Evaluation metrics on the test set for simulation studies under Scenarios i–iii introduced in Section 4.2. Outliers are not drawn. Mean performance of the PASS approach are marked using red dash lines for ease of comparison. On the left panel, we present the evaluation metrics of all methods for comparison when n=100. On the right panel, we compare the performance of PASS when n=100 with supervised LASSO obtained using labelled samples with various n (from 100 to 700).

5. Application to EHR Phenotying

We examine the performance of PASS along with other approaches in three real world EHR phenotyping studies with the goal of developing classification models for the diseases of interest. All studies are performed at a large tertiary hospital system with EHR spanning over multiple decades. Each study has n0 labeled observations for algorithm training and validation. We consider three choices of training size n no more than n0/2 in all examples. First, we randomly split the labelled samples into four folds of equal sizes. Then we pick each fold as the validation set, sample n training labels from the other three folds for 20 times, train and validate the algorithms, and finally average the evaluation metrics and their standard errors over the validation results on the four folds. We replicate this procedure 10 times and report the average performance.

Data Example 1 (CAD Phenotyping).

The goal of this study is to identify patients with coronary artery disease (CAD) based on their EHR features. The study cohort consists of N=4164 patients, out of which a random subset of n0=181 patients have their true CAD status annotated via chart review by domain experts. We use the sum of the counts for the CAD ICD code and NLP mention of CAD as the surrogate. There are p=585 additional EHR features consisting of the total count of all ICD codes as a healthcare utilization measure, 10 ICD codes related to CAD, and 574 NLP variables. For the size of training labels, we consider n=50,70,90. This de-identified dataset has been analyzed in previous studies (Zhang et al., 2019, e.g.) and is publicly available online: https://celehs.github.io/PheCAP/articles/example2.html.

Data Example 2 (RA Phenotyping).

Similar to the CAD phenotyping study, the goal is to identify patients with rheumatoid arthritis (RA) based on their EHR features. There are N=46114 patients in total and out of which, n0=435 patients have their RA status annotated. Again, we choose the sum of the ICD code and NLP mention of RA as the surrogate. The p=924 additional EHR features consist of the healthcare utilization and 923 NLP variables potentially predictive of RA. For the size of training labels, we consider n=50,125,200.

Data Example 3 (Depression Phenotyping).

The goal is to identify patients with depression based on their codified EHR features. There are N=9474 patients in total and n0=236 labeled observations. The surrogate is chosen as the counts of depression ICD code. There are p=231 additional EHR features, including the healthcare utilization and 230 codified EHR features on depression related medication prescriptions, laboratory tests and ICD codes. For the size of training labels, we consider 50,85,120.

In the three data examples, N is significantly larger than p with N/max(p,n) being approximately 7 for CAD, 50 for RA, and 41 for Depression. In all the three studies, we apply xlog(1+x) transformation for all count variables. Also, since patients with higher healthcare utilization tend to have higher counts of most features, we orthogonalize all features against the healthcare utilization before regression fitting. Since ϑ0 is unknown in applications, we quantify the performance of an estimator ϑ˜ based on the AUC and Brier skill score (BSS) of σ(ϑ˜W) for predicting Y, where the BSS is defined as 1E^v[{Yσ(ϑ˜W)}2]/E^v[{YE^v(Y)}2], and E^v denotes the empirical expectation on the validation sample. The BSS is essentially a binary version of the R-square.

For comparison, we included PASS, SSprior, pLASSO2, supervised LASSO and ALASSO on the three data examples to estimate the phenotyping model (𝓜Y). We exclude pLASSO1 since it requires fitting of an unpenalized regression on supp (α^), which is infeasible when |supp(α^)|>n. In addition, we compare to the unsupervised LASSO (ULASSO) approach of Chakrabortty et al. (2017), which estimates direction of the logistic coefficients for Yσ(βX) by regressing I(S>cu) against X on the subset whose S is either greater than cu or smaller than cl, for some pre-specified cu and cl typically chosen such that Pr(S>cu) and Pr(S<cl) are small. Since the ULASSO approach only provides an estimate β˜ to optimize the prediction of βX for YX without using S explicitly as an additional predictor, we also derive a semi-supervised variant of ULASSO, denoted by SSULASSO, by regressing the labeled Y against β˜X and S as for SSprior.

As shown in Figure 6, PASS significantly outperforms the supervised LASSO and ALASSO when n=50 in all three examples. As the label size n increases, their performances get closer. Compared with the semi-supervised benchmarks, PASS has slightly or moderately better performance on the CAD and RA studies. For Depression, PASS substantially outperforms them, especially SSprior and SSULASSO. For example, when n=50, PASS attained average AUC in classifying depression about 0.1 higher than that of SSprior and SSULASSO and 0.05 higher than pLASSO. The gap becomes smaller when n increases as expected. Interestingly, the supervised estimators outperform pLASSO, SSprior, and SSULASSO on the Depression dataset as well but has similar or worse performance than these semi-supervised approaches on the other two examples. This could in part be attributed to the relatively poor quality of the surrogate information, which makes existing semi-supervised approaches fail. In contrast, PASS could utilize such prior information more effectively and robustly, and still preserves better performance than the supervised estimators. Thus, we can conclude that incorporating prior information from the unlabeled dataset could improve and stabilize the prediction performance of phenotyping models in EHR applications, and PASS is more robust and efficient in leveraging the prior information compared with existing semi-supervised methods. In addition, ULASSO shows much worse performance than the other supervised and semi-supervised methods in all examples. This illustrates the importance of collecting labels and including the surrogate in the regression models for EHR phenotyping.

Figure 6:

Figure 6:

Out of sample AUC and BSS on the data examples 1–3, with various sizes of labelled training samples denoted as n. Median performance of PASS are marked using red dash line for ease of comparison.

6. Discussion

In this paper, we propose PASS, a high dimensional sparse estimator adaptively incorporating the prior knowledge from surrogate under a semi-supervised scenario commonly found in application fields like EHR analysis. Compared to the supervised approaches, the proposed PASS approach can substantially reduce the required number of labeled samples when the model assumptions (𝓜S) and (𝓒prior) and the elliptical design assumption (C1) hold exactly or approximately, and thus the prior information α is trustworthy. Compared to existing pLASSO and SSprior approaches that also incorporates prior information, the PASS approach is robust against unreliable prior information α, which might be the case when the surrogate model assumptions are violated or the design X is highly non-elliptically distributed.

One of the main challenges in our theoretical analysis comes from the colinearity of covariates (1,Si,Xiα^,Xi) due to the enrollment of ρ to leverage the prior information in α^. We overcome this by properly constructing the oracle coefficients θ and the restricted eigenvalue assumption (A6). The formulation of our problem falls into the missing data framework with missing completely at random. However, the missing probability approaches 1 as N. This together with the high dimensionality of X makes the theoretical justifications more challenging than those used in the standard missing data literature. Without prior assumptions of β0ρα0 being sparse in certain sense, the unlabeled dataset cannot directly contribute to the estimation of β0. Our proposed PASS procedure hinges on the sparsity of β0ρα0 to leverage the unlabeled dataset.

We have restricted the discussion to a single surrogate variable for simplicity. However, the proposed method can be easily extended to multiple surrogates. Specifically, consider K surrogates, denoted by S[1],,S[K]. Let α^[k] be the ALASSO estimator regressing Si[k] against Xi, 𝓐^=k=1Ksupp(α^k), Si=(Si[1],,Si[K]) and ρ=(ρ[1],,ρ[K]). We can obtain an estimator for the model parameters as

ζ^,γ^,ρ^,β^=argminζ,γ,ρ,βn1i=1n(Yi,ζ+Siγ+Xiβ)+λ1(βkρkα^k)𝓐^1+λ2β𝓐^c1.

Theoretical justification and finite sample performance of β^ under this setting warrant further research. In our numerical studies, we only focus on fully simulated datasets and real examples. We are further interested in investigating the performance of our approach through semi-synthetic experiments with various setups for the surrogate variables. In addition, it may be interesting to extend the semi-supervised PASS estimator under a high dimensional sparse parametric regression to semi-parametric settings such as the sparse additive model (Ravikumar et al., 2009) and the sparse varying coefficient model (Noh and Park, 2010). Under semi-parametric models, one could still leverage prior information through shrinking the coefficients to “ρα^“ with some sparse penalty function, to gain statistical efficiency. Studying the specific forms and theoretical properties of such approaches via a semi-supervised framework warrants future research.

R codes for implementing PASS and the benchmark methods, and replicating the simulation results can be found at https://github.com/moleibobliu/PASS.

Supplementary Material

Supplement

Contributor Information

Yichi Zhang, Department of Computer Science and Statistics, University of Rhode Island.

Molei Liu, Department of Biostatistics, Harvard T.H. Chan School of Public Health.

Matey Neykov, Department of Statistics and Data Science, Carnegie Mellon University.

Tianxi Cai, Department of Biostatistics, Harvard T.H. Chan School of Public Health.

References

  1. Agarwal Vibhu, Podchiyska Tanya, Banda Juan M, Goel Veena, Leung Tiffany I, Minty Evan P, Sweeney Timothy E, Gyang Elsie, and Shah Nigam H. Learning statistical models of phenotypes using noisy labeled training data. Journal of the American Medical Informatics Association, 23(6):1166–1173, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Ananthakrishnan Ashwin N, Cheng Su-Chun, Cai Tianxi, Cagan Andrew, Gainer Vivian S, Szolovits Peter, Shaw Stanley Y, Churchill Susanne, Karlson Elizabeth W, Murphy Shawn N, et al. Association between reduced plasma 25-hydroxy vitamin d and increased risk of cancer in patients with inflammatory bowel diseases. Clinical Gastroenterology and Hepatology, 12(5):821–827, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bickel Peter J., Ritov Ya’acov, and Tsybakov Alexandre B. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, 37(4):1705–1732, 08 2009. doi: 10.1214/08-AOS620. URL 10.1214/08-AOS620. [DOI] [Google Scholar]
  4. Brownstein John S, Murphy Shawn N, Goldfine Allison B, Grant Richard W, Sordo Margarita, Gainer Vivian, Colecchi Judith A, Dubey Anil, Nathan David M, Glaser John P, et al. Rapid identification of myocardial infarction risk associated with diabetes medications using electronic medical records. Diabetes care, 33(3):526–531, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bühlmann Peter and Geer Sara Van De. Statistics for high-dimensional data: methods, theory and applications. Springer, 2011. [Google Scholar]
  6. Cai Tianxi, Zhang Yichi, Ho Yuk-Lam, Link Nicholas, Sun Jiehuan, Huang Jie, Cai Tianrun A., Damrauer Scott, Ahuja Yuri, Honerlaw Jacqueline, Huang Jie, Costa Lauren, Schubert Petra, Hong Chuan, Gagnon David, Sun Yan V., Gaziano J. Michael, Wilson Peter, Cho Kelly, Tsao Philip, O’Donnell Christopher J., Liao Katherine P., and for the VA Million Veteran Program. Association of Interleukin 6 Receptor Variant With Cardiovascular Disease Effects of Interleukin 6 Receptor Blocking Therapy: A Phenome-Wide Association Study. JAMA Cardiology, 3(9):849–857, 09 2018. ISSN 2380–6583. doi: 10.1001/jamacardio.2018.2287. URL 10.1001/jamacardio.2018.2287. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Carroll Robert J, Thompson Will K, Eyler Anne E, Mandelin Arthur M, Cai Tianxi, Zink Raquel M, Pacheco Jennifer A, Boomershine Chad S, Lasko Thomas A, Xu Hua, et al. Portability of an algorithm to identify rheumatoid arthritis in electronic health records. Journal of the American Medical Informatics Association, 19(e1):e162–e169, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Chakrabortty Abhishek, Neykov Matey, Carroll Raymond, and Cai Tianxi. Surrogate aided unsupervised recovery of sparse signals in single index models for binary outcomes. arXiv preprint 1701.05230, 2017.
  9. Denny Joshua C, Ritchie Marylyn D, Basford Melissa A, Pulley Jill M, Bastarache Lisa, Brown-Gentry Kristin, Wang Deede, Masys Dan R, Roden Dan M, and Crawford Dana C. Phewas: demonstrating the feasibility of a phenome-wide scan to discover gene–disease associations. Bioinformatics, 26(9):1205–1210, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Diaconis Persi and Freedman David. Asymptotics of graphical projection pursuit. The Annals of Statistics, 12(3):793–815, 09 1984. doi: 10.1214/aos/1176346703. URL 10.1214/aos/1176346703. [DOI] [Google Scholar]
  11. Doshi-Velez Finale, Ge Yaorong, and Kohane Isaac. Comorbidity clusters in autism spectrum disorders: an electronic health record time-series analysis. Pediatrics, 133(1):e54–e63, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Friedman Jerome, Hastie Trevor, and Tibshirani Rob. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22, 2010. ISSN 1548–7660. doi: 10.18637/jss.v033.i01. URL https://www.jstatsoft.org/index.php/jss/article/view/v033i01. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Gottesman Omri, Kuivaniemi Helena, Tromp Gerard, W Andrew Faucett Rongling Li, Manolio Teri A, Sanderson Saskia C, Kannry Joseph, Zinberg Randi, Basford Melissa A, et al. The electronic medical records and genomics (emerge) network: past, present, and future. Genetics in Medicine, 15(10):761–771, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Hall Peter and Li Ker-Chau. On almost linearity of low dimensional projections from high dimensional data. The Annals of Statistics, 21(2):867–889, 06 1993. doi: 10.1214/aos/1176349155. URL 10.1214/aos/1176349155. [DOI] [Google Scholar]
  15. Halpern Yoni, Horng Steven, Choi Youngduck, and Sontag David. Electronic medical record phenotyping using the anchor and learn framework. Journal of the American Medical Informatics Association, 23(4):731–740, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Hong Chuan, Liao Katherine P, and Cai Tianxi. Semi-supervised validation of multiple surrogate outcomes with application to electronic medical records phenotyping. Biometrics, 75(1):78–89, 2019. [DOI] [PubMed] [Google Scholar]
  17. Jiang Yuan, He Yunxiao, and Zhang Heping. Variable selection with prior information for generalized linear models via the prior LASSO method. Journal of the American Statistical Association, 111(513):355–376, 2016. doi: 10.1080/01621459.2015.1008363. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Kohane Isaac S. Using electronic health records to drive discovery in disease genomics. Nature Reviews Genetics, 12(6):417–428, 2011. [DOI] [PubMed] [Google Scholar]
  19. Lee Jason D, Lei Qi, Saunshi Nikunj, and Zhuo Jiacheng. Predicting what you already know helps: Provable self-supervised learning. arXiv preprint arXiv:2008.01064, 2020.
  20. Li Ker-Chau and Duan Naihua. Regression analysis under link violation. The Annals of Statistics, 17(3):1009–1052, 09 1989. doi: 10.1214/aos/1176347254. URL 10.1214/aos/1176347254. [DOI] [Google Scholar]
  21. Liao Katherine P, Diogo Dorothée, Cui Jing, Cai Tianxi, Okada Yukinori, Gainer Vivian S, Murphy Shawn N, Gupta Namrata, Mirel Daniel, Ananthakrishnan Ashwin N, et al. Association between low density lipoprotein and rheumatoid arthritis genetic factors with low density lipoprotein levels in rheumatoid arthritis and non-rheumatoid arthritis controls. Annals of the rheumatic diseases, 73(6):1170–1175, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Liao Katherine P, Cai Tianxi, Savova Guergana K, Murphy Shawn N, Karlson Elizabeth W, Ananthakrishnan Ashwin N, Gainer Vivian S, Shaw Stanley Y, Xia Zongqi, Szolovits Peter, et al. Development of phenotype algorithms using electronic medical records and incorporating natural language processing. bmj, 350:h1885, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. McDermott Matthew, Yan Tom, Naumann Tristan, Hunt Nathan, Suresh Harini, Szolovits Peter, and Ghassemi Marzyeh. Semi-supervised biomedical translation with cycle wasserstein regression gans. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  24. Negahban Sahand, Yu Bin, Wainwright Martin J, and Ravikumar Pradeep K. A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. In Advances in neural information processing systems, pages 1348–1356, 2009.
  25. Newton Katherine M, Peissig Peggy L, Kho Abel Ngo, Bielinski Suzette J, Berg Richard L, Choudhary Vidhu, Basford Melissa, Chute Christopher G, Kullo Iftikhar J, Li Rongling, et al. Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the emerge network. Journal of the American Medical Informatics Association, 20(e1):e147–e154, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Noh Hoh Suk and Park Byeong U. Sparse varying coefficient models for longitudinal data. Statistica Sinica, pages 1183–1202, 2010.
  27. Ratner Alexander, Stephen H Bach Henry Ehrenberg, Fries Jason, Wu Sen, and Christopher Ré. Snorkel: Rapid training data creation with weak supervision. In Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases, volume 11, page 269. NIH Public Access, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Ravikumar Pradeep, Lafferty John, Liu Han, and Wasserman Larry. Sparse additive models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(5):1009–1030, 2009. [Google Scholar]
  29. van de Geer Sara A. High-dimensional generalized linear models and the lasso. The Annals of Statistics, 36(2):614–645, 04 2008. doi: 10.1214/009053607000000929. URL 10.1214/009053607000000929. [DOI] [Google Scholar]
  30. van de Geer Sara A and Bühlmann Peter. On the conditions used to prove oracle results for the lasso. Electronic Journal of Statistics, 3:1360–1392, 2009. doi: 10.1214/09-EJS506. URL 10.1214/09-EJS506. [DOI] [Google Scholar]
  31. Wang Hai and Poon Hoifung. Deep probabilistic logic: A unifying framework for indirect supervision. arXiv preprint arXiv:1808.08485, 2018.
  32. Wilke RA, Xu H, Denny JC, Roden DM, Krauss RM, McCarty CA, Davis RL, Skaar Todd, Lamba J, and Savova G. The emerging role of electronic medical records in pharmacogenomics. Clinical Pharmacology & Therapeutics, 89(3):379–386, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Yu Sheng, Liao Katherine P, Shaw Stanley Y, Gainer Vivian S, Churchill Susanne E, Szolovits Peter, Murphy Shawn N, Kohane Isaac S, and Cai Tianxi. Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources. Journal of the American Medical Informatics Association, 22(5):993–1000, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Zhang Lingjiao, Ding Xiruo, Ma Yanyuan, Muthu Naveen, Ajmal Imran, Moore Jason H, Herman Daniel S, and Chen Jinbo. A maximum likelihood approach to electronic health record phenotyping using positive and unlabeled patients. Journal of the American Medical Informatics Association, 27(1):119–126, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Zhang Yichi, Cai Tianrun, Yu Sheng, Cho Kelly, Hong Chuan, Sun Jiehuan, Huang Jie, Ho Yuk-Lam, Ashwin N Ananthakrishnan Zongqi Xia, et al. High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (phecap). Nature protocols, 14(12):3426–3444, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Zou Hui. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476):1418–1429, 2006. doi: 10.1198/016214506000000735. [DOI] [Google Scholar]
  37. Zou Hui and Zhang Hao Helen. On the adaptive elastic-net with a diverging number of parameters. The Annals of Statistics, 37(4):1733–1751, 08 2009. doi: 10.1214/08-AOS625. URL 10.1214/08-AOS625. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement

RESOURCES