Skip to main content
Journal of the American Medical Informatics Association: JAMIA logoLink to Journal of the American Medical Informatics Association: JAMIA
. 2021 Dec 28;29(5):918–927. doi: 10.1093/jamia/ocab267

SAT: a Surrogate-Assisted Two-wave case boosting sampling method, with application to EHR-based association studies

Xiaokang Liu 1, Jessica Chubak 2,3, Rebecca A Hubbard 4, Yong Chen 5,
PMCID: PMC9714591  PMID: 34962283

Abstract

Objectives

Electronic health records (EHRs) enable investigation of the association between phenotypes and risk factors. However, studies solely relying on potentially error-prone EHR-derived phenotypes (ie, surrogates) are subject to bias. Analyses of low prevalence phenotypes may also suffer from poor efficiency. Existing methods typically focus on one of these issues but seldom address both. This study aims to simultaneously address both issues by developing new sampling methods to select an optimal subsample to collect gold standard phenotypes for improving the accuracy of association estimation.

Materials and Methods

We develop a surrogate-assisted two-wave (SAT) sampling method, where a surrogate-guided sampling (SGS) procedure and a modified optimal subsampling procedure motivated from A-optimality criterion (OSMAC) are employed sequentially, to select a subsample for outcome validation through manual chart review subject to budget constraints. A model is then fitted based on the subsample with the true phenotypes. Simulation studies and an application to an EHR dataset of breast cancer survivors are conducted to demonstrate the effectiveness of SAT.

Results

We found that the subsample selected with the proposed method contains informative observations that effectively reduce the mean squared error of the resultant estimator of the association.

Conclusions

The proposed approach can handle the problem brought by the rarity of cases and misclassification of the surrogate in phenotype-absent EHR-based association studies. With a well-behaved surrogate, SAT successfully boosts the case prevalence in the subsample and improves the efficiency of estimation.

Keywords: association study, electronic health records, error in phenotype, rare disease, sampling strategy

INTRODUCTION

Real-world evidence on the effectiveness of medical therapies can be derived from the study of health care information collected in the course of delivery and administration of routine medical care. Some typical sources of real-world data (RWD) include electronic health records (EHR), administrative claims, and billing data. Notably, these data are generated as a by-product of the health information ecosystem and are not directly intended to support research. In contrast, clinical trials and comparative effectiveness research studies relying on primary data collection focus on generation of high-quality research data, but, as a result, patient populations and care patterns represented in these studies may not be generalizable to real-world patients and their health care experiences. RWD provides a data resource that directly inherits the characteristics of patients and their care environments as they occur in routine clinical care, and the use of RWD for research has the potential to overcome limits to generalizability of research study data.1

EHR, one important source of RWD, is a collection of patients’ individual health records that are routinely collected by health care providers and medical devices.2 Diagnosis, clinical, and medication information in EHR can serve as important resources for biomedical and clinical research,3–5 for example, to identify phenotype-genotype associations or improve clinical care quality. EHR-based association studies aimed at identifying risk factors for patient phenotypes have attracted substantial attention.6,7 By providing data on a large pool of patients, EHR data are uniquely valuable for studying rare diseases. The 1983 Orphan Drug Act defined a rare disease as a condition that affects fewer than 200,000 persons in the United States, and there are over 10,000 rare diseases which together affect 10% of the population.8 For example, progressive supranuclear palsy is a rare degenerative neurological disease and is believed to affect at least 20,000 Americans.9 Some rare diseases are life-threatening, and patients who suffer from rare diseases often face diagnostic delays or misdiagnoses due to the lack of sufficient data to establish reliable diagnostic standards. Therefore, EHR-based studies investigating risk factors for rare diseases are particularly useful for advancing understanding of disease mechanisms and identifying biomarkers that may facilitate diagnosis.10,11 Traditional statistical methods are the workhorse of such investigations; however, their performance is sensitive to the quality of the data.12–18 In this article, we consider two challenges to the validity of conclusions in rare disease research relying on error-prone EHR-derived phenotypes (hereafter referred to as the surrogate phenotype).

EHR-based phenotyping algorithms have been developed to generate surrogates of the true patient phenotype based on EHR data and other ancillary data resources, including disease registries and claims data.19,20 Surrogate phenotypes can be used to identify cohorts, exposures, and outcomes for studies such as genome- and phenome-wide association studies in which the surrogate phenotype is used in place of the true phenotype in analysis.21–23 Before a phenotyping algorithm can be put into application, assessment of its accuracy using a validation dataset is necessary to guarantee its validity. However, even after validation, the performance of phenotyping algorithms is still affected by many factors, and imperfect classification accuracy induces bias and degrades the reliability of analysis results.12–15 One analytic alternative is to use a subsample of patients whose true phenotypes are collected through manual chart review in conjunction with the surrogate information of the full cohort to conduct combined statistical analyses.16–18 As the size of this subsample may be small due to time and budget constraints, developing methods to select subjects for chart review that lead to the best performance of subsequent analyses has the potential to significantly improve EHR-based association analyses.

The low prevalence of the outcome is another issue that requires special attention in analysis. Classifiers trained with a low prevalence and hence highly imbalanced outcome may have poor performance in the minority class, and the subsample with the natural class distribution inherited from the population commonly does not lead to the best classification performance.24–26 Low prevalence outcomes also result in larger variances in coefficient estimation.27 Methods for the low prevalence setting include randomly over-sampling (under-sampling) the minority (majority) class or generating synthetic cases.28,29 Another solution is to form a balanced subsample with a well-designed sampling strategy.30,31 Thus, a properly tailored sampling method could simultaneously address both the bias due to error in the surrogate phenotype and the efficiency loss caused by imbalanced classes.

There are a number of sampling methods that have been developed to address either the issue of low prevalence outcomes or surrogate phenotypes. For example, in the low prevalence setting, Tan and Heagerty30 proposed a surrogate-guided sampling (SGS) design to enrich the cases in a subsample. Fithian and Hastie31 proposed a local case-control sampling strategy to select influential subsamples, that is, subjects whose responses seem to be contradictory to their predicted class. These samples are more difficult to classify conditional on the features, and thus are influential in forming the classification boundary. When the true phenotype is unavailable, inspired by an optimal subsampling procedure motivated by the A-optimality criterion (OSMAC),32 Zhang et al33 proposed an optimal nearly response-independent sampling scheme (optimal sampling under measurement constraints, abbreviated as OSUMC). OSMAC and OSUMC both target minimizing the asymptotic mean squared error (MSE) of the subsample estimator, which is the trace of the asymptotic variance-covariance matrix of the subsample estimator, by selecting an optimal set of sampling probabilities. OSMAC and OSUMC differ in that OSMAC requires information on true phenotypes, whereas OSUMC does not. This is because the derived variance-covariance matrix of OSMAC involves the true phenotype while for OSUMC it does not. However, none of the above methods are designed to simultaneously handle both low prevalence and error-prone surrogate phenotypes, which motivates the development of the method considered here.

The proposed surrogate-assisted two-wave (SAT) sampling method leverages the strengths of SGS and OSMAC by incorporating the surrogate information into two subsampling waves to control the estimator’s MSE. See Figure  1 for a research pipeline and see Figure  2 for a conceptual diagram of SAT. Specifically, at the first stage, a pilot subsample is obtained by applying SGS to enrich cases, which facilitates a better pilot estimator to approximate the full cohort maximum likelihood estimator (MLE). At the second stage, by plugging the pilot estimator into OSMAC sampling probabilities and replacing the true phenotype by a function of the surrogate, we compute the SAT sampling probabilities and then obtain the second-stage subsample. We then get the final estimator by fitting a weighted logistic regression on the pooled subsample that combines both the pilot subsample and the second-stage subsample. To the best of our knowledge, this is the first sampling method that utilizes surrogate information to address the effects of both low prevalence and measurement error of an outcome in EHR-based association studies. When compared with OSUMC, SAT further exploits the information in the surrogate to aid sampling, while compared with SGS, SAT utilizes an additional MSE minimization rule in the second-stage sampling to improve estimation accuracy.

Figure 1.

Figure 1.

The pipeline for association studies from electronic health record (EHR) data resource to final analysis. Based on the EHR data, surrogate phenotypes are generated by a validated well-behaved phenotyping algorithm, and gold standard phenotypes are obtained through manual chart review for a selected subsample of patients. The association analysis is conducted only on this subsample with their true phenotypes and risk factors. Our proposed method is a novel method that focuses on surrogate-assisted subsample selection to include informative samples to improve the subsequent association estimation.

Figure 2.

Figure 2.

Surrogate-assisted two-wave (SAT) case boosting sampling. At stage 1, for enriching cases, we apply surrogate-guided sampling (SGS) to obtain a pilot subsample of size n1, for which we collect the true phenotypes and use them to obtain a pilot estimator for the association parameter. At stage 2, based on the pilot estimator, the covariates, and the surrogate phenotype, we compute the SAT sampling probabilities which target controlling mean squared error (MSE) and apply them to obtain the second-stage subsample of size n2, for which the true phenotypes are collected. The final estimator is obtained by fitting a weighted logistic regression in the pooled subsample of size n1+n2 with their true phenotypes.

MATERIALS AND METHODS

Methods

The method is applied to an EHR-derived dataset to conduct association analysis between a binary response Y with yi{0, 1} and a set of covariates xi*Rp-1 using logistic regression. For all N patients in the dataset, we collect xi*, si, i=1, ,N, where si{0, 1} is the binary algorithm-derived surrogate (S). The true phenotype is only available for a small subsample that consists of two parts, the pilot subsample xi*, yi, si with i=1, ,n1 patients, and the second-stage subsample xi*, yi, si with i=1, ,n2 patients. Sampling techniques are exploited to facilitate subsample selection, and the related set of sampling probabilities is denoted as {πi}, i=1,  , N. We write xi=1,xi*TTRp to include the intercept.

Existing method-OSMAC

OSMAC minimizes the asymptotic MSE of the subsample estimator by selecting an appropriate set of sampling probabilities. A pilot subsample of size n1 is needed to obtain a pilot estimator β^pilot to facilitate the subsequent sampling probability calculation. With imbalanced data, the pilot subsample is selected through a stratified sampling procedure where the binary response is used as a grouping variable and the stage 1 sampling probabilities are: πi=12×number of cases in data for yi=1 and πi=12×number of controls in data for yi=0. The stage 2 sampling probabilities are then computed as

πiOSMAC=yi-piβ^pilot| || MX-1xi||2j=1N|yj-pjβ^pilot| ||MX-1 xj||2 (1)

with MX=i=1Npiβ^pilot(1-piβ^pilot)xixiT and piβ^pilot=(1+e-xiTβ^pilot)-1. Then the second-stage sampling is conducted with πiOSMAC to obtain a subsample of size n2 (n2>n1). The final estimator is obtained by fitting a weighted logistic regression on the pooled n=n1+n2 subsample as

β^OSMAC=argminβ1n1+n2(ipilotset1πiyilogpiβ+1-yilog1-piβ+jsecondstageset1πjOSMACyjlogpjβ+1-yjlog1-pjβ). (2)

Because calculating the sampling probabilities requires knowledge of the true phenotype, yi, this method is not practicable in real-world settings.

Existing method-OSUMC

The idea of OSUMC is very similar to OSMAC, except that its asymptotic results are derived unconditional on Y. Thus, the asymptotic variance-covariance matrix of the subsample estimator does not involve Y and the resulting πi need no information on yi, that is:

πiOSUMC=piβ^pilot(1-piβ^pilot)| || MX-1xi||2j=1N|piβ^pilot(1-piβ^pilot)| ||MX-1 xj||2 (3)

where MX and pi(β^pilot) are similarly defined as in OSMAC. OSUMC does not consider outcome prevalence and uses simple random sampling (SRS) in pilot subsample selection. Only the second-stage subsample is used in the final estimation step to fit a weighted logistic regression. The OSUMC estimator is denoted as β^OSUMC.

Existing method-SGS

SGS is a surrogate-assisted case boosting algorithm. Specifically, the whole dataset is divided into two strata according to the surrogate value si=1 or si=0, and different sampling probabilities are then applied. Let zi=1 indicate that thei-th subject is selected into the subsample and 0 otherwise, also let ns1 be the number of positive surrogates in the whole dataset and ns0 be the number of negative surrogates, then:

πiSGS=Pzi=1si=s= P(si=s|zi=1)P(zi=1)P(si=s) (4)

for s{0, 1}. The case proportion parameter R=Psi=1zi=1 is decided by the investigator to adjust for the case proportion in the subsample, P(zi=1) can be estimated by n1/N, and P(si=s) is approximated by ns1/N or ns0/N. After standardizing to achieve i=1NπiSGS=1, we get πiSGS=Rns1 for si=1 and πiSGS=1-Rns0 for si=0.

Proposed method-SAT

This method is an extension of OSMAC to the scenario where Y is unavailable for the majority of the cohort and has a low prevalence among the study population. The two sampling steps in OSMAC need to be modified accordingly.

Stage 1: The SGS method is employed to boost cases.

Stage 2: As the true phenotype is not available for all subjects, we use OSMAC with surrogate phenotype:

πiSAT=fβ^pilot,xi,si=yic-piβ^pilot| || MX-1xi||2j=1N|yjc-pjβ^pilot| ||MX-1 xj||2 (5)

where yic=Eyi=1si=s=Pyi=1si=s is the expectation of the true phenotype conditional on the corresponding surrogate observation. To obtain yic, we write Pyi=1si=s=P(si=s|yi=1)P(yi=1)P(si=s), where P(si=1|yi=1) can be computed based on the pilot subsample as the ratio of positive surrogates among all cases and Psi=0yi=1=1-P(si=1|yi=1), P(yi=1) is approximated by piβ^pilot and Psi=s is approximated by ns1/N or ns0/N. Then a weighted logistic regression is fitted on the pooled subsample to get the final estimator β^SAT. An alternative is to replace yic by si in the calculation of πiSAT, and we call the two methods as SAT-cY and SAT-S, respectively.

Proposed algorithm:

  1. With a given case proportion parameter R, compute πiSGS=Rns1 for si=1 and πiSGS=1-Rns0 for si=0, and keep sampling with replacement to obtain a pilot subsample of size n1. Collect true phenotypes for the selected samples.

  2. Fit a weighted logistic regression on the pilot subsample to get the pilot estimator:
    β^pilot=argminβ1n1i=1n11πiSGSyilogpiβ+1-yilog1-piβ.
  3. If using SAT-cY, compute Psi=1yi=1number of positive surrogates in cases number of cases and Psi=0yi=1=1-Psi=1yi=1 based on the pilot subsample, and then calculate yic=P(si=s|yi=1)P(yi=1)P(si=s). Otherwise, skip this step and go to step 4.

  4. Obtain πiSAT in equation (5) with yic or si and conduct the second-stage sampling by sampling with replacement to get a subsample of size n2(n2>n1). Collect true phenotypes for the selected samples.

  5. Pool all the subsamples and compute:
    β^SAT=argminβ1n1+n2(ipilotset1πiSGSyilogpiβ+1-yilog1-piβ+jsecondstageset1πjSATyjlogpjβ+1-yjlog1-pjβ).

Simulation study

We compare SAT to the uniform method (obtain n samples by SRS), SGS (obtain n samples by SGS), and OSUMC under four settings. OSMAC is provided as a reference representing the best performance that could be achieved if true phenotypes were available for the full sample. The considered settings are:

  1. In this setting, we use a surrogate exhibiting nondifferential sensitivity and specificity. The covariates xi*Rp-1 with p=8 are generated from a multivariate normal distribution Np-1(-1.51p-1, Σ) with Σ=(0.51(ij)) where 1ij=1 if ij and 0 otherwise, and 1p-1 is a length p-1 vector of 1’s. The true coefficient vector is β*=-0.5, 0, 0, 0.5, 0.5, 0.5, 0.5, 0.5T to get a low case prevalence pY=1=5.4%. The response Y is generated independently from the Bernoulli distribution, and the surrogate S is generated from a Bernoulli distribution with sensitivity 85% and specificity 95%. We fix n1=300 and increase n2 from 500 to 1,200 by 100 to evaluate the effect of sample size on performance. The case proportion parameter R is set to be 0.5. Moreover, to see the effects of sensitivity and specificity on the performance of SAT, we also fix n1, n2=(300, 800) and let sensitivity and specificity take values from (0.80, 0.85, 0.90, 0.95, 0.99) to generate S.

  2. The surrogates’ sensitivity and specificity depend on the exposure group. The first covariate, ie, the exposure, is generated from a Bernoulli distribution with probability 0.4, and the remaining covariates are generated from Np-2(-1.51p-2, Σ) with Σ=(0.51(ij)). The coefficient vector is β*=-0.7, 0.5, 0, 0.5, 0.5, 0.5, 0.5, 0.5T to make pY=1=5.4%. When generate the surrogate, we consider two scenarios: A. for the unexposed group, sensitivity is 85% and specificity is 90%; for the exposed group, sensitivity is 80% and specificity is 95%. B. for the unexposed group, sensitivity is 85% and specificity is 90%; for the exposed group, sensitivity is 95% and specificity is 80%. We fix n1=300 and increase n2 from 500 to 1,200 by 100. The R is set to be 0.5.

  3. We investigate the effect of pilot ratio ρ=n1n{0.1, 0.15,, 0.85, 0.9} on the performance of SAT with pooled subsample size n=800, and a surrogate with nondifferential sensitivity (85%) and specificity (95%) is used. All the other details are the same as in setting 1.

  4. We evaluate the effect of the case proportion parameter R{0.1, 0.2,,0.8, 0.9} on the performance of SAT with case prevalence values 1.6% and 5.4%, respectively, to find a range of R that ensures SAT’s usability and robustness. We fix n1=300 and n2=800 and use a surrogate with nondifferential sensitivity (85%) and specificity (95%). To obtain a case prevalence of 1.6%, we let β*=-2, 0, 0, 0.5, 0.5, 0.5, 0.5, 0.5T. All the other details are the same as in setting 1.

See Table  1 for a summary of the simulation setting. Under each setting, the whole population is generated once with full sample size N = 100,000, and the sampling procedure is repeated 1,000 times. The methods are compared in terms of their estimation accuracy (MSE), which is computed as 11,000t=11,000β^-β*22, where β^ is the estimated coefficient vector, β* is the true coefficient vector, and 22 is the squared Euclidean distance.

Table 1.

Details of simulation setting

n1 n2 Misclassification (Sensitivity, specificity)
Setting 1 300 500–1,200 Nondifferential (0.85, 0.90) Baseline setting
800 (0.80, 0.85, 0.90, 0.95, 0.99) Change surrogate’s property
Setting 2 300 500–1,200 Differential
  • A: Exposed (0.80, 0.95)

  • Unexposed (0.85, 0.90)

Differential misclassification with respect to exposure groups
  • B: Exposed (0.95, 0.80)

  • Unexposed (0.85, 0.90)

Setting 3
  • n1=800×ρ, 

  • n2=800-n1

  • ρ[0.1, 0.9]

Nondifferential (0.85, 0.90) Change pilot size
Setting 4 300 800 Nondifferential (0.85, 0.90) Change case proportion

Note: For all the settings, the total sample size is N=100,000 and the dimension is p=8. For settings 1, 3, and 4, all the covariates are generated from a multivariate normal distribution, and for setting 2, the exposure variable is generated from a Bernoulli distribution with probability 0.4 and the remaining covariates are generated from a multivariate normal distribution

Application to a breast cancer study

We applied these alternative methods to an EHR dataset from the BRAVA study, which focused on identifying risk factors for second breast cancer events (SBCE, defined as recurrent or second breast primary cancer) in a cohort of women with a personal history of breast cancer at Kaiser Permanente Washington,34 an integrated health care delivery system in Washington state. The dataset consists of 3,152 women diagnosed with a primary stage I–IIB invasive breast cancer between 1993 and 2006. The outcome of interest, SBCE, was obtained through manual review of medical records for all cohort members, and was treated as a gold standard. Additionally, an algorithm-derived surrogate phenotype was obtained based on automated EHR and cancer registry data.35 We treat SBCE as the outcome of interest and use characteristics of the primary breast cancer as risk factors, including patient age at primary breast cancer diagnosis, tumor stage (stage I or II), and estrogen receptor status. The study population demonstrates a moderately low prevalence of SBCE. Among the 3,152 patients, 407 (12.7%) experienced SBCE. To investigate the SAT’s performance on EHR datasets with significantly lower phenotype prevalence, we also experimented with synthesized datasets with phenotype prevalence of 2.5% and 5% by oversampling the controls in BRAVA. The availability of the gold standard SBCE phenotype for the full cohort facilitates the estimation of an oracle full cohort MLE, based on which we calculate the empirical MSE and estimation bias across 500 replicates of sampling and estimation as the metrics to compare methods.

Similar to the settings considered in Tong et al,18 we compare the methods with surrogates generated in three different ways. The first scenario (case 1) uses a previously published SBCE phenotyping-algorithm of high specificity to generate the surrogate, which has 89% sensitivity and 99% specificity.35 The second scenario (case 2) uses a randomly generated surrogate with nondifferential 90% sensitivity and 95% specificity. The third scenario (case 3) uses a surrogate exhibiting differential misclassification. Specifically, for patients with stage I primary cancers, we generated a surrogate with 90% sensitivity and 95% specificity, whereas for stage II patients, the surrogate was simulated with 95% sensitivity and 90% specificity. Under each scenario, different combinations of pilot subsample size n1 (200 and 400) and second-stage subsample size n2 are tested (400, 700, and 1,000). We use R=0.5 in the analysis.

RESULTS

Results from simulation studies

Among all the methods under comparison, the uniform method had the worst performance. As OSUMC uses SRS in the pilot sampling stage, the low prevalence of the outcome led to failures in β^pilot estimation, especially when n1 was small. For methods with pilot subsamples obtained by SGS, there were very few cases of failed estimation. The results reported for each method are limited to the cases where the pilot estimation did not fail. Under setting 1 (Figures  3 and 4), SAT-S had comparable performance to SAT-cY, and both achieved a smaller MSE than OSUMC and SGS. Figure  4 displays head-to-head comparisons of MSEs among methods over a grid of sensitivities and specificities. To summarize, OSUMC performed worse than the SAT methods, and a surrogate with a high specificity enabled the SAT methods to beat SGS. Between SAT-cY and SAT-S, once the sensitivity was high enough, SAT-S outperformed SAT-cY even when the specificity was moderate. The results of setting 2 (see Supplementary Figure S3) indicate that the SAT methods outperformed OSUMC and had performance at least as good as SGS with a surrogate exhibiting differential misclassification. The results from both settings 1 and 2 also demonstrate that, with an increasing n2 (thus n), the estimation accuracy of all the methods improved. Figure  5 shows how the MSE changes when we fix n and increase the pilot subsample’s proportion. For SAT-cY and SAT-S, when 0<ρ<0.25, MSE decreased sharply with increasing ρ for both methods because only when the pilot subsample is large enough can the pilot estimator approximate the full cohort MLE reasonably well. After ρ=0.4, the curves of SAT-S and SAT-cY show an increasing trend. When ρ1, the two curves converge to the same level, the MSE of applying SGS in both sampling stages. This is because, the second-stage subsample is key to controlling MSE, and a larger ρ implies a smaller proportion of the second-stage subsample thus a less effective reduction of MSE. Therefore, there exists an optimal ρ that achieves a balance between the quality of the pilot estimator and the amount of MSE reduction. As for the effect of R on SAT, the results of setting 4 (see Supplementary Figure S4) show that the region of R that corresponds to lower MSEs is [0.3, 0.5] for case prevalence around 4%. To avoid failures in the estimation of β^pilot when the case prevalence is low and n1 is small, a slightly larger R[0.5, 0.6] can be used without compromising the performance of SAT. Using R=0.5 is a safe choice for most situations.

Figure 3.

Figure 3.

Comparison of methods under setting 1 with n1=300 and an increasing second-stage subsample size n2(500, 600, 700, 800, 900, 1,000, 1,100, 1,200). The surrogate exhibits nondifferential misclassification with sensitivity of 85% and specificity of 95%. OSMAC is displayed here as a gold standard for reference. Under setting 1, the SAT methods achieved smaller MSE than the SGS, OSUMC, and the uniform method.

Figure 4.

Figure 4.

Comparison of methods under setting 1 with (n1, n2)=(300, 800) for different combinations of sensitivities and specificities. The MSE differences between methods over a grid of sensitivities and specificities are displayed. The heatmap (a) represents the difference between the MSE of SGS and the MSE of SAT-cY (ie, MSE[SGS] – MSE[SAT-cY]) for each combination of sensitivity and specificity. The red block denotes that, at the specified sensitivity and specificity, SGS has a larger MSE than SAT-cY, and the purple block denotes that SAT-cY has a larger MSE than SGS. All the subsequent heatmaps can be similarly interpreted. From these plots we observe that, in terms of MSE minimization, OSUMC performed worse than the SAT methods, and a surrogate with a high specificity enabled the SAT methods to outperform SGS.

Figure 5.

Figure 5.

Comparison of methods under setting 3 with n=800 and an increasing pilot ratio ρ=n1n. The surrogate has a nondifferential misclassification rate, and sensitivity is 85% and specificity is 95%. The pooled subsample size n=800 and the ratio ρ=n1n{0.1, 0.15,, 0.85, 0.9}. OSMAC is displayed here as a gold standard for reference. The pattern of variation in MSE for all the two-stage methods (OSMAC, SAT-S, and SAT-cY) as a function of ρ indicates that the pilot subsample size n1 should be relatively small to guarantee a good MSE reduction performance while it cannot be too small as the first stage pilot sample is needed to generate a well-behaved pilot estimator.

Results from BRAVA data analysis

In the analysis with the original BRAVA dataset, as in the simulation, when n1=200 there were two failed cases in β^pilot estimation for OSUMC due to the low prevalence of the outcome, and the methods with pilot subsamples obtained by SGS successfully resolved this problem. Figure  6 reports the MSE of SAT-S, SAT-cY, OSUMC, and the SGS method. The results display a similar pattern as in the simulation, that is, for all three surrogate types, OSUMC and SGS perform worse than the SAT methods; SAT-S has comparable performance to SAT-cY. Supplementary Figure S5 shows the estimation bias of the effect (log odds ratio) of tumor stage on the occurrence of SBCE under case 1. All methods appear to be unbiased. The OSUMC has a larger variance than other methods and the other three methods have comparable performance. The results for cases 2 and 3 are similar (see Supplementary Figures S6 and S7). In settings where we varied prevalence by oversampling controls, SAT achieved smaller MSEs than SGS and OSUMC (Supplementary Material, Part I).

Figure 6.

Figure 6.

Comparison of methods in terms of MSE using BRAVA breast cancer data. From the first row to the third row, the plots show the results of surrogate cases 1–3, respectively. The first column is for the setting with pilot subsample size n1=200, and the second column is for the setting with pilot subsample size n1=400. Each panel displays the MSE of all the methods at 3 levels of the second-stage subsample size, that is, n2=400, 700, and 1,000. The MSE is calculated as 1500t=1500β^-β*22 , where β^ is the estimated coefficient vector, β* is the oracle full cohort MLE based on all the true phenotypes and 22 is the squared Euclidean distance. OSMAC is displayed here as a gold standard for reference. These plots display a significant superiority of the SAT methods over SGS and OSUMC in terms of MSE reduction.

DISCUSSION

EHR-based association studies of rare diseases using surrogate phenotypes may produce unreliable conclusions. Both the low prevalence of the outcome and the error in the surrogate impede accurate estimation of risk factors' association with the outcome. Numerous techniques have been developed to handle these two challenges, but to our knowledge, there is little literature that considers the two problems simultaneously, although this is a commonly encountered problem in real-world studies of EHR data. Yin et al36 recently proposed a method that handles these two problems with a surrogate-assisted sampling scheme to enrich cases and a projection technique that exploits the correlation between outcome and surrogate to improve estimation efficiency based on a nondifferential misclassification assumption. Different from their idea and with no assumption, we formulated this problem as a subsample selection task and combined SGS and OSMAC to provide subsample selection guidance, the SAT methods. Through simulation studies, we demonstrated that with a proper surrogate whose sensitivity and specificity are reasonably high, SAT could effectively reduce MSE and is better than the methods that only resolve one of the two issues. For example, OSUMC provides a nearly response-independent subsampling procedure. However, failure to consider the phenotype prevalence results in inaccurate pilot estimation, and the method often fails to converge in pilot estimation. The BRAVA data analysis demonstrated the success of applying the SAT methods on real-world datasets. Although the rate of experiencing SBCE in BRAVA was 12.7%, which does not meet a strict definition of rare disease, we also investigated the SAT’s performance in rarer disease settings by oversampling controls. In these rare disease settings, SAT achieved smaller MSEs than SGS and OSUMC. Future research should focus on methods for very low case prevalence, for example, 1% or lower, in which settings even the OSMAC estimator has a large bias. Thus, to conduct association studies for extremely rare diseases, the sampling approach is not the best choice and other methods should be considered.

SAT is proposed for EHR-based association studies based on the assumption that the information contained in EHR is adequate to guarantee the validity of the manual chart review results, that is, a “true phenotype” can be obtained with manual chart review. However, there are situations where the reference is not a gold standard as the completeness of the information required to guarantee the validity of a diagnosis may vary depending on health care utilization, clinical practice, and documentation practices (eg, an NIH clinical center compared with a community hospital). This issue has been discussed extensively in diagnostic test studies37,38 but is under-researched in EHR data settings.39,40 Linking EHR data to external databases with additional specialized information (eg, cancer registry data) may improve accuracy in some applications. In general, however, additional research is needed to develop new methods that account for the imperfect manual chart review.

We have several suggestions to apply SAT in practice. First, when only a surrogate of poor-quality (ie, sensitivity or specificity is <0.8) is available, SGS or OSUMC are better choices than SAT. Second, although SAT-S achieves a greater MSE reduction than SAT-cY with a good surrogate in our simulation, SAT-cY is more robust than SAT-S to the accuracy of the surrogate. Unless a phenotyping algorithm has been validated to be of high accuracy, SAT-cY is preferred. Third, the number of cases in the pilot subsample depends on a user-specified case proportion parameter and the surrogate's accuracy, especially its specificity.30 If the case prevalence is very low, we can appropriately increase the case proportion parameter (eg, taking R from [0.5, 0.6]) or use a surrogate of higher specificity to include more cases in the pilot subsample. Finally, the pilot subsample needs to be large enough to provide a good approximation of the full cohort MLE for the subsequent sampling probability calculation.

In essence, SAT is a generalization of OSMAC under the phenotype-absent scenario in rare disease analysis, and it has some limitations. OSMAC is developed to minimize the asymptotic MSE of the subsample estimator whose asymptotic property derivation procedure does not consider the imbalanced class distribution. To our knowledge, there are no explicit results for the asymptotic distribution of the subsample estimator in the imbalanced class situation. Even though OSMAC and SAT successfully control the MSE in our simulations, theoretical development is needed to generalize the theoretical foundation of OSMAC to be compatible with the low prevalence scenario. The second limitation of SAT comes from the requirement of a phenotyping-algorithm of moderate accuracy to guarantee the success of SAT. In some investigations of rare diseases, a case-finding algorithm is commonly employed to capture as many cases as possible. These algorithms guarantee a high sensitivity of the surrogate, but have poor specificity which is a barrier to the application of SAT. The third limitation is in the selection of an optimal ρ with a given n. In general, a smaller ρ is preferred to maximize the extent of MSE reduction. However, n1 needs also to be large enough to obtain a good pilot estimator. An optimal ρ is also related to the problem itself, for example, data quality and signal strength, thus it is hard to provide uniform guidance.

An alternative to the proposed sampling method is to directly apply a nearly ‘perfect’ surrogate produced by a highly polished phenotyping algorithm in association studies. Great efforts and advances have been made in developing high-throughput phenotyping algorithms,41–45 where techniques such as natural language processing tools are exploited to leverage information from multiple resources to improve classification accuracy. However, as discussed before, even after validation, the performance of phenotyping algorithms could still be affected by many factors in specific applications, and classification error is practically unavoidable. Our method approaches the problem in a different way and reduces the consequences of using error-prone surrogates in association studies. Thus, even with a good surrogate, applying SAT methods is still a good choice that leverages both the experience of clinicians and the accuracy of the surrogate to provide an unbiased and efficient estimator.

Going forward, it is necessary to investigate the asymptotic properties of the subsample estimator in the imbalanced class scenario. Once the asymptotic distribution is available, we can follow the approach of OSMAC to derive an optimal set of sampling probabilities from minimizing the asymptotic MSE. Moreover, it is also worthwhile to develop sampling techniques that target improving prediction performance.

CONCLUSION

To conclude, in this article, we introduce a surrogate-assisted two-wave case boosting sampling method to address the issues in estimation caused by low prevalence outcomes and error-prone surrogate phenotypes in association studies. With a well-behaved surrogate, SAT provides a more accurate estimator compared with some benchmark methods. Further theoretical development to rigorously account for the imbalanced class distribution is needed to sharpen the method’s performance in MSE minimization.

FUNDING

Research reported in this publication was supported in part by the National Institutes of Health (NIH) grants 1R01AI130460, 1R01AG073435, 1R56AG074604, 1R01LM013519, 1R56AG069880 (to XL and YC), R21CA227613 (to RH, YC, and JC), and R21CA143242 (JC). Data collection was funded in part by the NIH grants R01CA093772, R01CA120562, and U01CA063731. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. This work was supported partially through Patient-Centered Outcomes Research Institute (PCORI) Project Program Awards (ME-2019C3-18315 and ME-2018C3-14899). All statements in this report, including its findings and conclusions, are solely those of the authors and do not necessarily represent the views of the Patient-Centered Outcomes Research Institute (PCORI), its Board of Governors or Methodology Committee.

AUTHORS CONTRIBUTION

XL and YC designed methods and experiments; XL conducted simulation experiments and data analysis; RH and JC provided the dataset from Kaiser Permanente Washington; XL drafted the main article; all coauthors made substantial contributions to revise the article; all authors interpreted the results and provided instructive comments; all authors have approved the article.

SUPPLEMENTARY MATERIAL

Supplementary material is available at Journal of the American Medical Informatics Association online.

Supplementary Material

ocab267_Supplementary_Data

ACKNOWLEDGMENTS

We wish to thank Jiayi Tong for her help in designing the layout for some of the figures. Some of the icons in Figure 1 are made by monkik, Freepik, GOWI and Slidicon from www.flaticon.com.

CONFLICT OF INTEREST STATEMENT

None declared.

DATA AVAILABILITY STATEMENT

The data that support the findings of this study are available from Kaiser Permanente Washington but restrictions apply to the availability of these data, which were obtained under a data use agreement for the current study, and so are not publicly available. Data are available from the authors upon reasonable request and with permission of the Kaiser Permanente Washington Health Research Institute.

CODE AVAILABILITY STATEMENT

The R package to implement the method is publicly available at https://github.com/Penncil/SAT where the instructions to install the package are also available.

Contributor Information

Xiaokang Liu, Department of Biostatistics, Epidemiology and Informatics, The University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania, USA.

Jessica Chubak, Kaiser Permanente Washington Health Research Institute, Seattle, Washington, USA; Department of Epidemiology, University of Washington, Seattle, Washington, USA.

Rebecca A Hubbard, Department of Biostatistics, Epidemiology and Informatics, The University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania, USA.

Yong Chen, Department of Biostatistics, Epidemiology and Informatics, The University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania, USA.

REFERENCES

  • 1. Sherman RE, Anderson SA, Dal Pan GJ, et al.  Real-world evidence—what is it and what can it tell us. N Engl J Med  2016; 375 (23): 2293–7. [DOI] [PubMed] [Google Scholar]
  • 2. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/use-electronic-health-record-data-clinical-investigations-guidance-industry Accessed May 2021.
  • 3. Jensen PB, Jensen LJ, Brunak S.  Mining electronic health records: towards better research applications and clinical care. Nat Rev Genet  2012; 13 (6): 395–405. [DOI] [PubMed] [Google Scholar]
  • 4. Atreja A, Achkar JP, Jain AK, Harris CM, Lashner BA.  Using technology to promote gastrointestinal outcomes research: a case for electronic health records. Am J Gastroenterol  2008; 103 (9): 2171–8. [DOI] [PubMed] [Google Scholar]
  • 5. Smoller JW.  The use of electronic health records for psychiatric phenotyping and genomics. Am J Med Genet B Neuropsychiatr Genet  2018; 177 (7): 601–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Ritchie MD, Denny JC, Crawford DC, et al.  Robust replication of genotype-phenotype associations across multiple diseases in an electronic medical record. Am J Hum Genet  2010; 86 (4): 560–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Li R, Duan R, Kember RL, et al.  A regression framework to uncover pleiotropy in large-scale electronic health record data. J Am Med Inform Assoc  2019; 26 (10): 1083–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Haendel M, Vasilevsky N, Unni D, et al.  How many rare diseases are there?  Nat Rev Drug Discov  2020; 19 (2): 77–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. https://rarediseases.org/rare-diseases/progressive-supranuclear-palsy/ Accessed May 2021.
  • 10. Höglinger GU, Melhem NM, Dickson DW, et al. ; PSP Genetics Study Group. Identification of common variants influencing risk of the tauopathy progressive supranuclear palsy. Nat Genet  2011; 43 (7): 699–705. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Sanchez-Contreras MY, et al.  Replication of progressive supranuclear palsy genome-wide association study identifies SLCO1A2 and DUSP10 as new susceptibility loci. Mol Neurodegener  2018; 13: 37. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Neuhaus JM.  Bias and efficiency loss due to misclassified responses in binary regression. Biometrika  1999; 86 (4): 843–55. [Google Scholar]
  • 13. Duan R, Cao M, Wu Y, et al.  An empirical study for impacts of measurement errors on EHR based association studies. AMIA Annu Symp Proc  2017; 2016: 1764–73. [PMC free article] [PubMed] [Google Scholar]
  • 14. Copeland KT, Checkoway H, McMichael AJ, Holbrook RH.  Bias due to misclassification in the estimation of relative risk. Am J Epidemiol  1977; 105 (5): 488–95. [DOI] [PubMed] [Google Scholar]
  • 15. Lyles RH, Lin J.  Sensitivity analysis for misclassification in logistic regression via likelihood methods and predictive value weighting. Stat Med  2010; 29 (22): 2297–309. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Laurence SM, James PH.  Logistic regression when the outcome is measured with uncertainty. Am J Epidemiol  1997; 146: 195–203. [DOI] [PubMed] [Google Scholar]
  • 17. Lyles RH, Tang L, Superak HM, et al.  Validation data-based adjustments for outcome misclassification in logistic regression: an illustration. Epidemiology  2011; 22 (4): 589–97. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Tong J, Huang J, Chubak J, et al.  An augmented estimation procedure for EHR-based association studies accounting for differential misclassification. J Am Med Inform Assoc  2020; 27 (2): 244–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Pathak J, Kho AN, Denny JC.  Electronic health records-driven phenotyping: challenges, recent advances, and perspectives. J Am Med Inform Assoc  2013; 20 (e2): e206–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. https://rethinkingclinicaltrials.org/chapters/conduct/electronic-health-records-based-phenotyping/electronic-health-records-based-phenotyping-introduction/ Accessed May 2021.
  • 21. Kho AN, Hayes MG, Rasmussen-Torvik L, et al.  Use of diverse electronic medical record systems to identify genetic risk for type 2 diabetes within a genome-wide association study. J Am Med Inform Assoc  2012; 19 (2): 212–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Ritchie MD, Denny JC, Zuvich RL, et al. ; Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) QRS Group. Genome- and phenome-wide analyses of cardiac conduction identifies markers of arrhythmia risk. Circulation  2013; 127 (13): 1377–85. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Agarwal V, Podchiyska T, Banda JM, et al.  Learning statistical models of phenotypes using noisy labeled training data. J Am Med Inform Assoc  2016; 23 (6): 1166–73. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Weiss GM, Provost F. The effect of class distribution on classifier learning: an empirical study. Technical Report ML-TR-43, Dept. of Computer Science, Rutgers Univ; 2001.
  • 25. Wei Q, Dunbrack RL Jr. The role of balanced training and testing data sets for binary classifiers in bioinformatics. PLoS One  2013; 8 (7): e67863. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Lin WJ, Chen JJ.  Class-imbalanced classifiers for high-dimensional data. Brief Bioinform  2013; 14 (1): 13–26. [DOI] [PubMed] [Google Scholar]
  • 27. Salas-Eljatib C, Fuentes-Ramirez A, Gregoire TG, Altamirano A, Yaitul V.  A study on the effects of unbalanced data when fitting logistic regression models in ecology. Ecol Indicators  2018; 85: 502–8. [Google Scholar]
  • 28. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP.  SMOTE: synthetic minority over-sampling technique. JAIR  2002; 16: 321–57. [Google Scholar]
  • 29. Lunardon N, Menardi G, Torelli N.  ROSE: a package for binary imbalanced learning. R J  2014; 6 (1): 79–89. [Google Scholar]
  • 30. Tan WK, Heagerty HP. Surrogate-guided sampling designs for classification of rare outcomes from electronic medical records data. Preprint at https://arxiv.org/abs/1904.00412 (2019).
  • 31. Fithian W, Hastie T.  Local case-control sampling: efficient subsampling in imbalanced data sets. Ann Stat  2014; 42 (5): 1693–724. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Wang H, Zhu R, Ma P.  Optimal subsampling for large sample logistic regression. J Am Stat Assoc  2018; 113 (522): 829–44. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Zhang T, Ning Y, Ruppert D.  Optimal sampling for generalized linear models under measurement constraints. J Comput Graph Stat  2021; 30 (1): 106–14. [Google Scholar]
  • 34. Boudreau DM, Yu O, Chubak J, et al.  Comparative safety of cardiovascular medication use and breast cancer outcomes among women with early stage breast cancer. Breast Cancer Res Treat  2014; 144 (2): 405–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Chubak J, Yu O, Pocobelli G, et al.  Administrative data algorithms to identify second breast cancer events following early-stage invasive breast cancer. J Natl Cancer Inst  2012; 104 (12): 931–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Yin Z, , TongJ, , Chen Y, et al, . A cost-effective chart review sampling design to account for phenotyping error in electronic health records (EHR) data. J Am Med Inform Assoc  2022; 29 (1): 52–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Chu H, Chen S, Louis TA.  Random effects models in a meta-analysis of the accuracy of two diagnostic tests without a gold standard. J Am Stat Assoc  2009; 104 (486): 512–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Liu Y, Chen Y, Chu H.  A unification of models for meta-analysis of diagnostic accuracy studies without a gold standard. Biometrics  2015; 71 (2): 538–47. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Yu S, Ma Y, Gronsbell J, et al.  Enabling phenotypic big data with PheNorm. J Am Med Inform Assoc  2018; 25 (1): 54–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Ahuja Y, Zhou D, He Z, et al.  sureLDA: a multidisease automated phenotyping method for the electronic health record. J Am Med Inform Assoc  2020; 27 (8): 1235–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Yu S, Liao KP, Shaw SY, et al.  Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources. J Am Med Inform Assoc  2015; 22 (5): 993–1000. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Agarwal V, Podchiyska T, Banda JM, et al.  Learning statistical models of phenotypes using noisy labeled training data. J Am Med Inform Assoc  2016; 23 (6): 1166–73. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Yu S, Chakrabortty A, Liao KP, et al.  Surrogate-assisted feature extraction for high-throughput phenotyping. J Am Med Inform Assoc  2017; 24 (e1): e143–e149. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Liao KP, Sun J, Cai TA, et al.  High-throughput multimodal automated phenotyping (MAP) with application to PheWAS. J Am Med Inform Assoc  2019; 26 (11): 1255–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Zheng NS, Feng Q, Kerchberger VE, et al.  PheMap: a multi-resource knowledge base for high-throughput phenotyping within electronic health records. J Am Med Inform Assoc  2020; 27 (11): 1675–87. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ocab267_Supplementary_Data

Data Availability Statement

The data that support the findings of this study are available from Kaiser Permanente Washington but restrictions apply to the availability of these data, which were obtained under a data use agreement for the current study, and so are not publicly available. Data are available from the authors upon reasonable request and with permission of the Kaiser Permanente Washington Health Research Institute.


Articles from Journal of the American Medical Informatics Association : JAMIA are provided here courtesy of Oxford University Press

RESOURCES