Abstract
For utilizing electronic health records (EHRs) to help design and conduct clinical trials, an essential first step is to select eligible patients from EHRs, that is, EHR phenotyping. We present two novel statistical methods that can be used in the context of EHR phenotyping. One mitigates the requirement for gold standard control patients in developing phenotyping algorithms, and the other effectively corrects for bias in downstream analysis introduced by study samples contaminated by ineligible subjects.
Keywords: Electronic health records, phenotyping, anchor variable, case contamination
Introduction
Electronic Health Records (EHRs) are expected to play an increasingly important role in clinical trials. EHRs can be used to identify a group of eligible patients for recruitment, to generate real world evidence for comparative effectiveness, and to generate evidence to support the initiation of clinical trials. In order to draw valid conclusions using EHR data, EHR phenotyping is an essential first step in assuring that case and control groups are selected accurately. We present two methods that can be used in the context of EHR phenotyping. The first method is a maximum likelihood approach to phenotyping EHRs within an anchor variable framework,1 and the second method is an estimating equation approach to correcting for bias in downstream analysis introduced by study samples contaminated by ineligible subjects.2
An anchor variable approach to EHR phenotyping
To identify patients with a specific phenotype using EHR data, the standard practice is to develop a phenotyping algorithm using a large annotated training set that includes labeled cases and controls. For a range of diseases, it is challenging to identify a representative group of controls in EHRs because of under-diagnosis due to various reasons. But a subgroup of cases can be easily identified with high accuracy by applying substantive knowledge (e.g. patients who were positive for a certain test are known to be cases). The readily accessible annotations then include an incomplete set of gold-standard cases but no gold-standard controls. Herein we propose a phenotyping method that efficiently and accurately leverages such incomplete labels, referred to as “positive-only” data. That is, a cohort sample with “positive-only” labels consist of a group of labeled cases and a large number of unlabeled patients who may be either a case or a control. Our method for analyzing “positive-only” data relies on the assumption that the labeled cases are representative of all cases in the EHR population.1 We summarize domain knowledge of case diagnosis by a binary “anchor variable”,3 which is highly informative of patients’ true case status. Such an anchor variable can be defined for many phenotypes, e.g., it could be a pathologic diagnosis of cancer. Let denote the latent binary label for the phenotype (1: case; 0: control), X denote the predictor variables of Y, and S denote the binary anchor variable (1: positive; 0: negative). Here (X,Y,S) are considered as random variables from which HER patients, including both anchor-positive cases and unlabeled patients, are randomly drawn, with only (X,S) observed. The anchor variable is characterized by two properties, perfect PPV, i.e. p(Y = 1|S = 1) = 1, and 2), and constant sensitivity, i.e. p(S = 1|Y = 1,X) = p(S = 1|Y = 1)≡ c, where c is a positive number between 0 and 1. Owing to these two properties, the probability of phenotype presence and the probability of anchor positivity differ by the constant factor c, p(Y = 1|X) = p(S = 1|X)/c. Therefore, the likelihood for the observed data, (X,S), can be transformed into a function of the latent variable Y. Assuming a logistic regression working model,
logit p(Y = 1|X;β)=XTβ, the regression coefficients β and anchor sensitivity c can be estimated simultaneously by maximizing the likelihood function. When it is of concern that anchor sensitivity may vary with respect to covariates, our method can be easily extended to accommodate a parametric model for such variation.
We recognize that positive-only data provides an efficient data source for model validation. To this end, we developed novel statistical methods for assessing calibration and quantifying prediction accuracy using “positive-only” data (Zhang L, Ma Y, Muthu N, et al. Testing calibration of risk prediction models using positive-only electronic health record data. submitted). Our method addressed the key challenge that the true case or control statuses for unlabeled patients are unknown. Taking advantage of the two properties of the anchor variable, we proposed a model-free estimate for the number of cases among the unlabeled patients in each risk region defined a priori. Our calibration statistic was constructed by aggregating differences between the model-free and model-based estimated number of cases across risk regions. The statistic follows a Chi-squared distribution under the null. Similarly, we proposed nonparametric estimators for the predictive accuracy measures, such as positive predictive values and area under the receiver operating characteristic curves, using positive-only data and derived their asymptotic properties.
Accounting for phenotyping inaccuracy in downstream analysis
Although we strive to achieve the highest classification accuracy possible when assigning phenotypes to patients based on EHR data, phenotyping methods are not perfect and often result in mis-identification of study subjects. When identifying a group of individuals for a clinical trial, for example, the cohort may be contaminated by individuals who do not truly have the phenotype, but share some clinical factors that are consistent with the phenotype definition. Such mis-identification results in “contaminated” study samples, and failure to account for the contamination can ultimately lead to biased results in downstream analysis. Below, we outline a solution using estimating equations that we proposed to correct the bias introduced by case contamination in the context of an EHR-based case-control study.2 Our framework can be easily extended to the setting of a cohort study.
We assumed that controls can be identified accurately, but that our case group is contaminated with ineligibles. We referred to the contaminating group of individuals as “non-cases” because they do not satisfy the case definition, but also do not meet all criteria of the control definition. Let D denote the true phenotyping status for the individuals where D=0 denotes a genuine control, D=1 denotes a genuine case, and D=2 denotes a non-case. The goal of the analysis is to model the association between D and a vector of covariates, X. This would be a straightforward calculation if there was an easy way to distinguish between the genuine cases and the non-cases. But in an EHR setting, this is often not an easy task.
Due to resource constraints, it is feasible to validate the case and non-case status for a small subsample of patients, typically ranging from 100 to 400 individuals, via manual chart review. Let V denote the validated case and non-case status of the individuals in the validation subset where V=1 denotes a true case and V=0 denotes a non-case. Let Z be a collection of variables that can be helpful in distinguishing between cases and non-cases. Z may or may not also contain a subset of X. We used the validation subset, where the true label is known, to build a model to predict P(V=1|Z), referred to as the “phenotyping model”. This model was then used to predict the probability of being a case, P(V=1|Z), for the non-validated individuals in the case pool.
Finally, to obtain an estimate for the association between D and X using all patients, each patient in the case pool was weighted by an individualized factor. If a patient was selected for validation, the weight would be equal to the case or control status V, and if the individual was not selected for validation, the weight would be equal to the probability of being a case, P(V=1|Z). We showed that given a valid model for P(V=1|Z), the final estimates obtained using our method are consistent.
Acknowledgements
This is a summary of a presentation made at the University of Pennsylvania’s 12th Conference on Statistical Issues in Clinical Trials - Electronic Health Records (EHR) in Randomized Clinical Trials: Challenges and Opportunities. The material is authored by the current authors and in review by or accepted in Biometrics, the Journal of the American Medical Informatics Association, and Biostatistics.
Funding
The research in the cited work was supported by the US National Institutes of Health (R01-HL138306, R01-CA236468 R01-CA207365) as indicated in the original manuscripts.
Footnotes
Declaration of conflicting interests
None declared.
References
- 1.Zhang L, Ding X, Ma Y, et al. A maximum likelihood approach for electronic health record phenotyping using positive and unlabeled patients. J Am Med Inform Assoc 2020; 27(1): 119–126. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Wang L, Schnall J, Small A, et al. Case contamination in electronic health records-based case-control studies. Biometrics. Epub ahead of print 4 April 2020. DOI: 10.1111/biom.13264. [DOI] [PubMed] [Google Scholar]
- 3.Halpern Y, Choi Y, Horng S, et al. Using anchors to estimate clinical state without labeled data. AMIA Annual Symp Proc 2014; 2014: 606–615. eCollection 2014. [PMC free article] [PubMed] [Google Scholar]
