A probabilistic approach for building disease phenotypes across electronic health records

David Vidmar; Jessica De Freitas; Will Thompson; John M Pfeifer; Brandon K Fornwalt; Noah Zimmerman; Riccardo Miotto; Ruijun Chen

doi:10.1186/s13040-025-00454-9

. 2025 Jun 11;18:39. doi: 10.1186/s13040-025-00454-9

A probabilistic approach for building disease phenotypes across electronic health records

David Vidmar ^1,^✉, Jessica De Freitas ¹, Will Thompson ¹, John M Pfeifer ¹, Brandon K Fornwalt ¹, Noah Zimmerman ¹, Riccardo Miotto ¹, Ruijun Chen ¹

PMCID: PMC12153169 PMID: 40500747

Abstract

Background

Identifying the set of patients with a particular disease diagnosis across electronic health records (EHRs), referred to as a phenotype, is an important step in clinical research and applications. However, this task is often challenging, where incomplete data can render definitive classifications impossible. We propose a probabilistic approach to phenotyping based on Bayesian inference and without the need for gold-standard labels. In this paper, we develop multiple heuristic “labeling functions’’ (LFs) for 4 diseases across de-identified EHR data and aggregate their votes through a majority vote approach (MV), a popular open-source approach (Snorkel OSS), and our proposed probabilistic approach (LEVI). We compare the resulting phenotypes to those built using expert-curated logic from the literature, as well as an off-the-shelf natural language processing pipeline (Medspacy), using a curated sample of physician-reviewed labels for evaluation.

Results

Phenotypes built using LFs perform better than off-the-shelf alternatives on classification performance (F1 scores of 0.79–0.82 vs. expert-logic: 0.68, Medspacy: 0.55). Compared to output scores from Snorkel OSS, LEVI provides better probabilistic performance (expected calibration error of 0.04 vs. 0.12), ROC AUC estimates (interval score [loss] of 0.03 vs. 0.10), and operating point selection (equal-cost net benefit of 0.18 vs. 0.15).

Conclusions

For challenging disease states, phenotyping using probabilities rather than binary classification can lead to improved and more personalized downstream decision-making. Probabilistic phenotypes built using LEVI exhibit low calibration error without the need for labels, allowing for better risk-benefit tradeoffs.

Supplementary Information

The online version contains supplementary material available at 10.1186/s13040-025-00454-9.

Keywords: EHR phenotyping, Weak supervision, Bayesian inference, Model calibration, Decision theory

Background

The task of disease phenotyping involves searching through electronic health records (EHRs) to identify all patients with a particular diagnosis. The resulting set of patients is referred to as a phenotype, with high fidelity phenotypes [1] enabling various important downstream objectives such as inferring effect sizes [2], building risk prediction models [3], and identifying care gaps [4]. Low fidelity phenotypes, on the other hand, can have significant adverse effects on subsequent analyses, potentially leading to biased or improper inferences [5, 6].

Phenotyping is traditionally framed as a problem of binary classification. A common approach involves eliciting custom logic directly from physicians with knowledge of the coding practice at a particular hospital system [7]. This is usually implemented as a query across various tables in the database holding the EHRs. Another approach looks only at the clinical notes, focusing on extracting disease mentions and classifying them as either incidental or positive [8]. Recently, machine learning approaches have also become popular [9]. Some of these approaches require only a few seed concepts as input for unsupervised phenotype definitions [10], whereas others require a comprehensive set of ground truth labels to train phenotype classification models [11].

In tasks where large-scale human annotation is expensive, such as phenotyping, aggregating together weaker sources of information in lieu of gold standard labels shows great promise [12–14]. A common approach to this problem incorporates multiple independent heuristics, often called “labeling functions” (LFs), which are applied to unlabeled data through voting [15]. Different means of aggregating the outputs of these heuristics exist in the literature, such as majority vote [16], Gibbs sampling [17], and Markov random fields [18]. This approach has been shown to yield effective results across diverse tasks such as spam filtering [19], named entity recognition [20], and extracting chemical reactions from text [21]. However, model training with these approaches is usually treated as an optimization problem without regard to predictive characteristics such as model calibration or strictly proper scoring rules [22].

A significant challenge in working with EHRs is that they suffer from incompleteness [23–25], providing a limited view of all the care a patient received. As a consequence, scenarios such as ambiguous references to out-of-site care, contradictory findings from different providers, and limited available followup data can result in cases where it is difficult to determine whether a diagnosis was made. Binarizing these highly uncertain cases as either “inside” or “outside” of a phenotype, rather than representing confidence through a probability, prematurely throws away information that could be useful for downstream decision-making [26]. Specifically, any use of a phenotype should be informed by our certainty of diagnosis combined with the particular costs of false positives and false negatives. In using a phenotype to search for care gaps, for example, we may be trading off in-demand physician time against the potential benefit of discovering a true care gap.

A fully probabilistic phenotype model, where the output is a well-calibrated probability rather than a binary classification, should help overcome these limitations. We hypothesize that using diagnosis probabilities, derived directly from the statistics of LF aggregation with informed priors, will result in improved calibration and can be leveraged for better performance when applied to specific cost-benefit tradeoffs likely to be encountered in practice. To this end, we propose label estimation via inference (LEVI), a simple Bayesian model which assigns label probabilities from LF votes without the need for any gold standard labels. We show that probabilistic phenotypes built using LEVI exhibit both better classification performance than traditional baselines and better calibration characteristics than a commonly-used method of LF aggregation. Further, we show that this improves downstream uses of such phenotypes, in the form of better coverage properties of a relevant estimation problem and better operating point selection in a risk-benefit analysis.

Methods

Study design

In this study we use de-identified EHR data from a regional health system in the Tempus Database (Tempus AI, Inc., Chicago, IL), consisting of 1,817,294 patients. This data is split at the patient-level into separate training and validation sets using a 90% / 10% split. All iterative developments of LFs are done using only patients sourced from the training set, whereas all chart reviews are done using only patients sourced from the validation set. We focus on phenotyping four different diseases in this study, specifically interstitial lung disease (ILD), heart failure (HF), hypertrophic cardiomyopathy (HCM), and chronic obstructive pulmonary disease (COPD). These were chosen to represent a diverse set of diseases across dimensions of complexity and prevalence.

An overview of our study design is shown in Fig. 1. Because most diseases have low prevalence across all patients in the EHRs, for each disease state we use a first step of filtering our population down to disease-specific candidates [27]. These candidates include all patients who have at least one piece of evidence suggestive of the specified disease, across relevant EHR fields such as diagnosis codes, medications, and note text. Regular Expressions (RegEx) were utilized for pattern matching all synonyms and related terms present across notes text. The comprehensive list of all relevant evidence for each disease was solicited directly from our expert physicians, and was developed to be as broad as possible so as to only exclude patients who have no identifiable evidence of the specified disease in their records. Importantly, at this stage we are not attempting the harder problem of classifying whether a patient was or was not actually diagnosed. This step, therefore, increases disease prevalence in the candidate population at low risk of excluding diseased patients. In total, this filtering leads to a candidate population of 503,744 patients with 239,176 males, 264,469 females, and 99 undeclared.

Phenotyping approaches

For each disease state, we build phenotypes from our candidate population using three different approaches:

Expert Logic: We build a query implementing expert-curated logic across EHR fields. This logic is taken from phenotypes published and validated in the literature [28–30] or deposited in PheKB, the Phenotype Knowledgebase [31]. PheKB is a tool for building, validating, and sharing algorithms that use structured or unstructured EHR data [32].
Medspacy: We use an open-source library developed for completing natural language processing (NLP) tasks on clinical text. Each disease mention present in a patient’s record is analyzed to determine whether it is either negated, uncertain, hypothetical, or family-related. The phenotype then comprises all patients who have at least one “positive” disease mention, i.e. one which does not fall into any of those categories.
Labeling Function Ensemble: We iteratively develop a set of positive and negative LFs by reviewing patient records and ideating on patterns that indicate either a positive or negative label, combined with spot checking accuracy on new samples and consulting clinicians as needed. Such LFs could apply to any modality of data, e.g. the presence of a particular medication in structured records or the mention of a disease in a numbered list in the notes text. This set of LFs is then applied to the population of interest. Finally, we use three different methods of aggregating votes: majority vote (with ties defaulting to negative), Snorkel OSS [33] (a popular open-source unsupervised modeling technique), and our proposed label estimation via inference (LEVI; described in the next section).

Label Estimation via Inference (LEVI)

Our data consists of patients with an (unknown) label Inline graphic corresponding to whether or not they have received a positive diagnosis. We have access to labeling functions, each of which provide a vote for a given patient . We define as the (unknown) disease prevalence and and as the (unknown) true positive rate and false positive rate of the i^th labeling function.

Our task, then, corresponds to determining the posterior label probability Inline graphic for a given patient. If we parametrize our priors for , , and in conjugate form, as , we can solve for this posterior label probability in closed form (see Supplemental section S1 for detailed derivation). We find that

We see that this posterior takes the form of a logistic model, where our weights have been derived directly from the rules of probability theory via our priors on prevalence, true positive rates, and false positive rates rather than fitted on labeled data (which we do not have access to).

Prior selection

In order to use Eq. 1, we need to define priors for prevalence and true / false positive rates for our LFs. For simplicity we will assume that we have the same priors for the performance of each positive LF, denoted by Inline graphic and . We retain the same values for negative LFs, but since these votes indicate a negative label the priors are inverted such that and . Without loss of generality, then, we will refer to only priors on the positive LFs for the remainder of the paper.

To encode our priors into quantitative values, we utilize the principle of Maximum Entropy [34, 35]. In brief, this principle states that we want to identify the prior distribution consistent with our knowledge which is most non-committal. For this work, we find it convenient to describe our prior information in the form of relative “upper bounds” on a variable’s plausible value. As such, we frame prior specification as the following maximization problem:

where Inline graphic represents our upper bound value, stands for entropy, and stands for the regularized incomplete beta function. This specification ensures that we select a prior distribution where 95% of probability mass lies below a given upper bound and is maximally non-committal otherwise.

To select these upper bounds, we note that the widespread success of majority vote aggregation [27] implies that it is not reasonable to assign equal plausibility to all levels of LF performance. In reality, during the process of building LFs we have direct access to spot-checking their precision on new samples. We posit that this process reliably results in LFs which cover relatively small, but precise, subgroups of the true target population. We therefore have strong priors that our false positive rates for positive LFs is very small, and to a lesser extent that our true positive rates for positive LFs is relatively small. Moreover, we find that due to the many incidental mentions occurring in note text it is challenging and unlikely to identify a candidate population with high prevalence. In this work, therefore, we use Inline graphic . Solving Eq. 2 with these values yields

We empirically examine the sensitivity of our inferences to these priors and find them to be robust to reasonable changes (Supplemental Fig. 1).

Ranking performance estimation

We can leverage the probabilistic nature of Eq. 1 to derive inferences about the performance of our phenotype in a given population without the need for any labeled data. These inferences about performances can be especially useful during the process of building LFs, as they can give a sense of how confident we can be that a current set of LFs is performing well. If we infer that performance is acceptable after building a handful of LFs, for example, we can consider the phenotype complete and save effort that is likely to only result in diminishing returns.

In particular, we are often interested in the ranking performance of a given phenotype as measured by the area under the receiver operating characteristic curve (ROC AUC). Let the label probabilities across our population be denoted Inline graphic . Given access only to these label probabilities, we can solve for a suitable approximation of our posterior inference about the value of ROC AUC, denoted as , in closed form (see Supplemental section S2 for derivation). This is given by the normal distribution with

and where Inline graphic is the rank of the patient’s label probability with respect to all label probabilities across the population . This posterior can then be summarized by a credible interval (CI) of any desired width.

Operating point selection

Most downstream use-cases of phenotyping require more than just the patient-level probability Inline graphic ). Instead, they require a decision to be made about which patients are or are not included in a phenotype and therefore can be framed as a problem of decision theory. In particular, assume we can decide upon a suitable “loss function” for each patient which denotes the cost associated with taking an action Inline graphic when the ground truth label turns out to be . In our case, this action is whether or not to include a patient in the given phenotype.

For phenotyping in particular, we can usually frame the problem in terms of costs associated with false positives (unnecessary workups, administrative overheads) and false negatives (opportunity costs of missed diagnoses) and can therefore use a loss function with just two terms: Inline graphic and . It can be shown (see Supplemental section S3) that the Bayes optimal decision rule in this case is that a patient should be included in the phenotype only if

This rule, therefore, effectively sets an “operating point” of Inline graphic for our output score, dependent on the cost ratio specific to a given use case.

Chart review evaluation

To evaluate the performance of different phenotyping approaches we select 200 random patients per disease from the corresponding candidate pool for chart review adjudication, resulting in a total of 800 patients with ground truth labels and an overall prevalence of 26%. Since we are concerned with the general performance of these approaches across all phenotypes, we report results here using a “pooled cohort” including all 800 chart reviews and the corresponding phenotype-specific predictions. Results at the phenotype-level are provided in Supplemental Table S1, and a sensitivity analysis related to the number of LFs is shown in Supplemental Figure S3. To enable the most fair comparison between approaches, for prediction scores from Snorkel OSS we set the “class_balance” parameter equal to 0.22. This was chosen as the mean value of the prior distribution used for LEVI.

To test the coverage properties of ROC AUC estimation using Snorkel OSS and LEVI scores, we bootstrap-resample the pooled cohort to create 1000 trials. For every trial we estimate credible intervals (CIs) for ROC AUC using Eqs. 3 and 4. We then compute the actual ROC AUC value using the chart reviewed labels and determine if each CI covers (i.e., contains) the true value at CI widths ranging from 0 to 100%. Finally, we compute the coverage at each CI width as the average number of trials which covers the true value. The closer the resulting curve is to the diagonal, the better coverage properties that approach has.

Finally, we compare the performance of all approaches in a cost-benefit analysis using a popular method to visualize net benefit in clinical settings [36]. Since LEVI and Snorkel OSS provide score outputs, we use Eq. 5 to select the optimal operating point for each phenotype based on the cost-benefit ratio, whereas all other approaches provide binary classifications by default.

Results

Classification performance

The phenotype classification performance of each approach was compared across the pooled cohort (Table 1), with Snorkel OSS and LEVI using a default operating point of 0.5. We see that precision and recall values vary widely across different approaches, with phenotypes built using LFs outperforming other approaches. In particular, the best performing phenotypes as judged by balancing precision and recall are from LFs aggregated using majority vote (F1 = 0.82) and LEVI (F1 = 0.82). When LFs are aggregated using Snorkel OSS, we see a slight drop in performance (F1 = 0.79), which could be due to the level of class imbalance in our data. This also aligns with other groups seeing similar challenges in reliably improving upon the classification performance of majority vote with similar methods [37].

Table 1.

Performance comparison of different phenotyping approaches across a pooled cohort including all 4 diseases, with 95% bootstrap confidence intervals. Labeling function (LF) ensemble approaches provide better overall classification performance, with aggregation using Label Estimation via Inference (LEVI) providing the best probabilistic performance. Bolded values represent the highest performing model for each metric. ROC AUC: receiver operating characteristic area under the curve

	Classification Metrics			Scoring Metrics
Approach	Precision	Recall	F1	ROC AUC	Brier	ECE
Medspacy				-	-	-
Expert Logic				-	-	-
LF Ensemble
Majority Vote				-	-	-
Snorkel OSS
LEVI

Open in a new tab

Score-based performance

We tested the performance of Snorkel OSS and LEVI model scores across various metrics on the pooled cohort (Table 1). We see that LEVI outperforms Snorkel OSS in terms of ROC AUC (LEVI = 0.94 vs. Snorkel OSS = 0.92). Moreover, LEVI shows improved probabilistic performance as judged by Brier Loss (LEVI = 0.07 vs. Snorkel OSS = 0.10), a strictly proper scoring rule which is affected by model calibration, and expected calibration error (ECE; LEVI = 0.04 vs. Snorkel OSS = 0.12) [22, 38].

We see clear evidence for LEVI’s improved calibration in Fig. 2, which shows calibration curves across the pooled cohort for each approach. Both approaches are seen to be “under-calibrated”, as evidenced by the curves lying below the diagonal, but LEVI is relatively close to the diagonal as compared to Snorkel OSS. This means that the score outputs from LEVI exhibit reasonable calibration properties without access to any labels or post-hoc calibration methods (such as temperature scaling), which would require labels.

Estimated ROC AUC coverage

Figure 3 shows that LEVI coverage remains relatively close to the diagonal, providing better coverage properties for ROC AUC estimation than Snorkel OSS. For example, at 95% confidence the ROC AUC estimates from LEVI are able to provide 91% actual coverage compared to Snorkel OSS’s 39%. While LEVI estimates are still slightly undercovered, they provide impressive coverage properties for an approach which requires no labeled data.

Fig. 3 — Coverage plot of receiver operating characteristic area under the curve (ROC AUC) estimation across bootstrap-resampled labeled data. Both methods undercover the true value, meaning fewer CI’s contained the actual ROC AUC value compared to the confidence level. Label estimation via inference (LEVI), however, exhibits significantly better coverage properties than Snorkel OSS, as judged by proximity to the diagonal

While coverage properties are important, it is also important for CIs to be as narrow as possible for inferences to be useful. A common metric used to evaluate an estimator’s overall performance is given by the negatively-oriented interval score [39]. This score, where lower values are better, balances CI width and coverage properties to provide a judgment of overall performance. We find that LEVI has a better interval score compared to Snorkel OSS (0.03 vs. 0.10), implying better ROC AUC inferences.

Cost-benefit analysis

Figure 4 shows that LEVI provides the largest net benefit across all approaches for the vast majority of potential cost ratios, only coming in second to majority vote for a small range around a cost ratio of 0.2. Importantly LEVI provides positive net benefit for the largest range of potential cost ratios, only failing to provide benefit when the cost of false positives is very high. This slight dip into negative benefit is due to the slight under-calibration observed in Fig. 2, but is corrected as the cost ratio tends toward 1 and the optimal threshold is equivalent to predicting all patients as negatives. Snorkel OSS, due in part to its poor calibration, consistently provides lower net benefit, with negative benefit at a larger range of cost ratios.

Fig. 4 — The cost-sensitive net benefit for each phenotyping approach (y axis) across a range of potential false positive versus false negative cost ratios (x axis). The cost ratio of interest is defined by subject-matter experts with consideration to the particular use-case, with higher net benefit representing better performance at the specified cost ratio. Label estimation via inference (LEVI) provides the highest net benefit at most cost ratios, and only fails to beat the baseline for a narrow range where false positives cost significantly more than false negatives. C_FP: Cost of false positives, C_FN: Cost of false negatives

Discussion

We have shown that phenotypes built through the aggregation of multiple independent labeling functions can outperform traditional approaches across different disease states, and without the need for gold standard labels for training. Building a set of LFs for a phenotype can be done in a quick feedback loop and iterated on rapidly and collaboratively after reviewing data and LF votes. This rapid iteration loop is particularly beneficial for phenotyping work, where it is not uncommon for the target disease definition to change slightly over the course of study such as narrowing down from a broad disease state (e.g., “stroke”) to a more granular disease state (e.g., “hemorrhagic stroke”). In the LF-based framework, such changes can be handled through corresponding tweaks to the set of LF rules. This also extends to situations where we want to apply LFs to different hospital sites, allowing modifications of LF rules to account for observed differences in the new data not present in the original data.

Another advantage of LF-based approaches compared to more involved text processing approaches is scalability. Many LFs can be represented as simple compositions of RegEx patterns, which can be composed into database queries and run efficiently at scale across many records. This is advantageous for phenotyping, since we often want to build phenotypes across the entire EHR. This can span many millions of records, so any approach to automated phenotyping must take into consideration compute time. Moreover, for LEVI in particular we can also aggregate the resulting LF votes into an output score in this same query since, unlike Snorkel OSS, scores are determined in closed-form through Eq. 1.

Beyond just showing the overall advantages of LF-based approaches, we show that the particular method of LF aggregation has a meaningful impact on the performance and interpretation of the output scores. In particular, we show that the output scores from LEVI can be treated as probabilities with reasonable calibration properties. This makes sense since these scores are derived directly from a probabilistic model, and therefore represent Bayesian probabilities as such. Given that the outputs incorporate no labeled data, we believe this level of calibration is notable. This is in comparison to scores from Snorkel OSS, for example, which are poorly calibrated despite showing relatively similar classification performance. Moreover, since LEVI is based on a simple Bayesian model, it is easy to adjust the prevalence prior for different populations when such information is available. This is particularly useful in the problem of disease diagnosis, where there is often literature providing estimates for reasonable disease prevalences and when such prevalence can change across different subgroups of interest.

Conclusions

Treating the phenotyping task as one of probable inference, rather than binary classification, better aligns with the problems facing physicians and scientists. While it may seem clear whether a patient was or was not diagnosed with a particular disease, determining such information entirely from available digitized records often results in challenging edge-cases that cannot be classified with certainty. Such cases, we argue, should be represented with a probability when possible, and only binarized as necessary and with considerations of a cost-benefit analysis (e.g. using Eq. 5). We implement this approach through LEVI and show that probabilities can be leveraged to improve performance in ROC AUC estimation and operating point selection. We believe this probabilistic approach has a potential to play a meaningful role in downstream analysis based on such phenotypes.

This study has some limitations to note. First, this study looks at a relatively small number of disease states due to the time-consuming nature of chart reviews to evaluate performance. Second, we do not compare our method to bespoke approaches requiring significant site-specific knowledge to implement. It is possible that such approaches, when incorporating such site-specific information with sufficient tweaking, could outperform the approaches presented here on particular phenotypes. However, such bespoke approaches are unlikely to be scalable across institutions and as such are not generally suitable for widespread use. Finally, the open-source version of Snorkel software we use here may not incorporate all the newest methods implemented in the closed-source commercial version and therefore we cannot draw any conclusions related to that software.

Future work should explore identifying groups of patients who are particularly challenging to classify and improving prior selection, perhaps by incorporating LF vote counts and agreements or disagreements across samples. This could improve the calibration and coverage properties of the LEVI outputs, providing better inferences and downstream decisions.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1^{(261.9KB, docx)}

Acknowledgements

Not Applicable.

Abbreviations

CI: Credible Interval
COPD: Chronic Obstructive Pulmonary Disease
ECE: Expected Calibration Error
EHR: Electronic Health Records
HCM: Hypertrophic Cardiomyopathy
HF: Heart Failure
ILD: Interstitial Lung Disease
LEVI: Label Estimation via Inference
LF: Labeling Function
MV: Majority Vote
NLP: Natural Language Processing
RegEx: Regular Expression
ROC AUC: Area Under the Receiver Operating Characteristic Curve

Author contributions

D.V., J.D., and W.T. developed the methodology and analysis and J.P. and R.C. provided chart reviews for validation. B.F., R.M., and N.Z. helped conceptualize the study and provided supervision. D.V. and J.D wrote the main manuscript text and all authors reviewed the manuscript.

Funding

Not Applicable.

Data availability

Deidentified data used in the research was collected in a real-world health care setting and is subject to controlled access for privacy and proprietary reasons. When possible, derived data supporting the findings of this study have been made available within the paper and its figures and tables. Minimal working code to run LEVI on custom data and labeling functions is available at https://github.com/dvidmar/levi.

Declarations

Ethics approval and consent to participate

This study was conducted on de-identified health information subject to an IRB exempt determination (Advarra Pro00072742) and did not involve human subjects research.

Consent for publication

Not Applicable.

Competing interests

All authors are employees and shareholders of Tempus AI, Inc.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Hripcsak G, Albers DJ. High-fidelity phenotyping: richness and freedom from bias. J Am Med Inf Assoc JAMIA. 2018;25:289–94. 10.1093/jamia/ocx110. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.DeBoever C, Tanigawa Y, Aguirre M, et al. Assessing digital phenotyping to enhance genetic studies of human diseases. Am J Hum Genet. 2020;106:611–22. 10.1016/j.ajhg.2020.03.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Raghunath S, Pfeifer JM, Ulloa-Cerna AE, et al. Deep neural networks can predict New-Onset atrial fibrillation from the 12-Lead ECG and help identify those at risk of atrial fibrillation–Related stroke. Circulation. 2021;143:1287–98. 10.1161/CIRCULATIONAHA.120.047829. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Chandra A, Philips ST, Pandey A, et al. Electronic health Records–Based Cardio-Oncology registry for care gap identification and pragmatic research: procedure and observational study. JMIR Cardio. 2021;5:e22296. 10.2196/22296. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Sood PD, Liu S, Lehmann H, et al. Assessing the effect of electronic health record data quality on identifying patients with type 2 diabetes: Cross-Sectional study. JMIR Med Inf. 2024;12:e56734. 10.2196/56734. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Leader JB, Pendergrass SA, Verma A et al. Contrasting association results between existing phewas phenotype definition methods and five validated electronic phenotypes. AMIA Annu Symp Proc AMIA Symp. 2015;2015:824–32. [PMC free article] [PubMed]
7.Newton KM, Peissig PL, Kho AN, et al. Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network. J Am Med Inf Assoc. 2013;20:e147–54. 10.1136/amiajnl-2012-000896. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Eyre H, Chapman AB, Peterson KS et al. Launching into clinical space with medspaCy: a new clinical text processing toolkit in Python. AMIA Annu Symp Proc. 2022;2021:438–47. [PMC free article] [PubMed]
9.Banda JM, Seneviratne M, Hernandez-Boussard T, et al. Advances in electronic phenotyping: from Rule-Based definitions to machine learning models. Annu Rev Biomed Data Sci. 2018;1:53–68. 10.1146/annurev-biodatasci-080917-013315. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.De Freitas JK, Johnson KW, Golden E, et al. Phe2vec: automated disease phenotyping based on unsupervised embeddings from electronic health records. Patterns. 2021;2. 10.1016/j.patter.2021.100337. [DOI] [PMC free article] [PubMed]
11.Ni Y, Alwell K, Moomaw CJ, et al. Towards phenotyping stroke: leveraging data from a large-scale epidemiological study to detect stroke diagnosis. PLoS ONE. 2018;13:e0192586. 10.1371/journal.pone.0192586. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Halpern Y, Horng S, Choi Y, et al. Electronic medical record phenotyping using the anchor and learn framework. J Am Med Inf Assoc JAMIA. 2016;23:731–40. 10.1093/jamia/ocw011. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Shu K, Zheng G, Li Y et al. Leveraging multi-source weak social supervision for early detection of fake news. 2020.
14.Zhang Z-Y, Zhao P, Jiang Y, et al. Learning from incomplete and inaccurate supervision. IEEE Trans Knowl Data Eng. 2022;34:5854–68. 10.1109/TKDE.2021.3061215. [Google Scholar]
15.Ratner A, Bach SH, Ehrenberg H, et al. Snorkel: rapid training data creation with weak supervision. Proc VLDB Endow Int Conf Very Large Data Bases. 2017;11:269. 10.14778/3157794.3157797. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Narasimhamurthy A. Theoretical bounds of majority voting performance for a binary classification problem. IEEE Trans Pattern Anal Mach Intell. 2005;27:1988–95. 10.1109/TPAMI.2005.249. [DOI] [PubMed] [Google Scholar]
17.Platanios EA, Dubey A, Mitchell T. Estimating accuracy from unlabeled data: a bayesian approach. proceedings of the 33rd international conference on machine learning. PMLR 2016:1416–25.
18.Ratner A, Hancock B, Dunnmon J, et al. Training complex models with Multi-Task weak supervision. Proc AAAI Conf Artif Intell. 2019;33:4763–71. 10.1609/aaai.v33i01.33014763. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Maheshwari A, Chatterjee O, Killamsetty K, et al. Semi-Supervised data programming with subset selection. Association for Computational Linguistics; 2021.
20.Lison P, Hubin A, Barnes J, et al. Named entity recognition without labelled data: A weak supervision approach. Association for Computational Linguistics; 2020.
21.Mallory EK, de Rochemonteix M, Ratner A, et al. Extracting chemical reactions from text using snorkel. BMC Bioinformatics. 2020;21:217. 10.1186/s12859-020-03542-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Guo C, Pleiss G, Sun Y et al. On Calibration of Modern Neural Networks. 2017.
23.Getzen E, Ungar L, Mowery D, et al. Mining for equitable health: assessing the impact of missing data in electronic health records. J Biomed Inf. 2023;139:104269. 10.1016/j.jbi.2022.104269. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Zhou Y, Shi J, Stein R, et al. Missing data matter: an empirical evaluation of the impacts of missing EHR data in comparative effectiveness research. J Am Med Inf Assoc. 2023;30:1246–56. 10.1093/jamia/ocad066. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Li J, Yan XS, Chaudhary D, et al. Imputation of missing values for electronic health record laboratory data. Npj Digit Med. 2021;4:1–14. 10.1038/s41746-021-00518-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Harrell F. Classification vs. Prediction. Stat. Think. 2017. https://www.fharrell.com/post/classification/. (Accessed 11 November 2024).
27.Rosenman M, He J, Martin J, et al. Database queries for hospitalizations for acute congestive heart failure: flexible methods and validation based on set theory. J Am Med Inf Assoc. 2014;21:345–52. 10.1136/amiajnl-2013-001942. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Zeng C, Schlueter DJ, Tran TC, et al. Comparison of phenomic profiles in the all of Us research program against the US general population and the UK biobank. J Am Med Inf Assoc. 2024;31:846–54. 10.1093/jamia/ocad260. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Pujades-Rodriguez M, Guttmann OP, Gonzalez-Izquierdo A, et al. Identifying unmet clinical need in hypertrophic cardiomyopathy using National electronic health records. PLoS ONE. 2018;13:e0191214. 10.1371/journal.pone.0191214. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Farrand E, Collard HR, Guarnieri M, et al. Extracting patient-level data from the electronic health record: expanding opportunities for health system research. PLoS ONE. 2023;18:e0280342. 10.1371/journal.pone.0280342. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Bielinski SJ. Heart Failure (HF) with Differentiation between Preserved and Reduced Ejection Fraction| PheKB. 2013. https://phekb.org/phenotype/heart-failure-hf-differentiation-between-preserved-and-reduced-ejection-fraction. (Accessed 1 November 2024).
32.Kirby JC, Speltz P, Rasmussen LV, et al. PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability. J Am Med Inf Assoc JAMIA. 2016;23:1046–52. 10.1093/jamia/ocv202. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Ratner A, Snorkel. 2017. https://github.com/snorkel-team/snorkel
34.Jaynes ET. Information theory and statistical mechanics. Phys Rev. 1957;106:620–30. 10.1103/PhysRev.106.620. [Google Scholar]
35.Jaynes ET. Where do we stand on maximum entropy? Maximum Entropy Formalism Conference, MIT. 1978.
36.Vickers AJ, Van Calster B, Steyerberg EW. Net benefit approaches to the evaluation of prediction models, molecular markers, and diagnostic tests. BMJ. 2016;i6. 10.1136/bmj.i6. [DOI] [PMC free article] [PubMed]
37.Shin C, Sebag AS. Can we get smarter than majority vote? Efficient use of individual rater’s labels for content moderation. 2nd Workshop on Efficient Natural Language and Speech Processing. 2022.
38.Kumar A, Liang PS, Ma T. Verified uncertainty calibration. In: Wallach H, Larochelle H, Beygelzimer A, et al. editors. Advances in neural information processing systems. Curran Associates, Inc.; 2019.
39.Gneiting T, Raftery AE. Strictly proper scoring rules, prediction, and Estimation. J Am Stat Assoc. 2007;102:359–78. 10.1198/016214506000001437. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 1^{(261.9KB, docx)}

Data Availability Statement

[CR1] 1.Hripcsak G, Albers DJ. High-fidelity phenotyping: richness and freedom from bias. J Am Med Inf Assoc JAMIA. 2018;25:289–94. 10.1093/jamia/ocx110. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR2] 2.DeBoever C, Tanigawa Y, Aguirre M, et al. Assessing digital phenotyping to enhance genetic studies of human diseases. Am J Hum Genet. 2020;106:611–22. 10.1016/j.ajhg.2020.03.007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Raghunath S, Pfeifer JM, Ulloa-Cerna AE, et al. Deep neural networks can predict New-Onset atrial fibrillation from the 12-Lead ECG and help identify those at risk of atrial fibrillation–Related stroke. Circulation. 2021;143:1287–98. 10.1161/CIRCULATIONAHA.120.047829. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Chandra A, Philips ST, Pandey A, et al. Electronic health Records–Based Cardio-Oncology registry for care gap identification and pragmatic research: procedure and observational study. JMIR Cardio. 2021;5:e22296. 10.2196/22296. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR5] 5.Sood PD, Liu S, Lehmann H, et al. Assessing the effect of electronic health record data quality on identifying patients with type 2 diabetes: Cross-Sectional study. JMIR Med Inf. 2024;12:e56734. 10.2196/56734. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Leader JB, Pendergrass SA, Verma A et al. Contrasting association results between existing phewas phenotype definition methods and five validated electronic phenotypes. AMIA Annu Symp Proc AMIA Symp. 2015;2015:824–32. [PMC free article] [PubMed]

[CR7] 7.Newton KM, Peissig PL, Kho AN, et al. Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network. J Am Med Inf Assoc. 2013;20:e147–54. 10.1136/amiajnl-2012-000896. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Eyre H, Chapman AB, Peterson KS et al. Launching into clinical space with medspaCy: a new clinical text processing toolkit in Python. AMIA Annu Symp Proc. 2022;2021:438–47. [PMC free article] [PubMed]

[CR9] 9.Banda JM, Seneviratne M, Hernandez-Boussard T, et al. Advances in electronic phenotyping: from Rule-Based definitions to machine learning models. Annu Rev Biomed Data Sci. 2018;1:53–68. 10.1146/annurev-biodatasci-080917-013315. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.De Freitas JK, Johnson KW, Golden E, et al. Phe2vec: automated disease phenotyping based on unsupervised embeddings from electronic health records. Patterns. 2021;2. 10.1016/j.patter.2021.100337. [DOI] [PMC free article] [PubMed]

[CR11] 11.Ni Y, Alwell K, Moomaw CJ, et al. Towards phenotyping stroke: leveraging data from a large-scale epidemiological study to detect stroke diagnosis. PLoS ONE. 2018;13:e0192586. 10.1371/journal.pone.0192586. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] 12.Halpern Y, Horng S, Choi Y, et al. Electronic medical record phenotyping using the anchor and learn framework. J Am Med Inf Assoc JAMIA. 2016;23:731–40. 10.1093/jamia/ocw011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Shu K, Zheng G, Li Y et al. Leveraging multi-source weak social supervision for early detection of fake news. 2020.

[CR14] 14.Zhang Z-Y, Zhao P, Jiang Y, et al. Learning from incomplete and inaccurate supervision. IEEE Trans Knowl Data Eng. 2022;34:5854–68. 10.1109/TKDE.2021.3061215. [Google Scholar]

[CR15] 15.Ratner A, Bach SH, Ehrenberg H, et al. Snorkel: rapid training data creation with weak supervision. Proc VLDB Endow Int Conf Very Large Data Bases. 2017;11:269. 10.14778/3157794.3157797. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR16] 16.Narasimhamurthy A. Theoretical bounds of majority voting performance for a binary classification problem. IEEE Trans Pattern Anal Mach Intell. 2005;27:1988–95. 10.1109/TPAMI.2005.249. [DOI] [PubMed] [Google Scholar]

[CR17] 17.Platanios EA, Dubey A, Mitchell T. Estimating accuracy from unlabeled data: a bayesian approach. proceedings of the 33rd international conference on machine learning. PMLR 2016:1416–25.

[CR18] 18.Ratner A, Hancock B, Dunnmon J, et al. Training complex models with Multi-Task weak supervision. Proc AAAI Conf Artif Intell. 2019;33:4763–71. 10.1609/aaai.v33i01.33014763. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR19] 19.Maheshwari A, Chatterjee O, Killamsetty K, et al. Semi-Supervised data programming with subset selection. Association for Computational Linguistics; 2021.

[CR20] 20.Lison P, Hubin A, Barnes J, et al. Named entity recognition without labelled data: A weak supervision approach. Association for Computational Linguistics; 2020.

[CR21] 21.Mallory EK, de Rochemonteix M, Ratner A, et al. Extracting chemical reactions from text using snorkel. BMC Bioinformatics. 2020;21:217. 10.1186/s12859-020-03542-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR22] 22.Guo C, Pleiss G, Sun Y et al. On Calibration of Modern Neural Networks. 2017.

[CR23] 23.Getzen E, Ungar L, Mowery D, et al. Mining for equitable health: assessing the impact of missing data in electronic health records. J Biomed Inf. 2023;139:104269. 10.1016/j.jbi.2022.104269. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR24] 24.Zhou Y, Shi J, Stein R, et al. Missing data matter: an empirical evaluation of the impacts of missing EHR data in comparative effectiveness research. J Am Med Inf Assoc. 2023;30:1246–56. 10.1093/jamia/ocad066. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR25] 25.Li J, Yan XS, Chaudhary D, et al. Imputation of missing values for electronic health record laboratory data. Npj Digit Med. 2021;4:1–14. 10.1038/s41746-021-00518-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR26] 26.Harrell F. Classification vs. Prediction. Stat. Think. 2017. https://www.fharrell.com/post/classification/. (Accessed 11 November 2024).

[CR27] 27.Rosenman M, He J, Martin J, et al. Database queries for hospitalizations for acute congestive heart failure: flexible methods and validation based on set theory. J Am Med Inf Assoc. 2014;21:345–52. 10.1136/amiajnl-2013-001942. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR28] 28.Zeng C, Schlueter DJ, Tran TC, et al. Comparison of phenomic profiles in the all of Us research program against the US general population and the UK biobank. J Am Med Inf Assoc. 2024;31:846–54. 10.1093/jamia/ocad260. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR29] 29.Pujades-Rodriguez M, Guttmann OP, Gonzalez-Izquierdo A, et al. Identifying unmet clinical need in hypertrophic cardiomyopathy using National electronic health records. PLoS ONE. 2018;13:e0191214. 10.1371/journal.pone.0191214. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR30] 30.Farrand E, Collard HR, Guarnieri M, et al. Extracting patient-level data from the electronic health record: expanding opportunities for health system research. PLoS ONE. 2023;18:e0280342. 10.1371/journal.pone.0280342. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR31] 31.Bielinski SJ. Heart Failure (HF) with Differentiation between Preserved and Reduced Ejection Fraction| PheKB. 2013. https://phekb.org/phenotype/heart-failure-hf-differentiation-between-preserved-and-reduced-ejection-fraction. (Accessed 1 November 2024).

[CR32] 32.Kirby JC, Speltz P, Rasmussen LV, et al. PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability. J Am Med Inf Assoc JAMIA. 2016;23:1046–52. 10.1093/jamia/ocv202. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR33] 33.Ratner A, Snorkel. 2017. https://github.com/snorkel-team/snorkel

[CR34] 34.Jaynes ET. Information theory and statistical mechanics. Phys Rev. 1957;106:620–30. 10.1103/PhysRev.106.620. [Google Scholar]

[CR35] 35.Jaynes ET. Where do we stand on maximum entropy? Maximum Entropy Formalism Conference, MIT. 1978.

[CR36] 36.Vickers AJ, Van Calster B, Steyerberg EW. Net benefit approaches to the evaluation of prediction models, molecular markers, and diagnostic tests. BMJ. 2016;i6. 10.1136/bmj.i6. [DOI] [PMC free article] [PubMed]

[CR37] 37.Shin C, Sebag AS. Can we get smarter than majority vote? Efficient use of individual rater’s labels for content moderation. 2nd Workshop on Efficient Natural Language and Speech Processing. 2022.

[CR38] 38.Kumar A, Liang PS, Ma T. Verified uncertainty calibration. In: Wallach H, Larochelle H, Beygelzimer A, et al. editors. Advances in neural information processing systems. Curran Associates, Inc.; 2019.

[CR39] 39.Gneiting T, Raftery AE. Strictly proper scoring rules, prediction, and Estimation. J Am Stat Assoc. 2007;102:359–78. 10.1198/016214506000001437. [Google Scholar]

PERMALINK

A probabilistic approach for building disease phenotypes across electronic health records

David Vidmar

Jessica De Freitas

Will Thompson

John M Pfeifer

Brandon K Fornwalt

Noah Zimmerman

Riccardo Miotto

Ruijun Chen

Abstract

Background

Results

Conclusions

Supplementary Information

Background

Methods

Study design

Fig. 1.

Phenotyping approaches

Label Estimation via Inference (LEVI)

Prior selection

Ranking performance estimation

Operating point selection

Chart review evaluation

Results

Classification performance

Table 1.

Score-based performance

Fig. 2.

Estimated ROC AUC coverage

Fig. 3.

Cost-benefit analysis

Fig. 4.

Discussion

Conclusions

Electronic supplementary material

Acknowledgements

Abbreviations

Author contributions

Funding

Data availability

Declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Footnotes

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases