Skip to main content
Journal of the American Medical Informatics Association : JAMIA logoLink to Journal of the American Medical Informatics Association : JAMIA
. 2023 Dec 18;31(3):574–582. doi: 10.1093/jamia/ocad241

Data-driven automated classification algorithms for acute health conditions: applying PheNorm to COVID-19 disease

Joshua C Smith 1,, Brian D Williamson 2, David J Cronkite 3, Daniel Park 4, Jill M Whitaker 5, Michael F McLemore 6, Joshua T Osmanski 7, Robert Winter 8, Arvind Ramaprasan 9, Ann Kelley 10, Mary Shea 11, Saranrat Wittayanukorn 12, Danijela Stojanovic 13, Yueqin Zhao 14, Sengwee Toh 15, Kevin B Johnson 16, David M Aronoff 17, David S Carrell 18
PMCID: PMC10873852  PMID: 38109888

Abstract

Objectives

Automated phenotyping algorithms can reduce development time and operator dependence compared to manually developed algorithms. One such approach, PheNorm, has performed well for identifying chronic health conditions, but its performance for acute conditions is largely unknown. Herein, we implement and evaluate PheNorm applied to symptomatic COVID-19 disease to investigate its potential feasibility for rapid phenotyping of acute health conditions.

Materials and methods

PheNorm is a general-purpose automated approach to creating computable phenotype algorithms based on natural language processing, machine learning, and (low cost) silver-standard training labels. We applied PheNorm to cohorts of potential COVID-19 patients from 2 institutions and used gold-standard manual chart review data to investigate the impact on performance of alternative feature engineering options and implementing externally trained models without local retraining.

Results

Models at each institution achieved AUC, sensitivity, and positive predictive value of 0.853, 0.879, 0.851 and 0.804, 0.976, and 0.885, respectively, at quantiles of model-predicted risk that maximize F1. We report performance metrics for all combinations of silver labels, feature engineering options, and models trained internally versus externally.

Discussion

Phenotyping algorithms developed using PheNorm performed well at both institutions. Performance varied with different silver-standard labels and feature engineering options. Models developed locally at one site also worked well when implemented externally at the other site.

Conclusion

PheNorm models successfully identified an acute health condition, symptomatic COVID-19. The simplicity of the PheNorm approach allows it to be applied at multiple study sites with substantially reduced overhead compared to traditional approaches.

Keywords: natural language processing, electronic health records, phenotyping, machine learning, COVID-19

Background and significance

Computable phenotyping algorithms are widely used in healthcare and public health surveillance.1–3 An example is the United States Food and Drug Administration (FDA)’s Sentinel medical product safety surveillance system, where phenotyping algorithms based on electronic health insurance claims and other healthcare data are used to identify health outcomes of interest, as well as underlying indications of medications.4,5 These algorithms have typically been developed with time-intensive expert curation and validated using manually annotated gold-standard training sets, which result in high costs, long development timelines, and limited scalability.6 This approach limits the number of phenotypes that can be investigated and may also introduce operator-dependent idiosyncrasies to resultant algorithms.

Automated approaches to developing phenotype algorithms have been applied successfully to many adult7,8 and pediatric9 chronic health conditions. One such approach, PheNorm,10 automates the generation of phenotyping algorithms without requiring expert-labeled gold-standard training data, greatly reducing development time and operator dependence. The PheNorm approach mines clinical knowledge articles to identify relevant medical concepts to be operationalized via natural language processing (NLP) of patient chart notes; machine learning and silver-standard case status labels are used for model development. Silver-standard labels are easily computed approximations of gold-standard case status labels (such as the number of times the main diagnosis code for a health condition appears in a patient’s medical record) and allow model training using very large numbers of observations because the cost constraints of creating gold-labeled training data do not apply.

Whether automated approaches will perform well for acute health conditions is unknown. Agarwal applied an automated approach to identify patients with acute myocardial infarction,11 but diagnostic codes alone can accurately identify such patients,12,13 and reliance on NLP-derived information may be more difficult. Unlike information available for modeling chronic health conditions—including longitudinal encounter, diagnosis, procedure, lab, and/or medication data—secondary data for acute health events is relatively sparse and rarely offers “second chances” to recognize and classify a potential event.

The PheNorm method recommends generating NLP-derived features as counts of affirmative mentions of relevant medical concepts appearing in patients’ clinical note text (ie, mentions that are not negated, historical, hypothetical, or not about the patient in question), normalizing these counts by measures of healthcare utilization, and using dimension reduction (also called “covariate pre-selection”) before modeling, but whether these details of the PheNorm process are necessary when modeling acute conditions is unknown.

Coronavirus disease 2019 (COVID-19), caused by the SARS-CoV-2 virus, was first identified in December 2019. As it was a new condition, little was initially known about the virus and structured clinical data was unavailable. As the pandemic progressed, diagnostic guidelines, laboratory testing, coding practices, and treatment options changed rapidly. As such, COVID-19 presents an opportunity to investigate application of the PheNorm approach to an acute condition that could serve several clinical, epidemiological, and public health purposes.

Objective

Herein, we apply an automated approach to developing phenotyping models to accurately identify patients with symptomatic COVID-19 disease. We investigate the effects of alternative feature engineering options and evaluate model performance in 2 diverse healthcare settings.

COVID-19 disease is an acute condition for which diagnostic codes have been shown to have low accuracy,14 which may be due to both over-coding and under-coding. During the height of the pandemic, providers may have been more likely to diagnose a patient with COVID-19 due to administrative reasons, the high incidence rate, and/or a tendency to label conditions with similar symptomatology as COVID-19. Conversely, mild cases may have been under-coded during the time when healthcare resources were stretched thin and patients with mild symptoms were discouraged from seeking care. In either case, patients with positive PCR tests were often coded (correctly) for COVID-19 even when asymptomatic. As our focus is on symptomatic COVID-19 disease, evidence of infection was insufficient since many patients who tested positive for COVID-19 were asymptomatic. We define symptomatic disease as SARS-CoV-2 infections accompanied by at least mild symptoms as characterized by the National Institutes of Health.15

Methods

In 2 heterogeneous healthcare settings, we implemented several variations of the PheNorm approach10,16,17 to investigate its utility for identifying patients with an acute health condition: symptomatic COVID-19 disease. Steps in the PheNorm method are summarized in Figure 1 and throughout this section. Additionally, we apply this modeling approach to patients identified in the traditional manner—ie, patients with new diagnosis codes for the condition of interest, COVID-19 disease—and we supplement this traditional cohort with a set of patients who do not have COVID-19 diagnoses but have other structured data codes suggesting they may have experienced COVID-19—an approach we refer to as high-sensitivity filtering (described below).

Figure 1.

Figure 1.

Major steps in implementing the PheNorm modeling approach and types of expertise required.

Settings and study cohorts

Eligible patients received care at 1 of 2 healthcare settings: Vanderbilt University Medical Center (VUMC), an academic medical center delivering outpatient, emergency, and inpatient care in the southern United States; or Kaiser Permanente Washington (KPWA), an integrated care health maintenance organization (HMO) provider delivering outpatient and limited emergency care in the Pacific Northwest of the United States. VUMC and its affiliated hospitals and clinics see over 3 million patient visits per year, including over 70 000 admissions and 200 000 emergency department (ED) visits annually. KPWA provides integrated care to over 660 000 enrollees through 35 outpatient clinics in Washington state; hospital and ED care are externally contracted and not documented in the KPWA EHR. As contributors to the FDA Sentinel Innovation Center,18 both institutions were in a position to collaboratively develop and test these methods in environments with different institutional characteristics and patient visit types.

Patients were included in the study if they had an encounter between April 1, 2020 and March 31, 2021 coded with either (a) any of 6 International Classification of Disease, 10th Revision, Clinical Modification (ICD-10-CM) diagnosis codes indicating COVID-19 disease, or (b) any of 43 procedure codes, medications, laboratory studies, problem list entries, or other ICD-10-CM diagnosis codes determined to be highly correlated with a diagnosis of COVID-19. The former criteria (a) is a common approach to identifying patients with a clinical phenotype19 which we call our traditional filter; the latter criteria (b) is a novel method we developed20 which we call our high-sensitivity filter (see Section SA). The PheNorm approach is completely independent of methods used to identify potential events for modeling; our inclusion of the high-sensitivity filtering approach allows us to separately investigate ways to potentially improve phenotyping sensitivity. All VUMC data came from the VUMC electronic health record (EHR); KPWA data came from the KPWA outpatient EHR supplemented by structured medical claims data for inpatient, emergency, and outpatient care provided to KPWA enrollees by external providers. This work was conducted under the authority of the FDA Sentinel Initiative in support of FDA medical product safety surveillance. VUMC and KPWA institutional review boards therefore determined it to be a public health surveillance activity exempt from IRB review.21–23

Data catchment period for potential COVID-19 episodes

In PheNorm, a fixed data catchment period anchored to a patient-specific index date identifies data used to operationalize silver labels and features. Considering catchment periods up to several years is reasonable for chronic health conditions7 but data regarding acute conditions are generated during narrow timeframes. We defined potential COVID-19 episodes and corresponding index dates as a patient’s first encounter with a qualifying ICD-10-CM code or high-sensitivity filter feature during the study period (Section SA). Our catchment period was index date ±30 days, which we consider likely to include relevant and exclude unrelated information. While some patients likely died within the 30-day post-index, we reasoned that this was unlikely to affect performance due to the acute nature of the phenotype and available data. That is, due to the nature of EHR data, all patients will have variable amounts of evidence, but that evidence should be sufficient to identify an acute condition. Eligible patients included adults (age 18+ years) with at least one encounter and ≥1000 characters of clinical text during the study period. KPWA patients (only) were required to be enrolled in the KPWA integrated care plan to ensure access to EHR data for manual chart review. All notes within each patient’s catchment period were processed using the MetaMapLite24 NLP tool to identify mentions of clinical concepts represented in the Unified Medical Language System (UMLS).

Silver-standard labels

The PheNorm authors suggest replacing scarce, costly gold-standard data with silver-standard data during model training and, further, suggest considering alternative versions thereof. Accordingly, we used information from each patient’s data catchment period to operationalize 4 silver-standard labels (using either structured or NLP-derived data) that we speculated would have strong positive correlations with actual symptomatic COVID-19 disease:

  1. Structured Label 1: Count of calendar days with a COVID-19 diagnosis code (U07.1), including both outpatient visits and inpatient days

  2. Structured Label 2: Count of calendar days with any of 6 COVID-19-related diagnosis codes: U07.1, J12.81, J12.82, B34.2, B97.21, B97.29 (Section SA)

  3. NLP Label 1: Count of mentions of the term “COVID-19” in chart notes

  4. NLP Label 2: Count of chart notes with an NLP-identified UMLS concept for COVID-19 disease (C5203670)

Detailed definitions and summary data for each silver label are provided in Section SE.

Feature engineering

Predictive models require patient features as input to generate predictions. PheNorm primarily uses features extracted from clinical text using NLP. To select features relevant to the phenotype of interest, we applied the automated feature extraction for phenotyping (AFEP) method.17 Briefly, AFEP utilizes knowledgebase articles describing a phenotype of interest to automatically select relevant features without requiring time-intensive expert curation. Instead, clinical concepts mentioned in the articles are extracted using NLP and those present in the majority of articles are used as features. Using MetaMapLite, we processed the text of 5 clinical knowledgebase articles on COVID-19 disease and identified 158 UMLS concepts to be used as patient features (Section SD). We then created 4 versions of each of these features based on relevant concepts identified by MetaMapLite in patient’s chart notes during the catchment period: (1) a simple version that counted all mentions of the corresponding concept in a patient’s chart, (2) a “non-negated” version that counted only mentions that were not negated in text (eg, “fever” was counted, but not “no fever”), (3) a normalized version of the simple count, and (4) a normalized version of the non-negated count. Normalization was achieved by dividing raw counts by the length of the patient’s chart text (in characters). Since longer charts would likely have more mentions of selected features, we theorized that such normalization would reduce arbitrary differences due to note length and patient visit frequency. However, we evaluated both normalized and non-normalized counts since longer notes, more visits, and more mentions might also imply the presence of the phenotype. The version of NLP-extracted features used for each model was determined by the feature engineering options we implemented for each model (Table 1). We also operationalized 2 “structured” data features: patient sex (as captured in the EHR) and patient age in years.

Table 1.

Five sets of PheNorm models each implementing alternative feature engineering options (Yes=option is implemented) and the scientific question motivating each model set.a

Feature engineering options
Model set Exclude negated mentions Normalize by patient’s chart length Dimension reduction pre-modeling Scientific question
1 No No No Does simple feature engineering yield sufficient model performance?
2 Yes No No Does excluding NLP negation improve performance (vs Model set 1)?
3 No Yes No Does normalizing features improve performance (vs Model set 1)?
4 No No Yes Is performance preserved in models based on reduced feature sets?
5 Yes Yes Yes Do all feature engineering options combined improve performance?
a

These 5 model sets allowed us to investigate the effect on model performance of not using any of the 3 feature engineering options (model set 1), the effect of each option separately (model sets 2-4), and the combined effects of all 3 options simultaneously (model set 5). For completeness, we report in (Section SF) results for the other 3 logical combinations of feature engineering options not included in this table.

Dimension reduction

When a large number of features are available for model training or the number of observations available for such training is limited, dimension reduction methods may yield simpler models without sacrificing performance by removing duplicative or less-informative features. Simpler models (those using fewer features) may also be more portable and less susceptible to noise. Since free-text notes were the primary source of PheNorm features, dimension reduction may yield models less dependent on local note-writing style, vocabulary, and expression. We investigated one such approach, surrogate-assisted feature engineering, which favors candidate features highly predictive of silver-standard labels for the phenotype,16 hypothesizing that feature reduction would yield models with comparable performance with substantially fewer features.

Model development

PheNorm models estimate each patient’s probability of being a phenotype case using easily operationalized silver-standard surrogate labels, a limited set of structured-data features, and NLP-derived features, described above. The PheNorm procedure involves 3 steps. First, the silver-standard label of interest is regressed on a collection of potential predictors consisting of a noisy version of the silver-standard label,10 the remaining silver-standard labels, and the features. Next, the regression model is used with the original (non-noisy) silver labels and features to make predictions of the silver label given the features. Finally, these predictions are used in an expectation-maximization algorithm to estimate the probability of being phenotype positive given the silver labels and features.10

We developed models for all logical combinations of the abovementioned options (negation, normalization, and dimension reduction). Each of the model sets included 5 PheNorm models, one for each of the 4 silver labels plus a fifth aggregating results from the other 4 (40 models total; the aggregate model is the average of the predicted probabilities from the 4 silver-label models). We trained these models using data from patients without gold-standard case labels and evaluated the models using data from a set-aside sample of patients with gold-standard labels. We focus on scientific question related to the 5 model sets described in Table 1.

Gold-standard sample

We used manual chart review to create the gold-standard data used to evaluate our PheNorm models from a stratified random sample of patients identified by our traditional and high-sensitivity filters (above). We calculated sampling probabilities (Section SC) for reweighting performance metrics during model evaluation (below). Trained chart abstractors following written guidelines (Section SC) assigned phenotype positive labels to patients with evidence of at least possible SARS-COV-2 infection and at least symptomatic COVID-19 disease,15 and phenotype negative labels to all other patients. Sub-samples of dual, blind independent chart reviews indicated high inter-rater agreement (Cohen’s Kappa of 0.951 at VUMC and 0.802 at KPWA).

Model evaluation

We evaluated models in our set-aside gold-standard sample (above), weighted to reflect stratum-specific sampling probabilities. To assess whether PheNorm improved the accuracy of outcome identification compared to the “raw” accuracy of the codes alone, we compared raw true-positive rates in our gold-standard sample (ie, without the benefit of PheNorm modeling) to the PPVs achieved by the PheNorm models—overall and separately for our traditional filters (combined) and our high-sensitivity filters (combined). Metrics we used to assess PheNorm performance are area under the receiver operating characteristic curve (AUC), and for selected cut-points of model-predicted risk, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and F1 score (the harmonic mean of PPV and sensitivity).

Additionally, we evaluated each site’s models using the other site’s gold-standard data to assess model transportability to an external site without local training or tailoring.

Results

Characteristics of the 24 177 VUMC and 8329 KPWA patients meeting study inclusion criteria are summarized in Table 2. Median age and percent female were 45 and 58% at VUMC and 53 at 58% at KPWA. Most patients (VUMC: 83%; KPWA: 82%) qualified for the study by satisfying our traditional filters (any of 6 ICD10 diagnoses for COVID-19 disease); the remainder qualified by our high-sensitivity filters. The proportion chart-review confirmed cases of COVID-19 in the VUMC gold standard sample was 85.8% overall, 92.6% in the traditional filter group, and 53.5% in the high-sensitivity filter group. At KPWA, these proportions were 68.7% overall, 76.2% in the traditional filter group, and 34.4% in the high-sensitivity filter group.

Table 2.

Characteristics of the study cohorts by study site.

VUMC
KPWA
Characteristic Count Percent Count Percent
All patients 24 177 100 8329 100
Sex is female 14 025 58 4837 58
Age group
 18-29 years 5645 23 1104 13
 30-49 years 8131 34 2503 30
 50-69 years 7433 31 3126 38
 70+ years 2968 12 1596 19
Race is White 16 407 68 5335 64
Ethnicity is Hispanic 1018 4 756 9
Qualified for study by
 COVID-19 diagnosis code 19 986 83 6847 82
 High-sensitivity filter (any) 4191 17 1482 18
Days with clinical notesa
 1-3 days 7334 30 1768 21
 4-6 days 5241 22 2357 28
 7-9 days 3852 16 1693 20
 10+ days 7750 32 2511 30
a

Calendar days with clinical notes during ±30 days from each patient’s index date.

The AUCs across all PheNorm models ranged from 0.770 to 0.804 at VUMC and 0.801 to 0.853 at KPWA; model PPVs (at maximum F1 score) ranged from 0.858 to 0.903 at VUMC and 0.772 to 0.876 at KPWA (Section SF). The model sets with the highest AUC at each study site are summarized in Table 3 (at VUMC, 2 other model sets had the same highest AUC; see Section SF). The VUMC model with highest AUC was trained on a structured data silver label (Structured Label 2, the count of calendar days during the data catchment period with any of 6 COVID-19-related diagnosis codes) without excluding negated mentions, feature normalization, or dimension reduction feature engineering options (Model set 1 in Table 1) (AUC = 0.804). The highest-AUC KPWA model was also trained on a structured data silver label (Structured Label 1, count of calendar days during the data catchment period with an ICD10-CM diagnosis code U07.1) with feature normalization, without excluding negated mentions or dimension reduction (Model set 3 in Table 1) (AUC = 0.853). At VUMC, both models trained on NLP silver labels achieved the highest F1 scores (0.937); at KPWA, the model trained on NLP Label 2 achieved the highest F1 score (0.869).

Table 3.

Performance of highest-AUC model sets at each study site, separately developed and evaluated, for identifying patients with symptomatic COVID-19 disease (AUC, F1, sensitivity, specificity, PPV, and NPV) at quantiles of model-predicted risk that maximize F1.

Study site (model set) Silver label AUC Max. F1 Sensi-tivity Speci-ficity PPV NPV
VUMC (model set 1) Struc. 1 0.802 0.927 0.976 0.214 0.883 0.597
Struc. 2 0.804 0.929 0.976 0.234 0.885 0.617
NLP 1 0.788 0.937 0.982 0.309 0.896 0.743
NLP 2 0.775 0.937 0.982 0.306 0.896 0.741
Agg. 0.786 0.937 0.982 0.306 0.896 0.741
KPWA (model set 3) Struc. 1 0.853 0.865 0.879 0.662 0.851 0.713
Struc. 2 0.851 0.862 0.875 0.662 0.850 0.706
NLP 1 0.819 0.861 0.945 0.451 0.791 0.789
NLP 2 0.833 0.869 0.949 0.482 0.801 0.812
Agg. 0.847 0.867 0.949 0.472 0.798 0.809

Abbreviations: AUC, area under the receiver operator characteristics curve; Max., maximum; F1, F1 score, defined as the harmonic mean of PPV (precision) and sensitivity (recall); PPV, positive predictive value; NPV, negative predictive value; Struct., structured.

Bolded values are for models with the highest AUC or F1 for each study site.

VUMC’s highest PPV from model set 1 was 0.896, a 4.1% improvement over the 85.8% true-case rate observed in gold standard data; KPWA’s highest PPV from model set 3 was 0.851, a 16.4% improvement over the 68.7% true-case rate in gold standard data. PPVs within subsets of the initial cohort defined by filter groups (traditional or high sensitivity) varied substantially, showing larger improvements at KPWA than VUMC. Among patients identified by traditional filters alone, the VUMC PheNorm model’s PPV of 0.950 was a 2.4% improvement over the true case rate in VUMC gold data (92.6%); the KPWA PheNorm model’s PPV of 0.851 was an 8.9% improvement over the true case rate in KPWA gold data (76.2%). Among patients identified by high sensitivity filters alone, the VUMC PheNorm model’s PPV of 0.615 was an 8.0% improvement over the true case rates in gold data, and the KPWA PheNorm model’s PPV of 0.893 was a 54.9% improvement (but results in this high sensitivity group may be unstable due to the small sample size).

Model performance varied by silver label, but models trained on structured data labels (Structured 1, Structured 2) had higher AUCs at both VUMC and KPWA; however, NLP Label 1 (mentions of the term “COVID-19”) had the highest AUC for Model Set 5 at VUMC. Performance also varied when using alternative feature engineering options (Table 1), but all yielded strong performance. Full results for all model sets are available in Section SF. Excluding from feature counts mentions that were negated (model set 2) did not substantially impact model performance, yielding models with AUCs between 0.777 and 0.804 at VUMC, and 0.803 and 0.849 at KPWA. Similarly, normalizing feature mention counts by the quantity of each patient’s clinical text processed (model set 3) had little impact on performance overall, yielding AUCs at VUMC between 0.770 and 0.790, and 0.819 and 0.853 at KPWA. Dimension reduction produced models with strong performance based on fewer features (Section SG). At VUMC, models based on 87 (55%) and 71 (45%) features pre-selected from the full set of 158 achieved AUCs of 0.804 and 0.791, respectively, and models at KPWA based on 55 (35%) and 30 (19%) features achieved AUCs of 0.848 and 0.842, respectively, for model sets 4 and 5.

Overall, differences in performance when using alternative feature engineering options compared to baseline Model Set 1 appeared minor. Change in AUC for each silver label across all model sets ranged from −0.003 to 0.016 at VUMC, and −0.018 to 0.012 at KPWA; performance variation in terms of F1, Sensitivity, Specificity, and PPV were similar (Section SH).

Performance metrics for the highest-AUC/silver label combinations VUMC and KPWA models at alternative cut-points of model-predicted probability are presented in Figure 2. Using cut-points of model-predicted probability that yielded greater than or equal to 80% PPV (a commonly used “benchmark”) yields sensitivities of 0.999 in the best VUMC model and 0.905 in the best KPWA model. Performance metrics and levels suitable for addressing different specific scientific questions may be achieved by selecting different cut-points of predicted probability.

Figure 2.

Figure 2.

Model performance metrics (F0.5, F1, NPV, PPV, Sensitivity, and Specificity) for (A) the highest-AUC VUMC model set and (B) the highest-AUC KPWA model set.

At both study sites the performance of externally trained models (eg, performance of a model trained on KPWA data evaluated on VUMC data) was generally similar to that of internally trained models (eg, performance of a model trained on VUMC data evaluated on VUMC gold-standard data), with some exceptions (Figure 3). At VUMC, the AUC of the best externally trained model was 0.804, compared to 0.817 for the best locally trained model. At KPWA, the AUC of the best-performing externally trained model was 0.834, compared to 0.853 for the best locally trained model. An exception was VUMC-trained models in Model Set 5, which had substantially lower AUCs on KPWA gold-standard data (Figure 3).

Figure 3.

Figure 3.

Comparing performance of internally trained and externally trained models based on AUC: (A) Performance of VUMC and KPWA models on VUMC gold-standard data and (B) performance of KPWA and VUMC models on KPWA gold-standard data.

Discussion

This work demonstrates that the PheNorm approach to automating development of phenotyping algorithms can be applied successfully to symptomatic COVID-19 disease, an acute health condition. The unexpectedly high raw true-positive rate in VUMC’s traditional filter group (without the benefit of modeling) was contrary to our expectations and reports by others,14 and may have been influenced by local practices in the initial year of the pandemic. This high true-positive rate likely contributes to the generally stronger PPV and sensitivity of the VUMC models, but VUMC’s access to inpatient and ED EHR records may also be an important factor. Further, as illustrated in Figure 2, the PheNorm models would allow researchers to achieve very high PPVs—in the upper 90% range—with acceptable reductions in sensitivity—tradeoffs that are not possible if ICD10 diagnoses alone were used to define the phenotype. Models based on NLP-derived data from KPWA’s outpatient chart notes also performed well and, critically, improved identification of patients with actual symptomatic COVID-19 disease to a level (∼85% PPV) that would be sufficient for use in many scientific investigations.

Variation in model performance by silver label, depending both on data type and counting rules, underscores the value of using diverse silver labels and feature engineering options, both of which are facilitated by the PheNorm approach. Alternative feature engineering options, particularly when easily implemented (as they are here), should also be considered. Although incorporating negation and/or normalization had only a minor impact on performance for this phenotype, they may be consequential in modeling other phenotypes. Interestingly, dimension reduction yielded strong performance even when using one-fourth to one-third as many features, as was the case with models trained at KPWA, thereby simplifying real-world model implementation. When choosing a model for implementation, simpler models—those with fewer features and/or simpler feature engineering options—should be favored (Occam’s principle).

Surprisingly, most models trained at VUMC and evaluated at KPWA, and vice versa, performed very well. In nearly all cases, KPWA-trained models performed as good or better than VUMC-trained models on VUMC data. Similarly, VUMC-trained models implemented on KPWA data performed nearly as well as KPWA-trained models (with the exception of KPWA Model Set 5 models). While externally trained models tended to perform well, the substantial drop in performance at KPWA of VUMC-trained Model set 5 reinforces the advantages of local training. The generally strong performance of externally trained models may be evidence of the portability of PheNorm models but may also be artifacts of this particular phenotype; further investigation is required.

This work has limitations. First, our phenotype was COVID-19 disease and we used data from early in the COVID-19 pandemic (when COVID-19 replaced other common reasons for hospital and clinic visits); both may introduce idiosyncrasies relative to other phenotypes and time periods. Second, we used data from only 2 healthcare settings which, though diverse, may not be representative of other settings. Future work should investigate other acute phenotypes, subtypes of phenotypes (eg, “severe” COVID-19 disease or “insulin-dependent” diabetes), and chronic phenotypes at multiple diverse sites; future research should also explore approaches to developing useful silver-standard labels and how data catchment window size impacts performance, particularly in acute conditions. Third, PheNorm modeling requires access to electronic clinical text which may not be available in all settings.

Conclusion

The PheNorm approach can successfully identify an acute health condition, as illustrated by these models predicting COVID-19 disease. The simplicity of the PheNorm approach to model development allows it to be applied at multiple study sites with substantially reduced overhead compared to manual phenotyping approaches. Considering multiple silver labels and feature engineering options is recommended; for this phenotype, however, they had only minor effects on performance. Preliminary results for this work indicate that models trained at one site may be transportable to other sites with little decrease in performance. Future work should further explore questions of cross-setting transportability, develop guidelines for formulating relevant silver standard labels, and modeling health outcomes labeled with varying degrees of certainty. As large-scale systems for medication surveillance such as FDA Sentinel increasingly incorporate EHR data elements including structured and unstructured data, advances in scalable approaches for computable phenotyping may offer unique opportunities to facilitate timely and efficient execution of queries against these data.

Supplementary Material

ocad241_Supplementary_Data

Acknowledgments

Many thanks are due to members of the Sentinel Innovation Center Workgroup that provided critical feedback during development of this work.

Contributor Information

Joshua C Smith, Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States.

Brian D Williamson, Kaiser Permanente Washington Health Research Institute, Seattle, WA 98101, United States.

David J Cronkite, Kaiser Permanente Washington Health Research Institute, Seattle, WA 98101, United States.

Daniel Park, Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States.

Jill M Whitaker, Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States.

Michael F McLemore, Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States.

Joshua T Osmanski, Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States.

Robert Winter, Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States.

Arvind Ramaprasan, Kaiser Permanente Washington Health Research Institute, Seattle, WA 98101, United States.

Ann Kelley, Kaiser Permanente Washington Health Research Institute, Seattle, WA 98101, United States.

Mary Shea, Kaiser Permanente Washington Health Research Institute, Seattle, WA 98101, United States.

Saranrat Wittayanukorn, Center for Drug Evaluation and Research, US Food and Drug Administration, Silver Spring, MD 20903, United States.

Danijela Stojanovic, Center for Drug Evaluation and Research, US Food and Drug Administration, Silver Spring, MD 20903, United States.

Yueqin Zhao, Center for Drug Evaluation and Research, US Food and Drug Administration, Silver Spring, MD 20903, United States.

Sengwee Toh, Harvard Pilgrim Health Care Institute, Boston, MA 02215, United States.

Kevin B Johnson, Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA 19104, United States.

David M Aronoff, Department of Medicine, Indiana University School of Medicine, Indianapolis, IN 46202, United States.

David S Carrell, Kaiser Permanente Washington Health Research Institute, Seattle, WA 98101, United States.

Author contributions

J.C.S. and D.S.C. conceptualized and led the study at VUMC and KPWA, respectively. J.C.S., D.S.C., B.D.W., D.J.C., D.P., J.T.O., R.W., A.R., S.W., D.S., Y.Z., K.B.J., and D.M.A. contributed to the design and planning of study methodology. Data acquisition and analysis were performed by J.C.S., D.S.C., B.D.W., D.J.C., D.P., J.M.W., M.F.M., J.T.O., R.W., A.R., A.K., M.S., and D.M.A. J.C.S., D.S.C., and B.D.W. wrote the initial draft and all authors contributed revisions and approved the final version of the manuscript.

Supplementary material

Supplementary material is available at Journal of the American Medical Informatics Association online.

Funding

This work was funded as part of the Sentinel Initiative and supported by Task Order 75F40119F19002 under Master Agreement 75F40119D10037 from the U.S. Food and Drug Administration (FDA). The work at Vanderbilt University Medical Center was also supported by CTSA Award No. UL1 TR002243 from the National Center for Advancing Translational Sciences. The views expressed in this work represent those of the authors and do not necessarily represent the official views of the U.S. FDA, the National Center for Advancing Translational Sciences, or the National Institutes of Health.

Conflicts of interest

None declared.

Data availability

The data underlying this article cannot be shared publicly due to institutional policies that protect the privacy of individuals whose data was used in the study.

References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ocad241_Supplementary_Data

Data Availability Statement

The data underlying this article cannot be shared publicly due to institutional policies that protect the privacy of individuals whose data was used in the study.


Articles from Journal of the American Medical Informatics Association : JAMIA are provided here courtesy of Oxford University Press

RESOURCES