Abstract
Objective
Diagnosis codes documented in electronic health records (EHR) are often relied upon to clinically phenotype patients for biomedical research. However, these diagnoses can be incomplete and inaccurate, leading to false negatives when searching for patients with phenotypes of interest. This study aims to determine whether PheMAP, a comprehensive knowledgebase integrating multiple clinical terminologies beyond diagnosis to capture phenotypes, can effectively identify patients lacking relevant EHR diagnosis codes.
Materials and Methods
We investigated a collection of 3.5 million patient records from Vanderbilt University Medical Center’s EHR and focused on 4 well-studied phenotypes: (1) type 2 diabetes mellitus (T2DM), (2) dementia, (3) prostate cancer, and (4) sensorineural hearing loss. We applied PheMAP to match structured concepts in patient records and calculated a phenotype risk score (PheScore) to indicate patient-phenotype similarity. Patients meeting predefined PheScore criteria but lacking diagnosis codes were identified. Clinically knowledgeable experts adjudicated randomly selected patients per phenotype as Positive, Possibly Positive, or Negative.
Results
Our approach indicated that 5.3% of patients lacked a diagnosis for T2DM, 4.5% for dementia, 2.2% for prostate cancer, and 0.2% for sensorineural hearing loss. The expert review indicated 100% precision (for Possibly Positive or Positive cases) for dementia and sensorineural hearing loss, and 90.0% and 85.0% precision for T2DM and prostate cancer, respectively. Excluding Possibly Positive cases, the precision for T2DM and prostate cancer was 88.9% and 81.3%, respectively.
Conclusions
Leveraging clinical terminologies incorporated by PheMAP can effectively identify patients with phenotypes who lack EHR diagnosis codes, thereby enhancing phenotyping quality and related research reliability.
Keywords: electronic health records, phenotyping, diagnosis
Introduction
Electronic health records (EHR) store extensive, valuable, and intricate details of a patient’s medical journey in digital form. They enable efficient accessibility for large-scale analyses and, thus, have become increasingly recognized as a data source for biomedical research.1,2 Patients’ records in EHR are now relied upon for a wide range of types of scientific investigation, including, but not limited to, phenotyping and genomics,3–6 clinical decision support,7,8 drug repurposing,9,10 patient outcome prediction,11,12 and clinical workflow mining.13,14 Research in these domains often commences with identifying sets of records that differentiate patients based on the presence or absence of specific phenotype(s). Diagnoses, usually represented by International Classification of Disease (ICD) codes, serve as a prevalent and valuable resource for this endeavor.
To support more effective phenotyping using diagnosis codes, Phecodes5,15 were developed to group together distinct ICD codes that represent clinically related conditions. Due to its simplicity and efficiency, Phecode mapping has been widely used for patient phenotyping in numerous studies involving EHR data. However, reliance only on the presence (or absence) of diagnostic codes to ascertain a patient’s phenotype status can be problematic for several reasons. First, it is common for patients to receive care at multiple care sites that do not exchange information, leading to fragmented records.16 In one study by Bourgeois et al, it was shown that, within a 5-year observation window, over 30% of patients who sought treatment in acute care settings visited 2 or more hospitals, with minimal health information exchange between these hospitals.17 Second, diagnosis codes may be inaccurate because of clinical documentation and coding inaccuracies or errors.18–21 For example, in one study examining mental illness in Massachusetts, it was observed that EHR data were missing up to 28% of diagnosis codes for depression and bipolar disorder.22 Third, the practice of ICD code assignment can differ across institutions (eg, use of non-standard codes and insurance-based code assignment that cannot be captured by Phecode mapping), potentially leading to the omission of disease identifications.23 These missing diagnosis codes may lead to false negatives in a study, potentially undermining research findings, shrinking the already small cohorts for many phenotypes, and introducing biases into nearly all studies that require cohort construction.
Prior research has shown that utilizing multiple components of EHR data, such as phenotype-related laboratory tests and medications, offers a more comprehensive basis for characterizing and identifying phenotypes when compared to diagnosis codes alone.24 However, manually gathering such information is time-consuming and cost-prohibitive, particularly for large groups of phenotypes. As such, the PheMAP knowledgebase was developed to systematically incorporate information surrounding phenotypes, using online medical resources (eg, MedlinePlus, MedicineNet, and Wikipedia) that thoroughly describe the traits and treatment knowledge of known human disease phenotypes.25 It includes medical concepts spanning diagnoses, symptoms, clinical procedures, medications, and laboratory tests in patients’ records,25,26 encompassing multiple clinical terminologies. Moreover, it specifies the relevance of medical concepts related to each phenotype, which enables patient scoring—a PheScore—to measure the patient-phenotype matching level by searching these concepts in patients’ structured records. By leveraging this tool, PheMAP further offers the opportunity to redress the missingness of ICD codes that can lead to false negatives in research studies.
In this study, we investigate the extent to which the computable knowledge of clinical terminologies in the PheMAP knowledgebase can be relied upon to recognize patients with their records lacking relevant diagnosis codes, thus addressing a critical challenge that extends beyond PheMAP’s original purpose of general phenotyping. We select 4 distinct disease phenotypes as a case study: (1) type 2 diabetes mellitus (T2DM), (2) dementia, (3) prostate cancer, and (4) sensorineural hearing loss, and assess the effectiveness of the PheMAP-based method by comparing its classifications to labels annotated by phenotyping experts from Vanderbilt University Medical Center (VUMC).
Materials and methods
Data source
In this study, we utilized all de-identified EHR data from the clinical database at VUMC,27 encompassing over 30 years of records from its inception up to August 2024 and covering all practices, facilities and care settings. The EHR data were represented according to the common data model of the observational medical outcomes partnership. This study was approved by the Institutional Review Board at Vanderbilt University Medical Center (IRB #: 201434).
PheMAP knowledgebase
The PheMAP knowledgebase, established in 2020, incorporated information from 5 publicly available medical text sources—Mayo Clinic Patient Care & Health Information website, MedlinePlus, MedicineNet, WikiDoc, and Wikipedia—many of which are maintained by medical professionals.25,26 This process has been detailed extensively25,26 and can be summarized as follows: First, the article titles from the 5 text sources were matched to concept unique identifiers (CUIs) in the Unified Medical Language System (UMLS).28 Second, CUIs were mapped to ICD codes via UMLS and then these ICD codes were mapped to Phecodes to associate them with phenotypes. Third, the content of articles corresponding to the same phenotype was then combined and parsed for CUI identification. Fourth, term frequency-inverse document frequency (TF-IDF) values were derived for each concept, quantifying their importance or relevance to a specific phenotype relative to all phenotypes by considering how frequently it appears in the phenotype of interest in relation to its occurrence across all phenotypes.
Then, to enable direct applications in patients’ EHR data, these concepts were mapped to multiple standard terminologies that are commonly used in EHR systems, including (1) International Classification of Diseases, Ninth and Tenth revisions, clinical modification (ICD-9-CM, ICD-10-CM) for diagnosis and procedure categorization; (2) International Classification of Diseases, Tenth revision, procedure coding system (ICD-10-PCS) for inpatient procedure classification; (3) Systematized Nomenclature of Medicine Clinical Terms (SNOMED-CT) for a comprehensive list of medical terms to represent codes, synonyms, and definitions used in clinical documentation and reporting; (4) Current Procedural Terminology for medical, surgical, and diagnostic procedures; (5) Logical Observation Identifiers Names and Codes (LOINC) for medical laboratory observations; and (6) RxNorm for US-approved medications. Terms mapped from a PheMAP concept share the same TF-IDF value as the concept itself. As evidence of the mapping efficacy, the diagnosis terms (for instance, ICD-10-CM codes starting with “E11” and ICD-9-CM codes beginning with “250”) that were directly mapped from the related phenotype (ie, T2DM) were observed to have the top TF-IDF values compared to all other mapped terms under the same phenotype.
The sum of TF-IDF values of the identified terms in the EHR data of a patient against a specific phenotype, defined as the PheScore, represents the degree to which the patient matches the phenotype. As such, these results can enable high-throughput phenotyping by facilitating automated searches for relevant medical terms in EHR data. Building upon this capacity, this research aims to leverage PheMAP to resolve a different yet significant challenge: identifying patients with a phenotype but lacking relevant diagnosis codes in their records.
Data extraction
We selected 4 biologically distinct phenotypes to assess whether the PheMAP knowledgebase can facilitate identification of likely cases lacking ICD-based diagnoses in patients’ records: (1) T2DM (Phecode: 250.2), (2) dementia (Phecode: 290.1), (3) prostate cancer (Phecode: 185; male only), and (4) sensorineural hearing loss (Phecode: 389.1). We extracted all relevant components from the EHR to calculate phenotype-specific PheScores, integrating information from the Condition, Procedure, Measurement, Drug, and Observation tables, encompassing the 6 standard terminologies (ie, ICD-9/10-CM, ICD-10-PCS, SNOMED-CT, CPT, LOINC, and RxNorm codes). No additional restrictions were imposed on these sources. For example, we included all assigned ICD-9-CM and ICD-10-CM codes, regardless of the intention of use within the system (eg, billing versus non-billing).
PheScore calculation
We utilized the PheMAP knowledgebase to calculate PheScores for each patient based on extracted EHR data as follows. First, for each phenotype, we searched the EHR data for the corresponding PheMAP terms within the terminologies noted above. Diagnosis terms falling within the exclusion range for each phenotype (eg, diagnosis term “type 1 diabetes” for phenotype T2DM) were excluded from the PheScore calculation to reduce the chance that patients with closely related but distinct phenotypes (eg, type 1 diabetes) would receive an elevated score for the target phenotype (eg, T2DM).15 The specific exclusion ranges used in this study are provided in Table S1. To mitigate the effect of diagnoses that were documented at different rates of occurrence, each EHR term was counted only once. Finally, to derive the final PheScores, we summed the TF-IDF scores of the matched terms for each of the phenotypes.
Identifying patients lacking a diagnosis
For each phenotype, we labeled a patient as lacking a diagnosis for a particular phenotype if they met 3 criteria: (1) their record in EHR was devoid of any related ICD diagnosis code, (2) their PheScore was greater than a threshold—the TF-IDF value of the corresponding diagnosis concept of the phenotype (ie, the highest TF-IDF value of a single concept for the target phenotype in PheMAP), and (3) the cumulative TF-IDF values of the top x% matched concepts (in terms of their TF-IDF values) surpassed y% of the patient’s total PheScore (referred to as top-concept rule). To determine if the first criterion was violated, we mapped all of the ICD diagnosis codes for each patient to Phecodes.5,15,16 If any of the mapped Phecodes matched the target phenotype’s Phecode, or one of its sub-codes (eg, ICD-10-CM code E11.22 maps to Phecode 250.22, which falls under the T2DM Phecode 250.2), then the first criterion was considered to be violated. The third criterion was designed to exclude patients who have multiple matched concepts but lack any that strongly indicate the target phenotype (ie, the TF-IDF values of all matched concepts are relatively low compared to those associated with the target phenotype’s diagnosis codes). In this study, x and y were set to 30 and 70, respectively. For convenience, we refer to the resulting patient cohort as the lacking-diagnosis cohort. To facilitate comparison, the cohort of patients who only violated the first criterion was designated as the reference cohort.
The distributions of age, sex, PheScore, and the number of matched terms were compared between the lacking-diagnosis and reference cohorts. Mann-Whitney U tests were performed for age, PheScore, and the number of match terms. Chi-square tests were performed for sex.
Evaluation and expert review
For each phenotype, we randomly sampled 20 patients from the lacking-diagnosis cohort for reviews with clinically knowledgeable experts to determine the phenotypic status of each patient. Specifically, 2 experts (M.E.G. and V.E.K.) independently reviewed the patients’ clinical notes for all 4 phenotypes. Discrepancies in annotations were resolved by in-depth discussions between the 2 experts and a third expert (W.Q.W.). The experts labeled each patient as Positive, Negative, or Possibly Positive. Here, Positive (which corresponds to true positive) and Negative (which corresponds to false positive) refer to a definitive phenotyping status, whereas Possibly Positive denoted the scenario where relevant evidence was identified from clinical notes but was insufficient to draw a definitive conclusion. For example, if a patient’s clinical notes indicated the presence of the phenotype of interest without sufficiently strong supporting or objective evidence, the patient was labeled as Possibly Positive. Each expert provided a justification for their decisions by recording the note text in the patient’s chart that supported the phenotype label, which was relied upon to aid in the resolution of any discrepancies among the experts.
We assessed precision (or positive predictive value [PPV]) in the following 2 scenarios: (1) excluding patients with a Possibly Positive designation to only consider those labeled as Positive or Negative in calculations, and (2) including Possibly Positive patients and counting them as Positive. Then, to further interrogate the results, we analyzed the data of multiple patients who were correctly identified as lacking relevant ICD diagnosis codes by our method, focusing on the primary EHR terms contributing to their PheScores (ie, ICD-9/10-CM, ICD-10-PCS, SNOMED-CT, CPT, LOINC, and RxNorm codes). Additionally, we performed case-by-case analysis for those classified by our method as lacking relevant ICD diagnosis codes but were labeled as Negative or Possibly Positive by experts.
Results
We assessed 3 541 543 patients with a mean (SD) age of 46.2 (25.6) years. Of this population, 1 886 084 (53.3%) and 1 654 581 (46.7%) were female and male, respectively. Table 1 presents the descriptive statistics for the 4 cohorts identified using relevant ICD diagnosis codes, providing a reference for subsequent comparisons. Type 2 diabetes mellitus and dementia corresponded to the highest (7.5%) and lowest (0.9%) prevalence, respectively. The medians (IQRs) of PheScores for T2DM, dementia, prostate cancer, and sensorineural hearing loss were 62.4 (34.9-99.7), 101.1 (55.2-153.9), 80.1 (57.6-133.6), and 25.3 (25.3-50.6), respectively. The T2DM cohort exhibited the highest median of the distinct number of matched terms in structured EHR data, due in part to a relatively larger set of comorbidities than the other phenotypes.
Table 1.
Characteristics of the cohorts for patients with ICD-based phenotypes.
| Characteristics | Cohorts with ICD diagnosis codes present |
|||
|---|---|---|---|---|
| T2DM | Dementia | Prostate cancer (male only) | Sensorineural hearing loss | |
| Cohort size (n [%]) | 266 332 (7.5%) | 33 299 (0.9%) | 38 203 (2.3%) | 85 701 (2.4%) |
| Agea (mean ± SD) | 67.8 ± 17.3 | 84.1 ± 14.6 | 78.4 ± 11.4 | 60.4 ± 27.0 |
| Sex (Female [%]) | 131 260 (49.3%) | 18 535 (55.7%) | 0 (0.0%) | 42 512 (49.6%) |
| PheScore distribution (median [IQR]) | 62.4 (34.9-99.7) | 101.1 (55.2-153.9) | 80.1 (57.6-133.6) | 25.3 (25.3-50.6) |
| Distinct terms matched (median [IQR]) | 5 (3-8) | 4 (2-8) | 3 (2-5) | 1 (1-2) |
Age was calculated based on the time gap (in years) between a patient’s birth year and the year 2024.
Abbreviations: ICD, International Classification of Disease; T2DM, type 2 diabetes mellitus.
Table 2 presents the characteristics of the reference and lacking-diagnosis cohorts for all 4 phenotypes. The PheScore thresholds were 31.2, 50.5, 56.0, and 25.3 for T2DM, dementia, prostate cancer, and sensorineural hearing loss, respectively. Among the patients with PheScores greater than or equal to the thresholds and satisfying the top-concept rule, 5.3% of the patients lacked a diagnosis code for T2DM, 4.5% for dementia, 2.2% for prostate cancer, and 0.2% for sensorineural hearing loss (ie, the lacking-diagnosis cohorts). Across all phenotypes, patients in the reference cohorts were statistically significantly associated with higher PheScores than those in the lacking-diagnosis cohorts (P < .05 for all 4 phenotypes), due in part to the presence of ICD diagnosis codes in the reference cohort. For T2DM, the lacking-diagnosis cohort had a higher proportion of female patients compared with the reference cohort. Additionally, the lacking-diagnosis cohorts for all phenotypes had higher ages than their corresponding reference cohorts, all in a statistically significant manner save for sensorineural hearing loss. For matched term counts, the reference cohorts had more matched terms than the lacking-diagnosis cohorts (P < .05 for all 4 phenotypes).
Table 2.
Characteristics of patient subpopulations.
| Patient subpopulation |
||||
|---|---|---|---|---|
| T2DM | Dementia | Prostate cancer | Sensorineural hearing loss | |
| Threshold of PheScores | 31.2 | 50.5 | 56.0 | 25.3 |
| Patients with PheScores ≥ threshold (n) | 290 203 | 39 331 | 38 909 | 85 913 |
| Sum of patients in the reference cohort and lacking-diagnosis cohort (n) | 215 772 | 24 939 | 29 717 | 52 371 |
| Reference cohort | ||||
| Cohort size (n [%]) | 204 391 (94.7%) | 23 817 (95.5%) | 29 065 (97.8%) | 52 272 (99.8%) |
| Age (mean ± SD) | 67.5 ± 17.8 | 83.7 ± 14.8 | 78.2 ± 11.8 | 58.1 ± 28.7 |
| Sex (female %) | 102 100 (50.0%) | 13 055 (54.8%) | 0 (0.0%) | 25 790 (49.3%) |
| PheScore distribution (median [IQR]) | 45.4 (33.0-75.5) | 63.2 (53.1-112.8) | 74.3 (56.0-128.5) | 25.3 (25.3-25.3) |
| Distinct terms matched (median [IQR]) | 4 (2-7) | 5 (2-9) | 2 (1-5) | 1 (1-1) |
| Lacking-diagnosis cohort | ||||
| Cohort size (n [%]) | 11 381 (5.3%) | 1122 (4.5%) | 652 (2.2%) | 99 (0.2%) |
| Age (mean ± SD) | 70.8 ± 15.6*** | 87.9 ± 12.2*** | 82.2 ± 9.5*** | 64.1 ± 23.5 |
| Sex (female [%]) | 5831 (51.2%)*** | 652 (58.1%) | 0 (0.0%) | 41 (41.4%) |
| PheScore distribution (median [IQR]) | 32.5 (31.2-34.3)*** | 50.5 (50.5-54.2)*** | 56.0 (56.0-57.6)*** | 25.3 (25.3-25.3)* |
| Distinct terms matched (median [IQR]) | 2 (1-3)*** | 1 (1-2)*** | 1 (1-2)*** | 1 (1-1)* |
The distributions of age, sex, PheScore, and the number of matched terms were compared between the reference and lacking-diagnosis cohorts. Mann-Whitney U test was performed for age, PheScore, and the number of match terms. Chi-square test was performed for sex. Significance levels are denoted as follows: *** for P < .001, ** for P < .01, and * for P < .05. These levels are indicated only within the cells of the lacking-diagnosis cohorts.
Abbreviation: T2DM, type 2 diabetes mellitus.
Among the 20 patients randomly selected from the lacking-diagnosis cohort for each phenotype, the experts confirmed 16 (80.0%) as Positive for T2DM, 16 (80.0%) for dementia, 13 (65.0%) for prostate cancer, and 19 (95.0%) for sensorineural hearing loss (Table 3). When casting Possibly Positive as Positive, the precision for dementia (with 4 labeled as Possibly Positive) and sensorineural hearing loss (with 1 labeled as Possibly Positive) was 100%, identical to precision when excluding Possibly Positive patients. For T2DM, the precision was 90.0% when casting Possibly Positive as Positive and 88.9% when excluding them. For prostate cancer, the precision was 85.0% when casting Possibly Positive as Positive and 81.3% when excluding them.
Table 3.
Precision for identifying patients lacking relevant ICD diagnosis codes compared to expert review.
| Labeled positives | Labeled negatives | Labeled possibly positive | PPV |
Error rate |
|||
|---|---|---|---|---|---|---|---|
| Casting possibly positive as positive (%) | Excluding possibly positive (%) | Casting possibly positive as positive (%) | Excluding possibly positive (%) | ||||
| T2DM | 16 | 2 | 2 | 90.0 | 88.9 | 10.0 | 11.1 |
| Dementia | 16 | 0 | 4 | 100 | 100 | 0 | 0 |
| Prostate cancer | 13 | 3 | 4 | 85.0 | 81.3 | 15.0 | 18.7 |
| Sensorineural hearing loss | 19 | 0 | 1 | 100 | 100 | 0 | 0 |
Abbreviations: ICD, International Classification of Disease; PPV, positive predictive value; T2DM, type 2 diabetes mellitus.
Table 4 presents 4 examples correctly identified by PheMAP knowledge, one patient for each phenotype. There are multiple notable findings. First, highly relevant clinical observations and/or suspected (or preliminary) conditions represented by SNOMED-CT codes were matched in the records of the example Positive patients. For instance, the fourth patient was assigned with 1 SNOMED-CT code related to deafness and the other related to auditory vertigo, which in total contributed to 70.1% of the patient’s PheScore. The second patient was assigned with a dementia SNOMED-CT code, which could be a clinical documentation result for a possible diagnosis that needs further follow-up. Second, highly relevant clinical procedures represented by CPT codes were matched in the record of the example Positive patients for prostate cancer (prostate antigen assay), sensorineural hearing loss (immunohistochemistry), and T2DM (blood glucose level). This finding is consistent with expectations because the results of these phenotype-specific procedures were directly used by clinicians to diagnose patients. Third, multiple comorbidities represented by ICD codes were also matched (eg, a Meniere’s disease ICD code for the patient with sensorineural hearing loss; obesity and cerebral thrombosis ICD codes for the patient with T2DM; cerebral degeneration and occlusion and stenosis of carotid artery ICD codes for the patient with dementia). Fourth, the phenotype of interest was documented in the problem list of the respective example patient.
Table 4.
Examples of patients successfully identified by PheMAP as lacking a diagnosis
| Patient index | Phenotype | PheScore | Notable contributing terms of the PheScore | Evidence extracted from expert review |
|---|---|---|---|---|
| 1 | T2DM | 55.6 |
|
One of the notes associated with a congenital heart disease clinic visit indicates that this patient had a history of type 2 diabetes. T2DM was also documented in the patient’s problem list. |
| 2 | Dementia | 73.6 |
|
One of the notes associated with an outpatient visit reveals that this patient had advanced dementia and could not be able to recall any events. Dementia was also documented in the patient’s problem list. |
| 3 | Prostate cancer | 74.3 |
|
One of the notes associated with an endocrine and diabetes clinic visit reveals that this patient had a malignant tumor in the prostate and underwent a prostatectomy 6 years ago. Prostate cancer was also documented in the patient’s problem list. |
| 4 | Sensorineural hearing loss | 43.5 |
|
One of the consultation notes indicates that this patient had Meniere’s disease with sensorineural hearing loss. Hearing loss was also documented in the patient’s problem list. |
Abbreviations: ICD, International Classification of Disease; SNOMED-CT, Systematized Nomenclature of Medicine Clinical Terms; TF-IDF, term frequency-inverse document frequency; T2DM, type 2 diabetes mellitus.
Next, we performed a case-by-case error analysis for all patients who were labeled as Negative by the experts, providing the top contributing terms matched by PheMAP and patient-specific comments from expert reviewers. For T2DM (Table S2), 2 patients were incorrectly identified as lacking relevant ICD diagnosis codes. One patient had multiple hypoglycemia-related ICD codes and CPT codes for blood glucose levels, which collectively contributed to approximately 77% of the PheScore, already exceeding the predefined threshold (ie, 31.2). While the patient’s structured EHR data included a metformin RxNorm code and 2 ICD codes for abnormal weight loss, there was no indication of T2DM in the clinical notes, aside from a mention of family history of diabetes. The other patient was assigned a diabetes mellitus onset SNOMED-CT code; however, this was noted only in the patient’s problem list, with no supporting evidence of a personal T2DM history in the clinical notes. For prostate cancer (Table S3), we found that the 3 false positive classifications were primarily due to (1) 2 patients being assigned a prostate cancer-related SNOMED-CT code, likely from incorrect coding for problem lists, and (2) 2 patients having prostate-specific antigen CPT codes, though with negative results. These matched terms dominated the PheScores which surpassed the predefined threshold (ie, 56.0).
We further extended the case-by-case analysis to patients labeled as Possibly Positive by experts. Each of these patients was assigned a phenotype-specific SNOMED-CT code, which, by itself, met the PheScore threshold (Tables S2-S5). The phenotype-specific evidence was observed in multiple note types for these patients, such as the problem list (including the problem list section in other note types like Progress Report and the notes named as Problem List), Outside Medical Record report, and Clinic Visit Summary. Beyond being mentioned in a single location within a specific note or section, such as the Problem List, which may lack context, patients also had at least one other piece of supporting evidence. For example, aside from being documented as having a “malignant tumor of prostate cancer” in a problem list note, the fourth patient in Table S3 showed elevated prostate-specific antigen levels and a mildly enlarged prostate, leading to a classification of Possibly Positive. The only Possibly Positive patient for sensorineural hearing loss had clinician documentation of “sensorineural hearing loss?—right ear” recorded in a clinical note associated with an internal medicine clinic visit (Table S4).
Discussion
In this study, we leverage the PheMAP knowledgebase to identify patients lacking relevant ICD diagnosis codes in their records. Using computable PheMAP knowledge achieved a high precision for dementia and sensorineural hearing loss (ie, >95%), and reasonably good precision for T2DM and prostate cancer (ie, >80%) across the entire EHR database at VUMC. These results highlight the potential of PheMAP as an effective strategy to enhance the accuracy of diagnostic information in EHRs, which, in turn, can improve the reliability of research with EHR data.
Our investigation suggests quantitatively that solely relying on EHR diagnosis codes for phenotyping can be problematic. We found that approximately 5.3% of patients with T2DM-specific PheScore above the predefined threshold, who also met the top-concept rule, lacked a T2DM diagnosis in their records, highlighting a potential for significant false negatives in analyses. Indeed, expert review confirmed that the majority of these patients had T2DM. While the proportions of patients lacking a diagnosis for different phenotypes varied, PheMAP remained an effective overall tool for identifying additional positive cases; this efficacy is likely due to its unique mechanism that relies on multiple EHR components, a benefit that has been confirmed by previous research.24 Patients who had multiple closely relevant EHR terms surrounding the phenotype of interest were more likely to be an actual case, even if not identified as such by ICD diagnosis codes. Correctly identifying these patients may redress some of the existing bias in clinical and artificial intelligence and machine learning (AI/ML)-based research, even if the numbers of individuals are relatively small compared with the overall counts.
The detailed examination of the Negative patients for T2DM and prostate cancer spotlights several opportunities for further refining the outlined method. We observed 2 major types of matched EHR terms that contributed to the high PheScores of these patients. The first type involved phenotype-specific measurements, such as blood glucose level tests for T2DM and prostate-specific antigen tests for prostate cancer. Currently, the presence of these measurements contributes to the PheScore calculation, regardless of whether the results are normal or abnormal. Measurement with normal results can inflate patients’ PheScores in an undesired way. Therefore, additional consideration of whether each measurement result falls within the normal range (which could be derived automatically with the help of large language models29,30) when calculating patients’ PheScores could potentially enhance the effectiveness of identifying patients lacking a diagnosis.
The second type of potentially misleading matched terms corresponds to the disease concepts that are closely related to the phenotype of interest but more accurately characterize other phenotypes. These concepts fall outside of the exclude range of the corresponding Phecode, so that PheMAP retained them as part of the updated knowledgebase. Two representative examples associated with large TF-IDF scores are (1) hypoglycemia-related ICD codes for T2DM and (2) pulmonary hypertension and heart disease ICD codes for prostate cancer. Removing these concepts from the corresponding PheMAP phenotypes could decrease the occurrence of false positives.
It is important to note that structured EHR data collected by different healthcare organizations exhibit different patterns so that the distributions of phenotype-specific PheScores could differ. Multiple key factors contribute to this variance. First, the degree to which patients’ EHR data are fragmented in healthcare organizations differs geometrically and largely depends on the number of healthcare facilities a patient visits and how well these facilities share information. Second, the history lengths of patients’ records can also differ, even with similar fragmentation levels, reflecting variations in the health status of patient populations. Third, healthcare organizations usually follow their own practice for code assignment within EHR. Therefore, when implementing our method in local EHR databases, it is highly recommended that healthcare organizations establish site-specific top-concept rules and PheScore thresholds to accommodate their tolerance for noise.
The phenotype-specific PheScore threshold and the top-concept rule were set to reflect the confidence level in identifying patients lacking a diagnosis. Specifically, if a patient’s PheScore exceeded the maximum TF-IDF value of the relevant phenotype concept by a constant factor (we set this to 1) and satisfied the predefined top-concept rule, they could be classified as lacking a diagnosis for that phenotype with the corresponding confidence level. These two gates ensured that the patient’s record contained a sufficient number of highly related concepts (eg, relevant medication prescriptions, laboratory tests, symptoms, and procedures) to support this classification. Selecting the appropriate phenotype-specific threshold and top-concept rule is crucial for effective patient cohort construction. For example, a relatively high threshold and an aggressive top-concept rule (requiring the top-matched terms with the highest TF-IDF values to constitute a larger portion of the patient’s PheScore) are preferable if the goal is to increase the size of the established case cohort by including more patients with high confidence. Conversely, a lower threshold and a less aggressive rule are better suited for excluding false negative patients from an established control cohort.
The PheMAP knowledgebase is designed to scale up the identification of patients lacking relevant diagnosis codes across all phenotypes due to its computable nature. As a result, healthcare organizations that implement this solution would be able to assess the reliability of the diagnosis codes in their EHR systems, and subsequently make necessary adjustments in their research databases. Additionally, our approach can potentially be used as an AI copilot to suggest diagnosis codes in real-time to clinicians, thereby preventing instances from happening where important diagnosis codes are missed in documentation, especially for uncommon diseases. Finally, our approach has the potential to assist utilization review teams by identifying likely underreported ICD codes, reducing the manual burden of uncovering missing diagnoses and improving the completeness of coded clinical data—a key objective of utilization review.
While our results are encouraging, this study has several limitations that present opportunities for future research. First, this study was restricted to an evaluation of 4 phenotypes. Given that it requires non-trivial effort for human reviewers to perform such activities, employment of the advanced summarization capability of large language models31 could facilitate the ongoing project of evaluation. The expert review process could be significantly accelerated by developing prompt engineering strategies and techniques to mitigate hallucinations, thereby allowing for a more comprehensive and expedient evaluation of PheMAP performance across all phenotypes. Second, this study involved data from a single healthcare organization, which may not fully represent PheMAP’s effectiveness across different settings. Expanding the research to include healthcare organizations of different characteristics (eg, urban versus rural, community versus regional, and general versus specialized facilities) could provide more comprehensive insights into its applicability. Third, the current algorithm for calculating the PheScore only counts each matched term once, regardless of how many times it occurred. This may underweight significant recurrent terms, such as measurements with abnormal results. Further research is needed to enhance the algorithm for deriving phenotype-specific PheScores to better capture the value of repeated important terms, thereby optimizing the effectiveness of PheMAP.
Conclusion
Using ICD-based diagnoses alone for EHR phenotyping can introduce inaccuracies in cohort formation due to chart fragmentation and coding errors. Leveraging PheMAP knowledge can improve the identification of patients with a phenotype by combining multiple EHR components, further limiting false negatives and helping to minimize bias. As such, PheMAP offers valuable contributions to the ongoing project of enhancing the reliability of EHR-based research. Future opportunities include further refining the concept matching process using PheMAP by incorporating additional information, such as clinical measurement results, and expanding and evaluating high-throughput analysis across a broader range of diseases and healthcare organizations with diverse characteristics.
Supplementary Material
Contributor Information
Chao Yan, Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States.
Monika E Grabowska, Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States.
Rut Thakkar, Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States.
Alyson L Dickson, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37232, United States.
Peter J Embí, Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States; Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37232, United States.
QiPing Feng, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37232, United States.
Joshua C Denny, All of United States Research Program, National Institute of Health, Bethesda, MD 20892, United States.
Vern Eric Kerchberger, Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States; Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37232, United States.
Bradley A Malin, Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States; Department of Computer Science, Vanderbilt University, Nashville, TN 37203, United States; Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37203, United States.
Wei-Qi Wei, Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States; Department of Computer Science, Vanderbilt University, Nashville, TN 37203, United States.
Author contributions
Chao Yan (Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Writing—original draft, Writing—review & editing), Monika E. Grabowska (Investigation, Validation, Writing—review & editing), Rut Thakkar (Validation), Alyson L. Dickson (Investigation, Writing—review & editing), Peter J. Embí (Writing—review & editing), QiPing Feng (Writing—review & editing), Joshua C. Denny (Writing—review & editing), Vern Eric Kerchberger (Investigation, Validation, Writing—review & editing), Bradley A. Malin (Investigation, Writing—review & editing), Wei-Qi Wei (Conceptualization, Funding acquisition, Investigation, Methodology, Supervision, Writing—review & editing)
Supplementary material
Supplementary material is available at Journal of the American Medical Informatics Association online.
Funding
This work was supported by the National Institutes of Health (grant numbers R01HL171809, R01AG084550, R01HL133786, R01GM139891, R01AG069900, U01HG011181, P50HD106446, R01LM012806, and K99LM014428). The dataset used for the analyses was obtained from Vanderbilt University Medical Center’s Synthetic Derivative, which is supported by institutional funding and by the National Center for Advancing Translational Science (grant number UL1 TR002243).
Conflicts of interest
All authors have no competing interests to declare.
Data availability
Limited primary cohort data are available by request to the corresponding author Dr Wei-Qi Wei at wei-qi.wei@vumc.org, pending VUMC approval and a data use agreement. An example source code for analysis is shared at https://github.com/The-Wei-Lab/PheMAP_patients_lacking_diagnosis.
References
- 1. Adler-Milstein J, Jha AK. HITECH Act drove large gains in hospital electronic health record adoption. Health Aff. 2017;36:1416-1422. [DOI] [PubMed] [Google Scholar]
- 2. Shortreed SM, Cook AJ, Coley RY, Bobb JF, Nelson JC. Challenges and opportunities for using big healthcare data to advance medical science and public health. Am J Epidemiol. 2019;188:851-861. [DOI] [PubMed] [Google Scholar]
- 3. Wei WQ, Denny JC. Extracting research-quality phenotypes from electronic health records to support precision medicine. Genome Med. 2015;7:41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Denny JC, Ritchie MD, Basford MA, et al. PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene–disease associations. Bioinformatics. 2010;26:1205-1210. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Wei WQ, Bastarache LA, Carroll RJ, et al. Evaluating phecodes, clinical classification software, and ICD-9-CM codes for phenome-wide association studies in the electronic health record. PloS One. 2017;12:e0175508. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Kho AN, Pacheco JA, Peissig PL, et al. Electronic medical records for genetic research: results of the eMERGE consortium. Sci Transl Med. 2011;3:79re1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Sutton RT, Pincock D, Baumgart DC, Sadowski DC, Fedorak RN, Kroeker KI. An overview of clinical decision support systems: benefits, risks, and strategies for success. NPJ Digit Med. 2020;3:17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Pawloski PA, Brooks GA, Nielsen ME, Olson-Bullis BA. A systematic review of clinical decision support systems for clinical oncology practice. J Natl Compr Canc Netw. 2019;17:331-338. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Yan C, Grabowska ME, Dickson AL, et al. Leveraging generative AI to prioritize drug repurposing candidates for Alzheimer’s disease with real-world clinical validation. NPJ Digit Med. 2024;7:46. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Zhou Y, Wang F, Tang J, Nussinov R, Cheng F. Artificial intelligence in COVID-19 drug repurposing. Lancet Digit Health. 2020;2:e667-e676. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Goldstein BA, Navar AM, Pencina MJ, Ioannidis J. Opportunities and challenges in developing risk prediction models with electronic health records data: a systematic review. J Am Med Inform Assoc. 2017;24:198-208. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Zhang Z, Yan C, Zhang X, Nyemba SL, Malin BA. Forecasting the future clinical events of a patient through contrastive learning. J Am Med Inform Assoc. 2022;29:1584-1592. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Yan C, Chen Y, Li B, Liebovitz D, Malin B. Learning clinical workflows to identify subgroups of heart failure patients. AMIA Annu Symp Proc. 2016;2016:1248-1257. [PMC free article] [PubMed] [Google Scholar]
- 14. Zhang Y, Padman R, Patel N. Paving the COWpath: learning and visualizing clinical pathways from electronic health record data. J Biomed Inform. 2015;58:186-197. [DOI] [PubMed] [Google Scholar]
- 15. Wu P, Gifford A, Meng X, et al. Mapping ICD-10 and ICD-10-CM codes to phecodes: workflow development and initial evaluation. JMIR Med Inform. 2019;7:e14325. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Wei WQ, Leibson CL, Ransom JE, et al. Impact of data fragmentation across healthcare centers on the accuracy of a high-throughput clinical phenotyping algorithm for specifying subjects with type 2 diabetes mellitus. J Am Med Inform Assoc. 2012;19:219-224. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Bourgeois FC, Olson KL, Mandl KD. Patients treated at multiple acute health care facilities: quantifying information fragmentation. Arch Intern Med. 2010;170:1989-1995. [DOI] [PubMed] [Google Scholar]
- 18. Horsky J, Drucker EA, Ramelson HZ. Accuracy and completeness of clinical coding using ICD-10 for ambulatory visits. AMIA Annu Symp Proc. 2017;2017:912-920. [PMC free article] [PubMed] [Google Scholar]
- 19. Banerjee D, Chung S, Wong EC, Wang EJ, Stafford RS, Palaniappan LP. Underdiagnosis of hypertension using electronic health records. Am J Hypertens. 2012;25:97-102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Khajouei R, Abbasi R, Mirzaee M. Errors and causes of communication failures from hospital information systems to electronic health record: a record-review study. Int J Med Inform. 2018;119:47-53. [DOI] [PubMed] [Google Scholar]
- 21. Callahan A, Shah NH, Chen JH. Research and reporting considerations for observational studies using electronic health record data. Ann of Intern Med. 2020;172:S79-S84. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Madden JM, Lakoma MD, Rusinak D, Lu CY, Soumerai SB. Missing clinical and behavioral health data in a large electronic health record (EHR) system. J Am Med Inform Assoc. 2016;23:1143-1149. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Wei WQ, Leibson CL, Ransom JE, Kho AN, Chute CG. The absence of longitudinal data limits the accuracy of high-throughput clinical phenotyping for identifying type 2 diabetes mellitus subjects. Int J Med Inform. 2013;82:239-247. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Wei WQ, Teixeira PL, Mo H, Cronin RM, Warner JL, Denny JC. Combining billing codes, clinical notes, and medications from electronic health records provides superior phenotyping performance. J Am Med Inform Assoc. 2016;23:e20-e27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Zheng NS, Feng Q, Kerchberger VE, et al. PheMAP: a multi-resource knowledge base for high-throughput phenotyping within electronic health records. J Am Med Inform Assoc. 2020;27:1675-1687. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Wan NC, Yaqoob AA, Ong HH, Zhao J, Wei W-Q. Evaluating resources composing the PheMAP knowledge base to enhance high-throughput phenotyping. J Am Med Inform Assoc. 2023;30:456-465. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Danciu I, Cowan JD, Basford M, et al. Secondary use of clinical data: the Vanderbilt approach. J Biomed Inform. 2014;52:28-35. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2004;32:D267-270. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Yan C, Ong HH, Grabowska ME, et al. Large language models facilitate the generation of electronic health record phenotyping algorithms. J Am Med Inform Assoc. 2024;31:1994-2001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Shah NH, Entwistle D, Pfeffer MA. Creation and adoption of large language models in medicine. JAMA. 2023;330:866-869. [DOI] [PubMed] [Google Scholar]
- 31. Van Veen D, Van Uden C, Blankemeier L, et al. Clinical text summarization: adapting large language models can outperform human experts. Nat Med. 2024;30:1134-1142. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Limited primary cohort data are available by request to the corresponding author Dr Wei-Qi Wei at wei-qi.wei@vumc.org, pending VUMC approval and a data use agreement. An example source code for analysis is shared at https://github.com/The-Wei-Lab/PheMAP_patients_lacking_diagnosis.
