Abstract
Background and Aim:
Guidelines recommend risk stratification scores in patients presenting with gastrointestinal bleeding (GIB), but such scores are uncommonly employed in practice. Automation and deployment of risk stratification scores in real time within electronic health records (EHRs) would overcome a major impediment. This requires an automated mechanism to accurately identify (“phenotype”) patients with GIB at the time of presentation. The goal is to identify patients with acute GIB by developing and evaluating EHR-based phenotyping algorithms for emergency department (ED) patients.
Methods:
We specified criteria using structured data elements to create rules for identifying patients and also developed multiple natural language processing (NLP)-based approaches for automated phenotyping of patients, tested them with tenfold cross-validation for 10 iterations (n = 7144) and external validation (n = 2988) and compared them with a standard method to identify patient conditions, the Systematized Nomenclature of Medicine. The gold standard for GIB diagnosis was the independent dual manual review of medical records. The primary outcome was the positive predictive value.
Results:
A decision rule using GIB-specific terms from ED triage and ED review-of-systems assessment performed better than the Systematized Nomenclature of Medicine on internal validation and external validation (positive predictive value = 85% confidence interval:83%–87% v.s 69% confidence interval:66%–72%; P < 0.001). The syntax-based NLP algorithm and Bidirectional Encoder Representation from Transformers neural network-based NLP algorithm had similar performance to the structured-data fields decision rule.
Conclusions:
An automated decision rule employing GIB-specific triage and review-of-systems terms can be used to trigger EHR-based deployment of risk stratification models to guide clinical decision making in real time for patients with acute GIB presenting to the ED.
Keywords: Gastrointestinal hemorrhage, Informatics, Natural language processing
Introduction
Acute gastrointestinal (GI) bleeding (GIB) is the most common GI diagnosis requiring hospital admission in the USA.1 Guidelines for upper and lower GIB recommend risk stratification of patients, including the use of risk assessment scores.2–4 Although many risk stratification tools have been developed and validated, they are not commonly used in real-world clinical practice partly because providers must manually enter a variety of variables into the scoring system. Widespread electronic health record (EHR) adoption makes it possible to automatically deploy risk stratification scores within the clinical workflow for acute GIB; however, in order to embed risk stratification models into EHRs and deploy them in real time, patients must first be correctly identified. This process of accurately identifying patients, called phenotyping, is used for any study that seeks to reliably group patients with a specific diagnosis or condition, from surveillance studies to comparative effectiveness research. Phenotyping is typically the first step in developing and validating risk stratification models within EHRs.5 Such processes have been used to improve the accuracy of case definition of inflammatory bowel disease patients as Crohn’s disease or ulcerative colitis, as well as to facilitate clinical trial recruitment and deploy randomized controlled trials.6,7
Unlike many conditions that require multiple elements of the record (laboratory testing, reported symptoms, and biometrics such as vital signs) for diagnosis, acute GIB is a condition that can be directly and clearly identified using a limited number of terms by patient report or provider evaluation. To our knowledge, no previous study has explored the early identification of patients with acute GIB with an EHR-based model or its implementation within standard EHR workflow.
Phenotypes utilize both structured and unstructured data and are typically used retrospectively after the clinical encounters have ended (e.g. ICD codes are a popular component of phenotypes). If the goal is to identify patients that would benefit from predictive models tailored for a particular condition, EHR phenotypes must use data elements generated during the visit. The Systematized Nomenclature of Medicine (SNOMED) is an international comprehensive clinical terminology that is the standard for encoding patient conditions in the EHR. Other approaches for phenotyping can be decision rules using specific data elements (e.g. triage diagnosis) or a machine learning approach to utilize unstructured clinical text through natural language processing (NLP). Machine learning models use computational modeling to learn from data, and their performance can improve with an increasing amount of data. NLP is a set of tools used to extract data from narrative text and uses syntactic processing, the order and arrangement of words in sentences and phrases, and semantic analysis to capture the meaning of the text. The empirical performance of these tools could provide useful comparisons to decision rules. For prediction, however, a “screening” phenotype using data elements likely to be entered close to real time may be better than a tool that may be delayed (e.g. due to delays in note writing).8
Our study aimed to accurately identify patients with acute GIB reported or witnessed in the emergency department, such that the identification of this phenotype can occur in real time in the EHR to subsequently launch predictive models for risk stratification. We chose to use the current standard for phenotyping patients, SNOMED, as the comparator, even though SNOMED does not provide real-time phenotyping in the emergency department.
Methods
We began with creating a sensitive data mart, a patient dataset selected using specified criteria, to screen for all patients presenting with GIB from 2014 to 2017 in the Yale New Haven Health System EHR (Epic). The creation of a sensitive data mart allows exclusion of patients with no evidence of the phenotype and to adequately handle the volume, heterogeneity, and velocity of data.9 To create the data mart, we screened for patients with data that suggested the phenotype of overt GIB. In order to maximize the capture of relevant data, we defined the process by which patients were evaluated in the emergency department and common time periods for data entry. We also identified points at which a diagnosis would be entered in the EHR throughout the hospital stay: hospital problem list, encounter diagnosis, admission diagnosis, and hospital billing diagnosis.
There were four categories of screening criteria, which were selected on the basis of existing identifiers in the EUR used to denote GIB (Fig. 1).
Figure 1.

Screening criteria used to create data mart for acute gastrointestinal (Gl) bleeding (GIB). SNOMED, Systematized Nomenclature of Medicine.
Development of electronic health record gastrointestinal bleeding phenotypes.
We developed rule-based algorithms and two machine learning-based algorithms, one using syntactic NLP analysis and the other using a Bidirectional Encoder Representation from Transformers (BERT) neural network NLP model and compared their performance with that of the SNOMED-only classification (Fig. 2).
Figure 2.

Rule-based algorithms and machine learning (natural language processing [NLP]) algorithm. GI, gastrointestinal; GIB, GI bleeding; ROS, review of systems.
To determine the specific data elements included in the rule-based algorithm, we analyzed the clinical workflow to identify two points where relevant diagnosis or symptom data were entered into the EHR. The first point of data entry was at triage, where a nurse selected presenting diagnoses from a drop down list of prespecified diagnoses. We reviewed all triage diagnoses to identify any that were GI diagnoses. The second data entry point was the review of systems (ROS) section in the note template used for all emergency department patients, which contains elements referring to overt GIB. Both the triage diagnosis and ROS provide structured data fields that can be used to identify patients as either 0 (not present) or 1 (present). We hypothesized that terms specific to GIB may have improved performance and therefore created two decision rules. Decision rule 1 was positive if the ROS field (e.g. hematemesis and blood in stool) was positive or if any GI triage term was positive. Decision rule 2 was positive if the ROS field was positive or GIB triage terms were positive.
The syntax-based NLP approach includes the preprocessing of the unstructured text in notes written by physician providers in the emergency department using SCISPACY, a Python software library used for advanced NLP that allows for breaking down text into smaller unique parts, which is a strategy applied to other biomedical text data.10–12 Classification was performed using random forest, support vector machine, and elastic net classifiers.
The NLP approach to capture meaning, or sematic information, from physician notes was the BERT model.13 The BERT model is a neural network transformer-based model that generates embeddings for sentences to capture meaning. We used Clinical BERT, a variant of BERT fine-tuned on biomedical and clinical text corpora (MIMIC-III and PubMed).14
We performed a sensitivity analysis for hematemesis and/or melena and hematochezia. For hematemesis or melena decision, rule 2 was modified to only include triage diagnoses GI bleeding, hematemesis, melena, vomiting blood, and ROS elements of melena and hematemesis. For hematochezia, decision rule 2 was modified to only include triage diagnoses GI bleeding, major rectal bleeding, rectal bleeding, and ROS elements of blood in stool, anal bleeding, rectal bleeding, and hematochezia.
Evaluation of phenotype.
To assess the performance of the different phenotypes, we performed manual note review to create a gold standard. Two clinical domain experts (DS and CT) reviewed the medical records of all patients and classified each patient as having the phenotype or not on the basis of expert opinion and a prespecified structured evaluation (Table 1). We further categorized the acute bleeding by symptoms, either hematemesis/melena or hematochezia.
Table 1.
Gold-standard strategy to label encounters
| Steps to label the gold standard: acute gastrointestinal bleeding in the ED |
|---|
|
ED, emergency department
Training and validation datasets.
The total number of encounters was temporally divided into training (70%, from 9/2014 to 7/2016) and validation (30%, 7/2016 to 5/2017) sets (Table 2). Internal validation was performed with 10 replications of tenfold cross-validation across the training set for decision rules and the syntax-based natural language processing. For each cross-validation split, the McNemar’s test was performed for sensitivity, specificity, and positive predictive value (PPV) comparing SNOMED to the decision rule 1, decision rule 2, and syntax-based NLP with random forest, support vector machine, and elastic net classifiers. Unfortunately, because of computational constraints, we did not perform the internal validation on the BERT neural network model. Extemal validation was performed directly using the held-out validation set. The primary metric for performance was the PPV, and secondary metrics for performance were the sensitivity and specificity. A high PPV indicates that a high proportion of the patients identified with acute GIB are true cases, but ideally would also have a high sensitivity to identify a high proportion of all true cases of acute GIB. There is no clear performance threshold for PPV, but PPV > 75% has been considered acceptable and reported for EHR phenotypes.15–20
Table 2.
Baseline characteristics for the training and external validation datasets
| Training/internal validation N= 7144 | External validation N = 2988 | ||
|---|---|---|---|
| Sex | Male (%) | 48 | 49 |
| Female (%) | 52 | 51 | |
| Age | Mean | 57 | 56 |
| Ethnicity | White/Caucasian (%) | 45 | 45 |
| Black/African American (%) | 25 | 26 | |
| Hispanic/Latino (%) | 15 | 14 | |
| Asian (%) | 2 | 1 | |
| Other (%) | 7 | 13 | |
| Disposition | Admit (%) | 66 | 66 |
| Discharge (%) | 32 | 33 | |
| Other (%) | 2 | 2 | |
| Type of bleeding† | Hematemesis/melena (%) | 24 | 24 |
| Hematochezia (%) | 51 | 45 | |
| 30-day Mortality (%) | 6 | 5 | |
| Inpatient mortality (%) | 3 | 2 | |
| Red blood cell transfusion (%) | 22 | 23 |
The types of bleeding do not add up to 100% because the remainder includes patients who have no acute bleeding.
Sensitivity analysis by bleeding etiology.
Predefined sensitivity analysis was performed in the same extemal validation cohort to predict either hematemesis and/or melena or hematochezia. These are two clinically distinct symptom complexes that may indicate an upper GI tract or lower GI tract source, respectively.
Statistical methodology.
McNemar’s test was used to compare sensitivity, specificity, and PPV for each iteration of internal validation for SNOMED, decision rules, and NLP tools. For the 10 replications of tenfold internal validation, the median and range of P values for the McNemar’s test are presented (Appendix I). We considered 10 replications of tenfold cross-validation to account for variabilities of random splitting so that we could generate a more robust understanding of potential performance for each of the approaches. McNemar’s test was also applied for the external validation dataset. PPV was considered the primary metric for performance with goal PPV > 75%. We also predefined comparisons between the baseline (SNOMED) and each of the phenotyping approaches (decision rule 1, decision rule 2, and the two NLP-based approaches). We corrected using the Bonferroni correction and defined significance as P < 0.01; we present the corresponding 99% confidence intervals (CIs).
Results
Performance of decision rules
Internal validation with tenfold cross-validation for 10 iterations.
Decision rules 1, 2, and the syntax-based NLP tools had better PPVs than SNOMED codes in identifying patients with acute bleeding. The syntax-based NLP tool with random forest classifier had the highest sensitivity (0.97, 99% Cl: 0.969–0.973) and was better than SNOMED (0.61, 99% Cl: 0.606–0.618) with median P value <0.0001. Decision rule 1, decision rule 2 (0.88, 99% Cl: 0.882–0.889), and the syntax-based NLP tool with elastic net and support vector machine classifiers also had higher sensitivity than SNOMED (median P value <0.0001). Decision rule 2 had the highest specificity (0.75, 99% Cl: 0.740–0.757) and was better than SNOMED (0.39, 99% Cl: 0.382–0.399) with median P value <0.0001) (Appendix I).
External validation.
The PPV of decision rules (0.78, 99% Cl: 0.75–0.80 for decision rule 1; 0.85, 99% Cl: 0.83–0.87 for decision rule 2) were increased compared with those in SNOMED (0.69, 99% Cl: 0.66–0.72; P < 0.001). Syntax-based NLP with the elastic net classifier (PPV 0.83, 99% Cl: 0.81–0.86) and BERT neural network model (PPV 0.84, 99% Cl: 0.82–0.86) also were increased compared with those in SNOMED (P < 0.001). The sensitivity of decision rules (0.90, 99% Cl: 0.88–0.92 for decision rule 1; 0.87, 99% Cl: 0.85–0.89 for decision rule 2), syntax-based NLP with the elastic net classifier (0.85, 99% Cl: 0.83–0.87), and the BERT neural network NLP model (0.93, 99% Cl: 0.92–0.95) are increased compared with those in SNOMED codes (0.59, 99% Cl: 0.57–0.62; P < 0.001). For specificity, SNOMED codes (0.45, 99% Cl: 0.42–0.48) are worse than decision rule 2 (0.69, 99% Cl: 0.65–0.73; P < 0.001), syntax-based NLP with the elastic net classifier (0.65, 99% Cl: 0.61–0.69), and BERT neural network NLP model (0.63, 99% Cl: 0.59–0.67) but similar to decision rule 1 (0.47, 99% Cl: 0.43–0.51; P = 0.87) (Table 3).
Table 3.
Results of phenotyping decision rules and NLP algorithm in training and validation sets
| External validation (30%) N = 2988 | |||||
|---|---|---|---|---|---|
| Phenotype and performance characteristics | |||||
| Performance algorithm (with 99% confidence interval) | |||||
| SNOMED codes (reference): problem list; encounter diagnosis; billing diagnosis | Decision rule 1: all Gl triage terms + ROS fields | Decision rule 2: Gl bleed-specific triage terms + ROS fields | Syntax-based NLP with elastic net | BERT neural network NLP | |
| PPV | 69% (0.66–0.72) | 78%* (0.75–0.80) | 85%* (0.83–0.87) | 83%* (0.81–0.86) | 84%* (0.82–0.86) |
| Sensitivity | 59% (0.57–0.62) | 90%* (0.88–0.92) | 87%* (0.85–0.89) | 85%* (0.83–0.87) | 93%* (0.92–0.95) |
| Specificity | 45% (0.42–0.48) | 47% (0.43–0.51) | 69%* (0.65–0.73) | 65%* (0.61–0.69) | 63%* (0.59–0.67) |
P < 0.001 compared with SNOMED (baseline).
BERT, Bidirectional Encoder Representation from Transformers; Gl, gastrointestinal; NLP, natural language processing; PPV, positive predictive value; ROS, review of systems; SNOMED, Systematized Nomenclature of Medicine.
Sensitivity analysis
Sensitivity analysis based on type of bleeding.
On external validation for hematemesis and/or melena, decision rule and NLP algorithms had higher PPV than SNOMED codes (P < 0.001). The BERT neural network NLP model had a similar sensitivity to SNOMED codes (P = 0.77), while the syntax-based NLP models and the modified decision rule had lower sensitivity (P < 0.001) (Table 4).
Table 4.
Sensitivity analysis of hematemesis and/or melena and hematochezia in the external validation group
| Hematemesis and/or melena validation (30%) N = 650/2988 | ||||
|---|---|---|---|---|
| Phenotype and performance characteristics | ||||
| Performance algorithm (with 99% confidence interval) | ||||
| SNOMED codes (reference) | Decision rule 2 modified: upper Gl bleed-specific triage terms + ROS fields | Syntax-based NLP with random forest | BERT neural network NLP | |
| PPV | 29% (0.26–0.31) | 82%* (0.76–0.87) | 85%* (0.81–0.90) | 78%* (0.74–0.83) |
| Sensitivity | 71% (0.67–0.75) | 34%* (0.30–0.39) | 46%* (0.42–0.51) | 72% (0.68–0.76) |
| Specificity | 46% (0.44–0.49) | 98%* (0.97–0.98) | 98%* (0.97–0.98) | 94%* (0.93–0.95) |
| Hematochezia validation (30%) N= 1316/2988 | ||||
| Phenotype and performance characteristics | ||||
| Performance algorithm (with 99% confidence interval) | ||||
| SNOMED codes (reference) | Decision rule 2 modified: lower Gl bleed-specific triage terms + ROS fields | Syntax-based NLP with support vector machines | BERT neural networks NLP | |
| PPV | 42% (0.39–0.45) | 72%* (0.69–0.74) | 86%* (0.83–0.88) | 84%* (0.82–0.87) |
| Sensitivity | 55% (0.51–0.58) | 87%* (0.85–0.89) | 82%* (0.79–0.84) | 90%* (0.88–0.92) |
| Specificity | 40% (0.37–0.43) | 73%* (0.70–0.75) | 89%* (0.87–0.91) | 87%* (0.84–0.89) |
P < 0.001 compared with SNOMED (baseline).
BERT, Bidirectional Encoder Representation from Transformers; Gl, gastrointestinal; NLP, natural language processing; PPV, positive predictive value; ROS, review of systems; SNOMED, Systematized Nomenclature of Medicine.
For hematochezia alone, the decision rule and NLP algorithms had higher PPV than SNOMED codes (P < 0.001) (Table 4).
Discussion
This is the first study to develop an EHR phenotype for identifying acute GIB in real time. Prompt identification of individuals with acute GI bleeding in the emergency room is an important first step in order to deploy risk scores that would inform and determine the level of care and clinical management decisions. (Fig. 3) We found that the automated decision rule that combined bleed-specific terms at the initial triage by a nurse and bleed-specific terms in the emergency department provider’s ROS had the highest PPV (85%) on external validation with a sensitivity of 84% to identify patients presenting with acute GIB in the emergency department. In practice, this would generate an alert that would deploy a risk stratification tool for acute GIB through the EHR to the provider. A proposed workflow would be the following: A patient presents to the emergency department with acute GIB and is identified with the decision rule from a triage GIB-specific diagnosis of “Vomiting Blood.” The patient would then be flagged and his vital signs and laboratory values and medical history elements from the EHR would be extracted and run through a risk stratification algorithm that would identify very low risk patients with a prespecified threshold (e.g. machine learning-based risk stratification tools).21,22 This algorithm would then prompt an alert that would be delivered to the provider to identify very low risk patients who could be discharged.
Figure 3.

Schematic showing how the electronic health record (EHR) phenotyping rule fits into the provider workflow to find patients with gastrointestinal bleeding (GIB) and deploy risk scores to assist decision making.
We chose the decision rule because it identifies patients who actually have acute GIB, which means that we want a low false positive rate (high specificity and PPV). Sensitivity should also be as high as possible to ensure that most patients with acute GIB are detected. A PPV of >75% has been reported as “high,” and the PPV for the highest performing decision rule was 91% on internal validation and 85% on external validation.15–20
Traditionally, no method exists to identify patients with GIB other than using diagnostic codes (International Classification of Disease [ICD]-9 and ICD-10) from the billing diagnosis list. By definition, these diagnostic codes are useless for detecting patients presenting with acute GIB at the point of care because they generally are not entered into the EHR until much later in the hospital course, well after initial identification and risk assessment is needed. Additionally, ICD codes may not detect all patients who have bleeding, as it may be reported but not coded in their billing diagnosis. SNOMED may be better than diagnostic codes alone—is a terminology that includes both billing codes and other related text and is updated every quarter.23–24 Our decision rule outperforms the baseline SNOMED methodology even when sampled over multiple areas in the EHR, including the billing diagnosis list. Interestingly, using multiple strategies of NLP over the entire text of the emergency department provider notes does not result in markedly improved performance over a decision rule for identifying patients with acute GIB, although it appears to have a slight benefit in the sensitivity analysis for hematochezia. Unfortunately, the NLP tools are not applicable to real-time phenotyping because the availability of the entire text is delayed because providers may choose to complete documentation up to 24 h after the actual visit.
The NLP tools had similar performance to decision rules for detecting overall acute bleeding. On sensitivity analysis, NLP tools performed similarly to the modified decision rule for melena or hematemesis and better than the modified decision rule for hematochezia. One advantage of NLP-based tools over decision rules is flexibility in the ability to handle additional covariates, including demographics and clinical information. In our evaluation, we incorporated age and gender into the features for the syntax-based NLP algorithm, but the results were similar. Future work is needed to systematically evaluate the additional benefits of incorporating other types of variables into NLP classification.
Strengths.
This study compares the performance of existing automated methods (SNOMED) across multiple time points in the EHR in identifying acute GIB at admission. This excludes GIB after admission while already hospitalized for another condition and provides the best data possible, given the constraints of using only structured data fields. Prompt identification of individuals with acute GIB would allow for the automatic provision of risk stratification scores for providers to guide appropriate triaging and clinical decision making.
Limitations.
We did not review patients who did not have the specific SNOMED code for GIB, did not have a Gl-related triage problem, and did not have a positive ROS for GIB. We believe it is reasonable to assume that these patients likely present with another primary issue and without any clinically significant GIB. Risk stratification scores for acute GIB are typically used for patients who present with GIB as the chief and acute complaint, and clinical decisions, such as admission or hospital-based interventions, need to be made early after presentation.
This phenotype was developed for a specific center with a local workflow, including the availability of a triage nurse with structured data fields for triage diagnosis. Patients outside this cohort (with negative SNOMED, none of the triage terms, and no ROS positivity) were not reviewed, which limits its applicability to all-comers in the emergency department. However, we believe that patients without any Gl symptoms or signs at triage or during emergency department evaluation and without any evidence of GIB on diagnostic codes or the other elements of SNOMED are very unlikely to have presented with clinically significant acute GIB. Changes in coding (e.g. from ICD-9 to ICD-10) and temporal shifts in treatment options, patient epidemiology, hospital utilization, and risk shifts can all decrease the performance of these phenotypes in identifying patients of interest.
Future directions.
These results provide a robust method of identifying patients with acute GIB, which is a crucial first step in developing an EHR-based model for risk prognostication. By mapping out the points at which data are generated from the process model, this guides the eventual deployment and implementation of a risk stratification prognostic algorithm for clinical decision making in real time.
Financial support:
DS is supported by an NIH grant T32 DK007017.
Appendix I
Internal validation with tenfold cross-validation for 10 iterations: SNOMED versus decision rules
| Training (70%) N = 7144 | |||||
|---|---|---|---|---|---|
| Internal validation: tenfold cross-validation for 10 iterations | |||||
| Phenotype and performance characteristics | |||||
| Performance algorithm (with 99% confidence interval) | |||||
| SNOMED codes (reference): problem list; encounter diagnosis; billing diagnosis | Decision rule 1: All Gl triage terms + ROS fields | P value (median, range) | Decision rule 2: Gl bleed-specific triage terms + ROS fields | P value (median, range) | |
| PPV | 74% (0.740–0.746) | 85% (0.849–0.855) | <0.0001 (0–0.002) | 91% (0.907–0.913) | <0.0001 |
| Sensitivity | 61% (0.606–0.618) | 91% (0.904–0.910) | <0.0001 | 88% (0.882–0.889) | <0.0001 |
| Specificity | 39% (0.382–0.399) | 55% (0.536–0.555) | 0.036 (0–0.50) | 75% (0.740–0.757) | <0.0001 |
GI, gastrointestinal; PPV, positive predictive value; ROS, review of systems; SNOMED, Systematized Nomenclature of Medicine.
Internal validation with tenfold cross-validation for 10 iterations: SNOMED versus NLP approaches
| Training (70%) N = 7144 | |||||||
|---|---|---|---|---|---|---|---|
| Internal validation: tenfold cross-validation for 10 iterations | |||||||
| Phenotype and performance characteristics | |||||||
| Performance algorithm (with 99% confidence interval) | |||||||
| SNOMED codes (reference): problem list; encounter diagnosis; billing diagnosis | Syntax-based NLP with random forest | P value (median, range) | Syntax-based NLP with support vector machines | P value (median, range) | Syntax-based NLP with elastic net | P value (median, range) | |
| PPV | 74% (0.740–0.746) | 86% (0.857–0.862) | <0.0001 | 88% (0.880–0.885) | <0.0001 | 89% (0.888–0.893) | <0.0001 |
| Sensitivity | 61% (0.606–0.618) | 97% (0.969–0.973) | <0.0001 | 94% (0.937–0.942) | <0.0001 | 90% (0.895–0.902) | <0.0001 |
| Specificity | 39% (0.382–0.399) | 54% (0.533–0.551) | 0.005 (0–0.16) | 64% (0.630–0.649) | <0.0001 (0–0.001) | 68% (0.672–0.692) | <0.0001 (0–0.0003) |
GI, gastrointestinal; NLP, natural language processing; PPV, positive predictive value; ROS, review of systems; SNOMED, Systematized Nomenclature of Medicine.
External validation on temporally separate cohort of patients, with sensitivity analysis hematochezia/melena and hematochezia
| Performance algorithm (with 99% confidence interval) | ||||
|---|---|---|---|---|
| SNOMED codes (reference): problem list; encounter diagnosis; billing diagnosis | Syntax-based NLP with random forest | Syntax-based NLP with support vector machines | Syntax-based NLP with elastic net | |
| External validation (30%) N = 2988 | ||||
| PPV | 74% (0.71–0.76) | 80%* (0.78–0.82) | 81%* (0.79–0.83) | 83%* (0.81–0.86) |
| Sensitivity | 61% (0.59–0.64) | 98%* (0.96–0.98) | 95%* (0.93–0.96) | 85%* (0.83–0.87) |
| Specificity | 38% (0.34–0.43) | 51%* (0.46–0.54) | 54%* (0.50–0.58) | 65%* (0.61–0.69) |
| Hematemesis and/or melena validation (30%) N = 650/2988 | ||||
| PPV | 29% (0.26–0.31) | 85%* (0.81–0.90) | 84%* (0.80–0.89) | 66%* (0.61–0.70) |
| Sensitivity | 71% (0.67–0.75) | 46%* (0.42–0.51) | 50%* (0.45–0.55) | 64% (0.59–0.69) |
| Specificity | 46% (0.44–0.49) | 98%* (0.97–0.98) | 97%* (0.96–0.98) | 90%* (0.88–0.91) |
| Hematochezia validation (30%) N = 1316/2988 | ||||
| PPV | 42% (0.39–0.45) | 84%* (0.81–0.86) | 86%* (0.83–0.88) | 75%* (0.72–0.78) |
| Sensitivity | 55% (0.51–0.58) | 86%* (0.84–0.89) | 82%* (0.79–0.84) | 83%* (0.81–0.86) |
| Specificity | 40% (0.37–0.43) | 87%* (0.85–0.89) | 89%* (0.87–0.91) | 78%* (0.75–0.80) |
P < 0.001 compared with SNOMED (baseline).
GI, gastrointestinal; NLP, natural language processing; PPV, positive predictive value; ROS, review of systems; SNOMED, Systematized Nomenclature of Medicine.
Footnotes
Declaration of conflict or interest: The authors declare no relevant industrial affiliations.
Ethical approval: The Yale University Institutional Review Board has reviewed the database used to generate this study and has approved the use of patient data for this study. HIC#1408014519.
References
- 1.Peeiy AF, Crockett SD, Murphy CC et al. Burden and cost of gastrointestinal, liver, and pancreatic diseases in the United States: update 2018. Gastroenterology 2019; 156: 254–72 e11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Laine L, Jensen DM. Management of patients with ulcer bleeding. Am. J. Gastroenterol 2012; 107: 345–60 quiz 61. [DOI] [PubMed] [Google Scholar]
- 3.Gralnek IM, Dumonceau JM, Kuipers EJ et al. Diagnosis and management of nonvariceal upper gastrointestinal hemorrhage: European Society of Gastrointestinal Endoscopy (ESGE) Guideline. Endoscopy 2015; 47: a1–46. [DOI] [PubMed] [Google Scholar]
- 4.Barkun AN, Bardou M, Kuipers EJ et al. International consensus recommendations on the management of patients with nonvariceal upper gastrointestinal bleeding. Ann. Intern. Med 2010; 152: 101–13. [DOI] [PubMed] [Google Scholar]
- 5.Pathak J, Kho AN, Denny JC. Electronic health records-driven phenotyping: challenges, recent advances, and perspectives. J. Am. Med Inform. Assoc.: JAMIA 2013; 20: e206–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Richesson RL, Hammond WE, Nahm M et al. Electronic health records based phenotyping in next-generation clinical trials: a perspective from the NIH Health Care Systems Collaboratory. J. Am. Med. Inform. Assoc.: JAMIA 2013; 20: e226–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Angus DC. Fusing randomized trials with big data: the key to self-learning health care systems? Fusing randomized trials with big data. JAMA 2015; 314: 767–8. [DOI] [PubMed] [Google Scholar]
- 8.Banda JM, Seneviratne M, Hemandez-Boussard T, Shah NH. Advances in electronic phenotyping: from rule-based definitions to machine learning models. Annu. Rev. Biomed. Data Sci 2018; 1: 53–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Liao KP, Cai T, Savova GK et al. Development of phenotype algorithms using electronic medical records and incorporating natural language processing. BMJ 2015; 350: h1885. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Neumann M, King D, Beltagy I, Ammar W. ScispaCy: fast and robust models for biomedical natural language processing. arXiv e-prints. (Serial online) 2019. February 01, 2019: (arXiv: 1902.07669). [Cited February 01, 2019.] Available from URL: https://ui.adsabs.harvard.edu/abs/2019arXiv190207669N [Google Scholar]
- 11.Wang Y, Liu S, Afzal N, et al. A comparison of word embeddings for the biomedical natural language processing. arXiv e-prints. (Serial online) 2018. February 01, 2018: (arXiv: 1802.00400). [Cited Februaiy 01, 2018.] Available from URL: https://ui.adsabs.harvard.edu/abs/2018arXivl80200400W [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Kim C, Zhu V, Obeid J, Lenert L. Natural language processing and machine learning algorithm to identify brain MRJ reports with acute ischemic stroke. PLoS ONE 2019; 14: e0212778. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Devlin J, Chang M-W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. (Serial online) 2018. October 01, 2018: (arXiv: 1810.04805). [Cited October 01, 2018.] Available from URL: https://ui.adsabs.harvard.edu/abs/2018arXiv181004805D [Google Scholar]
- 14.Chang D, Balazevic I, Allen C, Chawla D, Brandt C, Taylor RA. Benchmark and best practices for biomedical knowledge graph embeddings. (Serial online) 2020. June 01, 2020: (arXiv:2006.13774). [Cited June 01, 2020.] Available from URL: https://ui.adsabs.harvard.edu/abs/2020arXiv200613774C [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Ananthakrishnan AN, Cai T, Savova G et al. Improving case definition of Crohn’s disease and ulcerative colitis in electronic medical records using natural language processing: a novel informatics approach. Inflamm. Bowel Dis 2013; 19: 1411–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Liao KP, Cai T, Gainer V et al. Electronic medical records for discovery research in rheumatoid arthritis. Arthritis Care Res. (Hoboken) 2010; 62: 1120–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Xia Z, Secor E, Chibnik LB et al. Modeling disease severity in multiple sclerosis using electronic health records. PLoS ONE 2013; 8: e78927. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Wilkinson T, Schnier C, Bush K et al. Identifying dementia outcomes in UK Biobank: a validation study of primary care, hospital admissions and mortality data. Eur. J. Epidemiol 2019; 34: 557–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Wilkinson T, Ly A, Schnier C et al. Identifying dementia cases with routinely collected health data: a systematic review. Alzheimers Dement. 2018; 14: 1038–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Rubbo B, Fitzpatrick NK, Denaxas S et al. Use of electronic health records to ascertain, validate and phenotype acute myocardial infarction: a systematic review and recommendations. Int. J. Cardiol 2015; 187: 705–11. [DOI] [PubMed] [Google Scholar]
- 21.Shung D, Au B, Taylor R et al. Validation of a machine learning model that outperforms clinical risk scoring systems for upper gastrointestinal bleeding. Gastroenterology 2020; 158(1): 160–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Shung D, Simonov M, Gentry M, Au B, Laine L. Machine learning to predict outcomes in patients with acute gastrointestinal bleeding: a systematic review. Dig. Dis. Sci 2019; 64(8): 2078–87. [DOI] [PubMed] [Google Scholar]
- 23.Hripcsak G, Levine ME, Shang N, Ryan PB. Effect of vocabulary mapping for conditions on phenotype cohorts. J. Am. Med. Inform. Assoc 2018; 25: 1618–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Elkin PL, Ruggieri AP, Brown SH, et al. A randomized controlled trial of the accuracy of clinical record retrieval using SNOMED-RT as compared with ICD9-CM. Proc AMIA Symp. 2001: 159–63. [PMC free article] [PubMed] [Google Scholar]
