Abstract
Background:
The objective of the study was to develop a portal natural language processing approach to aid in the identification of postoperative venous thromboembolism (VTE) events from free-text clinical notes.
Methods:
We abstracted clinical notes from 25,494 operative events from two independent healthcare systems. A VTE detected as part of the American College of Surgeons’ National Surgical Quality Improvement Program was used as the reference standard. An NLP engine, EasyCIE-PEDVT, was trained to detect pulmonary embolism (PE) and deep vein thrombosis (DVT) from clinical notes. International classification of diseases (ICD) discharge diagnosis codes for VTE were used as a baseline comparator. The classification performance of EasyCIE-PEDVT was compared to ICD codes using sensitivity, specificity, area under the receiver operating characteristic curve (AUC) using an internal and external validation cohort.
Results:
To detect PE, EasyCIE-PEDVT had a sensitivity of 0.714 and 0.815 in internal and external validation, respectively. To detect DVT, EasyCIE-PEDVT had a sensitivity of 0.846 and 0.849 in internal and external validation, respectively. EasyCIE-PEDVT had significantly higher discrimination for DVT compared to ICD codes in internal validation (AUC: 0.920 vs 0.761, p<0.001) and external validation (AUC: 0.921 vs 0.794, p<0.001). There was no significant difference in the discrimination for PE between EasyCIE-PEDVT and ICD codes.
Conclusions:
Accurate surveillance of postoperative VTE may be achieved using NLP on clinical notes in two independent healthcare systems. These findings suggest NLP may augment manual chart abstraction for large registries such as NSQIP.
ARTICLE SUMMARY
In a cohort study that included 25,494 surgical encounters from two independent healthcare systems, we developed a natural language processing approach to identify postoperative venous thromboembolism events from clinical notes. The importance of this report is natural language processing can accurately extract postoperative venous thromboembolic events from clinical notes, which can augment manual chart abstraction.
INTRODUCTION
Accurate identification of postoperative venous thromboembolism (VTE) events is necessary to develop risk prediction models,1 design quality improvement interventions,2 assess hospital quality,3 and foster hospital quality collaboratives.4 Surveillance typically relies on International Classification of Diseases (ICD) diagnosis codes5 or manual chart abstraction.6 ICD codes are widely available but demonstrate limited accuracy for the identification of VTE events.5,7 Manual chart abstraction is utilized by programs such as the American College of Surgeons National Surgical Quality Improvement Program (ACS-NSQIP).6 While the ACS-NSQIP program provides high-quality clinical data abstraction, the labor-intensive process limits the generalizability and scalability of this approach.8
The wide-scale adoption of electronic health record systems (EHRs) by U.S. hospitals provides an opportunity to leverage advanced data analytic techniques for surveillance of postoperative VTE.9 The majority of postoperative complications are described in clinical notes. Natural language processing (NLP) offers a unique opportunity to automate the abstraction of adverse clinical events from electronic documentation.10 The goal of NLP is to convert unstructured text into a structured format, as detailed in Figure 1. NLP methods include a rules-based approach or a statistical approach.11 Rules-based approaches have the advantage of requiring fewer training data and delivering explainable results which are easily interpretable; however, these methods required a significant amount of effort to design. Statistical NLP approaches have the advantage of requiring little effort during training; however, they need a considerable amount of training data to ensure accuracy and deliver probability-based results. Given the probability-based results, the classification of statistical NLP sometimes can be challenging to interpret, especially when the underline logic is sophisticated, e.g., the logic used in the ACS-NSQIP program.
FIGURE 1.

Diagram of EasyCIE-PEDVT inferencing workflow. A. Sentence Level Inferencing. During sentence level inferences, each sentence is broken down into terms when are then categorized as target concepts or modifiers (certainty, temporality, site, dept, anatomy) B. Document Level Inferencing. During document inferencing, the sentence level conclusion is aggregated to a document level conclusion. This step aggregates multiple mentions of VTE or resolves discrepancies between sentences. C. Patient-Level Inferencing. In patient-level inferencing, the document-level conclusions are aggregated to resolve the conflicts between documents, identify if a VTE is new-onset, and reach a patient-level conclusion.
Our group and others have utilized NLP to extract clinical data related to several postoperative complications, including surgical site infections,12 VTE,13–15 pneumonia,16 and urinary tract infections.17 However, the majority of these studies focus on the mention-level or document-level extraction of concepts, failing to integrate information across a patient’s medical record.18 In addition, the majority of these studies focus on a single healthcare system and have not demonstrated portability of the NLP approach beyond a single EHR system.13,14
Given these limitations, the objectives of the current study are (1) to develop an NLP solution to identify acute VTE in postsurgical patients based on NSQIP VTE definitions; (2) to compare the performance of our NLP solution against the classification performance of ICD codes for VTE; and (3) to demonstrate portability across two healthcare systems utilizing different EHRs. We hypothesize that an NLP solution for VTE detection will have similar accuracy to manual chart review and improved accuracy compared to ICD codes in two independent healthcare systems.
METHODS
Setting
We performed a retrospective cohort study using data from two independent healthcare systems located in Utah that serve the intermountain west: the University of Utah Health and Intermountain Healthcare. The University of Utah Health has maintained an EHR serviced by Epic since May 11, 2014. Intermountain Healthcare has maintained a separate EHR, Help2, locally serviced since 1996. The institutional review boards at each health care system approved the study, granting a waiver of informed consent.
Study Cohort
Operative events reviewed by NSQIP-trained surgical clinical reviewers (SCR) were eligible for study inclusion. The NSQIP methodology has been described previously.6,19 NSQIP SCRs select a random, stratified set of patients undergoing surgical procedures, then review all of those patients’ records for postoperative complications occurring within 30 days of the procedure. If a patient’s postoperative documentation is not complete in the EHR, attempts are made to contact the patient to complete a 30-day follow-up. Operative events occurring in one healthcare system from January 1, 2015, through August 31, 2017, were included as a training cohort, and operative events occurring from September 1, 2017, through December 31, 2018, were included as an internal validation cohort. Operative events occurring at the other healthcare system from September 1, 2008, through June 30, 2017, were included as an external validation cohort.
Data Sources
We extracted the NSQIP reviewed cases for patient demographics, procedure characteristics, and outcomes for each cohort. The enterprise data warehouses at each healthcare system were queried using the NSQIP case identifiers for all ICD 9th and 10th edition discharge diagnosis codes and select electronic clinical notes from one year before surgery through 30 days after the operative date. We included the following clinical note types: radiology reports (computed tomography and duplex ultrasound reports), telephone encounter notes, and discharge summaries.
Reference Standard
We used the NSQIP definition for PE and DVT from the 2018 Operations Manual as the reference standard for the study.20
NLP Development
The description of our NLP architecture, Easy Clinical Information Extractor (EasyCIE), a rule-based NLP tool, has been reported previously and is illustrated in Figure 1.12,21,22 The Supplementary Appendix describes the tool in more detail. EasyCIE breaks up a patient’s clinical record into documents, sections, sentences, and sentence tokens. EasyCIE identifies relevant concepts by tokens and aggregates the context within the same sentences into semantic representations. (Figure 1A). EasyCIE then aggregates all these sentence-level representations in a document to infer a document-level classification (Figure 1B).23 Finally, Easy-CIE aggregates document classification results along a temporal timeline to infer a patient-level classification (Figure 1C). The NLP rules were initially developed based on our reference standard. We revised the rules by manually iterating the training cohort’s performance until all addressable errors were resolved. The final software package, EasyCIE-PEDVT, including the source code and PE/DVT knowledgebase, can be found at https://github.com/jianlins/EasyCIE_GUI.
Validation
Once training was finalized, the performance of EasyCIE-PEDVT was evaluated on both the internal and external validation cohorts to demonstrate the portability of the NLP process. After the initial validations, selected additional errors were reviewed, and the NLP rules were subsequently revised and reported as the final internal and external validation performance. The need for additional retraining at the external validation site was necessary to ensure the proper note types, and section headings were interpreted by the EasyCIE-PEDVT appropriately as no universal approach has been demonstrated to be efficacious for section detection.24 The subsequent need for rules modification was a direct result of EasyCIE-PEDVT misinterpreting note sections titled “Pulmonary Embolism” and “Deep Vein Thrombosis” in the external validation notes as VTE events. These sections were not present in the training corpora. The performance of EasyCIE-DVT throughout the training and validation process is shown in Table S1 (Supplementary Appendix).
ICD Code Comparison
The discharge diagnosis codes were classified for PE and DVT based on the numerators of the 2020 (ICD-CM 10th edition)25 and the 2015 (ICD-CM 9th edition)26 AHRQ Patient Safety Indicators for postoperative VTE. The list of ICD codes utilized is shown in Table S2 (Supplementary Appendix).
Statistical Analysis
The analysis was performed using the R v4.0 Statistical Software Package.27 A p-value less than 0.05 was considered statistically significant. We performed univariate analysis on the training, internal validation, and external validation cohorts using the Pearson Chi-square test for discrete variables and Student’s t-test for continuous variables. The Bonferroni method was used to adjust for multiple comparisons on the univariate analysis.28 The performance of EasyCIE-PEDVT and ICD codes were compared to the reference standard and evaluated for sensitivity, specificity, areas under the receiver operating characteristic curve (AUC), positive predictive value, negative predictive value, positive likelihood ratio, and negative likelihood ratio. A true positive occurred if both EasyCIE-PEDVT and the reference standard concluded a postoperative VTE was present. To evaluate the performance of EasyCIE-PEDVT between the internal and external validation cohorts, the Test of Two Proportions was used.29 To evaluate the difference between the sensitivity and specificity of EasyCIE-PEDVT and the ICD Codes, McNemar’s test was used.30 The differences in the predictive values and likelihood ratios were calculated as previously described.31,32 We utilized bootstrapping with 2000 iterations to generate 95% confidence intervals (CI) for each point estimate.33 The Studies of diagnostic accuracy (STARD) Flow diagram for PE is shown in Supplementary Figure S1, and for DVT is shown in Supplementary Figure S2.
RESULTS
Cohort Descriptions
We included 6,735 operative events in the training cohort, 2,266 operative events in the internal validation cohort, and 16,493 operative events in the external validation cohort (Table 1). In the internal validation cohort, there were significantly more appendectomy and hernia procedures and significantly fewer stomach procedures compared to the training cohort (Table 1). In the external validation cohort, there were significantly more appendectomy and vascular procedures and significantly fewer esophagus and skin procedures compared to the training cohort (Table 1). There were significantly more outpatient procedures in the internal validation cohort compared to the training cohort (Table 1).
TABLE 1.
Patient and Procedure Characteristics and Outcomes for Patients in Training, Internal Validation, and External Validation Cohorts.
| Training | Internal Validation | External Validation | |
|---|---|---|---|
| (N=6,735) | (N=2,266) | (N=16,493) | |
| Age (Years, SD) | 53 (17) | 53 (17) | 53 (18) |
| Sex (male, %) | 3232 (47.99%) | 1116 (49.25%) | 8110 (49.17%) |
| Race (%) | |||
| White | 6149 (91.30%) | 2073 (91.48%) | 15175 (92.01%)* |
| Black or African American | 76 (1.13%) | 26 (1.15%) | 138 (0.84%) |
| Asian | 94 (1.40%) | 32 (1.41%) | 205 (1.24%) |
| Native Hawaiian or Other Pacific Islander | 41 (0.61%) | 13 (0.57%) | 157 (0.95%) |
| American Indian or Alaska Native | 96 (1.43%) | 31 (1.37%) | 64 (0.39%) |
| Not Reported | 279 (4.14%) | 91 (4.02%) | 754 (4.57%) |
| Hispanic Ethnicity (%) | 657 (9.76%) | 207 (9.14%) | 1525 (9.25%) |
| Independent Functional Health Status (%) | 6642 (98.62%) | 2246 (99.12%) | 16101 (97.62%)* |
| Procedure Type (%) | |||
| Appendectomy | 644 (9.56%) | 315 (13.90%)* | 2821 (17.10%)* |
| Breast | 678 (10.07%) | 197 (8.69%) | 1453 (8.81%) |
| Colon/Rectal | 1090 (16.18%) | 409 (18.05%) | 2667 (16.17%) |
| Esophagus | 272 (4.04%) | 88 (3.88%) | 413 (2.50%)* |
| General Abdominal | 232 (3.44%) | 51 (2.25%) | 551 (3.34%) |
| Hepatobiliary | 741 (11.00%) | 253 (11.17%) | 1672 (10.14%) |
| Hernia | 1411 (20.95%) | 552 (24.36%)* | 3489 (21.15%) |
| Other Skin | 520 (7.72%) | 144 (6.35%) | 1038 (6.29%)* |
| Stomach | 501 (7.44%) | 63 (2.78%)* | 213 (1.29%)* |
| Vascular | 642 (9.53%) | 193 (8.52%) | 2162 (13.11%)* |
| Outpatient-Patient Status (%) | 3354 (49.80%) | 1252 (55.25%)* | 8316 (50.42%) |
| Operative Duration (min, SD) | 120 (99) | 120 (110) | 99 (95) |
| Hospital Length of Stay (Days, SD) | 3.3 (8.4) | 3.0 (9.3) | 3.2 (6.8) |
| Postoperative DVT (%) | 36 (0.53%) | 13 (0.57%) | 172 (1.04%)* |
| Postoperative Pulmonary Embolism (%) | 15 (0.22%) | 7 (0.31%) | 65 (0.39%) |
p<0.001 vs. Training Cohort
The prevalence of PE in the training cohort was similar to the internal validation (0.22% vs 0.31%, absolute difference (95%CI): 0.08% (−0.2%–0.3%), p=1) and external validation cohorts (0.22% vs 0.39%, absolute difference (95%CI): 0.17% (0.01%–0.32%), p=0.05). The prevalence of DVT in the training cohort was similar to the internal validation cohorts (0.53% vs 0.57%, p=1), and significantly lower than the external validation cohort (0.53% vs 1.04%, p<0.001) (Table 2).
TABLE 2.
EasyCIE-PEDVT Performance in Internal and External Validation Cohorts Compared to ICD Codes.
| Pulmonary Embolism | |||||||
|---|---|---|---|---|---|---|---|
| EasyCIE-PEDVT | ICD Code | ||||||
| Metrics | Internal Validation | External Validation | p-Value* | Internal Validation | p-Value# | External Validation | p-Value# |
| Sensitivity (95%CI) | 0.714 (0.286–1.00) | 0.815 (0.723–0.908) | 0.89 | 0.714 (0.286–1.00) | 1 | 0.908 (0.831–0.969) | 0.06 |
| Specificity (95%CI) | 0.993 (0.989–0.996) | 0.998 (0.997–0.999) | <0.001 | 0.992 (0.989–0.996) | 0.84 | 0.994 (0.993–0.995) | 0.004 |
| AUC (95%CI) | 0.854 (0.673–1.00) | 0.907 (0.859–0.954) | 0.56 | 0.853 (0.673–1.00) | 0.84 | 0.951 (0.915–0.986) | 0.06 |
| Deep Vein Thrombosis | |||||||
| EasyCIE-PEDVT | ICD Code | ||||||
| Metrics | Internal Validation | External Validation | p-Value* | Internal Validation | p-Value# | External Validation | p-Value# |
| Sensitivity (95%CI) | 0.846 (0.615–1.00) | 0.849 (0.791–0.901) | 1 | 0.538 (0.231–0.769) | 0.045 | 0.599 (0.529–0.674) | <0.001 |
| Specificity (95%CI) | 0.993 (0.990–0.996) | 0.992 (0.991–0.994) | 0.7 | 0.983 (0.977–0.988) | <0.001 | 0.988 (0.987–0.990) | <0.001 |
| AUC (95%CI) | 0.920 (0.818–1.00) | 0.921 (0.894–0.947) | 0.99 | 0.761 (0.620–0.902) | 0.017 | 0.794 (0.757–0.83) | <0.001 |
ICD: International Classification of Diseases, AUC: Area Under the Receiver Operating Characteristic Curve, CI: Confidence Interval
p-value vs. Internal Validation
p value vs. ICD Code
EasyCIE-PEDVT Performance for VTE Detection
For PE detection, the sensitivity of EasyCIE-PEDVT was similar between the internal validation (0.714 [95% CI: 0.286–1.0]) and external validation cohorts (0.815 [95% CI: 0.723–0.908], p=0.89). The specificity of EasyCIE-PEDVT was significantly higher in the external validation cohort compared to the internal validation cohort (0.998 [95% CI: 0.997–0.999] vs 0.993 [95% CI: 0.989–0.996], p<0.001). AUC did not differ significantly between validation cohorts (p=0.56) (Table 2).
For DVT detection, the sensitivity of EasyCIE-PEDVT was similar between the internal validation (0.846 [95% CI: 0.615–1.00]) and external validation cohorts (0.849 [95% CI: 0.791–0.901], p=1). Similarly, the specificity of EasyCIE-PEDVT did not differ significantly between the internal and external validation cohorts (0.993 [95% CI: 0.990–0.996] vs 0.992 [95% CI: 0.991–0.994], p=0.70). AUC remained similar between validation cohorts (p=0.99) (Table 2).
EasyCIE-PEDVT compared to ICD Discrimination for VTE Detection
No significant differences were observed between the discrimination performance of EasyCIE-PEDVT and ICD codes for PE detection in the internal validation (AUC [95%CI]: 0.854 [0.673–1.00] vs 0.853 [0.673–1.00], p=0.84) and external validation cohorts (AUC [95%CI]: 0.907 [0.859–0.954] vs 0.951 [0.915–0.986], p=0.06). EasyCIE-PEDVT had significantly higher discrimination performance compared to ICD codes for DVT detection in the internal validation (AUC [95%CI]: 0.920 [0.818–1.00] vs 0.761 [0.620–0.902], p=0.017) and external validation (AUC [95%CI]: 0.921 [0.894–0.947] vs 0.794 [0.757–0.830], p<0.001).
The predictive values of EasyCIE-PEDVT and ICD information are shown in Table 3. The positive predictive value (PPV) for both EasyCIE-PEDVT and ICD codes was low for both PE and DVT detection. The PPV for PE detection ranged from 0.238–0.632 for EasyCIE-PEDVT and 0.227–0.374 for ICD codes in the internal and external validation cohorts. The PPV for DVT detection ranged from 0.423–0.539 for NLP and 0.156–0.353 for ICD codes in the internal and external validation cohorts. However, the negative predictive value (NPV) was near perfect, ranging from 0.997–1.00 for both the NLP and ICD codes in all cohorts. Given the low prevalence of both PE and DVT, we also calculated the positive and negative likelihood ratios for both NLP and ICD codes. For example, in the external validation cohort, positive likelihood ratio (PLR) was significantly higher for NLP compared to the ICD codes for the detection of both PE (PLR [95% CI]: 432.1 [298.4–625.7] vs 158.6 [155.7–161.5], p<0.001), and DVT compared to the ICD codes (PLR [95% CI]: 110.8 [92.1–133.4] vs 53.1 [52.3–53.9], p<0.001) (Table 3).
TABLE 3.
Value of EasyCIE-PEDVT Information in the Internal and External Validation Cohorts Compared to ICD Codes.
| Pulmonary Embolism | ||||||
|---|---|---|---|---|---|---|
| Internal Validation | External Validation | |||||
| Metrics | EasyCIE-PEDVT | ICD Code | p-Value | EasyCIE-PEDVT | ICD Code | p-Value |
| Positive Predictive Value (95%CI) | 0.238 (0.125–0.400) | 0.227 (0.107–0.369) | 0.85 | 0.632 (0.549–0.723) | 0.374 (0.329–0.429) | <0.001 |
| Negative Predictive Value (95%CI) | 0.999 (0.998–1.00) | 0.999 (0.998–1.00) | 0.85 | 0.999 (0.999–1.00) | 1.00 (0.999–1.00) | 0.06 |
| Positive Likelihood Ratio (95%CI) | 100.8 (51.3–198.4) | 107.1 (103.1–111.1) | 0.84 | 432.1 (298.4–625.7) | 158.6 (155.7–161.5) | <0.001 |
| Negative Likelihood Ratio (95%CI) | 0.288 (0.089–0.928) | 0.276 (0.26–0.292) | 0.85 | 0.185 (0.111–0.308) | 0.096 (0.091–0.102) | 0.06 |
| Deep Vein Thrombosis | ||||||
| Internal Validation | External Validation | |||||
| Metrics | EasyCIE-PEDVT | ICD Code | p-Value | EasyCIE-PEDVT | ICD Code | p-Value |
| Positive Predictive Value (95%CI) | 0.423 (0.308–0.588) | 0.156 (0.08–0.243) | <0.001 | 0.539 (0.496–0.589) | 0.353 (0.313–0.397) | <0.001 |
| Negative Predictive Value (95%CI) | 0.999 (0.998–1.00) | 0.997 (0.996–0.999) | 0.044 | 0.998 (0.998–0.999) | 0.996 (0.995–0.997) | <0.001 |
| Positive Likelihood Ratio (95%CI) | 127.1 (73.0–221.4) | 34.1 (32.7–35.5) | <0.001 | 110.8 (92.1–133.4) | 53.1 (52.3–54.0) | <0.001 |
| Negative Likelihood Ratio (95%CI) | 0.155 (0.043–0.554) | 0.467 (0.451–0.483) | 0.054 | 0.152 (0.107–0.217) | 0.406 (0.400–0.411) | <0.001 |
ICD: International Classification of Diseases, CI: Confidence Interval
EasyCIE-PEDVT Error Analysis
To better understand the limitations of EasyCIE-PEDVT for VTE detection, we performed an error analysis of the false negative and false positive errors in the internal and external validation cohorts (Table 4). The most common reason for a false negative error by EasyCIE-PEDVT was lack of documentation (notes), accounting for 100% and 73% of the false errors in internal validation and external validation cohorts, respectively. In external validation, 18% of the false-negative errors were determined to be true negatives after the reexamination of the medical records. With regards to the false-positive errors, EasyCIE-PEDVT rules errors accounted for the majority of false-positive errors, including patient inferencing, document inferencing, section splitting, local inferencing, and named entity recognition errors. Another reason for false-positive errors was the correct identification of DVT events that were not medically treated. Interestingly, 19% and 24% of the apparent errors in the internal and external validation cohorts were true positives (not an error) after the reexamination of the medical record.
TABLE 4.
Error Analysis of EasyCIE-PEDVT in the Internal and External Validation Cohorts.
| Error Type | Reason | Internal Validation | External Validation | Description |
|---|---|---|---|---|
| False Negative | n | 2 | 22 | |
| Documentation Not Present | 2 (100%) | 16 (73%) | The documentation needed to identify a PE/DVT was not available electronically | |
| Document Inferencing | 0 | 2 (9%) | Document-level inferencing rules made an incorrect PE/DVT conclusion aggregating the sentences and paragraphs of the document | |
| True Negative | 0 | 4 (18%) | No evidence of PE/DVT in the medical record | |
| False Positive | n | 37 | 193 | |
| True Positive | 7 (19%) | 46 (24%) | The patient met the definition of PE/DVT according to the current NSQIP definitions | |
| Not Treated | 2 (5%) | 37 (19%) | The patient had a DVT which was not treated with anticoagulation | |
| Patient Inference | 0 (0%) | 11 (6%) | Patient-level inferencing rules made an incorrect PE/DVT conclusion aggregating the documents | |
| Document Inference | 0 (0%) | 3 (2%) | Document-level inferencing rules made an incorrect PE/DVT conclusion aggregating the sentences and paragraphs of the document | |
| Section Splitting | 8 (22%) | 4 (2%) | Section level splitting rules made an incorrect conclusion regarding the section of the document | |
| Local Inference | 18 (49%) | 84 (44%) | Sentence/Paragraph inferencing rules made and incorrect PE/DVT conclusion aggregating the terms. | |
| Named Entity Recognition | 2 (5%) | 8 (4%) | Local terminology was not in the dictionary required to identify key terms related to PE/DVT |
DISCUSSION
We present our most recent progress using NLP to assist with the surveillance of postoperative VTE events. There are several significant findings in this work. First, we developed a portable NLP pipeline for surveillance of postoperative VTE events from free-text clinical notes with high accuracy from two independent healthcare systems. Second, compared to the use of ICD codes, our NLP solution demonstrated superior performance for the surveillance of DVT and similar performance for surveillance of PE. Thirdly, the inability to detect post-discharge complications due to lack of documentation was the greatest NLP limitation.
Detecting postoperative VTE according to the NSQIP guideline is more challenging than simply detecting a DVT or PE in radiology reports because the NSQIP definition of VTE requires certain criteria beyond the presence or absence mentioned in a radiology report. NSQIP also required PE and DVT events to be new-onset, therefore not present at the time of surgery. Second, DVTs are required to have a therapeutic intervention such as anticoagulation administration. EasyCIE-PEDVT integrated information across time to infer whether a VTE event was considered a new-onset. We were able to achieve a sensitivity and specificity in our internal development cohort of 0.714 and 0.993 for PE, and 0.846 and 0.993 for DVT, respectively.
Other groups have attempted to extract VTE information across a patient’s postoperative course. For example, Murff et al. developed a rules-based NLP system to extract several postoperative complications, including VTE events, in the VA Healthcare system.13 They reported a patient-level sensitivity and specificity of 0.89 and 0.95 for the postoperative episode. Statistical NLP systems have also been applied for the surveillance of VTE events. Selby et al. utilized a “bag-of-words” statistical NLP model for postoperative VTE surveillance and demonstrated a sensitivity and specificity of 0.85 and 0.94 for DVT and 0.90 and 0.987 for PE.14 Thus, we demonstrated improved specificity and equivocal sensitivity compared to previous work. Interestingly, in both the present study and Selby’s, the false-negative errors occurred due to documentation not being present for NLP analysis. These findings highlight the use of NLP in a “real-world” setting where variability in data availability may affect NLP performance.
The portability of our NLP approach across two independent healthcare systems utilizing different EHRs is novel. The previous work mentioned above limited NLP validation to a single corpus in one healthcare system. Given differences in EHR design, document formatting, and local terminologies, NLP systems need to be validated in distinct corpora from different healthcare systems before widespread dissemination is feasible. For the surveillance of VTE events, we demonstrated equivalent or improved EasyCIE-PEDVT performance in external validation. Notably, we did not observe a significant difference between the performance of EasyCIE-PEDVT between the two validation cohorts, except for a small improvement in PE specificity in external validation of questionable clinical significance. Our external validation performance is comparable to the single-center studies mentioned above.13,14 However, our NLP rules required revision after initial external validation due to unanticipated note sections present in the external validation corpora. These findings highlight the challenges of portability of NLP systems across institutions and EHRs, where the variability of documentation practices and note formatting may impact NLP performance. Best practices of implementation of NLP systems requires validation and recalibration of system performance prior to widespread implementation.24
Traditionally, ICD codes have been utilized for surveillance of VTE events.5 Unlike NLP processes that require training and validation, ICD codes are readily available for use and analysis. Henderson et al. utilized the Agency for Healthcare Research and Quality (AHRQ) Patient Safety Indicator (PSI) for VTE and compared its performance to manual chart review.5 They demonstrated a sensitivity and specificity of 0.87 and 0.98, respectively, for the VTE PSI. We compared the performance of the ICD codes utilized by the PSI to EasyCIE-PEDVT and NSQIP chart review. We observed improved discrimination for DVT and equivalent discrimination for PE with EasyCIE-PEDVT compared to the ICD codes in both internal and external validation cohorts. Our findings are similar to those of Murff et al., who demonstrated improved specificity for VTE but not improved sensitivity for VTE compared to the AHRQ VTE PSI.13 Although we did not demonstrate improved discrimination for PE detection using EasyCIE-PEDVT, there are secondary benefits of and NLP approach, including obtaining information regarding temporality and anatomy, which are not available via ICD codes.
We performed a thorough error analysis of EasyCIE-PEDVT to identify limitations and areas for improvement. With regards to the false-negative errors, the most common reason was lack of documentation due to post-discharge events occurring outside the healthcare system in 100% and 73% of the internal and external validation cohorts, respectively. This observation highlights the continued challenge of post-discharge monitoring of patients who may lose contact with the healthcare system. Acknowledging the majority of readmissions are caused by new-onset post-discharge complications, there may still be a surveillance gap event with automated methodologies if the information is not available to the algorithms.34 Other causes of false negative and false positive errors were related to NLP inferencing rules errors due to unseen semantic context in validation sets. The advantages of a rules-based NLP system include the explainability of the results and the ability to perform in-depth error analysis. Before implementation in a new healthcare system, these errors could be addressed with a small amount of retraining data followed by modification of the NLP rules to improve the performance.
Interestingly, 18% of the false-negative and 24% of the false-positive errors reverted to true negative and true positive non-errors after manual review, suggesting an NLP-assisted solution could potentially achieve even better surveillance than manual chart review alone. If we were to include the errors in the reference standard, the detection of VTE’s EasyCIE-PEDVT would have a sensitivity of 0.714 and 0.852 in internal and external validation, respectively. This performance is compared to an NSQIP PE sensitivity of 1 and 0.778 in internal and external validation, respectively. For DVT, EasyCIE-PEDVT would have a sensitivity of 0.9 and 0.879 compared to an NSQIP DVT sensitivity of 0.650 and 0.859 in internal and external validation, respectively. These values may be artificially elevated as the number of VTE events missed by both EasyCIE-PEDVT and NSQIP is unknown. However, these findings demonstrate that EasyCIE-PEDVT and NSQIP can offer complementary approaches to VTE surveillance. EasyCIE-PEDVT errors occurred via a lack of documentation which can be overcome by current NSQIP telephone contact protocols. NSQIP errors occurred when events were missed within the clinical text, which can be overcome by EasyCIE-PEDVT assisted surveillance.
Our work has several implications for hospital quality assessment efforts. Despite the high sensitivity and specificity of EasyCIE-PEDVT for VTE events, given the low prevalence of these events, the PPV is expectedly low. We observed a PPV of 0.2–0.6 for PE and 0.3–0.5 for DVT, consistent with prior reports.14,35 However, the value of EasyCIE-PEDVT is in the NPV. With NPVs between 0.996–1.00 for PE and DVT, EasyCIE-PEDVT can reliably exclude postoperative VTE events in low likelihood patients. For example, in external validation, reviewing only the NLP-flagged charts for DVT and PE would result in a 98.4% and 99.3% reduction in the number of charts need to review, respectively. Consistent with our previous work with surgical site infections, hospital quality assessment reviewers could safely exclude patients based on NLP review alone and limit the manual chart review to the positively flagged patients.12 The integration of NLP into patient-safety and quality workflows could decrease the labor-intensive efforts and improve data abstraction efficiency.
There may be negative effects of NLP-based VTE surveillance. Based on our findings, hospitals that utilize NLP for automated extraction of VTE events would likely observe an increase in postoperative VTE events based on a surveillance bias. This phenomenon would be similar to a similar surveillance bias occurring after the implementation of a VTE screening program using ultrasonography.36 Therefore, participating hospitals may see increased penalties from pay for performance, such as the Centers for Medicare and Medicaid Services Hospital-Acquired Conditions program.37 However, more accurately detecting the postoperative PE VTE events allows healthcare systems to develop more effective countermeasures accordingly, which can benefit the hospitals in the long term.4
We acknowledge several limitations of our study. EasyCIE-PEDVT was developed according to NSQIP definitions, and the performance characteristics may differ from other VTE definitions. However, the advantage of a rules-based NLP system is the ability to easily modify the performance based on a small amount of training data. As discussed above, we modified our rule-set after initial external validation to adapt to unforeseen document formatting. We utilized documents from two independent healthcare systems located in the same geographic region. Regional differences in clinical language may vary, limiting the portability of EasyCIE-PEDVT without retraining. Lastly, we utilized clinical text and did not include other data such as medication administration into EasyCIE-PEDVT. It is possible, given the NSQIP definitions of DVT requiring therapy, that integration of medication administration may improve the predictive performance.
CONCLUSION
We developed and validated a portable NLP approach for surveillance of postoperative VTE events. We demonstrate the improved performance of our NLP approach compared to ICD codes for surveillance of DVT and equivocal performance for PE across two independent healthcare systems. The high NPV associated with our NLP approach may accurately exclude cases from manual chart review. However, with low PPV, NLP-flagged events should be verified by manual chart review. We believe NLP is ready for integration into hospital surveillance programs to reduce the chart review burden and improve resource utilization for quality assessment activities.
Supplementary Material
ACKNOWLEDGEMENTS
FUNDING/SUPPORT
This research was supported by grant 1K08HS025776 from the Agency for Healthcare Research and Quality (Dr. Bucher). The computational resources used were partially funded by the NIH Shared Instrumentation Grant 1S10OD021644-01A1.
ROLE OF THE FUNDER/SPONSOR
The Agency for Healthcare Research and Quality had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication. The views expressed are those of the authors and do not necessarily represent the views or opinions of the U.S. Government or the U.S. Department of Veterans Affairs.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
CONFLICT OF INTEREST DISCLOSURE
The authors report no disclosures.
ACCESS TO DATA AND DATA ANALYSIS
Dr. Bucher had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data.
REFERENCES
- 1.Rozeboom PD, Bronsert MR, Velopulos CG, et al. A comparison of the new, parsimonious tool Surgical Risk Preoperative Assessment System (SURPAS) to the American College of Surgeons (ACS) risk calculator in emergency surgery. Surgery. 2020;168(6):1152–1159. [DOI] [PubMed] [Google Scholar]
- 2.Myers SP, Brown JB, Leeper CM, et al. Early versus late venous thromboembolism: A secondary analysis of data from the PROPPR trial. Surgery. 2019;166(3):416–422. [DOI] [PubMed] [Google Scholar]
- 3.Stey AM, Russell MM, Ko CY, Sacks GD, Dawes AJ, Gibbons MM. Clinical registries and quality measurement in surgery: a systematic review. Surgery. 2015;157(2):381–395. [DOI] [PubMed] [Google Scholar]
- 4.Yang AD, Hewitt DB, Blay E Jr., et al. Multi-institution Evaluation of Adherence to Comprehensive Postoperative VTE Chemoprophylaxis. Ann Surg. 2020;271(6):1072–1079. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Henderson KE, Recktenwald A, Reichley RM, et al. Clinical validation of the AHRQ postoperative venous thromboembolism patient safety indicator. Jt Comm J Qual Patient Saf. 2009;35(7):370–376. [DOI] [PubMed] [Google Scholar]
- 6.Ko CY, Hall BL, Hart AJ, Cohen ME, Hoyt DB. The American College of Surgeons National Surgical Quality Improvement Program: achieving better and safer surgery. Jt Comm J Qual Patient Saf. 2015;41(5):199–204. [DOI] [PubMed] [Google Scholar]
- 7.Burles K, Innes G, Senior K, Lang E, McRae A. Limitations of pulmonary embolism ICD-10 codes in emergency department administrative data: let the buyer beware. BMC Med Res Methodol. 2017;17(1):89. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Hanauer DA, Englesbe MJ, Cowan JA Jr., Campbell DA. Informatics and the American College of Surgeons National Surgical Quality Improvement Program: automated processes could replace manual record review. J Am Coll Surg. 2009;208(1):37–41. [DOI] [PubMed] [Google Scholar]
- 9.Adler-Milstein J, Jha AK. HITECH Act Drove Large Gains In Hospital Electronic Health Record Adoption. Health Aff (Millwood). 2017;36(8):1416–1422. [DOI] [PubMed] [Google Scholar]
- 10.Wu ST, Kaggal VC, Dligach D, et al. A common type system for clinical natural language processing. J Biomed Semantics. 2013;4(1):1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Nadkarni PM, Ohno-Machado L, Chapman WW. Natural language processing: an introduction. J Am Med Inform Assoc. 2011;18(5):544–551. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Bucher BT, Shi J, Ferraro JP, et al. Portable Automated Surveillance of Surgical Site Infections Using Natural Language Processing: Development and Validation. Ann Surg. 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Murff HJ, FitzHenry F, Matheny ME, et al. Automated identification of postoperative complications within an electronic medical record using natural language processing. JAMA. 2011;306(8):848–855. [DOI] [PubMed] [Google Scholar]
- 14.Selby LV, Narain WR, Russo A, Strong VE, Stetson P. Autonomous detection, grading, and reporting of postoperative complications using natural language processing. Surgery. 2018;164(6):1300–1305. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Heilbrun ME, Chapman BE, Narasimhan E, Patel N, Mowery D. Feasibility of Natural Language Processing-Assisted Auditing of Critical Findings in Chest Radiology. J Am Coll Radiol. 2019;16(9 Pt B):1299–1304. [DOI] [PubMed] [Google Scholar]
- 16.Dublin S, Baldwin E, Walker RL, et al. Natural Language Processing to identify pneumonia from radiology reports. Pharmacoepidemiol Drug Saf. 2013;22(8):834–841. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Divita G, Carter M, Redd A, et al. Scaling-up NLP Pipelines to Process Large Corpora of Clinical Notes. Methods Inf Med. 2015;54(6):548–552. [DOI] [PubMed] [Google Scholar]
- 18.Chapman BE, Lee S, Kang HP, Chapman WW. Document-level classification of CT pulmonary angiography reports based on an extension of the ConText algorithm. J Biomed Inform. 2011;44(5):728–737. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Shiloach M, Frencher SK Jr., Steeger JE, et al. Toward robust information: data quality and inter-rater reliability in the American College of Surgeons National Surgical Quality Improvement Program. J Am Coll Surg. 2010;210(1):6–16. [DOI] [PubMed] [Google Scholar]
- 20.American College of Surgeons. Variable and Definitions. In: NSQIP Operations Manual. Chicago, IL: American College of Surgeons; 2018. [Google Scholar]
- 21.Shi J, Mowery D. EasyCIE: A Development Platform to Support Quick and Easy, Rule-based Clinical Information Extraction. Paper presented at: Fifth IEEE International Conference on Healthcare Informatics2017; Park City, UT. [Google Scholar]
- 22.Shi J, Liu S, Pruitt LCC, et al. Using Natural Language Processing to improve EHR Structured Data-based Surgical Site Infection Surveillance. AMIA Annu Symp Proc. 2019;2019:794–803. [PMC free article] [PubMed] [Google Scholar]
- 23.Shi J, Mowery D, Zhang M, Sanders J, Chapman W, Gawron L. Extracting Intrauterine Device Usage from Clinical Texts Using Natural Language Processing. Paper presented at: Healthcare Informatics (ICHI), 2017 IEEE International Conference on2017. [Google Scholar]
- 24.Pomares-Quimbaya A, Kreuzthaler M, Schulz S. Current approaches to identify sections within clinical narratives from electronic health records: a systematic review. BMC Med Res Methodol. 2019;19(1):155. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Agency for Healthcare Reserach and Quality. Patient Safety Indicator 12 (PSI 12) Perioperative Pulmonary Embolism or Deep Vein Thrombosis Rate. In. Rockville, MD: 2020. [Google Scholar]
- 26.Agency for Healthcare Reserach and Quality. Patient Safety Indicator 12 (PSI 12) Perioperative Pulmonary Embolism or Deep Vein Thrombosis Rate. In. Rockville, MD: 2015. [Google Scholar]
- 27.R: A language and environment for statistical computing. [computer program]. Vienna, Austria: R Foundation for Statistical Computing; 2020. [Google Scholar]
- 28.Shaffer JP. Multiple Hypothesis Testing. Annual Review of Psychology. 1995;46(1):561–584. [Google Scholar]
- 29.Newcombe RG. Interval estimation for the difference between independent proportions: comparison of eleven methods. Stat Med. 1998;17(8):873–890. [DOI] [PubMed] [Google Scholar]
- 30.McNemar Q. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika. 1947;12(2):153–157. [DOI] [PubMed] [Google Scholar]
- 31.Gu W, Pepe M. Measures to summarize and compare the predictive capacity of markers. Int J Biostat. 2009;5(1):Article 27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Leisenring W, Alonzo T, Pepe MS. Comparisons of predictive values of binary medical diagnostic tests for paired designs. Biometrics. 2000;56(2):345–351. [DOI] [PubMed] [Google Scholar]
- 33.Austin PC, Tu JV. Bootstrap Methods for Developing Predictive Models. The American Statistician. 2004;58(2):131–137. [Google Scholar]
- 34.Merkow RP, Ju MH, Chung JW, et al. Underlying reasons associated with hospital readmission following surgery in the United States. JAMA. 2015;313(5):483–495. [DOI] [PubMed] [Google Scholar]
- 35.FitzHenry F, Murff HJ, Matheny ME, et al. Exploring the frontier of electronic health record surveillance: the case of postoperative complications. Med Care. 2013;51(6):509–516. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Ju MH, Chung JW, Kinnier CV, et al. Association between hospital imaging use and venous thromboembolism events rates based on clinical data. Ann Surg. 2014;260(3):558–564; discussion 564–556. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Bilimoria KY, Chung J, Ju MH, et al. Evaluation of surveillance bias and the validity of the venous thromboembolism quality measure. Jama. 2013;310(14):1482–1489. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
