Abstract
Knowledge of the left ventricular ejection fraction is critical for the optimal care of patients with heart failure. When a document contains multiple ejection fraction assessments, accurate classification of their contextual use is necessary in order to filter out historical findings or recommendations, and prioritize the assessments for selection of document level ejection fraction information. We present a natural language processing system that classifies the contextual use of both quantitative and qualitative left ventricular ejection fraction assessments in clinical narrative documents. We created support vector machine classifiers with a variety of features extracted from the target assessment, associated concepts, and document section information. The experimental results showed that our classifiers achieved good performance, reaching 95.6% F1-measure for quantitative assessments and 94.2% F1-measure for qualitative assessments in a five-fold cross validation evaluation.
Keywords: Heart Failure, Ventricular Ejection Fraction, Medical Informatics, Natural Language Processing
Introduction
Ejection fraction is a measure of the percentage of the blood in a heart ventricle that is expelled during contraction. Knowledge of the left ventricular ejection fraction (LVEF) is critical for the optimal care of patients with heart failure. Multiple life-prolonging treatments for patients with heart failure (HF) have been demonstrated in randomized trials that enrolled patients with an LVEF below 40%. The LVEF is needed to determine if patients will benefit from treatment and ensure that these treatments are prescribed to the appropriate patients.
The U.S. Veterans Healthcare Administration (VHA) Consortium for Healthcare Informatics Research (CHIR) Translational Use-Case Project for Ejection Fraction (TUCP-EF), and the Automated Data Acquisition for Heart Failure (ADAHF) research project aimed at the automated extraction of LVEF mentions, LVSF (left ventricular systolic function) mentions, and their associated quantitative and qualitative assessments, as described in the following examples:
LVEF mentions (e.g., “left ventricular ejection fraction”, “VISUAL ESTIMATE OF LVEF”, “EF”)
LVEF quantitative values (e.g., “∼0.60-0.65”, “0.45”, “50%”)
LVEF or LVSF qualitative assessments (e.g., “NORMAL”, “mildly decreased”, “SEVERE”)
LVSF mentions (e.g., “Global left ventricular systolic function”, “systolic dysfunction”, “LVSF”)
When determining the current and most reliable LVEF assessment in a clinical note, identifying the contextual use of these assessments is important. For example, those that are mentioned as recommendations or historical values can be ignored or used for historical reference for longitudinal evaluation, and knowing whether an assessment was a summary, an interpretation, or a measurement can be used to prioritize the values used for document-level LVEF assessment. For this study, we focused on classification of the contextual use of LVEF quantitative and qualitative assessments into five different categories: summary, interpretation, technical measurement, recommendation, or past finding.
To our knowledge, this is the first attempt at LVEF assessment contextual use classification in HF, but several studies focused on classification of similar specialized purposes, such as tumor status [1], lung cancer stages [2], and medication prescription status classification [3]. More generally, for local context recognition and analysis, several Natural Language Processing (NLP) systems have been developed that focused on the negation or other assertions of medical concepts. For negation classification, rule-based systems like Negfinder [4] and NegEx [5] have been introduced. They used regular expressions with trigger terms to determine whether a medical term was negated. In BioNLP-2009 [6] and CoNLL-2010 [7-8] Shared Tasks, detecting negations (and their scope) in natural language text was the focus. Kilicoglu and Bergler [9] compiled negation cues from the corpus and detected the negation using dependency-based heuristics. Morante et al. [10] implemented two stages of negation scope detection: sentence level classification and phrase level with memory-based learning.
Assertion classification was the focus of the 2010 i2b2 NLP challenge [11] and the defined task consisted in choosing the status of medical problems by assigning one of six categories: present, absent, hypothetical, possible, conditional, or not associated with the patient. Several studies then used this challenge data and showed that machine learning approaches [12-14] performed better than handcrafted rule-based systems.
In the following sections, we will describe the methods we used for context classification and present our experimental results with feature contribution analysis.
Materials and Methods
We approached the contextual use classification task as a supervised learning problem. The classifier is given a LVEF assessment as input and must assign one of the five context categories as output. We created Support Vector Machine (SVM) classifiers with a variety of lexical features extracted from the target assessment, associated concepts, and section titles.
The TUCP-EF corpus of clinical notes used for this task consists of echocardiography reports, radiology reports, and other note types in the VHA developed using the Text Integration Utility (TIU) software [15]. These clinical notes were manually annotated for LVEF and LVSF mentions and assessments along with the contextual use of these assessments. We defined each category of contextual use, as explained below:
Contextual Use of Assessments
Summary
Any assessment appearing in the summary of important findings in the study. Summary is the short and concise briefing of the study, as contrast to the usually detailed and lengthy interpretation. Examples:
FINAL IMPRESSION: “Normal” LV function
EF “55-65%”.
Interpretation
Any assessment generated based on clinician's synthesis of the echocardiography machine metrics and reading of the ultrasound images using expert clinical judgment. This assessment usually appears in the body of the echo report as detailed findings or impressions. Examples:
Systolic function is “normal” with estimated ejection fraction “60%”.
LVEF appears “mildly reduced” (“35-40%”).
Technical Measurement
Any metrics that appear to be taken directly from the echocardiography machine readings. This indicates that the assessment is calculated using various algorithms from the technician's measurement. Examples:
Ejection fraction is calculated at “42%”.
LVEF (Teichholz): “33%”
Recommendation
Any assessment generated as part of decision support messages or reminders. This is not an assessment associated with the actual patient, but rather serves as a recommendation or instructional guidelines. Examples:
Please contact primary care provider should patient's ejection fraction fall under “40%”.
If “severe” systolic dysfunction is observed in studies with limited visual, please repeat the study within 2 days.
Past Finding
Any assessment from a previous echocardiography study. This reflects patient's past LVEF assessments and is not the value estimated from the current study. Examples:
No significant change from a previous study (“50-55%”).
Compared to the last echo, systolic function has reduced from “normal” to mildly decreased.
Data Description
The TUCP-EF corpus includes 3,060 clinical notes from three different components of the VistA electronic health record files [15]: echocardiography laboratory reports (1140 reports, from 16 VHA medical centers sampled at random), the radiology package (720 reports, from 5 medical centers sampled at random), and TIU (1200 reports, from 17 sites sampled at random). Among these 3,060 reports, 1,465 reports contained at least one of our concepts of interest (i.e., LVEF or LVSF mentions or assessments) and were selected for this project. 2,185 quantitative assessments and 1,278 qualitative assessments were manually annotated. The distributions of contextual use categories are displayed in Table 1. No qualitative assessment was annotated as technical measurement, recommendation or past finding.
Table 1. Distribution of contextual use categories.
Contextual use | Quantitative | Qualitative | ||
---|---|---|---|---|
Count | % | Count | % | |
Summary | 944 | 43.2 | 811 | 63.5 |
Interpretation | 927 | 42.4 | 467 | 36.5 |
Tech. measurement | 296 | 13.6 | 0 | 0.0 |
Recommendations | 10 | 0.5 | 0 | 0.0 |
Past finding | 8 | 0.4 | 0 | 0.0 |
All | 2,185 | 100.0 | 1,278 | 100.0 |
Two of the contextual use categories (Summary and Interpretation) accounted for nearly 85% of the quantitative assessment instances in the data set, while the other three classes were relatively infrequent.
For qualitative assessments, a heart failure expert identified terms used to describe left ventricular function and provided quantitative EF ranges to correspond to each term, as shown in Table 2. To enhance consistency of assessments, we normalized and grouped each qualitative assessment into six ranges (<30%, 25-35%, 30-40%, 35%-45%, 40-50%, and >50%) and used a list of assessment terms for this purpose. The numerical range was assigned when the assessment contained one of the assessment terms.
Table 2. Qualitative assessment ranges and terms.
Range | Terms |
---|---|
<30% | severe |
25-35% | moderate to severe |
30-40% | moderate |
35%-45% | mild to moderate |
40-50% | mild |
>50% | normal, borderline low, lower limits of normal, hyperdynamic, preserved |
We also normalized each quantitative value with the same ranges. To assess the consistency of assessments, we compared each pair of assessments in a document with the normalized ranges. 73 clinical notes (about 5% of the corpus) contained assessment pairs that did not agree with each other (i.e., their ranges were different), even when not considering the pairs with overlapping ranges (e.g., one assessment with 25-35% and the other with 30-40%.)
Contextual use classification is crucial to determine which assessment to use to determine the document-level LVEF, especially when multiple assessments have different ranges. When selecting the document-level LVEF, recommendations and past findings were excluded, and the following order was used to prioritize the assessments:
summary (quantitative) > summary (qualitative) > interpretation (quantitative) > technical measurement (quantitative) > interpretation (qualitative)
In the next section, we describe how the feature vectors were extracted from clinical notes and which feature set was utilized for contextual use classification.
Methods
We built a NLP information extraction application with two pre-processing components and SVM classifiers, as depicted in Figure 1. Pre-processing includes a tokenizer and a section detector.
Figure 1. Application architecture for feature extraction and classification of contextual use.
The tokenizer is based on regular expressions and splits the text in groups of alphanumerical characters separated by white space characters. For clinical note section detection, we randomly selected 150 documents (about 10% of the corpus) and compiled the list of section headers related with the concepts of interest. A phrase was detected as a section header when it contained the following keywords: impression, summary, findings, conclusion, assessment, interpretation, results, and note.
Machine Learning Classification Features
Features extracted from pre-processors are listed below:
Lexical: A bag-of-words representation of lexical features included the assessment term itself, uni-grams and bi-gams of five words preceding it, and five words following it.
Concept position: We created two binary features to represent whether the occurrence of the assessment was the only one in the document, and whether any preceding assessment existed before the target assessment in the document. We also defined a feature representing the assessment location normalized by the document length.
Related concept: We created features for mentions related to the target assessment. We captured the nearest LVEF or LVSF mentions. The mention term itself, uni-gram and bi-gams of five words preceding it, and five words following it were used for this set of features. We also computed the distance by counting the words between the target assessment and the nearest mention.
Section: We detected the phrases containing the section header keywords listed above. Titles of the nearest section, the nearest previous section, and the nearest following section were used as section features.
Contextual Use Classification
We trained two multi-class SVM classifiers, one for quantitative assessments and the other for qualitative assessments, both using the LIBLINEAR software [16]. The quantitative assessment classifier assigned one of the five contextual use categories: summary, interpretation, technical measurement, recommendations, and past finding. Because of the absence of annotations in all five categories, the qualitative assessment classifier only made a binary decision between summary and interpretation categories.
Experimental Results
Feature Contribution
For training and then testing our approach, we performed a five-fold cross validation with the TUCP-EF corpus to measure the contribution of each of the four subsets of features explained above. Table 3 shows the cross validation results when cumulatively adding each set of features.
Table 3. Feature contribution: F1-measure (%).
Features | Quantitative | Qualitative |
---|---|---|
1. Lexical | 90.3 | 78.3 |
2. Concept position | 57.2 | 77.2 |
3. Related concept | 78.2 | 80.2 |
4. Section | 80.9 | 94.0 |
1 + 2 | 91.9 | 85.9 |
1 + 2 + 3 | 91.6 | 86.9 |
All (1 + 2 + 3 + 4) | 95.6 | 94.2 |
Like other previous contextual classification task (e.g., negation detection and assertion classification), using only lexical features provided a good baseline for quantitative assessments classification. However, the lexical features were less helpful for qualitative assessments classification. Surprisingly, using only section features allowed for an F1-measure of 94% with qualitative assessments. It showed that global features such as section titles can be more beneficial than the local lexical features for qualitative assessments when there are not many distinguishable cue words in the context window surrounding the assessment mention.
Using concept position features for qualitative assessments performed better than for quantitative assessment. Most summary assessments were mentioned later in the document than other assessments. The features extracted from the nearest LVEF or LVSF mentions were used to mitigate the limited window size of lexical features. These related concept features helped more than lexical features for qualitative assessments and gave about 2% higher F1-measures. Combining features, especially with section features, allowed for substantially better performance than with models trained with only individual sets of features. Using a chi-squared test to measure statistical significance, the performance of the full feature system (All, 1 + 2 + 3 + 4) was significantly better than the systems with other feature combinations (p < 0.001), but not significantly better than the system using only section for qualitative assessments (p = 0.801).
Classification Results
As seen in Table 4, classification of summary, interpretation, and technical measurement quantitative assessments showed good performance with over 95% F1-measures. The recommendation category benefitted the most from lexical features with 90% F1-measure even with few instances (only 10 examples). However, the classifier performed very poorly with past finding assessments. No past finding assessment was correctly classified with highly unbalanced class probabilities, indicating that there is ample room for improvement for this category.
Table 4. Classification results (%): recall (R), precision (P), and F1-measure (F).
Contextual use | Quantitative | Qualitative | ||||
---|---|---|---|---|---|---|
R | P | F | R | P | F | |
Summary | 95.9 | 96.1 | 96.0 | 95.7 | 95.2 | 95.5 |
Interpretation | 95.8 | 94.7 | 95.2 | 91.7 | 92.4 | 92.0 |
Tech. measurement | 97.0 | 97.3 | 97.1 | |||
Recommendations | 90.0 | 90.0 | 90.0 | |||
Past finding | 0.0 | 0.0 | 0.0 | |||
All | 95.6 | 95.6 | 95.6 | 94.2 | 94.2 | 94.2 |
Overall, our classifiers achieved good performance, reaching 95.6% F1-measure for quantitative assessments and 94.2% F1-measure for qualitative assessments in a five-fold cross validation evaluation.
Table 5 displays counts of true positives (bolded), false positives, and false negatives of each category in a confusion matrix. Even though detecting sections played an important role in both quantitative and qualitative assessment classification, some summary assessments under the section titles that were not captured by our section detector were misclassified as interpretation assessments. Summary quantitative assessments were often misclassified as interpretation assessments with 39 false negatives and 32 false positives.
Table 5. Confusion matrix.
Contextual use | Classified as | ||||
---|---|---|---|---|---|
Summ | Inte | Tech | Reco | Past | |
Quantitative | |||||
Summary | 905 | 39 | 0 | 0 | 0 |
Interpretation | 32 | 888 | 7 | 0 | 0 |
Tech. measurement | 0 | 8 | 287 | 1 | 0 |
Recommendations | 0 | 0 | 1 | 9 | 0 |
Past finding | 5 | 3 | 0 | 0 | 0 |
Qualitative | |||||
Summary | 776 | 35 | |||
Interpretation | 39 | 428 |
True positives are bolded
We observed that many technical measurement assessments were preceded by a colon mark (‘:’) in the corpus. When there is no punctuation between a LVEF and a quantitative assessment, for example in “EF 57%”, the assessment was misclassified as an interpretation assessment. Several past finding assessments were misclassified as summary or interpretation assessments. Future improvement possibilities to reduce these errors include defining features capturing date expressions near the quantitative assessment for this past finding category.
Conclusion
This study demonstrated that the local context of quantitative and qualitative LVEF assessments could be successfully classified using multi-class SVM classifiers. We showed that our application performed well for both quantitative and qualitative assessments with various local and global features. We observed that lexical features and section information features contributed the most to this good performance.
Acknowledgments
This publication is based upon work supported by the Department of Veterans Affairs, HSR&D, grant numbers IBE 09-069 (ADAHF) and HIR 08-374 (Consortium for Healthcare Informatics Research) and HIR 09-007 (Translational Use Case). The views expressed in this article are those of the authors and do not necessarily represent the views of the Department of Veterans Affairs or other affiliated institutions.
References
- 1.Cheng LT, Zheng J, Savova GK, Erickson BJ. Discerning Tumor Status from Unstructured MRI Reports-Completeness of Information in Existing Reports and Utility of Automated Natural Language Processing. J Digit Imaging. 2010;23(2):119–32. doi: 10.1007/s10278-009-9215-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Nguyen AN, Lawley MJ, Hansen DP, Bowman RV, Clarke BE, Duhig EE, Colquist S. Symbolic rule-based classification of lung cancer stages from free-text pathology reports. J Am Med Inform Assoc. 2010;17(4):440–5. doi: 10.1136/jamia.2010.003707. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Kim Y, Garvin JH, Heavirland J, Williams J, Meystre SM. Medication Prescription Status Classification in Clinical Narrative Documents. Proceedings of AMIA 2014. 2014 [PMC free article] [PubMed] [Google Scholar]
- 4.Mutalik PG, Deshpande A, Nadkarni PM. Use of General-purpose Negation Detection to Augment Concept Indexing of Medical Documents: A Quantitative Study Using the UMLS. J Am Med Inform Assoc. 2001;8(6):598–609. doi: 10.1136/jamia.2001.0080598. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG. A simple algorithm for identifying negated findings and diseases in discharge summaries. J Biomed Inform. 2001:301–10. doi: 10.1006/jbin.2001.1029. [DOI] [PubMed] [Google Scholar]
- 6.Kim JD, Ohta T, Pyysalo S, Kano Y, Tsujii J. Overview of BioNLP'09 shared task on event extraction. Proceedings of the BioNLP 2009 Shared Task. 2009:1–9. [Google Scholar]
- 7.Farkas R, Vincze V, Móra G, Csirik J, Szarvas G. The CoNLL-2010 shared task: learning to detect hedges and their scope in natural language text. Proceedings of 14th CoNLL. 2010:1–12. [Google Scholar]
- 8.Vincze V, Szarvas G, Farkas R, Móra G, Csirik J. The BioScope corpus: biomedical texts annotated for uncertainty, negation and their scopes. BMC Bioinformatics. 2008;9(S9) doi: 10.1186/1471-2105-9-S11-S9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Kilicoglu H, Bergler S. Syntactic dependency based heuristics for biological event extraction. Proceedings of the BioNLP 2009 Shared Task. 2009:119–27. [Google Scholar]
- 10.Morante R, Liekens A, Daelemans W. Learning the Scope of Negation in Biomedical Texts. Proceedings of EMNLP 2008. 2008:715–24. [Google Scholar]
- 11.Uzuner Ö, South BR, Shen S, DuVall SL. 2010 i2b2/VA Challenge on Concepts, Assertions, and Relations in Clinical Text. J Am Med Inform Assoc. 2011;18(5):552–6. doi: 10.1136/amiajnl-2011-000203. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Bruijn BD, Cherry C, Kiritchenko S, Martin J, Zhu X. Machine learned Solutions for Three Stages of Clinical In- formation Extraction: the State of the Art at i2b2 2010. J Am Med Inform Assoc. 2011;18(5):557–62. doi: 10.1136/amiajnl-2011-000150. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Kim Y, Riloff E, Meystre SM. Improving classification of medical assertions in clinical notes. Proceedings of the 49th ACL/HLT. 2011;2:311–6. [Google Scholar]
- 14.Bejan CA, Vanderwende L, Xia F, Yetisgen-Yildiz M. Assertion modeling and its role in clinical phenotype identification. J Biomed Inform. 2013;46(1):68–74. doi: 10.1016/j.jbi.2012.09.001. [DOI] [PubMed] [Google Scholar]
- 15.VistA Monograph. http://www.ehealth.va.gov/VistA_Monograph.asp.
- 16.Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ. LIBLINEAR: A Library for Large Linear Classification. J Mach Learn Res. 2008;9:1871–4. [Google Scholar]