Abstract
Objective
The 2018 National NLP Clinical Challenge (2018 n2c2) focused on the task of cohort selection for clinical trials, where participating systems were tasked with analyzing longitudinal patient records to determine if the patients met or did not meet any of the 13 selection criteria. This article describes our participation in this shared task.
Materials and Methods
We followed a hybrid approach combining pattern-based, knowledge-intensive, and feature weighting techniques. After preprocessing the notes using publicly available natural language processing tools, we developed individual criterion-specific components that relied on collecting knowledge resources relevant for these criteria and pattern-based and weighting approaches to identify “met” and “not met” cases.
Results
As part of the 2018 n2c2 challenge, 3 runs were submitted. The overall micro-averaged F1 on the training set was 0.9444. On the test set, the micro-averaged F1 for the 3 submitted runs were 0.9075, 0.9065, and 0.9056. The best run was placed second in the overall challenge and all 3 runs were statistically similar to the top-ranked system. A reimplemented system achieved the best overall F1 of 0.9111 on the test set.
Discussion
We highlight the need for a focused resource-intensive effort to address the class imbalance in the cohort selection identification task.
Conclusion
Our hybrid approach was able to identify all selection criteria with high F1 performance on both training and test sets. Based on our participation in the 2018 n2c2 task, we conclude that there is merit in continuing a focused criterion-specific analysis and developing appropriate knowledge resources to build a quality cohort selection system.
Keywords: natural language processing (L01.224.065.580), information storage and retrieval [L01.313.500.750.280], information systems [L01.313.500.750.300], cohort identification, clinical trial selection criteria
INTRODUCTION
Identifying the appropriate cohort is a critical component of the design and subsequent success of a clinical trial. Clinical and health services researchers carefully study and account for confounding factors when determining the selection criteria. It is, hence, important to ensure that the patients satisfy the inclusion and exclusion criteria identified in the study design. However, selection criteria that go beyond simple demographic information often seek information not readily available in a structured database. They are also not easily searchable using a medical record search engine as they often require a more complex query, rather than a simple keyword match. There is a critical need to develop accurate approaches to extract relevant selection criteria from clinical notes for a robust cohort identification task. Developing natural language processing-based systems that can automatically assess the eligibility of patients for research studies can both reduce the time it takes to recruit patients and help remove bias from clinical trials.1
Understanding the need for standardized data sets to evaluate multiple natural language processing (NLP) approaches, Stubbs et al organized a shared task on cohort identification as part of the 2018 n2c2.2,3 The goal of the shared task was to identify whether a patient meets, does not meet, or possibly meets a selected set of eligibility criteria based on their longitudinal records. In this article, we summarize our participation in the shared task. Our approach to the cohort identification shared task was to develop a combination of pattern-based, knowledge-intensive, and feature weighting techniques to characterize the 13 selection criteria for cohort identification. This hybrid approach was ranked second and statistically tied to the top-ranked approach, in the overall F1 score among the participating runs in the shared task. A reimplementation further improved the performance, with the overall micro-averaged F1 of 0.9111.
Related work
Identifying a patient cohort requires extracting information from clinical narratives. However, parsing and analyzing clinical notes is challenging.4 It has been shown that the automated process of parsing clinical notes can significantly reduce the burden of the hitherto time-consuming manual process.5 In the past years, NLP has proven to be helpful in identifying a patient cohort from a larger group.6–8
Many methods have been proposed to identify a cohort of patients from their clinical notes.9 Kandula et al developed a rule-based algorithm to identify cases of type-2 diabetes10; Klompas et al utilized clinical definitions to create rules for disease identification and achieved good performance11; and Trick et al detected bloodstream infection based on rules developed from the definitions of blood isolate categories.12 Rule-based systems have also been shown to be easily refined once in operation.13,14
Various machine learning techniques have also been used to address numerous patient identification tasks. These include Bayesian Networks, Random Forests, Support Vector Machines, and decision tree-based algorithms, among others.15–17 In their review, Shivade et al9 note that numerous studies combine rules and machine learning models to obtain additional benefits from both types of techniques.
Rule-based systems are the most easily used and explainable methods. Keung et al18 identified a cohort by using controlled vocabulary and semantic types; Lingren et al19 found that rule-based algorithms provided the best positive predictive value in identifying patients with Autism Spectrum Disorder; Nguyen et al20 implemented MEDTEX, a rule-based system, to identify clinical terminology from text and then tuned it for ontology to extract lung cancer staging information from pathology reports.21
On the other hand, however, rule-based systems are limited in their ability to learn patterns from data. Therefore, various machine learning techniques were proposed for such tasks. Passos et al22 achieved the best prediction accuracy on identifying suicidal signals from patients with mood disorder using a relevance vector machine. Zhou et al23 treated extracting patient phenotypes as classification tasks; and Miotto and Weng developed a case-based reasoning framework for cohort identification and achieved good precision among 262 patients and 30 000 randomly assigned patients.24
However, researchers have also pointed out that formal rule-based approaches are useful for extracting complicated and structured information and give reliable results.9,25 Schmiedeskamp et al26 showed that empirical rules are effective in identifying patients with nosocomial clostridium difficile infection. Savova et al27 used regular expressions to identify peripheral arterial disease; and Sohn and Savova improved smoking status classification with rules.28 A potential reason for this is that some criteria, factors, or phenotypes are described explicitly in the notes and hence, can be easily captured by empirical rules. However, such factors are often ignored by the relatively more complicated machine learning techniques that are better at modeling patterns from data instead.29
Therefore, hybrid systems that combine the advantages of both empirical rules and learning-based approaches are preferred in many studies.30,31 One of the most widely used tools for parsing clinical notes, cTAKES, proposes a hybrid approach.32 Many other hybrid systems were proposed which extended cTAKES for other tasks. Liu et al proposed a hybrid information extraction framework with high cohort identification performance in peripheral arterial disease and heart failure33; Cui et al implemented EpiDEA to extract epilepsy and seizure information from patient notes for cohort identification34; Lin et al built a patient identifier for rheumatoid arthritis (RA) with methotrexate-induced liver transaminase abnormalities,35 which showed that by combining machine learning approaches with rules extracted from cTAKES, the performance was significantly better than the baselines that don’t use rules. Wei et al combined NLP, machine learning, and ontology to identify patients with type 2 diabetes,36 and Zhao and Vydiswaran proposed a hybrid approach to deidentify clinical notes.37
In this work, we examine the effectiveness of pattern-based and knowledge-intensive approaches in cohort selection and propose a pipeline approach to identify selection criteria from clinical notes. We propose that a similar approach could be designed to build effective learning systems that achieve state-of-the-art performance in cohort selection tasks.
MATERIALS AND METHODS
Data set characteristics
The shared task on cohort identification was organized as Track 1 of the 2018 n2c2. The goal of the task was to use narrative medical records to identify patients who meet any of the eligibility criteria identified by the organizers. These eligibility criteria were derived from real clinical trials and focused on patients’ medications, past medical histories, and whether certain events had occurred in a specified timeframe in the patients’ records. The criteria are described in more detail in the following section.
The data set consisted of 288 longitudinal patient records, annotated to determine if patients matched a list of 13 selection criteria. Of those records, 202 were made available as the training set and the remaining 86 were held out as the test set. For each patient, the records were aggregated in chronological order with the latest record last. For all documents in the training set, the records were annotated as “met” or “not met” for each of the selection criteria. Additional annotation details justifying the final criteria labels were released for 10 of the patient records in the training set.
Description of the selection criteria
The 13 selection criteria, as provided in the annotation guidelines used for the task, are described below:
DRUG-ABUSE: Do the patient records indicate current or past drug abuse?
ALCOHOL-ABUSE: Do the patient records indicate current alcohol use over the weekly recommended limits? The standard weekly recommended limits were not provided.
ENGLISH: Does the patient demonstrate fluency in speaking English?
MAKES-DECISIONS: Does the patient demonstrate ability to make decisions by himself or herself?
ABDOMINAL: Does the patient have a history of intra-abdominal surgery, a small- or large-intestine resection, or small bowel obstruction?
MAJOR-DIABETES: Does the patient have any major diabetes-related complication? For the purposes of the annotation exercise, a “major complication,” as opposed to a “minor complication,” was defined as any of the following that are a result of (or strongly correlated with) uncontrolled diabetes: amputation, kidney damage, skin conditions, retinopathy, nephropathy, or neuropathy.
ADVANCED-CAD: Do the patient records indicate advanced cardiovascular disease? For the purposes of the annotation exercise, an “advanced” condition was defined as having 2 or more of the following: (1) taking 2 or more medications to treat CAD, (2) a history of myocardial infarction (MI), (3) currently experiencing angina, (4) past or present ischemia.
ASP-FOR-MI: Do the patient records indicate use of aspirin to prevent MI?
MI-6MOS: Did the patient experience MI in the past 6 months?
KETO-1YR: Did the patient experience ketoacidosis in the past year?
DIETSUPP-2MOS: Did the patient take a dietary supplement (excluding Vitamin D) in the past 2 months?
HBA1C: Do the patient records include laboratory test results with any HbA1c value between 6.5% and 9.5%?
CREATININE: Do the patient records include laboratory test results with serum creatinine value above the normal range? The standard normal range was not provided.
These criteria span 5 kinds of traits commonly included in the inclusion and exclusion criteria for clinical trials and research studies, namely,
Evidence of substance abuse: DRUG-ABUSE and ALCOHOL-ABUSE;
Numeric inference of laboratory test values: HBA1C and CREATININE;
Clinical complications: ABDOMINAL, MAJOR-DIABETES, ADVANCED-CAD, MI-6MOS, and KETO-1YR;
Medication history: ASP-FOR-MI and DIETSUPP-2MOS; and
Independence: ENGLISH and MAKES-DECISIONS.
Overall methodology
In order to identify the diverse set of selection criteria, we followed a hybrid approach combining pattern-based, knowledge-intensive, and feature weighting techniques. Specifically, the overall methodology involved the following steps:
Preprocessing and section identification
Time-sensitive processing
Knowledge acquisition
Criteria-specific processing
Preprocessing and section identification
First, the longitudinal patient records were split by their record date creating a time sequence map for every patient. A section identifier was developed based on section headers and template text in the training set. Existing off-the-shelf clinical NLP tools, such as MetaMap,38 cTAKES,32 and RxNORM,39 were run on all patient records. Negations and hedging were detected using the NegEx40 and ConText41 components in cTAKES. Name-value pairs were identified using the colon as the delimiter to extract the names and corresponding values from the text.
Time-sensitive processing
Of the selection criteria 3 were specified with a time range to constrain the notes that should be considered relevant for the criteria. As previously described, the patient notes were available in a chronologically sorted order. However, since dates were aliased, the current date was set differently for each patient based on the date of their last recorded note. The date ranges were then computed relative to the date of the last note. For 2 of the selection criteria, KETO-1YR and MI-6MOS, only the patient records that were within the respective time window were analyzed. However, when analyzing the training set for DIETSUPP-2MOS, we observed that dietary supplements prescribed in an earlier time period were not repeated in later notes, even though the patient was taking those supplements. Hence, mentions of dietary supplements earlier than the prescribed 2-month period were still included unless they were mentioned as being discontinued in the following records.
Knowledge acquisition
One of the significant challenges in identifying certain characteristics is to collect lists of relevant diseases, procedures, or drugs. These lists were collected from websites of government agencies such as the National Institutes of Health,42 the United States Food and Drug Administration43 and Wikipedia. In this shared task, 4 of the selection criteria depended on collecting appropriate lists of clinical concepts and acronyms. These included a list of comorbidities associated with coronary artery diseases (for ADVANCED-CAD) and diabetes (for MAJOR-DIABETES); a list of 51 abdominal surgery procedures, including their abbreviations and acronyms (for ABDOMINAL); and a list of dietary supplements (for DIETSUPP-2MOS) from a combination of supplements mentioned on the National Institutes of Health website42 and manually identified in the training set. Medical concept names were expanded using common names, abbreviations, and other linguistic variations derived from the Unified Medical Language System Metathesaurus.44
Criteria-specific processing
Following the common steps of preprocessing and collecting relevant knowledge resources, we implemented individual components using a combination of knowledge-based, pattern-based, and feature weighting approaches for each selection criterion. Once the components made independent criterion-specific decisions, they were combined to generate the final results. The components are described in more detail in the next section.
As part of our participation in the 2018 n2c2 Track 1 challenge, 3 runs were submitted. These runs varied slightly in the configuration of the ADVANCED-CAD and MAJOR-DIABETES components, as described below. The approaches were refined after the challenge concluded and a re-implemented system (referred to as the “final system”) was used to generate the final results.
DESCRIPTION OF INDIVIDUAL CRITERIA-SPECIFIC COMPONENTS
The selection criteria were grouped by the types of traits they represent. Similar approaches were used for criteria in each trait type.
Evidence of substance abuse
There were 2 criteria related to substance abuse, namely, ALCOHOL-ABUSE and DRUG-ABUSE. A hot spot-based identification approach was designed to identify positive instances of alcohol abuse and drug abuse.
For ALCOHOL-ABUSE, sentences containing words relating to alcohol were selected from each document, based on the occurrence of a handful of key words including drink, alcohol, etoh (short for ethanol), and ethanol. Contextual cues were then analyzed to check if the sentences successfully described alcohol abuse or not. A list of modifier words was created to capture terms describing the degree of alcohol consumption (abuse, addiction, binge, concern, dependence, excessive, and heavy). The context search window was defined as the set of 4 words around the key word (2 terms before and after). This window was optimized based on observation of performance on the training set. In addition to modifiers, a list of mental health conditions and behaviors commonly associated with alcohol abuse was created (anxiety, debauchery, depression, dipsomania, dissoluteness, distraught, drunkenness, inebriety, insobriety, intemperance, intoxicate, and bibulousness). No search limit was specified for these words, as they did not depend on modifying the meaning of the key words. Lastly, a list of negation words was created to identify negated contexts (no, without, stop, n’t, not, h/o (history of), never, none, nor, non, rare, previous, prior, history, denies, negative). The search window for these terms was empirically set as 8 terms around the key word (5 terms before and 3 after). All terms were normalized with a snowball stemmer to remove affixes from words, keeping the string comparison of tokenized sentence terms and list terms consistent.
A similar hot spot-based approach was used for DRUG-ABUSE. Names of specific drugs of abuse, such as oxycodone, methamphetamine, marijuana, and cocaine, along with their street names (eg, “meth”)45 were used as key words and modifiers. The clinical note text was searched for mentions in the context of words describing the degree of use (occasional, serial, abuse, overdose), and the purpose of use (illicit, psychedelic, recreational). If these conditions were met, the candidate mentions were extracted for further analysis. Appropriate or prescribed drug use case was identified by looking for terms indicating a purpose of use and excluded from further analysis. Negated contexts were similarly identified and labeled as “not met.” Finally, candidates were scored based on the extent of evidence above and frequency of mentions in the notes to make the final decision. All 202 cases in the training set were correctly identified using this approach.
Numeric inference of laboratory test values
There were 2 criteria related to numeric inference, namely, CREATININE and HBA1C, that required similar pattern-based extraction and numeric analysis. Candidate mentions of CREATININE and HBA1C were identified using separate regular expressions, and associated test result values were extracted. Test values that were listed in a name–value pair format and in a tabular format were handled separately. If multiple values were mentioned (eg, results of multiple tests), the values were individually identified and tested.
The healthy, normal range of serum creatinine values was determined by analyzing the “met” and “not met” documents in the training set. If any of the extracted serum creatinine values were found to be outside the normal range, it was considered “not met.” On the other hand, the range of interest for HbA1c was noted in the annotation guidelines as between 6.5% and 9.5%, even if the range associated with diagnostic criteria for diabetes is known to vary slightly by gender and other demographic factors.46–48 Following the annotation guidelines, any extracted value beyond this range of interest was considered as not meeting the criteria, and marked as “not met.” For CREATININE, the computed precision, recall, and F1 were 0.92, 0.79, and 0.84, respectively over the 82 “met” cases in the training set. For HBA1C, the computed precision, recall, and F1 were 0.93, 0.94, and 0.93, respectively over the 67 “met” cases in the training set.
Clinical complications
Of the selection criteria, 5 were related to clinical complications: ABDOMINAL, MAJOR-DIABETES, ADVANCED-CAD, MI-6MOS, and KETO-1YR. Knowledge-intensive approaches were devised to decide these criteria. Another common feature to look for was relevant evidence in specific sections of the clinical notes.
For ABDOMINAL, the list of 51 abdominal surgeries described earlier was first searched within the past medical history and the past surgical history sections of the record. If none of these words matched, the entire record was searched in the context of positive words (for example, in a 5-word contextual window around S/P or “status post”). Of the 77 “met” and 125 “not met” cases in the training set, 70 and 123 cases, respectively, were correctly identified.
For MAJOR-DIABETES, the sections related to family history were identified and removed. Other parts of the notes that referenced the patient family history (such as, “mother” and “father”) were also removed. Further, if section headers related to diabetic complications (such as “Retinopathy,” “Neuropathy,” or “Nephropathy”) were followed by an explicit negation (eg, “Neuropathy: None”), such sections were discarded. In the second version of this component (used in Run-2 and Run-3), the list was expanded to also exclude negated mentions of “extremities.”
The remaining sections were processed using cTAKES to determine the disease or disorder name, unique concept identifier, semantic types and groups, polarity, and confidence. Patients were screened for the mention of diabetes by checking for 1 of the diabetes-related terms (“diabetic,” “diabetes,” “hba1c,” “insulin,” “obese,” “extremity,” “LUE,” or “edema”) with positive polarity and high confidence. If the condition was satisfied, then the semantic type was compared against an advanced set of concepts comprising the following tokens: “gangrene,” “decubitus ulcers,” “polyneuropathy,” “neuropathy,” “retinopathy,” “diabetic foot/nephropathy/glomerulopathy/dermopathy,” “nephropathy,” “macular degeneration/edema,” or “amputation.” If the patient met any 1 of these conditions with positive polarity with high confidence, the patient was labeled as “met.” If a patient had multiple polarities for the same condition, the highest polarity for a given condition was used.
A similar analysis was required for ADVANCED-CAD. The 4 conditions identified in the criterion definition were divided into 2 parts – the drug component and the rest. For the drug component, medication concepts were identified by MetaMap and then further processed using RxNorm to find drugs related to cardiac conditions. For the second component, key words related to MI, angina, and ischemia were searched using MetaMapped concept identification and negation detection. In the second variation of this component (used in Run-2), instead of any mention of chest pain, only the most recent status of chest pain was considered valid and previous mentions of angina were ignored. The Run-2 variation had a training set accuracy of 0.86, precision of 0.87, recall of 0.90, and F1 score of 0.89.
For MI-6MOS, a list of key words indicating MI was created based on the training set. The algorithm searched the MI key words in the records that were within 6 months from the most recent date. Negations and mentions of time periods were checked. If a key word was found but was negated, the criterion was labeled as “not met.” Further, if a key word occurred along with a mention of time (eg, “last December”), then the time mention was checked to see if the event fell within the 6-month period. If not, then the criterion was labeled as “not met.” The training set accuracy, precision, recall, and F1 scores were 0.97, 0.88, 0.8, and 0.84, respectively.
A similar time-based analysis was conducted for KETO-1YR. A list was compiled containing terms that are commonly used to indicate ketoacidosis. The algorithm searched through the MetaMap output for the presence of these words, as well as the presence of these words in negated contexts. If the words were present in the negation list, the patient was labeled as “not met.” If ketoacidosis was mentioned and was not negated, the note date was extracted and checked for whether it was within the year of the most recent note to determine if the criterion was met. All 202 records in the training set were “not met” and were correctly identified as such.
Medication history
There were 2 criteria, ASP-FOR-MI and DIETSUPP-2MOS, which were related to medication intake. For DIETSUPP-2MOS, “met” cases were determined by checking whether dietary supplements from the list described earlier were mentioned in the clinical notes within 2 months of the most recent note. Clinical notes with dietary supplements mentioned beyond 2 months from the most recent note were considered “not met.” On the training set, the accuracy, precision, recall, and F1 scores were 0.98, 1.00, 0.95, and 0.98, respectively.
For ASP-FOR-MI, a list including all variations of aspirin was compiled and used to search for mentions of aspirin in the notes. Further, the MI-6MOS component (without the time filter) was used to detect if the patient met the MI criterion. If both conditions were met, the criterion was labeled as “met,” else it was labeled as “not met.” On the training set, the precision, recall, and F1 scores were 0.88, 0.57, and 0.94, respectively.
Independence
The last 2 criteria, ENGLISH and MAKES-DECISIONS, were related to the patient’s ability to make independent decisions. For ENGLISH, a list of common spoken languages and demonyms was collected based on the United States Census data. Mention of languages in the context of the patient was considered as evidence for lack of English-speaking skills and was labeled as “not met.” On the training set, 8 out of 10 “not met” cases and all “met” cases were correctly identified using this approach. The 2 false positives were related to mentions of an ethnicity and cuisine (“Italian restaurant”), rather than language.
Similarly, for MAKES-DECISIONS, a list of conditions related to mental acuity (such as “mental retardation,” “altered mental status,” “Alzheimer”) were collected by analyzing the training set. Sections that are marked as “family history” and contexts that mentioned a close family relative (eg, mother, father, etc.) were discarded. If a condition word matched within a section related to the patient, the criteria was marked as “not met.” On the training set, all 8 “not met” cases were correctly identified, but 4 “met” cases were falsely labeled as “not met.”
RESULTS
As previously described, 3 runs were submitted, based on minor configuration variations to the ADVANCED-CAD and the MAJOR-DIABETES components. The overall micro-averaged F1 on the training set was 0.9444.
Table 1 summarizes the performance of the 3 submitted runs and the final system on the test set. Of the 3 submitted runs, Run-2 performed the best and achieved the micro-averaged F1 and AUC scores of 0.9075 and 0.9073, respectively. The macro-average F1 and AUC scores were 0.7824 and 0.7712, respectively. The performance of Run-1 and Run-3 were similar, with the micro-averaged F1 score of 0.9056 and 0.9065, respectively. Run-2 was ranked second among all submitted runs in the challenge, but all 3 runs were considered statistically similar to the top-ranked run.
Table 1.
MET |
NOT MET |
OVERALL |
|||||||
---|---|---|---|---|---|---|---|---|---|
Prec | Rec | Spec | F1 | Prec | Rec | F1 | F1 | AUC | |
Run-1 (Micro) | 0.8940 | 0.8824 | 0.9272 | 0.8882 | 0.9188 | 0.9272 | 0.9230 | 0.9056 | 0.9048 |
Run-2 (Micro) | 0.8928 | 0.8889 | 0.9256 | 0.8908 | 0.9228 | 0.9256 | 0.9242 | 0.9075 | 0.9073 |
Run-3 (Micro) | 0.8943 | 0.8845 | 0.9272 | 0.8894 | 0.9202 | 0.9272 | 0.9237 | 0.9065 | 0.9058 |
Final (Micro) | 0.9007 | 0.8889 | 0.9317 | 0.8947 | 0.9233 | 0.9317 | 0.9275 | 0.9111 | 0.9103 |
Run-1 (Macro) | 0.7855 | 0.7258 | 0.8132 | 0.7469 | 0.8652 | 0.8132 | 0.8146 | 0.7808 | 0.7695 |
Run-2 (Macro) | 0.7853 | 0.7310 | 0.8113 | 0.7492 | 0.8702 | 0.8113 | 0.8156 | 0.7824 | 0.7712 |
Run-3 (Macro) | 0.7857 | 0.7276 | 0.8132 | 0.7480 | 0.8665 | 0.8132 | 0.8153 | 0.7817 | 0.7704 |
Final (Macro) | 0.8010 | 0.7557 | 0.8395 | 0.7698 | 0.8691 | 0.8395 | 0.8336 | 0.8017 | 0.7976 |
Abbreviations: Prec, precision; Rec, recall; Spec, specificity; AUC, area under the curve.
The final system performed better than Run-2 on all micro-averaged measures and most macro-averaged measures. The micro-averaged F1 and AUC scores were 0.9111 and 0.9103, respectively. The macro-average F1 and AUC scores were 0.8017 and 0.7976, respectively. The micro-averaged F1 score is better than that of the best system on the challenge.
Table 2 summarizes the performance of the final system, our best performing run, on the 13 selection criteria. The performance was the highest for ENGLISH, DRUG-ABUSE, and HBA1C, with F1 scores above 0.9 for all 3 criteria. The most challenging criterion on the test set was MAKES-DECISIONS, with the F1 score of 0.575. All instances labeled by the KETO-1YR component were also correct, but since there were no “met” cases in the test set, the overall F1 score was 0.5.
Table 2.
MET |
NOT MET |
OVERALL |
|||||||
---|---|---|---|---|---|---|---|---|---|
Prec | Rec | Spec | F1 | Prec | Rec | F1 | F1 | AUC | |
ABDOMINAL | 0.9231 | 0.8000 | 0.9643 | 0.8571 | 0.9000 | 0.9643 | 0.9310 | 0.8941 | 0.8821 |
ADVANCED-CAD | 0.7455 | 0.9111 | 0.6585 | 0.8200 | 0.8710 | 0.6585 | 0.7500 | 0.7850 | 0.7848 |
ALCOHOL-ABUSE | 1.000 | 0.6667 | 1.000 | 0.8000 | 0.9881 | 1.000 | 0.9940 | 0.8970 | 0.8333 |
ASP-FOR-MI | 0.8395 | 1.000 | 0.2778 | 0.9128 | 1.000 | 0.2778 | 0.4348 | 0.6738 | 0.6389 |
CREATININE | 0.9412 | 0.6667 | 0.9839 | 0.7805 | 0.8841 | 0.9839 | 0.9313 | 0.8559 | 0.8253 |
DIETSUPP-2MOS | 0.9268 | 0.8636 | 0.9286 | 0.8941 | 0.8667 | 0.9286 | 0.8966 | 0.8953 | 0.8961 |
DRUG-ABUSE | 0.7500 | 1.000 | 0.9880 | 0.8571 | 1.000 | 0.9880 | 0.9939 | 0.9255 | 0.9940 |
ENGLISH | 0.9865 | 1.000 | 0.9231 | 0.9932 | 1.000 | 0.9231 | 0.9600 | 0.9766 | 0.9615 |
HBA1C | 1.000 | 0.8000 | 1.000 | 0.8889 | 0.8793 | 1.000 | 0.9358 | 0.9123 | 0.9000 |
KETO-1YR | 0.000 | 0.000 | 1.000 | 0.000 | 1.000 | 1.000 | 1.000 | 0.5000 | 0.5000 |
MAJOR-DIABETES | 0.8974 | 0.8140 | 0.9070 | 0.8537 | 0.8298 | 0.9070 | 0.8667 | 0.8602 | 0.8605 |
MAKES-DECISIONS | 0.9747 | 0.9277 | 0.3333 | 0.9506 | 0.1429 | 0.3333 | 0.2000 | 0.5753 | 0.6305 |
MI-6MOS | 0.4286 | 0.3750 | 0.9487 | 0.4000 | 0.9367 | 0.9487 | 0.9427 | 0.6713 | 0.6619 |
OVERALL (Micro) | 0.9007 | 0.8889 | 0.9317 | 0.8947 | 0.9233 | 0.9317 | 0.9275 | 0.9111 | 0.9103 |
OVERALL (Macro) | 0.8010 | 0.7557 | 0.8395 | 0.7698 | 0.8691 | 0.8395 | 0.8336 | 0.8017 | 0.7976 |
Abbreviations: Prec, precision; Rec, recall; Spec, specificity; AUC, area under the curve.
DISCUSSION
Although supervised machine learning approaches are known to generalize better than rule-based approaches, highly imbalanced data sets pose unique challenges to such approaches. As shown in Table 3, 6 of the 13 criteria had a highly imbalanced distribution of “met” and “not met” cases, with over 90% instances in 1 of the 2 classes. The criteria with the highest imbalances were in the substance abuse trait (ALCOHOL-ABUSE: 3.5% “met,” 96.5% “not met” and DRUG-ABUSE: 5.9% “met,” 94.1% “not met”) and in the independence traits (MAKES-DECISIONS: 96% “met,” 4% “not met” and ENGLISH: 95.1% “met,” 4.9% “not met”). Further, all training and test set instances of KETO-1YR belonged to the “not met” class, making a naïve “always no” labeler perform as well as our knowledge-rich component.
Table 3.
TRAINING SET (n = 202) |
TEST SET (n = 86) |
|||||
---|---|---|---|---|---|---|
MET | NOT MET | %MET | MET | NOT MET | %MET | |
ABDOMINAL | 77 | 125 | 38.1 | 30 | 56 | 34.9 |
ADVANCED-CAD | 125 | 77 | 61.9 | 45 | 41 | 52.3 |
ALCOHOL-ABUSE | 7 | 195 | 3.5 | 3 | 83 | 3.5 |
ASP-FOR-MI | 162 | 40 | 80.2 | 68 | 18 | 79.1 |
CREATININE | 82 | 120 | 40.6 | 24 | 62 | 27.9 |
DIETSUPP-2MOS | 105 | 97 | 52.0 | 44 | 42 | 51.2 |
DRUG-ABUSE | 12 | 190 | 5.9 | 3 | 83 | 3.5 |
ENGLISH | 192 | 10 | 95.1 | 73 | 13 | 84.9 |
HBA1C | 67 | 135 | 33.2 | 35 | 51 | 40.7 |
KETO-1YR | 0 | 202 | 0.0 | 0 | 86 | 0.0 |
MAJOR-DIABETES | 113 | 89 | 55.9 | 43 | 43 | 50.0 |
MAKES-DECISIONS | 194 | 8 | 96.0 | 83 | 3 | 96.5 |
MI-6MOS | 18 | 184 | 8.9 | 8 | 78 | 9.3 |
We employed a combination of pattern-based and knowledge-intensive methods for 9 of the 13 criteria. Although the performance of knowledge-intensive approaches is limited by the quality and completeness of the resources themselves, the developed components performed similarly over the training and test sets. Further, most of these components produced interpretative models that enabled us to conduct an extensive error analysis. The main causes of errors related to the knowledge resources were incompleteness of knowledge resources (such as missed dietary supplements, abdominal surgeries, major complications of diabetes); vocabulary gap (eg, patient-friendly or lay terms instead of clinical terms); and inaccurate clinical NLP tools. There were also some errors related to clinical complications, comorbidities, and inherent inconsistencies in documentation. Table 4 summarizes the main causes of errors, along with specific examples related to MAJOR-DIABETES.
Table 4.
Cause of error | Example | Explanation |
---|---|---|
Incomplete resources | He has a long hx (history) of ischemic ulcer of the second toe of the right foot. | “ischemic ulcer” was not listed as a severe condition (“ischemic gangrene” was) |
Vocabulary mismatch | Given the patient’s cardiac history, his hypertension, and worsening kidney function, we believe that he will be better served as an inpatient. | Lack of severe clinical terminology (“renal failure” instead of “worsening kidney function”) |
Inaccurate clinical NLP tools | He has sacral decubitus ulcers that are well healing, no evidence of acute infection | “decubitus ulcers” not recognized as a medical concept by cTAKES |
Severe complications | Significant for bilateral knee arthroscopies and multiple bypass grafts and revascularizations of his lower extremities | Mentioned conditions individually considered mild indicators, but taken together, represented a severe condition |
Unrelated complications | … should have him on ACE inhibitor with his mild diabetes … Diabetes mellitus (adult onset) … findings significant for severe compression neuropathy of the median nerve … | Patient does not have major diabetes, neuropathy unrelated to diabetes |
Clinical state inconsistency [False Positive] | … s/p L nephrectomy, … hypertension, no DM, low HDL … DIAGNOSIS: Renal cell carcinoma … May have mild pulmonary edema on CXR however, clinically appears euvolemic without signs of failure … 61 yo male (CRF: former smoker, HTN, obesity) present with 4–5 days of increasing chest pain. | Mixed signal for diabetes (“no DM” vs. “mild edema” and “obesity”), nephrectomy unrelated to diabetes. |
Clinical state inconsistency [False Negative] | Hard to refine her diabetes control … | Minimal mention of diabetes/symptoms (eg, edema), but significant advanced conditions |
Extremities: No edema … No active foot lesions. Small blister, resolving L 5th toe. Small crust L medial malleolus. |
Abbreviations: ACE, angiotensin-converting-enzyme; CRF, Chronic Renal Failure; CXR, chest x-ray; DM, diabetes mellitus; HDL, high-density lipoprotein; HTN, hypertension; NLP, natural language processing.
Finally, the best performing approaches for many cohort identification criteria relied heavily on knowledge resources and decision lists rather than a weighted linear combination of features for which supervised machine learning approaches are ideally suited. Collecting knowledge resources is labor intensive and akin to defining a computable phenotype49 for the selection criteria. Although medical knowledge ontologies, such as United Medical Language System Metathesaurus,44 are available and could be used for some criteria, they may not be at the desired granularity. For example, even though all abdominal surgeries are medical procedures, only those restricted to the abdomen were relevant for this task. Medical ontologies, clinical NLP tools, and consumer health vocabularies can assist in reducing the need for exhaustive listing of key words.
The robust performance of the approaches described in this article on both training and test sets indicate that criteria-specific approaches continue to be effective for certain tasks when labeled data are limited. Additional studies are needed to test the generalizability and robustness of the approach to clinical notes from other institutions or on specific tasks.
CONCLUSION
We developed a combination of knowledge-intensive and pattern-based approaches to identify 13 selection criteria for clinical trials. This system was ranked as 1 of the best-performing systems in the 2018 n2c2 track on cohort identification. Based on our participation, we conclude that there continues to be merit in developing individual criteria-specific components that rely on focused analysis to build a quality cohort selection system. Our approach was able to identify all selection criteria with high micro-averaged F1 on both training and test sets released as part of the 2018 n2c2 challenge.
FUNDING
The work was partially supported by the faculty startup grants from University of Michigan.
AUTHOR CONTRIBUTIONS
The work was done in part by students and learners in the graduate course of Natural Language Processing for Health Data at the University of Michigan. VGVV designed the overall approach, mentored the learners through the development phase, and wrote the first draft of the article. AS and XZ helped with manuscript preparation, and all authors reviewed and approved the manuscript.
Conflict of interest statement
None declared.
REFERENCES
- 1.Stubbs A, et al. A methodology for using professional knowledge in corpus annotation. 2013. ProQuest Dissertations and Theses, Brandeis University. https://search.library.brandeis.edu/primo-explore/fulldisplay?docid=TN_proquest1351438877&context=PC&vid=BRAND&lang=en_US&search_scope=EVERYTHING&adaptor=primo_central_multiple_fe&tab=everything&query=any,contains,A%20Methodology%20for%20Using%20Professional%20Knowledge%20in%20Corpus%20Annotation&offset=0].
- 2. Uzuner Ö, Stubbs A, Filannino M, et al. National NLP clinical challenge (n2c2) 2018. shared task and workshop, track 1: cohort selection for clinical trials. https://n2c2.dbmi.hms.harvard.edu/track1 Accessed January 16, 2019. [DOI] [PMC free article] [PubMed]
- 3. Stubbs A, Filannino M, Uzuner Ö.. Cohort selection for clinical trials: n2c2 2018 shared task track 1. J Am Med Inform Assoc 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Meystre SM, Savova GK, Kipper-Schuler KC, Hurdle JF.. Extracting information from textual documents in the electronic health record: a review of recent research. Yearb Med Inform 2008; 171: 128–44. [PubMed] [Google Scholar]
- 5. Penberthy L, Brown R, Puma F, Dahman B.. Automated matching software for clinical trials eligibility: measuring efficiency and flexibility. Contemporary Clinical Trials 2010; 313: 207–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Sarmiento RF, Dernoncourt F.. Improving patient cohort identification using natural language processing In: Secondary Analysis of Electronic Health Records. Cham, Switzerland: Springer; 2016: 405–17. [PubMed] [Google Scholar]
- 7. Kumar V, Liao K, Cheng SC, et al. Natural language processing improves phenotypic accuracy in an electronic medical record cohort of type 2 diabetes and cardiovascular disease. J Am Coll Cardiol 2014; 63 (suppl 12): A1359. [Google Scholar]
- 8. Carrell DS, Halgrim S, Tran DT, et al. Using natural language processing to improve efficiency of manual chart abstraction in research: the case of breast cancer recurrence. Am J Epidemiol 2014; 1796: 749–58. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Shivade C, Raghavan P, Fosler-Lussier E, et al. A review of approaches to identifying patient phenotype cohorts using electronic health records. J Am Med Inform Assoc 2014; 212: 221–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Kandula S, Zeng-Treitler Q, Chen L, Salomon WL, Bray BE.. A bootstrapping algorithm to improve cohort identification using structured data. J Biomed Inform 2011; 44: S63–8. [DOI] [PubMed] [Google Scholar]
- 11. Klompas M, Haney G, Church D, Lazarus R, Hou X, Platt R.. Automated identification of acute hepatitis B using electronic medical record data to facilitate public health surveillance. PLOS One 2008; 37: e2626.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Trick WE, Zagorski BM, Tokars JI, et al. Computer algorithms to detect bloodstream infections. Emerg Infect Dis 2004; 109: 1612.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Hebert PL, Geiss LS, Tierney EF, Engelgau MM, Yawn BP, McBean AM.. Identifying persons with diabetes using Medicare claims data. Am J Med Qual 1999; 146: 270–7. [DOI] [PubMed] [Google Scholar]
- 14. Wright A, Pang J, Feblowitz JC, et al. A method and knowledge base for automated inference of patient problems from structured data in an electronic medical record. J Am Med Inform Assoc 2011; 186: 859–67. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Zhao D, Weng C.. Combining PubMed knowledge and EHR data to develop a weighted Bayesian network for pancreatic cancer prediction. J Biomed Inform 2011; 445: 859–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Sesen MB, Kadir T, Alcantara RB, Fox J, Brady M.. Survival prediction and treatment recommendation with Bayesian techniques in lung cancer. AMIA Annu Symp Proc 2012; 2012: 838. [PMC free article] [PubMed] [Google Scholar]
- 17. Kawaler E, Cobian A, Peissig P, Cross D, Yale S, Craven M.. Learning to predict post-hospitalization VTE risk from EHR data. AMIA Annu Symp Proc 2012; 2012: 436–45. [PMC free article] [PubMed] [Google Scholar]
- 18. Keung SL, Zhao L, Tyler E, et al. Cohort identification for clinical research: querying federated electronic healthcare records using controlled vocabularies and semantic types. AMIA Jt Summits Transl Sci Proc 2012; 2012:9. [PMC free article] [PubMed] [Google Scholar]
- 19. Lingren T, Chen P, Bochenek J, et al. Electronic health record based algorithm to identify patients with autism spectrum disorder. PLoS One 2016; 117: e0159621. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Nguyen AN, Lawley MJ, Hansen DP, Colquist S. A simple pipeline application for identifying and negating SNOMED clinical terminology in free text. In: Sintchenko, Vitali (Editor); Croll, Peter (Editor). HIC 2009: Proceedings; Frontiers of Health Informatics - Redefining Healthcare, National Convention Centre Canberra, 19-21 August 2009. Brunswick East, Vic.: Health Informatics Society of Australia (HISA), 2009: 188-196. ISBN: 9780980552010. https://search.informit.com.au/documentSummary;res=IELHEA;dn=504954502711240]
- 21. Nguyen AN, Lawley MJ, Hansen DP, et al. Symbolic rule-based classification of lung cancer stages from free-text pathology reports. J Am Med Inform Assoc 2010; 174: 440–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Passos IC, Mwangi B, Cao B, et al. Identifying a clinical signature of suicidality among patients with mood disorders: a pilot study using a machine learning approach. J Affect Disord 2016; 193: 109–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Zhou SM, Rahman MA, Atkinson M, Brophy S. Mining textual data from primary healthcare records: Automatic identification of patient phenotype cohorts. In: International Joint Conference on Neural Networks (IJCNN); IEEE. 2014: 3621–27.
- 24. Miotto R, Weng C.. Case-based reasoning using electronic health records efficiently identifies eligible patients for clinical trials. J Am Med Inform Assoc 2015; 22 (e1): e141–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Mykowiecka A, Marciniak MI, Kupść A.. Rule-based information extraction from patients’ clinical data. J Biomed Inform 2009; 425: 923–36. [DOI] [PubMed] [Google Scholar]
- 26. Schmiedeskamp M, Harpe S, Polk R, Oinonen M, Pakyz A.. Use of international classification of diseases, ninth revision clinical modification codes and medication use data to identify nosocomial clostridium difficile infection. Infect Control Hosp Epidemiol 2009; 3011: 1070–6. [DOI] [PubMed] [Google Scholar]
- 27. Savova GK, Fan J, Ye Z, et al. Discovering peripheral arterial disease cases from radiology notes using natural language processing. In AMIA Annu Symp Proc 2010; 2010: 722. [PMC free article] [PubMed] [Google Scholar]
- 28. Sohn S, Savova GK.. Mayo clinic smoking status classification system: extensions and improvements. AMIA Annu Symp Proc 2009; 2009:619. [PMC free article] [PubMed] [Google Scholar]
- 29. Wang AY, Lancaster WJ, Wyatt MC, Rasmussen LV, Fort DG, Cimino JJ.. Classifying clinical trial eligibility criteria to facilitate phased cohort identification using clinical data repositories. AMIA Annu Symp Proc 2017; 2017: 1754.. [PMC free article] [PubMed] [Google Scholar]
- 30. Xu H, Fu Z, Shah A, et al. Extracting and integrating data from entire electronic health records for detecting colorectal cancer cases. AMIA Annu Symp Proc 2011; 2011: 1564. [PMC free article] [PubMed] [Google Scholar]
- 31. Sohn S, Kocher JA, Chute CG, Savova GK.. Drug side effect extraction from clinical narratives of psychiatry and psychology patients. J Am Med Inform Assoc 2011; 18 (suppl 1): i144–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Savova GK, Masanz JJ, Ogren PV, et al. Mayo clinical text analysis and knowledge extraction system (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc 2010; 175: 507–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Liu H, Bielinski SJ, Sohn S, et al. An information extraction framework for cohort identification using electronic health records. AMIA Jt Summits Transl Sci Proc 2013; 2013: 149. [PMC free article] [PubMed] [Google Scholar]
- 34. Cui L, Bozorgi A, Lhatoo SD, Zhang GQ, Sahoo SS.. Epidea: extracting structured epilepsy and seizure information from patient discharge summaries for cohort identification. AMIA Annu Symp Proc 2012; 2012: 1191.. [PMC free article] [PubMed] [Google Scholar]
- 35. Lin C, Karlson EW, Dligach D, et al. Automatic identification of methotrexate-induced liver toxicity in patients with rheumatoid arthritis from the electronic medical record. J Am Med Inform Assoc 2015; 22 (e1): e151–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Wei Q, Tao C, Jiang G, Chute CG.. A high throughput semantic concept frequency based approach for patient identification: a case study using type 2 diabetes mellitus clinical notes. AMIA Annu Symp Proc 2010; 2010: 857. [PMC free article] [PubMed] [Google Scholar]
- 37. Zhao X, Vydiswaran VGV.. HyDeXT: a hybrid de-identification and extraction tool for health text. AMIA Annu Symp Proc 2017; 2017:2250. [Google Scholar]
- 38. Aronson AR, Lang FM.. An overview of MetaMap: historical perspective and recent advances. J Am Med Inform Assoc 2010; 173: 229–36. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. US National Library of Medicine. Unified Medical Language System: RxNorm. https://www.nlm.nih.gov/research/umls/rxnorm/ Accessed January 16, 2019.
- 40. Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG.. A simple algorithm for identifying negated findings and diseases in discharge summaries. J Biomed Inform 2001; 345: 301–10. [DOI] [PubMed] [Google Scholar]
- 41. Harkema H, Dowling JN, Thornblade T, Chapman WW.. ConText: an algorithm for determining negation, experiencer, and temporal status from clinical reports. J Biomed Inform 2009; 425: 839–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. National Institutes of Health. Office of Dietary Supplements. https://ods.od.nih.gov/ Accessed January 16, 2019.
- 43. US Food and Drug Administration. Homepage. https://www.fda.gov/ Accessed January 16, 2019.
- 44. UMLS Reference Manual [Internet]. Bethesda, MD: National Library of Medicine (US); 2009. Metathesaurus. https://www.ncbi.nlm.nih.gov/books/NBK9684/ Accessed January 16, 2019.
- 45. Vydiswaran VG, Mei Q, Hanauer DA, Zheng K.. Mining consumer health vocabulary from community-generated text. AMIA Annu Symp Proc 2014; 2014:1150–9. [PMC free article] [PubMed] [Google Scholar]
- 46. Ma Q, Liu H, Xiang G, Shan W, Xing W.. Association between glycated hemoglobin A1c levels with age and gender in Chinese adults with no prior diagnosis of diabetes mellitus. Biomed Rep 2016; 46: 737–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Ziemer DC, Kolm P, Weintraub WS, et al. Glucose-independent, black-white differences in hemoglobin A1c levels: a cross-sectional analysis of 2 studies. Ann Intern Med 2010; 15212: 770–7. [DOI] [PubMed] [Google Scholar]
- 48. Gallegos-Macias AR, Macias SR, Kaufman E, Skipper B, Kalishman N.. Relationship between glycemic control, ethnicity and socioeconomic status in Hispanic and white non-Hispanic youths with type 1 diabetes mellitus. Pediatr Diabetes 2003; 41: 19–23. [DOI] [PubMed] [Google Scholar]
- 49. Richesson R, Smerek M, Rusincovitch S, et al. Electronic health records-based phenotyping. Resource chapter. Rethinking Clinical Trials: A Living Textbook of Pragmatic Clinical Trials NIH Health Care Systems Research Collaboratory; 2014. https://rethinkingclinicaltrials.org/resources/ehr-phenotyping/ Accessed March 23, 2019.