Abstract
Objective:
Emergency medical services (EMS) provide critical interventions for patients with acute illness and injury and are important in implementing prehospital emergency care research. Retrospective, manual patient record review, the current reference-standard for identifying patient cohorts, requires significant time and financial investment. We developed automated classification models to identify eligible patients for prehospital clinical trials using EMS clinical notes and compared model performance to manual review.
Methods:
With eligibility criteria for an ongoing prehospital study of chest pain patients, we used EMS clinical notes (n = 1208) to manually classify patients as eligible, ineligible, and indeterminate. We randomly split these same records into training and test sets to develop and evaluate machine-learning (ML) algorithms using natural language processing (NLP) for feature (variable) selection. We compared models to the manual classification to calculate sensitivity, specificity, accuracy, positive predictive value, and F1 measure. We measured clinical expert time to perform review for manual and automated methods.
Results:
ML models’ sensitivity, specificity, accuracy, positive predictive value, and F1 measure ranged from 0.93 to 0.98. Compared to manual classification (N = 363 records), the automated method excluded 90.9% of records as ineligible and leaving only 33 records for manual review.
Conclusions:
Our ML derived approach demonstrates the feasibility of developing a high-performing, automated classification system using EMS clinical notes to streamline the identification of a specific cardiac patient cohort. This efficient approach can be leveraged to facilitate prehospital patient-trial matching, patient phenotyping (i.e. influenza-like illness), and create prehospital patient registries.
Keywords: patient phenotype, machine learning, natural language processing, prehospital
INTRODUCTION
Background and Importance
The identification of patients who satisfy specific criteria from a large population in the prehospital setting has numerous use cases, including clinical trial recruitment, outcome prediction, and patient phenotyping (i.e. influenza like illness). Despite emergency medical services (EMS) playing a key role in providing critical care and timely interventions for acute illness and injury, most research concerning emergency care interventions is done in the emergency department (ED) and hospital settings. This results in insufficient scientific evidence to guide prehospital treatment decisions and patient care (1, 2). More high-quality clinical trials in EMS settings are needed to advance prehospital emergency care especially among high acuity, low frequency events (3). The large number of prehospital providers, difficulty in training prehospital providers and constrained time for research activities during a clinical episode, create fundamental barriers to prehospital research. As a result, prehospital trial sites frequently struggle to meet recruitment schedules and accrual goals and may necessitate EMS personnel trial protocol re-training due to personnel turnover. Thus, participant screening, recruitment and data collection present immense challenges for those investigating prehospital care (1, 4, 5).
Currently, cohort selection is determined by manual medical record review and considered the reference standard (or gold standard) validation method. Manual screening is time consuming and requires significant institutional financial investment, factors that may limit the breadth of review (6–8). This is especially true in prehospital settings where EMS systems experience inadequate resources and logistical challenges when participating in clinical trials (4). However, as electronic health records (EHR) among EMS systems become the standard, currently over 10,000 EMS agencies serving 49 states contribute data to the National EMS registry (9), alternate methods for identifying research study participants are feasible. Researchers now have access to significantly large prehospital clinical data and new opportunities to conduct effective scientific research to improve overall quality of care and operational efficiency. Zhang and Demner-Fusham7, among others, found machine learning (ML) approaches useful to automated classification of patient populations for eligibility screening (7, 10, 11) and disease identification (12–14). However, the details necessary for determining eligibility are primarily recorded as narrative text in the EHR, simple queries of discrete data prove inadequate (7, 10, 15). Classifying clinical text requires methods based on natural language processing (NLP), a field of computer science used to understand and analyze human language by using algorithms to identify and extract natural language from free text data into numerical representations (16). At a high level, these different algorithms can be classified into two groups based on the way they “learn” the desired classification (output) to make predictions: supervised and unsupervised learning. The majority of clinical text classification studies use supervised machine learning, this method relies on reference standards determined by subject matter experts (15, 17, 18). Conversely, unsupervised learning is based upon the idea that a computer can learn to identify complex processes and patterns without a human to provide guidance along the way (19). Typically, patients are identified for clinical trial eligibility through manual review or automated queries of registries and EHRs. Unlike clinical trials in a traditional setting where patients can be matched through an existing solution such as registries, our classification framework differs in a couple of significant aspects. Other machine learning methods include those described by Ni et al (7) and Meystre et al (15), these rely on data from clinical records and individual patient data that identify a match between patient data and study criterion, using cosine similarity and decision rules to determine eligibility. These classification systems used structured EHR fields to great effect, however, these data are not an integral component to EMS care delivery. Because these structured fields are not present in the EMS EHR, methods developed to identify clinical trial participants from registries and acute care EHRs do not generalize to the EMS setting. In this paper we describe the development and evaluation of a clinical trial eligibility screening tool that applies ML and NLP techniques to unstructured EMS clinical data. Structured fields do not provide the context necessary to determine patient eligibility (7, 15, 20). Additionally, EMS clinical notes are readily accessible and the stand-alone nature of our classification framework is advantageous in terms of flexibility. Our framework relies on the clinical notes only, improving generalizability and feasibility to use similar methods on other diseases or conditions. In this article, we compared the performance of the algorithms to a reference standard based on manual review by clinical experts then calculated the sensitivity, specificity, accuracy, positive predictive value, and F1 measure. To compare the screening efficiencies of manual review versus ML methods, we used the ‘workload’ metrics, defined as the number of records required to be reviewed from the output to identify all eligible patients (7).
MATERIALS AND METHODS
We conducted a retrospective study in a single EMS system, Orange County Emergency Services (OCES), from December 2017 and July 2018. OCES responds to approximately 15,000 requests for service annually covering 400 square miles of urban and rural areas with advanced life support ambulances. This study received human subjects’ research approval from the University of North Carolina Institutional Review Board.
Machine learning based natural language processing for prehospital clinical trials is a complex topic, therefore we assembled a multidisciplinary team of researchers. The team included 5 subject matter experts (RS, TB, JG, RK, and MP), 2 paramedics (RS, TB), 1 EMS Medical Director (JG), 1 prehospital clinical researcher (MP), and 2 machine learning specialists (RS, RK). We followed similar methodological frameworks employed by Ni et al. (7) and Zhang and Demnar-Fushman (10) for automated classification. We summarized these steps and how we performed them as follows.
In step 1, we identified the research question. To do so, Authors RS and MP proposed initial study objectives and research design that was discussed and refined with the rest of the research team.
In step 2, we selected relevant patient records. Our search was designed to capture all patient records that may be eligible for “A study of the feasibility of prehospital remote ischemic conditioning (RIC)” (4), (ClinicalTrials.gov Identifier: NCT03400579) in which OCES served as an enrollment site. This single-arm, open-label pilot study evaluated the feasibility of delivering remote ischemic conditioning (RIC) by emergency medical services (EMS) in the prehospital setting (4). The primary objective of this pilot study was to assess the feasibility of prehospital delivery of RIC by EMS providers in the United States (4). Eligible patients will have non-traumatic chest pain or anginal equivalent symptoms as guided by the local EMS protocol.
In this step we queried the EMS EHR, a cloud- based data warehouse, using the internal structured query language to extract all patient records, from December 2017 and July 2018, that met initial study screening criteria, i.e., transported to the University of North Carolina at Chapel Hill Emergency Department (UNC-CH ED), age > 17 years, and systolic blood pressure between 100–180 mm Hg (N = 1,223). All EHR documentation is recorded by EMS providers and these were the only inclusion criteria that could be applied within the internal EHR query system. A dataset was extracted in Excel format of de-identified data including patient demographics, level of service (advanced or basic life support), systolic blood pressure, primary impression, patient chief complaint, chief complaint anatomy and organ system, support system sign/ symptoms, and narrative.
In step 3, we processed the dataset for manual review. All data processing was conducted in Python using Pandas, an open-sourced BSD-licensed library (21). Duplicate patient records or those missing required data (chief complaint, primary impression, narrative, blood pressure, or patient age) were eliminated, reducing the dataset to 1,209 patient records. Three additional columns were added to the Excel sheet to capture data generated by manual review: classification label, time it took to review (seconds), and key indicators for inability to label the patient record as eligible or ineligible.
In step 4, we created the manual reference standard. Two paramedics (RS, TB) independently and manually reviewed each of the 1,209 de-identified patient records to assess participant eligibility based upon pilot study eligibility criteria and annotated each using three class labels: eligible, indeterminate, and ineligible. The two assessments were compared, disagreements were resolved by an EMS physician (JG). We calculated inter-annotator agreement using Cohen’s Kappa.
In step 5, we developed the automated machine learning models. The data (N = 1209) were randomly shuffled and split into two independent datasets: training set (70%, N = 846) as the basis for the ML model development and a test set (30%, N = 363) for model performance evaluation. ML best practices suggest a dataset split of no more than 70/30, otherwise the models risk overfitting (16, 22) and our sample size is consistent with similar studies (7, 15). A “holdout” dataset was not utilized during model training because we did not have the requisite volume of data to create training, validation, and test sets. However, to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model we implemented five-fold cross-validation. This approach involves randomly dividing the set of observations into k groups (k = 5), or folds, of approximately equal size (23). The first fold is treated as a validation set, and the method is fit on the remaining k −1 folds. We constructed a NLP pipeline (Figure 2) to train and test a classifier using supervised ML techniques to automatically classify each patient record for eligibility label (Figure 1) using the training set (N = 846) containing the reference standard eligibility label. Prior to entering the NLP pipeline the data set went through a series of logical rules based on the inclusionary criteria. Then unstructured data fields were processed for feature (variable) extraction (Figure 1, step 2). Text analysis is a major application field for ML algorithms. However, the raw data, a sequence of symbols cannot be fed directly to the algorithms as most algorithms expect numerical feature vectors with a fixed size rather than raw text with variable length (24). Therefore, features were extracted through a bag of words, with tf*idf weights (determined from the dataset of clinical notes) for each label. For example, the phrases “chest pain” and “NTG” within the same patient record may be given higher weight because of the association between the high frequency of these terms in the same document as positive “eligible” cases. The Scikit-learn toolkit (24) along with Python Pandas (21) were used to perform the text processing and construct the model. To reduce the bias generated by the various lengths of individual clinical note sentences we applied sublinear tf scaling (i.e. replace tf with 1 + log(tf)) (25). The lower and upper boundary of the range of n-values for the different n-grams to be extracted were (1, 3). For example an n-gram range of (1) means only unigrams (single word) for the word occurrence matrix associated with characteristics of each label. Feature selection hyperparameters were chosen based on a technique called grid search which is a widely used global optimization approach to automatic hyperparameter tuning (26). Grid search is a process that searches exhaustively through a manually specified subset of the hyperparameter space of the targeted algorithm (i.e. n-gram) (27).
Figure 2.

Features used for patient cohort classification a. Most and least positive feature coefficients by class b. Example of features in a misclassified patient clinical note (actual = eligible, predicted = ineligible).
Figure 1.

Architecture of the automated eligibility screening model.
We developed four ML classification models: Support Vector Machine (SVM), Random Forest, AdaBoost, and Gradient Tree Boosting (GTB). SVM is a non-probabilistic binary linear classifier that represents data as points in space (hyperplane) (24) and is often used as a base classifier because of its ability to transform high dimensional data (16, 22). Random Forest is an ensemble learning method for classification using regression and other tasks that operates by constructing a multitude of decision trees (24). AdaBoost and GTB are boosting ensemble models that train a number of individual models sequentially to learn from mistakes made in the previous model (22, 24). Using the test set (N = 363), we evaluate each model’s ability to classify a patient record as eligible, ineligible, or indeterminate using five metrics: sensitivity, specificity, accuracy (agreement), positive predictive value, and F1 measure. We opted to use sensitivity and positive predictive value and the F1 measure, the harmonic mean or weighted average, as the strongest indicator for overall performance. F1 measure is preferred in machine learning models with unbalanced class distribution common among low frequency events (22). Each metric was micro-averaged across each mention in clinical notes (i.e. calculated from a confusion matrix combining all classifications in the dataset).
Algorithm parameters were initially set to default for baseline comparison. Performance tuning was conducted by trial and error as guided by the Scikit- learn (24) documentation. Parameter tuning focused on the size of each tree, the learning rate that controls overfitting, and the number of variables within a tree. We found that a high learning rate (1.0) resulted in overfitting with the default number of tree (100) and was adjusted to 0.5 (0.0–1.0). Tuning the number of variables per tree below 100 did not improve the sensitivity or specificity.
Given the study’s objective and clinical researchers’ preference for an algorithm where the predicted label had high sensitivity (true positives) and low number of false negatives (i.e., eligible patient records whose predicted label was ineligible). Further, we chose to include the indeterminate label as acknowledgement that EMS clinical notes might not contain sufficient detail to discriminate and would therefore require manual review.
RESULTS
Descriptive Statistics of Evaluation Data
Of the 1,209 patient records included in analysis, the majority were male (54%) and averaged 54 (±18) years old. Manual annotation identified 105 eligible and 1,082 ineligible patient records (8.7%) (Table 1). Overall inter-annotator agreement exceeded the recommended Cohen’s kappa threshold for adequate agreement 0.70 (28) (0.76 and 88%), demonstrating substantial agreement between the reviewers. The clinical note word length per record ranged from 185 to 1,009 (mean = 354). Little difference in mean word count was seen in eligible (366 words) and ineligible (352 words) patient records.
Table 1.
Description of patient record labels
| Patient Record Labels | Train | Test | Total |
|---|---|---|---|
|
| |||
| Total | 846 | 363 | 1,209 |
| Eligible | 69 | 36 | 105 |
| Indeterminate | 15 | 7 | 22 |
| Ineligible | 762 | 320 | 1,082 |
Features Used for Patient Cohort Classification
We identified 8,376 features within the clinical notes data set (N ¼ 846) used to train the different ML models. Top features used by the classifiers included explicit indicators of acute coronary syndrome, as well as co-occurring treatments. For example, top features for the eligible class included “chest”, “ASA”, and “nitro” (Figure 2). No positive features were observed to indicate regionally or institution specific phrases; however, “UNC” (University of North Carolina) appeared as a top negative feature for eligible classification. A full list of top (positive) and least (negative) features and their associated TF-IDF weight used to determine each class are shown in Figure 2.
Evaluation of Machine Learning Models
To evaluate classification performance, we calculated sensitivity, specificity, accuracy, positive predictive value, and F1 measure (Table 2) for each patient record label in the test set for the following supervised ML classification algorithms: Support Vector Machine (SVM), Random Forest, AdaBoost, and Gradient Tree Boosting (GTB). The two highest performing algorithms were the ensemble method classifier, GBT (F1 = 0.93), and the linear SVM (F1 = 0.89). The GBT algorithm combines the predictions of several base estimators built with a given learning algorithms. This method improves generalizability and robustness over a single estimator (24).
Table 2.
Evaluation metrics for classifier performance
| Classifier | Sensitivity | Specificity | Accuracy | Precision (PPV) | F1 measure |
|---|---|---|---|---|---|
|
| |||||
| SVM | |||||
| Eligible | 0.33 (0.19–0.75) | 1.00 (0.98–1.00) | 0.93 (0.90–0.95) | 1.00 (0.98–1.00) | 0.50 (0.46–0.89) |
| Indeterminate | 0.00 (0.00–0.52) | 1.00 (0.98–1.00) | 0.98 (0.95–0.99) | 0.00 (0.00–0.64) | 0.00 (0.00–0.48) |
| Ineligible | 1.00 (0.98–1.00) | 0.28 (0.13–0.59) | 0.91 (0.88–0.93) | 0.91 (0.88–0.93) | 0.95 (0.91–1.00) |
| Average | 0.92 (0.89–0.96) | 0.96 (0.93–0.99) | 0.91 (0.89–0.94) | 0.90 (0.87–0.95) | 0.89 (0.82–0.95) |
| Random Forest | |||||
| Eligible | 0.00 (0.00–0.25) | 1.00 (0.98–1.00) | 0.90 (0.88–0.93) | * | * |
| Indeterminate | 0.00 (0.00–0.52) | 1.00 (0.98–1.00) | 0.98 (0.95–0.99) | * | * |
| Ineligible | 1.00 (0.98–1.00) | 0.00 (0.00–0.19) | 0.88 (0.84–0.93) | 0.88 (0.84–0.90) | 0.94 (0.92–0.98) |
| Average | 0.93 (0.89–0.96) | 0.96 (0.94–0.98) | 0.92 (0.88–0.97) | 0.93 (0.89–0.95) | 0.83 (0.71–0.93) |
| Adaboost | |||||
| Eligible | 0.25 (0.14–0.68) | 1.00 (0.98 – 1.00) | 0.93 (0.90–0.96) | 1.00 (0.98–1.00) | 0.40 (0.28–0.98) |
| Indeterminate | 0.00 (0.00–0.52) | 1.00 (0.98–1.00) | 0.98 (0.95–0.99) | * | * |
| Ineligible | 1.00 (0.98–1.00) | 0.21 (0.10 – 0.53) | 0.91 (0.89–0.94) | 0.90 (0.89–0.92) | 0.95 (0.93–1.00) |
| Average | 0.91 (0.87–0.95) | 0.95 (0.93–0.98) | 0.94 (0.92–0.96) | 0.91 (0.83–0.96) | 0.88 (0.85–0.94) |
| Gradient Tree Boosting | |||||
| Eligible | 0.69 (0.33–0.90) | 0.98 (0.94–0.99) | 0.95 (0.93–0.98) | 0.76 (0.46–0.95) | 0.72 (0.66–0.80) |
| Indeterminate | 0.00 (0.00–0.52) | 0.98 (0.94–0.99) | 0.96 (0.93–0.98) | * | * |
| Ineligible | 0.96 (0.92–0.98) | 0.63 (0.47–0.90) | 0.92 (0.89–0.95) | 0.95 (0.92–0.97) | 0.95 (0.93–0.98) |
| Average | 0.91 (0.88–0.96) | 0.95 (0.93–0.98) | 0.94 (0.93–0.96) | 0.90 (0.88–0.95) | 0.93 (0.93–0.94) |
If the sample sizes in the positive (disease present) and the negative (disease absent) groups do not reflect the real prevalence of the disease, then the Positive and Negative predicted values, and accuracy, cannot be estimated and you should ignore those values.
PPV = positive predicted value.
Overall, high performance was seen among all of the classifiers (F1 measure ≥ 0.83) with SVM (F1 = 0.89) and Adaboost (F1 = 0.88), attaining similar overall performance. No classifier was able to calculate the F1 measure or precision (PPV) for the indeterminate label due to small sample sizes in the positive and the negative groups not reflecting the real prevalence of the disease or actual label. Random Forest had the lowest micro-averaged F1 measure (0.83) and was the only classifier unable to calculate the F1 measure for the eligible label. GBT significantly outperformed all classifiers in sensitivity or the ability of the classifier to recall the actual eligible patients (0.69) with SVM and Adaboost performing higher in identifying how many of the predicted positive eligible patient records were relevant (1.00, 1.00).
We performed error analysis for the GBT algorithm by reviewing the false positives (N = 8) and negatives (N = 11) classified as eligible patient records. The majority of the errors were ascribed to the confusion between symptoms describing non-cardiac and cardiac chest pain such as breathing difficulty, rate induced pain, or atypical presentations. Other errors were due to negation confusion or lack of clarity in documentation. For example, the paramedic never implicitly documented “chest pain” yet described the symptoms as, “pain at a 5 and goes from under her left breast to her left breast to her sternum – it is described as sharp in nature.” These errors were reduced in tree decision algorithms by altering the maximum depth, which limits the number of nodes in the tree. We observed higher recall without lowering precision when reducing the number of nodes.
In a final step we compared individual labels of patient records to the reference standard. Figure 3 presents the confusion matrix plots for the evaluated classifiers. The confusion matrix displays classifier performance without normalization by the label support size (number of patient records in each label) highlighting the reference standard label to the prediction label. In the eligible label (N = 36), GTB (N = 25) had the highest number of true positives with SVM (N = 12) and Adaboost (N = 9) performing similarly and Random Forest (N = 0) not predicting any eligible patient records correctly. False negatives in the eligible label were determined using only those records with a predicted label of ineligible. We observed low rate of false negatives produced by the GTB algorithm (N = 11) with higher numbers of false negatives observed in the SVM (N = 24), Adaboost (N = 27), and Random Forest (N = 36) that were incorrectly labeled as ineligible. In the largest prediction label, ineligible (N = 320), all classifiers had perfect classification of true positives with exception of the GBT (N = 306).
Figure 3.

Confusion matrix plots.
Workload Reduction
With manual review only, research staff would need to review all 1,209 patient records notes to determine study eligibility status. Our manual review required an estimated 40 hours to complete. Using our automated classifiers, only records labeled as eligible and indeterminate would need manual review. All ML models required less than 3 seconds to processing and label all patient records (N ¼ 1209) for a total of just over an hour of processing time. However, this did not include the time taken to develop the classifiers. SVM and GBT had the highest workload reduction (96.3%, 97.2%); however, SVM (N = 24) had over double the false negatives in the eligible class than GBT (N = 11) (Figure 4).
Figure 4.

Workload reduction percentages and classifier performance.
DISCUSSION
Our study found that the identification of a specific cardiac patient cohort can be accomplished through leveraging prehospital clinical notes. In this study we explored the feasibility of ML and NLP methods for identifying research study participant eligibility and compared resulting models to traditional manual record review using EMS clinical notes. The results from the four algorithms we developed demonstrate that a supervised ML approach effectively classified patient records for eligibility status and required significantly less workload when compared to manual review. In this study, a researcher would only need to review 25 patient records, or 7.8% of 320 eligible patient records. These ML models were evaluated not on the ability to identify a binary status (eligible or ineligible), but as a screening tool that accurately classified eligibility status and recognized when clinical notes did not have enough information for a reliable classification (indeterminate). The indeterminate category assisted in improving sensitivity as clinical diagnosis often presents as a differential in prehospital clinical notes due to diagnostic constraints and limited scope of care. The indeterminate category may provide insight into future feature selection for algorithm parameter tuning and may guide quality improvement of clinical documentation such as standardization. Barriers to research study participant eligibility determination and recruitment may be one reason research studies in prehospital setting occur infrequently.
In previous research, ML systems using clinical notes successfully identified diseases (10, 29, 30), adverse drug events (31), and social risk factors (14, 32). An automated classification system for research study eligibility status using study records from ClinicalTrials.gov reported a precision of 0.90, recall of 0.86, and F2 score 0.87(10). These findings are consistent with the performance of our classifiers that varied in F1 measure from 0.83 to 0.93 with SVM serving as our baseline classifier. SVM classifiers tend to have high accuracy and low overfitting in high-dimensional feature spaces seen in biomedical text classification problems (33). Overall, the GTB classifier performed better in the more prevalent classes, with F1 measures of 0.72 (eligible) and 0.95 (ineligible). In this case more training data could further improve performance. We speculate that our result is due to the relative class imbalance in the dataset; i.e., 1,082 of 1,209 patient records were annotated as ineligible. However, the predictive value of 0.76 by the GBT in the eligible class must be interpreted in light of the low prevalence of clinical trial eligibility, which was 10% (36 of 363). It may be that, given a cardiac domain-specific dataset rather than an EMS dataset, our supervised ML based methods may achieve better performance.
Ni and colleagues (7) used similar information extraction methods to automate eligibility screening among pediatric emergency department patients and achieved a 90% workload reduction potential in patient cohort identification. We observed consistent workload reduction of >92% when a GTB classifier was applied to the dataset resulting in manual validation of only 33 of 363 records to find 25 eligible patient records (75.6% eligibility rate). This promising result showed the great potential of reducing manual labor associated with clinical screenings with a similar study reporting average eligibility rate of 1.25% (7).
During the reference standard creation, we observed that the concepts of patient capacity and anginal equivalent symptoms could be expressed in a multitude of ways. These concepts often lacked enough information for the reviewers to classify these patient records. For example, patient records describing syncopal and near syncopal episodes accounted for the majority of labeled indeterminate records. Patient records within inadequately described medical history or patients exhibiting signs/symptoms of dementia, Alzheimer’s, or impaired cognition were deemed indeterminate. Indeterminate patient records had a lower average number of words per clinical note (337 words) than the collection (354) with 50% of these records having less than 323 words per clinical note. This suggests that there weren’t enough features present to train the algorithm for the outcome of interest. Lack of clarity surrounding capacity and anginal equivalents were a point of disagreement during manual review, this may have contributed to lower rates of inter-annotator agreement. This confusion was evident when analyzing feature selection for the different classes (Figure 2a) showing similar features between the eligible and indeterminate class. Misclassified patients (actual = eligible, predicted = ineligible) had a number of conflicting features (Figure 2b) often due to documented treatments at the direction of the dispatcher prior to arrival. A difficulty with all ML models is the inherent reliance on clinicians to adequately document their history and physical findings (12, 28, 34). ML models using hospital clinical notes are known to be limited in their generalizability due to geographical language (15, 35); however, no slang or geographical linguistics were observed in this study. This suggests the potential of using EMS clinical notes from more than one institution, possibly nationally. To the best of our knowledge, this is the first study to use EMS clinical notes to train a ML algorithm. It is not yet known if EMS clinical notes contain enough relevant features to train robust ML algorithms.
Our goal was to identify the minority class, eligible, resulting in greater importance of sensitivity than positive predictive value. Imbalanced classes were another difficulty that we observed while constructing the models. Further performance improvement may rely on using more advanced NLP and ML techniques such as part-of-speech tagging and named entity recognition. It is worth noting that our ML classifiers are supervised learning algorithms; thus there is an up-front time investment required to perform manual review for each new problem domain.
Integrating ML algorithms into the EHR could drive patient phenotyping, patient-trial matching systems, clinical decision support tools, and surveillance and quality improvement systems. The code used to build the models are publicly available at https://github.com/rstem/prehospital_eligibility. Our ultimate intention was to implement this algorithm to promote identification of potentially eligible research study patients’ record and enhance trial enrollment through continuous, near real time monitoring and feedback loops to participating EMS providers. These novel approaches to eligibility screening and monitoring open the possibility of large-scale prehospital clinical trials that are often hampered by limited resources and funding (6, 7).
Limitations
A limitation of our ML-based method is that it was developed and evaluated in a single health care setting. The generalizability of algorithm performance in other settings is currently unknown. ML-based systems are dependent on robust high quality data sets (36,37) suggesting a potential limitation in terms of sample size and uneven class distribution. Unfortunately, no formula or recommendations for sample size when building gold standard corpus in ML-based systems currently exist (38). It is worth noting that our ML classifiers are supervised, therefore an up-front time investment is required to perform annotation for each new problem domain. Our small team of annotators were able to produce 1200 high accuracy annotations in 40 hours, a significant financial investment. This investment may be justified in trials with significantly large recruitment goals. Further, our automated algorithms were developed for eligibility criteria from a specific, ongoing prehospital study of chest pain patients, it is possible these algorithms would perform differently for other clinical settings or scenarios. As with all supervised learning algorithms, ours have the potential for selection bias due our use of a single EMS professionals clinical narrative data set. To our knowledge there is no study analyzing bias (gender, race/ethnicity etc.) in EMS clinical notes; however research suggests that there is an unintended, bias in medicine related to systemic neglect of gender and sex (39). Finally, the results and conclusions from this research were generated in an offline, retrospective environment. Real-time implementation may never be possible in the prehospital setting due to workflow. In some cases patient documentation may not be completed prior to medical decision-making; occasionally documentation occurs up to 24 hours after care initiation (12). Thus our tool may not provide support at the time of EMS interacts with the patient. In the future, it would be interesting to study the feasibility of using our classifiers in a real-world environment, such as consumer-facing search engines for clinical trial registries or as feedback loops to improve EMS provider participation.
In summary, we leveraged ML and NLP approaches using prehospital patient EHR clinical notes and demonstrated that it is feasible to develop a high-performing automated classification system. Using paramedic-generated, reference standard evaluation of real-world prehospital EHR clinical notes, our approach achieved more than 0.93 in sensitivity, positive predictive value, and F1 measure. Compared to manual classification (N = 363 records), the automated method excluded approximately 91% of records as ineligible, leaving only 33 records for manual review. Our study shows the feasibility of developing a ML model to identify a specific cardiac patient cohort using only prehospital clinical notes. Future research is needed to further investigate this method to identify similar rare events such as influenza like illness. We hypothesize that implementing our approach may have potential for significant impact in reduction of time and effort for executing prehospital clinical research, in particular and may significantly expand clinical trial access for patients receiving emergency care.
Funding Source:
RS is funded by the National Library of Medicine Institutional Training Grants for Research Training in Biomedical Informatics (T15LM012500–03).
Footnotes
Conflicts of Interest: No potential conflict of interest was reported by the authors.
Contributor Information
Rachel Stemerman, Carolina Health Informatics Program, University of North Carolina, Chapel Hill, North Carolina.
Thomas Bunning, Department of Anesthesiology, Duke University Medical Center, Durham, North Carolina.
Joseph Grover, Department of Emergency Medicine, University of North Carolina, Chapel Hill, North Carolina.
Rebecca Kitzmiller, Carolina Health Informatics Program, University of North Carolina, Chapel Hill, North Carolina.
Mehul D. Patel, Department of Emergency Medicine, University of North Carolina, Chapel Hill, North Carolina.
REFERENCES
- 1.Lecky F, Russell W, Fuller G, McClelland G, Pennington E, Goodacre S, Han K, Curran A, Holliman D, Freeman J, et al. Systematic review of pre-hospital controlled trials in trauma patients - The Head Injury Transportation Straight to Neurosurgery (HITS-NS) randomised trial: a feasibility study. Health Technol Assess. 2016;20(1):1–198. doi: 10.3310/hta20010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Callaham M Quantifying the scanty science of prehospital emergency care. Ann Emerg Med. 1997;30(6):785–90. doi: 10.1016/S0196-0644(97)70049-0. [DOI] [PubMed] [Google Scholar]
- 3.Wang HE, Yealy DM. Out-of-hospital clinical trials: challenges in advancing the evidence base. Ann Emerg Med. 2011;57(3):232–3. Mardoi: 10.1016/j.annemergmed.2010.11.034. [DOI] [PubMed] [Google Scholar]
- 4.Patel MD, Platts-Mills TF, Grover JM, Thomas SM, Rossi JS. Feasibility of prehospital delivery of remote ischemic conditioning by emergency medical services in chest pain patients: protocol for a pilot study. Pilot Feasibility Stud. 2019;5:42. doi: 10.1186/s40814-019-0431-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Elm JJ, Palesch Y, Easton JD, Lindblad A, Barsan W, Silbergleit R, Conwit R, Dillon C, Farrant M, Battenhouse H, et al. Screen failure data in clinical trials: Are screening logs worth it? Clin Trials. 2014;11(4):467–72. doi: 10.1177/1740774514538706. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Penberthy LT, Dahman BA, Petkov VI, DeShazo JP. Effort required in eligibility screening for clinical trials. J Oncol Pract. 2012;8(6):365–70. Novdoi: 10.1200/JOP.2012.000646. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Ni Y, Kennebeck S, Dexheimer JW, McAneney CM, Tang H, Lingren T, Li Q, Zhai H, Solti I. Automated clinical trial eligibility prescreening: increasing the efficiency of patient identification for clinical trials in the emergency department. J Am Med Inform Assoc. 2015;22(1):166–78. doi: 10.1136/amiajnl-2014-002887. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Schmickl CN, Li M, Li G, Wetzstein MM, Herasevich V, Gajic O, Benzo RP. The accuracy and efficiency of electronic screening for recruitment into a clinical trial on COPD. Respir Med. 2011;105(10):1501–6. doi: 10.1016/j.rmed.2011.04.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Bode AD, Singh M, Andrews JR, Baez AA. Racial and gender disparities in violent trauma: Results from the NEMSIS database. Am J Emerg Med. 2019;37(1):53–5. Jandoi:10.1016/j.ajem.2018.04.049. Zhang K, Demner-Fushman D. Automated classification of eligibility criteria in clinical trials to facilitate patient-trial matching for specific patient populations. J Am Med Inform Assoc. 2017;24(4):781–7. doi:10.1093/jamia/ocw176. [DOI] [PubMed] [Google Scholar]
- 10.Penberthy L, Brown R, Puma F, Dahman B. Automated matching software for clinical trials eligibility: measuring efficiency and flexibility. Contemp Clin Trials. 2010; 31(3): 207–17. doi: 10.1016/j.cct.2010.03.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Doan S, Maehara CK, Chaparro JD, Lu S, Liu R, Graham A, Berry E, Hsu CN, Kanegaye JT, Lloyd DD, Pediatric Emergency Medicine Kawasaki Disease Research Group, et al. Building a Natural Language Processing Tool to Identify Patients With High Clinical Suspicion for Kawasaki Disease from Emergency Department Notes. Acad Emerg Med. 2016;23(5):628–36. doi: 10.1111/acem.12925. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Zhou L, Baughman AW, Lei VJ, Lai KH, Navathe AS, Chang F, Sordo M, Topaz M, Zhong F, Murrali M, et al. Identifying Patients with Depression Using Free-text Clinical Documents. Stud Health Technol Inform. 2015;216:629–33. [PubMed] [Google Scholar]
- 13.Bejan CA, Angiolillo J, Conway D, Nash R, Shirey-Rice JK, Lipworth L, Cronin R, Pulley J, Kripalani S, Barkin S, Johnson K, et al. Mining 100 million notes to find homelessness and adverse childhood experiences: 2 case studies of rare and severe social determinants of health in electronic health records. J Am Med Inform Assoc. 2018;25(1):61–71. doi: 10.1093/jamia/ocx059. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Meystre SM, Heider PM, Kim Y, Aruch DB, Britten CD. Automatic trial eligibility surveillance based on unstructured clinical data. Int J Med Inform. 2019;129:13–9. doi: 10.1016/j.ijmedinf.2019.05.018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Nadkarni PM, Ohno-Machado L, Chapman WW. Natural language processing: an introduction. J Am Med Inform Assoc. 2011;18(5):544–51. doi: 10.1136/amiajnl-2011-000464. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Delahanty RJ, Alvarez J, Flynn LM, Sherwin RL, Jones SS. Development and evaluation of a machine learning model for the early identification of patients at risk for sepsis. Ann Emerg Med. 2019;73(4):334–44. doi: 10.1016/j.annemergmed.2018.11.036. [DOI] [PubMed] [Google Scholar]
- 17.Bertsimas D, Dunn J, Steele DW, Trikalinos TA, Wang Y. Comparison of machine learning optimal classification trees with the pediatric emergency care applied research network head trauma decision rules. JAMA Pediatr. 2019;173(7): 648–56. doi: 10.1001/jamapediatrics.2019.1068. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Li X, Wang H, He H, Du J, Chen J, Wu J. Intelligent diagnosis with Chinese electronic medical records based on convolutional neural networks. BMC Bioinformatics. 2019;20(1):62. doi: 10.1186/s12859-019-2617-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Ko€pcke F, Trinczek B, Majeed RW, Schreiweis B, Wenk J, Leusch T, Ganslandt T, Ohmann C, Bergh B, Rohrig R, et al. Evaluation of data completeness in the electronic health record for the purpose of patient recruitment into clinical trials: a retrospective analysis of element presence. BMC Med Inform Decis Mak. 2013;13:37. doi: 10.1186/1472-6947-13-37. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Python Data Analysis Library — pandas: Python Data Analysis Library [Internet]. [cited 2019 Mar 10]. Available from: https://pandas.pydata.org/index.html.
- 21.Stubbs A, Pustejovsky J. The Basics. Natural Language Annotation for Machine Learning. Sebastopol, California: O’Reilly Media, Inc; 2012. [Google Scholar]
- 22.James G, Witten D, Hastie T, Tibshirani R. An Introduction to Statistical Learning. New York, NY: Springer New York; 2013. [Google Scholar]
- 23.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al. Scikit-learn: machine learning in python. J Mach Learn Res. 2011; 12:2825–2830. [Google Scholar]
- 24.Manning CD, Raghavan P, Schutze H. Introduction to information retrieval. Cambridge: Cambridge University Press; 2008. [Google Scholar]
- 25.Chan S, Treleaven P. Continuous Model Selection for Large-Scale Recommender Systems. In: Big Data Analytics. London, UK: Elsevier; 2015. p. 107–24. [Google Scholar]
- 26.Bergstra J, Bengio Y. Random Search for Hyper-Parameter Optimization. J Mach Learn Res. 2012;13:281–305. [Google Scholar]
- 27.Mowery D, Wiebe J, Visweswaran S, Harkema H, Chapman WW. Building an automated SOAP classifier for emergency department reports. J Biomed Inform. 2012;45(1):71–81. doi: 10.1016/j.jbi.2011.08.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Wang Y, Chen ES, Pakhomov S, Arsoniadis E, Carter EW, Lindemann E, Sarkar IN, Melton GB. Automated Extraction of Substance Use Information from Clinical Texts. AMIA Annu Symp Proc. 2015;2015:2121–30. [PMC free article] [PubMed] [Google Scholar]
- 29.Carrell DS, Cronkite D, Palmer RE, Saunders K, Gross DE, Masters ET, Hylan TR, Von Korff M. Using natural language processing to identify problem usage of prescription opioids. Int J Med Inform. 2015;84(12):1057–64. Decdoi: 10.1016/j.ijmedinf.2015.09.002. [DOI] [PubMed] [Google Scholar]
- 30.Wang G, Jung K, Winnenburg R, Shah NH. A method for systematic discovery of adverse drug events from clinical notes. J Am Med Inform Assoc. 2015;22(6):1196–204. doi: 10.1093/jamia/ocv102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Navathe AS, Zhong F, Lei VJ, Chang FY, Sordo M, Topaz M, Navathe SB, Rocha RA, Zhou L. Hospital Readmission and Social Risk Factors Identified from Physician Notes. Health Serv Res. 2018;53(2):1110–36. doi: 10.1111/1475-6773.12670. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Mishra R, Bian J, Fiszman M, Weir CR, Jonnalagadda S, Mostafa J, Del Fiol G. Text summarization in the biomedical domain: a systematic review of recent research. J Biomed Inform. 2014;52:457–67. doi: 10.1016/j.jbi.2014.06.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Jiang M, Chen Y, Liu M, Rosenbloom ST, Mani S, Denny JC, Xu H. A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries. J Am Med Inform Assoc. 2011;18(5):601–6. doi: 10.1136/amiajnl-2011-000163. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Shivade C, Hebert C, Lopetegui M, de Marneffe M-C, Fosler-Lussier E, Lai AM. Textual inference for eligibility criteria resolution in clinical trials. J Biomed Inform. 2015;58 Suppl:S211–S8. Dec Suppl:doi: 10.1016/j.jbi.2015.09.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Wang Y, Wang L, Rastegar-Mojarad M, Moon S, Shen F, Afzal N, Liu S, Zeng Y, Mehrabi S, Sohn S, et al. Clinical information extraction applications: A literature review. J Biomed Inform. 2018;77:34–49. doi: 10.1016/j.jbi.2017.11.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Sarmiento RF, Dernoncourt F. Improving patient cohort identification using natural language processing. In: Secondary analysis of electronic health records. Cham: Springer International Publishing; 2016. p. 405–17. [PubMed] [Google Scholar]
- 37.Juckett D A method for determining the number of documents needed for a gold standard corpus. J Biomed Inform. 2012;45(3):460–70. Jundoi: 10.1016/j.jbi.2011.12.010. [DOI] [PubMed] [Google Scholar]
- 38.Hamberg K Gender bias in medicine. Womens Health (Lond). 2008;4(3):237–43. doi: 10.2217/17455057.4.3.237. [DOI] [PubMed] [Google Scholar]
