Abstract
Objective
While there are currently approaches to handle unstructured clinical data, such as manual abstraction and structured proxy variables, these methods may be time-consuming, not scalable, and imprecise. This article aims to determine whether selective prediction, which gives a model the option to abstain from generating a prediction, can improve the accuracy and efficiency of unstructured clinical data abstraction.
Materials and Methods
We trained selective classifiers (logistic regression, random forest, support vector machine) to extract 5 variables from clinical notes: depression (n = 1563), glioblastoma (GBM, n = 659), rectal adenocarcinoma (DRA, n = 601), and abdominoperineal resection (APR, n = 601) and low anterior resection (LAR, n = 601) of adenocarcinoma. We varied the cost of false positives (FP), false negatives (FN), and abstained notes and measured total misclassification cost.
Results
The depression selective classifiers abstained on anywhere from 0% to 97% of notes, and the change in total misclassification cost ranged from −58% to 9%. Selective classifiers abstained on 5%–43% of notes across the GBM and colorectal cancer models. The GBM selective classifier abstained on 43% of notes, which led to improvements in sensitivity (0.94 to 0.96), specificity (0.79 to 0.96), PPV (0.89 to 0.98), and NPV (0.88 to 0.91) when compared to a non-selective classifier and when compared to structured proxy variables.
Discussion
We showed that selective classifiers outperformed both non-selective classifiers and structured proxy variables for extracting data from unstructured clinical notes.
Conclusion
Selective prediction should be considered when abstaining is preferable to making an incorrect prediction.
Background
Electronic health records (EHRs) are a valuable data source for clinical outcomes research and quality improvement.1,2 EHRs typically contain structured and unstructured data. Structured data refers to data captured from drop-down menus, multi-select menus, or other data modalities that follow a consistent format when entered into the EHR.3 Unstructured data refers to free-form text, prose, or data that does not follow a consistent format when entered into the EHR. While structured data like medication orders and ICD (International Classification of Diseases) codes have been used for research purposes, important variables for outcomes research (eg, diagnosis dates, cancer progression and recurrence events, treatment response, comorbidities, adverse events, mortality, etc.) are typically available only as unstructured data. As healthcare professionals increasingly rely on capturing information in unstructured text,4 it becomes necessary to establish scalable methods of extracting unstructured data to enable clinical research and quality improvement efforts.
Unstructured variables in EHRs are extracted using manual abstraction and proxy variables available as structured data.5 Manual abstraction is time-consuming and not scalable, especially with the increase in EHR data produced by healthcare systems.6 Additionally, manual abstraction can be expensive if a highly trained professional (eg, clinician) is required to comprehend the clinical nuances of EHR data.
In cases where manual abstraction is logistically infeasible, structured data proxy variables have been used in population-based research. Crucially, structured proxies are imperfect and can introduce bias into downstream analyses. For example, the presence of an ICD code (structured) may be used as a proxy for clinician-confirmed diagnosis (unstructured) for a certain condition. However, the use and accuracy of ICD codes may vary substantially across sites, is not validated, and can present challenges when updating to newer versions.7–9 Additionally, CPT codes are frequently miscoded.10
Given these limitations of manual clinical abstraction, new approaches that leverage machine learning to increase efficiency and accuracy must be explored.6,11 Despite recent advances in the field, state-of-the-art machine learning (ML) models still do not perform as well as humans on several unstructured data extraction tasks, including those involving negation terms or small misspellings, and tasks requiring reasoning from context.12–15 Therefore, for use cases that require high accuracy, model-based abstraction may not always be acceptable.
Another approach to unstructured data extraction is model-assisted abstraction, where humans use the output of machine learning models to improve the efficiency or accuracy of manual abstraction.16 One common approach for model-assisted abstraction is using models to identify records that may require human review.16 For example, this may be used to build a cohort of patients that meet certain inclusion criteria. While this approach may improve abstraction efficiency, notes not flagged by the model may never be included in the final dataset, leading to lower sample sizes and systematic exclusion of certain patients. Therefore, there is still no effective way to synthesize the terabytes of unstructured EHR data produced daily, and as a result, the quality and quantity of clinical outcomes research are limited.17
To address the above limitations of manual abstraction and model-assisted abstraction, we used selective prediction, a modeling framework that allows a model to abstain from generating a prediction under certain scenarios, for example when uncertainty in the prediction is high. The motivation for selective prediction is that in certain settings—especially in healthcare—making an incorrect prediction may be worse than not making a prediction at all.18 For example, Guan et al19 found that incorporating selective prediction improved the performance of a support vector machine classifier to diagnose breast cancer, reducing the number of false negative predictions and increasing the overall accuracy of the model.
Our study makes 2 contributions to the selective prediction literature. First, we apply selective prediction to a clinical natural language processing (NLP) task, which is relevant since selective prediction has been understudied in NLP and to our knowledge has not been applied in clinical NLP tasks.20 Second, we employ decision theoretic utility-based thresholding, an understudied implementation of selective prediction which has several advantages including flexible, asymmetric rejection regions grounded in user-defined tolerances for false positives and false negatives.18,21
We hypothesized that incorporating selective prediction would lead to improved accuracy compared to traditional prediction for data points where the model generated a prediction. We applied this workflow to 5 variables found in clinical notes: depression, diagnosis of glioblastoma (GBM) (binary, yes/no), diagnosis of rectal adenocarcinoma (binary, yes/no), laparoscopic surgical resection of colorectal adenocarcinoma (binary, yes/no), laparoscopic surgical resection approach for colorectal adenocarcinoma (categorical, LAR/APR/other).
Materials and methods
Data sources
This retrospective cohort study used data from 2 sources: MIMIC-III and Stanford Hospital EHR. MIMIC-III is a dataset comprising health-related data associated with over 40 000 patients who stayed in critical care units of the Beth Israel Deaconess Medical Center from 2001 to 2012.24,25 All patient information in the MIMIC-III datasets is deidentified, so its use was exempt from the approval of the institutional review board. EHR note data was also used for a retrospective analysis at Stanford Hospital, Stanford, CA. This data source included any patient treated at Stanford from 1998 to 2022. Ethics approval was granted through Stanford University IRB (#50031).
Approach
We demonstrate our NLP workflow in 2 use cases. First, we extract mentions of depression from patient notes in the MIMIC-III dataset. We do this by first using weak supervision to weakly label notes, and subsequently training 3 supervised classifiers: logistic regression, random forest, and support vector machine. For each of the 3 supervised models, we build selective and non-selective classifiers, using a range of costs for false negatives and false positives. This use case demonstrates how selective prediction can be used with any supervised prediction model as well as weak supervision.
In the second use case, we apply our workflow to extract clinical variables from cancer pathology reports originating from an academic medical center. Manual abstraction of a fixed number of data points was performed by human abstractors (Figure 1A). Next, selective classifiers were developed using the data collected by manual abstraction. In contrast to typical ML models that generate a prediction for any input, selective prediction models may abstain from generating a prediction for certain data points (Figure 1B).20
Figure 1.
(A) Diagram depicting the modeling workflow used to extract unstructured variables. EHR data is abstracted, then used to train and test a model that either predicts an outcome or abstains from a prediction. Notes for which a prediction is not generated are abstracted manually. (B) Depiction of selective prediction via utility-based thresholding. Two thresholds (Threshold 1 and Threshold 2) are applied to a continuous predicted probability to yield 3 possible classifications: negative (predicted probability < Threshold 1), positive (predicted probability > Threshold 2), and undetermined (Threshold 1 ≤ predicted probability ≤ Threshold 2). Thresholds are selected based on real world utilities so as to minimize the total cost, defined as (# of FP) × (cost of FP) + (# of FN) × (cost of FN) + (# of undetermined) × (cost of undetermined), where # = number, FP = false positive, and FN = false negative.
Cohort selection criteria
Depression cohort
Note data on patients diagnosed with depression from the Phenotype Annotations for Patient Notes in the MIMIC-III Database were used to label patient notes in the MIMIC-III Clinical Database. The MIMIC-III annotated dataset consists of 1138 unique patient notes from the MIMIC-III dataset manually labeled by expert annotators.22 These notes were selected from the full MIMIC-III note database of 59 652 notes. Notes from patients in the MIMIC-III dataset were labeled by 2 expert human annotators (1 clinical researcher and 1 resident physician) for presence or absence of depression. In total, 1138 unique patient notes, representing 1138 unique patients, were labeled. Notes that exhibited disagreement between both human annotators regarding the label, as well as notes where at least one human annotator expressed uncertainty about the label, were excluded. The remaining 563 labeled unique patient notes, representing 563 unique patients, were selected for the cohort. From these 563 notes, 136 (24%) had a mention of depression. Of those 563 notes, 112 notes (20%) were randomly selected as a development set, and the remaining 451 (80%) were selected as a test set. For the training set, we took a random sample of 1000 unlabeled notes from the full set of 59 652 MIMIC-III notes (excluding the 563 labeled notes). These unlabeled notes were weakly labeled using a weak supervision workflow described below.
Glioblastoma cohort
The glioblastoma cohort included patients with a suspected diagnosis of GBM. Patients with pathology report containing the word “glioblastoma” in a was included, since any patient diagnosed with primary GBM would have a pathology report indicating that diagnosis (high sensitivity). Patients younger than 18 at the date of the pathology report that contained the word “glioblastoma” were excluded due to differences in the clinical management of pediatric GBM. The remaining 1195 patients were considered for subsequent analyses. Of these, 629 patients were manually abstracted. For each patient, each note was manually abstracted in temporal order to determine whether or not the note indicated a diagnosis of GBM (binary outcome). This process yielded 659 notes for the GBM cohort.
Colorectal cancer cohort
The colorectal cancer cohort was created by selecting patients between January 2012 and December 2021 that had ICD-9 and ICD-10 codes for rectal cancer (154.0 and 154.1 for IDC-9 and C19 and C20 for ICD-10), and Current Procedural Terminology (CPT) codes for treatment of rectal cancer (45110, 45111, 45112, 45114, 45116, 45126, 45395, 44146, 44208, 48.4, 48.5, 48.6, 1007599, 1007604). Patients younger than 18 at the date of the pathology report were excluded due to differences in clinical management. We also excluded patients with mention of “mucinous adenocarcinoma” in the pathology report to further select for patients with rectal adenocarcinoma. In total, 721 patients were selected in the initial cohort. Out of the 721 patients, 298 patients were randomly selected for manual abstraction and considered for subsequent analysis. These notes were used to build models to predict 3 outcomes: diagnosis of rectal adenocarcinoma, laparoscopic surgical resection of rectal adenocarcinoma via abdominoperineal resection, and laparoscopic surgical resection of rectal adenocarcinoma via low anterior resection. The 3 outcomes were manually abstracted from patient notes in temporal order. In cases where the outcome of interest was not present in a patient’s notes, all notes for the patient were reviewed. This process yielded 601 notes for diagnosis of rectal adenocarcinoma, 882 notes for laparoscopic surgical resection of rectal adenocarcinoma via abdominoperineal resection, and 986 notes for laparoscopic surgical resection of rectal adenocarcinoma via abdominoperineal resection.
Outcomes
Confirmation of depression
Confirmation of depression (binary variable indicating presence vs. absence) was defined as evidence of any of the following in a note: diagnosis of depression; prescription of anti-depressant medication; or any description of intentional drug overdose, suicide, or self-harm attempts. Depression diagnosis was determined by expert human annotators via relevant patient notes from the MIMIC-III database. Further details are highlighted in Gehrmann et al.23
Confirmation of primary glioblastoma diagnosis
Primary glioblastoma diagnosis (binary variable indicating presence vs. absence) was determined by reviewing the relevant pathology reports of patients with suspected GBM. In general, patients diagnosed via surgical resection, biopsy, or specimen review with GBM (WHO grade IV), with or without oligodendroglial or gliosarcoma features, were considered to have primary glioblastoma. Patients with transformed or recurrent GBM diagnoses were not considered to have primary glioblastoma.
Confirmation of rectal adenocarcinoma diagnosis
Rectal adenocarcinoma diagnosis (binary variable indicating presence vs. absence) was determined by reviewing the relevant pathology reports of patients with suspected rectal adenocarcinoma. In general, patients diagnosed via surgical resection of rectal adenocarcinoma were considered to have a rectal adenocarcinoma diagnosis.
Approach for laparoscopic surgical resection of rectal adenocarcinoma
There were 2 laparoscopic surgical approaches of interest for resection of suspected rectal adenocarcinoma. The first approach is low anterior resection (LAR), which involves removing part of the rectum. The second approach is abdominoperineal resection (APR), which involves removing the anus, rectum, and sigmoid colon. We built 2 separate binary classifier models to identify patients who underwent APR or LAR.
Data abstraction
For the variables related to colorectal cancer, we used existing data that had been manually abstracted for other research purposes. To confirm GBM diagnosis, we used the methodology described below to perform abstraction.
First, a note review was performed to understand how GBM diagnosis was documented in the EHR. Second, based on the findings from the preliminary note review, a detailed abstraction instruction guide was created that described how to determine a diagnosis of GBM from a patient’s note. Instructions for dealing with ambiguous cases (ex. gliosarcoma or glioblastoma with oligodendroglioma features) were included, as well as guidance for cases that the abstractor could not determine. Third, a standardized abstraction data entry tool was developed, and a pilot round of abstraction was conducted to refine the instruction guide and data entry tool. The data entry tool had a “flag” feature that abstractors could use to indicate notes for which they were not able to make a determination. Fourth, patient notes were allocated to non-clinician abstractors (with duplication to allow for calculation of inter-rater reliability) and abstraction was performed. A subset of notes was assigned to clinician abstractors to calculate the accuracy of non-clinician abstractors. Lastly, all flagged notes were reviewed, and data quality checks were performed.
Inter-rater reliability
Out of the 659 GBM notes abstracted, 172 were duplicate-abstracted to calculate inter-rater reliability, and of those 54 were duplicate-abstracted by a physician. The inter-rater reliability analysis yielded a Fleiss' Kappa score of 0.869 (P-value < .001), indicating excellent reliability. Treating the physician abstractor as a gold standard, the non-physician abstractors had comparable results to the physician, achieving sensitivity of 0.98, specificity of 1.00, PPV of 1.00, and NPV of 0.90.
We ran selective prediction models on 4 datasets taken from pathology reports, each of which confirmed GBM diagnosis, diagnosis of rectal adenocarcinoma (DRA), LAR of rectal adenocarcinoma, or APR of rectal adenocarcinoma.
Statistical analysis
Weak labeling
All modeling was done at the level of notes. Weak supervision was used to label the 1000 MIMIC-III notes in the training set for the depression cohort. Labeling functions are rough heuristics used to programmatically generate weak labels from unlabeled data. We manually reviewed the 112 notes in the development set and created labeling functions using keyword matching and regular expressions (Table S1). For example, a labeling function might label a note as positive if it contains the strings “depression,” “depressed,” or “depressive,” and abstain otherwise. We created 28 total labeling functions that were used to train a generative model that combines the outputs of multiple labeling functions, leveraging their collective knowledge and handling their conflicts.26
The output of the generative model was a probability. We evaluated the performance of the generative model the AUC, as well as classification metrics (sensitivity, specificity, PPV, NPV) using a threshold of 0.5. Binary weak labels were generated for the 1000 notes using a threshold of 0.5.
In total, 1563 events were labeled for depression in the MIMIC-III Dataset which will be used to evaluate the performance of single and double threshold classifiers for Support vector machines (SVM), Random Forest, and Logistic Regression models to highlight double threshold classifiers generalizability across a variety of models and inputs on a standardized labeled dataset.
Feature engineering and feature selection
For Stanford EHR variables (GBM, CRC, APR, LAR), modeling was performed at the level of a patient note, and each patient may have multiple notes. A randomly selected 60% of notes were used for model training, and the remaining 40% were used for model testing (Table S2). The single and double threshold model for a given classification task were applied to the same train and test population of annotated notes. Features were derived entirely from the free text of pathology reports. Using the text of these documents, TF-IDF (term-frequency-inverse-document-frequency) scores were calculated for unigrams, bigrams, trigrams, and 4-grams. Briefly, TF-IDF is a method used to quantify the relative frequency of an n-gram (an n-word phrase) in a corpus of documents. For example, the TF-IDF score for the unigram “cancer” in a patient’s pathology report is calculated by dividing the frequency of the word “cancer” in that patient’s pathology report (term frequency) by the frequency of the word “cancer” across all pathology reports across all patients (inverse document frequency).
In the GBM and colorectal cancer models, L1-regularized logistic regression (Lasso, glmnet package in R) was used for feature selection and prediction. Hyperparameters (regularization for Lasso and sparsity for document term matrix) were tuned using 10-fold cross-validation. The metric for cross-validation was misclassification cost (described below). Selected features were examined for sensibility and clinical meaningfulness (Table S3).
In the depression model, we separately fit L1-regularized logistic regression, random forest (randomForest R package), and support vector machine (e1071 R package) models. The random forest model was trained using the default parameter settings. Specifically, the number of trees (ntree) was set to 500, and the number of variables tried at each split (mtry) was set to the square root of the number of predictors for classification and one-third of the number of predictors for regression, as per the default settings. The SVM model was trained using the default parameter settings. Specifically, we used a Radial Basis Function (RBF) kernel; the cost parameter C, which penalizes misclassifications in the SVM's optimization problem, was set to 1; and the gamma parameter, which controls the influence of individual training examples on the model, was set to 1 divided by the data dimension.
Selective classification
For each binary variable, the output of the supervised model was a probability between 0 and 1 indicating the probability that the given note contains the variable of interest. Typically, a probability threshold is applied to convert a probability prediction model into a classifier. In selecting an optimal probability threshold, relative costs for false negatives and false positives must be specified. Here, “cost” refers not to monetary cost but rather real-world utilities. In the example of GBM diagnosis, the cost of a false positive corresponds to the cost of incorrectly labeling a note as having GBM, and the cost of a false negative corresponds to the cost of incorrectly labeling a note as not having GBM. The former results in incorrectly including a patient in the final dataset, whereas the latter results in incorrectly excluding a patient from the final dataset.
We implemented selective prediction by selecting 2 probability thresholds instead of one. Any note with a predicted probability of falling between the 2 thresholds is considered “undetermined.” The idea behind this approach is that a predicted probability very close to 0 or 1 indicates higher model confidence. By labeling notes whose predicted probability is far from 0 or 1 as “undetermined,” the model performance on the notes not labeled “undetermined” is likely to be better than the model performance on the overall sample.
The double threshold requires specifying a relative cost to labeling a note as “undermined,” as well as costs for false positives and false negatives. A grid search was used to identify the 2 thresholds which minimize the total cost, defined as follows:
| (1) |
where, # is number, FP is false positive, and FN is false negative. With this double threshold approach, we can expect abstraction efficiency gains equal to 100% minus the “undetermined rate,” or the percentage of notes labeled as “undetermined.” Although the model may not be able to classify 100% of notes, we expect more accurate performance on the subset of notes the model is able to classify, thus leading to gains in abstraction efficiency with minimal loss of accuracy.
We compare the double threshold approach to the typical single threshold approach, where predicted probabilities above or below the threshold are labeled “positive” or “negative,” respectively.
For the depression cohort, we evaluated the performance of the selective classifiers across 9 cost scenarios: undetermined cost was fixed at 1; FP cost could be either 2, 5, or 10; and FN cost could be either 2, 5, or 10. Each cost scenario represents a distinct use case for the depression model. For example, the scenario where FP cost = 5 and FN cost = 2 could correspond to a cohort selection use case where a researcher wants to select a “pure” cohort of patients with depression—FPs are less tolerable than FNs.
To quantify the relative costs of FPs, FNs, and undetermined for the GBM and colorectal cancer models, we created a survey for researchers (the end users of the models),that uses a regret-based approach.27 This validated method asks respondents to consider their preferences and willingness to accept tradeoffs in hypothetical scenarios about patient care. Our survey presents a scenario with 100 patient notes to be classified by an imperfect prediction model and asks respondents to quantify how many patient notes they would be willing to review manually to avoid a false positive and false negative. The responses can be used to generate cost ratios for FP: undetermined and FN: undetermined, respectively. A total of 7 clinician and abstractor responses were received. Responses were scanned for consistency and follow-up conversations were used to understand motivations for all responses. After removing one outlier response, the remaining responses had similar values and reasoning. A simple average of the remaining 6 preferences gave the cost ratio for FP: FN: undetermined of 13.5:8:1. This means that the model returning a false positive is 13.5 times more costly than abstaining. Similarly, a false negative result is 8 times more costly than the model abstaining.
Model evaluation
Misclassification cost, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) were calculated, treating the abstracted labels as ground truth. In addition, we introduce 3 metrics for the evaluation of selective prediction models: undetermined rate, positive undetermined rate, and negative undetermined rate. Positive undetermined rate is defined as the number of positive observations labeled as undetermined (undetermined positives or UP) over the total number of positive observations within the training dataset (TP + FN + UP). Similarly, the negative undetermined rate is defined as the number of negative observations (UN) labeled as undetermined over the total number of negative observations within the training dataset (TN + FP + UN). Positive and negative undetermined rates give insight into the composition of the undetermined rate. The undetermined rate is defined as the number of predictions a selective prediction model abstains from making (UN + UP), over the total number of possible predictions (UN + UP + FN + FP + TN + TP). A selective prediction model with a lower undetermined rate yields greater efficiency gains than a model with a larger undetermined rate and similar levels of accuracy. For example, if there is an equal proportion of cases and controls in the training dataset, but the positive undetermined rate is much larger than the negative undetermined rate, this suggests the selective prediction model is struggling to predict labels for cases accurately.
After training, the final model was applied to all unlabeled data, yielding 3 types of predictions: “positive,” “negative,” or “undetermined.” For notes labeled positive or negative, the model’s prediction was considered final. Notes labeled “undetermined” were those for which the model abstained from making a definitive prediction. These notes were given to abstractors for manual review.
Failure analysis
Failure analysis involved a manual review of observations where the model prediction resulted in false positives or false negatives during cross-validation (Table S4). The goal of failure analysis was to determine the reasons for the model’s incorrect prediction. In some cases, the model’s prediction was correct and erroneously labeled as a false positive or false negative due to abstractor error. Potential reasons for abstractor error include mixing up dates of records, misreading or misinterpreting similar words or synonyms, and failing to notice words and phrases. In these cases, the underlying data was amended, the model was retrained, and model accuracy metrics were recalculated.
Comparison to structured data proxies
We assessed whether the output of selective prediction models for GBM diagnosis was more accurate than EHR-derived structured proxy variables. This analysis was not performed for the colorectal cancer cohorts because structured variables were considered in the inclusion criteria for those cohorts. Structured proxy variables for diagnosis of GBM may include the presence of an ICD or CPT code or documented treatment with antineoplastics typically used to treat GBM. The following ICD codes relevant to GBM were included: C71.0–C71.9 and C72.9 in the ICD-10 scheme; C191.9 in the ICD-9 scheme.28–30 CPT codes related to “glioblastoma” (81287, 81345), “malignant neoplasm of the brain” (70554, 70555, 78600, 78601, 78605, 78606, 78608, 78609, 78610), and surgical procedures related to GBM (61510) were included.31 The proxy for treatment included any documented order or administration of temozolomide or bevacizumab. The labels generated by the model were compared with these structured data elements using the performance metrics described above to quantify the increased accuracy obtained by using an NLP-based model.
Results
Cost survey
Based on the survey conducted (n = 6), a cost ratio for FP: FN: undetermined was assigned as 13.5:8:1 (Table 1).
Table 1.
Questions and summary of responses for the cost survey (n = 6).
| Common scenario: | |||
| Suppose you have a group of patients, each with one medical note. You need to decide whether each of these patients should be included in a dataset containing patients with GBM. This dataset will be used for outcomes research and quality improvement efforts. | |||
| Since no prediction model is 100% accurate, it is expected that it will make some mistakes in its decision to include/exclude a patient from the dataset. Alternatively, you can manually review a patient's case and make the right decision with absolute certainty, but note that manually reviewing the note is time consuming. | |||
|
| |||
| Question | Average | Minimum | Maximum |
|
| |||
| How many patient notes would you be willing to review manually to prevent one patient who is healthy from being incorrectly included into the GBM dataset (ie, false positive)? | 13.5 | 10 | 20 |
| How many patient notes would you be willing to review manually to prevent one patient who has GBM from being left out of the dataset (ie, false negative)? | 8 | 3 | 15 |
Depression models
The weak supervision model, trained on 1000 randomly sampled notes and tested on 451 labeled notes, had a test set AUC of 0.87. Using a classification threshold of 0.5, the model had a sensitivity of 0.93, specificity of 0.81, PPV of 0.61, and NPV of 0.97. These results suggest that the weak labels used to train downstream classifiers had reasonable accuracy.
In our examination of the 3 supervised models—Logistic Regression (LR), Random Forest (RF), and Support Vector Machine (SVM)—we systematically varied the costs assigned to false negatives (FN) and false positives (FP), and observed how the relationship between these costs influenced the rate of undetermined classifications and the overall misclassification cost (Figure 2, Table S5).
Figure 2.
For the MIMIC-III depression classification task, each panel shows the undetermined rate of the double threshold classifier for logistic regression (LR), random forest (RF), and support vector machine (SVM) models at the given cost values for false negatives (FN) and false positive (FP). The undetermined cost was fixed at 1. Beneath each bar is shown the change in total misclassification cost between the single and double threshold classifiers, with a negative value corresponding to a decrease in cost with the implementation of double thresholding.
When FN cost = 2 and FP cost = 2, the undetermined rates were 0% for LR, 4.21% for RF, and 0% for SVM. LR also saw a modest decrease in misclassification cost with the use of double thresholding (−2%), while RF exhibited a marginal cost increase (+5%), and SVM maintained an unchanged cost. As the cost of FNs increased (FN cost: 5, FP cost: 2), undetermined rates increased across all models, with LR showing the smallest jump to 37.25%, while RF and SVM exhibited larger rises to 38.58% and 69.40%, respectively. Simultaneously, we noticed divergent changes in misclassification costs; while both LR and RF saw modest changes (9% and −5%, respectively), SVM observed a decrease (−22%). Undetermined rates were highest when the cost of FNs was set to 10. For example, with FP costs at 2, the rates were 78.49% for LR, 72.28% for RF, and 77.61% for SVM. The changes in total misclassification cost were −19% for LR, −42% for RF, and −25% for SVM, indicating a decrease in cost with the implementation of double thresholding across all models. This trend persisted as the FP cost was increased to 5 and 10.
Colorectal cancer models
Overall, our double threshold models performed better than the single threshold models, with 16%, 35%, and 41% decreases in total cost compared to single threshold models for diagnosis of rectal adenocarcinoma (DRA), abdominoperineal resection (APR), and low anterior resection (LAR), respectively (Table 2). Among the 3 colorectal cancer (CRC) models, the LAR double threshold classifier had the highest undetermined rate (10%), where 90% of notes were automatically labeled by the LAR classifier. Excluding the undetermined charts led to improvements over single threshold modeling in sensitivity (0.71 to 0.91), specificity (0.99 to 1.00), PPV (0.94 to 0.98), and NPV (0.94 to 0.99). Our DRA double threshold classifier had the second highest undetermined rate (9%), where 91% of notes were automatically labeled by the DRA classifier. Similarly, we saw improvements over single threshold modeling in sensitivity (0.87 to 0.93), specificity (0.92 to 0.93), PPV (0.90 to 0.92), and NPV (0.90 to 0.94). Lastly, our APR double threshold classifier had the lowest undetermined rate (5%), where 95% of notes were automatically labeled by our APR classifier. This led to improvements over single threshold modeling in sensitivity (0.81 to 0.91), specificity (0.99 to 0.99), PPV (0.88 to 0.94), and NPV (0.98 to 0.99).
Table 2.
Test set performance of classification models (single threshold and double threshold) for diagnosis of glioblastoma, diagnosis of rectal adenocarcinoma, abdominoperineal resection of rectal adenocarcinoma, and low anterior resection of rectal adenocarcinoma.
| Diagnosis of glioblastoma |
Diagnosis of rectal adenocarcinoma |
Abdominoperineal resection of rectal adenocarcinoma |
Low anterior resection of rectal adenocarcinoma |
|||||
|---|---|---|---|---|---|---|---|---|
| Double threshold | Single threshold | Double threshold | Single threshold | Double threshold | Single threshold | Double threshold | Single threshold | |
| Sensitivity | 0.96 | 0.94 | 0.93 | 0.87 | 0.91 | 0.82 | 0.91 | 0.71 |
| Specificity | 0.96 | 0.79 | 0.93 | 0.92 | 0.99 | 0.99 | 1.00 | 0.99 |
| PPV | 0.98 | 0.89 | 0.92 | 0.90 | 0.94 | 0.88 | 0.98 | 0.94 |
| NPV | 0.91 | 0.88 | 0.94 | 0.90 | 0.99 | 0.98 | 0.99 | 0.94 |
| # FN (%) | 4 | 10 | 7 | 14 | 3 | 8 | 4 | 19 |
| # FP (%) | 2 | 20 | 8 | 10 | 2 | 5 | 1 | 3 |
| Total N | 264 | 264 | 240 | 240 | 394 | 394 | 353 | 353 |
| Predictions made | 150 | 264 | 218 | 240 | 375 | 394 | 318 | 353 |
| Undetermined rate | 0.43 | 0.09 | 0.05 | 0.10 | ||||
| Positive undetermined rate | 0.38 | 0.07 | 0.25 | 0.32 | ||||
| Negative undetermined rate | 0.52 | 0.11 | 0.023 | 0.049 | ||||
| Threshold 1 | 0.27 | 0.15 | 0.12 | 0.17 | ||||
| Threshold 2 | 0.9 | 0.59 | 0.7 | 0.65 | 0.7 | 0.28 | 0.71 | 0.59 |
| Misclassification cost | 173 | 350 | 114 | 136 | 47 | 72 | 59 | 100 |
Positive undetermined rate = UP/(TP + FN + UP) and negative undetermined rate = UN/(TN + FP + UN), where UP (undetermined positives) is the number of positive observations labeled as undetermined, TP + FN + UP represents the total number of positive observations in the training dataset, UN (undetermined negatives) is the number of negative observations labeled as undetermined, and TN + FP + UN represents the total number of negative observations in the training dataset.
Glioblastoma model
The double threshold model performed better than the single threshold models, with a 51% decrease in total cost compared to the single threshold models (Table 2). The GBM double threshold classifier had an undetermined rate of 43%, meaning that 57% of notes were automatically labeled by the model. Excluding the undetermined charts led to improvements over single threshold modeling in sensitivity (0.94 to 0.96), specificity (0.79 to 0.96), PPV (0.89 to 0.98), and NPV (0.88 to 0.91).
For GBM, the double threshold model outperformed all structured proxy variables, yielding costs anywhere from 84.8% to 87.1% lower than those of structured proxy variables (Table 3). Compared to ICD codes, the double threshold model had higher sensitivity (0.96 to 0.62), specificity (0.96 to 0.46), PPV (0.98 to 0.70) and NPV (0.91 to 0.38). Compared to CPT codes, the double threshold model had higher sensitivity (0.96 to 0.36), specificity (0.96 to 0.59), PPV (0.98 to 0.64) and NPV (0.91 to 0.32). Compared to the proxy variable of GBM-related treatment, the double threshold model had higher sensitivity (0.96 to 0.21), specificity (0.96 to 0.84), PPV (0.98 to 0.71) and NPV (0.91 to 0.34).
Table 3.
Performance metrics of structured proxy variables (ICD codes, CPT codes, and GBM-related medication) in predicting confirmed diagnosis of GBM.
| Metric | Any GBM-related ICD code | Any GBM-related CPT code | Received GBM-related medication at any time |
|---|---|---|---|
| Sensitivity | 0.62 | 0.36 | 0.21 |
| Specificity | 0.46 | 0.59 | 0.84 |
| PPV | 0.70 | 0.64 | 0.71 |
| NPV | 0.38 | 0.32 | 0.34 |
| # FP | 46 | 35 | 14 |
| # FN | 65 | 108 | 135 |
| Misclassification cost | 1141 | 1337 | 1269 |
Discussion
We propose an approach that leverages selective prediction and human abstractors to improve the accuracy and efficiency of clinical data abstraction. Specifically, we used selective prediction implemented via cost-based probability thresholding to allow our classifier to abstain from predicting certain data points, which were then given to a human for manual abstraction. Using the depression cohort, we demonstrated how logistic regression, random forest, and support vector machine models can be converted to selective classifiers based on predetermined costs for FNs, FPs, and undetermined. We found that increasing the FN cost: undetermined cost ratio led to a higher increase in undetermined rate than increasing the FP cost: undetermined cost ratio. We also found that the greater the difference between FP or FN cost and undetermined cost, the greater the reduction in total misclassification cost when using selective classification compared to non-selective classification. This confirms the intuition that selective prediction is most preferable when abstaining from a prediction is much less costly than making an incorrect prediction. In the GBM and colorectal cancer models, the selective classifier led to sizable gains in automation (anywhere from 57% to 95% across our 4 outcomes) and for GBM diagnosis, outperformed all structured proxy variables in terms of misclassification cost.
For any well-calibrated prediction model, the resulting prediction has a higher chance of being wrong when model uncertainty is high. In many healthcare settings, making a wrong prediction is more harmful than not making a prediction at all. Selective prediction has the potential to improve efficiency and accuracy in healthcare applications where the cost of abstaining from a prediction is lower than the cost of a misclassification. One such scenario is automated note abstraction for cohort selection because the cost of abstaining is the cost of manually reviewing a note. The relative preference for abstaining compared to making an incorrect classification is reflected in the results of the cost survey. Other examples of selective prediction exist; for example, Kotropoulos and Arce32,33 reported building a linear classifier with the option to postpone decision-making, which they called a rejection. In this article, postponed decisions required further advice by a domain expert in the context of the paper’s intended application. Another example comes from Guan et al,19 where they used a selective prediction classification model to abstain from prediction when the classification task has a large amount of uncertainty.
Defining misclassification cost is a critical step in developing classification models. Common metrics for assessing ML model performance include F1 score, area under the curve (AUC), and calibration, but such metrics fail to consider real-world consequences of model errors.34 These numbers have little context in the clinical setting, for example, there is no cutoff F1 score that indicates a model is safe to use in making clinical decisions. Assigning costs for false positives and false negatives is an attempt to capture the difficulty of trade-offs in medical decision-making and tie the output of the model to the would-be impact of the model’s output. To establish cost ratios, we used stakeholder surveys but other methods may be implemented. Through this survey, stakeholders must implicitly consider how much they value quantities such as clinical accuracy, time, and monetary cost. Since there is no single right cost ratio, a survey that directly asks relevant stakeholders about their preferences, as we did, can help elicit the relative costs associated with the model outputs. Such a survey that makes people consider consequences and anticipate regret has been suggested to be effective because it combines both intuitive and deliberative systems of thinking.27 Future work may assess the validity, reliability, and consistency of such surveys.
One key question that selective prediction raises is how to deal with the data points where the model abstained. One simple approach is to have humans review all undetermined data points by hand. This may be preferable when the number of undetermined data points is small, when the total cost of manual review is low, or when the outcome is difficult to predict for undetermined data points. Beyond a certain number of training data points, the signal-to-noise ratio decreases and improvements in classification performance diminish.35 The manual review of undetermined data points could lend insight into methods to help fine-tune models. Another approach could leverage active learning techniques like uncertainty sampling to preferentially sample undetermined data points and retraining on notes that were initially labeled as uncertain. We found that the undetermined rate for the GBM model was higher than that of the CRC models. We hypothesize this is due to the greater variability in the language of pathology reports for describing diagnosis of glioblastoma than those describing a colorectal cancer procedure. Pathologists can describe multiple tumor types within a pathology report and speculate as to whether glioblastoma is the correct diagnosis. Interpretations can also differ among pathologists. In contrast, colorectal cancer resection procedures have standard descriptions, and there is no subjectivity in documenting what procedure a patient underwent. The heightened variability in glioblastoma diagnosis reports presents challenges in achieving precise classification, leading to higher undetermined rates. Preprocessing glioblastoma pathology reports using tools such as named entity recognition, or training the model on more pathology reports could result in lower undetermined rates for future diagnosis models.
Our approach has some limitations. First, we tested the algorithm on clinical oncology documents (pathology reports) and binary variables (diagnosis and procedures). Further testing is needed to confirm that the model can work on a variety of different datasets and variables. Second, selective prediction could theoretically result in a significant portion of undetermined data points without substantial increases in accuracy. Adjusting the cost ratios may ameliorate this. As the cost assigned to an undetermined increases, the model will make more predictions, but one faces the trade off of more false positives and false negatives in this case. Third, we use predicted probabilities as proxies for model uncertainty, which may not be the optimal metric for thresholding. In the active learning literature, alternative metrics for quantifying model uncertainty have been proposed, including Shannon entropy and variance in query-by-committee predictions.36
Conclusion
Selective prediction using cost-based probability thresholding can semi-automate unstructured EHR data extraction by giving “easy” notes to a model and “hard” notes to human abstractors, thus increasing efficiency while maintaining or improving accuracy. Selective prediction models substantially outperformed non-selective prediction models and structured proxy variables like ICD codes on a binary classification task, generating higher quality datasets that can be used for outcomes research.
Supplementary Material
Contributor Information
Akshay Swaminathan, Stanford University School of Medicine, Stanford, CA, United States; Cerebral Inc. Claymont, DE, United States.
Ivan Lopez, Stanford University School of Medicine, Stanford, CA, United States; Cerebral Inc. Claymont, DE, United States.
William Wang, Department of Biology, Stanford University, Stanford, CA, United States; Department of Bioengineering, Stanford University, Stanford, CA, United States.
Ujwal Srivastava, Department of Computer Science, Stanford University, Stanford, CA, United States.
Edward Tran, Department of Computer Science, Stanford University, Stanford, CA, United States; Department of Management Science and Engineering, Stanford University, Stanford, CA, United States.
Aarohi Bhargava-Shah, Stanford University School of Medicine, Stanford, CA, United States.
Janet Y Wu, Stanford University School of Medicine, Stanford, CA, United States.
Alexander L Ren, Stanford University School of Medicine, Stanford, CA, United States.
Kaitlin Caoili, Stanford University School of Medicine, Stanford, CA, United States.
Brandon Bui, Department of Human Biology, Stanford University, Stanford, CA, United States.
Layth Alkhani, Department of Bioengineering, Stanford University, Stanford, CA, United States; Department of Chemistry, Stanford University, Stanford, CA, United States.
Susan Lee, Department of Computer Science, Stanford University, Stanford, CA, United States.
Nathan Mohit, Department of Computer Science, Stanford University, Stanford, CA, United States; Department of Human Biology, Stanford University, Stanford, CA, United States.
Noel Seo, Department of Sociology, Stanford University, Stanford, CA, United States.
Nicholas Macedo, Department of Biology, Stanford University, Stanford, CA, United States; Department of Radiology, Stanford University School of Medicine, Stanford, CA, United States.
Winson Cheng, Department of Computer Science, Stanford University, Stanford, CA, United States; Department of Chemistry, Stanford University, Stanford, CA, United States.
Charles Liu, Department of Surgery, Stanford University School of Medicine, Stanford, CA, United States.
Reena Thomas, Department of Neurology and Neurological Sciences, Stanford Health Care, Stanford, CA, United States.
Jonathan H Chen, Stanford Center for Biomedical Informatics Research, Stanford, CA, United States; Division of Hospital Medicine, Stanford, CA, United States; Clinical Excellence Research Center, Stanford, CA, United States; Department of Medicine, Stanford, CA, United States.
Olivier Gevaert, Stanford Center for Biomedical Informatics Research, Stanford, CA, United States; Department of Medicine, Stanford, CA, United States.
Supplementary material
Supplementary material is available at Journal of the American Medical Informatics Association online.
Author contributions
A.S., I.L., and O.G. conceived the idea. A.S. and I.L. performed data verification. A.S., I.L., W.W., E.T., U.S., A.B.S., J.W., A.L.R., B.B., L.A., S.L., N.M., N.S., N.M., W.C., and C.L. carried out data acquisition. I.L., A.S., W.W., U.S., and E.T. performed data analysis and interpretation. I.L., A.S., W.W., U.S., and E.T. wrote the draft of the manuscript and made critical revisions. O.G., J.H.C., and R.T. supervised the project. All authors have read and approved the manuscript.
Funding
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
Conflicts of interest
A.S. reports stock ownership in Roche (RHHVF) and consulting fees from Conduce Health. J.H.C. reports royalties from Reaction Explorer LLC; consulting fees from National Institute of Drug Abuse Clinical Trials Network, Tuolc Inc, Roche Inc; and payment for expert testimony from Younker Hyde MacFarlane PLLC and Sutton Pierce. All other authors declare that they have no competing interests.
Data availability
All data needed to evaluate the conclusions are present in the paper and in the Supplementary Materials.
The datasets generated analyzed during the current study are not publicly available due to patient privacy but are available from the corresponding author (O.G.) on reasonable request.
References
- 1. Improved Diagnostics & Patient Outcomes | HealthIT.gov. Accessed September 14, 2022. https://www.healthit.gov/topic/health-it-and-health-information-exchange-basics/improved-diagnostics-patient-outcomes
- 2. Hecht J. The future of electronic health records. Nature. 2019;573(7775):S114-S116. 10.1038/d41586-019-02876-y [DOI] [PubMed] [Google Scholar]
- 3. Polnaszek B, Gilmore-Bykovskyi A, Hovanes M, et al. Overcoming the challenges of unstructured data in multi-site, electronic medical record-based abstraction. Med Care. 2016;54(10):e65-e72. 10.1097/MLR.0000000000000108 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Kong H-J. Managing unstructured big data in healthcare system. Healthc Inform Res. 2019;25(1):1-2. 10.4258/hir.2019.25.1.1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Yang S, Varghese P, Stephenson E, et al. Machine learning approaches for electronic health records phenotyping: A methodical review. 2022;2022.04.23.22274218. 10.1101/2022.04.23.22274218 [DOI] [PMC free article] [PubMed]
- 6. Alzu'bi AA, Watzlaf VJM, Sheridan P.. Electronic health record (EHR) abstraction. Perspect Health Inf Manag. 2021;18(Spring):1g. Accessed September 14, 2022. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8120673/ [PMC free article] [PubMed] [Google Scholar]
- 7. Kaur R, Ginige JA, Obst O. A systematic literature review of automated ICD coding and classification systems using discharge summaries. 33
- 8. Rasmy L, Tiryaki F, Zhou Y, et al. Representation of EHR data for predictive modeling: a comparison between UMLS and other terminologies. J Am Med Inform Assoc. 2020;27(10):1593-1599. 10.1093/jamia/ocaa180 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. O'Malley KJ, Cook KF, Price MD, et al. Measuring diagnoses: ICD code accuracy. Health Serv Res. 2005;40(5 Pt 2):1620-1639. 10.1111/j.1475-6773.2005.00444.x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. King MS, Sharp L, Lipsky MS. Accuracy of CPT evaluation and management coding by family physicians. 10. [PubMed]
- 11. Bommasani R, Hudson DA, Adeli E, et al. On the opportunities and risks of foundation models. 2022; 10.48550/arXiv.2108.07258 [DOI]
- 12. Wu S, Roberts K, Datta S, et al. Deep learning in clinical natural language processing: a methodical review. J Am Med Inform Assoc. 2020;27(3):457-470. 10.1093/jamia/ocz200 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Lin BY, Gao W, Yan J, et al. RockNER: a simple method to create adversarial examples for evaluating the robustness of named entity recognition models. arXiv, 2021, preprint: not peer reviewed; 10.48550/arXiv.2109.05620 [DOI]
- 14. Pruthi D, Dhingra B, Lipton ZC. Combating adversarial misspellings with robust word recognition. arXiv, 2019, preprint: not peer reviewed; 10.48550/arXiv.1905.11268 [DOI]
- 15. Singh PK, Paul S.. Deep learning approach for negation handling in sentiment analysis. IEEE Access. 2021;9:102579-102592. 10.1109/ACCESS.2021.3095412 [DOI] [Google Scholar]
- 16. Birnbaum B, Nussbaum N, Seidl-Rathkopf K, et al. Model-assisted cohort selection with bias analysis for generating large-scale cohorts from the EHR for oncology research. 2020, preprint: not peer reviewed; Accessed September 14, 2022. http://arxiv.org/abs/2001.09765
- 17. Botsis T, Hartvigsen G, Chen F, et al. Secondary use of EHR: data quality issues and informatics opportunities. Summit Translat Bioinforma. 2010;2010:1-5. Accessed September 14, 2022. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3041534/ [PMC free article] [PubMed] [Google Scholar]
- 18. Gandouz M, Holzmann H, Heider D.. Machine learning with asymmetric abstention for biomedical decision-making. BMC Med Inform Decis Making. 2021;21(1):294. 10.1186/s12911-021-01655-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Guan H, Zhang Y, Cheng HD, et al. Bounded-abstaining classification for breast tumors in imbalanced ultrasound images. Int J Appl Math Comput Sci. 2020;30:325-336. 10.34768/amcs-2020-0025 [DOI] [Google Scholar]
- 20. Xin J, Tang R, Yu Y, et al. The art of abstention: selective prediction and error regularization for natural language processing. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics; 2021:1040–51. 10.18653/v1/2021.acl-long.84 [DOI]
- 21. Hendrickx K, Perini L, Van der Plas D, et al. Machine learning with a reject option: a survey. 2021, preprint: not peer reviewed; Accessed March 22, 2023. http://arxiv.org/abs/2107.11277
- 22. Moseley E, Celi LA, Wu J, et al. Phenotype annotations for patient notes in the MIMIC-III database. Accessed May 25, 2023. 10.13026/TXMT-8M40 [DOI]
- 23. Gehrmann S, Dernoncourt F, Li Y, et al. Comparing deep learning and concept extraction based methods for patient phenotyping from clinical narratives. PLoS One. 2018;13(2):e0192360. 10.1371/journal.pone.0192360 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Johnson A, Pollard T, Mark R. MIMIC-III clinical database. 2015. Accessed May 25, 2023. 10.13026/C2XW26 [DOI]
- 25. MIMIC-III, a freely accessible critical care database | Scientific Data. Accessed June 12, 2023. https://www.nature.com/articles/sdata201635 [DOI] [PMC free article] [PubMed]
- 26. Ratner A, Bach SH, Ehrenberg H, et al. Snorkel: rapid training data creation with weak supervision. Proc VLDB Endowment. 2017;11(3):269-282. 10.14778/3157794.3157797 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Tsalatsanis A, Hozo I, Vickers A, et al. A regret theory approach to decision curve analysis: a novel method for eliciting decision makers’ preferences and decision-making. BMC Med Inform Decis Mak. 2010;10:51. 10.1186/1472-6947-10-51 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. 2021/2022 ICD-10-CM Index > “Glioblastoma.” Accessed September 14, 2022. https://www.icd10data.com/ICD10CM/Index/G/Glioblastoma
- 29. 2022 ICD-10-CM Codes C72. Malignant neoplasm of spinal cord, cranial nerves and other parts of central nervous system. Accessed September 14, 2022. https://www.icd10data.com/ICD10CM/Codes/C00-D49/C69-C72/C72-
- 30. 2022 ICD-10-CM Codes C71. Malignant neoplasm of brain. Accessed September 14, 2022. https://www.icd10data.com/ICD10CM/Codes/C00-D49/C69-C72/C71-
- 31. Medical Billing Codes Search—CPT, ICD 9, ICD 10 HCPCS Codes & Articles, Guidelines | Codify by AAPC. Accessed September 14, 2022. https://www.aapc.com/codes/code-search/
- 32. Kompa B, Snoek J, Beam AL.. Second opinion needed: communicating uncertainty in medical machine learning. NPJ Digit Med. 2021;4(1):4-6. 10.1038/s41746-020-00367-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Kotropoulos C, Arce GR.. Linear classifier with reject option for the detection of vocal fold paralysis and vocal fold edema. EURASIP J Adv Signal Process. 2009;2009(1):13. 10.1155/2009/203790 [DOI] [Google Scholar]
- 34. Vickers AJ, Calster BV, Steyerberg EW.. Net benefit approaches to the evaluation of prediction models, molecular markers, and diagnostic tests. BMJ. 2016;352:i6. 10.1136/bmj.i6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Arnold R, Marcus JS, Petropoulos G, et al. Is data the new oil? Diminishing returns to scale. 17
- 36. Sharma M, Bilgic M.. Evidence-based uncertainty sampling for active learning. Data Min Knowl Disc. 2017;31(1):164-202. 10.1007/s10618-016-0460-3 [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All data needed to evaluate the conclusions are present in the paper and in the Supplementary Materials.
The datasets generated analyzed during the current study are not publicly available due to patient privacy but are available from the corresponding author (O.G.) on reasonable request.


