Skip to main content
Journal of the American Medical Informatics Association: JAMIA logoLink to Journal of the American Medical Informatics Association: JAMIA
. 2022 Jan 5;29(5):891–899. doi: 10.1093/jamia/ocab292

The potential for leveraging machine learning to filter medication alerts

Siru Liu 1, Kensaku Kawamoto 2, Guilherme Del Fiol 3, Charlene Weir 4, Daniel C Malone 5, Thomas J Reese 6,7, Keaton Morgan 8, David ElHalta 9, Samir Abdelrahman 10,11,
PMCID: PMC9006688  PMID: 34990507

Abstract

Objective

To evaluate the potential for machine learning to predict medication alerts that might be ignored by a user, and intelligently filter out those alerts from the user’s view.

Materials and Methods

We identified features (eg, patient and provider characteristics) proposed to modulate user responses to medication alerts through the literature; these features were then refined through expert review. Models were developed using rule-based and machine learning techniques (logistic regression, random forest, support vector machine, neural network, and LightGBM). We collected log data on alerts shown to users throughout 2019 at University of Utah Health. We sought to maximize precision while maintaining a false-negative rate <0.01, a threshold predefined through discussion with physicians and pharmacists. We developed models while maintaining a sensitivity of 0.99. Two null hypotheses were developed: H1—there is no difference in precision among prediction models; and H2—the removal of any feature category does not change precision.

Results

A total of 3,481,634 medication alerts with 751 features were evaluated. With sensitivity fixed at 0.99, LightGBM achieved the highest precision of 0.192 and less than 0.01 for the pre-defined maximal false-negative rate by subject-matter experts (H1) (P <0.001). This model could reduce alert volume by 54.1%. We removed different combinations of features (H2) and found that not all features significantly contributed to precision. Removing medication order features (eg, dosage) most significantly decreased precision (−0.147, P =0.001).

Conclusions

Machine learning potentially enables the intelligent filtering of medication alerts.

Keywords: clinical decision support, machine learning, alert fatigue

INTRODUCTION

Within electronic health record (EHR) systems, medication alerts (eg, for drug–drug interactions, duplicate orders, medication allergies, and incorrect drug dosage) are often shown to healthcare providers during the medication prescription process with the goal of decreasing drug-related harm.1 However, medication alerts are often high-volume and have high override rates ranging from 49% to 96%.2 A large amount of clinically irrelevant alerts may lead to “alert fatigue,” where users become desensitized to alerts.1,3–5 Such alert fatigue could lead to patient safety concerns when users begin to ignore clinically important alerts.6,7 Several approaches have been proposed to address alert fatigue,8–10 including to tier how alerts are displayed based on severity.11 A second approach, such as that previously used at the University of Utah Health (UUH), is to implement a formal clinical decision support (CDS) governance process that includes a data-driven review of alerts to identify opportunities for improving alert logic and discontinuing ineffective alerts.12 Previous research has found that clinician type, work complexity, and repeated alerts influence the perceived value of the alerts.13,14 More recently, researchers have proposed that discrete contextual features (eg, provider characteristics, patient conditions, medication history, kidney function, liver function) could be used to identify whether a given alert will be useful to the end user.15–18 If the utility of an alert to the end user could be predicted, less useful medication alerts could be filtered out from the user’s EHR display while still preserving useful alerts. However, commercial medication knowledge bases only take into consideration a limited number of patient and contextual factors to generate alerts, which may be an important reason why medication alerts are often considered to be irrelevant by clinical users.

Researchers have suggested that machine learning could be leveraged to predict user responses to medication alerts and reduce alert fatigue.7,19 The predictive task is to classify alerts as useful (positive prediction) or not useful (negative prediction). The false-negative rate should be minimal to avoid suppressing alerts that would lead to a successful intervention. The false-negative rate should be minimal to avoid suppressing alerts that would lead to a successful intervention. Poly et al.19 analyzed 5 features of medication alerts to develop a model that achieved a sensitivity of 0.883 (ie, 0.117 false-negative rate). However, that study did not consider many other features that might affect provider behavior, and the false-negative rate may be too high for clinical use. The study also did not account for the imbalanced nature of the dataset, as most alerts are not acted upon by users. In this study, we sought to use a more comprehensive set of predictors systematically identified through a multi-method approach to more accurately predict user responses to medication alerts. We developed models while maintaining a sensitivity of at least 0.99 (a false-negative rate less than 0.01). We then used machine learning approaches that examined the impact of a comprehensive set of contextual features and aimed to maximize precision while preserving a clinically acceptable, high level of sensitivity (0.01 maximum false-negative rate). The comprehensive set of contextual features we used spanned 10 feature categories (eg, laboratory features, patient features, provider features, and order features).

MATERIALS AND METHODS

This study was approved by the University of Utah Institutional Review Board. The research setting was the UUH, a medium-sized academic healthcare system in Salt Lake City, Utah. UUH has 4 hospitals and 12 community clinic centers. Medi-Span decision support rules as configured in the Epic EHR determined the UUH medication alerts in the study, which were comprised of drug–drug interaction alerts (31.7%), duplicate therapy alerts (29.0%), duplicate medication alerts (21.6%), drug-allergy alerts (8.2%), dose alerts (7.9%), and other medication alerts (1.6%). Physician interaction settings filter duplicate alerts, and unfiltered medication alerts require a higher certainty of evidence (the probable documentation level in Medi-Span). Pharmacist interaction settings display duplicate alerts and a relatively lower certainty of evidence (the suspected documentation level in Medi-Span). All inactive ingredient allergens are suppressed.

We conducted a measurement study using machine learning models in order to test 2 null hypotheses: (H1) there is no difference in precision among prediction models and (H2) removal of any feature category does not change precision.

Figure  1 illustrates the 6 steps in the study methods: (1) a literature review to develop a preliminary features list; (2) a series of group-based discussions with subject matter experts to revise the features list; (3) a mapping of features into the UUH Enterprise Data Warehouse (EDW) to collect data; (4) data preprocessing; (5) model development; and (6) evaluation.

Figure 1.

Figure 1.

Study overview. Abbreviations: EDW, Enterprise Data Warehouse; n, the number of medication alerts.

Literature review

In the first step, one author (SL) identified studies on MEDLINE (PubMed) to develop a list of features that may affect user response to CDS. The search strategy comprised 2 parts: (1) using 2 MeSH terms (“Decision Support Systems, Clinical [MeSH]” and “Medical Order Entry Systems* [MeSH]”); and (2) using the combination of keywords “decision support” and “clinical” to search papers published from May 2019 to May 2020 that had not been MeSH-indexed (see Supplementary Material Tables 1 and 2). The inclusion criteria were: (1) English peer-reviewed manuscripts; (2) focus on CDS directed at healthcare providers (eg, physicians, pharmacists), delivered on-screen, integrated into the EHR, and with automatic extraction of at least some data from the EHR; and (3) study in inpatient acute care hospitals, outpatient primary care clinics, or subspecialty care clinics. Exclusion criteria were: (1) CDS based on simulated scenarios and (2) CDS in nursing homes or public health departments. Using these inclusion and exclusion criteria, SL screened titles and abstracts and then retrieved the full texts for relevant studies. SL then extracted potential features if studies assessed relationships between features and user responses to CDS or provided suggestions about how to contextualize CDS.

Group discussion and EDW mapping

In order to minimize subjective judgments from a single reviewer that could diminish the performance of the model, we conducted a series of five 1-h online meetings with experts (all study authors) in relevant fields (CDS, social psychology theory, machine learning, and pharmacy) from August 2020 to October 2020. We asked participants to review features individually and document the features they would like to discuss with the other experts. A discussion was then conducted to reach a group consensus. We encouraged participants to modify features and add new features based on their own experiences. We also considered the accessibility of features in the EDW at the UUH. We removed features that were too specific (eg, invasive tumor with microinvasion) or features that could not be extracted from the EDW (eg, user attitudes and implementation strategies).

After refining the list of features, we extracted log data for all medication alerts (3,481,634) generated and shown to users in 2019 within the UUH. Only data that would have been available on or before the time of the alert were used for analysis. For example, for a patient’s medications, we included only medications that had been prescribed on or before the date/time of the alert triggering. Features included patient demographics and ordering context (eg, age, diagnosis, encounter type), provider characteristics (eg, professional role, department), and alert characteristics (eg, medication alert type, triggered medication orders). A detailed features list is provided in Supplementary Material Table 3.

We consulted the participants, including physicians and pharmacists, to determine the maximum acceptable false-negative rate threshold for models to have clinical usefulness during the final online meeting. This value was used as the predefined threshold to test the study hypotheses.

Study design

We defined the user response as the outcome of the user’s interaction with a medication alert, which had 1 of 2 assigned values: successful or unsuccessful. Medication alerts were considered successful in the following cases: (1) the user receiving the alert removed or discontinued the alert-causing medication order, and the alert did not fire again in 1 h; (2) the user canceled the alert (exited out), the alert did not fire again in 1 h, and the medication order was not ordered again in the next hour; and (3) the user overrode the alert, the alert did not fire again in 1 h, and the medication order was discontinued within an hour for purposes other than discharge. This definition was proposed by the University of Utah CDS committee and agreed upon by all authors.20 As described above, our goal was to build a model that could be used to filter out unsuccessful alerts based on our definition while preserving high sensitivity for successful alerts, operationalized as a false-negative rate (1-sensitivity) <0.01. Also, we sought to determine the impact of features in each feature category on the model performance by removing each category of features while preserving false-negative rates of less than 0.01. Finally, we reported other evaluation metrics, such as F1 (ie, the harmonic mean of the precision and recall), area under the precision-recall curve (PR-AUC), and filter rate (ie, the proportion of the number of alerts that would be filtered out or predicted as not successful to the total number of alerts).

Data preprocessing and model development

The data preprocessing included imputing missing values, encoding/scaling categorical/numeric features, conducting univariate analyses, and adjusting for imbalance by sampling. In order to deal with missing values of numerical features, we evaluated 3 methods: mean imputation,21 median imputation,21 and imputation with the most frequent value.22 We found that mean imputation achieved the highest F1 value. For each categorical feature, we used a feature (we called it null feature) to represent the missing values and we used the 1-hot encoding method to convert each category into a binary feature (0 if the category does not exist in the data point and 1 otherwise) that represented its existence in the data point (ie, a medication alert).23 For each numeric feature, we validated 2 scaling methods (standardization and robust scalers) to weigh all features equally and accelerate the classification gradient descent.24 We found that the standardization method, in which we scaled the features using their means and variances, achieved the highest F1 value.

We had to consider methods to address the imbalance of our dataset (positive case: negative case = 1:11). We evaluated 3 imbalance sampling methods: random oversampling of the minority class, random under sampling of the majority class, and combined sampling (eg, synthetic minority over-sampling technique).25 We found that random under sampling of the majority class achieved the highest F1 value.

In order to remove nonsignificant features with respect to the model outcome (ie, the user interaction outcome of a medication alert), we performed the Kruskal-Wallis nonparametric test26 and chi-square (χ2) test27 for numeric and categorical features, respectively. We excluded features with P-value > 0.05. Also, we calculated the correlation (Pearson r) between each pair of features.28 For each pair, if the r> 0.95, we removed the feature with the largest average r values with all other features.29

Under the sensitivity greater than 0.99 (the false-negative rate less than 0.01), we developed a rule-based model and 5 machine learning models: lasso logistic regression,30 random forest,31 neural network,32 support vector machine,33 and LightGBM.34 The first 4 machine learning algorithms are traditional algorithms widely used in the healthcare area. The LightGBM algorithm is an algorithm based on gradient boosting, which has shown better performance than traditional machine learning models in several studies.35,36 We used the skope-rules package and scikit-learn package in Python 3. Hyperparameters were tuned using the Bayesian method from the package Optuna.

Evaluation

Data splitting

The dataset was randomly partitioned into 3 sets using stratified sampling to retain the class ratio: training (60%), validation (20%), and testing (20%). The training and validation datasets were used to develop the model and tune the hyperparameters. We used 10-fold cross-validation for training and validating the models.

Testing hypotheses

We evaluated the highest performing tuned models in the testing dataset to test the 3 null hypotheses with 1000 rounds of bootstrapping.37 Statistical analysis methods and outcomes for each hypothesis are listed in Table  1.

Table 1.

Summary of statistical analysis methods and outcomes for each hypothesis

Hypothesis Statistical analysis Outcome
H1: There is no difference in precision among prediction models. Friedman test; if significant, pairwise comparisons with Nemenyi test Precision
H2: Removal of any feature category does not change precision. Friedman test; if significant, pairwise comparisons with Nemenyi test Precision

Hypothesis 1

To test the first hypothesis, we set the probability thresholds of our models to a sensitivity ≥0.99 and used the Friedman test to assess the significance of the differences in precision among multiple classifiers.38 We did not remove any features. If the difference was significant, the Nemenyi test was used for pairwise comparisons. The highest performing model was the model with the highest precision and a sensitivity value ≥0.99. Models that could not achieve 0.99 sensitivity were excluded from the comparison. If none of the models could achieve a sensitivity ≥0.99, we selected the model with the highest sensitivity as the highest performing model. The highest performing model was used to test the second hypothesis. In addition, we calculated the false-negative rate (1-sensitivity) in 1000 rounds of bootstrapping, and then used the 1-sample Wilcoxon signed-rank test to assess if the false-negative rate was significantly lower than the predefined false-negative rate.38

Hypothesis 2

To test the second hypothesis using the highest performing model, we adjusted the model probability thresholds to achieve a sensitivity of ≥0.99, and then used the Friedman test to assess the significance of the differences in precision among models using different sets of features. If the difference was significant, we conducted pairwise comparisons using the Nemenyi test. We grouped features in 2 ways: based on the characteristics of features and the difficulty of operationalizing/engineering those features. For the second classification, we consulted an expert in standards-based CDS (KK) to categorize features into 2 levels: technically implementable using current interoperability standards, operationally defined as being potentially feasible for a medication knowledge vendor to implement using the CDS Hooks39 and Fast Healthcare Interoperability Resources (FHIR) standards;40 and those that are unlikely to be feasible. For the purposes of this study, a feature was deemed to be technically implementable using CDS Hooks and FHIR if the feature was available through the FHIR interface provided by the Epic EHR system; is currently supported by, or in the process of being supported by, Epic’s implementation of CDS Hooks, or is available in external knowledge resources (eg, the drug class associated with a medication). We then analyzed how model performance varied according to features grouped in these 2 ways. In addition to the sensitivity, precision, and F1 values, we calculated the filter rate, which is the proportion of the number of alerts filtered out (predicted not successful) to the total number of alerts. The filter rate represents the ability of our model to reduce the number of alerts.

RESULTS

The dataset contained 3,481,634 medication alerts displayed to 8,270 providers for 178,298 patients from January 1, 2019 through December 31, 2019 at the UUH. For each medication alert, 751 features were also extracted (see Supplementary Material Table 3).

H1: There is no difference in precision among prediction models

Table  2 summarizes probability thresholds, precision, F1 values, and the filter rate with models set with the probability threshold to yield a sensitivity of 0.99. Hyperparameters used in the models are listed in Supplementary Material Table 4. The sensitivity value of the rule-based model was 0.571. The sensitivity values were 0.991, 0.991, 0.994, 0.992, and 0.991 for machine learning models: support vector machine; logistic regression; neural network; random forest; and LightGBM, respectively. These values were significantly higher than the sensitivity value of the rule-based model (P <0.001). The LightGBM model achieved the highest precision after pairwise comparisons (0.192). Boxplots for precision and F1 are available in Supplementary Material Figures 1 and 2. At a probability threshold of 0.06, the model achieves a false-negative rate (0.009) significantly less than 0.01 (P <0.001). Using this model to filter alerts could decrease the number of alerts by 54.1% with less than a 1% false-negative rate (Figure  2).

Table 2.

Precision and F1 of each model when the sensitivity value was set to ≥0.99

Model Probability threshold Sensitivity Precision F1 Filter rate (%)
Support vector machine 0.04 0.991 [0.991, 0.991] 0.103 [0.102, 0.103] 0.186 [0.186, 0.186] 14.0
Logistic regression 0.02 0.991 [0.991, 0.991] 0.117 [0.117, 0.117] 0.209 [0.209, 0.209] 25.0
Neural network 0.01 0.994 [0.994, 0.994] 0.128 [0.128, 0.128] 0.226 [0.226, 0.227] 31.0
Random forest 0.15 0.992 [0.992, 0.992] 0.145 [0.145, 0.145] 0.253 [0.253, 0.253] 39.2
LightGBM 0.06 0.991 [0.991, 0.991] 0.192 [0.192, 0.192] 0.322 [0.322, 0.322] 54.1

Note: Filter rate is the proportion of alerts that would be predicted as not being successful and filtered out.

Figure 2.

Figure 2.

Precision-recall curves for machine learning models.

The precision-recall curves for each model are in Figure  3. The LightGBM model obtained the highest area under the precision-recall curve (P <0.001). When the recall was greater than 0.122, the LightGBM began outperforming other models.

Figure 3.

Figure 3.

The precision-recall curve for the LightGBM model after removing different feature categories.

H2: Removal of any feature category does not change precision

The classification performance of the LightGBM model with different removed features is presented in Table  3. Removing features associated with orders significantly decreased precision and F1 by 0.045 and 0.065, respectively (P =0.001). Features related to provider characteristics and patient diagnosis history did not significantly change precision or F1. Removing other features decreased precision and F1; however, the differences were small and might not be clinically significant. Boxplots for precision and F1 are in Supplementary Material Figures 3 and 4.

Table 3.

Threshold, precision, and F1 of the LightGBM model after removing certain feature categories

Feature category Sensitivity Precision F1 Filter rate (%)
All features 0.991 0.192 0.322 54.1
[0.991, 0.991] [0.192, 0.192] [0.322, 0.322]
Order (-) 0.992 0.147** 0.256** 40.1
[0.991, 0.992] [0.147, 0.147] [0.256, 0.257]
Alert (-) 0.992 0.178** 0.302** 54.2
[0.992, 0.992] [0.178, 0.178] [0.302, 0.302]
Patient (-) 0.991 0.183** 0.310** 51.9
[0.991, 0.991] [0.183, 0.184] [0.310, 0.310]
Alert response (-) 0.991 0.190** 0.318** 53.5
[0.991, 0.991] [0.189, 0.190] [0.318, 0.318]
Medication (-) 0.991 0.190** 0.320 53.8
[0.991, 0.991] [0.190, 0.191] [0.319, 0.320]
Provider (-) 0.991 0.191 0.320 53.8
[0.991, 0.991] [0.191, 0.191] [0.320, 0.321]
Diagnosis (-) 0.991 0.192 0.322 53.9
[0.991, 0.992] [0.192.192] [0.322, 0.322]
Vital signs (-) 0.992 0.193 0.323* 54.3
[0.992, 0.992] [0.193, 0.193] [0.323, 0.323]
Allergy (-) 0.991 0.193 0.324** 54.3
[0.991, 0.992] [0.193, 0.194] [0.323, 0.324]
Lab (-) 0.991 0.194** 0.324** 54.6
[0.991, 0.992] [0.194, 0.194] [0.324, 0.324]

Note: Thresholds were modified to let the model achieve a sensitivity ≥0.99. The first row shows the performance of the model using all features. The threshold value for all models was 0.06. Filter rate: the proportion of alerts that would be predicted as not being successful and filtered out. **P =0.001, *P <0.05.

Results of the LightGBM model using features in different categories related to the feasibility of operational use are presented in Table  4. When comparing the model using all features, limiting the model to technically implementable features slightly decreased precision and F1 by 0.002 and 0.004, respectively (P <0.001), but these differences are likely not clinically significant. Boxplots for precision and F1 are in Supplementary Material Figure 5. The filter rate also slightly decreased from 54.1% to 53.5%.

Table 4.

Threshold, precision, and F1 of the LightGBM model using datasets with features in different categories related to the feasibility of operational use

Threshold Sensitivity Precision F1 Filter rate (%)
Technically implementable features 0.06 0.991 0.190** 0.318** 53.5
[0.991, 0.991] [0.190, 0.190] [0.318, 0.319]
All features 0.06 0.991 0.192 0.322 54.1
[0.991, 0.991] [0.192, 0.192] [0.322, 0.322]

Note: Thresholds were modified to let the model achieve a sensitivity value of 0.99. **P <0.001.

DISCUSSION

In this study, we accomplished 3 goals: (1) identifying features that potentially predict user behavior when interacting with medication alerts; (2) testing the feasibility of applying machine learning models to predict user responses with high sensitivity; and (3) evaluating the impact of different categories of features on user response prediction.

Two experiments were performed in this study. The first experiment demonstrated that all 5 machine learning models met the predefined sensitivity threshold of 99% for clinical utility, but the rule-based model did not. This finding provides evidence that machine learning techniques are more capable of predicting user response to medication alerts. Among the machine learning models, LightGBM achieved the highest precision. The LightGBM model has also achieved good performance in previous studies focused on other medication areas, such as predicting opioid-overdose risk and detecting adverse drug reactions.35,36,41,42 Lo-Ciganic et al.35 applied machine learning techniques to predict opioid overdose risk among Medicare beneficiaries and compared the gradient boosting model’s performance with logistic regression and random forest. They found the gradient boosting model outperformed other models.35 Hoang et al.36 identified that the gradient boosting model had better performance in detecting signals for adverse drug reactions than other supervised machine learning models. Several characteristics of the LightGBM algorithm may contribute to its higher performance: (1) the use of a histogram-based algorithm to partition continuous features into distinct bins satisfies expectations for most continuous features in medicine, and (2) the use of a leaf-wise approach to splitting trees allows the construction of trees that are complex enough to describe different clinical settings.

One important issue in filtering alerts is the false-negative rate of the model, because it is critical to avoid suppressing alerts that would lead to a successful intervention. The LightGBM model’s false-negative rate was significantly lower than the predefined maximal false-negative rate when the threshold was less than or equal to 0.06 (P <0.001). This finding indicates that the probability threshold needs to be set at a very low value in order to make the model clinically acceptable.

In the second experiment, we found that medication order parameters significantly improved the model performance. Medication order parameters included medications and medication subclasses involved, medication route, pro re nata (p.r.n, or as-needed) order, medication frequency, order priority, order protocol, and order care setting (inpatient or outpatient). Seidling et al.43 proposed that using details of medication orders (eg, route of administration) to tailor medication alerts could improve the appropriateness from 10% to 25%. Order care setting has also been proposed to be positively correlated with alert acceptance.44,45 In order to facilitate the potential implementation of alert filtering based on machine learning, we considered the difficulty of operationalizing features. The model using features that could potentially be feasible to operationalize in the future had precision similar to the model with all features, with 53.5% of alerts filtered with a <1% false-negative rate. Although what we here define as “technically implementable” being in the context of a future envisioned CDS state, this analysis indicates that filtering alerts in such a manner may be of practical value. Although some features significantly decreased precision and F1 in the statistical analysis, those differences were very small and unlikely to be clinically significant.

We identified one previous study by Poly et al.19 that applied machine learning methods to predict user responses to alerts based on 5 features. Our study used a more extensive set of 751 features considering the context of users’ interactions with medication alerts. In addition, Poly et al. and the present study used different definitions of user response. Whereas Poly et al. defined the user response as the user’s immediate acceptance or rejection of the alert, we used an outcome proposed by Kawamoto et al.,20 which accounts for relevant actions taken up to an hour following the alert. We chose this approach because users can dismiss the alert itself, but still take appropriate actions downstream in the workflow after clearing the alert. In particular, the option 3 of successful alerts was defined as the user overrode the alert, the alert did not fire again in 1 h, and the medication order was discontinued within an hour for purposes other than discharge. In this option, it may include instances where a physician ignored an alert, but later discontinued the order because of intervention by a pharmacist. Due to the lack of data to determine if the pharmacist intervened in the process or if the physician proactively canceled the order (eg, following review of additional patient data or consultation with a pharmacist), we categorized all cases in option 3 as successful alerts in order to ensure patient safety. Nearly 159,000 alerts were in option 3. Of these, 101,997 (64.3%), 22,810 (14.4%), and 33,880 (21.3%) alerts were followed by a discontinued medication for less than or equal 5 min, 5 to 15 min, and more than 15 min, respectively. Due to the short time between the alert and discontinuation, the event of pharmacist intervention is less likely in those cases. Furthermore, we believe that machine learning algorithms should not suppress alerts that are ignored by physicians, but are caught by pharmacists downstream leading to discontinuation. In an ideal scenario, physicians would have seen and reacted to the alert discontinuing the medication without pharmacist intervention.

This study has several limitations. First, we tested the dataset in 2019 for only one healthcare system using one EHR system in a retrospective analysis. However, we validated the generalizability of the model by using a testing dataset with 1000 rounds of bootstrapping. Nonetheless, the generalizability of the model should be assessed using a prospective external dataset in the future. Second, these models were developed based on user responses to medication alerts when these responses may not have been the best decision for the patient. Filtering alerts using predictive models based on clinician responses to previous alerts may lead to automation bias that reinforces suboptimal clinical decisions poses a risk, with important alerts being filtered out just because clinicians ignored similar alerts in the past. Third, recent research has suggested that tailoring alerts could successfully decrease alert volume but have no impact on alert override rates.7,46 Consequently, we cannot assume that filtering out alerts that are likely to be ignored would necessarily lead to users paying more attention to the remaining alerts. Fourth, filtering medication alerts will reduce the amount of user interaction data, which may result in researchers not collecting sufficient data to maintain or update the prediction models. In addition, it might fundamentally change user behavior. Therefore, an evaluation system should be developed using a mix-methods approach to comprehensively assess clinician behavior and patient outcomes.

Future work in this area could include optimizing predictive models by exploring temporal trends using recurrent neural networks and adding encoding layers to convert categorical input features into lower dimensions. Future work could also include technical investigations to enable the integration of this type of approach in the EHR. This would also allow for an exploration of combinations of different features’ categories, including potentially creating more parsimonious models to facilitate implementation. Researchers could also conduct further analysis of the false negatives to characterize what successful alerts are missed by these models. From a practical standpoint, clinicians need to know what is potentially missed. We envision a future state in which implementation of filtering rules based on predictive models is technically feasible using standards-based CDS approaches such as CDS Hooks, but many more steps will be needed to enable such a future, including EHR vendor adoption of CDS Hooks to replace or augment their current medication alert mechanisms; medication knowledge vendors adopting the approach; and, perhaps most challenging from a technical perspective, achieving a response time in line with current, highly optimized approaches. Finally, much more work needs to be done from a sociotechnical and user acceptance perspective prior to operationalizing a predictive modeling-based approach to filtering medication alerts. For example, users may simply not accept a “black box” approach to filtering alerts due to their unpredictable nature, and even a 1% false-negative rate may be clinically unacceptable for some types of alerts. Furthermore, recent research has found that a hybrid rule-based and machine learning approach can be used to identify high-risk medication errors in prescriptions.47 Future studies are warranted to evaluate the performance of hybrid models in predicting user responses to medication alerts. Notably, because LightGBM is an ensemble model with relatively low interpretability, we followed 2 approaches to help explain the model performance. First, we identified statistically and clinically significant features, for example, the medications involved, the medication route, pro re nata (p.r.n., or as-needed) order, and the order care setting (inpatient or outpatient). Second, we performed a preliminary analysis of the model’s interpretability by exploring the impact of different feature categories on model performance. Combining the rule-based model with the LightGBM model may improve interpretability and provide more guidance for informatics experts to evaluate the prediction results. Extracting causal inference from machine learning models to help users understand model behavior is an important future research direction.

CONCLUSION

Alert fatigue is a significant problem. In this study, we demonstrated the feasibility of applying machine learning models to predict user response to medication alerts using a large set of contextual features. If used in clinical practice, the LightGBM model can filter out 54.1% of medication alerts with a 0.9% false-negative rate, which indicates the potential for using automatic filters to hide low response alerts. In addition, we identified medication order parameters as being particularly important for predicting user responses. Removing features that are particularly challenging to operationalize in care settings did not significantly change model precision.

FUNDING

This work was supported by the University of Utah.

AUTHOR CONTRIBUTIONS

SL conducted literature review, feature identification, data extraction, model developing, statistical analysis, and drafting the work. SL, SA, KK, GDF, CL, and DM were involved in experiments design. SL, SA, KK, GDF, CL, DM, TR, KM, and DE were involved in revising features and the paper. SA mentored all project machine learning steps. All authors approved the submitted version.

SUPPLEMENTARY MATERIAL

Supplementary material is available at Journal of the American Medical Informatics Association online.

CONFLICT OF INTEREST STATEMENT

KK reports honoraria, consulting, sponsored research, licensing, or co-development outside the submitted work in the past 3 years with McKesson InterQual, Hitachi, Pfizer, Premier, Klesis Healthcare, RTI International, Mayo Clinic, the University of Washington, the University of California at San Francisco, MD Aware, and the U.S. Office of the National Coordinator for Health IT (via ESAC and Security Risk Solutions) in the area of health information technology. KK was also an unpaid board member of the nonprofit Health Level Seven International health IT standard development organization, he is an unpaid member of the U.S. Health Information Technology Advisory Committee, and he has helped develop a number of health IT tools which may be commercialized to enable wider impact. None of these relationships have direct relevance to the manuscript but are reported in the interest of full disclosure. The other authors do not have conflicts of interest related to this study.

DATA AVAILABILITY

The data underlying this article cannot be shared publicly due to patient healthcare data privacy protection requirements.

Supplementary Material

ocab292_Supplementary_Data

Contributor Information

Siru Liu, Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah, USA.

Kensaku Kawamoto, Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah, USA.

Guilherme Del Fiol, Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah, USA.

Charlene Weir, Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah, USA.

Daniel C Malone, Department of Pharmacotherapy, Skaggs College of Pharmacy, University of Utah, Salt Lake City, Utah, USA.

Thomas J Reese, Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah, USA; Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA.

Keaton Morgan, Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah, USA.

David ElHalta, Pharmacy Services, University of Utah, Salt Lake City, Utah, USA.

Samir Abdelrahman, Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah, USA; Computer Science Department, Faculty of Computers and Artificial Intelligence, Cairo University, Giza, Egypt.

REFERENCES

  • 1. Dexheimer JW, Kirkendall ES, Kouril M, et al.  The effects of medication alerts on prescriber response in a pediatric hospital. Appl Clin Inform  2017; 8 (2): 491–501. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Cash JJ.  Alert fatigue. Am J Health Syst Pharm  2009; 66 (23): 2098–101. [DOI] [PubMed] [Google Scholar]
  • 3. Osheroff JA, Teich JM, Middleton B, et al.  A roadmap for national action on clinical decision support. J Am Med Inform Assoc  2007; 14 (2): 141–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Kawamoto K, Houlihan CA, Balas EA, et al.  Improving clinical practice using clinical decision support systems: a systematic review of trials to identify features critical to success. BMJ  2005; 330 (7494): 765. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Chaparro JD, Hussain C, Lee JA, et al.  Reducing interruptive alert burden using quality improvement methodology. Appl Clin Inform  2020; 11 (1): 46–58. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. McCoy AB, Thomas EJ, Krousel-Wood M, et al.  Clinical decision support alert appropriateness: a review and proposal for improvement. Ochsner J  2014; 14 (2): 195–202. [PMC free article] [PubMed] [Google Scholar]
  • 7. Kane-Gill SL, O’Connor MF, Rothschild JM, et al.  Technologic distractions (part 1): summary of approaches to manage alert quantity with intent to reduce alert fatigue and suggestions for alert fatigue metrics. Crit Care Med  2017; 45 (9): 1481–8. [DOI] [PubMed] [Google Scholar]
  • 8. Liu S, Reese TJ, Kawamoto K, et al.  A systematic review of theoretical constructs in CDS literature. BMC Med Inform Decis Mak  2021; 21 (1): 102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Liu S, Reese TJ, Kawamoto K, et al. Toward optimized clinical decision support: a theory-based approach. In: 2020 IEEE International Conference on Healthcare Informatics (ICHI). IEEE; 2020.
  • 10. Liu S, Reese TJ, Kawamoto K, et al.  A theory-based meta-regression of factors influencing clinical decision support adoption and implementation. J Am Med Informatics Assoc 2021; 28 (11): 2514–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Paterno MD, Maviglia SM, Gorman PN, et al.  Tiering drug-drug interaction alerts by severity increases compliance rates. J Am Med Informatics Assoc  2009; 16 (1): 40–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Kawamanto K, Flynn MC, Kukhareva P, et al.  A pragmatic guide to establishing clinical decision support governance and addressing decision support fatigue: a case study. AMIA. Annu Symp Proceedings AMIA Symp  2018; 2018: 624–33. http://www.ncbi.nlm.nih.gov/pubmed/30815104. Accessed December 29, 2021. [PMC free article] [PubMed] [Google Scholar]
  • 13. Payne TH, Hines LE, Chan RC, et al.  Recommendations to improve the usability of drug-drug interaction clinical decision support alerts. J Am Med Inform Assoc  2015; 22 (6): 1243–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Ancker JS, Edwards A, Nosal S, et al.  Effects of workload, work complexity, and repeated alerts on alert fatigue in a clinical decision support system. BMC Med Inform Decis Mak  2017; 17 (1): 36. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Jung M, Riedmann D, Hackl WO, et al.  Physicians’ perceptions on the usefulness of contextual information for prioritizing and presenting alerts in computerized physician order entry systems. BMC Med Inform Decis Mak  2012; 12: 111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Riedmann D, Jung M, Hackl WO, et al.  Development of a context model to prioritize drug safety alerts in CPOE systems. BMC Med Inform Decis Mak  2011; 11: 35. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Ammenwerth E, Hackl WO, Riedmann D, et al.  Contextualization of automatic alerts during electronic prescription: researchers’ and users’ opinions on useful context factors. Stud Health Technol Inform  2011; 169: 920–4. [PubMed] [Google Scholar]
  • 18. Riedmann D, Jung M, Hackl WO, et al.  How to improve the delivery of medication alerts within computerized physician order entry systems: an international Delphi study. J Am Med Inform Assoc  2011; 18 (6): 760–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Poly TN, Islam MM, Muhtar MS, et al.  Machine learning approach to reduce alert fatigue using a disease medication–related clinical decision support system: model development and validation. JMIR Med Inform  2020; 8 (11): e19489. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Kawamanto K, Flynn MC, Kukhareva P, et al.  A pragmatic guide to establishing clinical decision support governance and addressing decision support fatigue: a case study. AMIA Annu Symp Proceedings AMIA Symp  2018; 2018: 624–33. http://www.ncbi.nlm.nih.gov/pubmed/30815104. Accessed March 6, 2020. [PMC free article] [PubMed] [Google Scholar]
  • 21. Vach W, Blettner M.  Biased estimation of the odds ratio in case-control studies due to the use of ad hoc methods of correcting for missing values for confounding variables. Am J Epidemiol  1991; 134 (8): 895–907. [DOI] [PubMed] [Google Scholar]
  • 22. Carpenter JR, Kenward MG.  Missing Data in Randomised Controlled Trials—A Practical Guide. Birmingham, UK: Health Technology Assessment Methodology Programme; 2008: 199. [Google Scholar]
  • 23. Potdar K, S T, D C.  A comparative study of categorical variable encoding techniques for neural network classifiers. IJCA  2017; 175 (4): 7–9. [Google Scholar]
  • 24. Bhaskar H, Hoyle DC, Singh S.  Machine learning in bioinformatics: a brief survey and recommendations for practitioners. Comput Biol Med  2006; 36 (10): 1104–25. [DOI] [PubMed] [Google Scholar]
  • 25. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP.  SMOTE: synthetic minority over-sampling technique. JAIR  2002; 16: 321–57. [Google Scholar]
  • 26. Kruskal WH, Wallis WA.  Use of ranks in one-criterion variance analysis. J Am Stat Assoc  1952; 47 (260): 583–621. [Google Scholar]
  • 27. Pearson K.  On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling In: Breakthroughs in Statistics. Springer Series in Statistics (Perspectives in Statistics). New York, NY: Springer; 1992: 11–28. [Google Scholar]
  • 28. Benesty J, Chen J, Huang Y.  Pearson correlation coefficient. In: Encyclopedia of Public Health. Dordrecht, the Netherlands: Springer; 2009: 1–4. [Google Scholar]
  • 29. Kuhn M, Johnson K.  Data pre-processing. In: Applied Predictive Modeling. New York, NY: Springer; 2013: 27–59. [Google Scholar]
  • 30. Hosmer DW, Lemeshow S.  Applied Logistic Regression. Hoboken, NJ: Wiley; 2000. [Google Scholar]
  • 31. Breiman L.  Random forests. Mach Learn  2001; 45 (1): 5–32. [Google Scholar]
  • 32. Hansen LK, Salamon P.  Neural network ensembles. IEEE Trans Pattern Anal Machine Intell  1990; 12 (10): 993–1001. [Google Scholar]
  • 33. Dorogush AV, Ershov V, Gulin A. CatBoost: gradient boosting with categorical features support. Published online October 24, 2018. https://github.com/Microsoft/LightGBM. Accessed May 11, 2020.
  • 34. Ke G, Meng Q, Finley T, et al.  LightGBM: a highly efficient gradient boosting decision tree. In: Advances in Neural Information Processing Systems. 2017: 3147–55; Long Beach, CA. https://github.com/Microsoft/LightGBM. Accessed May 13, 2020. [Google Scholar]
  • 35. Lo-Ciganic WH, Huang JL, Zhang HH, et al.  Evaluation of machine-learning algorithms for predicting opioid overdose risk among Medicare beneficiaries with opioid prescriptions. JAMA Netw Open  2019; 2 (3): e190968. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Hoang T, Liu J, Roughead E, et al.  Supervised signal detection for adverse drug reactions in medication dispensing data. Comput Methods Programs Biomed  2018; 161: 25–38. [DOI] [PubMed] [Google Scholar]
  • 37. Mani S, Ozdas A, Aliferis C, et al.  Medical decision support using machine learning for early detection of late-onset neonatal sepsis. J Am Med Inform Assoc  2014; 21 (2): 326–36. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Demsar J.  Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res  2006; 7: 1–30. [Google Scholar]
  • 39.CDS Hooks. 2017. https://cds-hooks.org/. Accessed December 30, 2020.
  • 40.Health Level 7. FHIR v4.0.1. https://www.hl7.org/fhir/. Accessed December 30, 2020.
  • 41. Liu J, Wu J, Liu S, et al.  Predicting mortality of patients with acute kidney injury in the ICU using XGBoost model. PLoS One  2021; 16 (2): e0246306. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Li K, Shi Q, Liu S, et al.  Predicting in-hospital mortality in ICU patients with sepsis using gradient boosting decision tree. Medicine (Baltimore)  2021; 100 (19): e25813. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Seidling HM, Klein U, Schaier M, et al.  What, if all alerts were specific—estimating the potential impact on drug interaction alert burden. Int J Med Inform  2014; 83 (4): 285–91. [DOI] [PubMed] [Google Scholar]
  • 44. Seidling HM, Phansalkar S, Seger DL, et al.  Factors influencing alert acceptance: A novel approach for predicting the success of clinical decision support. J Am Med Inform Assoc  2011; 18 (4): 479–84. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Daniels CC, Burlison JD, Baker DK, et al.  Optimizing drug-drug interaction alerts using a multidimensional approach. Pediatrics  2019; 143 (3): e20174111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Horn JR, Hansten PD, Osborn JD, et al.  Customizing clinical decision support to prevent excessive drug–drug interaction alerts. Am J Health Syst Pharm  2011; 68 (8): 662–4. [DOI] [PubMed] [Google Scholar]
  • 47. Corny J, Rajkumar A, Martin O, et al.  A machine learning–based clinical decision support system to identify prescriptions with a high risk of medication error. J Am Med Inform Assoc  2020; 27 (11): 1688–94. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ocab292_Supplementary_Data

Data Availability Statement

The data underlying this article cannot be shared publicly due to patient healthcare data privacy protection requirements.


Articles from Journal of the American Medical Informatics Association : JAMIA are provided here courtesy of Oxford University Press

RESOURCES