Skip to main content
JAMIA Open logoLink to JAMIA Open
. 2025 Feb 25;8(1):ooaf016. doi: 10.1093/jamiaopen/ooaf016

Transforming appeal decisions: machine learning triage for hospital admission denials

Timothy Owolabi 1,
PMCID: PMC11854074  PMID: 40008183

Abstract

Objective

To develop and validate a machine learning model that helps physician advisors efficiently identify hospital admission denials likely to be overturned on appeal.

Materials

Analysis of 2473 appealed hospital admission denials with known outcomes, split 90:10 for training and testing.

Methods

Six binary classifier models were trained and evaluated using accuracy, precision, recall, and F1 score metrics.

Results

An elastic net logistic regression model was selected based on computational efficiency and optimal performance with 84% accuracy, 84% precision, 98% recall, and an F1 score of 0.9.

Discussion

The predictive model addresses the risk of physician advisors accepting inappropriate denials due to biased perceptions of appeal success. Model implementation improved denial screening efficiency and was a key feature of a more successful appeal strategy.

Conclusions

By addressing data quality problems inherent to electronic health data, and expanding the feature space, machine learning can be an effective tool in the healthcare provider space.

Keywords: supervised machine learning, artificial intelligence, insurance claim reviews, utilization review, concurrent review, physician advisor, natural language processing

Introduction

Background

Medical necessity, crucial for insurance coverage, refers to healthcare services meeting accepted diagnostic or treatment standards.1 Through utilization management (UM), providers and insurers evaluate services against these standards. Physician advisors, required by law (42 CFR § 482.30)2 and accreditation guidelines, review denied claims for potential appeals. WellSpan Health, a nonprofit integrated system serving Pennsylvania, provided inpatient care to 72 645 patients in 2024, with significant dependence on insurance payments.3

Objectives

This study demonstrates how machine learning can assist physician advisors to efficiently identify wrongly denied hospital admissions. By predicting which denials are likely to be overturned on appeal, the algorithm helps ensure hospitals recover appropriate payments while reducing the manual burden of screening hundreds of denials each month.

Literature review

The insurance and regulatory landscape

A 2019 analysis of 150 million Medicare Advantage and commercial enrollees showed increasing observation care use, especially among older patients, often for nonclinical reasons.4 United States hospital claim denial rates range from 9% to 17%.5,6 In response to documented Medicare Advantage service delays and denials,7 CMS mandated alignment with traditional Medicare standards (CMS-4201-F).8

Machine learning in healthcare

Machine learning applications for electronic health record (EHR) data face challenges like poor labeling and variable clinical representations.9,10 While denial risk prediction and fraud detection typically uses claims data,11–13 this work extends these methods from insurance settings14 to provider decision support.

Materials

Data pre-processing

EPIC electronic health denial data was analyzed from June 1, 2020 to August 31, 2023. Fifty observations were removed for a variety of reasons, including missing Social Vulnerability Index (SVI) scores (Appendix S1). SVI scores are based on US Census data and are created and maintained by the Centers for Disease Control and Prevention.15 The final dataset contained 2473 observations across 28 variables (Appendix S2). Appeal outcomes and factor variables were one-hot encoded, and data was split 90:10 for training and testing in a one-time split. The 10% hold-out data was selected using an R package that allowed automated, random, stratified sampling, ensuring that class proportions in the test set remained consistent with the full dataset. This approach maintained representativeness while reducing potential bias in model evaluation. Araújo et al. successfully used random over-sampling to address class imbalance.14 While there was a class imbalance in the training data, random over-sampling degraded model performance, so this technique was omitted.

Baseline accuracy

Evaluation of model performance requires establishing an appropriate baseline for comparison. In classification problems with unbalanced class distributions, the majority class proportion serves as a performance threshold. In the training dataset, 80% of appeals resulted in overturned denials. Therefore, a naive classifier that predicts the majority class for all denials would achieve 80% accuracy. This majority class baseline provides a reference point for assessing whether a model offers meaningful predictive value beyond simply identifying the predominant outcome.

Clinical concept creation

Feature extraction from diagnostic data utilized a rule-based natural language processing (NLP) approach based on substring matching within ICD-10 codes. First, 30 distinct clinical concepts were defined, with each concept characterized by specific character patterns that could appear within ICD-10 codes. Rather than matching complete ICD-10 codes, the algorithm searched for these predefined string patterns within the diagnostic codes of each observation. For example, a clinical concept might be identified by the presence of specific characters or substrings that appear across multiple related ICD-10 codes, ie, the substrings “delirium” and “encephalopathy” both mapped to the altered mental status clinical concept. This approach created binary features indicating the presence or absence of each clinical concept, effectively transforming the raw diagnostic codes into structured features while capturing clinically meaningful patterns that span multiple related diagnoses. A dedicated search for each clinical concept was conducted on the diagnosis variable for every denial. Each concept was dummy coded for all denials in the data set, introducing 30 new features (Appendix S3).

Methods

Model descriptions

Six different binary classification algorithms were trained which varied in computational intensity and modeling approach. The goal of the predictive model (PM) is to identify denials that have been accepted that are likely to be overturned on appeal. Human screening of accepted denials is time consuming, subject to bias, and is cognitively taxing. In addition to accuracy, models should balance recall and precision to aid physician advisors in saving time and increasing the likelihood of successful appeal of denials that appear unappealable to the human eye. The Elastic Net (eNet) model was selected based on its overall performance on a single, novel test dataset.16 Detailed model descriptions are available in Appendix S4.

Model training

Hardware and software details

For this analysis, a laptop computer equipped with 4 CPU cores was utilized (Appendix S5). All coding tasks were executed using R (version 4.3.1), a robust open-source programming system designed for statistical computation and graphics. The R tidymodels package was instrumental in model training.17 This R package provides a framework to efficiently and consistently create machine learning algorithms. The tidymodels general structure follows 3 steps: specify the type of learner (algorithm), create a recipe for model training (workflow), and fit the model. Feature engineering steps can be embedded in the workflow.

Approach to model development

The training data comprised 2225 observations and 52 variables. Clinical subject matter expertise was used to identify the most relevant EHR data elements, and the feature space was expanded by iteratively adding feature interactions that improved model performance. The feature space was expanded to 200-300 for each candidate model through the inclusion of feature interactions. Hyperparameters were tuned within each model’s tidymodels workflow, and each candidate predictive model underwent 10-fold cross-validation where the data set was randomly divided into 10 subsets, and one subset was chosen for validation while the model was trained on the remaining 9. After training and cross-validation, the models were applied to the 10% hold-out data set for testing. The testing data set contained 248 observations, the same 52 variables, and the same feature interactions included in the training data.

Results

Test data model outcomes

In addition to accuracy, precision, and recall, the F1 score was reported to quantify each model’s ability to balance the tradeoff between avoiding futile appeals (precision) and maximizing successful appeals (recall). To validate the selection of the eNet model, a post-hoc analysis was conducted using bootstrapped samples to calculate confidence intervals for all model metrics and to perform pairwise t-tests comparing F1 scores across models.

While several models demonstrated similar performance, the logistic regression-based models (elastic net, lasso, and ridge) consistently outperformed more computationally intensive alternatives (eg, decision trees, support vector machines, and neural networks). Pairwise comparisons of F1 scores indicated that the lasso model had the highest overall performance, though the differences between lasso and eNet were small. Given the real-world considerations of model deployment, including generalizability and stability across datasets, the eNet model remained an appropriate choice due to its ability to balance regularization effects while maintaining competitive performance (Table 1). All t-test pairwise comparisons are available in Appendix S6.

Table 1.

Model comparison on test data.

Model Accuracy [CI] Precision [CI] Recall [CI] F1 score [CI] F1 score T-test P-value
Elastic net model 0.839 [0.790, 0.883] 0.844 [0.792, 0.890] 0.980 [0.959, 0.995] 0.907 [0.874, 0.934] enet versus lasso .001
Lasso model 0.843 [0.794, 0.887] 0.848 [0.798, 0.894] 0.980 [0.959, 0.995] 0.909 [0.877, 0.936] Lasso versus ridge 8.25E-24
Ridge model 0.831 [0.782, 0.875] 0.840 [0.790, 0.885] 0.975 [0.952, 0.995] 0.902 [0.871, 0.930] Lasso versus ridge 8.25E-24
Decision tree model 0.810 [0.758, 0.859] 0.819 [0.767, 0.867] 0.980 [0.958, 0.995] 0.892 [0.858, 0.922] Lasso versus DT 1.29E-115
Support vector machine model 0.810 [0.758, 0.859] 0.822 [0.771, 0.869] 0.975 [0.953, 0.995] 0.892 [0.859, 0.921] Lasso versus SVM 1.64E-62
Neural net model 0.827 [0.778, 0.871] 0.855 [0.805, 0.900] 0.945 [0.913, 0.973] 0.897 [0.865, 0.926] Lasso versus NN 9.37E-121

The logistic regression (LR) models (Lasso, Ridge, and eNet) performed better on average than the more computationally intensive models.

Calibration of best model (eNet)

After selecting the eNet model, conformal inference techniques were used to assess the alignment of the model’s certainty with the observed rate of denial overturns. The initial model calibration was sub-optimal, so various remediation techniques were explored.18 The most significant improvement was observed when a logistic regression model was fitted to the data, utilizing the probability estimates as predictors. Despite the improved model calibration at upper predicted probabilities, the unbalanced classes and low number of upheld denials in the training data appear to be causing the model difficulties at low probabilities (Figure 1). A Brier score measures the difference between the predicted probability and the actual outcome, with 0 representing perfect accuracy and 1 representing perfect inaccuracy. Due to the class imbalance, the Brier score was used to calculate a Brier Skill Score (BSS). BSS provides a fairer comparison between calibrated and uncalibrated models by normalizing performance relative to a naïve baseline model. The uncalibrated model performed 22.1% better than the naïve baseline, while the calibrated model performed 41.7% better, nearly doubling the relative improvement and enhancing the real-world utility (Table 2).

Figure 1.

Side-by-side calibration plots of an Elastic Net model before and after logistic regression calibration, showing improved alignment with observed event rates but persistent miscalibration at probability extremes due to class imbalance.

A calibration plot reveals improvement of model performance but suboptimal performance at the extremes.

Table 2.

Model calibration metrics.

Model Metric Score
Naïve reference model Brier score 0.353269
Uncalibrated Brier score 0.275164
Brier skill score 0.221092
Calibrated Brier score 0.206037
Brier skill score 0.416770

As indicated in bold, the Brier Skill Score reveals that the uncalibrated model performed 22.1% better than the naive baseline model, while the calibrated model performed 41.7% better, nearly doubling the relative improvement.

Financial impact

The eNet algorithm streamlines the triage process for a repeat physician advisor review of a denial, enabling a more data-driven approach to appeal decisions. The model has been deployed in the production environment along with a Managed Medicare appeal strategy based on the CMS guidance concerning the 2-midnight rule.7 Appeals of Managed Medicare Denials based on the 2 midnight rule began in June 2023 with passage of CMS-4201-F.8 Following model deployment, the monthly appeal rate doubled, and the volume of overturned denials nearly doubled as seen in a screenshot from an EPIC Slicer Dicer radar dashboard depicting WellSpan system appeal data (Figure 2).

Figure 2.

Two line charts from an EPIC Slicer Dicer dashboard showing denial appeal trends. After model deployment, appeal rates and overturned denial volumes nearly doubled, though multiple factors likely contributed.

Following predictive model deployment, the monthly appeal rate doubled, and the volume of overturned denials nearly doubled as seen in this screenshot from an EPIC Slicer Dicer radar dashboard depicting WellSpan system appeal data. This improvement coincided with multiple changes: expanded application of the 2-midnight rule which began in June 2023, implementation of insights gained from denials that would have been previously accepted, and proprietary UM process improvements.

Discussion

Model development

Recent regulatory changes necessitated a fresh approach to hospital admission denial appeals at WellSpan Health. A machine learning solution was developed, drawing from diverse EHR sources. Data preparation included handling missing values and engineering features to standardize multiple representations of clinical conditions.

While biased toward predicting overturn success and showing uncertainty at probability extremes, the elastic net model effectively supports the goal of identifying overlooked appeal opportunities (Figure 3). Following model deployment, a significant increase in overturned denials was observed; however, this improvement coincided with multiple changes: expanded application of the 2-midnight rule, insights gained from denials that would have been previously accepted, and proprietary process improvements. The model streamlines denial review by reducing manual effort and minimizing bias from historical perceptions. Though only one component of a broader appeal strategy, the AI model has proven valuable in optimizing the appeal process.

Figure 3.

Confusion matrix and Venn diagram illustrating the Elastic Net model’s performance. The model achieved 83% accuracy, correctly predicting 194 overturned and 11 upheld denials. Precision was 98%, with a 16% false positive rate, aligning with its goal of triaging potential appeals for human review.

A confusion matrix is a summary table of the performance of a supervised predictive model. Fortunately, the false alarm (false positive) and false negative rates are consistent with the goal of avoiding missed appeal opportunities.

Challenges

Several challenges emerged during model development. The unbalanced distribution of denial outcomes created significant uncertainty, particularly for lower overturn probabilities. Decision tree and support vector machine models underperformed, likely due to insufficient training data. Random over-sampling failed to improve these imbalance issues. Additionally, the model struggled to differentiate between extended hospital stays caused by medical complexity versus discharge barriers, requiring manual review of false positive predictions.

Future use

Model automation is slated as a mid-to-long term goal, contingent on updates to the existing data pipeline. The model currently runs in an R developer environment, but development of an R Shiny app will allow non-developer end-users to run the model on novel data within WellSpan’s secure network. As appeal outcomes drift, model re-training may be needed to maintain generalizability.

Societal impact of inappropriate insurance denials in healthcare

The inappropriate denial of insurance claims creates cascading adverse consequences. For patients, denied claims can lead to delayed care and financial distress. Hospitals bear substantial financial and operational burdens through appeals processes and uncompensated care, particularly safety-net hospitals serving vulnerable populations. On a broader level, inappropriate denials contribute to healthcare system inefficiency. When medically necessary care is denied, patients may later require more expensive emergency care, while resources devoted to managing denials could be better allocated to patient care.

Conclusion

In this project, a predictive model was developed to assess the likelihood of overturning denied hospital admissions accepted upon initial review by a hospital physician advisor. Six classifiers were trained, and the best model was calibrated based on the results of conformal inference techniques. Deployment of this tool illustrates that by addressing data quality problems inherent to electronic health data and expanding the feature space, machine learning can be an effective tool in the healthcare provider space.

Supplementary Material

ooaf016_Supplementary_Data

Acknowledgments

The author would like to thank the health analytics faculty of the Northwestern University School of Professional Studies and the members of his Master’s Capstone advisory committee. Thank you especially to Lynd Bacon, PhD, MBS, Northwestern University for reviewing and providing feedback for this paper. Please note that publicly available large language models were used for writing assistance during the editorial process.

Author contributions

Timothy T. Owolabi (Conceptualization, Data curation, Formal analysis, Methodology, Project administration, Resources, Software, Validation, Visualization)

Supplementary material

Supplementary material is available at JAMIA Open online.

Funding

None declared.

Conflicts of interest

None declared.

Data availability

The data used for this project was extracted from the electronic health record and cannot be shared publicly due to containing protected health information. A public GitHub repository contains sample R code used during model implementation. https://github.com/towola01/UM-Denial-PM.git.

References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ooaf016_Supplementary_Data

Data Availability Statement

The data used for this project was extracted from the electronic health record and cannot be shared publicly due to containing protected health information. A public GitHub repository contains sample R code used during model implementation. https://github.com/towola01/UM-Denial-PM.git.


Articles from JAMIA Open are provided here courtesy of Oxford University Press

RESOURCES