Abstract
Objective
This study assessed the ability of a real-time artificial intelligence (AI) tool to correctly align early during hospitalization with the discharge status of inpatient versus observation.
Methods
This retrospective case–control study at Baylor Scott & White Medical Center – Temple involved patients on 11 randomly chosen calendar days between August 2023 and October 2024. A real-time AI care level score (CLS) and machine learning likelihood (MeL) recommendations for inpatient versus observation discharge status were developed. Receiver operating characteristic curves were used to compare CLS, MeL, and commercial screening tool criteria with actual inpatient versus observation discharge status.
Results
The receiver operating characteristic curve for CLS-based prediction of the MeL recommendation for inpatients had the highest area under the curve (AUC) of 0.9954 (95% confidence interval [CI] = 0.9954, 0.9998). The AUC for only CLS for predicting inpatient discharge was 0.8949 (95% CI = 0.8692, 0.9206). A CLS score ≥76 resulted in the highest correct classification rate of 86%. For CLS and the commercial screening tool, the AUC was the lowest at 0.8419 (95% CI = 0.8121, 0.871).
Conclusions
Patients with a real-time AI CLS ≥76 had an 86% correct assignment of inpatient discharge status.
Keywords: Clinical decision support, hospital utilization management, machine learning
In their review of observation medicine, Ross and Gronovsky dated the first descriptions of observation beds to 1965, where it was recommended that the use of observation beds in the emergency department should not exceed 24 hours.1 The 2005 Centers for Medicare and Medicaid Services (CMS) Transmittal 42 defined observation (status) hospitalization as “a well-defined set of specific, clinically appropriate services, which include ongoing short-term treatment, assessment, and reassessment before a decision can be made regarding whether patients will require further treatment as hospital inpatients or if they are able to be discharged from the hospital.”2 In response to patient advocacy groups, in 2013, the CMS added the two-midnight rule: “Inpatient admissions generally are payable under Part A if the admitting practitioner expects a patient to require a hospital stay that crosses 2 midnights and the medical record supports that reasonable expectation.”3 Status assignment does not affect the care a patient can receive but rather how the hospital services provided are billed. The Office of the Inspector General found that “Medicare paid nearly three times more for a short inpatient stay than an [outpatient] stay for the same condition.”4 Incorrectly billing an inpatient as an observation decreases revenue for the hospital by lowering payments from the payer. Even with the two-midnight rule, the definition of an inpatient lacks clarity.4 Both hospitals and payers incur expenses associated with review staff who perform labor-intensive tasks, such as applying commercial screening tools5 to assign inpatient versus observation status, which involves corresponding hospital/payer faxes, emails, and phone calls. Commercial screening tools are difficult to apply and have been reported to have poor predictive value, failing to replace physician judgment.6,7 When both hospitals and payers are able to see that real-time artificial intelligence (AI) aligns with inpatient versus observation patient status, inpatient status can be authorized based on a mutually agreed threshold, thereby reducing the need for hospitals/payers to apply screening tools, as well as the frequency of faxes, emails, and phone calls.
METHODS
The Baylor Scott & White Research Institute institutional review board reviewed the study, approved the protocol (023-272, reference #403052), and waived the need for informed consent because this was a secondary analysis of existing data. The principal investigator and members of the Baylor Scott & White Research Institute institutional review board by certification of CitiProgram affirm that the study complied with relevant ethics and guidelines, including Health Insurance Portability and Accountability Act privacy protection, principles of the Belmont Report, and principles of records-based research.
Data source
At Baylor Scott & White Medical Center – Temple, electronic medical records (EMRs) and the AI platform DragonFly (previously known as Cortex)8 were reviewed for patients discharged as either inpatients or under observation on 11 randomly selected calendar days from August 2023 to October 2024.
The study used the AI platform DragonFly developed by Xsolis.com and powered by machine learning. As of 2021, its big data input included 356 million documents, 128 million encounters, 3.9 billion labs results and vital signs, 765 million medications, and 124 billion corpus words, with 1.3 million status predictions. The AI platform starts interrogating the EMR when the patient is registered. If a patient is hospitalized from the emergency department, the AI input begins even before the hospitalization order is made. The AI input used to calculate the care level score (CLS) incorporates the patient’s wellness score (based on actuarial life tables, Medicare’s hierarchical condition categories, the Elixhauser Comorbidity Index, and the Charlson Comorbidity Index), vital signs, laboratory values, orders, medications administered, medical history, progress notes, and procedure documents. Additionally, the AI platform analyzes physician notes and clinical documents to predict the likelihood of inpatient care based on millions of similar cases. AI interrogation of the EMR can occur as frequently as every 2 hours, producing reports of both the maximum and current CLS scores. CLS scores are also recorded in a flowsheet within the EMR. After the AI’s 12-hour discovery window, the machine learning (MeL) prediction—OBSERVATION/INPATIENT—may be generated. If no usable prediction exists for the patient, there will be no MeL prediction. Once the MeL status model is populated, the end user can be confident that the CLS incorporates all relevant components of the EMR.
Study population
Patients included in the study were documented as being discharged as either observed or inpatients by both the EMR and AI. Patients <18 years of age and those with ob/gyn hospitalizations were excluded from the study. A total of 630 patients were enrolled. The AI platform’s CLS provides a comprehensive assessment of medical necessity, calculating both the severity of illness and the intensity of service by automatically extracting clinical data from the EMR in near real time as patients progress through their hospital stay. The CLS score ranges from 0 to 157, with higher scores indicating a more appropriate inpatient order. AI’s ability to predict hospital status typically occurs within the first 12 hours of hospitalization.
Study outcomes
The presence of a MeL status recommendation for inpatients versus observation patients, along with CLS scores, served as the independent variables. The dependent variables were actual discharge status, length of stay, plus binary subsets of Medicare Advantage payer versus others, surgical versus medical patients, and met or failed commercial screening criteria for inpatient hospitalization.
Statistical methods
All statistical analyses were performed using SAS software version 9.4 (SAS Institute, Cary, NC, USA). Sample characteristics were described using descriptive statistics. Frequencies and percentages were used for categorical variables, while means and standard deviations (or medians and ranges where appropriate) were used for continuous variables. Logistic regression was conducted with four different cutoff points. Pearson’s correlation assessed correlations between continuous variables, and Kappa’s test evaluated agreement between methods. Statistical significance was set at a value of < 0.05. A two-sample t test with unequal variance was used to calculate the sample size needed to achieve 90% power, under the assumption that the maximum CLS score for observation patients was 50 to 60, with a standard deviation ranging from 30 to 40, and the maximum CLS score for inpatients was 90 to 100, with a standard deviation between 40 and 50. A sample size of 630 patients met the 90% power criterion.
RESULTS
Table 1 presents descriptive statistics for the full sample. A total of 466 patients (74%) were discharged as inpatients. The overall median CLS was 96, and the median length of stay was 75 hours. The commercial screening criterion classified inpatient status for 330 patients (52%).
Table 1.
Descriptive statistics for the full sample
| Variable | Value |
|---|---|
| Max CLS score: median (IQR) | 96 (83–124) |
| Length of stay (hr): median (IQR) | 75 (38–150) |
| Medicare Advantage payer: No / Yes | 394 (62%) / 236 (38%) |
| MeL: OBS / INPT | 121 (19%) / 509 (81%) |
| Discharge: OBS / INPT | 164 (26%) / 466 (74%) |
| Commercial criteria: OBS / INPT | 300 (48%) / 330 (52%) |
CLS indicates care level score; INPT, discharged as inpatient hospitalization; IQR, interquartile range; MeL, machine learning prediction; OBS, discharged as observation hospitalization.
Figure 1 presents the receiver operating characteristic curve for CLS predicting inpatient discharge status. The AUC was 0.8949 (95% CI = 0.8692–0.9206). A cutoff point at CLS ≥76 yielded the highest correct classification rate of 86%. A cutoff point at CLS ≥90 minimized the distance to the “perfect” point at the upper left of the plot. A cutoff point at CLS ≥ 91 minimized the difference between sensitivity and specificity. The area under the curve (AUC) for the MeL recommendation for inpatients was 0.9954 (95% CI = 0.9909–0.9998). For MeL, the cutoff point of CLS ≥76 had the highest correct classification rate of 99%. The AUC for the CLS relationship to the commercial screening tool recommending inpatient status was 0.8419 (95% CI = 0.8121–0.8717), with a cutoff point of CLS ≥ 90 yielding the highest correct classification rate of 76%. The 95% CIs for the AUC of CLS and MeL in predicting inpatient status, as well as for the CLS commercial screening tool recommending inpatients, did not overlap. The lower 95% CI of the AUC for CLS-based prediction of inpatient discharge was 0.8692, which overlapped with the upper 95% CI of 0.8717 for CLS-based and commercial screening tool-recommended inpatients.
Figure 1.
Receiver operating characteristic curves. Discharge indicates care level score (CLS) prediction of patient discharge status. The area under the curve (AUC) was 0.8949, with a 95% confidence interval (CI) of 0.8692–0.9206. MeL indicates machine learning likelihood (MeL), or the CLS and artificial intelligence recommendation, for inpatient care. The AUC was 0.9954, with a 95% CI of 0.9909–0.9998. Commercial indicates the relationship between CLS and commercial screening tools for recommending inpatient care, which had an AUC of 0.8419, with a 95% CI of 0.8121–0.8717.
The correlation between CLS and length of stay was only weakly positive (Pearson’s R = 0.3636). For observation discharge status, the kappa coefficient of 0.1311 with a CI of 0.0211, 0.2411 indicated that the level of agreement between the MeL and the commercial screening tool was poor. Similarly, for inpatient discharge status, the kappa coefficient of 0.1257 with a CI of 0.0594, 0.1920 indicated a poor level of agreement between the MeL and the commercial screening tool.
Table 2 shows the sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and accuracy of the different CLS cutoff points for patients based on discharge status. Consistent with Figure 1, a cutoff CLS of ≥76 had the highest accuracy at 86%, with a PPV of 87%. In contrast, the highest PPV of 98% associated with a CLS ≥110 was coupled with a reduced accuracy of 63%.
Table 2.
Analysis of different cutoff points using discharge as inpatients as the gold standard
| CLS cutoff | Sensitivity | Specificity | PPV | NPV | Accuracy |
|---|---|---|---|---|---|
| ≥76 | 95% | 59% | 87% | 82% | 86% |
| ≥90 | 83% | 79% | 92% | 62% | 82% |
| ≥110 | 52% | 98% | 98% | 41% | 63% |
Accuracy indicates percentage of correct test results; CLS, care level score; NPV, negative predictive value, the probability that a patient with a negative inpatient discharge test result was truly discharged; PPV, positive predictive value, the probability that a patient with a positive inpatient discharge test result was truly discharged.
Table 3 presents a contingency table for CLS ≥ 76 and CLS ≥ 110. For the CLS ≥76 cutoff point, 445 of the 466 inpatients were correctly classified for inpatient reimbursement. However, 67 of the 144 patients discharged as observation patients were erroneously included, resulting in a total of 512 patients submitted to the payer for inpatient reimbursement. For the CLS ≥110 cutoff point, only 240 of the 466 patients discharged as inpatients were correctly identified. After adding the 4 patients with a false negative discharge observation, the total inpatient reimbursement burden for the payer was only 244 patients. By submitting only 244 patients for inpatient billing, the hospital would submit 386 patients for observation reimbursement; however, only 160 of these patients were correctly classified as observations.
Table 3.
Comparison of care level score versus discharge status results for observation and inpatient care
| Cutoff | CLS classification |
Actual discharge status |
|
|---|---|---|---|
| Observation | Inpatient | ||
| 76 | Observation | 97 (59%) | 21 (5%) |
| Inpatient | 67 (41%) | 445 (95%) | |
| 110 | Observation | 160 (98%) | 226 (49%) |
| Inpatient | 4 (2%) | 240 (51%) | |
DISCUSSION
The study design produced independent and statistically validated findings supporting the predictive value of real-time AI for determining appropriate discharge status as inpatient versus observation. Since real-time AI has higher AUC values than a commercial screening tool, it is a reasonable tool for determining whether patients are hospitalized as inpatients or as observation patients.
The outcome of utilization management is influenced by the operational rigor required for the application of either real-time AI or commercial screening tools, as well as the complexities of the patient’s illness throughout hospitalization. For example, if an initial observation patient with chronic obstructive pulmonary disease unexpectedly developed hypoxic respiratory failure requiring mechanical ventilation during hospitalization, the initial real-time AI prediction for observation hospitalization would differ from that at discharge as an inpatient. The CMS recognizes unexpected significant events that should align with inpatient hospitalization.9
An Xsolis research report indicated that up to 40% of hospitalizations could be automated for classifying inpatients versus observation patients with 99% accuracy, thereby reducing both payer and hospital time, as well as the expense of utilization review administrative work.8 Based on information from Humana about the nearly equal mix of fax and EMR access being used among their providers, Xsolis also compared time savings between these two methods. Xsolis reduced the average time to reach a determination by 28.4% compared with the fax/EMR access average of 47.5 minutes. Thus, the approximately 15 minutes saved per case by the AI platform corresponds with 1 hour utilization review nurse time saved for every four AI cases reviewed.
Our utilization review nurses’ CLS/physician advisor decision matrix uses a cutoff point of CLS ≥ 75 for inpatient consideration, which compares favorably with this study’s cutoff point of CLS ≥76, yielding the highest correct classification rate of 86%. Considering the nurses’ utilization decision matrix, which includes eight decision pathways using the CLS, only two pathways result in physician advisor referrals. Since implementing the AI utilization tool, there has been a reduction in physician advisor referrals from an average of 522 patients per month to an average of 107 patients per month.
This study was limited to a single tertiary medical center. A significant difference in the incidence of hospital inpatients at discharge, other than the reported 74%, could change these results. Likelihood ratios (LRs) are independent of prevalence and can be calculated from the sensitivity and specificity data in Table 2. The highest positive LR is 26 for a CLS of 110.10 This study supports the validity of real-time automated AI determination for accurately predicting the outcome of appropriate inpatient orders for care aligned with final inpatient discharge. Generalizing these results, we cautiously assert that real-time AI has the potential to improve the accuracy and timeliness of utilization reviews by reducing waste related to manpower expenditure for hospital/payer faxes, emails, and phone calls. Using real-time AI would benefit patients, hospitals, and payers by reducing errors that require disputes and rework, thus lowering costs for both hospitals and payers while increasing appropriate accountability for payment for hospital care.
Disclosure statement/Funding
The authors report no funding or conflicts of interest.
References
- 1.Ross MA, Granovsky M.. History, principles, and policies of observation medicine. Emerg Med Clin North Am. 2017;35(3):503–518. doi: 10.1016/j.emc.2017.03.001. [DOI] [PubMed] [Google Scholar]
- 2.DHHS-CMS . CMS manual system Pub100-2 Medicare benefit policy transmittal 42. https://www.cms.gov/Regulations-and-Guidance/Guidance/Transmittals/Downloads/R42BP.pdf.
- 3.CMS . Frequently asked questions 2. Midnight inpatient admission guidance & patient status reviews for admissions on or after October 1, 2013. 2013. http://www.cms.gov/Research-Statistics-Data-and-Systems/Monitoring-Programs/Medical-Review/Downloads/QAsforWebsitePosting_110413-v2-CLEAN.pdf. Accessed January 1, 2015.
- 4.Locke C, Sheehy AM, Deutschendorf A, Mackowiak S, Flansbaum BE, Petty B.. Changes to inpatient versus outpatient hospitalization: Medicare’s 2-midnight rule. J Hosp Med. 2015;10(3):194–201. doi: 10.1002/jhm.2312. [DOI] [PubMed] [Google Scholar]
- 5.MCG Health . Industry-leading evidence-based care guidelines (Ed28). 2024. https://www.mcg.com/care-guidelines/care-guidelines/.
- 6.Wang H, Robinson RD, Coppola M, et al. The accuracy of interqual criteria in determining the need for observation versus hospitalization in emergency department patients with chronic heart failure. Crit Pathw Cardiol. 2013;12(4):192–196. doi: 10.1097/HPC.0b013e3182a313e1. [DOI] [PubMed] [Google Scholar]
- 7.Strumwasser I, Paranjpe NV, Ronis DL, Share D, Sell LJ.. Reliability and validity of utilization review criteria: appropriateness evaluation protocol, standardized medreview instrument, and intensity-severity-discharge criteria. Med Care. 1990;28(2):95–111. doi: 10.1097/00005650-199002000-00001. [DOI] [PubMed] [Google Scholar]
- 8.XSOLIS.com . Increasing efficiency with health plan case reviews in utilization management. 2013. https://info.xsolis.com/l/1023201/2023-12-03/2lzl7/1023201/1701656631kLePrA2M/xs_wp_hp_humanatimestudy.pdf. Accessed February 12, 2024.
- 9.CMS . Medicare program integrity manual: 6.5.2 - Conducting patient status reviews of claims for Medicare part A payment for inpatient hospital admissions. 2020. https://www.cms.gov/Regulations-and-Guidance/Guidance/Manuals/downloads/pim83c06.pdf.
- 10.Deeks JJ, Altman DG.. Diagnostic tests 4: likelihood ratios. BMJ. 2004;329(7458):168–169. doi: 10.1136/bmj.329.7458.168. [DOI] [PMC free article] [PubMed] [Google Scholar]

