Abstract
Uncertainty quantification in hierarchical healthcare data presents a fundamental methodological challenge. Existing approaches face fundamental trade-offs. Conformal prediction offers coverage guarantees but struggles to adapt to instance-level difficulty in clusters, while Bayesian methods quantify uncertainty but lose reliability when models are misspecified. We propose a hybrid framework for hierarchical random forests that addresses the trade-off between coverage and precision by combining group-aware conformal calibration with Bayesian posterior uncertainties to yield adaptive prediction intervals with near-nominal empirical coverage. We assess length-of-stay predictions on 61538 patients from 3793 hospitals using 5-fold cross-validation. The hybrid approach achieved near-target coverage (94.3 % ± 0.5 pp vs 95 % nominal) with adaptive interval width reallocation across uncertainty strata (21 % narrower for low-uncertainty cases, 6 % wider for high-uncertainty cases) comparable to standard conformal prediction (95.0 % ± 0.2 pp) which produces uniform-width intervals. Post-hoc Bayesian calibration alone severely under-covered (14.1 %) under our Gaussian specification, highlighting the necessity of conformal adjustment. Performance remained stable across hospitals and folds, with grouped cross-validation (hospital holdout) confirming generalization to unseen institutions. The pooled calibration approach achieves near-nominal empirical coverage at scale, supporting evidence-based resource allocation and risk-stratified patient management without site-specific recalibration.
Keywords: Uncertainty quantification, Conformal prediction, Bayesian modeling, Hierarchical random forests, Healthcare analytics, Length of stay
Subject terms: Health care, Mathematics and computing
Introduction
Hospital length of stay (LOS) prediction is critical for resource management, discharge planning, and care coordination, but current machine learning methods lack the uncertainty estimates needed for clinical use1–3. Reliable uncertainty bounds enable better bed allocation, staffing, and patient flow under growing resource pressures4,5. Yet healthcare data are hierarchical (patients nested within hospitals and systems) creating statistical dependencies that violate independence assumptions in standard methods6–9. These dependencies, driven by institutional practices, case mix, and resource availability, undermine existing approaches and create systematic patterns that existing methods fail to capture10–12. Addressing this requires uncertainty quantification methods designed for hierarchical data.
Current machine learning approaches for healthcare prediction suffer from a critical gap between predictive performance and clinical deployment. Hierarchical random forests (HRFs) directly address this gap by modeling both patient-level characteristics and institutional effects, while also improving robustness, generalizability, and dynamic adaptability13–15. However, despite their predictive strengths, HRFs generally rely on asymptotic assumptions and heuristic variance estimates that fail to provide the rigorous coverage guarantees essential for safety-critical applications16,17. This limitation is particularly severe in healthcare settings where unreliable uncertainty bounds pose a direct patient safety risk, potentially leading to premature discharge planning for high-risk individuals or inefficient allocation of scarce ICU beds. Such failures can compromise patient care, lead to suboptimal resource allocation, and undermine clinical trust in machine learning systems18–20.
Researchers have increasingly explored conformal prediction for uncertainty quantification, as it provides rigorous coverage guarantees without restrictive distributional assumptions, making it attractive for risk-sensitive systems21–23.Yet conformal methods rely on data exchangeability, which hierarchical healthcare structures systematically violate24,25. Bayesian models, by contrast, are inherently suited to capture complex dependencies through hierarchical structures26,27, but often assume independence for tractability and offer only asymptotic coverage guarantees that may fail in practice28. This presents a methodological challenge. Bayesian methods sacrifice reliable coverage guarantees, while conformal methods sacrifice adaptive efficiency, even though clinical deployment demands both rigorous validity and adaptive uncertainty reflecting prediction difficulty29,30.
In hospital settings, LOS predictions must balance reliability for patient safety with precision for resource planning31,32. Administrators need uncertainty estimates that support risk-stratified decisions, conservative allocation under high uncertainty, and efficiency gains when predictions are confident33,34. Scalable deployment further requires methods that generalize across diverse hospitals without site-specific recalibration19,20.
In this study, we use the canonical split-hierarchical conformal predictor, an unweighted absolute-residual method that calibrates a single global quantile on calibration residuals, producing uniform-width intervals for all test points. We do not employ adaptive variants (e.g., variance-scaled, weighted, or quantile-conformal) as our primary method because they require reliable per-point surrogates, such as conditional variances or quantiles, which our current HRF implementation does not provide, and their guarantees often assume i.i.d. or additional structure, making adaptation to clustered, partially exchangeable data complicated and beyond our present scope.We include locally adaptive conformal methods as baselines for comparison.
Motivated by these challenges, we propose integrating conformal prediction and Bayesian uncertainty estimation to improve uncertainty quantification in hierarchical random forests. Specifically, we aim to design and validate a framework that (i) achieves distribution-free marginal coverage for hierarchical data via group-aware conformal calibration, and (ii) restores instance-level adaptivity through Bayesian posterior predictive standard deviations.
This presents a methodological challenge we address through integration rather than choosing one approach over the other. We evaluate this methodology across diverse hospital characteristics and patient populations, demonstrating robust performance suitable for real-world deployment. The framework enables uncertainty-guided clinical protocols while preserving the computational efficiency required in operational healthcare systems.
This work makes three methodological contributions demonstrated through LOS prediction.
Research Contributions
We develop a hybrid framework that integrates group-aware conformal calibration with Bayesian posterior uncertainties, producing adaptive prediction intervals that achieve near-nominal coverage under hospital clustering and adjust to instance-level uncertainty. This approach addresses the trade-off between coverage reliability and adaptive precision by unifying conformal and Bayesian uncertainty estimation through a single scaling mechanism.
We demonstrate a standardized calibration strategy that generalizes across diverse hospital characteristics without site-specific retraining, with systematic evaluation of coverage stability and recommended frameworks for monitoring and recalibration.
We provide an interpretation of how uncertainty-guided prediction intervals can be operationalized in hospital settings, analyzing how prediction confidence maps to resource allocation decisions and outlining implementation considerations for translating statistical performance into clinical workflows.
Related work
Length of stay prediction in healthcare Hospital LOS prediction has been extensively studied using approaches ranging from clinical scoring systems to modern machine learning methods including ensembles, gradient boosting, and deep learning2,35,36. However, most research focuses on prediction accuracy rather than uncertainty quantification needed for clinical decision support. Uncertainty in LOS forecasts has long been recognized as a barrier to effective hospital scheduling and capacity planning37, with this challenge becoming more critical as healthcare systems face increasing demand and resource constraints4,5.
Uncertainty quantification (UQ) for clinical prediction Standard UQ methods in healthcare such as bootstrap intervals, ensemble variance, and Bayesian posteriors, offer useful summaries but typically provide only model-based or large-sample guarantees, not coverage under minimal assumptions18,19. Clinical deployment, however, requires uncertainty that is both valid (coverage close to nominal under distribution-free) and adaptive (intervals that widen when predictions are harder), a combination that conventional methods rarely achieve1,20. When applied under global i.i.d./exchangeability assumptions or strict parametric specifications, existing approaches often fail to capture hierarchical dependencies in healthcare data, leading to miscalibrated intervals and systematic residual patterns at the hospital level10–12. These challenges highlight the need for UQ methods that ensure both coverage validity and adaptivity while respecting hierarchical structure.
Hierarchical Random Forests and Extensions The development of hierarchical extensions to random forests has been an active area of research aimed at explicitly handling multi-level data structures while preserving the computational efficiency and predictive power of tree-based ensemble methods38. Hajjem et al. pioneered mixed-effects random forests for clustered data, introducing innovative approaches for group-level modeling that account for both fixed and random effects within the tree-growing process14. Subsequent developments have improved scalability and extended these methods to more complex hierarchical structures. Sela and Simonoff added random-effects structures for continuous outcomes39, recent work has enhanced computational efficiency13, and Pellagatti et al. developed approaches for multiple levels of nesting40. Despite these important methodological advances in hierarchical modeling capabilities, a critical limitation persists across all existing hierarchical random forest approaches. They typically lack rigorous and well-calibrated uncertainty quantification methods, a significant gap that has been explicitly identified by Mentch and Hooker17,41 and represents the primary focus of our research contribution.
Conformal Prediction for Structured and Hierarchical Data Conformal prediction methods have gained widespread adoption due to their minimal distributional assumptions and mathematically rigorous coverage guarantees21–23,42. However, extending these techniques to hierarchical and structured data introduces fundamental challenges, most notably the systematic violation of exchangeability assumptions that underlie conformal prediction theory24,25. Recent methodological advances have begun addressing these challenges through specialized approaches for clustered data. Principato et al. developed techniques specifically adapted to hierarchical data structures, integrating group-level adjustments within the conformal framework to maintain coverage guarantees while accommodating complex dependency patterns43. Dunn et al. explored conformal methods for two-layer hierarchical models, providing theoretical foundations for distribution-free prediction sets under structured dependencies44. Barber et al. extended conformal prediction beyond exchangeability, developing robust approaches to maintain validity under distribution shifts and clustering24. Despite these advances in hierarchical conformal methodology, prior work has not integrated these techniques with hierarchical random forests for continuous clinical outcomes, leaving a critical gap for healthcare applications where both institutional clustering and adaptive uncertainty quantification are essential.
Bayesian Approaches to Hierarchical Modeling and Uncertainty Quantification Bayesian methods provide principled uncertainty quantification through posterior distribution estimation, with Bayesian hierarchical models specifically designed to capture multi-level uncertainty through partial pooling across nested structures26,45. These methods are widely used in healthcare and social sciences due to their interpretability and ability to handle complex dependencies27. In tree-based settings, Bayesian Additive Regression Trees (BART) and extensions offer uncertainty quantification through model averaging46,47, with recent developments improving uncertainty calibration and extending to heteroscedastic settings48,49. However, while Bayesian hierarchical methods excel at capturing complex dependency structures and providing interpretable uncertainty estimates, they typically provide only asymptotic coverage guarantees50. This limitation is particularly concerning for safety-critical healthcare applications23. The integration of hierarchical Bayesian uncertainty estimation with ensemble methods like random forests for clinical prediction represents an underexplored area with significant potential for enhanced uncertainty quantification in healthcare applications.
Hybrid Bayesian-Conformal Methods and Integration Approaches Recent developments in uncertainty quantification have explored hybrid approaches that leverage both the coverage guarantees of conformal prediction and the adaptive strengths of Bayesian inference51–53. Fong and Holmes established theoretical foundations for Bayesian Conformal Computation, demonstrating how to maintain both bayesian interpretability and conformal validity51. Subsequent work has extended these concepts across domains. Stanton et al. developed methods that preserve both theoretical properties in optimization settings52, while applications in medical imaging and physics modeling have shown practical advantages for complex prediction tasks54,55. However, these hybrid approaches have not been adapted to hierarchical data structures common in healthcare settings, where both institutional clustering and coverage guarantees are essential for clinical deployment. Our work addresses this gap by developing hybrid bayesian-conformal methods specifically designed for hierarchical random forests in clinical prediction tasks. Our approach differs from these prior hybrid methods by using Post-hoc Bayesian hierarchical calibration solely for uncertainty weighting while relying on conformal calibration for coverage validity, rather than attempting full Bayesian inference.
Uncertainty Requirements for Clinical AI Deployment Clinical deployment requires governance beyond research benchmarking. Food and Drug Administration (FDA) communications on AI/ML medical devices emphasize transparent performance monitoring and risk management, while the European Union (EU) AI Act mandates accuracy and robustness measures for high-risk healthcare systems56,57. Operationally, uncertainty estimates must enable clinical decision-making by distinguishing routine cases from those requiring enhanced monitoring or specialist consultation lack distribution-free coverage guarantees and remain sensitive to institutional differences50,58. Our hierarchical conformal calibration framework addresses these deployment requirements by providing near-nominal empirical coverage (pooled calibration for operational simplicity) or finite-sample guarantees (subsampling variants for theoretical validity) combined with bayesian uncertainty signals for adaptive precision, supported by systematic monitoring protocols including coverage tracking, drift detection, and recalibration triggers.
Paper Organization The remainder of this paper is organized as follows: Section 3 presents our methodological approaches in detail, including theoretical foundations, algorithmic implementations, and computational considerations, it also describes our comprehensive experimental design and evaluation framework using real-world healthcare data. Section 4 reports detailed empirical results with statistical analysis and performance comparisons across different hierarchical structures. Section 5 discusses broader implications, methodological limitations, and promising directions for future research. Section 6 concludes the paper with a summary of key contributions and their potential impact on the field.
Methodology
We develop a hybrid uncertainty quantification framework that combines conformal prediction’s coverage validity with Bayesian uncertainty estimation’s adaptive precision for hierarchical healthcare data. The key insight is that these approaches offer complementary strengths. Conformal methods provide empirical coverage validity while Bayesian posterior uncertainties enable instance-specific adaptation.
Our framework operates in three stages. First, a hierarchical random forest captures multi-level dependencies through sequential residual decomposition (Sect. 3.2). Second, we obtain uncertainty estimates through both group-aware conformal calibration that addresses hierarchical data structure (Sect. 3.3) and Bayesian posterior distributions that reflect instance-specific uncertainty (Sect. 3.4). Third, we integrate these approaches to preserve conformal coverage while achieving adaptive interval widths (Sect. 3.5).
Problem formulation and theoretical framework
Consider a hierarchical dataset
where i indexes each patient:
: d-dimensional patient feature vector (demographics, clinical indicators)
: continuous outcome (hospital length-of-stay in days)
: hospital identifier
: regional identifier
Throughout this paper, we use lowercase letters
to denote observed values for specific patients, and uppercase letters (X, Y, H, R) to represent the corresponding random variables in probability statements and expectations.
The hierarchical structure imposes a nesting constraint through the mapping
, where
ensures each hospital belongs to exactly one region. This creates natural groupings:
![]() |
1 |
![]() |
2 |
The fundamental challenge arises from within-group dependencies, where outcomes for units belonging to the same group tend to be correlated. Such correlations violate standard independence assumptions and compromise traditional uncertainty quantification methods, necessitating specialized approaches that respect the hierarchical nature of the data.
To address these challenges, we study marginal coverage for hierarchical data under group-conditional exchangeability and focus on empirical coverage, with calibration schemes introduced in Sect. 3.3.
For a hierarchical prediction function
, we seek prediction intervals
that satisfy three key objectives.
The first objective is coverage validity: for a specified miscoverage level
, the intervals should achieve
![]() |
3 |
ensuring marginal coverage close to the nominal level.
The second objective is interval efficiency, which requires minimizing interval width while maintaining coverage:
![]() |
4 |
![]() |
5 |
The third objective is adaptive precision: intervals should widen when predictions are more uncertain. This can be quantified by the adaptation ratio,
![]() |
6 |
where Q5 and Q1 denote the highest and lowest uncertainty quintiles, respectively. Values exceeding 1.0 indicate that high-uncertainty cases receive wider intervals than low-uncertainty cases.
The hierarchical structure imposes three critical requirements for valid uncertainty quantification. First, hierarchical exchangeability must hold (within each group, patient-level observations should be treated as exchangeable). Second, methods must achieve empirical coverage validity, meaning that across diverse data splits the proportion
should approximate
. Finally, uncertainty estimates
must be well-calibrated for interpretable risk stratification, so that
, with c ideally close to 0.798 under Gaussian errors.
To address these requirements, we develop a solution that captures hierarchical structure through multi-level random forests explicitly modeling patient, hospital, and regional effects, ensures achieves near-nominal coverage via conformal prediction with group-aware calibration protocols, and achieves adaptive efficiency through Bayesian posterior uncertainty that weights conformal scores.
This strategy enables us to satisfy all three theoretical requirements while maintaining computational tractability for large-scale healthcare applications. The following sections detail each component and their integration into the final hybrid framework. Throughout our analysis, reported interval width refers to full width
. In algorithms we denote by w the half-width used to form
.
Hierarchical random forest foundation
We employ HRF as our base prediction model following13. The key insight is to decompose the prediction task into three sequential levels that capture distinct sources of variation patient-level effects, hospital-specific adjustments, and regional patterns.
The training proceeds through three sequential stages that progressively capture hierarchical dependencies. Initially, we train a random forest
using patient features
to predict outcomes
, capturing baseline patient-specific effects without institutional context. Subsequently, we calculate hospital residuals
and train a second random forest
on augmented features
to predict these residuals, explicitly modeling how patient characteristics interact with hospital-specific factors. Finally, we calculate regional residuals
and train a final random forest
on fully augmented features
to capture broader regional patterns. This hierarchical decomposition yields the final prediction:
![]() |
7 |
To prevent information leakage across hierarchical stages, residuals at each level are computed using out-of-bag (OOB) predictions rather than in-sample fitted values. Specifically, hospital-level residuals are defined as
, using predictions from trees whose bootstrap samples excluded observation i. Regional residuals
similarly use OOB predictions from the hospital-level model. This ensures that higher-level random forests learn from held-out residual structure, preventing artificial inflation of hierarchical effects due to overfitting at earlier stages.
We note that hospital identifiers (
) are included as features in the second and third stages. Under patient-level cross-validation, hospitals may appear in both training and test sets, allowing the model to learn hospital-specific effects. This design choice maximizes predictive performance for hospitals with historical data. To assess generalization to entirely new hospitals, we additionally report grouped cross-validation results (Sect. 4.8) where hospital identifiers in the test set are never seen during training.
Hierarchical conformal prediction framework
We employ standard split-conformal prediction with uniform intervals rather than adaptive variants for three strategic reasons. Clinical deployment requires real-time predictions across thousands of patients, making the
computational costs of adaptive methods such as kernel-weighted approaches prohibitive. Additionally, adaptive conformal methods require conditional uncertainty estimates that hierarchical random forests cannot provide without major architectural changes. Most importantly, we achieve instance-specific intervals through Bayesian posterior uncertainties rather than complex conformal adaptations, creating a modular framework where conformal calibration target coverage validity.
Conformal prediction provides distribution-free, marginal coverage when calibration and test points are exchangeable. Given a miscoverage level
, it constructs intervals that contain the true outcome with probability at least
, regardless of the underlying distribution. This property makes it particularly attractive for safety-critical healthcare applications.
The standard split-conformal approach proceeds as follows: (1) split data into training, calibration, and test sets, (2) train a model
on training data, (3) compute nonconformity scores on calibration data:
, (4) calculate the
quantile:
, and (5) for test data, construct prediction intervals:
.
Conformal prediction assumes exchangeability, in which all permutations of the data are equally likely. However, hierarchical healthcare data violates this assumption due to within-cluster dependencies:
![]() |
8 |
Patients within the same hospital exhibit correlated outcomes, leading to calibration sets dominated by large hospitals and biased quantile estimates that may not generalize across institutions.
To address exchangeability violations in hierarchical data, we adapt three established approaches from44. These methods represent different points on the spectrum between theoretical validity and practical efficiency:
Method 1: Cumulative Distribution Function (CDF) Pooling. Uses all calibration data despite dependencies. Compute unweighted residuals on the calibration set,
. Let
. Compute the conformal threshold via the rank-adjusted level
![]() |
Ties in
are broken by choosing the smallest
that attains the required rank. Construct test intervals as
. This design is deterministic and uses all it does not carry finite-sample coverage guarantees.
Method 2: Single Subsampling. Restores exchangeability by sampling one observation per hospital. For each hospital h in the calibration set, randomly selects one patient and computes the quantile from this reduced set. While theoretically sound (ensures hospital-level exchangeability), this approach sacrifices substantial sample size, from m patients to H hospitals, resulting in wider intervals.
Method 3: Repeated Subsampling. Aggregates
single subsampling iterations, taking the median of the resulting quantiles. This hybrid approach partially recovers the efficiency lost in single subsampling while maintaining better theoretical properties than CDF pooling.
These three designs trade off sample-size utilization and formal validity: pooling uses all calibration residuals without enforcing group exchangeability, single subsampling enforces group-conditional exchangeability at the cost of sample size, repeated subsampling aggregates multiple single-subsamples to reduce variance while preserving the same guarantee regime as single subsampling.
For our primary experiments, we select CDF pooling as our default calibration scheme. We explicitly acknowledge that this method does not carry the same finite-sample coverage guarantees as the subsampling variants because it technically violates the hospital-level exchangeability assumption. However, we argue that in a large and diverse hierarchical dataset such as ours ( 3793 hospitals), the practical benefits of using the full calibration set outweigh this theoretical limitation. Empirically, this choice yields superior mean coverage, the lowest fold-to-fold variability, and perfect reproducibility, all of which are paramount for deployment and validation in a clinical setting. This transforms a theoretical compromise into a justified, data-driven strength for large-scale systems.
Post-hoc Bayesian hierarchical calibration for uncertainty quantification
We implement a Bayesian hierarchical model that post-processes HRF predictions to provide uncertainty quantification. This is a post-hoc calibration approach in which the HRF is trained conventionally, and a Bayesian model then estimates residual uncertainties-distinct from fully Bayesian methods like BART, where uncertainty propagates through the entire model. For brevity in figures and tables, we use“Bayesian HRF”to refer to this post-hoc Bayesian hierarchical calibration layer.
We augment HRF predictions with a Bayesian hierarchical structure:
![]() |
9 |
![]() |
10 |
where:
: Calibration parameters for HRF predictions
: Hospital-specific random effects
: Regional random effects
: Residual variance
represents all Bayesian model parameters (
,
,
,
,
)
This formulation treats HRF as a sophisticated feature extractor while the Bayesian layer captures residual hierarchical variation not fully explained by the base model.
To quantify uncertainty for a new patient, we obtain the posterior predictive distribution using Markov Chain Monte Carlo (MCMC) sampling. The posterior standard deviation
serves as our uncertainty estimate, integrating three sources of variation: parameter uncertainty in the calibration coefficients
, random effect variability at hospital and regional levels, and residual patient-level variation. This instance-specific uncertainty estimate subsequently weights conformal scores in Sect. 3.5, enabling adaptive interval widths that respond to prediction difficulty.
Model checking and likelihood sensitivity Because LOS is strictly positive and often right-skewed, a Gaussian likelihood on the original LOS scale can be misspecified. We therefore performed Bayesian model diagnostics (residual and posterior predictive checks) for the baseline Gaussian specification, and we additionally evaluated two positive-support alternatives (Log-Normal and Gamma) as a likelihood sensitivity analysis. Sensitivity results are reported in Supplementary Table S1, and diagnostic plots for the baseline specification are provided in Supplementary Figure S1.
For the results presented in Sect. 4, post-hoc Bayesian calibration intervals are constructed as
![]() |
corresponding to approximate 95% prediction intervals under the Gaussian posterior predictive distribution.
Hybrid HRF (Bayesian conformal) framework
The challenge in hierarchical healthcare prediction lies in balancing coverage reliability with adaptive precision. We propose integrating conformal methods with Bayesian approaches by using Bayesian posterior uncertainties to weight conformity scores within a hierarchical conformal framework, thereby improving efficiency while maintaining near-target coverage.
Our key innovation is to handle variance heterogeneity and cluster-driven error variability by weighting conformity scores with Bayesian posterior uncertainties, replacing the absolute-residual score with an uncertainty-standardized score:
![]() |
11 |
where
is the posterior predictive standard deviation,
controls adaptivity (default
), and
prevents division by very small uncertainties (insensitive for
).
Intuitively,
is a z-score-like residual: uncertain (hard) cases receive proportionally wider intervals, while confident (easy) cases receive narrower ones. This standardization stabilizes score scale across hospitals and patient groups, intended to improve conditional adaptivity (and empirical conditional calibration) while preserving conformal validity under group-aware calibration (single/repeated subsampling), since the same score is applied to calibration and test points within the hierarchical protocol.
This weighting does not restore exchangeability or remove within-hospital dependence. Rather, it is a heteroscedasticity-aware normalization intended to allocate interval width according to modeled predictive variability. Formal finite-sample coverage guarantees are provided only by the group-aware conformal variants (single/repeated subsampling); CDF pooling is used as a deterministic heuristic baseline.
This modular design represents a key strength of our framework. It cleanly separates the task of targeting marginal coverage, which is handled by the conformal “wrapper,” from the task of estimating instance-level prediction difficulty, which is handled by the Bayesian hierarchical model. This “separation of concerns” makes the framework more robust than integrated adaptive methods, as the final coverage validity does not require the Bayesian uncertainty model to be well calibrated; marginal coverage is governed by the assumptions of the hierarchical calibration protocol.
We integrate this weighting with hierarchical conformal calibration through the following algorithm:
Algorithm 1.

Hybrid HRF (Bayesian Conformal) Prediction
While weighted conformal prediction has been studied, our work is distinct in that we couple a Bayesian hierarchical residual model, providing per-patient posterior predictive scales, with hierarchical conformal calibration designed for multi-level clinical data and a hierarchical random-forest base model. In a three-level patient-hospital-region setting, the Bayesian layer shares information across levels to generate the uncertainty weights used in the conformity score. Pooled calibration attains near-nominal coverage empirically, the weighting scheme then delivers instance-adaptive intervals without per-case quantile models. Importantly, the hybrid incurs only a lightweight offline Bayesian fit and a constant-time online step, avoiding refits at prediction time typical of many adaptive conformal procedures.
Validation metrics and experimental framework
We analyzed data from the 2019 Healthcare Cost and Utilization Project (HCUP) National Inpatient Sample (NIS), the largest publicly available all-payer inpatient database in the United States59. The final dataset comprised 61538 inpatient records from 3793 hospitals across four U.S. regions, representing a three-level hierarchical structure: patients nested within hospitals, hospitals nested within regions. All analyses used de-identified data consistent with Health Insurance Portability and Accountability Act (HIPAA) regulations60. Missing values (
2.0 %) were handled using mode/median imputation due to the low percentage of missing data. Continuous variables were z-standardized, categorical variables retained HCUP encodings. We employed two cross-validation strategies. Our primary analysis used 5-fold stratified cross-validation based on LOS quintiles to preserve the distribution of short (Q1), moderate (Q2-Q4), and extended (Q5) stays, allowing hospitals to appear across splits to maximize within-hospital learning. To validate generalization to unseen institutions, we additionally conducted 5-fold grouped cross-validation with hospital-level holdout, ensuring zero hospital overlap between training and test sets. Within each fold, data were split into train/calibration/test as 64 %/16 %/20 % to meet conformal calibration requirements.
Hyperparameters were selected by grid search with 5-fold CV on training data. The hierarchical random forest used three levels, patient (100 trees, depth 15), hospital (75, depth 12), region (50, depth 10). Bayesian models used MCMC (2 chains, 500 warmup, 250 post-warmup samples, target_accept=0.99), with convergence assessed by
(Gelman-Rubin diagnostic indicating chain convergence) and effective sample size
. On the hardware described in Supplementary Table S3, the offline MCMC fit required 9.4 minutes per training run. A fixed base seed (42) with fold-specific offsets ensured reproducibility using Python 3.11, scikit-learn 1.3.0, and PyMC 5.0.
Empirical coverage rate was the primary criterion, the proportion of test observations whose true outcomes fell within predicted intervals. For
, we required marginal coverage
. Conditional coverage was assessed within uncertainty quintiles to identify systematic failures in high-uncertainty predictions.
Other secondary metrics measured and evaluated included: interval efficiency measured by mean width, uncertainty-error correlation (Pearson r) between predicted uncertainties and absolute errors, and adaptation index computed as the ratio of mean widths for high-uncertainty (>75th percentile) versus low-uncertainty (<25th percentile) predictions.
For Bayesian methods, isotonic regression calibration mapped predicted uncertainties to observed errors per CV fold using the calibration split, applied to the corresponding test fold to avoid data leakage. We evaluated: (i) discrimination via Pearson and Spearman correlations between predicted uncertainties and absolute errors, (ii) calibration via regression
(ideal
under Gaussian errors) and Expected Calibration Error (ECE):
, (iii) proper scoring rules that evaluate both reliability and sharpness of probabilistic predictions (Continuous Ranked Probability Score (CRPS), Winkler Score).
Robustness was assessed across cross-validation stability, confidence level sensitivity (80 %, 90 %, 95 %, 99 %), and hospital subgroup performance to assess generalizability. Coverage variability is reported in percentage points (pp), while width variability is reported as standard deviation (days) and coefficient of variation (CV, %). All widths refer to full widths (days). The main stability analysis uses 5-fold cross-validation;
-sensitivity and subgroup analyses use 3-fold cross-validation due to computational cost. Bayesian HRF intervals use raw posterior
(pre-isotonic); isotonic calibration is applied only for uncertainty-diagnostic metrics, not interval construction.
Results
Data description
We validated our framework on hospital LOS prediction across 61538 patients from 3793 hospitals, demonstrating improved uncertainty quantification for safety-critical healthcare applications. The following sections detail data characteristics, method comparison, and performance analysis. Table 1 presents the 18 predictor variables selected based on established clinical relevance for LOS prediction. LOS, our primary outcome, exhibited typical healthcare characteristics: median = 3 days, IQR = [2, 6] days, with right skewness (mean = 4.9, SD = 7.1 days).
Table 1.
Complete Variable Descriptions and Characteristics.
| Level | Variable | Description | Type | Range/Categories |
|---|---|---|---|---|
| Patient Level | AGE | Patient age in years | Continuous | 1–90 |
| FEMALE | Female gender | Binary | 0=Male, 1=Female | |
| RACE | Patient race/ethnicity | Categorical | 1=White, 2=Black, 3=Hispanic, 4=Asian, 5=Native, 6=Other | |
| ZIPINC_QRTL | ZIP income quartile | Ordinal | 1=Lowest to 4=Highest | |
| I10_NDX | Number of diagnoses | Count | 0-40 | |
| I10_NPR | Number of procedures | Count | 0-25 | |
has_comorbidities
|
Presence of comorbidities | Binary | 0=None, 1=Present | |
| APRDRG_Severity | APR-DRG illness severity | Ordinal | 0=Minor to 4=Extreme | |
| APRDRG_Risk_Mortality | APR-DRG mortality risk | Ordinal | 0=Minor to 4=Extreme | |
| I10_SERVICELINE | Primary service line | Categorical | 1=Maternal, 2=Mental health/substance use, 3=Injury, 4=Surgical, 5=Medical | |
| PAY1 | Primary payer | Categorical | 1=Medicare, 2=Medicaid, 3=Private, 4=Self-pay, 5=Other, 6=Unknown | |
| ELECTIVE | Admission type | Binary | 0=None-Elective, 1=Elective | |
| I10_INJURY | Injury indicator | Categorical | 0=No injury, 1=Injury, 2=Poisoning | |
| DIED | In-hospital mortality | Binary | 0=Survived, 1=Died | |
| Hospital Level | HOSP_BEDSIZE | Hospital bed capacity | Ordinal | 1=Small (6-99), 2=Medium (100-399), 3=Large (400+) |
| HOSP_LOCTEACH | Location/teaching status | Categorical | 1=Rural, 2=Urban non-teaching, 3=Urban teaching | |
| H_CONTRL | Hospital ownership | Categorical | 1=Public, 2=Private non-profit, 3=Private for-profit | |
| Regional | HOSP_REGION | Census region | Categorical | 1=Northeast, 2=Midwest, 3=South, 4=West |
Comorbidity indicator derived using Enhanced ICD-10-CM Elixhauser Comorbidity Software, identifying the presence of any chronic condition among 38 categories
The hierarchical structure exhibited substantial clustering, with intraclass correlation coefficients (ICCs) revealing that 12.5 % of LOS variance was attributable to hospital-level factors, 40.8 % to regional factors, and 46.7 % to patient-level factors (Table 2). The substantial hierarchical clustering across all levels violates standard independence assumptions and justifies our hierarchical uncertainty quantification approach. Importantly, we retained all observations including extreme outliers (LOS up to 300 days) to evaluate uncertainty quantification across the full clinical spectrum. While 94 % of patients had stays under 20 days, the 6 % with extended stays critically test our framework’s ability to provide reliable coverage for high-uncertainty predictions.
Table 2.
Hierarchical Structure and Variance Decomposition Analysis.
| Hierarchical Level | Sample Characteristics | ICC |
|---|---|---|
| Patient Level | 61538 individual admissions | 0.467
|
| Hospital Level | 3793 hospitals (range: 1-130 patients per hospital) | 0.125 |
| Regional Level | 4 U.S. census regions (range: 555-1,461 hospitals per region) | 0.408 |
| Total Variance | LOS variance = 50.6 days
|
1.000 |
Patient-level ICC calculated as residual variance:
ICC
ICC
. The substantial regional-level clustering (ICC = 0.408) demonstrates significant institutional variation in LOS patterns, strongly justifying hierarchical uncertainty quantification approaches
Hierarchical conformal method selection and validation
Before presenting main results, we compare three hierarchical conformal calibration schemes to justify our methodological choice. Table 3 reports coverage, efficiency (mean full width), and computation time, averaged over 5-fold cross-validation.
Table 3.
Performance comparison of hierarchical conformal calibration schemes (5-fold CV, mean ± SD). Widths are full widths in days.
| Method | Coverage ( %) | Mean Width (days) | Time (s) |
|---|---|---|---|
| CDF Pooling | 95.0 ± 0.22 | 15.99 ± 0.36 | 28.29 ± 2.75 |
| Single Subsampling | 94.8 ± 0.53 | 15.98 ± 0.87 | 27.02 ± 0.53 |
| Repeated Subsampling | 94.4 ± 0.28 | 15.30 ± 0.40 | 35.81 ± 0.88 |
In safety-critical clinical settings we prioritize coverage fidelity and operational reproducibility over marginal efficiency gains. CDF pooling achieves the highest mean coverage (95.0 %) with the lowest fold-to-fold variability (±0.22 pp) and is deterministic (no Monte-Carlo randomness), while maintaining comparable runtime to Single Subsampling (28.29 s vs. 27.02 s). The width difference relative to Single Subsampling is modest, and although Repeated Subsampling attains narrower intervals, it does so with lower mean coverage (94.4 %) and higher computation time (35.81 s).
Accordingly, we adopt CDF pooling as the primary scheme for subsequent analyses, reflecting the deployment priority of calibrated and reproducible coverage. To preserve formal guarantees and demonstrate robustness, we also report Single Subsampling and Repeated Subsampling as guarantee-preserving sensitivity analyses, the subsampling variants enforce group-exchangeability and thus provide a finite-sample coverage backstop.
Coverage validity
Across five-fold cross-validation, methods differed sharply in marginal coverage (Fig. 1A; mean ± SD across folds, in percentage points). Conformal HRF achieved 95.0 % (± 0.2 pp), Hybrid HRF achieved 94.3 % (± 0.5 pp), and post-hoc Bayesian calibration showed 14.1 % (± 2.1 pp) under the baseline Gaussian-on-LOS specification. Fold-level 95 % CIs for the mean were: Conformal 94.75–95.25 %, Hybrid 93.68–94.92 %, Bayesian 11.49–16.71 %(
).
Fig. 1.
Coverage performance across methods.
Coverage was highly stable for Conformal HRF and Hybrid HRF, with near-horizontal traces at
95 % across folds (Fig. 1B); post-hoc Bayesian calibration showed greater fold-to-fold variability. Coefficients of variation (CV = SD/mean) were 0.21 % for Conformal, 0.53 % for Hybrid, and 14.9 % for Bayesian. Given overlapping fold-level uncertainty between Conformal and Hybrid, we do not interpret their difference as statistically distinguishable. The Hybrid coverage of 94.3 % is 0.7 pp below the nominal 95 % target but within the range typically considered acceptable for clinical deployment. The severely low Bayesian coverage (14.1 %) indicates inadequate calibration under the Gaussian specification.
Bayesian model diagnostics and likelihood sensitivity were examined to assess whether the poor coverage of baseline post-hoc Bayesian calibration might reflect likelihood misspecification, given that LOS is strictly positive and strongly right-skewed. Diagnostic checks (Supplementary Fig. S1) revealed substantial departures from Gaussian assumptions, including strong right-skew (skewness = 16.76), heavy tails, and heteroscedasticity. However, likelihood sensitivity analysis (Supplementary Table S1) showed that replacing the Gaussian likelihood with positive-support alternatives (Log-Normal: 13.0 % ± 3.2; Gamma: 13.4 % ± 3.9) yielded similarly poor coverage compared to the Gaussian baseline (14.1 % ± 0.7). This indicates that inadequate coverage cannot be attributed to likelihood choice alone. Although the Gaussian assumption is clearly violated, positive-support alternatives that better match the data-generating process also fail to achieve adequate coverage, suggesting that no simple parametric form captures the complex residual structure of hierarchical LOS predictions. Consequently, distribution-free conformal calibration remains necessary regardless of likelihood specification.
Conditional coverage analysis revealed distinct patterns across the uncertainty quintiles for the three methods as shown in Fig. 2 (Table S2 reports standard errors and 95 % confidence intervals).
Fig. 2.

Conditional coverage by uncertainty quintile. Post-hoc Bayesian calibration (blue bars) shows uniformly poor coverage (
14%). Conformal HRF (orange bars) over-covers in Q1–Q4 but under-covers in Q5 (81.8%). Hybrid HRF (red bars) maintains near-nominal coverage across quintiles with improved Q5 performance (90.9%). The dashed line indicates the 95% target. Error bars show 95% confidence intervals (Table S2).
Post-hoc Bayesian calibration exhibited uniformly low coverage across all quintiles (12.7–15.5 %), indicating systematic miscalibration under the Gaussian specification rather than uncertainty-dependent failure. (The method’s inability to provide reliable intervals persisted regardless of prediction difficulty.)
Conformal HRF showed substantial over-coverage in Q1–Q4 (96.2 –99.7 %, exceeding the 95 % target) but under-coverage in Q5 (81.8 %, 95 % CI [80.3 %, 83.3 %]). This pattern reflects overly conservative intervals for low-uncertainty cases and inadequate coverage for high-uncertainty cases.
Hybrid HRF achieved coverage near nominal across Q1–Q4 (93.9 –97.1 %) and improved Q5 coverage (90.9 %, 95 % CI [89.8 %, 92.1 %]) relative to Conformal. The Q5 confidence intervals do not overlap (Conformal: 80.3–83.3%; Hybrid: 89.8–92.1 %), indicating a statistically distinguishable improvement of 9.1 percentage points for the highest-uncertainty patients.
Interval efficiency and adaptation
We evaluated the efficiency and adaptive characteristics of prediction intervals across the three methods, focusing on interval width patterns, and adaptive behavior.
Five-fold cross-validation revealed substantial differences in interval width characteristics across methods (Table 4). Post-hoc Bayesian calibration produced extremely narrow intervals (0.75 ± 0.01 days) but with inadequate coverage reliability. Conformal HRF generated uniform intervals with mean width of 16.32 ± 0.25 days. Our Hybrid method produced similar mean interval widths to Conformal HRF (15.99 ± 0.24 vs. 16.32 ± 0.25 days). Given overlapping fold-to-fold variability, we describe the difference as small in magnitude rather than statistically definitive.
Table 4.
Cross-Validated Interval Width Analysis.
| Width Metric | Conformal | Bayesian | Hybrid | Difference
|
|---|---|---|---|---|
| Mean Width (days) | 16.32 ± 0.25 | 0.75 ± 0.01 | 15.99 ± 0.24 | -0.33 |
| Width Ratio vs Conformal | 1.00 ± 0.00 | 0.046 ± 0.001 | 0.98 ± 0.02 | -0.02 |
| Adaptation Ratio | 1.00 ± 0.00 | 1.01 ± 0.01 | 1.29 ± 0.01 | +0.29 |
| CV Stability (%) | 1.5 | 2.0 | 1.5 | 0.0 |
± = SD across 5 folds.
Difference = Hybrid − Conformal. Adaptation Ratio = Mean(Q5 interval width) / Mean(Q1 interval width), measuring within-method variation across uncertainty quintiles
The efficiency comparison between Conformal and Hybrid methods demonstrates modest improvement. The Hybrid approach achieves slightly narrower intervals through improved uncertainty quantification rather than coverage compromise, as evidenced by the small width reduction accompanied by preserved coverage validity, corresponding to a width ratio of 0.98 ± 0.02 relative to Conformal HRF .
Analysis of adaptive interval sizing revealed fundamental differences in width allocation strategies across methods. Conformal HRF produced uniform intervals by design (adaptation ratio = 1.00 ± 0.00), reflecting the global calibration approach that applies identical width adjustments regardless of prediction difficulty.
Both Bayesian methods demonstrated adaptive sizing capabilities. Post-hoc Bayesian calibration showed modest adaptation (ratio = 1.01 ± 0.010), while the Hybrid method exhibited slightly stronger adaptive behavior (ratio = 1.29 ± 0.01). This adaptation enables the Hybrid method to allocate wider intervals for high-uncertainty predictions while providing more precise bounds for confident predictions.
Interval efficiency across uncertainty quintiles revealed systematic patterns across methods, as shown in Table 5. The analysis aggregated predictions from all cross-validation folds to examine width allocation behavior across the full spectrum of prediction uncertainty.
Table 5.
Interval Width and Coverage by Uncertainty Quintile.
| Quintile | Mean | Interval width (days) | Width ratio | |||
|---|---|---|---|---|---|---|
| Uncert. | Conf. | Bayes. | Hybrid | Bay/Conf | Hyb/Conf | |
| Q1 (Low) | 0.168 | 16.71 ± 0.06 | 0.69 ± 0.01 | 13.21 ± 0.10 | 0.041 | 0.790 |
| Q2 | 0.176 | 16.24 ± 0.10 | 0.74 ± 0.01 | 14.86 ± 0.12 | 0.046 | 0.915 |
| Q3 | 0.179 | 16.06 ± 0.17 | 0.73 ± 0.02 | 15.14 ± 0.15 | 0.045 | 0.942 |
| Q4 | 0.186 | 16.62 ± 0.15 | 0.78 ± 0.01 | 16.48 ± 0.31 | 0.047 | 0.992 |
| Q5 (High) | 0.195 | 15.99 ± 0.10 | 0.81 ± 0.04 | 16.98 ± 1.38 | 0.051 | 1.061 |
Notes. Mean uncertainty = posterior predictive uncertainty from Bayesian HRF (refers to Post-hoc Bayesian hierarchical calibration), used to form quintiles (Q1 = lowest, Q5 = highest). Width ratios compare Bayesian/Hybrid methods to Conformal baseline
Conformal HRF exhibited minimal variation across uncertainty quintiles (range: 15.99–16.71 days), as expected from a global calibration method that produces uniform intervals regardless of prediction difficulty. The slight variations observed likely reflect sampling variability rather than adaptive behavior.
The Hybrid method demonstrated meaningful adaptive behavior relative to conformal, with width ratios showing a clear trend from 0.790 to 1.061 across uncertainty quintiles. Table 5 summarizes how the Hybrid method reallocates interval width across uncertainty quintiles. Consistent with an adaptive weighting scheme, intervals are narrower in lower-uncertainty strata (Q1–Q3) and slightly wider in the highest-uncertainty stratum (Q5) relative to Conformal. At the same time, the average width difference between Hybrid and Conformal is modest (Table 4), indicating that Hybrid primarily redistributes width rather than materially shrinking widths overall. We therefore interpret these results as evidence of adaptive allocation of interval width across prediction difficulty, rather than a large global efficiency gain.
The adaptation ratio (
) indicates that high-uncertainty cases receive approximately
wider intervals than low-uncertainty cases-a clinically meaningful difference that enables risk-stratified resource allocation. Low-uncertainty patients can benefit from more precise planning bounds, while high-uncertainty patients receive appropriately conservative estimates.
The Hybrid method exhibited dramatically increased variability in the highest uncertainty quintile (Q5:
vs. 0.10 days for conformal), representing a 14-fold increase in interval width variation. This heterogeneous behavior reflects the method’s ability to differentially respond to varying prediction difficulty within the high-uncertainty stratum. While this increased variability might complicate operational planning for Q5 cases, it provides valuable signal about prediction confidence that uniform intervals cannot capture.
Post-hoc Bayesian calibration maintained consistently narrow intervals across all quintiles (0.69–0.81 days), with width ratios of approximately 0.04–0.05 relative to conformal. These extremely narrow intervals, roughly 20 times smaller than conformal methods, explain the method’s catastrophic coverage failure despite the slight increase in width from Q1 to Q5 (17 % increase from 0.69 to 0.81 days).
Comprehensive cross-validation analysis confirmed the stability and reproducibility of our uncertainty quantification methods across diverse data partitions (Fig. 3). Panel A reveals the dramatic scale differences between methods. Conformal HRF maintained consistent interval widths across all five folds (16.32 ± 0.25 days, CV = 1.5 %), with the Hybrid HRF method showing nearly identical stability (15.99 ± 0.24 days, CV = 1.5 %), while Post-hoc Bayesian calibration intervals remained uniformly narrow (0.75 ± 0.01 days, CV = 2.0 %)-approximately 20-fold narrower than conformal methods, visually explaining its severe under-coverage.
Fig. 3.
Cross-validation stability across 5 folds.
The Hybrid method demonstrated robust efficiency gains across all validation folds (Fig. 3B), achieving a mean improvement of 0.33 days over conformal. Critically, every fold exhibited positive efficiency gains (range: 0.31-0.37 days), with no instance of efficiency loss, supporting the reliability and generalizability of the hybrid approach.
Adaptation behavior proved remarkably consistent across folds (Fig. 3C). The adaptation index of 1.29 ± 0.01 significantly exceeded the null hypothesis of uniform intervals (1.000, P < 0.001), with a narrow 95 % confidence interval [1.278, 1.302] confirming that adaptive sizing represents a genuine methodological characteristic rather than a data-specific artifact.
These stability metrics-consistent interval widths (CV < 4 %), uniform efficiency gains across folds, and stable adaptation indices-demonstrate that our methodological findings are robust to data partitioning. The Hybrid method achieves modest but consistent efficiency (2.0 % mean width reduction) through adaptive allocation rather than coverage compromise, with a statistically significant adaptation ratio (1.29 vs. 1.00) enabling uncertainty-responsive interval sizing. In contrast, post-hoc Bayesian calibration, despite its computational elegance, fails as a standalone uncertainty quantification method due to inadequate coverage. These findings support the Hybrid method’s suitability for deployment across diverse clinical settings.
Uncertainty–error association and clinical calibration
Conformal HRF produced uniform uncertainty estimates by design, yielding zero uncertainty-error correlation and preventing uncertainty-guided clinical decision-making.
Both Post-hoc Bayesian calibration and Hybrid rely on the same Bayesian-derived uncertainty estimates, so their raw discrimination performance is identical. Specifically, predicted uncertainties show a modest positive association with absolute error (Pearson
[95 % CI: 0.195, 0.210], Spearman
), indicating that higher predicted uncertainty corresponds (at least weakly) to higher observed error. However, the magnitude of these raw uncertainty values was severely miscalibrated. The mean predicted uncertainty was only 0.19 days, compared to a mean absolute error of 2.71 days (a 14.5
under-scale). Consistently, the calibration slope was
, far above the theoretical Normal-based ideal of 0.798, confirming that the raw Bayesian uncertainties, while able to rank cases, were not on a clinically interpretable scale.
To correct this mismatch, we applied isotonic regression calibration, a non-parametric technique that fits a monotone mapping between predicted uncertainties and observed absolute errors. Unlike parametric rescaling methods (e.g., Platt or temperature scaling), isotonic regression does not assume any functional form, making it well-suited for healthcare data where error distributions are complex and heterogeneous. The goal is to preserve the ordering of uncertainties (so discrimination is unchanged or improved) while bringing the scale of predictions into alignment with observed errors.
Isotonic regression calibration dramatically improved uncertainty quality (Table 6). Pearson correlation increased to
(+134 %), Spearman correlation to
(+465 %), and the calibration slope improved to
, approaching the theoretical ideal. Expected Calibration Error decreased from 2.52 to 0.037 days (98.5 % reduction), with mean calibrated uncertainty matching the mean absolute error at 2.71 days.
Table 6.
Isotonic Calibration Impact on Uncertainty Quality.
| State | Pearson r | Spearman
|
Slope
|
ECE | Mean
|
|---|---|---|---|---|---|
| Raw | 0.203 | 0.043 | 91.4 | 2.52 | 0.19 |
| Calibrated | 0.476 | 0.243 | 0.926 | 0.037 | 2.71 |
| Improvement | ![]() |
![]() |
![]() |
![]() |
![]() |
QQ plots (Fig. 4) demonstrated substantial improvement in distributional properties following isotonic calibration. Raw uncertainties showed severe deviations from normality with heavy tails and systematic bias, while calibrated uncertainties achieved much closer alignment with the theoretical normal distribution. Despite visual improvement, Shapiro-Wilk tests remained significant (
) for both raw and calibrated uncertainties, reflecting the inherent complexity of healthcare prediction errors.
Fig. 4.
Normality Assessment: Standardized Residuals QQ Plots. Quantile-quantile plots validating distributional assumptions for uncertainty quantification. (a) Raw uncertainties show systematic deviations from normality (red line) with heavy tails and bias, reflecting poor scaling. (b) Calibrated uncertainties demonstrate improved alignment with normal distribution, supporting the validity of Normal-based uncertainty framework despite continued complexity in healthcare data. Shapiro-Wilk tests: both
, but substantial visual improvement in linearity following calibration.
Despite identical well-calibrated uncertainties, the methods differed dramatically in coverage performance (Table 7). This critical finding demonstrates that even well-specified Post-hoc Bayesian hierarchical calibration with proper uncertainty calibration cannot guarantee coverage without conformal adjustment-a key insight for practitioners who might assume Bayesian credible intervals provide reliable coverage.
Table 7.
Coverage Performance with Calibrated Uncertainties.
| Method | Coverage (%) | Width (days) | CRPS | Winkler Score |
|---|---|---|---|---|
| Post-hoc Bayesian calibration | 14.1 | 0.75 | 2.04 | 95.02 |
| Hybrid HRF | 94.3 | 15.99 | 2.04 | 35.52 |
This contrast is illuminated by proper scoring rules that evaluate different aspects of uncertainty quantification. Continuous Ranked Probability Score (CRPS) measures distributional accuracy while Winkler Score penalizes both interval width and coverage failures, with lower scores indicating better performance. The identical CRPS values (2.04 for both methods) confirm that both approaches have equivalent uncertainty discrimination since they use the same Bayesian posterior uncertainties. However, the Winkler Score strongly favored the Hybrid method (35.5 vs 95.0), demonstrating that superior coverage reliability outweighs interval efficiency when both aspects are evaluated together.
This confirms that the Hybrid method’s improvement stems from conformal calibration rather than better uncertainty estimation, validating the design principle that wider but reliable intervals are operationally superior to narrow intervals that fail to cover.
Clinical interpretation of results
The performance characteristics of our hybrid framework translate directly into operational advantages for hospital resource management and clinical decision support, with immediate implications for implementing uncertainty-aware LOS prediction in healthcare settings.
The 94.3 % coverage reliability fundamentally transforms hospital capacity planning by providing administrators with quantifiable confidence levels rather than uncertain predictions. For every 1000 admissions, hospital planners can rely on 943 cases having actual LOS fall within predicted bounds, enabling reduced traditional safety margins while maintaining patient safety. The predictable 57 cases per 1000 that exceed intervals can trigger predetermined escalation protocols, activating overflow capacity, accelerating discharge planning, or alerting post-acute care networks, rather than catching administrators unprepared. This converts unpredictable capacity crunches into manageable, planned responses that improve both patient flow and staff workload predictability. The complete failure of Post-hoc Bayesian hierarchical calibration alone (14.1 % coverage) underscores why distribution-free calibration is essential for clinical deployment, regardless of the sophistication of the uncertainty model.
Performance varies systematically across uncertainty quintiles, enabling risk-stratified operational protocols. For high-confidence predictions (Q1–Q2), Hybrid intervals are meaningfully narrower than Conformal (Q1: 13.21 vs 16.71 days; Q2: 14.86 vs 16.24 days), enabling reduced routine buffers of 1.4–3.5 days per case with standard monitoring protocols. Moderate-uncertainty cases (Q3–Q4) show comparable widths between methods, requiring no operational changes to existing buffer policies.
Because clinical harm is most likely to arise from high-uncertainty predictions, coverage by uncertainty quintile (Q5) serves as a direct deployment risk indicator. The hybrid framework offers the most deployment-ready option, maintaining good overall calibration while limiting coverage loss in the highest-uncertainty stratum. The standardized performance across diverse hospital characteristics enables immediate deployment without site-specific calibration.
Essential deployment protocols include uncertainty-stratified safeguards. Q5 cases should be routed to human review and senior triage, with conservative interval widening when risk thresholds are exceeded. Clinical interfaces should expose interval widths so that clinicians can identify highly uncertain predictions. Governance frameworks must include coverage monitoring by quintile and subgroup (alerting if Q5 coverage falls below nominal targets), buffer efficiency tracking (planned hours per admission and aggregate bed-days of reduced prediction uncertainty), escalation load management (capping Q5 review queues to service level agreements), and drift surveillance with recalibration triggers.
The computational efficiency enables real-time integration with existing electronic health record systems without dedicated infrastructure investments, while guarantee-capable (group-aware) calibration provide statistical documentation required for regulatory compliance and clinical governance committees. For multi-facility health systems, the consistent methodology allows uniform resource allocation protocols, establishment of consistent quality metrics across facilities, and system-wide capacity management based on predicted demand patterns with known reliability bounds. This framework converts statistical interval properties into actionable bed and staffing decisions, providing scalable efficiency improvements for routine cases while maintaining safety through uncertainty-responsive allocation and human oversight for challenging predictions.
Robustness and sensitivity analysis
We evaluated robustness along three axes: cross-validation stability, sensitivity to the target confidence level
, and performance across hospital characteristics.
Beyond the coverage and width stability metrics reported in previous sections, we examined predictive accuracy across folds. Table 8 shows that all three methods achieved nearly identical Root Mean Square Error (RMSE), with differences of less than 6 % that are unlikely to be clinically significant.
Table 8.
Predictive accuracy across methods (5-fold CV, mean ± SD).
| Method | RMSE (days) | CV ( %) |
|---|---|---|
| Conformal HRF | 6.05 ± 0.59 | 9.7 |
| Post-hoc Bayesian calibration | 6.39 ± 0.60 | 9.3 |
| Hybrid HRF (Bayesian Conformal) | 6.01 ± 0.65 | 9.1 |
The near-identical RMSE values demonstrate that coverage performance is essentially independent of predictive accuracy. Conformal and Hybrid methods achieve 95 % and 94 % coverage respectively despite using different base predictions, while Post-hoc Bayesian calibration ’s substantial 14 % coverage occurs with the similar predictions as the Hybrid method. This decoupling of accuracy and coverage underscores that interval reliability depends primarily on the uncertainty quantification method rather than point prediction quality.
Notably, the RMSE difference between Conformal (6.05) and Bayesian-based methods (6.39, and 6.01) represents small relative change-negligible compared to the mean LOS of 4.9 days and SD of 7.1 days in our dataset. From a clinical perspective, all three methods provide equivalent predictive accuracy. This decoupling of accuracy and coverage is a critical finding: it demonstrates that the substantial gains in interval reliability offered by the Hybrid method are achieved at a negligible cost to the model’s underlying point-prediction performance, making the choice between methods a clear decision in favor of superior uncertainty quantification.
Sensitivity analysis across alternative confidence levels (
) showed appropriate tracking to the nominal targets (Table 9). Conformal HRF achieved exact or near-exact coverage at all levels, with deviations
0.1 pp. Hybrid HRF exhibited small, systematic under-coverage that slightly increased with larger
values (at
:
pp,
:
pp,
:
pp,
:
pp). In contrast, Post-hoc Bayesian calibration substantially under-covered at all levels due to uncalibrated Normal intervals derived from raw posterior
.
Table 9.
Alpha Level Sensitivity Analysis.
(Nominal) |
Coverage Performance | Mean Interval Width | ||||
|---|---|---|---|---|---|---|
| Conformal | Bayesian | Hybrid | Conformal | Bayesian | Hybrid | |
| HRF (%) | HRF (%) | HRF (%) | HRF | HRF | HRF | |
| 0.01 (99 %) | 99.1 % ± 0.1% | 22.1 % ± 2.1 | 99.0 % ± 0.2 | 39.4 ± 2.0 | 1.1 ± 0.1 | 37.8 ± 2.8 |
| 0.05 (95 %) | 95.0 % ± 0.4% | 14.1 % ± 2.1 | 94.4 % ± 0.3 | 16.2 ± 0.5 | 0.8 ± 0.1 | 15.9 ± 0.2 |
| 0.10 (90 %) | 90.0 % ± 0.6% | 12.0 % ± 1.0 | 88.9 % ± 0.7 | 10.9 ± 0.2 | 0.6 ± 0.1 | 10.8 ± 0.3 |
| 0.20 (80 %) | 80.0 % ± 0.6% | 8.0 % ± 0.6 | 79.0 % ± 1.6% | 6.7 ± 0.1 | 0.4 ± 0.0 | 6.9 ± 0.3 |
Results from 3-fold cross-validation comparing Conformal HRF , Post-hoc Bayesian calibration , and Bayesian Conformal HRF . Coverage shown as mean ± standard deviation. Width represents mean interval full width in days. Post-hoc Bayesian calibration shows systematic under-coverage indicating miscalibration
Sensitivity to Hierarchy Specification (3-Level vs. 2-Level) was assessed by comparing a three-level structure (patients
hospitals
regions) with two alternative two-level configurations (patients
hospitals and patients
regions). We evaluated the Hybrid HRF method under each specification to determine robustness to structural assumptions. Coverage and efficiency (mean full width) results for all configurations are summarized in Table 10.
Table 10.
Performance across hierarchy levels (mean ± SD). Widths are full widths in days.
| Hierarchy | Coverage ( %) | Mean Width (days) |
|---|---|---|
3-Level (Patient Hospital Region) |
94.3 ± 0.05 | 15.99 ± 0.59 |
Patient Hospital |
94.8 ± 0.30 | 16.30 ± 0.34 |
Patient Region |
94.2 ± 0.35 | 15.80 ± 0.42 |
Across hierarchies, coverage remained close to nominal (94.2–94.8 %) and widths varied only modestly (15.80–16.30 days), indicating that conclusions are not sensitive to the specific hierarchy used. The Patient–Hospital specification achieved the highest mean coverage (94.8 %) with a slightly wider interval (16.30 days), whereas the Patient–Region specification produced the narrowest intervals (15.80 days) with a small decrease in mean coverage (94.2 %). The full 3-Level hierarchy lies between these alternatives (94.3 %, 15.99 days), capturing both within-hospital and regional effects while preserving efficiency.
These results are consistent with the overall findings reported elsewhere, intervals are in the mid-teens (days) and coverage remains near 95 % under reasonable hierarchy choices.
Hospital characteristics subgroups, defined by bed size, ownership, location, and region, revealed consistent patterns in coverage performance (Table 11). Conformal HRF and Hybrid HRF maintained robust coverage across all strata, ranging from
93.5 to 95.6 %, while Post-hoc Bayesian calibration exhibited systematic under-coverage, with rates ranging from 11.0 to 40.6 %. To highlight clinical shortfall, we also report the “Bayesian deficit,” defined as the percentage points below the 95 % target.
Table 11.
Hospital Characteristics Subgroup Analysis (coverage %, mean ± SD).
| Characteristic | Category | Patients | Hospitals | Conformal ( %) | Bayesian ( %) | Hybrid ( %) | Bay Deficit (%) |
|---|---|---|---|---|---|---|---|
| Bed Size | Small | 13,534 | 1,789 | 95.1 ± 1.0 | 11.0 ± 1.9 | 95.3 ± 1.0 | -84.0 |
| Medium | 17,735 | 1,014 | 95.3 ± 0.2 | 13.9 ± 2.4 | 95.2 ± 0.1 | -81.1 | |
| Large | 30,269 | 990 | 94.8 ± 0.6 | 15.4 ± 1.7 | 94.3 ± 1.1 | -79.6 | |
| Ownership | Public | 7,023 | 591 | 93.5 ± 2.2 | 12.7 ± 1.4 | 93.5 ± 3.0 | -82.3 |
| Private NP | 45,896 | 2,551 | 95.0 ± 0.3 | 15.8 ± 2.2 | 94.7 ± 0.6 | -79.2 | |
| Private FP | 8,619 | 651 | 94.9 ± 1.7 | 13.6 ± 1.8 | 94.0 ± 0.8 | -81.4 | |
| Location | Rural | 5,259 | 1,144 | 94.8 ± 2.1 | 14.2 ± 0.8 | 94.7 ± 1.3 | -80.8 |
| Urban NT | 10,868 | 984 | 95.1 ± 0.3 | 15.6 ± 4.1 | 94.8 ± 1.3 | -79.4 | |
| Urban T | 45,411 | 1,665 | 94.7 ± 0.7 | 13.7 ± 1.4 | 94.8 ± 1.0 | -81.3 | |
| Region | Northeast | 11,223 | 555 | 95.2 ± 1.1 | 37.9 ± 4.4 | 95.4 ± 0.8 | -57.1 |
| Midwest | 13,654 | 1,046 | 95.6 ± 0.1 | 38.7 ± 1.3 | 94.6 ± 0.6 | -56.3 | |
| South | 24,643 | 1,461 | 94.9 ± 0.2 | 38.7 ± 0.5 | 94.1 ± 0.5 | -56.3 | |
| West | 12,018 | 731 | 95.0 ± 0.4 | 40.6 ± 3.3 | 94.7 ± 1.1 | -54.4 |
Coverage rates from 3-fold cross-validation (mean ± SD). Conformal = Conformal HRF , Bayesian = Post-hoc Bayesian calibration , Hybrid = Bayesian Conformal HRF . Bay Deficit = percentage point deficit from 95 % target for Post-hoc Bayesian calibration . NP = Non-profit, FP = For-profit, NT = Non-teaching, T = Teaching
Per-Hospital Coverage Distribution was examined to assess robustness across individual institutions. We computed coverage rates for all hospitals with at least five patients in the test set, and the resulting distribution is summarized in Table 12. The mean per-hospital coverage (94.6 %) closely aligns with the overall marginal coverage (94.3 %), and the median of 100 % indicates that most hospitals achieve complete or near-complete coverage. However, the distribution is right-skewed, with 25.3 % of hospitals exhibiting coverage below 90 %. Hospitals with lower coverage generally had smaller sample sizes or higher proportions of complex cases (Q5 patients). This heterogeneity underscores the need for the uncertainty-stratified protocols proposed in Sect. 4.6, where high-uncertainty predictions receive enhanced monitoring regardless of institutional setting.
Table 12.
Per-hospital coverage distribution for Hybrid HRF (hospitals with
5 test patients).
| Statistic | Value |
|---|---|
| Hospitals analyzed | 1000 |
| Mean coverage | 94.6 % |
| Median coverage | 100.0 % |
| Standard deviation | 8.5 pp |
| Interquartile range | [88.9 %, 100.0 %] |
| Range | [50.0 %, 100.0 %] |
Hospitals with coverage % |
253 (25.3 %) |
Hospitals with coverage % |
150 (15.0 %) |
Generalization to unseen hospitals
To validate deployment claims for multi-hospital settings, we conducted grouped cross-validation where entire hospitals were held out from training. This design ensures that test predictions are made for institutions never seen during model fitting, directly testing generalization to new hospitals without site-specific recalibration.
We partitioned the 3793 hospitals into 5 folds, with each fold holding out approximately 759 hospitals (20 %) as a test set. By design, there was zero hospital overlap between training and test sets in each fold. This represents the most stringent evaluation setting: predicting for entirely new institutions rather than new patients within known institutions.
Table 13 presents the results. Hybrid HRF coverage remained near-nominal at 94.3 % ± 0.5 pp, closely matching the 94.3 % achieved under patient-level cross-validation (Table 7). Mean interval width was 16.07 ± 0.58 days, comparable to the 15.99 ± 0.24 days in our primary analysis, with modestly higher variability as expected under the more stringent hospital-holdout design.
Table 13.
Grouped cross-validation results with hospital holdout. Each fold holds out
759 hospitals (20 %) with zero overlap between training and test hospitals.
| Metric | Mean ± SD |
|---|---|
| Coverage (%) | 94.3 ± 0.5 |
| Mean Width (days) | 16.07 ± 0.58 |
The stability of coverage under hospital holdout validates the framework’s applicability to new institutions. The hierarchical random forest successfully generalizes hospital-level patterns to unseen hospitals, likely by leveraging similarities in patient case-mix and hospital characteristics (bed size, teaching status, ownership). This finding supports deployment scenarios where the model is trained on a consortium of hospitals and applied to newly joining institutions without requiring site-specific recalibration.
Comparison with adaptive conformal baselines
To evaluate whether established adaptive conformal methods achieve comparable conditional coverage, we compared our Hybrid HRF against Locally Adaptive Split Conformal Prediction61, a principled approach that scales conformity scores by estimated conditional standard deviation. This method represents the standard approach for addressing heteroscedasticity in conformal prediction and serves as a direct baseline for our Bayesian-weighted framework.
Table 14 presents coverage performance stratified by uncertainty quintile. The Locally Adaptive method achieved 94.0 % overall coverage. However, conditional coverage in high-uncertainty cases was limited: Q5 coverage was 86.5 %, compared to 90.9 % for our Hybrid HRF, a 4.4 percentage point improvement. This represents a 52 % reduction in the coverage deficit for the most challenging predictions (from 8.5 pp to 4.1 pp below the 95 % nominal target).
Table 14.
Comparison of adaptive conformal methods by uncertainty quintile.
| Method | Overall | Q1 | Q2 | Q3 | Q4 | Q5 |
|---|---|---|---|---|---|---|
Locally Adaptive Conformal
|
94.0 | 97.8 | 97.4 | 96.1 | 94.8 | 86.5 |
| Conformal HRF | 95.0 | 99.7 | 99.2 | 98.1 | 96.3 | 81.8 |
| Hybrid HRF (Proposed) | 94.3 | 95.5 | 97.1 | 93.9 | 94.2 | 90.9 |
Q5 deficit from 95% target: Locally Adaptive = 8.5 pp, Conformal = 13.2 pp, Hybrid = 4.1 pp
Notably, both adaptive methods substantially outperform standard Conformal HRF in the high-uncertainty stratum (Q5: 86.5 % and 90.9 % versus 81.8 %), confirming that adaptive calibration is essential for heteroscedastic healthcare data. Our Hybrid HRF achieves the best Q5 coverage among all methods, demonstrating that Bayesian hierarchical uncertainty estimates provide superior adaptivity compared to variance-scaling alone.
These results support the conclusion that integrating Bayesian posterior uncertainties with conformal calibration yields better conditional coverage than standard adaptive conformal approaches, particularly in the high-uncertainty stratum where reliable prediction intervals are most clinically important.
Discussion
Key findings and clinical significance
This study demonstrates that Hybrid HRF uncertainty quantification, combining Post-hoc Bayesian hierarchical calibration with conformal calibration, can provide near-nominal marginal coverage while enabling adaptive precision in hierarchical healthcare data. Using hospital LOS as a test case, our framework achieved 94.3 % coverage across 61538 patients from 3793 hospitals (five-fold cross-validation), with stable performance across validation folds and confidence levels. The method reallocates interval width across uncertainty strata, producing narrower intervals for low-uncertainty cases (Q1: 13.21 vs 16.71 days) and wider intervals for high-uncertainty cases (Q5: 16.98 vs 15.99 days), yielding an adaptation ratio of 1.29. These results were consistent across diverse hospital characteristics, suggesting potential for standardized deployment with reduced need for site-specific tuning.
A critical methodological insight emerges from comparing our three approaches: well-calibrated Post-hoc Bayesian hierarchical calibration achieved only 14.1 % coverage despite identical uncertainty discrimination as the hybrid method (CRPS of 2.04 for both). This suggests that, in our setting, posterior uncertainty calibration, even when properly scaled through isotonic regression, cannot substitute for distribution-free calibration. This finding has important implications for healthcare ML deployment, where practitioners might incorrectly assume that sophisticated uncertainty models provide reliable prediction intervals.
The clinical significance extends beyond statistical performance metrics. Hospital administrators require uncertainty bounds that balance patient safety with operational efficiency. Our approach provides quantifiable reliability levels (94.3 % coverage) that enable evidence-based resource allocation decisions rather than ad-hoc safety buffers. The uncertainty-error correlation (r=0.476 after calibration) allows targeted attention to cases most likely to deviate from predictions, supporting risk-stratified workflows where high-uncertainty patients receive enhanced monitoring and conservative planning. This adaptive sizing enables differentiated clinical protocols. Low-uncertainty patients benefit from tighter resource planning, while high-uncertainty patients receive appropriately conservative bounds with enhanced monitoring.
However, under-coverage concentrates in the highest-uncertainty quintile (90.9 % vs 95 % target), affecting precisely the patients where reliable bounds matter most. While this represents our primary limitation, the hybrid method shows higher empirical coverage in Q5 than Conformal (90.9 % vs 81.7 %)., reducing prediction failures among the most challenging cases. The complete failure of Bayesian intervals alone (14.1 % coverage), despite sharing an identical and well-calibrated uncertainty signal with the Hybrid method (CRPS of 2.04 for both), provides a critical insight: well-calibrated uncertainty is not a substitute for formal coverage guarantees. This underscores that distribution-free calibration wrappers are essential for safety-critical healthcare applications. This underscores why group-aware coverage is essential for safety-critical healthcare applications, regardless of the theoretical elegance or discriminative power of the uncertainty model.
Clinical implementation and safety
While our intervals achieve reliable coverage, we acknowledge that their widths substantially exceed typical stays, limiting immediate clinical utility for individual patient planning. Nevertheless, the framework provides value by identifying the subset of predictions that can be made with narrow, reliable bounds, establishing a methodological baseline for uncertainty quantification in hierarchical data, and enabling risk-stratified protocols in which uncertainty levels trigger different clinical workflows.
The framework requires minimal technical infrastructure for hospital deployment. Implementation involves two simple stages: first, training the uncertainty model offline using historical data (updated periodically), then applying real-time predictions during clinical workflows. This approach requires no additional computational resources at inference time beyond standard prediction systems and maintains fast response times essential for busy clinical environments. In our experiments, fitting the Bayesian post-processing layer via MCMC was performed offline and took 9.4 minutes on a 2-core CPU with 12.7 GB RAM; once trained, interval generation requires no MCMC and adds negligible overhead to real-time prediction.. The system uses standard patient information already collected in hospitals such as demographics, diagnoses, and procedures, without requiring new data collection or staff training. Integration occurs through existing hospital information systems using standard healthcare data formats, eliminating the need for custom interfaces or workflow disruptions.
The modest interval width reduction (0.33 days or 8 hours per admission) represents improved prediction precision, not actual LOS reduction. Extrapolating this precision gain across our study cohort (61538 patients) yields approximately 20000 fewer bed-days of prediction uncertainty annually. We emphasize this reflects tighter planning bounds rather than demonstrated LOS reductions, which would require prospective intervention studies. The operational value lies in more precise resource allocation and reduced safety buffers, with realized benefits depending on implementation and workflow integration. It is crucial to interpret this value correctly. Due to the high fixed-cost structure of hospitals, where expenses for staff, buildings, and equipment are largely static, the primary benefit is not a direct, per-patient variable cost reduction. Instead, the value is realized through enhanced hospital capacity and throughput. By freeing up beds more efficiently, the hospital can accommodate more admissions, which in turn generates new revenue and improves the overall operating margin. Furthermore, for admissions under fixed-payment models, such as Medicare’s diagnosis-related groups (DRGs), each day the LOS is shortened directly increases the profitability of that specific case. Therefore, the model’s contribution should be framed as creating significant operational and financial opportunity at the system level, rather than simple per-patient savings. The 94.3 % coverage reliability enables administrators to plan against quantifiable prediction bounds (943 reliable cases per 1000 admissions), converting unpredictable capacity issues into manageable exceptions with predetermined escalation protocols. At healthcare system scale, such precision improvements could support more efficient resource utilization, though realized benefits depend on implementation and hospital workflow integration.
Risk-stratified protocols directly map uncertainty estimates to operational decisions with quantifiable impact. High-confidence predictions (Q1-Q2) enable substantial safety buffer reductions of 3.5 and 1.4 days per case respectively, while maintaining standard monitoring protocols. High-uncertainty predictions (Q5) receive appropriately conservative widening (6 % increase) while still achieving improved coverage reliability. This reduces unexpected LOS violations from 37 to 18 cases per 1000 admissions-a 51 % reduction in prediction failures among the most resource-intensive patients. For a typical hospital processing 1000 monthly admissions, this prevents approximately 19 surprise extended stays requiring intensive coordination and specialized discharge planning.
Safety governance requires systematic coverage monitoring using predefined service-level indicators. These include overall coverage alerts when performance falls below 94 % over 7-day periods, uncertainty-stratified monitoring with Q5 coverage thresholds set at 90 %, and outside-interval incident logging with root-cause analysis. Additionally, quarterly recalibration triggers are implemented. The framework supports auditable reliability claims (94.3 % empirical coverage) and transparent performance monitoring, aligned with FDA AI/ML device guidance and EU AI Act requirements for high-risk healthcare systems.
Methodological limitations and mitigation strategies
The primary limitation is under-coverage in the highest-uncertainty quintile (Q5: 90.9 % vs 95 % target, −4.1 percentage points), affecting precisely the patients where reliable bounds matter most. This occurs because a single global conformal multiplier cannot fully accommodate heteroscedastic error patterns, despite improved uncertainty discrimination through isotonic calibration (Pearson r from 0.203 to 0.476). While conformal theory guarantees marginal coverage under exchangeability, conditional coverage within subsets defined by predicted uncertainty is not ensured.
We recommend three mitigation strategies for Q5 under-coverage: (i) uncertainty-stratified calibration using separate conformal quantiles for each uncertainty quintile, preserving exchangeability within strata while addressing heteroscedastic patterns; (ii) threshold-based conservative widening for cases exceeding the 75th percentile of posterior uncertainty, applying multiplicative factors (e.g., 1.2
) to interval widths; and (iii) enhanced clinical oversight protocols requiring senior review and conservative resource allocation for flagged high-uncertainty cases. Note that interval widths are influenced by our retention of extreme outliers (6 % with LOS> 20 days), a deliberate choice to ensure coverage validity across the full patient spectrum.
The severe coverage failure of Post-hoc Bayesian hierarchical calibration alone (14.1 %) illustrates why posterior-derived intervals, even when well-calibrated, cannot substitute for formal coverage guarantees in safety-critical applications. Despite strong uncertainty-error discrimination after isotonic calibration (r = 0.476), our Bayesian post-hoc approach produced intervals that reflect model assumptions rather than empirical coverage under hierarchical clustering. We acknowledge that fully Bayesian methods like BART might perform differently, but our results definitively show that post-hoc Bayesian calibration alone is insufficient. This reinforces the necessity of distribution-free conformal adjustment for clinical deployment, regardless of the sophistication of the uncertainty model.
Additional limitations include reliance on administrative data lacking clinical granularity that might improve uncertainty calibration, use of 2019 data preceding COVID-19 disruptions that altered healthcare patterns, and focus on absolute residual conformity scores that may not capture all relevant uncertainty patterns. A primary direction for future work is to integrate richer data sources, such as unstructured clinical notes and time-series lab values from electronic health records. Such data could significantly improve the Bayesian model’s ability to discriminate uncertainty, potentially resolving the residual under-coverage in the highest-risk quintile. This would complement further methodological research into uncertainty-stratified conformal variants that maintain coverage guarantees while addressing heteroscedastic patterns.
Systematic monitoring protocols address these limitations operationally. Coverage tracking by uncertainty quintile with alert thresholds (Q5< 90 % over 7 days), quarterly recalibration triggers based on distribution shift detection, and prospective validation in diverse clinical settings before widespread deployment.
A design consideration is that our primary cross-validation allows hospitals to appear across splits, enabling the model to leverage hospital-specific patterns. This reflects deployment scenarios where predictions are made for hospitals with historical data in the training dataset. For deployment to entirely new hospitals without historical data, our grouped cross-validation analysis (Sect. 4.8) provides relevant performance estimates, showing stable coverage under hospital holdout conditions.
Broader impact and future work
Our framework demonstrates that the apparent trade-off between coverage validity and adaptive precision can be resolved through methodological integration rather than choosing one approach over the other. This principle extends beyond healthcare to any domain with hierarchical data requiring reliable uncertainty bounds.
Future research should focus on approaches that preserve finite-sample validity while addressing current limitations. Promising directions include developing uncertainty-stratified conformal methods that guarantee coverage within risk strata while handling heteroscedastic error patterns, designing conformal procedures tailored to hierarchical data that maintain group-conditional exchangeability and enable reliable deployment across institutional clusters without site-specific recalibration, and creating lightweight adaptive techniques that improve instance-level precision while retaining the computational efficiency necessary for real-time clinical systems.
Conclusions
We presented a hybrid framework that combines Post-hoc Bayesian hierarchical calibration with conformal calibration to achieve reliable uncertainty quantification in hierarchical healthcare data. In multi-hospital length-of-stay prediction, the hybrid method achieved near-target coverage (94.3 %) with adaptive precision gains ranging from 3.5 days narrower intervals for low-uncertainty cases to appropriately conservative sizing for high-uncertainty cases (adaptation ratio = 1.29). Standard conformal met the nominal target (95.0 %) with uniform intervals, while Post-hoc Bayesian hierarchical calibration approaches severely under-covered (14.1 %), confirming that conformal calibration is indispensable even when underlying uncertainty estimates are well-calibrated. The resulting intervals demonstrated stability across validation folds and hospital subgroups, supporting deployment across diverse healthcare settings without site-specific recalibration.
Future work should focus on uncertainty-stratified conformal designs preserving finite-sample validity, lightweight adaptive procedures for clustered data maintaining group-conditional exchangeability, and richer clinical covariates with prospective monitoring. Overall, the principled combination of hierarchical random forests for point prediction, Bayesian post-processing for instance-level uncertainty estimation, and group-aware conformal calibration for coverage validity offers a practical path to reliable and adaptive uncertainty quantification for safety-critical hierarchical prediction.
Supplementary Information
Acknowledgements
This material is based upon work supported by the National Science Foundation under Award No. DGE-2125362. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Author contributions
Marzieh Amiri Shahbazi: Conceptualization, Methodology, Formal analysis, Investigation, Visualization, Writing-original draft, Writing-review and editing, Validation. Ali Baheri: Supervision, Methodology, Writing-review and editing. Nasibeh Azadeh-Fard: Supervision, Resources, Data curation, Validation, Writing-review and editing.
Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Data availability
The data used in this study were obtained from the National Inpatient Sample (NIS), Healthcare Cost and Utilization Project (HCUP), Agency for Healthcare Research and Quality. These data are available for purchase from HCUP at https://www.hcup-us.ahrq.gov/nisoverview.jsp. The authors do not have permission to share the data directly.
Code availability
Code is available from the corresponding author upon reasonable request.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
The online version contains supplementary material available at 10.1038/s41598-026-37450-w.
References
- 1.Rajkomar, A. et al. Scalable and accurate deep learning with electronic health records. NPJ Digit Med1(1), 18 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Stone, K., Zwiggelaar, R., Jones, P. & Mac Parthaláin, N. A systematic review of the prediction of hospital length of stay: Towards a unified framework. PLOS Digit Health1(4), 0000017 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Beam, A. L. & Kohane, I. S. Big data and machine learning in health care. JAMA319(13), 1317–1318 (2018). [DOI] [PubMed] [Google Scholar]
- 4.McCarthy, K. et al. Improving hospital length of stay prediction through machine learning. J. Healthc. Manag.66(4), 272–285 (2021). [Google Scholar]
- 5.Harris, S. L. et al. Bed management and length of stay. Int. J. Health Care Qual. Assur.31(8), 1014–1027 (2018).30415623 [Google Scholar]
- 6.Werner, E. et al. Explainable hierarchical clustering for patient subtyping and risk prediction. Exp. Biol. Med.248(24), 2547–2559 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Bollmann, S., Groll, A. & Havranek, M. M. Accounting for clustering in automated variable selection using hospital data: A comparison of different lasso approaches. BMC Med. Res. Methodol.23(1), 280 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Austin, P. C. Using multilevel models and generalized estimating equation models to account for clustering in neurology clinical research. Neurology103(8), 209947 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Berta, P., Levantesi, S., & Marta, G. Uncover mortality patterns and hospital effects in covid-19 heart failure patients: A novel multilevel logistic cluster-weighted modeling approach. arXiv preprint arXiv:2405.11239 (2024) [DOI] [PubMed]
- 10.Dimick, J. B. et al. Hospital volume and surgical outcomes: Is more always better?. New England J Med369, 1073–1075 (2013). [Google Scholar]
- 11.Freeland, K. N. et al. Length of stay determinants in hospital medicine. Am. J. Med. Qual.28(5), 357–365 (2013).23709471 [Google Scholar]
- 12.Kaboli, P. J. et al. Associations between reduced hospital length of stay and 30-day readmission rate and mortality. Ann Internal Med157(12), 837–845 (2012). [DOI] [PubMed] [Google Scholar]
- 13.Shahbazi, M.A., & Azadeh-Fard, N. Hierarchical data modeling: A systematic comparison of statistical, tree-based, and neural network approaches. Machine Learning with Applications, 100688 (2025)
- 14.Hajjem, A. Mixed effects trees and forests for clustered data (University of Montreal, ProQuest LLC, 2010).
- 15.Wager, S. & Athey, S. Estimation and inference of heterogeneous treatment effects using random forests. J. Am. Stat. Assoc.113(523), 1228–1242 (2018). [Google Scholar]
- 16.Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin, S., Dillon, J., Lakshminarayanan, B.,& Snoek, J. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. Advances in neural information processing systems 32 (2019)
- 17.Mentch, L. & Hooker, G. Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. J. Mach. Learn. Res.17(26), 1–41 (2016). [Google Scholar]
- 18.Guo, C., Pleiss, G., Sun, Y., & Weinberger, K.Q. On calibration of modern neural networks. In: International Conference on Machine Learning, pp. 1321–1330 (2017). PMLR
- 19.Sendak, M. P. et al. Machine learning in health care: A critical appraisal of challenges and opportunities. EGEMs7(1), 1–15 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Liu, X. et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence. Lancet Digit Health2(10), 537–548 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Vovk, V., Nouretdinov, I. & Gammerman, A. Linearly time efficient nonconformity measure for conformal prediction. Ann. Math. Artif. Intell.56(1), 83–99 (2009). [Google Scholar]
- 22.Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. J. & Wasserman, L. Distribution-free predictive inference for regression. J. Am. Stat. Assoc.113(523), 1094–1111 (2018). [Google Scholar]
- 23.Angelopoulos, A. N. et al. Conformal prediction: a gentle introduction. Found Trends Mach Learn16(4), 494–591 (2023). [Google Scholar]
- 24.Barber, R. F., Candes, E. J., Ramdas, A. & Tibshirani, R. J. Conformal prediction beyond exchangeability. Ann. Stat.51(2), 816–845 (2023). [Google Scholar]
- 25.Tibshirani, R.J., Foygel Barber, R., Candes, E., & Ramdas, A. Conformal prediction under covariate shift. Advances in neural information processing systems 32 (2019)
- 26.Gelman, A. Prior distributions for variance parameters in hierarchical models (comment on article by browne and draper) (2006)
- 27.Carlin, B.P., & Xia, H. Assessing environmental justice using bayesian hierarchical models: Two case studies. Journal of Exposure Analysis & Environmental Epidemiology 9(1) (1999) [DOI] [PubMed]
- 28.Betancourt, M. A conceptual introduction to Hamiltonian Monte Carlo. arXiv preprint arXiv:1701.02434 (2017)
- 29.Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., & Mané, D. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565 (2016)
- 30.Varshney, K.R. Engineering safety in machine learning. In: 2016 Information theory and applications workshop (ITA), pp. 1–5 (2016). IEEE
- 31.Meehinkong, P.,& Ponnoprat, D. coverforest: Conformal predictions with random forest in python. arXiv preprint arXiv:2501.14570 (2025)
- 32.Arthur, R. et al. Planning hospital capacity to save lives during the covid-19 pandemic and beyond. Eur. J. Oper. Res.295(3), 1202–1214 (2022). [Google Scholar]
- 33.Helm, J. E. et al. Improving hospital efficiency and patient flow. Prod. Oper. Manag.20(3), 385–397 (2011). [Google Scholar]
- 34.Proudlove, N. C. et al. Can good bed management solve the overcrowding in accident and emergency departments?. Emerg. Med. J.20(2), 149–155 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Chrusciel, J. et al. The prediction of hospital length of stay using unstructured data. BMC Med. Inform. Decis. Mak.21(1), 351. 10.1186/s12911-021-01722-4 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Ciaroni, S. et al. Machine learning-based prediction of hospital prolonged length of stay admission at emergency department: A gradient boosting algorithm analysis. Front Artif Intell6, 1179226. 10.3389/frai.2023.1179226 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Young, W. J. Prediction of hospital length of stay. Health Serv. Res.8(4), 287–295 (1973). [PMC free article] [PubMed] [Google Scholar]
- 38.Breiman, L. Random forests. Mach Learn45, 5–32 (2001). [Google Scholar]
- 39.Sela, R. J. & Simonoff, J. S. Re-em trees: A data mining approach for longitudinal and clustered data. Mach. Learn.86, 169–207 (2012). [Google Scholar]
- 40.Pellagatti, M., Masci, C., Ieva, F. & Paganoni, A. M. Generalized mixed-effects random forest: A flexible approach to predict university student dropout. Stat Anal Data Min: ASA Data Sci J14(3), 241–257 (2021). [Google Scholar]
- 41.Tyralis, H. & Papacharalampous, G. A review of predictive uncertainty estimation with machine learning. Artif. Intell. Rev.57(4), 94 (2024). [Google Scholar]
- 42.Shafer, G., & Vovk, V. A tutorial on conformal prediction. Journal of Machine Learning Research 9(3) (2008)
- 43.Principato, G., Stoltz, G., Amara-Ouali, Y., Goude, Y., Hamrouche, B., & Poggi, J.-M. Conformal prediction for hierarchical data. arXiv preprint arXiv:2411.13479 (2024)
- 44.Dunn, R., Wasserman, L. & Ramdas, A. Distribution-free prediction sets for two-layer hierarchical models. J. Am. Stat. Assoc.118(544), 2491–2502 (2023). [Google Scholar]
- 45.Murphy, K. P. Machine learning: a probabilistic perspective (MIT Press, Cambridge, 2012). [Google Scholar]
- 46.Chipman, H. A., George, E. I. & McCulloch, R. E. Bart: Bayesian additive regression trees. Ann Appl Stat4(1), 266–298 (2010). [Google Scholar]
- 47.Tan, Y. V. & Roy, J. Bayesian additive regression trees and the general bart model. Stat. Med.38(25), 5048–5069 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Linero, A. R. Bayesian regression trees for high-dimensional prediction and variable selection. J. Am. Stat. Assoc.113(522), 626–636 (2018). [Google Scholar]
- 49.Pratola, M. T., Chipman, H. A., George, E. I. & McCulloch, R. E. Heteroscedastic bart via multiplicative regression trees. J. Comput. Graph. Stat.29(2), 405–417 (2020). [Google Scholar]
- 50.Vehtari, A., Gelman, A. & Gabry, J. Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Stat. Comput.27, 1413–1432 (2017). [Google Scholar]
- 51.Fong, E. & Holmes, C. C. Conformal Bayesian computation. Adv. Neural. Inf. Process. Syst.34, 18268–18279 (2021). [Google Scholar]
- 52.Stanton, S., Maddox, W., & Wilson, A.G. Bayesian optimization with conformal prediction sets. In: International conference on artificial intelligence and statistics, pp. 959–986 (2023). PMLR
- 53.Zhang, Y., Park, S. & Simeone, O. Bayesian optimization with formal safety guarantees via online conformal prediction. IEEE J Sel Top Signal Process.19(1), 45–59 (2024). [Google Scholar]
- 54.Ekmekci, C.,& Cetin, M. Conformalized generative Bayesian imaging: an uncertainty quantification framework for computational imaging. arXiv preprint arXiv:2504.07696 (2025)
- 55.Podina, L., Rad, M.T., & Kohandel, M. Conformalized physics-informed neural networks. arXiv preprint arXiv:2405.08111 (2024)
- 56.FDA: Artificial Intelligence/Machine Learning (AI/ML)-Based Software as a Medical Device (SaMD) Action Plan. https://www.fda.gov/media/145022/download (2021)
- 57.European Commission: Regulation of the European parliament and of the council laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=celex:32024R1689 (2024)
- 58.Davis, J. et al. Machine learning in medicine: Addressing ethical challenges. PLoS Med.17(11), 1003391 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Agency for Healthcare Research and Quality (AHRQ): HCUP-US NIS Overview. https://hcup-us.ahrq.gov/nisoverview.jsp. Accessed July 22, 2025 (2025)
- 60.Agency for Healthcare Research and Quality (AHRQ): Nationwide data use agreement - HCUP. https://hcup-us.ahrq.gov/team/NationwideDUA.jsp. Accessed July 22, 2025 (2024)
- 61.Guan, L. Localized conformal prediction. Ann. Stat.50(3), 1631–1656 (2022). [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data used in this study were obtained from the National Inpatient Sample (NIS), Healthcare Cost and Utilization Project (HCUP), Agency for Healthcare Research and Quality. These data are available for purchase from HCUP at https://www.hcup-us.ahrq.gov/nisoverview.jsp. The authors do not have permission to share the data directly.
Code is available from the corresponding author upon reasonable request.




































