Toward generalizable and interpretable machine learning models in healthcare: Insights from ICU outcome predictions

Lasse Bohlen; Julian Rosenberger; Nico Hambauer; Daniel Zähringer; Volkmar Franz; Patrick Zschech; Mathias Kraus

doi:10.1007/s10729-026-09760-y

. 2026 Jun 10;29(2):23. doi: 10.1007/s10729-026-09760-y

Toward generalizable and interpretable machine learning models in healthcare: Insights from ICU outcome predictions

Lasse Bohlen ^2,^✉, Julian Rosenberger ¹, Nico Hambauer ¹, Daniel Zähringer ², Volkmar Franz ³, Patrick Zschech ², Mathias Kraus ¹

PMCID: PMC13253713 PMID: 42268427

Abstract

The application of machine learning (ML) models in healthcare management offers high potential. In particular, resource allocation and operational decision-making in intensive care units (ICUs) can benefit from ML predictions, leading to improvements in patient outcomes and operational efficiency. However, the generalizability of these models across diverse hospital settings with potentially different patient populations remains a critical challenge. This study examines the generalizability of ML-based ICU outcome prediction models built using external data. We utilize data from two sources: a European University Hospital (EUH) dataset from Universitätsklinikum Carl Gustav Carus Dresden, Germany and the Medical Information Mart for Intensive Care (MIMIC)-IV database, representing different healthcare systems and patient populations. Our approach evaluates multiple models of varying architectures and complexity across three common prediction tasks in ICU settings (mortality, length of stay, and readmission), analyzes the impact of data availability on model performance, and applies interpretability techniques to identify features and scenarios where models succeed or fail in new environments. We found that locally trained models generally outperform those using external data when sufficient local data is available. Low and medium complexity models, such as generalized additive models, demonstrate significantly superior generalizability compared to high complexity models and require substantially less local data for high-quality predictions, offering evidence-based guidance for healthcare managers dealing with limited data resources. Our results demonstrate how interpretability techniques can identify dataset differences that hinder generalizability, providing valuable insights for healthcare practitioners in implementing ML solutions across diverse hospitals. This research contributes to the development of more generalizable and interpretable ML models in healthcare.

Keywords: Healthcare, Intensive Care Units, Machine Learning, Generalizability, Interpretability

Highlights

This study evaluates how well machine learning models for predicting ICU patient outcomes (mortality, length of stay, readmission) perform when transferred between different hospitals, using data from European and US healthcare systems.
We find that simpler, more interpretable models maintain significantly better performance when transferred between hospitals compared to complex algorithms, challenging the assumption that more complex and sophisticated models are always better for healthcare applications.
Healthcare institutions can use data availability to inform evidence-based decisions about adopting ML: external models can be useful when there is very little or no local data; low or medium complexity models tend to be preferable when there are small to medium sized datasets; and it is only when there are large locally available datasets that more complex and sophisticated models achieve superior performance.
Our interpretability analysis provides healthcare managers with practical tools to identify which clinical features may cause external models to fail in their specific hospital setting, enabling safer model validation before implementation.

Introduction

Intensive care units (ICUs) represent a critical and high-cost component of healthcare systems [1]. Complex decisions must be made to ensure both high-quality care and efficient resource utilization [2]. To support these decisions, effective ICU management increasingly looks towards data-driven approaches, with machine learning (ML) offering promising predictive capabilities [3, 4]. Using ML models, reliable predictions of patient outcomes can provide vital input for optimizing bed management [5], anticipating staffing needs [6], informing discharge planning [7], and improving overall operational efficiency within the ICU [8].

Following trends in the broader ML field, ML models in healthcare have become increasingly complex [9, 10]. This complexity is often justified by arguments that sophisticated architectures can better capture the multifaceted nature of patient data [11, 12]. This has led the majority of research to favor architectures such as neural networks or ensemble methods to enhance predictive accuracy [13]. However, relatively little is known about the practical requirements for successfully deploying such complex models across diverse healthcare settings. In particular, complex ML models require large amounts of well-organized, labeled, and cleaned data to find trustworthy patterns that generalize well. Unfortunately, these data requirements are primarily met only by large healthcare institutions, creating a critical gap where smaller institutions cannot develop their own sophisticated ML models [14].

As a remedy, many healthcare institutions, especially those smaller or with constrained data and computational resources, increasingly seek to leverage models developed at facilities with greater data resources [15]. However, the utility of these externally developed models depends entirely on their ability to generalize successfully, that is, to maintain satisfactory performance when applied to distinct patient populations beyond the initial cohort used during development [16]. This quality becomes crucial when predictive models are deployed across different hospitals or healthcare systems, where variations in patient demographics, clinical practices, and data collection procedures can lead to inaccurate predictions [17, 18]. However, successful implementation faces substantial organizational challenges beyond technical performance. Healthcare managers must navigate trade-offs when deciding whether to develop local models, adopt external solutions, or adapt existing models to their institutions. These decisions are further complicated by differences in electronic health record systems, organizational structures, and availability of local training data [11, 19].

The development of generalizable predictive models in healthcare, particularly for ICUs, is an active area of research [20]. Many studies have focused on building and externally validating ML models, but results are often mixed when these models are applied in new clinical environments [21–23]. Advanced techniques like federated learning [24] or transfer learning [25] aim to improve model performance across sites, especially when local data is limited.

However, current research faces several limitations in addressing generalizability challenges. First, the impact of model complexity on generalizability remains largely underexplored, particularly in the context of clinical environments. Second, little attention has been given to the role of local training data. In particular, it is unclear whether small datasets can be used to optimize a locally developed model, or if it would be more effective to adopt a more sophisticated external model that has been trained on a much larger dataset. Third, most studies rely on aggregate performance metrics without examining the underlying reasons why models succeed or fail in new settings. As a result, healthcare managers lack clear guidance on when and how externally developed ML models can be safely applied to clinical and operational decision-making.

To address these critical gaps in knowledge, our study undertakes a systematic investigation into the factors shaping ML model generalizability in the ICU context. Our research is guided by three central research questions:

RQ1: How does model complexity influence cross-hospital generalizability of ICU prediction models?

RQ2: When local training data is limited, to what extent do external models or additional external data improve predictive performance?

RQ3: How can interpretability techniques reveal why models succeed or fail when transferred between healthcare institutions?

Through answering these questions, we aim to provide healthcare managers with guidance for identifying suitable models and data strategies when implementing predictive systems, particularly in settings with limited local data availability.

Our investigation explores the generalizability of ML models across two distinct ICU environments: a European University Hospital (EUH) dataset from Universitätsklinikum Carl Gustav Carus Dresden, Germany, and the publicly available Medical Information Mart for Intensive Care (MIMIC)-IV database. This cross-institutional approach allows us to assess how model complexity influences generalizability when ICU prediction models are transferred between healthcare institutions. We focus on three commonly studied ICU prediction tasks: mortality, length of stay, and readmssion to ensure clinical relevance and validity. In addition, we consider the impact of data availability. In particular, we simulate contexts where access to local data is limited. To deepen our understanding of these dynamics, we apply interpretability techniques to identify clinical features contributing to shifts in model behavior between settings. This approach provides practical insights into model selection, data requirements, and the risks associated with implementing externally developed models in new clinical environments.

Our analysis yields three key contributions for healthcare management. First, we demonstrate that model complexity is a key factor in cross-hospital generalizability, with low and medium complexity models significantly outperforming high complexity alternatives when transferred between institutions. Statistical analysis across three ICU prediction tasks reveals that more complex models experience disproportionate performance degradation when applied to different hospitals.

Second, we empirically identify patterns in data availability that can inform model selection strategies. Across prediction tasks, low complexity models perform well with a small amount of local training data, whereas medium and high complexity models require progressively larger datasets to achieve favorable local performance. This suggests that institutions with limited local data may benefit more from training low or medium complexity models locally than from adopting more complex external solutions.

Third, we demonstrate how interpretability techniques can reveal why models succeed or fail when transferred across hospitals, identifying specific clinical features where consistent physiological relationships enable successful deployment versus those where hospital-specific data collection practices create deployment risks. This provides healthcare managers with practical tools for validating external models without requiring access to proprietary training data.

The remainder of this paper is structured as follows. Section 2 reviews relevant literature on ICU prediction tasks, model complexity-generalizability relationships, and interpretability techniques. Section 3 details our experimental design, data sources, model selection, and evaluation approach for assessing cross-hospital generalizability. Section 4 presents our empirical findings across generalizability evaluation, data availability analysis, and interpretability insights. Section 5 discusses our key findings, their practical and theoretical implications for healthcare management, and identifies future research directions, while Section 6 concludes our work.

Research background

Understanding the generalizability of ML models in healthcare requires examining three key aspects that inform our research approach. First, we establish the clinical context by examining the specific prediction tasks where ML models are deployed in ICU settings and their relevance to clinical decision-making (Section 2.1). Second, we discuss the aspect of local data availability and explore how ML model complexity may affect generalizability across hospitals (Section 2.2). Finally, we examine how different levels of model complexity relate to interpretability and how the latter can be used as a tool for a better understanding of generalizability patterns (Section 2.3). Together, these aspects provide the foundation for our empirical investigation of which modeling approaches are most suitable for cross-hospital deployment.

ICU machine learning tasks and clinical decision making

ML has shown immense promise in healthcare, offering powerful tools to support clinical decision-making by creating rapid and accurate predictions based on complex, high-dimensional data [41–44]. In high-stakes environments such as the ICU, ML models are developed to predict patient outcomes, including the risk of life-threatening complications [45], readiness for transfer [39], length of stay [5], and likelihood of readmission [46]. These predictions can inform treatment decisions and help optimize resource allocation [36, 47, 48].

Our investigation focuses on three critical ICU prediction tasks that directly support clinical decision-making: mortality, length of stay in the ICU, and readmission to the ICU after discharge. These tasks were selected because they represent common applications of ML in intensive care settings and provide diverse profiles for evaluating model generalizability across different hospitals [12, 49].

Predicting a patient’s mortality is a common approach to assessing illness severity and acuity. As a definitive and consistently recorded outcome, mortality is frequently used as a target in ICU-focused ML research [5, 12]. Developed models are typically benchmarked against established scores such as APACHE IV [50] and SAPS III [51]. The goal is to identify high-risk patients early, enabling timely ICU transfers or staffing adjustments to ensure appropriate care. Studies have shown that patient mortality risk is closely linked to their acuity [26], and that patients with a high risk of dying often require more intensive nursing care [27]. As a result, acuity is commonly used in nurse-to-patient assignment problems to support balanced and fair workload distribution [28–30].

Estimating a patient’s length of stay is crucial for ICU resource planning and discharge coordination. Scheduling models, such as those by Turhan and Bilgen [34] and Heider et al. [35], rely on historical length of stay data to optimize patient admissions and surgical scheduling. Recent work by Shi et al. [36] leverages ML to handle long-tailed length of stay distributions, enhancing scheduling in complex settings. Individual length of stay predictions can further support early discharge planning by informing patients and caregivers about care plans and warning signs [37]. Moreover, prolonged ICU length of stay has been associated with increased resource use. Identifying patients at risk of prolonged stay can prompt a search for alternative care options, underlining the importance of accurate length of stay forecasting for both operational and clinical outcomes [52].

Predicting a patient’s readmission is essential for supporting safer discharge decisions from the ICU. Critical care professionals often base their discharge decisions on workload pressure and the ongoing demand for ICU beds [53]. In resource-limited settings, clinicians frequently face the challenge of identifying patients who are stable enough to be transferred to a general ward in order to free up ICU capacity for more critical cases. However, these decisions are complex, and readmissions to the ICU are associated with worse outcomes, including higher mortality rates, longer hospital stays, and increased costs [7, 54]. Identifying patients at risk of readmission can support safer discharge decisions and improve continuity of care [2, 39]. ML models are being explored as a promising approach to anticipate post-discharge deterioration. Accurate prediction could help prevent avoidable harm, optimize patient flow, and reduce the strain on critical care resources [46, 55, 56].

Table 1 summarizes the clinical importance and decision-making applications of these three prediction tasks. As illustrated, each task addresses distinct operational challenges: mortality prediction supports acuity-based resource allocation and workload planning, length of stay prediction enables proactive capacity management and surgical scheduling, and readmission prediction informs discharge safety protocols and care transitions. The diversity of these clinical applications, spanning from immediate bedside decisions to strategic resource planning, provides an effective foundation for evaluating the generalizability of ML models in various ICU settings and for different situations.

Table 1.

Clinical importance and decision-making applications of ICU prediction tasks

Prediction Task	Clinical Importance	Associated Decision Tasks
Mortality	Proxy for illness severity and acuity [26]; correlates with nursing workload [27]	Acuity-based nurse assignment [28–30]; nurse re-scheduling after workload disruption [31]; admission control [32]
Length of stay	Enables discharge date estimation [33]; lower daily nursing workload for long stay patients [27]	Admission planning [2, 34]; surgery scheduling with downstream ICU capacity [35, 36]; early discharge planning [37]
Readmission	Identifies post-discharge risk within specified time frames; quality indicator for care transitions; increased mortality risk and higher cost for readmitted patients [38]	Patient discharge and transfer decisions [39]; ICU readmission as quality measure [40]

Open in a new tab

While these prediction tasks offer clinical value, their successful implementation across healthcare institutions faces substantial organizational challenges that extend beyond technical performance. Healthcare managers must navigate complex trade-offs when deciding whether to develop models locally, adopt external solutions, or adapt existing models to institutional contexts [19]. These decisions are complicated by heterogeneity in electronic health record systems, organizational structure, and data availability, all of which can limit model generalizability [11]. Understanding how model complexity shapes both generalizability and organizational acceptance is therefore critical for effective healthcare management decision-making [57].

Model complexity and generalizability

The following section provides context for our first two research questions. First, we question how model complexity might impact generalizability across hospitals (RQ1). Second, we discuss under what conditions it might be preferable to train simpler models locally rather than adopt or adapt complex, externally developed models, particularly when local data availability is limited (RQ2).

In our context, generalizability refers to a model’s ability to maintain satisfactory performance when applied to distinct patient populations beyond the initial cohort used during development [16]. This quality is crucial when predictive models are created using external data and deployed in new environments, such as different hospitals [17]. The importance of generalizability becomes pressing when considering the potential risks of applying ML models in new settings, where variations in patient demographics and healthcare practices can lead to inaccurate and harmful predictions [18].

Generalizability provides insight into how robust a model is across different contexts and settings, revealing a dimension of performance measurement beyond the local setting where training data originates [16]. This is particularly relevant when multiple sites may benefit from a single model rather than training individual models for every location. Such model sharing is especially valuable when training data is scarce and large sample collection is unfeasible [15].

Complex models are capable of learning intricate and highly specific patterns within the training data, but these patterns are not necessarily transferable to other locations. Recent studies investigating ML model applicability across hospitals have yielded mixed results. The widely adopted Epic Sepsis model exemplifies these divergent outcomes. Wong et al. [21] critique the model’s performance when applied to a Michigan, U.S. hospital and raise concerns about sepsis care quality, while Cull et al. [45] affirm its effectiveness using data from Greenville, U.S. Similarly, a Hepatitis B prediction model successfully transferred between Nigerian hospitals but failed when applied to an Australian cohort [22]. These conflicting findings underscore the challenges associated with ML model generalizability in healthcare.

In many real-world settings, hospitals have only limited local data available, which constrains their ability to train complex models from scratch. Recent research offers potential solutions through federated learning, which allows hospitals to collaborate on model training without sharing data [24, 58], and transfer learning, which adapts models trained at one hospital for use at another using small amounts of local data [25, 59]. However, while these studies measure and report model performance across scenarios, they often fail to identify the exact reasons for performance differences or why model adjustments are necessary. Furthermore, these studies rarely consider whether training a simpler model directly on the limited local data might sometimes yield higher predictive performance than adapting a complex external model.

We examine how different model complexity levels influence both local performance and cross-hospital generalizability. We hypothesize that high model complexity may result in excellent performance at the hospital where training data originated but struggle to generalize effectively to other hospitals due to "local overfitting", where models learn hospital-specific patterns that are not applicable in other settings [20]. Additionally, we investigate how the level of local data availability interacts with model complexity to shape performance.

The challenge of estimating the generalizability of a model before transferring it, is compounded by the interpretability limitations of complex models. As models become more flexible and incorporate vast amounts of data and features, it becomes difficult for clinicians to understand the relationship between input features and model predictions [60–62]. When such complex models are transferred to new hospitals it is often unclear whether they will perform reliably or how they will behave in unfamiliar settings.

Model complexity and interpretability

The generalizability challenges discussed in the previous section raise a critical question: how can we understand why models succeed or fail when transferred across hospitals? This question directly connects to our third research question (RQ3) about using interpretability techniques to explain generalizability patterns. Understanding the relationship between model complexity and interpretability is essential for addressing this challenge.

Achieving high predictive performance in healthcare often requires ML models with high flexibility to capture complex patterns in high-dimensional data [41]. Flexible model types, such as artificial neural networks and boosted decision trees, are well-suited for learning non-linear relationships and intricate feature interactions [63]. However, greater flexibility often comes at the cost of increased model complexity, making the model structure harder to interpret as it captures multi-level and sometimes opaque patterns across features [64].

This trade-off between flexibility and interpretability becomes particularly critical when considering cross-hospital generalizability. Interpretable ML refers to ML models and techniques that provide explanations for their predictions in terms understandable to human users, enabling them to comprehend, trust, and effectively manage the prediction process [65]. Without understanding the reasoning behind model predictions, it remains unclear why models succeed or fail in new environments or how they will respond to unfamiliar scenarios, making them potentially dangerous to use. When complex models fail to generalize effectively, it often remains difficult to identify the underlying reasons, since models may rely on hospital-specific patterns that do not transfer to other settings. Interpretable ML enables medical professionals to understand the decision logic behind ML models completely, ensuring that predictions align with domain knowledge and can be trusted for clinical decision-making across different environments [61, 66].

Two main strategies exist to address the interpretability challenge in the context of generalizability assessment. The first approach involves limiting model complexity through various constraints such as sparsity, linearity, or additivity [67]. By limiting complexity, the decision logic becomes more comprehensible to clinical stakeholders, making it easier to identify when and why models may struggle in new environments. Examples include logistic regression (linear and additive), decision trees (rule-based), and generalized additive models (additive but non-linear). However, this approach risks compromising the predictive performance that is critical in high-stakes healthcare applications [64].

The second strategy maintains model flexibility while applying additional techniques to reconstruct the decision logic of complex models with post-hoc explanation methods [67]. Common post-hoc explanation methods include SHapley Additive exPlanations (SHAP) [68] and Local Interpretable Model-agnostic Explanations (LIME) [69]. These methods provide insights into learned feature-outcome relationships, such as identifying positive or negative feature effects, determining feature importance, and revealing trends across different feature values. For generalizability assessment, these techniques can reveal which features contribute to successful or failed cross-hospital transfer. However, post-hoc explanations are approximations to the underlying model rather than exact representations [65].

Figure 1 demonstrates how the body temperature affects mortality prediction across three different model complexity levels. As model complexity increases from left to right, the interpretability of the feature’s effect becomes increasingly challenging to understand, illustrating the fundamental trade-off between model complexity and interpretability. This example highlights why interpretability becomes crucial when assessing whether learned patterns will generalize across different clinical environments.

Fig. 1 — Low, medium, and high complexity models from left to right. An exemplary feature "Mean Body Temperature" has different effects on the model output for mortality prediction depending on the ML model complexity

While most research focuses on the trade-off between interpretability and performance when choosing between constraining complexity and using post-hoc explanations [62, 70], we additionally consider the complexity–generalizability trade-off. In this study, we focus on two key dimensions: evaluating models of varying complexity levels to understand the complexity-generalizability relationship, and applying interpretability techniques to explain why certain models generalize better than others across hospital settings. This comprehensive approach enables us to provide practical guidance for healthcare institutions considering cross-hospital ML model deployment.

Methodology

Our methodology addresses the three research questions established in Section 1: how model complexity influences generalizability across hospitals (RQ1), to what extent external models or additional external data improve predictive performance when local training data is limited (RQ2), and how interpretability techniques can help understand these generalizability patterns (RQ3). This section outlines our research approach, data sources, model selection, experimental design, and evaluation approach to investigate these questions.

Generalizability definition and operationalization

We define generalizability as a model’s ability to maintain satisfactory performance when applied to distinct patient populations beyond the initial cohort used during development [16]. In healthcare settings, this translates to how well models trained at one hospital perform when deployed at another institution with different patient demographics, clinical practices, and data collection procedures [11, 19].

We operationalize this concept using two datasets: Inline graphic (source hospital) and (target hospital), both containing the same m clinical features with and patients respectively. Each patient is represented by a tuple , where describes the patient’s clinical features and indicates the task-specific outcome.

Our objective is to train ML models Inline graphic from a hypothesis space that perform well on prediction tasks.

We systematically compare two scenarios: (1) local models where Inline graphic minimizes loss over patients in and is evaluated in the same hospital, and (2) external models where minimizes loss over patients in but is evaluated on . Comparing both approaches with models of different complexities provides insights into how model complexity impacts cross-hospital generalizability.

Task definitions

We define three binary classification tasks using routinely collected ICU data. As detailed in Section 2.1 and summarized in Table 1, these tasks encompass diverse clinical decision-making contexts with different temporal prediction windows.

Mortality prediction

Indicating whether the patient died during the ICU stay. We use clinical data from the first 24 hours of ICU admission to enable early identification of high-risk patients.

Length of stay prediction

Separating long-stay patients ( Inline graphic 7 days) from shorter stays, where length of stay is calculated as hours. We use clinical data from the first 24 hours of admission to simulate an early prediction, which theoretically enables proactive capacity planning.

Readmission prediction

Indicating readmission within 72 hours of ICU discharge ( Inline graphic hours). We use clinical data from the last 24 hours of the initial stay, including length of stay as an additional feature since it is available at discharge time. Patients who died during the initial ICU stay are excluded for this task to avoid prediction bias.

Data sources and preprocessing

We use two large ICU datasets from geographically separate and independent healthcare institutions to evaluate cross-hospital generalizability. The datasets represent different healthcare systems and patient populations, providing an ideal testing environment for assessing model generalizability.

MIMIC-IV dataset

The Medical Information Mart for Intensive Care (MIMIC)-IV is a large-scale research dataset consolidating patient records from Beth Israel Deaconess Medical Center, collected between 2008 and 2022. MIMIC-IV is publicly available and represents one of the most widely used healthcare databases in ML research. For our study, we utilize 94458 de-identified health records across five ICUs within the medical center [71].

European university hospital (EUH) dataset

We utilize a proprietary dataset from Universitätsklinikum Carl Gustav Carus Dresden, Germany, covering surgical and anesthetic ICU admissions between 2004 and 2023. The dataset includes demographic and medical parameters comparable to MIMIC-IV data and encompasses 25062 patients.1

Feature selection and alignment

We selected 28 clinical features based on their prevalence in prior ICU models, availability in both datasets, and input from clinical experts. To support generalizability, we focused exclusively on routinely collected patient measurements such as vital signs and laboratory values that are typically available within the first few hours of ICU admission. We deliberately excluded variables directly related to clinical decisions or contextual information, such as admission reasons, medications, or treatment choices, as these are more likely to vary across hospitals and may reduce model comparability.

Both datasets undergo standardized preprocessing to ensure consistency: (1) unit alignment between datasets (e.g., converting glucose from mmol/L to mg/dL), (2) age adjustment in EUH data to match MIMIC-IV’s anonymization protocol (maximum age 90), (3) outlier removal using predefined clinical thresholds, (4) temporal aggregation using mean values over specified time windows, (5) exclusion of patients with >50% missing features, (6) standardized feature scaling and encoding, and (7) k-nearest neighbors imputation for remaining missing values.

Table 2 provides a comparative summary of the 28 clinical features for both datasets, showing mean values, standard deviations, and missing data rates. Notable differences include higher missing rates for certain features in EUH (e.g., respiratory rate: 41.9% vs. 0.23%) and MIMIC-IV (e.g., lactate: 39.98% vs. 0.27%), reflecting different data collection practices across institutions. Detailed explanations of all clinical features and specific outlier removal thresholds are provided in Appendix A.

Table 2.

Comparison of MIMIC-IV and EUH data for 28 features. Mean values, standard deviations (std), and missing data rates are shown for each dataset. Note, that for the readmission task, we extracted the same features from the last 24h instead of the first 24h, resulting in slightly different values (see Appendix A)

Feature	MIMIC-IV			EUH
	Mean	Std	Missing (%)	Mean	Std	Missing (%)
Age	63.67	16.41	0.00	64.04	16.62	0.01
Weight	82.65	23.75	2.49	77.74	19.61	37.60
Temperature	36.68	0.82	19.11	36.76	0.75	6.81
Respiratory Rate	19.26	3.82	0.23	17.22	3.97	41.91
Heart Rate	85.48	15.85	0.10	83.48	16.86	0.28
Glucose	136.73	48.38	0.87	138.23	35.51	0.28
Mean Blood Pressure	78.28	11.13	0.42	83.75	12.21	17.29
Potential Hydrogen	7.37	0.07	38.48	7.42	0.05	48.50
Glasgow Coma Scale Total	12.17	3.32	0.17	11.55	4.53	56.31
Gender (Female %)	43.27	–	0.00	40.24	–	0.00
Partial Pressure of O2	134.53	46.77	52.99	96.06	29.96	0.37
Fraction of Inspired O2	54.70	14.74	46.37	30.71	10.42	1.29
Potassium	4.18	0.55	0.16	4.13	0.42	0.38
Sodium	138.44	4.48	0.39	138.84	4.27	0.63
Leukocytes	12.21	7.75	1.06	11.49	5.92	4.89
Thrombocytes (Platelets)	205.75	106.81	0.61	223.60	109.84	4.82
Bilirubin	1.96	4.37	56.37	0.99	1.49	20.60
Bicarbonate	23.41	4.31	0.19	26.11	3.27	0.31
Hemoglobin	10.47	2.02	0.55	10.58	1.78	0.00
Prothrombin Time	1.45	0.61	15.06	1.37	0.35	5.15
Aspartate Aminotransferase	112.74	225.99	56.75	78.91	159.30	16.47
Alanine Aminotransferase	87.34	203.13	56.53	61.30	135.05	19.86
Partial Pressure of CO2	41.21	9.65	52.26	40.92	6.24	0.30
Albumin	3.16	0.60	73.22	2.98	0.54	28.64
Anion Gap	14.07	3.31	0.77	6.16	2.92	71.97
Lactate	2.16	1.53	39.98	1.48	1.57	0.27
Urea Nitrogen	26.32	21.66	0.23	41.42	32.20	15.33
Creatinine	1.43	1.50	0.10	1.07	0.99	5.79
Mortality (%)	7.30	–	0.00	5.70	–	0.00
Length of stay (%)	14.30	–	0.00	9.2	–	0.00
Readmission (%)	6.50	–	0.00	7.80	–	0.00

Open in a new tab

Despite these differences, a convex hull analysis (Appendix B) indicated substantial overlap in feature ranges (over 90%) between the two datasets, a pattern that generally suggests strong potential for cross-hospital generalizability [72].

Model selection and complexity levels

We evaluate six ML models representing three distinct complexity levels, enabling systematic investigation of the relationship between model complexity and cross-hospital generalizability. As discussed in Section 2.3, model complexity affects both predictive performance and interpretability, with potential implications for generalizability that we investigate empirically.

Low complexity models

Logistic Regression (LR) serves as our baseline model, providing linear and additive relationships between features and outcomes with high interpretability. Decision Trees (DT) offer rule-based predictions through sequential splits that form clear decision paths. When constrained to shallow depths, DTs remain highly interpretable while capturing non-linear patterns through discrete splits.

Medium complexity models

Generalized Additive Models (GAMs) maintain the additive property of linear models while allowing non-linear feature-outcome relationships. We evaluate two GAM implementations: Explainable Boosting Machine (EBM) [73], which uses bagged and boosted tree ensembles to learn step functions, and Interpretable Generalized Additive Neural Networks (IGANN) [74], which employs extreme learning machines for gradient boosting. Both models provide inherent interpretability through visualizable feature shape functions while offering greater flexibility than linear models.

High complexity models

Extreme Gradient Boosting (XGB) and Multilayer Perceptrons (MLP) represent highly flexible models with universal approximation properties. XGB uses gradient boosting with decision trees to capture complex feature interactions and non-linear patterns. MLP employs multiple hidden layers with non-linear activation functions, enabling the learning of highly complex decision boundaries. Both models can approximate any continuous function on a compact domain but require post-hoc explanation methods, such as SHAP or LIME, for interpretability analysis.

All models are implemented using standard ML libraries with identical preprocessing pipelines. Implementation details are provided in Appendix C.

Experimental design and scenarios

Our experimental design evaluates model generalizability across multiple dimensions in order to address our research questions. We implement three complementary approaches. First, we conduct a generalizability evaluation to investigate how model complexity influences cross-hospital generalizability (RQ1). Second, we perform a data availability analysis to examine how external models or additional external data affect predictive performance (RQ2). Third, we perform an interpretability analysis to explore why models succeed or fail (RQ3).

Generalizability evaluation

We evaluate model generalizability by comparing two primary training scenarios for each model and task combination: (1) local data trained and tested on the same hospital dataset, representing the standard scenario where sufficient local data is available, and (2) external data trained on one hospital dataset and tested on the other, representing the scenario where models must transfer across institutions. This bidirectional evaluation (EUH Inline graphic MIMIC-IV) provides comprehensive assessment of generalizability in both directions, as transfer may not be symmetric due to dataset characteristics.

Figure 2 illustrates these training and evaluation scenarios. Local scenarios (white) represent optimal conditions where models are trained and tested on data from the same institution. External scenarios (light gray) assess generalizability by testing models trained at one hospital on data from another hospital. This design directly addresses RQ1 by enabling systematic comparison of how different model complexity levels affect cross-hospital performance.

Fig. 2 — Different evaluation and training settings are indicated by color: white for models trained and tested on local data and grey for models trained on local data and tested on external data

Data availability analysis

To simulate real-world constraints faced by institutions with limited local data, we conduct experiments with varying local data availability. We incrementally increase local training samples from 250 to 16000, focusing on the EUH dataset as the target institution. For each sample size, we compare two training approaches: (1) models trained exclusively on the limited local data, and (2) models trained on the limited local data supplemented with the complete external MIMIC-IV dataset. This analysis identifies "break-even points" where local-only models begin to outperform externally-supplemented models, answering RQ2, and providing practical guidance for institutions with limited data resources.

Interpretability analysis

To address RQ3 and understand the mechanisms underlying generalizability patterns, we apply interpretability techniques to examine feature-outcome relationships across datasets and models. For inherently interpretable models (LR, DT, GAMs), we visualize the direct model logic and feature effects. For complex models (XGB, MLP), we employ SHAP values [68] to provide post-hoc interpretability. SHAP values assign approximated contribution scores to individual feature values for each prediction, enabling generation of global explanation plots that show how specific features affect predictions across the entire dataset. This approach allows consistent interpretability analysis across all model types and identifies specific features where models learn consistent versus inconsistent relationships across hospitals, revealing potential sources of generalizability challenges.

Evaluation strategy and metrics

We developed an evaluation protocol designed to provide robust generalizability assessments.

Performance metrics

We use Area Under the Receiver Operating Characteristic Curve (AUROC) as our primary performance metric. AUROC is widely adopted for binary classification in ICU prediction tasks [12, 75] and offers threshold-independent evaluation, enabling fair comparison across hospitals without requiring model calibration. This metric is particularly suitable for our generalizability assessment because it remains consistent across different prevalence rates and decision thresholds. Thus, it is generally recommended for model comparisons [76].

Generalizability assessment

We measure generalizability using two complementary criteria that capture different aspects of cross-hospital performance.

Model-specific Generalizability Loss ( Inline graphic ) measures the performance degradation experienced by each model when transferred between hospitals:

where Inline graphic is a model trained on source hospital and evaluated on target hospital . This metric reflects the expected performance drop when externally trained models are applied locally.

Since a model with larger model-specific losses may be preferable in external scenarios due to its potentially higher base performance, we consider a second metric.

Comparative Generalizability Loss ( Inline graphic ) measures performance relative to the best-transferring model:

This metric highlights the performance gap between a given model and the optimal choice for cross-hospital deployment, providing practical guidance for model selection in external scenarios.

Evaluation protocol

We implement a standardized evaluation strategy to ensure reliable performance estimates and maintain strict train-test separation in all scenarios. Each dataset is initially split into training and test sets in a 80/20 ratio, and this division is maintained throughout all experiments.

Within the resulting train and test set, five slightly different folds are generated in a cross-validation pattern. This results in five training and five test folds for each hospital. For each experiment, we optimize the hyperparameters via grid search with cross-validation on the five train folds. We use ranges informed by established best practices (see Appendix C for details). The optimal parameters are then selected based on the average validation AUROC.

Using the optimal parameters found, we then train five models (one per training fold) and evaluate each on the five corresponding test folds. This results in 25 evaluations per model and scenario. This design yields robust performance estimates with uncertainty estimates, while strictly preventing any overlap between the train and test data, a critical consideration when the data come from different hospitals. Figure 3 visualizes our evaluation strategy.

Fig. 3 — Overview of the training and evaluation strategy. Hyperparameters are optimized using 5-fold cross-validation on the training data, selected based on the average validation AUROC. Using these hyperparameters, five models are created on each training fold and individually evaluated on all five test folds, resulting in 25 evaluations per model and scenario

Statistical analysis

The 25 evaluations per model enable robust statistical comparison across models and scenarios. Following Demšar [77], we employ the non-parametric Friedman test to detect significant differences in model performance (for both model-specific and comparative generalizability loss), followed by the Nemenyi post-hoc test for pairwise comparisons. These rank-based tests assign the lowest rank to models that lose the least performance when tested on external hospital datasets, providing a robust measure of generalizability. Therefore, lower ranks are better. This approach handles multiple comparisons effectively and allows visualization of critical differences between models [77]. Further details on the ranking procedure can be found in Appendix D.

Results

This section presents our results studying the relationship between model complexity and cross-hospital generalizability in ICU outcome prediction. Following the methodology outlined in Section 3, we systematically evaluate six models of varying complexity across three prediction tasks using data from two geographically distinct healthcare institutions.

Section 4.1 reports model performance across local and external scenarios with statistical assessment of generalizability (RQ1). Section 4.2 examines how data availability constraints interact with model complexity to inform practical deployment decisions (RQ2). Section 4.3 applies interpretability techniques to reveal why certain models succeed or fail when transferred between institutions, providing deeper insights into the observed generalizability patterns (RQ3).

Generalizability evaluation

We evaluate model generalizability by comparing local models (trained and tested on the same hospital dataset) with external models (trained on one hospital dataset and tested on another). Our analysis shown in Table 3 reveals consistent patterns across all three ICU prediction tasks, with local models systematically outperforming external models. However, the magnitude and characteristics of generalizability challenges vary substantially across prediction tasks.

Table 3.

Model performance (AUROC ± standard deviation) for the three ICU classification tasks across train and test set combinations. Gray fields show scenarios with external test sets. Inline graphic and show model-specific and comparative generalizability losses

graphic file with name 10729_2026_9760_Tab3_HTML.jpg

Open in a new tab