Abstract
Forecasting inpatient mortality (IM) and discharges against medical advice (DAMA) provides essential insights for healthcare quality monitoring and hospital management. This study compared six time-series forecasting methods—ARIMA, Grey Model, NNETAR, LSTM, Prophet, and Chronos, a pretrained probabilistic model—to predict monthly IM and DAMA in two tertiary hospitals in China from January 2018 to December 2024. Model performance was evaluated using RMSE, MAE, MAPE. Chronos demonstrated the best predictive accuracy for IM across both hospitals, achieving the lowest MAPE values (26.96–33.37%) and outperforming traditional and deep learning approaches (Diebold–Mariano test, p < 0.05). For DAMA forecasting, Chronos performed optimally (MAPE = 5.52%) in the hospital with higher and more stable DAMA volumes, whereas NNETAR yielded relatively superior results (MAPE = 11.29%) in the hospital with smaller and more irregular time series. LSTM consistently showed limited generalizability, likely due to small sample sizes and model complexity. These findings indicate that pretrained models such as Chronos can deliver robust and scalable forecasting performance even with limited data, while simpler neural networks like NNETAR may better handle low-volume, noisy data. Implementing these models in hospital management systems could enhance the timeliness and precision of quality monitoring, enabling proactive responses to adverse clinical and operational trends.
Keywords: Deaths numbers, Discharge against medical advice numbers, Time-series, Forecasting, Hospital management
Subject terms: Health care, Mathematics and computing, Medical research
Introduction
Healthcare performance indicators serve as essential quantitative instruments for systematically evaluating and continuously enhancing the quality of care, occupying a strategic and central role within the clinical quality management framework1. Among them, inpatient mortality (IM) and discharges against medical advice (DAMA) represent key outcome measures for monitoring healthcare quality. IM refers to deaths that occur during hospitalization, whereas DAMA describes situations in which patients voluntarily leave the hospital contrary to medical recommendations2, which is associated with a higher likelihood of adverse outcomes, including elevated risks of readmission, mortality, and increased healthcare costs3,4. Although not all IM or DAMA events are preventable, rising or unstable trends often indicate deficiencies in clinical practice, care coordination, or institutional challenges such as staffing shortages and resource constraints, particularly affecting vulnerable populations5,6. These indicators should be interpreted within the broader clinical and organizational context.
Growing availability of routinely collected hospital data has stimulated interest in quantitative forecasting approaches to support quality monitoring and managerial decision-making. Existing forecasting studies of IM have predominantly relied on retrospective methods, such as last-year’s corresponding period values, historical averages, last-value-carried-forward approach or classical univariate time-series methods, especially ARIMA model, typically applied in single-center settings7,8. While these approaches are well established and interpretable, classical univariate time-series models such as ARIMA rely on strong assumptions of linearity and stationarity, whereas simpler retrospective methods implicitly assume temporal stability across periods. These assumptions may be violated in real-world hospital settings characterized by structural changes and non-stationary dynamics. Moreover, relatively few studies have systematically compared these methods with more flexible machine learning or deep learning models under identical data conditions. As a result, evidence remains limited on the relative advantages and limitations of alternative forecasting paradigms for IM in routine hospital practice. By contrast, research on DAMA has largely focused on descriptive epidemiology and regression-based analyses aimed at identifying patient-level risk factors2,9. Hospital-level time-series forecasting of DAMA volumes or rates has received far less attention, despite its direct relevance for staffing, bed management, and targeted intervention planning. Although time-series methods have proven effective in related hospital management tasks, such as forecasting bed occupancy or discharge volumes10,11, their systematic application to DAMA forecasting remains underdeveloped.
Meanwhile, methodological advances in time-series forecasting have expanded the range of available analytical tools. Beyond traditional statistical models, neural network–based approaches such as NNETAR and recurrent architectures including long short-term memory (LSTM) networks have been increasingly applied to capture nonlinear temporal dependencies. More recently, pretrained transformer-based time-series models, exemplified by Chronos, have been proposed as a means of leveraging large and heterogeneous datasets to deliver robust forecasting performance, even in data-limited settings. However, empirical evidence comparing pretrained models with conventional statistical and neural approaches in healthcare contexts—particularly for low-frequency and high-variability indicators such as IM and DAMA—remains limited.
Against this background, the present study systematically evaluates six representative forecasting approaches—ARIMA, Grey Model, NNETAR, LSTM, Prophet, and the pretrained Chronos model—for predicting monthly IM and DAMA in two tertiary hospitals located in distinct regions of China. By integrating multiple methodological paradigms within a unified comparative framework, this study addresses three key gaps in the existing literature: limited comparative evaluation of IM forecasting models, the relative absence of hospital-level DAMA forecasting studies, and the lack of empirical assessment of pretrained time-series models in routine hospital quality monitoring. The findings aim to inform the selection of forecasting strategies that are both methodologically robust and practically scalable for data-driven hospital management across heterogeneous clinical settings.
Materials and methods
Data source
Monthly IM and DAMA time-series data were extracted from routinely collected hospital administrative records within the electronic medical record (EMR) systems of two tertiary Class A hospitals in China. At each institution, original inpatient-level records were retrieved by the respective Departments of Medical Records in accordance with established data governance procedures. IM and DAMA events were identified based on discharge disposition and in-hospital outcomes recorded in the EMR system, using consistent administrative definitions across hospitals and throughout the study period.
Data preprocessing consisted of data quality assessment, temporal aggregation, and dataset structuring, rather than complex transformations. Data completeness and internal consistency were examined prior to aggregation, and no missing values were identified; therefore, no imputation or record exclusion procedures were required. The inpatient-level records were subsequently aggregated into monthly counts of IM and DAMA for each hospital to construct univariate time-series datasets suitable for forecasting analysis. No normalization, detrending, seasonal adjustment, smoothing, or outlier removal was applied prior to model implementation, as the objective was to evaluate model performance under real-world hospital data conditions.
The study period spanned from January 2018 to December 2024, yielding 84 monthly observations per indicator per hospital. For model development, data from January 2018 to December 2023 (72 observations; 85.7%) were used as the training set, while data from January to December 2024 (12 observations; 14.3%) were reserved as the test set, reflecting a realistic forecasting scenario in which historical data are used to predict outcomes in the most recent year. The two hospitals were intentionally selected to represent geographically distinct tertiary institutions within the same national healthcare system, thereby enhancing external validity and enabling evaluation of forecasting model robustness across heterogeneous institutional contexts. One hospital is located in a western plateau region (altitude: 3,650 m), while the other is situated in an eastern plain region (altitude: 20 m). Both institutions are tertiary Class A referral hospitals with comparable functional roles, standardized administrative reporting practices, and stable bed capacity throughout the study period. This design allows assessment of model performance across diverse real-world hospital environments while minimizing confounding related to institutional level or reporting standards. The reported altitude difference is included solely as a regional contextual characteristic and was not treated as an analytical or explanatory variable in the modeling process. To facilitate interpretation of the underlying temporal structure, time-series plots of IM and DAMA over the full study period (2018–2024) are provided in the Results section, illustrating overall trends, variability, and potential seasonal patterns prior to forecasting. All datasets were fully anonymized before analysis. Ethical approval was obtained from the institutional review boards of both hospitals, and informed consent was waived due to the retrospective use of anonymized administrative data.
Statistical analysis
This study aimed to forecast the monthly volumes of IM and DAMA for the most recent year available (2024) across both hospitals. The workflow followed a structured sequence of steps, as illustrated in Fig. 1. Exploratory data analysis (EDA) was performed. And the model training phase involved the application of 6 separate models for subsequent comparison: the Autoregressive Integrated Moving Average (ARIMA) model, the Grey Model (GM), the Neural Network Autoregression (NNETAR) model, the Long Short-Term Memory (LSTM) network, Facebook’s Prophet, and Amazon’s Chronos. Afterward, the models were evaluated on the test set. Model performance was assessed using metrics such as Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE). And the Diebold–Mariano (DM) test was employed to statistically compare the predictive accuracy between models.
Fig. 1.
Workflow for forecasting monthly IM and DAMA.
Statistical models
Model development involved training 6 separate models for comparative analysis. This section introduces them briefly.
ARIMA: ARIMA(p, d, q) is a widely used statistical model for univariate time-series forecasting. It integrates differencing, autoregressive, and moving average models12. The ARIMA(p, d, q) model is expressed as:
![]() |
1 |
Where:
original series.
differencing order.
(AR polynomial).
(MA polynomial).
: white noise error.
Prior to model fitting, stationarity was assessed using the Augmented Dickey–Fuller (ADF) test, supported by visual inspection of time-series plots and ACF/PACF patterns. When non-stationarity was identified, differencing was applied until approximate stationarity was achieved, with the differencing order
selected to avoid over-differencing. The AR and MA orders
were identified based on ACF/PACF diagnostics, and final model selection was guided by the Akaike Information Criterion (AIC). Model adequacy was evaluated using residual diagnostics, including the Ljung–Box test. Figure 2 presents the flowchart for the ARIMA model.
Fig. 2.
Flow Chart of ARIMA Model.
GM: GM is a predictive modeling technique based on grey system theory, developed by Deng Julong in the early 1980s13, and is particularly suited for small-sample, non-negative time series. Among the grey system family, the GM(1,1) model is the most widely used. The first ‘1’ refers to a single variable. The second ‘1’ denotes a first-order differential equation used for modeling. GM(1,1) is typically effective for series exhibiting approximately monotonic or exponential trends. In this study, the IM and DAMA series satisfy the basic data requirements of GM(1,1), including small sample size (72 monthly observations in the training set) and non-negativity. However, these series do not consistently display strong monotonic or exponential growth patterns and instead show irregular fluctuations. Accordingly, GM was included as a representative small-sample benchmark model frequently reported in healthcare forecasting studies, rather than under the assumption that its ideal conditions are fully met.
The non-negative original time series (raw data)
and AGO time series
showed as:
![]() |
2 |
![]() |
3 |
is the sample size of the data.
Adjacent neighbor means. Calculating the mean of AGO time series and showed as:
![]() |
4 |
The whitenization equation was showed as:
![]() |
5 |
In this equation,
is developing coefficient and
is control variable. These are two parameters of GM(1,1) model.
Inverse AGO (Restoring Predicted Values) was done and the predicted original sequence is obtained by differencing:
![]() |
6 |
with
.
NNETAR: NNETAR model is a type of parametric, nonlinear neural network specifically designed for time series forecasting. It combines autoregressive modeling with a feedforward neural network to capture both linear and nonlinear temporal dependencies14,15.
The modeling procedure involves two stages. First, the autoregressive lag order of the time series is determined. Second, a feedforward neural network is trained using the selected lagged observations as inputs, a single hidden layer, and one output node for forecasting, as Figure.3. The NNETAR model can be expressed as:
![]() |
7 |
![]() |
8 |
Where:
: predicted value at time
.
: lagged values of the time series.
: nonlinear function learned by the neural network.
: random error term (assumed to be white noise).
: number of lagged inputs.
: number of hidden nodes.
: weight from input
to hidden node
.
: bias for hidden node
.
: weight from hidden node
to the output.
: output bias.
: activation function.
In this study, the NNETAR model was implemented as a feedforward neural network with a single hidden layer using a logistic activation function. Lag selection followed a rule-based procedure: non-seasonal series used a single autoregressive lag (p = 1), whereas seasonal series included all lags up to the data frequency (p = 1: frequency). The number of hidden neurons was determined using the default rule-based setting (
.). To improve forecast stability and reduce sensitivity to random initialization, the model was trained 20 times with different random seeds, and the final forecasts were obtained by averaging the results across runs.
Fig. 3.
Schematic architecture of the NNETAR model.
LSTM: LSTM is a specialized form of Recurrent Neural Network (RNN) designed to model sequential and time series data with long-range temporal dependencies.
By introducing a memory cell and a set of gating mechanisms, LSTM overcomes the vanishing and exploding gradient problems commonly encountered in standard RNNs, enabling more effective learning of long-term patterns16.
At each time step
, the LSTM unit is governed by the following equations:
![]() |
9 |
![]() |
10 |
![]() |
11 |
![]() |
12 |
![]() |
13 |
![]() |
14 |
where
denotes the input at time
;
and
represent the hidden state and cell state, respectively;
,
, and
correspond to the forget gate, input gate, and output gate;
is the sigmoid activation function; and
denotes element-wise multiplication. These components and information flows17 are illustrated schematically in Fig. 4.
Fig. 4.

LSTM architecture and information flow.
In this study, the LSTM model was implemented in Python using the PyTorch framework. For each time series, the network received the previous four monthly observations as input (lag window = 4) and was configured as a one-step-ahead forecaster. The model was trained for a fixed 50 epochs without early stopping or dropout. Overfitting was controlled through the use of the AdamW optimizer with weight decay (L2 regularization), which penalizes excessively large weights and stabilizes training.
Prophet: Prophet is an open-source forecasting tool developed by Facebook (now Meta). Prophet reframes forecasting as a curve fitting problem using a decomposable time-series model including holidays, seasonality, and overall trend that makes use of nonlinear smoothers14. Trend (g(t)): Captures non-periodic changes in the time series. Prophet supports both linear and logistic growth models, allowing flexibility in modeling trend saturation. Seasonality (s(t)): Models periodic effects, such as daily, weekly, or yearly seasonality. Prophet uses Fourier series to represent seasonality, allowing the model to learn complex patterns in time. Holiday Effects (h(t)): Accounts for the impact of holidays or other special events that may cause deviations from the usual patterns. The general form of the Prophet model is:
![]() |
15 |
Where:
is the observed time series,
is the trend component,
is the seasonal component,
is the holiday component,
is the error term assumed to be normally distributed.
In our implementation, the trend term
was specified using Prophet’s default linear trend (growth = “linear”). The seasonal component
was modeled as recurring within-year variation (yearly seasonality) using a Fourier series representation.
Chronos: Chronos is a pretrained probabilistic time-series forecasting framework based on transformer architectures, designed to enable accurate forecasting with minimal task-specific tuning. Unlike traditional statistical models or neural networks trained directly on a target dataset, Chronos adopts a representation learning paradigm in which continuous time series are first scaled and quantized into discrete token sequences. These tokens are then modeled using a transformer architecture trained with a cross-entropy objective on large collections of time series from diverse domains18.
Through large-scale cross-domain pretraining, Chronos learns general temporal representations that can be transferred to unseen forecasting tasks in a zero-shot setting, without requiring retraining or parameter optimization on the target data. This characteristic distinguishes Chronos from task-trained neural models such as NNETAR and LSTM and allows it to serve as a pretrained benchmark in comparative forecasting studies.
In this study, we employed the pretrained chronos-bolt-small (48 M) model. The model was applied directly to the monthly IM and DAMA series to generate one-step-ahead probabilistic forecasts, enabling evaluation of its performance under realistic low-data conditions without additional fine-tuning.
Forecasting metrics
For this study, we applied three forecasting error metrics to evaluate the accuracy of these models. These measures are RMSE, MAE and MAPE, defined as follows:
![]() |
16 |
![]() |
17 |
![]() |
18 |
Where:
is the number of observations,
is the actual observed value,
is the predicted value19.
For comparing the predictive accuracy of models, the Diebold–Mariano (DM) test is used. The DM test is implemented in R using the function forecast::dm.test(e1, e2, alternative = “two.sided”, h = 1, power = 1). Under this specification, the loss function is the absolute error,
, and the loss differential is defined as
The null hypothesis of the DM test is that the two models have equal expected loss. The test statistic is given by
![]() |
19 |
where
denotes the estimated variance of the sample mean of the loss differential. With the default variance estimator (varestimator = “acf”) and (h = 1), the long-run variance reduces to the contemporaneous variance of
}, so that.
![]() |
20 |
where
is the sample variance of
.
Microsoft Excel 2021 was used for initial time-series dataset construction. Model development was conducted using R 4.4.3 for the ARIMA, GM, NNETAR, and Prophet models. Python 3.12 was employed for implementing LSTM network and Chronos model. For all six forecasting approaches, the two hospitals were modeled independently.
Results
Figure 5 presents the monthly time-series trends of IM and DAMA for the two hospitals from 2018 to 2024. IM counts in both hospitals remained relatively low throughout the study period(mean: 10.67 & 19.95 respectively) and were characterized by irregular month-to-month fluctuations without a clear long-term increasing or decreasing trend. Hospital 2 exhibited higher IM levels and greater variability than Hospital 1. In contrast, DAMA volumes were substantially higher(mean: 137.61 & 38.17 respectively) than IM in both hospitals. Hospital 1 showed comparatively stable DAMA patterns with moderate temporal variability, whereas Hospital 2 displayed lower DAMA volumes accompanied by greater relative irregularity across months. These descriptive patterns highlight clear differences in scale and variability between indicators and institutions, providing important contextual background for the subsequent forecasting analyses.
Fig. 5.
Trends of IM and DAMA in the two hospitals from 2018 to 2024.
Table 1 shows the characteristics of IM and DAMA data from January 2018 to December 2023. In Hospital 1, the IM count ranged from 4 to 23 per month, with a mean of 11.00 and a median of 10.5. The monthly DAMA volume varied from 45 to 212, with a mean of 136.61 and a median of 134.0. The standard deviations for IM and DAMA were 4.19 and 32.92, respectively, resulting in coefficients of variation (CV) of 38.09% and 24.09%. The distribution of IM in Hospital 1 showed moderate positive skewness (0.94) and leptokurtosis (3.70), suggesting a slight right-tail asymmetry and peakedness. Conversely, the DAMA data were relatively symmetric (skewness = -0.23) and approximately mesokurtic (kurtosis = 3.13). For Hospital 2, IM exhibited a wider range (7 to 73) with a higher mean (20.56) and greater variability (standard deviation = 8.97; CV = 43.63%) compared to Hospital 1. DAMA ranged from 8 to 63 per month, with a mean of 36.21 and a CV of 33.83%. Skewness and kurtosis for both indicators in Hospital 2 indicated near-normal distributions with minimal asymmetry and tail behavior.
Table 1.
Descriptive statistics of IM and DAMA in training set.
| Hospital 1 | Hospital 2 | |||
|---|---|---|---|---|
| IM | DAMA | IM | DAMA | |
| Max | 23 | 212 | 73 | 63 |
| Min | 4 | 45 | 7 | 8 |
| Median | 10.5 | 134.0 | 19.0 | 37.0 |
| Mean | 11.00 | 136.61 | 20.56 | 36.21 |
| Standard Deviation (Std.) | 4.19 | 32.92 | 8.97 | 12.25 |
| Coefficient of Variation (CV %) | 38.09 | 24.09 | 43.63 | 33.83 |
| Skewness | 0.94 | -0.23 | 3.03 | -0.08 |
| Kurtosis | 3.70 | 3.13 | 17.99 | 2.92 |
Figure 6 illustrates the monthly distribution patterns of IM and DAMA across the six-year training period for the two hospitals. Each boxplot represents the distribution of values for a given month, allowing assessment of central tendency, dispersion, and potential outliers. For Hospital 1, the upper panel shows that monthly DAMA volumes exhibit moderate variability across months, with higher medians observed in the middle of the year (May to August), suggesting potential seasonal or operational influences. The pattern is based on visual inspection of the time series rather than on formal statistical tests of seasonality. The interquartile range (IQR) is wider during these months, indicating increased variability. The lower panel of Hospital 1 indicates that IM values remain relatively stable across months, with medians consistently ranging between 10 and 15. January appears to have slightly higher variability, and several months include mild outliers, though no major deviation from the overall monthly trend is evident. In Hospital 2, monthly DAMA distributions (upper panel) are generally lower in magnitude than those in Hospital 1, with relatively tight IQRs and limited outliers. Medians remain relatively stable throughout the year, indicating consistent patient discharge behavior with minimal seasonal fluctuation. IM values in Hospital 2 (lower panel) show more month-to-month variability compared to Hospital 1, particularly in the early part of the year (January to March), where both the spread and presence of outliers are more pronounced. While most months show medians around 20, months like February and July demonstrate broader dispersion, suggesting the presence of atypical mortality patterns in those periods.
Fig. 6.
Monthly distribution trends of IM and DAMA (2018–2023).
Table 2 presents the forecasting performance of six models on IM data across both hospitals, using training data and test data. For Hospital 1, the Chronos model showed the best predictive performance on the test dataset, achieving the lowest RMSE (2.6612), MAE (2.1506), and MAPE (33.3685%). The NNETAR model achieved the good training performance (MAPE = 28.693%) but showed notable degradation on test data (MAPE = 69.4149%). The LSTM model did not perform competitively: it exhibited relatively large RMSE, MAE and MAPE on both the training and test sets, with test errors only better than NNETAR, indicating limited predictive accuracy and generalization in this application. For Hospital 2, a similar trend was observed. The Chronos model again delivered the most accurate forecasts on the test data, with a MAPE of 26.9575%. The LSTM model performed poorly, with the highest error across all metrics. Overall, Chronos consistently outperformed all other models across both institutions, highlighting its potential as a robust, general-purpose forecasting tool for IM.
Table 2.
The results of the models for IM cases.
| Models | Training Data | Test Data | |||||
|---|---|---|---|---|---|---|---|
| Metrics | RMSE | MAE | MAPE | RMSE | MAE | MAPE | |
| Hospital 1 | ARIMA | 4.16 | 3.1667 | 33.0679 | 3.3912 | 2.8333 | 44.2847 |
| GM | 4.0117 | 3.0895 | 32.7135 | 3.5282 | 2.9631 | 46.2586 | |
| NNETAR | 3.3218 | 2.6566 | 28.693 | 6.0692 | 4.6592 | 69.4149 | |
| Prophet | 3.4858 | 2.7823 | 28.1396 | 4.0447 | 3.4802 | 48.3677 | |
| LSTM | 4.59 | 3.7943 | 43.3676 | 4.1013 | 3.5864 | 53.2915 | |
| Chronos | 4.6869 | 3.3545 | 28.6911 | 2.6612 | 2.1506 | 33.3685 | |
| Hospital 2 | ARIMA | 7.7803 | 5.4258 | 28.2898 | 7.7009 | 5.879 | 41.5226 |
| GM | 8.8711 | 5.6445 | 31.2939 | 6.8515 | 5.93 | 42.5013 | |
| NNETAR | 6.3513 | 4.6445 | 26.5944 | 8.2555 | 6.417 | 44.2799 | |
| Prophet | 7.4467 | 4.9273 | 26.6499 | 9.026 | 6.6756 | 46.7492 | |
| LSTM | 16.9262 | 15.0141 | 95.576 | 19.4453 | 18.7838 | 125.485 | |
| Chronos | 9.3439 | 6.0181 | 33.8896 | 4.4039 | 3.8081 | 26.9575 | |
Table 3 presents the forecasting performance of six models on DAMA data for both tertiary hospitals. For Hospital 1, the Chronos model demonstrated the strongest generalization to unseen data, achieving the lowest test RMSE (11.0939), MAE (8.0941), and MAPE (5.5177%). NNETAR exhibited the lowest training error (RMSE = 19.3885; MAPE = 12.2291%), it did not fully maintain this advantage on the test set, suggesting some degree of overfitting. ARIMA also performed reasonably well given the relatively stable DAMA patterns at this site. The GM, LSTM and Prophet models performed poorly. For Hospital 2, the NNETAR model demonstrated relatively superior predictive performance on the test dataset, achieving the lowest RMSE (7.5667), MAE (5.8258), and MAPE (11.2867%). In contrast, more complex deep learning-based models such as LSTM and Chronos showed considerably weaker performance on the test data. The Chronos model yielded the highest test RMSE (20.8802), MAE (18.3003), and MAPE (34.7463%), followed closely by LSTM. The comparative weakness of Chronos and LSTM at this site may reflect the lower magnitude and greater irregularity of DAMA counts in Hospital 2, conditions under which simpler neural network architectures may adapt more effectively.
Table 3.
The results of the models for DAMA cases.
| Models | Training Data | Test Data | |||||
|---|---|---|---|---|---|---|---|
| Metrics | RMSE | MAE | MAPE | RMSE | MAE | MAPE | |
| Hospital 1 | ARIMA | 21.8752 | 16.9512 | 13.9327 | 11.6843 | 9.7828 | 6.6189 |
| GM | 31.4401 | 24.9558 | 22.008 | 23.1211 | 21.6125 | 14.7988 | |
| NNETAR | 19.3885 | 15.1496 | 12.2291 | 12.5496 | 11.1636 | 7.9162 | |
| Prophet | 30.6982 | 24.5352 | 21.6782 | 26.5893 | 24.5097 | 16.9377 | |
| LSTM | 41.502 | 34.2493 | 25.8049 | 24.6605 | 22.5114 | 15.5568 | |
| Chronos | 33.7296 | 26.1622 | 22.6461 | 11.0939 | 8.0941 | 5.5177 | |
| Hospital 2 | ARIMA | 9.6318 | 7.8793 | 27.2238 | 9.1067 | 8.0895 | 17.5157 |
| GM | 11.631 | 9.503 | 37.6488 | 9.9546 | 9.309 | 18.6476 | |
| NNETAR | 2.8546 | 2.1778 | 7.1203 | 7.5667 | 5.8258 | 11.2867 | |
| Prophet | 10.1087 | 7.9354 | 30.5555 | 16.5878 | 14.4713 | 28.4543 | |
| LSTM | 13.0651 | 10.528 | 35.6563 | 19.2162 | 17.4598 | 33.0682 | |
| Chronos | 16.9254 | 14.106 | 61.9739 | 20.8802 | 18.3003 | 34.7463 | |
Collectively, these findings demonstrate that no single model is universally best across both indicators and hospitals. Chronos performed strongly for IM and for DAMA in Hospital 1, whereas NNETAR was more effective for DAMA forecasting in Hospital 2.
Table 4; Fig. 7 show the results of pairwise comparisons between forecasting models using the DM test for IM and DAMA at two hospitals. The DM statistic and associated p-values indicate statistically significant differences between model forecast accuracies.
Table 4.
Pairwise comparison between different models using DM test.
| Model 1 | Model 2 | IM | DAMA | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Hospital 1 | Hospital 2 | Hospital 1 | Hospital 2 | ||||||
| DM Statistic | P-value | DM Statistic | P-value | DM Statistic | P-value | DM Statistic | P-value | ||
| Chronos | LSTM | 3.22** | 0.008 | 13.60*** | < 0.001 | 3.70** | 0.004 | -0.45 NS | 0.664 |
| Chronos | ARIMA | -2.76* | 0.018 | -1.78 NS | 0.102 | -1.00 NS | 0.339 | 2.78* | 0.018 |
| Chronos | GM | -2.81* | 0.017 | -3.19** | 0.009 | -6.04*** | < 0.001 | 3.08* | 0.01 |
| Chronos | NNETAR | -1.33 NS | 0.212 | -5.34*** | < 0.001 | -0.57 NS | 0.582 | 4.67** | 0.001 |
| Chronos | Prophet | -2.42* | 0.034 | -1.89 NS | 0.086 | -6.65*** | < 0.001 | 1.17 NS | 0.268 |
| LSTM | ARIMA | 2.15NS | 0.055 | 11.12*** | < 0.001 | 4.04** | 0.002 | 2.89* | 0.015 |
| LSTM | GM | 1.80 NS | 0.099 | 12.61*** | < 0.001 | 0.31 NS | 0.766 | 3.90** | 0.002 |
| LSTM | NNETAR | 0.49 NS | 0.635 | -2.70* | 0.021 | 3.43** | 0.006 | 5.16*** | < 0.001 |
| LSTM | Prophet | 0.13 NS | 0.901 | 6.70*** | < 0.001 | -0.44 NS | 0.672 | 1.29 NS | 0.222 |
| ARIMA | GM | -2.61* | 0.024 | -0.04 NS | 0.968 | -11.92*** | < 0.001 | -0.93 NS | 0.370 |
| ARIMA | NNETAR | -0.43 NS | 0.673 | -4.87*** | < 0.001 | -0.04 NS | 0.971 | 1.34 NS | 0.208 |
| ARIMA | Prophet | -1.06 NS | 0.314 | -0.76 NS | 0.464 | -6.16*** | < 0.001 | -2.54* | 0.028 |
| GM | NNETAR | -0.27 NS | 0.794 | -5.09*** | < 0.001 | 3.27** | 0.007 | 3.04* | 0.011 |
| GM | Prophet | -0.83 NS | 0.424 | -0.45 NS | 0.660 | -1.02 NS | 0.331 | -2.65* | 0.022 |
| NNETAR | Prophet | -0.35 NS | 0.732 | 4.22** | 0.001 | -3.50** | 0.005 | -3.29** | 0.007 |
Note: ***indicates p-value < 0.001, ** indicates p-value between 0.01 and 0.001, * indicates p-value between 0.05 and0.01. NS: Non-significant between the forecast of pairwise models.
Fig. 7.
Visualization of model comparison using Diebold-Mariano Test.
For Hospital 1 (IM forecasting), the Chronos model outperformed the LSTM model (DM statistic = 3.22, p = 0.0082), ARIMA (p = 0.018), GM (p = 0.017) and Prophet (p = 0.034). Differences between Chronos and NNETAR were not statistically significant (p = 0.212). Comparisons involving LSTM versus ARIMA, GM, NNETAR, and Prophet showed no significant differences. For Hospital 2 (IM forecasting), Chronos was significantly more accurate than LSTM (DM statistic = 13.60, p < 0.001), GM (p = 0.009), and NNETAR (p < 0.001). Differences with ARIMA and Prophet were not statistically significant. These findings reinforce that Chronos consistently ranked among the top-performing models for IM. Regarding Hospital 1 (DAMA forecasting), Chronos significantly outperformed LSTM (p = 0.004), GM (p < 0.001), and Prophet (p < 0.001). ARIMA significantly outperformed LSTM (p = 0.002); differences with ARIMA and NNETAR were not significant. For Hospital 2 (DAMA forecasting), NNETAR significantly outperformed LSTM (p < 0.001), GM (p = 0.011), Prophet (p = 0.007) and Chronos (p < 0.001). Comparisons involving ARIMA yielded mixed results depending on the model pair. Across both hospitals, Chronos is statistically superior for IM, particularly relative to LSTM and GM. NNETAR is statistically superior for DAMA in Hospital 2, confirming the numerical results. Many comparisons are not statistically significant, indicating overlapping performance among several models. These results emphasize the importance of considering both statistical significance and practical accuracy when selecting hospital forecasting models.
Figure 8 compares test values with fitted forecasts. The figure highlights that overfitting was most pronounced in NNETAR and LSTM, especially in IM forecasting. Chronos displayed consistently stable behavior, closely tracking IM values across both hospitals. For DAMA in Hospital 2, NNETAR produced the most stable test-set trajectory. This pattern suggests that models with higher representational complexity (e.g., LSTM, Chronos) do not always perform best for low-volume or highly variable indicators, whereas simpler neural network structures may yield more robust performance under those conditions. Although some models achieved relatively low test errors (e.g., Chronos for IM; NNETAR for DAMA), MAPE values for IM remained moderately high (often above 30%). This reflects the inherent volatility and small numerical scale of IM data. For hospital managers, these levels of error indicate that forecasts should be interpreted as supporting tools rather than definitive predictors, especially for short-term operational decisions.
Fig. 8.
Test data versus fitted values from different models.
Discussion
This study evaluated six forecasting models—ARIMA, GM, NNETAR, LSTM, Prophet, and Chronos—across two tertiary hospitals with differing demographic and operational characteristics. By comparing traditional statistical models, neural network–based methods, and state-of-the-art pretrained probabilistic structure, this study provides new insights into the forecasting of IM and DAMA, two key quality and performance indicators in hospital management20,21.
Across both hospitals, Chronos consistently achieved the lowest or near-lowest forecast errors and demonstrated stable generalisation performance (Table 2). In contrast, NNETAR frequently exhibited lower training errors but higher test errors, suggesting sensitivity to sample size and a tendency toward overfitting under limited data conditions. The LSTM model provided limited gains under the relatively small sample sizes considered. This is consistent with evidence that neural forecasting models, trained from scratch with many parameters, generally require substantial data and intensive tuning to generalize well22, while Chronos can leverage pretrained representations from large heterogeneous corpora to achieve more robust accuracy even in low-data settings18.
It is noteworthy that, in several cases, test errors were not uniformly higher than training errors across all models, indicators, and hospitals. This pattern reflects characteristics of the data and evaluation design rather than methodological shortcomings. First, the monthly IM and DAMA series are short (72 observations in the training set), making error estimates sensitive to a small number of observations. Second, relatively parsimonious models such as ARIMA and GM may underfit the training data, leading to similar or occasionally lower test errors by chance. Third, the use of a single contiguous year (2024) as the test set means that test performance is strongly influenced by whether that year exhibits smoother or less variable behaviour than the preceding period. Finally, pretrained or regularised models such as Chronos may exhibit implicit regularisation effects, resulting in comparable training and test errors without indicating data leakage or overfitting. These observations underscore the importance of interpreting training-test error relationships within the context of small, noisy hospital-level time series.
Forecasting performance for DAMA varied more substantially between hospitals in our study (Table 3). In Hospital 1, where DAMA volumes were higher and relatively stable, Chronos produced the most accurate forecasts. In Hospital 2, characterized by lower volumes and greater irregularity, NNETAR achieved the lowest test errors. Importantly, these between-hospital differences in forecasting performance are attributed to differences in time-series properties and operational context, rather than to altitude-related causal mechanisms. Altitude was reported solely as a regional characteristic and was not modelled or tested as a determinant of IM or DAMA outcomes. This heterogeneity is consistent with recent benchmark and survey studies showing that dataset characteristics—such as variability, count magnitude, and the strength and regularity of temporal patterns—play an important role in determining which forecasting models perform best. In low-data or weakly structured settings, simpler models can match or even outperform more complex deep learning architectures22–24. These findings support our interpretation that the lower volume and higher irregularity of DAMA in Hospital 2 favored a relatively simple neural network. These results highlight the importance of selecting simple architectures for low-volume or noisy series, along with rigorous validation strategies such as DM testing.
From an operational perspective, hospitals often plan staffing, bed capacity, and emergency surge responses using historical averages or subjective judgement. Forecasting tools, particularly those capable of capturing recurring fluctuations such as those observed in DAMA for Hospital 1, can support more proactive and data-driven resource allocation. Both IM and DAMA function as sentinel indicators of care processes, and even moderate improvements in forecasting accuracy may help hospital administrators detect abnormal fluctuations earlier, identify service bottlenecks, and initiate timely interventions.
Overall, this study highlights that no single forecasting model is universally optimal across all contexts. Pretrained models such as Chronos appear well suited for IM and relatively stable DAMA patterns, whereas simpler neural architectures such as NNETAR may be preferable for low-volume or highly irregular series. Hospitals should therefore select forecasting methods based on local data characteristics rather than model complexity alone25,26. The performance of Chronos despite limited data suggests substantial potential for pretrained models in resource-limited hospitals that lack rich datasets or technical capacity. These models may enable institutions to leapfrog traditional model development burdens and implement high-quality forecasting pipelines more rapidly.
Limitations
This study relied solely on historical IM and DAMA values. Incorporating additional variables—such as patient demographics, disease severity—may improve accuracy. Monthly data limited the training sample to 72 observations per hospital. This constraints model complexity and may influence the performance of deep learning methods. Although descriptive patterns suggested potential seasonality, no formal decomposition or seasonal testing procedure was applied. Future studies could include decomposition methods. Recent research highlights strong performance of ensemble approach27 and future studies could evaluate it. Changes in staffing, hospital policy, or clinical protocols may influence IM and DAMA but were not explicitly modeled. Finally, although the two hospitals differ markedly in geographic and environmental context, this study does not attempt causal inference regarding altitude-related effects. Altitude was not modeled as an explanatory variable, and assessing its potential impact on IM or DAMA would require patient-level physiological and clinical covariates that were beyond the scope of the present analysis.
Conclusion
This study demonstrates that forecasting IM and DAMA using a combination of statistical, machine learning, deep learning, and pretrained models provides practical insights for hospital management. For IM, the pretrained Chronos model consistently achieved the best or near-best performance across both hospitals, demonstrating robust generalisation under limited data conditions. For DAMA, forecasting performance was indicator- and site-specific: Chronos performed best in the hospital with higher and more stable DAMA volumes, whereas NNETAR outperformed other models in the lower-volume, more irregular series. Overall, the findings indicate that no single forecasting approach is universally optimal. Model selection should be guided by data characteristics such as scale, variability, and temporal regularity rather than model complexity alone. Pretrained models offer a scalable solution for stable hospital indicators, while simpler neural architectures may be preferable for low-volume or noisy time series. Future research should incorporate multi-source data, explore hybrid and ensemble methods to support more responsive and data-driven hospital management.
Acknowledgements
We are grateful to the referees and the editors for their valuable comments.
Author contributions
Cheng Pang: Conceptualization, Methodology, Formal analysis, Mainly Writing, Funding acquisition; Dexi Jiayong: Data curation, Writing – review & editing, Revising; Dandan Jiang: Data curation, Writing – review & editing; Yi Wang: Investigation, Writing – review & editing; Naishi Li: Methodology, Investigation, Writing – review & editing; Dan Ren: Conceptualization, Funding acquisition, project administration, Writing – review & editing.
Funding
The study was supported by the Non-profit Central Research Institute Fund of Chinese Academy of Medical Sciences (2024-RW630-01) and the Natural Science Foundation of Xizang Autonomous Region (XZZR202402111(W)). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Data availability
The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.
Declarations
Competing interests
The authors declare no competing interests.
Ethical approval and consent to participate
The studies were approved by the Ethics Committee of Peking Union Medical College Hospital (approval number I-25PJ1221) and the Ethics Committee of People’s Hospital of Xizang Autonomous Region (approval number ME-TBHP-24-067). The study was conducted in accordance with the local legislation and institutional requirements. Written informed consent for participation was not required from the participants or the participants’ legal guardians/next of kin because the study used a retrospective study design.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Cheng Pang and Dexi Jiayong contributed equally to this work.
References
- 1.Barbazza, E., Klazinga, N. S. & Kringos, D. S. Exploring the actionability of healthcare performance indicators for quality of care: a qualitative analysis of the literature, expert opinion and user experience. BMJ Qual. Saf.30, 1010–1020 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Alfandre, D. J. I’m Going Home: Discharges Against Medical Advice. Mayo Clin. Proc. 84, 255–260 (2009). [DOI] [PMC free article] [PubMed]
- 3.Saia, M. et al. Hospital readmissions and mortality following discharge against medical advice: a five-year retrospective, population-based cohort study in Veneto region, Northeast Italy. BMJ Open.13, e069775 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Ni, J. et al. Discharge against medical advice after hospitalization for sepsis: Predictors, 30-Day Readmissions, and outcomes. J. Emerg. Med.65, e383–e392 (2023). [DOI] [PubMed] [Google Scholar]
- 5.Ambasta, A., Santana, M., Ghali, W. A. & Tang, K. Discharge against medical advice: ‘deviant’ behaviour or a health system quality gap? BMJ Qual. Saf.29, 348–352 (2020). [DOI] [PubMed] [Google Scholar]
- 6.Kumah, A. Poor quality care in healthcare settings: an overlooked epidemic. Front. Public. Health. 13, 1504172 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Dalili, H., Shariat, M. & Sahebi, L. Time series analysis for forecasting neonatal intensive care unit census and neonatal mortality. BMC Pediatr.25, 339 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Rodea-Montero, E. R. et al. Trends, structural changes, and assessment of time series models for forecasting hospital discharge due to death at a Mexican tertiary care hospital. PLOS ONE. 16, e0248277 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Cho, N. Y. et al. Discharge against medical advice in trauma patients: Trends, risk factors, and implications for health care management strategies. Surgery176, 942–948 (2024). [DOI] [PubMed] [Google Scholar]
- 10.Gao, R., Cheng, W. X., Suganthan, P. N. & Yuen, K. F. Inpatient discharges forecasting for Singapore hospitals by machine learning. IEEE J. Biomed. Health Inf.26, 4966–4975 (2022). [DOI] [PubMed] [Google Scholar]
- 11.Avinash, G., Pachori, H., Sharma, A. & Mishra, S. Time series forecasting of bed occupancy in mental health facilities in India using machine learning. Sci. Rep.15, 2686 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Besong, A. E. et al. Significance of the ARIMA epidemiological modeling to predict the rate of HIV and AIDS in the Kumba health district of Cameroon. Front. Public. Health. 13, 1526454 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Tan, X. et al. Grey modelling and real-time forecasting for the approximate non-homogeneous white exponential law BDS clock bias sequences. Sci. Rep.14, 17897 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Talkhi, N., Fatemi, A., Ataei, N., Jabbari Nooghabi, M. & Z. & Modeling and forecasting number of confirmed and death caused COVID-19 in IRAN: A comparison of time series forecasting methods. Biomed. Signal. Process. Control. 66, 102494 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Saranya, M. S. & Vinish, V. N. A comparative evaluation of streamflow prediction using the SWAT and NNAR models in the meenachil river basin of central Kerala, India. Water Sci. Technol.88, 2002–2018 (2023). [DOI] [PubMed] [Google Scholar]
- 16.Guo, Y. et al. Deep learning models for hepatitis E incidence prediction leveraging Baidu index. BMC Public. Health. 24, 3014 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Sembiring, I., Wahyuni, S. N. & Sediyono, E. LSTM algorithm optimization for COVID-19 prediction model. Heliyon10, e26158 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Ansari, A. F. et al. Chronos: Learning the Language of Time Series. Preprint at (2024). 10.48550/ARXIV.2403.07815
- 19.Al-qaness, M. A. A., Ewees, A. A. & Fan, H. Abd El Aziz, M. Optimization method for forecasting confirmed cases of COVID-19 in China. J. Clin. Med.9, 674 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Gaur, A., Gilham, E., Machin, L. & Warriner, D. Discharge against medical advice: the Causes, consequences and possible corrective measures. Br. J. Hosp. Med.85, 1–14 (2024). [DOI] [PubMed] [Google Scholar]
- 21.Cecil, E., Bottle, A., Esmail, A., Vincent, C. & Aylin, P. What is the relationship between mortality alerts and other indicators of quality of care? A National cross-sectional study. J. Health Serv. Res. Policy. 25, 13–21 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Lim, B. & Zohren, S. Time-series forecasting with deep learning: a survey. Philos. Trans. R Soc. Math. Phys. Eng. Sci.379, 20200209 (2021). [DOI] [PubMed] [Google Scholar]
- 23.Brigato, L. et al. Position: There are no Champions in Long-Term Time Series Forecasting. Preprint at (2025). 10.48550/ARXIV.2502.14045
- 24.Liu, X. & Wang, W. Deep time series forecasting models: A comprehensive survey. Mathematics12, 1504 (2024). [Google Scholar]
- 25.Youssef, A. et al. External validation of AI models in health should be replaced with recurring local validation. Nat. Med.29, 2686–2687 (2023). [DOI] [PubMed] [Google Scholar]
- 26.Andric, M. & Dragoni, M. Machine learning and statistical insights into hospital stay durations: the Italian EHR case. Preprint at (2025). 10.48550/ARXIV.2504.18393
- 27.Sakib, M. & Siddiqui, T. Multi-Network-Based ensemble deep learning model to forecast Ross river virus outbreak in Australia. Int. J. Pattern Recognit. Artif. Intell.37, 2352015 (2023). [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.



























