Abstract
Motivation
Healthcare data, particularly in critical care settings, presents three key challenges for analysis. First, physiological measurements come from different sources but are inherently related. Yet, traditional methods often treat each measurement type independently, losing valuable information about their relationships. Second, clinical measurements are collected at irregular intervals, and these sampling times can carry clinical meaning. Finally, the prevalence of missing values. Whilst several imputation methods exist to tackle this common problem, they often fail to address the temporal nature of the data or provide estimates of uncertainty in their predictions.
Results
We propose using deep Gaussian process emulation with stochastic imputation, a methodology initially conceived to deal with computationally expensive models and uncertainty quantification, to solve the problem of handling missing values that naturally occur in critical care data. This method leverages longitudinal and cross-sectional information and provides uncertainty estimation for the imputed values. Our evaluation of a clinical dataset shows that the proposed method performs better than conventional methods, such as multiple imputations with chained equations (MICE), last-known value imputation, and individually fitted Gaussian processes (GPs).
Availability and implementation
The source code of the experiments is freely available at: https://github.com/aliakbars/dgpsi-picu.
1 Introduction
One of the main challenges in analysing healthcare data is that they are usually collected from multiple measurement streams. A patient’s medical record may include data from different sources, such as medical histories, laboratory tests, and imaging studies (Johnson et al. 2016). Integrating and analysing the data can thus be challenging, as the various data sources may use different formats, units, and scales. For example, a patient’s CO2 level may be measured breath by breath directly from a ventilator circuit, but their albumin levels are measured daily from blood sample tests.
The difference in sampling frequency for data from multiple sources (Groenwold 2020) relates to the cost to both the patient and the observer. For example, non-invasive measurements are sampled frequently, whereas invasive measurements, which require technically difficult procedures, are sampled less frequently. Blood sampling comes at a cost to the patient in loss of blood volume—while this will be relatively small for adults, this can be significant enough for children that they may need blood transfusions (François et al. 2022, Siegal et al. 2023).
With the difference in sampling frequency, aligning these multiple streams of data will result in missing values. Simply removing missing values and performing a complete case analysis would not suffice, since useful observations might be lost (Vesin et al. 2013). Additionally, there is the problem of informative sampling, where one might observe an extended period of intervals because the patient is getting better, making the clinicians reduce the frequency of monitoring (Che et al. 2018). These circumstances make it challenging to assess a patient’s health and make informed decisions accurately. As the sampling frequency is, by nature, informative, it will be more difficult to detect subtle changes in unobserved variables retrospectively.
Current practices for handling missing values in healthcare data often prioritize simplicity over complexity. A common approach is using the last-known value imputation, also known as the last-observation-carried-forward (LOCF) method, which fills in missing values by extending the most recent available measurement for a specified time period (Siddiqui and Ali 1998). While this method is straightforward, it overlooks the correlation between covariates. On the other hand, multiple imputation by chained equations (MICE), another widely used method, uses cross-sectional information between covariates but treats each observation independently (Rubin 1987, Tsvetanova et al. 2021).
More recently, deep neural networks have become increasingly popular for handling missing values in healthcare data (Lipton et al. 2016, Che et al. 2018, Cao et al. 2018, Yıldız et al. 2022). However, these methods have a limitation: they typically do not provide estimates of uncertainty in their predictions. This can be a problem in healthcare data analysis, where medical observations and interpretations inherently contain uncertainty, which may come from measurement error, inherent noise in the signal, or the use of surrogate markers. When this uncertainty in the input data is not accounted for, it can lead to unreliable model predictions (Cabitza et al. 2017).
This study aims to tackle the problem of handling missing values in multivariate time series data by leveraging both longitudinal and cross-sectional information. We use a deep Gaussian process (GP) model with stochastic imputation (Ming and Guillas 2021, Ming et al. 2023) where time is the input to predict the target variable through covariates in the model’s latent layer. By fitting the model jointly, the available information can be used as leverage to impute missing values stochastically. Moreover, this method comes with uncertainty estimation. This approach is evaluated against a baseline method that fits individual GPs to covariates using complete case analysis for model training and subsequent imputation.
A deep Gaussian process model is a hierarchical structure of GP nodes organized in layers to represent latent variables (Damianou and Lawrence 2013). Each node receives input from the previous layer and produces output that serves as input for the next layer. The observed data points appear at the final layer of this hierarchy. While single-layer GP models are limited by the kernel function used, which can be highly parametrized to learn complex data patterns, a deep GP model learns them non-parametrically via the hierarchy, thus having fewer hyperparameters to optimize (Salimbeni and Deisenroth 2017). Due to their ability to provide uncertainty estimates, deep GP models have applications in real-life domains, including aero-propulsion system simulation (Biggio et al. 2021), crop yield prediction from remote sensing data (You et al. 2017), and uncertainty estimation in electronic health records (Li et al. 2021).
The remainder of the paper is structured as follows. The clinical problem that motivated this research is explained in Section 2. GPs and deep GPs are reviewed in Section 3, where the methodological approach and alternative techniques for handling missing values in multivariate time series data are described. The proposed deep GP using stochastic imputation is validated by applying it to a clinical case study in Section 4. Finally, findings and future directions are summarized in Section 5.
2 Motivation: clinical problem
Missing values are a common challenge in healthcare datasets, arising from sources such as incomplete patient forms, survey non-responses (Penny and Atkinson 2012), and technical glitches during data collection (Zhang and Koru 2020). These missing values manifest within the data and affect study outcomes and statistical validity. Therefore, choosing appropriate methodological approaches to tackle this problem is crucial.
In critical care medicine, missing values are also a result of irregular and informative sampling (Groenwold 2020). To provide some context, clinicians in critical care monitor deviations from the expected arterial acidity (pH) range to gain insights into respiratory function, electrolyte balance, and the underlying diseases of the patients (Sirker et al. 2002). When pH deviates from normal ranges, either through acidosis (pH ) or alkalosis (pH ), it could disrupt vital biochemical processes and overall equilibrium, with studies showing that blood pH levels are associated with the mortality rate (Jung et al. 2011, Rodríguez-Villar et al. 2021) and neurological recovery in cases of cardiopulmonary resuscitation (Shin et al. 2017).
One way to monitor and model the pH level is the Stewart-Fencl approach (Stewart 1978, 1983, Fencl and Leith 1993). This approach identifies three independent variables that determine pH. The first variable is carbon dioxide (CO2), a major source of acid in the body that can be continuously monitored using modern bedside equipment. The second and third variables are strong ion differences (e.g. Na, K, Cl) and weak acids (e.g. albumin, lactate, urea, phosphate), respectively. These components require blood tests for measurement, which are performed less frequently due to their invasive nature (Barie 2004).
While CO2 is also measured through capnography as a surrogate, known as end-tidal CO2 (ETCO2), the main interest is in blood CO2 levels because they directly influence pH. In various respiratory conditions (such as asthma and COPD), the CO2 in the blood does not equilibrate with CO2 in the lungs. This creates a measurable gap between blood and alveolar CO2 levels, which can also be informative (Anderson and Breen 2000, Lee et al. 2024).
This disparity in measurement frequency creates a pattern of irregular sampling, where data collection occurs at inconsistent intervals. The resulting gaps in data collection lead to missing values, particularly for parameters requiring blood tests, as illustrated in Fig. 1. Consequently, the irregular nature of these measurements complicates the application of standard statistical techniques to analyse and interpret the data.
Figure 1.
Irregular sampling of six measurements of a sample patient. The pCO2, strong ion differences (SID), and lactate are measured from bedside monitoring, while albumin, phosphate, and urea levels are obtained from blood tests.
On the other hand, informative sampling occurs when data point selection is influenced by factors related to clinical hypotheses. For instance, clinicians might refrain from collecting data when a patient’s health improves and vice versa (Che et al. 2018). In this context, both irregular and informative sampling scenarios fall under the Missing Not at Random (MNAR) category, as the missingness is not random but associated with unobserved data or specific conditions and the missingness carries information (Little and Rubin 2019).
This study addresses the challenge of missing values in critical care data, which can impact patient outcome predictions (Vesin et al. 2013), for example, in predicting in-hospital and 30-day mortality (Sharafoddini et al. 2019). Rather than relying on complete-case analysis, which risks losing valuable information, this work proposes using deep GPs for data imputation. This approach leverages both cross-sectional and longitudinal information from patient records.
Following the Stewart-Fencl approach, the analysis is structured with arterial pH as the dependent variable. Relevant covariates are constructed to identify factors causing pH deviations. The proposed method not only imputes missing values in the covariates but also quantifies the uncertainty associated with these imputations, providing a more comprehensive understanding of the data’s reliability. Although pH is primarily used to evaluate acid-base status, other covariates provide critical information on metabolic status relevant to patient care in critical care settings (Gattinoni et al. 2017).
This study focuses on its application in critical care medicine because of its clinical importance, but the proposed method applies to different scenarios with similar conditions. For example, missing values are also found in human activity recognition from multiple sensor streams (Jain et al. 2022), sleep disorder diagnoses using electroencephalogram (EEG) (Lee et al. 2021), and hepatocellular carcinoma (Han et al. 2021).
3 Methodology
3.1 Gaussian processes
Let represent a D-dimensional input with N observed data points and be the corresponding outputs. Then, the GP model assumes that follows a multivariate Gaussian distribution , where is the mean vector, is the scale parameter, and is the correlation matrix. Cell ij in the matrix is specified by , where is a given kernel function with being the nugget term and being the indicator function. In this study, we consider Gaussian processes with zero means, i.e. and kernel functions with the multiplicative form: where is a one-dimensional stationary and isotropic kernel function, e.g., squared exponential kernel function (Rasmussen and Williams 2006), for the d-th input dimension.
The hyperparameters , , and those contained in are typically estimated using maximum likelihood or maximum a posteriori (Rasmussen and Williams 2006). Given estimated GP hyperparameters, the realizations of input , and output , then the posterior predictive distribution of output at a new input position follows a Gaussian distribution with mean and variance given by:
| (1) |
| (2) |
where .
A GP model can be used as a smoothing function for irregularly sampled signals through the predicted mean function of a time series. Thus, GPs have been used to model electronic health records (Lasko et al. 2013), wearable sensor data for e-health (Clifton et al. 2013), gene expression data (Gao et al. 2008, Kirk and Stumpf 2009, Liu et al. 2010), and quantitative traits (Vanhatalo et al. 2019, Arjas et al. 2020).
3.2 Linked Gaussian processes
Consider P GP models for , where P is the total number of output dimensions of computer models with sets of -dimensional input () and produces sets of one-dimensional output (). In the Stewart–Fencl approach, this multi-output GP model can be interpreted as using time as a shared input variable, but with differing numbers of training points, and predicting covariates as outputs. Let be another GP model with M sets of P-dimensional input () and one-dimensional output (), where the P features (i.e. dimensions) in correspond to the P-dimensional outputs produced by . A linked GP (LGP) is then created when we feed the predictions from into the input of .
Given a new global input positions , the hierarchy of the LGP that produces the global output prediction can be seen as in Fig. 2.
Figure 2.
A two-layered linked Gaussian process model. In the Stewart–Fencl approach, the first layer can be seen as using time as the input variable and predicting covariates as outputs, where the outputs are conditionally independent with respect to . A special case arises when all models in the first layer share time as a common input variable with an equal number of training points. The second GP layer then takes the outputs from the first layer as its inputs to model pH as the output variable. This model corresponds to a deep Gaussian process model when the first layer is treated as latent.
Assume that the model parameters involved in and are known or estimated and we observe realizations , , and of , and respectively given inputs for all . Then, given that the one-dimensional outputs of are conditionally independent, the posterior predictive distribution of the global output at is given by with and PDF :
| (3) |
where and are PDF’s of the posterior predictive distributions of and , respectively; and . However, is analytically intractable because the integral in Equation (3) does not have a closed form expression because the GP posterior is Gaussian with both mean and variance that are nonlinear in , and marginalizing a Gaussian input yields a non-Gaussian mixture with no closed-form density under standard kernels (e.g. squared-exponential, Matérn). However, the first two moments are available in closed form since they reduce to kernel expectations against a Gaussian. It can be shown that, given the GP specifications in Subsection Gaussian Processes, the mean, , and variance, , of are expressed analytically as follows:
| (4) |
| (5) |
where with its ith element
| (6) |
with its ij-th element
| (7) |
and the expectations in and have closed-form expressions under the linear kernel, squared exponential kernel, and a class of Matérn kernels (Titsias and Lawrence 2010, Kyzyurova et al. 2018, Ming and Guillas 2021). The linked GP is then defined as a Gaussian approximation with its mean and variance given by and . Furthermore, the linked GP can be built iteratively to analytically approximate the posterior predictive distribution of outputs from any feed-forward GP systems. Research has shown that this approach provides an adequate approximation by minimizing the Kullback-Leibler divergence (Ming and Guillas 2021).
3.3 Deep Gaussian processes
An LGP model is useful when we have complete cases in the data. However, in many cases the data can have incomplete cases, e.g. although all input positions have their corresponding global output observed, some input locations can only have partially observed internal input/output, meaning that when we construct LGP between the global input and output, we must remove incomplete cases from the data to build individual GP models, leading to the loss of information. To address this issue, a deep GP (DGP) model can be used.
Consider P GP models with N sets of D-dimensional input () and N sets of one-dimensional output (). Let be another GP model with N sets of P-dimensional input () and one-dimensional output (). A two-layered DGP model is then created when we feed the predictions from into the input of .
Note that if we have realizations of and of fully observed while of are completely or partially missing, we need to use complete cases in to train each , and complete cases in to train . This can be represented by the likelihood function:
| (8) |
which leads to optimizations of individual GP likelihoods with complete cases.
However, for a two-layer DGP model, we can retain and extract as much information as possible about its latent values from , while jointly training the individual GPs by maximizing the likelihood function:
| (9) |
where is the multivariate Gaussian PDF of , is the multivariate Gaussian PDF of , and is the missing cases (i.e. latent values) in . However, due to the nonlinearity between and , the integral with respect to in Equation (9) is analytically intractable, making the inference of the DGPs challenging.
3.4 Deep GP with stochastic imputation
Recently, stochastic imputation (SI) was proposed to tackle the inference issue in deep GP (Ming et al. 2023), leveraging the fact that DGP and LGP are similar in structure. This approach provides a well-balanced trade-off between computational complexity and accuracy by combining the computational efficiency of variational inference (Salimbeni and Deisenroth 2017) and the accuracy of a full Bayesian approach (Sauer et al. 2023). The key concept of SI is converting a DGP emulator to multiple LGP emulators, each representing a DGP realization with imputed latent variables. Because some elements of the latent layer are observed, full imputation of the latent layer is not required.
Given a DGP hierarchy as described in Subsection Deep Gaussian Processes and realizations of and of , we can obtain point estimates of unknown model parameters in for all and , using the stochastic expectation maximization (SEM) algorithm (Ming et al. 2023). With the estimated model parameters, the DGP emulator gives the approximate posterior predictive mean and variance of at a new input position as described in Algorithm 1.
Algorithm 1.
Construction of a two-layered DGP emulator
Input: (i) Realizations , , and ; (ii) A new input position ; (iii) The number of imputations N.
Output: Mean and variance of .
1: for do
2: Given , , and , draw an imputation of the latent output via an Elliptical Slice Sampling (Nishihara et al. 2014) update.
3: Construct the LGP emulator with the mean and variance , given , , and where .
4: end for
5: Compute the mean and variance of by
One can extend Algorithm 1 to multi-layered DGPs with multi-dimensional outputs by applying the same algorithm and repeating step 2 for each layer. A detailed explanation of this generalization is provided in Ming et al. (2023).
3.5 Benchmarking
Our numerical experiments evaluated four different models to analyse the dataset.
3.6 Deep GP with stochastic imputation (DGP-SI)
We trained a unified deep GP model that integrated all components: it took timestamp as input, processed covariates in the latent layer, and predicted the output variable as output—all within a single end-to-end framework as described in Subsection Deep GP with Stochastic Imputation.
To set the baseline for our proposed models, we compared them with the following approaches:
Last-observation carried forward (LOCF) imputation: we used the last-known value of the variable until the next measurement is observed (Siddiqui and Ali 1998);
MICE: multiple imputation using chained equations (Jarrett et al. 2022)—ignoring the temporal dependency, relying only on the cross-sectional information between variables by treating each observation as independent and identically distributed (Rubin 1987, Little and Rubin 2019), and doing iterative imputation;
GP interpolation: we fitted a GP regressor with a squared exponential kernel individually for each covariate.
4 Numerical experiments
4.1 Clinical problem
Building on the Stewart-Fencl approach, the model was constructed with pH as the dependent variable and three independent variables: CO2 levels, strong ion differences (SID), and weak acid concentrations. As such, in this study, we developed a continuous pH estimation model based on the Stewart-Fencl approach (Stewart 1983). We then simulated real-world clinical scenarios by deliberately masking (intentionally withholding) portions of the covariates, mirroring the practical challenges of aligning laboratory test results with continuous bedside monitoring. Using measured pH levels to impute these missing covariate values, the aim was to provide clinicians with insights into patient status between laboratory tests.
The experiments were done on a dataset of 14 ICU admission windows selected at random from a paediatric intensive care unit. Each admission had a different number of data points, ranging from 19 to 115 hourly timestamps, and some patients had multiple admissions. To further de-identify the patients, the dates were shifted to future dates while retaining the time relationships.
To follow the physicochemical approach of acid-base balance, the model incorporated multiple blood gas measurements. Specifically, the model used partial carbon dioxide pressure (pCO2) and pH measurements and the difference between Na and Cl concentrations to represent the SID (Kellum 2000). Although CO2 levels can be measured through both blood gas analysis (Hassan and Martinez 2024) and capnography (ETCO2) (Raffe 2020), the numerical experiment was simplified by using only blood gas measurements.
For acid-base balance modelling, blood gas measurements are generally preferred over capnography for CO2 because they offer a more comprehensive and direct assessment of the body’s acid-base status (Anderson and Breen 2000, Lee et al. 2024). Additionally, the weak acid component was represented by lactate measurements from the blood gas analyser, as observations for albumin, phosphate, and urea were limited to only one or two measurements in half of the admission windows.
The DGP-SI model was trained simultaneously for all components using the architecture illustrated in Fig. 3. As a baseline, an ablation study was also conducted using individually fitted GP models to predict the three covariates (pCO2, SID, and lactate) from time inputs. For a fair comparison, the MICE model only used time and pH data, excluding covariates since these would not be available during the inference process of our proposed models.
Figure 3.
A two-layered deep Gaussian process (DGP) for pH prediction. This study compares two fitting approaches: individually fitting GP models by predicting the covariates using time as the input and the DGP with stochastic imputation (SI) method, where all the components are trained simultaneously.
Following the literature, the data was preprocessed through several steps. First, the data was discretised into sequences of hourly intervals and the measurements were aggregated by taking the arithmetic means (Lipton et al. 2016, Ghosheh et al. 2024). Then, to evaluate the model’s robustness, either the observed pH or the covariates were randomly masked with varying proportions (10%, 20%, 30%, and 40%) (Beaulieu-Jones and Moore 2017, Jafrasteh et al. 2023). Realistically, the proportion of missingness typically ranges from about 15% to 30% according to a similar study where a GP model was also used as a benchmark (Luo et al. 2018). To evaluate the robustness of our approach under more challenging conditions, we further conducted stress tests at 40% missingness. Finally, z-score transformation was applied to both pH values and the covariates to standardize the data distribution.
To evaluate model performance given the presence of measurement noise, the accuracy of missing value imputations was defined as the mean absolute error (MAE):
| (10) |
where N is the total number of missing values being evaluated, is the i-th true value that was masked and is the i-th estimated value at dimension d. In the GP-based methods, the model prediction is computed as the mean of the predictive distribution.
Additionally, the models were evaluated with respect to uncertainty quantification using the negative log likelihood (NLL), defined as:
| (11) |
where denotes the i-th masked true value, and represents the input to the probability density function , parametrized by the learnable parameters . Note that, since the LOCF method does not provide uncertainty quantification, it was excluded from this evaluation.
5 Results
5.1 Imputing missing values in covariates
The first experiment was to predict missing values with varying proportions from three variables: pCO2, SID, and lactate as the weak acid. To ensure comparable results, all models were standardized to use time as the input variable and pH as the target variable. All three variables were measured at the same time as pH. This experiment used all the observed pH values to link the three covariates as suggested by the Stewart-Fencl physicochemical approach. Since training a linked GP with sequential design, in this case, was just individually fitted GPs with complete case analysis, there were only four models to compare in this experiment.
The performance comparison revealed that DGP achieved the lowest error rates at 10% and 20% missing values, performed similarly to GP at 30%, and slightly underperformed compared to GP at 40% (Fig. 4). However, as noted earlier, a missingness rate of 40% is only used here as a benchmark and is not typical in clinical settings. The proposed method works best when the amount of missing data is at realistic, clinically expected levels. Both DGP and GP demonstrated better performance than MICE and LOCF, with LOCF showing lower error rates than MICE as missingness increased. These findings suggest that longitudinal information is more valuable than cross-sectional information for covariate imputation. However, DGP’s low error rates indicate that combining both longitudinal and cross-sectional information yields optimal results.
Figure 4.
Average MAE values in imputing the covariates. The error bars represent the standard error of the MAE values across all admission windows. DGP, which combines longitudinal and cross-sectional information, performs better in imputing missing values in the covariates, particularly at lower missingness levels. However, as the proportion of missing values increases, methods that rely on longitudinal information become more effective.
Given the results, practitioners should generally start with a GP model when the underlying data is expected to be smooth or can be well-approximated using established kernels. If the validation metrics indicate that a higher model capacity is needed and a hierarchical structure aligns with the mathematical formulation of the problem, e.g. the physicochemical modelling, then using DGP may be appropriate. In our case, DGP helps to address the partial information issues that we have in the covariates, as well as the non-stationarity. However, it is important to note that adding more layers to DGPs may provide little improvement in accuracy, and past a certain depth, the additional computational cost outweighs the marginal gains in performance (Dunlop et al. 2018, Ming et al. 2023).
5.2 Uncertainty quantification
When incorporating uncertainty quantification through NLL evaluation, the results show that all models exhibited higher NLL values as the proportion of missing data increased (Fig. 5). Furthermore, although DGP and GP achieved comparable NLL at 10% and 20% missingness, DGP maintained tighter uncertainty bounds than GP at 30% and 40%, resulting in lower NLL. Consistent with the MAE results, both DGP and GP outperformed MICE, whilst the last-known imputation method was excluded from this evaluation as it does not provide uncertainty quantification.
Figure 5.

Average NLL values in imputing the covariates. The error bars represent the standard error of the NLL values across all admission windows. DGP generally performs better at imputing missing values whilst maintaining tighter uncertainty bounds for the covariates. As the proportion of missing values increases, all models show higher NLL values.
As the covariates were connected through pH in the output layer using DGP-SI, an observation from one covariate could affect the uncertainty of another covariate where an observation was unavailable. To demonstrate this effect, differences in uncertainty were compared by manually masking observations from the three covariates in three different intervals, focusing on masking lactate within these intervals. The experiment revealed that the uncertainty in lactate, shown in Fig. 6, was less when only the observed points in lactate were masked instead of all three covariates being masked.
Figure 6.
Uncertainty quantification derived from the DGP-SI model when imputing missing values in the covariates. The dots represent observed values and the dots between the two vertical lines show masked values. The x-axis shows scaled time, and the y-axis shows standardized covariate values. The uncertainty for lactate resulting from manually masking the three covariates (left) is greater than that from only masking lactate (right).
Given that the number of tests performed on a patient is often restricted, particularly when procedures are invasive or carry associated risks, clinicians face the challenge of obtaining clinical insight from limited measurements. As demonstrated in Fig. 6, DGP-SI offers an advantage in such situations by enabling the integration of longitudinal patterns within a specific covariate alongside cross-sectional information from the remaining covariates. This approach can support informed test selection and imputation strategies, optimizing both patient safety and diagnostic yield.
6 Discussion
This paper demonstrated that DGP-SI, initially developed for uncertainty quantification in computationally expensive models, could effectively handle missing values in critical care data from different sources. The analysis across 14 admission windows showed that DGP-SI performed better in imputing missing covariate values, particularly when the proportion of missing data was low. This approach offers clinicians insight into patient states between measurements whilst providing uncertainty quantification, hence attaching a measure of confidence that addresses the inherent uncertainties in medical science (Cabitza et al. 2017).
A similar approach for inferring missing values in time series, such as Variational GP Dynamical Systems (Damianou et al. 2011), can be found in the literature. Although Variational GP and DGP-SI share similar aims–namely, modelling high-dimensional sequential data and providing uncertainty quantification–we chose to use the Deep Gaussian Process with Stochastic Imputation (Ming et al. 2023) due to its favourable properties in uncertainty quantification. A common concern with the variational inference used in Variational GP Dynamical Systems (Damianou et al. 2011) is that it may fail to capture important aspects of posterior uncertainty. Additionally, maximization of the ELBO can be challenging due to non-convexity or the large number of model parameters introduced by the layered structure of GPs (Ming et al. 2023). As an alternative, Ming et al. (2023) showed that using Elliptical Slice Sampling (ESS) within a Gibbs sampler (ESS-within-Gibbs) to impute latent variables in DGPs could lead to better uncertainty quantification.
From a clinical perspective, accurately imputing missing values with quantified uncertainty can impact decision-making, especially in time-sensitive critical care scenarios. Clinicians often face the challenge of making treatment decisions with incomplete data, and the uncertainty quantification provided by DGP-SI can shape how they are interpreted and applied in clinical workflows, e.g. in guiding decisions on whether to order additional tests or cautioning against over-reliance on imputed results. Rather than serving solely as a methodological complication, the resulting risk distributions can provide valuable insights to support clinical decision-making and facilitate conversations with patients and their families (Mathiszig-Lee et al. 2022). Additionally, this method has the potential to enhance early warning systems in intensive care units, identifying deteriorating patients earlier while reducing false alarms.
The proposed approach also poses challenges for future work with two main limitations. First, it is less effective in emulating the Stewart-Fencl physicochemical model for pH prediction, likely due to error propagation through intermediate variables. Second, as highlighted in Ming et al. (2023), the method becomes computationally expensive with larger datasets. To handle this, the data was partitioned into shorter admission windows. We did not compare the results with the non-discretised time data for two reasons. First, the absolute number of missing values would differ when using non-discretised raw data, even with the same missingness proportion, complicating fair comparison. Second, the discretization step aligns the data with clinically relevant time intervals, reflecting the practical realities of ICU monitoring where clinicians are not always available at the bedside, thus only observing the aggregated measurements. Note that such an approach may not be feasible in settings with high-frequency measurements, such as those from wearable devices.
To address these limitations, several paths forward exist. The computational burden can be reduced through time discretization and data aggregation, as demonstrated in this work. Alternative solutions include implementing sparse GP approximations (Snelson and Ghahramani, 2007, Bauer et al. 2016) or utilizing GPU parallelization for exact GP computations (Wang et al. 2019). Additionally, further research should evaluate the model’s performance on multimodal datasets with naturally varying patterns of missing data, focusing on scenarios where sufficient observations exist to be emulated using a deep GP model.
Contributor Information
Ali A Septiandri, Department of Statistical Science, University College London, London WC1E 7HB, United Kingdom.
Deyu Ming, School of Management, University College London, London WC1E 6BT, United Kingdom.
Francisco Alejandro DiazDelaO, Clinical Operational Research Unit, University College London, London WC1H 0BT, United Kingdom.
Takoua Jendoubi, Department of Statistical Science, University College London, London WC1E 7HB, United Kingdom.
Samiran Ray, Paediatric Intensive Care Unit, Great Ormond Street Hospital For Children NHS Foundation Trust, London WC1N 3BH, United Kingdom.
Author contributions
Ali A. Septiandri (conceived the experiment(s), conducted the experiment(s), analysed the results, wrote the original manuscript), Deyu Ming (conceived the experiment(s), analysed the results, reviewed and edited the manuscript), Francisco Alejandro DiazDelaO (conceived the experiment(s), analysed the results, reviewed and edited the manuscript), Takoua Jendoubi (reviewed and edited the manuscript), and Samiran Ray (reviewed and edited the manuscript)
Conflict of interest
No competing interest is declared.
Funding
This work was supported by the UK Engineering and Physical Sciences Research Council (EP/W523835/1 to A.A.S. and EP/T017791/1 to F.A.D. and S.R.).
Data availability
The data underlying this article will be shared on reasonable request to the corresponding author.
References
- Anderson CT, Breen PH. Carbon dioxide kinetics and capnography during critical care. Crit Care 2000;4:207–15. 10.1186/cc696 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Arjas A, Hauptmann A, Sillanpää MJ. Estimation of dynamic snp-heritability with Bayesian Gaussian process models. Bioinformatics 2020;36:3795–802. 10.1093/bioinformatics/btaa199. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barie PS. Phlebotomy in the intensive care unit: strategies for blood conservation. Crit Care 2004;8: S34–36. 10.1186/cc2454 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bauer M, Van Der Wilk M, Rasmussen CE. Understanding probabilistic sparse gaussian process approximations. Adv Neural Inf Process Syst 2016;29:1533–41. [Google Scholar]
- Beaulieu-Jones BK, Moore JH; Pooled Resource Open-Access ALS Clinical Trials Consortium. Missing data imputation in the electronic health record using deeply learned autoencoders. In: Pacific Symposium on Biocomputing 2017. Hackensack, NJ: World Scientific, 2017, 207–18.
- Biggio L, Wieland A, Chao MA et al. Uncertainty-aware prognosis via deep Gaussian process. IEEE Access 2021;9:123517–27. [Google Scholar]
- Cabitza F, Rasoini R, Gensini GF. Unintended consequences of machine learning in medicine. JAMA 2017;318:517–8. 10.1001/jama.2017.7797 [DOI] [PubMed] [Google Scholar]
- Cao W, Wang D, Li J et al. BRITS: bidirectional recurrent imputation for time series. Adv Neural Inf Process Syst 2018;31:6776–86. [Google Scholar]
- Che Z, Purushotham S, Cho K et al. Recurrent neural networks for multivariate time series with missing values. Sci Rep 2018;8:6085. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Clifton L, Clifton DA, Pimentel MAF et al. Gaussian processes for personalized e-health monitoring with wearable sensors. IEEE Trans Biomed Eng 2013;60:193–7. 10.1109/TBME.2012.2208459 [DOI] [PubMed] [Google Scholar]
- Damianou A, Lawrence ND. Deep Gaussian processes. In: Carvalho CM, Ravikumar P (eds), Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics, volume 31 of Proceedings of Machine Learning Research, Scottsdale, AZ. PMLR, 2013, 207–15. https://proceedings.mlr.press/v31/damianou13a.html
- Damianou AC, Titsias MK, Lawrence ND. Variational Gaussian process dynamical systems. In: Proceedings of the 25th International Conference on Neural Information Processing Systems, NIPS’11. Red Hook, NY: Curran Associates Inc, 2011, 2510–8.
- Dunlop MM, Girolami MA, Stuart AM et al. How deep are deep Gaussian processes? J Mach Learn Res 2018;19:1–46. [Google Scholar]
- Fencl V, Leith DE. Stewart’s quantitative acid-base chemistry: applications in biology and medicine. Respir Physiol 1993;91:1–16. [DOI] [PubMed] [Google Scholar]
- François T, Sauthier M, Charlier J et al. Impact of blood sampling on anemia in the picu: a prospective cohort study. Pediatr Crit Care Med 2022;23:435–43. 10.1097/pcc.0000000000002947 [DOI] [PubMed] [Google Scholar]
- Gao P, Honkela A, Rattray M et al. Gaussian process modelling of latent chemical species: applications to inferring transcription factor activities. Bioinformatics 2008;24:i70–5. 10.1093/bioinformatics/btn278 [DOI] [PubMed] [Google Scholar]
- Gattinoni L, Pesenti A, Matthay M. Understanding blood gas analysis. Intensive Care Med 2017;44:91–3. 10.1007/s00134-017-4824-y [DOI] [PubMed] [Google Scholar]
- Ghosheh GO, Li J, Zhu T. IGNITE: individualized generation of imputations in time-series electronic health records. arXiv, 10.48550/arXiv.2401.04402, 2024, preprint: not peer reviewed. [DOI]
- Groenwold RHH. Informative missingness in electronic health record systems: the curse of knowing. Diagn Progn Res 2020;4:8–6. 10.1186/s41512-020-00077-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Han S, Tsui K-W, Zhang H et al. Multiple imputation analysis for propensity score matching with missing causes of failure: an application to hepatocellular carcinoma data. Stat Methods Med Res 2021;30:2313–28. 10.1177/09622802211037075 [DOI] [PubMed] [Google Scholar]
- Hassan W, Martinez S. Arterial blood gas sampling [ABG machine use]. In: StatPearls. Treasure Island, FL: StatPearls Publishing, ; 2024. https://www.ncbi.nlm.nih.gov/books/NBK606112/ [Google Scholar]
- Jafrasteh B, Hernández-Lobato D, Lubián-López SP et al. Gaussian processes for missing value imputation. Knowl Based Syst 2023;273:110603. 10.1016/j.knosys.2023.110603 [DOI] [Google Scholar]
- Jain Y, Tang CI, Min C et al. ColloSSL: collaborative self-supervised learning for human activity recognition. Proc ACM Interact Mob Wearable Ubiquitous Technol 2022;6:1–28. [Google Scholar]
- Jarrett D, Cebere B, Liu T et al. Hyperimpute: generalized iterative imputation with automatic model selection. In: 39th International Conference on Machine Learning. PMLR, 2022, 9916–37. 10.48550/ARXIV.2206.07769 [DOI]
- Johnson AE, Pollard TJ, Shen L et al. MIMIC-III, a freely accessible critical care database. Sci Data 2016;3:160035. 10.1038/sdata.2016.35 [DOI] [Google Scholar]
- Jung B, Rimmele T, Le Goff C et al. ; AzuRea Group. Severe metabolic or mixed acidemia on intensive care unit admission: incidence, prognosis and administration of buffer therapy. a prospective, multiple-center study. Crit Care (Fullerton) 2011;15:R238. 10.1186/cc10487 [DOI] [Google Scholar]
- Kellum JA. Determinants of blood pH in health and disease. Crit Care 2000;4:6–14. 10.1186/cc644 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kirk PDW, Stumpf MPH. Gaussian process regression bootstrapping: exploring the effects of uncertainty in time course data. Bioinformatics 2009;25:1300–6. 10.1093/bioinformatics/btp139 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kyzyurova KN, Berger JO, Wolpert RL. Coupling computer models through linking their statistical emulators. SIAM/ASA J Uncertainty Quantification 2018;6:1151–71. [Google Scholar]
- Lasko TA, Denny JC, Levy MA. Computational phenotype discovery using unsupervised feature learning over noisy, sparse, and irregular clinical data. PLoS One 2013;8:e66341. 10.1371/journal.pone.0066341 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee DH, Driver BE, Reardon RF. Pitfalls of overreliance on capnography and disregard of visual evidence of tracheal tube placement: a pediatric case series. JEM Rep 2024;3:100061. 10.1016/j.jemrpt.2023.100061 [DOI] [Google Scholar]
- Lee W, Lee J, Kim Y. Contextual imputation with missing sequence of EEG signals using generative adversarial networks. IEEE Access 2021;9:151753–65. [Google Scholar]
- Li Y, Rao S, Hassaine A et al. Deep Bayesian Gaussian processes for uncertainty estimation in electronic health records. Sci Rep 2021;11:20685. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lipton ZC, Kale D, Wetzel R. Directly modeling missing data in sequences with RNNs: Improved classification of clinical time series. In: Doshi-Velez F, Fackler J, Kale D et al. (eds), Proceedings of the 1st Machine Learning for Healthcare Conference, volume 56 of Proceedings of Machine Learning Research, Northeastern University, Boston, MA, 18–19 August 2016. PMLR, 253–70. https://proceedings.mlr.press/v56/Lipton16.html.
- Little RJ, Rubin DB. Statistical Analysis with Missing Data. Vol. 793. Hoboken, NJ: John Wiley & Sons, ; 2019. [Google Scholar]
- Liu Q, Lin KK, Andersen B et al. Estimating replicate time shifts using Gaussian process regression. Bioinformatics 2010;26:770–6. 10.1093/bioinformatics/btq022 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Luo Y, Szolovits P, Dighe AS et al. 3D-mice: integration of cross-sectional and longitudinal imputation for multi-analyte longitudinal clinical data. J Am Med Inform Assoc 2018;25:645–53. 10.1093/jamia/ocx133 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mathiszig-Lee JF, Catling FJR, Moonesinghe SR et al. Highlighting uncertainty in clinical risk prediction using a model of emergency laparotomy mortality risk. NPJ Digit Med 2022;5:70. 10.1038/s41746-022-00616-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ming D, Guillas S. Linked Gaussian process emulation for systems of computer models using matérn kernels and adaptive design. SIAM/ASA J Uncertainty Quantification 2021;9:1615–42. [Google Scholar]
- Ming D, Williamson D, Guillas S. Deep Gaussian process emulation using stochastic imputation. Technometrics 2023;65:150–61. [Google Scholar]
- Nishihara R, Murray I, Adams RP. Parallel MCMC with generalized elliptical slice sampling. J Mach Learn Res 2014;15:2087–112. [Google Scholar]
- Penny KI, Atkinson I. Approaches for dealing with missing data in health care studies. J Clin Nurs 2012;21:2722–9. [DOI] [PubMed] [Google Scholar]
- Raffe MR. Oximetry and capnography. The Veterinary ICU Book. Boca Raton, FL: CRC Press, ; 2020, 86–95. [Google Scholar]
- Rasmussen CE, Williams CKI. Gaussian Processes for Machine Learning. Cambridge, MA: The MIT Press, ; 2006. [Google Scholar]
- Rodríguez-Villar S, Kraut J, Arévalo-Serrano J et al. ; Acid-Base Working Group. Systemic acidemia impairs cardiac function in critically ill patients. EClinicalMedicine 2021;37:100956. 10.1016/j.eclinm.2021.100956 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rubin DB. Multiple Imputation for Nonresponse in Surveys. Hoboken, NJ: John Wiley & Sons, ; 1987. 10.1002/9780470316696 [DOI] [Google Scholar]
- Salimbeni H, Deisenroth M. Doubly stochastic variational inference for deep gaussian processes. Adv Neural Inf Process Syst 2017;30:4591–602. [Google Scholar]
- Sauer A, Gramacy RB, Higdon D. Active learning for deep gaussian process surrogates. Technometrics 2023;65:4–18. 10.1080/00401706.2021.2008505 [DOI] [Google Scholar]
- Sharafoddini A, Dubin JA, Maslove DM et al. A new insight into missing data in intensive care unit patient profiles: observational study. JMIR Med Inform 2019;7:e11605. 10.2196/11605 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shin J, Lim YS, Kim K et al. Initial blood pH during cardiopulmonary resuscitation in out-of-hospital cardiac arrest patients: a multicenter observational registry-based study. Crit Care 2017;21:322. 10.1186/s13054-017-1893-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Siddiqui O, Ali MW. A comparison of the random-effects pattern mixture model with last-observation-carried-forward (LOCF) analysis in longitudinal clinical trials with dropouts. J Biopharm Stat 1998;8:545–63. 10.1080/10543409808835259 [DOI] [PubMed] [Google Scholar]
- Siegal DM, Belley-Côté EP, Lee SF et al. Small-volume blood collection tubes to reduce transfusions in intensive care: the stratus randomized clinical trial. JAMA 2023;330:1872–81. 10.1001/jama.2023.20820 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sirker A, Rhodes A, Grounds R et al. Acid-base physiology: the ‘traditional’ and the ‘modern’ approaches. Anaesthesia 2002;57:348–56. [DOI] [PubMed] [Google Scholar]
- Snelson E, Ghahramani Z. Local and global sparse gaussian process approximations. Artificial Intelligence and Statistics. San Juan, Puerto Rico: PMLR, ; 2007, 524–531. [Google Scholar]
- Stewart PA. Independent and dependent variables of acid-base control. Respir Physiol 1978;33:9–26. 10.1016/0034-5687(78)90079-8 [DOI] [PubMed] [Google Scholar]
- Stewart PA. Modern quantitative acid–base chemistry. Can J Physiol Pharmacol 1983;61:1444–61. 10.1139/y83-207 [DOI] [PubMed] [Google Scholar]
- Titsias M, Lawrence ND. Bayesian Gaussian process latent variable model. In: Teh YW, Titterington M (eds), Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, 13–15 May 2010. PMLR, 844–51.
- Tsvetanova A, Sperrin M, Peek N et al. Missing data was handled inconsistently in UK prediction models: a review of method used. J Clin Epidemiol 2021;140:149–58. 10.1016/j.jclinepi.2021.09.008 [DOI] [PubMed] [Google Scholar]
- Vanhatalo J, Li Z, Sillanpää MJ. A Gaussian process model and Bayesian variable selection for mapping function-valued quantitative traits with incomplete phenotypic data. Bioinformatics 2019;35:3684–92. 10.1093/bioinformatics/btz164 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vesin A, Azoulay E, Ruckly S et al. Reporting and handling missing values in clinical studies in intensive care units. Intensive Care Med 2013;39:1396–404. 10.1007/s00134-013-2949-1 [DOI] [PubMed] [Google Scholar]
- Wang K, Pleiss G, Gardner J, et al. Exact gaussian processes on a million data points. Adv Neural Inf Process Syst 2019;32:14648–59. [Google Scholar]
- Yıldız AY, Koç E, Koç A. Multivariate time series imputation with transformers. IEEE Signal Process Lett 2022;29:2517–21. [Google Scholar]
- You J, Li X, Low M, et al. Deep gaussian process for crop yield prediction based on remote sensing data. AAAI 2017;31:4559–65. 10.1609/aaai.v31i1.11172 [DOI] [Google Scholar]
- Zhang Y, Koru G. Understanding and detecting defects in healthcare administration data: Toward higher data quality to better support healthcare operations and decisions. J Am Med Inform Assoc 2020;27:386–95. 10.1093/jamia/ocz201 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The data underlying this article will be shared on reasonable request to the corresponding author.





