Abstract
Background and Objective:
Vital sign monitoring in the Intensive Care Unit (ICU) is crucial for enabling prompt interventions for patients. This underscores the need for an accurate predictive system. Therefore, this study proposes a novel deep learning approach for forecasting Heart Rate (HR), Systolic Blood Pressure (SBP), and Diastolic Blood Pressure (DBP) in the ICU.
Methods:
We extracted 24, 886 ICU stays from the MIMIC-III database which contains data from over 46 thousand patients, to train and test the model. The model proposed in this study, Transformer-based Diffusion Probabilistic Model for Sparse Time Series Forecasting (TDSTF), merges Transformer and diffusion models to forecast vital signs. The TDSTF model showed state-of-the-art performance in predicting vital signs in the ICU, outperforming other models’ ability to predict distributions of vital signs and being more computationally efficient. The code is available at https://github.com/PingChang818/TDSTF.
Results:
The results of the study showed that TDSTF achieved a Standardized Average Continuous Ranked Probability Score (SACRPS) of 0.4438 and a Mean Squared Error (MSE) of 0.4168, an improvement of 18.9% and 34.3% over the best baseline model, respectively. The inference speed of TDSTF is more than 17 times faster than the best baseline model.
Conclusion:
TDSTF is an effective and efficient solution for forecasting vital signs in the ICU, and it shows a significant improvement compared to other models in the field.
Keywords: deep learning, time series forecasting, sparse data, vital signs, ICU
1. Introduction
Vital signs are crucial in monitoring patients’ health and body functions in the ICU. Continuous monitoring systems alert caregivers of potential adverse events [1, 2]. A predictive warning system for vital signs can save valuable time by enabling prompt interventions [3, 4]. However, the implementation of such a system faces several challenges. ICUs are complex environments where patients have multiple underlying conditions, treatments, and interventions that can affect their vital signs, making it difficult to develop algorithms that accurately predict them [5, 6]. Vital sign prediction requires large amounts of data to develop accurate algorithms [7], but in many ICUs, the availability and quality of electronic health record data can be limited [8, 9]. This makes it challenging to use classical statistical methods, which often rely on manual feature engineering and cannot capture complex patterns [10, 11]. As a result, few attempts have been made to predict vital signs in the ICU using these methods.
With the development of deep learning, applying this new approach to predicting ICU vital signs is possible. As is well known, deep learning has revolutionized the field of time series forecasting in recent years, with its high representational power enabling successful predictions of vital signs in various studies. For instance, in Generative Boosting [12], the Long Short-Term Memory (LSTM) network is used to create a generative model, effectively reducing error propagation and improving HR prediction performance. In a comparison of Recurrent and Convolutional Neural Network (RNN and CNN) [13], using different horizon strategies on the MIMIC-II dataset, the bidirectional LSTM (Bi-LSTM) with the DIRMO strategy delivered the best predictions of HR and blood pressure. The TOP-Net model [14] predicts tachycardia onset using Bi-LSTM, with a prediction horizon of up to 6 hours. It was trained on the data of less than 6 thousand patients from the MIMIC-III database. The Temporal Fusion Transformer [15] predicts vital sign quantiles based on an attention mechanism, capturing anomaly temporal patterns and optimizing input time windows by calculating temporal importance.
Although these methods have shown promising results in vital sign forecasting, they still face several limitations when it comes to practical application in the ICU setting. First, these models require continuous monitoring of vital signs. However, given the complex and unpredictable conditions in the ICU, monitoring often becomes intermittent and sporadic. Second, these models only consider vital signs, ignoring the interrelated events—interventions and conditions surrounding a patient in the ICU—that could improve forecasting accuracy. For example, the existing models should have addressed active interventions such as medications, potential medical procedures, and their impact on patient vital signs. Third, the datasets used to evaluate these models often have a limited number of subjects, leading to a lack of generalizability and potential bias, which is a major concern in critical ICU scenarios.
We aim to develop an effective and efficient deep-learning approach to forecast Heart Rate (HR), Systolic Blood Pressure (SBP), and Diastolic Blood Pressure (DBP) in the ICU setting. The use of diffusion probabilistic models (diffusion models for short) [16, 17] in time series analysis has been gaining popularity due to their balance between flexibility and tractability [18]. These models have achieved state-of-the-art performance in time series forecasting. Our study aims to investigate the effectiveness of diffusion models in handling sparse time series data and making fast and accurate predictions of vital signs in the ICU setting. The triplet form of the diffusion model enhances its ability to process sparse data, and using a Transformer-based backbone leads to improved performance compared to baseline models. The ultimate goal is to promptly provide ICU caregivers with critical information by considering all recorded events, not just vital signs. The novel contributions of this paper include: (1) An examination of the ability of the diffusion model to extract temporal dependencies from sparse time series data. (2) Use the triplet form to enhance the efficiency of the diffusion model when processing sparse data. (3) Fast and accurate forecasting of vital signs in the ICU setting. (4) Integration of all recorded events in the ICU setting for vital sign forecasting without being limited by the screened data. (5) Comparison of the proposed model with baseline models, showing improved performance using the Transformer as the backbone.
2. Related Works
2.1. Probabilistic Time Series Forecasting
In many real-world scenarios, the future is uncertain, and making a single best estimate of the outcome is impossible. To account for this uncertainty, probabilistic forecasting provides a range of possible outcomes and probabilities of each. This is particularly significant in the ICU setting, where caregivers must understand the risks associated with different decisions. Several probabilistic time series forecasting models have been proposed in recent years and achieved state-of-the-art performance. MQ-RNN [19] uses RNN to generate hidden states of the input time series, which are then transformed into contextual information by a global Multilayer Perceptron Network (MLP). The local MLP uses contextual information and covariates to generate quantile predictions. In DeepAR [20], hidden states are obtained through RNN, which are then input into linear layers with activation to generate the mean and variance of the assumed likelihood model that samples predictions. DeepFactor [21] assumes that time series are exchangeable and decomposes the joint distribution into global and local time series. RNN is subsequently employed to capture global non-linear patterns. Simultaneously, an assumed observation model is applied to discern local uncertainties conditioned on the global effects. The outputs of these two functions are used to generate the forecasting distribution. EnCQR [22] is a homogeneous ensemble approach. Member learners within this framework can be constructed utilizing various machine learning algorithms. EnCQR combines conformal prediction and quantile regression methodologies to construct prediction intervals devoid of reliance on specific distributional assumptions.
2.2. Diffusion Model
The Diffusion model is a powerful generative model that learns the underlying distribution of data by transforming data samples into Gaussian noise, and has achieved state-of-the-art performance in various applications [23]. Initially, it garnered significant attention due to its superior performance in image synthesis compared to the Generative Adversarial Network (GAN) [24–26]. In recent years, its potential has expanded to domains such as protein sequence analysis [27], threat detection [28], audio synthesis [29], and probabilistic time series forecasting [16, 17]. CSDI takes as input a matrix filled with both historical data and target and a mask matrix indicating missing values [17]. Its backbone is based on DiffWave [30], which enables correlation across all features and time points. The results from CSDI has shown that the diffusion model can be optimized by selecting the appropriate backbone for specific tasks.
2.3. Transformer
The Transformer uses an encoder-decoder architecture [31], widely applied in Natural Language Processing (NLP) tasks. It is known for its ability to attend to specific parts of the input sequence rather than considering the entire sequence equally. This is achieved through the use of an attention mechanism, which assigns a weight to each element of the input sequence, indicating the amount of attention the model should allocate to each element when making predictions. A notable extension of the Transformer is GPT-3 [32], which has demonstrated strong semantic representation capabilities. ICU data share many characteristics with NLP data, given the vast capacity of dictionaries. While multiple aspects of a patient’s condition can be monitored in the ICU, only a limited number of them are recorded at any given time. Additionally, the recording intervals may be irregular, presenting data processing challenges. Furthermore, different patients may have different items recorded, and all possible items must be considered in the analysis. All of these factors contribute to the extreme sparsity of ICU data [8, 9].
3. Methods
3.1. TDSTF
Generative models aim to learn the underlying distribution of an observation dataset, but the main challenge lies in marginalizing out the latent variables to calculate the normalizing constant for a valid distribution, which is intractable [33]. Variational inference is a commonly used solution, transforming the distribution calculation into an optimization problem. The diffusion model applies variational inference to approximate a data distribution. It is based on the idea that a continuous Gaussian diffusion process can be reversed with the same functional form as the forward process [34]. After learning the reverse process, the input pure noise will converge to data points sampled from the modeled distribution. This can be approximated with a discrete Gaussian diffusion process, given a large enough number of diffusion steps.
We propose a Transformer-based Diffusion Probabilistic Model for Sparse Time Series Forecasting (TDSTF). The diffusion processes in terms of time series forecasting are manifested in Figure 1, divided into forward and reverse trajectories. Conditioned on the history observation , The purpose of our method is to learn parameterized by that approximates , so that predicts the target . During forward trajectory, a small amount of noise is added to the ground truth data at each step to obtain the noisy target . At the final step . The noise amount at each step is determined by a variance schedule . The forward process is expressed as a Markov chain:
| (1) |
where stands for Gaussian distribution parameterized by as its mean and as its variance. Table 1 lists all notations used in our method for convenience. The bold font denotes a scalar vector. For instance, where is the dimension of the vector.
Figure 1:

Diagram of the diffusion processes in our forecasting model. The blue curves indicate the history events as the conditional data . The red curves symbolize the noisy target at time step in the forward trajectory or the intermediate result during prediction. The target is represented by . The gaps among the curves symbolize the missing values in the sparse data. The forward trajectory adds noise of increasing levels to . The reverse trajectory then removes the noise from the pure noise to generate samples.
Table 1:
Notations used in the proposed model.
| Notation | Description |
|---|---|
| Ground truth of target without noise | |
| Noisy target during forward trajectory at step | |
| Intermediate result during inference at step | |
| Conditional data | |
| Backbone network of the diffusion model parameterized by | |
| Distribution of noisy target | |
| Distribution of output from the diffusion model | |
| Diffusion schedule at step | |
| Number of diffusion steps |
The reverse process begins with a Gaussian noise sample . At each time step, the process progressively denoises until is obtained. The final sampled is drawn from a distribution designed to resemble the training data distribution. This process can also be represented as a Markov chain:
| (2) |
where and are learnable functions that generate the mean and variance of the modeled distribution. Equation 1 implies that , where . This means that we can sample at any time step during the forward process, based only on . As a result, it is advantageous to construct and as follows:
| (3) |
| (4) |
where and , in order for to be close to as stated in [35].
3.2. Model Structure
The objective of the model training is to maximize the Evidence Lower Bound of [24]. It can be expressed as the following equation:
| (5) |
where is a function that can be learned to predict for each step of the denoising process. Figure 2 illustrates the entire training procedure. Before being input into and are transformed into triplets.
Figure 2:

The training procedure of our forecasting model minimizes the Mean Squared Error (MSE) between the noise prediction and as the loss function. The implementation of in the dashed box will be expanded and explained in detail later.
Traditional methods, such as aggregation and imputation, are ineffective when dealing with extremely sparse data. Aggregation results in poor resolution and loss of temporal information, while imputation introduces excessive noise. To overcome these issues, the triplet form compactly stores sparse data. Each triplet contains a feature, time, value, and a mask bit indicating the presence or absence of data, represented by 1 or 0, respectively. The absolute time of the valid data points from the raw data is transformed into a relative time range. Figure 3 gives an illustration of converting a sparse matrix to a triplet representation. If the number of conditional triplets exceeds the input size (preset as a hyperparameter according to data preprocessing), the feature selection module prioritizes data points of the same features as the target, most correlated to the target data [14, 15], and then fills in the remaining input triplets randomly. The Mean Squared Error (MSE) between the predicted noise and the Gaussian noise is used as the loss function.
Figure 3:

An illustration of converting a sparse matrix to triplet representation. In this example, a patient received 200 milligram (mg) of losartan at the second minute. A Heart Rate (HR) of 90 beats per minute (bpm) and a Systolic Blood Pressure (SBP) of 130 millimeters of mercury (mmHg) are recorded in the third and fifth minute, respectively. Assuming all other data points in the matrix are missing, and the 3 valid event records a, b, and c line up in an array of triplets. The size of the input triplet arrays is preset, and it can be larger than the number of valid triplets. The mask value of 0 in d signifies the invalidity of this triplet, meaning the invalidity of its other 3 elements.
We construct using a deep neural network that is divided into two stages: the front stage and the back stage. This architecture is depicted in Figure 4. The front stage maps the data into higher-dimensional spaces. A triplet’s feature, value, and time components are transformed into vectors. The embedding module maps the triplet features into vectors, the linear layer projects the triplet values into vectors, and a group of sinusoidal functions transforms the triplet times into vectors. To avoid disturbing the missingness representations, the representations of the feature and time also incorporate information about the missingness, and the values of triplets with masks of 0 are set to 0 (the mean value for all features after standardization). A random diffusion step is applied in each iteration to generate noisy target values. The diffusion step is represented as a vector obtained from a lookup table and projected through linear layers. The results of all these vectors are fed into the back stage.
Figure 4:

The architecture of the network, designed for predicting at the current step.
Considering the trait similarity between the NLP data and the triplet data within our context, both originating from expansive and diverse spaces, we employ Transformer to build the back stage. This decision aligns with the prevailing discussions advocating for the attention mechanism to its efficacy in handling complex data representations [31, 32, 36]. The conditional triplet is transformed by the encoders into Query , Key , and Value vectors. The self-attention layer calculates the correlation strength between one triplet and every other triplet through the following equation:
| (6) |
where is the dimension and , and the scaling factor enhances the model’s robustness. and output from the top encoder layer are input into the decoder’s encoder-decoder attention layers, together with the target , to obtain the cross attention. The multi-headed attention mechanism also enhances the model’s representational power by expanding latent subspaces with multiple independent sets of , and . The architecture is characterized by its use of 3 Transformer blocks, each consisting of 2 stacked encoder-decoder Transformer that sequentially process the input data. This encoder–decoder pattern can refine the output of the first encoder-decoder Transformer, potentially correcting errors and adding complexity to the representation. The first block is connected through a skip connection that extends beyond the subsequent Transformer block, directly interfacing with the convolutional (Conv) decoder. The second block also utilizes a skip connection, contributing its processed features directly to the Conv decoder. The third block follows a direct sequential path, transmitting its output to the Conv decoder. The skip connections allow information to be passed directly from one layer to another, allowing the network to effectively learn complex relationships in the data while mitigating the risk of vanishing gradients [29, 37]. The training process is outlined in Algorithm 1.
Algorithm 1.
Training
| Input: |
| Output: network trained |
| repeat |
| Take gradient descent step on |
| until Converged |
3.3. Model Inference
The values of the predicted triplets in the model are initially Gaussian noise corresponding to the first diffusion step of the reverse trajectory (step in Figure 1). These noisy values are then denoised using the output from according to Equation 2, which is written as , where . The features and times of the triplets to predict provide important contextual information. The denoised values are then fed back into the model to generate a prediction of for the next diffusion step, and this process is repeated until the final step of the reverse trajectory to output . The detailed process for this inference is described in Algorithm 2.
Algorithm 2.
Inference
| Input: |
| Output: (prediction of ) |
| for do |
| end for |
4. Experiments
4.1. Data and Preprocessing
In this study, we evaluate the model using the MIMIC-III dataset [38]. This dataset holds health information for over 46 thousand patients admitted to the Beth Israel Deaconess Medical Center (Boston, MA) between 2001 and 2012. The data preprocessing approach is depicted in Figure 5. The 3 steps in the yellow box are detailed as follows. First, we exclude records related to pediatric patients by filtering out individuals whose ages are greater than or equal to 18 years old. Second, we remove records with abnormal feature values. The features represent the medical events that happen to the patients, such as vital signs, medication usage, and biochemical test results. We eliminate outliers by retaining values within reasonable ranges for 117 numerical features, while non-null values are retained for other features categorized as yes-or-no items. Ultimately, we solely preserve the initial 40 consecutive minutes from each ICU stay, denoting this piece of data as an individual ICU sample. Of these 40 minutes, the former 30 minutes are used for the conditional data, and the latter 10 minutes for the target data. Both the conditional and target data are screened to ensure non-emptiness. We exclude ICU samples where the latter 10-minute segment lacks any target data. We standardize the feature values to ensure that it is suitable for input into the TDSTF model. We partially adopt the excluding method from [9] to steps 1 and 2. To prevent subject repetition, we divide the dataset into training and testing sets, randomly selecting 80% and 20% of the data by subject.
Figure 5:

Illustration of the data preprocessing steps involved in this study. denotes the number of ICU stays.
After implementing the exclusion criteria in the preprocessing step, a total of 11, 401 subjects were included in the study. From these eligible subjects, we extracted 24, 886 ICU stays to test our method. Table 2 presents statistics for the eligible subjects and target features in both the training and testing datasets before standardization. Subjects in the training and testing datasets combined were generally older (age of 65.42±16.27) and male (57.2%). The training data is further split into an 80 – 20% ratio, with the latter portion used for validation. Note that one subject corresponds to one or more ICU stays. The goal of this study is to predict three target features: HR, SBP, and DBP [39].
Table 2:
Subject number by sex; mean and standard deviation of subject age and target features.
| Training | Testing | |
|---|---|---|
| n=19,955 | n=4,931 | |
| M/F | 5244/3919 | 1281/957 |
| Age | 65.43±16.35 | 65.36±15.94 |
| HR | 90.82±21.56 | 91.24±22.27 |
| SBP | 117.74±27.95 | 117.76±30.45 |
| DBP | 60.34±16.86 | 59.90±17.05 |
4.2. Metrics
To thoroughly assess the performance of the proposed method, multiple metrics are utilized to measure the differences between the forecast and the ground truth time series.
One of the simplest and most commonly used metrics for this purpose is the Mean Squared HR is heart rate in beats per minute (bpm); SBP is systolic blood pressure in millimeters of mercury (mmHg); DBP is diastolic blood pressure in mmHg; M is male; F is female; is the number of ICU stays.
Error (MSE). It is defined as the squared distance between each element in the input and target , calculated as
| (7) |
In our case, the median values of the generated samples act as the deterministic predictions for the calculation of MSE with the ground truth. A lower MSE value indicates better performance.
Another widely used metric for evaluating the accuracy of probabilistic forecasts is the Continuous Ranked Probability Score (CRPS). It is defined as:
| (8) |
where is the modeled Cumulative Distribution Function (CDF), is the observation, and is the Indicator function. Figure 6 illustrates how CRPS is calculated. It provides a comprehensive measure of forecast accuracy by evaluating the mean difference between the predicted CDF and the ground truth, taking into account both under-prediction and over-prediction. The smaller the CRPS value, the more the modeled distribution concentrates around the ground truth, indicating better performance.
Figure 6:

A schematic illustration of the Continuous Ranked Probability Score (CRPS). The blue curves represent the modeled probability distribution functions, and the red dashed curves represent their cumulative distribution functions. The subplots (a) and (b) depict the same distribution with different ground truth and . The black hatched area representing the integration as the CRPS value in (a) is smaller than that in (b), meaning the modeled distribution evaluates better.
To calculate the Standardized Average CRPS (SACRPS), when predicting a standardized target value , we adopt the same discrete quantile loss as in [17]:
| (9) |
| (10) |
where is the value of quantile level from the generated samples. The scaling factor represents the variance of the target distribution, which becomes more difficult to predict as the variance increases. SACRPS calculated as the CRPS divided by this scaling factor provides a more accurate evaluation of the model.
4.3. Experiment Setup
We use the ADAM optimizer [40] with a learning rate 10−3. The mini-batch size is set to 32 samples. We implemented all the experiments using the Python programming language and the Pytorch framework [41]. The experiments were conducted on a workstation that was equipped with an Intel i7-12700K processor and an Nvidia RTX 3090 graphics card.
4.4. Comparison with Baseline
Five distinct models were employed as the baseline models for predicting vital signs with the MIMIC-III dataset. MQ-RNN encompasses a singular layer of bidirectional GRU, featuring a hidden size of 50. The encoder CNN employs kernel sizes of [7, 3, 3] alongside channel counts of [30, 30, 30]. Each layer within the Forking MLP Decoder is dimensioned at 30. The lookback period is tailored to cover the entire input time length. Categorical feature embeddings assume a dimensionality of 50. DeepAR comprises 2 layers of LSTM, each with 40 cells. The Dimension of the embeddings for categorical features is set as 50. The context length is adjusted to encompass the prediction horizon. For evaluation and sampling predictions, the student-t distribution is employed. DeepFactor applies both global and local models instantiated as LSTM. The count of global factors stands at 10, with each global model housing a single hidden layer of 50 units. The local model comprises a hidden layer containing 5 units. The observation model is specified as student-t. EnCQR contains an ensemble of 5 member learners, each configured as a 5-layer LSTM with a hidden size of 128 and an regularization factor of 10−4. Conformalized prediction intervals are utilized as the output. CSDI involves 50 diffusion steps, with a noise schedule following the quadratic schedule from to with a quadratic diffusion schedule [17]. Feature, time, and diffusion embedding dimensions are established at 16, 128, and 128, respectively. To address the missing data, we adopted a mean imputation of 0 for MQ-RNN, DeepAR, DeepFactor, and EnCQR, whereas CSDI utilized a mask matrix to denote the positions of missing data. In our model, the input size of the conditional triplet array is set to 60 according to Figure 7.
Figure 7:

Histogram showing the distribution of valid conditional data points derived from ICU samples within the training dataset, exhibiting a mean count of 12.9. On average, 99.7% of the conditional data within the matrix representation exhibit missing values. More than 99.5% of the ICU samples from the training dataset contain less than or equal to 60 valid conditional data points.
The experimental results are presented in Table 3, with 3 trials conducted. Both the SACRPS and MSE results indicate that the CSDI and TDSTF models performed much better than the other models. On average, TDSTF obtained 0.4438 and 0.4168, while CSDI obtained 0.5439 and 0.6358 for SACRPS and MSE, respectively. Thus, TDSTF improved SACRPS and MSE by 18.9% and 34.3% over CSDI. Figure 8 shows the violin histograms of SACRPS and MSE per ICU sample for the first trial of TDSTF. The results concentrate on the median values 0.4651 and 0.1677 of SACRPS and MSE, which shows robustness of the model.
Table 3:
SACRPS and MSE of the proposed model in comparison with the baselines. The best results are shown bold.
| MQ-RNN | DeepAR | DeepFactor | EnCQR | CSDI | TDSTF (Ours) | |
|---|---|---|---|---|---|---|
| SACRPS | 1.0276 | 0.9930 | 0.8221 | 0.8255 | 0.5515 | 0.4379 |
| 1.0367 | 0.9615 | 0.8207 | 0.8256 | 0.5455 | 0.4526 | |
| 1.0385 | 0.9653 | 0.8238 | 0.8254 | 0.5439 | 0.4408 | |
| MSE | 1.1014 | 1.0295 | 1.0325 | 1.2711 | 0.6430 | 0.4061 |
| 1.1237 | 1.0304 | 1.0332 | 1.2710 | 0.6358 | 0.4434 | |
| 1.1186 | 1.0289 | 1.0324 | 1.2715 | 0.6236 | 0.4008 |
Figure 8:

Violin histograms of the SACRPS in the left and the MSE in the right per ICU sample from the first trial of TDSTF. The results for each metric concentrate around the median values of 0.4651 and 0.1677.
Figure 9 shows some forecasting examples from the first trial of TDSTF. All the deterministic predictions fall into the 95% confidence interval. Subplot (a) predicts an increasing trend of HR over 100 beats per minute (bpm). A sudden increase in HR to around 100 bpm is captured in subplot (b). Both of them imply potential tachycardia. Subplot (d) shows a continuously decreasing SBP, and possible hypotension may be expected. Subplot (e) forecasts recovery from hypertension into a period of a normotensive SBP of around 120 millimeters of mercury (mmHg). A decreasing trend of DBP below 60 mmHg is seen in subplot (g), suggesting the patient is at risk for dangerously low blood pressure. Subplot (h) captures a sudden decrease in DBP, which should alert clinicians of the potential for further hypotension. The model also performed well on test cases without target conditional data, as is shown in subplots (c), (f), and (i).
Figure 9:

Examples from the TDSTF test results from the first trial of TDSTF. HR predictions in subplots (a)-(c) are in beats per minute (bpm), while SBP and DBP predictions in subplots (d)-(i) are in millimeters of mercury (mmHg). The horizontal axis represents the relative time in minutes for each ICU stay. The red dots indicate target feature values, and the blue dots indicate conditional values of the same target feature. Forecasts based on all 129 features show high accuracy, even without the conditional data as is shown in (c), (f), and (i). The 95% confidence intervals are shown as the areas between top and bottom bars of the violin histograms, with wider areas corresponding to more confidence and the middle bars representing the median values of the generated data points.
The complexity of TDSTF was compared with CSDI using PyTorch-OpCounter 1. The comparison considered the number of parameters and the Multiply-Accumulate (MAC) of the models. The number of parameters determines the model complexity, while the MAC number refers to the computational costs. The results show that CSDI had 0.28 million parameters, while TDSTF had 0.56 million parameters. The training time for CSDI was 12 hours, while for TDSTF, it was only 0.5 hours. The inference per sample consumed more than 55 billion MACs in CSDI and 0.79 million in TDSTF. Consequently, 14.913 and 0.865 seconds were required for inference per sample in CSDI and TDSTF, respectively, and TDSTF was more than 17 times faster. The fact that TDSTF is more complex than CSDI but consumes much less computation verifies our assumption on the CSDI overhead due to its large amount of missingness input.
4.5. Ablation Study
To assess the impact of Transformer block structures on the backbone’s architecture and the diffusion model’s performance, we conducted an ablation study. This study focused on 3 configurations, as delineated in the 10. With a series of 3 trials, the outcomes are shown in Table 4. TDSTF-a and TDSTF-c require the same training duration of 0.5 hours. TDSTF-c demonstrates better performance and robustness compared to TDSTF-a. Although TDSTF-b achieves lower MSE than TDSTF-c, it shows higher SACRPS and demands a longer training time of 0.7 hours. However, TDSTF-c manages to maintain comparable overall performance with a shorter training duration. As a result, TDSTF-c achieves an optimal balance between performance and model efficiency.
Table 4:
SACRPS and MSE of the 3 ablation configurations, each with and . The best results are shown in bold. TDSTF-c is highlighted as the selected configuration.
| Configuration | TDSTF-a | TDSTF-b | TDSTF-c |
|---|---|---|---|
| SACRPS | 0.4843 | 0.4328 | 0.4379 |
| 0.6072 | 0.4720 | 0.4526 | |
| 0.5162 | 0.4442 | 0.4408 | |
| MSE | 0.7861 | 0.3188 | 0.4061 |
| 1.2609 | 0.4247 | 0.4434 | |
| 0.6341 | 0.3506 | 0.4008 |
Noise amount plays a critical role in diffusion models, thus prompting our investigation of the schedule through additional 2 sets of experiments, each consisting of 3 trials, built upon TDSTF-c. In both scenarios, remains fixed at 10−4, and is still set to 50. assumes values of 0.05 and 0.1 respectively. The results in Table 5 indicate that the schedule spanning from to yields the most optimal performance.
Table 5:
SACRPS and MSE of Configuration c with 3 values. The best results are shown in bold.
| β T | 0.05 | 0.1 | 0.5 (Ours) |
|---|---|---|---|
| SACRPS | 0.5898 | 0.4893 | 0.4379 |
| 0.5884 | 0.4721 | 0.4526 | |
| 0.6002 | 0.5023 | 0.4408 | |
| MSE | 0.5247 | 0.4613 | 0.4061 |
| 0.5629 | 0.4443 | 0.4434 | |
| 0.5777 | 0.4847 | 0.4008 |
5. Discussion
This study presents a deep learning model, TDSTF, that can accurately predict sparse time series data in the ICU setting. The model was trained on MIMIC-III data and tested on several performance metrics. The results show that TDSTF outperforms several baseline models and can detect important changes in the vital signs of ICU patients. The model consists of a residual network based on Transformers, which allows for the ability to capture complex temporal dependencies and efficient computation and storage.
The TDSTF model is developed to accurately forecast sparse time series data without making assumptions about the underlying distribution. The model was trained and tested using the MIMIC-III ICU data and evaluated using SACRPS and MSE. It outperforms the baseline models, including MQ-RNN, DeepAR, DeepFactor, EnCQR, and CSDI. The mask used in CSDI reduces disturbance but cannot guarantee enough exclusion of invalid information from the sparse time series input. On average, 99.7% of conditional data points were missing, which could negatively impact performance and waste computational resources. Setting the input size of conditional triplets to 60 for TDSTF improved the signal-to-noise ratio and outperformed CSDI because of its improved representational power. MQ-RNN, DeepAR, DeepFactor, and EnCQR use RNN to extract hidden states and cannot fully utilize parallel computation, hindering model complexity [42]. The Transformer-based network solves these issues. While DeepAR and DeepFactor assume likelihood that may be too simple to represent high-dimensional distributions. TDSTF, on the other hand, directly learns the underlying distributions.
TDSTF is effective because it accurately captures temporal patterns in sparse time series data, as shown in Figure 9. In addition to high accuracy in the slow change cases, the model successfully recognizes sudden changes, which are important indicators of changes in system states. The high accuracy in predicting vital signs without target conditional data shows that the model can extract interrelations among all features. The quick response of TDSTF is also crucial in the ICU setting. All these benefits help detect deteriorations in time, such as sudden tachycardia or developing hypotension, and thus are important for timely interventions in the ICU setting.
However, there is still room for improving TDSTF. One issue is that the limitations in the size of the Transformer input can hinder its effectiveness, as the memory and time consumption increases quadratically with its size [43]. Additionally, the data used in the study may introduce bias into the model as the eligible subjects were generally older and male (as shown in Table 2), which could limit its generalizability to other populations. These issues highlight the importance of considering both the technical limitations and the demographic characteristics of the data when developing and evaluating deep learning models for forecasting sparse time series.
We look forward to this work serving as a foundation for addressing practical clinical applications in the ICU. It is essential to notice that subtle variations in vital indicators are warning signals of clinical deterioration that could eventually result in adverse outcomes. To further improve treatment outcomes, we may extend the proposed TDSTF to include core body temperature, breathing rate, and oxygen saturation as additional forecasting outputs. Additionally, we may explore the application of the TDSTF in reducing false alarms, with the aim of alleviating the stress experienced by ICU caregivers.
6. Conclusion
The TDSTF model offers a solution to the limitations of current state-of-the-art probabilistic forecasting models. By incorporating a Transformer-based backbone and using a triplet method to avoid input noise and increase speed, TDSTF was able to outperform the main baseline model, CSDI, on both SACRPS and MSE by an average of 18.9% and 34.3%, respectively, and more than 17 times faster than CSDI in inference. The results of this study demonstrate the improved accuracy and efficiency of TDSTF for sparse time series forecasting, making it a more practical and valuable tool for forecasting vital signs in the ICU setting. Additionally, TDSTF’s ability to capture temporal dependencies across all features further enhances its potential for real-world applications. Further studies of TDSTF’s performance should be performed in datasets with other patient characteristics. If further validated, performance in real-time scenarios would be indicated.
Figure 10:

The backstage structures for the ablation study.
Highlights.
The TDSTF model forecasts vital signs based on all events in the MIMIC-III dataset.
The proposed model captures temporal patterns in both slow and sudden changes.
The MSE of the model improves by 34.3% over the best baseline model.
The SACRPS of the model improves by 18.9% over the best baseline model.
The inference speed of the model is more than 17 times faster than the best baseline model.
Funding Statement
This work was supported by grants from the National Heart, Lung, and Blood Institute (#R21HL159661) and the National Science Foundation (#2052528). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the sponsors.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Conflict of Interest
S.F.Q. is a consultant for Bryte Bed, Guidepoint Global, Cowen Services, Whispersom, DR Capital and Best Doctors and is a member of the Hypopnea Taskforce of the American Academy of Sleep Medicine. Other authors have nothing to disclose.
Data Availability Statement
The MIMIC-III dataset used in this study is available at https://physionet.org/content/mimiciii/1.4/. In order to access the data, researchers must complete the application form.
References
- [1].Kenzaka Tsuneaki, Okayama Masanobu, Kuroki Shigehiro, Fukui Miho, Yahata Shinsuke, Hayashi Hiroki, Kitao Akihito, Sugiyama Daisuke, Kajii Eiji, and Hashimoto Masayoshi. Importance of vital signs to the early diagnosis and severity of sepsis: association between vital signs and sequential organ failure assessment score in patients with sepsis. Internal Medicine, 51(8):871–876, 2012. [DOI] [PubMed] [Google Scholar]
- [2].Yoon Joo Heung, Mu Lidan, Chen Lujie, Dubrawski Artur, Hravnak Marilyn, Pinsky Michael R, and Clermont Gilles. Predicting tachycardia as a surrogate for instability in the intensive care unit. Journal of Clinical Monitoring and Computing, 33:973–985, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Subbe Christian P, Kruger Michael, Rutherford Peter, and Gemmel L. Validation of a modified early warning score in medical admissions. Qjm, 94(10):521–526, 2001. [DOI] [PubMed] [Google Scholar]
- [4].Sessler Daniel I and Saugel Bernd. Beyond ‘failure to rescue’: the time has come for continuous ward monitoring. British journal of anaesthesia, 122(3):304–306, 2019. [DOI] [PubMed] [Google Scholar]
- [5].Doig Alexa K, Drews Frank A, and Keefe Maureen R. Informing the design of hemodynamic monitoring displays. CIN: Computers, Informatics, Nursing, 29(12):706–713, 2011. [DOI] [PubMed] [Google Scholar]
- [6].Collins Sarah A, Mamykina Lena, Jordan Desmond, Stein Dan M, Shine Alisabeth, Reyfman Paul, and Kaufman David. In search of common ground in handoff documentation in an intensive care unit. Journal of biomedical informatics, 45(2):307–315, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Kristinsson Ævar Örn, Gu Ying, Rasmussen Søren M, Mølgaard Jesper, Haahr-Raunkjær Camilla, Meyhoff Christian S, Aasvang Eske K, and Sørensen Helge BD. Prediction of serious outcomes based on continuous vital sign monitoring of high-risk patients. Computers in Biology and Medicine, 147:105559, 2022. [DOI] [PubMed] [Google Scholar]
- [8].Ghassemi Marzyeh, Pimentel Marco, Naumann Tristan, Brennan Thomas, Clifton David, Szolovits Peter, and Feng Mengling. A multivariate timeseries modeling approach to severity of illness assessment and forecasting in icu with sparse, heterogeneous clinical data. In Proceedings of the AAAI conference on artificial intelligence, volume 29, 2015. [PMC free article] [PubMed] [Google Scholar]
- [9].Tipirneni Sindhu and Reddy Chandan K. Self-supervised transformer for sparse and irregularly sampled multivariate clinical time-series. ACM Transactions on Knowledge Discovery from Data (TKDD), 16(6):1–17, 2022. [Google Scholar]
- [10].Jauregi Unanue Inigo, Zare Borzeshi Ehsan, and Piccardi Massimo. Recurrent neural networks with specialized word embeddings for health-domain named-entity recognition. Journal of biomedical informatics, 76:102–109, 2017. [DOI] [PubMed] [Google Scholar]
- [11].Ij H. Statistics versus machine learning. Nat Methods, 15(4):233, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Liu Shiyu, Yao Jia, and Motani Mehul. Early prediction of vital signs using generative boosting via lstm networks. In 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 437–444. IEEE, 2019. [Google Scholar]
- [13].Masum Shamsul, Chiverton John P, Liu Ying, and Vuksanovic Branislav. Investigation of machine learning techniques in forecasting of blood pressure time series data. In Artificial Intelligence XXXVI: 39th SGAI International Conference on Artificial Intelligence, AI 2019, Cambridge, UK, December 17–19, 2019, Proceedings 39, pages 269–282. Springer, 2019. [Google Scholar]
- [14].Liu Xiaoli, Liu Tongbo, Zhang Zhengbo, Kuo Po-Chih, Xu Haoran, Yang Zhicheng, Lan Ke, Li Peiyao, Ouyang Zhenchao, Ng Yeuk Lam, et al. Top-net prediction model using bidirectional long short-term memory and medical-grade wearable multisensor system for tachycardia onset: algorithm development study. JMIR Medical Informatics, 9(4):e18803, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Phetrittikun Ratchakit, Suvirat Kerdkiat, Pattalung Thanakron Na, Kongkamol Chanon, Ingviya Thammasin, and Chaichulee Sitthichok. Temporal fusion transformer for forecasting vital sign trajectories in intensive care patients. In 2021 13th Biomedical Engineering International Conference (BMEiCON), pages 1–5. IEEE, 2021. [Google Scholar]
- [16].Rasul Kashif, Seward Calvin, Schuster Ingmar, and Vollgraf Roland. Autoregressive denoising diffusion models for multivariate probabilistic time series forecasting. In International Conference on Machine Learning, pages 8857–8868. PMLR, 2021. [Google Scholar]
- [17].Tashiro Yusuke, Song Jiaming, Song Yang, and Ermon Stefano. Csdi: Conditional score-based diffusion models for probabilistic time series imputation. Advances in Neural Information Processing Systems, 34:24804–24816, 2021. [Google Scholar]
- [18].Sohl-Dickstein Jascha, Weiss Eric, Maheswaranathan Niru, and Ganguli Surya. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015. [Google Scholar]
- [19].Wen Ruofeng, Torkkola Kari, Narayanaswamy Balakrishnan, and Madeka Dhruv. A multi-horizon quantile recurrent forecaster. arXiv preprint arXiv:1711.11053, 2017. [Google Scholar]
- [20].Salinas David, Flunkert Valentin, Gasthaus Jan, and Januschowski Tim. Deepar: Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting, 36(3):1181–1191, 2020. [Google Scholar]
- [21].Wang Yuyang, Smola Alex, Maddix Danielle, Gasthaus Jan, Foster Dean, and Januschowski Tim. Deep factors for forecasting. In International conference on machine learning, pages 6607–6617. PMLR, 2019. [Google Scholar]
- [22].Jensen Vilde, Bianchi Filippo Maria, and Anfinsen Stian Normann. Ensemble conformalized quantile regression for probabilistic time series forecasting. IEEE Transactions on Neural Networks and Learning Systems, 2022. [DOI] [PubMed] [Google Scholar]
- [23].Yang Ling, Zhang Zhilong, Song Yang, Hong Shenda, Xu Runsheng, Zhao Yue, Zhang Wentao, Cui Bin, and Yang Ming-Hsuan. Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 56(4):1–39, 2023. [Google Scholar]
- [24].Ho Jonathan, Jain Ajay, and Abbeel Pieter. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020. [Google Scholar]
- [25].Song Jiaming, Meng Chenlin, and Ermon Stefano. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020. [Google Scholar]
- [26].Austin Jacob, Johnson Daniel D, Ho Jonathan, Tarlow Daniel, and van den Berg Rianne. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981–17993, 2021. [Google Scholar]
- [27].Anand Namrata and Achim Tudor. Protein structure and sequence generation with equivariant denoising diffusion probabilistic models. arXiv preprint arXiv:2205.15019, 2022. [Google Scholar]
- [28].Blau Tsachi, Ganz Roy, Kawar Bahjat, Bronstein Alex, and Elad Michael. Threat model-agnostic adversarial defense using diffusion models. arXiv preprint arXiv:2207.08089, 2022. [Google Scholar]
- [29].Oord Aaron van den, Dieleman Sander, Zen Heiga, Simonyan Karen, Vinyals Oriol, Graves Alex, Kalchbrenner Nal, Senior Andrew, and Kavukcuoglu Koray. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016. [Google Scholar]
- [30].Kong Zhifeng, Ping Wei, Huang Jiaji, Zhao Kexin, and Catanzaro Bryan. Diffwave: A versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761, 2020. [Google Scholar]
- [31].Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N, Kaiser Łukasz, and Polosukhin Illia. Attention is all you need. Advances in neural information processing systems, 30, 2017. [Google Scholar]
- [32].Brown Tom, Mann Benjamin, Ryder Nick, Subbiah Melanie, Kaplan Jared D, Dhariwal Prafulla, Neelakantan Arvind, Shyam Pranav, Sastry Girish, Askell Amanda, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. [Google Scholar]
- [33].Zhang Cheng, Bütepage Judith, Kjellström Hedvig, and Mandt Stephan. Advances in variational inference. IEEE transactions on pattern analysis and machine intelligence, 41(8):2008–2026, 2018. [DOI] [PubMed] [Google Scholar]
- [34].Feller William. On the theory of stochastic processes, with particular reference to applications. In Selected Papers I, pages 769–798. Springer, 2015. [Google Scholar]
- [35].Luo Calvin. Understanding diffusion models: A unified perspective. arXiv preprint arXiv:2208.11970, 2022. [Google Scholar]
- [36].Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. [Google Scholar]
- [37].He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [Google Scholar]
- [38].Johnson Alistair, Pollard Tom, and Mark Roger. Mimic-iii clinical database (version 1.4). PhysioNet, 10(C2XW26):2, 2016. [Google Scholar]
- [39].Lockwood Craig, Conroy-Hiller Tiffany, and Page Tamara. Vital signs. JBI Evidence Synthesis, 2(6):1–38, 2004. [DOI] [PubMed] [Google Scholar]
- [40].Kingma Diederik P and Ba Jimmy. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [Google Scholar]
- [41].Paszke Adam, Gross Sam, Massa Francisco, Lerer Adam, Bradbury James, Chanan Gregory, Killeen Trevor, Lin Zeming, Gimelshein Natalia, Antiga Luca, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019. [Google Scholar]
- [42].De Mulder Wim, Bethard Steven, and Moens Marie-Francine. A survey on the application of recurrent neural networks to statistical language modeling. Computer Speech 83 Language, 30(1):61–98, 2015. [Google Scholar]
- [43].Ding Ming, Zhou Chang, Yang Hongxia, and Tang Jie. Cogltx: Applying bert to long texts. Advances in Neural Information Processing Systems, 33:12792–12804, 2020. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The MIMIC-III dataset used in this study is available at https://physionet.org/content/mimiciii/1.4/. In order to access the data, researchers must complete the application form.
