Skip to main content
PLOS Digital Health logoLink to PLOS Digital Health
. 2022 Jul 19;1(7):e0000074. doi: 10.1371/journal.pdig.0000074

Conditional generation of medical time series for extrapolation to underrepresented populations

Simon Bing 1,2,*, Andrea Dittadi 3, Stefan Bauer 4,5,6,, Patrick Schwab 6,
Editor: Gilles Guillot7
PMCID: PMC9931259  PMID: 36812549

Abstract

The widespread adoption of electronic health records (EHRs) and subsequent increased availability of longitudinal healthcare data has led to significant advances in our understanding of health and disease with direct and immediate impact on the development of new diagnostics and therapeutic treatment options. However, access to EHRs is often restricted due to their perceived sensitive nature and associated legal concerns, and the cohorts therein typically are those seen at a specific hospital or network of hospitals and therefore not representative of the wider population of patients. Here, we present HealthGen, a new approach for the conditional generation of synthetic EHRs that maintains an accurate representation of real patient characteristics, temporal information and missingness patterns. We demonstrate experimentally that HealthGen generates synthetic cohorts that are significantly more faithful to real patient EHRs than the current state-of-the-art, and that augmenting real data sets with conditionally generated cohorts of underrepresented subpopulations of patients can significantly enhance the generalisability of models derived from these data sets to different patient populations. Synthetic conditionally generated EHRs could help increase the accessibility of longitudinal healthcare data sets and improve the generalisability of inferences made from these data sets to underrepresented populations.

Author summary

Electronic health record (EHR) data sets are essential for developing machine learning (ML) based therapeutic and diagnostic tools. Developing such data-driven models requires large and diverse amounts of medical data, but access to the necessary data sets is often not given in practice. Even when access is provided, the data usually stems from a single source, resulting in models that preform well for the patient groups from this limited source, but not on the diverse, general population. Here, we introduce a new method to generate synthetic EHR patient data, helping to overcome the issues of data access and patient representation. The data that our method generates shares the characteristics of real patient data, allowing the developers of downstream ML models to use this data freely during development. With our model, we can directly control the composition of patient cohorts, in terms of demographic variables such as age, sex and ethnicity. We can therefore generate more representative data sets, which lead to more fair downstream models and ultimately a more fair treatment of underrepresented populations.

1 Introduction

The broad use of electronic health records (EHRs) has lead to a significant increase in the availability of longitudinal health care data. As a consequence, our understanding of health and disease has deepened, allowing for the development of diagnostic and therapeutic approaches directly derived from EHR patient data. Models that utilize rich healthcare time series data derived from clinical practice could enable a variety of use cases in personalised medicine, as evidenced by the numerous recent efforts in this area [14]. However, the development of these novel diagnostic and therapeutic tools is often hampered by the lack of access to actionable patient data [5].

Even after being deidentified, EHR data is perceived as highly sensitive and clinical institutions raise legal and privacy concerns over the sharing of patient data they may have access to [6]. Furthermore, even if data is made public, it often originates from a single institution only [79], resulting in a data set that may not be representative for more general patient populations. Basing machine learning models on single site data sets only risks overfitting to a cohort of patients that is biased towards the population seen at one clinic or hospital, and renders their use for general applications across heterogeneous patient populations uninformative at best and harmful at worst [10, 11].

Putting aside the issue of non-representative patient cohorts, the development of accurate machine learning-based models for healthcare is further impeded by the imbalance in magnitude of the available data compared to other domains. While fields such as computer vision or language modelling have made significant advances, thanks in part to access to large-scale training data sets like ImageNet [12] or large text corpora derived from the World Wide Web, there do not yet exist any comparable data repositories for machine learning in healthcare that may spur innovation at similar pace. Practical problems may also arise during model development due to a lack of training samples for specific, rare patient conditions. If one wishes to study a model’s behaviour given data with certain properties, such as only patients with a certain pre-existing condition, medical data sets may often be too small to representatively cover such populations.

One potentially attractive approach to address the aforementioned issues would be to generate realistic, synthetic training data for machine learning models. Given access to an underlying distribution that approximates that of the real data, paired with the capability to sample from it, one could theoretically synthesize data sets of any desired size. The generated synthetic patient data can be used for assessing [1315] or even improving machine learning healthcare based software e.g. for liver lesions classification [16]. If the generative model of the data were to also have the capacity to generate samples conditioned on factors that may be freely chosen, such as for example pre-existing conditions, data sets with the exact properties required for a specified task could be generated as well. Previous reports suggest that such synthetically generated data sets may furthermore be shared with a significantly lower risk of exposing private information [17].

Beyond generating synthetic data to address issues surrounding fairness and bias mitigation, other complementary approaches have been studied in the literature. These include methods to transfer learned knowledge from one data set to another [18, 19], casting the collection of training data as an optimization problem with an objective function directly linked to population-level goals [20], as well as meta-learning approaches that generalize to a new task with relatively few samples [21]. Considering confounding factors is another important point when addressing issues surrounding fairness and bias in machine learning applications for medicine [22, 23]. While these and other approaches to mitigating bias in medical data sets [24] show promise to aid in the development of more fair clinical machine learning tools, we propose a complementary approach based on synthesizing medical data sets to do so. Specifically, our method is characterized by its capability to conditionally generate data, thereby directly modelling the effect of otherwise confounding variables.

Developing models with synthetic data is already widely applied in machine learning research. In Reinforcement Learning for example, it is the de-facto standard to train models in simulation, in order to have high-fidelity control over the environment [25, 26], or simply because experiments in the real world would be to costly, unethical or dangerous to conduct. Some previous work even suggests that models trained on synthetic data could outperform those derived from real data sets [27]. The gap between real and synthetic data is rapidly closing in fields like facial recognition in computer vision, as has for example recently been demonstrated by Wood et al. [28].

Classical approaches to generating medical time series data exist, but they fall short of the requirements that modern data-driven models require for their input. Some works employ hand crafted generation mechanisms followed by expensive post-hoc rectifications by clinical professionals [17], while others rely on mathematical models of biological subsystems such as the cardiovascular system [29, 30], which require a detailed physiological understanding of the system to be modelled. When the output data stems from multiple, interconnected subsystems whose global dynamics are too complex to model with ordinary differential equations and the size of the required data set is too large to tediously correct unrealistic samples by experts, these approaches may be difficult to utilize.

A natural approach to learning complex relationships from data is to move away from hand-crafted generative models and utilize machine learning methods. While a plethora of powerful generative models for medical imaging data generation have been brought forward in recent years [3134], relatively little research has been reported on generating synthetic medical time series data [5, 3537]. Moreover, the generation and evaluation of synthetic patient data [38] is often challenging due to the high prevalence of missing measurement values in the original medical data sets [3942].

To address these issues, we present HealthGen, a new approach to conditionally generate EHRs that accurately represent real measured patient characteristics, including time series of clinical observations and missingness patterns. In this work, we demonstrate that the patient cohorts generated by our model are significantly better-aligned to realistic data compared to various state-of-the-art approaches for medical time series generation. We demonstrate that our model outperforms previous approaches due to its explicit development for real clinical time series data, resulting in modelling not only the dynamics of the clinical covariates, but also their patterns of missingness which have been shown to be potentially highly informative in medical settings [43]. We show that our model’s capability to synthesize specific patient subpopulations by means of conditioning on their demographic descriptors allows us to generate synthetic data sets which exhibit more fair downstream behavior between patient subgroups, than competing approaches. Moreover, we demonstrate that by conditionally generating patient samples of underrepresented subpopulations and augmenting real data sets to equally represent each patient group, we can significantly boost the fairness (In this work, we measure fairness in terms of generalization performance to underrepresented populations.) of downstream models derived from the data. Furthermore, we evaluate the quality and usefulness of the data we generate using a downstream task that represents a realistic clinical use-case—allowing us to compare our model against competing approaches in a setting that is relevant for clinical impact.

Our main contributions are:

  • We introduce HealthGen, a new machine-learning method to conditionally generate realistic EHR data, including patient characteristics, the temporal evolution of clinical observations and their associated missingness patterns over time.

  • We experimentally show that our method outperforms current state-of-the-art models for medical time series generation in synthetically generating patient cohorts that are faithful to real patient data.

  • We demonstrate the high fidelity of control over synthetic cohort composition that our model provides by means of its conditional generation capability, by generating more diverse synthetic data sets than competing approaches, which ultimately leads to a more fair representation of different patient populations.

  • We show that by augmenting real data with conditionally generated samples of underrepresented populations, the models derived from these data sets exhibit significantly more fair behaviour than those derived from unaltered real data.

  • We perform a comprehensive computational evaluation in realistic clinical use cases to evaluate the comparative performance of HealthGen against various state-of-the-art generative time-series modelling approaches.

2 Results

2.1 Overview

For this study, we consider the MIMIC-III data set [8], which consists of EHRs containing time series of measurements of patients that spent time in the intensive care units (ICUs) of the Beth Israel Deaconess Medical Center in Boston, Massachusetts, USA. Additionally, each patient is described by static variables such as their age, ethnicity, insurance type and sex. The time series of a given patient is labelled to indicate whether or not one of the following clinical interventions was performed: mechanical ventilation, vasopressor administration, colloid bolus administration, crystalloid bolus administration or non-invasive ventilation.

After extracting the cohort of patients from the data base, we split them into training (70%), validation (15%) and test (15%) sets, stratified by their binary intervention labels. The generative models are trained on the real training data Dtrain={x1:Tn,m1:Tn,sn,yn}n=1Ntrain, where x1:T denotes the time series of covariates, m1:T the time series of binary masks indicating where values are missing, s the static patient variables and y a patient’s set of labels for the respective clinical interventions. We include the missingness information m1:T explicitly, as previous work has shown that patterns of missing values are highly informative [41] and especially in the medical setting including them is preferential to imputation [43].

To evaluate and compare generative models, we first train a downstream time series classification model developed for medical data [43] on the data sets synthesized by each model. This classifier that has been trained with synthetic data is evaluated on the held-out real test data, and the resulting AUROC score (Area Under the Receiver Operating Characteristic curve) is compared with the AUROC score of the same classifier trained on the real data. The final measure for how faithful a given generated data set is to the real data is the difference between the evaluation score of the synthetic data and that of the real data. Details on the experimental pipeline can be found in Section 4.

In our first experiment, we generate synthetic patient cohorts that are faithful to the real data in terms of their demographics, i.e. they contain the same number of patients as the original data under the same distribution of static variables. Repeating this synthetic cohort generation for all of the available clinical intervention labels, we compare the performance of our model to baseline models for generating clinical time series data. As baselines, we consider Stochastic Recurrent Neural Networks (SRNN) [44] and KalmanVAE (KVAE) [45], both based on variational autoencoders (VAE) [46], and TimeGAN [47], based on generative adversarial networks (GAN) [48]. We then present results of an extension of the previous experiment demonstrating our proposed approach’s conditional generation capability, where we generate patient cohorts with static variable distributions that differ from the real data, and investigate the fairness of models derived from the resulting synthetic data. In an additional experiment, we identify real data settings where some subpopulations of patients have a significantly lower classification score than other patients. Using the the conditional generation capability of our model, we augment the real data with synthetic samples of minority groups and test if this augmentation leads to an increase in the downstream classification score of the previously underrepresented populations.

2.2 Generating synthetic patient cohorts

In the first experiment of this work, we investigate our model’s capability to generate synthetic data that is faithful to the real data and useful in downstream tasks. To study the generative performance of our model and compare it against competing approaches for medical time series generation, we employ the experimental framework described in detail in the Methods section.

Here, we train each generative model conditioned on the real labels y and then generate a synthetic data set D^, where the generation is again conditioned on the label of the considered task, to guarantee that the synthetic data shares the same statistics in terms of split between positive and negative labels as the real data. While we could in practice generate as much data as we wish, we synthesize data sets containing the same number of patients as the real data for this experiment, to facilitate a fair comparison between the real patient cohorts and those that have been synthetically generated.

A downstream model for medical time series classification [43] is then trained on the synthetic training data D^train of each generative model and evaluated on a held out test data set Dtest consisting of real patients. We compare the performance of our model against the current state-of-the-art models for time series generation, across all five available classification tasks in the MIMIC-III data. The results of these experiments are summarized in Table 1. We provide examples of synthetically generated patients’ time series in Fig 1, next to a real patient’s time series for comparison. Additional, more detailed visualizations are presented in S4 Appendix. While these visualizations show a qualitative similarity between synthetic and real time series, they further underscore the need for our utilized, more fine-grained quantitative evaluation of the quality of the generated data.

Table 1. Comparison of AUROC scores for all predictive tasks between HealthGen and the baseline models.

The 95% confidence interval of the mean value is presented in parentheses and is estimated via bootstrapping with 30 samples.

vent vaso colloid_bolus crystalloid_bolus niv
Real Data 0.809 (0.807, 0.811) 0.801 (0.799, 0.803) 0.751 (0.741, 0.760) 0.613 (0.609, 0.616) 0.634 (0.632, 0.637)
HealthGen (Ours) 0.769 (0.767, 0.772) 0.722 (0.718, 0.727) 0.664 (0.650, 0.678) 0.574 (0.571, 0.577) 0.567 (0.566, 0.569)
SRNN 0.639 (0.637, 0.641) 0.693 (0.690, 0.695) 0.661 (0.656, 0.666) 0.562 (0.561, 0.564) 0.553 (0.552, 0.555)
KVAE 0.559 (0.549, 0.570) 0.608 (0.589, 0.627) 0.565 (0.544, 0.586) 0.538 (0.531, 0.545) 0.523 (0.517, 0.529)
TimeGAN 0.558 (0.558, 0.558) 0.703 (0.703, 0.704) 0.552 (0.530, 0.573) 0.545 (0.543, 0.548) 0.530 (0.527, 0.533)

Fig 1. Sample time series of synthetically generated patients, with the time series of one real patient for comparison.

Fig 1

In all but one of the considered tasks, our approach significantly outperforms the state-of-the-art models in synthetically generating medical time series. The fact that our model’s downstream classification score is higher then the baselines’, across all experimental settings, suggests that the synthetic data generated by our model is more faithful to the real data and thus more useful for the development of downstream clinical time series prediction tasks than the cohorts generated by any of the competing architectures.

2.3 Conditional generation

From Table C in S3 Appendix we see that many demographic variables have examples of highly underrepresented classes, possibly leading to the classification performance for these subpopulations to be much lower than for the majority class. This shines a light on a real problem found in many clinical settings, especially when transferring between hospitals [3]. In preliminary experiments, we investigate the classification score on a per-group level for the real data (cf. Table E in S1 Appendix). While it does not occur for all static variables and classification tasks, we identify cases where there is a significant difference in the score of a given subpopulation, with respect to the other groups. This inter-group performance gap raises the question if our model’s conditional generation capability can be leveraged to address these fairness issues.

In the preceding experiment, we do no utilize our model’s capacity to conditionally generate synthetic patient cohorts, as we do not explicitly condition our model on the static variables s during training or generation. Here, in addition to the label y, we condition on a static variable of interest allowing us to then conditionally generate an equal number of synthetic patient samples for each subgroup of this considered static variable. We investigate if conditionally generating patient cohorts provides a benefit in terms of fairness, as well as overall downstream performance. In this experiment, we study the performance on a per-group level, comparing not only to the baseline approaches, but to our model when unconditionally generating the data, as well. To enable a fair comparison, the overall number of generated patients is equal for each considered model. The results of this experiment for two exemplary settings are presented in Fig 2, with additional results reported in Fig A in S1 Appendix.

Fig 2. Comparison of AUROC score between our model when conditionally and unconditionally generating synthetic data with baselines.

Fig 2

We show results of the colloid_bolus task for different ethnicity groups (a) as wells as the range of insurance types (b). Note that the Asian subpopulation is not among the ethnicity groups, as there are no positive validation samples among this group for the considered task, thereby prohibiting the calculation of the AUROC score. Conditionally generating patient cohorts is favourable to unconditional generation overall, and for almost all subgroups, allowing for the generation of more representative and therefore fair synthetic data sets. Significance levels between groups of interest are shown with brackets, where * corresponds to p < 0.05, ** to p < 0.01 and *** to p < 0.001.

The results indicate that, for settings in which the real data exhibits performance differences between subpopulations, conditionally generating synthetic patient cohorts provides a significant benefit over unconditional generation. In terms of overall performance, and in nearly all subgroups, our model outperforms the baselines when conditionally generating data. The score of the fairness metric we utilize is also higher for our model, than for any of the competing approaches (cf. Table C in S1 Appendix). Our model is not only capable of generating synthetic “copies” of the real data in terms of the distribution of demographics, but we can generate data sets with a high degree of control over the composition of subpopulations, resulting in more diverse training sets for downstream tasks.

2.4 Real data augmentation

In the preceding experiment, we demonstrated that our model’s conditional generation capability can be used to synthesize patient cohorts that yield more fair downstream classification models. The settings that emerge in which our model can provide a benefit are those where the real data displays an imbalance in the downstream performance between patient subpopulations. This gives rise to the question if our approach to conditionally generate synthetic data can also be useful for the setting when access to the real data is not restricted, but rather the given cohort does not fulfill the requirements for the development of downstream models. For these cases, we hypothesize that we can conditionally generate more examples of this previously underrepresented class, augment the real data with them and thereby boost the performance in the downstream classification task for this subpopulation.

One of the cases, where we identify a significant difference in the performance of the trained classifier for different subtypes of patients, is the colloid_bolus classification task when looking at different insurance types of patients. In Fig 3, we see that while the overall score on this task is fairly high, the underrepresented class of Government insured individuals has a significantly lower score than all other classes. The score of this class is even lower than 0.5, which would be obtained by randomly guessing which class a sample belongs to.

Fig 3. Comparison of the AUROC score between the real data and data sets obtained from the augmenting the real data with the synthetic patients from the considered generative models.

Fig 3

We report the scores for each different insurance type on the (a) colloid_bolus as well as the (b) vaso classification task. Significance levels between groups of interest are shown with brackets, where * corresponds to p < 0.05, ** to p < 0.01 and *** to p < 0.001.

To investigate if our model can improve the performance for such an underrepresented group, we conditionally generate additional samples of each underrepresented class and augment the real training data with these until all insurance types are fairly represented by the same number of samples. Since the baselines cannot generate data conditioned on static variables, we have them unconditionally generate the same number of overall samples that our model augments the real data with. We then compare the results of the downstream classifier trained on data sets augmented by synthetic data of the respective generative models to the classifier trained on the real data.

As we see in Fig 3(a), our model can indeed increase the performance of previously underrepresented groups. Our model significantly boosts the predictive score of the Government class, without sacrificing the performance of any other subpopulation. This result is remarkable, since the number of positive samples of Government insured individuals is exceedingly small. Interestingly, the performance of the Self Pay class, which is also heavily underrepresented, is also boosted, even if it was already at a high level to begin with. While some other baselines also manage to boost the score of the Government insured class, they either fail to do so to the same degree as our approach, or they also decrease the performance for another class. These findings are further underlined by the resulting fairness metric scores presented in Table D in S1 Appendix. Our model’s superiority is further demonstrated by the fact that our conditional augmentation leads to a boosting of the overall score as well.

A second setting in which our model provides a benefit for underrepresented classes is the vaso task, again looking at different insurance types. Here, the performance on the minority groups of Government and Self Pay insured patients is not as dramatically lower compared to the other majority groups, but our approach to augment the real data still provides a significant benefit. Visualized in Fig 3(b), our augmentation boosts the performance of the downstream classifier for the two smallest classes significantly, even in a setting where their score is not severely below that of the larger groups to begin with.

2.5 Privacy

To qualitatively assess if our model simply memorizes the training data and reproduces it at generation time, we visualize time series of a randomly selected, synthetically generated sample and time series of the three closest samples (nearest neighbours) in the training data. In Fig 4, we compare the corresponding features of the synthetic and real samples side-by-side and observe that while certain patterns are shared, the synthetic data is not a copy of any of the real patients. This indicates that our model does not memorize the sensitive training data, allowing us to conjecture that it is privacy preserving to a certain degree, and sharing synthetically generated patient cohorts does not jeopardize the private information of the real patients our model was trained with. We stress that this visualization is only meant as a qualitative visual inspection that the synthetically generated data is not a copy of any real patients. This is not a metric of how well the synthetic data captures the relevant characteristics of the real data, as this is measured quantitatively via our evaluation pipeline, described in detail in Section 4.

Fig 4. Comparison of the time series of a randomly sampled, synthetically generated patient and the corresponding time series of the three closest real patients (nearest neighbours) in the training data.

Fig 4

While certain characteristics such as the number of missing values per feature or dynamics are similar between the synthetic sample and its nearest neighbours amongst the real data, we observe that the synthetically generated data is not a copy of the real data, indicating that our method does not memorize the data it sees during training. This experiment is merely meant to visually check if the generated data is an exact copy of the real data. The overall quality of the synthetic data is measured quantitatively by our evaluation scheme, not visually assessed here.

3 Discussion

We presented HealthGen, a deep generative model capable of synthesizing realistic patient time series data, including informative patterns of missingness, that is faithful to data observed in real patient cohorts. To study the quality of the generated patient cohorts, we trained our generative model on the MIMIC-III data set, consisting of labeled ICU patient time series, to synthetically generate EHR data and evaluate the utility of the generated data on the clinically relevant downstream task of classifying patients’ medical time series. In an experimental comparison against existing state-of-the-art models for time series generation, we explored multiple dimensions of the generative capability of our proposed approach: first, we synthesized patient cohorts with the same distribution of static variables as the real training data and observed that the data generated by our model is significantly more faithful to the real data across all evaluated downstream clinical prediction tasks than existing state-of-the-art approaches. In a second experiment, we demonstrated that HealthGen is capable of conditionally generating synthetic patient cohorts with static variable distributions that differ from the underlying, real data, without sacrificing the quality of the generated samples and boosting the fairness of the resulting synthetic patient cohorts. Finally, we identified settings where HealthGen can alleviate issues of unfair downstream classification performance between demographic subpopulations that arise in the real data, by augmenting the real patient cohorts with more diverse, synthetic samples.

3.1 Generating synthetic patient cohorts

A key motivation behind synthetically generating medical time series is the lack of access to this type of data for the development of downstream tasks. Data-driven approaches to assist clinical practitioners in diagnostic or therapeutic tasks promise significant improvements to the quality of healthcare we can provide in the future, but without sufficient data, both in terms of amount and quality, their development is impeded. Clinical institutions that collect this type of data at large are reluctant to centralize and share it, raising the question of how access to useful training and development data may be ensured. One approach that has been brought forward recently is the idea of synthesizing patient cohorts, with the hope that these generated data sets can then be shared freely. In this scenario, only one model hast to be granted access to the sensitive real data, while the synthesized cohorts that are generated by the trained generative model can be freely shared with anyone in need of data for developing a downstream task. The primary requirement for this generated data is that it must adhere to the characteristics of the real data in such a way that it allows for the meaningful development of downstream models. These models are then deployed in the real world, to be used with real data. We demonstrate our model’s capacity to fulfill precisely these requirements. On five different clinical downstream tasks, the synthetic patient cohorts generated by our approach are closer aligned to the real data, evident from their classification scores being closer to that of the downstream task trained on real data, than any competing baselines. While in no setting the generated data ever outperforms the real data in terms of classification score, we stress that this cannot be expected and more importantly this does not diminish the obtained results. In practice, one would only have access to synthetic data for development of these downstream models, so the performance obtained by training on real data is only considered for model selection of the upstream generative model, by means of providing a point of reference for comparison.

Our approach to synthetically generate realistic and useful medical time series data outperforms competing state-of-the-art models for a number of reasons. We include inductive biases aligned with the healthcare domain, in the form of explicitly modelling missing data patterns and separating the generation of these missing values from the generation of observable clinical variables. Furthermore, our model’s capability to capture the influence of static and demographic variables on the generated data and to condition on them during the generation of the synthetic data adds to the expressive capacity of our architecture. To the best of our knowledge, no other models to generate time series in the medical domain explicitly model missing data patterns, even though their importance and prevalence in clinical data is well known. Instead of cherry picking features with low missingness or downsampling the temporal resolution of the data to alleviate missing values, we can generate time series data that is faithful to the characteristics of realistically occurring EHR data. Not only do we outperform the competing baselines on all of the considered downstream tasks, but we do so in a much more realistic setting than previously presented in the literature. This is a notable contribution, as we validate and compare models to generate medical time series with real-world downstream tasks for healthcare applications, giving our findings significantly more weight for clinical practitioners concerned with an impact beyond exemplary academic settings.

3.2 Conditional generation

In the context of healthcare applications, providing fair diagnostic models is of high ethical importance and increasing the fairness of such tools can have a potentially large impact. If an approach works well for the majority of a cohort at the cost of neglecting a certain subclass this can lead to systematically worse treatment of patients belonging to this group. Even a small increase in the predictive quality of a diagnostic tool can mean that hundreds or thousands of patients receive a treatment better aligned with their needs.

In addition to our model being able to generate synthetic patient cohorts of high quality and usefulness, we can do so with a high fidelity of control over the composition of cohorts, without having to sacrifice the quality of the generated data. In settings where the real data displays significant differences in performance between different subpopulations, utilizing this conditioning capability provides a benefit and yields more fair synthetic data sets. This increased fairness is evident from the lower variance between subpopulations when utilizing conditioning, compared to the unconditionally generated data of the other approaches. Importantly, this increase in fairness does not come at the cost of diminishing the performance of certain populations, but rather through an increase in the score of previously sub-par groups, which is also evident in the increase in overall score, with respect to our model when we do not condition. While conditional generation never hurts overall, we cannot boost the score of any subpopulation in any arbitrary setting. For the conditioning to provide a benefit, the real data must display performance imbalances between subpopulations. When this imbalance is not present and all subgroups perform similarly, conditioning should not be expected to provide an additional benefit, as the differences in subpopulations are not relevant for their classification.

The fact that we can successfully condition the generative process of our model to synthesize patients with given features indicates HealthGen’s capability to correctly capture meaningful dependencies between high-level, time invariant patient features and their influence on the resulting dynamics of the generated covariates. The key modelling choices that enable this level of conditioning are the fact that we introduce an additional static latent variable to capture time invariant patient states, as well as the inference procedure by which our model learns the dependencies between this time-invariant latent variable and the dynamics of the sequential data we are interested in generating. Splitting the high-level patient features from the dynamics on an architectural level encourages our model to focus on learning these concepts separately. Independence however does not follow from this separation, as high-level patient states will dictate the evolution of dynamical variables over time, which we capture in the dependencies of the static latent variable on time-varying observations during inference and vice versa during generation.

3.3 Real data augmentation

We have shown that by leveraging our model’s conditional generation capability, we can synthesize data sets which are significantly more fair in their representation of subpopulations of patients. Having demonstrated this in the setting where access to the real data is not given for the development of downstream models, therefore having to rely on fully synthetic data, the question arises if conditionally generated data can also provide a benefit when we do have access to the real data. In a scenario where access to a real data set is given during the development of such a clinical tool, we investigated if augmenting the real data with synthetic patients of specific, underrepresented subpopulations can help to develop a more fair downstream classifier.

After identifying settings where certain subpopulations display significantly lower classification performances than the majority groups, in our final experiment, we demonstrate our model’s capacity to increase the fairness of these real data sets through augmentation with synthetically generated data. This underlines the usefulness of synthetic data generation for a data augmentation task, which is orthogonal to the original objective of our model, namely generation of fully synthetic patient cohorts. That we can boost the performance of downstream tasks using these mixed-modality data sets consisting of real as well as synthetic data speaks to our model’s capability to generate time series that are true to the real data in their informative characteristics and opens up even more possibilities of utilizing synthetically generated data in relevant, real-world applications.

Here, the effect of our approach’s explicit modelling of static variables of interest and the resulting capability to condition on these becomes even more evident. While other generative models can also boost the classification of individual underrepresented classes through unconditional generation, our model proves to have a decisive advantage. The baselines are bound to generate some examples of the minority classes during generation, but we can generate these with high fidelity in a targeted fashion. The resulting augmented data set that follows from our approach manages to boost the score of underrepresented groups, without sacrificing the previously good score of any other subpopulations, which cannot be said for the baselines against which we compare. While we can provide a benefit via augmentation with generated data in specific settings, this does not hold in general. This implies that we cannot simply hope to boost any subpopulations downstream classification performance by generating more samples of this class. Our findings suggest that two main conditions must be met for augmentation to provide a benefit: the gap between the classification score of the minority group and the other groups must exceed a certain magnitude and the other groups must display a minimum score overall, in order for the model to have informative enough examples to learn from, even if they belong to an other group than the one we are interested in generating.

3.4 Limitations

In this work, we do not provide any strong guarantees on the privacy preserving nature of the generated data sets. While it may be interesting to investigate and extend our model in the future in terms of rigorous differential privacy-preserving guarantees [49], we argue that we still generate synthetic data that is privacy-preserving to certain degree. For example, we provide experimental evidence that our model does not simply memorize the training data and reproduce it to generate synthetic cohorts. Moreover, it has also been shown that training neural network architectures with stochastic gradient descent intrinsically guarantees a certain degree of privacy [50], the extent of which is however still an open research question.

Furthermore, the quality and diversity of the generated data which our model produces is limited by the real data with which it is trained. We cannot hope to generate samples of data which are too far out of the distribution of patients which the model has seen during training. Furthermore, biases related to factors which our model does not condition on are likely to be reproduced in the generated data, although this is a fundamental issue all machine learning models face. A possible solution to this could be the integration of HealthGen into a federated learning framework as an avenue of future development. The initial motivation to synthetically generate EHR data is the lack of large publicly available data sets, with those that are available being only representative of a specific patient cohort. Training a generative model on multiple cohorts in a privacy preserving, federated fashion has been proposed to increase the diversity of the generated data and further catalyze the development of personalized medicine [51, 52]. While there is no reason to believe our model could not be applied to other data, validating the presented results on an external data set may lend our findings additional weight. However, the lack of access to suitable data sets that can be readily used is the key limiting factor to doing so.

4 Materials and methods

Here, we present the methodological details of the experimental pipeline used in this study. As the raw data is incompatible for training, the custom preprocessing pipeline introduced in Section 4.1 is employed to arrive at the necessary format. This formatted data can then be used to train the HealthGen model, presented in Section 4.2. Once trained, the model is capable of generating synthetic data sets, whose quality and similarity to the real data is quantitatively evaluated using the evaluation pipeline described in Section 4.4.

4.1 Data set and preprocessing

In our experiments, we use the publicly available Medical Information Mart for Intensive Care (MIMIC-III) data set [8] as input. In its raw form, it consists of the deidentified electronic health records (EHRs) of 53,423 patients collected in the intensive care units (ICUs) of the Beth Israel Deaconess Medical Center in Boston, Massachusetts, USA between 2001 and 2012. It consists of multiple tables containing the data of its over 50,000 patients. A single patient’s information is linked across tables through a unique patient ID, and time series data contains a time stamp to maintain the correct temporal ordering of measurements. In this form, the sequential data is not ordered and many of the raw measurements represent the same concept, but are redundantly recorded under different names.

As a first preprocessing step, we employ a slightly modified version of the MIMIC-Extract pipeline [53]. This yields a data set containing the ordered time series of measurements of each patient, static patient variables such as age, sex, ethnicity and insurance type, and a sequence of binary labels at each time step, indicating whether a certain medical intervention was active or not. We apply the standard cohort selection found in the literature [5456]: the first ICU admission of adult patients (at least 15 years old), with a minimum duration of at least 12 hours, resulting in a total number of N = 34472 patients.

At this point, the time series data is still irregularly sampled and asynchronous across different features of the same patient. Given a sampling frequency, we look at the resulting window around each time step and either record the measurement, or indicate the absence of a measurement with a NaN (Not a Number) value. We then truncate all time series to have the same, fixed length. In our setting we choose a sampling frequency of 4 steps per hour and truncate the sequences to have a total duration of 12 hours.

From the observed feature sequences we additionally extract a sequence of binary masks m1:T indicating where a value in x1:T is missing:

mt,d={1,ifxt,dNaN,0,otherwise. (1)

Finally, we standardize all (non-missing) numerical values of x1:T to empirically have zero mean and unit variance along each dimension dD, and replace the NaN values in x1:T with zeros.

To obtain a binary label for a patient, we split the 12-hour sequence into three sections: a 6-hour observation window followed by a 2-hour hold-out section and finally a 4-hour prediction window. The label is then extracted from the prediction window: if an intervention is active at any time in this section, the label is positive, otherwise it is negative. Drawing inspiration from Suresh et al. [57], this procedure aims to create a fair prediction of future interventions from observed data by minimizing information leakage. If there is no gap between observation and prediction, oftentimes the last step of the observation contains enough information for a meaningful prediction. We extract five binary labels corresponding to different types of clinical interventions in the ICU: vent refers to mechanical ventilation, vaso to the administration of vasopressor drugs, colloid_bolus and crystalloid_bolus refer to colloid and crystalloid fluid bolus administration, respectively, and niv denotes non-invasive ventilation. An overview of the prevalence of overall positive samples for each of these labels is presented in Table B in S3 Appendix. Table C in S3 Appendix provides a summary of the extracted static variables and the representation of each sub-cohort and Table A in S3 Appendix presents all extracted time-varying features together with selected statistics.

After preprocessing, each patient is represented by a time series of inputs x1:T={xtRD}t=1T, a time series of missingness masks m1:T={mt{0,1}D}t=1T, where D = 104, a vector of static features sRM, M = 4 and a set of binary outcome labels y ∈ {0, 1}L, L = 5. The time series x1:T and m1:T cover 6 hours of measurements at a resolution of 15 minutes between steps, resulting in a sequence of length 25. The final data set D={x1:Tn,m1:Tn,sn,yn}n=1N is then split into a training, validation and test set, stratified with respect to the labels y.

In Fig 1 we visualize an exemplary set of time series of one patient. We can observe that some covariates such as the heart rate or oxygen saturation have many successive measurements and their evolution over time can be directly studied, while others such as CO2 are missing values at the majority of the time steps and the signal of their dynamics is much sparser. This visualization also shows the two types of correlations evident in the sequential data. Firstly, values of variables may be correlated over time, as we can see from the evolutions of the heart rate and the diastolic blood pressure. Secondly, when values were measured may correlate as well, i.e. the patterns of missingness for different input variables can be related.

4.2 The HealthGen model

Here we introduce the main technical contribution of this work: the generative model we propose for the task of conditionally generating medical time series data, which we christen HealthGen. The HealthGen model consists of a dynamical VAE-based architecture that allows for the generation of feature time series with informative missing values, conditioned on high-level static variables and binary labels.

Machine learning models can generally be categorized into discriminative or generative models. While discriminative models aim to learn decision boundaries in the data, generative models aim to learn the underlying distribution of the data. In contrast to discriminative models, generative models allow samples from the data distribution to be drawn, enabling the generation of synthetic data sets. A widely used family of generative models are variational autoencoders (VAEs) [46], which also constitute the basis of our HealthGen model. VAEs consist of an encoder, or inference network, which encodes the data to a lower dimensional latent representation, and a decoder, or generation network, which takes the latent representation and attempts to decode the data back to its original state. By pushing the reconstructed data to be close to the original observation, as well as imposing constraints on the structure of the latent space, VAE models can efficiently learn to model the underlying data-generating distribution.

The HealthGen model leverages an extension of the VAE framework to sequential data, where sequential latent variables z1:T describe the dynamics of the observed data in the latent space. In HealthGen, we introduce an additional static latent variable v to capture the time-invariant characteristics of the data. In the remainder of this section, we present the generative and inference models of HealthGen. In S2 Appendix, we provide a more detailed model description including the functional forms of all the distributions, and all implementation details required to reproduce our results.

Generative model

As discussed in the previous section, our data consists of a feature time series x1:T representing the physiological state of a patient, a sequence of binary missingness masks m1:T indicating when a value of x1:T is observed and when it is missing, static observable variables s, and labels y. The pattern of missingness is notably not random, and highly informative, as preliminary experiments have shown (see S1 Appendix). This result is in line with the findings of Che et al. [43], who show that missing values in medical time series play a key role in downstream predictive tasks. Given their evident importance, we explicitly model the missingness masks m1:T alongside the observed feature sequence x1:T.

The decoder network of the generative model aims to generate x1:T and m1:T from the latent variables v and z1:T. The generative process starts from the static latent variable v with a fixed unconditional prior p(v)=N(v;0,I), where N(μ,σ2) is the Gaussian distribution with mean μ and variance σ2. The observed features x1:T and missingness masks m1:T are then independently generated as follows.

First, the sequence of missingness masks m1:T is generated from the static latent variable v, conditioned on the static observable variable s and label y. We do not model this as a dynamical process, but rather generate the entire sequence in one step, modelling its joint distribution with independent Bernoulli distributions:

pθm(m1:T|v,s,y)=t=1Td=1DBernoulli(mt,d;μt,d). (2)

The subsequent generation of the observed features sequence x1:T is based on the SRNN model [44], including ideas from the DSAE [58] and the conditional VAE [59] models. A transition model between timesteps in the latent space is learned, and at each timestep t the latent variable zt is decoded together with the internal RNN state ht, with additional conditioning on the static latent v, the static features s, and the labels y, to generate xt. The generative distribution for x1:T is given by:

pθx(xt|zt,ht,v,s,y)=N(xt;μθx(zt,ht,v,s,y),diag{σθx2(zt,ht,v,s,y)}). (3)

Finally, the joint distribution of all variables, conditional on the observed static features s and labels y, is:

p(x1:T,m1:T,z1:T,h1:T,v|s,y)=p(v)pθm(m1:T|v,s,y)t=1Tpθx(xt|zt,ht,v,s,y)pθz(zt|zt-1,ht)pθh(ht|xt-1,ht-1.v), (4)

A graphical representation of the dependencies implied by this generative distribution is shown in Fig 5.

Fig 5. Probabilistic graphical model of the generative process of HealthGen.

Fig 5

Diamond shaped nodes represent deterministic variables, round nodes probabilistic variables. Arrows represent direct dependencies. Shaded nodes represent observed or generated variables.

To generate synthetic data, we sample the features and missingness masks from the generative model p(x1:T, m1:T, |s, y) using ancestral sampling as described in Algorithm 1. In practice, the conditioning is implemented by concatenating the vectors s and y, so to perform unconditional generation, we do not pass s at the conditioning step, but only y. Note that we sample z0N(z0;0,I) rather than fixing it to 0, as we observed that this empirically yields synthetic data that is more useful for the downstream tasks.

Algorithm 1 HealthGen sampling.

1: Set values for conditionals s, y

2: Sample vp(v) from static latent prior

3: Sample missingness masks m1:Tpθm (m1:T|v, s, y)

4: Sample z0N(z0;0,I)

5: Initialize h0 ← 0

6: for t ← 1 to T do

7:  Sample ztpθz (zt|zt−1, ht)

8:  Sample xtpθx (xt|zt, ht, v, s, y)

9:  Encode xt to obtain ht+1eh(xt, v, ht)

10: end for

11: return x1:T, m1:T

Inference model

Similarly to the generative process, we split the inference model into two steps, beginning with the inference of the static variable v from the observable feature sequence x1:T, the missingness pattern sequence m1:T and the static features s as well as the label y. The approximate posterior distribution of v—the distribution of the static latent variable conditioned on the above mentioned dependencies—is subsequently formalized as:

qϕv(v|x1:T,m1:T,s,y)=N(v;μϕv(x1:T,m1:T,s,y),diag{σϕv2(x1:T,m1:T,s,y)}). (5)

The static latent variable v encodes the static features as well as the label, but it also encodes static information from the time series inputs. This allows our model to capture high-level information about a patient’s state, that is not explicitly represented by any of the static features alone. By splitting the inference into a static latent variable and a latent time series representing the underlying dynamics of the observable features, our model learns to separate the time invariant content of a given sample from the dynamics that govern the evolution of the time-varying parts of its state. A patient’s general state has a large effect on the temporal evolution of their time-varying lower-level states, which is reflected in our model’s conditioning on the static features (both latent and observed) at multiple steps during the inference and generative processes.

This is formalized in the approximate posterior distribution of the latent sequence z1:T, which is defined as follows:

qϕz(zt|zt-1,gt)=N(zt;μϕz(zt-1,gt),diag{σϕz2(zt-1,gt)})fort>1. (6)

The overall inference model of HealthGen can then be written as follows:

qϕ(z1:T,g1:T,h1:T,v|x1:T,m1:T,s,y) (7)
=qϕv(v|x1:T,m1:T,s,y) (8)
t=1Tqϕz(zt|zt-1,gt)qϕg(gt|xt,ht,gt+1,v)pθh(ht|xt-1,ht-1,v). (9)

A graphical overview of the described inference procedure, together with all modelled dependencies during the encoding step are visualized in Fig 6.

Fig 6. Probabilistic graphical model of HealthGen at inference time.

Fig 6

Diamond shaped nodes represent deterministic variables, round nodes probabilistic variables. Arrows represent direct dependencies. Shaded nodes represent observed variables.

Training HealthGen

VAE-based models are trained by optimizing a lower bound of the data log likelihood, which we adapt to optimize our model, as well. The general strucure of the ELBO includes a term which encourages a faithful reconstruction of the data after it has been encoded and subsequently decoded, as well as a term penalizing the deviation of the latent distribution from a chosen prior distribution. HealthGen’s parameters are trained by maximizing the Evidence Lower BOund (ELBO), a lower bound to the data log likelihood conditional on the labels and observable static variables. The final functional form of the ELBO which we optimize is given by:

L(θ,ϕ)=Eqϕv(v|x,m,s,y)[logpθm(m|v,s,y)+tEqϕz(z1:t|g˜1:t)[logpθx(xt|zt,h˜t,v,s,y)]]-DKL(qϕv(v|x,m,s,y)p(v))-tEqϕz(z1:t-1|g˜1:t-1)[DKL(qϕz(zt|zt-1,g˜t)pθz(zt|zt-1,h˜t))].

We can discern a reconstruction term both for the observed feature sequence x1:T and the missingess masks m1:T in the first two summands, as well as KL divergences in the following terms penalizing the deviation of the inferred latent distributions of v and z1:T from their respective priors. The full derivation used to arrive at this final objective function is presented in S2 Appendix. The parameters θ = [θx, θm, θz, θh] of the generative model and ϕ = [ϕv, ϕz, ϕg] of the inference model are jointly trained by descending on the negative gradient of the ELBO. The KL divergence terms have analytical expressions and all intractable expectations are approximated with Monte Carlo estimation. In practice, we mask the reconstruction loss term of the features x1:T with the masks m1:T to only take into account the learning signal of the features which have actually been observed.

4.3 Baseline models

We choose three generative models against which to compare our approach with: the SRNN [44], the KVAE [45] and the TimeGAN [47] models. These models were chosen in an effort to select examples from the literature which represent the state-of-the-art in generative sequence modelling for different architectures. For technical details on the baseline models, please refer to the original papers.

The SRNN model is chosen to represent the “classical” dynamic VAE model: it utilizes RNNs as encoder and decoder and models the internal dynamics of the inferred latent sequence with an explicit transition model. In the comprehensive comparison between DVAE models provided by Girin et al. [60] it emerges as the most performative model, leading us to select it as the representative for this class of generative models.

The KalmanVAE is also included in our comparison due to its unique approach to model dynamics in the latent space. It combines a VAE with a classical linear state-space model to model the latent dynamics, resulting in interesting properties for the inference process. Fraccaro at al. [45] show that this approach works well in settings with well described dynamics, such as low dimensional mechanical systems, leading to the question of how well this translates to dynamics of clinical observables.

Models based on the GAN architecture have the reputation of shining when it comes to generating high quality synthetic data. To investigate if this is also the case in the setting we consider, we compare against the state-of-the-art GAN model for sequential data. In their original publication, Yoon et al. [47] also present one experiment on the MIMIC-III data set, making the TimeGAN model one of the most direct competitors to our approach a priori.

Since none of the models described above have the capability to generate data conditioned on labels y, we provide a minor extension to all models, to enable a more fair comparison. Drawing inspiration from the Conditional VAE model [59], we repeat the labels y to all T time steps y1:T and encode them as an additional feature during training. The resulting latent sequence is then again extended by y1:T before decoding. At generation time, we can choose y1:T as we wish, append it to the sampled prior or random noise and decode to obtain a generated sample conditioned on the label we desire.

4.4 Evaluation

Quantitatively evaluating generative models is no trivial task, and in the setting where the generated data takes the form of real valued time series, this is especially true. Generative models that have been widely heralded as impressive examples of their class often convince the reader with generated human faces that are indiscernible from real images [61, 62]. In the medical setting, where specialized domain knowledge is necessary to tell the difference between real and fake samples, the quality of generated imaging data is presented to clinical experts, who then discriminate between synthetic and real samples [31].

Unfortunately, none of these approaches apply to the medical time series data we aim to synthesize. It may be possible to identify generated data with extremely low quality by visual inspection, but after a certain fidelity is achieved, discerning between a“better” or “worse” example of generated samples is no longer qualitatively possible.

To this end, we rely on the Train on Synthetic, Test on Real (TSTR) evaluation paradigm, first introduced by Esteban et al. [63]. A conceptual overview of our employed evaluation pipeline is presented in Fig 7. Let E denote the evaluation model trained on the real data Dtrain and E^ denote the evaluation model trained on the synthetic data D^train. E and E^ share the same architecture and are trained according to identical procedures with the same hyperparameters. Both models are then evaluated on the same held-out real test data Dtest:

e=E(Dtest), (10)
e^=E^(Dtest). (11)

The quantitative measure for how well a given generative model M performs is then measured in the difference between e and e^. If the generative model M captures the dependencies in the real data that are informative for the downstream task represented by the evaluation model, and it is successful in synthesizing these in the generated data set D^, this is reflected in a score e^ that is close, or ideally equal, to e.

Fig 7. Conceptual overview of the experimental pipeline.

Fig 7

The generative model M is trained with the real training data, allowing it to generate a synthetic data set. Two identical evaluation models E and E^ are then trained with the real or synthetic data, respectively. Finally, these evaluation models are tested on the real data, yielding the metric e, derived from the real training set and that derived from the synthetic training data, e^.

The model that implements E in practice is the GRU-D model [43] for medical time series classification. Based on the Gated Recurrent Unit (GRU) [64], this model was specifically introduced for classifying time series with missing values in the medical domain. It identifies two characteristics of missing values in the healthcare setting: first, the value of the missing variable tends to decay to some default value if its last measurement happened a long time ago (homeostasis), and second, the importance of an input variable will decrease if it has not been observed for an extended period of time.

These principles are modelled with two separate decay mechanisms. If a variable is missing for a number of time steps, its value decays toward the empirical mean of its measurements over time. The second decay is applied to the internal hidden state of the RNN cell, to model the waning importance of states that have not been updated in a while. In addition to the input features x1:T, the GRU-D model also takes the masks m1:T as well as the time since the last measurement δ1:T={δtRD}t=1T as direct input.

While previous works have also used the TSTR framework to evaluate the quality of their generated medical time series data [47, 63], we argue that our setting is better suited to evaluate generative models in the healthcare domain. The key difference we wish to highlight is the proxy task, implemented by the downstream evaluation model, that is chosen for the evaluation. Past approaches have attempted to predict the value of the next step in the input sequence [47], or to predict whether a time series surpasses a pre-defined threshold [63]. We opt for a downstream model that is specifically designed for a clinically relevant prediction task using real-world medical time series. This constitutes a setting much closer aligned to a real application in healthcare, and thus facilitating a comparison of generative models according to the relevant criteria instead of contrived metrics.

4.5 Uncertainty estimation

In all of our experiments, we repeat each run with five random initialisations of weights for the entire experimental pipeline, i.e. the generative model as well as the downstream evaluation model. For each generative model, we then choose the initialisation with the highest resulting downstream performance and estimate the 95% confidence interval of the mean of the AUROC score by performing bootstrap resampling 30 times on the resulting generated synthetic data set. This allows us to report and compare not only the obtained performance of the models we consider, but also the uncertainty of our chosen metric.

In the second and final experiments of this work, we additionally perform statistical tests to quantify the significance levels between competing approaches. Here, we take the scores of the bootstrapping for settings we wish to compare and perform the one-sided, parametric-free, Mann-Whitney U test [65], to determine the significance levels of competing approaches.

4.6 Memorization analysis

We analyze the privacy preserving characteristics of our model in similar fashion to DuMont Schütte et al. [31]. To find the nearest neighbour of a synthetic sample, among the real data, we measure the distances between their respective latent encoding. To this end, we take our trained model and encode a randomly sampled synthetic patient, yielding a 32-dimensional static latent vector v and a 32-dimensional latent time series z1:T with 25 time steps. After flattening the time dimension in z1:T and concatenating the static latent vector v, we end up with an 832-dimensional latent representation of the synthetic patient. We repeat this process for all patients in the real training data set, again yielding an 832-dimensional latent representation for each real patient. Then, utilizing the cosine distance measure between vectors, we find the three nearest neighbours of the randomly sampled synthetic patient and plot the respective time series of this generated patient and its nearest neighbours amongst the training data in order to qualitatively compare them. A randomly sampled synthetic patient with its three nearest neighbours is visualized in Fig 4.

Supporting information

S1 Appendix. Preliminary findings and additional experimental results.

(PDF)

S2 Appendix. Model implementation and training details.

(PDF)

S3 Appendix. Data set characteristics.

(PDF)

S4 Appendix. Visualizations of synthetically generated data.

(PDF)

Data Availability

The utilized MIMIC-III data set (https://physionet.org/content/mimiciii/1.4/) is publicly available to researchers after having completed a course to certify their capability to handle sensitive patient data. Access to the data may be requested at https://mimic.mit.edu/.

Funding Statement

The authors received no specific funding for this work.

References

  • 1. Henry KE, Hager DN, Pronovost PJ, Saria S. A targeted real-time early warning score (TREWScore) for septic shock. Sci Transl Med. 2015;7(299):299ra122. doi: 10.1126/scitranslmed.aab3719 [DOI] [PubMed] [Google Scholar]
  • 2. Sandfort V, Johnson AEW, Kunz LM, Vargas JD, Rosing DR. Prolonged Elevated Heart Rate and 90-Day Survival in Acutely Ill Patients: Data From the MIMIC-III Database. J Intensive Care Med. 2019;34(8):622–629. doi: 10.1177/0885066618756828 [DOI] [PubMed] [Google Scholar]
  • 3. Schwab P, Mehrjou A, Parbhoo S, Celi LA, Hetzel J, Hofer M, et al. Real-time prediction of COVID-19 related mortality using electronic health records. Nat Commun. 2021;12(1):1058. doi: 10.1038/s41467-020-20816-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Tomašev N, Glorot X, Rae JW, Zielinski M, Askham H, Saraiva A, et al. A clinically applicable approach to continuous prediction of future acute kidney injury. Nature. 2019;572(7767):116–119. doi: 10.1038/s41586-019-1390-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Alaa AM, Chan AJ, van der Schaar M. Generative Time-series Modeling with Fourier Flows. International Conference on Learning Representations. 2021. [Google Scholar]
  • 6. van Panhuis WG, Paul P, Emerson C, Grefenstette J, Wilder R, Herbst AJ, et al. A systematic review of barriers to data sharing in public health. BMC Public Health. 2014;14(1):1144. doi: 10.1186/1471-2458-14-1144 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Hyland SL, Faltys M, Hüser M, Lyu X, Gumbsch T, Esteban C, et al. Early prediction of circulatory failure in the intensive care unit using machine learning. Nat Med. 2020;26(3):364–373. doi: 10.1038/s41591-020-0789-4 [DOI] [PubMed] [Google Scholar]
  • 8. Johnson AEW, Pollard TJ, Shen L, Lehman LwH, Feng M, Ghassemi M, et al. MIMIC-III, a freely accessible critical care database. Scientific Data. 2016;3(1):160035. doi: 10.1038/sdata.2016.35 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Pollard TJ, Johnson AEW, Raffa JD, Celi LA, Mark RG, Badawi O. The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Sci Data. 2018;5(1):180178. doi: 10.1038/sdata.2018.178 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Dexter G, Grannis S, Dixon B, Kasthurirathne SN. Generalization of Machine Learning Approaches to Identify Notifiable Conditions from a Statewide Health Information Exchange. AMIA Joint Summits on Translational Science proceedings. 2020;2020:152–161. [PMC free article] [PubMed] [Google Scholar]
  • 11. Zech JR, Badgeley MA, Liu M, Costa AB, Titano JJ, Oermann EK. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLOS Med. 2018;15(11):1–17. doi: 10.1371/journal.pmed.1002683 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L. Imagenet: A large-scale hierarchical image database. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2009. p. 248–255.
  • 13. Chen J, Chun D, Patel M, Chiang E, James J. The validity of synthetic clinical data: a validation study of a leading synthetic data generator (Synthea) using clinical quality measures. BMC Med Inform Decis Mak. 2019;19(1):1–9. doi: 10.1186/s12911-019-0793-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Chen RJ, Lu MY, Chen TY, Williamson DF, Mahmood F. Synthetic data in machine learning for medicine and healthcare. Nat Biomed Eng. 2021:1–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Tucker A, Wang Z, Rotalinti Y, Myles P. Generating high-fidelity synthetic patient data for assessing machine learning healthcare software. NPJ Digit Med. 2020;3(1):1–13. doi: 10.1038/s41746-020-00353-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Frid-Adar M, Klang E, Amitai M, Goldberger J, Greenspan H. Synthetic data augmentation using GAN for improved liver lesion classification. In: 2018 IEEE 15th international symposium on biomedical imaging (ISBI 2018). IEEE; 2018. p. 289–293. [Google Scholar]
  • 17. Buczak AL, Babin S, Moniz L. Data-driven approach for creating synthetic electronic medical records. BMC Med Inform Decis Mak. 2010;10(1):59. doi: 10.1186/1472-6947-10-59 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Wang Z, Dai Z, Poczos B, Carbonell J. Characterizing and Avoiding Negative Transfer. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019. [Google Scholar]
  • 19. Gao Y, Cui Y. Deep transfer learning for reducing health care disparities arising from biomedical data inequality. Nat Commun. 2020;11(1):5131. doi: 10.1038/s41467-020-18918-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Rolf E, Worledge TT, Recht B, Jordan M. Representation Matters: Assessing the Importance of Subgroup Allocations in Training Data. In: Proceedings of the 38th International Conference on Machine Learning; 2021. p. 9040–9051.
  • 21. Qiu YL, Zheng H, Devos A, Selby H, Gevaert O. A meta-learning approach for genomic survival analysis. Nat Commun. 2020;11(1):6350. doi: 10.1038/s41467-020-20167-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Sul JH, Martin LS, Eskin E. Population structure in genetic studies: Confounding factors and mixed models. PLOS Genet. 2018;14(12):1–22. doi: 10.1371/journal.pgen.1007309 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Zhao Q, Adeli E, Pohl KM. Training confounder-free deep learning models for medical applications. Nat Commun. 2020;11(1):6010. doi: 10.1038/s41467-020-19784-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Thompson HM, Sharma B, Bhalla S, Boley RA, McCluskey C, Dligach D, et al. Bias and fairness assessment of a natural language processing opioid misuse classifier: detection and mitigation of electronic health record data disadvantages across racial subgroups. J Am Med Inform Assoc: JAMIA. 2021;28:2393–2403. doi: 10.1093/jamia/ocab148 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Tremblay J, Prakash A, Acuna D, Brophy M, Jampani V, Anil C, et al. Training deep networks with synthetic data: Bridging the reality gap by domain randomization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition workshops; 2018. p. 969–977.
  • 26. Ahmed O, Träuble F, Goyal A, Neitz A, Wüthrich M, Bengio Y, et al. CausalWorld: A Robotic Manipulation Benchmark for Causal Structure and Transfer Learning. International Conference on Learning Representations. 2021. [Google Scholar]
  • 27. Tremblay J, To T, Sundaralingam B, Xiang Y, Fox D, Birchfield S. Deep Object Pose Estimation for Semantic Robotic Grasping of Household Objects. Conference on Robot Learning (CoRL). 2018. [Google Scholar]
  • 28.Wood E, Baltrušaitis T, Hewitt C, Dziadzio S, Johnson M, Estellers V, et al. Fake It Till You Make It: Face analysis in the wild using synthetic data alone. arXiv preprint. 2021. Available from: https://arxiv.org/abs/2109.15102v2.
  • 29. McSharry PE, Clifford GD, Tarassenko L, Smith LA. A dynamical model for generating synthetic electrocardiogram signals. IEEE Trans Biomed Eng. 2003;50(3):289–294. doi: 10.1109/TBME.2003.808805 [DOI] [PubMed] [Google Scholar]
  • 30. Quiroz-Juárez MA, Jiménez-Ramírez O, Vázquez-Medina R, Breña-Medina V, Aragón JL, Barrio RA. Generation of ECG signals from a reaction-diffusion model spatially discretized. Sci Rep. 2019;9(1):19000. doi: 10.1038/s41598-019-55448-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. DuMont Schütte A, Hetzel J, Gatidis S, Hepp T, Dietz B, Bauer S, et al. Overcoming barriers to data sharing with medical image generation: a comprehensive evaluation. NPJ Digit Med. 2021;4(1):141. doi: 10.1038/s41746-021-00507-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Gohorbani A, Natarajan V, Coz DD, Liu Y. DermGAN: Synthetic Generation of Clinical Skin Images with Pathology. arXiv preprint. 2019. Available from: https://arxiv.org/abs/1911.08716v1.
  • 33. Kohlberger T, Liu Y, Moran M, Chen PHC, Brown T, Hipp JD, et al. Whole-Slide Image Focus Quality: Automatic Assessment and Impact on AI Cancer Detection. Journal of Pathology Informatics. 2019;10:39. doi: 10.4103/jpi.jpi_11_19 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Skandarani Y, Jodoin PM, Lalande A. GANs for Medical Image Synthesis: An Empirical Study. arXiv preprin arXiv:210505318. 2021. [DOI] [PMC free article] [PubMed]
  • 35.Dash S, Yale A, Guyon I, Bennett KP. Medical Time-Series Data Generation using Generative Adversarial Networks. In: International Conference on Artificial Intelligence in Medicine. Springer; 2020. p. 382–391.
  • 36. Jarrett D, Bica I, van der Schaar M. Time-series Generation by Contrastive Imitation. Advances in Neural Information Processing Systems. 2021;34. [Google Scholar]
  • 37. van Breugel B, Kyono T, Berrevoets J, van der Schaar M. DECAF: Generating Fair Synthetic Data Using Causally-Aware Generative Networks. Advances in Neural Information Processing Systems. 2021;34. [Google Scholar]
  • 38. Goncalves A, Ray P, Soper B, Stevens J, Coyle L, Sales AP. Generation and evaluation of synthetic patient data. BMC Med Res Methodol. 2020;20(1):1–40. doi: 10.1186/s12874-020-00977-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Ma C, Zhang C. Identifiable Generative models for Missing Not at Random Data Imputation. Advances in Neural Information Processing Systems. 2021;34. [Google Scholar]
  • 40.Nabi R, Bhattacharya R, Shpitser I. Full law identification in graphical models of missing data: Completeness results. In: International Conference on Machine Learning; 2020. p. 7153–7163. [PMC free article] [PubMed]
  • 41. Rubin DB. Inference and missing data. Biometrika. 1976;63(3):581–592. doi: 10.1093/biomet/63.3.581 [DOI] [Google Scholar]
  • 42. Scheffer J. Dealing with Missing Data. Res Lett Inf Math Sci. 2002;3:153–160. [Google Scholar]
  • 43. Che Z, Purushotham S, Cho K, Sontag D, Liu Y. Recurrent Neural Networks for Multivariate Time Series with Missing Values. Sci Rep. 2018;8(1):6085. doi: 10.1038/s41598-018-24271-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Fraccaro M, Sønderby SrK, Paquet U, Winther O. Sequential Neural Models with Stochastic Layers. Advances in Neural Information Processing Systems. 2016;29. [Google Scholar]
  • 45. Fraccaro M, Kamronn S, Paquet U, Winther O. A Disentangled Recognition and Nonlinear Dynamics Model for Unsupervised Learning. Advances in Neural Information Processing Systems. 2017;30. [Google Scholar]
  • 46. Kingma DP, Welling M. Auto-Encoding Variational Bayes. International Conference on Learning Representations. 2014. [Google Scholar]
  • 47. Yoon J, Jarrett D, van der Schaar M. Time-series Generative Adversarial Networks. Advances in Neural Information Processing Systems. 2019;32. [Google Scholar]
  • 48. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative Adversarial Nets. Advances in Neural Information Processing Systems. 2014;27. [Google Scholar]
  • 49. Dwork C, Roth A. The Algorithmic Foundations of Differential Privacy. Found Trends Theor Comput Sci. 2014;9(3–4):211–407. [Google Scholar]
  • 50.Hyland SL, Tople S. An Empirical Study on the Intrinsic Privacy of SGD. arXiv preprint. 2020. Available from: https://arxiv.org/abs/1912.02919v3.
  • 51. Rieke N, Hancox J, Li W, Milletarì F, Roth HR, Albarqouni S, et al. The future of digital health with federated learning. NPJ Digit Med. 2020;3(1):119. doi: 10.1038/s41746-020-00323-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Sheller MJ, Edwards B, Reina GA, Martin J, Pati S, Kotrotsou A, et al. Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data. Sci Rep. 2020;10(1):12598. doi: 10.1038/s41598-020-69250-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Wang S, McDermott MBA, Chauhan G, Ghassemi M, Hughes MC, Naumann T. MIMIC-Extract: A Data Extraction, Preprocessing, and Representation Pipeline for MIMIC-III. In: Proceedings of the ACM Conference on Health, Inference, and Learning; 2020. p. 222–235.
  • 54. Ghassemi M, Pimentel M, Naumann T, Brennan T, Clifton D, Szolovits P, et al. A Multivariate Timeseries Modeling Approach to Severity of Illness Assessment and Forecasting in ICU with Sparse, Heterogeneous Clinical Data. Proceedings of the AAAI Conference on Artificial Intelligence. 2015. doi: 10.1609/aaai.v29i1.9209 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. McDermott M, Yan T, Naumann T, Hunt N, Suresh H, Szolovits P, et al. Semi-Supervised Biomedical Translation With Cycle Wasserstein Regression GANs. Proceedings of the AAAI Conference on Artificial Intelligence. 2018. doi: 10.1609/aaai.v32i1.11890 [DOI] [Google Scholar]
  • 56.Raghu A, Komorowski M, Celi LA, Szolovits P, Ghassemi M. Continuous State-Space Models for Optimal Sepsis Treatment: a Deep Reinforcement Learning Approach. In: Proceedings of the 2nd Machine Learning for Healthcare Conference; 2017. p. 147–163.
  • 57.Suresh H, Hunt N, Johnson A, Celi LA, Szolovits P, Ghassemi M. Clinical Intervention Prediction and Understanding using Deep Networks. arXiv preprint. 2017. Available from: https://arxiv.org/abs/1705.08498v1.
  • 58.Yingzhen L, Mandt S. Disentangled Sequential Autoencoder. In: International Conference on Machine Learning; 2018. p. 5670–5679.
  • 59. Sohn K, Lee H, Yan X. Learning Structured Output Representation using Deep Conditional Generative Models. Advances in Neural Information Processing Systems. 2015;28. [Google Scholar]
  • 60.Girin L, Leglaive S, Bie X, Diard J, Hueber T, Alameda-Pineda X. Dynamical Variational Autoencoders: A Comprehensive Review. arXiv preprint. 2020. Available from: https://arxiv.org/abs/2008.12595v3.
  • 61. Karras T, Laine S, Aila T. A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2019. p. 4401–4410. [DOI] [PubMed]
  • 62. Vahdat A, Kautz J. NVAE: A Deep Hierarchical Variational Autoencoder. Advances in Neural Information Processing Systems. 2020;33. [Google Scholar]
  • 63.Esteban C, Hyland SL, Rätsch G. Real-valued (Medical) Time Series Generation with Recurrent Conditional GANs. arXiv preprint. 2017. Available from: https://arxiv.org/abs/1706.02633v2.
  • 64.Cho K, van Merrienboer B, Gülçehre Ç, Bahdanau D, Bougares F, Schwenk H, et al. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. ACL; 2014. p. 1724–1734.
  • 65. Mann HB, Whitney DR. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other. Ann Math Stat. 1947;18(1):50–60. doi: 10.1214/aoms/1177730491 [DOI] [Google Scholar]
PLOS Digit Health. doi: 10.1371/journal.pdig.0000074.r001

Decision Letter 0

Henry Horng-Shing Lu, Gilles Guillot

13 Apr 2022

PDIG-D-22-00047

Conditional Generation of Medical Time Series for Extrapolation to Underrepresented Populations

PLOS Digital Health

Dear Dr. Bing,

Thank you for submitting your manuscript to PLOS Digital Health. After careful consideration, we feel that it has merit but does not fully meet PLOS Digital Health's publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Jun 12 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at digitalhealth@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pdig/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

* A rebuttal letter that responds to each point raised by the editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

* A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

* An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

We look forward to receiving your revised manuscript.

Kind regards,

Gilles Guillot

Academic Editor

PLOS Digital Health

Journal Requirements:

1. Your co-authors, Stefan Bauer (stefan.a.bauer@gmail.com) and Patrick Schwab (patrick.x.schwab@gsk.com), have not confirmed authorship of the manuscript. We have resent them the authorship confirmation email; however please check that the above email address for them is correct and follow up personally to ensure they confirm. Please note that we cannot pass your manuscript to Production until we have received confirmations from all co-authors.

Just in case your co-authors are having difficulty confirming their authorship, you may advise them to send us an email at digitalhealth@plos.org and we will confirm their authorship on the authors' behalf.

2. Please update the completed 'Competing Interests' statement. Please declare all competing interests beginning with the statement “I have read the journal's policy and the authors of this manuscript have the following competing interests:”.

3. We ask that a manuscript source file is provided at Revision. Please upload your manuscript file as a .doc, .docx, .rtf or .tex. If you are providing a .tex file, please upload it under the item type ‘LaTeX Source File’ and leave your .pdf version as the item type ‘Manuscript’.

4. Please provide separate figure files in .tif or .eps format and remove any figures embedded in your manuscript file. Please also ensure that all files are under our size limit of 20MB. If you are using LaTeX, you do not need to remove embedded figures.

For more information about how to convert your figure files please see our guidelines: https://journals.plos.org/digitalhealth/s/figures

5. We notice that your supplementary figures and tables are included in the manuscript file. Please remove them and upload them with the file type 'Supporting Information'. Please ensure that all Supporting Information files are included correctly and that each one has a legend listed in the manuscript after the references list.

Additional Editor Comments (if provided):

This manuscript reports an important and interesting piece of work.

In addition to the comments of three reviewers, I would add the following remark: most of the material presented is technical, abstract and based on highly specialised data analytics techniques. On the other hand, PLOS Digital Health readership is highly diverse and not necessarily familiar with the techniques implemented. There is a need to bridge the gap between the current state of the manuscript and the journal's readership. In particular, the Material and Method section does a poor job in its current state: it contains a lot of acronyms and abbreviations, it make uses of many variables and distributions that are not defined, the rationale of the model is not stated anywhere. Equations and diagrams can not be a substitute to explanations in plain English. This section requires a thorough re-writing.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Does this manuscript meet PLOS Digital Health’s publication criteria? Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe methodologically and ethically rigorous research with conclusions that are appropriately drawn based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Partly

--------------------

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: No

--------------------

3. Have the authors made all data underlying the findings in their manuscript fully available (please refer to the Data Availability Statement at the start of the manuscript PDF file)?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception. The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

--------------------

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS Digital Health does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

--------------------

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors present an interesting method, HealthGen, to work on an important topic of generating medical time series (EHR) data for underrepresented population groups. The author report findings that suggest that compared with other methods including SRNN, KVAE, and TimeGAN, HealthGen performs better in generating realistic patient cohorts. In addition, the generated samples would lead to a better fairness representation for the minority group populations, and the trained model on these generated samples of underrepresented populations would have a better fairness than the model trained from the real dataset. Overall, this method and findings are important for model fairness and performance disparity reduction/elimination. The authors studied two population groups: ethnicity and insurance types, and showed that the performance from the method proposed in this manuscript worked best in most of scenarios. Overall, this method and findings are important, the manuscript is well written, the message is clear, and the organization is easy to follow. Still, the findings remain limited and the authors should consider the following comments:

Minor:

The reference should be sorted in order. Reference [1] is not cited in the article.

Major:

Since the author is working on underrepresented population groups, it would be better to discuss some other works including Federate Learning, transfer learning [1, 2], population allocation [3], meta learning [4], etc. Also, it would be better to mention some bias reduction and mitigation methods [5].

Another issue is the confounding factors [6,7]. It would be better to consider some factors, like age, sex, education background, ethnicity, etc. If you do not want to consider these confounding factors, please discuss the reasons.

[1]: Wang, Z., Dai, Z., Póczos, B. and Carbonell, J., 2019. Characterizing and avoiding negative transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11293-11302).

[2]: Gao, Y. and Cui, Y., 2020. Deep transfer learning for reducing health care disparities arising from biomedical data inequality. Nature communications, 11(1), pp.1-8.

[3]: Rolf, E., Worledge, T.T., Recht, B. and Jordan, M., 2021, July. Representation Matters: Assessing the Importance of Subgroup Allocations in Training Data. In International Conference on Machine Learning (pp. 9040-9051). PMLR.

[4]: Qiu, Y.L., Zheng, H., Devos, A., Selby, H. and Gevaert, O., 2020. A meta-learning approach for genomic survival analysis. Nature communications, 11(1), pp.1-11.

[5]: Thompson, H.M., Sharma, B., Bhalla, S., Boley, R., McCluskey, C., Dligach, D., Churpek, M.M., Karnik, N.S. and Afshar, M., 2021. Bias and fairness assessment of a natural language processing opioid misuse classifier: detection and mitigation of electronic health record data disadvantages across racial subgroups. Journal of the American Medical Informatics Association, 28(11), pp.2393-2403.

[6]: Ewers, R.M. and Didham, R.K., 2006. Confounding factors in the detection of species responses to habitat fragmentation. Biological reviews, 81(1), pp.117-142.

[7]: Sul, J.H., Martin, L.S. and Eskin, E., 2018. Population structure in genetic studies: Confounding factors and mixed models. PLoS genetics, 14(12), p.e1007309.

Reviewer #2: This paper by Bing et al. presents an interesting new approach to the

generation of synthetic medical time series data. The proposed

HealthGen model is based on the foundation of a RNN-based deep

learning architecture that incorporates static values as conditioning

factors on Tom series data generation processes. The authors use

this conduct multiple experiments demonstrating the potential utility

of this approach for generation of synthetic data that might

ameliorate poor model performance for specific subgroups, leading to

more fair models.

This paper presents a novel and thoughtful approach at the

intersection of two key problems: appropriate generation of

high-fidelity data for construction of medical models and disparities

in clinical AI. The analyses provided demonstrate the utility of the

model for developing datasets that overcome differential model

performance across ethnicities and sources of payment, two key sources

of potential biases in medical AI models.

The paper is generally convincing and the model seems highly

promising. There are some concerns that should be addressed to

clarify the presentation of the work and to strengthen the argument.

The fidelity of the time series models is central to the success of

the proposed HealthGen model Put simply, the synthetic time series

must be close enough to the original time series to be useful as

synthetic data without being simplistic duplications of the original

data. Figure 1, Figure 4, and Appendix D are used to support the

argument that the synthetic time series are close, but not too close

to the originals. However, as these results are purely qualitative,

it's hard to tell if they are are generalizable. This argument would be

much more compelling if the authors could propose an appropriate

measure of similarity and to demonstrate that this measure showed

appropriate characteristics over a large dataset.

This suggestion is made with full understanding that the appropriate

metric is far from obvious, as the goal would be to measure success in

hitting a "sweet spot" between complete lack of correlation and

memorization. Possibilities such as distance correlation/covariance

might be considered, but it is not clear that they would

suffice. Additional challenges include the need to ensure that

synthesized time series exhibit clinically appropriate and feasible

behavior. For example, a synthetic dataset with high correlation to an

actual dataset might still be inappropriate if the synthetic data

exhibiting unrealistically large fluctuations. If a metric-based

approach is not feasible, another alternative might be an adversarial

approach - if an appropriately-trained classifier was not able to

successfully classify sequences as either being real or synthetic,

this might be seen as an argument for the success of the data

synthesis approach.

On another, perhaps more minor note, the discussion of the colloid

bollus augmentation results in Figures 2 and 3 seems to understate

the potential importance of those results. Tables C.2 and C.3 clearly

illustrate that the number of government-funded colloiud bolus

positive samples is very small, representing 0.95% (for colloid) and

3% (government funded). From the base of roughly 34,000 patients, this

amounts to approximately 10 patients at the intersection of these two

groups (assuming that the two factors are roughly independent). It's

not surprising that the initial results are so poor, but the

remarkable improvement on the augmented data is well-worth

emphasizing.

Regarding the presentation of the methods, the figures are useful and

the equations are helpful. However, the details are a bit dense,

reading as if presented for a NeurIPS audience, as opposed to the

presumably broader audience that might be reading PLoS Digital

Health. A slightly gentler introduction might be preferred for this

audience.

A minor concern: the lack of external validation might be seen as a

weakness. Although there is no reason to believe that this method

would not work on a second data set, this validation should be

explored at some point in the future, and deserves mention in the

current manuscript.

Reviewer #3: The paper presents HealthGen, a conditional generative model for synthetic patient cohorts. In addition of synthetic data more faithful to real data than previous state-of-the-art models, their model also allows of conditional generation of specific patient populations, a feature missing in previous state-of-the-art models. This conditional generation allows them to specifically generate data for underrepresented groups and therefore reduce downstream AI models’ inequality in predictive power between over- and underrepresented groups.

Most parts of the paper are well-written and it is easy to follow the authors arguments for the most part. Some parts of the paper could be rewritten to better integrate in the overall story and parts of their evaluation could be improved to better demonstrate the points the authors want to make. The introduction is especially well-written and gives an excellent motivation for the problem. My main issue with the paper is that the authors do not provide enough data to sufficiently support their claims.

Issues regarding the performance of the model include the following:

1. The paper sets up 4 static variables and 5 classification tasks, but the authors only show results for a fraction of possible combinations of static variables and tasks. Additional information is needed to show that the model’s performance holds for the other combinations of static variables and classification tasks.

2. In figure 4 the authors show a synthetic patient and data of the 3 closest real patients.

While the data of the real patients looks similar, the synthetic data looks very different (different value ranges, different trends), leading to doubts whether their method of identifying nearest neighbors for synthetic patients is working. Additional statistics for i.e. a sample of 10 synthetic patients and the distance to their nearest neighbors, as well as the distance between the nearest neighbors would give more insights.

3. As described in 2., Figure 4 leads to doubts whether their method of identifying nearest neighbors works. The authors should therefore consider using another baseline approach like dynamic time warping to measure the distances between patient time series to verify that their method is working.

4. For results 2.4 they make an unfair comparison: “Since the baselines cannot generate data conditioned on static variables, we have them unconditionally generate the same number of overall samples that our model augments the real data with.”

As the baseline models can create any number of synthetic patients, they should create synthetic patients (and discard synthetic patients from overrepresented classes) until they have enough to create a balanced dataset. This shows that the model is able to create underrepresented groups with the same quality as the other approaches, while also showing that their conditional approach makes this much simpler and more targeted.

5. The downstream classification model’s strong performance on some underrepresented groups indicates that group membership might already carry information. Some initial information on how the medical interventions are distributed over group memberships would help put the results better in context.

6. P.6 “In all considered tasks, our approach significantly outperforms the state-of-the-art models in synthetically generating medical time series.” This is not true. See for example Table 1 (improvement for colloid_bolus likely not significant), of Fig.3b overall (performance worse than baseline, significant difference).

7. Very small underrepresented groups see a large increase in predictive power of the downstream model. This could be because the generative model learns to ‘overfit’ on

Issues regarding the improved fairness by using synthetic data:

8. The authors define a way to measure fairness, and claim that their approach produces fairer results, but they never provide numerical evidence to support this

9. If the model overfits on underrepresented groups, it is also likely to reproduce bias found in the dataset. “Furthermore, the quality and diversity of the generated data which our model produces is limited by the real data with which it is trained” is not enough of a warning for this possibility

10. Ideally, one would use more than one dataset for evaluation to ensure that the model actually learns to generalize the properties of underrepresented groups and not overfits on patterns observed in the data. However, I concede that this is very difficult due to the limited availability of medical datasets and their limited compatibility.

Other issues with the introduction:

11. The term missingness is mentioned in the introduction (p.3). While in the later context it becomes clear what missingness means, but a short explanation would improve reading flow here.

Other issues with the results section:

12. The initial description of the dataset should explicitly mention that it contains only patients from one hospital, i.e. “[…] patients that spent time in the intensive care unit (ICU) in the Beth Israel hospital …”. Together with the preceding introduction, this can otherwise make it look like the dataset contains a variety of patients from different populations and the HealthGen model can be used out-of-the-box by anyone wishing to improve their dataset diversity.

13. For Fig. 1 and Fig. 4 it is not clear why / how these features were selected for display from the set of all features. Reducing number of features and number of patients so the image can fit horizontally would immensely improve readability.

14. The benefit of having Fig. 1 is unclear, the displayed synthetic timeseries look very different from the real one, only the patterns of missingness seem appropriate.

--------------------

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

Do you want your identity to be public for this peer review? If you choose “no”, your identity will remain anonymous but your review may still be made public.

For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Harry Hochheiser

Reviewer #3: No

--------------------

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLOS Digit Health. doi: 10.1371/journal.pdig.0000074.r003

Decision Letter 1

Henry Horng-Shing Lu, Gilles Guillot

10 Jun 2022

Conditional Generation of Medical Time Series for Extrapolation to Underrepresented Populations

PDIG-D-22-00047R1

Dear Mr Bing,

We are pleased to inform you that your manuscript 'Conditional Generation of Medical Time Series for Extrapolation to Underrepresented Populations' has been provisionally accepted for publication in PLOS Digital Health.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow-up email from a member of our team. 

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they'll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact digitalhealth@plos.org.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Digital Health.

Best regards,

Gilles Guillot

Academic Editor

PLOS Digital Health

***********************************************************

Reviewer Comments (if any, and for reference):

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #3: All comments have been addressed

**********

2. Does this manuscript meet PLOS Digital Health’s publication criteria? Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe methodologically and ethically rigorous research with conclusions that are appropriately drawn based on the data presented.

Reviewer #3: (No Response)

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #3: (No Response)

**********

4. Have the authors made all data underlying the findings in their manuscript fully available (please refer to the Data Availability Statement at the start of the manuscript PDF file)?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception. The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #3: (No Response)

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS Digital Health does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #3: (No Response)

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #3: (No Response)

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

Do you want your identity to be public for this peer review? If you choose “no”, your identity will remain anonymous but your review may still be made public.

For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #3: No

**********

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Appendix. Preliminary findings and additional experimental results.

    (PDF)

    S2 Appendix. Model implementation and training details.

    (PDF)

    S3 Appendix. Data set characteristics.

    (PDF)

    S4 Appendix. Visualizations of synthetically generated data.

    (PDF)

    Attachment

    Submitted filename: Response_Letter.pdf

    Data Availability Statement

    The utilized MIMIC-III data set (https://physionet.org/content/mimiciii/1.4/) is publicly available to researchers after having completed a course to certify their capability to handle sensitive patient data. Access to the data may be requested at https://mimic.mit.edu/.


    Articles from PLOS Digital Health are provided here courtesy of PLOS

    RESOURCES