Table 2.
Fidelity results with utility metrics.
| Utility with all features | ||||||
|---|---|---|---|---|---|---|
| Target | Models | Metrics | MIMIC-III | eICU | ||
| Train on Real | Train on Synth | Train on Real | Train on Synth | |||
| Mortality | GBDT | AUC | 0.762 | 0.736 | 0.943 | 0.938 |
| AP | 0.304 | 0.261 | 0.600 | 0.534 | ||
| RF | AUC | 0.723 | 0.710 | 0.954 | 0.945 | |
| AP | 0.276 | 0.251 | 0.600 | 0.580 | ||
| GRU | AUC | 0.728 | 0.667 | 0.937 | 0.938 | |
| AP | 0.278 | 0.193 | 0.567 | 0.528 | ||
| LR | AUC | 0.712 | 0.680 | 0.872 | 0.818 | |
| AP | 0.233 | 0.207 | 0.323 | 0.260 | ||
| Average | AUC | 0.731 | 0.689 | 0.926 | 0.909 | |
| AP | 0.272 | 0.228 | 0.522 | 0.475 | ||
| Utility with random subsets of features | ||||||
|---|---|---|---|---|---|---|
| Target | Models | Metrics | MIMIC-III | eICU | ||
| Mean-diff | p-value (X = 0.04) | Mean-diff | p-value (X = 0.04) | |||
| Mortality | RF | AUC | 0.009 | 0.000 | 0.009 | 0.000 |
| AP | 0.035 | 0.000 | 0.035 | 0.098 | ||
| Gender | AUC | 0.065 | 1.000 | 0.019 | 0.000 | |
| AP | 0.046 | 0.860 | 0.013 | 0.000 | ||
(Upper) Downstream task performance with four different predictive models and two different settings (train on real vs. train on synthetic) on MIMIC-III and eICU datasets. Performance is evaluated on the original test sets. The best performance in each column is shown in bold. (Lower) The average absolute performance difference (in terms of AUC/AP) between training on real vs. synthetic data and the corresponding p-values (computed by one sample T-test) for predicting mortality and gender with random subsets of features.