Actionability of Synthetic Data in a Heterogeneous and Rare Health Care Demographic: Adolescents and Young Adults With Cancer

Joshi Hogenboom; Aiara Lobo Gomes; Andre Dekker; Winette Van Der Graaf; Olga Husson; Leonard Wee

doi:10.1200/CCI.24.00056

. 2024 Dec 3;8:e2400056. doi: 10.1200/CCI.24.00056

Actionability of Synthetic Data in a Heterogeneous and Rare Health Care Demographic: Adolescents and Young Adults With Cancer

Joshi Hogenboom ^1,^✉, Aiara Lobo Gomes ¹, Andre Dekker ¹, Winette Van Der Graaf ^2,³, Olga Husson ^2,⁴, Leonard Wee ¹

PMCID: PMC11627331 PMID: 39626135

Abstract

PURPOSE

Research on rare diseases and atypical health care demographics is often slowed by high interparticipant heterogeneity and overall scarcity of data. Synthetic data (SD) have been proposed as means for data sharing, enlargement, and diversification, by artificially generating real phenomena while obscuring the real patient data. The utility of SD is actively scrutinized in health care research, but the role of sample size for actionability of SD is insufficiently explored. We aim to understand the interplay of actionability and sample size by generating SD sets of varying sizes from gradually diminishing amounts of real individuals' data. We evaluate the actionability of SD in a highly heterogeneous and rare demographic: adolescents and young adults (AYAs) with cancer.

METHODS

A population-based cross-sectional cohort study of 3,735 AYAs was subsampled at random to produce 13 training data sets of varying sample sizes. We studied four distinct generator architectures built on the open-source Synthetic Data Vault library. Each architecture was used to generate SD of varying sizes on the basis of each aforementioned training subsets. SD actionability was assessed by comparing the resulting SD with their respective real data against three metrics—veracity, utility, and privacy concealment.

RESULTS

All examined generator architectures yielded actionable data when generating SD with sizes similar to the real data. Large SD sample size increased veracity but generally increased privacy risks. Using fewer training participants led to faster convergence in veracity, but partially exacerbated privacy concealment issues.

CONCLUSION

SD is a potentially promising option for data sharing and data augmentation, yet sample size plays a significant role in its actionability. SD generation should go hand-in-hand with consistent scrutiny, and sample size should be carefully considered in this process.

Synthetic data, possibly game-changing for research hindered by scarcity, but when are they actionable?

INTRODUCTION

Health care data for rare diseases in heterogeneous populations—such as adolescents and young adults (AYAs) with cancer—are generally difficult to acquire, and sizable high-quality data sets on these participants are not widely available. Research on this important group has thus been hampered by data availability, resulting in lack of standards for age-specific cancer care.¹ Sharing of data sets is regularly constrained by administrative burden and privacy-related challenges. However, it is exactly the research into such diseases that would benefit most from greater access to larger volumes of high-quality data.

CONTEXT

Key Objective
Synthetically generated data offer vast new potential for studying rare and heterogeneous conditions, but what are the metrics of actionability that should be carefully evaluated before using synthetic data (SD) to accelerate research for areas that are hindered by data scarcity?
Knowledge Generated
Free and open-source SD generator tools are available that produce actionable SD. It is important to evaluate the effect of training and generated sample size in terms of three metrics—veracity, utility, and privacy. Before SD use, the actionability should be carefully evaluated in accordance with the intended use of the generated data sets.
Relevance
The use of actionable SD could be beneficial for the study of rare cancers and to answer critical questions.

Synthetic data (SD) have been suggested as a potential surrogate that addresses data availability and data acquisition,^2-7 since it has means to augment both sample size and data diversity. Here, SD is defined as artificially generated person-level health data that resembles—to some degree of authenticity—real individual health data, while implying the absence of identifiable information of any actual persons. In this case, the goal of SD is to catalyze knowledge generation in health care, but without exposing real human's data. Although use of SD within health care is still in development, ethicists, policymakers, and researchers are trying to establish how SD can be used appropriately.^5,8 Meanwhile, various techniques^9-12 are now openly available which researchers can use to train generative models and thus create SD for various purposes.^13-22

Constructing a SD generator always needs some initial form of real data for its training. A trained generative network (ie, generator) is then used to produce any arbitrary size of SD. The sample size of training data is therefore an important question that also depends on the intended use of SD. Likewise, it is not universally known—for a given generator—if there are upper and lower sample size limits of SD that ought to be produced. The interplay of actionability and sample size (both for training and generation) therefore warrants careful examination.

What designates SD as actionable remains a disputed topic^2,23,24; in this work, we propose three evaluation domains that may be considered as minimally necessary for actionable SD. (1) Veracity, meaning that the generated SD are statistically similar to the training data in terms of distribution of values. (2) Utility, meaning that SD replicate the intervariable associations within the training data, for example, the coefficients in a logistic regression model. (3) Privacy concealment, in terms of obscuring individual identity and obscuring their observable attributes.

To elucidate the dependence of veracity, utility, and privacy concealment on sample size, we ran training and SD-generating experiments using real AYA cancer data (persons age 18-39 years at first cancer diagnosis). Cancer incidence is low in this demographic (only one million new cases annually worldwide), making this group difficult to study by conventional methods.¹² Although this group experiences the burden of cancer in a starkly different way compared with pediatric and elderly patients,^25,26 AYAs are predominantly treated in either pediatric or adult care institutes. Evidence to inform personalized age-specific patterns of cancer care is very limited for AYAs, hence our interest in understanding how SD might potentially be used to address AYA health-related hypotheses.

METHODS

Data from the SURVAYA study^25,26 (ClinicalTrials.gov identifier: NCT05379387) were reused with permission from the sponsor. SURVAYA was a population-based cross-sectional cohort study among AYA cancer survivors registered in the Netherlands Cancer Registry (NCR). SURVAYA included AYAs who had been treated either in a Dutch academic medical center or in the Netherlands Cancer Institute, and were registered in the NCR between 1999 and 2015; Amsterdam AMC, Netherlands Cancer Institute, and Utrecht UMC supplied treatment-related data up to 2014. The principal instruments in SURVAYA were validated health-related quality-of-life questionnaires,^27-30 which were linked to clinical attributes from the NCR. Details are provided in the original study publications.^25,26 In total, there were 950 variables, some of which contained unstructured text.

We limited ourselves only to variables selected by Saris et al²⁵ for logistic regression models of negative body image among AYAs. We used exactly the same inclusion and exclusion criterion, and the same data cleaning scripts as the original research. We imputed the remaining missing values using multiple imputation by chained equations. The missing value rate was in fact very low (Data Supplement, Table S1). Age at diagnosis was left in as a privacy concealment challenge for attribute inference, since certain cancer diagnoses are widely known to be exceptionally rare among younger AYAs. The preprocessed training data, hereafter referred to as the original data set (OD), consisted of 3,735 AYAs with 21 variables (37 after a one-hot encoding of categories; Data Supplement, Table S2).

Training and generating experiments are illustrated in Figure 1. The OD was sampled without replacement to produce smaller training data sets of sizes 3,600, 3,350, …, down to 600 participants, resulting in a total of 14 distinct training sets.

FIG 1. — Overview that describes the experimental setup to evaluate the role of OD and SD sample size on SD actionability. In (A), the generator training process including the simulation of smaller sample sizes and sample size-specific generator training, and in (B) the data generation process including synthetic data generation and actionability assessment. AYA, adolescents and young adults; OD, original data set; SD, synthetic data.

We tested four SD generator architectures from the Synthetic Data Vault (SDV)⁹ Python library. Two of these were machine learning (ML)–based architectures that used strictly parametric statistical models. The first was a speed-optimized Gaussian copula generator (hereafter Fast ML) and the other was a classical Gaussian copula generator (hereafter Gaussian Copula). Two nonparametric architectures were based on deep-learning neural networks: a Conditional Tabular Generative Adversarial Network (CTGAN)¹² and a Conditional Generative Adversarial Network incorporating Differential Privacy (DP-CGAN).¹⁰ The DP-CGAN was not part of SDV's current library, but was built on top of existing SDV-library modules.

In the Fast ML architecture, each of the aforementioned subsets of the OD were used for training, thus resulting in 14 separately trained Fast ML generators. This process was then repeated entirely for the Gaussian Copula, CTGAN, and DP-CGAN architectures, thus always resulting in 14 separately trained generators within each of the four selected architectures.

Each architecture used default (hyper)parameters for training specified by its developers. We limited our experiments only to the default hyperparameters, without fine-tuning and no grid search for these experiments. All architectures were supplied with the metadata of the OD as required in the documentation.^10,31 Hyperparameters per architecture and variable metadata are provided in the Data Supplement, Tables S3 and S4, respectively.

From each of the separately trained generators (thus four architectures × 14 training OD subsets per architecture = 56 unique generators), we generated SD of varying samples sizes ranging from 100 up to 39,100 artificial participants, in increments of 1,000. Thus, for each uniquely trained generator, we derived 40 distinct sets of SD (56 generators × 40 sets of SD per generator = 2,240 SD sets in total).

Actionability was assessed by comparing a selected SD set, from a given generator, with the corresponding OD training set used to train that generator against three metrics—veracity, utility, and privacy concealment. For instance, a SD set of 5,100 samples generated with a Fast ML generator that was trained using a set of 3,735 real participants was compared with those specific 3,735 real participants.

Veracity of SD was quantified in terms of precision, density, recall, and coverage.^32-34 The precision metric quantifies the proportion of individual participants in the SD that fall within a minimum number of neighboring participants (k) found in the OD. The density corrects for outliers in precision by scoring the number of samples inside the densely overlapping regions. Recall describes the proportion of OD samples that lay close to the SD's samples. Finally, coverage corrects for outliers in recall by building the radii with the OD's samples, rather than the SD's samples. A schematic overview of the veracity metric's mechanism is provided in the Data Supplement (Fig S1). The metrics were computed using the prdc Python library with a constant k = 5.

Utility was examined by checking if a given SD reproduced the same logistic regression coefficients as the original paper by Saris et al,²⁵ for generative architectures that had been trained using all of 3,735 real participants. For each generator, we selected an SD output size of 3,100 for the comparison of odds ratios (ORs) with Saris et al. The regression analyses were conducted using Python libraries: Statsmodels for regression analysis and scikit-learn for an optimism-adjusted (through random under-sampling) concordance-statistic as area under receiver operator curve score.

Privacy concealment was assessed with two metrics. First, a measure of the minimum Hamming distance¹⁰ between any synthetic participant in a given SD to any real participant in the OD used to train the corresponding generator. Second, we estimated an attribute inference probability using seemingly innocent information that might be obtained through public channels. We limited the test variables to age in nearest whole year, type of cancer, and romantic partnership status, which we assumed would be easily gleaned from social media. We then assumed that an attacker would infer a sensitive and not easily known attribute, that is, an AYA's self-perception of sexual attractiveness (a primary outcome in SURVAYA). The attribute inference attack was simulated using a Correct Attribution Probability algorithm (CAP).³⁵

Institutional Review Board Statement

The SURVAYA study was conducted in accordance with the Declaration of Helsinki, and approved by the Netherlands Cancer Institute Institutional Review Board (IRBIRBd18122) on February 6, 2019. In the presented work, the SURVAYA study was reused with permission from the study principal investigator and sponsor.

Informed Consent Statement

Informed consent was obtained from all participants involved in the SURVAYA study.

RESULTS

Veracity

Figure 2 shows a sensitivity of coverage toward training and SD sample size. For the generator architectures we tested, reasonable coverage (>0.75) was obtained for a wide range of training sample sizes, as long as more than 3,000 synthetic participants were generated. Among the four architectures, CTGAN overall yielded lower coverage for combinations of training and SD sample sizes. When the original data were relatively small in size and lacking in diversity, the chance of finding synthetic participants within the radii of five closest real participants was relatively high, hence the coverage seems counterintuitively better for smaller training set size.

FIG 2. — Veracity in terms of coverage (on the left side of the figure) and density (on the right side of the figure), per color-coded training data sample size. The veracity score relates to the prdc's outcome, where a value of 1 is best and 0 is worst. Each horizontal row of figures depicts the respective scores of the included generator architectures. CTGAN, Conditional Tabular Generative Adversarial Network; DP-CGAN, Conditional Generative Adversarial Network incorporating Differential Privacy; ML, machine learning.

Density was for the most part independent of training and SD sample size, as shown in Figure 2. However, CTGAN shows a stepwise improvement in the density metric from 1,100 to 3,100 training sample size. Similarly, the precision and recall scores were acceptably high and were independent of either training or SD sample size; the precision and recall curves are provided in the Data Supplement (Fig S2) and minimum scores were 0.86, 0.90, 0.75, and 0.93 for Fast ML, Gaussian Copula, CTGAN, and DP-CGAN, respectively, for both precision and recall.

Utility

The overall picture for the ORs in univariable and multivariable logistic regressions against negative body image was quite mixed. There was a moderate degree of overlapping associations among the variables, where the CI of the OR estimated from the SD included the same OR estimated in the OD. However, the number of overlapping associations between the SD and OD varied between different generator architectures.

Furthermore, ORs that were not statistically significant in the OD (P < .05) had become newly significant in the SD. Additionally, some of the ORs from the SD had been perturbed so much that its CI no longer contained the OR derived from the OD, and other ORs had shifted effects.

The number of overlapping, shifted, and newly significant effects generally increased in multivariable regression compared with univariable regression. Table 1 summarizes the number of perturbations in the ORs observed from univariable and multivariable regressions, in terms of overlap or shifted OR estimates in the SD relative to the OD, and the number of statistically significant variables in the OD and SD.

TABLE 1.

Summary of the Number of Overlapping, Shifted and Statistically Significant Associations (after one-hot encoding) Comparing a Synthetic Dataset of 3,100 Participants to the Original Dataset of 3,735 Participants

Architecture	Overlapping Associations	Shifted Associations	Significant in SD	Significant in OD
Fast ML
Univariable	16/37	3/37	18/37	25/37
Multivariable	26/37	1/37	14/37	18/37
Gaussian Copula
Univariable	13/37	11/37	17/37	25/37
Multivariable	18/37	9/37	12/37	18/37
CTGAN
Univariable	17/37	6/37	31/37	25/37
Multivariable	20/37	7/37	13/37	18/37
DP-CGAN
Univariable	22/37	5/37	26/37	25/37
Multivariable	23/37	9/37	14/37	18/37

Open in a new tab

Abbreviations: CTGAN, Conditional Tabular Generative Adversarial Network; DP-CGAN, Conditional Generative Adversarial Network incorporating Differential Privacy; ML, machine learning; OD, original data set; SD, synthetic data.

The adjusted c-statistics for multiple regression over all variables on negative body image were 0.87, 0.88, 0.86, and 0.86, for Fast ML, Gaussian Copula, CTGAN, and DP-CGAN generators, respectively.

Forest plots of the ORs (Data Supplement, Figs S3-S10) show that the effect estimates from the OD had not been consistently reproduced in the SD, irrespective of generator architecture.

Privacy Concealment

For a given training sample, as one generates ever-larger sample sizes of SD, for all generators except DP-CGAN, there comes a point where a real participant's information becomes replicated into the SD. This would correspond to a minimum Hamming distance in given SD equaling to zero.

Figure 3 illustrates how the probability of a real patient being replicated in an SD data set generally tends to increase for all architectures except for DP-CGAN. The figure shows the proportion of identical participants in training and SD sets, normalized to the size of the training set. The DP-CGAN appears entirely insensitive to training and generation sample size, as we were consistently able to generate up to 39,100 SD samples without replicating any real participants.

FIG 3. — Identity concealment in relation to varying training and synthetic data sample size, per color-coded generator architecture. Each horizontal row depicts the respective scores per training data set sample size. CTGAN, Conditional Tabular Generative Adversarial Network; DP-CGAN, Conditional Generative Adversarial Network incorporating Differential Privacy; ML, machine learning.

For Fast ML and Gaussian Copula, when generating 39,100 participants in the SD, there will be approximately 5% of synthetic participants perfectly resembling real participants, and this occurred independently of the training sample size. The number of replicated participants in the SD decreases roughly linearly toward zero when (obviously) the SD size is zero.

The training sample size of the CTGAN is notable; for 1,100 training participants and fewer, the number of replicated participants is very low, barely just above that of the DP-CGAN, which was always zero. However, when a training sample size of 3,100 or more is used, the CTGAN switched to a state where it was much more likely than Fast ML or Gaussian Copula to replicate a real participant in the SD. This consistently reaches up to 15% of synthetic participants perfectly resembling real participants, when 39,100 synthetic participants are generated. Note that this is the same range of training sample dependence where the veracity metric of density—and precision—switches from poor (approximately 0.50) to acceptable (approximately 0.70) for the CTGAN, in Figure 2. However, the training and generation loss curves (Data Supplement, Figs S11 and S12) did not specifically indicate a clear sign of model collapse for the CTGAN in this range of training sample sizes.

Even if real participant's values are not fully replicated in the SD, it might still be possible to identify sensitive information through attribute inference attacks. Figure 4 plots the attribute inference score from the CAP algorithm. Higher scores imply better concealment, therefore presenting lower risk of exposing sensitive information (here, self-perception of attractiveness) via an attribute inference attack using only the SD. Concealment scores for Fast ML and Gaussian Copula were insensitive to training sample size and to SD sample size; median scores were 0.41 and 0.46, respectively. Concealment in CTGAN and DP-CGAN was dependent on training sample size, but not dependent on SD output size; range of scores were 0.32 to 0.40 and 0.34 to 0.42, respectively. However, Figure 4 shows that attribute inference is not directly related to replication of real participants in the SD.

FIG 4. — Attribute concealment in relation to varying synthetic data sample size, per color-coded training data sample size. The median attribute inference score relates to the SDMetric's CAP-algorithm outcome, where a value of 1 is best and 0 is worst. Each horizontal row depicts the respective scores of the included generator architectures. CAP, Correct Attribution Probability algorithm; CTGAN, Conditional Tabular Generative Adversarial Network; DP-CGAN, Conditional Generative Adversarial Network incorporating Differential Privacy; ML, machine learning.

DISCUSSION

Sample size played an important role in SD actionability. Training and SD sample size influenced veracity, as high diversity was only obtained through an appropriate balance between training and SD sample size regardless of generator architecture; the CTGAN architecture did, however, require proportionally larger SD sizes to obtain equal diversity. Utility was only assessed with a SD sample size similar to the training data, and we found that intervariable associations were not uniformly reproducible in the SD. Privacy concealment was sensitive to sample size. The Fast ML and Gaussian Copula architectures had a distinct decrease in identity concealment that was proportional to both training and synthetic sample sizes. CTGAN was able to achieve up to a factor 65 SD enlargement with smaller training data sets, but was susceptible to identity disclosure when exceeding certain training sample sizes. The DP-CGAN architecture was the best performer in terms of identity concealment as in none of our experiments it had included any training data in the SD. All SD had inferable attributes, irrespective of generator architecture, which carried finite risk of disclosing sensitive real information.

Arora and Arora¹⁵ already investigated SD sample size and found it less influential, but their sample size ranges were narrow, and they acknowledged the need for broader investigation. To the best of our knowledge, our combination of varying training and SD size is unique, whereas other work consistently focusses on balancing representativeness and privacy for a given training and SD sample size, hindering further comparison with our findings. The minimum Hamming distance for identity concealment indeed presents the most conservative scenario. However, the experiment was informative insofar as demonstrating the increasing risk of disclosure as SD size increases, and the susceptibility thereof among generator architectures. Inescapably, failed identity concealment can potentially result in reidentification with an unfortunate combination of identifiable attributes that act as real data held by a bad actor.³⁶

We assume most clinical SD creators have limited computational resources and know-how to perform architecture adaptation and exhaustive tuning. Thus, we chose four accessible and easy-to-use generator architectures with default settings. It is, however, conceivable that actionability would improve after adaptation and tuning³⁷—despite apparent overfitting or underfitting—yet we conceive this to be feasible only for a fraction of creators.

The implications would likely differ on the basis of the specifics of each use-case and accordingly require customized considerations. For instance, when augmenting a data set, certain architectures appear to have an upper enlargement limit, but also when sharing SD with a small sample size, for instance to prototype applications, the SD should at least be sufficiently large to capture the real diversity. We suggest that it would be a good idea that researchers intending to use or publish SD should first perform checks along the lines of what we reported above. We recommend checking veracity and utility as a function of sample size, to be reasonably assured of the reliability of the generator model. Next, if the intended use of the SD warrants it, privacy and identity concealment checks should be considered.

The (in)dispensability of consistent SD scrutiny might substantially be influenced by real participant's characteristics and sample size's role requires further evaluation. Although this first work is focused on evaluating the original author's models on the SD, we are actively investigating other use-cases for this SD, such as network analysis, machine learning classifiers, and unsupervised clustering. In part, the necessity of consistent scrutiny reiterates the need for established methods to assess SD's actionability.^2,23,24 Regardless, SD can technically be made actionable, and exuberant innovation can not only simplify this process and appraisal thereof but can incorporate here unmentioned domains such as SD fairness,³⁸ yet for true SD actionability in health care, established ethical and legal consensus is paramount as these should dictate what actionability actually implies.^5,8

Overall, SD is indeed able to resemble real data to a moderate degree. In our experiments with AYA cancer data, we found that sample size has various roles in defining SD's actionability. Typically, SD sample size had to be sufficiently large to encapsulate the entire veracity, yet SD sample size also should not be too large, as it might exacerbate flaws in utility and increase privacy risks. The training sample size dictated this balance between veracity, privacy, and SD sample size, in which smaller training sample sizes generally appear best suited with smaller SD sample sizes.

ACKNOWLEDGMENT

All individuals who have contributed to this study are included in the author list. Experiments were made possible using the Data Science Research Infrastructure (https://dsri.maastrichtuniversity.nl/) hosted at Maastricht University.

PRIOR PRESENTATION

Presented as preprint on MedRxiv, available via: https://doi.org/10.1101/2024.03.04.24303526.

SUPPORT

Supported in part by the European Union's Horizon 2020 research and innovation program through The STRONG-AYA Initiative (Grant agreement ID: 101,057,482), the Innovative Medicines Initiative, the Digital Oncology Network for Europe, the European Regional Development Fund, the Netherlands Organization for Scientific Research through a Vidi grant (ID: 198.007), ZonMW, and Stichting Hanarth Fonds.

DATA SHARING STATEMENT

The code and software versioning used for this work are available on GitHub with a working example, see: https://github.com/MaastrichtU-CDS/AYA-synthetic-data.

AUTHOR CONTRIBUTIONS

Conception and design: Joshi Hogenboom, Aiara Lobo Gomes, Andre Dekker, Olga Husson, Leonard Wee

Financial support: Andre Dekker, Leonard Wee

Provision of study materials or patients: Olga Husson

Collection and assembly of data: Joshi Hogenboom, Olga Husson

Data analysis and interpretation: All authors

Manuscript writing: All authors

Final approval of manuscript: All authors

Accountable for all aspects of the work: All authors

AUTHOR'S DISCLOSURES OF POTENTIAL CONFLICTS OF INTEREST

The following represents disclosure information provided by authors of this manuscript. All relationships are considered compensated unless otherwise noted. Relationships are self-held unless noted. I = Immediate Family Member, Inst = My Institution. Relationships may not relate to the subject matter of this manuscript. For more information about ASCO's conflict of interest policy, please refer to www.asco.org/rwc or ascopubs.org/cci/author-center.

Open Payments is a public database containing information reported by companies about payments made to US-licensed physicians (Open Payments).

Aiara Lobo Gomes

Research Funding: IQVIA (Inst), Johnson & Johnson/Janssen (Inst)

Travel, Accommodations, Expenses: IQVIA (Inst)

Andre Dekker

Employment: Medical Data Works B.V.

Stock and Other Ownership Interests: Medical Data Works B.V.

Honoraria: Varian Medical Systems, Accuray, Roche, Janssen-Cilag, Philips Research (Inst), Varian Medical Systems (Inst), Medtronic

Consulting or Advisory Role: Varian Medical Systems

Research Funding: Varian Medical Systems (Inst), Philips Healthcare (Inst), OncoRadiomics (Inst)

Patents, Royalties, Other Intellectual Property: Royalties from Mirada Medical Ltd on a deep learning application in medical imaging (Inst), Royalties from Varian Medical Systems on dosimetry software for radiation oncology (Inst), Royalties from Health Innovation Ventures on Radiomics, eLearning and prediction models in cancer (Inst), Royalties from PXI for software related to small animal irradiation for life sciences research (Inst), Patents held currently -Monitoring respiration based on plethysmographic heart rate signal. US patent 6,702,752. -Apparatus and method for monitoring respiration with a pulse oximeter. US patent 6,709,402. -Monitoring Mayer wave effects based on a photoplethysmographic signal. US patent 6,805,673. -Monitoring physiological parameters based on variations in a photoplethysmographic baseline signal. US patent 6,896,661. -Monitoring physiological parameters based on variations in a photoplethysmographic signal. US patent 7,001,337. -Knowledge-Based Interpretable Predictive Model for Survival Analysis. US Patent 8,078,554 -Dose distribution modeling by region from functional imaging. US Patent 8,812,240 -Systems, methods and devices for analyzing quantitative information obtained from radiological images US Patent 9721340 B2

Expert Testimony: Thoratec

Travel, Accommodations, Expenses: Medical Data Works B.V., Varian Medical Systems, Roche, Janssen-Cilag

Winette Van Der Graaf

Consulting or Advisory Role: Agenus (Inst), PTC Therapeutics (Inst), Novartis (I), Bayer (I)

Research Funding: Lilly (Inst)

No other potential conflicts of interest were reported.

REFERENCES

1.Alvarez EM, Force LM, Xu R, et al. : The global burden of adolescent and young adult cancer in 2019: A systematic analysis for the global burden of disease study 2019. Lancet Oncol 23:27-52, 2022 [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Rajotte JF, Bergen R, Buckeridge DL, et al. : Synthetic data as an enabler for machine learning applications in medicine. iScience 25:105331, 2022 [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Goncalves A, Ray P, Soper B, et al. : Generation and evaluation of synthetic patient data. BMC Med Res Methodol 20:108, 2020 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Rodriguez-Almeida AJ, Fabelo H, Ortega S, et al. : Synthetic patient data generation and evaluation in disease prediction using small and imbalanced datasets. IEEE J Biomed Health Inform 27:2670-2680, 2023 [DOI] [PubMed] [Google Scholar]
5.Jacobsen BN: Machine learning and the politics of synthetic data. Big Data Soc 10:205395172211453, 2023 [Google Scholar]
6.Gonzales A, Guruswamy G, Smith SR: Synthetic data in health care: A narrative review. PLOS Digit Health 2:e0000082, 2023 [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Jacobs F, D'Amico S, Benvenuti C, et al. : Opportunities and challenges of synthetic data generation in oncology. JCO Clin Cancer Inform 10.1200/CCI.23.00045 [DOI] [PubMed] [Google Scholar]
8.Shanley D, Hogenboom J, Lysen F, et al. : Getting real about synthetic data ethics: Are AI ethics principles a good starting point for synthetic data ethics? EMBO Rep 25:2152-2155, 2024 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Montanez A, Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science : SDV: An open source library for synthetic data generation. 2018 [Google Scholar]
10.Sun C, van Soest J, Dumontier M: Generating synthetic personal health data using conditional generative adversarial networks combining with differential privacy. J Biomed Inform 143:104404, 2023 [DOI] [PubMed] [Google Scholar]
11.Walonoski J, Kramer M, Nichols J, et al. : Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. J Am Med Inform Assoc 25:230-238, 2018 [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Xu L, Skoularidou M, Cuesta-Infante A, et al. : Modeling tabular data using conditional GAN, NeurIPS2019—Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019. pp 7335-7345
13.Azizi Z, Zheng C, Mosquera L, et al. : Can synthetic data be a proxy for real clinical trial data? A validation study. BMJ Open 11:e043497, 2021 [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Foraker RE, Yu SC, Gupta A, et al. : Spot the difference: Comparing results of analyses from real patient data and synthetic derivatives. JAMIA Open 3:557-566, 2020 [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Arora A, Arora A: Machine learning models trained on synthetic datasets of multiple sample sizes for the use of predicting blood pressure from clinical data in a national dataset. PLoS One 18:e0283094, 2023 [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Smith A, Lambert PC, Rutherford MJ: Generating high-fidelity synthetic time-to-event datasets to improve data transparency and accessibility. BMC Med Res Methodol 22:176, 2022 [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Greenberg JK, Landman JM, Kelly MP, et al. : Leveraging artificial intelligence and synthetic data derivatives for spine surgery research. Glob Spine J 13:2409-2421, 2023 [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Reiner Benaim A, Almog R, Gorelik Y, et al. : Analyzing medical research results based on synthetic data and their relation to real data results: Systematic comparison from five observational studies. JMIR Med Inform 8:e16492, 2020 [DOI] [PMC free article] [PubMed] [Google Scholar]
19.D'Amico S, Dall'Olio D, Sala C, et al. : Synthetic data generation by artificial intelligence to accelerate research and precision medicine in hematology. JCO Clin Cancer Inform 10.1200/CCI.23.00021 [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Kim H, Jang WS, Sim WS, et al. : Synthetic data improve survival status prediction models in early-onset colorectal cancer. JCO Clin Cancer Inform 10.1200/CCI.23.00201 [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Zhang Y, Wang ZX, Zhang Z, et al. : GAN-based one dimensional medical data augmentation. Soft Comput 27:10481-10491, 2023 [Google Scholar]
22.El Kababji S, Mitsakakis N, Fang X, et al. : Evaluating the utility and privacy of synthetic breast cancer clinical trial data sets. JCO Clin Cancer Inform 10.1200/CCI.23.00116 [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Yan C, Yan Y, Wan Z, et al. : A multifaceted benchmarking of synthetic electronic health record generation models. Nat Commun 13:7609, 2022 [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Hernandez M, Epelde G, Alberdi A, et al. : Synthetic data generation for tabular health records: A systematic review. Neurocomputing 493:28-45, 2022 [Google Scholar]
25.Saris LMH, Vlooswijk C, Kaal SEJ, et al. : A negative body image among adolescent and young adult (AYA) cancer survivors: Results from the population-based SURVAYA study. Cancers (Basel) 14:5243, 2022 [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Vlooswijk C, Poll-Franse LVV, Janssen SHM, et al. : Recruiting adolescent and young adult cancer survivors for patient-reported outcome research: Experiences and sample characteristics of the SURVAYA study. Curr Oncol 29:5407-5425, 2022 [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Pols MA, Peeters PH, Ocke MC, et al. : Estimation of reproducibility and relative validity of the questions included in the EPIC Physical Activity Questionnaire. Int J Epidemiol 26:S181-S189, 1997. (supp 1) [DOI] [PubMed] [Google Scholar]
28.Betegon E, Rodriguez-Medina J, Del-Valle M, et al. : Emotion regulation in adolescents: Evidence of the validity and factor structure of the Cognitive Emotion Regulation Questionnaire (CERQ). Int J Environ Res Public Health 19:3602, 2022 [DOI] [PMC free article] [PubMed] [Google Scholar]
29.van Leeuwen M, Kieffer JM, Young TE, et al. : Phase III study of the European Organisation for Research and Treatment of cancer quality of life Cancer Survivorship Core Questionnaire. J Cancer Surviv 17:1111-1130, 2023 [DOI] [PubMed] [Google Scholar]
30.Aaronson NK, Ahmedzai S, Bergman B, et al. : The European Organization for Research and Treatment of cancer QLQ-C30: A quality-of-life instrument for use in international clinical trials in oncology. J Natl Cancer Inst 85:365-376, 1993 [DOI] [PubMed] [Google Scholar]
31.DataCebo : Single Table Metadata JSON. 2023 [Google Scholar]
32.Sajjadi MSM, Bachem O, Lucic M, et al. : Assessing generative models via precision and recall. Adv Neural Inf Process Syst 31:31, 2018 [Google Scholar]
33.Kynkäänniemi T, Karras T, Laine S, et al. : Improved precision and recall metric for assessing generative models, NeurIPS2019—Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019. pp 3927-3936
34.Naeem M, Oh S, Uh Y, et al. : Reliable Fidelity and Diversity Metrics for Generative models, Proceedings of Machine Learning Research—Proceedings of the 37th International Conference on Machine Learning, 2020. pp 7176-7185
35.DataCebo: Categorical Correct Attribution Probability (CategoricalCAP). 2022. SDMetrics. docs.sdv.dev/sdmetrics/metrics/metrics-glossary/categoricalcap [Google Scholar]
36.El Emam K, Mosquera L, Bass J: Evaluating identity disclosure risk in fully synthetic health data: Model development and validation. J Med Internet Res 22:e23139, 2020 [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Yeom S, Giacomelli I, Fredrikson M, et al. : Privacy Risk in Machine Learning: Analyzing the Connection to Overfitting, 2018 IEEE 31st Computer Security Foundations Symposium (CSF), 2018. pp 268-282
38.Bhanot K, Qi M, Erickson JS, et al. : The problem of fairness in synthetic healthcare data. Entropy (Basel) 23:1165, 2021 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The code and software versioning used for this work are available on GitHub with a working example, see: https://github.com/MaastrichtU-CDS/AYA-synthetic-data.

[b1] 1.Alvarez EM, Force LM, Xu R, et al. : The global burden of adolescent and young adult cancer in 2019: A systematic analysis for the global burden of disease study 2019. Lancet Oncol 23:27-52, 2022 [DOI] [PMC free article] [PubMed] [Google Scholar]

[b2] 2.Rajotte JF, Bergen R, Buckeridge DL, et al. : Synthetic data as an enabler for machine learning applications in medicine. iScience 25:105331, 2022 [DOI] [PMC free article] [PubMed] [Google Scholar]

[b3] 3.Goncalves A, Ray P, Soper B, et al. : Generation and evaluation of synthetic patient data. BMC Med Res Methodol 20:108, 2020 [DOI] [PMC free article] [PubMed] [Google Scholar]

[b4] 4.Rodriguez-Almeida AJ, Fabelo H, Ortega S, et al. : Synthetic patient data generation and evaluation in disease prediction using small and imbalanced datasets. IEEE J Biomed Health Inform 27:2670-2680, 2023 [DOI] [PubMed] [Google Scholar]

[b5] 5.Jacobsen BN: Machine learning and the politics of synthetic data. Big Data Soc 10:205395172211453, 2023 [Google Scholar]

[b6] 6.Gonzales A, Guruswamy G, Smith SR: Synthetic data in health care: A narrative review. PLOS Digit Health 2:e0000082, 2023 [DOI] [PMC free article] [PubMed] [Google Scholar]

[b7] 7.Jacobs F, D'Amico S, Benvenuti C, et al. : Opportunities and challenges of synthetic data generation in oncology. JCO Clin Cancer Inform 10.1200/CCI.23.00045 [DOI] [PubMed] [Google Scholar]

[b8] 8.Shanley D, Hogenboom J, Lysen F, et al. : Getting real about synthetic data ethics: Are AI ethics principles a good starting point for synthetic data ethics? EMBO Rep 25:2152-2155, 2024 [DOI] [PMC free article] [PubMed] [Google Scholar]

[b9] 9.Montanez A, Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science : SDV: An open source library for synthetic data generation. 2018 [Google Scholar]

[b10] 10.Sun C, van Soest J, Dumontier M: Generating synthetic personal health data using conditional generative adversarial networks combining with differential privacy. J Biomed Inform 143:104404, 2023 [DOI] [PubMed] [Google Scholar]

[b11] 11.Walonoski J, Kramer M, Nichols J, et al. : Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. J Am Med Inform Assoc 25:230-238, 2018 [DOI] [PMC free article] [PubMed] [Google Scholar]

[b12] 12.Xu L, Skoularidou M, Cuesta-Infante A, et al. : Modeling tabular data using conditional GAN, NeurIPS2019—Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019. pp 7335-7345

[b13] 13.Azizi Z, Zheng C, Mosquera L, et al. : Can synthetic data be a proxy for real clinical trial data? A validation study. BMJ Open 11:e043497, 2021 [DOI] [PMC free article] [PubMed] [Google Scholar]

[b14] 14.Foraker RE, Yu SC, Gupta A, et al. : Spot the difference: Comparing results of analyses from real patient data and synthetic derivatives. JAMIA Open 3:557-566, 2020 [DOI] [PMC free article] [PubMed] [Google Scholar]

[b15] 15.Arora A, Arora A: Machine learning models trained on synthetic datasets of multiple sample sizes for the use of predicting blood pressure from clinical data in a national dataset. PLoS One 18:e0283094, 2023 [DOI] [PMC free article] [PubMed] [Google Scholar]

[b16] 16.Smith A, Lambert PC, Rutherford MJ: Generating high-fidelity synthetic time-to-event datasets to improve data transparency and accessibility. BMC Med Res Methodol 22:176, 2022 [DOI] [PMC free article] [PubMed] [Google Scholar]

[b17] 17.Greenberg JK, Landman JM, Kelly MP, et al. : Leveraging artificial intelligence and synthetic data derivatives for spine surgery research. Glob Spine J 13:2409-2421, 2023 [DOI] [PMC free article] [PubMed] [Google Scholar]

[b18] 18.Reiner Benaim A, Almog R, Gorelik Y, et al. : Analyzing medical research results based on synthetic data and their relation to real data results: Systematic comparison from five observational studies. JMIR Med Inform 8:e16492, 2020 [DOI] [PMC free article] [PubMed] [Google Scholar]

[b19] 19.D'Amico S, Dall'Olio D, Sala C, et al. : Synthetic data generation by artificial intelligence to accelerate research and precision medicine in hematology. JCO Clin Cancer Inform 10.1200/CCI.23.00021 [DOI] [PMC free article] [PubMed] [Google Scholar]

[b20] 20.Kim H, Jang WS, Sim WS, et al. : Synthetic data improve survival status prediction models in early-onset colorectal cancer. JCO Clin Cancer Inform 10.1200/CCI.23.00201 [DOI] [PMC free article] [PubMed] [Google Scholar]

[b21] 21.Zhang Y, Wang ZX, Zhang Z, et al. : GAN-based one dimensional medical data augmentation. Soft Comput 27:10481-10491, 2023 [Google Scholar]

[b22] 22.El Kababji S, Mitsakakis N, Fang X, et al. : Evaluating the utility and privacy of synthetic breast cancer clinical trial data sets. JCO Clin Cancer Inform 10.1200/CCI.23.00116 [DOI] [PMC free article] [PubMed] [Google Scholar]

[b23] 23.Yan C, Yan Y, Wan Z, et al. : A multifaceted benchmarking of synthetic electronic health record generation models. Nat Commun 13:7609, 2022 [DOI] [PMC free article] [PubMed] [Google Scholar]

[b24] 24.Hernandez M, Epelde G, Alberdi A, et al. : Synthetic data generation for tabular health records: A systematic review. Neurocomputing 493:28-45, 2022 [Google Scholar]

[b25] 25.Saris LMH, Vlooswijk C, Kaal SEJ, et al. : A negative body image among adolescent and young adult (AYA) cancer survivors: Results from the population-based SURVAYA study. Cancers (Basel) 14:5243, 2022 [DOI] [PMC free article] [PubMed] [Google Scholar]

[b26] 26.Vlooswijk C, Poll-Franse LVV, Janssen SHM, et al. : Recruiting adolescent and young adult cancer survivors for patient-reported outcome research: Experiences and sample characteristics of the SURVAYA study. Curr Oncol 29:5407-5425, 2022 [DOI] [PMC free article] [PubMed] [Google Scholar]

[b27] 27.Pols MA, Peeters PH, Ocke MC, et al. : Estimation of reproducibility and relative validity of the questions included in the EPIC Physical Activity Questionnaire. Int J Epidemiol 26:S181-S189, 1997. (supp 1) [DOI] [PubMed] [Google Scholar]

[b28] 28.Betegon E, Rodriguez-Medina J, Del-Valle M, et al. : Emotion regulation in adolescents: Evidence of the validity and factor structure of the Cognitive Emotion Regulation Questionnaire (CERQ). Int J Environ Res Public Health 19:3602, 2022 [DOI] [PMC free article] [PubMed] [Google Scholar]

[b29] 29.van Leeuwen M, Kieffer JM, Young TE, et al. : Phase III study of the European Organisation for Research and Treatment of cancer quality of life Cancer Survivorship Core Questionnaire. J Cancer Surviv 17:1111-1130, 2023 [DOI] [PubMed] [Google Scholar]

[b30] 30.Aaronson NK, Ahmedzai S, Bergman B, et al. : The European Organization for Research and Treatment of cancer QLQ-C30: A quality-of-life instrument for use in international clinical trials in oncology. J Natl Cancer Inst 85:365-376, 1993 [DOI] [PubMed] [Google Scholar]

[b31] 31.DataCebo : Single Table Metadata JSON. 2023 [Google Scholar]

[b32] 32.Sajjadi MSM, Bachem O, Lucic M, et al. : Assessing generative models via precision and recall. Adv Neural Inf Process Syst 31:31, 2018 [Google Scholar]

[b33] 33.Kynkäänniemi T, Karras T, Laine S, et al. : Improved precision and recall metric for assessing generative models, NeurIPS2019—Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019. pp 3927-3936

[b34] 34.Naeem M, Oh S, Uh Y, et al. : Reliable Fidelity and Diversity Metrics for Generative models, Proceedings of Machine Learning Research—Proceedings of the 37th International Conference on Machine Learning, 2020. pp 7176-7185

[b35] 35.DataCebo: Categorical Correct Attribution Probability (CategoricalCAP). 2022. SDMetrics. docs.sdv.dev/sdmetrics/metrics/metrics-glossary/categoricalcap [Google Scholar]

[b36] 36.El Emam K, Mosquera L, Bass J: Evaluating identity disclosure risk in fully synthetic health data: Model development and validation. J Med Internet Res 22:e23139, 2020 [DOI] [PMC free article] [PubMed] [Google Scholar]

[b37] 37.Yeom S, Giacomelli I, Fredrikson M, et al. : Privacy Risk in Machine Learning: Analyzing the Connection to Overfitting, 2018 IEEE 31st Computer Security Foundations Symposium (CSF), 2018. pp 268-282

[b38] 38.Bhanot K, Qi M, Erickson JS, et al. : The problem of fairness in synthetic healthcare data. Entropy (Basel) 23:1165, 2021 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Actionability of Synthetic Data in a Heterogeneous and Rare Health Care Demographic: Adolescents and Young Adults With Cancer

Joshi Hogenboom, MSc

Aiara Lobo Gomes, PhD

Andre Dekker, PhD

Winette Van Der Graaf, MD, PhD

Olga Husson, PhD

Leonard Wee, PhD

Abstract

PURPOSE

METHODS

RESULTS

CONCLUSION

INTRODUCTION

CONTEXT

METHODS

FIG 1.

Institutional Review Board Statement

Informed Consent Statement

RESULTS

Veracity

FIG 2.

Utility

TABLE 1.

Privacy Concealment

FIG 3.

FIG 4.

DISCUSSION

ACKNOWLEDGMENT

PRIOR PRESENTATION

SUPPORT

DATA SHARING STATEMENT

AUTHOR CONTRIBUTIONS

AUTHOR'S DISCLOSURES OF POTENTIAL CONFLICTS OF INTEREST

REFERENCES

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases