Skip to main content
Wiley Open Access Collection logoLink to Wiley Open Access Collection
. 2025 Aug 13;44(18-19):e70227. doi: 10.1002/sim.70227

A Framework for Generating Realistic Synthetic Tabular Data in a Randomized Controlled Trial Setting

Niki Z Petrakos 1,, Erica E M Moodie 1, Nicolas Savy 2
PMCID: PMC12345405  PMID: 40801475

ABSTRACT

Generation of realistic synthetic data has garnered considerable attention in recent years, particularly in the health research domain due to its utility in, for instance, sharing data while protecting patient privacy or determining optimal clinical trial design. While much work has been concentrated on synthetic image generation, generation of realistic and complex synthetic tabular data of the type most commonly encountered in classic epidemiological or clinical studies is still lacking, especially with regard to generating data for randomized controlled trials (RCTs). There is no consensus regarding the best way to generate synthetic tabular RCT data such that the underlying multivariate data distribution is preserved. Motivated by an RCT in the treatment of Human Immunodeficiency Virus, we empirically compared the ability of several strategies and three generation techniques (two machine learning, the other a more classical statistical method) to faithfully reproduce realistic data. Our results suggest that using a sequential generation approach with an R‐vine copula model to generate baseline variables, followed by a simple random treatment allocation to mimic the RCT setting, and subsequent regression models for variables post‐treatment allocation (such as the trial outcome) is the most effective way to generate synthetic tabular RCT data that capture important and realistic features of the real data.

Keywords: Adversarial Random Forest, copula, data generation, Generative Adversarial Network, randomized controlled trials, synthetic data, tabular data


Abbreviations

AIDS

acquired immunodeficiency syndrome

ARF

Adversarial Random Forest

ART

antiretroviral therapy

CDF

cumulative distribution function

CI

confidence interval

GAN

Generative Adversarial Network

HIV

human immunodeficiency virus

KS

Kolmogorov Smirnov

KNN

k‐nearest neighbors

OOB

out‐of‐bag

OR

odds ratio

RCT

randomized controlled trial

SMART

Sequential Multiple Assignment Randomized Trial

SMOTE

Synthetic Minority Oversampling Technique

TVD

total variation distance

URF

unsupervised random forest

XGBoost

Extreme Gradient Boosting

1. Introduction

Synthetic data generation has become a topic of increased interest in many disciplines, such as finance, climate science, and health research [1, 2, 3]. Across these domains, countless studies require the analysis of complex systems (e.g., testing trading algorithms in hypothetical economic market stressors, the analysis of extreme weather events impacting power grids, or determining the best medical treatment for a group of patients). However, the data required are often difficult to access, usually due to privacy reasons. Even when the data can be accessed, they can be incomplete or otherwise insufficient in answering the scientific question at hand [4, 5]. An increasingly common solution has been to turn towards synthetic data sets that are faithful to the original data and can demonstrate important features of the original data. Though an abundance of synthetic data generation methods exist, many create generated data sets that are much too simplistic and cannot reflect the complexities of the real world and the unusual univariate and multivariate distributions that can be observed, especially with mixed data types [6, 7, 8]. In health research in particular, plasmode simulations were introduced in the early 2000s to allow some of the real‐world complexity to be captured in the simulated data [9], but a major drawback of this method is they often focus only on generation of an outcome variable, rather than generating realistic data for a large number of variables at once [5]. It may also be the case that repeated simulations are called for, or there may be a need to introduce more variability (such as to protect data privacy [10]). Recently, in medical research, where randomized controlled trials (RCTs) are often held as the “gold standard” for determining efficacy and safety of various treatments, data generation via “in silico trials” has gained attention. In silico trials are virtual trials that are conducted through computer simulations to generate synthetic data such that the generated distribution mimics the original data distribution without simply copying the original observations [11, 12]. The utility of such trials includes determining, for example, optimal study design of a full‐scale RCT, as it would be of great benefit to be able to study different iterations of study designs in a timely, efficient manner to determine which study design to implement in real life [12, 13, 14, 15]. Of course, as trial designs can be very complex with multiple features to consider (sample size, follow‐up time, definition of outcome, etc.), the ability to generate synthetic data that reflects real‐world complexities becomes a necessity. Additionally, it has been suggested that generating realistic synthetic data that reflects the same characteristics as the real data is essential for model validation in the field of unsupervised learning [12]. Thus, the need for complex synthetic data that are realistic exists in many settings.

Trials, and indeed a large number of epidemiological and clinical studies, have data that can be represented in tabular form. Tabular data refers to a table of data with each column representing a random variable such that all columns together follow an unknown joint distribution. Each row represents one observation from that (unknown) joint distribution. A major breakthrough in synthetic data generation occurred in 2014, when Goodfellow et al. introduced Generative Adversarial Networks (GANs), a form of deep learning [16]. GANs have been particularly influential in improving image generation, with many specific GAN architectures designed to handle image data [17, 18, 19]. However, considerably less research has been dedicated to the tabular data context (within which RCT data fall), with some notable exceptions of specific variants of GANs (termed GAN architectures) showing promise for this context [20, 21, 22, 23]. Even less work has focused on the context of small, tabular, RCT data in which use cases for synthetic data sets are just beginning to be explored and appreciated [14, 24, 25], as most work applied to tabular health care data is currently devoted to observational, electronic health records [26]. In comparison to image data, tabular data (and in particular tabular RCT data) have added difficulties for the data generation process, including variables with more complex dependence structures, small sample sizes, variables with different distributions (e.g., continuous and discrete variables), variables with complex distributions (e.g., multi‐modal continuous distributions, mixture distributions), and categorical variables with imbalanced class representation. These difficulties are not rare cases for which it is difficult to find examples in real‐life applications. Rather, these added complexities are commonplace in tabular RCT data, for which there are often fewer than a few hundred participants (i.e., observations/rows). For example, age and sex are almost always recorded in RCTs, where age usually follows a continuous distribution and sex a discrete distribution. Health scores are another commonly included variable in RCTs, which often follow a multi‐modal continuous distribution. Demographics (such as sex and ethnicity) may have groups that are rarely observed in a given sample (women and people of color are often not well represented in RCTs) [27, 28]. This leads to imbalanced class representations where the vast majority of observations are in a select few strata. Thus, it is important to consider data generation methods that are able to adequately handle these challenges.

More recently, tree‐based machine learning (ML) methods for synthetic data generation have also gained traction. One such method is the Adversarial Random Forest (ARF), which utilizes a similar adversarial training procedure as GANs but harnesses classification and regression trees rather than neural networks [29]. This method is appealing since it can naturally handle data with variables of mixed type (continuous, discrete) and requires far fewer computational resources as compared to GANs. Although ARF was described in the seminal paper as “not (yet)” generating completely new, synthetic data [29], it is almost exclusively used for this purpose [30, 31, 32, 33]. While ML methods have increased in popularity in the data generation domain, they are not the only methods that hold promise. Some researchers have shown that copula‐based generation approaches can also work very well when generating realistic synthetic data [15, 34, 35, 36]. Though copulas were introduced decades ago in the statistics literature [37], their use for synthetic data generation is relatively new. Copulas, and in particular R‐vine copulas, are especially attractive for generating tabular RCT data because they can effectively capture univariate distributions by construction, as well as multivariate dependencies between variables (no matter the distribution of the variables).

There is currently no consensus in the literature on how best to generate tabular data for RCTs. Generally, there are two distinct ways of generating tables of data: Either simultaneously and often using complex machine learning algorithms, or in a sequential fashion. Sequential data generation has blossomed in a variety of contexts, such as music and non‐tabular health data [38, 39], however it is still rather new in the tabular RCT data domain [36, 40]. Figure 1 provides a schematic to understanding the difference between the simultaneous and sequential frameworks, where we term “execution models” to be models fit to real data to generate synthetic observations for one variable (often, one outcome) at a time. This sequential approach is derived from the well‐known area of agent‐based models, which have wide‐ranging applications, including public health and infectious disease modeling [41, 42, 43]. Indeed, these models have also been suggested to determine the design of community‐randomized prevention studies [44]. The basic idea of agent‐based modeling is to simulate a baseline population, pre‐specify outcomes of interest, simulate changes in the behavior of the baseline population through predictive modeling, and then measure the outcomes of interest. While much of the machine learning literature provides methodologies for performing data generation in a simultaneous fashion, we hypothesize that a sequential framework, similar to the workflow of agent‐based models, would be better suited to the task of generating tabular RCT data. This is due to the inherently sequential nature of RCT data. RCT data have a temporal aspect and causal relationships between variables that may be more naturally captured by sequentially generating data. In an RCT, researchers first collect study participant observations for a set of baseline variables. Then, participants are randomized to treatment. After that, they are often followed for a period of time, and follow‐up data are collected at certain intervals, perhaps repeatedly or simply once with a final outcome. In a simultaneous data generation framework, such as one that uses GANs, temporal relationships between variables are not considered. Since the algorithm learns the distributions of all variables at once and generates data for all variables at once, information from variables collected later in time can inform the generation of variables representing earlier time points that would not, in the real world, be influenced by future events or measurements. However, though the simultaneous generation does not mimic how an RCT is carried out in “nature”, it is possible that the borrowing of future information to inform variables that are temporally precedent could have the potential to greatly improve the data generation performance when certain variables, no matter the temporal relationship, are highly correlated.

FIGURE 1.

FIGURE 1

Schematic of the steps in the simultaneous versus sequential data generation frameworks. Synth stands for synthetic, Tx stands for treatment assignment, Post‐Rand. stands for post‐treatment randomization variables, and Y stands for the final outcome. nreal and nsynth represent the number of rows in the real and synthetic data sets, respectively. While nreal and nsynth need not be equivalent, we set nsynth to be equal to nreal in the work shown in this paper.

For the synthetic data generation of tabular health data with mixed data types, the current literature generally harnesses GAN‐based methods, though these methods do not always show optimal performance. Wang and Pai (2023) investigated synthetic data generation in the context of small RCTs, finding that while GANs are very effective in generating diverse data observations, they require large data sets for training [25]. The authors also found that a hybrid generation approach that employed a rather simple minority oversampling method in combination with a GAN effectively generated realistic synthetic observations in the small tabular RCT setting with different variable types. Koloi et al. (2023) compared a copula‐based approach to GANs with another deep learning approach, Variational Autoencoders [45], to generate a virtual baseline patient population and found that the copula approach yielded the best performance in generating synthetic data with distributions closest to the real data [46].

In this work, we aim to empirically study methods for generating synthetic tabular RCT data such that the real data distribution is preserved and provide recommendations for best practice. To accomplish this aim, we compare the simultaneous framework to the sequential framework. Further, within the sequential framework setting, the performance of a GAN‐based algorithm, a random forest‐based algorithm, and a copula‐based algorithm is compared. In particular, this work aims to investigate through a case study the potential differences between (i) a simultaneous data generation approach and one that is sequential and (ii) the complexity of the generator used for the baseline data within the context of small tabular RCT data. This article is organized as follows. In Section 2, the generative algorithms compared in this article are described. Section 3 details the motivating data set and the simultaneous and sequential frameworks to be compared. Section 4 presents the results of our empirical studies. Finally, Section 5 concludes with recommendations and a discussion of limitations and future directions for related work.

2. Generative Algorithms

Early data generation techniques stem from methods that deal with missing data. While these methods are often thought of as a means to impute missing data, they can also be thought of as data generation techniques (since imputing missing data involves creating values where once there were no values). At first, methods were fairly straightforward and simplistic—mean imputation, or carrying‐forward values from past instances. Though easy to implement, these methods suffered from making overly simplistic assumptions that likely did not represent reality. Then, the Synthetic Minority Oversampling Technique (SMOTE) was proposed [47]. SMOTE works by selecting a sample from an under‐represented stratum for a given discrete variable. Then, it selects one of the sample's k nearest neighbors and interpolates between the selected neighbor and the original sample to generate a synthetic sample. Hence, SMOTE was a popular tool for fixing class imbalances within tabular data, and several extensions were developed [48, 49, 50, 51]. However, these methods were still too simplistic for many contexts, especially those that involved generating data with complex distributions [25]. Additionally, imputation methods generally create observations by exploiting the data of other individuals in the same data set, whereas the goal of generating an entire synthetic data set (e.g., through generative ML algorithms) is to do so by exploiting exogenous data by means of fitted models or learning directly from the data. Hence, the development of GANs made the task of generating data much more realistic and thus added to its popularity. In recent years, another method that has garnered much attention in the ML field is Variational Autoencoders, or VAEs [45], which generate synthetic observations through a neural network structure that starts with an encoder, which maps inputted data to a compressed latent space, and ends with a decoder, which maps a sample from the latent space to the original data space and outputs a synthetic observation. For tabular data generation, variants known as Conditional Tabular GAN (CTGAN) and Tabular VAE (TVAE) were proposed to handle the complexities of the tabular data context [20]. In this work, we did not consider TVAE as it has been shown that it can perform poorly when a minority class of a discrete variable is rarely observed [52], and yet this is a common occurrence in RCT data. We now turn to describing the methods employed in the present comparison.

2.1. GANs and CTGAN

A GAN typically consists of two fully‐connected neural networks—a generative model, G, and a discriminative model, D. The generative model, G(z;θg):𝒵𝒳 parameterized by θg, takes random noise z𝒵 as input and outputs a vector of synthetic values G(z)𝒳, where 𝒳 is the real data space (generally, both 𝒵 and 𝒳 are contained in d). Often, z is sampled from a standard Normal distribution, though more generally this prior distribution is denoted as pZ(z). The real data distribution is denoted as pX(x). The discriminative model, D(x;θd):𝒳[0,1] parameterized by θd, takes the output from G (the vector of synthetic data) as its input, although D could of course also take real data as an input. Then, D (the discriminative model) outputs a scalar value representing the probability that its input came from the real data. Both models are trained simultaneously in an adversarial fashion, where the generative and discriminative models attempt to satisfy a minimax condition with the following value function, V(G,D):

minθgmaxθdV(G,D)=𝔼xpX(x)[logD(x;θd)]+𝔼zpZ(z)[log(1D(G(z;θg)))]

A specific GAN architecture that was proposed by Xu and Veeramachaneni et al. in 2019, CTGAN, aims to handle the additional complexities that come with generating tabular data [20]. In particular, the authors implemented mode‐specific normalization to better mimic multi‐modal continuous distributions. They also introduced a conditional generator with the idea of training‐by‐sampling, where observations are generated by conditioning on both a particular discrete variable and one of the observed categories for the given discrete variable, allowing for a more even exploration of all classes of each discrete variable, even for those with unbalanced class distributions. Since CTGANs are the preferred state‐of‐the‐art method for generating tabular data with mixed data types [53, 54], and their implementation is well‐documented through the Synthetic Data Vault library in Python [34], we pursue this approach in our investigations. For more details, including theoretical results, regarding GANs, please refer to [16, 55].

2.2. ARF

A more recent ML method that has gained traction due to easier model fitting and reduced computational demands compared to GANs (including CTGAN) is ARF, developed by Watson and colleagues in 2023 [29]. Broadly, ARF is a recursive version of an unsupervised random forest (URF), which creates synthetic observations by drawing independent samples from the marginal distributions of the original data. Then, these synthetic data are compared to the real data by training an RF classifier. In ARF, this procedure works by first fitting a standard URF to the original data to produce a synthetic data set. Then, the leaf coverage is computed for each leaf, where leaf coverage is defined as nbl(nb)1, that is, the ratio of the number of training samples in a leaf l from a tree b to the number of training samples in tree b. Leaf coverage can be interpreted as the probability of a random real data sample being contained in the data subspace, denoted 𝒳bl, of leaf l from tree b. New synthetic data are then generated by sampling from the marginal distributions of leaves chosen with probability proportional to the computed coverage of each leaf. Then, this new synthetic data set is compared to the real data through the training of another RF classifier. This iterative procedure continues until convergence is reached, which occurs when the out‐of‐bag (OOB) error (i.e., the prediction error calculated by using a proportion of excluded real data samples) falls below an analyst‐selected threshold. In the end, each leaf in the ARF should contain data that are jointly independent, meaning p(x|θbl)=Πj=1dp(xj|θbl) where θbl is the combination of Boolean variables that represents the membership of leaf l in tree b (often referred to as the split criteria). Hence, for trees b with leaves l and real data samples x, ARF aims to find a set of splits such that the data distribution can be estimated via

q(x)=1Bl,b:x𝒳blq(θbl)j=1dq(xj;ψb,jl)

where B is the total number of trees, q(θbl) is the empirical coverage of leaf l from tree b, and q(xj;ψb,jl) is the estimated marginal density for variable xj with parameters ψb,jl. The ψb,jl can be learned using various methods; in the original paper, the authors used classic maximum likelihood estimation for continuous variables and Bayesian inference methods for discrete variables. The estimated joint density q(x) can be thought of as a weighted average of the estimated densities in the leaves for which the split criteria are satisfied.

As mentioned in Section 1, ARFs require far less computational resources than GANs. This is due to the assumption that in each leaf, the data are jointly independent. Hence, the multivariate real data density can be estimated through separate estimators of the marginal real data densities per leaf, which is a much simpler task compared to that of a GAN, which estimates the joint density directly. The authors note, however, that this benefit is not without other costs: ARF training may involve deeper trees and several iterations of training. However, in practice, the computation time for ARF is still far less than that required for GANs [29]. Additionally, since ARF utilizes classification and regression trees, it is well‐suited to the mixed tabular data setting. Another advantage of ARF over GANs in particular is the limited number of hyperparameters, which renders the method simpler to implement.

2.3. Copulas and R‐Vine Copula

While complex machine learning models have attracted much attention, there are other statistical methods that also hold potential in generating realistic synthetic data with complex distributions. One such method is copulas [37]. A copula is a multidimensional cumulative distribution function (CDF) that relates the marginal distributions of a collection of random variables to the overall joint distribution. A key result that allows for the modeling of a complex joint distribution using copulas is Sklar's Theorem [37], which states that for any joint CDF, there exists a copula representation built on the marginal CDFs of each variable:

Theorem 1

(Sklar's Theorem) If X={X1,,Xd} is a vector of random variables with joint CDF FX1,,Xd and marginal CDFs FX1,,FXd, then there exists a copula C such that

FX1,,Xd(x1,,xd)=C(FX1(x1),,FXd(xd))

Then, applying the chain rule formula, it can be shown that the joint probability density function of X can be represented as the following decomposition:

fX1,,Xd(x1,,xd)=c(FX1(x1),,FXd(xd))×i=1dfXi(xi)

where c() is the copula density function. Using recursive conditioning and the Bayes formula, the joint marginal density function can also be represented as

fX1,,Xd(x1,,xd)=fXd(xd)×fXd1|Xd(xd1|xd)×fXd2|Xd1,Xd(xd2|xd1,xd)××fX1|X2,,Xd(x1|x2,,xd)

and in general, the conditional marginal distribution can be written as

fXk|Z(xk|z)=c(FXk|Zj(xk|zj),FZj|Zj(zj|zj))×fXk|Zj(xk|zj) (1)

where Z is a d‐dimensional vector of random variables and z is a vector of realizations (observations) of Z, and subscript j indicates the entire vector excluding variable j. Here, Z differs from X in that it is a subset of X; one can consider the conditional marginal distribution of Xk given an arbitrary collection or subset of variables from X1,,Xk1,Xk+1,,Xd. Of note, this result shows that each conditional marginal distribution can be written as a product of bi‐dimensional copulas and conditional marginal density functions. Hence, this representation is often referred to as the pair‐copula construction. Because of this formulation, copulas are very useful in modeling multivariate dependencies.

By construction, copulas are designed to effectively model univariate distributions, since they utilize the inverse of the empirical distribution of each variable when modeling the marginal CDF. This becomes clearer when describing the data generation process using a copula. For each Xi,i=1,,d, the first step is to sample n independent observations from a random Uniform, U. Then, apply the fitted copula model (fitted to the original data) to the independent U to generate dependent observations, Ũ. Finally, apply the integral transform to map Ũ to the Xi data space via the (inverse of the) empirical distribution of the observed Xi.

Though copula models date back to 1959, vine copulas (of which R‐vine copulas are a subset) were developed in the late 1990s and 2000s [56, 57, 58] as a specific type of copula that invokes a nested tree (or “vine”) structure, where pairs of copulas are utilized to build the entire copula structure, and hence are useful in modeling the dependence between variables in a high‐dimensional, multivariable context [59]. This is because the decomposition shown in Equation (1) is far from unique in our data generation context. The decomposition is unique if all random variables are strictly continuous [37], which is very rarely the case in RCT data. R‐vine copulas are useful in a high‐dimensional multivariable setting because they allow for a structure to be applied to the non‐unique decomposition, and there exists an algorithm to traverse said vine structure to effectively model the joint CDF. Due to this utility in a multivariable context, we chose to utilize specifically R‐vine copulas as another method to consider in our data generation framework.

3. Methods

We first describe the data set used in this paper to provide context for the data generation frameworks and illustrate more easily how the different frameworks operate in an RCT data setting.

3.1. Experimental Tools—Data Description

The data used are from the AIDS Clinical Trials Group Study 175, ACTG 175, [60] which compared four human immunodeficiency virus (HIV) therapeautic arms considering mono‐ or combined treatment among adolescents and adults living with HIV and whose CD4 cell counts were between 200 and 500 per cubic millimeter. 2139 patients were recruited from 43 AIDS Clinical Trials Units and nine National Hemophilia Foundation locations in the United States and Puerto Rico. To be eligible to participate in the study, patients had to satisfy the following criteria: Be at least 12 years old with a laboratory‐confirmed HIV infection, have a CD4 count between 200 and 500 per cubic millimeter in the 30 days leading up to treatment randomization, have no history of acquired immunodeficiency syndrome (AIDS), and have a Karnofsky score of 70 or higher (which is a health score that measures a patient's functional status, with a maximum score of 100). Study participants were randomly assigned to one of four treatments at baseline (zidovudine only—this was the baseline comparator, zidovudine and didanosine, zidovudine and zalcitabine, or didanosine only) and were followed for a median time of 143 weeks. Several baseline covariates were measured, including sex, age, weight, race, hemophilia status, whether a patient identified as homosexual, injection drug use status, Karnofsky score, history of prior antiretroviral therapy (ART), whether a patient had symptomatic HIV infection, and CD4 count. Patients had follow‐up visits at weeks 2, 4, and 8, and then every 12 weeks, and CD4 count was measured every 12 weeks starting from week 8. CD4 count at week 96 was not recorded for approximately 37% of participants. The primary endpoint was the composite event defined by having a CD4 count decline of at least 50%, an event indicating progression of HIV to AIDS, or death. If no event was observed, then the participant's outcome was deemed to be censored. Hence, the trial outcome was whether the composite event was observed. Note that the final outcome was treated as a binary variable in this analysis, both for simplicity and, more importantly, since binary outcomes are highly prevent in clinical studies. We acknowledge that in doing so, important information related to censoring was discarded. Limitations regarding this simplification can be found in Section 5. For a full list of variables and their support, see the Online Supporting Information.

3.2. Experimental Setup

To provide a framework to generate realistic data in the RCT context, we compared the frameworks that we term CTGAN Simultaneous versus CTGAN Sequential. We also investigated the frameworks that we term ARF Sequential and R‐Vine Copula Sequential. Within CTGAN Sequential, we experimented with pre‐processing the original data in settings where the original data had bounded or asymmetric distributions. Additionally, we compared three different ways to induce randomness when generating data using sequential regression models. Finally, we considered the degree of complexity or sophistication used in each step of the sequential frameworks. Altogether, we compared eight different data generation frameworks. We explain the setup of each in more detail below.

3.2.1. CTGAN Simultaneous

The CTGAN Simultaneous framework involved training a CTGAN model on the entire data set, all at once (i.e., simultaneously). A model was trained on the real data set, including all baseline covariates, treatment allocation, post‐randomization variables, and the final outcome. Then, this trained model was used to generate observations for all variables, again simultaneously.

3.2.2. CTGAN Sequential

The CTGAN Sequential framework again involved training a CTGAN model on the real data set, but the data generation process proceeded in a sequential fashion such that the natural steps in an RCT were followed (i.e., baseline data collection, then treatment randomization, followed by post‐randomization data collection occurring at follow‐up visits, and finally the outcome measurement) and temporal relationships between variables were maintained. In other words, as the first time point in the trial was at baseline, the first step in the sequential process was to train a model on the data subset that included only the baseline variables. The next time point in the trial was treatment randomization, though generating synthetic treatment did not involve fitting a model to real data and hence did not require any real data subset. Following this was each follow‐up visit in the trial. The first was at week 20, so the subset of data used to train a model at this point included all variables collected up until week 20 (baseline variables, treatment assignment, and CD4 count at week 20). The next follow‐up visit was at week 96, and similarly, the subset of data included all variables collected up until week 96 (baseline variables, treatment assignment, CD4 count at week 20, and CD4 count at week 96). The final visit occurred after week 96 to measure the final outcome, and here all variables were included in model fitting.

Hence, the first step involved only the baseline variables. First, a CTGAN was trained on the real baseline data, and then the fitted CTGAN was used to generate observations for the baseline variables. Then, what we term “execution models” were fit to each subsequent variable. In a trial setting, the next occurrence was treatment allocation, and since the context involved RCT data, treatment allocation was randomized, independently of any baseline information. The execution model for generating treatment data was simply a probability distribution from which to draw observations. For example, to generate data for the treatment arms in the ACTG 175 trial, samples were drawn from a multinomial distribution with equal probability of observing each of the four treatment arms. Then, following treatment allocation, the post‐randomization variables were generated. For example, to generate the CD4 count at week 20, an execution model was fit to the real data by regressing the CD4 count at week 20 on baseline variables and treatment. Similarly, to generate the CD4 count at week 96, another execution model was fit to the real data by regressing the CD4 count at week 96 on baseline variables, treatment, and CD4 count at week 20. Generating data for one variable at a time was an easier task than generating data for several variables at once, as modeling and learning the distribution of a single variable was simpler than doing so for a multivariate distribution representing several variables. Hence, statistical regression models were chosen (either linear or logistic) for generating the post‐randomization variables, which were much more simplistic and thus less resource‐intensive compared to a CTGAN. Finally, the last execution model was for generating the outcome variable, and this model was, again, a regression model. For the binary outcome in the ACTG 175 trial, a logistic regression model was fit to generate the outcome, where all other variables from the real data were included as predictors in the model.

There is an important distinction between generating observations and predicting observations, which is particularly pertinent when using regression models to generate data. After a regression execution model was fit to the real data, where the response variable in the model was the variable for which we wish to generate data and the predictors in the model were all variables that came before temporally in the trial setting, the next step was to use the fitted model to generate predictions for the response variable (i.e., the variable for which we were generating observations) using the already‐generated synthetic data from the baseline CTGAN generated data plus previous execution model‐generated data. However, predicting from a simple execution model, such as a linear regression, lacks the randomness that would be expected in real (or realistic) data. Hence, we adopted an approach familiar from the multiple imputation literature to ensure that any individuals with identical baseline data could still have different values generated for subsequent variables. In this work, we induced randomness in three different ways to compare which was the most effective: (a) generated observation = prediction + sample from N(0,σ^resid), where σ^resid was the observed standard deviation of the residuals from the fitted execution model; (b) generated observation = prediction + sample from residuals; and (c) generated observation was a sample from {(prediction+residuals)0}. The third method was implemented by simply rejecting any value of (prediction+residuals) that did not satisfy the bounds of the distribution. In other words, first sample one of the model residuals, then check whether the resulting generated value of (prediction+residuals) was valid. If valid, set the generated value to this sum, and if not, repeat the process. The first two methods of inducing randomness could lead to nonsensical values given the context of the generated variable (e.g., negative values for CD4 count), which motivated the inclusion of the third method, where all generated values would be admissible. In our context, inadmissible values were those less than zero, which is why the inequality threshold was set to zero. However, for other contexts, one would simply adjust this threshold value to adapt this strategy to a different context. Additionally, the second and third methods of inducing randomness were included to determine whether omitting the normality assumption from the first method led to better data generation results.

The CTGAN Sequential framework was also implemented with the first method of inducing randomness using pre‐processed original data. Certain variables in the real data were transformed before fitting any generation models in the hopes of improving the quality of the generated data. For example, the natural log transformation was applied to CD4 count at baseline, week 20, and week 96, and then the generated data were back‐transformed so that the final synthetic data set was on the same scale as the real data.

Additionally, to deduce the impact of the degree of complexity of models used in the sequential framework, CTGANs were defined for each post‐treatment execution model rather than regression models. The purpose of this was to determine whether fitting complex models at each stage in a sequential fit was worth the additional computational power and complexity. When fitting CTGAN models as execution models, sequential steps were still followed. For example, when generating CD4 count at week 20 using a CTGAN, only variables collected up to that point in time were included in the new CTGAN (i.e., baseline covariates, treatment allocation, and CD4 count at week 20), thus replicating some of the work of the baseline CTGAN fit. Then, only the generated data for CD4 count at week 20 was saved and merged with the previously‐generated synthetic data (baseline data, treatment). This process was repeated for the CD4 count at week 96. Then, to generate data for the outcome, a CTGAN was fit to the entire real data set, and only the generated data for the outcome from this CTGAN model was saved and merged with the previously generated synthetic data. Since the training of a CTGAN involves random noise, there was no need to introduce further randomness when fitting a CTGAN as an execution model (unlike in the regression setting).

3.2.3. ARF Sequential

The ARF Sequential framework was very similar to the previous CTGAN Sequential frameworks, differing only in that ARF Sequential utilized an ARF rather than a CTGAN to generate the synthetic baseline cohort. The execution models were exactly the same regression models as described for the CTGAN Sequential frameworks. For ARF Sequential, one version of inducing randomness was implemented—(c) generating observations by sampling from the set of (prediction + residuals) 0. Additionally, the original data were not pre‐processed before generating synthetic data using this method. The decision not to pre‐process data, to pursue only regression for execution models, and to introduce randomness that ensured the synthetic data respected the range of the data distribution was based on preliminary explorations of the methods with the CTGAN Sequential approach.

3.2.4. R‐Vine Copula Sequential

The R‐Vine Copula Sequential framework was also very similar to the CTGAN Sequential framework. As was described for ARF Sequential, the main difference, as alluded to in the name, was that the baseline generator was an R‐vine copula model fitted to the real baseline data, instead of a CTGAN. The rest of the framework mimics that of ARF Sequential, again due to the preliminary explorations of the sequential CTGAN approach.

3.2.5. Simulations

Altogether, we compared eight different data generation frameworks, which we label and characterize as follows. CTGAN Simultaneous: No data pre‐processing, CTGAN trained on all data at once; CTGAN Sequential 1: Data pre‐processing, CTGAN baseline generator, regression execution models, sample from N(0,σ^resid) and add to prediction; CTGAN Sequential 2: No data pre‐processing, CTGAN baseline generator, regression execution models, sample from N(0,σ^resid) and add to prediction; CTGAN Sequential 3: No data pre‐processing, CTGAN baseline generator, regression execution models, sample directly from residuals and add to prediction; CTGAN Sequential 4: No data pre‐processing, CTGAN baseline generator, regression execution models, sample from {(prediction+residuals)0}; CTGAN Sequential 5: No data pre‐processing, CTGAN baseline generator, CTGAN execution models; ARF Sequential: No data pre‐processing, ARF baseline generator, regression execution models, sample from {(prediction+residuals)0}; R‐Vine Copula Sequential: No data pre‐processing, R‐vine copula baseline generator, regression execution models, sample from {(prediction+residuals)0}. Figure 2 is a flowchart showing key distinctions defined by modeling choices, which in turn define each of the eight data generation frameworks.

FIGURE 2.

FIGURE 2

Flowchart of each decision point (data pre‐processing, simultaneous versus sequential, CTGAN versus ARF versus R‐vine copula, regression versus CTGAN execution models, method to induce randomness) resulting in the final eight synthetic data generation frameworks.

Since each framework involved a step that induced randomness, the data generation process was repeated 500 times, and performance was compared across all 500 simulation runs. For CTGAN Simultaneous, random noise was incorporated in the fitting of the CTGAN as described in Subsection 2.1. This was the case for CTGAN Sequential 1–5 as well, though CTGAN Sequential 1–4 also incorporated randomness in the data generation process through the three different methods described earlier for the regression execution models. For ARF Sequential, randomness was involved in the first step of the algorithm, where a subset of the real data was randomly sampled. This means the generation of both the baseline data and the post‐baseline variables involved randomness. For R‐Vine Copula Sequential, the copula model fit to the original data was also not deterministic due to the non‐unique decomposition of the joint distribution, and hence randomness was induced both for the generation of the baseline variables using an R‐vine copula, as well as through the incorporation of randomness when generating data using the regression execution models. Therefore, each simulation run involved re‐fitting models to generate a new synthetic data set, and then computing metrics for the new synthetic data set (refer to Section 3.3 for a specification of the performance metrics that were employed). Data generation via CTGANs was conducted in Python utilizing the Synthetic Data Vault and SDMetrics libraries [34, 61], and data generation via ARFs and R‐vine copulas were conducted in R utilizing the arf and rvinecopulib packages, respectively [29, 62].

3.2.6. Definition of Models

Now, we describe in greater detail the models that were fit for each of the eight data generation frameworks.

In CTGAN Simultaneous, one model was fit in total (a CTGAN), which was trained on the entire data set without any pre‐processing of bounded variables.

In CTGAN Sequential 1, five models were fit in total, and the natural log transformation was applied to CD4 count at baseline, week 20, and week 96 in the real data before models were fit. First, a CTGAN was fit to the baseline data only. Then, a multinomial distribution with n=2139 and p=0.25 was assumed for the treatment execution model. For CD4 count at week 20, a linear regression model was fit with (the natural log of) CD4 count at week 20 as the response variable and the following variables as the predictors: Age, weight, sex, race, hemophilia status, homosexuality identity, intravenous drug use, Karnofsky score, prior non‐zidovudine ART usage, zidovudine usage 30 days prior, previous time on ART, ART historical usage, symptomatic HIV status, CD4 count at baseline, and randomized treatment assignment. For CD4 count at week 96, a linear regression model was fit, omitting observations with missing data. CD4 count at week 96 (transformed to be on the natural log scale) was the response variable, and the predictor variables were (the natural log of) CD4 count at week 20, in addition to the same predictor variables as for the CD4 week 20 execution regression model. For the outcome, a logistic regression model was fit, omitting observations with missing data. The response variable was the binary outcome, and the predictor variables were all other variables in the data set (baseline variables, randomized treatment, CD4 count at week 20, and CD4 count at week 96). Finally, the synthetically generated observations for CD4 count at baseline, week 20, and week 96 were exponentiated so that they were on the same scale as the original data.

In CTGAN Sequential 2, similar to CTGAN Sequential 1, a total of five models were fit. The only difference was that there was no data pre‐processing.

In CTGAN Sequential 3 and CTGAN Sequential 4, the same models were fit as in CTGAN Sequential 2 (five models total). Recall that the difference was how randomness was induced when generating values—CTGAN Sequential 2: Sample from N(0,σ^resid) and add to prediction; CTGAN Sequential 3: Sample directly from residuals and add to prediction, CTGAN Sequential 4: Sample from {(prediction+residuals)0}. That is, in CTGAN Sequential 4, if the randomly selected residual added to the prediction led to an inadmissible value (in this case, a negative CD4 cell count), a new residual was drawn and checked for admissibility.

In CTGAN Sequential 5, again, five models were fit in total; however, in this framework, each model except the treatment execution model was a CTGAN. The first CTGAN was fit to only the baseline variables to produce baseline synthetic data. The same multinomial distribution as before was assumed for sampling observations to generate synthetic treatment values. Then, another CTGAN was fit, this time to the baseline variables, treatment, and CD4 count at week 20. Only the generated data for CD4 count at week 20 were saved (the rest of the generated data from this CTGAN were discarded). A third CTGAN was fit to the baseline variables, treatment, CD4 count at week 20, and CD4 count at week 96. Again, only the generated data for CD4 count at week 96 were saved. A final CTGAN was fit to all data (baseline variables, treatment, CD4 count at week 20, CD4 count at week 96, and outcome), and only the generated data for the outcome were saved.

In ARF Sequential, five models were fit in total. First, an ARF model was fit using the baseline data, and then the same execution models as described for CTGAN Sequential 2–4 were fit. As in CTGAN Sequential 4, randomness was incorporated by adding a randomly selected residual that led to an admissible value of the covariate.

In R‐Vine Copula Sequential, five models were fit. First, an R‐vine copula model was fit using the baseline data, and the rest is as described for ARF Sequential.

3.3. Experimental Evaluation—Metrics

To compare the eight different generation frameworks, a wide range of metrics was used to ensure conclusions were robust to the type of metric used. As was recommended in the literature [4], the quality of the synthetic data (i.e., the closeness of the synthetic data distributions to the original data distributions) was measured by considering univariate and bivariate comparisons. To do so, the simulation was repeated 500 times, where each simulation involved generating a new synthetic data set. For univariate comparisons, the complement of the Kolmogorov‐Smirnov (KS) statistic was measured for continuous variables, and the complement of the Total Variation Distance (TVD) was measured for discrete variables. The KS statistic is defined as the maximum distance between two empirical CDFs, and hence the complement of the KS statistic represents the “closeness” of the two distributions. The complement of the KS statistic is defined as 1supxXFnreal(x)Fnsynth(x), where Fnreal(x) and Fnsynth(x) are the empirical CDFs for some continuous variable X with realization x in the real data and synthetic data, respectively. The complement of the TVD is defined as 112aAπarealπasynth, where πareal and πasynth represent the proportion of stratum a of some discrete variable A in the real and synthetic data, respectively. The TVD represents the difference in proportions for the strata of a given discrete variable between the real and synthetic data. In this context, it can be thought of as the categorical counterpart to the KS statistic. Thus, when utilizing the complement of the KS statistic and the complement of the TVD, higher scores indicated higher quality of the synthetic data generation in terms of fidelity to the real data.

For bivariate comparisons, a correlation similarity score using the Spearman correlation was computed for pairs of continuous variables, and a contingency similarity score was calculated for pairs of discrete variables and discrete and continuous variables. For a pair of variables in which one was discrete and the other continuous, the continuous variable was dichotomized by quartiles. The reason for the decision to dichotomize by quartiles was two‐fold: First, it is beneficial to apply the same dichotomization procedure to all continuous variables when calculating this metric for the sake of consistency and reproducibility, and second, the interpretation of this bivariate metric did not involve the values of the proportions themselves in the contingency table but rather the difference between these values, comparing the contingency table of the real data to that of the synthetic data. This means that the way in which dichotomization was carried out is of little importance since no clinical interpretation is being performed here; rather, it is only necessary that the same procedure is applied in both the real and synthetic data sets. The correlation similarity score is defined as the normalized difference between the real data correlation of a pair of variables and the synthetic data correlation of the same pair of variables: 112ρXYrealρXYsynth, where ρXYreal and ρXYsynth are the Spearman correlation between two continuous variables X and Y in the real and synthetic data sets, respectively. Spearman correlation was calculated rather than Pearson correlation because the former is less parametric. In fact, both were calculated; however, in our simulations, the results using Spearman versus Pearson correlations were very similar, and so choosing one over the other did not make a difference. For brevity, we report only those for the Spearman correlation. The contingency similarity score is defined as the normalized difference between the real data proportions in a contingency table of two variables and the synthetic data proportions in a contingency table of the same variables: 112aAbBπabrealπabsynth, where πabreal and πabsynth represent the proportion observed in both stratum a of some discrete variable A and stratum b of some discrete variable B in the real and synthetic data, respectively. Again, higher scores indicated higher fidelity of the synthetic to the real data. All univariate and bivariate metrics were computed for all variables and all pairs of variables for each simulation run, and then plotted. The univariate and bivariate distributions were also compared graphically by selecting a single simulation run at random and plotting the synthetic and real data distributions of each variable. Plots for a selection of variables with varying features are presented.

As is often done in the machine learning literature [20, 63], machine learning efficacy metrics were also compared—precision, recall, and F‐1 score. Precision, also known as the positive predictive value in statistics, is the proportion of true positives out of all positives: True PositivesTrue Positives + False Positives, where a “positive” is defined as the outcome event having occurred for a given patient. Hence, a “true positive” means the classifier correctly identified that the outcome event occurred for a particular participant, while a “false positive” means the classifier identified that the outcome event occurred for a participant when in truth, the event did not occur. Recall, also known as sensitivity, is the proportion of true positives out of all detected positives by the model: True PositivesTrue Positives + False Negatives. The F‐1 score is the harmonic mean of precision and recall: 2·Precision·RecallPrecision+Recall. Accuracy was not considered because in imbalanced data, it can be a misleading metric [64]. First, the real data were split into training and test sets, then a machine learning classifier was trained on the training set and used to predict values for the real data test set. Similarly, the synthetic data were split into training and test sets, and another classifier was trained on the synthetic data training set and then used to predict values for the real data test set. Precision, recall, and the F‐1 score were measured for each classifier. Ideally, the scores from the model trained on the real data and from the model trained on the synthetic data should be very similar, thus indicating that the synthetic data were very similar to the real data. Since the choice of classifier could impact the results, two different classifiers were chosen—Extreme Gradient Boosting (XGBoost), which had the ability to deal with missing values, and k‐nearest neighbors (KNN), which required a data imputation step [65]. When fitting the KNN classifier, k was set to five for all simulation runs. Precision, recall, and the F‐1 score were measured for each simulation run, and then the differences between the real and synthetic data metrics for precision, recall, and F‐1 were plotted across all 500 simulations, for both the XGBoost and KNN classifiers. The plot of absolute differences is shown in Section 4, and the plot of relative differences can be found in the Online Supporting Information. Values close to zero indicated good performance, meaning the metric value for the classifier trained on real data was similar to the metric value for the classifier trained on synthetic data.

Because our context involves RCT data, it is also interesting to know whether the synthetic data recover the same trial inference results as the real data. Note that we do not advocate for synthetic data to be used to draw clinical or substantive conclusions; rather, comparing inference results between synthetic and real data can capture another element of the multivariate distribution of the synthetic data. As is done in other data generation work focused on tabular RCT data, a chosen estimand and its standard error are estimated (often via regression) in the real and synthetic data and thus compared [66]. Ideally, the inference results using the synthetic data should be very close to those of the real data. Here, the effect of treatment on the trial outcome was estimated via an odds ratio (OR). For simplicity, treatment was dichotomized with zidovudine only as the baseline comparator and the other three treatment arms grouped together. This re‐categorization was performed after the synthetic data were generated, solely to compute these trial inference metrics. A logistic regression model was fit such that the binary outcome was regressed on binary treatment. The same procedure (dichotomize treatment, regress the outcome on binary treatment to estimate the OR and standard error) was also performed in the real data. ORs and standard errors were estimated across all 500 simulation runs. The plots of estimated ORs and CIs across all simulation runs for each framework are shown in Section 4. Higher fidelity to the original data was indicated by point estimates being contained in, and CIs overlapping with, the real data CI. Finally, we also measured the total computing time required to generate and evaluate each synthetic data set for each of the eight proposed frameworks.

4. Results

4.1. Graphical Comparisons of the Synthetic and Real Data Distributions

First, a simulation run was selected at random, and the univariate and bivariate distributions were plotted. Though density plots were examined for all variables, only a select few are discussed here for brevity. In all density plots, the pink represents the real data distribution and the blue represents the synthetic data distribution. Each plot shows the distribution generated for each of the eight data generation frameworks, for one simulation run chosen at random. The sequential execution model framework was much more successful at recovering the original data distribution for treatment allocation as compared to the simultaneous framework. This may be due to the nature of the RCT context, where treatment was assigned randomly. When equal probabilities of sampling from each treatment arm were assumed, and samples were drawn at random to generate the synthetic treatment assignment, the generation process essentially mimicked the implementation of the real‐life RCT. In contrast, the simultaneous CTGAN model used all information at once, and thus likely picked up on relationships between treatment assignment and post‐treatment randomization variables that then influenced the generation of treatment assignment, even though the treatment assignment was independent of all other variables in the RCT. This was confirmed in the density plot (see Online Supporting Information), where the generated treatment values were skewed towards arms 3 and 4.

When comparing the generation performance of each framework for the variable age, for instance, R‐Vine Copula Sequential was clearly the most effective at capturing the original data distribution. The distribution plot for synthetic data generated by ARF Sequential appeared to be close to that of the real (with some unexpected behavior exhibited close to the mean); however, this method generated values that were unrealistic for a trial setting. For instance, synthetic ages as small as one year and as large as 91 years were generated, despite the range of observed ages in the real data being much smaller (12 to 70 years old). Pre‐processing data (CTGAN Sequential 1) shifted the mean of the synthetic distribution closer to the mean of the real distribution, but the tail of the synthetic distribution exhibited behavior that did not exist in the real distribution. Also, it was to be expected that CTGAN Sequential 2–5 would have very similar results since they generated the baseline data in the same manner. (The density plots of age for one simulation run across all eight frameworks are included in the Online Supporting Information.) Similar results are displayed in Figure 3, which shows the density plots for Karnofsky score comparing synthetic to real data. In preliminary explorations, this variable was treated as a continuous variable with a multi‐modal distribution, as is commonly done for health score variables. However, when treating the Karnofsky score as continuous, the ARF model generated several nonsensical values. In the real data, the only observed values were contained in the set {70, 80, 90, 100} due to how this variable is measured—it is impossible to have a score between these values. However, ARF generated several values between 70 and 80, between 80 and 90, and so on, likely due to smoothing. Values as large as 120 were also generated by ARF, though the maximum possible value for the Karnofsky score is 100. The CTGAN model at baseline also exhibited this behavior, but to a lesser extent, with fewer values between the 10s being generated. In contrast, the R‐vine copula model only generated values that were also observed in the real data (i.e., 70, 80, 90, and 100). Due to these inadmissible synthetic values generated by the CTGAN‐ and ARF‐based methods, we decided to treat the Karnofsky score as discrete for the simulation results presented here. Indeed, with a discrete Karnofsky score, all frameworks (including R‐Vine Copula Sequential) showed better performance than with a continuous Karnofsky score. This indicates that it may be easier for generators to learn the distribution of a multi‐category discrete variable than a multimodal continuous distribution. Note that we also tried treating Karnofsky score as continuous and then rounding synthetic values to the nearest 10 as a post‐processing step, but this still resulted in unsatisfactory performance. For instance, the (inadmissible) rounded synthetic values included 110 and 120. Thus, all of the following results involve Karnofsky score treated as a categorical variable with four strata (unless stated otherwise). For the (discrete) Karnofsky score, the CTGAN‐based frameworks showed similar, less‐than‐satisfactory performance, whereas ARF Sequential and R‐Vine Copula Sequential closely captured the real data distribution. It is interesting to note that when the Karnofsky score was treated as continuous, CTGAN Simultaneous performed very poorly in mimicking the original data distribution. R‐Vine Copula Sequential performed the best, though the CTGAN Sequential frameworks were also able to detect the multi‐modality of the original data distributions, even though in these frameworks, the (baseline) Karnofsky score was generated by a CTGAN, just like in CTGAN Simultaneous. The difference was that the CTGAN Sequential frameworks trained the CTGAN generative model on only the baseline data, rather than all data simultaneously. This seemed to improve the generation of the baseline variables. In terms of generating the outcome variable (which was binary), most methods performed well, with R‐Vine Copula Sequential, ARF Sequential, and CTGAN Sequential 4 showing the best performance, and CTGAN Sequential 1 (pre‐processing data) showing the worst performance. Refer to the Online Supporting Information for the density plots of the generation of the composite binary event outcome.

FIGURE 3.

FIGURE 3

Density plots of Karnofsky score (generated as a discrete variable) at baseline for a single data generation run as compared to the real data for eight candidate synthetic data generators.

The performance of each method with regard to generating data for variables that change over time was also investigated. In particular, the synthetic and real distributions were plotted for CD4 count at baseline and week 20 (see Online Supporting Information), as well as week 96 (Figure 4). At baseline, R‐Vine Copula Sequential was the most effective at mimicking the original data distribution. The other frameworks were able to capture the mean of the original distribution, but the spreads were different. Though ARF Sequential displayed a data distribution that looked similar to that of the real data, some of the generated CD4 counts at baseline were not realistic or were inadmissible; for instance, negative values were generated. At week 20, R‐Vine Copula Sequential and ARF Sequential again showed excellent performance, though CTGAN Sequential 1–4 performed comparably. Both methods that used CTGANs to generate CD4 cell count at week 20 (CTGAN Simultaneous, CTGAN Sequential 5) were much less successful at capturing the real data distribution. At week 96 (where 37% of participants in the real data had missing values for CD4 count at week 96), CTGAN Simultaneous and CTGAN Sequential 5 performed very poorly at capturing the real data distribution. Again, R‐Vine Copula Sequential and ARF Sequential outperformed the rest, though CTGAN Sequential 4 showed adequate performance.

FIGURE 4.

FIGURE 4

Density plots of CD4 cell count at week 96 for a single data generation run as compared to the real data for eight candidate synthetic data generators.

Additionally, bivariate density plots were employed to evaluate whether relationships between pairs of variables were captured and to verify whether generated data created correlations between variables that were not present in the original data. Several pairs of variables were examined, though two pairs of variables are shown here for brevity: CD4 count at baseline and CD4 count at week 20 (expected to be highly correlated), and age and treatment arm (expected to be uncorrelated). Plots of these bivariate comparisons can be found in the Online Supporting Information. R‐Vine Copula Sequential was the most successful at capturing the highly correlated relationship. ARF Sequential also performed well, though the synthetically generated correlation did not extend to the more extreme spaces that existed in the real data. CTGAN Sequential 4 also showed satisfactory performance, whereas CTGAN Sequential 1 performed the worst. Additionally, CTGAN Sequential 2 and 3 generated inadmissible values for CD4 count at week 20 (values less than zero), whereas CTGAN Sequential 4 generated only admissible values. CTGAN Simultaneous and CTGAN Sequential 5 generated distributions that overlapped with the original real‐data distribution, but with more noise. All frameworks were successful in not creating relationships between variables when none existed in the real data. R‐Vine Copula Sequential and ARF Sequential performed remarkably well at capturing the real data bivariate distribution for age against treatment arm, where the mean age across treatment arms was very close to the real data, and the spread of age across treatment arms was also very similar to that of the real data. CTGAN Sequential 1 was also moderately successful at mimicking the original bivariate data distribution. The rest of the frameworks performed similarly to one another.

4.2. Univariate and Bivariate Performance Metrics

The eight data generation frameworks were further evaluated based on univariate and bivariate distribution similarity scores. Recall that for univariate continuous distributions, the complement of the KS statistic was employed, and for univariate discrete distributions, the complement of the TVD was calculated. For bivariate distribution comparisons where both variables were continuous, a normalized difference in correlation was utilized. For bivariate distribution comparisons where both variables were discrete or one variable was discrete and the other was continuous, a normalized difference in proportions from the contingency table was used. In all cases, a higher value indicated better performance.

As shown in the top left of Figure 5, the R‐Vine Copula Sequential framework outperformed the other frameworks in terms of capturing univariate continuous distributions in the original data. Not only did it perform better on average, as evidenced by the mean being much higher than the rest, but the spread of values across all 500 simulations was also very tight. This differed from the other frameworks that, at times, performed extremely poorly. ARF Sequential showed adequate performance with a mean slightly lower than that of R‐Vine Copula Sequential, though there were some instances of poor performance, as shown by the longer tail. The four iterations of CTGAN Sequential utilizing regression models (1–4) all performed similarly (Figure 5, top left panel). When capturing the univariate discrete distributions in the original data, R‐Vine Copula Sequential and ARF Sequential showed similarly good performance (Figure 5, top right panel). Though the CTGAN‐based methods showed adequate performance on average in capturing discrete univariate distributions, there were many instances of very small metric values and hence very poor performance. In both the univariate continuous and discrete settings, CTGAN Simultaneous had the poorest performance.

FIGURE 5.

FIGURE 5

Comparison of the eight data generation methods by similarity of the synthetic and real distributions. (a) represents univariate similarity scores for continuous variables, (b) represents univariate similarity scores for discrete variables, (c) shows the bivariate correlation similarity scores, and (d) shows the bivariate contingency similarity scores. Note that these scores aggregate across multiple variables and across all data generation runs.

When capturing the correlation between pairs of (continuous) variables, ARF Sequential performed the best, with R‐Vine Copula Sequential demonstrating similarly high performance but with slightly more instances of lower metric values and hence poorer performance, as shown in the bottom left of Figure 5. Though all methods showed adequate performance on average, there were instances in which CTGAN Simultaneous and CTGAN Sequential 5 (the two methods that harnessed CTGAN(s) to generate post‐baseline variables) performed very poorly. Again, CTGAN Sequential 1–4 demonstrated very similar results. When comparing the ability to maintain the bivariate relationships between pairs of discrete variables and discrete and continuous variables, there was a clearer distinction between R‐Vine Copula Sequential and the CTGAN‐based frameworks, as shown in the bottom right of Figure 5. ARF Sequential had similar performance on average as compared to R‐Vine Copula Sequential, but this framework had the longest tail and the smallest metric values compared to all frameworks, indicating that there were several simulation runs in which the data generated by ARF Sequential had bivariate discrete relationships that were far from those of the real data. Though there were a few instances when R‐Vine Copula Sequential performed sub‐optimally, it was again the most successful framework at capturing the bivariate relationships in the real data, on average.

4.3. ML Efficacy Metrics

When interpreting the ML efficacy results, it was less clear which framework performed the best, since the results were highly dependent on the type of classifier (in this case, XGBoost versus KNN), as well as the metric (precision, recall, or F‐1 score). One pattern that emerged in Figure 6 was that no matter the classifier, CTGAN Simultaneous and CTGAN Sequential 5 performed the worst in generating data with a distribution that closely resembled that of the original data. Also, regardless of the classifier used to assess efficacy, CTGAN Sequential 1–4 performed very similarly, and their results showed rather high performance compared to the rest of the frameworks, since the box plots were generally close to zero. This was much more apparent for the KNN classifier. For XGBoost, the R‐Vine Copula Sequential and ARF Sequential methods performed well when considering the difference in recall metric values, but performed much worse than CTGAN Sequential 1–4 when considering the difference in precision metric values and the difference in F‐1 metric values. For the KNN classifier, while the averages of the difference between metric values for all three metrics for R‐Vine Copula Sequential and ARF Sequential were close to zero, there were some instances in which these frameworks performed very poorly, and were even the worst out of all eight methods, as evidenced by the large outliers. While it is recommended to present these ML efficacy metrics to assess the quality of the synthetic data, it is unfortunately the case that these results were less interpretable compared to the distribution plots and similarity scores. Additionally, it was not clear if the mixed results from the ML efficacy metrics were due to true differences between the real and synthetic data distributions for each framework, or to what extent they were influenced by the choice of classifier or ML metric.

FIGURE 6.

FIGURE 6

Comparison of the eight data generation methods by similarity of the ML efficacy metrics (precision, recall, and F‐1 score) for both XGBoost classifiers and KNN classifiers. The vertical axis represents the absolute difference between the real and synthetic metric values. The horizontal dashed line at zero indicates that the metric value for the classifier trained on real data is equal to that of the classifier trained on the synthetic data. A value close to zero means that the two models performed similarly and therefore the synthetic data successfully retained the original data distribution.

4.4. Trial Inference Metrics

The estimated ORs and 95% CIs resulting from each synthetic data set, for each data generation framework, were also compared to the OR and 95% CI estimated using the real data. CTGAN Sequential 1 had the best performance out of all frameworks, as shown in Figure 7 since the majority of the simulation runs resulted in point estimates close to that of the real data (represented by the horizontal dashed line) and estimated CIs overlapping with the estimated CI using the real data (represented by the purple shaded region). CTGAN Sequential 2–4 performed similarly to one another as well as to both ARF Sequential and R‐Vine Copula Sequential. CTGAN Simultaneous demonstrated very poor performance, with most simulation runs resulting in an estimated OR far from that estimated in the real data and a CI barely overlapping with the real data CI. The estimated CIs also varied greatly across simulation runs, thus indicating large variation across synthetic data sets and hence in the data generation process. CTGAN Sequential 5 also showed very poor performance for similar reasons, though the lengths of the estimated CIs across simulation runs were generally shorter as compared to those of CTGAN Simultaneous. Out of all frameworks, R‐Vine Copula Sequential had the shortest median CI length across 500 simulation runs (0.196). The median CI lengths for CTGAN Sequential 1–3 and ARF Sequential were very similar, ranging from 0.197 to 0.210. CTGAN Sequential 5 and CTGAN Simultaneous had much larger median CI lengths (0.459 and 0.446, respectively). Synthetic data generated by CTGAN Sequential 1–4, ARF Sequential, and R‐Vine Copula Sequential generally led to the same inferential conclusion as that of the real data (in this case, that there was a statistically‐significant positive effect of the combined treatment group relative to zidovudine). Taken together, these results indicate that sequential data generation seemed to be preferable in terms of recapturing trial inference results from the real data, and that care in the specific modeling choices is still needed.

FIGURE 7.

FIGURE 7

Comparison of the eight data generation methods by similarity of trial inference results. The horizontal dashed line (at value 0.52) represents the real data estimated OR of the effect of binary treatment on the outcome, and the purple shaded region indicates the real data 95% CI for the estimated OR (which was estimated to be 0.42, 0.65). The estimated ORs and associated CIs are plotted for each simulation run, per framework. The vertical axes scales for the plots of CTGAN Simultaneous and CTGAN Sequential 5 differ from the rest.

4.5. Computing Time

Lastly, we compared the computing time needed for all 500 simulations to generate data and evaluate metrics for each of the eight frameworks. Simulations were performed on a machine with a 14‐core CPU and 16 GB RAM. Ordered from fastest to slowest, the time taken to generate and evaluate synthetic data were the following, in hh:mm:ss format: ARF Sequential (00:22:27), CTGAN Sequential 2 (06:01:19), CTGAN Sequential 1 (06:01:47), CTGAN Sequential 3 (06:02:41), CTGAN Sequential 4 (06:04:53), CTGAN Simultaneous (07:07:10), R‐Vine Copula Sequential (12:12:34), and CTGAN Sequential 5 (25:38:46). Note that computation times were significantly higher when Karnofsky score was treated as a continuous variable rather than a categorical variable, suggesting that learning the distribution of a categorical variable with multiple strata is an easier task than learning the distribution of a multi‐modal continuous variable. All frameworks involving CTGAN were run using Python version 3.9, and the R‐Vine Copula Sequential framework, as well as the ARF Sequential framework, were run using R version 4.3.1. ARF Sequential took considerably less time compared to all other frameworks. This was to be expected as synthetic data generation using ARF was designed to be much less computationally intensive [29]. R‐Vine Copula Sequential took more time than the CTGAN Sequential methods that also utilized regression models, while using CTGAN models everywhere was the most time‐ and resource‐intensive. Though these differences in computing time are important, none of the frameworks were so slow as to be prohibitive.

5. Discussion

In this paper, we developed and compared the performance of eight different data generation frameworks to determine which was the most effective at reproducing the original data distribution of a data set arising from an RCT. We compared the difference between a simultaneous framework and a sequential framework, and within the sequential framework, we evaluated the difference between utilizing a CTGAN, an ARF, or an R‐vine copula to generate the baseline cohort data with various options to address bounded (non‐negative) variables. We found that the sequential data generation framework greatly outperformed simultaneous data generation when the task was to generate a synthetic tabular data set in an RCT context that maintained the original data distribution. In particular, the best framework out of the eight presented here was to not pre‐process data, to use an R‐vine copula to generate data for baseline variables, fit regression models for post‐baseline variables, and induce randomness by drawing a sample from the set of admissible values of prediction plus residuals to generate the synthetic observation. This R‐Vine Copula Sequential data generation framework showed great promise, as it outperformed the other frameworks that were considered in capturing both the univariate and bivariate distributions in the original data. While its performance for capturing univariate distributions was to be expected, it was surprising to see it outperform CTGAN models when capturing bivariate distributions as well, since both R‐vine copula models and CTGANs were designed to capture multivariate dependencies. Note, however, that only one framework utilizing an R‐vine copula was included in the experimentation; simulations were not included that combined an R‐vine copula with data pre‐processing, or with more complex (non‐regression) execution models, or with other methods for inducing randomness when generating post‐randomization variables. Hence, though the combination of decision points in creating the R‐Vine Copula Sequential Framework showed the best performance out of the eight frameworks presented here, our experiments were not “fully factorial” and it remains possible that other choices regarding data pre‐processing, regression execution models, and inducing randomness in combination with a sequential R‐vine copula framework may exhibit different performance.

While complex machine learning methods have gained much popularity and perform very well in certain contexts, our investigations suggest that they underperform for RCT data generation purposes. At first glance, it may appear that generating all data at once using a powerful model such as a CTGAN should be sufficient. We have shown this is not the case—in an RCT setting, it is necessary to generate data in sequential steps that follow the temporal ordering in a real‐life RCT. Moreover, using future data to inform the generation of past data negatively impacted performance. Furthermore, the results presented here for ARF Sequential suggest that the computational benefits and generally high performance shown by univariate and (continuous) bivariate metrics were not without drawbacks. On average, it seemed that the ARF‐based method performed very well in capturing the real data distribution, and at an exceptionally fast speed. However, for real‐world use, the generated data sets would require post‐processing steps to ensure that they were realistic and did not contain inadmissible values. The similarity metrics included in this work showed good performance on average, but ARF Sequential generated values outside the distribution of the real data, which led to synthetic data that actually were not realistic. For instance, impossible health scores (when treated as a multi‐modal continuous variable), negative count values, and ages outside of the observed data representing individuals who would never be enrolled in the given RCT context were all exhibited in the synthetic data generated by this framework. Note that the R‐Vine Copula Sequential framework demonstrated similar, if not better, performance than ARF Sequential but did not suffer from the same issues that would necessitate data post‐processing.

While these investigations demonstrated many strengths, notably in providing a framework for effectively generating synthetic tabular data in an RCT context, there were also some limitations. Though we examined eight possible approaches, we did not consider all possible permutations when comparing data generation frameworks, including simultaneous data generation using an ARF or an R‐vine copula. The former omission was due to how we aimed to organize the comparisons presented in this manuscript. The primary comparison was to show the difference between simultaneous and sequential data generation; then, within a sequential generation context, we compared various generative algorithms. The latter omission was due to the limitations of the vine structure, where including time‐varying treatment and covariates would result in closed loops between variables, which is impossible for a vine. Though our use‐case example is a single‐stage RCT where treatment was not time‐varying, we will, in future work, extend these investigations to RCTs where treatments do change over time (such as in Sequential Multiple Assignment Randomized Trials, also known as SMARTs [67]). We also did not include a distance metric for capturing the closeness of multivariate (greater than two) dependencies; however, the ML efficacy metrics presented here still help capture whether these higher‐order relationships were preserved as they involved learning the relationship between all variables and the trial outcome. Additionally, in this work, the same number of observations was generated as was collected in the original data. We did not explore the potential impact of generating a synthetic data set with more or fewer observations (rows) than the original data set. However, we did not expect that one framework or method would perform better for different generated sample sizes, and we did not believe that exploring the effect of changing the sample size of the synthetic data set would be a scientific question of interest.

Additionally, the results presented here were from simulations that were all based on the same data set. We did not investigate how these eight frameworks may perform for different real data sets. Nevertheless, the data set chosen for our empirical study contained mixed variable types, randomization, longitudinal aspects, and missing information, thereby capturing many realistic and complex features that would be seen in a typical clinical or epidemiological study. However, the time‐to‐event nature of the original trial outcome was not explored in this analysis. In future work, we will explore the impact of different forms of missingness, including censoring in tabular data generation. It would be interesting to test whether a sequential framework, particularly one that harnesses an R‐vine copula at baseline, still performs the best for RCT data with different structures, such as time‐varying treatments, a more complex outcome (such as a time‐to‐event outcome), a smaller sample size, or a collection of variables with more complex distributions. Additionally, we did not specifically investigate whether privacy concerns were satisfied using any of the proposed generation frameworks; this would be an important step for future work. With increasing momentum for research to be open and reproducible, the approach put forward in this work could be used to generate mock data sets that mimic real data and yet avoid releasing any information associated with individual research participants to permit, for example, the inclusion of code and analyses mimicking that in statistical papers.

Though the purpose of the work presented here was to examine methods for generating realistic and complex synthetic tabular RCT data, the proposed sequential data generation strategy is adaptable to other modern biomedical research settings, such as high‐dimensional data scenarios. One of the primary strengths of the sequential generation approach is the flexibility and control that the researcher has when defining the models to be fit at each sequential step. Take the setting of omics data as an example. At baseline, there may be hundreds if not thousands of variables (genes) to generate, which would likely render a copula‐based approach less than optimal. Here, perhaps a machine learning approach would be better suited to handle the large array of variables. When generating post‐randomization variables, it may be the case that a large number of variables should be included as predictors in the execution model—this may necessitate a penalized regression approach or incorporating principal component analysis in the sequential data generation procedure. Hence, the flexibility in choosing different modeling approaches at each sequential step allows for the proposed sequential data generation framework to be rather easily adaptable to other settings, and not just for small tabular RCT data sets. How best to adapt these approaches to high‐dimensional settings rather than the more typical, low‐dimensional tabular RCT context, remains an important avenue for future investigation.

Furthermore, in our sequential frameworks that utilized regression models to produce post‐baseline variables, observations with missing values were omitted when variables included in the regression execution model had missingness. In this case, missing values were not generated in the synthetic sample when the original data had missing values. This led to a loss of information and perhaps hindered the ability of the regression models to accurately model the true data distribution. This is because in the real data, observations might not be missing completely at random, and thus the generated distribution mimics that of the observed distribution, but not the true data distribution that would be observed in the absence of missingness (or in the case where observations were missing completely at random). Though not implemented in this paper, we have considered how our sequential framework can still take into account missing values and even generate missing values. Take, for instance, the scenario in the ACTG175 data where CD4 count at week 96 had non‐negligible missingness. A new random variable could be defined, say R, where R=1 if the observation for CD4 week 96 was observed, and R=0 if the observation was missing. Then, after generating CD4 count at week 20 but before generating CD4 count at week 96, another execution model could be fit to generate synthetic R, that is, model Pr(R=1|Baseline, Treatment, CD4 Week 20). This could easily be modeled via logistic regression. Then, to generate CD4 count at week 96, a weighted regression execution model could be fit with baseline covariates, treatment assignment, and CD4 count at week 20 included as predictors and weights equal to the inverse of the probability of being observed (estimated by the logistic regression execution model for R). In other words, observations that are more likely to be missing are up‐weighted. To generate missingness in CD4 count at week 96, we could simply omit observations in the synthetic data for participants with synthetic R=0. The same process could be repeated to estimate the weights to be used in a weighted regression execution model when generating the outcome. While we have here outlined a way to incorporate a model for missingness and generate data with missing values, it would be of interest to further explore this approach in the sequential data generation framework for tabular RCT data. Model selection when developing a sequential data generation framework is another intriguing avenue of research to consider.

Based on the findings presented here, we recommend utilizing a sequential data generation approach that separates the data generation task into separate, time‐dependent steps when generating a synthetic table of RCT data with mixed‐type variables. Additionally, we found that generating more complex tables of data (e.g., a set of baseline variables) was most effective when harnessing a vine copula approach, in particular, R‐vine copulas. In contrast, simpler regression models seem to be sufficient in generating single variables at a time to generate more realistic and meaningful synthetic data sets. Further, we recommend the use of simple statistical comparisons of univariate and bivariate distributions and careful handling of any bounded covariates to ensure faithfulness of the synthetic data to the original data on which they are based. These powerful yet simple approaches to generating realistic synthetic data can be harnessed to study design operating characteristics for new RCTs, to perform more realistic simulations when assessing finite sample performance of new methodology, or to realize “nearly” real data alongside new code without risking any breaches of data confidentiality.

Author Contributions

N.Z.P. contributed to the experimental design, coding of simulations, and writing and editing of the original manuscript. E.E.M.M. and N.S. contributed to the experimental design and editing of the original manuscript.

Disclosure

The authors have nothing to report.

Conflicts of Interest

The authors declare no conflicts of interest.

Supporting information

Data S1: Additional information consisting of a table of the ACTG 175 trial variables used in the analysis, as well as further synthetic and real data distribution plots, ML efficacy metric comparisons, and plots of trial inference results across all simulation runs, can be found in the online version of the article at the publisher's website.

SIM-44-0-s001.pdf (2.7MB, pdf)

Acknowledgments

E.E.M.M. is a Canada Research Chair (Tier 1) in Statistical Methods for Precision Medicine and acknowledges the support of a chercheur de mérite career award from the Fonds de Recherche du Québec, Santé. This work is supported by a Discovery Grant from the Natural Sciences and Engineering Research Council of Canada.

Petrakos N. Z., Moodie E. E. M., and Savy N., “A Framework for Generating Realistic Synthetic Tabular Data in a Randomized Controlled Trial Setting,” Statistics in Medicine 44, no. 18‐19 (2025): e70227, 10.1002/sim.70227.

Funding: This work is supported by a Discovery Grant from the Natural Sciences and Engineering Research Council of Canada.

Data Availability Statement

The data that support the findings of this study are openly available in the speff2trial R package at https://cran.r‐project.org/web/packages/speff2trial/index.html.

References

  • 1. Assefa S. A., Dervovic D., Mahfouz M., Tillman R. E., Reddy P., and Veloso M., “Generating Synthetic Data in Finance: Opportunities, Challenges and Pitfalls,” in ICAIF '20: Proceedings of the First ACM International Conference on AI in Finance. October 15–16 (Association for Computing Machinery, 2020), 1–8. [Google Scholar]
  • 2. Hsu A., Khoo W., Goyal N., and Wainstein M., “Next‐Generation Digital Ecosystem for Climate Data Mining and Knowledge Discovery: A Review of Digital Data Collection Technologies,” Frontiers in Big Data 3, no. 29 (2020): 1–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Pezoulas V. C., Zaridis D. I., Mylona E., et al., “Synthetic Data Generation Methods in Healthcare: A Review on Open‐Source Tools and Methods,” CSBJ 23 (2024): 2892–2910. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Figueira A. and Vaz B., “Survey on Synthetic Data Generation, Evaluation Methods and GANs,” Mathematics 10, no. 15 (2022): 1–41. [Google Scholar]
  • 5. Schreck N., Slynko A., Saadati M., and Benner A., “Statistical Plasmode Simulations – Potentials, Challenges and Recommendations,” Statistics in Medicine 43, no. 9 (2024): 1804–1825. [DOI] [PubMed] [Google Scholar]
  • 6. Franklin J. M., Schneeweiss S., Polinski J. M., and Rassen J. A., “Plasmode Simulation for the Evaluation of Pharmacoepidemiologic Methods in Complex Healthcare Databases,” Computational Statistics and Data Analysis 72 (2014): 219–226. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Souli Y., Trudel X., Diop A., Brisson C., and Talbot D., “Longitudinal Plasmode Algorithms to Evaluate Statistical Methods in Realistic Scenarios: An Illustration Applied to Occupational Epidemiology,” BMC Medical Research Methodology 23, no. 242 (2023): 1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Crowther M. J. and Lambert P. C., “Simulating Biologically Plausible Complex Survival Data,” Statistics in Medicine 32, no. 23 (2013): 4118–4134. [DOI] [PubMed] [Google Scholar]
  • 9. Vaughan L. K., Divers J., Padilla M. A., et al., “The Use of Plasmodes as a Supplement to Simulations: A Simple Example Evaluating Individual Admixture Estimation Methodologies,” Computational Statistics and Data Analysis 53, no. 5 (2009): 1755–1766. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. O'Keefe C. M. and Rubin D. B., “Individual Privacy Versus Public Good: Protecting Confidentiality in Health Research,” Statistics in Medicine 34, no. 23 (2015): 3081–3103. [DOI] [PubMed] [Google Scholar]
  • 11. Pappalardo F., Russo G., Tshinanu F. M., and Viceconti M., “In Silico Clinical Trials: Concepts and Early Adoptions,” Briefings in Bioinformatics 20, no. 5 (2019): 1699–1708. [DOI] [PubMed] [Google Scholar]
  • 12. Friedrich S. and Friede T., “On the Role of Benchmarking Data Sets and Simulations in Method Comparison Studies,” Biometrical Journal 66, no. 1 (2024): 1–15. [DOI] [PubMed] [Google Scholar]
  • 13. Chen Z., Zhang H., Guo Y., et al., “Exploring the Feasibility of Using Real‐World Data From a Large Clinical Data Research Network to Simulate Clinical Trials of Alzheimer's Disease,” npj Digital Medicine 4, no. 1 (2021): 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Sarrami‐Foroushani A., Lassila T., MacRaild M., et al., “In‐Silico Trial of Intracranial Flow Diverters Replicates and Expands Insights From Conventional Clinical Trials,” Nature Communications 12, no. 1 (2021): 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Zwep L. B., Guo T., Nagler T., Knibbe C. A. J., Meulman J. J., and van Hasselt J. G. C., “Virtual patient simulation using copula modeling,” Clinical Pharmacology and Therapeutics 115, no. 4 (2024): 795–804. [DOI] [PubMed] [Google Scholar]
  • 16. Goodfellow I., Pouget‐Abadie J., Mirza M., et al., “Generative Adversarial Networks,” Communications of the ACM 63, no. 11 (2020): 139–144. [Google Scholar]
  • 17. Mirza M. and Osindero S., “Conditional Generative Adversarial Nets,” 2014, http://arxiv.org/abs/1411.1784.
  • 18. Denton E. L., Chintala S., Szlam A., and Fergus R., “Deep Generative Image Models Using a Laplacian Pyramid of Adversarial Networks,” in Advances in Neural Information Processing Systems December 7–10 (Curran Associates, Inc, 2015), 1–9. [Google Scholar]
  • 19. Radford A., Metz L., and Chintala S., “Unsupervised Representation Learning With Deep Convolutional Generative Adversarial Networks,” in International Conference on Learning Representations 2015.; May 7–9 (CoRR, 2015), 1–16. [Google Scholar]
  • 20. Xu L., Skoularidou M., Cuesta‐Infante A., and Veeramachaneni K., “Modeling tabular data using conditional GAN,” in Advances in Neural Information Processing Systems. December 8–14 (NeurIPS, 2019), 1–11. [Google Scholar]
  • 21. Choi E., Biswal S., Malin B., Duke J., Stewart W. F., and Sun J., “Generating Multi‐Label Discrete Patient Records Using Generative Adversarial Networks,” in Proceedings of the 2nd Machine Learning for Healthcare Conference. August 18–19 (PMLR, 2017), 286–305. [Google Scholar]
  • 22. Rajabi A. and Garibay O. O., “TabFairGAN: Fair Tabular Data Generation With Generative Adversarial Networks,” Machine Learning and Knowledge Extraction 4, no. 2 (2022): 488–501. [Google Scholar]
  • 23. Walia M., Tierney B., and McKeever S., “Synthesising Tabular Data Using Wasserstein Conditional GANs With Gradient Penalty (WCGAN‐GP),” in Proceedings of the 28th Irish Conference on Artificial Intelligence and Cognitive Science. December 7–8 (AICS, 2020), 1–12. [Google Scholar]
  • 24. Askin S., Burkhalter D., Calado G., and El Dakrouni S., “Artificial Intelligence Applied to Clinical Trials: Opportunities and Challenges,” Health Technology 13 (2023): 203–213. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Wang W. and Pai T. W., “Enhancing Small Tabular Clinical Trial Dataset Through Hybrid Data Augmentation: Combining SMOTE and WCGAN‐GP,” Data 8, no. 9 (2023): 1–20. [Google Scholar]
  • 26. Hernandez M., Epelde G., Alberdi A., Cilla R., and Rankin D., “Synthetic Data Generation for Tabular Health Records: A Systematic Review,” Neurocomputing 493 (2022): 28–45. [Google Scholar]
  • 27. Boden‐Albala B., “Confronting Legacies of Underrepresentation in Clinical Trials: The Case for Greater Diversity in Research,” Neuron 110, no. 5 (2022): 746–748. [DOI] [PubMed] [Google Scholar]
  • 28. Heiat A., Gross C. P., and Krumholz H. M., “Representation of the Elderly, Women, and Minorities in Heart Failure Clinical Trials,” Archives of Internal Medicine 162, no. 15 (2002): 1682–1688. [DOI] [PubMed] [Google Scholar]
  • 29. Watson D. S., Blesch K., Kapar J., and Wright M. N., “Adversarial Random Forests for Density Estimation and Generative Modeling,” in Proceedings of the 26th International Conference on Artificial Intelligence and Statistics. April 25–27 (PMLR, 2023), 5357–5375. [Google Scholar]
  • 30. Thees O., Novák J., and Templ M., “Evaluation of Synthetic Data Generators on Complex Tabular Data,” in Privacy in Statistical Databases. September 25–27 (PSD, 2024), 194–209. [Google Scholar]
  • 31. Qian Z., Davis R., and van Schaar M., “Synthcity: A Benchmark Framework for Diverse Use Cases of Tabular Synthetic Data,” in Advances in Neural Information Processing Systems. December 10–16 (NeurIPS, 2023), 3173–3188. [Google Scholar]
  • 32. Höllig J. and Geierhos M., “Utility Meets Privacy: A Critical Evaluation of Tabular Data Synthesizers,” IEEE Access 13 (2025): 44497–44509. [Google Scholar]
  • 33. Fössing E. and Drechsler J., “An Evaluation of Synthetic Data Generators Implemented in the Python Library Synthcity,” in Privacy in Statistical Databases. September 25–27 (PSD, 2024), 178–193. [Google Scholar]
  • 34. Patki N., Wedge R., and Veeramachaneni K., “The Synthetic Data Vault,” in Proceedings of the 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA) (IEEE; October 17–19, 2016), 399–410. [Google Scholar]
  • 35. Sun Y., Cuesta‐Infante A., and Veeramachaneni K., “Learning Vine Copula Models for Synthetic Data Generation,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI; January 21 – February 1, 2019), 5049–5057. [Google Scholar]
  • 36. Demeulemeester R., Savy N., Grosclaude P., Costa N., and Saint‐Pierre P., “Agent Based Modeling in Health Care Economics: Examples in the Field of Thyroid Cancer,” International Journal of Biostatistics 19, no. 2 (2023): 351–368. [DOI] [PubMed] [Google Scholar]
  • 37. Sklar A., “Fonctions de répartition à n dimmensions et leurs marges,” Annals of the International Society of Urological Pathology 8, no. 3 (1959): 229–231. [Google Scholar]
  • 38. Akbari M. and Liang J., “Semi–Recurrent CNN–Based VAE–GAN for Sequential Data Generation,” in Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE; Calgary, Canada, April 15–20, 2018), 2321–2325. [Google Scholar]
  • 39. Dahmen J. and Cook D., “SynSys: A Synthetic Data Generation System for Healthcare Applications,” Sensors (Basel) 19, no. 5 (2019): 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Saint‐Pierre P. and Savy N., “Agent‐Based Modeling in Medical Research, Virtual Baseline Generator and Change in Patients' Profile Issue,” International Journal of Biostatistics 19, no. 2 (2023): 333–349. [DOI] [PubMed] [Google Scholar]
  • 41. Maglio P. P. and Mabry P. L., “Agent‐Based Models and Systems Science Approaches to Public Health,” American Journal of Preventive Medicine 40, no. 3 (2011): 392–394. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Ainslie K. E. C., Haber M. J., Malosh R. E., Petrie J. G., and Monto A. S., “Maximum Likelihood Estimation of Influenza Vaccine Effectiveness Against Transmission From the Household and From the Community,” Statistics in Medicine 37, no. 6 (2018): 970–983. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Yan Q. L., Tang S. Y., and Xiao Y. N., “Impact of Individual Behaviour Change on the Spread of Emerging Infectious Diseases,” Statistics in Medicine 37, no. 6 (2018): 948–969. [DOI] [PubMed] [Google Scholar]
  • 44. Boren D., Sullivan P. S., Beyrer C., Baral S. D., Bekker L. G., and Brookmeyer R., “Stochastic Variation in Network Epidemic Models: Implications for the Design of Community Level HIV Prevention Trials,” Statistics in Medicine 33, no. 22 (2014): 3894–3904. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Kingma D. P. and Welling M., “Auto‐Encoding Variational Bayes,” in International Conference on Learning Representations 2013 (CoRR; Scottsdale, United States, May 2–4, 2013), 1–14. [Google Scholar]
  • 46. Koloi A., Loukas V. S., Sakellarios A., et al., “A Comparison Study on Creating Simulated Patient Data for Individuals Suffering From Chronic Coronary Disorders,” in 2023 45th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC) (IEEE; July 24–27, 2023), 1–4. [DOI] [PubMed] [Google Scholar]
  • 47. Chawla N. V., Bowyer K. W., Hall L. O., and Kegelmeyer W. P., “SMOTE: Synthetic Minority Over‐Sampling Technique,” Journal of Artificial Intelligence Research 16 (2002): 321–357. [Google Scholar]
  • 48. Han H., Wang W. Y., and Mao B. H., “Borderline‐SMOTE: A New Over‐Sampling Method in Imbalanced Data Sets Learning,” in International Conference on Intelligent Computing (Springer; August 23–26, 2005), 878–887. [Google Scholar]
  • 49. Bunkhumpornpat C., Sinapiromsaran K., and Lursinsap C., “Safe‐Level‐SMOTE: Safe‐Level‐Synthetic Minority Over‐Sampling Technique for Handling the Class Imbalanced Problem,” in Advances in Knowledge Discovery and Data Mining: 13th Pacific‐Asia Conference (Springer; April 27–30, 2009), 475–482. [Google Scholar]
  • 50. He H., Bai Y., Garcia E. A., and Li S., “ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning,” in Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence) (IEEE; June 1–8, 2008), 1322–1328. [Google Scholar]
  • 51. Douzas G., Bacao F., and Last F., “Improving Imbalanced Learning Through a Heuristic Oversampling Method Based on k‐Means and SMOTE,” Information Sciences 465 (2018): 1–20. [Google Scholar]
  • 52. Kiran A. and Kumar S. S., “A Comparative Analysis of GAN and VAE Based Synthetic Data Generators for High Dimensional, Imbalanced Tabular Data,” in Proceedings of the 2023 2nd International Conference for Innovation in Technology (INOCON) (IEEE; March 3–5, 2023), 1–6. [Google Scholar]
  • 53. Kiran A. and Kumar S. S., “A Methodology and an Empirical Analysis to Determine the Most Suitable Synthetic Data Generator,” IEEE Access 12 (2024): 12209–12228. [Google Scholar]
  • 54. Chen Q., Ye A., Zhang Y., Chen J., and Huang C., “An Intra‐Class Distribution‐Focused Generative Adversarial Network Approach for Imbalanced Tabular Data Learning,” International Journal of Machine Learning and Cybernetics 15, no. 7 (2024): 2551–2572. [Google Scholar]
  • 55. Biau G., Cadre B., Sangnier M., and Tanielian U., “Some Theoretical Properties of GANS,” Annals of Statistics 48, no. 3 (2020): 1539–1566. [Google Scholar]
  • 56. Joe H., “Families of m‐Variate Distributions With Given Margins and m (m‐1)/2 Bivariate Dependence Parameters,” Lecture Notes‐Monograph Series 28 (1996): 120–141. Institute of Mathematical Statistics. [Google Scholar]
  • 57. Bedford T. and Cooke R. M., “Vines ‐ A New Graphical Model for Dependent Random Variables,” Annals of Statistics 30, no. 4 (1996): 1031–1068. [Google Scholar]
  • 58. Aas K., Czado C., Frigessi A., and Bakken H., “Pair‐copula constructions of multiple dependence,” Insurance Mathematics and Economics 44, no. 2 (2009): 182–198. [Google Scholar]
  • 59. Genest C., Okhrin O., and Bodnar T., “Copula Modeling From Abe Sklar to the Present Day,” JMVA 201 (2024): 1–9. [Google Scholar]
  • 60. Hammer S. M., Katzenstein D. A., Hughes M. D., et al., “A Trial Comparing Nucleoside Monotherapy With Combination Therapy in HIV‐Infected Adults With CD4 Cell Counts From 200 to 500 per Cubic Millimeter,” NEJM 335, no. 15 (1996): 1081–1090. [DOI] [PubMed] [Google Scholar]
  • 61. DataCebo, Inc ., “Synthetic Data Metrics,” Version 0.15.0 2024.
  • 62. Nagler T. and Vatter T., “Rvinecopulib: High Performance Algorithms for Vine Copula Modeling,” 2023. Version 0.6.3.1.1.
  • 63. Zhou Y., Dong F., Liu Y., Li Z., Du J., and Zhang L., “Forecasting Emerging Technologies Using Data Augmentation and Deep Learning,” Scientometrics 123 (2020): 1–29. [Google Scholar]
  • 64. Jeni L. A., Cohn J. F., and De La Torre F., “Facing Imbalanced Data–Recommendations for the Use of Performance Metrics,” in Proceedings of the 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction (IEEE; September 02–05, 2013), 245–251. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65. Hastie T., Tibshirani R., Sherlock G., Eisen M., Brown P., and Botstein D., “Imputing Missing Data for Gene Expression Arrays,” Technical Report, Stanford Statistics Department 17, no. 6 (2001): 520–525. [Google Scholar]
  • 66. Guillaudeux M., Rousseau O., Petot J., et al., “Patient‐Centric Synthetic Data Generation, no Reason to Risk Re‐Identification in Biomedical Data Analysis,” npj Digital Medicine 6, no. 37 (2023): 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67. Collins L. M., Murphy S. A., and Strecher V., “The Multiphase Optimization Strategy (MOST) and the Sequential Multiple Assignment Randomized Trial (SMART): New Methods for More Potent eHealth Interventions,” American Journal of Preventive Medicine 32, no. 5 (2007): 112–118. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data S1: Additional information consisting of a table of the ACTG 175 trial variables used in the analysis, as well as further synthetic and real data distribution plots, ML efficacy metric comparisons, and plots of trial inference results across all simulation runs, can be found in the online version of the article at the publisher's website.

SIM-44-0-s001.pdf (2.7MB, pdf)

Data Availability Statement

The data that support the findings of this study are openly available in the speff2trial R package at https://cran.r‐project.org/web/packages/speff2trial/index.html.


Articles from Statistics in Medicine are provided here courtesy of Wiley

RESOURCES