Table 2: Methods and examples of synthetic data generation (further explanations are provided in the Glossary in Supplementary Table 2).
Techniques | Description | Examples | |
---|---|---|---|
Prediction | |||
Multiple Imputation | Data are generated by simulating multiple copies of the population from which respondents have been selected. All data with missing values are filled in by multiple imputations. A random sample from each of these synthetic populations is then released [53]. |
US Census Bureau’s Small Area Income and Poverty Estimates
https://www.census.gov/library/fact-sheets/2021/what-are-synthetic-data.html |
|
Classification and Regression Tree (CART) | CART works by using a series of conditional models to sequentially predict and impute the value of each variable, depending on the value of other variables, some of which have already been synthesised. Synthetic data are generated by sequentially predicting the value of each variable, depending on the value of other variables. This approach builds on multiple imputations, in which values for missing data are imputed based on the relationships with original variables. Prediction approaches effectively treat all values as missing and synthesised variables are generated through imputation based on relationships observed in the real data [18] |
Synthetic versions of the Scottish Longitudinal Study (SLS)
https://sls.lscs.ac.uk/guides-resources/synthetic-data/ |
R package synthpop2 [54] |
Autoregressive models | An autoregressive (AR) model predicts future behaviour based on past behaviour. It is used for forecasting when there is some correlation between values in a time series and the values that precede and succeed them. The process is a linear regression of the data in the current series against one or more past values in the same series. The AR process is an example of a stochastic process, which has degrees of uncertainty or randomness built-in. Randomness means that while future trends can be predicted based on past data, predictions will never be 100% accurate. Usually, the process gets “close enough” for it to be useful for generating synthetic data. AR models are also called conditional models, Markov models, or transition models [55]. | ||
Sampling | |||
Bayesian Network | In this approach, the relationships between the variables are specified within a graphical structure (e.g., directed acyclic graph) and joint probability distributions (or contingency tables if categorical) for all the variables are derived from the original data [17]. Synthetic data are generated by sampling from these distributions. |
The National Cancer Registry (‘Simulacrum’), generated by Health Data Insight and Public Health England
https://healthdatainsight.org.uk/project_category/synthetic-data/ |
“Simulacrum codebase” in MATLAB |
Bayesian models using MCMC | MCMC chains are used to sample from the joint posterior distribution of the variables to be synthesised. A mixture of variable types can be handled by transforming the observed joint distribution into a multivariate normal distribution. Multi-level data can be synthesised (i.e., to preserve hierarchical structures such as pupils within schools or patients within hospitals) [56, 57]. |
UK Primary care data for public health research
i) CPRD cardiovascular disease dataset3 [46, 58] ii) CPRD COVID-19 symptoms and risk factors synthetic dataset4 [46, 58] |
R package bnlearn5 R package jomo [59] |
Synthetic minority over-sampling (SMOTE) | SMOTE is essentially performing resampling (creating new data points for the minority class) and instead of duplicating observations, it creates new observations along the lines of a randomly chosen point and its nearest neighbours, i.e., it synthesises new data examples between existing (real) examples. SMOTE was initially developed to help address the problems with applying classification algorithms to unbalanced datasets and thus, it does not replicate data in the general region of the minority samples, but on exact locations. This method was proposed in 2002 [60] and could be used to train neural network classifiers for medical decision making [61]. | ||
Static spatial microsimulation | This technique generates synthetic populations by combining individual-level samples (i.e., microdata or seed) and aggregated census data (i.e., macro data or target). Its intended use is for cases where spatial microsimulation is desirable, i.e., where the individuals belong to household and regional hierarchical structures. The main method used for synthetic reconstruction is called Iterative Proportional Fit [62, 63] which builds up a synthetic dataset for a small area using Census tables leading to an entirely synthetic dataset that is created by a joint probability distribution specified using attributes conditional on existing (known) attributes. | ||
Generative models | |||
General Adversarial Networks (GANs) | GANs generate two competing neural network models. One takes noise as input and generates samples (the generator or forger). The other model (the discriminator) receives samples from both the generator and the training data and attempts to distinguish between the two sources. There is a continuous game between the two networks where the generator is learning to produce more and more realistic samples while the discriminator is learning to get better and better at distinguishing generated data from real data. GAN is successful if these two networks co-operate well, and both learn at the expense of one another and attain equilibrium over time [64]. |
Quantifying utility and preserving privacy in synthetic datasets (QUIPP) – The Alan Turing Institute [Conditional GAN for Tabular data (CTGAN)] https://www.turing.ac.uk/research/research-projects/quipp-quantifying-utility-and-preserving-privacy-synthetic-data-sets Synthetic data in Machine learning for healthcare https://www.vanderschaar-lab.com/synthetic-data-breaking-the-data-logjam-in-machine-learning-for-healthcare/ (PATE-GAN6, ADS-GAN7, and TimeGAN techniques8) |
|
Autoencoders and Variational autoencoders (VAEs) | The autoencoder’s deep network consists of two individual deep neural networks. The first of these networks is called the encoder and compresses the input (original) data into a shortcode. The second deep network is called the decoder and it is a mirror image of the encoder and its purpose is to decompress the shortcode generated by the encoder into a representation that closely resembles the original data [65]. The variational autoencoders (VAEs) are a more modern version of autoencoders. VAEs use the same architecture as autoencoders but impose added constraints on the learned encoded representation. Those two techniques use the sampling method to produce new samples which are similar to those in the original dataset but not exactly the same [66]. | ||
Recurrent Neural Networks (RNNs) | RNNs are networks with loops in them, allowing information to persist, i.e., they remember things from prior inputs while generating outputs. This architecture equips RNNs with a form of internal state (memory) enabling them to exhibit temporal dynamic behaviours and to process sequences of inputs. In RNNs the aim is to build a generative model which captures the joint probability, p(x, y), of the inputs x and the output y. This probability can be used to sample data or to make predictions by using Bayes rules to calculate the posterior probability p(y|x) and then estimating the most likely output [67]. |