Skip to main content
. 2022 May 23;7(1):1727. doi: 10.23889/ijpds.v7i1.1727

Table 1: Proposed simplified categorisation of synthetic data (a more comprehensive summary of existing terminology for synthetic data can be found in Supplementary Table 1).

Data utility Category of synthetic data Description Expectations Uses
Minimally disclosive, minimal analytic value, low fidelity
Inline graphic
Univariate synthetic data
  • -

    Preserves the type, structure and format

  • -

    Does not contain any original data

  • -

    Impossible to identify any single entity

  • -

    Variables in the synthetic dataset should have similar fundamental aggregated statistics to the ground truth (original data) for both continuous and categorical variables.

    These include:
    • Population/cohort-based distribution data (e.g., age, income distributions etc.)
    • Categorical proportion data (e.g., % of ethnic groups)
  • -

    Requires none or light disclosure processes

  1. Basic code or advanced code testing including data management or cleaning

  2. Education and training for data analysis

  3. Sharing (allows easier sharing of data within or between government departments)

Multivariate synthetic data
  • -

    Preserves complex inter-relationships between variables-Close representation of values in real individuals in a specific population is expected

  • -

    Can preserve multivariate distribution for higher-level or low-level geographies and household structure

  • -

    Can preserve real/logical relationships (joint distributions, e.g., marital status and age)

  • -

    Can preserve biological relationships (e.g., excessive thirst due to diabetes)

  • -

    Preserves statistical distributions of variables and at least some relationships (e.g., correlations) between them

  • -
    Preserves the original confounding structure of the data
    • These datasets should be informed by subject-matter expertise or familiarity with real-world distributions.
    • The more relationships are preserved, the more realistic they can be
  • -

    Requires stringent disclosure processes

  1. Extended code testing

  2. Extended education and training for data analysis

  3. Testing experimental methods

  4. Exploring and comparing populations

  5. Understanding and examining specific subgroups of populations for study or trial/intervention planning

More disclosure risk; more analytic value, high fidelity Complex modality synthetic data
  • -

    Data created from perturbations using accurate forward models (i.e., models that simulate outcomes given specific inputs), physical simulation or AI-driven generative models [33].

  • -

    This includes specific modalities such as cardiac and radiology images, physiological longitudinal data (e.g., from wearables), genomes, longitudinal data on interactions with public services, prescriptions data etc.

  • -

    Quality, effectiveness, and robustness of the synthetic data rely on the quality of the ground truth data supplied to train algorithms

  • -

    Careful consideration is required around the appropriate generation of high-volume data modalities (e.g., issues such as geometric distortions of body parts inherent to MRIs).

  • -

    Requires stringent disclosure processes

  • Producing high-dimensional data distributions in image synthesis [34, 35] (e.g., image-to-image translation) [36, 37] and speech synthesis

  • Increasing robustness and adaptability of AI models and testing stability of machine learning for medicine and healthcare (e.g., predicting early Alzheimer’s disease from brain imaging data or medical imaging such as skin lesions [38], pathology slides [39] and other imaging modalities) [4043].

  • Hypothesis generation by facilitating the use of data in understanding social and human behaviour (e.g., diagnosing mental health conditions from a smartphone mood diary app or audio recordings without using personally identifiable data)