Skip to main content
Wellcome Open Research logoLink to Wellcome Open Research
. 2024 Dec 24;9:57. Originally published 2024 Feb 19. [Version 2] doi: 10.12688/wellcomeopenres.20530.2

Releasing synthetic data from the Avon Longitudinal Study of Parents and Children (ALSPAC): Guidelines and applied examples

Daniel Major-Smith 1,a, Alex S F Kwong 1,2, Nicholas J Timpson 1,3, Jon Heron 1,3, Kate Northstone 1
PMCID: PMC11809151  PMID: 39931104

Version Changes

Revised. Amendments from Version 1

The revised version of our manuscript has taken into consideration the helpful and constructive comments of the reviewers (detailed in our responses to the reviewers). The main updates are:  - Including two additional paragraphs in the Discussion providing guidance for other population-based studies for adopting such synthetic data sharing guidelines, in addition to some challenges and open questions regarding the sharing of synthetic data (e.g., ownership and access to synthetic datasets, and whether synthetic datasets satisfy journal and funding data sharing policies).  - Clarifying some questions regarding how synthesising data via ‘synthpop’ works (e.g., use of predictive models, meaning there is no need to take the causal structure of the data into consideration when synthesising), practicalities of synthesising under different study designs (e.g., when the synthesis is for more general use and there are no specified exposures and outcomes) and providing more information on how to compare synthetic vs observed data (e.g., visual inspection of bar charts and histograms).  - Providing more details of how the repeated measures data were synthesised, and when ‘synthpop’ can and cannot be used for repeated measures data.  - Re-structuring Tables 1, 2 and 4 to make the observed and synthetic results easier to compare.  - In the Discussion, reinforcing the safe-guards we have put in place to try and avoid synthetic datasets being analysed as observed data, and potential implications if this occurs.

Abstract

The Avon Longitudinal Study of Parents and Children (ALSPAC) is a prospective birth cohort. Since its inception in the early 1990s, the study has collected over thirty years of data on approximately 15,000 mothers, their partners, and their offspring, resulting in over 100,000 phenotype variables to date. Maintaining data security and participant anonymity and confidentiality are key principles for the study, meaning that data access is restricted to bona fide researchers who must apply to use data, which is then shared on a project-by-project basis. Despite these legitimate reasons for restricting data access, this does run counter to emerging best scientific practices encouraging making data openly available to facilitate transparent and reproducible research. Given the rich nature of the resource, ALSPAC data are also a valuable educational tool, used for teaching a variety of methods, such as longitudinal modelling and approaches to modelling missing data. To support these efforts and to overcome the restrictions in place with the study’s data sharing policy, we discuss methods for generating and making openly available synthesised ALSPAC datasets; these synthesised datasets are modelled on the original ALSPAC data, thus maintaining variable distributions and relations among variables (including missing data) as closely as possible, while at the same time preserving participant anonymity and confidentiality. We discuss how ALSPAC data can be synthesised using the ‘synthpop’ package in the R statistical programming language (including an applied example), present a list of guidelines for researchers wishing to release such synthesised ALSPAC data to follow, and demonstrate how this approach can be used as an educational tool to illustrate longitudinal modelling methods.

Keywords: ALSPAC, Synthetic data, Reproducibility, Confidentiality, Open science, Methods education

Introduction

Scientific best practice is moving towards enhanced openness, reproducibility and transparency, with data and analysis code increasingly being shared alongside scientific publications ( Bouter, 2023; Localio et al., 2018; Munafò et al., 2017; Smaldino et al., 2019). Although data sharing is still not universal, there is a continued push from academics, journals, funders and governments towards this goal ( Abbasi, 2023; Federer et al., 2018; Hardwicke et al., 2018; House of Commons Science Innovation and Technology Committee, 2023; Malički et al., 2021; Mathur & Fox, 2023; Minocher et al., 2021; Shepherd et al., 2017; Tedersoo et al., 2021). While beneficial for science as a whole, these changes may be challenging for research from certain sources, such as large-scale population-based longitudinal studies, where data often cannot be made openly available. Reasons for this include preserving participant anonymity and ensuring that only legitimate researchers are able to access the resource ( Colditz, 2009; Hogue, 1991; Quintana, 2020; Samet, 2009; Shepherd et al., 2017). This is the case for the focus of this paper, the Avon Longitudinal Study of Parents and Children (ALSPAC). ALSPAC is a longitudinal population-based birth cohort which enrolled approximately 15,000 pregnant women resident in the Bristol area of the UK who had expected dates of delivery between 1st April 1991 and 31st December 1992. These women, their partners, and their children – and more recently these children’s children – have been followed ever since ( Boyd et al., 2013; Fraser et al., 2013; Lawlor et al., 2019; Major-Smith et al., 2023; Northstone et al., 2019; Northstone et al., 2023). The ALSPAC resource is available to bona fide researchers, and it is not possible to release observed data alongside published work (as detailed in the ALSPAC Data Management plan).

Rather than releasing observed data, an alternative approach to data sharing is based on creating ‘synthetic’ datasets (for an introduction to synthetic data, see; ( Raghunathan, 2021)). These synthesised datasets are modelled on the original observed data, thus closely maintaining the marginal distributions of variables (e.g., mean, variation, cell counts, etc.), as well as the relationships between variables. However, as data are simulated and do not correspond to real-life individuals by design, they preserve participant anonymity (note that, because fully synthetic datasets do not contain personal information of real individuals, they are likely exempt from complying to the European Union’s General Data Protection Regulation [GDPR]; Beduschi, 2024). Although various approaches to synthetic dataset creation exist ( Raghunathan, 2021), the methods followed here are based on the ‘synthpop’ package available in the R programming language, which has been spear-headed by a longitudinal studies group at the University of Edinburgh ( Nowok et al., 2016) (see also https://www.synthpop.org.uk/about-synthpop.html).

While synthetic data may not exactly preserve the attributes of the original observed data, due to random variability and the inability of models to perfectly recreate the original data, any analyses and conclusions ought to be similar ( Nowok et al., 2016; Quintana, 2020). This can enable readers of the paper – either pre-publication, during the peer review process or post-publication – to explore the raw data, understand the analyses better, and replicate analyses themselves using these synthesised data ( Coughlin, 2017; Quintana, 2020; Shepherd et al., 2017). This can help provide readers with assurance that the reported results are broadly correct, allow readers to test out the methods, and could also help with the self-correction of science by noticing potential errors in the analyses (such as treating a categorical variable as continuous ( Decety et al., 2015; Shariff et al., 2016), or recoding the control and intervention groups of a clinical trial incorrectly ( Goldacre et al., 2019), to give two high-profile examples). In addition, synthesised datasets from longitudinal studies such as ALSPAC could be created as an educational tool to help others learn about new and/or complex methods ( Goldstein, 2018); for an example using synthesised data to explore longitudinal growth trajectory modelling, see ( Elhakeem et al., 2022).

In this paper we: i) briefly describe the ‘synthpop’ package in more detail; ii) discuss recommendations for checking the synthesised data and ensuring that synthesised data are non-disclosive; iii) introduce guidelines to be adopted by researchers wishing to release synthesised ALSPAC data; iv) provide an example workflow for synthesising ALSPAC data (using an openly-available ALSPAC dataset); and v) present an example of how synthetic ALSPAC data can be used as an educational tool, focusing on longitudinal modelling methods.

Creating synthetic datasets using ‘synthpop’

The ‘synthpop’ package works sequentially, with each variable synthesised conditional on previously-synthesised variables (other than the first variable, which is synthesised based on random sampling from the observed values). For instance, say our dataset had just three variables: age, sex and height. If age was synthesised first, this would be generated by randomly sampling from the observed distribution of age. If sex were synthesised next, it would be generated conditional on the previously-synthesised variable ‘age’, generating synthetic observations by randomly sampling from the range of predicted values from this model in the observed data. Finally, if height was synthesised last, it would be generated conditional on both previously-synthesised variables ‘age’ and ‘sex’, with synthetic values again generated by randomly sampling from the range of predicted values from this model in the observed data. The default algorithm for synthesising data is tree-based (using classification and regression trees; CART), but it is also possible to synthesise data using alternative tree-based (e.g., random forest) or parametric (e.g., linear, logistic) models. Note also that this method accounts for missing data, and maintains relations between missing data and other variables ( Nowok et al., 2016). This process of synthetic data generation is closely related to the method of multiple imputation by chained equations for imputing missing data ( van Buuren, 2018); the difference being that, rather than only imputing missing values, these synthetic data methods generate wholly-synthetic datasets based on the observed data ( Raghunathan, 2021).

Most of the synthesising is automated by the command ‘syn’ within the ‘synthpop’ package, although it is possible to specify various options, such as the type of model to use, the order in which variables are synthesised, the choice of predictor variables, whether to apply a ‘smoothing’ parameter to continuous variables (to lower the risk of disclosive data for continuous data), and applying rules to maintain relations between variables (for instance, if synthesising variables such as ‘ever smoked’ and ‘amount smoked per day’, one could specify a rule that said ‘if never smoked, then code amount smoked per day as 0’). For more information on the ‘synthpop’ package and its functionality, see ( Nowok et al., 2016).

Recommendations when using ‘synthpop’: An example in population-based studies

Successful synthesis should meet two goals: i) preserving participant anonymity; and ii) maintaining relations between variables, the latter being vital for reproducibility ( Quintana, 2020). These two goals may trade-off somewhat against one another, as we discuss below.

The most important factor to consider when synthesising data is the potential disclosure risk. As synthetic datasets are wholly simulated, data ought to be non-disclosive as they are no longer based on individual records. However, it is possible that a unique combination of values could be synthesised corresponding to a unique individual in the observed data, thus remaining a potential disclosure risk. Although researchers using the synthesised data will not be able to know whether such a unique observation matches that of an actual participant, there is a remote possibility that unique individuals could be identified from the synthesised data. We therefore recommend that users undertake ‘statistical disclosure control’ checks on any synthetic datasets to remove any unique observations that occur in both the observed and synthetic datasets. This can be done easily within the ‘synthpop’ package using the commands ‘replicated.uniques’ (to tabulate these cases) and ‘sdc’ (to remove these cases). In our experience, the number of such unique replicates within synthesised ALSPAC data is likely to be quite low. For instance, in the dataset introduced below with 15 variables and 3,727 observations, when using the default CART synthesising method, only 4 observations (0.11% of the original sample) were unique replicates which had to be removed. However, this does depend on a range of factors, such as the number and type of variables; for instance, because there is less variation in possible responses, a dataset with many categorical variables may be more likely to result in a greater number of unique replicates, especially if some categories have low cell counts (an example is given in the associated script where approx. 10% of synthesised cases are unique replicates; see the ‘Data Availability’ section for more information).

If the number of unique replicates is found to be higher than one would like or anticipate, there are a number of available options (although what constitutes a ‘high number’ of unique replicates is a subjective matter and is up to the researcher to decide). For instance, synthesising data via parametric, rather than tree-based, methods may reduce the number of such cases. For synthesising continuous data, a ‘smoothing’ option can be applied which may provide an additional level of disclosure control by making the synthesised values slightly different from the observed values. Top and bottom coding of variables is also available, which may help reduce any potentially identifiable outliers. For more information on statistical disclosure control in ‘synthpop’, see ( Nowok et al., 2016). However, in some circumstances it may not be possible to significantly reduce the number of unique replicates, and hence the sample size of the observed vs synthetic data may differ quite substantially; as long as the remaining synthetic dataset maintains the relations between variables from the observed data, this difference in sample size should not make much difference in practice (although large differences in sample size may lead to a loss of precision in estimates). In addition to this formal statistical disclosure control to remove unique replicates, we recommend that researchers perform a manual check of the data, using a few rows of the dataset, to ensure that ‘synthpop’ has not generated any data corresponding to real participants (this is primarily a sanity check, as the ‘sdc’ command should automatically remove all such cases).

The second factor to consider is whether the synthetic dataset successfully maintains relations between variables (although even synthetic datasets which do not maintain relations between variables can still be useful for reproducing code and checking for errors; ( Shepherd et al., 2017)). The ‘synthpop’ package provides a suite of useful tools for comparing the synthesised data against the observed data. This includes a simple comparison of the distributions of each variable, through to more complex conditional associations, such as in a multivariable regression. Ideally, these should be similar between the observed and synthetic datasets, although random variation and imperfections in the synthesising process are of course inevitable. Formal measures of ‘utility’, comparing the synthetic to the observed data, are also available within the ‘synthpop’ package, but are not discussed here ( Raab et al., 2017).

There are no definitive guidelines for successful synthesis, although some suggestions and recommendations can be made ( Nowok et al., 2016; Raab et al., 2017). First, when synthesising a large number of variables, tree-based methods are often quicker than parametric methods. Second, tree-based methods may reconstruct the observed data structure more faithfully than parametric methods. However, this advice may not hold in all circumstances, and we urge users to check the synthesised results against the observed data and examine whether the similarity is sufficient; if not, try a different specification and compare results. For instance, one can try using a parametric method, or an alternative tree-based method. In our experience, the order in which variables are synthesised can also influence the correspondence with the observed data, although not in all circumstances. While synthesising the data in any order may be sufficient, we have found that synthesising the exposure(s) and outcome(s) last sometimes maintains the relations between variables more faithfully, although this does somewhat contradict the advice given in ( Raab et al., 2017), who recommend synthesising the most important variables first (although if the dataset is being synthesised for more general use and does not have specific exposures or outcomes, synthesising the variables in any order will likely suffice). Note that as synthetic data generation using ‘synthpop’ uses predictive modelling, it is not necessary to consider the causal structure of the data (i.e., the data-generating mechanisms) when synthesising data using this approach. Due to random variation, some synthetic datasets may be closer to the observed by chance; using a different starting seed may be another option to explore to improve the correspondence between synthetic and observed data. We advise that users test different synthesising methods, and see what works best for their specific dataset in terms of replicating the properties of the observed data (e.g., one method may maintain the relationship between the exposure and outcome more faithfully than another). For similar, but more detailed, guidelines on using ‘synthpop’, see ( Raab et al., 2017). There are no definitive rules on what constitutes ‘successful’ synthesis, so again researchers must use their subjective judgement.

There may be a trade-off between the two aims introduced at the start of this section. For instance, methods of disclosure control may alter the relations between variables by removing some observations; conversely, a synthesised dataset which matches the observed data more faithfully may be at an increased risk of participant disclosure. Synthesis using parametric methods may reduce the number of unique replicates but at the same time result in less faithful synthetic datasets, while CART methods may better recreate the original data but result in more unique replicates, for example. Given the competing demands of creating a synthesised dataset close to the original vs reducing the risk of participant disclosure, there may be an iterative cycle between these potentially-conflicting requirements.

ALSPAC guidelines for releasing synthetic data using ‘synthpop’

Here, we detail the guidelines required by all users of ALSPAC data wanting to make their synthesised data openly available. Please note, these are subject to change as the process develops over time. The most recent version of the ‘ ALSPAC synthetic data checklist’ will be available online. For many of these steps example code is provided below:

  • 1)

    When submitting an ALSPAC proposal to access the data, make sure to state that you intend to release synthesised data. This can be noted by an amendment at a later date if necessary.

  • 2)

    Reduce the number of variables and observations in the dataset you plan to synthesise to only those required to replicate the results in the paper (e.g., if the original dataset contains 15,645 observations and 20 variables, while the final analyses contain 4,000 observations and only use 10 variables, all additional observations and variables should be removed prior to synthesis). To avoid releasing a large number of synthetic variables, synthetic datasets should include fewer than 50 variables; if you need to synthesise more than 50 variables, please talk to ALSPAC and provide justification first.

  • 3)

    Check whether there are individuals uniquely identified in both the observed and synthetic datasets. If there are, these should be removed from the synthetic dataset. Perform a manual check on a handful of cases as a sanity check to make sure there are no unique replicates in the observed and synthetic datasets.

  • 4)

    Check that the distributions of all synthesised variables are similar to those in the observed data. Simply visualising the data (e.g., in bar charts or histograms) should provide this information (see worked example below).

  • 5)

    Check that for key variables of interest (e.g., exposure and outcome) the relationships between synthesised variables are comparable to those in the observed data (e.g., via univariable and multivariable regressions). Or, in other words, re-estimate your substantive analysis.

  • 6)

    Include a variable at the beginning of the synthetic dataset named ‘FALSE_DATA’, with values of ‘FALSE_DATA’ for all observations, to ensure it is clear that the dataset contains synthetic data, rather than real observations ( Nowok et al., 2017).

  • 7)

    Include a disclaimer in the published paper, and alongside the synthetic data, making it clear to users that the data are synthetic and should not be used for any subsequent research or publications. We recommend the following statement: “ These are synthesised ALSPAC datasets, and are not suitable for research purposes. The relations between variables are unlikely to be maintained perfectly, so there is the risk when using these synthesised datasets that results may differ from the true data. Only the actual, observed, ALSPAC data should be used for formal research and analyses reported in published work.”.

  • 8)

    Provide the DOI or a suitable weblink as to where the synthetic data are stored (e.g., GitHub, Dryad, or Open Science Framework.

  • 9)

    Agree to provide details of downloads/requests to use the synthetic data if ALSPAC request such information (if possible).

  • 10)

    Agree to publish the script that developed the synthetic dataset to sit alongside the dataset (including code for all variable name changes and variable recodes and derivations from the original ALSPAC data, to facilitate reproducibility).

Example ‘synthpop’ script

In this section, we will detail the basics of how to synthesise ALSPAC data using the ‘synthpop’ package, and how to compare between observed vs synthesised datasets in a simple regression context. Figure 1 summarises this process. The example here is based on data from an openly available subset of the ALSPAC data ( Northstone et al., 2022) ( https://osf.io/8sgze) so that readers can reproduce these steps. The substantive analysis is a logistic regression with the outcome being a diagnosis of depression in ALSPAC offspring at age 17 years (based on revised Clinical Interview Schedule; ( Lewis et al., 1992)), and depression scores from mothers 8-months post-delivery as the exposure (using Edinburgh Postnatal Depression Scale; ( Cox et al., 1987)), adjusting for a range of sociodemographic confounders (maternal age at delivery, maternal educational attainment, child gender, maternal home ownership status, and maternal ethnicity). This example was conducted using R version 4.0.4 ( R Development Core Team, 2021) and version 1.6-0 of the ‘synthpop’ package.

Figure 1. Flow-chart detailing the process for synthesising data to ensure that synthetic data is non-disclosive whilst maintaining relations between variables.

Figure 1.

Note that this may be an iterative process to find a synthetic dataset which meets these two potentially-competing demands. Note also that both ‘acceptable amounts of data loss’ and ‘synthetic data matching observed’ are subjective judgment calls.

Step-by-step guide for synthesising data:

  • 1.

    Install the ‘synthpop’ package in R and load it.

    install.packages("synthpop")
    library(synthpop)
  • 2.

    Read in the observed dataset (here a Stata .dta file, using the R package ‘readstata13’), and then keep only variables and observations used in final analyses (note that some pre-processing of these variables has been omitted here; see the associated “SynthPopExample.R” script for full details).

    dat <- read.dta13("Master_MSc_Data.dta")
    dat <- dat[, c("gender", "bwt", "gest", "ethnic", "matage", "mated", "pated", "msoc", "psoc", "housing", "marital", "parity", "pregSize", "mat_dep", "depression_17")]
    cca_marker <- complete.cases(dat[, c("gender", "ethnic", "matage", "mated", "housing", "mat_dep", "depression_17")])
    dat <- dat[cca_marker == TRUE, ]
  • 3.

    Take the observed dataset and synthesise using the ‘syn’ command (setting a seed so it is reproducible). The example here just uses the default ‘classification and regression trees’ method; for additional options, see ( Nowok et al., 2016) and the “SynthPopExample.R” script associated with this paper.

    dat_syn <- syn(dat, seed = 13327)
  • 4.

    Next, apply ‘statistical disclosure control’ to remove individuals with unique combinations of variables in both the observed and synthesised data. Tabulate the number and percentage of such cases using the ‘replicated.uniques’ command, and then remove them from the synthesised dataset using the ‘sdc’ command with the ‘rm.replicated.uniques’ option. In this example of 3,727 observations, only 4 (0.11%) are unique replicates to be removed from the synthesised dataset.

    replicated.uniques(dat_syn, dat)
    dat_syn <- sdc(dat_syn, dat, rm.replicated.uniques = TRUE)
  • 5.

    Perform a manual check on a handful of cases to ensure that the ‘sdc’ command above worked and that there are no replicated unique individuals in the final synthetic dataset.

    # Create a dataset of unique observed individuals
    dat_unique <- dat[!(duplicated(dat) | duplicated(dat, fromLast = TRUE)), ]
    
    # Create a dataset of unique synthetic individuals
    syn_unique <- dat_syn$syn[!(duplicated(dat_syn$syn) | duplicated(dat_syn$syn, fromLast = TRUE)), ]
    
    # Select 10 rows at random from the unique observed dataset
    row_unique <- dat_unique[sample(nrow(dat_unique), 10), ]
    
    # Check there are no duplicated observations (this should equal ‘0’)
    sum(duplicated(rbind.data.frame(syn_unique, row_unique)))
  • 6.

    Compare the distribution of variables between the observed and synthetic datasets using the ‘compare’ command. This provides a series of descriptive tables and figures (see Figure 2), which compare the marginal distribution of all variables in the synthesised and observed datasets. Figure 2 illustrates that the synthetic data matches the observed distribution of each variable very well (including NAs/missing values).

    compare(dat_syn, dat, stat = “count”, nrow = 3, ncol = 5)
  • 7.

    Run an unadjusted model comparing the association between the exposure and outcome in both observed and synthetic datasets (here, the exposure is the ALSPAC mother’s depressive symptoms score 8-months post-delivery, and the outcome is whether their offspring was depressed at age 17 years). First store a model using the synthetic data (using the ‘glm.synds’ command), and then compare this against a model using the actual data. In our example, the association in the synthetic data is similar, but slightly larger, to that in the observed data ( Figure 3; note that, by default, ‘synthpop’ converts all coefficients to z-values so that all coefficients are on the same scale and hence easier to compare).

    model.syn <- glm.synds(depression_17 ~ mat_dep, family = “binomial”, data = dat_syn)
    compare(model.syn, dat)
  • 8.

    Repeat step 7, using a multivariable model adjusting for additional covariates (i.e., our substantive analysis model). Here, we can see that the relations between the outcome and other variables are also similar, although not exactly the same, in both observed and synthetic datasets ( Figure 4; we also stress here that in a causal analysis between the exposure and outcome the coefficients of these additional covariates do not have a straightforward interpretation [see Westreich & Greenland, 2013 on the ‘table 2’ fallacy]).

    model.syn2 <- glm.synds(depression_17 ~ mat_dep + matage + ethnic + gender + mated + housing, family = “binomial”, data = dat_syn) compare(model.syn2, dat)
  • 9.

    Repeat steps 3 to 8 until you are happy that: a) the number of unique replicates removed from the synthetic dataset is sufficiently low; and b) the distributions and relations between variables in the observed and synthetic datasets are sufficiently similar. Note that both of these decisions are subjective judgements.

  • 10.

    Add a variable called ‘FALSE_DATA’, with the value of ‘FALSE_DATA’ for all observations, to the start of the synthetic dataset, so users know that the dataset contains synthetic – as opposed to observed – data.

    dat_syn$syn <- cbind(FALSE_DATA = rep(“FALSE_DATA”, nrow(dat_syn$syn)), dat_syn$syn)
  • 11.

    Save the synthesised dataset. In our example, we have saved the synthesised data as R, CSV and Stata data files, respectively.

    write.syn(dat_syn, file = “syntheticData”, filetype = “Rdata”)
    write.syn(dat_syn, file = “syntheticData”, filetype = “csv”)
    write.syn(dat_syn, file = “syntheticData”, filetype = “Stata”, convert.factors = “labels”)

Figure 2. Comparing variable distributions in observed (dark blue) and synthetic (light blue) datasets.

Figure 2.

Figure 3. Example analysis comparing univariable associations between an exposure (maternal depressive symptoms score; mat_dep) and an outcome (offspring depression diagnosis; depression_17) in both observed (dark blue) and synthetic (light blue) datasets.

Figure 3.

Note also that the results in the plot are on a z-value scale.

Figure 4. Example analysis comparing multivariable associations between an exposure (maternal depressive symptoms score; mat_dep) and an outcome (offspring depression diagnosis; depression_17) in both observed (dark blue) and synthetic (light blue) datasets.

Figure 4.

Note also that the results in the plot are on a z-value scale.

Applied longitudinal example – using ‘synthpop’ as an educational and open research tool

The above example highlights how simulated data using ‘synthpop’ can mimic basic results from longitudinal studies. However, one of the key assets of longitudinal studies like ALSPAC is the ability to capture traits over time in a repeated measures context. Currently, ‘synthpop’ does not have a way to test this within the package. However, using the framework above, it is possible to simulate repeated measures (for example, but not limited to, height, weight, substance use, test scores and mental health) to create trajectories of simulated data that mimic the real data. Here we give two examples adapted from existing research within ALSPAC to highlight how repeated measures data can be simulated using ‘synthpop’ (note that, unlike the simple example above, the observed ALSPAC data for these analyses are not openly-available, although the synthetic data are; see the 'Data Availability' section). For both examples, we synthesised datasets using the default CART approach with these longitudinal data in ‘wide’ format (i.e., one row per participant with time-points as separate variables); synthesised data were then converted to ‘long’ format for the multi-level modelling analysis (i.e., multiple observations per participant with one row per time-point).

The first example examines height trajectories from childhood to early adulthood using a multilevel modelling framework, similar to that of Howe et al. ( Howe et al., 2012; Howe et al., 2016), using up to eight occasions of height between approximately 7 to 18 years of age. The second example uses growth mixture modelling to examine associations between adolescent self-esteem and depression trajectories across adolescence and early adulthood, and then depression trajectories across adolescence and early adulthood associated with later depression, similar to that of Kwong et al. ( Kwong et al., 2019) and López-López et al. ( López-López et al., 2020), using up to nine occasions of depressive symptoms between approximately 10 and 24 years of age. For further details on these methods, please refer to these original papers.

As shown in Figure 5 and Table 1, synthetic height trajectories perform almost identically to the observed data when assessed using multi-level growth models. Both trajectories show the same rate of change and the estimates from the model are nearly identical across both datasets. The marginal differences are likely to reflect different sample sizes or random variability in the synthesis model (synthetic n=10,261, observed=10,059).

Figure 5. Synthetic (solid blue line) and observed (dashed orange line) trajectories of height in ALSPAC.

Figure 5.

Table 1. Synthetic and observed estimates from the height trajectories.

Observed ( n=10,059, n obs =53,853) Synthetic ( n=10,261, n obs =53,864)
Fixed effects Beta Std Err 95% CI P Beta Std Err 95% CI P
   Age 5.19 0.01 5.17 to 5.21 <0.001 5.17 0.01 5.15 to 5.19 <0.001
   Age 2 -0.24 0.00 -0.24 to -0.23 <0.001 -0.24 0.00 -0.25 to -0.24 <0.001
   Intercept 152.74 0.07 152.60 to 152.88 <0.001 152.80 0.07 152.66 to 152.94 <0.001
Random effects Estimate Std Err 95% CI Estimate Std Err 95% CI
   var(Age) 0.45 0.01 0.43 to 0.47 0.48 0.01 0.46 to 0.51
   var(Age 2) 0.03 0.00 0.03 to 0.04 0.03 0.00 0.03 to 0.03
   var(Intercept) 46.54 0.72 45.16 to 47.97 45.50 0.70 44.14 to 46.90
   cov(Age, Age 2) 0.09 0.00 0.09 to 1.00 0.09 0.00 0.09 to 1.00
   cov(Age, Intercept) 0.95 0.07 0.82 to 1.074 0.96 0.07 0.82 to 1.09
   cov(Age 2, Intercept) -0.51 0.02 -0.55 to -0.47 -0.48 0.02 -0.52 to -0.45
   var(Residual) 6.32 0.05 6.22 to 6.43 7.35 0.06 7.23 to 7.47

Note: a random intercept and random slope model was used to estimate the height trajectories, mean centering the age variable to 12 years (the mean age of all the age variables). We used a quadratic polynomial age term to allow for non-linearity. var: variance; cov: covariance; Std Error: standard error; CI: Confidence interval

Furthermore, when adding an interaction term of sex to estimate male and female height trajectories, the synthetic and observed data created trajectories that were nearly identical, as shown in Figure 6 and Table 2. However, the main effect of sex (i.e., the difference in height between males and females at age 12 [the mean age of all the assessments]) did vary between the observed and synthetic data, which is likely a result of random variability in the synthesis model. However, as shown in Figure 6, this had little effect on the estimation of the height trajectories and it is worth noting the confidence intervals for the main effect overlap as in the example above.

Figure 6. Synthetic male (solid blue line), synthetic female (solid green line), observed male (dashed orange line) and observed female (dashed red line) trajectories of height in ALSPAC.

Figure 6.

Table 2. Synthetic and observed estimates from the height trajectories by sex.

Observed ( n=10,053, n obs =53,818) Synthetic ( n=9,834, n obs =52,557)
Fixed effects Beta Std Err 95% CI P Beta Std Err 95% CI P
   Age 5.65 0.01 5.64 to 5.67 <0.001 5.65 0.01 5.63 to 5.67 <0.001
   Female 0.05 0.14 -0.23 to 0.33 0.751 -0.35 0.14 -0.63 to -0.06 0.016
   Female*Age -0.98 0.01 -1.01 to -0.96 <0.001 -0.99 0.01 -1.02 to -0.97 <0.001
   Age 2 -0.11 0.00 -0.11 to -0.10 <0.001 -0.12 0.00 -0.12 to -0.11 <0.001
   Female*Age 2 -0.28 0.00 -0.29 to -0.27 <0.001 -0.26 0.00 -0.27 to -0.25 <0.001
   Intercept 152.72 0.10 152.53 to 152.92 <0.001 153.00 0.10 152.80 to 153.19 <0.001
Random effects Estimate Std Err 95% CI Estimate Std Err 95% CI
   var(Age) 0.14 0.01 0.13 to 0.15 0.17 0.01 0.16 to 0.18
   var(Age 2) 0.01 0.00 0.01 to 0.01 0.01 0.00 0.01 to 0.01
   var(Intercept) 46.50 0.72 45.12 to 47.93 45.20 0.71 43.83 to 46.62
   cov(Age, Age 2) 0.01 0.00 0.01 to 0.01 0.02 0.00 0.01 to 0.02
   cov(Age, Intercept) 0.89 0.04 0.80 to 0.97 0.77 0.05 0.68 to 0.87
   cov(Age 2, Intercept) -0.53 0.01 -0.55 to -0.50 -0.53 0.01 -0.55 to -0.50
   var(Residual) 6.54 0.06 6.44 6.65 7.54 0.06 7.42 to 7.67

Note: a random intercept and random slope model was used to estimate the height trajectories, mean centering the age variable to 12 years (the mean age of all the age variables). We used a quadratic polynomial model to allow for non-linearity. Female was coded as 0=males, 1=females. var: variance; cov: covariance; Std Error: standard error; CI: Confidence interval

Building on the example above, we show that synthetic data can also be used for more advanced forms of growth curve modelling. As shown in Figure 7, synthetic datasets produce similar patterns of observed depression trajectories when analysed using growth mixture modelling (GMM). Several features used to assess model fit within GMM were similar between the synthetic and observed data, including entropy, Bayesian Information Criterion (BIC) and Sample Size Adjusted BIC, which were 0.735, 249378.07 and 249304.98 for the synthetic data and 0.734, 251430.95 and 251357.86 for the observed data, respectively. In addition, the estimates from adjusted regression models using synthetic data mimic the estimates from the adjusted regression models using observed data. For example, Table 3 shows that higher self-esteem in childhood is associated with lower relative risk ratios for each of the trajectories, and this matches across both the synthetic data and the observed data. Furthermore, Table 4 shows that worse depression trajectories across adolescence and early adulthood are associated with greater odds of depression later on, and these estimates are almost identical for both the synthetic and observed datasets.

Figure 7. Synthetic (dashed blue lines) and observed (solid orange lines) data depression trajectories using growth mixture modelling.

Figure 7.

Table 3. Association between self-esteem and different depression trajectories (RRR, Std Err and P value) with synthetic and observed data.

Stable-Low vs
Stable High
Stable Low vs
Increasing
Stable Low vs
Transient
Stable Low vs
Decreasing
Synthetic
n=3,719
Self-esteem 0.17 (0.02), P<0.001 0.44 (0.03), P<0.001 0.29 (0.02), P<0.001 0.51 (0.04), P<0.001
Observed
n=3,850
Self-esteem 0.11 (0.01), P<0.001 0.42 (0.03), P<0.001 0.23 (0.02), P<0.001 0.58 (0.05), P<0.001

Note: Depression trajectories were created used growth mixture modelling using 9 time points (see ( Kwong et al., 2019) for further details). Models adjusted for maternal depression, maternal education and financial problems. Self-esteem was standardised to have a mean of 0 and a SD of 1. Higher self-esteem scores reflect higher self-esteem. RRR: relative risk ratio; Std Err: standard error.

Table 4. Association between depression trajectories and later depression using synthetic and observed data.

Observed ( n=3,432) Synthetic ( n=3,215)
Comparison OR Std Err 95% CI P OR Std Err 95% CI P
   Stable-Low vs Stable High 9.61 2.07 6.32 to 14.63 <0.001 9.18 2.48 5.40 to 15.59 <0.001
   Stable Low vs Increasing 5.78 0.71 4.53 to 7.36 <0.001 5.27 0.65 4.12 to 6.71 <0.001
   Stable Low vs Transient 3.50 0.46 2.70 to 4.52 <0.001 2.75 0.38 2.09 to 3.62 <0.001
   Stable Low vs Decreasing 2.46 0.49 1.67 to 3.62 <0.001 3.33 0.63 2.30 to 4.81 <0.001

Note: Depression trajectories were created used growth mixture modelling using 9 time points (see ( Kwong et al., 2019) for further details). Models adjusted for maternal depression, maternal education and financial problems. Later depression was assessed two years after the trajectories. OR: odds ratio; Std Err: standard error; CI: confidence interval.

Discussion

We recognise the importance of open science practices for longitudinal population studies, while also acknowledging the need for such studies to maintain control over access to potentially-sensitive data. We believe that the synthetic data approach described in this paper provides a reasonable compromise between these competing demands, allowing data users to make de-identified synthesised data openly available while complying with data security and participant confidentiality best practice (see also; ( Quintana, 2020; Shepherd et al., 2017)). While the focus of this paper has been on longitudinal studies, and ALSPAC in particular, the suggestions and guidelines in this paper may also help inform the sharing of potentially-sensitive individual-level data in many other areas and disciplines ( Quintana, 2020). These methods can also be used to construct synthetic datasets for educational purposes, such as demonstrating complex methods such as using repeated measures for generating trajectories and growth mixture modelling, as illustrated above. We end with a brief discussion on some clarifications and potential limitations of this synthetic data approach.

While undoubtedly a useful pedagogical tool, and beneficial for open science practices, we state clearly here that these synthetic datasets should not be used in place of the actual observed data for research purposes; that is, synthesised data should never be used for a final published analysis. While hopefully similar on average to the observed data, the synthesised relations between variables may not be preserved perfectly and hence may provide different results, so published work should only ever be based on the observed data. To try and avoid synthetic ALSPAC data being used for research purposes – whether knowingly or otherwise – we have put a number of safe-guards in place (see points 2, 6 and 7 of the ‘ALSPAC guidelines for releasing synthetic data using ‘synthpop’’ section). Additional measures will be put in place should synthetic ALSPAC data be found to have been used for these purposes.

We also note that the recommendations and guidelines above only apply to data synthesised using ‘synthpop’, rather than via other simulation methods, such as using ALSPAC summary statistics or regression results to inform simulation parameters (for an example study using the latter approach to explore selection bias in ALSPAC, see ( Millard et al., 2023)). We make this distinction because these latter forms of simulation are only based on summary-level data, meaning that simulated data points do not correspond to data from individuals in the observed dataset and can sometimes take on impossible values (e.g., negative and/or decimal values, if the original scale was positive and/or only took integer values). In contrast, when using sampling, tree-based methods or ranked modelling methods within ‘synthpop’, the synthesised values are taken directly from the observed data, making synthesised data more faithful to the observed data, but also potentially increasing the risk of disclosure. In addition, the equations and parameters used to simulate data from summary statistics are transparent, meaning that it is obvious that the data are simulated. For ‘synthpop’, on the other hand, the synthesis methods and parameters are much more opaque; greater attention to potential disclosure of individual-level information is therefore needed when synthesising data using ‘synthpop’.

The present paper has focused on the package ‘synthpop’, although other software for synthetic data generation are available ( Raghunathan, 2021; Shepherd et al., 2017), such as Imputation and Variance Estimation Software ( IVWare; ( Raghunathan et al., 2022)). However, at present, for synthesising ALSPAC data we recommend using the ‘synthpop’ package because this contains built-in statistical disclosure control functionality to automatically remove potentially disclosive observations. Please talk to ALSPAC first if you wish to use an alternative method for data synthesis. At present, ALSPAC also does not permit data to be made openly-available via other approaches which aim to anonymise and de-identify participants (e.g., statistical disclosure control; Templ et al., 2015) as these still largely make use of observed – rather than wholly synthetic – data, although this may change in the future.

A potential limitation of the ‘synthpop’ package is that is primarily designed for datasets with independent observations, not more complex situations such as multi-level/hierarchical data. While it can be applied in such circumstances – such as for longitudinal modelling with repeated data, as demonstrated above – the correspondence between the observed and synthetic data needs to be assessed carefully and cannot be assumed to hold. For instance, when using the standard ‘synthpop’ package, synthesising data in ‘wide’ format (as used here) appears to work well, but is unlikely to hold for data in ‘long’ format as information on the relations between observations within individuals would be lost. There are user-written extensions to synthpop which describe how to synthesise hierarchical data (see http://gradientdescending.com/synthesising-multiple-linked-data-sets-in-r/), but these methods are not covered in this paper. The ‘synthpop’ package can, however, be used to synthesise other data types, such as time-to-event/survival data (see also ( Smith et al., 2022)).

A further limitation is that the ‘synthpop’ package is primarily only available in the R programming language (although an implementation of ' synthpop' is available in Python;). While alternative synthesis software such as IVEware are compatible with a larger number of statistical programmes (e.g., R, Stata, SPSS and SAS), as discussed above we do not recommend this approach due to a lack of statistical disclosure control measures. We hope the step-by-step guide, along with the more detailed R scripts associated with this data note, will enable even researchers unfamiliar with the R programming language to successfully create synthesised datasets.

The release of synthetic datasets from longitudinal population-based studies is novel (to the best of our knowledge ALSPAC is the first such study to create specific guidelines for sharing synthetic datasets). We would welcome other longitudinal population-based studies to build upon the knowledge and guidelines developed by ALSPAC in creating their own synthetic data sharing policies, furthering the promotion of open and transparent research. We believe that our current approach is comprehensive and therefore our guidelines could be readily adopted by other studies and adapted as required. We would also welcome feedback from other longitudinal population-based studies and their users to help us improve our guidelines and processes.

There are also some challenges and open questions, for instance, whether potential safe-guards to prevent synthetic data being used in the place of real data (as discussed above). A further issue is the question of ownership; that is, who owns the synthetic data? At present, synthetic ALSPAC data is allowed to be made freely-available to all without the need for managed access or agreements to be signed. This is largely to make the synthetic data and open and accessible as possible, while minimising additional administrative work for ALSPAC staff. If needed, however, this may change in the future. An additional question is whether such synthetic datasets meet journal and funder requirements for data sharing. Although sharing synthetic data is clearly better than sharing no data, clarity from funders and journals is required to definitively answer this. However, in our experience some journals that have clear policies mandating data sharing have been open to the sharing of synthetic ALSPAC data, given that the raw ALSPAC data cannot be released.

To end, we stress that, wherever possible, the observed raw data – alongside the analysis code ( Goldacre et al., 2019; Goldstein et al., 2020; Localio et al., 2018) – should be made openly available to facilitate fully-reproducible open science ( Goldstein, 2018; Harper, 2019; Munafò et al., 2017; Peng et al., 2006). Where this is not feasible, either to preserve participant confidentiality or to ensure only legitimate researchers can access the resource, releasing synthetic datasets is a useful and pragmatic alternative, which enables research to be ‘quasi-reproducible’ ( Shepherd et al., 2017). For a recent example of such an approach which includes openly-available synthetic ALSPAC data, see ( Major-Smith, 2023). We hope to see an increasing number of papers, both in ALSPAC and more widely, using synthetic generation methods to make potentially-sensitive datasets openly available.

Consent

Ethical approval for this study was obtained from the ALSPAC Ethics and Law Committee and the Local Research Ethics Committees. Informed consent for the use of data collected via questionnaires and clinics was obtained from participants following the recommendations of the ALSPAC Ethics and Law Committee at the time. Study participants have the right to withdraw their consent for elements of the study or from the study entirely. Full details of the ALSPAC consent procedures are available on the study website ( http://www.bristol.ac.uk/alspac/researchers/research-ethics/).

Acknowledgements

We are extremely grateful to all the families who took part in this study, the midwives for their help in recruiting them, and the whole ALSPAC team, which includes interviewers, computer and laboratory technicians, clerical workers, research scientists, volunteers, managers, receptionists and nurses.

Funding Statement

The UK Medical Research Council and Wellcome Trust (Grant ref: 217065/Z/19/Z) and the University of Bristol provide core support for ALSPAC. This publication is the work of the authors and Daniel Major-Smith, Alex S F Kwong and Kate Northstone will serve as guarantors for the contents of this paper. DM-S was supported by the John Templeton Foundation (ref no. 61917). This research was funded in whole, or in part, by the Wellcome Trust [217065, <a href=https://doi.org/10.35802/217065>https://doi.org/10.35802/217065</a>]. For the purpose of Open Access, the author has applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission. A comprehensive list of grants funding is available on the ALSPAC website. (http://www.bristol.ac.uk/alspac/external/documents/grant-acknowledgements.pdf).

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

[version 2; peer review: 3 approved, 1 approved with reservations]

Data availability

Please see the ALSPAC data management plan which describes the policy regarding data sharing ( http://www.bristol.ac.uk/alspac/researchers/data-access/documents/alspac-data-management-plan.pdf), which is by a system of managed open access. Other than the freely available ALSPAC dataset and the synthetic datasets (see below), all other data used for this submission will be made available on request to the Executive ( alspac-exec@bristol.ac.uk). These datasets are linked to ALSPAC project number B4301, please quote this project number during your application. The following datasets and analysis code files supporting this submission are available on DM-S’s GitHub page ( https://github.com/djsmith-90/synthetic-data, available under a GPL-3.0 license, archived at the time of publication: https://doi.org/10.5281/zenodo.10457847, djsmith-90, (2024)) ; this includes:

  • 1)

    “SynthPopExample.r”: An example R script to replicate the step-by-step example using the openly available ALSPAC dataset, as well as explore some additional ‘synthpop’ functionality (parametric synthesis, smoothing parameters, top- and bottom-coding, synthesis using different numbers of variables, etc.). The openly available subset of ALSPAC data used for this example is available here ( https://doi.org/10.17605/OSF.IO/8SGZE; Northstone et al., 2022).

  • 2)

    “synthpop_repeated-measures_script”: This script processes and synthesises the datasets for the longitudinal modelling examples.

  • 3)

    “Simulated_height.dta”: The synthesised ALSPAC dataset for the multi-level growth models of height, in Stata format (note that the corresponding observed ALSPAC data files are not available for these analyses).

  • 4)

    “analysis_height”: The Stata script to perform the multi-level growth models on the “Simulated_height.dta” dataset.

  • 5)

    “Simulated_depression_mplus.dta”: The synthesised ALSPAC dataset for performing the growth mixture modelling analysis of depression, in Stata format (note that the corresponding observed ALSPAC data files are not available for these analyses).

  • 6)

    “prep_analysis_depression”: Script which initially processes the data for the growth mixture modelling analysis (in Stata), followed by the MPlus code to perform the growth mixture modelling analysis.

The steps below highlight how to apply for access to the data included in the data note and all other ALSPAC data:

Please note that the study website contains details of all the data that is available through a fully searchable data dictionary and variable search tool: http://www.bristol.ac.uk/alspac/researchers/our-data/.

References

  1. Abbasi K: A commitment to act on data sharing. BMJ. 2023;382: 1609. 10.1136/bmj.p1609 [DOI] [Google Scholar]
  2. Beduschi A: Synthetic data protection: towards a paradigm change in data regulation? Big Data & Society. 2024;11(1):20539517241231277. 10.1177/20539517241231277 [DOI] [Google Scholar]
  3. Bouter L: Why research integrity matters and how it can be improved. Account Res. 2023;11:1–10. 10.1080/08989621.2023.2189010 [DOI] [PubMed] [Google Scholar]
  4. Boyd A, Golding J, Macleod J, et al. : Cohort profile: the 'children of the 90s'--the index offspring of the Avon Longitudinal Study of Parents and Children. Int J Epidemiol. 2013;42(1):111–127. 10.1093/ije/dys064 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Colditz GA: Constraints on data sharing: experience from the nurses' health study. Epidemiology. 2009;20(2):169–171. 10.1097/EDE.0b013e318196ad0f [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Coughlin SS: Reproducing epidemiologic research and ensuring transparency. Am J Epidemiol. 2017;186(4):393–394. 10.1093/aje/kwx065 [DOI] [PubMed] [Google Scholar]
  7. Cox JL, Holden JM, Sagovsky R: Detection of postnatal depression. Development of the 10-item Edinburgh Postnatal Depression Scale. Br J Psychiatry. 1987;150:782–786. 10.1192/bjp.150.6.782 [DOI] [PubMed] [Google Scholar]
  8. Decety J, Cowell JM, Lee K, et al. : RETRACTED: the negative association between religiousness and children’s altruism across the world. Curr Biol. 2015;25(22):2951–2955. 10.1016/j.cub.2015.09.056 [DOI] [PubMed] [Google Scholar]
  9. djsmith-90: djsmith-90/synthetic-data: v1.0.0 (v1.0.0). Zenodo.[Dataset]2024. 10.5281/zenodo.10457847 [DOI]
  10. Elhakeem A, Hughes RA, Tilling K, et al. : Using linear and natural cubic splines, SITAR, and latent trajectory models to characterise nonlinear longitudinal growth trajectories in cohort studies. BMC Med Res Methodol. 2022;22(1):1–20. 10.1186/s12874-022-01542-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Federer LM, Belter CW, Joubert DJ, et al. : Data sharing in PLOS ONE : an analysis of Data Availability Statements. PLoS One. 2018;13(5): e0194768. 10.1371/journal.pone.0194768 [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Fraser A, Macdonald-Wallis C, Tilling K, et al. : Cohort profile: the avon longitudinal study of parents and children: ALSPAC mothers cohort. Int J Epidemiol. 2013;42(1):97–110. 10.1093/ije/dys066 [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Goldacre B, Morton CE, DeVito NJ: Why researchers should share their analytic code. BMJ. 2019;367: l6365. 10.1136/bmj.l6365 [DOI] [PubMed] [Google Scholar]
  14. Goldstein ND: Toward open-source epidemiology. Epidemiology. 2018;29(2):161–164. 10.1097/EDE.0000000000000782 [DOI] [PubMed] [Google Scholar]
  15. Goldstein ND, Hamra GB, Harper S: Are descriptions of methods alone sufficient for study reproducibility? An example from the cardiovascular Literature. Epidemiology. 2020;31(2):184–188. 10.1097/EDE.0000000000001149 [DOI] [PubMed] [Google Scholar]
  16. Hardwicke TE, Mathur MB, MacDonald K, et al. : Data availability, reusability, and analytic reproducibility: evaluating the impact of a mandatory open data policy at the journal Cognition. R Soc Open Sci. 2018;5(8): 180448. 10.1098/rsos.180448 [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Harper S: A future for observational epidemiology: clarity, credibility, transparency. Am J Epidemiol. 2019;188(5):840–845. 10.1093/aje/kwy280 [DOI] [PubMed] [Google Scholar]
  18. Hogue CJ: Ethical issues in sharing epidemiologic data. J Clin Epidemiol. 1991;44(Suppl 1):103–107. 10.1016/0895-4356(91)90183-a [DOI] [PubMed] [Google Scholar]
  19. House of Commons Science Innovation and Technology Committee: Reproducibility and Research Integrity.2023.
  20. Howe LD, Tilling K, Galobardes B, et al. : Socioeconomic differences in childhood growth trajectories: at what age do height inequalities emerge? J Epidemiol Community Health. 2012;66(2):143–148. 10.1136/jech.2010.113068 [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Howe LD, Tilling K, Matijasevich A, et al. : Linear spline multilevel models for summarising childhood growth trajectories: a guide to their application using examples from five birth cohorts. Stat Methods Med Res. 2016;25(5):1854–1874. 10.1177/0962280213503925 [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Kwong ASF, López-López JA, Hammerton G, et al. : Genetic and environmental risk factors associated with trajectories of depression symptoms from adolescence to young adulthood. JAMA Netw Open. 2019;2(6): e196587. 10.1001/jamanetworkopen.2019.6587 [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Lawlor DA, Lewcock M, Rena-Jones L, et al. : The second generation of the avon longitudinal study of parents and children (ALSPAC-G2): a cohort profile [version 2; peer review: 2 approved]. Wellcome Open Res. 2019;4:36. 10.12688/wellcomeopenres.15087.2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Lewis G, Pelosi AJ, Araya R, et al. : Measuring psychiatric disorder in the community: a standardized assessment for use by lay interviewers. Psychol Med. 1992;22(2):465–486. 10.1017/s0033291700030415 [DOI] [PubMed] [Google Scholar]
  25. Localio AR, Goodman SN, Meibohm A, et al. : Statistical code to support the scientific story. Ann Intern Med. 2018;168(11):828–829. 10.7326/M17-3431 [DOI] [PubMed] [Google Scholar]
  26. López-López JA, Kwong ASF, Washbrook E, et al. : Trajectories of depressive symptoms and adult educational and employment outcomes. BJPsych Open. 2020;6(1):e6. 10.1192/bjo.2019.90 [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Major-Smith D: Exploring causality from observational data: an example assessing whether religiosity promotes cooperation. Evol Hum Sci. 2023;5:e22. 10.1017/ehs.2023.17 [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Major-Smith D, Heron J, Fraser A, et al. : The Avon Longitudinal Study of Parents and Children (ALSPAC): a 2022 update on the enrolled sample of mothers and the associated baseline data [version 1; peer review: 2 approved]. Wellcome Open Res. 2023;7:283. 10.12688/wellcomeopenres.18564.1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Malički M, Jerončić A, Aalbersberg IJ, et al. : Systematic review and meta-analyses of studies analysing instructions to authors from 1987 to 2017. Nat Commun. 2021;12(1): 5840. 10.1038/s41467-021-26027-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Mathur MB, Fox MP: Toward open and reproducible epidemiology. Am J Epidemiol. 2023;192(4):658–664. 10.1093/aje/kwad007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Millard LA, Fernández-Sanlés A, Carter AR, et al. : Exploring the impact of selection bias in observational studies of COVID-19: a simulation study. Int J Epidemiol. 2023;52(1):44–57. 10.1093/ije/dyac221 [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Minocher R, Atmaca S, Bavero C, et al. : Estimating the reproducibility of social learning research published between 1955 and 2018. R Soc Open Sci. 2021;8(9): 210450. 10.1098/rsos.210450 [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Munafò MR, Nosek BA, Bishop DVM, et al. : A manifesto for reproducible science. Nat Hum Behav. 2017;1:0021. 10.1038/s41562-016-0021 [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Northstone K, Heron J, Smith D, et al. : ALSPAC Masters Training Dataset (Stata).2022. 10.17605/OSF.IO/8SGZE [DOI]
  35. Northstone K, Lewcock M, Groom A, et al. : The Avon Longitudinal Study of Parents and Children (ALSPAC): an update on the enrolled sample of index children in 2019 [version 1; peer review: 2 approved]. Wellcome Open Res. 2019;14(4):51. 10.12688/wellcomeopenres.15132.1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Northstone K, Shlomo YB, Teyhan A, et al. : The Avon Longitudinal Study of Parents and children ALSPAC G0 partners: a cohort profile [version 1; peer review: 1 approved with reservations]. Wellcome Open Res. 2023;8:37. 10.12688/wellcomeopenres.18782.1 [DOI] [Google Scholar]
  37. Nowok B, Raab GM, Dibben C: Synthpop: bespoke creation of synthetic data in R. J Stat Softw. 2016;74(11):1–26. 10.18637/jss.v074.i11 [DOI] [Google Scholar]
  38. Nowok B, Raab GM, Dibben C: Providing bespoke synthetic data for the UK longitudinal studies and other sensitive data with the synthpop package for R. Stat J IAOS. 2017;33(3):785–796. 10.3233/SJI-150153 [DOI] [Google Scholar]
  39. Peng RD, Dominici F, Zeger SL: Reproducible epidemiologic research. Am J Epidemiol. 2006;163(9):783–789. 10.1093/aje/kwj093 [DOI] [PubMed] [Google Scholar]
  40. Quintana DS: A synthetic dataset primer for the biobehavioural sciences to promote reproducibility and hypothesis generation. eLife. 2020;9: e53275. 10.7554/eLife.53275 [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Raab GM, Nowok B, Dibben C: Guidelines for producing useful synthetic data. arXiv Prepr. 2017. 10.48550/arXiv.1712.04078 [DOI] [Google Scholar]
  42. R Development Core Team: R: A language and environment for statistical computing.2021. [Google Scholar]
  43. Raghunathan TE: Synthetic data. Annu Rev Stat Its Appl. 2021;8:129–140. 10.1146/annurev-statistics-040720-031848 [DOI] [Google Scholar]
  44. Raghunathan TE, Solenberger, Berglund J: IVEware: Imputation and Variance Estimation Software.2022. Reference Source [Google Scholar]
  45. Samet JM: Data: to share or not to share? Epidemiology. 2009;20(2):172–174. Reference Source [DOI] [PubMed] [Google Scholar]
  46. Shariff AF, Willard AK, Muthukrishna M, et al. : What is the association between religious affiliation and children’s altruism? Curr Biol. 2016;26(15):R699–R700. 10.1016/j.cub.2016.06.031 [DOI] [PubMed] [Google Scholar]
  47. Shepherd BE, Peratikos MB, Rebeiro F, et al. : A pragmatic approach for reproducible research with sensitive data. Am J Epidemiol. 2017;186(4):387–392. 10.1093/aje/kwx066 [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Smaldino E, Turner MA, Contreras Kallens A: Open science and modified funding lotteries can impede the natural selection of bad science. R Soc Open Sci. 2019;6(7): 190194. 10.1098/rsos.190194 [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Smith A, Lambert C, Rutherford MJ: Generating high-fidelity synthetic time-to-event datasets to improve data transparency and accessibility. BMC Med Res Methodol. 2022;22(1): 176. 10.1186/s12874-022-01654-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Tedersoo L, Küngas R, Oras E, et al. : Data sharing practices and data availability upon request differ across scientific disciplines. Sci Data. 2021;8(1): 192. 10.1038/s41597-021-00981-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Templ M, Kowarik A, Meindl B: Statistical disclosure control for micro-data using the R package sdcMicro. J Stat Softw. 2015;67:1–36. 10.18637/jss.v067.i04 [DOI] [Google Scholar]
  52. van Buuren S: Flexible imputation of missing data.CRC Press, Boca Raton, FL,2018. 10.1201/9780429492259 [DOI] [Google Scholar]
  53. Westreich D, Greenland S: The table 2 fallacy: presenting and interpreting confounder and modifier coefficients. Am J Epidemiol. 2013;177(4):292–298. 10.1093/aje/kws412 [DOI] [PMC free article] [PubMed] [Google Scholar]
Wellcome Open Res. 2025 Feb 5. doi: 10.21956/wellcomeopenres.25979.r116323

Reviewer response for version 2

Phillip Melton 1

I thank the authors of this manuscript for their thoughtful responses to my earlier comments that they have addressed in this updated version of the manuscript.

Is the rationale for developing the new method (or application) clearly explained?

Yes

Is the description of the method technically sound?

Partly

Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

Yes

If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Yes

Are sufficient details provided to allow replication of the method development and its use by others?

Yes

Reviewer Expertise:

biostatisics, longitudinal data, cohort studies, data sciene

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Wellcome Open Res. 2025 Jan 10. doi: 10.21956/wellcomeopenres.25979.r116944

Reviewer response for version 2

Venet Osmani 1

The motivation of this paper is commendable. There is indeed an urgent need to make health data more widely available to advance science and mitigate reproducibility crisis.

However, there are a number of ways the paper can be further improved.

  1. The authors' goal, stated in the introduction, is to generate research-quality synthetic data, i.e. data where 'any analyses and conclusions ought to be similar'. However, the method (synthpop) is not very well suited for this objective. Indeed, the authors state this in the discussion that synthetic variables may take on impossible values.

    Therefore, there appears to be little gain in using synthpop with respect to the approach adopted by the NHS artificial data pilot, where the risk of disclosure is virtually zero. Can authors comment on this point?

    Perhaps a more sophisticated method that can generate research-quality synthetic data should be considered if this goal is to be achieved.

  2. Tree-based methods are not well equipped to handle longitudinal data. Authors address this issue by creating a 'wide' data format however at the expense of assumption that each time-point measurement is an independent variable, which of course is not correct.

  3. I liked the recommendations. I would add another: to consider whether the synthetic dataset is useful, that is to compare whether results in a downstream task (e.g. an outcome prediction) are similar between the real and the synthetic datasets.

Is the rationale for developing the new method (or application) clearly explained?

Yes

Is the description of the method technically sound?

Yes

Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

Partly

If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Yes

Are sufficient details provided to allow replication of the method development and its use by others?

Yes

Reviewer Expertise:

Machine learning

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Wellcome Open Res. 2025 Jan 8. doi: 10.21956/wellcomeopenres.25979.r116321

Reviewer response for version 2

Sam Harper 1

Overall, I think the authors have done a good job of responding to the comments of the reviewers, and at this point I think we can probably just agree to disagree regarding certain questions. Specifically, I still think it would help readers and those wanting to use 'synthpop' if the authors could provide at least some guidance about how to actually determine the 'faithfulness' of the synthetic vs. original datasets.

The authors say:

even datasets which randomly synthesise each variable independent of all others can still be useful to explore the underlying code/analysis approach and checking for errors, even if all relations between the variables are lost 

But what kind of errors would I check for to determine whether or not the synthetic data are 'useful'? Just that the code can run, even if the results are not sensible? Or if X1 and X2 are correlated at 0.7 in the original source but the synthpop correlation is 0.1, is that useful? Again, I don't want to let the perfect be the enemy of the good here, but I even think that if the authors provided 1 or 2 examples of cases where they actually thought this *was not* useful, that would help readers a bit more. Given that they say, " we urge users to check the synthesised results against the observed data and examine whether the similarity is sufficient; if not, try a different specification and compare results". At least an example of 'non-sufficiency' here would help. 

And I could not tell whether the authors specifically mention it in the paper, but I think that one of the other benefits of 'synthpop' could be to provide a complete replication package for a published paper when the underlying data are sensitive/confidential and cannot be released. Including the synthetic data in a complete set of scripts can be extraordinarily helpful for readers and reviewers to have a better understanding of exactly how researchers generated their estimates and figures, even if the underlying data must be synthetic. 

Apart from that, I think the revised version is solid and will be helpful for readers interested in pushing open science further.

Is the rationale for developing the new method (or application) clearly explained?

Yes

Is the description of the method technically sound?

Partly

Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

Yes

If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Yes

Are sufficient details provided to allow replication of the method development and its use by others?

Yes

Reviewer Expertise:

epidemiology, impact evaluation, reproducible research

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Wellcome Open Res. 2024 Dec 31. doi: 10.21956/wellcomeopenres.25979.r116322

Reviewer response for version 2

Neal Goldstein 1

I thank the authors for their responsiveness to my initial comments. I have no additional requests at this time.

Is the rationale for developing the new method (or application) clearly explained?

Partly

Is the description of the method technically sound?

Yes

Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

Yes

If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Yes

Are sufficient details provided to allow replication of the method development and its use by others?

Yes

Reviewer Expertise:

Epidemiology

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Wellcome Open Res. 2024 Sep 11. doi: 10.21956/wellcomeopenres.22725.r87781

Reviewer response for version 1

Phillip Melton 1

The paper “Releasing synthetic data from the Avon the paper “Releasing synthetic data from the Avon Longitudinal Study of Parents and Children (ALSPAC): Guidelines and applied examples” by Major-Smith et al. provides a real-world example of how to produce synthetic data from a large population cohort. The authors provide examples of the methodology used in the R-package, synthpop, and then give an example of doing longitudinal analysis. Overall, the paper is well structured, but I have a few comments and suggestions that I hope the authors may find useful.

General comments:

  • The discussion would benefit from a paragraph on recommendations for other cohort studies that may be considering developing guidelines around synthetic data. Several other longitudinal cohort studies would potentially benefit from this.

  • Also, who technically owns the synthetic dataset is this the author or what role does ALSPAC play in this? Is there an agreement that the authors sign, and how is access managed?

  • Would synthetic data meet EU GDPR data sharing guidelines, while it is artificial but may still apply (see Beduschi A et al.(2024 1))  

Specific comments:

  • P3: When mentioning other tree-based and parametric methods beyond CART, it would help to specify some of these.  

  • P4. For longitudinal data how is potential disclosure managed because you now have temporal data that has the potential to identify participants?

  • P5. More details on the specifics of the similarity of distributions. Guidance here or recommendations would be useful.

  • Figure 4: Why are so many of the Z-values 0 for the synthetic data? Is this because they are potentially categorical?

  • Tables: Showing the synthetic data and the observed data for each covariate on the same row may be easier to interpret for the reader.

Is the rationale for developing the new method (or application) clearly explained?

Yes

Is the description of the method technically sound?

Partly

Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

Yes

If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Yes

Are sufficient details provided to allow replication of the method development and its use by others?

Yes

Reviewer Expertise:

biostatisics, longitudinal data, cohort studies, data sciene

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

References

  • 1. : Synthetic data protection: Towards a paradigm change in data regulation?. Big Data & Society .2024;11(1) : 10.1177/20539517241231277 10.1177/20539517241231277 [DOI] [Google Scholar]
Wellcome Open Res. 2024 Dec 17.
Daniel Smith 1

We thank the reviewer for their positive review and their constructive comments. We have responded to the reviewer’s comments in turn below, with the reviewer’s original comments in standard font and our responses in italics.  

General comments:

The discussion would benefit from a paragraph on recommendations for other cohort studies that may be considering developing guidelines around synthetic data. Several other longitudinal cohort studies would potentially benefit from this.

Response​​​​​​​: This is a nice idea, thanks for suggesting it. We have now added a paragraph on this in the Discussion: “The release of synthetic datasets from longitudinal population-based studies is novel (to the best of our knowledge ALSPAC is the first such study to create specific guidelines for sharing synthetic datasets). We would welcome other longitudinal population-based studies to build upon the knowledge and guidelines developed by ALSPAC in creating their own synthetic data sharing policies, furthering the promotion of open and transparent research. We believe that our current approach is comprehensive and therefore our guidelines could be readily adopted by other studies and adapted as required. We would also welcome feedback from other longitudinal population-based studies and their users to help us improve our guidelines and processes.”

Also, who technically owns the synthetic dataset is this the author or what role does ALSPAC play in this? Is there an agreement that the authors sign, and how is access managed? Response​​​​​​​: We have added a second new paragraph in the Discussion following the paragraph above, which addresses this question (among others): “There are also some challenges and open questions, for instance, whether potential safe-guards to prevent synthetic data being used in the place of real data (as discussed above). A further issue is the question of ownership; that is, who owns the synthetic data? At present, synthetic ALSPAC data is allowed to be made freely-available to all without the need to managed access or agreements to be signed. This is largely to make the synthetic data and open and accessible as possible, while minimising additional administrative work for ALSPAC staff. If needed, however, this may change in the future. An additional question is whether such synthetic datasets meet journal and funder requirements for data sharing. Although sharing synthetic data is clearly better than sharing no data, clarity from funders and journals is required to definitively answer this. However, in our experience some journals that have clear policies mandating data sharing have been open to the sharing of synthetic ALSPAC data, given that the raw ALSPAC data cannot be released.”

Would synthetic data meet EU GDPR data sharing guidelines, while it is artificial but may still apply (see  Beduschi A et al.(2024 1))   ​​​​​​​

Response​​​​​​​: Based on our reading of this paper, these synthetic datasets would not need to be GDPR compliant as our synthesis methods are fully synthetic (i.e., do not retain any original individual participant data). We have now added this to the second paragraph of the Introduction: “However, as data are simulated and do not correspond to real-life individuals by design, they preserve participant anonymity (note that, because fully synthetic datasets do not contain personal information of real individuals, they are likely exempt from complying to the European Union’s General Data Protection Regulation [GDPR]; Beduschi, 2024).”  

Specific comments:

  • P3: When mentioning other tree-based and parametric methods beyond CART, it would help to specify some of these.  

    Response​​​​​​​: Now updated: “it is also possible to synthesise data using alternative tree-based (e.g., random forest) or parametric (e.g., linear, logistic) models”

  • P4. For longitudinal data how is potential disclosure managed because you now have temporal data that has the potential to identify participants?

    Response​​​​​​​: For longitudinal data our approach for disclosure management is still the same – i.e., remove participants if all their data are uniquely replicated in both the original and synthetic datasets. We do not believe that longitudinal/repeated data requires a different approach to disclosure compared with cross-sectional data.

  • P5. More details on the specifics of the similarity of distributions. Guidance here or recommendations would be useful.

    ​​​​​​​​​​​​​​ Response​​​​​​​: This point was also raised by reviewer 1. Essentially, we do not believe that it is possible to make an objective decision regarding whether synthetic data are ‘similar enough’ to the observed data to be ‘useful’. For instance, even datasets which randomly synthesise each variable independent of all others can still be useful to explore the underlying code/analysis approach and checking for errors, even if all relations between the variables are lost (see Shepherd et al., 2017; https://doi.org/10.1093/aje/kwx066 ). We refer this reviewer to our response to reviewer 1 for more details.

  • Figure 4: Why are so many of the Z-values 0 for the synthetic data? Is this because they are potentially categorical?

    ​​​​​​​ Response​​​​​​​: Apologies for any confusion here. This is because these results are the Z-values of the point estimate and 95% confidence interval (not Z-scores of the original variable). This is the default option for ‘synthpop’, which we have now noted more clearly in the manuscript (point 7 in the step-by-step guide): “note that, by default, ‘synthpop’ converts all coefficients to z-values so that all coefficients are on the same scale and hence easier to compare”.

  • Tables: Showing the synthetic data and the observed data for each covariate on the same row may be easier to interpret for the reader.​​​​​​​​​​​​​​

    ​​​​​​​ Response​​​​​​​: This was also suggested by reviewer 1, and the tables have been updated accordingly.

Wellcome Open Res. 2024 Aug 29. doi: 10.21956/wellcomeopenres.22725.r95196

Reviewer response for version 1

Neal Goldstein 1

This article describes how an R package ‘synthpop’ can be used to create a synthetic population based on the ALSPAC longitudinal study. The authors observed that the synthetic population largely recapitulated analyses conducted on the original data.

The article is well written and deals with a timely and important topic in research: reproducibility and transparency. The R code included with the article is easy to follow and should allow other researchers to implement ‘synthpop’ on their own.

I have a few comments for additional consideration:

  1. First, on pg5, I am dubious that a claim such as “These are synthesized ALSPAC datasets, and are not suitable for research purposes” - also echoed pg 11 “we state clearly here that these synthetic datasets should not be used in place of the actual observed data for research purposes;” - will prevent researchers from conducting original research on these synthetic data as they are more readily accessed. What other precautions can the authors recommend here? Signing an agreement before download? There is also an onus on journals or reviewers when seeing a synthetic data study to understand the intention of the research.

  2. Second, I would like to have seen a brief comparison of data anonymizing approaches (e.g., to scrub the protected health information from an electronic health record export) to a wholly synthetic data approach. When and under what conditions are each appropriate?

  3. Third, how does use of synthpop (or other synthetic data generators) satisfy funders’ data sharing requirements? From an NIH-centric perspective, this would be their Data Management and Sharing Policy.

  4. Fourth, while there is a check for replicated unique individuals, we must also be concerned about small cell counts; even though an observation could be unique, because there are so few of them it could allow for identification of individuals. This is common in surveillance data where small cell counts are suppressed.

  5. Fifth, on pg4, the claim “difference in sample size should not make much difference in practice” – is potentially misleading because this could lead to an important loss in precision in estimates.

Is the rationale for developing the new method (or application) clearly explained?

Partly

Is the description of the method technically sound?

Yes

Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

Yes

If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Yes

Are sufficient details provided to allow replication of the method development and its use by others?

Yes

Reviewer Expertise:

Epidemiology

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Wellcome Open Res. 2024 Dec 17.
Daniel Smith 1

We thank the reviewer for their positive review and their constructive comments. We have responded to the reviewer’s comments in turn below, with the reviewer’s original comments in standard font and our responses in italics.   I have a few comments for additional consideration:

First, on pg5, I am dubious that a claim such as “These are synthesized ALSPAC datasets, and are not suitable for research purposes” - also echoed pg 11 “we state clearly here that these synthetic datasets should not be used in place of the actual observed data for research purposes;” - will prevent researchers from conducting original research on these synthetic data as they are more readily accessed. What other precautions can the authors recommend here? Signing an agreement before download? There is also an onus on journals or reviewers when seeing a synthetic data study to understand the intention of the research.

Response: This is a good point. We want to make these synthetic datasets as open as possible to avoid unnecessary bureaucracy and barriers to data access, so currently recommend just making these synthetic datasets openly-available (e.g., on GitHub or OSF). We are therefore trusting researchers not to abuse this system by using these synthetic datasets for research purposes. Nonetheless, we have put some safe-guards in place to try and mitigate this possibility, such as: including a disclaimer in any published paper and with the synthetic data warning that data are synthetic and should not be used for research purposes; reducing the number of variables and observations in the synthetic dataset so that the synthetic dataset is very specific to the research question of interest (and hence may not be applicable to other research questions), and; including a ‘FALSE_DATA’ variable in the dataset so users cannot confuse it with actual data (points 2, 6 and 7 of the ‘ALSPAC guidelines’ section). We will also develop ALSPAC-specific guidelines in the future if these rules are found to be violated. The paragraph in the Discussion on this has now been updated: “To try and avoid synthetic ALSPAC data being used for research purposes – whether knowingly or otherwise – we have put a number of safe-guards in place (see points 2, 6 and 7 of the ‘ALSPAC guidelines for releasing synthetic data using ‘synthpop’’ section). Additional measures will be put in place should synthetic ALSPAC data be found to have been used for these purposes.”  

Second, I would like to have seen a brief comparison of data anonymizing approaches (e.g., to scrub the protected health information from an electronic health record export) to a wholly synthetic data approach. When and under what conditions are each appropriate? ​​​​​​​

Response​​​​​​​: While we agree a comparison of data anonymisation approaches would be informative, this is beyond the scope of the present article, which focuses specifically on using synthpop to create synthetic ALSPAC data. At present, ALSPAC does not permit data to be made available via other approaches which aim to anonymise and de-identify participants (e.g., statistical disclosure control; Templ et al., 2015: https://doi.org/10.18637/jss.v067.i04 ) as these still largely make use of observed – rather than wholly synthetic – data, although this may change in the future. We have added a sentence to this effect in the Discussion (paragraph 4).  

Third, how does use of synthpop (or other synthetic data generators) satisfy funders’ data sharing requirements? From an NIH-centric perspective, this would be their Data Management and Sharing Policy.

Response​​​​​​​: This is a very good question, and one we are not entirely clear on the answer to! As we note in the introduction “although data sharing is still not universal, there is a continued push from academics, journals, funders and governments towards this goal”, we are not aware of any specific policies from funders regarding the sharing of synthetic data. Most funders now do recommend that data should be openly-available (if possible), so a policy of sharing synthetic data when data could not otherwise be shared should appeal to them, but without talking to – and clarification from – funders, the actual implications of this for funders’ data sharing requirements are unclear. We have added a section on this in the Discussion: “An additional question is whether such synthetic datasets meet journal and funder requirements for data sharing. Although sharing synthetic data is clearly better than sharing no data, clarity from funders and journals is required to definitively answer this. However, in our experience some journals that have clear policies mandating data sharing have been open to the sharing of synthetic ALSPAC data, given that the raw ALSPAC data cannot be released.”  

Fourth, while there is a check for replicated unique individuals, we must also be concerned about small cell counts; even though an observation could be unique, because there are so few of them it could allow for identification of individuals. This is common in surveillance data where small cell counts are suppressed.

Response​​​​​​​: We agree that this is an issue for observed data (and ALSPAC already has a policy of no potentially-identifiable cell counts of less than 5 in publications; https://www.bristol.ac.uk/media-library/sites/alspac/documents/alspac-publications-checklist.pdf ), but we would argue that this is less of an issue for synthetic data. As we note in our paper: “As synthetic datasets are wholly simulated, data ought to be non-disclosive as they are no longer based on individual records. However, it is possible that a unique combination of values could be synthesised corresponding to a unique individual in the observed data, thus remaining a potential disclosure risk. Although researchers using the synthesised data will not be able to know whether such a unique observation matches that of an actual participant, there is a remote possibility that unique individuals could be identified from the synthesised data. We therefore recommend that users undertake ‘statistical disclosure control’ checks on any synthetic datasets to remove any unique observations that occur in both the observed and synthetic datasets.” The recommendation to remove uniquely-replicated individuals is therefore likely to be over-cautious as it would be almost-impossible to know whether the data for a uniquely-identified individual in the synthetic dataset was the same as a uniquely-identified individual in the observed dataset. This risk is therefore even more remote when considering small cell counts greater than 1 or on a smaller number of variables, hence why we have focused on uniquely-replicated individuals here.  

Fifth, on pg4, the claim “difference in sample size should not make much difference in practice” – is potentially misleading because this could lead to an important loss in precision in estimates.

​​​​​​​​​​​​​​ Response​​​​​​​: Yes, this is technically true – Although in practice the differences in sample sizes between observed and synthetic data are generally small, so should make little difference in reality. Nonetheless, we have updated this sentence accordingly: “this difference in sample size should not make much difference in practice (although large differences in sample size may lead to a loss of precision in estimates)”.

Wellcome Open Res. 2024 May 17. doi: 10.21956/wellcomeopenres.22725.r79973

Reviewer response for version 1

Sam Harper 1

  •  General Comments

- This paper provides a sensible overview and introduction to the practical use of synthetic datasets in the service of increasing the transparency and reproducibility of epidemiologic research. Overall I think the paper is generally well written and structured, but I also have a few comments and suggestions. 

- As a general comment, the authors (in many places) make it clear that deciding 'how good is good enough' for synthetic data requires subjective interpretations. However, I think readers would probably benefit from more discussion of how to describe, evaluate, or measure the concept of 'faithfulness ' between the synthetic and observed datasets and statistical relationships. Are there no rules of thumb or guidance about how to compare, say, measures of central tendency or variability in ways that would help researchers to draw sensible conclusions about comparability? If not, perhaps some could or should even be suggested here? 

  • Specific comments

- p4. The authors note that the order in which variable are synthesized may also affect 'correspondence' between synthesized and observed data. Is this a potential consequence of not respecting the temporal ordering of the variables in question? And is this more important for causal questions than, say, descriptive or predictive analyses? For example, if both income and education are included, wouldn't it be prudent to synthesize education first, since that is a potential determinant of income? Should researchers use a directed acyclic graph in order to make these relationships clear (if the analysis is causal)?

- p4. The authors say that synthesizing the 'exposures and outcomes last' is advantageous, but at this point in the paper it isn't clear what the exposure or outcome is, since there has only been talk about ALSPAC more generally. This makes me wonder whether the authors would provide different guidance for those wanting to produce a synthetic dataset for a specific exposure-outcome investigation vs. a synthetic general dataset for multiple uses. Would this lead to different strategies for implementation?

- p4. "We advise that users test different synthesizing methods, and see what works best for their specific dataset", but how is 'works best' defined here? By some metric?

- p5. "Check that the distributions are similar..." how similar (e.g., a standardized difference of <10%? Means or medians within a certain range?). Again, a little help or guidance here could be beneficial, even if no hard-and-fast rules apply. I think for many readers that have never heard of the concept of generating synthetic data, this would help.

- p7 (Figure 4). I recognize that the purpose of this figure is to demonstrate the similarity between the coefficients in the synthetic and observed data (which it does), but it's important to make it clear to readers that in an analysis with a focal exposure of interest and and outcome, these coefficients do not have any straightforward interpretation (i.e., Table 2 fallacy from Westreich et al 2013 AJE), and any direct interpretations should be avoided. 

- p9. I know that it is impossible to put all of the details for both cross-sectional and longitudinal synthetic datasets together, but I think at least some intuition for how the process of developing the synthetic data for the longitudinal case would be of interest. Was this done parametrically or again by regression trees? How to deal with constraints placed on some of the longitudinal variables (e.g., some can only increase, like age or education). A few details would probably be helpful. 

- p11. "the main effect of sex did vary between the observed and synthetic data" By how much? Meaningfully? Also a little confusing because in the context of EMM there is no 'single' effect of sex (i.e., effect of sex at what age?) so it isn't clear what "main effect of sex" means here.

- p11. Also the authors suggest this could be due to the removal of 200 individuals due to SDC, but can they possibly confirm that this is the case (even if, ultimately those individuals would have to be removed in order to make the synthetic dataset available?). At least you could rule this out as an explanation. 

- p14. I'm a little confused regarding the comment on "multi-level/ hierarchical data" as a limitation.  ALSPAC is multi-level data, right (occasions nested within individuals)? So were the observations just treated as independent for the longitudinal / trajectory models given above? Does that have any consequences for variance estimation using the synthetic vs. the observed data? 

  •  Minor comments

- Figure 2 is great but hard to read. Would be better to increase the font size for the legend as well as for each strip text heading and x- and y-axes (including titles). Could also drop the grayscale background but that is more of a personal preference.

- Figure 3. Not quite sure that it is best to use the Z-value for the x-axis. Might someone confuse this with a z-statistic (i.e., from a test?). Why not just use the regression coefficient, or clarify that this is a symptom score?

- Table 1. I would encourage re-orienting the table so that readers can look to compare coefficients and standard errors from left to right to compare synthetic vs. observed values across columns, rather than across rows. It would be much easier to read. 

- Table 4. Would also encourage re-orienting to compare synthetic vs. observed across columns rather than rows.

Is the rationale for developing the new method (or application) clearly explained?

Yes

Is the description of the method technically sound?

Partly

Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

Yes

If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Yes

Are sufficient details provided to allow replication of the method development and its use by others?

Yes

Reviewer Expertise:

epidemiology, impact evaluation, reproducible research

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Wellcome Open Res. 2024 Dec 17.
Daniel Smith 1

We thank the reviewer for their positive review and their constructive comments. We also apologise for the length of time taken to respond to the reviewer’s comments; it took quite a while to find additional reviewers for our paper, and we thank this reviewer for their patience. We have responded to the reviewer’s comments in turn below, with the reviewer’s original comments in standard font and our responses in italics.  

As a general comment, the authors (in many places) make it clear that deciding 'how good is good enough' for synthetic data requires subjective interpretations. However, I think readers would probably benefit from more discussion of how to describe, evaluate, or measure the concept of 'faithfulness ' between the synthetic and observed datasets and statistical relationships. Are there no rules of thumb or guidance about how to compare, say, measures of central tendency or variability in ways that would help researchers to draw sensible conclusions about comparability? If not, perhaps some could or should even be suggested here?

Response: While we sympathise with the reviewer’s sentiments here, we do not believe that it is possible to make an objective decision regarding whether synthetic data are ‘similar enough’ to the observed data to be ‘useful’. For instance, even datasets which randomly synthesise each variable independent of all others can still be useful to explore the underlying code/analysis approach and checking for errors, even if all relations between the variables are lost (see Shepherd et al., 2017; https://doi.org/10.1093/aje/kwx066 ). Of course, the closer the synthetic data are to the observed the better, but in the absence of clear dividing lines, and different needs depending on the type of dataset, we do not feel it would be appropriate for us to provide guidance on this to readers. In our original manuscript we pointed readers to the concept of ‘utility’, which measures how similar the synthetic and observed data are, but did not develop this further for the reasons outlined above (see paragraph 4 of the ‘ Recommendations when using ‘synthpop’ section): “The second factor to consider is whether the synthetic dataset successfully maintains relations between variables (although even synthetic datasets which do not maintain relations between variables can still be useful for reproducing code and checking for errors; (Shepherd et al., 2017)). The ‘synthpop’ package provides a suite of useful tools for comparing the synthesised data against the observed data. This includes a simple comparison of the distributions of each variable, through to more complex conditional associations, such as in a multivariable regression. Ideally, these should be similar between the observed and synthetic datasets, although random variation and imperfections in the synthesising process are of course inevitable. Formal measures of ‘utility’, comparing the synthetic to the observed data, are also available within the ‘synthpop’ package, but are not discussed here (Raab et al., 2017). […] There are no definitive rules on what constitutes ‘successful’ synthesis, so again researchers must use their subjective judgement.”  

Specific comments P4. The authors note that the order in which variable are synthesized may also affect 'correspondence' between synthesized and observed data. Is this a potential consequence of not respecting the temporal ordering of the variables in question? And is this more important for causal questions than, say, descriptive or predictive analyses? For example, if both income and education are included, wouldn't it be prudent to synthesize education first, since that is a potential determinant of income? Should researchers use a directed acyclic graph in order to make these relationships clear (if the analysis is causal)? Response: As synthpop models are predictive, rather than based on the causal structure/data-generating mechanisms, it does not appear necessary to take the (assumed) causal relations between variables into consideration when synthesising data (e.g., in the reviewer’s example, whether education is synthesised before income, or income before education, would make little difference to the final synthetic dataset). We have added the following sentence to see paragraph 5 of the ‘ Recommendations when using ‘synthpop’ section to clarify this: “Note that as synthetic data generation using ‘synthpop’ uses predictive modelling, it is not necessary to consider the causal structure of the data (i.e., the data-generating mechanisms) when synthesising data using this approach.”

p4. The authors say that synthesizing the 'exposures and outcomes last' is advantageous, but at this point in the paper it isn't clear what the exposure or outcome is, since there has only been talk about ALSPAC more generally. This makes me wonder whether the authors would provide different guidance for those wanting to produce a synthetic dataset for a specific exposure-outcome investigation vs. a synthetic general dataset for multiple uses. Would this lead to different strategies for implementation?

Response: We have added further clarification on this in brackets following this sentence: “While synthesising the data in any order may be sufficient, we have found that synthesising the exposure(s) and outcome(s) last sometimes maintains the relations between variables more faithfully, although this does somewhat contradict the advice given in (Raab et al., 2017), who recommend synthesising the most important variables first (although if the dataset is being synthesised for more general use and does not have specific exposures or outcomes, synthesising the variables in any order will likely suffice).”

p4. "We advise that users test different synthesizing methods, and see what works best for their specific dataset", but how is 'works best' defined here? By some metric? Response:  Again, we appreciate that these decisions are somewhat vague and ‘vibes-y’, but this is inevitable when dealing with subjective judgements such as this. However, we have now clarified this statement and provided an example to help readers: “We advise that users test different synthesising methods, and see what works best for their specific dataset in terms of replicating the properties of the observed data (e.g., one method may maintain the relationship between the exposure and outcome more faithfully than another).”

p5. "Check that the distributions are similar..." how similar (e.g., a standardized difference of <10%? Means or medians within a certain range?). Again, a little help or guidance here could be beneficial, even if no hard-and-fast rules apply. I think for many readers that have never heard of the concept of generating synthetic data, this would help.

Response: While we do not propose any quantitative measures to assess similarity for the reasons outlined above, we have added that simply eye-balling the data is generally sufficient for this: “Check that the distributions of all synthesised variables are similar to those in the observed data. Simply visualising the data (e.g., in bar charts or histograms) should provide this information (see worked example below).”

p7 (Figure 4). I recognize that the purpose of this figure is to demonstrate the similarity between the coefficients in the synthetic and observed data (which it does), but it's important to make it clear to readers that in an analysis with a focal exposure of interest and and outcome, these coefficients do not have any straightforward interpretation (i.e., Table 2 fallacy from Westreich et al 2013 AJE), and any direct interpretations should be avoided.  Response: We have added the following text to make this clear to readers: “we also stress here that in a causal analysis between the exposure and outcome the coefficients of these additional covariates do not have a straightforward interpretation [see Westreich and Greenland, 2013 on the ‘table 2’ fallacy]).”

p9. I know that it is impossible to put all of the details for both cross-sectional and longitudinal synthetic datasets together, but I think at least some intuition for how the process of developing the synthetic data for the longitudinal case would be of interest. Was this done parametrically or again by regression trees? How to deal with constraints placed on some of the longitudinal variables (e.g., some can only increase, like age or education). A few details would probably be helpful. 

Response: We have provided additional details of the synthesis approach used in the ‘Applied longitudinal example’ section: “For both examples, we synthesised datasets using the default CART approach with these longitudinal data in ‘wide’ format (i.e., one row per participant with time-points as separate variables); synthesised data were then converted to ‘long’ format for the multilevel modelling analysis (i.e., multiple observations per participant with one row per time-point).” Additional details of the data used for the repeated measures analyses have also been provided (new text in bold font): “The first example examines height trajectories from childhood to early adulthood using a multilevel modelling framework, similar to that of Howe et al. ( Howe et al., 2012; Howe et al., 2016), using up to eight occasions of height between approximately 7 to 18 years of age. The second example uses growth mixture modelling to examine associations between adolescent self-esteem and depression trajectories across adolescence and early adulthood, and then depression trajectories across adolescence and early adulthood associated with later depression, similar to that of Kwong et al. ( Kwong et al., 2019) and López-López et al. ( López-López et al., 2020), using up to nine occasions of depressive symptoms between approximately 10 and 24 years of age. We have also updated the discussion to be clearer on when the standard ‘synthpop’ package can be used for multi-level/hierarchical data and when it cannot: “For instance, when using the standard ‘synthpop’ package, synthesising data in ‘wide’ format (as used here) appears to work well, but is unlikely to hold for data in ‘long’ format as information on the relations between observations within individuals would be lost.”

p11. "the main effect of sex did vary between the observed and synthetic data" By how much? Meaningfully? Also a little confusing because in the context of EMM there is no 'single' effect of sex (i.e., effect of sex at what age?) so it isn't clear what "main effect of sex" means here. Response: We have now clarified this in the text: “the main effect of sex (i.e., the difference in height between males and females at age 12 [the mean age of all the assessments]) did vary between the observed and synthetic data”. This information has also been added to the footnotes in Tables 1 and 2. As we note later in this paragraph, this did not seem to meaningfully impact on differences between the synthetic and observed data: “However, as shown in Figure 6, this had little effect on the estimation of the height trajectories and it is worth noting the confidence intervals for the main effect overlap”.

p11. Also the authors suggest this could be due to the removal of 200 individuals due to SDC, but can they possibly confirm that this is the case (even if, ultimately those individuals would have to be removed in order to make the synthetic dataset available?). At least you could rule this out as an explanation. 

Response:  We have re-run the analysis without including the SDC command in ‘synthpop’ and find that the results are nearly identical to the analysis that contain the SDC command in ‘synthpop’, suggesting that it is not the removal of the SDC observations, but probably more likely random variability from the synthesis model. These results presented in the table below. We have now amended the manuscript to reflect this: “However, the main effect of sex (i.e., the difference in height between males and females at age 12 [the mean age of all the assessments]) did vary between the observed and synthetic data, which is likely a result of random variability in the synthesis model”.

    Synthetic with SDC command ( n=9,834, n obs =52,557) Synthetic without SDC command ( n=9,969, n obs =52,692) Fixed effects Beta Std Err 95% CI P Beta Std Err 95% CI P    Age 5.65 0.01 5.63 to 5.67 <0.001 5.65 0.01 5.63 to 5.67 <0.001    Female -0.35 0.14 -0.63 to -0.06 0.016 -0.35 0.13 -0.63 to -0.06 0.014    Female*Age -0.99 0.01 -1.02 to -0.97 <0.001 -0.99 0.01 -1.02 to -0.96 <0.001    Age 2 -0.12 0.00 -0.12 to -0.11 <0.001 -0.12 0.00 -0.12 to -0.11 <0.001    Female*Age 2 -0.26 0.00 -0.27 to -0.25 <0.001 -0.26 0.00 -0.27 to -0.25 <0.001    Intercept 153.00 0.10 152.80 to 153.19 <0.001 153.00 0.10 152.80 to 153.19 <0.001   Random effects Estimate Std Err 95% CI   Estimate Std Err 95% CI      var(Age) 0.17 0.01 0.16 to 0.18   0.17 0.01 0.16 to 0.18      var(Age 2) 0.01 0.00 0.01 to 0.01   0.01 0.00 0.01 to 0.01      var(Intercept) 45.20 0.71 43.83 to 46.62   45.20 0.71 43.83 to 46.62      cov(Age, Age 2) 0.02 0.00 0.01 to 0.02   0.02 0.00 0.01 to 0.02      cov(Age, Intercept) 0.77 0.05 0.68 to 0.87   0.77 0.05 0.68 to 0.87      cov(Age 2, Intercept) -0.53 0.01 -0.55 to -0.50   -0.53 0.01 -0.55 to -0.50      var(Residual) 7.54 0.06 7.42 to 7.67   7.54 0.06 7.42 to 7.67    

p14. I'm a little confused regarding the comment on "multi-level/ hierarchical data" as a limitation.  ALSPAC is multi-level data, right (occasions nested within individuals)? So were the observations just treated as independent for the longitudinal / trajectory models given above? Does that have any consequences for variance estimation using the synthetic vs. the observed data?  We have hopefully explained this above, re. discussion of ‘wide’ vs ‘long’ longitudinal data formats. In terms of variance estimation, there does not appear to be any large, consistent or meaningful differences between the variances, standard errors and random effects for the observed vs synthetic data (Tables 1-4), suggesting that this is not an issue (at least when synthesising longitudinal/repeated measures data in ‘wide’ format).   Minor comments

Figure 2 is great but hard to read. Would be better to increase the font size for the legend as well as for each strip text heading and x- and y-axes (including titles). Could also drop the grayscale background but that is more of a personal preference.

Response:  These plots are the default ones created by the ‘synthpop’ package, and are surprisingly difficult to edit manually! To try and make this figure easier to read, we have changed the orientation from landscape (3 rows, 5 columns) to portrait (5 rows, 3 columns) which should increase the plot size in the manuscript and make these easier to read.

Figure 3. Not quite sure that it is best to use the Z-value for the x-axis. Might someone confuse this with a z-statistic (i.e., from a test?). Why not just use the regression coefficient, or clarify that this is a symptom score?

Response:  This is the default option for ‘synthpop’, which we have now noted more clearly in the manuscript (point 7 in the step-by-step guide): “note that, by default, ‘synthpop’ converts all coefficients to z-values so that all coefficients are on the same scale and hence easier to compare”.

Table 1. I would encourage re-orienting the table so that readers can look to compare coefficients and standard errors from left to right to compare synthetic vs. observed values across columns, rather than across rows. It would be much easier to read. 

Table 4. Would also encourage re-orienting to compare synthetic vs. observed across columns rather than rows.

Response:  We have re-formatted Tables 1, 2 and 4 to make the comparison between synthetic and observed results easier.

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Data Citations

    1. djsmith-90: djsmith-90/synthetic-data: v1.0.0 (v1.0.0). Zenodo.[Dataset]2024. 10.5281/zenodo.10457847 [DOI]
    2. Northstone K, Heron J, Smith D, et al. : ALSPAC Masters Training Dataset (Stata).2022. 10.17605/OSF.IO/8SGZE [DOI]

    Data Availability Statement

    Please see the ALSPAC data management plan which describes the policy regarding data sharing ( http://www.bristol.ac.uk/alspac/researchers/data-access/documents/alspac-data-management-plan.pdf), which is by a system of managed open access. Other than the freely available ALSPAC dataset and the synthetic datasets (see below), all other data used for this submission will be made available on request to the Executive ( alspac-exec@bristol.ac.uk). These datasets are linked to ALSPAC project number B4301, please quote this project number during your application. The following datasets and analysis code files supporting this submission are available on DM-S’s GitHub page ( https://github.com/djsmith-90/synthetic-data, available under a GPL-3.0 license, archived at the time of publication: https://doi.org/10.5281/zenodo.10457847, djsmith-90, (2024)) ; this includes:

    • 1)

      “SynthPopExample.r”: An example R script to replicate the step-by-step example using the openly available ALSPAC dataset, as well as explore some additional ‘synthpop’ functionality (parametric synthesis, smoothing parameters, top- and bottom-coding, synthesis using different numbers of variables, etc.). The openly available subset of ALSPAC data used for this example is available here ( https://doi.org/10.17605/OSF.IO/8SGZE; Northstone et al., 2022).

    • 2)

      “synthpop_repeated-measures_script”: This script processes and synthesises the datasets for the longitudinal modelling examples.

    • 3)

      “Simulated_height.dta”: The synthesised ALSPAC dataset for the multi-level growth models of height, in Stata format (note that the corresponding observed ALSPAC data files are not available for these analyses).

    • 4)

      “analysis_height”: The Stata script to perform the multi-level growth models on the “Simulated_height.dta” dataset.

    • 5)

      “Simulated_depression_mplus.dta”: The synthesised ALSPAC dataset for performing the growth mixture modelling analysis of depression, in Stata format (note that the corresponding observed ALSPAC data files are not available for these analyses).

    • 6)

      “prep_analysis_depression”: Script which initially processes the data for the growth mixture modelling analysis (in Stata), followed by the MPlus code to perform the growth mixture modelling analysis.

    The steps below highlight how to apply for access to the data included in the data note and all other ALSPAC data:

    Please note that the study website contains details of all the data that is available through a fully searchable data dictionary and variable search tool: http://www.bristol.ac.uk/alspac/researchers/our-data/.


    Articles from Wellcome Open Research are provided here courtesy of The Wellcome Trust

    RESOURCES