Synthesizing Surveys with Multiple Units of Observation: An Application to the Longitudinal Aging Study in India

Joshua Snoke; Erik Meijer; Drystan Phillips; Jenny Wilkens; Jinkook Lee

doi:10.1093/jssam/smae047

. 2025 Jan 9;13(4):420–444. doi: 10.1093/jssam/smae047

Synthesizing Surveys with Multiple Units of Observation: An Application to the Longitudinal Aging Study in India

Joshua Snoke ^1,^✉,^1,^2,³, Erik Meijer ², Drystan Phillips ³, Jenny Wilkens ⁴, Jinkook Lee ⁵

¹Dr. Joshua Snoke is with the Economics, Sociology, and Statistics Department, RAND Corporation, 4570 Fifth Avenue, Suite 600, Pittsburgh, PA 15213, USA

²Dr. Erik Meijer are with the Department of Economics & Center for Economic and Social Research, University of Southern Califonia, 635 Downey Way, Suite 305, Los Angeles, CA 90089, USA

³Mr. Drystan Phillips are with the Department of Economics & Center for Economic and Social Research, University of Southern Califonia, 635 Downey Way, Suite 305, Los Angeles, CA 90089, USA

⁴Ms. Jenny Wilkens are with the Department of Economics & Center for Economic and Social Research, University of Southern Califonia, 635 Downey Way, Suite 305, Los Angeles, CA 90089, USA

⁵Dr. Jinkook Lee are with the Department of Economics & Center for Economic and Social Research, University of Southern Califonia, 635 Downey Way, Suite 305, Los Angeles, CA 90089, USA

This analysis uses data or information from the Harmonized LASI dataset and Codebook, Version A.3 as of April 2023, developed by the Gateway to Global Aging Data (https://doi.org/10.25549/h-lasi). This document uses data from the 2017–2019 Wave 1 of LASI, Version B. LASI is a joint project of three partnering institutions: International Institute for Population Sciences (IIPS), Harvard T.H. Chan School of Public Health (HSPH), and University of Southern California (USC).

We thank participants of the 2022 Virtual Gateway Advisory Meeting and the 2023 UCLA Synthetic Data Workshop for helpful discussions.

The development of the Harmonized LASI was funded by the National Institute on Aging (R01 AG042778, 2R01 AG030153, 2R01 AG051125). For more information about the Harmonization project, please refer to https://g2aging.org/. LASI Wave 1 was funded by the Ministry of Health and Family Welfare, Government of India, the National Institute on Aging (R01 AG042778), and United Nations Population Fund, India.

^✉

Address correspondence to Joshua Snoke, Economics, Sociology, and Statistics Department, RAND Corporation, 4570 Fifth Avenue, Suite 600, Pittsburgh, PA 15213, USA; E-mail: jsnoke@rand.org.

PMCID: PMC12596149 PMID: 41210181

Abstract

We present methodology for creating synthetic data and an application to create a publicly releasable synthetic version of the Longitudinal Aging Study in India (LASI). The LASI, a health and retirement survey, is used for research and educational purposes, but it can only be shared under restricted access due to privacy considerations. We present novel methods to synthesize the survey, maintaining three nested levels of observation—individuals, couples, and households—with both continuous and categorical variables and survey weights. We show that the synthetic data maintains the distributional patterns of the confidential data and largely mitigates identification and attribute disclosure risk. We also present a novel method for controlling the risk and utility tradeoff for the synthetic data that take into account the survey sampling rates. Specifically, we down-weight records that have a high likelihood of being uniquely identifiable in the population due to unique demographic information and oversampling. We show this approach reduces both identification and attribute risk for records while preserving better utility over another common approach of coarsening records. Our methods and evaluations provide a foundation for creating a synthetic version of surveys with multiple units of observation, such as the LASI.

Keywords: Aging studies, Disclosure risk, Multiple units of observation, Survey data, Synthetic data

Statement of Significance.

We present novel methodology for synthesizing survey data that are collected with multiple units of observation. Prior synthetic data methods were insufficient due to restrictions on the units of observations or the types of data. Our approach can utilize any common method of sequential synthesis, but we simultaneously synthesize individuals, couples, and households by first restructuring the data and synthesizing structural variables. Our methods can be easily implemented using standard public software packages, so others can easily apply our methods to their own survey data. Additionally, we provide new means of mitigating disclosure risk that are specific to the types of risk inherent in synthetic data, which is an area that currently has little research.

1. INTRODUCTION

Researchers and data maintainers who collect survey data increasingly seek to share the data for purposes of education, training, reproducibility, and enabling secondary analyses. Prior to sharing survey data, data maintainers commonly apply statistical disclosure control (SDC) methods to confidential data (Hundepool et al. 2012). Approaches include data reduction methods, such as suppressing certain variables, records, or cells (for tabular data), or coarsening variables by collapsing categories. Synthetic data were proposed as an alternative method to traditional SDC approaches, with formative work by Rubin (1993), Little (1993), and Raghunathan et al. (2003). Rather than starting from the entire confidential sample and attempting to maintain as much of the original data values as possible, synthetic data starts with an assumed data-generating process, estimates sample parameters based on the confidential data, and draws fully new records based on a model using these sample parameters. In this way, a model and a data-generating process are essential to synthetic data in a way that they are not for other SDC methods.

Multiple synthetic data sets can be combined to produce valid inferences (Reiter 2003; Reiter and Raghunathan 2007; Raab et al. 2016), similar to combining rules for multiple imputation, but it is also common to produce only one synthetic data set for exploratory data analysis and model building. Synthetic data can be created by replacing every record entirely, known generally as fully synthetic data, or only some records or portions of records can be replaced, which is known as partially synthetic data (Reiter 2003, 2005; Drechsler and Reiter 2009). In some cases, the synthetic data can be analyzed similarly whether the data are partially or fully synthetic, but in other cases, they must be handled differently (Reiter and Kinney 2012; Raab et al. 2016).

In this paper, we consider the application of synthetic data to survey data, motivated by synthesizing the Longitudinal Aging Study in India (LASI), a health and retirement survey that can only be shared under restricted access due to privacy considerations. Synthetic data have been previously applied to survey data in a variety of settings. For example, Benedetto et al. (2018) created a synthetic version of the Survey of Income and Program Participation for the US Census, Hu et al. (2018) applied synthesis methods on the American Community Survey, Drechsler and Reiter (2009) and Kim et al. (2021) synthesized establishment surveys, Hu et al. (2022) evaluated synthesis methods using the Consumer Expenditure Survey from the Bureau of Labor Statistics, and Drechsler and Reiter (2012) offered general methods for synthesizing large surveys. Less work has considered the problem of synthesizing surveys with nested structure, for example those that contain individuals within couples or households (HHs). Prior work such as Manrique-Vallier and Hu (2018) and Hu et al. (2018) proposed latent class models to handle hierarchical data or structural zeros, but these methods have only been applied to categorical data. Alternatively, Benedetto and Totty (2023) proposed to create nested structure in synthetic data sets by matching records, for example, couples, after synthesizing individual characteristics. We propose a novel method for synthesizing multiple levels of observation, and we do so while using common synthesis models that are computationally efficient and easy to create with publicly available software. We show that high utility and low disclosure risk can be maintained simply by re-imagining the joint distribution of interest.

We organize the remainder of this paper as follows: section 2 describes the LASI, section 3 provides an overview of synthesis models for data with HHs or similarly nested structures, section 4 discusses down-weighting records for additional disclosure control, section 5 details our synthesis model for the LASI data set, sections 6 and 7 give the utility measures and risk measures, respectively, that we use to evaluate our synthesis, section 8 provides empirical results, and section 9 concludes with a discussion.

2. GATEWAY TO GLOBAL AGING DATA AND THE LASI

The LASI is a multidisciplinary survey of individuals aged 45 and over and their spouses of any age in India (Bloom et al. 2021; Perianayagam et al. 2022). Wave 1 was mostly conducted between 2017 and 2019, and Wave 2 is in preparation as of August 2024. LASI was designed to be nationally representative, as well as representative of each of the 36 states and union territories in India. Therefore, the sampling was stratified by State and Urbanicity (i.e., urban or rural) within each state, with a three-stage clustered sampling design in rural areas and a four-stage clustered sampling design in urban areas. Sampling weights are provided that reflect the sampling design as well as differential response rates.¹ Once a HH had been included into the sample, all HH members aged 45 and over and their spouses were asked to participate.

LASI’s questionnaire was designed to provide comparable information to the US Health and Retirement Study (HRS) (Juster and Suzman 1995; NIA 2007), which started in 1992 as a nationally representative panel study of people over the age of 50 living in the United States and their spouses. Similar studies, such as the LASI, have been conducted in many countries across the world and are known as the Health and Retirement Studies–International Network of Studies (HRS-INS). The topics covered in the LASI questionnaire range widely and include demographics, health, socio-economic status, and social support, among others.

To further facilitate cross-country comparisons, the Gateway to Global Aging Data at the University of Southern California prepares harmonized datasets for over 10 of the HRS-INS, with comparable variable names and definitions, as well as documentation, closely following the RAND HRS (see Bugliari et al. (2024) for its latest version as of August 2024), which is a user-friendly data file that contains a large subset of the HRS variables. The starting point for the creation of our synthetic data is the Harmonized LASI data file (Chien et al. 2023).

As part of collecting the LASI, respondents are ensured that their privacy and confidentiality will be maintained throughout any downstream publications or data sharing. This is vital both to preserve the ethical right to privacy and also to ensure the viability of future data collections. Respondents might be concerned about re-identification, such that someone could ascertain that they participated in the survey. Along with this, they may be concerned about disclosure of certain sensitive attributes, which may result from re-identification. In other cases, attributes may be learned by inference, such as the income of the wealthiest person in a small town.

With a very large underlying population, such as India’s, it may be tempting to think that the risk of such disclosures is minimal. But given enough information, particularly geographically identifying information in low population regions, studies have shown that disclosure is quite possible (e.g., Rocher et al. 2019).

2.1 Data Access Through Synthetic Data

Using synthetic data opens the door to a few potential means for increasing data access. While there are many potential uses for synthetic data, we have envisioned three major avenues for the use of synthetic Harmonized LASI data at this time.

User trainings: The Gateway conducts quarterly user trainings to teach students or researchers how to utilize harmonized data for their analyses. The utilization of synthetic data would more easily facilitate these trainings.
Exploratory research: Researchers could conduct exploratory research on synthetic data made easily available on the Gateway website and determine whether the data would suit their research plan. If they find it suitable, they could then apply for the Harmonized LASI.
Restricted data research: We could use synthetic data to protect respondent anonymity when using highly restricted data. For example, the LASI may be linked to data on pollution or severe weather, which are very closely tied to specific geographic locations, potentially making respondents identifiable. The use of synthetic data in these cases would allow researchers to conduct their analyses in full detail while protecting respondent anonymity.

Given the goals of synthesizing the LASI, we only generate one synthetic data set. This lessens the burden on users in learning to handle multiple data sets and it allows them to work with the synthetic data in the same way that they would work with the confidential data. The lack of multiple data sets limits the ability for researchers to conduct valid inference using the synthetic data, but that is not currently a goal of this synthetic data.

We synthesize the variables listed in table 1. The file contains both HH and individual-level characteristics, as well as couple identifiers indicating spousal respondents. There is one auxiliary variable, Head of HH gender, which was not part of the original Harmonized LASI file, but was derived to be used in the synthesis process to re-weight the data after synthesis to match the original weight totals. It will not be released in the synthetic data. The LASI will contain more than one wave in the future, but our synthetic data only represent the first (and currently only) wave. Future work could consider the longitudinal nature in the synthesis when more waves are collected, but longitudinal models are not part of the current work.

Table 1.

Variables Included in the Synthesis Process and Evaluations

Measurement unit	Characteristic	Description
Household	HH ID	42,311 households
Household	State	35 states/territories
Household	Rural indicator	Binary
Household	HH # residents	Continuous
Household	HH # respondents	Continuous
Household	HH survey weight	Continuous
Household	Head of HH gender (not released)	Binary
Couple	Couple ID	1–4 couples per HH
Individual	Individual ID	1–8 individuals per HH
Individual	Gender	Binary
Individual	Age	Binned into 9 categories
Individual	Education years	Continuous
Individual	Education category	10 categories
Individual	Self-reported health	Ordered 1–5
Individual	Activities of daily living	Order 0–5
Individual	Working status	Binary
Individual	Individual earnings	Continuous
Individual	Individual survey weight	Continuous

Open in a new tab

Note.—ID values are arbitrarily labeled in the synthetic data, not based on confidential IDs.

3. SYNTHESIZING DATA WITH MULTIPLE LEVELS OF NESTED OBSERVATIONS WITH SEQUENTIAL SYNTHESIS

The LASI survey contains three levels of observation: HH, couples within HHs, and individuals within couples. For the synthetic data, we want to maintain both the HH and couple information in the synthetic data and ensure the relationships between individuals in the same HHs or couples are maintained. To do this, we propose a novel approach that reshapes the data in particular ways and adds some structural variables to the synthesis process, so we can use common conditional synthesis approaches. This concept can be applied to synthesize any survey with multiple nested levels of observation, such as HHs and individuals.

Prior work on synthesizing this type of data is limited. Hu et al. (2018) proposed using hierarchical Bayesian models to draw individual and HH characteristics from a Dirichlet process mixture of products of multinomials. The model assumes individuals and HHs are each members of nested latent classes, such that the relationships between the two can be modeled. Their model does not accommodate continuous variables or sampling weights, which are common in surveys such as the LASI or the US Census Bureau’s American Community Survey. These Bayesian models are also computationally intensive and require defining complex joint distributions for the variables in the data.

Alternatively, Benedetto and Totty (2023) presented a method to match couples after synthesizing individual characteristics, thus preserving both the individual-level and couple-level distributions. For each synthesized individual who should be coupled, they greedily draw candidate matches from the pool of possible matches using distance measures, until all individuals are matched. While innovative, the structure in our data set, which contains HHs, couples, and individuals, is more complex than this prior work, and we do not see a straightforward means of applying the method to our data.

Given the limitations of these prior methods for surveys with multiple nested levels of observations, we design a simpler but intuitive approach to maintaining the structure in the data in our synthesis model. Rather than considering multiple observation levels, we flatten the data to a single level, the HH level, and we explicitly incorporate couple and individual level information. We want to synthesize the data using a sequence of conditional models, as is common in the field, for example, using a sequence of regressions (Raghunathan et al. 2001; Nowok et al. 2016b) or CART models (Reiter 2005; Drechsler and Reiter 2011).

To understand the new data structure, let N be the number of HHs in our data, and let there be M measured HH characteristics, Q measured individual characteristics, and P possible couples² in each HH. We define:

$Y_{i k} : i = 1, …, N; k = 1, …, M$ as the HH characteristics.
$X_{ijlr} : i = 1, …, N; j = 1, …, Q; l = 1, …, P; r = 1, 2$ as the individual characteristics.
$Z_{i l r} : i = 1, ..., N; j = 1, ..., Q; r = 1, 2$ as binary flags indicating the existence of person r in couple l in HH i.

The last set of variables define the structure of the HHs. First, the number of individual respondents that exist in a HH is restricted by the total respondents, which is a HH feature. Second, couples must be filled in order, such that couple 3 can only exist if couple 2 exists, and so on. Third, the second person in a couple can only exist if the first person in the couple exists.

In other words, if there are h_i total respondents, then by definition Z_ilr = 0 for all $l > h_{i}$ or (r × l)/2 ≥ h_i. By definition if Z_ilr = 0, then X_ijlr is set to missing for all j. While N, M, and Q are given based on confidential data, the selection of P is less straightforward. We discuss the selection of P as a synthesis tuning parameter in the supplementary materials.

Because certain numbers of total HH respondents allow for different arrangements of these individuals into couples, we provide an example illustration of the possible person combinations that exist with a maximum of three respondents in a HH in table 2.

Table 2.

Notional Table of All Possible HH Person Patterns with Age as an Example Variable.

# HH Respondents	Person 1	Person 2	Person 1	Person 2	Person 1
	Age
	Couple 1		Couple 2		Couple 3
1	50	–	–	–	–
2	45	47	–	–	–
2	46	–	65	–	–
3	47	45	66	–	–
3	70	–	48	46	–
3	70	–	65	–	47

Open in a new tab

In order to correctly synthesize only the individual characteristics for individuals who exist in the HHs, we must carefully construct the synthesis order, that is, the order of conditional models for synthesizing each variable. We arrange the conditional models as follows³:

Household characteristics: for every k, fit $f (Y_{i k} | Y_{i 1}, Y_{i 2}, …, Y_{i (k - 1)})$
First individual in the HH: for every j and $l; r = 1$ , fit $f (X_{i j 11} | X_{i 111}, X_{i 211}, …, X_{i (j - 1) 11}, Y)$
Remaining individuals: starting with $l = 1; r = 2$ , for every l and $r \in {1, 2}$ :
1. Fit $f (Z_{ilr} | Z_{i 11}, Z_{i 12}, Z_{i 21}, …, Z_{i (l r - 1)}, Y)$ ⁴
2. For every j, fit $f (X_{ijlr} | X_{i 1 l r}, X_{i 2 l r}, …, X_{i (j - 1) l r}, X_{i j 11}, X_{i j 12}, X_{i j 21}, …, X_{i j (l r - 1)}, Y, Z_{ilr} = 1)$

where i indexes the HHs, k indexes the HH characteristics, j index the individual characteristics, l indexes the couples within a HH, and r indexes the person within the couple. Once we fit the models, we generate synthetic HH records, and only draw synthetic X_ijlr values if Z_ilr = 1. Because we condition the Z models on the number of respondents as a HH characteristic and the prior Z variables, we ensure that each HH has the correct number of individuals with synthesized characteristics. Figure 1 gives a visual depiction of the structure and synthesis of the data.

Figure 1. — **A Visual Representation of the X, Y, and Z Variables.** The arrows depict which variables are conditioned on in the synthesis models.

Given this approach, we can apply standard synthesis models and software (Nowok et al. 2016a) to perform conditional sequential synthesis. After completing the synthesis, we convert the structure back to the structure of the confidential individual level file to produce information in a familiar format for users. Code to replicate our synthesis approach is available.⁵

Practitioners interested in applying the methods to their own survey data should be able to adapt our code easily by following the concepts laid out in this section. If users would like to utilize different types of models than CART models, such as sequential linear regressions, it would be trivial to make that change. Households and couples are likely some of the most common types of nested levels of observations in surveys, so numerous other applications could easily use the exact approach detailed in this section. Our method may need additional work to adapt it to surveys with other types of structure, but by providing the conceptual framework here, future efforts should be able to adapt the ideas we present for surveys of many other types of structures.

4. DOWN-WEIGHTING RISKY RECORDS IN SEQUENTIAL SYNTHESIS MODELS

A few states and territories in the data have very small populations, so they were significantly oversampled in order to create a sample of every Indian state and territory with a minimum number of records to perform state and territory specific analyses. For these locations, the disclosure risk from releasing synthetic data records similar to the confidential records is higher because there are fewer other potential candidate HHs from these states. In other words, if someone inferred that a HH with certain attributes was present in the confidential data that had contributed to the synthetic data, it would present less risk if the HH was in a large state where there are tens of thousands of HHs with the same attributes than in a small state with only one or a few HHs with the same attributes.

We provide a new method for managing the risk-utility tradeoff of these oversampled locations by down-weighting records using the record weight feature of CART models. We down-weight records with high probabilities of population uniqueness due to oversampling or unique combinations of characteristics. Hu et al. (2022) first proposed the concept of risk down-weighting for joint Bayesian synthesizers by reducing the contribution of risky records to the likelihood. Our method is conceptually similar, but ours is the first paper, to our knowledge, to provide a way of down-weighting with sequential synthesis models.

We down-weight using the case-weights in the CART models implemented in the rpart package (Breiman et al. 1984; Therneau et al. 2015) in R. This causes the model to put less weight on certain records when choosing splits to grow the tree. The CART models borrow strength across states where HHs are similar, so down-weighting records in oversampled states will have the effect of capturing relationships in small states that are similar to those in larger states but reducing the ability of the model to learn unique relationships in smaller states and reproduce these in the synthetic data. It should not have a substantial impact on larger states.

We use an estimate of identification risk, specifically estimated risk of being unique in the population, as the inverse weight value to down-weight records in the synthesis process for HHs that are at higher risk. We detail how we compute the estimate of population uniqueness in greater detail in section 7. The methodology of down-weighting we provide can be used with any measure of risk, such as the types of risk measures used by Hu et al. (2022). We choose the measure of population uniqueness given the collection methods used for the LASI, but other applications of our methods could use different weights. Evaluations of the down-weighting method for the LASI are presented in section 8.3.

Down-weighting records makes sense for synthetic data, particularly since the membership inference attack (MIA) (e.g., Stadler et al. 2022) has become a more common measure of disclosure risk for synthetic data. Our method would be advantageous against the MIA because by definition down-weighting records directly reduces the risk modeled by the MIA. Yet, we choose not to evaluate this metric for our paper because we believe it is more likely for an attacker in our context to match synthetic records based on key variables, which they may know about targeted HHs.⁶ First, the MIA does not account for sampling uncertainty, which is significant in our context. Second, an attacker would not learn much meaningful information from inferring that a record with certain attributes contributed to the creation of the synthetic data without being able to leverage that information to learn additional attributes. Lastly, an attacker would not have access to the weights we use because they are based on unreleased information. This restricts the ability to successfully conduct the MIA, which needs to replicate the synthesis process.

Similar to our proposed synthesis methods for survey data with multiple levels of observations in section 3, down-weighting by the CART case-weights is simple and easily implementable in existing statistical software. Yet, we find it produces a powerful risk-utility tradeoff curve that enables creators of synthetic data to more easily protect more risky records. We expect that many other surveys contain similar privacy concerns due to significant oversampling of certain populations, whether by geography or demographics. Additionally, because the method is flexible to utilize different measures of disclosure risk to down-weight records, future work could adapt this to other applications where the risk comes from other sources than oversampling or unique combinations of key variables.

5. SYNTHESIZING THE LASI

In the prior sections, we described the general format for synthesizing data with multiple levels of observations. Here, we detail some of the specific decisions we made regarding the functional forms for the synthesis models and the tuning of those models. Additional aspects of the synthesis process, such as the CART tuning parameters and the method for handling survey weights, are detailed in the supplementary data online. Also recall that we generate only a single synthetic data set.

5.1 Synthesis Model and Predictor Selection

We use flexible non-parametric synthesis models using the CART algorithm first proposed for synthesis by Reiter (2005) and implemented in the synthpop package in R (Nowok et al. 2016a, 2016b). All selected variables are synthesized except for State and Rural indicator, resulting in a partially synthetic data set. By not synthesizing State and Rural indicators, we maintain the same number of HHs by State and Rural areas in the synthetic data as are in the confidential data. These two geographic variables formed the core of the sampling frame of the confidential survey data, so we keep the size of the sample and strata fixed.

We exclude HH survey weight as a predictor of the individual variables because we found that resulted in overfitting the relationship between the HH weight and individual characteristics. We also only include individual characteristics of previous individuals as predictors if it is the same characteristic. For example, when synthesizing the third respondent in the HH’s age, we include the first and second respondents’ ages as predictors, but we do not include any other characteristics of the first and second people. This simplifies the number of features in each model, which we found led to overfitting. Finally, individual survey weights are synthesized conditional on HH survey weights, previous persons’ individual survey weights, and the corresponding individual’s other characteristics only.

6. MEASURES OF STATISTICAL UTILITY EVALUATION METRICS

Synthetic data are commonly used in different ways, such as are described in Snoke et al. (2018) and Arnold and Neunhoeffer (2020). Our utility assessment focuses on general utility, which is measured using distributional distance measures between the confidential and synthetic data. We evaluate the utility of the synthetic data we generate using summary statistics, visual depictions of the distributions, and a distributional distance measure, the pMSE ratio (Snoke et al. 2018). The pMSE ratio is a metric developed by Snoke et al. (2018) that extended the pMSE, originally proposed (without a name) by Woo et al. (2009). The pMSE is a distributional distance measure that computes the mean-squared error from predicted probabilities of records belonging to the synthetic data versus the confidential data. Since the proposal of the pMSE, a broad class of general utility distance measures has been developed based on discriminate models between the confidential and synthetic data, such as those described in Bowen et al. (2021) and Raab et al. (2021). Raab et al. (2021) show that the pMSE ratio is both among the most powerful and the most versatile statistics in this class of metrics.

We obtain the pMSE ratio by dividing the observed pMSE by its expectation under the null that the synthetic data are drawn from the same true data generating process for the confidential sample. A ratio of 1 implies the synthetic data are distributionally as close as a new sample from the same underlying population, and larger values imply worse synthesis. Raab et al. (2021) suggest anything below 10 represents acceptable synthesis as a rule of thumb.

We compute distributional comparisons both for the entire data set and for all 1-, 2-, and 3-way combinations of characteristics, separately within the sets of variables measured at the individual and HH levels. When computing the pMSE ratio for 1-, 2-, or 3-dimensional variable comparisons, we use fully saturated logistic models, which is recommended for low-dimensional distributional comparisons (Raab et al. 2021). When we compare the distribution of the entire data set, we use CART models to discriminate between synthetic and confidential data. We utilize two adaptations to compute the pMSE ratio recommended by Bowen and Snoke (2021): first, we use a training-test split to choose the optimal CART model with the highest area under the curve and second we estimate the null by resampling the confidential data only rather than the confidential and synthetic data combined.

7. DISCLOSURE RISK METRICS

We evaluate two types of disclosure risk: probability of identification and attribute disclosure. To measure identification disclosure risk, we utilize matching on quasi-identifiers (QIs), such as is proposed in Reiter and Mitra (2009). Because this is fully synthetic data within geographies, the identification risk does not represent a “true match,” but an attacker might still attempt to match synthetic records with external information on key variables. If they can do this, they may still attempt to learn attributes based on such matches.

Because we are synthesizing survey data, we do not assume the attacker already knows whether HHs participated, so we also consider the likelihood that a match is unique in the population as part of our identification risk estimate. This distinction matters because the uncertainty that comes from survey sampling provides additional privacy protection. If the attacker does not know which HHs participated in the survey, the likelihood of identification from matching records for a HH sampled from a geography at a rate of 1 out of 100,000 is substantially lower than a HH sampled from a geography at a rate of 1 out of 10.

We estimate identification risk using a combination of sample and population uniqueness, and we follow prior work using an adapted form of the log-linear risk estimation from Skinner and Shlomo (2008). Specifically, we estimate the risk value:

r_{i} = E [1 / F_{k} | f_{k} = 1, Q I_{i}], \forall k : f_{k} = 1

where F_k and f_k are the number of HHs in the population and sample, respectively, sharing the same characteristics and i indexes the HHs in the confidential data. We encourage interested readers to look at their paper for more details, but we provide an abbreviated summary of the risk computation here.

Assume we organize the HHs in our data as a contingency table with counts of records ( $f_{k})$ sharing the same QIs ( $Q_{k})$ , and the corresponding counts in the population are denoted F_k. To estimate risk, we assume the counts are distributed $f_{k} \sim Poisson (π_{k} λ_{k})$ and $F_{k} \sim Poisson (λ_{k})$ , where π_k is the known inclusion probability of the HH in the survey. We use Poisson regression models to produce estimated ${\hat{λ}}_{k}$ values, which are used to create risk estimates:

r_{i} = \frac{1 - e^{- (1 - π_{k}) {\hat{λ}}_{k}}}{(1 - π_{k}) {\hat{λ}}_{k}}

We extend their work by using a penalized lasso Poisson regression model, and we select the penalization parameter based on the model selection criteria laid out in Skinner and Shlomo (2008).

When generating the log-linear risk estimates, we assume that the attacker seeking to identify records knows only certain QIs from the data. These QIs are the same that we use when measuring identification risk by matching. Specifically, we assume they know the State indicator, Rural indicator, number of HH residents, the ages, gender, educational attainment, and employment status of the first two respondents in the HH.⁷ After computing r estimates for each sample unique record, we match records with the synthetic data and estimate how many records have unique synthetic data matches, commonly referred to as unique-uniques. Finally, we estimate the number of unique-uniques by their r value, since records with low r values are unlikely to be identified in the population even if they are uniquely matched in the sample.

Given the ability to identify a HH, an attacker may wish to learn about particular sensitive attributes. We estimate the risk of learning information about the total HH income among HHs with high identification risk. We compute two measures of attribute risk. First, we compute the number of records where the observed synthetic value (e.g., income) is within a certain percentage of the confidential value:

Definition 7.1.

Define the a₁ attribute risk metric as:

$a_{1} (b, R) = Σ_{j} I (r_{j} > R) I (x_{j}^{C} b > x_{j}^{S} > \frac{x_{j}^{C}}{b})$

where $j = 1, …, J$ indexes the set of records which are unique-uniques, b is the percentage range around the confidential value, $x_{j}^{C}$ are the confidential values, $x_{j}^{S}$ are the synthetic values, and R is the threshold value that determines the set of risky records for which we compute this metric.

This measure is the same as the correction median attribution probability (CMAP) (Feldman and Kowal 2022) under three changes. First, we only generate one synthetic dataset, so the median value used in the CMAP is only the single synthetic value. Second, we compute our measure across a subset of records, namely those that meet the identification risk threshold. Lastly, the CMAP uses an absolute difference (denoted by ε), whereas we use a relative difference (denoted by b) to bound the area around the true value which the attacker considers.

Second, we propose a new measure that is conceptually similar to measures recommended by Reiter et al. (2014), but not using a Bayesian framework, and Stadler et al. (2022). Specifically, we emulate an attacker by training a model using the synthetic data to predict income based on other key attributes. We bootstrap this process to create a set of models from which we produce a distribution of predictions of income values for the target records. We then compute the risk metric for the mean probability that the confidential income values are within a certain range given this distribution, given in Definition 7.2.

Definition 7.2.

Define the a₂ attribute risk metric as:

$a_{2} (b, R) = \frac{Σ_{j} I (r_{j} > R) p (x_{j}^{C} b > {\hat{x}}_{j}^{S} > \frac{x_{j}^{C}}{b} | S)}{Σ_{j} I (r_{j} > R)}$

where $j = 1, …, J$ indexes the set of records that are unique-uniques, b is the percentage range around the confidential value, $x_{j}^{C}$ is the confidential value, S is the synthetic data, and ${\hat{x}}_{i}^{S}$ is the estimated value for HH i.

This measure of risk should be stronger than Definition 7.1 because the attacker is leveraging the information about the relationships between the variables contained in the synthetic data. For both measures of risk, we estimate them only for unique-unique records. We choose to do so rather than using all records to focus on the records with the biggest risk due to identifiability. Additionally, we use a large number of key variables such that the vast majority of the HHs are unique, so we do not expect the risk values to change much if we utilized all records.

8. EMPIRICAL RESULTS

We provide detailed utility evaluations of synthetic data generated using the methods described in the previous sections. We provide results on utility measures for the couple and HH distributions, with an extensive additional utility evaluation provide in the supplementary material. Finally, we evaluate the risk-utility tradeoff for using down-weighted or coarsening states to reduce risk as discussed in section 4.

8.1 Within Couple and Household Distributional Comparisons

We evaluate how well the synthetic data capture the within couple and within HH distributions in the confidential data by looking at the relationships between the male and female partner in the couple⁸ and different persons in the HH. First, figure 2 shows the distribution of ages of the female and male partners in the couple in the left panel and the distribution of the oldest and youngest respondents in the HH in the right panel, among HHs with more than one respondent.⁹ Apart from a few couples where the female partner is significantly older than the male, the synthetic data accurately captures distribution in the confidential data. The synthetic data does an even better job, relative to the couples, of preserving the HH age relationships.

Figure 2. — **Left panel: Distribution of the Ages of the Female and Male Partners in the Couple in the Confidential and Synthetic Data.** Right panel: distribution of the ages of the oldest and youngest respondents in the household in the confidential and synthetic data. Both are only among households with more than one respondent.

Next, figure 3 shows the difference between the educational attainment for the male and female partners in each couple in the left panel, and the difference between the most and least years of education among respondents in the HH in the right panel, among HHs with more than one respondent. Again, the synthetic data captures the distribution accurately. It slightly under-predicts the number of couples with the same number of years of education, and the errors appear more likely to predict women having more educational years. We see that similar to the couples, the synthetic data slightly under-predicts the number of HHs where the difference is zero, but overall the synthetic data does a very accurate job of replicating the distribution of educational attainment for individuals within the same HH.

8.2 Disclosure Risks of the Synthetic LASI

We evaluate disclosure risks on the same synthetic data set for which we presented utility evaluations. We utilize the measures of identification and attribute disclosure risk described in section 7. Recall that we assume an attacker has access to information (QIs) about the State indicator, Rural indicator, number of HH respondents, the ages, gender, educational attainment, and employment status of the first two respondents in HHs they wish to identify. They match information of the HHs to the synthetic data and look for unique matches in the synthetic data.

For HHs that are unique in the confidential data, we compute the estimated risk of being unique in the population, defined in section 7. Identification risk from this type of matching attack is higher for records that are more likely to be unique in the population, since the attacker is more likely to be matching the target HH.

Table 3 gives the results for the synthetic data. We see that while almost 6 percent of records are synthetic unique-confidential uniques (SU-CUs), the number of SU-CU records with high-risk values declines. At the highest risk level, there are 86 HH records in the synthetic data, which are unique and can be matched to confidential records. As we will show in section 8.3, we can expect a certain amount of matching simply by random chance, so we should not expect to see zero HHs in the highest risk categories unless we manually edit the records. Depending on the context, the data maintainer may decide whether or not further protections are merited.

Table 3.

Number of Households with Different Thresholds of Identification Risk

Households	N	Percent
Total	42311	100
Synthetic unique-confidential unique	2505	5.92
SU-CU: r > 0.1	686	1.62
SU-CU: r > 0.5	215	0.508
SU-CU: r > 0.95	86	0.203

Open in a new tab

Note.—Risk defined as the likelihood of being unique in the population given a set of QIs. SU-CU defined as records that share a unique set of QIs in both confidential and synthetic data.

To help determine the potential harm from identifying a HH, we estimate the risk of learning information about the total HH income among HHs with high identification risk. We compute two measures as described in section 7. First, we compute $a_{1} (b, R)$ , the number of records where the observed synthetic income falls within a certain percentage difference, b, of the confidential income. Second, we compute $a_{2} (b, R)$ , the probability of a percent range, b, of the confidential income given the estimated distribution of synthetic values.

Table 4 provides the results. We see that the number of HHs with high attribute risk decreases as identification risk increases. We find only one HH with the highest level of risk where the synthetic income is within 10 percent of the confidential income. For HHs with lower identification risk, we still find only 20 HHs where the synthetic income is close to the confidential value. We also see that the mean probabilities that the confidential value falls within the distribution estimated from the synthetic data are low. For the highest identification risk group, we find a mean probability of 0.22 for a $\pm 50 %$ range around the confidential value, suggesting the attackers would likely learn little about even the rough magnitude of HH income for particular records.

Table 4.

Summary of Total Household Income Attribute Risk Values for Households with High Identification Risk

	SU-CU households
	R = 0.1	R = 0.5	R = 0.95
Total records	294	91	34
$a_{1} (50, R)$	67	23	9
$a_{1} (20, R)$	34	12	4
$a_{1} (10, R)$	20	6	1
$a_{2} (50, R)$	0.30	0.22	0.22
$a_{2} (20, R)$	0.11	0.073	0.075
$a_{2} (10, R)$	0.06	0.034	0.038

Open in a new tab

Note.—Only households with nonzero confidential income are evaluated.

Reviewing these results as a whole, we believe the risk of real harm to these HHs is very low. There are a very small percentage of HHs that an attacker may be able to match, so they would need to have the extensive external information on every possible HH in the data. Given the context from which the data were obtained, a malicious party would likely need to either obtain Indian Census data containing the same QIs in the LASI, which we do not believe exists as public microdata, or they would need to have personal, intimate information about a HH in a very remote area.

Additionally, the attribute risk is quite low, so even if they were able to identify participation of a HH in the survey, it is highly unlikely that an attacker would learn any new meaningful sensitive information. The choice of how much risk is acceptable is ultimately a policy decision, since the only way to achieve zero risk is by not releasing any information. Given the context of the survey and our evaluation, we believe the synthetic data could be safely released.

8.3 Risk-Utility Tradeoff for Additional Protections of States with Small Populations

The risk results in the previous section showed low risk, but these risks are not evenly distributed across geographic locations due to the survey oversampling. To better protect individuals in small locations, we evaluate two possible methods. First, one could simply coarsen the geographic information by grouping oversampled (small) states with larger neighbors. This straightforward approach removes any unique relationships between characteristics in the data that exist within the smaller states and instead draws values based on the average relationship across the combined smaller and larger states. We compare this with our proposed method to down-weight records based on the r measure of identification risk present in section 4.

We evaluate three levels of down-weighting corresponding to different levels of protection. We give zero weight to any records with r values above one of the cutoffs ${0, 0.5, 0.9}$ , effectively suppressing them from the synthesis. This removes any contribution of records sampled with certainty from the synthesis model. The lower the cutoff, the more suppression from the synthesis. Recall that r estimates the likelihood that a HH record is unique in the population, so replicating records in the synthetic data that have higher values of r is more risky. For records with risk values below the cutoff, we set the weight to $1 - r^{2}$ . For records that have no expected value of being unique in the population, that is, they are not unique in the confidential data, we set the weight to 1.

We evaluate the risk and utility tradeoff for the different approaches by estimating (1) the overall pMSE ratio for the distributional similarity of the entire synthetic data set to the confidential data set and (2) the identification risk based on the percentage of SU-CU records among those with different values of r. For each synthesis method, we replicate the process 30 times to avoid differences due to random chance and report the mean results.

Figure 4 plots the risk and utility values against each other. We see a similar pattern when comparing different synthesis approaches within each panel that represent records with different levels of expected population uniqueness. We see that the unweighted approach is in the upper left corner of each panel, signifying both higher utility and higher risk. The approach of coarsening states does little to mitigate the risk, but it decreases the utility. The down-weighting approaches give progressively more protection in exchange for less utility, as we would expect. When only down-weighting the riskiest records, there is no drop in utility in exchange for increased protections, particularly among the riskiest records. As we increase the down-weighting threshold, we still maintain some utility while decreasing the risk even further. Results showing utility versus the attribute risk measures are given in the supplementary material.

Figure 5 shows more granular utility results for two states, one smaller and one larger, that were grouped together in the coarsening approach. The two-way utility metric is the average pMSE ratio value for all two-way distributional comparisons including the variable listed on the y-axis. The small state has a high percentage of records with high-risk values. The left panel shows the larger state utility and the right panel shows the smaller state utility. Each show both the results for (1) the full state synthesis (same as shown above), (2) the grouped synthesis, and (3) the down-weighted synthesis with the cutoff of r > 0.9.

We see that the down-weighting preserves the accuracy of the two-way relationships for every variable in both the small and large states, while grouping the states distorts the distributions in each of the individual states. The fractions of SU-CU records with different risk levels in the small state (not shown) display a similar pattern to the overall results shown in figure 4. In grouped states, the down-weighting approach is particularly better at preserving utility while reducing disclosure risk over a coarsening approach.

9. DISCUSSION

We present a novel approach for synthesizing individuals nested within couples and both nested within HHs using simple sequential synthesis methods. This method provides high utility, capturing overall distributional similarity and maintaining the couple and HH relationships between individuals in the data. This approach enables synthesis of this type of survey that could not be accomplished using prior methods.

We also provide a novel method for managing the risk-utility tradeoff by down-weighting records using the record weight feature of CART models. We down-weight records with high probabilities of population uniqueness based on identifying features and sampling rates. For surveys that utilize significant oversampling, some records will be much more identifiable than others, so we need to mitigate their risk. We find that the down-weighting approach provides both better utility and less risk than the alternative of coarsening the geographic information.

We apply our methods to the first wave of the LASI, and we show that we can produce a synthetic version of this survey that maintains high levels of univariate, bivariate, and higher order distributional similarity to the confidential data. We also show that the identification and attribute risk of synthetic data released from this model are low.

Future work on the LASI set could expand the number of variables, either by including more questions from the LASI or by matching the variables here to additional data to synthesize jointly. As we do this, we will need to monitor different parts of the joint distribution of variables to ensure that it continues to capture the distribution observed in the confidential data. We may also target improvements on variables in the current file, which have worse utility, such as ADL or working status.

The methods we present here can be easily adapted to other surveys because they rely on simple synthetic data models that are computationally fast and use publicly available code. Future work on other surveys will need to adapt the restructuring approach to fit the structure of other surveys. For example, they may only contain individuals within HHs without couple information. The specific structure we provide here will not work for every survey, but the concepts we provide should be flexible enough to be used for a wide variety of surveys. In addition, the concepts of down-weighting using the CART case-weights can be utilized, both for surveys with multiple levels of observations and for more classic synthetic data sets. Future work could consider using other risk measures than the one we utilize, and such work could explore the impact of different risk measures on the ability to control the risk-utility tradeoff of the synthesis.

Supplementary Materials

Supplementary materials are available online at academic.oup.com/jssam.

Supplementary Material

smae047_Supplementary_Data

smae047_supplementary_data.pdf^{(1MB, pdf)}

Footnotes

Even with variation, the response rates are high compared to many other surveys, varying between 77 and 96 percent across states and territories. By comparison, for example, the average response rates across the 113 studies listed in the supplementary material of Daikeler et al. (2020) were 36 percent for web surveys and 49 percent for surveys using other modes.

Note that many “couples” will only have one individual who is without a partner.

Model parameters not included to simplify notation.

⁴

We abuse the notation here slightly, since the last prior person could be either $l - 1$ or $r - 1$ .

⁵

https://github.com/jsnoke/jssam-synthetic-lasi

⁶

This type of matching is sometimes referred to as a linkage attack because a malicious party seeks to link information they have from other sources with the released data. We describe our disclosure evaluation methods in more detail in section 7.

⁷

The log-linear models were too sparse when we included information on more respondents, which leads to inaccurate results. Also, the likelihood of a match in the synthetic data decreases with additional QIs, so we assume the first two respondents is the optimal amount of information that the attacker may choose to utilize.

⁸

There are a small number of same-sex couples in the data, but we do not present graphical results on these couples due to disclosure concerns over displaying confidential distributions of age or education for individuals with a rare characteristic in our sample.

⁹

Note that not all household members are respondents: only individuals age 45+ and their spouses were eligible.

Contributor Information

Joshua Snoke, Dr. Joshua Snoke is with the Economics, Sociology, and Statistics Department, RAND Corporation, 4570 Fifth Avenue, Suite 600, Pittsburgh, PA 15213, USA.

Erik Meijer, Dr. Erik Meijer are with the Department of Economics & Center for Economic and Social Research, University of Southern Califonia, 635 Downey Way, Suite 305, Los Angeles, CA 90089, USA.

Drystan Phillips, Mr. Drystan Phillips are with the Department of Economics & Center for Economic and Social Research, University of Southern Califonia, 635 Downey Way, Suite 305, Los Angeles, CA 90089, USA.

Jenny Wilkens, Ms. Jenny Wilkens are with the Department of Economics & Center for Economic and Social Research, University of Southern Califonia, 635 Downey Way, Suite 305, Los Angeles, CA 90089, USA.

Jinkook Lee, Dr. Jinkook Lee are with the Department of Economics & Center for Economic and Social Research, University of Southern Califonia, 635 Downey Way, Suite 305, Los Angeles, CA 90089, USA.

REFERENCES

Arnold C., Neunhoeffer M. (2020), “Really useful synthetic data—a framework to evaluate the quality of differentially private synthetic data,” arXiv:2004.07740.
Benedetto G., Totty E. (2023), “Synthesizing Familial Linkages for Privacy in Microdata,” Journal of Privacy and Confidentiality, 13. [Google Scholar]
Benedetto G., Stanley J. C., Totty E. (2018), “The Creation and Use of the SIPP Synthetic Beta v7.0,” US Census Bureau. Available at https://www.census.gov/content/dam/Census/programs-surveys/sipp/methodology/SSBdescribe_nontechnicalv7.pdf
Bloom D. E., Sekher T., Lee J. (2021), “Longitudinal Aging Study in India (LASI): New Data Resources for Addressing Aging in India,” Nature Aging, 1, 1070–1072. [DOI] [PubMed] [Google Scholar]
Bowen C. M., Snoke J. (2021), “Comparative Study of Differentially Private Synthetic Data Algorithms from the NIST PSCR Differential Privacy Synthetic Data Challenge,” Journal of Privacy and Confidentiality, 11. [Google Scholar]
Bowen C. M., Liu F., Su B. (2021), “Differentially Private Data Release via Statistical Election to Partition Sequentially: Statistical Election to Partition Sequentially,” Metron, 79, 1–31. [Google Scholar]
Breiman L., Friedman J. H., Olshen R. A., Stone C. J. (1984), Classification and Regression Trees (1st ed.), New York: Chapman and Hall/CRC. [Google Scholar]
Bugliari D., Carroll J., Hayden O., Hayes J., Hurd M. D., Lee S., Main R., McCullough C. M., Meijer E., Pantoja P., Rohwedder S. (2024), RAND HRS detailed imputations file 2020 (v1) documentation. Available at https://www.rand.org/content/dam/rand/www/external/labor/aging/dataprod/randhrsimp1992_2020v2.pdf
Chien S., Young C., Phillips D., Wilkens J., Wang Y., Gross A., Meijer E., Angrisani M., Lee J. (2023), Harmonized LASI documentation, version A.3 (2017–2021). Available at https://lasi-dad.org/codebooks/Harmonized%20LASI-DAD%20A.3.pdf
Daikeler J., Bošnjak M., Lozar Manfreda K. (2020), “Web versus Other Survey Modes: An Updated and Extended Meta-Analysis Comparing Response Rates,” Journal of Survey Statistics and Methodology, 8, 513–539. [Google Scholar]
Drechsler J., Reiter J. P. (2009), “Disclosure Risk and Data Utility for Partially Synthetic Data: An Empirical Study Using the German IAB Establishment Survey,” Journal of Official Statistics, 25, 589–603. [Google Scholar]
Drechsler J., Reiter J. P. (2011), “An Empirical Evaluation of Easily Implemented, Nonparametric Methods for Generating Synthetic Datasets,” Computational Statistics & Data Analysis, 55, 3232–3243. [Google Scholar]
Drechsler J., Reiter J. P. (2012), “Combining Synthetic Data with Subsampling to Create Public Use Microdata Files for Large Scale Surveys,” Survey Methodology, 38, 73–79. [Google Scholar]
Feldman J., Kowal D. R. (2022), “Bayesian Data Synthesis and the Utility-Risk Trade-off for Mixed Epidemiological Data,” The Annals of Applied Statistics, 16, 2577–2602. [Google Scholar]
Hu J., Reiter J. P., Wang Q. (2018), “Dirichlet Process Mixture Models for Modeling and Generating Synthetic Versions of Nested Categorical Data,” Bayesian Analysis, 13, 183–200. [Google Scholar]
Hu J., Savitsky T. D., Williams M. R. (2022), “Risk-Efficient Bayesian Data Synthesis for Privacy Protection,” Journal of Survey Statistics and Methodology, 10, 1370–1399. [Google Scholar]
Hundepool A., Domingo-Ferrer J., Franconi L., Giessing S., Nordholt E. S., Spicer K., de Wolf P.-P. (2012), Statistical Disclosure Control, United Kingdom: Wiley. [Google Scholar]
Juster F. T., Suzman R. (1995), “An Overview of the Health and Retirement Study,” Journal of Human Resources, 30, S7–S56. [Google Scholar]
Kim H. J., Drechsler J., Thompson K. J. (2021), “Synthetic Microdata for Establishment Surveys under Informative Sampling,” Journal of the Royal Statistical Society Series A: Statistics in Society, 184, 255–281. [Google Scholar]
Little R. J. (1993), “Statistical Analysis of Masked Data,” Journal of Official Statistics, 9, 407–426. [Google Scholar]
Manrique-Vallier D., Hu J. (2018), “Bayesian Non-Parametric Generation of Fully Synthetic Multivariate Categorical Data in the Presence of Structural Zeros,” Journal of the Royal Statistical Society Series A: Statistics in Society, 181, 635–647. [Google Scholar]
NIA (2007), “Growing Older in America: The Health and Retirement Study (NIH publication No. 07-5757),” Technical Report, National Institute on Aging.
Nowok B., Raab G., Snoke J., Dibben C. (2016a), synthpop: Generating Synthetic Versions of Sensitive Microdata for Statistical Disclosure Control. R package version 1.8-0. Available at https://cran.r-project.org/web/packages/synthpop/index.html
Nowok B., Raab G. M., Dibben C. (2016b), “Synthpop: Bespoke Creation of Synthetic Data in R,” Journal of Statistical Software, 74, 1–26. [Google Scholar]
Perianayagam A., Bloom D., Lee J., Parasuraman S., Sekher T. V., Mohanty S. K., Chattopadhyay A., Govil D., Pedgaonkar S., Gupta S., Agarwal A., Posture A., Weerman A., Pramanik S. (2022), “Cohort Profile: The Longitudinal Ageing Study in India (LASI),” International Journal of Epidemiology, 51, e167–e176. [DOI] [PMC free article] [PubMed] [Google Scholar]
Raab G. M., Nowok B., Dibben C. (2016), “Practical Data Synthesis for Large Samples,” Journal of Privacy and Confidentiality, 7, 67–97. [Google Scholar]
Raab G. M., Nowok B., Dibben C. (2021), “Assessing, Visualizing and Improving the Utility of Synthetic Data,” arXiv:2109.12717.
Raghunathan T. E., Lepkowski J. M., Van Hoewyk J., Solenberger P. (2001), “A Multivariate Technique for Multiply Imputing Missing Values Using a Sequence of Regression Models,” Survey Methodology, 27, 85–96. [Google Scholar]
Raghunathan T. E., Reiter J. P., Rubin D. B. (2003), “Multiple Imputation for Statistical Disclosure Limitation,” Journal of Official Statistics, 19, 1–16. [Google Scholar]
Reiter J. P. (2003), “Inference for Partially Synthetic, Public Use Microdata Sets,” Survey Methodology, 29, 181–188. [Google Scholar]
Reiter J. P. (2005), “Using CART to Generate Partially Synthetic Public Use Microdata,” Journal of Official Statistics, 21, 441–462. [Google Scholar]
Reiter J. P., Kinney S. K. (2012), “Inferentially Valid, Partially Synthetic Data: Generating from Posterior Predictive Distributions Not Necessary,” Journal of Official Statistics, 28, 583. [Google Scholar]
Reiter J. P., Mitra R. (2009), “Estimating Risks of Identification Disclosure in Partially Synthetic Data,” Journal of Privacy and Confidentiality, 1, 99–110. [Google Scholar]
Reiter J. P., Raghunathan T. E. (2007), “The Multiple Adaptations of Multiple Imputation,” Journal of the American Statistical Association, 102, 1462–1471. [Google Scholar]
Reiter J. P., Wang Q., Zhang B. E. (2014), “Bayesian Estimation of Disclosure Risks for Multiply Imputed, Synthetic Data,” Journal of Privacy and Confidentiality, 6, 17–33. [Google Scholar]
Rocher L., Hendrickx J. M., De Montjoye Y.-A. (2019), “Estimating the Success of re-Identifications in Incomplete Datasets Using Generative Models,” Nature Communications, 10, 3069. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rubin D. B. (1993), “Statistical Disclosure Limitation,” Journal of Official Statistics, 9, 461–468. [Google Scholar]
Skinner C., Shlomo N. (2008), “Assessing Identification Risk in Survey Microdata Using Log-Linear Models,” Journal of the American Statistical Association, 103, 989–1001. [Google Scholar]
Snoke J., Raab G. M., Nowok B., Dibben C., Slavkovic A. (2018), “General and Specific Utility Measures for Synthetic Data,” Journal of the Royal Statistical Society Series A: Statistics in Society, 181, 663–688. [Google Scholar]
Stadler T., Oprisanu B., Troncoso C. (2022). “Synthetic Data-anonymisation Groundhog Day,” in Proceedings Of The 31St Usenix Security Symposium, Berkeley: USENIX Association, pp. 1451–1468.
Therneau T., Atkinson B., Ripley B., Ripley M. B. (2015), “rpart: Recursive Partitioning and Regression Trees R package version 4.1.16.” Available at https://cran.r-project.org/web/packages/rpart/index.html
Woo M.-J., Reiter J. P., Oganian A., Karr A. F. (2009), “Global Measures of Data Utility for Microdata Masked for Disclosure Limitation,” Journal of Privacy and Confidentiality, 1, 111–124. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

smae047_Supplementary_Data

smae047_supplementary_data.pdf^{(1MB, pdf)}

[smae047-B1] Arnold C., Neunhoeffer M. (2020), “Really useful synthetic data—a framework to evaluate the quality of differentially private synthetic data,” arXiv:2004.07740.

[smae047-B2] Benedetto G., Totty E. (2023), “Synthesizing Familial Linkages for Privacy in Microdata,” Journal of Privacy and Confidentiality, 13. [Google Scholar]

[smae047-B3] Benedetto G., Stanley J. C., Totty E. (2018), “The Creation and Use of the SIPP Synthetic Beta v7.0,” US Census Bureau. Available at https://www.census.gov/content/dam/Census/programs-surveys/sipp/methodology/SSBdescribe_nontechnicalv7.pdf

[smae047-B4] Bloom D. E., Sekher T., Lee J. (2021), “Longitudinal Aging Study in India (LASI): New Data Resources for Addressing Aging in India,” Nature Aging, 1, 1070–1072. [DOI] [PubMed] [Google Scholar]

[smae047-B5] Bowen C. M., Snoke J. (2021), “Comparative Study of Differentially Private Synthetic Data Algorithms from the NIST PSCR Differential Privacy Synthetic Data Challenge,” Journal of Privacy and Confidentiality, 11. [Google Scholar]

[smae047-B6] Bowen C. M., Liu F., Su B. (2021), “Differentially Private Data Release via Statistical Election to Partition Sequentially: Statistical Election to Partition Sequentially,” Metron, 79, 1–31. [Google Scholar]

[smae047-B7] Breiman L., Friedman J. H., Olshen R. A., Stone C. J. (1984), Classification and Regression Trees (1st ed.), New York: Chapman and Hall/CRC. [Google Scholar]

[smae047-B8] Bugliari D., Carroll J., Hayden O., Hayes J., Hurd M. D., Lee S., Main R., McCullough C. M., Meijer E., Pantoja P., Rohwedder S. (2024), RAND HRS detailed imputations file 2020 (v1) documentation. Available at https://www.rand.org/content/dam/rand/www/external/labor/aging/dataprod/randhrsimp1992_2020v2.pdf

[smae047-B9] Chien S., Young C., Phillips D., Wilkens J., Wang Y., Gross A., Meijer E., Angrisani M., Lee J. (2023), Harmonized LASI documentation, version A.3 (2017–2021). Available at https://lasi-dad.org/codebooks/Harmonized%20LASI-DAD%20A.3.pdf

[smae047-B10] Daikeler J., Bošnjak M., Lozar Manfreda K. (2020), “Web versus Other Survey Modes: An Updated and Extended Meta-Analysis Comparing Response Rates,” Journal of Survey Statistics and Methodology, 8, 513–539. [Google Scholar]

[smae047-B11] Drechsler J., Reiter J. P. (2009), “Disclosure Risk and Data Utility for Partially Synthetic Data: An Empirical Study Using the German IAB Establishment Survey,” Journal of Official Statistics, 25, 589–603. [Google Scholar]

[smae047-B12] Drechsler J., Reiter J. P. (2011), “An Empirical Evaluation of Easily Implemented, Nonparametric Methods for Generating Synthetic Datasets,” Computational Statistics & Data Analysis, 55, 3232–3243. [Google Scholar]

[smae047-B13] Drechsler J., Reiter J. P. (2012), “Combining Synthetic Data with Subsampling to Create Public Use Microdata Files for Large Scale Surveys,” Survey Methodology, 38, 73–79. [Google Scholar]

[smae047-B14] Feldman J., Kowal D. R. (2022), “Bayesian Data Synthesis and the Utility-Risk Trade-off for Mixed Epidemiological Data,” The Annals of Applied Statistics, 16, 2577–2602. [Google Scholar]

[smae047-B15] Hu J., Reiter J. P., Wang Q. (2018), “Dirichlet Process Mixture Models for Modeling and Generating Synthetic Versions of Nested Categorical Data,” Bayesian Analysis, 13, 183–200. [Google Scholar]

[smae047-B16] Hu J., Savitsky T. D., Williams M. R. (2022), “Risk-Efficient Bayesian Data Synthesis for Privacy Protection,” Journal of Survey Statistics and Methodology, 10, 1370–1399. [Google Scholar]

[smae047-B17] Hundepool A., Domingo-Ferrer J., Franconi L., Giessing S., Nordholt E. S., Spicer K., de Wolf P.-P. (2012), Statistical Disclosure Control, United Kingdom: Wiley. [Google Scholar]

[smae047-B18] Juster F. T., Suzman R. (1995), “An Overview of the Health and Retirement Study,” Journal of Human Resources, 30, S7–S56. [Google Scholar]

[smae047-B19] Kim H. J., Drechsler J., Thompson K. J. (2021), “Synthetic Microdata for Establishment Surveys under Informative Sampling,” Journal of the Royal Statistical Society Series A: Statistics in Society, 184, 255–281. [Google Scholar]

[smae047-B20] Little R. J. (1993), “Statistical Analysis of Masked Data,” Journal of Official Statistics, 9, 407–426. [Google Scholar]

[smae047-B21] Manrique-Vallier D., Hu J. (2018), “Bayesian Non-Parametric Generation of Fully Synthetic Multivariate Categorical Data in the Presence of Structural Zeros,” Journal of the Royal Statistical Society Series A: Statistics in Society, 181, 635–647. [Google Scholar]

[smae047-B22] NIA (2007), “Growing Older in America: The Health and Retirement Study (NIH publication No. 07-5757),” Technical Report, National Institute on Aging.

[smae047-B23] Nowok B., Raab G., Snoke J., Dibben C. (2016a), synthpop: Generating Synthetic Versions of Sensitive Microdata for Statistical Disclosure Control. R package version 1.8-0. Available at https://cran.r-project.org/web/packages/synthpop/index.html

[smae047-B24] Nowok B., Raab G. M., Dibben C. (2016b), “Synthpop: Bespoke Creation of Synthetic Data in R,” Journal of Statistical Software, 74, 1–26. [Google Scholar]

[smae047-B25] Perianayagam A., Bloom D., Lee J., Parasuraman S., Sekher T. V., Mohanty S. K., Chattopadhyay A., Govil D., Pedgaonkar S., Gupta S., Agarwal A., Posture A., Weerman A., Pramanik S. (2022), “Cohort Profile: The Longitudinal Ageing Study in India (LASI),” International Journal of Epidemiology, 51, e167–e176. [DOI] [PMC free article] [PubMed] [Google Scholar]

[smae047-B26] Raab G. M., Nowok B., Dibben C. (2016), “Practical Data Synthesis for Large Samples,” Journal of Privacy and Confidentiality, 7, 67–97. [Google Scholar]

[smae047-B27] Raab G. M., Nowok B., Dibben C. (2021), “Assessing, Visualizing and Improving the Utility of Synthetic Data,” arXiv:2109.12717.

[smae047-B28] Raghunathan T. E., Lepkowski J. M., Van Hoewyk J., Solenberger P. (2001), “A Multivariate Technique for Multiply Imputing Missing Values Using a Sequence of Regression Models,” Survey Methodology, 27, 85–96. [Google Scholar]

[smae047-B29] Raghunathan T. E., Reiter J. P., Rubin D. B. (2003), “Multiple Imputation for Statistical Disclosure Limitation,” Journal of Official Statistics, 19, 1–16. [Google Scholar]

[smae047-B30] Reiter J. P. (2003), “Inference for Partially Synthetic, Public Use Microdata Sets,” Survey Methodology, 29, 181–188. [Google Scholar]

[smae047-B31] Reiter J. P. (2005), “Using CART to Generate Partially Synthetic Public Use Microdata,” Journal of Official Statistics, 21, 441–462. [Google Scholar]

[smae047-B32] Reiter J. P., Kinney S. K. (2012), “Inferentially Valid, Partially Synthetic Data: Generating from Posterior Predictive Distributions Not Necessary,” Journal of Official Statistics, 28, 583. [Google Scholar]

[smae047-B33] Reiter J. P., Mitra R. (2009), “Estimating Risks of Identification Disclosure in Partially Synthetic Data,” Journal of Privacy and Confidentiality, 1, 99–110. [Google Scholar]

[smae047-B34] Reiter J. P., Raghunathan T. E. (2007), “The Multiple Adaptations of Multiple Imputation,” Journal of the American Statistical Association, 102, 1462–1471. [Google Scholar]

[smae047-B35] Reiter J. P., Wang Q., Zhang B. E. (2014), “Bayesian Estimation of Disclosure Risks for Multiply Imputed, Synthetic Data,” Journal of Privacy and Confidentiality, 6, 17–33. [Google Scholar]

[smae047-B36] Rocher L., Hendrickx J. M., De Montjoye Y.-A. (2019), “Estimating the Success of re-Identifications in Incomplete Datasets Using Generative Models,” Nature Communications, 10, 3069. [DOI] [PMC free article] [PubMed] [Google Scholar]

[smae047-B37] Rubin D. B. (1993), “Statistical Disclosure Limitation,” Journal of Official Statistics, 9, 461–468. [Google Scholar]

[smae047-B38] Skinner C., Shlomo N. (2008), “Assessing Identification Risk in Survey Microdata Using Log-Linear Models,” Journal of the American Statistical Association, 103, 989–1001. [Google Scholar]

[smae047-B39] Snoke J., Raab G. M., Nowok B., Dibben C., Slavkovic A. (2018), “General and Specific Utility Measures for Synthetic Data,” Journal of the Royal Statistical Society Series A: Statistics in Society, 181, 663–688. [Google Scholar]

[smae047-B40] Stadler T., Oprisanu B., Troncoso C. (2022). “Synthetic Data-anonymisation Groundhog Day,” in Proceedings Of The 31St Usenix Security Symposium, Berkeley: USENIX Association, pp. 1451–1468.

[smae047-B41] Therneau T., Atkinson B., Ripley B., Ripley M. B. (2015), “rpart: Recursive Partitioning and Regression Trees R package version 4.1.16.” Available at https://cran.r-project.org/web/packages/rpart/index.html

[smae047-B42] Woo M.-J., Reiter J. P., Oganian A., Karr A. F. (2009), “Global Measures of Data Utility for Microdata Masked for Disclosure Limitation,” Journal of Privacy and Confidentiality, 1, 111–124. [Google Scholar]

PERMALINK

Synthesizing Surveys with Multiple Units of Observation: An Application to the Longitudinal Aging Study in India

Joshua Snoke

Erik Meijer

Drystan Phillips

Jenny Wilkens

Jinkook Lee

Abstract

Statement of Significance.

1. INTRODUCTION

2. GATEWAY TO GLOBAL AGING DATA AND THE LASI

2.1 Data Access Through Synthetic Data

Table 1.

3. SYNTHESIZING DATA WITH MULTIPLE LEVELS OF NESTED OBSERVATIONS WITH SEQUENTIAL SYNTHESIS

Table 2.

Figure 1.

4. DOWN-WEIGHTING RISKY RECORDS IN SEQUENTIAL SYNTHESIS MODELS

5. SYNTHESIZING THE LASI

5.1 Synthesis Model and Predictor Selection

6. MEASURES OF STATISTICAL UTILITY EVALUATION METRICS

7. DISCLOSURE RISK METRICS

Definition 7.1.

Definition 7.2.

8. EMPIRICAL RESULTS

8.1 Within Couple and Household Distributional Comparisons

Figure 2.

Figure 3.

8.2 Disclosure Risks of the Synthetic LASI

Table 3.

Table 4.

8.3 Risk-Utility Tradeoff for Additional Protections of States with Small Populations

Figure 4.

Figure 5.

9. DISCUSSION

Supplementary Materials

Supplementary Material

Footnotes

Contributor Information

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases