Skip to main content
American Journal of Epidemiology logoLink to American Journal of Epidemiology
. 2017 Jun 5;186(1):83–91. doi: 10.1093/aje/kwx050

Protecting Confidentiality in Cancer Registry Data With Geographic Identifiers

Mandi Yu *, Jerome Phillip Reiter, Li Zhu, Benmei Liu, Kathleen A Cronin, Eric J (Rocky) Feuer
PMCID: PMC5860429  PMID: 28453646

Abstract

The National Cancer Institute's Surveillance, Epidemiology, and End Results Program releases research files of cancer registry data. These files include geographic information at the county level, but no finer. Access to finer geography, such as census tract identifiers, would enable richer analyses—for example, examination of health disparities across neighborhoods. To date, tract identifiers have been left off the research files because they could compromise the confidentiality of patients’ identities. We present an approach to inclusion of tract identifiers based on multiply imputed, synthetic data. The idea is to build a predictive model of tract locations, given patient and tumor characteristics, and randomly simulate the tract of each patient by sampling from this model. For the predictive model, we use multivariate regression trees fitted to the latitude and longitude of the population centroid of each tract. We implement the approach in the registry data from California. The method results in synthetic data that reproduce a wide range (but not all) of analyses of census tract socioeconomic cancer disparities and have relatively low disclosure risks, which we assess by comparing individual patients’ actual and synthetic tract locations. We conclude with a discussion of how synthetic data sets can be used by researchers with cancer registry data.

Keywords: breast cancer; classification and regression trees; health disparities; multiple imputation; partial synthetic data; Surveillance, Epidemiology, and End Results Program


Population-based cancer registry data, such as those collected by the National Cancer Institute (NCI)'s Surveillance, Epidemiology, and End Results (SEER) Program, are central for quantifying cancer burdens related to socioeconomic deprivation. Unfortunately, cancer registry data usually do not include information on patient-level socioeconomic status (SES). As a result, users often utilize neighborhood SES measures as proxies or contextual data (1, 2).

To facilitate the construction of such proxies, it is beneficial for data stewards to release information on patients’ places of residence. However, releasing detailed geographic data could introduce unacceptable risks of rendering individuals identifiable, particularly since the data usually include a number of demographic variables on the individuals that could be matched to other data sources. Hence, many data stewards aggregate geographic data before release. For example, to access SEER data, researchers have the options to 1) apply for the regular research file with the patient's county of residence as the finest level of geographic information or 2) request access to a special research data set with precalculated quintiles constructed using several census-tract SES attributes as the only substate geographic information (2). Of course, both approaches limit the flexibility that secondary data users can have in analyzing data at finer geographic levels.

As alternatives, data stewards can apply disclosure limitation methods to the geographic data, such as geomasking (35) and partially synthetic data (610). In partially synthetic data approaches, the data steward estimates a statistical model for the geographic variables given the attributes of the geographic areas and replaces the actual geographic information with draws from the model. In this way, the partially synthetic data can preserve spatial relationships from the actual data while not releasing actual locations. Multiple draws of the synthetic geographies are released, thereby allowing analysts to incorporate uncertainty due to the simulation when making inferences (11, 12). To date, methods have been proposed for generating synthetic data with latitudes and longitudes (9, 10), with modestly aggregated geographic areas like blocks (7, 8), and with substantially aggregated geographic areas like census tracts or counties (6). Synthetic data products have been developed for a number of government databases, including the Survey of Consumer Finances (13), the Survey of Income and Program Participation (14), the Longitudinal Business Database (15, 16), and the group quarters data in the American Community Survey (17).

In this article, we propose to generate synthetic census tracts for the SEER registry data with the goal of providing finer geographic details in the research file. Our setting is most similar to that of Burgette and Reiter (6), who used a Bayesian multinomial probit model to synthesize tracts from individual-level tracts. However, as the authors noted, those models may not fit well when there is a large number of tracts or covariates, as in the SEER data. We therefore extend the approach used by Wang and Reiter (10), who proposed using classification and regression trees (CARTs) as models for point-referenced geographic areas, to simulate tract values. We apply the methodology to SEER data for female breast cancers diagnosed in California in 2012. We present a variety of evaluations of the data utility (i.e., the usefulness and limitations of the synthetic data sets), as well as evaluations of the disclosure risks in releasing the synthetic tracts.

METHODS

The SEER Program collects data on cancer cases diagnosed in the SEER registries’ catchment areas (18). Information collected includes data on cancer diagnostic factors (e.g., grade, stage, size of tumor), patient's demographic characteristics (e.g., age, sex, race, ethnicity), and residential address at the time of diagnosis. However, only the geocoded state, county, and residential census tract data are submitted to the NCI, per the cooperative agreements. According to the registries, 95.9% of census tracts were derived from complete and valid street addresses for cancer cases diagnosed in SEER 17 registry areas in 2012 (19). For the remaining cases, the tracts were either assigned on the basis of zip codes (3.8%) or not obtained (0.2%). Considering the high quality of SEER census tract data, we anticipate the impact of missing and imputed tracts on the overall results to be minimal.

We replace patients’ actual census tracts with simulated values, leaving all other variables intact. To do so, we treat the latitude and longitude of each tract's population-weighted centroid as a bivariate continuous outcome. We predict this outcome conditional on some set of covariates and simulate new tracts for every patient. We separately fit models and synthesize tracts within each county, thereby ensuring that each individual remains in the same county as in the research file. Thus, the synthetic data match the actual SEER data when aggregated at the county level. We note that the synthetic and actual tracts are identical for counties with only 1 census tract. We do not view releasing the tract identification as incurring additional disclosure risk in this case. We define tracts using Federal Information Processing Series (FIPS) codes.

Data sources

We use data from the November 2014 SEER data submission in California (19), which includes 25,034 female breast cancers diagnosed in 2012 from 7,074 census tracts defined on the basis of the 2010 tract boundaries. Forty-four cases without any tract information are excluded. The final sample comprises 24,990 cancers occurring in 24,403 patients. A small percentage of patients (approximately 2%) have more than 1 cancer. For simplicity, we treat each cancer as coming from a distinct patient. Census tracts for 2 cases diagnosed in counties that contain a single tract are kept unchanged in the synthetic data. We select commonly analyzed variables to be the predictors (2022). Patient- and cancer-level predictors are drawn from the SEER data. Tract-level predictors are obtained from the 2009–2013 American Community Survey 5-year estimates (23). Table 1 displays the predictors according to the level of aggregation. Sample distributions of patient- and cancer-level variables are shown in Web Table 1 (available at http://aje.oxfordjournals.org/).

Table 1.

Patient-, Case-, and Census Tract-Level Attributes of Women With Malignant Breast Cancer, California SEER Registries, 2012

Level of Attribute Description of Attribute
Patient level Age, race, Hispanic origin, marital status, place of birth, and health insurance status
Case level Surgery, chemotherapy, hormonal therapy, radiotherapy, tumor stage, grade, and subtype (estrogen receptor, progesterone receptor, HER2), and autopsy case status
Census tract level
 Race/ethnicity % NH white, % NH black, % NH Asian/Pacific Islander, % American Indian/Alaska Native, % Hawaiian, and % Hispanic
 Socioeconomic  status % below federal poverty line, % unemployed, median family income ($/year), % with bachelor's degree, % with high school diploma, and % who completed ninth grade
 Age % aged <18 years and % aged ≥65 years
 Immigration  status % foreign-born and % with language isolation

Abbreviations: HER2, human epidermal growth factor receptor 2; NH, non-Hispanic; SEER, Surveillance, Epidemiology, and End Results.

Synthesis model: CART models for simulating census tracts

To model the latitude and longitude, we use a bivariate CART model (24). CART models are appealing in that they can handle a large number of covariates of different types and capture complex relationships automatically. The basic idea of CART models (25) is to recursively partition the data on one predictor at a time, so that each subset is increasingly homogenous with regard to the outcome. CART models have been used to model spatial relationships in environmental and ecological data (24, 26) and to generate synthetic data (27, 28).

For i = 1, . . . n, let (ϕi,λi) be the latitude and longitude for tract i, and let zi be a q-dimensional set of tract-level covariates. Let xij be the p-dimensional cancer-level attributes for patient j in tract i, where j = 1, . . . ni. For the SEER data, ϕij=ϕi and λij=λi for all (i, j). We fit a bivariate regression tree, denoted T, of (ϕij,λij), on xij or possibly (xij, zi) using the MVPART procedure in R (R Foundation for Statistical Computing, Vienna, Austria). We enforce each leaf of the tree to include a minimum of 5 cases, and we continue splitting the data as long as doing so increases the overall R2 value by a factor of e10. This maximizes the size of the trees.

Using T, we follow a 2-step process to simulate the FIPS codes. First, we drop each xij or (xij, zi) down T. For each case, we randomly select, with replacement, a working latitude and longitude from the cancer's corresponding leaf. Second, we replace each working latitude and longitude with a FIPS code in a way that ensures that the marginal distributions of FIPS codes in the synthetic data match those in the actual data. To do so, we concatenate the working and actual latitudes and longitudes and create an indicator such that I = 1 for working values and I = 0 for actual values. We fit a logistic regression model for the effect of I on the concatenated latitudes and longitudes. Using this model, we compute the predicted probability that I = 1 for each of the 2n concatenated cases. We then match cases, without replacement, in the working data to those in the actual data, using nearest-neighbor matching on the predicted probabilities. We set each cancer's synthetic FIPS code to the FIPS code for its corresponding match. We repeat this 2-step selection multiple times to obtain multiple synthetic tracts that can be released for research use. The CART synthesizer works most effectively when the bivariate response surface defined by latitude and longitude is explained reasonably well (but not perfectly) by the attributes. When this is not the case—for example, when demographic characteristics change abruptly in contiguous tracts—the synthetic data can have low quality in some dimensions.

Selecting the predictors

We consider 4 models with different synthesis strategies and predictors, as shown in Table 2. Model 1 includes only the patient- and cancer-level predictors. Model 2 adds tract-level predictors, primarily as a comparison to see the effect on risk and the utility of adding tract-level information. Model 3 uses the same attributes as model 1 but fits separate models for each racial/ethnic group. This may help the synthesizer estimate more accurately the distribution of tract locations by race/ethnicity, which is central in evaluations of cancer disparities. Model 4 extends model 3 by fitting separate models for each race/ethnicity and cancer-stage stratum. For models 3 and 4, for strata with fewer than 20 cancers, we do not fit CART models but generate synthetic tracts by sampling randomly from the observed tract values in the county. We assess the disclosure risks and the utilities of synthetic data sets generated from each model, and we ultimately use the model that leads to the highest degree of utility with acceptable confidentiality protection.

Table 2.

Census Tract Synthesis Strategies and Attributes Included in Each Candidate Model, California SEER Registries, 2012

Candidate Model Synthesis Strategy Attributes Included
Model 1 Overall synthesis Patient- and cancer-level attributes
Model 2 Overall synthesis Patient-, cancer-, and census tract-level attributes
Model 3 Synthesis stratified by patient race/ethnicitya Patient- and cancer-level attributes with stratifiers excluded
Model 4 Synthesis stratified by patient race/ethnicity and cancer stageb Patient- and cancer-level attributes with stratifiers excluded

Abbreviation: SEER, Surveillance, Epidemiology, and End Results.

a Race/ethnicity categories included non-Hispanic white, non-Hispanic black, non-Hispanic Asian/Pacific Islander, Hispanic, and other.

b Cancer stage was coded according to the Seventh Edition of the AJCC Cancer Staging Manual (31) and included the categories early (stage IIA and earlier), regional (stage IIB and later, not including stage IV), distant (stage IV), and unknown.

Evaluation of data utility

We compare synthetic and actual data on several commonly analyzed descriptive and inferential statistics. We compute the percentage difference of the synthetic point estimates relative to the actual estimates. We also compute the 95% confidence interval overlap measure of Karr et al. (29), which finds the probability mass in common in the confidence intervals estimated with the synthetic and actual data. This measure takes a maximum value of 0.95 and a minimum value of zero. The more overlap in the confidence intervals, the less loss of utility in the synthetic data.

For descriptive statistics, we estimate incidence rates by race/ethnicity and quintile of census-tract median family income per 100,000 persons. All rates are age-adjusted to the 2000 US standard population (30). The race/ethnicity groups we consider include non-Hispanic white, non-Hispanic black, non-Hispanic Asian/Pacific Islander, and Hispanic. We do not consider other racial/ethnic groups (e.g., American Indians and Alaska Natives, Hispanic blacks, etc.), because the sample sizes are too small to yield reliable results, even with the actual data. In addition to income, we also consider operationalizing SES using other measures, such as the percentage of the population living below the federal poverty line, the percentage of residents unemployed, and the percentage of residents with a college degree, in sensitivity analyses. The sensitivity to a finer level of population aggregation by SES decile and vigintile is also evaluated.

We also evaluate the performance of synthetic data in preserving statistics for small samples by comparing rates of incidence for subgroups, such as late-stage cancers and cancers of various molecular subtypes, by race/ethnicity and SES. Breast cancers coded with a stage of IIB or later according to the Seventh Edition of the AJCC Cancer Staging Manual (31) are considered late-stage. Subtype groups are formed by combining joint hormone receptor (HR) and human epidermal growth factor receptor 2 (HER2) status: luminal A (HER2−/HR+), luminal B (HER2+/HR+), HER2-enriched (HER2+/HR−), and triple negative (HER2−/HR−), as defined by SEER (32).

Population data for race/ethnicity by tract income quintile are available from the 2010 Census. Individuals could have reported multiple races in the 2010 Census, whereas only 1 race is extracted by SEER. Because of this incompatibility, we follow the methodology developed by Yu and Gibson (M. Yu and J. T. Gibson, National Cancer Institute, unpublished data, 2015) to allocate responses of multiple/not specified to one of 4 single races (white, black, American Indian/Alaska Native, and Asian/Pacific Islander) for each age, sex, and Hispanic-origin group. The resulting estimates match the modified county-level population data used by the NCI in the 1975–2012 Cancer Statistics Review reports (33): the single-race county totals when summed by county and the tract totals when summed by race (34).

For inferential statistics, we estimate the impact of SES on the likelihood of late-stage diagnosis by fitting a multilevel regression model. The predictors include age, race/ethnicity, subtype, and tract income quintile. This model represents a typical research question of interest to data users (35, 36).

Assessment of disclosure risk

Using 2 measures, we evaluate the risks that ill-intentioned users (intruders) could reidentify patients’ tracts of residence. The first measure is the “switch rate”: the percentage of patients whose synthetic tracts differ from their actual ones. The switch rate is calculated separately for each synthetic data set. The second measure is the “hit rate,” which captures the likelihood of reidentifying the actual tract using the information from multiple synthetic data sets. Here, we suppose that the intruder uses each patient's most frequently occurring census tract across all synthetic data sets as the best guess for the actual tract. When the best guess agrees with the actual value, we call it a “hit.” The “hit rate” is the percentage of hits across the entire data set.

To help interpret these measures, we compare the risks for the synthetic data generated by the 4 models with the risk for a geographic perturbation approach, in which we randomly switch tracts within a county, disregarding other variables (hereafter called random-switch). We use 1,000 replications of the random-switch technique to obtain precise estimates of switch rates and hit rates. When the switch rate and hit rates for a model-based synthesis procedure are not too different from those for random-switch, arguably there is not much risk in releasing the synthetic tracts from model-based approaches.

RESULTS

Evaluation of data utility

Figure 1 displays plots of confidence interval overlap probabilities of female breast cancer incidence rates for the synthetic and actual data (left panels) and the percent difference in rates as a fraction of the actual data rate (right panels), for models 1–4. Separate rates are calculated for all racial/ethnic groups (non-Hispanic white, non-Hispanic black, non-Hispanic Asian/Pacific Islander, and Hispanic) by income quintile. For models 1 and 2, we see large deviations in rates, apparently low overlap probabilities, and large absolute values in the percent differences. Further examination reveals they mainly occur among non-Hispanic white, non-Hispanic black, and Hispanic patients. Indeed, this motivated us to propose models 3 and 4. Model 3 performs the best in preserving inferences across all race/ethnicity × income groups. Evidently, the stratification improves estimates. Further stratification by stage as implemented in model 4 causes a slight loss in data quality, but only for non-Hispanic blacks, for which the synthesis process tends to move patients from low-SES groups to high-SES groups. Because of inferior utility results in models 1 and 2, we focus on models 3 and 4 in subsequent evaluations.

Figure 1.

Figure 1.

Female breast cancer incidence by race/ethnicity and census-tract income quintile, overall and in 4 racial/ethnic groups (non-Hispanic (NH) white, NH black, NH Asian/Pacific Islander (API), and Hispanic), in models 1–4, California, 2012. Left-hand panels (A, C, E, and G) show the 95% confidence interval (CI) overlap probability based on data generated from models 1, 2, 3, and 4, respectively. Right-hand panels (B, D, F, and H) show the percent difference in the age-adjusted incidence rate based on data generated from models 1, 2, 3, and 4, respectively. Income quintiles: squares, quintile 1; circles, quintile 2; triangles, quintile 3; pluses, quintile 4; multiplication signs, quintile 5.

Figure 2 displays similar plots for late-stage breast cancers. Because model 4 stratifies on stage, we expect the highest levels of agreement for model 4. This is borne out in the synthetic data, evident as high overlap probabilities and low percent differences in rates across all groups.

Figure 2.

Figure 2.

Incidence rates of late-stage female breast cancer by race/ethnicity and census-tract income quintile, overall and in 4 racial/ethnic groups (non-Hispanic (NH) white, NH black, NH Asian/Pacific Islander (API), and Hispanic), in models 3 and 4, California, 2012. Left-hand panels (A and C) show the 95% confidence interval (CI) overlap probability based on data generated from models 3 and 4, respectively. Right-hand panels (B and D) show the percent difference in the age-adjusted incidence rate based on data generated from models 3 and 4, respectively. Income quintiles: squares, quintile 1; circles, quintile 2; triangles, quintile 3; pluses, quintile 4; multiplication signs, quintile 5.

Web Figure 1 presents the confidence interval overlap probabilities and percent differences for rates by subtype. Most exceed 80%. A few are between 60% and 80%; these occur mainly among HER2+/HR− and HER2+/HR+ subtypes, where the sample sizes are relatively small. This pattern holds for all race/ethnicity and income groups. The random pattern in percent differences suggests no systematic biases.

Taking the results of all analyses thus far, it appears that model 4 offers the highest utility. The detailed rates and agreement measures plotted in Figures 1 and 2 and Web Figure 1 for model 4 are included in Web Tables 2–7.

Using the synthetic data from model 4, we next compare regression coefficients for predicting late-stage diagnosis using multilevel logistic regression. The covariates include age, tract income quintile, race/ethnicity, and subtype. Goodness-of-fit statistics, including the Akaike information criterion, the Bayesian information criterion, and log-likelihood, suggest that the fixed-effects model (without assuming random effects for county and tract) fits best for both the actual data and each of the 5 synthetic data sets. Table 3 displays the coefficients estimated from the model 4 synthetic data and the actual data. For all effects, the point estimates are close and the variance estimates are almost identical. The almost perfectly overlapped confidence intervals indicate that the synthesis procedure reproduces these results quite accurately.

Table 3.

Multilevel Logistic Regression Coefficients (and Standard Errors) for the Likelihood of Late-Stage Breast Cancer Based on Synthetic Data Generated From Model 4 and Actual Data, California SEER Registries, 2012

Synthetic Data Actual Data 95% CI Overlap Probability
Intercept 0.548 (0.095) 0.550 (0.095) 0.950
Age, years −0.015 (0.001) −0.015 (0.001) 0.948
Census tract income quintile
 Quintile 1 (low)a 0
 Quintile 2 −0.152 (0.056) −0.135 (0.056) 0.940
 Quintile 3 −0.207 (0.055) −0.188 (0.055) 0.937
 Quintile 4 −0.285 (0.055) −0.268 (0.054) 0.938
 Quintile 5 (high) −0.418 (0.055) −0.427 (0.055) 0.947
Race/ethnicity
 NH whitea 0
 NH black 0.242 (0.062) 0.240 (0.062) 0.950
 NH Asian/Pacific Islander −0.143 (0.049) −0.144 (0.049) 0.950
 Hispanic 0.190 (0.043) 0.188 (0.043) 0.950
Subtype
 HER2+/HR+a 0
 HER2+/HR− 0.270 (0.078) 0.263 (0.078) 0.949
 HER2−/HR+ −0.528 (0.048) −0.528 (0.048) 0.950
 HER2−/HR− −0.163 (0.063) −0.165 (0.063) 0.950

Abbreviations: CI, confidence interval; HER2, human epidermal growth factor receptor 2; HR, hormone receptor; NH, non-Hispanic; SEER, Surveillance, Epidemiology, and End Results.

a Reference category.

We also evaluate a commonly used measure of SES health disparities, the relative concentration index (RCI) (37). Unlike previously assessed measures, which examine each income group separately, the RCI summarizes differences in health outcomes across the entire income distribution. A positive RCI indicates that the most advantaged groups (i.e., high SES) have a worse health outcome (cancer incidence) than the least advantaged groups. The RCI results for model 4 are summarized in Web Table 8. Overall, the synthetic data from model 4 offer the best quality for most health disparity statistics. In the few cases (8 out of 30) with relatively low agreement (confidence interval overlap probability of 70%–80%), we reach the same inferential conclusions of using the RCI to test for the presence and direction of SES disparities.

Web Tables 9–12 shows results from sensitivity analyses of alternative SES measures and scales for model 4. SES quintiles formed by tract-level poverty, unemployment, or education yield levels of utility similar to those obtained from the income quintiles (Web Table 9). Finer stratification of population by SES decile and vigintile do not reduce data utility (Web Tables 10–12).

Evaluation of disclosure risk

Due to disclosure rules at the NCI, we cannot publish the exact switch rates and hit rates in this article. Instead, we broadly summarize the risk evaluation results. Model 2 results in entirely unacceptable levels of risk, with the lowest switch rates and the highest hit rates. Including the tract-level predictors results in many cases' never switching tracts. In some sense, using tract-level predictors is akin to predicting an outcome with some function of the outcome as a predictor, so it is not surprising that we get many “perfect” predictions. We also try a slightly reduced version of model 2, removing nonracial census tract attributes, but the rates are still unacceptable. Therefore, we do not recommend model 2 or the use of tract-level predictors to predict tract centroids in general. Models 1, 3, and 4 have similar levels of risk, with model 3 having the highest risk and model 1 having the lowest risk. The switch and hit rates for all 3 models are close to those for random-switch.

We also evaluate the relationships between the switch rate and features of the tracts. In general, switch rates increase with the sample size in a tract and with the number of tracts in a county. Tracts with unacceptable risks (low rates), which tend to have small populations, could be aggregated into larger regions to reduce risks without much sacrifice in overall data quality.

DISCUSSION

Based on the analyses of risk and utility, we select model 4 as the synthesizer. It reproduces reasonably well the analyses of racial and SES disparities in cancer incidence while offering low probabilities that patients’ synthetic tracts will match their actual tracts. We generated 5 partially synthetic data sets from model 4, so as to allow users to account for inference uncertainties (12).

As evident in the analyses, synthetic data can offer reliable inference for many analyses. However, not all analyses can be accurately preserved. Thus, primary uses of synthetic data can be to perform routine cancer surveillance or to direct data-capturing and processing systems, where the analyses are known. The extended use of synthetic data for research, particularly those involving complicated analyses and unknown to data stewards, probably needs to be restricted to initial development of analytical strategies and exploratory data analysis. Subsequent verification analyses, to determine whether the analytical results on synthetic data are true for the actual data, would help to establish the public's trust in synthetic data results and provide feedback for further improvement in synthetic data modeling. To take advantage of such benefits, the NCI and the 4 cancer registries in California have agreed to make the synthetic data sets available upon request and under agreement with the NCI. Through collaborations with the NCI, data users may have their analyses conducted with the synthetic data and validated using the actual data. The actual data results (after the performance of disclosure checks to ensure that the output does not represent an unacceptable risk) will be shared with the analysts, so that the impact of synthesis on statistical inferences can be evaluated. Variables routinely released in the SEER research data file (38) but not used in the process of generating the synthetic data will also be made available, providing more flexibility to users to choose study topics. The NCI plans to use feedback from these validation studies in further refinement of the synthesis process. More information about how to request special SEER data is described elsewhere (39).

Similar validation systems have been employed by the Census Bureau for the synthetic Survey of Income and Program Participation (40) and Longitudinal Business Database products (41). For example, user accounts for the synthetic Longitudinal Business Database have grown from 1–2 accounts in 2010 to almost 40 in 2014 (42). Researchers are actively building systems intended to automate such feedback (see Reiter et al. (43) and McClure and Reiter (44)). SEER has incorporated longitudinal data for approximately 8 million cancers across 20 registries since its inception in 1973. Some registries joined later. In future work, we plan to expand the synthesis to all participating SEER registries and to synthesize data back to the year 2000 to facilitate analyses of time trends in incidence rates and survival analyses.

This study focused on female patients. Because neighborhood SES may capture differential contextual effects for men and women, future research on cancers occurring in people of both sexes is needed.

Supplementary Material

Web Material

ACKNOWLEDGMENTS

Author affiliations: Surveillance Research Program, Division of Cancer Control and Population Sciences, National Cancer Institute, Rockville, Maryland (Mandi Yu, Li Zhu, Benmei Lu, Kathleen A. Cronin, Eric J. (Rocky) Feuer); and Department of Statistical Science, Trinity College of Arts and Sciences, Duke University, Durham, North Carolina (Jerome P. Reiter).

This work was supported in part with funds from the National Cancer Institute under contract HHSN261201200121P.

Conflict of interest: none declared.

REFERENCES

  • 1. Krieger N, Chen JT, Waterman PD, et al. . Geocoding and monitoring of US socioeconomic inequalities in mortality and cancer incidence: does the choice of area-based measure and geographic level matter?: the Public Health Disparities Geocoding Project. Am J Epidemiol. 2002;156(5):471–482. [DOI] [PubMed] [Google Scholar]
  • 2. Yu M, Tatalovich Z, Gibson JT, et al. . Using a composite index of socioeconomic status to investigate health disparities while protecting the confidentiality of cancer registry data. Cancer Causes Control. 2014;25(1):81–92. [DOI] [PubMed] [Google Scholar]
  • 3. Armstrong MP, Rushton G, Zimmerman DL. Geographically masking health data to preserve confidentiality. Stat Med. 1999;18(5):497–525. [DOI] [PubMed] [Google Scholar]
  • 4. Hampton KH, Fitch MK, Allshouse WB, et al. . Mapping health data: improved privacy protection with donut method geomasking. Am J Epidemiol. 2010;172(9):1062–1069. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Zandbergen PA. Ensuring confidentiality of geocoded health data: assessing geographic masking strategies for individual-level data. Adv Med. 2014;2014:567049. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Burgette LF, Reiter JP. Multiple-shrinkage multinomial probit models with applications to simulating geographies in public use data. Bayesian Anal. 2013;8(2):453–478. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Machanavajjhala A, Kifer D, Abowd J, et al. . Privacy: theory meets practice on the map. In: 2008 IEEE 24th International Conference on Data Engineering, Cancun, Mexico, 7–12 April 2008 (IEEE catalog no. CFP08026-PRT). Washington, DC: Institute of Electrical and Electronics Engineers; 2008:277–286. [Google Scholar]
  • 8. Paiva T, Chakraborty A, Reiter J, et al. . Imputation of confidential data sets with spatial locations using disease mapping models. Stat Med. 2014;33(11):1928–1945. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Quick H, Holan SH, Wikle CK, et al. . Bayesian marked point process modeling for generating fully synthetic public use data with point-referenced geography. Spat Stat. 2015;14(C):439–451. [Google Scholar]
  • 10. Wang H, Reiter JP. Multiple imputation for sharing precise geographies in public use data. Ann Appl Stat. 2012;6(1):229–252. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Little RJ. Statistical analysis of masked data. J Off Stat. 1993;9(2):407–426. [Google Scholar]
  • 12. Reiter JP. Inference for partially synthetic, public use microdata sets. Surv Methodol. 2003;29(2):181–188. [Google Scholar]
  • 13. Kennickell AB. Multiple imputation and disclosure protection: the case of the 1995 Survey of Consumer Finances. In: Alvey W, Jamerson B, eds. Record Linkage Techniques—1997. Proceedings of an International Workshop and Exposition. Washington DC: National Academy Press; 1997:248–267. [Google Scholar]
  • 14. Abowd JM, Stinson M, Benedetto G. Final Report to the Social Security Administration on the SIPP/SSA/IRS Public Use File Project. Suitland, MD: Longitudinal Employer-Household Dynamics Program, Bureau of the Census, US Department of Commerce; 2006. https://ecommons.cornell.edu/handle/1813/43929. Accessed May 15, 2015. [Google Scholar]
  • 15. Kinney SK, Reiter JP, Miranda J. Improving the Synthetic Longitudinal Business Database. (Paper CES-WP-14-12). Suitland, MD:Center for Economic Studies, Bureau of the Census, US Department of Commerce; 2014. http://econpapers.repec.org/paper/cenwpaper/14-12.htm. Accessed May 15, 2015. [Google Scholar]
  • 16. Kinney SK, Reiter JP, Reznek AP, et al. . Towards unrestricted public use business microdata: the Synthetic Longitudinal Business Database. Int Stat Rev. 2011;79(3):362–384. [Google Scholar]
  • 17. Drechsler J. Synthetic Datasets for Statistical Disclosure Control: Theory and Implementation. (Lecture Notes in Statistics, vol. 201). New York, NY: Springer Science & Business Media; 2011. [Google Scholar]
  • 18. Surveillance, Epidemiology, and End Results Program, National Cancer Institute List of SEER registries. http://seer.cancer.gov/registries/list.html. Accessed July 15, 2016.
  • 19. Surveillance, Epidemiology, and End Results Program, National Cancer Institute SEER*Stat Database: Incidence—SEER 20 Regs, November 2014 Submission (1973–2013 Varying)—Linked to County Attributes—Total U.S., 1969–2013 Counties. Rockville, MD: National Cancer Institute; 2015. [Google Scholar]
  • 20. Bauer KR, Brown M, Cress RD, et al. . Descriptive analysis of estrogen receptor (ER)-negative, progesterone receptor (PR)-negative, and HER2-negative invasive breast cancer, the so-called triple-negative phenotype: a population-based study from the California Cancer Registry. Cancer. 2007;109(9):1721–1728. [DOI] [PubMed] [Google Scholar]
  • 21. MacKinnon JA, Duncan RC, Huang Y, et al. . Detecting an association between socioeconomic status and late stage breast cancer using spatial analysis and area-based measures. Cancer Epidemiol Biomarkers Prev. 2007;16(4):756–762. [DOI] [PubMed] [Google Scholar]
  • 22. Ward E, Jemal A, Cokkinides V, et al. . Cancer disparities by race/ethnicity and socioeconomic status. CA Cancer J Clin. 2004;54(2):78–93. [DOI] [PubMed] [Google Scholar]
  • 23. Bureau of the Census, US Department of Commerce 2009–2013 American Community Survey. (Summary file). Washington, DC: US Department of Commerce; 2015. http://ftp2.census.gov/. Accessed May 8, 2017. [Google Scholar]
  • 24. De'ath G. Multivariate regression trees: a new technique for modeling species-environment relationships. Ecology. 2002;83(4):1105–1117. [Google Scholar]
  • 25. Breiman L, Friedman J, Stone CJ, et al. . Classification and Regression Trees. 1st ed. (Wadsworth Statistics/Probability). Boca Raton, FL: CRC Press; 1984. [Google Scholar]
  • 26. Bel L, Allard D, Laurent JM, et al. . CART algorithm for spatial data: application to environmental and ecological data. Comput Stat Data Anal. 2009;53(8):3082–3093. [Google Scholar]
  • 27. Reiter JP. Using CART to generate partially synthetic public use microdata. J Off Stat. 2005;21(3):441–462. [Google Scholar]
  • 28. Drechsler J, Reiter JP. An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Comput Stat Data Anal. 2011;55(12):3232–3243. [Google Scholar]
  • 29. Karr AF, Kohnen CN, Oganian A, et al. . A framework for evaluating the utility of data altered to protect confidentiality. Am Stat. 2006;60(3):224–232. [Google Scholar]
  • 30. Tiwari RC, Clegg LX, Zou Z. Efficient interval estimation for age-adjusted cancer rates. Stat Methods Med Res. 2006;15(6):547–569. [DOI] [PubMed] [Google Scholar]
  • 31. Edge S, Byrd DR, Compton CC, et al., eds. AJCC Cancer Staging Manual. 7th ed New York, NY: Springer Publishing Company; 2011. [Google Scholar]
  • 32. Howlader N, Altekruse SF, Li CI, et al. . US incidence of breast cancer subtypes defined by joint hormone receptor and HER2 status. J Natl Cancer Inst. 2014;106(5):dju055. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Surveillance, Epidemiology, and End Results Program, National Cancer Institute Previous version: SEER Cancer Statistics Review, 1975–2012. https://seer.cancer.gov/archive/csr/1975_2012/. Updated November 18, 2015. Accessed May 8, 2017.
  • 34. Surveillance, Epidemiology, and End Results Program, National Cancer Institute SEER*Stat Database: Populations—Total U.S. (1990–2014): Single Ages to 85+, Katrina/Rita Adjustment—Linked to County Attributes—Total U.S., 1969–2014 Counties. Rockville, MD: National Cancer Institute; 2015. [Google Scholar]
  • 35. Kuo TM, Mobley LR, Anselin L. Geographic disparities in late-stage breast cancer diagnosis in California. Health Place. 2011;17(1):327–334. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. McLafferty S, Wang F, Luo L, et al. . Rural-urban inequalities in late-stage breast cancer: spatial and social dimensions of risk and access. Environ Plann B Plann Des. 2011;38(4):726–740. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Kakwani N, Wagstaff A, van Doorslaer E. Socioeconomic inequalities in health: measurement, computation, and statistical inference. J Econom. 1997;77(1):87–103. [Google Scholar]
  • 38. Surveillance, Epidemiology, and End Results Program, National Cancer Institute Documentation for the ASCII text data files. http://seer.cancer.gov/data/documentation.html. Accessed July 15, 2016.
  • 39. Surveillance, Epidemiology, and End Results Program, National Cancer Institute Specialized SEER*Stat datasets. http://seer.cancer.gov/resources/specialized.html. Accessed July 15, 2016.
  • 40. Bureau of the Census, US Department of Commerce Survey of Income and Program Participation. Synthetic SIPP data. https://www.census.gov/programs-surveys/sipp/guidance/sipp-synthetic-beta-data-product.html. Revised July 14, 2016. Accessed May 8, 2017.
  • 41. Bureau of the Census, US Department of Commerce Synthetic Longitudinal Business Dabatase (SynLBD). Syn LBD beta data. https://www.census.gov/ces/dataproducts/synlbd/. Accessed May 8, 2017.
  • 42. Vilhuber L. Broadening data access through synthetic data. Presented at the NCRN Meeting Spring 2015, Washington, DC, May 7–8, 2015. Ithaca, NY: NSF-Census Research Network, Cornell University; 2015. (NCRN Coordinating Office preprint 1813:40185). https://ecommons.cornell.edu/handle/1813/40185. Accessed May 15, 2015.
  • 43. Reiter JP, Oganian A, Karr AF. Verification servers: enabling analysts to assess the quality of inferences from public use data. Comput Stat Data Anal. 2009;53(4):1475–1482. [Google Scholar]
  • 44. McClure DR, Reiter JP. Towards providing automated feedback on the quality of inferences from synthetic datasets. J Priv Confid. 2012;4(1):171–188. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Web Material

Articles from American Journal of Epidemiology are provided here courtesy of Oxford University Press

RESOURCES