Abstract
CDC WONDER (Centers for Disease Control and Prevention Wide-Ranging Online Data for Epidemiologic Research) is the nation’s primary data repository for health statistics. Before WONDER data are released to the public, data cells with fewer than 10 case counts are suppressed. We showed that maps produced from suppressed data have predictable geographic biases that can be removed by applying population data in the system and an algorithm that uses regional rates to estimate missing data. By using CDC WONDER heart disease mortality data, we demonstrated that effects of suppression could be largely overcome.
CDC WONDER (Centers for Disease Control and Prevention Wide-Ranging Online Data for Epidemiologic Research) provides county-level data on directly age-adjusted mortality rates, and age- and gender-stratified mortality and population counts.1 To protect against the potential disclosure of personal health information, WONDER suppresses any statistic (counts or rates) calculated using fewer than 10 observations.2 However, such suppression restricts the utility of WONDER data to compute and map reliable rates for areas with small populations, for short time periods, or for rare diseases.3,4 Furthermore, rates that are indirectly adjusted for age, which are currently not provided by WONDER, can only be calculated for those counties where count data are not suppressed.5,6 Using an example of heart disease mortality, we showed that rates computed from suppressed mortality count data provided by WONDER are biased in predictable ways and that our algorithm can be used to remove these known biases.
DATA SUPPRESSION AND LOCAL MORTALITY RATES
Data suppression, if ignored, will always underestimate mortality rates in counties with small populations, which most frequently occur in rural areas. To illustrate this bias, we examined the spatial patterns of heart disease mortality (2007–2009) using maps constructed from WONDER’s published age-adjusted rates (Figure 1a) as compared with our age-adjusted rates calculated using WONDER’s age-stratified mortality count data (Figure 1b).6 Both maps were directly age-adjusted using 10-year age groups, and thus, if data suppression were not an issue, the maps would display identical spatial patterns.
The map in Figure 1a served as the reference map for heart disease mortality patterns in US Counties. By comparison, the map in Figure 1b clearly showed underestimation in mortality rates, especially in the predominantly rural, Great Plains region of the United States. The correlation in county-level rates between the 2 maps for all US counties is 0.885 (n = 2970), and the correlation in rates for counties in the Great Plains region is 0.752 (n = 587).
The difference between these maps is attributed to data suppression. The WONDER data table (rates) used in the construction of the map in Figure 1a had minimal suppression (∼4%) compared with the WONDER table (counts) used in the construction of the map in Figure 1b (∼30%). These differences in both levels of suppression and mortality rates indicate that some information that was used to create the map in Figure 1a was not available when calculating rates depicted in the map in Figure 1b. We inferred from WONDER’s published suppression guidelines that the information “missing” from our rate calculations was likely age-specific mortality count data and associated crude rates for age groups that have fewer than 10 observations each. In the case of the map in Figure 1a, WONDER is able to release rates calculated using age groups that individually have fewer than 10 observations but that, when used in concert with information for other age groups in the county, result in a final rate calculation that is based on at least 10 cases.
METHODS
WONDER data release policies state that the term “Suppressed” replaces subnational death counts and rates, as well as corresponding population figures, when the figure represents 0 to 9 persons.2 However, population figures corresponding to suppressed data cells are only suppressed when the population counts themselves are between 0 and 9 persons. Because population counts in any cell are rarely fewer than 10 persons, it is possible to compute an expected mortality count for most suppressed cells by multiplying their corresponding population by the applicable regional mortality rate. Our age-adjustment algorithm reduced the impact of data suppression by substituting such an expected value for a suppressed value. Statewide rates or other small-area estimates of mortality may be used as the regional mortality rate in the algorithm.7,8 Regional risk estimates, computed using substate estimates, such as those derived from agglomerations of neighboring counties, may improve the accuracy of the expected counts by accounting for local variations in rates that may not be captured when using a statewide risk estimate. The map produced by our algorithm using statewide mortality rates (Figure 2a) indicated a high degree of similarity to our reference map in Figure 1b. Algorithm details and software are available from http://www.webdmap.com/suppression.
RESULTS
The correlation in county-level rates between these 2 maps (Figure 2a and Figure 1b) improved from 0.885 to 0.976 (n = 2970), and the correlation for counties in the Great Plains region improved from 0.752 to 0.922 (n = 587). Thus, the algorithm removed 69% of the original variation in rates across all counties and 72.8% of the variation in rates in the Great Plains region. The spatial patterns of this improvement can be seen in the maps in Figure 2b and 2c.
DISCUSSION
Our results suggest 2 ways to address the problem of rate underestimation caused by suppressed WONDER data. First, data distributors could provide information about the degree to which a user’s data request was suppressed to help them understand the impact of data suppression on their analysis and avoid misinterpretation. Such information could include the number of suppressed cells, as well as the proportion of the population that was subject to suppression. Second, CDC WONDER data users who seek to use mortality count data may consider utilizing an adjustment algorithm, as described above, to overcome biases caused by data suppression.
Human Participant Protection
Review board approval not needed because we used only publicly available, de-idenitified data.
References
- 1.Centers for Disease Control and Prevention. General help for CDC WONDER. Available at: http://wonder.cdc.gov/wonder/help/main.html. Accessed October, 20, 2013.
- 2.Centers for Disease Control and Prevention. Underlying cause of death 1999–2010 help. Available at: http://wonder.cdc.gov/wonder/help/ucd.html. Accessed October, 20, 2013.
- 3.Beyer KMM, Tiwari C, Rushton G. Five essential properties of disease maps. Ann Assoc Am Geogr. 2012;102(5):1067–1075. [Google Scholar]
- 4.Tiwari C, Rushton G. Using spatially adaptive filters to map late stage colorectal cancer incidence in Iowa. In: Fisher P, editor. Developments in Spatial Data Handling. Berlin, Germany: Springer-Verlag; 2005. pp. 665–676. [Google Scholar]
- 5.Pickle LW, White AA. Effects of the choice of age-adjustment method on maps of death rates. Stat Med. 1995;14(5-7):615–627. doi: 10.1002/sim.4780140519. [DOI] [PubMed] [Google Scholar]
- 6.Curtin LR, Klein RJ. Direct standardization (age-adjusted death rates) Healthy People Stat Notes. 1995;6:1–10. [PubMed] [Google Scholar]
- 7.Clayton D, Kaldor J. Empirical Bayes estimates of age-standardized relative risks for use in disease mapping. Biometrics. 1987;43(3):671–681. [PubMed] [Google Scholar]
- 8.Mungiole M, Pickle LW, Simonson KH, White AA. Application of a weighted head–banging algorithm to mortality data maps. Stat Med. 1999;18(23):3201–3209. doi: 10.1002/(sici)1097-0258(19991215)18:23<3201::aid-sim310>3.0.co;2-u. [DOI] [PubMed] [Google Scholar]