Skip to main content
American Journal of Public Health logoLink to American Journal of Public Health
. 2020 Jun;110(6):829–832. doi: 10.2105/AJPH.2020.305611

The Use of Small Area Estimates in Place-Based Health Research

Amanda Y Kong 1,, Xingyou Zhang 1
PMCID: PMC7204458  PMID: 32298183

Abstract

Interest in the impact of the built environment on health behaviors, outcomes, and disparities is increasing, and the growing development of statistical modeling techniques has allowed researchers to better investigate these relationships. However, without enough data that are identifiable at smaller geographic levels (e.g., census tract), place-based health researchers are unable to reliably estimate the prevalence of a health outcome at these more granular and potentially more salient neighborhood levels.

When reliable direct survey estimates cannot be produced because of small samples or a lack of samples, estimates based on small area estimation techniques are often used. As place-based health research and the production and secondary use of small area estimates increase, it is critical that researchers understand both the underlying methods used to create these estimates and their limitations. Without this foundation, researchers may fit inappropriate models, or interpret findings inaccurately.

As a demonstrative example, we focus this discussion on the small area health indicator estimates recently produced through the 500 Cities Project by the Robert Wood Johnson Foundation, the Centers for Disease Control and Prevention (CDC), and the CDC Foundation.


Interest by both researchers and communities in the impact of neighborhood characteristics on health behaviors and outcomes is flourishing. Increased interest in health inequities, policies regulating the built environment, and the dissemination of geospatial technologies and data has greatly contributed to this trend.1 Furthermore, increasing statistical computing power and the growing popularity and feasibility of implementing multilevel statistical models have allowed researchers to more accurately determine whether compositional or contextual variables are associated with health behaviors and outcomes.2 Generally, compositional effects represent those that are due to variation in characteristics of individuals, whereas contextual effects represent those that are due to variation in the neighborhood environment that individuals may reside in.2,3 For example, a researcher may be interested in investigating whether individual-level factors or built environment factors are driving physical activity levels. One could fit multilevel models to examine whether an individual’s self-efficacy to exercise (compositional variable) or their accessibility to parks (contextual variable) is predictive of physical activity levels. Where individual-level data are not available, researchers often rely on aggregate group-level data to test associations between group- or area-level variables.2

A major difficulty of place-based health research is defining a salient neighborhood to the priority population or health phenomena, as well as accessing data that include geoidentifiers at the salient neighborhood level.1 To date, administratively defined areas, such as census block groups, census tracts, and zip codes, are commonly used to operationalize neighborhoods.1 However, without enough data that are identifiable at smaller geographic scales, place-based health researchers are unable to reliably estimate the prevalence of an outcome at more granular neighborhood levels.4–6 Furthermore, local jurisdictions with limited resources to conduct community health assessments may only have access to estimates of health indicators at larger geographic levels (e.g., state, county) that are not relevant to or representative of what may be occurring in their local neighborhoods.

Small area estimates represent those that are indirectly estimated for a geographic area for which direct survey estimates cannot produce valid and reliable estimates because of factors such as low response rate, limited sample sizes, or a lack of samples.5,6 Small area estimates can be a useful indicator for local health behavior and disease surveillance purposes, and they have also been used in secondary analyses to identify potential community-level factors that may be associated with health behaviors or disease prevalence.7–9 Additionally, small area estimates are more precise than those provided by direct survey estimates.5,6,10,11 As neighborhood and health research continues to grow, and the release and use of small area estimates increases, it is critical that researchers understand both the underlying methods used to create these estimates and their limitations.

The purpose of this commentary is to provide an overview of the methods used to generate small area estimates and to discuss the limitations of their use in answering research questions. Although there are several methods to create small area estimates, most have similar procedures requiring the combination of individual-level data, area group composition population estimates, and multilevel mixed effects regression techniques. As a demonstrative example, we focus our discussion, which may broadly be applied to other small area estimates, to those recently produced through the 500 Cities Project.

PRODUCING 500 CITIES SMALL AREA ESTIMATES

Through a partnership between the Robert Wood Johnson Foundation (RWJF), the Centers for Disease Control and Prevention (CDC), and the CDC Foundation, this publicly available data set includes 27 chronic disease health behavior and health outcome small area estimates.12 Researchers produced city- and census tract–level estimates for 500 cities, including the largest 497 US cities and three additional cities, to ensure representation from all states.12 Together, these cities represent 33.4% of the total US population.12 The 500 Cities Project (500 Cities) is innovative because of its national urban representation and inclusion of an array of health indicator prevalence estimates at small geographic area levels. The data set is intended to provide descriptive estimates of health indicators in an effort to supplement existing surveillance data, help local jurisdictions expand community engagement to identify potential health issues, and establish key health objectives.12

Researchers created the small area estimates from 500 Cities by combining data from the CDC’s Behavioral Risk Factor Surveillance System (BRFSS)13 with Census Bureau and American Community Survey14 population estimates. The BRFSS is a national surveillance system that uses random digit dialing to administer a telephone survey (cellphone and landline) to collect demographic and behavioral health risk data for US adults in all 50 states, the District of Colombia, and three US territories.13 Nearly half a million adult BRFSS surveys are completed annually.13 Although data are collected at the individual level, only county-level indicators are included, and direct area prevalence estimates are available only at the state level and for some large metropolitan areas.13 Researchers used a multilevel regression and poststratification approach to generate estimates in 500 Cities and the census tracts within these cities.10 Details of this methodology are briefly described on the CDC Web site, which also refers readers to several journal articles that first used this method.10,15,16 Here, we summarize this methodology with a focus on the conceptual interpretation of this approach to aid in understanding what these model estimates represent.

To produce small area model estimates in 500 Cities, researchers fit multilevel logistic models to national BRFSS data to determine an individual’s probability of having a health behavior or outcome as a function of several compositional sociodemographic attributes, including a person’s age, gender, and race/ethnicity.10 Models included county and census tract poverty estimates from the American Community Survey and county and state random effects. County and state random effects statistically account for the correlation of individuals living in the same county (or state). These random effects also represent unmeasured contextual factors that may affect an individual’s probability of having a health indicator. However, the impacts of measured contextual effects (other than area poverty) are not directly incorporated into model estimates. Other contextual variables were not included as predictors of model estimates because, although one contextual variable may be an important predictor for one geographic area, it may not be appropriate for another.10 For example, two cities may have a similarly high obesity prevalence. However, the factors driving the high prevalence may differ: perhaps lower accessibility to parks and recreation centers in one city and high caloric intake in the other city. Finally, to estimate city- and tract-level prevalence, the models were poststratified to their respective area group composition population estimates using 2010 Census demographic data.10 In other words, the produced 500 Cities small area estimates represent the expected prevalence estimate of a health indicator for census tracts and cities given the sociodemographic characteristics (age, gender, and race/ethnicity) of the individuals in the census tract or city, census tract–level poverty, and unmeasured contextual factors in the county and state. The produced values are therefore akin to synthetic estimates, and they do not represent the actual observed or direct prevalence in a census tract or city.5,6 Finally, because a global statistical model is fit to create these model estimates, some estimates may be more valid for certain health indicators or geographic areas.

VALIDATING 500 CITIES SMALL AREA ESTIMATES

Prior studies comparing estimates produced using the 500 Cities methodology to the “gold standard” of direct survey estimates have found validity to vary by health indicator and geographic scale. In a national study, internal validation of county-level small area estimates of chronic obstructive pulmonary disease compared with BRFSS direct survey estimates indicated a high Pearson’s correlation of 0.88.10 In Missouri, Pearson’s correlations between BRFSS direct survey estimates and county-level small area estimates of chronic obstructive pulmonary disease, current smoking, diabetes, obesity, and uninsured status ranged from 0.731 to 0.960.15 External validation (i.e., comparing direct survey estimates to those not used to produce small area estimates) with locally collected data indicated correlations of 0.28 for obesity and no health insurance, 0.40 for cigarette smoking, 0.51 for diabetes, and 0.69 for chronic obstructive pulmonary disease.15 In a sample of cities in Massachusetts, external validation comparing city-level MDPHnet (an electronic health record–based surveillance platform) disease prevalence with 500 Cities small area estimates indicated Pearson’s correlations ranging from 0.65 (obesity prevalence) to 0.89 (diabetes prevalence).17 Finally, an external validation study comparing 500 Cities estimates to Boston BRFSS direct survey estimates indicated good to strong validity for several city-level factors (e.g., cigarette smoking, high blood pressure, diabetes), but not for others (e.g., binge drinking, obesity, mental distress, physical distress); additionally, validity was not as strong for subcity levels.16

Validation with small area estimates is not an easy task, as external sources (such as health surveys conducted by local jurisdictions) may also suffer from nonresponse, bias, and suppression. The validity of health indicators may also depend on the actual predictors entered into the multilevel prediction model. Additionally, at finer geographic scales, direct survey estimates may be less reliable and have greater bias, and it also may become more difficult to incorporate local variations in public health interventions and social, economic, and cultural contexts into small area estimates.16 In this way, it is difficult to know what the “true” or “valid” prevalence of a neighborhood health indicator is for comparison purposes. When interpreting and disseminating study results, researchers using small area estimates, and specifically those produced by 500 Cities, should be acutely aware that validation of these estimates may vary18 substantially by health indicator, geographic area, and geographic scale.

OTHER CONSIDERATIONS

With the release of this wealth of small area health data, several publications have appeared in which the authors use regression methods to conduct secondary analyses of the 500 Cities small area estimates. To date, researchers have used 500 Cities to investigate associations of gentrification7 and park quality8 with physical health, and to examine neighborhood-level income and racial disparities in the prevalence of smoking,19 obesity,9 and poor mental health.20 Researchers conducting secondary regression analyses of small area estimates should be cognizant of several additional interpretation issues. First, while the county and state random effects model some county- and state-level variation in the estimates, those neighborhood-level or other local contextual factors that may have a large impact on a health indicator will not be directly accounted for in model estimates. For example, if a tobacco control policy is implemented in a city that has substantially reduced smoking prevalence, this policy effect will not be directly accounted for in the 500 Cities small area estimate, resulting in an area smoking prevalence estimate that may be biased upward.15 Furthermore, the fact that a tract or city has a high prevalence of a health indicator does not imply that there is something about the environment or context that is driving this prevalence estimate; rather, there may be a large number of individuals with a high probability of having the health indicator residing in this tract or city. Thus, researchers should be aware that small area estimates could potentially introduce substantial bias in the evaluation of the actual impact of contextual variables, such as local interventions and policies.15,16

Second, the CDC states about 500 Cities, “The SAE [small area estimate] for each city is dependent mainly upon the sociodemographic characteristics of that city.”12 Existing studies that statistically test associations between tract– or city- level demographic characteristics with small area estimate indicators should be aware that significant associations may be due to the poststratification estimation process that uses an area’s group composition of age, gender, and race/ethnicity data to produce model estimates. Regressing small area estimates on the same variables that were used to create them will result in associations that reflect the group compositional variables used to estimate the small area estimate. Likewise, regressing small area estimates on one another (e.g., physical activity with obesity) will also likely reflect significant associations. Although researchers may choose to include some area-level demographic characteristics that were used to generate the small area estimates as control variables, caution should be taken when interpreting these specific effects as meaningful.

Finally, researchers using small area estimates should be aware of error associated with these estimates, which are often not acknowledged. First, the BRFSS data itself may have measurement error or self-reported bias associated with it. Additionally, the model estimates produced in 500 Cities have varying margins of errors that differ substantially by health indicators, geographic areas, and geographic scale. This error may be especially large in neighborhoods with small population sizes and small county and state BRFSS samples. Although there are various statistical methodologies21,22 that may be used to account for this measurement error, these methods can be complex, and researchers often do not account for this measurement error in their secondary analyses. Therefore, standard errors reported are more than likely biased, potentially increasing type I error. We urge researchers to examine and report the margins of error of 500 Cities health indicators and discuss the impact and limitations of their findings, especially if this error is not explicitly modeled in subsequent analyses.

CONCLUSIONS

500 Cities is an exciting, promising, and novel project that has spurred continued interest in understanding the relationship between neighborhood characteristics on health behaviors and outcomes. Although the estimates are so far reflective of 2010 population estimates, the CDC hopes to provide census tract– and city-level estimates using more recent intercensal population data.12 The small area estimates generated by 500 Cities are intended to “allow cities and local health departments to better understand the burden and geographic distribution of health-related variables in their jurisdictions, and assist them in planning public health interventions.”12 To support these efforts, New York University Lagone Health, in partnership with the RWJF, released an interactive Web-based resource known as the City Health Dashboard.23 The City Health Dashboard includes measures and data visualization from 500 Cities, as well as several other indicators of socioeconomic, built environment, and health behavior and outcome factors from other surveillance systems (e.g., National Vital Statistics System), small area estimates (e.g., US Small Area Life Expectancy Estimates Project), and data sources (e.g., US Department of Agriculture Economic Research Service’s Food Access Research Atlas).23 Small area estimates may provide preliminary data to communities about important health indicators and may also supplement local community health assessments and surveillance efforts, which may help jurisdictions to prioritize health needs, set community health goals, and allocate funding to local health policy and intervention strategies. As researchers use this data to statistically test hypotheses, they must also be mindful of how these estimates were produced when interpreting research findings, especially as localities and policymakers increasingly look to researchers for guidance and expertise.

ACKNOWLEDGMENTS

We thank Justin Feldman, PhD (New York University School of Medicine, Department of Population Health) for his guidance and discussion on this topic.

Note. The findings and conclusions of this study are those of the authors and do not necessarily reflect the views of the Substance Abuse and Mental Health Services Administration or the US Department of Health and Human Services.

CONFLICTS OF INTEREST

The authors have no conflicts of interest to report.

REFERENCES

  • 1.Diez Roux AV. Neighborhoods and health: where are we and where do we go from here? Rev Epidemiol Sante Publique. 2007;55(1):13–21. doi: 10.1016/j.respe.2006.12.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Diez Roux AV. Multilevel analysis in public health research. Annu Rev Public Health. 2000;21(1):171–192. doi: 10.1146/annurev.publhealth.21.1.171. [DOI] [PubMed] [Google Scholar]
  • 3.Diez Roux AV. A glossary for multilevel analysis. J Epidemiol Community Health. 2002;56(8):588–594. doi: 10.1136/jech.56.8.588. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Thorpe LE. Surveillance as our sextant. Am J Public Health. 2017;107(6):847–848. doi: 10.2105/AJPH.2017.303803. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Ghosh M, Rao JN. Small area estimation: an appraisal. Stat Sci. 1994;9(1):55–76. [Google Scholar]
  • 6.Rao JN. Some new developments in small area estimation. J Iranian Stat Soc. 2003;2(2):145–169. [Google Scholar]
  • 7.Gibbons J, Barton M, Brault E. Evaluating gentrification’s relation to neighborhood and city health. PLoS One. 2018;13(11):e0207432. doi: 10.1371/journal.pone.0207432. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Mullenbach LE, Mowen AJ, Baker BL. Assessing the relationship between a composite score of urban park quality and health. Prev Chronic Dis. 2018;15:E136. doi: 10.5888/pcd15.180033. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Fitzpatrick KM, Shi X, Willis D, Niemeier J. Obesity and place: chronic disease in the 500 largest US cities. Obes Res Clin Pract. 2018;12(5):421–425. doi: 10.1016/j.orcp.2018.02.005. [DOI] [PubMed] [Google Scholar]
  • 10.Zhang X, Holt JB, Lu H et al. Multilevel regression and poststratification for small-area estimation of population health outcomes: a case study of chronic obstructive pulmonary disease prevalence using the behavioral risk factor surveillance system. Am J Epidemiol. 2014;179(8):1025–1033. doi: 10.1093/aje/kwu018. [DOI] [PubMed] [Google Scholar]
  • 11.Jia H, Muennig P, Borawski E. Comparison of small-area analysis techniques for estimating county-level outcomes. Am J Prev Med. 2004;26(5):453–460. doi: 10.1016/j.amepre.2004.02.004. [DOI] [PubMed] [Google Scholar]
  • 12.Centers for Disease Control and Prevention. 500 cities: local data for better health. 2016. Available at: https://www.cdc.gov/500cities/faqs/general.htm. Accessed April 25, 2018.
  • 13.Centers for Disease Control and Prevention. Behavioral Risk Factor Surveillance System. Available at: https://www.cdc.gov/brfss/about/index.htm. Accessed May 1, 2019.
  • 14.US Census Bureau. American Community Survey Estimates. Available at: https://www.census.gov/programs-surveys/acs. Accessed May 1, 2019.
  • 15.Zhang X, Holt JB, Yun S, Lu H, Greenlund KJ, Croft JB. Validation of multilevel regression and poststratification methodology for small area estimation of health indicators from the behavioral risk factor surveillance system. Am J Epidemiol. 2015;182(2):127–137. doi: 10.1093/aje/kwv002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Wang Y, Holt JB, Zhang X et al. Comparison of methods for estimating prevalence of chronic diseases and health behaviors for small geographic areas: Boston Validation Study, 2013. Prev Chronic Dis. 2017;14:E99. doi: 10.5888/pcd14.170281. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Klompas M, Cocoros NM, Menchaca JT et al. State and local chronic disease surveillance using electronic health record systems. Am J Public Health. 2017;107(9):1406–1412. doi: 10.2105/AJPH.2017.303874. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Wang Y, Holt JB, Xu F et al. Using 3 health surveys to compare multilevel models for small area estimation for chronic diseases and health behaviors. Prev Chronic Dis. 2018;15:E133. doi: 10.5888/pcd15.180313. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Leas EC, Schleicher NC, Prochaska JJ, Henriksen L. Place-based inequity in smoking prevalence in the largest cities in the United States. JAMA Intern Med. 2019;179(3):442–444. doi: 10.1001/jamainternmed.2018.5990. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Browning M, Rigolon A. Do income, race and ethnicity, and sprawl influence the greenspace–human health link in city-level analyses? Findings from 496 cities in the United States. Int J Environ Res Public Health. 2018;15(7):E1541. doi: 10.3390/ijerph15071541. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Lewis JB, Linzer DA. Estimating regression models in which the dependent variable is based on estimates. Polit Anal. 2017;13(4):345–364. [Google Scholar]
  • 22.Hornstein AS, Greene WH. Usage of an estimated coefficient as a dependent variable. Econ Lett. 2012;116(3):316–318. [Google Scholar]
  • 23.Gourevitch MN, Athens JK, Levine SE, Kleiman N, Thorpe LE. City-level measures of health, health determinants, and equity to foster population health improvement: The City Health Dashboard. Am J Public Health. 2019;109(4):585–592. doi: 10.2105/AJPH.2018.304903. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from American Journal of Public Health are provided here courtesy of American Public Health Association

RESOURCES