Abstract
Background:
To our knowledge, no agreed-upon best practices exist for joining U.S. Census ZIP Code Tabulation Areas (ZCTAs) and U.S. Postal Service ZIP Codes (ZIPs). One-to-one linkage using 5-digit ZCTA identifiers excludes ZIPs without direct matches. “Crosswalk” linkage may match a ZCTA to multiple ZIPs, avoiding losses.
Methods:
We compared non-crosswalk and crosswalk linkages nationally and for mortality and health insurance in California. To elucidate selection implications, generalized additive models related sociodemographics to whether ZCTAs contained non-matching ZIPs.
Results:
Nationwide, 15% of ZCTAs had non-matching ZIPs, i.e., ZIPs dropped under non-crosswalk linkage. ZCTAs with non-matching ZIPs were positively associated with metropolitan core location, lower socioeconomics, and non-white population. In California, 34% of ZIPs in the mortality and 25% in the health insurance data had ZCTAs with non-matching ZIPs; however, these ZIPs constitute only 0.03% of total mortality and 0.44% of total insurance enrollees.
Conclusions:
Our study findings support the use of crosswalk linkages and ZCTAs as a unit of analysis. One-to-one linkage may cause bias by differentially excluding ZIPs with more disadvantaged populations, although affected population sizes appear small.
Keywords: ZIP Code, ZCTA, crosswalk, American Community Survey, bias
Background
Researchers often link U.S. Census American Community Survey (ACS) data, including socio-demographic variables, with datasets that use ZIP Codes1–4 for spatial identification. ACS data span various levels, from census blocks to ZIP Code Tabulation Areas (ZCTAs). Unlike ZIP Codes, which U.S. Postal Service (USPS) assigns for mail delivery, lack official boundaries, and can change frequently,5–7 ZCTAs are stable Census Bureau products with updated, public shapefiles,8 making them more suitable for research.
To relate ZIP Codes to ZCTAs, “crosswalk” linkages (e.g., Uniform Data System [UDS] Mapper9, CensusReporter.org10; details in eAppendix 1) combine spatial methods and manual reviews to join commercial ZIP Code points (not publicly available) with ZCTAs. Non-crosswalk one-to-one linkage based on five-digit identifiers excludes ZIP Codes with no matching ZCTA.10 Studies have reported potential bias from these spatiotemporal mismatches and have called for caution when linking ZIP Codes and ZCTAs.6,7 The issue can be avoided by using other geographies (e.g., census tract) but some datasets contain only ZIP Codes. Epidemiology literature offers limited guidance on best practices for linking ZIP Codes and ZCTA datasets.
We compared non-crosswalk one-to-one and crosswalk linkages nationally to identify and characterize ZIP Codes excluded using non-crosswalk linkage. In case studies of California mortality and low-cost health insurance coverage datasets, we aimed to quantify the size of populations excluded when using non-crosswalk linkage. Results inform potential selection bias from non-crosswalk linkage, including who and where exclusions tend to occur and to what extent exclusions may affect epidemiologic studies.
Methods
Study data
We obtained crosswalk data from the UDS Mapper (John Snow, Inc)9 to longitudinally join ZIP Codes with ZCTA from 2009 to 2019.
The national analysis, 50 states plus Washington, D.C., focused on 2019 data. We obtained ZCTA-level population characteristics from 2015–2019 ACS 5-Year Estimates.11 We selected factors associated with health disparities,12,13 including educational attainment (% population aged ≥25 years with at least bachelor’s degree), poverty (% households whose income in past 12 months was below poverty level), race–ethnicity (% Non-Hispanic White, Non-Hispanic Black, and Hispanic) and housing (% renters, single-family households, occupied housing units, and moved last year). To assess urbanicity, we used 2010 ZIP Code-level Rural-Urban Commuting Area (RUCA) codes dichotomized into metropolitan (RUCA=1) and non-metropolitan (RUCA=2 to 10) core, following prior work.14,15 We generated ZCTA-level RUCA values by crosswalking dichotomized RUCA to ZCTA and taking most common value for ZCTA with multiple matching ZIP Codes (details in eAppendix 2).
For the California case studies, we obtained annual ZIP Code-level all-cause mortality counts from 2009 to 2018 from the California Health and Human Services Open Data Portal.2 Death counts ≤10 for confidentiality; zero deaths were reported as 0. We also obtained 2016 ZIP Code-level counts of enrollees in Covered California, a health insurance marketplace established under the Affordable Care Act.3 The minimum number of reported enrollees was 10 with no indicators of cell suppression nor zero cells. Therefore, we treated missing data as suppressed.
Data analysis
National analysis
We identified ZIP Codes excluded based on non-crosswalk linkage. To identify ZCTAs containing excluded ZIP Codes, we calculated the number of ZIP Codes nested within each ZCTA according to the 2019 USD Crosswalk file. A “1” indicates a one-to-one match with the matching ZIP Code retained. For ZCTAs containing more than one ZIP Code, non-matching ZIP Codes are excluded under non-crosswalk linkage. For analyses, we categorized the number of nested ZIP Codes per ZCTA as 1, 2, 3, ≥4 or dichotomized them as 1 or >1.
We summarized the distribution of selected characteristics by number of ZIP Codes nested per ZCTA (categorical), by state. To assess if urbanicity was associated with whether a ZIP Code is excluded under non-crosswalk linkage, we fitted a logistic regression model, adjusting for state. To understand characteristics of ZCTAs containing >1 ZIP Codes (dichotomized), we fitted generalized additive models (implemented with “mgcv” in R16) with a logit link for each ZCTA population characteristic separately, allowing for potential nonlinearity with a penalized cubic spline and adjusting for state and ZCTA-level urbanicity. Results were presented as predicted probability of ZCTAs containing >1 ZIP Code across the values of each ZCTA population characteristic.
California case studies
We calculated the percent of ZIP Codes excluded under non-crosswalk linkage and tabulated the number of deaths and of Covered California health insurance enrollees excluded under non-crosswalk linkage. In sensitivity analyses, we imputed missing or suppressed cells under a range of possible scenarios (eAppendix 4). Analyses used R 4.2.2.17
Results:
Nationally, there were 32,983 unique ZCTAs in 2019. Of these, 4,926 (15%) contained >1 non-matching ZIP Codes, ranging from 3% in South Dakota to 47% in D.C. (median across states=14%). It was most common (10.9%, n=3,591) to have only one non-matching ZIP Code (i.e., 2 nested ZIPs), while 2.5% (n=825) had two non-matching ZIP Codes and 1.6% (n=510) had three or more (eTable 1). Nationally, there were 41,104 unique 5-digit ZIP Codes in 2019. Of these 19.4% (n=7,966) did not have a one-to-one matching ZCTA. The majority (90%, n=7,142) were Post Office (P.O.) boxes or large-volume customer ZIP Codes per UDS assignment. ZIP Codes within metropolitan cores had a higher odds of not matching a ZCTA than those outside metropolitan cores (OR=5.38, 95%CI: 5.07–5.72).
In most states, ZCTAs with lower socioeconomic (e.g., poverty, housing characteristics) and higher % of non-white population tend to contain non-matching ZIP Codes (i.e., ZIP Codes excluded under non-crosswalk linkage) (eFigure 1), though nationwide associations with the sociodemographic factors were mostly nonlinear (Figure, detailed descriptions in eAppendix 5).
Figure.

Predicted probabilities (blue line) with 95% confidence interval (green shaded bands) of a ZCTA containing >1 ZIP Codes (left side y-axis) across level of selected ZCTA-level sociodemographic factors, full U.S. ZCTAs, 2019. Based on generalized additive models with a penalized cubic spline for the factor and controlling for states and urbanicity. Overlayed histograms show the distribution of each ZCTA sociodemographic (right side y-axis), with the 1st and 3rd quantiles noted by vertical dash lines and the median by a vertical dotted line. X-axes were restricted to the 5th through 95th percentiles of each ZCTA sociodemographic distribution for better interpretability.
In California, 30% of ZCTAs contained non-matching ZIP Codes, 3rd highest among states. On average, 33.83% of n=2,664 ZIP Codes did not have a one-to-one ZCTA match annually across the 10 years of mortality data (2009–2018 range: 33.71%-33.90%). Under non-crosswalk linkage, 0.03% (n=777) of deaths would be excluded over the 10 years. However, of the ZIP Codes excluded, many had suppressed small counts (n=280/year on average, equivalent to 44.7% of total suppressed small cells, eTable2). In sensitivity analyses imputing suppressed cells, proportion of excluded deaths remained low (0.16–1.15%; eTable 3). Of ZIP Codes in 2016 Covered California data, 24.94% (n=576) had no matching ZCTA, representing 0.44% of enrollees. In sensitivity analyses imputing suppressed cells, this number remained low (0.44–0.57%; eTable3). Additionally, we found non-matching ZIP Codes in case studies were in high-population ZCTAs (eAppendix 7).
Discussion:
We assessed potential selection bias, from non-crosswalk linkage compared to crosswalk linkage nationally and in California. ZIP Codes excluded under non-crosswalk linkage were more likely to be within metropolitan cores. ZCTAs with certain sociodemographics, e.g., higher % non-white population, higher % renter, lower % single-family households, were more likely to contain ZIP Codes excluded under non-crosswalk linkage. Hence non-crosswalk linkage might exclude populations in high-density areas, and disproportionally exclude more vulnerable populations and introduce selection bias (expanded in eAppendix 6) since these sociodemographic factors tend to correlate with increased adverse exposures and health outcomes.18
In the case studies, despite a large percentage of ZIP Codes without matching ZCTAs, predominantly in high-population areas, the number of California deaths and insurance enrollees excluded were small, suggesting the influence of resulting bias from non-crosswalk linkage might be modest in California.
Crosswalk linkage should be used cautiously since P.O. Box and large-volume customer ZIP Codes are a less reliable estimate of residence.9,19,20 Additionally, households can have P.O. Boxes along with mailing addresses.19 However, given the large areal coverage of ZIP Codes, P.O. Boxes may be reasonable residential proxies. The median distance between breast cancer patients’ street address and their P.O. Box ZIP-centroid in a California-based study was 2.2 miles and 4.3 miles at the 3rd quantile.20 Distances between addresses and P.O. Boxes may be smaller than ZIP Code areal coverage, which in heavily urbanized states such as New Jersey can be 12.8 square miles (3.6-mile radius).21
This study has limitations. First, we assumed crosswalk files as a “gold standard.” We did not compare spatial matching between ZCTA and ZIP Codes that have non-coinciding boundaries, so crosswalk is not a perfect solution.7 Our results apply to scenarios when commercial shapefiles of ZIP Codes are unavailable; therefore, manual spatial matching is not possible. Second, analyses with ZCTA sociodemographics evaluated only one ZCTA characteristic at a time, which while providing some evidence does not take into account underlying complex relations. Last, our case studies included only California. Different states and datasets might reveal different patterns. Given California ranked 3rd highest state with non-matching ZIP Codes, we believe our results are generalizable to states with large numbers of non-matching ZIP Codes.
To avoid potential selection bias, we recommended the following research practices when joining ZIP Code and ZCTA data. First, we call for clearly documented methodology for generating crosswalk files (e.g., census reporter10). Second, we recommend researchers working with ZIP Code-level data use crosswalk linkages,10,22 or at least include them in a sensitivity analysis. Third, given that neither USPS nor ACS provides ZIP Code-level population statistics, we recommend ZCTA be the primary unit of analysis. Researchers using ZIP Code data should also be aware of the geographical scale and other biases associated with spatial context.23,24
Supplementary Material
Source of Funding:
This work was supported by grants # R01ES035137 and P30ES007048 from the National Institute of Environmental Health Sciences.
Footnotes
Conflicts of interest: none declared.
Data and computing code availability:
The paper uses public data and data obtained from IPUMS (http://doi.org/10.18128/D050.V18.0). IPUMS does not allow redistribution but is publicly available. The data citation in the main article has the full URL. UDS crosswalk data link was redirected to another website in April, 2024, and users can no longer see historical crosswalks. Code for data cleaning and analysis along with longitudinal UDS crosswalk data used in this work is available at GitHub: https://github.com/madaopt/ZIP-ZCTA-Epidemiology
Reference:
- 1.United States Census Bureau. County Business Patterns. Census.gov. Accessed March 18, 2024. https://www.census.gov/programs-surveys/cbp.html [Google Scholar]
- 2.California Health and Human Services. Death Profiles by ZIP Code - California Health and Human Services Open Data Portal. Accessed March 7, 2024. https://data.chhs.ca.gov/dataset/death-profiles-by-zip-code
- 3.California Health Care Foundation. Covered California Enrollment by Zip Code, March 2016. California Health Care Foundation. Accessed March 7, 2024. https://www.chcf.org/publication/covered-california-enrollment-by-zip-code-march-2016/ [Google Scholar]
- 4.California Energy Commission. Light-Duty Vehicle Population in California. California Energy Commission. Published current-date. Accessed March 18, 2024. https://www.energy.ca.gov/data-reports/energy-almanac/zero-emission-vehicle-and-infrastructure-statistics/light-duty-vehicle [Google Scholar]
- 5.United States Postal Service. ZIP Code™ - The Basics. Accessed March 7, 2024. https://faq.usps.com/s/article/ZIP-Code-The-Basics
- 6.Krieger N, Waterman P, Chen JT, Soobader MJ, Subramanian SV, Carson R. Zip Code Caveat: Bias Due to Spatiotemporal Mismatches Between Zip Codes and US Census–Defined Geographic Areas—The Public Health Disparities Geocoding Project. Am J Public Health. 2002;92(7):1100–1102. doi: 10.2105/AJPH.92.7.1100 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Grubesic TH, Matisziw TC. On the use of ZIP codes and ZIP code tabulation areas (ZCTAs) for the spatial analysis of epidemiological data. Int J Health Geogr. 2006;5(1):1–15. doi: 10.1186/1476-072X-5-58 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.United States Census Bureau. ZIP Code Tabulation Areas (ZCTAs). Census.gov. Accessed March 7, 2024. https://www.census.gov/programs-surveys/geography/guidance/geo-areas/zctas.html [Google Scholar]
- 9.Uniform Data System (UDS). ZIP Code to ZCTA Crosswalk – UDS Mapper. Accessed March 18, 2024. https://udsmapper.org/zip-code-to-zcta-crosswalk/
- 10.Census Reporter. acs-aggregate/crosswalks/zip_to_zcta/ZIP_ZCTA_README.md at master · censusreporter/acs-aggregate. GitHub. Accessed March 18, 2024. https://github.com/censusreporter/acs-aggregate/blob/master/crosswalks/zip_to_zcta/ZIP_ZCTA_README.md [Google Scholar]
- 11.Manson Steven, Schroeder Jonathan, Van Riper David, Knowles Katherine, Kugler Tracy, Roberts Finn, and Ruggles Steven. IPUMS National Historical Geographic Information System: Version 18.0. doi: 10.18128/D050.V18.0 [DOI] [Google Scholar]
- 12.Singh GK, Daus GP, Allender M, et al. Social Determinants of Health in the United States: Addressing Major Health Inequality Trends for the Nation, 1935–2016. Int J MCH AIDS. 2017;6(2):139–164. doi: 10.21106/ijma.236 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Rauh VA, Landrigan PJ, Claudio L. Housing and Health. Annals of the New York Academy of Sciences. 2008;1136(1):276–288. doi: 10.1196/annals.1425.032 [DOI] [PubMed] [Google Scholar]
- 14.Eckel Sandrah P, Cockburn Myles, Shu Yu-Hsiang, et al. Air pollution affects lung cancer survival. Thorax. 2016;71(10):891. doi: 10.1136/thoraxjnl-2015-207927 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Han PJ. Rural Definitions Matter. Published online 2024.
- 16.Wood SN. Fast Stable Restricted Maximum Likelihood and Marginal Likelihood Estimation of Semiparametric Generalized Linear Models. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2011;73(1):3–36. doi: 10.1111/j.1467-9868.2010.00749.x [DOI] [Google Scholar]
- 17.R Core Team. R: A language and environment for statistical computing. Published online 2022. https://www.R-project.org/
- 18.Smith LH. Selection Mechanisms and Their Consequences: Understanding and Addressing Selection Bias. Curr Epidemiol Rep. 2020;7(4):179–189. doi: 10.1007/s40471-020-00241-6 [DOI] [Google Scholar]
- 19.IANNACCHIONE VG, STAAB JM, REDDEN DT. Evaluating the Use of Residential Mailing Addresses in a Metropolitan Household Survey*. Public Opinion Quarterly. 2003;67(2):202–210. doi: 10.1086/374398 [DOI] [Google Scholar]
- 20.Hurley SE, Saunders TM, Nivas R, Hertz A, Reynolds P. Post Office Box Addresses: A Challenge for Geographic Information System-Based Studies. Epidemiology. 2003;14(4):386. doi: 10.1097/01.EDE.0000073161.66729.89 [DOI] [PubMed] [Google Scholar]
- 21.Grubesic TH. Zip codes and spatial analysis: Problems and prospects. Socio-Economic Planning Sciences. 2008;42(2):129–149. doi: 10.1016/j.seps.2006.09.001 [DOI] [Google Scholar]
- 22.Climate-CAFE/zip_codes_and_zctas. Published online March 8, 2024. Accessed April 5, 2024. https://github.com/Climate-CAFE/zip_codes_and_zctas
- 23.Sadler RC. How ZIP codes nearly masked the lead problem in Flint. The Conversation. Published September 20, 2016. Accessed July 23, 2024. http://theconversation.com/how-zip-codes-nearly-masked-the-lead-problem-in-flint-65626
- 24.Sadler RC, Lafreniere DJ. You are where you live: Methodological challenges to measuring children’s exposure to hazards. J Child Poverty. 2017;23(2):189–198. doi: 10.1080/10796126.2017.1336705 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The paper uses public data and data obtained from IPUMS (http://doi.org/10.18128/D050.V18.0). IPUMS does not allow redistribution but is publicly available. The data citation in the main article has the full URL. UDS crosswalk data link was redirected to another website in April, 2024, and users can no longer see historical crosswalks. Code for data cleaning and analysis along with longitudinal UDS crosswalk data used in this work is available at GitHub: https://github.com/madaopt/ZIP-ZCTA-Epidemiology
