Skip to main content
GeoHealth logoLink to GeoHealth
. 2026 Jan 4;10(1):e2025GH001596. doi: 10.1029/2025GH001596

A Novel Method for Generating Spatially Resolved Synthetic Populations for Health Impact Assessments in Vulnerable Populations

Flannery Black‐Ingersoll 1,, Chad W Milando 1, Zachary T Popp 1, Mariangelí Echevarría‐Ramos 2, M Patricia Fabian 1,3, Amruta Nori‐Sarma 4,5, Jonathan I Levy 1,
PMCID: PMC12765813  PMID: 41497286

Abstract

The spatial resolution of environmental exposure and sociodemographic population data is often mismatched given limited publicly available population data that complies with privacy requirements for individuals. To address this limitation, we developed a novel matching algorithm to construct a synthetic population at the address‐level. To demonstrate how our approach can improve environmental justice (EJ) analyses and health impact assessments (HIAs), we examined sociodemographic patterns of residential proximity to major roadways in Greater Boston (Massachusetts) and HIA results, comparing our method with a random address allocation method. The synthetic population was developed at a census tract‐level using US Census microdata and combinatorial optimization methods and then downscaled to address‐level parcels by matching building attributes to synthetic households. We designated households within 50 m of a major road “high exposure” and households below state median household income “low income”.We found misclassification for individual households (21% of the high exposure/low‐income households in the matched data set were identified as such in the random allocation data set). We found modest aggregate differences in matched allocation (3.3% of low‐income households had high exposure) compared to random allocation (3.4%). In a HIA, the difference between random and matched allocation would be stronger when there is a strong interactive effect between a sociodemographic effect modifier and exposure on the outcome. Address‐level exposure assignment based on synthetic populations can provide more significant and nuanced health impact and EJ analyses. Our novel method can be applied to other regions of the US and expanded to other dimensions of population vulnerability.

Keywords: environmental justice, health impact assessment, synthetic population

Plain Language Summary

Communities and decision makers often need to identify if there are disparities in the distribution of hazardous exposures and associated health outcomes. To do so requires understanding of both spatial patterns of exposures and of the attributes of exposed populations. While environmental exposure data are available at increasingly higher spatial resolution, data on high‐resolution population sociodemographic characteristics are limited by privacy requirements in the US. To support the investigation of environmental exposures and health outcomes across sociodemographic characteristics at address‐level resolution, we used publicly available US Census data to simulate an address‐level population with sociodemographic information. In a case study looking at proximity to major roadways in Greater Boston (Massachusetts), we compared exposure patterns between our approach and approaches where household attributes were not used for address assignment. We found large differences in how individual households were identified but modest differences in the percent of households identified as high‐exposure and low‐income. We also showed that differences in estimated health impacts would depend on whether there was a strong interaction between the environmental exposure and sociodemographic variable. The methods used to create the address‐level synthetic population can be replicated in other regions of the US using the same census data resources.

Key Points

  • Low spatial resolution population data can mask environmental justice issues

  • Public US Census data can be used to generate address‐level synthetic populations

  • Health impacts vary by sociodemographic and exposure variable interactions

1. Introduction

Many environmental hazards have substantial heterogeneity across small spatial units, with patterns that are not randomly distributed across the population. Numerous studies have documented disproportionate exposures among marginalized communities, providing empirical evidence to support concerns about environmental justice (EJ) (Brulle & Pellow, 2006; Checker, 2021; Lee, 2002; Robert D. Bullard & Beverly H. Wright, 1993; Smith & Laribi, 2022; Van Horne et al., 2023; Wing, 2005) wherein the distribution of both environmental hazards and benefits are unequally distributed across racial and socioeconomic population subgroups (Chowkwanyun, 2023). However, in situations where available exposure data are highly spatially resolved, analyses of the sociodemographic patterns of exposure may be inaccurate if the available population data lack comparable spatial resolution. This can be challenging given that publicly available census data can only provide limited sociodemographic data at coarse (i.e., US census tracts) spatial resolution in order to protect individual level privacy and avoid reidentification of census participants, leading to mismatches with exposures that can vary across tens of meters (Gardner‐Frolick et al., 2022; Racz & Rish, 2022; Shan et al., 2024). As high‐resolution environmental exposure data become increasingly available through satellite data and other geospatial information, the gap between exposure and sociodemographic data is increasing, creating challenges for EJ analyses and other investigations of sociodemographic exposure patterns and their health implications.

One strategy to address this gap is to utilize synthetic populations, simulated data sets that combine coarse geographic resolution individual data with higher resolution population‐level characteristics for covariates (Milando et al., 2025). There is a growing body of research developing and applying synthetic population data sets to a range of questions related to sociodemographic patterns of exposures or behaviors (Basra et al., 2017; Gelb et al., 2024; Jiang et al., 2022, 2024; Joubert, 2018; Levy et al., 2014; Milando et al., 2021; Wheaton et al., 2009). The applications of synthetic populations span urban planning (Lin, 2024), policy analyses (Joubert, 2018), and public health applications of methods including agent‐based modeling, risk assessment, intervention assessment, and health impact assessment (Basra et al., 2017; Jiang et al., 2022; Levy et al., 2014; Milando et al., 2021; Nicolaie et al., 2023; Wheaton et al., 2009). Synthetic population data sets have the potential to aid in identification of inequities for spatially variable exposures that are true to underlying population characteristics while avoiding possible privacy concerns.

Synthetic population methods developed and applied to date lack the spatial resolution necessary to align with highly spatially heterogeneous exposures. Previous applications of synthetic population data sets have been primarily census tract‐level, geographic domains that vary in areal size based on the population density but containing approximately 4,000 people (Basra et al., 2017; Levy et al., 2014; Milando et al., 2021). A recent Canadian study presents a synthetic population method that achieves a geospatial disaggregation of income, age, ethnicity, and household composition at a scale similar to a US block group (cluster of blocks with 600–3,000 people) (Gelb et al., 2024; US Census Bureau, 2022). Another US study generated age, sex, race, and Hispanic or Latino origin data at block‐group resolution (Lin, 2024). While this spatial resolution is higher than the census tract‐level, it may still be inadequate to align with many environmental exposures. Other studies allocated households to residential roadways, but did so randomly rather than based on individual and building attributes (Jiang et al., 2022, 2024). Methods to assign synthetic populations to building parcels that acknowledge correlations between population and parcel attributes could allow for novel insights by better capturing heterogeneity by sociodemographic characteristics that vary within tracts, by households, and/or by individuals. For example, a study of lead water service line distributions in New York State concluded that census tracts with higher proportions of Hispanic residents were more likely to have potential lead service lines (Nigra et al., 2023). However, to conduct this analysis, the researchers had to aggregate the lead service line exposure data to align with the census tract‐level population data.

In this study, we build upon methods previously used to generate census tract‐level synthetic populations (Basra et al., 2017; Levy et al., 2014; Milando et al., 2021). We develop a census tract‐level population spanning a broader geographic area than in our previous work, and then use a novel method to downscale the population further to the parcel level. We illustrate the implications of our downscaling methodology in a case study application with an environmental exposure with high spatial heterogeneity: residential proximity to major roadways in 21 cities and towns in the Mystic River Watershed (MRW) area of Greater Boston, Massachusetts. We demonstrate the implications of this parcel‐level synthetic data set for health impact assessments (HIAs) by applying hypothetical epidemiological findings (with effect modification of the proximity‐outcome association by sociodemographic characteristics). To our knowledge, the present study is the first to generate a parcel‐level synthetic population using methods beyond random allocation of households and to evaluate the implications for exposure misclassification, HIAs, and conclusions about environmental justice.

2. Materials and Methods

2.1. Location

The study area of interest for this project was the 21 cities and towns spanning the Mystic River Watershed (MRW) in the Greater Boston area of Massachusetts (Figure 1). A highly urban watershed in New England that includes both urban and suburban communities, the population residing in MRW faces a multitude of environmental hazards related to and exacerbated by climate change (e.g., urban heat islands, flooding, and chemical hazards). These cities and towns are nested within the Boston‐Cambridge‐Newton, Massachusetts‐New Hampshire Metropolitan/Micropolitan Statistical Area, which has documented disparities in wealth, including racial and ethnic gaps (Meschede et al., 2016). Annual household income, for example, varies substantially among census tracts in the MRW (Figure 1).

Figure 1.

Figure 1

Map of municipalities in the Mystic River Watershed area of Greater Boston, Massachusetts with the percent of households, by census tract, with an annual income of less than $100,000 in the 2021 US Census (U.S. Census Bureau, 2021b).

2.2. Overview

We developed a novel method to generate a parcel‐level synthetic population data set in two stages. First, we generated a census tract‐level synthetic population using combinatorial optimization with simulated annealing (Basra et al., 2017; Levy et al., 2014; Milando et al., 2021). Second, we developed and applied a matching algorithm to allocate the synthetic population to individual building parcels. Whereas the random allocation assumes a random distribution of sociodemographic and exposure characteristics throughout a census tract, the matched allocation allocates the synthetic population to parcels using housing characteristics (e.g., building value). To demonstrate the influence of our methodology for parcel allocation, we allocated the census tract‐level households using both our matching algorithm and at random, comparing resulting conclusions about sociodemographic patterns of exposure in a use‐case example (proximity to major roadways). Proximity to major roadways presents an informative proxy within urbanized areas for several health‐relevant environmental exposures including indoor and outdoor air pollution (Brugge et al., 2013; Dadvand et al., 2014; Matthaios et al., 2024; Rowangould, 2013; Yuchi et al., 2020) and heat (Dadvand et al., 2014). Proximity to major roadway is also a highly health‐relevant metric, associated with adverse cardiovascular (Brugge et al., 2013), renal (Lue et al., 2013), neurological (Yuchi et al., 2020), and birth outcomes (Dadvand et al., 2014). With potential effect modification by sociodemographic characteristics such as income (Cakmak et al., 2016), exposure disparities by race (Rowangould, 2013), and several associated adverse health outcomes, proximity to major roadway is a suitable use‐case for our parcel‐level synthetic population data set. We then conducted a hypothetical HIA, demonstrating the implications of our matching algorithm as a function of the correlations among covariates and nature of hypothetical epidemiological associations.

2.3. Stage 1: Census Tract‐Level Synthetic Population

The census tract‐level synthetic population data set was developed using methods previously described: combinatorial optimization using simulated annealing with replacement (Basra et al., 2017; Levy et al., 2014; Milando et al., 2021). This method, originally described in Williamson (2007) and as applied in Levy et al. (2014), involves the resampling from householder microdata to minimize the proportional differences between constraints and weighted estimates for census tract‐level individuals and households. The census tract‐level constraints serve as known distributions of a defined set of characteristics that are imposed on the resampling process to optimize the “fit” of the final data set (e.g., if the census tract‐level data show 30% of households in a tract are below median household income, the synthetic population data set for that tract should have approximately 30% of households below median household income).

We obtained person and household observations for this resampling procedure from the Public Use Microdata Samples (PUMS) from the 2021 American Community Survey (ACS) 5‐year estimates (Table 1). The 21 cities and towns in the MRW comprise 10 Public Use Microdata Areas (PUMAs). The PUMS data are a representative sample of approximately five percent of individuals within a given PUMA, a set of adjacent census tracts totaling approximately 100,000 people (U.S. Census Bureau, 2021j). We defined the constraints for the census tract‐level distributions using 2021 ACS 5‐year Detailed Estimates tables for the census tracts within the study area's 10 PUMAs. These constraints were compiled into tables with tract‐level counts for a range of characteristics: race/ethnicity, sex and age, education, householder age and household income, and householder age and tenure (Table 2, Table S1 and Table S2 in Supporting Information S1). We then prepared these data for the tract‐level allocation to adjust for changes to PUMS and ACS since the last publication of this tract‐level synthetic population generation method (Text S1 in Supporting Information S1) (Milando et al., 2021). After making these adjustments, the microdata and constraint tables were used to run the combinatorial optimization algorithm (Basra et al., 2017; Levy et al., 2014; Milando et al., 2021) to generate the census tract‐level synthetic population. As described in Levy et al. (2014), this approach conducts simulated annealing using probabilistic reweighting to reach an optimum match (in this case, of householders from the microdata to census‐tracts) across all included constraint variables. The constraint variables include person and household characteristics; however, the method implemented in our study allocates the householders who are drawn from a subset of the person microdata (which has full household samples). We then rejoin these to the full household samples after allocating the householders. The objective is to minimize the relative sum of z‐squares (RSSZ2, the sum of z‐squares divided by the 0.05 chi‐square critical value, estimated at each resampling) (Williamson, 2007, 2013). As reported in previous studies, after running the combinatorial optimization, we assess the goodness‐of‐fit of the census tract‐level synthetic population using the overall total average error per household (OTAE/HH) produced in the output from the combinatorial optimization algorithm (Basra et al., 2017; Levy et al., 2014; Milando et al., 2021).

Table 1.

Public Use Microdata Sample (PUMS)

Type Variables
Person Age, Sex, Ancestry, Educational Status, Relationship to reference person, Hispanic/Non‐Hispanic, Serial Number
Household Tenure, Household Income, Adjusted Household Income, Number of people in house, Number of rooms in house, Year built

Table 2.

American Community Survey Tables

Table number Table name
B04006 People Reporting Ancestry
B03001 Hispanic or Latino by Specific Origin
B01001 Sex by Age
B15001 Sex by educational attainment for the population 18 years and over
B19307 Age of householder by household income in the past 12 months (2021 inflation‐adjusted dollars)
B25007 Tenure by Age of Householder

2.4. Stage 2: Parcel‐Level Synthetic Population

After creating the census tract‐level synthetic population data set, we developed an allocation algorithm to downscale further by matching households to building parcels. We selected variables that we anticipated would correlate with a range of household attributes related to sociodemographic characteristics available in both the MassGIS Property Tax Parcel Data (MassGIS Bureau of Geographic Information, 2024b) and the census tract‐level synthetic population data. Proceeding by PUMA, we categorized each of the matching variables based on the respective distributions of the property tax data and the census‐tract level synthetic population data. These variables are total value (tertiles), number of rooms (tertiles), and year built (above/below median). Based on matching methods commonly implemented in observational epidemiologic studies, we applied a treatment‐control matching algorithm in R. To implement this algorithm, we structured the data in two columns: “treatment” (households in the property tax data) and “control” (households in the synthetic population data from Stage 1). We matched the control households to treatment households using 1:1 nearest neighbor matching without replacement within each respective census tract (e.g. control households in census tract 1 matched with treatment households in census tract 1). This method applies logistic regression, calculating propensity scores for both treatment and control observations based on the variables provided (household characteristics: number of rooms, year built, and total value). Propensity score matching of treatment and control groups is commonly used in observational epidemiologic studies (Austin, 2011). The propensity score is the probability of exposure (treatment vs. control group) measured from a set of predictors. After assigning propensity scores to all treated and control observations, the matching method then assigns controls to treatment observations by finding the closest matching propensity score. To determine how well the matching worked, we then examine the standard mean differences for the covariates matched (Belitser et al., 2011).

Implementing this method required addressing missing values with our chosen treatment and control groups: both the property tax data and census‐tract synthetic population data had missing values, and the property tax data had fewer household observations (N = 364,472) than the tract‐level synthetic population data (N = 535,063). We imputed both to reduce missingness and made several manual corrections on the property tax data (Text S2 in Supporting Information S1). After imputation, four PUMAs had adequate property tax data for parcels within all census tracts and were therefore able to assign synthetic population households to all census tracts. While we were able to expand the number of households in the property tax data where the parcels indicated higher numbers of households than there were observations in the data set and we were able to impute missing variables, 6 PUMAs still contained census tracts with fewer households in the property data than in the synthetic population. These census tracts were therefore omitted from the final parcel‐level population. After imputation and manual correction, we were able to assign 87% of synthetic population households to parcels (N = 464,490) (Table S3 and S4 in Supporting Information S1). The full schematic of PUMA to census tract to parcel‐level allocation is provided in Figure S1 of Supporting Information S1. As a sensitivity analysis, we compared the results of this matched allocation with a random allocation of the census tract‐level synthetic population to the property tax parcels.

2.5. Use‐Case Example: Residential Proximity to Major Roadway

As a case study of the application of this parcel‐level synthetic population to an exposure‐ and health‐relevant environmental variable, we examined the shortest distance to the nearest major road from the centroid of each parcel. This exposure is a proxy for traffic‐related air pollution and other similarly patterned exposures (e.g., traffic noise). We used the Massachusetts Department of Transportation line shapefiles for public and private roads in the state, which are publicly available from MassGIS (MassGIS Bureau of Geographic Information, 2024a). Major roads were designated as interstates, highways, and state routes. Based on peer‐reviewed literature on health impacts of close proximity to major roadways, we designated parcels within 50 m of a major road as high exposure (Brugge et al., 2013; Chen et al., 2017; Clark et al., 2010; Freid et al., 2021; Gaskins et al., 2018; Kingsley et al., 2015, 2016; Lue et al., 2013). To consider the potential for exposure misclassification across sociodemographic covariates, we assume the distance to the nearest major road is assessed “correctly” in the matched allocation and then assign these distances to the randomly allocated parcel data using a household serial number (Figure S2 in Supporting Information S1).

We used these two metrics to look for differences in exposure by sociodemographic covariates that may also be potential modifiers of adverse health outcomes of exposure to major roadways (Cakmak et al., 2016). For example, we looked at conclusions about the percent of households close to a major road across median household income and rent/own status (tenure). In addition, we conducted two comparisons. First, we looked at the percentage of households changing exposure classifications in the matched compared to the random allocation, which provides insight into potential exposure misclassification at the individual household level. Second, we calculated the aggregate number of households characterized as high risk (high exposure and renter or high exposure and below median household income), to determine whether overall conclusions about environmental justice are affected by any exposure misclassification.

2.6. Hypothetical Health Impact Assessment

We evaluated whether using a synthetic population produced with matching versus random allocation would lead to different conclusions about health impacts. A key input for a health impact assessment is an exposure‐response association derived from the epidemiological literature, which could include both a main effect of the exposure of interest and a modifier variable (e.g., socioeconomic status). We developed a hypothetical health impact assessment, generating mock inputs for modifier, exposure, and outcome variables. The conceptual model that we applied used a logistic regression, as one example of a plausible exposure‐response association:

logit(Y)=β0+βMmodifier+βEexposure+βImodifierexposure

To examine how matching versus random allocation and effect modification makes a difference in health impact assessment, we looked for changes in cases among householders with the exposure and the modifier for various exposure‐modifier associations and model coefficients. While this model is presented as a hypothetical, exposures might include proximity to major roadway, heat index, greenspace, etc. Modifiers might include a range of sociodemographic characteristics such as median household income, householder race and ethnicity, tenure, etc.

3. Results

In the first stage, we created a census‐tract level synthetic population by combinatorial optimization with low overall total absolute error per household. The two tracts with higher OTAE/HH had very few households (one tract with 4 households and one tract with 363 households) relative to other census tracts. Overall, 70% of census tracts in the final data set had low OTAE/HH (OTAE/HH < 1), with 99% of tracts having OTAE/HH < 5 (Figure S3 in Supporting Information S1). In the matching algorithm, we found small standardized mean differences across the covariates (mean overall: −0.05, range: −2.82 to 1.47), an indication of balanced covariates (number of rooms, year built, and total value) in the pairing of controls (Property Tax Data Parcels) and treatments (Synthetic Population Households) in the matching algorithm (Belitser et al., 2011).

Descriptive statistics for the aggregate (all households) and stratified (by proximity to major roadway) synthetic population are provided in Table 3. In aggregate, we allocated a total of 464,490 households with a median income of approximately $92,093. Most households are owned (67.4%) and built before 2000 (89.8%). Householders are mostly White race alone (85.2%) and more than half of householders have higher education degrees (57.1%), reflecting the fact that the larger MRW region includes many suburban communities along with cities in the urban core. Table 3 also presents the matched allocation of households to parcels by proximity to nearest major roadway. Comparing household and householder characteristics by stratum of proximity to nearest major roadway, we see no major differences across these sociodemographic categorizations. A slightly higher percent of those living within 50 m of major roadways (51.2%) make below median income compared to those living more than 50 m from a major roadway (50.1%). Those living within 50 m of major roadways are somewhat more likely to rent (34.3%) compared to those living further away (31.0%). Finally, those living within 50 m of major roadways are more likely to live in smaller homes (48.5% with less than 6 rooms) than those living further away (45.9%).

Table 3.

Household and Householder Characteristics in Aggregate and Stratified by Proximity to Major Roadway

Stratified by proximity to major roadway
Aggregated Households <50 m (6.8% of total) Households ≥50 m (93.2% of total)
n % n % n %
Total 464,490 100.0 31,405 100.0 433,085 100.0
Household Characteristics
Income (<median a ) 232,243 50.0 16,070 51.2 216,173 49.9
Tenure
Rent 144,988 31.2 10,771 34.3 134,217 31.0
Own 313,215 67.4 20,187 64.3 293,028 67.7
Other 6,287 1.4 447 1.4 5,840 1.3
Year Built (Pre‐2000) 416,972 89.8 28,214 89.8 388,758 89.8
Number of Rooms (<6) 214,171 46.1 15,241 48.5 198,930 45.9
Building Value
<$322,526.40 154,881 33.3 10,854 34.6 144,027 33.3
$322,526.40 – $475,000.00 156,909 33.8 10,081 32.1 146,828 33.9
>$475,000.00 152,700 32.9 10,470 33.3 142,230 32.8
Householder characteristics
Age (≥65 years) 124,030 26.7 8,514 27.1 115,516 26.7
Hispanic/Latino 37,034 8.0 2,448 7.8 34,586 8.0
Education
Higher Education Degree 265,343 57.1 17,683 56.3 247,660 57.2
High School Diploma 170,474 36.7 11,761 37.5 158,713 36.6
No High School Diploma 28,673 6.2 1,961 6.2 26,712 6.2
Race b
White alone 395,907 85.2 26,871 85.6 369,036 85.2
Black alone 22,885 4.9 1,424 4.5 21,461 5.0
Asian alone 22,584 4.9 1,589 5.1 20,995 4.9
Some other race alone 12,950 2.8 750 2.4 12,200 2.8
Two or more races 9,777 2.1 734 2.3 9,043 2.1
American Indian or Alaska Native alone 380 0.1 37 0.1 343 0.1
Native Hawaiian & Other Pacific Islander alone 7 0.0 0 0.0 7 0.00
a

Median household income $92,092.71.

b

As categorized by the ACS 2017–2021 5‐year census data.

We compared the proximity to major roads for each individual parcel within a given census tract with a hypothetical situation where the average proximity was assigned to all parcels within a census tract (Figure 2). We found differing exposure assignments where the heterogeneity in exposures is omitted in the latter case and with substantial error for some parcels that border major roadways. Analogous issues occur for median income by parcel versus tract (Figure S4 in Supporting Information S1). Examining the misclassification of exposure by effect modifier in the aggregate across the watershed, we see little evidence that the percentage of households living close to major roads and below median income (Table 4) differs across allocation methods. For example, 3.3% of the households in the matched allocation are both below median household income and close to a major roadway, compared to 3.4% in the random allocation. An investigation of tenure by proximity to major roadway yields similar findings (Table S5 in Supporting Information S1).

Figure 2.

Figure 2

Public Use Microdata Area 00506: (a) assigning tract‐level average proximity to major roadway to parcels compared to (b) parcel‐level average proximity to major roadway by parcels. Major roadways are shown in red, and tract outlines are shown in black.

Table 4.

Allocation by Proximity to Major Roadway and Household Income (Median: $92,092.71)

Households
Matched allocation Random allocation
Assigned exposure a E+ E− E+ E− E+ E− E+ E−
Modifier b M+ M− M+ M−
Total households (%) 3.3 46.7 3.5 46.5 3.4 46.6 3.3 46.7
a

<50 m (E+) or ≥50 m (E−) from major roadway.

b

<Median household income (M+) or ≥median household income (M−).

However, when we investigate the individual household level, there is a substantial amount of exposure misclassification, with both “false positives” and “false negatives” that are masked in the aggregate calculations. For example, in the data set created by matching (number of rooms, total value, and year built), 15,335 households are both below the median household income and within 50 m of a major roadway (Table 5). Random allocation only correctly assigns 3,145 (21%) of these households. There are 12,726 households that are in the low exposure and low‐income group using matched allocation but are incorrectly assigned to be high exposure using random allocation. Similarly, there are 12,925 households that are in the high exposure and low‐income group using matched allocation but are incorrectly assigned to be low exposure using random allocation. The same analysis was done with tenure as the modifying variable (Table S6 in Supporting Information S1). The Sankey diagrams in Figures S5 and S6 of Supporting Information S1 illustrate the shift in exposure and modifier categorization that occurs based on the method of allocating households to parcels.

Table 5.

Assigned and Calculated Proximity to Major Road by Household Income (Median: $92,092.71)

Households (n)
Matched allocation Random allocation
Assigned exposure a E+ E− E+ E− E+ E− E+ E−
Modifier b M+ M− M+ M−
True E+ 15,335 0 16,070 0 3,145 12,925 3,055 12,280
True E− 0 216,912 0 216,173 12,726 203,447 12,479 204,433
a

<50 m (E+) or ≥50 m (E−) from major roadway.

b

<Median household income (M+) or ≥Median household income (M−).

To illustrate the influence of this exposure misclassification and examine the potential application of a parcel‐level synthetic population to a health impact assessment, we simulated health impacts of an adverse environmental risk factor with hypothetical epidemiological evidence. We looked for differences in health impact estimates for varying magnitudes of model coefficients and exposure‐effect modifier associations (Text S5 in Supporting Information S1). We present the difference in counts of health outcomes (difference cases) in our simulation of two exposure scenarios. For our initial simulations (Figure 3), we assumed a strong positive association between exposure and effect modifier value. We did so by setting a high probability that an exposed individual had the “high risk” effect modifier (P(Modifier | Exposure +) = 0.90) and a low probability that an unexposed individual had the “high risk” effect modifier (P(Modifier | Exposure −) = 0.10). In this simulation, there are no differences when there are no significant interactions (Figure 3a). However, there is systematic underestimation of health impacts using random allocation for non‐zero interactions, with the magnitude of the bias increasing with increasing strength of the interaction (βI) (Figure 3b). If there is non‐zero interaction (βI0), the degree of underestimation depends on the magnitude of the exposure main effect βE, where a higher exposure main effect relative to the interaction term leads to lower bias (Figure 3c). In situations where the probability of “high risk” modifier status conditional on exposure is closer to random (e.g., P(Modifier | Exposure +) = 0.70), the essential patterns are similar but the magnitude of the difference in health impact estimates for random versus matched allocation decreases (Figure 4).

Figure 3.

Figure 3

Difference in cases (sum of cases in matched exposure scenario minus sum of cases in random exposure scenario) from a hypothetical health impact assessment. Simulated point estimates of average difference in case counts (red) and smoothed (loess local smooth) models with 95% confidence intervals (blue) and β0=0. There is a high probability of having the modifier among those exposed (P(M|E+)=0.90), with a low probability among those unexposed (P(M|E)=0.10) in all three plots. Three scenarios are shown for the difference in cases for different values of: (a) the exposure beta‐coefficient with the interation and modifier beta‐coefficients set to equal 0, (b) the interaction beta‐coefficient with the modifier and exposure beta‐coefficients set equal to 0, and (c) the exposure beta‐coefficient with an interaction beta‐coefficient of 1.1 and modifier beta‐coeficient at 0.3.

Figure 4.

Figure 4

Difference in cases (sum of cases in matched exposure scenario minus sum of cases in random exposure scenario) from a hypothetical health impact assessment. Simulated point estimates of average difference in case counts (red) and smoothed (loess local smooth) models with 95% confidence intervals (blue) and β0=0. There is a moderate probability of having the modifier among those exposed (P(M|E+)=0.70), with a moderately low probability among those unexposed (P(M|E)=0.30) in all three plots. Three scenarios are shown for the difference in cases for different values of: (a) the exposure beta‐coefficient with the interaction and modifier beta‐coefficients set equal to 0, (b) the interaction beta‐coefficient with the modifier and exposure beta‐coefficients set equal to 0, and (c) the exposure beta‐coefficient with an interaction beta‐coefficient at 1.1 and modifier beta‐coefficient at 0.3.

4. Discussion

We developed a highly spatially resolved synthetic population data set of 464,490 households in 10 PUMAs in the Greater Boston MRW area, using a novel methodology that combined the creation of a census tract‐level population using combinatorial optimization (Basra et al., 2017; Levy et al., 2014; Milando et al., 2021) with a matching algorithm using tax property records for housing units to downscale to the parcel level. While synthetic populations have been developed for multiple previous applications, including equity analyses, this is the first time to our knowledge that a synthetic population was developed at parcel scale with inclusion of the association between household and parcel attributes. The parcel‐level synthetic population data set can be used to investigate potential disparities in individual and joint exposures by sociodemographic characteristics.

For a spatially heterogeneous exposure (residential proximity to major roadway, a proxy for air pollution exposure) we found substantial differences in exposure assignment when using parcel‐level as compared to aggregated census tract‐level data. When comparing our parcel‐level matching algorithm with random allocation, there were small aggregate differences in group‐level exposures stratified by income or tenure but large differences in individual‐level exposure characterization that could have substantial implications for a health impact assessment based on multivariable epidemiological findings.

The empirical patterns and the importance of using a matching algorithm for parcel‐level exposure assignment will vary by setting and the nature of the health evidence. When applied to a hypothetical multivariable epidemiological analysis, we found that using random allocation led to an underestimation of cases among those exposed to the main exposure (proximity to major roadway) and the effect modifier. The degree of underestimation was dependent on the strength of the association between exposure and effect modifier and the relative values of the coefficients derived from the epidemiological study. Proximity to major roadway can be interpreted in this context as a proxy for air pollution exposure, which is often available at coarse geographic resolution. The exposure misclassification in the random allocation relative to the matched allocation is important if there are sociodemographic factors relevant to health that are not already highly correlated with the modifier of interest (and would therefore merit inclusion in the multivariable model). Similarly, when considering applications to environmental justice analyses or other characterization of sociodemographic exposure patterns, the magnitude and direction of the difference between matched and random allocation depends on the relative sizes of the various exposure and modifier groups, the geographic patterns of exposure, and the extent to which there is an underlying correlation between the factors used in the matching process (household characteristics) and the exposure itself.

One crucial question is whether it is likely that exposure and sociodemographic patterns within at least some census tracts of concern are patterned in a manner that would lead to inaccurate conclusions in the absence of the methods we developed. In geographies containing environmental justice areas, we may anticipate that the true distribution of environmental exposures is heterogeneous by sociodemographic characteristics. The potential spatial misalignment of exposure, modifier, and outcome data is described in a study looking at methods for assigning historical redlining exposure (Maliniak et al., 2023). In this study, misclassification occurs due to differences in present‐day and historical boundaries that researchers found could contribute to bias, particularly as the geographic scale (ranging from census blocks to tracts) increases (Maliniak et al., 2023). In general, there is potential risk for Simpson's paradox, in which patterns observed at disaggregated spatial scales differ from patterns observed in aggregate. This is most likely to occur when exposures and demographics vary substantially between geographic aggregates, which could be expected for multiple environmental exposures (Hernán et al., 2011; Sachdeva & Fotheringham, 2023).

While our synthetic population follows standard methods and fulfills all internal validation criteria as described above, our method has several limitations. First, we are unable to externally validate the synthetic population created, since due to the underlying privacy concerns, we lack geographically refined multivariable population attributes. Validation of a synthetic population is primarily based on an understanding of internal validity (Gelb et al., 2024), as measured by the ability of the census data to measure, with accuracy and precision, the attributes of the population residing in a given census area (Warren, 2022). Additionally, we consider how well the population microdata can be resampled to fit the census tract constraints using an overall total average error. In this study, there is an additional validation step related to our parcel‐level matching algorithm, which is based both on the statistical performance of our algorithm and the fidelity of the underlying tax property data. We are unable to compare the tax property data to a relevant “gold standard” of residential building data; however, the need for imputation and inability to downscale in some tracts due to missingness in the property data suggests that it could be improved. Second, this synthetic population is created based on households and householders, not individuals within households. These are the units considered in both the census tract and parcel‐level population generation, so that the counts of people within households is not necessarily accurate, hindering individual‐level analyses of people within homes. Future methodological refinements could potentially incorporate household size within the matching algorithm based on secondary models of household size as a function of available attributes like number of rooms and home value, but that was beyond the scope of the present study.

In spite of these limitations, our two‐stage method for a parcel‐level synthetic population was able to successfully construct a higher spatial resolution synthetic population than used in the literature to date. Further, we were able to construct the synthetic population entirely from publicly available data, so that these methods are reproducible in other parts of the US. Additional steps to advance this methodology would include the development of methods relevant to rural and other non‐urban contexts. In rural areas, exposures are likely more homogenous, population density is lower, and the geographic size of PUMAs, census tracts, and parcels is larger. The necessary scale for a synthetic population would depend on these factors as well as the sociodemographic homogeneity of the rural population (Bray, 2021; Harrington et al., 2020). Future work could further refine the method by incorporating time‐activity, such as time spent at work or school, where additional and/or differing exposures are experienced. This could be captured with the integration of a social network‐infused synthetic population model similar to that presented by Jiang et al. (2022). Finally, an important future step would be to incorporate robust community input for the identification of relevant exposures of interest, validation of the distribution of exposures, and interpretation of findings from parcel‐level synthetic population applications (Van Horne et al., 2023).

5. Conclusions

We developed a parcel‐level synthetic population data set using publicly available census and tax property data for a section of 21 cities and towns in the Mystic River Watershed area in Massachusetts. A use‐case example with an exposure proxy for air pollution (proximity to major roadway) demonstrates the potential for exposure misclassification and application of the synthetic population to analyses of exposure disparities with high spatial heterogeneity. Disaggregating sociodemographic data to the parcel‐level can address the spatial misalignment of exposures and modifiers in environmental justice and health impact assessments.

Conflict of Interest

The authors declare no conflicts of interest relevant to this study.

Supporting information

Supporting Information S1

Table S1

Acknowledgments

This project was partially funded by the United States Environmental Protection Agency (EPA) under assistance agreement RD‐84048001 to Boston University, until the assistance agreement was terminated in May 2025 for no longer effectuating agency priorities. We thank the Georgette L’Italien Memorial Scholarship Fund and the participating universities for their support to allow for the completion of this manuscript. The contents of this document do not necessarily reflect the views and policies of the EPA, nor do they endorse trade names or recommend the use of commercial products mentioned in this document. FB‐I was additionally supported by the National Institute of Environmental Health Sciences (NIEHS) T32 training Grant (T32ES014562). M. Patricia Fabian is supported by the National Oceanic and Atmospheric Administration Grant (NOAA NA21OAR4310313).

Black‐Ingersoll, F. , Milando, C. W. , Popp, Z. T. , Echevarría‐Ramos, M. , Fabian, M. P. , Nori‐Sarma, A. , & Levy, J. I. (2026). A novel method for generating spatially resolved synthetic populations for health impact assessments in vulnerable populations. GeoHealth, 10, e2025GH001596. 10.1029/2025GH001596

Contributor Information

Flannery Black‐Ingersoll, Email: fblackin@bu.edu.

Jonathan I. Levy, Email: jonlevy@bu.edu.

Data Availability Statement

All data used for this study are publicly available. Census tract‐level American Community Survey data (U.S. Census Bureau, 2021b, 2021c, 2021d, 2021e, 2021f, 2021g) is available via https://data.census.gov and Public Use Microdata Samples are available from https://data.census.gov/app/mdat (U.S. Census Bureau, 2021a). Property tax parcel data for Massachusetts is available on the MassGIS website (MassGIS Bureau of Geographic Information, 2024b) as well as Massachusetts road line shapefiles (MassGIS Bureau of Geographic Information, 2024a). Shapefiles of Massachusetts PUMA and census tract boundaries are publicly available via the U.S. Census as well (U.S. Census Bureau, 2021h, 2021i). A comprehensive rBookdown guide on the code and specific data sets used in this paper, including sample code, are available on the first author's Github: https://github.com/Flannery‐BIng?tab=repositories. To complete this work, we used R (R Core Team, 2023) and QGIS software (QGIS Development Team, 2024).

References

  1. Austin, P. C. (2011). An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behavioral Research, 46(3), 399–424. 10.1080/00273171.2011.568786 [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Basra, K. , Fabian, M. , Holberger, R. , French, R. , & Levy, J. (2017). Community‐engaged modeling of geographic and demographic patterns of multiple public health risk factors. International Journal of Environmental Research and Public Health, 14(7), 730. 10.3390/ijerph14070730 [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Belitser, S. V. , Martens, E. P. , Pestman, W. R. , Groenwold, R. H. H. , de Boer, A. , & Klungel, O. H. (2011). Measuring balance and model selection in propensity score methods. Pharmacoepidemiology and Drug Safety, 20(11), 1115–1129. 10.1002/pds.2188 [DOI] [PubMed] [Google Scholar]
  4. Bray, L. A. (2021). Settler colonialism and rural environmental injustice: Water inequality on the Navajo nation. Rural Sociology, 86(3), 586–610. 10.1111/ruso.12366 [DOI] [Google Scholar]
  5. Brugge, D. , Lane, K. , Padró‐Martínez, L. T. , Stewart, A. , Hoesterey, K. , Weiss, D. , et al. (2013). Highway proximity associated with cardiovascular disease risk: The influence of individual‐level confounders and exposure misclassification. Environmental Health, 12(1), 84. 10.1186/1476-069X-12-84 [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Brulle, R. J. , & Pellow, D. N. (2006). Environmental justice: Human health and environmental inequalities. Annual Review of Public Health, 27(27), 103–124. 10.1146/annurev.publhealth.27.021405.102124 [DOI] [PubMed] [Google Scholar]
  7. Bullard, R. D. , & Wright, B. H. (1993). Environmental justice for all: Community perspectives on health and research. Toxicology and Industrial Health, 9(5), 821–841. 10.1177/074823379300900508 [DOI] [PubMed] [Google Scholar]
  8. Cakmak, S. , Hebbern, C. , Cakmak, J. D. , & Vanos, J. (2016). The modifying effect of socioeconomic status on the relationship between traffic, air pollution and respiratory health in elementary schoolchildren. Journal of Environmental Management, 177, 1–8. 10.1016/j.jenvman.2016.03.051 [DOI] [PubMed] [Google Scholar]
  9. Checker, M. (2021). Environmental justice and gentrification in New York City. Environment: Science and Policy for Sustainable Development, 63(2), 16–27. 10.1080/00139157.2021.1871293 [DOI] [Google Scholar]
  10. Chen, H. , Kwong, J. C. , Copes, R. , Tu, K. , Villeneuve, P. J. , van Donkelaar, A. , et al. (2017). Living near major roads and the incidence of dementia, Parkinson's disease, and multiple sclerosis: A population‐based cohort study. The Lancet, 389(10070), 718–726. 10.1016/S0140-6736(16)32399-6 [DOI] [PubMed] [Google Scholar]
  11. Chowkwanyun, M. (2023). Environmental justice: Where it has been, and where it might be going. Annual Review of Public Health, 44(1), 93–111. 10.1146/annurev-publhealth-071621-064925 [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Clark, N. A. , Demers, P. A. , Karr, C. J. , Koehoorn, M. , Lencar, C. , Tamburic, L. , & Brauer, M. (2010). Effect of early life exposure to air pollution on development of childhood asthma. Environmental Health Perspectives, 118(2), 284–290. 10.1289/ehp.0900916 [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Dadvand, P. , Ostro, B. , Figueras, F. , Foraster, M. , Basagaña, X. , Valentín, A. , et al. (2014). Residential proximity to major roads and term low birth weight: The roles of air pollution, heat, noise, and road‐adjacent trees. Epidemiology, 25(4), 518–525. 10.1097/EDE.0000000000000107 [DOI] [PubMed] [Google Scholar]
  14. Freid, R. D. , Qi, Y. , Espinola, J. A. , Cash, R. E. , Aryan, Z. , Sullivan, A. F. , & Camargo, C. A. (2021). Proximity to major roads and risks of childhood recurrent wheeze and asthma in a severe bronchiolitis cohort. International Journal of Environmental Research and Public Health, 18(8), 8. 10.3390/ijerph18084197 [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Gardner‐Frolick, R. , Boyd, D. , & Giang, A. (2022). Selecting data analytic and modeling methods to support air pollution and environmental justice investigations: A critical review and guidance framework. Environmental Science & Technology, 56(5), 2843–2860. 10.1021/acs.est.1c01739 [DOI] [PubMed] [Google Scholar]
  16. Gaskins, A. J. , Hart, J. E. , Mínguez‐Alarcón, L. , Chavarro, J. E. , Laden, F. , Coull, B. A. , et al. (2018). Residential proximity to major roadways and traffic in relation to outcomes of in vitro fertilization. Environment International, 115, 239–246. 10.1016/j.envint.2018.03.029 [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Gelb, J. , Apparicio, P. , & Alizadeh, H. (2024). A synthetic vulnerable population dataset for fine scale geographical equity analysis and urban planning. Scientific Data, 11(1), 954. 10.1038/s41597-024-03771-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Harrington, R. A. , Califf, R. M. , Balamurugan, A. , Brown, N. , Benjamin, R. M. , Braund, W. E. , et al. (2020). Call to action: Rural health: A presidential advisory from the American heart association and American stroke association. Circulation, 141(10), e615–e644. 10.1161/CIR.0000000000000753 [DOI] [PubMed] [Google Scholar]
  19. Hernán, M. A. , Clayton, D. , & Keiding, N. (2011). The Simpson's paradox unraveled. International Journal of Epidemiology, 40(3), 780–785. 10.1093/ije/dyr041 [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Jiang, N. , Crooks, A. T. , Kavak, H. , Burger, A. , & Kennedy, W. G. (2022). A method to create a synthetic population with social networks for geographically‐explicit agent‐based models. Computational Urban Science, 2(1), 7. 10.1007/s43762-022-00034-1 [DOI] [Google Scholar]
  21. Jiang, N. , Yin, F. , Wang, B. , & Crooks, A. T. (2024). A large‐scale geographically explicit synthetic population with social networks for the United States. Scientific Data, 11(1), 1204. 10.1038/s41597-024-03970-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Joubert, J. W. (2018). Synthetic populations of South African urban areas. Data in Brief, 19, 1012–1020. 10.1016/j.dib.2018.05.126 [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Kingsley, S. L. , Eliot, M. N. , Whitsel, E. A. , Huang, Y.‐T. , Kelsey, K. T. , Marsit, C. J. , & Wellenius, G. A. (2016). Maternal residential proximity to major roadways, birth weight, and placental DNA methylation. Environment International, 92–93, 43–49. 10.1016/j.envint.2016.03.020 [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Kingsley, S. L. , Eliot, M. N. , Whitsel, E. A. , Wang, Y. , Coull, B. A. , Hou, L. , et al. (2015). Residential proximity to major roadways and incident hypertension in post‐menopausal women. Environmental Research, 142, 522–528. 10.1016/j.envres.2015.08.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Lee, C. (2002). Environmental justice: Building a unified vision of health and the environment. Environmental Health Perspectives, 110(suppl 2), 141–144. 10.1289/ehp.02110s2141 [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Levy, J. I. , Fabian, M. P. , & Peters, J. L. (2014). Community‐wide health risk assessment using geographically resolved demographic data: A synthetic population approach. PLoS One, 9(1), e87144. 10.1371/journal.pone.0087144 [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Lin, Y. (2024). Synthetic population data for small area estimation in the United States. Environment and Planning B: Urban Analytics and City Science, 51(2), 553–562. 10.1177/23998083231215825 [DOI] [Google Scholar]
  28. Lue, S.‐H. , Wellenius, G. A. , Wilker, E. H. , Mostofsky, E. , & Mittleman, M. A. (2013). Residential proximity to major roadways and renal function. Journal of Epidemiology & Community Health, 67(8), 629–634. 10.1136/jech-2012-202307 [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Maliniak, M. L. , Moubadder, L. , Nash, R. , Lash, T. L. , Kramer, M. R. , & McCullough, L. E. (2023). Census tracts are not neighborhoods: Addressing spatial misalignment in studies examining the impact of historical redlining on present‐day health outcomes. Epidemiology, 34(6), 817–826. 10.1097/EDE.0000000000001646 [DOI] [PubMed] [Google Scholar]
  30. MassGIS Bureau of Geographic Information . (2024a). MassGIS data: MassGIS‐MassDOT roads [feature layer]. Retrieved from https://www.mass.gov/info‐details/massgis‐data‐massgis‐massdot‐roads
  31. MassGIS Bureau of Geographic Information . (2024b). MassGIS data: Property tax parcels [feature layer]. Retrieved from https://www.mass.gov/info‐details/massgis‐data‐property‐tax‐parcels
  32. Matthaios, V. N. , Holland, I. , Kang, C. M. , Hart, J. E. , Hauptman, M. , Wolfson, J. M. , et al. (2024). The effects of urban green space and road proximity to indoor traffic‐related PM2.5, NO2, and BC exposure in inner‐city schools. Journal of Exposure Science and Environmental Epidemiology, 34(5), 745–752. 10.1038/s41370-024-00669-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Meschede, T. , Hamilton, D. , Muñoz, A. P. , Jackson, R. , & Darity, W. (2016). Inequality in the “cradle of liberty”: Race/ethnicity and wealth in greater Boston. Race and Social Problems, 8(1), 18–28. 10.1007/s12552-016-9166-9 [DOI] [Google Scholar]
  34. Milando, C. W. , Black‐Ingersoll, F. , Khemani, M. , Fabian, M. P. , & Levy, J. I. (2025). Filling the gaps in environmental justice data: The role of synthetic populations. Environment International, 204, 109790. 10.1016/j.envint.2025.109790 [DOI] [PubMed] [Google Scholar]
  35. Milando, C. W. , Yitshak‐Sade, M. , Zanobetti, A. , Levy, J. I. , Laden, F. , & Fabian, M. P. (2021). Modeling the impact of exposure reductions using multi‐stressor epidemiology, exposure models, and synthetic microdata: An application to birthweight in two environmental justice communities. Journal of Exposure Science and Environmental Epidemiology, 31(3), 442–453. 10.1038/s41370-021-00318-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Nicolaie, M. A. , Füssenich, K. , Ameling, C. , & Boshuizen, H. C. (2023). Constructing synthetic populations in the age of big data. Population Health Metrics, 21(1), 19. 10.1186/s12963-023-00319-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Nigra, A. E. , Lieberman‐Cribbin, W. , Bostick, B. C. , Chillrud, S. N. , & Carrión, D. (2023). Geospatial assessment of racial/ethnic composition, social vulnerability, and lead water service lines in New York City. Environmental Health Perspectives, 131(8), 087015. 10.1289/EHP12276 [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. QGIS Development Team . (2024). QGIS Geographic Information System [Dataset]. Retrieved from https://www.qgis.org
  39. Racz, L. , & Rish, W. (2022). Exposure monitoring toward environmental justice. Integrated Environmental Assessment and Management, 18(4), 858–862. 10.1002/ieam.4534 [DOI] [PubMed] [Google Scholar]
  40. R Core Team . (2023). R: A Language and Environment for Statistical Computing [Dataset]. Retrieved from https://www.R‐project.org/
  41. Rowangould, G. M. (2013). A census of the US near‐roadway population: Public health and environmental justice considerations. Transportation Research Part D: Transport and Environment, 25, 59–67. 10.1016/j.trd.2013.08.003 [DOI] [Google Scholar]
  42. Sachdeva, M. , & Fotheringham, A. S. (2023). A geographical perspective on Simpson’s paradox. Journal of Spatial Information Science, 26, 26. 10.5311/JOSIS.2023.26.212 [DOI] [Google Scholar]
  43. Shan, X. , Casey, J. A. , Shearston, J. A. , & Henneman, L. R. F. (2024). Methods for quantifying source‐specific air pollution exposure to serve epidemiology, risk assessment, and environmental justice. GeoHealth, 8(11), e2024GH001188. 10.1029/2024GH001188 [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Smith, A. , & Laribi, O. (2022). Environmental justice in the American public health context: Trends in the scientific literature at the intersection between health, environment, and social status. Journal of Racial and Ethnic Health Disparities, 9(1), 247–256. 10.1007/s40615-020-00949-7 [DOI] [PubMed] [Google Scholar]
  45. US Census Bureau . (2022). US census Bureau glossary. Census. Government. Retrieved from https://www.census.gov/programs‐surveys/geography/about/glossary.html [Google Scholar]
  46. U.S. Census Bureau . (2021a). ACS 5‐Year estimates public use microdata sample [Dataset]. Retrieved from https://data.census.gov/mdat/#/search?ds=ACSPUMS5Y2021
  47. U.S. Census Bureau . (2021b). Age of householder by household income in the past 12 months (2021 inflation‐adjusted dollars) (No. B19037) [Dataset]. Retrieved from https://data.census.gov/table?q=B19037&g=040XX00US25,25$1400000
  48. U.S. Census Bureau . (2021c). Hispanic or Latino by specific origin (version American community survey, ACS 5‐Year estimates detailed tables, table B03001) [Dataset]. Retrieved from https://data.census.gov/table/ACSDT5Y2021.B03001?q=B03001&g=040XX00US25
  49. U.S. Census Bureau . (2021d). People reporting ancestry (version American community survey, ACS 5‐Year estimates detailed tables, table B04006) [Dataset]. Retrieved from https://data.census.gov/table/ACSDT5Y2021.B04006?q=B04006&g=040XX00US25
  50. U.S. Census Bureau . (2021e). Sex by age (version American community survey, ACS 5‐Year estimates detailed tables, table B01001) [Dataset]. Retrieved from https://data.census.gov/table/ACSDT5Y2021.B01001?q=B01001&g=040XX00US25
  51. U.S. Census Bureau . (2021f). Sex by educational attainment for the population 18 years and over (No. B15001) [Dataset]. Retrieved from https://data.census.gov/table?q=B15001&g=040XX00US25,25$1400000
  52. U.S. Census Bureau . (2021g). Tenure by age of householder (version American community survey, ACS 5‐Year estimates detailed tables, table B25007) [Dataset]. Retrieved from https://data.census.gov/table/ACSDT5Y2021.B25007?q=B25007&g=040XX00US25
  53. U.S. Census Bureau . (2021h). Tl_2021_us_puma [Dataset]. TIGER/Line Shapefiles, 2021. Retrieved from https://www2.census.gov/geo/tiger/TIGER2001/PUMA/
  54. U.S. Census Bureau . (2021i). Tl_2021_us_tract [Dataset]. TIGER/Line Shapefiles, 2021. Retrieved from https://www2.census.gov/geo/tiger/TIGER2001/TRACT/
  55. U.S. Census Bureau . (2021j). Understanding and using the American community survey public use microdata sample files: What data users need to know. U.S. Government Printing Office. [Google Scholar]
  56. Van Horne, Y. O. , Alcala, C. S. , Peltier, R. E. , Quintana, P. J. E. , Seto, E. , Gonzales, M. , et al. (2023). An applied environmental justice framework for exposure science. Journal of Exposure Science and Environmental Epidemiology, 33(1), 1–11. 10.1038/s41370-022-00422-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Warren, R. (2022). 2020 American community survey: Use with caution, an analysis of the undercount in the 2020 ACS data used to derive estimates of the undocumented population. Journal on Migration and Human Security, 10(2), 134–145. 10.1177/23315024221102327 [DOI] [Google Scholar]
  58. Wheaton, W. D. , Cajka, J. C. , Chasteen, B. M. , Wagener, D. K. , Cooley, P. C. , Ganapathi, L. , et al. (2009). Synthesized population databases: A US geospatial database for agent‐based models. Methods Report (RTI Press), 2009(10), 905. 10.3768/rtipress.2009.mr.0010.0905 [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Williamson, P. (2007). CO instruction manual V2.0 working paper 2002/1. Population Microdata Unit, Department. of Geographic, University of Liverpool. [Google Scholar]
  60. Williamson, P. (2013). An evaluation of two synthetic small‐area Microdata simulation methodologies: Synthetic reconstruction and combinatorial optimisation. In Tanton R. & Edwards K. (Eds.), Spatial microsimulation: A reference guide for users (pp. 19–47). Springer. 10.1007/978-94-007-4623-7_3 [DOI] [Google Scholar]
  61. Wing, S. (2005). Environmental justice, science, and public health. Environmental Health Perspectives, 113(8–1), 54–64. 10.1289/ehp.7900 [DOI] [Google Scholar]
  62. Yuchi, W. , Sbihi, H. , Davies, H. , Tamburic, L. , & Brauer, M. (2020). Road proximity, air pollution, noise, green space and neurologic disease incidence: A population‐based cohort study. Environmental Health, 19(1), 8. 10.1186/s12940-020-0565-4 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

  1. QGIS Development Team . (2024). QGIS Geographic Information System [Dataset]. Retrieved from https://www.qgis.org
  2. R Core Team . (2023). R: A Language and Environment for Statistical Computing [Dataset]. Retrieved from https://www.R‐project.org/
  3. U.S. Census Bureau . (2021a). ACS 5‐Year estimates public use microdata sample [Dataset]. Retrieved from https://data.census.gov/mdat/#/search?ds=ACSPUMS5Y2021
  4. U.S. Census Bureau . (2021b). Age of householder by household income in the past 12 months (2021 inflation‐adjusted dollars) (No. B19037) [Dataset]. Retrieved from https://data.census.gov/table?q=B19037&g=040XX00US25,25$1400000
  5. U.S. Census Bureau . (2021c). Hispanic or Latino by specific origin (version American community survey, ACS 5‐Year estimates detailed tables, table B03001) [Dataset]. Retrieved from https://data.census.gov/table/ACSDT5Y2021.B03001?q=B03001&g=040XX00US25
  6. U.S. Census Bureau . (2021d). People reporting ancestry (version American community survey, ACS 5‐Year estimates detailed tables, table B04006) [Dataset]. Retrieved from https://data.census.gov/table/ACSDT5Y2021.B04006?q=B04006&g=040XX00US25
  7. U.S. Census Bureau . (2021e). Sex by age (version American community survey, ACS 5‐Year estimates detailed tables, table B01001) [Dataset]. Retrieved from https://data.census.gov/table/ACSDT5Y2021.B01001?q=B01001&g=040XX00US25
  8. U.S. Census Bureau . (2021f). Sex by educational attainment for the population 18 years and over (No. B15001) [Dataset]. Retrieved from https://data.census.gov/table?q=B15001&g=040XX00US25,25$1400000
  9. U.S. Census Bureau . (2021g). Tenure by age of householder (version American community survey, ACS 5‐Year estimates detailed tables, table B25007) [Dataset]. Retrieved from https://data.census.gov/table/ACSDT5Y2021.B25007?q=B25007&g=040XX00US25
  10. U.S. Census Bureau . (2021h). Tl_2021_us_puma [Dataset]. TIGER/Line Shapefiles, 2021. Retrieved from https://www2.census.gov/geo/tiger/TIGER2001/PUMA/
  11. U.S. Census Bureau . (2021i). Tl_2021_us_tract [Dataset]. TIGER/Line Shapefiles, 2021. Retrieved from https://www2.census.gov/geo/tiger/TIGER2001/TRACT/

Supplementary Materials

Supporting Information S1

Table S1

Data Availability Statement

All data used for this study are publicly available. Census tract‐level American Community Survey data (U.S. Census Bureau, 2021b, 2021c, 2021d, 2021e, 2021f, 2021g) is available via https://data.census.gov and Public Use Microdata Samples are available from https://data.census.gov/app/mdat (U.S. Census Bureau, 2021a). Property tax parcel data for Massachusetts is available on the MassGIS website (MassGIS Bureau of Geographic Information, 2024b) as well as Massachusetts road line shapefiles (MassGIS Bureau of Geographic Information, 2024a). Shapefiles of Massachusetts PUMA and census tract boundaries are publicly available via the U.S. Census as well (U.S. Census Bureau, 2021h, 2021i). A comprehensive rBookdown guide on the code and specific data sets used in this paper, including sample code, are available on the first author's Github: https://github.com/Flannery‐BIng?tab=repositories. To complete this work, we used R (R Core Team, 2023) and QGIS software (QGIS Development Team, 2024).


Articles from GeoHealth are provided here courtesy of Wiley

RESOURCES