Skip to main content
PLOS One logoLink to PLOS One
. 2022 Jul 14;17(7):e0270746. doi: 10.1371/journal.pone.0270746

High-resolution gridded estimates of population sociodemographics from the 2020 census in California

Nicholas J Depsky 1,*, Lara Cushing 2, Rachel Morello-Frosch 3
Editor: Krishna Prasad Vadrevu4
PMCID: PMC9282657  PMID: 35834564

Abstract

This paper introduces a series of high resolution (100-meter) population grids for eight different sociodemographic variables across the state of California using data from the 2020 census. These layers constitute the ‘CA-POP’ dataset, and were produced using dasymetric mapping methods to downscale census block populations using fine-scale residential tax parcel boundaries and Microsoft’s remotely-sensed building footprint layer as ancillary datasets. In comparison to a number of existing gridded population products, CA-POP shows good concordance and offers a number of benefits, including more recent data vintage, higher resolution, more accurate building footprint data, and in some cases more sophisticated but parsimonious and transparent dasymetric mapping methodologies. A general accuracy assessment of the CA-POP dasymetric mapping methodology was conducted by producing a population grid that was constrained by population observations within block groups instead of blocks, enabling a comparison of this grid’s population apportionment to block-level census values, yielding a median absolute relative error of approximately 30% for block group-to-block apportionment. However, the final CA-POP grids are constrained by higher-resolution census block-level observations, likely making them even more accurate than these block group-constrained grids over a given region, but for which error assessments of population disaggregation is not possible due to the absence of observational data at the sub-block scale. The CA-POP grids are freely available as GeoTIFF rasters online at github.com/njdepsky/CA-POP, for total population, Hispanic/Latinx population of any race, and non-Hispanic populations for the following groups: American Indian/Alaska Native, Asian, Black/African-American, Native Hawaiian and other Pacific Islander, White, other race or multiracial (two or more races) and residents under 18 years old (i.e. minors).

Introduction

Understanding the spatial distribution of human populations is integral to civic and land use planning, public policy design and various fields of academic research. For example, many public health studies in the United States (U.S.) seek to quantify the number of people residing near a potential environmental health hazard [14]. Similarly, environmental justice and equity oriented research often evaluates the degree to which people of color and other socially disadvantaged populations live in closer proximity to environmental contaminants or hazards [3, 5, 6]. However, the ability to estimate fine-scale spatial distributions of populations in many prior studies has been limited to the spatial granularity of population estimates that are made available by public enumerating agencies, such as the U.S. Census Bureau.

In the U.S., the most granular spatial units of enumeration are census blocks, available in each decennial year (i.e. 2000, 2010, 2020). In non-decennial years, the finest scale estimates are made at the block group level, which are coarser than census blocks. In California (CA), for example, blocks have an average land area of roughly 0.8 km2 (~200 acres) and a population of about 75 people each, while block groups are roughly twenty times larger, both in average area and population. Census blocks therefore provide population information at a high spatial resolution, although in more sparsely populated regions, their areal extents tend to be much larger, consisting of large open, unpopulated spaces. Without more precise information about the likely locations of population within these areas, researchers are often forced to assume that populations are uniformly distributed across the entire area of the given census spatial unit [5, 7, 8]. Such simplifying assumptions may have significant implications on study findings, especially for research in rural areas and concerned with precisely quantifying populations within an area smaller than local census block or block group areas, such as a specified buffer distance surrounding a polluting facility [5, 9, 10].

To address this issue, many techniques have been developed to disaggregate population estimates to finer scales. Broadly speaking, this field of population downscaling is a form of ‘dasymetric’ mapping, a methodology which dates back many decades [1113]. Eicher and Brewer (2001) [14] formalized many of the techniques and terminology used in modern dasymetric mapping studies in their study to disaggregate population from the 1990 U.S. census from 159 different counties. They refer to the county boundaries where they have observed population estimates as their “source zones”, then mask out areas likely to be unpopulated within each county based on higher resolution, ancillary land use datasets, with final populated area boundaries within each county deemed their “target zones”. Many subsequent studies emulated the dasymetric mapping techniques detailed in this study, usually employing various land use datasets as their primary source of ancillary data to reapportion population within source zones (e.g., [1517]. Some studies construct multi-class weighting schemes to reapportion population to target zones based on the characteristics of the land use type (e.g. high-density versus low-density residential) [1820], and often integrate additional ancillary datasets, such as tax parcel data [21, 22], home address [17, 23], property records [24], building footprints [25] and/or mobile phone data [26].

More recently, researchers have begun to employ more complex machine learning techniques to predict fine-scale population distributions within source zones, often using a wide array of ancillary datasets, such as road networks, nighttime lights, infrastructure and building footprint data, in addition to land use layers as covariates in the models [2732]. These highly-modeled approaches can represent a significant improvement from simpler techniques, especially in regions of the world for which source zone population estimates from official census surveys are infrequent and/or only exist at very coarse spatial resolutions [3337]. Leyk et al. (2019) [11] provide a thorough review of dasymetric mapping methods employed in past studies, including these highly-modeled approaches, to construct large-scale (i.e. global, continental) grids of population.

In this paper, we introduce a new suite of publicly available population grids, known as ‘CA-POP’, for the State of California produced using dasymetric mapping methods. The grids represent values for eight different demographic variables from the 2020 U.S. Census and are provided at a pixel resolution of 100 meters. Census blocks from the 2020 census were utilized as source zones, with high-resolution residential tax parcel boundaries and remotely-sensed, individual building footprints used as ancillary datasets to construct target zones of population within each block. A relatively simple, polygon binary method [14] of reapportioning population from the block level source zones to parcel and/or building level target zones was utilized. A qualitative comparison to a few of the more heavily-modeled gridded products available in California (e.g. LandScan, WorldPop) revealed that CA-POP performs very well in differentiating between populated and unpopulated regions, comparatively. This suggests that in contexts where both source zone population estimates and ancillary datasets are available at high resolutions, simpler, more easily-replicable dasymetric mapping techniques can yield high quality grids without needing to employ more complex algorithms.

Producing high-resolution CA-POP grids for various demographic variables estimated in the 2020 census, including racial and ethnic subgroups, can serve as a resource for studies that seek to evaluate these communities. A precursor to the 2020 CA-POP grids developed by the authors, based on 2017 American Community Survey block group and 2010 block source zones, were employed by Casey et al. (2021) [38] to assess social inequalities in residential proximity to large methane-emitting sites and in Pace et al. (2022) [39] to estimate racial/ethnic inequalities in estimated drinking water concentrations of arsenic, nitrate, and hexavalent chromium from community water systems and areas of potentially high domestic well prevalence, demonstrating the utility of CA-POP for environmental equity studies. As more socio-demographic variables are released by future U.S. Census Bureau’s American Community Surveys from the U.S. Census Bureau, additional grids based on these block group-level values may be produced and uploaded to the public CA-POP repository (github.com/njdepsky/CA-POP).

Data and methods

In conjunction with the estimates of census block populations from the 2020 U.S. Census, two sources of ancillary data were used that represent spatial units at a sub-block level of spatial granularity: i) residential tax parcel boundaries and ii) estimates of the individual footprint of every building throughout the state. Both of these ancillary data sources were used to identify areas within each census block likely to contain populated, residential areas, as opposed to vacant, commercial or other non-residential space.

Census data

We utilized block level estimates of population collected during the 2020 U.S. Census–the highest spatial-resolution available from the U.S. Census Bureau–from the (P.L. 94–171) Redistricting Data Summary File [40]. The tabular block-population data for the Summary File, as well as the shapefile of block boundaries were obtained from the U.S. Census Bureau in November 2021 for the entire state of California [41]. Specifically, data were obtained for the following variables: i) total population, ii) Hispanic/Latinx population of any race, iii) non-Hispanic/Latinx populations for all major racial subgroups available in the P.L. 94–171 file and, iv) population of minors (younger than 18 years old) (Table 1). We produced grids for all racial/ethnic subgroups made available thus far for uniform (single) race classifications, with respondents identifying as another race or as multiple races grouped into a grid for “other/multiracial” residents.

Table 1. Census variables represented as CA-POP grids.

2020 Census Block Level Population Totals Obtained (P.L. 94–171 Code) CA-POP Grid Name
Grids created for each variable:
Population (P002001) TOTAL
Hispanic or Latinx (P002002) HISP
Non-Hispanic or Latinx, White (P002005) NHWHITE
Non-Hispanic or Latinx, Black or African American (P002006) NHBLACK
Non-Hispanic or Latinx, American Indian and Alaska Native (P002007) NHAMIND
Non-Hispanic or Latino, Asian (P002008) NHASIAN
Non-Hispanic or Latinx, Native Hawaiian and Other Pacific Islander (P002009) NHHIPI
OTHER/MULTI grid is the combined sum of: NHOTHERMULTI
Non-Hispanic or Latinx, Some Other Race alone (P002010) + Non-Hispanic or Latinx, Population of two or more races (P002011)
MINORS grid (population < 18 years old) created from: MINORS
Population of adults (P003001) (subtracted from P002001)

In choosing our racial/ethnic groupings, we sought to maximize the utility of CA-POP for research employing race as a proxy for experiences of racism—particularly racism operating at institutional and structural levels—to determine opportunities and risk factors at the neighborhood level. This is in keeping with the understanding of race as a social construct that has been used to systematically discriminate against and socioeconomically marginalize specific groups of people [42, 43]. We chose groupings that are typical in the environmental justice literature and somewhat reflect shared forms of discrimination [44]. However, we recognize that forms of discrimination vary widely between racial and ethnic groups that we have grouped together (for example, different immigration policies for Mexicans and Cubans, who might both identify as “Hispanic” or “Latino/Latinx”). We were limited in our ability to create more fine-grained categories due to the availability of current data, and grids for additional racial categorizations provided in subsequent 2020 Census or American Community Survey tables (e.g., additional sub-categories for Hispanic/Latinx and Asian respondents) can be generated when these data are released.

The official population enumerated in the 2020 Census for the entire state of California is 39,538,223 people across 519,723 census blocks, with a mean area of 0.79 km2, or 195 acres. Estimates for each of the above values at the block-group level for the 2020 census were also obtained for use in an accuracy assessment of the dasymetric mapping method employed for total population. Block-groups are at a significantly coarser spatial resolution than blocks (~1:20), with a total count of 25,607 and mean area of 16.0 km2, or 3950 acres.

Residential parcel data

We utilized boundaries for all tax parcels in California from LightBox-Digital Map Products (accessible at digmap.com/platform/smartparcels/), which contains 12,728,980 parcels classified by 278 different land use types. This dataset is used by the California Air Resources Board, among other state agencies, and updated quarterly. The data we utilized for this study was from the final quarter of 2018 and represents tax parcels that were assessed either in 2018 (55% of total) or 2017 (45%). Although the vintage (i.e. date of data collection) of these parcel boundaries is not perfectly consistent with the 2020 census population estimates, obtaining ancillary data with uniform vintages is challenging and rarely done in a completely harmonized manner [15, 30, 35]. Given the relative recency of this parcel data, it is still a valuable source of ancillary data for dasymetric mapping of 2020 census populations.

We identified 30 of these 278 land use classes as residential for use as ancillary data in creating the population grids; 8,839,658 residential parcels represented roughly two-thirds of all parcels statewide, and covered 6.7% of the total area represented in the full parcel dataset. The full list of these residential land use classes is shown in the S1 Table. The average residential parcel area is 3,500 m2 (~38,000 ft2, ~0.86 acres), approximately 220x smaller than the average census block, making these parcel boundaries valuable for downscaling population estimates within blocks. The highest proportion of residential parcel types are ‘SINGLE FAMILY RESIDENTIAL’ (n = 7,255,233, 82.1%) and ‘CONDOMINIUM (RESIDENTIAL)’ (n = 330,047, 3.73%).

In terms of area, the most abundant land use types are ‘SINGLE FAMILY RESIDENTIAL’ (12,850 km2, 41.5%) and ‘RURAL RESIDENCE (AGRICULTURAL)’ (12,000 km2, 38.8%). The amount of populated residential area within each parcel varies greatly, especially between certain land use types, such as ‘SINGLE FAMILY RESIDENTIAL’ and ‘RURAL RESIDENCE (AGRICULTURAL)’. For example, the former tends to be fairly small, encompassing a single house and surrounding lot area, while the latter often includes a farm residence as well as adjacent agricultural fields. Therefore, even within many residential parcel boundaries, there is a need to further distinguish populated versus unpopulated space, which we largely achieve here through the use of building footprint data.

Building footprint data

Further distinguishing between open space and populated areas within blocks and larger residential parcels was done using publicly-available, remotely-sensed building footprints produced by Microsoft for the entire country. The initial version of this dataset was released in 2018, though a second version was released in early 2021 and was obtained in November of 2021 for use in this study. These building footprints were identified from publicly-available satellite imagery of the U.S. and employed a series of machine learning (deep neural net) classification algorithms to identify likely building rooftops, converting these footprints to a polygon shapefile for each state. More information on the production of this dataset can be found on its online source repository (github.com/microsoft/USBuildingFootprints). This dataset contains estimated footprints of 11,542,912 distinct buildings across California, with an average individual building area of 277 m2 (~2980 ft2, ~0.07 acres), or approximately 13x smaller than the average size of residential parcels, and ~2,850x smaller than the average census block area statewide. The approximate date range of the source satellite image used to create each building footprint is also provided, with 91.6% of all buildings delineated using imagery from 2018 or later.

Despite being the ancillary data source of highest spatial granularity, one inherent limitation of the building footprint data is that it is a single-class dataset, with no distinction between building types, making it difficult to identify which buildings are residential structures. Additionally, the classification algorithm used for building delineation is not perfect, with Microsoft reporting its accuracy in terms of precision and recall at 98.5% and 92.4%, respectively. Precision pertains to relative error rates of false positives (detecting a building where there is none), suggesting a false positive rate of 1.5%, while recall pertains to false negative error rates (failing to detect an existing building), suggesting a false negative rate of 7.6%. This rate of false negatives is not insignificant, and examples of such instances can be seen (S1 Fig). Given the limitations of the building footprint data, we opted to use entire parcel boundaries to represent populated areas in small residential plots to avoid completely relying on building footprints to identify populated structures within all residential parcels, described in further detail below. However, the relative performance of Microsoft’s building detection algorithm is still remarkable and their footprint dataset allows for substantial spatial downscaling of likely residential zones, especially in areas where fine-scale residential parcels are absent. Fig 1 presents examples of census block group boundaries and the ancillary datasets.

Fig 1. Data sources used for the population grid creation process.

Fig 1

Examples are shown in in urban (top row) and rural (bottom row) settings. The 2020 census blocks represent the source zones of population and the parcel and building footprint data the ancillary data comprising the target population zones. (Satellite base-imagery source: USGS (NAIP) from The National Map).

Dasymetric methods

Using the 2020 census blocks as source zones for population estimates across California, the residential parcel and individual building polygons were used to apportion population to smaller sub-regions within each block. Therefore, some combination of residential parcels and/or building boundaries served as the target zones for population within each census block, with this final vector layer of populated areas then converted to a 100m-resolution statewide grid. This approach could be classified as a form of ‘polygon binary’ dasymetric mapping, where vector polygon ancillary data sources are used to define populated versus unpopulated classifications within source zones, assuming population is homogeneously distributed amongst the populated target zones regions within each source zone [14].

Other, more complex dasymetric mapping techniques that utilize multi-class information associated with ancillary data to assign population density weights to each land use region have been employed in past studies as well [15, 18, 22]. In theory, a similar approach could have been employed with the residential parcel ancillary data used in this study, assigning relative population density weights for each of the 30 residential parcel types. However, coming up with appropriate weights for each parcel type is not straightforward, especially as many of the residential parcel classes are absent or rare in some counties compared to others. Also, many of the studies that apply multi-class weighting schemes utilize population source zones and ancillary land use datasets at much coarser resolutions than the data sources we used (e.g. census tracts rather than blocks) [14, 22, 25]. Accounting for likely variation in population densities between parcel types is less important with smaller source zones like those used in our study. Furthermore, given that the building footprint data lack classifications of structure type and the fact that both parcel and building boundaries were often both utilized to apportion population within a given census block, we opted to treat both ancillary data sources in a binary fashion.

Creation of the 100m x 100m statewide grids from the 2020 census population estimates was done for each census block, in a stepwise fashion as follows:

  • A) Identified all “small” residential parcels to include as eventual target zones in each census block. “Small” residential parcels were defined as those with an area less than or equal to one acre (~4050 m2). However, for five high-density residential classes (’APARTMENT HOUSE (100+ UNITS)’, ’APARTMENT HOUSE (5+ UNITS)’, ’APARTMENTS (GENERIC)’, ’COOPERATIVE (RESIDENTIAL)’, ’HIGHRISE APARTMENTS’), parcels of up to 10 acres (~40,500 m2) were included in this “small” categorization. These thresholds were utilized to exclude parcels that contain large areas of open space in addition to residential structures. This was most commonly seen in RURAL RESIDENCE (AGRICULTURAL) parcels, which have a size (~18 acres) of roughly 40x that of average SINGLE FAMILY RESIDENTIAL parcels (~0.45 acres), on average, despite both types tending to encompass just one single family house. Therefore, parcel sizes for most residential parcel types were limited to one acre so that any open space contained within them would not exceed the size of a medium to large yard surrounding a single-family home (S2 Fig). This one-acre threshold corresponds to roughly 40% of the area of a single 100 x 100m grid cell and therefore, any open yard space within these plots are likely to minimally impact the eventual gridded output. The 10-acre threshold utilized for the five high-density parcel types was selected after manual inspection of those parcel classes were determined to often occupy more area (i.e. a full city block) in urban zones without containing large amounts of unpopulated space.

    These small residential parcels were selected as target zones amongst the ancillary data because most inaccuracies observed upon manual inspection of several hundred parcels were instances of large open spaces being assigned a residential use code, masked out here by selecting only small residential parcels. Given the somewhat common occurrence of false negatives in the building footprint dataset, we did not want to rely on these footprints alone to constrain populated area, though other gridded population efforts have employed such an approach, including the “constrained” WorldPop population grids [28] and those produced by Huang et al. (2021) [25] for the contiguous U.S. (CONUS) region.

Identified all building footprints within “large” residential parcels not included in step (A) and combined those footprints with the small residential polygon boundaries from (A) to produce the final target zones for populations within all census blocks containing some residential parcel area.

Building footprints within all “large” residential parcels were assumed to be residential and selected as target zones for population, which masked out open space in these large parcels but still included likely housing structures. These building polygon geometries were then merged with the small residential parcel polygons selected in step (A) to produce the final target zone extents for population apportionment within each census block. Therefore, the target zones within a single census block could consist of both small residential parcel boundaries as well as building footprints if both small and large residential parcels are present.

Given the fact that intersecting the census block, residential parcel and building footprint polygon geometries resulted in some small slivers or fragments of individual parcels or building footprints being assigned to certain blocks, a sliver-removal algorithm was employed to remove most of these instances using an upper population density limit of 1 person per 10 m2. Slivers are defined as any single-part polygon resulting from the block-parcel intersection that is less than:

[original residential parcel area] / [2 * # of polygons descendent of a given parcel after intersecting with blocks]

In other words, if a residential parcel with an area of 1km2 is split evenly across two different blocks into two 0.5km2 portions, they will both be preserved since 0.5km2 > [1km2 / (2 x 2) = 0.25km2]. However, if this same parcel is split across two blocks such that 90% (0.9km2) of its area is contained in one block and 10% (0.1km2) in the other, the smaller portion would be considered a sliver and removed since it is less than 0.25km2. This ultimately resulted in the removal of 3.1% of polygons (in terms of count, not area) resulting from the intersection of census blocks with residential parcels.

  • B) This threshold value roughly corresponds to the top 99.9th percentile of observed population density in the original census block source zones and was employed to remove erroneous target zone geometries composed only of small sliver/fragment polygons. Steps (A) and (B) were applied to all census blocks that had some amount of residential parcel area.

  • C) Identified all building footprints within census blocks with a non-zero population, but which contain no residential parcels, and set as target zones for those blocks.

    A small portion of the state’s population resides in census blocks without residential parcels, largely in sparsely-populated regions. In these instances, the only ancillary data available was the building footprint data and population was uniformly apportioned across all building geometries in these cases. The main limitation in this approach is that not all structures are residential, resulting in some likely over-apportionment of populations to non-residential structures, an issue that was largely avoided in step (B) by selecting only buildings within large residential parcels.

  • D) Identified any remaining blocks that have a non-zero population but do not contain any residential parcels nor building footprints. We used the census block boundary as the target zone in this case, assuming uniform population distribution across these blocks.

    A tiny fraction of the state’s estimated 2020 population are enumerated in census blocks with neither residential parcels nor detected building footprints. In these cases, the target zones were simply treated as equivalent to the source zones (census block boundaries) and populations were uniformly distributed throughout these areas.

The resultant target zone polygonal geometries produced in steps (A-D) were uniformly assigned population densities based on their parent source zone populations and then converted to 100m x 100m statewide raster grids, which contain values of people per pixel (Fig 2). Roughly 34.5 million people in this final grid (87.2% of the state population) fell within small residential parcels (Step A), 3.6 million (9.1%) resided in large residential parcels and therefore are represented by building footprints within those parcels (Step B), 1.3 million (3.4%) fell in blocks with no residential parcels identified and therefore uniformly represented across all building footprints within these blocks, and for 128,000 people (0.3%) there existed neither residential parcel boundaries nor building footprints within their census blocks, resulting in a uniform distribution of those populations across their entire block areas (Fig 3). All geospatial operations were performed in a PostgreSQL (v13.3) programming environment using the PostGIS (v2.5) spatial database extension.

Fig 2. CA-POP’s dasymetric mapping method applied to a single census block.

Fig 2

Example shown is just north of Modesto, CA. Panel (a) shows the block boundary; (b) shows the ancillary residential parcel and building footprint boundaries; (c) shows the polygon boundaries used as the final target zones to assign the block’s population values, retaining the small residential parcel polygons and building footprint polygons within large residential parcels; (d) shows the 100m-resolution grid produced from population apportioned to the final target zones. (Satellite base imagery source: USGS (NAIP) from The National Map).

Fig 3. Process workflow illustrating the identification of population target zones.

Fig 3

Dasymetric mapping process using the residential parcel and building footprint ancillary datasets within each census block and producing the final statewide grids for each population variable considered.

Comparison to other gridded products

Four different, commonly-used global gridded population products were evaluated against the CA-POP grids: i) Gridded Population of the World v4.11 (GPW), ii) WorldPop (100m, unconstrained) (WPUC), iii) WorldPop (100m, constrained) (WPC), and iv) LandScan (LS). At the time of writing, each of these products’ population source zones were based on the 2010 census at the block level, with populations in later-year grids estimated via different growth forecasting and/or inter-census population estimates to extrapolate 2010 values over time [28, 45, 46]. The latest GPW, WPUC and WPC population grids are for 2020 and LS for 2019, though presumably they will be updated to 2020 census block source zone populations in the near future. We also assessed two sets of population grids produced for the CONUS region by: i) Huang et al. (2021) [25], which utilized an earlier version of the Microsoft building footprints as ancillary data, and ii) the SocScape grids, which were produced using census block population estimates along with two land use ancillary datasets from 2010–2011 [20].

The GPW product employs the simplest methodology of the grids evaluated, apportioning population from source zones (blocks) to grid pixels through a simple, uniform areal weighting technique, masking out some unpopulated zones, such as water bodies and is provided at a 1km resolution. WorldPop employs a much more complex approach based on constructing machine learning models using a wide suite of covariates, such as roads, land cover, nighttime lights, infrastructure, protected areas, among others to predict population distributions within source zones [11, 28]. The WPC grids represent the results of these predictive models, but with population constrained to building footprints as represented in a recent buildings dataset from Maxar/Ecopia (WorldPop.org, [28]). Both WorldPop datasets are provided at 1km and 100m grid resolutions. LandScan employs a “smart interpolation” approach to weight pixels by likelihood of containing population based on a large suite of ancillary data and apportioning populations accordingly, and is provided at a 1km resolution [28, 46].

Therefore, one advantage of CA-POP over the GPW and LS grids is its higher resolution (100m compared to 1km), made possible from the fine scale parcel and building footprint ancillary data utilized in its production. This allows for a more granular representation of population distributions, especially in sparsely populated regions (Fig 4). Also, the dasymetric mapping techniques used in CA-POP are a significant improvement over GPW’s simple, areal weighting techniques that assume uniform population distribution throughout census blocks. The CA-POP techniques are simpler than the more heavily-modeled approaches used in LS and the WP grids, requiring fewer input datasets and are therefore more easily-reproducible. The WP and LS products are particularly well-suited for predicting populations in regions of the world where official population estimates are sparse or more coarsely resolved compared to U.S. census blocks.

Fig 4. CA-POP compared to the GPW and LS 1km resolution datasets.

Fig 4

Examples are shown for the Fresno area, CA. (Satellite base imagery source: USGS (NAIP) from The National Map).

The CONUS level grids produced by Huang et al. (2021) [25] represent the closest, previously published, methodological approach to CA-POP, apportioning population from census tract source zones to Microsoft building footprints (v1) that fall within residential areas identified in a national OpenStreetMap land use database. However, by assigning population solely to Microsoft building footprint boundaries, the approach is vulnerable to the building detection errors associated with that dataset, namely the occurrence of false negatives. Additionally, population values and source zones in the study correspond to 2017 American Community Survey census tracts, a much coarser geographic unit compared to census blocks. Also, compared to the CONUS grid produced by Huang et al. (2021) [25], which only provides population counts, CA-POP offers grids for multiple sociodemographic variables in addition to population.

The SocScape grids represent the finest resolution population grids we evaluated, with population grids for total population as well as various racial subgroups at a 30-meter resolution for the CONUS region. These grids were produced using census blocks as source zones and a pair of national land use/land cover datasets as the ancillary layers to comprise target zones of population apportionment within blocks [20]. These grids are publicly-available for download at socscape.edu.pl and appear to have been recently updated to include grids based on 2020 census values. Table 2 summarizes the characteristics of each these gridded datasets that were compared to CA-POP.

Table 2. Description of gridded datasets assessed.

GPW, WPC, WPUC and LS datasets all currently use 2010 census blocks as their source zones, which will likely be updated to 2020 census blocks in subsequent grids.

Product Resolution Population Apportionment Methods Gridded Variables
Gridded Population of the World v4.11 (GPW) 1km Areal weighting assuming uniform population distribution using limited land use data (e.g. water bodies) to mask uninhabited areas Population count
Population density
Population by age and sex (5-year age bins)
WorldPop (unconstrained) (WPUC) 100m Random forest machine learning algorithm using a wide array of gridded and binary/categorical input covariates (e.g. topography, land cover, nighttime lights, local climate variables etc.) Population count
Population density
Population by age and sex (5-year age bins)
WorldPop (constrained) (WPC) 100m Similar modeling approach and input datasets as WPUC but with population limited to Maxar/Ecopia building footprint boundaries Population count
Population by age and sex (5-year age bins)
LandScan (LS) 1km “Smart interpolation” modeling approach to weight pixels based on their likelihood of containing population using a large suite of ancillary datasets (e.g. topography, land cover, climate, infrastructure etc.) Population count
Huang et al. 2021 100m Assigns population to Microsoft building footprints (v1) that are masked to residential areas using OpenStreetMap land use data using census tracts as source zones Population count
SocScape 2010, 2020 30m National Land Cover and Land Use Datasets as ancillary layers and census blocks as source zones Population count Six racial subgroups
CA-POP 100m Uses high resolution statewide tax parcel dataset from LightBox-DMP and Microsoft building footprints (v2) as ancillary datasets to apportion populations from 2020 census block source zones Population count (total)
Hispanic/Latinx population
Non-Hispanic/Latinx population for six racial subgroups and minors

The incongruity of population source zone vintage between many of these alternative gridded products and CA-POP made direct comparisons to most of these grids challenging, and producing an earlier (i.e. 2010) version of CA-POP would be infeasible given the more recent vintage of its ancillary data. Additionally, once the census block source zone populations underpinning each of the other gridded products are updated to the 2020 census, block-level errors should in theory be zero and equal to those associated with the CA-POP grids. Given the fact that each of these grids utilize the highest resolution observed population estimates available, there is no straightforward way to estimate errors in their apportionment of population from source to target zones.

Therefore, we conducted a qualitative assessment of the advantages and disadvantages of CA-POP to the four global products based on their underlying data, methods and final grid resolutions, as well as through a series of manual accuracy assessments using contemporary satellite base-imagery from 2020–2021. However, reported metrics of relative errors as reported in the SocScape documentation were also compared to CA-POP, as the accuracy assessment employed by Dmowska and Stepinski (2017) [20] represents the most similar accuracy assessment to the one we employed for the CA-POP grids.

Accuracy assessment

Census blocks are the finest spatial unit of population estimation tabulated in the census, and therefore represent the highest resolution set of “ground-truth” population estimates available to evaluate population estimation accuracy for different dasymetric modeling exercises. Given the fact that these block-level populations are used to constrain population totals in these grids, there is not an easy way to assess the relative accuracy of the dasymetric mapping approach aside from physically visiting the areas or manual inspection of the final product against recent satellite imagery and with contextual knowledge about likely populated areas. More highly-modeled, machine learning based prediction models of population can perform accuracy assessments using a cross-validation process, whereby certain observed population constraints are withheld from the modeling process and errors at those omitted locations relative to observed values are measured [30, 47]. However, past applications of traditional dasymetric mapping techniques, like those employed in this study, have often evaluated accuracy of their techniques by producing population grids that are constrained by observed populations at a spatial unit that is coarser (e.g. block groups or tracts) than the finest unit available, then evaluating how well the disaggregated population in the resultant grids matches observed population values within the finest spatial unit of ground-truth data (e.g. blocks) [15, 17, 25].

We performed this form of accuracy assessment for our mapping technique by producing a statewide grid using 2020 population estimates at the block-group level (~20x larger than blocks) but using the same ancillary datasets and dasymetric mapping process described above. Block-level errors were then calculated by comparing values from this grid within each census block’s census population value. Generally, the highest percent errors at the block-level in this analysis occurred in blocks with low population totals, both due to the lack of residential parcel data within them and their small population values (i.e. small denominator in the percent error calculations) (S3 Fig). However, these errors are not exactly analogous to those that exist in the block-constrained grids comprising the final CA-POP products, which are likely lower in magnitude due to the finer resolution of the input data.

We assessed the relative accuracy of our methods compared to simple uniform, areal population weighting. The errors in each block-level estimate between the modeled (block group constrained grid) and observed (census block level data) are reported in terms of root mean-squared errors (RMSE) and the squared Pearson correlation coefficient (R2) across all populated blocks. Median block-wise percent errors were also calculated for both raw and absolute percent error magnitudes, following a similar accuracy analysis approach employed by Dmowska and Stepinski (2017) [20], in which they term the median absolute percent error a measure of ‘relative error’. Given the skewed nature of the percent error distribution across population blocks, the median was deemed a more informative metric as opposed to the mean or standard variation [20].

The same procedure was carried out for a simple uniform, areal weighting technique, which estimates block level population values assuming a homogenous population distribution across the entire spatial area of each block group. Unfortunately, comparison of CA-POP grids produced using this ‘second-best’ ground-truth source zone population data (block-groups) to other products (e.g. WorldPop, LandScan) is not possible due to the fact that those data providers do not provide versions of their grids that utilize anything besides the best-available source zone data as their ground-truth population constraints (blocks). However, SocScape’s documentation reports error estimates for a version of their 2010 population grid constrained by block groups instead of blocks, allowing for a roughly analogous error comparison between CA-POP and SocScape [20].

Results

Statewide, 100-meter resolution raster grids were produced for each of the eight 2020 U.S. Census demographic variables listed in Table 1 for the entire state of California, comprising the CA-POP dataset. Examples of these grids for four demographic variables at different locations across the state are provided in Fig 5 and are publicly available online (see Data Availability).

Fig 5. Final CA-POP grids.

Fig 5

Examples are shown for four demographic population variables at three different locations in California. (Satellite base imagery source: USGS (NAIP) from The National Map, Ocean boundary layer source: Natural Earth).

Assessing the accuracy of the block group-constrained CA-POP grid to the simple uniform areal weighting technique demonstrates the improved accuracy of using these dasymetric mapping techniques, in terms of lower absolute error magnitudes (RMSE) and higher agreement of population distribution (R2) at the block level compared to uniform areal weighting (Table 3). In terms of median percent errors, the CA-POP method yields much higher accuracy compared to the simple uniform method both in raw (-4.1% compared to -25.9%) and absolute terms, or ‘relative error’, (30.1% compared to 46.4%). This median relative error of 30.1% is also lower than the 44% relative error value reported for SocScape’s 2010 national population grid, calculated using the accuracy assessment exercise of comparing block group-constrained grid performance to observed block values [20]. Although SocScape’s relative error is a CONUS-wide value and CA-POP’s value of is solely for California, making an analogous comparison impossible, CA-POP’s lower error value suggests that it, on average, likely outperforms SocScape.

Table 3. Summary error statistics of block group-constrained CA-POP grid.

Block Population Estimation Method RMSE (people) R2 Median Percent Error Median Absolute Percent Error (Relative Error)
Block group-constrained, dasymetric population grid (CA-POP method) 76.9 0.76 -4.1% 30.1%
Uniform, areal weighting of block group population 114.6 0.54 -25.9% 46.4%

Differences between the block group constrained dasymetric population grids and simple, uniform areal population estimation techniques for populated blocks only (unpopulated blocks excluded).

However, it is important to consider that these accuracy values only reflect the improved accuracy of the block group-constrained grid compared to uniform areal weighting, and not the block-constrained grid, which is how the final grids in this study were produced. By design, the block level errors of the block-constrained grid are zero, and calculating an analogous set of accuracy measures would require ground-truth estimates of population at the sub-block target zones (i.e. residential parcels and buildings), which are not available. Therefore, the accuracy improvements as compared to uniform areal weighting shown in Table 3 are simply to demonstrate the value of utilizing the dasymetric mapping techniques in CA-POP generally, and do not reflect exact accuracy values of the final, block-based grids, which by definition are more accurate across space than block group-based grids.

We also assessed pixel-level differences between the 2020 SocScape and CA-POP total population grids across the state to better evaluate how apportionment of population at the sub-block scale differs between the two methods. Fig 6 displays SocScape pixel values subtracted from CA-POP as a grid, and demonstrates that SocScape distributes low population counts across the majority of open space within blocks, whereas CA-POP more accurately sets these regions to zero. This is evident in the figure’s first two panels, where the light red zones spanning large areas represent regions in SocScape with small population counts across primarily open space, for which CA-POP does not apportion population. In more densely-populated urban zones where blocks are smaller and the two grids are therefore constrained to equal one another at a smaller spatial scale, much of the pixel-level differences emulate random noise, although some patterns appear to suggest that SocScape overly-apportions populations along major streets and roadways where CA-POP does not. The final panel in Fig 6 demonstrates this in South Los Angeles, where red areas (SocScape greater than CA-POP) tend to reflect the pattern of major streets in this neighborhood, a difference that is likely due to CA-POP’s use of ancillary datasets that exclude street surfaces from the population target zones.

Fig 6. Pixel-wise differences between the CA-POP and 2020 SocScape total population grids.

Fig 6

SocScape’s 2020 total population grid was converted to population density, aggregated from 30m to 100m resolution using an average resampling approach and then re-converted to units of people per cell prior to differencing with CA-POP’s total population grid. Blue areas represent regions where CA-POP pixel values are greater than SocScape and red areas are those where SocScape is greater.

In comparing CA-POP to other gridded products, given the high resolution of the population source zones and ancillary datasets used in CA-POP, it is not apparent that WP and LS approaches produce more accurate grids than CA-POP in a California context. In fact, the global Maxar/Ecopia building footprint dataset used to constrain the population apportionment extents in WPC looks to poorly capture residential structures in many areas of California, especially in medium to low density settings, compared to the Microsoft building footprint data used in CA-POP (Fig 7). Conversely, the WPUC, which utilizes nighttime lights and topography covariates in its predictive modeling algorithm seems to routinely assign population values to pixels based on the pattern of light scatter from street lights or topographic characteristics of the landscape, resulting in the allocation of low population densities across vast swaths of uninhabited space (Fig 7).

Fig 7. Comparison of CA-POP with the unconstrained and constrained WorldPop grids.

Fig 7

Example shown for three locations in California. The first two rows represent populated areas and the third row represents largely unpopulated, open space. Over-apportionment of population across open space is seen in the unconstrained WorldPop grid and under-apportionment to buildings footprints detected by Microsoft (yellow area) in residential parcels is evident in the constrained WorldPop grid, as compared to CA-POP. (Satellite base imagery source: USGS (NAIP) from The National Map).

Discussion

Overall, the CA-POP grids look to perform well when compared against other available gridded population products (e.g. GPW, WorldPop, LandScan, SocScape) in terms of its high resolution and ability to capture known residential areas in its population apportionment. CA-POP also contains additional demographic variables compared to many alternative products, which can be of use in many research applications concerned with specific demographic subgroups, such as environmental health, equity and justice-oriented research. Gridded raster products allow for easier spatial analysis of values within a given zone of interest compared to vector polygon layers due to the relative ease of summing or averaging pixels within an area as opposed to intersecting multiple vector polygon layers and conducting some form of subsequent areal weighting within the zone of interest.

Limitations and potential improvements

Though CA-POP represents a fairly accurate and easily-replicable method of gridding different census variables, there are a number of known limitations and improvements that could still be made. For one, the accuracy of the CA-POP grids relates to the certainty of the 2020 U.S. census values, which was unique for a number of reasons, including the COVID-19 pandemic, a potential citizenship question, natural disasters, and various operational changes to the census enumeration process that may have led to an undercount of particular groups, especially marginalized populations [48]. Recently, The Urban Institute approximated this possible bias by state and urban area through a process of statistically simulating the likely “true” census values, with their results suggesting that the 2020 census in California is biased low by roughly 345,097 people (-0.87%) [48]. Nationally, the study indicates that these undercounts are proportionally higher in certain population subgroups, such as Black and Hispanic communities and for young children (-2.45%, -2.17% -4.86%, respectively) [48]. Therefore, the CA-POP grids based on these census values should be interpreted with the knowledge that statewide totals are likely lower than true populations, especially in certain disadvantaged communities.

Other future improvements to CA-POP could include updating ancillary datasets as they are made available, such as residential tax parcel boundaries of a more recent vintage than 2017–2018, or more accurate building footprint data. In theory, perfectly accurate building footprint data could be used for all final populated target zone boundaries, with tax parcels or other land-use ancillary data layers solely used to identify residential zones within which buildings should be selected, therein avoiding minor inaccuracies associated with regions of open space within small residential parcels currently present in our methodology. Additional types of ancillary data could also be considered to further inform the apportionment of population to eventual target zones within census blocks, such as home address data [17, 23] or mobile phone usage data [26]. The ancillary building footprint data that is utilized can also be potentially further analyzed to infer additional information about likely building types based on patterns and characteristics of building geometry and proximity to one another, analyses for which Jochem and Tatem (2021) [49] constructed the R package foot.

Additionally, some form of multi-class weighting technique could be employed to apportion population between different residential parcel types, as is done in similar studies [14, 18, 22]. This would require estimating the different relative population density in each of the 30 residential parcel types and then distributing population within a single source zone according to those weights, rather than distributing population evenly across the target zones within each block. Also, additional demographic census variables to the eight initial CA-POP grids provided here are planned to be produced as they are released from the U.S. Census.

Finally, the methods we utilize here are inherently limited in geographic scope given that only California is represented. The feasibility of constructing a U.S.-wide product using these dasymetric methods, however, is limited by the absence of national, high-quality, publicly-available tax parcel data. Tax parcel data are instead disparately gathered, maintained and provided by different state and local agencies, with a freely-available nationwide product with harmonized land use classifications not currently available for public use [20, 50]. A number of proprietary options are maintained by various data retailers, though licensing fees often make access cost-prohibitive. Tax parcel boundaries are valuable ancillary datasets in many societally-beneficial demographic research contexts and we believe a publicly-funded effort to generate a well-maintained and open-access, national tax parcel dataset should be initiated to help facilitate this work.

Conclusion

In this study, we present a set of high-resolution gridded population products using values from the 2020 U.S. Census for the entire state of California, known as ‘CA-POP’. These grids were produced via dasymetric techniques, using census blocks as the population source zones, with population estimates from the 2020 census redistricting Summary File (P.L. 94–171), and leveraging two high-resolution ancillary datasets (residential parcel boundaries and building footprints), to reapportion the estimated population distributions at the sub-block scale.

Assessing the accuracy the CA-POP dasymetric mapping methodology for a population grid constrained by block group census observations instead of blocks yielded a block-wise median absolute relative error of approximately 30% for block group-to-block disaggregation, which is lower than national error rates reported in the CONUS-wide SocScape grids, the product that reports the most analogous form of accuracy assessment for block group-to-block population disaggregation grids derived from U.S. census values. Additionally, given that the final CA-POP grids are not constrained by block groups, but by higher-resolution census block observations, they are likely even more accurate than their block group-constrained counterparts over a given region, though a proper error assessment of these final grids is not possible due to the absence of observational data at the sub-block scale.

The statewide CA-POP population grids are publicly-available at a 100-meter resolution for eight population variables of interest provided by the 2020 census: total population, Hispanic/Latinx population of any race, and non-Hispanic populations of: American Indian/Alaska Native, Asian, Black/African-American, Native Hawaiian and other Pacific Islander, White, other race or multiracial (two or more races), and residents under 18 years old (i.e. minors).

Supporting information

S1 Table. Land use codes from tax parcel dataset identified as residential.

(DOCX)

S1 Fig. False negatives in the Microsoft building footprint data.

Examples shown in urban and rural contexts. Locations were chosen based on the presence of false negatives and do not generally reflect the typical proportion of false negative instances around the state.

(DOCX)

S2 Fig. Examples of 1-acre residential parcels.

1-acre (~4050 km2) was used as the upper area threshold for low-density residential parcels.

(DOCX)

S3 Fig. Percent errors of block group-constrained CA-POP population grid relative to census block population observations.

Values represent the percent difference between block-level estimates from the block group-constrained CA-POP total population grid and census block population values. Note that these errors do not reflect the block-level errors of the final CA-POP grids themselves, as those were constrained by the block-level census observations, which by definition makes block-level errors zero. Errors in the final grids instead occur at the sub-block level of population apportionment, for which there are no ground-truth population observations available for assessing CA-POP’s ability to apportion population within blocks.

(DOCX)

Acknowledgments

We thank the members of UC Berkeley’s Sustainability and Health Equities Lab for their valuable feedback, guidance and initial application of these datasets. Thanks also to Dr. Maggi Kelly for her encouragement and feedback.

Data Availability

Data are available from: https://zenodo.org/badge/latestdoi/434382697, DOI: 10.5281/zenodo.5874927, github.com/njdepsky/CA-POP.

Funding Statement

This study was funded by the California Air Resources Board (# 18RD018- RM-F and NJD), the Strategic Growth Council (CCRP0022 - RM-F, NJD and LC) and U.S. Environmental Protection Agency (#84003901 LC, RM-F and ND). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Clark L.P., Millet D.B., Marshall J.D., 2014. National Patterns in Environmental Injustice and Inequality: Outdoor NO2 Air Pollution in the United States. PLoS ONE 9, e94431. doi: 10.1371/journal.pone.0094431 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Cushing L., Faust J., August L.M., Cendak R., Wieland W., Alexeeff G., 2015. Racial/Ethnic Disparities in Cumulative Environmental Health Impacts in California: Evidence From a Statewide Environmental Justice Screening Tool (CalEnviroScreen 1.1). Am. J. Public Health 105, 2341–2348. doi: 10.2105/AJPH.2015.302643 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Grineski S.E., Collins T.W., Morales D.X., 2017. Asian Americans and disproportionate exposure to carcinogenic hazardous air pollutants: A national study. Soc. Sci. Med. 185, 71–80. doi: 10.1016/j.socscimed.2017.05.042 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Morello-Frosch R., Jesdale B.M., 2006. Separate and Unequal: Residential Segregation and Estimated Cancer Risks Associated with Ambient Air Toxics in U.S. Metropolitan Areas. Environ. Health Perspect. 114, 386–393. doi: 10.1289/ehp.8500 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Cushing L., Blaustein-Rejto D., Wander M., Pastor M., Sadd J., Zhu A., et al., 2018. Carbon trading, co-pollutants, and environmental equity: Evidence from California’s cap-and-trade program (2011–2015). PLOS Med. 15, e1002604. doi: 10.1371/journal.pmed.1002604 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Morello-Frosch R., Pastor M., Sadd J., 2001. Environmental Justice and Southern California’s “Riskscape”: The Distribution of Air Toxics Exposures and Health Risks among Diverse Communities. Urban Aff. Rev. 36, 551–578. 10.1177/10780870122184993 [DOI] [Google Scholar]
  • 7.Mohai P., Saha R., 2006. Reassessing racial and socioeconomic disparities in environmental justice research. Demography 43, 383–399. doi: 10.1353/dem.2006.0017 [DOI] [PubMed] [Google Scholar]
  • 8.Sadd J.L., Pastor M., Morello-Frosch R., Scoggins J., Jesdale B., 2011. Playing It Safe: Assessing Cumulative Impact and Social Vulnerability through an Environmental Justice Screening Method in the South Coast Air Basin, California. Int. J. Environ. Res. Public. Health 8, 1441–1459. doi: 10.3390/ijerph8051441 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.McKenzie L.M., Allshouse W.B., Burke T., Blair B.D., Adgate J.L., 2016. Population Size, Growth, and Environmental Justice Near Oil and Gas Wells in Colorado. Environ. Sci. Technol. 50, 11471–11480. doi: 10.1021/acs.est.6b04391 [DOI] [PubMed] [Google Scholar]
  • 10.Pastor M., Sadd J., Hipp J., 2001. Which Came First? Toxic Facilities, Minority Move-In, and Environmental Justice. J. Urban Aff. 23, 1–21. 10.1111/0735-2166.00072 [DOI] [Google Scholar]
  • 11.Leyk S., Gaughan A.E., Adamo S.B., De Sherbinin A., Balk D., Freire S., et al., 2019. The spatial allocation of population: a review of large-scale gridded population data products and their fitness for use. Earth Syst. Sci. Data 11, 1385–1409. 10.5194/essd-11-1385-2019 [DOI] [Google Scholar]
  • 12.Petrov A., 2012. One hundred years of dasymetric mapping: Back to the origin. Cartogr. J. 49, 256–264. 10.1179/1743277412Y.0000000001 [DOI] [Google Scholar]
  • 13.Wright J.K., 1936. A method of mapping densities of population: With Cape Cod as an example. Geogr. Rev. 26, 103–110. [Google Scholar]
  • 14.Eicher C.L., Brewer C.A., 2001. Dasymetric Mapping and Areal Interpolation: Implementation and Evaluation. Cartogr. Geogr. Inf. Sci. 28, 125–138. 10.1559/152304001782173727 [DOI] [Google Scholar]
  • 15.Mennis J., Hultgren T., 2006. Intelligent dasymetric mapping and its application to areal interpolation. Cartogr. Geogr. Inf. Sci. 33, 179–194. 10.1559/152304006779077309 [DOI] [Google Scholar]
  • 16.Sleeter R., 2004. Dasymetric mapping techniques for the San Francisco Bay Region, California. Urban Reg. Inf. Syst. Assoc. Annu. Conf. Proc. 1–12. [Google Scholar]
  • 17.Tapp A.F., 2010. Areal interpolation and dasymetric mapping methods using local ancillary data sources. Cartogr. Geogr. Inf. Sci. 37, 215–228. 10.1559/152304010792194976 [DOI] [Google Scholar]
  • 18.Su M.D., Lin M.C., Hsieh H.I., Tsai B.W., Lin C.H., 2010. Multi-layer multi-class dasymetric mapping to estimate population distribution. Sci. Total Environ. 408, 4807–4816. doi: 10.1016/j.scitotenv.2010.06.032 [DOI] [PubMed] [Google Scholar]
  • 19.Zandbergen P.A., Ignizio D.A., 2010. Comparison of dasymetric mapping techniques for small-area population estimates. Cartogr. Geogr. Inf. Sci. 37, 199–214. 10.1559/152304010792194985 [DOI] [Google Scholar]
  • 20.Dmowska A., & Stepinski T. F. (2017). A high resolution population grid for the conterminous United States: The 2010 edition. Computers, Environment and Urban Systems, 61, 13–23. 10.1016/j.compenvurbsys.2016.08.006 [DOI] [Google Scholar]
  • 21.Mesgar M.A.A., Jalilvand P., 2017. Vulnerability Analysis of the Urban Environments to Different Seismic Scenarios: Residential Buildings and Associated Population Distribution Modelling through Integrating Dasymetric Mapping Method and GIS. Procedia Eng. 198, 454–466. 10.1016/j.proeng.2017.07.100 [DOI] [Google Scholar]
  • 22.Mitsova D., Esnard A.M., Li Y., 2012. Using enhanced dasymetric mapping techniques to improve the spatial accuracy of sea level rise vulnerability assessments. J. Coast. Conserv. 16, 355–372. 10.1007/s11852-012-0206-3 [DOI] [Google Scholar]
  • 23.Zandbergen P.A., 2011. Dasymetric Mapping Using High Resolution Address Point Datasets. Trans. GIS 15, 5–27. 10.1111/j.1467-9671.2011.01270.x [DOI] [Google Scholar]
  • 24.Wan H., Yoon J., Srikrishnan V., Daniel B., & Judi D. (2022). Population downscaling using high-resolution, temporally-rich U.S. property data. Cartography and Geographic Information Science, 49(1), 18–31. 10.1080/15230406.2021.1991479 [DOI] [Google Scholar]
  • 25.Huang X., Wang C., Li Z., Ning H., 2021. A 100 m population grid in the CONUS by disaggregating census data with open-source Microsoft building footprints. Big Earth Data 5, 112–133. 10.1080/20964471.2020.1776200 [DOI] [Google Scholar]
  • 26.Liu L., Peng Z., Wu H., Jiao H., Yu Y., 2018. Exploring urban spatial feature with dasymetric mapping based on mobile phone data and LUR-2SFCAe method. Sustain. Switz. 10, 1–15. 10.3390/su10072432 [DOI] [Google Scholar]
  • 27.Bhaduri B., Bright E., Coleman P., Urban M.L., 2007. LandScan USA: a high-resolution geospatial and temporal modeling approach for population distribution and dynamics. GeoJournal 69, 103–117. 10.1007/s10708-007-9105-9 [DOI] [Google Scholar]
  • 28.Lloyd C.T., Chamberlain H., Kerr D., Yetman G., Pistolesi L., Stevens F.R., et al., 2019. Global spatio-temporally harmonised datasets for producing high-resolution gridded population distribution datasets. Big Earth Data 3, 108–139. doi: 10.1080/20964471.2019.1625151 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Lloyd C.T., Sorichetta A., Tatem A.J., 2017. High resolution global gridded data for use in population studies. Sci. Data 4, 170001. doi: 10.1038/sdata.2017.1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Qiu G., Bao Y., Yang X., Wang C., Ye T., Stein A., et al., 2020. Local population mapping using a random forest model based on remote and social sensing data: A case study in Zhengzhou, China. Remote Sens. 12. 10.3390/rs12101618 [DOI] [Google Scholar]
  • 31.Stevens F.R., Gaughan A.E., Linard C., Tatem A.J., 2015. Disaggregating Census Data for Population Mapping Using Random Forests with Remotely-Sensed and Ancillary Data. PLOS ONE 10, e0107042. doi: 10.1371/journal.pone.0107042 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Tiecke T.G., Liu X., Zhang A., Gros A., Li N., Yetman G., et al., 2017. Mapping the World Population One Building at a Time. Mapp. World Popul. One Build. Time. 10.1596/33700 [DOI] [Google Scholar]
  • 33.Graetz N., Ummel K., Cohen D.A., 2020. Small-Area Analyses Using Public American Community Survey Data: A Tree-Based Spatial Microsimulation Technique, SSRN Electronic Journal. 10.2139/ssrn.3574679 [DOI] [Google Scholar]
  • 34.Palacios‐Lopez D., Bachofer F., Esch T., Marconcini M., Macmanus K., Sorichetta A., et al., 2021. High‐resolution gridded population datasets: Exploring the capabilities of the world settlement footprint 2019 imperviousness layer for the african continent. Remote Sens. 13, 1–26. 10.3390/rs13061142 [DOI] [Google Scholar]
  • 35.Reed F.J., Gaughan A.E., Stevens F.R., Yetman G., Sorichetta A., Tatem A.J., 2018. Gridded population maps informed by different built settlement products. Data 3. doi: 10.3390/data3030033 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Stevens F.R., Gaughan A.E., Nieves J.J., King A., Sorichetta A., Linard C., et al., 2020. Comparisons of two global built area land cover datasets in methods to disaggregate human population in eleven countries from the global South. Int. J. Digit. Earth 13, 78–100. 10.1080/17538947.2019.1633424 [DOI] [Google Scholar]
  • 37.Thomson D.R., Rhoda D.A., Tatem A.J., Castro M.C., 2020. Gridded population survey sampling: A systematic scoping review of the field and strategic research agenda. Int. J. Health Geogr. 19, 1–16. doi: 10.1186/s12942-020-00230-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Casey J.A., Cushing L., Depsky N., Morello-Frosch R., 2021. Climate Justice and California’s Methane Superemitters: Environmental Equity Assessment of Community Proximity and Exposure Intensity. Environ. Sci. Technol. 55, 14746–14757. doi: 10.1021/acs.est.1c04328 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Pace C., Balazs C., Bangia K., Depsky N., Renteria A., Morello-Frosch R., et al., 2022. Inequities in Drinking Water Quality Among Domestic Well Communities and Community Water Systems, California, 2011‒2019. Am. J. Public Health 112, 88–97. doi: 10.2105/AJPH.2021.306561 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Harris C., 2021. The 2020 Census. [Google Scholar]
  • 41.Manson S.M., 2020. IPUMS national historical geographic information system: version 15.0. [Google Scholar]
  • 42.Bailey Z.D., Feldman J.M., Bassett M.T., 2021. How Structural Racism Works—Racist Policies as a Root Cause of U.S. Racial Health Inequities. N. Engl. J. Med. 384, 768–773. doi: 10.1056/NEJMms2025396 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Braveman P., Parker Dominguez T., 2021. Abandon “Race.” Focus on Racism. Front. Public Health 9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Morello-Frosch R., 2002. The political economy of environmental discrimination. Env. Plann C Gov Policy 20, 477–496. [Google Scholar]
  • 45.Doxsey-Whitfield E., MacManus K., Adamo S.B., Pistolesi L., Squires J., Borkovska O., et al., 2015. Taking Advantage of the Improved Availability of Census Data: A First Look at the Gridded Population of the World, Version 4. Pap. Appl. Geogr. 1, 226–234. 10.1080/23754931.2015.1014272 [DOI] [Google Scholar]
  • 46.Rose A.N., McKee J.J., Sims K.M., Bright E.A., Reith A.E., Urban M.L., 2020. LandScan 2019. [Google Scholar]
  • 47.Gervasoni L., Fenet S., Perrier R., Sturm P., 2019. Convolutional neural networks for disaggregated population mapping using open data. Proc. - 2018 IEEE 5th Int. Conf. Data Sci. Adv. Anal. DSAA 2018 594–603. 10.1109/DSAA.2018.00076 [DOI] [Google Scholar]
  • 48.Elliott D., 2021. Simulating the 2020 Census 55. [Google Scholar]
  • 49.Jochem W.C., Tatem A.J., 2021. Tools for mapping multi-scale settlement patterns of building footprints: An introduction to the R package foot. PLoS ONE 16, 1–19. 10.1371/journal.pone.0247535 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Jia P., Qiu Y., & Gaughan A. E. (2014). A fine-scale spatial population distribution on the High-resolution Gridded Population Surface and application in Alachua County, Florida. Applied Geography, 50, 99–107. 10.1016/J.APGEOG.2014.02.009 [DOI] [Google Scholar]

Decision Letter 0

Krishna Prasad Vadrevu

15 Mar 2022

PONE-D-22-01737High-Resolution Gridded Estimates of Population Sociodemographics from the 2020 Census in CaliforniaPLOS ONE

Dear Dr. DEPSKY,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Specifically, provide additional clarifications on comparison of CA-POP with other gridded products (qualitative versus quantitative comparisons), accuracy assessment, clarity on the presentation, spatial and temporal characteristics of the data, and finally limitations. Please submit your revised manuscript by Apr 29 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Krishna Prasad Vadrevu, Ph.D

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Thank you for stating the following financial disclosure:

“This study was funded by the California Air Resources Board (# 18RD018- RM-F and NJD), the Strategic Growth Council (CCRP0022 - RM-F, NJD and LC) and  U.S. Environmental Protection Agency (#84003901 LC, RM-F and ND)”

Please state what role the funders took in the study.  If the funders had no role, please state: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."

If this statement is not correct you must amend it as needed.

Please include this amended Role of Funder statement in your cover letter; we will change the online submission form on your behalf.

3. Thank you for stating the following in the Acknowledgments Section of your manuscript:

“This study was funded by the California Air Resources Board (# 18RD018- RM-F and NJD), the Strategic Growth Council (CCRP0022 - RM-F, NJD and LC) and U.S. Environmental Protection Agency (#84003901 LC, RM-F and ND)”

We note that you have provided additional information within the Acknowledgements Section that is not currently declared in your Funding Statement. Please note that funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

“This study was funded by the California Air Resources Board (# 18RD018- RM-F and NJD), the Strategic Growth Council (CCRP0022 - RM-F, NJD and LC) and  U.S. Environmental Protection Agency (#84003901 LC, RM-F and ND)”

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

4. We note that Figure 1, 2, 4 and 6 in your submission contain map images which may be copyrighted. All PLOS content is published under the Creative Commons Attribution License (CC BY 4.0), which means that the manuscript, images, and Supporting Information files will be freely available online, and any third party is permitted to access, download, copy, distribute, and use these materials in any way, even commercially, with proper attribution. For these reasons, we cannot publish previously copyrighted maps or satellite images created using proprietary data, such as Google software (Google Maps, Street View, and Earth). For more information, see our copyright guidelines: http://journals.plos.org/plosone/s/licenses-and-copyright.

We require you to either (1) present written permission from the copyright holder to publish these figures specifically under the CC BY 4.0 license, or (2) remove the figures from your submission:

 a. You may seek permission from the original copyright holder of Figure 1, 2, 4 and 6 to publish the content specifically under the CC BY 4.0 license. 

We recommend that you contact the original copyright holder with the Content Permission Form (http://journals.plos.org/plosone/s/file?id=7c09/content-permission-form.pdf) and the following text:

“I request permission for the open-access journal PLOS ONE to publish XXX under the Creative Commons Attribution License (CCAL) CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). Please be aware that this license allows unrestricted use and distribution, even commercially, by third parties. Please reply and provide explicit written permission to publish XXX under a CC BY license and complete the attached form.”

Please upload the completed Content Permission Form or other proof of granted permissions as an "Other" file with your submission.

In the figure caption of the copyrighted figure, please include the following text: “Reprinted from [ref] under a CC BY license, with permission from [name of publisher], original copyright [original copyright year].”

 b. If you are unable to obtain permission from the original copyright holder to publish these figures under the CC BY 4.0 license or if the copyright holder’s requirements are incompatible with the CC BY 4.0 license, please either i) remove the figure or ii) supply a replacement figure that complies with the CC BY 4.0 license. Please check copyright information on all replacement figures and update the figure caption with source information. If applicable, please specify in the figure caption text when a figure is similar but not identical to the original image and is therefore for illustrative purposes only.

The following resources for replacing copyrighted map figures may be helpful:

USGS National Map Viewer (public domain): http://viewer.nationalmap.gov/viewer/

The Gateway to Astronaut Photography of Earth (public domain): http://eol.jsc.nasa.gov/sseop/clickmap/

Maps at the CIA (public domain): https://www.cia.gov/library/publications/the-world-factbook/index.html and https://www.cia.gov/library/publications/cia-maps-publications/index.html

NASA Earth Observatory (public domain): http://earthobservatory.nasa.gov/

Landsat: http://landsat.visibleearth.nasa.gov/

USGS EROS (Earth Resources Observatory and Science (EROS) Center) (public domain): http://eros.usgs.gov/#

Natural Earth (public domain): http://www.naturalearthdata.com/

5. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: No

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: In the manuscript entitled “High-Resolution Gridded Estimates of Population Sociodemographics from the 2020 Census in California” the authors present the CA-POP dataset, a series of high-resolution population grids for the state of California based on the 2020 census. The dataset contains eight demographic variables and is publicly and freely available from Github. It constitutes a valuable contribution that can be a basis for future sociodemographic research regarding California.

The dasymetric method used to obtain the CA-POP dataset was carefully and appropriately chosen. This choice as well as the method itself are explained in a very understandable way, and the reader gets a good sense of what has been done in the workflow, what information is contained in the layers, and what potential and limitations are associated with them. The figures are very helpful to understand the method and the relation of the source and ancillary data, as well as to grasp the difference between CA-POP and a similar product, the WorldPop grids.

I do like the manuscript in its present form, but I also have some suggestions for improvement:

- Did or do the authors consider to extend this dataset to the entire US (like the other mentioned gridded products)? It seems that, with the method now in place, this would be a reasonable extension to make and would certainly yield a hugely valuable dataset. If there is an interesting answer to this question, it may be worth to add it to the discussion.

- I think its unfortunate that you didn’t include a different method (possibly a machine-learning method) in the accuracy assessment comparison. It’s not very surprising that the dasymetric method used by CA-POP does better than a null model, after all a lot of informative ancillary data goes into it. So for a fair comparison and for making the point the CA-POP’s rather parsimonious method is appropriate here, it would have been more convincing to include a different algorithm that works with the same data/information.

- From the comparison of CA-POP with other gridded products I understand that the CONUS grids by Huang et al. 2021 are the most similar to CA-POP. The main benefit of CA-POP that you name in this direct comparison is that it uses the residential parcels to offset some weakness of the Microsoft building footprints. However, a more significant advantage that I would see here (taking from Table 3) is that CA-POP offers all this detail on the population composition, whereas Huang et al. 202 only give population count?

- At the beginning of the Results section (ll. 351-354), could you make it a little clearer how the raster grids described here differ from the CA-POP grids?

Line comments:

- l. 157: please define the term ‘data vintage’

- l. 253: I think there should be no dash between “1-acre” (as opposed to l. 255 where it’s surely correct)

- l. 269: please identify what the CONUS region is

- l. 356: “these methods” don’t really have a reference in a preceeding sentence; please directly name them (Actually, the whole sentence sounds like it was moved here from a different context.)

- l. 365, 367: unnecessary repetition of “these dasymetric techniques”

- Figure 6 is included twice

Reviewer #2: The authors present a set of high-resolution gridded population products using values from the 2020 U.S. Census for the entire state of California (CA-POP). This is a very thorough and interesting analysis that makes a nice contribution to the growing literature on high resolution grided estimates of population. In particular, some statistic value should be added in Abstract to illuminate the accuracy and good concordance of CA-POP. While the methods (dasymetric methods, accuracy assessment), the paper is often unclear and hard to assess. Too many details are provided for methods that ought to be summarized more succinctly. The paper needs to clarify the methods and improve the flow overall.

Reviewer #3: The manuscript entitled “High-Resolution Gridded Estimates of Population Sociodemographics from the 2020 Census in California” produced population grids by apportioning census block population to 100-m grids based on California tax parcel data and Microsoft building footprint. This manuscript is well-written, but some issues need to be addressed before publication.

1. The author tries to compare CA-POP accuracy with the areal weighting of block group population to demonstrate its superior performance. However, simple areal weighting is known for its bad performance in population downscaling when compared with other methods. I’d suggest including more dasymetric mapping methods to enhance the comparison (e.g., dasymetric mapping based on commonly-used ancillary dataset such as imperviousness, road, etc.). Adding a spatial map showing the overestimation/underestimation percentages of each block for the proposed method and related discussions is recommended.

2. The author tries to compare the CA-POP grids with other gridded product (e.g., WorldPop, LandScan, GPW), but only in a qualitative way (A table comparing their spatial resolutions, population apportionment methods, and gridded variables.) due to the intrinsic limitations of those products. As the accuracy assessment does not reflect the true accuracy of the CA-POP grids, a quantitative comparison with other gridded products becomes more urgent. SocScape (http://www.socscape.edu.pl/) provides 30-m resolution population grid (also racial diversity grids) for year 2020 across the United States. Aggregating this dataset to 100-m and then directly compare to the CA-POP dataset could greatly enhance the comparison analysis. If this direct comparison is included, then there’s no need to include the supplemental Table A2, which shows a biased accuracy assessment comparison. (Conducting accuracy assessments for these gridded products at the block level does not hold true since they are using blocks as the source zones, and as the author has discussed, the spatial resolution would heavily impact the accuracy at block level). Also, if this quantitative comparison is included, the qualitative comparison (Table 3) is less important and should be moved to the introduction part (Introducing in detail about these products there).

3. Dasymetric mapping of population often choose nation-wide ancillary dataset with high temporal resolution, making the product available to a greater spatial and temporal extent. While this study relies heavily on California tax parcel dataset in 2017/2018, it has a relatively limited spatial and temporal application. This is a major limitation and should be discussed in the manuscript.

4. The author has described the process of removing large residential parcels by thresholds, and the remaining small residential parcels have relatively small open space, which should have less impact on the eventual gridded population output. Even if the impact of these open space is minimalized by only selecting small residential parcels, it should still be considered as a limitation for the proposed method, and I think it is still worth been mentioned and discussed for its potential solution in the “limitations and potential improvements” part. Another limitation is that the selection of the thresholds is based on manual inspection, which is considered as an impediment if this method is applied elsewhere.

5. “Residential parcels tended to be fairly homogenous within census blocks (i.e. a single block rarely contained both highrise apartments and single family homes or rural residences), reducing the need for a multi-class weighting scheme”. Reporting a detailed percentage value for this rare occasion in the study area could be more persuasive to the readers.

6. The conclusion part is week and should be enhanced.

There are some minor issues:

1. Page 3, line 68 – 70. The list of additional ancillary dataset is not exhaustive, and I recommend adding more variety of ancillary dataset:

Property data (Wan, H., Yoon, J., Srikrishnan, V., Daniel, B., Judi, D., 2021. Population downscaling using high-resolution, temporally-rich US property data. Cartography and Geographic Information Science 1–14);

Building footprint data (Huang, X., Wang, C., Li, Z., Ning, H., 2021. A 100 m population grid in the CONUS by disaggregating census data with open-source Microsoft building footprints. Big Earth Data 5, 112 – 113).

2. Page 12, line 286. Can the authors explain more in detail about the sliver-removal algorithm?

3. Page 15, line 338. The statement for evaluating errors should be clearer. The downscaled grid populations are first aggregated to the finest spatial unit (blocks), and then compared with the ground-truth observations at that spatial unit level.

4. Page 15 – 16, line 356 – 365. These sentences are more related to the accuracy assessment part rather than the result part.

5. Page 17, line 387. “The later-year grids are estimated via different growth forecasting assumptions to extrapolate 2010 values”. Are all those datasets extrapolating population for every year from 2010 to 2020? The author should be clearer about this statement.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2022 Jul 14;17(7):e0270746. doi: 10.1371/journal.pone.0270746.r002

Author response to Decision Letter 0


13 May 2022

Author responses to requested revisions to

Depsky, Cushing, Morello-Frosch 2022: High-resolution gridded estimates of population sociodemographics from the 2020 census in California

[EDITOR] Journal Requirements:

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

We have formatted our manuscript in accordance with the above requirements.

2. Thank you for stating the following financial disclosure:

“This study was funded by the California Air Resources Board (# 18RD018- RM-F and NJD), the Strategic Growth Council (CCRP0022 - RM-F, NJD and LC) and U.S. Environmental Protection Agency (#84003901 LC, RM-F and ND)”

Please state what role the funders took in the study. If the funders had no role, please state: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."

If this statement is not correct you must amend it as needed.

Please include this amended Role of Funder statement in your cover letter; we will change the online submission form on your behalf.

We include an amended statement in the cover letter as follows:

​​”This study was funded by the California Air Resources Board (# 18RD018- RM-F and NJD), the Strategic Growth Council (CCRP0022 - RM-F, NJD and LC) and U.S. Environmental Protection Agency (#84003901 LC, RM-F and ND). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.”

3. Thank you for stating the following in the Acknowledgments Section of your manuscript:

“This study was funded by the California Air Resources Board (# 18RD018- RM-F and NJD), the Strategic Growth Council (CCRP0022 - RM-F, NJD and LC) and U.S. Environmental Protection Agency (#84003901 LC, RM-F and ND)”

We note that you have provided additional information within the Acknowledgements Section that is not currently declared in your Funding Statement. Please note that funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

“This study was funded by the California Air Resources Board (# 18RD018- RM-F and NJD), the Strategic Growth Council (CCRP0022 - RM-F, NJD and LC) and U.S. Environmental Protection Agency (#84003901 LC, RM-F and ND)”

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

We have removed all funding-related text from the manuscript and include an amended statement in the cover letter (See response to comment #1).

4. We note that Figure 1, 2, 4 and 6 in your submission contain map images which may be copyrighted. All PLOS content is published under the Creative Commons Attribution License (CC BY 4.0), which means that the manuscript, images, and Supporting Information files will be freely available online, and any third party is permitted to access, download, copy, distribute, and use these materials in any way, even commercially, with proper attribution. For these reasons, we cannot publish previously copyrighted maps or satellite images created using proprietary data, such as Google software (Google Maps, Street View, and Earth). For more information, see our copyright guidelines: http://journals.plos.org/plosone/s/licenses-and-copyright.

We require you to either (1) present written permission from the copyright holder to publish these figures specifically under the CC BY 4.0 license, or (2) remove the figures from your submission:

a. You may seek permission from the original copyright holder of Figure 1, 2, 4 and 6 to publish the content specifically under the CC BY 4.0 license.

We recommend that you contact the original copyright holder with the Content Permission Form (http://journals.plos.org/plosone/s/file?id=7c09/content-permission-form.pdf) and the following text:

“I request permission for the open-access journal PLOS ONE to publish XXX under the Creative Commons Attribution License (CCAL) CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). Please be aware that this license allows unrestricted use and distribution, even commercially, by third parties. Please reply and provide explicit written permission to publish XXX under a CC BY license and complete the attached form.”

Please upload the completed Content Permission Form or other proof of granted permissions as an "Other" file with your submission.

In the figure caption of the copyrighted figure, please include the following text: “Reprinted from [ref] under a CC BY license, with permission from [name of publisher], original copyright [original copyright year].”

b. If you are unable to obtain permission from the original copyright holder to publish these figures under the CC BY 4.0 license or if the copyright holder’s requirements are incompatible with the CC BY 4.0 license, please either i) remove the figure or ii) supply a replacement figure that complies with the CC BY 4.0 license. Please check copyright information on all replacement figures and update the figure caption with source information. If applicable, please specify in the figure caption text when a figure is similar but not identical to the original image and is therefore for illustrative purposes only.

The following resources for replacing copyrighted map figures may be helpful:

USGS National Map Viewer (public domain): http://viewer.nationalmap.gov/viewer/

The Gateway to Astronaut Photography of Earth (public domain): http://eol.jsc.nasa.gov/sseop/clickmap/

Maps at the CIA (public domain): https://www.cia.gov/library/publications/the-world-factbook/index.html and https://www.cia.gov/library/publications/cia-maps-publications/index.html

NASA Earth Observatory (public domain): http://earthobservatory.nasa.gov/

Landsat: http://landsat.visibleearth.nasa.gov/

USGS EROS (Earth Resources Observatory and Science (EROS) Center) (public domain): http://eros.usgs.gov/#

Natural Earth (public domain): http://www.naturalearthdata.com/

We have now altered the map backgrounds in all map-based figures (Figs 1, 2, 4, 5, 6 and supplement figures S3_Fig and S4_Fig) to contain imagery from the USGS National Map Viewer and the ocean layer shown in Figures 4 and 6 is from Natural Earth.

5. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information.

Each supplemental table or figure is now provided as a standalone file to be hyperlinked with corresponding captions at the conclusion of the manuscript immediately preceding the Data Availability section.

Reviewer #1

In the manuscript entitled “High-Resolution Gridded Estimates of Population Sociodemographics from the 2020 Census in California” the authors present the CA-POP dataset, a series of high-resolution population grids for the state of California based on the 2020 census. The dataset contains eight demographic variables and is publicly and freely available from Github. It constitutes a valuable contribution that can be a basis for future sociodemographic research regarding California.

The dasymetric method used to obtain the CA-POP dataset was carefully and appropriately chosen. This choice as well as the method itself are explained in a very understandable way, and the reader gets a good sense of what has been done in the workflow, what information is contained in the layers, and what potential and limitations are associated with them. The figures are very helpful to understand the method and the relation of the source and ancillary data, as well as to grasp the difference between CA-POP and a similar product, the WorldPop grids.

I do like the manuscript in its present form, but I also have some suggestions for improvement:

Did or do the authors consider to extend this dataset to the entire US (like the other mentioned gridded products)? It seems that, with the method now in place, this would be a reasonable extension to make and would certainly yield a hugely valuable dataset. If there is an interesting answer to this question, it may be worth to add it to the discussion.

Thank you for flagging this. We would have liked to have extended this analysis to cover the entire U.S. but were limited by the lack of contemporary, publicly-available tax parcel boundaries for each state. Given the patchwork manner in which parcels are assessed by state and local governments, there are no freely-available national tax boundary datasets with harmonized land use classes. Some data retailers offer parcel datasets with partial coverage of land use codes nationally (such as LoveLand Inc.), but they are offered as proprietary data products with licensing fees in the tens of thousands of dollars, which was out of the funding scope of this project and would prohibit the replicability and maintenance of the CA-POP grids with future census population values. We added an explanation of this at the end of the Discussion section (lines 591-600) as follows:

“Finally, the methods we utilize here are inherently limited in geographic scope given that only California is represented. The feasibility of constructing a U.S.-wide product using these dasymetric methods, however, is limited by the absence of national, high-quality, publicly-available tax parcel data. Tax parcel data are instead disparately gathered, maintained and provided by different state and local agencies, with a freely-available nationwide product with harmonized land use classifications not currently available for general use (Jia et al. 2014, Dmowska and Stepinski 2017). A number of proprietary options are maintained by various data retailers, though licensing fees often make access cost-prohibitive. Tax parcel boundaries are valuable ancillary datasets in many societally-beneficial demographic research contexts and we believe a publicly-funded effort to generate a well-maintained and open-access, national tax parcel dataset should be initiated to help facilitate this work.”

I think its unfortunate that you didn’t include a different method (possibly a machine-learning method) in the accuracy assessment comparison. It’s not very surprising that the dasymetric method used by CA-POP does better than a null model, after all a lot of informative ancillary data goes into it. So for a fair comparison and for making the point the CA-POP’s rather parsimonious method is appropriate here, it would have been more convincing to include a different (ML) algorithm that works with the same data/information.

We agree that comparing the accuracy of the CA-POP products to other population gridding products, such as those based on ML algorithms (e.g. WorldPop), would have been a compelling analysis. However, given the nature of these alternate population grids, such an analysis is not feasible. One reason is that to date most of the alternative gridded population products have not yet released grids based on the 2020 census values. Another primary reason such an analysis is not possible is the fact that most of the alternative major population grid examples (GPW, LandScan, WorldPop) all utilize the finest-scale ground-truthed population values available (census blocks) as their ‘source zones’ of population, just like CA-POP does. This is because when constructing dasymetric population maps, the most accurate end products will be those that utilize the best-available population observations, which in the U.S. are at the census block level from decennial censuses. Therefore, all of these products vary only in how populations are apportioned to subregions within each block, using sub-block ancillary data layers (e.g. building footprints or tax parcels) or via a ML-based population prediction algorithm, also trained using various ancillary data layers (e.g. nightlights, road networks etc.). When the alternate gridded products release their 2020-based population grids, there should essentially be no inaccuracies relative to census values at the block-level, as is the case with CA-POP.

This is a common challenge when assessing the accuracy of dasymetric mapping techniques, and is described in Huang et al. 2021 and Stevens et al. 2015, because the desire to create dasymetric population maps is borne of a lack of ground-truthed population observations at the spatial resolution desired. However, that very lack of ground-truthed population data is precisely what prevents the ability to truly assess the accuracy of the method in a comprehensive, quantitative manner across the study space. Were there available ground-truthed data at the sub-block level, then such an assessment would be possible, but would also negate the need for creating a dasymetric map of downscaled population in the first place. Therefore, dasymetric mappers always face the dilemma of either foregoing the use of the highest-resolution ground-truthed population data available in favor of being able to conduct an accuracy assessment (e.g. using block-groups as population source zones of the final grid and then assessing its accuracy at the block-level), or using the highest-resolution ground-truthed data to construct a more accurate grid but therein forfeiting the ability to rigorously assess the accuracy given the lack of finer-scale observations.

Two scenarios would make a more convincing side-by-side accuracy comparison between CA-POP and other products like WorldPop or LandScan possible: i) for each product (including CA-POP) to create grids using block-groups, rather than blocks, as their population source zones and to make these publicly-available, allowing for an estimation of accuracy against ground-truthed block-level values; ii) the presence of some sub-block, ground-truthed population data across some part of California, such that the ability of CA-POP and other block-constrained products to disaggregate population within blocks could be benchmarked. However, to our knowledge neither of these requisite data products exist, making such an assessment infeasible. We do partially carry out the first option above, producing CA-POP grids using block-group source zones rather than blocks, but are only able to compare this to the null model (uniform, areal weighting) as this is all that is available for such a comparison.

The above explanation of these limitations is now summarized in the ‘Accuracy Assessment’ subsection of the Data and Methods section (lines 451-469), which also contains an additional concluding sentence detailing the barriers to conducting side-by-side comparisons with other gridded products:

“We assessed the relative accuracy of our methods compared to simple uniform, areal population weighting. The errors in each block-level estimate between the modeled (block group constrained grid) and observed (census block level data) are reported in terms of root mean-squared errors (RMSE) and the squared Pearson correlation coefficient (R2) across all populated blocks. Median block-wise percent errors were also calculated for both raw and absolute percent error magnitudes, following a similar accuracy analysis approach employed by Dmowska and Stepinski (2017), in which they term the median absolute percent error a measure of ‘relative error’. Given the skewed nature of the percent error distribution across population blocks, the median was deemed a more informative metric as opposed to the mean or standard variation (Dmowska and Stepinksi 2017).

The same procedure was carried out for a simple uniform, areal weighting technique, which estimates block level population values assuming a homogenous population distribution across the entire spatial area of each block group. Unfortunately, comparison of CA-POP grids produced using this ‘second-best’ ground-truth source zone population data (block-groups) to other products (e.g. WorldPop, LandScan) is not possible due to the fact that those data providers to not provide versions of their grids that utilize anything besides the best-available source zone data as their ground-truth population constraints (blocks). However, SocScape’s methods documentation reports error estimates for their 2010 grids using the same type of accuracy assessment we carried out with CA-POP, constructing a block group-based grid and then comparing its ability to match observed population totals within blocks. This allows for a roughly analogous error comparison between CA-POP and SocScape (Dmowska and Stepinski 2017).”

From the comparison of CA-POP with other gridded products I understand that the CONUS grids by Huang et al. 2021 are the most similar to CA-POP. The main benefit of CA-POP that you name in this direct comparison is that it uses the residential parcels to offset some weakness of the Microsoft building footprints. However, a more significant advantage that I would see here (taking from Table 3) is that CA-POP offers all this detail on the population composition, whereas Huang et al. 202 only give population count?

We agree, thank you for this and added a sentence highlighting this on lines 395-397:

“Also, compared to the CONUS grid produced by Huang et al. (2021), which only provides population counts, CA-POP offers grids for multiple sociodemographic variables in addition to population.”

At the beginning of the Results section (ll. 351-354), could you make it a little clearer how the raster grids described here differ from the CA-POP grids?

These actually are the CA-POP raster grids, but our apologies that this was unclear. We added a “, comprising the CA-POP dataset” clause to the first sentence of the Results section to add clarity (lines 472-473), which now reads:

“​​Statewide, 100-meter resolution raster grids were produced for each of the eight 2020 U.S. Census demographic variables listed in Table 1 for the entire state of California, comprising the CA-POP dataset.”

Line comments:

157: please define the term ‘data vintage’

253: I think there should be no dash between “1-acre” (as opposed to l. 255 where it’s surely correct)

269: please identify what the CONUS region is

356: “these methods” don’t really have a reference in a preceeding sentence; please directly name them (Actually, the whole sentence sounds like it was moved here from a different context.)

365, 367: unnecessary repetition of “these dasymetric techniques”

Figure 6 is included twice

These were all implemented in the locations specified, thank you for the suggestions.

Reviewer #2

The authors present a set of high-resolution gridded population products using values from the 2020 U.S. Census for the entire state of California (CA-POP). This is a very thorough and interesting analysis that makes a nice contribution to the growing literature on high resolution gridded estimates of population. In particular, some statistical value should be added in Abstract to illuminate the accuracy and good concordance of CA-POP.

We thank the reviewer for the suggestion. We have expanded the abstract to contain a brief discussion of our accuracy assessment, and include the 30% median relative error value as discussed in the text (see lines 26-34):

“A general accuracy assessment of the CA-POP dasymetric mapping methodology was conducted by producing a population grid that was constrained by block group census observations instead of blocks, enabling a comparison of this grid’s apportionment of population within census blocks to block-level census values. This accuracy assessment yielded a block-wise median absolute relative error of approximately 30% for block group-to-block disaggregation. However, the final CA-POP grids are constrained by higher-resolution census block-level observations, likely making them even more accurate than these block group-constrained grids over a given region, but for which error assessments of population disaggregation is not possible due to the absence of observational data at the sub-block scale.”

While the methods (dasymetric methods, accuracy assessment), the paper is often unclear and hard to assess. Too many details are provided for methods that ought to be summarized more succinctly. The paper needs to clarify the methods and improve the flow overall.

In response to specific requests for clarifications from other reviewers, we have addressed this in our revised manuscript, particularly in the methods section and accuracy assessment sections. revisions. Specifically, a number of additional clarifying statements have been added and the accuracy assessment portions of the Discussion and Results sections have been condensed for clarity and to reduce redundancies. We also placed some of the ‘Comparison to Other Gridded Products’ subsection text in the Results section to the Data and Methods section, as it more appropriately fits there and we believe should help with the manuscript’s flow and interpretability.

Reviewer #3

The manuscript entitled “High-Resolution Gridded Estimates of Population Sociodemographics from the 2020 Census in California” produced population grids by apportioning census block population to 100-m grids based on California tax parcel data and Microsoft building footprint. This manuscript is well-written, but some issues need to be addressed before publication.

The author tries to compare CA-POP accuracy with the areal weighting of block group population to demonstrate its superior performance. However, simple areal weighting is known for its bad performance in population downscaling when compared with other methods. I’d suggest including more dasymetric mapping methods to enhance the comparison (e.g., dasymetric mapping based on commonly-used ancillary dataset such as imperviousness, road, etc.). Adding a spatial map showing the overestimation/underestimation percentages of each block for the proposed method and related discussions is recommended.

Thank you for this feedback and we agree that simple areal weighting is not the most compelling comparison due it being the simplest option. Comparing CA-POP to other, more complex, methods would have been desired if such comparisons were feasible. We were not entirely clear if your request for “including more dasymetric mapping methods to enhance the comparison” is asking for us to either i) create various additional statewide dasymetric maps of population using different combinations of ancillary data ourselves, or to ii) compare CA-POP to other such dasymetric mapping products that have already been developed that rely on different ancillary data (e.g. WorldPop, LandScan, SocScape etc.).

If this request is asking for the former, we contend that this would be out of the scope for this paper, as constructing multiple additional statewide grids, each from different sets of ancillary data, would multiply the scope of work for this analysis N-times by however many N-additional grids we create. Additionally, the methods behind the creation of each of these grids, intentionally constructed with lower resolution/precision ancillary data to those layers used in CA-POP, would also have to be fully documented in detail and would greatly expand the scope and length of this manuscript.

If, however, the request is referring to the latter option, such that you would like us to compare CA-POP to other, previously-constructed gridded population products that utilize alternate dasymetric methods and ancillary datasets, this is a similar comment to the second comment from Reviewer 1 and would refer you to the response we provided above.

Regarding the request for ‘spatial map showing the overestimation/underestimation percentages of each block for the proposed method’, this would by definition yield a homogenous map of 0% errors at every block with respect to the final CA-POP grids, due to the fact that they are constrained by block-level population census observations. The errors in the final grids only occur at the sub-block apportionment of population to the smaller ancillary data footprints, for which we do not have ground-truthed observations against which to assess accuracy. The only feasible alternative was to produce a map of block-level errors with respect to the block group-constrained population grid that we used to assess general (upper-bounds) of the errors in the CA-POP method in apportioning from block-groups to blocks. We have therefore produced this map and provided a supplemental figure (S5_Fig) which portrays these block-level error percentages of the block group-constrained CA-POP for three locales.

However, we did not include this figure in the main manuscript text because we feel it is potentially slightly misleading to readers, as it could be interpreted at first glance as errors of the final CA-POP grid values associated with each block. However, this error map instead represents different errors at blocks when applying our dasymetric method to coarser spatial units (block groups) of ground-truthed population observations, which is not in fact how the final grids were produced. However, given that this remains the best and only real way to gauge some relative accuracy of our methods (see our response to R1’s second comment), we agree that it provides utility and therefore include it in our supporting information and reference it in the manuscript text in lines 445-447, as follows:

“Generally, the highest percent errors at the block-level in this analysis occurred in blocks with low population totals, both due to the lack of residential parcel data within these and the small populations used as the denominator in these percentage error calculations (Figure in S5 Figure).”

The author tries to compare the CA-POP grids with other gridded products (e.g., WorldPop, LandScan, GPW), but only in a qualitative way (A table comparing their spatial resolutions, population apportionment methods, and gridded variables.) due to the intrinsic limitations of those products. As the accuracy assessment does not reflect the true accuracy of the CA-POP grids, a quantitative comparison with other gridded products becomes more urgent. SocScape (http://www.socscape.edu.pl/) provides 30-m resolution population grid (also racial diversity grids) for year 2020 across the United States. Aggregating this dataset to 100-m and then directly compare to the CA-POP dataset could greatly enhance the comparison analysis. If this direct comparison is included, then there’s no need to include the supplemental Table A2, which shows a biased accuracy assessment comparison. (Conducting accuracy assessments for these gridded products at the block level does not hold true since they are using blocks as the source zones, and as the author has discussed, the spatial resolution would heavily impact the accuracy at block level). Also, if this quantitative comparison is included, the qualitative comparison (Table 3) is less important and should be moved to the introduction part (Introducing in detail about these products there).

Thank you for pointing us towards the SocScape data products, we were not previously aware of them. We have added citations to the underlying methods paper for the SocScape data (Dmowska and Stepinski 2017) to various locations of our manuscript. We have also added two additional error statistics in our comparison of the block group-constrained CA-POP population grid to the uniform areal weighting block group-constrained grid of California. These block-wise accuracy metrics are the median percent error (raw) and median absolute percent error (called ‘relative error’ in Dmowska and Stepinski 2017). We found that this ‘relative error’ metric is roughly 30% for the block group-constrained CA-POP grid, compared to over 46% when using simple uniform, areal weighting. Dmowska and Stepinski report their CONUS-wide ‘relative errors’ based on block group-constrained grids to be around 44%, which suggests that CA-POP may have superior performance, at least when compared to the national average of SocScape (we added explanation of this in lines 483-491):

“In terms of median percent errors, the CA-POP method yields much higher accuracy compared to the simple uniform method both in raw (-4.1% compared to -25.9%) and absolute terms, or ‘relative error’, (30.1% compared to 46.4%). This median relative error is significantly lower than the 44% value of the equivalent metric reported for SocScape’s 2010 national population grid, calculated using the accuracy assessment exercise of comparing block group-constrained grid performance to observed block values (Dmowska and Stepinski 2017). Though the SocScape relative error is a national value and CA-POP’s is solely for California, it suggests that the CA-POP approach likely outperforms SocScape in this context.”

We appreciate your suggestion to compare CA-POP to the 30m SocScape grids. We downloaded the 2020 SocScape population grids and aggregated to 100m as suggested and compared the pixel-wise differences between CA-POP and SocScape across the state. This was helpful to evaluate the differences between the underlying dasymetric methods utilized in each approach. However, we feel it is important to note that the SocScape data are also products of dasymetric mapping techniques that disaggregate population from the block-level to sub-block target zones and are not ground-truthed observation data and therefore cannot be used to evaluate the accuracy of CA-POP in apportioning population within blocks. Based on the limited comparison of ‘relative error’ metrics from the block group-constrained CA-POP and (national) SocScape grids, CA-POP appears to perform better (30% error), at least relative to national performance of SocScape (44% error) reported in Dmowska and Stepinski 2017. The comparison of CA-POP to SocScape, however, was helpful in visualizing the differences between these grids and is described in Figure 6 and in lines 507-520:

“We also assessed pixel-level differences between the 2020 SocScape and CA-POP total population grids across the state to better evaluate how apportionment of population at the sub-block scale differs between the two methods. Figure 6 displays SocScape pixel values subtracted from CA-POP as a grid, and demonstrates that SocScape distributes low population counts across the majority of open space within blocks, whereas CA-POP more accurately sets these regions to zero. This is evident in the figure’s first two panels, where the light red zones spanning large areas represent regions in SocScape with small population counts across primarily open space, for which CA-POP does not apportion population. In more densely-populated urban zones where blocks are smaller and the two grids are therefore constrained to equal one another at a smaller spatial scale, much of the pixel-level differences emulate random noise, although some patterns appear to suggest that SocScape overly-apportions populations along major streets and roadways where CA-POP does not. The final panel in Figure 6 shows an example of this in South Los Angeles, where red areas (SocScape greater than CA-POP) tend to reflect the pattern of major streets in this neighborhood, a difference that is likely due to CA-POP’s use of ancillary datasets that exclude street surfaces from the population target zones.”

Dasymetric mapping of population often choose nation-wide ancillary dataset with high temporal resolution, making the product available to a greater spatial and temporal extent. While this study relies heavily on California tax parcel dataset in 2017/2018, it has a relatively limited spatial and temporal application. This is a major limitation and should be discussed in the manuscript.

Yes, thank you for flagging this. We would have liked to have extended this analysis to cover the entire U.S. but were limited by the lack of contemporary, publicly-available tax parcel boundaries for each state. Given the patchwork manner in which parcels are assessed by state and local governments, there are no freely-available national tax boundary datasets with harmonized land use classes. Some data retailers offer parcel datasets with partial coverage of land use codes nationally (such as LoveLand Inc.), but they are offered as proprietary data products with licensing fees in the tens of thousands of dollars, which was out of the funding scope of this project and would prohibit the replicability and maintenance of the CA-POP grids with future census population values. We have added an explanation of this at the end of the Discussion section (lines 591-600):

“Finally, the methods we utilize here are inherently limited in geographic scope given that only California is represented. The feasibility of constructing a U.S.-wide product using these dasymetric methods, however, is limited by the absence of national, high-quality, publicly-available tax parcel data. Tax parcel data are instead disparately gathered, maintained and provided by different state and local agencies, with a freely-available nationwide product with harmonized land use classifications not currently available for general use (Jia et al. 2014, Dmowska and Stepinski 2017). A number of proprietary options are maintained by various data retailers, though licensing fees often make access cost-prohibitive. Tax parcel boundaries are valuable ancillary datasets in many societally-beneficial demographic research contexts and we believe a publicly-funded effort to generate a well-maintained and open-access, national tax parcel dataset should be initiated to help facilitate this work.”

In terms of temporal limitations of the 2017-2018 parcel data used, we agree that ideally this would have more contemporary with the 2020 census data utilized and should be updated where possible with future updates to the CA-POP dataset. However, the 2017/2018 data were those to which we were granted access at the time of analysis.

The author has described the process of removing large residential parcels by thresholds, and the remaining small residential parcels have relatively small open space, which should have less impact on the eventual gridded population output. Even if the impact of these open space is minimalized by only selecting small residential parcels, it should still be considered as a limitation for the proposed method, and I think it is still worth been mentioned and discussed for its potential solution in the “limitations and potential improvements” part. Another limitation is that the selection of the thresholds is based on manual inspection, which is considered as an impediment if this method is applied elsewhere.

We agree and have added a clarifying sentence about this limitation in lines 572-576:

“In theory, perfectly accurate building footprint data could be used for all final populated target zone boundaries, with tax parcels or other land-use ancillary data layers solely used to identify residential zones within which buildings should be selected, therein avoiding minor inaccuracies associated with regions of open space within small residential parcels currently present in our methodology.”

Regarding the thresholds we utilized for inclusion of small residential parcels as population target zones, it’s true that their selection was guided by manual inspection of the parcel data around the state. However, we believe that the threshold of 1-acre for small residential lots would be suitable for other 100m population gridding exercises in different states or locales in which tax parcels are being utilized without the need for a new iteration of manual inspection. This is due to the fact that the 1-acre threshold for small residential parcels is sufficiently small in the context of a 100m raster grid (~40% of a single pixel’s area) and would still be large enough to encompass single family homes on small to moderate lots in much of the country.

“Residential parcels tended to be fairly homogenous within census blocks (i.e. a single block rarely contained both highrise apartments and single family homes or rural residences), reducing the need for a multi-class weighting scheme”. Reporting a detailed percentage value for this rare occasion in the study area could be more persuasive to the readers.

We agree with this assessment and we have opted to remove this justification argument from the manuscript text and instead highlight the lack of a multi-class weighting scheme as a limitation of the current methodological approach and potential area of future improvement. We made this decision because properly quantifying the degree of homogeneity of parcel types would entail determining mean population densities associated with all 30 residential parcel types across the state, requiring more complex geospatial analysis that is beyond the scope of this work (and would be the bulk of work required to carry out the more complex multi-class weighting scheme itself, since in quantifying this claim we would have to determine the weights of each land use class in order to identify exactly how low/medium/high density each is and compare diversity within all blocks).

The conclusion part is weak and should be enhanced.

Additional summary of the accuracy assessment was added here (lines 609-617) to round out the conclusion and some reorganization was done to improve the flow of the section:

“Assessing the accuracy the CA-POP dasymetric mapping methodology for a population grid constrained by block group census observations instead of blocks yielded a block-wise median absolute relative error of approximately 30% for block group-to-block disaggregation, which is lower than national error rates reported in the CONUS-wide SocScape grids, the product that reports the most analogous form of accuracy assessment for block group-to-block population disaggregation grids derived from U.S. census values. Given that the final CA-POP grids are constrained by higher-resolution census block-level observations, they are likely more even more accurate than their block group-constrained counterparts over a given region, though a proper error assessment of them is not possible due to the absence of observational data at the sub-block scale.”

Line Comments:

Page 3, line 68 – 70. The list of additional ancillary dataset is not exhaustive, and I recommend adding more variety of ancillary dataset:

Property data (Wan, H., Yoon, J., Srikrishnan, V., Daniel, B., Judi, D., 2021. Population downscaling using high-resolution, temporally-rich US property data. Cartography and Geographic Information Science 1–14);

This citation was added as suggested in line 81

Building footprint data (Huang, X., Wang, C., Li, Z., Ning, H., 2021. A 100 m population grid in the CONUS by disaggregating census data with open-source Microsoft building footprints. Big Earth Data 5, 112 – 113).

This citation was added as suggested in line 81

“Page 12, line 286. Can the authors explain more in detail about the sliver-removal algorithm?

We provided additional explanation in the form of a footnote linked to line 298 as follows:

“Slivers are defined as any single-part polygon resulting from the block-parcel intersection that is less than:

[original residential parcel area] / [2 * # of polygons descendent of a given parcel after intersecting with blocks]

In other words, if a residential parcel with an area of 1km2 is split evenly across two different blocks into two 0.5km2 portions, they will both be preserved since 0.5km2 > [1km2 / (2 x 2) = 0.25km2]. However, if this same parcel is split across two blocks such that 90% (0.9km2) of its area is contained in one block and 10% (0.1km2) in the other, the smaller portion would be considered a sliver and removed since it is less than 0.25km2. This ultimately resulted in the removal of 3.1% of polygons (in terms of count, not area) resulting from the intersection of census blocks with residential parcels.”

Page 15, line 338. The statement for evaluating errors should be clearer. The downscaled grid populations are first aggregated to the finest spatial unit (blocks), and then compared with the ground-truth observations at that spatial unit level.

We have altered this sentence (lines 433-438) to provide more clarity about this process as suggested as follows:

“However, past applications of traditional dasymetric mapping techniques, like those employed in this study, have often evaluated accuracy of their techniques by producing population grids that are constrained by observed populations at a spatial unit that is coarser (e.g. block groups or tracts) than the finest unit available, then evaluating how well the disaggregated population in the resultant grids matches observed population values within the finest spatial unit of ground-truthed data (e.g. blocks)”

Page 15 – 16, line 356 – 365. These sentences are more related to the accuracy assessment part rather than the result part.

We have placed most of this text in the Discussion-Accuracy Assessment section rather than results and consolidated the text to reduce redundancy, as suggested. The Results now simply highlight the accuracy values themselves, as shown in Table 2, rather than discussing methods. (now lines 451-459):

“We assessed the relative accuracy of our methods compared to simple uniform, areal population weighting. The errors in each block-level estimate between the modeled (block group constrained grid) and observed (census block level data) are reported in terms of root mean-squared errors (RMSE) and the squared Pearson correlation coefficient (R2) across all populated blocks. Median block-wise percent errors were also calculated for both raw and absolute percent error magnitudes, following a similar accuracy analysis approach employed by Dmowska and Stepinski (2017), in which they term the median absolute percent error a measure of ‘relative error’. Given the skewed nature of the percent error distribution across population blocks, the median was deemed a more informative metric as opposed to the mean or standard variation (Dmowska and Stepinksi 2017). The same procedure was carried out for a simple uniform, areal weighting technique, which estimates block level population values assuming a homogenous population distribution across the entire spatial area of each block group. Unfortunately, comparison of CA-POP grids produced using this ‘second-best’ ground-truth source zone population data (block-groups) to other products (e.g. WorldPop, LandScan) is not possible due to the fact that those data providers to not provide versions of their grids that utilize anything besides the best-available source zone data as their ground-truth population constraints (blocks). However, SocScape’s methods documentation reports error estimates for their 2010 grids using the same type of accuracy assessment we carried out with CA-POP, constructing a block group-based grid and then comparing its ability to match observed population totals within blocks. This allows for a roughly analogous error comparison between CA-POP and SocScape (Dmowska and Stepinski 2017).”

Page 17, line 387. “The later-year grids are estimated via different growth forecasting assumptions to extrapolate 2010 values”. Are all those datasets extrapolating population for every year from 2010 to 2020? The author should be clearer about this statement.

Yes, GPW (Doxsey-Whitfield et al. 2015) and WorldPop (Lloyd et al. 2019) detail the population growth extrapolation mechanisms in the non-decennial-year gridded products. The LandScan documentation (Rose et al. 2020) is slightly more opaque in this regard but does make mention of utilizing mid-year population estimates to adjust values. Therefore, we have altered the sentence in our manuscript slightly to more generally describe these varied methods (lines 350-353):

“At the time of writing, each of these products’ population source zones were based on the 2010 census at the block level, with populations in later-year grids estimated via different growth forecasting and/or inter-census population estimates to extrapolate 2010 values forward (Doxsey-Whitfield et al., 2015; Lloyd et al., 2019; Rose et al., 2020).”

However, we do not believe that describing the specific population extrapolation formulas utilized in these other products in finer detail is not required for the narrative of our text in this section.

Attachment

Submitted filename: Depsky_etal_2022_ResponsetoReviewers.docx

Decision Letter 1

Krishna Prasad Vadrevu

17 Jun 2022

High-resolution gridded estimates of population sociodemographics from the 2020 census in California

PONE-D-22-01737R1

Dear Dr. DEPSKY,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Krishna Prasad Vadrevu, Ph.D

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

Reviewer #3: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors have have submitted a revised version of their manuscript entitled “High-Resolution Gridded Estimates of Population Sociodemographics from the 2020 Census in California”. In their rebuttal letter, the authors provide extensive replies to all reviewer comments. In my view, all reviewer comments have been appropriately addressed with changes in the paper, rendering the manuscript acceptable for publication.

Specifically, key sections explaining the methods received clarifications; relevant citations were added; and an additional analysis to assess the results accuracy was conducted. These amendments help the reader to understand the presented dasymetric method and to assess the usefulness of the provided data set, CA-POP.

Reviewer #2: The manuscript has gone through important adaptations, demonstrating more clarity and coherence in the manner of presenting the results. I think this manuscript will be acceptable.

Reviewer #3: The author has addressed all my concerns, and I don’t have further questions. I recommend the publication of this manuscript with minor revisions. The author should carefully proofread the manuscript to avoid any mistakes or typos. Some potential grammatical errors and typos are listed below:

Line 24: Please unify the grammatical tenses (“showed” and “offers”)

Line 300: This sentence is not complete for ‘”population density limit” depiction.

Line 389: “Fresno, CA area” should be “Fresno area, CA”.

Line 424: “Satellite base imagery” should be “satellite-based imagery” or “satellite imagery”.

Line 490 and 510: Replace “table 2” by “table 3”.

Line 622: “…they are likely more even more accurate…”. Please delete the first “more”.

Line 623: Please add the proper preposition before “these final grids”.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

**********

Acceptance letter

Krishna Prasad Vadrevu

21 Jun 2022

PONE-D-22-01737R1

High-resolution gridded estimates of population sociodemographics from the 2020 census in California

Dear Dr. Depsky:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr Krishna Prasad Vadrevu

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Table. Land use codes from tax parcel dataset identified as residential.

    (DOCX)

    S1 Fig. False negatives in the Microsoft building footprint data.

    Examples shown in urban and rural contexts. Locations were chosen based on the presence of false negatives and do not generally reflect the typical proportion of false negative instances around the state.

    (DOCX)

    S2 Fig. Examples of 1-acre residential parcels.

    1-acre (~4050 km2) was used as the upper area threshold for low-density residential parcels.

    (DOCX)

    S3 Fig. Percent errors of block group-constrained CA-POP population grid relative to census block population observations.

    Values represent the percent difference between block-level estimates from the block group-constrained CA-POP total population grid and census block population values. Note that these errors do not reflect the block-level errors of the final CA-POP grids themselves, as those were constrained by the block-level census observations, which by definition makes block-level errors zero. Errors in the final grids instead occur at the sub-block level of population apportionment, for which there are no ground-truth population observations available for assessing CA-POP’s ability to apportion population within blocks.

    (DOCX)

    Attachment

    Submitted filename: Depsky_etal_2022_ResponsetoReviewers.docx

    Data Availability Statement

    Data are available from: https://zenodo.org/badge/latestdoi/434382697, DOI: 10.5281/zenodo.5874927, github.com/njdepsky/CA-POP.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES