Abstract
Background and Objective:
Social and Environmental Determinants of Health (SEDoH) are of increasing interest to researchers in personal and public health. Collecting SEDoH and associating them with patient medical record can be challenging, especially for environmental variables. We announce here the release of SEnDAE, the Social and Environmental Determinants Address Enhancement toolkit, and open-source resource for ingesting a range of environmental variables and measurements from a variety of sources and associated them with arbitrary addresses.
Methods:
SEnDAE includes optional components for geocoding addresses, in case an organization does not have independent capabilities in that area, and recipes for extending the OMOP CDM and the ontology of an i2b2 instance to display and compute over the SEnDAE variables within i2b2.
Results:
On a set of 5000 synthetic addresses, SEnDAE was able to geocode 83%. SEnDAE geocodes addresses to the same Census tract as ESRI 98.1% of the time.
Conclusion:
Development of SEnDAE is ongoing, but we hope that teams will find it useful to increase their usage of environmental variables and increase the field’s general understanding of these important determinants of health.
Keywords: Social and environmental determinants of health (SEDoH), Geocoding, Open-source toolkit
1. Introduction
The health outcomes of individuals and communities are influenced by biological, behavioral, healthcare access, socioeconomic, and environmental factors. The contextual factors are known as social determinants of health (SDoH), defined as the “conditions in which people are born, grow, live, work, and age” (1–5). Many of these determinants are specifically focused on aspects of the natural and built environment, leading some (6–8) to use the term social and environmental determinants of health (SEDoH) in order to draw attention to these aspects. Examples of specifically environmental determinants within the larger set include air quality, water quality, neighborhood diversity, social cohesion and safety, housing quality, traffic density, and park access. Studies (9–16) have attempted to identify relevant determinants and their impact upon personal and population health.
The challenge in using SEDoH lies in collecting the data and associating it with patient medical information. Patients may be able to report on their social determinants such as education level or having insecure employment, but they are unlikely to be able to report accurately on the air quality indices of their home or the safety of their neighborhood, for example. Many environmental variables are collected by local, state, and national government bodies, and organizations such as the US Census Bureau, US Department of Agriculture, and state environmental protection agencies release rich data sets at no cost. The first challenge lies in locating and parsing this data. The second challenge lies in efficiently connecting individual patients to the environmental measurements in these governmental datasets.
To address these challenges, we have developed a toolkit to enable other institutions to enrich clinical datasets with geocoded SEDoH data. Named SEnDAE for Social and Environmental Determinants Address Enhancement and pronounced “sundae,” it ingests a range of social and environmental variables—42 currently—from national and California governmental datasets, handling the parsing problem. It includes an optional geolocation module for turning addresses into the same reference polygons used by the government datasets; this module is intended for institutions that do not have their own geolocation resources. Finally, it quickly enriches a dataset with the social and environmental measurements relevant to the geolocated addresses. The resulting dataset separates the social and environmental measurements from the addresses, allowing researchers to use the environmental variables without exposing personal health information (PHI) in the form of patient addresses.
SEndAE represents an important step forward by bringing together datasets, a geocoder, and the ontology extension tool. We hope the SEnDAE toolkit enables more institutions to undertake research on the impact of social and environmental variables, increasing the understanding of these important determinants of personal and population health.
2. System description
2.1. SEnDAE comprises three main modules
The Geolocation Module takes an arbitrary (csv) list of addresses and sends them to the US Census Bureau Geocoding service (17). The Geocoding service returns a specific latitude and longitude for each address. The Geolocation Module then assigns each latitude-longitude pair to a specific Census tract using reference files also provided by the Census service.
A Census tract is a “small, relatively permanent statistical subdivision” defined by local populations or the US Census Bureau, with the intention of being maintained over a long time and generally following visible and identifiable features. (18) Because a Census tract is a geographic division smaller than a ZIP code, it is considered an element of PHI and must be protected as such. The input file to the Geolocation Module should contain only one or more addresses and a numerical identifier for each. If any other information is contained in the input file, such as a name or medical record number (MRN), the Module will halt and return a warning about protecting PHI. Interaction with the Census Geocoding service is encrypted in both directions, but it is recommended to run this Module on a server that is not associated with a healthcare provider organization.
The Enrichment Module compiles all the data files from the providers such as the US Census Bureau and California Environmental Protection Agency. It then parses these files and extracts all the social and environmental variable measurements. Finally, it takes an input file of geocoded records and appends to each record the social and environmental measurements for each Census tract in that input. The enriched records are written to a csv which can be ingested into a database or statistical application of the user’s choice.
If the Enrichment Module is given an input file containing addresses rather than geocodes, it will call the Geolocation Module in order to translate the addresses to geocodes. Otherwise the Enrichment Module does not transmit any information and can be run entirely within a health provider organization (HPO) firewall.
The Ontology Extension Tool is the third module of SEnDAE. It starts from an awareness that users want to browse (deidentified) patient data, and an awareness that i2b2 (19) has become a de facto standard for this purpose. The Ontology Extension Tool expands an institution’s i2b2 database and user interface to allow for browsing of records that have been enriched with geospatially-linked SEDoH measurements. Out of the box, the tool expands i2b2 to only the social and environmental variables included in the Enrichment Module, but instructions are given for including additional variables an institution may wish to incorporate into their i2b2 application.
The Geolocation and Enrichment Modules are coded in Python 3.10 and call standard libraries including pandas, openpyxl, xlrd, requests, pytest, numpy, and rasterio. The i2b2 expansion module is a Microsoft Excel spreadsheet but will be recoded into Python in a future release.
3. Data sources
3.1. SEnDAE currently includes data from the following sources
- US Census American Community Survey, 5-year detailed tables (B), subject tables (S), and data profiles (DP) (20)
- Availability: public
- Access via API, scopable to particular geographic unit (eg city, state) or entire USA. Large API calls should be cached for efficiency.
- Data is returned as JSON, parsable via Python
- Note: The US Census requires registering a (free) API key if exceeding 500 queries per IP address per day.
- Census Geography Program (21) Gazetteer Files
- Availability: public
- Access as TXT format (tab separated), bundled with SEnDAE download
- Parse with pandas
- Office of Environmental Health Hazard Assessment (OEHHA), on behalf of the California Environmental Protection Agency (CalEPA) (22)
- Availability: public
- Access as CSV file, bundled with SEnDAE download
- Parse with pandas
- Centers for Disease Control and Prevention. Agency for Toxic Substances and Disease Registry (23)
- Availability: public
- Access as CSV file, bundled with SEnDAE download
- Parse with pandas
- Note: when opening the CDC CSV files with Excel, the “FIPS” column which identifies the census tract is truncated (deleting leading zeroes). Users should be wary of opening this dataset with Microsoft Excel
- US Department of Agriculture, Economic Research Service (24)
- Availability: public
- Access as XLSX (Microsoft Excel) format, bundled with SEnDAE download
- Parse with pandas
- Southern California Environmental Health Sciences Center (25)
- Availability: by request
- Access as raster format (GeoTIFF)
- Parse with Python library rasterio (https://rasterio.readthedocs.io/en/latest/open-source)
- NASA Air Quality files (26,27)
- Availability: public
- Access as raster format (GeoTIFF)
- Parse with Python library rasterio (https://rasterio.readthedocs.io/en/latest/open-source)
More details on these data sources, including description of the exact measurements provided, are in the Appendix. These data sources were selected due to being publicly accessible (mostly) and of high interest to our Southern California-based initial researcher community. Additional data sources are constantly being added and will be included in subsequent releases.
For data sources that are bundled with the SEnDAE distribution, we compared the performance of formating the files as .xlsx (Microsoft Excel), CSV, and Parquet (an open source, column-oriented data file format designed for efficient data storage and retrieval). We found that:
CSV files were much more performant than Excel .xlsx files;
Parquet files were slightly faster than CSV files;
CSV format is the most familiar to most researchers and natively supported by many tools including Microsoft Excel.
As a result, bundled data files are parsed into CSV format or left as GeoTIFF.
Each data source uses its own naming conventions, even for Census tract. Users should take care to understand the conventions and correctly choose and parse the columns accordingly. We also found that any algorithm that works to join different data sets should not do so on a row-by-row basis; this is very inefficient. Instead, we used numpy (an open-source library in Python used for scientific computation) to join the data using a join column. This was approximately 50 times faster.
4. Compliance
It must be stressed once again that geolocations are an element of PHI when associated to patients or HPOs. As such, the most common expected use case for SEnDAE is by operators of a clinical data warehouse who have access to patient PHI such as address but who wish to enrich the warehouse with the SEDoH variables. These operators should have either an established process to provide geocodes, or institutional approval to turn on geocoding within the toolkit. SEnDAE is designed first to separate addresses from other PHI elements (such as name or MRN) and second to separate the geolocation lookup from the HPO itself. This minimizes (but of course does not completely eliminate) the risk of accidental disclosure of PHI, in line with compliance review at USC as well as published best practices and protocols (28–33).
5. Sample system run
5.1. The seven tables below illustrate the processing flow and accompanying phi protections
Records from the data warehouse containing patient’s names and addresses are extracted and assigned a temporary random ID per normal practices as shown in Table 1. From this is extracted a set containing only the temporary IDs and the addresses as shown in Table 2. This dataset is moved to a server external to the HPO network. The destination server should be covered by a Business Associate Agreement (BAA) with the HPO to authorize the transfer of health information, since this is still a transfer of PHI.
Table 1.
Mock Data Warehouse Initial State.
| Temp. ID | MRN | Name | Address | Other info |
|---|---|---|---|---|
| 54,654 | 123,456 | John Smith | 123 Elm St LA CA | … |
| 68,984 | 789,456 | Mary Brown | 5477 Main St LA CA | … |
| 23,489 | 025,741 | Roberto Alvarez | 23 Oceanview Ave LA CA | … |
Table 2.
Geocoding Dataset Initial State.
| Temp. ID | Address |
|---|---|
| 54,654 | 123 Elm St LA CA |
| 68,984 | 5477 Main St LA CA |
| 23,489 | 23 Oceanview Ave LA CA |
The external dataset is sent to the US Census geocoding service, which returns it with the latitude and longitude appended as shown in Table 3.
Table 3.
Geocoding Dataset, Geolocated.
| Temp. ID | Address | Longitude | Latitude |
|---|---|---|---|
| 54,654 | 123 Elm St LA CA | −111.8747353 | 33.4566050 |
| 68,984 | 5477 Main St LA CA | −111.8886227 | 33.4295194 |
| 23,489 | 23 Oceanview Ave LA CA | −111.8867018 | 33.4290795 |
The latitude-longitude pair is next associated with a specific Census tract as shown in Table 4 using mapping tables also provided by the US Census Bureau.
Table 4.
Geocoding Dataset with Tract ID.
| Temp. ID | Address | Longitude | Latitude | Census Tract |
|---|---|---|---|---|
| 54,654 | 123 Elm St LA CA | −111.8747353 | 33.4566050 | 8095.12 |
| 68,984 | 5477 Main St LA CA | −111.8886227 | 33.4295194 | 8097.55 |
| 23,489 | 23 Oceanview Ave LA CA | −111.8867018 | 33.4290795 | 8091.30 |
A selection of the publicly available datasets noted earlier are then loaded from local disk, parsed, and merged into a large lookup table, as shown in Table 5.
Table 5.
Compiled Public Datasets.
| Census Tract | % Hispanic | % below Fed. Poverty level | Median household income | Unemployment rate | Other measurements |
|---|---|---|---|---|---|
| … | … | … | … | … | … |
| 8095.12 | 6.9 | 28.8 | 40,239 | 10.3 | … |
| … | … | … | … | … | … |
| 8097.55 | 7.6 | 9.8 | 59,881 | 1.0 | … |
| … | … | … | … | … | … |
The geocoding dataset can be transferred back to the HPO network. Using the Census tract or latitude-longitude pair as key value, one or more of the measurements from Table 5 are appended to each row of the geocoding dataset in Table 6.
Table 6.
Geocoding Dataset Enriched.
| Temp. ID | Address | Longitude | Latitude | Census Tract | % Hispanic | Other measurements |
|---|---|---|---|---|---|---|
| 54,654 | 123 Elm St LA CA | −111.8747353 | 33.4566050 | 8095.12 | 10.3 | … |
| 68,984 | 5477 Main St LA CA | −111.8886227 | 33.4295194 | 8097.55 | 1.0 | … |
| 23,489 | 23 Oceanview Ave LA CA | −111.8867018 | 33.4290795 | 8091.30 | 11.7 | … |
Using the temporary ID as the key value, SEDoH measurements are appended to each row of the original dataset, and the latitude, longitude, and Census tract information is deleted, as shown in Table 7. The only output of the toolkit is the SEDoH measurements associated to the record rows; geolocations are deleted.
Table 7.
Data Warehouse Enriched.
| Temp. ID | MRN | Name | Address | Other info | % Hispanic | % below Fed poverty rate | Other measurements |
|---|---|---|---|---|---|---|---|
| 54,654 | 123,456 | John Smith | 123 Elm St LA CA | … | 6.9 | 28.8 | … |
| 68,984 | 789,456 | Mary Brown | 5477 Main St LA CA | … | 7.6 | 9.8 | … |
| 23,489 | 025,741 | Roberto Alvarez | 23 Oceanview Ave LA CA | … | 11.7 | 5.2 | … |
At this point the SEDoH measurements can be used just like any other fields in the patient record in the data warehouse. Datasets extracted from the data warehouse can include the SEDoH measurements without including any patient PHI.
The overall architecture of the system is outlined in Fig. 1.
Fig. 1.

SEnDAE data flow with external geocoding.
Alternatively, if an institution has access to a self-contained GIS lookup application (e.g., Esri’s ArcGIS Business Analyst geocoding solution) that does not need to send data to an external server, then the entire architecture can be housed within the HPO network as outlined in Fig. 2. Some experts (30) argue that in-house geocoding is the only HIPAA-compliant method, to ensure PHI is not compromised by transferring patient addresses to another entity for geolocation.
Fig. 2.

SEnDAE data flow with internal lookup.
Although the example here shows enriching patient records in the data warehouse, which is expected to be the most common use case, SEnDAE can add the SEDoH measurements to any arbitrary list of addresses. For example, a researcher may be interested in recruiting students who attend a school in an area with poor housing conditions, poor air quality, or safety issues; the researcher could feed a list of school addresses into SEnDAE. (This stands in contrast to geocoding capabilities present in major EHR systems such as Epic or Cerner, which are limited to patient addresses.) SEnDAE could also be used by a researcher who already has IRB-approved access to patient addresses, in which case the added SEDoH variables would probably not require any additional IRB approvals.
Geocoding and enrichment must account for temporality; that is, the fact that the association between patient records, Census tracts, and environmental measures changes over time. A patient may change their residence, but care is needed because only a subset of the social and environmental measurements may be updated annually. Census tracts are also revised after each decennial Census to ensure that the jurisdictions delineated for local, state, and federal elections contain roughly equal numbers of eligible voters. SEnDAE accounts for these variations and associates records with the environmental measurements for the appropriate years.
5.2. Computational methods: SEDoH ontology
To make SEDoH measurements accessible to researchers, they should be included in a browsable interface along with other patient data fields. At USC the preferred interface is i2b2. After surveying ontology domains for SEDoH concepts, however, we concluded that no existing ontology was sufficient to the task. I2b2 itself did not provide an ontology for all the SEDoH concepts included in SEnDAE. The World Health Organization (WHO)’s International Classification of Diseases (ICD) ontology included many SEDoH concepts but only as applied to an individual patient’s actual experience (e.g., V15.86 “personal history of contact with and (suspected) exposure to lead”) or V60.1 “lack of adequate housing”) and not to location-based data. LOINC is a standard for encoding tests and observations; it is planning to add SEDoH concepts but these are not yet finalized. Groups have investigated Fast Healthcare Interoperability Resources (FHIR) (34) or Observational Medical Outcomes Partnership (OMOP) (35) data models to incorporate some SEDoH variables but these are not official models yet. Thus there seemed to be no viable off-the-shelf options for exposing SEDoH data via i2b2.
Accordingly, a novel ontology tree for displaying SEDoH within i2b2 was developed, customized for the 42 concepts included in SEnDAE. For each concept, the appropriate unit was incorporated based upon the units provided by the public dataset. For example, some agencies report air quality measures in parts-per-million (ppm) while others in parts-per-billion (ppb). These values are not normalized in order to keep the data as true to the reporting agencies’ measurements as possible. Users are alerted to changing units. For each concept, filters are required to define how the i2b2 application will process the variables—minimum and maximum values, for example. These are designed to support queries such as “patients who live in a census tract where the median annual income is between $45k and $75k” or “patients who live in a census tract where the average PM2.5 measurement is ≥ 12 micrograms per cubic meter”.
Adding a concept to the i2b2 ontology entails defining for that concept: the Path, detailing the concepts between the root and the new concept; the Name, which is the human-readable display name for the concept; and the Code, which is a unique identifier. The concept must also be defined as a Folder, if it can contain further concepts, or as a Leaf, if it is the endpoint of a Path. This also defines how the concept will be displayed within the i2b2 browser. For the concept a Filter must be defined, spelling out acceptable values for the concept (Booleans, allowed textual values, or range of numerical values). Finally, a Tooltip may be provided for the concept as explanatory information to help researchers use the concept in their queries.
To facilitate the process of adding concepts to i2b2, an Ontology Extension Tool was developed. Although this was developed specifically to enable adding the SEDoH variables, it can be used for creating or expanding the i2b2 ontology in any topic. It is currently encoded as an Excel workbook. The first three tabs contain prompts for input from the ontology designer. Formulas on those tabs check for and report any anomalies in the input, while formulas on the next four tabs create SQL code and .csv files for creating the data structures within the i2b2 application and server.
A portion of the i2b2 ontology extended to SEDoH variables is shown in Fig. 3. Note the tooltip providing more information on the variable “Education – Pct w/Bachelor’s Degree – Age 25 or Over”.
Fig. 3.

Portion of SEDoH Ontology within i2b2.
6. Metrics
We performed an initial validation of the Census Geocoder API implemented in SEnDAE vs. ESRI (a leading mapping software and available at many research institutions). Openaddresses.io is a collection of authoritative data for address locations around the world, collected from a variety of public data sources. Addresses are considered well-formed in that they are plausible addresses for the location in question, but they are not guaranteed to be plausible enough to geocode. To measure accuracy, we created one dataset of 5000 addresses from openaddresses.io, 100 from each US State. Running the geocoding function on this dataset, SEnDAE was able to geocode 83% of the addresses. Considering that these addresses were not guaranteed to be geocodable, we consider this acceptable performance.
There is a GitHub repository (https://github.com/EthanRBrown/rrad) which selects a subset of the openaddresses.io data and confirms that the addresses are geocodable. From this repository we created a second dataset of 106 addresses in California which were thus guaranteed to be geocodable. We performed a spatial join on the 2020 TIGER/Line Shapefiles with the 106 addresses geocoded using the ArcGIS World Geocoding Service. All the data sources used in SEnDAE are keyed on a census tract. So to compare the two geocoding systems, we considered it a match when the census tract returned by the spatial join was identical to the census tract returned by the Census Geocoder API for a particular address.
We found that 104/106 (−98.1%) addresses matched (given the above definition).
At the time of writing we did not have approval to test the SEnDAE geocoder on actual addresses.
Runtime metrics for SEnDAE were collected using a Lenovo ThinkPad with Intel® Core™ i5–10,210 U CPU @ 1.60 GHz and 8.0 GB RAM.
To load the file-based data sets into memory takes about 10s, with an additional 8 min 17s to load ACS data via API calls.
Geocoding using the Census geocoder works in batches of 10,000 addresses at a time and takes an average of 0.27s per address.
Enrichment with all variables takes an average of 0.00989s per address. This assumes that each record (e.g., patient) is associated with only one address, but each address has multiple social and environmental measurements or estimates over the years (e.g., a new air quality measurement every year).
7. Availability
SEnDAE is released as open source via https://github.com/scctsi/gis-toolkit.
8. Limitations
SEnDAE is obviously most limited by the availability of source data on SEDoH. Not every variable of possible interest is collected, and not every repository of collected SEDoH variables is easily available to the public. The SEnDAE development team has focused on variables of interest to our research partners and data sources covering our Southern California research populations, but the open source nature of the toolkit allows other teams to expand to datasets localized to their regions and interests. We also continue to add datasets at the request of users, per availability and technical feasibility.
9. Discussion
SEnDAE is intended to provide a relatively simple and open source resource for institutions interested in incorporating SEDoH into their healthcare research, especially those institutions that do not have the resources or expertise to manually geocode and enrich records. Although this toolkit helps address some of the technical barriers to doing this type of research, it does not solve the need for subject matter expertise. There are multiple layers of complexity in using geocoded SEDoH variables, such as understanding geographic granularity and temporality. For some variables, such as those from the ACS, researchers need to know when to use different versions of the same variable estimated over different time windows (e.g., 1-year and 5-year estimates). We encourage all researchers and institutions interested in this work to identify partners with the appropriate expertise. Some additional thought and care may also be needed to handle the ways in which individual and aggregate measures are linked and interpreted. Decisions associated with the geocoding process (31), and the variability of locational accuracy related to urban or rural status of an address or to geocoding methods (32), or the use of different geographic units when matching addresses (33) are considerations. There are many subtleties linked to the atomistic and ecological fallacies, such as the modifiable areal unit problem (MAUP) and the uncertain observation point problem (UOPP). Robertson and Feick (36) provide a good starting place for those new to these potential challenges and pitfalls.
SEnDAE also allows some separation between patient PHI, in the form of addresses, geocodes, or Census tracts, and the environmental measurements themselves. By adding environmental measurements as fields in the data warehouse, these measurements can be used while maintaining a fully-deidentified dataset. While it would be technically possible to back-engineer a record’s Census tract information using a large set of environmental measurements, such reidentification is true of most any large set of patient data. Thus, just as for any request for patient information needs to justify which variables are required, IRBs and data provisioning teams should release only the smallest number of SEDoH variables necessary for a given analysis.
Acknowledgments
This work was supported by grants UL1TR001855 and UL1TR000130 from the National Center for Advancing Translational Science (NCATS) of the U.S. National Institutes of Health. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Appendix: Detailed Data Sources
| Data element | Source | Access as | Years covered | Update Freq. | Description |
|---|---|---|---|---|---|
| Social Vulnerability Index | CDC | File | 2000-present | Approximately every 2 years | Social vulnerability measures community resilience to respond to or recover from threats to public health. The CDC SVI uses Census data for 15 social factors, including poverty, lack of access to transportation, and crowded housing, grouped into four themes. The index ranges from 0 to 1, with higher values indicating more-vulnerable populations. |
| Gini Inequality Coefficient | ACS | API | 2009-present | Every year | Income inequality is the extent to which income is distributed unevenly among a population. The Gini Index is a summary measure where higher values indicate greater inequality; the coefficient ranges from 0 (perfect equality, everyone receives an equal share) to 1 (perfect inequality). |
| Old Age Dependency Ratio | ACS | API | 2009-present | Every year | The Old Age Dependency Ratio compares the size of the senior population (age 65 and over) to the size of the working-age population (18-64) to measure the social and economic impact of age structures. The ratio ranges from zero upwards; the higher the number, the greater the burden of supporting seniors on working people. |
| Child Dependency Ratio | ACS | API | 2009-present | Every year | The Child Dependency Ratio compares the size of the child population (age 17 and under) to the size of the working-age population (18-64) to measure the social and economic impact of age structures. The ratio ranges from zero upwards; the higher the number, the greater the burden of supporting children on working people. |
| Air Quality Indicator - Ozone (O3) | CalEPA_CES | File | 2014 - present | Released every 4 years | Air Quality Indicator - Ozone (O3). Ozone pollution causes numerous adverse health effects, including acute upper and lower respiratory symptoms, reduced lung function, and exacerbation of lung disease. CalEnviroScreen values modeled from summer month maximum-value means for 2012–2014. The values are in parts per million. O3 concentrations in these data range between 0.026–0.068 ppm. |
| Air Quality Indicator - PM2.5 | CalEPA_CES | File | 2014 - present | Released every 4 years | Air Quality Indicator - Fine Particulate Matter <= 2.5 μm. PM2.5 includes extremely small particles and liquid droplets that when inhaled can penetrate deep into the lungs causing serious health problems. CalEnviroScreen values modeled from averages of quarterly means for 2012–2014. The values are in micrograms per cubic meter. PM2.5 values in these data range between 2 and 20 micrograms per cubic meter. |
| Drinking Water Quality Indicator | CalEPA_CES | File | 2014 - present | Released every 4 years | The drinking water contaminant indicator considers measured chemical and bacterial contaminant levels; presence of multiple contaminants; and any past water system water-quality violations. CalEnviroScreen data (2005–2013) were aggregated from census blocks to tracts, weighted by population, and the state percentile value for each contaminant and violation type was summed for the overall score. The index ranges from 0 upwards, with higher values indicating greater concentrations of multiple contaminants. |
| Air Quality Indicator - Asthma ER Visits | CalEPA_CES | File | 2014 - present | Released every 4 years | Air Quality Indicator - Age-adjusted rate of ER visits for asthma per 10,000 persons. |
| Food - Fraction of Population with Low Access | USDA | File | 2010-present | Released every 4 years | USDA Food Access Research Atlas data (2015) measure access to healthy and affordable food based on distance thresholds. This indicator estimates the percentage of the census tract population that live beyond 1 mile from the nearest supermarket for urban areas, or 10 miles for rural areas. |
| Food - Low-Access Tract | USDA | File | 2010-present | Released approximately every 4 years | Low food access tract at 1 mile for urban areas or 10 miles for rural areas. The USDA Food Access Research Atlas considers a census tract to have low food access if a significant number (500) or share (>= 33%) of individuals in the tract lives beyond a specific distance from a supermarket, supercenter, large grocery store, or other source of healthy and affordable food. |
| Housing - Median Year Built | ACS | API | 2009-present | Every year | Median year housing units built. Housing quality is affected by a home’s age, maintenance, structure, and design. Poor air quality; lack of insulation; and potential exposure to lead, asbestos, mold, or carbon monoxide are examples of factors associated with negative health outcomes. |
| Housing - Pct Occupied Units Lacking Plumbing | ACS calculation | API | 2009-present | Every year | Percentage of occupied housing units that lack indoor plumbing. |
| Housing - Pct Occupied Units Lacking Complete Kitchen | ACS | API | 2009-present | Every year | Percentage of occupied housing units that lack complete kitchen facilities. |
| Housing - Pct Occupied Units with No Bedroom | ACS | API | 2009-present | Every year | Percentage of occupied housing units with no bedroom. |
| Housing - Pct Occupied Units with No Vehicle Available | ACS | API | 2009-present | Every year | Percentage of occupied housing units with no vehicle available. |
| Housing - Pct Occupied Units with No Computer [includes Smartphone] | ACS | API | 2009-present | Every year | Percentage of occupied housing units with no computing device available; includes smartphone. |
| Housing - Pct Occupied Units with No Internet Subscription | ACS | API | 2009-present | Every year | Percentage of occupied housing units with no internet subscription; includes cellular data. |
| Population Density | ACS | API | 2009-present | Every year | Population density measures the number of persons per geographical area by dividing the total population by the total land area. It varies considerably across urban and rural areas and also across neighborhoods within cities. It is measured in persons per square kilometer. |
| Pct Hispanic | ACS | API | 2009-present | Every year | Percentage of the population who identify as of Hispanic or Latino origin. |
| Pct Non-Hispanic | ACS | API | 2009-present | Every year | Percentage of the population who do not identify as of Hispanic or Latino origin. |
| Pct American Indian or Alaska Native | ACS | API | 2009-present | Every year | Percentage of the population who identify their race as American Indian or Alaska Native. |
| Pct Asian | ACS | API | 2009-present | Every year | Percentage of the population who identify their race as Asian. |
| Pct Black | ACS | API | 2009-present | Every year | Percentage of the population who identify their race as Black or African American. |
| Pct Native Hawaiian or Other Pacific Islander | ACS | API | 2009-present | Every year | Percentage of the population who identify their race as Native Hawaiian or Other Pacific Islander. |
| Pct Multiple Race | ACS | API | 2009-present | Every year | Percentage of the population who identify as two or more races. |
| Pct White | ACS | API | 2009-present | Every year | Percentage of the population who identify their race as White. |
| Pct Some Other Race | ACS | API | 2009-present | Every year | Percentage of the population who identify their race as Some Other Race. |
| Pct Below 100% of Fed Poverty Level | ACS | API | 2009-present | Every year | Percentage of the population with estimated annual income below the Federal Poverty Level. |
| Pct Below 200% of Fed Poverty Level | ACS | API | 2009-present | Every year | Percentage of the population with estimated annual income below 200% of the Federal Poverty Level. |
| Pct Below 300% of Fed Poverty Level | ACS | API | 2009-present | Every year | Percentage of the population with estimated annual income below 300% of the Federal Poverty Level. |
| Pct HH that receive SNAP | ACS | API | 2009-present | Every year | Percentage of households receiving food stamps/SNAP. |
| Pct HH with limited English | ACS | API | 2009-present | Every year | Percentage of households that speak English less than very well. |
| Pct HS Grad - Age 25 or Over | ACS | API | 2009-present | Every year | Educational attainment; estimated percentage of the population age 25 and over who are high school graduates. |
| Pct Bachelor’s Degree - Age 25 or Over | ACS | API | 2009-present | Every year | Educational attainment; estimated percentage of the population age 25 and over who hold a Bachelor’s degree. |
| Median Household Income | ACS | API | 2009-present | Every year | Estimated median annual household income in US dollars. These data range between $6875-$249,597. |
| Unemployment Rate - Age 16 or Over | ACS | API | 2009-present | Every year | Estimated unemployment rate for persons age 16 and older. |
| Annual Pollutant Data from SCEHSC - Ozone | SCEHSC | File | 1998–2009 | 2016 (All SCEHSC data released at once) | Air Quality Indicator - Ozone (O3) |
| Annual Pollutant Data from SCEHSC - Nitrogen Dioxide | SCEHSC | File | 1998–2009 | 2016 (All SCEHSC data released at once) | Air Quality Indicator - Nitrogen Dioxide (NO2) |
| Annual Pollutant Data from SCEHSC - PM2.5 | SCEHSC | File | 1998–2009 | 2016 (All SCEHSC data released at once) | μm |
| Annual Pollutant Data from SCEHSC - PM10 | SCEHSC | File | 1998–2009 | 2016 (All SCEHSC data released at once) | Air Quality Indicator - Fine Particulate Matter <= 2.5 μm. |
| Annual Pollutant data from NASA - Ozone | NASA | File | 2000–2016 | Approximately every year until 2016 | |
| Annual Pollutant data from NASA - PM2.5 | NASA | File | 2000–2016 | Approximately every year until 2016 |
Footnotes
Declaration of Competing Interest
JE is a paid consultant for AI Health. AI Health played no role in the design, execution, analysis, or write up of this work. AI Health did not play a role in the decision to publish this manuscript and had no editorial input.
References
- [1].Wilkinson RG, Marmot M Social determinants of health: the solid facts: world Health Organization; 2003. [Google Scholar]
- [2].Commission on Social Determinants of Health. Closing the gap in a generation: health equity through action on the social determinants of health: final report of the commission on social determinants of health: world health organization; 2008. [Google Scholar]
- [3].Raphael D, Social determinants of health: present status, unanswered questions, and future directions, Int. J. Health Serv 36 (4) (2006) 651–677. [DOI] [PubMed] [Google Scholar]
- [4].US department of health and human services. Healthy People 2030 https://health.gov/healthypeople/objectives-and-data/social-determinants-health [ [DOI] [PubMed]
- [5].Centers for Disease Control and Prevention. About Social Determinants of Health (SDOH) 2021. [Available from: https://www.cdc.gov/socialdeterminants/about.html.
- [6].Lynch JW, Smith GD, Kaplan GA, House JS, Income inequality and mortality: importance to health of individual income, psychosocial environment, or material conditions, BMJ 320 (7243) (2000) 1200–1204. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Kirtchuk L, Wylie A, Social and Environmental Determinants of health. A Prescription For Healthy Living, Elsevier, 2021, pp. 3–15. [Google Scholar]
- [8].Cleland VJ, Ball K, Crawford D, Social and Environmental Determinants of Health behaviors. Handbook of Behavioral Medicine, Springer, 2010, pp. 3–17. [Google Scholar]
- [9].Braveman P, Egerter S, Williams DR, The social determinants of health: coming of age, Annu. Rev. Public Health 32 (2011) 381–398. [DOI] [PubMed] [Google Scholar]
- [10].Viner RM, Ozer EM, Denny S, Marmot M, Resnick M, Fatusi A, et al. , Adolescence and the social determinants of health, Lancet North Am. Ed 379 (9826) (2012) 1641–1652. [DOI] [PubMed] [Google Scholar]
- [11].McGovern L, Miller G, Hughes-Cromwick P The relative contribution of multiple determinants to health. Health Affairs Health Policy Brief. 2014;10. [Google Scholar]
- [12].Walker RJ, Smalls BL, Campbell JA, Strom Williams JL, Egede LE, Impact of social determinants of health on outcomes for type 2 diabetes: a systematic review, Endocrine 47 (1) (2014) 29–48. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Strominger J, Anthopolos R, Miranda ML, Implications of construction method and spatial scale on measures of the built environment, Int. J. Health Geogr 15 (1) (2016) 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].O’Shea TM, McGrath M, Aschner JL, Lester B, Santos HP, Marsit C, et al. , Environmental influences on child health outcomes: cohorts of individuals born very preterm, Pediatr. Res (2022) 1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Ogunwole SM, Golden SH, Social determinants of health and structural inequities—root causes of diabetes disparities, Diabetes Care. 44 (1) (2021) 11–13. [DOI] [PubMed] [Google Scholar]
- [16].Jilani MH, Javed Z, Yahya T, Valero-Elizondo J, Khan SU, Kash B, et al. , Social determinants of health and cardiovascular disease: current state and future directions towards healthcare equity, Curr. Atheroscler. Rep 23 (9) (2021) 1–11. [DOI] [PubMed] [Google Scholar]
- [17].Welcome to the Geocoder!: US Census Bureau; [updated 21 June 2022. Available from: https://geocoding.geo.census.gov/geocoder/. [Google Scholar]
- [18].Glossary: U.S. Census Bureau; 11 April 2022. [Available from: https://www.census.gov/programs-surveys/geography/about/glossary.html#par_textimage_13.
- [19].Partners Healthcare. i2b2: informatics for Integrating Biology & the Bedside https://www.i2b2.org/
- [20].American Community Survey 5-Year Data (2009-2020): US Census Bureau; [updated 17 March 2022. Available from: https://www.census.gov/data/developers/data-sets/acs-5year.2018.html. [Google Scholar]
- [21].Geography Program: US Census Bureau; [updated 8 June 2022. Available from: https://www.census.gov/programs-surveys/geography.html. [Google Scholar]
- [22].CalEnviroScreen 4.0: office of Environmental Health Hazard Assessment (OEHHA); 2022. [updated 20 October 2021. Available from: https://oehha.ca.gov/calenviroscreen/report/calenviroscreen-40.
- [23].CDC/ATSDR Social Vulnerability Index: US Department of Health & Human Services; 2022. [updated 15 March 2022. Available from: https://www.atsdr.cdc.gov/placeandhealth/svi/. [Google Scholar]
- [24].Food Access Research Atlas: US Department of Agriculture Economic Research Service; 2022. [updated 14 March 2022. Available from: https://www.ers.usda.gov/data-products/food-access-research-atlas/. [Google Scholar]
- [25].Southern california environmental health sciences center: SCEHSC southern california environmental health sciences center; [Available from: https://scehsc.usc.edu/.
- [26].Requia WJ, Wei Y, Shtein A, Hultquist C, Xing X, Di Q, et al. , Daily 8-Hour Maximum and Annual O3 Concentrations for the Contiguous United States, 1-km Grids, v1 (2000 - 2016), NASA Socioeconomic Data and Applications Center (SEDAC), 2021. [Google Scholar]
- [27].Di Q, Wei Y, Shtein A, Hultquist C, Xing X, Amini H, et al. , Daily and Annual PM2.5 Concentrations for the Contiguous United States, 1-km Grids, v1 (2000 - 2016), NASA Socioeconomic Data and Applications Center (SEDAC), 2021. [Google Scholar]
- [28].Bader MD, Mooney SJ, Rundle AG, Protecting Personally Identifiable Information When Using Online Geographic Tools for Public Health Research, Am. J. Public Health 106 (2) (2016) 206–208. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [29].Rivera B, Hoffman M, Technical strategies for real-time geocoding in healthcare, in: Proceedings of the IEEE International Smart Cities Conference (ISC2), IEEE, 2018. [Google Scholar]
- [30].Rundle AG, Bader MDM, Mooney SJ, The disclosure of personally identifiable information in studies of neighborhood contexts and patient outcomes, J. Med. Internet Res 24 (3) (2022) e30619. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [31].Goldberg DW, Wilson JP, Knoblock CA, From text to geographic coordinates: the current state of geocoding, URISA J. 19 (1) (2007) 33–46. [Google Scholar]
- [32].Jones RR, DellaValle CT, Flory AR, Nordan A, Hoppin JA, Hofmann JN, et al. , Accuracy of residential geocoding in the agricultural health study, Int. J. Health Geogr 13 (2014) 37. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [33].Zandbergen PA, A comparison of address point, parcel and street geocoding techniques, Comput. Environ. Urban Syst 32 (3) (2008) 214–232. [Google Scholar]
- [34].Watkins M, Viernes B, Nguyen V, Mezarina LR, Valencia JS, Borbolla D, Translating social determinants of health into standardized clinical entities, Stud. Health Technol. Inform 270 (2020) 474. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [35].Phuong J, Zampino E, Dobbins N, Espinoza J, Meeker D, Spratt H, et al. , Eds. Extracting patient-level social determinants of health into the OMOP common data model. AMIA Annual Symposium Proceedings; 2021: American Medical Informatics Association. [PMC free article] [PubMed] [Google Scholar]
- [36].Robertson C, Feick R, Inference and analysis across spatial supports in the big data era: uncertain point observations and geographic contexts, Trans. GIS 22 (2) (2018) 455–476. [Google Scholar]
