Skip to main content
Elsevier - PMC COVID-19 Collection logoLink to Elsevier - PMC COVID-19 Collection
. 2020 Aug 20;51:7–13. doi: 10.1016/j.annepidem.2020.08.012

Black/African American Communities are at highest risk of COVID-19: spatial modeling of New York City ZIP Code–level testing results

Charles DiMaggio 1,, Michael Klein 1, Cherisse Berry 1, Spiros Frangos 1
PMCID: PMC7438213  PMID: 32827672

Abstract

Purpose

The population and spatial characteristics of COVID-19 infections are poorly understood, but there is increasing evidence that in addition to individual clinical factors, demographic, socioeconomic, and racial characteristics play an important role.

Methods

We analyzed positive COVID-19 testing results counts within New York City ZIP Code Tabulation Areas with Bayesian hierarchical Poisson spatial models using integrated nested Laplace approximations.

Results

Spatial clustering accounted for approximately 32% of the variation in the data. There was a nearly five-fold increase in the risk of a positive COVID-19 test (incidence density ratio = 4.8, 95% credible interval 2.4, 9.7) associated with the proportion of black/African American residents. Increases in the proportion of residents older than 65 years, housing density, and the proportion of residents with heart disease were each associated with an approximate doubling of risk. In a multivariable model including estimates for age, chronic obstructive pulmonary disease, heart disease, housing density, and black/African American race, the only variables that remained associated with positive COVID-19 testing with a probability greater than chance were the proportion of black/African American residents and proportion of older persons.

Conclusions

Areas with large proportions of black/African American residents are at markedly higher risk that is not fully explained by characteristics of the environment and pre-existing conditions in the population.

Keywords: COVID-19, Disparity, Spatial analysis

Introduction

The SARS-Cov-2 poses unprecedented clinical and public health challenges worldwide. Although much of the attention has been rightfully focused on the clinical aspects of the disease, epidemiological studies and prevention research are becoming of increasing importance, particularly as no effective therapeutic has yet been identified [1]. Epidemiological and population-based studies can contribute to the identification of patient risk factors for disease severity. Recent studies of observational registry data have found COVID-19 mortality to be independently associated with coronary artery disease (odds ratio [OR] for mortality = 2.7; 95% CI: 2.1, 3.5), chronic obstructive pulmonary disease (COPD) (OR = 3.0; 95% CI, 2.0 to 4.4), and age greater than 65 years (OR = 1.9; 95% CI, 1.6 to 2.4) [2]. In one case series, 68% of laboratory-confirmed COVID-19 ICU patients had at least one comorbidity, of which hypertension was most common [3].

Not all risks, however, are physiologic. As the COVID-19 pandemic continues to ravage communities across the United States and the world, attention is increasingly turning to population-level demographic, socioeconomic, racial, and environmental risk factors for COVID-19. Blacks/African Americans have been reported to contract and die from COVID-19 at higher rates than others [4]. In Chicago, a large number of COVID-19 deaths are concentrated in five largely black neighborhoods [5]. A similar mortality concentration among black/African American persons has been reported in New Orleans [6]. At the built-environmental level, drivers of disease include population density [7] and housing density, with urban counties in the United States having the highest COVID-19 death rates [8].

Few regions of the United States have been more grievously affected than the five boroughs of New York City. A neighborhood-level analysis of New York City found higher rates of COVID-19 disease in areas with higher population shares of black/African American and Hispanic persons, and in areas with higher population density [9]. Although it certainly is possible that those affected have higher rates of underlying health conditions that may increase their susceptibility to the virus, the authors speculate that “residents of these neighborhoods are less likely to be able to work from home, disproportionately rely on public transit during the crisis, are less likely to have internet access,” and “have higher rates of overcrowding at the household level.”

In this study, we analyze positive COVID-19 testing result counts within New York City ZIP Code Tabulation Areas (ZCTAs) using Bayesian hierarchical Poisson spatial models with integrated nested Laplace approximations. We attempt to quantify the amount of spatial clustering in New York City neighborhoods, and the association of positive test counts in a neighborhood with population-level estimates of demographic, socioeconomic, health, and built environmental variables. The results quantify and provide insights into the complex interplay of individual and ecologic risks for COVID-19 spread and may be helpful in the effective allocation of testing resources and interventions in similar urban settings.

Methods

Data

COVID-19 test result data were obtained from the New York City Department of Health and Mental Hygiene (NYC DOHMH) GitHub Page. Variables consisted of ZCTA designation, total number of positive tests, and total number of tests performed. Files are updated approximately every 2 days. The data in these analyses were current as of 22 April 2020.

ZCTA-level data for total population, proportion of persons older than 65 years, number of persons self-identifying as black/African American, Asian, or Hispanic, number of persons older than 5 years, speaking a language other than English, population density, housing density, school density (number of people, housing structures, and schools per square mile, respectively), proportion of persons receiving public assistance, were obtained from or derived from the U.S. Census [10].

We created a social fragmentation index based on the work of Congdon [11] which combines four variables extracted from U.S. Census variables: the proportion of total housing units in a ZCTA that are not owner occupied, the proportion of vacant housing units, the proportion of individuals living alone, and the proportion of units into which someone recently moved. Based on Census definitions, a “recent” move is defined as anytime in the previous 9 years (since the last decennial census). Variables are standardized and added with equal weight. The resulting variable is normally distributed with mean zero and 95% quantiles −2.5 and 2.2.

Data on ZCTA health metrics were derived from shapefiles downloaded from the Simply Analytics company [12] and consisted of the number of persons in a ZCTA with heart disease or congestive heart failure (which are combined as a metric) and the number of persons with COPD. The estimates are based on SimmonsLOCAL data, which are local approximations of national survey results of individual responses to questions regarding recent health events. A full description of the methodology has previously been described [13].

Spatial shapefiles of New York City ZCTAs were downloaded and derived from the New York City Department of City Planning [14]. The testing and covariate data were merged to the spatial shapefile data and restricted to ZCTAs with valid data entries. An adjacency matrix was created from the map file using the R tool spdep::poly2nb() and manually edited to create adjacencies between New York City boroughs using spdep::edit.nb().

Statistical analysis

After merging the testing to the covariate data, descriptive statistics consisted of counts, means and medians, and maps of the number of positive COVID-19 tests per 10,000 total population and 10,000 tests performed in a ZCTA.

Counts of positive COVID-19 test results in New York City ZCTAs were spatially modeled in accordance with Besag-York-Mollie as described by Lawson [[15], [16], [17]].

yiPois(λi=eiθi)log(θi)βxi+υi+ηiυnl(0,τυ)ηnl(ηδ¯,τη/nδ)

where,

  • 1.

    the y i counts in area i are independently identically Poisson distributed and have an expectation in area i of e i, the expected count, times θ i, the risk for area i.

  • 2.

    a logarithmic transformation (log (λ i)) allows a linear, additive model of regression terms (βx i), along with

  • 3.

    a spatially random effects component (υ i) that is i.i.d normally distributed with mean zero (~nl (0, τ η)), and

  • 4.

    a conditional autoregressive spatially structured component (ηnl(ηδ¯,τη/nδ)) in which a “neighborhood” consisting of spatially adjacent shapes is characterized by the normally distributed mean of the spatially structured random effect terms for the spatial shapes that make up the neighborhood (ηδ¯), and the standard deviation of that mean divided by the number of spatial shapes in the neighborhood (τ η/n δ). This spatially structured conditional autoregression component is also sometimes described as a Gaussian process λNl(W,τλ) where W represents the matrix of neighbors that defines the neighborhood structure, and the conditional distribution of each λ i , given all the other λ i is normal with μ = the average λ of it,s neighbors and a precision (τ λ).

A baseline convolution model that consisted solely of an intercept term with unstructured and spatially structured random effect terms was extended to include univariate association of explanatory variables with the number of positive COVID-19 tests in a ZCTA. Important and likely associations were chosen for inclusion in a multivariable model with the primary exposure variable being the proportion of black/African American residents in an area and additional explanatory variables included as potential confounders.

The final linear model consisted of an intercept (β 0); a vector of scaled ZCTA-level explanatory variables (βxiT) for the proportion of persons in a ZCTA identifying as black/African American, with COPD, heart disease, older than 65 years, a measure of housing density, a spatially unstructured random effect term (υ i), and a spatially structured conditional autoregression term (η i). An offset variable for the total number of tests was included in all models. Model selection was based on deviance information criteria and number of effective parameters.

log(θij)=β0+βxiT+υi+ηi+(offset)

The spatially unstructured random effect term captures normally distributed or Gaussian random variation around the mean or intercept. The spatially structured conditional autoregression term accounts for local geographic influence. The intercept is interpreted as the average citywide risk on the log scale adjusted for the covariates, random effects and spatial terms. The exponentiated coefficients for the explanatory covariates are interpreted as incidence density ratios. Coefficient results are presented with 95% Bayesian credible intervals (95% Cr I).

Spatial risk, controlling for or holding the covariates constant, was calculated as ζ i = υ i + η i [18], and is interpreted as the residual spatial risk for each area (compared with all of New York City) after covariates and spatial clustering are taken into account. Finally, the proportion of spatially explained variance was calculated as the proportion of total spatial heterogeneity accounted for by the spatially structured conditional autoregression variance.

Spatial modeling was conducted using integrated nested Laplace approximations (INLA) with the R INLA package [19] using approaches described by Blangiardo et al. [18] Code to reproduce the analyses is available at: http://www.injuryepi.org/resources/Misc/covidINLA_onlineCode.html.

The study protocol was exempted as not human research by the New York University School of Medicine Institutional Review Board.

Results

Descriptive statistics

There were 177 ZCTAs in the data set. The mean COVID-19 rate of positive tests per 10,000 ZCTA population was 166.2 (95% CI: 156.7, 175.7). The mean COVID-19 rate of positive tests per 10,000 tests was 5176.0 (95% CI 5045.9, 5306.1) and appeared skewed and peaked, indicating that a relatively small number of ZCTAs accounted for the highest rates (Fig. 1 ). The 5 ZCTAs with the highest positive COVID-19 test numbers per 10,000 population were the same as those with the highest proportion per 10,000 tests (10464, 10470, 10455, 10473, 11234, and 11210). The 5 lowest ZCTAs were also the same for both measures (11103, 11102, 11693, 11369, 11363, and 10308). Table 1 presents comparative statistics for the ZCTAs with the highest and lowest quantiles for population-based rates of positive tests. Figure 2 presents a choropleth of positive COVID-19 tests per 10,000 per 10,000 positive tests.

Fig. 1.

Fig. 1

Rate of positive COVID-19 tests per 10,000 tests. New York City, April 3–22, 2020.

Table 1.

Comparative descriptive statistics high versus low quantile COVID-19 ZIP Code Tabulation Areas (New York City, April 3–22, 2020)

Variable All (SE) High (SE) Low (SE) P-value difference
Median household income 57,758.7 (24,986.7) 55,314.5 (19,700.6) 82,917 (27,557.0) .001
School density 5.1 (4.6) 2.7 (2.0) 7.289 (5.4) .001
Population density 16,584.9 (11,770.9) 9486.7 (7238.2) 26,000.1 (13,418.6) .001
Housing density 18,165.2 (19,748.0) 8784.8 (6788.2) 37,361.7 (33,665.0) .001
Congdon index −0.089 (2.0) −1.1 (2.0) 1.603 (2.0) .001
Proportion black 0.23 (0.26) 0.36 (0.31) 0.070 (0.13) .001
Proportion hispanic 0.12 (0.05) 0.13 (0.05) 0.12 (0.05) .06
Heart disease 0.11 (0.21) 0.17 (0.27) 0.07 (0.16) .1
Chronic obstructive Pulmonary disease 2.01 (1.93) 2.23 (2.48) 1.55 (1.42) .2

Fig. 2.

Fig. 2

Choropleth quintiles number of positive COVID-19 tests per 10,000 tests. New York City, April 3–22, 2020.

Spatial models

A frailty model consisting of only a random effect term and no explicit spatial component returned a deviance information criterion of 1831.58, with 174.5 effective parameters. The random effect term was normally distributed around the mean value of 64.9 (SD = 1.1; 95% Cr I: 55.5, 75.6) reflecting the random nature of the distribution of the unstructured heterogeneity or variance.

A convolution model with a spatially structured conditional autoregression term added to the spatially unstructured heterogeneity random effect term of the frailty model, returned a deviance information criterion of 1807.60 (with 175.98 effective parameters) reflecting an improvement over the baseline unstructured heterogeneity frailty model, and indicating the spatial component added information to the simple unstructured model. In Figure 3 , the spatial risk estimate is calculated as the sum of the unstructured and spatially structured variance components (ζ = υ + ν). Finally, we estimate the proportion of the variance explained by geographic variation or place, which for this model is approximately 32%.

Fig. 3.

Fig. 3

Choropleth quantiles spatial risk estimates (sum of unstructured and spatially structured variance) positive COVID-19 tests per 10,000 tests. New York City, April 3–22, 2020.

Simple and multivariable models

The convolution model is extended to include a series of simple, single-variable, ecological-level models examining the unadjusted bivariate association of population, housing, income, social fragmentation, population characteristics, and clinical conditions with positive COVID-19 test counts. Table 2 summarizes the results of this series of unadjusted single covariate models of associations with positive COVID-19 test counts. The single strongest unadjusted bivariate association is for the proportion of persons in a ZCTA with COPD, which returned an incidence density ratio (IDR) of 8.2 (95% Cr I: 3.7, 18.3), indicating that for each single unit increase in the standardized proportion of persons in a ZCTA with COPD, there was an eight-fold increased risk of an additional positive COVID-19 test in that ZCTA. The proportion of black/African American residents in a ZCTA was also strongly associated with the risk of positive COVID-19 tests. For every one unit increase in a scaled standardized measure of the proportion of black/African American residents, there was a nearly five-fold increase in the risk of a positive COVID-19 test (IDR = 4.8; 95% Cr I: 2.4, 9.7).

Table 2.

Summary series of unadjusted single covariate Bayesian hierarchical Poisson models for association with positive COVID-19 tests counts in New York city ZIP Code Tabulation Areas, April 3–22, 2020

Model IDR 2.5% 97.5%
Population density 1.5 1.1 2.2
Median household income 0.5 0.4 0.7
School density 0.8 0.6 1.2
Older than 65 years 1.9 1.6 2.4
Asian 0.4 0.2 0.8
Housing density 2.0 1.2 3.2
Congdon index 0.8 0.8 0.9
Language 1.3 0.9 1.8
Black/African American 4.8 2.4 9.7
Hispanic 1.2 0.9 1.6
Heart disease 2.1 1.5 2.9
COPD 8.2 3.7 18.3

Incidence Density Ratio for bivariate association of explanatory covariates with Positive Test Counts in ZIP Code Tabulation Area.

Chronic obstructive pulmonary disease.

Variables for population density, proportion of residents older than 65 years, housing density, and heart disease were also associated with increased risk of positive COVID-19 testing rates. Median household income in a ZCTA community was inversely related to positive COVID-19 tests. For each unit increase in a standardized measure of median household income in a ZCTA, there is an approximately 46% decrease in the number of positive COVID-19 tests (IDR = 0.54; 95% Cr I: 0.43, 0.69). Other variables that were associated with lower positive tests were proportion of Asian and proportion of Hispanic residents and increased measures of social fragmentation. School density, proportion of persons not speaking English, and the proportion of persons on public assistance were not associated with positive COVID-19 testing rates.

Single-variable models were followed by multivariable models. In a multivariable model including COPD, heart disease, proportion of black/African American residents, housing density, and age greater than 65 years, the only 2 variables that remained associated with positive COVID-19 testing with a probability greater than chance were the proportion of black/African American residents and older persons (Table 3). Proportion of black/African American residents was the strongest predictor of higher positive testing rates in a community regardless of other factors.

Table 3.

Summary multivariable Bayesian hierarchical Poisson modes for association with positive COVID-19 tests counts in New York City ZIP Code Tabulation Areas, April 3–22, 2020

Variable IDR 2.5% 97.5%
Intercept 353.82 197.66 632.23
COPD 2.32 0.92 5.85
Heart disease 1.27 0.88 1.83
Black/African American 2.29 1.13 4.68
Older than 65 years 1.50 1.17 1.92
Housing density 1.08 0.65 1.78

Incidence density ratio for bivariate association of explanatory covariates with positive test counts in ZIP Code Tabulation Area.

Chronic obstructive pulmonary disease.

Discussion

Despite the recent onset of the current COVID-19 pandemic, there is already growing evidence about both individual risk factors and population-level drivers of disease and mortality. Our results are consistent with recent reports of a correlation between the percentage of black/African Americans living in a U.S. county and the percentage of confirmed COVID-19 cases and deaths [20], a nearly 3X greater risk of hospitalization for COVID-19 for black/African Americans in northern California [21], and that counties with higher proportions of black residents had an appreciably greater risk of COVID-19 diagnoses (RR = 1.24, CI 1.17–1.33) [22]. This study adds to a number of very recent similar spatial analyses of ZCTA-level testing data released by the New York City Department of Health and Mental Hygeiene [[23], [24], [25]] and illustrates the importance of sharing these kinds of data, as well as the informative nature of spatial epidemiology as the pandemic evolves across the nation and the world. Consistent with prior reports, we find that the clustering of positive COVID-19 testing results in New York City are unlikely to be due to chance [9,24] and is driven in large measure by socioeconomics, age distribution [25], and race [9,24].

Our study adds to this by demonstrating that the proportion of residents self-identifying as black/African American is among the single strongest unadjusted bivariate predictors of the proportion of positive tests in a community. The only stronger such predictor is the proportion of residents with COPD, which at 8 times the risk of areas with less COPD is stunning. But perhaps the more unexpected finding is that when black/African American race and COPD are considered jointly, it is race that appears to be the stronger predictor. Unlike a previous New York City–based report [9], we did not find an independent risk associated with the proportion of Hispanic residents. It may be that census estimates of black/African American persons includes persons who also identify as Hispanic. Three of the 5 ZCTAs with highest positive COVID-19 test numbers per 10,000 population were in areas of the Bronx with large proportions of Hispanic and Latino residents. And, it may be that disparities may vary depending in part on how well-established Hispanic communities are within cities and states [26].

The question of why COVID-19 affects one community more severely than another may provide clues to crucial questions about who is a risk and why [27]. Our study indicates place is important. We find about a third of the variance in a simple spatial model can be accounted for by place. We found risk to be approximately doubled by environmental characteristics such as population and housing density. This complements a report of a nonspatial, linear multivariable regression model of similar data that reported that 72% of variance could be attributed to individual characteristics such as household size, gender, age, race, and immigration status [23].

If ecologic and spatial analyses can provide clues, those clues cannot on the basis of these analyses necessarily point to individual-level or biological risk factors. While there are preliminary reports that infection with SARS-Cov-2 may be associated with the type A blood group [28], and that severe COVID-19 is driven in part by coagulopathies that may be associated with Factor VIII and von Willebrand factor [29], the relationship of such factors with race is [30] complex and cannot be supported by these results. These results must be interpreted in the context of place. Predominantly nonwhite neighborhoods are likely to be poorer, with less access to routine health services which can lead to greater risks of many disease outcomes. Largely nonwhite neighborhoods may also have a larger proportion of persons at increased risk of exposure. Black/African Americans make up a large proportion of persons providing direct services to COVID-19 patients in New York City. By one account, 80% of nonmedical staff in New York City's hard-hit public hospitals are black/African American or Hispanic. It would be consistent with our findings that the neighborhoods in which these persons live have higher rates of disease, and may point toward an increased emphasis on personal protective equipment for essential workers [31].

Interestingly, the proportion of comorbidities in a community, which are associated with disease severity, were associated with disease acquisition in simple bivariate models. However, they generally fell out of significance when race was added to the model. The odds ratio for the association of the rate of positive tests with an important risk like COPD dropped from eight to nonsignificance when race was included in the model, indicating that race may be more strongly associated with positive tests. Placing risk factors in context, both within and across populations, may be key. New York City and Chicago appear to differ in the factors associated with disease clusters and hot spots. New York City hot spots may be associated with service workers. Chicago hot spots are in neighborhoods with high poverty rates [24]. It will be increasingly important to conduct comparative studies.

Ecological studies can offer a view of disease processes in a community, but it may be a fractured view. Measures such as school density and social fragmentation may not be measuring what we think they are measuring; the number of schools in an area, rather than acting as a disease multiplier, may be a measure of the strength of the tax base. Similarly, the Congdon index treats empty houses as a measure of disorder which can be correlated with a number of social ills. But, empty houses may indicate less dense neighborhoods which may be inversely correlated with less person-to-person disease spread. The proportion of non-English speakers in a given ZCTA may be biased by a lack of self-reporting by undocumented immigrants. And, as in any ecologic study, it is not certain that the persons with the risk factor being studied are those who are developing the outcome.

SARS-Cov-2 testing results are imperfect, with numbers likely to be biased by the availability of testing. But, we would expect that bias, to be in the direction of increased counts in areas with higher socioeconomic status. Consistent with our findings, a recent geographic analysis reported that persons in poorer New York City neighborhoods were less likely to be tested but once tested, were more likely to test positive [22]. It is partly for this reason, we chose to base most of our analyses on the proportion of positive tests, rather than the population-based rates of positive tests, an approach taken by others [22].

Despite these caveats, it is difficult to overlook the interplay of race and COVID-19. Race appears to be an indicator of risk independent of social status, income, built environment, or even underlying health. This finding has implications not only for justice and equity, but for an effective response to the pandemic.

Footnotes

Charles DiMaggio: Conceptualization, Methodology, Data curation; Michael Klein: Writing - reviewing & editing, Conceptualization; Cherisse Berry: Conceptualization, Reviewing and Editing; Spiros Frangos: Conceptualization, Reviewing and Editing.

References


Articles from Annals of Epidemiology are provided here courtesy of Elsevier

RESOURCES