Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Dec 3.
Published in final edited form as: Epidemiology. 2021 Mar 1;32(2):157–161. doi: 10.1097/EDE.0000000000001314

Estimating the unknown: greater racial and ethnic disparities in COVID-19 burden after accounting for missing race/ethnicity data

Katie Labgold 1,*, Sarah Hamid 1,*, Sarita Shah 1,2,3, Neel R Gandhi 1,2,3, Allison Chamberlain 1, Fazle Khan 4, Shamimul Khan 4, Sasha Smith 4, Steve Williams 4, Timothy L Lash 1, Lindsay J Collin 1,5
PMCID: PMC8641438  NIHMSID: NIHMS1754433  PMID: 33323745

Abstract

Black, Hispanic, and Indigenous persons in the United States have an increased risk of SARS-CoV-2 infection and death from COVID-19, due to persistent social inequities. Yet the magnitude of the disparity is unclear because race/ethnicity information is often missing in surveillance data. We quantified the burden of SARS-CoV-2 notification, hospitalization, and case fatality rates in an urban county by racial/ethnic group using combined race/ethnicity imputation and quantitative bias analysis for misclassification. The ratio of the absolute racial/ethnic disparity in notification rates after bias adjustment, compared with the complete case analysis, increased 1.3-fold and 1.6-fold for classified Black and Hispanic persons in reference to classified White persons, respectively. These results highlight that complete case analyses may underestimate absolute disparities in notification rates. Complete reporting of race/ethnicity information is necessary for health equity. When data are missing, quantitative bias analysis methods may improve estimates of racial/ethnic disparities in the COVID-19 burden.

Keywords: SARS-CoV-2, COVID-19, missing data, bias analysis, race/ethnicity disparities, surveillance

Introduction

In the United States, early surveillance reports highlight that persons of Hispanic, Black, and American Indigenous race and ethnicity are disproportionately affected by the COVID-19 pandemic.1 These disparities arise from historical and contemporary social and health inequities that result from structural racism, including racial capitalism—the systemic exploitation of Black, Indigenous, and People of Color by predominantly White institutions for social and economic gain.25 In the COVID-19 pandemic, racial capitalism produces structurally unequal exposure to (and protection from) SARS-CoV-2 infection.3

The role of systemic racism in the pandemic motivates the need for accurate surveillance of racial/ethnic disparities in SARS-CoV-2 infection and death. However, there are challenges in estimating COVID-19 racial/ethnic disparities.6,7 Although reports highlight the unequal burden across racial/ethnic groups, the magnitude of disparities is uncertain due to missing race/ethnicity information in surveillance data. In recent reports, race/ethnicity was missing in 56% of confirmed infections nationally, and in 36% in Georgia.8,9 Current surveillance estimates are reported as complete case analyses, which exclude cases with missing race/ethnicity.1,6,9,10 Complete case analyses may bias racial/ethnic disparity estimates if race/ethnicity information is not missing completely at random.11

Beginning in August 2020, the Department of Health and Human Services issued COVID-19 reporting guidelines requiring all labs to report race/ethnicity.12 These guidelines seek to address missing data moving forward, but fail to address missing information for case-patients identified before August. Collecting race/ethnicity information at time of testing is essential for improving our understanding, and ultimately addressing racial/ethnic health disparities. Until complete reporting becomes routine, imputation of missing race/ethnicity combined with quantitative bias analysis to account for misclassification of the imputed race/ethnicity can improve estimates of the COVID-19 burden among racial/ethnic groups when race/ethnicity data are missing.13 In this study, we calculate SARS-CoV-2 notification, hospitalization, and case fatality rates by race/ethnicity group and report the absolute racial/ethnic disparities in SARS-CoV-2 notification rates in Fulton County, Georgia accounting for missing race/ethnicity information.

Methods

Fulton County, Georgia includes the city of Atlanta and residents identify as Black (44%), White (40%), Hispanic (7%), Asian (7%), and Other races/ethnicities (2%).14 Between 2 March 2020 and 18 August 2020, 19,623 cases of SARS-CoV-2 infection were reported among Fulton County residents. Case reports included the case-patients’ residential address, full name, race/ethnicity, hospitalization (yes/no/unknown), and death (yes/no/unknown). We use the term “case-patient” to capture the definitions of both case—an occurrence of a clinical condition—and patient—an individual with a clinical condition; the term is commonly used in surveillance of disease outbreaks by public health organizations.15 Fulton County Board of Health staff geocoded case-patients’ address to census block groups. For this analysis, we categorized reported race/ethnicity as Hispanic (any race), and non-Hispanic Black, Asian, White, or Other. The Other race/ethnicity category included Indigenous Americans (1.3%), Native Hawaiian/Other Pacific Islanders (1.5%), and those who reported their race as “Other” (97%).16

We used quantitative bias analysis to account for missing race/ethnicity. Quantitative bias analysis entails imputation of race/ethnicity for case-patients who were missing this information, and then bias-adjusting estimates to account for the imputation algorithm’s misclassification of race/ethnicity. Hereafter, we refer to race/ethnicity as reported when provided in case-patient records, imputed when referring to the imputed case-patient race/ethnicity, and classified when referring to the combined reported and imputed race/ethnicity after bias adjustment.

First, for all case-patients with complete race/ethnicity information (n=12,492, 64%), we imputed race/ethnicity using the Bayesian Improved Surname Geocoding method to validate this method in our study population and to generate estimated values for bias parameters to be used in the quantitative bias analysis.14 The Bayesian Improved Surname Geocoding method is the current standard method for race/ethnicity prediction.17,18 This method estimates the probability that a person belongs to one of five racial/ethnic groups (Black, Hispanic, Asian, White or Other) based on the person’s surname and residential census block group, the population distribution of race/ethnicity in the census block, and race/ethnicity associated with a national list of surnames. The approach was previously validated with data from nearly 2 million individuals and imputed race/ethnicity was correlated (0.76) with self-reported race/ethnicity.17,18 However, replication has been inconsistent across other studies.19 We addressed imperfect imputation with probabilistic bias analysis. Imputation was performed using the R package “wru,” which includes the 2010 surname census distribution. The geographic distribution of race/ethnicity came from the 2018 5-year American Community Survey.20,21 We calculated predictive values (PV) for each imputed race/ethnic group using reported race/ethnicity as the gold standard. The PV is the probability that a person’s reported race/ethnicity group classification was correctly imputed.13

Second, among case-patients with missing race/ethnicity, we imputed the race/ethnicity category and used the PV values from the validation study to bias-adjust quantitatively for the expected misclassification of the imputed race/ethnicity groups. We assigned each race/ethnicity group PV from the validation study to a Dirichlet distribution (Table 1).13,22,23 Among those with imputed race/ethnicity, we reclassified individuals over 100,000 iterations using probabilistic bias analysis.13 The approach uses Monte Carlo sampling techniques to generate frequency distributions of the bias-adjusted estimates to account for inaccurate assignment of case-patients to a race/ethnicity group by the Bayesian Improved Surname Geocoding method. Sampling error was incorporated into the estimates using bootstrap approximation from a standard normal distribution.13

Table 1:

Predictive values (PV) and 95% confidence intervals (CI) of the imputation by race/ethnicity based on residence and surname compared with reported race/ethnic group in the State Electronic Notifiable Disease Surveillance System

Imputed Race/Ethnicity
Black Hispanic Asian White Other

Reported Race/Ethnicity Black 5106 68 13 1754 11
Hispanic 77 1288 16 230 6
Asian 16 15 145 80 4
White 192 103 28 2818 2
Other 135 69 12 303 1
Total 5,526 1,543 214 5,185 24

PV % (95% CI) 92% (92%, 93%) 83% (82%, 85%) 68% (61%, 74%) 54% (53%, 56%) 3.0% (0.1%, 15%)

For both the complete case and bias-adjusted analyses, we calculated the SARS-CoV-2 notification rates (per 1,000 persons), hospitalization proportions (hospitalized cases/reported cases), and case fatality rates (deaths/reported cases) by race/ethnicity group. We reported 95% confidence intervals (CI) for the complete case analysis. For the bias-adjusted estimates, we reported the median with 95% simulation intervals (SI), which account for the potential misclassification of imputed race/ethnicity and sampling error. We calculated the differences in SARS-CoV-2 notification rates in each race/ethnicity group compared with persons of White race/ethnicity, among case-patients with reported race/ethnicity information, and among all case-patients after bias adjustment. To estimate the magnitude of the change in the absolute disparity after accounting for missing race/ethnicity information, we computed the relative change. We divided the absolute disparity accounting for missing race/ethnicity by the absolute disparity from the complete case analysis. All analyses used R v3.6 (Vienna, Austria). The Georgia Department of Health determined this activity to be consistent with public health surveillance, which does not require informed consent or IRB approval.

Results

Among the 19,623 cases reported in Fulton County from 2 March to 18 August 2020, 7,131 (36%) were missing race/ethnicity information in the case report. Data were more complete among the 1,776 hospitalized case-patients, where only 14 (3.5%) were missing race/ethnicity information. All deceased case-patients (n=456) had complete information on race/ethnicity.

Comparison of reported versus imputed race/ethnicity group showed that the algorithm’s imputation accuracy varied by race/ethnicity group (Table 1). Of the 5,526 persons who were imputed as Black race/ethnicity, 92% (95%CI: 92%, 93%) were reported as Black in case reports. Among persons imputed as Hispanic ethnicity, 83% (95%CI: 82%, 85%) were reported as Hispanic. The algorithm was less accurate for case-patients with race/ethnicity imputed as Asian (PV=68%, 95%CI: 61%, 74%) and as White (PV=54%, 95%CI: 53%, 56%). The PV estimates for racial/ethnic groups changed over time, likely due to changes in the prevalence of demographic groups affected by the pandemic over time (Supplemental Table 1).

In both the complete case and bias-adjusted analyses, the SARS-CoV-2 notification rates were highest among those classified as Other, followed by Hispanic, Black, White, and Asian (Table 2a and 2b). Imputation and bias adjustment yielded higher estimates of notification rates for each racial/ethnic group than complete case analysis because more case-patients were included in the numerator. Estimated notification rates increased 1.8-fold for persons classified as Asian, 1.7-fold for White, 1.7-fold for Hispanic, 1.6-fold for Other, and 1.5-fold for Black. Hospitalization proportions and case fatality rates decreased across all race/ethnicity groups with bias adjustment compared with the complete case analyses, because more cases were included in the denominator. In both the complete case and bias-adjusted analyses, case-patients who were classified as Black race/ethnicity had the highest hospitalization proportions (complete case: 17%, 95%CI: 16%, 18%; bias-adjusted: 12%, 95%SI: 11%, 12%) and case fatality rates (complete case: 4.6%, 95%CI: 4.1%, 5.1%; bias-adjusted: 3.1%, 95%SI: 2.8%, 3.4%).

Table 2a:

Complete case estimates of SARS-CoV-2 notification rates, hospitalization proportions, and case fatality rates by race/ethnic group among 12,492 cases reported to Fulton County Board of Health, 2 March – 18 Aug 2020.

Race/Ethnicity Total infections Hospitalized Died At Riska Notification rate per 1,000 (95%CI) Hospitalized proportion (95%CI) Case Fatality Rate as a proportion (95%CI)
Asian 260 25 5 69987 3.7 (3.3, 4.2) 9.6 (6.2, 14) 1.9 (0.4, 3.8)
Hispanic 1617 214 15 74328 22 (21, 23) 13 (12, 15) 0.9 (0.5, 1.4)
Black 6952 1192 320 445992 16 (15, 16) 17 (16, 18) 4.6 (4.1, 5.1)
White 3143 312 112 406755 7.7 (7.5, 8.0) 9.9 (8.9, 11) 3.6 (2.9, 4.2)
Other 520 30 4 6056 86 (79, 93) 5.8 (3.8, 7.9) 0.8 (0.2, 1.5)

Table 2b:

Bias-adjusted estimates of SARS-CoV-2notification rates, hospitalization proportions, and case fatality rates including 7,131 cases with imputed race/ethnicity, among 19,623 cases reported to Fulton County Board of Health, 2 March – 18 Aug 2020

Race/Ethnicity Total infections (95%SI) Hospitalized Died At Riska Notification rate per 1,000 (95%SI) Hospitalized proportion (95%SI) Case Fatality Rate as a proportion (95%SI)
Asian 456 (438, 474) 25 5 69987 6.5 (5.9, 7.2) 5.5 (3.4, 7.6) 1.1 (0.1, 2.1)
Hispanic 2,687 (2,657, 2717) 214 15 74328 36 (35, 38) 8.0 (6.9, 9.0) 0.6 (0.3, 0.8)
Black 10,351 (10,301, 10,402) 1195 320 445992 23 (23, 24) 12 (11, 12) 3.1 (2.8, 3.4)
White 5,284 (5,232, 5,337) 312 112 406755 13 (13, 13) 5.9 (5.3, 6.5) 2.1 (1.7, 2.5)
Other 844 (817, 873) 30 4 6056 139 (130, 149) 3.6 (2.3, 4.8) 0.5 (0.0, 0.9)
a

American Community Survey 5-year 2018 estimates

The magnitude of the absolute disparity—difference in SARS-CoV-2 notification rates for case-patients classified in each race/ethnicity group compared with case-patients classified as White—increased in the bias-adjusted analysis relative to the complete case analysis for nearly all race/ethnicity groups (Table 3). When comparing bias-adjusted with complete case results, the absolute disparity in notification rates increased 1.3-fold among classified Black and 1.6-fold among classified Hispanic race/ethnicity groups in reference to case-patients classified as White.

Table 3:

Absolute disparity (RD) of SARS-CoV-2 notification rates among minority groups compared with non-Hispanic White persons among cases with complete information and after accounting for missing race/ethnicity data among 19,623 SARS-CoV-2 infected persons reported to Fulton County between 2 March 2020 and 18 August 2020.

Complete Case Bias-Adjusted
Race/Ethnicity Notification rate per 1,000 (95%CI) RD per 1,000 (95%CI) Notification rate per 1,000 (95%SI) RD per 1,000 (95%SI) Relative change in magnitude of disparitya
Asian 3.7 (3.3, 4.2) −4.0 (−4.5, −3.5) 6.5 (5.9, 7.2) −6.5 (−6.8,−6.2) 0.6
Hispanic 22 (21, 23) 14 (13, 15) 36 (35, 38) 23 (23, 23) 1.7
Black 16 (15, 16) 7.9 (7.4, 8.3) 23 (23, 24) 10 (10, 10) 1.3
White 7.7 (7.5, 8.0) Reference 13 (13, 13) Reference
Other 86 (79, 93) 78 (71, 85) 139 (130, 149) 126 (122, 131) 1.6
a

Estimated as the ratio of the bias-adjusted absolute disparity to the ratio of the complete case absolute disparity

Discussion

In this study, accounting for missing race/ethnicity information revealed greater differences in SARS-CoV-2 notification rates comparing most racial/ethnic groups with case-patients classified as White race. These results suggest that national estimates, which exclude case-patients with missing race/ethnicity information, may underestimate the magnitude of absolute racial/ethnic disparities in COVID-19 morbidity and mortality.7,9

Our results underscore the need for imputation combined with bias adjustment. In our study population, the PV estimates indicated that imputation without bias adjustment overestimated infections among case-patients classified as White and underestimated infections among case-patients classified as Black (Table 1). Since race/ethnicity information is relatively complete for hospitalized and deceased cases, an analysis based on imputed race/ethnicity without bias adjustment would underestimate the hospitalized proportions and case fatality rates in classified White case-patients and overestimate these measures in classified Black case-patients. Our bias-adjusted estimates account for this expected misclassification.

Notably, both the complete case analysis and the bias-adjusted estimates demonstrate important absolute racial/ethnic disparities in the notification rates. The bias-adjusted estimates do not change our understanding of the direction of racial/ethnic disparities in the COVID-19 pandemic; however, the magnitude of racial/ethnic disparities changed meaningfully after bias adjustment. In contrast, the hospitalization proportions and case fatality rates decreased across all classified race/ethnicity groups after accounting for missing race/ethnicity information because few hospitalized or deceased case-patients were missing race/ethnicity information. These results highlight the need for more complete reporting so that health equity and racial justice efforts aimed at addressing these disparities operate on the most accurate data possible.

The imputation of race/ethnicity has limitations. The Bayesian Improved Surname Geocoding algorithm limits the racial/ethnic groups that can be imputed to Black, Hispanic, Asian, White, or Other.1618 The reliance on categories of ‘other’ is problematic for identifying and addressing disparities in other racial/ethnic populations (e.g. Indigenous populations). Future studies should explore how accounting for missing race/ethnicity affects other disease burden measures. Additionally, we assumed that the Bayesian Improved Surname Geocoding algorithm performs equally well among those with reported race/ethnicity as among those with missing race/ethnicity. Given that the data used to inform the imputed race/ethnicity are external to the study population, this is a reasonable assumption.1618 Last, our results are conditioned on being tested. Although testing capacity has increased across most states, it was difficult to receive testing at the beginning of the pandemic. Therefore, our estimates of disparities in SARS-CoV-2 notification rates may not fully capture the underlying disparities in SARS-CoV-2 infection rates.

Our findings emphasize the importance of collecting complete race/ethnicity data at the time of testing, for the current pandemic and future outbreaks. When data are missing, Bayesian Improved Surname Geocoding combined with quantitative bias analysis may provide better estimates of the racial/ethnic disparities in SARS-CoV-2 notification rates, hospitalization proportions, and case fatality rates.

Supplementary Material

SupplementaryContent

Financial Support:

This work was supported in part by the US National Institutes of Health F31CA239566 (PI L. J. Collin), R01LM013049 (PI T. L. Lash), and K24AI114444 (PI N. R. Gandhi). It was also supported by a grant from the Robert W. Woodruff foundation (PI A. Chamberlain). K. Labgold is supported in part by the Center for Reproductive Health Research in the Southeast (RISE) Doctoral Fellowship and an ARCS Foundation Award. S. Hamid was supported in part by the U.S. National Institutes of HAPIN trial, which is funded by the U.S. National Institutes of Health (cooperative agreement 1UM1HL134590) in collaboration with the Bill & Melinda Gates Foundation (OPP1131279). L Collin was also supported in part by TL1TR002540 from the National Center for Advancing Translational Sciences of the National Institutes of Health

Footnotes

Conflicts of Interest: The authors have no conflicts of interest to declare.

Data Access: Due to patient confidentiality, data are only available upon request from the Fulton County Board of Health and with IRB approval from the Georgia Department of Public Health. Example code used to perform the imputation and bias adjustment is available on GitHub (https://github.com/lcolli5/Adaptive-Validation).

References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

SupplementaryContent

RESOURCES