Skip to main content
Public Health Reports logoLink to Public Health Reports
. 2023 Mar 24;138(3):428–437. doi: 10.1177/00333549231163531

Tracking COVID-19 in the United States With Surveillance of Aggregate Cases and Deaths

Diba Khan 1,, Meeyoung Park 1, Jacqueline Burkholder 1, Sorie Dumbuya 1, Matthew D Ritchey 1,2, Paula Yoon 1, Amanda Galante 3, Joseph L Duva 3, Jeffrey Freeman 3, William Duck 1, Stephen Soroka 1, Lyndsay Bottichio 1, Michael Wellman 1, Samuel Lerma 1, B Casey Lyons 1, Deborah Dee 1,2, Seghen Haile 1, Denise M Gaughan 1, Adam Langer 1, Adi V Gundlapalli 1, Amitabh B Suthar 1, on behalf of the COVID-19 Response
PMCID: PMC10040484  PMID: 36960828

Abstract

Early during the COVID-19 pandemic, the Centers for Disease Control and Prevention (CDC) leveraged an existing surveillance system infrastructure to monitor COVID-19 cases and deaths in the United States. Given the time needed to report individual-level (also called line-level) COVID-19 case and death data containing detailed information from individual case reports, CDC designed and implemented a new aggregate case surveillance system to inform emergency response decisions more efficiently, with timelier indicators of emerging areas of concern. We describe the processes implemented by CDC to operationalize this novel, multifaceted aggregate surveillance system for collecting COVID-19 case and death data to track the spread and impact of the SARS-CoV-2 virus at national, state, and county levels. We also review the processes established to acquire, process, and validate the aggregate number of cases and deaths due to COVID-19 in the United States at the county and jurisdiction levels during the pandemic. These processes include time-saving tools and strategies implemented to collect and validate authoritative COVID-19 case and death data from jurisdictions, such as web scraping to automate data collection and algorithms to identify and correct data anomalies. This topical review highlights the need to prepare for future emergencies, such as novel disease outbreaks, by having an event-agnostic aggregate surveillance system infrastructure in place to supplement line-level case reporting for near–real-time situational awareness and timely data.

Keywords: COVID-19, surveillance, public health, data, CDC


During a public health emergency, decision makers require timely and accurate surveillance data to rapidly implement measures for mitigating public health threats to the nation. The Centers for Disease Control and Prevention (CDC) plays an instrumental role in the national surveillance of public health threats for all-hazards responses. During new and emerging public health threats to the nation, rapid development of data collection processes and tools is required to collect critical data for response decisions. Historically, the collection of critical data has been demonstrated during recent public health threats such as anthrax in 2001 1 and 2015, 2 severe acute respiratory syndrome (SARS) in 2003, 3 Marburg hemorrhagic fever in 2008, 4 influenza A subtype H1N1 in 2009,5,6 Middle East respiratory syndrome coronavirus (MERS-CoV) in 2012 7 and 2014, 8 Ebola hemorrhagic fever in 2014 9 and 2019, 10 the Flint water crisis in Michigan in 2016, 11 and Zika in 2016. 12 Following the report of a novel coronavirus infection in Wuhan City, Hubei Province, China, CDC engaged the international public health community to identify this global threat, now known to be caused by the SARS-CoV-2 virus. As global cases ramped up, 13 CDC established an Incident Management Structure on January 7, 2020, to track the emergence of the novel SARS-CoV-2 virus and implemented public health interventions.

On January 8, 2020, CDC issued an advisory through the Health Alert Network 14 to inform health care providers about a pneumonia cluster of unknown etiology in Wuhan City. CDC requested that health care providers immediately notify their state, tribal, local, or territorial (STLT) health departments if they encountered patients with severe respiratory symptoms and recent travel to Wuhan City. CDC issued guidance on January 17, 2020, 15 requesting STLT health departments to manually submit a Human Infection With 2019 Novel Coronavirus Case Report Form 16 for each person under investigation for suspected SARS-CoV-2 infection. On January 20, 2020, CDC activated its Emergency Operations Center to coordinate agencywide response efforts and surveillance activities. The first confirmed case of COVID-19 in the United States was identified on January 21, 2020, in a Washington State resident with recent travel to Wuhan City. 17

CDC transitioned from manually reporting people under investigation for COVID-19 to leveraging an existing electronic surveillance system, the Data Collation and Integration for Public Health Event Response (DCIPHER) platform, 18 for collecting, storing, and analyzing data. Individual case information, or “line-level” reports, could be submitted to DCIPHER in 2 ways: (1) using established reporting mechanisms in the National Notifiable Diseases Surveillance System, similar to other reportable conditions 19 or (2) uploading case report data in comma-separated value (CSV) files standardized to align with federal reporting guidelines. Voluntary reporting of COVID-19 case and death data by STLT health departments has played an important role in addressing emerging scientific questions and characterizing epidemiological trends for COVID-19. 20

As the numbers of COVID-19 cases and deaths increased, the reporting lag increased because of the additional workload required by jurisdictions to conduct individual case investigations before processing and reporting detailed line-level records for COVID-19 cases and deaths. A need for timelier COVID-19 surveillance data emerged to inform the dynamic response information requirements. To address this data gap, CDC augmented traditional surveillance for COVID-19 using line-level data by establishing aggregate case and death surveillance (ACS). While line-level COVID-19 case surveillance collects detailed information on demographic characteristics, hospitalization and death status, exposure and clinical history, laboratory results, and vaccination history for each case identified in a jurisdiction, 21 the ACS tracks only the daily number of cumulative COVID-19 cases and deaths from each jurisdiction. 20 This streamlined ACS reporting provides a reliable real-time snapshot of the case and death burden of the pandemic that can be used for tracking trends and decision-making.

Data collection for the line-level and ACS data sources remains separate, because it is not possible to determine which cases reported through the line-level reporting channel have been counted in the ACS total. Initially, ACS data tend to have a higher case count than line-level data because of the lag associated with individual-level case reporting. Over time, line-level data for each jurisdiction tends to catch up to preliminary ACS counts. A review of COVID-19 ACS and line-level surveillance systems, using select CDC criteria for evaluating various public health surveillance systems, 22 demonstrates their complementary roles (eTable 1 in Supplemental Material).

ACS data have proven critical for timely spatial–temporal COVID-19 surveillance and continue to inform CDC’s response operations. We describe the innovative processes and tools CDC has implemented to track, validate, and report the spread and impact of COVID-19 using ACS data. We also discuss the challenges and data modernization efforts in planning response capabilities to support future public health emergencies.

ACS System Design and Implementation

Aggregate Case and Death Reporting

CDC began working with STLT health departments in January 2020 to establish a process for collecting and validating information on COVID-19 cases and deaths from reporting jurisdictions. To maintain situational awareness on the spread of COVID-19, CDC initially summarized information from surveillance reports and case investigations from cruise ships, jurisdictional person-under-investigation reports, and publicly available sources, including public health department press releases, social media announcements, and media reports. CDC began daily correspondence with jurisdictions to ensure the submitted person-under-investigation reports were appropriately matched with confirmatory results for CDC-performed laboratory tests, so that accurate COVID-19 case and death totals could be reported. As national case definitions23,24 and reporting requirements were established and confirmatory testing capacity increased across the country, more jurisdictions began systematically submitting COVID-19 case reports into DCIPHER. To ensure DCIPHER-derived totals matched jurisdictional totals, CDC began sending a daily standardized email to jurisdictions asking them to validate the aggregate COVID-19 case and death counts based on information in DCIPHER.

As the number of COVID-19 cases increased, the number of individual case reports submitted to DCIPHER began lagging the known number of confirmed and probable COVID-19 cases and deaths occurring in jurisdictions. The daily email correspondence evolved from a data quality assurance activity to a stand-alone system for jurisdictions to provide up-to-date aggregate totals serving as the authoritative jurisdictional and national COVID-19 case and death counts. To further streamline the data collection process, in March 2020, CDC developed an Epi-Info web entry form 25 for jurisdictions to systematically provide daily cumulative COVID-19 case and death totals.

When jurisdictions began publishing COVID-19 case and death data online, CDC developed the capacity to collect publicly available information from trusted verifiable sources 26 by web scraping, a programmed process to automate extraction of data from websites. Web scraping supplemented the COVID-19 case and death data for jurisdictions that did not submit data through Epi-Info. CDC used web scraping to collect aggregate COVID-19 case and death data from various direct official sources 26 that public health departments published online. These official sources included websites, data feeds (available through application program interfaces [APIs], spreadsheets, or Github), situation reports, county and jurisdictional dashboards, official press releases and verified statements, state governors’ websites, official department of health Facebook pages and Twitter feeds, and World Health Organization and Pan American Health Organization websites. When data were available from multiple sources, the most current counts were used. These COVID-19 case and death data were verified with jurisdictions daily. Web-scraped data were merged with Epi-Info survey results for daily compilation of COVID-19 case and death counts. By July 2020, CDC was using APIs to further automate data collection and improve data quality at the jurisdiction level. 27

CDC also established business rules (eTable 2 in Supplemental Material) to enforce jurisdictional preferences for reporting probable cases and deaths, or cases among nonresidents, and accommodated changes in reporting procedures after updates were made to the COVID-19 case definition.23,24 These rules facilitated automated selection of the most accurate and up-to-date information available from multiple data sources for aggregate daily reporting.

While these processes were effective in understanding national and jurisdictional trends in COVID-19 cases and deaths, they did not provide visibility of disease transmission at local levels. By mid-March 2020, multiple third parties, including Johns Hopkins University 28 and USAFacts, 29 began collecting, summarizing, and visualizing county-level aggregate COVID-19 case and death information collected from official public data sources. As CDC developed its internal systems to enhance county-level surveillance, it leveraged data provided by USAFacts, a nonprofit organization dedicated to presenting government data, 29 as an initial verifiable source of official, county-level aggregate COVID-19 case and death data to monitor disease transmission and severity at the local level. By November 2020, CDC had established business rules (eTable 2 in Supplemental Material) and curated official sources of online data for more than 3200 US counties to enhance data quality and timeliness for surveillance of county-level aggregate COVID-19 cases and deaths.

Data Management, Integration, Validation, and Transmission

CDC manages the flow of jurisdiction- and county-level aggregate COVID-19 case and death data through data pipelines in DCIPHER 30 and HHS Protect, systems that provide a secure, common operating platform for storing, aggregating, and sharing public health data. 31 The data pipeline consists of a sequence of consecutive data-processing steps for ingesting raw COVID-19 case and death data from multiple sources to be cleaned, integrated, and stored on a shared platform, creating authoritative, analytically ready datasets. These COVID-19 datasets are used for analysis, visualization, and reporting.

A key difference between jurisdiction- and county-level aggregate COVID-19 case and death datasets is that jurisdictions may report confirmed and probable cases and deaths as separate totals, a breakdown not recorded in county-level data. 26 Quality checks for jurisdictional aggregate COVID-19 case and death data are also more extensive because daily manual review of inconsistencies for more than 3200 counties is not feasible for aggregate county-level data.

COVID-19 Aggregate Case and Death Counts (ACDC) Data Collection—County and Jurisdictions

Web-scraped county- and jurisdiction-level COVID-19 case and death counts, known as the Aggregate Case and Death Counts (ACDC) dataset, are obtained from each jurisdiction’s publicly reported website or data feed through an autonomous data collection pipeline. ACDC data are acquired from regional public health websites through various means. A suite of modular web scrapers (an automated tool used for data extraction from websites), with each scraper customized for each public health website, automatically retrieves data multiple times per day as a parallelized automated data collection process. Web scrapers use various approaches to find and capture data, including searching for regular expressions, using Web Drivers (an open-source tool for testing web applications across various browsers using various programming languages) to autonomously interact with web interfaces, and leveraging available APIs and other data sources for automated downloads. Manual COVID-19 case and death data collection is also performed in situations where automated data retrieval has proven difficult, such as in social media posts and data contained within images.

After ACDC data are collected, manual and automated data quality checks are performed to ensure data acquisition is working properly and no unexpected anomalies have occurred. If jurisdictions publish COVID-19 case and death data on multiple sources, ACDC data business rules prioritize the most up-to-date or maximum values. A dedicated team maintains the ACDC pipeline to address system and data anomalies as they are identified. If anomalous COVID-19 case and death data are detected, CDC analysts coordinate with state and local health departments to reconcile the affected data. Further details on aggregate COVID-19 case and death data sources, formats, and considerations are included in Supplemental Material.

COVID-19 Case and Death Data Validation and Processing, by County

County-level ACDC raw time-series data are subject to numerous validation steps and a smoothing algorithm to ensure monotonically increasing cumulative trends before finally integrating into HHS Protect for further distribution to the CDC COVID Data Tracker (Figure 1). The county-level COVID-19 case and death time-series data, once sent to the HHS Protect system, undergo several additional data quality checks and processing steps before final publication. HHS Protect performs an automated review of the COVID-19 case and death data file to ensure no fields are malformed, empty, decreasing, or otherwise anomalous. The approved county-level COVID-19 case and death time-series data are then pushed to additional processing workflows, where they are eventually used to generate daily and weekly numbers to support national response efforts. The county-level COVID-19 case and death time-series data are also delivered to CDC’s DCIPHER system for further processing, epidemiological modeling, and publishing of data products on the CDC COVID Data Tracker and data.cdc.gov.20,32 The final county-level COVID-19 case and death time-series data (Figures 2 and 3) can be viewed on the US County View map on the CDC COVID Data Tracker, 20 at an individual county level using the COVID-19 Integrated County View. 33 Aggregate county-level case data are also used to generate COVID-19 Community Levels32,34 (Figure 4) and COVID-19 Community Profile Reports. 35 CDC transitioned the reporting of COVID-19 aggregate case and death data for counties from daily to weekly cadence beginning the week of October 18, 2022.

Figure 1.

Figure 1.

Centers for Disease Control and Prevention (CDC) process for collecting and validating county-level data on COVID-19 cases and deaths, United States. Abbreviation: HHS, US Department of Health and Human Services.

Figure 2.

Figure 2.

Cumulative number of COVID-19 cases per 100 000 population, by county, United States, January 22, 2020–September 29, 2022.

Data source: Centers for Disease Control and Prevention aggregate case counts. 33

Figure 3.

Figure 3.

Cumulative number of COVID-19 deaths per 100 000 population, by county, United States, January 22, 2020–September 29, 2022.

Data source: Centers for Disease Control and Prevention aggregate death counts. 33

Figure 4.

Figure 4.

COVID-19 community levels per 100 000 population, by county, United States, September 23-29, 2022. Data source: Centers for Disease Control and Prevention aggregate case and death counts, hospital admissions, and utilization. 32

Jurisdiction-Level COVID-19 Case and Death Data Validation, Merging, and Processing

The ACDC dataset also provides daily snapshots of aggregate COVID-19 case and death data by jurisdiction. ACDC data are merged with cumulative COVID-19 case and death counts submitted by jurisdictions voluntarily through an Epi-Info survey, 25 open from 4 pm Eastern Standard Time (EST) the previous day through 9 am EST the day of data validation (Figure 5). At 7 am EST, CDC conducts an initial review of ACDC and Epi-Info data to check for outliers, compares preliminary Epi-Info submissions against ACDC data for inconsistencies such as typos, and processes validated data into HHS Protect. CDC reviews STLT health department websites to document relevant caveats, changes to processes, and retroactive corrections of historical time series reported by jurisdictions.

Figure 5.

Figure 5.

Centers for Disease Control and Prevention (CDC) process for collecting and validating jurisdiction-level data on COVID-19 cases and deaths, United States. Abbreviation: HHS, US Department of Health and Human Services.

To update historical time series, STLT health departments may provide data either by using a CSV template for manual upload or an API enabling CDC to automate download of data directly into CDC’s data pipeline in HHS Protect.20,27

After Epi-Info submissions are closed at 9 am EST, CDC merges additional COVID-19 case and death data from Epi-Info, bulk historical data uploads, and APIs and conducts further validation of COVID-19 case and death counts. If discrepancies are found, CDC contacts jurisdictions for clarification. CDC then finalizes new and cumulative COVID-19 case and death counts based on any additional communication from jurisdictions, Epi-Info survey results, and validation of ACDC via notes captured from websites and media sources of all 60 jurisdictions. The finalized jurisdictional new and cumulative COVID-19 case and death counts are added to the time series and provided through HHS Protect for distribution to data.cdc.gov and the Epi-Info survey, analysis and visualization on CDC COVID Data Tracker, and subsequent use in analysis and dissemination via multiple CDC reports. Data on data.cdc.gov 36 are used to generate the COVID Data Tracker Weekly Review 37 as well as COVID-19 State Profile Reports. 38

CDC transitioned reporting of COVID-19 ACDC for jurisdictions from daily to weekly beginning the week of October 18, 2022. CDC has also streamlined data collection, using data from the county-level pipeline to calculate weekly aggregate counts for the jurisdiction-level pipeline. Depending on the jurisdiction, slight differences may exist between COVID-19 cases and deaths recorded from county and state websites because of the reporting lag or data reconciliation. Jurisdictions may still submit weekly cumulative counts via Epi-Info to update data from the county-level pipeline. Analysis of weekly jurisdictional data can be found on the COVID Data Tracker in the weekly trends analysis 39 and on data.cdc.gov. 40

Challenges

Aggregate COVID-19 case and death data collection has improved the timeliness of case counts compared with line-level case reporting; as of September 29, 2022, the line-level dataset had captured 91.0% of the total cases reported in the aggregate dataset (eFigure in Supplemental Material), yet multiple data challenges remain. For example, heterogeneity occurs between jurisdictions in the time needed for a diagnosed case to reach a jurisdiction’s case surveillance system and subsequently be reported to CDC. Irregular batch reporting of historical COVID-19 case and death data requires persistent monitoring and coordination with states to avoid artificial spikes in daily trends analyses. Moreover, exclusive availability of aggregate case and death data on external websites can pose challenges when jurisdictions change the format or location of COVID-19 case and death data on a webpage and/or reduce the frequency at which they are available. Another limitation is the vulnerability of these public data hubs to cyberattacks. For example, data from one jurisdiction were not publicly available for more than 2 weeks in December 2021 after a ransomware attack. 41

Discussion

CDC’s existing ACS capabilities will remain critical to the nation for responding to future public health emergencies. The current infrastructure is designed to enable automated, near–real-time collection of publicly available information; in this instance, jurisdiction- and county-level reporting of COVID-19 cases and deaths. The expanded use of data standards and reporting requirements for ACDC, as well as integration of more APIs and other technical infrastructure, will help further automate data collection and reduce the reporting burden on state and local partners during public health emergencies. Continued application of cutting-edge technology for CDC’s data management will help improve data quality and timeliness. Improved data collection processes, ongoing digitization in the health sector, and expansion of data analytic capabilities, including artificial intelligence and machine learning on structured and unstructured data, will play an increasingly important role in the surveillance of emerging diseases. Use of multiple strategies could promote more efficient and rapid implementation of ACS for future public health emergencies and minimize the reporting burden on STLT health departments. One foundational strategy for enhancing overall case surveillance is to have access to more comprehensive, near–real-time line-level case and death data with sufficient completeness of key demographic characteristics that can be rapidly aggregated at various geographic levels. A recent public health advancement that supports this aim is the rapid adoption of electronic case reporting (eCR) capabilities in health care and public health.42,43

The use of eCR offers more effective automation of line-level surveillance capabilities than existing reporting methods, which could improve timeliness and reduce redundancy in both reporting and the need for ad hoc aggregate surveillance systems once relevant policies, funding, and implementation scales have been achieved.

The collection and use of ACS information, as well as other meaningful aggregate data, will likely remain important tools in support of future public health surveillance efforts. Preplanning and coordination with STLT partners should be explored on reporting methods (eg, public health websites, data repositories), formats (eg, machine-readable tables, API, CSV), content (eg, consistent application of definitions, aggregation of findings), and context (eg, how data will be displayed and used) to facilitate automated surveillance. Agreement on general parameters for electronic publication of these aggregate data by STLT health departments should relieve their workload and improve jurisdictional and national situational awareness going forward.

Public Health Implications

In February 2020, CDC established an aggregate data collection system to capture and validate daily county- and jurisdiction-level COVID-19 case and death counts to serve as the authoritative source for the United States. The current jurisdictional aggregate process combines the use of an Epi-Info survey tool, automated collection of publicly available web-based information, and data transfers via bulk uploads and APIs to collect COVID-19 case and death data across jurisdictions. Furthermore, technologies enabling the automated collection of web-based information have been used to capture aggregate COVID-19 case and death counts down to the county level. Aggregate case surveillance integrated into a common operating platform (HHS Protect) has provided essential COVID-19 data used to (1) describe trends and geographic patterns in COVID-19 cases and deaths, (2) estimate COVID-19 community levels and disease severity, (3) identify geographically localized hotspots, and (4) guide national and local mitigation efforts, such as face mask wearing, screening and testing, and staying up-to-date with vaccinations and boosters. 34 CDC’s COVID-19 ACS infrastructure and capabilities can be applied to achieve robust epidemic intelligence across a broader set of publicly available sources of information and may prove vital in guiding the nation’s response to future public health threats and promoting domestic security.

Supplemental Material

sj-docx-1-phr-10.1177_00333549231163531 – Supplemental material for Tracking COVID-19 in the United States With Surveillance of Aggregate Cases and Deaths

Supplemental material, sj-docx-1-phr-10.1177_00333549231163531 for Tracking COVID-19 in the United States With Surveillance of Aggregate Cases and Deaths by Diba Khan, Meeyoung Park, Jacqueline Burkholder, Sorie Dumbuya, Matthew D. Ritchey, Paula Yoon, Amanda Galante, Joseph L. Duva, Jeffrey Freeman, William Duck, Stephen Soroka, Lyndsay Bottichio, Michael Wellman, Samuel Lerma, B. Casey Lyons, Deborah Dee, Seghen Haile, Denise M. Gaughan, Adam Langer, Adi V. Gundlapalli and Amitabh B. Suthar in Public Health Reports

Acknowledgments

The authors acknowledge the following people for contributing to the design and/or execution of the systems described in this article: Kayla Anderson, Monika Bray, Aaron Curns, Asong Defang, Joseph Duva, Erik Euler, Mary Fukushima, Katie Fullerton, Beatrice Garcia, Curtis Gaye, Peter Grillo, Kayla Janos, Greta Kintzley, Amy Koening, Elena Kuklina, William LaCholter, Florence Lee, Xin Tong, Kelly Myrick, Nathan Drew, Hui Xie, Jon Rees, Aaron Maitland, Thomeka J.N. Oyebade, Timothy Ng, Amanda Okello, Chad Schupbach, Jim Tyson, Emily Ussery, Hilary Whitham, Sarah Witter, Grant Zhao, Katherine Roguski, and Andrea Mansur, all with the Centers for Disease Control and Prevention (CDC) COVID-19 Response. The authors also thank Chandre Chaney, the program deputy, for the Data, Analytics and Visualization Taskforce, 2019 Novel Coronavirus (COVID-19) Response, for her steadfast efforts in managing the response. Paula Yoon recently retired from CDC. Samuel Lerma currently works for Google. Jeffrey Freeman currently currently works for the Biotechnology and Human Systems Division, MIT Lincoln Laboratory.

Footnotes

The findings and conclusions in this article are those of the authors and do not represent the official position of CDC.

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The authors received no financial support for the research, authorship, and/or publication of this article.

ORCID iD: Diba Khan, PhD, MS Inline graphichttps://orcid.org/0000-0001-7314-8709

Supplemental Material: Supplemental material for this article is available online. The authors have provided these supplemental materials to give readers additional information about their work. These materials have not been edited or formatted by Public Health Reports’s scientific editors and, thus, may not conform to the guidelines of the AMA Manual of Style, 11th Edition.

References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

sj-docx-1-phr-10.1177_00333549231163531 – Supplemental material for Tracking COVID-19 in the United States With Surveillance of Aggregate Cases and Deaths

Supplemental material, sj-docx-1-phr-10.1177_00333549231163531 for Tracking COVID-19 in the United States With Surveillance of Aggregate Cases and Deaths by Diba Khan, Meeyoung Park, Jacqueline Burkholder, Sorie Dumbuya, Matthew D. Ritchey, Paula Yoon, Amanda Galante, Joseph L. Duva, Jeffrey Freeman, William Duck, Stephen Soroka, Lyndsay Bottichio, Michael Wellman, Samuel Lerma, B. Casey Lyons, Deborah Dee, Seghen Haile, Denise M. Gaughan, Adam Langer, Adi V. Gundlapalli and Amitabh B. Suthar in Public Health Reports


Articles from Public Health Reports are provided here courtesy of SAGE Publications

RESOURCES