Skip to main content
AMIA Annual Symposium Proceedings logoLink to AMIA Annual Symposium Proceedings
. 2023 Apr 29;2022:279–288.

Supporting COVID-19 Disparity Investigations with Dynamically Adjusting Case Reporting Policies

J Thomas Brown 1, Zhiyu Wan 1, Aris Gkoulalas-Divanis 2, Murat Kantarcioglu 3, Bradley A Malin 1
PMCID: PMC10148367  PMID: 37128430

Abstract

Data access limitations have stifled COVID-19 disparity investigations in the United States. Though federal and state legislation permits publicly disseminating de-identified data, methods for de-identification, including a recently proposed dynamic policy approach to pandemic data sharing, remain unproved in their ability to support pandemic disparity studies. Thus, in this paper, we evaluate how such an approach enables timely, accurate, and fair disparity detection, with respect to potential adversaries with varying prior knowledge about the population. We show that, when considering reasonably enabled adversaries, dynamic policies support up to three times earlier disparity detection in partially synthetic data than data sharing policies derived from two current, public datasets. Using real-world COVID-19 data, we also show how granular date information, which dynamic policies were designed to share, improves disparity characterization. Our results highlight the potential of the dynamic policy approach to publish data that supports disparity investigations in current and future pandemics.

Introduction

The novel coronavirus disease 2019 (COVID-19) pandemic has disproportionately affected segments of society in the United States (U.S.). Black, Hispanic/Latino, and Native American communities have suffered higher risks of infection1, hospitalization2, and mortality3 than other racial and ethnic groups, and the infection fatality rate has exhibited a direct correlation with age4. In recognition of these differential outcomes, researchers and policy makers have sought to quickly identify disparities to inform timely interventions amid an evolving pandemic. For instance, after discovering imbalanced infection and mortality rates between racial and ethnic groups, the state of Michigan increased testing resources and access to primary care physicians for minority subpopulations5. Due in part to these and other policy decisions, from April to November 2020, the percentage of COVID-19 cases in Michigan corresponding to Black residents dropped from 41% to 8%6.

Despite such efforts, the differential impact of COVID-19 remains poorly understood. Attempts to determine potential sources of disparities, including socioeconomic factors and the differential incidence of pre-existing conditions, have been hindered by limited data accessibility3. Notwithstanding informaticians’ significant efforts to develop infrastructure and tools to monitor the spread of COVID-197,8, fewer resources have been allocated to publicly disseminate robust person-level information. Much of the publicly available data in the U.S. have not included racial or ethnic information, and data that do include this information are typically limited to aggregated counts at the state level3,9,10. Though several initiatives have formed patient-level COVID-19 data repositories, such as the National COVID Cohort Collaborative (N3C) of the U.S. National Institutes of Health11 and the COVID-19 Case Surveillance datasets from the Centers for Disease Control and Prevention (CDC)12,13, most of the repositories are not readily open to the public or include data shared in real time8.

Patient privacy is one of the primary factors limiting person-level COVID-19 data sharing14. When publicly disseminating data, many organizations capturing COVID-19 data may be subject to the Health Insurance Portability and Accountability Act of 1996 (HIPAA) Privacy Rule15 and related laws. Though HIPAA, and state laws such as the California Consumer Protection Act16, permit sharing de-identified data, the process of de-identifying pandemic data is nontrivial. It has been shown that a data recipient can exploit prior knowledge to re-identify individual records from the shared quasi-identifying features (i.e., attributes such as age and race that can, in combination, uniquely represent individuals17). As such, HIPAA provides two alternative methods to achieve de-identification and minimize the re-identification risk. The first, Safe Harbor, specifies 18 direct (e.g., name and residential address) and quasi-identifying features (e.g., geocodes corresponding to fewer than 20,000 residents) that must be removed. However, the Safe Harbor method requires historical data to be shared with an uncertainty period of a year – achieved by generalizing date of event to year of event and imposing a delayed publication schedule – rendering it ineffective to detect disparate trends in a timely manner15. The second method, Expert Determination, allows data to be de-identified by the application of “generally accepted statistical and scientific principles”15 so that “the risk is very small that the information could be used to identify an individual who is a subject of the information.”18

Following Expert Determination, a method was recently proposed to publicly share de-identified patient-level epidemiological data in near-real time19. Relying on a framework to forecast the privacy risk of sharing the data at different levels of granularity, the approach adaptively generates data generalization strategies according to the influx of new records. In comparison to traditional de-identification methods, the dynamic policy approach maintains the re-identification risk below a threshold based on state and federal standards more frequently, under several adversarial scenarios, when sharing granular date information with consistent updates (e.g., daily or weekly). Though the dynamic policy approach was designed to support pandemic data sharing, its ability to support disparity investigations has yet to be systematically evaluated.

In this paper, we determine how well data shared under the dynamic policy approach enables the detection of disproportionately elevated infection rates within a specific subpopulation. Such COVID-19 disparities have fluctuated longitudinally, emerging and dissipating as subpopulation outbreaks2,3. As such, our evaluation applies an outbreak detection algorithm to measure the timeliness and accuracy at which disparities can be detected. We also evaluate the fairness of detection performance, following the definition in the algorithmic fairness literature, in terms of enabling consistent disparity detection times between regions and subpopulations. We compare several versions of the dynamic policy to policies resembling those applied to two current, publicly available COVID-19 datasets: 1) the CDC’s COVID-19 Case Surveillance Public Use Data with Geography12 and 2) the aggregated case counts that have been used in several disparity investigations20.

Methods

Built upon standards in the outbreak detection literature, this study compares the timeliness and accuracy at which COVID-19 infection disparities can be detected from surveillance data shared under several distinct de-identification policies. This section begins with a description of the five de-identification policies considered in our evaluation, including data format, data availability, and the type of adversary each policy protects against. Next, we describe how we simulate disparities in infectious disease surveillance data, to be used in several of our experiments. We then provide details regarding how we detect disparities in the surveillance data with an outbreak detection algorithm. Finally, we review our experimental design and performance evaluation measures.

Data sharing policies and assumptions

Our analysis focuses on five different data sharing policies and the extent to which they support disparity detection within two Tennessee counties with very different demographic compositions: 1) Davidson County, a relatively large metropolitan region, and 2) Perry County, a relatively rural region. Table 1 displays the counties’ population demographics according to recent estimates from U.S. Census Bureau21.

Table 1.

County demographics*

Davidson County, TN n = 626,681 Perry County, TN n = 7,915
Race White 385,039 (61.4%) 7,584 (95.8%)
Black 173,730 (27.7%) 119 (1.5%)
Asian 19,027 (3.0%) 14 (0.2%)
AIAN 2,091 (0.3%) 48 (0.6%)
NHPI 394 (0.06%) 0 (0%)
Other 30,757 (4.9%) 30 (0.4%)
Mixed 15,643 (2.5%) 120 (1.5%)
Ethnicity Hispanic/Latino 61,086 (9.7%) 117 (1.5%)
Non-Hispanic 565,595 (90.3%) 7,798 (98.5%)
Age [0, 10) 82,304 (13.1%) 927 (11.7%)
group [10, 20) 72,903 (11.6%) 1,041 (13.2%)
[20, 30) 115,876 (18.5%) 819 (10.3%)
[30, 40) 97,154 (15.5%) 887 (11.2%)
[40, 50) 83,472 (13.3%) 980 (12.4%)
[50, 60) 79,768 (12.7%) 1,192 (15.1%)
[60, 70) 49,803 (7.9%) 1,096 (13.8%)
[70, 80) 26,901 (4.3%) 645 (8.1%)
[80, +] 18,500 (3.0%) 328 (4.1%)
Sex Female 323,141 (51.6%) 3,941 (49.8%)
Male 303,540 (48.4%) 3,974 (50.2%)

*Number of individuals (% of population).

AIAN = American Indian / Alaskan Native, NHPI = Native Hawaiian / Pacific Islander.

A data sharing policy describes the format of the shared dataset, including the granularity at which each quasi-identifying feature is transformed, and the schedule by which the dataset is updated. The quasi-identifying features considered in this study are race, ethnicity, age, sex, county of residence (following the format of the CDC’s surveillance datasets), and date of diagnosis. Each policy is designed to minimize the privacy risk against a potential adversary; i.e., a recipient with certain background knowledge who may attempt to re-identify individual records17. To mitigate re-identification risk, the quasi-identifiers can be converted into a more generalized form (e.g., converting year of birth to 5-year age intervals) to increase the number of records that correspond to each unique combination of quasi-identifying values, or equivalence class22.

Adversarial modeling is critical for policy selection. Assuming too strong an adversary could overestimate the privacy risk and unnecessarily coarsen the data, while assuming too weak an adversary could expose patient identities.23 Here, we consider adversaries who vary in terms of their background knowledge and their motivation to attempt re-identification. We define dynamic policies according to one of two standard privacy risk measures, each considering a different type of adversary. The first is the PK11 risk, defined as the proportion of records in the dataset that reside in an equivalence class of size less than 1119. This equivalence class size is typically incorporated into federal24 and state-level16 guidance. The PK11 risk measures the privacy risk against an adversary who knows an individual is in the dataset and a subset of the individual’s quasi-identifier information. In this setting, an adversary attempts re-identification to learn additional sensitive information included in the patient dataset (e.g., comorbidities25) corresponding to the target individual with identity. The second measure is the marketer risk, which measures the average risk of each record in the dataset in the context of the underlying population26. Each record’s risk is computed as one over the size of the corresponding equivalence class in the population. The marketer risk measure considers an adversary who attempts to re-identify each record in the patient dataset by matching the quasi-identifiers in the shared dataset to those in a separate, identified dataset. A common example of the latter is a voter registration list17,26.

To evaluate how disparity detection performance varies when protecting against adversaries of differing strength, we develop a distinct dynamic policy for each of three different adversaries. All three dynamic policies include date of diagnosis and county of residence and are updated on a daily basis. The first dynamic policy, hereafter referred to as the Strong Adversary policy, follows the PK11 policy proposed by Brown, et al19. This policy assumes the adversary knows a target individual’s demographic information and COVID-19 date of diagnosis within a five-day period, accounting for the separation between diagnostic test date and date of confirmed diagnosis. Under this assumption, the data sharer adapts the granularity of the quasi-identifying features on a weekly basis while assuring the forecasted PK11 risk remains at or below 0.01. In essence, the Strong Adversary policy fluctuates data granularity with the infection rate while maintaining weekly quasi-identifier representation.

The Reasonable Adversary policy protects against an adversary who also knows a target individual’s demographic information, but not their diagnosis date. This is likely a more reasonable assumption due to uncertainty regarding a patient’s exact date of diagnosis19,23. In this scenario, the data sharer updates the generalization strategy of all records in the dataset at the end of each week, according to the PK11 risk of sharing the cumulative number of records. This method constrains successive generalization strategies to represent demographic quasi-identifiers with equal or greater granularity than previous strategies determine. Again, the data sharer chooses strategies according to a PK11 risk threshold of 0.01.

The Marketer Adversary policy protects against an adversary with an identified dataset about a population that does not include diagnosis dates. In this setting, the data sharer updates the generalization strategy applied to all records on a weekly basis, similar to the Reasonable Adversary policy, but they choose generalization strategies according to the marketer risk. The data sharer similarly applies a marketer risk threshold of 0.01. We estimate the marketer risk under the assumption the adversary has an identified dataset that covers every population resident - a worst-case scenario.

We would like to note that when implementing any dynamic policy the data sharer can choose generalization strategies that prioritize the granularity of particular features19. For Davidson county, we rely upon generalization strategies that prioritize race and ethnicity granularity. Due to the population’s racial and ethnic homogeneity, in Perry county, we rely upon strategies that prioritize age and sex granularity.

The k-anonymous policy resembles that applied to the CDC’s COVID-19 Case Surveillance Public Use Data with Geography12. This policy shares age group (0-17, 18-49, 50-64, and 65+), race (Black, White, Asian, American Indian or Alaskan Native (AIAN), Native Hawaiian or Pacific Islander (NHPI), Multiple/Other), ethnicity (Hispanic-Latino and Non-Hispanic), sex (Female and Male), state and county of residency, and month of diagnosis (as date of diagnosis is considered quasi-identifying). Due to the generalized month of diagnosis, we assume the dataset is updated on the first day of each month. For simplicity and to match the dynamic policy implementation, the k- anonymous policy defined for this investigation differs from the CDC’s policy only in that it does not strategically suppress quasi-identifiers to ensure each equivalence class holds at least 11 records (11-anonymity22). Notably, the CDC’s policy suppresses around 3% of each quasi-identifier to achieve 11-anonymity27.

The Marginal Counts policy resembles the non-person-level data displayed in state COVID-19 dashboards7 that have been used in several disparity investigations20. Though most racial data have been shared at the state level, for consistency with the other policies, we assume it shares county-level marginal counts for each race, ethnicity, age, and sex value, without preserving joint statistics. For example, the marginal counts for Black race would be the daily counts of all cases corresponding to Black individuals, independent sex, age, and ethnicity. We assume the dataset shared under this policy is updated on a daily basis. Table 2 summarizes the five de-identification policies’ details.

Table 2.

Details of the de-identification policy assessed in this study.

Strong Adversary Reasonable Adversary Marketer Adversary k-anonymous Marginal Counts
Diagnosis date granularity Date Date Date Month Date
Publication schedule Daily Daily Daily Monthly Daily
Demographic generalization Varies between time periods Updated over time Updated over time Fixed Fixed
Format Row-level Row-level Row-level Row-level Daily counts by feature value
Includes comorbidity information Yes Yes Optional Yes No
Assumed worst-case adversarial knowledge Target individual’s demographics and date of diagnosis Target individual’s demographics Identified dataset of population residents Target individual’s demographics and date of diagnosis NA

Simulating surveillance data

Labelling real world surveillance data for disparities can be both time consuming and arbitrary, such that outbreak detection is normally evaluated on simulated data28. For our evaluation, we generate partially synthetic data through constrained random sampling. It is partially synthetic in that the number of daily case records is informed by the Johns Hopkins University COVID-19 county-level tracking data29, but how the records distribute across demographic subpopulations is simulated. Since a disparity manifests as an anomalous increase in the number of cases corresponding to a specific demographic subpopulation relative to the subpopulation’s size2,3, the baseline distribution is generated by randomly sampling individuals from the population, without replacement. To simulate a disparity, we disproportionately sample from the affected subpopulation.

Figure 1 depicts the complete simulation process. A disparity is defined by a start date, peak date, duration, and subpopulation affected. In the simulation, all records are randomly sampled without replacement from a representative county population generated from U.S. Census population count data21. We generate the baseline demographic distribution by randomly assigning which county residents are infected on each day leading up to (step 1) and throughout the disparity period (step 2). To simulate a disparity in the specified subpopulation, we first calculate the standard deviation of the subpopulation’s baseline infection rate during the disparity period (step 3). We then generate a log-normal shaped epidemic curve30 (step 4), whose values define the additional proportion of daily cases that need to correspond to the disparity subpopulation. For example, if the curve has a value of 0.2 on a given day, then an additional 20% of the day’s records need to correspond to the disparity subpopulation. We rely upon a log-normal shaped curve, following the standard practice in the literature, to approximate real world epidemic curves28,31. The curve reaches its apex on the peak date, at a value set to four times the standard deviation of the baseline infection rate. This induces a disparity proportional to the subpopulations’ baseline rate, peaking at a 99.9% significance level. In scenarios where no baseline cases correspond to the disparity subpopulation, and the standard deviation is zero, the peak value is set to a proportion value of 0.5. We then randomly replace records within the disparity period that do not belong to the disparity subpopulation with those that do, according to the proportion values defined by the epidemic curve (step 5). Finally, we continue baseline sampling for the remainder of the time series (step 6).

Figure 1.

Figure 1.

The pipeline for simulating disparity data in this study.

As the evaluation emphasizes early disparity detection, all simulated disparities are 45 days in duration with an epidemic curve increasing rapidly to a peak on day 10, before decreasing slowly30. The affected subpopulation is defined as a combination of demographic values the Census provides for race, ethnicity, sex, and age. The definition includes up to one value for each of these four features. Since a disparity typically affects a range of ages instead of an exact age, we transform age into age groups ([0, 10), [10, 20), [20, 30), [30, 40), [40, 50), [50, 60), [60, 70), [70, 80), [80, +]) when simulating and detecting disparities.

Disparity detection

We apply the What’s Strange About Recent Events (WSARE)32 algorithm to detect disparate infection rates. We utilize this algorithm because it applies to categorical, person-level data without requiring large amounts of historical data. Moreover, it has been implemented in several real world settings, including American and Israeli outbreak detection monitoring systems32.

On each date in the dataset, the outbreak detection algorithm searches for the most statistically significant increase in case records using a set of rules. Each rule consists of a single value for one or more covariates. For instance, the algorithm may return an alert indicating an unusually high number of records from October 10, 2020 that correspond to 20 to 30-year-old males. The algorithm uses a greedy search to identify the most anomalous rule through a series of Fisher Exact Tests33, comparing the current time period’s records to baseline records at a user-defined statistical significance threshold. False positives due to multiple hypothesis testing are mitigated via randomization tests. Variations of WSARE (namely, 2.0, 2.5, 3.0) apply different methods for defining baseline records32. In this study, we employ version 2.0 for each de-identification policy because it does not require extensive historical data (which are likely unavailable in novel pandemics). This version of the algorithm creates a baseline from dataset records 35, 42, 49, and 56 days prior to the date of evaluation. To further evaluate the Strong Adversary policy, unique in its non-uniform demographic representation, we additionally apply a variation of version 3.0. Our variation generates a baseline by randomly sampling up to 10,000 county residents from the U.S. Census population statistics.

Particular details for applying the outbreak detection algorithm to the de-identified versions of the simulated datasets include the following. For the application of WSARE 3.0 to the Strong Adversary policy (referred to as Strong 3.0 in the results), the current day’s generalized records are compared to the census-derived baseline. For the application of version 2.0 to the Strong Adversary policy, we generalize the quasi-identifiers in the evaluation date’s records and the baseline dates’ records to the most general version specified by the set of the generalization strategies applied to those records. For both the Reasonable Adversary and Marketer Adversary policies, we transform the records in the full dataset per the evaluation date’s specified generalization strategy prior to applying the detection algorithm. To standardize our comparison between policies, we convert the k-anonymous policy’s month of diagnosis to date of diagnosis by randomly assigning a date within the month to each record. For the Marginal Counts policy, we consider a single covariate that includes all race, ethnicity, age group, and sex values. Finally, for comparison, we apply WSARE 2.0 to the raw data.

Experimental design

We repeat each experiment for both Davidson and Perry county, to evaluate performance in counties of varying size and diversity. The first, which we call the Broad experiment, evaluates how well each of the de-identification policies enables disparity detection at different significance thresholds. We simulate 50 datasets, each with the same two-feature disparity starting on a different day – every 10 days from May 10, 2020 to September 12, 2021. For Davidson county, the two features are Black race and age between 30 and 40 years old. Likely due to the racial and ethnic homogeneity, we were unable to simulate detectable disparities with a racial or ethnic feature in Perry county. Therefore, for Perry county, the disparity features are Female sex and age between 30 and 40 years old. We apply the outbreak detection algorithm at five different statistical significance thresholds (0.1, 0.05, 0.01, 0.005, 0.001) to each de-identified dataset. We then measure the time to detection, defined as the number of days since the start of the simulated disparity to the first date an alarm is raised with matching demographic features. Note, the detection time considers the date the data is made available per the de-identification policy. We consider an alert to be a match if the alarm feature value contains the true value. For instance, if the simulated disparity occurs in the 30 to 40-year-old age group and the data is shared under the k-anonymous policy, an alert for age group 18 to 50 years old is considered a match. If the disparity is not detected, we assign a detection time of 90 days, or twice the disparity duration. We also measure how many false positives are generated. False positives are defined as an alarm raised during the disparity period that does not have any of the correct features and any alarm raised outside the period. Since WSARE 2.0 generates a baseline from records occurring up to 56 days prior to the evaluation date, we do not count false positives (for any WSARE implementation) prior to day 56 or during the first 56 days following the simulated disparity. This is done because a representative baseline cannot be acquired.

Next, the Fairness experiment evaluates how de-identification policies may bias timely detection between subpopulations. In this context, the more consistent the disparity detection times between subpopulations, the fairer we consider the performance. We simulate 10 datasets with a single-feature disparity for each race, ethnicity, age group, and sex value. There is one dataset for each of 10 dates spread across COVID-19’s multiple waves. We apply the outbreak detection algorithm to search for the most significant single feature increase exceeding a significance threshold of 0.05. We measure bias, or the lack of fairness, between subpopulations by calculating the standard deviation across subpopulations’ average detection times. A smaller standard deviation indicates fairer disparity detection. We calculate standard deviations across all subpopulations, evaluating overall fairness.

Finally, we supplement our evaluation with a case study to detect racial and ethnic disparities in real world COVID-19 data shared under two different policies. We apply WSARE 2.0 to both the CDC’s COVID-19 Case Surveillance Public Use Data with Geography12 and COVID-19 Case Surveillance Restricted Access Detailed Data13 at a significance threshold of 0.05. The former’s policy is defined above. The latter’s policy shares combined race and ethnicity, 10-year age group, sex, and date of diagnosis. The Restricted Access requires the data user to complete a registration process and sign a data use agreement. The dataset achieves 5-anonymity to race and ethnicity, age group, and sex values and 2-diversity34 to date of diagnosis through suppression. As the two datasets are updated on similar schedules, our evaluation visualizes the dates for which the outbreak detection algorithm raises an alarm instead of measuring detection times. Note, month of diagnosis is converted to date of diagnosis in the Public Use Data in the same manner as for the k-anonymous policy.

Code availability

All experiments are performed using Python (version 3.10). The code for our experiments can be found at https://github.com/vanderbiltheads/PandemicDataPrivacy.

Results

Broad Experiment

Our first experiment broadly evaluates how well the de-identification policies enable disparity detection, for both Davidson and Perry County. We average the detection times and false positives for each policy at each significance threshold to create Activity Monitoring Operating Characteristic (AMOC) curves, in which the optimum is in the bottom left-hand corner. A larger p-value threshold tends to decrease the detection time while increasing the false positive rate. A more significant threshold has the opposite effect. Figures 2 and 3 present the AMOC curves.

Figure 2.

Figure 2.

AMOC curves for Davidson County, TN, for detecting at least one of the simulated disparity features (left) and both features (right). Each point is the average of 50 different experiments.

Figure 3.

Figure 3.

AMOC curves for Perry County, TN, for detecting at least one of the simulated disparity features (left) and both features (right). Each point is the average of 50 different experiments.

For Davidson County, the Reasonable and Marketer Adversary dynamic policies enable the shortest times to detect at least one and both simulated disparity features. The Marginal Counts policy provides comparable detection times for detecting only one disparity feature. Neither Strong Adversary policy implementation supports detection of both features. However, both implementations enable, on average, similar detection times to the k-anonymous policy while generating fewer false positives. This is because the k-anonymous policy’s monthly publication schedule delays the time to detection, even though the k-anonymous policy detects the simulated disparities more frequently.

For Perry County, the Marketer Adversary policy enables the earliest detection of at least one disparity feature, followed by the Reasonable Adversary and k-anonymous policies. No policy enabled the detection of both disparity features, producing average detection times of 90 days.

Fairness Experiment

To ensure a de-identification policy fairly supports disparity detection across subpopulations and counties, we evaluate the consistency at which each policy supports early disparity detection. Across all subpopulations, each policy supports earlier disparity detection in Davidson County than Perry County. For both counties, the k-anonymous policy enables the most consistent detection times, with standard deviations in the average detection time across all demographic subpopulations of 14.2 and 11.2 days in Davidson and Perry County, respectively. However, the detection times are longer than those for the Reasonable and Marketer Adversary dynamic policies. These policies enable the algorithm to detect the disparities nearly three times faster than the k-anonymous policy in Davidson County. The reasonable and marketer policies also support relatively consistent detection times across subpopulations, except for AIAN and NHPI, the two smallest subpopulations in Davidson County.

The longer detection times for the NHPI and AIAN subpopulations in Davidson County and the White subpopulation in Perry County suggest it is difficult to detect outbreaks in super-minority and super-majority subpopulations. However, the k-anonymous policy enables better performance than the raw data for the super-minority populations. This is likely due to the fact the k-anonymous policy aggregates records by month of diagnosis, providing a higher-level representation of the underlying trend.

Case Study

Using the CDC’s Restricted Access COVID-19 Surveillance dataset, we find the daily proportion of records corresponding to each racial/ethnic subpopulation fluctuates over time. For instance, a disparity in the Hispanic-Latino subpopulation emerges early in the pandemic in Davidson County (Figure 4). Notably, the real-world data contains many unknown and suppressed values. The outbreak detection algorithm raises many alerts for the “Unknown” race subpopulation in both counties. Furthermore, there are so few records in Perry County that the Public Access dataset did not contain any records explicitly corresponding to Perry County. We assume the records’ county of residence value was suppressed. As highlighted in Davidson County, the availability of date of diagnosis in the Restricted Access dataset enables granular disparity detection and characterization. In contrast, alerts generated from the Public Access dataset are less accurate and less consistent.

Figure 4.

Figure 4.

Davidson County, TN case study. (Top) Daily proportion of COVID-19 cases by race/ethnicity, according to CDC’s COVID-19 Case Surveillance Restricted Access Detailed Data13. (Bottom) WSARE 2.0 results from both Restricted Access Data and CDC’s COVID-19 Case Surveillance Public Use Data with Geography12 at the 0.05 significance threshold.

Discussion and Conclusions

To support COVID-19 disparity investigations, we evaluate how accurately, timely, and fairly disparate subpopulation outbreaks can be detected from data shared under three different dynamic de-identification policies and two policies derived from current public datasets. The results suggest that in larger, more heterogenous populations like Davidson County, TN, the Reasonable and Marketer Adversary dynamic policies enable better disparity detection performance, for single and double-feature disparities, than the other policies. The k-anonymous policy’s generalization of date of diagnosis hinders the ability to detect more specific, multi-feature disparities whilst generating more false positives. Though the policy can support accurate detection of large single-feature disparities, its monthly data publication schedule markedly delays time to detection. The Marginal Counts policy’s lack of joint statistics prevents the detection of more than one demographic feature defining the disparity. In smaller, more homogenous populations like Perry County, TN, disparity detection is generally more challenging. Though the Reasonable and Marketer Adversary policies enable earlier detection times than the other policies, they do not enable simultaneous detection of multiple disparity features. Finally, even though the WSARE 3.0 implementation of the Strong Adversary policy outperforms the 2.0 implementation, the policy still provides suboptimal detection performance for both counties.

The fairness in detection performance supported by each de-identification policy is more nuanced. In terms of consistently producing early detection times in Davidson County, the Reasonable and Marketer Adversary policies outperform the others. However, these policies, and even the raw data, could not detect disparities in super-minority populations as well as the k-anonymous policy. This finding seems counter-intuitive, but it follows similar findings in previous studies in which generalization dampened the noise in the data to improve downstream application performance35. Our results suggest the aggregation of case counts by month, with subsequent random disaggregation to regain diagnosis date, enables the k-anonymous policy to better detect disparities in super-minority populations. However, it should not be interpreted that this discredits the other policies’ utility in such scenarios. Data with daily temporal resolution could be aggregated downstream by the data user to improve performance. A data sharing policy that preemptively aggregates the data’s temporality, on the other hand, limits the data user’s flexibility. As evidenced in the Davidson County case study, such aggregation can disrupt the granularity and accuracy by which a disparity can be detected and characterized.

We evaluate several dynamic policies, each designed to meet a privacy risk threshold against adversaries with different types of background knowledge. We do not, however, advocate for which policy should be implemented. Rather, our results highlight the importance of adversarial modelling in data sharing policy development and selection. If the adversary does not know (or cannot know) the COVID-19 diagnosis date of a target individual, the data sharer has the potential to share more granular information under the reasonable or Marketer Adversary policies. If the adversary can reasonably obtain such information, the Strong Adversary and k-anonymous policies provide better privacy protection. The difference in disparity detection performance between these two groups highlights the need to investigate the likelihood an adversary can know the date if diagnosis, if they even know the complete demographic information23.

Despite the merits of this investigation, we wish to highlight several limitations that can guide future extensions of our work. First, our evaluation measures the ability to detect a disparity without quantifying how accurately the disparity is represented by the data sharing policy. Though data representation may be sufficient for accurate detection, it is likely the data sharing policies distort disparity features (e.g., severity or duration). Moreover, our simulated data does not consider potential simultaneous disparities in multiple subpopulations. Future work should consider more complex disparities and quantify how well data sharing policies preserve their features.

Second, our experiments using simulated data do not consider the effect of suppressing values (to achieve k- anonymity privacy guarantees22) and missing data on disparity detection. We illustrate their potential impact in the case studies, but do not quantify the results. Future work should quantify the robustness the policies’ performance under suppression and varying levels of missingness.

Third, our evaluation relies on a single outbreak detection algorithm. It is possible that other outbreak detection algorithms improve performance and fairness. Moreover, different statistical methods, such as regression3, could be used to identify temporal disparities. Future work should apply alternative algorithms and methods to more broadly evaluate the data share policies’ ability to preserve underlying disparities.

Fourth, we focus our evaluation on disparities within counties while only briefly comparing performance between two counties. The difference in Davidson and Perry county performance suggests all five data sharing policies are inconsistent in terms of providing similar disparity detection performance between counties. Future work should extend our standardized and systematic approach to analyze performance differences between all counties in a state or country

Finally, we acknowledge our investigation does not adequately address the full impact of de-identification methods on minority subpopulations. As illustrated by the policies’ inability to detect disparities within super-minority populations, our inability to generate detectable racial disparities in Perry County’s Broad experiment, and the absence of explicitly labeled Perry county records in the CDC’s Public Access surveillance dataset, de-identification may not be fair in more ways than one. De-identification methods, by design, target the quasi-identifiers of the most unique individual records, which tends to distort minority subpopulations’ representation in the dataset more than that of majority subpopulations. Consequently, de-identification could mask disparate trends in already marginalized communities, preventing their detection in the disseminated data and further exacerbating differential outcomes. Alternatively, de-identification methods could be tailored to allow fair representation, but this would likely disproportionately expose minorities to re-identification, if they are not already. This complex tradeoff between fair privacy and fair utility has received limited attention36,37, despite its pervasive influence and significant implications in our work and beyond, and deserves greater attention from the broader community moving forward. After better examining this tradeoff, dynamic policies could be designed to balance generalization strategies’ privacy protection, disparity detection utility, and corresponding fairness measures.

In conclusion, we show that when protecting against a potential adversary of reasonable strength – an adversary who, at most, knows a target individual’s demographic information – dynamic policy de-identification enables timely publication of person-level data that preserves evidence of underlying disparities better than current public datasets. As such, dynamic policy de-identification has the potential to support the detection and characterization of disparities, and the investigation of their sources, in current and future pandemics.

Acknowledgements

This research was sponsored in part by grants from the National Science Foundation (CNS2029651 and CNS2029661) and National Institutes of Health (T15LM007450).

Figures & Table

Figure 5.

Figure 5.

Perry County, TN case study. (Top) Daily proportion of COVID-19 cases by race/ethnicity, according to CDC’s COVID-19 Case Surveillance Restricted Access Detailed Data13. (Bottom) WSARE 2.0 results from both Restricted Access Data and CDC’s COVID-19 Case Surveillance Public Use Data with Geography12 at the 0.05 significance threshold.

Table 3.

Average time to detect, in days, disparities in each single-feature subpopulation. Davidson County’s larger and more diverse population enables better disparity detection.

graphic file with name 421t3.jpg

References

  • 1.Webb Hooper M, Nápoles AM, Pérez-Stable EJ. COVID-19 and Racial/Ethnic Disparities. JAMA. 2020 Jun 23;323(24):2466–7. doi: 10.1001/jama.2020.8598. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Romano SD, Blackstock AJ, Taylor EV, El Burai Felix S, Adjei S, Singleton CM, et al. Trends in racial and ethnic disparities in COVID-19 hospitalizations, by region — United States, March–December 2020. MMWR Morb Mortal Wkly Rep. 2021 Apr 16;70(15):560–5. doi: 10.15585/mmwr.mm7015e2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.McLaren J. Racial disparity in COVID-19 deaths: seeking economic roots with census data. The BE Journal of Economic Analysis & Policy. 2021 Jul 1;21(3):897–919. [Google Scholar]
  • 4.Levin AT, Hanage WP, Owusu-Boaitey N, Cochran KB, Walsh SP, Meyerowitz-Katz G. Assessing the age specificity of infection fatality rates for COVID-19: systematic review, meta-analysis, and public policy implications. Eur J Epidemiol. 2020 Dec 1;35(12):1123–38. doi: 10.1007/s10654-020-00698-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Parpia AS, Martinez I, El-Sayed AM, Wells CR, Myers L, Duncan J, et al. Racial disparities in COVID-19 mortality across Michigan, United States. EClinicalMedicine. 2021;33:100761. doi: 10.1016/j.eclinm.2021.100761. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Keating D, Cha AE, Florit G. ‘I just pray God will help me’: Racial, ethnic minorities reel from higher covid-19 death rates. Washington Post. 2020 Nov. p. 20.
  • 7.Dixon BE, Grannis SJ, McAndrews C, Broyles AA, Mikels-Carrasco W, Wiensch A, et al. Leveraging data visualization and a statewide health information exchange to support COVID-19 surveillance and response: Application of public health informatics. JAMIA. 2021 Jul 1;28(7):1363–73. doi: 10.1093/jamia/ocab004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Madhavan S, Bastarache L, Brown JS, Butte AJ, Dorr DA, Embi PJ, et al. Use of electronic health records to support a public health response to the COVID-19 pandemic in the United States: a perspective from 15 academic medical centers. JAMIA. 2021 Feb 1;28(2):393–401. doi: 10.1093/jamia/ocaa287. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Benitez J, Courtemanche C, Yelowitz A. Racial and ethnic disparities in COVID-19: evidence from six large cities. J Econ Race Policy. 2020 Dec 1;3(4):243–61. doi: 10.1007/s41996-020-00068-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Maybank A. Why racial and ethnic data on COVID-19’s impact is badly needed. American Medical Association. 2020 Apr 8.
  • 11.Haendel MA, Chute CG, Bennett TD, Eichmann DA, Guinney J, Kibbe WA, et al. The National COVID Cohort Collaborative (N3C): Rationale, design, infrastructure, and deployment. JAMIA. 2020 Aug 17;28(3):427–43. doi: 10.1093/jamia/ocaa196. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Centers for Disease Control and Prevention, COVID-19 Response. COVID-19 Case Surveillance Public Use Data with Geography (dataset access date: August 1, 2021) Available from: https://data.cdc.gov/Case-Surveillance/COVID-19-Case-Surveillance-Public-Use-Data-with-Ge/n8mc-b4w4.
  • 13.Centers for Disease Control and Prevention, COVID-19 Response. COVID-19 Case Surveillance Restricted Data Access, Summary, and Limitations (dataset access date: August 1, 2021) Available from: https://data.cdc.gov/Case-Surveillance/COVID-19-Case-Surveillance-Restricted-Access-Detai/mbd7-r32t.
  • 14.Maxmen A. Why the United States is having a coronavirus data crisis. Nature. 2020 Aug 25;585(7823):13–4. doi: 10.1038/d41586-020-02478-z. [DOI] [PubMed] [Google Scholar]
  • 15.Office for Civil Rights. Summary of the HIPAA Privacy Rule. HHS.gov
  • 16.California Consumer Privacy Act (CCPA). State of California - Department of Justice - Office of the Attorney General. 2018.
  • 17.Sweeney L. Simple demographics often identify people uniquely. Carnegie Mellon University, Data Privacy Working Paper 3. 2000.
  • 18.Office for Civil Rights. The HIPAA Privacy Rule. HHS.gov. 2008.
  • 19.Brown JT, Yan C, Xia W, Yin Z, Wan Z, Gkoulalas-Divanis A, et al. Dynamically adjusting case reporting policy to maximize privacy and public health utility in the face of a pandemic. JAMIA. 2022 Feb 19;29(5):853–63. doi: 10.1093/jamia/ocac011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Gross CP, Essien UR, Pasha S, Gross JR, Wang S yi, Nunez-Smith M. Racial and Ethnic Disparities in Population-Level Covid-19 Mortality. J Gen Intern Med. 2020 Oct;35(10):3097–9. doi: 10.1007/s11606-020-06081-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.The United States Census Bureau. Population Census Tables (dataset access date: July 1, 2020) Available from: https://www.census.gov/data/datasets/2010/dec/summary-file-1.html.
  • 22.Sweeney L. k-anonymity: a model for protecting privacy. Int J Unc Fuzz Knowl Based Syst. 2002 Oct 1;10(05):557–70. [Google Scholar]
  • 23.Xia W, Liu Y, Wan Z, Vorobeychik Y, Kantarcioglu M, Nyemba S, et al. Enabling realistic health data re-identification risk assessment through adversarial modeling. JAMIA. 2021 Jan 15;28(4):744–52. doi: 10.1093/jamia/ocaa327. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.ResDAC, Centers for Medicare & Medicaid Services, Cell Size Suppression Policy
  • 25.Sanyaolu A, Okorie C, Marinkovic A, Patidar R, Younis K, Desai P, et al. Comorbidity and its Impact on Patients with COVID-19. SN Compr Clin Med. 2020 Jun 25. pp. 1–8. [DOI] [PMC free article] [PubMed]
  • 26.Dankar FK, El Emam K. Proceedings of the 1st International Workshop on Data Semantics - DataSem ’10. Lausanne, Switzerland: ACM Press; 2010. A method for evaluating marketer re-identification risk; p. 1. [Google Scholar]
  • 27.Lee B, Dupervil B, Deputy NP, Duck W, Soroka S, Bottichio L, et al. Protecting Privacy and Transforming COVID-19 Case Surveillance Datasets for Public Use. Public Health Rep. 2021 Jun 17;136(5):554–61. doi: 10.1177/00333549211026817. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Zhou H, Burkom H, Winston CA, Dey A, Ajani U. Practical comparison of aberration detection algorithms for biosurveillance systems. JBI. 2015 Oct 1;57:446–55. doi: 10.1016/j.jbi.2015.08.023. [DOI] [PubMed] [Google Scholar]
  • 29.Dong E, Du H, Gardner L. An interactive web-based dashboard to track COVID-19 in real time. Lancet Infect Dis. 2020 May;20(5):533–4. doi: 10.1016/S1473-3099(20)30120-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Lotze T, Shmueli G, Yahav I. Simulating multivariate syndromic time series and outbreak signatures. Robert H Smith School Research Paper No RHS-06-054. 2007 May 1.
  • 31.Sartwell PE. The distribution of incubation periods of infectious disease. Am J Epidemiol. 1995 Mar 1;141(5):386–94. doi: 10.1093/oxfordjournals.aje.a117440. [DOI] [PubMed] [Google Scholar]
  • 32.Wong WK, Moore A, Cooper G, Wagner M. What’s Strange About Recent Events (WSARE): an algorithm for the early detection of disease outbreaks. JMLR. 2005;6(66):1961–98. [Google Scholar]
  • 33.Good PI. 2nd ed. New York: Springer; 2000. Permutation tests : a practical guide to resampling methods for testing hypotheses. (Springer series in statistics) [Google Scholar]
  • 34.Machanavajjhala A, Gehrke J, Kifer D, Venkitasubramaniam M. l-diversity: privacy beyond k-anonymity. 22nd International Conference on Data Engineering (ICDE’06) 2006. pp. 24–24.
  • 35.Deleger L, Molnar K, Savova G, Xia F, Lingren T, Li Q, et al. Large-scale evaluation of automated clinical note de-identification and its impact on information extraction. JAMIA. 2013;20(1):84–94. doi: 10.1136/amiajnl-2012-001012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Wan Z, Vorobeychik Y, Xia W, Liu Y, Wooders M, Guo J, et al. Using game theory to thwart multistage privacy intrusions when sharing data. Science Advances. 2021;7(50):eabe9986. doi: 10.1126/sciadv.abe9986. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Xu H, Zhang N. Implications of data anonymization on the statistical evidence of disparity. Management Science. 2021;68(4):2600–18. [Google Scholar]

Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES