Skip to main content
Journal of the American Medical Informatics Association: JAMIA logoLink to Journal of the American Medical Informatics Association: JAMIA
. 2022 Feb 19;29(5):853–863. doi: 10.1093/jamia/ocac011

Dynamically adjusting case reporting policy to maximize privacy and public health utility in the face of a pandemic

J Thomas Brown 1,, Chao Yan 2,3, Weiyi Xia 4, Zhijun Yin 5,6, Zhiyu Wan 7,8, Aris Gkoulalas-Divanis 9, Murat Kantarcioglu 10, Bradley A Malin 11,12,13
PMCID: PMC9006705  PMID: 35182149

Abstract

Objective

Supporting public health research and the public’s situational awareness during a pandemic requires continuous dissemination of infectious disease surveillance data. Legislation, such as the Health Insurance Portability and Accountability Act of 1996 and recent state-level regulations, permits sharing deidentified person-level data; however, current deidentification approaches are limited. Namely, they are inefficient, relying on retrospective disclosure risk assessments, and do not flex with changes in infection rates or population demographics over time. In this paper, we introduce a framework to dynamically adapt deidentification for near-real time sharing of person-level surveillance data.

Materials and Methods

The framework leverages a simulation mechanism, capable of application at any geographic level, to forecast the reidentification risk of sharing the data under a wide range of generalization policies. The estimates inform weekly, prospective policy selection to maintain the proportion of records corresponding to a group size less than 11 (PK11) at or below 0.1. Fixing the policy at the start of each week facilitates timely dataset updates and supports sharing granular date information. We use August 2020 through October 2021 case data from Johns Hopkins University and the Centers for Disease Control and Prevention to demonstrate the framework’s effectiveness in maintaining the PK11 threshold of 0.01.

Results

When sharing COVID-19 county-level case data across all US counties, the framework’s approach meets the threshold for 96.2% of daily data releases, while a policy based on current deidentification techniques meets the threshold for 32.3%.

Conclusion

Periodically adapting the data publication policies preserves privacy while enhancing public health utility through timely updates and sharing epidemiologically critical features.

Keywords: data sharing, privacy, forecasting, infectious disease, simulation

INTRODUCTION

The novel coronavirus 2019 (COVID-19) pandemic has put a spotlight on infectious disease surveillance systems1 and the importance of making such information widely accessible.2 Sharing surveillance data in a timely manner can support a wide variety of public health research endeavors (eg, from modeling disease transmissibility to simulating interventions3–6) and provide the public with situational awareness of outbreaks.4,7,8 In recognition of such benefits, over the past year and a half, various organizations have worked to broaden access to large epidemiological datasets. Recent instantiations of COVID-19 initiatives include the National COVID Cohort Collaborative of the U.S. National Institutes of Health,9 the Datavant COVID-19 Research Database,10 the Centers for Disease Control and Prevention’s (CDC) COVID-19 Case Surveillance datasets,11–13 and the Global.health data science initiative,14 among others.

While advances in surveillance have spurred rapid growth in the volume and diversity of epidemiological resources, public data sharing on a wide scale remains limited.15 This is due to numerous social and political factors, but it is evident that privacy is a core driving factor. In the United States, for instance, infectious disease data are captured by a variety of organizations, such as public health authorities, hospitals, and pharmacies. In regard to public data dissemination, such organizations may be subject to the Health Insurance Portability and Accountability Act of 1996 (HIPAA) and related laws and policies. Under HIPAA, an organization is permitted to publicly share patient-level data only when it is deidentified, that is, when “there is no reasonable basis to believe that the information can be used to identify an individual.”16 Even when organizations are not covered by HIPAA, they may be permitted to share data in a deidentified form as well. For example, the California Consumer Protection Act, the Virginia Consumer Data Protection Act, and the Colorado Privacy Act provide exemptions to deidentified data sharing.17–19 However, transforming data into a deidentified form is a nontrivial endeavor. Numerous demonstration attacks have shown that, with the right background knowledge, a data recipient can leverage residual information in the records to reidentify the individuals to whom the data correspond.20–25 Concerns over such intrusions to anonymity have discouraged organizations from sharing data,26,27 which raises the importance of the question: How can organizations best comply with regulatory requirements while making surveillance data publicly available?

Under HIPAA, deidentification can be satisfied through two alternative implementations. The first is Safe Harbor, which requires the suppression of 18 direct (eg, patient name) and quasi-identifying features (eg, geocodes with populations smaller than 20,000 residents). However, Safe Harbor requires hiding epidemiologically critical factors, such as reducing the granularity of dates of events to their year, which renders such a policy useless for characterizing infectious disease transmission. The alternative is Expert Determination, which indicates data are deidentified when “the risk is very small that the information could be used to identify an individual who is a subject of the information.”28 Various methods for risk assessment have been developed, including those previously developed for surveillance data,29 but provide limited guidance on adapting policies to the needs of the moment. Rather, they are retrospective in nature in that they assume data have already been collected and are ready for dissemination. Most methods further assume the number of records in the dataset remains fixed.30 These assumptions differ from the requirements of case reporting while in the face of a pandemic. Moreover, waiting to publish the data will hinder the ability to characterize the current state and evolution of an outbreak.1,2,31,32 The infection rate must also be considered in the deidentification approach as it directly and dynamically influences the number of records in the dataset. Furthermore, several factors affect the privacy risk, including the demographics of the people infected20,22 and the geolocations to which the pandemic spreads.33,34 These requirements motivate the need for methods that forecast surveillance data.

In this paper, we introduce an approach to adaptively generate policies to publicly share deidentified patient-level epidemiological data. The framework simulates disease cases to estimate the longitudinal privacy risk of sharing infected individuals’ quasi-identifier information at different levels of granularity in the absence of actual patient data. Periodically adjusting the policy allows the data sharer to adapt data granularity according to the influx of new patient records, while simultaneously allowing periods of consistent quasi-identifier representation. We specifically apply the framework to illustrate how policies could be developed to share COVID-19 patient health information and compare such policies to a more traditional deidentification approach relying on retrospective risk assessment. Furthermore, to be consistent with the CDC’s current practice of using generalization and suppression for privacy,13 we use the framework to explore a wide range of data generalization policies.

It should be recognized the framework applies to any type of epidemiological disease spread, adjusts for the demographic diversity of individual US counties, and relies on public data sources. The framework can also be reused to address emerging data sharing needs, such as for vaccine registries.35,36 Dynamically adapting data sharing policies holds the potential to consistently share more data with the public in a timely and privacy-preserving manner, fueling our data-driven response to infectious disease.8

MATERIALS AND METHODS

Due to the challenge of predicting exactly who will be infected, prospectively fixing a data sharing policy requires probabilistic risk assessment. Our framework provides longitudinal privacy risk estimates for a data generalization policy within a specified geographic region. Given the appropriate population statistics, the framework can utilize any geographic level of detail (eg, state, county, or ZIP code). In this research, we apply the framework to simulate disease spread on a county level to match the format of the COVID-19 surveillance data made accessible by the CDC.11,12 In this section, we summarize the framework’s features and its application to contextualize the results. Specific technical details are provided in the Supplementary Information.

Privacy risk estimation framework

Figure 1 summarizes the framework. In the first step, we select a data generalization policy, which defines the generalization of each quasi-identifying feature considered. In this paper, we consider basic demographic features and the date of diagnosis as quasi-identifiers, as they are typical features organizations have been requested to share (Table 1). The second step generates the county-level population across the quasi-identifying features per the selected policy. We use population count data from the U.S. Census Bureau to calculate the number of people in the county that fall into each demographic group,37 where each group is defined by a unique combination of quasi-identifier values, excluding date of diagnosis.

Figure 1.

Figure 1.

Privacy risk estimation framework. The curved rectangles represent processes, the cylinders represent data, and the hexagons represent user-defined parameters. The algorithm that performs the processes within the black box is in the core of the proposed framework, employs Monte Carlo random sampling, and is presented in greater detail in the Methods section. To obtain the privacy risk distributions, the simulation is repeated n times. The circled numbers denote the framework steps.

Table 1.

The quasi-identifiers considered in this study

Field Generalization strategy Generalization example
State of residence Nonea NA
County of residence Nonea NA
Date of diagnosis Combine into week ranges (Sunday–Saturdayb)
  • 01/05/21 →

  • 01/03/21–01/09/21

Year of birth Convert to age ranges
  • 1980 →

  • 40–45 years old

Sex Nullify value
  • Female → null,

  • Male → null

Race Combine race groups
  • AIAN → AIAN or PI,

  • PI→ AIAN or PI

Ethnicity Nullify value
  • Hispanic-Latino → null,

  • Non-Hispanic → null

Note: The middle column describes the generalization strategy for each quasi-identifier. The third column provides an example generalization for each quasi-identifier. In the case of sex and ethnicity, the information is either included or null.

Abbreviations: AIAN: American Indian/Alaskan Native; PI: Pacific Islander.

a

These values cannot be generalized since we simulate on a county level.

b

This definition of a week is consistent with the one used by the CDC’s COVID-19 case forecasts.38

The third step applies a Monte Carlo simulation (represented by the black box in Figure 1) to generate synthetic patient datasets using the county-level population distribution and a time series of new disease case counts. The time series’ periodicity defines the frequency at which the updated dataset is released (eg, every day or every week). To simulate the COVID-19 pandemic, we input time series derived from the Johns Hopkins COVID-19 tracking data.39 The simulation algorithm (details of which are in the Supplementary Information) initially assumes that the no one in the county is infected. Then, for each time point, we randomly sample the number of disease cases (without replacement) from the uninfected population to form the newly reported patient dataset. The framework assumes individuals are not reinfected (for simplicity, considering a potentially negligible COVID-19 reinfection rate40) and assumes equal weighting across all individuals when sampling (to model the general uncertainty of disease spread, particularly in pandemics41).

The algorithm computes the reidentification risk on the patient set at each time point, according to a specified risk measure. There are various methods for measuring privacy risk.30 In this work, we measure risk as the proportion of individuals in the dataset that fall into a group of size less than k, where each group is defined by a unique set of quasi-identifier values.42,43 We refer to this measure as the PK risk and evaluate it given a set of k values (as defined below) consistent with the standard thresholds used by public health authorities.44–48 The PK risk assumes a data recipient knows (1) an individual is a member of the dataset, (2) the individual’s name and quasi-identifying information, and (3) the individual’s relative date of diagnosis for the disease of interest. In this scenario, the data recipient attempts reidentification to learn the target individual’s sensitive information from additional features included in the dataset (eg, comorbidities49,50). The more unique the record’s representation, the more likely the data recipient can reidentify the individual.20,22 In this research, we focus on this risk measure to follow the CDC’s application of k-anonymization.51 The PK risk effectively measures the proportion of records that fail to achieve k-anonymity.

In practice, obtaining such patient information is difficult.23,52 Thus, evaluating the PK risk provides an upper bound of reidentification risk for the dataset. To demonstrate the approach’s flexibility and to offer a different perspective on privacy risk, we further analyze the amortized reidentification risk53 in the Supplementary Information. The amortized reidentification risk relaxes assumptions (1) and (3) and considers the scenario in which the data recipient is motivated to reidentify as many patients as possible to learn who has the infectious disease of interest.

We highlight that, when applying the PK risk measure, we assume the attacker knows the diagnosis occurred within a lagging period of time (eg, within 1, 3, or 5 days prior to the documented date). We allow this flexible assumption as it is unlikely a data recipient knows the targeted individual’s exact diagnosis date,54 particularly when the time from a diagnostic test to case report extends beyond 1 day. The group corresponding to an individual contains all patients in the simulated patient set that match the individual on the demographic features, with a diagnosis date falling within the lagging period.

The final step of the framework uses the privacy risk distributions to estimate when the policy meets a privacy risk threshold. Computing the longitudinal privacy risk estimates under several data sharing policies for the same county identifies which policies likely meet the threshold at each point in the time series. The data sharer can then choose which policy to apply according to information priorities (eg, prioritizing age granularity over sex granularity).

Dynamic policy search

To dynamically adapt policies according to an expected infection rate, we identify policies that are likely to satisfy a specific PK risk threshold at varying volumes of new case records. For this policy search, we choose a k of 11, which is as a typical group size incorporated into guidance issued at the state45–48 and federal44 level. It is also the group size applied to CDC’s COVID-19 Public Use Data with Geography.11 We henceforth refer to the PK risk when k equal to 11 as the PK11 risk. In this paper, we search for policies that meet a PK11 threshold of 0.01; ie, the percentage of records falling into a demographic group of size 10 or smaller should be less than or equal to 1%. Similar investigations for k of 5 and 20 (other common group size thresholds) are provided in the Supplementary Information.

The search uses the privacy risk estimation framework to evaluate 96 alternative data sharing policies for each U.S. county (with available census tract information) across a range of case count values. The policies include 6 potential generalizations of age, 4 generalizations of race, 2 generalizations of sex, and 2 generalizations of ethnicity. The generalization options follow a hierarchical structure (see Figure 2), where moving up the hierarchy generalizes the information to increase privacy at the cost of utility.55 For each policy, county, and case number combination, the framework generates 1,000 PK11 estimates. A policy meets the threshold when the upper bound of the estimates’ 95% quantile range is less than or equal to 0.01. We choose to evaluate a policy in this manner to increase the likelihood supported policies meet the privacy risk threshold in application. Note, the data sharer can adjust the size of the quantile range to modify the confidence a policy will meet a specific privacy risk threshold.

Figure 2.

Figure 2.

The generalization hierarchies for age, race, sex, and ethnicity used in this paper, adapted from those of Wan et al.68 Each horizontal level is a potential generalization state for the data generalization policy. For example, the policy could specify generalizing age to 5-year age intervals to 15-year age intervals, or broader ranges. We represent year of birth as 1-year age at the bottom of the age hierarchy. Moving up the hierarchies, the data become more generalized to increase privacy. An asterisk indicates the feature is generalized to a null value for all individuals, which is equivalent to suppression or nonrelease of the corresponding field.

Dynamic policy evaluation

We use the summarized policy search results and forecasted COVID-19 disease case counts to evaluate dynamic policy selection in the context of the COVID-19 pandemic. In this experiment, we measure the proportion of data releases in which the PK11 likely remains below the policy search threshold of 0.01. The dynamic policy is evaluated for two distinct alternative data sharing scenarios: (1) a daily release schedule with a 1-day lagging period assumption and (2) a weekly release schedule. The daily release schedule shares the actual date of diagnosis, prioritizing date granularity at the potential cost of demographic granularity. The weekly release schedule generalizes the date to week of diagnosis.

For each county, the dynamic policy method selects the generalization policy from the search results at the beginning of each week according to the forecasted COVID-19 case volumes. We use the CDC COVID-19 ensemble model’s county-specific, 1-week forecasts for its superior accuracy over other models.38,56,57 For the evaluation, we collected all model predictions from August 2020 through October 2021. We obtain daily increase predictions by uniformly distributing the weekly increase point estimate. In selecting policies for the daily release schedule, we use the minimum number of predicted cases in the week. This applies the most privacy preserving policy to all new cases reported in the week. For the weekly release schedule, we use the forecasted 1-week increase.

After selecting the sequence of policies for each county, we estimate the privacy risk of sharing the actual reported number of records via the privacy risk estimation framework. We define the actual number of disease cases per day or week by the Johns Hopkins COVID-19 tracking data. The PK11 risk value for each time point in each county is calculated as the upper bound of the 95% quantile range of 1000 simulations. The evaluation measures the proportion of releases the upper bound remains below 0.01. We additionally evaluate the static application of a policy designed with current, retrospective deidentification techniques, akin to those applied to the CDC’s COVID-19 Public Use Data with Geography.11 The policy, hereafter referred to as the k-anonymous policy, shares age intervals in the form (0–17, 18–49, 50–64, and 65+); nearly fully specified race; fully specified ethnicity, sex, and state and county of residence; and date or week of diagnosis. We note the CDC’s policy, from which the k-anonymous policy derives, was developed to meet regulatory requirements and public health standards under a different release schedule (once every 2 weeks) and in a retrospective manner (the actual patient records are collected, deidentified and released in a batch). The CDC’s policy is designed to achieve 11-anonymity (ie, PK11 = 0) by generalizing the date of diagnosis to month and by nulling out quasi-identifier information for small groups.11,13,58 Thus, the k-anonymous policy resembles a policy developed with traditional deidentification, but notably differs in its treatment of dates of events and in its assumption of no suppression. We further note this last feature is another unique factor to sharing surveillance data in near-real time. Suppression cannot be applied with confidence because it is almost impossible to forecast exactly which records will fall into small demographic groups.

Case studies

To provide a specific illustration of the dynamic policy approach to daily releasing updated, record-level disease surveillance data, we consider two Tennessee counties. The first, Davidson County, is a relatively large metropolitan region with a population of over 600,000 residents. The second, Perry County, is a relatively rural area with around 8,000 residents.

In each case study, we select a policy on a weekly basis in the same manner as the evaluation. However, to demonstrate how the framework incorporates the data recipient’s potential knowledge of diagnosis date, and accounting for the general turnaround time of COVID-19 diagnostic tests results,59–61 we set a 5-day lagging period. Under these constraints, weekly dynamic policy selection first calculates a 5-day rolling sum of new disease case numbers through the coming week. The minimum value of the rolling sum is used to select the policy. We again estimate the privacy risk of sharing the actual number of records under the sequence of selected policies with the privacy risk estimation framework and the Johns Hopkins COVID-19 tracking data. To evaluate the dynamic policy under optimal case load forecasting, we repeat the process by replacing the forecasted case counts with the actual case numbers in policy selection.

Code

All experiments are performed using Python (version 3.8). The code, and walkthroughs corresponding to each experiment, can be found at https://github.com/vanderbiltheads/PandemicDataPrivacy

RESULTS

Dynamic policy search

We summarize the policy search results in Figure 3. To aid in readability, we represent the generalization of each quasi-identifier in a policy with a 4-character alphanumeric code. From left to right, the characters represent the age, race, sex, and ethnicity generalizations. We further summarize the results by categorizing US counties by population size.

Figure 3.

Figure 3.

Generalization policies with a PK11 upper bound (calculated as the upper bound of the 95% quantile range of 1,000 framework simulations) less than or equal to 0.01 at varying disease case volume thresholds. A 4-character alphanumeric code indicates the policy’s generalization levels. All policies additionally include state and county of residence and some generalization of diagnosis date. A policy is eligible to be listed under the minimum number of new cases (table column) at which it meets the PK11 threshold for every county in the category (table row). A maximum of 2 policies are listed in each cell among the actual number of policies supported. The number in the bottom right-hand corner of each cell indicates how many of the 96 searched policies meet the risk threshold at the case volume.

Once a generalization policy meets the PK11 threshold for a given number of cases, it is unlikely records fall into a demographic group of size 10 or less. Further increasing the case volume increases the number of records in each group and decreases the PK11 value. As such, a policy is listed under the smallest case quantity at which the policy meets the PK11 threshold for every county in the category. It should also be noted there exists a parent–child relationship between policies. For example, policy 2*** is the parent of policy 3***, where the former only differs from the latter by generalizing age to a lesser degree. When a parent policy meets the PK11 threshold, all its child policies also meet the threshold.

As Figure 3 displays, the number of acceptable policies increases with the number of new cases. In most cases, larger counties achieve more acceptable policies than smaller counties at a given case quantity. The maximum number of acceptable policies is 73. The most granular policies across all county categories are 1C*e, 2Bse, and 3Ase. Each of these policies prioritizes different types of information. Policy 1C*e offers the most granular age information at the cost of race and sex information, while Policy 3Ase reduces age granularity to increase race and sex specificity.

The case number values are window-size agnostic, such that the policy search results hold regardless of the time period considered. For example, assume a county with fewer than 1000 residents updates its disease surveillance dataset daily. Further, assume the county adjusts for sets a 5-day lagging period assumption. When the expected number of new cases from the current day and the previous 2 days sum to 50, the current day’s records should be generalized according to either policy **** or **s*. The same policies are supported if, instead, the dataset is updated weekly (and diagnosis date is generalized to week of diagnosis) and 50 new cases are expected for the current week.

Dynamic policy evaluation

We summarize the evaluation results, categorizing counties in the same manner as the policy search, in Table 2. There are several major findings. First, dynamically adapting the generalization policy meets the PK11 threshold more frequently than statically applying the k-anonymous policy. On average, the dynamic policy meets the threshold for at least 92.8% of the 448 daily releases and 96.0% of the 64 weekly releases. The k-anonymous policy meets the threshold as few as 11.8% of the daily releases and 0.4% of the weekly releases. Second, we find that new cases do not occur every day or every week, particularly in counties with fewer residents. As such, there are fewer days the PK11 upper bound can potentially exceed the threshold, inflating proportions in smaller counties.

Table 2.

Average proportion of time periods where the upper bound of the 95% quantile range of the PK11 risk is less than or equal to 0.01 in the COVID-19 pandemic (August 2, 2020 to October 23, 2021)

Average proportion of daily releases that meet the PK11 threshold in the COVID-19 pandemic [95% quantile range] Average proportion of weekly releases that meet the PK11 threshold in the COVID-19 pandemic
[95% quantile range]
(n =448)
(n =64)
County population size k-Anonymous policy Dynamic policy k-Anonymous policy Dynamic policy
< 1000 (n =35) 0.900 [0.790, 0.998] 1 [1, 1] 0.605 [0.266, 0.987] 0.999 [0.984, 1]
1000–50 000 (n =2129) 0.389 [0.118, 0.815] 0.971 [0.902, 1] 0.072 [0, 0.406] 0.960 [0.906, 1]
50 000–100 000 (n =398) 0.181 [0.042, 0.532] 0.928 [0.868, 0.987] 0.004 [0, 0.031] 0.974 [0.922, 1]
100 000–1 000 000 (n =538) 0.145 [0.009, 0.521] 0.947 [0.882, 0.998] 0.008 [0, 0.026] 0.982 [0.938, 1]
>1 000 000 (n =39) 0.118 [0.007, 0.304] 0.961 [0.874, 0.998] 0.057 [0, 0.288] 0.962 [0.906, 1]

Case study: Davidson County, TN

Figure 4 shows how the forecasted case volumes do not match the weekly seasonality of the actual reported cases in Davidson County. Consequently, the CDC ensemble model tends to overestimate case loads, leading to the selection of more granular policies. Despite the rippling effects of the overestimation, the 95% quantile range of the forecast-driven PK11 remains below 0.01 throughout most of the time frame. Several days exceed the threshold, most of which occur when the selected policies disagree whether to share record-level data under the **** policy or to not share. When sharing fewer than 11 new case records in a 5-day window under the forecast-driven dynamic policy, all new records fall into a demographic group smaller than size 11, resulting in a PK11 of 1.0. Notably, the PK11 never exceeds the threshold when selecting policies according to the actual case counts. Adapting the policy according to perfect forecasts provides optimal privacy protection.

Figure 4.

Figure 4.

Dynamic policy selection applied to Davidson County, TN, in the COVID-19 pandemic (August 2, 2020 to October 23, 2021). (Top) The 5-day rolling sum of the forecasted and actual case counts reported in Davidson County. The forecasted counts are from the CDC’s COVID-19 ensemble model and the actual counts are from the Johns Hopkins surveillance data. The blue triangles and red squares denote the minimum value within each week (defined as Sunday–Saturday per the CDC model’s definition). The minimum values are used to select a policy from policy search results. (Middle) The selected policy at the beginning of each week in the pandemic. Each policy is represented by a 4-character alphanumeric code following the key in Figure 3. The policies are ordered by increasing case count thresholds from bottom to top. Green circles indicate agreement between the policies selected from the forecasted and actual case counts. (Bottom) The PK11 from sharing the actual number of records under the two sequences of policies detailed in the middle graph. The expectation and 95% quantile range are calculated from 1,000 independent framework simulations, while applying a 5-day lagging period assumption. The horizontal dashed line marks the PK11 threshold of 0.01.

Case study: Perry County, TN

Figure 5 shows that case counts remain relatively small before, as well as after, infection spikes in October 2020 and August 2021. Throughout most of these intervals of low-infection rates, the selected policies from each data source indicate that record-level data should not be shared on a daily basis. However, when the 5-day rolling sums oscillate around 11 cases, the forecasted values again overestimate the weekly minimum case loads, resulting in a PK11 of 1.0. Despite the privacy leaks in the forecast-driven dynamic policy, the dynamic policy guided by the actual disease case counts again maintains the PK11 values below the threshold throughout the time frame.

Figure 5.

Figure 5.

Dynamic policy selection applied to Perry County, TN in the COVID-19 pandemic (August 2, 2020 to October 23, 2021). (Top) The 5-day rolling sum of the forecasted and actual case counts reported in Davidson County. The forecasted counts are from the CDC’s COVID-19 ensemble model and the actual counts are from the Johns Hopkins surveillance data. The blue triangles and red squares denote the minimum value within each week (defined as Sunday–Saturday per the CDC model’s definition). The minimum values are used to select a policy from policy search results. (Middle) The selected policy at the beginning of each week in the pandemic. Each policy is represented by a 4-character alphanumeric code following the key in Figure 3. The policies are ordered by increasing case count thresholds from bottom to top. Green circles indicate agreement between the policies selected from the forecasted and actual case counts. (Bottom) The PK11 from sharing the actual number of records under the two sequences of policies detailed in the middle graph. The expectation and 95% quantile range are calculated from 1,000 independent framework simulations, while applying a 5-day lagging period assumption. The quantile ranges are too narrow to be seen outside the mean. The horizontal dashed line marks the PK11 threshold of 0.01.

DISCUSSION

This paper introduces a framework to dynamically adjust data sharing policies to publicly share infectious disease surveillance data. The framework forecasts privacy risk according to the expected volume of new cases, enabling data sharers to prospectively adapt policies before seeing case loads. We demonstrate how dynamically changing the policy per the framework’s recommendations maintains the privacy risk below the specified privacy risk threshold more frequently than statically applying a policy developed through retrospective deidentification methods, for both the PK and marketer risk-based approaches. The dynamic policy also enhances surveillance utility by fluctuating data generalization with the infection rate, allowing the data sharer to prioritize sharing certain patient information; bypassing the delay of accumulating patient records before performing a risk assessment; and sharing dates of events. These last two features are crucial for characterizing disease transmission.2,31 Forecasting also enables greater consistency in quasi-identifier representation, as the policy can be maintained throughout the forecasted interval of time. Moreover, predicting which policies provide sufficient privacy protection could potentially automate patient deidentification.

We demonstrate two approaches to dynamic policy adaptation. In the PK risk-based approach, we fix county of residence and date of diagnosis granularity while varying the demographic granularity. We make this tradeoff to support consistent data updates but acknowledge that it may induce certain data utility constraints. For instance, if an application requires uniform demographic granularity, the demographic values may need to be further generalized. An alternative dynamic policy approach could preserve the demographic granularity over time by using the privacy risk estimation framework’s predictions to generalize the date of diagnosis into variably-sized time windows. Still, this would impose a utility constraint on date information and cause the data publication schedule to vary. In the marketer risk-based approach (see the Supplementary Information), we show that when the potential attacker has less background knowledge, the dynamic policy can preserve date of diagnosis granularity while monotonically increasing the demographic granularity of the entire dataset over time.

We do not advocate for which measure provides the best privacy protection, nor do we specify which applications each approach best supports; rather, this investigation shows how the privacy risk estimation framework’s flexibility can inform different approaches to dynamic policy adjustment.

Despite the merits of this work, we wish to highlight several limitations to guide future extensions and transition into application. First, the dynamic, forecast-driven approach did not always meet the privacy risk threshold in the PK risk-based scenario. However, the framework’s policy search results remain relatively robust. Policies chosen from forecasted counts are typically similar or close to those chosen from actual case counts. And when overestimating the number of cases, the privacy risk does not always dramatically exceed the threshold. Furthermore, we selected policies according to a 95% empirical confidence interval, but the policy search can readily incorporate larger confidence intervals as organizations deem desirable. Expanding the intervals further increases the likelihood the dynamic policy will meet the threshold in application. Moreover, when adjusting policies according to the actual case counts, the privacy risk never exceeds the threshold. Thus, the dynamic policy approach can be improved through more accurate forecasts and a model that accounts for potential case load overestimation.

Second, our approach does not incorporate suppression to protect the most unique patient records in the dataset. This is because it is nearly impossible to accurately forecast the exact records which will fall into small demographic groups. It is possible, however, during the enforcement of a selected policy (using the framework) to suppress actual patient records that need to be published and fall into population demographic bins corresponding to very few individuals, such as patient records that are population uniques, or patient records that correspond to population groups with fewer than k individuals (for PK risk). Such records with certainty would not meet the k-anonymity requirement. Additional risk analysis can be performed to estimate the risk of actual records in not meeting the k-anonymity requirement in a data release and suppress fields in records that are associated with a high estimated risk. Still, the framework’s policy search and the policy selection approach depend on many adjustable parameters (eg, the number of performed simulations, the expected number of new disease cases, the specific bins randomly selected to simulate new cases, and the size of the quantile range used for the confidence a policy will meet a given risk threshold), which can be adjusted to mitigate the need for suppression.

Third, as we aim to generally support public data sharing, we focus on privacy risk without measuring the utility of a data generalization policy. Though we provide the data sharer with policy options, from which they can choose how to prioritize sharing quasi-identifier information, and our approach generally supports surveillance utility in terms of providing granular date information and timely updates, we do not address the more complex problem of policy planning. For instance, maximizing the granularity of one quasi-identifier early in the time series could hinder policy flexibility in the future. In the scenario where another quasi-identifier becomes important to public health research later, the data sharer may want to change the generalization of previously released data to complement the new priority. However, if the earlier policy has already consumed the available privacy risk, the policy may not be altered without potentially exposing patients’ identities. Previously released data may be shared again with more detail, but not less. Future work should quantitatively measure data utility to inform data sharers in policy planning.

Fourth, the privacy risk estimation framework depends on random sampling methods that may not realistically simulate the pandemic spread of disease. We assign an equal likelihood of infection to all uninfected county residents at any given time in the simulations, and do not allow reinfections. In reality, the actual likelihood varies according to contact patterns of infectious individuals (ie, through households or at work),62,63 and reinfections are possible, though not likely in the case of COVID-19.40 Still, we believe that Monte Carlo simulations, constrained to run within the relatively contained geographic region of a county, provide a reasonable range and estimate of infection outcomes, as they have shown to be adept at simulating complex, high-dimensional patterns.64 Further framework refinement should address the possibility of reinfection for diseases for which reinfection is more likely.

Fifth, the framework does not compute the reidentification risk of sharing a specific record. Rather, it estimates the range and expectation of privacy risk for a population. Future work should evaluate how well the framework’s estimates compare to the reidentification risk of sharing actual disease surveillance data.

Finally, while this paper focuses on deidentification through generalization, an alternative approach would rely on the principle of differential privacy. Differential privacy offers formal privacy guarantees65; but as has been recently noted,66 realizing this definition in practice requires injecting noise into the data, a strategy that is not appropriate for every data sharing scenario. Moreover, the CDC’s COVID-19 datasets apply generalization and suppression.13 Therefore, to be consistent with the CDC’s current practice, we focused our framework’s application on data generalization policies.

CONCLUSION

Disease surveillance data are variable, between geographic areas and over time. As such, data must be consistently updated in a timely manner. To support public health research and the public’s situational awareness during a pandemic, the data must also contain granular date information. The privacy risk estimation framework we propose enables a prospective approach to surveillance data deidentification. In contrast to traditional methods, prospective policy selection offers increased flexibility, with intermittent consistency, to support near-real time data dissemination. Moreover, we show that forecast-driven deidentification offers better privacy protection than the static data sharing policy application.

FUNDING

This study was supported by the funding sources: grants CNS-2029651 and CNS-2029661 from the National Science Foundation and training grant T15LM007450 from the National Library of Medicine.

AUTHOR CONTRIBUTIONS

JTB designed the framework and privacy model, wrote the computer code, performed the experiments, analyzed the results, and prepared the manuscript. CY, WX, ZY, and ZW contributed to the conceptual design of the framework and privacy model, analyzed the results, and revised the manuscript. AG-D, MK, and MAB supervised each component of the project.

SUPPLEMENTARY MATERIAL

Supplementary material is available at JAMIA online.

CONFLICT OF INTEREST STATEMENT

None declared.

DATA AVAILABILITY

All data used herein are publicly available. The datasets include: the United States Census PCT12 Tables,37 the Johns Hopkins COVID-19 tracking data,39 and the CDC COVID-19 Ensemble Forecasts.38,67

Supplementary Material

ocac011_supplementary_data

Contributor Information

J Thomas Brown, Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA.

Chao Yan, Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA; Department of Computer Science, Vanderbilt University, Nashville, Tennessee, USA.

Weiyi Xia, Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA.

Zhijun Yin, Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA; Department of Computer Science, Vanderbilt University, Nashville, Tennessee, USA.

Zhiyu Wan, Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA; Department of Computer Science, Vanderbilt University, Nashville, Tennessee, USA.

Aris Gkoulalas-Divanis, IBM Watson Health, Cambridge, Massachusetts, USA.

Murat Kantarcioglu, Department of Computer Science, University of Texas at Dallas, Dallas, Texas, USA.

Bradley A Malin, Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA; Department of Computer Science, Vanderbilt University, Nashville, Tennessee, USA; Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee, USA.

REFERENCES

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ocac011_supplementary_data

Data Availability Statement

All data used herein are publicly available. The datasets include: the United States Census PCT12 Tables,37 the Johns Hopkins COVID-19 tracking data,39 and the CDC COVID-19 Ensemble Forecasts.38,67


Articles from Journal of the American Medical Informatics Association : JAMIA are provided here courtesy of Oxford University Press

RESOURCES