A standardised differential privacy framework for epidemiological modeling with mobile phone data

Merveille Koissi Savi; Akash Yadav; Wanrong Zhang; Navin Vembar; Andrew Schroeder; Satchit Balsari; Caroline O Buckee; Salil Vadhan; Nishant Kishore

doi:10.1371/journal.pdig.0000233

. 2023 Oct 27;2(10):e0000233. doi: 10.1371/journal.pdig.0000233

A standardised differential privacy framework for epidemiological modeling with mobile phone data

Merveille Koissi Savi ¹, Akash Yadav ², Wanrong Zhang ³, Navin Vembar ⁴, Andrew Schroeder ², Satchit Balsari ⁵, Caroline O Buckee ⁶, Salil Vadhan ³, Nishant Kishore ^6,^*

Editor: Michele Tizzoni⁷

¹Department of Medical Oncology, Dana Farber Cancer Institute, Harvard School of Medicine, Boston, Massachusetts, United States of America

²Direct Relief, Santa Barbara, California, United States of America

³Department of Computer Sciences, Harvard John A. Paulson School of Engineering & Applied Sciences, Boston, Massachusetts, United States of America

⁴Camber Systems, Washington, District of Columbia, United States of America

⁵Department of Emergency Medicine, Harvard Medical School, Boston, Massachusetts, United States of America

⁶Department of Epidemiology, Harvard TH Chan School of Public Health, Boston, Massachusetts, United States of America

⁷University of Trento Department of Sociology and Social Research: Universita degli Studi di Trento Dipartimento di Sociologia e Ricerca Sociale, ITALY

The authors have declared that no competing interests exist.

^✉

* E-mail: nish.kishore@gmail.com

Roles

Merveille Koissi Savi: Data curation, Formal analysis, Methodology, Resources, Software, Visualization, Writing – original draft, Writing – review & editing

Akash Yadav: Data curation, Formal analysis, Methodology, Software, Writing – review & editing

Wanrong Zhang: Data curation, Formal analysis, Methodology, Writing – review & editing

Navin Vembar: Conceptualization, Data curation, Funding acquisition, Supervision, Visualization, Writing – review & editing

Andrew Schroeder: Funding acquisition, Supervision, Validation, Writing – review & editing

Satchit Balsari: Conceptualization, Funding acquisition, Investigation, Project administration, Supervision, Writing – review & editing

Caroline O Buckee: Conceptualization, Funding acquisition, Investigation, Validation, Writing – review & editing

Salil Vadhan: Conceptualization, Funding acquisition, Investigation, Supervision, Validation, Writing – review & editing

Nishant Kishore: Conceptualization, Data curation, Formal analysis, Methodology, Software, Supervision, Validation, Visualization, Writing – review & editing

Michele Tizzoni: Editor

PMCID: PMC10610440 PMID: 37889905

Abstract

During the COVID-19 pandemic, the use of mobile phone data for monitoring human mobility patterns has become increasingly common, both to study the impact of travel restrictions on population movement and epidemiological modeling. Despite the importance of these data, the use of location information to guide public policy can raise issues of privacy and ethical use. Studies have shown that simple aggregation does not protect the privacy of an individual, and there are no universal standards for aggregation that guarantee anonymity. Newer methods, such as differential privacy, can provide statistically verifiable protection against identifiability but have been largely untested as inputs for compartment models used in infectious disease epidemiology. Our study examines the application of differential privacy as an anonymisation tool in epidemiological models, studying the impact of adding quantifiable statistical noise to mobile phone-based location data on the bias of ten common epidemiological metrics. We find that many epidemiological metrics are preserved and remain close to their non-private values when the true noise state is less than 20, in a count transition matrix, which corresponds to a privacy-less parameter ϵ = 0.05 per release. We show that differential privacy offers a robust approach to preserving individual privacy in mobility data while providing useful population-level insights for public health. Importantly, we have built a modular software pipeline to facilitate the replication and expansion of our framework.

Author summary

Human mobility data has been used broadly in epidemiological population models to better understand the transmission dynamics of an epidemic, predict its future trajectory, and evaluate potential interventions. The availability and use of these data inherently raises the question of how we can balance individual privacy and the statistical utility of these data. Unfortunately, there are few existing frameworks that allow us to quantify this trade-off. Here, we have developed a framework to implement a differential privacy layer on top of human mobility data which can guarantee a minimum level of privacy protection and evaluate their effects on the statistical utility of model outputs. We show that this set of models and their outputs are resilient to high levels of privacy-preserving noise and suggest a standard privacy threshold with an epsilon of 0.05. Finally, we provide a reproducible framework for public health researchers and data providers to evaluate varying levels of privacy-preserving noise in human mobility data inputs, models, and epidemiological outputs.

Introduction

The use of private mobile phone data for various applications in public health, urban planning, and response to natural disasters has been steadily growing for more than a decade. The COVID-19 pandemic has accelerated this trend, and the use of mobility data has increased, following the need to monitor and make policy decisions related to travel restrictions and lockdowns. These data were incorporated into epidemiological models during the pandemic to monitor or forecast SARS-COV-2 transmission.

Mobility data from mobile phones allow us to quantify changes in human movement, identify how social contacts cluster, evaluate where cases come into contact with others, and predict the probability of geographic spread [1–4]. Data acquired from cell phone metadata recorded for billing purposes or from digital platforms are aggregated and shared with researchers, who can then get significant information from mobility patterns [5–7]. Such studies have been used to explain the seasonal pattern of dengue in Pakistan and rubella in Kenya, for example [5, 7]. These models are predominantly metapopulation models in which mobility data are used to determine the impact of human migration on the trajectory of infectious diseases. During the COVID-19 pandemic, the use of mobility data increased around the world, and metapopulation models were used to understand the relationship between human mobility and the spread of the epidemic, predict the dynamics of the epidemic, and estimate the effectiveness of nonpharmaceutical interventions such as lockdowns, reopenings, and social distancing, based on other work modeling the spatial dynamics of pathogens [1, 5, 6].

Despite the statistical utility of these datasets, important privacy concerns remain about the sharing of personal data, even if they are deidentified and aggregated. Standardised approaches are currently lacking for data-sharing agreements and guidelines on the appropriate ways to protect individual privacy while using mobility data for public health. As big data, the semantic web, the interconnectedness of digital technology, and the "Internet of Things" (IoT) increase the volume and velocity of data, it becomes easier to reanonymise such aggregated data [8, 9].

Several privacy frameworks have been developed to address the trade-off between privacy and utility for statistical analyses [10–13]. Amongst these frameworks, differential privacy (DP) has become the leading approach to balance this trade-off [14]. DP is a parameterized privacy concept, where the privacy parameter ϵ allows for a smooth trade-off between privacy and utility for statistical analyses [14]. Informally, an algorithm that is ϵ-differentially private ensures that any particular output of the algorithm is at most e^ϵ more likely when we arbitrarily change one data entry. In DP, observations are perturbed by adding noise coming from a carefully chosen distribution [14]. A DP mechanism applied to a mobility matrix of travel between different locations will prevent disclosing the exact number of movements and will also keep the private information of the individual (home and work location, etc.) hidden.

DP is considered the gold standard of statistical privacy, as its application can be proven to preserve privacy while quantifying the trade-off between privacy and the utility of the released statistics [15]. The trade-off between privacy and utility is important because the noisier the output, the less useful it may be for inference. Increasingly, DP is used for the public release of data sets by industries and governments such as Google [16], Apple [17], Microsoft [18], Facebook (Meta), Uber [19], and the US Census Bureau [20], but it remains unclear how DP should be used in the context of mobility data for epidemiological frameworks.

In this paper, we examine how differential privacy can be applied to infectious disease modeling and analyse the impact of different levels of privacy on the reconstruction of epidemic features through simulation. Our method is based on a previously validated epidemiological metapopulation model, and we investigate the effect of the addition of privacy-preserving noise on key epidemiological outputs of interest. We used real-world mobility data from New York State during the early stages of the COVID-19 pandemic in the United States and show that the application of differential privacy can bias certain epidemiological metrics. We propose that differential privacy offers a rigorous and quantifiable approach to safely using mobile phone data during epidemics for modeling purposes.

Results

Mobility data

The mobility matrices included data from August 15 to November 15, 2020, and contained a total of 812,587 transitions made between sixty-two counties of New York State, with a mean of 9,029 transitions a day. The observed daily transitions ranged from a minimum of 600, occurring in Hamilton County, to a maximum of 77,131 in Suffolk County. The maximum transition between counties occurred between Queens and Kings counties, with 5,262, whereas we counted 14 combinations of zero transitions during the selected windows. After applying DP, the absolute number of transitions was affected, but the relative rank of the intercounty routes with respect to the volume of travel remained the same. We initiated a variety of common scenarios to assess the effect of added noise on bias and variability in our epidemiological parameters of interest.

Scenarios with initial outbreaks in large and small regions

We first address the impact of starting epidemics in large versus small counties to determine whether DP would have systematic impacts on the dynamics overall. Kings and Queens are the largest counties in New York State with an approximate population of 2 million individuals each [21]. Allegany and Essex are the smallest counties in New York state, with populations of approximately 46,000 and 37,000 individuals, respectively. In each of these counties (first the two largest, and then the two smallest), we seeded 20 infectious individuals to spark an epidemic. In the scenario with large counties, we observed epidemics that started around the 50th day and peaked around the 75^th day, reaching approximately 1% of the population living in these areas. In the smaller counties, the epidemic began around the 60th day and peaked on the 150th day, reaching approximately 5% of the population (S1 Fig).

We evaluated the metrics of interest over 1,000 iterations for each combination of scenarios and noise. We observed that when the epidemic is seeded in Queens and Kings, the epidemic size and the proportion of counties with at least one case are higher compared to an outbreak seeded in smaller counties (Fig 1A). As metrics can exist on very different scales, we calculate the normalized distribution of bootstrapped metrics where a minimum amount of noise is added. We then compare this to the median value of bootstrapped values at increasing values of noise to describe the change from expectation. When noise is above 20, the values for the epidemic size for observed, asymptomatic, and symptomatic infected, the size at the peak of the epidemic, and the proportion of counties with one case are lower than those obtained when the mobility matrix is not perturbed. However, the values obtained for the rate of spread, effective reproductive rate, risk of importation, probability of importation, and mean importation rate are higher than those obtained for the non-perturbed dataset (Fig 1).

Fig 1 — **Metapopulation metric distribution for different values of epsilon for Scenario A:** *Location of the first cases*, B: *change in the mobility network*, *and C*: *change in the parameters of the metapopulation model*. *As metrics can exist on very different scales*, *we calculate the normalized distribution of bootstrapped metrics where a minimum amount of noise is added*. *We then compare this to the median value of bootstrapped values at increasing values of noise to describe the change from expectation*.

Scenarios with Epidemics in Well- and Poorly Connected Regions

To address how the effect of DP on network connectivity would impact predicted disease dynamics, we simulated an outbreak in three pairs of counties with varying levels of connectivity to Kings County. The first simulation in Monroe and Saratoga counties was designed to assess the impact of low connectivity (less than 20% of transitions during the period) on the disease dynamic. The second scenario targeted counties in the median of transitions, such as Putman and Westchester counties, to assess the dose-response effect of the epidemiological model. The third scenario was simulated in Schoharie and Lewis counties (no transition to Kings County during the period) to assess the impact in places that were isolated in the larger mobility network. When the outbreak is simulated in Monroe and Saratoga (Scenario 3), the epidemic begins around the 60th day and the number of infected persons reaches the maximum around the 150th day, with less than 1% of the total population living in this area infected. When the outbreak is seeded in medium connectivity areas such as Putnam and Westchester (Scenario 4), less than 0.6% of the population became infected around the 75th day after the epidemic peaks around the 40th day. When the outbreak is seeded in an area with low connectivity to Kings County, i.e., connectivity close to 0 such as Lewis and Schoharie (scenario 5), less than 0.07% of the population is infected around the 200th day since the epidemic only starts around the 90th day (S1 Fig).

We found that regardless of network connectivity, epidemiological metrics degraded as noise increased (Fig 1B). As such in the three scenarios addressing the change in the network of mobility, namely when i) the epidemic is sparked in two random counties having less than 20% transition to Kings County, ii) the epidemic is sparked in two random counties with a median transition to Kings County, and iii) the epidemic is sparked in a county with no transition to Kings County; we observed a similar pattern in the distribution of the metric to what we observed when there was an outbreak in small counties (scenario 2). Specifically, the size of the epidemic, the day that the epidemic peaks, the fraction of counties with at least one case, the size of the epidemic, the average exposure time, the maximum exposure time, and the minimum exposure time are smaller than the baseline. The spread rate, the effective reproductive rate, the importation risk, the mean importation risk rate, and the probability of infection are higher than the baseline, especially when the noise is above 33.33 (Fig 1B). We observed a significant change in the epidemiological metrics only when the value of noise added to perturb the transition matrix is above those of the scenario targeting the location of the first cases (small versus large county) (Fig 1B).

Scenarios with varying epidemiological parameters

To address the nature of the epidemic, we simulated three changes in the trajectory of the epidemic in Kings and Queens counties. Specifically, we simulated i) a faster epidemic through the increase of the transmission rate, ii) a heavy load of asymptomatic individuals, and ii) an absence of asymptomatic individuals in the population. When the transmission rate increases (scenario 6) we can observe that the epidemic starts around the 40th day and reaches its peak around the 75th with almost 3% of the population infected. When the fraction of symptomatic individuals increases, the size of the epidemic also increases and reaches 1.5% of the population around the 75th day since the epidemic starts around the 40th day after the first case (scenario 7). When the fraction of documented infection decreases (scenario 8), there is no declared epidemic, as only asymptomatic people are recorded in the population, reaching a fraction of 0.008% after the 100th day.

When the transmission rate increases, the epidemic spreads quickly (Fig 1C). When the asymptomatic rate increases, the probability of infection will subsequently increase. The trajectory of the epidemic is similar to the non-perturbed dataset. However, above the noise of 33.33, epidemiological metrics are either more conservative (lower than those of the baseline) or more volatile (higher than those of the baseline) (Fig 1C). Furthermore, we found that the fraction of counties with at least one case is not affected by the change in i) the transmission rate and ii) the fraction of symptomatic individuals (Fig 1C).

Discussion

Several metapopulation models were developed throughout the SARS-CoV2 pandemic to inform decision making, predict the trajectory of the disease and identify weaknesses in the healthcare system [22–25]. The mobility data used to parameterize these models provided information on geographic and behavioral heterogeneity between populations, but these data could theoretically be used to identify individuals or their unique travel behavior, which warrants privacy preservation measures [26]. Our study shows that in metapopulation models that use mobility data, the application of privacy-preserving noise results in unbiased estimates of metrics of interest at a wide range of noise values with an upper limit that allows for a significant privacy-preserving budget.

We found that mobility matrices that are infused with noise values below 20, that is, loss of privacy loss of at least ϵ = 0.05 per matrix, can help protect the privacy of individuals who contribute their data, while limiting bias in the estimation of public health measures of interest when used for epidemiological modeling. Importantly, as we have already added a minimum amount of noise to preserve privacy this limit represents a minimum threshold allowing for the addition of larger amount of privacy-preserving noise than previous studies have shown. Intuitively, adding noise to these mobility matrices may result in newly created connections between locations that would not otherwise be connected, strengthening connections that would otherwise be weak, or vice versa. In some cases, we may even see the removal of connections on specific days. Predictions of the spread of the rural area may be more affected than those of the areas connected to urban centers. However, sensitivity analyses could be performed to provide robustness, and the purpose and geographic scope of the model will dictate how important this degradation is.

As noise increases above 20, estimates such as the epidemic size, the day that the epidemic peaks, and the average epidemic size are biased downwards as the mobility matrix decreases connectivity to large population centers and distributes the epidemic into many smaller locations with lower contact rates. Similarly, estimates such as the rate of spread, the risk of importation, and the effective reproduction rate are biased upwards as mobility between smaller and poorly connected locations increases, leading to greater importation into areas with smaller population sizes. Our study demonstrates that for epidemiological metapopulation models using mobility data, metrics estimates are fairly unbiased up to a noise threshold of 20, which provides greater privacy protection than previous studies [25, 27].

Although our pipeline only evaluated a specific combination of mobility data, metapopulation model, and metrics, it provides a " plug-and-play " interface for researchers to assess bias using proprietary models and mobility data [28]. As mobility data sets become increasingly available and used in metapopulation models, we provide a flexible framework to identify the evaluation-specific maximum privacy-preserving noise that can be incorporated into these mobility data before they result in biased outputs. Data providers can interact with researchers in many ways and the goal of this study is not to systematize this relationship. Instead, this “plug-and-play” framework can be used by researchers to simulate the effects of the application of differential privacy methods on their epidemiological parameters of interest. This would allow researchers to have an informed discussion with data providers before the data are sourced to identify an optimal threshold of noise which protects user privacy while also allowing for unbiased estimates of epidemiological parameters to be inferred.

Methods

Ethical statements

This study is a nonhuman subject research and does not necessitate neither an IRB approval nor an informed consent. Furthermore, all methods were performed in accordance with relevant guidelines and regulations.

Study workflow

The pipeline workflow for the next analysis is represented in the following schematic architecture (Fig 2). This flow diagram shows the preprocessing before and after acquisition of the mobility data, and, most importantly, how synthetic data has been used to parameterize the metapopulation mode. Since obtaining non-processed data from third parties was impossible, we overlaid noise on pre-processed mobility data to determine the impact of differential privacy on the metapopulation model.

Mobility data

We obtained mobility data from Camber Systems (the provider), a third-party analytics company that purchased advertising technology (ad tech) data from many data brokers. The data covered 90 days from August 15 to November 15, 2020, representing between 3–7% of the total American Community Survey (ACS), a county-specific population in New York State. The original data consisted of a log of user global positioning system (GPS) coordinates, sorted and grouped by a unique device identification number. These data have all the identifying information removed, cleaned to remove duplicate entries or unrealistic usage, used to calculate device-specific modal locations, and aggregated at the county-level [29]. The key metric of interest used in these analyses was movement between counties in 8-hour increments. Movement was defined as the change in a device’s location from time period t-1 to the location of the device at time t. To further guarantee anonymity, the provider used a predefined group of devices per area, removed data that represented small numbers of devices, and applied an initial layer of privacy noise to the data set to ensure that the basic privacy preservation mechanisms were in place before providing access to these data to researchers [30]. However, the data provider did not disclose the initial privacy method and the degree of noise applied to maintain individual and group privacy. We then added an additional layer of post production differential privacy (PPDP) (see next section) and aggregated it into 24-hour blocks of time with averaged transitions between counties. The process consists of generating an origin/destination matrix normalised to the ACS population for each county. The matrix was then randomly sampled and replicated 500 times to extend the data set time period. In most locations, the simulated outbreak was just beginning or at the exponential growth phase after the 90 days of the primary dataset. In light of this observation, it becomes particularly challenging to assess the impact of DP on epidemiological metrics which is why we used bootstrapping to form a mobility dataset of 500 days.

Application of differential privacy

As background, a mechanism M taking a database in a domain H and producing outputs in a domain R M: H→R is (ϵ, δ (ϵ, δ)-differential private if and only if for every pair of neighboring databases x, y∈H, such that they differ in at most one entry, and for any subset of possible outputs S⊆R, we have

P r [M (x) \in S] \leq e^{ϵ} P r [M (y) \in S] + δ,

(1)

where the probability is taken over the randomness of the mechanism M and is denoted the “security parameter’ [31]. Eq (1) suggests that if two databases x, y are sufficiently close due to the perturbation, then it becomes difficult for random attackers to uncover the privacy of the observed individuals. This is achieved by perturbing the true observations by adding noise from a carefully chosen distribution. The parameter quantifying the privacy loss ϵ represents the likelihood that an attacker with nearly full information about a database can determine whether their target is in the database. DP offers a quantifiable tradeoff between accuracy and privacy. Mobility data is aggregated data that could display the transmission of small groups of individuals. Our goal is to preserve the privacy of these groups and hide low transitions by applying differential privacy.

The Laplace mechanism is a common differential privacy mechanism, which adds Laplace noise to query values in which the noise scales with Δ/ϵ, where Δ is the query sensitivity. DP compositions adaptively allow us to design a mechanism with several building blocks, ensuring efficient privacy protection achievable using the advanced composition [10].

For all ϵ, δ, δ′>0, the class of (ϵ, δ)-differentially private mechanisms satisfies (ϵ′, kδ+δ′)-differential privacy under k-fold adaptive composition for (Eq 2):

ϵ' = \sqrt{2 k l o g (\frac{1}{δ'})} ϵ + k ϵ (e^{ϵ} - 1)

(2)

where δ′ is a security parameter extremely small δ′ = 2⁻³⁰

To assess the tradeoff between accuracy and utility, we further privatize the synthetic data using the composition theorem with the privacy parameter epsilon ranging from 0.01 to 16 by the means of the Laplace mechanism using the ‘smartnoise sdk’ library [32]. The transition data contains the movements for 8-hour time blocks over 90 days, and using the advanced composition theorem with k = 270, the total privacy budget is as follows (Eq 3):

ϵ' = \sqrt{540 l o g (\frac{1}{δ'})} ϵ + 270 ϵ (e^{ϵ} - 1) = 84.6 ϵ + 270 ϵ (e^{ϵ} - 1)

(3)

For ϵ = 0.01, δ′ = 2⁻³⁰, we have ϵ′ = 0.8731 used to the existing deployment.

The rationale for using this range of epsilon lies in the fact that below 0.01 the infused noise is extremely large, compromising the accuracy of the transition matrix, and above 16 the total privacy budget is extremely large, compromising the privacy. More specifically, since the transition matrix used already has privacy noise applied, with a value of ϵ = 16 means, the synthetic transition obtained is similar to the one received from the provider. However, for ϵ = 0.01, the synthetic data is more protective since low transitions are more hidden due to the large amount of noise added through the Laplace mechanism. To simplify interpretation, from here on, we evaluate noise which is the inverse of the privacy loss ϵ.

Metapopulation model

The disease dynamic was modeled with a Susceptible-Exposed-Infected Symptomatic-Infected asymptomatic model as follows (Eqs 4–7).

\frac{d S_{i}}{d t} = - \frac{β S_{i} I_{i}^{r}}{N_{i}} - \frac{μ β S_{i} I_{i}^{u}}{N_{i}}

(4)

\frac{d E_{i}}{d t} = \frac{β S_{i} I_{i}^{r}}{N_{i}} + \frac{μ β S_{i} I_{i}^{u}}{N_{i}} - \frac{E_{i}}{Z}

(5)

\frac{d I_{i}^{s}}{d t} = α \frac{E_{i}}{Z} - \frac{I_{i}^{s}}{D}

(6)

\frac{d I_{i}^{a}}{d t} = (1 - α) \frac{E_{i}}{Z} - \frac{I_{i}^{a}}{D}

(7)

where S_i, E_i, $I_{i}^{s}, I_{i}^{a}$ are the susceptible, exposed, infected symptomatic, infected asymptomatic, and total population in a county i.

The synthetic mobility datasets were integrated into the previous system (Eqs 4–7) and documented [33] by the following equations (Eqs 8–12),

\frac{d S_{i}}{d t} = - \frac{β S_{i} I_{i}^{r}}{N_{i}} - \frac{μ β S_{i} I_{i}^{u}}{N_{i}} + θ \sum_{j} \frac{M_{i j} S_{j}}{N_{j} - I_{j}^{r}} - θ \sum_{j} \frac{M_{j i} S_{i}}{N_{j} - I_{j}^{r}}

(8)

\frac{d E_{i}}{d t} = \frac{β S_{i} I_{i}^{r}}{N_{i}} + \frac{μ β S_{i} I_{i}^{u}}{N_{i}} - \frac{E_{i}}{Z} + θ \sum_{j} \frac{M_{i j} E_{j}}{N_{j} - I_{j}^{r}} - θ \sum_{j} \frac{M_{j i} E_{i}}{N_{j} - I_{j}^{r}}

(9)

\frac{d I_{i}^{r}}{d t} = α \frac{E_{i}}{Z} - \frac{I_{i}^{r}}{D}

(10)

\frac{d I_{i}^{u}}{d t} = (1 - α) \frac{E_{i}}{Z} - \frac{I_{i}^{u}}{D} + θ \sum_{j} \frac{M_{i j} I_{j}^{u}}{N_{j} - I_{j}^{r}} - θ \sum_{j} \frac{M_{j i} I_{i}^{u}}{N_{j} - I_{j}^{r}}

(11)

N_{i} = N_{i} + θ \sum_{j} M_{i j} - θ \sum_{j} M_{j i}

(12)

where S_i, E_i, $I_{i}^{r}, I_{i}^{u}$ are the susceptible, exposed, documented infected, undocumented infected, and total population in a county i.

The system of equations (Eqs 8–12) thus took into account both the mobility and the contagion describing the epidemic’s evolution on the metapopulations network. We assumed that the randomness in the contagion followed a Poisson distribution and was documented elsewhere [33]. Most specifically, we seeded cases in a specific location, then, for each time t, the disease spread through the metapopulation network according to the transition matrix when people are moving between counties from the first day to the 500^th day. Key parameters and sources in literature are described in Table 1.

Table 1. Parameters of the metapopulation model.

Parameters	Definition	Value	References
β	Transmission rate	0.8	Estimated
μ	Factor of reduction of the transmission rate	0.5	Li et al. [33]
α	Fraction of symptomatic	0.65	Li et al. [33]
Z	Average latency period	3.6	Li et al. [33]
D	Average duration of infection	3.14	Li et al. [33]
M _ij	Number of people travelling from county j to county i daily	-	Estimated
M _ji	Number of people travelling from county i to county j daily	-	Estimated
θ	Multiplicative Travel Factor	1	Li et al. [33]

Open in a new tab

Epidemiological metrics

In reviewing epidemiological models using mobility data, we identified salient metrics of interest, including:

Probability of infection [34]: let denote G (V, E) the mobility network of unknown topology where the vertices (V) are county/individuals with edges (u, v) E. u and v are the contacts likely to result in infection. The model of disease dynamics can have four states described above (Eq 8:12) and is assigned to each individual. There is a probability that an epidemic will evolve through a particular sequence of states ϕ₁, ϕ₂,…,ϕ_n and a probability P that it will arrive at a certain state. A given state’s ϕ₁ probability is influenced by its previous state ϕ₀ (Markov property). The probability of u is infected {I^u, I^r} at the nth is given by (Eq 13):

P (u | ϕ_{1}) = \sum_{j = 1}^{n} P (ϕ_{j} (u) = {I^{u}, I^{r}} | ϕ_{1})

(13)

The risk of importation [35]: Let F_i be the cumulative distribution function that the disease is likely imported to a county j from a county i, T_i be the probability associated with the travel from i, and n_j the travel flux from i, the daily risk of importation R_j is given by (Eq 14):

R_{j, t} = \frac{\sum_{i} F_{i} n_{i} T_{i}}{\sum_{j} F_{j} n_{j}}

(14)

Incubation period: It is the period of time between exposure to the disease-causing agent and the onset of symptoms [36].

Mean importation rate: is the average number of infected cases that move to j during the epidemiological season (Eq 15)

\bar{R_{j, t}} = \frac{1}{N} \sum_{i} R_{j, t}

(15)

The effective reproduction rate [33] R_e: is a time-depend metric that measures how fast a disease is infectious given by the largest eigenvalue of the next-generation matrix method and is given by (Eq 16)

R_{e} = α β D + (1 - α) μ β D

(16)

Epidemic peak is the maximum number of infected over a time span of the epidemic [37].

Timing of the peak corresponds to the day the epidemic peak is reached [38].

The rate of spread represents the ratio of infected by susceptible over a time-span [39].

Proportion of counties with at least one case [40].

Epidemic size is the total number of infected divided by the population size of each county and then multiplied by 100,000 individuals [41].

Size at the epidemic peak is the previous ratio by the day of the epidemic peak [42].

Epidemiological scenarios

To assess the effect of noise on these metrics, we evaluated eight scenarios with three salient characteristics and provided a general formula to incorporate more. We evaluated scenarios where the epidemiological metrics of interest were driven by i) the location of the first case, ii) changes in connectivity, and iii) changes in epidemiological parameters (Table 2).

Table 2. Overview of the scenarios and the investigated question.

	Characterisation of the Scenario	Scenario	Investigated question
1	Location of the first cases	The epidemic started in two large counties, i.e., Kings and Queens.	How does the perturbation of the transition matrix affect the epidemic curve when the epidemic starts in well-visited areas that are New York State?
2	Location of the first cases	The epidemic started in two small counties, i.e., Allegany and Essex.	How does the perturbation of the transition matrix affect the epidemic curve when the epidemic starts in well-visited areas that are not New York State?
3	Change in Connectivity	The epidemic started in two counties with 20% of transitions, i.e., Saratoga and Monroe.	How does the perturbation of the transition matrix affect the epidemic curve when the epidemic started in a region with low connectivity to New York City?
4		The epidemic started in two random counties in the median of transitions, i.e., Westchester and Putnam.	Is there a "dose-response" of the interaction between transitions?
5		The epidemic started in the counties of Schoharie and Lewis with no transitions to New York City,	How does the perturbation of the transition matrix affect the epidemic curve when the epidemic starts in locations that seem to be isolated from New York City?
6	Change in parameters of the metapopulation model	Increase the contact rate (β = 0.9)	How does the perturbation of the transition matrix affect the epidemic curve when the epidemic is more transmissible?
7		Increase asymptomatic burden (α = 0.75)	How does the perturbation of the transition matrix affect the epidemic curve when the burden of transmission is passed on by asymptomatic individuals?
8		Decrease the asymptomatic burden (α = 0.01)	How does the perturbation of the transition matrix affect the epidemic curve when only symptomatic individuals transmit?

Open in a new tab

We explored several spatial epidemiological questions (Table 2) with our scenarios, including how the place of outbreak affects the dynamics of the disease, how connectivity networks could potentially affect epidemic dynamics, and how DP might ultimately affect the metric we are interested in.

To assess the impact of privacy on the epidemiological metric, we ran each set of parameters through 1000 Monte Carlo iterations and visualised the results.

Supporting information

S1 Fig. Simulated scenarios epidemiological curve embedding perturbed mobility matrices.

Scenarios 1 and 2 the disease is spread in two large and small counties, respectively; Scenarios 3, 4, and 5 the epidemic occurred in counties with no, medium to high connectivity with neighboring counties; Scenarios 6, 7, and 8 key parameters such as the burden of asymptomatic, the contact rate varied. I^s, I^a, and Obs are infected symptomatic, infected asymptomatic, and observed.

(DOCX)

Click here for additional data file.^{(145.5KB, docx)}

Data Availability

Data and codes are available at https://github.com/crisisready/DP_Metapopulation.

Funding Statement

A portion of this study was generously supported by a Trust in Science Grant awarded by the Harvard Data Science Initiative to A.S., S.B., and C.B. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. The other authors received no specific funding for this work.

References

1.Grantz KH, Meredith HR, Cummings DAT, Metcalf CJE, Grenfell BT, Giles JR, et al. The use of mobile phone data to inform analysis of COVID-19 pandemic epidemiology. Nat Commun. 2020;11: 4961. doi: 10.1038/s41467-020-18190-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Oliver N, Lepri B, Sterly H, Lambiotte R, Deletaille S, De Nadai M, et al. Mobile phone data for informing public health actions across the COVID-19 pandemic life cycle. Sci Adv. 2020;6: eabc0764. doi: 10.1126/sciadv.abc0764 [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Wu W, Niu X. Influence of Built Environment on Urban Vitality: Case Study of Shanghai Using Mobile Phone Location Data. J Urban Plan Dev. 2019;145: 04019007. doi: 10.1061/(ASCE)UP.1943-5444.0000513 [DOI] [Google Scholar]
4.Yabe T, Jones NKW, Rao PSC, Gonzalez MC, Ukkusuri SV. Mobile phone location data for disasters: A review from natural hazards and epidemics. Comput Environ Urban Syst. 2022;94: 101777. doi: 10.1016/j.compenvurbsys.2022.101777 [DOI] [Google Scholar]
5.Wesolowski A, Qureshi T, Boni MF, Sundsøy PR, Johansson MA, Rasheed SB, et al. Impact of human mobility on the emergence of dengue epidemics in Pakistan. Proc Natl Acad Sci. 2015;112: 11887–11892. doi: 10.1073/pnas.1504964112 [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Fiadino P, Ponce-Lopez V, Antonio J, Torrent-Moreno M, D’Alconzo A. Call Detail Records for Human Mobility Studies: Taking Stock of the Situation in the “Always Connected Era.” Proceedings of the Workshop on Big Data Analytics and Machine Learning for Data Communication Networks. New York, NY, USA: Association for Computing Machinery; 2017. pp. 43–48. doi: 10.1145/3098593.3098601 [DOI]
7.Wesolowski A, Metcalf CJE, Eagle N, Kombich J, Grenfell BT, Bjørnstad ON, et al. Quantifying seasonal population fluxes driving rubella transmission dynamics using mobile phone data. Proc Natl Acad Sci. 2015;112: 11114–11119. doi: 10.1073/pnas.1423542112 [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Rocher L, Hendrickx JM, de Montjoye Y-A. Estimating the success of re-identifications in incomplete datasets using generative models. Nat Commun. 2019;10: 3069. doi: 10.1038/s41467-019-10933-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Pyrgelis A, Troncoso C, De Cristofaro E. Knock Knock, Who’s There? Membership Inference on Aggregate Location Data. arXiv; 2017. Available: http://arxiv.org/abs/1708.06145. [Google Scholar]
10.Dwork C, Roth A. The Algorithmic Foundations of Differential Privacy. Found Trends Theor Comput Sci. 2014;9: 211–407. doi: 10.1561/0400000042 [DOI] [Google Scholar]
11.Machanavajjhala A, Gehrke J, Kifer D, Venkitasubramaniam M. L-diversity: privacy beyond k-anonymity. 22nd International Conference on Data Engineering (ICDE’06). 2006. pp. 24–24. doi: 10.1109/ICDE.2006.1 [DOI]
12.El Emam K, Dankar FK. Protecting Privacy Using k-Anonymity. J Am Med Inform Assoc JAMIA. 2008;15: 627–637. doi: 10.1197/jamia.M2716 [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Sweeney L. k-ANONYMITY: A MODEL FOR PROTECTING PRIVACY. Int J Uncertain Fuzziness Knowl-Based Syst. 2002;10: 557–570. doi: 10.1142/S0218488502001648 [DOI] [Google Scholar]
14.Dwork C, McSherry F, Nissim K, Smith A. Calibrating Noise to Sensitivity in Private Data Analysis. In: Halevi S, Rabin T, editors. Theory of Cryptography. Berlin, Heidelberg: Springer; 2006. pp. 265–284. doi: 10.1007/11681878_14 [DOI] [Google Scholar]
15.Yang X, Fienberg SE, Rinaldo A. Differential Privacy for Protecting Multi-dimensional Contingency Table Data: Extensions and Applications. J Priv Confidentiality. 2012;4: 101–125. [Google Scholar]
16.Erlingsson Ú, Pihur V, Korolova A. RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response. Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security. 2014. pp. 1054–1067. doi: 10.1145/2660267.2660348 [DOI]
17.Learning with Privacy at Scale. In: Apple Machine Learning Research [Internet]. [cited 9 Jun 2023]. Available: https://machinelearning.apple.com/research/learning-with-privacy-at-scale.
18.Ding B, Kulkarni J, Yekhanin S. Collecting Telemetry Data Privately. arXiv; 2017. Available: http://arxiv.org/abs/1712.01524. [Google Scholar]
19.Desfontaines D. A list of real-world uses of differential privacy. 2021. doi: 10.7910/DVN/TDOAPG/DGSAMS [DOI] [Google Scholar]
20.Dajani AN, Lauger AD, Singer PE, Kifer D, Reiter JP, Machanavajjhala A, et al. The modernization of statistical disclosure limitation at the U.S. Census Bureau. [Google Scholar]
21.U.S. Census Bureau QuickFacts: United States. [cited 11 Mar 2023]. Available: https://www.census.gov/quickfacts/fact/table/US#.
22.Calvetti D, Hoover AP, Rose J, Somersalo E. Metapopulation Network Models for Understanding, Predicting, and Managing the Coronavirus Disease COVID-19. Front Phys. 2020;8. Available: https://www.frontiersin.org/articles/10.3389/fphy.2020.00261. [Google Scholar]
23.Coletti P, Libin P, Petrof O, Willem L, Abrams S, Herzog SA, et al. A data-driven metapopulation model for the Belgian COVID-19 epidemic: assessing the impact of lockdown and exit strategies. BMC Infect Dis. 2021;21: 503. doi: 10.1186/s12879-021-06092-w [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Balcan D, Colizza V, Gonçalves B, Hu H, Ramasco JJ, Vespignani A. Multiscale mobility networks and the spatial spreading of infectious diseases. Proc Natl Acad Sci. 2009;106: 21484–21489. doi: 10.1073/pnas.0906910106 [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Houssiau F, Rocher L, de Montjoye Y-A. On the difficulty of achieving Differential Privacy in practice: user-level guarantees in aggregate location data. Nat Commun. 2022;13: 29. doi: 10.1038/s41467-021-27566-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
26.de Montjoye Y-A, Gambs S, Blondel V, Canright G, de Cordes N, Deletaille S, et al. On the privacy-conscientious use of mobile phone data. Sci Data. 2018;5: 180286. doi: 10.1038/sdata.2018.286 [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Bassolas A, Barbosa-Filho H, Dickinson B, Dotiwalla X, Eastham P, Gallotti R, et al. Hierarchical organization of urban mobility and its connection with city livability. Nat Commun. 2019;10: 4817. doi: 10.1038/s41467-019-12809-y [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Savi MK, Yavad A, Vembar N, Kishore N. A standardized differential privacy framework for epidemiological modeling with mobile phone data. 2022. Available: https://github.com/crisisready/DP_Metapopulation. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Kishore N, Kiang MV, Engø-Monsen K, Vembar N, Schroeder A, Balsari S, et al. Measuring mobility to monitor travel and physical distancing interventions: a common framework for mobile phone data analysis. Lancet Digit Health. 2020;2: e622–e628. doi: 10.1016/S2589-7500(20)30193-X [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Pereira M, Kim A, Allen J, White K, Ferres JL, Dodhia R. U.S. Broadband Coverage Data Set: A Differentially Private Data Release. arXiv; 2021. doi: 10.48550/arXiv.2103.14035 [DOI] [Google Scholar]
31.Murtagh J, Vadhan S. The Complexity of Computing the Optimal Composition of Differential Privacy. In: Kushilevitz E, Malkin T, editors. Theory of Cryptography. Berlin, Heidelberg: Springer Berlin Heidelberg; 2016. pp. 157–175. doi: 10.1007/978-3-662-49096-9_7 [DOI] [Google Scholar]
32.OpenDP. SmartNoise—OpenDP SmartNoise. Available: https://docs.smartnoise.org/en/stable/index.html.
33.Li R, Pei S, Chen B, Song Y, Zhang T, Yang W, et al. Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (SARS-CoV-2). Science. 2020;368: 489–493. doi: 10.1126/science.abb3221 [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Shapiro M, Delgado-Eckert E. Finding the probability of infection in an SIR network is NP-Hard. Math Biosci. 2012;240: 77–84. doi: 10.1016/j.mbs.2012.07.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Gilbert M, Pullano G, Pinotti F, Valdano E, Poletto C, Boëlle P-Y, et al. Preparedness and vulnerability of African countries against importations of COVID-19: a modelling study. The Lancet. 2020;395: 871–877. doi: 10.1016/S0140-6736(20)30411-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
36.CDC LC Quick Learn: Using an Epi Curve to Determine Most Likely Period of Exposure. [cited 15 May 2023]. Available: https://www.cdc.gov/training/quicklearns/exposure/.
37.Cadoni M, Gaeta G. Size and timescale of epidemics in the SIR framework. Phys Nonlinear Phenom. 2020;411: 132626. doi: 10.1016/j.physd.2020.132626 [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Balcan D, Hu H, Goncalves B, Bajardi P, Poletto C, Ramasco JJ, et al. Seasonal transmission potential and activity peaks of the new influenza A(H1N1): a Monte Carlo likelihood analysis based on human mobility. BMC Med. 2009;7: 45. doi: 10.1186/1741-7015-7-45 [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Kishore N, Kahn R, Martinez PP, Salazar PMD, Mahmud AS, Buckee CO. Lockdown related travel behavior undermines the containment of SARS-CoV-2. medRxiv. 2020; 2020.10.22.20217752. doi: 10/ghts2k [Google Scholar]
40.Souch JM, Cossman JS, Hayward MD. Interstates of Infection: Preliminary Investigations of Human Mobility Patterns in the COVID-19 Pandemic. J Rural Health. 2021;37: 266–271. doi: 10.1111/jrh.12558 [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Amhare AF, Tao Y, Li R, Zhang L. Early and Subsequent Epidemic Characteristics of COVID-19 and Their Impact on the Epidemic Size in Ethiopia. Front Public Health. 2022;10. Available: https://www.frontiersin.org/articles/10.3389/fpubh.2022.834592. doi: 10.3389/fpubh.2022.834592 [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Zhou Y, Xu R, Hu D, Yue Y, Li Q, Xia J. Effects of human mobility restrictions on the spread of COVID-19 in Shenzhen, China: a modelling study using mobile phone data. Lancet Digit Health. 2020;2: e417–e424. doi: 10.1016/S2589-7500(20)30165-5 [DOI] [PMC free article] [PubMed] [Google Scholar]

PLOS Digit Health. doi: 10.1371/journal.pdig.0000233.r001

Decision Letter 0

Michele Tizzoni, Gaurav Laroia

9 May 2023

PDIG-D-23-00098

A standardised differential privacy framework for epidemiological modelling with mobile phone data

PLOS Digital Health

Dear Dr. Kishore,

Thank you for submitting your manuscript to PLOS Digital Health. After careful consideration, we feel that it has merit but does not fully meet PLOS Digital Health's publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript within 60 days Jul 08 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at digitalhealth@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pdig/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

We look forward to receiving your revised manuscript.

Kind regards,

Michele Tizzoni

Academic Editor

PLOS Digital Health

Journal Requirements:

1. We ask that a manuscript source file is provided at Revision. Please upload your manuscript file as a .doc, .docx, .rtf or .tex.

2. Please provide separate figure files in .tif or .eps format.

For more information about figure files please see our guidelines: LINK

https://journals.plos.org/digitalhealth/s/figures

https://journals.plos.org/digitalhealth/s/figures#loc-file-requirements

Additional Editor Comments (if provided):

Both referees have found the manuscript to be a valuable work that addresses a timely topic of interest.

Please, in your revision, carefully consider the requests of both referees especially regarding the clarity of your methods and the description of the epidemiological scenarios.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Does this manuscript meet PLOS Digital Health’s publication criteria? Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe methodologically and ethically rigorous research with conclusions that are appropriately drawn based on the data presented.

Reviewer #1: Yes

Reviewer #2: Partly

--------------------

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: I don't know

Reviewer #2: Yes

--------------------

3. Have the authors made all data underlying the findings in their manuscript fully available (please refer to the Data Availability Statement at the start of the manuscript PDF file)?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception. The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

--------------------

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS Digital Health does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

--------------------

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: This paper addresses an important important topic, and presents a timely application of differential privacy. However, I believe that the presentation needs to be improved before publication.

Specifically, while I believe the privacy methodology to be sound, I have concerns about its presentation:

1. The exact mechanism used is never explicitly described, and as such it is hard to ensure that it satisfies differential privacy (especially in regards to the computation of the sensitivity). My understanding is that the original data already comes in aggregated form (transition matrices for 8-hour slots), to which individual users contribute at most once (sensitivity = 1). Noise is then added to these matrices before they are aggregated to form daily transition matrices. Is that accurate?

2. In line 233, you mention that the data already has an "initial layer of privacy noise". This should be explained in more detail: what mechanism was used? Is it DP? If so, what privacy budget was used?

3. The computation of epsilon on line 265 is inaccurate, but I suspect this is due to a typo in the value of delta. It seems you used 1.75e-6, which is reasonable. It would also be good to explain briefly how you chose this value of delta.

4. In lines 31 and 44, the paper mentions that an "epsilon of 0.05" could be used: this is a bit misleading due to the composition, and the total value (around 4 after advanced composition) should be used instead in these segments.

I found the epidemiological analysis interesting (although this is outside my area of expertise), but I think the results could be presented in a clearer way. At the moment, it is sometimes unclear whether the focus is on the results themselves or the impact of the privacy noise on the results. If possible, it would be good to show, e.g., a plot displaying how DP impacts key metrics. Your recommendation in the asbtract of using epsilon=0.05 should also be more explicit in the results section.

Minor comments:

- Could you please explain why and how you resample the data (line 238), and what impact (if any) it would have on the analysis?

- Your definition of differential privacy (lines 244-245) should include delta, as you use it later for advanced composition.

- I was a bit surprised by some of the references cited for the privacy side. In line 73, if you wish to say that aggregated data can be re-identified, I would cite https://arxiv.org/abs/1708.06145 rather than [8]. In line 75, I think references [9-15] do not represent "several privacy frameworks", at least as would be understood by the computer science community (i.e., k-anonymity, l-diversity, federated learning). Similarly, reference [18] on line 90 points to the OpenDP website, which does not support the claim. For that sentence, it would also be good to add references for the use of DP by Google etc. (see, e.g., https://desfontain.es/privacy/real-world-differential-privacy.html for a list of up-to-date references)

Reviewer #2: The authors present an interesting paper on the impact of differential privacy techniques applied to mobility data on the output of epidemiological models. I believe this paper is very relevant. Given the central role that mobility data have acquired during the past years in computational epidemiology, similar contributions are needed as we build better data and modeling preparedness for the next global health emergency.

I do not have major concerns with the validity of the work, but I have some remarks mostly aimed at improving the accessibility of the paper:

1) It was not straightforward for me to understand what Fig 1 is showing and connect this with the commentary in the Results section. I would recommend to include a caption to Fig.1 and a brief description of it at the beginning of the results section. For example, apart from the legend in the figure it is never mentioned what the colors of the heatmap in Fig. 1 indicate and how they should be interpreted.

2) The epidemiological metrics need a better explanation. Currently, they are just listed along with references in the methods section. While the interpretation of some metrics is quite self-explanatory, for others it may not be as straightforward especially for readers that are less familiar with epidemiological modeling. Therefore, I recommend to include more details on the definition of these metrics in the Methods section.

3) I have similar remark also for the 8 epidemiological scenarios considered. At the current stage, the methods section has a very brief paragraph related to their description. I believe the paper would benefit from more details about the simulation scenarios.

4) I would like to see clarified a bit more the sampling procedure of the mobility matrix. It is my understanding that the simulations last for 500 days, while the mobility data covers roughly 4 months. It is unclear to me at the moment how we go from a matrix M_t that varies daily for ~4 months to the 500 days one.

5) The providers of mobility data apply privacy filters to ensure that the privacy of users is preserved. Therefore, researchers receive preprocessed data that has undergone privacy-preserving algorithms. This is also the case with the data used in the paper under consideration. As the authors mentioned, the data has undergone privacy-preserving algorithms twice. However, I have two questions concerning this point.

5a) First, I wonder if this can bias the maximum estimated threshold of noise that can be applied without biasing too much the epidemiological metrics. My point is that it may be higher than 20 as the data have already some noise. While I do not think authors should change their analysis, this point should be at least discussed.

5b) My second point concerns the authors' approach to operationalizing the proposed methodology. In its current form, the paper lacks a comprehensive discussion on this topic. Mobility data is typically provided with noise, and researchers have limited knowledge about the methods used by data providers to ensure user privacy. Therefore, it is unclear how the proposed methodologies can overcome these challenges. One potential solution is to collaborate with data providers and encourage them to apply privacy-preserving algorithms that do not compromise the accuracy of epidemic models. Alternatively, privacy-preserving algorithms can be reapplied to the data, but this approach requires more caution as different providers may have applied vastly different preprocessing techniques. Overall, the paper would benefit from a more in-depth exploration of the potential challenges associated with operationalizing the proposed methodology and a thorough discussion of possible solutions.

--------------------

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

Do you want your identity to be public for this peer review? If you choose “no”, your identity will remain anonymous but your review may still be made public.

For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

--------------------

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLOS Digit Health. 2023 Oct 27;2(10):e0000233. doi: 10.1371/journal.pdig.0000233.r002

Author response to Decision Letter 0

17 Jul 2023

Attachment

Submitted filename: Point to point response to reviewers.pdf

Click here for additional data file.^{(408KB, pdf)}

PLOS Digit Health. doi: 10.1371/journal.pdig.0000233.r003

Decision Letter 1

Michele Tizzoni, Gaurav Laroia

3 Sep 2023

A standardised differential privacy framework for epidemiological modelling with mobile phone data

PDIG-D-23-00098R1

Dear Dr. Kishore,

We are pleased to inform you that your manuscript 'A standardised differential privacy framework for epidemiological modelling with mobile phone data' has been provisionally accepted for publication in PLOS Digital Health.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow-up email from a member of our team.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they'll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact digitalhealth@plos.org.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Digital Health.

Best regards,

Michele Tizzoni

Academic Editor

PLOS Digital Health

***********************************************************

Reviewer Comments (if any, and for reference):

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Does this manuscript meet PLOS Digital Health’s publication criteria? Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe methodologically and ethically rigorous research with conclusions that are appropriately drawn based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available (please refer to the Data Availability Statement at the start of the manuscript PDF file)?

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Reviewer #1: (No Response)

Reviewer #2: (No Response)

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

Do you want your identity to be public for this peer review? If you choose “no”, your identity will remain anonymous but your review may still be made public.

For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: None

Reviewer #2: None

**********

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Fig. Simulated scenarios epidemiological curve embedding perturbed mobility matrices.

(DOCX)

Click here for additional data file.^{(145.5KB, docx)}

Attachment

Submitted filename: Point to point response to reviewers.pdf

Click here for additional data file.^{(408KB, pdf)}

Data Availability Statement

Data and codes are available at https://github.com/crisisready/DP_Metapopulation.

[pdig.0000233.ref001] 1.Grantz KH, Meredith HR, Cummings DAT, Metcalf CJE, Grenfell BT, Giles JR, et al. The use of mobile phone data to inform analysis of COVID-19 pandemic epidemiology. Nat Commun. 2020;11: 4961. doi: 10.1038/s41467-020-18190-5 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pdig.0000233.ref002] 2.Oliver N, Lepri B, Sterly H, Lambiotte R, Deletaille S, De Nadai M, et al. Mobile phone data for informing public health actions across the COVID-19 pandemic life cycle. Sci Adv. 2020;6: eabc0764. doi: 10.1126/sciadv.abc0764 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pdig.0000233.ref003] 3.Wu W, Niu X. Influence of Built Environment on Urban Vitality: Case Study of Shanghai Using Mobile Phone Location Data. J Urban Plan Dev. 2019;145: 04019007. doi: 10.1061/(ASCE)UP.1943-5444.0000513 [DOI] [Google Scholar]

[pdig.0000233.ref004] 4.Yabe T, Jones NKW, Rao PSC, Gonzalez MC, Ukkusuri SV. Mobile phone location data for disasters: A review from natural hazards and epidemics. Comput Environ Urban Syst. 2022;94: 101777. doi: 10.1016/j.compenvurbsys.2022.101777 [DOI] [Google Scholar]

[pdig.0000233.ref005] 5.Wesolowski A, Qureshi T, Boni MF, Sundsøy PR, Johansson MA, Rasheed SB, et al. Impact of human mobility on the emergence of dengue epidemics in Pakistan. Proc Natl Acad Sci. 2015;112: 11887–11892. doi: 10.1073/pnas.1504964112 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pdig.0000233.ref006] 6.Fiadino P, Ponce-Lopez V, Antonio J, Torrent-Moreno M, D’Alconzo A. Call Detail Records for Human Mobility Studies: Taking Stock of the Situation in the “Always Connected Era.” Proceedings of the Workshop on Big Data Analytics and Machine Learning for Data Communication Networks. New York, NY, USA: Association for Computing Machinery; 2017. pp. 43–48. doi: 10.1145/3098593.3098601 [DOI]

[pdig.0000233.ref007] 7.Wesolowski A, Metcalf CJE, Eagle N, Kombich J, Grenfell BT, Bjørnstad ON, et al. Quantifying seasonal population fluxes driving rubella transmission dynamics using mobile phone data. Proc Natl Acad Sci. 2015;112: 11114–11119. doi: 10.1073/pnas.1423542112 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pdig.0000233.ref008] 8.Rocher L, Hendrickx JM, de Montjoye Y-A. Estimating the success of re-identifications in incomplete datasets using generative models. Nat Commun. 2019;10: 3069. doi: 10.1038/s41467-019-10933-3 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pdig.0000233.ref009] 9.Pyrgelis A, Troncoso C, De Cristofaro E. Knock Knock, Who’s There? Membership Inference on Aggregate Location Data. arXiv; 2017. Available: http://arxiv.org/abs/1708.06145. [Google Scholar]

[pdig.0000233.ref010] 10.Dwork C, Roth A. The Algorithmic Foundations of Differential Privacy. Found Trends Theor Comput Sci. 2014;9: 211–407. doi: 10.1561/0400000042 [DOI] [Google Scholar]

[pdig.0000233.ref011] 11.Machanavajjhala A, Gehrke J, Kifer D, Venkitasubramaniam M. L-diversity: privacy beyond k-anonymity. 22nd International Conference on Data Engineering (ICDE’06). 2006. pp. 24–24. doi: 10.1109/ICDE.2006.1 [DOI]

[pdig.0000233.ref012] 12.El Emam K, Dankar FK. Protecting Privacy Using k-Anonymity. J Am Med Inform Assoc JAMIA. 2008;15: 627–637. doi: 10.1197/jamia.M2716 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pdig.0000233.ref013] 13.Sweeney L. k-ANONYMITY: A MODEL FOR PROTECTING PRIVACY. Int J Uncertain Fuzziness Knowl-Based Syst. 2002;10: 557–570. doi: 10.1142/S0218488502001648 [DOI] [Google Scholar]

[pdig.0000233.ref014] 14.Dwork C, McSherry F, Nissim K, Smith A. Calibrating Noise to Sensitivity in Private Data Analysis. In: Halevi S, Rabin T, editors. Theory of Cryptography. Berlin, Heidelberg: Springer; 2006. pp. 265–284. doi: 10.1007/11681878_14 [DOI] [Google Scholar]

[pdig.0000233.ref015] 15.Yang X, Fienberg SE, Rinaldo A. Differential Privacy for Protecting Multi-dimensional Contingency Table Data: Extensions and Applications. J Priv Confidentiality. 2012;4: 101–125. [Google Scholar]

[pdig.0000233.ref016] 16.Erlingsson Ú, Pihur V, Korolova A. RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response. Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security. 2014. pp. 1054–1067. doi: 10.1145/2660267.2660348 [DOI]

[pdig.0000233.ref017] 17.Learning with Privacy at Scale. In: Apple Machine Learning Research [Internet]. [cited 9 Jun 2023]. Available: https://machinelearning.apple.com/research/learning-with-privacy-at-scale.

[pdig.0000233.ref018] 18.Ding B, Kulkarni J, Yekhanin S. Collecting Telemetry Data Privately. arXiv; 2017. Available: http://arxiv.org/abs/1712.01524. [Google Scholar]

[pdig.0000233.ref019] 19.Desfontaines D. A list of real-world uses of differential privacy. 2021. doi: 10.7910/DVN/TDOAPG/DGSAMS [DOI] [Google Scholar]

[pdig.0000233.ref020] 20.Dajani AN, Lauger AD, Singer PE, Kifer D, Reiter JP, Machanavajjhala A, et al. The modernization of statistical disclosure limitation at the U.S. Census Bureau. [Google Scholar]

[pdig.0000233.ref021] 21.U.S. Census Bureau QuickFacts: United States. [cited 11 Mar 2023]. Available: https://www.census.gov/quickfacts/fact/table/US#.

[pdig.0000233.ref022] 22.Calvetti D, Hoover AP, Rose J, Somersalo E. Metapopulation Network Models for Understanding, Predicting, and Managing the Coronavirus Disease COVID-19. Front Phys. 2020;8. Available: https://www.frontiersin.org/articles/10.3389/fphy.2020.00261. [Google Scholar]

[pdig.0000233.ref023] 23.Coletti P, Libin P, Petrof O, Willem L, Abrams S, Herzog SA, et al. A data-driven metapopulation model for the Belgian COVID-19 epidemic: assessing the impact of lockdown and exit strategies. BMC Infect Dis. 2021;21: 503. doi: 10.1186/s12879-021-06092-w [DOI] [PMC free article] [PubMed] [Google Scholar]

[pdig.0000233.ref024] 24.Balcan D, Colizza V, Gonçalves B, Hu H, Ramasco JJ, Vespignani A. Multiscale mobility networks and the spatial spreading of infectious diseases. Proc Natl Acad Sci. 2009;106: 21484–21489. doi: 10.1073/pnas.0906910106 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pdig.0000233.ref025] 25.Houssiau F, Rocher L, de Montjoye Y-A. On the difficulty of achieving Differential Privacy in practice: user-level guarantees in aggregate location data. Nat Commun. 2022;13: 29. doi: 10.1038/s41467-021-27566-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pdig.0000233.ref026] 26.de Montjoye Y-A, Gambs S, Blondel V, Canright G, de Cordes N, Deletaille S, et al. On the privacy-conscientious use of mobile phone data. Sci Data. 2018;5: 180286. doi: 10.1038/sdata.2018.286 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pdig.0000233.ref027] 27.Bassolas A, Barbosa-Filho H, Dickinson B, Dotiwalla X, Eastham P, Gallotti R, et al. Hierarchical organization of urban mobility and its connection with city livability. Nat Commun. 2019;10: 4817. doi: 10.1038/s41467-019-12809-y [DOI] [PMC free article] [PubMed] [Google Scholar]

[pdig.0000233.ref028] 28.Savi MK, Yavad A, Vembar N, Kishore N. A standardized differential privacy framework for epidemiological modeling with mobile phone data. 2022. Available: https://github.com/crisisready/DP_Metapopulation. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pdig.0000233.ref029] 29.Kishore N, Kiang MV, Engø-Monsen K, Vembar N, Schroeder A, Balsari S, et al. Measuring mobility to monitor travel and physical distancing interventions: a common framework for mobile phone data analysis. Lancet Digit Health. 2020;2: e622–e628. doi: 10.1016/S2589-7500(20)30193-X [DOI] [PMC free article] [PubMed] [Google Scholar]

[pdig.0000233.ref030] 30.Pereira M, Kim A, Allen J, White K, Ferres JL, Dodhia R. U.S. Broadband Coverage Data Set: A Differentially Private Data Release. arXiv; 2021. doi: 10.48550/arXiv.2103.14035 [DOI] [Google Scholar]

[pdig.0000233.ref031] 31.Murtagh J, Vadhan S. The Complexity of Computing the Optimal Composition of Differential Privacy. In: Kushilevitz E, Malkin T, editors. Theory of Cryptography. Berlin, Heidelberg: Springer Berlin Heidelberg; 2016. pp. 157–175. doi: 10.1007/978-3-662-49096-9_7 [DOI] [Google Scholar]

[pdig.0000233.ref032] 32.OpenDP. SmartNoise—OpenDP SmartNoise. Available: https://docs.smartnoise.org/en/stable/index.html.

[pdig.0000233.ref033] 33.Li R, Pei S, Chen B, Song Y, Zhang T, Yang W, et al. Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (SARS-CoV-2). Science. 2020;368: 489–493. doi: 10.1126/science.abb3221 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pdig.0000233.ref034] 34.Shapiro M, Delgado-Eckert E. Finding the probability of infection in an SIR network is NP-Hard. Math Biosci. 2012;240: 77–84. doi: 10.1016/j.mbs.2012.07.002 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pdig.0000233.ref035] 35.Gilbert M, Pullano G, Pinotti F, Valdano E, Poletto C, Boëlle P-Y, et al. Preparedness and vulnerability of African countries against importations of COVID-19: a modelling study. The Lancet. 2020;395: 871–877. doi: 10.1016/S0140-6736(20)30411-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pdig.0000233.ref036] 36.CDC LC Quick Learn: Using an Epi Curve to Determine Most Likely Period of Exposure. [cited 15 May 2023]. Available: https://www.cdc.gov/training/quicklearns/exposure/.

[pdig.0000233.ref037] 37.Cadoni M, Gaeta G. Size and timescale of epidemics in the SIR framework. Phys Nonlinear Phenom. 2020;411: 132626. doi: 10.1016/j.physd.2020.132626 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pdig.0000233.ref038] 38.Balcan D, Hu H, Goncalves B, Bajardi P, Poletto C, Ramasco JJ, et al. Seasonal transmission potential and activity peaks of the new influenza A(H1N1): a Monte Carlo likelihood analysis based on human mobility. BMC Med. 2009;7: 45. doi: 10.1186/1741-7015-7-45 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pdig.0000233.ref039] 39.Kishore N, Kahn R, Martinez PP, Salazar PMD, Mahmud AS, Buckee CO. Lockdown related travel behavior undermines the containment of SARS-CoV-2. medRxiv. 2020; 2020.10.22.20217752. doi: 10/ghts2k [Google Scholar]

[pdig.0000233.ref040] 40.Souch JM, Cossman JS, Hayward MD. Interstates of Infection: Preliminary Investigations of Human Mobility Patterns in the COVID-19 Pandemic. J Rural Health. 2021;37: 266–271. doi: 10.1111/jrh.12558 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pdig.0000233.ref041] 41.Amhare AF, Tao Y, Li R, Zhang L. Early and Subsequent Epidemic Characteristics of COVID-19 and Their Impact on the Epidemic Size in Ethiopia. Front Public Health. 2022;10. Available: https://www.frontiersin.org/articles/10.3389/fpubh.2022.834592. doi: 10.3389/fpubh.2022.834592 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pdig.0000233.ref042] 42.Zhou Y, Xu R, Hu D, Yue Y, Li Q, Xia J. Effects of human mobility restrictions on the spread of COVID-19 in Shenzhen, China: a modelling study using mobile phone data. Lancet Digit Health. 2020;2: e417–e424. doi: 10.1016/S2589-7500(20)30165-5 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A standardised differential privacy framework for epidemiological modeling with mobile phone data

Merveille Koissi Savi

Akash Yadav

Wanrong Zhang

Navin Vembar

Andrew Schroeder

Satchit Balsari

Caroline O Buckee

Salil Vadhan

Nishant Kishore

Roles

Abstract

Author summary

Introduction

Results

Mobility data

Scenarios with initial outbreaks in large and small regions

Fig 1.

Scenarios with Epidemics in Well- and Poorly Connected Regions

Scenarios with varying epidemiological parameters

Discussion

Methods

Ethical statements

Study workflow

Fig 2. Flow diagram showing the architecture of the modular software pipeline designed to quantify the tradeoff privacy utility of mobility data post-differential privacy processing in epidemiological models.

Mobility data

Application of differential privacy

Metapopulation model

Table 1. Parameters of the metapopulation model.

Epidemiological metrics

Epidemiological scenarios

Table 2. Overview of the scenarios and the investigated question.

Supporting information

Data Availability

Funding Statement

References

Decision Letter 0

Michele Tizzoni

Gaurav Laroia

Roles

Author response to Decision Letter 0

Decision Letter 1

Michele Tizzoni

Gaurav Laroia

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases