Differentially Private Histogram Publication For Dynamic Datasets: An Adaptive Sampling Approach

Haoran Li; Xiaoqian Jiang; Li Xiong; Jinfei Liu

doi:10.1145/2806416.2806441

. Author manuscript; available in PMC: 2016 Mar 11.

Published in final edited form as: Proc ACM Int Conf Inf Knowl Manag. 2015 Oct;2015:1001–1010. doi: 10.1145/2806416.2806441

Differentially Private Histogram Publication For Dynamic Datasets: An Adaptive Sampling Approach

Haoran Li ¹, Xiaoqian Jiang ², Li Xiong ³, Jinfei Liu ⁴

PMCID: PMC4788513 NIHMSID: NIHMS760493 PMID: 26973795

Abstract

Differential privacy has recently become a de facto standard for private statistical data release. Many algorithms have been proposed to generate differentially private histograms or synthetic data. However, most of them focus on “one-time” release of a static dataset and do not adequately address the increasing need of releasing series of dynamic datasets in real time. A straightforward application of existing histogram methods on each snapshot of such dynamic datasets will incur high accumulated error due to the composibility of differential privacy and correlations or overlapping users between the snapshots. In this paper, we address the problem of releasing series of dynamic datasets in real time with differential privacy, using a novel adaptive distance-based sampling approach. Our first method, DSFT, uses a fixed distance threshold and releases a differentially private histogram only when the current snapshot is sufficiently different from the previous one, i.e., with a distance greater than a predefined threshold. Our second method, DSAT, further improves DSFT and uses a dynamic threshold adaptively adjusted by a feedback control mechanism to capture the data dynamics. Extensive experiments on real and synthetic datasets demonstrate that our approach achieves better utility than baseline methods and existing state-of-the-art methods.

Keywords: Differential privacy, adaptive sampling, dynamic dataset release

1. Introduction

Sharing dynamic private data while providing privacy guarantee enables many important data mining and knowledge discovery applications. Consider the examples below:

Medical research

A hospital gathers data from individual patients every day. The dynamic datasets, e.g. the daily datasets of individual patients with fevers, coughs, and different demographic attributes can be shared with researchers for cohort discovery, medical research, and seasonal epidemic outbreak monitoring.

Traffic Monitoring

A GPS service provider gathers data from individual users about their locations, speeds, mobility, etc. The dynamic datasets, e.g., the numbers of users at different regions during each time period, can be mined for commercial interest, such as congestion patterns on the roads.

A common scenario of such applications is that a trusted server gathers data from a large number of individual subscribers. The aggregated data can be then continuously shared with other untrusted entities for various purposes. The trusted server, i.e. publisher, therefore must ensure that releasing the data does not compromise the privacy of any individual who contributed data. The goal of our work is to enable publishers to share a series of dynamic private datasets over individual users while guaranteeing their privacy.

The current state-of-the-art standard for privacy preserving data publishing is differential privacy [9, 27], which requires that the output released by a data provider be perturbed by a randomized algorithm 𝒜, so that the output of 𝒜 remains roughly the same even if any individual tuple in the input data is arbitrarily modified. Given the output of 𝒜, an adversary will not be able to infer much about any individual tuple in the input, and thus privacy is protected.

Most existing works on differentially private data release focus on “one-time” release of static data (e.g. [20, 29, 26, 7, 17], etc). In this paper, we study the problem of releasing histograms for dynamic datasets while guaranteeing user-level differential privacy, i.e., protecting the presence of a user in the entire series of dynamic datasets. In the worst case, a user may be present in all datasets in the series. A straight-forward application of the standard differential privacy mechanism or existing histogram method to each snapshot of the dataset will lead to a very high perturbation error O(N) in the order of the number of datasets or snapshots N in the series, due to the composition theorem [22].

A set of related works have studied the problem of releasing aggregate time series and stream statistics. The works in [12, 6] proposed differentially private continual counters over a binary stream. However, both works adopt an event-level differential privacy, which protects the presence of an individual event, i.e. a user's contribution to the data stream at a single time point, rather than her presence or contribution to the entire series. The works in [25, 13, 14] studied the problem of releasing aggregate time-series with user-level differential privacy. Both works consider temporal correlations of the time-series. The paper [25] uses a Discrete Fourier Transform approach and is not applicable to real-time applications when data needs to be released at each time point. Other works [13, 14] take a model based approach which assumes original data is generated by an underlying process and uses the model based prediction to improve the accuracy of the released data. The limitation is that the model needs to be assumed or learned from public data with similar patterns and the method may not be effective when the real data deviates from the model.

The recent work [18] studies the problem similar to ours and represents the state-of-the-art. It proposed a novel w-event privacy framework by combining user-level and event-level privacy, which essentially guarantees user-level privacy within any window of w timestamps. When w is set to the number of time points in the series of data, or infinity for infinite data streams, it converges to user-level privacy. In addition, it proposed a sampling approach with various privacy budget allocation schemes to release data. However, in their schemes, privacy budgets may be exhausted prematurely or not fully utilized, still leading to suboptimal utility of the released data.

Our contributions

In this paper, we present a novel and principled adaptive distance-based sampling approach for releasing multiple histograms for a series of dynamic datasets in real time. We summarize the contributions and features of our approach below.

We propose a distance-based sampling approach to address the dynamics of evolving datasets under user-level differential privacy. Instead of generating a differentially private (DP) histogram at each time stamp, we only compute new histograms when the update is significant, i.e., the distance between the current dataset and the latest released dataset is higher than a threshold. Both the distance computation and threshold comparison are designed to guarantee differential privacy. The key observation is that datasets may be subject to small updates at times. Distance-based sampling allows us to release a new histogram only when the datasets have significant updates, hence saving the privacy budget and reducing the overall error of released histograms. In contrast to [18], we use an explicit threshold to determine the sampling points, inspired by the sparse vector technique [15] originally proposed for releasing DP counts only when the counts are greater than a threshold. The explicit threshold based sampling provides two advantages: 1) we can predefine a threshold based on the expected update rate of the data if there is prior domain knowledge, 2) we can dynamically adjust the threshold in a principled way based on data dynamics. Another important feature of our approach is that it is orthogonal to the histogram method used for each time point, i.e. it can use any of the state-of-the-art static differentially private histogram release method (e.g. [9, 28, 7, 26, 7, 17, 24, 20, 21, 30) as a black box, which is efficient and effective for generating “one time” histograms.
We present two methods for defining the threshold. The first method, DSFT (Distance-based Sampling with Fixed Threshold), uses a predefined threshold T. The second improved method, DSAT (Distance-based Sampling with Adaptive Threshold), applies a feedback control mechanism to adaptively adjust the threshold T. Real world dynamic datasets may exhibit varying update behaviors across different settings. The adaptive threshold mechanism allows us to dynamically adjust the threshold T without having to rely on prior knowledge to tune the threshold. We use a PID (Proportional, Integral, and Derivative) controller [2] to detect the dynamics and adaptively adjust the threshold such that the privacy budget is not depleted prematurely due to high update and sampling rates or insufficiently utilized due to low update and sampling rates.
We present formal analysis of differential privacy guarantees, complexity, and utility for DSFT and DSAT. In our approach, each released DP histogram has either a perturbation error O(C) at sampling points, where C is the maximum number of released DP histograms (C ≪ N) or an update error with an upper bound (see Section 6). We also show a formal analysis of how to select optimal algorithmic parameters given a required utility guarantee.
In addition to standard user-level differential privacy, we further extend our methods under the framework of w-event privacy [18], so it can work with infinite series of evolving datasets.
Finally, we present extensive experiments using both synthetic and real datasets. Experiment results demonstrate that our methods significantly outperform the baseline approaches and existing state-of-the-art techniques [18].

We state the problem setting of releasing dynamic datasets under differential privacy in Section 3 and introduce w-event privacy and existing state-of-the-art solutions. We present our methods DSFT and DSAT while provide formal privacy analysis in Section 4, then follow by the utility analysis in Section 5. We extend our techniques to w-event privacy framework in Section 6. We include detailed experimental evaluation of our algorithms in Section 7 and conclude in Section 8.

2. Related Work

Several mechanisms (e.g. [12, 6], etc) focus on event-level privacy in releasing counters, i.e. in publishing the number of event occurrences at every time point since the commencement of the system. These mechanisms consider the data stream as a bit string and at each time point they release the number of 1's seen so far. A set of related work focus on releasing aggregate time series or stream statistics under differential privacy as we discussed earlier [25, 13, 14]. The work in [25, 13] releases aggregate time-series with user-level differential privacy. Both works have some limitations as we discussed earlier. [31] releases dynamic transaction data under user-level privacy, and set an upper bound to limit the maximum number of updates to handle infinite updates. But it can only handle insertions updates.

The most recent work that is closely related to our work is Kellaris et al. [18] which deals with differentially private release of events or histograms for infinite stream. It proposes a w-event privacy framework by combining user-level and event-level privacy, which protects any event sequence occurring within any window of w timestamps. It is event-privacy with w = 1 and converges to user- level privacy with w = infinity. They also proposed two mechanisms, Budget Distribution (BD) and Budget Absorption (BA), to allocate the budget within one w-timestamp window. The key difference between our work and [18] is that our methods detect the data dynamics and adaptively adjust the distance threshold for sampling such that the privacy budget is not depleted prematurely due to high update and sampling rates or insufficiently utilized due to low update and sampling rates. In [18], privacy budgets may be depleted prematurely, especially when w is very large, or not fully utilized during the w timestamps. In addition, our method is independent of the histogram method used for each time point and can utilize any state-of-the-art histogram methods designed for static data release as a blackbox. In our experiments, we compare our methods with BD and BA in [18], since they represent the state-of-the-art and have been shown to perform better than other existing work.

Our distance threshold based sampling builds on top of the sparse vector technique [15] originally proposed for releasing differentially private counts only when the counts are greater than a threshold. The sparse vector technique has also been used in [19] for releasing top-k frequent itemsets given a static transaction dataset and a threshold derived from the kth frequent itemsets. In our work, we use the sparse vector technique in a novel way to enable differentially private distance based sampling for releasing dynamic datasets while adaptively adjusting the distance threshold.

3. Preliminaries

In this section, we formally define the problem of releasing series of real-time dynamic histograms or datasets and introduce definitions on user-level differential privacy and w-event privacy. We summarize all frequently used notations in Table 1.

Table 1. Frequently used notations.

Notation	Discription
D	A series of original dynamic datasets
D̃	A set of released DP datasets for D
D_i or D̃_i	Snapshot of D or D̃ at time point t_i
H	A series of original dynamic histograms
H̃	A set of released DP histograms for H
H_i or H̃_i	Snapshot of H or H̃ at time point t_i
N	Number of time points
C	Cutoff point (i.e. the upper bound of the number of released DP datasets)
U	Domain universe or number of histogram bins
ε	Overall privacy budget
ε₁	Privacy budget for the decision step
ε₂	Privacy budget for the sampling step
d(D_i, D̃_j)	The distance between D_i and D̃_j
Δ	The sensitivity of L₁ distance

Open in a new tab

3.1 Problem definition

Let N denote the total number of time points. Let D denote a series of original dynamic datasets and D_i be a dataset snapshot at time stamp t_i. We assume all snapshots have the same domain universe U, the product of domains of all attributes. For every t_i, we are to release a private dataset D̃_i. Over the N time stamps, the series of privately released dynamic datasets D̃={D̃_i:1 ≤ i ≤ N} should guarantee user-level ε-differential privacy.

In this paper, we call H as a series of original dynamic histograms (corresponding to D) with H_i as a snapshot at t_i, and H̃ as a series of released private dynamic histograms with H̃_i as a private snapshot at t_i. Since a dataset can be transformed to a histogram, and a synthetic dataset can be constructed from a histogram, D and H are interchangeable in this paper.

3.2 Differential Privacy

Intuitively, a randomized mechanism 𝒜 is differentially private if its outcome is not significantly affected by the removal or addition of any record. ε-differential privacy is formally defined as Pr[𝒜(D) ∈ 𝒪] ≤ e^ε Pr[𝒜(D′) ∈ 𝒪], where 𝒪 is any arbitrary set of possible outputs of 𝒜, D and D′ are two neighbouring datasets differing in at most one record (i.e. D can be obtained from D′ by adding or removing at most one record). In our problem definition, an adversary should learn approximately the same information about any individual user given D̃, irrespective of its presence or absence in D, and one individual can be present in up to N snapshots in D. Two series of dynamic datasets D and D̂ are user-level neighbors if one can be obtained by adding or removing one individual (including all its occurrences in the snapshots) from the other. Then user-level ε-differential privacy is defined as below.

Definition 3.1 (User-Level ε-Differential Privacy). Let 𝒜 be a randomized mechanism over two user-level neighbors D, and D̂ which differ in one user's presence in the entire series, and let 𝒪 be any arbitrary set of possible outputs of 𝒜. Algorithm 𝒜 satisfies ε-differential privacy iff the following holds

Pr [A (D) \in O] \leq e^{ε} Pr [A (\hat{D}) \in O]

Laplace Mechanism

Dwork et al. [11] show that ε-differential privacy can be achieved by adding i.i.d. Laplace noise to query result q(D), where D is a dataset. Formally, q̃(D) = q(D) + (v₁, …, v_M)′, where $v_{i} ~ Lap (0, \frac{GS (q)}{ε})$ , for i = 1, …, M, and M is the dimension of q(D). v_i follows a Laplace distribution with mean zero and scale $\frac{GS (q)}{ε}$ , where GS(q) denotes the global sensitivity [11] of the query q. The global sensitivity is the maximum L₁ distance between the results of q from any two neighbouring datasets D and D′, formally defined as GS(q) = max_D,D′ ‖q(D) − q(D′)‖₁. In our problem setting, the global sensitivity of any two user-level neighbors D and D̂ is formally defined as

GS (q) = {max}_{D, \hat{D}} {‖ q (D) - q (\hat{D}) ‖}_{N} .

For a sequence of DP mechanisms, the sequential composition theorem [22] guarantees its overall privacy as follows:

Theorem 3.1 (Sequential Composition [22]). For a sequence of n mechanisms M₁, …, M_n and each M_i provides ε_i-differential privacy, the sequence of M_i will provide $(\sum_{i = 1}^{n} ε_{i})$ differential privacy.

Hence, one way to achieve epsilon-differential privacy for the entire series of D is to apply Laplace mechanism for each D_i with noise $Lap (\frac{N}{ε})$ , which leads to O(N) noise.

(α, σ)-usefulness

We use a formal utility metric (α, σ)-usefulness [3] to analyze the utility of each snapshot D̃_i in D̃.

Definition 3.2 ((α, σ)-Usefulness). A randomized mechanism A is (α, σ) -useful for queries in class 𝒞 if with probability 1 − σ, for every query Q ∈ 𝒞 and a dataset D, A(D) = D̃, |Q(D) − Q(D̃)| ≤ α.

3.3 w-event privacy

W-event privacy [18] is proposed as an extension of differential privacy to address release of infinite streams. It guarantees user-level ε-differential privacy for every sub sequence of length w (or over w timestamps) anywhere (i.e. it can start from any timestamp) in the original series of dynamic datasets. w-neighboring series of dynamic datasets, D_w, and D̂_w, can be defined as the user-level neighbors under any sub sequence of length w anywhere. w-event privacy can be formally given as below:

Definition 3.3 (w-Event ε-Differential Privacy). Let 𝒜 be a randomized mechanism over two w-neighboring series of dynamic datasets D_w, and D̂_w, and let 𝒪 be any arbitrary set of possible outputs of 𝒜. Algorithm 𝒜 satisfies w-event ε-differential privacy (or, w-event privacy) iff the following holds

Pr [A (D_{w}) \in O] \leq e^{ε} Pr [A ({\tilde{D}}_{w}) \in O]

3.4 Baseline and Existing State-of-the-Art Solutions

Given our problem of releasing dynamic datasets under user-level privacy, we review some baseline and existing state-of-the-art methods which will motivate our approach. We will also compare our approach with these methods in the experiment section.

Baseline method

A baseline method is to apply existing “one time” DP histogram release methods to the dataset at every time point. If each released DP histogram preserves $\frac{ε}{N}$ -differential privacy, the series of N dynamic datasets guarantee ε-differential privacy by sequential composition theorem. This results in an overall noise of O(N) which can be extremely large for large N. In an unbounded setting with N being infinite, this method will not be useful.

Fixed-sampling method

Another potential solution is to release $\frac{N}{I}$ DP histograms periodically given a sampling interval I. Privacy budget $ε / \frac{N}{I}$ is allocated to each dataset at the sampling time point, and the entire private dataset series preserve ε-differential privacy. Unfortunately, the pre-defined sampling interval may not accurately capture the update pattern in the original series of dynamic datasets, leading to either high perturbation errors if sampling too frequently or large update errors if sampling not frequently enough or at wrong time points.

Approaches in w-event privacy

[18] proposes a sampling approach which computes the noisy distance between the dataset at the current time point and the original dataset at the latest sampling point, and then compares the noisy distance with the perturbation noise to be added if current dataset is to be released. If the distance is greater than the perturbation noise, a noisy dataset is released at current time stamp. The perturbation noise is determined by their privacy budget allocation schemes, Budget Distribution (BD) and Budget Absorption (BA), that allocate the budget to different times-tamps in the w-event window. BD allocates the privacy budget in an exponentially decreasing fashion, in which earlier timestamps obtain exponentially more budget than later ones. BA starts by uniformly distributing the budget to all w timestamps, and accumulates the budget of non-sampling timestamps, which can be allocated later to the sampling timestamps. A main drawback of their approach is that the privacy budget may be exhausted prematurely (sampling too frequently in the beginning) or not fully utilized during all w timestamps (sampling not frequent enough), leading to suboptimal utility of the released data.

4. Adaptive Sampling Approach

We propose an adaptive distance-based sampling approach to address the dynamics of evolving datasets under user-level differential privacy. Instead of generating a differentially private histogram at each time stamp, we only compute new histograms when the update is significant, i.e., the distance between the current dataset and the latest released dataset is higher than a threshold. The key observation is that datasets may be subject to small updates at times. Distance-based sampling allows us to release a new histogram only when the datasets have significant updates, hence saving the privacy budget and reducing the overall error of released histograms. In contrast to [18], we use an explicit threshold for distance comparison to determine the sampling points, which provides two advantages: 1) we can predefine a threshold based on the expected update rate of the data if there is prior domain knowledge, 2) we can dynamically adjust the threshold in a principled way based on data dynamics.

In this section, we first present the basic method, DSFT, which uses a predefined fixed threshold. This will allow us to analyze its privacy property which also applies to our adaptive method and facilitate our description of the adaptive method. We then introduce our adaptive method, DSAT, which dynamically adjusts the threshold in a principled way to adapt to the data dynamics.

4.1 DSFT

DSFT (Distance-based Sampling with Fixed Threshold) uses a fixed threshold and is divided into two steps at each time point ti: decision and sampling. The decision step computes a noisy distance between the original dataset H_i at current time stamp and the latest released histogram H̃_j and determines if it is larger than a noisy threshold T̃. If yes, the sampling step generates a new DP histogram H̃_i, otherwise it outputs the previous H̃_j. The overall privacy budget is divided between the decision (ε₁) and sampling (ε₂) steps which are designed to guarantee differential privacy as we will analyze later.

Algorithm 1 presents DSFT. Line 1-4 initializes the privacy budget for the two steps, computes the noisy threshold, and releases a DP histogram at the first time stamp. Line 5-11 carry out the decision (line 7-8) and sampling (line 8-9) steps for each time point t_i if the number of released histograms is below the cutoff point C, and releases the last histogram with all remaining budget. For the distance d(H_i, H̃_j), we use the L₁ distance in our implementation and other distance metrics (e.g. KL divergence) can be also used.


Algorithm 1 Distance-based Sampling with Fixed Threshold Algorithm (DSFT)

Input: D = {D_i\|1 ≤ i ≤ N, i ∈ Z}, T, C and ε.
Output: D̃ = {D̃_i\|1 ≤ i ≤ N, i ∈ Z}
1:	Set ε₁ = kε, ε₂ = ε − ε₁, k is computed due to theorem 5.4;
2:	Set $\tilde{T} = T + Lap (\frac{2 Δ}{ε_{1}})$ , Δ is computed due to lemma 4.1;
3:	For D₁, release a DP dataset D̃₁ with $\frac{ε_{2}}{C}$ privacy budget;
4:	Set count = 1, and j = 1;
5:	for each time point t_i with i ≥ 2 do
6:	if count ≥ C, then set D̃_i = D̃_j continue;
7:	Set $\tilde{d} (D_{i}, {\tilde{D}}_{j}) = d (D_{i}, {\tilde{D}}_{j}) + Lap (\frac{2 C Δ}{ε_{1}})$
8:	if d̃(D_i, D̃j) ≥ T̃, then release D̃_i at t_i with $\frac{ε_{2}}{C}$ budget, and set count = count + 1, and j = i;
9:	else use D̃_j as the release of D_i;
10:	if i == N and count < C, then release D̃_N with all remaining privacy budget;
11:	end for

Open in a new tab

4.2 DSAT

In DSFT, a prior knowledge on D is needed for the user to determine an appropriate value T. Suppose there exists an optimal value of T which can enable the algorithm to exactly generate C DP histograms. If the threshold T is higher than the optimal value, there will be remaining privacy budgets that are not utilized. On the contrary, if T is smaller than the optimal value, the privacy budget will be exhausted prematurely, resulting in update errors for remaining time points. In this section, we present DSAT, Distance-based Sampling with Adaptive Threshold, that releases a series of DP dynamic histograms while adaptively adjusting the threshold T_i for each time point, based on data dynamics. With DSAT, we do not have to find an optimal value of T, which may be difficult in practice.

Figure 1 illustrates the framework of DSAT. Intuitively, we wish to have C sampling points over N time points, hence our target sampling rate is $\frac{C}{N}$ . Suppose we have released C_i histograms at t_i. If $\frac{C_{i}}{i} < \frac{C}{N}$ , we need to decrease the threshold to allow more sampling time points, and vice versa. For each t_i, we adjust the threshold based on the feedback error between the update ratio $\frac{C_{i}}{i}$ at t_i and the target ratio $\frac{C}{N}$ , which is formally defined below.

Definition 4.1 (Feedback Error). We define the feedback error E_i at t_i as follows:

E_{i} = | \frac{C_{i}}{i} - \frac{C}{N} |

(1)

where C_i means the number of sampling time points till t_i, C is the cutoff point, and N is the total number of time points.

DSAT adopts a PID (Proportional-Integral-Derivative) [2], a generic control loop feedback mechanism, to dynamically adjust the threshold T over time. Under our problem setting, we redefine the three correcting terms, Proportional, Integral, and Derivative, with the feedback error defined in Equation (1). These three terms are summed to compute the output u_i of PID controller at t_i. The final PID algorithm is defined as:

u_{i} = \underset{proportional term}{\underset{︸}{θ_{P} \times e_{i}}} + \underset{integral term}{\underset{︸}{θ_{I} \times \sum_{τ = t_{i} - w + 1}^{t_{i}} e_{τ}}} + \underset{derivative term}{\underset{︸}{θ_{D} \times \frac{e_{i} - e_{j}}{t_{i} - t_{j}}}}

(2)

where θ_P, θ_I, θ_D are respectively the proportional gain, the integral gain, and the derivative gain, e_τ is the error at t_τ, t_i is the current time point, t_j is the latest sampling time point.

Proportional term

The first proportional term produces an output value that is proportional to the current error e_i. The proportional term can be amplified by the proportional gain θ_P. In our context, the error e_i at the current time point t_i is calculated by

e_{i} = \frac{| E_{i} - δ |}{δ}

(3)

where E_i is the feedback error defined in equation (1), parameter δ is the set point for E_i. We assume δ is 5% in our empirical studies, i.e. the maximum tolerance for the feedback error is 5%. It can be determined by users according to specific applications. The proportional term is defined as θ_P × e_i.

Integral term

The integral term is to eliminate the cumulated offset through multiplying the sum of the instantaneous error over time by the integral gain. We define the integral term as $θ_{I} \times \sum_{τ = t_{i} - w + 1}^{t_{i}} e_{τ}$ , where θ_I is the integral gain and w represents the integral time window denoting how many recent errors are taken.

Derivative term

The derivative term determines the slope of error over time and changes the PID output in proportion to this rate of change via the derivative gain θ_D. It is defined as $θ_{D} \times \frac{e_{i} - e_{j}}{t_{i} - t_{j}}$ . Given the PID error u_i, a new threshold T_i produced at the current time point t_i can be determined as follows:

T_{i} = T_{i - 1} + sign (\frac{C_{i}}{i} - \frac{C}{N}) \times θ \times u_{i}

(4)

T_i₋₁ is the threshold produced at the previous time point t_i₋₁. Parameter θ determines the magnitude of impact of PID error on the T_i. sign(.) is a sign function, indicating that if the update ratio $\frac{C_{i}}{i}$ is larger than the target ratio $\frac{C}{N}$ , we need to increase T_j to generate less DP histograms and reduce the update ratio, and vice versa. Our DSAT uses only the proportional term in equation 2 in our experiment setting, for simplicity. That means, we set θ_P = 1, θ_I = 0, θ_D = 0, and u_i is the same with e_i as defined in equation (3).


Algorithm 2 Distance-based Sampling with Adaptive Threshold Algorithm (DSAT)

Input: D = {D_i\|1 ≤ i ≤ N, i ∈ Z}, T, C and ε.
Output: D̃ = {D̃_i\|1 ≤ i ≤ N, i ∈ Z}
1:	Run step 1,2,3,4 in Algorithm 1;
2:	Skip the first M timestamps;
3:	for each time point t_i with i > M do
4:	if count ≥ C, then set D̃_i = D̃_j
5:	Set $\tilde{d} (D_{i}, {\tilde{D}}_{j}) = d (D_{i}, {\tilde{D}}_{j}) + Lap (\frac{2 C Δ}{ε_{1}})$
6:	Set $E_{i} = \| \frac{count}{i} - \frac{C}{N} \|, e_{i} = \frac{\| E_{i} - δ \|}{δ}$ , and u_i = θe_i;
7:	if $\frac{count}{t} - \frac{C}{N} \leq 0$ , then set T̃_i = max{0, T̃_i₋₁ − u_i}
8:	else set T̃_i = min{2, T̃_i₋₁ + u_i};
9:	if d̃(D_i, D̃_j) ≥ T̃_i then
10:	release a DP dataset D̃_i at t_i with $\frac{ε_{2}}{C}$ budget, and set count = count + 1, j = i;
11:	else
12:	release D̃_j;
13:	end if
14:	if i == N and count < C then
15:	release D̃_N with all remaining privacy budget;
16:	end if
17:	end for

Open in a new tab

Algorithm 2 presents DSAT. We use T_i to denote the produced threshold at t_i and other notations are the same as Algorithm 1. In Line 1, T₁ is set to be $T + Lap (\frac{Δ}{{\hat{ε}}_{1}})$ . Different from Algorithm 1, ε̃₁ is a tiny privacy budget because the initial value T₁ is not significant in DSAT. We only need to bound it between 0 and 2, which is the domain of the L1 distance. Line 2 uses D̃₁ for the first M time points where M is a small integer number to allow a burn-in period and enough discrepancy to be accumulated, avoiding frequent updating of T_i during the beginning time periods. M can be user-specified and is not a sensitive parameter besides that it is much smaller than N. The algorithm from Line 3 to Line 12 is similar to Algorithm 1 except Line 6 to Line 8 which use the PID control to adaptively adjust and generate a new threshold T_i.

4.3 Privacy Analysis

Sensitivity analysis of L₁ Distance

In the sensitivity analysis, we use n_p (n_q) to denote the sum of all histogram bin counts of the histograms H_p (H_q). U is the number of histogram bins. Since the L₁ distance of Algorithm 1 and Algorithm 2 is computed using one private histogram and one original histogram, we only need to protect privacy for the original histogram.

Lemma 4.1. The sensitivity of L₁ distance d(H̃_p, H_q) is, $Δ = \frac{2}{n_{q} - 1}$ , where H̃_p (H_q) is a DP histogram with the sum of all histogram bin counts as n_p (n_q). (Proof omitted due to space limitation)

Privacy guarantee

Inspired by Hardt et al. [15], we formally provide the proof of privacy guarantee for the decision stage below. The intuition behind theorem 4.1 is that, the noises on both sides of $d (D_{i}, {\tilde{D}}_{j}) + Lap (\frac{2 C Δ}{ε_{1}}) \geq T + Lap (\frac{2 Δ}{ε_{1}})$ are necessary for the decision stage to be differentially private, even though T is publicly known.

Theorem 4.1. In algorithm 1, the decision stage guarantees ε₁-differential privacy.

Proof. D is a series of dynamic datasets with D = (D₁, …, D_N) over N time points. D̂ is the user-level neighbor of D, which is D̂ = (D̂₁, …, D̂_N). We say D̂ is the user-level neighbour of D if we can obtain D̂ by removing or adding only one individual user from D by the definition in section 3.2.

Let d_i denote d(D_i, D̃_j) for every i, j ∈ [N]([N] = {1, …, N} beginning with i = 2, which is the true distance between D_i ad D̃_j, where D̃_j is the private dataset released in the latest sampling time point t_j. Let d̃_i denote d̃(D_i, D̃_j), which is the DP L₁ distance.

For all pairs of user-level neighbours D and D̃, and the corresponding L₁ distance vectors d = (d₁, …, d_N), we need to prove:

log (\frac{{Pr}_{D} [d = \tilde{d}]}{{Pr}_{\tilde{D}} [d = \tilde{d}]}) = \sum_{i = 1}^{N} log (\frac{{Pr}_{D} [d_{i} = {\tilde{d}}_{i} | {\tilde{d}}_{i - 1}]}{{Pr}_{\tilde{D}} [d_{i} = {\tilde{d}}_{i} | {\tilde{d}}_{i - 1}]}) \leq ε_{1} .

Because d_i is affected only by d_i₋₁ at the previous time point, we have Pr_D[d_i = d̃_i|d̃_i₋₁] = Pr_D̃[d_i = d̃_i|d_i₋₁, …, d̃₁]. Let S = {i : d̃_i ≥ T̃} be the set of indices of d̃_i at all sampling time points, and S^C = {i : d̃_i ≤ T̃} be the set of indices of d̃_i at all non-sampling time points, we have $log (\frac{{Pr}_{D} [d = \tilde{d}]}{{Pr}_{\tilde{D}} [d = \tilde{d}]}) = \sum_{i \in S} log (\frac{{Pr}_{D} [d_{i} = {\tilde{d}}_{i} | {\tilde{d}}_{i - 1}]}{{Pr}_{\tilde{D}} [d_{i} = {\tilde{d}}_{i} | {\tilde{d}}_{i - 1}]}) + \sum_{i \in S^{C}} log (\frac{{Pr}_{D} [d_{i} = \emptyset | {\tilde{d}}_{i - 1}]}{{Pr}_{\tilde{D}} [d_{i} = \emptyset | {\tilde{d}}_{i - 1}]})$ .

Now we need to bound the two sums respectively. For the first sum, we can see that (1) independent Laplace noise with $Lap (\frac{2 C Δ}{ε_{1}})$ is added to each distance with $\frac{ε_{1}}{2 C}$ differential privacy, (2) the computation of each L₁ distance needs to access the original histogram once, and (3) |S| ≤ C due to the algorithm, so we can obtain the following equation due to sequential composition theorem:

\sum_{i \in S} log (\frac{{Pr}_{D} [d_{i} = {\tilde{d}}_{i} | {\tilde{d}}_{i - 1}]}{{Pr}_{\hat{D}} [d_{i} = {\tilde{d}}_{i} | {\tilde{d}}_{i - 1}]}) = \sum_{i \in S} log (\frac{{Pr}_{D} [v_{i} = {\tilde{d}}_{i} - d_{i}]}{{Pr}_{\hat{D}} [v_{i} = {\tilde{d}}_{i} - d_{i}]}) \leq \frac{ε_{1}}{2}

For the second sum, let A_Z(D) be the set of all values of the noise variables (v₁, …, v_N₋₁) that cause d̃_i ≤ T̃ for all i ∈ S^C, when the mechanism runs on D, conditioning on T̃ = Z and d_i = d̃_i for all i ∈ S. Since from D to D̂, all distances may be increased by at most Δ (i.e. $Δ = \frac{2}{n - 1}$ for the L₁ distance due to lemma 4.1), which will cause each distance to remain less than T̃ if we increase T̃ by Δ. But the distances larger than T̃ may become less than T̃ + Δ, so A_T̃_+Δ(D̂) ⊆ A_T̃(D) ⊆ A_T̃_+Δ(D̂). Thus, we have: $\frac{{Pr}_{D} {\tilde{T} = T + v_{1}}}{{Pr}_{\hat{D}} {\tilde{T} = T + Δ + v_{2}}} \leq exp (\frac{ε_{1}}{2})$ . Therefore, let Z₁ = T + v₁ and Z₂ = T + Δ + v₂, we have:

\begin{matrix} \prod_{i \in N^{C}} \underset{D}{Pr} (d_{i} = ∅ | {\tilde{d}}_{i - 1}) \\ = \int_{- \infty}^{\infty} Pr (\hat{d} = Z_{1}) Pr ((v_{1}, …, v_{k}) \in A_{Z} (D)) d Z \\ \leq exp (\frac{ε_{1}}{2}) \int_{- \infty}^{\infty} Pr (\hat{T} = Z_{1}) Pr ((v_{1}, …, v_{k}) \in A_{Z_{1}} (D)) d Z \\ \leq exp (\frac{ε_{1}}{2}) \int_{- \infty}^{\infty} Pr (\hat{T} = Z_{2}) Pr ((v_{1}, …, v_{k}) \in A_{Z_{2}} (D')) d Z \\ = exp (\frac{ε_{1}}{2}) \int_{- \infty}^{\infty} Pr (\hat{T} = Z_{1}) Pr ((v_{1}, …, v_{k}) \in A_{Z_{1}} (D')) d Z \\ = exp (\frac{ε_{1}}{2}) \prod_{i \in N^{C}} \underset{\hat{D}}{Pr} (d_{i} = ∅ | {\tilde{d}}_{i - 1}) \end{matrix}

Therefore, we have $\frac{Π_{i \in N^{C}} {Pr}_{D} (d_{i} = \emptyset | {\tilde{d}}_{i - 1})}{Π_{i \in N^{C}} {Pr}_{\hat{D}} (d_{i} = \emptyset | {\tilde{d}}_{i - 1})} \leq \frac{ε_{1}}{2}$

Theorem 4.2. Algorithm 1 and 2 preserve ε-differential privacy.

Proof. For Algorithm 1, the decision stage preserves ε₁-differential privacy due to theorem 4.1. Since releasing at most C DP histograms guarantees ε₂-differential privacy, algorithm 1 preserves ε₁ + ε₂ = ε-differential privacy due to theorem 3.1. For Algorithm 2, since adaptively adjusting threshold (Line 6 to Line 8) uses no raw data, it does not influence differential privacy guarantee, thus Algorithm 2 guarantees ε-differential privacy.

5. Utility Analysis

We analyze the utility of DSFT and DSAT using (α, σ)-usefulness in definition 3.2 and show the conclusions in theorem 5.1 and 5.2. Since we assume LPA as the DP histogram release method, the conclusions can be heuristically used as the upper bound when new methods better than LPA are employed. Here d(H_i, H_j) denotes L₁ distance between H_i and H_j.

Error quantification of DSFT

The utility of released datasets at sampling time points are analyzed based on lemma 5.1. The error of datasets at non-sampling time points are obtained via the error bound of the decision stage in lemma 5.2.

Lemma 5.1. (Sum of Independent Laplace variables [6]) Suppose that X₁, …, X_n are independent Laplace random variables, with each X_i following a Lap(b_i) distribution. Denote $Z = \sum_{i = 1}^{n} X_{i}$ and b_M = max_ib_i. Then for all $γ \geq \sqrt{\sum_{i = 1}^{n} b_{i}^{2}}$ and $0 < λ < \frac{2 \sqrt{2} γ^{2}}{b_{M}}$ , we have $Pr [Z > λ] \leq exp (- \frac{λ^{2}}{8 γ^{2}})$ .

Lemma 5.2. In Algorithm 1, for any 0 < σ < 1, we can obtain

Pr {d (H_{i}, {\tilde{H}}_{j}) \leq T + \frac{4 \sqrt{2} Δ (1 - log σ)}{ε_{1}}} \geq 1 - σ

(5)

This means, with probability greater than 1 − σ, we can set t_i as non-sampling time points. (Proof omitted due to space limitation.)

Theorem 5.1. For a range count query covering m histogram bins on H̃_k, and 0 < σ < 1, if k is a sampling time point, we have that $Pr {| A_{k} - {\tilde{A}}_{k} | \leq - \frac{2 \sqrt{2} C log (\frac{σ}{2})}{n ε_{2}}} \geq 1 - σ$ , and if k is a non-sampling time point, we have that $Pr {| A_{k} - {\tilde{A}}_{k} | \leq T + \frac{4 \sqrt{2} Δ [1 - log (1 - σ)]}{ε_{1}} - \frac{2 \sqrt{2} C log (\frac{σ}{2})}{ε_{2}}} \geq {(1 - σ)}^{2}$ , where A_k and Ã_k are the query answers on the original histogram H_k and the DP histogram H̃_k. Therefore, each released histogram H̃_k of our algorithms maintains (α, σ)-usefulness for range count queries.

Error quantification of DSAT

We analyze the utility of DSAT based on theorem 5.2, and give the conclusion as below.

Theorem 5.2. For a range count query covering m histogram bins on H̃_k, and 0< σ < 1, if k is a sampling time point, the conclusion is the same as theorem 5.1 and if k is a non-sampling time point, we have $Pr {| A_{k} - {\tilde{A}}_{k} | \leq T_{k} + \sum_{i = 1}^{k} I_{i} u_{i} - \frac{2 \sqrt{2} C log (\frac{σ}{2})}{n ε_{2}}} \geq {(1 - σ)}^{2}$ , where I_i is a value being 1 or -1, and dependent on the data, and u_i is defined in equation (2). Therefore, each released histogram H̃_k of our algorithms maintains (α, σ)-usefulness for range count queries. (The conclusion can be obtained via equation (5) and we omitted the full proof.)

Lower bound of the data cardinality

Since the injected noise in the decision stage is related with data cardinality, we analyze the lower bound of data cardinality to guarantee a relatively small injected noise compared to the true L₁ distance. This lower bound can be used to maintain a high accuracy at the decision stage.

Theorem 5.3. In DSFT, in order to satisfy (α, σ)-usefulness and guarantee the utility of the decision stage, it requires that $n \geq \frac{16 \sqrt{2} α (1 - log (1 - σ))}{T ε_{1}}$ , where σ is defined in lemma 5.2. (Proof can be deducted from lemma 5.2)

Select the value of k in DSFT

Our algorithm requires ε to be divided between ε₁ and ε₂ with ε₁ = kε. We now analyze how to select k. Assume H = (H₁, …, H_N) corresponds to D. For each i, we analyze the incurred noise variance of L₁ distance between H_i and H̃_i when i is (1) a sampling time point and (2) a non-sampling time point.

Lemma 5.3. The noise variance of the L₁ distance between H_j and H̃_j, is ${\hat{σ}}_{1} = O (\frac{U C^{2}}{n^{2} ε_{2}^{2}})$ for a sampling time point, and ${\hat{σ}}_{2} = O (\frac{8 Δ^{2}}{ε_{1}^{2}} + \frac{32 C^{2} Δ^{2}}{ε_{1}^{2}} + \frac{U C^{2}}{n^{2} ε_{2}^{2}})$ for a non-sampling time point.

Proof. We skip this proof due to space limitation.

Theorem 5.4. If we use L₁ distance and LPA, the k value can be obtained as $k = min {\sqrt[3]{\frac{n^{2} (8 Δ^{2} + 32 C^{2} Δ^{2})}{U C^{2}}}, 1 - \frac{C}{N}}$

Proof. Since k is only used when analyzing the distance at non-sampling time points, we can obtain the upper bound of noise variance at a non-sampling time point due to lemma 5.3 with $ε_{1} = \frac{k_{ε}}{k + 1}$ and $ε_{2} = \frac{ε}{k + 1}$ by ${\hat{σ}}_{2} = \frac{8 Δ^{2} + 32 C^{2} Δ^{2}}{ε^{2}} \frac{{(k + 1)}^{2}}{k^{2}} + \frac{U C^{2}}{n^{2} ε^{2}} {(k + 1)}^{2}$ . Let f(k) = σ̂₂, then the first-order derivative of f(k) is as follows: $\nabla_{k} f (k) = - \frac{2 (k + 1)}{k^{3}} \frac{8 Δ^{2} + 32 C^{2} Δ^{2}}{ε^{2}} + 2 (k + 1) \frac{U C^{2}}{n^{2} ε^{2}}$ . By setting ∇_kf(k) = 0, we can obtain the value of k as: $k = \sqrt[3]{\frac{(8 Δ^{2} + + 32 C^{2} Δ^{2}) n^{2}}{U C^{2}}}$ . Since the second-order derivative of f(k) with respect to k is no less than 0, $k = \sqrt[3]{\frac{(8 Δ^{2} + + 32 C^{2} Δ^{2}) n^{2}}{U C^{2}}}$ is the value of k when f(k) arrives at the minimum. Simultaneously, we must require the privacy budget of each sampling time point to be no less than that of each time point in the baseline method, which leads to $\frac{ε_{2}}{C} \geq \frac{ε}{N}$ and $k \leq 1 - \frac{C}{N}$ . Therefore, we can obtain that $k = min {\sqrt[3]{\frac{n^{2} (8 Δ^{2} + 32 C^{2} Δ^{2})}{U C^{2}}}, 1 - \frac{C}{N}}$ .

6. Extensions to Infinite Streams

DSAT under w-event privacy

Algorithm 3 presents DSAT under w-event privacy. For the first w time points, we run DSAT normally and record the privacy budget ε_2,_i for every time point i, i.e. $ε_{2, i} = \frac{ε_{2}}{C}$ if i is a sampling point and ε_2,_i = 0 otherwise. For time points w + 1 to N, if the remaining privacy budget ε_rm for the current w-window is larger than zero, we compare the distance between H_i and H̃_j, modify the threshold and release a private histogram when the private distance is larger than the threshold; if no privacy budget is left, we skip the current time point and go to the next one.

Privacy guarantee

The first w time points guarantees ε-differential privacy. The condition in Line 4 of Algorithm 3 guarantees that if there is no remaining privacy budget (i.e. ε_rm ≤ 0) for the current w window from time point t_i₋₁ to t_i₋_w₊₁, no new private datasets will be released. Therefore, for any w-length window beginning with any time point, at most ε privacy budget will be used. This leads to the conclusion that Algorithm 3 satisfies w-event privacy.

7. Experiment

We implemented our methods on top of two static histogram methods, LPA in Matlab and PSD [7] in Python. All the experiments are performed on a PC with a 2.9GHz CPU and a 8GB memory. Table 2 summarizes the parameters and their default values in the experiments.

Table 2. Experiment Parameters.

Parameter	Description	Default value
N	Number of time points	500
d	Number of data dimensions	6
n	Number of tuples in D_i	500K
ε	Privacy budget	1.0
C	Cutoff point	0.01 × N
τ	Update rate	0.5
δ	Deviation tolerance	0.05
θ	Proportional gain	0.5

Open in a new tab


Algorithm 3 DSAT under w-event privacy

Input:
D = {D_i\|1 ≤ i ≤ N, i ∈ Z}, T, C and ε.
Output:
D̃ = {D̃_i\|1 ≤ i < N, i ∈ Z}
1:	Run DSAT for the first w time points;
2:	for i = (w + 1) to N do
3:	$ε_{r m} = \in_{2} - \sum_{m = i - w + 1}^{m = i - 1} ε_{2, m}$
4:	if ε_rm ≤ 0 then
5:	Set D̃_i := D_j, where j is the time point of last release;
6:	else
7:	Set $count = \frac{\sum_{m = i - w + 1}^{m = i - 1} ε_{2, m}}{ε_{2} / C}$
8:	Compute $\tilde{d} (D_{i}, {\tilde{D}}_{j}) = d (D_{i}, {\tilde{D}}_{j}) + Lap (\frac{2 C Δ}{ε_{1}})$ ;
9:	Compute $E_{i} = \| \frac{count}{i} - \frac{C}{w} \|, e_{i} = \frac{\| E_{i} - δ \|}{δ}$ , and u_i = θ × e_i;
10:	if $\frac{count}{i} - \frac{C}{w} \leq 0$ , then T̃_i = max{0, T_i₋₁ − u_i};
11:	else set T̃_i = min{2, T_i₋₁ + u_i};
12:	if d̃(D_i, D̃_j) ≥ T̃_i, then set ε_2,_i = ε₂/C, D̃_i := D_i+ < Lap(1/ε_2,_i) >^U, and j = i;
13:	else set D̃_i := D_j;
14:	end if
15:	end for

Open in a new tab

7.1 Experiment Setup

Datasets

We conducted our experiments with three datasets: the US census (http://ipums.org), the Taxi-Drive trajectory data (http://research.microsoft.com/apps/) and the Oldenburg traffic data [5].

The US census dataset contains six attributes, Age, Gender, Education, Health insurance, Marital status and Income with 3M tuples and domain sizes of 96, 2, 12, 2, 2, 3. Each tuple represents an individual user. In order to avoid the sparsity of histograms, we convert Income into a categorical attribute: values smaller than 0 (mapped to 1), values between 0 and 28K (mapped to 2), and values larger than 28K (mapped to 3). 28K is a median value. Values smaller than 0 means the tuples have ages smaller than 20. The number of histogram bins are the product of the domain sizes of all attributes.

We generate a series of dynamic datasets as follows. D_i is the original dataset at t_i. D₁ has 500K tuples randomly sampled from the original 3M tuples. A public pool is initiated using the remaining tuples. D_i (i ≥ 2) is obtained by deleting m tuples from D_i₋₁ while inserting m tuples randomly selected from the public pool to simulate the user updates. m is sampled from N(μ, σ²), where μ is $\frac{r \times | D_{i - 1} |}{2}$ , and σ² is set to 100K. Here, r is the update rate, |D_i| is the data cardinality of D_i and datasets at all time points have the same data cardinality. The time points are partitioned into 10 periods with different values of m to simulate varying update patterns. All experiments use US census data by default since we can generate various datasets under different parameter settings.

The Taxi trajectory dataset has a one_week trajectories of 10, 357 taxis during the period of Feb. 2 to Feb. 8, 2008 within Beijing. We transfer the time dimension to 168 time points with 24 × 7. The total number of points in this dataset is about 15 million and the total distance of the trajectories reaches 9 million kilometers. We partition the longitude and latitude into 10 × 10 grids. We amplify the number of taxis to 110, 357 by sampling dummy points on extremely sparse time points and geographical areas while still keeping the patterns of original data.

We generated Oldenburg traffic data with the Brinkhoff generator [5]. The input of the generator is the road map of Oldenburg in Germany, and the output is a set of moving objects on the road network. We created the data set with 1000 discrete timestamps, with 500,000 objects at the beginning. A 2D grid with 1024 × 1024 cells is used to record the locations of the moving objects.

Comparison

We evaluate the utility of the private DP histograms of dynamic datasets by answering random range count queries. The query accuracy of DSAT is compared with three solutions described in Section 3: the baseline Laplace mechanism, the fixed-sampling method, and the state-of-the-art w-event privacy methods. LPA and PSD [7] are used to generate DP histograms at sampling time points. We note that our proposed sampling framework can utilize any state-of-the-art static histogram method at each sampling point. Here we just use, as an example, the standard LPA method as well as the PSD method [7] which is a state-of-the-art static histogram method that uses spatial partitioning. The goal is to compare our proposed methods and the three solutions. We also include the non-private methods to compare the update errors of DSAT and fixed-sampling.

Metrics

For the US census dataset, we generated random range-count queries with random query predicates on each attribute defined in the SQL format as “Select COUNT(*) from D, Where A₁ ∈ I₁ and A₂ ∈ I₂ and…and A_m ∈ I_m”. I_i is a random interval generated from the domain of attribute A_i. For the traffic data, query rectangles with various sizes are randomly generated. In each experiment run, 5000 random queries are generated and the average absolute error over 10 runs is reported, which is defined as $E^{a} = \frac{1}{M \times N} \sum_{k = 1}^{M} \sum_{i = 1}^{N} | {\tilde{A}}_{i}^{k} - A_{i}^{k} |, A_{i}^{k}$ is the true answer and ${\tilde{A}}_{i}^{k}$ is the noisy answer. Here we use the range-count query to measure the utility since it composes data histograms, and the range counts can be used for many significant mining tasks, e.g. dynamic stream clustering, outlier detection of time-series data, etc.

7.2 Results on user level privacy

In all experiments, we compare our methods with the baseline and fixed-sampling methods, which are denoted by “baseline” and “fixed” in figures. Unless specified, we use LPA by default as the underlying histogram method for the sampling point. We also use “DSAT-true” and “fixed-true” to denote the non-private versions of DSAT and fixed-sampling.

Absolute error vs. k

Figure 2 investigates how utility changes with various k values, which specify the budget allocation ratio between ε₁ and ε₂ for the decision and sampling stages respectively. With the value of C being 10, we compute k to be 0.0532 due to theorem 5.4. From Figure 2, we can observe that the empirical result matches the theoretical result well and the utility reaches the optimal value with k between 0.01 and 0.1. The error increases as k becomes larger or smaller than 0.1 or or 0.01, respectively. This is reasonable because larger k may lead to more perturbation error while smaller k values result in more update error.

DSFT and DSAT

In this experiment, we compare our proposed two methods DSFT and DSAT. From figure 3, we can observe that the error of DSFT is very sensitive to the threshold value T. As T initially increases, the error decreases thanks to the decreased perturbation error. As T further increases, the error increases back up due to the increased sampling error which becomes the dominant error. Without prior knowledge, it is difficult to determine the optimal T. However, the average absolute error of DSAT is close to the lowest error of DSFT with the optimal threshold T value being around 0.025. Here the initial value of T for DSAT can be arbitrarily selected. Thus, the DSAT method with the PID control can effectively adjust T to an optimal one. In the remaining experiments, we only use DSAT to compare with other methods.

Absolute error vs. differential privacy

Figure 4 compares DSAT with other methods under various privacy budgets. The larger the privacy budget is, the closer the query accuracy is to non-private versions. Since the baseline performs one order of magnitude worse than other methods in most experiments, we do not include them for better readability of the graphs. The perturbation errors for fixed-sampling and DSAT are almost similar as the number of released DP histograms are the same. DSAT outperforms fixed-sampling because DSAT has much less update error, which can be seen from the comparison of non-private versions. Figure 4(b) uses the taxi trajectory dataset and Figure 4(c) uses PSD to release DP histograms with 3D US data. We can see that by using PSD, errors are generally improved compared to the ones using LPA. This further confirms that our methods can take advantage of any state-of-the-art static histogram methods for each sampling point.

Absolute error vs. update rate

We study the impact of the update rate r (defined in section 7.1) on the query accuracy for different methods, as shown in Figure 5. All methods remain stable for various update rates. The DSAT performs better than both non-private and private fixed-sampling methods. This is because the update error of non-private DSAT is much less than non-private fixed-sampling. This further verifies that our DSAT with PID controller succeeds in adaptively adjusting the threshold and the location of the sampling time point, leading to better performance.

Absolute error vs. dimensionality

Figure 6 examines the absolute error with various numbers of dimensions in the US dataset. DSAT again outperforms both non-private and private fixed-sampling methods with the dimensionality from 3 to 6. One interesting phenomena we observe is that the performances of non-private and private fixed-sampling methods improve sharply after five dimensions. This can be explained by the fact that a higher dimensionality results in a larger number of histogram bins. Given a threshold T, if the L₁ distance between two datasets D_i and D_i₋₁ is below T, the previously released histogram will be used which incurs an update error. Given the same L₁ distance between two histograms, a larger number of bins would result in a smaller measured update error since the average difference for each histogram bin is smaller. Hence the fixed sampling methods show a dramatic drop in the error which is dominated by the update error. The DSAT methods are less sensitive to the number of dimensions because they already mitigate the update error by tuning the threshold adaptively. Hence the non-private DSAT shows a slight drop in the update error while the private DSAT shows a slight increase due to the dominating perturbation error.

Query accuracy vs. query range size

We study the impact of the query range size on the query accuracy for different methods. For each query range size, we randomly generated queries such that the product of the query ranges on each dimension equals the given size. Figure 7 presents the impact of various query range sizes on query accuracy in terms of relative error and absolute error. The relative error is defined as $E^{r} = \frac{1}{M \times N} \sum_{k = 1}^{M} \sum_{i = 1}^{N} \frac{| {\tilde{A}}_{i}^{k} - A_{i}^{k} |}{max (s, A_{i}^{k})}$ , where s is the sanity bound to mitigate the effect for $A_{i}^{k} = 0$ . DSAT outperforms the private fixed-sampling method. The difference of relative errors between all methods is not obvious because of the large data cardinality in the US data. For all methods, the relative error gradually degrades as the query range size increases while the absolute error has the opposite trend. The reason is that when the query size is small, the true answer $A_{i}^{k}$ is also small which may incur a small absolute error but large relative error. In this experiment, the sanity bound s is set to 1.

7.3 Results on w-event privacy

Query accuracy vs. parameter w

We use the Oldenburg traffic data in this experiment, since it contains 1000 timestamps that is sufficient to investigate the impact of w. We compare DSAT with BD and BA in [18] under w-event privacy framework while varying w values. BD and BA are implemented by using column partitioning technique and setting $ε_{1} = \frac{ε}{U}$ as recommended in [18]. From figure 8, we can see that the gap between DSAT and BD or BA expands greatly as w increases. This is because our technique adaptively adjusts the threshold and allocates the privacy budget more appropriately. In contrast, BA and BD may not fully utilize or in advance exhaust most budget during w timestamps.

Query accuracy vs. differential privacy

In this experiment, we set w to be 800 using Oldenburg traffic data with 1000 timestamps. Figure 9 compares DSAT with BA and BD under various privacy budgets. We can see that BA degrades dramatically and the gap between BA and DSAT greatly expands as we reduce the privacy budget ε. This is because BA starts by uniformly distributing the budget to all w timestamps, and more perturbation error will be incurred when e is small and w is large. Our DSAT performs well since the perturbation error of released datasets depends only on C.

8. Conclusions

In this paper, we have proposed an adaptive distance-based sampling approach to address the challenges of releasing a series of differentially private dynamic datasets in real time. With an upper bound to limit the number of DP data releases, our methods incur much smaller errors. We apply an adaptive control mechanism to dynamically adjust the threshold value. We also provide privacy and utility analysis for our method. Experiments on real and synthetic datasets show that our algorithm outperforms the baseline and existing state-of-the-art techniques. As future work, we would like to study update models and incorporate them into our sampling framework. We are also interested in applying the adaptive sampling framework for releasing other types of dynamic data with differential privacy, e.g. frequent patterns for dynamically changing transactional data and dynamic graph patterns in social networks.

Acknowledgments

This work is supported by the National Institute of Health (NIH) under award number R01GM114612, the Patient-Centered Outcomes Research Institute (PCORI) under award number ME-1310-07058, and the National Science Foundation (NSF) under award number 1117763, and partly supported by NLM (R00LM011392), NLM (R21LM012060), and NHLBI (U54HL108460). The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies.

Contributor Information

Haoran Li, Email: hli57@emory.edu, Emory University, Atlanta, GA.

Xiaoqian Jiang, Email: x1jiang@ucsd.edu, University of California, San Diego, La Jolla, CA.

Li Xiong, Email: lxiong@emory.edu, Emory University, Atlanta, GA.

Jinfei Liu, Email: jinfei.liu@emory.edu, Emory University, Atlanta, GA.

References

1.Ács G, Castelluccia C, Chen R. Differentially private histogram publishing through lossy compression. ICDM. 2012 [Google Scholar]
2.Ang KH, Chong G, Li Y. Pid control system analysis, design, and technology. IEEE Trans Contr Sys Techn. 2005;13(4):559–576. [Google Scholar]
3.Avrim Blum KL, Roth A. A learning theory approach to non-interactive database privacy. STOC. 2008 [Google Scholar]
4.Barak B, Chaudhuri K, Dwork C, Kale S, McSherry F, Talwar K. Privacy, accuracy, and consistency too: a holistic solution to contingency table release. PODS. 2007 [Google Scholar]
5.Brinkhoff T. A framework for generating network-based moving objects. Geoinformatica. 2002;6(2):153–180. [Google Scholar]
6.Chan THH, Shi E, Song D. Private and continual release of statistics. ACM Trans Inf Syst Secur. 2011;14(3):26. [Google Scholar]
7.Cormode G, Procopiuc CM, Srivastava D, Shen E, Yu T. Differentially private spatial decompositions. ICDE. 2012 [Google Scholar]
8.Cormode G, Procopiuc CM, Srivastava D, Tran TTL. Differentially private summaries for sparse data. ICDT. 2012:299–311. [Google Scholar]
9.Dwork C. Differential privacy. Automata. Languages and Programming. 2006;(Pt 2):4052. [Google Scholar]
10.Dwork C. A firm foundation for private data analysis. Commun ACM. 2011 [Google Scholar]
11.Dwork C, Mcsherry F, Nissim K, Smith A. Calibrating Noise to Sensitivity in Private Data Analysis. Theory of Cryptography. :1–20. [Google Scholar]
12.Dwork C, Naor M, Pitassi T, Rothblum GN. Differential privacy under continual observation. STOC. 2010:715–724. [Google Scholar]
13.Fan L, Bonomi L, Xiong L, Sunderam V. Monitoring web browsing behaviors with differential privacy; WWW Conference; 2014. [Google Scholar]
14.Fan L, Xiong L. An adaptive approach to real-time aggregate monitoring with differential privacy. IEEE TKDE. 2014;26(9):2094–2106. [Google Scholar]
15.Hardt M, Rothblum GN. A multiplicative weights mechanism for privacy-preserving data analysis. FOCS. 2010:61–70. [Google Scholar]
16.Hayy M, Rastogiz V, Miklauy G, Suciu D. Boosting the accuracy of differentially-private histograms through consistency. VLDB. 2010 [Google Scholar]
17.Kellaris G, Papadopoulos S. Practical differential privacy via grouping and smoothing. VLDB. 2013 [Google Scholar]
18.Kellaris G, Papadopoulos S, Xiao X, Papadias D. Differentially private event sequences over infinite streams. PVLDB. 2014;7(12):1155–1166. [Google Scholar]
19.Lee J, Clifton CW. Top-k frequent itemsets via differentially private fp-trees. SigKDD. 2014;2014:931–940. [Google Scholar]
20.Li H, Xiong L, Jiang X. Differentially private synthesization of multi-dimensional data using copula functions. EDBT. 2014:475–486. doi: 10.5441/002/edbt.2014.43. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Li H, Xiong L, Zhang L, Jiang X. Dpsynthesizer: Differentially private data synthesizer for privacy preserving data sharing. PVLDB. 2014;7(13):1677–1680. doi: 10.14778/2733004.2733059. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.McSherry. SIGMOD. New York, NY, USA: ACM; 2009. Privacy integrated queries: an extensible platform for privacy-preserving data analysis. [Google Scholar]
23.Qardaji WH, Yang W, Li N. Differentially private grids for geospatial data. ICDE. 2013 [Google Scholar]
24.Qardaji WH, Yang W, Li N. Understanding hierarchical methods for differentially private histograms. PVLDB. 2013;6(14):1954–1965. [Google Scholar]
25.Rastogi V, Nath S. Differentially private aggregation of distributed time-series with transformation and encryption. SIGMOD. 2010:735–746. [Google Scholar]
26.Xiao X, Wang G, Gehrke J. Differential privacy via wavelet transforms. ICDE. 2010:225–236. [Google Scholar]
27.Xiao Y, Xiong L. Protecting locations with differential privacy under temporal correlations. CCS. 2015 [Google Scholar]
28.Xiao Y, Xiong L, Fan L, Goryczka S, Li H. Dpcube: Differentially private histogram release through multidimensional partitioning. Transactions on Data Privacy. 2014;7(3):195–222. [Google Scholar]
29.Xu J, Zhang Z, Xiao X, Yang Y, Yu G, Winslett M. Differentially private histogram publication. VLDB J. 2013 [Google Scholar]
30.Zhang J, Cormode G, Procopiuc CM, Srivastava D, Xiao X. Privbayes: private data release via bayesian networks. SIGMOD. 2014:1423–1434. [Google Scholar]
31.Zhang X, Meng X, Chen R. Differentially private set-valued data release against incremental updates. DASFAA (1) 2013 [Google Scholar]

[R1] 1.Ács G, Castelluccia C, Chen R. Differentially private histogram publishing through lossy compression. ICDM. 2012 [Google Scholar]

[R2] 2.Ang KH, Chong G, Li Y. Pid control system analysis, design, and technology. IEEE Trans Contr Sys Techn. 2005;13(4):559–576. [Google Scholar]

[R3] 3.Avrim Blum KL, Roth A. A learning theory approach to non-interactive database privacy. STOC. 2008 [Google Scholar]

[R4] 4.Barak B, Chaudhuri K, Dwork C, Kale S, McSherry F, Talwar K. Privacy, accuracy, and consistency too: a holistic solution to contingency table release. PODS. 2007 [Google Scholar]

[R5] 5.Brinkhoff T. A framework for generating network-based moving objects. Geoinformatica. 2002;6(2):153–180. [Google Scholar]

[R6] 6.Chan THH, Shi E, Song D. Private and continual release of statistics. ACM Trans Inf Syst Secur. 2011;14(3):26. [Google Scholar]

[R7] 7.Cormode G, Procopiuc CM, Srivastava D, Shen E, Yu T. Differentially private spatial decompositions. ICDE. 2012 [Google Scholar]

[R8] 8.Cormode G, Procopiuc CM, Srivastava D, Tran TTL. Differentially private summaries for sparse data. ICDT. 2012:299–311. [Google Scholar]

[R9] 9.Dwork C. Differential privacy. Automata. Languages and Programming. 2006;(Pt 2):4052. [Google Scholar]

[R10] 10.Dwork C. A firm foundation for private data analysis. Commun ACM. 2011 [Google Scholar]

[R11] 11.Dwork C, Mcsherry F, Nissim K, Smith A. Calibrating Noise to Sensitivity in Private Data Analysis. Theory of Cryptography. :1–20. [Google Scholar]

[R12] 12.Dwork C, Naor M, Pitassi T, Rothblum GN. Differential privacy under continual observation. STOC. 2010:715–724. [Google Scholar]

[R13] 13.Fan L, Bonomi L, Xiong L, Sunderam V. Monitoring web browsing behaviors with differential privacy; WWW Conference; 2014. [Google Scholar]

[R14] 14.Fan L, Xiong L. An adaptive approach to real-time aggregate monitoring with differential privacy. IEEE TKDE. 2014;26(9):2094–2106. [Google Scholar]

[R15] 15.Hardt M, Rothblum GN. A multiplicative weights mechanism for privacy-preserving data analysis. FOCS. 2010:61–70. [Google Scholar]

[R16] 16.Hayy M, Rastogiz V, Miklauy G, Suciu D. Boosting the accuracy of differentially-private histograms through consistency. VLDB. 2010 [Google Scholar]

[R17] 17.Kellaris G, Papadopoulos S. Practical differential privacy via grouping and smoothing. VLDB. 2013 [Google Scholar]

[R18] 18.Kellaris G, Papadopoulos S, Xiao X, Papadias D. Differentially private event sequences over infinite streams. PVLDB. 2014;7(12):1155–1166. [Google Scholar]

[R19] 19.Lee J, Clifton CW. Top-k frequent itemsets via differentially private fp-trees. SigKDD. 2014;2014:931–940. [Google Scholar]

[R20] 20.Li H, Xiong L, Jiang X. Differentially private synthesization of multi-dimensional data using copula functions. EDBT. 2014:475–486. doi: 10.5441/002/edbt.2014.43. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Li H, Xiong L, Zhang L, Jiang X. Dpsynthesizer: Differentially private data synthesizer for privacy preserving data sharing. PVLDB. 2014;7(13):1677–1680. doi: 10.14778/2733004.2733059. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.McSherry. SIGMOD. New York, NY, USA: ACM; 2009. Privacy integrated queries: an extensible platform for privacy-preserving data analysis. [Google Scholar]

[R23] 23.Qardaji WH, Yang W, Li N. Differentially private grids for geospatial data. ICDE. 2013 [Google Scholar]

[R24] 24.Qardaji WH, Yang W, Li N. Understanding hierarchical methods for differentially private histograms. PVLDB. 2013;6(14):1954–1965. [Google Scholar]

[R25] 25.Rastogi V, Nath S. Differentially private aggregation of distributed time-series with transformation and encryption. SIGMOD. 2010:735–746. [Google Scholar]

[R26] 26.Xiao X, Wang G, Gehrke J. Differential privacy via wavelet transforms. ICDE. 2010:225–236. [Google Scholar]

[R27] 27.Xiao Y, Xiong L. Protecting locations with differential privacy under temporal correlations. CCS. 2015 [Google Scholar]

[R28] 28.Xiao Y, Xiong L, Fan L, Goryczka S, Li H. Dpcube: Differentially private histogram release through multidimensional partitioning. Transactions on Data Privacy. 2014;7(3):195–222. [Google Scholar]

[R29] 29.Xu J, Zhang Z, Xiao X, Yang Y, Yu G, Winslett M. Differentially private histogram publication. VLDB J. 2013 [Google Scholar]

[R30] 30.Zhang J, Cormode G, Procopiuc CM, Srivastava D, Xiao X. Privbayes: private data release via bayesian networks. SIGMOD. 2014:1423–1434. [Google Scholar]

[R31] 31.Zhang X, Meng X, Chen R. Differentially private set-valued data release against incremental updates. DASFAA (1) 2013 [Google Scholar]

PERMALINK

Differentially Private Histogram Publication For Dynamic Datasets: An Adaptive Sampling Approach

Haoran Li

Xiaoqian Jiang

Li Xiong

Jinfei Liu

Abstract

1. Introduction

Medical research

Traffic Monitoring

Our contributions

2. Related Work

3. Preliminaries

Table 1. Frequently used notations.

3.1 Problem definition

3.2 Differential Privacy

Laplace Mechanism

(α, σ)-usefulness

3.3 w-event privacy

3.4 Baseline and Existing State-of-the-Art Solutions

Baseline method

Fixed-sampling method

Approaches in w-event privacy

4. Adaptive Sampling Approach

4.1 DSFT

4.2 DSAT

Figure 1. DSAT Framework.

Proportional term

Integral term

Derivative term

4.3 Privacy Analysis

Sensitivity analysis of L1 Distance

Privacy guarantee

5. Utility Analysis

Error quantification of DSFT

Error quantification of DSAT

Lower bound of the data cardinality

Select the value of k in DSFT

6. Extensions to Infinite Streams

DSAT under w-event privacy

Privacy guarantee

7. Experiment

Table 2. Experiment Parameters.

7.1 Experiment Setup

Datasets

Comparison

Metrics

7.2 Results on user level privacy

Absolute error vs. k

Figure 2. Absolute error vs. k.

DSFT and DSAT

Figure 3. Absolute error vs. threshold value T.

Absolute error vs. differential privacy

Figure 4. Accuracy vs. differential privacy budget.

Absolute error vs. update rate

Figure 5. Absolute error vs. update rate.

Absolute error vs. dimensionality

Figure 6. Absolute error vs. # of dimensions.

Query accuracy vs. query range size

Figure 7. Query accuracy vs. query range size.

7.3 Results on w-event privacy

Query accuracy vs. parameter w

Figure 8. Query accuracy vs. w.

Query accuracy vs. differential privacy

Figure 9. Query accuracy vs. differential privacy.

8. Conclusions

Acknowledgments

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Sensitivity analysis of L₁ Distance