Locally private frequency estimation of physical symptoms for infectious disease analysis in Internet of Medical Things

Xiaotong Wu; Mohammad Reza Khosravi; Lianyong Qi; Genlin Ji; Wanchun Dou; Xiaolong Xu

doi:10.1016/j.comcom.2020.08.015

. 2020 Aug 27;162:139–151. doi: 10.1016/j.comcom.2020.08.015

Locally private frequency estimation of physical symptoms for infectious disease analysis in Internet of Medical Things

Xiaotong Wu ^a, Mohammad Reza Khosravi ^b, Lianyong Qi ^c, Genlin Ji ^a, Wanchun Dou ^d, Xiaolong Xu ^e,^⁎

PMCID: PMC7450982 PMID: 32873996

Abstract

Frequency estimation of physical symptoms for peoples is the most direct way to analyze and predict infectious diseases. In Internet of medical Things (IoMT), it is efficient and convenient for users to report their physical symptoms to hospitals or disease prevention departments by various mobile devices. Unfortunately, it usually brings leakage risk of these symptoms since data receivers may be untrusted. As a strong metric for health privacy, local differential privacy (LDP) requires that users should perturb their symptoms to prevent the risk. However, the widely-used data structure called sketch for frequency estimation does not satisfy the specified requirement. In this paper, we firstly define the problem of frequency estimation of physical symptoms under LDP. Then, we propose four different protocols, i.e., CMS-LDP, FCS-LDP, CS-LDP and FAS-LDP to solve the above problem. Next, we demonstrate that the designed protocols satisfy LDP and unbiased estimation. We also present two approaches to implement the key component (i.e., universal hash functions) of protocols. Finally, we conduct experiments to evaluate four protocols on two real-world datasets, representing two different distributions of physical symptoms. The results show that CMS-LDP and CS-LDP have relatively optimal utility for frequency estimation of physical symptoms in IoMT.

Keywords: Health privacy, Frequency estimation, Local differential privacy, Infectious disease analysis

1. Introduction

With the explosive development of Internet of Medical Things (IoMT), there have been various medical applications and services in a large number of mobile devices (e.g., smart phones and wearable devices) [1], [2], [3]. It is efficient and convenient for hospitals or disease prevention departments (i.e., the third party) to estimate the frequency of physical symptoms for mobile users by their devices so as to monitor and predict infectious diseases. For example, if disease prevention departments want to know potential spread range and speed of coronavirus disease 2019 (COVID-19) [4], it is secure for them to remotely estimate the frequency of typical symptoms of peoples by smart phones, including fever, cough and shortness of breath. The novel estimation model reduces the labor costs and improves the detection efficiency. Although human beings benefit from the estimation model in IoMT, users inevitably face leakage risks of disease information, especially when the third party are untrusted [5], [6]. Once disease information is leaked, it may cause a heavier psychological burden for the society and individuals.

In fact, health privacy has been one of the biggest concerns of not only individuals but also the whole society (e.g., European GDPR law) [7], [8], [9], [10], [11], [12]. Therefore, it is necessary to protect individuals’ health privacy when collecting and analyzing their symptoms. Differential privacy (DP) [13], [14] is a strong privacy metric in a central setting. It assumes that there is a trusted third party to collect and perturb symptoms from users and then share them with others. However, DP ignores that it is possible for the third party to threaten users’ health privacy, especially in IoMT. To this end, local differential privacy (LDP) [15] is proposed in the local setting. It requires that each mobile user should locally perturb his/her symptoms before sending to the third party and make them indistinguishable.

In recent years, there have been a series of works to design various protocols for frequency estimation of physical symptoms under LDP in both academia and industry [16], [17], [18], [19]. In the existing privacy protocols, physical symptom is defined as categorical data. For categorical data, it consists of three important components, i.e., encoding, perturbation and aggregation [20], [21], [22], [23], [24]. The first two components are operated by the user while the last one is executed by the third party. Most of perturbation mechanisms to guarantee privacy are based on the Randomized Response (RR) technique, in which reporting a real answer for a Boolean query is derived from a certain probability [25]. Wang et al. [24] surveyed the previous protocols and classified encoding methods into four types, including direct [26], histogram [22], unary [23], [27] and local hashing [20], [21], [24] encoding (i.e., DE, HE, UE and LHE). Protocols based on UE need higher communication cost, while those based on DE and HE have relatively lower accuracy. As a result, protocols based on LHE are a promising solution to balance computation complexity and data utility.

However, the previous protocols do not make full use of the widely applied data structures for frequency estimation of physical symptoms. Sketch is one of the most fundamental and most efficient data structures to estimate frequency for physical symptoms. In general, it uses a small vector to index the position of a symptom by a certain number of hash functions, which saves space but faces possible conflict. There are some basic and classic sketches, such as Count Sketch (CS) [28], Count-Min Sketch (CMS) [29], Fast-AGMS Sketch (FAS) [30] and Fast-Count Sketch (FCS) [31]. The focus of the paper is on how to design the sketch-based protocols for frequency estimation of physical symptoms satisfying local differential privacy. If so, it is convenient for others to directly apply the proposed protocols to estimate the frequency of physical symptoms.

There are two main challenges to solve the above problem. The first one is that the existing private protocols cannot be directly used to perturb query result by sketches. The reasons are that (i) perturbation and aggregation of symptoms are executed by the user and the third party, respectively. It implies that only the RR technique can be leveraged; and (ii) hash functions used in the sketches need to stay the same implying that the existing encoding methods (e.g., DE, HE and UE) are not applicable. The second one is that the designed private sketches should satisfy two important properties, including unbiased estimation and local differential privacy. In other words, it needs to balance privacy guarantee and utility of perturbed data.

To address the above challenges, we make full use of LHE and RR. In brief, we leverage the sketch vector to implement encoding of physical symptoms and then perturb it by RR to achieve privacy guarantee. At first, we propose two CMS- and FCS-based protocols, namely CMS-LDP and FCS-LDP to guarantee privacy. Then, we propose two CS- and FAS-based protocols, namely CS-LDP and FAS-LDP to ensure privacy. In particular, the formers utilize perturbation mechanism of optimal local hashing (OLH) in [24], while the latter directly use RR to perturb each entry of the sketch vector. Finally, we do other operations to make sure that the protocols satisfy unbiased estimation. It is noted that our protocols only increase computation cost rather than space overhead. This is consistent with the initial purpose, i.e., space save. To the best of our knowledge, there are few works to utilize the sketch under LDP to estimate the frequency of symptoms. The contributions of this paper can be summarized as follows:

•
We formulate the problem of designing sketches under LDP by mathematical definitions and strong privacy metric to estimate the frequency of physical symptoms in IoMT.
•
We propose four different protocols for frequency estimation of symptoms with privacy protection, including CMS-LDP, FCS-LDP, CS-LDP and FAS-LDP. We also demonstrate that all of the protocols satisfy two important properties, including unbiased estimation and LDP. We also analyze the variance of frequency estimation in four protocols.
•
We present two approaches to implement universal hash functions. We also evaluate and compare the proposed protocols on two real-world datasets, representing two different distributions of physical symptoms.

The rest of this paper is organized as follows. Section 2 surveys related work about privacy protection in IoMT. Section 3 introduces the preliminaries and gives problem definitions. Section 4 proposes four different protocols for sketches under LDP to estimate the frequency of symptoms. Section 5 analyzes unbiased estimation, privacy guarantee and variance. Section 6 presents two methods to implement universal hash functions. Section 7 evaluates proposed protocols on real datasets. Finally, Section 8 concludes this paper.

2. Related work

2.1. Health privacy protection in IoMT

In IoMT, there are vast amounts of users’ healthy information generated by various mobile devices every day. However, the information is left from mobile users and may be faced with leakage risk from untrusted data collectors or data adversaries [6]. In order to guarantee privacy, there have been a number of protection techniques, including identification, anonymity and differential privacy [5], [7], [14]. In detail, identification technique utilizes authentication mechanism for each mobile user, who must login the system by his/her identity. On the other hand, data collectors use perturbation techniques (e.g., $k$ -anonymity [32], $l$ -diversity [33] and DP [13]) to perturb raw data from mobile users. These techniques provide a certain degree of privacy protection in some specified scenes.

However, these techniques are not fully suited to applications in IoMT due to their respective shortcomings. Identification and anonymity cannot offer enough privacy protection, while DP usually assumes that the third party is trusted. That is, the third party does not illegally leak users’ health information. Unfortunately, it does not hold in some real environments of IoMT. The reason is that users cannot know the receivers of their healthy information in the local setting so that they do not trust them. In order to overcome the disadvantages, local differential privacy [15] has been proposed to protect private information of each user against the threat from the untrusted third party. Both academical researchers and industrial engineers focus on how to achieve local differential privacy from theory and application.

2.2. Local differential privacy

Different from DP, LDP has higher privacy requirement in a local setting, which requires probability limit between any two values, instead of any two datasets in DP [15]. To this end, the existing mechanisms of DP cannot be fully applied to LDP, including the Laplace mechanism [34] and the exponential mechanism [35]. In order to implement LDP, there have been a number of works to design proper mechanisms [16], [17], [18], [19]. Most of them leveraged the Randomized Response (RR) mechanism [25] or its refinements. RR reports a random answer derived from a certain amount of probability for a Boolean question. For categorical data, Kairouz et al. [36], [37] proposed an extremal privatization mechanism ( $k$ -RR), which is universally optimal in the low and high privacy regimes. For ordinal data, the value is firstly encoded to a binary one and then perturb it by RR [17], [38], [39]. In general, the proposed algorithms need to satisfy both unbiased estimation and LDP. In addition, these perturbation privacy mechanisms are widely applied to various scenes, including heavy hitter identification, itemset mining, marginal release, data mining, medical analysis and infectious disease prediction [40].

In the industry, there have been a lot of applications to satisfy local differential privacy. RAPPOR is the first product from Google to collect users’ data and ensure privacy [27]. The user firstly encodes his/her data by a special data structure called Bloom filter and then perturbs the encoded output by the randomized response mechanism. The third party collects and decodes perturbed data from the users to get statistical information. In 2017, Apple [41] published a white paper to propose efficient and scalable algorithms satisfying local differential privacy. Microsoft [38] also presented new algorithms with local differential privacy to estimate mean and histogram for telemetry data.

2.3. Frequency estimation under LDP for physical symptoms

Frequency estimation under LDP for physical symptoms is an efficient and convenient way for infectious disease prediction. There have been a few of works to design effective mechanisms for different types of health information, including categorical (e.g., sex and symptoms), ordinal (e.g., age) [36], [37], [40], [42]. The main objective of the existing mechanisms is to maximize the utility and reduce communication cost and computation complexity.

Bassily et al. [21] proposed efficient mechanisms by utilizing Random Matrix Projection. Running time of the third party and the user is $O (n^{5 ∕ 2})$ and $O (n^{3 ∕ 2})$ , respectively. Meanwhile, the estimation error is $O (\sqrt{log (m) ∕ (ϵ^{2} n)})$ with high probability where $n$ is the number of users and $m$ is the size of domain. They also proposed improved algorithms, in which server time is $O (n)$ and user time is $O (1)$ [20]. Bun et al. [22] presented an algorithm called PrivateExpanderSketch, which achieved optimal worst-case error as a function of all possible parameters. Wang et al. [24] introduced a framework to analyze all the previous mechanisms for frequency estimation under LDP. They also proposed an Optimal Local Hashing (OLH) to estimate the frequency with better utility. Different from the previous environment with only an attribute, Qin et al. [23] estimated frequency over set-valued data under LDP. However, to the best of our knowledge, there are few works to combine sketch under LDP to estimate the frequency of physical symptoms in IoMT. Therefore, the focus of this paper is to solve the problem.

3. Preliminaries and problem definition

As aforementioned, the focus of the paper is on frequency estimation of physical symptoms in IoMT. There are two key roles, i.e., an untrusted third party and a large number of mobile users. The third party collects perturbed health information from the users. In the following section, we formulate the frequency estimation problem in the context of local differential privacy.

3.1. Preliminaries

Suppose that there are a set of $n$ users denoted by $U = {u_{1}, u_{2}, \dots, u_{n}}$ . Each user $u_{i}$ has a symptom $d_{i}$ . Then, $I = {d_{1}, d_{2}, \dots, d_{n}}$ is defined as a set of symptoms from users. The domain of elements in $I$ is $D$ with size $m$ . An untrusted third party plans to collect statistical information of users’ symptoms. Assume that the untrusted third party has known the domain $D$ . Each element in $D$ is generalized to an integer $j \in [m] = {1, 2, \dots, m}$ . For example in Fig. 1, the domain $D$ consists of $m$ types of symptoms of COVID-19, such as fever, cough and shortness of breath. Each symptom is labeled as an integer (e.g., fever is labeled by $1$ ). Therefore, in the following section, we have $d_{i} \in D = [m]$ .

Here, we focus on the frequency estimation of physical symptoms as follows:

Definition 1 Frequency Estimation of Symptoms [39] —

A query $f : D \to Z$ for frequency estimation of some symptom is to compute the frequency of users whose symptom is the same with $d \in D$ . Formally, for any $d \in D$ ,

$f (d) = | {d_{i} | \exists d_{i} = d} | .$ (1)

Obviously, the range of $f (d)$ is $[0, n]$ .

However, in infectious disease prediction, it is not necessary and impossible to count accurate frequency estimation for physical symptoms. For example, if some disease prevention department wants to estimate the speed or range of COVID-19 by the number of patients with specified symptoms (e.g., fever, cough or shortness of breath), there is no need for the department to get the accurate count for each symptom. What is more, it is not realistic to spend a large amount of labor cost in dealing with it. Thus, an approximate frequency estimation is proposed as follows:

Definition 2 ( $ξ, δ$ )-Approximate Frequency Estimation of Symptoms [39] —

A query $\hat{f} : D \to Z$ is to solve the problem of ( $ξ, δ$ )-approximate frequency estimation of symptoms if and only if for any symptom $d \in D$ , $\hat{f} (d)$ satisfies the following requirement:

$Pr [| f (d) - \hat{f} (d) | \leq n ξ] \geq 1 - δ,$ (2)

where $ξ \in [0, 1]$ represents the expected error and $δ \in [0, 1]$ is the confidence probability.

Unfortunately, the third party may be untrusted because that he/she possibly utilizes health information of users to get illegal profit. Therefore, the users have to perturb their information so as to avoid privacy leakage. Differential privacy [13], [14] is the most common and strong privacy protection metric, which is usually suited for the central setting. Different from DP, local differential privacy is proposed in the local setting, in which it assumes that the third party is untrusted. This means that before the users send their symptoms, perturbation of symptoms should satisfy the following privacy measurement:

Definition 3 Local Differential Privacy [15] —

A sanitized algorithm $M$ satisfies $ϵ$ -local differential privacy if and only if for any pair of input values $d_{i}, d_{j} \in D$ and for any output $y \in R a n g e (M)$ , we have

$Pr [M (d_{i}) = y] \leq exp (ϵ) \cdot Pr [M (d_{j}) = y],$ (3)

where $R a n g e (M)$ denotes the set of all possible outputs of the algorithm $M$ and $ϵ$ is privacy budget.

$ϵ$ represents the strength of privacy protection. The lower value of $ϵ$ corresponds to the stronger privacy protection. With the same as DP, LDP has also the composition properties as follows.

Lemma 1 Sequential Composition [43] —

Assume that each privacy algorithm $M_{i}$ satisfies $ϵ_{i}$ -local differential privacy. A group of $M_{i} (\cdot)$ applied to the same dataset satisfy $(\sum_{i} ϵ_{i})$ -local differential privacy.

Lemma 2 Parallel Composition [43] —

Assume that each privacy algorithm $M_{i}$ satisfies $ϵ_{i}$ -local differential privacy. A group of $M_{i} (\cdot)$ applied to a set of disjoint datasets satisfy $max {ϵ_{i}}$ -local differential privacy.

In the algorithms designed by the paper, we will make full use of the composition properties of local differential privacy.

3.2. Private framework

In order to solve the frequency estimation problem under LDP for symptoms, the proposed mechanism should satisfy the privacy measurement as follows:

Definition 4 Frequency Estimation of Symptoms under LDP —

A sanitized protocol $M : D^{n} \to Z$ for frequency estimation of symptoms satisfies $ϵ$ -local differential privacy if and only if $M = A (M_{1} (\cdot), M_{2} (\cdot), \dots, M_{n} (\cdot))$ , in which each $M_{i} : D \to Y$ is a local random processor that satisfies $ϵ$ -local differential privacy, and $A : Y^{n} \to Z$ is some post-processing function.

Based on the above definition, the key functions are the local random processor $M_{i} (\cdot)$ and the post-processing function $A (\cdot)$ . Each user executes the local random processor, while the third party executes the post-processing function. Furthermore, Wang et al. [24] surveyed the previous work and concluded that a protocol for frequency estimation under LDP generally consists of the following steps:

•
Encoding. This operation $Encode (\cdot)$ is executed by the user. In detail, input $d$ of the user is generally encoded as a histogram, a vector [27], a single bit [21], or an integer [24] by encoding and decoding algorithms or hash functions.
•
Perturbing. The mechanism $Perturb(⋅)$ is also executed by the user. Meanwhile, the perturbed mechanism should satisfy $ϵ$ -local differential privacy. The input of $Perturb (\cdot)$ is the output of $Encode (\cdot)$ . Therefore, $M_{i} (\cdot)$ in Definition 4 is $Perturb (Encode (\cdot))$ .
•
Aggregating. The operation $Aggregate (\cdot)$ is executed by the third party. At first, the third party receives data from users, which is perturbed with the combination of $Encode (\cdot)$ and $Perturb (\cdot)$ . Then, he/she aggregates the data and extracts approximate frequency estimation.

Fig. 2 shows the concrete operations and interactions between mobile users and the third party. It is obvious to see that users need to encode and perturb their symptoms locally, while the third party aggregates perturbed data. As a result, these operations increase extra overhead for users, including computation, communication and storage. To this end, it is necessary to make full use of the existing data structure named sketch to decrease the overhead.

3.3. Problem formulation

In order to solve the approximate frequency estimation of categorical data (e.g., physical symptoms), Charikar et al. [28] firstly proposed an algorithm by using a special data structure, i.e., Count Sketch, which achieved good space bounds. After that, the refined versions were proposed, such as Fast-AGMS Sketch [30], Count-Min Sketch [29] and Fast-Count Sketch [31]. However, to the best of our knowledge, there are few works to design the new data structure Sketch under LDP to solve the approximate frequency estimation of physical symptoms. In the following sections, we propose improved protocols based on Count Sketch, Fast-AGMS Sketch, Count-Min Sketch and Fast-Count Sketch, which satisfy unbiased estimation, i.e., $\forall d \in D, E [\hat{f} (d)] = f (d)$ and local differential privacy. In addition, we also compare the proposed protocols by experiments.

4. Protocols for frequency estimation of symptoms under LDP

In this section, we propose four novel protocols to solve the approximate frequency estimation of physical symptoms under LDP. These protocols are based on CS, FAS, CMS and FCS. We will analyze unbiased estimation and privacy protection of proposed protocols in the next section.

Under no privacy preservation, the third party receives physical symptoms and uses a small vector to count the frequency of each element in $D$ , while users do nothing. However, in order to protect privacy, the user should send perturbed symptoms rather than raw values. Therefore, the user needs to generate a small vector locally and then send it to the third party. Finally, the third party aggregates these vectors and derives the full sketch.

4.1. Protocol CMS-LDP

CMS is a simple and effective data structure, which was proposed by Cormode et al. [29]. For a frequency estimation query $\hat{f} (d)$ , its values by CMS is $[f (d), f (d) + ξ {‖ f_{- d} ‖}_{1}]$ with a probability $1 - δ$ , in which $f_{- d}$ is the real frequency except $d$ . Here, we propose corresponding protocols based on optimal local hashing in [24] to implement CMS under local differential privacy.

Algorithm 1 shows a protocol CMS-LDP for privacy estimation of symptoms under LDP. The protocol consists of four key components. The first one is an initial process, in which $b$ and $t$ are computed by the input parameters $ξ$ and $δ$ (line $1 \sim 3$ ). $b$ represents the domain of universal hash functions, while $t$ is the number of hash functions. Then, each user calls $M_{client-CMS-LDP}$ to report the perturbed record to the server-side (line $4 \sim 6$ ). Next, Line $7$ implements the key component, namely Sketch-CMS-LDP. The last one implements the query of server-side (line $8 \sim 10$ ). By the above four parts, the algorithm solves the approximate frequency estimation problem of physical symptoms.

Each user executes Algorithm 2 to implement the function of Perturb(Encode ( $\cdot$ )) in Section 3.2. Algorithm 2 utilizes an existing technique, namely $b$ -Randomized Response ( $b$ -RR), to ensure $ϵ$ -local differential privacy, which was designed by Kairouz et al. [37]. In detail, Algorithm 2 firstly implements $Encode (\cdot)$ by universal hash function $h_{i}$ to map the input $d$ into a $b$ -length bit array in line $4$ . Then, a $b$ -RR technique is used to perturb the output of $Encode (\cdot)$ in line $5$ . Finally, the user returns the result to the server-side.

After the third party receives perturbed symptoms from users, he/she computes a data structure, namely Sketch-CMS-LDP by Algorithm 3. Compared to CMS, our proposed mechanism $M_{Sketch-CMS-LDP}$ has extra operations, i.e., Line $7 \sim 9$ . Sketch-CMS-LDP firstly generates an initial sketch, which is a multidimensional array in line $1$ . Then, all arrays from $n$ users are added to the sketch in line $2 \sim 6$ , which is the same with CMS. Finally, Line $7 \sim 9$ is to ensure unbiased estimation of CMS-LDP, which will be proved in Section 5.1.

Algorithm 4 gets the approximate estimation for each frequency query $\hat{f}$ , which is executed by the third party. The combination of Algorithms 3 and 4 implement the function of $Aggregate(⋅)$ in Section 3.2. For each query $\hat{f} (d)$ , the algorithm chooses the minimum of $t$ values, each of which is from an array $V [i], i \in [t]$ . The position of each value in the array is decided by hash function $h_{i}$ and $d$ . By Algorithm 1 $\sim$ 4, we solve the problem of frequency estimation under local differential privacy.

For each user in the client-side, the time complexity is $O ({log}_{2} (1 ∕ δ))$ and the space complexity in $Perturb (Encode (\cdot))$ is $O (\frac{1}{ξ} {log}_{2} (1 ∕ δ))$ . For $M_{server-CMS-LDP}$ in the server-side, the time complexity is $O (\frac{1}{ξ} {log}_{2} (1 ∕ δ))$ and the space complexity is $O (\frac{1}{ξ} {log}_{2} (1 ∕ δ))$ . Besides, the communication overhead between each user and the server is $O (\frac{1}{ξ} {log}_{2} (1 ∕ δ))$ . Thus, for the third party, our proposed protocol does not increase the space overhead. Section 5.1 will analyze the proposed algorithms in terms of the error bound, unbiased estimation and local differential privacy.

4.2. Protocol FCS-LDP

Thorup et al. [31] proposed a refined version of Count-Min Sketch, namely Fast-Count Sketch (FCS). Compared with CMS, FCS uses a family of $4$ -universal hash functions, instead of $2$ -universal hash function.

The protocol to implement Fast-Count Sketch under local differential privacy, namely Algorithm $FCS-LDP$ , is similar with Algorithm 1. $FCS-LDP$ also consists of three important components, i.e., $M_{client-FCS-LDP}$ , $M_{Sketch-FCS-LDP}$ and $M_{server-FCS-LDP}$ . Different from $CMS-LDP$ , $FCS-LDP$ replaces $2$ -universal hash functions with $4$ - universal hash functions in line $3$ in Algorithm 1. Besides, $M_{Sketch-FCS-LDP}$ , $M_{client-FCS-LDP}$ and $M_{server-FCS-LDP}$ have the same operations with Algorithm $M_{Sketch-CMS-LDP}$ , $M_{client-CMS-LDP}$ and $M_{server-CMS-LDP}$ , respectively. Here, we do not repeat the same algorithms. Meanwhile, these algorithms have the same time and space complexity with Algorithm 1 $\sim$ 4 for each user and the third party.

4.3. Protocol CS-LDP

Among all of various formats of the special data structure called sketch, Count Sketch was firstly proposed in 2002 [28]. The data structure is generally used to point query, range query, inner product query and so on. Compared with Count-Min Sketch, CS needs higher space cost, i.e., $O ({log}_{2} (δ) ∕ ξ^{2})$ . Meanwhile, CS also satisfies unbiased estimation and the value for a frequency estimation query $\hat{f} (d)$ about $d$ is $[f (d) - ξ {‖ f_{- d} ‖}_{2}, f (d) + ξ {‖ f_{- d} ‖}_{2}]$ with a confidence probability $1 - δ$ . In the following, we will introduce algorithms to implement CS with LDP.

In Algorithm 5, we describe the overview of protocol CS-LDP to implement CS under LDP. It consists of four key parts, i.e., initialization, perturbation in client-side, construction of sketch and frequency estimation in server-side. In the initial process (line $1 \sim 4$ ), it requires $t$ hash function pairs $〈 h_{i}, g_{i} 〉$ . Hash function $h_{i}$ maps some symptom into $⌈ c ∕ ξ^{2} ⌉$ -length bits in which c is some constant, while function $g_{i}$ maps some input value into ${- 1, + 1}$ . Both two classes of hash functions are independent with each other. Compared with $⌈ 1 ∕ ξ ⌉$ -length bits of hash function $h_{i}$ in Algorithm 1, CS-LDP needs more space cost. Then, each user executes $M_{client-CS-LDP}$ to perturb his/her data (line $5 \sim 7$ ), which implements the function of $Perturb(Encode(⋅))$ . Next, the third party constructs a special data structure, namely Sketch-CS-LDP, to collect data from users by $M_{Sketch-CS-LDP}$ (line $8$ ). Finally, for each frequency estimation $\hat{f} (d)$ , the third party calls $M_{server-CS-LDP}$ to compute the result (line $9 \sim 11$ ).

Before each user sends his/her symptom to the third party, Algorithm 6 is executed. $M_{client-CS-LDP}$ firstly initializes a multidimensional integer array $V$ , i.e., ${- 1}^{t \times b}$ (line $1$ ). The array only consists of two elements, i.e., $- 1$ and $+ 1$ . Then, because of the property of sequential composition of local differential privacy, it computes smaller privacy parameter $ϵ^{'}$ for each hash function (line 2). Next, the specified position $h_{i} (d)$ of each row in $V$ is set as $g_{i} (d)$ (line $3 \sim 5$ ), which implements the function of $Encode (\cdot)$ . For each entry in array $V$ , it uses Randomized Response (RR) [25] to ensure $ϵ^{'}$ -local differential privacy (line $6 \sim 10$ ), which implements the function of $Perturb (\cdot)$ . At last, the result is returned to the third party.

After the third party receives perturbed symptoms from users, he/she uses Algorithm 7 to get the data structure Sketch-CS-LDP. Firstly, $M_{Sketch-CS-LDP}$ initializes a $t \times b$ -integer array, i.e., ${0}^{t \times b}$ (line $1$ ). Then, line 2 is to compute a parameter $c_{ϵ}$ , which is used to ensure unbiased estimation due to randomized response. Since data from perturbed mechanism $M_{client-CS-LDP}$ is biased, it needs to be calibrated. Therefore, Algorithm 7 adjusts the data from each user (line $3 \sim 5$ ) to get the estimated frequency of each entry in each array $V_{i}$ . Finally, all adjusted arrays are added to sketch (line $6 \sim 9$ ). In Section 5.2, we will prove unbiased estimation and local differential privacy of our proposed protocol CS-LDP.

Algorithm 8 is used to answer the frequency estimation $\hat{f} (d)$ for the third party. The combination of Algorithms 7 and 8 implements the function of $Aggregate (\cdot)$ . We use function $median (\cdot)$ to choose a proper value from the candidates as the final estimation for $d$ , rather than function $mean (\cdot)$ . Charikar et al. [28] has demonstrated that function $mean (\cdot)$ is sensitive to outliers, while $median (\cdot)$ is robust. By Algorithm 5 $\sim$ 8, we solve the problem of frequency estimation with local differential privacy.

Compared with CMS-LDP, CS-LDP needs the higher time and space cost, and has the shaper bound for the frequency estimation. For each user in $M_{client-CS-LDP}$ , the time cost is $O (\frac{1}{ξ^{2}} {log}_{2} (1 ∕ δ))$ due to sampling each entry of the matrix by Randomized Response, while the space cost is $\frac{1}{ξ^{2}} {log}_{2} (1 ∕ δ)$ . For the third party, the cost of time and space is $O (\frac{1}{ξ^{2}} {log}_{2} (1 ∕ δ))$ and $O (\frac{1}{ξ^{2}} {log}_{2} (1 ∕ δ))$ in $M_{Sketch-CS-LDP}$ , respectively. The cost of time and space is $O ({log}_{2} (1 ∕ δ))$ and $O (\frac{1}{ξ^{2}} {log}_{2} (1 ∕ δ))$ in $M_{server-CS-LDP}$ , respectively. Therefore, the total time cost for the third party is $O (\frac{1}{ξ^{2}} {log}_{2} (1 ∕ δ))$ , while the total space cost is $O (\frac{1}{ξ^{2}} {log}_{2} (1 ∕ δ))$ .

4.4. Protocol FAS-LDP

Fast-AGMS Sketch (FAS) is a refined version of Count Sketch [30], which guarantees logarithmic-time sketch update and tracking costs. The only difference between Count Sketch and Fast-AGMS Sketch is that the latter replaces $2$ -universal hash functions $G$ of the former with $4$ -universal hash functions.

Protocol $FAS-LDP$ to achieve Fast-AGMS Sketch under local differential privacy is similar with CS-LDP. The only change is that a set of hash functions, namely $G$ , are chosen independently from a $4$ -universal family of hashing functions mapping $D$ to ${- 1, + 1}$ . Besides, other mechanisms, i.e., $M_{client-FAS-LDP}$ , $M_{Sketch-FAS-LDP}$ and $M_{server-FAS-LDP}$ in client-side and server-side are the same with $M_{client-CS-LDP}$ , $M_{Sketch-CS-LDP}$ and $M_{server-CS-LDP}$ , respectively. As a result, these algorithms have the same time and space cost with Algorithms 5 $\sim$ 8 for each user and the third party.

5. Theory analysis of protocols

The protocols that implement the approximate frequency estimation of physical symptoms under LDP must satisfy two fundamental requirements, i.e., unbiased estimation and local differential privacy. In the section, we will demonstrate that our proposed protocols satisfy two requirements. Besides, we discuss the variance of protocols.

5.1. CMS-LDP and FCS-LDP

For the proposed protocols, the most important property is to satisfy local differential privacy. We prove that protocol CMS-LDP satisfies the requirement as follows:

Theorem 3

Protocol CMS-LDP satisfies $ϵ$ -local differential privacy.

Proof

Each user utilizes mechanism $M_{client−CMS−LDP}$ in CMS-LDP to encode and perturb his/her symptom $d$ . According to Eq. (4), for two different value $d_{i}$ and $d_{j}$ , it satisfies the follow equation:

$\frac{Pr [Perturb (Encode (d_{i})) = y]}{Pr [Perturb (Encode (d_{j})) = y]} \leq \frac{e^{ϵ^{'}}}{e^{ϵ^{'}} + b - 1} ∕ \frac{1}{e^{ϵ^{'}} + b - 1} = e^{ϵ^{'}}$

Therefore, for each hash function, perturbation satisfies $ϵ^{'}$ -local differential privacy. Because that there are $t$ hash functions for the same symptom $d$ , the combination of $t$ hash functions satisfy $ϵ$ -local differential privacy for each user according to Lemma 1. On the other hand, the third party receives perturbed symptoms from all users. Since records between different users are disjoint, it satisfies $ϵ$ -differential privacy for all perturbed symptoms according to Lemma 2. ■

Theorem 4

For each $2$ -universal hash function $h_{i}$ in protocol CMS-LDP, it satisfies unbiased estimation for frequency estimation of symptoms after $Encode (\cdot)$ . Meanwhile, its variance is $\frac{(b - 1 - b p)}{b p - 1} \cdot f + \frac{n (b - 1)}{{(b p - 1)}^{2}}$ , in which $p = \frac{e^{ϵ^{'}}}{e^{ϵ^{'}} + b - 1}$ .

Proof

For each hash function $h_{i}$ , the third party receives $n$ reported symptoms from all users. Let $y$ any possible output and $S$ the set of symptoms that are hashed to $y$ by $Encode (\cdot)$ . Then, the probability that any symptom in $S$ is perturbed to $y$ is $p$ . That is, for any $s_{i} \in S$ , we have

$p = Pr [Perturb (Encode (s_{i})) = y] = \frac{e^{ϵ^{'}}}{e^{ϵ^{'}} + b - 1},$

and

$q = Pr [Perturb (Encode (s_{i})) \neq y] = \frac{1}{e^{ϵ^{'}} + b - 1} .$

For another input value $d_{j} ⁄ \in S$ , it is still possibly mapped to $y$ . Let $q^{'}$ the probability that $d_{j}$ is mapped to $y$ . In line $10$ of mechanism $M_{Sketch-LMS-LDP}$ , the third party makes the extra operations for the raw records. Thus, the probability $q^{'}$ is computed by as follows:

$q^{'} = \frac{1}{b} \cdot p + \frac{b - 1}{b} \cdot q = \frac{1}{b} \cdot \frac{e^{ϵ^{'}}}{e^{ϵ^{'}} + b - 1} + \frac{b - 1}{b} \cdot \frac{1}{e^{ϵ^{'}} + b - 1} = \frac{1}{b}$

In mechanism $M_{Sketch-LMS-LDP}$ , we get the final sketch by line $11$ . The expectation that the number of elements in $D$ which are mapped to $y$ is

$E [\hat{f}] = E [\frac{f - n \cdot q^{'}}{p - q^{'}}] = \frac{E [f] - n \cdot q^{'}}{p - q^{'}} = \frac{p \cdot f + (n - f) \cdot q^{'} - n \cdot q^{'}}{p - q^{'}} = f$

where $f$ is the number of elements in $S$ . Therefore, for each $2$ -universal hash functions, protocol CMS-LDP satisfies unbiased estimation. The variance is

$Var [\hat{f}] = Var [\frac{f - n \cdot q^{'}}{p - q^{'}}] = \frac{Var [f]}{{(p - q^{'})}^{2}} = \frac{f \cdot p (1 - p) + (n - f) \cdot q^{'} (1 - q^{'})}{{(p - q^{'})}^{2}} = \frac{f b^{2} p (1 - p) + (n - f) (b - 1)}{{(b p - 1)}^{2}} = \frac{(b - 1 - b p)}{b p - 1} \cdot f + \frac{n (b - 1)}{{(b p - 1)}^{2}} ■$

By the above proof, CMS-LDP satisfies unbiased estimation and LDP. Since protocol FCS-LDP is similar with CMS-LDP except for hash functions, FCS-LDP also satisfies unbiased estimation and local differential privacy. Besides, it has the same variance. The corresponding proof is omitted.

5.2. CS-LDP and FAS-LDP

Here, we firstly prove that protocol CS-LDP satisfies local differential privacy as follows:

Theorem 5

Protocol CS-LDP satisfies $ϵ$ -local differential privacy.

Proof

Assume that there are two users $u_{i}$ and $u_{j}$ . Both of them have different input values $d_{i}$ and $d_{j}$ and the same output $O \in {- 1, + 1}^{b}$ . For each pair $(h_{i}, g_{i})$ of hash functions, the third party receives two perturbed arrays $υ$ and $ω$ from these two users. Because that all of hash functions $h_{i}$ and $g_{i}$ are independent with each other, we have

$\frac{Pr [M_{Client-CS-LDP} (d_{i}) = O]}{Pr [M_{Client-CS-LDP} (d_{j}) = O]} = \frac{\prod_{κ = 1}^{b} Pr [υ_{κ} = O_{κ}]}{\prod_{κ = 1}^{b} Pr [ω_{κ} = O_{κ}]} = \prod_{κ = 1}^{b} \frac{Pr [υ_{κ} = O_{κ}]}{Pr [ω_{κ} = O_{κ}]} ▿ = \frac{Pr [υ_{h_{i} (d_{i})} = O_{h_{i} (d_{i})}] Pr [υ_{h_{i} (d_{j})} = O_{h_{i} (d_{j})}]}{Pr [ω_{h_{i} (d_{j})} = O_{h_{i} (d_{j})}] Pr [ω_{h_{i} (d_{i})} = O_{h_{i} (d_{i})}]} ▵$

The reason for $▿ ⟹ ▵$ is presented as follows. In detail, $υ = {υ_{1}, υ_{2}, \dots, υ_{b}}$ and $ω = {ω_{1}, ω_{2}, \dots, ω_{b}}$ . Each entry $υ_{κ}$ in $υ$ is derived from ${- 1 \cdot y, + 1 \cdot y}$ , in which $y \in {- 1, + 1}$ is sampled from Eq. and only an entry is $+ 1$ with position decided by hash value $h_{i} (d_{i})$ . $ω$ is similar with $υ$ . In Eq. $▿$ , only two entries in $υ_{h_{i} (d_{i})}$ are different from two entries in $ω_{h_{i} (d_{j})}$ , while the others have the same probability. Therefore, Eq. $▿$ is derived to Eq. $▵$ .

For Eq. $▵$ , there are two different cases, i.e., $h_{i} (d_{i}) = h_{i} (d_{j})$ and $h_{i} (d_{i}) \neq h_{i} (d_{j})$ . For the first case in which $h_{i} (d_{i}) = h_{i} (d_{j})$ , we have

$▵ = \frac{Pr [υ_{h_{i} (d_{i})} = O_{h_{i} (d_{i})}]}{Pr [ω_{h_{i} (d_{j})} = O_{h_{i} (d_{j})}]} = \frac{Pr [υ_{h_{i} (d_{i})} = O_{h_{i} (d_{i})}]}{Pr [υ_{h_{i} (d_{j})} = O_{h_{i} (d_{j})}]} = 1$

For the second case in which $h_{i} (d_{i}) \neq h_{i} (d_{j})$ , we have

$▵ = \frac{Pr [υ_{h_{i} (d_{i})} = O_{h_{i} (d_{i})}]}{Pr [ω_{h_{i} (d_{i})} = O_{h_{i} (d_{i})}]} \cdot \frac{Pr [υ_{h_{i} (d_{j})} = O_{h_{i} (d_{j})}]}{Pr [ω_{h_{i} (d_{j})} = O_{h_{i} (d_{j})}]} .$

Assume that $g_{i} (d_{i}) = + 1$ and $g_{i} (d_{j}) = + 1$ . Then, $υ_{h_{i} (d_{i})} = + 1 \cdot y_{1}$ , $ω_{h_{i} (d_{i})} = - 1 \cdot y_{2}$ , $υ_{h_{i} (d_{j})} = - 1 \cdot y_{3}$ and $ω_{h_{i} (d_{j})} = + 1 \cdot y_{4}$ . If $O_{h_{i} (d_{i})} = + 1$ , then $y_{1} = + 1$ , $y_{2} = - 1$ , $y_{3} = - 1$ and $y_{4} = + 1$ . Therefore, according to Eq. , we have

$e^{- 2 ϵ^{'}} \leq \frac{Pr [υ_{h_{i} (d_{i})} = O_{h_{i} (d_{i})}]}{Pr [ω_{h_{i} (d_{i})} = O_{h_{i} (d_{i})}]} \cdot \frac{Pr [υ_{h_{i} (d_{j})} = O_{h_{i} (d_{j})}]}{Pr [ω_{h_{i} (d_{j})} = O_{h_{i} (d_{j})}]} \leq e^{2 ϵ^{'}},$

in which $ϵ^{'} = ϵ ∕ 2 t$ . Therefore, for each pair $(h_{i}, g_{i})$ of hash functions in the above cases, it satisfies $ϵ ∕ t$ -local differential privacy. Since there are $t$ hash functions to perturb the same symptom, $M_{Client-CS-LDP}$ of each user satisfies $ϵ$ -local differential privacy according to Lemma 1. Since records between different users are disjoint, it satisfies $ϵ$ -differential privacy for all perturbed records according to Lemma 2. ■

Theorem 6

Protocol CS-LDP satisfies unbiased estimation. For any value $j^{*} \in [m]$ , the variance of $\hat{f} (j^{*})$ is $(\frac{b - 1}{b} \cdot c_{ϵ}^{2} - 1) f^{2} (j^{*}) + \frac{c_{ϵ}^{2}}{b} {‖ f ‖}_{2}^{2}$ .

Proof

Protocol CS-LDP satisfies unbiased estimation if and only if $E [\hat{f} (d)] = f (d)$ . Assume that there is a fixed constant $j^{*} \in [m]$ . For any $j \in [m]$ , $Y_{j}$ is denoted as follows:

$Y_{j} = \{\begin{matrix} 1 & if h (j) = h (j^{*}) \\ 0 & otherwise \end{matrix})$

Meanwhile, assume that there is a perturbation function $p (\cdot)$ , namely, Eq. . Then, $\hat{f} (j^{*})$ is equal to

$g (j^{*}) \cdot V [h (j^{*})] = g (j^{*}) \cdot \sum_{i = 1}^{n} g (d_{i}) p c_{ϵ} Y_{i} = g (j^{*}) \cdot \sum_{j = 1}^{m} f (j) g (j) p c_{ϵ} Y_{j} = g {(j^{*})}^{2} f (j^{*}) p c_{ϵ} Y_{j^{*}} + \sum_{j \neq j^{*}} f (d_{j}) g (d_{j^{*}}) g (d_{j}) p c_{ϵ} Y_{j} = f (j^{*}) p c_{ϵ} + \sum_{j \neq j^{*}} f (j) g (j^{*}) g (j) p c_{ϵ} Y_{j}$

For the expected value of $\hat{f} (j^{*})$ , we have

$E [\hat{f} (j^{*})] = E [f (j^{*}) p c_{ϵ} + \sum_{j \neq j^{*}} f (j) g (j^{*}) g (j) p c_{ϵ} Y_{j}] = E [f (j^{*}) p c_{ϵ}] + E [\sum_{j \neq j^{*}} f (j) g (j^{*}) g (j) p c_{ϵ} Y_{j}] = f (j^{*}) c_{ϵ} E [p] + \sum_{j \neq j^{*}} f (j) c_{ϵ} E [g (j^{*}) g (j) p Y_{j}] .$

For $E [p]$ , it is equal to $(+ 1) \cdot \frac{e^{ϵ^{'}}}{e^{ϵ^{'}} + 1} + (- 1) \cdot \frac{1}{e^{ϵ^{'}} + 1} = \frac{e^{ϵ^{'}} - 1}{e^{ϵ^{'}} + 1}$ . Since hash functions $g_{i}$ are independent with each other, we have $E [g (j^{*}) g (j) Y_{j}] = E [g (j^{*})] E [g (j) Y_{j}] = 0$ for $j \neq j^{*}$ . Therefore, $E [\hat{f} (j^{*})] = f (j^{*})$ .

For the variance of $\hat{f} (j^{*})$ , $Var [\hat{f} (j^{*})]$ is derived as follows:

$E [{(\hat{f} (j^{*}) - E [\hat{f} (j^{*})])}^{2}] = E [{(\hat{f} (j^{*}) - f (j^{*}))}^{2}] = E [{(f (j^{*}) (1 - p c_{ϵ}) + \sum_{j \neq j^{*}} f (j) g (j^{*}) g (j) p c_{ϵ} Y_{j})}^{2}] = E [f^{2} (j^{*}) {(1 - p c_{ϵ})}^{2} + {(\sum_{j \neq j^{*}} f (j) g (j^{*}) g (j) p c_{ϵ} Y_{j})}^{2} + 2 f (j^{*}) (1 - p c_{ϵ}) \sum_{j \neq j^{*}} f (j) g (j^{*}) g (j) p c_{ϵ} Y_{j}] = E [f^{2} (j^{*}) {(1 - p c_{ϵ})}^{2}] + E [{(\sum_{j \neq j^{*}} f (j) g (j^{*}) g (j) p c_{ϵ} Y_{j})}^{2}] + E [2 f (j^{*}) (1 - p c_{ϵ}) \sum_{j \neq j^{*}} f (j) g (j^{*}) g (j) p c_{ϵ} Y_{j}] = E [f^{2} (j^{*})] \cdot E [1 - 2 p c_{ϵ} + p^{2} c_{ϵ}^{2}] + E [\sum_{i \neq j^{*}} \sum_{j \neq j^{*}} f (i) f (j) g (i) g (j) p^{2} c_{ϵ}^{2} Y_{i} Y_{j}] = (c_{ϵ}^{2} - 1) f^{2} (j^{*}) + \sum_{i \neq j^{*}} \sum_{j \neq j^{*}} f (i) f (j) E [g (i) g (j) p^{2} c_{ϵ}^{2} Y_{i} Y_{j}] = (c_{ϵ}^{2} - 1) f^{2} (j^{*}) + \sum_{j \neq j^{*}} f^{2} (j) E [p^{2} c_{ϵ}^{2} Y_{j}^{2}] = (c_{ϵ}^{2} - 1) f^{2} (j^{*}) + \sum_{j \neq j^{*}} c_{ϵ}^{2} f^{2} (j) E [Y_{j}] = (c_{ϵ}^{2} - 1) f^{2} (j^{*}) + \frac{c_{ϵ}^{2}}{b} \sum_{j \neq j^{*}} f^{2} (j) = (\frac{b - 1}{b} \cdot c_{ϵ}^{2} - 1) f^{2} (j^{*}) + \frac{c_{ϵ}^{2}}{b} {‖ f ‖}_{2}^{2}$

Therefore, we complete the proof. ■

Since protocol FAS-LDP is similar with CS-LDP except for hash functions, FAS-LDP also satisfies unbiased estimation and local differential privacy. Besides, it has the same variance. The corresponding proof is omitted.

6. Implementation of universal hash functions

In the section, we introduce the implementation of $k$ -universal hash functions. In detail, we propose two different methods to implement universal hash functions. That is, CW-Trick hashing is used to the case in which $k = 2$ while Tabulation based hashing is used to $k = 4$ .

6.1. Universal hash function

In the protocols proposed by Section 4, the most important component is $k$ -universal hash functions, which are defined as follows:

Definition 5 $k$ -universal Class of Hash Functions [44] —

A class $H$ of hash functions from $D$ into $[b]$ is a $k$ -universal class of hash functions if for any distinct $d_{0}, d_{1}, \dots, d_{k - 1} \in D$ and any possibly identical $v_{1}, v_{2}, \dots, v_{k - 1} \in [b]$ ,

$\forall i \in [k] \underset{h \in H}{Pr} [h (d_{i}) = v_{i}] = 1 ∕ b^{k},$ (6)

where $k$ is an positive integer.

According to Definition 5, a large number of values can be hashed to a relatively small set of keys. Meanwhile, there may be collisions, in which two different values are hashed to a key. Parameter $k$ is the factor that influences the collision. For example, as parameter $k$ increases from $2$ to $4$ , the probability of collision decreases from $1 ∕ b^{2}$ to $1 ∕ b^{4}$ . Here, we focus on how to design the proper hash functions that satisfy Definition 5. In the following section, we present two different implementations for $k$ -universal hash functions, i.e., CW-Trick and Tabulation based hash functions.

6.2. CW-trick hashing

The first method for $k$ -universal hash functions is implemented by the following equation:

h (x) = \sum_{i = 0}^{k - 1} a_{i} x^{i} mod p,

(7)

where $p$ is a prime that is greater than value $x$ and $a_{i}$ is picked randomly from $[p]$ . This equation was proposed by Wegman et al. [44]. It is obvious to see that this method is very simple and easy to implement. However, it has a disadvantage, especially when $p$ is very large. That is, if $p$ is an arbitrary prime, this method is fairly slow because the ‘mod $p$ ’ is relatively slow. In order to solve the problem, Thorup et al. [31] proposed a simple method, namely CW-trick, in which $p$ is a so-called Mersenne prime of the form $2^{i} - 1$ . In our experiments, we will use $p = 2^{61} - 1$ . The CW-Trick hashing is used to $k = 2$ .

6.3. Tabulation based hashing

If CW-Trick hashing is used to $k = 4$ , the time is slow. Thorup et al. [31] proposed another method, namely Tabulation based hashing, to implement $k$ -universal hash functions. They compared their method with CW-trick in terms of running time, which shows that the former is faster than the latter by at least a factor of $5$ . However, the weakness of Tabulation based hashing is that it requires large pre-computed tables and thus needs the extra space to store.

We firstly introduce Theorem 2.3 in [31] to compute hash value of $x$ by $4$ -universal hash functions.

Theorem 7 Theorem 2.3 in [31] —

There are any $q$ characters $\vec{x} = (x_{0} x_{1} \dots x_{q - 1}), x_{i} \in [2^{c}]$ . $G$ is a $q \times r$ generator matrix, which satisfies that any square sub-matrix has full rank over prime field $Z_{p}$ , where $p \geq max {2^{c}, q + r}$ is an odd prime. $\vec{y} = \vec{x} G$ are $r$ additional characters $(y_{0} y_{1} \dots y_{r}), r = q - 1$ . Then, hash function of $\vec{x}$ is

$h (\vec{x}) = h_{0} (x_{0}) \oplus \dots \oplus h_{q - 1} (x_{q - 1}) \oplus {\bar{h}}_{0} (y_{0}) \oplus \dots \oplus {\bar{h}}_{r - 1} (y_{r - 1}) .$ (8)

$h (\cdot)$ is a $4$ -universal hashing function if $h_{i}$ and ${\bar{h}}_{j}$ are independent $4$ -universal hash functions into $[2^{l}]$ .

In our experiments, assume that value $d_{i}$ of each user $u_{i}$ has 32 bits and each character has $16$ bits.1 Therefore, $q = 2$ and $r = 1$ . In detail, value $d_{i}$ can be divided into two sub-values with 16-bits, i.e., $d_{i}^{1}$ and $d_{i}^{2}$ as the following equation:

d_{i} (32 bits) ⟹ \underset{d_{i}^{1} (16 bits)}{\underset{︸}{1010 \dots 1010}} \underset{d_{i}^{2} (16 bits)}{\underset{︸}{0101 \dots 0101}}

Besides, $4$ -universal hash functions $h_{i}$ and ${\bar{h}}_{j}$ are implemented by tabulation based hashing. That is, we use precomputed tables to replace hash functions $h_{i}$ and ${\bar{h}}_{j}$ . That is, Eq. (8) is changed as

h (\vec{x}) = T_{0} [x_{0}] \oplus \dots T_{q - 1} [x_{q - 1}] \oplus T_{q} [y_{0}] \oplus \dots \oplus T_{q + r - 1} [y_{r - 1}],

(9)

where $T_{i}$ is a fully random table of size $2^{c}$ with values from $R$ . Tabulation based hashing thus takes time $O (q + r)$ and space $O ((q + r) 2^{c})$ .

7. Experimental evaluation

In the section, our main goal is to study (1) the impact of different parameters on data utility; and (2) the application of protocols for frequency estimation of physical symptoms. The experiments are performed over two real datasets, representing two different distributions of symptoms.

7.1. Experimental setting

The evaluation is performed on a desktop computer, which has $4$ G of RAM and Inter(R) Pentium(R) CPU P6200 running Windows $7$ operating system. All of protocols are implemented by Python. All of testing datasets are real.

Parameter settings. The number of hash function $〈 h_{i}, g_{i} 〉$ is $100$ and we will choose $t$ hash function pairs from them. The implementation of hash functions is based on CW-Trick and Tabulation-based hashing in Section 6. The values of $t$ and $b$ depend on error $ξ$ and confidence probability $δ$ . We mainly evaluate four protocols on different values of privacy budget $ϵ$ , error $ξ$ and confidence probability $δ$ .

Datasets. We use two real datasets to evaluate our protocols, including WISDM and BMBD. We select an attribute to estimate its frequency in these two datasets, which represent two different distributions of physical symptoms. The datasets are presented as follows:

•
WISDM [45]. The dataset is collected from the accelerometer and gyroscope sensors of a smart-phone and smart-watch as $51$ subjects. It is about diverse activities of daily living and has $15, 630, 426$ records. We select attribute Activity from seven attributes to estimate its frequency. There are $18$ activities, including walking, jogging, Sitting and so on.
•
BMBD [46]. The dataset is collected from a public website Bixi,2 which offers bike rental service. It contains bike share information for Bixi Monetreal from $2016$ to $2019$ , including start and end station ID of a trip. We select attribute Start_station_code from six attributes to estimate its frequency.

Table 1 shows the statistical information of attribute Activity in WISDM and Start_station_code in BMBD. The distributions of these two attributes are different. Activity is uniformly distributed, while Start_station_code is loose. Therefore, we use them to evaluate our protocols for different distributions of symptoms.

Table 1.

Statistical information of datasets.

	WISDM	BMBD
$Attribute$	Activity	Start_station_code
$Num.$	$18$	$621$
$Min. (f)$	$833, 208$	$28$
$Max. (f)$	$901, 381$	$188, 306$
$Avg. (f)$	$868, 357$	$31, 501$
$Med. (f)$	$870, 532$	$25, 238$
$Total$	$15, 630, 426$	$19, 561, 549$

Open in a new tab

Evaluation Metrics. We evaluate the top- $k$ frequent elements $C = {d_{i_{1}}, d_{i_{2}}, \dots, d_{i_{k}}}$ in $D$ by RE and MSE, which have the most frequencies [47], [48], [49]. The concrete metrics are presented as follows:

•
Relative Error (RE). The metric is usually used for SUM/ COUNT/AVG queries and measures how large the error is relatively to the true answer for each query $\hat{f} (d)$ by the following equation:
$RE = \frac{1}{k} \sum_{j = 0}^{k - 1} \frac{| f (i_{j}) - \hat{f} (i_{j}) |}{f (i_{j})}, i_{j} \in [m]$ (10)
Furthermore, for the top- $k$ elements, we evaluate them by the average and median value of relative errors, i.e., ARE and MRE.
•
Mean Square Error (MSE). It is used to measure the estimation accuracy by the average of the squared errors, namely,
$MSE = \frac{1}{k} \sum_{j = 0}^{k - 1} {[\hat{f} (i_{j}) - f (i_{j})]}^{2}, i_{j} \in [m]$ (11)
where $f (i_{j})$ is the true frequency of users taking value $i_{j}$ .

7.2. Experimental results

In the following section, we show the impact of three key parameter $ξ$ , $δ$ and $ϵ$ in terms of RE and MSE for different datasets. Then, we show the effectiveness of protocols by estimating top- $k$ in different cases. In order to reduce the error, we repeat each experiment $10$ times. Since frequency of different records in BMBD is different, we will consider two cases, including top- $10$ and top- $30$ . That is, we estimate $10$ and $30$ elements with the most frequency, respectively.

7.2.1. Impact of parameter $ξ$

In order to analyze impact of parameter $ξ$ , we firstly set the parameters. That is, confidence probability $δ$ is 0.1 and privacy budget $ϵ$ is $3$ . $ξ$ of WISDM is $[0.15, 0.18, 0.2, 0.25]$ while that is $[0.004, 0.0044, 0.005, 0.00571]$ for BMBD.

Figs. 3(a)–3(d) show RE and MSE of WISDM by different protocols, including CMS-LDP, FCS-LDP, CS-LDP and FAS-LDP. It is intuitive to see that RE and MSE in CMS-LDP and FCS-LDP increase as $ξ$ increases from 0.15 to 0.25 in Fig. 3, Fig. 3. This is consistent to CMS and FCS without LDP. This implies that addition of privacy preservation does not change influence of parameter $ξ$ . However, Fig. 3, Fig. 3 show that $ξ$ has little impact on CS-LDP and FAS-LDP, since RE and MSE do not change as $ξ$ increases from 0.15 to 0.25.

Fig. 3 — Results for Top- $18$ of WISDM.

Figs. 4(a)–4(d) and 5(a)–5(d) show RE and MSE of BMBD in two different cases by four private protocols. Compared to WISDM, it has the opposite results for protocol CMS-LDP and FCS-LDP. That is, with $ξ$ increasing from 0.004 to 0.00571, RE and MSE decrease. The key factor for this opposition is size of $b$ mapped by hash function $H$ , in which the size in BMBD is far greater than that in WISDM. Thus, perturbation for privacy preservation is the dominant factor for BMBD, while conflict probability is relatively less. As size of $b$ decreases, it leads to less expected error. Fig. 4, Fig. 4 and Fig. 5, Fig. 5 show that RE and MSE of BMBD by CS-LDP and FAS-LDP have no much difference with $ξ$ increasing. That is, $ξ$ has little impact on CS-LDP and FAS-LDP.

Therefore, the impact of $ξ$ on CMS-LDP and FCS-LDP depends on the size of $b$ . In contrast, it has little impact on CS-LDP and FAS-LDP.

7.2.2. Impact of parameter $δ$

Here, we consider the impact of parameter $δ$ for four LDP protocols. The expected error $ξ$ is 0.18 for WISDM, while that is 0.005 for BMBD. Privacy budget $ϵ$ is $3$ . The range of $δ$ is $[0.005, 0.02, 0.1, 0.25]$ .

For WISDM, Figs. 3(e)–3(h) show RE and MSE of four protocols. It is intuitive to find that RE and MSE increase with $δ$ increasing. This implies that adding the number of hash functions can efficiently improve the accuracy for perturbed data. Meanwhile, it needs more time for the users and the third party. From Figs. 3(e)–3(h), RE and MSE are fast increasing once $δ$ is greater than 0.25. Therefore, $δ < 0.25$ is a good choice.

For BMBD, Figs. 4(e)–4(h) and 5(e)–5(h) show RE and MSE of four protocols. Two cases have the similar results. It is easy to see that the impact of $δ$ on CMS-LDP and FCS-LDP is counter to that on CS-LDP and FAS-LDP. In detail, RE and MSE of CMS-LDP and FCS-LDP decrease as $δ$ increases while those of CS-LDP and FAS-LDP have the opposite results. This means that it should add the number of hash functions for FCS-LDP and FCS-LDP and decrease the number for CS-LDP and FAS-LDP.

Thus, the impact of $δ$ on CS-LDP and FAS-LDP is consistent, while that on CMS-LDP and FCS-LDP is changing in different datasets.

7.2.3. Impact of parameter $ϵ$

In order to analyze the impact of $ϵ$ , $ξ$ for WISDM is 0.18, while that is 0.005 for BMBD. Confidence probability $δ$ is 0.1. The range of privacy budget $ϵ$ is $[1, 3, 5, 7]$ .

For WISDM, Figs. 3(i)–3(l) show RE and MSE of four protocols. Fig. 3, Fig. 3 show that RE and MSE of CMS-LDP and FCS-LDP do not decrease with $ϵ$ increasing. In contrast, they have big fluctuation. Compared to those for BMBD in Figs. 4(i), 4(k), Fig. 5, Fig. 5, the possible reason is the size of $b$ . For CS-LDP and FAS-LDP, the accuracy increases as privacy budget increases. However, it does not change once privacy budget exceeds some threshold (e.g., $5$ in Fig. 3, Fig. 3).

For BMBD, Figs. 4(i)–4(l) and 5(i)–5(l) show RE and MSE of four protocols. We can find that top- $10$ and top- $30$ have the same results about RE and MSE. For four protocols, both of RE and MSE decrease as privacy budget $ϵ$ increases. In other words, the increase of RE and MSE in CMS-LDP and FCS-LDP is fast, while that in FCS-LDP and FAS-LDP is not obvious. Once privacy budget exceeds some threshold (e.g., $3$ in Fig. 4(j)), RE and MSE have little change for CS-LDP and FAS-LDP. The threshold is $5$ for CMS-LDP and FCS-LDP.

According to the above results, we can derive that privacy budget $ϵ$ has the important impact on four protocols. Meanwhile, once $ϵ$ exceeds some threshold, the impact can be ignored.

7.2.4. Protocol applications in IoMT

By our experiments, we analyze two different distributions of symptoms for some infectious disease in IoMT. That is, the symptoms are uniformly distributed in case (i), while those are loose in case (ii). Generally speaking, we have two important findings in Fig. 3, Fig. 4, Fig. 5. The first one is that the utility of CMS-LDP and FCS-LDP is greater than that of CS-LDP and FAS-LDP. Since the time of CMS-LDP and FCS-LDP is smaller than that of CMS-LDP and FAS-LDP, Therefore, CMS-LDP and FCS-LDP are relatively optimal in terms of utility and computation overhead. The second one is that protocol CMS-LDP is not absolutely better than FCS-LDP, and vice versa. Furthermore, different distributions of symptoms have an important influence for the choices of different parameters, including $ξ$ , $δ$ and $σ$ . In case (i), FCS-LDP has better stability than CMS-LDP as different parameters change. In case (ii), CMS-LDP and FCS-LDP have the similar changes. Parameter $ξ$ , $δ$ and $σ$ should be as small as possible to maximize the utility.

8. Conclusions

The focus of the paper is on how to estimate the frequency of symptoms under local differential privacy for infectious disease analysis in IoMT. Based on basic sketches, this paper has proposed four protocols, including CMS-LDP, FCS-LDP, CS-LDP and FAS-LDP. We have proved that our proposed protocols satisfy two important properties, i.e., unbiased estimation and local differential privacy. Meanwhile, we have presented the variance of four protocols. Through empirical analysis, CMS-LDP and CS-LDP are relatively optimal protocols for frequency estimation of physical symptoms in IoMT. We plan to design better encoding methods to reduce the computation complexity and perturbation techniques to improve the accuracy of our protocols in the future.

CRediT authorship contribution statement

Xiaotong Wu: Conceptualization, Methodology, Formal analysis, Writing - original draft. Mohammad Reza Khosravi: Conceptualization, Methodology, Writing - original draft. Lianyong Qi: Software, Data curation, Validation. Genlin Ji: Writing - review & editing, Computation resources, Project administration. Wanchun Dou: Supervision, Writing - review & editing, Project administration. Xiaolong Xu: Conceptualization, Writing - review & editing, Investigation, Validation, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

Our thanks to the reviewers for their constructive comments and suggestions to improve the quality of the manuscript. This work was supported in part by the National Natural Science Foundation of China under Grant No. 41971343.

Here, the length of $d_{i}$ is not constant. If there are a large number of elements in $D$ , the length can be 64 or more bits and thus $q = 4$ and $r = 3$ .

https://montreal.bixi.com/en/open-data.

References

1.Xu X., Liu X., Xu Z., Dai F., Zhang X., Qi L. Trust-oriented iot service placement for smart cities in edge computing. IEEE Internet Things J. 2019 [Google Scholar]
2.Zhang Y., Yin C., Wu Q., He Q., Zhu H. Location-aware deep collaborative filtering for service recommendation. IEEE Trans. Syst. Man Cybern.: Syst. 2019:1–12. [Google Scholar]
3.Zhang Y., Cui G., Deng S., Chen F., Wang Y., He Q. Efficient query of quality correlation for service composition. IEEE Trans. Serv. Comput. 2018:1. [Google Scholar]
4.Wu F., Zhao S., Yu B., Chen Y.-M., Wang W., Song Z.-G., Hu Y., Tao Z.-W., Tian J.-H., Pei Y.-Y., Yuan M.-L., Zhang Y.-L., Dai F.-H., Liu Y., Wang Q.-M., Zheng J.-J., Xu L., Holmes E.C., Zhang Y.-Z. A new coronavirus associated with human respiratory disease in China. Nature. 2020;579:265–269. doi: 10.1038/s41586-020-2008-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Deep S., Zheng X., Jolfaei A., Yu D., Ostovari P., K.Bashir A. A survey of security and privacy issues in the internet of things from the layered context. Trans. Emerg. Telecommun. Technol. 2020 [Google Scholar]
6.G. Hatzivasilis, O. Soultatos, S. Ioannidis, C. Verikoukis, G. Demetriou, C. Tsatsoulis, Review of security and privacy for the Internet of Medical Things (IoMT), in: Proceedings of 15th International Conference on Distributed Computing in Sensor Systems, DCOSS, 2019, pp. 457–464.
7.Wang T., Zheng Z., Bashir A., Jolfaei A., Xu Y. Finprivacy: A privacy-preserving mechanism for fingerprint identification. ACM Trans. Internet Technol. 2020 [Google Scholar]
8.Mirzamohammadi S., Sani A.A. Viola: Trustworthy sensor notifications for enhanced privacy on mobile systems. IEEE Trans. Mobile Comput. 2018;17(11):2689–2702. [Google Scholar]
9.Qi L., Wang X., Xu X., Dou W., Li S. Privacy-aware cross-platform service recommendation based on enhanced locality-sensitive hashing. IEEE Trans. Netw. Sci. Eng. 2020 [Google Scholar]
10.Xu X., Liu Q., Zhang X., Zhang J., Qi L., Dou W. A blockchain-powered crowdsourcing method with privacy preservation in mobile environment. IEEE Trans. Comput. Soc. Syst. 2019;6(6):1407–1419. [Google Scholar]
11.Zuo C., Lin Z., Zhang Y. Proceedings of Symposium on Security and Privacy, SP. IEEE; 2019. Why does your data leak? Uncovering the data leakage in cloud from mobile apps; pp. 1296–1310. [Google Scholar]
12.Xu X., He C., Xu Z., Qi L., Wan S., Bhuiyan M.Z.A. Joint optimization of offloading utility and privacy for edge computing enabled iot. IEEE Internet Things J. 2019 [Google Scholar]
13.Dwork C. Proceedings of International Conference on Theory and Applications of Models of Computation, TAMC, Vol. 4978. Springer; 2008. Differential privacy: A survey of results; pp. 1–19. [Google Scholar]
14.Zheng Z., Wang T., Wen J., Mumtaz S., Bashir A.K., Chauhdary S.H. Differentially private high-dimensional data publication in internet of things. IEEE Internet Things J. 2020;7(4):2640–2650. [Google Scholar]
15.Duchi J.C., Jordan M.I., Wainwright M.J. Proceedings of 54th Annual Symposium on Foundations of Computer Science, FOCS. IEEE; 2013. Local privacy and statistical minimax rates; pp. 429–438. [Google Scholar]
16.Cormode G., Kulkarni T., Srivastava D. Proceedings of the International Conference on Management of Data, SIGMOD. ACM; 2018. Marginal release under local differential privacy; pp. 131–146. [Google Scholar]
17.Duchi J.C., Wainwright M.J., Jordan M.I. Minimax optimal procedures for locally private estimation. J. Amer. Statist. Assoc. 2018;113(521):182–201. [Google Scholar]
18.Wang N., Xiao X., Yang Y., Zhao J., Hui S.C., Shin H., Shin J., Yu G. Proceedings of International Conference on Data Engineering, ICDE. IEEE; 2019. Collecting and analyzing multidimensional data with local differential privacy; pp. 638–649. [Google Scholar]
19.Ye M., Barg A. Optimal schemes for discrete distribution estimation under locally differential privacy. IEEE Trans. Inform. Theory. 2018;64(8):5662–5676. [Google Scholar]
20.R. Bassily, K. Nissim, U. Stemmer, A.G. Thakurta, Practical locally private heavy hitters, in: Proceedings of Annual Conference on Neural Information Processing Systems, 2017, pp. 2288–2296.
21.Bassily R., Smith A.D. Proceedings of the Forty-Seventh Annual on Symposium on Theory of Computing, STOC. ACM; 2015. Local, private, efficient protocols for succinct histograms; pp. 127–135. [Google Scholar]
22.Bun M., Nelson J., Stemmer U. Heavy hitters and the structure of local privacy. ACM Trans. Algorithms. 2019;15(4):51:1–51:40. [Google Scholar]
23.Qin Z., Yang Y., Yu T., Khalil I., Xiao X., Ren K. Proceedings of the Conference on Computer and Communications Security, CCS. ACM; 2016. Heavy hitter estimation over set-valued data with local differential privacy; pp. 192–203. [Google Scholar]
24.T. Wang, J. Blocki, N. Li, S. Jha, Locally differentially private protocols for frequency estimation, in: Proceedings of USENIX Security Symposium, 2017, pp. 729–745.
25.Warner S.L. Randomized response: A survey technique for eliminating evasive answer bias. J. Amer. Statist. Assoc. 1965;60(309):63–69. [PubMed] [Google Scholar]
26.Wang S., Huang L., Wang P., Deng H., Xu H., Yang W. Proceedings of International Conference on Wireless Algorithms, Systems, and Applications, WASA, Vol. 9798. Springer; 2016. Private weighted histogram aggregation in crowdsourcing; pp. 250–261. [Google Scholar]
27.Erlingsson Ú., Pihur V., Korolova A. Proceedings of the Conference on Computer and Communications Security, CCS. ACM; 2014. RAPPOR: randomized aggregatable privacy-preserving ordinal response; pp. 1054–1067. [Google Scholar]
28.M. Charikar, K.C. Chen, M. Farach-Colton, Finding frequent items in data streams, in: Proceedings of International Colloquium on Automata, Languages and Programming, ICALP, 2002, pp. 693–703.
29.Cormode G., Muthukrishnan S. An improved data stream summary: The count-min sketch and its applications. J. Algorithms. 2005;55(1):58–75. [Google Scholar]
30.G. Cormode, M.N. Garofalakis, Sketching streams through the net: Distributed approximate query tracking, in: Proceedings of the 31st International Conference on Very Large Data Bases, PVLDB, 2005, pp. 13–24.
31.M. Thorup, Y. Zhang, Tabulation based 4-universal hashing with applications to second moment estimation, in: Proceedings of the Fifteenth Annual Symposium on Discrete Algorithms, SODA, 2004, pp. 615–624.
32.Sweeney L. K-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 2002;10(5):557–570. [Google Scholar]
33.Machanavajjhala A., Kifer D., Gehrke J., Venkitasubramaniam M. L-diversity: Privacy beyond k -anonymity. ACM Trans. Knowl. Discovery Data. 2007;1(1):3. [Google Scholar]
34.Dwork C., McSherry F., Nissim K., Smith A.D. Proceedings of Third Theory of Cryptography Conference, TCC, Vol. 3876. Springer; 2006. Calibrating noise to sensitivity in private data analysis; pp. 265–284. [Google Scholar]
35.F. McSherry, K. Talwar, Mechanism design via differential privacy, in: Proceedings of 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07), 2007, pp. 94–103.
36.P. Kairouz, S. Oh, P. Viswanath, Extremal mechanisms for local differential privacy, in: Proceedings of Advances in Neural Information Processing Systems, 2014, pp. 2879–2887.
37.Kairouz P., Oh S., Viswanath P. Extremal mechanisms for local differential privacy. J. Mach. Learn. Res. 2016;17:17:1–17:51. [Google Scholar]
38.B. Ding, J. Kulkarni, S. Yekhanin, Collecting telemetry data privately, in: Proceedings of Advances in Neural Information Processing Systems, 2017, pp. 3571–3580.
39.Ye Q., Hu H., Meng X., Zheng H. Proceedings of Symposium on Security and Privacy, SP. IEEE; 2019. Privkv: Key-value data collection with local differential privacy; pp. 317–331. [Google Scholar]
40.N. Li, Q. Ye, Mobile data collection and analysis with local differential privacy, in: Proceedings of IEEE 20th International Conference on Mobile Data Management (MDM), 2019, pp. 4–7.
41.Differential Privacy Team A. Apple; 2017. Learning with Privacy at Scale. [Google Scholar]
42.R. Chen, H. Li, A.K. Qin, S.P. Kasiviswanathan, H. Jin, Private spatial data aggregation in the local setting, in: Proceedings of IEEE 32nd International Conference on Data Engineering (ICDE), 2016, pp. 289–300.
43.McSherry F. Proceedings of International Conference on Management of Data, SIGMOD. ACM; 2009. Privacy integrated queries: an extensible platform for privacy-preserving data analysis; pp. 19–30. [Google Scholar]
44.Wegman M.N., Carter L. New hash functions and their use in authentication and set equality. J. Comput. System Sci. 1981;22(3):265–279. [Google Scholar]
45.G. Weiss, WISDM smartphone and smartwatch activity and biometrics dataset data set. http://archive.ics.uci.edu/ml/index.php.
46.J. Wang, Bixi montreal bikeshare data - Bikeshare information for bixi montreal. https://www.kaggle.com/jackywang529/bixi-montreal-bikeshare-data.
47.Zhang Y., Wang K., He Q., Chen F., Deng S., Zheng Z., Yang Y. Covering-based web service quality prediction via neighborhood-aware matrix factorization. IEEE Trans. Serv. Comput. 2019:1. [Google Scholar]
48.Xue X., Han H., Wang S., Qin C. Computational experiment-based evaluation on context-aware O2O service recommendation. IEEE Trans. Serv. Comput. 2019;12(6):910–924. [Google Scholar]
49.Xue X., Wang S., Zhang L., Feng Z., Guo Y. Social learning evolution (SLE): computational experiment-based modeling framework of social manufacturing. IEEE Trans. Ind. Inform. 2019;15(6):3343–3355. [Google Scholar]

[b1] 1.Xu X., Liu X., Xu Z., Dai F., Zhang X., Qi L. Trust-oriented iot service placement for smart cities in edge computing. IEEE Internet Things J. 2019 [Google Scholar]

[b2] 2.Zhang Y., Yin C., Wu Q., He Q., Zhu H. Location-aware deep collaborative filtering for service recommendation. IEEE Trans. Syst. Man Cybern.: Syst. 2019:1–12. [Google Scholar]

[b3] 3.Zhang Y., Cui G., Deng S., Chen F., Wang Y., He Q. Efficient query of quality correlation for service composition. IEEE Trans. Serv. Comput. 2018:1. [Google Scholar]

[b4] 4.Wu F., Zhao S., Yu B., Chen Y.-M., Wang W., Song Z.-G., Hu Y., Tao Z.-W., Tian J.-H., Pei Y.-Y., Yuan M.-L., Zhang Y.-L., Dai F.-H., Liu Y., Wang Q.-M., Zheng J.-J., Xu L., Holmes E.C., Zhang Y.-Z. A new coronavirus associated with human respiratory disease in China. Nature. 2020;579:265–269. doi: 10.1038/s41586-020-2008-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b5] 5.Deep S., Zheng X., Jolfaei A., Yu D., Ostovari P., K.Bashir A. A survey of security and privacy issues in the internet of things from the layered context. Trans. Emerg. Telecommun. Technol. 2020 [Google Scholar]

[b6] 6.G. Hatzivasilis, O. Soultatos, S. Ioannidis, C. Verikoukis, G. Demetriou, C. Tsatsoulis, Review of security and privacy for the Internet of Medical Things (IoMT), in: Proceedings of 15th International Conference on Distributed Computing in Sensor Systems, DCOSS, 2019, pp. 457–464.

[b7] 7.Wang T., Zheng Z., Bashir A., Jolfaei A., Xu Y. Finprivacy: A privacy-preserving mechanism for fingerprint identification. ACM Trans. Internet Technol. 2020 [Google Scholar]

[b8] 8.Mirzamohammadi S., Sani A.A. Viola: Trustworthy sensor notifications for enhanced privacy on mobile systems. IEEE Trans. Mobile Comput. 2018;17(11):2689–2702. [Google Scholar]

[b9] 9.Qi L., Wang X., Xu X., Dou W., Li S. Privacy-aware cross-platform service recommendation based on enhanced locality-sensitive hashing. IEEE Trans. Netw. Sci. Eng. 2020 [Google Scholar]

[b10] 10.Xu X., Liu Q., Zhang X., Zhang J., Qi L., Dou W. A blockchain-powered crowdsourcing method with privacy preservation in mobile environment. IEEE Trans. Comput. Soc. Syst. 2019;6(6):1407–1419. [Google Scholar]

[b11] 11.Zuo C., Lin Z., Zhang Y. Proceedings of Symposium on Security and Privacy, SP. IEEE; 2019. Why does your data leak? Uncovering the data leakage in cloud from mobile apps; pp. 1296–1310. [Google Scholar]

[b12] 12.Xu X., He C., Xu Z., Qi L., Wan S., Bhuiyan M.Z.A. Joint optimization of offloading utility and privacy for edge computing enabled iot. IEEE Internet Things J. 2019 [Google Scholar]

[b13] 13.Dwork C. Proceedings of International Conference on Theory and Applications of Models of Computation, TAMC, Vol. 4978. Springer; 2008. Differential privacy: A survey of results; pp. 1–19. [Google Scholar]

[b14] 14.Zheng Z., Wang T., Wen J., Mumtaz S., Bashir A.K., Chauhdary S.H. Differentially private high-dimensional data publication in internet of things. IEEE Internet Things J. 2020;7(4):2640–2650. [Google Scholar]

[b15] 15.Duchi J.C., Jordan M.I., Wainwright M.J. Proceedings of 54th Annual Symposium on Foundations of Computer Science, FOCS. IEEE; 2013. Local privacy and statistical minimax rates; pp. 429–438. [Google Scholar]

[b16] 16.Cormode G., Kulkarni T., Srivastava D. Proceedings of the International Conference on Management of Data, SIGMOD. ACM; 2018. Marginal release under local differential privacy; pp. 131–146. [Google Scholar]

[b17] 17.Duchi J.C., Wainwright M.J., Jordan M.I. Minimax optimal procedures for locally private estimation. J. Amer. Statist. Assoc. 2018;113(521):182–201. [Google Scholar]

[b18] 18.Wang N., Xiao X., Yang Y., Zhao J., Hui S.C., Shin H., Shin J., Yu G. Proceedings of International Conference on Data Engineering, ICDE. IEEE; 2019. Collecting and analyzing multidimensional data with local differential privacy; pp. 638–649. [Google Scholar]

[b19] 19.Ye M., Barg A. Optimal schemes for discrete distribution estimation under locally differential privacy. IEEE Trans. Inform. Theory. 2018;64(8):5662–5676. [Google Scholar]

[b20] 20.R. Bassily, K. Nissim, U. Stemmer, A.G. Thakurta, Practical locally private heavy hitters, in: Proceedings of Annual Conference on Neural Information Processing Systems, 2017, pp. 2288–2296.

[b21] 21.Bassily R., Smith A.D. Proceedings of the Forty-Seventh Annual on Symposium on Theory of Computing, STOC. ACM; 2015. Local, private, efficient protocols for succinct histograms; pp. 127–135. [Google Scholar]

[b22] 22.Bun M., Nelson J., Stemmer U. Heavy hitters and the structure of local privacy. ACM Trans. Algorithms. 2019;15(4):51:1–51:40. [Google Scholar]

[b23] 23.Qin Z., Yang Y., Yu T., Khalil I., Xiao X., Ren K. Proceedings of the Conference on Computer and Communications Security, CCS. ACM; 2016. Heavy hitter estimation over set-valued data with local differential privacy; pp. 192–203. [Google Scholar]

[b24] 24.T. Wang, J. Blocki, N. Li, S. Jha, Locally differentially private protocols for frequency estimation, in: Proceedings of USENIX Security Symposium, 2017, pp. 729–745.

[b25] 25.Warner S.L. Randomized response: A survey technique for eliminating evasive answer bias. J. Amer. Statist. Assoc. 1965;60(309):63–69. [PubMed] [Google Scholar]

[b26] 26.Wang S., Huang L., Wang P., Deng H., Xu H., Yang W. Proceedings of International Conference on Wireless Algorithms, Systems, and Applications, WASA, Vol. 9798. Springer; 2016. Private weighted histogram aggregation in crowdsourcing; pp. 250–261. [Google Scholar]

[b27] 27.Erlingsson Ú., Pihur V., Korolova A. Proceedings of the Conference on Computer and Communications Security, CCS. ACM; 2014. RAPPOR: randomized aggregatable privacy-preserving ordinal response; pp. 1054–1067. [Google Scholar]

[b28] 28.M. Charikar, K.C. Chen, M. Farach-Colton, Finding frequent items in data streams, in: Proceedings of International Colloquium on Automata, Languages and Programming, ICALP, 2002, pp. 693–703.

[b29] 29.Cormode G., Muthukrishnan S. An improved data stream summary: The count-min sketch and its applications. J. Algorithms. 2005;55(1):58–75. [Google Scholar]

[b30] 30.G. Cormode, M.N. Garofalakis, Sketching streams through the net: Distributed approximate query tracking, in: Proceedings of the 31st International Conference on Very Large Data Bases, PVLDB, 2005, pp. 13–24.

[b31] 31.M. Thorup, Y. Zhang, Tabulation based 4-universal hashing with applications to second moment estimation, in: Proceedings of the Fifteenth Annual Symposium on Discrete Algorithms, SODA, 2004, pp. 615–624.

[b32] 32.Sweeney L. K-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 2002;10(5):557–570. [Google Scholar]

[b33] 33.Machanavajjhala A., Kifer D., Gehrke J., Venkitasubramaniam M. L-diversity: Privacy beyond k -anonymity. ACM Trans. Knowl. Discovery Data. 2007;1(1):3. [Google Scholar]

[b34] 34.Dwork C., McSherry F., Nissim K., Smith A.D. Proceedings of Third Theory of Cryptography Conference, TCC, Vol. 3876. Springer; 2006. Calibrating noise to sensitivity in private data analysis; pp. 265–284. [Google Scholar]

[b35] 35.F. McSherry, K. Talwar, Mechanism design via differential privacy, in: Proceedings of 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07), 2007, pp. 94–103.

[b36] 36.P. Kairouz, S. Oh, P. Viswanath, Extremal mechanisms for local differential privacy, in: Proceedings of Advances in Neural Information Processing Systems, 2014, pp. 2879–2887.

[b37] 37.Kairouz P., Oh S., Viswanath P. Extremal mechanisms for local differential privacy. J. Mach. Learn. Res. 2016;17:17:1–17:51. [Google Scholar]

[b38] 38.B. Ding, J. Kulkarni, S. Yekhanin, Collecting telemetry data privately, in: Proceedings of Advances in Neural Information Processing Systems, 2017, pp. 3571–3580.

[b39] 39.Ye Q., Hu H., Meng X., Zheng H. Proceedings of Symposium on Security and Privacy, SP. IEEE; 2019. Privkv: Key-value data collection with local differential privacy; pp. 317–331. [Google Scholar]

[b40] 40.N. Li, Q. Ye, Mobile data collection and analysis with local differential privacy, in: Proceedings of IEEE 20th International Conference on Mobile Data Management (MDM), 2019, pp. 4–7.

[b41] 41.Differential Privacy Team A. Apple; 2017. Learning with Privacy at Scale. [Google Scholar]

[b42] 42.R. Chen, H. Li, A.K. Qin, S.P. Kasiviswanathan, H. Jin, Private spatial data aggregation in the local setting, in: Proceedings of IEEE 32nd International Conference on Data Engineering (ICDE), 2016, pp. 289–300.

[b43] 43.McSherry F. Proceedings of International Conference on Management of Data, SIGMOD. ACM; 2009. Privacy integrated queries: an extensible platform for privacy-preserving data analysis; pp. 19–30. [Google Scholar]

[b44] 44.Wegman M.N., Carter L. New hash functions and their use in authentication and set equality. J. Comput. System Sci. 1981;22(3):265–279. [Google Scholar]

[b45] 45.G. Weiss, WISDM smartphone and smartwatch activity and biometrics dataset data set. http://archive.ics.uci.edu/ml/index.php.

[b46] 46.J. Wang, Bixi montreal bikeshare data - Bikeshare information for bixi montreal. https://www.kaggle.com/jackywang529/bixi-montreal-bikeshare-data.

[b47] 47.Zhang Y., Wang K., He Q., Chen F., Deng S., Zheng Z., Yang Y. Covering-based web service quality prediction via neighborhood-aware matrix factorization. IEEE Trans. Serv. Comput. 2019:1. [Google Scholar]

[b48] 48.Xue X., Han H., Wang S., Qin C. Computational experiment-based evaluation on context-aware O2O service recommendation. IEEE Trans. Serv. Comput. 2019;12(6):910–924. [Google Scholar]

[b49] 49.Xue X., Wang S., Zhang L., Feng Z., Guo Y. Social learning evolution (SLE): computational experiment-based modeling framework of social manufacturing. IEEE Trans. Ind. Inform. 2019;15(6):3343–3355. [Google Scholar]

PERMALINK

Locally private frequency estimation of physical symptoms for infectious disease analysis in Internet of Medical Things

Xiaotong Wu

Mohammad Reza Khosravi

Lianyong Qi

Genlin Ji

Wanchun Dou

Xiaolong Xu

Abstract

1. Introduction

2. Related work

2.1. Health privacy protection in IoMT

2.2. Local differential privacy

2.3. Frequency estimation under LDP for physical symptoms

3. Preliminaries and problem definition

3.1. Preliminaries

Fig. 1.

Definition 1 Frequency Estimation of Symptoms [39] —

Definition 2 (ξ,δ)-Approximate Frequency Estimation of Symptoms [39] —

Definition 3 Local Differential Privacy [15] —

Lemma 1 Sequential Composition [43] —

Lemma 2 Parallel Composition [43] —

3.2. Private framework

Definition 4 Frequency Estimation of Symptoms under LDP —

Fig. 2.

3.3. Problem formulation

4. Protocols for frequency estimation of symptoms under LDP

4.1. Protocol CMS-LDP

4.2. Protocol FCS-LDP

4.3. Protocol CS-LDP

4.4. Protocol FAS-LDP

5. Theory analysis of protocols

5.1. CMS-LDP and FCS-LDP

Theorem 3

Proof

Theorem 4

Proof

5.2. CS-LDP and FAS-LDP

Theorem 5

Proof

Theorem 6

Proof

6. Implementation of universal hash functions

6.1. Universal hash function

Definition 5 k-universal Class of Hash Functions [44] —

6.2. CW-trick hashing

6.3. Tabulation based hashing

Theorem 7 Theorem 2.3 in [31] —

7. Experimental evaluation

7.1. Experimental setting

Table 1.

7.2. Experimental results

7.2.1. Impact of parameter ξ

Fig. 3.

Fig. 4.

Fig. 5.

7.2.2. Impact of parameter δ

7.2.3. Impact of parameter ϵ

7.2.4. Protocol applications in IoMT

8. Conclusions

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Definition 2 ( $ξ, δ$ )-Approximate Frequency Estimation of Symptoms [39] —

Definition 5 $k$ -universal Class of Hash Functions [44] —

7.2.1. Impact of parameter $ξ$

7.2.2. Impact of parameter $δ$

7.2.3. Impact of parameter $ϵ$