Preserving Genomic Privacy via Selective Sharing

Emre Yilmaz; Tianxi Ji; Erman Ayday; Pan Li

doi:10.1145/3411497.3420214

. Author manuscript; available in PMC: 2021 Sep 2.

Published in final edited form as: Proc ACM Workshop Priv Electron Soc. 2020 Nov 9;2020:163–179. doi: 10.1145/3411497.3420214

Preserving Genomic Privacy via Selective Sharing

Emre Yilmaz ¹, Tianxi Ji ², Erman Ayday ³, Pan Li ⁴

PMCID: PMC8411901 NIHMSID: NIHMS1705344 PMID: 34485998

Abstract

Although genomic data has significant impact and widespread usage in medical research, it puts individuals’ privacy in danger, even if they anonymously or partially share their genomic data. To address this problem, we present a framework that is inspired from differential privacy for sharing individuals’ genomic data while preserving their privacy. We assume an individual with some sensitive portion on her genome (e.g., mutations or single nucleotide polymorphisms - SNPs that reveal sensitive information about the individual) that she does not want to share. The goals of the individual are to (i) preserve the privacy of her sensitive data (considering the correlations between the sensitive and non-sensitive part), (ii) preserve the privacy of interdependent data (data that belongs to other individuals that is correlated with her data), and (iii) share as much non-sensitive data as possible to maximize utility of data sharing. As opposed to traditional differential privacy-based data sharing schemes, the proposed scheme does not intentionally add noise to data; it is based on selective sharing of data points. We observe that traditional differential privacy concept does not capture sharing data in such a setting, and hence we first introduce a privacy notation, ϵ-indirect privacy, that addresses data sharing in such settings. We show that the proposed framework does not provide sensitive information to the attacker while it provides a high data sharing utility. We also compare the proposed technique with the previous ones and show our advantage both in terms of privacy and data sharing utility.

Keywords: Privacy, Differential Privacy, Genomics, Data Sharing

1. INTRODUCTION

Taking benefits of low cost and accessible sequencing of genomes, nowadays, even ordinary individuals can obtain their digital genome sequences in an affordable way via some online services such as 23andme [1]. They also share their genomic data with medical institutions, on public repositories (such as OpenSNP [2]), and with other direct-to-consumer service providers. Individuals typically use such services to be informed about their predisposition to certain diseases (e.g., cancer) [3, 26], to find their ancestors, or even to find compatible genomic partners. Moreover, this wide availability of genomes opens a new horizon for research in medical field (e.g., treatment of genomic-related diseases or personalized medicine). Although these direct-to-consumer services and potential revolution in medicine look appealing, they also raise significant privacy concerns and ramifications. Because genes have critical information about one’s medical profile and predisposition to sensitive diseases, once the identity of a genome donor is revealed, he or she is faced with the risk of discrimination by employers or insurance companies. Therefore, almost all public genomic data sharing repositories hide the identities of their donors (or participants). However, it has been shown that anonymization is not an effective technique for privacy-preserving genomic data sharing [19, 37].

Despite such risks, users in some online platforms (such as Open-SNP) share their genomic data with their identities, or some scientists publish their own genomic data on their personal websites [4]. Such individuals tend to hide sensitive parts of their genomes (e.g., parts that reveal their predisposition to a sensitive disease) while sharing their genomic data. However, it has been shown that hiding sensitive parts is not sufficient for privacy [7] and it is possible to infer the hidden parts by using the pairwise correlations that exists between single nucleotide polymorphisms (SNPs) in the genome, also referred to as linkage disequilibrium (LD) [35].

Although public availability of genome sequences is a privacy threat, limiting access to public genomic datasets is a barrier for both medical research and all of the aforementioned benefits. In this paper, we build a framework to protect the privacy of individuals’ genomic data while providing high utility for genomic data sharing. Our proposed technique is inspired from the differential privacy concept [16] to control the trade-off between the utility and privacy.

We assume an individual (called the donor) with a genomic data sequence that includes some sensitive SNPs (e.g., the ones revealing her predisposition to a sensitive disease). Sensitive part of the genome is not fixed, it may vary among individuals. Our goal is to protect the sensitive part of sequence from inference attacks while sharing as much as possible from the rest (non-sensitive part). In other words, we consider the privacy of sensitive data based on the shared non-sensitive part, the correlations in between, and interdependent factors. The attacker may try to infer the individual’s sensitive SNPs by using existing inference attacks (e.g., using kinship information and correlations among the SNPs) and it has access to the public genomic datasets (e.g., from [2, 6]) to build its statistical models for the inference attacks. Moreover, rather than adding noise to the shared data to provide privacy (which implies modifying the content of genomic data, and hence is not acceptable among medical researchers), our goal is to selectively decide whether or not to share particular SNPs based on our formulation.

We observed that traditional differential privacy (DP) concepts cannot be utilized for such a setting. Therefore, we first introduce a new privacy definition, ϵ-indirect privacy, inspired by the definition of DP, and propose an algorithm to decide which part of the non-sensitive data can be shared with a service provider in order to minimize the risk of inference of sensitive data. ϵ-indirect privacy guarantees that sharing a non-sensitive data point (e.g., SNP) changes the knowledge of the attacker within a boundary controlled by the privacy parameter ϵ. Our proposed algorithm processes one non-sensitive SNP in each step and decides to share or hide that SNP based on the definition of ϵ-indirect privacy. We also analyze how changing the processing order affects utility. Since selecting the order providing the highest utility is an NP-complete problem, we propose a greedy algorithm to select the order to improve utility. Furthermore, different from previous work on genomic data sharing [23], in our proposed mechanism, hiding a SNP does not provide any information about the value of that SNP (or other sensitive SNPs) to the attacker, because the proposed SNP sharing mechanism does not consider the real values of the sensitive SNPs. We also consider and preserve the privacy of interdependent data.

For genomic data sharing setting, the proposed definition is more meaningful than traditional DP definitions since it considers (i) correlations between SNPs, (ii) familial correlations, and (iii) the fact that data may have sensitive and non-sensitive parts. Note that this new privacy notion is not only related to genomic data and it can be used for other data types that are shared in similar settings. Although the proposed privacy definition is more advantageous compared to traditional DP-based definitions for the problem in hand, it operates under the assumption that the attacker possesses a certain amount of auxiliary information. Therefore, depending on the assumption about the background knowledge of the attacker, the utility provided by the proposed definition may change. In Section 6.3, we discuss the consequences of this attacker-dependency of the proposed scheme.

We evaluate the proposed mechanism on real genomic data belonging to Central European population [6]. We study the effects of various design parameters such as the correlation model, order of sharing, and kinship relationships on the privacy and utility. We compare the proposed scheme with the existing work of Humbert et al. that proposes an optimization-based solution for the same problem [23] and a local differential privacy-based data sharing mechanism. We show that existing works are vulnerable to inference attacks and our scheme provides both higher privacy (in terms of entropy and estimation error) and utility compared to them. Due to its advantages for both genomic data donors (by providing strong privacy guarantees) and researchers (by providing high data utility), we expect that the proposed framework will promote genomic data sharing.

The rest of the paper is organized as follows. In Section 2, we summarize the related work in the literature. In Section 3, we give a brief introduction about genomics and technical preliminaries. In Section 4, we explain our proposed framework in detail. In Section 5, we evaluate our framework on a real genomic data and compare our method with existing work. In Section 6, we discuss more about our mechanism based on our evaluation results. Finally, in Section 7, we conclude our work and discuss the future work.

2. RELATED WORK

Genomic privacy topic has been recently explored by many researchers [31]. Several works have studied various inference attacks against genomic data including membership inference [21, 34, 39] and attribute inference [13, 22, 33]. To mitigate these threats, some researchers proposed using cryptographic techniques for privacy-preserving processing of genomic data. Jha et al. proposed a method for secure comparison of DNA sequences [24]. Cassa et al. proposed a cryptographic scheme to securely transmit externally generated sequence data which does not require any patient identifiers [11]. Baldi et al. proposed cryptographic techniques for privacy-preserving computations on genomic data using private set intersection [9]. Ayday et al. proposed partially homomorphic encryption for privacy-preserving use of genomic data in clinical settings [8]. Wang et al. proposed private edit distance protocols to find similar patients (across several hospitals) [41]. Deuber et al. proposed a system to do computation over encrypted genomic data stored in the cloud [12].

Some researchers proposed using the differential privacy (DP) concept [20] to release summary statistics in a privacy-preserving way (to mitigate membership inference attacks). Fienberg et al. used the DP concept for sharing the statistics such as minor allele frequencies, p-values, and chi-square values [18]. Johnson and Shmatikov proposed using the exponential mechanism for computation and release of (i) the number of SNPs that are associated with the specific phenotype, (ii) the most significant SNPs related to a phenotype, (iii) p-values, and (iv) correlation between pairs of SNPs [25]. Yu et al. extended the work of Feinberg et al. and presented a scalable algorithm for any arbitrary number of SNPs [43]. Different from existing DP-based approaches, our goal is to share the genomic sequence of an individual, not summary statistics.

To share genomic sequences in a privacy-preserving way, Humbert et al. proposed an optimization-based technique that selectively hides portions of shared genomic data by considering the privacy budgets of both the donor and her family members [23]. Another goal of Humbert et al.’s work is to maximize the genomic data sharing utility (by maximizing the number of SNPs shared). This work is the closest in literature to ours. We compare our proposed mechanism with the work of Humbert et al. and show that our work outperforms [23] both in terms of privacy and utility.

3. TECHNICAL PRELIMINARIES

In this section, we provide brief backgrounds about genomics and differential privacy. We also provide some details about the inference attack against kin genomic privacy (in [13]) that we utilize in our proposed mechanism in Appendix A.

3.1. Genomics Background

Approximately 99.9% of the all individuals’ DNA are identical and the remaining 0.1% is responsible for our differences. Single nucleotide polymorphisms (SNPs) are the most common variation in the human genome. SNP is a point mutation (e.g., variation of a single nucleotide - A,T,C, or G) and there are around 100 million known SNP positions in the human genome [5]. Every SNP comes in a pair of nucleotides, called alleles. The major allele is the most common nucleotide in the population and the minor allele is the second most common nucleotide. The frequency of observing the minor allele at a SNP position is called the minor allele frequency (MAF) of that SNP. We can represent each SNP in terms of the number of its minor alleles (i.e. by a number from the set {0, 1, 2}). Using its MAF value, the prior probability distribution of a SNP can be calculated. Recent discoveries show that particular SNPs are associated with some serious genetically related diseases [3]. Since SNPs carry sensitive information including individuals’ diseases risks, most existing works in genomic privacy focus on the protection of the SNPs to prevent the risk of genetic discrimination.

3.2. Differential Privacy

Differential privacy (DP) [16] is a concept to preserve the privacy of records in statistical databases while publishing statistical information about the database. DP guarantees that an algorithm behaves approximately the same on two neighboring databases (that differ by a single record) as follows:

\Pr [K (D_{1}) \in S] \leq e^{ϵ} \times \Pr [K (D_{2}) \in S],

(1)

where D₁ and D₂ are neighboring databases, $K$ is a randomized algorithm, and S is the output of the randomized algorithm ( $K$ ). Algorithm $K$ is then called $ϵ$ -differentially private if and only if (1) holds for all neighboring databases.

Recently, local differential privacy (LDP) [15, 27] has been defined to formalize the privacy during individual data sharing (e.g., between a data owner and an untrusted party). LDP is satisfied when an untrusted data collector cannot determine the original value of a data point (belonging to a data owner) from the reported (perturbed) value. Formally, for any two inputs x₁ and x₂ in the input space, and any output y, an algorithm $K$ satisfies ϵ-LDP if

P r [K (x_{1}) = y] \leq e^{ϵ} \times \Pr [K (x_{2}) = y] .

(2)

LDP can be achieved by randomized response mechanism [17, 40], in which each individual reports her value correctly with probability $\frac{e^{ϵ}}{e^{ϵ} + n - 1}$ and reports each of the incorrect values with probability $\frac{1}{e^{ϵ} + n - 1}$ to satisfy ϵ-LDP [40], where n is the size of the input set.

Differential Privacy under Correlated Data.

Although DP provides strong guarantees for individual privacy, there may be privacy risks for individuals when data is correlated. In order to protect the privacy of individuals under correlations, several approaches have been proposed. Yang et al. focused on how data correlations and prior knowledge affect privacy [42]. They proposed a mechanism to satisfy Bayesian DP for a sum query on correlated data. Cao et al. also quantified the potential privacy loss of a traditional DP mechanism under temporal correlations in the context of continuous data release [10]. Liu et al. proposed dependent DP that accounts for the dependence that exists between tuples and a dependent perturbation mechanism by introducing dependence coefficients for analyzing the sensitivity of different queries under dependent DP [30]. Song et al. adopted a generalized version of DP called Pufferfish to address privacy protection for correlated data [36]. All these works focus on privacy of aggregate data release and they are not suitable for individual data sharing.

We observed that existing DP concepts do not consider data that is composed of sensitive and non-sensitive parts, in which privacy is defined over the sensitive part (that is never shared), but shared non-sensitive part may be used to infer the sensitive part. Existing work also do not consider interdependent relationship between individuals such as kinship. Furthermore, existing techniques that are used to satisfy DP do not consider a sharing mechanism that is based on selective sharing rather than noise addition. Therefore, in this work, inspired from the definition of DP, we first develop a new privacy concept, ϵ-indirect privacy, addressing the aforementioned limitations. Then, we use the proposed concept to develop a genomic data sharing mechanism. One similar work by Doudalis et al. also categorized data points as sensitive and non-sensitive [14] and introduced the notion of one-sided differential privacy (OSDP). However, OSDP has also been proposed as an extension to traditional DP and it was defined over neighboring databases. Also, it does not consider data correlations and it cannot be used for individual data sharing.

4. PROPOSED PRIVACY-PRESERVING FRAMEWORK

In this section, we first describe the general settings, assumptions, and the attacker model. Then, we provide a mathematical formulation of our solution and explain the general data sharing framework.

4.1. Assumptions and Notations

We have a set of family members denoted as F. We represent the set of SNP IDs of an individual $i (i \in F)$ as Iⁱ. We represent the value of a SNP as the number of minor alleles it carries and we denote the value of a SNP j for individual i as $x_{j}^{i} (j \in I^{i})$ . Thus, $x_{j}^{i}$ takes values from set {0, 1, 2}. Also, we denote a SNP j as $x_{j}$ for general representation (regardless of its value in a specific individual). We denote the set of sensitive SNPs for individual i as Sⁱ. Sensitive SNPs are determined by individuals based on their association with the genetically related diseases [3]. The SNPs in the sensitive set are never shared by the corresponding individual. However, as will be discussed later, information about these SNPs can be leaked either by sharing other SNPs that are not in the sensitive set or SNPs shared by other family members. Also, each family member may have her own sensitive SNP set.

During the SNP sharing procedure, by using our proposed mechanism, an individual decides to hide (or share) each of her SNPs. We denote the set of hidden SNPs of individual i as $H^{i}$ and her set of shared SNPs as $R^{i} \cdot S^{i} \subseteq H^{i}$ , because SNPs in the sensitive set of individual i are always hidden. At the beginning of the sharing procedure (discussed in Section 4.4), all of the SNPs of i are hidden (i.e., $H^{i} = I^{i}$ and $R^{i} = Ø$ ). Then, based on the result of the proposed mechanism on each SNP, we decide whether or not to add that SNP to the set of shared SNPs ( $R^{i}$ ). In practice, utility of each shared SNP may be different and it depends on the goal of the corresponding research (that will use that SNP). To simplify this, we assume that all non-sensitive SNPs can be potentially shared with equal utility after checking the privacy condition of the proposed mechanism. We list the frequently used notations in Table 1.

Table 1:

Frequently used notations.

Definition	Notation
Set of family members	F
Set of SNPs of individual i	Iⁱ
Value of SNP j of individual i	$x_{j}^{i}$
Set of sensitive SNPs of individual i	Sⁱ
Set of hidden SNPs of individual i	Hⁱ
Set of shared SNP of individual i	Rⁱ

Open in a new tab

4.2. Attacker Model

As will be discussed in Section 4.3, in the proposed mechanism, decision to whether or not to share a SNP depends on the auxiliary information of the receiver (attacker). For a given ϵ value, the sharing decision of the algorithm depends on assumption for the auxiliary information of the data receiver. Thus, in contrast to differential privacy, this feature makes the proposed mechanism attacker-dependent. The proposed mechanism achieves the optimal utility and privacy if the auxiliary information that is possessed by the data receiver is known. However, in practice, it is not possible to estimate the exact auxiliary information that is possessed by each data receiver. Therefore, we apply a hypothetical upper bound to the auxiliary information of the attacker and the proposed mechanism provides the sharing decisions accordingly. In this work, we assume that the attacker does not have SNP data of the donor from other sources. That is, only SNP data that the attacker can use is provided by the donor as a result of the SNP sharing mechanism. As the auxiliary information, we assume that the attacker has background knowledge related to public statistics about genomics and the relationship between the family members in F. That is, the attacker has access to public resources including SNP data belonging to different populations [5]. Using such resources, the attacker can calculate the minor allele frequency (MAF) for each SNP (frequency at which the minor allele is observed in a given population). Using similar resources, the attacker can also compute high-order correlations between the SNPs and use this information for the inference of the SNPs in the sensitive sets of individuals [33]. For this, the attacker exploits the method introduced in [33] as follows:

P_{t} (x_{j}) = {\begin{matrix} 0 & if F (x_{i - t, i - 1}) = 0 \\ \frac{F (x_{i - t, i})}{F (x_{i - t, i - 1})} & if F (x_{i - t, i - 1}) > 0. \end{matrix}

(3)

Here, $P_{t} (x_{j})$ is the probability distribution of SNP j computed using a Markov chain of order t. Also, $F (x_{i, j})$ is the frequency of the subsequence $x_{i, j}$ that includes all the SNPs between $x_{i}$ and $x_{j}$ . Furthermore, the attacker knows that the shared SNPs of an individual may threaten the kin genomic privacy of her family members and uses the Mendel’s law to utilize this information.

To run the attacker’s inference attack on the sensitive SNPs of individuals, we use state-of-the-art inference attacks developed so far. The attacker uses the combination of all the aforementioned information as shown in [13, 22] by using a message passing algorithm on a graphical model [29, 32] (as discussed in Appendix A). In attacker’s favor, we assume that the attacker knows the correlation model used in the SNP sharing mechanism and uses the same model in its attack. It is worth noting that via new discoveries in genomics, things that are non-sensitive today may turn out to be sensitive in the future. Similarly, new statistics or correlation models may reduce our privacy guarantees. However, this is a general concern for all developing fields.

4.3. Mathematical Formulation

In our work, it is not possible to directly apply the differential privacy or local differential privacy definitions which are given in Section 3.2 because our data model has sensitive and non-sensitive parts. We want to preserve the level of distinguishability between any two states of sensitive SNPs within certain limits after sharing a non-sensitive SNP. Therefore, inspired by the definition of DP [16], we introduce a new privacy concept (ϵ-indirect privacy) that can be used to provide formal privacy guarantees for data types that (i) have both sensitive and non-sensitive parts and (ii) while the privacy is defined over the sensitive part (that is never directly shared), the shared portion of the data is the non-sensitive part, which may reveal information about the sensitive part via the inherent correlations in the data. As discussed before, existing DP concepts do not consider such data types and existing mechanisms to satisfy DP do not operate based on selective sharing. In the following, we first provide our general definition for this new privacy concept, then we discuss how it will be used for genomic data sharing.

Definition 4.1. ϵ-indirect privacy.

Let set Iⁱ represent the set of data points belonging to individual i and Sⁱ include the sensitive data points in Iⁱ (that will not be shared). A data sharing mechanism satisfies ϵ-indirect privacy if for all shared data points $x_{j}^{i} (j \in I^{i})$ ,

\frac{P (x_{k}^{i} ∣ R^{i} \cup x_{j}^{i}, A)}{P (x_{k}^{i'} ∣ R^{i} \cup x_{j}^{i}, A)} \leq e^{ϵ} \frac{P (x_{k}^{i} ∣ A)}{P (x_{k}^{i'} ∣ A)}

(4)

\forall k \in S^{i} and \forall x_{k}^{i'} \in D

where D is the input domain. Also, A represents the publicly available auxiliary information and Rⁱ represents the set of data points that are revealed so far.

Informally, this formulation guarantees that the knowledge of the attacker only changes within a boundary when a data point $x_{j}^{i}$ is shared. In other words, it is guaranteed that any two possible values for a sensitive data point (such as $x_{k}^{i}$ and $x_{k}^{i'}$ ) will be indistinguishable for the attacker after learning a non-sensitive data point $x_{j}^{i}$ . ϵ-indirect privacy is satisfied if and only if (4) holds for all sensitive data points in Sⁱ. Similar to the idea of differential privacy [16], the level of privacy is determined by the parameter ϵ. If there is interdependent relations in data (i.e., if data belonging to different individuals are also correlated), the condition in (4) should also be satisfied for interdependent data points.

ϵ-indirect Privacy for Genomic Data Sharing.

We use ϵ-indirect privacy concept in our proposed SNP sharing mechanism. We assume each SNP j takes a scalar value (from $ℜ^{1}$ ) such that $x_{j} \in {0, 1, 2}$ , and hence the L₁ norm between any two SNP values is bounded by 2. We assume A includes (i) minor allele frequency (MAF) values of the SNPs, (ii) high-order correlations between the SNPs, and (iii) the relationship between the family members in F. Our proposed SNP sharing mechanism (that is described in the next section) decides whether or not to share each hidden SNP of a donor i based on ϵ-indirect privacy. For each SNP j in Hⁱ, the mechanism calculates the probability distributions of SNPs in $S^{m} (\forall m \in F)$ assuming $x_{j}^{i}$ is shared and also using all the previously shared SNPs of both the donor and the other family members in F. For this, we use the inference attack introduced in [13] (also discussed in Appendix A).

Let set R include all the SNPs that have been shared by the individuals in F so far. That is, $R = U_{i \in F} R^{i}$ . Following (4), for individual i to share a new SNP j, we require that for all sensitive SNPs of all family members, the ratio between probabilities of different states should not be greater than a boundary as follows:

\frac{P (x_{k}^{m} ∣ R \cup x_{j}^{i}, A)}{P (x_{k}^{m'} ∣ R \cup x_{j}^{i}, A)} \leq e^{ϵ_{m}} \frac{P (x_{k}^{m} ∣ A)}{P (x_{k}^{m'} ∣ A)}

(5)

\forall m \in F, \forall k \in S^{m}, and \forall x_{k}^{m'} \in {0, 1, 2},

where $ϵ_{i}$ is the privacy parameter for individual i. Note that the above condition, and hence the sharing (or hiding) decision on a particular SNP is independent of the actual values of the sensitive SNPs of the donor and the other family members. We will further discuss the importance of this property in later sections.

4.4. SNP Sharing Mechanism

Unlike differentially private data sharing mechanisms, which add noise to shared data points in order to protect privacy, we introduce a selective sharing mechanism for the non-sensitive SNPs. That is, rather than sharing noisy SNP values, we prefer the proposed mechanism to decide whether or not to share each non-sensitive SNP (this is also the preferred methodology for medical data in general). The overview of the proposed mechanism is shown in Fig. 1. Let individual i be the donor that wants to share her SNPs with a service provider. At the beginning of the process, all of the SNPs of the individual i are hidden, (i.e., $R^{i} = Ø$ and $H^{i} = I^{i}$ ). We first assign the set of sensitive SNPs (Sⁱ) and a privacy parameter (i.e., $ϵ_{i}$ in (5)) for individual i. As discussed, these two parameters can be different for all individuals in F. Then, we pick a SNP j from Hⁱ and calculate its disclosure effect on the probability distribution of each SNP in the sensitive SNP set of i (Sⁱ) and the sensitive SNP sets of all the other family members in F. If (5) holds true for all SNPs in S^m $(\forall m \in F)$ , then we share the corresponding hidden SNP and add it to set Rⁱ, otherwise $x_{j}^{i}$ remains in Hⁱ. The details of our proposed mechanism are also shown in Algorithm 1.

Figure 1: — One instance of sharing SNP sequence of the donor (individual i) with a service provider (data receiver). Here, the proposed mechanism decides whether to share SNP j of individual i (with value $x_{j}^{i}$ ) with the service provider.

ALGORITHM 1:

SNP Sharing Mechanism

graphic file with name nihms-1705344-t0011.jpg

Open in a new tab

Note that our privacy guarantee is for the sensitive SNPs of the individuals that the proposed mechanism never reveals. When we reveal the non-sensitive SNPs using (5), the attacker’s knowledge (or view) of the sensitive SNPs changes. This change depends on the amount of non-sensitive SNPs that are publicly shared. For higher values of ϵ, attacker’s knowledge increases more and for smaller values of ϵ, its knowledge increases less. Thus, the randomness introduced by the proposed mechanism is in the probability distributions of the sensitive SNPs provided to the attacker (by publishing the non-sensitive SNPs). The change in the probability distribution of the sensitive SNPs due to this mechanism can also be thought as revealing the values of the sensitive SNPs by adding noise to them. Different from existing DP-based mechanisms, which, given an ϵ value, may share a certain data point with different values (i.e., with different noise amounts) for each new sharing of the data, our proposed mechanism will reveal the same set of non-sensitive SNPs at each new sharing of the data.

We propose this selective sharing mechanism due to the nature of the data we consider (since noisy data points are not typically tolerated for genomic data). As we further discuss in Section 6, ϵ-indirect privacy concept can also be used for other data types. Therefore, depending on the noise tolerance of the data type, ϵ-indirect privacy can be satisfied by adding noise to shared non-sensitive data points as well.

4.5. The Order of Sharing

The proposed privacy-preserving SNP sharing algorithm (Algorithm 1) processes all non-sensitive SNPs one by one and makes a decision (hide or share) for the processed SNP. Even though ϵ-indirect privacy is checked and achieved in each step, the utility (i.e., the number of shared non-sensitive SNPs) changes based on the order of sharing since the previously shared SNPs affect the value of $P (x_{j}^{i} ∣ R \cup x_{j}^{i}, A)$ . Here, we first represent the problem of finding the processing order of SNPs that provides the highest utility as a binary integer programming problem and show its NP-completeness. Then, we provide an algorithm that minimizes the attacker’s maximum knowledge gain in each step.

Assume there are n non-sensitive SNPs of an individual i (i.e., $| I^{i} ∖ S^{i} | = n)$ ). Let $P$ be the all possible permutations (sharing orders) of these n SNPs $(| P | = n!)$ and $p \in P$ be one permutation, whose components $p (1), p (2), \dots, p (n)$ consist of $x_{j}^{i} \in I^{i} ∖ S^{i}$ (i.e., $p$ represent a particular order of sharing). Then, the problem to maximize the utility can be formulated as

\max_{c_{j}, p} \sum_{j}^{n} c_{j} s.t. \frac{P (x_{k}^{i} ∣ R_{1} \cup p (1), A)}{P (x_{k}^{i'} ∣ R_{1} \cup p (1), A)} c_{1} \leq e^{ϵ_{m}} \frac{P (x_{k}^{i} ∣ A)}{P (x_{k}^{i'} ∣ A)} \dots \frac{P (x_{k}^{i} ∣ R_{n} \cup p (n), A)}{P (x_{k}^{i'} ∣ R_{n} \cup p (n), A)} c_{n} \leq e^{ϵ_{m}} \frac{P (x_{k}^{i} ∣ A)}{P (x_{k}^{i'} ∣ A)} c_{j} \in {0, 1}, p \in P,

(6)

$\forall (x_{k}^{i}, x_{k}^{i'}) \in {0, 1, 2} \times {0, 1, 2}, x_{k}^{i} \neq x_{k}^{i'}$ , where $R_{j} = R_{j - 1} \cup p (j - 1)$ . If sharing $p (j)$ violates ϵ-indirect privacy, c_j will be 0, which means p(j) will not be shared. Clearly, (6) is a binary integer programming problem, which is known to be NP-complete [28].

Therefore, instead of directly solving (6), we propose to solve a min-max problem, which shares non-sensitive SNPs in an order such that the attacker’s maximum gain on the sensitive SNPs (i.e., $\max_{x_{k}^{i}, x_{k}^{i'} \in {0, 1, 2}, x_{k}^{i} \neq x_{k}^{i'}} \frac{P (x_{k}^{i} ∣ R \cup x_{j}^{i} = α, A)}{P (x_{k}^{i'} ∣ R \cup x_{j}^{i} = α, A)}, x_{j}^{i} \in I^{i} ∖ S^{i}$ is minimized. Here, the attacker gains the maximum information about a sensitive SNP when the distinguishability of two states of the corresponding sensitive SNP is maximized. Specifically, given a t-th order Markov chain, we first divide the SNP sequence into disjoint subsequences such that there is only one sensitive SNP at the end of each subsequence. Then, in each subsequence, we process the consecutive non-sensitive SNPs in disjoint groups, each with size t, starting from the closest group to the sensitive SNP. At last, we select a non-sensitive SNP (for processing) by solving ${\hat{x}}_{j}^{i} = \underset{x_{j}^{i}}{\arg \min} (\max_{x_{k}^{i}, x_{k}^{i'} \in {0, 1, 2}, x_{k}^{i} \neq x_{k}^{i'}} \frac{P (x_{k}^{i} ∣ R \cup x_{j}^{i} = α, A)}{P (x_{k}^{i'} ∣ R \cup x_{j}^{i} = α, A)})$ . The details of this approach and the proposed algorithm to identify the order of sharing are given in Appendix C.

5. EVALUATION

We evaluate our proposed mechanism using a real-life dataset and study the effects of various parameters to both privacy and utility. We also compare the proposed mechanism with a similar work by Humbert et al. that has a similar goal as ours [23] and with a local differential privacy (LDP)-based data sharing technique.

We use a dataset that consists of 1000 SNPs belonging to 99 people from Central European ethnicity [6]. Using this dataset, first, we simulate the auxiliary information of the attacker. Thus, we generate the correlation model on the SNPs of the individuals in the population using (3), and also compute the prior probability distributions of the SNPs using their MAF values. In all experiments, (in attacker’s favor) we assume that the attacker knows and uses the same correlation model that is used in the SNP sharing mechanism. We define the utility as $U = \sum_{j \in I^{i}} u_{j} D_{j}^{i}$ , where $u_{j}$ is the utility of SNP j and $D_{j}^{i}$ is the sharing information about $x_{j}^{i} . D_{j}^{i} = 1$ if $x_{j}^{i}$ is correctly shared, $D_{j}^{i} = 0$ if $x_{j}^{i}$ is hidden, and $D_{j}^{i} = - 1$ if $x_{j}^{i}$ is incorrectly shared (i.e., shared with noise). The proposed mechanism does not share a non-sensitive SNP incorrectly, however this is possible for LDP-based schemes (as we show in Section 5.7). In the initial experiments, in which we quantify the utility of the proposed algorithm for varying design parameters, we assume that the utility of each SNP to be equal (i.e., u_j = 1). However, the utility of each SNP can also be determined based on its MAF value. That is, rare SNPs with lower MAF values may be considered to have more utility than the common ones (with higher MAF values). Thus, in Section 5.7, we quantify the utility of a SNP j as $u_{j} = \frac{1}{M A F_{j}}$ when we compare our method with LDP. In addition, in some experiments, we present the normalized value for the utility as $U_{N} = U / (\sum_{j \in I^{i}} u_{j})$ . The denominator is equal to $| I^{i} |$ when all SNPs have equal utility values. We repeat all of the experiments for 10 random individuals and report the average. We study the following parameters that have effect on the privacy and utility.

Correlation model. We study the effect of correlation model between the SNPs (i.e., Markov chain of different orders) on the inference power of the attacker and on the utility.
Privacy parameter. We study the effect of ϵ parameter in (5) on both privacy and utility.
Size of sensitive SNP set. We study the relationship between the fraction of sensitive SNPs to the whole SNPs and the utility.
Attacker’s error and entropy. We study the relationship between the success of attacker’s inference attack and utility.
Order of sharing. We study the effect of sharing order of the SNPs on privacy and utility.
Kinship relationships. We study the effect of kinship inference attack [13, 22] on the privacy and utility.

5.1. Correlation Model and Privacy Parameter

Here, we choose 100 random SNPs as the sensitive ones (out of 1000 SNPs in the dataset). In Fig. 2, we show the relation between the privacy parameter (ϵ) and the utility for different orders of the Markov chain model (for the correlation between the SNPs). We quantify the utility $U$ using equal weights for the SNPs (i.e., $u_{j} = 1$ . Hence, $U$ is equal to the number of shared non-sensitive SNPs. We also observe similar patterns when the utility weights are inversely proportional to the MAFs of the SNPs (i.e., $u_{j} = \frac{1}{M A F_{j}}$ ). We observe that as expected, with increasing ϵ value, the average utility increases. Also, with increasing Markov chain order, the utility decreases. In other words, higher-order correlation models improve the inference power of the attacker. We also observe that (i) the results for correlation models with Markov chain orders 3 and 4 overlap and (ii) for correlation models of order higher than 4, the improvement in the inference power of the attacker is negligible (the results in [13] also support this finding).

Figure 2: — Relationship between utility ( $U$ ), privacy parameter (ϵ), and the correlation model (i.e., Markov chain of different orders). All SNPs have equal utility values ( $u_{j} = 1$ ).

5.2. Size of the Sensitive SNP Set

Here, we study the effect of fraction of sensitive SNPs (to the whole SNPs) on the utility. For this study, we quantify the normalized utility $U_{N}$ with equal $u_{j}$ values for all SNPs. Hence, $U_{N} = U / | I^{i} |$ . In Fig. 3, we show the effect of the sensitive SNP set size on the utility for Markov chain of order 1 and for different privacy parameters. The x-axis shows the fraction of the sensitive SNPs to the whole SNPs in Iⁱ (i.e., $| S^{i} | / | I^{i} |$ ) varying between 2% and 20%. Although the utility is defined as the fraction of shared SNPs from the non-sensitive SNP set, by increasing the fraction of the sensitive SNPs, we observe a decrease in the utility. This is because as the size of sensitive SNP set increases, more SNPs in the non-sensitive set becomes correlated with the sensitive SNPs. Also, the improvement in utility gets smaller and the utility converges to a common value as the ϵ value gets closer to 1. We also observe that the decrease in utility is higher for higher-order correlation models which is consistent with the results in Fig. 2.

Figure 3: — Relationship between normalized utility ( $U_{N}$ ), fraction of sensitive SNPs ( $| S^{i} | / | I^{i} |$ ), and privacy parameter (ϵ). Markov chain of order 1 is used as the correlation model. All SNPs have equal utility values ( $u_{j}$ = 1).

5.3. Estimation Error and Entropy

In order to evaluate our proposed SNP sharing mechanism in terms of attacker’s success for inferring the sensitive SNPs, we use two metrics that has been previously proposed by Humbert et al. [22]: (i) the average distance of the attacker from true value of the sensitive SNPs (i.e., estimation error or incorrectness) and (ii) the entropy (or uncertainty) of the attacker based on inferred probability distributions of the sensitive SNPs. For attacker’s incorrectness, we use the following metric:

E_{i} = \sum_{x_{j}^{i}} P (x_{j}^{i}) ‖ x_{j}^{i} - {\hat{x}}_{j}^{i} ‖, j \in S^{i}, x_{j}^{i} \in {0, 1, 2},

(7)

where E_i is attacker’s error for individual i’s sensitive SNPs. Also, $P (x_{j}^{i})$ is the probability distribution of SNP j of individual i that is inferred by the attacker as a result of the inference attack (as discussed in Appendix A) and ${\hat{x}}_{j}^{i}$ is the true value of SNP j of individual i. For attacker’s uncertainty, we use the following metric:

H_{i} = - \sum_{x_{j}^{i}} P (x_{j}^{i}) \log (P (x_{j}^{i})), j \in S^{i}, x_{j}^{i} \in {0, 1, 2},

(8)

where H_i is attacker’s uncertainty (entropy) for individual i’s sensitive SNPs.

We study the effect of fraction of sensitive SNPs to the whole SNPs (i.e., $| S^{i} | / | I^{i} |)$ ) on the error and entropy. In Table 2, we show how attacker’s (average) estimation error changes with different fractions of sensitive SNPs for ϵ = 0.5 (we observed similar patterns for different ϵ values between 0.05 and 1). We observe that attacker’s estimation error generally increases with increasing fractions of sensitive SNPs. The error increases fast for small fractions of sensitive SNPs and then it saturates for larger fractions. Also, the error is higher for higher-order correlation models. This is because we share less SNPs for higher order models (as shown in Fig. 2), and hence higher-order correlation models generate noisy inference results. Also, in Table 3, we show how attacker’s (average) uncertainty changes with different fractions of sensitive SNPs for ϵ = 0.5. As in error, we observe that attacker’s uncertainty generally increases with increasing fractions of sensitive SNPs. Note however that, as shown in Fig. 3, the utility of the SNP sharing mechanism is different for different fractions of sensitive SNPs and correlation models.

Table 2:

Relationship between attacker’s average estimation error and fraction of sensitive SNPs ( $| S^{i} | / | I^{i} |$ ) for different correlation models. Privacy parameter (ϵ) is set to 0.5.

	Fraction of sensitive SNPs

	5%	10%	15%	20%
Markov chain order 1	0.5646	0.5540	0.5661	0.5774
Markov chain order 4	0.6382	0.6341	0.6526	0.6557

Open in a new tab

Table 3:

Relationship between attacker’s average uncertainty and fraction of sensitive SNPs ( $| S^{i} | / | I^{i} |$ ) for different correlation models. Privacy parameter (ϵ) is set to 0.5.

	Fraction of sensitive SNPs

	5%	10%	15%	20%
Markov chain order 1	0.8444	0.8457	0.8492	0.8622
Markov chain order 4	0.9468	0.9460	0.9682	0.9689

Open in a new tab

5.4. The Order of Sharing

Our proposed algorithm decides whether to share each non-sensitive SNP using the definition of ϵ-indirect privacy. Each non-sensitive SNP is processed by the algorithm sequentially to give the sharing decision. As a result, the sharing (or processing) order of non-sensitive SNPs affects the utility of the algorithm. Here, we show how the order of sharing affects utility of the shared data. As discussed in Section 4.5, selecting the optimal order (that provides the highest utility) is an NP-complete problem. Thus, in Section 4.5, we proposed an efficient min-max approach to iteratively determine the order by selecting the SNP that minimizes the attacker’s maximum gain among t SNPs in each step.

To evaluate the performance of the proposed (min-max) approach, one needs to compare it with the utility provided as a result of the optimal order of sharing. However, finding the optimal order (that provides the highest utility) requires to check all permutations of non-sensitive SNPs, and hence doing such is infeasible for hundreds of SNPs. Instead, we implement a suboptimal algorithm that checks all permutations of consecutive non-sensitive SNPs. For instance, we check all permutations of non-sensitive SNPs before the first sensitive SNP and select the permutation providing the highest utility. Then, all permutations of non-sensitive SNPs between the first and second sensitive SNPs are checked, and so on. This algorithm provides a utility close to the optimal one because in our experiments we observed that hiding decision for a non-sensitive SNP is mostly made as a result of increasing the knowledge of the attacker for the closest subsequent sensitive SNP. Since this suboptimal algorithm divides the non-sensitive SNPs into disjoint subsequences based on the closest subsequent sensitive SNP (as described in the proposed min-max algorithm in Section 4.5), we expect that the utility of the suboptimal algorithm is close to the optimal one.

In order to run this suboptimal algorithm in feasible time, we use the first 500 SNPs in the dataset and select 100 sensitive SNPs. We make sure that each sensitive SNP is equally apart from each other, and hence we check around 4! permutations of non-sensitive SNPs before each sensitive SNP. Note that the only rational of this setting is to compare the performance of the proposed (min-max) approach with the optimal order of sharing. Other than the orders generated by the proposed (min-max) algorithm and the suboptimal algorithm, we also run the SNP sharing algorithm with increasing order of SNP IDs, decreasing order of SNP IDs, and random order. In Table 4, we show our experimental results for Markov chains with orders 1 and 2. In general, we observe that changing the order affects the number of shared SNPs slightly. Although the proposed (min-max) algorithm provides higher utility compared to increasing order, decreasing order, and random order, suboptimal algorithm provides the highest utility in all scenarios. We also observe that the utility difference between the suboptimal algorithm and the proposed min-max algorithm slightly increases for higher Markov chain orders. However, due to the high complexity of suboptimal algorithm (which requires to check all permutations of consecutive non-sensitive SNPs), this algorithm cannot be used in practical scenarios. For instance, even in this simple experimental setting, the suboptimal algorithm requires to check 4! = 24 permutations for each subgroup and we observe the run time of the suboptimal algorithm (approximately 4 seconds) as approximately 24 times more than the min-max approach (approximately 0.2 seconds).

Table 4:

Relationship between the order of sharing, utility, privacy parameter (ϵ), and the correlation model (i.e., Markov chain of different orders (t)). Utility ( $U$ ) is the number of shared SNPs among 400 non-sensitive SNPs.

	ϵ

	0.1	0.2	0.5	1
t = 1, increasing order	247.0	276.1	322.7	347.6
t = 1, random order	247.2	276.5	323.3	347.8
t = 1, decreasing order	247.7	277.3	323.7	347.9
t = 1, proposed order	247.7	277.3	323.7	347.9
t = 1, suboptimal order	247.7	277.3	323.7	347.9

t = 2, increasing order	116.3	164.0	248.3	306.1
t = 2, random order	117.0	166.4	250.4	307.4
t = 2, decreasing order	117.3	168.4	252.5	308.8
t = 2, proposed order	120.2	170.2	256.4	314.0
t = 2, suboptimal order	125.9	175.7	261.3	321.7

Open in a new tab

5.5. Kinship Relationships

In this section, we evaluate our proposed SNP sharing mechanism by also considering the kin genomic privacy of individuals (as formulated in Section 4.3). Along with kinship relationships and Mendel’s law, we also use higher-order correlations on the DNA as in [13]. We use the inference attack introduced in Appendix A to compute the posterior probabilities in (5).

For the evaluation, we use a trio (father, mother, and son) from Manuel Corpas family DNA dataset [4]. We choose 100 neighboring SNPs of the considered family members. We set the size of the sensitive SNP set to 5 for all family members and we randomly choose 5 SNPs for each family member to construct their sensitive SNP sets (i.e., sensitive SNP set of each family member is different). If family members have overlapping sensitive SNPs they will not be released, because the donor never shares sensitive SNPs. Since having distinct sensitive SNPs is more challenging scenario to preserve privacy, we selected the sensitive SNPs randomly. We assume that the son is the donor and we use the proposed SNP sharing mechanism to share the non-sensitive SNPs of son (by also considering genomic privacy of other family members). We use the same privacy parameter (ϵ) for all family members. We conduct each experiment for 100 times. We show the results for utility ( $U$ ) for different ϵ parameters and correlation models in Fig. 4. Since we use equal utility values for each SNP, $U$ is equal to the number of shared SNPs by the donor.

Figure 4: — Relationship between the privacy parameter (ϵ) and the utility ( $U$ ) for different correlation models when we also consider kin genomic privacy. All SNPs have equal utility values ( $u_{j}$ = 1).

As shown in Fig. 4, the number of SNPs shared by the donor decreases when he also considers his parents’ genomic privacy (i.e., their sensitive SNPs). For instance, when ϵ = 0.1, the donor shares 77 SNPs (average of 100 executions) when the kinship is considered in the algorithm. However, he shares 86 SNPs when the kinship is not considered. These 9 SNPs are the ones that are hidden for the genomic privacy of the other family members. In Table 5, we also show the average estimation error of the attacker for estimating the sensitive SNPs of the parents. We observe that considering kinship in the data sharing increases the attacker’s estimation error for all Markov chain orders because the donor does not share the SNPs that gives information about the sensitive SNPs of his family members if he considers kinship. In other words, if the kinship is not considered in data sharing, the attacker can estimate the sensitive SNPs of the parents with high probability using Mendel’s law. In this experiment, we assume 5% of total SNPs as sensitive for each family member and the utility increases as this fraction decreases.

Table 5:

Relationship between the privacy parameter (ϵ) and the privacy (average estimation error for the sensitive SNPs of the parents) for different correlation models when we also consider kin genomic privacy. Using the shared SNPs of the donor, we compute the attacker’s estimation error for all sensitive SNPs of the donor’s parents. t is the order of the Markov chain.

	ϵ

	0.1	0.2	0.5	1
t = 1 with kinship	0.5620	0.5629	0.5605	0.5534
t = 1 without kinship	0.4859	0.4833	0.4798	0.4775
t = 4 with kinship	0.6554	0.6430	0.6296	0.6228
t = 4 without kinship	0.5265	0.5124	0.4964	0.4897

Open in a new tab

We use first-degree family members in our experiments because correlations become weak after immediate family members. To show this, we also conduct the same experiments with another trio (grandfather, grandmother, and son) from the same dataset. We observe that the utility and privacy do not change much when we consider the grandparents’ genomic privacy. For instance, the results for “t = 1 with kinship” are higher than the results in Fig. 4, where it merges with the line of “t = 1 without kinship” when ϵ = 1. This means that the child shares more SNPs (close to without considering kinship) when he considers the genomic privacy his grandparents instead of his parents. Similarly, the estimation error values “with kinship” and “without kinship” are close to each other when the grandparents’ privacy is considered. These results show that considering kinship is important for the first-degree family members and the effect of kinship on utility and privacy decreases significantly after first-degree family members.

5.6. Comparison With Previous Work

We compare our proposed mechanism with Humbert et al.’s work [23] which has a similar goal as ours. Humbert et al. propose a SNP sharing mechanism by formulating the problem as an optimization problem, in which the goal is to maximize the utility while considering privacy constraints of the donor and her family members. We compare our proposed mechanism with [23] (hereafter, referred to as the “optimization-based technique”) first without considering the kinship relationships between individuals and then, by also considering the kinship.

For the donor i, we randomly choose 50 SNPs to construct her sensitive SNP set (Sⁱ) among 1000 SNPs (Iⁱ). As discussed before, our mechanism does not share the SNPs of individual i in Sⁱ, thus for sharing, we only consider SNPs in $I^{i} ∖ S^{i}$ . To check the privacy constraints, we consider the SNPs in Sⁱ. Like [23], we assume all of the SNP utilities ( $u_{j}$ ) and the sensitivities to be equal. Therefore, $U$ is equal to the number of shared SNPs.

We show the results of the comparison in terms of estimation error and utility in Fig. 5, and entropy and utility in Fig. 8 in Appendix E. To be consistent with [23], we use Markov chain of order 1 as the correlation model (we obtained similar patterns with other correlation models). We observe that to achieve the same utility, the estimation error and the entropy provided by our proposed mechanism is significantly higher than the optimization-based method. On average, for the same utility, our mechanism provides 16% higher error and 18% higher entropy, which also means higher privacy. Moreover, the optimization-based mechanism always shares the SNPs that increase (or not change) the estimation error (and entropy) for the sensitive SNPs and hides the ones that decrease the error (and entropy). Thus, when a particular SNP j of the donor is hidden as a result of the optimization-based mechanism, the attacker can infer the value of that hidden SNP knowing that the actual value of that SNP reduces the error (and entropy) of the SNPs in the sensitive set. On the other hand, (as we have also shown via the toy example in Appendix B) when deciding whether or not to share a SNP j, our proposed mechanism checks the change of the probability distributions for all the sensitive SNPs (regardless of the actual values of the SNPs) and if any of them violates (5), it does not share the corresponding SNP. Thus, our decision for sharing (or not sharing) a SNP does not provide extra information to the attacker. In Appendix D, we also show the robustness of our proposed mechanism against an attack utilizing this auxiliary knowledge.

Figure 5: — Error vs. utility for the proposed SNP sharing mechanism and the optimization-based technique when the kinship relationships between the individuals are not considered. The top x-axis shows the privacy parameter used for the proposed SNP sharing mechanism. Privacy tolerances of individuals (i.e., $P r i (i, P_{s}^{i})$ ) values) in [23] vary between 0 and 20.

Next, to compare both schemes by also considering the kinship relationships, we use a trio (father, mother, and son) from [4] (similar to Section 5.5). We choose 100 neighboring SNPs, set the size of the sensitive SNP set to 20 for all family members, and randomly choose 20 SNPs for each family member to construct their sensitive SNP sets. We use both the proposed mechanism and the optimization-based mechanism to share the non-sensitive SNPs of the son (by also considering genomic privacy of mother and father). For the proposed mechanism, we use the same privacy parameter (ϵ) for all family members. Overall, for the same utility, we observe similar trends for entropy and closer estimation error values for both schemes (we do not include the results due to space constraints and due to the fact that the trend is similar to the previous experiment). Similar to before, when we also utilize the additional auxiliary information about the decisions of the donor in the inference attack, we observe that attacker’s estimation error remains almost the same for the proposed mechanism. However, as before, it decreases to almost 0 in the optimization-based mechanism. This again shows the robustness of the proposed mechanism.

5.7. Comparison With Local Differential Privacy

Since traditional DP protects privacy of individuals while sharing aggregate statistics about databases, it is not comparable with our proposed scheme. As discussed in Section 3.2, local differential privacy (LDP) is a concept to preserve individual privacy during data sharing. The main reasons why LDP is not suitable for genomic data sharing are (i) mechanisms to satisfy LDP are based on adding noise to data points and (ii) LDP does not consider correlations in data, which may cause additional privacy leakage (even if LDP is satisfied). In addition, LDP is mainly used for estimating statistics about the population. In spite of these reasonings, LDP can be considered as an alternative to our proposed mechanism, and hence here, we compare our approach with an LDP-based mechanism. For that, we implement the randomized response mechanism in Section 3.2. We applied the mechanism in two ways: (LDP1) each individual shares only her non-sensitive SNPs after perturbing them, and (LDP2) each individual shares all of her SNPs (both sensitive and non-sensitive) after perturbing them. The value of each SNP is reported correctly with probability $p = \frac{e^{ϵ}}{e^{ϵ} + 2}$ and each incorrect value is reported with the same probability $q = \frac{1}{e^{ϵ} + 2}$ . We use the same evaluation settings as in Section 5.6.

To evaluate the privacy of LDP-based mechanism, we compute the attacker’s average estimation error for inferring the sensitive SNPs. When individuals share only their non-sensitive data points after perturbation (LDP1), the attacker, using the perturbed versions of all non-sensitive SNPs, conducts the inference attack on perturbed data. When individuals share all of their SNPs (LDP2), the error is computed based on perturbed sensitive SNPs. In Fig. 6, we show the comparison of the proposed method with LDP-based mechanism in terms of average estimation error for different values of ϵ (higher error means better privacy). For the smaller ϵ values, LDP provides better privacy since the randomness in the reported SNPs is high. However, the utility of LDP is significantly low since most of the SNPs are reported incorrectly for smaller ϵ values, as discussed next. Note that the attacker can also eliminate the added noise partially in the LDP-based mechanism, since LDP does not consider correlations. To show this vulnerability of LDP-based scheme, we perform an attack to exploit the correlations in data. Briefly, we check the correlation probabilities and MAF values for the shared SNPs. For a shared SNP, if the probability of observing the shared value is less than a threshold (that is computed based on correlations and MAF), we change the value of the shared SNP and use the new value for error calculation. As shown in Fig. 6, the attacker’s estimation error decreases after performing this attack.

Figure 6: — Comparison of the proposed method and LDP-based data sharing mechanism in terms of privacy (attacker’s average estimation error) for different values of ϵ (higher error means better privacy). t represents the order of Markov chain used for the correlation model. In LDP1, only non-sensitive SNPs are shared after perturbation. In LDP2, all SNPs are shared after perturbation. Correlation attack is performed for LDP1 and LDP2.

In randomized response mechanism, approximately $\frac{e^{ϵ}}{e^{ϵ} + 2}$ of SNPs are shared correctly and $\frac{2}{e^{ϵ} + 2}$ of them are shared incorrectly. Reporting incorrect (noisy) values is not preferred in genomic data sharing since it may degrade the utility of data significantly and it may cause wrong findings. Hence, sharing incorrect value decreases the utility as shown in the definition of $U$ (i.e., $D_{j}^{i} = - 1$ for the incorrectly shared SNPs). As opposed to the proposed algorithm, LDP-based mechanism (LDP2 in our experiments) enables sharing of sensitive SNPs as well. Thus, to quantify the informativeness and value of the shared information, here, we quantify the utility of a SNP $x_{j}$ as $u_{j} = \frac{1}{M A F_{j}}$ and compute the normalized utility as $U_{N} = U / (\sum_{j \in I^{i}} u_{j})$ . As shown in Fig. 7, our proposed mechanism provides better utility than LDP-based mechanism for all ϵ values, because high randomness in LDP causes low utility.

Figure 7: — Comparison of the proposed method and LDP-based data sharing mechanism in terms of utility $U_{N}$ for different values of ϵ. The utility of each SNP is $u_{j} = \frac{1}{M A F_{j}}$ is the order of the Markov chain. In LDP1, only non-sensitive SNPs are shared after perturbation. In LDP2, all SNPs are shared after perturbation.

The only advantage of LDP-based mechanism against our proposed method can be considered as sharing sensitive attributes after perturbation. Our utility definition $U$ is defined over all SNPs. This is because, in medical research, it is not viable to define the utility over the sensitive SNPs because they can be different for every individual. There is no common set of sensitive SNPs that spans all individuals since each individual may prefer to hide SNPs associated to different diseases. When we quantify the utility based only on sensitive SNPs, the utility of the proposed method and LDP1 is always 0. The utility of LDP2 is similar to its utility shown in Fig. 7. That is, it starts with a negative utility for smaller ϵ values and exceeds 0 when ϵ > 0.7. However, as mentioned before (Fig. 6) privacy of LDP2 after correlation attack is much lower than the proposed method for almost all ϵ values. In conclusion, we observe that for values of ϵ that provide high data utility, the proposed scheme outperforms LDP-based mechanism in terms of privacy and utility.

LDP-based SNP sharing mechanism is also vulnerable against kinship attacks since it does not consider familial relationships in data sharing. To show this vulnerability, we implement a kinship attack using the same experimental setting in Section 5.5. The attacker initially performs the correlation attack described in this section, and then using the result of this attack and Mendel’s law, the attacker infers the sensitive SNPs of donor’s parents. Since we already show (in Fig. 6) that privacy provided by LDP2 is significantly lower, here, we only compare the proposed scheme with LDP1 (in which only the non-sensitive SNPs are perturbed and shared). We show our results in Table 6. We observe that LDP is vulnerable against kinship attacks and our proposed method provides better utility and privacy compared to LDP (we do not show the utility results of this experiment due to space constraints).

Table 6:

Comparison of the proposed method and LDP-based data sharing mechanism in terms of privacy when the kinship relationships are also considered. Using the shared SNPs of the donor and Mendel’s law, we compute the attacker’s estimation error for all sensitive SNPs of the donor’s parents. t is the order of the Markov chain. Only non-sensitive SNPs are shared in LDP1. Kinship attack is performed for LDP1.

	ϵ

	0.1	0.2	0.5	1
Proposed method, t = 1	0.5620	0.5629	0.5605	0.5534
Proposed method, t = 4	0.6554	0.6430	0.6296	0.6228
LDP1	0.5805	0.5662	0.5543	0.5468
LDP1 after attack	0.5040	0.5016	0.4944	0.4934

Open in a new tab

In our experiments, we mainly used the attacker’s average estimation error to quantify privacy. This metric is accepted as one of the most informative ones to quantify genomic privacy [38]. Attacker’s success can also be quantified in terms of the accuracy in correctly guessing the value (state) of a SNP (assuming that the attacker guesses the state of a SNP as the state with the highest probability). On the other hand, as mentioned before, rare SNPs with lower MAF values may be considered as more important for the attacker. Hence, we finally compare the proposed mechanism with LDP-based mechanism in terms of attacker’s accuracy in correctly guessing the values of donor’s rare SNPs. We consider all SNPs with frequency (minor allele frequency) less than 0.1 as rare SNPs. We quantify the accuracy of the attacker as the ratio of correctly guessed rare SNPs to the total number of rare SNPs. When we select 20% of SNPs as sensitive, we observe the accuracy of the attacker for the proposed method as 0 for ϵ ≤ 1. We also observe the accuracy of the attacker as 0.0863 for ϵ = 0.1 and 0.1218 for ϵ = 1 when LDP-based mechanism only shares non-sensitive SNPs (LDP1). Therefore, we conclude that our method provides higher privacy for rare SNPs as well.

6. DISCUSSION

Here, we discuss the proposed SNP sharing mechanism in terms of its functionality/practicality and robustness. We also discuss our assumptions about the attacker.

6.1. Functionality and Practicality

The proposed genomic data sharing mechanism provides privacy-preserving sharing of genomic data. In practice, a donor, depending on the entropy of the SNPs in her sensitive SNP set (not the values of those SNPs), may select a privacy parameter (ϵ) and share her non-sensitive SNPs with a service provider accordingly. It is important to note that the actual values of the SNPs in the sensitive set are not required for the sharing process (i.e., to check the condition in (5)). We foresee that the system that will include the proposed algorithm can list the diseases, donors can select the sensitive diseases for them before data sharing, and the system can mark the SNPs related to the selected diseases as sensitive. Thus, we do not expect donors to know which SNPs are related to which diseases.

In this work, we assume strong inference attacks we can use today. It may be argued that with new discoveries in genomics field, the parts (non-sensitive SNPs) revealed today may provide more information about the sensitive SNPs in the future. However, this is a common drawback for public availability of sensitive data in general. Our work is just a first step towards this for genomic data. Also, it may be argued that when the kinship relationships are considered, a decision may require coordination between family members and this may create a burden for the practicality of the proposed scheme. However, due to the nature of genomic data, such factors have to be considered while sharing genomic data. We foresee that such techniques can be automated in the future while sharing genomic data. Also, the proposed mechanism does not need coordination among the whole extended family, because correlation between family members reduces significantly after immediate family connections.

When we do not consider the kinship relationships between the individuals, the time complexity to share a donor’s SNP sequence of size n (of which m are in the sensitive set) is $O ((n - m) n 3^{t} + (n - m) m)$ , where t is the order of the Markov chain that is used for the correlation model. When we also consider f of the donor’s family members during this process, the time complexity becomes $O ((n - m) n^{2} 3^{t} f + (n - m) m)$ . Thus, the time complexity scales quadratically (or cubic when the kinship is considered) with the number of SNPs to be shared by the donor. Time complexity scales exponentially with the order (t) of the Markov chain. However, as we discussed and showed via simulations (e.g., in Fig. 2), for correlation models of order higher than 4, the improvement in the inference power of the attacker is negligible, and hence we can assume the term 3^t as a constant. Furthermore, the complexity of the min-max approach is $O (3^{t})$ . Thus, the computational complexity to check all non-sensitive SNPs is $O ((n - m) m 3^{t})$ , which is linear in the length of SNP sequence. Considering the mechanism does not need to run in real-time, these complexity values are reasonable for practicality of the proposed mechanism.

6.2. Robustness

Our results illustrate strong scenarios in terms of attacker’s power. We build a correlation model and compute the prior probability distributions of the SNPs to check the condition in (5) during SNP sharing. We use a population that is consistent with the donor’s to build these models and we assume the attacker has access to the same population that also includes the victim (donor). In reality, the attacker may use a similar (but not the same) population for its inference attack. Therefore, its estimation error will be more than what we show in the evaluation.

The attacker knows that when a SNP is not revealed, the condition in (5) is not satisfied for at least one sensitive SNP. Note that (5) is computed regardless of the real values of the sensitive SNPs but considering the real value of the non-sensitive SNP that the donor is considering to share. Therefore, upon observing a “hide” decision from the donor, the attacker can compute (5) for different values (0, 1, and 2) of unrevealed SNP to infer its actual value. We considered this attack, made an empirical evaluation against it, and observed its ineffectiveness for the proposed mechanism.

The proposed SNP sharing mechanism considers the fraction of change in the probability distribution of all possible states of the sensitive SNPs (not the actual values of the sensitive SNPs of the donor and her family members). Therefore, not sharing a SNP from the non-sensitive SNP set does not mean that sharing that SNP would reduce the estimation error and entropy of the attacker about the SNPs in the sensitive set. In fact, as we have also shown via a toy example in Appendix B, sharing a SNP may actually decrease the estimation error (and entropy) of the attacker for the sensitive SNPs. Thus, the attacker cannot gain extra information by observing which SNPs are hidden by the mechanism. As another consequence of this property, for the proposed SNP sharing mechanism, attacker’s estimation error and entropy do not monotonically decrease with the increasing privacy parameter (i.e., increasing ϵ value or increasing privacy budget for the donor). In Fig. 10 (in Appendix E), we show the variation of estimation error and entropy with increasing privacy parameter. On the contrary, in Humbert et al.’s work [23], a SNP is shared only if it does not decrease the estimation error (and entropy) of the attacker. Also in [23], SNPs shared due to an increase in privacy budget always cause monotonic decrease in both estimation error and entropy of the attacker. With this knowledge, the attacker can actually infer the values of the SNPs the mechanism decides to hide. Our sharing mechanism is robust against this aforementioned attack.

6.3. Attacker Assumptions

In the definition of ϵ-indirect privacy, the correlations between the SNPs (i.e., conditional probabilities) are calculated by assuming the auxiliary information A is publicly known. In our genomic data sharing scenario, we only consider public statistics and information about family bonds as the auxiliary information A. By doing so, we apply a hypothetical upper-bound to the auxiliary information of the attacker and the proposed mechanism provides sharing decisions accordingly. Therefore, the proposed mechanism can be considered as attacker-dependent. Here, we discuss the consequences (in terms of anticipated and actual privacy and utility) of attacker having less or more information than our assumption.

If the attacker knows less than our assumption (e.g., attacker does not use the correlations between the SNPs), its attack will be less successful. If we knew that the attacker does not know the correlations, we would set A accordingly and we would share all the non-sensitive SNPs because the attacker cannot infer sensitive SNPs with using non-sensitive SNPs. If the attacker knows more than our assumption (e.g., states of some sensitive SNPs), its attack will be more successful. On the other hand, if we knew that the attacker knows some sensitive SNPs, we would set A accordingly and share more non-sensitive SNPs because we do not need to hide non-sensitive SNPs which can be used to infer the sensitive SNPs that are already known by the attacker. In both cases, due to attacker-dependency of the proposed mechanism, there is a decrease in the optimal utility that can be achieved if the mechanism knew the exact auxiliary information of the attacker. To show this effect, we also conducted experiments by assuming 20% of the sensitive SNPs are known by the attacker. When ϵ = 0.5, we observed that under our assumptions (that the attacker does not know any sensitive SNPs), the proposed mechanism shares 831 non-sensitive SNPs and the estimation error is 0.5540. On the other hand, when we include attacker’s knowledge of 20% of the sensitive SNPs into A, the proposed mechanism shares 841 non-sensitive SNPs out of 900 non-sensitive SNPs and the estimation error is 0.4455. These results also verify that the attacker-dependency of the proposed mechanism causes a slight decrease in the optimal utility.

7. CONCLUSION AND FUTURE WORK

We have proposed a privacy-preserving genomic data sharing mechanism inspired from differential privacy. Our method keeps an attacker’s knowledge about sensitive parts of individuals’ genomes within a boundary, while providing public availability of genomic data. The proposed mechanism considers both individual and interdependent genomic privacy. That is, when a donor shares her genomic data, both her and her family members’ genomic privacy are protected. One notable feature of the proposed scheme is that it selectively shares the SNPs of a donor without considering the real values of her sensitive SNPs. This prevents the attacker from initiating inference attacks based on the sharing decisions on the SNPs. We have studied and discussed the effects of different parameters on both utility and privacy of the proposed mechanism. We have shown that the proposed mechanism outperforms the previous work both in terms of privacy and utility. As future work, we will explore more scenarios on different kinship relationships such as (i) the situation in which some family members already revealed some of their SNPs and (ii) practicality of the proposed mechanism on an extended family (e.g., which family members to consider and how far to navigate in a family tree during the SNP sharing process).

CCS CONCEPTS.

• Applied computing → Genomics; • Security and privacy → Privacy protections

ACKNOWLEDGMENTS

Research reported in this publication was supported by the National Library Of Medicine of the National Institutes of Health under Award Number R01LM013429.

Appendices

A. INFERENCE ATTACK ON KIN GENOMIC PRIVACY

To be robust against strongest attacks in the literature, we consider an attacker with background knowledge about the correlation model on the DNA and family relationships between individuals. Here, we briefly describe the inference attack on kin genomic privacy proposed in [22] and then improved in [13]. The attacker has access to the following resources: (i) publicly available genomic datasets belonging to different populations [5], (ii) family tree and family relationships of a target family, and (iii) genomic data (partial or whole) that is shared by a subset of the target family members. Besides these resources, the attacker uses Mendel’s law (of inheritance) and high-order correlations between the SNPs [33].

The goal of the attacker is to infer the missing parts of the genomes of the family members (or a target individual in the family) using a message passing algorithm (belief propagation [29, 32]) on

Table 7:

The example population including 3 SNPs of 6 individuals.


	i₁	i₂	i₃	i₄	i₅	i₆
x₁	0	1	2	1	0	2

x₂	0	0	0	0	1	1

x₃	0	1	0	0	1	2

Open in a new tab

Table 8:

Prior probability distributions of SNPs (in Table 7) computed using their MAF values.


	P(x₁)	P(x₂)	P(x₃)
Homozygous major (0)	0.33	0.67	0.50

Heterozygous (1)	0.33	0.33	0.33

Homozygous minor (2)	0.33	0.00	0.17

Open in a new tab

a graphical model (factor graph). A factor graph is a bipartite graph that includes two sets of nodes: variable nodes that represent the SNPs of family members, and factor nodes that represent the dependencies between the resources of the attacker and the variable nodes. In this setting, the factor nodes represent: (i) familial relationships ( Mendel’s law) between family members, (ii) high-order correlations between SNPs in the genome, and (iii) genotype-phenotype relationships between the SNPs and physical characteristics of individuals. The nodes on the factor graph are connected via edges (depending on the relationship between them) and through these edges, they iteratively exchange messages throughout the iterative algorithm. At the beginning, each variable node has its own belief about the marginal probability distribution of the corresponding SNP (computed using the MAF values). Then, the iterative algorithm starts and at each round, nodes generate and send messages (in the form of conditional probabilities) to their neighbors until the marginal probability distributions of the variable nodes converge.

B. A TOY EXAMPLE FOR SNP SHARING

Here, we provide a toy example about the proposed mechanism to discuss some common scenarios that might happen. Notably, we show that sharing decision for a particular SNP is independent of the actual values of donor’s (and family members’) sensitive SNPs. Therefore, the reason for not sharing a SNP is not necessarily due to the decrease in the estimation error of the attacker when that SNP is shared, and hence the attacker cannot infer the actual values of the sensitive SNPs using the decisions of the donor.

Assume we have the population as shown in Table 7 that consists of 6 individuals $(i_{1}, \dots, i_{6})$ . For simplicity, in this example, we do not consider the kinship information between the individuals. The SNP set has three SNPs, $I^{i} = {x_{1}^{i}, x_{2}^{i}, x_{3}^{i}}$ , and the sensitive set is $S^{i} = {x_{3}^{i}}$ for all individuals.

The attacker’s auxiliary information consists of MAF values of the SNPs and the correlation model between the SNPs. In Table 8, we show the prior probability distribution of each SNP $x_{j} (j \in {1, 2, 3})$ computed using its MAF value. We assume that the attacker uses the first-order Markov chain to calculate the correlation model between the SNPs. That is, the attacker computes $P_{1} (x_{j}) = P (x_{j} ∣ x_{j - 1}) ((j \in {1, 2, 3}))$ , by using (3). We also show these correlation values (computed using the SNP sequences in Table 7) in Table 9.

We set the privacy parameter ϵ = 0.3 for all individuals. We consider the sharing of $x_{1}^{i}$ for different individuals in the example

Table 9:

Correlation model between the SNPs for the first-order Markov chain computed by using (3). The first column shows the different states of sequential SNPs and the remaining columns show the probabilities. In this example, we do not consider the SNPs before x₁, and hence in the correlation model, all states of x₁ are equally likely.


	P(x₁)	$P (x_{2} ∣ x_{1})$	$P (x_{3} ∣ x_{2})$
x_j = 0, x_j−1 = 0	0.33	0.50	0.75

x_j = 0, x_j−1 = 1	0.33	1.00	0.00

x_j = 0, x_j−1 = 2	0.33	0.50	0.00

x_j = 1, x_j−1 = 0	0.33	0.50	0.25

x_j = 1, x_j−1 = 1	0.33	0.00	0.50

x_j = 1, x_j−1 = 2	0.33	0.50	0.00

x_j = 2, x_j−1 = 0	0.33	0.00	0.00

x_j = 2, x_j−1 = 1	0.33	0.00	0.50

x_j = 2, x_j−1 = 2	0.33	0.00	0.00

Open in a new tab

population. For that, we compute how this disclosure changes the probability distribution of the sensitive SNP $x_{3}^{i}$ and observe how this change may violate (5). We may observe three different cases (or a combination of them) for the left part of (5). Sharing $x_{1}^{i}$ may change:

the ratio between states zero and one of $x_{3}^{i}$ .
the ratio between states one and two of $x_{3}^{i}$ .
the ratio between states two and zero of $x_{3}^{i}$ .

For cases (1) and (2), consider individual 4 (i₄) as the donor. We show the prior probability distributions for $x_{3}^{4}$ in Table 8. We can compute the effect of sharing $x_{1}^{4}$ on the posterior probability distribution of $x_{3}^{4}$ as follows:

P (x_{3}^{4} ∣ R^{4}, A) \propto \sum_{x_{2}} P (x_{3}^{4} ∣ x_{2}^{4}) P (x_{1}^{4} = 1), x_{3}^{4} \in {0, 1, 2}

P (x_{3}^{4} = 0 ∣ R^{4}, A) \propto \sum_{x_{2}} P (x_{3}^{4} = 0 ∣ x_{2}^{4}) P (x_{2}^{4} ∣ x_{1}^{4} = 1) P (x_{1}^{4} = 1) = \frac{1}{4}

P (x_{3}^{4} = 1 ∣ R^{4}, A) \propto \sum_{x_{2}} P (x_{3}^{4} = 1 ∣ x_{2}^{4}) P (x_{2}^{4} ∣ x_{1}^{4} = 1) P (x_{1}^{4} = 1) = \frac{1}{12}

P (x_{3}^{4} = 2 ∣ R^{4}, A) \propto \sum_{x_{2}} P (x_{3}^{4} = 2 ∣ x_{2}^{4}) P (x_{2}^{4} ∣ x_{1}^{4} = 1) P (x_{1}^{4} = 1) = 0

For case (1), if we set $x_{3}^{4}' = 1$ and $x_{3}^{4} = 0$ in (5), we have:

(\frac{P (x_{3}^{4} = 0 ∣ R, A)}{P (x_{3}^{4}' = 1 ∣ R, A)} = 3) ≰ (e^{0.3} \frac{P (x_{3}^{4} = 0)}{P (x_{3}^{4}' = 1)} = 2.02)

Thus, the condition in (5) is violated, and hence the proposed mechanism does not reveal $x_{1}^{4}$ . As shown, regardless of the real value of $x_{3}^{4}$ , condition in (5) may be violated due to several different cases. Therefore, not sharing $x_{1}^{4}$ would not provide additional information about the actual value of $x_{3}^{4}$ to the attacker.

To illustrate case (3), we can pick individual 3 as the donor and repeat similar computations. In this case, setting $x_{3}^{3} = 2$ and $x_{3}^{3'} = 0$ violates (5). Therefore, even though sharing $x_{1}^{3}$ increases the estimation error of the attacker for the value of the sensitive SNP $x_{3}^{3}$ , due to the violation of (5), our mechanism does not share $x_{1}^{3}$ . Thus, the attacker cannot infer further information about the value of $x_{3}^{3}$ due to this decision. From this example, we come up with the following conclusions:

Since we do not consider the real values of the SNPs in the sensitive set while computing (5), each of the aforementioned cases (or even combination of them) may occur, and hence the attacker cannot know which violation is the reason for not sharing a particular SNP. In Section 5.6, we show that not considering the real value of the sensitive SNPs also increases the utility of shared data.
Sharing a SNP may decrease or increase the estimation error of the attacker for the sensitive SNPs. Similarly, the decision may also either decrease or increase the entropy of the sensitive SNPs. Thus, the attacker cannot infer the real values of the sensitive SNPs from the decision our mechanism gives about sharing SNPs. We discuss this further in Section 6.2.

C. THE ALGORITHM FOR SELECTING THE SNP TO BE PROCESSED

We consider sharing the non-sensitive SNPs in an order such that the attacker’s maximum knowledge gain on a sensitive SNP $x_{k}^{i}$ (i.e., $\frac{P (x_{k}^{i} ∣ R \cup x_{j}^{i}, A)}{P (x_{k}^{i'} ∣ R \cup x_{j}^{i}, A)}$ ) is minimized. In general, for a t-th order Markov chain, if we consider a non-sensitive SNP $x_{j}^{i}$ , the Markov chain is constructed starting from $x_{j}^{i}$ , thus, we have

P (x_{k}^{i} ∣ R \cup x_{j}^{i} = α, A) = \sum_{x_{p}^{i} \in S} P (x_{k}^{i} ∣ x_{k - 1}^{i}, \dots, x_{k - t}^{i}) \times P (x_{k - 1}^{i} ∣ x_{k - 2}^{i}, \dots, x_{k - t - 1}^{i}) \times \dots \times P (x_{j + t + 1}^{i} ∣ x_{j + t}^{i}, \dots, x_{j + 1}^{i}) \times P (x_{j + t}^{i} ∣ x_{j + t - 1}^{i}, \dots, x_{j}^{i} = α),

(9)

where the true value $x_{j}^{i}$ is $α$ , and $S = {x_{p}^{i} \in {0, 1, 2}, \forall x_{p}^{i} \notin R, j + 1 \leq p \leq k - 1}$ .

Specifically, we formulate the following min-max problem to determine which non-sensitive SNP to process each time

{\hat{x}}_{j}^{i} = \underset{x_{j}^{i}}{arg min} (max_{x_{k}^{i}, x_{k}^{i'} \in {0, 1, 2}, x_{k}^{i} \neq x_{k}^{i'}} \frac{P (x_{k}^{i} ∣ R \cup x_{j}^{i} = α, A)}{P (x_{k}^{i'} ∣ R \cup x_{j}^{i} = α, A)}) .

(10)

We will show how to solve (10) by reducing $S$ in a two-stage approach. First, we investigate the inner part of (10) and have

\begin{array}{l} \max_{\begin{array}{c} x_{k}^{i}, x_{k}^{i'} \in {0, 1, 2} \\ x_{k}^{i} \neq x_{k}^{i'} \end{array}} \frac{P (x_{k}^{i} ∣ R \cup x_{j}^{i} = α, A)}{P (x_{k}^{i'} ∣ R \cup x_{j}^{i} = α, A)} \\ \underset{\begin{array}{c} x_{k}^{i}, x_{k}^{i'} \in {0, 1, 2} \\ x_{k}^{i} \neq x_{k}^{i'} \end{array}}{\overset{(a)}{=} \max} \frac{\sum_{x_{p}^{i} \in S} P (x_{k}^{i} \dots) \dots P (\dots x_{j}^{i} = α)}{\sum_{x_{p}^{i} \in S} P (x_{k}^{i'} \dots) \dots P (\dots x_{j}^{i} = α)} \\ \underset{\begin{array}{c} x_{k}^{i}, x_{k}^{i'} \in {0, 1, 2} \\ x_{k}^{i} \neq x_{k}^{i'} \end{array}}{\overset{(b)}{=} \max} \max_{x_{p}^{i} \in S_{1}} \frac{P (x_{k}^{i} ∣ x_{k - 1}^{i}, \dots, x_{k - t}^{i})}{P (x_{k}^{i'} ∣ x_{k - 1}^{i}, \dots, x_{k - t}^{i})} = G_{k} (t), \end{array}

(11)

where (a) is obtained by plugging (9), and (b) is because $\frac{\sum_{i} a_{i}}{\sum_{i} b_{i}} \leq \max_{i} \frac{a_{i}}{b_{i}}$ holds for any positive sequences a_i’s and b_i’s. Also, $S_{1} = {x_{p}^{i} \in {0, 1, 2}, \forall x_{p}^{i} \notin R, k - t \leq p \leq k - 1} \subset S$ and $G_{k} (t)$ is interpreted as the upper bound on the attacker’s knowledge gain of $x_{k}^{i}$ when t-th order Markov chain is considered. Then, we need to select $x_{j}^{i}$ which reduces the value of $G_{k} (t)$ as much as possible. Obviously, if $j < k - t, x_{j}^{i}$ will have no effect on $G_{k} (t)$ . However, if $j = k - t$ , we have

\max_{\begin{array}{c} x_{k}^{i}, x_{k}^{i'} \in {0, 1, 2} \\ x_{k}^{i} \neq x_{k}^{i'} \end{array}} max_{\begin{array}{c} x_{p}^{i} \in S_{1} \\ x_{k - t}^{i} = α \end{array}} \frac{P (x_{k}^{i} ∣ x_{k - 1}^{i}, \dots, x_{k - t}^{i} = α)}{P (x_{k}^{i'} ∣ x_{k - 1}^{i}, \dots, x_{k - t}^{i} = α)} \leq G_{k} (t),

because the state of $x_{k - t}^{i}$ that maximizes $G_{k} (t)$ may not be its true value (α). Thus, by selecting $x_{j}^{i} = x_{k - t}^{i}, G_{k} (t)$ can potentially be reduced and we formally state this intermediate conclusion in the following proposition.

Proposition C.1.

Given a sensitive SNP $x_{k}^{i}$ and a t-th order Markov chain, the upper bound on the attacker’s knowledge gain $(G_{k} (t))$ obtained by selecting a non-sensitive SNP $x_{j}^{i}, \forall j < k - t$ is always higher than or equal to that obtained by selecting a non-sensitive SNP $x_{j}^{i}$ when $j = k - t$ .

Next, we discuss if choosing $x_{j}^{i}$ to the right of $x_{k - t}^{i}$ , i.e., $k - t < j < k - 1$ , can further reduce the value of $G_{k} (t)$ . First, we rewrite $G_{k} (t)$ in (11) as

\begin{array}{l} \underset{\begin{array}{c} x_{k}^{i}, x_{k}^{i'} \in {0, 1, 2} \\ x_{k}^{i} \neq x_{k}^{i'} \end{array}}{G_{k} (t) = \max} \max_{x_{p}^{i} \in S_{1}} \frac{P (x_{k}^{i}, x_{k - 1}^{i}, \dots, x_{k - t}^{i})}{P (x_{k}^{i'}, x_{k - 1}^{i}, \dots, x_{k - t}^{i})} \\ \underset{\begin{array}{c} x_{k}^{i}, x_{k}^{i'} \in {0, 1, 2} \\ x_{k}^{i} \neq x_{k}^{i'} \end{array}}{\overset{*}{=} \max} \max \prod_{x_{p}^{i} \in S_{1}} \frac{P (x_{k}^{i} ∣ x_{p}^{i})}{P (x_{k}^{i'} ∣ x_{p}^{i})}, \end{array}

(12)

where ∗ is due to the factorization rule of the joint probability of Markov chain with order t. Considering another set $S_{2} = {x_{p}^{i} \in {0, 1, 2}, \forall x_{p}^{i} \notin R, k - m \leq p \leq k - 1, 1 \leq m \leq t} \subset S_{1}$ , then we have

\max_{\begin{array}{c} x_{k}^{i}, x_{k}^{i'} \in {0, 1, 2} \\ x_{k}^{i} \neq x_{k}^{i'} \end{array}} \max \prod_{\begin{matrix} x_{p}^{i} \in S_{2} \\ x_{k - m}^{t} = β \end{matrix}} \frac{P (x_{k}^{i} ∣ x_{p}^{i})}{P (x_{k}^{i'} ∣ x_{p}^{i})}

\underset{\begin{array}{c} x_{k}^{i}, x_{k}^{i'} \in {0, 1, 2} \\ x_{k}^{i} \neq x_{k}^{i'} \end{array}}{\leq \max} \max \prod_{x_{p}^{i} \in S_{2}} \frac{P (x_{k}^{i} ∣ x_{p}^{i})}{P (x_{k}^{i'} ∣ x_{p}^{i})} .

Suppose the true value of $x_{k - m}^{i}$ is β. As a result, we can obtain the solution to (10) via minimizing (12) as

{\hat{x}}_{j}^{i} = \underset{x_{k - m}^{i}}{\arg \min} (\max_{\begin{array}{c} x_{k}^{i}, x_{k}^{i'} \in {0, 1, 2} \\ x_{k}^{i} \neq x_{k}^{i'} \end{array}} \max \prod_{\begin{matrix} x_{p}^{i} \in S_{2} \\ x_{k - m}^{t} = β \end{matrix}} \frac{P (x_{k}^{i} ∣ x_{p}^{i})}{P (x_{k}^{i'} ∣ x_{p}^{i})}),

which, in practice, can be obtained by Algorithm 2. Once we share a non-sensitive SNP $x_{j}^{i}$ , we update $R = R \cup x_{j}^{i}$ and run Algorithm 2 to determine the next SNP to share.

As mentioned in Section 4.5, given a t-th order Markov chain, the SNP sequence is first divided into disjoint subsequences, each of which contains only one sensitive SNP at the end. Then, in each subsequence, the consecutive non-sensitive SNPs are further partitioned into disjoint groups each with size t. These groups are processed starting from the closest group to the sensitive SNP in that subsequence. Above, we discussed how to (i.e., in which order)

ALGORITHM 2:

Selecting $x_{j}^{i}$ considering t-th order Markov chain

graphic file with name nihms-1705344-t0012.jpg

Open in a new tab

process the non-sensitive SNPs in the group that is the closest to a sensitive SNP $x_{k}^{i}$ . For SNPs in other groups in the same subsequence, similar algorithms can also be developed. For example, if we consider the group that is the second closest to $x_{k}^{i}$ , we can select $x_{j}^{i}$ by solving

\hat{x_{j}^{i}} = \underset{x_{k - m}^{i}}{\arg \min} (\max_{\begin{array}{c} x_{k}^{i}, x_{k}^{i'} \in {0, 1, 2} \\ x_{k}^{i} \neq x_{k}^{i'} \end{array}} \max \prod_{\begin{array}{l} x_{p}^{i} \in S_{3} \\ x_{k - m}^{t} = β \end{array}} \frac{P (x_{k}^{i} ∣ x_{p}^{i})}{P (x_{k}^{i'} ∣ x_{p}^{i})}),

where $S_{3} = {x_{p}^{i} \in {0, 1, 2}, \forall x_{p}^{i} \notin R, k - t - m \leq p \leq k - t - 1, 1 \leq m \leq t} \subset S$ .

Delete text

D. COMPARISON WITH OPTIMIZATION-BASED TECHNIQUE

In Section 5.6, we compare our proposed mechanism with an existing work [23]. In Figure 8, we show the comparison in terms of entropy and utility. In addition, we discuss that optimization-based mechanism [23] always shares the SNPs that increase (or not change) the estimation error (and entropy) for the sensitive SNPs and hides the ones that decrease the error (and entropy). Using this phenomenon, we also conduct attacker’s inference attack by also using this additional auxiliary knowledge. That is, we assume that (i) attacker knows the sensitive SNP set of the donor (just the IDs of the SNPs, not the values) and (ii) attacker knows “not share” decision for a SNP means that the actual value of that SNP reduces the entropy of the SNPs in the sensitive set. Note that attacker cannot compute the estimation error (as it does not know the values of the SNPs in Sⁱ) but it can calculate the entropy (as (8) does not require the knowledge of the SNP values). In Fig. 9, we show the additional benefit of this attack for the attacker for both the proposed scheme and the optimization-based mechanism [23]. Here, we show the amount of error decrease from the values we have shown in Fig. 5. For instance, in Fig. 5 the initial amount of error for the optimization-based technique is 0.5. As shown in Fig. 9, after this attack, the error of the attacker decreases by 0.5, and hence

Figure 9: — Decrease in the estimation error of the attacker from the values shown in Fig. 5 when the attacker uses additional auxiliary information about the donor’s decisions.

its overall error drops to almost zero for the optimization-based technique. On the other hand, for the proposed mechanism, the additional decrease in attacker’s error is around 0.1, and hence the proposed mechanism remains robust against such an attack.

E. CHANGE IN ATTACKER’S ESTIMATION ERROR AND ENTROPY

As we discussed in Section 6.2, for the proposed SNP sharing mechanism, attacker’s estimation error and entropy do not monotonically decrease with the increasing privacy parameter (ϵ). In Fig. 10, we show the variation of estimation error and entropy with increasing privacy parameter. The evaluation settings are similar to the ones in Section 5.3 with fraction of the sensitive SNPs is set to 5%. Due to this behavior, the attacker cannot infer the values of the SNPs the mechanism decides to hide.

Figure 10: — Relationship between the estimation error and utility for increasing privacy budget (ϵ value) for the proposed SNP sharing mechanism.

Contributor Information

Emre Yilmaz, Case Western Reserve University.

Tianxi Ji, Case Western Reserve University.

Erman Ayday, Case Western Reserve University.

Pan Li, Case Western Reserve University.

REFERENCES

[1].2020. https://www.23andme.com/en-int/. [Online; accessed 13-September-2020].
[2].2020. https://opensnp.org/. [Online; accessed 13-September-2020].
[3].2020. http://www.eupedia.com/genetics/medical_dna_test.shtml. [Online; accessed 13-September-2020].
[4].2020. https://medium.com/genomes-web-2-0-and-bioethics/my-personal-exome-analysis-part-i-first-findings-72902e4d42cb. [Online; accessed 13-September-2020].
[5].2020. https://ghr.nlm.nih.gov/primer/genomicresearch/snp. [Online; accessed 13-September-2020].
[6].20202. http://mathgen.stats.ox.ac.uk/impute/impute_v2.html. [Online; accessed 13-September-2020].
[7].APOC APOC. 2009. On Jim Watson’s APOE status: genetic information is hard to hide. European Journal of Human Genetics 17 (2009), 147–149. [DOI] [PMC free article] [PubMed] [Google Scholar]
[8].Ayday Erman, Raisaro Jean Louis, Hubaux Jean-Pierre, and Rougemont Jacques. 2013. Protecting and evaluating genomic privacy in medical tests and personalized medicine. In Proceedings of the 12th ACM workshop on Workshop on privacy in the electronic society. ACM, 95–106. [Google Scholar]
[9].Baldi Pierre, Baronio Roberta, De Cristofaro Emiliano, Gasti Paolo, and Tsudik Gene. 2011. Countering gattaca: efficient and secure testing of fully-sequenced human genomes. In Proceedings of the 18th ACM conference on Computer and communications security. ACM, 691–702. [Google Scholar]
[10].Cao Yang, Yoshikawa Masatoshi, Xiao Yonghui, and Xiong Li. 2017. Quantifying differential privacy under temporal correlations. In 2017 IEEE 33rd International Conference on Data Engineering (ICDE). IEEE, 821–832. [DOI] [PMC free article] [PubMed] [Google Scholar]
[11].Cassa Christopher A, Miller Rachel A, and Mandl Kenneth D. 2013. A novel, privacy-preserving cryptographic approach for sharing sequencing data. Journal of the American Medical Informatics Association 20, 1 (2013), 69–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
[12].Deuber Dominic, Egger Christoph, Fech Katharina, Malavolta Giulio, Schröder Dominique, Thyagarajan Sri Aravinda Krishnan, Battke Florian, and Durand Claudia. 2019. My Genome Belongs to Me: Controlling Third Party Computation on Genomic Data. Proceedings on Privacy Enhancing Technologies 2019, 1 (2019), 108–132. [Google Scholar]
[13].Deznabi Iman, Mobayen Mohammad, Jafari Nazanin, Tastan Oznur, and Ayday Erman. 2018. An inference attack on genomic data using kinship, complex correlations, and phenotype information. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 15, 4 (2018), 1333–1343. [DOI] [PubMed] [Google Scholar]
[14].Doudalis Stelios, Kotsogiannis Ios, Haney Samuel, Machanavajjhala Ashwin, and Mehrotra Sharad. 2017. One-sided differential privacy. arXiv preprint arXiv:1712.05888 (2017). [Google Scholar]
[15].Duchi John C, Jordan Michael I, and Wainwright Martin J. 2013. Local privacy and statistical minimax rates. In Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium on. IEEE, 429–438. [Google Scholar]
[16].Dwork Cynthia. 2008. Differential privacy: A survey of results. In International Conference on Theory and Applications of Models of Computation. Springer, 1–19. [Google Scholar]
[17].Erlingsson Úlfar, Pihur Vasyl, and Korolova Aleksandra. 2014. Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security. ACM, 1054–1067. [Google Scholar]
[18].Fienberg Stephen E, Slavkovic Aleksandra, and Uhler Caroline. 2011. Privacy preserving GWAS data sharing. In Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on. IEEE, 628–635. [Google Scholar]
[19].Gymrek Melissa, McGuire Amy L, Golan David, Halperin Eran, and Erlich Yaniv. 2013. Identifying personal genomes by surname inference. Science 339, 6117 (2013), 321–324. [DOI] [PubMed] [Google Scholar]
[20].Hardt Moritz and Talwar Kunal. 2010. On the geometry of differential privacy. In Proceedings of the forty-second ACM symposium on Theory of computing. ACM, 705–714. [Google Scholar]
[21].Homer Nils, Szelinger Szabolcs, Redman Margot, Duggan David, Tembe Waibhav, Muehling Jill, Pearson John V, Stephan Dietrich A, Nelson Stanley F, and Craig David W. 2008. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS genetics 4, 8 (2008), e1000167. [DOI] [PMC free article] [PubMed] [Google Scholar]
[22].Humbert Mathias, Ayday Erman, Hubaux Jean-Pierre, and Telenti Amalio. 2013. Addressing the concerns of the lacks family: quantification of kin genomic privacy. In Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security. ACM, 1141–1152. [Google Scholar]
[23].Humbert Mathias, Ayday Erman, Hubaux Jean-Pierre, and Telenti Amalio. 2014. Reconciling utility with privacy in genomics. In Proceedings of the 13th Workshop on Privacy in the Electronic Society. ACM, 11–20. [Google Scholar]
[24].Jha Somesh, Kruger Louis, and Shmatikov Vitaly. 2008. Towards practical privacy for genomic computation. In Security and Privacy, 2008. SP 2008. IEEE Symposium on. IEEE, 216–230. [Google Scholar]
[25].Johnson Aaron and Shmatikov Vitaly. 2013. Privacy-preserving data exploration in genome-wide association studies. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 1079–1087. [DOI] [PMC free article] [PubMed] [Google Scholar]
[26].Jolie Angelina. 2013. My medical choice. The New York Times 14, 05 (2013), 2013. [Google Scholar]
[27].Kairouz Peter, Oh Sewoong, and Viswanath Pramod. 2014. Extremal mechanisms for local differential privacy. In Advances in neural information processing systems. 2879–2887. [Google Scholar]
[28].Karp Richard M. 1972. Reducibility among combinatorial problems. In Complexity of computer computations. Springer, 85–103. [Google Scholar]
[29].Kschischang Frank R, Frey Brendan J, and Loeliger H-A. 2001. Factor graphs and the sum-product algorithm. IEEE Transactions on information theory 47, 2 (2001), 498–519. [Google Scholar]
[30].Liu Changchang, Chakraborty Supriyo, and Mittal Prateek. 2016. Dependence Makes You Vulnberable: Differential Privacy Under Dependent Tuples.. In NDSS, Vol. 16. 21–24. [Google Scholar]
[31].Naveed Muhammad, Ayday Erman, Clayton Ellen W, Fellay Jacques, Gunter Carl A, Hubaux Jean-Pierre, Malin Bradley A, and Wang XiaoFeng. 2015. Privacy in the genomic era. ACM Computing Surveys (CSUR) 48, 1 (2015), 6. [DOI] [PMC free article] [PubMed] [Google Scholar]
[32].Pearl Judea. 2014. Probabilistic reasoning in intelligent systems: networks of plausible inference. Elsevier. [Google Scholar]
[33].Samani Sahel Shariati, Huang Zhicong, Ayday Erman, Elliot Mark, Fellay Jacques, Hubaux Jean-Pierre, and Kutalik Zoltán. 2015. Quantifying genomic privacy via inference attack with high-order SNV correlations. In Security and Privacy Workshops (SPW), 2015 IEEE. IEEE, 32–40. [Google Scholar]
[34].Shringarpure Suyash S and Bustamante Carlos D. 2015. Privacy risks from genomic data-sharing beacons. The American Journal of Human Genetics 97, 5 (2015), 631–646. [DOI] [PMC free article] [PubMed] [Google Scholar]
[35].Slatkin Montgomery. 2008. Linkage disequilibrium - understanding the evolutionary past and mapping the medical future. Nature Reviews Genetics 9, 6 (2008), 477–485. [DOI] [PMC free article] [PubMed] [Google Scholar]
[36].Song Shuang, Wang Yizhen, and Chaudhuri Kamalika. 2017. Pufferfish privacy mechanisms for correlated data. In Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 1291–1306. [Google Scholar]
[37].Sweeney Latanya, Abu Akua, and Winn Julia. 2013. Identifying participants in the personal genome project by name (a re-identification experiment). arXiv preprint arXiv:1304.7605 (2013). [Google Scholar]
[38].Wagner Isabel. 2017. Evaluating the strength of genomic privacy metrics. ACM Transactions on Privacy and Security (TOPS) 20, 1 (2017), 2. [Google Scholar]
[39].Wang Rui, Li Yong Fuga, Wang XiaoFeng, Tang Haixu, and Zhou Xiaoyong. 2009. Learning your identity and disease from research papers: information leaks in genome wide association study. In Proceedings of the 16th ACM conference on Computer and communications security. ACM, 534–544. [Google Scholar]
[40].Wang Tianhao, Blocki Jeremiah, Li Ninghui, and Jha Somesh. 2017. Locally differentially private protocols for frequency estimation. In Proc. of the 26th USENIX Security Symposium. 729–745. [Google Scholar]
[41].Wang Xiao Shaun, Huang Yan, Zhao Yongan, Tang Haixu, Wang XiaoFeng, and Bu Diyue. 2015. Efficient genome-wide, privacy-preserving similar patient query based on private edit distance. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. ACM, 492–503. [Google Scholar]
[42].Yang Bin, Sato Issei, and Nakagawa Hiroshi. 2015. Bayesian differential privacy on correlated data. In Proceedings of the 2015 ACM SIGMOD international conference on Management of Data. ACM, 747–762. [Google Scholar]
[43].Yu Fei, Fienberg Stephen E, Slavković Aleksandra B, and Uhler Caroline. 2014. Scalable privacy-preserving data sharing methodology for genome-wide association studies. Journal of biomedical informatics 50 (2014), 133–141. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] [1].2020. https://www.23andme.com/en-int/. [Online; accessed 13-September-2020].

[R2] [2].2020. https://opensnp.org/. [Online; accessed 13-September-2020].

[R3] [3].2020. http://www.eupedia.com/genetics/medical_dna_test.shtml. [Online; accessed 13-September-2020].

[R4] [4].2020. https://medium.com/genomes-web-2-0-and-bioethics/my-personal-exome-analysis-part-i-first-findings-72902e4d42cb. [Online; accessed 13-September-2020].

[R5] [5].2020. https://ghr.nlm.nih.gov/primer/genomicresearch/snp. [Online; accessed 13-September-2020].

[R6] [6].20202. http://mathgen.stats.ox.ac.uk/impute/impute_v2.html. [Online; accessed 13-September-2020].

[R7] [7].APOC APOC. 2009. On Jim Watson’s APOE status: genetic information is hard to hide. European Journal of Human Genetics 17 (2009), 147–149. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] [8].Ayday Erman, Raisaro Jean Louis, Hubaux Jean-Pierre, and Rougemont Jacques. 2013. Protecting and evaluating genomic privacy in medical tests and personalized medicine. In Proceedings of the 12th ACM workshop on Workshop on privacy in the electronic society. ACM, 95–106. [Google Scholar]

[R9] [9].Baldi Pierre, Baronio Roberta, De Cristofaro Emiliano, Gasti Paolo, and Tsudik Gene. 2011. Countering gattaca: efficient and secure testing of fully-sequenced human genomes. In Proceedings of the 18th ACM conference on Computer and communications security. ACM, 691–702. [Google Scholar]

[R10] [10].Cao Yang, Yoshikawa Masatoshi, Xiao Yonghui, and Xiong Li. 2017. Quantifying differential privacy under temporal correlations. In 2017 IEEE 33rd International Conference on Data Engineering (ICDE). IEEE, 821–832. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] [11].Cassa Christopher A, Miller Rachel A, and Mandl Kenneth D. 2013. A novel, privacy-preserving cryptographic approach for sharing sequencing data. Journal of the American Medical Informatics Association 20, 1 (2013), 69–76. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] [12].Deuber Dominic, Egger Christoph, Fech Katharina, Malavolta Giulio, Schröder Dominique, Thyagarajan Sri Aravinda Krishnan, Battke Florian, and Durand Claudia. 2019. My Genome Belongs to Me: Controlling Third Party Computation on Genomic Data. Proceedings on Privacy Enhancing Technologies 2019, 1 (2019), 108–132. [Google Scholar]

[R13] [13].Deznabi Iman, Mobayen Mohammad, Jafari Nazanin, Tastan Oznur, and Ayday Erman. 2018. An inference attack on genomic data using kinship, complex correlations, and phenotype information. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 15, 4 (2018), 1333–1343. [DOI] [PubMed] [Google Scholar]

[R14] [14].Doudalis Stelios, Kotsogiannis Ios, Haney Samuel, Machanavajjhala Ashwin, and Mehrotra Sharad. 2017. One-sided differential privacy. arXiv preprint arXiv:1712.05888 (2017). [Google Scholar]

[R15] [15].Duchi John C, Jordan Michael I, and Wainwright Martin J. 2013. Local privacy and statistical minimax rates. In Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium on. IEEE, 429–438. [Google Scholar]

[R16] [16].Dwork Cynthia. 2008. Differential privacy: A survey of results. In International Conference on Theory and Applications of Models of Computation. Springer, 1–19. [Google Scholar]

[R17] [17].Erlingsson Úlfar, Pihur Vasyl, and Korolova Aleksandra. 2014. Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security. ACM, 1054–1067. [Google Scholar]

[R18] [18].Fienberg Stephen E, Slavkovic Aleksandra, and Uhler Caroline. 2011. Privacy preserving GWAS data sharing. In Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on. IEEE, 628–635. [Google Scholar]

[R19] [19].Gymrek Melissa, McGuire Amy L, Golan David, Halperin Eran, and Erlich Yaniv. 2013. Identifying personal genomes by surname inference. Science 339, 6117 (2013), 321–324. [DOI] [PubMed] [Google Scholar]

[R20] [20].Hardt Moritz and Talwar Kunal. 2010. On the geometry of differential privacy. In Proceedings of the forty-second ACM symposium on Theory of computing. ACM, 705–714. [Google Scholar]

[R21] [21].Homer Nils, Szelinger Szabolcs, Redman Margot, Duggan David, Tembe Waibhav, Muehling Jill, Pearson John V, Stephan Dietrich A, Nelson Stanley F, and Craig David W. 2008. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS genetics 4, 8 (2008), e1000167. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] [22].Humbert Mathias, Ayday Erman, Hubaux Jean-Pierre, and Telenti Amalio. 2013. Addressing the concerns of the lacks family: quantification of kin genomic privacy. In Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security. ACM, 1141–1152. [Google Scholar]

[R23] [23].Humbert Mathias, Ayday Erman, Hubaux Jean-Pierre, and Telenti Amalio. 2014. Reconciling utility with privacy in genomics. In Proceedings of the 13th Workshop on Privacy in the Electronic Society. ACM, 11–20. [Google Scholar]

[R24] [24].Jha Somesh, Kruger Louis, and Shmatikov Vitaly. 2008. Towards practical privacy for genomic computation. In Security and Privacy, 2008. SP 2008. IEEE Symposium on. IEEE, 216–230. [Google Scholar]

[R25] [25].Johnson Aaron and Shmatikov Vitaly. 2013. Privacy-preserving data exploration in genome-wide association studies. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 1079–1087. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] [26].Jolie Angelina. 2013. My medical choice. The New York Times 14, 05 (2013), 2013. [Google Scholar]

[R27] [27].Kairouz Peter, Oh Sewoong, and Viswanath Pramod. 2014. Extremal mechanisms for local differential privacy. In Advances in neural information processing systems. 2879–2887. [Google Scholar]

[R28] [28].Karp Richard M. 1972. Reducibility among combinatorial problems. In Complexity of computer computations. Springer, 85–103. [Google Scholar]

[R29] [29].Kschischang Frank R, Frey Brendan J, and Loeliger H-A. 2001. Factor graphs and the sum-product algorithm. IEEE Transactions on information theory 47, 2 (2001), 498–519. [Google Scholar]

[R30] [30].Liu Changchang, Chakraborty Supriyo, and Mittal Prateek. 2016. Dependence Makes You Vulnberable: Differential Privacy Under Dependent Tuples.. In NDSS, Vol. 16. 21–24. [Google Scholar]

[R31] [31].Naveed Muhammad, Ayday Erman, Clayton Ellen W, Fellay Jacques, Gunter Carl A, Hubaux Jean-Pierre, Malin Bradley A, and Wang XiaoFeng. 2015. Privacy in the genomic era. ACM Computing Surveys (CSUR) 48, 1 (2015), 6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] [32].Pearl Judea. 2014. Probabilistic reasoning in intelligent systems: networks of plausible inference. Elsevier. [Google Scholar]

[R33] [33].Samani Sahel Shariati, Huang Zhicong, Ayday Erman, Elliot Mark, Fellay Jacques, Hubaux Jean-Pierre, and Kutalik Zoltán. 2015. Quantifying genomic privacy via inference attack with high-order SNV correlations. In Security and Privacy Workshops (SPW), 2015 IEEE. IEEE, 32–40. [Google Scholar]

[R34] [34].Shringarpure Suyash S and Bustamante Carlos D. 2015. Privacy risks from genomic data-sharing beacons. The American Journal of Human Genetics 97, 5 (2015), 631–646. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] [35].Slatkin Montgomery. 2008. Linkage disequilibrium - understanding the evolutionary past and mapping the medical future. Nature Reviews Genetics 9, 6 (2008), 477–485. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] [36].Song Shuang, Wang Yizhen, and Chaudhuri Kamalika. 2017. Pufferfish privacy mechanisms for correlated data. In Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 1291–1306. [Google Scholar]

[R37] [37].Sweeney Latanya, Abu Akua, and Winn Julia. 2013. Identifying participants in the personal genome project by name (a re-identification experiment). arXiv preprint arXiv:1304.7605 (2013). [Google Scholar]

[R38] [38].Wagner Isabel. 2017. Evaluating the strength of genomic privacy metrics. ACM Transactions on Privacy and Security (TOPS) 20, 1 (2017), 2. [Google Scholar]

[R39] [39].Wang Rui, Li Yong Fuga, Wang XiaoFeng, Tang Haixu, and Zhou Xiaoyong. 2009. Learning your identity and disease from research papers: information leaks in genome wide association study. In Proceedings of the 16th ACM conference on Computer and communications security. ACM, 534–544. [Google Scholar]

[R40] [40].Wang Tianhao, Blocki Jeremiah, Li Ninghui, and Jha Somesh. 2017. Locally differentially private protocols for frequency estimation. In Proc. of the 26th USENIX Security Symposium. 729–745. [Google Scholar]

[R41] [41].Wang Xiao Shaun, Huang Yan, Zhao Yongan, Tang Haixu, Wang XiaoFeng, and Bu Diyue. 2015. Efficient genome-wide, privacy-preserving similar patient query based on private edit distance. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. ACM, 492–503. [Google Scholar]

[R42] [42].Yang Bin, Sato Issei, and Nakagawa Hiroshi. 2015. Bayesian differential privacy on correlated data. In Proceedings of the 2015 ACM SIGMOD international conference on Management of Data. ACM, 747–762. [Google Scholar]

[R43] [43].Yu Fei, Fienberg Stephen E, Slavković Aleksandra B, and Uhler Caroline. 2014. Scalable privacy-preserving data sharing methodology for genome-wide association studies. Journal of biomedical informatics 50 (2014), 133–141. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Preserving Genomic Privacy via Selective Sharing

Emre Yilmaz

Tianxi Ji

Erman Ayday

Pan Li

Abstract

1. INTRODUCTION

2. RELATED WORK

3. TECHNICAL PRELIMINARIES

3.1. Genomics Background

3.2. Differential Privacy

Differential Privacy under Correlated Data.

4. PROPOSED PRIVACY-PRESERVING FRAMEWORK

4.1. Assumptions and Notations

Table 1:

4.2. Attacker Model

4.3. Mathematical Formulation

Definition 4.1. ϵ-indirect privacy.

ϵ-indirect Privacy for Genomic Data Sharing.

4.4. SNP Sharing Mechanism

Figure 1:

ALGORITHM 1:

4.5. The Order of Sharing

5. EVALUATION

5.1. Correlation Model and Privacy Parameter

Figure 2:

5.2. Size of the Sensitive SNP Set

Figure 3:

5.3. Estimation Error and Entropy

Table 2:

Table 3:

5.4. The Order of Sharing

Table 4:

5.5. Kinship Relationships

Figure 4:

Table 5:

5.6. Comparison With Previous Work

Figure 5:

5.7. Comparison With Local Differential Privacy

Figure 6:

Figure 7:

Table 6:

6. DISCUSSION

6.1. Functionality and Practicality

6.2. Robustness

6.3. Attacker Assumptions

7. CONCLUSION AND FUTURE WORK

CCS CONCEPTS.

ACKNOWLEDGMENTS

Appendices

A. INFERENCE ATTACK ON KIN GENOMIC PRIVACY

Table 7:

Table 8:

B. A TOY EXAMPLE FOR SNP SHARING

Table 9:

C. THE ALGORITHM FOR SELECTING THE SNP TO BE PROCESSED

Proposition C.1.

ALGORITHM 2:

D. COMPARISON WITH OPTIMIZATION-BASED TECHNIQUE

Figure 8:

Figure 9:

E. CHANGE IN ATTACKER’S ESTIMATION ERROR AND ENTROPY

Figure 10:

Contributor Information

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases