Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Jun 6.
Published in final edited form as: IEEE Trans Inf Theory. 2022 Mar 3;68(6):4090–4105. doi: 10.1109/tit.2022.3156276

Mechanisms for Hiding Sensitive Genotypes with Information-Theoretic Privacy

Fangwei Ye 1, Hyunghoon Cho 2, Salim El Rouayheb 3
PMCID: PMC10243750  NIHMSID: NIHMS1850165  PMID: 37283781

Abstract

Motivated by the growing availability of personal genomics services, we study an information-theoretic privacy problem that arises when sharing genomic data: a user wants to share his or her genome sequence while keeping the genotypes at certain positions hidden, which could otherwise reveal critical health-related information. A straightforward solution of erasing (masking) the chosen genotypes does not ensure privacy, because the correlation between nearby positions can leak the masked genotypes. We introduce an erasure-based privacy mechanism with perfect information-theoretic privacy, whereby the released sequence is statistically independent of the sensitive genotypes. Our mechanism can be interpreted as a locally-optimal greedy algorithm for a given processing order of sequence positions, where utility is measured by the number of positions released without erasure. We show that finding an optimal order is NP-hard in general and provide an upper bound on the optimal utility. For sequences from hidden Markov models, a standard modeling approach in genetics, we propose an efficient algorithmic implementation of our mechanism with complexity polynomial in sequence length. Moreover, we illustrate the robustness of the mechanism by bounding the privacy leakage from erroneous prior distributions. Our work is a step towards more rigorous control of privacy in genomic data sharing.

Keywords: Information-theoretic privacy, genomic privacy, genomic data sharing, data sanitization, hidden Markov models

I. Introduction

A. Motivation

The rise of personal genomics, whereby private individuals are exposed to an increasing range of direct-to-consumer services for sequencing, sharing, or analyzing their genomes, is leading to growing concerns for genomic privacy [1]-[3]. A personal genome is a rich trove of information about the underlying individual, including predictors for disease risks and other health-related traits, which holds great potential for improving one’s health, yet may cause harm if used against the individual. Unlike other types of personal data like passwords, one’s genetic data cannot be replaced once leaked, and a data breach may even affect the relatives of the individual whose genome is leaked. In order to facilitate the sharing of genomes to improve public health and advance science, we need principled strategies for controlling the privacy risks associated with genomic data sharing.

A key need in this regard is to selectively limit the leakage of information about biological or health-related traits of an individual that can be inferred from the shared genetic data. For example, one may wish to hide certain genotypes (an individual’s genetic information at specific genomic positions) with well-established disease association before sharing his or her data with others (e.g., analytic service providers or researchers). Such a capability would give the individuals more fine-grained control over their genomic privacy.

A simple approach to privacy protection, whereby specific positions in the genome deemed sensitive by the individual are masked before sharing the data, does not provide sufficient privacy protection. This is because the correlation structure among nearby genomic positions induced by the biological processes of genetic inheritance can be used to reconstruct the masked data as demonstrated in a number of studies [4], [5]. To prevent such an attack, one could alternatively erase all positions that are highly correlated with the sensitive sites [6], which may be achieved by masking the data within a large window around each sensitive position. Unfortunately, depending upon the chosen size of window, these approaches either provide incomplete privacy protection or require an excessive amount of data to be erased in order to achieve strong privacy (as we demonstrate in our results), thus limiting the usefulness of the shared data. Here, we aim to design a principled and effective mechanism for sharing a personal genome that provably hides sensitive positions, while introducing a small amount of erasure. Our techniques build upon the recent work on ON-OFF privacy [31], [32] while extending the theory to general data distributions beyond Markov chains addressed in the previous work.

It is worth noting that information-theoretic approaches are being increasingly explored for a diverse range of applications in genomics, including sequencing [7], genome-wide association study (GWAS) [8], [9], genome assembly [10], [11], regulatory network of gene interactions (RNGI) [12], and DNA-based information storage [13]. There are also recent works addressing the issue of genomic privacy, including a solution for private shotgun sequencing [14] based on the intensively researched private information retrieval (PIR) problems [15]-[20] and differential privacy mechanisms for sharing aggregate genomic data [21]-[23]. Broadly, our work can be viewed as a continuation of these efforts to develop effective genomic data processing tools from an information-theoretic perspective, yet for a novel problem that we introduce, i.e., the design of mechanisms for selectively hiding sensitive positions in genetic sequences.

B. Genetics background

An individual’s genome consists of a pair of sequences, one from each parent, each consisting of around 3 billion nucleotides (A, C, G, and T). Each sequence is referred to as a haplotype. Since most of the genome sequence is identical between different individuals, a common way to compactly represent a personal genome is as a list of positions of variation, paired with the observed nucleotide(s) in the given individual (referred to as a genotype). In this work, we consider the problem of sharing a list of genotypes corresponding to a single haplotype of an individual. Although standard sequencing or genotyping pipelines produce a genotype at each position that convolves the two haplotypes, well-established methods exist [24], [25] for resolving this ambiguity in order to separate the two haplotypes (a process called phasing), after which each haplotype could be individually considered.

In the setting of our work, we consider an adversary whose goal is to infer the target individual’s genotypes at specific positions in the genome, given a partially masked genetic sequence of the individual. In principle, this reconstruction task is equivalent to an extensively studied problem in bioinformatics known as genotype imputation, originally developed for coping with the presence of missing data in the existing experimental pipelines for characterizing personal genomes. If one were to mask only the sensitive positions before sharing the data, existing imputation algorithms are expected to be effective at revealing the hidden genotypes using other genotypes in their respective neighborhoods.

A state-of-the-art algorithm for genotype imputation, Minimac [26], is based on a classical model of genetic sequences introduced by Li and Stephens [27]. In this model, a person’s genetic sequence is modelled as a mosaic of a large group of reference sequences from other individuals. This model intuitively captures the underlying biological process of recombination, which describes the interleaving of two haplotypes of each parent when their genetic material is passed onto the child. Formally, these models are expressed as hidden Markov models (HMMs), where a sequence of genotypes of an individual is generated from a sequence of hidden states indicating which reference haplotype to copy the genotype from, for each corresponding position. The parameters of these models are typically inferred from a large reference panel including tens of thousands of sequenced human genomes [28]. Although alternative approaches to imputation (e.g. based on matrix factorization [29]) exist, in our work we are especially interested in HMMs as the primary means to model the distribution of genotypes, considering the wide adoption of HMMs in genetics not only for imputation, but also for other standard tasks like phasing [24] and simulation [30]. Further details of this model is provided in Section VII-A.

C. Setup and contributions

In this paper, we formulate the genotype hiding problem: We consider a user who wishes to share a partially erased version of their genetic sequence while protecting a list of sensitive positions. Privacy is measured by the mutual information between the sensitive positions and the released sequence, and we adopt a stringent privacy requirement that enforces zero mutual information (i.e., perfect privacy). The goal of the problem is to design a privacy mechanism that satisfies this requirement, while minimizing the number of erasures introduced so as to maximize the utility of the data.

We present such a mechanism with perfect privacy and provide a range of theoretical insights into its performance with respect to its utility, measured by the erasure rate. The proposed mechanism sequentially processes the positions in the sequence in a given ordering and determines a suitable erasure rate at each position based on the previously released positions and the data generating distribution. We prove that our mechanism can be viewed as a locally-optimal, greedy solution for minimizing the erasure rate at each position. Furthermore, we give a lower bound on the number of erasures required for any mechanism satisfying the privacy constraint, and show that our privacy mechanism is in fact (globally) optimal for a class of data generative distributions defined by Markov chains. We also show that finding the optimal ordering for the sequential mechanism is generally intractable (NP-hard), illustrating the limits of current techniques. Lastly, we derive an upper bound on potential privacy leakage due to inaccuracies in the estimation of the data generative model, suggesting that our mechanism is relatively robust to a small amount of noise in the data distribution.

For practical applications, we are particularly interested in data generating distributions induced by hidden Markov models (HMMs), which are broadly adopted in genetics as described in Section VII-A. To this end, we also present a computationally-efficient algorithm to implement the proposed privacy mechanism based on HMMs, and provide an empirical evaluation of its performance on simulated datasets.

The rest of this paper is organized as follows. In Section II, we formalize the genotype-hiding problem. Performance bounds are summarized in Section III. In Section IV, we introduce our privacy mechanism for hiding sensitive genotypes. In Section V, we describe its interpretation as a locally-optimal solution in detail and demonstrate the NP-hardness of finding the optimal ordering in general. The robustness of our privacy mechanism to model mismatch is discussed in Section VI. In Section VII, we propose an efficient implementation of the privacy mechanism for hidden Markov models. Simulation experiments are presented in Section VIII. Finally in Section IX, we conclude the paper and discuss future directions.

II. The Genotype-Hiding Problem

Let X = (X1, …, Xn) be the user’s personal genome sequence of length n, and each Xi takes values in the alphabet 𝒳. The user wishes to share X with others, but is concerned about revealing information about certain positions of X. To hide the values at these sensitive positions, the user generates a masked version of the data Y = (Y1, …, Yn), which only partially reveals X.

The desired properties of Y are given as follows. First, since we expect substitution errors to be considerably more undesirable than erasures in genetic analyses, we impose a constraint that Yi can be either Xi or the erasure symbol *. We refer to this property as the faithfulness condition, i.e.,

Yi=Xior.(Faithfulness) (1)

Note that the alphabet of Yi is 𝒳{}.

Next, let 𝒦[n]{1,,n} be the user-provided set of indices of X containing sensitive information. We assume that 𝒦 is chosen irrespective of the sequence (i.e., independently from X) based on information such as family history or curated disease associations. We use X𝒦 to denote a collection of random variables, i.e., X𝒦{Xi:i𝒦}. We require that no information about X𝒦 is revealed when Y is shared. In other words, we require that

I(X𝒦;Y)=0,(Privacy) (2)

where I(·) denotes the mutual information. We refer to this requirement as the privacy condition. Note that our notion of privacy is stronger than alternatives such as local differential privacy [33], which allows a small amount of leakage. Our work focuses on maximizing the utility over all mechanisms satisfying the perfect privacy condition.

We aim to design a privacy mechanism w (yx) to generate Y from given X and 𝒦 such that both the faithfulness and privacy conditions are satisfied. Here, we consider the ideal scenario where the data generating distribution p (x) is known to the mechanism. We discuss the impact of having an inaccurate p (x) in Section VI; even under this challenging scenario, we show that the potential privacy leakage is bounded by the divergence between the given p (x) and the true distribution. Note that we use uppercase symbols to represent random variables and lowercase symbols to denote their realizations.

While satisfying the above two conditions, we wish to share as much of X as possible. More precisely, let e(Y) be the number of erasure symbols in Y. Our goal is to minimize the expected number of erasures E[e(Y)], or equivalently the erasure rate 1nE[e(Y)], where

E[e(Y)]=i=1nE[1{Yi=}]=i=1np(yi=), (3)

and 1{} denotes the indicator function.

A formal description of the genotype-hiding problem is given below. We start by defining the privacy mechanism for the genotype-hiding problem as follows.

Definition 1. An (n, 𝒦) privacy mechanism for a given data generative distribution p (x) with input alphabet 𝒳n and output alphabet 𝒴n is defined by a probabilistic encoding function

Enc:𝒳n𝒴n,

where Enc satisfies both the faithfulness condition (Yi ∈ {Xi, *}, ∀i) and the privacy condition (I(X𝒦; Y) = 0).

The performance of the privacy mechanism is measured by the expected number of erasures per symbol in an output sequence y. This measure captures the distortion between the input and output sequences induced by a set of single-letter erasures. Following the convention, we define the rate of a privacy mechanism as the fraction of positions that are not erased in the output:

Definition 2. The rate of an (n, 𝒦) privacy mechanism for a given data generative distribution p (x) is defined by 11nE[e(Enc(X))] per symbol.

Definition 3. For any given data distribution p (x), a rate R is achievable if there exists an (n, 𝒦) privacy mechanism such that

11nE[e(Y)]R, (4)

where Y = Enc(X).

Clearly, if R is achievable then Rϵ for any ϵ > 0 is also achievable by the definition, so we are interested in finding the maximum achievable rate.

It is worth noting that the encoder Enc(·) can be potentially stochastic, so we may use conditional probabilities w (yx) to represent the encoding function. If we treat conditional probabilities w (yx) where x𝒳n, y𝒴n as decision variables, the genotype-hiding problem can be defined as the following optimization problem:

maximizew(yx)11ni=1np(yi=)subject toI(X𝒦;Y)=0(Privacy)Yi{Xi,},i(Faithfulness) (5)

Note that this problem maximizes the information rate (utility) under the stringent privacy constraint such that no information about the sensitive positions is leaked.

If we express the objective and the constraints explicitly in terms of the conditional probabilities w (yx), the optimization problem (5) can be viewed as an instance of linear programming (LP). However, the scale of the problem is intractable in practice, given the exponential blowup in the number of variables and constraints as the length of the sequence n grows; the number of decision variables is 𝒳n𝒴n, and the number of constraints is in the order of 𝒳𝒦𝒴n+n𝒳𝒴.

Therefore, the ultimate goal of this paper is to identify a solution to the genotype-hiding problem in a tractable and computationally-efficient manner. To this end, we first present an achievable privacy mechanism as well as an upper bound on the maximum achievable rate. Then we show that the proposed privacy mechanism is computationally efficient for a particular data generative distribution, namely hidden Markov models, which is of broad interest in our motivating application in genomics.

III. Performance Bounds

In this section, we state the performance bounds on the achievable rate in the following theorems.

Theorem 1. For a given data distribution p (x), a rate R is achievable if

R1ni=1nxi𝒳EY[i1][minu𝒳𝒦p(xix𝒦=u,Y[i1])]. (6)

A detailed description of the achievable scheme will be presented in Section IV. The right-hand side of (6) may appear unconventional, given that conditioning on Y[i−1] for each i makes the probability term generally hard to compute as the sequence length n grows. However, this expression corresponds to a sequential mechanism where the encoder generates Y1, …, Yn one position at a time, and an efficient update exists for incrementally expanding the conditioning set. As an example, in Section VII, we present a concrete implementation of the privacy mechanism for data distributions governed by hidden Markov models, which indeed allows the right-hand side of (6) to be efficiently computed.

Theorem 2. For a given data distribution p (x), any achievable rate R must satisfy

R1ni=1nxi𝒳minu𝒳𝒦p(xix𝒦=u). (7)

It is worth noting that, given a data distribution p (x), each summand in the right-hand side of (7) represents the conditional probability of the observation xi at coordinate i when the sensitive positions x𝒦 take on the least-likely values, which can be determined from the given p (x).

Proof: From (3), we know that to establish (7), it is sufficient to show

p(yi)xi𝒳minu𝒳𝒦p(xix𝒦=u) (8)

for any mechanism satisfying the privacy and faithfulness conditions. Consider

p(yi)=yi𝒳p(yi)=(a)yi𝒳minup(yix𝒦=u)=(b)yi𝒳minup(yi=xix𝒦=u)=xi𝒳minup(xix𝒦=u)p(yi=xixi,x𝒦=u)(c)xi𝒳minup(xix𝒦=u), (9)

where (a) is due to the fact that Yi is independent of X𝒦 (privacy condition); (b) follows from the faithfulness condition Yi ∈ {Xi; *}; and (c) follows from the fact that probabilities are bounded above by 1.

Although not true in general, the upper bounds in (6) and (7) match under special circumstances, implying the optimality of an achievable mechanism. That is,

xi𝒳y[i1]p(y[i1])minx𝒦p(xix𝒦,y[i1])=xi𝒳minu𝒳𝒦p(xix𝒦=u). (10)

We observe that a sufficient condition for this equality is given by the following: for any xi, if

uargminup(xix𝒦=u), (11)

then

uargminup(xix𝒦=u,y[i1]) (12)

for all possible y[i−1]. Intuitively, this means that for any given position xi, the least-likely values of the (unobserved) sensitive positions x𝒦 remains the same regardless of the positions that have been previously released in the output y[i−1] during the course of the mechanism.

A special case that satisfies this optimality condition is when random variables X1, …, Xn form a Markov chain (i.e., p (x) is induced by a Markov chain), with a single sensitive position. Without loss of generality, we assume 𝒦={1}.

Corollary 1 (Markov chain). If X1, …, Xn forms a Markov chain and the sensitive position is 𝒦={1}, then a rate R is achievable if and only if

R1ni=1nxi𝒳minu𝒳𝒦p(xix𝒦=u). (13)

It is sufficient to justify the corollary by showing that the aforementioned sufficient condition holds. The proof is included in Appendix A.

IV. Privacy Mechanism

In this section, we present a privacy mechanism for generating Y based on a given p (x), whose performance matches the bound given in (6), while satisfying both faithfulness and privacy conditions.

Let us first recall the genotype-hiding problem introduced in (5), i.e.,

maximizew(yx)11ni=1np(yi=)subject toI(X𝒦;Y)=0(Privacy)Yi{Xi,},i.(Faithfulness) (14)

This problem is difficult to solve in its general form given the exponentially growing number of decision variables in w (yx) as the sequence length n grows. Instead, we adopt a greedy optimization approach, whereby the erasure probability of yi is locally minimized, one position at a time, from 1 to n. In other words, for each i = 1, …, n, we solve

minimizew(yix,y[i1])p(yi=y[i1])subject toI(X𝒦;YiY[i1])=0Yi{Xi,}, (15)

for any given y[i−1]. Note that

I(X𝒦;Y)=i=1nI(X𝒦;YiY[i1])=0, (16)

by the chain rule, so if the first constraint of (15) is satisfied for all i, then the solution preserves the required privacy constraint I(X𝒦;Y)=0 as defined in (2). The second constraint is inherited directly from the faithfulness condition. In other words, any solution satisfying the constraints of (15) for all i will naturally be a feasible solution to the genotype-hiding problem in (5).

We observe that solving the local optimization problem (15) gives rise to a sequential mechanism for generating Y. That is, we generate Y one position at a time, where the conditional distribution for Yi may depend on the values of Y1, …, Yi−1 that have been previously generated. The following defines our chosen privacy mechanism for any given position i, which is in fact an optimal solution to the local optimization problem (15). A detailed proof of the local optimality of this scheme is deferred to Section V.

Privacy mechanism: Generate each Yi according to the following conditional distribution

w(yixi,x𝒦,y[i1])={minu𝒳𝒦p(xix𝒦=u,y[i1])p(xix𝒦,y[i1]),ifyi=xi,1minu𝒳𝒦p(xix𝒦=u,y[i1])p(xix𝒦,y[i1]),ifyi=,0,otherwise,} (17)

for any xi, x𝒦 and y[i−1], where [i − 1] := {1, …, i − 1}.

The expression for the erasure probability in the above mechanism can be intuitively understood as follows. We first identify the values of the sensitive positions with the smallest likelihood of generating the observed symbol xi at the i-th position (as indicated by the numerator in the fractional term), conditioned on the previously released positions y[i−1]. Note that u is an auxiliary variable denoting the possible values in the alphabet 𝒳𝒦, whereas x𝒦 denotes the observed values at the sensitive positions. We then choose the erasure probability such that, the probability of releasing the original symbol (without erasure) becomes identical among different hypothetical values of x𝒦, thus ensuring privacy.

It is worth noting that our privacy mechanism satisfies the faithfulness condition (i.e., yi ∈ {xi, *}) by design, so we only need to verify that it satisfies the privacy constraint (2). Before verifying the privacy constraint, we note the following properties of the mechanism.

  1. If i𝒦, then
    minu𝒳𝒦p(xix𝒦=u,y[i1])=0, (18)
    which yields
    w(yi=xi,x𝒦,y[i1])=1. (19)

    This implies that Xi is always erased if it corresponds to one of the sensitive positions in 𝒦.

  2. We notice from (17) that Xi is not erased with some nonzero probability, so this mechanism is strictly better than the naïve approach of always erasing any position that have a nonzero correlation with the sensitive positions.

Proof of privacy: To show that the proposed mechanism in (17) satisfies the privacy condition (2), it is sufficient to show

I(Yi;X𝒦Y1,,Yi1)=0, (20)

for all i = 1, …, n, since this implies

I(X𝒦;Y)=i=1nI(X𝒦;YiY[i1])=0 (21)

by the chain rule. To establish (20), we will equivalently prove that

p(yix𝒦,y[i1])=p(yiy[i1]) (22)

for any x𝒦, y[i−1] and yi. Since

p(yix𝒦,y[i1])=xi𝒳p(xix𝒦,y[i1])w(yixi,x𝒦,y[i1]), (23)

by substituting (17), we have

p(yi=x𝒦,y[i1])=xip(xix𝒦,y[i1])w(yi=xi,x𝒦,y[i1])=1xi𝒳minu𝒳𝒦p(xix𝒦=u,y[i1]). (24)

Similarly, for yi𝒳, we have

p(yix𝒦,y[i1])=xi𝒳p(xix𝒦,y[i1])w(yi=xixi,x𝒦,y[i1])=xi𝒳minu𝒳𝒦p(xix𝒦=u,y[i1]). (25)

We can observe that the right-hand sides of both (24) and (25) are independent of x𝒦, and hence by combining (24) and (25), we have

p(yix𝒦,y[i1])=p(yiy[i1]), (26)

for any x𝒦, y[i−1] and yi, which finishes the proof of (22).

Finally, we can easily verify that our sequential privacy mechanism (17) achieves the rate

11ni=1np(yi=)=11ni=1ny[i1]p(yi=y[i1])p(y[i1])=(a)11ni=1ny[i1]p(y[i1])(1xi𝒳minu𝒳𝒦p(xix𝒦=u,y[i1]))=1ni=1ny[i1]p(y[i1])xi𝒳minu𝒳𝒦p(xix𝒦=u,y[i1])=1ni=1nxi𝒳y[i1]p(y[i1])minx𝒦p(xix𝒦,y[i1]), (27)

where (a) follows by (24) and (26). The final expression is identical to the right-hand side of (6) as desired.

Example. We present an example to illustrate the operations of the proposed privacy mechanism in a simplified setting. Let us consider a data distribution p (x) where X1; …, Xn form a Markov chain, as in Corollary 1, and a single sensitive position 𝒦={1}.

By inspecting the privacy mechanism in (17), we know that if yi−1 ≠ * for some i > 1, then

p(xix𝒦=u,y[i1])=p(xix𝒦=u,y[i1],xi1=yi1)=p(xixi1=yi1), (28)

for any xi and y[i−1] by the Markov property and the fact that 𝒦={1}. This implies that

w(yi=xixi,x𝒦,y[i1])=minu𝒳𝒦p(xix𝒦=u,y[i1])p(xix𝒦,y[i1])=p(xixi1=yi1)p(xixi1=yi1)=1, (29)

which means that if yi−1 ≠ * then yi ≠ * with probability one.

Thus, when p (x) is specified by a Markov chain, we see that the privacy mechanism erases all positions within a window from the sensitive position and releases the rest without erasure, and the size of the window is stochastically chosen. This observation suggests that, in contrast to the heuristic approach of deterministically choosing a window for erasure, our mechanism introduces additional uncertainty about sensitive data (in fact achieving perfect privacy) by randomizing the choice of the window. Later in Section VIII, we present a simulation experiment comparing our mechanism with the deterministic window-based erasure approach with respect to the privacy-utility trade-off, based on a more realistic data distribution defined by hidden Markov models.

V. Local Optimality

In the previous section, we proposed a privacy mechanism for the genotype-hiding problem satisfying both privacy and faithfulness conditions. Here, we provide further insights into the optimality of the proposed mechanism. We first prove that the mechanism is indeed an optimal solution to the local optimization problem in (15) as claimed, and thus can be viewed as a greedy solution to the general genotype-hiding problem in (5) given a fixed variable ordering (i.e., the order in which Yi’s are sampled). We then present a negative result to inform future investigation, showing that finding an optimal variable ordering for the mechanism is intractable (NP-hard) in general, thus illustrating the limits of current techniques in achieving global optimality.

A. Optimality with respect to the local optimization problem

Let us first recall the local optimization problem (15), i.e.,

minimizew(yix,y[i1])p(yi=y[i1])subject toI(X𝒦;YiY[i1])=0Yi{Xi,}. (30)

As we have shown,

I(X𝒦;Y)=i=1nI(X𝒦;YiY[i1])=0, (31)

by the chain rule, so any solution satisfying the constraints of (15) for all i is a feasible solution to the general genotypehiding problem in (5).

We now show that the privacy mechanism in (17) is optimal with respect to the above optimization problem. First, for any given y[i−1], note that

p(yi=y[i1])=1yi𝒳p(yiy[i1])=(a)1yi𝒳minx𝒦p(yix𝒦,y[i1])=(b)1yi𝒳minx𝒦p(yi=xix𝒦,y[i1])=1xi𝒳minx𝒦p(xix𝒦,y[i1])w(yi=xixi,x𝒦,y[i1])(c)1xi𝒳minx𝒦p(xix𝒦,y[i1]), (32)

where (a) follows from the privacy condition, (b) follows from the faithfulness condition, and (c) holds because probability values are at most 1. This implies that any feasible solution to the local optimization problem (15) has to satisfy

p(yi=y[i1])1xi𝒳minx𝒦p(xix𝒦,y[i1]), (33)

and that it is optimal if the last step holds with equality, i.e.,

minx𝒦p(xix𝒦,y[i1])w(yi=xixi,x𝒦,y[i1])=minx𝒦p(xix𝒦,y[i1]), (34)

for any xi and y[i−1].

By plugging in the proposed mechanism in (17), we have

minx𝒦p(xix𝒦,y[i1])w(yi=xixi,x𝒦,y[i1])=minx𝒦p(xix𝒦,y[i1])minu𝒳𝒦p(xix𝒦=u,y[i1])p(xix𝒦,y[i1])=minx𝒦minu𝒳𝒦p(xix𝒦=u,y[i1])=minx𝒦p(xix𝒦,y[i1]), (35)

where the last step follows because the two minimizations, both over the alphabet of X𝒦, are equivalent and can be merged. This implies that the mechanism (17) attains the minimum probability of erasing Yi and thus is an optimal solution to the local optimization problem (15). Therefore, our sequential privacy mechanism can be viewed as a locally-optimal algorithm for solving the general genotype-hiding problem (5), given a fixed variable ordering.

B. NP-hardness of finding an optimal variable ordering

So far, we considered the privacy mechanism that generates a masked sequence Y1, …, Yn in a linear order from 1 to n. A natural question is then whether this linear ordering is optimal in terms of the erasure rate that the locally-optimal mechanism achieves. Here, we illustrate the difficulty of determining the optimal variable ordering for the mechanism from a complexity theory perspective, by proving that it is NP-hard in general. This suggests that devising an efficient mechanism with better optimality guarantees in the general setting requires additional assumptions or techniques to circumvent this impossibility result, which is an interesting direction for further research.

To formalize the problem, let (o1, …, on) be any permutation of (1, …, n). We consider generating Y in the order of o1, …, on instead. In this setting, the privacy mechanism (17) is defined by the conditional distribution

w(yoixoi,x𝒦,yo[i1])={minu𝒳𝒦p(xoix𝒦=u,yo[i1])p(xoix𝒦,yo[i1]),ifyoi=xoi,1minu𝒳𝒦p(xoix𝒦=u,yo[i1])p(xoix𝒦,yo[i1]),ifyoi=,0,otherwise,} (36)

for any xoi, x𝒦 and yo[i−1], where o[i−1] := {o1, …, oi−1}. It is easy to see that the faithfulness and privacy conditions are still satisfied regardless of the ordering.

In the following, we show that finding the best ordering (o1, …, on) that minimizes the erasure rate of the mechanism is NP-hard by constructing a polynomial-time reduction of the well-known hitting set problem [34] to our problem. More specifically, given an arbitrary instance of a hitting set problem, we construct an instance of the genotype-hiding problem for which finding the optimal ordering for the privacy mechanism is equivalent to solving the original hitting set problem.

At the core of this reduction is a bipartite graph, illustrated in Figure 2, which we use to represent both an instance of the hitting set problem and to construct a corresponding instance of the genotype-hiding problem, as we explain in detail below. To clarify the dimensions of the problems upfront, note that we represent a hitting set problem for k sets over m elements using a bipartite graph with m left nodes and k right nodes, and the resulting genotype-hiding problem is over a sequence of length n = m + k with k sensitive positions (𝒦=k) and a specially constructed p (x).

Fig. 2:

Fig. 2:

A graphical illustration of the bipartite graph used in our NP-hardness proof, representing an instance of the hitting set problem. The universe U is represented by vertices on the left, sets are represented by vertices on the right, and the edges represent the inclusion of elements in each set. To facilitate reduction to the genotype-hiding problem, we associate each edge with an independent and uniformly random bit bi,j.

We first review the hitting set problem. Consider a universe U = {v1, …, vm} and a collection of non-empty subsets 𝒮={S1,,S𝒦} such that SjU for all j ∈ [k]. Without loss of generality, assume that U=j=1kSj, and U = [m]. A universe U and sets {S1, …, Sk} can be represented by a bipartite graph, as depicted in Fig. 2. The goal of the hitting set problem is to find the minimum cardinality h* of a set VU that satisfies VSi ≠ ∅ for all i, that is

h=minVU:VSj,j[k]V. (37)

Next, we construct the corresponding genotype-hiding problem from the given hitting set problem instance (U, 𝒮). For any i ∈ [m], j ∈ [k] such that iSj, let bi,j be a random variable which is independently and uniformly drawn from {0,1}. In other words, each edge in the bipartite graph is associated with a random bit bi,j (see Fig. 2). Then, we define X to be a sequence of length n = m + k as follows. Let Xi for i ∈ [m] be a tuple of random bits associated with edges connected to node vi, i.e.,

Xi=(bi,j1,,bi,jr), (38)

where {j1, …, jr} = {j : iSj}. Next, let Xm+j for j ∈ [k] be

Xm+j=iSjbi,j, (39)

which can be viewed as a parity check bit over the edges connected to node Sj. In other words, the first m positions of the sequence are uniform and independently distributed symbols (a tuple of random bits), whereas the remaining k positions are parity check bits defined over the first m positions.

Note that the joint distribution p (x) = p(x1, …, xm+k) is succinctly characterized by the random bits bi,j‘s and the associated bipartite graph, and thus the description of the genotype-hiding problem can be generated in polynomial time with respect to m and k. In the following, we refer to the above data generating distribution as p (x; U, 𝒮), with respect to which the corresponding genotype-hiding problem is defined.

Theorem 3. Given a data generating distribution p (x; U, S) for a sequence of length n = m + k and sensitive positions 𝒦={m+1,,m+k}, finding the best ordering (o1, …, om+k) that minimizes the erasure rate of our mechanism (36) is NP-hard.

We provide a sketch of the proof here and defer the details to Appendix C . First, we note the key property of p (x; U, S) that whether or not our mechanism erases the oi-th position is deterministic given the variable ordering, as stated in the following lemma.

Lemma 2. Given a data generating distribution p (x; U, S) for a sequence of length n = m + k and sensitive positions 𝒦={m+1,,m+k}, the conditional sampling distribution of our privacy mechanism satisfies

w(yoi=xoi,x𝒦,yo[i1]){0,1} (40)

for all i, given any ordering π = (o1; …, om+k).

Proof: See Appendix B.

As a result of Lemma 2, the overall erasure rate of the privacy mechanism can be calculated simply by counting the number of erased positions. Note that, if oi𝒦, then

w(yoi=xoi,x𝒦,yo[i1])=1, (41)

regardless of the ordering as we have previously shown. Thus, we need to compare only the erased indices in [m]=[m+k]𝒦 for finding the best ordering.

Let Eπ be the set of erased indices in [m] for a given ordering π = (o1, …, on), i.e.,

Eπ={i:yi=,i[m]}, (42)

where the distribution over Y is determined by the privacy mechanism. Then, finding the best ordering corresponds to finding π that leads to the minimum cardinality e* of the corresponding Eπ:

e=minπEπ. (43)

Intuitively, whether a particular index i ∈ [m] is included in Eπ can be easily determined based on the bipartite graph representation of the underlying hitting set problem (see Fig. 2) as follows. The ordering π = (o1, …, om+k) specifies the order in which the m nodes on the left-hand side of the graph, each with a corresponding Xi, is visited by the mechanism (disregarding the sensitive indices oi ∉ [m], which are always erased). As we show in the proof of Lemma 2, when we visit the node oi ∈ [m], Xoi is erased if and only if there exists a node j ∈ [k] on the right-hand side of the graph that is connected to oi and only to other nodes (if any) that are previously visited and not erased. The presence of such a node j indicates that the sensitive variable Xm+j is directly revealed by Xoi (since the rest of random bits contributing to Xm+j are already released in Y without erasure), while the absence of such j indicates the existence of other positions that are erased or have not been released that fully mask the correlation between Xoi and the sensitive positions.

Finally, we complete the reduction by showing that solving (43) also produces a solution for the hitting set problem (37), i.e., e* = h*. This is achieved by showing both that the set of erased indices Eπ is in fact a valid hitting set (e* ≥ h*), and that there exists an ordering π satisfying ∣Eπ∣ ≤ ∣V∣ for any given hitting set V(e* ≤ h*). A detailed proof is included in Appendix C.

Since the hitting set problem is equivalent to the set cover problem and is well-known to be NP-hard, our reduction proves that finding the best ordering π for our privacy mechanism given any p (x) and 𝒦 is also NP-hard. We note that this result does not preclude the possibility that for a restricted class of genotype-hiding problems (e.g., with a structured p (x) defined by HMMs), one could still find an efficient polynomial-time algorithm for determining the optimal variable ordering, which remains an interesting open question.

VI. Robustness

In this section, we discuss the robustness of our mechanism with respect to the underlying data distribution. In our formulation of the privacy mechanism, the distribution (or the data generative model) p (x), from which the input genome sequence originated, is assumed to be known. In practice, one can only empirically estimate this distribution based on existing data resources, e.g., by obtaining maximum likelihood estimates of the model parameters based on a large collection of reference genomes in public data repositories. Consequently, the generative model used by the mechanism is bound to have deviations from the true generative process, both in terms of the limitations of the model as well as the noisy estimation of the parameters. These discrepancies can potentially lead to privacy leakage if the adversary has access to a more accurate distribution for the underlying input. Here, we study the potential privacy leakage under the worst-case scenario, where the adversary has access to the true underlying distribution. We bound the potential leakage as a function of the distance between the data distribution used by the mechanism and the true underlying distribution, suggesting that our mechanism is robust to small deviations in the noisy data distribution we expect to encounter in real-world use cases.

We denote the noisy data distribution used by the mechanism by q (x) and the true distribution by p (x). The privacy mechanism constructs the sampling distribution w(yx) based on the available q (x) such that the output Y is independent of sensitive genotypes X𝒦 with respect to the joint distribution q (x, y) induced by q (x) and the mechanism w(yx), i.e.,

q(x𝒦,y)=x[n]𝒦q(x,y)=x[n]𝒦w(yx)q(x)=q(x𝒦)q(y). (44)

Since X is actually generated from p (x) not q (x), we also define the true joint distribution p (x, y) induced by p (x) and the mechanism w(yx); note that the mechanism is still based on q (x).

Then, we can measure the unforeseen privacy leakage due to the mismatch in data distribution by the mutual information I(p(x𝒦); p (y)) between the sensitive genotypes and the output sequence with respect to p (x, y), as follows:

I(p(x𝒦);p(y))=x𝒦,yp(x𝒦,y)logp(x𝒦,y)p(y)p(x𝒦)=x𝒦,yp(x𝒦,y)logp(x𝒦,y)q(x𝒦,y)p(y)p(x𝒦)q(x𝒦,y)=(a)x𝒦,yp(x𝒦,y)logp(x𝒦,y)q(x𝒦)q(y)p(y)p(x𝒦)q(x𝒦,y)=D(p(x𝒦,y)q(x𝒦,y))D(p(x𝒦)q(x𝒦))D(p(y)q(y)), (45)

where D(· ∥ ·) denotes relative entropy or equivalently Kullback-Leibler (KL) divergence, and (a) follows from (44). This leads to the following theorem.

Theorem 4. I(p (x𝒦); p (y)) ≤ D(p (x) ∥q (x)).

Proof: See Appendix D.

This result implies that the amount of privacy leakage due to the potential mismatch between the data distribution used by the mechanism and the true underlying generative process gracefully scales with the extent to which the two distributions diverge.

VII. Privacy Mechanism for Hidden Markov Models

Thus far, we considered the data generative model p (x) of the privacy mechanism to be an arbitrary distribution. Here, we address a particular form of p (x) of great interest for our application setting in genomics, namely the Li and Stephens model [27], which is based on a hidden Markov model. This model is widely adopted in genetics for a wide range of tasks that require a probabilistic model of the genome [35]. For this class of p (x), we propose an efficient algorithm to implement the privacy mechanism introduced in Section IV.

A. Review of hidden Markov models for genomes

The classical hidden Markov model (HMM) describing the distribution of personal genomes [27] is as follows. First, let X = (X1, X2, …, Xn) represent an individual’s (haplotype) genetic sequence of length n. Following standard practice in genetics, we adopt a binary alphabet 𝒳={0,1} for each element Xi, representing whether the observed nucleotide is identical to the one in the reference human genome (called reference allele) or not (alternative allele). In addition, we are given a reference dataset of m personal genome sequences ={hj:j=1,,m}, where each sequence hj is of length n. The i-th coordinate of hj is denoted by hi,j, which also takes a value in 𝒳.

In this model, X is viewed as a “mosaic” of reference sequences in with potential substitution errors arising from mutations or experimental noise in sequencing. Formally, X depends on a sequence of hidden states {Si}i=1n forming a Markov chain, where each Si takes an integer in the range {1, …, m}, representing an index into . Without loss of generality, we assume that the initial state S1 is uniformly distributed over {1, …, m}. The transition probability πi,j from state i to j is set to ϵm1 and 1 − ϵ for ij and i = j, respectively. The parameter ϵ is often called the recombination probability; in the following we also use the term crossover probability to refer to this quantity.

Next, each Xi is sampled based on the hidden state Si by copying the corresponding symbol in the selected reference sequence with a small probability of error. In other words, Xi is equal to the symbol in the i-th position of hSi with error probability θ. The overall data distribution p (x) is fully specified by the tuple (, ϵ, θ). We provide a graphical illustration of p (x) in Fig. 3. In our work, we treat the parameters of the above model as given. In practice, these parameters are estimated from a large collection of reference genomes, e.g., including hundreds of thousands of individuals, which are available in public data repositories such as the UK Biobank [36].

Fig. 3:

Fig. 3:

A graphical illustration of HMM for genomes. The state space of the hidden states is {1, …, m}, where each element corresponds to an index into the reference dataset {h1, …, hm} (each of length n). A Markov process {Si}i=1n indicates which reference sequence the user reads the data from at the i-th position. For each i, Xi differs from the i-th position of hSi with probability θ, representing noise in the data. BSCθ: Binary symmetric channel with crossover probability θ.

B. An efficient algorithm for HMMs

In this section, we propose an efficient algorithm to implement the privacy mechanism introduced in Section IV for p (x) based on a hidden Markov model (H, ϵ, θ) described in the previous section. The outline of our algorithm is provided in Algorithm 1.

As seen in (17), the privacy mechanism determines the probability of erasing xi mainly based on the probability p (xix𝒦, y[i−1]). By employing a belief propagation approach akin to the well-known forward-backward algorithm [37], we track the computation of p (xix𝒦 = u y[i−1]), for all u𝒳𝒦 efficiently. The novelty of our algorithm is that it incorporates the stochasticity of the privacy mechanism in addition to that of the HMM.

First, note that it is sufficient to describe how to compute p (xix𝒦 = u, y[i−1]) for all u𝒳𝒦 and i ∈ [n], which fully determines the distribution of y1, …, yn specified by our privacy mechanism, i.e.,

p(yixi,x𝒦,y[i1])={minu𝒳𝒦p(xix𝒦=u,y[i1])p(xix𝒦,y[i1]),ifyi=xi,1minu𝒳𝒦p(xix𝒦=u,y[i1])p(xix𝒦,y[i1]),ifyi=,0,otherwise.} (46)

We begin by expressing p (xix𝒦 = u, y[i−1]) as

p(xix𝒦=u,y[i1])=sip(six𝒦=u,y[i1])p(xisi,x𝒦=u,y[i1])=si,si1p(si1x𝒦=u,y[i1])p(sisi1,x𝒦=u)p(xisi). (47)

Note that

p(sisi1,x𝒦=u)=p(si,si1,x𝒦=u)p(si1,x𝒦=u)=p(si1x𝒦i=u)p(sisi1)p(x𝒦i+=u+si)p(si1x𝒦i=u)p(x𝒦i+=u+si1)=p(sisi1)p(x𝒦i+=u+si)p(x𝒦i+=u+si1)=p(sisi1)p(x𝒦i+=u+si)sip(sisi1)p(x𝒦i+=u+si), (48)

where 𝒦i𝒦{1,,i1}, 𝒦i+𝒦{i,,n}, u and u+ are corresponding values of x𝒦i and x𝒦i+ specified by u.

As p (xisi) and p (sisi−1) are directly given by the HMM, we need only to consider how to compute the two terms p (si−1x𝒦 = u, y[i−1]) and p (x𝒦i+ = u+si). To simplify our notation, we introduce the following variables to represent these terms:

ψ(i)(u,si)p(six𝒦=u,y1,,yi),γ(i)(u,si)p(x𝒦i+=u+si).

With ψ(i)(u, si) and γ(i)(u, si) for a given position i, we can calculate (47) as

p(xix𝒦=u,y[i1])=si1siψ(i1)(u,si1)p(sisi1)γ(i)(u,si)p(xisi)sip(sisi1)γ(i)(u,si). (49)

First, note that γ(i)(u, si) can be recursively computed in the same manner as calculating the backward probabilities in the forward-backward algorithm, as described below:

Initialization: We initialize γ(n)(u, sn) by

γ(n)(u,sn)={p(xn=unsn),n𝒦,1,n𝒦.} (50)

Iterations: For i = n − 1, …, 1, we compute γ(i)(u, si) as

γ(i)(u,si)={si+1p(xi=uisi)p(si+1si)γ(i+1)(u,si+1),i𝒦si+1p(si+1si)γ(i+1)(u,si+1),i𝒦.} (51)

Next, to efficiently compute ψ(i)(u, si) for i ∈ [n], we analogously adopt the following iterative steps.

Initialization: ψ(1)(u, s1) is initialized by

ψ(1)(u,s1)p(s1x𝒦=u)p(y1s1,x𝒦=u), (52)

where p (s1x𝒦 = u) can be calculated by (48) given γ(1)(u, s1), and p (y1s1, x𝒦 = u) is given by our mechanism as shown in (46).

Iterations: Using Bayes’ rule, we can express φ(i)(u, si) as

ψ(i)(u,si)=p(six𝒦=u,y[i])p(six𝒦=u,y[i1])p(yisi,x𝒦=u,y[i1]), (53)

where

p(six𝒦=u,y[i1])=si1ψ(i1)(u,si1)p(sisi1,x𝒦=u), (54)

and

p(yisi,x𝒦=u,y[i1])=xip(xisi)p(yixi,x𝒦=u,y[i1]). (55)

Therefore, ψ(i)(u, si) can be computed based on ψ(i−1)(u, si−1). We note that the probability p (sisi−1, x𝒦) can be calculated using γ(i)(u, si) as shown in (48), and p(yixi, x𝒦, u, y[i−1]) is given by our mechanism as shown in (46). Using this recurrence relation, ψ(i)(u, si) for all i ∈ [n] can be computed.

Analogous to the forward-backward algorithm, our algorithm has polynomial computational complexity of 𝒪(nm2) for a fixed u, with respect to the sequence length n and the number of reference sequences m, for a given u. Clearly, minu𝒳𝒦p(xix𝒦=u,y[i1]) can be easily obtained once p(xix𝒦 = u, y[i−1]) for all u have been computed. This overhead involves a factor of 2𝒦 in the computational complexity, but we expect 𝒦 to be a small constant in practice (e.g., less than 10); since genotype correlation is predominantly local, the user may apply our mechanism to local regions of the genome of a permissive length, each of which including only a few sensitive positions.

VIII. Simulations

In this section, we provide insights into the empirical performance of our privacy mechanism for hidden Markov models (HMMs) on simulated datasets. We randomly generated 100 haplotype sequences of length 100, which together with the choices of error probability θ and crossover probability ϵ induce p (x), as described in Section VII-A. For simplicity, we suppose the sensitive position 𝒦={1}.

We first illustrate the privacy-utility trade-off of the heuristic window-based erasure approach described in the Introduction. In particular, this approach erases the first ω positions of the sequence to hide information about the sensitive position (the first position). The results are shown in Figure 4. The erasure rate is defined by the size of the erased window over the sequence length, i.e., ω/n (note n = 100). The privacy leakage is measured by the mutual information between the released positions and the sensitive position X1, normalized by the entropy of X1, i.e., I(X1; X[n]\[ω])/H(X1). We also show the expected erasure rate of our proposed privacy mechanism for comparison, whose privacy leakage is strictly zero by design. We observe that the window-erasure approach requires a high erasure rate (around 0.3) to keep the privacy leakage close to zero, whereas our mechanism achieves a considerably smaller erasure rate (around 0.12) while providing perfect privacy. On the other hand, choosing a window size for the baseline approach to match the erasure rate of our mechanism leads to a considerable privacy leakage.

Fig. 4:

Fig. 4:

Privacy-utility trade-off of the window-based erasure approach on simulated HMM data with m = 100, n = 100, 𝒦 = {1}, crossover probability ϵ = 0.1 and error probability θ = 0.01. Erasure rate denotes the size of window that is erased normalized by the sequence length n. Privacy leakage denotes the mutual information between the released data and the sensitive symbol normalized by the entropy of the sensitive symbol.

Algorithm 1 Mechanism for hiding sensitive genotypes in X
Require:GenomesequenceX=(X1,,Xn)fromanHMM with parametes(,ϵ,θ),and indices of sensitivepositions𝒦[n]Ensure:Masked genome sequenceY=(Y1,,Yn),suchthatI(X𝒦;Y)=0andYi{Xi,}for alli[n]1:Initializeγ(n)(u,sn)according to(50)2:fori=n1,,1do3:foru𝒳𝒦do4:Computeγ(i)(u,si)according to(51)5:Computep(sisi1,x𝒦=u)according to(48)6:endfor7:endfor8:Initializeψ(1)(u,s1)according to(50)9:fori=2,,ndo10:Calculate the erasure probability forYiusing(46)11:GenerateYi{Xi,}according to the erasure proba-bility12:foru𝒳𝒦do13:Computeψ(i)(u,si)according to(53)14:endfor15:endfor

We next evaluate our privacy mechanism over a range of different parameter settings. We consider θ ∈ {0.01, 0.05} and vary ϵ from 0.01 to 0.5, both of which reflect reasonable ranges of the parameters for the scale of the dataset we simulated. We provide each instance of p (x) to our privacy mechanism with 𝒦={1} to calculate its achievable rate R (i.e., one minus the expected erasure rate). Figure 5 shows the comparison between the rate of our mechanism and the upper bound we derived in Section III. The results suggest that the performance of our mechanism shows varying degrees of closeness to the theoretical upper bound depending on the characteristics of the underlying data distribution. In particular, for higher values of ϵ, representing the regime where the hidden Markov model mixes faster and thus the correlation with the sensitive position decays more quickly, the rate of our mechanism is nearly identical to the upper bound. On the other hand, for lower values of ϵ, which lead to stronger correlations in the sequence, we observed that the gap between our mechanism and the upper bound can grow considerable large. Note that this does not necessarily imply that our mechanism achieves a significantly suboptimal performance, given that the upper bound we considered is not tight in general. We also note that the rate of our mechanism is generally higher when the error probability is larger (θ = 0.05 vs 0.01), which agrees with the intuition that higher levels of noise in the data distribution lower the requirement for hiding sensitive information, thus leading to lower erasure probabilities and higher rates as a result.

Fig. 5:

Fig. 5:

Comparison of our mechanism and the upper bound on simulated HMM data with m = 100, n = 100, 𝒦 = {1} and different choices of crossover probability ϵ and error probability θ.

To gain further insights into the noticeable gap between the upper bound and our mechanism in the small ϵ regime, we additionally implemented a linear programming (LP) approach for directly obtaining the optimal mechanism w (yx). However, since the size of LP grows exponentially with the length of the sequence n, we could only evaluate this approach for small problem instances due to numerical instability. We took the same simulated data as before and truncated each reference haplotype down to the first six positions to obtain a tractable LP instance for this experiment (n = 6).

The rate comparisons of our privacy mechanism, LP-based optimal mechanism, and the upper bound in this setting are shown in Fig. 6. As expected, we observed that the optimal rate lies between the upper bound and the rate of our mechanism, demonstrating that the gap between the optimal rate and the rate of our mechanism is indeed smaller than the ostensible gap suggested by the upper bound.

Fig. 6:

Fig. 6:

Comparison of our mechanism, the upper bound and the optimal rate based on a linear programming (LP) solution on simulated HMM data with m = 100, n = 6, 𝒦 = {1} based on a truncated version of the dataset used in Fig. 5.

Taken together, these results suggest that, although the performance of our mechanism is often quite close to the upper bound, the difference between the maximum achievable rate and the rate of our mechanism can vary based on the properties of the data distribution. We note that it is yet unknown whether there exists a privacy mechanism that can be as efficiently constructed as our mechanism while achieving performance that is closer to the optimal rate. Closing this performance gap both by devising enhanced privacy mechanisms that achieve higher rates and by developing tighter upper bounds are important directions for future work.

IX. Conclusion and Future Work

In this paper, we introduced the genotype hiding problem and proposed an information-theoretic privacy mechanism as a solution. We analyzed the theoretical properties of the mechanism, and proposed an efficient algorithmic implementation of the mechanism for hidden Markov models, a main model of interest for our application in genomics.

It is worth noting that our mechanism does not rule out the possibilities of genotype reconstruction attacks that leverage (i) alternative genetic sequence models and imputation strategies or (ii) a larger set of reference dataset using which HMM parameters could be more accurately estimated. However, our model based on HMMs is consistent with the state-of-the-art techniques for genotype imputation, which is a relatively mature field. In addition, given the high cost of amassing large-scale genomic data, it would be a significant challenge for an attacker to gain access to a larger dataset than those in the public realm. As such, our mechanism could be thought of as providing privacy protection according to the best knowledge of the field. Our results in Section VI show that any unforeseen privacy leakage arising from the discrepancies in the data distribution scales gracefully with the relative entropy between the true distribution and the one used by the mechanism.

There are several key directions for future work. Our work focused on hiding the content of the sensitive positions, yet a potential concern remains regarding information revealed by the choice of sensitive positions 𝒦. Any approach relying on erasures for privacy protection may inevitably leak information about 𝒦, since preventing such leakage would generally require erasures to be consistently applied throughout the sequence, which is highly costly in terms of utility if only a small fraction is considered sensitive. An interesting extension of our work is then to relax the faithfulness condition when hiding the positions is deemed important. A promising approach is to re-sample the erased positions from the data distribution as a post-processing step to the mechanism presented in this paper. That said, we note that in our application setting, 𝒦 is neither necessarily or nor solely decided by the sequence, as it may be determined based on family history of diseases or curated disease associations in public repositories. Thus, we believe the mechanism presented in this work is directly applicable in many practical scenarios.

Next, although we focused on achieving perfect privacy (with respect to the given data distribution), it may be useful in practice to consider a relaxed notion such as local differential privacy [38]. This may give the user the ability to determine a more desirable trade-off between the level of privacy and the amount of data to be erased. From an analytical standpoint, this direction would also lead to useful insights about the achievable points along the privacy-utility trade-off curve defined by the genotype-hiding problem with a relaxed notion of privacy, to complement the results in this work.

Furthermore, it would be interesting to explore the generalization of our efficient implementation strategies to a broader class of data generative models beyond HMMs, which may allow similar mechanisms to be employed to protect sensitive data in other domains.

Lastly, we plan to study the performance of our privacy mechanism on real genetic datasets and release the software implementation of our mechanism for the genetics community in the near future.

Growing threats to genetic privacy are necessitating principled strategies for protecting the privacy of individuals while maintaining the utility of data sharing. Our work illustrates how such a strategy could be designed from an information-theoretic perspective to enable selective disclosure of personal genomic data. Our methodology is broadly applicable to other data sharing scenarios involving sensitive data with complex correlation structure. We hope that our work will help spur the development of a wide range of information-theoretic tools for modelling and preserving private genomic information.

Fig. 1:

Fig. 1:

An illustration of (n, 𝒦) genotype-hiding privacy mechanism. The mechanism takes as input a genetic sequence along with a set of sensitive positions and outputs a masked sequence with erasures. We require the faithfulness and privacy conditions to be satisfied, and the goal is to minimize the expected number of erasures in the output.

Acknowledgments

A preliminary version of this paper was presented at IEEE International Symposium on Information Theory, Los Angeles, CA, USA, 2020. The work of F. Ye and S. El Rouayheb was supported in part by NSF Grant CCF 1817635. The work of F. Ye and H. Cho was supported in part by NIH DP5 OD029574-01 and by Eric and Wendy Schmidt through the Schmidt Fellows program at Broad Institute.

Biographies

Fangwei Ye (Member, IEEE) received the B.Eng. degree in Information Engineering from Southeast University, in 2013, and the Ph.D. degree from Department of Information Engineering, The Chinese University of Hong Kong, in 2018. From 2018 to 2020, he was a Post-Doctoral Associate with Department of Electrical and Computer Engineering, Rutgers University. He is now with the Broad Institute of MIT and Harvard. His research interests include information theory and its applications to privacy, bioinformatics and coding opportunities in learning.

Hyunghoon Cho (Member, IEEE) received the B.S. and M.S. degrees in computer science from Stanford University, in 2013, and the Ph.D. degree in electrical engineering and computer science from the Massachusetts Institute of Technology, in 2019. He is currently a Schmidt Fellow with the Broad Institute of MIT and Harvard. His research interests include computational biology and biomedical data privacy. He is a recipient of the NIH Director’s Early Independence Award.

Salim El Rouayheb (Senior Member, IEEE) received the Diploma degree in electrical engineering from the Faculty of Engineering, Lebanese University, Roumieh, Lebanon, in 2002, the M.S. degree from the American University of Beirut, Lebanon, in 2004, and the Ph.D. degree in electrical engineering from Texas A&M University, College Station, in 2009. He is currently an Associate Professor with the ECE Department, Rutgers University, New Brunswick, NJ, USA. He was a Postdoctoral Research Fellow with UC Berkeley from 2010 to 2011, and a Research Scholar with Princeton University from 2012 to 2013. He was an Assistant Professor with the ECE Department, Illinois Institute of Technology from 2013 to 2017. His research interests are in the broad area of information theory and coding theory with applications to reliability, security, and privacy in distributed systems. He is a recipient of the NSF Career Award.

Appendix A

Proof of Corollary 1

We prove the sufficient condition of the optimality holds for the Markov chain case. We give an inductive proof for the sufficient condition by showing that, for a given xi,

uargminup(xix𝒦=u,y[j1]) (56)

implies

uargminup(xix𝒦=u,y[j]) (57)

for j = 1, …, i − 1. For each j, we consider the following two cases (yj ≠ * and yj = *):

  1. If yj ≠ *, then we have
    p(xix𝒦,y[j])=xjp(xjx𝒦,y[j])p(xix𝒦,y[j],xj)=(a)1{xj=yj}p(xix𝒦,y[j],xj)=(b)1{xj=yj}p(xixj), (58)
    where (a) follows because Yj can either be Xj or *, and (b) follows from Markovity. In this case, argminu p (xix𝒦 = u, y[j]) is indeed independent of u, which means
    argminup(xix𝒦=u,y[j])=𝒳, (59)
    so the statement is trivially true.
  2. If yj = *, then we have
    p(xix𝒦,y[j])=xjp(xjx𝒦,y[j])p(xix𝒦,y[j],xj)=(a)xjp(xjx𝒦,y[j])p(xixj)(b)xj{p(xjx𝒦,y[j1])minup(xjx𝒦=u,y[j1])}p(xixj)=xjp(xixj)p(xjx𝒦,y[j1])xjp(xixj)minup(xjx𝒦=u,y[j1])=p(xix𝒦,y[j1])xjp(xixj)minup(xjx𝒦=u,y[j1]), (60)
    where (a) follows from Markovity, and (b) follows from Bayes’s rule and our privacy mechanism (17). Since the second term of the right-hand side in (60) is independent of x𝒦, we obtain
    argminup(xix𝒦=u,y[j])=argminup(xix𝒦=u,y[j1]). (61)

For both cases, we have verified that the sufficient condition holds, which completes the proof.

Appendix B

Proof of Lemma 2

We will prove (40) by induction. First, consider the base case:

w(yo1=xo1,x𝒦)=1minx𝒦p(xo1x𝒦)p(xo1x𝒦). (62)

From the previous discussion, we know that if oi ∈ 𝒦, then

w(yoi=xoi,x𝒦,yo[i1])=1, (63)

so without loss of generality, we assume that

o1𝒦={m+1,,m+k}. (64)

Since

x𝒦={i:iSjbi,j:j[k]}, (65)

and

xo1={bo1,j:o1Sj} (66)

by definition, we can see that if there exists some j such that Sj = {o1}, then bo1,jx𝒦 and bo1,jxo1. In this case, we can always find some assignments such that

minx𝒦p(xo1x𝒦)=0, (67)

implying that

w(yo1=xo1,x𝒦)=1. (68)

If there is no j such that Sj = {oi}, each i:iSjbi,j constituting x𝒦 is a binary summation of some bo1,j and (independent) random bits bi,j such that io1, where the latter render the result uniformly random. This means that X𝒦 is independent of Xo1, and thus we have

w(yo1=xo1,x𝒦)=1minx𝒦p(xo1x𝒦)p(xo1x𝒦)=1minx𝒦p(xo1)p(xo1)=0, (69)

for all xo1 and x𝒦.

Assume the statement is true for o1, …, oi−1. Then for oi, note that

p(xoix𝒦,yo[i1])=p(xoix𝒦)p(yo[i1]xoi,x𝒦)p(yo[i1]x𝒦). (70)

By letting

ε~i={oj:yoj,ji1}, (71)

(70) can be written as

p(xoix𝒦,yo[i1])=p(xoix𝒦)p(xε~ixoi,x𝒦)p(xε~ix𝒦)=p(xoixε~i,x𝒦), (72)

because of the inductive assumption that the decisions whether to erase yo1, …, yoi−1 are deterministic.

Hence, we have

w(yoi=xoi,x𝒦,yo[i1])=1minx𝒦p(xoix𝒦,yo[i1])p(xoix𝒦,yo[i1])=1minx𝒦p(xoixε~i,x𝒦)p(xoixε~i,x𝒦). (73)

Analogous to our argument for the base case, if there exists some j such that Sjε~i{oi}, then one can determine boi,jxo1 from xε~i, x𝒦, and thus

minx𝒦p(xoixε~i,x𝒦)=0, (74)

implying that

w(yoi=xoi,x𝒦,yo[i1])=1. (75)

If there is no such j, each xj for j ∈ 𝒦 is the binary summation of some boi,jxoi and some independent random bits bi′,j such that i′ ≠ oi, which again guarantees that X𝒦 is independent of Xoi conditioning on Xε~i. Thus, we have

w(yoi=xoi,x𝒦,yo[i1])=1minx𝒦p(xoixε~i)p(xoixε~i)=0, (76)

for all xoi, x𝒦 and yo[i−1], which completes the inductive proof.

Appendix C

Proof of Theorem 3

First, let us show that e* ≥ h* by showing that Eπ is a hitting set for any order π, i.e., EπSj ≠ ∅ for all j ∈ [k]. We prove it by contradiction. Suppose that there exists some Sj such that EπSj = ∅, which implies that Sj ⊆ [m]\Eπ for some j. Assume that Sj = {i1, …, it}, and it is the last index visited that specified by the given order π. Then, when we run our mechanism for it, since i1, …, it−1 are all visited and not erased, by recalling the proof of Lemma 2, we know that ε~it{i1,,it1}, so we have Sjε~it{it}. It means that yit is erased or itEπ, which contradicts with our assumption EπSj = ∅.

Next, we show that e* ≤ h* by showing that for any given hitting set V, there exists an order π such that ∣Eπ∣ ≤ ∣V∣. Suppose V is a hitting set and ∣V∣ = h, i.e., VSj ≠ ∅ for all j ∈ [k]. Consider an order π such that oiV ∪ [m + 1 : m+k] for imh and oiV for i ∈ [mh+1 : m], i.e., visiting indices in the complementary of T before attaining V. When we visit oi such that imh (or oi ∈ [m]\V), by the assumption that VSj ≠ ∅ for all j, we know that there exists some index tjSjV for each j. By recalling the definition (71), we know that ε~i[m]V, so tjε~i. Note that tjV while oiV, so tjε~i{oi}. Hence, we know that yoi is not erased, or oiEπ from the proof of Lemma 2. Since oiEπ for imh given this particular order π, we have ∣Eπ∣ ≤ h = ∣V∣, which completes the proof.

Appendix D

Proof of Theorem 4

From (45), we have

I(p(x𝒦);p(y))=D(p(x𝒦,y)q(x𝒦,y))D(p(x𝒦)q(x𝒦))D(p(y)q(y)), (77)

and it remains to show that the right-hand side is bounded above by D(p (x) ∥q (x)).

By applying the chain rule for relative entropy, we have

D(p(x,y)q(x,y))=D(p(x)q(x))+D(p(yx)q(yx)), (78)

and

D(p(x,y)q(x,y))=D(p(x𝒦,y)q(x𝒦,y))+D(p(x[n]𝒦x𝒦,y)q(x[n]𝒦x𝒦,y)). (79)

The definition of conditional relative entropy and the proof of the chain rule for relative entropy can be found in [40, p. 24]. From these equations, we obtain

D(p(x𝒦,y)q(x𝒦,y))=D(p(x)q(x))+D(p(yx)q(yx))D(p(x[n]𝒦x𝒦,y)q(x[n]𝒦x𝒦,y)). (80)

By substituting (80) in (77), we have

I(p(x𝒦);p(y))=D(p(x)q(x))+D(p(yx)q(yx))D(p(x𝒦)q(x𝒦))D(p(x[n]𝒦x𝒦,y)q(x[n]𝒦x𝒦,y))D(p(y)q(y))(a)D(p(x)q(x))+D(p(yx)q(yx))=(b)D(p(x)q(x)), (81)

where (a) follows from the non-negativity of relative entropy, (b) follows from the assumption q (yx) = p (yx) = w(yx).

Contributor Information

Fangwei Ye, Department of Electrical and Computer Engineering, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.

Hyunghoon Cho, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.

Salim El Rouayheb, Department of Electrical and Computer Engineering, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA.

REFERENCES

  • [1].Hubaux J-P, Katzenbeisser S, and Malin B, “Genomic data privacy and security: Where we stand and where we are heading,” IEEE Security & Privacy, vol. 15, no. 5, pp. 10–12, 2017. [Google Scholar]
  • [2].Grishin D, Obbad K, and Church GM, “Data privacy in the age of personal genomics,” Nature biotechnology, vol. 37, no. 10, pp. 1115–1117, 2019. [DOI] [PubMed] [Google Scholar]
  • [3].Berger B and Cho H, “Emerging technologies towards enhancing privacy in genomic data sharing,” Genome biology, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Nyholt DR, Yu C-E, and Visscher PM, “On Jim Watson’s APOE status: genetic information is hard to hide,” European Journal of Human Genetics, vol. 17, no. 2, pp. 147–149, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [5].von Thenen N, Ayday E, and Cicek AE, “Re-identification of individuals in genomic data-sharing beacons via allele inference,” Bioinformatics, vol. 35, no. 3, pp. 365–371, 2018. [DOI] [PubMed] [Google Scholar]
  • [6].Gürsoy G, Emani P, Jolanki OA, Brannon CM, Harmanci A, Strattan JS, Miranker AD, and Gerstein M, “Private information leakage from functional genomics data: Quantification with calibration experiments and reduction via data sanitization protocols,” bioRxiv, 2019. [Google Scholar]
  • [7].Motahari AS, Bresler G, and Tse DNC, “Information theory of dna shotgun sequencing,” IEEE Transactions on Information Theory, vol. 59, no. 10, pp. 6273–6289, 2013. [Google Scholar]
  • [8].Cho H, Wu DJ, and Berger B, “Secure genome-wide association analysis using multiparty computation,” Nature biotechnology, vol. 36, no. 6, pp. 547–551, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Tahmasebi B, Maddah-Ali MA, and Motahari SA, “Information theory of mixed population genome-wide association studies,” in 2018 IEEE Information Theory Workshop (ITW), 2018, pp. 1–5. [Google Scholar]
  • [10].Shomorony I, Kim SH, Courtade TA, and Tse DNC, “Information-optimal genome assembly via sparse read-overlap graphs,” Bioinformatics, vol. 32, no. 17, pp. i494–i502, 08 2016. [DOI] [PubMed] [Google Scholar]
  • [11].Si H, Vikalo H, and Vishwanath S, “Information-theoretic analysis of haplotype assembly,” IEEE Trans. Inf. Theory, vol. 63, no. 6, pp. 3468–3479, 2017. [Google Scholar]
  • [12].Milenkovic O and Vasic B, “Information theory and coding problems in genetics,” in Information Theory Workshop, 2004, pp. 60–65. [Google Scholar]
  • [13].Kiah HM, Puleo GJ, and Milenkovic O, “Codes for dna sequence profiles,” IEEE Trans. Inf. Theory, vol. 62, no. 6, pp. 3125–3146, 2016. [Google Scholar]
  • [14].Gholami A, Maddah-Ali MA, and Abolfazl Motahari S, “Private shotgun dna sequencing,” in 2019 IEEE International Symposium on Information Theory (ISIT), 2019, pp. 171–175. [Google Scholar]
  • [15].Sun H and Jafar S, “The Capacity of Private Information Retrieval,” in IEEE Trans. Inf. Theory, vol. 63, no. 7, pp. 4075–4088, Jul. 2017. [Google Scholar]
  • [16].Freij-Hollanti R, Gnilke OW, Hollanti C, and Karpuk DA, “Private Information Retrieval from Coded Databases with Colluding Servers,” in SIAM Journal on Applied Algebra and Geometry, vol. 1, no. 1, pp. 647–664, Nov. 2017. [Google Scholar]
  • [17].Banawan K and Ulukus S, “ The Capacity of Private Information Retrieval from Coded Databases,” in IEEE Trans. Inf. Theory, vol. 64, no. 3, pp. 1945–1956, Mar. 2018. [Google Scholar]
  • [18].Tajeddine R, Gnilke OW, and El Rouayheb S, “Private Information Retrieval from MDS Coded Data in Distributed Storage Systems,” in IEEE Trans. Inf. Theory, vol. 64, no. 11, pp. 7081–7093, Nov. 2018. [Google Scholar]
  • [19].Li S and Gastpar M, “Single-Server Multi-message Private Information Retrieval with Side Information,” 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, 2018. [Google Scholar]
  • [20].Kadhe S, Garcia B, Heidarzadeh A, El Rouayheb S, and Sprintson A, “Private Information Retrieval with Side Information,” in IEEE Trans. Inf. Theory, vol. 66, no. 4, pp. 2032–2043, Apr. 2020. [Google Scholar]
  • [21].Simmons S, Sahinalp C, and Berger B, “Enabling privacy-preserving gwass in heterogeneous human populations,” Cell Systems, vol. 3, no. 1, pp. 54 – 61, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [22].Fienberg SE, Slavkovic A, and Uhler C, “Privacy preserving gwas data sharing,” in 2011 IEEE 11th International Conference on Data Mining Workshops, 2011, pp. 628–635. [Google Scholar]
  • [23].Cho H, Simmons S, Kim R, and Berger B, “Privacy-preserving biomedical database queries with optimal privacy-utility trade-offs,” Cell Systems, 2020. [DOI] [PubMed] [Google Scholar]
  • [24].Browning SR and Browning BL, “Haplotype phasing: existing methods and new developments,” Nature Reviews Genetics, vol. 12, no. 10, pp. 703–714, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [25].Loh P-R, Danecek P, Palamara PF, Fuchsberger C, Reshef YA, Finucane HK, Schoenherr S, Forer L, McCarthy S, Abecasis GR et al. , “Reference-based phasing using the haplotype reference consortium panel,” Nature genetics, vol. 48, no. 11, p. 1443, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [26].Das S, Forer L, Schönherr S, Sidore C, Locke AE, Kwong A, Vrieze SI, Chew EY, Levy S, McGue M et al. , “Next-generation genotype imputation service and methods,” Nature genetics, vol. 48, no. 10, p. 1284, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [27].Li N and Stephens M, “Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data,” Genetics, vol. 165, no. 4, pp. 2213–2233, 2003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [28].McCarthy S, Das S, Kretzschmar W, Delaneau O, Wood AR, Teumer A, Kang HM, Fuchsberger C, Danecek P, Sharp K et al. , “A reference panel of 64,976 haplotypes for genotype imputation,” Nature Genetics, vol. 48, no. 10, pp. 1279–1283, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [29].Chi EC, Zhou H, Chen GK, Del Vecchyo DO, and Lange K, “Genotype imputation via matrix completion,” Genome research, vol. 23, no. 3, pp. 509–518, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [30].Dutheil JY, Ganapathy G, Hobolth A, Mailund T, Uyenoyama MK, and Schierup MH, “Ancestral population genomics: the coalescent hidden markov model approach,” Genetics, vol. 183, no. 1, pp. 259–274, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [31].Ye F, Naim C, and Rouayheb SE, “On-Off privacy in the presence of correlation,” IEEE Trans. Inf. Theory, vol. 67, no. 11, pp. 7438–7457, 2021. [Google Scholar]
  • [32].Ye F, Naim C, and El Rouayheb S, “On-Off privacy against correlation over time,” IEEE Trans. Inf. Forensics Secur, vol. 16, pp. 2104–2117, 2021. [Google Scholar]
  • [33].Kairouz P, Oh S, and Viswanath P, “Extremal mechanisms for local differential privacy,” in Advances in neural information processing systems, 2014, pp. 2879–2887. [Google Scholar]
  • [34].Cormen TH, Leiserson CE, Rivest RL, and Stein C, Introduction to algorithms, MIT press, 2009. [Google Scholar]
  • [35].Song YS, “Na li and matthew stephens on modeling linkage disequilibrium,” Genetics, vol. 203, no. 3, p. 1005, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [36].Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, Downey P, Elliott P, Green J, Landray M, Liu B, Matthews P, Ong G, Pell J, Silman A, Young A, Sprosen T, Peakman T, and Collins R, “Uk biobank: An open access resource for identifying the causes of a wide range of complex diseases of middle and old age,” PLOS Medicine, vol. 12, no. 3, pp. 1–10, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [37].Rabiner LR, “A tutorial on hidden markov models and selected applications in speech recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, Feb 1989. [Google Scholar]
  • [38].Dwork C, “Differential privacy: A survey of results,” in International conference on theory and applications of models of computation, Springer, 2008, pp. 1–19. [Google Scholar]
  • [39].Harmanci A, Jiang X, and Zhi D, “Haplohide: A data hiding framework for privacy enhanced sharing of personal genetic data,” bioRxiv, 2019. [Google Scholar]
  • [40].Cover TM, and Thomas JA. Elements of information theory. Wiley-Interscience, 2006. [Google Scholar]

RESOURCES