Private genome analysis through homomorphic encryption

Miran Kim; Kristin Lauter

doi:10.1186/1472-6947-15-S5-S3

. 2015 Dec 21;15(Suppl 5):S3. doi: 10.1186/1472-6947-15-S5-S3

Private genome analysis through homomorphic encryption

Miran Kim ^1,^✉, Kristin Lauter ²

PMCID: PMC4699052 PMID: 26733152

Abstract

Background

The rapid development of genome sequencing technology allows researchers to access large genome datasets. However, outsourcing the data processing o the cloud poses high risks for personal privacy. The aim of this paper is to give a practical solution for this problem using homomorphic encryption. In our approach, all the computations can be performed in an untrusted cloud without requiring the decryption key or any interaction with the data owner, which preserves the privacy of genome data.

Methods

We present evaluation algorithms for secure computation of the minor allele frequencies and χ²statistic in a genome-wide association studies setting. We also describe how to privately compute the Hamming distance and approximate Edit distance between encrypted DNA sequences. Finally, we compare performance details of using two practical homomorphic encryption schemes - the BGV scheme by Gentry, Halevi and Smart and the YASHE scheme by Bos, Lauter, Loftus and Naehrig.

Results

The approach with the YASHE scheme analyzes data from 400 people within about 2 seconds and picks a variant associated with disease from 311 spots. For another task, using the BGV scheme, it took about 65 seconds to securely compute the approximate Edit distance for DNA sequences of size 5K and figure out the differences between them.

Conclusions

The performance numbers for BGV are better than YASHE when homomorphically evaluating deep circuits (like the Hamming distance algorithm or approximate Edit distance algorithm). On the other hand, it is more efficient to use the YASHE scheme for a low-degree computation, such as minor allele frequencies or χ²test statistic in a case-control study.

Keywords: Homomorphic encryption, Genome-wide association studies, Hamming distance, Approximate Edit distance

Introduction

The rapid development of genome sequencing technology has led to the genome era. We expect that the price of a whole genome sequence will soon be $1K in a day, which enables researchers to access large genome datasets. Moreover, many genome projects like the Personal Genome Project (PGP) [1] and the HapMap Project [2] display genotypic information in public databases, so genomic data has become publicly accessible.

While genome data can be used for a wide range of applications including healthcare, biomedical research, and forensics, it can be misused, violating personal privacy via genetic disease disclosure or genetic discrimination. Even when explicit identifiers (e.g., name, date of birth or address) are removed from genomic data, one can often recover the identity information [3-5]. For these reasons, genomic data should be handled with care.

There have been many attempts to protect genomic privacy using cryptographic methods. In particular, it has been suggested that we can preserve privacy through (partially) homomorphic encryption, which allows computations to be carried out on ciphertexts. Kantarcioglu et al. [6] presented a novel framework that allows organizations to support data mining without violating genomic privacy. Baldi et al. [7] proposed a cryptographic protocol to determine whether there exists a biological parent-child relationship between two individuals. Ayday et al. [8] recently conducted privacy-preserving computation of disease risk based on genomic and non-genomic data. However, these methods used homomorphic computation involving a single operation on ciphertexts (e.g., either additions or multiplications, not both), thus they could support a limited set of genomic queries.

Fully homomorphic encryption (e.g., [9-11]) permits encrypted data to be computed on without decryption, so it allows us to evaluate arbitrary arithmetic circuits over encrypted data. Thus, we can privately perform all types of genome analysis using Homomorphic Encryption (HE) cryptosystems. Moreover, we can delegate intensive computation to a public cloud and store large amounts of data in it.

Recently, many protocols to conduct privacy-preserving computation of genomic tests with fully homomorphic encryption have been introduced. Yasuda et al. [12] gave a practical solution for computation of multiple Hamming distance values using the LNV scheme [13] on encrypted data, so to find the locations where a pattern occurs in a text. Graepel et al. [14] and Bos et al. [15] applied HE to machine learning, and described how to privately conduct predictive analysis based on an encrypted learned model. Lauter et al. [16] gave a solution to privately compute the basic genomic algorithms used in genetic association studies. Cheon et al. [17] described how to calculate edit distance on homomorphically encrypted data.

In this paper, we propose efficient evaluation algorithms to compute genomic tests on encrypted data. We first consider the basic tests which are used in Genome-Wide Association Studies (GWAS). They are conducted to analyze the statistical associations between common genetic variants in different individuals. In particular, we focus on the minor allele frequencies (MAFs) and χ²test statistic between the variants of case and control groups. Secondly, we consider DNA sequence comparison which can be used in sequence alignment and gene finding. We show how to privately compute the Hamming distance and approximate Edit distance on encrypted data. We also adapt these methods to the practical HE schemes − BGV scheme [18] by Gentry, Halevi and Smart and YASHE scheme [19] by Bos, Lauter, Loftus and Naehrig. Finally, we compare the performance of the two encryption schemes in these contexts. In practice, we take advantage of batching techniques to parallelize both space and computation time together.

One possible scenario could be of interest in situations involving patients, a data owner (e.g., a healthcare organization or a medical center) and a public cloud. In our solution, a data owner wants to store large amounts of data in the cloud and many users may interact with the same data over time. The cloud can handle all that interaction through computation on encrypted data, so it does not require further interaction from the data owner. The patients can upload their encrypted data directly to the cloud using the public key. The genomic tests are performed on the cloud and the encrypted results are returned to the data owner. Finally, the data owner decrypts the results using the secret key to share it with the patient. All the computations in the cloud are performed on encrypted data without requiring the decryption key, so the privacy of genomic data can be protected by the semantic security of the underlying HE schemes.

Background

The iDASH (Integrating Data for Analysis, 'anonymization' and SHaring) National Center organized the iDASH Privacy & Security challenge for secure genome analysis. This paper is based on a submission to the iDASH challenge which consisted of two tasks: i) secure outsourcing of GWAS and ii) secure comparison between genomic data.

Two tasks for iDASH challenge

Given the encrypted genotypes of two groups of individuals over many single nucleotide variants (SNVs), the goal of the first task is to privately compute the MAFs in each group and a χ²test statistic between the two groups on each site.

Suppose that A and B are two alleles of the gene, and let n_AA, n_AB, n_BBdenote the numbers of observed individuals for genotypes AA, AB, BB, respectively. The allele counts of A and B are given by $n_{A} \overset{l e t}{=} 2 n_{A A} + n_{A B}$ and $n_{B} \overset{l e t}{=} 2 n_{B B} + n_{A B}$ . Then the MAF of the given alleles is defined by

\frac{min (n_{A}, n_{B})}{n_{A} + n_{B}} .

If we let N be the total number of people in a sample population, the total number of alleles in the sample is n_A+ n_B= 2N, so we compute only one of two allele counts in encrypted form. The minimum can then easily be computed after decryption and we obtain the MAF by one division by 2N .

The χ²test statistic in case-control groups is computed based on the allelic contingency table (Table 1):

Table 1.

Allelic Contingency Table

	Allele type		Total

	A	B
Case	n_A	n_B	R = 2N
Control	$n_{A}^{'}$	$n_{B}^{'}$	S = 2N
Total	$G = n_{A} + n_{A}^{'}$	$K = n_{B} + n_{B}^{'}$	T = 4N

Open in a new tab

\frac{T (n_{A} {n^{'}}_{B} - n_{B} {n^{'}}_{A})}{R \cdot S \cdot G \cdot K} .

Algorithm 1 Hamming Distance Algorithm

1: h ← 0

2: for i ∈ $L$ do

3: if ('x_i.sv' or 'y_i.sv') in {'INS', 'DEL'} then

4: h_i← 0

5: else if ((x_ior y_i) == '∅') or

6: ((x_i.ref == y_i.ref) and (x_i.alt ! = y_i.alt)) then

7: h_i← 1

8: else

9: h_i← 0

10: end if

11: h ← h + h_i

12: end for

13: return h

We observe that the test can be written as a function of n_Aand $n_{A}^{'}$ . More precisely, it is expressed as

\begin{gathered} \frac{4 N {(n_{A} (2 N - {n^{'}}_{A}) - {n^{'}}_{A} (2 N - n_{A}))}^{2}}{2 N \cdot 2 N \cdot G \cdot K} \\ = \frac{4 N {(n_{A} - {n^{'}}_{A})}^{2}}{(n_{A} + {n^{'}}_{A}) \cdot (4 N - (n_{A} + {n^{'}}_{A}))} . \end{gathered}

Let $n_{A}^{(j)}$ and ${n^{'}}_{A}^{(j)}$ denote the allele counts of A at SNV j in the case group and control group, respectively. As discussed above, it suffices to compute $(n_{A}^{(j)} + {n^{'}}_{A}^{(j)})$ and $(n_{A}^{(j)} - {n^{'}}_{A}^{(j)})$ over encrypted data.

The goal of the second task is to privately compute the Hamming distance and approximate Edit distance between the encrypted genome sequences. Suppose that two participants have Variation Call Format (VCF) files which summarize their variants compared with the reference genome (e.g., insertion, deletion, or substitution at a given position of a given chromosome). If there is only one record in the VCF files at a specified location, the other one is considered to be an empty set ('∅'). Let $L$ be a list indexed by the positions of two participants. Then we can define the Hamming distance as described in Algorithm 1, where "x_i.sv" denotes the type of structural variant relative to the reference, "x_i.ref " the reference bases and "x_i.alt" the alternate non-reference alleles.

The standard dynamic programming approach to compute the full Wagner-Fischer Edit distance [20] is computed in a recursive way, so the multiplicative depth of the circuit to be homomorphically evaluated is too large. Recently, Cheon et al. [17] presented an algorithm to compute the WF Edit distance over packed ciphertexts but it took about 27 seconds even on length 8 DNA sequences. On the other hand, in this task we are given the distance to a public human DNA sequence (called the reference genome), which allows us to efficiently approximate the Edit distance using Algorithm 2. It is calculated based on the set difference metric, which enables parallel processing in computation.

Algorithm 2 Approximate Edit Distance Algorithm

1: e ← 0

2: for $i \in L$ do

3: if x_i== '∅' then

4: D(x_i) ← 0

5: else if 'x_i.sv' == 'DEL' then

6: D(x_i) ← len(x_i.ref)

7: else

8: D(x_i) ← len(x_i.alt)

9: end if

10: Define D(y_i) with the same way as D(x_i)

11: if ((x_i.ref == y_i.ref) and (x_i.alt == y_i.alt)) then

12: e_i← 0

13: else

14: e_i← max{D(x_i), D(y_i)}

15: end if

16: e ← e + e_i

17: end for

18: return e

Practical homomorphic encryption

Fully Homomorphic cryptosystems allow us to homomorphically evaluate any arithmetic circuit without decryption. However, the noise of the resulting ciphertext grows during homomorphic evaluations, slightly with addition but substantially with multiplication. For efficiency reasons for tasks which are known in advance, we use a more practical Somewhat Homomorphic Encryption (SHE) scheme, which evaluates functions up to a certain complexity. In particular, two techniques are used for noise management of SHE: one is the modulus-switching technique introduced by Brakerski, Gentry and Vaikuntanathan [21], which scales down a ciphertext during every multiplication operation and reduces the noise by its scaling factor. The other is a scale-invariant technique proposed by Brakerski [22] such that the same modulus is used throughout the evaluation process.

Let us denote by [·]_qthe reduction modulo q into the interval $(- q / 2, q / 2] \cap ℤ$ of the integer or integer polynomial (coefficient-wise). For a security parameter λ, we choose an integer m = m(λ) that defines the m-th cyclotomic polynomial Φ_m(x). For a polynomial ring $R = ℤ [x] / (Φ_{m} (x))$ , set the plaintext space to R_t:= R/tR for some fixed t ≥ 2 and the ciphertext space to R_q:= R/qR for an integer q = q(λ). Let χ = χ(λ) denote a noise distribution over the ring R. We use the standard notation $a \leftarrow D$ to denote that a is chosen from the distribution $D$ . Now, we recall the BGV scheme [18] and the scale-invariant YASHE scheme [19].

The BGV scheme

Gentry, Halevi and Smart [18] constructed an efficient BGV-type SHE scheme. The security of this scheme is based on the (decisional) Ring Learning With Errors (RLWE) assumption, which was first introduced by Lyubashevsky, Peikert and Regev [23]. The assumption is that it is infeasible to distinguish the following two distributions. The first distribution consists of pairs (a_i, u_i), where a_i, u_i← R_quniformly at random. The second distribution consists of pairs of the form (a_i, b_i) = (a_i, a_is + e_i) where a_i← R_qdrawn uniformly and s, e_i← χ . Note that we can generate RLWE samples as (a_i, a_is+te_i) where t and q are relatively prime. To improve efficiency for HE, they use very sparse secret keys s with coefficients sampled from {−1, 0, 1}.

Here is the SHE scheme of [18]:

• ParamsGen: Given the security parameter λ, choose an odd integer m, a chain of moduli q₀<q₁< ⋯ <q_L−1= q, a plaintext modulus t with 1 <t <q₀, and discrete Gaussian distribution χ_err. Output (m, {q_i}, t, χ_err).

• KeyGen: On the input parameters, choose a random s from {0, ± 1}^φ(m)and generate an RLWE instance (a, b) = (a, [as + te]_q) for e ← χ_err. We set the key pair: (pk, sk) = ((a, b), s) with an evaluation key $e v k \in R_{P \cdot q_{L - 2}}^{2}$ for a large integer P.

• Encryption: To encrypt m ∈ R_t, choose a small polynomial v and two Gaussian polynomials e₀, e₁over Rq . Then compute the ciphertext given by Enc(m, pk) = (c₀, c₁) = (m, 0) + (bv + te0, av +te₁) ∈ $R_{q}^{2}$ .

• Decryption: Given a ciphertext ct = (c₀, c₁) at level l, output $Dec(ct,sk) = {[c_{0} - s \cdot c_{1}]}_{q_{l}}$ mod t where the polynomial ${[c_{0} - s \cdot c_{1}]}_{q_{l}}$ is called the noise in the ciphertext ct.

• Homomorphic Evaluation: Given two ciphertexts ct = (c₀, c₁) and $ct' = (c_{0}^{'}, c_{1}^{'})$ at level l, the homomorphic addition is computed by $c t_{add} = ({[c_{0} + c_{0}^{'}]}_{q_{l}}, {[c_{1} + c_{1}^{'}]}_{q_{l}})$ . The homomorphic multiplication is computed by ct_mult= SwitchKey(c₀∗ c₁ , evk) where $c_{0} * c_{1} = ({[c_{0} c_{0}^{'}]}_{q_{l}}, {[c_{0} c_{1}^{'} + c_{1} c_{0}^{'}]}_{q_{l}}, {[c_{1} c_{1}^{'}]}_{q_{l}})$ and the key switching function SwitchKey is used to reduce the size of ciphertexts to two ring elements. We also apply modulus switching from q_ito q_i−1in order to reduce the noise. If we reach the smallest modulus q₀, we can no longer compute on ciphertexts.

Smart and Vercauteren [24] observed that R_tis isomorphic to $\prod_{i = 1}^{ℓ} ℤ_{t} [x] / f_{i} (x)$ if Φ_m(x) factors modulo t into ℓ irreducible factors f_i(x) of the same degree. Namely, a plaintext polynomial m can be considered as a vector of ℓ small polynomials, m mod f_i, called plaintext slots. We can also transform the plain-text vector $(m_{1}, \dots, m_{r}) \in \prod_{i = 1}^{ℓ} ℤ_{t} [x] / f_{i} (x)$ to an element m ∈ R_tusing the polynomial Chinese Remainder Theorem (i.e., m = CRT(m₁, ..., m_r)). In particular, it is possible to add and multiply on the slots: if m, m′ ∈ R_tencode (m₁ , ..., m_ℓ) and $(m_{1}^{'}, \dots, m_{ℓ}^{'})$ respectively, then we see that $m + m^{'} = m_{i} + m_{i}^{'}$ mod f_iand $m \cdot m^{'} = m_{i} \cdot m_{i}^{'}$ mod f_i. This technique was adapted to the BGV scheme.

The YASHE scheme

A practical SHE scheme, YASHE, was proposed in [19] based on combining ideas from [22,25,26]. The security of this scheme is based on the hardness of the RLWE assumption similar to the one for BGV. It also relies on the Decisional Small Polynomial Ratio (DSPR) assumption which was introduced by Lopez-Alt, Tromer, and Vaikuntanathan [26]. Let $t \in R_{q}^{\times}$ be invertible in R_q, y_i∈ R_qand z_i= y_i/t ( mod q) for i = 1, 2. For z ∈ R_q, and, we define χ_z= χ + z to be the distribution shifted by z. The assumption is that it is hard to distinguish elements of the form h = a/b, where a ← y₁+ tχ_z, b ← y₂+ tχ_z, from elements drawn uniformly from R_q. The YASHE scheme consists of the following algorithms.

• ParamsGen: Given the security parameter λ, choose m to be a power of 2 (the m-th cyclotomic polynomial is Φ_m(x) = xⁿ+ 1 (n = φ(m) = m/2), modulus q and t with 1 < t < q, truncated discrete Gaussian distribution χ_erron R such that the coefficients of the polynomial are selected in the range [−B(λ), B(λ)]), and an integer base ω >1. Output (m, q, t, χ_err, ω).

• KeyGen: On the input parameters, sample f′, g ← {0, ± 1}^φ(m)and set f = [tf′ + 1]_q. If f is not invertible modulo q, choose a new f′ and compute the inverse f⁻¹∈ R of f modulo q and set h = [tgf⁻¹]q . Let ℓ_ω,q= [log_ω(q)] + 1 and define $P_{ω, q} (a) = {({[a ω^{i}]}_{q})}_{i = 0}^{ℓ_{ω, q^{- 1}}}$ . Sample $e, s \leftarrow χ_{e r r}^{ℓ_{ω, q}}$ and compute $γ = [P_{ω, q} (f) + e + h s] \in R_{q}^{ℓ_{ω, q}}$ . Then we set the key pair: (pk, sk, evk) = (h, f, γ).

• Encryption: To encrypt m ∈ R_t, choose e, s ← χ_errand then compute the ciphertext $Enc (m, pk) = {[⌊\frac{q}{t}⌋ \cdot {[m]}_{t} + e + h s]}_{q} \in R_{q}$ .

• Decryption: Given a ciphertext ct, output $Dec (ct, sk) = ⌊ \frac{t}{q} \cdot {[f \cdot c t]}_{q} ⌉$ mod t. The inherent noise in the ciphertext is defined as the minimum value of infinite norm ||v||_∞= max_i{|v_i|} such that $f \cdot ct = ⌊ \frac{q}{t} ⌋ \cdot {[m]}_{t} + v {(mod)}_{q}$

• Homomorphic Evaluation: Given two ciphertexts ct and ct′, homomorphic addition is computed as $c t_{add} = {[ct + ct']}_{q}$

• Homomorphic Evaluation: Given two ciphertexts ct and ct′, homomorphic addition is computed as $c t_{add} = {[ct + ct']}_{q}$ . Homomorphic multiplication is computed as $c t_{mult} = SwitchKey ({[⌊\frac{t}{q} ct \cdot ct'⌉]}_{q}, evk)$ where the key switching function SwitchKey is used to transform a ciphertext decryptable under the original secret key f (see [19] for details).

Our methods for private genome analysis

In this section, we describe how to encode and encrypt the genomic data for each task. Based on these methods, we propose the evaluation algorithms to compute the genomic tests on encrypted data.

Encoding genomic data

Lauter et al. [16] presented a method to encode a person's genotype given a candidate allele associated to a specified disease. They used a binary dummy vector representation, which makes the number of ciphertexts too large. In contrast, we encode the genotypes as integers so that one can efficiently compute their sums and differences over the integers. More precisely, for a bi-allelic gene with alleles A and B, there are 3 possible Single Nucleotide Polymorphisms (SNPs) - AA, AB, BB, and they are encoded as follows: AA → 2, AB → 1, BB → 0. Figure 1 shows the file format of the data for Task 1 and its encodings.

**A snapshot of the dataset for Task 1 and its encodings**.

Now, we describe how genomic data can be encoded for DNA comparison. The first step is to curate the data using the positions in the VCF files of two participants. In other words, the server should arrange the information and make the merged list $L$ so that each individual can encode their genotypic information according to the list. Let $ℓ (L)$ denote the length of the list $L$ . Then, for $1 \leq i \leq ℓ (L)$ , we define two values

e_{i} = \{\begin{matrix} 1 & if po s_{i} \in L \\ 0 & o .w, \end{matrix}), f_{i} = \{\begin{matrix} 0 & if s v_{i} \in {INS, DEL} \\ 1 & o .w, . \end{matrix})

The value e_idefines whether the genotype at the specified locus is missing; the value f_ispecifies the variants compared with the reference.

Since both VCF files are aligned with the same reference genome, we don't need to compare the columns of 'REF'. To improve performance, we assume that it suffices to compare 7 SNPs between two non-reference sequences. In the following, we describe how to encode the sequences. Each SNP is represented by two bits as

A \to 00, G \to 01, C \to 10, T \to 11,

and then concatenated with each other. Next we pad with 1 at the end of the bit string so as to distinguish the A-strings. Finally, we pad with zeros to make it a binary string of length 15, denoted by s_i. Let s_i[j] denote j-th bit of s_i. If a person's SNV at the given locus is not known (i.e., e_i= 0), then it is encoded as 0-string. For example, 'GT C' is encoded as a bit string 01||11||10||10 ... 0, of length 15.

Finally, let us consider the i-th genotype lengths D_i, $D_{i}^{'}$ of two participants defined as follows: when it has no variants at the given locus of the sequence, set zero as the length at the locus. If it includes a deletion compared with the reference, use the length of reference. Otherwise, we take the length of the target sequence at the current locus. In Figure 2, we illustrate the file format of the data for Task 2 and its encodings.

**A snapshot of the dataset for Task 2 and its encodings**. (a) hu604D39 and (b) hu661AD0.

Homomorphic computation of the BGV scheme

We describe how to compute the genomic algorithms described above on encrypted genetic data using the BGV scheme.

Task 1: GWAS on encrypted genomic data

Using the encodings that we propose for practical HE, we can homomorphically evaluate any function involving additions and multiplications, but it is not known how to perform homomorphic division of integer values. We obtain the counts using a few homomorphic additions.

Let g_jbe the encoded value of SNV site j based on the encoding method as described above. Then each person packs g_jinto the j-th slot. Let s be the total number of SNVs. Assuming that each ciphertext holds ℓ plaintext slots for s ≤ ℓ, the i-th person encrypts the vector $(g_{i}^{(1)}, \dots, g_{i}^{(s)}, 0 \dots, 0) \in ℤ_{t}^{ℓ}$ using batching as

{ct}_{i} = Enc (CRT ((g_{i}^{(1)}, \dots, g_{i}^{(s)}, 0 \dots, 0), pk) .

Let ct_evalbe a ciphertext given by the homomorphic operation

c t_{e v a l} = \sum_{i = 1}^{N} c t_{i} .

Note that the use of batching technique enables to perform N aggregate operations in parallel. Next, let m = Dec(ct_eval, sk) denote the decryption of the ciphertext ct_evaland decode the s outputs from the output plaintext polynomial as follows: let m_jbe the constant coefficient of m mod f_jfor 1 ≤ j ≤ s. That is, we have

m_{j} \overset{l e t}{=} m mod f_{j} = \sum_{i = 1}^{N} g_{i}^{(j)} .

Thus the MAF of SNV j in the group is computed as

\frac{min {m_{j}, 2 N - m_{j}}}{2 N} .

For the homomorphic evaluation of χ²test, each group performs aggregations over ciphertexts as shown in (1). Let ct_caseand ct_contdenote the ciphertexts by the evaluations in the case and control groups, respectively. Then one can compute two ciphertexts by the homomorphic operations

c t^{+} \overset{l e t}{=} c t_{c a s e} + c t_{c o n t}, c t^{-} \overset{l e t}{=} c t_{c a s e} - c t_{c o n t} .

The plaintext polynomial from ct⁺can be decoded as the plaintext slots which have $(n_{A}^{(j)} + {n^{'}}_{A}^{(j)})$ at the j-th slot. In other words, we have

Dec ({ct}^{+}, sk) \mod f_{j} = \sum_{i = 1}^{N} (g_{i}^{(j)} - {g^{'}}_{i}^{(j)}) = (n_{A}^{(j)} - {n^{'}}_{A}^{(j)}) .

Similarly, the plaintext polynomial from ct⁻is decoded as the plaintext slots which has the value congruent to $(n_{A} - n_{A}^{'})$ in the interval $[0, t) \cap ℤ$ . Thus, if the output value is larger than $\frac{t}{2}$ , then subtract t from it; that is, we have

Dec ({ct}^{+}, sk) \mod f_{j} = \sum_{i = 1}^{N} (g_{i}^{(j)} - {g^{'}}_{i}^{(j)}) = (n_{A}^{(j)} - {n^{'}}_{A}^{(j)}) .

Task 2: secure DNA sequence comparison

We represent sequence comparison algorithms as binary circuits and then evaluate them over encrypted data. We use the native plaintext space of binary polynomials (i.e., $R_{2} = ℤ_{2} [x] / (Φ_{m} (x))$ ), and denote XOR and AND as ⊕ and ∧, respectively. For simplicity, you may consider the plaintext space $ℤ_{2}^{ℓ}$ supporting batching operation with ℓ slots.

For the homomorphic evaluation of Hamming distance, the genomic data of two participants, denoted by $(e_{i}, f_{i}, s_{i})$ and $(e_{i}^{'}, f_{i}^{'}, s_{i}^{'})$ , are encrypted bit-wise. For example, the encryptions of e_i's are in the form of

\begin{gathered} Enc (CRT(e_{1}, \dots, e_{ℓ}), pk), \\ Enc (CRT(e_{ℓ + 1}, \dots, e_{2 ℓ}), pk), \dots, \\ Enc (CRT(e_{⌊ℓ (L) / ℓ⌋ \cdot ℓ + 1}, \dots, e_{ℓ (L)}, 0, \dots 0), pk) . \end{gathered}

This allows to compute the same function on ℓ inputs at the price of one computation. Then one can evaluate the following binary circuit over encryption:

(E (s_{i}, {s'}_{i}) \land (e_{i} \oplus {e^{'}}_{i} \oplus 1) \oplus 1) \land f_{i} \land f_{i}^{'}

where $E(s_{i}, {s'}_{i}) = \land_{j = 1}^{15} (s_{i} [j] \oplus {s'}_{i} [j] \oplus 1)$ has 1 if and only if $s_{i}, {s'}_{i}$ are the same. After homomorphic computations, the output can be decrypted with the secret key. The plaintext polynomial has the Hamming distance result of SNV site i at the i-th slot, so we need only aggregate them.

Now, we consider the comparison binary circuit (described in [17]) for the secure computation of the approximate Edit distance. We express an unsigned μ-bit integer x in its binary representation and denote the j-th coordinate of x by x[j] (i.e., $x = \sum_{j = 1}^{μ} x [j] 2^{j - 1}, x [j] \in {0, 1}$ ). For two μ-bit integers x and y, the comparison circuit is defined by

C (x, y) = \{\begin{matrix} 1 & if x < y, \\ 0 & o . w ., \end{matrix})

and this is written recursively as C(x; y) := c_μwhere

c_{j} = ((x [j] \oplus 1) \land y [j]) \oplus ((x [j] \oplus 1 \oplus y [j]) \land c_{j - 1})

for j ≥ 2 with an initial value $c_{1} = (x [1] \oplus 1) \land y [1]$ . Then the j-th bit of maximum value between two inputs is defined as follows:

\begin{gathered} max {x, y} [j] = ((1 \oplus C(x, y)) \land x [j]) \oplus (c (x, y) \land y [j]) \\ = x [j] \oplus (C(x, y) \land (x [j] \oplus y [j])) . \end{gathered}

For the bit-sliced implementation, all the lengths are also expressed in a binary representation and we denote the maximum length of SNPs by μ. It follows from the primitive circuits that we can evaluate the circuits homomorphically:

(E (s_{i}, {s'}_{i}) \land (f_{i} \oplus {f^{'}}_{i} \oplus 1) \oplus 1) \land max {D_{i}, D_{i}^{'}} [j] .

Finally, one can decrypt the results and decode $ℓ (L)$ values from the output plaintext polynomials. More precisely, let $ℓ_{i, j}$ be the value at i-th slot which corresponds to the j-th bit. We see that $\sum_{j = 1}^{μ} ℓ_{i, j} \cdot 2^{j - 1}$ is the approximate Edit distance of SNV site i, hence we need only perform aggregation operations over them.

Homomorphic computation of the YASHE scheme

We explain how to evaluate the genomic algorithms homomorphically using the YASHE scheme.

Task 1: GWAS on encrypted genomic data

Lauter et al. [13] introduced a method how to pack m bits b₀, ..., b_m−1into a single ciphertext that encodes the polynomial $b (x) = \sum_{i = 0}^{m - 1} b_{i} x^{i}$ . We note that polynomial addition corresponds to simple component-wise addition of the vectors. Since a case-control study requires only additions, this method can be used for our case. When using a ring polynomial xⁿ+1 with a power-of-two n, we can embed data of $n^{'} \overset{l e t}{=} ⌊\frac{n}{s}⌋$ persons into a single plaintext polynomial. Namely, one can encrypt the polynomial

\begin{gathered} pm (g_{1} = (g_{1}^{(1)}, \dots, g_{1}^{(s)}), \dots, g_{n^{'}} = (g_{n^{'}}^{(1)}, \dots, g_{n^{'}}^{(s)})) \\ \overset{l e t}{=} \sum_{i = 1}^{n^{'}} \sum_{j = 0}^{s - 1} g_{i}^{(j)} x^{j + s \cdot (i - 1)} . \end{gathered}

The simple aggregation operations are performed over packed ciphertexts. Now, let

m = \sum_{j = 0}^{n^{'} s - 1} m_{j} x^{j} \in R_{t}

denote the decryption result of the evaluated ciphertext. Then, for 1 ≤ j ≤ s, one can aggregate n^'data from the output plaintext polynomial by computing

m_{j} \leftarrow \sum_{i = 0}^{n^{'} - 1} m_{j + i s},

which is the allele counts of A at the SNV site j. Notice that if $n^{'} = 1$ , then we don't need to do the above operations. Hence, the MAF of the SNV j in the group is computed as

\frac{min {m_{j}, 2 N - m_{j}}}{2 N} .

Similarly, let ct⁺and ct⁻denote the ciphertexts computed by the homomorphic additions and subtractions after simple aggregations. As we have demonstrated, we need additional aggregation processes after decryptions. Let

m^{+} = \sum_{j = 0}^{n^{'} s - 1} m_{j}^{+} x^{_{j}}, m^{-} = \sum_{j = 0}^{n^{'} s - 1} m_{j}^{-} x^{_{j}}

denote the decryption polynomials of ct⁺and ct⁻, respectively. Then, for 1 ≤ j ≤ s, one can obtain the allele counts by computing as

n_{A}^{(j)} + n_{A}^{^{'} (j)} = \sum_{i = 0}^{n^{'} - 1} m_{j + i s}^{+}, n_{A}^{(j)} - n_{A}^{^{'} (j)} = {[\sum_{i = 0}^{n^{'} - 1} m_{j + i s}^{-}]}_{t} .

Task 2: secure DNA sequence comparison

Since polynomial multiplication does not correspond to component-wise multiplication of the vectors, we have to consider another packing method instead of [13]. Let us consider the polynomial-CRT packing method. The m-th cyclotomic polynomial $Φ_{m} (x)$ factors modulo 2 into a product of the same irreducible factors (i.e., $Φ_{m} (x) = x^{n} + 1 = {(x + 1)}^{n}$ mod 2); so we cannot apply batching technique with these parameters. We can instead do that if taking a prime t (not 2) such that the polynomial splits into the distinct factors modulo t, but the use of a different message space leads to change our primitive circuits.

As noted in [27], we see that for $x, y \in {0, 1}$ , the following properties hold: $x \oplus y = {(x - y)}^{2}$ and $x \land y = x \cdot y$ where − and · are arithmetic operations over integers. From these observations, we can amend the evaluation circuit for the Hamming distance as follows:

(E (s_{i}, {s^{'}}_{i}) \cdot ({(e_{i} - {e^{'}}_{i})}^{2} - 1) + 1) \cdot f_{i} \cdot {f^{'}}_{i}

where $E (s_{i}, {s^{'}}_{i}) = \prod_{j = 1}^{15} (1 - {(s_{i} [j] - {s^{'}}_{i} [j])}^{2})$ .

We note that for μ-bit integer x and y, the comparison circuit C(x; y) = c_μcan be expressed as

c_{j} = (1 - x [j]) \cdot y [j] + (1 - {(x [j] - y [j])}^{2}) \cdot c_{j - 1} .

for j ≥ 2 with $c_{1} = (1 - x [1]) \cdot y [1]$ Since it is available to compute on large integer inputs, the maximum value is defined by

\begin{gathered} max \{x, y\} = (1 - C (x, y)) \cdot x + C (x, y) \cdot y \\ = x + C (x, y) \cdot (y - x) . \end{gathered}

Using these circuits, we compute the ciphertext given by the homomorphic operations

(1 + E (s_{i}, {s'}_{i}) \cdot ({(f_{i} - {f^{'}}_{i})}^{2} - 1)) \cdot max \{D_{i}, {D^{'}}_{i}\} .

Then we get the encryptions of the approximate Edit distance result of SNV i.

Results and discussion

In this section, we explain how to set the parameters for homomorphic evaluations and present our experimental results. We used BGV scheme with Shoup-Halevi's HE library [28] (called HELib). HELib is written in C++ and based on the arithmetic library NTL [29] over GMP. Our experiments with BGV were performed on a Linux machine with an Intel Xeon 2.67 GHz processor. We also implemented YASHE scheme with ARITH library in C. The measurements were done in an Intel Core 3.60GHz, running 64-bit Windows 7.

The dataset used for Task 1 consists of 200 case group (constructed from 200 participants from PGP) and 200 control group (simulated based on the haplotypes of 174 participants from CEU population of apMap Project). The dataset for Task 2 consists of two individual genomes randomly selected from PGP.

Theoretical comparison between BGV and YASHE

BGV scheme has a chain of ciphertext moduli by a set of primes of roughly the same size, p₀, ⋯, p_L−1, that is, the i-th modulus q_iis defined as $q_{i} = \prod_{k = 0}^{i} p k$ . For simplicity, assume that p is the approximate size of the p_is. Given the lattice dimension $n = φ (m)$ , the plaintext modulus t, and the Hamming weight h of the secret key, it follows from Theorem 3 in [27] that the depth of a classical homomorphic multiplication is

d_{n, t} \approx ⌈\frac{{log}_{2} (h \cdot n \cdot t^{4})}{2 {log}_{2} (p)}⌉ \approx ⌈\frac{{log}_{2} (h \cdot n \cdot t^{4})}{36}⌉,

so the total number of modulus switching operations during the M-levels of multiplications is about M·d_n,t. Since we first should do one modulus switching to the initial ciphertext before homomorphic computation, we see that L = M · d_n,t+ 2. Thus we can approximate the size of the ciphertext modulus q_BGVin the BGV scheme (from C.3 in [18]) as follows:

\begin{gathered} {log}_{2} q_{BGV} \approx 24 + \frac{3}{2} {log}_{2} n + (L - 2) \cdot (11 + \frac{1}{2} {log}_{2} n) \\ < (L + 1) \cdot (11 + \frac{1}{2} {log}_{2} n) \end{gathered}

Since a fresh ciphertext in BGV consists of a pair of polynomials over R_qL−1, the size of ciphertext from the above inequality is about

| c t_{BGV} | \approx 2 n \cdot {log}_{2} q_{BGV} \approx 2 n (L + 1) \cdot (11 + \frac{1}{2} {log}_{2} n)

Similarly, [19, Lemma 9] provides a theoretical upper bound on the noise growth after M multiplicative levels for YASHE as (nt)^2(M−1)· (12n²tσℓ_ω,qωM) when taking B = 6σ as the coefficient bound of error polynomials. It should be less than the ratio of q_YASHEto t so that the decryption procedure works; we should select a ciphertext modulus q_YASHEso as to satisfy

\begin{gathered} {log}_{2} q_{YASHE} \approx 2 M \cdot {log}_{2} n t + {log}_{2} (12 σ ℓ_{ω, q} ω M) \\ \geq 2 M \cdot ({log}_{2} n t) \end{gathered}

Since a ciphertext consists of only a single ring element, the size is about

| c t_{YASHE} | \approx n \cdot {log}_{2} q_{YASHE} \approx 2 n M \cdot ({log}_{2} n t) .

We summarize the above results in Table 2.

Table 2.

The theoretical sizes of ciphertext modulus and a ciphertext

	BGV	YASHE
Log₂q	$(M \cdot \frac{{log}_{2} (h \cdot n \cdot t^{4})}{36} + 3) \cdot (11 + \frac{1}{2} {log}_{2} n)$	2M · log₂nt

\|ct\|	$2 n (M \cdot \frac{{log}_{2} (h \cdot n \cdot t^{4})}{36} + 3) \cdot (11 + \frac{1}{2} {log}_{2} n)$	2nM · log₂nt

Open in a new tab

Note that it is difficult to compare these two schemes because their parameters depend on at least 4 variables: the plaintext modulus, t, the dimension, n, the Hamming weight, h, and the number of multaplicative levels to be evaluated, M. However we observe that, in the case that log₂n ≈ 14 and h = 64, we have:

\begin{gathered} {log}_{2} q_{YASHE} - {log}_{2} q_{BGV} \\ \approx 2 M \cdot ({log}_{2} n t) - (M \cdot d_{n, t} + 3) \cdot (11 + \frac{1}{2} {log}_{2} n) \\ \approx 2 M \cdot (14 + {log}_{2} t) - (M \cdot d_{n, t} + 3) \cdot 18 \\ = 2 M \cdot (14 + {log}_{2} t - 9 \cdot d_{n, t}) - 54 \\ \approx 2 M (14 + {log}_{2} t - 9 \cdot (\frac{20 + 4 {log}_{2} t}{36} + η)) - 54 \\ = 18 M (1 - η) - 54 for some 0 \leq η < 1 . \end{gathered}

Hence, if M is large, we can use a smaller ciphertext modulus to evaluate M-levels of multiplications with BGV in comparison to YASHE; however, the YASHE scheme has smaller ciphertexts than BGV. This follows from the fact that

\begin{gathered} | c t_{BGV} | - | c t_{YASHE} | \\ \approx 2 (M \cdot d_{n, t} + 3) \cdot (11 + \frac{1}{2} {log}_{2} n) - 2 M \cdot ({log}_{2} n t) \\ \approx 2 M \cdot (18 \cdot d_{n, t} - 14 - {log}_{2} t) + 108 \\ \approx 2 M \cdot ({log}_{2} t + 18 η - 4) + 108 \end{gathered}

For some 0 ≤ η < 1; if log₂t ≥ 4, then log₂t + 18η - 4 ≥ 0; otherwise, we have d_n,t= 1 and so $18 \cdot d_{n, t} - 14 - {log}_{2} t > 0$ .

Let us contrast the complexity of homomorphic multiplication operations for the two schemes. One of the new optimizations for BGV is to convert polynomials between coefficient and evaluation representations. Most of the homomorphic operations are performed in the more efficient evaluation representation, but it sometimes requires coefficient representation. Note that these conversions take the most time in execution. In more detail, at the l-th level of this scheme, the key switching procedure requires $O$ (l) Fast Fourier Transforms (FFTs) and the modulus switching operation requires (l + 1) FFTs. Since HElib uses the Bluestein FFT algorithm [30] (with run-time complexity of $O$ (n log n)), this yields an overall complexity of $O$ (ln log n) for a multiplication of ciphertexts.

For the polynomial multiplication in the base ring R_q= ℤ_q[x]/(xⁿ+ 1), we implemented the FFT algorithm by Nussbaumer [31] based on recursive negacyclic convolutions (with run-time complexity $\frac{9}{2} n log n log log n + O (n log n)$ of arithmetic operations in ℤ_q). The homomorphic multiplication in YASHE includes a costly key switching operation which is an inner product on $R_{q}^{ℓ ω, q}$ , hence we obtain a total cost of $ℓ_{ω, q} \cdot (\frac{9}{2} n log n log log n + O (n log n))$ operations for a ciphertext multiplication. Therefore, BGV is expected to be faster than YASHE for a ciphertext multiplication if we take similar parameters with q and n.

How to set parameters

The security of BGV relies on the hardness of the RLWE assumption. Similarly, YASHE is provably secure in the sense of IND-CPA under the RLWE assumption and DSPR assumption. The main difference between the schemes is that BGV uses an odd integer m while YASHE chooses m to be a power-of-two with a prime integer q such that q ≡ 1 (mod m). In [23], it was shown that the hardness of RLWE with the cyclotomic polynomial Φ_m(x) = x^ϕ(m)+ 1 can be established by a quantum reduction to shortest vector problems in ideal lattices. This means that YASHE is believed to be secure as long as the lattice problems are hard to solve.

Parameters of the BGV scheme

To homomorphically evaluate the algorithms for Task 1, we first choose sufficiently large t so that no reductions modulo t occurs in the plaintext slots. For example, we take t as the smallest power-of-two which satisfies the following inequalities:

n_{A}^{(j)} = \sum_{i = 1}^{200} g_{i}^{(j)} \leq \sum_{i = 1}^{200} 2 = 400 < t

since the total number of people in the same group is N = 200. So it suffices to take t = 2⁹for privately computing the minor allele counts. In the case of χ²test, we have

n_{A}^{(j)} + {n^{'}}_{A}^{(j)} = \sum_{i = 1}^{200} g_{i}^{(j)} + \sum_{i = 1}^{200} {g^{'}}_{i}^{(j)} \leq 2 \sum_{i = 1}^{200} 2 = 800 < t,

thus we set the parameter t = 2¹⁰. For the second task, we used t = 2 to evaluate binary circuits.

Now, we derive a lower-bound on ϕ(m) such that

ϕ (m) \geq \frac{(L (log m + 23) - 8.5) \cdot (λ + 110)}{7.2} .

(2)

from the security analysis of [18] based on Lindner and Peikert's method [32]. For the efficiency of the implementation, we choose the smallest integer m so as to satisfy Inequality (2) and pack the message into plaintext slots as many as possible. Next, we define a ladder of moduli to make the correct decryption after computation with L levels (see [18] for details). Finally, we consider the discrete Gaussian distribution χ_err= D_ℤ,σwith mean 0 and standard deviation σ = 3.2 over the integers to sample random error polynomials.

Parameters of the YASHE scheme

As discussed before, t = 2¹⁰will suffice to compute the MAFs and χ²statistic. For the second task, we look for the parameter t ≠ 2 which maximizes the number of slots we can handle in one go. We fix the word ω = 2¹²⁸for the evaluation key and the standard deviation σ = 8 for the error distribution χ_err.

Since we can estimate the size of noise during homomorphic operations, we get the lower bound on q to ensure the correctness. We also have maximal values of q to ensure the desired security using the results of [33], so that we can have more loose bound than that from LP's method. Then we set m as a power-of-two to get a non-trivial interval for q and then select a smallest q in this interval.

Implementation results

We present the parameter setting and performance results for secure genome analysis in Table 3 and 4. All the parameters provide 80-bit security level. We give the plaintext modulus t, the size of the ciphertext modulus q, the lattice dimension n = ϕ(m), and the number of plaintext slots ℓ. We also give the circuit depth L so that HE scheme can correctly evaluate such a computation on encrypted data. In particular, it can be considered as the number of ciphertext moduli in the BGV scheme. We consider the ciphertext size in kBytes for a set of parameters. The last columns give the timings for the key generation, encryption, evaluation and decryption.

Table 3.

Implementation results of Task 1 using BGV and YASHE

		s	t	log₂q	n	ℓ	L	\|ct\|	KeyGen	Encrypt	Eval	Decrypt
BGV	MAF	311 610	2⁹	60 61	5292 8190	378 630	3	78 kB 122 kB	6.92s 10.28s	11.90s 14.85s	29.99 ms 33.36 ms	290.06 ms 690.23 ms

	χ ²	311 610	2¹⁰	60 61	5292 8190	378 630	3	78 kB 122 kB	6.35s 12.27s	11.61s 15.13s	30.05 ms 38.17 ms	560.10 ms 720.33 ms

YASHE	MAF	311 610	2¹⁰	48	1024	1024	0	6 kB	0.01s 0.04s	1.63s 4.10s	5.74 ms 16.98 ms	33.71 ms 16.78 ms

	χ ²	311 610							0.01s 0.04s	1.61s 4.12s	5.99 ms 17.20 ms	16.73 ms 17.01 ms

Open in a new tab

Table 4.

Implementation results of Task 2 using BGV and YASHE

		Size	t	log₂q	n	ℓ	L	\|ct\|	KeyGen	Encrypt	Eval	Decrypt
BGV	Hamming	5K 10K	2	132	8190	630	7	264 kB	2.53s	12.65s 24.90s	15.39s 29.39s	0.64s 1.29s

	Edit	5K 10K		150			8	300 kB	3.41s	16.98s 33.34s	40.86s 76.08s	2.97s 5.81s

YASHE	Hamming	5K 10K	8191	384	8192	4096	6	384 kB	130.59s	29.70s 58.82s	68.31s 134.87s	2.67s 5.04s

	Edit	5K 10K								58.46s 116.61s	110.18s 245.04s	2.66s 5.07s

Open in a new tab

Performance results of Task 1

In Table 3 the top four rows refer to the results using BGV, and the bottom four rows refer to results using YASHE for computing the MAFs and χ²statistic in case-control groups. Note that the number of slots means that how many messages we can pack into one single ciphertext. When using YASHE, we can evaluate simultaneously by embedding the data into the coefficients of plaintext polynomial; the maximal degree of plaintext polynomial in this case is considered to be the number of slots.

In practice, we need to apply one more modulus-switching during homomorphic additions for the BGV scheme, so the total number of ciphertext moduli is L = 1 + 2 = 3. On the contrary, L means the levels of multiplications in YASHE (without taking into account the additions). In other words, when evaluating a polynomial of degree d on encrypted data, we have L ≈ log d levels of multiplications by computing in a binary tree way. Thus, L = 0 suffices to support such homomorphic additions in Task 1. Thus we don't need to generate the evaluation key, which enables to take less time for key generation than BGV. Moreover, the evaluation performance of YASHE is much better since BGV requires a costly modulus switching operations even for computing simple homomorphic additions.

Performance results of Task 2

Table 4 presents the parameter setting and performance results for secure DNA sequence comparison using BGV and YASHE. We evaluated the performance with the input data of different sizes 5K and 10K. We implemented the comparison circuit with the same method as described in [17, Lemma 1] in order to reduce the circuit depth over encryption.

As discussed before, given the parameter L, we obtain the approximate size of ciphertext modulus as log₂q ≈ 43 + 18 · (L − 2) for BGV when using t = 2 and R = ℤ[x]/(Φ₈₁₉₁(x)). Since it should support L = 7 or 8 to correctly evaluate genomic algorithms of Task 2, we use the modulus q around 130 to 150. On the other hand, the size of the parameter q in YASHE should be strictly larger than 2L log₂(nt) ≈ 52L with t = 2⁹and R = ℤ[x]/(x⁸¹⁹²+ 1). So we used a 384-bit prime q such that q ≡ 1 (mod 2¹⁴).

In the implementation of YASHE scheme, computing the inverse of f modulo q turns out to be the most-time consuming part of the key-generation, which runs in around 128.34 seconds(s). In total, it takes about 130.59s to generate the public key, secret key and evaluation keys, while the key generation of the BGV scheme takes about 3.41s in order to support 8 levels.

There is also quite a big gap between the two schemes in timings for a multiplication of ciphertexts: BGV takes around 0.07s, while YASHE takes around 1.75s (including the key switching step) under the parameter settings used in Task 2. For the efficiency of the YASHE scheme, we might avoid a costly key switching step during the homomorphic multiplication; however, it supports a limited number of homomorphic multiplications without the key switching step. This follows since the noise grows exponentially with the multiplicative depth through such consecutive operations. One alternative is to use a hybrid approach, in which we leave out key switching in certain places but do it in others using the evaluation key with a power of the secret key so that one can keep the ciphertext noise small for correct decryption. As a result, polynomial multiplication modulo xⁿ+ 1 takes about 0.64s, but it is still slower than that in BGV. As expected, BGV is faster than YASHE to evaluate the genomic algorithms for DNA sequence comparison.

Conclusions

In this paper, we discussed how to privately perform genomic tests on encrypted genome data using homomorphic encryption. In addition to the efficient implementations of BGV and YASHE, we compared two schemes both theoretically and practically. We found that there is a trade-off between the security and performance. YASHE uses a power-of-two dimension n which defines the 2n-th cyclotomic polynomial; this is a good choice for providing strong security, but it requires larger parameters to ensure correctness than BGV, and the homomorphic multiplication in YASHE is slower than that in BGV. Therefore, the performance numbers for BGV are better than YASHE when homomorphically evaluating deep circuits (like the Hamming distance algorithm or approximate Edit distance algorithm). On the other hand, it is more efficient to use the YASHE scheme for a low-degree computation, such as minor allele frequencies or χ²test statistic in a case-control study.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

MK and KL designed the baseline methods. MK drafted the manuscript and conducted the experiment for the competition. KL guided the experimental design and provided detailed edits.

Acknowledgements

The authors would like to thank Michael Naehrig for extensive assistance with the code for the YASHE-based implementation for the contest. The authors would also like to thank the iDASH Secure Genome Analysis Contest organizers, in particular Xiaoqian Jiang and Shuang Wang, for running the contest and providing the opportunity to submit competing implementations for these important tasks. MK was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. 2014R1A2A 1A11050917).

This article has been published as part of BMC Medical Informatics and Decision Making Volume 15 Supplement 5, 2015: Proceedings of the 4th iDASH Privacy Workshop: Critical Assessment of Data Privacy and Protection (CADPP) challenge. The full contents of the supplement are available online at http://www.biomedcentral.com/1472-6947/15/S5.

Declarations

Publication funding for this supplement was supported by iDASH U54HL108460, iDASH linked R01HG007078 (Indiana University), NHGRI K99HG008175 and NLM R00LM011392.

References

Personal Genome Project. http://www.personalgenomes.org/ http://www.personalgenomes.org/
HapMap Project. http://hapmap.ncbi.nlm.nih.gov/ http://hapmap.ncbi.nlm.nih.gov/
Humbert M, Ayday E, Hubaux J-P, Telenti A. Addressing the concerns of the lacks family: quantification of kin genomic privacy. Proceedings of the ACM SIGSAC Conference on Computer and Communications Security. 2013. pp. 1141–1152.
Erlich Y, Narayanan A. Routes for breaching and protecting genetic privacy. Nature Reviews Genetics. 2014;15(6):409–421. doi: 10.1038/nrg3723. [DOI] [PMC free article] [PubMed] [Google Scholar]
Naveed M, Ayday E, Clayton EW, Fellay J, Gunter CA, Hubaux J-P, Malin BA, Wang X. Privacy in the Genomic Era. arXiv,abs/1405.1891v3. [DOI] [PMC free article] [PubMed]
Kantarcioglu M, Jiang W, Liu Y, Malin B. A cryptographic approach to securely share and query genomic sequences. IEEE Trans on Inf Technol Biomed. 2008;12(5):606–617. doi: 10.1109/TITB.2007.908465. [DOI] [PubMed] [Google Scholar]
Baldi P, Baronio R, Cristofaro ED. Countering GATTACA: efficient and secure testing of fully-sequenced human genomes. Proceedings of the 18th ACM Conference on Computer and Communications Security. 2011. pp. 691–702.
Ayday E, Raisaro JL, McLaren PJ, Fellay J, Hubaux J-P. Privacy-preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data. USENIX Workshop on Health Information Technologies. 2013.
Gentry C. Fully homomorphic encryption using ideal lattices. Proceedings of the 40th ACM Symposium on Theory of Computing. 2009. pp. 169–178.
van Dijk M, Gentry C, Halevi S, Vaikuntanathan V. Fully homomorphic encryption over the integers. Proceedings of Advances in Cryptology-Eurocrypt. 2010;6110:24–43. [Google Scholar]
Gentry C, Sahai A, Waters B. Homomorphic encryption from learning with errors: Conceptually-simpler, asymptotically-faster, attribute-based. Proceedings of Advances in Cryptology-Crypto. 2013;8042:75–92. [Google Scholar]
Yasuda M, Shimoyama T, Kogure J, Yokoyama K, Koshiba T. Secure pattern matching using somewhat homomorphic encryption. Proceedings of the 2013 ACM Cloud Computing Security Workshop. pp. 65–76.
Lauter K, Naehrig M, Vaikuntanathan V. Can homomorphic encryption be practical? Proceedings of the 18th ACM Conference on Cloud Computing Security. 2011. pp. 113–124.
Graepel T, Lauter K, Naehrig M. Ml confidential: Machine learning on encrypted data. Proceedings of Information Security and Cryptology-ICISC. 2012;7839:1–21. [Google Scholar]
Bos JW, Lauter K, Naehrig M. Private predictive analysis on encrypted medical data. Journal of Biomedical Informatics. 2014;50:234–243. doi: 10.1016/j.jbi.2014.04.003. [DOI] [PubMed] [Google Scholar]
Lauter K, López-Alt A, Naehrig M. Private computation on encrypted genomic data. Proceedings of Progress in Cryptology - LATINCRYPT. 2014;8895:3–27. [Google Scholar]
Cheon JH, Kim M, Lauter K. In: Proceedings of Financial Cryptography and Data Security - FC International Workshop WAHC. Brenner M, Christin N, Johnson B, Rohloff K, editor. Vol. 8976. 2015. Homomorphic computation of edit distance; pp. 194–212. [Google Scholar]
Gentry C, Halevi S, Smart N. Homomorphic evaluation of the AES circuit. Proceedings of Advances in Cryptology-Crypto. 2012;7417:850–867. [Google Scholar]
Bos JW, Lauter K, Loftus J, Naehrig M. Improved security for a ring-based fully homomorphic encryption scheme. Proceedings of Cryptography and Coding - 14th IMA International Conference. 2013;8308:45–64. [Google Scholar]
Wagner RA, Fischer MJ. The string to string correction problem. Journal of the ACM. 1974;21(1):168–173. [Google Scholar]
Brakerski Z, Gentry C, Vaikuntanathan V. (Leveled) fully homomorphic encryption without bootstrapping. Proceedings of the 3rd Innovations in Theoretical Computer Science Conference. 2012. pp. 309–325.
Brakerski Z. Fully homomorphic encryption without modulus swithing from classical gapsvp. Proceedings of Advances in Cryptology-Crypto. 2012;7417:868–886. [Google Scholar]
Lyubashevsky V, Peikert C, Regev O. On ideal lattices and learning with errors over rings. Proceedings of Advances in Cryptology-Eurocrypt. 2010;6110:1–23. [Google Scholar]
Smart N, Vercauteren F. Fully homomorphic SIMD operations. Designs, Codes and Cryptography. 2014;71(1):57–81. [Google Scholar]
Stehlé D, Steinfeld R. Making NTRU as secure as worst-case problems over ideal lattices. Proceedings of Advances in Cryptology - EUROCRYPT. 2011;6632:27–47. [Google Scholar]
López-Alt A, Tromer E, Vaikuntanathan V. On-the-fly multiparty computation on the cloud via multikey fully homomorphic encryption. Proceedings of the 40th ACM Symposium on Theory of Computing. 2012. pp. 1219–1234.
Cheon JH, Kim M, Kim M. In: Proceedings of Financial Cryptography and Data Security - FC International Workshop WAHC. Brenner M, Christin N, Johnson B, Rohloff K, editor. Vol. 8976. 2015. Search-and-compute on encrypted data; pp. 142–159. [Google Scholar]
Halevi S, Shoup V. Design and implementation of a homomorphic encryption library. Technical report, IBM. 2013.
Shoup V. NTL: A Library for Doing Number Theory. http://www.shoup.net/ntl http://www.shoup.net/ntl
Bluestein LI. A Linear Filtering Approach to the Computation of Discrete Fourier Transform. IEEE Transactions on Audio and Electroacoustics. pp. 451–455.
Nussbaumer HJ. Fast polynomial transform algorithms for digital convolution. IEEE Trans on Acoustics, Speech and Signal Proceesing. 1980;28(2):205–215. [Google Scholar]
Lindner R, Peikert C. Better key sizes (and attacks) for lwe-based encryption. Proceedings of Topics in Cryptology-CT-RSA. 2011;6558:319–339. [Google Scholar]
Lepoint T, Naehrig M. A comparison of the homomorphic encryption schemes fv and yashe. Proceedings of Progress in Cryptology-AFRICACRYPT. 2014;8469:318–335. [Google Scholar]

[B1] Personal Genome Project. http://www.personalgenomes.org/ http://www.personalgenomes.org/

[B2] HapMap Project. http://hapmap.ncbi.nlm.nih.gov/ http://hapmap.ncbi.nlm.nih.gov/

[B3] Humbert M, Ayday E, Hubaux J-P, Telenti A. Addressing the concerns of the lacks family: quantification of kin genomic privacy. Proceedings of the ACM SIGSAC Conference on Computer and Communications Security. 2013. pp. 1141–1152.

[B4] Erlich Y, Narayanan A. Routes for breaching and protecting genetic privacy. Nature Reviews Genetics. 2014;15(6):409–421. doi: 10.1038/nrg3723. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B5] Naveed M, Ayday E, Clayton EW, Fellay J, Gunter CA, Hubaux J-P, Malin BA, Wang X. Privacy in the Genomic Era. arXiv,abs/1405.1891v3. [DOI] [PMC free article] [PubMed]

[B6] Kantarcioglu M, Jiang W, Liu Y, Malin B. A cryptographic approach to securely share and query genomic sequences. IEEE Trans on Inf Technol Biomed. 2008;12(5):606–617. doi: 10.1109/TITB.2007.908465. [DOI] [PubMed] [Google Scholar]

[B7] Baldi P, Baronio R, Cristofaro ED. Countering GATTACA: efficient and secure testing of fully-sequenced human genomes. Proceedings of the 18th ACM Conference on Computer and Communications Security. 2011. pp. 691–702.

[B8] Ayday E, Raisaro JL, McLaren PJ, Fellay J, Hubaux J-P. Privacy-preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data. USENIX Workshop on Health Information Technologies. 2013.

[B9] Gentry C. Fully homomorphic encryption using ideal lattices. Proceedings of the 40th ACM Symposium on Theory of Computing. 2009. pp. 169–178.

[B10] van Dijk M, Gentry C, Halevi S, Vaikuntanathan V. Fully homomorphic encryption over the integers. Proceedings of Advances in Cryptology-Eurocrypt. 2010;6110:24–43. [Google Scholar]

[B11] Gentry C, Sahai A, Waters B. Homomorphic encryption from learning with errors: Conceptually-simpler, asymptotically-faster, attribute-based. Proceedings of Advances in Cryptology-Crypto. 2013;8042:75–92. [Google Scholar]

[B12] Yasuda M, Shimoyama T, Kogure J, Yokoyama K, Koshiba T. Secure pattern matching using somewhat homomorphic encryption. Proceedings of the 2013 ACM Cloud Computing Security Workshop. pp. 65–76.

[B13] Lauter K, Naehrig M, Vaikuntanathan V. Can homomorphic encryption be practical? Proceedings of the 18th ACM Conference on Cloud Computing Security. 2011. pp. 113–124.

[B14] Graepel T, Lauter K, Naehrig M. Ml confidential: Machine learning on encrypted data. Proceedings of Information Security and Cryptology-ICISC. 2012;7839:1–21. [Google Scholar]

[B15] Bos JW, Lauter K, Naehrig M. Private predictive analysis on encrypted medical data. Journal of Biomedical Informatics. 2014;50:234–243. doi: 10.1016/j.jbi.2014.04.003. [DOI] [PubMed] [Google Scholar]

[B16] Lauter K, López-Alt A, Naehrig M. Private computation on encrypted genomic data. Proceedings of Progress in Cryptology - LATINCRYPT. 2014;8895:3–27. [Google Scholar]

[B17] Cheon JH, Kim M, Lauter K. In: Proceedings of Financial Cryptography and Data Security - FC International Workshop WAHC. Brenner M, Christin N, Johnson B, Rohloff K, editor. Vol. 8976. 2015. Homomorphic computation of edit distance; pp. 194–212. [Google Scholar]

[B18] Gentry C, Halevi S, Smart N. Homomorphic evaluation of the AES circuit. Proceedings of Advances in Cryptology-Crypto. 2012;7417:850–867. [Google Scholar]

[B19] Bos JW, Lauter K, Loftus J, Naehrig M. Improved security for a ring-based fully homomorphic encryption scheme. Proceedings of Cryptography and Coding - 14th IMA International Conference. 2013;8308:45–64. [Google Scholar]

[B20] Wagner RA, Fischer MJ. The string to string correction problem. Journal of the ACM. 1974;21(1):168–173. [Google Scholar]

[B21] Brakerski Z, Gentry C, Vaikuntanathan V. (Leveled) fully homomorphic encryption without bootstrapping. Proceedings of the 3rd Innovations in Theoretical Computer Science Conference. 2012. pp. 309–325.

[B22] Brakerski Z. Fully homomorphic encryption without modulus swithing from classical gapsvp. Proceedings of Advances in Cryptology-Crypto. 2012;7417:868–886. [Google Scholar]

[B23] Lyubashevsky V, Peikert C, Regev O. On ideal lattices and learning with errors over rings. Proceedings of Advances in Cryptology-Eurocrypt. 2010;6110:1–23. [Google Scholar]

[B24] Smart N, Vercauteren F. Fully homomorphic SIMD operations. Designs, Codes and Cryptography. 2014;71(1):57–81. [Google Scholar]

[B25] Stehlé D, Steinfeld R. Making NTRU as secure as worst-case problems over ideal lattices. Proceedings of Advances in Cryptology - EUROCRYPT. 2011;6632:27–47. [Google Scholar]

[B26] López-Alt A, Tromer E, Vaikuntanathan V. On-the-fly multiparty computation on the cloud via multikey fully homomorphic encryption. Proceedings of the 40th ACM Symposium on Theory of Computing. 2012. pp. 1219–1234.

[B27] Cheon JH, Kim M, Kim M. In: Proceedings of Financial Cryptography and Data Security - FC International Workshop WAHC. Brenner M, Christin N, Johnson B, Rohloff K, editor. Vol. 8976. 2015. Search-and-compute on encrypted data; pp. 142–159. [Google Scholar]

[B28] Halevi S, Shoup V. Design and implementation of a homomorphic encryption library. Technical report, IBM. 2013.

[B29] Shoup V. NTL: A Library for Doing Number Theory. http://www.shoup.net/ntl http://www.shoup.net/ntl

[B30] Bluestein LI. A Linear Filtering Approach to the Computation of Discrete Fourier Transform. IEEE Transactions on Audio and Electroacoustics. pp. 451–455.

[B31] Nussbaumer HJ. Fast polynomial transform algorithms for digital convolution. IEEE Trans on Acoustics, Speech and Signal Proceesing. 1980;28(2):205–215. [Google Scholar]

[B32] Lindner R, Peikert C. Better key sizes (and attacks) for lwe-based encryption. Proceedings of Topics in Cryptology-CT-RSA. 2011;6558:319–339. [Google Scholar]

[B33] Lepoint T, Naehrig M. A comparison of the homomorphic encryption schemes fv and yashe. Proceedings of Progress in Cryptology-AFRICACRYPT. 2014;8469:318–335. [Google Scholar]

PERMALINK

Private genome analysis through homomorphic encryption

Miran Kim

Kristin Lauter

Supplement

Conference

Abstract

Background

Methods

Results

Conclusions

Introduction

Background

Two tasks for iDASH challenge

Table 1.

Practical homomorphic encryption

The BGV scheme

The YASHE scheme

Our methods for private genome analysis

Encoding genomic data

Figure 1.

Figure 2.

Homomorphic computation of the BGV scheme

Task 1: GWAS on encrypted genomic data

Task 2: secure DNA sequence comparison

Homomorphic computation of the YASHE scheme

Task 1: GWAS on encrypted genomic data

Task 2: secure DNA sequence comparison

Results and discussion

Theoretical comparison between BGV and YASHE

Table 2.

How to set parameters

Parameters of the BGV scheme

Parameters of the YASHE scheme

Implementation results

Table 3.

Table 4.

Performance results of Task 1

Performance results of Task 2

Conclusions

Competing interests

Authors' contributions

Acknowledgements

Declarations

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases