Skip to main content
BMC Medical Genomics logoLink to BMC Medical Genomics
. 2020 Jul 21;13(Suppl 7):77. doi: 10.1186/s12920-020-0722-1

Privacy-preserving approximate GWAS computation based on homomorphic encryption

Duhyeong Kim 1, Yongha Son 1, Dongwoo Kim 1, Andrey Kim 1, Seungwan Hong 1, Jung Hee Cheon 1,
PMCID: PMC7372890  PMID: 32693801

Abstract

Background

One of three tasks in a secure genome analysis competition called iDASH 2018 was to develop a solution for privacy-preserving GWAS computation based on homomorphic encryption. The scenario is that a data holder encrypts a number of individual records, each of which consists of several phenotype and genotype data, and provide the encrypted data to an untrusted server. Then, the server performs a GWAS algorithm based on homomorphic encryption without the decryption key and outputs the result in encrypted state so that there is no information leakage on the sensitive data to the server.

Methods

We develop a privacy-preserving semi-parallel GWAS algorithm by applying an approximate homomorphic encryption scheme HEAAN. Fisher scoring and semi-parallel GWAS algorithms are modified to be efficiently computed over homomorphically encrypted data with several optimization methodologies; substitute matrix inversion by an adjoint matrix, avoid computing a superfluous matrix of super-large size, and transform the algorithm into an approximate version.

Results

Our modified semi-parallel GWAS algorithm based on homomorphic encryption which achieves 128-bit security takes 30–40 minutes for 245 samples containing 10,000–15,000 SNPs. Compared to the true p-value from the original semi-parallel GWAS algorithm, the F1 score of our p-value result is over 0.99.

Conclusions

Privacy-preserving semi-parallel GWAS computation can be efficiently done based on homomorphic encryption with sufficiently high accuracy compared to the semi-parallel GWAS computation in unencrypted state.

Keywords: Homomorphic encryption, Privacy, GWAS, Fisher scoring

Background

After the successful completion of the Human Genome Project in the early 21st century, high throughput technology on genetic variations has been rapidly developed and widely studied. In particular, through the development of microarray chip with rather small computational cost, it became possible to determine the genotype of millions of single nucleotide polymorphism (SNP), a variation in a single nucleotide that occurs at a specific position in the genome, for each individual. With those statistical data of genotypes, many researches are proposed that investigate associations between SNPs and phenotypes like major human disease, and especially Genome-wide association study (GWAS) aims to find top significant SNPs relevant to a certain phenotype.

Motivation

Since genome analysis uses genomic data that are very sensitive and irreplacable, privacy on genomic data has come up to be one of the most important issues in genome analysis including GWAS. The usual privacy-preserving methodology in data analysis is anonymization, perturbation, randomization, and condensation [1]; however, those methods leverage the quality of data with privacy resulting in an inaccurate analysis to some extent. The dilemma regarding the balance between personal privacy and analytical efficiency has been resolved by applying many cryptographic primitives, while homomorphic encryption (HE) is noticed as one of the ultimate cryptographic solutions for privacy-preserving data analysis. Conceptually, HE is an encryption scheme which allows computations over encrypted data without decryption. HE not only fundamentally prevents the leakage of input data during the analysis phase, but also provides an accurate result of analysis since it preserves the original data intactly. However, HE causes a significant blowup of computational cost for analysis, and optimization and modification of algorithm for efficient computation in HE is the main problem of applying HE in data analysis.

Since 2014, there has been an annual biomedical privacy competition hosted by Integrating Data for Analysis, Anonymization and SHaring (iDASH), a national center for biomedical computing in the United States. One of three tasks in iDASH 2018 [2] was to develop a solution for privacy-preserving GWAS computation based on HE, and we participated in this competition with our delicately constructed algorithms.

Summary of results

In this study, we propose approximate HE algorithms for privacy-preserving GWAS computation. To be precise, we transform well-known Fisher scoring and semi-parallel GWAS algorithm into HE-friendly algorithms so that we can efficiently evaluate them in encrypted state. Note that the HE-friendly modified Fisher Scoring algorithm can be generally used for logistic regression, not only for GWAS.

The main challenges in transforming the semi-parallel GWAS algorithm (Algorithm 1) to an HE algorithm are complex matrix operations such as multiplication and inversion. Since matrix inversion in HE is complicated and costly, we substitute it by computation of the adjoint matrices and determinant. With this approach, the original Fisher scoring algorithm can be successfully modified to compute encrypted data efficiently. For the efficient computation of semi-parallel GWAS computation based on HE, moreover, we reduced the number of matrix multiplications as many as possible, and modified the original algorithm into an approximate version which requires much less computational cost. The details of our optimization methodologies are well described in “Our optimization methodology” section.

We exploited an approximate HE scheme HEAAN [3, 4] with a publicly available library [5] for the implementation of our modified semi-parallel GWAS algorithm based on HE. The HE algorithm takes about 40 minutes for 245 samples each containing a binary phenotype, 3 covariates, and 14,841 SNPs on Linux with a 2.10GHz processor using 8 threads (4 cores).

Related works

Some works on HE-based genome analyses has been studied over the past years: Lauter et al. [6] studied application of HE on basic genomic algorithms such as the Pearson Goodness-of-Fit test, and Wang et al. [7] performed an exact logistic regression on small datasets based on HE. More recently, some solutions [811] submitted to task 3 of iDASH 2017 [12] competition dealt with training the logistic regression model of genomic data based on HE. Some works [13, 14] studied the privacy-preserving GWAS based on HE, but they performed rather simple χ2 test on quite small numbers of SNPs, which is quite different from iDASH 2018 task; logistic regression on large numbers of SNPs.

There are some other alternative tools to deal with privacy issues in real-world applications. One of them is a cryptographic tool called secure Multi-Party Computation (MPC). In MPC, computations are done online by multiple parties without revealing any information of the result to each party. There have been several remarkable works on privacy-preserving genome analysis based on MPC. In 2017, Jagadeesh et al. [15] proposed privacy-preserving solutions for several genomic diagnose methods based on MPC with practical implementation over real patient data. Recently, Cho, Wu and Berger [16] successfully constructed a practical MPC-based protocol for privacy-preserving GWAS computation over large-scale genomic data. There have been several previous works [14, 1719] on privacy-preserving GWAS computation based on MPC; however, they had some limitations to be applied in practice since they either require infeasible computational cost or greatly streamlined the task.

Meanwhile, in 2016, Chen et al. [20] proposed a hardware-based solution for privacy-preserving genome analysis with Software Guard Extensions (SGX) [21], the security-related instruction built in Intel CPU which allows secure computations in a private region which cannot be accessed without the private key. They implemented a SGX-based framework for secure transmission disequilibrium test on Kawasaki disease patients with high scalability on the number of SNPs.

Methods

Approximate homomorphic encryption scheme HEAAN

For privacy-preserving GWAS computation, we applied an HE scheme called HEAAN proposed by Cheon et al. [3, 4], which supports approximate computation of real numbers in encrypted state. Efficiency of HEAAN in the real world has been proved by showing its application in various fields including machine learning [8, 22, 23] and cyber physical system [24]. The winning solution of iDASH competition in 2017 also applied HEAAN as an HE scheme for privacy-preserving logistic regression on genomic data.

At high-level, for a HEAAN ciphertext ct of some message polynomial 𝔪, the decryption process with a secret key sk is done as

Decsk(ct)=𝔪+e𝔪

where e is a small error attached to the message polynomial 𝔪. Furthermore, for ciphertexts ct1 and ct2 of message polynomials 𝔪1 and 𝔪2, the homomorphic evaluation algorithms C.Add and C.Mult satisfy

DecskC.Add(ct1,ct2)𝔪1+𝔪2,Decsk(C.Multevkct1,ct2)𝔪1·𝔪2,

i.e., addition and multiplication can be internally done even in encrypted state.

For formal definitions, let L be a level parameter, and q:=2 for 1≤L. Let R:=[X]/XN+1 for a power-of-two N and Rq be a modulo-q quotient ring of R, i.e., Rq:=R/qR. The distribution χkey:=HW(h) over Rq outputs a polynomial of {−1,0,1}-coefficient having h number of non-zero coefficients, and χenc and χerr denote the discrete Gaussian distribution with some prefixed standard deviation. We denote the rounding function by ⌊·⌉, which outputs the nearest integer of a real-number input. For a=i=0N1aiXi[X]/XN+1, then a:=i=0N1aiXiR. Finally, [·]q denotes a component-wise modulo q operation on each element of Rq. Note that those parameters N, L and h satisfying a certain security level can be determined by Albrecht’s security estimator [25, 26]. The scheme description of HEAAN is as following:

  • KeyGen(params)_.
    • Sample sχkey. Set the secret key as sk←(1,s).
    • Sample aURqL and eχerr. Set the public key as pk(b,a)RqL2 where b[a·s+e]qL.
    • Sample aURqL2 and eχerr. Set the evaluation key as evkb,aRqL22 where bas+e+qL·s2qL2.
  • Encpk(𝔪)_. For a message 𝔪R, sample vχenc and e0,e1χerr. Output the ciphertext ct=v·pk+(𝔪+e0,e1)qL.

  • Decsk(ct)_. For a ciphertext ct=(c0,c1)Rq2, output a message 𝔪=[c0+c1·s]q.

  • C.Addct,ct_. For ct,ctRq2, output ctaddct+ctq.

  • C.Subct,ct_. For ct,ctRq2, output ctsubctctq.

  • C.Multevkct,ct_. For ct=(c0,c1),ct=c0,c1Rq2, let (d0,d1,d2)=(c0c0′,c0c1′+c1c0′,c1c1′). Output ctmultd0,d1+qL1·d2·evkq.

For more details of the scheme including the correctness and security analysis, we refer the readers to [3].

The above-mentioned scheme deals with message polynomial 𝔪 in some integer polynomial ring ℝ. To encrypt real (or complex) value, HEAAN use a (field) isomorphism τ:[X]/XN+1N/2 called canonical embedding. A plaintext vector m=(m0,...,mN/21) is first transformed into τ1m[X]/XN+1, and then rounded off to an integer-coefficient polynomial. However, the naive rounding-off τ1m can derive quite large relative error on the plaintext. To control the error, we round it off after scaling up by p bits for some integer p, i.e., 2p·τ1m, so that the relative error is reduced. Clearly, a decoding algorithm for 𝔪 would be 2p·τ(𝔪):

  • Ecdm;p_. For m=(m0,...,mN/21) in N/2 and a precision bit p>0, output a polynomial 𝔪2p·τ1mR where the rounding ⌊·⌉ is done coefficient-wisely.

  • Dcd(𝔪;p)_. For 𝔪R, output a plaintext vector m=2p·τ(𝔪)N/2.

To sum up, to encrypt a plaintext vector of real (complex) numbers m, we first encode m into 𝔪Ecdm;p with a certain precision bit p, and then generate a ciphertext ctEncpk(𝔪) with the public key pk.

Now consider ct1 and ct2 be ciphertexts of m1 and m2 in N/2. Since our encoding method scales each plaintext vector up by 2p, the plaintext vector of a ciphertext ct←C.Multevk(ct1,ct2) is (approximately) 2p·m1m2, not m1m2, which will result in exponential growth of plaintexts. To deal with this problem, we adjust the scaling factor by the following procedure so-called rescaling:

  • RS(ct)_. For a ciphertext ctRq2, output ctq/q·ctq.

After the rescaling procedure ctmult←RS(ct), the plaintext vector of the output ctmult is (approximately) m1m2, and the ciphertext modulus qL is reduced by 2p. As a result, the level parameter L should be carefully chosen according to the multiplicative depth of a target computation. In order to present our algorithm in a simple form, we will not describe these rescaling procedures for the rest of the paper, but we remark that in the actual use of HEAAN, there should be delicate consideration on scaling of message.

To deal with a plaintext vector of the form mk having length KN/2 for some power-of-two divisor K of N/2,HEAAN encrypts m into a ciphertext of an N/2-dimensional vector m||||mN/2, where v||w denotes the concatenation of two vectors v and w. This implies that a ciphertext of mk can be understood as a ciphertext of m||||mK for powers-of-two K and K satisfying KKN/2.

Finally, the HEAAN scheme provides the rotation operation on plaintext slots, i.e., it enables us to securely obtain an encryption of the shifted plaintext vector mr,,mN/21,m0,,mr1 from an encryption of m0,,mN/21. It is necessary to generate an additional public information rk, called the rotation key. We denote the rotation operation as follows.

  • Rotrk(ct;r)_. For the rotation key rk, output a ciphertext ct encrypting the (left) rotated plaintext vector of ct by r(>0) positions as above example. If r<0, it denotes the right rotation by (−r) positions.

We omit a subscript of each algorithm of HEAAN for convenience if it is obvious.

Matrix packing method and rotate function

In this subsection, we describe an encoding method to encrypt a matrix structure in a ciphertext which was also introduced in [8]. Consider an n×m matrix Z

Z=z0,0z0,m1zn1,0zn1,m1.

We first pad zeros to set the number of rows and columns to be powers-of-two, say n¯ and m¯, and assume that logn¯+logm¯log(N/2). Then we pack the whole matrix in a single ciphertext ctZ in a column-by-column manner. As described above, the algorithm Rot(ctZ;r) can shift the encrypted vector by r positions. In particular, we can perform row and column rotations of an encrypted matrix with this operation. When r=n¯·j, and the result will be the (left) column rotation of the encrypted matrix Z by j columns.

For the row rotation of an encrypted matrix, we use so-called masking approach. Consider n×m matrices Mi and Mi¯, where the first ni rows of Mi (resp. Mi¯) are filled with 1 (resp. 0) and the last i rows of Mi (resp. Mi¯) are filled with 0 (resp. 1). Let mski and mski¯ be the ciphertext of Mi and Mi¯, respectively.

Mi=111100,Mi¯=000011

To rotate Z by i rows in encrypted state, we first compute ct1←Rot(ctZ,i) and ct2←Rot(ctZ,in). Then, we mask them as ct1←C.Mult(ct1,mski) and ct2C.Multct2,mski¯. As a result, the output of C.Add(ct1,ct2) is a ciphertext of upper row rotation of Z by i rows.

Those row and column rotations of an encrypted matrix are denoted as follows:

  • C.ColumnRotctZ,j_. For a ciphertext ctZ of a matrix Z and an integer j, output a ciphertext ct of left column rotation of Z by j columns.

  • C.RowRotctZ,i_. For a ciphertext ctZ of a matrix Z and an integer j, output a ciphertext ct of upper row rotation of Z by i rows.

Semi-parallel GWAS algorithm

A naive application of GWAS analysis can be done by running a logistic regression for each SNP, which resulting in high computational cost since the number of SNPs can be usually hundred thousands or more. To overcome this problem, Sikorska et al.[27] proposed a semi-parallel GWAS algorithm which reduces the required computation time from 6 hours to 10-15 minutes using projections.

Let n be the number of samples each of which consists of m (binary) SNP data and k covariate data. Then the whole SNP data and covariate data can be organized as an n×m matrix S and an n×k matrix X0, respectively. For k:=k+1, we define a matrix X as the concatenation of a vector whose components are 1 and X0, denoted by X=1||X0. Let y be a target binary phenotype vector of length n. With these inputs, the semi-parallel GWAS algorithm outputs the m-dimensional vector pval which indicates the p-value of each SNP with respect to the target phenotype. The detail of the algorithm is described in Algorithm 1.

graphic file with name 12920_2020_722_Figa_HTML.jpg

The semi-parallel GWAS algorithm involves logistic regression on X,y in the first step, and Fisher Scoring [28] described in Algorithm 2 is one of the most highly efficient algorithm for logistic regression.

graphic file with name 12920_2020_722_Figb_HTML.jpg

In both algorithms, σ(x):=1/(1+ exp(−x)) is called sigmoid function. For both algorithms, if the input of functions such as logarithm (log), division (/) and sigmoid (σ) is a vector, then it means to apply the function component-wisely resulting in the output vector of the same length. The notation T in the superscript denotes the matrix transpose. The operation ⊙ denotes the Hadamard (component-wise) multiplication of two vectors, and the notation diag(·) with an input of a square matrix means the diagonal vector of the input.

Our optimization methodology

The aim of this study is to construct an HE algorithm for privacy-preserving semi-parallel GWAS computation. Since non-polynomial operations such as matrix inverse or real number inversion is a challenging stuff in HE, we need to modify the original semi-parallel GWAS algorithm into HE-friendly form for efficiency. Moreover, the super-large data size of GWAS requires too much computational cost in encrypted state, and this issue should be resolved. In this regard, we introduce our optimization methodology to the algorithm.

Modification of fisher scoring

The main obstacle of Fisher Scoring (Algorithm 2) is a matrix inversion for U=XTWXk×k. To overcome this problem, we exploit the adjoint matrix adj(U) and the determinant det(U) of U. For (i,j)-minor Mi,j of U, adj(U) and det(U) are obtained from basic linear algebra:

adj(U)=(1)i+j·Mi,jk×k, 1
det(U)=i=0k1ui,0·(1)i·Mi,0. 2

We express the inverse matrix U−1 as U1=1det(U)·adj(U). To be precise, observe that

v(t)=logp(t)1p(t)+yp(t)diagW(t)=Xβ(t)+yp(t)diagW(t),

from which we obtain an iterative updating equation on β(t) as follows:

β(t+1)=U1XTW(t)Xβ(t)+yp(t)diagW(t)=β(t)+U1XTyp(t)=β(t)+1det(U)·adj(U)XTyp(t).

Here one needs to compute the inverse of det(U), but this non-polynomial operation is rather expensive in HE. The key observation on this equation is that the second term U1XTyp(t) essentially converges to 0 as t since β(t) converges to some point. From this, we may expect that the convergence would be still valid even when we neglect the term det(U)−1 and substitute it by some appropriate constant. Namely, we can modify the equation as

β(t+1)=β(t)+α·adj(U)XTyp(t).

for some constant α>0. In practice, this approximate version of the Fisher scoring algorithm works quite well with slightly slower convergence rate.

Computing diag(STWS) without S

The main observation of this subsection is that computation of n×m matrix S in Algorithm 1 is superfluous for obtaining p-values. Rather than computing S, we directly compute det(U)·diag(STWS), which can be obtained without computing (matrix) inversion. To be precise, using the fact that S=SXU−1V for V:=XTWS, we get

STWS=SXU1VTWSXU1V=STWSVTU1V.

Based on this observation, we compute det(U)·diag(STWS) by following:

  1. Compute U=XTWX and V=XTWS.

  2. Compute adj(U) and det(U).

  3. Compute det(U)·diag(STWS)−diag(VTadj(U)V).

Approximate computation of STWv

We also take the main observation of the previous subsection so that we do not compute S. From the definition of S and v, it holds that STWv=STWv. Then we have the following equations:

STWv=STWIXU1XTWv=STWIXU1XTWypdiag(W)=STypSTWXU1XTypSTyp

where the last approximation is valid since the term XTyp is sufficiently close to the zero vector, which is resulted from the Fisher Scoring. For example, each component of XTyp was approximately 10−30 in our experiment with 4 iterations in Fisher scoring. Refer to “Dataset description” section for the specific description of datasets. Therefore, we compute det(U)·STyp which is a reliable approximation of det(U)·STWv, in much less computational costs.

Our modified semi-parallel GWAS algorithm

To sum up all our algorithmic optimization techniques described in above, we have Algorithm 3 and 4, which are modified Fisher Scoring and semi-parallel GWAS algorithms, respectively.

graphic file with name 12920_2020_722_Figc_HTML.jpg

In Algorithm 3, the constant α takes a similar role to the learning rate in gradient descent algorithm [29], which can be adjusted if necessary.

graphic file with name 12920_2020_722_Figd_HTML.jpg

We remark that step 8 of Algorithm 4, a conversion procedure from the squared z-score zi to the p-value pvali, is done in unencrypted state. Namely, we decrypt the ciphertext of zi for 1≤im after step 7 so that zi’s are publicized. We stress that the squared z-score has exactly the same information as the p-value, so publishing squared z-scores does not leak any additional information more than publishing p-values.

Homomorphic evaluation of the modified semi-parallel GWAS algorithm

Upon HE-friendly algorithms discussed in the previous section, there still remain computational issues regarding more fundamental operations. Recall that HEAAN basically supports component-wise addition and multiplication along with data slot rotations. However, we encrypt the data matrix by column-by-column manner, and our algorithms include complex operations such as matrix multiplication, evaluation of the adjoint matrix, a sigmoid function, and so on. In this regard, we specify how we can deal with such operations efficiently, reducing the number of multiplications or the total depth of multiplications required which are the main bottleneck of HE.

Note that this section consists of rather technical contents related to HE, since it includes HE algorithms of all building blocks for Algorithm 3 and Algorithm 4. One can simply embrace the fact that every operation required in Algorithm 3 and Algorithm 4 can be efficiently done based on HE, if not really interested in the details.

Hereafter, [a]k with an integer a denotes a residue number in [0,k−1] modulo k. An n-dimensional vector a=(a1,,an) is simply denoted by (ai)1≤in, and an n×m matrix A having (i,j)-entry ai,j is denoted by [ai,j]1≤in,1≤jm. For both cases, if the size is obvious from context, we simply write a vector by (ai), and a matrix by [ai,j]. Every vector and matrix in this section is assumed to be of size power-of-two, which is in line with our packing method introduced in “Matrix packing method and rotate function” section.

Adjoint matrix and determinant

In step 4 of Algorithm 3 and step 2 of Algorithm 4, we need to compute the adjoint matrix and the determinant of the matrix U=[ui,j]i,j=XTWX.

We recall Eqs. 1 and 2 for adjoint matrix and determinant. Given an encryption of U, denoted by CU, we generate (k−1)2 ciphertexts Ci,j for 1≤i,jk−1 from CU, whose plaintext is an i-row (upper) rotation and j-column (left) rotation of U. We first consider the 0-th plaintext slot, i.e., the (0,0)-position of the plaintext matrix, of the ciphertexts. Since every ua,b for 1≤a,bk−1 is a (0,0)-entry of plaintext matrix for one and only one ciphertext Ca,b, we can compute a ciphertext whose (0,0)-entry of the plaintext matrix is M0,0 from Ca,b’s, by homomorphically evaluating the polynomial f which outputs M0,0 with input ua,b’s.

Now observe that Ma,b and Ma,b have a formula of the same form where the subscript indices of ui,j are shifted by (aa,bb) modulo k. Thanks to this index-shifting property, the homomorphic evaluation of the polynomial f with input Ca,b’s essentially outputs a ciphertext of which the plaintext matrix is [Mi,j].

After computing the ciphertext of [Mi,j] as above, we can obtain the ciphertext Cadj of adj(U) by multiplying a ciphertext Csgn of [(−1)i+j]. Finally, the ciphertext Cdet of determinant det(U) is easily obtained from the homomorphic multiplication of CU and Cadj, followed by logk rotations and summations.

In case of k=4, for example, the polynomial f is defined as f([ui,j]1≤i,j≤3)=u1,1u2,2u3,3u1,1u2,3u3,2u2,1u1,2u3,3+u2,1u1,3u3,2+u3,1u1,2u2,3u3,1u1,3u2,2. Then, the homomorphic evaluation to obtain Cadj is done as Algorithm 5.

graphic file with name 12920_2020_722_Fige_HTML.jpg

Matrix multiplications

Let A=[ai,j] be an n×k matrix with nk and B=[bi,j] be an n×m matrix. We use an Algorithm 6 computing a ciphertext CATB of ATB from ciphertexts CA and CB of A and B, which is inspired from the hybrid method by Juvekar et al. [30].

As the first step, we compute k ciphertexts of

diagt(A)=ai,[it]k0in1

for 0≤tk−1. For this, we use ciphertexts dmskt of n×k masking matrix, of which the (i,j)-entry is δ[i+t]k,j. Here δi,j denotes the Kronecker Delta. By summing column rotations of Admskt, we get ciphertexts of n×m matrices Expdiagt(A) having m identical columns diagt(A). Then, we compute the following matrix M:

M=t=0k1ρtExpdiagt(A)B=t=0k1ρtai,[it]k·bi,j=t=0k1a[i+t]n,[i]k·b[i+t]n,jn×m,

where ρt is an (upward) t-rotation of matrix by row. Thus, by properly summing rows of M, we obtain ATB=t=0n1[at,i·bt,j]k×m. The detail of this algorithm is described in Algorithm 6.

graphic file with name 12920_2020_722_Figf_HTML.jpg

Indeed, our GWAS algorithm contains several matrix multiplications of the form ATDB for the diagonal matrix D. For this, we first compute B=DB by Expdiag0(D)⊙B, and then obtain ATDB by computing ATB with the above method. Note that this requires only one additional Hadamard multiplication.

Matrix-vector multiplications

By understanding a vector by an one column matrix, we can perform all matrix-vector multiplications in our algorithms, except Xβ that appears in step 5 of Algorithm 3.

In fact, we can also compute Xβ by changing Algorithm 6 a little bit. Recall that X=[xi,j] is an n·k matrix and β=(βi) is a k-length vector, and it holds that nk. We now again compute k ciphertexts of diagt(X), and then compute

t=0k1diagt(X)ρtβ||||β=t=0k1xi,[it]kβ[it]k=t=0k1xi,[it]k·β[it]k=Xβ,

whose algorithm is described by Algorithm 7.

graphic file with name 12920_2020_722_Figg_HTML.jpg

We remark that there is another simple method for matrix-vector multiplication that we use for c=STyp. For simplicity, let x=(xi):=yp. Then by rotating and summing all rows of Sx||||x=si,j·xi, we obtain a matrix having the same size with S and consisting of identical rowsc=STx. This requires only one Hadamard multiplication and logn rotations. However, strictly speaking, this resulting ciphertext is not a ciphertext of STx, since it encrypts a matrix having c row-wisely, not column-wisely. Thus we can only use this simple method only for STx, where this row-wise packing does not matter after then. The detail of this algorithm is described in Algorithm 8.

graphic file with name 12920_2020_722_Figh_HTML.jpg

Fast diag(ATBA) computations

To obtain diag(ATBA), one can perform matrix multiplication followed by diagonal extraction, but it is obviously not optimal since this computes unnecessary entries of ATBC other than diagonal entries. Thus we use another method that only compute the diagonal entries.

Let A=[ai,j] be an n×m matrix. As an incremental step, we first consider diag(ATDC) where D=[di,j] is an n×n diagonal matrix and C=[ci,j] is an n×m matrix. Then it holds that diagATDCj=i=0n1di,i·ai,j·ci,j. Now, from an encryption of D, we compute an encryption of n×m matrix Expdiag0(D) and then by rotating and summing

AExpdiag0(D)C=di,i·ai,j·ci,ji,j

through all rows, we obtain a matrix consisting of identical rows diag(ATDC). One can easily check that Algorithm 8 with input CA and C.Mult(Cdiag(D),CC)) exactly performs this computation, and then we omit the explicit algorithm. Note that this can be directly applied for diag(STWS) computation of step 5 of Algorithm 4.

Toward our goal diag(ATBA) with a full matrix B, we exploit the above diagonal-case method after decomposing B into diagonal matrices. Let Bt be a diagonal matrix with the diagonal diagt(B) for 0≤tn−1, then it holds that

diagATBA=t=0n1diagAT·Bt·ρt(A).

Therefore, after obtaining encryptions of Expdiagt(B) and ρt(A) from encryptions of B and A, we can directly apply the diagonal-case method on each diag(AT·Bt·ρt(A)) computation for 1≤tn and finally obtain the encryption of diag(ATBA).

Here we again remark that, since these methods use Algorithm 8, they also ruin the column-wise packing as we already pointed out. Hence after applying these methods, it would be hard to perform another matrix operation. Indeed, one can check that the diagonal extractions are required for step 5 of Algorithm 4, which is the last part of algorithm that uses matrix structure.

Approximate computation of sigmoid

Since the sigmoid function σ(x)=1/(1+ exp(−x)) is not a polynomial, we exploit an approximate polynomial of the function to evaluate based on HE. Following the methodology of [8, 22], we used least square approximation method over the interval [−8,8]. The approximate polynomials g(x) of degree 7 is computed as

0.5+1.735·x84.194·x83+5.434·x852.507·x87.

The maximal error between σ(x) and g(x) is approximately 0.032.

Inverse of real numbers

In step 7 of Algorithm 4, we need to compute the inverse of di for 1≤im. To compute the inverse of real numbers, we exploit the Goldschmidt’s division algorithm [31], which outputs an approximate value of the inverse through iterative polynomial evaluations. Refer to [32] for more details of the algorithm.

Results

In this section, we present the experimental results of our modified semi-parallel GWAS algorithm based on HEAAN with a publicly available library [5]. All experiments were implemented in C++ 11 standard, and performed on Linux with Intel Xeon CPU E5-2620 v4 at 2.10GHz processor with multi-threading (8 threads) turned on.

Dataset description

We used a dataset of 245 samples where each sample contains a binary phenotype, 3 covariates (height, weight, age), and 25,484 SNP data provided by iDASH 2018 competition. The dataset is divided into two sets named by iDash_Test and iDash_Eval each composed of 245 samples containing common phenotype and 3 covariates but different number of SNPs; 10,643 and 14,841 SNPs, respectively. We used iDash_Test to set optimal parameters, and iDash_Eval was used to evaluate our algorithm in the competition. Note that the first column of the covariate matrix Xn×k is a vector of which all the components are 1. Therefore, the parameters are (n,m,k)=(245,10643,4) for iDash_Test and (n,m,k)=(245,14841,4) for iDash_Eval.

Experimental setting and parameter selection

We propose two HEAAN parameter sets achieving 128-bit or higher security for two experiments denoted by Exp I and Exp II in Table 1. The security levels of HEAAN parameter sets were estimated with Albrecht’s security estimator [25, 26] of which inputs are the ring dimension N, the modulus Q, the Hamming weight h of a secret polynomial, and the error distribution χerr. Note that since the modulus of the evaluation key evk is 22L, the security of HEAAN is estimated with input (N,Q=22L,h,χerr).

Table 1.

Parameters for HEAAN, and running time of KeyGen,Enc and Dec

Exp HE params Time (sec)
logN L p h KeyGen Enc Dec
I 17 1300 50 56 157 68 0.28
II 17 1700 50 78 197 90 0.15

Exp I is a streamlined version operating Algorithm 4 only until step 5; that is, it does not perform the last division process. Note that Exp II that includes the division process (step 7 of Algorithm 4) naturally requires larger L than Exp I.

We first set the scaling parameter p to be sufficiently large so that errors derived from HEAAN do not effect on significant bits of plaintexts. The level parameter L is chosen by considering the fact that p levels are consumed for each homomorphic multiplication. See “Approximate homomorphic encryption scheme HEAAN” section for specific definitions of HEAAN parameters. Besides thoese HEAAN parameters, one also needs to select an appropriate constant α>0 in Algorithm 3. After experimenting on several values, we set α=8. Note that the choice of α merely depends on the size of X, but not on S.

Experimental results and evaluation

We demonstrated our modified semi-parallel GWAS algorithm in encrypted state, and evaluated the accuracy of our algorithm comparing it to that of the original algorithm which is performed in unencrypted state. The comparison result of p-value is described as a (log-scale) graph in Fig. 1. We plotted each SNPs according to the p-values computed by original algorithm and ours denoted by True and Enc, respectively. The diagonal line represents the line y=x, and closer distribution of points to this line implies higher accuracy. The Fig. 1a shows that the accuracy of our algorithm increases with the number of iterations for Fisher scoring, where the data set iDash_Test is used. The Fig. 1b shows that the accuracy of Exp II, which includes a division procedure, is comparable to that of Exp I without the division, where the data set iDash_Eval is used. Comparing Exp I and Exp II, there exists a trade-off between computational time and information leakage. The output of Exp I is the vector of squared statistics (zi)1≤im which has exactly same information with the p-value vector pval, but it takes 20 minutes longer than Exp I. On the other hand, since Exp I outputs the numerator det(U)·ci2 and the denominator di (in Algorithm 4) separately, it leaks some information more than p-values. However, it still seems to be very hard to extract any important information of input data from the numerator and denominator.

Fig. 1.

Fig. 1

Comparison of p-values on iDASH datasets: a the left figure represents Exp I for iter=1,2,4 on iDash_Test, b the right figure represents Exp I and Exp II for iter=4 on iDash_Eval

For more concrete evaluation, we classified each SNP as positive or negative depending on whether the corresponding p-value is larger or smaller than the given threshold (e.g. 10−2,10−5, or 10−12). Then, the accuracy of our algorithm compared to the original algorithm can be checked by a well-known statistical measure called F1 score. The F1 score of our algorithm is calculated by regarding the positive SNPs classified by the original GWAS algorithm to be the correct positive samples. For the formal definition of F1 score, we refer readers to [33].

The performance of our algorithm including the computation time and the F1 score on each parameter set is described in Table 2, where iter denotes the number of iterations in Fisher Scoring, Comp. time denotes the running time of our algorithm in encrypted state, and TH denotes the threshold of p-values for classification.

Table 2.

Experimental results for each parameter set

Data Params Comp. time Memory F1 Score
Exp iter (GB) TH: 10−2 TH: 10−5 TH: 10−12
iDash_Test I 1 13 min* 9.3 0.960 0.969 0.243
I 2 27 min 16.7 0.985 0.985 0.955
I 4 32 min 16.7 1.000 0.999 0.997
II 4 52 min 22.0 1.000 0.999 0.998
iDash_Eval I 4 38 min 19.4 0.998 0.995 0.992
II 62 min 25.4 0.998 0.996 0.994

*: We used more streamlined parameter; logN=16,L=950,p=50,h=91

As we have seen in Fig. 1, more iterations of Fisher scoring provides higher accuracy measured by higher F1 score. Note that 4 iterations suffice to provide high F1 score even in a very small threshold such as 10−12. Also, Exp II calculating approximate inverse in encrypted state provides almost similar F1 score to Exp I without such approximation. It implies that the error of inverse approximation does not seriously impact the whole approximation. Furthermore, Exp II shows even higher F1 score than Exp I due to the cancellation of errors from our algorithmic approximation and that from the inverse approximation.

For about 15,000 SNP data, our algorithm works in less than 40 minutes when we exclude step 7 of Algorithm 4 in encrypted state, or in about 60 minutes otherwise. We emphasize that each iteration of Fisher scoring takes about 3 minutes while the Goldschmidt’s division algorithm takes less than 30 seconds. Exp II takes much longer time than Exp I due to the larger level parameter L.

Discussion

Scalability

Our algorithm is executed and evaluated with about hundreds of samples each containing ten thousand SNPs, and 3 covariates which can be seen as a small-size data in usual GWAS analysis. We emphasize that our algorithm is highly scalable in the number of samples or SNPs, since we circumvent the naive execution of large-sized matrix operations through the proper algorithmic modification. To test the scalability of our algorithm in practice, we randomly generated 500 samples each of which consist of 3 covariates and 30,000 SNPs. Each column of the covariate matrix was uniform randomly generated in the interval [150,200],[40,100] and [20,80] considering height, weight and age, respectively. Each element of the SNP matrix was uniform randomly chosen as a binary matrix. The experiment Exp 1 on this random dataset encrypted with properly chosen hyperparameters iter=4 and α=2−7 still showed quite accurate p-value result compared to the result obtained by running Algorithm 2 in unencrypted state within 2 h.

Fisher Scoring

Our HE-friendly modified Fisher scoring (Algorithm 3) works quite well in practice, but there still remains to obtain some theoretical results on the convergence of the algorithm with respect to the new parameter α. Furthermore, we should consider an error in every operation derived from HEAAN when homomorphically evaluate the algorithm. As a result, research on the convergence of the erroneous version of our modified Fisher scoring algorithm should be very interesting topic as a further work. In addition, we note that our modified Fisher scoring algorithm can be generally used for logistic regression, not restricted to GWAS algorithm.

Conclusions

Interest on privacy-preserving genome data analysis based on HE has grown up very rapidly since the annual iDASH competition was launched, and GWAS is one of the most important technologies in this area which was also selected as one of three tasks in iDASH 2018 competition. Our HE-friendly modified semi-parallel GWAS algorithm was successfully implemented based on an approximate HE scheme HEAAN, and we could obtain the p-value result in about 30–40 minutes for 10,000–15,000 SNP data with sufficiently high accuracy compared to the result obtained in unencrypted state.

Acknowledgements

We thank Kyoohyung Han for the help on finding an error point of our code implementation. We also thank Minki Hhan for giving us helpful discussion on designing efficient algorithms.

About this supplement

This article has been published as part of BMC Medical Genomics Volume 13 Supplement 7, 2020: Proceedings of the 7th iDASH Privacy and Security Workshop 2018. The full contents of the supplement are available online at https://bmcmedgenomics.biomedcentral.com/articles/supplements/volume-13-supplement-7.

Abbreviations

HE

Homomorphic encryption

SNP

Single nucleotide polymorphism

GWAS

Genome-wide association study

TH

threshold

Authors’ contributions

DH designed the overall algorithms for this study. YH and DW transformed the algorithms into homomorphic encryption algorithms. AD and SW contributed to the implementation of our HE algorithms. JH contributed in writing the manuscript. DH takes final responsibility for the manuscript. All author(s) have read and approved the final manuscript.

Funding

This work was supported in part by the Institute for Information & Communications Technology Promotion (IITP) Grant through the Korean Government (MSIT), (Development and Library Implementation of Fully Homomorphic Machine Learning Algorithms supporting Neural Network Learning over Encrypted Data), under Grant 2020-0-00840, in part by the National Research Foundation of Korea (NRF) Grant funded by the Korean Government (MSIT) (No. 2017R1A5A1015626), and in part by the LG Electronics (LGE) grant.

Availability of data and materials

The dataset was available to participants registered to iDASH 2018 competition.

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Malik MB, Ghazi MA, Ali R. Third International Conference on Computer and Communication Technology (ICCCT) Allahabad: IEEE; 2012. Privacy preserving data mining techniques: current scenario and future prospects. [Google Scholar]
  • 2.IDASH 2018. http://www.humangenomeprivacy.org/2018/. Accessed 15 Jan 2019.
  • 3.Cheon JH, Kim A, Kim M, Song Y. Advances in Cryptology–ASIACRYPT 2017: 23rd International Conference on the Theory and Application of Cryptology and Information Security. Cham: Springer; 2017. Homomorphic encryption for arithmetic of approximate numbers. [Google Scholar]
  • 4.Cheon JH, Han K, Kim A, Kim M, Song Y. Annual International Conference on the Theory and Applications of Cryptographic Techniques. Cham: Springer; 2018. Bootstrapping for approximate homomorphic encryption. [Google Scholar]
  • 5.Han K, Kim A, Kim M, Song Y. Implementation of HEAAN. https://github.com/snucrypto/HEAAN. Accessed 12 July 2018.
  • 6.Lauter K, López-Alt A, Naehrig M. International Conference on Cryptology and Information Security in Latin America. Cham: Springer; 2014. Private computation on encrypted genomic data. [Google Scholar]
  • 7.Wang S, Zhang Y, Dai W, Lauter K, Kim M, Tang Y, Xiong H, Jiang X. Healer: homomorphic computation of exact logistic regression for secure rare disease variants analysis in GWAS. Bioinformatics. 2015;32(2):211–8. doi: 10.1093/bioinformatics/btv563. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Kim A, Song Y, Kim M, Lee K, Cheon JH. Logistic regression model training based on the approximate homomorphic encryption. BMC Med Genet. 2018;11(4):83. doi: 10.1186/s12920-018-0401-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Chen H, Gilad-Bachrach R, Han K, Huang Z, Jalali A, Laine K, Lauter K. Logistic regression over encrypted data from fully homomorphic encryption. BMC Med Genet. 2018;11(4):81. doi: 10.1186/s12920-018-0397-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Crawford JL, Gentry C, Halevi S, Platt D, Shoup V. Proceedings of the 6th Workshop on Encrypted Computing & Applied Homomorphic Cryptography. New York: ACM; 2018. Doing real work with FHE: The case of logistic regression. [Google Scholar]
  • 11.Bonte C, Vercauteren F. Privacy-preserving logistic regression training. BMC Med Genet. 2018;11(4):86. doi: 10.1186/s12920-018-0398-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.IDASH 2017. http://www.humangenomeprivacy.org/2017/. Accessed 15 Jan 2019.
  • 13.Lu W, Yamada Y, Sakuma J. 2015 IEEE Security and Privacy Workshops. USA: IEEE; 2015. Efficient secure outsourcing of genome-wide association studies. [Google Scholar]
  • 14.Bonte C, Makri E, Ardeshirdavani A, Simm J, Moreau Y, Vercauteren F. Towards practical privacy-preserving genome-wide association study. BMC Bioinformatics. 2018;19(1):537. doi: 10.1186/s12859-018-2541-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Jagadeesh KA, Wu DJ, Birgmeier JA, Boneh D, Bejerano G. Deriving genomic diagnoses without revealing patient genomes. Science. 2017;357(6352):692–5. doi: 10.1126/science.aam9710. [DOI] [PubMed] [Google Scholar]
  • 16.Cho H, Wu DJ, Berger B. Secure genome-wide association analysis using multiparty computation. Nat Biotechnol. 2018;36(6):547. doi: 10.1038/nbt.4108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Kamm L, Bogdanov D, Laur S, Vilo J. A new way to protect privacy in large-scale genome-wide association studies. Bioinformatics. 2013;29(7):886–93. doi: 10.1093/bioinformatics/btt066. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Constable SD, Tang Y, Wang S, Jiang X, Chapin S. Privacy-preserving GWAS analysis on federated genomic datasets. BMC Med Inform Decis Making. 2015;15:2. doi: 10.1186/1472-6947-15-S5-S2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Bogdanov D, Kamm L, Laur S, Sokk V. Implementation and evaluation of an algorithm for cryptographically private principal component analysis on genomic data. IEEE/ACM Trans Comput Biol Bioinforma. 2018;15(5):1427–32. doi: 10.1109/TCBB.2018.2858818. [DOI] [PubMed] [Google Scholar]
  • 20.Chen F, Wang S, Jiang X, Ding S, Lu Y, Kim J, Sahinalp SC, Shimizu C, Burns JC, Wright VJ, et al. Princess: Privacy-protecting rare disease international network collaboration via encryption through software guard extensions. Bioinformatics. 2016;33(6):871–8. doi: 10.1093/bioinformatics/btw758. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Anati I, Gueron S, Johnson S, Scarlata V. Proceedings of the 2nd International Workshop on Hardware and Architectural Support for Security and Privacy vol. 13. New York: ACM; 2013. Innovative technology for cpu based attestation and sealing. [Google Scholar]
  • 22.Kim M, Song Y, Wang S, Xia Y, Jiang X. Secure logistic regression based on homomorphic encryption: Design and evaluation. JMIR Med Inform. 2018;6(2):e19. doi: 10.2196/medinform.8805. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Cheon JH, Kim D, Kim Y, Song Y. Ensemble method for privacy-preserving logistic regression based on homomorphic encryption. IEEE Access. 2018;6:46938–48. [Google Scholar]
  • 24.Cheon JH, Han K, Hong SM, Kim HJ, Kim J, Kim S, Seo H, Shim H, Song Y. Toward a secure drone system: Flying with real-time homomorphic authenticated encryption. IEEE Access. 2018;6:24325–39. [Google Scholar]
  • 25.Albrecht MR, Player R, Scott S. On the concrete hardness of learning with errors. J Math Cryptol. 2015;9(3):169–203. [Google Scholar]
  • 26.Albrecht MR. A sage module for estimating the concrete security of learning with errors instances. https://bitbucket.org/malb/lwe-estimator. Accessed 15 July 2018.
  • 27.Sikorska K, Lesaffre E, Groenen PF, Eilers PH. Gwas on your notebook: fast semi-parallel linear and logistic regression for genome-wide association studies. BMC Bioinformatics. 2013;14(1):166. doi: 10.1186/1471-2105-14-166. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Longford NT. A fast scoring algorithm for maximum likelihood estimation in unbalanced mixed models with nested random effects. Biometrika. 1987;74(4):817–27. [Google Scholar]
  • 29.Ruder S. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747. 2016.
  • 30.Juvekar C, Vaikuntanathan V, Chandrakasan A. 27th USENIX Security Symposium (USENIX Security 18) Berkeley: USENIX Association; 2018. Gazelle: A low latency framework for secure neural network inference. [Google Scholar]
  • 31.Goldschmidt RE. Applications of division by convergence. PhD thesis, Massachusetts Institute of Technology. 1964.
  • 32.Markstein P. Software division and square root using goldschmidt’s algorithms. Proc 6th Conf Real Numbers Comput (RNC’6) 2004;123:146–57. [Google Scholar]
  • 33.Chinchor N. Proceedings of the 4th Conference on Message Understanding. USA: Association for Computational Linguistics; 1992. Muc-4 evaluation metrics. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The dataset was available to participants registered to iDASH 2018 competition.


Articles from BMC Medical Genomics are provided here courtesy of BMC

RESOURCES