Learning-Augmented Sketching Offers Improved Performance for Privacy Preserving and Secure GWAS

Junyan Xu; Kaiyuan Zhu; Jieling Cai; Can Kockan; Natnatee Dokmai; Hyunghoon Cho; David P Woodruff; S Cenk Sahinalp

doi:10.1101/2024.09.19.613975

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2024 Sep 24:2024.09.19.613975. [Version 1] doi: 10.1101/2024.09.19.613975

Learning-Augmented Sketching Offers Improved Performance for Privacy Preserving and Secure GWAS

Junyan Xu ¹, Kaiyuan Zhu ², Jieling Cai ³, Can Kockan ⁴, Natnatee Dokmai ⁵, Hyunghoon Cho ⁵, David P Woodruff ⁶, S Cenk Sahinalp ¹

PMCID: PMC11451619 PMID: 39372761

Abstract

The introduction of trusted execution environments (TEEs), such as secure enclaves provided by the Intel SGX technology has enabled secure and privacy-preserving computation on the cloud. The stringent resource limitations, such as memory constraints, required by some TEEs necessitates the development of computational approaches with reduced memory usage, such as sketching. One example is the SkSES method for GWAS on a cohort of case and control samples from multiple institutions, which identifies the most significant SNPs in a privacy-preserving manner without disclosing sensitive genotype information to other institutions or the cloud service provider. Here we show how to improve the performance of SkSES on large datasets by augmenting it with a learning-augmented approach. Specifically, we show how individual institutions can perform smaller scale GWAS on their own datasets and identify two sets of variants according to certain criteria, which are then used to guide the sketching process to more accurately identify significant variants over the collective dataset. The new method achieves up to 40% accuracy gain compared to the original SkSES method under the same memory constraints on datasets we tested on. The code is available at https://github.com/alreadydone/sgx-genome-variants-search.

1. Introduction

The ongoing expansion and increasing significance of genomic datasets necessitate the development of privacy-preserving and secure methods to process and analyze them. Genomic data carries sensitive insights into an individual’s predisposition to diseases [1], extending potential privacy implications to their biological relatives [2]. The inherent immutability of genomic information and the potential for unforeseen privacy breaches in the future [3] highlight the criticality of safeguarding this data. As a result, policies or law could severely limit or fully prohibit exchange of genomic data between hospitals, research institutions, government organizations and countries. A number of cryptographic techniques [4, 5, 6, 7, 8, 9, 10, 11] have been proposed to address privacy concerns related to genomic data, but their high computational overhead means they cannot scale up to offer the performance needed for real-life genomic analysis [12, 13, 14, 15].

A genome-wide association study (GWAS) aims to identify genetic variants associated with a particular trait (e.g. disease), and it does so by computing a certain statistic for each variant over a large cohort of cases and controls. Bringing together such a cohort, especially for rare diseases, may necessitate multiple institutions, possibly from several countries, to share data and collectively develop an accurate model. SkSES (sketching algorithms for secure-enclave-based genomic data analysis) [16] is a computational framework to perform secure collaborative GWAS in an untrusted cloud platform with minimal performance overhead, through the use of Intel’s Software Guard Extensions (SGX) [17], a combined hardware and software platform supported by all current-generation Intel processors, that offers users sensitive data analysis within a protected enclave. Each sample’s genotype vector is encrypted and uploaded independently to the secure enclave, and it is decrypted there, so no institution nor the cloud provider could gain access to the sensitive genomic data, alleviating privacy concerns for such large collaborative genomic analysis.

Unlike software-based techniques such as the garbled-circuit-based FlexSC framework [18] or the homomorphic encryption-based HElib framework [19], the SGX platform does not introduce substantial computational overhead or restrictions on basic operations, and thus is likely to make secure genome-scale data analysis feasible. However, it does come with memory limitations, which SkSES addresses by a “sketching” approach to summarize variant frequencies within the limited memory of the secure enclave and identify the most significant variants across case and control samples (i.e., those that best differentiate the case and control samples) according to the $χ^{2}$ statistic.

One key challenge in GWAS is sorting out systematic differences between different human subpopulations [20]: disease-causing genetic changes may get passed down alongside harmless ones, leading to false hits. This confounding effect is known as population stratification and requires mitigation using methods such as EIGENSTRAT [21], which subtracts off principal components identified through PCA. Due to limited memory of the enclave, sketching is also applied by SkSES to perform PCA.

As the size of the dataset increases, the sketches become more crowded, and the estimation error soars. We hypothesize that “collisions”, i.e., multiple significant (in terms of the statistic used by the GWAS or another quantity used as a proxy for the statistic) SNPs being assigned to the same entry (i.e., “bucket”) in the sketches, might be a main contributing factor to the error. In this paper, we address this issue by introducing a learning-augmented approach to identify the most significant SNPs in a training dataset and assign them to “unique buckets” outside of the sketches; this also reduces their impact on the estimated quantity for SNPs that are non-significant. In addition, we show how to identify another set of representative SNPs that are most useful for accurately identifying principal components during PCA for reducing the impact of population stratification via EIGENSTRAT. These SNPs are identified by running GWAS on a much smaller dataset (say 10% of the size of the full dataset), which can be done much less costly and locally by each or any of the data-holding parties involved. According to our tests, by employing these two sets of SNPs, our learning-augmented version of SkSES significantly improves the accuracy of the original version of SkSES without introducing any overhead with respect to memory usage or speed.

2. Methods

2.1. Preliminaries

2.1.1. The SGX platform in a nutshell

Intel SGX (Software Guard Extensions) is a commonly-used Trusted Execution Environment (TEE) technology embedded in modern Intel CPU models, which offers hardware-based protection for applications running inside an SGX enclave - an isolated and protected computing environment against adversaries controlling the host operating system. The security model provided by SGX hardware can prevent adversaries from learning private data or modifying the control flow of a protected application without compromising the hardware itself. As such a common use case of SGX is for secure outsourced computation, in which a user with limited computational capabilities sends private data to a remote SGX enclave, controlled by an untrusted party, for secure data processing. This enticing possibility has been gathering increasing attention in the bioinformatics community in recent years [22, 23, 24, 25, 26, 27, 16, 28, 29, 30] due to its potential to accelerate scientific discoveries by facilitating the sharing of sensitive biomedical data. For security in the outsourcing scenario, SGX usually employs a cryptographic protocol called remote attestation to provide proof to a remote user that both the enclave environment and the application running inside the enclave would not be tampered with; and that the communication channel for the transfer of sensitive data is secure.

In this work, we focus on the algorithmic improvement on the GWAS protocol but do not address the potential side-channel leaks which refer to externally measurable behaviors of the program (e.g. runtime or memory access patterns) that may be used to infer the underlying sensitive data. Note that side-channel attacks can be usually addressed with an oblivious implementation of the algorithm but at the cost of increased running time.

2.1.2. An overview of the SkSES method for identifying the top SNPs in an SGX enclave

We focus on the most general GWAS protocol introduced by SkSES [16] involving three steps: (i) Population stratification correction using PCA; (ii) Top $l$ SNP candidate identification by sketching; and (iii) Top $k$ SNP identification among the $l$ candidates. The computation process involves a server equipped with Intel SGX hardware, and the client(s) that want to perform collaborative GWAS in a privacy-preserving and secure manner. Each of the three steps requires the client(s) to encrypt and send the VCF files they store in a compressed format by only keeping the SNP IDs and their zygosity status. The server, i.e., the SGX enclave memory maintains a distinct data structure for each step, which is updated online after the server receives each compressed and encrypted VCF file from the clients. In each step, once all incoming data have been processed, the server postprocess the data structure and computes all necessary intermediate results required for the next step (or the final output). The size of these intermediate results is usually small (negligible compared to the data structures) and thus are stored independently (from the data structures) in the enclave memory. Finally, after the last step the results consisting of the top $k$ significant SNPs according to the desired association test are sent back to the clients. We describe each of the steps in more detail below.

(i). Population stratification correction using PCA.

In this step the server constructs a sketch genotype matrix $G^{'}$ which corresponds to a random projection (of the columns) of the underlying genotype matrix $G$ given by the input individuals (VCF files). Specifically, $G \in {0, 1, 2}^{m \times n}$ denotes the genotype matrix where $m$ is the number of VCF files and $n$ is the total number of SNPs included in the $m$ samples; $G^{'} = G K$ denotes the sketch matrix of size $m \times n^{'} (n^{'} ≪ n)$ which can fit in the enclave memory, where $K$ is a sparse count-sketch matrix of size $n \times n^{'}$ with exactly one random ±1 entry in each row. A compressed and encrypted VCF file encodes all non-zero entries in a particular row $g_{j} \in {0, 1, 2}^{n}$ in $G$ . The update of $G^{'}$ from the $j$ -th individual simply involves adding a vector $g_{j} \cdot K$ to the $j$ -th row of $G^{'}$ . After all encrypted VCF files are processed, the server first normalize each column of $G^{'}$ by subtracting the mean and optionally normalizing by the standard deviation. Let the normalized matrix be $X^{'}$ . Then SVD is performed on $X^{'}$ to compute the top $y$ left singular vectors $U_{y}$ of size $m \times y$ , which will be stored in the enclave memory and used in the following steps of computation. In addition, the phenotype vector $p \in {0, 1}^{m}$ (or $p_{1} \in {- 1, 1}^{m}$ ) which indicates the underlying phenotype (disease status) for each genotype vector $g_{j}$ also needs to be stored if the input VCF files are sent from the client in a random order.

(ii). Top $l$ SNP candidate identification by sketching.

In this step the server keeps $d$ independent sketches, each of size $w$ , for approximate queries of $\tilde{f} [i]$ , a linear proxy to the $χ^{2}$ statistic $r_{i}^{2}$ defined below. Specifically, let $g_{(i)}$ denote the $i$ -th column of the genotype matrix $G$ , then $\tilde{f} [i]$ is given by the dot product of $g_{(i)}$ and the phenotype vector $p_{1}$ with ±1 entries corrected by the top $y$ eigenvectors $U_{y} : \tilde{f} [i] = g_{(i)} \cdot (I - U_{y} U_{y}^{T}) p_{1}$ . Let $[z] = {1, \cdot \cdot \cdot, z}$ . For updating the sketches the server maintains (i) $d$ independent hash functions $h_{k}$ $(k \in [d])$ , where each hash function $h_{k} : [n] \to [w]$ maps a SNP $i$ to one of the $w$ buckets; and additionally (ii) $d$ sign functions ${s i g n}_{1}, \dots, {s i g n}_{d} : [n] \to {- 1, 1}$ . Let $A_{k}$ denote the array of size $w$ of hash buckets that correspond to hash function $h_{k}$ . Initially all entries in the sketches $A_{1}, \dots, A_{d}$ are set to 0. To update the sketches, the server first precomputes a corrected phenotype vector with ±1 entries ${\tilde{p}}_{1} = (I - U_{y} U_{y}^{T}) p_{1}$ , and then processes each encrypted VCF $g_{j}$ and adds ${s i g n}_{k} (i) \cdot g_{j} [i] \cdot {\tilde{p}}_{1} [j]$ to $A_{k} [h_{k} (i)]$ , for each SNP $i$ and hash function $k \in [d]$ . Once all incoming data has been processed, the client sends the entire list of $n$ SNP IDs to the server. For each SNP ID $i$ , the server queries the $d$ sketches and computes the median $\hat{f} [i] = {m e d i a n}_{k \in [d]} \{{s i g n}_{k} (i) \cdot A_{k} [h_{k} (i)]\}$ . The top $l$ SNP IDs according to the absolute value of $\hat{f} [i]$ are kept for the next step. Ideally, these $l$ SNPs include all top $k$ SNPs according to the actual $r_{i}^{2}$ defined below.

(iii). Top $k$ SNP identification among the $l$ candidates.

In this step we rank the $l$ SNPs according to the Armitage trend $χ^{2}$ statistic and return the $k$ largest SNPs. The test statistic $r_{i}^{2}$ is defined for each SNP $i$ as

r_{i}^{2} = \frac{m - y - 1}{m \sum_{j} \tilde{p} [j]^{2} - {(\sum_{j} \tilde{p} [j])}^{2}} \cdot \frac{{(m \sum_{j} {\tilde{g}}_{j} [i] \tilde{p} [j] - (\sum_{j} {\tilde{g}}_{j} [i]) (\sum_{j} \tilde{p} [j]))}^{2}}{m \sum_{j} {\tilde{g}}_{j} [i]^{2} - {(\sum_{j} {\tilde{g}}_{j} [i])}^{2}}

where ${\tilde{g}}_{j} [i]$ denotes the entries in a population stratification corrected genotype matrix $\tilde{G} = (I - U_{y} U_{y}^{T}) G; \tilde{p} [j]$ denotes the entries in a population stratification corrected phenotype vector $\tilde{p} = (I - U_{y} U_{y}^{T}) p$ . Rather than explicitly keeping the corrected genotype matrix $\tilde{G}$ which requires $O (m l)$ memory, SkSES implemented an algorithm which only maintains in the enclave $y + 3$ floating point numbers for each SNP $i$ with a hash table (so the total memory required is $O (l \cdot y)$ ), and these numbers are updated while processing the next individual $g_{j}$ . Once all $m$ individuals are processed the Armitage trend $χ^{2}$ statistic $r_{i}$ for each SNP $i$ can be computed exactly, and finally the top $k$ SNPs according to $r_{i}$ are returned to the client. We omitted the algorithmic details of how SkSES updates these numbers with $g_{j}$ as this step is not modified with “learning-augmented” sketches described in the following section.

2.2. Our learning-augmented sketching approach for privacy preserving GWAS

We offer two major improvements over the SkSES method for (i) the PCA step and (ii) the top $l$ SNP candidate identification step, by newly using learning-augmented sketches. To achieve this, we introduced a new pre-computing step, for which we assume that there is a small number of publicly available VCF files to be used as a training set; in case such a training set is not publicly available, a “small” proportion of the $m$ VCF files in the input is sampled (with a default of $m / 10$ ) and used as a training set. From the training set, the pre-computing step “learns” two sets of potentially significant SNPs in distinct phases: in the first phase, it computes $S_{1}$ for assisting with SkSES’ step (i) and in the second phase, it computes $S_{2}$ , for assisting with SkSES step (ii). Note that both phases of the pre-computing step can be executed locally and outside a secure enclave. For example, when multiple computation parties are involved in a collaborative GWAS study, the sets $S_{1}$ and $S_{2}$ can be pre-computed in advance and locally by one of the computation parties, using the training set. Next, we modified steps (i) and (ii) of the SkSES’ secure GWAS protocol so that the sets $S_{1}$ and $S_{2}$ of potentially significant SNPs are sent to the server, along with all input data (i.e. preprocessed and compressed VCF files), for PCA and top $l$ SNP candidate identification by sketching.

In the remainder of the paper we denote by $m_{0}$ be the number of the VCF files in the training set and by $n_{0}$ the total number of SNP records available in each one of the VCF files in the training set.

(i). Improved PCA with learned SNPs.

The first phase of our new pre-computing step identifies the set of most significant SNPs $S_{1}$ on the test set, i.e., genotype matrix $G_{0} \in {0, 1, 2}^{m_{0} \times n_{0}}$ . For that, we follow the same normalization process as SkSES to compute a mean-centered and standard deviation corrected genotype matrix $X_{0}$ from $G_{0}$ . Then we compute the first eigenvector $u \in R^{m_{0}}$ of the covariance matrix $X_{0} X_{0}^{T}$ exactly. We pick the top $n^{'}$ SNPs as $S_{1}$ , according to their cosine similarities betwen each column vector in $G_{0}$ and the top eigenvector $u$ (recall that each SNP corresponds to a column vector in $G_{0}$ ).

Following the pre-computing step, the set $S_{1}$ of $n^{'}$ SNP IDs is encrypted and sent to the server. Our newly modified step (i) of SkSES, now computes the top eigenvectors from the entire $m$ samples within the secure enclave. For that, all participating parties send their encrypted VCF files to the server, which filters out all SNPs which are not present in $S_{1}$ , and performs SVD on the resulting genotype (sub)matrix $G$ of dimension $m \times n^{'}$ . Mean centering and normalization by row standard deviations before SVD (to compute $X$ from $G$ ) are now done explicitly and in place.

We note that $n^{'}$ by itself is not sensitive (in fact it is part of the GWAS protocol), and as such, we do not attempt to hide the message length of the encrypted SNP IDs sending to the server.

(ii). Improved top $l$ SNP candidate identification with unique buckets.

In the second phase of our new pre-computing step, we identify the set of SNPs $S_{2}$ by simply picking up $n^{''}$ SNPs with largest $| \tilde{f} [i] |$ on the training set of $m_{0}$ individuals among the $n_{0}$ SNPs. $S_{2}$ is then sent to the server just like $S_{1}$ as described above.

Our newly modified step (ii) of SkSES works as follows. First, the size of $S_{2}, n^{''}$ is determined by the desired size of sketches to be maintained in the enclave memory, namely the product of $w$ , the number of buckets in each sketch and $d$ , the number of independent sketches: we split the $w \cdot d$ buckets originally assigned for sketches into two parts. The first $α \cdot w d$ buckets are allocated to maintain the SNP IDs as well as $\tilde{f} [i]$ for each SNP $i$ (this time, computed from the entire set of $m$ individuals), through a hash table so that $\tilde{f} [i]$ can be queried exactly. As such $n^{''}$ is given by $\frac{α \cdot w d}{2}$ ¹. We refer to these buckets as “unique buckets” in the following and denote them as $B [i] (i \in [n^{''}])$ . The remaining $(1 - α) \cdot w d$ buckets are assigned to the sketches $A_{1}, \dots, A_{d}$ , each of size reduced to $(1 - α) \cdot w$ . By default we set $α = 0.5$ , meaning half of the buckets originally used for maintaining sketches in the enclave memory are assigned as unique buckets $B [i]$ . To process an encrypted VCF file $g_{j}$ that includes a SNP $i$ , the server first checks whether the SNP ID $i$ is maintained in the unique buckets. If yes, then we add $g_{j} [i] {\tilde{p}}_{1} [j]$ to the unique bucket $B [i]$ , otherwise update the sketch entries ${s i g n}_{k} (i) \cdot g_{j} [i] \cdot {\tilde{p}}_{j}$ to $A_{k} [h_{k} (i)]$ for each $k \in [d]$ . After all incoming VCF files has been processed, the client again sends the entire list of $n$ SNP IDs to the server. The server modified its querying process for SNP $i$ by also first checking whether $i$ lies in the unique buckets. If yes, then the query returns the value currently stored in $B [i]$ as the exact $\tilde{f} [i]$ ; otherwise sketch entries are queried and an approximate $\hat{f} [i] = {m e d i a n}_{k \in [d]} \{{s i g n}_{k} (i) \cdot A_{k} [h_{k} (i)]\}$ is returned. Still the $l$ largest SNPs according to the absolute value of $\tilde{f} [i]$ (or $\hat{f} [i]$ ) are stored as candidates for the top $k$ SNP identification in the next step.

Comparison with existing work on learning augmented algorithms.

There are a number of recent works in using machine learning to augment the performance of classical algorithms [31, 32, 33, 34, 35, 36]. Unlike the most relevant Learning-enhanced Frequency Estimation Algorithms [32], where the IP addresses carry meaningful features, we cannot extract meaningful features from the SNP IDs (rsid) alone. We do not have enough memory to store auxiliary data inside the enclave, and querying an external database would leak information. So we resort to memorization rather than generalization, i.e. we learn explicit sets of SNPs rather than an oracle to classify the SNPs.

3. Results

3.1. Dataset

We evaluated our approach on a dataset from the UK Biobank consisting of 7,550 GVCF files, with 3,775 from type 2 diabetes patients² (cases), and the other 3,775 from non-patients (controls). These GVCF files include in total 9.54×10⁵ distinct SNPs, and we performed imputation on each individual with the 1000 Genomes Phase 3 [37] reference panel which includes 7.82 × 10⁷ distinct SNPs in the human genome. We sampled random subsets of 100, 200, 400, 800, 1600 and 3200 imputed GVCF files to use them as training sets for our learning augmented approach, and then applied it to test sets composed of, again randomly sampled 1,000, 2,000, and 4,000 imputed GVCF files (the training and test sets did not overlap and they were both composed 50% cases and 50% controls) to evaluate the performance of our approach.

Note that each of the original 7,550 GVCF files from the UK Biobank contains ~80k variant records. Before using them as inputs to our approach we processed each GVCF file by first phasing its variants using SHAPEIT4 [38] and then imputed the missing allele specific variants using Minimac [39]. After imputation, each resulting VCF ended up containing genotype information of 7.82 × 10⁷ variants as mentioned above. Interestingly only 2.35 × 10⁷ of the variants were present in at least one of the 7,550 individuals, so the remainder of the variants were removed from the VCFs. Therefore, all training and test sets we used were comprised of 2.35 × 10⁷ (n ≤ 2.35 × 10⁷) variant records. Please see the supplementary materials for further details about how we prepared our datasets.

3.2. Tested methods and parameters

We tested our learning-augmented sketching method against the SkSES baseline on the datasets described below. We report the results from our main fully-learned approach, where we utilize both sets of learned SNPs $S_{1}$ (for PCA) and $S_{2}$ (for unique buckets), as well as three different approaches that each uses only part of the full method to study their individual effect: the learned PCA approach where we only utilize $S_{1}$ , the learned unique approach where we only utilize $S_{2}$ , and the no sketch approach, where both $S_{1}$ and $S_{2}$ are used but the sketch is completely removed in favor of the unique buckets.

For all four learning-augmented sketching methods, we ensured that the training set was evenly split between cases and controls just like the full dataset it was sampled from. The original SkSES method does not use a training set. We set the size of the set $S_{1}$ of learned SNPs (for PCA) to be $n^{'} = 4,096$ , consistent with the sketch matrix dimension used in the original SkSES. We used two principal components for population stratification correction ( $y = 2$ ), the default setting of SkSES. We set the number of candidates $l$ to be 2²⁰. The size of training set $m_{0}$ , the size of sketches $w$ , and the number of unique buckets $n^{''}$ vary between different experiment settings.

3.3. Memory and time consumption

We first compared the runtime and memory consumption of the learning-augmented sketching approach (fully-learned) with that of SkSES. We ensured that the fully-learned method uses the same amount of memory as the original SkSES algorithm. Numeric values are stored as 32-bit floating numbers, occupying 4B of memory each. The memory requirements of the three steps described in Section 2.1.2 are as follows. (i) PCA requires $m \cdot n^{'} \cdot 4 B = 62.5 M B$ memory. (ii) Sketching memory requirements depend on other parameters. For SkSES, we chose the parameters $d = 8$ and $w = 2^{21}$ for the sketching arrays, occupying 64MB memory in total. For our learning-augmented sketching method, we used sketching arrays with parameters $d = 8$ and $w = 2^{20}$ (32MB) together with $n^{''} = 2^{22}$ unique buckets (32MB). (iii) Computing the $χ^{2}$ values accurately for the $l$ candidates requires $l (3 + y) \cdot 4 B = 20 M B$ .

Note that, while transitioning from step (ii) to (iii), and before the sketching arrays and unique buckets are cleared, each approach (including SkSES) requires and additional $2 l \cdot 4 B = 8 M B$ memory. Therefore, during all our tests the memory used by each method at any point of execution was at most 72MB, ignoring a small overhead. This means that the parameters $d, w, n^{''}$ , and $l$ could be further increased, but how we should increase them for optimal performance while remaining within 128MB or a higher upper bound would be the subject of future work.

In our tests, our learning augmented method takes 1,978 seconds to complete step (i), 4,419 seconds to complete step (ii) (which are significantly faster than the original SkSES algorithm), and 1,323 seconds to complete step (iii) (see Table 1). These numbers do not include the the time for training and preprocessing (including compression and encryption) the VCF files, which can be done locally by each participating party individually with low cost. The tests were run on a Linux server equipped with an SGX-capable Intel Xeon E3–1280 v5 processor with 4 cores at 3.70 GHz.

Table 1:

Running time performance of learning-augmented sketching method (fully-learned) against SkSES, with identical enclave memory allocated to core data structures. Mean running time ± standard deviation is reported from the results of five runs. In step (i) popstrat correction using PCA, we set $n^{'}$ = 4096 for both methods. In Step (ii) top l SNP candidate identification by sketching, we set d = 8 and w = 2²¹ for the sketch in the SkSES method; d = 8 and w = 2²⁰ for the sketch together with $n^{''}$ = 2²² unique buckets for learning-augmented sketching method, so that both methods use 64MB memory. In step (iii) top-k SNP identification among l we set l = 2²⁰ for both methods. In our implementation of the fully-learned method, we replaced the Robin Hood hash table (used in step (iii)) in the original SkSES by a plain array to fix a bug, which results in a slowdown. If the fix is directly applied to the original SkSES, step (iii) slows down to 1255±13s.

Step/Method	SkSES	fully-learned
(i) Popstrat correction using PCA	3846± 16s	1610± 10s
(ii) Top-l SNP candidate identification by sketching	8767±100s	4294±222s
(iii) Top-k SNP identification among l candidates	32± 0s	1240± 11s

Open in a new tab

We additionally tested the running time performance with varying sketch width $w$ and depth $d$ , and concluded that the time consumed by step (ii) appears to be determined mostly by $w$ ; in fact starting from $w = 2^{17}$ the time roughly doubles as $w$ doubles, even though the amount of read/writes remain the same. Decreasing $w$ by increasing $d$ or introducing unique buckets is therefore beneficial for reducing running times.

3.4. Accuracy of sketches

We performed three sets of experiments to assess our learning augmented sketching approach. In the first set of experiments we assessed the impact of the training set size, in comparison to the test set size. In the second set of experiments we assessed the impact of the size of the sketches and the number of unique buckets. In the third and final set of experiments we assessed how our learning augmented sketching approach compares against the baseline approaches and how individual components of the full approach contribute to its performance for various test set sizes.

The metric we used to evaluate performance of the algorithms is the true positive rate, defined to be the percentage of the top $k$ SNPs identified that are actually among the top $k$ , according to the Cochran–Armitage trend $χ^{2}$ statistic. We report the percentage for both $k = 100$ (with Cochran–Armitage $r^{2} = 19.61 / 20.86 / 23.43$ from 1000/2000/4000 individuals) and $k = 1,000$ (with Cochran–Armitage $r^{2} = 15.15 / 14.77 / 17.91$ from 1000/2000/4000 individuals).

3.4.1. Impact of training set size

In the first set of experiments we assessed the impact of the training set size on accuracy, given the test set size. We used the following settings in these experiments (which, as per Table 3, provide the best results): the number of unique buckets are 2²², the sketch depth is 8 and the sketch width is 2²⁰. Note that we used training sets disjoint from the tests sets.³ As demonstrated in Table 2, the impact of training set size on the accuracy is minimal, as long as the training set size is at least 200. In fact, if the whole test set is used as the training set, we do not get much (or any) improvement: for the 1,000-individual test set we get 1.000/0.899; for the 2,000-individual test set we get 0.990/0.824; for the 4,000-individual test set we get 0.940/0.788.

Table 3:

Accuracy of the fully-learned approach under various configurations for the sketch width (rows) and the number of unique buckets (columns) on a test set with 4000 individuals trained on an independent, randomly selected set with 400 individuals (sketch depth: 8). For each number of unique buckets (column) and each sketch width (row), we give the accuracy values for both the top-100 (left) and top-1000 (right) SNPs. The results in the rightmost column are those with no (learned) unique buckets but with learned PCA. The same amount of memory is used for the unique buckets as the sketch for entries along the main diagonal (except for the rightmost column). The entry marked with

Sketch width	Number of unique buckets ( $n^{''}$ )
(w)	2²³	2²²	2²¹	2²⁰	0

2²¹	0.940/0.791	0.950/0.784	0.790/0.753	0.730/0.697	0.590/0.653

2²⁰	0.940/0.788	0.900/0.770^*	0.650/0.672	0.600/0.635	0.470/0.555

2¹⁹	0.940/0.785	0.570/0.668	0.540/0.611	0.500/0.543	0.400/0.440

2¹⁸	0.920/0.769	0.490/0.580	0.440/0.465	0.350/0.440	0.240/0.318

2¹⁷	0.440/0.602	0.410/0.484	0.260/0.349	0.180/0.288	0.120/0.205

Open in a new tab

is the configuration used in the (Figures of the) remainder of the paper; it is the first diagonal entry that fits within 128MB SGX enclave memory.

Table 2:

Accuracy of the fully-learned method (with 2²² unique buckets and a sketch of depth 8 and width 2²⁰) as a function of training set size. Note that the training sets were sampled from 3550 individuals disjoint from test sets with 1000/2000/4000 individuals. With each training set (column) and each test set (row) size, we give the accuracy values for both the top-100 (left) and top-1000 (right) SNPs. The three entries marked withs

Size of test set	Size of training set
Size of test set	100	200	400	800	1600	3200

1000	0.990/0.768^*	1.000/0.913	1.000/0.902	1.000/0.898	0.990/0.895	0.990/0.883

2000	1.000/0.772	0.990/0.806^*	0.990/0.820	0.990/0.778	0.990/0.796	0.990/0.817

4000	0.910/0.671	0.920/0.782	0.860/0.755^*	0.750/0.771	0.940/0.783	0.910/0.777

Open in a new tab

are the configurations used in the (Figures in the) remainder of the paper.

3.4.2. Impact of sketch size and unique bucket count

In the second set of experiments we assessed the impact of the sketch size vs the number of unique buckets on accuracy (training set size: 400, test set size: 4,000). As demonstrated in Table 3, for a given number of unique buckets, there is a sharp drop in accuracy as the sketch width decreases, probably because when the sketch is small, the values accumulated in the sketch and hence the queried values are systematically larger (in absolute value) than the values in the unique buckets, leading to few or none of the SNPs in unique buckets being selected as the top $l$ candidates. We leave a more comprehensive study of methods to balance the sketch size and the number of unique buckets to future work.

3.4.3. Comparison with baseline and ablation study

In the next set of experiments, we compared the accuracies of the four learning-augmented approaches with the SkSES baseline. Results from five runs of each method are shown as box plots in Figures 1, 2 and and 3, except for the “no sketch” approach, which is only run once because no randomness is involved. All other methods use random hash functions for the sketch, and the “learned unique” and SkSES methods also use a different random seed for each run to generate the sketch matrix $K$ for PCA. The training sets stay the same across five runs.

Figure 1: — Accuracy comparison of the five approaches on a dataset of 1,000 individuals, using a training set of 100 individuals. The percentages of the actual top 100 and top 1,000 SNPs identified are shown. Both the “learned unique” and SkSES methods use a randomly signed sketch matrix $K$ generated by a random seed, and we observe that some random seeds lead to exceptionally good results (for example the outliers near 0.8 in with $k = 1,000$ for both methods); these seeds appear roughly 20% of the time when randomly selected.

Figure 2: — Accuracy results on a dataset of 2,000 individuals using a training set of 200 individuals. The large variations in accuracy of the “learned unique” and SkSES approaches are due to the use of a random seed to generate the sketch matrix $K$ .

Figure 3: — Accuracy results on a dataset of 4,000 individuals using a training set of 400 individuals. The large variations in accuracy of the “learned unique” and SkSES approaches are due to the use of a random seed to generate the sketch matrix $K$ . Notice that the “fully learned” method is not superior to “no sketch” under memory parity ( $n^{''} \times 2$ ) here (while it is superior on the 1,000- and 2,000-individual test sets), but the usefulness of sketching can still be seen when compared with “no sketch” ( $n^{''} \times 1$ ).

For all four learning-augmented methods, the training set size is fixed to be 10% of the test set size, which can be compared against entries marked with * in Table 2. We ensure all methods use the same amount of memory in step (ii) in order to compare them fairly: the sketch depth and width for the main approach (fully-learned) are respectively fixed at 8 and 2²⁰, and the unique bucket count is fixed at 2²², which is marked with * in Table 3. For the other approaches, if sketching is not used (“no sketch”) then the unique bucket count $n^{''}$ is doubled (fixed at 2²³), while if unique buckets are not used (“learned PCA” and SkSES) then the sketch width is doubled (fixed at 2²¹). Since we also report the performance of the “no sketch” approach without doubling the unique bucket count in the same figures - to assess the accuracy gains that can be attributed to sketching, we distinguish the two setups by their labels: “no sketch ( $n^{''} \times 2$ )” v.s. “no sketch ( $n^{''} \times 1$ ) respectively.

Under the same memory constraint, our fully-learned approach scored a 40% absolute improvement (from ~ 50% to > 90%) of the percentage on the 4,000-individual test set when $k = 100$ , and a 20% absolute improvement of the percentage when $k = 1,000$ , compared to the original SkSES algorithm (our baseline); see Figure 3. The improvement on the 2,000- and 1,000-individual test sets are less pronounced but still clearly visible, as shown in Figure 1 and Figure 2.

It is crucial to utilize the small set $S_{1}$ of significant SNPs in the PCA step to create the sketch matrix $K$ for PCA, which leads to major improvement and stabilization of performance. If we used a randomly signed sketch matrix as in the original SkSES (the “learned unique” approach), the performance is very unstable: for about 10% of such randomly chosen matrices, we get performance matching the performance of “fully learned” approach, but most of the time we get very poor results. It is still an open question why using a subset of SNPs that is order-of-magnitudes smaller than $n$ yields more stable eigenvector results than using the whole 23 million SNPs in a sketched form, and we leave this to future work.

3.5. Using a non-representative training set

To show robustness of our method, we tested its performance using training sets that are not representative of the whole test set. Specifically, we selected 400-individual subsets “of different ethnicities” from the 4000-individual test set in the following way: for $k = 0, 1, 2, 3, 4$ , we take the $k$ -th eigenvector $u_{k}$ (where $k = 0$ corresponds to the top eigenvalue) of the covariance matrix $X^{'} X^{' T}$ of the normalized genotype matrix $X^{'}$ , and pick the 400 individuals $j$ with (i) the largest, and then (ii) the smallest ${(u_{k})}_{j}$ , and use them to form the training set.⁴ As shown in the table below, the results are similar to those using a disjoint (§3.4.1) or sampled (§3.4.3) training set. For the 400 individuals with the smallest ${(u_{k})}_{j}$ , the result is quite poor for $k = 0$ (corresponding to the top eigenvector, presumably representing the most common ethnic contribution to this cohort), but remains good for $k = 1, 2, 3, 4$ .⁵

Supplementary Material

Supplement 1

NIHPP2024.09.19.613975v1-supplement-1.pdf^{(220KB, pdf)}

Table 4:

Accuracy of the fully-learned approach using a non-representative training set. For each k, the training set consists of the individuals j in the cohort, with the largest (middle row) or smallest (bottom row) 400 entries (u_k)_j in the k^th eigenvector. For a given k, the accuracy of the fully-learned approach for the top-100/top-1000 SNPs are given.

k	0	1	2	3	4
largest	0.940/0.788	0.910/0.760	0.940/0.787	0.920/0.790	0.940/0.787
smallest	0.150/0.102	0.910/0.771	0.940/0.787	0.880/0.792	0.920/0.788

Open in a new tab

4. Acknowledgements

We thank the Department of Computer Science at Indiana University, especially Haixu Tang and Xiaofeng Wang, for providing remote access to SGX-enabled machines on which we run the experiments. We also thank Hongbo Chen and Rob Henderson for providing technical assistance.

This research was supported in part by the Intramural Research Program of the Center for Cancer Research, National Cancer Institute, NIH.

Footnotes

⁵

Declaration of interests

The authors have no competing interests related to this work to declare.

we ignore the cases where $n^{''} \geq n_{0}$ as the assumption is the total number of SNPs is significantly larger than the size of the sketch, and the $m_{0}$ samples should include sufficiently large number of SNPs to be tested

the entire set of type 2 diabetes patients with relevant information in the UK Biobank

There is some minimal impact if the training sets were sampled from the test sets: a comparison of the three cells marked with * with the corresponding Figures reveal that only the 1,000 individual case has an increase in accuracy when the training set is indeed sampled from the test set.

⁴

Since $u_{k}$ is an eigenvector of the real symmetric matrix $X X^{T}$ , we have $X X^{T} u_{k} = λ_{k} u_{k}$ for some nonnegative number $λ_{k}$ , so ${(u_{k})}_{j}$ is proportional to the dot product of $j$ th individual’s correlation vector with $u_{k}$ .

⁵

Since the dataset is from the UK Biobank, it is possible that there is a dominant ethnicity in the cohort and the selection of a training subcohort with the lowest contribution from this dominant ethnicity significantly impacted these results, especially because the SNP set $S_{1}$ , which we use for performing PCA, consists of those SNPs with the highest cosine similarity with the top eigenvector.

References

[1].McGuire A. L. et al. Confidentiality, privacy, and security of genetic and genomic test information in electronic health records: points to consider. Genetics in Medicine 10, 495–499 (2008). [DOI] [PubMed] [Google Scholar]
[2].Bloss C. S. Does family always matter? public genomes and their effect on relatives. Genome Medicine 5, 107 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
[3].Shringarpure S. S. & Bustamante C. D. Privacy risks from genomic data-sharing beacons. American Journal of Human Genetics 97, 631–646 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
[4].Ayday E., Raisaro J. L., Hengartner U., Molyneaux A. & Hubaux J.-P. Privacy-preserving processing of raw genomic data. In Data Privacy Management and Autonomous Spontaneous Security, 133–147 (Springer, 2014). [Google Scholar]
[5].He D. et al. Identifying genetic relatives without compromising privacy. Genome Research 24, 664–672 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
[6].Kamm L., Bogdanov D., Laur S. & Vilo J. A new way to protect privacy in large-scale genome-wide association studies. Bioinformatics 29, 886–893 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
[7].McLaren P. J. et al. Privacy-preserving genomic testing in the clinic: a model using hiv treatment. Genetics in Medicine 18, 814–822 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
[8].Shimizu K., Nuida K. & Rätsch G. Efficient privacy-preserving string search and an application in genomics. Bioinformatics 32, 1652–1661 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
[9].Xie W. et al. Securema: protecting participant privacy in genetic association meta-analysis. Bioinformatics 30, 3334–3341 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
[10].Zhao Y., Wang X., Jiang X., Ohno-Machado L. & Tang H. Choosing blindly but wisely: differentially private solicitation of dna datasets for disease marker discovery. Journal of the American Medical Informatics Association 22, 100–108 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
[11].Shahbazi A., Bayatbabolghani F. & Blanton M. Private computation with genomic data for genome-wide association and linkage studies. In Proc. 3rd International Workshop Genome Privacy Security (2016). URL https://www.acsu.buffalo.edu/~mblanton/publications/genopri16.pdf. [Google Scholar]
[12].Chen F. et al. Premix: privacy-preserving estimation of individual admixture. AMIA Annual Symposium Proceedings 2016, 1747–1755 (2016). [PMC free article] [PubMed] [Google Scholar]
[13].Lauter K., López-Alt A. & Naehrig M. Private computation on encrypted genomic data. In Aranha D. F. & Menezes A. (eds.) International Conference on Cryptology and Information Security in Latin America, 3–27 (Springer, 2014). [Google Scholar]
[14].Wang S. et al. Healer: homomorphic computation of exact logistic regression for secure rare disease variants analysis in gwas. Bioinformatics 32, 211–218 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
[15].Zhang Y., Blanton M. & Almashaqbeh G. Secure distributed genome analysis for gwas & sequence comparison computation. BMC medical informatics and decision making 15 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
[16].Kockan C. et al. Sketching algorithms for genomic data analysis and querying in a secure enclave. Nature Methods 17, 295–301 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
[17].Anati I., Gueron S., Johnson S. P. & Scarlata V. R. Innovative technology for cpu based attestation and sealing (2013). URL https://software.intel.com/en-us/articles/innovative-technology-for-cpu-based-attestation-and-sealing. [Google Scholar]
[18].Wang X. S., Chan T.-H. H. & Shi E. Circuit oram: on tightness of the goldreich-ostrovsky lower bound. In Ray I., Li N. & Kruegel C. (eds.) Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, 850–861 (ACM, 2015). [Google Scholar]
[19].Halevi S. & Shoup V. Algorithms in helib. In Garay J. A. & Gennaro R. (eds.) International Cryptology Conference, 554–571 (Springer, 2014). [Google Scholar]
[20].Yang J., Zaitlen N. A., Goddard M. E., Visscher P. M. & Price A. L. Advantages and pitfalls in the application of mixed-model association methods. Nature genetics 46, 100 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
[21].Price A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nature genetics 38, 904–909 (2006). [DOI] [PubMed] [Google Scholar]
[22].Chen F. et al. Presage: Privacy-preserving genetic testing via software guard extension. BMC medical genomics 10, 77–85 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
[23].Chen F. et al. Princess: Privacy-protecting rare disease international network collaboration via encryption through software guard extensions. Bioinformatics 33, 871–878 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
[24].Mandal A., Mitchell J. C., Montgomery H. & Roy A. Data oblivious genome variants search on intel sgx. In International Workshop on Data Privacy Management, 296–310 (Springer, 2018). [Google Scholar]
[25].Carpov S. & Tortech T. Secure top most significant genome variants search: idash 2017 competition. BMC medical genomics 11, 47–55 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
[26].Lambert C., Fernandes M., Decouchant J. & Esteves-Verissimo P. Maskal: Privacy preserving masked reads alignment using intel sgx . In 2018 IEEE 37th Symposium on Reliable Distributed Systems (SRDS), 113–122 (IEEE, 2018). [Google Scholar]
[27].Sadat M. N., Jiang X., Al Aziz M. M., Wang S. & Mohammed N. Secure and efficient regression analysis using a hybrid cryptographic framework: Development and evaluation. JMIR medical informatics 6, e8286 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
[28].Mainardi N., Sampietro D., Barenghi A. & Pelosi G. Efficient oblivious substring search via architectural support. In Annual Computer Security Applications Conference, 526–541 (2020). [Google Scholar]
[29].Dokmai N. et al. Privacy-preserving genotype imputation in a trusted execution environment. Cell Systems 12, 983–993.e7 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
[30].Widanage C. et al. Hysec-flow: privacy-preserving genomic computing with sgx-based big-data analytics framework. In 2021 IEEE 14th International Conference on Cloud Computing (CLOUD), 733–743 (IEEE, 2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
[31].Dong Y., Indyk P., Razenshteyn I. & Wagner T. Learning space partitions for nearest neighbor search. arXiv preprint arXiv:1901.08544 (2019). [Google Scholar]
[32].Hsu C.-Y., Indyk P., Katabi D. & Vakilian A. Learning-based frequency estimation algorithms. In International Conference on Learning Representations (2019). [Google Scholar]
[33].Indyk P., Vakilian A. & Yuan Y. Learning-based low-rank approximations. Advances in Neural Information Processing Systems 32 (2019). [Google Scholar]
[34].Eden T. et al. Learning-based support estimation in sublinear time. In International Conference on Learning Representations (2020). [Google Scholar]
[35].Ergun J. C., Feng Z., Silwal S., Woodruff D. & Zhou S. Learning-augmented k-means clustering. In International Conference on Learning Representations (2021). [Google Scholar]
[36].Grigorescu E., Lin Y.-S., Silwal S., Song M. & Zhou S. Learning-augmented algorithms for online linear and semidefinite programming. Advances in Neural Information Processing Systems 35, 38643–38654 (2022). [Google Scholar]
[37].Consortium G. P. et al. A global reference for human genetic variation. Nature 526, 68 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
[38].Delaneau O., Zagury J.-F., Robinson M. R., Marchini J. L. & Dermitzakis E. T. Accurate, scalable and integrative haplotype estimation. Nature communications 10, 5436 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
[39].Das S. et al. Next-generation genotype imputation service and methods. Nature genetics 48, 1284–1287 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
[40].Danecek P. et al. The variant call format and vcftools. Bioinformatics 27, 2156–2158 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1

NIHPP2024.09.19.613975v1-supplement-1.pdf^{(220KB, pdf)}

[R1] [1].McGuire A. L. et al. Confidentiality, privacy, and security of genetic and genomic test information in electronic health records: points to consider. Genetics in Medicine 10, 495–499 (2008). [DOI] [PubMed] [Google Scholar]

[R2] [2].Bloss C. S. Does family always matter? public genomes and their effect on relatives. Genome Medicine 5, 107 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] [3].Shringarpure S. S. & Bustamante C. D. Privacy risks from genomic data-sharing beacons. American Journal of Human Genetics 97, 631–646 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] [4].Ayday E., Raisaro J. L., Hengartner U., Molyneaux A. & Hubaux J.-P. Privacy-preserving processing of raw genomic data. In Data Privacy Management and Autonomous Spontaneous Security, 133–147 (Springer, 2014). [Google Scholar]

[R5] [5].He D. et al. Identifying genetic relatives without compromising privacy. Genome Research 24, 664–672 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] [6].Kamm L., Bogdanov D., Laur S. & Vilo J. A new way to protect privacy in large-scale genome-wide association studies. Bioinformatics 29, 886–893 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] [7].McLaren P. J. et al. Privacy-preserving genomic testing in the clinic: a model using hiv treatment. Genetics in Medicine 18, 814–822 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] [8].Shimizu K., Nuida K. & Rätsch G. Efficient privacy-preserving string search and an application in genomics. Bioinformatics 32, 1652–1661 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] [9].Xie W. et al. Securema: protecting participant privacy in genetic association meta-analysis. Bioinformatics 30, 3334–3341 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] [10].Zhao Y., Wang X., Jiang X., Ohno-Machado L. & Tang H. Choosing blindly but wisely: differentially private solicitation of dna datasets for disease marker discovery. Journal of the American Medical Informatics Association 22, 100–108 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] [11].Shahbazi A., Bayatbabolghani F. & Blanton M. Private computation with genomic data for genome-wide association and linkage studies. In Proc. 3rd International Workshop Genome Privacy Security (2016). URL https://www.acsu.buffalo.edu/~mblanton/publications/genopri16.pdf. [Google Scholar]

[R12] [12].Chen F. et al. Premix: privacy-preserving estimation of individual admixture. AMIA Annual Symposium Proceedings 2016, 1747–1755 (2016). [PMC free article] [PubMed] [Google Scholar]

[R13] [13].Lauter K., López-Alt A. & Naehrig M. Private computation on encrypted genomic data. In Aranha D. F. & Menezes A. (eds.) International Conference on Cryptology and Information Security in Latin America, 3–27 (Springer, 2014). [Google Scholar]

[R14] [14].Wang S. et al. Healer: homomorphic computation of exact logistic regression for secure rare disease variants analysis in gwas. Bioinformatics 32, 211–218 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] [15].Zhang Y., Blanton M. & Almashaqbeh G. Secure distributed genome analysis for gwas & sequence comparison computation. BMC medical informatics and decision making 15 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] [16].Kockan C. et al. Sketching algorithms for genomic data analysis and querying in a secure enclave. Nature Methods 17, 295–301 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] [17].Anati I., Gueron S., Johnson S. P. & Scarlata V. R. Innovative technology for cpu based attestation and sealing (2013). URL https://software.intel.com/en-us/articles/innovative-technology-for-cpu-based-attestation-and-sealing. [Google Scholar]

[R18] [18].Wang X. S., Chan T.-H. H. & Shi E. Circuit oram: on tightness of the goldreich-ostrovsky lower bound. In Ray I., Li N. & Kruegel C. (eds.) Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, 850–861 (ACM, 2015). [Google Scholar]

[R19] [19].Halevi S. & Shoup V. Algorithms in helib. In Garay J. A. & Gennaro R. (eds.) International Cryptology Conference, 554–571 (Springer, 2014). [Google Scholar]

[R20] [20].Yang J., Zaitlen N. A., Goddard M. E., Visscher P. M. & Price A. L. Advantages and pitfalls in the application of mixed-model association methods. Nature genetics 46, 100 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] [21].Price A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nature genetics 38, 904–909 (2006). [DOI] [PubMed] [Google Scholar]

[R22] [22].Chen F. et al. Presage: Privacy-preserving genetic testing via software guard extension. BMC medical genomics 10, 77–85 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] [23].Chen F. et al. Princess: Privacy-protecting rare disease international network collaboration via encryption through software guard extensions. Bioinformatics 33, 871–878 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] [24].Mandal A., Mitchell J. C., Montgomery H. & Roy A. Data oblivious genome variants search on intel sgx. In International Workshop on Data Privacy Management, 296–310 (Springer, 2018). [Google Scholar]

[R25] [25].Carpov S. & Tortech T. Secure top most significant genome variants search: idash 2017 competition. BMC medical genomics 11, 47–55 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] [26].Lambert C., Fernandes M., Decouchant J. & Esteves-Verissimo P. Maskal: Privacy preserving masked reads alignment using intel sgx . In 2018 IEEE 37th Symposium on Reliable Distributed Systems (SRDS), 113–122 (IEEE, 2018). [Google Scholar]

[R27] [27].Sadat M. N., Jiang X., Al Aziz M. M., Wang S. & Mohammed N. Secure and efficient regression analysis using a hybrid cryptographic framework: Development and evaluation. JMIR medical informatics 6, e8286 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] [28].Mainardi N., Sampietro D., Barenghi A. & Pelosi G. Efficient oblivious substring search via architectural support. In Annual Computer Security Applications Conference, 526–541 (2020). [Google Scholar]

[R29] [29].Dokmai N. et al. Privacy-preserving genotype imputation in a trusted execution environment. Cell Systems 12, 983–993.e7 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] [30].Widanage C. et al. Hysec-flow: privacy-preserving genomic computing with sgx-based big-data analytics framework. In 2021 IEEE 14th International Conference on Cloud Computing (CLOUD), 733–743 (IEEE, 2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] [31].Dong Y., Indyk P., Razenshteyn I. & Wagner T. Learning space partitions for nearest neighbor search. arXiv preprint arXiv:1901.08544 (2019). [Google Scholar]

[R32] [32].Hsu C.-Y., Indyk P., Katabi D. & Vakilian A. Learning-based frequency estimation algorithms. In International Conference on Learning Representations (2019). [Google Scholar]

[R33] [33].Indyk P., Vakilian A. & Yuan Y. Learning-based low-rank approximations. Advances in Neural Information Processing Systems 32 (2019). [Google Scholar]

[R34] [34].Eden T. et al. Learning-based support estimation in sublinear time. In International Conference on Learning Representations (2020). [Google Scholar]

[R35] [35].Ergun J. C., Feng Z., Silwal S., Woodruff D. & Zhou S. Learning-augmented k-means clustering. In International Conference on Learning Representations (2021). [Google Scholar]

[R36] [36].Grigorescu E., Lin Y.-S., Silwal S., Song M. & Zhou S. Learning-augmented algorithms for online linear and semidefinite programming. Advances in Neural Information Processing Systems 35, 38643–38654 (2022). [Google Scholar]

[R37] [37].Consortium G. P. et al. A global reference for human genetic variation. Nature 526, 68 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] [38].Delaneau O., Zagury J.-F., Robinson M. R., Marchini J. L. & Dermitzakis E. T. Accurate, scalable and integrative haplotype estimation. Nature communications 10, 5436 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] [39].Das S. et al. Next-generation genotype imputation service and methods. Nature genetics 48, 1284–1287 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] [40].Danecek P. et al. The variant call format and vcftools. Bioinformatics 27, 2156–2158 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

This is a preprint.

Learning-Augmented Sketching Offers Improved Performance for Privacy Preserving and Secure GWAS

Junyan Xu

Kaiyuan Zhu

Jieling Cai

Can Kockan

Natnatee Dokmai

Hyunghoon Cho

David P Woodruff

S Cenk Sahinalp

Abstract

1. Introduction

2. Methods

2.1. Preliminaries

2.1.1. The SGX platform in a nutshell

2.1.2. An overview of the SkSES method for identifying the top SNPs in an SGX enclave

(i). Population stratification correction using PCA.

(ii). Top l SNP candidate identification by sketching.

(iii). Top k SNP identification among the l candidates.

2.2. Our learning-augmented sketching approach for privacy preserving GWAS

(i). Improved PCA with learned SNPs.

(ii). Improved top l SNP candidate identification with unique buckets.

Comparison with existing work on learning augmented algorithms.

3. Results

3.1. Dataset

3.2. Tested methods and parameters

3.3. Memory and time consumption

Table 1:

3.4. Accuracy of sketches

3.4.1. Impact of training set size

Table 3:

Table 2:

3.4.2. Impact of sketch size and unique bucket count

3.4.3. Comparison with baseline and ablation study

Figure 1:

Figure 2:

Figure 3:

3.5. Using a non-representative training set

Supplementary Material

Table 4:

4. Acknowledgements

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

(ii). Top $l$ SNP candidate identification by sketching.

(iii). Top $k$ SNP identification among the $l$ candidates.

(ii). Improved top $l$ SNP candidate identification with unique buckets.