Genome Reconstruction Attacks Against Genomic Data-Sharing Beacons

Kerem Ayoz; Erman Ayday; A Ercument Cicek

doi:10.2478/popets-2021-0036

. Author manuscript; available in PMC: 2021 Nov 5.

Published in final edited form as: Proc Priv Enhanc Technol. 2021 Apr 26;2021(3):28–48. doi: 10.2478/popets-2021-0036

Genome Reconstruction Attacks Against Genomic Data-Sharing Beacons

Kerem Ayoz ¹, Erman Ayday ², A Ercument Cicek ³

PMCID: PMC8570374 NIHMSID: NIHMS1705341 PMID: 34746296

Abstract

Sharing genome data in a privacy-preserving way stands as a major bottleneck in front of the scientific progress promised by the big data era in genomics. A community-driven protocol named genomic data-sharing beacon protocol has been widely adopted for sharing genomic data. The system aims to provide a secure, easy to implement, and standardized interface for data sharing by only allowing yes/no queries on the presence of specific alleles in the dataset. However, beacon protocol was recently shown to be vulnerable against membership inference attacks. In this paper, we show that privacy threats against genomic data sharing beacons are not limited to membership inference. We identify and analyze a novel vulnerability of genomic data-sharing beacons: genome reconstruction. We show that it is possible to successfully reconstruct a substantial part of the genome of a victim when the attacker knows the victim has been added to the beacon in a recent update. In particular, we show how an attacker can use the inherent correlations in the genome and clustering techniques to run such an attack in an efficient and accurate way. We also show that even if multiple individuals are added to the beacon during the same update, it is possible to identify the victim’s genome with high confidence using traits that are easily accessible by the attacker (e.g., eye color or hair type). Moreover, we show how a reconstructed genome using a beacon that is not associated with a sensitive phenotype can be used for membership inference attacks to beacons with sensitive phenotypes (e.g., HIV+). The outcome of this work will guide beacon operators on when and how to update the content of the beacon and help them (along with the beacon participants) make informed decisions.

Keywords: Privacy, Genome Reconstruction Attack, Genomic Data-Sharing Beacons, Genomics

1. Introduction

With plummeting sequencing costs, we look forward reaching a capacity of sequencing one billion individuals over the next 15–20 years, resulting in availability of very large genomic datasets [20, 49, 64]. Although such large datasets are promising a revolution in medicine, it has been shown in numerous studies that it is not straightforward to ensure anonymity of the participants in such datasets [19, 36, 42, 63, 71].

Human genome is the utmost personal identifier and sharing genomic data for research while preserving the privacy of the individuals have been challenging many different fields (e.g., medicine, bioinformatics, computer science, law, and ethics) for long, due to possibly dire ethical, monetary, and legal consequences. To address this challenge and create frameworks and standards to enable the responsible, voluntary, and secure sharing of genomic data, the Global Alliance for Genomics and Health (GA4GH) was formed by the community [1]. The current genomic data sharing standard of the GA4GH is called the genomic data-sharing beacons. Beacons are the gateways that let users (researchers) and data owners exchange information without -in theory-disclosing any personal information. A user who wants to apply for access to a dataset can learn whether individuals with specific alleles (nucleotides) of interest are present in the beacon through an online interface. That is, a user can submit a query, asking whether a genome exists in the beacon with a certain nucleotide at a certain position, and the beacon answers as “yes” or “no”. If the dataset does not contain the desired genome, genomic data is not shared and distributed unnecessarily. In addition, researchers do not have to go through the paperwork to obtain a dataset which will not be helpful for their research. The GA4GH provides a shared beacon interface [2] that as of December 2020 provides access to 81 beacons and acts as a hub where researchers and data owners meet.

Beacons are typically associated with a particular sensitive phenotype (e.g., the SFARI beacon that host individuals with autism). Therefore, presence of an individual in a particular beacon is considered as privacy-sensitive information and the main aim of the beacons is to protect this information. An attacker, using the responses of a beacon and genomic data of a victim, may try to infer the membership of the victim in a particular beacon by running a membership inference attack. Beacon framework sets a barrier against membership inference attacks by allowing only presence/absence queries for variants and not tying any response to any specific individual. In that sense, beacons are considered to have stronger privacy measures compared to other statistical genomic databases. Despite these barriers, several works have proven that beacons are not bulletproof and they are vulnerable to membership inference attacks [59, 65, 73].

However, threats against genomic data-sharing beacons are not limited to membership inference attacks. In this paper, for the first time, we identify and analyze the vulnerability of genomic data-sharing beacons for the “genome reconstruction” attack. We consider a scenario, in which the attacker knows the membership of a victim to a beacon that may not be associated with a sensitive phenotype. Therefore, we consider a targeted attack, in which either (i) the attacker knows that the victim donated their genome to take part in a study or (ii) infer the membership of the victim from beacon’s metadata (as done in [65]). Then, we show how the attacker can accurately infer the genome of the victim by using the beacon responses. Such an attack may result in serious consequences if the attacker uses the reconstructed genome to infer sensitive information (e.g., disease diagnosis) about the victim or to infer the victim’s membership to another statistical genomic database of interest (e.g., another beacon that is associated with a sensitive phenotype). In particular, we show how the attacker can use the inherent correlations in the genome to run such an attack in an efficient and accurate way compared to a baseline approach. We also show how clustering techniques can be used to further improve the accuracy of such an attack.

Previous works in the literature assume beacons are static and do not change over time. However, beacons are dynamic datasets (donors join and leave) and this results in an increased risk for the genome reconstruction attack. An attacker can monitor the number of newly added donors to the beacon and the number of donors leaving the beacon from the meta-information of the beacon. With this information, newly joined donors (or donor leaving the beacon) become more vulnerable for genome reconstruction attacks. Thus, for the first time, we consider the beacons as dynamic databases and formulate the genome reconstruction attack accordingly. Privacy vulnerabilities due to dynamic changes in a system has been recently explored in the context of dynamic model changes in machine learning models [61]. It has been shown that different model outputs can constitute a new attack surface for an adversary to infer information of the dataset used to perform a model update [61]. Here, rather than model updates, we focus on the changes in the query responses to a dynamic database.

In a genome reconstruction attack, the attacker reconstructs all or a subset of the genomes in the beacon. Among the reconstructed genomes, it is not trivial to infer which one belongs to the victim. Therefore, we also show how the attacker can identify the victim’s genome among the set of reconstructed genomes using moderate auxiliary information about the victim (i.e., a set of visible physical characteristics of the victim, which is public information). Finally, to show one of the consequences of the identified genome reconstruction attack, we show how the attacker can utilize the outcome of this attack to initiate a membership inference attack against the same victim in another beacon, which can be associated with a sensitive phenotype. To do this, we combine the identified genome reconstruction attack with the membership inference attacks against beacons from the literature.

We implement and evaluate the identified vulnerability using real genome data obtained from OpenSNP [32] and HapMap [21] datasets. We particularly evaluate the success of the attacker to reconstruct a victim’s point mutations that include at least one rare nucleotide (i.e., minor allele) since minor alleles (i) reveal sensitive attributes of individuals (e.g., predispositions to privacy-sensitive diseases); and (ii) provide rich information to the attacker for membership inference attacks [59, 73]. We show that for a beacon with 50 individuals, precision and recall of the reconstruction reach up to 0.9 (each) when 3 individuals are added to the beacon and the victim is one of the newcomers. Even when 10 new participants are added to the beacon (causing a 20% increase in beacon size), we show that the attacker has a precision of 0.7 and a recall of 0.8. Furthermore, our results show that when more than one individual is added to the beacon, the attacker can accurately pinpoint the victim’s reconstructed genome by using moderate (and publicly available) auxiliary information about the victim. For this, we show how the attacker can match the victim’s phenotypical characteristics to the reconstructed genomes using machine learning algorithms. We also show via experiments that the outcome of the genome reconstruction attack can be accurately used for the membership inference attack on another beacon and it helps an attacker infer the membership of a victim only with a few queries.

Overall, we identify an important vulnerability and show how it can be exploited. We notably show how dependencies between point mutations can be used in a clustering algorithm to have high accuracy in a genome reconstruction attack. Furthermore, our methodology consists of a complete pipeline, showing how an attacker use the information it infers in the genome reconstruction attack in a subsequent membership inference attack. Therefore, this study clearly shows that privacy risks for genomic data-sharing beacons are much severe than perceived. This is particularly important since the number of beacon participants, and hence the privacy risk of individuals increase rapidly.

2. Related Work

Genomic privacy has recently been explored by many studies [11, 27, 56]. In the following subsections, we will summarize existing work on privacy in statistical genomic databases, inference attacks, and privacy of genomic data-sharing beacons.

2.1. Privacy in Statistical Genomic Databases and Inference Attacks on Genomic Privacy

Several works have shown that anonymization does not effectively protect the privacy of genomic data [30, 33, 35, 45, 50, 53, 66]. It has been shown that the identity of a participant of a genomic study can be revealed by using a second sample (e.g., part of the DNA information from the individual) and the results of the clinical study [19, 37, 41, 75, 77]. Differential privacy (DP) [26] concept has been frequently used to mitigate membership inference attacks when releasing summary statistics from genomic databases [28, 44, 68, 76]. Compared to statistical databases, genomic data-sharing beacons have stronger privacy measures since they only allow presence/absence (or yes/no) queries for variants.

Humbert et al. proposed an inference attack on kin genomic privacy using the family ties between individuals, pairwise correlations between the SNPs, and publicly available statistics about DNA [38]. Then, Deznabi et al. demonstrated that stronger inference techniques can be generated by combining high-order correlations and family ties [25]. Furthermore, several studies have examined phenotype prediction from genomic data, as a means of tracing identity [10, 18, 39, 46, 51, 52, 54, 58, 74, 78]. To mitigate such attribute inference attacks, cryptographic solutions has been proposed for privacy-preserving processing and sharing of genomic data (e.g., to outsource the computation to a public cloud or to conduct collaborative association studies). Existing cryptographic solutions mainly focus on (i) private pattern-matching and the comparison of genomic sequences [15, 24, 43, 55, 69] and (ii) privacy-preserving personalized medicine [12, 13]. In this work, we identify and analyze a different type of attribute inference attack particularly against genomic data-sharing beacons.

2.2. Privacy in Genomic Data Sharing Beacons

Researchers showed that presence (membership) of an individual in a genome sharing beacon can be inferred by repeatedly querying the beacon. Here, the attacker is assumed to be an active (or authorized) user of the beacon, in practice, it can ask as many queries as it wishes to the beacon (there is no limitations and cost for this in the current beacon protocol), and it can decide which queries to ask to the beacon. Furthermore, the attacker is assumed to have access to the set of SNPs of the victim. Shringarpure and Bustamante introduced a likelihood-ratio test (LRT) that can predict whether an individual is in the beacon by querying the beacon for multiple SNPs of a victim [65]. Note that inferring the membership of an individual in a beacon that is associated with a sensitive phenotype is equivalent to uncovering the sensitive phenotype about the victim. Then, Raisaro et al. showed that if the attacker first queries the SNPs with low minor allele frequency (MAF) values, it needs fewer queries for a successful attack [59]. In Section 6.5, we use this attack when we show how the proposed genome reconstruction attack can be combined with the membership inference attack. We provide further background information about this attack in Appendix A. Later, von Thenen et al. showed that even if the attacker does not have victim’s low-MAF SNPs, it is still possible to infer membership by exploiting the correlations in the genome [73]. Furthermore, they showed that beacon responses can also be inferred using such correlations (via a query inference, or QI-attack). In an orthogonal work, Hagestedt et al. have hypothesized that while current beacons systems are limited to genomic data, in the near future, the community is going to need a similar system for other biomedical data types. They proposed a beacon system for sharing DNA methylation data (an epigenetic mechanism to regulate transcriptional activity) and then showed that it is possible to successfully launch a membership inference attack against this system. They proposed a DP-based solution in their proposed MBeacon [34] system. The approach retains utility by adjusting the noise level for high risk methylation regions that might leak phenotypic information (i.e., regions which are related to disease).

Contribution of this paper.

In this paper, we identify and analyze a genome reconstruction attack against genomic data-sharing beacons by particularly exploiting the information leaked due to beacon updates and the correlations between the point mutations. So far, all works in the literature have focused on membership inference attacks against genomic data-sharing beacons. To the best of our knowledge, this is the first work that identifies, thoroughly analyzes, and shows the consequences of the genome reconstruction attack against the beacons. Furthermore, as opposed to existing work (that only consider a snapshot of the beacon), we show the privacy risk in dynamic beacons, in which new donors may join or existing donors may leave.

3. Genomics Background

Approximately 99.9% of the all individuals’ DNA are identical and the remaining 0.1% is responsible for our differences. Single nucleotide polymorphism (SNP) is the most common source of variation in the human genome. SNP is a point mutation (e.g., substitution of a single nucleotide in the genome - A,T,C, or G) and there are around 50 million known SNPs in the human genome [3]. The alternative nucleotides for each locus (SNP position) are called alleles and each allele of a SNP can be either the major or the minor allele for that SNP. The major allele is the most frequently observed nucleotide for a SNP position and the minor allele is the rare nucleotide (i.e., the second most common). The frequency (or probability) of observing the minor allele at a SNP position is called the minor allele frequency (MAF) of that SNP. Human genome has two copies for each locus (one per chromosome) and a SNP can be represented in terms of the number of its minor alleles (i.e., 0 for homozygous major, 1 for heterozygous, or 2 for homozygous minor).

Particular SNPs in human population are inherently correlated and this correlation model may change for different populations. Linkage disequilibrium (LD) is the non-random association of alleles at two or more loci. If two SNPs are in LD, they are correlated and cooccur more frequently than expected. Some SNPs are pathogenic and cause genetic diseases [6] and hence, they may carry sensitive information regarding individuals’ health conditions. As discussed in Section 2, most existing works in genomic privacy literature focus on the protection of the SNPs to prevent the risk of genetic discrimination.

4. System Model

As shown in Figure 1, we consider a system between the beacon participants (e.g., donors), the beacon, and the beacon users (which may include the attacker). The donor shares their genome with the beacon. It is possible that the donor may share their genome with multiple beacons that may or may not be associated with sensitive traits. Genome donor is not active during the protocol after they share their data with the beacon. Also, beacon never publicly shares its dataset, but some beacons may share metadata about (i) their content (e.g., size) or (ii) their donors (e.g., their gender, age, or ethnicity). In general, we consider the beacon as a dynamic dataset, in which new donors may join and existing donors may leave over time. Beacon users issue queries to the beacon. As discussed, the beacon user can only ask the presence of a genome with a particular allele (nucleotide) at a particular position of a given chromosome and the beacon only responds as “yes” or “no”. In this work, we assume beacon honestly reports the result of each query to the user (e.g., without introducing intentional noise to the query results) and we do not consider a query limit for the users, as it is usually trivial to overcome such limits (e.g., by registering several times with different accounts).

5. Threat Model

Depending on the attacker’s objective, two attacks that can be launched against genomic data-sharing beacons are: (i) membership inference attack and (ii) genome reconstruction attack. In both attacks (including this work), the attacker is assumed to be a registered beacon user who can send unlimited number of queries to the beacon. In this work, for the first time, we identify and study the genome reconstruction attack. We assume that the attacker knows the membership of an individual to a beacon. Thus, we consider a targeted attack, in which the attacker knows that the victim donated their genome (to take part in a study). Given the current rise in personal genomics (people uploading their genomes to public sites), this is feasible. Also, beacons with no sensitive-phenotype report metadata about their donors. For instance, Shringapure and Bus-tamante [65] verified a specific person being in PGP and Kaviar [31] beacons via metadata, and hence the attacker can also identify the membership of the victim using such metadata. Using the membership information, the goal of the attacker is to reconstruct the victim’s genome by issuing queries to the corresponding beacon.

Genome inference attack can be considered both for static and dynamic beacons. In static beacons, knowing that the victim is a member of the beacon, only the “no” responses would provide certain information about the victim’s genome to the attacker. “Yes” responses may be due to any other participant of the beacon and as the size of the beacon increases, “yes” responses do not provide much information to the attacker. However, in dynamic beacons, when the beacon is updated, using the change in the responses of the beacon, the attacker can learn more about the genomes of new participants. Thus, in this paper, we analyze this vulnerability for dynamic beacons and we assume that the victim is added between times t and t+δ along with other (m−1) newly added donors to the beacon. As discussed before, the attacker can monitor the number of newly added donors to the beacon and the number of donors leaving the beacon from the metadata of the beacon.

We assume that, along with the fact that the victim is among the newly joined participants to the beacon, the attacker also knows (i) the number of other newly joined individuals that are added to the beacon along with the victim; (ii) a snapshot of the beacon before the victim is added (at time t). That is, responses to all queries before the victim joins to the beacon. The beacon protocol does not bar someone from taking a complete snapshot. Thus, querying a beacon to take a complete snapshot only requires a high-bandwidth internet connection. Economic cost of such an internet service is around 79$ per month [70] and there is no other economic cost, as the system is publicly available at [2]. Even though the number of SNPs in a complete snapshot is large, typically, only low-MAF SNPs are useful for the attacker (as they are typically the sensitive ones); (iii) auxiliary information about the victim to identify victim’s genome among the reconstructed ones. For this we assume the attacker has moderate information, such as a set of victim’s visible characteristics (phenotype); and (iv) publicly available information about genomics, such as minor allele frequencies (MAF values) of SNPs and correlation between the SNPs in the population of interest. Finally, we assume that the attacker does not collude with the beacon.

In genome reconstruction attack, due to the nature of beacon responses, the attacker can infer if a victim has at least one minor allele at every SNP position. This is because the response of the beacon only tells if there is an individual in the beacon with at least one minor allele at a given SNP position. Thus, for each SNP j of victim $v (S_{j}^{v})$ , the goal of the attacker is to infer $P r (S_{j}^{v} = 0)$ and $P r (S_{j}^{v} \neq 0)$ (i.e., $P r (S_{j}^{v} = 1)$ or $P r (S_{j}^{v} = 2)$ ). For simplicity, we define the event ${\hat{S}}_{j}^{v} = 1_{S_{j}^{v} = 1 \lor S_{j}^{v} = 2}$ . Thus, ${\hat{S}}_{j}^{v} = 0$ if $S_{j}^{v} = 0$ , and ${\hat{S}}_{j}^{v} = 1$ , otherwise. Note that inferring this information for a victim results in a serious privacy concern. As we will discuss and show later, using this information, an attacker can associate the genotype of the victim to related phenotypes (e.g., diseases) and initiate a membership inference attack for the victim by targeting another beacon that is associated with a sensitive phenotype (e.g., cancer or HIV+).

Our methodology consists of a complete pipeline, showing how an attacker uses the information it infers in the genome reconstruction attack in a subsequent membership inference attack. Therefore, we evaluate the success of the attacker using different metrics in different parts of the pipeline as follows. For genome reconstruction (in Section 6.3), we use precision and recall to quantify this inference power of the attacker. As we will show in Section 7, the success of genome reconstruction mainly depends on the size of the beacon, the number of newly added donors to the beacon between times t and t + δ, and the fraction of attacker’s snapshot at time t. In real life, sizes of beacons show a large variation. The size of a beacon can be as small as 100, such as NBDC Human Database [4] or as large as 100K, such as The Genome Aggregation Database (gnomAD) [5]. As discussed, these numbers can be monitored from the metadata of such beacons. Thus, as we will we show, for small-size beacons, even if the size of the beacon is significantly increased (compared to its original size), the attacker’s success may be high. For large-size beacons, on the other hand, the number of newly added donors should be a small fraction of the original size for a successful attack. As a result of the genome reconstruction, the attacker potentially reconstructs multiple genomes and among these, one belongs to the victim. For this part, we show how the attacker can utilize machine learning techniques to identify the victim’s genome among the reconstructed ones (in Section 6.4) and we use the classification accuracy of the attacker as its success metric. Finally, to quantify the success of the membership inference (Section 6.5), we use a power analysis as the success metric. To evaluate the success of the attacker in the membership inference attack, we first let the attacker run the genome reconstruction attack and then use the proposed machine learning technique to identify the victim’s genome among the reconstructed ones. Thus, the success metric for the membership inference considers the attacker’s success in the entire pipeline.

6. Genome Reconstruction Attack on Genomic Data-Sharing Beacons

As discussed, we define the genome reconstruction attack as inferring genomic data of a genome donor (i.e., victim) given their membership information to the beacon. To show the effect of genome reconstruction attack more clearly, we consider dynamic beacons and we assume the victim is among the newly joined donors to the beacon. For clarity of the discussion, we present the identified attack only considering newly joined donors. Considering the donors that leave the beacon is symmetrical and trivial. We discuss this case in Section 8.2.

We consider a scenario, in which the attacker has no information about the victim’s genome, but it knows that the victim is added to the beacon between times t and t + δ. Let n and (n + m) represent the number of individuals in the beacon at times t and t + δ, respectively. As discussed, for most real-life beacons, the attacker knows m (by monitoring the changes in beacon using the metadata of the beacon). In all attack scenarios, we assume that the attacker reconstructs m’ genomes (m’ can be different than m and the selection of m’ effects the precision and recall of the attacker). Our goal is to evaluate the performance for different m’ values to show the attack is robust even if the attacker does not know how many people are added. When metadata of the beacon, and hence m is not available, the attacker can determine a potential upper bound (k) for the number of newly added donors (m) by examining the number of flipped responses (from “no” to “yes”). Then, for each i from 1 to k, it can reconstruct genomes using $R_{N \to Y}$ assuming m = i, and hence instead of m, the attacker ends up having $\frac{k (k + 1)}{2}$ potential genomes to identify the victim’s best matching reconstructed genome.

Using its auxiliary information (as discussed in Section 5), the attacker can probabilistically infer the genome of the victim by utilizing the changes in beacon’s responses (at times t and t + δ) as follows: (i) if the previous response (at time t) was “no” and the current response (at time t + δ) is “yes”, the probability that the victim having a minor allele at the corresponding query position increases depending on how many new individuals are added to the beacon in this time interval; (ii) if the previous response was “yes” and the current response is also “yes”, attacker cannot infer much about the victim’s genome, especially if the total size of the beacon is large; and (iii) if both the previous and the current responses are “no”, the attacker understands that the victim does not have a minor allele at the corresponding query position.

Here, the most important (or the most sensitive) information for the attacker can be considered as the “no” responses at time t that turn to “yes” at time t+δ. Because, such responses let the attacker infer the positions that the victim has at least one minor allele with a high probability (depending on how many new individuals are added to the beacon in this time interval). Since minor alleles of individuals are typically the indicators for privacy-sensitive information about them, in this work, we focus on the success of the attacker based on its success in inferring the minor alleles of a victim using the beacon responses that turn to “yes”. Exhaustively generating all potential solutions of this problem would result in a total of 2^{β∗m^′} genomes, where β is the total number of responses that turn to “yes” at time t + δ (which can be on the order of tens of thousands), and hence it is intractable. In the following, we first describe a baseline method that provides a tractable solution to this problem. Next, we present a greedy approach to run such an attack more accurately, and then we will detail a more sophisticated, clustering-based approach for the genome reconstruction attack.

6.1. Baseline Approach for Genome Reconstruction

Here, we describe a baseline approach, in which the attacker, using the responses of the beacon, reconstructs the genomes (of the newly joined donors) by assigning them to m′ bins according to MAF values of the SNPs. Genome reconstruction attack using the baseline algorithm for a particular victim v at time t + δ can be described as follows. The input of the attacker is (i) responses of the beacon to all possible queries at time t (i.e., complete snapshot of the beacon at time t); (ii) the fact that m new donors are added to the beacon between times t and t + δ; (iii) the fact that the victim is among the newly added donors; and (iv) publicly available MAF values of the SNPs.

First, the attacker identifies the set of SNPs for which the response of the beacon was “no” at time t and it becomes “yes” at time t + δ. Thus, the attacker constructs a set $R_{N \to Y}$ , consisting of these SNPs. Then, the attacker creates m’ empty bins representing SNP sets of newcomer donors. For each SNP j in set $R_{N \to Y}$ , the attacker retrieves its MAF value, MAF_j. Next, the attacker assigns the value of SNP j for each individual i (in each bin) consistent with the SNP’s MAF value as follows: (i) ${\hat{S}}_{j}^{i} = 0$ with probability (1 − MAF_j)² and (ii) ${\hat{S}}_{j}^{v} = 1$ with probability $M A F_{j}^{2} + 2 M A F_{j} (1 - M A F_{j})$ . Since the beacon’s response for SNPs in $R_{N \to Y}$ has flipped from “no” to “yes”, for all SNPs in $R_{N \to Y}$ , there should be at least one bin (among m’ bins) with at least one mutation (i.e., homozygous minor or heterozygous SNP). Thus, once the values of the SNPs in $R_{N \to Y}$ for all m’ bins are determined, the attacker checks if there is any SNP in set $R_{N \to Y}$ that is not assigned to any bin. If there is such a SNP, the attacker randomly picks a bin and assigns the value of the corresponding SNP as ${\hat{S}}_{j}^{i} = 1$ for the corresponding bin. The details of this baseline approach are also shown in Algorithm 2 (in Appendix B).

6.2. Greedy Algorithm for Genome Reconstruction

The above-mentioned baseline algorithm assumes every SNP is independent and the correlations among them are disregarded. However, SNPs are inherently correlated and considering such correlations in the genome reconstruction attack may result in significantly more accurate results. In the greedy algorithm discussed here, the attacker forms the bins considering the correlations between the SNPs in set $R_{N \to Y}$ . Using an iterative approach, the attacker assigns each SNP (minor allele) to an individual such that the probability of assignment is proportional to the average correlation of the new SNP with the already assigned SNPs of the individual (i.e., bin i). If no assignment is made this way, a random individual is selected to make sure there is at least one person with the corresponding new SNP.

Genome reconstruction attack using the greedy algorithm for a particular victim v at time t + δ can be described as follows. The input of the attacker includes everything in the baseline approach and also a correlation model between the SNPs that is consistent with the population structure of the beacon (that can be computed using publicly available genomic datasets). Different correlation models have been explored for genomic data before. In [62], authors showed how the correlations in the genome can be modelled using a Markov chain model. We create our correlation model by considering the pairwise correlations between all the SNPs in the beacon (which results in richer information for the attacker). The attacker calculates the likelihood of the victim v having at least one minor allele at a SNP position j as $P_{k} ({\hat{S}}_{j}^{v}) = P ({\hat{S}}_{j}^{v} ∣ {\hat{S}}_{k}^{v})$ where k may be any other position in the genome. We use Sokal-Michener distance to compute correlations between SNPs as follows:

A = 2 (n_{{\hat{S}}_{j}^{v} = 1, {\hat{S}}_{k}^{v} = 0} + n_{{\hat{S}}_{j}^{v} = 0, {\hat{S}}_{k}^{v} = 1})

B = n_{{\hat{S}}_{j}^{v} = 1, {\hat{S}}_{k}^{v} = 1} + n_{{\hat{S}}_{j}^{v} = 0, {\hat{S}}_{k}^{v} = 0}

D_{S o k a l - M i c h e n e r} ({\hat{S}}_{j}^{v}, {\hat{S}}_{k}^{v}) = \frac{A}{A + B}

In the greedy approach, first, the attacker constructs set $R_{N \to Y}$ . Then, it creates m′ empty bins (m′ does not have to be equal to m) representing the number of rare SNPs in $R_{N \to Y}$ . We assume that the SNPs with an MAF value below a threshold τ are categorized as rare SNPs. Observing rare SNPs do not have correlations among each other, assigning the rare SNPs in $R_{N \to Y}$ to different bins as seeds is assumed to result in an accurate initial separation of individuals. Next, for each remaining SNP j in $R_{N \to Y}$ , the attacker computes the average correlation between that and all the previously assigned SNPs in bin i using the aforementioned correlation model. This is done for each bin i. Let ${\hat{S}}_{j}^{i}$ be a binary random variable for SNP j and bin i. The attacker assigns ${\hat{S}}_{j}^{c} = 1$ for bin c which has the highest average correlation value and ${\hat{S}}_{j}^{i} = 0, \forall i \in [1, m^{'}]$ and i ≠ c. Eventually, the attacker constructs m′ potential genomes (in m′ bins) belonging to m newcomer donors.

6.3. Clustering-Based Algorithm for Genome Reconstruction

Greedy algorithm (in Section 6.2) reconstructs genomes by following a particular order (determined based on the MAFs of the SNPs). Different orders may provide different solutions. Thus, to consider all query responses together in a collective way, we propose clustering-based approaches for the genome reconstruction attack that cluster the identified minor alleles for the newly joined donors to the beacon. The proposed clustering techniques essentially use the correlations between the SNPs (that are computed using the aforementioned correlation model) to distribute SNPs into different bins. We use two types of clustering techniques: (i) hard clustering to create non-overlapping bins and (ii) soft or fuzzy clusterin to assign a SNP into multiple bins.

For (i), we employ spectral clustering, in which a standard clustering method (such as k-means clustering) is applied on certain eigenvectors of the Laplacian matrix of a graph [57]. In this graph, the SNPs correspond to vertices and correlations between the SNPs correspond to weights of edges. Spectral clustering is our method of choice as it has been shown to provide favorable results in many high dimensional feature spaces like ours [60]. And, for (ii) we employ the fuzzy c-means clustering (FCM) algorithm [14], which is a common choice for these types of tasks. The algorithm is similar to k-means clustering, but it also allows probabilistic assignments of samples to multiple clusters. Different from k-means clustering, FCM assigns a membership value $u_{i j} = P ({\hat{S}}_{j}^{i} = 1)$ for each element j and for each cluster i. This membership values are used as weights in the objective function. After convergence, these membership values are used as the probability of assignments of elements to each cluster. The description of both clustering methods are similar except for the clustering steps. Thus, in the following, we describe both methods together.

The input of both clustering-based algorithms is the same as the input of the greedy algorithm. First, the attacker identifies the set of SNP positions for which the response of the beacon was “no” at time t and it becomes “yes” at time t + δ and constructs set $R_{N \to Y}$ . Then, the attacker builds a graph of SNPs using the correlation model, in which the vertices are the SNPs in $R_{N \to Y}$ and undirected edges are weighted by the correlation values between these SNPs. This graph represents a pairwise similarity model for the SNPs and is used for a quantitative assessment of the correlation of each SNP pair in $R_{N \to Y}$ .

Next, the attacker applies either the spectral or fuzzy clustering algorithms on the constructed graph. The outcome of spectral clustering is a set of disjoint clusters. Fuzzy clustering results in groups of SNPs that maximizes the similarity in a group while allowing a SNP to be shared by multiple individuals. Thus, in fuzzy clustering, each SNP i is assigned to clusters for which the algorithm returns a relatively high probability of association. After clustering, the attacker obtains m’ different clusters which corresponds to m’ reconstructed genomes. The details are shown in Algorithm 1.

6.4. Identifying the Victim Using Genotype-Phenotype Associations

In previous sections, for genome reconstruction, we assumed that the attacker can correctly identify the victim’s genome among several reconstructed bins. Assuming the attacker has some moderate auxiliary information about the victim, here, we study and show how accurately the attacker can identify the victim’s genome among other candidates. For this, we assume the attacker uses information about some phenotypic characteristics of the victim and it relies upon the fact that SNPs are intrinsically linked to phenotypic traits (such as eye color, hair color, etc.) This provides a complete methodology for the genome reconstruction attack against beacons in real-life. As we will discuss later, the success of the attacker to correctly identify the victim’s genome among the reconstructed ones increases if the attacker has access to more auxiliary information about the victim.

Assume victim v is among the m new additions to the beacon (it is trivial to extend the methodology if there are more than one victim). The attacker is assumed to have access to two distinct sets: (i) a set $S = {{\vec{S}}_{1}, {\vec{S}}_{2}, \dots, {\vec{S}}_{m^{'}}}$ of m′ reconstructed genotypes as a result of the genome reconstruction attack, where ${\vec{S}}_{i} = ({\hat{S}}_{1}^{i}, \dots, {\hat{S}}_{k}^{i})$ is a vector containing the SNP values of genotype i (or bin i); and (ii) a set $P_{v} = (p_{1}^{v}, \dots, p_{t}^{v})$ containing the values of t phenotypic traits of victim v. Such phenotype information can be obtained from publicly available resources or using the physical traits of the victim. For instance, the attacker can obtain such information from victim’s social media accounts. The goal of the attacker is to correctly match the victim’s phenotype to the correct reconstructed genome (that is the most similar to the victim’s) among all candidate reconstructed genome sequences. In the test phase, the attacker has m newly added donors and m’ reconstructed genomes. Attacker’s task is to match each donor with the best matching reconstructed genome. Thus, for each newly added donor, the attacker calculates the likelihood scores of matching with all m’ reconstructed genomes.

Algorithm 1:

Clustering-Based Algorithm for Genome Reconstruction Attack

graphic file with name nihms-1705341-t0012.jpg

Open in a new tab

In [40], Humbert et al. focused on the deanoymization risk and modelled genotype-phenotype association as an assignment problem. They showed this risk by using the Hungarian algorithm [47]. Different from [40], here, we rely on machine learning for maximizing the matching likelihood and genotype-phenotype associations. We observe that such a formulation provides more accurate results. Also, rather than using SNP values (0, 1 or 2), due to the nature of the proposed attack, we represent the state of each SNP j of individual i as ${\hat{S}}_{j}^{i}$ which can be either 0 or 1, as discussed before.

For phenotype inference, we train a separate model for each of the considered phenotypes, where SNPs with flipped responses (from “no” to “yes”) are used as features. Since phenotype datasets are highly imbalanced, we apply Synthetic Minority Oversampling Technique (SMOTE) [16] for each of these datasets to resolve this problem. In SMOTE, a minority class instance is selected along with its nearest neighbors at random. Then, a new sample is generated as a combination of the original instance and a random neighbor. Next, we train a random forest model for each phenotype. We use repeated stratified 5-fold cross validation to tune the hyperparameters. After training the phenotype models, we form the ensemble classifier using the ones that have better validation F1-macro score than random guess. We discard the other models.

Ensemble classifier calculates the matching likelihood of given genome and set of phenotypic traits. Softmax output of each phenotype model corresponding to a given phenotypic trait of the victim (i.e., probability that a reconstructed genome having blue eye) are summed to calculate the matching likelihood. For single victim, this calculation is done for each reconstructed genome and the victim is matched with the reconstructed genome with the highest matching likelihood score. Note that this matching does not need to be one-to-one; a single reconstructed genome might match with different set of phenotypic traits. We discuss the performance of identification of victim’s reconstructed genome under different settings in Section 7.4.

6.5. Using Genome Reconstruction in Membership Inference Attack

To show one consequence of the proposed genome reconstruction attack, we also model and analyze how the proposed attack can be utilized for membership inference attack (introduced in Appendix A). We consider a scenario in which the attacker knows the membership of an individual to a beacon with which no sensitive associated phenotype (e.g., phenotype neutral). The attacker first utilizes the responses of this beacon to infer specific parts of a victim’s genome (i.e., SNPs). Then, it uses these inferred SNPs to infer the membership of the victim to a beacon with a sensitive phenotype. This attack is important and realistic, because knowing the membership of an individual to a phenotype neutral beacon (e.g., Kaviar Beacon) may not seem to pose a privacy issue. However, using the proposed genome reconstruction attack and the membership information of the victim to the beacon with non-sensitive phenotype, the attacker can first infer the SNPs of the victim and then, infer the membership of the victim to another beacon which is associated a sensitive phenotype (e.g., SFARI beacon which is associated with autism phenotype).

To show this, first, we run the proposed genome reconstruction attack that is explained in Section 6.3 and infer the SNPs of the victim with at least one minor allele on a beacon B₁. Using these inferred SNPs, we then run the membership inference attack to infer the membership of the victim in another beacon B₂. For membership inference attack, we use the optimal attack in [59] (described in Appendix A), which is shown to be an effective attack for membership inference (for our scenario, optimal attack in [59] and the QI-attack in [73] perform similarly, so we choose to use the optimal attack due to its simplicity). However, in contrast to the original optimal attack, in the null and alternate hypothesis equations in (1) and (2), there is an additional error due to the inference error of the genome reconstruction attack. This is because the attacker queries the alleles of the victim that it infers as a result of the genome reconstruction attack and there is a degree of uncertainty. Thus, we first experimentally compute the error rate of the genome reconstruction attack for a particular scenario (e.g., for particular m and n values). We then include this additional error on the γ parameter in (2), which represents the probability that the attacker’s copy of the victim’s genome does not match the beacon’s copy for a SNP. Furthermore, as opposed to original optimal attack, here the attacker may not have access to the SNPs of the victim with the lowest MAF values; instead the attacker only knows the SNPs that are inferred as a result of the genome reconstruction attack.

We evaluate the success of this attack in terms of the power of the attacker in Section 7.5. Similar to Raisaro et al. and von Thenen et al., we plot the power curve of the membership inference attack at 5% false positive rate. We empirically build the null hypothesis (H₀ in Appendix A). For every query, we determine the distribution of Λ under the null hypothesis using 20 individuals that are not in B₂. In this work, in order to model the uncertainty of correctly matching the victim (using phenotype inference as in Section 6.4), we first experimentally compute the error rate of the overall process. For instance, if the accuracy of correctly matching the phenotype of the victim to their reconstructed genome is p%, then p% of the 20 individuals are selected from correctly identified reconstructions and remaining individuals are selected from other new people added to the beacon along with the victim (incorrect identifications).

When Λ is less than a threshold t_α, the null hypothesis is rejected and we find t_α from the null hypothesis with α = 0.05 (corresponding to 5% false positive rate). Then, we computed the power as proportion of the individuals in the alternate hypothesis (including 20 different individuals in B₂) having a Λ value that is less than t_α. As before, p% of the 20 individuals are selected from correctly identified reconstructions and remaining people are selected from other new people added to the beacon along with the victim.

7. Evaluation

To evaluate the identified vulnerabilities, we evaluated our methods using real-life genomic datasets. Here, we describe the datasets and present the evaluation results.

7.1. Datasets

We used two different genome datasets for evaluation: (i) genome dataset of CEU population from the HapMap dataset [29] and (ii) OpenSNP genome dataset [7]. Using the HapMap dataset, we created the beacons and victims from CEU population which contains 164 donors and around 4 million SNPs for each donor. We created the correlation model (i.e., SNP-SNP relation network or similarity model) for this beacon using individuals from the same HapMap dataset that are not in the constructed beacon and set of victims. Using the OpenSNP dataset, we created the beacons and victims from a random population which contains 2980 donors and around 2 million SNPs for each donor. We created the correlation model using the rest of the OpenSNP dataset.

For the OpenSNP dataset, we also collected the reported phenotypes of individuals. Since sample sizes are small, we used the reported phenotypes in a binary form. From OpenSNP, we used the following commonly reported phenotypes: (i) eye color, 967 samples, (ii) hair type, 371 samples, (iii) hair color, 468 samples, (iv) tan ability, 287 samples, (v) asthma, 226 samples, (vi) lactose intolerance, 347 samples, (vii) earwax, 244 samples, (viii) tongue rolling, 434 samples, (ix) intolerance to soy, 136 samples, (x) freckling, 277 samples, (xi) ring finger being longer than index finger, 268 samples, (xii) widow peak, 176 samples, (xiii) ADHD, 154 samples, (xiv) acrophobia, 155 samples, (xv) finger hair, 155 samples, (xvi) myopia, 152 samples, (xvii) irritable bowel syndrome, 142 samples, (xviii) index finger being longer than big thumb, 131 samples, (xix) photoptarmis, 133 samples, (xx) migraine, 129 samples, and (xxi) Rh protein, 311 samples. We used 1320 genomes which are associated with at least one of the listed phenotypes while training the models. Newly added donors are chosen from the individuals who have reported at least 10 out of 21 considered phenotypes. We repeated each experiment for 10 times with different sets of newly added donors. For each experiment, remaining samples (except for the beacon participants and newly added donors) are used to train and validate phenotype models.

7.2. Evaluation Metrics

We evaluated the precision and recall for the reconstruction of a victim’s SNPs based on the changes in beacon responses. For precision and recall, we defined the success as correctly inferring the SNPs of the victim with at least one minor allele. Thus, for the calculation of precision and recall, we defined (i) true positive as correctly inferring a SNP j of victim v with ${\hat{S}}_{j}^{v} = 1$ (with at least one minor allele); (ii) false positive as incorrectly assigning ${\hat{S}}_{j}^{v} = 1$ for v who is homozygous major at that locus; (iii) true negative as correctly inferring a SNP j of victim v with ${\hat{S}}_{j}^{v} = 0$ (with no minor allele, homozygous major); and (iv) false negative as incorrectly assigning ${\hat{S}}_{j}^{v} = 0$ for v who has at least one minor allele at that locus (i.e., heterozygous or homozygous minor).

Furthermore, we quantified the success of identifying the victim’s genome among the reconstructed genomes in terms of the accuracy of the developed genotype-phenotype inference mechanism. We evaluated the accuracy of the ensemble classifier (to identify victim’s genome from phenotype) using the reconstructed genomes of newly added donors. Given ensemble classifier f, set of indices $\vec{J} = (j_{1}, \dots, j_{v})$ that represent the indices of best matching clusters for each newly added donors, vector containing SNP values of the i^th cluster ${\vec{S}}_{i} = ({\hat{S}}_{1}^{i}, \dots, {\hat{S}}_{k}^{i})$ and set of phenotypic traits of victim $v, P_{v} = (p_{1}^{v}, \dots, p_{t}^{v})$ , we computed the accuracy as $(\sum_{v = 1}^{m} 1_{(\underset{1 \leq i \leq m^{'}}{argmax} f ({\vec{S}}_{i}, P_{v})) = j_{v}}) / m$ . Finally, we used power analysis for the membership inference to show how the outcome of the genome reconstruction attack can be used for membership inference attack. Power for the i^th query is calculated from given set of l case people as $P^{i} = (\sum_{Λ_{j}^{i} \in Λ_{c a s e}^{i}} 1_{Λ_{j}^{i} < t_{α}^{i}}) / l$ which is defined as the fraction of the cases who have $Λ_{j}^{i}$ value that is less than $t_{α}^{i}$ as described in Section 6.5. Then, the vector ${\vec{P}}^{n} = (P^{1}, \dots, P^{n})$ is plotted to see the power change with respect to a total of n queries. Higher power value represents a more successful attack.

7.3. Evaluation of Genome Reconstruction

First, using both OpenSNP and HapMap beacons and only focusing on genome reconstruction, we evaluated and compared the baseline method (in Section 6.1) and the proposed clustering-based approach (in Section 6.3) when the size of the beacon (n) is 50 and m = m’. Here, we assume that the attacker can identify the victim’s reconstructed genome among the other candidates. Later, we will also show that attacker can indeed identify this genome with high accuracy using public (i.e., not sensitive) phenotype information about the victim.

Figures 2 and 9 (in Appendix C) show the precision and recall of the reconstruction for various number of newly added donors (m) for OpenSNP and HapMap beacons, respectively. Overall, we observed that the success of the attack to be higher for OpenSNP beacon. The reason of this is the limited data we had to build the correlation model for HapMap dataset (we used 945 donors to build the correlation model in OpenSNP beacon, while we could only use 110 donors to build it for the HapMap beacon). For both datasets, we used individuals that are not in the beacon to construct the correlation models. When we compared the correlation models that are constructed using the individuals that (i) are not in the beacon and (ii) are in the beacon, we observed that correlation model constructed for the OpenSNP beacon is significantly more accurate (i.e., it is very close to the correlation model of the individuals that are in the OpenSNP beacon) mainly due to the number of individuals we used to create the model. Therefore, in the following, we mostly discuss the results we obtained from the OpenSNP beacon.

Fig. 2. — Precision and recall for the genome reconstruction of a newly added donor to OpenSNP beacon with varying number of newly added donors.

Fig. 9. — Precision and recall for the genome reconstruction of a newly added donor to HapMap beacon with varying number of newly added donors.

The results show that on average, the identified attack using spectral clustering can reconstruct the victim’s genome with a precision close to 0.9 when the size of the beacon is increased by adding 3 people (i.e., a 6% increase in beacon size). We also obtained more than 0.7 precision and 0.8 recall even when the size of the beacon is increased by adding 10 people (i.e., a 20% increase in beacon size). This indicates a substantial privacy risk, especially if the reconstructed SNPs are tied to sensitive phenotypes. Also, the baseline algorithm (in Section 6.1) performs substantially worse than the proposed clustering-based approach. The results also show that spectral clustering-based genome reconstruction is slightly better than the fuzzy clustering-based approach. We observed that allowing a SNP (that includes at least one minor allele) to be in multiple bins results in high false positives. Therefore, in the remaining of this section, we use spectral clustering-based genome reconstruction for the evaluations.

To show the benefit of utilizing a beacon (and beacon update) in its genome reconstruction attack, we also computed the reconstruction accuracy of an attacker when it only uses publicly available information (e.g., population statistics and victim’s phenotype). As discussed, each victim we consider has a subset of 21 phenotypes listed in Section 7.1. Using the associations of victim’s phenotypes with the corresponding SNPs (extracted from SNPedia [8]), we assigned some SNP values of the victim. We observed that, on the average, such a reconstruction achieves a precision of 18% and a recall of 47% on total of 232 SNPs. Therefore, we conclude that having access to a beacon and knowing the membership of a victim to a beacon significantly increases the success of the genome reconstruction attack.

To show the effect of varying number of bins (m’) in the genome reconstruction attack, in Figures 3 and 10 (in Appendix C), we show the attacker’s success when the number of newly added donors m = 5 and beacon size n = 50 for OpenSNP and HapMap beacons, respectively. We observed that for both beacons, precision increases and recall decreases with increasing m’. Also, as expected, precision and recall becomes balanced when m’ = m.

Fig. 3. — Precision and recall for the genome reconstruction of a newly added donor to OpenSNP beacon with varying number of bins/clusters (m’) in the genome reconstruction attack. Number of newly added donors (m) is 5.

Fig. 10. — Precision and recall for the genome reconstruction of a newly added donor to HapMap beacon with varying number of bins/clusters (m’) in the genome reconstruction attack. Number of newly added donors (m) is 5.

Next, in Figures 4 and 11 (in Appendix C), we show the effect of the beacon size (n) at time t when 5 new donors are added between times t and t + δ for OpenSNP and HapMap beacons, respectively. Here, we assume that the number of bins (m’) is equal to the number of newly added donors (m). We observed that as the size of the beacon increases, both the precision and recall of the reconstruction attack almost remains the same (for a fixed number of newly added donors).

Fig. 4. — Precision and recall for the genome reconstruction of a newly added donor to OpenSNP beacon with varying number of beacon size (n). Number of newly added donors m is 5 and m’ = m for all plots.

Fig. 11. — Precision and recall for the genome reconstruction of a newly added donor to HapMap beacon with varying number of beacon size (n). Number of newly added donors m is 5 and m’ = m for all plots.

Even if the success of the genome reconstruction remains high, the number of flipped responses (from “no” to “yes”) may decrease when beacon size is increased (as shown in Figure 4). In other words, the number of vulnerable SNPs (the ones that can be inferred using the change in the beacon responses) of a victim decreases and this might result in lower performance in phenotype inference and membership inference parts of the attack. However, with high probability, as the beacon size increase, low-MAF SNPs of the victim (which typically provide the most valuable information for the membership inference attack) still remain vulnerable, since with high probability, such SNPs are not observed in other donors in the beacon. For example, in the previous experiment (in Figure 4), when the size of the beacon is increased from 50 to 400, total number of vulnerable SNPs of a victim reduces by 94%, however, number of vulnerable SNPs of a victim with MAF value smaller than 0.01 only reduces by 52%.

Keeping the ratio of newly added donors fixed (to 5%), we also observed the change in the success of the attack with increasing beacon size when m’ = m in Figure 5 (we did this evaluation only for the OpenSNP beacon since HapMap beacon did not have more than 100 donors). We observed that, when the beacon size increases beyond 100, although the recall of the attacker still remains high, its precision starts decreasing. This shows that the success of the identified attack mainly relies on the number of clusters the attackers needs to generate (in the proposed clustering-based algorithm). For small or mid-size beacons (e.g., NBDC Human Database [4] with slightly more than 100 individuals), even if the beacon update significantly increases beacon’s size, the identified attack is still effective. On the other hand, for large size beacons (e.g., gno-mAD [5], with more than 100K individuals), the update size should be small to have a vulnerability.

Fig. 5. — Precision and recall for the genome reconstruction of a newly added donor to OpenSNP beacon with varying number of beacon size (n). Number of newly added donors m is always 5% of the beacon size and m’ = m for all plots.

Finally, we explored the scenario, in which the attacker only has a partial snapshot of the beacon (instead of a full snapshot). In Figure 6, we show the success of the reconstruction attack when m = 5 donors are added (at time t + δ) into the OpenSNP beacon with size 50 when the attacker has varying snapshots of the beacon at time t and when m = m’. We observed that the success (precision and recall) of reconstruction do not change with varying snapshots. However, the number of inferred SNPs (as a result of the genome reconstruction attack) decreases linearly with the decreasing snapshot that is known by the attacker at time t.

Fig. 6. — Precision and recall for the genome reconstruction of a newly added donor to OpenSNP beacon when the attacker knows varying fractions of beacon’s snapshot. Number of newly added donors m is 5, beacon size n is 50 and m’ = m for all plots.

7.4. Identifying the Victim’s Genome Using Phenotype Inference

Here, we evaluate the success of the attacker in identifying the reconstructed genome of the victim among all reconstructed genomes using the algorithm in Section 6.4. Since HapMap dataset does not include phenotype information about the genome donors, we only use the OpenSNP beacon for this evaluation.

We employed and compared several machine learning models for genotype-phenotype associations, including: Logistic Regression [23], SVM [22], Multi-layer Perceptron [72], Random Forest [67], and XGBoost [17]. Among these, we obtained the highest classifier accuracy with the Random Forest, and hence all reported results are based on this model.

In Figure 7, we show the ensemble classifier accuracy for varying number of newly added donors to the beacon (here, we assumed m’ = m and we observed similar patterns when m’ ≠ m as well). We used the original genomes of individuals in the training dataset when building the model. For test, we used reconstructed genomes of the victims (that may have noise due to reconstruction error). Beacon size is 50 in these experiments (i.e., n = 50).

Fig. 7. — Classification accuracy of genotype inference from phenotype for varying number of newly added donors (m) to the beacon.

We observed that the proposed algorithm provides 70% accuracy when the size of the beacon is increased by adding 2 individuals in the update, and the accuracy slightly decreases with increasing number of newly added donors. These results show that the attacker can identify the reconstructed genome of the victim among all m’ reconstructed genomes with high accuracy. As discussed before, in this experiment, we assumed the attacker has moderate auxiliary knowledge about the victim (i.e., phenotypic-traits, which can be easily learnt from social network profiles of the victim). However, since genotype-phenotype associations are not strong yet, there is an accuracy bottleneck in the overall process due to this step. A stronger attacker (that has access to richer auxiliary information about the victim) may utilize victim’s known mutations or genomes of family members. Then, the phenotype inference part is not required and accuracy loss would not happen.

7.5. Using Genome Reconstruction in Membership Inference

In Section 7.3, we evaluated the success of the reconstruction and in Section 7.4, we showed that the attacker is able to identify the victim among many added donors with high accuracy. Here, we show a severe consequence of the proposed genome reconstruction attack, in which the outcome of the previous steps can be utilized in a membership inference attack. By doing so, we also explicitly explore the impacts of (i) incorrect inference of some SNPs during reconstruction and (ii) imperfect choice of the reconstructed genotype due to the use of genotype-phenotype associations in terms of the success of this membership inference attack.

We randomly constructed two non-overlapping beacons from the OpenSNP dataset: (i) B₁ includes 50, and (ii) B₂ includes 60 individuals. We assume that B₂ is associated with a privacy-sensitive phenotype and the goal of the attacker is to infer the membership of the victim to B₂. We also assume that m new individuals are added to B₁ at time t + δ and the victim is among these newly joined donors. The attacker only knows that the victim is among these m individuals that are added to B₁ at time t + δ along with a snapshot of B₁ at time t.

First, we applied the spectral clustering-based genome reconstruction (that provides the best performance in Section 7.3) to reconstruct the genomes of newly joined m donors to B₁. Then, we identified the reconstructed genome of the victim using phenotype information about the victim (as in Section 6.4). Finally, using the reconstructed genome of the victim, we conducted the membership inference attack on B₂ using the optimal attack (as described in Appendix A).

We used the identification accuracy in Section 6.4 to construct and infer victims’ genomes for alternate and null hypotheses. For instance, when m = 2 we have 70% identification accuracy. In this scenario, 14 genomes are chosen from correctly reconstructed genomes, while the remaining 6 genomes are chosen from incorrectly reconstructed genomes for corresponding victims.

In Figure 8, we show the power plots of this attack with varying number of newly added donors (m) to beacon B₁. As expected, with decreasing values of m, the power increases faster since the accuracy of genome reconstruction increases (and hence the error rate of the membership inference attack decreases). For instance, when the victim is the only newly added donor to beacon B₁ (m = 1), the attacker can reconstruct their genome and then infer the victim’s membership to beacon B₂ with a very high confidence (100% power) in just slightly more than 15 queries. We also observed that when m is increased, the power decreases, yet still reaches to 0.8 with approximately 80 queries when 2 individuals are added. These results show that the attacker may confidently conduct membership inference attacks as a result of genome reconstruction even though it has many sources of uncertainties in its input for membership inference.

Fig. 8. — Power of membership inference attack on beacon B₂ with varying number of newly added donors (m) to beacon B₁.

8. Discussion

This work pinpoints a new information leak and identifies beacon updates as a new risk, which leads to genome reconstruction attacks. We show that an attacker can efficiently and accurately link this new vulnerability to a membership inference attack. Furthermore, recently, we observed that some beacons even report the number of occurrences for a “yes” response (e.g., Sinai Health System Beacon in [2]). Using such information in the identified attack would further improve the accuracy of the proposed clustering-based algorithm (in Section 6.3). We will explore this in future work.

8.1. Extension of the Proposed Attack

For the proposed genome reconstruction algorithm in Section 6.3, we only focused on the “no” responses of the beacon at time t that turn to “yes” at time t + δ (i.e., no-yes responses) since such responses reveal the SNPs of the victim with minor alleles and minor alleles are typically the indicators for privacy-sensitive information about individuals. As a result, we also only considered the correlations between such SNPs of a victim. As briefly discussed before, no-no responses also provide deterministic information to the attacker (about the victim certainly not having a minor allele in such SNP positions). Furthermore, using the information from no-no responses, the attacker can utilize the correlations between such SNPs (with no minor alleles) and others. In this work, we did not consider the no-no responses in the attack since (i) typically there are excessive number of no-no responses in a beacon and this creates a computational burden to compute all the pairwise correlations between such SNPs and (ii) SNPs with no minor alleles (learned from no-no responses) typically are not highly correlated with the SNPs with minor alleles, which are instrumental to the attacker for the membership inference attack. We will further consider the impact of such no-no responses in future work.

8.2. Donors Leaving the Beacon

In Sections 6 and 7, we presented and evaluated the identified vulnerability by only considering the newly joined donors to the beacons. It is also possible that existing donors may leave the beacon. However, such a scenario can be easily addressed by using the identified attack mechanism. Considering the donors that leave the beacon brings up two different scenarios: (i) victim is among the newly joined donors (while there are also donors leaving the beacon between times t and t+δ) and (ii) victim is among the donors that leave the beacon (while there may be other donors leaving or joining the beacon between times t and t + δ).

Scenario in (i) is no different than what we discussed in Section 6. The number of “no” responses at time t that turn to “yes” at time t + δ does not change due to the donors leaving the beacon. On the other hand, some “yes” responses at time t may turn to “no” at time t+δ due to the donors leaving the beacon. However, such responses do not provide information about the minor alleles of the victim, and hence we do not consider such responses in this work. In scenario (ii), “yes” responses at time t that turn to “no” at time t + δ will provide information about the minor alleles of the victim (and other donors that leave the beacon during that time interval). Using such responses, one will need to run the algorithms proposed in Section 6 to reconstruct the genome of the victim.

8.3. Risk Quantification for the Genome Reconstruction Attack

The identified vulnerability and the proposed attack algorithm can be used as a privacy risk quantification tool by the beacon operator. For this, we foresee a simulation-based technique to quantify the risk and show it to the beacon operator. This will be a customized technique for each donor in the beacon and the following discussion is for one particular donor. Assume that a total of m new donors are gathered by the beacon between times t and t+δ. To quantify the genome reconstruction risk, one may run the attack we introduced in Section 6, pretending the donor is added to the beacon along with the other (m−1) newcomer donors and compute the fraction of the SNPs that can be reconstructed. Then, using public sources (such as HapMap), one can gather a small number (e.g., s) of genomes belonging to individuals from the same population as the donor. Then, the same attack can be run for the selected s people (i.e., adding each random individual along with the other (m − 1) newcomer donors), their reconstruction rates can be set as the baseline, and eventually, a privacy risk percentile can be provided for the donor. Moreover, for all correctly inferred SNPs, one can perform a pathogenic scan on ClinVar [48] to inform the donor about what traits they might be linked should their genome is put onto the beacon. Using this information and based on the privacy risk of the donor, either the donor or the beacon operator will decide whether or not to add the donor to the beacon at time t + δ. This process can be repeated for all the newcomer donors.

We foresee that using such a quantification algorithm, a potential beacon participant can provide informed consent about how (and what portion of) their data can be used by the beacons (e.g., when the beacon can start using their data in its responses or when the beacon should stop using their data). Similarly, such a tool can guide a beacon operator on the number of participants to include in a batch to update the beacon.

8.4. Mitigation Techniques

To mitigate membership inference attacks against beacons, several countermeasures have been proposed [9, 59, 65]. However, most of such techniques directly reduce the utility of the beacon without carefully analyzing a balance between privacy (of beacon participants) and utility (of beacon responses). Thus, we believe that existing countermeasures proposed for membership inference are not directly applicable to mitigate genome reconstruction attack. To mitigate genome reconstruction, here we suggest three simple methods: (i) updating the beacon content considering the beacon size and the size of the update. For instance, as we showed in Section 7.3, for small and mid-size beacons, even largesized updates create a vulnerability, while for large-size beacons, only small-sized updates pose a threat; (ii) adding (or removing) donors after quantifying their risks against genome reconstruction (as discussed in Section 8.3); and (iii) adjusting diversity of the beacon to have beacons with mixed ethnicity genome donors. For beacons with mixed ethnicity donors, it is hard to construct the correlation model (unless the beacon discloses the ethnicities of the donors as metadata), and hence it is hard to conduct the proposed correlation-based genome reconstruction attacks. It is worth noting that the OpenSNP beacon in our evaluations was a diverse one, however we also created the correlation model from the same diverse population (i.e., in our settings, the attacker had access to a very similar population to the target population). We will further work on more sophisticated countermeasures in future work.

9. Conclusion

Thus far, the only privacy vulnerability that has been identified for beacons was membership inference. We have identified and, via extensive analysis, showed the impact of another serious privacy concern for beacons: genome reconstruction. We showed the practicality of the identified privacy concern in real-life by showing the whole attack strategy including genotype-phenotype inference. Furthermore, we showed how genome reconstruction attack can be used together with the membership inference to identify privacy-sensitive phenotypes of individuals.

10. Acknowledgement

Research reported in this publication was supported by the National Library Of Medicine of the National Institutes of Health under Award Number R01LM013429.

Appendices

A. Membership Inference Attack Against Genomic Data-Sharing Beacons

In [59], Raisaro et al. introduced the “optimal attack” using the same attacker assumptions discussed in Section 2.2. In optimal attack, the attacker constructs a set of candidate SNPs S to be queried and submits queries starting from the lowest MAF SNP_i. Let the null hypothesis (H₀) refer to the case in which the queried genome is not in the beacon and alternative hypothesis (H₁) be the case in which the queried genome is a member of the beacon. In [59], the log-likelihood (L) under the null and alternate hypothesis are shown as follows:

L_{H_{0}} (R) = \sum_{i = 1}^{n} x_{i} \log (1 - D_{N}^{i}) + (1 - x_{i}) \log (D_{N}^{i})

(1)

L_{H_{1}} (R) = \sum_{i = 1}^{n} x_{i} \log (1 - δ D_{N - 1}^{i}) + (1 - x_{i}) \log (δ D_{N - 1}^{i}),

(2)

where R is the response set, x_i is the answer of the beacon to the query at position i (1 for “yes”, 0 for “no”), and γ represents a small probability where the attacker’s copy of the victim’s genome does not match the beacon’s copy for a locus (e.g., due to difference in variant calling pipeline). n is the number of posed queries. $D_{N}^{i}$ is the probability that none of the N individuals in the beacon has the queried allele at position i and $D_{N - 1}^{i}$ represents the probability of no individual except for the queried person having the queried allele at position i. The computations of $D_{N - 1}^{i}$ and $D_{N}^{i}$ depend on the queried position i and they change at each query as follows: $D_{N - 1}^{i} = {(1 - f_{i})}^{2 N - 2}$ and $D_{N}^{i} = {(1 - f_{i})}^{2 N}$ , where f_i represents the MAF of the SNP at position i. The likelihood-ratio test (LRT) statistic, Λ, is then determined as

Λ = \sum_{i = 1}^{n} \log (\frac{D_{N}^{i}}{δ D_{N - 1}^{i}}) + \log (\frac{δ D_{N - 1}^{i} (1 - D_{N}^{i})}{D_{N}^{i} (1 - δ D_{N - 1}^{i})}) x_{i} .

B. Baseline Approach for Genome Reconstruction

The details of this baseline approach for genome reconstruction (described in Section 6.1) are shown in Algorithm 2.

C. Evaluation of Genome Reconstruction on the HapMap Beacon

In Figure 9 we show the success (precision, recall, and accuracy) of the reconstruction for various number of newly added donors (m) in HapMap beacon. In Figure 10, we show the effect of varying number of bins (m’) in the genome reconstruction attack when the number of newly added donors (m) is 5 for HapMap beacon. Next, in Figure 11, we show the effect of the beacon size (n) at time t when 5 new donors are added between times t and t + δ for HapMap beacon.

Algorithm 2:

Baseline Algorithm for Genome Reconstruction Attack

graphic file with name nihms-1705341-t0013.jpg

Open in a new tab

Contributor Information

Kerem Ayoz, Bilkent University.

Erman Ayday, Case Western Reverse University.

A. Ercument Cicek, Bilkent University, Carnegie Mellon University.

References

[1].2020. https://www.ga4gh.org/about-us/. [Online; accessed 10-January-2020].
[2].2020. http://beacon-network.org. [Online; accessed 10-January-2020].
[3].2020. https://ghr.nlm.nih.gov/primer/genomicresearch/snp. [Online; accessed 10-January-2020].
[4].2020. https://humandbs.biosciencedbc.jp/en/hum0029-v1. [Online; accessed 03-December-2020].
[5].2020. https://gnomad.broadinstitute.org/. [Online; accessed 03-December-2020].
[6].2020. Risk Disease. http://www.eupedia.com/genetics/medical_dna_test.shtml [Online; accessed 10-January-2020].
[7].2020. OpenSNP. http://opensnp.org. [Online; accessed 10-January-2020].
[8].2020. SNPedia. https://www.snpedia.com/. [Online; accessed 10-January-2020].
[9].Momin Al Aziz Md, Ghasemi Reza, Waliullah Md, and Mohammed Noman. 2017. Aftermath of Bustamante attack on genomic beacon service. BMC Medical Genomics 10, 2 (2017), 43. [DOI] [PMC free article] [PubMed] [Google Scholar]
[10].Lango Allen Hana, Estrada Karol, Lettre Guillaume, Berndt Sonja I, Weedon Michael N, Rivadeneira Fernando, Willer Cristen J, Jackson Anne U, Vedantam Sailaja, Raychaudhuri Soumya, et al. 2010. Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature 467, 7317 (2010), 832–838. [DOI] [PMC free article] [PubMed] [Google Scholar]
[11].Ayday Erman, Cristofaro Emiliano De, Hubaux Jean-Pierre, and Tsudik Gene. 2013. The chills and thrills of whole genome sequencing. (2013). [Google Scholar]
[12].Ayday Erman, Louis Raisaro Jean, Hubaux Jean-Pierre, and Rougemont Jacques. 2013. Protecting and evaluating genomic privacy in medical tests and personalized medicine. In Proceedings of the 12th ACM Workshop on Privacy in the Electronic Society. 95–106. [Google Scholar]
[13].Baldi Pierre, Baronio Roberta, Cristofaro Emiliano De, Gasti Paolo, and Tsudik Gene. 2011. Countering GATTACA: efficient and secure testing of fully-sequenced human genomes. In Proceedings of the 18th ACM conference on Computer and communications security. 691–702. [Google Scholar]
[14].Bezdek James C, Ehrlich Robert, and Full William. 1984. FCM: The fuzzy c-means clustering algorithm. Computers & Geosciences 10, 2–3 (1984), 191–203. [Google Scholar]
[15].Blanton Marina, Atallah Mikhail J, Frikken Keith B, and Malluhi Qutaibah. 2012. Secure and efficient outsourcing of sequence comparisons. In Proceedings of European Symposium on Research in Computer Security. 505–522. [Google Scholar]
[16].Bowyer Kevin W., Chawla Nitesh V., Hall Lawrence O., and Kegelmeyer W. Philip. 2011. SMOTE: Synthetic Minority Over-sampling Technique. CoRR abs/1106.1813 (2011). arXiv:1106.1813 http://arxiv.org/abs/1106.1813 [Google Scholar]
[17].Chen Tianqi and Guestrin Carlos. 2016. XGBoost: A Scalable Tree Boosting System. CoRR abs/1603.02754 (2016). arXiv:1603.02754 http://arxiv.org/abs/1603.02754 [Google Scholar]
[18].Claes Peter, Liberton Denise K, Daniels Katleen, Matthes Rosana Kerri, Quillen Ellen E, Pearson Laurel N, McEvoy Brian, Bauchet Marc, Zaidi Arslan A, Yao Wei, et al. 2014. Modeling 3D facial shape from DNA. PLoS Genetics 10, 3 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
[19].Clayton David. 2010. On inferring presence of an individual in a mixture: a Bayesian approach. Biostatistics 11, 4 (2010), 661–673. [DOI] [PMC free article] [PubMed] [Google Scholar]
[20].Collins Francis Sand Varmus Harold. 2015. A new initiative on precision medicine. New England Journal of Medicine 372, 9 (2015), 793–795. [DOI] [PMC free article] [PubMed] [Google Scholar]
[21].International HapMap Consortium et al. 2003. The international HapMap project. Nature 426, 6968 (2003), 789. [DOI] [PubMed] [Google Scholar]
[22].Cortes Corinna and Vapnik Vladimir. 1995. Support-vector networks. Machine learning 20, 3 (1995), 273–297. [Google Scholar]
[23].Cramer JS 2002. The Origins of Logistic Regression. Tinbergen Institute, Tinbergen Institute Discussion Papers (01 2002). 10.2139/ssrn.360300 [DOI] [Google Scholar]
[24].Cristofaro Emiliano De, Faber Sky, and Tsudik Gene. 2013. Secure Genomic Testing with Size- and Position-hiding Private Substring Matching. In Proceedings of the 12th ACM Workshop on Privacy in the Electronic Society. [Google Scholar]
[25].Deznabi Iman, Mobayen Mohammad, Jafari Nazanin, Tastan Oznur, and Ayday Erman. 2018. An inference attack on genomic data using kinship, complex correlations, and phenotype information. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 15, 4 (2018), 1333–1343. [DOI] [PubMed] [Google Scholar]
[26].Dwork Cynthia. 2006. Differential Privacy. Proceedings of the 33rd International Conference on Automata, Languages and Programming (2006). [Google Scholar]
[27].Erlich Yaniv and Narayanan Arvind. 2014. Routes for breaching and protecting genetic privacy. Nature Reviews Genetics 15, 6 (2014), 409–421. [DOI] [PMC free article] [PubMed] [Google Scholar]
[28].Fienberg Stephen E, Slavkovic Aleksandra, and Uhler Caroline. 2011. Privacy preserving GWAS data sharing. In IEEE 11th International Conference on Data Mining Workshops (ICDMW). 628–635. [Google Scholar]
[29].Gibbs Richard A, Belmont John W, Hardenbol Paul, Willis Thomas D, Yu Fuli, Yang Huanming, Ch’ang Lan-Yang, Huang Wei, Liu Bin, Shen Yan, et al. 2003. The international HapMap project. Nature 426, 6968 (2003), 789–796. [DOI] [PubMed] [Google Scholar]
[30].Gitschier Jane. 2009. Inferential genotyping of Y chromosomes in Latter-Day Saints founders and comparison to Utah samples in the HapMap project. American Journal of Human Genetics 84, 2 (2009), 251–258. [DOI] [PMC free article] [PubMed] [Google Scholar]
[31].Glusman Gustavo, Caballero Juan, Mauldin Denise E, Hood Leroy, and Roach Jared C. 2011. Kaviar: an accessible system for testing SNV novelty. Bioinformatics 27, 22 (2011), 3216–3217. [DOI] [PMC free article] [PubMed] [Google Scholar]
[32].Greshake Bastian, Bayer Philipp E, Rausch Helge, and Reda Julia. 2014. OpenSNP–a crowdsourced web resource for personal genomics. PLoS One 9, 3 (2014), e89204. [DOI] [PMC free article] [PubMed] [Google Scholar]
[33].Gymrek Melissa, McGuire Amy L, Golan David, Halperin Eran, and Erlich Yaniv. 2013. Identifying personal genomes by surname inference. Science 339, 6117 (2013), 321–324. [DOI] [PubMed] [Google Scholar]
[34].Hagestedt Inken, Zhang Yang, Humbert Mathias, Berrang Pascal, Tang Haixu, Wang XiaoFeng, and Backes Michael. 2019. MBeacon: Privacy-Preserving Beacons for DNA Methylation Data. In 26th Annual Network and Distributed System Security Symposium, NDSS 2019, San Diego, California, USA, February 24–27, 2019. https://www.ndsssymposium.org/ndss-paper/mbeacon-privacy-preservingbeacons-for-dna-methylation-data/ [Google Scholar]
[35].Hayden Erika Check. 2013. Privacy protections: The genome hacker. Nature 497 (2013), 172–174. [DOI] [PubMed] [Google Scholar]
[36].Homer Nils, Szelinger Szabolcs, Redman Margot, Duggan David, Tembe Waibhav, Muehling Jill, Pearson John V, Stephan Dietrich A, Nelson Stanley F, and Craig David W. 2008 Resolving individuals contributing trace amounts ofDNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genetics 4, 8 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
[37].Homer Nils, Szelinger Szabolcs, Redman Margot, Duggan David, Tembe Waibhav, Muehling Jill, Pearson John V, Stephan Dietrich A, Nelson Stanley F, and Craig David W. 2008 Resolving individuals contributing trace amounts ofDNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genetics 4, 8 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
[38].Humbert Mathias, Ayday Erman, Hubaux Jean-Pierre, and Telenti Amalio. 2013. Addressing the concerns of the Lacks family: quantification of kin genomic privacy. In Proceedings of the 2013 ACM SIGSAC Conference on Computer and Communications Security. ACM, 1141–1152. [Google Scholar]
[39].Humbert Mathias, Huguenin Kévin, Hugonot Joachim, Ayday Erman, and Hubaux Jean-Pierre. 2015. De-anonymizing Genomic Databases Using Phenotypic Traits. Proceedings on Privacy Enhancing Technologies 2015 (2015), 99–114. [Google Scholar]
[40].Humbert Mathias, Huguenin Kévin, Hugonot Joachim, Ayday Erman, and Hubaux Jean-Pierre. 2015. De-anonymizing genomic databases using phenotypic traits. Proceedings on Privacy Enhancing Technologies 2015, 2 (2015), 99–114. [Google Scholar]
[41].Kyung Im Hae, Gamazon Eric R, Nicolae Dan L, and Cox Nancy J. 2012. On sharing quantitative trait GWAS results in an era of multiple-omics data and the limits of genomic privacy. American Journal of Human Genetics 90, 4 (2012), 591–598. [DOI] [PMC free article] [PubMed] [Google Scholar]
[42].Jacobs Kevin B, Yeager Meredith, Wacholder Sholom, Craig David, Kraft Peter, Hunter David J, Paschal Justin, Manolio Teri A, Tucker Margaret, Hoover Robert N, et al. 2009. A new statistic and its power to infer membership in a genome-wide association study using genotype frequencies. Nature genetics 41, 11 (2009), 1253–1257. [DOI] [PMC free article] [PubMed] [Google Scholar]
[43].Jha Somesh, Kruger Louis, and Shmatikov Vitaly. 2008. Towards practical privacy for genomic computation. In Proceedings of IEEE Symposium on Security and Privacy. 216–230. [Google Scholar]
[44].Johnson Aaron and Shmatikov Vitaly. 2013. Privacy-preserving data exploration in genome-wide association studies. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1079–1087. [DOI] [PMC free article] [PubMed] [Google Scholar]
[45].Kale Gulce, Ayday Erman, and Tastan Öznur. 2017. A utility maximizing and privacy preserving approach for protecting kinship in genomic databases. Bioinformatics (2017). [DOI] [PubMed] [Google Scholar]
[46].Kayser Manfred and Knijff Peter de. 2011. Improving human forensics through advances in genetics, genomics and molecular biology. Nature Reviews Genetics 12, 3 (2011), 179–192. [DOI] [PubMed] [Google Scholar]
[47].Kuhn Harold W. 1955. The Hungarian method for the assignment problem. Naval research logistics quarterly 2, 1–2 (1955), 83–97. [Google Scholar]
[48].Landrum Melissa J, Lee Jennifer M, Benson Mark, Brown Garth R, Chao Chen, Chitipiralla Shanmuga, Gu Baoshan, Hart Jennifer, Hoffman Douglas, Jang Wonhee, et al. 2017. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic acids research 46, D1 (2017), D1062–D1067. [DOI] [PMC free article] [PubMed] [Google Scholar]
[49].Ledford H. 2016. AstraZeneca launches project to sequence 2 million genomes. Nature 532, 7600 (2016), 427. [DOI] [PubMed] [Google Scholar]
[50].Lin Z, Owen AB, and Altman RB 2004. Genomic research and human subject privacy. Science 305, 5681 (July 2004), 183. [DOI] [PubMed] [Google Scholar]
[51].Lippert Christoph, Sabatini Riccardo, Maher M. Cyrus, Yong Kang Eun, Lee Seunghak, Arikan Okan, Harley Alena, Bernal Axel, Garst Peter, Lavrenko Victor, Yocum Ken, Wong Theodore, Zhu Mingfu, Yang Wen-Yun, Chang Chris, Lu Tim, Lee Charlie W. H., Hicks Barry, Ramakrishnan Smriti, Tang Haibao, Xie Chao, Piper Jason, Brewerton Suzanne, Turpaz Yaron, Telenti Amalio, Roby Rhonda K., Och Franz J., and Venter J. Craig. 2017. Identification of individuals by trait prediction using whole-genome sequencing data. Proceedings of the National Academy of Sciences (2017). 10.1073/pnas.1711125114 [DOI] [PMC free article] [PubMed] [Google Scholar]
[52].Liu Fan, van der Lijn Fedde, Schurmann Claudia, Zhu Gu, Chakravarty M Mallar, Hysi Pirro G, Wollstein Andreas, Lao Oscar, Bruijne Marleen de, Ikram M Arfan, et al. 2012. A genome-wide association study identifies five loci influencing facial morphology in Europeans. PLoS Genetics 8, 9 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
[53].Malin Bradley A. and Sweeney Latanya. 2004. How (not) to protect genomic data privacy in a distributed network: using trail re-identification to evaluate and design anonymity protection systems. Journal of Biomedical Informatics 37, 3 (2004), 179–192. [DOI] [PubMed] [Google Scholar]
[54].Manning Alisa K, Hivert Marie-France, Scott Robert A, Grimsby Jonna L, Bouatia-Naji Nabila, Chen Han, Rybin Denis, Liu Ching-Ti, Bielak Lawrence F, Prokopenko Inga, et al. 2012. A genome-wide approach accounting for body mass index identifies genetic variants influencing fasting glycemic traits and insulin resistance. Nature Genetics 44, 6 (2012), 659–669. [DOI] [PMC free article] [PubMed] [Google Scholar]
[55].Naveed Muhammad, Agrawal Shashank, Prabhakaran Manoj, Wang XiaoFeng, Ayday Erman, Hubaux Jean-Pierre, and Gunter Carl. 2014. Controlled Functional Encryption. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security. [Google Scholar]
[56].Naveed Muhammad, Ayday Erman, Clayton Ellen W, Fellay Jacques, Gunter Carl A, Hubaux Jean-Pierre, Malin Bradley A, and Wang XiaoFeng. 2015. Privacy in the genomic era. ACM Computing Surveys (CSUR) 48, 1 (2015), 6. [DOI] [PMC free article] [PubMed] [Google Scholar]
[57].Ng Andrew Y, Jordan Michael I, and Weiss Yair. 2002. On spectral clustering: Analysis and an algorithm. In Advances in neural information processing systems. 849–856. [Google Scholar]
[58].Ou Xue-ling, Gao Jun, Wang Huan, Wang Hong-sheng, Lu Huiling, and Sun Hong-yu. 2012. Predicting human age with bloodstains by sjTREC quantification. PloS ONE 7, 8 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
[59].Raisaro Jean L, Tramer Florian, Zhanglong Ji, Bu Diyue, Zhao Yongan, Carey Knox, Lloyd David, Sofia Heidi, Baker Dixie, Flicek Paul, Shringarpure Suyash S, Bustamante Carlos D, Wang Suang, Jiang Xiaoqian, Ohno-Machado Lucila, Tang Haixu, Wang XiaoFeng, and Hubaux Jean-Pierre. 2016. Addressing Beacon Re-Identification Attacks: Quantification and Mitigation of Privacy Risks. The Journal of the American Medical Informatics Association 24, 4 (2016), 799–805. [DOI] [PMC free article] [PubMed] [Google Scholar]
[60].Rodriguez Mayra Z, Comin Cesar H, Casanova Dalcimar, Bruno Odemir M, Amancio Diego R, Costa Luciano da F, and Rodrigues Francisco A. 2019. Clustering algorithms: A comparative approach. PloS one 14, 1 (2019), e0210236. [DOI] [PMC free article] [PubMed] [Google Scholar]
[61].Salem A, Bhattacharyya Apratim, Backes M, Fritz M, and Zhang Y. 2020. Updates-Leak: Data Set Inference and Reconstruction Attacks in Online Learning. ArXiv abs/1904.01067 (2020). [Google Scholar]
[62].Shariati Samani Sahel, Huang Zhicong Erman, Elliot Mark, Fellay Jacques, Hubaux Jean-Pierre, and Kutalik Zoltán. 2015. Quantifying genomic privacy via inference attack with high-order SNV correlations. In Security and Privacy Workshops (SPW), 2015. IEEE. 32–40. [Google Scholar]
[63].Sankararaman Sriram, Obozinski Guillaume, Jordan Michael I, and Halperin Eran. 2009. Genomic privacy and limits of individual detection in a pool. Nature Genetics 41, 9 (2009), 965–967. [DOI] [PubMed] [Google Scholar]
[64].Schatz Michael C. 2015. Biological data sciences in genome research. Genome Research 25, 10 (2015), 1417–1422. [DOI] [PMC free article] [PubMed] [Google Scholar]
[65].Shringarpure Suyash Sand Bustamante Carlos D. 2015. Privacy risks from genomic data-sharing beacons. The American Journal of Human Genetics 97, 5 (2015), 631–646. [DOI] [PMC free article] [PubMed] [Google Scholar]
[66].Sweeney Latanya, Abu Akua, and Winn Julia. 2013. Identifying participants in the personal genome project by name. arXiv preprint arXiv:1304.7605 (2013). [Google Scholar]
[67].Kam Ho Tin. 1995. Random decision forests. In Proceedings of 3rd International Conference on Document Analysis and Recognition, Vol. 1. 278–282 vol.1. [Google Scholar]
[68].Tramer Florian, Huang Zhicong, Hubaux Jean-Pierre, and Ayday Erman. 2015. Differential Privacy with Bounded Priors: Reconciling Utility and Privacy in Genome-Wide Association Studies. In Proceedings of ACM Conference on Computer and Communications Security (CCS). 1286–1297. [Google Scholar]
[69].Troncoso-Pastoriza Juan Ramón, Katzenbeisser Stefan, and Celik Mehmet. 2007. Privacy preserving error resilient DNA searching through oblivious automata. Proceedings of ACM CCS ‘ 07 (2007). [Google Scholar]
[70].Verizon. 2021. Verizon Fios Home Internet. https://www.verizon.com/home/fios-fastest-internet/
[71].Visscher Peter Mand Hill William G. 2009. The limits of individual identification from sample allele frequencies: theory and statistical analysis. PLoS Genet 5, 10 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
[72].der Malsburg Christoph von. 1986. Frank Rosenblatt: Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Brain Theory (01 1986), 245–248. 10.1007/978-3-642-70911-1_20 [DOI] [Google Scholar]
[73].Thenen Nora von, Ayday Erman, and Cicek A Ercument. 2018. Re-identification of individuals in genomic data-sharing beacons via allele inference. Bioinformatics 35, 3 (2018), 365–371. [DOI] [PubMed] [Google Scholar]
[74].Walsh Susan, Liu Fan, Ballantyne Kaye N, Oven Mannis van, Lao Oscar, and Kayser Manfred. 2011. IrisPlex: a sensitive DNA tool for accurate prediction of blue and brown eye colour in the absence of ancestry information. Forensic Science International: Genetics 5, 3 (2011), 170–180. [DOI] [PubMed] [Google Scholar]
[75].Wang Rui, Yong Fuga Li XiaoFeng Wang, Tang Haixu, and Zhou Xiaoyong. 2009. Learning Your Identity and Disease from Research Papers: Information Leaks in Genome Wide Association Study. In Proceedings of the 16th ACM Conference on Computer and Communications Security (CCS ‘09). Association for Computing Machinery, New York, NY, USA, 534–544. 10.1145/1653662.1653726 [DOI] [Google Scholar]
[76].Yu Fei, Fienberg Stephen E, Slavković Aleksandra B, and Uhler Caroline. 2014. Scalable privacy-preserving data sharing methodology for genome-wide association studies. Journal of Biomedical Informatics 50 (2014), 133–141. [DOI] [PMC free article] [PubMed] [Google Scholar]
[77].Zhou Xiaoyong, Peng Bo, Yong Fuga Li Yangyi Chen, Tang Haixu, and Wang XiaoFeng. 2011. To release or not to release: Evaluating information leaks in aggregate human-genome data. ESORICS’11: Proc. of the 16th European Conf. on Research in Computer Security (2011), 607–627. [Google Scholar]
[78].Zubakov Dmitry, Liu Fan, Van Zelm MC, Vermeulen J, Oostra BA, Van Duijn CM, Driessen GJ, Van Dongen JJM, Kayser Manfred, and Langerak AW. 2010. Estimating human age from T-cell DNA rearrangements. Current Biology 20, 22 (2010), R970–R971. [DOI] [PubMed] [Google Scholar]

[R1] [1].2020. https://www.ga4gh.org/about-us/. [Online; accessed 10-January-2020].

[R2] [2].2020. http://beacon-network.org. [Online; accessed 10-January-2020].

[R3] [3].2020. https://ghr.nlm.nih.gov/primer/genomicresearch/snp. [Online; accessed 10-January-2020].

[R4] [4].2020. https://humandbs.biosciencedbc.jp/en/hum0029-v1. [Online; accessed 03-December-2020].

[R5] [5].2020. https://gnomad.broadinstitute.org/. [Online; accessed 03-December-2020].

[R6] [6].2020. Risk Disease. http://www.eupedia.com/genetics/medical_dna_test.shtml [Online; accessed 10-January-2020].

[R7] [7].2020. OpenSNP. http://opensnp.org. [Online; accessed 10-January-2020].

[R8] [8].2020. SNPedia. https://www.snpedia.com/. [Online; accessed 10-January-2020].

[R9] [9].Momin Al Aziz Md, Ghasemi Reza, Waliullah Md, and Mohammed Noman. 2017. Aftermath of Bustamante attack on genomic beacon service. BMC Medical Genomics 10, 2 (2017), 43. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] [10].Lango Allen Hana, Estrada Karol, Lettre Guillaume, Berndt Sonja I, Weedon Michael N, Rivadeneira Fernando, Willer Cristen J, Jackson Anne U, Vedantam Sailaja, Raychaudhuri Soumya, et al. 2010. Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature 467, 7317 (2010), 832–838. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] [11].Ayday Erman, Cristofaro Emiliano De, Hubaux Jean-Pierre, and Tsudik Gene. 2013. The chills and thrills of whole genome sequencing. (2013). [Google Scholar]

[R12] [12].Ayday Erman, Louis Raisaro Jean, Hubaux Jean-Pierre, and Rougemont Jacques. 2013. Protecting and evaluating genomic privacy in medical tests and personalized medicine. In Proceedings of the 12th ACM Workshop on Privacy in the Electronic Society. 95–106. [Google Scholar]

[R13] [13].Baldi Pierre, Baronio Roberta, Cristofaro Emiliano De, Gasti Paolo, and Tsudik Gene. 2011. Countering GATTACA: efficient and secure testing of fully-sequenced human genomes. In Proceedings of the 18th ACM conference on Computer and communications security. 691–702. [Google Scholar]

[R14] [14].Bezdek James C, Ehrlich Robert, and Full William. 1984. FCM: The fuzzy c-means clustering algorithm. Computers & Geosciences 10, 2–3 (1984), 191–203. [Google Scholar]

[R15] [15].Blanton Marina, Atallah Mikhail J, Frikken Keith B, and Malluhi Qutaibah. 2012. Secure and efficient outsourcing of sequence comparisons. In Proceedings of European Symposium on Research in Computer Security. 505–522. [Google Scholar]

[R16] [16].Bowyer Kevin W., Chawla Nitesh V., Hall Lawrence O., and Kegelmeyer W. Philip. 2011. SMOTE: Synthetic Minority Over-sampling Technique. CoRR abs/1106.1813 (2011). arXiv:1106.1813 http://arxiv.org/abs/1106.1813 [Google Scholar]

[R17] [17].Chen Tianqi and Guestrin Carlos. 2016. XGBoost: A Scalable Tree Boosting System. CoRR abs/1603.02754 (2016). arXiv:1603.02754 http://arxiv.org/abs/1603.02754 [Google Scholar]

[R18] [18].Claes Peter, Liberton Denise K, Daniels Katleen, Matthes Rosana Kerri, Quillen Ellen E, Pearson Laurel N, McEvoy Brian, Bauchet Marc, Zaidi Arslan A, Yao Wei, et al. 2014. Modeling 3D facial shape from DNA. PLoS Genetics 10, 3 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] [19].Clayton David. 2010. On inferring presence of an individual in a mixture: a Bayesian approach. Biostatistics 11, 4 (2010), 661–673. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] [20].Collins Francis Sand Varmus Harold. 2015. A new initiative on precision medicine. New England Journal of Medicine 372, 9 (2015), 793–795. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] [21].International HapMap Consortium et al. 2003. The international HapMap project. Nature 426, 6968 (2003), 789. [DOI] [PubMed] [Google Scholar]

[R22] [22].Cortes Corinna and Vapnik Vladimir. 1995. Support-vector networks. Machine learning 20, 3 (1995), 273–297. [Google Scholar]

[R23] [23].Cramer JS 2002. The Origins of Logistic Regression. Tinbergen Institute, Tinbergen Institute Discussion Papers (01 2002). 10.2139/ssrn.360300 [DOI] [Google Scholar]

[R24] [24].Cristofaro Emiliano De, Faber Sky, and Tsudik Gene. 2013. Secure Genomic Testing with Size- and Position-hiding Private Substring Matching. In Proceedings of the 12th ACM Workshop on Privacy in the Electronic Society. [Google Scholar]

[R25] [25].Deznabi Iman, Mobayen Mohammad, Jafari Nazanin, Tastan Oznur, and Ayday Erman. 2018. An inference attack on genomic data using kinship, complex correlations, and phenotype information. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 15, 4 (2018), 1333–1343. [DOI] [PubMed] [Google Scholar]

[R26] [26].Dwork Cynthia. 2006. Differential Privacy. Proceedings of the 33rd International Conference on Automata, Languages and Programming (2006). [Google Scholar]

[R27] [27].Erlich Yaniv and Narayanan Arvind. 2014. Routes for breaching and protecting genetic privacy. Nature Reviews Genetics 15, 6 (2014), 409–421. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] [28].Fienberg Stephen E, Slavkovic Aleksandra, and Uhler Caroline. 2011. Privacy preserving GWAS data sharing. In IEEE 11th International Conference on Data Mining Workshops (ICDMW). 628–635. [Google Scholar]

[R29] [29].Gibbs Richard A, Belmont John W, Hardenbol Paul, Willis Thomas D, Yu Fuli, Yang Huanming, Ch’ang Lan-Yang, Huang Wei, Liu Bin, Shen Yan, et al. 2003. The international HapMap project. Nature 426, 6968 (2003), 789–796. [DOI] [PubMed] [Google Scholar]

[R30] [30].Gitschier Jane. 2009. Inferential genotyping of Y chromosomes in Latter-Day Saints founders and comparison to Utah samples in the HapMap project. American Journal of Human Genetics 84, 2 (2009), 251–258. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] [31].Glusman Gustavo, Caballero Juan, Mauldin Denise E, Hood Leroy, and Roach Jared C. 2011. Kaviar: an accessible system for testing SNV novelty. Bioinformatics 27, 22 (2011), 3216–3217. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] [32].Greshake Bastian, Bayer Philipp E, Rausch Helge, and Reda Julia. 2014. OpenSNP–a crowdsourced web resource for personal genomics. PLoS One 9, 3 (2014), e89204. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] [33].Gymrek Melissa, McGuire Amy L, Golan David, Halperin Eran, and Erlich Yaniv. 2013. Identifying personal genomes by surname inference. Science 339, 6117 (2013), 321–324. [DOI] [PubMed] [Google Scholar]

[R34] [34].Hagestedt Inken, Zhang Yang, Humbert Mathias, Berrang Pascal, Tang Haixu, Wang XiaoFeng, and Backes Michael. 2019. MBeacon: Privacy-Preserving Beacons for DNA Methylation Data. In 26th Annual Network and Distributed System Security Symposium, NDSS 2019, San Diego, California, USA, February 24–27, 2019. https://www.ndsssymposium.org/ndss-paper/mbeacon-privacy-preservingbeacons-for-dna-methylation-data/ [Google Scholar]

[R35] [35].Hayden Erika Check. 2013. Privacy protections: The genome hacker. Nature 497 (2013), 172–174. [DOI] [PubMed] [Google Scholar]

[R36] [36].Homer Nils, Szelinger Szabolcs, Redman Margot, Duggan David, Tembe Waibhav, Muehling Jill, Pearson John V, Stephan Dietrich A, Nelson Stanley F, and Craig David W. 2008 Resolving individuals contributing trace amounts ofDNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genetics 4, 8 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] [37].Homer Nils, Szelinger Szabolcs, Redman Margot, Duggan David, Tembe Waibhav, Muehling Jill, Pearson John V, Stephan Dietrich A, Nelson Stanley F, and Craig David W. 2008 Resolving individuals contributing trace amounts ofDNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genetics 4, 8 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] [38].Humbert Mathias, Ayday Erman, Hubaux Jean-Pierre, and Telenti Amalio. 2013. Addressing the concerns of the Lacks family: quantification of kin genomic privacy. In Proceedings of the 2013 ACM SIGSAC Conference on Computer and Communications Security. ACM, 1141–1152. [Google Scholar]

[R39] [39].Humbert Mathias, Huguenin Kévin, Hugonot Joachim, Ayday Erman, and Hubaux Jean-Pierre. 2015. De-anonymizing Genomic Databases Using Phenotypic Traits. Proceedings on Privacy Enhancing Technologies 2015 (2015), 99–114. [Google Scholar]

[R40] [40].Humbert Mathias, Huguenin Kévin, Hugonot Joachim, Ayday Erman, and Hubaux Jean-Pierre. 2015. De-anonymizing genomic databases using phenotypic traits. Proceedings on Privacy Enhancing Technologies 2015, 2 (2015), 99–114. [Google Scholar]

[R41] [41].Kyung Im Hae, Gamazon Eric R, Nicolae Dan L, and Cox Nancy J. 2012. On sharing quantitative trait GWAS results in an era of multiple-omics data and the limits of genomic privacy. American Journal of Human Genetics 90, 4 (2012), 591–598. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] [42].Jacobs Kevin B, Yeager Meredith, Wacholder Sholom, Craig David, Kraft Peter, Hunter David J, Paschal Justin, Manolio Teri A, Tucker Margaret, Hoover Robert N, et al. 2009. A new statistic and its power to infer membership in a genome-wide association study using genotype frequencies. Nature genetics 41, 11 (2009), 1253–1257. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] [43].Jha Somesh, Kruger Louis, and Shmatikov Vitaly. 2008. Towards practical privacy for genomic computation. In Proceedings of IEEE Symposium on Security and Privacy. 216–230. [Google Scholar]

[R44] [44].Johnson Aaron and Shmatikov Vitaly. 2013. Privacy-preserving data exploration in genome-wide association studies. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1079–1087. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] [45].Kale Gulce, Ayday Erman, and Tastan Öznur. 2017. A utility maximizing and privacy preserving approach for protecting kinship in genomic databases. Bioinformatics (2017). [DOI] [PubMed] [Google Scholar]

[R46] [46].Kayser Manfred and Knijff Peter de. 2011. Improving human forensics through advances in genetics, genomics and molecular biology. Nature Reviews Genetics 12, 3 (2011), 179–192. [DOI] [PubMed] [Google Scholar]

[R47] [47].Kuhn Harold W. 1955. The Hungarian method for the assignment problem. Naval research logistics quarterly 2, 1–2 (1955), 83–97. [Google Scholar]

[R48] [48].Landrum Melissa J, Lee Jennifer M, Benson Mark, Brown Garth R, Chao Chen, Chitipiralla Shanmuga, Gu Baoshan, Hart Jennifer, Hoffman Douglas, Jang Wonhee, et al. 2017. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic acids research 46, D1 (2017), D1062–D1067. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R49] [49].Ledford H. 2016. AstraZeneca launches project to sequence 2 million genomes. Nature 532, 7600 (2016), 427. [DOI] [PubMed] [Google Scholar]

[R50] [50].Lin Z, Owen AB, and Altman RB 2004. Genomic research and human subject privacy. Science 305, 5681 (July 2004), 183. [DOI] [PubMed] [Google Scholar]

[R51] [51].Lippert Christoph, Sabatini Riccardo, Maher M. Cyrus, Yong Kang Eun, Lee Seunghak, Arikan Okan, Harley Alena, Bernal Axel, Garst Peter, Lavrenko Victor, Yocum Ken, Wong Theodore, Zhu Mingfu, Yang Wen-Yun, Chang Chris, Lu Tim, Lee Charlie W. H., Hicks Barry, Ramakrishnan Smriti, Tang Haibao, Xie Chao, Piper Jason, Brewerton Suzanne, Turpaz Yaron, Telenti Amalio, Roby Rhonda K., Och Franz J., and Venter J. Craig. 2017. Identification of individuals by trait prediction using whole-genome sequencing data. Proceedings of the National Academy of Sciences (2017). 10.1073/pnas.1711125114 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R52] [52].Liu Fan, van der Lijn Fedde, Schurmann Claudia, Zhu Gu, Chakravarty M Mallar, Hysi Pirro G, Wollstein Andreas, Lao Oscar, Bruijne Marleen de, Ikram M Arfan, et al. 2012. A genome-wide association study identifies five loci influencing facial morphology in Europeans. PLoS Genetics 8, 9 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R53] [53].Malin Bradley A. and Sweeney Latanya. 2004. How (not) to protect genomic data privacy in a distributed network: using trail re-identification to evaluate and design anonymity protection systems. Journal of Biomedical Informatics 37, 3 (2004), 179–192. [DOI] [PubMed] [Google Scholar]

[R54] [54].Manning Alisa K, Hivert Marie-France, Scott Robert A, Grimsby Jonna L, Bouatia-Naji Nabila, Chen Han, Rybin Denis, Liu Ching-Ti, Bielak Lawrence F, Prokopenko Inga, et al. 2012. A genome-wide approach accounting for body mass index identifies genetic variants influencing fasting glycemic traits and insulin resistance. Nature Genetics 44, 6 (2012), 659–669. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R55] [55].Naveed Muhammad, Agrawal Shashank, Prabhakaran Manoj, Wang XiaoFeng, Ayday Erman, Hubaux Jean-Pierre, and Gunter Carl. 2014. Controlled Functional Encryption. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security. [Google Scholar]

[R56] [56].Naveed Muhammad, Ayday Erman, Clayton Ellen W, Fellay Jacques, Gunter Carl A, Hubaux Jean-Pierre, Malin Bradley A, and Wang XiaoFeng. 2015. Privacy in the genomic era. ACM Computing Surveys (CSUR) 48, 1 (2015), 6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R57] [57].Ng Andrew Y, Jordan Michael I, and Weiss Yair. 2002. On spectral clustering: Analysis and an algorithm. In Advances in neural information processing systems. 849–856. [Google Scholar]

[R58] [58].Ou Xue-ling, Gao Jun, Wang Huan, Wang Hong-sheng, Lu Huiling, and Sun Hong-yu. 2012. Predicting human age with bloodstains by sjTREC quantification. PloS ONE 7, 8 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R59] [59].Raisaro Jean L, Tramer Florian, Zhanglong Ji, Bu Diyue, Zhao Yongan, Carey Knox, Lloyd David, Sofia Heidi, Baker Dixie, Flicek Paul, Shringarpure Suyash S, Bustamante Carlos D, Wang Suang, Jiang Xiaoqian, Ohno-Machado Lucila, Tang Haixu, Wang XiaoFeng, and Hubaux Jean-Pierre. 2016. Addressing Beacon Re-Identification Attacks: Quantification and Mitigation of Privacy Risks. The Journal of the American Medical Informatics Association 24, 4 (2016), 799–805. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R60] [60].Rodriguez Mayra Z, Comin Cesar H, Casanova Dalcimar, Bruno Odemir M, Amancio Diego R, Costa Luciano da F, and Rodrigues Francisco A. 2019. Clustering algorithms: A comparative approach. PloS one 14, 1 (2019), e0210236. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R61] [61].Salem A, Bhattacharyya Apratim, Backes M, Fritz M, and Zhang Y. 2020. Updates-Leak: Data Set Inference and Reconstruction Attacks in Online Learning. ArXiv abs/1904.01067 (2020). [Google Scholar]

[R62] [62].Shariati Samani Sahel, Huang Zhicong Erman, Elliot Mark, Fellay Jacques, Hubaux Jean-Pierre, and Kutalik Zoltán. 2015. Quantifying genomic privacy via inference attack with high-order SNV correlations. In Security and Privacy Workshops (SPW), 2015. IEEE. 32–40. [Google Scholar]

[R63] [63].Sankararaman Sriram, Obozinski Guillaume, Jordan Michael I, and Halperin Eran. 2009. Genomic privacy and limits of individual detection in a pool. Nature Genetics 41, 9 (2009), 965–967. [DOI] [PubMed] [Google Scholar]

[R64] [64].Schatz Michael C. 2015. Biological data sciences in genome research. Genome Research 25, 10 (2015), 1417–1422. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R65] [65].Shringarpure Suyash Sand Bustamante Carlos D. 2015. Privacy risks from genomic data-sharing beacons. The American Journal of Human Genetics 97, 5 (2015), 631–646. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R66] [66].Sweeney Latanya, Abu Akua, and Winn Julia. 2013. Identifying participants in the personal genome project by name. arXiv preprint arXiv:1304.7605 (2013). [Google Scholar]

[R67] [67].Kam Ho Tin. 1995. Random decision forests. In Proceedings of 3rd International Conference on Document Analysis and Recognition, Vol. 1. 278–282 vol.1. [Google Scholar]

[R68] [68].Tramer Florian, Huang Zhicong, Hubaux Jean-Pierre, and Ayday Erman. 2015. Differential Privacy with Bounded Priors: Reconciling Utility and Privacy in Genome-Wide Association Studies. In Proceedings of ACM Conference on Computer and Communications Security (CCS). 1286–1297. [Google Scholar]

[R69] [69].Troncoso-Pastoriza Juan Ramón, Katzenbeisser Stefan, and Celik Mehmet. 2007. Privacy preserving error resilient DNA searching through oblivious automata. Proceedings of ACM CCS ‘ 07 (2007). [Google Scholar]

[R70] [70].Verizon. 2021. Verizon Fios Home Internet. https://www.verizon.com/home/fios-fastest-internet/

[R71] [71].Visscher Peter Mand Hill William G. 2009. The limits of individual identification from sample allele frequencies: theory and statistical analysis. PLoS Genet 5, 10 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R72] [72].der Malsburg Christoph von. 1986. Frank Rosenblatt: Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Brain Theory (01 1986), 245–248. 10.1007/978-3-642-70911-1_20 [DOI] [Google Scholar]

[R73] [73].Thenen Nora von, Ayday Erman, and Cicek A Ercument. 2018. Re-identification of individuals in genomic data-sharing beacons via allele inference. Bioinformatics 35, 3 (2018), 365–371. [DOI] [PubMed] [Google Scholar]

[R74] [74].Walsh Susan, Liu Fan, Ballantyne Kaye N, Oven Mannis van, Lao Oscar, and Kayser Manfred. 2011. IrisPlex: a sensitive DNA tool for accurate prediction of blue and brown eye colour in the absence of ancestry information. Forensic Science International: Genetics 5, 3 (2011), 170–180. [DOI] [PubMed] [Google Scholar]

[R75] [75].Wang Rui, Yong Fuga Li XiaoFeng Wang, Tang Haixu, and Zhou Xiaoyong. 2009. Learning Your Identity and Disease from Research Papers: Information Leaks in Genome Wide Association Study. In Proceedings of the 16th ACM Conference on Computer and Communications Security (CCS ‘09). Association for Computing Machinery, New York, NY, USA, 534–544. 10.1145/1653662.1653726 [DOI] [Google Scholar]

[R76] [76].Yu Fei, Fienberg Stephen E, Slavković Aleksandra B, and Uhler Caroline. 2014. Scalable privacy-preserving data sharing methodology for genome-wide association studies. Journal of Biomedical Informatics 50 (2014), 133–141. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R77] [77].Zhou Xiaoyong, Peng Bo, Yong Fuga Li Yangyi Chen, Tang Haixu, and Wang XiaoFeng. 2011. To release or not to release: Evaluating information leaks in aggregate human-genome data. ESORICS’11: Proc. of the 16th European Conf. on Research in Computer Security (2011), 607–627. [Google Scholar]

[R78] [78].Zubakov Dmitry, Liu Fan, Van Zelm MC, Vermeulen J, Oostra BA, Van Duijn CM, Driessen GJ, Van Dongen JJM, Kayser Manfred, and Langerak AW. 2010. Estimating human age from T-cell DNA rearrangements. Current Biology 20, 22 (2010), R970–R971. [DOI] [PubMed] [Google Scholar]

PERMALINK

Genome Reconstruction Attacks Against Genomic Data-Sharing Beacons

Kerem Ayoz

Erman Ayday

A Ercument Cicek

Abstract

1. Introduction

2. Related Work

2.1. Privacy in Statistical Genomic Databases and Inference Attacks on Genomic Privacy

2.2. Privacy in Genomic Data Sharing Beacons

Contribution of this paper.

3. Genomics Background

4. System Model

Fig. 1.

5. Threat Model

6. Genome Reconstruction Attack on Genomic Data-Sharing Beacons

6.1. Baseline Approach for Genome Reconstruction

6.2. Greedy Algorithm for Genome Reconstruction

6.3. Clustering-Based Algorithm for Genome Reconstruction

6.4. Identifying the Victim Using Genotype-Phenotype Associations

Algorithm 1:

6.5. Using Genome Reconstruction in Membership Inference Attack

7. Evaluation

7.1. Datasets

7.2. Evaluation Metrics

7.3. Evaluation of Genome Reconstruction

Fig. 2.

Fig. 9.

Fig. 3.

Fig. 10.

Fig. 4.

Fig. 11.

Fig. 5.

Fig. 6.

7.4. Identifying the Victim’s Genome Using Phenotype Inference

Fig. 7.

7.5. Using Genome Reconstruction in Membership Inference

Fig. 8.

8. Discussion

8.1. Extension of the Proposed Attack

8.2. Donors Leaving the Beacon

8.3. Risk Quantification for the Genome Reconstruction Attack

8.4. Mitigation Techniques

9. Conclusion

10. Acknowledgement

Appendices

A. Membership Inference Attack Against Genomic Data-Sharing Beacons

B. Baseline Approach for Genome Reconstruction

C. Evaluation of Genome Reconstruction on the HapMap Beacon

Algorithm 2:

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases