Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2025 May 19;41(6):btaf273. doi: 10.1093/bioinformatics/btaf273

Beacon Reconstruction Attack: Reconstruction of genomes in genomic data-sharing beacons using summary statistics

Kousar Saleem 1, A Ercument Cicek 2,, Sinem Sav 3,
Editor: Inanc Birol
PMCID: PMC12133290  PMID: 40388204

Abstract

Motivation

Genomic data-sharing beacon protocol, developed by the Global Alliance for Genomics and Health, offers a privacy-preserving mechanism for querying genomic datasets while restricting direct data access. Despite their design, beacons remain vulnerable to privacy attacks. This study introduces a novel privacy vulnerability of the protocol: one can reconstruct large portions of the genomes of all beacon participants by only using the summary statistics reported by the protocol.

Results

We introduce a novel optimization-based algorithm that leverages beacon responses and SNP correlations for reconstruction. By optimizing for the SNP correlations and allele frequencies, the proposed approach achieves genome reconstruction with a substantially higher F1-score (70%) compared to baseline methods (45%) on beacons generated using individuals from the HapMap and OpenSNP datasets. We show that reconstructed genomes can be used by downstream applications such as in membership inference attacks against other beacons. Our findings reveal that beacons releasing allele frequencies substantially increase the reconstruction risk, underscoring the need for enhanced privacy-preserving mechanisms to protect genomic data.

Availability and implementation

Our implementation is available at https://github.com/ASAP-Bilkent/Beacon-Reconstruction-Attack.

1 Introduction

The rapid advancements in genomic research have led to unprecedented access to vast amounts of genomic data, enabling breakthroughs in precision medicine, disease prediction, and population genetics. However, sharing such sensitive data comes with significant privacy challenges (Jacobs et al. 2009, Clayton 2010). Genomic data beacons, a widely adopted approach for secure genomic data sharing, aim to balance the need for data accessibility with the requirement for individual privacy. Introduced by the Global Alliance for Genomics and Health (GA4GH), beacons provide an interface for researchers to query datasets to ask whether a specific allele at a specific position exists in the dataset. The protocol responds with a simple “yes/no” answer. If the answer is “yes,” it can also report the frequency of the allele in the underlying datasets. Several beacons use this feature and report allele frequencies such as Elixir reference beacon (https://github.com/ga4gh-beacon/beacon-elixir), THL Biobank Beacon (https://thl.fi/en/-/thl-biobank-s-new-beacon-discovery-service-includes-20-million-genomic-variants), and Progenetix Beacon+ (https://beaconplus.progenetix.org/). This mechanism enables collaboration without direct access to the data itself, reducing the risks of exposing sensitive genomic information.

Genomic beacons are not immune to privacy attacks, despite being designed to enable secure querying of genomic databases. One of the most notable threats is the membership inference attacks, in which an adversary attempts to determine whether a particular individual’s genomic data is in the queried dataset. The adversary can accumulate information through subsequent queries, exploiting statistical patterns to make inferences about the composition of the dataset. This presents substantial privacy challenges, particularly because beacons are often linked to specific sensitive phenotypes such as the Autism Speaks MSSNG Dataset (Yuen et al. 2017). Several works have introduced approaches based on likelihood ratio test (LRT) to launch membership inference attacks against beacons (Shringarpure and Bustamante 2015, Raisaro et al. 2017, Von Thenen et al. 2019). Several countermeasures have been proposed against these attacks with techniques including flipping responses (Wan et al. 2017), query budgets for users (Raisaro et al. 2017), including relatives of the participants in the databases (Ayoz et al. 2020), differential privacy (DP) (Cho et al. 2020), game theory (Venkatesaramani et al. 2023a,b, Poorghaffar et al. 2024), and reinforcement learning (Poorghaffar et al. 2024). In addition to membership inference attacks, beacons are also vulnerable to genome reconstruction attacks. Wang et al. (2009) showed years before the introduction of the protocol that releasing allele frequencies can lead to reconstruction of haplotypes using integer programming. Then, Bu et al. (2021) further advanced this and showed that using beacon responses it is possible to reconstruct even novel haplotypes in a population. Finally, Ayoz et al. (2021) leveraged SNP correlations along with clustering techniques to reconstruct large portions of individual genomes using the responses of the beacon system. The attack exploits the dynamic nature of a beacon in which participants are added/removed over time, and the changes in responses must belong to the added/removed beacon participants.

In this study, we show that the summary statistics reported by the beacon reveal more than expected by design about the genomes within the database, and we introduce a new genome reconstruction attack. For the first time, we demonstrate that an adversary can reconstruct the genomes of all individuals in a beacon database by using only the beacon responses. The attack is not restricted to haploblocks and can scale beyond thousands of SNPs in the genome by formulating the problem as an optimization problem. That is, an adversary who takes the snapshot of the beacon (e.g. query for a target SNP subset of interest) can use the correlations among SNPs to learn which participants carry each SNPs. To achieve this, we introduce a two-step optimization-based algorithm that optimizes the following objectives in alternating fashion: (i) matching the publicly available correlations among the SNPs, and (ii) matching the frequencies reported by the beacon.

Previous studies in the literature are based on strong assumptions. For example, membership inference attacks (Shringarpure and Bustamante 2015, Raisaro et al. 2017, Von Thenen et al. 2019) assume that the attacker has access to the victim’s genome. The genome reconstruction attack (Ayoz et al. 2021) assumes that the attacker knows the time that the victim has joined the study. It further assumes that the attacker has the snapshot of the beacon before and after this time. That is, the attacker has queried the beacon for all SNPs available in the human genome, which requires continuous brute-force querying. A haploblock reconstruction attack reconstructs unknown haplotypes for target haploblocks for <100 SNPs (Bu et al. 2021). In contrast, the Beacon Reconstruction Attack presented here does not rely on such strong assumptions. We only rely on the responses of the beacon protocol and show that it is reconstructable for thousands of SNPs without any assumptions on an underlying population/haplotype structure and not for a few victims but for all participants. Thus, we show that the protocol leaks much more information than previously thought.

Our results on beacons generated using HapMap and OpenSNP datasets demonstrate that given a randomly selected set of 1000 SNPs and a beacon with 50 individuals, an adversary can reconstruct all genomes with an F1-score of 70% while the baseline approach can only achieve 45%. We also show that the reconstructed genomes can be successfully used in membership inference attacks against other beacons. Our work shows the need for privacy-preserving mechanisms to protect beacon participants and stresses the increased genome reconstruction risk when beacons release allele frequencies.

2 Related work

Numerous studies have shown that anonymization of genomic samples does not effectively preserve the privacy of individuals. Homer et al. (2008) demonstrated that individuals contributing even trace amounts of DNA to genome-wide association studies (GWAS) with complex mixtures can be reidentified. Wang et al. (2009) demonstrated that information leaks from GWAS can lead to the identification of individuals through linkage to public datasets. Gymrek et al. (2013) further exposed vulnerabilities in anonymized genomic data by demonstrating how surname inference from the Y-chromosome can be combined with publicly available genealogical databases to re-identify participants in genomic studies. Similarly, Erlich and Narayanan (2014) used linkage attacks by integrating anonymized genomic data with auxiliary information from genealogical sources, uncovering new privacy risks. Humbert et al. (2017) and Samani et al. (2015) examined the potential of using high-order SNP correlations and familial ties to infer hidden genomic information.

In response to the limitations of traditional anonymization techniques, DP emerged as an advanced approach to privacy preservation. DP ensures that the inclusion or exclusion of an individual’s data in a dataset minimally affects the results of any analysis, thus safeguarding personal privacy. Initially introduced by Dwork (2006, 2008) through the concept of adding noise proportional to query sensitivity. This framework has been widely applied in GWAS to protect sensitive genetic information (Fienberg et al. 2011, Johnson and Shmatikov 2013, Tramèr et al. 2015). However, while DP offers robust privacy protection, it inherently involves a trade-off between privacy and data utility. To address privacy concerns while retaining data utility, genomic beacons were introduced by GA4GH.

Beacons offer summary statistics to user queries about the existence of certain alleles at certain positions. These summary statistics could be a binary response or the frequency of the allele. Thus, the system does not reveal the entire dataset content. It provides a more controlled and secure method of data sharing as the data are shared only if the allele of interest is present in the dataset. Moreover, the summary statistics in theory do not reveal the genome or the identity of the participants. However, beacons remain vulnerable to certain attacks. Shringarpure and Bustamante (2015) demonstrated that attackers could exploit statistical techniques, such as the LRT, to infer whether an individual is part of a beacon dataset by analyzing the yes/no responses to a query set. This work marked a critical advancement in highlighting privacy risks, especially in scenarios where genomic data is linked to sensitive phenotypic traits. Raisaro et al. (2017) further showed that targeting SNPs with minor allele frequencies (MAFs) can dramatically reduce the number of queries required for re-identification. Finally, Von Thenen et al. (2019) used the correlations among SNPs to infer the answers of the beacon to queries that are not posed which enabled the attack to work with a smaller number of queries and also work even if the victim has not revealed certain high-risk SNPs to the beacon.

Beacon protocol is also vulnerable to the genome reconstruction attacks. However, these are relatively under-explored. In this type, the adversary can infer the genome sequence of the target individual(s) using the responses of the system. Before the introduction of the beacon protocol.

Wang et al. (2009) showed that an adversary can reconstruct haplotypes in an haploblock using the allele frequencies and correlations released by a GWAS. Using a few SNPs in LD, the method uses integer programming to find the exact number of haplotypes in that region. However, the method can work with a few SNPs at a time and cut/link haplotypes as the required computation increases exponentially with the number of SNPs. In their analyses, they reconstruct genomes for 174 closely located SNPs in the FGFR2 gene in a GWAS (Hunter et al. 2007). Bu et al. (2021) propose a more recent and optimized version of this method using beacons. They aim to find the frequencies of known haplotypes in a beacon. They use integer programming to find the exact solution. If no solution is found, they conclude that there must be a novel haplotype(s) on the considered haploblock and aim to reconstruct d novel haplotypes, where d is a hyperparameter of the algorithm. These are different from our model which does not target a haploblock but any subset of SNPs over the full genome. Thus, they do not have to be on the same or close/overlapping haploblocks. While we show our model can scale beyond thousands of SNPs, a typical haploblock contains roughly 10–50 SNPs, which are in linkage disequilibrium. Thus, these works consider a smaller and very strongly correlated set of SNPs, which is not necessarily the case in our model (i.e. SNPs might be weakly correlated as selected randomly and size can be much larger). Our model relies on a two-step gradient descent-based optimization algorithm and is fundamentally different as we are not searching for an exact solution.

Ayoz et al. (2021) showed, if the adversary knows that the victim has joined the study at time t, then they can compare the responses of the system at time t1 and t+1 to conclude that the responses that have been flipped from “no” to “yes” belong to the victim. They further showed that (i) even if more than one person has joined the study, it is possible to determine the victim’s alleles using clustering of the SNPs with respect to linkage disequilibrium and also (ii) to link the public features of the victim (e.g. hair color) to the reconstructed genome using classifiers with high confidence. Although this study showed that beacons are vulnerable to genome reconstruction attacks because of dataset updates, it relies on the strong assumption of knowing the time that the victim joins the study and that the adversary keeps snapshots of the beacon taken at multiple time points.

3 Materials and methods

In this section, we first explain the system model (Section 3.1) and then the threat model (Section 3.2). We formulate the problem in Section 3.3, and finally, we provide the details of our beacon reconstruction attack methods in Section 3.4.

3.1 System model

The beacon protocol processes incoming queries about the presence of specific alleles in its connected database(s), providing summary statistics for each dataset. For simplicity, we assume that the beacon is connected to a single database. The summary statistic can be a binary answer for the existence of the SNP of interest or can be the allele frequency in the underlying database.

However, even these simple “yes” or “no” responses can be exploited in several types of attacks, potentially leading to the exposure of sensitive information. This research aims to develop methods for reconstructing genomic data-sharing beacons based on query responses, with the potential to reveal critical information about individuals within the dataset.

This protocol functions within an authenticated online setting, meaning that the querier must log in before submitting one or more queries. Additionally, the system retains a record of previous queries submitted by the same querier.

3.2 Threat model

In our threat model, we consider a scenario where the querier functions as an attacker, aiming to reconstruct the dataset interfaced through the beacon protocol. We assume the attacker is a registered beacon user with access to publicly available information, such as the number of individuals in the dataset, allele frequencies (AFs) in the population, and the linkage disequilibrium of the SNPs.

As a registered user, an attacker can submit an unlimited number of queries, each disclosing whether specific alleles are present at particular genomic positions and, optionally, their respective frequencies. The attacker queries the beacon including N individuals, for a potentially large set of SNPs of interest (M) and captures a comprehensive snapshot of the dataset. Using the optimization-based approach we explain below, the attacker assigns the SNPs with the answer “Yes” or with allele frequency >0, to N individuals in a one-to-many fashion such that the reported summary statistics and the correlations between the SNPs in the reconstructed genomes are matched.

3.3 Problem formulation

Let B denote the dataset, which is accessed through a beacon. B can be represented as a binary matrix of size N×M where N denotes the number of individuals in the beacon and M denotes the number of cataloged SNPs in the human genome. Assume the attacker is interested in reconstructing a subset of these SNPs: M. The attacker queries the beacon for M and obtains a response vector b of size |M|. b[0,1] if the beacon responds with the AFs, or b{0,1} if the beacon responds with presence/absence information for the SNPs.

The attacker has access to the following publicly available information: (i) AFs in the population, which is represented as a vector m[0,1] of size |M|; and (ii) the linkage disequilibrium of the SNPs, which the attacker uses to construct a correlation matrix of size |M|×|M|: C[0,1].

Following Ayoz et al. (2021), we calculate C based on the Sokal–Michener similarity which is the percentage of matches between two binary vectors of equal size. In this context, we have a binary vector sj, for each SNP j in M of size Np. Np is the number of individuals in the population which we use to construct C. A value of one indicates that the individual is carrying the SNP, and zero otherwise. Then, the correlation between SNP j and SNP k is equal to sj·skNp

We define the problem of beacon reconstruction as an optimization problem. The goal of the attacker is to find a function f(b,N,m,M,C) to obtain a reconstruction B of the beacon B such that the Frobenius norm of their difference is minimized: ||BB||F = i,j(BijBij)2.

3.4 Beacon reconstruction attack

Here, we present the first-of-its-kind beacon reconstruction attack in detail. We begin by describing our greedy (baseline) approach for beacon reconstruction, which serves as a baseline to compare our results in Section 3.4.1. Next, we introduce our novel optimization-based approach for beacon reconstruction in Section 3.4.2. Finally, we extend this optimization approach to scenarios involving known individual subsets. Figure 1 shows the overview of our proposed model.

Figure 1.

Figure 1.

The system model of our approach. The attacker finds out N and decides on the subset of SNPs M to reconstruct. 1. The attacker takes the snapshot of the beacon for M. 2. Beacon can respond with allele frequencies or yes/no for each SNP. In the latter, the attacker estimates allele frequencies from the population. 3. The attacker initializes the matrix using the baseline algorithm, and then 4. updates the assignments such that the SNP correlations are preserved. 5. The attacker optimizes for the original allele frequencies obtained. 6. Steps 4 and 5 are alternated until convergence.

3.4.1 Baseline approach for beacon reconstruction

In this section, we introduce a greedy algorithm in which we call the baseline approach. It is detailed in Algorithm 1.

Algorithm 1:

Baseline Algorithm for Beacon Reconstruction Attack

Require:  b, m, N, M

1: B0N×|M|   Initialize Reconstructed Beacon

2: for each  SNP jM  do

3:  if  bj>0  then

4:   Nbj×N

5:    if  b is binary then

6:    Nmj×N

7:    end if

8:   SRandomly select N individuals from N individuals

9:    for each  iS  do  Bij=1

10:    end for

11:   end if

12:  end for

13:  return  B

First, the attacker initializes the matrix B as a zero matrix (Line 1). For each SNP jM (Line 2), if the beacon responds with the frequency of the allele and the returned value is greater than zero (b[j]>0) (Line 3), the attacker finds the number of individuals with this allele in the beacon simply by calculating N=bj×N (Line 4). If the beacon responses are binary, then the attacker estimates this number using the MAF in the population, which is publicly available (N=mj×N) (Line 6). The attacker then randomly selects a subset S of beacon participants of size N that are assigned to carry the SNP j (Line 8).

The baseline algorithm operates under the assumption that each SNP is independent and neglects the correlation between SNPs. However, SNPs are often correlated as alleles at different loci are inherited together more than would be expected by chance because of linkage disequilibrium. To enable better reconstruction, we introduce a novel approach that uses pairwise SNP correlations in the next section.

3.4.2 Optimization-based approach for beacon reconstruction

The optimization-based method performs a two-objective optimization uses the pairwise SNP correlation matrix (C) and the AFs (b). It iteratively refines B until convergence such that the reconstruction has (i) matching AFs to the original beacon as reported in b or estimated using m, and (ii) matching SNP correlations to the observed SNP correlations in the population as summarized in C. The steps of the optimization-based method are introduced in Algorithm 2, and we detail this algorithm hereafter.

Algorithm 2.

Gradient-Based Optimization Algorithm for Beacon Reconstruction Attack

Require:  b,m,N,M,C,e1,e2,η,max_iter

1: BAlgorithm1(b,m,N,M)   Initialize Reconstructed Beacon

2: do

3:  Correlation Loss Minimization

4:  for  i=1 to e1  do

5:   BAdam(B,C,CB,Lcorr,η)

6:  end for

7:  Frequency Adjustment

8:  fb

9:  if  b is binary then

10:   fm

11:  end if

12:  f0|M|

13:  for each  SNP jM  do

14:   fji=1NBijN

15:  end for

16:  for  i=1 to e2  do

17:   BAdam(B,f,f,LMSE,η)

18:  end for

19: while  B¬converged and ¬max-iter-reached

20: return  B

Step 0: initialization:

We use Algorithm 1 to initialize the reconstruction B based on the frequencies reported by the beacon (Line 1).

Step 1: correlation loss minimization:

In the first step, the goal is to minimize the difference between the computed correlation matrix from B and the actual correlation matrix C, using the Frobenius norm: Lcorr=CBCF where Lcorr is the correlation loss and CB represents the correlation matrix calculated from B.

We optimize SNP assignments using the Adam optimizer (Kingma 2014), which updates B to minimize Lcorr for e1 epochs with leaning rate η (Line 3–6). This step improves the accuracy of the reconstructed beacon by ensuring that the correlations among the SNPs are as close to the expectation. However, this might lead to mismatching AFs between the actual and reconstructed beacon.

Step 2: frequency adjustment:

The second step adjusts the AFs in B to match the original AFs in B. That is, we want to minimize the difference between the AFs obtained from the reconstruction (namely f), and from the actual beacon, namely f which is equal to b or if the beacon responds with binary responses, to m. We use the mean squared error loss: LMSE = j=1|M|(fjfj)2|M|. We again use the Adam optimizer, which updates B to minimize LMSE for e2 epochs with leaning rate η (Line 7–18). We note that this an iterative algorithm and steps 1 and 2 are repeated until convergence. Here, convergence means number of flips in B after the update is less than a small number, which we set as 10 (Lines 2–19).

3.4.3 Reconstruction when a subset of the genomes are known by the attacker

We also consider a variant of the gradient-based optimization approach with the assumption that the attacker has access to the genomes of a percentage p of the beacon participants. The rationale behind this assumption is that genomic data may have already been compromised through genome reconstruction attacks targeting the beacon (Ayoz et al. 2021). Additionally, such data might have been accessed, either legally or illegally, from other DNA repositories, such as genealogy sites like 23andMe or ancestry.com. The attacker could then verify the membership of these individuals in the beacon using membership inference attacks (Shringarpure and Bustamante 2015, Raisaro et al. 2017, Von Thenen et al. 2019).

In this scenario, the attacker aims to reconstruct the genomes of unknown individuals in the beacon, given the known individuals; first, they fix the SNP assignments of the known individuals in B and reconstruct the remaining individuals using Algorithm 2.

4 Results

4.1 Datasets

We use two publicly available genomic datasets to evaluate the algorithms introduced in Section 3.4: the HapMap dataset (Gibbs et al. 2003) and the OpenSNP dataset (http://opensnp.org/). The CEU population in the HapMap dataset contains the genotype of 164 individuals and covers over 4 million SNPs. The HapMap project is a well-established resource for understanding genetic variation, and has been previously used to evaluate attack scenarios against beacons (Shringarpure and Bustamante 2015, Raisaro et al. 2017, Von Thenen et al. 2019, Ayoz et al. 2021). OpenSNP contains genotypes of 2980 individuals who donate their genomes to this community-driven project. Each individual has approximately 2 million SNPs. While the HapMap CEU population represents a beacon with a uniform genetic background, the OpenSNP dataset is diverse, and due to the lack of a standard protocol for genotyping, it represents a relatively noisy and more challenging setting. For the subsection on analyzing the effect of underrepresented populations on the attack, we use the Gujarati Indians from Houston, TX, population from the HapMap dataset, which contains the genotype of 117 individuals.

4.2 Experimental setup

In our experiments, we use subsets of both the HapMap and the OpenSNP datasets. We simulate beacons with N=3,10,25,50, and 100, where N represents the number of individuals in the beacon. We focus on a randomly selected set of 1000 SNPs. We experiment with various subsets M of this set where |M|=30,50,100,500,2000. Note that M denotes the SNP subset of interest. Note that every SNP subset we construct of size |M|, is a superset of the smaller SNP subsets. For example, the SNP subset of size |M| = 100 includes the SNPs in the SNP subset of size |M| = 50 and smaller. We set the learning rate for the Adam optimizer as η=0.001 which is applied for 1000 epochs (e1) to optimize the correlation matrix and 500 epochs (e2) to optimize the frequencies. The maximum number of iterations for the optimization algorithm to converge is set to 3000. To simulate the case where a percentage of the individuals are known by the attacker, we experiment with p=20% and 40%.

We construct the correlation matrix C using 100 left-out samples from the OpenSNP dataset to simulate the case where the publicly available correlations are relatively more noisy to test our approach in a relatively more setting. To obtain the AFs in the population, we use the 64 left-out samples from the HapMap dataset. We use the population AFs as a proxy when the beacon responds in binary instead of the exact AFs in the dataset. We also simulate the scenario where the AFs for the target population are not available. Then, we consider using the AFs from another population. In this case, we use frequencies of the Mexican population, which we obtained from the Merged phase I+II and III HapMap Dataset (https://ftp.ncbi.nlm.nih.gov/hapmap/genotypes/2010-08/_phaseII+III/forward/). All experiments are performed on a SuperMicro SuperServer with Intel Xeon CPU E5-2650 v3 2.30 GHz with 40 cores.

4.3 Reconstruction performance

In this subsection, we investigate the performance of the reconstruction method under varying conditions. We first focus on the effect of increasing the number of individuals in the beacon and then increasing the number of targeted SNPs. Next, we investigate the performance when the attacker has already reidentified a percentage of the individuals in the beacon. Finally, we show the effect of the beacon reporting AFs instead of binary “Yes/No” answers.

4.3.1 Effect of beacon size

We compare the reconstruction performances of the baseline approach and the optimization-based approach with respect to the F1-score. Supplementary Fig. S10 shows the results of both algorithms with |M| = 1000, across various beacon sizes for both OpenSNP and HapMap datasets. We observe that with very few samples (N =3), which is not very realistic, the baseline approach yields 64.1% and 61% F1-scores for OpenSNP and HapMap datasets, respectively. The optimization-based approach, on the other hand, attains 81.7% and 79.6% F1-scores for these datasets. This clearly shows the improvement over the baseline and the benefit of the optimization approach. We observe that the precision and the recall of the algorithm are also balanced. Please see the precision and recall values obtained for the OpenSNP beacon in Supplementary Fig. S7. As N is increased both approaches’ performance deteriorates as expected, but the gap between the two approaches stays the same. For N =50, the optimization-based approach achieves an F1-score of 70% and 67%, for OpenSNP and HapMap datasets, respectively. This is a typical beacon size considered in the literature (Shringarpure and Bustamante 2015, Raisaro et al. 2017, Von Thenen et al. 2019, Ayoz et al. 2021). A 70% F1-score represents a strong balance between precision and recall, especially in a realistic, challenging, and imbalanced task such as this. This serves as a meaningful improvement over the baseline performance and shows the feasibility of such a beacon reconstruction attack. Note that the correlation matrix C is estimated from the OpenSNP dataset, and the results obtained on the HapMap dataset with this correlation matrix show that even if the source of the correlations is not from the same population and is obtained from a rather noisy community-driven public dataset, the optimization-based reconstruction method can achieve high performance. On the other hand, performance decreases when N doubles from 50 to 100 is only 4%, and the gap between the optimization-based approach and the baseline is preserved. This indicates that the method can scale well while the problem size is doubled. The same experiment with varying SNP set sizes can be found in the Supplementary Figs S1 and S2, for OpenSNP and HapMap datasets, respectively. We observe the same trend in both figures.

Please see Supplementary Note S1.1 for the performance analysis when the attacker is assumed to have access to the genomes of a portion of the individuals in the beacon. See Supplementary Note S1.2 for the time performance analysis.

4.3.2 Effect of number of SNPs targeted

Here, we evaluate the effect of varying the number of SNPs considered (|M|). First, the results in Supplementary Figs S1 and S2 demonstrate that the optimization-based approach performs substantially better than the baseline method for all |M| values, for both datasets. For example, in Supplementary Fig. S1B where |M| = 50, the optimization-based method achieves an F1-score of 94% for three individuals in the OpenSNP dataset. In comparison, the baseline approach achieves 71%, which assumes independence between SNPs. The performance gap arises due to the baseline’s tendency to make false assignments due to its inability to account for correlations between SNPs causing incorrect predictions of alleles. As |M| increases, both approaches’ performances go down, which is expected due to the increasing search space and complexity of the problem. However, we observe that with even 2000 SNPs considered and 50 individuals in the beacon (Supplementary Fig. S1E), the performance of the optimization-based approach is 67%, which is a strong F1-score considering the size of the problem. The F1-score gap between predicting the assignment of 1000 SNPs (Supplementary Fig. S10A) and 2000 SNPs (Supplementary Fig. S1E) for 50 individuals is just 2.2%. This indicates that the model scales well even if the number of SNPs doubles. We observe a very similar pattern in the results obtained on the HapMap dataset as shown in Supplementary Fig. S2.

4.3.3 Effect of beacons reporting allele frequency

All results reported above, assume that the beacon responds to queries with AFs which are implemented in various beacons such as THL Biobank and Progenetix. However, beacons can also respond by the presence or absence of the alleles instead of the frequency (i.e. binary). Clearly, this returns less information compared to AFs and makes it more difficult for the attacker to reconstruct the dataset. In this case, we assume that the attacker estimates the AFs in the beacon dataset by using background AFs observed in the population or in other public datasets and investigate the performance of the attack in this setting.

We use 64 left-out CEU HapMap samples to estimate the frequencies that would have been reported by the beacon interfacing the HapMap dataset (with varying N and |M|) instead of a “Yes/No” answer. That is, the attacker simulates the beacon and the underlying dataset using the left-out samples and uses the frequencies in the simulated dataset if they receive a “Yes” response from the beacon. If they receive a “No” response, then the attacker sets the frequency for that SNP as zero. To consider the case where the attacker does not have access to the population AFs or the public AFs are not representing the beacon dataset well. For instance, the genetic background in the dataset can be mixed. To simulate this, we estimate the AFs using the Mexican population in the HapMap dataset as well.

The results are shown in Fig. 2. We observe that in the HapMap beacon with N=50 and |M|=1000, the optimization-based algorithm achieves an F1-score of 53.4%. This is a decrease of 14% in the F1-score obtained when the beacon reports actual AFs in the dataset. Clearly exact AFs reported by the beacon result in more precise reconstruction and the algorithm loses power. However, the baseline algorithm performs much worse (37%). This indicates the model still remains effective in less ideal scenarios such as using background frequencies from the same population. For the same experiment with varying SNP set sizes, can be found in Supplementary Fig. S4 for the HapMap dataset.

Figure 2.

Figure 2.

F1-score comparison for |M| = 1000 when frequency is unknown. Plot A represents reconstruction from 64 left-out samples of HapMap dataset, Plot B represents reconstruction using Mexican samples.

With MAFs in the Mexican population, the attacker obtains an F1-score of 32%. For the same experiment with varying SNP set sizes, can be found in Supplementary Fig. S5 for the HapMap dataset. The results show that the use of AFs in the Mexican population substantially decreases performance. Note that this represents an extreme case where we estimate the frequencies using incorrect background information. However, these results show that beacons with genetically uniform content leak more information and are more prone to reconstruction attacks.

4.3.4 Effect of underrepresented populations

While we investigate challenging situations such as (i) inferring the SNP correlations from a noisy resource like OpenSNP, and (ii) considering the case where one does not have access to the population AFs, all of our above results are obtained on datasets that consist of European individuals. To see if our attack is effective on a data set consisting of individuals from an underrepresented population, we obtained the DNA samples of 117 Gujarati Indians from Houston, TX, from the HapMap study and analyze its effect. We replicate the analyses we performed for the CEU individuals. That is, we generate beacons with dataset sizes of 3, 10, 25, 50, and 100 and vary the number of SNPs to be reconstructed as 30, 50, 100, 500, 1000, and 2000. We again use the OpenSNP dataset to obtain the SNP correlations. We observe that the method achieves an F1-score that is only 7% less on average compared to the CEU beacons across all combinations tested. The precision and recall are again balanced. Performance decline is expected, as the dataset participants are from an underrepresented group with potentially few or no individuals in the OpenSNP dataset. The performance decline being limited to only 7% on average, shows that the attack can still work in such a challenging setting and is robust. These results are provided in Supplementary Fig. S9.

4.3.5 Using reconstructed genomes for membership inference attacks

In this subsection, we show that the reconstructed genomes can be exploited in downstream applications and investigate using reconstructions for membership inference attacks. Assume that an attacker launches a Beacon Reconstruction Attack against a beacon with no associated sensitive attribute (Beacon A). They obtain reconstructions for everyone in this beacon. Then, they launch a membership inference attack against a beacon that is associated with a sensitive attribute (Beacon B). Membership inference attacks assume that the attacker has access to the victim’s genome, which is a strong assumption, as it is assumed to be stolen. Using the Beacon Reconstruction Attack, the attacker can reconstruct the victim’s genome from a beacon that is considered safe (Beacon A). The only weak assumption here is that the attacker knows that the victim has donated their genome to Beacon A, which again can easily be public information, and as shown in Ayoz et al. (2021), it is possible to pinpoint the victim’s genome using trained models to predict victim’s public phenotypes, such as eye and hair color. In our experiment, we use the beacon of 50 CEU individuals, which we analyzed above in Section 4.3.1. This is a Beacon A with no associated sensitive attribute. Then, using 49 CEU individuals from the remaining people in the population, we generated another beacon. This is Beacon B, which represents a beacon that is associated with a sensitive attribute. For every individual i in Beacon A, we add their actual genome, one by one, to Beacon B to obtain Beacon Bi. Thus, we obtain 50 beacons Bi with 50 individuals each of which differs by one person. Using the reconstructed genome of every individual i in Beacon A, we launch a membership inference attack against Bi. As for the membership inference attack, we use the optimal attack (Raisaro et al. 2017) where 20 randomly selected individuals from the remaining 64 CEU individuals are used as the control group. We find that in 41 out of 50 attacks we were able to re-identify the victims (at 5% false-positive rate) with the number of maximum queries being only 21. This shows that the reconstructions obtained can be used effectively to further violate the privacy of the beacon participants and recover sensitive phenotypes about them.

5 Discussion and conclusion

In this work, we introduced the first-of-its-kind, beacon reconstruction attack, which can reconstruct large numbers of variants in all beacon participants’ genomes with high performance. We tested the limits of the approach under various factors such as beacon size, number of target variants, percentage of already identified individuals in the beacon, and assumptions on summary statistics. Our results show that the attack is feasible. Required time scales linearly with increasing number of participants and targeted variants. The results decline again linearly with an increasing number of participants but with a shallow slope. We have tested the attack in a challenging setting where the SNP correlations are estimated from the OpenSNP dataset, which does not reflect the actual correlations in the CEU population. Yet, our attack was effective.

The beacon protocol allows AFs to be reported, and many beacons implement this option. We clearly show that if beacons report AFs, the beacon reconstruction is more effective compared to simple “Yes/No” responses. The attacker can estimate the frequencies using background AFs, and the attack is still feasible but is less powerful. Therefore, we suggest administrators implement the beacon protocol to avoid AF reporting options for safety. In this case, we also show that the attack is again more effective if the dataset contains individuals from a single population. Thus, larger number of individuals with diverse genetic backgrounds can help mitigate such beacon reconstruction attacks.

Our suggested approach relies on pairwise correlations among variants. Yet, other approaches suggest that using higher-order correlations can provide more information and can be used to reconstruct other parts of the genomes (Samani et al. 2015, Von Thenen et al. 2019). While we relied on a randomly selected subset of variants to test the limits of the model, a more carefully selected subset of variants can be targeted using our approach to form a baseline. This can be followed by an independent second stage where the missing parts of the genomes of the victims can be further reconstructed using higher-order correlations as done in Samani et al. (2015). This means the beacon reconstruction attack, which focuses on pairwise correlations can be paired with a second attack. This follow-up attack can further reconstruct the remaining missing parts using higher-order correlations independently from the beacon summary statistics to uncover more variants. We will consider this in the future work.

One can also consider selecting a relatively more correlated subset of variants. We have investigated this issue and reconstructed subsets of SNPs with low, moderate, and high average correlations. Yet, the performance of the reconstruction was not substantially different indicating the method is robust against this effect as long as it can rely on confidently estimated correlation matrix. In our experiments, we used a correlation matrix estimated from the OpenSNP dataset to reconstruct the beacon simulated using individuals from the CEU population. More confident estimation of the correlation matrix can yield stronger attacks, indicating that there is still room for improvement.

Supplementary Material

btaf273_Supplementary_Data

Acknowledgements

We thank Göktuğ Gürbüztürk, Efe Erkan, Deniz Aydemir, İrem Aydın, Kerem Ayoz, and Erman Ayday for their contributions to this study.

Contributor Information

Kousar Saleem, Computer Engineering Department, Bilkent University, Üniversiteler Mahallesi, Ankara, 06800, Türkiye.

A Ercument Cicek, Computer Engineering Department, Bilkent University, Üniversiteler Mahallesi, Ankara, 06800, Türkiye.

Sinem Sav, Computer Engineering Department, Bilkent University, Üniversiteler Mahallesi, Ankara, 06800, Türkiye.

Author contributions

Kousar Saleem (Methodology [lead], Software [lead], Writing—original draft [supporting]), A. Ercument Cicek (Conceptualization [support], Project administration [equal], Supervision [equal], Writing—original draft [support], Writing—review & editing [equal]) and Sinem Sav (Conceptualization [lead], Project administration [equal], Supervision [equal], Writing—original draft [equal], Writing—review & editing [lead])

Supplementary data

Supplementary data are available at Bioinformatics online.

Conflict of interest: None declared.

Funding

This work was supported by the National Library of Medicine of the National Institutes of Health under Award Number R01LM013429.

References

  1. Ayoz K, Ayday E, Cicek AE. Genome reconstruction attacks against genomic data-sharing beacons. In: Proceedings on Privacy Enhancing Technologies. Privacy Enhancing Technologies Symposium, Vol 2021, Oxford, England, UK: NIH Public Access, 2021, p28. [DOI] [PMC free article] [PubMed]
  2. Ayoz K, Aysen M, Ayday E  et al.  The effect of kinship in re-identification attacks against genomic data sharing beacons. Bioinformatics  2020;36(Supplement_2):i903–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bu D, Wang X, Tang H.  Haplotype-based membership inference from summary genomic data. Bioinformatics  2021;37(Supplement_1):i161–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Cho H, Simmons S, Kim R  et al.  Privacy-preserving biomedical database queries with optimal privacy-utility trade-offs. Cell Syst  2020;10:408–16.e9. [DOI] [PubMed] [Google Scholar]
  5. Clayton D.  On inferring presence of an individual in a mixture: a Bayesian approach. Biostatistics  2010;11:661–73. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Dwork C.  Differential privacy. In: Bugliesi M, Preneel B, Sassone V, Wegener I (eds), International Colloquium on Automata, Languages, and Programming. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006, 1–12. [Google Scholar]
  7. Dwork C.  Differential privacy: a survey of results. In: Agrawal M, Du D, Duan Z, Li A (eds), International Conference on Theory and Applications of Models of Computation. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008, 1–19. [Google Scholar]
  8. Erlich Y, Narayanan A.  Routes for breaching and protecting genetic privacy. Nat Rev Genet  2014;15:409–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Fienberg SE, Slavkovic A, Uhler C. Privacy preserving GWAS data sharing. In: 2011 IEEE 11th International Conference on Data Mining Workshops, Vancouver, Canada. Washington, DC, United States: IEEE, 2011, 628–35.
  10. Gibbs RA, Belmont JW, Hardenbol P  et al.  The international hapmap project. Nature  2003;426:789–96. [DOI] [PubMed] [Google Scholar]
  11. Gymrek M, McGuire AL, Golan D  et al.  Identifying personal genomes by surname inference. Science  2013;339:321–4. [DOI] [PubMed] [Google Scholar]
  12. Homer N, Szelinger S, Redman M  et al.  Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet  2008;4:e1000167. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Humbert M, Ayday E, Hubaux J-P  et al.  Quantifying interdependent risks in genomic privacy. ACM Transactions on Privacy and Security (TOPS)  2017;20:1–31. [Google Scholar]
  14. Hunter DJ, Kraft P, Jacobs KB  et al.  A genome-wide association study identifies alleles in fgfr2 associated with risk of sporadic postmenopausal breast cancer. Nat Genet  2007;39:870–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Jacobs KB, Yeager M, Wacholder S  et al.  A new statistic and its power to infer membership in a genome-wide association study using genotype frequencies. Nat Genet  2009;41:1253–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Johnson A, Shmatikov V. Privacy-preserving data exploration in genome-wide association studies. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, IL, USA. New York, NY, USA: Association for Computing Machinery, 2013, 1079–87. [DOI] [PMC free article] [PubMed]
  17. Kingma DP. Adam: a method for stochastic optimization. arXiv. arXiv:1412.6980, 10.48550/arXiv.1412.6980,  2014, preprint: not peer reviewed. [DOI]
  18. Poorghaffar M, Shukueian Tabrizi S, Ayoz K  et al. A reinforcement learning-based approach for dynamic privacy protection in genomic data sharing beacons. bioRxiv, 2024-10, 10.1101/2024.10.28.620587,  2024, preprint: not peer reviewed. [DOI]
  19. Raisaro JL, Tramèr F, Ji Z  et al.  Addressing beacon re-identification attacks: quantification and mitigation of privacy risks. J Am Med Inf Assoc  2017;24:799–805. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Samani SS, Huang Z, Ayday E  et al. Quantifying genomic privacy via inference attack with high-order SNV correlations. In: 2015 IEEE Security and Privacy Workshops, San Jose, California, USA. Washington, DC, USA: IEEE, 2015, 32–40.
  21. Shringarpure SS, Bustamante CD.  Privacy risks from genomic data-sharing beacons. Am J Hum Genet  2015;97:631–46. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Tramèr F, Huang Z, Hubaux J-P  et al. Differential privacy with bounded priors: reconciling utility and privacy in genome-wide association studies. In: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. Denver, Colorado, USA, 2015, 1286–97.
  23. Venkatesaramani R, Wan Z, Malin BA  et al.  Defending against membership inference attacks on beacon services. ACM Trans Privacy Security  2023a;26:1–32. [Google Scholar]
  24. Venkatesaramani R, Wan Z, Malin BA  et al.  Enabling tradeoffs in privacy and utility in genomic data beacons and summary statistics. Genome Res  2023b;33:1113–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Von Thenen N, Ayday E, Cicek AE.  Re-identification of individuals in genomic data-sharing beacons via allele inference. Bioinformatics  2019;35:365–71. [DOI] [PubMed] [Google Scholar]
  26. Wan Z, Vorobeychik Y, Kantarcioglu M  et al.  Controlling the signal: practical privacy protection of genomic data sharing through beacon services. BMC Medical Genomics  2017;10:87–100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Wang R, Li YF, Wang X  et al. Learning your identity and disease from research papers: information leaks in genome wide association study. In: Proceedings of the 16th ACM conference on Computer and Communications Security, Chicago IL, USA. New York, NY, USA: Association for Computing Machinery, 2009, 534–44.
  28. Yuen RKC, Merico D, Bookman M  et al.  Whole genome sequencing resource identifies 18 new candidate genes for autism spectrum disorder. Nature Neurosci  2017;20:602–11. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btaf273_Supplementary_Data

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES