Abstract
Some organizations such as 23andMe and the UK Biobank have large genomic databases that they re-use for multiple different genome-wide association studies. Even research studies that compile smaller genomic databases often utilize these databases to investigate many related traits. It is common for the study to report a genetic risk score (GRS) model for each trait within the publication. Here, we show that under some circumstances, these GRS models can be used to recover the genetic variants of individuals in these genomic databases—a reconstruction attack. In particular, if two GRS models are trained by using a largely overlapping set of participants, it is often possible to determine the genotype for each of the individuals who were used to train one GRS model, but not the other. We demonstrate this theoretically and experimentally by analyzing the Cornell Dog Genome database. The accuracy of our reconstruction attack depends on how accurately we can estimate the rate of co-occurrence of pairs of single nucleotide polymorphisms within the private database, so if this aggregate information is ever released, it would drastically reduce the security of a private genomic database. Caution should be applied when using the same database for multiple analysis, especially when a small number of individuals are included or excluded from one part of the study.
Keywords: genetic risk scores, genomic privacy, GWAS, long-term privacy, reconstruction attack
1. Introduction
In a survey of genomic privacy experts, the long-term privacy of genomic information was deemed both the most important and the most challenging problem to overcome (Mittos et al., 2019). If an individual's password or ID number gets leaked, it is always possible to change it. However, it is impossible for a person to change their genetic code and they will pass part of it onto their children, so any information leaks can have long-term impacts on both the individual and their descendants. Although much of the research focus on long-term privacy of genomic databases rests on the longevity of the encryption scheme (Huang et al., 2015), it is also important to remember that these genomic databases are not just sitting on a server somewhere, but are also being continually utilized for making new scientific discoveries. Each time these databases are accessed and the scientific results are published, there is a risk that information will be leaked and that eventually this would enable an attacker to reconstruct private information held in the database.
Genomic researchers are already aware that some forms of aggregate data from their databases should not be released publicly, because there is a risk that an attacker may be able to determine whether a particular individual is a member of the database (a membership inference attack). For instance, such attacks have already been developed for summary statistics about the frequency of single nucleotide polymorphisms (SNPs; Cai et al., 2015; Dwork et al., 2015; Simmons and Berger, 2015). Membership inference attacks have also been developed for the case where a person is allowed to repeatedly query a database to learn whether at least one individual contains a particular SNP (Shringarpure and Bustamante, 2015; Raisaro et al., 2017; von Thenen et al., 2018). These kinds of aggregate statistics about the frequency or presence/absence of a particular SNP in a database might be useful to release to the broader research community, but it is not an essential output of the research process.
However, the main research findings—that is, the SNPs associated with the trait of interest and their strength of association—are essential to publish since the entire purpose of these genomic research projects is to uncover the relationship between genetic variants and phenotypic traits. Moreover, knowledge of these SNPs can lead to new diagnosis procedures or new potential drug targets, so their release is important for the public interest (Visscher et al., 2017). However, even this information can potentially leak private information about individuals in the database. For instance, Im et al. (2012) found that information about individuals in a genomic database is leaked when studies publish whether each SNP is correlated or anti-correlated to the trait of interest. It is important to quantify how much information is leaked by publishing these research findings, so that scientists can make informed decisions about when to publish their results and whether it is worth risking the privacy of the participants.
In this study, we demonstrate that the kind of research output that is published from genome-wide association studies (GWAS) has the potential to leak enough information to recover the SNPs of individuals in the database (a reconstruction attack), under specific circumstances. In particular, we focus on the release of genetic risk scores (GRS), a common research output for finding genetic associations with continuous traits (Qi et al., 2011; Belsky et al., 2013; Zhao et al., 2014; Chouraki et al., 2016; Day et al., 2017; Knowles and Ashley, 2018). We also focus on cases where a database is repeatedly used to perform a GWAS analysis, but not all the individuals are part of all the analyses. This could be the case because some individuals drop out of the study or skip specific survey questions. Alternatively, some databases, such as 23andMe, may grow in size over time and allow several GWAS to be performed within a short period. Under these circumstances, we demonstrate that it is possible to completely reconstruct the SNPs of an individual by using a custom expectation–maximization (EM) algorithm. We also provide suggestions for avoiding this kind of attack.
To be clear, this article focuses on the simpler case in which the exact same trait is investigated in multiple GWAS studies; however, we expect that some version of this attack may be developed in the near future for the case of multiple highly correlated traits.
1.1. Overview of scenarios that will be investigated
We demonstrate a series of reconstruction attacks that enable us to infer the genotypes of individuals in private genomic databases, based on publicly released GRS. These attacks will initially be deployed in a very favorable scenario, but the scope of the attack will be subsequently expanded, building up to the scenario shown in Figure 1. It is worth noting that the reconstruction attacks that we will describe do not depend on (1) how the SNPs were initially filtered or (2) how strongly they associate with the trait of interest.
We will begin by investigating a simple scenario: Two GWAS studies are performed to identify SNPs associated with the same trait, and the two studies use the same set of participants, except that the second study includes one extra individual. In addition, we will assume that we know the frequencies of each SNP and the frequencies that pairs of SNPs co-occur in the same individual. We assume that both studies publish the coefficients associated with the GRS models that they infer as part of the analysis.
Next, we will consider the case in which the second study includes more than one additional participant and we demonstrate that in many circumstances this still allows us to easily reconstruct the individual genotypes of all the individuals that are found in the second study but not the first (see Section 3.2).
Afterward, we will demonstrate that we do not need to know the precise frequencies of SNPs and frequencies of co-occurring SNPs, as long as we have a reasonable estimate of these values from public databases (see Section 3.3). We also briefly discuss how loosening additional restrictions would impact our ability to predict individual genotypes. In particular, we analyze the case where the two sets of SNPs that are used by the two studies are not identical. These results imply that if two sets of GRS are released on two genetic datasets with largely overlapping populations, it may be possible to reconstruct the genotypes of those individuals who participated in one study but not the other (Fig. 1).
2. Methods
GRS models describe the relationship between a particular phenotype of interest and particular SNPs. These models are fit in a two-stage process: First, a reduced set of SNPs is selected from a potentially very large pool of candidates; then, this reduced set is used as the independent variables in a linear regression analysis. The set of SNPs is selected by first filtering for those that significantly correlate to the trait of interest, after controlling for other covariates. These SNPs are then further filtered to ensure that they are far apart from one another, to decrease the correlation between them. In this setting, we suppose that M individuals have taken part in a study, and N SNPs have passed the filtration steps to be used in a linear model. Let yM be the vector of M real-valued phenotypes, and XM be an binary matrix, where if individual i has SNP j. To include an intercept term in the linear model, we define the design matrix to be the matrix
(1) |
The GRS model parameter is just the coefficient vector of the linear model
(2) |
where is independent Gaussian noise. Given and phenotypes yM, the maximum likelihood estimate of this parameter has a closed form
where we have defined the symmetric matrix K as
(4) |
Now, suppose a second study is run, targeting the same phenotype, which adds a single extra individual with SNPs represented by the N length vector x0. This corresponds to adding the row to the design matrix, and extending y with the additional phenotypic value y0 for the new individual. The updated estimator (i.e., the GRS values for the second study) is given by
We assume that both GRS models and are released publicly. An attacker aims at using this knowledge to reconstruct (the genotype of the added individual). Through algebraic re-arrangement (see Section 5.2), we find that:
(6) |
where C is a scalar, specifically . Eq. (6) means that is a scalar multiple of .
Our approach, thus, centers on the use of the vector that we define as d1,
corresponding to a rescaled copy of the input SNP data in the design matrix , which can be easily computed from the two parameter vectors if the matrix K is known. As seen in Section 3.1, we can use d1 to exactly reconstruct the added individual with 100% accuracy.
We additionally consider the case where m additional individuals have been included in the second study, yielding a new GRS model including these participants. The extra rows of the design matrix now form a matrix of size , where each row is an individual that was added to the second study and each column is an SNP (and the last column contains only 1). The corresponding analog to Eq. (7) for multiple individuals, which we derive in Section 5.2, is
where Cm is a vector of length m. For sufficiently small m (relative to N), exact reconstruction of all m added individual genomes is also possible in this setting, following the algorithm we will introduce in Section 3.2.
The previous examples have focused on cases in which the participants in the first study are a subset of the individuals in the second study. In Section 5.7, we consider the case in which the first study has some participants that are not found in the second study and vice versa. We show that the same strategies for reconstructing the genome can be used as in the previous scenario that we discussed, in which multiple participants are added to the second study.
2.1. Estimation of K
As it turns out, the entries of matrix K correspond to simple population-level statistics of the SNPs, which could either be inadvertently released (under the assumption they would be safe to share), or could be estimated from another sample from the same population. In fact, the entries of K depend only on the SNP frequencies and SNP co-occurrence frequencies in the dataset:
– For : estimates the probability that SNP i has value 1 (i.e., the frequency of the SNP in the population).
– For and : estimates the probability that both SNP i and SNP j are 1 simultaneously (i.e., the frequency of SNP i and SNP j co-occurring in the same individual).
– For and : also estimates the probability that SNP i has value 1, that is, .
– Finally, .
Thus, knowledge of SNP frequencies and pairwise co-frequencies from the original study are all that is required to compute K. In the following Sections 3.1 and 3.2, we consider adding one and multiple individuals at once, respectively, in the setting where this matrix K can be estimated exactly. However, although , and M are likely to be published along with the study, an attacker would often need to estimate K from other publicly available data. Most studies will report some information about the study population (such as whether the study focused on individuals from a specific continent), which can help with estimating K. From this information, we can estimate the value of K in similar populations as those used in the study using publicly available data, for example, from the HapMap project. Our additional experiments in Section 3.3 use a custom EM algorithm to find maximum likelihood estimates of when the matrix is estimated from independent public data. The derivation of this EM algorithm is given in Section 5.4.3, and a formal analysis of the reconstruction error of given the error in is found in Section 5.2.1.
3. Results
The key observation from the previous section is that the vectors d1 and dm, derived from the change in parameter vectors from a first study to a second study, take only a finite number of values thanks to the fact that the design matrices contain only zeros and ones. In particular, when m new individuals are added to the second study, each entry of the vector dm can only take at most values, and a zero value corresponds to the setting where all individuals have the most common variant for that SNP.
This section describes algorithmically how these vectors can be used to recover the genomes of the additional individuals, as well as empirical tests that use the Cornell Dog Genome dataset as a case study (Hayward et al., 2016). More details on the experimental setup can be found in Section 5.1.
3.1. Complete reconstruction of one individual's genotype when SNP frequency information is known
The first, most straightforward case is when only one participant is added between the first and second studies, that is, where is the GRS for the first study (containing M participants), and is the GRS for the second study as described in Eqs. (3) and (5). Both of these are vectors of length , where the first N indices correspond to the relationship between each SNP and the trait and the last element is the intercept of the linear model. For now, we also assume we are in the setting where the matrix K is known, for example, because the SNP frequency information has been publicly released.
Given K, , and , we can use d1 (a vector of length ) to precisely determine the genotype of the individual who was added to the database. For each , the entry of d1 is either equal to 0 if contains a 0 (i.e., the individual does not have the SNP at that index) or to C if contains a 1 (i.e., the individual has the SNP at that index). In other words, it is possible to exactly read off the SNPs of the added individual in this setting. Indeed, we tested this strategy on the Cornell Dog Database and found that we were able to reconstruct the genotype of the dog that was added to the second study with 100% accuracy, on both common and uncommon SNPs (Fig. 2A).
3.2. Complete reconstruction of multiple individuals' genotype when SNP frequency information is known
We now consider the case where m additional individuals have been included in the second study, yielding a new GRS model , including these participants.
Consider again Eq. (8) described earlier. The row of is a binary vector that represents the combination of the m individuals who have SNP i. This means that, for a fixed value of Cm, the value of the vector dm at index i is uniquely determined by the combination of individuals who have SNP i (Fig. 2B). In other words, there will be at most unique values taken by entries of dm, each corresponding to a combination of the values in vector Cm (Fig. 2C).
If we were to learn which values of dm are also found in Cm, then we could infer the complete genotypes of all the m individuals added to the second study. We would be able to reconstruct m complete genotype vectors, although it would be impossible to know which of the genotypes corresponded to which of the m individuals. In fact, in many cases it is extremely straightforward to determine which values in dm correspond to values in Cm. Here, we describe a simple algorithm for finding Cm when there are exactly unique values in dm. If this is not the case, please refer to the more complete algorithm in Section 5.3.
-
1.
First, extract all unique, nonzero values from dm.
-
2.
Find the sum of all pairs of values in this vector.
-
3.
Find all values that are in Eq. (1), but not in Eq. (2). The values of Cm appear in this list. There is no way to know which value of Cm corresponds to which index, so for simplicity we can randomly assign them indices.
-
4.
Each value in Cm corresponds to a specific individual who was added to the second study. Each value in dm can be described as a sum of a unique combination of values in Cm. For instance, if , this means that the SNP at position i is found in individual j and k, but no one else.
We tested this approach by using the Cornell Dog Database, in a test scenario where the second study added three different dogs. We were able to uniquely identify the genotypes of all three dogs with 100% accuracy, with both common and uncommon SNPs (Fig. 2C).
3.3. Accurate estimation of an individual's genotype when SNP frequency information is estimated from a public database
Previously, we assumed that the attacker had access to the matrix K, which consists of population-level statistics on frequencies and co-occurrence frequencies of SNPs. Although this could be released voluntarily by organizations that are not aware of the risk, we now consider the case where K is not directly available to the attacker but is instead estimated from a separate public database assumed to correspond to individuals from the same population.
We simulated this scenario by using the Cornell Dog Database by taking one random set of dogs for building the GRS model, and a second non-overlapping set of dogs for estimating . We compared the value of with the known value of . We observe that has significantly different values at indices where and ; examples for the cases where one and three dogs are added can be seen in Figure 3.
The main challenge is that the vector now includes additional noise, so we cannot simply use its entry at index to estimate C, nor do the entries i with also correspond directly to . Instead, we develop a custom EM algorithm to find a maximum likelihood estimate of the constant C and recover , that is, to determine the probability that each or , based on the value of (see Section 5.4.3 for details). We find that this method can successfully reconstruct the correct value of much better than a baseline that uses the public dataset to independently estimate the most common variant for each of the SNP (Fig. 4).
Crucially, we show that our approach is able to reconstruct, with relatively high accuracy, the genotypes of dogs even when they differ significantly from those in the public dataset (Fig. 5). This shows that our attack is able to extract information about the particular individuals that differ across the two studies, not merely about the general population as in the most-common-variant baseline. By definition, dogs that have genotypes that differ significantly from the general population have a higher proportion of uncommon SNPs, and the ability to recover these uncommon SNPs is particularly important from a privacy perspective. Indeed, uncommon SNPs can be used to identify a particular individual and are also more likely to be associated with disease phenotypes, which is sensitive information. In general, we find that the larger the public dataset available, and the more similar the dataset is to the unknown private dataset, the better we are able to reconstruct the genome of the added individual. Full details and description of the experimental setting are given in Section 5.1. We also derive theoretical error bounds for our estimate of based on the error in in Section 5.4.1.
This task becomes more challenging when multiple individuals are added simultaneously and K is unknown; an algorithm for estimating for , along with additional empirical results, is given in Section 5.4.
3.4. Accurate estimation of an individuals' genotype when different SNPs are used in each study
When GRS models are constructed, the first step is to filter the set of SNPs down to a small set of SNPs that are (1) significantly correlated to the trait after covariates are considered and (2) far apart from one another along the genome. If the two studies use two different sets of SNPs to construct the GRS model, it is still possible to recover whether or not each of the SNPs in the overlap is present in the new individual. This process is highly analogous to the previous cases and is detailed in Section 5.6.
4. Discussion
In this study, we demonstrate that private information is leaked when GRS models are published, specifically in the case where two sets of largely overlapping individuals are used for multiple studies. In particular, we show that we can recover SNPs from an individual in a private database—a reconstruction attack. Even though we would not have a name associated with this genotype, it may be possible to identify the individual once the genotypic data are available to the attacker. For instance, the attacker may have access to partial genotypic information of the individual and then be able to identify them. Alternatively, they could use the genotype information to predict ethnicity and other phenotypic traits that could then be used to uniquely identify the individual.
We also note that even an incomplete reconstruction attack (in which only a proportion of the SNPs are correctly identified) is likely to be sufficient to perform a membership inference attack. Investigating the relationship between the reconstruction attack and the membership inference attack will be a subject of future research. Importantly, if the attackers were unable to link the genomic data with a particular individual, the reconstruction attack would still be a breach in privacy that could have serious consequences. For instance, the patient may have only consented to have their genomic data used in particular kinds of research studies, whereas the attacker may use the reconstructed genomic data for a different (potentially unethical) purpose.
4.1. Suggestions for good practice
We provide a number of simple suggestions for good practice that would help limit this attack.
-
1.
Aggregate statistics about the frequency of SNPs in the database or the frequency of co-occurrence of SNPs should never be released. We have shown that this information, combined with GRS, allows to precisely reconstruct individual genomes in various settings. It may be possible to release noisy versions of SNP frequency data, but this would be equivalent to releasing (our estimated K from the public database). With our EM algorithm, we have demonstrated that it is still possible to do some genotypic reconstruction with a noisy , but this becomes harder as the noise in increases. On the other hand, providing a very noisy may be of limited utility to the scientific community.
-
2.
If a genetic dataset is intended to serve for multiple complementary analyses, it is important that all study participants are used in every analysis performed. If there are missing phenotypic data from a few individuals, they should not be included in any of the analyses performed, or their privacy may be compromised.
-
3.
When multiple individuals are added in between two studies, then the ability to reconstruct the genomes depends on the number of SNPs, being large relative to the number of individuals. In particular, if m new dogs are added, exact reconstruction is only possible by using the approach in Section 3.2 if the number of SNPs . Thus, we suggest to avoid releasing multiple studies that differ by fewer than individuals.
4.2. Extensions and future work
Although we have analyzed the case where the genome is represented by binary values of 0 or 1, often studies instead count the number of times each allele is present, which would lead to a design matrix containing values or 2. In this scenario, K no longer contains the frequencies of SNPs and their co-occurrences, but it is something slightly more complicated that we describe in Section 5.8. This does not dramatically change the approach in this study, except in that the vector dm can take possible values, rather than . In practice, then, studies that use allele counts are somewhat more robust to attacks; the multiple dog reconstruction attack would likely be ambiguous if , rather than .
A possible countermeasure to our reconstruction attack could consist of randomly perturbing the GRS models before releasing them, as done in differentially private linear regression (Wang, 2018). However, a naive application of this strategy could destroy the utility of the models. A formal and empirical analysis of the effectiveness of such protection against reconstruction attacks, as well as of the usefulness of the resulting GRS models to genomic researchers, is beyond the scope of this article and left for future work.
Another countermeasure is to refrain from releasing precise information about the population structure of the study population to prevent the attacker from estimating K effectively. This would, however, limit the utility of the research study, because the researchers would not know to what populations the research applies.
Our work has a number of limitations. For instance, we only test our EM algorithm on dog data. Dog populations may have different population structures than human populations due to selective breeding, so in the future we aim at investigating how properties of population structure will impact our ability to estimate K and the accuracy of our reconstruction attack.
It may seem unlikely on the surface that two GWAS analyses will include nearly the same participants. One potentially common setting where this could arise is when a single study collects both genotype and phenotype data from a single set of participants, and it releases multiple models to predict multiple traits. In this case, there may be a small number of individuals who are used in one analysis, but not the other; for instance, there may be a small subset of participants who skip a particular survey question that was used to collect phenotype information, and this is, indeed, evident in a recent study (Jiang et al., 2019). In such settings, it could be very possible for multiple released GRS models to be computed on sets of individuals that differ by only a few participants. In future work, we aim at extending our analysis and attack to settings where multiple GRS models are released, each predicting different but highly correlated traits.
5. Appendix
5.1. Experimental details
5.1.1. Cornell dog database
To experimentally test the reconstruction attacks, we used data from the Cornell Dog Genome Database, which contains data about SNPs from a wide range of dog breeds and a number of associated phenotypic traits. The two traits we focused on were average breed weight and average breed height, because these two phenotypes had the fewest number of missing values. For the initial investigation, we binarized the genotype matrix—considering all heterogenous alleles to have a value of 1. (We also repeated the analysis with the original genotype matrix.) Only common SNPs (i.e., SNPs that were found in 25%–75% of the dogs) were used, leaving 23,497 SNPs. For each linear model built, dogs were randomly sampled as the “private” dataset and SNPs were randomly selected. To ensure that the SNPs that were sampled were spatially distributed, the SNPs were randomly sampled in a stratified way, so one SNP was selected in every -sized bin.
5.1.2. Experiment with imprecise K
First, two linear models were constructed to predict average breed weights: one with the M = 1000 randomly sampled dogs and another that contained one additional randomly sampled dog. This gives and . To mimic the process of estimating K from a public database, we randomly sampled an additional 200, 400, or 800 dogs that were not included as part of the original set and used this to estimate K, which we denote by . Now, we could calculate and compare this with the known for the additional dog from the second study. These additional dogs are taken from a third “test” dataset, disjoint from both the public and private data. The plots in Figures 4 and 5 are produced by re-running the algorithms across 10 random public/private/test splits, where the “test” dataset has 50 dogs that are each individually considered as candidates for the (M + 1)th dog added to the private dataset.
5.2. Adding multiple dogs
Here, we explain Eqs. (7) and (8). Note that the former is a special case of the latter so we will only explain the latter in detail. First note that, by definition,
Substituting these into the left hand side of the following equation gives the right hand side:
(9) |
This equation can be rearranged to give
Defining the length m vector yields the form used in Eq. (8). For the special case of , Cm is a scalar and we recover Eq. (7).
5.3. Algorithm for identifying unique genotypes of multiple dogs when K is known
Although the simple approach described in the main article will work in many cases, there are a few special circumstances where a more complex algorithm may be required. In particular, it would not work if there are combinations of SNPs that are not observed among the individuals added to the database. For instance, if there is not an SNP location where the first individual has an SNP variant and the others do not, then we would miss the corresponding value in Cm. However, it is still possible to identify all the values in Cm through a more complex algorithm:
-
1.
First, extract all unique, nonzero values from dm.
-
2.
Find the sum of all pairs of values in Eq. (1).
-
3.
Find all values that are in Eq. (1), but not in Eq. (2).
-
4.
If there are exactly m values in Eq. (3) and the sum of these values equal the last value of dm (corresponding to the intercept term), then we have found the correct values of Cm.
-
5.
Otherwise, this suggests that there are one or more elements of Cm that are missing from Eq. (3) and possibly a few values in Eq. (3) that are not in Cm.
-
6.
Begin by subtracting every pair of values in Eq. (3). These are now also potential values of Cm
-
7.
Search for a set of m values from Eqs. (3) and (6) that sum to the last element of dm. There may be more than one set of values for which this is true.
-
8.
If this search is unsuccessful, repeat steps 6–7. Eventually, a set of m values summing to dm should be found.
-
9.
If more than one possible set of values is found for Cm in Eq. (7), it is still possible to compare these sets and identify which is the most likely to contain the true values of Cm. For each possible Cm vector, a set of genotypes can be constructed for the m additional individuals. Using the frequencies of each SNP, it is possible to calculate the probability of observing each genotype. The set of values that produces the most likely genotypes for the m individuals is most likely to be the correct one.
In addition, this algorithm depends on the fact that it is extremely unlikely that if someone were to sample three random continuous numbers i, j, and k, it would just so happen that . There is an extremely small chance that a value of Cm would be un-discoverable because of a coincidence of this nature.
5.4. Estimating K
If the true matrix K is unknown, it can be estimated with public data. We denote this estimator by . For to be an accurate estimate, the data that it is generated from must be drawn from the same (or a sufficiently similar) population as that used in the private study. We will model this assuming no discrepancy between population distributions; however, when we discuss how to evaluate whether the estimate is good, that assessment should account for this systematic error as well. Later, we will be primarily concerned with the error due to the subsampling in both the private and public datasets.
Also of note, the same analysis given next applies to the scenario in which the researchers do not release K, but rather release a “noisy” version of K, where the noise is drawn from a normal distribution. They might consider doing this if they feel that releasing information about SNP frequencies is important for the research community, but they do not wish to release the real K because this would allow for an exact reconstruction of genotype. This noisy K could still be used in a reconstruction attack in the same way as an estimate of K from a public database is used.
5.4.1. Analytic bound on
For convenience, we only consider the case of adding a single individual, though the generalization is quite straightforward. If is substituted for K in our reconstruction Eq. (7), we get an approximation of that we denote . We would like to bound the (relative) error between and . Later, we ignore the constant factors C and for simplicity, noting that these scaling factors are estimated from the resulting or anyway. We, thus, consider and . Using on vectors, and also on matrices, we denote the corresponding operator norm. The relative error between and is given by:
Note that and hence
(10) |
This means that we can bound the error by two quantities. The term is bounded earlier by , which is finite as soon as K is nonsingular. This is not a strong requirement, as in the case of linear regression it is required for and to exist. Note that in the case of L2-regularized linear regression (i.e., ridge regression), K is replaced by , where is the regularization parameter, and we can directly bound this term by .
The key term in Eq. (10) is , the error in estimating K by . Let us assume that the public database used to obtain follows the same distribution as the private database used to fit the GRS models. Denote by the number of individuals used to estimate . Then, under classic boundedness assumptions and leveraging matrix concentration inequalities such as matrix Bernstein Tropp (2015), we can show that . This shows that the error in estimating K is small as long as the private and public databases are large enough.
5.4.2. Modeling the error in
In this section, we define a model to capture the error in , which leads to the EM algorithm for estimating , which is used in the experiments. As our estimated drifts from the true K, this expression would produce a wider range of values than just 0 and C.
Let be independent noise, which we assume corrupts each element of ; that is, given the estimated matrix , suppose
(11) |
for some small . This is clearly an oversimplification (as we know K is e.g., bounded and symmetric), but it is a useful starting point that allows derivation of a simple estimation algorithm. For notational brevity, in this and the following section we define the vector
(12) |
which corresponds to the difference between the two GRS model parameter vectors when m additional dogs are added. Given the true value of K, the system of equations
relates the known quantity and the Gaussian-distributed K with the matrix and the unknown values in the vector . This breaks down into a sum across the entries in cm, with
We need to estimate all m constants .
If K is Gaussian [following Eq. (11)], then the linear transformation is Gaussian as well. We denote each of the rows of K as a vector ki, ; then, for each row, the scalar value
meaning overall the vector is distributed .
With some algebraic re-arrangement, and since for the true underlying value of K we have , we can write this as
where and are parameters we need to estimate. The vector is observed “data,” computed from the public SNP database and the two released parameter vectors. We can model each of the entries of , which are zeros and ones, as Bernoulli distributions, whose prior probabilities correspond to the public dataset estimated frequencies. This suggests a model for that is akin to a constrained mixture of Gaussians.
For the special case of , with only a single scalar C and vector , this reduces to
(14) |
5.4.3. Parameter estimation with EM
We now can use this model to derive EM algorithms for finding maximum likelihood estimates of all parameters, and estimate the posterior distribution over SNP variants for the added individuals between the two studies.
For notational convenience in this section, denote the entries of the m new individuals as , for and , and let zj denote the column vector . Denote the prior probabilities for each i as , where are the (public) population frequencies for each SNP, and . Let denote the entries of the fixed (observed) vector , which in this simplified notation is distributed as
Supposing we know values of , to estimate the entries of we want to find ,
An EM algorithm to estimate would proceed by alternately:
-
1.
Given , estimate the posterior distribution ;
-
2.
Given the posterior , maximize with respect to cm and .
For each , we can analytically compute the distribution
(15) |
the conditional probability of each particular entry taking a value of 1, rather than 0, for each zj given the values of the other zk, . Note that each SNP location i can be treated independently; however, each of the individuals individuals must be considered jointly.
5.4.4. Exact EM algorithm when 1 individual is added
For the special case of , this yields a tractable exact EM algorithm. Since there are no other individuals, Eq. (15) reduces to , with
(16) |
the posterior probability of each particular entry taking a value of 1, rather than 0. To maximize with respect to C and , we first compute the derivatives of
This yields
which we set equal to zero and solve to find
(17) |
These updates taken together can be used to define an EM algorithm that optimizes the values of C and , despite the fact that the entries of are unknown; once C and are then known, the vector will give probability estimates for each entry of .
The overall EM algorithm can be summarized by the following iterative updates:
To initialize the algorithm, we can set to some initial probabilities, and find initial values for ; we experimented with setting both to the prior probabilities per SNP estimated from the public data and to the vector of all zeros (corresponding to a “hard” initialization at the value of the baseline estimate), and we found no qualitative difference in performance.
5.4.5. Stochastic EM (SEM) for multiple individuals
For , the exact posterior depends on all individuals and does not have a compact form. However, we can easily approximate the posterior by Gibbs sampling using Eq. (15), which describes the full conditional distribution , iteratively drawing samples for each individual j. We can use this for parameter estimation of and each by using the stochastic EM algorithm Celeux and Diebolt (1985), which differs from a standard EM algorithm in that the expectation step (evaluating the posterior) is replaced by Monte Carlo sampling. In this algorithm, we alternately
-
1.
draw approximate posterior samples of by one or more sweeps of Gibbs sampling, following Eq. (15);
-
2.
conditioned on the current sampled values , find values of and which maximize the likelihood .
Although this does not converge to an exact parameter value, under suitable conditions the algorithm converges in distribution to a Gaussian centered on the maximum likelihood estimate of the parameter. A point estimate can be extracted by averaging across many iterations after convergence.
In contrast to the EM updates, the updates for values of Cj and given actual sampled values of zj are straightforward and do not scale combinatorially in m. Optimizing cm corresponds to solving a least-squares problem, that is,
using the vector notation , has the solution
The maximum likelihood estimate of given this estimated is simply the mean squared error
To address permutation invariance in the entries , we enforce an ordering on the estimated values of Cj, with . This breaks the symmetry across the indices of the m new individuals added in the second study, and it is handled by a projection operation at each iteration, in which the estimated values are sorted in ascending order after each maximization step.
Empirical results quantifying the performance of this algorithm are shown in Figure 6, in an experimental setup similar to that for evaluating EM when a single dog is added to a dataset in the main article, with unknown K. A private dataset is assumed to contain 1000 individuals, whereas a separate public dataset of 800 is available; new individuals are added to the private dataset to produce two parameter vectors and . On average, the SEM algorithm predicts the correct SNP 75.5% of the time, relative to 71.5% for the “most common variant” baseline, a moderate improvement.
5.5. Scaling of EM algorithm with size of private dataset
Figure 7 demonstrates the change in accuracy of the EM algorithm over a range of different private database sizes. For this test, a synthetic dataset with 100 SNPs and 1,000,000 individuals is generated; 10,000 are held out as a public database, and 30 individuals are taken as a fixed test dataset of new dogs to add and are used to estimate EM algorithm accuracy, across increasingly large private database sizes. The algorithm has stable performance for increasingly large private databases.
5.6. Estimating with different SNP sets
Here, we analyze what can still be said in the event that the two studies do not use exactly the same set of SNPs. We will still assume that the sets of SNPs are considered to have a significant overlap.
For this purpose we will need a greater variety of notation. A primed variable denotes that it corresponds to the second set of SNPs, for example, is the co-occurrence matrix from the original M users for the second experiment. If a vector or matrix is surrounded by square brackets, this denotes the same object but with rows and columns corresponding to SNPs not in the overlap removed, for example, denotes the co-occurrence matrix from the first experiment restricted to the overlapping SNPs.
As described earlier, from the first experiment, we have
(21) |
and now, from the second experiment, we have
(22) |
Taking the difference between these expressions, as earlier, gives
(23) |
Restricting to the overlapping set gives that
(24) |
Noting that and that we get that
(25) |
Analogous to the previous cases, is a scalar that we can label C and we get
(26) |
Thus, if K is known, it can be used to deduce whether the additional individual has each of the SNPs in the overlapping set. If K is not known exactly, it can be estimated from public data just as in the same SNP case.
5.7. Case in which each GWAS study adds two new sets of participants
This article mostly explores the case in which one study's participants are a subset of the other study's participants. Here, we demonstrate that this is equivalent to the case where each of the two studies contains a small number of participants that are not found in the other study.
In particular, let us say that the first study has participants and the second study has participants, where the first M participants are shared between the studies, but there are a participants that are found in the first study but not the second, and b participants that are found in the second study but not in the first. Following on from Eq. (9), we see that:
Let us define the following matrix obtained by concatenating the two genotype matrices:
(27) |
and the following length vector:
(28) |
Then, this gives us:
(29) |
This means that having two nonoverlapping participant sets is equivalent to the setting in which the first study is a subset of the second (only m is now ).
5.8. Description of K when the genotypes are non-binary
In many cases, GRS are calculated on genotype matrices that are non-binary. In particular, they may take on three discrete values 0, 1, and 2, where 0 indicates that the most common variant is homozygous, 1 indicates that the individual is heterozygous for the uncommon variant, and 2 indicates that the individual is homozygous for the uncommon variant.
If this is the case, the description of K will change. However, it is still the case that the entries of K depend only on the SNP frequencies and SNP co-occurrence frequencies in the dataset, and that knowledge of SNP frequencies and pairwise co-frequencies from the original study, are all that is required to compute K.
– For : where is the frequency of individuals being heterozygous for the uncommon variant and is the frequency of individuals being homozygous for the uncommon variant.
– For and : where is the frequency that both SNPs are simultaneously heterogygous, is the frequency that one SNP is homozygous for the rare variant and the other is heterogygous simultaneously, and is the frequency that that uncommon variants are found to be homozygous simultaneously.
– For and : .
– Finally, .
Author Disclosure Statement
The authors declare they have no competing financial interests.
Funding Information
This project was funded by the Alan Turing Institute Research Fellowship under EPSRC Research grant (TU/A/000017); EPSRC/BBSRC Innovation Fellowship (EP/S001360/1), and under the EPSRC grant EP/N510129/1. It was also partly funded by a grant from CPER Nord-Pas de Calais/FEDER DATA Advanced data science and technologies 2015–2020.
References
- Belsky, D.W., Moffitt, T.E., Sugden, K., et al. 2013. Development and evaluation of a genetic risk score for obesity. Biodemography Soc. Biol. 59, 85–100 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cai, R., Hao, Z., Winslett, M., et al. 2015. Deterministic identification of specific individuals from GWAS results. Bioinformatics 31, 1701–1707 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Celeux, G., and Diebolt, J.. 1985. The SEM algorithm: A probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Comput. Stat. Q. 2, 73–82 [Google Scholar]
- Chouraki, V., Reitz, C., Maury, F., et al. 2016. Evaluation of a genetic risk score to improve risk prediction for Alzheimer's disease. J. Alzheimers Dis. 53, 921–932 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Day, F.R., Thompson, D.J., Helgason, H., et al. 2017. Genomic analyses identify hundreds of variants associated with age at menarche and support a role for puberty timing in cancer risk. Nat. Genet. 49, 834–841 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dwork, C., Smith, A., Steinke, T., et al. 2015. Robust traceability from trace amounts. 2015 IEEE 56th Annual Symposium on Foundations of Computer Science. Berkeley, California. pp. 650–669 [Google Scholar]
- Hayward, J.J., Castelhano, M.G., Oliveira, K.C., et al. 2016. Complex disease and phenotype mapping in the domestic dog. Nat. Commun. 7, 10460. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang, Z., Ayday, E., Fellay, J., et al. 2015. Genoguard: Protecting genomic data against brute-force attacks. 2015 IEEE Symposium on Security and Privacy. San Jose, California. pp. 447–462 [Google Scholar]
- Im, H., Gamazon, E., Nicolae, D., et al. 2012. On sharing quantitative trait gwas results in an era of multiple-omics data and the limits of genomic privacy. Am. J. Hum. Genet. 90, 591–598 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jiang, L., Zheng, Z., Qi, T., et al. 2019. A resource-efficient tool for mixed model association analysis of large-scale data. Nat. Genet. 51, 1749–1755 [DOI] [PubMed] [Google Scholar]
- Knowles, J.W., and Ashley, E.A.. 2018. Cardiovascular disease: The rise of the genetic risk score. PLoS Med. 15, e1002546. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mittos, A., Malin, B., and Cristofaro, E.D.. 2019. Systematizing genome privacy research: A privacy-enhancing technologies perspective. Proc. Privacy Enhanc. Technol. 2019, 87–107 [Google Scholar]
- Qi, L., Ma, J., Qi, Q., et al. 2011. Genetic risk score and risk of myocardial infarction in hispanics. Circulation 123, 374–380 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Raisaro, J.L., Tramèr, F., Ji, Z., et al. 2017. Addressing beacon re-identification attacks: Quantification and mitigation of privacy risks. J. Am. Med. Inform. Assoc. 24, 799–805 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shringarpure, S., and Bustamante, C.. 2015. Privacy risks from genomic data-sharing beacons. Am. J. Hum. Genet. 97, 631–646 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Simmons, S., and Berger, B.. 2015. One size doesn't fit all: Measuring individual privacy in aggregate genomic data. 2015 IEEE Security and Privacy Workshops. San Jose, California. pp. 41–49 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tropp, J.A. 2015. An introduction to matrix concentration inequalities. Found. Trends Mach. Learn. 8, 1–230 [Google Scholar]
- Visscher, P.M., Wray, N.R., Zhang, Q., et al. 2017. 10 years of gwas discovery: Biology, function, and translation. Am. J. Hum. Genet. 101, 5–22 [DOI] [PMC free article] [PubMed] [Google Scholar]
- von Thenen, N., Ayday, E., and Cicek, A.E.. 2018. Re-identification of individuals in genomic data-sharing beacons via allele inference. Bioinformatics 35, 365–371 [DOI] [PubMed] [Google Scholar]
- Wang, Y.X. 2018. Revisiting differentially private linear regression: Optimal and adaptive prediction & estimation in unbounded domain. Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI). Monterey, California. pp. 93–103 [Google Scholar]
- Zhao, X., Xi, B., Shen, Y., et al. 2014. An obesity genetic risk score is associated with metabolic syndrome in Chinese children. Gene 535, 299–302 [DOI] [PubMed] [Google Scholar]