Identification of genetic outliers due to sub-structure and cryptic relationships

Daniel Schlauch; Heide Fier; Christoph Lange

doi:10.1093/bioinformatics/btx109

. 2017 Mar 3;33(13):1972–1979. doi: 10.1093/bioinformatics/btx109

Identification of genetic outliers due to sub-structure and cryptic relationships

Daniel Schlauch ^1,^2,^✉, Heide Fier ^1,³, Christoph Lange ^1,⁴

Editor: Oliver Stegle

PMCID: PMC5870703 PMID: 28334167

Abstract

Motivation

In order to minimize the effects of genetic confounding on the analysis of high-throughput genetic association studies, e.g. (whole-genome) sequencing (WGS) studies, genome-wide association studies (GWAS), etc., we propose a general framework to assess and to test formally for genetic heterogeneity among study subjects. As the approach fully utilizes the recent ancestor information captured by rare variants, it is especially powerful in WGS studies. Even for relatively moderate sample sizes, the proposed testing framework is able to identify study subjects that are genetically too similar, e.g. cryptic relationships, or that are genetically too different, e.g. population substructure. The approach is computationally fast, enabling the application to whole-genome sequencing data, and straightforward to implement.

Results

Simulation studies illustrate the overall performance of our approach. In an application to the 1000 Genomes Project, we outline an analysis/cleaning pipeline that utilizes our approach to formally assess whether study subjects are related and whether population substructure is present. In the analysis of the 1000 Genomes Project data, our approach revealed subjects that are most likely related, but had previously passed standard qc-filters.

Availability and Implementation

An implementation of our method, Similarity Test for Estimating Genetic Outliers (STEGO), is available in the R package stego from Github at https://github.com/dschlauch/stego.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

The fundamental assumption in standard genetic association analysis is that the study subjects are independent and that, at each locus, the allele frequency is identical across study subjects (Choi et al., 2009; Purcell et al., 2007; Yang et al., 2010). In the presence of population heterogeneity, e.g. population substructure or cryptic relatedness, these assumptions are violated. It can introduce confounding into the analysis and lead to biased results, e.g. false positive findings (Kang et al., 2010; Price et al., 2006; Ptak and Przeworski, 2002; Voight and Pritchard, 2005). Given the generality of the problem, it has been the focus of methodology research for a long time. An early approach, genomic control, was developed for candidate gene and later for genome-wide association studies (GWAS) (Bacanu et al., 2002; Devlin et al., 2001), adjusting the association test statistics at the loci of interest by an inflation factor that is estimated at a set of known null-loci. With the arrival of GWAS data, it became possible to estimate the genetic dependence between study subjects and the overall genetic variation for each study subject by computing the empirical genetic variance-covariance matrix between study subjects at a whole genome level. The genetic variance-covariance matrix can then be utilized in two ways to minimize the effects of population substructure on the association analysis.

The first method is to compute an eigendecomposition of the matrix and to include the eigenvectors that explain the most variation as covariates in the association analysis (Price et al., 2006, 2010). An alternative approach is to incorporate the estimated dependence structure of the study subjects directly into a generalized linear model and account directly for the dependence at the model-level (Lippert et al., 2011; Listgarten et al., 2012; Zhang et al., 2010). Both approaches have proven to work well in numerous applications. While the first approach is computationally fast and easy to implement, the direct modeling of the dependence structure between study subjects can be more efficient (Mathieson and McVean, 2012).

However, both approaches benefit if, prior to the analysis, study subjects whose genetic profile is very different from the other study subjects, e.g. ‘genetic outliers’, are removed from the dataset. A common practice is currently to examine the eigenvalue plots visually and to identify outliers by personal judgment on how far study subjects are from the ‘clouds’ of study subjects. As typically up to 10 eigenvectors have to be considered, this process of identifying outliers can become a complicated and subjective procedure. Alternatively, a software tool SMARTPCA (Patterson et al., 2006), provides a more quantitative utility for removal of outliers by iteratively recomputing PCs in the genetic data. The method assumes a set of unrelated individuals and uses the covariance-based genetic similarity matrix to identify these individuals.

Many methods exist for inferring relatedness which make the strong assumption of population homogeneity (Choi et al., 2009; Kang et al., 2010; Purcell et al., 2007; Yang et al., 2010). These methods have been shown to be biased in the context of population heterogeneity (Manichaikul et al., 2010). More recently, methods have been developed which attempt to estimate relatedness with population structure (Manichaikul et al., 2010; Thornton et al., 2012). These developments improve the ability to detect existing pedigrees, which can aid in the removal of individuals who violate homogeneity. However, there is currently no quantitative measure of homogeneity which can be used to test a dataset prior to the application of GWAS.

In this communication, we propose a formal statistical test that assesses whether two study subjects come from the same population and whether they are unrelated. The test statistic is based on an adaptation of the Jaccard Index which utilizes the idea that variants are differentially informative of relatedness based on their allele frequency. Recent work has shown that the Jaccard Index alone can be used to reveal finer scale population structure compared with existing methods such as EIGENSTRAT (Prokopenko et al., 2016). Furthermore, the distribution of our statistic can be derived under the null-hypothesis which makes it computationally fast, enabling the application to whole-genome sequencing data. Our measure has clearly defined properties which can be used to test for homogeneity in a population and in particular identify individuals who are likely be related in a study population. Applications to the 1000 Genome Project suggests that our approach is better suited to detect sub-populations than genetic variance-covariance approach. This is most likely attributable to the emphasis of our approach on small allele frequencies.

2 Materials and methods

Exploiting the information in rare variants (RVs), such as one with minor allele frequency (MAF) <1%, is fundamental to our method, as our approach utilizes the features of RVs that they are typically more recent than common variants and that many of them are population/family specific. Since allele frequencies can differentially confound association studies (Mathieson and McVean, 2012), we developed a method that utilizes the differential informativeness of variants by allele frequency to obtain a high resolution picture of population structure and protect the association study against bias due to genetic confounding. Our approach uses an intuitive, computationally straightforward approach towards identifying similarity between two study subjects which is also directly linked to the kinship coefficient.

2.1 Similarity measure among haploid genomes

Consider a matrix of n individuals (2n haploid genomes), with N independent variants described by the genotype matrix $G_{2 n \times N}$ . G is a binary matrix with value 1 indicating the presence of the minor allele and 0 indicating the major allele. We define the similarity index between two haploid genomes, s_i_,_j

s_{i, j} = \frac{\sum_{k = 1}^{N} w_{k} G_{i, k} G_{j, k}}{\sum_{k = 1}^{N} I [\sum_{l = 1}^{2 n} G_{l, k} > 1]}

(1)

where

w_{k} = {\begin{matrix} \frac{(\begin{matrix} 2 n \\ 2 \end{matrix})}{(\begin{matrix} \sum_{l = 1}^{2 n} G_{l, k} \\ 2 \end{matrix})} & if \sum_{l = 1}^{2 n} G_{l, k} > 1 \\ 0 & if \sum_{l = 1}^{2 n} G_{l, k} \leq 1 \end{matrix}

and $I [\cdot]$ is an indicator function, evaluating to 1 if true and 0 if false.

We can consider the weight variable, w_k, to be the inverse of the probability that two alleles selected randomly without replacement both belong to the set of minor alleles. Intuitively, for $\sum_{l = 1}^{2 n} G_{l, k} > 1$ , w_k is monotonically decreasing as the minor allele count increases (See Supplementary Materials).

In the absence of population structure, i.e. homogeneous population we have

E (s_{i, j}) = 1

(2)

It therefore follows from the Central Limit Theorem that in the absence of population structure, cryptic relatedness and dependence between loci (such as linkage disequilibrium) the distribution of the similarity index, s_i_,_j is Gaussian.

s_{i, j} \sim N (1, σ_{i, j}^{2})

Where the variance of s_ij can be estimated by

\hat{σ_{i, j}^{2}} = \hat{V a r} (s_{i, j}) = \frac{\sum_{k = 1}^{N} (w_{k} - 1)}{{(\sum_{k = 1}^{N} I [\sum_{l = 1}^{2 n} G_{l, k} > 1])}^{2}}

(3)

The similarity index s_i_,_j provides an easily interpreted statistical test for evaluating possible relatedness between individuals in a purportedly homogeneous dataset of unrelated individuals. Note that this formulation (3) is independent of the samples i, j and depends only on the allele counts for each variant across the study group. See Supplementary Methods S1.2–S1.3.

2.2 Similarity measure among diploid genomes

This approach is easily generalized to the diploid scenario. A diploid similarity score, s_diploid, is obtained by averaging each of the four pairwise haploid s_haploid scores between each person’s two haploid genotypes. For n individuals, 2n genotypes per locus, the similarity between individuals i and j is defined as

s_{i, j}^{(d i p l o i d)} = \frac{\sum_{k = 1}^{N} \sum_{l = 1}^{2} \sum_{m = 1}^{2} [w_{k} G_{i_{l}, k} G_{j_{m}, k}] / 4}{\sum_{k = 1}^{N} I [(\sum_{l = 1}^{n} [G_{l_{1}, k} + G_{l_{2}, k}]) > 1]}

where $G_{i_{2}, k}$ refers to the $2^{n d}$ genotype of individual i at locus k.

Here it becomes clear that the method can be applied to phased and unphased data alike. For an unphased data matrix $H_{n \times N}$ , where H contains the number of minor alleles, ${0, 1, 2}$ , for a subject at a particular variant.

s_{i, j}^{(d i p l o i d)} = \frac{\sum_{k = 1}^{N} [w_{k} H_{i, k} H_{j, k}] / 4}{\sum_{k = 1}^{N} I [(\sum_{l = 1}^{n} H_{l, k}) > 1]}

This formulation will have the same mean

E [s_{i, j}^{(d i p l o i d)}] = 1

and assuming independence of each individual’s haploid genomes, such as in the absence of inbreeding,

\hat{V a r} (s_{i, j}^{(d i p l o i d)}) = \frac{\hat{V a r} (s_{i, j}^{(h a p l o i d)})}{4} = {\hat{σ^{2}}}_{i, j}

which yields the asymptotic result

s_{i, j} \sim N (1, {\hat{σ^{2}}}_{i, j})

(4)

2.3 Tests of heterogeneity

We can test the null hypothesis that population structure does not exist and all subjects are unrelated, with respect to the alternative that at least one pair of individuals is related.

\begin{matrix} H_{0} : μ_{i, j} = 1 \forall i, j \in 1 \dots n \\ H_{A} : \exists i, j \in 1 \dots n | μ_{i, j} \neq 1 \end{matrix}

Under the null hypothesis, we have clearly defined the distribution of test statistics. However, violations of homogeneity may come in many forms and thus there is no most powerful test which can be applied to detect all possible alternatives. Below, we examine two separate rejection criteria designed for two types of heterogeneity—population structure and cryptic relatedness.

Given the complex nature with which population structure may manifest itself, we recommend the conservative Kolmogorov–Smirnov test for detection of population structure.

K = \sup_{x} | F_{s} (x) - Φ (x) |

where F_s is the empirical distribution function for s, and $Φ (x)$ is the cumulative distribution function defined in (4). Under the null, K follows the Kolmogorov distribution (Wang et al., 2003), so we define $K_{α}$ as the value such that $P (K > K_{α} | H_{0}) = α$ , and reject homogeneity in the data if $K > K_{α}$ .

While this approach is effective for a large number of small effects, as would be expected with subtle population structure, it will not be particularly effective at detecting a small number of large effects, as expected with cryptic relatedness. For this scenario we recommend a test with more power at the far right tail of the distribution.

In a homogeneous dataset lacking relatedness, we consider each of the $(\begin{matrix} n \\ 2 \end{matrix})$ comparisons to be independent. To achieve a family-wise error rate α, we use the Šidák procedure (Šidák, 1967) or the approximately equivalent Bonferroni procedure. We reject the null at the α level when we obtain similarity scores in the rejection region

R : m a x (s_{i, j}) > 1 - p r o b i t (\frac{α}{(\begin{matrix} n \\ 2 \end{matrix})})

2.4 Estimating cryptic relatedness

The measure described here is particularly powerful for measuring relatedness. Intuitively, we can imagine two subjects which have a kinship coefficient, $ϕ$ , indicating a probability of a randomly chosen allele in each person being identical by descent (IBD). For an allele which belongs to the one person, the probability of it belonging to a related person with kinship coefficient $ϕ$ is $ϕ + (1 - ϕ) \times p$ , where p is the allele frequency in the population. We can clearly see that for rare alleles, such that p is small compared to $ϕ$ , there will be a much larger relative difference in the probability of shared alleles among related individuals ( $ϕ > 0$ ) compared to unrelated individuals ( $ϕ = 0$ ). Given that STEGO weights more highly these rarer alleles, there is increased sensitivity to detection of relatedness.

Consider a coefficient of kinship between two individuals i, j, $ϕ_{i, j} > 0$ with no other population structure present in the data. For an individual variant, k, with sufficient allele frequency, the expected contribution to the statistic for an allele from each individual, $s_{i_{1}, j_{1}}$ is

E (s_{i_{1}, j_{1}, k} | ϕ_{i, j}) = 1 + ϕ_{i, j} [p_{k} \frac{(\begin{matrix} 2 n \\ 2 \end{matrix})}{(\begin{matrix} p_{k} (2 n - 2) + 2 \\ 2 \end{matrix})} - 1]

and the expectation for the similarity score between those haploid genomes is

\begin{matrix} E (s_{i_{1}, j_{1}} | ϕ_{i, j}) = \\ \frac{\sum_{k = 1}^{N} I [\sum_{l = 1}^{2 n} G_{l, k} > 1] [1 + ϕ_{i, j} [p_{k} \frac{2 n (2 n - 1)}{(p_{k} (2 n - 2) + 2) (p_{k} (2 n - 2) + 1)} - 1]]}{\sum_{k = 1}^{N} I [\sum_{l = 1}^{2 n} G_{l, k} > 1]} \end{matrix}

(5)

It can be seen that in the presence of cryptic relatedness, $ϕ_{i, j} > 0$ ,

E (s_{i_{1}, j_{1}} | ϕ_{i, j} > 0) > 1

With $\sum_{i = 1}^{2 n} G_{i, k}$ as the maximum likelihood estimator for $p_{k} n$ , by the invariance principle, w_k is a consistent estimator for $\frac{(\begin{matrix} 2 n \\ 2 \end{matrix})}{(\begin{matrix} p_{k} (2 n - 2) + 2 \\ 2 \end{matrix})}$ .

This yields a maximum likelihood estimate of this kinship defined as

{\hat{ϕ}}_{i, j} = \frac{s_{i, j} - 1}{[\frac{\sum_{k = 1}^{N} \hat{p_{k}} w_{k}}{\sum_{k = 1}^{N} I [\sum_{l = 1}^{2 n} G_{l, k} > 1]} - 1]}

(6)

with

\hat{V a r} ({\hat{ϕ}}_{i, j}) = \frac{\hat{σ_{i, j}^{2}}}{{[\frac{\sum_{k = 1}^{N} \hat{p_{k}} w_{k}}{\sum_{k = 1}^{N} I [\sum_{l = 1}^{2 n} G_{l, k} > 1]} - 1]}^{2}}

For example, in an otherwise homogeneous study group of unrelated individuals a pair of cousins $(ϕ = .0625)$ , with $M A F \sim U n i f o r m (.02, .1)$ we can directly calculate the expectation of their similarity statistic, s_i_,_j

E (s_{i, j} | ϕ = .0625, No other structure) \approx 2.19

2.5 Statistical power to detect outliers

The properties of this similarity measure lend themselves toward straightforward power calculations. It is often of interest to consider some coefficient of relatedness, γ, that is acceptable for a study. Setting a $ϕ \geq γ$ allows for the calculation of the probability of obtaining a pair of samples inside the rejection region given two unacceptably closely related individuals.

P (R e j e c t H_{0} | ϕ_{i, j} = γ) = α + (1 - α) (1 - Φ (\frac{μ_{i, j} - 1}{\sqrt{{\hat{σ^{2}}}_{i, j}}}))

(7)

where $Φ (x)$ is the cumulative distribution function for a standard normal random variable. Also note that this power is computed under the assumption of homogeneity among all unrelated individuals, which will yield a conservative estimate of the probability of rejection. The presence of unknown population structure will necessarily increase the power of the test.

It is of interest in any study seeking to quantitatively demonstrate the homogeneity of participants to produce this statistic which can demonstrate that heterogeneity would have been observed with some probability, given the presence of some specified degree of relatedness, γ.

3 Results

3.1 Identification of relatedness and structure in 1000GP data

We applied our method to data from the 1000 Genomes Project (TGP) (Consortium et al., 2012, 2015), an international consortium which has sequenced individuals from 26 distinct populations sampled from around the globe.

These populations were not identified by the TGP to have cryptic relatedness or had known cryptic relatedness removed (Nemesh and McCarroll, 2012). However, subsequent analyses have discovered numerous inferred relationships closer than first cousins (Al-Khudhair et al., 2015; Fedorova et al., 2016; Gazal et al., 2015).

Phase 3 of the 1000 Genomes Project contains 2504 individuals with a combined total of over 80 million variants. To test STEGO, we selected a subset of approximately 100 000 variants across each of the 26 populations which limited the impact of linkage disequilibrium (Price et al., 2008) and increased the independence of consecutive measurements (See Supplementary Methods). These 100 000 variants were each chosen from 100 000 separate blocks based on low minor allele frequency and a qc-control cutoff of 1% (Supplementary Methods S1.4, Supplementary Fig. S5). STEGO was then run on each of these populations separately to test for heterogeneity and relatedness within population groups (Figs 1 and 3A).

Fig. 1. — Distribution of similarity coefficients for each of the 26 populations in the 1000 Genomes Project. Homogeneous populations lacking cryptic relatedness should be expected to exhibit distributions centered around 1 with no outliers. The red dotted vertical line on each plot indicates the family-wise $α = .01$ level cutoff for $(\begin{matrix} n \\ 2 \end{matrix})$ comparisons. The most significant related pair is labeled for each population with the estimated kinship for that pairing indicated in blue. The p-value for the KS test for homogeneity is reported for each population. Many of the population groups do demonstrate the null behavior (e.g. JPT, KHV, FIN), however, a number of populations show the presence of extreme outliers (e.g. STU, PUR) or systematic right skew (e.g. MXL, PEL)

Our investigation revealed a great deal of variation in the presence of cryptic relatedness and population structure across the 26 populations of the study. Under the assumptions that each study contained a homogeneous population of unrelated individuals, only a handful of groups contained neither large outliers nor heavily inflated numbers of significant results.

We defined the presence of population structure as applying to those populations which deviated from the normal distribution defined under the null model. From Equation (3), we have the expected distribution under H₀ which we tested for in each of the populations using a standard Kolmogorov-Smirnov test. Using a significance cutoff of $α = .01$ , 15 of the 26 populations were found to have violated population homogeneity.

In addition to investigating population structure, we examined the presence of cryptic relatedness in the study. We defined relatedness as those individual pairs which exceed the cutoff for a family-wise error rate of $α = .01$ and were estimated to have a coefficient of relatedness $\hat{ϕ} > \frac{1}{32}$ , which approximately corresponds to half first cousins. By this measure, cryptic relatedness was observed in all but six of the 26 populations using this method. Eleven pairs of first order (parent-offspring or full sibling) relationships were detected among individuals within the same population group, $(.2 < \hat{ϕ_{i, j}} < .3)$ , a set of pairings which corresponds identically with the conclusions of Gazal et al (2015).

Inference on our kinship estimate is made under the assumption of homogeneity of the background study population. Identified significant relatedness may be due to the fact that the variance of the similarity score is inflated in the presence of population structure. So it is incomplete to identify cryptic relatedness in this manner in populations which contain identified structure. However, in populations which do not exhibit detectable structure, we still find many instances of related individuals in this study. For example, two individuals from the ACB population (African Caribbeans in Barbados) produced a s_i_,_j score of 2.6 $(p < 10^{- 30})$ , whereas no other pairing exceeded the family-wise cutoff of 1.3 (Fig. 1). Using the formula above, the estimated coefficient of kinship is $\hat{ϕ} = .27$ , suggesting that those individuals are first degree relatives. Additionally, two pairs of individuals in the STU population- (HG03899/HG03733 and HG03754/HG03750) were both estimated to have a kinship coefficient $\hat{ϕ} \approx .25$ , similarly indicating a relatedness of the first degree.

Interestingly, not all related pairs belonged to the same population groups. We additionally discovered a pair of individuals, HG03998 from the STU population and HG03873 from the ITU population, which exhibited strikingly high relatedness. The plot below (Fig. 2) was generated by placing HG03998 into the ITU population and running STEGO on that population. An individual who belongs to a separate population from all others in a dataset would be expected to produce similarity scores less than 1. However, the similarity between HG03998 and HG03873 was found to be s = 3.9, significant at $p < 10^{- 30}$ with an estimated relatedness $\hat{ϕ} > .25$ , suggesting that these individuals are first order relatives despite belonging to different population groups. Both populations were sampled from locations in the United Kingdom, further supporting the evidence that these individuals are related.

Fig. 2. — Distribution of all pairwise s statistics for population Indian Telugu from the UK (ITU) with individual HG03998 included. HG03998 is now believed to be related to HG03873, despite being labeled in the Sri Lankan Tamil from the UK (STU) population. The family-wise $α = .01$ cutoff is indicated by the dotted red vertical line and the s statistic for HG03998 and HG03873 is seen as an extreme outlier at 3.97

With strong evidence of relatedness in the data, we sought to test our method on pruned data which contained no known related pairs. Gazal et al propose a subset of the TGP which removes 243 individuals such that no two individuals are as related as cousins or closer. These 243 samples include all those with cryptic relatedness inferred by the FSUITE and RELPAIR (Boehnke and Cox, 1997; Epstein et al., 2000) methods. This results in a reduced set of 2261 individuals which are assumed to be no more closely related than half first cousins $(ϕ = .0312)$ (Gazal et al., 2015). We applied this filter and re-analyzed each of the 26 populations again to test for heterogeneity and cryptic relatedness (Fig. 3B, Supplementary Table S1, Supplementary Fig. S1).

Fig. 3. — Significance of population heterogeneity in 26 populations of the TGP. Detection of population structure was found at P <.01 in 15 of the 26 populations using the full dataset (A). Upon removal of suspected related individuals, four populations (CLM, PEL, PUR and GIH) violated homogeneity in the relatedness-removed populations (B)

Eleven populations which had been identified as violating homogeneity $(α = .01)$ in the full TGP dataset were no longer identified as violating homogeneity after removal of suspected related pairs. However, four populations, including each of the ad-mixed American groups, continued to violate homogeneity even after the attempts to limit the impact of related individuals. The three most significant populations are all part of the Ad Mixed American super population and represent ‘new world’ groups which have undergone extensive admixture in recent centuries, CLM (Colombians from Medellin, Colombia) $(p = 7 \times 10^{- 8})$ , PUR (Puerto Ricans from Puerto Rico) $(p = 3 \times 10^{- 31})$ , and PEL (Peruvians from Lima, Peru) $(p = 2 \times 10^{- 27})$ . It is therefore reassuring that these groups of individuals would exhibit the greatest amount of structure among the populations surveyed.

3.2 Population differentiation in 1000 genomes project

There are many methods for detecting population structure. Most commonly, Principal Components Analysis (Price et al., 2006, 2010) is used, identifying the vectors of largest variation which ideally corresponds to the population structure. Commonly, this procedure first involves the computation of a genetic similarity matrix (GSM) via the correlation between all samples, which is followed by an eigendecomposition of that matrix. There are a number of limitations to this straightforward approach, one of which is that the calculation of a variance–covariance matrix equally weights the impact of all loci, failing to fully utilize the fact that each variant’s allele frequency is informative of the value of each variant. Recently, the use of the Jaccard Index has been used to estimate genetic similarity (Prokopenko et al., 2016). This approach provides a higher resolution picture of the genetic landscape by exploiting the co-occurrence of rare-variants in sequencing data. STEGO directly utilizes this differential value of alleles based on minor allele frequency by weighting variants by how unlikely such a co-occurrence would have been in a homogeneous population.

We evaluated the effectiveness of our similarity measure to differentiate populations in the TGP in both global and localized contexts. For the global scenario we used data from all 26 populations in a single analysis. In the localized scenarios, we ran 57 separate analyses corresponding to all possible pairs of populations within each of the five superpopulations. In each analysis, STEGO and covariance (PCA) were used to compute the GSMs containing all pairwise similarity scores. An eigendecomposition of each GSM was performed and each individual in the study was plotted against the top two eigenvectors for each method.

In comparing our results with those of PCA on the global scale, we achieve highly similar results depicting the two dimensional linear migrations of ancient human history. This is notable because despite a focus on separating recently related populations, STEGO remains effective at partitioning samples of more distant common ancestry as well (Supplementary Fig. S2).

Despite no loss of performance on the global scale, STEGO outperforms standard PCA when the task involves classifying individuals of recent ancestry. Focusing only on populations belonging to the same continental superpopulation, every possible pair of populations was merged following the removal of suspected related pairs. This yielded 57 sets of unrelated samples in which a subtle binary population stratification existed. STEGO and standard PCA were then run on each merged dataset and the two methods were compared by computing the ratio of mean within-population variance to total variance across the first three principal components.

The results show that STEGO outperforms PCA by this measure in 41 of the 57 possible comparisons (binomial test P <.001) (Fig. 4). We chose a pair of closely related populations from the 1000 Genomes Project in order to illustrate this performance. The populations Sri Lankan Tamil (STU) and Indian Telugu (ITU) have relatively small geographical separation and recent common ancestry relative to other populations in the TGP. We demonstrate the clearer separation in comparing our method with that of standard Principal Components Analysis (Fig. 5). We additionally explored a case in which the ratio of mean within-population variance to total variance was greater for STEGO compared to PCA, a group containing Utah Residents with Northern and Western Ancestry (CEU) and British in England and Scotland (GBR). Despite a clear trend of superior performance with STEGO, CEU-GBR is a notable exception. Closer inspection reveals that the first eigenvector from STEGO clearly isolates 11 samples exclusively from the GBR population. To our knowledge, these individuals have not previously been identified as a distinct subset of the CEU population. It is not readily apparent what features of the data are being captured here or the relative value of those features (this may be a result of population structure, relatedness, batch effect, etc.). But it is notable that all 11 samples came from the same population in the TGP. It is reasonable to infer that this subset of samples contains a disproportionate number of co-occurrences of low frequency variants, which were not detected by PCA (Supplementary Fig. S3).

Fig. 5. — Example: ITU versus STU. Two populations of Southern Asian origin, Indian Telugu from the UK (ITU) and Sri Lankan Tamil from the UK (STU). A genetic similarity matrix was computed using STEGO and standard correlation on the same set of variants (See Supplementary Methods S1.4). An eigendecomposition of each matrix was then performed. These plots show the set of unrelated individuals projected on to the first two eigenvectors. We see clearer clustering by population (colored) in our method (Left) compared to standard PCA (right). This performance boost is attributed to the value added by preferentially considering genetic agreement in less frequent alleles

The reasoning behind the superior performance in fine scale population stratification is due to the focus on rarer alleles. Rare alleles tend to be less stable over generations and become fixed at 0% with high probability. Therefore, rare alleles that are observed are more likely to have arisen recently (See Supplementary Materials S1.6). It stands to reason that these alleles would therefore be the most informative of recently related populations. By appropriately recognizing the increased information contained in the co-occurrences of less frequent alleles, we achieve superior separation of recently related populations.

4 Discussion

The ability to identify genetic outliers has well-established utility in genome-wide association studies. Many existing methods for identification of genetic associations are predicated on the assumptions that population homogeneity holds in the study. Checking for violations of these assumptions typically involves a qualitative assessment without any specific concern for effect size and power. STEGO provides an analytical approach for quantitatively assessing homogeneity and a formal test for the identification of cryptic relatedness and population stratification.

Moreover, our approach involves the estimation of a GSM which, due to its preferential weighting towards rare variants, provides higher resolution for distinguishing populations which have recently diverged. As sequencing costs have plummeted and our ability to measure rare variants has increased, there will be increased demand for tools which make use of the differential informativeness of variants according to frequency. Recent work (Chen et al., 2013) has already demonstrated the use of pre-calculated SNP weights to infer the ancestry of samples of unknown origin, and STEGO’s GSM in combination with large scale sequencing projects, such as the TGP, promises to further improve the resolution of this approach.

Several limitations exist with our approach. First, the method assumes that the variants are independent. We satisfy this assumption by performing LD sampling, but in doing so limit the number of informative markers to less than 100k, potentially omitting much of our data and reducing our power to detect heterogeneity. The choice of LD sampling method will necessarily impact the performance of the method, but due to the nature of STEGO focusing on rare variants as opposed to SNPs, the impact of LD will be limited. Additionally, with respect to the detection of population structure, we cannot design a uniformly most powerful test for structure due to the complex nature in which structure can exist. Future work will include quantification of the specific gains achieved in controlling type I error and power in the context of rare variant association studies. Higher resolution population structure is always preferred, though the exact gains achieved in GWAS remain to be quantified.

In spite of these limitations, STEGO provides a formal, interpretable tool which is directly linked to the kinship coefficient. It provides a formal statistical test for population substructure, identifying study subjects which are related and subjects which are genetic outliers in their assigned population.

Funding

This work was supported by the National Institutes of Health [Grant numbers 1P01HL105339, T32HL007427] and Cure Alzheimer. The computations in this paper were run on the Odyssey cluster supported by the FAS Division of Science, Research Computing Group at Harvard University.

Conflict of Interest: none declared.

Supplementary Material

Supplementary Data

Click here for additional data file.^{(1.5MB, pdf)}

References

Al-Khudhair A. et al. (2015) Inference of distant genetic relations in humans using ’1000 genomes’. Genome Biol. Evol., 7, 481–492. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bacanu S.-A. et al. (2002) Association studies for quantitative traits in structured populations. Genet. Epidemiol., 22, 78–93. [DOI] [PubMed] [Google Scholar]
Boehnke M., Cox N.J. (1997) Accurate inference of relationships in sib-pair linkage studies. Am. J. Hum. Genet., 61, 423–429. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen C.-Y. et al. (2013) Improved ancestry inference using weights from external reference panels. Bioinformatics, btt144. [DOI] [PMC free article] [PubMed] [Google Scholar]
Choi Y. et al. (2009) Case–control association testing in the presence of unknown relationships. Genet. Epidemiol., 33, 668–678. [DOI] [PMC free article] [PubMed] [Google Scholar]
Consortium G.P. et al. (2012) An integrated map of genetic variation from 1,092 human genomes. Nature, 491, 56–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
Consortium G.P. et al. (2015) A global reference for human genetic variation. Nature, 526, 68–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
Devlin B. et al. (2001) Genomic control, a new approach to genetic-based association studies. Theor. Popul. Biol., 60, 155–166. [DOI] [PubMed] [Google Scholar]
Epstein M.P. et al. (2000) Improved inference of relationship for pairs of individuals. Am. J. Hum. Genet., 67, 1219–1231. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fedorova L. et al. (2016) Atlas of cryptic genetic relatedness among 1000 human genomes. Genome Biol. Evol., 8, 777–790. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gazal S. et al. (2015) High level of inbreeding in final phase of 1000 genomes project. Scientific Rep., 5, 17453. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kang H.M. et al. (2010) Variance component model to account for sample structure in genome-wide association studies. Nat. Genet., 42, 348–354. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lippert C. et al. (2011) Fast linear mixed models for genome-wide association studies. Nat. Methods, 8, 833–835. [DOI] [PubMed] [Google Scholar]
Listgarten J. et al. (2012) Improved linear mixed models for genome-wide association studies. Nat. Methods, 9, 525–526. [DOI] [PMC free article] [PubMed] [Google Scholar]
Manichaikul A. et al. (2010) Robust relationship inference in genome-wide association studies. Bioinformatics, 26, 2867–2873. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mathieson I., McVean G. (2012) Differential confounding of rare and common variants in spatially structured populations. Nat. Genet., 44, 243–246. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nemesh J., McCarroll S. (2012). Addressing cryptic relatedness in candidate samples for 1kg. ftp://ftp.1000genomes.ebi.ac.uk/Vol03313/ftp/phase1/analysis_results/supporting/cryptic_relation_analysis. (6 June 2016, date last accessed).
Patterson N. et al. (2006) Population structure and eigenanalysis. PLoS Genet., 2, e190. [DOI] [PMC free article] [PubMed] [Google Scholar]
Price A.L. et al. (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet., 38, 904–909. [DOI] [PubMed] [Google Scholar]
Price A.L. et al. (2008) Long-range ld can confound genome scans in admixed populations. Am. J. Hum. Genet., 83, 132–135. [DOI] [PMC free article] [PubMed] [Google Scholar]
Price A.L. et al. (2010) New approaches to population stratification in genome-wide association studies. Nat. Rev. Genet., 11, 459–463. [DOI] [PMC free article] [PubMed] [Google Scholar]
Prokopenko D. et al. (2016) Utilizing the jaccard index to reveal population stratification in sequencing data: a simulation study and an application to the 1000 genomes project. Bioinformatics, 32, 1366–1372. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ptak S.E., Przeworski M. (2002) Evidence for population growth in humans is confounded by fine-scale population structure. Trends Genet., 18, 559–563. [DOI] [PubMed] [Google Scholar]
Purcell S. et al. (2007) Plink: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet., 81, 559–575. [DOI] [PMC free article] [PubMed] [Google Scholar]
Šidák Z. (1967) Rectangular confidence regions for the means of multivariate normal distributions. J. Am. Stat. Assoc., 62, 626–633. [Google Scholar]
Thornton T. et al. (2012) Estimating kinship in admixed populations. Am. J. Hum. Genet., 91, 122–138. [DOI] [PMC free article] [PubMed] [Google Scholar]
Voight B.F., Pritchard J.K. (2005) Confounding from cryptic relatedness in case-control association studies. PLoS Genet., 1, e32. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang J. et al. (2003) Evaluating Kolmogorov’s distribution. J. Stat. Softw., 8 [Google Scholar]
Yang J. et al. (2010) Common snps explain a large proportion of the heritability for human height. Nat. Genet., 42, 565–569. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang Z. et al. (2010) Mixed linear model approach adapted for genome-wide association studies. Nat. Genet., 42, 355–360. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Click here for additional data file.^{(1.5MB, pdf)}

[btx109-B1] Al-Khudhair A. et al. (2015) Inference of distant genetic relations in humans using ’1000 genomes’. Genome Biol. Evol., 7, 481–492. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx109-B2] Bacanu S.-A. et al. (2002) Association studies for quantitative traits in structured populations. Genet. Epidemiol., 22, 78–93. [DOI] [PubMed] [Google Scholar]

[btx109-B3] Boehnke M., Cox N.J. (1997) Accurate inference of relationships in sib-pair linkage studies. Am. J. Hum. Genet., 61, 423–429. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx109-B4] Chen C.-Y. et al. (2013) Improved ancestry inference using weights from external reference panels. Bioinformatics, btt144. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx109-B5] Choi Y. et al. (2009) Case–control association testing in the presence of unknown relationships. Genet. Epidemiol., 33, 668–678. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx109-B6] Consortium G.P. et al. (2012) An integrated map of genetic variation from 1,092 human genomes. Nature, 491, 56–65. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx109-B7] Consortium G.P. et al. (2015) A global reference for human genetic variation. Nature, 526, 68–74. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx109-B8] Devlin B. et al. (2001) Genomic control, a new approach to genetic-based association studies. Theor. Popul. Biol., 60, 155–166. [DOI] [PubMed] [Google Scholar]

[btx109-B9] Epstein M.P. et al. (2000) Improved inference of relationship for pairs of individuals. Am. J. Hum. Genet., 67, 1219–1231. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx109-B10] Fedorova L. et al. (2016) Atlas of cryptic genetic relatedness among 1000 human genomes. Genome Biol. Evol., 8, 777–790. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx109-B11] Gazal S. et al. (2015) High level of inbreeding in final phase of 1000 genomes project. Scientific Rep., 5, 17453. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx109-B12] Kang H.M. et al. (2010) Variance component model to account for sample structure in genome-wide association studies. Nat. Genet., 42, 348–354. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx109-B13] Lippert C. et al. (2011) Fast linear mixed models for genome-wide association studies. Nat. Methods, 8, 833–835. [DOI] [PubMed] [Google Scholar]

[btx109-B14] Listgarten J. et al. (2012) Improved linear mixed models for genome-wide association studies. Nat. Methods, 9, 525–526. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx109-B15] Manichaikul A. et al. (2010) Robust relationship inference in genome-wide association studies. Bioinformatics, 26, 2867–2873. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx109-B16] Mathieson I., McVean G. (2012) Differential confounding of rare and common variants in spatially structured populations. Nat. Genet., 44, 243–246. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx109-B17] Nemesh J., McCarroll S. (2012). Addressing cryptic relatedness in candidate samples for 1kg. ftp://ftp.1000genomes.ebi.ac.uk/Vol03313/ftp/phase1/analysis_results/supporting/cryptic_relation_analysis. (6 June 2016, date last accessed).

[btx109-B18] Patterson N. et al. (2006) Population structure and eigenanalysis. PLoS Genet., 2, e190. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx109-B19] Price A.L. et al. (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet., 38, 904–909. [DOI] [PubMed] [Google Scholar]

[btx109-B20] Price A.L. et al. (2008) Long-range ld can confound genome scans in admixed populations. Am. J. Hum. Genet., 83, 132–135. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx109-B21] Price A.L. et al. (2010) New approaches to population stratification in genome-wide association studies. Nat. Rev. Genet., 11, 459–463. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx109-B22] Prokopenko D. et al. (2016) Utilizing the jaccard index to reveal population stratification in sequencing data: a simulation study and an application to the 1000 genomes project. Bioinformatics, 32, 1366–1372. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx109-B23] Ptak S.E., Przeworski M. (2002) Evidence for population growth in humans is confounded by fine-scale population structure. Trends Genet., 18, 559–563. [DOI] [PubMed] [Google Scholar]

[btx109-B24] Purcell S. et al. (2007) Plink: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet., 81, 559–575. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx109-B25] Šidák Z. (1967) Rectangular confidence regions for the means of multivariate normal distributions. J. Am. Stat. Assoc., 62, 626–633. [Google Scholar]

[btx109-B26] Thornton T. et al. (2012) Estimating kinship in admixed populations. Am. J. Hum. Genet., 91, 122–138. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx109-B27] Voight B.F., Pritchard J.K. (2005) Confounding from cryptic relatedness in case-control association studies. PLoS Genet., 1, e32. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx109-B28] Wang J. et al. (2003) Evaluating Kolmogorov’s distribution. J. Stat. Softw., 8 [Google Scholar]

[btx109-B29] Yang J. et al. (2010) Common snps explain a large proportion of the heritability for human height. Nat. Genet., 42, 565–569. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx109-B30] Zhang Z. et al. (2010) Mixed linear model approach adapted for genome-wide association studies. Nat. Genet., 42, 355–360. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Identification of genetic outliers due to sub-structure and cryptic relationships

Daniel Schlauch

Heide Fier

Christoph Lange

Roles