Abstract
For unrelated samples, principal component (PC) analysis has been established as a simple and effective approach to adjusting for population stratification in association analysis of common variants (CVs, with minor allele frequencies MAF > 5%). However, it is less clear how it would perform in analysis of low-frequency variants (LFVs, MAF between 1% and 5%), or of rare variants (RVs, MAF < 5%). Furthermore, with next-generation sequencing data, it is unknown whether PCs should be constructed based on CVs, LFVs or RVs. In this study, we used the 1000 Genomes Project sequence data to explore the construction of PCs and their use in association analysis of LFVs or RVs for unrelated samples. It is shown that a few top PCs based on either CVs or LFVs could separate two continental groups, European and African samples, but those based on only RVs performed less well. When applied to several association tests in simulated data with population stratification, using PCs based on either CVs or LFVs was effective in controlling Type I error rates, while non-adjustment led to inflated Type I error rates. Perhaps the most interesting observation is that, although the PCs based on LFVs could better separate the two continental groups than those based on CVs, the use of the former could lead to over-adjustment in the sense of substantial power loss in the absence of population stratification; in contrast, we did not see any problem with the use of the PCs based on CVs in all our examples.
Keywords: 1000 Genomes Project, Association tests, Logistic regression, Next-generation sequencing, SNP, SSU test
INTRODUCTION
With the availability of next-generation sequencing data, there has been increasing interest in studying associations between complex traits and low-frequency variants (LFVs, with MAF between 1% and 5%) or rare variants (RVs, with MAF < 1%); see two recent reviews (Asimit and Zeggini 2010; Bansal et al 2010). Due to the low minor allele frequencies (MAFs) of LFVs and RVs, statistical tests developed for common variants (CVs, with MAF > 5%) in genome-wide association studies (GWASs) may no longer be powerful. Accordingly, there have been intensive efforts in developing new statistical tests for LFVs and RVs. Basu and Pan (2011) conducted a comprehensive review and comparison of many existing association tests for LFVs and RVs with unrelated samples. Although there does not exist a uniformly most powerful test, they used simulated data to demonstrate the generally good performance of the sum of squared score (SSU) test, which has been shown (Pan 2011) to be closely related to an empirical Bayes test for high-dimensional data (Goeman et al 2006), kernel machine regression (KMR) (Kwee et al 2008; Wu et al 2010, 2011b), genomic-distance based regression (GDBR) (Wessel and Schork 2006) and the C-alpha test (Neale et al 2011). A limitation of their study is the lack of use of real sequence data. Furthermore, Basu and Pan (2011) also did not consider the small sample size issue and use of covariates, which may include principal components to adjust for population stratification. Here we use a low-coverage whole-genome sequencing dataset generated by the 1000 Genomes Project (1000 Genomes Project Consortium 2010) to address the above issues.
Intuitively population stratification can arise in association studies of LFVs and RVs, and some existing techniques for CVs, e.g. principal component (PC) analysis, might be applicable to LFVs and RVs (Lin and Tang 2011). However, two recent studies (Baye et al 2011; Siu et al 2012) achieved different conclusions on the relative effectiveness of CV- or RV-based PCs in uncovering population structures. More importantly, to our knowledge, the issue has not been experimentally demonstrated in the context of association tests. Among the many existing techniques for CVs, Wu et al (2011) demonstrated that adding a few top PCs as covariates in a regression analysis is a simple and effective approach to adjusting for population stratification for unrelated samples. Hence we adopt this approach throughout. Furthermore, with the availability of sequence data, as pointed out by Price et al (2010), it is not completely clear whether LFVs or RVs can be used to infer genetic ancestry. If so, importantly, it is natural to ask whether using LFVs or RVs (or both LFVs/RVs and CVs) can perform better than using CVs alone in adjusting for population stratification. We show that, in agreement with Siu et al (2012), based on the 1000 Genomes Project data for two continental groups, 174 African (AFR) and 283 European (EUR) samples, the top PC based on a large number of LFVs could better separate the two groups than that based on CVs; however, the PCs based on either CVs, LFVs or RVs could not separate the underlying subgroups. More interestingly and perhaps surprisingly, although using PCs based on either CVs or LFVs can effectively control inflated Type I error rates in the presence of population stratification, using PCs based on CVs maintained power while using PCs based on some randomly selected LFVs might suffer from substantial power loss in the absence of population stratification, which was likely due to the high linkage disequilibrium (LD) among the randomly selected LFVs.
METHODS
DATA
We downloaded a low-coverage whole genome sequencing dataset released in August 2010 on the 1000 Genomes Project web site. The dataset included 629 individuals: 174 Africans (AFR), 283 Europeans (EUR) and 194 Asians; we only used the data from the first two groups. In the first two continental groups, there were 4 and 6 subgroups respectively (Table 1). Due to the small sample size, we mainly focus on the two continental groups for association testing with chromosome 1 data, though we will also explore the use of PCA in separating the subgroups with the whole genome data. We defined rare variants (RVs) as single nucleotide polymorphisms (SNPs) with minor allele frequencies (MAFs) less than 1%, low-frequency variants (LFVs) as those with MAFs between 1% and 5%, and common variants (CVs) as those with MAFs greater than 5%. On chromosome 1, among the 694231 common SNPs in both groups, there were 478208 CVs, 146353 LFVs and 69670 RVs.
Table 1.
A summary of the two continental groups.
| Group | EUR | AFR | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Subgroups | CEU | FIN | GBR | TSI | MXL | PUR | YRI | LWK | ASW | PUR2 |
| #Samples | 90 | 36 | 43 | 92 | 17 | 5 | 78 | 67 | 24 | 5 |
For the purpose of this project, we selected a few regions of multiple LFVs or RVs associated with the continental group. As pointed out by Price et al (2010), since spurious associations often arise at differentiated variants whose MAFs are unusually different between different ancestral groups, it is crucial to consider these SNPs when correcting for population stratification. We used sliding windows with various sizes on chromosome 1 and tested the association between the continental group and the LFVs or RVs inside each window using a few statistical tests (discussed below). We identified 3 regions, termed R1 to R3, as representatives for unusually differentiated LFVs or RVs with various characteristics.
For each region, based on a statistical model and the selected LFVs or RVs from the sequence data, we generated simulated datasets with a simulated disease status for each subject. Then we tested possible association between the generated disease status and the observed SNPs in each region, based on which and the truth we assessed the performance of each test in terms of its statistical power and Type I error. Of particular interest was to investigate how the performance of a test depended on whether and how to use PCs constructed from the genome-wide sequence data.
STATISTICAL TESTS
We applied two sets of some representative statistical tests for association analysis of LFVs or RVs. The first set includes the score test, the sum of squared score (SSU) test, the weighted sum of squared score (SSUw) test, the Sum test and the univariate minimum p-value (UminP) test (Pan 2009), while the second includes the T1, T5, Fp, VT and EREC tests (Lin and Tang 2011). We will first introduce the first set of the five tests. They were chosen based on the following reasons. The score test is a classical test in general statistical applications, asymptotically equivalent to the Wald test and likelihood ratio test. The UminP test is perhaps most popular in association analysis of CVs, as used in GWASs. The Sum test is a representative of the so-called pooled association tests (Han and Pan 2010), similar to the well-known CAST (Morgenthaler and Thilly 2007) and CMC test (Li and Leal 2008). The SSU test is closely related to GDBR (Wessel and Schork 2006), KMR (Wu et al 2010, 2010b) and C-alpha test (Neale et al 2011); in an extensive simulation study, Basu and Pan (2011) found that the SSU test performed similarly to the KMR and C-alpha, and was an overall winner with the highest or close to the highest power in association analysis of RVs. With either CVs or RVs, the SSUw test often performed similarly to the SSU test; however, with both RVs and CVs, the SSUw test might perform better (Basu and Pan 2011). All the five tests are based on the score vector of a regression model, e.g. a generalized linear model (GLM), hence only a reduced model under the null hypothesis H0 is to be fitted, leading to their being computationally faster and numerically more stable than those based on fitting a full model, e.g. the Wald or likelihood ratio test. Importantly, since the tests are formulated in the general regression framework, it is easy to incorporate covariates or extend them to other more complex studies, e.g. with censored event times as traits, correlated family data or multiple traits.
For a binary trait Yi for subject i with k SNPs Xi = (Xi1, …,Xik)’ and covariates Zi = (Zi1, …ZiJ)’, all the five tests are based on the null model
which is simpler than the full model:
We use the additive coding for each SNP; that is, Xij = 0, 1 or 2 is the the number of minor alleles in SNP j. Due to the extremely low MAF, it is unlikely to have two copies of the minor allele for a given RV, and thus there is little difference between various coding schemes for RVs. In the current context, for population stratification, we include the top J PCs as covariates.
All the five tests are global tests with the null hypothesis H0: β = (β1, …, βk)’ = 0; it is global in the sense of not identifying specific zero subcomponents of β. Given a score vector U = (U.1, …, U.k)’ and its covariance estimate V = Cov(U), the five test statistics are respectively:
where diag(V) is a diagonal matrix with diagonal elements (Vjj’s) of V.
Under H0, based on the asymptotic Normality of U, U ~ N(0, V), the asymptotic distribution of the first four tests can be easily derived and used to obtain their p-values (Pan 2009), while numerical integrations with a multivariate Normal density can be used for the UminP test (Conneely and Boehnke 2007). For relatively small sample sizes, especially with RVs, the above asymptotics may not be applicable. Alternatively, as suggested by other authors (Lin and Tang 2011; Wu et al 2011), we can apply the parametric bootstrap (Efron and Tibshirani 1993) in the following steps: 1) fit the null model; 2) use the fitted null model to generate as the bth bootstrap dataset with b = 1, …, B; 3) calculate a test statistic T with the original data (Yi, Xi)’s, and Tb with the bth bootstrap data (, Xi)’s; 4) the p-value is . We used B = 200 throughout (and using B = 1000 gave similar results in all the simulations), though in practice we might need to use a much larger B to achieve a higher level of statistical significance.
As to be shown, for the score test and to a lesser degree for the UminP test, the asymptotics might give inflated Type I error rates, while the bootstrap gave much better results; in contrast, the SSU and SSUw tests are more robust to small samples with Type I error rates always close to the nominal level in all our experiments.
For comparison, we also included the T1, T5, Fp, VT and EREC tests (Lin and Tang 2011), all implemented in software SCORE-Seq available at http://www.bios.unc.edu/~dlin/software/SCORE-Seq/ As shown by Lin and Tang (2011), a general class of the score-based association tests can be formulated as
where ζj is a weight for SNP j. Different choices of the weights lead to a variety of tests: 1) the T1 (or T5) test corresponds to ζj = 1 if the MAF of SNP j is less than 1% (or 5%) and ζj = 0 otherwise; 2) in the Fp test, we have , where is an estimate of the MAF of SNP j with pseudo counts from the pooled sample, giving higher weights to rarer SNPs (Madsen and Browning 2009); 3) the VT test combines multiple tests based on multiple thresholds, and for each threshold, ζj = 1 if the MAF of SNP j is less than the threshold and ζj = 0 otherwise (Price et al 2010b); it is a form of the adaptive Neyman’s test (Pan and Shen 2011); 4) the EREC test uses with as the (univariate) maximum likelihood estimate of βj and c = 1 for binary traits. Although an asymptotic null distribution is available for each of the first three tests, it is not available for the EREC test. Furthermore, the asymptotic approximations might result in inflated Type I error rates for RVs. Hence we only show the results of the second set of the tests with their p-values calculated by the parametric bootstrap with the minimum allowable B = 106 resamples.
We also note that the score, SSU, SSUw and Sum tests are also special cases of the general TG test. In particular, the SSUw test uses the weight (Pan 2009), suggesting its close connection to the EREC test, whose weight ζj can be regarded as shrinking towards constants c or −c.
RESULTS
DATA DESCRIPTION
As shown in Figure 1, there are clear differences between the MAF distributions of the two continental groups. In particular, the difference seems to be larger for low MAFs than for high MAFs.
Figure 1.

Distributions of MAFs for the EUR and AFR groups.
We selected 3 regions, named R1 to R3: the first two contained 19 and 40 consecutive LFVs (and only LFVs) respectively, while the third one consisted of 40 consecutive RVs (and only RVs); we calculated the MAF of any SNP based on the pooled sample. The LFVs or RVs within each region were associated with the continental group; that is, the MAFs of the SNPs were different between the AFR and EUR groups (Figure 2). These 3 regions also showed different LD patterns (Figure 3): LD was weak in R1, moderate in R3, and strong in R2.
Figure 2.

Comparison of the MAFs between the EUR and AFR groups for the SNPs in regions R1–R3.
Figure 3.
LD plots in r2 for the EUR (top row) and AFR (bottom row) groups in regions R1–R3.
We randomly selected a large number of CVs (or LFVs) from chromosome 1 to construct PCs. As shown in Figure 4, the top PC based on CVs could largely separate the two AFR and EUR groups; however, perhaps surprisingly, the top PC based on LFVs did better in completely separating the two groups. When using some randomly selected SNPs, including CVs, LFVs and RVs, the results were between those based on either CVs or LFVs alone (not shown). We will present and discuss results based on RVs later. Since the results with 100000 CVs (or LFVs) (not shown) were similar, in the following, we used a few top PCs based on either 10000 CVs or 10000 LFVs.
Figure 4.

The top two PCs constructed with CVs or LFVs.
ASSOCIATION TESTING WITH LFVS: TYPE I ERROR
We first generated simulated data under H0 with population stratification. Specifically, we randomly selected 90% of the EUR samples and 10% of AFR samples as cases (i.e. Yi = 1), while the remaining ones as controls (i.e. Yi = 0). In this way, none of the SNPs caused the “disease”, and there was a clear association between the continental group and the “disease” (i.e. population stratification). We applied the five tests to the two LFV regions with 1000 simulated datasets for each case. Since the results are similar for PCs based on either CVs or LFVs, we only show that for the former. Table 2 lists the Type I error rates at the nominal level α = 0.05. It is clear that, without adjustment for population stratification, all the tests could have dramatically inflated Type I error rates (except the Sum test for R1), suggesting the necessity of adjusting for population stratification. With PCs, including with even just the single top PC (i.e. #PCs=1), the problem with inflated Type I error rates largely disappeared; there was almost no difference between using various numbers of PCs, as long as at least one PC was used. It is noted that the asymptotics-based score test could have severely inflated Type I error rates, even in the presence of PCs for region R2, and that the asymptotics-based UminP test could also have slightly inflated Type I error rates. The bootstrap-based tests all had their Type I error rates better controlled.
Table 2.
Type I error rates with population stratification. The PCs were constructed using 10000 CVs.
| Asymptotics | Bootstrap | ||||||||
|---|---|---|---|---|---|---|---|---|---|
|
| |||||||||
| Loc | Test | #PCs=0 | 1 | 5 | 10 | #PCs=0 | 1 | 5 | 10 |
| R1 | Score | 0.693 | 0.055 | 0.052 | 0.048 | 0.716 | 0.044 | 0.044 | 0.042 |
| SSU | 0.618 | 0.062 | 0.039 | 0.035 | 0.647 | 0.061 | 0.039 | 0.044 | |
| SSUw | 0.620 | 0.052 | 0.036 | 0.035 | 0.642 | 0.053 | 0.038 | 0.037 | |
| Sum | 0.067 | 0.066 | 0.050 | 0.047 | 0.070 | 0.044 | 0.048 | 0.048 | |
| UminP | 0.201 | 0.084 | 0.065 | 0.067 | 0.232 | 0.068 | 0.039 | 0.047 | |
|
| |||||||||
| R2 | Score | 0.155 | 0.180 | 0.192 | 0.171 | 0.624 | 0.062 | 0.071 | 0.061 |
| SSU | 0.709 | 0.053 | 0.055 | 0.054 | 0.684 | 0.047 | 0.051 | 0.055 | |
| SSUw | 0.700 | 0.052 | 0.055 | 0.054 | 0.669 | 0.049 | 0.052 | 0.055 | |
| Sum | 0.684 | 0.054 | 0.055 | 0.055 | 0.652 | 0.049 | 0.059 | 0.056 | |
| UminP | 0.677 | 0.052 | 0.061 | 0.066 | 0.685 | 0.044 | 0.050 | 0.047 | |
ASSOCIATION TESTING WITH LFVS: POWER
We generated a disease status from the following logistic regression model:
where Xij was the jth SNP of the ith subject (AFR or EUR), β0 = −log 3 was chosen to generate a background disease incidence of 25% (when all Xij = 0), and the causal effect sizes βj were randomly generated from a uniform distribution U(−a, a) or U(0, a) for a constant a > 0. With U(−a, a), some causal effects were deleterious while others were protective against disease; with U(0, a), all causal effects were in the same direction of being deleterious. We used k1 ≤ k: if k1 < k, we randomly selected a subset of the SNPs to be causal while others were neutral or non-causal, but a test was always applied to all the k SNPs; it is important to assess a test’s robustness to the number of non-causal SNPs, since in practice we expect causal SNPs to be mixed with some neighboring non-causal ones. Each subject i’s genotype was input to the above model to generate his/her disease status. In such a way, we generated a dataset of 457 subjects with various numbers of cases and controls.
Since the general conclusions remained the same, we only chose a small subset of results to present in Tables 3 and 4. Recall that there were weak and strong LD, and a small and a large number of LFVs, in the two regions R1 and R2 respectively. The PCs were constructed based on the 10000 randomly selected CVs. First, since all the SNPs had MAF between 1% and 5%, the T1 test was not applicable (with all weights ζj = 0), and the T5 test (with all weights ζj = 1) was essentially the same as the Sum test. In addition, since the causal SNPs were selected randomly and were not correlated with lower or higher MAFs, the Fp and VT tests were not expected to improve over the T5 and Sum tests. Second, since there was no population stratification, it is good to see that using or not using PCs, or using different numbers of PCs, gave similar results for all the tests. We emphasize that this is a desired property. In practice, for a given dataset, population stratification may or may not be present; to be safe in avoiding spurious associations, we might still want to apply an adjustment, e.g. based on PCs. Hence, it would be desirable to have no or minimum power loss when adjusting for population stratification. Third, the identity of the most powerful tests varied with the set-up. For example, in region R1, 1) with all 19 causal SNPs being deleterious, the Sum, T5, Fp and EREC tests performed similarly and were most powerful; 2) with the 19 causal SNPs with opposite association directions, as expected, the Sum, T5 and Fp tests were low powered, while the SSU, SSUw and score tests were most powerful. Although there was no uniformly most powerful test, the SSU and SSUw tests seemed to be the overall winners. Fourth, due to the small sample size, the asymptotics-based score test might lose power as compared to the bootstrap-based score test (not shown). In contrast, other tests seemed to be more robust to small samples: their asymptotics-based version and bootstrap-based version always gave similar results (not shown).
Table 3.
Empirical power of various tests based on the parametric bootstrap for the two regions with k1 causal SNPs. The PCs were constructed using 10000 CVs.
| Loc | Test | #PCs=0 | 1 | 5 | 10 | #PCs=0 | 1 | 5 | 10 |
|---|---|---|---|---|---|---|---|---|---|
| R1 k1 = 8 |
βi ~ U(−log 3, log 3) | βi ~ U(0, log 3) | |||||||
|
|
|||||||||
| Score | 0.489 | 0.489 | 0.489 | 0.496 | 0.838 | 0.836 | 0.824 | 0.826 | |
| SSU | 0.500 | 0.492 | 0.497 | 0.501 | 0.882 | 0.880 | 0.891 | 0.881 | |
| SSUw | 0.507 | 0.480 | 0.481 | 0.486 | 0.883 | 0.883 | 0.889 | 0.886 | |
| Sum | 0.240 | 0.230 | 0.234 | 0.230 | 0.860 | 0.860 | 0.856 | 0.852 | |
| UminP | 0.401 | 0.397 | 0.393 | 0.383 | 0.813 | 0.820 | 0.813 | 0.802 | |
|
| |||||||||
| R1 k1 = 19 |
βi ~ U(−log 2, log 2) | βi ~ U(0, log 1.5) | |||||||
|
|
|||||||||
| Score | 0.483 | 0.467 | 0.479 | 0.475 | 0.504 | 0.504 | 0.493 | 0.479 | |
| SSU | 0.507 | 0.493 | 0.511 | 0.492 | 0.773 | 0.771 | 0.758 | 0.738 | |
| SSUw | 0.479 | 0.479 | 0.483 | 0.477 | 0.769 | 0.764 | 0.765 | 0.756 | |
| Sum | 0.207 | 0.202 | 0.201 | 0.204 | 0.842 | 0.839 | 0.839 | 0.832 | |
| UminP | 0.288 | 0.286 | 0.280 | 0.287 | 0.558 | 0.544 | 0.546 | 0.541 | |
|
| |||||||||
| R2 k1 = 4 |
βi ~ U(−log 3, log 3) | βi ~ U(0, log 2) | |||||||
|
|
|||||||||
| Score | 0.256 | 0.240 | 0.251 | 0.244 | 0.707 | 0.702 | 0.699 | 0.687 | |
| SSU | 0.401 | 0.406 | 0.400 | 0.406 | 0.786 | 0.787 | 0.785 | 0.793 | |
| SSUw | 0.404 | 0.404 | 0.403 | 0.408 | 0.783 | 0.787 | 0.787 | 0.795 | |
| Sum | 0.405 | 0.406 | 0.406 | 0.409 | 0.784 | 0.785 | 0.789 | 0.793 | |
| UminP | 0.360 | 0.344 | 0.347 | 0.349 | 0.761 | 0.763 | 0.758 | 0.756 | |
|
| |||||||||
| R2 k1 = 30 |
βi ~ U(−log 2, log 2) | βi ~ U(0, log 1.1) | |||||||
|
|
|||||||||
| Score | 0.364 | 0.365 | 0.362 | 0.366 | 0.776 | 0.761 | 0.765 | 0.761 | |
| SSU | 0.601 | 0.602 | 0.607 | 0.606 | 0.871 | 0.868 | 0.871 | 0.869 | |
| SSUw | 0.601 | 0.603 | 0.607 | 0.610 | 0.873 | 0.867 | 0.872 | 0.869 | |
| Sum | 0.599 | 0.601 | 0.602 | 0.605 | 0.874 | 0.866 | 0.869 | 0.869 | |
| UminP | 0.563 | 0.550 | 0.546 | 0.544 | 0.848 | 0.834 | 0.836 | 0.835 | |
Table 4.
Empirical power of various tests based on the parametric bootstrap for region R1. The PCs were constructed using either 10000 CVs or 10000 LFVs.
| 10000 CVs | 10000 LFVs | ||||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
|||||||||
| Loc | Test | #PCs=0 | 1 | 5 | 10 | #PCs=0 | 1 | 5 | 10 |
| R1 k1 = 8 |
βi ~ U(−log 3, log 3) | βi ~ U(−log 3, log 3) | |||||||
|
|
|||||||||
| T5 | 0.200 | 0.207 | 0.195 | 0.188 | 0.200 | 0.205 | 0.193 | 0.182 | |
| Fp | 0.203 | 0.197 | 0.186 | 0.192 | 0.203 | 0.195 | 0.184 | 0.179 | |
| VT | 0.214 | 0.212 | 0.208 | 0.199 | 0.214 | 0.211 | 0.202 | 0.195 | |
| EREC | 0.401 | 0.399 | 0.381 | 0.375 | 0.401 | 0.401 | 0.373 | 0.368 | |
|
| |||||||||
| R1 k1 = 8 |
βi ~ U(0, log 3) | βi ~ U(0, log 3) | |||||||
|
|
|||||||||
| T5 | 0.872 | 0.878 | 0.869 | 0.857 | 0.872 | 0.876 | 0.866 | 0.812 | |
| Fp | 0.872 | 0.873 | 0.864 | 0.852 | 0.872 | 0.872 | 0.856 | 0.805 | |
| VT | 0.838 | 0.832 | 0.826 | 0.813 | 0.838 | 0.820 | 0.784 | 0.755 | |
| EREC | 0.916 | 0.914 | 0.903 | 0.891 | 0.916 | 0.912 | 0.873 | 0.850 | |
|
| |||||||||
| R1 k1 = 19 |
βi ~ U(−log 2, log 2) | βi ~ U(−log 2, log 2) | |||||||
|
|
|||||||||
| T5 | 0.222 | 0.225 | 0.223 | 0.216 | 0.222 | 0.232 | 0.215 | 0.193 | |
| Fp | 0.220 | 0.223 | 0.214 | 0.209 | 0.220 | 0.219 | 0.208 | 0.198 | |
| VT | 0.230 | 0.235 | 0.228 | 0.229 | 0.230 | 0.225 | 0.233 | 0.200 | |
| EREC | 0.395 | 0.396 | 0.398 | 0.382 | 0.395 | 0.390 | 0.392 | 0.375 | |
|
| |||||||||
| R1 k1 = 19 |
βi ~ U(0, log 1.5) | βi ~ U(0, log 1.5) | |||||||
|
|
|||||||||
| T5 | 0.844 | 0.846 | 0.845 | 0.828 | 0.844 | 0.843 | 0.823 | 0.768 | |
| Fp | 0.849 | 0.852 | 0.837 | 0.827 | 0.849 | 0.852 | 0.806 | 0.762 | |
| VT | 0.779 | 0.779 | 0.752 | 0.747 | 0.779 | 0.760 | 0.729 | 0.685 | |
| EREC | 0.826 | 0.820 | 0.807 | 0.796 | 0.826 | 0.817 | 0.782 | 0.698 | |
Since the PCs based on LFVs could better separate the AFR and EUR groups, it would be interesting to see the performance of the tests with PCs constructed from LFVs. In many situations, a test with LFV-based PCs and with CV-based PCs performed similarly; however, as shown in Tables 4 and 5, when all the causal effects were in the same direction, it is clear that adjusting with more than one PC led to power loss, which was often substantial. For example, in the set-up with 30 causal SNPs with positive effects in region R2 (Table 5), the SSU and SSUw tests were most powerful; however, with 1, 5 and 10 PCs, the power of the SSU test monotonically decreased from 0.871 to 0.865, 0.803 and 0.781, respectively. This is a case called over-adjustment in the sense of losing substantial power when adjusting for population stratification (or more generally, confounders). This phenomenon was not specific to the first set of the five tests shown in Table 5; it also appeared for the second set (Table 4): for example, for region R1 with k1 = 8 causal SNPs with the same effect direction, the power of the T5, Fp, VT and EREC tests reduced respectively from 0.872, 0.872, 0.838 and 0.916 with no PC to 0.812, 0.805, 0.755 and 0.850 with 10 PCs constructed from 10000 LFVs. It is interesting to note that, in all our examples, using only the top PC could largely control the Type I error rate while maintaining the power (with no or negligible power loss).
Table 5.
Empirical power of various tests based on the parametric bootstrap for the two regions with k1 causal SNPs. The PCs were constructed using 10000 LFVs.
| Loc | Test | #PCs=0 | 1 | 5 | 10 | #PCs=0 | 1 | 5 | 10 |
|---|---|---|---|---|---|---|---|---|---|
| R1 k1 = 8 |
βi ~ U(−log 3, log 3) | βi ~ U(0, log 3) | |||||||
|
|
|||||||||
| Score | 0.489 | 0.498 | 0.499 | 0.486 | 0.838 | 0.836 | 0.805 | 0.787 | |
| SSU | 0.500 | 0.502 | 0.495 | 0.506 | 0.882 | 0.881 | 0.880 | 0.851 | |
| SSUw | 0.507 | 0.490 | 0.489 | 0.493 | 0.883 | 0.885 | 0.882 | 0.857 | |
| Sum | 0.240 | 0.223 | 0.226 | 0.219 | 0.860 | 0.857 | 0.850 | 0.821 | |
| UminP | 0.401 | 0.402 | 0.386 | 0.396 | 0.813 | 0.822 | 0.792 | 0.753 | |
|
| |||||||||
| R1 k1 = 19 |
βi ~ U(−log 2, log 2) | βi ~ U(0, log 1.5) | |||||||
|
|
|||||||||
| Score | 0.483 | 0.474 | 0.481 | 0.490 | 0.504 | 0.497 | 0.465 | 0.452 | |
| SSU | 0.507 | 0.501 | 0.507 | 0.503 | 0.773 | 0.765 | 0.724 | 0.663 | |
| SSUw | 0.479 | 0.476 | 0.484 | 0.488 | 0.769 | 0.763 | 0.727 | 0.681 | |
| Sum | 0.207 | 0.202 | 0.194 | 0.180 | 0.842 | 0.835 | 0.819 | 0.791 | |
| UminP | 0.288 | 0.286 | 0.273 | 0.283 | 0.558 | 0.547 | 0.498 | 0.441 | |
|
| |||||||||
| R2 k1 = 4 |
βi ~ U(−log 3, log 3) | βi ~ U(0, log 2) | |||||||
|
|
|||||||||
| Score | 0.256 | 0.242 | 0.233 | 0.229 | 0.707 | 0.703 | 0.628 | 0.616 | |
| SSU | 0.401 | 0.410 | 0.386 | 0.382 | 0.786 | 0.790 | 0.734 | 0.713 | |
| SSUw | 0.404 | 0.408 | 0.384 | 0.380 | 0.783 | 0.791 | 0.737 | 0.713 | |
| Sum | 0.405 | 0.407 | 0.384 | 0.380 | 0.784 | 0.786 | 0.737 | 0.712 | |
| UminP | 0.360 | 0.341 | 0.330 | 0.317 | 0.761 | 0.762 | 0.703 | 0.672 | |
|
| |||||||||
| R2 k1 = 30 |
βi ~ U(−log 2, log 2) | βi ~ U(0, log 1.1) | |||||||
|
|
|||||||||
| Score | 0.364 | 0.361 | 0.347 | 0.337 | 0.776 | 0.761 | 0.657 | 0.634 | |
| SSU | 0.601 | 0.604 | 0.585 | 0.582 | 0.871 | 0.865 | 0.803 | 0.781 | |
| SSUw | 0.601 | 0.600 | 0.583 | 0.580 | 0.873 | 0.866 | 0.805 | 0.782 | |
| Sum | 0.599 | 0.604 | 0.579 | 0.578 | 0.874 | 0.863 | 0.804 | 0.787 | |
| UminP | 0.563 | 0.546 | 0.511 | 0.503 | 0.848 | 0.833 | 0.770 | 0.737 | |
We explored the reason for the over-adjustment. We first hypothesized that the LFV-based PCs might reflect some hidden ethnic structure. When adjusting for ancestry using either reported two continental groups or reported ethnic subgroups, the test results were similar to those with no or only 1 PC; in other words, there was no loss of power. Second, we regressed the binary trait on the top 10 PCs and referred to the corresponding linear combination of the top 10 PCs as a PCs-defined group score. We found that in the cases with over-adjustment, the PCs-defined group score was much more significantly associated with the sum of the LFVs to be tested than those in other cases. Although the LFVs were randomly selected to construct PCs, we found that surprisingly many of them were highly correlated, as shown by an exome sequence dataset from the 1000 Genomes Project (Tintle et al 2011). For CVs, it is highly recommended to use only nearly independent SNPs to construct PCs (Patterson 2006; Lee et al 2011) to avoid the resulting PCs’ representing some peculiar features of the data. Hence, we first tried to remove highly-correlated LFVs by using PLINK with a threshold of r2 ≤ 0.5 or r2 ≤ 0.05 respectively. Then using the top 10 PCs constructed with the remaining LFVs, we obtained the results (not shown) similar to those without adjustment (or to using CV-based PCs). In conclusion, the multiple PCs based on the original possibly highly correlated LFVs perhaps represented some unknown and possibly artificial structure in the data.
SUBGROUP ANALYSIS WITH RVS
We used a subset of 786487 non-monomorphic RVs from chromosomes 2 to 22 to construct PCs. We first used PLINK (Purcell et al 2007) to prune correlated SNPs with a sliding window of size 50 (shifted by 5) and a threshold of r2 < 0.05; after this thinning process, we had 305036 RVs. We then selected a random set of 10000 RVs to construct PCs. As shown in Figure 5, the first PC could largely separate the two continental groups, but not the 10 subgroups. Several PUR and PUR2 samples appeared to be outliers, which might have unduly influenced the PCA results; on the other hand, it may be argued that the RVs could better separate the PUR and PUR2 samples from other subgroups.
Figure 5.

The top three PCs constructed with 10000 RVs. The red/dark ones are the EUR samples and the green/grey ones are the AFR samples.
We applied the Tracy-Widom (TW) test (implemented in R package EigenCorr, Lee et al 2011), yielding a statistically significant p-value less than 0.05 for each of the top 19 eigenvalues (Table 6). We also applied one-way ANOVA to test the significance of each PC with varying mean values across the two continental groups or the 10 subgroups; in both cases, the most significant PCs were in the top 13.
Table 6.
The p-values of the Tracy-Widom (TW) test and one-way ANOVA applied to the eigenvalues or PCs constructed from 10000 RVs.
| # Eigenvalue or PC | TW | 2-group ANOVA | 10-subgroup ANOVA |
|---|---|---|---|
| 1 | 0.000e+00 | 6.667e-24 | 2.955e-25 |
| 2 | 0.000e+00 | 3.061e-21 | 9.428e-28 |
| 3 | 0.000e+00 | 1.212e-28 | 6.830e-27 |
| 4 | 0.000e+00 | 0.1283 | 0.0393 |
| 5 | 0.000e+00 | 3.165e-33 | 8.635e-34 |
| 6 | 0.000e+00 | 6.745e-10 | 1.593e-27 |
| 7 | 0.000e+00 | 9.098e-08 | 5.310e-18 |
| 8 | 0.000e+00 | 0.0349 | 1.002e-25 |
| 9 | 0.000e+00 | 0.1118 | 1.221e-55 |
| 10 | 0.000e+00 | 0.0002 | 3.654e-09 |
| 11 | 0.000e+00 | 0.2260 | 1.376e-05 |
| 12 | 0.000e+00 | 0.9646 | 0.9393 |
| 13 | 0.000e+00 | 0.0013 | 9.934e-09 |
| 14 | 0.000e+00 | 0.1094 | 0.3517 |
| 15 | 0.000e+00 | 0.2797 | 0.3142 |
| 16 | 0.000e+00 | 0.8881 | 0.1215 |
| 17 | 0.000e+00 | 0.4054 | 0.0243 |
| 18 | 3.800e-03 | 0.9982 | 0.9916 |
| 19 | 4.230e-02 | 0.1583 | 0.7185 |
| 20 | 1.580e-01 | 0.7069 | 0.3861 |
A visual examination of the scatter plots of the top PCs did not reveal that the top PCs could separate the 10 subgroups. Hence, we applied finite Gaussian mixture model-based clustering (implemented in R package mclust) to top 10, 20 and 50 PCs. Based on the Rand index (calculated using R package clue), using top 20 PCs led to the highest agreement between the resulting 12 clusters and 10 true subgroups (with a Rand index value of 0.812 and an adjusted Rand index value of 0.408). As shown in Table 7, in agreement with Figure 5, although the two continental groups could be largely but not perfectly separated, the subgroups in the EUR group could not be distinguished: most of them were mixed into two clusters.
Table 7.
The numbers of samples assigned to each of the 12 clusters based on top 20 PCs constructed from 10000 RVs.
| Subgroup/Cluster | C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 | C10 | C11 | C12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CEU | 76 | 14 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| FIN | 9 | 27 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| GBR | 29 | 13 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| TSI | 78 | 14 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| MXL | 0 | 0 | 0 | 0 | 0 | 3 | 14 | 0 | 0 | 0 | 0 | 0 |
| PUR | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |
|
| ||||||||||||
| YRI | 0 | 0 | 68 | 1 | 9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| LWK | 0 | 0 | 0 | 38 | 24 | 5 | 0 | 0 | 0 | 0 | 0 | 0 |
| ASW | 0 | 0 | 1 | 0 | 14 | 9 | 0 | 0 | 0 | 0 | 0 | 0 |
| PUR2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |
ASSOCIATION TESTING WITH RVS
We also conducted a simulation study with the 40 RVs in region R3. To assess the Type I error rates, we generated a binary trait as before under population stratification; for power, we randomly selected k1 = 15 RVs as causal ones with their effect sizes βj ~ U(0, log(3)). We used the parametric bootstrap for p-value calculation for each test. Note that since all the 40 SNPs had MAF less than 1%, the results for the T1 and T5 tests were exactly the same. As shown in Table 8, under H0, if no adjustment was made, all the tests resulted in dramatically inflated Type I errors. Since the first PC could not completely separate the two continental groups (Figure 5), using only the top PC still yielded largely inflated Type I error rates. In contrast, using the top 10 or 20 PCs could largely remedy the problem, though there were still some slightly inflated Type I error rates for some tests, which could be due to the fact that even the top PCs could not completely separate the two continental groups (Figure 5 and Table 7). For power, with the exception of the score, SSU and SSUw tests, all other tests seemed to have some power loss with 20 PCs.
Table 8.
Type I error and power for region R3. The PCs were constructed using 10000 RVs.
| Type I error | Power | |||||||
|---|---|---|---|---|---|---|---|---|
|
| ||||||||
| Test | #PCs=0 | 1 | 10 | 20 | #PCs=0 | 1 | 10 | 20 |
| Score | 0.972 | 0.521 | 0.114 | 0.086 | 0.525 | 0.519 | 0.504 | 0.537 |
| SSU | 0.995 | 0.301 | 0.040 | 0.052 | 0.654 | 0.639 | 0.623 | 0.640 |
| SSUw | 0.992 | 0.542 | 0.056 | 0.070 | 0.659 | 0.652 | 0.634 | 0.652 |
| Sum | 0.995 | 0.818 | 0.076 | 0.050 | 0.671 | 0.664 | 0.628 | 0.630 |
| UminP | 0.900 | 0.561 | 0.108 | 0.088 | 0.492 | 0.476 | 0.427 | 0.434 |
|
| ||||||||
| T1, T5 | 0.995 | 0.821 | 0.061 | 0.038 | 0.663 | 0.673 | 0.624 | 0.608 |
| Fp | 0.993 | 0.816 | 0.064 | 0.041 | 0.658 | 0.654 | 0.615 | 0.606 |
| VT | 0.984 | 0.647 | 0.056 | 0.057 | 0.605 | 0.590 | 0.537 | 0.533 |
| EREC | 0.997 | 0.119 | 0.012 | 0.066 | 0.662 | 0.648 | 0.609 | 0.594 |
DISCUSSION
We have used a low-coverage whole-genome sequencing dataset generated by the 1000 Genomes Project to empirically investigate some characteristics of LFVs or RVs that are relevant to their association analysis. For example, some might argue that, due to the low MAFs, LFVs and RVs are expected to be independent; we have demonstrated that the neighboring LFVs or RVs in a region may be in either low, moderate or high LD, suggesting that future studies on the performance of any association test should consider varying LD as a factor. Furthermore, as a useful complement to the extensive simulation studies of Basu and Pan (2011), we have used real sequence data to demonstrate the power properties of the various tests with or without PCs, though it was not the main aim of the current study. In particular, it is confirmed that the Sum test, a representative of simple pooled association tests (Dering et al 2011), is not powerful in the presence of different association directions or of many non-causal SNPs; in contrast, the SSU and SSUw tests are much more powerful in these situations. It is also shown that the asymptotics of the Sum, SSU and SSUw tests seemed to work well with a reasonable sample size for LFVs, much more robust than the score test. Of course, with small samples sizes or RVs with extremely low MAFs, one has to be cautious in using asymptotics. As shown here and in other places (Tang and Lin 2011; Wu et al 2011b), the parametric bootstrap is a useful alternative. Given the generally good performance of the SSU and SSUw tests, we would recommend their use in practice; if the applicability of the asymptotics is of concern, a two-step procedure can be taken: one could first use the asymptotics-based SSU or SSUw test to quickly scan the genome, then apply the more computing-intensive bootstrap-based SSU or SSUw test to the more significant regions identified in the first step.
Perhaps the most interesting finding of this study is that, in accordance with Siu et al (2012) but differing from Baye et al (2011), PCs constructed with LFVs could potentially separate different continental or ethnic groups better than those with CVs, though either can be used to adjust for population stratification effectively. We note that Siu et al (2012) used a similar whole genome sequence dataset as ours while Baye et al (2011) used a smaller subset of the exome sequence dataset with much fewer LFVs or RVs. In addition, differing from Mathieson et al (2012), we focused on two relatively well-separated populations, i.e. AFR and EUR samples; further studies are warranted for other more challenging cases. In all our numerical examples, in contrast to that using PCs based on CVs led to no or little power loss in the absence of population stratification, surprisingly using multiple PCs based on LFVs might result in over-adjustment in the sense of substantial power loss. It is also interesting to note that, in all our examples, using only the top PC based on LFVs could largely control the Type I error rate while maintaining the power (with no or minimum power loss). The over-adjustment with multiple PCs based on LFVs in our experiments was likely due to the use of many LFVs in high LD; once we used LFVs not in high LD, the problem largely disappeared. This is in agreement with two known results: first, it is highly recommended to use only almost independent CVs to construct PCs (Patterson et al 2006; Lee et al 2011); second, for unknown reasons, there seems to exist long-range correlations among LFVs or RVs in real sequence data (Tintle et al 2011). Hence, one has to be careful in selecting LFVs or RVs to construct PCs; in particular, a random subset of far-away LFVs or RVs may not be su cient. Furthermore, our preliminary analysis also shows that PCA of RVs with MAFs < 1% might not be effective in separating subpopulations. One possible reason is the sensitivity of PCA to outliers, which are present with some diverse subpopulations and largely varying numbers of subpopulation samples; it would be interesting to apply other more robust methods (e.g. Lee et al 2011). We also emphasize that our conclusions are based on the use of a low-coverage whole-genome sequencing dataset, which may be different from high-coverage sequencing data; for example, high-coverage sequencing tends to uncover more RVs (Tennessen et al 2012). Importantly, we only considered using CVs, LFVs or RVs, but not their combined use; it remains to be investigated how to select and combine CVs, LFVs and RVs to best capture population structures. Finally, since our current study focuses on unrelated samples and the PC-based adjustment for population stratification, it would be interesting to investigate the same issues with other adjustment methods (e.g. Pritchard et al 2000; Zhu et al 2002, 2008; Guan et al 2009; Engelhardt and Stephens 2010; Lee et al 2010) or for family studies (Zhu et al 2010; Feng et al 2011).
R code will be posted on our web site at http://www.biostat.umn.edu/~weip/prog.html.
ACKNOWLEDGMENTS
We thank the reviewers for many helpful and constructive comments and suggestions. YZ and WP were supported by NIH grants R21DK089351, R01HL65462, R01HL105397 and R01GM081535.
REFERENCES
- 1000 Genomes Project Consortium A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. doi: 10.1038/nature09534. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Asimit J, Zeggini E. Rare variant association analysis methods for complex traits. Ann Rev Genet. 2010;44:293–308. doi: 10.1146/annurev-genet-102209-163421. [DOI] [PubMed] [Google Scholar]
- Bansal V, Libiger O, Torkamani A, Schork NJ. Statistical analysis strategies for association studies involving rare variants. Nature Review Genetics. 2010;11:773–785. doi: 10.1038/nrg2867. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Basu S, Pan W. Comparison of Statistical Tests for Association with Rare Variants. Genetic Epidemiology. 2011;35:606–619. doi: 10.1002/gepi.20609. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Baye TM, He H, Ding L, Kurowski BG, Zhang X, Martin LJ. Population structure analysis using rare and common functional variants. BMC Proceedings. 2011;5(Suppl 9):S8. doi: 10.1186/1753-6561-5-S9-S8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Conneely KN, Boehnke M. So many correlated tests, so little time Rapid adjustment of p values for multiple correlated tests. Am J Hum Genet. 2007;81:1158–1168. doi: 10.1086/522036. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dering C, Hemmelmann C, Pugh E, Ziegler A. Statistical analysis of rare sequence variants: an overview of collapsing methods. Genetic Epidemiology. 2011;35(S1):S12S17. doi: 10.1002/gepi.20643. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Efron B, Tibshirani R. An Introduction to the Bootstrap. Chapman and Hall/CRC; Boca Raton, FL: 1993. [Google Scholar]
- Engelhardt BE, Stephens M. Analysis of population structure: a unifying framework and novel methods based on sparse factor analysis. PLoS Genetics. 2010;6(9):e1001117. doi: 10.1371/journal.pgen.1001117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Feng T, Zhu X. Genome-wide searching of rare genetic variants in WTCCC data. Human Genetics. 2010;128:269–280. doi: 10.1007/s00439-010-0849-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goeman JJ, van de Geer S, van Houwelingen HC. Testing against a high dimensional alternative. J R Stat Soc B. 2006;68:477–493. [Google Scholar]
- Guan W, Liang L, Boehnke M, Abecasis GR. Genotype-based matching to correct for population stratification in large-scale case-control genetic association studies. Genet Epidemiol. 2009;33:508–517. doi: 10.1002/gepi.20403. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Han F, Pan W. A data-adaptive sum test for disease association with multiple common or rare variants. Hum Hered. 2010;70:42–54. doi: 10.1159/000288704. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kwee LC, Liu D, Lin X, Ghosh D, Epstein MP. A powerful and flexible multilocus association test for quantitative traits, Am. J. Hum. Genet. 2008;82:386–397. doi: 10.1016/j.ajhg.2007.10.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee AB, Luca D, Klei L, Devlin B, Roeder K. Discovering genetic ancestry using spectral graph theory. Genetic Epidemiology. 2010;34:51–59. doi: 10.1002/gepi.20434. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee S, Wright FA, Zou F. Control of population stratification by correlation-selected principal components. Biometrics. 2011;67:967–974. doi: 10.1111/j.1541-0420.2010.01520.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li B, Leal SM. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008;83:311–321. doi: 10.1016/j.ajhg.2008.06.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li Y, Byrnes AE, Li M. To identify associations with rare variants, Just WHaIT: weighted haplotype and imputation-based tests. Am J Hum Genet. 2010;87:728–735. doi: 10.1016/j.ajhg.2010.10.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin DY, Tang ZZ. A general framework for detecting disease associations with rare variants in sequencing studies. Am J Hum Genet. 2011;89:354–367. doi: 10.1016/j.ajhg.2011.07.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Madsen BE, Browning SR. A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet. 2009;5(2):e1000384. doi: 10.1371/journal.pgen.1000384. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mathieson I, McVean G. Differential confounding of rare and common variants in spatially structured populations. Nature Genetics. 2012;44:243–246. doi: 10.1038/ng.1074. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Morgenthaler S, Thilly WG. A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: A cohort allelic sums test (CAST) Mutation Research. 2007;615:28–56. doi: 10.1016/j.mrfmmm.2006.09.003. [DOI] [PubMed] [Google Scholar]
- Neale BM, Rivas MA, Voight BF, Altshuler D, Devlin B, Ogho-Melander M, Katherisan S, Purcell SM, Roeder K, Daly MJ. Testing for an unusual distribution of rare variants. PLoS Genetics. 2010;7(3):e1001322. doi: 10.1371/journal.pgen.1001322. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pan W. Asymptotic tests of association with multiple SNPs in linkage disequilibrium. Genetic Epidemiology. 2009;33:497–507. doi: 10.1002/gepi.20402. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pan W. Relationship between Genomic Distance-Based Regression and Kernel Machine Regression for Multi-marker Association Testing. Genetic Epidemiology. 2011;35:211–216. doi: 10.1002/gepi.20567. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pan W, Shen X. Adaptive tests for association analysis of rare variants. Genetic Epidemiology. 2011;35:381–388. doi: 10.1002/gepi.20586. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Patterson N, Price A, Reich D. Population structure and eigenanalysis. PLoS Genetics. 2006;2(12):e190. doi: 10.1371/journal.pgen.0020190. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Price AL, Zaitlen NA, Reich D, Patterson N. New approaches to population stratification in genome-wide association studies. Nat Rev Genet. 2010;11:459–463. doi: 10.1038/nrg2813. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Price AL, Kryukov GV, de Bakker PIW, Purcell SM, Staples J, Wei L-J, Sunyaev SR. Pooled association tests for rare variants in exon-resequenced studies. Am J Hum Genet. 2010b;86:832–838. doi: 10.1016/j.ajhg.2010.04.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics. 2000;155:945–959. doi: 10.1093/genetics/155.2.945. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Siu H, Jin L, Xiong M. Manifold Learning for Human Population Structure Studies. PLoS ONE. 2012;7(1):e29901. doi: 10.1371/journal.pone.0029901. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tennessen JA, Bigham AW, O’Connor TD, Fu W, Kenny EE, Gravel S, McGee S, Do R, Liu X, Jun G, Kang HM, Jordan D, Leal SM, Gabriel S, Rieder MJ, Abecasis G, Altshuler D, Nickerson DA, Boerwinkle E, Sunyaev S, Bustamante CD, Bamshad MJ, Akey JM. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science. 2012;337(6090):64–69. doi: 10.1126/science.1219240. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tintle N, Aschard H, Hu I, Nock N, Wang H, Pugh E. Inflated type I error rates when using aggregation methods to analyze rare variants in the 1000 Genomes Project exon sequencing data in unrelated individuals: summary results from Group 7 at Genetic Analysis Workshop 17. Genetic Epidemiology. 2011;35(S1):S56–S60. doi: 10.1002/gepi.20650. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wessel J, Schork NJ. Generalized genomic distance-based regression methodology for multilocus association analysis. Am J Hum Genet. 2006;79:792–806. doi: 10.1086/508346. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu C, DeWan A, Hoh J, Wang Z. A Comparison of Association Methods Correcting for Population Stratification in Case-Control Studies. Annals of Human Genetics. 2011;75:418–427. doi: 10.1111/j.1469-1809.2010.00639.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu MC, Kraft P, Epstein MP, Taylor DM, Chanock SJ, Hunter DJ, Lin X. Powerful SNP-Set Analysis for Case-Control Genome-wide Association Studies. Am J Hum Genet. 2010;86:929–942. doi: 10.1016/j.ajhg.2010.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011b;89:82–93. doi: 10.1016/j.ajhg.2011.05.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhu X, Zhang S, Zhao H, Cooper RS. Association mapping, using a mixture model for complex traits. Genet Epidemiology. 2002;23:181–196. doi: 10.1002/gepi.210. [DOI] [PubMed] [Google Scholar]
- Zhu X, Li S, Cooper RS, Elston RC. A unified association analysis approach for family and unrelated samples. Am J Hum Genet. 2008;82:352–365. doi: 10.1016/j.ajhg.2007.10.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhu X, Feng T, Li Y, Lu Q, Elston RC. Detecting rare variants for complex traits using family and unrelated data. Genetic Epidemiology. 2010;34:171–187. doi: 10.1002/gepi.20449. [DOI] [PMC free article] [PubMed] [Google Scholar]

