Abstract
Recent advancements in next-generation DNA sequencing technologies have made it plausible to study the association of rare variants with complex diseases. Due to the low frequency, rare variants need to be aggregated in association tests to achieve adequate power with reasonable sample sizes. Hierarchical modeling/kernel machine methods have gained popularity among many available methods for testing a set of rare variants collectively. Here, we propose a new score statistic based on a hierarchical model by additionally modeling the distribution of rare variants under the case-control study design. Results from extensive simulation studies show that the proposed method strikes a balance between robustness and power and outperforms several popular rare-variant association tests. We demonstrate the performance of our method using the Dallas Heart Study.
Keywords: case control study, rare variant, gene set, population genetic distribution
Introduction
Genome wide association studies (GWAS) have successfully identified hundreds of variants associated with complex human traits. Most of the identified variants are common with minor allele frequencies (MAFs) above 0.05. However, for many complex traits, there is still missing and unexplained heritability [Gibson 2012; Manolio, et al. 2009]. The 1,000 Genome Project revealed that a significant proportion of variants in any individual genome (5–10%) had never been seen before [Altshuler et al., 2012; Keinan and Clark, 2012]. It is now recognized that common and rare variants are equally important for understanding the genetic basis of human traits. Next-generation sequencing technology makes it plausible to identify these rare variants in individual samples [Altshuler et al., 2012]
By convention, SNPs with MAF between 1% and 5% are referred as low-frequency variants and SNPs with MAF less than 1% as “rare variants”[Lee et al. 2014]. Due to low frequencies, rare variants are typically found in a small fraction of sequenced samples, which poses challenge in association testing. The standard method for testing association with individual common SNPs is usually underpowered for rare variants. Some powerful test statistics have recently been proposed, which largely fall into two groups: 1) The “burden” tests [Li and Leal, 2008; Morgenthaler and Thilly, 2007; Pan, 2009], which first collapse information across rare variants (either weighted or unweighted) to create a “super” variant, then test this “super” variant in association with the interest phenotype. 2) Variance component based tests [Wu et al., 2011; Neale et al., 2011], which assume that the association parameter for each rare variant follows a distribution that has mean 0 and variance σ2, and the association is assessed by testing the null hypothesis σ2 = 0. The burden tests assume a common effect for all variants, therefore have high power if the effects of different variants are in the same direction or a large proportion of the variants are associated. The variance component tests are more powerful when the directions of association across variants differ or a large proportion of the variants are null. They are not sensitive to the averaged effect of the variants in the study region. Recently, two methods have been proposed to unify the power of these two classes of tests. Lee et al. [Lee et al., 2012] proposed a Sequence Kernel Association Test-Optimum (SKAT-O) test that is a linear combination of a burden test and the Sequence Kernel Association Test (SKAT) variance component test. Sun et al [Sun et al., 2013] proposed a method that combines score statistics for the averaged group effect and variance components of individual SNP effects based on a mixed effects model (MiST). Both methods were designed to test the average and variation in individual effects of rare variants jointly.
In this paper, we propose a novel unified association test that enhances the efficiency of the current rare variant tests. It has been shown that population genetic modeling of the distribution of genetic variants could lead to improved power for assessing genetic association in retrospective case-control studies [Chen and Chatterjee, 2007]. Similar idea has been employed for improving power for assessing gene-environment interaction effects by imposing the constraint of gene-environment independence [Chatterjee and Carroll, 2005]. This motivates us to integrate the population genetic model with the generalized mixed effects model through a hierarchical modeling for both common and individual rare variant effects within one unified association test to further borrow strength across all the variants to enhance power. The proposed test is built upon the retrospective likelihood that accounts for ascertainment bias resulting from case-control sampling [Piegorsch et al., 1994].
We conducted extensive simulation studies to compare our newly proposed tests with two unified test competitors: SKAT-O and MiST tests. Together with analysis of data from the Dallas Heart Study, we showed that our test is valid and appealing because it enhances the robustness and power of the two existing unified tests across a range of realistic scenarios.
Methods
Notation and Model
Suppose that n1 cases and n0 controls are sequenced in a region where K variants are genotyped. Let Yi denote case-control status for the ith subject (Yi =1 for cases; Yi =0 for controls), Gi =(Gi1, Gi2,.....GiK )′ denote the numerical values of the K genotypes for the ith subject, and denote the averaged genotype values for the ith subject. Using a logistic penetrance model, the K variants are related to the case-control status through
| (1) |
In this model, the association of variants with the case-control status is described by two sets of parameters [Sun et al., 2013; Wang et al., 2012]:γ and β =(β1, β2,.....βK)′. γ is the regression coefficient that describes the average effect of all K variants, that is, the fixed burden effect, and βk is the deviation of the kth individual variant from the burden effect γ. βk ’s are assumed as realizations from a distribution with mean 0 and variance σ2. Following [Wu et al. 2011], We are interested in testing if any of these variants is associated with phenotype Y, with the null hypothesis H0: γ =0,σ2 =0.
Modeling the distribution of rare variants and the likelihood function
It has been widely recognized that multilevel modeling can improve the estimation efficiency and testing power when the sample size or the number of events is small [Rao, 2003]. This is the case in testing association with a group of rare variants since the number of subjects who carry a particular mutation is small. We propose modeling the joint distribution of the K genotype variables to borrow strength across all variants to increase the power of testing the overall test under H0. Specifically, we assume that for subject i: (a) Variant Gik follows a Bernoulli distribution with two genotypes AA (Gik =0) and Aa (Gik =1); (b) MAF pik =P(Gik =1) independently follows a Beta distribution Beta (a1, a2), which is a conjugate prior for the Bernoulli distribution; and (c) K variants are independently distributed in controls. Under (a)~(c), the joint distribution of Gi can then be written as
| (2) |
As shown in this equation, the genotype Gik marginally follows a Bernoulli distribution with success probability . Here we can see a1 and a2 only play their roles through p, the average proportion of the minor allele for all the variants.
Following Goeman (2006), we assume that β1,...,βK follow a common distribution with expectation zero and variance σ2, which is equivalent to β = σb where b does not depend on σ2 with E(b) = 0 and E(bb′) = I. Under the association model (1), population genetic model (2), and rare disease approximation, it is straightforward to show that the logarithm of the retrospective likelihood function can be written as
| (3) |
Score statistics under the null hypothesis H0: γ =0,σ2 =0
Following Sun et al. (2013), we test H0: γ =0,σ2 =0 using two score tests: the score test for the burden effect γ under the null H0 :γ = 0 with the constraint σ2 =0, and the score test for σ2 under the null H0: σ2 =0 without any restriction on γ. First, we derive the likelihood score function for γ based on (3) under the null H0: γ = 0 given that σ2 =0, as
Plug the maximum likelihood estimator (MLE) of the unknown parameter p under the null, which is given by , into S̃γ we obtain
| (4) |
The score statistics for testing H0: γ =0 given that σ2 =0 can then be written as , where is the estimated variance for Sγ. By the central limit theorem, as n0 and n1 go to infinity, this statistic has an asymptotic standard normal distribution N(0,1) under the null hypothesis. The p-value can be calculated accordingly. We derived that (Appendix) and obtained by plugging p̂ into var(Sγ).
We derived the score statistic for variance component σ2 by applying Lemma 3 in Goeman et al. (2006) as
| (5) |
where and , k=1, …, K. Following Goeman et al. 2006 and Wu et al. 2011, the asymptotic distribution of Sσ2 can be approximated by a mixture of Chi-squared distribution, i.e. , where are independent random variables and λ1, λ2,..., λr are the non-zero eigenvalues of the estimated covariance matrix of T, . We derived in Appedix , where 1 is a vector of 1s with dimension K and I is the identity matrix with dimension K × K, and p̂ is the MLE estimate of nuisance parameters under the null hypothesis. Davies method [Davies, 1980] is used to approximate this mixture of chi-squared distribution and calculate the p-values. As for all the parametric methods, this asymptotic distribution is derived under the specified model and assumptions. When the model assumption is not satisfied, however, it is not feasible to obtain the valid model-based estimate of covariance matrix of T, . Ignoring model mis-specification leads to inflated type I error rates of our asymptotic method. As an alternative, the valid p-value can be calculated by permutation, where the disease status Yi is shuffled among subjects and the p-value is calculated as the proportion of the permuted score statistics being greater than the observed score statistics Sσ2.
Following Sun et al. 2013 we use the Tippett’s procedure to combine these two independent score tests to produce the overall p-value for the joint test of H0: γ = 0,σ2 =0. This combined procedure requires that Sγ and Sσ2 are independent, which is proved in Appendix. Specifically, let pγ and pσ2 denote the p-values from Sγ and Sσ2, respectively. The overall p-value is calculated as 1– (1–min(pγ, pσ2.))2.
Simulation Studies
We carried out extensive simulation studies to compare the performance of our case-control score method (CCS) with SKAT-O [Lee et al., 2012] and MiST [Sun et al., 2013], the two available methods for testing the group common effect and variance components jointly. We used R packages SKAT and MiST for the corresponding methods. We used the equal weight in SKAT-O method to reflect the independence of the MAF and the effect size in the simulation settings. We evaluated type I error rates and power as a function of sample size, number of neutral variants, genotype distribution, and degree of rareness of genotype data. We set the disease prevalence (p) equal to 5% under the various null scenarios. Among K variants in the model, we randomly set eight variants to be associated with the phenotype and 8 or 50 neutral variants that were not associated.
Generate Genotype data
We generated genotype data G using three approaches. Model A) We assumed that all the variants are de novo and that their genotype were generated from a Bernoulli distribution with an independent mutation rate generated from the Beta distribution Beta (1, a2 ) as in model (2). Model B) To assess the robustness of our proposed test, we also generated the genotype data from a distribution that deviated from model (1). Following Basu and Pan 2011’s simulation design, we first generated two independent haplotypes by dichotomizing two latent components from a multivariate normal distribution. Between any two latent components, the first-order autoregressive covariance structure was assumed. The MAF was restricted between 0.001 and 0.01. The genotype data were obtained by combining the two haplotypes. Model C) We used the software package Genome, a coalescent-based whole-genome simulator [Liang et al. 2007] to generate a population with effective size N=60,000 for p=1%, and 1000 original variants. We retained variants which have MAF less than 0.01 (about 64%) in analyses. Starting from a randomly selected variant in the retained set, the following K variants was taken as the studied variant. The causal variants were randomly selected from the variant set. The recombination rate between consecutive fragments was taken as the default value 0.0001. For model A and C, we generated n0 = n1 =500 or 1,000 cases and controls. For model B, since the studied methods can have meaningful power with a smaller sample size, we considered n0 = n1 = 200 or 500 cases and controls in the simulations.
Causal Variant Effect Patterns
Throughout the simulations, we set the significance level at 0.05. For type I error evaluation, we set γ =0, and βk =0 for k =1,..., K. For the power evaluation, we considered four odds ratio (OR) parameter patterns for 8 randomly selected disease associated variants [Wang et al., 2012]. From Model (1), the marginal OR for individual variant is given as . The four patterns were: 1) OR=(2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0), the direction and effect sizes were the same; 2) OR= (2.0, 2.0, 1.8, 1.8, 1.4, 1.4, 1.2, 1.2), the effects were in the same direction but of different magnitude; 3) OR= (2.0, 2.0, 2.0, 2.0, 0.5, 0.5, 0.5, 0.5, 0.5), the effect of the first four variants were in the same direction while that of the other four were in the opposite direction, and the average effect was zero; and 4) OR=( 3.0, 3.0, 2.0, 2.0, 2.0, 0.5, 0.5, 0.5), both the direction and strength of association varied but the average effect was positive.
Other Scenario
In order to check the robustness of the proposed method on dealing with the confounding covariates, we designed the following simulation scenario. We generated the genotype data from genotype model A with disease prevalence being 5% and the disease status using the logistic model with one extra confounding covariate, i.e.
where X is generated from Uniform(0,1) for each subject and βX varies from 0 to 2. In the analysis, we pretended we did not know the confounding covariate X and ignored it in all three methods.
Results
In the simulation setting, the parameter a2 was set as 200 or 400, where a larger value of a2 will lead to rarer variants. Table 1 describes the Beta probability distribution of MAFs for different values of a2. When a2 =200, on average 14% variants had MAF between 0.01 and 0.05, 68% variants had MAF between 0.001 and 0.01, and 18% variants had MAF below 0.001. With a2 equal to 400, 65% of the variants had MAF less than 0.01, and about 33% had MAF less than 0.001. Therefore, the considered variants were quite “rare”.
Table 1.
Probability distribution of Beta (a1=1, a2)
| a2 | P(MAF>0.05) | P(0.01<MAF<0.05) | P(0.001<MAF<0.01) | P(MAF<0.001) |
|---|---|---|---|---|
| 200 | 0.000 | 0.135 | 0.683 | 0.181 |
| 400 | 0.000 | 0.018 | 0.652 | 0.330 |
Tables 2–4 present the empirical type I error rates for all three methods under three genotype models and some other scenarios. Our CCS method based on permutation (CCS_perm) and the competitive methods all maintained the nominal type I error rates at different settings regardless of the sample size, number of neutral variants and genotype distribution. As expected, the CCS method based on the asymptotic approximation (CCS_asy) maintained the nominal type I error rate. When the model assumption for the asymptotic theory was violated, such as under genotype model B and C, the asymptotic method produced inflated type I error and the results are omitted here and in the following power section.
Table 2.
Type I Error rates at nominal significance level α=0.05 under genotype model A where all the variants are de novo and their genotypes are independently generated from a Bernoulli distribution with an independent mutation rate generated from Beta(1, a2) distribution as in model (2). p=5%. Based on 5000 simulations.
| Genotype model A | |||||
|---|---|---|---|---|---|
| a2=200 | a2=400 | ||||
|
| |||||
| Methods | # of NV n0=n1 | 8 | 50 | 8 | 50 |
| SKAT- O_perm | 500 | 0.043 | 0.044 | 0.052 | 0.054 |
| 1000 | 0.056 | 0.048 | 0.041 | 0.045 | |
| SKAT-O | 500 | 0.048 | 0.052 | 0.050 | 0.052 |
| 1000 | 0.047 | 0.041 | 0.038 | 0.039 | |
| MiST_perm | 500 | 0.046 | 0.048 | 0.044 | 0.050 |
| 1000 | 0.053 | 0.048 | 0.051 | 0.052 | |
| MiST | 500 | 0.048 | 0.038 | 0.036 | 0.037 |
| 1000 | 0.044 | 0.047 | 0.041 | 0.045 | |
| CCS_asy | 500 | 0.045 | 0.046 | 0.050 | 0.048 |
| 1000 | 0.050 | 0.049 | 0.048 | 0.052 | |
| CCS_perm | 500 | 0.047 | 0.047 | 0.060 | 0.056 |
| 1000 | 0.056 | 0.052 | 0.055 | 0.055 | |
Table 4.
Type I error rates at nominal significance level alpha=0.05 under genotype model A with confounding effect βX. Disease prevalence=5%. n0=n1=500. Number of neutral variants is 8. Based on 5,000 simulations.
| βX=0 | βX=0.5 | βX=1 | βX=2 | |
|---|---|---|---|---|
| SKAT-O_perm | 0.0514 | 0.0506 | 0.0582 | 0.0522 |
| MiST_perm | 0.0554 | 0.052 | 0.0538 | 0.0552 |
| CCS_perm | 0.0494 | 0.0454 | 0.0514 | 0.0472 |
| CCS_asy | 0.048 | 0.0494 | 0.0496 | 0.049 |
Power
Within each genotype generating method, we considered four patterns of causal variant effects as aforementioned. Under patterns 1 and 2, the causal variants all had positive effects, and the effect sizes were constant in pattern 1 and varied in pattern 2. These two patterns were in favor of the burden test. Under patterns 3 and 4, some causal variants had positive and some had negative effects, which were constant in pattern 3 and varied in pattern 4. Therefore, the average effect was zero in pattern 3. Patterns 3 and 4 are in favor of SKAT.
First when the genotype data was generated from Beta (1, a2 ) as in model (2), considering two sample sizes: 500 and 1000 cases and controls, MiST and SKAT-O’s performance were comparable, and SKAT-O had slightly higher power than MiST (Figure 1, 2). Comparing to these two tests, CCS tests always had the highest power across the four patterns and two a2 values. The power advantage of the CCS tests was particularly prominent under patterns 3 and 4. For example, under pattern 4, with a2 equal to 200, n0=n1=1000, and the number of neutral variants equal to 50 (Figure 2), the power of SKAT-O, MiST, CCS_perm and CCS_asy tests was 0.49, 0.44, 0.74, and 0.84 respectively. The power of our CCS tests was nearly twice as high as that of the other two methods. For the CCS methods, the p-value that was computed based on the asymptotic theory always produces higher power than that based on permutation. From Figure 1 and 2 we also observe that the SKAT-O_perm and SKAT_O based on asymptotic theory have almost equivalent results and the MiST_perm has always higher power than its asymptotic version.
Figure 1.
Power for genotype model A. (de novo rare variants). Disease prevalence=5%. n0=n1=500. Four causal variant effect patterns: Pattern 1, OR=(2,2,2,2,2,2,2,2); Pattern 2, OR= (2,2,1.8,1.8,1.4,1.4,1.2,1.2); Pattern 3, OR= (2,0.5,2,0.5,2,0.5,2,0.5); and Pattern 4, OR=( 3,3, 2,2,2, 0.5,0.5, 0.5).
Figure 2.
Power for genotype model A. (de novo rare variants). Disease prevalence=5%. n0=n1=1000. Four causal variant effect patterns: Pattern 1, OR=(2,2,2,2,2,2,2,2); Pattern 2, OR= (2,2,1.8,1.8,1.4,1.4,1.2,1.2); Pattern 3, OR= (2,0.5,2,0.5,2,0.5,2,0.5); and Pattern 4, OR=( 3,3, 2,2,2, 0.5,0.5, 0.5).
Under genotype model B and C, we skipped CCS based on asymptotic theory because of the inflated type-I error rate due to violation of the model assumption. We only compared the performance of the SKAT-O, MiST and CCS tests using permutation in those two models.
For genotype model B, in the top row in Figure 3, when the sample size was small (200 cases and 200 controls), the CCS_perm always had higher power than SKAT-O_perm and MiST_perm. For example, in pattern 4, comparing to SKAT-O_perm, CCS_perm has 28% power improvement. When the sample size was relatively large (500 cases and 500 controls), the power of all three methods became higher in all patterns (bottom row, Figure 3). However, CCS_perm still had higher power compared with SKAT-O_perm and MiST_perm, especially when the number of neutral variants is large (50).
Figure 3.
Power for Basu and Pan genotype model B. Disease prevalence=5%. Four causal variant effect patterns: Pattern 1, OR=(2,2,2,2,2,2,2,2); Pattern 2, OR= (2,2,1.8,1.8,1.4,1.4,1.2,1.2); Pattern 3, OR= (2,0.5,2,0.5,2,0.5,2,0.5); and Pattern 4, OR=( 3,3, 2,2,2, 0.5,0.5, 0.5).
Under coalescent genotype model C, we considered two sample sizes: 500 and 1000 cases and controls (Figure 4). SKAT-O and MiST method had almost identical power in all scenarios, while CCS was more powerful except when the sample size was 1000 under Pattern 3 with 8 neutral variants. The advantage of CCS over the other two methods was more evident in patterns 1 and 2 than it was in patterns 3 and 4. For example, when n0=1000, under pattern 1 with 50 neutral variants, the power of SKAT-O, MiST and CCS was 0.18, 0.17, and 0.22 respectively, in which CCS has gain 22% more power comparing to SKAT-O and MiST.
Figure 4.
Power for coalescent genotype model C. Disease prevalence=5%. Four causal variant effect patterns: Pattern 1, OR=(2,2,2,2,2,2,2,2); Pattern 2, OR= (2,2,1.8,1.8,1.4,1.4,1.2,1.2); Pattern 3, OR= (2,0.5,2,0.5,2,0.5,2,0.5); and Pattern 4, OR=( 3,3, 2,2,2, 0.5,0.5, 0.5). Based on 5000 simulations
In the last scenario, we generated data using the model A with one confounding covariate, which was ignored in the association tests. In this study, CCS_asy tends to have slightly inflated Type I error as βX increases, however, CCS_perm always maintains nominal Type I error as SKAT-O and MiST methods do (Table 5). Figure 5 presents the power comparison between three methods. CCS_asy and CCS_perm always have higher power than their competitors. This shows the robustness of our proposed CCS method.
Table 5.
P-values for SKAT-O, MiST and CCS_perm tests for testing associations between dichotomized triglyceride level and three genes in the Dallas Heart Study. Triglyceride level was dichotomized by the1st and 3rd quartiles.
| nvar | SKAT-O | MiST | CCS_perm | |
|---|---|---|---|---|
| ANGPTL3 | 35 | 0.037 | 0.048 | 0.028 |
| ANGPTL4 | 30 | 0.019 | 0.024 | 0.083 |
| ANGPTL5 | 25 | 0.170 | 0.234 | 0.170 |
Figure 5.
Power for genotype model A with confounding covariate X. Disease prevalence=5%. n0=n1=500. 5000 simulations. Number of neutral variants is 8. Two causal variant effect patterns: Pattern 2, OR= (2,2,1.8,1.8,1.4,1.4,1.2,1.2); and Pattern 4, OR=( 3,3, 2,2,2, 0.5,0.5, 0.5).
Application to Dallas Heart Study Data
We applied CCS_perm, SKAT-O, and MiST using data from Dallas Heart Study [Pritchard and Cox 2002]. Data is available for three candidate genes, ANGPTL3 (MIM 604774), ANGPTL4 (MIM 605910), and ANGPTL5 (MIM 607666) in 3,476 individuals. We analyzed 35, 30 and 25 variants with MAF<0.05 in the three genes, respectively. We evaluated the association between the dichotomized triglyceride level and each candidate gene. The triglyceride level was dichotomized based on the highest and lowest quartiles of each of the six sex-ethnicity groups (Table 5) [Wu, et al. 2011] with 867 cases and 873 controls. The p-values are reported in Table 4. All the three methods detected a significant association with gene ANGPTL3. The p-value based on CCS_perm is close to those based on SKAT-O and MiST, which confirmed the validity of our CCS test.
Discussion
In this paper, we proposed a unified score test for assessing association with rare variants under the case-control study design. Our test is novel in two aspects: 1) it integrates the population genetic model with the popular hierarchical model to increase power for testing rare variants as a set. Our method has high power when the MAFs of the test variants are very low and sample size is small; 2) Our score test was derived based on the retrospective likelihood that appropriately accounts for the case-control sampling. Similar as other existing unified rare variant association tests, the proposed method also tests jointly both the average effect and variance component of individual variant effects. Our test outperformed the competing methods in all the simulation scenarios we considered when the genotype data arose from the Beta distribution. The power gain was even more appreciated when the sample size was small, test variants were very rare and number of neutral variant is large. When the genotype data arose from a distribution that deviated from Beta, such like in the more realistic coalescent model and Basu&Pan’s genotype type model, the advantage of our test was also confirmed.
Unlike most currently available set-based rare variant association tests, which were designed assuming the prospective cohort study design, our method appropriately recognized the case-control design and derived statistics through the retrospective likelihood. In our model, the genotype data is modeled directly in the case and control groups and treated as the random variable. Therefore, we were able to build in the population genetic model into the association between rare variants and case/control status. This turns out to produce a more robust and efficient score test for the rare variant test in the case-control study.
We derived an analytic asymptotic distribution for the proposed test statistic, and showed in simulation studies that it performed well for p-value calculation when the genotype data arose from the assumed model. However, the asymptotic distribution leads to inflated type I error rates when the model assumption for the genotype data is invalid. But the p-value obtained based on permutation appeared to be robust to all of our model assumptions on the genotype data while retaining the power advantage. Therefore, we suggest that p-value be calculated using the permutation method for the proposed method in practice.
Both SKAT-O and MiST can reduce to burden test or SKAT test when certain regression parameters are taken to be zeros [Sun, Zheng and Hsu 2013]. The first component of our score statistic (4) for the common group effect is actually the same as the burden statistic before standardization (see appendix), while the second component (5) for the variance component is different from SKAT. The main factor to affect the relative performance of SKAT-O and CCS is the MAF. Since the core part of SKAT is the sum of the K squares of the difference of the MAF estimates between cases and controls in each variant, when the study variants are very rare, with small sample size, the variability of the MAF estimates for each individual variant is large, which results in large variability of the SKAT statistics and leads the power loss. In contrast, our score function for the variance component was the distance between the MAF estimates for each variant and the MAF estimates in the cases averaged across variants. The overall MAF estimate was derived from the assumption that all the variants follow the same population genetic model. As such, we were able to borrow strength across all the variants to result in smaller variance and increase the power in testing the overall effect. Similar technique has been well studied in the small area estimation field (small means sample size or event size is small) [Rao 2003]. As the review mentioned in the next comment, CCS test statistics is equivalent to the variance of MAF estimate. As K increases, the variance of MAF estimate is more accurate. While the test statistics for SKAT is the summation of K square terms, its variance goes up as K increases. This explains why with the large number of the neutral variant, the edge of CCS over SKAT-O is more prominent.
Our method has also several limitations. First, our model implicitly assumed the independence between gene and environment independence, which has been used to increase power for testing gene-environment interaction effects. When this assumption is violated, bias and inflation in type-I error rates can be severe [Chen and Chatterjee, 2007; Chen et al., 2012; Mukherjee et al., 2012]. Our proposed method can potentially suffer from the same limitation. However, because of the rareness of the mutations, we felt that concerns for confounding would primarily be due to population stratification, not to non-genetic risk factors. When population substrata can be well characterized by a small number of population subgroups, our method can be adapted in a straightforward manner by using separate prior genetic distributions for each distinct covariate value. Secondly, our method cannot adjust the confounding covariates. When covariates do not confound the association, it has been recognized that logistic regression analysis without adjusting for covariates is valid and has higher power than that adjusting for covariates [Pirinen et al., 2012]. Along the same line of argument, our proposed method is valid. When covariates confound association, for example, when covariates include population stratification variables, these covariates must be adjusted for. Our method in general will not be applicable, although it can be generalized when covariates are discrete and have a small number of distinct values (we will then use separate prior genetic distributions corresponding to each distinct covariate value). In a simplified simulation study, we have showed that our proposed CCS method is robust to the minor confounding covariate adjustment. Thirdly, since we recommend the permutation based CCS method to avoid possible inflation in type I error rates when p-values are calculated based on the asymptotic theory due to invalid model assumption, the proposed method is infeasible for analyzing genome-wide association due to the high computation burden. The proposed CCS method therefore is best suitable for gene-based, pathway based or small scale exome sequencing studies. Fourthly, our method only has better performance over SKAT-O in the rare variant analysis (i.e. MAF<0.01). When the MAF is large, SKAT-O outperforms CCS. Because when the MAF is large, with a reasonable sample size (more than 500 cases and controls), the MAF estimate for each individual in SKAT-O method is accurate enough. In such scenarios, SKAT does not impose any distribution on the MAF across the variants and is most efficient. While CCS models the distribution of MAF using a Beta prior and will likely deviate from the underlying true model. Thus when the portion of the variants with big MAF is large, SKAT will have better performance than CCS.
Table 3.
Type I Error rates at nominal significance level α=0.05 under genotype model B) Basu and Pan model and model C) coalescent model. p=5%. Based on 5000 simulations.
| Methods | Genotype model B | Genotype model C | ||||
|---|---|---|---|---|---|---|
| # of NV n0=n1 | 8 | 50 | # of NV n0=n1 | 8 | 50 | |
| SKAT-O_perm | 200 | 0.054 | 0.052 | 500 | 0.058 | 0.041 |
| 500 | 0.055 | 0.045 | 1000 | 0.040 | 0.040 | |
| MiST_perm | 200 | 0.046 | 0.045 | 500 | 0.053 | 0.036 |
| 500 | 0.053 | 0.045 | 1000 | 0.045 | 0.041 | |
| CCS_perm | 200 | 0.052 | 0.050 | 500 | 0.044 | 0.046 |
| 500 | 0.051 | 0.048 | 1000 | 0.047 | 0.049 | |
Acknowledgments
The authors would like to thank the reviewers and editor for their constructive suggestions. The authors also thank Dr. Iuliana Ionita-Laza for sharing the Dallas Heart Study with us. The work is funded in part by NIH U01 CA170948, R21-ES020811 and R01-ES016626.
Appendix
-
A1Derivation of the joint distribution of Gi
- A2
-
A3The calculation of var(Ḡj. )
-
A4
Score statistics for H0 : β =(β1,..., βK )′ =0
When sample size n is relatively small to the number of variant K, it is impossible to test H0 : β =(β1,..., βK )′ =0 directly. Following Goeman (2006), we assume β1,...,βK follow a common distribution with expectation zero and variance σ2. Therefore, testing the hypotheses H0: β = 0 is equivalent to testing H0: H̃0:σ2 = 0 against H̃0:σ2 > 0 Write β=σb where b does not depend on σ2 with E(b) = 0 and E(bb′) = I, the retrospective log-likelihood can be represented in term of σ2 asBy implementing the Lemma 3 in Goeman et al. (2006), the score statistic is given bywhereWhen γ is not equal to zero, it needs to be replaced by γ̂ which is obtained by solving from (5) we haveSince G j1,..., GjK are interchangeable, we have(8) Thus -
A4Independence of Sγ and Tk
Thus Sγ and Tk are independent.
-
A5
The covariance matrix of T.
- Using the case only in the second term of T
-
A6
Other methods
Under the logistic regression model Logit . Define which is a K by 1 vector with the kth element . where and . For case-control study, we have cov(U ) simplified as . The burden test statistics for H0 : β =(β1,..., βk )′ =0 is given by(A6.1) Similarly, under the case control setting, the SKAT test for H0 : β =(β1,..., βk )′ =0 is given by TSKAT =(Y –Ȳ)′GW′G (Y–Ȳ) where W =diag (w1,..., wk ) and , MAFj is the sample MAF for the jth variant in the data combining both cases and controls. After the simplification, we have(A6.2)
Footnotes
All authors have no conflict of interest to declare.
References
- Altshuler DM, Durbin RM, Abecasis GR, Bentley DR, et al. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491:56–65. doi: 10.1038/nature11632. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Basu S, Pan W. Comparison of Statistical Tests for Disease Association with Rare Variants. Genet Epidemiol. 2011;35(7):606–619. doi: 10.1002/gepi.20609. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chatterjee N, Carroll RJ. Semiparametric maximum likelihood estimation exploiting gene-environment independence in case-control studies. Biometrika. 2005;92(2):399–418. [Google Scholar]
- Chen J, Chatterjee N. Exploiting Hardy-Weinberg Equilibrium for efficient screening of single SNP associations from case-control studies. Human Heredity. 2007;63:196–204. doi: 10.1159/000099996. [DOI] [PubMed] [Google Scholar]
- Chen J, Kang G, Vanderweele T, Zhang C, Mukherjee B. Efficient designs of gene-environment interaction studies: implications of Hardy-Weinberg equilibrium and gene-environment independence. Stat Med. 2012 Sep 28;31(22):2516–30. doi: 10.1002/sim.4460. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Davies R. Algorithm as 155: The distribution of a linear combination of chi2 random variables. JRSSC. 1980;29:323–333. [Google Scholar]
- Gibson G. Rare and common variants: Twenty arguments. Nature Review Genetics. 2012;13:135–145. doi: 10.1038/nrg3118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goeman JJ, van de Geer S, van Houwelingen HC. Testing against a high dimensional alternative. J Royal Stat Soc B. 2006;68:477–493. doi: 10.1111/j.1467-9868.2006.00551.x. [DOI] [Google Scholar]
- Keinan A, Clark AG. Recent explosive human population growth has resulted in an excess of rare genetic variants. Science. 2012;336:740–743. doi: 10.1126/science.1217283. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee S, et al. Optimal unified approach for rare-variant association testing with application to small-sample case-control whole-exome sequencing studies. Am J Hum Genet. 2012;91:224–237. doi: 10.1016/j.ajhg.2012.06.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee S, Abecasis GR, Boehnke M, Lin X. Rare-Variant Association Analysis: Study Designs and Statistical Tests. Am J Hum Genet. 2014 Jul 3;95(1):5–23. doi: 10.1016/j.ajhg.2014.06.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li B, Leal SM. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008;83:311–321. doi: 10.1016/j.ajhg.2008.06.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liang L, Zoellner S, Abecasis GR. GENOME: a rapid coalescent-based whole genome simulator. Bioinformatics. 2007;23:1565–1567. doi: 10.1093/bioinformatics/btm138. [DOI] [PubMed] [Google Scholar]
- Lin X. Variance component testing in generalized linear models with random effects. Biometrika. 1997;84:309–326. [Google Scholar]
- Liu D, Lin X, Ghosh D. Semiparametric regression of multi-dimensional genetic pathway data: least squares kernel machines and linear mixed models. Biometrics. 2007;63(4):1079–88. doi: 10.1111/j.1541-0420.2007.00799.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, Cho JH, Guttmacher AE, Kong A, Kruglyak L, Mardis E, Rotimi CN, Slatkin M, Valle D, Whittemore AS, Boehnke M, Clark AG, Eichler EE, Gibson G, Haines JL, MacKay TFC, McCarroll SA, Visscher PM. Finding the missing heritability of complex diseases. Nature. 2009;461:747–753. doi: 10.1038/nature08494. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Morgenthaler S, Thilly WG. A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST) MutatRes. 2007;615(1):28– 56. doi: 10.1016/j.mrfmmm.2006.09.003. [DOI] [PubMed] [Google Scholar]
- Mukherjee B, Ahn J, Gruber SB, Chatterjee N. Testing gene-environment interaction in large-scale case-control association studies: possible choices and comparisons. Am J Epidemiol. 2012 Feb 1;175(3):177–90. doi: 10.1093/aje/kwr367. Epub 2011 Dec 22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Neale BM, Rivas MA, Voight BF, Altshuler D, Devlin B, Orho-Melander M, Kathiresan S, Purcell SM, Roeder K, Daly MJ. Testing for an unusual distribution of rare variants. PLoS Genet. 2011;7(2011):e1001322. doi: 10.1371/journal.pgen.1001322. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pan W. Asymptotic tests of association with multiple SNPs in linkage disequilibrium. Genet Epidemiol. 2009;33(6):497–507. doi: 10.1002/gepi.20402. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Piegorsch WW, Weinberg CR, Taylor JA. Non-hierarchical logistic models and case-only designs for assessing susceptibility in population-based case-control studies. Stat Med. 1994;13(2):153–162. doi: 10.1002/sim.4780130206. [DOI] [PubMed] [Google Scholar]
- Pirinen M, Donnelly P, Spencer CC. Including known covariates can reduce power to detect genetic effects in case-control studies. Nature Genetics. 2012;44(8):848851. doi: 10.1038/ng.2346. [DOI] [PubMed] [Google Scholar]
- Pritchard JK, Cox NJ. The allelic architecture of human disease genes: common disease-common variant…or not? Human Molecular Genetics. 2002;11:2417–23. doi: 10.1093/hmg/11.20.2417. [DOI] [PubMed] [Google Scholar]
- Price A, Kryukov G, de Bakker P, Purcell S, Staples J, Wei L, Sunyaev S. Pooled association tests for rare variants in exon-resequencing studies. The American Journal of Human Genetics. 2010;86:832–838. doi: 10.1016/j.ajhg.2010.04.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rao JNK. Small area estimation. Wiley; 2003. [Google Scholar]
- Sun J, Zheng Y, Li Hsu. A unified Mixed-effects model for rare-variant association in sequencing studies. Genetic Epidemiology. 2013;37(4):334–344. doi: 10.1002/gepi.21717. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tennessen JA, Bigham AW, O’Connor TD, Fu WQ, Kenny EE, Gravel S, McGee S, Do R, Liu XM, Jun G, Kang HM, Jordan D, Leal SM, Gabriel S, Rieder MJ, Abecasis G, Altshuler D, Nickerson DA, Boerwinkle E, Sunyaev S, Bustamante CD, Bamshad MJ, Akey JM, GOB, GOS Project NES. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science. 2012;337:64–69. doi: 10.1126/science.1219240. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang YJ, Chen YS, Yong Q. Joint rare variant association test of the average and individual effects for sequencing studies. Plos One. 2012:e32485. doi: 10.1371/journal.pone.0032485. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu MC, Lee S, Cai T, Li Y, Boehnke M, et al. Rare variant association testing for sequencing data using the sequence kernel association test (SKAT) Am J Hum Genet. 2011;891:82–93. doi: 10.1016/j.ajhg.2011.05.029. [DOI] [PMC free article] [PubMed] [Google Scholar]





