A Rare Variant Association Test Based on Combinations of Single-Variant Tests

Qiuying Sha; Shuanglin Zhang

doi:10.1002/gepi.21834

. Author manuscript; available in PMC: 2015 Sep 1.

Published in final edited form as: Genet Epidemiol. 2014 Jul 25;38(6):494–501. doi: 10.1002/gepi.21834

A Rare Variant Association Test Based on Combinations of Single-Variant Tests

Qiuying Sha ¹, Shuanglin Zhang ^1,^§

PMCID: PMC4127117 NIHMSID: NIHMS615101 PMID: 25065727

Abstract

Next generation sequencing technologies make directly testing rare variant associations possible. However, the development of powerful statistical methods for rare variant association studies is still underway. Most of existing methods are burden and quadratic tests. Recent studies show that the performance of each of burden and quadratic tests depends strongly upon the underlying assumption and no test demonstrates consistently acceptable power. Thus, combined tests by combining information from the burden and quadratic tests have been proposed recently. However, results from recent studies (including this study) show that there exist tests that can outperform both burden and quadratic tests. In this article, we propose three classes of tests that include tests outperforming both burden and quadratic tests. Then, we propose the optimal combination of single-variant tests (OCST) by combining information from tests of the three classes. We use extensive simulation studies to compare the performance of OCST with that of burden, quadratic and optimal single-variant tests. Our results show that OCST either is the most powerful test or has similar power with the most powerful test. We also compare the performance of OCST with that of the two existing combined tests. Our results show that OCST has better power than the two combined tests.

Keywords: rare variant, association study, next generation sequencing

Introduction

Recent studies show that complex diseases are caused by both common and rare variants [Pritchard, 2001; Pritchard and Cox, 2002; Walsh and King, 2007; Stratton and Rahman, 2008; Bodmer and Bonilia, 2008; Ng et al., 2009; Teer and Mullikin, 2010]. To detect disease associated common variants, indirect mapping methods based on tagging SNPs can be used. However, to detect disease associated rare variants, direct association mapping methods in which all variants must be identified should be used because rare variants are essentially independent of other variants. Next-generation sequencing technology allows sequencing of the whole genome of large groups of individuals, and thus makes direct association mapping feasible [Andre’s et al., 2007; Metzker, 2010].

Statistical methods for common variant association studies have been well developed. However, the variant by variant methods for common variant association studies may not be optimal for rare variant association studies due to allelic heterogeneity as well as the extreme rarity of individual variants [Li and Leal, 2008]. Recently, statistical methods for rare variant association studies by summarizing genotype information from multiple variants have been developed. These methods can be roughly divided into three groups: burden tests, quadratic tests, and combined tests.

Burden tests include the cohort allelic sums test (CAST) [Morgenthaler and Thilly, 2007], the combined multivariate and collapsing (CMC) method [Li and Leal, 2008], the weighted sum (WS) method [Madsen and Browning, 2009], the variable minor allele frequency threshold (VT) method [Price et al., 2010], and the cumulative minor-allele test (CMAT) [Zawistowski et al., 2010], among others. Burden tests collapse rare variants in a genomic region into a single burden variable and then regress the phenotype on the burden variable to test for the cumulative effects of rare variants in the region [Lee et al., 2012]. Let x_im denote the genotype (number of minor alleles) of the i^th individual at the m^th variant. As shown by Sha et al. [2012], the burden variables of the aforementioned methods are all the weighted combination of variants, Σ_mw_mx_im, or its function with different ways to model the weights w_m. Let s_m denote the score test statistic from a linear model or a logistic model for the m^th variant. Linear test statistics with the form Σ_mW_ms_m are also based on the burden variable Σ_mw_mx_im. Thus, from the way of collapsing genotypes, burden tests and linear tests are equivalent. So, burden tests are also called linear tests [Derkach et al., 2012].

Quadratic tests with test statistics in the form $Σ_{m} W_{m} s_{m}^{2}$ include C-alpha test [Neale et al., 2011], sequence kernel association test (SKAT) [Wu et al., 2011], and the test for testing the effects of the optimally weighted combination of variants TOW [Sha et al., 2012]. Recently developed adaptive weighting methods for rare variant association studies [Han and Pan, 2010; Hoffmann et al., 2010; Lin and Tang, 2011; Yi and Zhi, 2011; Sha et al., 2013], as pointed out by Derkach et al., [2012], are operationally similar to quadratic tests. Combined tests include the test using Fisher’s method to combine information from the linear and quadratic statistics (Fisher-CT) [Derkach et al., 2012] and the optimal linear combination of the burden test and SKAT (SKAT-O) [Lee et al., 2012].

Burden tests and quadratic tests perform quite differently. Burden tests or linear tests implicitly assume that all the rare variants are causal and directions of effects are all the same. If these assumptions are true, burden tests can outperform quadratic tests; otherwise, burden tests can perform poorly and quadratic tests can outperform burden tests [Wu et al., 2011; Lee et al., 2012; Sha et al., 2012; Derkach et al., 2012]. Ladouceur et al. [2012] showed that the performance of each of burden and quadratic tests depends strongly upon the underlying assumption and no test demonstrates consistently acceptable power despite the large sample size. To increase the robustness of the test, both SKAT-O and Fisher-CT combine a burden and a quadratic test aiming to have advantages of both burden and quadratic tests. However, burden and quadratic tests cannot cover all situations. Kinnamon et al. [2012] demonstrated that the single-variant test with statistic $\max_{m} s_{m}^{2}$ can outperform both burden and quadratic tests when there are a large number of neutral variants and small number of causal variants. Results of this study show that the tests with statistics $Σ_{m} ∣ s_{m} ∣^{p} (p \geq 4), Σ_{m} {(s_{m} I_{{s_{m} \geq 0}})}^{2}, or Σ_{m} {(- s_{m} I_{{s_{m} \leq 0}})}^{2}$ can outperform both burden and quadratic tests in some situations.

In this article, through the optimal combination of single-variant tests under different criteria, we first obtain three classes of tests that are well beyond burden and quadratic tests. Then, we propose the optimal combination of single-variant tests (OCST) by combining information from tests of the three classes. Using extensive simulation studies, we compare the performance of OCST with that of the burden, quadratic, and the optimal single-variant tests. Our results show that, in a wide range of scenarios, OCST either is the most powerful test or has similar power with the most powerful test. We also compare power of OCST with that of the two existing combined tests: Fisher-CT and SKAT-O. We are able to demonstrate that OCST has better power than both Fisher-CT and SKAT-O.

Method

Consider a sample of n individuals. Each individual has been genotyped at M variants in a genomic region (a gene or a pathway). Denote y_i as the trait value of the i^th individual for either a quantitative trait or a qualitative trait (1 for cases and 0 for controls for a qualitative trait) and denote x_im as the genotypic score of the i^th individual at the m^th variant, where x_im∈{0,1,2} is the number of minor alleles. If there are no covariates, we use the generalized linear model [Nelder and Wedderburn, 1972]

g (E (y_{i} ∣ x_{i m})) = β_{0} + β_{1} x_{i m}

to model the relationship between trait values and genotypes at the m^th variant, where g() is a monotone “link” function. Under the generalized linear model, the score test statistic to test the null hypothesis H₀:β₁ = 0 is given by [Sha et al., 2011]

s_{m} = \frac{U_{m}}{\sqrt{V_{m}}},

(1)

where $U_{m} = Σ_{i = 1}^{n} (y_{i} - \overset{‒}{y}) (x_{i m} - {\overset{‒}{x}}_{m}) and V_{m} = \frac{1}{n} Σ_{i = 1}^{n} {(y_{i} - \overset{‒}{y})}^{2} Σ_{i = 1}^{n} {(x_{i m} - {\overset{‒}{x}}_{m})}^{2}$ . The statistic s_m asymptotically follows the standard normal distribution. If there are covariates, we use the method proposed by Sha et al. [2012] to adjust the effect of the covariates. Let (z_i1,…,z_ip)^T denote covariates of the i^th individual. We adjust both trait value y_i and genotypic score x_im for the covariates by applying linear regressions. That is,

y_{i} = α_{0} + α_{1} z_{i 1} + \dots + α_{p} z_{i p} + ε_{i} and x_{i m} = α_{0 m} + α_{1 m} z_{i 1} + \dots + α_{p m} z_{i p} + τ_{i m} .

(2)

Let ỹ_i and x̃_im denote the residuals of y_i and x_im, respectively. With covariates, we replace y_i and x_im by ỹ_i and x̃_im in s_m.

Let $S_{m} = s_{m}^{2}$ . Current quadratic tests for rare variant association studies are combinations of S_m. The statistic of TOW [Sha et al., 2012] $T_{T O W} = Σ_{m = 1}^{M} S_{m}$ and the statistic of SKAT [Wu et al., 2011] $T_{S K A T} = Σ_{m = 1}^{M} w_{m} S_{m}$ , where w_m = V_mW_m and W_m is the weight used by SKAT. Since $Σ_{m = 1}^{M} s_{m} = Σ_{i = 1}^{n} (y_{i} - \overset{‒}{y}) x_{i}$ , where $x_{i} = Σ_{m = 1}^{M} w_{m} x_{i m} and w_{m} = 1 ∕ \sqrt{V_{m}}$ is asymptotically equivalent to the weights used by Weighted Sum (WS) method [Madsen and Browning, 2009], $Σ_{m = 1}^{M} s_{m}$ is a burden test and is similar to WS method. These observations motivate us to consider combinations of S_m and combinations of s_m.

First, we consider the optimal combinations of S₁,…,S_M under different criteria, that is, $T_{a}^{\overset{˚}{a}} (p) = \max_{u_{1}, \dots, u_{M}} Σ_{m = 1}^{M} u_{m} S_{m}$ under the condition $Σ_{m = 1}^{M} u_{m}^{p} = 1$ for p∈(1,∞). By solving the maximization problem, we have $T_{a}^{\overset{˚}{a}} (p) = {(Σ_{m = 1}^{M} S_{m}^{p ∕ (p - 1)})}^{(p - 1) ∕ p}$ . The class of tests ${T_{a}^{\overset{˚}{a}} (p) : p ∊ (1, \infty)}$ is equivalent to {T_a(z):z∈(1,∞)}, where $T_{a} (z) = {(Σ_{m = 1}^{M} S_{m}^{z})}^{1 ∕ z}$ . We further extend {T_a(z):z∈(1,∞)} to A_a = {T_a(z):z∈(1,∞)}, where we define $T_{a} (\infty) = \lim_{z \to \infty} T_{a} (z) = \max_{1 \leq m \leq M} S_{m} and T_{a} (1) = Σ_{m = 1}^{M} S_{m}$ . Each test in A_a can be more powerful than other tests in A_a in some scenarios. No test can be consistently more powerful than other tests in A_a (see Figures S1 and S2). Note that TOW (T_a(2)) belongs to A_a. In most cases, there is another test in A_a that is more powerful than TOW (Figures S1 and S2).

All tests in A_a are robust to the directions of the effects of causal variants. From the literature [Sha et al., 2012; Wu et al., 2011], we learn that tests being robust to directions of the effects of causal variants are less powerful than burden tests when directions of the effects of causal variants are all the same and there are not many neutral variants. This observation leads us to consider the optimal combination of s₁,…,s_M besides the class of tests A_a. To consider the optimal combination of s₁,…,s_M, we propose to use either $T_{b}^{\overset{˚}{a}} (p) = \max_{w_{1}, \dots, w_{M}} \sum_{m = 1}^{M} w_{m} s_{m}$ under the condition $\sum_{m = 1}^{M} w_{m}^{p} = 1$ and w_m≥0 for m = 1,…,M or $T_{c}^{\overset{˚}{a}} (p) = \max_{w_{1}, \dots, w_{M}} \sum_{m = 1}^{M} w_{m} s_{m}$ under the condition $\sum_{m = 1}^{M} {(- w_{m})}^{p} = 1$ and w_m≤0 for m = 1,…,M. Using the same argument for A_a, we have that $T_{b}^{\overset{˚}{a}} (p)$ lead to the class of tests A_b={T_b(z):z∈[1,∞]}, where $T_{b} (z) = {(\sum_{m = 1}^{M} {(s_{m} I_{{s_{m} \geq 0}})}^{z})}^{\frac{1}{z}} and T_{c}^{\overset{˚}{a}} (p)$ lead to the class of tests A_c={T_c(z):z∈[1,∞]}, where $T_{c} (z) = {(\sum_{m = 1}^{M} {(- s_{m} I_{{s_{m} \leq 0}})}^{z})}^{\frac{1}{z}}$ .

Each of the three test classes A_a, A_b, and A_c has its own favorite scenario. The favorite scenario of A_a is that both risk and protective variants are present. The favorite scenario of A_b is that all causal variants are risk variants while the favorite scenario of A_c is that all causal variants are protective variants (see Figures S3 and S4). Let P_a(z), P_b(z), and P_c(z) denote the p-values of T_a(z), T_b(z) and T_c(z), respectively. Our proposed Optimal Combination of Single-variant Tests (OCST) is defined as

T_{O C S T} = \min_{z ∊ [1, \infty]} (P_{a} (z), P_{b} (z), P_{c} (z)) .

T_OCST can be obtained by a simple grid search across a range of z. For a given grid 1≤z₁<…<z_k≤∞ , the test statistic $T_{O C S T} = \min_{1 \leq k \leq K} (P_{a} (z_{k}), P_{b} (z_{k}), P_{c} (z_{k}))$ .

We use a permutation test to evaluate the p-value of OCST. In each permutation, we randomly shuffle the trait values. Suppose that we perform B times of permutations. Let $s_{m}^{(b)}$ denote the values of s_m based on the b^th permuted data, where b=0 represents the original data. Based on $s_{m}^{(b)} (b = 0, 1, \dots, B)$ , we can calculate $T_{s}^{(b)} (z_{k})$ for s = a, b, or c. Then, we transfer $T_{s}^{(b)} (z_{k}) to P_{s}^{(b)} (z_{k})$ by

P_{s}^{(b)} (z_{k}) = \frac{1}{B + 1} \sum_{d = 0}^{B} I (T_{s}^{(d)} (z_{k}) \geq T_{s}^{(b)} (z_{k})), for s = a, b, and c .

Let $P^{(b)} = \min_{1 \leq k \leq K} (P_{a}^{(b)} (z_{k}), P_{b}^{(b)} (z_{k}), P_{c}^{(b)} (z_{k}))$ . Then, the p-value of OCST is given by

\frac{1}{B} \sum_{b = 1}^{B} I (P^{(b)} < P^{(0)}) .

For a simulation study with R replicates, the above procedure will be rather computationally expensive. In our simulation studies, we use the pooling permutation method proposed by Guo and Lin [2009] to evaluate p-values. In the pooling permutation method, permuted samples from all the replicates are pooled together to form a joint sample from the null distribution. Suppose that we have R replicates and we perform B permutations for each replicate. Let $T_{s}^{(b, r)} (z_{k})$ denote the value of T_s(z_k) based on the b^th permuted data in the r^th replicate for s=a,b, or c, where b=0 represents original data. Then, we transfer $T_{s}^{(b, r)} (z_{k})$ to the corresponding p-value $P_{s}^{(b, r)} (z_{k})$ by

P_{s}^{(b, r)} (z_{k}) = \frac{1}{(B + 1) R} \sum_{l = 1}^{R} \sum_{d = 0}^{B} I (T_{s}^{(d, l)} (z_{k}) \geq T_{s}^{(b, r)} (z_{k})), for s = a, b, and c .

Let $P^{(b, r)} = \min_{1 \leq k \leq K} (P_{a}^{(b, r)} (z_{k}), P_{b}^{(b, r)} (z_{k}), P_{c}^{(b, r)} (z_{k}))$ . Then, the p-value of OCST in the r^th replicate is given by

\frac{1}{B R} \sum_{l = 1}^{R} \sum_{b = 1}^{B} I (P^{(b, l)} < P^{(0, r)}) .

Since the permutation samples are pooled across all replicates to form a sample from the null, B can be set to be much smaller than the situation when only one sample is analyzed.

Comparison of Tests

We compare the performance of the proposed test with that of (1) the weighted sum (WS) method [Madsen and Browning, 2009], (2) the sequence kernel association test (SKAT) [Wu et al., 2011], (3) $T_{a} (\infty) = \max_{1 \leq m \leq M} S_{m}$ that is called maximum single-variant test (MAXST), and (4) $T_{a} (2) = \sum_{m = 1}^{M} s_{m}^{2}$ that is the same as TOW [Sha et al., 2012]. The rank sum test used by WS is replaced with the score test based on residuals ỹ_i and x̃_im. We also compare the performance of the proposed method with two combined tests: Fisher-CT and SKAT-O [Derkach et al., 2012; Lee et al., 2012].

Simulation

The empirical Mini-Exome genotype data provided by the 17^th genetic analysis workshops (GAW17) are used for simulation studies. This dataset contains genotypes of 697 unrelated individuals on 3205 genes. We choose six genes: AHNAK (gene1), AKAP13 (gene2), COL6A3 (gene3), FREM2 (gene4), MDN1 (gene5), and TG (gene6) with 231, 163, 187, 143, 187, and 146 variants, respectively. We merge the six genes to form a super gene (Sgene) with 1057 variants. We use Sgene because the distributions of the minor allele frequencies (MAFs) in the 1057 variants in the Sgene and in the 24487 variants in all the 3205 genes are very similar (Figure S5). In our simulation studies, we generate genotypes based on the genotypes of 697 individuals in the Sgene. The genotypes of the GAW17 data set are extracted from the sequence alignment files provided by the 1000 Genomes Project for their pilot3 study (http://www.1000genomes.org). We use the program fastPHASE [Scheet and Stephens, 2006] to infer haplotypic phase for the 697 individuals and calculate haplotype frequencies. To generate the genotype of an individual, we generate two haplotypes according to the haplotype frequencies. To generate a qualitative disease affection status, we use a liability threshold model based on a continuous phenotype (quantitative trait). An individual is defined to be affected if the individual’s phenotype is at least one standard deviation larger than the phenotypic mean. This yields a prevalence of 16% for the simulated disease in the general population. In the following, we describe how to generate a quantitative trait.

To evaluate type I error, we generate trait values independent of genotypes by using the model:

y = 0.5 X_{1} + 0.5 X_{2} + ε,

(3)

where X₁ is a continuous covariate generated from a standard normal distribution, X₂ is a binary covariate taking values 0 and 1 with a probability of 0.5, and ε follows a standard normal distribution.

To evaluate power, we assume that there are M variants in total and there are n_cau causal variants, where M is determined by n_cau and the percentage of neutral variants. When M and n_cau are given, we randomly choose M variants from 1057 variants of Sgene as total variants and randomly choose n_cau rare variants (MAF<0.01) from M variants as causal variants. Denote n_r and n_p as the number of risk variants and protective variants, respectively, where n_r + n_p = n_cau. For an individual, let $x_{i}^{r}$ and $x_{j}^{p}$ denote the genotypic scores of the i^th risk variant and the j^th protective variant, respectively. The disease model is given by

y = 0.5 X_{1} + 0.5 X_{2} + \sum_{i = 1}^{n_{r}} β_{i}^{r} x_{i}^{r} - \sum_{j = 1}^{n_{p}} β_{j}^{p} x_{j}^{p} + ε,

where X₁, X₂, and ε are the same as those in equation (3); $β_{i}^{r}$ and $β_{j}^{p}$ are constants and their values depend on the heritability of each causal variant. We have two models to determine the heritability of each causal variant. Let h_i denote the heritability of the i^th causal variant and let $h_{T} = \sum_{i = 1}^{n_{c a u}} h_{i}$ denote the total heritability. In Model 1, let r₁,…,r_{n_casu} be random numbers between 0 and 1, then, $h_{i} = h_{T} r_{i} ∕ \sum_{j = 1}^{n_{c a u}} r_{j} for i = 1, \dots, n_{c a u}$ . In model 2, h₁=0.5h_T. Let r₂,…,r_{n_casu} be random numbers between 0 and 1, then, $h_{i} = h_{T} r_{i} ∕ (2 \sum_{j = 2}^{n_{c a u}} r_{j}) for i = 2, \dots, n_{c a u}$ . Under Model 1, all causal variants have the same expected heritability. Under Model 2, the heritability of one of the causal variants is much larger than that of other causal variants.

Results

In simulation studies, p-values are estimated using a pooling permutation method [Guo and Lin, 2009] in which permuted samples from all the replicates are pooled together to form a joint sample from the null distribution. In each replicate, we perform 20 permutations. Type I error rates are evaluated using 10,000 replicated samples, while powers are evaluated using 1,000 replicated samples.

For type I error evaluation, we consider different kinds of traits, different haplotype structures (different genes), and different significance levels. For 10,000 replicated samples, the 95% confidence intervals (CIs) for type I error rates of nominal levels 0.05, 0.01, and 0.001 are (0.046, 0.054), (0.008, 0.012), and (0.0004, 0.0016), respectively. The estimated type I error rates of the five tests are summarized in Table 1. As shown in this table, more than 95% estimated type I error rates are within the 95% CIs, which indicates that the estimated type I error rates are not significantly different from the nominal levels. Thus, all the five tests are valid tests.

Table 1.

Estimated type I error rates (in percentage) of the five tests.

		Quantitative Traits					Qualitative Traits
α	Gene	WS	MAXST	TOW	SKAT	OCST	WS	MAXST	TOW	SKAT	OCST
5%	Gene1	5.21	5.11	4.82	5.18	4.96	5.04	5.26	5.04	5.04	5.20
	Gene3	5.06	5.06	5.12	5.10	5.14	4.72	5.37	4.97	4.93	5.07
	Gene5	4.92	5.37	4.88	4.54	5.08	4.98	5.37	5.14	5.03	4.97
	Sgene	5.23	5.09	4.74	5.18	4.73	4.77	4.98	4.68	4.90	5.03
1%	Gene1	0.96	0.86	0.89	1.08	0.77	1.09	1.05	1.09	0.91	0.97
	Gene3	1.09	1.09	1.09	1.06	1.14	0.99	1.06	0.97	1.06	0.90
	Gene5	0.93	1.12	0.98	0.84	0.93	0.85	1.22	1.01	1.15	1.04
	Sgene	0.89	1.04	1.12	1.02	1.02	0.98	1.08	0.92	0.94	0.99
0.1%	Gene1	0.11	0.06	0.10	0.05	0.13	0.13	0.08	0.09	0.10	0.09
	Gene3	0.07	0.11	0.12	0.08	0.07	0.07	0.08	0.11	0.10	0.08
	Gene5	0.12	0.15	0.06	0.03	0.12	0.10	0.10	0.12	0.14	0.16
	Sgene	0.13	0.13	0.13	0.07	0.18	0.06	0.11	0.11	0.14	0.13

Open in a new tab

Note: α denotes the significance level. In this set of simulations, sample size is 1000.

For power comparisons, we conduct two sets of simulations. In simulation set 1, we compare the power of OCST with that of burden (WS), quadratic (SKAT and TOW), and optimal single-variant (MAXST) tests. In simulation set 2, we compare the power of OCST with that of two combined tests (SKAT-O and Fisher-CT). For simulation set 1, we compare the power of the five tests for power as a function of the percentage of neutral variants (Figures 1, 2, S6, S7) and as a function of the percentage of protective variants (Figures 3, S8). The power of TOW and the power of SKAT have similar patterns in all the simulation scenarios, but TOW is consistently more powerful than SKAT. In the following discussion of power comparisons, we omit SKAT.

Power comparisons of the five tests for power as a function of the percentage of neutral variants under model 1 for quantitative traits. ncau represents the number of causal variants. pp represents the percentage of protective variants. In this set of simulations, sample size is 1000; significance level is 0.001; total heritability is 0.05. Tn=*ncau*/(1−pn), where Tn represents the total number of variants and pn represents the percentage of neutral variants, and the Tn variants are randomly chosen from the 1057 variants of Sgene.

Power comparisons of the five tests for power as a function of the percentage of neutral variants under model 2 for quantitative traits. ncau represents the number of causal variants. pp represents the percentage of protective variants. In this set of simulations, sample size is 1000; significance level is 0.001; total heritability is 0.05. Tn=*ncau*/(1−pn), where Tn represents the total number of variants and pn represents the percentage of neutral variants, and the Tn variants are randomly chosen from the 1057 variants of Sgene.

Power comparisons of five tests for power as a function of the percentage of protective variants for quantitative traits. The number of causal variants is 20. pn represents the percentage of neutral variants. In this set of simulations, sample size is 1000; significance level is 0.001; total heritability is 0.05. Tn=*ncau*/(1−pn), where Tn represents the total number of variants and ncau represents the number of causal variants, and the Tn variants are randomly chosen from the 1057 variants of Sgene.

As shown by the power comparisons for power as a function of the percentage of neutral variants (Figures 1, 2), in all the cases, OCST either is the most powerful test or has similar power with the most powerful test. WS is the most powerful test and OCST has similar power with WS when there are no protective variants and the percentage of neutral variants is small; TOW is the most powerful test and OCST has similar power with TOW under model 1 when both protective and risk variants are present; MAXST is the most powerful test and OCST has similar power with MAXST under model 2 when both protective and risk variants are present and the percentage of neutral variants is large; OCST is the most powerful test otherwise. With the increase of neutral variants, power of all the tests decreases while the power of WS decreases the fastest and power of MAXST decreases the slowest. With the decrease of number of causal variants, power of all the tests increases while the power of WS increase the slowest and power of MAXST increases the fastest. The reason that the power of MAXST decreases the slowest with the increase of neutral variants and increases the fastest with the decrease of causal variants is that MAXST essentially only depends on the variant with the largest heritability. This reason can also explain why the power of MAXST is higher under model 2 than that under model 1.

The power comparisons for power as a function of the percentage of protective variants are given in figure 3. As shown by figure 3, again, OCST either is the most powerful test or has similar power with the most powerful test. TOW is the most powerful test and OCST has similar power with TOW when the percentage of neutral variants is small; MAXST is the most powerful test and OCST has similar power with MAXST under model 2 when the percentage of neutral variants is large; OCST is the most powerful test otherwise. When both protective and risk variants are present, the power of WS decreases dramatically, while the power of OCST decreases slightly and the power of TOW and MAXST doesn’t decrease at all.

Power comparisons based on a qualitative trait have similar patterns to those based on a quantitative trait (Figures S6-S8). However, the power of TOW and MAXST decreases in the presence of both risk and protective variants, although decreases not as fast as that of WS (Figure S8). As pointed out by Wu et al. [2011] and Sha et al. [2012], decrease in power of TOW and MAXST in the presence of both risk and protective variants is due to the fact that protective variants lower MAFs in cases and thus make observing rare variants in cases more difficult.

In simulation set 2, we compare the power of OCST, Fisher-CT, and SKAT-O for power as a function of percentage of protective variants. Results are summarized in Figure 4. This figure shows that OCST is consistently more powerful than Fisher-CT and Fisher-CT is consistently more powerful than SKAT-O. Power simulation results based on a qualitative trait yield the same conclusions, but differences in power between the three tests are smaller than those based on a quantitative trait (Figure S9).

Power comparisons of three tests (OCST, Fisher-CT, and SKAT-O) for power as a function of the percentage of protective variants for quantitative traits. The number of causal variants is 20. pn represents the percentage of neutral variants. In this set of simulations, sample size is 1000; significance level is 0.001; total heritability is 0.05. Tn=*ncau*/(1−pn), where Tn represents the total number of variants and ncau represents the number of causal variants, and the Tn variants are randomly chosen from the 1057 variants of Sgene.

We also perform simulation studies to compare the power of the proposed test (OCST) with the Adaptive Weighting test (AW2) proposed by Sha et al. [2013]. The power comparisons of these two tests for power as a function of the heritability for quantitative traits are given in Figure 5. This figure shows that OCST is consistently more powerful than AW2.

Power comparisons of the proposed test (OCST) and AW2 proposed by Sha et al. [2013] for power as a function of the heritability for quantitative traits. The number of causal variants is 30. nprot represents the percentage of protective variants. pcau represents the percentage of causal variants. In this set of simulations, sample size is 1000; significance level is 0.001.

Discussion

There is increasing interest to detect associations between rare variants and complex traits. Reasons are that (1) the common variants identified through genome-wide association studies (GWAS) account for only a small portion of the presumed phenotypic variation and (2) the development of next-generation sequencing technology has made directly testing all rare variants feasible. Several statistical methods for rare variant association studies have been developed recently. However, recent studies show that the performance of each of these methods depends strongly upon the underlying assumption and no method demonstrates consistently acceptable power [Ladouceur et al. 2012]. More recently, Derkach et al., [2012] and Lee et al., [2012] proposed combined tests by combining information from the burden and quadratic tests. However, results from this study and from Kinnamon et al. [2012] show that there exist tests that can outperform both burden and quadratic tests in some situations. In this article, we propose a novel combined test OCST by combining information from tests of the three classes that are well beyond burden and quadratic tests. Our results show that, comparing with burden and quadratic tests, OCST either is the most powerful test or has similar power with the most powerful test. Our results also show that OCST has better power than the two combined tests: Fisher-CT and SKAT-O.

All the existing methods discussed in this article are for unrelated individuals only. Although our proposed method is also described using unrelated individuals, our method can be applied to family-based data as long as there is a single-variant test. As an example, we consider the within-family test T_WFT and admixture between-family test T_adBFT proposed by Fang et al. [2012] for family-based rare variant association studies. We can use either T_WFT or T_adBFT as the single-variant test s_m and then our method can be applied to family-based data through this s_m.

Using formula (2) to adjust for the effect of covariates for binary traits may look strange. Previous researches showed that using formula (2) to adjust for the effect of covariates for binary traits works well. To control for population stratification, Price et al. [2006] used formula (2) to adjust for the effect of covariates (eigenvectors) for binary traits and they showed that this method works fine. In rare variant association studies, Sha et al. [2012] used formula (2) to adjust for the effect of covariates for binary traits and their results also showed that this method works very well.

Supplementary Material

Supp FigureS1-S9

NIHMS615101-supplement-Supp_FigureS1-S9.doc^{(149.5KB, doc)}

Acknowledgements

Research reported in this article was supported by the National Human Genome Research Institute of the National Institutes of Health under Award Number R03 HG006155. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

The Genetic Analysis workshops are supported by NIH grant R01 GM031575 from the National Institute of General Medical Sciences. Preparation of the Genetic Analysis Workshop 17 Simulated Exome Data Set was supported in part by NIH R01 MH059490 and used sequencing data from the 1000 Genomes Project (www.1000genomes.org).

Footnotes

The authors have no conflict of interests to declare.

References

Andrés AM, Clark AG, Shimmin L, Boerwinkle E, Sing CF, Hixson JE. Understanding the accuracy of statistical haplotype inference with sequence data of known phase. Genetic Epi. 2007;31:659–671. doi: 10.1002/gepi.20185. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bodmer W, Bonilla C. Common and rare variants in multifactorial susceptibility to common diseases. Nat Genet. 2008;40(6):695–701. doi: 10.1038/ng.f.136. [DOI] [PMC free article] [PubMed] [Google Scholar]
Derkach A, Lawless J, Sun L. Robust and powerful tests for rare variants using fisher’s method to combine evidence of association from two or more complementary tests. Genetic Epi. 2012;37(1):110–121. doi: 10.1002/gepi.21689. [DOI] [PubMed] [Google Scholar]
Guo W, Lin S. Generalized linear modeling with regularization for detecting common disease rare haplotype association. Genet Epi. 2009;33:308–316. doi: 10.1002/gepi.20382. [DOI] [PMC free article] [PubMed] [Google Scholar]
Han F, Pan W. A data-adaptive sum test for disease association with multiple common or rare variants. Hum Hered. 2010;70:42–54. doi: 10.1159/000288704. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hoffmann TJ, Marini NJ, Witte JS. Comprehensive approach to analyzing rare genetic variants. PLoS ONE. 2010;5(11):e13584. doi: 10.1371/journal.pone.0013584. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kinnamon DD, Hershberger RE, Martin ER. Reconsidering association testing methods using single-variant test statistics as alternatives to pooling tests for sequence data with rare variants. PLoS ONE. 2012;7(2):e30238. doi: 10.1371/journal.pone.0030238. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ladouceur M, Dastani Z, Aulchenko YS, Greenwood CMT, Richards JB. The empirical power of rare variant association methods: results from sanger sequencing in 1,998 individuals. PLoS Genet. 2012;8:e1002496. doi: 10.1371/journal.pgen.1002496. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lee S, Emond MJ, Bamshad MJ, Barnes KC, Rieder MJ, Nickerson DA, NHLBI GO Exome Sequencing Project—ESP Lung Project Team. Christiani DC, Wurfel MM, Lin X. Optimal unified approach for rare-variant association testing with application to small-sample case-control whole-exome sequencing studies. Am J Hum Genet. 2012;91:224–237. doi: 10.1016/j.ajhg.2012.06.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li B, Leal SM. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008;83:311–321. doi: 10.1016/j.ajhg.2008.06.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lin D-Y, Tang Z-Z. A general framework for detecting disease associations with rare variants in sequencing studies. Am J Hum Genet. 2011;89:354–367. doi: 10.1016/j.ajhg.2011.07.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
Madsen BE, Browning SR. A group-wise association test for rare mutations using a weighted sum statistic. PLoS Genet. 2009;5:e1000384. doi: 10.1371/journal.pgen.1000384. [DOI] [PMC free article] [PubMed] [Google Scholar]
Metzker ML. Sequencing technologies – the next generation. Nature Reviews Genetics. 2010;11:31–46. doi: 10.1038/nrg2626. [DOI] [PubMed] [Google Scholar]
Morgenthaler S, Thilly WG. A strategy to discover genes that carry multiallelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST) Mutat Res. 2007;615:28–56. doi: 10.1016/j.mrfmmm.2006.09.003. [DOI] [PubMed] [Google Scholar]
Neale BM, Rivas MA, Voight BF, Altshuler D, Devlin B, Orho-Melander M, Kathiresan S, Purcell SM, Roeder K, Daly MJ. Testing for an unusual distribution of rare variants. PLoS Genet. 2011;7:e1001322. doi: 10.1371/journal.pgen.1001322. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nelder J, Wedderburn R. Generalized linear models. J R Stat Soc Ser A. 1972;135:370–384. [Google Scholar]
Ng SB, Turner EH, Robertson PD. Targeted capture and massively parallel sequencing of 12 human exomes. Nature Letters. 2009;461:272–276. doi: 10.1038/nature08250. [DOI] [PMC free article] [PubMed] [Google Scholar]
Price AL, Kryukov GV, de Bakker PI, Purcell SM, Staples J, Wei LJ, Sunyaev SR. Pooled association tests for rare variants in exon-resequencing studies. Am J Hum Genet. 2010;86:832–838. doi: 10.1016/j.ajhg.2010.04.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. PCs analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006;38:904–909. doi: 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]
Pritchard JK, Cox NJ. The allelic architecture of human disease genes: common disease-common variant...or not? Hum Mol Genet. 2002;11:2417–2423. doi: 10.1093/hmg/11.20.2417. [DOI] [PubMed] [Google Scholar]
Pritchard JK. Are rare variants responsible for susceptibility to complex diseases? Am J Hum Genet. 2001;69:124–137. doi: 10.1086/321272. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sha Q, Wang X, Wang X, Zhang S. Detecting association of rare and common variants by testing an optimally weighted combination of variants. Genet Epi. 2012;36(6):561–71. doi: 10.1002/gepi.21649. [DOI] [PubMed] [Google Scholar]
Sha Q, Wang S, Zhang S. Adaptive clustering and adaptive weighting methods to detect disease associated rare variants. Eur J Hum Genet. 2013;21(3):332–7. doi: 10.1038/ejhg.2012.143. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sha Q, Zhang Z, Zhang S. An improved score test for genetic association studies. Genetic Epi. 2011;35:350–359. doi: 10.1002/gepi.20583. [DOI] [PubMed] [Google Scholar]
Scheet P, Stephens M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet. 2006;78(4):629–644. doi: 10.1086/502802. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stratton MR, Rahman N. The emerging landscape of breast cancer susceptibility. Nat Genet. 2008;40:17–22. doi: 10.1038/ng.2007.53. [DOI] [PubMed] [Google Scholar]
Teer JK, Mullikin JC. Exome sequencing: the sweet spot before whole genomes. Hum Mol Genet. 2010 doi: 10.1093/hmg/ddq333. doi: 10.1093/hmg/ddq333. [DOI] [PMC free article] [PubMed] [Google Scholar]
Walsh T, King MC. Ten genes for inherited breast cancer. Cancer Cell. 2007;11:103–105. doi: 10.1016/j.ccr.2007.01.010. [DOI] [PubMed] [Google Scholar]
Wu M, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare variant association testing for sequencing data using the sequence kernel association test (SKAT) Am J Hum Genet. 2011;89:82–93. doi: 10.1016/j.ajhg.2011.05.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yi N, Zhi D. Bayesian analysis of rare variants in genetic association studies. Genet Epi. 2011;35:57–69. doi: 10.1002/gepi.20554. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zawistowski M, Gopalakrishnan S, Ding J, Li Y, Grimm S, Zollner S. Extending rare-variant testing strategies: analysis of noncoding sequence and imputed genotypes. Am J Hum Genet. 2010;87:604–617. doi: 10.1016/j.ajhg.2010.10.012. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp FigureS1-S9

NIHMS615101-supplement-Supp_FigureS1-S9.doc^{(149.5KB, doc)}

[R1] Andrés AM, Clark AG, Shimmin L, Boerwinkle E, Sing CF, Hixson JE. Understanding the accuracy of statistical haplotype inference with sequence data of known phase. Genetic Epi. 2007;31:659–671. doi: 10.1002/gepi.20185. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Bodmer W, Bonilla C. Common and rare variants in multifactorial susceptibility to common diseases. Nat Genet. 2008;40(6):695–701. doi: 10.1038/ng.f.136. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Derkach A, Lawless J, Sun L. Robust and powerful tests for rare variants using fisher’s method to combine evidence of association from two or more complementary tests. Genetic Epi. 2012;37(1):110–121. doi: 10.1002/gepi.21689. [DOI] [PubMed] [Google Scholar]

[R4] Guo W, Lin S. Generalized linear modeling with regularization for detecting common disease rare haplotype association. Genet Epi. 2009;33:308–316. doi: 10.1002/gepi.20382. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Han F, Pan W. A data-adaptive sum test for disease association with multiple common or rare variants. Hum Hered. 2010;70:42–54. doi: 10.1159/000288704. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Hoffmann TJ, Marini NJ, Witte JS. Comprehensive approach to analyzing rare genetic variants. PLoS ONE. 2010;5(11):e13584. doi: 10.1371/journal.pone.0013584. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Kinnamon DD, Hershberger RE, Martin ER. Reconsidering association testing methods using single-variant test statistics as alternatives to pooling tests for sequence data with rare variants. PLoS ONE. 2012;7(2):e30238. doi: 10.1371/journal.pone.0030238. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Ladouceur M, Dastani Z, Aulchenko YS, Greenwood CMT, Richards JB. The empirical power of rare variant association methods: results from sanger sequencing in 1,998 individuals. PLoS Genet. 2012;8:e1002496. doi: 10.1371/journal.pgen.1002496. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Lee S, Emond MJ, Bamshad MJ, Barnes KC, Rieder MJ, Nickerson DA, NHLBI GO Exome Sequencing Project—ESP Lung Project Team. Christiani DC, Wurfel MM, Lin X. Optimal unified approach for rare-variant association testing with application to small-sample case-control whole-exome sequencing studies. Am J Hum Genet. 2012;91:224–237. doi: 10.1016/j.ajhg.2012.06.007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Li B, Leal SM. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008;83:311–321. doi: 10.1016/j.ajhg.2008.06.024. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Lin D-Y, Tang Z-Z. A general framework for detecting disease associations with rare variants in sequencing studies. Am J Hum Genet. 2011;89:354–367. doi: 10.1016/j.ajhg.2011.07.015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Madsen BE, Browning SR. A group-wise association test for rare mutations using a weighted sum statistic. PLoS Genet. 2009;5:e1000384. doi: 10.1371/journal.pgen.1000384. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Metzker ML. Sequencing technologies – the next generation. Nature Reviews Genetics. 2010;11:31–46. doi: 10.1038/nrg2626. [DOI] [PubMed] [Google Scholar]

[R14] Morgenthaler S, Thilly WG. A strategy to discover genes that carry multiallelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST) Mutat Res. 2007;615:28–56. doi: 10.1016/j.mrfmmm.2006.09.003. [DOI] [PubMed] [Google Scholar]

[R15] Neale BM, Rivas MA, Voight BF, Altshuler D, Devlin B, Orho-Melander M, Kathiresan S, Purcell SM, Roeder K, Daly MJ. Testing for an unusual distribution of rare variants. PLoS Genet. 2011;7:e1001322. doi: 10.1371/journal.pgen.1001322. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Nelder J, Wedderburn R. Generalized linear models. J R Stat Soc Ser A. 1972;135:370–384. [Google Scholar]

[R17] Ng SB, Turner EH, Robertson PD. Targeted capture and massively parallel sequencing of 12 human exomes. Nature Letters. 2009;461:272–276. doi: 10.1038/nature08250. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Price AL, Kryukov GV, de Bakker PI, Purcell SM, Staples J, Wei LJ, Sunyaev SR. Pooled association tests for rare variants in exon-resequencing studies. Am J Hum Genet. 2010;86:832–838. doi: 10.1016/j.ajhg.2010.04.005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. PCs analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006;38:904–909. doi: 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]

[R20] Pritchard JK, Cox NJ. The allelic architecture of human disease genes: common disease-common variant...or not? Hum Mol Genet. 2002;11:2417–2423. doi: 10.1093/hmg/11.20.2417. [DOI] [PubMed] [Google Scholar]

[R21] Pritchard JK. Are rare variants responsible for susceptibility to complex diseases? Am J Hum Genet. 2001;69:124–137. doi: 10.1086/321272. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Sha Q, Wang X, Wang X, Zhang S. Detecting association of rare and common variants by testing an optimally weighted combination of variants. Genet Epi. 2012;36(6):561–71. doi: 10.1002/gepi.21649. [DOI] [PubMed] [Google Scholar]

[R23] Sha Q, Wang S, Zhang S. Adaptive clustering and adaptive weighting methods to detect disease associated rare variants. Eur J Hum Genet. 2013;21(3):332–7. doi: 10.1038/ejhg.2012.143. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Sha Q, Zhang Z, Zhang S. An improved score test for genetic association studies. Genetic Epi. 2011;35:350–359. doi: 10.1002/gepi.20583. [DOI] [PubMed] [Google Scholar]

[R25] Scheet P, Stephens M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet. 2006;78(4):629–644. doi: 10.1086/502802. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Stratton MR, Rahman N. The emerging landscape of breast cancer susceptibility. Nat Genet. 2008;40:17–22. doi: 10.1038/ng.2007.53. [DOI] [PubMed] [Google Scholar]

[R27] Teer JK, Mullikin JC. Exome sequencing: the sweet spot before whole genomes. Hum Mol Genet. 2010 doi: 10.1093/hmg/ddq333. doi: 10.1093/hmg/ddq333. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Walsh T, King MC. Ten genes for inherited breast cancer. Cancer Cell. 2007;11:103–105. doi: 10.1016/j.ccr.2007.01.010. [DOI] [PubMed] [Google Scholar]

[R29] Wu M, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare variant association testing for sequencing data using the sequence kernel association test (SKAT) Am J Hum Genet. 2011;89:82–93. doi: 10.1016/j.ajhg.2011.05.029. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] Yi N, Zhi D. Bayesian analysis of rare variants in genetic association studies. Genet Epi. 2011;35:57–69. doi: 10.1002/gepi.20554. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] Zawistowski M, Gopalakrishnan S, Ding J, Li Y, Grimm S, Zollner S. Extending rare-variant testing strategies: analysis of noncoding sequence and imputed genotypes. Am J Hum Genet. 2010;87:604–617. doi: 10.1016/j.ajhg.2010.10.012. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A Rare Variant Association Test Based on Combinations of Single-Variant Tests

Qiuying Sha

Shuanglin Zhang

Abstract

Introduction

Method

Comparison of Tests

Simulation

Results

Table 1.

Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Discussion

Supplementary Material

Acknowledgements

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

A Rare Variant Association Test Based on Combinations of Single-Variant Tests

Qiuying Sha

Shuanglin Zhang

Abstract

Introduction

Method

Comparison of Tests

Simulation

Results

Table 1.

Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Discussion

Supplementary Material

Acknowledgements

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases