Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2016 Sep 1.
Published in final edited form as: Genet Epidemiol. 2015 Sep;39(6):399–405. doi: 10.1002/gepi.21913

Sequence kernel association analysis of rare variant set based on the marginal regression model for binary traits

Baolin Wu 1,*, James S Pankow 2, Weihua Guan 1,*
PMCID: PMC4544778  NIHMSID: NIHMS703118  PMID: 26282996

Abstract

Recent sequencing efforts have focused on exploring the influence of rare variants on the complex diseases. Gene-level based tests by aggregating information across rare variants within a gene have become attractive to enrich the rare variant association signal. Among them, the sequence kernel association test has proved to be a very powerful method for jointly testing multiple rare variants within a gene. In this article, we explore an alternative sequence kernel association test. We propose to use the univariate likelihood ratio statistics from the marginal model for individual variants as input into the kernel association test. We show how to compute its significance p-value efficiently based on the asymptotic chi-square mixture distribution. We demonstrate through extensive numerical studies that the proposed method has competitive performance. Its usefulness is further illustrated with application to associations between rare exonic variants and type 2 diabetes in the Atherosclerosis Risk in Communities (ARIC) Study. We identified an exome-wide significant rare variant set in the gene ZZZ3 worthy of further investigations.

Keywords: GWAS, SKAT, Score statistic, Sequencing data

Introduction

In GWAS, observed effect sizes for common variants have typically been quite small. In combination they explain a small proportion of the phenotypic variance. Manolio et al. (2009) have suggested that rare variants could have substantial effect sizes without demonstrating clear Mendelian segregation, and could contribute substantially to missing heritability. Individual rare variant based tests typically lack power due to low minor allele frequencies, and gene-level based association tests implemented by aggregating information across rare variants within a gene have become attractive to enrich the association signal. An intuitive and simple approach to aggregating signals across rare variants collapses the rare variants into a burden score to be linked to the phenotype (Morgenthaler and Thilly, 2007; Madsen and Browning, 2009; Morris and Zeggini, 2010; Price et al., 2010; Lin and Tang, 2011). The combined multivariate and collapsing (CMC) method is an extension of the burden test by collapsing rare variants in a region within subgroups defined according to their minor allele frequencies (MAFs) (Li and Leal, 2008). The variable threshold (VT) method is a data adaptive burden test by choosing an optimal MAF threshold (Price et al., 2010; Lin and Tang, 2011). The burden test works well for variants with similar effects and could lose substantial power with both protective and deleterious variants, or in the presence of many non-causal variants. The sequence kernel association test (SKAT) is based on the variance component score test and works well under various combinations of protective and deleterious variants (Wu et al., 2010; Neale et al., 2011; Wu et al., 2011). A more flexible approach is SKAT-O, which adaptively combines the burden and the SKAT statistics (Lee et al., 2012). The SKAT based approach performs well and is widely used in rare variant based association test.

Rare variants have been postulated to have large effect sizes (Manolio et al., 2009). It is likely that typical GWAS only have sufficient power to detect variants with large effects. This is indeed the case for most rare disease-causing variants identified to date (Bonnefond et al., 2012; Zhan et al., 2013; Steinthorsdottir et al., 2014;Wang et al., 2014; Estrada et al., 2014). The SKAT is based on the score test, thus is computationally very efficient. The score test performs well when parameter is close to the null value, but could have suboptimal performance with large deviation from the null (e.g., when testing those rare variants with large effect sizes).

Recently Chen et al. (2014) developed a Cox SKAT for survival outcomes and adopted the likelihood ratio test for its better performance compared to the score test in the Cox proportional hazard model. In this article, we explore an alternative sequence kernel association test for binary trait in the same spirit as Chen et al. (2014). We use the univariate likelihood ratio statistics from the marginal model for individual variants as input into the sequence kernel association test and its adaptive test. Their significance p-values can be computed efficiently based on the asymptotic chi-square mixture distribution. We demonstrate through extensive numerical studies that the proposed method has competitive performance. We illustrate the usefulness of the proposed method through an application to associations between rare variants and type 2 diabetes in the ARIC Study.

Materials and Methods

Consider a GWAS with genotype scores G, coded as (0,1,2) for the copies of minor allele, disease status indicator Y, and additional covariates X, which could include ancestry covariate (e.g., ancestry indicator or principal components).

Consider n subjects sequenced in a region with m genotyped rare variants. For the i-th subject, let yi denote the case-control status, Gi = (gi1, …, gim) the genotypes for the m variants, Xi = (xi1, …, xip) the covariates to be adjusted. We study the disease association of rare variants based on the following logistic regression model

Pr(yi=1|Xi,Gi)=expit(β0+Xiα+Giβ), (1)

where α and β = (β1, …, βm)′ are the vector of regression coefficients for the covariates and rare variants. Here expit(x) = 1/(1 + exp(−x)) is the inverse-logit function. The disease association of the m rare variants can be tested by evaluating the null hypothesis H0 : β = 0.

Sequence kernel association test

The sequence kernel association test (SKAT; Wu et al., 2011) is derived as a variance-component score statistic by assuming that each βj follows an arbitrary zero-mean distribution with variance wj2ψ, where weight wj is fixed and typically computed based on MAF, e.g., the Wu weights wj = Beta(fj; 1, 25) (Wu et al., 2011). Here fj is the MAF of Gj and Beta is the beta distribution density function. Under this assumption, the null hypothesis H0 : β = 0 is equivalent to H0 : ψ = 0.

Let y = (y1, …, yn)′ denote the response vector, X the n × p covariates matrix, G=(G1,,Gn) the n × m genotype matrix, W = diag(w1, …, wm) the diagonal matrix of weights. The SKAT statistic can be computed as

Q=(yπ^0)GWWG(yπ^0),

where π̂0 = (π̂1, …, π̂n)′ with π^i=Pr^(yi=1|Xi,Gi) derived under the null model (β = 0). Let V0 = diag{π̂0(1 − π̂0)} denote the n × n diagonal matrix of marginal variances, and X0 = (1, X) the n × (p + 1) null model design matrix. Define P=V0V0X0(X0V0X0)1X0V0, which is the asymptotic covariance matrix Cov(yπ̂0). Under null, Q follows a mixture of 1-DF chi-square distributions (Liu et al., 2007; Tzeng and Zhang, 2007), with the mixture coefficients being the eigen values of P1/2GWWGP1/2, which is of dimension n × n. The p-value can be obtained by matching moments (Liu et al., 2009) or by inverting the characteristic function (Davies, 1980).

The SKAT statistic can be equivalently derived based on the score vector U for β (Pan, 2009). We can check that U = G′(yπ̂0). Under null, the score vector U are asymptotically zero-mean multivariate normal with covariance that can be consistently estimated by (Cox and Hinkley, 1979)

Σ=GV0GGV0X0(X0V0X0)1X0V0G=GPG, (2)

which accounts for the linkage disequilibrium among variants. The SKAT statistic can be equivalently written as Q = UWWU. Hence the mixture coefficients can be equivalently computed based on the eigen values of Σ1/2WWΣ1/2, which is an m × m matrix. Note that m is typically much smaller than n, and the eigen values can be very efficiently solved.

Likelihood ratio test based kernel association test

For the score vector U = G′(yπ̂0), consider its j-th element Uj=Gj(yπ^0), where Gj = (g1j, …, gnj)′ is the j-th column of G. Here Uj can be checked equal to the score statistic for testing the significance of the j-th SNP based on the following marginal logistic model

Pr(yi=1|Xi,gij)=expit(β0j+Xiαj+gijβ1j). (3)

Alternatively we can employ the likelihood ratio test (LRT) to assess the marginal significance of the j-th SNP. Under null, the score test is asymptotically equivalent to the LRT. However the LRT could be more powerful than the score test if the j-th rare variant has potentially large effect size, when it is either a risk variant or in linkage disequilibrium with other risk variants.

We propose to develop a marginal LRT based sequence kernel association test (denoted as SKATL) as follows. Denote χj as the LRT chi-square statistic for testing β1j under model (3). Let Sj=sign(β^1j)χj and S = (S1, …, Sm)′, where β̂1j is the maximum likelihood estimator (MLE). Define the SKATL statistic

L=SWWS=j=1mwj2χj.

Under the null of no rare variant effects (all βj = 0), we have β1j = 0, and Sj is asymptotically equivalent to the standardized Uj. Let R = diag(Σ)−1/2Σdiag(Σ)−1/2, which is the corresponding correlation matrix of Σ in (2). The null distribution of L is a mixture of 1-DF chi-square distributions with mixture coefficients being the eigen values of R1/2WWR1/2.

Note that the SKATL only depends on the LRT chi-square statistic, and in principle we do not need the MLE β̂1j, which could have convergence issues and aberrant testing behavior (Hauck and Donner, 1977). When computing the SKATL in our numerical studies, we set χj equal to the squared standardized score statistics for extremely rare variants (specifically with minor allele count less than ten).

Data adaptive kernel association test

An alternative approach to aggregating signals across rare variants is the burden test (Li and Leal, 2008; Madsen and Browning, 2009). The burden test is typically computed as the weighted sum of score statistics. the burden test works well for variants with similar effects and could lose substantial power in the presence of large number of non-causal variants, or with both protective and deleterious variants. A more flexible approach is to data adaptively combine the burden test and the kernel association test following the SKAT-O approach of Lee et al. (2012), which tested the rare variant effects using the minimum p-value of weighted SKAT statistic, (yπ̂0)′Kρ(yπ̂0), where Kρ = GW[(1 − ρ)I + ρJ]WG′, ρ ∈ [0, 1]. Here I is an m × m identity matrix and J m × m matrix with all elements equal to one.

Similarly we consider the following weighted SKATL statistic

Lρ=SW[(1ρ)I+ρJ]WS,ρ[0,1].

Given ρ, the significance p-value of Lρ, P-val(Lρ), can be similarly computed based on the 1-DF chi-square mixture distribution with coefficients being the eigen values of R1/2W[(1 − ρ)I + ρJ]WR1/2. Data adaptive SKATL statistic (denoted as SKAT-OL) is defined as the minimum p-value, T = min0≤ρ≤1 P-val(Lρ), where the minimum is often taken over a finite grid of ρ: 0 = ρ1 < … < ρb = 1, and the significance of T can be efficiently computed using an one-dimensional numerical integration (Lee et al., 2012). We discuss computational details in the following section.

P-value computation for kernel association tests

We offer some insights into the efficient p-value computation for SKAT, SKATL and data adaptive kernel association tests. First note that the non-zero eigen values of AA′ are the same as AA for any matrix A, which can be verified from the singular value decomposition of matrix A: A=UADAVA, where UA and VA are orthogonal and DA diagonal matrix. Therefore AA=UADA2UA and AA=VADA2VA, and hence their eigen values equal to the squared singular values of A. So for computing the p-values of proposed SKATL, the eigen values of R1/2WWR1/2 can be equivalently computed from WRW. For SKAT, the eigen values of P1/2GWWGP1/2 are the same as WGPGW = WΣW.

For matrix B = (1 − ρ)I + ρJ, ρ ∈ [0, 1], we can check that B=Bh2, where Bh=1ρI+1+(m1)ρ1ρmJ. Therefore for computing p-values of weighted SKATL, the eigen values of R1/2W[(1 − ρ)I + ρJ]WR1/2 can be equivalently computed from BhWRWBh.

Null distribution of SKAT-OL

The significance of SKAT-OL can be computed as (see Appendix for technical details)

10q˜1M(δ(x))f(x|χ12)dx,

where

δ(x)=(minυ<bqρυτρυx1ρυμ)σσ0+μ,q˜1=F1(1T|χ12),

f(·|χ12) and F(·|χ12) are the 1-DF chi-square density/distribution functions, and M(·) is the distribution function of 1-DF chi-square mixture with coefficients (λ1, …, λm), which are the eigen values of (IH1) (IH1), where = WRW. Here μ=j=1mλj,σ2=2j=1mλj2,σ02=σ2+4tr[R˜H1R˜(IH1)],τρ=ρR12+(1ρ)R1R˜R1/R12,H1=RhJRh/(R1R1), where RhRh = , and R1 = Rh(1, …, 1)′.

Results

Simulation studies

We conducted extensive simulation studies to evaluate the performance of the proposed and existing methods. Following Lee et al. (2012), we generated 10,000 European-like haplotypes of length 1000 kb under a calibrated coalescent model (Schaffner et al., 2005). We randomly pair the haplotypes to simulate a total population of 106 individuals. We randomly select a gene region of length 10 kb and study those rare variants with MAF≤0.01. We consider two covariates Z = (Z1, Z2)′: Z1 ∈ {0, 1} follows Bernoulli(0.5), and Z2 ~ N(0, 1). We model the logit disease risk as expit(β0+ZβZ+j=1mβjGj). We set β0 = −3.4, βZ = (0.5, 0.5)′ (corresponding to 5% population disease rate). We randomly select 2500 cases and 2500 controls from the simulated population of 106 samples. We compared five rare variant set analysis methods: SKAT, SKAT-O, SKATL, SKAT-OL and burden test. In the burden and SKAT tests, we assign weight Beta(fj; a0, b0) to the jth variant Gj. And for the proposed method we assign weight Beta(fj; a1, b1). Here fj is the MAF of Gj. For a given variant, the likelihood ratio test statistic is inherently standardized and roughly corresponds to the standardized score statistics, which is the score statistics used in SKAT scaled by its standard error, which is roughly proportional to fj(1fj). Therefore for the proposed method, we set a1 = a0+0.5 and b1 = b0+0.5. Following Wu et al. (2011), we set a0 = 1, b0 = 25 for the following simulation studies. We have investigated three sets of weights for (a0, b0): (0.5,24.5), (1,25), and (1.5,25.5). The overall conclusions remain the same (see supplementary material for complete results). As shown in Ma et al. (2013), the performance of single rare variant LRT depends on the case-control ratio. We have investigated different case-control ratios for (ne, nc). Here we reported the results for ne = nc = 2500, and ne = 1700, nc = 3300. The supplementary material provided simulation results for more unbalanced case-control ratios (1:6 and 1:10).

We use 2.5 × 106 experiments to evaluate the type I error at the nominal significance level α = 10−5, 10−4, and 10−3 by setting all βj = 0. The results are summarized in Table 1 and 2. All methods appropriately control the Type I errors. We also verify that the Type I errors are appropriately controlled at the 10−6 significance level by conducting 108 experiments (please see the supplementary material for detailed results including the QQ plots).

Table 1.

Type I error divided by the nominal significance level for rare variant set analysis: ne = 2500 cases and nc = 2500 controls. The SKAT/SKAT-O and burden tests used (1,25) weight, and the proposed SKATL/SKAT-OL used (1.5,25.5) weight.

α 10−5 10−4 10−3
SKAT 0.82 0.85 0.92
SKAT-O 0.91 1.02 1.07
SKATL 0.92 1.10 1.08
SKAT-OL 0.96 1.00 1.11
Burden 0.94 0.98 1.00

Table 2.

Type I error divided by the nominal significance level for rare variant set analysis: ne = 1700 cases and nc = 3300 controls. The SKAT/SKAT-O and burden tests used (1,25) weight, and the proposed SKATL/SKAT-OL used (1.5,25.5) weight.

α 10−5 10−4 10−3
SKAT 0.86 0.91 0.93
SKAT-O 0.89 0.98 1.04
SKATL 0.96 1.12 1.11
SKAT-OL 0.88 1.07 1.10
Burden 0.91 0.97 1.01

We use 104 experiments to evaluate the power under various combinations of βj at α = 10−6, 10−5, 10−4, and 10−3. The rare variant effects βj are set as follows. Each time we randomly select θ proportion of rare variants and set their |βj| = d log10(fj). The other null rare variants have zero coefficients. We have assumed that rarer variants have larger effect sizes. We conducted simulations for (1) θ = 0.05, d = −0.6, (2) θ = 0.1, d = −0.5, (3) θ = 0.2, d = −0.4, (4) θ = 0.5, d = −0.25. They correspond to odds ratio of 3.32, 2.72, 2.23 and 1.65 for MAF=0.01 respectively. We have investigated two scenarios for the direction of causal variant effects. First, we assume a mix of equal proportions of protective and deleterious variants, which will in general favor the kernel association test. Second, we assume a mix of unequal proportions of protective and deleterious variants. Specially we randomly set signs of βj as negative or positive with probability 0.9 and 0.1 respectively.

Table 3 summarized the results assuming equal proportions of protective and deleterious variants and equal case-control ratio (ne = nc = 2500). Overall the proposed SKATL has the best performance. As expected the burden test suffers a dramatic power loss since the burden sum score cancels out those causal variants, and as a result the adaptive SKAT-O and SKAT-OL have reduced performance compared to the SKAT and SKATL. The proposed SKATL has the largest power gain over SKAT with relatively large rare variant effect sizes.

Table 3.

Power comparison of rare variant set analysis: ne = nc = 2500, equal proportions of protective and deleterious variants. The highest powered tests in each row are bold-faced.

θ = 0.05, d = −0.6

α SKAT SKAT-O SKATL SKAT-OL Burden

10−6 0.1654 0.1373 0.2000 0.1661 0.0028
10−5 0.2279 0.1995 0.2627 0.2336 0.0067
10−4 0.3080 0.2808 0.3521 0.3191 0.0181
10−3 0.4286 0.3994 0.4740 0.4398 0.0451

θ = 0.1, d = −0.5

α SKAT SKAT-O SKATL SKAT-OL Burden

10−6 0.2469 0.2041 0.2940 0.2442 0.0081
10−5 0.3446 0.3030 0.3906 0.3506 0.0164
10−4 0.4612 0.4260 0.5051 0.4691 0.0352
10−3 0.6031 0.5657 0.6453 0.6092 0.0750

θ = 0.2, d = −0.4

α SKAT SKAT-O SKATL SKAT-OL Burden

10−6 0.3481 0.2952 0.3965 0.3381 0.0161
10−5 0.4742 0.4239 0.5189 0.4695 0.0302
10−4 0.6122 0.5698 0.6513 0.6130 0.0634
10−3 0.7577 0.7283 0.7885 0.7613 0.1174

θ = 0.5, d = −0.25

α SKAT SKAT-O SKATL SKAT-OL Burden

10−6 0.2959 0.2409 0.3379 0.2783 0.0139
10−5 0.4306 0.3796 0.4724 0.4196 0.0279
10−4 0.5871 0.5432 0.6272 0.5827 0.0568
10−3 0.7554 0.7234 0.7842 0.7542 0.1125

Table 4 summarized the results assuming unequal proportions of protective and deleterious variants and equal case-control ratio (ne = nc = 2500). The adaptive SKAT-O and SKAT-OL now perform better than the SKAT and SKATL under relatively more causal variants with θ = 0.5. With small proportion of causal variants, the burden test suffered much power loss, and as a result the SKAT and SKATL performed better than the adaptive SKAT-O and SKAT-OL.

Table 4.

Power comparison of rare variant set analysis: ne = nc = 2500, unequal proportions of protective and deleterious variants. The highest powered tests are bold-faced.

θ = 0.05, d = −0.6

α SKAT SKAT-O SKATL SKAT-OL Burden

10−6 0.0858 0.0699 0.1054 0.0842 0.0031
10−5 0.1287 0.1110 0.1508 0.1290 0.0077
10−4 0.1858 0.1659 0.2096 0.1878 0.0162
10−3 0.2781 0.2492 0.3026 0.2785 0.0367

θ = 0.1, d = −0.5

α SKAT SKAT-O SKATL SKAT-OL Burden

10−6 0.1475 0.1257 0.1725 0.1452 0.0112
10−5 0.2114 0.1887 0.2390 0.2128 0.0214
10−4 0.3059 0.2768 0.3373 0.3061 0.0405
10−3 0.4278 0.3988 0.4587 0.4319 0.0820

θ = 0.2, d = −0.4

α SKAT SKAT-O SKATL SKAT-OL Burden

10−6 0.2510 0.2321 0.2796 0.2598 0.0522
10−5 0.3389 0.3200 0.3756 0.3528 0.0820
10−4 0.4587 0.4446 0.4938 0.4768 0.1311
10−3 0.6030 0.5961 0.6349 0.6248 0.2170

θ = 0.5, d = −0.25

α SKAT SKAT-O SKATL SKAT-OL Burden

10−6 0.3220 0.3804 0.3577 0.4076 0.2244
10−5 0.4310 0.5050 0.4675 0.5348 0.3132
10−4 0.5657 0.6483 0.5964 0.6722 0.4256
10−3 0.7177 0.7879 0.7443 0.8063 0.5649

Table 5 and 6 summarized the corresponding power results under unequal case-control ratio: ne = 1700, nc = 3300. When there are equal proportion of protective and deleterious variants, the proposed LRT based SKAT offered more improvement compared to the score test based SKAT with equal case-control ratio (Table 3 versus 5). While under unequal proportion of protective and deleterious variants, the proposed LRT based SKAT offered more improvement compared to the score test based SKAT with unequal case-control ratio (Table 4 versus 6). The results are in agreement with the observations of Ma et al. (2013), who showed that the performance of single rare variant LRT depends on the case-control ratio.

Table 5.

Power comparison of rare variant set analysis: ne = 1700, nc = 3300, equal proportions of protective and deleterious variants. The highest powered tests are bold-faced.

θ = 0.05, d = −0.6

α SKAT SKAT-O SKATL SKAT-OL Burden

10−6 0.1624 0.1363 0.1655 0.1356 0.0052
10−5 0.2171 0.1913 0.2244 0.1942 0.0095
10−4 0.2933 0.2640 0.3050 0.2728 0.0229
10−3 0.4027 0.3734 0.4191 0.3850 0.0531

θ = 0.1, d = −0.5

α SKAT SKAT-O SKATL SKAT-OL Burden

10−6 0.2364 0.2003 0.2460 0.2036 0.0123
10−5 0.3238 0.2890 0.3337 0.2937 0.0230
10−4 0.4325 0.3896 0.4471 0.4086 0.0488
10−3 0.5700 0.5365 0.5843 0.5488 0.0914

θ = 0.2, d = −0.4

α SKAT SKAT-O SKATL SKAT-OL Burden

10−6 0.3103 0.2631 0.3253 0.2666 0.0211
10−5 0.4249 0.3823 0.4414 0.3896 0.0383
10−4 0.5643 0.5240 0.5840 0.5372 0.0702
10−3 0.7175 0.6831 0.7338 0.6946 0.1281

θ = 0.5, d = −0.25

α SKAT SKAT-O SKATL SKAT-OL Burden

10−6 0.2638 0.2194 0.2786 0.2251 0.0166
10−5 0.3844 0.3380 0.4022 0.3501 0.0335
10−4 0.5355 0.4904 0.5559 0.5093 0.0652
10−3 0.7131 0.6758 0.7303 0.6914 0.1238

Table 6.

Power comparison of rare variant set analysis: ne = 1700, nc = 3300, unequal proportions of protective and deleterious variants. The highest powered tests are bold-faced.

θ = 0.05, d = −0.6

α SKAT SKAT-O SKATL SKAT-OL Burden

10−6 0.0402 0.0314 0.0633 0.0482 0.0007
10−5 0.0710 0.0564 0.1000 0.0842 0.0022
10−4 0.1176 0.1006 0.1548 0.1355 0.0050
10−3 0.1964 0.1743 0.2370 0.2153 0.0177

θ = 0.1, d = −0.5

α SKAT SKAT-O SKATL SKAT-OL Burden

10−6 0.0746 0.0598 0.1093 0.0878 0.0036
10−5 0.1242 0.1027 0.1686 0.1474 0.0078
10−4 0.2009 0.1785 0.2535 0.2306 0.0177
10−3 0.3177 0.2910 0.3784 0.3512 0.0470

θ = 0.2, d = −0.4

α SKAT SKAT-O SKATL SKAT-OL Burden

10−6 0.1288 0.1120 0.1856 0.1675 0.0166
10−5 0.2074 0.1922 0.2762 0.2632 0.0355
10−4 0.3185 0.3103 0.3909 0.3816 0.0663
10−3 0.4728 0.4636 0.5423 0.5367 0.1360

θ = 0.5, d = −0.25

α SKAT SKAT-O SKATL SKAT-OL Burden

10−6 0.1548 0.2098 0.2339 0.2961 0.1181
10−5 0.2510 0.3333 0.3366 0.4270 0.1913
10−4 0.3819 0.4948 0.4700 0.5760 0.3032
10−3 0.5563 0.6708 0.6412 0.7365 0.4561

Diabetes study

The Atherosclerosis Risk in Communities (ARIC) study (The ARIC Investigators, 1989) is a multi-center prospective investigation of atherosclerotic disease in a predominantly bi-racial population. Men and women aged 45–64 years at baseline were recruited from four U.S. communities: Forsyth County, North Carolina; Jackson, Mississippi; suburban areas of Minneapolis, Minnesota; and Washington County, Maryland. A total of 15,792 individuals participated in the baseline examination in 1987–1989. The vast majority of ARIC participants are of European (73%) or African ancestry (26%).

We applied the proposed SKATL and other competing methods in ARIC to test for association between type 2 diabetes (T2D) and rare variants in each gene. Genotypes were obtained from the Illumina HumanExome BeadChip (Grove et al., 2013), which has information on 247,870 variants. Prevalent T2D diabetes was defined as in previous GWAS analyses using phenotypic information collected at the baseline examination (Morris et al., 2012). Exome chip data were analyzed for 1048 white T2D cases and 6598 white non-cases.

We conducted two different analyses of T2D and adjusted for age, gender and center. First, we analyzed the rare variants (with MAF≤ 0.01 and at least five copies in the total sample) in the gene PAM, which has been recently identified to contain a rare missense variant that contributes to the risk of T2D (Steinthorsdottir et al., 2014). Second, we ran a genome-wide scan and tested the association of rare variants located in each gene.

For the eight rare variants located in the gene PAM and available on the exome chip, the proposed SKATL has a p-value of 0.039, and SKAT’s p-value is 0.115. The burden test has a p-value of 0.894. For the data adaptive tests, the proposed SKAT-OL has a p-value of 0.072, and SKAT-O’s p-value is 0.196.

In total we analyzed 11426 rare variant sets in the genome-wide scan for T2D. SKATL identified a significant set with three rare variants in the gene ZZZ3 (p-value=1.4 × 10−6) that passed genome-wide significance after a Bonferonni correction for the total number of sets (4.4 × 10−6). And the other tests did not identify any significant rare variant set. SKAT reported a p-value of 2.7 × 10−5 for the gene ZZZ3, and did not identify any genome-wide significant rare variant set. ZZZ3 is a protein-coding gene which is a component of the ATAC complex, a complex with histone acetyltransferase activity on histones H3 and H4. A common variant in ZZZ3 was recently found to be associated with obesity and body mass index in a genome-wide meta-analysis of of 263,407 European individuals (Berndt et al., 2013). Obesity is a major risk factor for T2D. This suggests that ZZZ3 is likely involved in T2D. Further research is needed on the possible role of identified rare variants in the gene ZZZ3.

Discussion

To enrich association signals for rare variants, it is attractive and often customary to combine multiple rare variants in a gene. The widely used SKAT is powerful and computationally efficient by combining rare variants based on the variance component score test. The proposed SKATL is based on the observation that the score statistics used in SKAT are asymptotically equivalent to the LRT statistics in the marginal regression modeling of individual rare variants, and that the score test performs well when parameter is close to the null value, but could have suboptimal performance with large deviation from the null (e.g., when testing those rare variants with large effect sizes). We developed efficient algorithms to compute p-values based on the asymptotic distribution of the proposed SKATL and SKAT-OL. In our extensive numerical studies, the proposed SKATL and SKAT-OL have well controlled type I errors and shown very competitive performance.

Our approach is in the same spirit as Xing et al. (2012), Ma et al. (2013) and Chen et al. (2014), who have shown that the likelihood ratio test often has better performance than the score test for either single rare variant or rare variant set analysis. In practice, the score test has the computational advantage in that we only need to fit one null model. For the ARIC diabetes data, when analyzing 1415 rare variant sets on chromosome 1 on a single Linux workstation, SKAT takes 42 sec CPU time, SKAT-O takes 674 sec CPU time, SKATL takes 151 sec CPU time, and SKAT-OL takes 630 sec CPU time on the same machine. In the supplementary material, we provide more time comparison of score test versus LRT based SKAT in numerical studies. The proposed approach can be readily extended to handle across study meta analyses of gene-level tests, and the analysis of multiple traits. In summary, we advocate using the proposed method as a complementary approach to enhancing the power of detecting association for rare variants in case-control genome-wide association studies.

We have implemented the proposed methods in R programs posted at http://www.umn.edu/~baolin/research/skatl_Rcode.html

Supplementary Material

Supp MaterialS1

Acknowledgements

This research was supported in part by NIH grant GM083345 and CA134848. We are grateful to the University of Minnesota Supercomputing Institute for assistance with the computations. We want to thank the editor and reviewers for their constructive comments which have greatly improved the presentation of the paper.

The ARIC Study is carried out as a collaborative study supported by National Heart, Lung, and Blood Institute (NHLBI) contracts (HHSN268201100005C, HHSN268201100006C, HHSN268201100007C, HHSN268201100008C, HHSN268201100009C, HHSN2682011000010C, HHSN2682011000011C, and HHSN2682011000012C). The authors thank the staff and participants of the ARIC study for their important contributions. Support for exome chip genotyping in the ARIC Study was provided by the National Institutes of Health (NIH) American Recovery and Reinvestment Act of 2009 (ARRA) (5RC2HL102419).

APPENDIX

Null distribution of SKAT-OL

The significance of SKAT-OL can be computed following the approach of Lee et al. (2012). Denote = WRW. Define a symmetric matrix Rh such that RhRh = . Let Z = (z1, …, zm)′ be independent standard normal random variables. Then the null distribution of Lρ is the same as Lρ = ZRh[(1 − ρ)I + ρJ]RhZ. Denote R1 = Rh1, where 1 = (1, …, 1)′ is a column vector of ones. Note that H1=RhJRh/(R1R1) is a projection matrix into a space spanned by R1. Therefore Z1 = H1Z and Z2 = (IH1)Z are independent. Define η2=Z2R˜Z2,η1=Z1R˜Z2, and η0=Z1Z1. Here η2 follows a mixture of 1-DF chi-square distributions with coefficients being the eigen values of (IH1) (IH1), denoted as (λ1, …, λm). Note Cov(η1, η2) = Cov(η1, η0) = 0, and E(η1) = 0, Var(η1) = tr[H1 (IH1)]. We can check that

Lρ=(1ρ)(η2+2η1)+τρη0,τρ=ρR12+(1ρ)R1R˜R1/R12.

Let Lρ1, …, Lρb be the score statistics computed with 0 = ρ1 < ρ2 < … < ρb = 1. Denote qρ as the (1 − T)-th percentile of the distribution of Lρ, which can be computed based on moment matching (Liu et al., 2009). Let q˜1=F1(1T|χ12), where F(·|χ12) is the distribution function of 1-DF chi-square distribution. Note that L1 = ‖R12η0. Hence q1 = ‖R121, The significance p-value based on the test statistic T is

1Pr(Lρ1<qρ1,,Lρb<qρb)=1E[Pr(η2+2η1<minυ<bqρυτρυη01ρυ|η0)I(η0<q˜1)],

where η0 follows the 1-DF chi-square distribution, and I() is an indicator function. Denote μ=j=1mλj,σ2=2j=1mλj2,σ02=σ2+4tr[R˜H1R˜(IH1)]. Let

δ(x)=(minυ<bqρυτρυx1ρυμ)σσ0+μ.

The p-value is computed as

10q˜1M(δ(x))f(x|χ12)dx,

where f(·|χ12) is the density of 1-DF chi-square distribution, and M(·) is the distribution function of 1-DF chi-square mixture with coefficients (λ1, …, λm). Here we want to emphasize that special care is needed for ρ = 1. When ρb = 1 is included in the minimum p-value search, we have an indicator I0 < 1) in the expectation, and the integration is in interval [0, 1]. Otherwise the integration is over [0, 1 = ∞).

References

  1. Berndt SI, Gustafsson S, Magi R, Ganna A, Wheeler E, Feitosa MF, et al. Genome-wide meta-analysis identifies 11 new loci for anthropometric traits and provides insights into genetic architecture. Nature genetics. 2013;45(5):501–512. doi: 10.1038/ng.2606. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bonnefond A, Clement N, Fawcett K, Yengo L, Vaillant E, Guillaume JL, et al. Rare MTNR1B variants impairing melatonin receptor 1B function contribute to type 2 diabetes. Nature genetics. 2012;44(3):297–301. doi: 10.1038/ng.1053. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Chen H, Lumley T, Brody J, Heard-Costa NL, Fox CS, Cupples LA, Dupuis J. Sequence kernel association test for survival traits. Genetic Epidemiology. 2014;38(3):191–197. doi: 10.1002/gepi.21791. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Cox DR, Hinkley DV. Theoretical Statistics. CRC Press; 1979. [Google Scholar]
  5. Davies RB. Algorithm AS 155: the distribution of a linear combination of χ2 random variables. Applied Statistics. 1980;29(3):323. [Google Scholar]
  6. Estrada K, Aukrust I, Bjorkhaug L, Burtt NP, Mercader JM, et al. Association of a low-frequency variant in HNF1A with type 2 diabetes in a latino population. JAMA. 2014;311(22):2305–2314. doi: 10.1001/jama.2014.6511. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Grove ML, Yu B, Cochran BJ, Haritunians T, Bis JC, Taylor KD, et al. Best practices and joint calling of the HumanExome BeadChip: the CHARGE consortium. PloS One. 2013;8(7):e68095. doi: 10.1371/journal.pone.0068095. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Hauck WW, Donner A. Wald’s test as applied to hypotheses in logit analysis. Journal of the American Statistical Association. 1977;72(360):851. [Google Scholar]
  9. Lee S, Wu MC, Lin X. Optimal tests for rare variant effects in sequencing association studies. Biostatistics. 2012;13(4):762–775. doi: 10.1093/biostatistics/kxs014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Li B, Leal SM. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. The American Journal of Human Genetics. 2008;83(3):311–321. doi: 10.1016/j.ajhg.2008.06.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Lin DY, Tang ZZ. A general framework for detecting disease associations with rare variants in sequencing studies. The American Journal of Human Genetics. 2011;89(3):354–367. doi: 10.1016/j.ajhg.2011.07.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Liu D, Lin X, Ghosh D. Semiparametric regression of multidimensional genetic pathway data: least-squares kernel machines and linear mixed models. Biometrics. 2007;63(4):1079–1088. doi: 10.1111/j.1541-0420.2007.00799.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Liu H, Tang Y, Zhang HH. A new chi-square approximation to the distribution of non-negative definite quadratic forms in non-central normal variables. Computational Statistics & Data Analysis. 2009;53(4):853–856. [Google Scholar]
  14. Ma C, Blackwell T, Boehnke M, Scott LJ GoT2D Investigators. Recommended joint and meta-analysis strategies for case-control association testing of single low-count variants. Genetic Epidemiology. 2013;37(6):539–550. doi: 10.1002/gepi.21742. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Madsen BE, Browning SR. A groupwise association test for rare mutations using a weighted sum statistic. PLOS Genetics. 2009;5(2):e1000384. doi: 10.1371/journal.pgen.1000384. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, et al. Finding the missing heritability of complex diseases. Nature. 2009;461(7265):747–753. doi: 10.1038/nature08494. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Morgenthaler S, Thilly WG. A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST) Mutation research. 2007;615(1–2):28–56. doi: 10.1016/j.mrfmmm.2006.09.003. [DOI] [PubMed] [Google Scholar]
  18. Morris AP, Voight BF, Teslovich TM, Ferreira T, Segre AV, et al. Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes. Nature Genetics. 2012;44(9):981–990. doi: 10.1038/ng.2383. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Morris AP, Zeggini E. An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genetic epidemiology. 2010;34(2):188–193. doi: 10.1002/gepi.20450. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Neale BM, Rivas MA, Voight BF, Altshuler D, Devlin B, Orho-Melander M, Kathiresan S, Purcell SM, Roeder K, Daly MJ. Testing for an unusual distribution of rare variants. PLoS Genet. 2011;7(3):e1001322. doi: 10.1371/journal.pgen.1001322. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Pan W. Asymptotic tests of association with multiple SNPs in linkage disequilibrium. Genetic Epidemiology. 2009;33(6):497–507. doi: 10.1002/gepi.20402. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Price AL, Kryukov GV, de Bakker PIW, Purcell SM, Staples J, Wei LJ, Sunyaev SR. Pooled association tests for rare variants in exon-resequencing studies. American journal of human genetics. 2010;86(6):832–838. doi: 10.1016/j.ajhg.2010.04.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, Bakker PIWd, Daly MJ, Sham PC. PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics. 2007;81(3):559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Schaffner SF, Foo C, Gabriel S, Reich D, Daly MJ, Altshuler D. Calibrating a coalescent simulation of human genome sequence variation. Genome Research. 2005;15(11):1576–1583. doi: 10.1101/gr.3709305. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Steinthorsdottir V, Thorleifsson G, Sulem P, Helgason H, Grarup N, et al. Identification of low-frequency and rare sequence variants associated with elevated or reduced risk of type 2 diabetes. Nature Genetics. 2014;46(3):294–298. doi: 10.1038/ng.2882. [DOI] [PubMed] [Google Scholar]
  26. The ARIC Investigators. The atherosclerosis risk in communities (aric) study: design and objectives. American Journal of Epidemiology. 1989;129(4):687–702. [PubMed] [Google Scholar]
  27. Tzeng JY, Zhang D. Haplotype-based association analysis via variance-components score test. American Journal of Human Genetics. 2007;81(5):927–938. doi: 10.1086/521558. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Wang Y, McKay JD, Rafnar T, Wang Z, Timofeeva MN, Broderick P, et al. Rare variants of large effect in BRCA2 and CHEK2 affect risk of lung cancer. Nature Genetics. 2014;46(7):736–741. doi: 10.1038/ng.3002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Wu MC, Kraft P, Epstein MP, Taylor DM, Chanock SJ, Hunter DJ, Lin X. Powerful SNP-Set analysis for case-control genome-wide association studies. The American Journal of Human Genetics. 2010;86(6):929–942. doi: 10.1016/j.ajhg.2010.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Wu M, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. The American Journal of Human Genetics. 2011;89(1):82–93. doi: 10.1016/j.ajhg.2011.05.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Xing G, Lin CY, Wooding SP, Xing C. Blindly using wald’s test can miss rare disease-causal variants in case-control association studies. Annals of human genetics. 2012;76(2):168–177. doi: 10.1111/j.1469-1809.2011.00700.x. [DOI] [PubMed] [Google Scholar]
  32. Zhan X, Larson DE, Wang C, Koboldt DC, Sergeev YV, Fulton RS, et al. Identification of a rare coding variant in complement 3 associated with age-related macular degeneration. Nature Genetics. 2013;45(11):1375–1379. doi: 10.1038/ng.2758. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp MaterialS1

RESOURCES