Sequence kernel association analysis of rare variant set based on the marginal regression model for binary traits

Baolin Wu; James S Pankow; Weihua Guan

doi:10.1002/gepi.21913

. Author manuscript; available in PMC: 2016 Sep 1.

Published in final edited form as: Genet Epidemiol. 2015 Sep;39(6):399–405. doi: 10.1002/gepi.21913

Sequence kernel association analysis of rare variant set based on the marginal regression model for binary traits

Baolin Wu ^1,^*, James S Pankow ², Weihua Guan ^1,^*

PMCID: PMC4544778 NIHMSID: NIHMS703118 PMID: 26282996

Abstract

Recent sequencing efforts have focused on exploring the influence of rare variants on the complex diseases. Gene-level based tests by aggregating information across rare variants within a gene have become attractive to enrich the rare variant association signal. Among them, the sequence kernel association test has proved to be a very powerful method for jointly testing multiple rare variants within a gene. In this article, we explore an alternative sequence kernel association test. We propose to use the univariate likelihood ratio statistics from the marginal model for individual variants as input into the kernel association test. We show how to compute its significance p-value efficiently based on the asymptotic chi-square mixture distribution. We demonstrate through extensive numerical studies that the proposed method has competitive performance. Its usefulness is further illustrated with application to associations between rare exonic variants and type 2 diabetes in the Atherosclerosis Risk in Communities (ARIC) Study. We identified an exome-wide significant rare variant set in the gene ZZZ3 worthy of further investigations.

Keywords: GWAS, SKAT, Score statistic, Sequencing data

Introduction

In GWAS, observed effect sizes for common variants have typically been quite small. In combination they explain a small proportion of the phenotypic variance. Manolio et al. (2009) have suggested that rare variants could have substantial effect sizes without demonstrating clear Mendelian segregation, and could contribute substantially to missing heritability. Individual rare variant based tests typically lack power due to low minor allele frequencies, and gene-level based association tests implemented by aggregating information across rare variants within a gene have become attractive to enrich the association signal. An intuitive and simple approach to aggregating signals across rare variants collapses the rare variants into a burden score to be linked to the phenotype (Morgenthaler and Thilly, 2007; Madsen and Browning, 2009; Morris and Zeggini, 2010; Price et al., 2010; Lin and Tang, 2011). The combined multivariate and collapsing (CMC) method is an extension of the burden test by collapsing rare variants in a region within subgroups defined according to their minor allele frequencies (MAFs) (Li and Leal, 2008). The variable threshold (VT) method is a data adaptive burden test by choosing an optimal MAF threshold (Price et al., 2010; Lin and Tang, 2011). The burden test works well for variants with similar effects and could lose substantial power with both protective and deleterious variants, or in the presence of many non-causal variants. The sequence kernel association test (SKAT) is based on the variance component score test and works well under various combinations of protective and deleterious variants (Wu et al., 2010; Neale et al., 2011; Wu et al., 2011). A more flexible approach is SKAT-O, which adaptively combines the burden and the SKAT statistics (Lee et al., 2012). The SKAT based approach performs well and is widely used in rare variant based association test.

Rare variants have been postulated to have large effect sizes (Manolio et al., 2009). It is likely that typical GWAS only have sufficient power to detect variants with large effects. This is indeed the case for most rare disease-causing variants identified to date (Bonnefond et al., 2012; Zhan et al., 2013; Steinthorsdottir et al., 2014;Wang et al., 2014; Estrada et al., 2014). The SKAT is based on the score test, thus is computationally very efficient. The score test performs well when parameter is close to the null value, but could have suboptimal performance with large deviation from the null (e.g., when testing those rare variants with large effect sizes).

Recently Chen et al. (2014) developed a Cox SKAT for survival outcomes and adopted the likelihood ratio test for its better performance compared to the score test in the Cox proportional hazard model. In this article, we explore an alternative sequence kernel association test for binary trait in the same spirit as Chen et al. (2014). We use the univariate likelihood ratio statistics from the marginal model for individual variants as input into the sequence kernel association test and its adaptive test. Their significance p-values can be computed efficiently based on the asymptotic chi-square mixture distribution. We demonstrate through extensive numerical studies that the proposed method has competitive performance. We illustrate the usefulness of the proposed method through an application to associations between rare variants and type 2 diabetes in the ARIC Study.

Materials and Methods

Consider a GWAS with genotype scores G, coded as (0,1,2) for the copies of minor allele, disease status indicator Y, and additional covariates X, which could include ancestry covariate (e.g., ancestry indicator or principal components).

Consider n subjects sequenced in a region with m genotyped rare variants. For the i-th subject, let y_i denote the case-control status, G_i = (g_i1, …, g_im) the genotypes for the m variants, X_i = (x_i1, …, x_ip) the covariates to be adjusted. We study the disease association of rare variants based on the following logistic regression model

Pr (y_{i} = 1 | X_{i}, G_{i}) = expit (β_{0} + X_{i} α + G_{i} β),

(1)

where α and β = (β₁, …, β_m)′ are the vector of regression coefficients for the covariates and rare variants. Here expit(x) = 1/(1 + exp(−x)) is the inverse-logit function. The disease association of the m rare variants can be tested by evaluating the null hypothesis H₀ : β = 0.

Sequence kernel association test

The sequence kernel association test (SKAT; Wu et al., 2011) is derived as a variance-component score statistic by assuming that each β_j follows an arbitrary zero-mean distribution with variance $w_{j}^{2} ψ$ , where weight w_j is fixed and typically computed based on MAF, e.g., the Wu weights w_j = Beta(f_j; 1, 25) (Wu et al., 2011). Here f_j is the MAF of G_j and Beta is the beta distribution density function. Under this assumption, the null hypothesis H₀ : β = 0 is equivalent to H₀ : ψ = 0.

Let y = (y₁, …, y_n)′ denote the response vector, X the n × p covariates matrix, $G = (G_{1}^{'}, \dots, G_{n}^{'})'$ the n × m genotype matrix, W = diag(w₁, …, w_m) the diagonal matrix of weights. The SKAT statistic can be computed as

Q = (y - {\hat{π}}_{0})' G W W G' (y - {\hat{π}}_{0}),

where π̂₀ = (π̂₁, …, π̂_n)′ with ${\hat{π}}_{i} = \hat{Pr} (y_{i} = 1 | X_{i}, G_{i})$ derived under the null model (β = 0). Let V₀ = diag{π̂₀(1 − π̂₀)} denote the n × n diagonal matrix of marginal variances, and X₀ = (1, X) the n × (p + 1) null model design matrix. Define $P = V_{0} - V_{0} X_{0} {(X_{0}^{'} V_{0} X_{0})}^{- 1} X_{0}^{'} V_{0}$ , which is the asymptotic covariance matrix Cov(y − π̂₀). Under null, Q follows a mixture of 1-DF chi-square distributions (Liu et al., 2007; Tzeng and Zhang, 2007), with the mixture coefficients being the eigen values of P^1/2GWWG′P^1/2, which is of dimension n × n. The p-value can be obtained by matching moments (Liu et al., 2009) or by inverting the characteristic function (Davies, 1980).

The SKAT statistic can be equivalently derived based on the score vector U for β (Pan, 2009). We can check that U = G′(y − π̂₀). Under null, the score vector U are asymptotically zero-mean multivariate normal with covariance that can be consistently estimated by (Cox and Hinkley, 1979)

Σ = G' V_{0} G - G' V_{0} X_{0} {(X_{0}^{'} V_{0} X_{0})}^{- 1} X_{0}^{'} V_{0} G = G' P G,

(2)

which accounts for the linkage disequilibrium among variants. The SKAT statistic can be equivalently written as Q = U′WWU. Hence the mixture coefficients can be equivalently computed based on the eigen values of Σ^1/2WWΣ^1/2, which is an m × m matrix. Note that m is typically much smaller than n, and the eigen values can be very efficiently solved.

Likelihood ratio test based kernel association test

For the score vector U = G′(y − π̂₀), consider its j-th element $U_{j} = G_{j}^{'} (y - {\hat{π}}_{0})$ , where G_j = (g_1j, …, g_nj)′ is the j-th column of G. Here U_j can be checked equal to the score statistic for testing the significance of the j-th SNP based on the following marginal logistic model

Pr (y_{i} = 1 | X_{i}, g_{i j}) = expit (β_{0 j} + X_{i} α_{j} + g_{i j} β_{1 j}) .

(3)

Alternatively we can employ the likelihood ratio test (LRT) to assess the marginal significance of the j-th SNP. Under null, the score test is asymptotically equivalent to the LRT. However the LRT could be more powerful than the score test if the j-th rare variant has potentially large effect size, when it is either a risk variant or in linkage disequilibrium with other risk variants.

We propose to develop a marginal LRT based sequence kernel association test (denoted as SKAT_L) as follows. Denote χ_j as the LRT chi-square statistic for testing β_1j under model (3). Let $S_{j} = sign ({\hat{β}}_{1 j}) \sqrt{χ_{j}}$ and S = (S₁, …, S_m)′, where β̂_1j is the maximum likelihood estimator (MLE). Define the SKAT_L statistic

L = S' W W S = \sum_{j = 1}^{m} w_{j}^{2} χ_{j} .

Under the null of no rare variant effects (all β_j = 0), we have β_1j = 0, and S_j is asymptotically equivalent to the standardized U_j. Let R = diag(Σ)^−1/2Σdiag(Σ)^−1/2, which is the corresponding correlation matrix of Σ in (2). The null distribution of L is a mixture of 1-DF chi-square distributions with mixture coefficients being the eigen values of R^1/2WWR^1/2.

Note that the SKAT_L only depends on the LRT chi-square statistic, and in principle we do not need the MLE β̂_1j, which could have convergence issues and aberrant testing behavior (Hauck and Donner, 1977). When computing the SKAT_L in our numerical studies, we set χ_j equal to the squared standardized score statistics for extremely rare variants (specifically with minor allele count less than ten).

Data adaptive kernel association test

An alternative approach to aggregating signals across rare variants is the burden test (Li and Leal, 2008; Madsen and Browning, 2009). The burden test is typically computed as the weighted sum of score statistics. the burden test works well for variants with similar effects and could lose substantial power in the presence of large number of non-causal variants, or with both protective and deleterious variants. A more flexible approach is to data adaptively combine the burden test and the kernel association test following the SKAT-O approach of Lee et al. (2012), which tested the rare variant effects using the minimum p-value of weighted SKAT statistic, (y − π̂₀)′K_ρ(y − π̂₀), where K_ρ = GW[(1 − ρ)I + ρJ]WG′, ρ ∈ [0, 1]. Here I is an m × m identity matrix and J m × m matrix with all elements equal to one.

Similarly we consider the following weighted SKAT_L statistic

L_{ρ} = S' W [(1 - ρ) I + ρ J] W S, ρ \in [0, 1] .

Given ρ, the significance p-value of L_ρ, P-val(L_ρ), can be similarly computed based on the 1-DF chi-square mixture distribution with coefficients being the eigen values of R^1/2W[(1 − ρ)I + ρJ]WR^1/2. Data adaptive SKAT_L statistic (denoted as SKAT-O_L) is defined as the minimum p-value, T = min_0≤ρ≤1 P-val(L_ρ), where the minimum is often taken over a finite grid of ρ: 0 = ρ₁ < … < ρ_b = 1, and the significance of T can be efficiently computed using an one-dimensional numerical integration (Lee et al., 2012). We discuss computational details in the following section.

P-value computation for kernel association tests

We offer some insights into the efficient p-value computation for SKAT, SKAT_L and data adaptive kernel association tests. First note that the non-zero eigen values of AA′ are the same as A′A for any matrix A, which can be verified from the singular value decomposition of matrix A: $A = U_{A} D_{A} V_{A}^{'}$ , where U_A and V_A are orthogonal and D_A diagonal matrix. Therefore $A A' = U_{A} D_{A}^{2} U_{A}^{'}$ and $A' A = V_{A} D_{A}^{2} V_{A}^{'}$ , and hence their eigen values equal to the squared singular values of A. So for computing the p-values of proposed SKAT_L, the eigen values of R^1/2WWR^1/2 can be equivalently computed from WRW. For SKAT, the eigen values of P^1/2GWWG′P^1/2 are the same as WG′PGW = WΣW.

For matrix B = (1 − ρ)I + ρJ, ρ ∈ [0, 1], we can check that $B = B_{h}^{2}$ , where $B_{h} = \sqrt{1 - ρ} I + \frac{\sqrt{1 + (m - 1) ρ} - \sqrt{1 -} ρ}{m} J$ . Therefore for computing p-values of weighted SKAT_L, the eigen values of R^1/2W[(1 − ρ)I + ρJ]WR^1/2 can be equivalently computed from B_hWRWB_h.

Null distribution of SKAT-O_L

The significance of SKAT-O_L can be computed as (see Appendix for technical details)

1 - \int_{0}^{{\tilde{q}}_{1}} M (δ (x)) f (x | χ_{1}^{2}) d x,

where

δ (x) = (min_{υ < b} \frac{q_{ρ_{υ}} - τ_{ρ_{υ}} x}{1 - ρ_{υ}} - μ) \frac{σ}{σ_{0}} + μ, {\tilde{q}}_{1} = F^{- 1} (1 - T | χ_{1}^{2}),

$f (\cdot | χ_{1}^{2})$ and $F (\cdot | χ_{1}^{2})$ are the 1-DF chi-square density/distribution functions, and M(·) is the distribution function of 1-DF chi-square mixture with coefficients (λ₁, …, λ_m), which are the eigen values of (I − H₁) R̃(I − H₁), where R̃ = WRW. Here $μ = \sum_{j = 1}^{m} λ_{j}, σ^{2} = 2 \sum_{j = 1}^{m} λ_{j}^{2}, σ_{0}^{2} = σ^{2} + 4 tr [\tilde{R} H_{1} \tilde{R} (I - H_{1})], τ_{ρ} =_{ρ} {‖ R_{1} ‖}^{2} + (1 - ρ) R_{1}^{'} \tilde{R} R_{1} / {‖ R_{1} ‖}^{2}, H_{1} = R_{h} J R_{h} / (R_{1}^{'} R_{1})$ , where R_hR_h = R̃, and R₁ = R_h(1, …, 1)′.

Results

Simulation studies

We conducted extensive simulation studies to evaluate the performance of the proposed and existing methods. Following Lee et al. (2012), we generated 10,000 European-like haplotypes of length 1000 kb under a calibrated coalescent model (Schaffner et al., 2005). We randomly pair the haplotypes to simulate a total population of 10⁶ individuals. We randomly select a gene region of length 10 kb and study those rare variants with MAF≤0.01. We consider two covariates Z = (Z₁, Z₂)′: Z₁ ∈ {0, 1} follows Bernoulli(0.5), and Z₂ ~ N(0, 1). We model the logit disease risk as $expit (β_{0} + Z' β_{Z} + \sum_{j = 1}^{m} β_{j} G_{j})$ . We set β₀ = −3.4, β_Z = (0.5, 0.5)′ (corresponding to 5% population disease rate). We randomly select 2500 cases and 2500 controls from the simulated population of 10⁶ samples. We compared five rare variant set analysis methods: SKAT, SKAT-O, SKAT_L, SKAT-O_L and burden test. In the burden and SKAT tests, we assign weight Beta(f_j; a₀, b₀) to the jth variant G_j. And for the proposed method we assign weight Beta(f_j; a₁, b₁). Here f_j is the MAF of G_j. For a given variant, the likelihood ratio test statistic is inherently standardized and roughly corresponds to the standardized score statistics, which is the score statistics used in SKAT scaled by its standard error, which is roughly proportional to $\sqrt{f_{j} (1 - f_{j})}$ . Therefore for the proposed method, we set a₁ = a₀+0.5 and b₁ = b₀+0.5. Following Wu et al. (2011), we set a₀ = 1, b₀ = 25 for the following simulation studies. We have investigated three sets of weights for (a₀, b₀): (0.5,24.5), (1,25), and (1.5,25.5). The overall conclusions remain the same (see supplementary material for complete results). As shown in Ma et al. (2013), the performance of single rare variant LRT depends on the case-control ratio. We have investigated different case-control ratios for (n_e, n_c). Here we reported the results for n_e = n_c = 2500, and n_e = 1700, n_c = 3300. The supplementary material provided simulation results for more unbalanced case-control ratios (1:6 and 1:10).

We use 2.5 × 10⁶ experiments to evaluate the type I error at the nominal significance level α = 10⁻⁵, 10⁻⁴, and 10⁻³ by setting all β_j = 0. The results are summarized in Table 1 and 2. All methods appropriately control the Type I errors. We also verify that the Type I errors are appropriately controlled at the 10⁻⁶ significance level by conducting 10⁸ experiments (please see the supplementary material for detailed results including the QQ plots).

Table 1.

Type I error divided by the nominal significance level for rare variant set analysis: n_e = 2500 cases and n_c = 2500 controls. The SKAT/SKAT-O and burden tests used (1,25) weight, and the proposed SKAT_L/SKAT-O_L used (1.5,25.5) weight.

α	10⁻⁵	10⁻⁴	10⁻³
SKAT	0.82	0.85	0.92
SKAT-O	0.91	1.02	1.07
SKAT_L	0.92	1.10	1.08
SKAT-O_L	0.96	1.00	1.11
Burden	0.94	0.98	1.00

Open in a new tab

Table 2.

Type I error divided by the nominal significance level for rare variant set analysis: n_e = 1700 cases and n_c = 3300 controls. The SKAT/SKAT-O and burden tests used (1,25) weight, and the proposed SKAT_L/SKAT-O_L used (1.5,25.5) weight.

α	10⁻⁵	10⁻⁴	10⁻³
SKAT	0.86	0.91	0.93
SKAT-O	0.89	0.98	1.04
SKAT_L	0.96	1.12	1.11
SKAT-O_L	0.88	1.07	1.10
Burden	0.91	0.97	1.01

Open in a new tab

We use 10⁴ experiments to evaluate the power under various combinations of β_j at α = 10⁻⁶, 10⁻⁵, 10⁻⁴, and 10⁻³. The rare variant effects β_j are set as follows. Each time we randomly select θ proportion of rare variants and set their |β_j| = d log₁₀(f_j). The other null rare variants have zero coefficients. We have assumed that rarer variants have larger effect sizes. We conducted simulations for (1) θ = 0.05, d = −0.6, (2) θ = 0.1, d = −0.5, (3) θ = 0.2, d = −0.4, (4) θ = 0.5, d = −0.25. They correspond to odds ratio of 3.32, 2.72, 2.23 and 1.65 for MAF=0.01 respectively. We have investigated two scenarios for the direction of causal variant effects. First, we assume a mix of equal proportions of protective and deleterious variants, which will in general favor the kernel association test. Second, we assume a mix of unequal proportions of protective and deleterious variants. Specially we randomly set signs of β_j as negative or positive with probability 0.9 and 0.1 respectively.

Table 3 summarized the results assuming equal proportions of protective and deleterious variants and equal case-control ratio (n_e = n_c = 2500). Overall the proposed SKAT_L has the best performance. As expected the burden test suffers a dramatic power loss since the burden sum score cancels out those causal variants, and as a result the adaptive SKAT-O and SKAT-O_L have reduced performance compared to the SKAT and SKAT_L. The proposed SKAT_L has the largest power gain over SKAT with relatively large rare variant effect sizes.

Table 3.

Power comparison of rare variant set analysis: n_e = n_c = 2500, equal proportions of protective and deleterious variants. The highest powered tests in each row are bold-faced.

	θ = 0.05, d = −0.6

α	SKAT	SKAT-O	SKAT_L	SKAT-O_L	Burden

10⁻⁶	0.1654	0.1373	0.2000	0.1661	0.0028
10⁻⁵	0.2279	0.1995	0.2627	0.2336	0.0067
10⁻⁴	0.3080	0.2808	0.3521	0.3191	0.0181
10⁻³	0.4286	0.3994	0.4740	0.4398	0.0451

	θ = 0.1, d = −0.5

α	SKAT	SKAT-O	SKAT_L	SKAT-O_L	Burden

10⁻⁶	0.2469	0.2041	0.2940	0.2442	0.0081
10⁻⁵	0.3446	0.3030	0.3906	0.3506	0.0164
10⁻⁴	0.4612	0.4260	0.5051	0.4691	0.0352
10⁻³	0.6031	0.5657	0.6453	0.6092	0.0750

	θ = 0.2, d = −0.4

α	SKAT	SKAT-O	SKAT_L	SKAT-O_L	Burden

10⁻⁶	0.3481	0.2952	0.3965	0.3381	0.0161
10⁻⁵	0.4742	0.4239	0.5189	0.4695	0.0302
10⁻⁴	0.6122	0.5698	0.6513	0.6130	0.0634
10⁻³	0.7577	0.7283	0.7885	0.7613	0.1174

	θ = 0.5, d = −0.25

α	SKAT	SKAT-O	SKAT_L	SKAT-O_L	Burden

10⁻⁶	0.2959	0.2409	0.3379	0.2783	0.0139
10⁻⁵	0.4306	0.3796	0.4724	0.4196	0.0279
10⁻⁴	0.5871	0.5432	0.6272	0.5827	0.0568
10⁻³	0.7554	0.7234	0.7842	0.7542	0.1125

Open in a new tab

Table 4 summarized the results assuming unequal proportions of protective and deleterious variants and equal case-control ratio (n_e = n_c = 2500). The adaptive SKAT-O and SKAT-O_L now perform better than the SKAT and SKAT_L under relatively more causal variants with θ = 0.5. With small proportion of causal variants, the burden test suffered much power loss, and as a result the SKAT and SKAT_L performed better than the adaptive SKAT-O and SKAT-O_L.

Table 4.

Power comparison of rare variant set analysis: n_e = n_c = 2500, unequal proportions of protective and deleterious variants. The highest powered tests are bold-faced.

	θ = 0.05, d = −0.6

α	SKAT	SKAT-O	SKAT_L	SKAT-O_L	Burden

10⁻⁶	0.0858	0.0699	0.1054	0.0842	0.0031
10⁻⁵	0.1287	0.1110	0.1508	0.1290	0.0077
10⁻⁴	0.1858	0.1659	0.2096	0.1878	0.0162
10⁻³	0.2781	0.2492	0.3026	0.2785	0.0367

	θ = 0.1, d = −0.5

α	SKAT	SKAT-O	SKAT_L	SKAT-O_L	Burden

10⁻⁶	0.1475	0.1257	0.1725	0.1452	0.0112
10⁻⁵	0.2114	0.1887	0.2390	0.2128	0.0214
10⁻⁴	0.3059	0.2768	0.3373	0.3061	0.0405
10⁻³	0.4278	0.3988	0.4587	0.4319	0.0820

	θ = 0.2, d = −0.4

α	SKAT	SKAT-O	SKAT_L	SKAT-O_L	Burden

10⁻⁶	0.2510	0.2321	0.2796	0.2598	0.0522
10⁻⁵	0.3389	0.3200	0.3756	0.3528	0.0820
10⁻⁴	0.4587	0.4446	0.4938	0.4768	0.1311
10⁻³	0.6030	0.5961	0.6349	0.6248	0.2170

	θ = 0.5, d = −0.25

α	SKAT	SKAT-O	SKAT_L	SKAT-O_L	Burden

10⁻⁶	0.3220	0.3804	0.3577	0.4076	0.2244
10⁻⁵	0.4310	0.5050	0.4675	0.5348	0.3132
10⁻⁴	0.5657	0.6483	0.5964	0.6722	0.4256
10⁻³	0.7177	0.7879	0.7443	0.8063	0.5649

Open in a new tab

Table 5 and 6 summarized the corresponding power results under unequal case-control ratio: n_e = 1700, n_c = 3300. When there are equal proportion of protective and deleterious variants, the proposed LRT based SKAT offered more improvement compared to the score test based SKAT with equal case-control ratio (Table 3 versus 5). While under unequal proportion of protective and deleterious variants, the proposed LRT based SKAT offered more improvement compared to the score test based SKAT with unequal case-control ratio (Table 4 versus 6). The results are in agreement with the observations of Ma et al. (2013), who showed that the performance of single rare variant LRT depends on the case-control ratio.

Table 5.

Power comparison of rare variant set analysis: n_e = 1700, n_c = 3300, equal proportions of protective and deleterious variants. The highest powered tests are bold-faced.

	θ = 0.05, d = −0.6

α	SKAT	SKAT-O	SKAT_L	SKAT-O_L	Burden

10⁻⁶	0.1624	0.1363	0.1655	0.1356	0.0052
10⁻⁵	0.2171	0.1913	0.2244	0.1942	0.0095
10⁻⁴	0.2933	0.2640	0.3050	0.2728	0.0229
10⁻³	0.4027	0.3734	0.4191	0.3850	0.0531

	θ = 0.1, d = −0.5

α	SKAT	SKAT-O	SKAT_L	SKAT-O_L	Burden

10⁻⁶	0.2364	0.2003	0.2460	0.2036	0.0123
10⁻⁵	0.3238	0.2890	0.3337	0.2937	0.0230
10⁻⁴	0.4325	0.3896	0.4471	0.4086	0.0488
10⁻³	0.5700	0.5365	0.5843	0.5488	0.0914

	θ = 0.2, d = −0.4

α	SKAT	SKAT-O	SKAT_L	SKAT-O_L	Burden

10⁻⁶	0.3103	0.2631	0.3253	0.2666	0.0211
10⁻⁵	0.4249	0.3823	0.4414	0.3896	0.0383
10⁻⁴	0.5643	0.5240	0.5840	0.5372	0.0702
10⁻³	0.7175	0.6831	0.7338	0.6946	0.1281

	θ = 0.5, d = −0.25

α	SKAT	SKAT-O	SKAT_L	SKAT-O_L	Burden

10⁻⁶	0.2638	0.2194	0.2786	0.2251	0.0166
10⁻⁵	0.3844	0.3380	0.4022	0.3501	0.0335
10⁻⁴	0.5355	0.4904	0.5559	0.5093	0.0652
10⁻³	0.7131	0.6758	0.7303	0.6914	0.1238

Open in a new tab

Table 6.

Power comparison of rare variant set analysis: n_e = 1700, n_c = 3300, unequal proportions of protective and deleterious variants. The highest powered tests are bold-faced.

	θ = 0.05, d = −0.6

α	SKAT	SKAT-O	SKAT_L	SKAT-O_L	Burden

10⁻⁶	0.0402	0.0314	0.0633	0.0482	0.0007
10⁻⁵	0.0710	0.0564	0.1000	0.0842	0.0022
10⁻⁴	0.1176	0.1006	0.1548	0.1355	0.0050
10⁻³	0.1964	0.1743	0.2370	0.2153	0.0177

	θ = 0.1, d = −0.5

α	SKAT	SKAT-O	SKAT_L	SKAT-O_L	Burden

10⁻⁶	0.0746	0.0598	0.1093	0.0878	0.0036
10⁻⁵	0.1242	0.1027	0.1686	0.1474	0.0078
10⁻⁴	0.2009	0.1785	0.2535	0.2306	0.0177
10⁻³	0.3177	0.2910	0.3784	0.3512	0.0470

	θ = 0.2, d = −0.4

α	SKAT	SKAT-O	SKAT_L	SKAT-O_L	Burden

10⁻⁶	0.1288	0.1120	0.1856	0.1675	0.0166
10⁻⁵	0.2074	0.1922	0.2762	0.2632	0.0355
10⁻⁴	0.3185	0.3103	0.3909	0.3816	0.0663
10⁻³	0.4728	0.4636	0.5423	0.5367	0.1360

	θ = 0.5, d = −0.25

α	SKAT	SKAT-O	SKAT_L	SKAT-O_L	Burden

10⁻⁶	0.1548	0.2098	0.2339	0.2961	0.1181
10⁻⁵	0.2510	0.3333	0.3366	0.4270	0.1913
10⁻⁴	0.3819	0.4948	0.4700	0.5760	0.3032
10⁻³	0.5563	0.6708	0.6412	0.7365	0.4561

Open in a new tab

Diabetes study

The Atherosclerosis Risk in Communities (ARIC) study (The ARIC Investigators, 1989) is a multi-center prospective investigation of atherosclerotic disease in a predominantly bi-racial population. Men and women aged 45–64 years at baseline were recruited from four U.S. communities: Forsyth County, North Carolina; Jackson, Mississippi; suburban areas of Minneapolis, Minnesota; and Washington County, Maryland. A total of 15,792 individuals participated in the baseline examination in 1987–1989. The vast majority of ARIC participants are of European (73%) or African ancestry (26%).

We applied the proposed SKAT_L and other competing methods in ARIC to test for association between type 2 diabetes (T2D) and rare variants in each gene. Genotypes were obtained from the Illumina HumanExome BeadChip (Grove et al., 2013), which has information on 247,870 variants. Prevalent T2D diabetes was defined as in previous GWAS analyses using phenotypic information collected at the baseline examination (Morris et al., 2012). Exome chip data were analyzed for 1048 white T2D cases and 6598 white non-cases.

We conducted two different analyses of T2D and adjusted for age, gender and center. First, we analyzed the rare variants (with MAF≤ 0.01 and at least five copies in the total sample) in the gene PAM, which has been recently identified to contain a rare missense variant that contributes to the risk of T2D (Steinthorsdottir et al., 2014). Second, we ran a genome-wide scan and tested the association of rare variants located in each gene.

For the eight rare variants located in the gene PAM and available on the exome chip, the proposed SKAT_L has a p-value of 0.039, and SKAT’s p-value is 0.115. The burden test has a p-value of 0.894. For the data adaptive tests, the proposed SKAT-O_L has a p-value of 0.072, and SKAT-O’s p-value is 0.196.

In total we analyzed 11426 rare variant sets in the genome-wide scan for T2D. SKAT_L identified a significant set with three rare variants in the gene ZZZ3 (p-value=1.4 × 10⁻⁶) that passed genome-wide significance after a Bonferonni correction for the total number of sets (4.4 × 10⁻⁶). And the other tests did not identify any significant rare variant set. SKAT reported a p-value of 2.7 × 10⁻⁵ for the gene ZZZ3, and did not identify any genome-wide significant rare variant set. ZZZ3 is a protein-coding gene which is a component of the ATAC complex, a complex with histone acetyltransferase activity on histones H3 and H4. A common variant in ZZZ3 was recently found to be associated with obesity and body mass index in a genome-wide meta-analysis of of 263,407 European individuals (Berndt et al., 2013). Obesity is a major risk factor for T2D. This suggests that ZZZ3 is likely involved in T2D. Further research is needed on the possible role of identified rare variants in the gene ZZZ3.

Discussion

To enrich association signals for rare variants, it is attractive and often customary to combine multiple rare variants in a gene. The widely used SKAT is powerful and computationally efficient by combining rare variants based on the variance component score test. The proposed SKAT_L is based on the observation that the score statistics used in SKAT are asymptotically equivalent to the LRT statistics in the marginal regression modeling of individual rare variants, and that the score test performs well when parameter is close to the null value, but could have suboptimal performance with large deviation from the null (e.g., when testing those rare variants with large effect sizes). We developed efficient algorithms to compute p-values based on the asymptotic distribution of the proposed SKAT_L and SKAT-O_L. In our extensive numerical studies, the proposed SKAT_L and SKAT-O_L have well controlled type I errors and shown very competitive performance.

Our approach is in the same spirit as Xing et al. (2012), Ma et al. (2013) and Chen et al. (2014), who have shown that the likelihood ratio test often has better performance than the score test for either single rare variant or rare variant set analysis. In practice, the score test has the computational advantage in that we only need to fit one null model. For the ARIC diabetes data, when analyzing 1415 rare variant sets on chromosome 1 on a single Linux workstation, SKAT takes 42 sec CPU time, SKAT-O takes 674 sec CPU time, SKAT_L takes 151 sec CPU time, and SKAT-O_L takes 630 sec CPU time on the same machine. In the supplementary material, we provide more time comparison of score test versus LRT based SKAT in numerical studies. The proposed approach can be readily extended to handle across study meta analyses of gene-level tests, and the analysis of multiple traits. In summary, we advocate using the proposed method as a complementary approach to enhancing the power of detecting association for rare variants in case-control genome-wide association studies.

We have implemented the proposed methods in R programs posted at http://www.umn.edu/~baolin/research/skatl_Rcode.html

Supplementary Material

Supp MaterialS1

NIHMS703118-supplement-Supp_MaterialS1.pdf^{(266.3KB, pdf)}

Acknowledgements

This research was supported in part by NIH grant GM083345 and CA134848. We are grateful to the University of Minnesota Supercomputing Institute for assistance with the computations. We want to thank the editor and reviewers for their constructive comments which have greatly improved the presentation of the paper.

The ARIC Study is carried out as a collaborative study supported by National Heart, Lung, and Blood Institute (NHLBI) contracts (HHSN268201100005C, HHSN268201100006C, HHSN268201100007C, HHSN268201100008C, HHSN268201100009C, HHSN2682011000010C, HHSN2682011000011C, and HHSN2682011000012C). The authors thank the staff and participants of the ARIC study for their important contributions. Support for exome chip genotyping in the ARIC Study was provided by the National Institutes of Health (NIH) American Recovery and Reinvestment Act of 2009 (ARRA) (5RC2HL102419).

APPENDIX

Null distribution of SKAT-O_L

The significance of SKAT-O_L can be computed following the approach of Lee et al. (2012). Denote R̃ = WRW. Define a symmetric matrix R_h such that R_hR_h = R̃. Let Z = (z₁, …, z_m)′ be independent standard normal random variables. Then the null distribution of L_ρ is the same as L_ρ = Z′R_h[(1 − ρ)I + ρJ]R_hZ. Denote R₁ = R_h1, where 1 = (1, …, 1)′ is a column vector of ones. Note that $H_{1} = R_{h} J R_{h} / (R_{1}^{'} R_{1})$ is a projection matrix into a space spanned by R₁. Therefore Z₁ = H₁Z and Z₂ = (I − H₁)Z are independent. Define $η_{2} = Z_{2}^{'} \tilde{R} Z_{2}, η_{1} = Z_{1}^{'} \tilde{R} Z_{2}$ , and $η_{0} = Z_{1}^{'} Z_{1}$ . Here η₂ follows a mixture of 1-DF chi-square distributions with coefficients being the eigen values of (I − H₁) R̃(I − H₁), denoted as (λ₁, …, λ_m). Note Cov(η₁, η₂) = Cov(η₁, η₀) = 0, and E(η₁) = 0, Var(η₁) = tr[R̃H₁ R̃(I − H₁)]. We can check that

L_{ρ} = (1 - ρ) (η_{2} + 2 η_{1}) + τ_{ρ} η_{0}, τ_{ρ} = ρ {‖ R_{1} ‖}^{2} + (1 - ρ) R_{1}^{'} \tilde{R} R_{1} / {‖ R_{1} ‖}^{2} .

Let L_ρ₁, …, L_{ρ_b} be the score statistics computed with 0 = ρ₁ < ρ₂ < … < ρ_b = 1. Denote q_ρ as the (1 − T)-th percentile of the distribution of L_ρ, which can be computed based on moment matching (Liu et al., 2009). Let ${\tilde{q}}_{1} = F^{- 1} (1 - T | χ_{1}^{2})$ , where $F (\cdot | χ_{1}^{2})$ is the distribution function of 1-DF chi-square distribution. Note that L₁ = ‖R₁‖²η₀. Hence q₁ = ‖R₁‖²q̃₁, The significance p-value based on the test statistic T is

1 - Pr (L_{ρ_{1}} < q_{ρ_{1}}, \dots, L_{ρ_{b}} < q_{ρ_{b}}) = 1 - E [Pr (η_{2} + 2 η_{1} < min_{υ < b} \frac{q_{ρ_{υ}} - τ_{ρ_{υ}} η_{0}}{1 - ρ_{υ}} | η_{0}) I (η_{0} < {\tilde{q}}_{1})],

where η₀ follows the 1-DF chi-square distribution, and I() is an indicator function. Denote $μ = \sum_{j = 1}^{m} λ_{j}, σ^{2} = 2 \sum_{j = 1}^{m} λ_{j}^{2}, σ_{0}^{2} = σ^{2} + 4 tr [\tilde{R} H_{1} \tilde{R} (I - H_{1})]$ . Let

δ (x) = (min_{υ < b} \frac{q_{ρ_{υ}} - τ_{ρ_{υ}} x}{1 - ρ_{υ}} - μ) \frac{σ}{σ_{0}} + μ .

The p-value is computed as

1 - \int_{0}^{{\tilde{q}}_{1}} M (δ (x)) f (x | χ_{1}^{2}) d x,

where $f (\cdot | χ_{1}^{2})$ is the density of 1-DF chi-square distribution, and M(·) is the distribution function of 1-DF chi-square mixture with coefficients (λ₁, …, λ_m). Here we want to emphasize that special care is needed for ρ = 1. When ρ_b = 1 is included in the minimum p-value search, we have an indicator I(η₀ < q̃₁) in the expectation, and the integration is in interval [0, q̃₁]. Otherwise the integration is over [0, q̃₁ = ∞).

References

Berndt SI, Gustafsson S, Magi R, Ganna A, Wheeler E, Feitosa MF, et al. Genome-wide meta-analysis identifies 11 new loci for anthropometric traits and provides insights into genetic architecture. Nature genetics. 2013;45(5):501–512. doi: 10.1038/ng.2606. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bonnefond A, Clement N, Fawcett K, Yengo L, Vaillant E, Guillaume JL, et al. Rare MTNR1B variants impairing melatonin receptor 1B function contribute to type 2 diabetes. Nature genetics. 2012;44(3):297–301. doi: 10.1038/ng.1053. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen H, Lumley T, Brody J, Heard-Costa NL, Fox CS, Cupples LA, Dupuis J. Sequence kernel association test for survival traits. Genetic Epidemiology. 2014;38(3):191–197. doi: 10.1002/gepi.21791. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cox DR, Hinkley DV. Theoretical Statistics. CRC Press; 1979. [Google Scholar]
Davies RB. Algorithm AS 155: the distribution of a linear combination of χ2 random variables. Applied Statistics. 1980;29(3):323. [Google Scholar]
Estrada K, Aukrust I, Bjorkhaug L, Burtt NP, Mercader JM, et al. Association of a low-frequency variant in HNF1A with type 2 diabetes in a latino population. JAMA. 2014;311(22):2305–2314. doi: 10.1001/jama.2014.6511. [DOI] [PMC free article] [PubMed] [Google Scholar]
Grove ML, Yu B, Cochran BJ, Haritunians T, Bis JC, Taylor KD, et al. Best practices and joint calling of the HumanExome BeadChip: the CHARGE consortium. PloS One. 2013;8(7):e68095. doi: 10.1371/journal.pone.0068095. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hauck WW, Donner A. Wald’s test as applied to hypotheses in logit analysis. Journal of the American Statistical Association. 1977;72(360):851. [Google Scholar]
Lee S, Wu MC, Lin X. Optimal tests for rare variant effects in sequencing association studies. Biostatistics. 2012;13(4):762–775. doi: 10.1093/biostatistics/kxs014. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li B, Leal SM. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. The American Journal of Human Genetics. 2008;83(3):311–321. doi: 10.1016/j.ajhg.2008.06.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lin DY, Tang ZZ. A general framework for detecting disease associations with rare variants in sequencing studies. The American Journal of Human Genetics. 2011;89(3):354–367. doi: 10.1016/j.ajhg.2011.07.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu D, Lin X, Ghosh D. Semiparametric regression of multidimensional genetic pathway data: least-squares kernel machines and linear mixed models. Biometrics. 2007;63(4):1079–1088. doi: 10.1111/j.1541-0420.2007.00799.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu H, Tang Y, Zhang HH. A new chi-square approximation to the distribution of non-negative definite quadratic forms in non-central normal variables. Computational Statistics & Data Analysis. 2009;53(4):853–856. [Google Scholar]
Ma C, Blackwell T, Boehnke M, Scott LJ GoT2D Investigators. Recommended joint and meta-analysis strategies for case-control association testing of single low-count variants. Genetic Epidemiology. 2013;37(6):539–550. doi: 10.1002/gepi.21742. [DOI] [PMC free article] [PubMed] [Google Scholar]
Madsen BE, Browning SR. A groupwise association test for rare mutations using a weighted sum statistic. PLOS Genetics. 2009;5(2):e1000384. doi: 10.1371/journal.pgen.1000384. [DOI] [PMC free article] [PubMed] [Google Scholar]
Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, et al. Finding the missing heritability of complex diseases. Nature. 2009;461(7265):747–753. doi: 10.1038/nature08494. [DOI] [PMC free article] [PubMed] [Google Scholar]
Morgenthaler S, Thilly WG. A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST) Mutation research. 2007;615(1–2):28–56. doi: 10.1016/j.mrfmmm.2006.09.003. [DOI] [PubMed] [Google Scholar]
Morris AP, Voight BF, Teslovich TM, Ferreira T, Segre AV, et al. Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes. Nature Genetics. 2012;44(9):981–990. doi: 10.1038/ng.2383. [DOI] [PMC free article] [PubMed] [Google Scholar]
Morris AP, Zeggini E. An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genetic epidemiology. 2010;34(2):188–193. doi: 10.1002/gepi.20450. [DOI] [PMC free article] [PubMed] [Google Scholar]
Neale BM, Rivas MA, Voight BF, Altshuler D, Devlin B, Orho-Melander M, Kathiresan S, Purcell SM, Roeder K, Daly MJ. Testing for an unusual distribution of rare variants. PLoS Genet. 2011;7(3):e1001322. doi: 10.1371/journal.pgen.1001322. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pan W. Asymptotic tests of association with multiple SNPs in linkage disequilibrium. Genetic Epidemiology. 2009;33(6):497–507. doi: 10.1002/gepi.20402. [DOI] [PMC free article] [PubMed] [Google Scholar]
Price AL, Kryukov GV, de Bakker PIW, Purcell SM, Staples J, Wei LJ, Sunyaev SR. Pooled association tests for rare variants in exon-resequencing studies. American journal of human genetics. 2010;86(6):832–838. doi: 10.1016/j.ajhg.2010.04.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, Bakker PIWd, Daly MJ, Sham PC. PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics. 2007;81(3):559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schaffner SF, Foo C, Gabriel S, Reich D, Daly MJ, Altshuler D. Calibrating a coalescent simulation of human genome sequence variation. Genome Research. 2005;15(11):1576–1583. doi: 10.1101/gr.3709305. [DOI] [PMC free article] [PubMed] [Google Scholar]
Steinthorsdottir V, Thorleifsson G, Sulem P, Helgason H, Grarup N, et al. Identification of low-frequency and rare sequence variants associated with elevated or reduced risk of type 2 diabetes. Nature Genetics. 2014;46(3):294–298. doi: 10.1038/ng.2882. [DOI] [PubMed] [Google Scholar]
The ARIC Investigators. The atherosclerosis risk in communities (aric) study: design and objectives. American Journal of Epidemiology. 1989;129(4):687–702. [PubMed] [Google Scholar]
Tzeng JY, Zhang D. Haplotype-based association analysis via variance-components score test. American Journal of Human Genetics. 2007;81(5):927–938. doi: 10.1086/521558. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang Y, McKay JD, Rafnar T, Wang Z, Timofeeva MN, Broderick P, et al. Rare variants of large effect in BRCA2 and CHEK2 affect risk of lung cancer. Nature Genetics. 2014;46(7):736–741. doi: 10.1038/ng.3002. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wu MC, Kraft P, Epstein MP, Taylor DM, Chanock SJ, Hunter DJ, Lin X. Powerful SNP-Set analysis for case-control genome-wide association studies. The American Journal of Human Genetics. 2010;86(6):929–942. doi: 10.1016/j.ajhg.2010.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wu M, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. The American Journal of Human Genetics. 2011;89(1):82–93. doi: 10.1016/j.ajhg.2011.05.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
Xing G, Lin CY, Wooding SP, Xing C. Blindly using wald’s test can miss rare disease-causal variants in case-control association studies. Annals of human genetics. 2012;76(2):168–177. doi: 10.1111/j.1469-1809.2011.00700.x. [DOI] [PubMed] [Google Scholar]
Zhan X, Larson DE, Wang C, Koboldt DC, Sergeev YV, Fulton RS, et al. Identification of a rare coding variant in complement 3 associated with age-related macular degeneration. Nature Genetics. 2013;45(11):1375–1379. doi: 10.1038/ng.2758. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp MaterialS1

NIHMS703118-supplement-Supp_MaterialS1.pdf^{(266.3KB, pdf)}

[R1] Berndt SI, Gustafsson S, Magi R, Ganna A, Wheeler E, Feitosa MF, et al. Genome-wide meta-analysis identifies 11 new loci for anthropometric traits and provides insights into genetic architecture. Nature genetics. 2013;45(5):501–512. doi: 10.1038/ng.2606. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Bonnefond A, Clement N, Fawcett K, Yengo L, Vaillant E, Guillaume JL, et al. Rare MTNR1B variants impairing melatonin receptor 1B function contribute to type 2 diabetes. Nature genetics. 2012;44(3):297–301. doi: 10.1038/ng.1053. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Chen H, Lumley T, Brody J, Heard-Costa NL, Fox CS, Cupples LA, Dupuis J. Sequence kernel association test for survival traits. Genetic Epidemiology. 2014;38(3):191–197. doi: 10.1002/gepi.21791. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Cox DR, Hinkley DV. Theoretical Statistics. CRC Press; 1979. [Google Scholar]

[R5] Davies RB. Algorithm AS 155: the distribution of a linear combination of χ2 random variables. Applied Statistics. 1980;29(3):323. [Google Scholar]

[R6] Estrada K, Aukrust I, Bjorkhaug L, Burtt NP, Mercader JM, et al. Association of a low-frequency variant in HNF1A with type 2 diabetes in a latino population. JAMA. 2014;311(22):2305–2314. doi: 10.1001/jama.2014.6511. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Grove ML, Yu B, Cochran BJ, Haritunians T, Bis JC, Taylor KD, et al. Best practices and joint calling of the HumanExome BeadChip: the CHARGE consortium. PloS One. 2013;8(7):e68095. doi: 10.1371/journal.pone.0068095. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Hauck WW, Donner A. Wald’s test as applied to hypotheses in logit analysis. Journal of the American Statistical Association. 1977;72(360):851. [Google Scholar]

[R9] Lee S, Wu MC, Lin X. Optimal tests for rare variant effects in sequencing association studies. Biostatistics. 2012;13(4):762–775. doi: 10.1093/biostatistics/kxs014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Li B, Leal SM. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. The American Journal of Human Genetics. 2008;83(3):311–321. doi: 10.1016/j.ajhg.2008.06.024. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Lin DY, Tang ZZ. A general framework for detecting disease associations with rare variants in sequencing studies. The American Journal of Human Genetics. 2011;89(3):354–367. doi: 10.1016/j.ajhg.2011.07.015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Liu D, Lin X, Ghosh D. Semiparametric regression of multidimensional genetic pathway data: least-squares kernel machines and linear mixed models. Biometrics. 2007;63(4):1079–1088. doi: 10.1111/j.1541-0420.2007.00799.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Liu H, Tang Y, Zhang HH. A new chi-square approximation to the distribution of non-negative definite quadratic forms in non-central normal variables. Computational Statistics & Data Analysis. 2009;53(4):853–856. [Google Scholar]

[R14] Ma C, Blackwell T, Boehnke M, Scott LJ GoT2D Investigators. Recommended joint and meta-analysis strategies for case-control association testing of single low-count variants. Genetic Epidemiology. 2013;37(6):539–550. doi: 10.1002/gepi.21742. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Madsen BE, Browning SR. A groupwise association test for rare mutations using a weighted sum statistic. PLOS Genetics. 2009;5(2):e1000384. doi: 10.1371/journal.pgen.1000384. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, et al. Finding the missing heritability of complex diseases. Nature. 2009;461(7265):747–753. doi: 10.1038/nature08494. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Morgenthaler S, Thilly WG. A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST) Mutation research. 2007;615(1–2):28–56. doi: 10.1016/j.mrfmmm.2006.09.003. [DOI] [PubMed] [Google Scholar]

[R18] Morris AP, Voight BF, Teslovich TM, Ferreira T, Segre AV, et al. Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes. Nature Genetics. 2012;44(9):981–990. doi: 10.1038/ng.2383. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Morris AP, Zeggini E. An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genetic epidemiology. 2010;34(2):188–193. doi: 10.1002/gepi.20450. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Neale BM, Rivas MA, Voight BF, Altshuler D, Devlin B, Orho-Melander M, Kathiresan S, Purcell SM, Roeder K, Daly MJ. Testing for an unusual distribution of rare variants. PLoS Genet. 2011;7(3):e1001322. doi: 10.1371/journal.pgen.1001322. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Pan W. Asymptotic tests of association with multiple SNPs in linkage disequilibrium. Genetic Epidemiology. 2009;33(6):497–507. doi: 10.1002/gepi.20402. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Price AL, Kryukov GV, de Bakker PIW, Purcell SM, Staples J, Wei LJ, Sunyaev SR. Pooled association tests for rare variants in exon-resequencing studies. American journal of human genetics. 2010;86(6):832–838. doi: 10.1016/j.ajhg.2010.04.005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, Bakker PIWd, Daly MJ, Sham PC. PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics. 2007;81(3):559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Schaffner SF, Foo C, Gabriel S, Reich D, Daly MJ, Altshuler D. Calibrating a coalescent simulation of human genome sequence variation. Genome Research. 2005;15(11):1576–1583. doi: 10.1101/gr.3709305. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Steinthorsdottir V, Thorleifsson G, Sulem P, Helgason H, Grarup N, et al. Identification of low-frequency and rare sequence variants associated with elevated or reduced risk of type 2 diabetes. Nature Genetics. 2014;46(3):294–298. doi: 10.1038/ng.2882. [DOI] [PubMed] [Google Scholar]

[R26] The ARIC Investigators. The atherosclerosis risk in communities (aric) study: design and objectives. American Journal of Epidemiology. 1989;129(4):687–702. [PubMed] [Google Scholar]

[R27] Tzeng JY, Zhang D. Haplotype-based association analysis via variance-components score test. American Journal of Human Genetics. 2007;81(5):927–938. doi: 10.1086/521558. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Wang Y, McKay JD, Rafnar T, Wang Z, Timofeeva MN, Broderick P, et al. Rare variants of large effect in BRCA2 and CHEK2 affect risk of lung cancer. Nature Genetics. 2014;46(7):736–741. doi: 10.1038/ng.3002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Wu MC, Kraft P, Epstein MP, Taylor DM, Chanock SJ, Hunter DJ, Lin X. Powerful SNP-Set analysis for case-control genome-wide association studies. The American Journal of Human Genetics. 2010;86(6):929–942. doi: 10.1016/j.ajhg.2010.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] Wu M, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. The American Journal of Human Genetics. 2011;89(1):82–93. doi: 10.1016/j.ajhg.2011.05.029. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] Xing G, Lin CY, Wooding SP, Xing C. Blindly using wald’s test can miss rare disease-causal variants in case-control association studies. Annals of human genetics. 2012;76(2):168–177. doi: 10.1111/j.1469-1809.2011.00700.x. [DOI] [PubMed] [Google Scholar]

[R32] Zhan X, Larson DE, Wang C, Koboldt DC, Sergeev YV, Fulton RS, et al. Identification of a rare coding variant in complement 3 associated with age-related macular degeneration. Nature Genetics. 2013;45(11):1375–1379. doi: 10.1038/ng.2758. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Sequence kernel association analysis of rare variant set based on the marginal regression model for binary traits

Baolin Wu

James S Pankow

Weihua Guan

Abstract

Introduction

Materials and Methods

Sequence kernel association test

Likelihood ratio test based kernel association test

Data adaptive kernel association test

P-value computation for kernel association tests

Null distribution of SKAT-O_L

Results

Simulation studies

Table 1.

Table 2.

Table 3.

Table 4.

Table 5.

Table 6.

Diabetes study

Discussion

Supplementary Material

Acknowledgements

APPENDIX

Null distribution of SKAT-O_L

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Sequence kernel association analysis of rare variant set based on the marginal regression model for binary traits

Baolin Wu

James S Pankow

Weihua Guan

Abstract

Introduction

Materials and Methods

Sequence kernel association test

Likelihood ratio test based kernel association test

Data adaptive kernel association test

P-value computation for kernel association tests

Null distribution of SKAT-OL

Results

Simulation studies

Table 1.

Table 2.

Table 3.

Table 4.

Table 5.

Table 6.

Diabetes study

Discussion

Supplementary Material

Acknowledgements

APPENDIX

Null distribution of SKAT-OL

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Null distribution of SKAT-O_L

Null distribution of SKAT-O_L