Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2017 Aug 14;33(23):3733–3739. doi: 10.1093/bioinformatics/btx511

Conditional asymptotic inference for the kernel association test

Kai Wang 1,
Editor: John Hancock
PMCID: PMC5860324  PMID: 28961861

Abstract

Motivation

The kernel association test (KAT) is popular in biological studies for its ability to combine weak effects potentially of opposite direction. Its P-value is typically assessed via its (unconditional) asymptotic distribution. However, such an asymptotic distribution is known only for continuous traits and for dichotomous traits. Furthermore, the derived P-values are known to be conservative when sample size is small, especially for the important case of dichotomous traits. One alternative is the permutation test, a widely accepted approximation to the exact finite sample conditional inference. But it is time-consuming to use in practice due to stringent significance criteria commonly seen in these analyses.

Results

Based on a previous theoretical result a conditional asymptotic distribution for the KAT is introduced. This distribution provides an alternative approximation to the exact distribution of the KAT. An explicit expression of this distribution is provided from which P-values can be easily computed. This method applies to any type of traits. The usefulness of this approach is demonstrated via extensive simulation studies using real genotype data and an analysis of genetic data from the Ocular Hypertension Treatment Study. Numerical results showed that the new method can control the type I error rate and is a bit conservative when compared to the permutation method. Nevertheless the proposed method may be used as a fast screening method. A time-consuming permutation procedure may be conducted at locations that show signals of association.

Availability and implementation

An implementation of the proposed method is provided in the R package iGasso.

1 Introduction

Kernel machine regression (Liu et al., 2007, 2008) is becoming more and more popular in analysis of biological data. It has the ability to accumulate weak signals across biological features even though they are potentially of opposite effects. Initially proposed for pathway analysis, this method has found a plethora of applications in rare variants association analysis with sequence data (Lee et al., 2012b; Wu et al., 2011). It has been extended to numerous situations such as extreme continuous traits (Barnett et al., 2013), survival outcomes (Cai et al., 2011; Chen et al., 2013; Lin et al., 2011), family-based association test (Wang et al., 2013), gene–gene and gene–environmental interactions (Lin et al., 2013, 2016) and microbiome data (Chen et al., 2016; Zhao et al., 2015).

The test statistic for association, denoted by Q, derived from the kernel machine regression is of the following form:

Q=(yy^)tK(yy^). (1)

Here y is the vector of responses on n subjects. y^ is the vector of predicted responses using a linear regression (for continuous responses) or logistic regression (for dichotomous responses) with potential confounders as predictors. This regression model is called the null model as it does not involve any of the features to be tested for association. The positive definite n × n matrix K is called the kernel. It measures the pair-wise similarities between subjects. Choice of kernel depends on the application context. In genetic association studies, it is often taken as K=GGt, where G is a n × m matrix of genotype scores at m single nucleotide polymorphisms (SNPs) (Wu et al., 2011). The columns of G are typically weighted. In microbiome data analysis, popular kernels include the UniFrac kernel and the Bray-Curtis kennel (Chen and Li, 2013; Chen et al., 2012; Zhao et al., 2015).

The exact distribution of the statistic Q is unknown. Asymptotic distribution has been derived for the case of continuous traits and the case of dichotomous traits (Liu et al., 2007, 2008). Let Z denote the design matrix of the covariates (including the intercept) used in the null model and P=VVZ(ZtVZ)1ZtV with V=diag(y^(1y^)) for dichotomous traits and V=σ^2I for continuous traits where σ^2 is an estimated variance of the residuals. Then the asymptotic distribution of Q is a linear combination of m independent 1-df chi-square distributions: jλjχj2(df=1) where λj,j=1,,m, are eigenvalues of P1/2KP1/2.

However, it has been found that the asymptotic distribution for the Q statistic is conservative for small sample size especially in the important case of dichotomous responses. Lee et al. (2012b) use a single chi-square distribution to approximate jλjχj2(df=1) by matching the first and the second moments. The performance of this method was investigated via simulation studies (Lee et al., 2012b).

Chen et al. (2016) found that the small-sample correction method of Lee et al. (2012b) can be anti-conservative. Alternatively, Chen et al. (2016) proposed the following statistic for continuous traits: (yy^)tK(yy^)/σ^2 in order to take into account the uncertainty in the estimated residual variance σ^2. Based on the iteratively reweighted least squares algorithm for generalized linear models, a similar statistic was proposed for dichotomous traits.

The main challenge in approximating the distribution of Q is that its distribution depends on the residual variance of the response which needs to be estimated. However, it would be known in the framework of conditional inference in which statistical inference is conditional on the observed data. Deriving the exact distribution of Q given the data doesn’t seem to be an easy task. Permutation would be a straightforward way to approximate this distribution but it may be time-consuming. In this article we study the conditional asymptotic distribution of Q using the theoretical work by Strasser and Weber (1999). Unlike those literature mentioned above, this approach is not limited to situations with continuous traits or dichotomous traits.

In what follows, we first introduce the main results of Strasser and Weber (1999) on permutation. A small sample counterpart of the statistic Q is proposed such that the results of Strasser and Weber (1999) are applicable. The inference behavior based on the conditional asymptotic distribution of Q is studied using simulated phenotypes with real genotype data from Genetic Analysis Workshop (GAW) 17. We also show an application to a genetic study for The Ocular Hypertension Treatment Study (OHTS) (Gordon and Kass, 1999).

2 Materials and methods

2.1 A theoretical result of Strasser and Weber (1999)

We start by introducing the main result by Strasser and Weber (1999) on permutation. Let (Yi,Xi),i=1,,n, denote n independent and identically distributed observations. Both Yi and Xi can be univariate or multivariate. To test the null hypothesis that the conditional distribution of Y given X is identical to the distribution of Y against arbitrary alternatives, Strasser and Weber (1999) proposed a multivariate linear statistic of the following form (Hothorn et al., 2006):

T=vec(i=1ng(Xi)h(Yi)t)pq×1,

where g is a transformation of X to a vector of length p and h is a transformation of Y to a vector of length q. The h transformation may depend on all the observations of Y: h(Yi)=h(Yi;(Y1,,Yn)). However, this dependence must be permutation invariant: for a permutation of (Y1,,Yn),h(Yi) does not change. One example is that the rank of a value remains the same in any permutation. Other examples of h and g transformations can be found in Hothorn et al. (2006).

Under the null hypothesis of independence between Y and X, Strasser and Weber (1999) were able to derive the mean μ and variance Σ of T over the set S of all possible permutations:

μ=vec[(i=1ng(Xi))E(h(Y)|S)t]
Σ=nn1V(h(Y)|S)(i=1ng(Xi)g(Xi)t)
1n1V(h(Y)|S)(i=1ng(Xi)i=1ng(Xi)t).

Here ⊗ represents the Kronecker product and

E(h(Y)|S)=1ni=1nh(Yi),
V(h(Y)|S)=1ni=1n [h(Yi)E(h(Y)|S)][h(Yi)E(h(Y)|S)]t.

Strasser and Weber (1999) further showed that T follows an asymptotic multivariate normal distribution with mean μ and variance matrix Σ as sample size n goes to infinity.

These results on T are very useful for conditional inference when exact conditional distribution is unknown or permutation is too time-consuming to conduct. This conditional inference framework encompasses many commonly used tests for independence. The test statistics of these tests are either scalars (when pq = 1) or assume the quadratic form (Tμ)tΣ+(Tμ). Examples of tests that can be put into this framework include Kruskal-Wallis test, linear-by-linear test, trend test, two-sample test, etc. (Hothorn et al., 2006). The R package coin provides a nice implementation of this theory. In the next subsection, we show that KAT can be coined into this framework as well. And we introduce its conditional asymptotic distribution.

2.2 Conditional asymptotic distribution of KAT

Since K is positive definite, it can be in general written K=GGt by Cholesky decomposition. So we will express the statistic Q as Q=(yy^)tGGt(yy^).

Define y˜=yy^ to be the vector of residuals. Now the ith row of G, denoted by Gi, corresponds to Xi in the general theory presented in the last subsection. Letting g and h be identity transformations such that h(Git)=Git and g(y˜i)=y˜i where y˜i is the ith component of y˜. Clearly, transformation g satisfies the ‘permutation invariant’ requirement. We have

T=i=1nGity˜i=Gty˜m×1.

It is straightforward to calculate that

E(y˜|S)=1ni=1ny˜i=0.

Furthermore,

V(y˜|S)=1ni=1ny˜i2=n1nsy˜2,

where sy˜2=(n1)1iy˜i2 is the sample variance of y˜i,i=1,,n.

In summary, we have the following result.

Proposition 1. The conditional mean of T over all permutations is μ=0 and its conditional variance matrix is

Σ=sy˜2(GtGn1Gt11tG)=(n1)sy˜2·SG2,

where SG2=(n1)1(GtGn1Gt11tG) is the sample covariance matrix of genotype scores at the m SNPs.

Since KAT =TtT, the conditional asymptotic distribution is obtained using the main result of Strasser and Weber (1999).

Proposition 2. The conditional asymptotic distribution of KAT is the same as the distribution of a linear combination of m independent 1-df chi-square random variables: j=1mλj·χj2(df=1) where λj,j=1,,m, are the eigenvalues of Σ.

We note that, unlike the asymptotic distribution for Q, this asymptotic distribution has the same form regardless of the type of traits. Whether it is continuous or dichotomous, the coefficients λj,j=1,,m are the eigenvalues of Σ and are proportional to the eigenvalues of SG2.

3 Simulation studies

We have performed extensive simulation studies to demonstrate and compare the performance of the proposed conditional inference method, the method of Chen et al. (2016), SKAT (for continuous traits), SKAT with small sample adjustment (for dichotomous traits) and permutation test. While other works (Chen et al., 2016; Lee et al., 2012a; Wu et al., 2010) used simulated sequence data, we use real sequence data that were made available by Genetic Analysis Workshop (GAW) 17 (Almasy et al., 2011).

The GAW 17 mini-exome sequence data were from 697 subjects selected from the 1000 Genome Projects. There were 24 487 SNPs in 3205 genes. Nine largest genes (i.e. ABCC6, ACIN1, AHNAK, AKAP13, ALPK3, ANKRD12, ANKRD15, ATP10A and BAIAP3) in terms of genotyped SNPs are selected. The number of SNPs ranges from 200 to 400. Trait data were simulated in the same way as in the other works (Chen et al., 2016; Lee et al., 2012a; Wu et al., 2010) which are described below.

For each gene, 20% randomly selected SNPs are assumed to be causal. Eighty percent of these causal SNPs are assumed to have positive effect on the trait while the remaining 20% are negative.

Similar to Wu et al. (2011), continuous traits were simulated using the following model

y=0.5x1+0.5x2+β1g1++βsgs, (2)

where x1 follows a standard normal distribution, x2 follows a discrete distribution taking values 0 or 1 with probability 0.5, and g1,,gs were genotype scores at s causal rare variants. Given the minor allele frequency (MAF) of a causal variant, its coefficient β is determined by β=0.5log(5)|log10(MAF)|. For the study of type I error rate, β1,,βs were set at 0. For the study of power, the magnitude of βj was equal to |0.2log10(MAF)|. In the analysis of type I error rate, all βs are set to 0.

For the dichotomous trait, its disease status is determined using the following logistic regression model:

logit[Pr(disease)]=β0+0.5x1+0.5x2+β1g1++βsgs,

where x1 and x2 are the same as those for the continuous traits, β0=log(0.01/0.99) corresponds to background disease probability of 0.01 and exp(β1),,exp(βs) are the odds ratios for the s disease susceptibility variants, respectively. The number of cases is the same as the number of controls. Similar simulation set up has been used elsewhere (Wang and Elston, 2007; Wu et al., 2011). As in the case of continuous, all βs are set to 0 in the analysis of type I error rate.

The number of simulation replicates is 100 000 for the analysis of type I error and is 1000 for the analysis of power.

The simulated type I error rates are presented in Table 1 for continuous traits and in Table 2 for dichotomous traits. These type I error rates are under control except that the adjusted SKAT is slightly anti-conservative. The anti-conservativeness of the adjusted SKAT has also been observed in Chen et al. (2016).

Table 1.

Simulated type I error rate for continuous traits

Sig. Cond.
Gene n Level Chen SKAT Asy. Perm.
ABCC6 50 0.01 0.01017 0.00659 0.00666 0.01039
0.001 0.00104 0.00038 0.00039 0.00107
100 0.01 0.01023 0.00904 0.00910 0.01042
0.001 0.00093 0.00058 0.00057 0.00105
200 0.01 0.00980 0.00909 0.00907 0.00976
0.001 0.00085 0.00069 0.00069 0.00091
ACIN1 50 0.01 0.00970 0.00780 0.00798 0.00984
0.001 0.00097 0.00053 0.00054 0.00113
100 0.01 0.01004 0.00890 0.00880 0.01009
0.001 0.00102 0.00079 0.00077 0.00113
200 0.01 0.00969 0.00913 0.00916 0.00981
0.001 0.00108 0.00097 0.00096 0.00119
AHNAK 50 0.01 0.01013 0.00659 0.00687 0.01018
0.001 0.00086 0.00030 0.00030 0.00102
100 0.01 0.00931 0.00681 0.00676 0.00950
0.001 0.00104 0.00065 0.00066 0.00127
200 0.01 0.00986 0.00895 0.00897 0.01001
0.001 0.00125 0.00099 0.00098 0.00125
AKAP13 50 0.01 0.00986 0.00802 0.00799 0.01004
0.001 0.00116 0.00059 0.00062 0.00120
100 0.01 0.00958 0.00871 0.00875 0.00977
0.001 0.00102 0.00080 0.00079 0.00108
200 0.01 0.00991 0.00953 0.00952 0.00984
0.001 0.00090 0.00078 0.00080 0.00093
ALPK3 50 0.01 0.01011 0.00680 0.00695 0.01013
0.001 0.00100 0.00038 0.00039 0.00110
100 0.01 0.00947 0.00802 0.00803 0.00943
0.001 0.00104 0.00073 0.00070 0.00117
200 0.01 0.00972 0.00900 0.00896 0.00982
0.001 0.00103 0.00086 0.00086 0.00109
ANKRD12 50 0.01 0.00983 0.00751 0.00736 0.00987
0.001 0.00082 0.00040 0.00036 0.00098
100 0.01 0.01012 0.00893 0.00899 0.01041
0.001 0.00111 0.00087 0.00086 0.00125
200 0.01 0.01009 0.00929 0.00938 0.01008
0.001 0.00105 0.00085 0.00086 0.00112
ANKRD15 50 0.01 0.00962 0.00663 0.00679 0.00960
0.001 0.00095 0.00026 0.00027 0.00106
100 0.01 0.01027 0.00886 0.00881 0.01033
0.001 0.00097 0.00068 0.00069 0.00109
200 0.01 0.01026 0.00923 0.00915 0.01016
0.001 0.00079 0.00060 0.00059 0.00092
ATP10A 50 0.01 0.01003 0.00594 0.00606 0.01012
0.001 0.00085 0.00025 0.00029 0.00097
100 0.01 0.00993 0.00791 0.00800 0.01014
0.001 0.00095 0.00060 0.00063 0.00103
200 0.01 0.01000 0.00903 0.00909 0.01005
0.001 0.00098 0.00081 0.00078 0.00112
BAIAP3 50 0.01 0.01021 0.00659 0.00667 0.01056
0.001 0.00099 0.00032 0.00029 0.00106
100 0.01 0.00991 0.00878 0.00882 0.00997
0.001 0.00098 0.00070 0.00067 0.00110
200 0.01 0.00972 0.00897 0.00897 0.00981
0.001 0.00093 0.00078 0.00078 0.00110

Note: ‘Cond. Asy.’ is the proposed method and ‘Perm.’ is permutation test.

Table 2.

Simulated type I error rate for dichotomous traits

Sig. SKAT Cond.
Gene n Level Chen Adj. Asy. Perm.
ABCC6 50 0.01 0.00962 0.01279 0.00705 0.00998
0.001 0.00092 0.00154 0.00044 0.00102
100 0.01 0.01028 0.01189 0.00874 0.01053
0.001 0.00091 0.00116 0.00061 0.00105
200 0.01 0.01016 0.01064 0.00939 0.01029
0.001 0.00097 0.00119 0.00082 0.00110
ACIN1 50 0.01 0.01094 0.01400 0.00873 0.01090
0.001 0.00132 0.00164 0.00069 0.00126
100 0.01 0.01050 0.01187 0.00952 0.01057
0.001 0.00110 0.00131 0.00078 0.00112
200 0.01 0.01014 0.01066 0.00956 0.01015
0.001 0.00084 0.00094 0.00072 0.00091
AHNAK 50 0.01 0.00365 0.01537 0.00188 0.01020
0.001 0.00018 0.00118 0.00003 0.00106
100 0.01 0.00642 0.01285 0.00465 0.01069
0.001 0.00028 0.00118 0.00013 0.00103
200 0.01 0.00811 0.01110 0.00721 0.01060
0.001 0.00052 0.00133 0.00038 0.00109
AKAP13 50 0.01 0.01116 0.01392 0.00938 0.01102
0.001 0.00132 0.00172 0.00070 0.00130
100 0.01 0.01045 0.01170 0.00959 0.01036
0.001 0.00117 0.00138 0.00093 0.00113
200 0.01 0.00991 0.01068 0.00954 0.00995
0.001 0.00110 0.00120 0.00099 0.00114
ALPK3 50 0.01 0.00925 0.01279 0.00613 0.00947
0.001 0.00086 0.00141 0.00032 0.00104
100 0.01 0.01003 0.01154 0.00826 0.01022
0.001 0.00091 0.00129 0.00061 0.00112
200 0.01 0.00994 0.01025 0.00910 0.01003
0.001 0.00100 0.00121 0.00083 0.00114
ANKRD12 50 0.01 0.00757 0.01362 0.00561 0.01013
0.001 0.00055 0.00150 0.00025 0.00114
100 0.01 0.00829 0.01136 0.00713 0.00951
0.001 0.00045 0.00121 0.00027 0.00098
200 0.01 0.00953 0.01097 0.00888 0.01023
0.001 0.00089 0.00126 0.00078 0.00114
ANKRD15 50 0.01 0.00943 0.01322 0.00615 0.01008
0.001 0.00087 0.00143 0.00028 0.00110
100 0.01 0.00967 0.01149 0.00781 0.00997
0.001 0.00098 0.00134 0.00061 0.00115
200 0.01 0.01035 0.01106 0.00927 0.01046
0.001 0.00105 0.00127 0.00074 0.00120
ATP10A 50 0.01 0.00996 0.01313 0.00640 0.01000
0.001 0.00094 0.00122 0.00030 0.00098
100 0.01 0.01014 0.01156 0.00819 0.01013
0.001 0.00121 0.00148 0.00064 0.00121
200 0.01 0.01011 0.01064 0.00911 0.01018
0.001 0.00107 0.00118 0.00080 0.00112
BAIAP3 50 0.01 0.00850 0.01323 0.00580 0.01005
0.001 0.00064 0.00121 0.00030 0.00097
100 0.01 0.00930 0.01133 0.00789 0.01027
0.001 0.00079 0.00121 0.00055 0.00109
200 0.01 0.00946 0.01029 0.00867 0.00977
0.001 0.00107 0.00128 0.00093 0.00120

Note: ‘SKAT Adj’ is the adjusted STAT, ‘Cond. Asy.’ is the proposed method, and ‘Perm.’ is permutation test.

Results of the power analysis are presented in Table 3 for continuous traits and in Table 4 for dichotomous traits. All statistics have similar performance. The conditional inference method using the asymptotic distribution is sometimes slightly less powerful than the permutation method and the Chen method. Note that in some genes, the power is not increasing in sample size. This is because the causal SNPs are selected at random within each scenario.

Table 3.

Simulated power for continuous traits

Sig. level 0.01
Sig. level 0.001
Cond. Cond.
Gene n Chen SKAT Asy. Perm. Chen SKAT Asy. Perm.
ABCC6 50 0.318 0.279 0.278 0.316 0.083 0.054 0.052 0.082
100 0.695 0.677 0.681 0.694 0.376 0.359 0.359 0.381
200 0.555 0.544 0.545 0.551 0.252 0.243 0.245 0.255
ACIN1 50 0.144 0.139 0.135 0.144 0.039 0.034 0.034 0.041
100 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
200 0.212 0.211 0.206 0.207 0.071 0.070 0.069 0.074
AHNAK 50 0.967 0.955 0.955 0.960 0.767 0.695 0.689 0.684
100 0.999 0.998 0.998 0.999 0.992 0.991 0.989 0.991
200 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
AKAP13 50 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
100 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
200 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
ALPK3 50 0.038 0.031 0.030 0.037 0.008 0.007 0.007 0.008
100 0.324 0.315 0.317 0.325 0.123 0.109 0.105 0.121
200 0.958 0.958 0.956 0.958 0.863 0.857 0.858 0.861
ANKRD12 50 0.033 0.031 0.028 0.032 0.003 0.003 0.003 0.003
100 0.056 0.055 0.054 0.056 0.006 0.006 0.005 0.007
200 0.807 0.801 0.802 0.803 0.580 0.569 0.570 0.576
ANKRD15 50 0.469 0.452 0.454 0.469 0.242 0.196 0.199 0.238
100 0.872 0.868 0.868 0.875 0.636 0.605 0.605 0.639
200 0.307 0.297 0.298 0.308 0.087 0.078 0.078 0.082
ATP10A 50 0.050 0.041 0.040 0.051 0.008 0.004 0.004 0.008
100 0.164 0.153 0.152 0.166 0.043 0.037 0.038 0.042
200 1.000 1.000 1.000 1.000 0.993 0.993 0.993 0.993
BAIAP3 50 0.654 0.621 0.623 0.649 0.381 0.323 0.327 0.383
100 0.137 0.130 0.132 0.134 0.019 0.016 0.017 0.018
200 0.999 0.999 0.999 0.999 0.993 0.993 0.992 0.993

Note: ‘Cond. Asy.’ is the proposed method and ‘Perm.’ is permutation test.

Table 4.

Simulated power for dichotomous traits

Sig. level 0.01
Sig. level 0.001
SKAT Cond. SKAT Cond.
Gene n Chen Adj. Asy. Perm. Chen Adj. Asy. Perm.
ABCC6 50 0.142 0.159 0.122 0.146 0.039 0.047 0.018 0.035
100 0.359 0.365 0.349 0.359 0.166 0.177 0.152 0.168
200 0.858 0.856 0.850 0.857 0.603 0.608 0.582 0.609
ACIN1 50 0.111 0.121 0.094 0.107 0.027 0.035 0.020 0.027
100 0.997 0.997 0.996 0.996 0.977 0.979 0.974 0.976
200 0.324 0.327 0.322 0.326 0.128 0.133 0.120 0.132
AHNAK 50 0.795 0.845 0.776 0.811 0.529 0.602 0.482 0.555
100 0.988 0.990 0.988 0.989 0.947 0.962 0.944 0.954
200 0.975 0.977 0.973 0.975 0.895 0.907 0.885 0.902
AKAP13 50 0.989 0.990 0.983 0.984 0.920 0.931 0.903 0.920
100 0.998 0.998 0.998 0.998 0.990 0.990 0.989 0.989
200 1.000 1.000 1.000 1.000 0.999 0.999 0.999 0.999
ALPK3 50 0.087 0.104 0.072 0.090 0.017 0.023 0.009 0.021
100 0.195 0.208 0.188 0.197 0.071 0.076 0.062 0.074
200 0.505 0.503 0.500 0.504 0.214 0.218 0.199 0.210
ANKRD12 50 0.054 0.070 0.050 0.059 0.013 0.018 0.010 0.016
100 0.255 0.282 0.249 0.263 0.072 0.092 0.063 0.086
200 0.447 0.457 0.447 0.451 0.205 0.218 0.202 0.212
ANKRD15 50 0.289 0.317 0.265 0.288 0.113 0.138 0.087 0.115
100 0.536 0.556 0.524 0.538 0.318 0.333 0.296 0.313
200 0.729 0.740 0.724 0.737 0.489 0.502 0.478 0.495
ATP10A 50 0.070 0.078 0.054 0.069 0.014 0.018 0.012 0.014
100 0.244 0.251 0.235 0.242 0.107 0.108 0.094 0.104
200 0.927 0.931 0.924 0.928 0.807 0.814 0.800 0.810
BAIAP3 50 0.284 0.312 0.263 0.286 0.115 0.147 0.094 0.124
100 0.202 0.230 0.191 0.220 0.055 0.070 0.042 0.065
200 0.995 0.996 0.996 0.996 0.968 0.970 0.964 0.967

Note: ‘SKAT Adj’ is the adjusted STAT, ‘Cond. Asy.’ is the proposed method, and ‘Perm.’ is permutation test.

4 Application to OHTS

Primary open angle glaucoma (POAG) is a leading cause of irreversible blindness. Although a genetic basis has been established for a substantial fraction of POAG, no risk alleles of major effect have been identified (Fingert, 2011). The etiology of POAG is likely to be complex. Since POAG is assessed through quantitative measures such as central corneal thickness (CCT), intraocular pressure (IOP), and cup-to-disc ratio, one promising research direction is to map genes underlying these quantitative measures. Indeed, large-scale GWAS have identified genes that affect CCT (Lu et al., 2010; Vitart et al., 2010; Vithana et al., 2011). We report an application of the method of Chen et al. (2016), SKAT with and without small sample correction, and the conditional inference method described in this report to a genome-wide gene-based association study of CCT averaged over both eyes using data from the Ocular Hypertension Treatment Study (OHTS).

OHTS (Gordon and Kass, 1999) is a National Eye Institute-sponsored multi-center, randomized clinical trial. Its goal is to investigate the efficacy of medical treatment in delaying or preventing the onset of POAG in individuals with elevated intraocular pressure. One thousand six hundred and thirty six individuals between 40 and 80 years old were enrolled and 1077 of them were genotyped in a subsequent study. Data for this genetic study is available at Database of Genotypes and Phenotypes (dbGaP, Study Accession phs000240.v1.p1). It contains 1057 subjects who have available both genotype data and baseline phenotype data. The vast majority of these subjects were non-Hispanic White (752) and Black (252). Unpublished results have identified genetic heterogeneity between whites and blacks. Our focus is a genome-wide association studies of CCT on the Black subjects. A histogram of CCT is presented in Figure 1. It is pretty symmetric.

Fig. 1.

Fig. 1

Histogram of the average CCT for the Black subjects in OHTS

There were 1 051 295 genotyped SNPs. There HGNC gene symbols were obtained using the R/Bioconductor package biomaRt (version 2.26.1). There are 30 562 autosomal genes. Similar to Lee et al. (2012a), genes that contain less than 3 SNPs were excluded from further consideration. This reduces the number of genes to 23 778. Variables age and gender are used as covariates. Gene symbols at which at least one of the four statistics has a P-value less than 0.0001 are listed in Table 5. In this table, the information on biotype and base-pair position is obtained from www.ensemble.org. It is interesting that all of them except one are on chromosome 17. There are three tight gene clusters on chromosome 17 in terms of base-pair location. Cluster 1 consists of genes SENP3, SENP3-EIF4A1, EIF4A1 and SNORA48. Cluster 2 consists of KCNH4, HCRT, GHDC and STAT5B. Cluster 3 consists of BECN1, MIR6781 and PSME3. Cluster 2 is very close to cluster 3. They warrant further investigation although none of them overlaps with ZNF469, COL5A1, COL8A2, AKAP13 and AVGR8, genes for which association with CCT has been reported previously (Lu et al., 2010; Vitart et al., 2010; Vithana et al., 2011).

Table 5.

A summary of gene-based association P-values (×105) for the 252 Black subjects in the OHTS study

Base-pair position
Conditional inference
Chr Biotype Gene symbol Start Stop Chen SKAT Permutation
1 Protein coding ZBTB40 22 451 851 22 531 157 19.52 23.28 23.64 9.999
17 Protein coding SENP3 7 561 875 7 571 969 1.897 2.728 2.971 9.999
17 Protein coding SENP3-EIF4A1 7 563 287 7 578 715 1.070 1.153 1.215 0.000
17 Protein coding EIF4A1 7 572 706 7 579 005 2.965 3.551 3.624 9.999
17 snoRNA SNORA48 7 574 713 7 574 847 3.376 4.479 4.733 0.000
17 miRNA MIR6779 38 914 979 38 915 042 1.002 1.742 1.631 9.999
17 Protein coding KCNH4 42 156 891 42 181 278 0.000 0.000 0.000 0.000
17 Protein coding HCRT 42 184 060 42 185 452 0.000 0.000 0.000 0.000
17 Protein coding GHDC 42 188 799 42 194 532 0.000 0.000 0.000 0.000
17 Protein coding STAT5B 42 199 168 42 276 707 0.983 1.623 1.505 0.000
17 Protein coding BECN1 42 810 134 42 833 350 3.126 3.812 3.541 0.000
17 miRNA MIR6781 42 823 880 42 823 943 5.550 7.268 6.837 0.000
17 Protein coding PSME3 42 824 385 42 843 758 2.936 3.871 3.609 9.999

Note: Genes are selected if any of the four listed statistics is significant at level 0.0001.

One reviewer suggested that an example of data analysis where the phenotype is neither continuous nor binary be given. To satisfy this suggestion, the CCT measurement is discretized according its quartiles. The discretized CCT measurement takes 4 values: 1, 2, 3 and 4. For this modified data, neither the Chen method nor SKAT applies. However, the proposed method and the permutation method are still applicable. Findings from these two methods are summarized in Table 6. This list of findings is slightly longer than that from Table 5. Most of them are due to the significance of the permutation test. The proposed method is generally less significant than the permutation text. At the significance level 0.0001 used for this table, both methods are significant at gene ZBTB40 on chromosome 1 and some other genes on chromosome 17 that are seen in Table 5.

Table 6.

A summary of gene-based association P-values (×105) for the 252 Black subjects in the OHTS study

Base-pair position
Conditional inference
Chr Biotype Gene symbol Start Stop Permutation
1 Protein coding ZBTB40 22 451 851 22 531 157 5.640 9.999
1 Protein coding ADIPOR1 202 940 823 202 958 572 21.08 0.000
9 snRNA RNU6-1039P 110 369 592 110 369 698 27.58 9.999
9 Protein coding CNTRL 121 074 863 121 177 610 40.45 9.999
13 Misc RNA RN7SL597P 41 121 876 41 122 171 28.56 9.999
14 TR V_pseudogene TRAV32 22 185 562 22 186 057 46.41 9.999
17 snoRNA SNORA48 7 574 713 7 574 847 34.56 0.000
17 Protein coding SENP3-EIF4A1 7 563 287 7 578 715 25.57 9.999
17 miRNA MIR6779 38 914 979 38 915 042 5.532 0.000
17 Protein coding ARL5C 39 156 894 39 167 484 34.41 9.999
17 Protein coding KCNH4 42 156 891 42 181 278 0.000 0.000
17 Protein coding HCRT 42 184 060 42 185 452 0.000 0.000
17 Protein coding GHDC 42 188 799 42 194 532 0.000 0.000
17 Protein coding STAT5B 42 199 168 42 276 707 0.400 0.000
17 Protein coding BECN1 42 810 134 42 833 350 25.97 9.999
17 Protein coding PSME3 42 824 385 42 843 758 23.85 9.999
18 LincRNA LINC01541 71 519 962 71 578 956 17.25 9.999
19 Protein coding CACNA1A 13 206 442 13 633 025 71.19 9.999

Note: The phenotype CCT is discretized according to its quartiles. Genes are selected if any of the two listed statistics is significant at level 0.0001. Methods ‘Chen’ and ‘SKAT’ are not applicable in this situation because the phenotype is neither continuous nor binary.

5 Discussion

An approximation method is proposed for the conditional inference of kernel association test (KAT). This approximation is based on a theoretical result developed by Strasser and Weber (1999) on the asymptotic permutation distribution of a random vector. By properly constructing a vector T and a mapping function h, a small sample counterpart of the KAT is introduced. The current application of this theory focuses on test statistics of the form (TE(T))tΣ1(TE(T)), where Σ is the covariance matrix of T, and is not applicable to KAT. The work presented in this report fills this gap.

We note that the proposed method approximates the permutation distribution of KAT where the covariates are permuted together with response y. This is to satisfy the ‘permutation invariant’ requirement.

A salient feature of the proposed method is that it has no limitations on the types of the responses, a benefit of using the conditional inference. Regardless of the type of the responses, being it continuous, dichotomous, or some other types, the conditional variance matrix of the vector T assumes the same form. In practice, one may want to use this method as a screening tool followed by extensive permutations at genes that show signals.

An implementation of the proposed method is provided in the R package iGasso. The function name is KAT.coin.

Funding

Preparation of GAW 17 data was supported, in part, by National Institutes of Health [grant R01 MH059490] and used data from the 1000 Genomes Project (www.1000genomes.org). The GAW 17 was supported by NIH grant R01 GM031575 from the National Institute of General Medical Sciences. The author thanks Dr Jun Chen and Dr Wenan Chen for sharing the R code for their method proposed in Chen et al. (2016). The author also thanks the Associate Editor, Dr Alison Hutchins, and three anonymous reviewers for their useful comments.

Conflict of Interest: none declared.

References

  1. Almasy L. et al. (2011) Genetic Analysis Workshop 17 mini-exome simulation. BMC Proceedings, 5, S2.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Barnett I.J. et al. (2013) Detecting rare variant effects using extreme phenotype sampling in sequencing association studies. Genet. Epidemiol., 37, 142–151. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Cai T. et al. (2011) Kernel machine approach to testing the significance of multiple genetic markers for risk prediction. Biometrics, 67, 975–986. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Chen H. et al. (2013) Sequence kernel association test for quantitative traits in family samples. Genet. Epidemiol., 37, 196–204. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Chen J., Li H. (2013) Kernel methods for regression analysis of microbiome compositional data In: Mingxiu Hu, Yi Liu, and Jianchang Lin et al. (eds) Topics in Applied Statistics. Springer, New York, pp. 191–201. [Google Scholar]
  6. Chen J. et al. (2012) Associating microbiome composition with environmental covariates using generalized unifrac distances. Bioinformatics, 28, 2106–2113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Chen J. et al. (2016) Small sample kernel association tests for human genetic and microbiome association studies. Genet. Epidemiol., 40, 5–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Fingert J. (2011) Primary open-angle glaucoma genes. Eye, 25, 587–595. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Gordon M.O., Kass M.A. (1999) The ocular hypertension treatment study: design and baseline description of the participants. Arch. Ophthalmol., 117, 573–583. [DOI] [PubMed] [Google Scholar]
  10. Hothorn T. et al. (2006) A lego system for conditional inference. Am. Stat., 60, 257–263. [Google Scholar]
  11. Lee S. et al. (2012a) Optimal tests for rare variant effects in sequencing association studies. Biostatistics, 13, 762–775. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Lee S. et al. (2012b) Optimal unified approach for rare-variant association testing with application to small-sample case-control whole-exome sequencing studies. Am. J. Hum. Genet., 91, 224–237. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Lin X. et al. (2011) Kernel machine snp-set analysis for censored survival outcomes in genome-wide association studies. Genet. Epidemiol., 35, 620–631. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Lin X. et al. (2013) Test for interactions between a genetic marker set and environment in generalized linear models. Biostatistics, 14, 667–681. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Lin X. et al. (2016) Test for rare variants by environment interactions in sequencing association studies. Biometrics, 72, 156–164. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Liu D. et al. (2007) Semiparametric regression of multidimensional genetic pathway data: Least-squares kernel machines and linear mixed models. Biometrics, 63, 1079–1088. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Liu D. et al. (2008) Estimation and testing for the effect of a genetic pathway on a disease outcome using logistic kernel machine regression via logistic mixed models. BMC Bioinform, 9, 292.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Lu Y. et al. (2010) Common genetic variants near the brittle cornea syndrome locus znf469 influence the blinding disease risk factor central corneal thickness. PLoS Genet., 6, e1000947.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Strasser H., Weber C. (1999). On the asymptotic theory of permutation statistics. Math. Methods Stat., 2, 220–250. [Google Scholar]
  20. Vitart V. et al. (2010) New loci associated with central cornea thickness include col5a1, akap13 and avgr8. Hum. Mol. Genet., 19, 4304–4311. [DOI] [PubMed] [Google Scholar]
  21. Vithana E.N. et al. (2011) Collagen-related genes influence the glaucoma risk factor, central corneal thickness. Hum. Mol. Genet., 20, 649–658. [DOI] [PubMed] [Google Scholar]
  22. Wang T., Elston R.C. (2007) Improved power by use of a weighted score test for linkage disequilibrium mapping. Am. J. Hum. Genet., 80, 353–360. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Wang X. et al. (2013) GEE-based SNP set association test for continuous and discrete traits in family-based association studies. Genet. Epidemiol., 37, 778–786. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Wu M.C. et al. (2010) Powerful SNP-set analysis for case-control genome-wide association studies. Am. J. Hum. Genet., 86, 929–942. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Wu M.C. et al. (2011) Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet., 89, 82–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Zhao N. et al. (2015) Testing in microbiome-profiling studies with mirkat, the microbiome regression-based kernel association test. Am. J. Hum. Genet., 96, 797–807. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES