Abstract
Motivation
The kernel association test (KAT) is popular in biological studies for its ability to combine weak effects potentially of opposite direction. Its P-value is typically assessed via its (unconditional) asymptotic distribution. However, such an asymptotic distribution is known only for continuous traits and for dichotomous traits. Furthermore, the derived P-values are known to be conservative when sample size is small, especially for the important case of dichotomous traits. One alternative is the permutation test, a widely accepted approximation to the exact finite sample conditional inference. But it is time-consuming to use in practice due to stringent significance criteria commonly seen in these analyses.
Results
Based on a previous theoretical result a conditional asymptotic distribution for the KAT is introduced. This distribution provides an alternative approximation to the exact distribution of the KAT. An explicit expression of this distribution is provided from which P-values can be easily computed. This method applies to any type of traits. The usefulness of this approach is demonstrated via extensive simulation studies using real genotype data and an analysis of genetic data from the Ocular Hypertension Treatment Study. Numerical results showed that the new method can control the type I error rate and is a bit conservative when compared to the permutation method. Nevertheless the proposed method may be used as a fast screening method. A time-consuming permutation procedure may be conducted at locations that show signals of association.
Availability and implementation
An implementation of the proposed method is provided in the R package iGasso.
1 Introduction
Kernel machine regression (Liu et al., 2007, 2008) is becoming more and more popular in analysis of biological data. It has the ability to accumulate weak signals across biological features even though they are potentially of opposite effects. Initially proposed for pathway analysis, this method has found a plethora of applications in rare variants association analysis with sequence data (Lee et al., 2012b; Wu et al., 2011). It has been extended to numerous situations such as extreme continuous traits (Barnett et al., 2013), survival outcomes (Cai et al., 2011; Chen et al., 2013; Lin et al., 2011), family-based association test (Wang et al., 2013), gene–gene and gene–environmental interactions (Lin et al., 2013, 2016) and microbiome data (Chen et al., 2016; Zhao et al., 2015).
The test statistic for association, denoted by Q, derived from the kernel machine regression is of the following form:
| (1) |
Here is the vector of responses on n subjects. is the vector of predicted responses using a linear regression (for continuous responses) or logistic regression (for dichotomous responses) with potential confounders as predictors. This regression model is called the null model as it does not involve any of the features to be tested for association. The positive definite n × n matrix is called the kernel. It measures the pair-wise similarities between subjects. Choice of kernel depends on the application context. In genetic association studies, it is often taken as , where is a n × m matrix of genotype scores at m single nucleotide polymorphisms (SNPs) (Wu et al., 2011). The columns of are typically weighted. In microbiome data analysis, popular kernels include the UniFrac kernel and the Bray-Curtis kennel (Chen and Li, 2013; Chen et al., 2012; Zhao et al., 2015).
The exact distribution of the statistic Q is unknown. Asymptotic distribution has been derived for the case of continuous traits and the case of dichotomous traits (Liu et al., 2007, 2008). Let denote the design matrix of the covariates (including the intercept) used in the null model and with for dichotomous traits and for continuous traits where is an estimated variance of the residuals. Then the asymptotic distribution of Q is a linear combination of m independent 1-df chi-square distributions: where , are eigenvalues of .
However, it has been found that the asymptotic distribution for the Q statistic is conservative for small sample size especially in the important case of dichotomous responses. Lee et al. (2012b) use a single chi-square distribution to approximate by matching the first and the second moments. The performance of this method was investigated via simulation studies (Lee et al., 2012b).
Chen et al. (2016) found that the small-sample correction method of Lee et al. (2012b) can be anti-conservative. Alternatively, Chen et al. (2016) proposed the following statistic for continuous traits: in order to take into account the uncertainty in the estimated residual variance . Based on the iteratively reweighted least squares algorithm for generalized linear models, a similar statistic was proposed for dichotomous traits.
The main challenge in approximating the distribution of Q is that its distribution depends on the residual variance of the response which needs to be estimated. However, it would be known in the framework of conditional inference in which statistical inference is conditional on the observed data. Deriving the exact distribution of Q given the data doesn’t seem to be an easy task. Permutation would be a straightforward way to approximate this distribution but it may be time-consuming. In this article we study the conditional asymptotic distribution of Q using the theoretical work by Strasser and Weber (1999). Unlike those literature mentioned above, this approach is not limited to situations with continuous traits or dichotomous traits.
In what follows, we first introduce the main results of Strasser and Weber (1999) on permutation. A small sample counterpart of the statistic Q is proposed such that the results of Strasser and Weber (1999) are applicable. The inference behavior based on the conditional asymptotic distribution of Q is studied using simulated phenotypes with real genotype data from Genetic Analysis Workshop (GAW) 17. We also show an application to a genetic study for The Ocular Hypertension Treatment Study (OHTS) (Gordon and Kass, 1999).
2 Materials and methods
2.1 A theoretical result of Strasser and Weber (1999)
We start by introducing the main result by Strasser and Weber (1999) on permutation. Let , denote n independent and identically distributed observations. Both and can be univariate or multivariate. To test the null hypothesis that the conditional distribution of given is identical to the distribution of against arbitrary alternatives, Strasser and Weber (1999) proposed a multivariate linear statistic of the following form (Hothorn et al., 2006):
where g is a transformation of to a vector of length p and h is a transformation of to a vector of length q. The h transformation may depend on all the observations of : . However, this dependence must be permutation invariant: for a permutation of does not change. One example is that the rank of a value remains the same in any permutation. Other examples of h and g transformations can be found in Hothorn et al. (2006).
Under the null hypothesis of independence between and , Strasser and Weber (1999) were able to derive the mean and variance of over the set of all possible permutations:
Here ⊗ represents the Kronecker product and
Strasser and Weber (1999) further showed that follows an asymptotic multivariate normal distribution with mean and variance matrix as sample size n goes to infinity.
These results on are very useful for conditional inference when exact conditional distribution is unknown or permutation is too time-consuming to conduct. This conditional inference framework encompasses many commonly used tests for independence. The test statistics of these tests are either scalars (when pq = 1) or assume the quadratic form . Examples of tests that can be put into this framework include Kruskal-Wallis test, linear-by-linear test, trend test, two-sample test, etc. (Hothorn et al., 2006). The R package coin provides a nice implementation of this theory. In the next subsection, we show that KAT can be coined into this framework as well. And we introduce its conditional asymptotic distribution.
2.2 Conditional asymptotic distribution of KAT
Since is positive definite, it can be in general written by Cholesky decomposition. So we will express the statistic Q as .
Define to be the vector of residuals. Now the ith row of , denoted by , corresponds to in the general theory presented in the last subsection. Letting g and h be identity transformations such that and where is the ith component of . Clearly, transformation g satisfies the ‘permutation invariant’ requirement. We have
It is straightforward to calculate that
Furthermore,
where is the sample variance of .
In summary, we have the following result.
Proposition 1. The conditional mean of over all permutations is and its conditional variance matrix is
where is the sample covariance matrix of genotype scores at the m SNPs.
Since KAT , the conditional asymptotic distribution is obtained using the main result of Strasser and Weber (1999).
Proposition 2. The conditional asymptotic distribution of KAT is the same as the distribution of a linear combination of m independent 1-df chi-square random variables: where , are the eigenvalues of .
We note that, unlike the asymptotic distribution for Q, this asymptotic distribution has the same form regardless of the type of traits. Whether it is continuous or dichotomous, the coefficients are the eigenvalues of and are proportional to the eigenvalues of .
3 Simulation studies
We have performed extensive simulation studies to demonstrate and compare the performance of the proposed conditional inference method, the method of Chen et al. (2016), SKAT (for continuous traits), SKAT with small sample adjustment (for dichotomous traits) and permutation test. While other works (Chen et al., 2016; Lee et al., 2012a; Wu et al., 2010) used simulated sequence data, we use real sequence data that were made available by Genetic Analysis Workshop (GAW) 17 (Almasy et al., 2011).
The GAW 17 mini-exome sequence data were from 697 subjects selected from the 1000 Genome Projects. There were 24 487 SNPs in 3205 genes. Nine largest genes (i.e. ABCC6, ACIN1, AHNAK, AKAP13, ALPK3, ANKRD12, ANKRD15, ATP10A and BAIAP3) in terms of genotyped SNPs are selected. The number of SNPs ranges from 200 to 400. Trait data were simulated in the same way as in the other works (Chen et al., 2016; Lee et al., 2012a; Wu et al., 2010) which are described below.
For each gene, 20% randomly selected SNPs are assumed to be causal. Eighty percent of these causal SNPs are assumed to have positive effect on the trait while the remaining 20% are negative.
Similar to Wu et al. (2011), continuous traits were simulated using the following model
| (2) |
where x1 follows a standard normal distribution, x2 follows a discrete distribution taking values 0 or 1 with probability 0.5, and were genotype scores at s causal rare variants. Given the minor allele frequency (MAF) of a causal variant, its coefficient β is determined by . For the study of type I error rate, were set at 0. For the study of power, the magnitude of βj was equal to . In the analysis of type I error rate, all βs are set to 0.
For the dichotomous trait, its disease status is determined using the following logistic regression model:
where x1 and x2 are the same as those for the continuous traits, corresponds to background disease probability of 0.01 and are the odds ratios for the s disease susceptibility variants, respectively. The number of cases is the same as the number of controls. Similar simulation set up has been used elsewhere (Wang and Elston, 2007; Wu et al., 2011). As in the case of continuous, all βs are set to 0 in the analysis of type I error rate.
The number of simulation replicates is 100 000 for the analysis of type I error and is 1000 for the analysis of power.
The simulated type I error rates are presented in Table 1 for continuous traits and in Table 2 for dichotomous traits. These type I error rates are under control except that the adjusted SKAT is slightly anti-conservative. The anti-conservativeness of the adjusted SKAT has also been observed in Chen et al. (2016).
Table 1.
Simulated type I error rate for continuous traits
| Sig. | Cond. | |||||
|---|---|---|---|---|---|---|
| Gene | n | Level | Chen | SKAT | Asy. | Perm. |
| ABCC6 | 50 | 0.01 | 0.01017 | 0.00659 | 0.00666 | 0.01039 |
| 0.001 | 0.00104 | 0.00038 | 0.00039 | 0.00107 | ||
| 100 | 0.01 | 0.01023 | 0.00904 | 0.00910 | 0.01042 | |
| 0.001 | 0.00093 | 0.00058 | 0.00057 | 0.00105 | ||
| 200 | 0.01 | 0.00980 | 0.00909 | 0.00907 | 0.00976 | |
| 0.001 | 0.00085 | 0.00069 | 0.00069 | 0.00091 | ||
| ACIN1 | 50 | 0.01 | 0.00970 | 0.00780 | 0.00798 | 0.00984 |
| 0.001 | 0.00097 | 0.00053 | 0.00054 | 0.00113 | ||
| 100 | 0.01 | 0.01004 | 0.00890 | 0.00880 | 0.01009 | |
| 0.001 | 0.00102 | 0.00079 | 0.00077 | 0.00113 | ||
| 200 | 0.01 | 0.00969 | 0.00913 | 0.00916 | 0.00981 | |
| 0.001 | 0.00108 | 0.00097 | 0.00096 | 0.00119 | ||
| AHNAK | 50 | 0.01 | 0.01013 | 0.00659 | 0.00687 | 0.01018 |
| 0.001 | 0.00086 | 0.00030 | 0.00030 | 0.00102 | ||
| 100 | 0.01 | 0.00931 | 0.00681 | 0.00676 | 0.00950 | |
| 0.001 | 0.00104 | 0.00065 | 0.00066 | 0.00127 | ||
| 200 | 0.01 | 0.00986 | 0.00895 | 0.00897 | 0.01001 | |
| 0.001 | 0.00125 | 0.00099 | 0.00098 | 0.00125 | ||
| AKAP13 | 50 | 0.01 | 0.00986 | 0.00802 | 0.00799 | 0.01004 |
| 0.001 | 0.00116 | 0.00059 | 0.00062 | 0.00120 | ||
| 100 | 0.01 | 0.00958 | 0.00871 | 0.00875 | 0.00977 | |
| 0.001 | 0.00102 | 0.00080 | 0.00079 | 0.00108 | ||
| 200 | 0.01 | 0.00991 | 0.00953 | 0.00952 | 0.00984 | |
| 0.001 | 0.00090 | 0.00078 | 0.00080 | 0.00093 | ||
| ALPK3 | 50 | 0.01 | 0.01011 | 0.00680 | 0.00695 | 0.01013 |
| 0.001 | 0.00100 | 0.00038 | 0.00039 | 0.00110 | ||
| 100 | 0.01 | 0.00947 | 0.00802 | 0.00803 | 0.00943 | |
| 0.001 | 0.00104 | 0.00073 | 0.00070 | 0.00117 | ||
| 200 | 0.01 | 0.00972 | 0.00900 | 0.00896 | 0.00982 | |
| 0.001 | 0.00103 | 0.00086 | 0.00086 | 0.00109 | ||
| ANKRD12 | 50 | 0.01 | 0.00983 | 0.00751 | 0.00736 | 0.00987 |
| 0.001 | 0.00082 | 0.00040 | 0.00036 | 0.00098 | ||
| 100 | 0.01 | 0.01012 | 0.00893 | 0.00899 | 0.01041 | |
| 0.001 | 0.00111 | 0.00087 | 0.00086 | 0.00125 | ||
| 200 | 0.01 | 0.01009 | 0.00929 | 0.00938 | 0.01008 | |
| 0.001 | 0.00105 | 0.00085 | 0.00086 | 0.00112 | ||
| ANKRD15 | 50 | 0.01 | 0.00962 | 0.00663 | 0.00679 | 0.00960 |
| 0.001 | 0.00095 | 0.00026 | 0.00027 | 0.00106 | ||
| 100 | 0.01 | 0.01027 | 0.00886 | 0.00881 | 0.01033 | |
| 0.001 | 0.00097 | 0.00068 | 0.00069 | 0.00109 | ||
| 200 | 0.01 | 0.01026 | 0.00923 | 0.00915 | 0.01016 | |
| 0.001 | 0.00079 | 0.00060 | 0.00059 | 0.00092 | ||
| ATP10A | 50 | 0.01 | 0.01003 | 0.00594 | 0.00606 | 0.01012 |
| 0.001 | 0.00085 | 0.00025 | 0.00029 | 0.00097 | ||
| 100 | 0.01 | 0.00993 | 0.00791 | 0.00800 | 0.01014 | |
| 0.001 | 0.00095 | 0.00060 | 0.00063 | 0.00103 | ||
| 200 | 0.01 | 0.01000 | 0.00903 | 0.00909 | 0.01005 | |
| 0.001 | 0.00098 | 0.00081 | 0.00078 | 0.00112 | ||
| BAIAP3 | 50 | 0.01 | 0.01021 | 0.00659 | 0.00667 | 0.01056 |
| 0.001 | 0.00099 | 0.00032 | 0.00029 | 0.00106 | ||
| 100 | 0.01 | 0.00991 | 0.00878 | 0.00882 | 0.00997 | |
| 0.001 | 0.00098 | 0.00070 | 0.00067 | 0.00110 | ||
| 200 | 0.01 | 0.00972 | 0.00897 | 0.00897 | 0.00981 | |
| 0.001 | 0.00093 | 0.00078 | 0.00078 | 0.00110 |
Note: ‘Cond. Asy.’ is the proposed method and ‘Perm.’ is permutation test.
Table 2.
Simulated type I error rate for dichotomous traits
| Sig. | SKAT | Cond. | ||||
|---|---|---|---|---|---|---|
| Gene | n | Level | Chen | Adj. | Asy. | Perm. |
| ABCC6 | 50 | 0.01 | 0.00962 | 0.01279 | 0.00705 | 0.00998 |
| 0.001 | 0.00092 | 0.00154 | 0.00044 | 0.00102 | ||
| 100 | 0.01 | 0.01028 | 0.01189 | 0.00874 | 0.01053 | |
| 0.001 | 0.00091 | 0.00116 | 0.00061 | 0.00105 | ||
| 200 | 0.01 | 0.01016 | 0.01064 | 0.00939 | 0.01029 | |
| 0.001 | 0.00097 | 0.00119 | 0.00082 | 0.00110 | ||
| ACIN1 | 50 | 0.01 | 0.01094 | 0.01400 | 0.00873 | 0.01090 |
| 0.001 | 0.00132 | 0.00164 | 0.00069 | 0.00126 | ||
| 100 | 0.01 | 0.01050 | 0.01187 | 0.00952 | 0.01057 | |
| 0.001 | 0.00110 | 0.00131 | 0.00078 | 0.00112 | ||
| 200 | 0.01 | 0.01014 | 0.01066 | 0.00956 | 0.01015 | |
| 0.001 | 0.00084 | 0.00094 | 0.00072 | 0.00091 | ||
| AHNAK | 50 | 0.01 | 0.00365 | 0.01537 | 0.00188 | 0.01020 |
| 0.001 | 0.00018 | 0.00118 | 0.00003 | 0.00106 | ||
| 100 | 0.01 | 0.00642 | 0.01285 | 0.00465 | 0.01069 | |
| 0.001 | 0.00028 | 0.00118 | 0.00013 | 0.00103 | ||
| 200 | 0.01 | 0.00811 | 0.01110 | 0.00721 | 0.01060 | |
| 0.001 | 0.00052 | 0.00133 | 0.00038 | 0.00109 | ||
| AKAP13 | 50 | 0.01 | 0.01116 | 0.01392 | 0.00938 | 0.01102 |
| 0.001 | 0.00132 | 0.00172 | 0.00070 | 0.00130 | ||
| 100 | 0.01 | 0.01045 | 0.01170 | 0.00959 | 0.01036 | |
| 0.001 | 0.00117 | 0.00138 | 0.00093 | 0.00113 | ||
| 200 | 0.01 | 0.00991 | 0.01068 | 0.00954 | 0.00995 | |
| 0.001 | 0.00110 | 0.00120 | 0.00099 | 0.00114 | ||
| ALPK3 | 50 | 0.01 | 0.00925 | 0.01279 | 0.00613 | 0.00947 |
| 0.001 | 0.00086 | 0.00141 | 0.00032 | 0.00104 | ||
| 100 | 0.01 | 0.01003 | 0.01154 | 0.00826 | 0.01022 | |
| 0.001 | 0.00091 | 0.00129 | 0.00061 | 0.00112 | ||
| 200 | 0.01 | 0.00994 | 0.01025 | 0.00910 | 0.01003 | |
| 0.001 | 0.00100 | 0.00121 | 0.00083 | 0.00114 | ||
| ANKRD12 | 50 | 0.01 | 0.00757 | 0.01362 | 0.00561 | 0.01013 |
| 0.001 | 0.00055 | 0.00150 | 0.00025 | 0.00114 | ||
| 100 | 0.01 | 0.00829 | 0.01136 | 0.00713 | 0.00951 | |
| 0.001 | 0.00045 | 0.00121 | 0.00027 | 0.00098 | ||
| 200 | 0.01 | 0.00953 | 0.01097 | 0.00888 | 0.01023 | |
| 0.001 | 0.00089 | 0.00126 | 0.00078 | 0.00114 | ||
| ANKRD15 | 50 | 0.01 | 0.00943 | 0.01322 | 0.00615 | 0.01008 |
| 0.001 | 0.00087 | 0.00143 | 0.00028 | 0.00110 | ||
| 100 | 0.01 | 0.00967 | 0.01149 | 0.00781 | 0.00997 | |
| 0.001 | 0.00098 | 0.00134 | 0.00061 | 0.00115 | ||
| 200 | 0.01 | 0.01035 | 0.01106 | 0.00927 | 0.01046 | |
| 0.001 | 0.00105 | 0.00127 | 0.00074 | 0.00120 | ||
| ATP10A | 50 | 0.01 | 0.00996 | 0.01313 | 0.00640 | 0.01000 |
| 0.001 | 0.00094 | 0.00122 | 0.00030 | 0.00098 | ||
| 100 | 0.01 | 0.01014 | 0.01156 | 0.00819 | 0.01013 | |
| 0.001 | 0.00121 | 0.00148 | 0.00064 | 0.00121 | ||
| 200 | 0.01 | 0.01011 | 0.01064 | 0.00911 | 0.01018 | |
| 0.001 | 0.00107 | 0.00118 | 0.00080 | 0.00112 | ||
| BAIAP3 | 50 | 0.01 | 0.00850 | 0.01323 | 0.00580 | 0.01005 |
| 0.001 | 0.00064 | 0.00121 | 0.00030 | 0.00097 | ||
| 100 | 0.01 | 0.00930 | 0.01133 | 0.00789 | 0.01027 | |
| 0.001 | 0.00079 | 0.00121 | 0.00055 | 0.00109 | ||
| 200 | 0.01 | 0.00946 | 0.01029 | 0.00867 | 0.00977 | |
| 0.001 | 0.00107 | 0.00128 | 0.00093 | 0.00120 |
Note: ‘SKAT Adj’ is the adjusted STAT, ‘Cond. Asy.’ is the proposed method, and ‘Perm.’ is permutation test.
Results of the power analysis are presented in Table 3 for continuous traits and in Table 4 for dichotomous traits. All statistics have similar performance. The conditional inference method using the asymptotic distribution is sometimes slightly less powerful than the permutation method and the Chen method. Note that in some genes, the power is not increasing in sample size. This is because the causal SNPs are selected at random within each scenario.
Table 3.
Simulated power for continuous traits
| Sig. level 0.01 |
Sig. level 0.001 |
||||||||
|---|---|---|---|---|---|---|---|---|---|
| Cond. | Cond. | ||||||||
| Gene | n | Chen | SKAT | Asy. | Perm. | Chen | SKAT | Asy. | Perm. |
| ABCC6 | 50 | 0.318 | 0.279 | 0.278 | 0.316 | 0.083 | 0.054 | 0.052 | 0.082 |
| 100 | 0.695 | 0.677 | 0.681 | 0.694 | 0.376 | 0.359 | 0.359 | 0.381 | |
| 200 | 0.555 | 0.544 | 0.545 | 0.551 | 0.252 | 0.243 | 0.245 | 0.255 | |
| ACIN1 | 50 | 0.144 | 0.139 | 0.135 | 0.144 | 0.039 | 0.034 | 0.034 | 0.041 |
| 100 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | |
| 200 | 0.212 | 0.211 | 0.206 | 0.207 | 0.071 | 0.070 | 0.069 | 0.074 | |
| AHNAK | 50 | 0.967 | 0.955 | 0.955 | 0.960 | 0.767 | 0.695 | 0.689 | 0.684 |
| 100 | 0.999 | 0.998 | 0.998 | 0.999 | 0.992 | 0.991 | 0.989 | 0.991 | |
| 200 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | |
| AKAP13 | 50 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| 100 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | |
| 200 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | |
| ALPK3 | 50 | 0.038 | 0.031 | 0.030 | 0.037 | 0.008 | 0.007 | 0.007 | 0.008 |
| 100 | 0.324 | 0.315 | 0.317 | 0.325 | 0.123 | 0.109 | 0.105 | 0.121 | |
| 200 | 0.958 | 0.958 | 0.956 | 0.958 | 0.863 | 0.857 | 0.858 | 0.861 | |
| ANKRD12 | 50 | 0.033 | 0.031 | 0.028 | 0.032 | 0.003 | 0.003 | 0.003 | 0.003 |
| 100 | 0.056 | 0.055 | 0.054 | 0.056 | 0.006 | 0.006 | 0.005 | 0.007 | |
| 200 | 0.807 | 0.801 | 0.802 | 0.803 | 0.580 | 0.569 | 0.570 | 0.576 | |
| ANKRD15 | 50 | 0.469 | 0.452 | 0.454 | 0.469 | 0.242 | 0.196 | 0.199 | 0.238 |
| 100 | 0.872 | 0.868 | 0.868 | 0.875 | 0.636 | 0.605 | 0.605 | 0.639 | |
| 200 | 0.307 | 0.297 | 0.298 | 0.308 | 0.087 | 0.078 | 0.078 | 0.082 | |
| ATP10A | 50 | 0.050 | 0.041 | 0.040 | 0.051 | 0.008 | 0.004 | 0.004 | 0.008 |
| 100 | 0.164 | 0.153 | 0.152 | 0.166 | 0.043 | 0.037 | 0.038 | 0.042 | |
| 200 | 1.000 | 1.000 | 1.000 | 1.000 | 0.993 | 0.993 | 0.993 | 0.993 | |
| BAIAP3 | 50 | 0.654 | 0.621 | 0.623 | 0.649 | 0.381 | 0.323 | 0.327 | 0.383 |
| 100 | 0.137 | 0.130 | 0.132 | 0.134 | 0.019 | 0.016 | 0.017 | 0.018 | |
| 200 | 0.999 | 0.999 | 0.999 | 0.999 | 0.993 | 0.993 | 0.992 | 0.993 | |
Note: ‘Cond. Asy.’ is the proposed method and ‘Perm.’ is permutation test.
Table 4.
Simulated power for dichotomous traits
| Sig. level 0.01 |
Sig. level 0.001 |
||||||||
|---|---|---|---|---|---|---|---|---|---|
| SKAT | Cond. | SKAT | Cond. | ||||||
| Gene | n | Chen | Adj. | Asy. | Perm. | Chen | Adj. | Asy. | Perm. |
| ABCC6 | 50 | 0.142 | 0.159 | 0.122 | 0.146 | 0.039 | 0.047 | 0.018 | 0.035 |
| 100 | 0.359 | 0.365 | 0.349 | 0.359 | 0.166 | 0.177 | 0.152 | 0.168 | |
| 200 | 0.858 | 0.856 | 0.850 | 0.857 | 0.603 | 0.608 | 0.582 | 0.609 | |
| ACIN1 | 50 | 0.111 | 0.121 | 0.094 | 0.107 | 0.027 | 0.035 | 0.020 | 0.027 |
| 100 | 0.997 | 0.997 | 0.996 | 0.996 | 0.977 | 0.979 | 0.974 | 0.976 | |
| 200 | 0.324 | 0.327 | 0.322 | 0.326 | 0.128 | 0.133 | 0.120 | 0.132 | |
| AHNAK | 50 | 0.795 | 0.845 | 0.776 | 0.811 | 0.529 | 0.602 | 0.482 | 0.555 |
| 100 | 0.988 | 0.990 | 0.988 | 0.989 | 0.947 | 0.962 | 0.944 | 0.954 | |
| 200 | 0.975 | 0.977 | 0.973 | 0.975 | 0.895 | 0.907 | 0.885 | 0.902 | |
| AKAP13 | 50 | 0.989 | 0.990 | 0.983 | 0.984 | 0.920 | 0.931 | 0.903 | 0.920 |
| 100 | 0.998 | 0.998 | 0.998 | 0.998 | 0.990 | 0.990 | 0.989 | 0.989 | |
| 200 | 1.000 | 1.000 | 1.000 | 1.000 | 0.999 | 0.999 | 0.999 | 0.999 | |
| ALPK3 | 50 | 0.087 | 0.104 | 0.072 | 0.090 | 0.017 | 0.023 | 0.009 | 0.021 |
| 100 | 0.195 | 0.208 | 0.188 | 0.197 | 0.071 | 0.076 | 0.062 | 0.074 | |
| 200 | 0.505 | 0.503 | 0.500 | 0.504 | 0.214 | 0.218 | 0.199 | 0.210 | |
| ANKRD12 | 50 | 0.054 | 0.070 | 0.050 | 0.059 | 0.013 | 0.018 | 0.010 | 0.016 |
| 100 | 0.255 | 0.282 | 0.249 | 0.263 | 0.072 | 0.092 | 0.063 | 0.086 | |
| 200 | 0.447 | 0.457 | 0.447 | 0.451 | 0.205 | 0.218 | 0.202 | 0.212 | |
| ANKRD15 | 50 | 0.289 | 0.317 | 0.265 | 0.288 | 0.113 | 0.138 | 0.087 | 0.115 |
| 100 | 0.536 | 0.556 | 0.524 | 0.538 | 0.318 | 0.333 | 0.296 | 0.313 | |
| 200 | 0.729 | 0.740 | 0.724 | 0.737 | 0.489 | 0.502 | 0.478 | 0.495 | |
| ATP10A | 50 | 0.070 | 0.078 | 0.054 | 0.069 | 0.014 | 0.018 | 0.012 | 0.014 |
| 100 | 0.244 | 0.251 | 0.235 | 0.242 | 0.107 | 0.108 | 0.094 | 0.104 | |
| 200 | 0.927 | 0.931 | 0.924 | 0.928 | 0.807 | 0.814 | 0.800 | 0.810 | |
| BAIAP3 | 50 | 0.284 | 0.312 | 0.263 | 0.286 | 0.115 | 0.147 | 0.094 | 0.124 |
| 100 | 0.202 | 0.230 | 0.191 | 0.220 | 0.055 | 0.070 | 0.042 | 0.065 | |
| 200 | 0.995 | 0.996 | 0.996 | 0.996 | 0.968 | 0.970 | 0.964 | 0.967 | |
Note: ‘SKAT Adj’ is the adjusted STAT, ‘Cond. Asy.’ is the proposed method, and ‘Perm.’ is permutation test.
4 Application to OHTS
Primary open angle glaucoma (POAG) is a leading cause of irreversible blindness. Although a genetic basis has been established for a substantial fraction of POAG, no risk alleles of major effect have been identified (Fingert, 2011). The etiology of POAG is likely to be complex. Since POAG is assessed through quantitative measures such as central corneal thickness (CCT), intraocular pressure (IOP), and cup-to-disc ratio, one promising research direction is to map genes underlying these quantitative measures. Indeed, large-scale GWAS have identified genes that affect CCT (Lu et al., 2010; Vitart et al., 2010; Vithana et al., 2011). We report an application of the method of Chen et al. (2016), SKAT with and without small sample correction, and the conditional inference method described in this report to a genome-wide gene-based association study of CCT averaged over both eyes using data from the Ocular Hypertension Treatment Study (OHTS).
OHTS (Gordon and Kass, 1999) is a National Eye Institute-sponsored multi-center, randomized clinical trial. Its goal is to investigate the efficacy of medical treatment in delaying or preventing the onset of POAG in individuals with elevated intraocular pressure. One thousand six hundred and thirty six individuals between 40 and 80 years old were enrolled and 1077 of them were genotyped in a subsequent study. Data for this genetic study is available at Database of Genotypes and Phenotypes (dbGaP, Study Accession phs000240.v1.p1). It contains 1057 subjects who have available both genotype data and baseline phenotype data. The vast majority of these subjects were non-Hispanic White (752) and Black (252). Unpublished results have identified genetic heterogeneity between whites and blacks. Our focus is a genome-wide association studies of CCT on the Black subjects. A histogram of CCT is presented in Figure 1. It is pretty symmetric.
Fig. 1.
Histogram of the average CCT for the Black subjects in OHTS
There were 1 051 295 genotyped SNPs. There HGNC gene symbols were obtained using the R/Bioconductor package biomaRt (version 2.26.1). There are 30 562 autosomal genes. Similar to Lee et al. (2012a), genes that contain less than 3 SNPs were excluded from further consideration. This reduces the number of genes to 23 778. Variables age and gender are used as covariates. Gene symbols at which at least one of the four statistics has a P-value less than 0.0001 are listed in Table 5. In this table, the information on biotype and base-pair position is obtained from www.ensemble.org. It is interesting that all of them except one are on chromosome 17. There are three tight gene clusters on chromosome 17 in terms of base-pair location. Cluster 1 consists of genes SENP3, SENP3-EIF4A1, EIF4A1 and SNORA48. Cluster 2 consists of KCNH4, HCRT, GHDC and STAT5B. Cluster 3 consists of BECN1, MIR6781 and PSME3. Cluster 2 is very close to cluster 3. They warrant further investigation although none of them overlaps with ZNF469, COL5A1, COL8A2, AKAP13 and AVGR8, genes for which association with CCT has been reported previously (Lu et al., 2010; Vitart et al., 2010; Vithana et al., 2011).
Table 5.
A summary of gene-based association P-values () for the 252 Black subjects in the OHTS study
| Base-pair position |
Conditional inference | |||||||
|---|---|---|---|---|---|---|---|---|
| Chr | Biotype | Gene symbol | Start | Stop | Chen | SKAT | Permutation | |
| 1 | Protein coding | ZBTB40 | 22 451 851 | 22 531 157 | 19.52 | 23.28 | 23.64 | 9.999 |
| 17 | Protein coding | SENP3 | 7 561 875 | 7 571 969 | 1.897 | 2.728 | 2.971 | 9.999 |
| 17 | Protein coding | SENP3-EIF4A1 | 7 563 287 | 7 578 715 | 1.070 | 1.153 | 1.215 | 0.000 |
| 17 | Protein coding | EIF4A1 | 7 572 706 | 7 579 005 | 2.965 | 3.551 | 3.624 | 9.999 |
| 17 | snoRNA | SNORA48 | 7 574 713 | 7 574 847 | 3.376 | 4.479 | 4.733 | 0.000 |
| 17 | miRNA | MIR6779 | 38 914 979 | 38 915 042 | 1.002 | 1.742 | 1.631 | 9.999 |
| 17 | Protein coding | KCNH4 | 42 156 891 | 42 181 278 | 0.000 | 0.000 | 0.000 | 0.000 |
| 17 | Protein coding | HCRT | 42 184 060 | 42 185 452 | 0.000 | 0.000 | 0.000 | 0.000 |
| 17 | Protein coding | GHDC | 42 188 799 | 42 194 532 | 0.000 | 0.000 | 0.000 | 0.000 |
| 17 | Protein coding | STAT5B | 42 199 168 | 42 276 707 | 0.983 | 1.623 | 1.505 | 0.000 |
| 17 | Protein coding | BECN1 | 42 810 134 | 42 833 350 | 3.126 | 3.812 | 3.541 | 0.000 |
| 17 | miRNA | MIR6781 | 42 823 880 | 42 823 943 | 5.550 | 7.268 | 6.837 | 0.000 |
| 17 | Protein coding | PSME3 | 42 824 385 | 42 843 758 | 2.936 | 3.871 | 3.609 | 9.999 |
Note: Genes are selected if any of the four listed statistics is significant at level 0.0001.
One reviewer suggested that an example of data analysis where the phenotype is neither continuous nor binary be given. To satisfy this suggestion, the CCT measurement is discretized according its quartiles. The discretized CCT measurement takes 4 values: 1, 2, 3 and 4. For this modified data, neither the Chen method nor SKAT applies. However, the proposed method and the permutation method are still applicable. Findings from these two methods are summarized in Table 6. This list of findings is slightly longer than that from Table 5. Most of them are due to the significance of the permutation test. The proposed method is generally less significant than the permutation text. At the significance level 0.0001 used for this table, both methods are significant at gene ZBTB40 on chromosome 1 and some other genes on chromosome 17 that are seen in Table 5.
Table 6.
A summary of gene-based association P-values () for the 252 Black subjects in the OHTS study
| Base-pair position |
Conditional inference | |||||
|---|---|---|---|---|---|---|
| Chr | Biotype | Gene symbol | Start | Stop | Permutation | |
| 1 | Protein coding | ZBTB40 | 22 451 851 | 22 531 157 | 5.640 | 9.999 |
| 1 | Protein coding | ADIPOR1 | 202 940 823 | 202 958 572 | 21.08 | 0.000 |
| 9 | snRNA | RNU6-1039P | 110 369 592 | 110 369 698 | 27.58 | 9.999 |
| 9 | Protein coding | CNTRL | 121 074 863 | 121 177 610 | 40.45 | 9.999 |
| 13 | Misc RNA | RN7SL597P | 41 121 876 | 41 122 171 | 28.56 | 9.999 |
| 14 | TR V_pseudogene | TRAV32 | 22 185 562 | 22 186 057 | 46.41 | 9.999 |
| 17 | snoRNA | SNORA48 | 7 574 713 | 7 574 847 | 34.56 | 0.000 |
| 17 | Protein coding | SENP3-EIF4A1 | 7 563 287 | 7 578 715 | 25.57 | 9.999 |
| 17 | miRNA | MIR6779 | 38 914 979 | 38 915 042 | 5.532 | 0.000 |
| 17 | Protein coding | ARL5C | 39 156 894 | 39 167 484 | 34.41 | 9.999 |
| 17 | Protein coding | KCNH4 | 42 156 891 | 42 181 278 | 0.000 | 0.000 |
| 17 | Protein coding | HCRT | 42 184 060 | 42 185 452 | 0.000 | 0.000 |
| 17 | Protein coding | GHDC | 42 188 799 | 42 194 532 | 0.000 | 0.000 |
| 17 | Protein coding | STAT5B | 42 199 168 | 42 276 707 | 0.400 | 0.000 |
| 17 | Protein coding | BECN1 | 42 810 134 | 42 833 350 | 25.97 | 9.999 |
| 17 | Protein coding | PSME3 | 42 824 385 | 42 843 758 | 23.85 | 9.999 |
| 18 | LincRNA | LINC01541 | 71 519 962 | 71 578 956 | 17.25 | 9.999 |
| 19 | Protein coding | CACNA1A | 13 206 442 | 13 633 025 | 71.19 | 9.999 |
Note: The phenotype CCT is discretized according to its quartiles. Genes are selected if any of the two listed statistics is significant at level 0.0001. Methods ‘Chen’ and ‘SKAT’ are not applicable in this situation because the phenotype is neither continuous nor binary.
5 Discussion
An approximation method is proposed for the conditional inference of kernel association test (KAT). This approximation is based on a theoretical result developed by Strasser and Weber (1999) on the asymptotic permutation distribution of a random vector. By properly constructing a vector and a mapping function h, a small sample counterpart of the KAT is introduced. The current application of this theory focuses on test statistics of the form , where is the covariance matrix of , and is not applicable to KAT. The work presented in this report fills this gap.
We note that the proposed method approximates the permutation distribution of KAT where the covariates are permuted together with response y. This is to satisfy the ‘permutation invariant’ requirement.
A salient feature of the proposed method is that it has no limitations on the types of the responses, a benefit of using the conditional inference. Regardless of the type of the responses, being it continuous, dichotomous, or some other types, the conditional variance matrix of the vector assumes the same form. In practice, one may want to use this method as a screening tool followed by extensive permutations at genes that show signals.
An implementation of the proposed method is provided in the R package iGasso. The function name is KAT.coin.
Funding
Preparation of GAW 17 data was supported, in part, by National Institutes of Health [grant R01 MH059490] and used data from the 1000 Genomes Project (www.1000genomes.org). The GAW 17 was supported by NIH grant R01 GM031575 from the National Institute of General Medical Sciences. The author thanks Dr Jun Chen and Dr Wenan Chen for sharing the R code for their method proposed in Chen et al. (2016). The author also thanks the Associate Editor, Dr Alison Hutchins, and three anonymous reviewers for their useful comments.
Conflict of Interest: none declared.
References
- Almasy L. et al. (2011) Genetic Analysis Workshop 17 mini-exome simulation. BMC Proceedings, 5, S2.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barnett I.J. et al. (2013) Detecting rare variant effects using extreme phenotype sampling in sequencing association studies. Genet. Epidemiol., 37, 142–151. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cai T. et al. (2011) Kernel machine approach to testing the significance of multiple genetic markers for risk prediction. Biometrics, 67, 975–986. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen H. et al. (2013) Sequence kernel association test for quantitative traits in family samples. Genet. Epidemiol., 37, 196–204. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen J., Li H. (2013) Kernel methods for regression analysis of microbiome compositional data In: Mingxiu Hu, Yi Liu, and Jianchang Lin et al. (eds) Topics in Applied Statistics. Springer, New York, pp. 191–201. [Google Scholar]
- Chen J. et al. (2012) Associating microbiome composition with environmental covariates using generalized unifrac distances. Bioinformatics, 28, 2106–2113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen J. et al. (2016) Small sample kernel association tests for human genetic and microbiome association studies. Genet. Epidemiol., 40, 5–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fingert J. (2011) Primary open-angle glaucoma genes. Eye, 25, 587–595. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gordon M.O., Kass M.A. (1999) The ocular hypertension treatment study: design and baseline description of the participants. Arch. Ophthalmol., 117, 573–583. [DOI] [PubMed] [Google Scholar]
- Hothorn T. et al. (2006) A lego system for conditional inference. Am. Stat., 60, 257–263. [Google Scholar]
- Lee S. et al. (2012a) Optimal tests for rare variant effects in sequencing association studies. Biostatistics, 13, 762–775. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee S. et al. (2012b) Optimal unified approach for rare-variant association testing with application to small-sample case-control whole-exome sequencing studies. Am. J. Hum. Genet., 91, 224–237. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin X. et al. (2011) Kernel machine snp-set analysis for censored survival outcomes in genome-wide association studies. Genet. Epidemiol., 35, 620–631. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin X. et al. (2013) Test for interactions between a genetic marker set and environment in generalized linear models. Biostatistics, 14, 667–681. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin X. et al. (2016) Test for rare variants by environment interactions in sequencing association studies. Biometrics, 72, 156–164. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu D. et al. (2007) Semiparametric regression of multidimensional genetic pathway data: Least-squares kernel machines and linear mixed models. Biometrics, 63, 1079–1088. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu D. et al. (2008) Estimation and testing for the effect of a genetic pathway on a disease outcome using logistic kernel machine regression via logistic mixed models. BMC Bioinform, 9, 292.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lu Y. et al. (2010) Common genetic variants near the brittle cornea syndrome locus znf469 influence the blinding disease risk factor central corneal thickness. PLoS Genet., 6, e1000947.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Strasser H., Weber C. (1999). On the asymptotic theory of permutation statistics. Math. Methods Stat., 2, 220–250. [Google Scholar]
- Vitart V. et al. (2010) New loci associated with central cornea thickness include col5a1, akap13 and avgr8. Hum. Mol. Genet., 19, 4304–4311. [DOI] [PubMed] [Google Scholar]
- Vithana E.N. et al. (2011) Collagen-related genes influence the glaucoma risk factor, central corneal thickness. Hum. Mol. Genet., 20, 649–658. [DOI] [PubMed] [Google Scholar]
- Wang T., Elston R.C. (2007) Improved power by use of a weighted score test for linkage disequilibrium mapping. Am. J. Hum. Genet., 80, 353–360. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang X. et al. (2013) GEE-based SNP set association test for continuous and discrete traits in family-based association studies. Genet. Epidemiol., 37, 778–786. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu M.C. et al. (2010) Powerful SNP-set analysis for case-control genome-wide association studies. Am. J. Hum. Genet., 86, 929–942. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu M.C. et al. (2011) Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet., 89, 82–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao N. et al. (2015) Testing in microbiome-profiling studies with mirkat, the microbiome regression-based kernel association test. Am. J. Hum. Genet., 96, 797–807. [DOI] [PMC free article] [PubMed] [Google Scholar]

