Summary
Acar and Sun (2013, Biometrics, 69, 427-435) presented a generalized Kruskal-Wallis (GKW) test for genetic association studies that incorporated the genotype uncertainty and showed its robust and competitive performance compared to existing methods. We present another interesting way to derive the GKW test via a rank linear model.
Keywords: Kruskal-Wallis test, Linear model, Score test, Wald test
1 Introduction
Acar and Sun (2013) proposed a generalized Kruskal-Wallis (GKW) statistic for association test of imputed SNPs accounting for the genotype imputation uncertainty. The GKW test is based on contrasting weighted rank sum statistics across different groups. Its asymptotic chi-square null distribution is derived using the rank statistics based central limit theorem.
For this article, we presented a rank linear regression model and derived the proposed GKW statistic as a score test statistic. The linear model approach makes the derivation more straightforward and transparent, and leads to a simplified and unified approach to the general rank based multi-group comparison problem. We present a F-type statistic based on the Wald test and numerically compare its performance to the GKW test.
2 Rank based association test
Consider N samples randomly sampled from a large population consisting of k ≥ 2 disjoint groups. Denote by G the categorical variable taking values on {1, 2,…, k}. We want to compare the k groups, of sizes n1,…, nk with , based on a continuous response variable Y. Denote the rank of Yj among the overall samples as rj for j = 1,…, N. Suppose available to us are not G but probabilistic data of G: (p1j, p2j,…, pkj), where pij = Pr(Gj = i) for i = 1, 2,⋯, k and j = 1, 2,…, N, with .
2.1 The generalized Kruskal-Wallis (GKW) test
Acar and Sun (2013) assumed a location-shift model for the continuous outcome. Formally, letting the distribution function of Y be of the form Fi(y) = F(y − θi) for group i, we wish to test H0 : θ1 = θ2 =⋯= θk against HA: not all θi's are equal.
The GKW test is based on testing the equality of the weighted rank sum across group i = 1,⋯, k. Acar and Sun (2013) showed the asymptotic multivariate normality of and constructed the corresponding GKW chi-square statistic.
2.2 Rank linear regression model
Another interesting way to derive the GKW test is via a rank linear model
| (1) |
Here μ1 represents the mean rank of the first group, and μi(i > 1) represents the mean rank difference between the i-th and first group. The null hypothesis is H0 : μ2 = ⋯ = μk = 0, which amounts to testing H0 : θ1 = ⋯ = θk under the location-shift model of Acar and Sun (2013). For the case of genetic association test (k = 3), this model treats the three genotypes as categorical without imposing any genetic model. Denote R = (r1,…, rN)T, and P the design matrix for the rank linear model (1): Pj1 = 1, Pji = pij, i = 2,…, k. With some algebra, we can show that the Score statistic is , where , and . In the following we show that S0 equals to the GKW statistic.
Note that H = P(PTP)−1PT is a projection matrix into the column space of P, and H is invariant under centering and scaling. Let P̄ be the covariate centered and scaled design matrix: P̄j1 = 1, , ∀i > 1, where . Therefore we can also write , where U = P̄T (R − μ̂0) and V = P̄T P̄. We can check that V11 = N, V1i = Vi1 = 0, ∀i > 1, and
which is the sample correlation of Pi and Pl, where Pi = (pi1,…,piN)T, and
Note that is just the sample correlation of R and Pi. Therefore a general form of the Score statistic (and hence the GKW) is
| (2) |
where Γ = (γ2,…,γk)T is the sample correlation vector of response rank and probabilistic covariates {Pi : i > 1}, and Σ = (ρil) is the sample correlation matrix of corresponding probabilistic covariates, ρil = corr(Pi, Pl) (i > 1, l > 1). This general form also readily leads to the asymptotic chi-square distribution based on the standard central limit theorem argument. When k = 2, . For association testing of imputed SNPs, k = 3, and . And they exactly equal to the GKW statistic defined in Acar and Sun (2013, Section 2.2).
The proposed rank linear model (1) also provides a simplified approach to testing more general hypotheses. For example, with a binary environment covariate and an imputed SNP, we can create a group variable with six categories. Detecting gene-environment interaction or testing the overall genetic effect by leveraging gene-environment interactions (Kraft et al., 2007) using the robust sample rank can both be formulated as testing the linear combination of regression parameters under the linear rank model, and are worthy of further study.
As we have shown the equivalence of the GKW and Score statistic. It is natural to consider the asymptotically equivalent Wald statistic, which is the commonly used F-statistic in linear model. For linear regression model, the Score and Wald statistics have the same form and differ only in their variance calculation with the Wald statistic variance computed under the full model. The Wald statistic can be checked to be , where .
Computationally, these tests can be readily and very efficiently computed from existing software for linear model regressing the sample rank on the probabilistic covariates.
3 Numerical studies
We follow the same simulation settings of Acar and Sun (2013, Section 3.1) to evaluate the performance of GKW and the rank linear model based Score and Wald statistics for SNP association test. Numerically we have checked the exact equivalence of GKW and Score statistics, and consistently observed that the Wald statistic performs slightly better than the GKW statistic (especially for relatively small sample size).
The rank linear model framework (1) and general formula for the Score/GKW statistic (2) also suggest that it helps to use normally transformed rank vector. Specifically we use instead of rj, j = 1,…, N, in the rank linear model (1). In further simulation studies (not shown) with minor allele frequency of 0.1, 0.2, and 20% group uncertainty, we found that for all of the Score statistic S0, GKW statistic, and the Wald statistic W0, using the normally transformed ranks gave slightly improved power. All statistics have the right size.
Acknowledgments
This research was supported in part by NIH grant GM083345. We are grateful to the University of Minnesota Supercomputing Institute for assistance with the computations. We would like to thank the editor and associate editor for their valuable comments, which have greatly improved the presentation of the paper.
References
- Acar EF, Sun L. A generalized Kruskal-Wallis test incorporating group uncertainty with application to genetic association studies. Biometrics. 2013;69:427–435. doi: 10.1111/biom.12006. [DOI] [PubMed] [Google Scholar]
- Kraft P, Yen YC, Stram DO, Morrison J, Gauderman WJ. Exploiting gene-environment interaction to detect genetic associations. Human Heredity. 2007;63:111–119. doi: 10.1159/000099183. [DOI] [PubMed] [Google Scholar]
