Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2016 Jun 1.
Published in final edited form as: Biometrics. 2014 Oct 28;71(2):556–557. doi: 10.1111/biom.12260

Reader Reaction On the generalized Kruskal-Wallis test for genetic association studies incorporating group uncertainty

Baolin Wu 1,*, Weihua Guan 1
PMCID: PMC4454617  NIHMSID: NIHMS694003  PMID: 25351417

Summary

Acar and Sun (2013, Biometrics, 69, 427-435) presented a generalized Kruskal-Wallis (GKW) test for genetic association studies that incorporated the genotype uncertainty and showed its robust and competitive performance compared to existing methods. We present another interesting way to derive the GKW test via a rank linear model.

Keywords: Kruskal-Wallis test, Linear model, Score test, Wald test

1 Introduction

Acar and Sun (2013) proposed a generalized Kruskal-Wallis (GKW) statistic for association test of imputed SNPs accounting for the genotype imputation uncertainty. The GKW test is based on contrasting weighted rank sum statistics across different groups. Its asymptotic chi-square null distribution is derived using the rank statistics based central limit theorem.

For this article, we presented a rank linear regression model and derived the proposed GKW statistic as a score test statistic. The linear model approach makes the derivation more straightforward and transparent, and leads to a simplified and unified approach to the general rank based multi-group comparison problem. We present a F-type statistic based on the Wald test and numerically compare its performance to the GKW test.

2 Rank based association test

Consider N samples randomly sampled from a large population consisting of k ≥ 2 disjoint groups. Denote by G the categorical variable taking values on {1, 2,…, k}. We want to compare the k groups, of sizes n1,…, nk with i=1kni=N, based on a continuous response variable Y. Denote the rank of Yj among the overall samples as rj for j = 1,…, N. Suppose available to us are not G but probabilistic data of G: (p1j, p2j,…, pkj), where pij = Pr(Gj = i) for i = 1, 2,⋯, k and j = 1, 2,…, N, with i=1kpij=1.

2.1 The generalized Kruskal-Wallis (GKW) test

Acar and Sun (2013) assumed a location-shift model for the continuous outcome. Formally, letting the distribution function of Y be of the form Fi(y) = F(yθi) for group i, we wish to test H0 : θ1 = θ2 =⋯= θk against HA: not all θi's are equal.

The GKW test is based on testing the equality of the weighted rank sum Ri=j=1Npijrj across group i = 1,⋯, k. Acar and Sun (2013) showed the asymptotic multivariate normality of {Ri:i=1,,k} and constructed the corresponding GKW chi-square statistic.

2.2 Rank linear regression model

Another interesting way to derive the GKW test is via a rank linear model

rj=μ1+i=2kpijμi+εj,Var(εj)=σ2,j=1,,N. (1)

Here μ1 represents the mean rank of the first group, and μi(i > 1) represents the mean rank difference between the i-th and first group. The null hypothesis is H0 : μ2 = ⋯ = μk = 0, which amounts to testing H0 : θ1 = ⋯ = θk under the location-shift model of Acar and Sun (2013). For the case of genetic association test (k = 3), this model treats the three genotypes as categorical without imposing any genetic model. Denote R = (r1,…, rN)T, and P the design matrix for the rank linear model (1): Pj1 = 1, Pji = pij, i = 2,…, k. With some algebra, we can show that the Score statistic is S0=1σ^02(R-μ^0)T{P(PTP)-1PT}(Rμ^0), where μ^0=N+12, and σ^02=j=1N{j(N+1)/2}2N1=N(N+1)12. In the following we show that S0 equals to the GKW statistic.

Note that H = P(PTP)−1PT is a projection matrix into the column space of P, and H is invariant under centering and scaling. Let be the covariate centered and scaled design matrix: j1 = 1, P¯ji=pijp¯ij=1N(pijp¯i)2, ∀i > 1, where p¯i=j=1Npij/N. Therefore we can also write S0=1σ^02(UTV1U), where U = T (Rμ̂0) and V = T . We can check that V11 = N, V1i = Vi1 = 0, ∀i > 1, and

Vil=j=1N(pijp¯i)(pljp¯l)j=1N(pijp¯i)2j=1N(pljp¯l)2,i>1,l>1,

which is the sample correlation of Pi and Pl, where Pi = (pi1,…,piN)T, and

U1=0,Ui=j=1Nrjpij0N+12Np¯ij=1N(pijp¯i)2,i>1.

Note that Ui/(N1σ^0) is just the sample correlation of R and Pi. Therefore a general form of the Score statistic (and hence the GKW) is

S0=(N1)ΓT1Γ, (2)

where Γ = (γ2,…,γk)T is the sample correlation vector of response rank and probabilistic covariates {Pi : i > 1}, and Σ = (ρil) is the sample correlation matrix of corresponding probabilistic covariates, ρil = corr(Pi, Pl) (i > 1, l > 1). This general form also readily leads to the asymptotic chi-square distribution based on the standard central limit theorem argument. When k = 2, S0=(N1)γ22. For association testing of imputed SNPs, k = 3, and S0=N11ρ232(γ22+γ322ρ23γ2γ3). And they exactly equal to the GKW statistic defined in Acar and Sun (2013, Section 2.2).

The proposed rank linear model (1) also provides a simplified approach to testing more general hypotheses. For example, with a binary environment covariate and an imputed SNP, we can create a group variable with six categories. Detecting gene-environment interaction or testing the overall genetic effect by leveraging gene-environment interactions (Kraft et al., 2007) using the robust sample rank can both be formulated as testing the linear combination of regression parameters under the linear rank model, and are worthy of further study.

As we have shown the equivalence of the GKW and Score statistic. It is natural to consider the asymptotically equivalent Wald statistic, which is the commonly used F-statistic in linear model. For linear regression model, the Score and Wald statistics have the same form and differ only in their variance calculation with the Wald statistic variance computed under the full model. The Wald statistic can be checked to be W0=1σ^12(Rμ^0)T{P(PTP)1PT}(Rμ^0), where σ^12=1nkσ^02(N1S0).

Computationally, these tests can be readily and very efficiently computed from existing software for linear model regressing the sample rank on the probabilistic covariates.

3 Numerical studies

We follow the same simulation settings of Acar and Sun (2013, Section 3.1) to evaluate the performance of GKW and the rank linear model based Score and Wald statistics for SNP association test. Numerically we have checked the exact equivalence of GKW and Score statistics, and consistently observed that the Wald statistic performs slightly better than the GKW statistic (especially for relatively small sample size).

The rank linear model framework (1) and general formula for the Score/GKW statistic (2) also suggest that it helps to use normally transformed rank vector. Specifically we use Φ1(RjN+1) instead of rj, j = 1,…, N, in the rank linear model (1). In further simulation studies (not shown) with minor allele frequency of 0.1, 0.2, and 20% group uncertainty, we found that for all of the Score statistic S0, GKW statistic, and the Wald statistic W0, using the normally transformed ranks gave slightly improved power. All statistics have the right size.

Acknowledgments

This research was supported in part by NIH grant GM083345. We are grateful to the University of Minnesota Supercomputing Institute for assistance with the computations. We would like to thank the editor and associate editor for their valuable comments, which have greatly improved the presentation of the paper.

References

  1. Acar EF, Sun L. A generalized Kruskal-Wallis test incorporating group uncertainty with application to genetic association studies. Biometrics. 2013;69:427–435. doi: 10.1111/biom.12006. [DOI] [PubMed] [Google Scholar]
  2. Kraft P, Yen YC, Stram DO, Morrison J, Gauderman WJ. Exploiting gene-environment interaction to detect genetic associations. Human Heredity. 2007;63:111–119. doi: 10.1159/000099183. [DOI] [PubMed] [Google Scholar]

RESOURCES