Reader Reaction On the generalized Kruskal-Wallis test for genetic association studies incorporating group uncertainty

Baolin Wu; Weihua Guan

doi:10.1111/biom.12260

. Author manuscript; available in PMC: 2016 Jun 1.

Published in final edited form as: Biometrics. 2014 Oct 28;71(2):556–557. doi: 10.1111/biom.12260

Reader Reaction On the generalized Kruskal-Wallis test for genetic association studies incorporating group uncertainty

Baolin Wu ^1,^*, Weihua Guan ¹

PMCID: PMC4454617 NIHMSID: NIHMS694003 PMID: 25351417

Summary

Acar and Sun (2013, Biometrics, 69, 427-435) presented a generalized Kruskal-Wallis (GKW) test for genetic association studies that incorporated the genotype uncertainty and showed its robust and competitive performance compared to existing methods. We present another interesting way to derive the GKW test via a rank linear model.

Keywords: Kruskal-Wallis test, Linear model, Score test, Wald test

1 Introduction

Acar and Sun (2013) proposed a generalized Kruskal-Wallis (GKW) statistic for association test of imputed SNPs accounting for the genotype imputation uncertainty. The GKW test is based on contrasting weighted rank sum statistics across different groups. Its asymptotic chi-square null distribution is derived using the rank statistics based central limit theorem.

For this article, we presented a rank linear regression model and derived the proposed GKW statistic as a score test statistic. The linear model approach makes the derivation more straightforward and transparent, and leads to a simplified and unified approach to the general rank based multi-group comparison problem. We present a F-type statistic based on the Wald test and numerically compare its performance to the GKW test.

2 Rank based association test

Consider N samples randomly sampled from a large population consisting of k ≥ 2 disjoint groups. Denote by G the categorical variable taking values on {1, 2,…, k}. We want to compare the k groups, of sizes n₁,…, n_k with $\sum_{i = 1}^{k} n_{i} = N$ , based on a continuous response variable Y. Denote the rank of Y_j among the overall samples as r_j for j = 1,…, N. Suppose available to us are not G but probabilistic data of G: (p_1j, p_2j,…, p_kj), where p_ij = Pr(G_j = i) for i = 1, 2,⋯, k and j = 1, 2,…, N, with $\sum_{i = 1}^{k} p_{i j} = 1$ .

2.1 The generalized Kruskal-Wallis (GKW) test

Acar and Sun (2013) assumed a location-shift model for the continuous outcome. Formally, letting the distribution function of Y be of the form F_i(y) = F(y − θ_i) for group i, we wish to test H₀ : θ₁ = θ₂ =⋯= θ_k against H_A: not all θ_i's are equal.

The GKW test is based on testing the equality of the weighted rank sum $R_{i}^{*} = \sum_{j = 1}^{N} p_{i j} r_{j}$ across group i = 1,⋯, k. Acar and Sun (2013) showed the asymptotic multivariate normality of ${R_{i}^{*} : i = 1, \dots, k}$ and constructed the corresponding GKW chi-square statistic.

2.2 Rank linear regression model

Another interesting way to derive the GKW test is via a rank linear model

r_{j} = μ_{1} + \sum_{i = 2}^{k} p_{i j} μ_{i} + ε_{j}, Var (ε_{j}) = σ^{2}, j = 1, \dots, N .

(1)

Here μ₁ represents the mean rank of the first group, and μ_i(i > 1) represents the mean rank difference between the i-th and first group. The null hypothesis is H₀ : μ₂ = ⋯ = μ_k = 0, which amounts to testing H₀ : θ₁ = ⋯ = θ_k under the location-shift model of Acar and Sun (2013). For the case of genetic association test (k = 3), this model treats the three genotypes as categorical without imposing any genetic model. Denote R = (r₁,…, r_N)^T, and P the design matrix for the rank linear model (1): P_j1 = 1, P_ji = p_ij, i = 2,…, k. With some algebra, we can show that the Score statistic is $S_{0} = \frac{1}{{\hat{σ}}_{0}^{2}} {(R - {\hat{μ}}_{0})}^{T} {P {(P^{T} P)}^{- 1} P^{T}} (R - {\hat{μ}}_{0})$ , where ${\hat{μ}}_{0} = \frac{N + 1}{2}$ , and ${\hat{σ}}_{0}^{2} = \frac{\sum_{j = 1}^{N} {j - (N + 1) / 2}^{2}}{N - 1} = \frac{N (N + 1)}{12}$ . In the following we show that S₀ equals to the GKW statistic.

Note that H = P(P^TP)⁻¹P^T is a projection matrix into the column space of P, and H is invariant under centering and scaling. Let P̄ be the covariate centered and scaled design matrix: P̄_j1 = 1, ${\bar{P}}_{j i} = \frac{p_{i j} - {\bar{p}}_{i}}{\sqrt{\sum_{j = 1}^{N} {(p_{i j} - {\bar{p}}_{i})}^{2}}}$ , ∀i > 1, where ${\bar{p}}_{i} = \sum_{j = 1}^{N} p_{i j} / N$ . Therefore we can also write $S_{0} = \frac{1}{{\hat{σ}}_{0}^{2}} (U^{T} V^{- 1} U)$ , where U = P̄^T (R − μ̂₀) and V = P̄^T P̄. We can check that V₁₁ = N, V_1i = V_i1 = 0, ∀i > 1, and

V_{i l} = \frac{\sum_{j = 1}^{N} (p_{i j} - {\bar{p}}_{i}) (p_{l j} - {\bar{p}}_{l})}{\sqrt{\sum_{j = 1}^{N} {(p_{i j} - {\bar{p}}_{i})}^{2}} \sqrt{\sum_{j = 1}^{N} {(p_{l j} - {\bar{p}}_{l})}^{2}}}, \forall i > 1, l > 1,

which is the sample correlation of P_i and P_l, where P_i = (p_i1,…,p_iN)^T, and

U_{1} = 0, U_{i} = \frac{\sum_{j = 1}^{N} r_{j} p_{i j 0} - \frac{N + 1}{2} N {\bar{p}}_{i}}{\sqrt{\sum_{j = 1}^{N} {(p_{i j} - {\bar{p}}_{i})}^{2}}}, \forall i > 1.

Note that $U_{i} / (\sqrt{N - 1} \hat{σ} 0)$ is just the sample correlation of R and P_i. Therefore a general form of the Score statistic (and hence the GKW) is

S_{0} = (N - 1) Γ^{T} \sum^{- 1} Γ,

(2)

where Γ = (γ₂,…,γ_k)^T is the sample correlation vector of response rank and probabilistic covariates {P_i : i > 1}, and Σ = (ρ_i_l) is the sample correlation matrix of corresponding probabilistic covariates, ρ_il = corr(P_i, P_l) (i > 1, l > 1). This general form also readily leads to the asymptotic chi-square distribution based on the standard central limit theorem argument. When k = 2, $S_{0} = (N - 1) γ_{2}^{2}$ . For association testing of imputed SNPs, k = 3, and $S_{0} = \frac{N - 1}{1 - ρ_{23}^{2}} (γ_{2}^{2} + γ_{3}^{2} - 2 ρ_{23} γ_{2} γ_{3})$ . And they exactly equal to the GKW statistic defined in Acar and Sun (2013, Section 2.2).

The proposed rank linear model (1) also provides a simplified approach to testing more general hypotheses. For example, with a binary environment covariate and an imputed SNP, we can create a group variable with six categories. Detecting gene-environment interaction or testing the overall genetic effect by leveraging gene-environment interactions (Kraft et al., 2007) using the robust sample rank can both be formulated as testing the linear combination of regression parameters under the linear rank model, and are worthy of further study.

As we have shown the equivalence of the GKW and Score statistic. It is natural to consider the asymptotically equivalent Wald statistic, which is the commonly used F-statistic in linear model. For linear regression model, the Score and Wald statistics have the same form and differ only in their variance calculation with the Wald statistic variance computed under the full model. The Wald statistic can be checked to be $W_{0} = \frac{1}{{\hat{σ}}_{1}^{2}} {(R - {\hat{μ}}_{0})}^{T} {P {(P^{T} P)}^{- 1} P^{T}} (R - {\hat{μ}}_{0})$ , where ${\hat{σ}}_{1}^{2} = \frac{1}{n - k} {\hat{σ}}_{0}^{2} (N - 1 - S_{0})$ .

Computationally, these tests can be readily and very efficiently computed from existing software for linear model regressing the sample rank on the probabilistic covariates.

3 Numerical studies

We follow the same simulation settings of Acar and Sun (2013, Section 3.1) to evaluate the performance of GKW and the rank linear model based Score and Wald statistics for SNP association test. Numerically we have checked the exact equivalence of GKW and Score statistics, and consistently observed that the Wald statistic performs slightly better than the GKW statistic (especially for relatively small sample size).

The rank linear model framework (1) and general formula for the Score/GKW statistic (2) also suggest that it helps to use normally transformed rank vector. Specifically we use $Φ^{- 1} (\frac{R_{j}}{N + 1})$ instead of r_j, j = 1,…, N, in the rank linear model (1). In further simulation studies (not shown) with minor allele frequency of 0.1, 0.2, and 20% group uncertainty, we found that for all of the Score statistic S₀, GKW statistic, and the Wald statistic W₀, using the normally transformed ranks gave slightly improved power. All statistics have the right size.

Acknowledgments

This research was supported in part by NIH grant GM083345. We are grateful to the University of Minnesota Supercomputing Institute for assistance with the computations. We would like to thank the editor and associate editor for their valuable comments, which have greatly improved the presentation of the paper.

References

Acar EF, Sun L. A generalized Kruskal-Wallis test incorporating group uncertainty with application to genetic association studies. Biometrics. 2013;69:427–435. doi: 10.1111/biom.12006. [DOI] [PubMed] [Google Scholar]
Kraft P, Yen YC, Stram DO, Morrison J, Gauderman WJ. Exploiting gene-environment interaction to detect genetic associations. Human Heredity. 2007;63:111–119. doi: 10.1159/000099183. [DOI] [PubMed] [Google Scholar]

[R1] Acar EF, Sun L. A generalized Kruskal-Wallis test incorporating group uncertainty with application to genetic association studies. Biometrics. 2013;69:427–435. doi: 10.1111/biom.12006. [DOI] [PubMed] [Google Scholar]

[R2] Kraft P, Yen YC, Stram DO, Morrison J, Gauderman WJ. Exploiting gene-environment interaction to detect genetic associations. Human Heredity. 2007;63:111–119. doi: 10.1159/000099183. [DOI] [PubMed] [Google Scholar]

PERMALINK

Reader Reaction On the generalized Kruskal-Wallis test for genetic association studies incorporating group uncertainty

Baolin Wu

Weihua Guan

Summary

1 Introduction

2 Rank based association test

2.1 The generalized Kruskal-Wallis (GKW) test

2.2 Rank linear regression model

3 Numerical studies

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Reader Reaction On the generalized Kruskal-Wallis test for genetic association studies incorporating group uncertainty

Baolin Wu

Weihua Guan

Summary

1 Introduction

2 Rank based association test

2.1 The generalized Kruskal-Wallis (GKW) test

2.2 Rank linear regression model

3 Numerical studies

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases