A Note on Testing and Estimation in Marker-Set Association Study Using Semiparametric Quantile Regression Kernel Machine

Xiang Zhan; Michael C Wu

doi:10.1111/biom.12785

. Author manuscript; available in PMC: 2018 Jun 27.

Published in final edited form as: Biometrics. 2017 Nov 2;74(2):764–766. doi: 10.1111/biom.12785

A Note on Testing and Estimation in Marker-Set Association Study Using Semiparametric Quantile Regression Kernel Machine

Xiang Zhan ^1,^2,^*, Michael C Wu ^1,^**

PMCID: PMC5932287 NIHMSID: NIHMS907051 PMID: 29096048

Summary

Kong et al. (2016, Biometrics 72, 364–371) presented a quantile regression kernel machine (QRKM) test for robust analysis of genetic marker-set association studies. A potential limitation of QRKM is the permutation-based test design may be unscalable for the massive sizes of modern datasets. In this article, we present an alternative strategy for p-value calculation of QRKM, which is capable of speeding up the QRKM testing procedure dramatically while maintaining the same testing performance as QRKM. The effectiveness of our approach is demonstrated via simulation studies.

Keywords: Fast permutation test, Genetic marker-set association, Kernel machines, Quantile regression

1. Introduction

In an excellent article on testing of marker-set association analysis (Kong et al., 2016), the authors developed a semiparametric quantile regression kernel machine (QRKM) test to examine the association between the conditional quantiles of a response variable and a group of genetic markers. In particular, the QRKM framework models the effect of covariates on a given conditional quantile of the response parametrically and the genetic marker-set effect on the conditional quantile of the response nonparametrically. Then a score-type test statistic is derived from the subgradient of the check function and the null distribution of this score-type test statistic is obtained via permutations. There are some potential computational limitations of the QRKM test. First, many permutations are needed when a stringent p-value is required, as in many genome-wide association studies. Second, the permutation-based test design is not scalable to big datasets, in which sample sizes are in the thousands, that are frequently encountered in modern statistical sciences. The purpose of this article is to present a computationally-motivated fast QRKM testing strategy that can address the computational limitations.

2. Methods

2.1 Review of QRKM

Let (y, X, Z) denote the triplet of response, genetic markers and covariates. The following quantile regression model was considered to relate the response to genetic markers and covariates:

Q_{y} (τ | X, Z) = f (X) + Z β + β_{0} + Q_{ε} (τ),

(1)

where f(·) is some centered unknown function quantifying effect of the marker-set, β is the regression coefficient for the covariates, and Q_ε(τ) is the τ -th quantile of ε for τ ∈ (0, 1). By assuming f in some function space H_K spanned by a certain kernel machine, the QRKM fits the quantile regression model (1) via the minimization problem

min_{β_{0}, β, f \in H_{K}} \sum_{i = 1}^{n} ρ_{τ} (y_{i} - f (X_{i}) - Z_{i}^{'} β - β_{0}) + P_{λ} (f),

(2)

where ρ_τ (t) = t(τ − I_t≤0), I is the indicator function, and P_λ(f) is a penalty function which regulates the smoothness of f (·). Then both an estimating and a testing procedure for β and f (·) was proposed (Kong et al., 2016), while we focus on the testing part in this article.

The joint effect of the marker-set is tested as H₀ : f (·) = 0 via a score-type statistic T = w′Kw, where w = (w₁, … w_n)′ are binary random variables taking values τ or τ − 1 depending on whether the residuals of the null quantile regression model Q_y (τ|Z) = Zβ + β₀ + Q_ε(τ) are positive or not, and K is an n × n kernel matrix calculated from the marker-set. Due to the non-smooth check loss, the null distribution of T = w′Kw is obtained via permutation. In particular, QRKM first fits model (2) for the full data {(y_i, X_i, Z_i), i = 1, …, n} and obtains the full model residuals. Next, a permutation of the full model residuals are used to generate pseudo response $y_{i}^{*}$ . Using the pseudo response, null quantile regression model ( $y_{i}^{*}$ , Z_i) is fitted and used to calculate w* and T* = w*′ Kw*. The permutation procedure is repeated N times to calculate ${T_{1}^{*}, \dots, T_{N}^{*}}$ . Finally, the p-value is obtained by comparing the observed QRKM statistic T = w′Kw to the multiple permuted statistics ${T_{1}^{*}, \dots, T_{N}^{*}}$ .

2.2 Fast Permutation

The computational burden of the original QRKM test comes from the following aspects: 1) fitting the full QRKM model once; 2) fitting null quantile regression model ( $y_{i}^{*}$ , Z_i) for N times; and 3) calculating permutated statistics T* for N times. To reduce the computational cost by omitting the need for explicit permutations, we present the fast QRKM testing procedure in the following.

According to the dual representation of Kong et al. (2016), $f (X) = λ^{- 1} \sum_{i = 1}^{n} θ_{i} K (X, X_{i})$ , where K(·, ·) represents the kernel function. Since f (·) is assumed to be a centered function in QRKM, hence K(·, ·) should be centered. In terms of matrix, we assume K to be a centered kernel matrix. In other words, we consider a slightly different QRKM statistic T = w′HKHw = tr(HKHww′), where H = I − 11′/n is the n-th order centering matrix and tr(·) denotes the trace of a matrix. With this centered version statistic, we are able to implement some fast permutation testing strategies (Josse, Pagè, and Husson, 2008) developed for RV-type statistic of the form tr(HAA′HBB′), where A and B are two arbitrary input matrices. To see this, we are able to decompose kernel matrix K as K = ΦΦ′ (such a Φ exists since K is symmetric and positive semi-definite). If we set A = Φ and B = w, then the centered version of QRKM statistic T = tr(HKHww′) reduces to a RV-type statistic, and thus the existing fast permutation RV test designs (Josse et al., 2008) can be directly applied to our new QRKM statistic T = tr(HKHww′).

Specifically, instead of explicitly drawing permutations and recalculating the test statistic, the fast RV permutation strategies approximate the null permutation distribution to a simpler and known distribution. Let K* denote a permuted kernel matrix obtained by shuffling rows and columns of K simultaneously. Then the null empirical distribution of all permutations { $T_{b}^{* *} = tr ({HK}_{b}^{*} {Hww}^{T})$ , b = 1, … n!} is approximated to a simpler distribution by moment matching. Since w is calculated under the null model without X, shuffling rows and columns of K (which only depends on X) does not affect the null distribution of w. In other words, $T_{b}^{* *} ’ s$ have the same null distribution as the test statistic T, and therefore, it is reasonable to calculate the test p-value using the empirical distribution of { $T_{b}^{* *} = tr ({HK}_{b}^{*} {Hww}^{T})$ , b = 1, … n!}. To approximate this empirical permutation distribution, several simpler distributions (normal, log-normal, Edgeworth and Pearson type III) have been used (Josse et al., 2008). Among these approximations, Pearson type III distribution provides the most accurate and efficient performance and hence is adopted to approximate the empirical distribution of $T_{b}^{* *} ’ s$ . Since analytical forms of sample moments of { $T_{b}^{* *} = tr ({HK}_{b}^{*} {Hww}^{T})$ , b = 1, … n!} are available (see Josse et al. (2008) and references therein), closed-form Pearson type III density can be calculated. Finally, based on this closed-form Pearson type III density, fast and analytical p-value calculation is allowed in our new fast QRKM method.

3. Numerical Studies

We follow the same simulation setup and set the number of permutations N = 50, 000 for QRKM test as in Kong et al. (2016). For ease of presenting, we term our new testing strategy as fast QRKM (FQRKM) in this section. The type I error is presented in Table 1. Based on the table, FQRKM τ = 0.1 is slightly inflated when n = 200. It has been shown that FQRKM τ = 0.1 test tends to have the correct type I error with a larger sample size (see additional numerical studies in Web Appendix A). On the other hand, both FQRKM τ = 0.5 and FQRKM τ = 0.8 have reasonable type I errors. Finally, the average computing time reported in Table 1 shows that, FQRKM is about 6,000 times faster than QRKM under n = 200 (the computing time of QRKM and FQRKM was counted after both w and K were calculated from the observed samples).

Table 1.

Type I errors of QRKM test and FQRKM test with n = 200. The simulation is based 10,000 Monte Carlo replicates. The empirical type I error was evaluated under nominal levels α = 0.001, 0.01 and 0.05 respectively. Type I errors outside the 95% CI $[α - 1.96 \sqrt{α (1 - α) / 10000}, α + 1.96 \sqrt{α (1 - α) / 10000}]$ are typed in bold. The unit of average computing time (ACT) is in seconds.

Distribution

Method

0.001

0.01

0.05

ACT

Normal

QRKM τ = 0.1

0.0012

0.0095

0.0482

293.6

QRKM τ = 0.5

0.0009

0.0113

0.0489

257.2

QRKM τ = 0.8

0.0006

0.0084

0.0467

280.5

FQRKM τ = 0.1

0.0015

0.0132

0.0501

0.051

FQRKM τ = 0.5

0.0009

0.0100

0.0489

0.051

FQRKM τ = 0.8

0.0008

0.0095

0.0470

0.051

QRKM τ = 0.1

0.0012

0.0092

0.0483

297.4

QRKM τ = 0.5

0.0007

0.0085

0.0500

262.8

QRKM τ = 0.8

0.0010

0.0099

0.0497

284.4

FQRKM τ = 0.1

0.0016

0.0138

0.0537

0.052

FQRKM τ = 0.5

0.0011

0.0092

0.0499

0.052

FQRKM τ = 0.8

0.0008

0.0110

0.0524

0.051

χ_{1}^{2} - 1

QRKM τ = 0.1

0.0009

0.0082

0.0461

298.7

QRKM τ = 0.5

0.0012

0.0106

0.0532

258.3

QRKM τ = 0.8

0.0012

0.0107

0.0501

280.7

FQRKM τ = 0.1

0.0012

0.0143

0.0549

0.051

FQRKM τ = 0.5

0.0011

0.0112

0.0531

0.051

FQRKM τ = 0.8

0.0011

0.0115

0.0500

0.051

Open in a new tab

The power of QRKM and FQRKM under the significance level of α = 0.05 are presented in Figure 1. Because the type I error of FQRKM τ = 0.1 might be slightly inflated under α = 0.05, and also for ease of presenting, we omitted the power curves with τ = 0.1. Based on Figure 1, it can be seen that the power difference between the QRKM test and the FQRKM test with same quantile is negligible under each scenario. To conclude, FQRKM can serve as a useful QRKM-equivalent test when computational expedience is desired.

Power curves of QRKM and FQRKM for quantiles τ = 0.5 and 0.8 based on 500 Monte Carlo replications. The left, middle and right columns corresponds to normal, t and chi-squares error distribution respectively. The first row corresponds to function f₁, and the second row corresponds to f₂, where f₁ and f₂ are the same notations specified in Kong et al. 2016. The solid, dashed lines correspond to τ = 0.5 and 0.8 respectively. ○ represents QRKM and Δ represents FQRKM.

Supplementary Material

Supp info

NIHMS907051-supplement-Supp_info.pdf^{(112.6KB, pdf)}

Acknowledgments

This research was supported by NIH Grants U10 CA180819 and R01 HG007508 and the Hope Foundation. The authors thank the associated editor and reviewers for thoughtful and constructive comments that have helped improve the article.

Footnotes

Supplementary Materials

Web Appendix referenced in Sections 3, which includes additional numerical results; and R codes implementing FQRKM, are available with this article at the Biometrics website on Wiley Online Library.

References

Josse J, Pagès J, Husson F. Testing the significance of the RV coefficient. Computational Statistics & Data Analysis. 2008;53:82–91. [Google Scholar]
Kong D, Maity A, Hsu FC, Tzeng JY. Testing and estimation in marker–set association study using semiparametric quantile regression kernel machine. Biometrics. 2016;72:364–371. doi: 10.1111/biom.12438. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp info

NIHMS907051-supplement-Supp_info.pdf^{(112.6KB, pdf)}

[R1] Josse J, Pagès J, Husson F. Testing the significance of the RV coefficient. Computational Statistics & Data Analysis. 2008;53:82–91. [Google Scholar]

[R2] Kong D, Maity A, Hsu FC, Tzeng JY. Testing and estimation in marker–set association study using semiparametric quantile regression kernel machine. Biometrics. 2016;72:364–371. doi: 10.1111/biom.12438. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A Note on Testing and Estimation in Marker-Set Association Study Using Semiparametric Quantile Regression Kernel Machine

Xiang Zhan

Michael C Wu

Summary

1. Introduction

2. Methods

2.1 Review of QRKM

2.2 Fast Permutation

3. Numerical Studies

Table 1.

Figure 1.

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

A Note on Testing and Estimation in Marker-Set Association Study Using Semiparametric Quantile Regression Kernel Machine

Xiang Zhan

Michael C Wu

Summary

1. Introduction

2. Methods

2.1 Review of QRKM

2.2 Fast Permutation

3. Numerical Studies

Table 1.

Figure 1.

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases