Abstract
The quantile varying coefficient (VC) model can flexibly capture dynamical patterns of regression coefficients. In addition, due to the quantile check loss function, it is robust against outliers and heavy-tailed distributions of the response variable, and can provide a more comprehensive picture of modeling via exploring the conditional quantiles of the response variable. Although extensive studies have been conducted to examine variable selection for the high-dimensional quantile varying coefficient models, the Bayesian analysis has been rarely developed. The Bayesian regularized quantile varying coefficient model has been proposed to incorporate robustness against data heterogeneity while accommodating the non-linear interactions between the effect modifier and predictors. Selecting important varying coefficients can be achieved through Bayesian variable selection. Incorporating the multivariate spike-and-slab priors further improves performance by inducing exact sparsity. The Gibbs sampler has been derived to conduct efficient posterior inference of the sparse Bayesian quantile VC model through Markov chain Monte Carlo (MCMC). The merit of the proposed model in selection and estimation accuracy over the alternatives has been systematically investigated in simulation under specific quantile levels and multiple heavy-tailed model errors. In the case study, the proposed model leads to identification of biologically sensible markers in a non-linear gene-environment interaction study using the NHS data.
Keywords: Bayesian variable selection, Quantile regression, Markov Chain Monte Carlo, Robustness, Varying coefficient model
1. Introduction
The quantile varying coefficient model (Kim (2007)) has two defining characteristics. First, it can safeguard against heavy-tailed distribution and outliers due to the robustness of check loss function in quantile regression. Compared to the modeling based on conditional means, the check loss also makes a more comprehensive modeling of data feasible. Second, the quantile varying coefficient model can account for the dynamic effects of predictors on the response variable. As it has inherited from the varying coefficient model (Hastie and Tibshirani (1993)), its regression coefficients are nonparametric functions of other variables, or effect modifiers, so the dynamic influences of the predictor can be properly captured through the varying coefficients. Therefore, the quantile varying coefficient model enjoys wide popularity and application in a broad spectrum of scientific research areas due to its robustness, superior flexibility and interpretability. For example, in the gene-environment interaction analysis (Zhou et al. (2021)) of the Nurse’s Health Data conducted in Section 5 of this paper, we aim at addressing the scientific question on how the genetic factors, which are single nucleotide polymorphisms or SNPs, are influenced by age to affect the change in body mass index (BMI). The exploratory data analysis in Figure 1 clearly shows the skewness in the response variable BMI, and nonlinear interactions between SNP rs13001304 and age (the effect modifier), which justifies the use of the quantile VC model.
Figure 1:
Distribution of the BMI (left) and non-linear interaction effect of SNP rs13001304 (right) from the NHS data. The blue dashed lines denote the 95% credible interval.
With a large number of the genetic factors, identification of important gene-environment interactions naturally leads to a sparse high-dimensional problem. Regularized variable selection has been extensively studied for quantile varying coefficient models. For example, Noh et al. (2012) has developed the regularization procedure based on the second order cone programming. The selection of important varying coefficients amounts to group level selection of the spline coefficients with group SCAD penalty. In longitudinal studies, Tang et al. (2013) has developed adaptive LASSO based variable selection method for quantile varying coefficient models, where the group level spline coefficients are penalized via the shrinkage of the norm . Tang et al. (2012) has further examined structural identification of varying coefficients by separating the varying, nonzero constant and zero effects in quantile regression. All these studies have established the asymptotic properties of the corresponding regularized estimators in terms of (1) consistency in variable selection; that is, the proposed methods can identify nonzero quantile varying coefficient functions with probability approaching 1, and (2) the rate of convergence of the nonzero quantile varying coefficient functions. However, they have not developed the asymptotic distributions of the regularized estimators. On the other hand, Dai and Kolar (2021) have established asymptotic normality and estimation consistency for a sparse kernel estimator that approximates quantile VC functions. The consistency in variable selection has not been established.
From the Bayesian perspective, variable selection for quantile varying coefficient models has not been well developed yet. One advantage of the fully Bayesian methods is that exact posterior inference can be conducted through the MCMC algorithms, even under small sample sizes. Therefore, the Bayesian analysis can provide additional insight over existing frequentist approaches, including the statistical inference based on credible intervals of the quantile varying coefficient functions. As the general framework for penalized (robust) variable selection can be formulated as “(robust) loss function + penalty function” (Wu and Ma (2015); Wu et al. (2019)), choosing the appropriate likelihood function and sparsity inducing priors, which correspond to the (robust) loss function and penalty terms respectively, have been shown to be an effective way to develop the Bayesian hierarchical models (Casella et al. (2010); Park and Casella (2008)). For Bayesian quantile regression, Yu and Moyeed (2001) has proposed using the asymmetric Laplace distribution (ALD) as the likelihood function to formulate the Bayesian quantile regression. Li et al. (2010) has further developed the Bayesian regularized quantile regression based on adopting the univariate and multivariate conditional Laplace priors as sparse priors. A major limitation of the conditional Laplace prior is that it does not lead to shrinkage with exact 0 coefficient, which has motivated Ren et al. (2022) to consider incorporating the spike-and-slab priors in bi-level selection for the Bayesian least absolute deviation (LAD) regression, a special case of the Bayesian penalized quantile regression with 50% quantile level. These methods are of a parametric nature, and cannot be adopted for analyzing the quantile varying coefficient models.
In literature, nonparametric Bayesian variable selection has been examined in varying coefficient models. Li et al. (2015) has developed Bayesian group LASSO for varying coefficient models in longitudinal studies. In gene-environment interaction studies, Ren et al. (2019) has examined the sparse structure identification for Bayesian partially linear varying coefficient models. Both work have developed Gibbs samplers for posterior sampling and inference. As the likelihood functions are employed based on normal distribution, both are not robust to long-tailed distributions and outliers in the response variable.
To the best of our knowledge, Bayesian regularized variable selection in quantile regression models with varying coefficients has not been well studied. As the quantile VC model can be further extended to a large family of non-/semi-parametric models (Lv and Li (2020); Ma and Song (2015); Wang et al. (2009)), it is not feasible to investigate these models within the Bayesian framework if the cornerstone model in this family has not been fully understood from the Bayesian perspective. Therefore, to fill this gap, we have developed a novel regularized Bayesian quantile varying coefficient model. The proposed model shares the two aforementioned defining characteristics of the quantile varying coefficient model within the Bayesian framework by accommodating the heavy-tailed errors and outlying observations in the response while flexibly modeling the nonlinear interactions between the predictor and the effect modifying variable. Selection of important varying coefficients can be efficiently conducted through group level Bayesian variable selection. Incorporation of the the multivariate spike and slab priors in our model promotes identification of important effects with exact sparsity, thus further improving the performance in identification and estimation. The Bayesian hierarchical model leads to a Gibbs sampler which facilitates fast posterior inference based on MCMC algorithms. We have implemented the proposed and alternative methods in R package pqrBayes on the corresponding author’s Github page (https://github.com/cenwu/pqrBayes). The core modules of the R package have been developed in C++. The package will be available on CRAN shortly.
2. Statistical Methods
2.1. The Quantile Varying Coefficient Model
Let , be independent and identically distributed random vectors, where is the response, is the univariate index variable, denotes the -dimensional design vector with the first element being 1, and is the -dimensional design vector. In particular, is of high dimensionality (e.g., denoting gene expressions), and represents low dimensional clinical factors. At a given quantile level , we consider the following quantile varying coefficient model:
| (1) |
where is the component of , is the component of , and are unknown smooth varying-coefficient functions. The quantile of random error equals 0. The quantile varying coefficient model enjoys the flexibility in that the high dimensional predictors are linearly associated with the response, but the corresponding regression coefficients vary with the univariate index variable . It frequently rises in many applications that only a subset of predictors among are relevant to the response variable in model (1), motivating the variable selection for quantile varying coefficient models. Here, stands for the low dimensional clinical and environmental factors that are pre-determined as important covariates and not subject to selection. Without loss of generality, we assume that the index variable . Besides, we omit the subscript hereafter for simplicity of notation.
2.2. The Bayesian formulation of the Quantile Varying Coefficient Model
To formulate the Bayesian quantile varying coefficient model, we begin with approximating the varying coefficient function in model (1) through basis expansion using polynomial splines. Denote as the number of uniform interior knots, and as the degree of the polynomial. Then and 2 correspond to the linear and quadratic splines respectively, and so on. Let be a set of normalized B-spline basis with (Schumaker (2007)). Then for , we have the following approximations
where is the spline coefficient vector. Subsequently, model (1) becomes
| (2) |
where .
Given the above basis expansion, the regression coefficients and can be estimated by solving the following minimization problem:
| (3) |
where is the check loss function for quantile regression.
Given a quantile level , we assume that the random errors from model (2) follow an i.i.d. skewed (or asymmetric) Laplace distribution with density shown below (Yu and Moyeed (2001); Yu and Zhang (2005)):
where is a scale parameter determining the skewness of the distribution. Then the joint distribution of given and can be expressed as:
It is worth pointing out that the asymmetric Laplace likelihood is essentially a working likelihood. It has been adopted merely for the purpose to ensure that the minimization problem specified in (3) is equivalent to maximizing the above likelihood (Yang et al. (2016)), which allows us to work with the usual likelihood function. Because of its connection to the check loss function in quantile regression, the asymmetric Laplace distribution has been widely adopted to specify the likelihood function for Bayesian quantile regression, which sheds additional insight over the frequentist-based approaches to quantile regression.
Kozumi and Kobayashi (2011) have shown that the skewed Laplace distribution can be equivalently represented as a mixture of an exponential distribution and a scaled normal distribution. To be more specific, let the random variables and be standard exponential distribution, Exp(1), and standard normal distribution, , respectively. Define and for . Then we have the following representation based on a location–scale mixture of normals as
where follows a skewed Laplace distribution with a scale parameter . Consequently, model (2) becomes
where and . Let and , Therefore, we have the following hierarchical model:
2.3. The Bayesian Regularized Quantile Varying Coefficient Model
In the literature, penalized variable selection for quantile varying coefficient models have been examined with different group level penalty functions. For example, Noh et al. (2012) has developed a group SCAD to select important groups of spline coefficients after basis expansion. Tang et al. (2013) has proposed adaptive group LASSO for quantile varying coefficient models, where the group level shrinkage on spline coefficients has been imposed through the norm with . From the Bayesian perspective, the group LASSO estimator can be viewed as the posterior mode estimate when independent and identical multivariate Laplace priors are assumed for groups of regression coefficients. Such a connection has motivated us to consider the following regularized quantile varying coefficient model with group LASSO penalty:
| (4) |
where , and is the tuning parameter. We first set the independent and identical multivariate Laplace prior on as , where is the group size (i.e. the length of ). The resulting posterior distribution of is
With the reparametrization , the multivariate Laplace prior can be rewritten as a scale mixture of multivariate normal distribution using Gamma mixing density, that is,
| (5) |
where the multivariate normal (MVN) distribution has zero mean vector and a d-by-d diagonal matrix as the covariance matrix, and the Gamma distribution is defined with the shape parameter and the rate parameter . By integrating out , the conditional prior on has the multivariate Laplace distribution defined in (5). Therefore, the prior can be expressed as a gamma mixture of normal distributions in a Bayesian hierarchical model:
| (6) |
A major limitation of the above Laplacian shrinkage based formulation of hierarchical model is that the posterior estimates for regression coefficients cannot be shrunk to 0 exactly. In general, a 95% credible interval needs to be constructed to determine the sparsity, which suffers from inaccuracy as shown in many published studies. Here, we consider incorporating multivariate spike-and-slab priors to achieve direct identification of sparsity, i.e.,
| (7) |
where the spike is defined as , a point mass at , and the slab component is . The parameter . For , we introduce a latent binary indicator variable corresponding to each group to conduct the selection of spline coefficients on the group level. When , the spline coefficient vector has a point mass density at zero, suggesting that is estimated as a zero vector and the varying coefficient corresponding to the predictor in is 0, i.e., the predictor is not associated with the response. Besides, if , the slab part, or the normal distribution, is in action, and the spike-and-slab prior reduces to the hierarchical priors in (6), leading to a Bayesian quantile group LASSO. Therefore, and the group of spline coefficients is selected in final model. By integrating out and in (7), we have the marginal prior on as a mixture of a multivariate Laplace distribution and a point mass at :
| (8) |
which borrows strength from both the Laplacian shrinkage and spike-and-slab priors. The multivariate Laplacian in the slab component plays the role as a diffuse density to model the large effects, and is a point mass at zero to achieve variable selection via shrinking negligible group of spline coefficients to 0. Note that (8) reduces to (5) when . We assign a conjugate beta prior as with fixed parameters and , which accounts for the uncertainty in choosing .
Besides, for computational convenience, we assign conjugate Gamma priors to and as follows:
where , , and are constants. The multivariate normal prior has been placed on the -dimensional coefficient vector as:
where denotes the covariance matrix. Similarly, for the coefficients corresponding to the varying intercept, we assign the following prior:
3. The Gibbs Sampler
The joint likelihood of the unknown parameters conditional on data will be given as
The full conditional distributions can be derived as follows. We provide all the details in the Appendix.
- The full conditional distribution of is:
where “rest” denotes the data and all the other model parameters sampled in the MCMC. - Let , then the conditional posterior distribution of is a multivariate spike-and-slab distribution given as:
where
and
Therefore, the posterior distribution of is a mixture of a multivariate normal distribution and a point mass at 0. At each iteration of MCMC, is drawn from with probability and is set to 0 with probability . - The full conditional distribution of is
- The full conditional distribution of is
- The full conditional distribution of , is
- The full conditional distribution of
where - The full conditional distribution of is multivariate normal:
with covariance
and mean - Similarly the full conditional distribution of can be obtained as
where
and
4. Simulation
We conduct a comprehensive evaluation to assess the performance of the proposed method, Bayesian regularized quantile varying coefficient model with spike and slab priors (BQRVCSS), with three alternative Bayesian methods: BQRVC, BVCSS and BVC. The BQRVC only differs from BQRVCSS in that the spike-and-slab prior is not incorporated. BVCSS and BVC are the non-robust counterpart of BQRVCSS and BQRVC, respectively. Details of the hierarchical model formulation and derivation of the corresponding Gibbs samplers are provided in the Appendix C. Besides, two frequentist methods, regularized varying coefficient model with adaptive group LASSO under the quantile check loss (QRVC-adp) and least square loss (VC-adp) from Tang et al. (2013) are also included.
The response variable generated according to model 1 with sample size and dimensionality of being 100. Without loss of generality, the low dimensional clinical covariates, denoted as in model 1, is omitted, which can facilitate a fair comparison as such a component is not included in QRVC-adp and VC-adp (Tang et al. (2013)). The total dimension of regression coefficients after basis expansion is larger than the sample size. For instance, if the number of basis function is set to 5, the actual dimension is 505, including the varying intercept. The varying coefficients are set as , , , . The rest of the coefficients are 0. We simulate two types of predictors separately. First, the predictors are simulated from a multivariate normal distribution with mean 0 and an AR-1 covariance matrix where marginal mean is 0 and correlation coefficient is 0.5, which represents the continuous gene expression data. Second, we generate the predictors as the categorical single nucleotide polymorphism (SNP) data by dichotomizing the aforementioned gene expression values of each predictor at the 1st and 3rd quartiles, leading to the 3-level categories (0,1,2) for genotypes (aa, Aa, AA).
We consider five error distribution for in model (1): (Error 1), (Error 2), with the scale parameter (Error 3), (Error 4), with (Error 5). Errors 2–5 are heavy-tailed distributions. For each error, is chosen so that the quantile is 0. We also consider the case of non i.i.d. random errors by using the following data generating model :
where , the i.i.d. errors in model (1) are replaced by , and the regression coefficients are the same as in the model under i.i.d. random errors.
The proportions of correct fitting (C), over-fitting (O), and under-fitting (U) are used to evaluate identification performance. In addition, the integrated mean squared error (IMSE) is adopted to assess estimation accuracy of varying coefficients. Let denote the posterior median estimate for , and be the grid of points equally space on [0,1]. Therefore can be evaluated on the grid points . Then the IMSE of is given as . reduces to 0 if . The total integrated mean squared error (TIMSE), or the sum of all the estimated varying coefficients, denote the overall estimation accuracy.
We have drawn the posterior samples from the Gibbs sampler. For Bayesian methods that are based on the spike-and-slab priors, the median probability model (MPM) is adopted to identify important predictors. Define the indicator for the predictor. At the iteration, if the predictor is included in the regression model,i.e., the varying coefficient is nonzero. Then, based on posterior samples drawn from the MCMC after excluding burn-ins, the posterior probability of including the predictor in the final model can be calculated as
A larger posterior inclusion probability suggests a stronger evidence for the importance of the corresponding varying coefficients. The MPM model consists of predictors with posterior inclusion probability no less than 0.5. It has been recommended due to its optimal prediction performance when selecting a single model is of interest (Barbieri and Berger (2004)). For methods without using spike–and–slab priors, we use the 95% credible interval (95%CI) to conduct identification. In simulation, the Gibbs sampler run 10,000 MCMC iterations in which the first 5,000 samples are burn-ins.
For the 4 data generating scenarios, i.e., (1) gene expression with i.i.d. error; (2) gene expression with non-i.i.d. error, (3) SNPs with i.i.d. error and (4) SNPs with non-i.i.d. error, all the 6 methods have been compared across 5 error distributions and 3 different quantile levels (0.3, 0.5 and 0.7). The identification results for the first scenario are shown in Figure 2. We can observe that under the standard normal error, BQRVCSS and BVCSS, the two Bayesian methods with the spike-and-slab priors, as well as the two frequentist methods (QRVC-adp and VC-adp), have comparable performance in correctly identifying the true model. When the random errors are heavy-tailed, Figure 2 clearly shows the advantage of BQRVCSS over non-robust alternatives. On the other hand, BQRVCSS is apparently superior over BQRVC and BVC by yielding much larger percentage of correctly fitted models. In fact, the two Bayesian approaches without adopting spike-and-slab priors consistently lead to the two lowest proportions of identifying the true model. A comparison between BQRVCSS and QRVC-adp indicates that the two are comparable in general, and the proposed one appears slightly better. Among the 12 sub-panels in Figure 2, robust methods tend to perform the worst at quantile level 0.7 under the lognormal error (Error 4), since lognormal distribution is right skewed. Such a phenomenon has not been observed under other 4 symmetric errors.
Figure 2:
Identification results for simulated gene expression data with i.i.d. errors based on 100 replicates. C: correct-fitting proportion; O: overfitting proportion; U: underfitting proportion.
Figure 3 shows the identification results under the 2nd setting where the response variable is generated based on gene expression data with non-i.i.d. errors. The advantage of BQRVCSS can be again concluded. Furthermore, the estimation results in terms of total integrated mean square error (TIMSE) for scenario 1 and 2 are provided in Table 1 to Table 2, respectively. Under the heavy-tailed error, BVCSS leads to the smallest estimation error. For example, in Table 1, at quantile 0.5 with the t(2) error distribution, BQRVCSS has a TIMSE of 0.33 (sd 0.23), less than that of the BQRVC (4.35 (sd 0.78)) and QRVC-adp (0.76 (sd 0.99)), as well as non–robust alternatives. The advantage of the proposed method over the rest is due to its robustness and incorporation of the spike-and-slab prior. We also observe similar patterns in the 3rd and 4th setting from Figure 5, Figure 6, Table 3 and Table 4 in the Appendix.
Figure 3:
Identification results for simulated gene expression data with heterogeneous errors based on 100 replicates. C: correct-fitting proportion; O: overfitting proportion; U: underfitting proportion.
Table 1:
Estimation results in terms of total integrated mean square error (TIMSE) for simulated gene expression data with i.i.d. errors based on 100 replicates.
| BQRVCSS | BQRVC | BVCSS | BVC | QRVC-adp | VC-adp | ||
|---|---|---|---|---|---|---|---|
| Normal | 0.23(0.10) | 2.28(0.35) | 0.45(0.09) | 1.56(0.16) | 0.25(0.10) | 0.70(0.09) | |
| NormalMix | 0.34(0.19) | 3.90(0.62) | 0.76(0.27) | 3.04(0.43) | 0.45(0.23) | 0.92(0.16) | |
| Laplace | 0.27(0.13) | 2.97(0.45) | 0.47(0.15) | 2.12(0.31) | 0.26(0.11) | 0.71(0.11) | |
| Lognormal | 0.11(0.05) | 3.38(0.55) | 1.14(0.85) | 5.84(1.92) | 0.18(0.41) | 1.22(2.45) | |
| 0.44(0.24) | 5.01(1.16) | 2.63(5.24) | 8.35(9.76) | 0.84(0.98) | 2.58(3.22) | ||
| Normal | 0.21(0.06) | 2.42(0.36) | 0.40(0.06) | 1.57(0.16) | 0.21(0.07) | 0.62(0.11) | |
| NormalMix | 0.31(0.17) | 3.75(0.60) | 0.74(0.24) | 2.71(0.49) | 0.35(0.16) | 0.92(0.11) | |
| Laplace | 0.22(0.06) | 3.07(0.48) | 0.46(0.08) | 1.83(0.28) | 0.22(0.09) | 0.70(0.08) | |
| Lognormal | 0.25(0.19) | 4.59(0.94) | 1.18(1.69) | 5.09(2.28) | 0.40(0.56) | 1.26(0.68) | |
| 0.33(0.23) | 4.35(0.78) | 2.04(1.48) | 6.82(6.51) | 0.76(0.99) | 2.05(4.32) | ||
| Normal | 0.21(0.08) | 2.53(0.41) | 0.41(0.08) | 1.58(0.18) | 0.23(0.10) | 0.71(0.10) | |
| NormalMix | 0.33(0.14) | 3.84(0.58) | 0.78(0.30) | 3.03(0.53) | 0.45(0.26) | 0.92(0.18) | |
| Laplace | 0.29(0.11) | 3.22(0.49) | 0.49(0.16) | 2.18(0.34) | 0.30(0.17) | 0.73(0.12) | |
| Lognormal | 0.71(0.45) | 5.44(1.52) | 0.99(0.90) | 4.19(2.07) | 0.96(0.95) | 1.35(3.65) | |
| 0.42(0.35) | 5.07(1.21) | 2.65(3.35) | 9.10(11.24) | 0.97(1.42) | 2.02(1.75) |
Table 2:
Estimation results in terms of total integrated mean square error (TIMSE) for simulated gene expression data with heterogeneous errors based on 100 replicates.
| BQRVCSS | BQRVC | BVCSS | BVC | QRVC-adp | VC-adp | ||
|---|---|---|---|---|---|---|---|
| Normal | 0.35(0.15) | 3.44(0.54) | 0.94(0.30) | 2.82(0.37) | 0.37(0.20) | 0.95(0.17) | |
| NormalMix | 0.50(0.24) | 5.05(0.99) | 1.04(1.20) | 5.79(1.70) | 0.45(0.23) | 1.62(0.61) | |
| Laplace | 0.35(0.15) | 4.04(0.79) | 1.03(0.67) | 3.57(0.90) | 0.41(0.21) | 0.94(0.27) | |
| Lognormal | 0.20(0.09) | 4.18(0.93) | 2.55(2.57) | 9.84(4.87) | 0.37(0.54) | 3.59(2.03) | |
| 0.64(0.39) | 5.87(1.29) | 2.99(2.83) | 10.94(6.72) | 1.37(1.59) | 3.27(1.27) | ||
| Normal | 0.27(0.21) | 3.38(0.53) | 0.93(0.17) | 2.21(0.36) | 0.28(0.16) | 0.96(0.16) | |
| NormalMix | 0.29(0.12) | 4.61(0.82) | 1.12(0.94) | 5.20(1.48) | 0.35(0.16) | 1.62(0.61) | |
| Laplace | 0.21(0.10) | 3.84(0.67) | 0.98(0.41) | 3.18(0.72) | 0.21(0.12) | 1.06(0.33) | |
| Lognormal | 0.29(0.16) | 4.36(0.95) | 2.09(2.13) | 8.26(3.61) | 0.40(0.48) | 2.45(2.17) | |
| 0.38(0.22) | 5.31(1.12) | 3.33(3.15) | 11.94(15.06) | 1.16(2.20) | 3.92(5.56) | ||
| Normal | 0.33(0.11) | 3.65(0.59) | 0.85(0.25) | 2.71(0.47) | 0.38(0.16) | 1.06(0.27) | |
| NormalMix | 0.51(0.22) | 5.32(0.89) | 1.22(1.04) | 5.91(1.57) | 0.78(0.56) | 1.65(0.61) | |
| Laplace | 0.42(0.22) | 4.25(0.73) | 0.93(0.42) | 3.37(0.72) | 0.42(0.24) | 1.10(0.39) | |
| Lognormal | 0.80(0.58) | 6.85(1.71) | 2.47(8.41) | 7.98(6.94) | 2.72(6.07) | 2.54(3.39) | |
| 0.62(0.29) | 6.44(1.31) | 5.37(4.67) | 13.41(12.08) | 1.27(1.13) | 3.32(3.06) |
We have also shown the estimated varying coefficients of the proposed method (BQRVCSS) for the gene expression data with i.i.d. errors and 50% quantile level in the first setting in Figure 7. Here are the details of generating the Figure 7. At each replicate, a new dataset has been simulated with the aforementioned data generating model. We can obtain the posterior median estimates and 95% credible intervals after fitting the proposed method to the data generated at every replicate. The median estimates, as well as the lower and upper bound of the credible intervals, have been averaged respectively across 100 replicates to yield the estimated varying coefficients and corresponding 95% credible intervals shown in Figure 7. In addition, we have evaluated the empirical 95% coverage probabilities of four Bayesian methods using their pointwise 95% credible intervals over the 200 grid points. Table 5 in the Appendix shows the 95% coverage probabilities for four varying coefficient functions under simulated gene expression data with i.i.d. errors. We can observe that overall, the proposed BQRVCSS outperforms all the alternatives. Specifically, BQRVC and BVC, the two methods not incorporating the spike-and-slab priors, can barely cover and . The nonrobust counterpart BVCSS is inferior particularly at quantile level 0.3 and 0.7. The results also suggest that the performance may depend on the form of varying coefficients under estimation. It is apparent that , a quadratic function, is corresponding to better coverage probabilities in general and none of the methods have completely missed the coverage of , compared to those under the non-polynomial functions ( and ) and polynomial functions with a higher order . Yang et al. (2016) have proposed a posterior variance adjustment procedure to improve the validity of credible intervals from Bayesian quantile regression with the asymmetric Laplace likelihood. While their method has been developed from a low dimensional parametric regression setting, how to adjust posterior variance to improve performance in terms of coverage probabilities in high-dimensional nonparametric setting when more complicated sparsity priors (i.e. the spike-and-slab prior) are involved worths further exploration beyond our study.
By far, the asymptotic distribution of the spline-based regularized quantile varying coefficient models have not been developed (Noh et al. (2012); Tang et al. (2013, 2012)). Without the asymptotic variance, it is not feasible to construct the corresponding pointwise asymptotic confidence intervals for the varying coefficients. Therefore,the counterpart of Figure 7 for frequentist spline-based quantile VC models are not available. In literature, Dai and Kolar (2021) have developed kernel-based inference procedure for estimators that approximates quantile VC in high-dimensional setting. They did not show any plots of pointwise confidence intervals for nonparametric functions. It is not immediately evident to us whether or how their methods can be used to generate confidence intervals for varying coefficient functions without the relevant specifics. Therefore we have not pursued a direct comparison to frequentist coverage of confidence intervals using their methods. For frequentist methods VC-adp and QRVC-adp, we have selected tuning parameters through Schwarz-type Information Criterion (SIC) which has been widely adopted in published literature in choosing tuning parameters for regularized (quantile) varying coefficient models (Noh et al. (2012); Tang et al. (2013, 2012); Wang and Xia (2009)). Please refer to the Appendix for more details.
The convergence of the MCMC chains is examined by using the the potential scale reduction factor (PSRF) (Gelman and Rubin (1992),Brooks and Gelman (1998)). The convergence is achieved if PSRF values are close to 1. According to Gelman et al. (2013), we use 1.1 as the cutoff (i.e. PSRF ≤ 1.1) to determine convergence. The PSRF has been computed for each parameter, indicating convergence of all chains after burn-ins. Figure 8 shows the PSRF of the estimated spline coefficients of each varying coefficient function in Figure 8. The convergence is satisfactorily achieved.
We demonstrate the sensitivity of the proposed method BQRVCSS for variable selection to the choice of the hyperparameters for and in the Appendix and tabulate the results from Table 6 to Table 9. These results suggest that the MPM model is insensitive to different choices of the hyperparameters. We also conduct sensitivity analysis on whether the smoothness specification of the parameters in the B spline will impact the variable selection. The sensitivity analysis results are shown in Table 12 to 15 in the Appendix. It is evident that the proposed method is insensitive to the number of spline basis , which is equivalent to , in smoothness specification. We provide a heuristic justification as follows. In nonparametric literature, has been established as the optimal order of number of interior spline knots under certain regularity conditions (Xue and Yang (2006)). Other orders, such as , has also been commonly assumed (Wang and Yang (2009)). Therefore, if the number of interior knots is chosen within the range of , where denotes the integer part of , the optimal order can be achieved. In practice, to avoid over fitting, cubic splines and splines with a smaller degree have been extensively used. With quadratic and cubic splines, where the spline order corresponds to 2 and 3 respectively, the aforementioned range results in 1 to 3 interior knots under the sample size 200 adopted in simulation. Therefore, the proposed method is insensitive with the above specifications of and , i.e. the number of spline basis. Nevertheless, a rigorous justification on the optimal order of number of interior knots in high-dimensional quantile varying coefficient models remains an open question. Based on this finding, we set the degree and the number of interior knots for the B spline basis, which leads to basis functions.
The varying coefficient functions in the simulation study have been widely adopted in published nonparametric literature (Noh et al. (2012); Tang et al. (2013, 2012); Wang and Yang (2009); Xue and Yang (2006)). Functions with more complex structures may not lead to the same satisfactory performance as shown here. For example, a sine function with more oscillations in [0,1] is not a polynomial function in nature, and thus cannot be well approximated by the spline–based methods with the established optimal order of number of interior knots. We run additional simulations under setting 1 where gene expression data are generated with i.i.d. errors by only changing to a more complicated sine function . Table 11 in the Appendix shows that the estimation accuracy has significantly decreased for all the methods, compared to the estimation results in Table 1. In the Appendix, we have also provided the estimation plots of more complicated varying coefficient functions using the BQRVCSS and the frequentist counterpart QRVC-adp. Figure 9 and Figure 10 show that cannot be well modeled by both methods, which has also been observed with all the other methods (BQRVC, BVCSS, BVC and VC-adp) under comparison.
In the simulation, the figures of estimated curves are obtained based on averages over multiple replicates. To further explore the estimation performance when the proposed method has been applied to single datasets, we have also shown the figure beyond the “average case” scenario. Specifically, at each simulation run, we compute the IMSE of posterior median estimates of the curves. Then the curves at the 25th, 50th and 75th percentile of IMSEs across all the replicates have been overlaid with the true curve in Figure 11 in the Appendix. We can observe that all of them are close to the true curves, although the curves at the 75th percentile of IMSEs are slightly worse than those at the other two percentiles.
5. Real Data Analysis
We analyze the Nurse’s Health Study (NHS) data from the Gene, Environment Association Studies Consortium (GENVEA) (Cornelis et al. (2010)). The NHS aims at assessing a series of hypotheses of disease susceptibility in female based on genetic factors, i.e. single nucleotide polymorphisms (SNPs), and environmental/clinical factors in gene-environment interaction studies. The body mass index (BMI), which can quantify the obesity level, is set as the response. We focus on SNPs on chromosome 2. We consider age as the environment factor since it has been shown to be associated with the variations of obesity level. Besides, three clinical covariates are included: total physical activity, trans fat intake and cereal fiber intake. The healthy subjects in the NHS are selected in the case study. We clean the data by keeping subjects with matched phenotypes and genotypes, removing SNPs with minor allele frequency (MAF) less than 0.05 or deviation from Hardy-Weinberg equilibrium, and imputing the missing values. The final working dataset contains 1716 subjects with 53,408 SNPs.
A common practice in variable selection for ultra-high dimensional data in omics data analysis is to first conduct marginal screening and reduce the number of feature to a reasonable scale so (Bayesian) regularized variable selection can be applied (Li et al. (2015); Wu et al. (2014, 2018)). Here, we screen the SNPs using the testing procedure in non-linear gene-environment interaction studies proposed by Ma et al. (2011) and Wu and Cui (2013). In particular, three statistical tests have been performed to assess the effect of a genetic factor under the environmental influences and to dissect whether the interaction effects are nonlinear, linear, constant, or zero. We keep the SNPs with p-values less than a cutoff of 0.005 from any of the tests under the response BMI. 300 SNPs pass the screening.
We analyze the screened data using the proposed method BQRVCSS at the median and the alternative BVCSS. Other methods, such as BQRVC and BVC are not considered since they have inferior performance in the simulation studies. The eleven SNPs identified by BQRVCSS and the corresponding estimated varying coefficients are displayed in Figures 4. BVCSS identifies nine SNPs which are rs17533992, rs16864365, rs6719951, rs7585571, rs752833, rs4894108, rs16867269, rs2675102 and rs13418054. Six SNPs are commonly selected by both methods. Besides, the proposed method uniquely identified five SNPs that are located within the genes that have been reported to be associated with body weight change. For example, BQRVCSS identifies the SNP rs17783776, which is located in the gene ALK. ALK (anaplastic lymphoma kinase) has been identified as a thinness gene which suggests it could be the target gene for obesity treatment (Orthofer et al. (2020)). As a comparison, the alternative method BVCSS misses this important gene. The proposed method also identifies rs 41349646, a SNP that is mapped to the gene NPAS2. NPAS2 has been found to play an essential role in the regulation of peripheral circadian response and hepatic metabolism, therefore affects weight change (O’Neil et al. (2013)). The SNP rs10933420 is also uniquely identified by our proposed method and it is located in the gene NGEF. Kim et al. (2015) has found NGEF associated with intra-abdominal fat accumulation. Besides, our proposed method BQRVCSS identifies rs4854071 as well. The SNP rs4854071 is located within the gene NDUFA10 (NADH:Ubiquinone Oxidoreductase Subunit A10), which has been found to be involved in the NAFLD pathway regulating weight loss together with ten other genes (Mirhashemi et al. (2021)).
Figure 4:
Real data analysis using the proposed method (BQRVCSS). Black line: median estimates of varying coefficients for BQRVCSS. Blue dashed lines: 95% credible intervals for the estimated varying coefficients.
It is difficult to objectively evaluate the selection accuracy with real data. We assess the prediction performance as it may provide additional information on the performance of different methods. We refit the selected models of BQRVCSS and BQRVC by Bayesian quantile LASSO and Bayesian LASSO, respectively, by following the refitting procedure in Li et al. (2015). The prediction mean squared errors (PMSEs) and prediction mean absolute deviations (PMADs) are computed based on the posterior median estimates. The proposed method BQRVCSS has the PMSE and PMAD equal to 13.13 and 1.34, respectively, while the PMSE and PMAD for BVCSS are 15.04 and 3.05, which are both larger than the counterparts of BQRVCSS.
6. Discussion
Within a broader scope, regularized quantile varying coefficient model can be regarded as a robust variable selection problem in the form of “robust loss function + penalty function” (Wu and Ma (2015)), which consists of a quantile check loss and a group level penalty function. Although other robust loss functions, including the rank based loss (Wu et al. (2015)), can also be considered for robust high-dimensional varying coefficient models, the regularized quantile VC model naturally leads to a Bayesian formulation if the likelihood function of the Bayesian hierarchical model is specified based on the asymmetric Laplace distribution (ALD) (Yu and Moyeed (2001)). The modeling of spline basis in the proposed study has connections to the development of semiparametric Bayesian regressions for the “large , small settings (Huang et al. (2015)). As the high-dimensional Bayesian quantile VC model is underdeveloped, examining the Bayesian counterpart complements and further advances the existing studies on the quantile VC model in the frequentist framework.
Nevertheless, our limited literature search shows that high dimensional Bayesian quantile varying coefficient models have not been well examined by far. In this article, we have developed a Bayesian regularized quantile varying coefficient model. The robust asymmetric Laplace likelihood and sparsity inducing priors lead to full conditional distributions of the model parameters. Therefore, posterior inference can be efficiently conducted through Gibbs sampling. The varying coefficient model is a special case of the varying index coefficient model (VICM) when the effect modifying variable is univariate with loading weight being 1 (Ma and Song (2015)). Ma and Song (2015) has further shown that the new class of VICM gives rise to a broad spectrum of semi- and non-parametric models. Our study has laid a solid foundation for initiating Bayesian analyses of these models in the high-dimensional setting. Investigations on these extensions within the Bayesian framework will be postponed to the near future.
Acknowledgements
We thank the editor, associate editor and reviewers for their careful review and insightful comments which lead to a significant improvement of this article. We also thank Yuwen Liu’s help with conducting the additional simulation studies during the revision. This work was partially supported by an Innovative Research Award from the Johnson Cancer Research Center at Kansas State University and the National Institutes of Health (NIH) grant R01 CA204120.
Appendix
A. Additional Simulation Results
A.1. Additional Identification Results
Figure 5:
Identification results for simulated SNP data with i.i.d. errors based on 100 replicates. C: correct-fitting proportion; O: overfitting proportion; U: underfitting proportion.
Figure 6:
Identification results for simulated SNP data with heterogeneous errors based on 100 replicates. C: correct-fitting proportion; O: overfitting proportion; U: underfitting proportion.
A.2. Additional Estimation Results
Table 3:
Estimation results in terms of total integrated mean square error (TIMSE) for simulated SNPs with i.i.d. errors based on 100 replicates.
| BQRVCSS | BQRVC | BVCSS | BVC | QRVC-adp | VC-adp | ||
|---|---|---|---|---|---|---|---|
| TMSE | 0.23(0.10) | 2.32(0.40) | 0.45(0.12) | 1.51(0.19) | 0.28(0.11) | 0.79(0.14) | |
| NormalMix | 0.34(0.17) | 3.47(0.59) | 0.76(0.23) | 2.92(0.46) | 0.53(0.35) | 0.98(0.27) | |
| Laplace | 0.26(0.10) | 2.91(0.53) | 0.45(0.12) | 2.06(0.32) | 0.34(0.15) | 0.80(0.11) | |
| Lognormal | 0.11(0.07) | 3.23(0.61) | 1.76(0.64) | 4.7(1.38) | 0.28(0.51) | 1.45(0.76) | |
| 0.38(0.17) | 4.70(1.07) | 1.99(1.66) | 7.91(9.55) | 1.30(1.30) | 1.54(1.52) | ||
| Normal | 0.19(0.07) | 2.14(0.38) | 0.41(0.09) | 1.21(0.14) | 0.28(0.12) | 0.76(0.10) | |
| NormalMix | 0.27(0.12) | 3.67(0.58) | 0.73(0.16) | 2.65(0.43) | 0.49(0.37) | 1.03(0.32) | |
| Laplace | 0.16(0.05) | 2.88(0.43) | 0.45(0.09) | 1.87(0.35) | 0.28(0.19) | 0.78(0.23) | |
| Lognormal | 0.23(0.13) | 4.16(0.83) | 1.55(1.14) | 5.3(2.43) | 0.44(0.45) | 1.43(0.66) | |
| 0.31(0.18) | 4.17(0.83) | 1.94(1.63) | 7.49(7.61) | 1.25(1.23) | 2.14(1.90) | ||
| Normal | 0.19(0.07) | 2.37(0.46) | 0.41(0.10) | 1.50(0.18) | 0.30(0.16) | 0.78(0.12) | |
| NormalMix | 0.35(0.15) | 3.49(0.53) | 0.7(0.19) | 2.94(0.45) | 0.52(0.30) | 1.11(0.39) | |
| Laplace | 0.25(0.13) | 2.76(0.46) | 0.46(0.13) | 1.99(0.27) | 0.36(0.16) | 0.86(0.19) | |
| Lognormal | 0.78(0.79) | 5.24(1.38) | 1.06(1.07) | 4.21(1.91) | 1.05(0.88) | 0.49(0.77) | |
| 0.46(0.38) | 4.83(1.35) | 1.9(1.67) | 7.59(7.54) | 1.13(1.01) | 1.77(1.00) |
Table 4:
Estimation results in terms of total integrated mean square error (TIMSE) for simulated SNPs with heterogeneous errors based on 100 replicates.
| BQRVCSS | BQRVC | BVCSS | BVC | QRVC-adp | VC-adp | ||
|---|---|---|---|---|---|---|---|
| Normal | 0.26(0.11) | 3.17(0.57) | 0.83(0.24) | 2.71(0.44) | 0.35(0.24) | 1.13(0.30) | |
| NormalMix | 0.40(0.20) | 4.59(0.79) | 1.72(0.86) | 5.66(1.35) | 0.63(0.54) | 1.63(0.57) | |
| Laplace | <0.30(0.14) | 3.72(0.68) | 0.90(0.36) | 3.74(0.80) | 0.42(0.45) | 1.17(0.49) | |
| Lognormal | 0.17(0.08) | 3.54(0.67) | 3.70(2.01) | 8.86(3.66) | 0.72(0.98) | 4.32(3.00) | |
| 0.66(0.65) | 5.92(1.54) | 4.64(5.71) | 16.16(23.82) | 2.09(4.68) | 3.78(4.41) | ||
| Normal | 0.17(0.08) | 3.11(0.48) | 0.82(0.21) | 2.10(0.29) | 0.25(0.18) | 1.09(0.31) | |
| NormalMix | 0.25(0.12) | 4.2(0.71) | 1.66(0.63) | 4.56(1.02) | 0.68(0.73) | 1.74(0.71) | |
| Laplace | 0.18(0.12) | 3.78(0.63) | 0.46(0.26) | 3.18(0.55) | 0.23(0.16) | 0.85(0.45) | |
| Lognormal | 0.17(0.08) | 4.20(0.74) | 2.88(4.66) | 9.86(9.94) | 0.7(1.14) | 2.79(3.26) | |
| 0.3(0.16) | 4.75(0.66) | 3.19(4.68) | 12.78(12.71) | 1.55(1.46) | 3.78(3.64) | ||
| Normal | 0.25(0.11) | 3.40(0.59) | 0.80(0.22) | 2.63(0.39) | 0.30(0.13) | 1.12(0.29) | |
| NormalMix | 0.39(0.17) | 4.77(0.76) | 1.35(0.49) | 4.76(0.97) | 0.94(1.08) | 1.85(0.77) | |
| Laplace | 0.25(0.11) | 4.14(0.70) | 0.88(0.25) | 3.57(0.64) | 0.43(0.53) | 1.30(0.50) | |
| Lognormal | 0.58(0.23) | 6.55(1.35) | 5.32(22.78) | 9.11(11.68) | 1.26(1.18) | 2.15(2.68) | |
| 0.49(0.25) | 6.08(0.99) | 5.98(9.18) | 18.73(22.33) | 3.2(3.84) | 4.94(5.77) |
Table 5:
Empirical 95% coverage probabilities under simulated gene expression data with i.i.d. error based on 200 replicates.
| error | BQRVCSS | BQRVC | BVCSS | BVC | |
|---|---|---|---|---|---|
| 0.800 | 0.875 | 0.570 | 0.630 | ||
| 0.865 | 0.020 | 0.815 | 0.055 | ||
| 0.950 | 0.780 | 0.745 | 0.825 | ||
| 0.860 | 0.000 | 0.760 | 0.055 | ||
| 0.875 | 0.935 | 0.885 | 0.865 | ||
| 0.930 | 0.020 | 0.850 | 0.050 | ||
| 0.960 | 0.845 | 0.810 | 0.835 | ||
| 0.905 | 0.015 | 0.790 | 0.050 | ||
| 0.820 | 0.905 | 0.665 | 0.710 | ||
| 0.930 | 0.045 | 0.870 | 0.080 | ||
| 0.940 | 0.830 | 0.735 | 0.850 | ||
| 0.910 | 0.020 | 0.815 | 0.070 |
A.3. The estimated quantile varying coefficient functions
Figure 7:
Estimation of non-zero varying coefficients under the normal mixture error (Error 2) for the proposed method (BQRVCSS) at 50% quantile level. Red line: true varying coefficients. Black line: posterior median estimates of varying coefficients from BQRVCSS. Blue lines: 95% credible intervals for the estimated varying coefficients.
A.4. Evaluation on the convergence of MCMC chains
Figure 8:
Potential scale reduction factor (PSRF) versus iterations for the varying functions in Figure 7. Black line: PSRF. Red line: the threshold of 1.1. to denote the five estimated spline coefficients for the varying coefficient function , respectively.
A.5. Hyper-parameters sensitivity analysis
Table 6:
Sensitivity analysis on the choice of the hyperparameter for using different Beta priors for the Laplace error distribution for the 30% quantile. TIMSE: total integrated mean square error.
| C | O | U | TIMSE | |
|---|---|---|---|---|
|
| ||||
| Beta(0.5,0.5) | 0.90 | 0.10 | 0.00 | 0.27(0.12) |
| Beta(1,1) | 0.90 | 0.10 | 0.00 | 0.28(0.12) |
| Beta(2,2) | 0.90 | 0.10 | 0.00 | 0.28(0.11) |
| Beta(1,5) | 0.90 | 0.10 | 0.00 | 0.27(0.11) |
| Beta(5,1) | 0.90 | 0.10 | 0.00 | 0.27(0.11) |
Table 7:
Sensitivity analysis on the choice of the hyperparameter for using different Gamma priors for the Laplace error distribution for the 30% quantile. TIMSE: total integrated mean square error.
| C | O | U | TIMSE | |
|---|---|---|---|---|
|
| ||||
| Gamma(0.1,1) | 0.90 | 0.10 | 0.00 | 0.29(0.17) |
| Gamma(1,1) | 0.90 | 0.10 | 0.00 | 0.29(0.16) |
| Gamma(1,5) | 0.90 | 0.10 | 0.00 | 0.30(0.16) |
| Gamma(2,5) | 0.88 | 0.12 | 0.00 | 0.30(0.16) |
| Gamma(5,1) | 0.90 | 0.10 | 0.00 | 0.29(0.16) |
Table 8:
Sensitivity analysis on the choice of the hyperparameter for using different Beta priors for the Laplace error distribution for the 50% quantile. TIMSE: total integrated mean square error.
| C | O | U | TIMSE | |
|---|---|---|---|---|
|
| ||||
| Beta(0.5,0.5) | 0.92 | 0.08 | 0.00 | 0.22(0.05) |
| Beta(1,1) | 0.94 | 0.06 | 0.00 | 0.22(0.06) |
| Beta(2,2) | 0.94 | 0.06 | 0.00 | 0.22(0.06) |
| Beta(1,5) | 0.94 | 0.06 | 0.00 | 0.22(0.06) |
| Beta(5,1) | 0.92 | 0.08 | 0.00 | 0.22(0.06) |
Table 9:
Sensitivity analysis on the choice of the hyperparameter for using different Gamma priors for the Laplace error distribution for the 50% quantile. TIMSE: total integrated mean square error.
| C | O | U | TIMSE | |
|---|---|---|---|---|
|
| ||||
| Gamma(0.1,1) | 0.96 | 0.04 | 0.00 | 0.22(0.05) |
| Gamma(1,1) | 0.94 | 0.06 | 0.00 | 0.22(0.05) |
| Gamma(1,5) | 0.94 | 0.06 | 0.00 | 0.23(0.05) |
| Gamma(2,5) | 0.94 | 0.06 | 0.00 | 0.22(0.06) |
| Gamma(5,1) | 0.94 | 0.06 | 0.00 | 0.22(0.05) |
A.6. Selection of tuning parameters for frequentist methods
We have chosen the tuning parameters for VC-adp and QRVC-adp in terms of the Schwarz-type Information Criterion (SIC):
where edf is the effective degree of freedom. For QRVC-adp, is the quantile check loss function, and edf is the number of zero residuals which has been extensively used as a metric indicating the effective dimension of the fitted quantile regression models. Such a SIC criterion has been commonly adopted in published work on regularized quantile varying coefficient models Noh et al. (2012); Tang et al. (2013, 2012). For VC-adp, is the least square loss function, and edf is the total number of nonzero varying coefficients Tang et al. (2012); Wang and Xia (2009). The R codes of VC-adp and QRVC-adp can be obtained through minor modifications to the R codes for methods proposed in Tang et al. (2012) available at Dr. Huixia Wang’s website (https://blogs.gwu.edu/judywang/software/).
We have examined the estimation performance of the two frequentist methods when the tuning parameters are selected using validation. Specifically, after the regularized estimates have been obtained using the training data, the prediction in terms of the check loss for QRVC-adp and least square loss for VC-adp are assessed on an independently generated testing data. For each tuning parameter across the sequence, the prediction performance is assessed on the same testing data. Therefore, the optimal tuning is corresponding to the smallest testing error. Such a method of choosing the tuning parameters is feasible in simulation as the data generating model is available, which is computationally less intensive compared to cross-validation. For illustration purpose, we have conducted the simulation under the 1st setting where gene expression data have been generated with i.i.d errors. The estimation results in Table 10 below are very close to the ones obtained in Table 1 from the main text.
Table 10:
Selecting tuning parameters based on validation: estimation results in terms of total integrated mean square error (TIMSE) for simulated gene expression data with i.i.d. errors based on 100 replicates.
| QRVC-adp | VC-adp | QRVC-adp | VC-adp | QRVC-adp | VC-adp | |
|---|---|---|---|---|---|---|
| Normal | 0.29(0.09) | 0.84(0.26) | 0.28(0.13) | 0.94(0.22) | 0.31(0.09) | 1.02(0.33) |
| NormalMix | 0.63(0.52) | 1.31(0.52) | 0.45(0.24) | 1.05(0.28) | 0.51(0.23) | 1.49(0.26) |
| Laplace | 0.37(0.21) | 0.96(0.17) | 0.30(0.13) | 1.00(0.31) | 0.35(0.16) | 1.23(0.25) |
| Lognormal | 0.28(0.48) | 2.63(0.77) | 0.51(0.65) | 2.13(0.90) | 0.98(0.57) | 1.77(0.71) |
| 1.19(0.91) | 2.61(1.32) | 0.82(0.68) | 2.23(1.45) | 1.18(1.13) | 2.56(1.27) | |
A.7. Additional simulation under more complicated varying coefficient functions
Table 11:
Additional simulation under more complicated varying coefficient functions : estimation results in terms of total integrated mean square error (TIMSE) for simulated gene expression data with i.i.d. errors based on 100 replicates.
| BQRVCSS | BQRVC | BVCSS | BVC | QRVC-adp | VC-adp | ||
|---|---|---|---|---|---|---|---|
| Normal | 2.22(0.28) | 4.91(0.60) | 2.14(0.29) | 4.20(0.47) | 2.52(0.43) | 2.58(0.33) | |
| NormalMix | 2.49(0.52) | 5.71(0.92) | 2.49(0.47) | 5.25(0.72) | 2.89(0.71) | 2.80(0.44) | |
| Laplace | 2.42(0.56) | 5.41(0.76) | 2.18(0.38) | 4.47(0.58) | 2.90(0.70) | 2.57(0.37) | |
| Lognormal | 2.17(0.41) | 5.34(0.87) | 3.79(1.17) | 7.45(2.38) | 2.85(0.79) | 3.80(0.77) | |
| 2.74(0.61) | 6.67(1.69) | 4.41(3.97) | 9.76(5.29) | 4.73(3.27) | 4.32(2.60) | ||
| Normal | 2.02(0.34) | 4.99(0.65) | 1.81(0.21) | 3.83(0.39) | 2.31(0.37) | 2.44(0.85) | |
| NormalMix | 2.26(0.48) | 5.76(0.83) | 2.11(0.36) | 4.87(0.61) | 2.81(0.70) | 2.76(1.08) | |
| Laplace | 2.21(0.50) | 5.36(0.59) | 1.96(0.43) | 4.40(0.60) | 2.68(0.72) | 2.55(0.88) | |
| Lognormal | 2.34(0.48) | 6.08(0.77) | 3.23(1.21) | 7.32(3.39) | 3.07(0.87) | 3.40(1.14) | |
| 2.53(0.74) | 6.16(0.89) | 5.12(9.20) | 9.88(7.15) | 4.03(2.94) | 4.82(2.91) | ||
| Normal | 2.20(0.32) | 5.03(0.58) | 2.28(0.22) | 4.05(0.41) | 2.67(0.57) | 3.39(1.60) | |
| NormalMix | 2.44(0.52) | 5.75(0.74) | 2.57(0.50) | 5.34(0.82) | 2.93(0.77) | 4.51(2.00) | |
| Laplace | 2.42(0.39) | 5.43(0.70) | 2.14(0.40) | 4.44(0.61) | 2.85(0.52) | 3.86(1.90) | |
| Lognormal | 3.07(0.92) | 7.30(1.45) | 3.02(1.66) | 6.73(3.14) | 4.22(1.56) | 3.58(1.72) | |
| 2.73(0.73) | 6.68(1.11) | 4.92(4.32) | 9.14(3.79) | 4.09(2.67) | 6.16(3.66) |
Figure 9:
Estimation of more complicated non-zero varying coefficients under the normal mixture error (Error 2) for the proposed method (BQRVCSS) at 50% quantile level. Red line: true varying coefficients. Black line: posterior median estimates of varying coefficients from BQRVCSS. Blue lines: 95% credible intervals for the estimated varying coefficients.
Figure 10:
Estimation of more complicated non-zero varying coefficients under the normal mixture error (Error 2) for the QRVC-adp at 50% quantile level. Red line: true varying coefficients. Black line: estimated varying coefficients from QRVC-adp. The confidence intervals are not available for frequentist regularized quantile varying coefficients.
Figure 11:
Estimation of non-zero varying coefficients under the normal mixture error (Error 2) for the proposed method (BQRVCSS) at 50% quantile level. Red line: true varying coefficients. Black, Blue and Green lines: posterior median estimates of varying coefficients from BQRVCSS under 25%, 50% and 75% IMSE respectively.
B. Sensitivity analysis on smoothness specification
Let denote the degree of B spline basis and denote the number of interior knots. For quadratic and cubic splines corresponding to and respectively, we conduct a sensitivity analysis for the proposed model.
Table 12:
Sensitivity analysis on smoothness specification for the Laplace error distribution for the 30% quantile. TIMSE: total integrated mean square error.
| 1 | 2 | 3 | 4 | 5 | ||
|---|---|---|---|---|---|---|
|
| ||||||
| Laplace | C | 0.88 | 0.90 | 0.92 | 0.89 | 0.91 |
| O | 0.12 | 0.10 | 0.08 | 0.11 | 0.09 | |
| U | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| TIMSE | 0.33(0.19) | 0.28(0.12) | 0.31(0.14) | 0.24(0.12) | 0.25(0.15) | |
|
| ||||||
| 1 | 2 | 3 | 4 | 5 | ||
|
| ||||||
| Laplace | C | 0.89 | 0.90 | 0.92 | 0.86 | 0.88 |
| O | 0.11 | 0.10 | 0.08 | 0.14 | 0.12 | |
| U | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| TIMSE | 0.25(0.11) | 0.28(0.12) | 0.28(0.15) | 0.26(0.19) | 0.25(0.16) | |
Table 13:
Sensitivity analysis on smoothness specification for the Normal error distribution for the 30% quantile. TIMSE: total integrated mean square error.
| 1 | 2 | 3 | 4 | 5 | ||
|---|---|---|---|---|---|---|
|
| ||||||
| Normal | C | 0.97 | 0.96 | 0.98 | 0.95 | 0.94 |
| O | 0.03 | 0.04 | 0.04 | 0.05 | 0.06 | |
| U | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| TIMSE | 0.26(0.12) | 0.22(0.09) | 0.29(0.16) | 0.23(0.12) | 0.22(0.18) | |
|
| ||||||
| 1 | 2 | 3 | 4 | 5 | ||
|
| ||||||
| Normal | C | 0.96 | 0.94 | 0.97 | 0.94 | 0.95 |
| O | 0.04 | 0.06 | 0.03 | 0.06 | 0.05 | |
| U | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| TIMSE | 0.24(0.09) | 0.26(0.14) | 0.21(0.10) | 0.25(0.19) | 0.24(0.12) | |
Table 14:
Sensitivity analysis on smoothness specification for the Laplace error distribution for the 50% quantile. TIMSE: total integrated mean square error.
| 1 | 2 | 3 | 4 | 5 | ||
|---|---|---|---|---|---|---|
|
| ||||||
| Laplace | C | 0.96 | 0.94 | 0.92 | 0.95 | 0.96 |
| O | 0.04 | 0.06 | 0.08 | 0.05 | 0.04 | |
| U | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| TIMSE | 0.25(0.11) | 0.21(0.09) | 0.29(0.16) | 0.28(0.11) | 0.25(0.19) | |
|
| ||||||
| 1 | 2 | 3 | 4 | 5 | ||
|
| ||||||
| Laplace | C | 0.95 | 0.93 | 0.94 | 0.96 | 0.93 |
| O | 0.05 | 0.07 | 0.06 | 0.04 | 0.07 | |
| U | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| TIMSE | 0.24(0.07) | 0.31(0.14) | 0.26(0.12) | 0.22(0.16) | 0.26(0.13) | |
Table 15:
Sensitivity analysis on smoothness specification for the Normal error distribution for the 50% quantile. TIMSE: total integrated mean square error.
| 1 | 2 | A. 3 | 4 | 5 | ||
|---|---|---|---|---|---|---|
|
| ||||||
| Normal | C | 0.97 | 0.98 | 0.96 | 0.99 | 0.98 |
| O | 0.03 | 0.02 | 0.04 | 0.01 | 0.02 | |
| U | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| TIMSE | 0.21(0.06) | 0.23(0.13) | 0.22(0.07) | 0.24(0.14) | 0.22(0.09) | |
|
| ||||||
| 1 | 2 | 3 | 4 | 5 | ||
|
| ||||||
| Normal | C | 0.98 | 0.96 | 0.98 | 0.98 | 0.97 |
| O | 0.02 | 0.04 | 0.02 | 0.02 | 0.03 | |
| U | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| TIMSE | 0.19(0.07) | 0.29(0.11) | 0.25(0.07) | 0.24(0.14) | 0.23(0.08) | |
C. Posterior inference
C.1. Posterior inference for BQRVCSS
C.1.1. Bayesian hierarchical model
C.1.2. Gibbs Sampler
-
The full conditional distribution of ,Hence, it follows that
-
The full conditional distribution of ,Let , then the full conditional posterior distribution of is given as:
where
andHence, the posterior distribution of is a mixture of a multivariate normal distribution and a point mass at 0. At each iteration of MCMC, is drawn from with probability and is set to 0 with probability .
-
The full conditional distribution of isTherefore,
-
The full conditional distribution of isIt follows that
-
The full conditional distribution of isThen,
-
The full conditional distribution of isLet
consequently, - The full conditional distribution of is
therefore, we have
with
and - Similarly the full conditional distribution of is derived as
with
and
C.2. Posterior inference for BQRVC
C.2.1. Bayesian hierarchical model
C.2.2. Gibbs Sampler
-
The full conditional distribution of ,Then, the full conditional distribution of is
-
The full conditional distribution of isIt follows that
-
The full conditional distribution of ,Denote the covariance
and the mean
then we have -
The full conditional distribution of isTherefore,
-
The full conditional distribution of isIt follows that
- The full conditional distribution of is
therefore, we have
with mean
and covariance - The full conditional distribution of is derived as
where
and
C.3. Posterior inference for BVCSS
C.3.1. Bayesian hierarchical model
C.3.2. Gibbs Sampler
-
The full conditional distribution of ,Let , then the conditional posterior distribution of is a multivariate spike-and-slab distribution given as:
where , , andHence, the posterior distribution of is a mixture of a multivariate normal distribution and a point mass at 0.
-
The full conditional distribution ofLet
then the posterior distribution of becomesTherefore, -
The full conditional distribution of ,Then we have
- The full conditional distribution of
and we have - The full conditional distribution of
hence - The full conditional distribution of
and
where and . - The full conditional distribution of is
with and .
C.4. Posterior inference for BVC
C.4.1. Bayesian hierarchical model
C.4.2. Gibbs Sampler
-
The full conditional distribution ofDenote and , then the posterior distribution of is
- The full conditional distribution of
and we have
which is a multivariate normal distribution, with mean
and covariance - The full conditional distribution of
therefore . - The full conditional distribution of
then, -
The full conditional distribution ofTherefore, the posterior distribution of is
- The full conditional distribution of is derived as
where and .
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- Barbieri MM and Berger JO (2004). Optimal predictive model selection. The Annals of Statistics 32(3), 870–897. [Google Scholar]
- Brooks SP and Gelman A. (1998, December). General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics 7(4), 434–455. [Google Scholar]
- Casella G, Ghosh M, Gill J, and Kyung M. (2010). Penalized regression, standard errors, and Bayesian lassos. Bayesian Analysis 5(2), 369–411. [Google Scholar]
- Cornelis MC, Agrawal A, Cole JW, Hansel NN, Barnes KC, Beaty TH, Bennett SN, Bierut LJ, Boerwinkle E, Doheny KF, et al. (2010). The gene, environment association studies consortium (geneva): maximizing the knowledge obtained from gwas by collaboration across studies of multiple conditions. Genetic Epidemiology 34(4), 364–372. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dai R. and Kolar M. (2021). Inference for high-dimensional varying-coefficient quantile regression. Electronic Journal of Statistics 15(2), 5696–5757. [Google Scholar]
- Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, and Rubin DB (2013). Bayesian data analysis. CRC press. [Google Scholar]
- Gelman A. and Rubin DB (1992, November). Inference from iterative simulation using multiple sequences. Statistical Science 7(4). [Google Scholar]
- Hastie T. and Tibshirani R. (1993). Varying-coefficient models. Journal of the Royal Statistical Society: Series B (Methodological) 55(4), 757–779. [Google Scholar]
- Huang Z, Li J, Nott D, Feng L, Ng T-P, and Wong T-Y (2015). Bayesian estimation of varying-coefficient models with missing data, with application to the singapore longitudinal aging study. Journal of Statistical Computation and Simulation 85(12), 2364–2377. [Google Scholar]
- Kim H-J, Park J-H, Lee S, Son H-Y, Hwang J, Chae J, Yun JM, Kwon H, Kim J-I, and Cho B. (2015, 09). A common variant of ngef is associated with abdominal visceral fat in Korean men. PLOS ONE 10(9), 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim M-O (2007). Quantile regression with varying coefficients. The Annals of Statistics, 92–108. [Google Scholar]
- Kozumi H. and Kobayashi G. (2011). Gibbs sampling methods for Bayesian quantile regression. Journal of Statistical Computation and Simulation 81(11), 1565–1578. [Google Scholar]
- Li J, Wang Z, Li R, and Wu R. (2015). Bayesian group lasso for nonparametric varying-coefficient models with application to functional genome-wide association studies. Annals of Applied Statistics 9(2), 640–664. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li Q, Xi R, and Lin N. (2010). Bayesian regularized quantile regression. Bayesian Analysis 5(3), 533–556. [Google Scholar]
- Lv J. and Li J. (2020). High-dimensional varying index coefficient quantile regression model. Statistica Sinica. [Google Scholar]
- Ma S. and Song PX-K (2015). Varying index coefficient models. Journal of the American Statistical Association 110(509), 341–356. [Google Scholar]
- Ma S, Yang L, Romero R, and Cui Y. (2011). Varying coefficient model for gene–environment interaction: a non-linear look. Bioinformatics 27(15), 2119–2126. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mirhashemi ME, Shah RV, Kitchen RR, Rong J, Spahillari A, Pico AR, Vitseva O, Levy D, Demarco D, Shah S, Iafrati MD, Larson MG, Tanriverdi K, and Freedman JE (2021, February). The dynamic platelet transcriptome in obesity and weight loss. Arteriosclerosis, Thrombosis, and Vascular Biology 41(2), 854–864. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Noh H, Chung K, and Keilegom IV (2012). Variable selection of varying coefficient models in quantile regression. Electronic Journal of Statistics 6(0), 1220–1238. [Google Scholar]
- O’Neil D, Mendez-Figueroa H, Mistretta T-A, Su C, Lane RH, and Aagaard KM (2013). Dysregulation of npas2 leads to altered metabolic pathways in a murine knockout model. Molecular Genetics and Metabolism 110(3), 378–387. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Orthofer M, Valsesia A, Mägi R, Wang Q-P, Kaczanowska J, Kozieradzki I, Leopoldi A, Cikes D, Zopf LM, Tretiakov EO, et al. (2020). Identification of alk in thinness. Cell 181(6), 1246–1262. [DOI] [PubMed] [Google Scholar]
- Park T. and Casella G. (2008). The Bayesian lasso. Journal of the American Statistical Association 103(482), 681–686. [Google Scholar]
- Ren J, Zhou F, Li X, Chen Q, Zhang H, Ma S, Jiang Y, and Wu C. (2019). Semiparametric Bayesian variable selection for gene-environment interactions. Statistics in Medicine 39(5), 617–638. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ren J, Zhou F, Li X, Ma S, Jiang Y, and Wu C. (2022). Robust Bayesian variable selection for gene–environment interactions. Biometrics. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schumaker L. (2007). Spline functions: basic theory. Cambridge University Press. [Google Scholar]
- Tang Y, Wang HJ, and Zhu Z. (2013, January). Variable selection in quantile varying coefficient models with longitudinal data. Computational Statistics and Data Analysis 57(1), 435–449. [Google Scholar]
- Tang Y, Wang HJ, Zhu Z, and Song X. (2012, April). A unified variable selection approach for varying coefficient models. Statistica Sinica 22(2). [Google Scholar]
- Wang H. and Xia Y. (2009). Shrinkage estimation of the varying coefficient model. Journal of the American Statistical Association 104(486), 747–757. [Google Scholar]
- Wang HJ, Zhu Z, and Zhou J. (2009). Quantile regression in partially linear varying coefficient models. The Annals of Statistics, 3841–3866. [Google Scholar]
- Wang J. and Yang L. (2009). Polynomial spline confidence bands for regression curves. Statistica Sinica, 325–342. [Google Scholar]
- Wu C. and Cui Y. (2013). A novel method for identifying nonlinear gene–environment interactions in case–control association studies. Human Genetics 132(12), 1413–1425. [DOI] [PubMed] [Google Scholar]
- Wu C, Cui Y, and Ma S. (2014). Integrative analysis of gene–environment interactions under a multi-response partially linear varying coefficient model. Statistics in Medicine 33(28), 4988–4998. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu C. and Ma S. (2015). A selective review of robust variable selection with applications in bioinformatics. Briefings in Bioinformatics 16(5), 873–883. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu C, Shi X, Cui Y, and Ma S. (2015). A penalized robust semiparametric approach for gene–environment interactions. Statistics in Medicine 34(30), 4016–4030. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu C, Zhong P-S, and Cui Y. (2018). Additive varying-coefficient model for nonlinear gene-environment interactions. Statistical Applications in Genetics and Molecular Biology 17(2). [DOI] [PubMed] [Google Scholar]
- Wu C, Zhou F, Ren J, Li X, Jiang Y, and Ma S. (2019). A selective review of multi-level omics data integration using variable selection. High-throughput 8(1), 4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xue L. and Yang L. (2006). Additive coefficient modeling via polynomial spline. Statistica Sinica, 1423–1446. [Google Scholar]
- Yang Y, Wang HJ, and He X. (2016). Posterior inference in Bayesian quantile regression with asymmetric laplace likelihood. International Statistical Review 84(3), 327–344. [Google Scholar]
- Yu K. and Moyeed RA (2001). Bayesian quantile regression. Statistics & Probability Letters 54(4), 437–447. [Google Scholar]
- Yu K. and Zhang J. (2005). A three-parameter asymmetric laplace distribution and its extension. Communications in Statistics—Theory and Methods 34(9–10), 1867–1879. [Google Scholar]
- Zhou F, Ren J, Lu X, Ma S, and Wu C. (2021). Gene–environment interaction: A variable selection perspective. Epistasis, 191–223. [DOI] [PubMed] [Google Scholar]











