Abstract
With the availability of high dimensional genetic biomarkers, it is of interest to identify heterogeneous effects of these predictors on patients’ survival, along with proper statistical inference. Censored quantile regression has emerged as a powerful tool for detecting heterogeneous effects of covariates on survival outcomes. To our knowledge, there is little work available to draw inference on the effects of high dimensional predictors for censored quantile regression. This paper proposes a novel procedure to draw inference on all predictors within the framework of global censored quantile regression, which investigates covariate-response associations over an interval of quantile levels, instead of a few discrete values. The proposed estimator combines a sequence of low dimensional model estimates that are based on multi-sample splittings and variable selection. We show that, under some regularity conditions, the estimator is consistent and asymptotically follows a Gaussian process indexed by the quantile level. Simulation studies indicate that our procedure can properly quantify the uncertainty of the estimates in high dimensional settings. We apply our method to analyze the heterogeneous effects of SNPs residing in lung cancer pathways on patients’ survival, using the Boston Lung Cancer Survivor Cohort, a cancer epidemiology study on the molecular mechanism of lung cancer.
Keywords: Conditional Quantiles, Fused-HDCQR, High Dimensional Predictors, Statistical Inference, Survival Analysis
1. Introduction
Lung cancer presents much heterogeneity in etiology (McKay et al., 2017; Dong et al., 2012; Huang et al., 2009), and some genetic variants may insert different impacts on different quantile levels of survival time. For example, in the Boston Lung Cancer Survivor Cohort (Christiani, 2017), a cancer epidemiology cohort of over 11,000 lung cancer cases enrolled in the Boston area since 1992, it was found that SNP AX.37793583 (rs115952579), along with age, gender, cancer stage and smoking status, had heterogeneous effects on different quantiles of survival time. A total of 674 patients in the study were genotyped, with the goal of identifying lung cancer survival-predictive SNPs. Target gene approaches, which focus on SNPs residing in cancer-related gene pathways, are appealing for increased statistical power in detecting significant SNPs (Moon et al., 2003; Risch and Plass, 2008; Ho et al., 2019), and the investigators have identified SNPs residing in 14 well-known lung cancer-related genes (Zhu et al., 2017; Korpanty et al., 2014; Yamamoto et al., 2008; Kelley et al., 2001). One goal was to investigate whether and how each SNP might play a different role among the high-risk (i.e. lower quantiles of overall survival) and low-risk (i.e. higher quantiles of overall survival) cancer survivors.
Quantile regression (QR) (Koenker and Bassett Jr, 1978) is a significant extension of classic linear regression. By permitting the effects of active variables to vary across quantile levels, quantile regression can naturally accommodate and examine the heterogeneous impacts of biomarkers on different segments of the response variable’s conditional distribution. As survival data are subject to censoring and may be incomplete, QR methods developed for complete data may be unsuitable. Efforts have been devoted to developing censored quantile regression (CQR) (Powell, 1986; Portnoy, 2003; Peng and Huang, 2008, among others), which has become a useful alternative strategy to traditional survival models, such as the Cox model and accelerated failure time model. QR has also been widely studied to accommodate high dimensional predictors. For example, Wang et al. (2012) dealt with variable selection using non-convex penalization; Zheng et al. (2013) proposed an adaptive penalized quantile regression estimator that can select the true sparse model with high probability; and Fan et al. (2014) studied the penalized quantile regression with a weighted L1 penalty in an ultra-high dimensional setting. As to high dimensional CQR (HDCQR), He et al. (2013) provided a model-free variable screening procedure for ultra-high dimensional covariates, and Zheng et al. (2018) proposed a penalized HDCQR built upon a stochastic integral based estimating equation. However, most of the existing works in HDCQR were designed to select a subset of predictors and estimate the effects of the selected variables, instead of drawing inference on high dimensional predictors.
Progress in high dimensional inferences has been made for linear and non-linear models (Zhang and Zhang, 2014; Bühlmann et al., 2014; Javanmard and Montanari, 2014; Ning and Liu, 2017; Fei et al., 2019, among others). For example, Meinshausen et al. (2009) proposed to aggregate p-values from multi-sample splittings for high dimensional linear regression. Another line of works referred to as post-selection inference includes Berk et al. (2013), Lee et al. (2016), and Belloni et al. (2019), which recently provided post-selection inference at fixed quantiles for complete data. However, these methods may not handle censored outcomes. For censored median regression, Shows et al. (2010) provided sparse estimation and inference, but it cannot handle high dimensional data.
We propose to draw inference on high dimensional HDCQR based on a splitting and fusing scheme, termed Fused-HDCQR. Utilizing a variable selection procedure for HDCQR such as Zheng et al. (2018), our method operates partial regression followed by smoothing. Specifically, partial regression allows us to estimate the effect of each predictor, regardless whether it is chosen by variable selection or not. The fused estimator aggregates the estimates based on multiple data-splittings and variable selection, with a variance estimator derived by the functional delta method (Efron, 2014; Wager and Athey, 2018). To comprehensively assess the covariate effects on the survival distribution, we adopt a “global” quantile model (Zheng et al., 2015) with the quantile level being over an interval, instead of the local CQR that focuses only on a few pre-specified quantile levels. The global quantile model may indeed address the molecular mechanism of lung cancer, our motivating disease, that hypothesizes that some genetic variants may cause heterogeneous impacts on different but unspecified segments of survival distribution (McKay et al., 2017; Dong et al., 2012; Huang et al., 2009).
Our work presents several advantages. First, compared to high dimensional Cox models (Zhao and Li, 2012; Fang et al., 2017; Kong et al., 2018), the employed HDCQR stems from the accelerated failure time model (Wei, 1992) and offers straightforward interpretations (Hong et al., 2019). Second, utilizing the global conditional quantile regression, it uses various segments of the conditional survival distribution to improve the robustness of variable selection and capture global sparsity. Third, our splitting-and-averaging scheme avoids the technicalities of estimating the precision matrix by inverting the p × p Hessian matrix of the log likelihood, which is a major challenge for debiased-LASSO type methods (Zhang and Zhang, 2014; Van de Geer et al., 2014) and is even more so if we apply debiased-LASSO to the CQR setting. Finally, as opposed to post-selection inferences (Belloni et al., 2019, among others), Fused-HDCQR accounts for variations in model selection and draws inference for all of the predictors.
The rest of the paper is organized as follows. Section 2 introduces the method, and Section 3 details the asymptotic properties. Section 4 derives a non-parametric variance estimator, Section 5 conducts simulation studies, and Section 6 applies the proposed method to analyze the BLCSC data. The technical details, such as proofs and additional lemmas, are relegated to the online Supplementary Material.
2. Model and Method
2.1. High dimensional censored quantile regression
Let T and C denote the survival outcome and censoring time, respectively. We assume that C is independent of T given , a (p − 1)-dimensional vector of covariates (p > 1). Let X = min{T, C}, Δ = 1{T ≤ C}, and , where 1{·} is the binary indicator function. The observed data, D(n) = {(Xi,Δi,Zi), i = 1, . . . , n}, are n i.i.d. copies of (X,Δ,Z). With Y = log T, let QY (τ|Z) = inf{t : P(Y ≤ t|Z) ≥ τ} be the τ-th conditional quantile of Y given Z. A global censored quantile regression model stipulates
| (1) |
where β*(τ) is a p-dimensional vector of coefficients at τ. We aim to draw inference on for each τ ∈ (0, τU] and for all j ∈ {1, . . . , p}, where 0 < τU < 1 is an upper bound for estimable quantiles subject to identifiability constraint caused by censoring (Peng and Huang, 2008).
Let N(t) = 1{logX ≤ t, Δ = 1}, ΛT (t|Z) = −log(1 − P(log T ≤ t|Z)), and H(u) = −log(1 − u). Then, M(t) = N(t) − ΛT (t ∧ logX|Z) is a Martingale process under model (1) (Fleming and Harrington, 2011) and hence E(M(t)|Z) = 0. We use Ni(t) and Mi(t), i = 1, . . . , n, to denote the sample analogs of N(t) and M(t). Let and
We denote the expectation of Un(β, τ) by u(β, τ).
The Martingale property implies u(β*, τ) = 0 with τ ∈ [0, τU], entailing the estimating equation with τ ∈ (0, τU]:
| (2) |
The stochastic integral in (2) naturally suggests sequential estimation with respect to τ. We define a grid of quantile values Γm = {τ0, τ1, . . . , τm} to cover the interval [ν, τU], where τ0 = ν and τm = τU. The assumption on the lower bound ν > 0 is made to circumvent the singularity problem with CQR at τ = 0, as detailed in assumption (A1). In practice, ν is chosen such that only a small proportion of observations are censored below the ν-th quantile.
Then, ’s, the estimates of β(τk)’s, τk ∈ Γm can be sequentially obtained by solving
where . Due to the monotonicity of θi(τ) in τ, can be solved efficiently via L1-minimization. And , τ ∈ [ν, τU] is defined as a right-continuous piece-wise constant function that only jumps at the grid points. It can be shown that is uniformly consistent and converges weakly to a mean zero Gaussian process for τ ∈ [ν, τU] when p = o(n). More importantly, provides a comprehensive understanding of the covariate effects on the conditional survival distribution over the quantile interval [ν, τU]. We incorporate this sequential estimating procedure for low dimensional CQR estimation in our proposed method.
In addition, our method requires dimension reduction, which can be accomplished by existing methods, including the screening method proposed by He et al. (2013) and the penalized estimation and selection procedure developed by Zheng et al. (2018). Specifically, Zheng et al. (2018) incorporated an L1 penalty into the stochastic integral based estimating equation in (2) to obtain an L-HDCQR estimator, which achieves a uniform convergence rate of , and results in “sure screening” variable selection with high probability, where q is defined in condition (A4). Zheng et al. (2018) also proposed an AL-HDCQR estimator by employing the Adaptive Lasso penalties, which attains a uniform convergence rate of and selection consistency.
2.2. Fused-HDCQR estimator
Our proposed Fused-HDCQR procedure consists of multiple data splitting, selecting variables, fitting low dimensional CQRs with partitioned data, applying append-and-estimate to all predictors, and aggregating those estimates.
With the full data D(n), determine via cross-validation the tuning parameter(s) λn of , an HDCQR variable selection method.
- Let B be a large positive number. For each b = 1, . . . ,B,
- randomly split the data into equal halves and ;
- on , apply the selection procedure with λn on [ν, τU], to select a subset of predictors, denoted by , or for short;
- on , for each j = 1, . . . , p, fit the partial CQR using the subset of covariates , and denote the estimator by , τ ∈ [ν, τU]. is a right-continuous piecewise-constant function that only jumps at the grid points at τk ∈ Γm;
- denote the entry in corresponding to Zj by .
Fusing: the final estimator of , τ ∈ [ν, τU], j = 1, . . . , p is
| (3) |
Remark 1.
We could select different tuning parameters for in each data split, but with much added computation. Our numerical evidence seemed to suggest that a globally chosen λn work well.
Remark 2.
Our procedure needs a variable selection procedure to reduce dimension. For example, L-HDCQR selects the subset , where ‘s are the L-HDCQR estimates, and a0 > 0 is a predetermined threshold. We start j with 2 as the intercept term (corresponding to j = 1) is always included in the model. In regards to the choice of variable selection methods, based on our experience, we can adopt the screening method in He et al. (2013) for fast computation, use L-HDCQR for detecting any non-zero effects in the quantile interval [ν, τU], and choose AL-HDCQR if we opt to select fewer predictors.
Remark 3.
With the censored outcomes, we have used the deviance residual to define the K-fold cross-validation criterion as in Zheng et al. (2018) and selected λn by minimizing it. Specifically, we partition the data to K folds, and let be the penalized estimate of β(τ) using all of the data excluding the k-th fold with a tuning parameter λ and τ ∈ [ν, τU], where k = 1, . . . ,K. Under the global CQR model (1), we define the cross-validation error as
| (4) |
where
with . Here, H(u) = −log(1 − u), Ni(·) is the counting process, and Mi(β(τ)) is the Martingale residual under model (1) (Zheng et al., 2018).
3. Theoretical Studies
3.1. Notation and regularity conditions
For any vector δ ∈ Rp and a subset S ⊂ {1, . . . , p}, denote by SC the complementary set and define ‖δ‖r,S = ‖δS‖r, the lr-norm of the sub-vector δS, in which δjS = δj if j ∈ S and δjS = 0 if j ∉ S. We set the following conditions.
(A1) There exists a quantile ν and some constant c > 0 such that
holds for sufficiently large n.
(A2) (Bounded observations) ‖Z‖∞ ≤ C0. Without loss of generality, we assume C0 = 1. In addition, E|logX| < ∞.
(A3) (Bounded densities) Let FT (t|Z) = P(log T ≤ t|Z), ΛT (t|Z) = −log(1 − FT (t|Z)), F(t|Z) = P(logX ≤ t|Z), and G(t|Z) = P(logX ≤ t,Δ = 1|Z). Also, define f(t|Z) = dF(t|Z)/dt, and g(t|Z) = dG(t|Z)/dt.
-
a
There exist constants , , and such that
-
b
There exist constant κ > 0 and A such that ∀|t| ≤ κ,
(A4) (Sparsity) Assume log p = o(n1/2), let
Let be the index set of covariates selected by with a tuning parameter λn. There exist constants 0 ≤ c1 < 1/3, c2, K1, K2 > 0 such that , , and
(A5) Let . There exists a positive constant L, such that and , for all τ1, τ2 ∈ (ν, τU] and 1 ≤ j ≤ p.
(A6) (Eigenvalues) is bounded below and above by λmin and λmax, respectively, over , δ ≠ 0, where 0 < λmin < λmax. (nonlinear impact).
(A7) Γm is equally gridded with τk − τk−1 = ϵn for τk ∈ Γm, k = 1, . . . ,m. The grid size satisfies ϵn = c0n−1 for some constant c0.
Assumption (A1) requires that the number of censored observations below the ν-th quantile does not exceed cn1/2, which is satisfied if the lower bound of the censoring time C’s support is greater than 0 and seems reasonable in real applications. As recommended in Zheng et al. (2018), ν is chosen such that only a small proportion of the observed survival times below the ν-th quantile are censored. (A2) assumes that the covariates are uniformly bounded. As pointed out by Zheng et al. (2015), the global linear quantile regression model is most meaningful when the covariates are confined to a compact set to avoid crossing of the quantile functions. (A3) ensures the positiveness of f(t|Z) between ZTβ*(ν) and ZTβ*(τU), which is essential for the identifiability of β*(τ) for τ < τU. (A4) restricts the order of data dimensions, as well as the sparsity of β*(τ), which is necessary for the convergence of the low dimensional estimator in (2) (Condition C4 in Wang et al. (2012)). (A4) also characterizes the “sure screening” property by S. The asymptotic property does not assess the variability of selection with a finite sample. For high dimensional inference, it is crucial to account for such variability (Fei et al., 2019). Specifically, several variable selection methods for high dimensional CQR satisfy the sure screening property in (A4) with additional mild conditions.
L-HDCQR: by Corollary 4.1 of Zheng et al. (2018), a Beta-min condition is required in addition to the set of conditions in this paper. Explicitly, there exist constants C1, C2 > 0, such that
AL-HDCQR: by Corollary 4.2 of Zheng et al. (2018), AL-HDCQR achieves the stronger selection consistency property, which implies the sure screening property.
Quantile-adaptive Screening: by Theorem 3.3 of He et al. (2013), with a proper threshold value in their technical conditions, the screening procedure achieves the sure screening property.
(A5) characterizes the smoothness of β*(τ). (A6) is analogous to the assumptions on the covariance structure in the high dimensional literature (Zhao and Yu, 2006; Belloni and Chernozhukov, 2011; Fan et al., 2014; Van de Geer et al., 2014). As an extension to Condition C4 in Peng and Huang (2008), it ensures the convergence of low dimensional CQR but with a diverging number of covariates. (A7) details the fineness of Γm, which renders an adequate approximation to the stochastic integration in (2).
3.2. Theoretical properties of Fused-HDCQR
We first extend the results in Peng and Huang (2008) from a fixed p to a p-diverges-butless-than-n case. They are novel and critical extensions since we allow the true model size q = |S*| to increase with n, while the selected ’s in the fused procedure vary around S*. Specifically, we assume a subset S ⊂ {1, . . . , p} in Theorems 1 and 2, where , 0 ≤ c1 < 1/3 and K1 > 0. Let , τ ∈ [ν, τU] be the estimator from Peng and Huang (2008) of fitting the CQR with ZS over the τ-grid Γm.
Theorem 1.
(Consistency with a diverging number of predictors) Under Conditions (A1) – (A7) and given a subset S ⊂ {1, · · · , p} such that S* ⊆ S and , there exist positive constants ζ1 and ζ2 such that
with probability at least .
Remark 4.
From the proofs of Propositions 1 and 2, it can be seen that ζ1 and ζ2 do not depend on the choice of S or n. Thus, ζ1 and ζ2 are universal for all possible S satisfying S* ⊆ S and .
Next, we derive the weak convergence of for any j ∈ S.
Theorem 2.
(Weak convergence with a diverging number of covariates) Suppose Conditions (A1) – (A7) hold. Given a S ⊂ {1, · · · , p} such that S* ⊆ S and , for any j ∈ S,
converges weakly to a mean zero Gaussian process for τ ∈ [ν, τU].
In high dimensional settings, the next theorem shows that the fused estimator enjoys desirable theoretical properties.
Theorem 3.
Consider the Fused-HDCQR estimator in (3). Under assumptions (A1) – (A7), for any j ∈ {1, . . . , p},
converges weakly to a mean zero Gaussian process for τ ∈ [ν, τU].
Our framework enables us to obtain the joint distribution of K-dimensional estimated coefficients, where K is a finite number. Let be the collection of the indices of K covariates of interest. We can show that the weak convergence result of , a K-dimensional subvector of the oracle estimator, still holds for τ ∈ [ν, τU], that is, , τ ∈ [ν, τU] converges to a K-dimensional Gaussian distribution at any τ ∈ [ν, τU]. We only need to replace by in the proof of Theorem 2 in the Appendix and slightly modify the arguments accordingly. Consequently, the term I in the proof of Theorem 3 still converges weakly to a mean zero Gaussian distribution, while the norms of items II and III are still op(1). Therefore, Theorem 3 still holds for any K-dimensional subvector of , i.e., converges to a mean zero K-dimensional Gaussian distribution at any τ ∈ [ν, τU].
As shown in the proof, the covariance function of depends on the unknown active set S*, the unknown conditional density functions f(t|Z) and g(t|Z), and other unknown quantities. Thus, it is not calculable. The next section proposes an alternative model-free variance estimator based on functional delta method and multi-sample splitting properties (Efron, 2014; Fei et al., 2019).
4. A Variance Estimator via the Functional Delta Method
Let Jbi ∈ {0, 1} be the indicator of whether ith observation is in the bth sub-sample , and . We define the re-sampling covariances between Jbi and at τk ∈ Γm for each i = 1, . . . , n as
Let . The covariance between and is estimated by
where the multiplier n(n − 1)/(n − n1)2 is a finite-sample correction for the sub-sampling (Wager and Athey, 2018). Thus a variance estimator for is
| (5) |
It is shown in Wager and Athey (2018) that (5) is consistent, i.e., as n,B → ∞. Furthermore, for a finite B, we propose a bias corrected version of (5):
| (6) |
The correction term in (6) is a suitable multiplier of the re-sampling variance of ‘s, which converges to zero as n → ∞ and n1 = O(n), and the two variance estimators in (5) and (6) are asymptotically equivalent. However, in (5) requires B to be of order n3/2 to reduce the Monte Carlo noise below the sampling noise, while in (6) only requires B to be of order n to achieve the same (Wager et al., 2014).
Since converges weakly to a Gaussian process by Theorem 3, and our variance estimators are consistent on the grid points, we define the asymptotic 100(1 − α)% local confidence intervals for at any τk ∈ Γm as
where is the variance estimator in (6), and Φ is the standard normal cumulative distribution function. The p-value of testing for each τk ∈ Γm is
5. Simulation Studies
In various settings, we have compared the proposed method, Fused-HDCQR (referred to as “Fused” in the tables and figures hereafter), with some competing methods in quantile regression or high dimensional inference. These methods include Wang et al. (2012) (“W12”) and Fan et al. (2014) (“F14”) for quantile regression; Zheng et al. (2018) (“Z18”) for censored quantile regression; and Meinshausen et al. (2009) (“M09”) for inference with aggregated p-values from multi-sample splittings.
In the simulations and the data analysis, we choose L-HDCQR described in Section 3 as the variable selection tool for Fused-HDCQR. We also explore the feasibility of using other alternatives for variable selection, such as Fan et al. (2009) (“F09”) and M09.
When implementing Fused-HDCQR, we specify the number of splits as B = 300, the quantile interval as [ν, τU] = [0.1, 0.8], and the grid length as m = n/log p. The tuning parameter is chosen by minimizing the 5-fold cross-validation error as in (4). We study the following examples with sparse non-zero effects, some of which are heterogeneous.
Example 1.
The event times are generated by
where the coefficient vector b are sparse with b20 = 0.5, b40 = 1, b60 = 1.5, bj = 0 for all other j’s, and εi ~ N(0, 1). Therefore, the true coefficients are β*(τ) = (Qε(τ), bT)T for all τ ∈ (0, 1), where Qε(τ), τ-th quantile of the distribution of ε, is the intercept. ‘s are i.i.d. Unif(−1, 1) for j ∈ {1, . . . , p}. The censoring time is generated independently as log Ci = N(0, 16) + N(−5, 1) + N(8, 0.25), which gives a censoring rate around 25%.
Example 2.
The event times follow
| (7) |
where b20 = 1, b40 = 1.5, b60 = 2, bj = 0 for all other j’s, and εi ~ N(0, 1). We first generate with Σ = (σkℓ)p×p, σkℓ = 0.3|k−ℓ| the AR(1) correlation structure, and then let , except for the third covariate . Therefore , , and , for all other j’s. The censoring time is generated independently as log Ci = N(0, 16) + N(−4, 1) + N(8, 0.25), which gives a censoring rate around 23%.
Example 3.
The event times follow
where b20 = 1, b40 = 1.5, b60 = 2, bj = 0 for all other j’s, ξi ~ N(0, 1), and ϕ1, ϕ10 are monotone functions as the dashed lines in Figure 1, both are continuous with zero and non-zero pieces over τ. We first generate as in Example 2, and then let , except and . Therefore , , , and , for all other j’s. The censoring time is generated independently as log Ci = N(0, 16) + N(−4, 1) + N(10, 0.25), which gives a censoring rate around 20%.
Figure 1:

Estimated heterogeneous effects and confidence intervals of Fused-HDCQR using Example 3: (left panel) and (right panel). From the top to the bottom are the plots for (n, p) = (300, 1000), (700, 1000) and (700, 2000), respectively.
For each of these examples, we set (n, p)=(300, 1000) and (700, 1000) to study the impacts of the sample size and the number of variables and how the methods fare when p > n. In Example 3, which mimics the real data example in Section 6 most closely, we have also explored (n, p) = (700, 2000), which is roughly equal to the dimension of the real dataset. For every parameter configuration, a total of 100 independent datasets are generated, and we report the averaged results from these replications, unless specified otherwise. The number of 100 is chosen because the penalized methods for high dimensional CQR are in general computationally intensive and take much computing time for one simulated dataset (Table 5).
Table 5:
Comparisons of average computing time (in seconds) when performing Example 1.
| Fused | Z18 | W12 | F14 | M09 | |
|---|---|---|---|---|---|
|
| |||||
| (n,p) = (300,1000) | 888 | 853 | 509 | 390 | 170 |
| (n,p) = (700,1000) | 3,108 | 1,812 | 2,230 | 1,231 | 440 |
Note: see the footnote of Table 2.
We first evaluate the feasibility of using various variable selection tools for our proposed method. Comparisons of true positives and false negatives among F09, M09, and L-HDCQR under Examples 1–3 are reported in Table 1. F09 presents a subpar performance because, by taking intersections of variables selected from different partitions of data, it tends to miss out some true signals and thus have fewer true positives. In contrast, L-HDCQR retains more true positives than both F09 and M09, while having larger false positives. Because our method requires the variable selection step to include the true signals with high probability, even at the cost of some false positives, we have opted to use L-HDCQR as the screening tool for our method.
Table 1:
Summary of variable selection results based on the simulated datasets.
| TP | FP | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| (n,p) | CR | q | L-HDCQR | M09 | F09 | L-HDCQR | M09 | F09 | |
|
| |||||||||
| Example 1 | (300,1000) | 0.25 | 3 | 2.67 | 2.12 | 1.64 | 7.95 | 0.00 | 0.19 |
| (700,1000) | 0.25 | 3 | 2.98 | 2.78 | 2.27 | 13.08 | 0.01 | 0.34 | |
| Example 2 | (300,1000) | 0.22 | 4 | 3.60 | 3.58 | 2.22 | 12.45 | 0.00 | 0.22 |
| (700,1000) | 0.23 | 4 | 3.99 | 3.99 | 3.54 | 11.29 | 0.00 | 0.64 | |
| Example 3 | (300,1000) | 0.20 | 5 | 3.82 | 3.63 | 1.91 | 10.00 | 0.00 | 0.17 |
| (700,1000) | 0.20 | 5 | 4.81 | 4.77 | 4.35 | 11.73 | 0.01 | 0.54 | |
| (700,2000) | 0.19 | 5 | 4.78 | 4.76 | 4.17 | 16.34 | 0.00 | 0.47 | |
Note: CR, average censoring rate; q = |S*|; TP, average true positives; FP, average false positives; M09, Meinshausen et al. (2009); F09, Fan et al. (2009); L-HDCQR, Zheng et al. (2018).
We next compare the performance of Fused-HDCQR with other high dimensional quantile regression methods at τ = .25, .5, .75 under Example 1. As a benchmark for comparisons, we also compute the oracle estimates based on the true model (with S* known). As W12, F14, and Z18 provide coefficient estimates without standard errors (SEs), only the estimation biases are reported for them, while the average SEs, empirical standard deviations (SDs) and coverage probabilities of the confidence intervals are reported for our method. Table 2 shows that Fused-HDCQR presents the smallest biases, which are comparable to those of the oracle estimates. In contrast, Z18 has smaller biases when the sample size is large, and larger biases otherwise, while W12 and F14 incur substantial biases since they are not designed for censored data. Moreover, the SEs based on Fused-HDCQR agree with the empirical SDs of the estimates. The consistent estimates of coefficients and SEs obtained by Fused-HDCQR lead to proper coverage probabilities around the 0.95 nominal level. In addition, the coverage probabilities improve as n increases.
Table 2:
Results of Example 1 based on the simulated datasets.
| Bias |
EmpSD | SE | Cov | Power |
||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Oracle | Fused | Z18 | F14 | W12 | Fused | Fused | M09 | |||
|
| ||||||||||
| n = 300, p = 1000 | ||||||||||
| 0.02 | 0.02 | −0.38 | −0.50 | −0.48 | 0.14 | 0.13 | 0.93 | 0.97 | 0.06 | |
| β21 = 0.5 | 0.02 | 0.01 | −0.24 | −0.49 | −0.48 | 0.12 | 0.13 | 0.95 | 0.98 | 0.04 |
| 0.01 | 0.01 | −0.13 | −0.50 | −0.48 | 0.12 | 0.13 | 0.96 | 1.00 | 0.02 | |
| −0.01 | −0.01 | −0.02 | −0.91 | −0.33 | 0.14 | 0.13 | 0.92 | 1.00 | 0.99 | |
| β41 = 1 | −0.00 | −0.00 | −0.03 | −0.68 | −0.32 | 0.14 | 0.12 | 0.92 | 1.00 | 0.98 |
| 0.02 | 0.01 | −0.01 | −0.70 | −0.30 | 0.17 | 0.14 | 0.93 | 1.00 | 0.94 | |
| −0.00 | 0.01 | 0.00 | −0.92 | −0.24 | 0.12 | 0.13 | 0.92 | 1.00 | 1.00 | |
| β61 = 1.5 | 0.00 | 0.01 | 0.01 | −0.64 | −0.25 | 0.11 | 0.13 | 0.97 | 1.00 | 1.00 |
| 0.02 | 0.01 | 0.02 | −0.70 | −0.25 | 0.13 | 0.14 | 0.95 | 1.00 | 1.00 | |
| n = 700, p = 1000 | ||||||||||
| −0.02 | −0.01 | −0.01 | −0.47 | −0.23 | 0.09 | 0.08 | 0.92 | 1.00 | 0.56 | |
| β21 = 0.5 | −0.01 | −0.01 | −0.01 | −0.39 | −0.22 | 0.08 | 0.08 | 0.89 | 1.00 | 0.65 |
| −0.01 | −0.01 | −0.01 | −0.40 | −0.23 | 0.10 | 0.09 | 0.89 | 1.00 | 0.44 | |
| 0.00 | 0.00 | 0.04 | −0.53 | −0.17 | 0.09 | 0.08 | 0.91 | 1.00 | 1.00 | |
| β41 = 1 | −0.00 | 0.00 | 0.03 | −0.49 | −0.19 | 0.09 | 0.08 | 0.90 | 1.00 | 1.00 |
| −0.01 | −0.01 | 0.01 | −0.53 | −0.18 | 0.08 | 0.10 | 0.87 | 1.00 | 1.00 | |
| 0.01 | 0.01 | 0.06 | −0.54 | −0.21 | 0.10 | 0.09 | 0.93 | 1.00 | 1.00 | |
| β61 = 1.5 | 0.01 | 0.01 | 0.03 | −0.62 | −0.21 | 0.08 | 0.08 | 0.93 | 1.00 | 1.00 |
| −0.00 | 0.00 | 0.03 | −0.71 | −0.21 | 0.07 | 0.09 | 0.94 | 1.00 | 1.00 | |
Note: Each β has three rows corresponding to τ = .25,.5,.75 from the top to bottom; EmpSD, empirical standard deviation; SE, average standard error; Cov, coverage probability; Oracle, Oracle estimator; Z18, Zheng et al. (2018).F14, Fan et al. (2014); W12, Wang et al. (2012); M09, Meinshausen et al. (2009).
Table 2 also concerns the power for detection of signals. Since W12, F14, and Z18 cannot draw inference and, in general, there is lack of literature that deals with inference for HDCQR, we compare our method with the aggregated p-value approach (M09) in the quantile setting, though M09 originated from linear regression. The results indicate that Fused-HDCQR outperforms M09, and presents adequate testing power when the effect size is moderate or large.
Table 3 summarizes the results from Example 2 with the heterogeneous effect β4 varying with τ. We compare the estimation accuracy between Fused-HDCQR and Z18, as well as the statistical power between Fused-HDCQR and M09. Again, Fused-HDCQR presents smaller biases than Z18 and a higher power than M09. To assess whether the tuning parameters selected as in Remark 3 help the variable selection method (L-HDCQR) used by Fused-HDCQR satisfy assumption (A4) in Section 3, we report the selection frequency of each signal variable in Table 3 (and also in Table 4), and observe that the selection frequency increases as the sample size increases, hinting that assumption (A4) may be satisfied with these selected tuning parameters.
Table 3:
Results of Example 2 based on the simulated datasets.
| Bias |
EmpSD | SE | Cov | Freq | Power |
||||
|---|---|---|---|---|---|---|---|---|---|
| Oracle | Fused | Z18 | Fused | Fused | M09 | ||||
|
| |||||||||
| n = 300, p = 1000 | |||||||||
| 0.01 | 0.13 | 0.29 | 0.32 | 0.31 | 0.88 | 0.82 | 0.16 | ||
| β4 = 1.5Qε(τ ) | −0.05 | −0.07 | 0.06 | 0.33 | 0.29 | 0.90 | 0.73 | 0.11 | 0.00 |
| 0.01 | −0.14 | −0.05 | 0.31 | 0.34 | 0.82 | 0.62 | 0.10 | ||
| −0.01 | −0.01 | −0.01 | 0.14 | 0.13 | 0.90 | 1.00 | 0.88 | ||
| β21 = 1 | −0.03 | −0.01 | −0.05 | 0.12 | 0.12 | 0.91 | 0.69 | 1.00 | 0.92 |
| −0.01 | −0.00 | −0.02 | 0.14 | 0.13 | 0.92 | 1.00 | 0.84 | ||
| 0.01 | 0.01 | 0.03 | 0.13 | 0.13 | 0.90 | 1.00 | 1.00 | ||
| β41 = 1.5 | −0.01 | 0.01 | 0.03 | 0.12 | 0.13 | 0.93 | 0.99 | 1.00 | 1.00 |
| −0.00 | 0.02 | −0.02 | 0.13 | 0.14 | 0.93 | 1.00 | 1.00 | ||
| −0.03 | −0.03 | 0.04 | 0.13 | 0.13 | 0.91 | 1.00 | 1.00 | ||
| β61 = 2 | −0.03 | −0.02 | 0.03 | 0.11 | 0.13 | 0.92 | 1.00 | 1.00 | 1.00 |
| −0.01 | −0.01 | −0.00 | 0.12 | 0.15 | 0.95 | 1.00 | 1.00 | ||
|
| |||||||||
| n = 700, p = 1000 | |||||||||
| 0.03 | 0.08 | 0.19 | 0.19 | 0.21 | 0.92 | 0.99 | 0.61 | ||
| β4 = 1.5Qε(τ ) | 0.02 | 0.03 | 0.14 | 0.18 | 0.19 | 0.89 | 0.89 | 0.11 | 0.00 |
| 0.04 | −0.03 | −0.01 | 0.21 | 0.23 | 0.92 | 0.97 | 0.56 | ||
| 0.01 | 0.01 | 0.05 | 0.09 | 0.08 | 0.94 | 1.00 | 1.00 | ||
| β21 = 1 | 0.01 | 0.01 | 0.01 | 0.08 | 0.08 | 0.87 | 0.99 | 1.00 | 1.00 |
| 0.01 | 0.01 | 0.05 | 0.10 | 0.09 | 0.89 | 1.00 | 1.00 | ||
| −0.01 | 0.00 | 0.08 | 0.08 | 0.08 | 0.94 | 1.00 | 1.00 | ||
| β41 = 1.5 | −0.00 | 0.00 | 0.05 | 0.09 | 0.08 | 0.92 | 1.00 | 1.00 | 1.00 |
| 0.00 | 0.01 | 0.04 | 0.09 | 0.09 | 0.95 | 1.00 | 1.00 | ||
| −0.01 | −0.01 | 0.10 | 0.08 | 0.09 | 0.93 | 1.00 | 1.00 | ||
| β61 = 2 | −0.01 | −0.01 | 0.06 | 0.08 | 0.09 | 0.91 | 1.00 | 1.00 | 1.00 |
| −0.00 | −0.00 | 0.07 | 0.09 | 0.10 | 0.90 | 1.00 | 1.00 | ||
Note: See the footnote of Table 2; Freq, average selection frequency in B splits.
Table 4:
Results of Example 3 based on the simulated datasets.
| Bias |
EmpSD | SE | Cov | Freq | Power |
||||
|---|---|---|---|---|---|---|---|---|---|
| Oracle | Fused | Z18 | Fused | Fused | M09 | ||||
|
| |||||||||
| n = 300, p = 1000 | |||||||||
| −0.05 | 0.06 | 0.59 | 0.34 | 0.36 | 0.94 | 0.06 | 0.00 | ||
| β2 = φ1(τ ) | 0.11 | 0.37 | 1.01 | 0.52 | 0.51 | 0.89 | 0.71 | 0.20 | 0.00 |
| 0.04 | −0.20 | −0.05 | 0.80 | 0.72 | 0.89 | 0.87 | 0.06 | ||
| 0.08 | 0.14 | 0.27 | 0.65 | 0.50 | 0.90 | 0.77 | 0.36 | ||
| β11 = φ10(τ ) | 0.10 | −0.20 | −0.36 | 0.62 | 0.51 | 0.91 | 0.67 | 0.19 | 0.00 |
| 0.16 | 0.06 | −0.03 | 0.56 | 0.52 | 0.90 | 0.10 | 0.00 | ||
| β21 = 1.5 | 0.03 | 0.03 | 0.04 | 0.25 | 0.23 | 0.95 | 0.65 | 1.00 | 0.77 |
| β41 = 2 | 0.00 | −0.00 | 0.02 | 0.23 | 0.25 | 0.93 | 0.93 | 1.00 | 0.99 |
| β61 = 2.5 | 0.09 | 0.07 | 0.19 | 0.21 | 0.26 | 0.94 | 0.99 | 1.00 | 1.00 |
|
| |||||||||
| n = 700, p = 1000 | |||||||||
| 0.02 | 0.04 | 0.27 | 0.21 | 0.23 | 0.94 | 0.06 | 0.00 | ||
| β2 = φ1(τ ) | 0.17 | 0.30 | 0.79 | 0.37 | 0.40 | 0.88 | 0.96 | 0.27 | 0.01 |
| 0.15 | 0.08 | 0.35 | 0.51 | 0.51 | 0.90 | 1.00 | 0.77 | ||
| 0.07 | 0.09 | 0.18 | 0.33 | 0.33 | 0.91 | 0.99 | 0.92 | ||
| β11 = φ10(τ ) | −0.01 | −0.19 | −0.23 | 0.35 | 0.34 | 0.85 | 0.92 | 0.21 | 0.00 |
| −0.00 | −0.04 | −0.08 | 0.37 | 0.31 | 0.94 | 0.06 | 0.00 | ||
| β21 = 1.5 | −0.00 | 0.00 | 0.04 | 0.16 | 0.17 | 0.97 | 0.98 | 1.00 | 1.00 |
| β41 = 2 | −0.03 | −0.02 | −0.01 | 0.15 | 0.18 | 0.95 | 1.00 | 1.00 | 1.00 |
| β61 = 2.5 | 0.00 | 0.00 | 0.07 | 0.18 | 0.18 | 0.94 | 1.00 | 1.00 | 1.00 |
|
| |||||||||
| n = 700, p = 2000 | |||||||||
| 0.05 | 0.11 | 0.13 | 0.32 | 0.32 | 0.93 | 0.07 | 0.00 | ||
| β2 = φ1(τ ) | 0.09 | 0.34 | 0.87 | 0.46 | 0.44 | 0.91 | 0.93 | 0.09 | 0.02 |
| 0.25 | 0.36 | 1.77 | 0.53 | 0.46 | 0.87 | 0.74 | 0.58 | ||
| 0.13 | 0.25 | 0.73 | 0.45 | 0.35 | 0.84 | 1.00 | 0.83 | ||
| β11 = φ10(τ ) | 0.09 | −0.02 | 0.56 | 0.41 | 0.36 | 0.89 | 0.90 | 0.76 | 0.01 |
| −0.04 | −0.30 | −0.13 | 0.36 | 0.34 | 0.85 | 0.15 | 0.00 | ||
| β21 = 1.5 | 0.01 | 0.01 | 0.03 | 0.18 | 0.21 | 0.98 | 0.98 | 1.00 | 1.00 |
| β41 = 2 | 0.01 | 0.03 | −0.07 | 0.22 | 0.20 | 0.91 | 0.99 | 1.00 | 0.98 |
| β61 = 2.5 | −0.02 | −0.01 | −0.05 | 0.25 | 0.20 | 0.94 | 1.00 | 1.00 | 0.98 |
Table 4 summarizes the results based on Example 3. For the two heterogeneous effects β2 and β11 that vary with τ, their estimation biases of Fused-HDCQR become smaller and the estimated SEs are closer to the empirical ones as n increases. Figure 1 shows that the Fused-HDCQR estimates agree with the oracle estimates and the truth, except at the change points, and have narrower confidence intervals with a larger n.
Finally, we compare the computation intensity among Z18, M09, W12, F14, and Fused-HDCQR under Example 1 and report in Table 5 the average computing time per dataset. Our method is the most computationally intensive, because it involves multiple data-splittings and draws inferences on all of the p coefficients. However, by utilizing parallel computing, we have managed to reduce the computational time to the same order of Z18, W12, and F14 that are based on penalized regression.
6. Application to the Boston Lung Cancer Survivor Study (BLCSC)
Detection of molecular profiles related to cancer patients’ survival can aid personalized treatment, leading to prolonged survival and improved quality of life. In a subset of BLCSC samples, 674 lung cancer patients were measured with survival times, along with 40, 000 SNPs and clinical indicators, such as lung cancer subtypes (adenocarcinoma, squamous cell carcinoma, or others), cancer stages (1–4), age, gender, education level ( ≤ high school or > high school) and smoking status (active or non-active smokers); see Table 6 for patients’ characteristics. The censoring rate was 23% and a total of 518 deaths were observed during the followup period, with the observed followup time varying from 13 to 8, 584 days.
Table 6:
Patients’ characteristics in the BLCSC samples.
| (n = 674) | ||
|---|---|---|
|
| ||
| Mean (SD) | ||
| Age | 60 (10.8) | |
| Count (%) | ||
| Female | 259 (38) | |
| Education level | ≤ High school | 264 (39) |
| > High school | 410 (61) | |
| Smoking | Non-active | 418 (62) |
| Active | 256 (38) | |
| Cancer type | Adenocarcinoma | 283 (42) |
| Squamous cell | 110 (16) | |
| Other | 281 (42) | |
| Cancer stage | 1 | 283 (42) |
| 2 | 110 (16) | |
| 3 | 256 (38) | |
| 4 | 25 (4) | |
We could have included all 40,000 SNPs in our analysis. However, for more statistical power, we opt for the targeted gene approach by focusing on 2,002 SNPs residing in 14 genes identified to be cancer related, namely, ALK, BRAF, BRCA1, EGFR, ERBB2, ERCC1, KRAS, MET, PIK3CA, RET, ROS1, RRM1, TP53, and TYMS (Brose et al., 2002; Toyooka et al., 2003; Paez et al., 2004; Soda et al., 2007). Pinpointing the effects of individual loci within the targeted genes is helpful for understanding disease mechanisms (Evans et al., 2011; D’Antonio et al., 2019) and designing gene therapies (Pâques and Duchateau, 2007; Hanawa et al., 2004). We also adjust for patients’ clinical and environmental characteristics listed in Table 6, which gives a total of p = 2, 011 predictors.
We apply Fused-HDCQR to compute the coefficient estimates (3) and variance estimates (6). We set the quantile interval to be [0.2, 0.7], which is wide enough to cover high risk and low risk groups and, in the meantime, ensures the quantile parameters be estimable in the presence of censoring (Zheng et al., 2015). We choose the lower bound τ0 = ν = 0.1 to circumvent the singularity problem with CQR at τ = 0, because few (< 2%) observations are censored below the ν-th quantile. With ϵn = 01, we form the τ-grid Γm of length m = 61. We set B = 750 as the number of re-samples, which is sufficiently large based on our numerical experience. To determine the tuning parameter λn in L-HDCQR for selection, we use 5-fold cross-validation as specified in Remark 3.
For ease of presentation, we summarize the results evaluated at 6 quantile levels, τ = 0.2, 0.3, . . . , 0.7, instead of the whole grid Γm. To highlight the findings of the high risk group, we rank all SNPs based on their p-values at τ = 0.2. After Bonferroni correction for multiple testing, there are 83 significant SNPs with the overall type I error of α = 0.05. Our method estimates the coefficients and the p-values for all predictors, and we only present the results for the patient characteristics, the top 10 significant SNPs, and the 3 least significant SNPs in Figure 2 and Table 7. The estimated coefficient of active smoking drops from −0.42 (p = 0.0011) to −0.53 (p = 0.0005) as τ changes from 0.2 to 0.5, and then increases to −0.31 (p = 0.038) as τ changes to 0.7, suggesting that active smoking might be more harmful to the high or median risk groups than the low risk group of patients. The most significant SNP at τ = 0.2 is AX.37793583 T, which remains significant throughout τ = 0.2 to τ = 0.7. However, its estimated coefficient decreases from 2.75 (τ = 0.2) to 1.39 (τ = 0.7), indicating its heterogeneous impacts on survival, i.e. stronger protective effect at lower quantiles and vice versa.
Figure 2:

Estimated quantile-specific coefficients of the predictors in Table 7.
Table 7:
Analysis of the BLCSC data with Fused-HDCQR. The SNPs are sorted by their p-values at τ = 0:2, corresponding to the high risk groups. Results for the top 10 and the bottom 3 are presented.
| Estimator | SE | p-value | Estimator | SE | p-value | Estimator | SE | p-value | |
|---|---|---|---|---|---|---|---|---|---|
| τ | 0.2 | 0.3 | 0.4 | ||||||
|
| |||||||||
| Int | 6.90 | 0.25 | 1.4E–165 | 7.48 | 0.28 | 4.3E–157 | 7.94 | 0.24 | 3.2E–241 |
| Adeno | 0.20 | 0.16 | 2.1E–01 | 0.14 | 0.18 | 4.5E–01 | 0.02 | 0.13 | 8.7E–01 |
| Squamous | −0.16 | 0.16 | 3.0E–01 | −0.20 | 0.16 | 2.1E–01 | −0.34 | 0.13 | 1.0E–02 |
| Stage2 | −0.82 | 0.24 | 6.3E–04 | −0.99 | 0.25 | 6.0E–05 | −0.98 | 0.24 | 3.2E–05 |
| Stage3 | −0.97 | 0.17 | 1.6E–08 | −1.04 | 0.20 | 2.0E–07 | −1.13 | 0.14 | 2.0E–15 |
| Stage4 | −1.54 | 0.17 | 3.0E–20 | −1.77 | 0.20 | 1.7E–19 | −1.86 | 0.14 | 2.2E–42 |
| Age | −0.01 | 0.01 | 1.5E–02 | −0.01 | 0.01 | 3.0E–02 | −0.02 | 0.01 | 1.0E–02 |
| Edu | 0.08 | 0.14 | 6.0E–01 | 0.06 | 0.15 | 6.9E–01 | 0.07 | 0.13 | 5.8E–01 |
| Female | −0.30 | 0.13 | 2.2E–02 | −0.35 | 0.14 | 1.0E–02 | −0.37 | 0.12 | 1.6E–03 |
| Smoke | −0.42 | 0.13 | 1.1E–03 | −0.48 | 0.14 | 5.0E–04 | −0.52 | 0.11 | 3.4E–06 |
| AX.37793583 T | 2.75 | 0.22 | 3.0E–36 | 2.61 | 0.20 | 4.6E–39 | 2.39 | 0.20 | 3.7E–33 |
| AX.83104700 A | 2.32 | 0.20 | 4.0E–31 | 1.91 | 0.19 | 6.3E–24 | 1.54 | 0.19 | 1.5E–15 |
| AX.15207405 G | 2.03 | 0.20 | 1.0E–24 | 1.59 | 0.22 | 9.8E–13 | 1.17 | 0.21 | 3.7E–08 |
| AX.16619495 T | 1.79 | 0.20 | 3.3E–19 | 1.36 | 0.20 | 1.3E–11 | 0.97 | 0.20 | 1.2E–06 |
| AX.13920550 G | 1.93 | 0.23 | 2.5E–17 | 1.41 | 0.28 | 5.3E–07 | 0.87 | 0.27 | 1.6E–03 |
| AX.83444620 C | 1.39 | 0.17 | 7.4E–17 | 1.05 | 0.19 | 6.6E–08 | 0.71 | 0.21 | 8.8E–04 |
| AX.82902859 T | 1.58 | 0.20 | 8.7E–16 | 1.19 | 0.18 | 2.0E–11 | 0.90 | 0.12 | 3.4E–14 |
| AX.40182999 A | 1.50 | 0.21 | 9.6E–13 | 1.01 | 0.25 | 3.9E–05 | 0.64 | 0.14 | 6.5E–06 |
| AX.82976133 A | 2.32 | 0.33 | 3.8E–12 | 2.02 | 0.35 | 6.7E–09 | 1.58 | 0.35 | 6.1E–06 |
| AX.82900605 G | 2.21 | 0.35 | 1.6E–10 | 1.91 | 0.29 | 9.1E–11 | 1.54 | 0.33 | 2.9E–06 |
| ... | |||||||||
| AX.41828883 G | 1.4E–03 | 0.34 | 1.00 | −3.2E–02 | 0.42 | 0.94 | −5.7E–02 | 0.54 | 0.92 |
| AX.11293250 T | −3.6E–04 | 0.14 | 1.00 | 6.2E–02 | 0.15 | 0.67 | 5.0E–02 | 0.12 | 0.68 |
| AX.37863475 C | −3.1E–04 | 0.26 | 1.00 | −1.1E–01 | 0.25 | 0.68 | −1.8E–01 | 0.24 | 0.46 |
| Int | 8.30 | 0.27 | 4.8E-214 | 8.55 | 0.30 | 4.9E-180 | 8.69 | 0.35 | 2.8E-132 |
| Adeno | −0.09 | 0.15 | 5.3E–01 | −0.09 | 0.13 | 4.8E–01 | −0.09 | 0.13 | 5.1E–01 |
| Squamous | −0.50 | 0.15 | 1.0E–03 | −0.60 | 0.16 | 2.1E–04 | −0.50 | 0.19 | 7.1E–03 |
| Stage2 | −0.88 | 0.25 | 5.0E–04 | −0.73 | 0.24 | 2.1E–03 | −0.57 | 0.19 | 2.8E–03 |
| Stage3 | −1.08 | 0.17 | 1.7E–10 | −0.91 | 0.15 | 6.4E–10 | −0.68 | 0.16 | 2.0E–05 |
| Stage4 | −1.91 | 0.15 | 7.0E–38 | −1.93 | 0.14 | 1.7E–44 | −1.69 | 0.16 | 2.1E–27 |
| Age | −0.02 | 0.00 | 3.3E–05 | −0.02 | 0.01 | 1.3E–03 | −0.02 | 0.01 | 1.9E–03 |
| Edu | 0.15 | 0.14 | 2.7E–01 | 0.16 | 0.13 | 2.2E–01 | 0.11 | 0.13 | 4.0E–01 |
| Female | −0.44 | 0.11 | 6.4E–05 | −0.47 | 0.12 | 1.6E–04 | −0.38 | 0.15 | 1.3E–02 |
| Smoke | −0.53 | 0.15 | 4.9E–04 | −0.36 | 0.16 | 2.4E–02 | −0.31 | 0.15 | 3.8E–02 |
| AX.37793583 T | 2.16 | 0.20 | 4.1E–28 | 1.84 | 0.28 | 2.8E–11 | 1.39 | 0.25 | 4.2E–08 |
| AX.83104700 A | 1.15 | 0.27 | 1.6E–05 | 0.58 | 0.27 | 3.5E–02 | 0.13 | 0.25 | 6.0E–01 |
| AX.15207405 G | 0.75 | 0.25 | 2.3E–03 | 0.34 | 0.37 | 3.5E–01 | −0.05 | 0.48 | 9.2E–01 |
| AX.16619495 T | 0.66 | 0.22 | 3.1E–03 | 0.44 | 0.31 | 1.5E–01 | 0.18 | 0.35 | 6.1E–01 |
| AX.13920550 G | 0.54 | 0.27 | 4.3E–02 | 0.26 | 0.60 | 6.7E–01 | 0.11 | 0.60 | 8.6E–01 |
| AX.83444620 C | 0.55 | 0.23 | 2.0E–02 | 0.29 | 0.22 | 1.8E–01 | 0.01 | 0.18 | 9.7E–01 |
| AX.82902859 T | 0.73 | 0.13 | 4.2E–08 | 0.51 | 0.32 | 1.1E–01 | 0.22 | 0.46 | 6.3E–01 |
| AX.40182999 A | 0.41 | 0.18 | 2.6E–02 | 0.22 | 0.27 | 4.1E–01 | −0.01 | 0.30 | 9.6E–01 |
| AX.82976133 A | 1.17 | 0.42 | 5.4E–03 | 0.61 | 0.52 | 2.4E–01 | 0.24 | 0.46 | 6.0E–01 |
| AX.82900605 G | 1.22 | 0.35 | 4.5E–04 | 0.86 | 0.34 | 1.1E–02 | 0.50 | 0.31 | 1.0E–01 |
| ... | |||||||||
| AX.41828883 G | 0.26 | 0.60 | 0.66 | 0.32 | 0.52 | 0.54 | 0.12 | 0.68 | 0.86 |
| AX.11293250 T | −0.00 | 0.12 | 1.00 | −0.09 | 0.12 | 0.44 | −0.09 | 0.15 | 0.56 |
| AX.37863475 C | −0.24 | 0.20 | 0.23 | −0.37 | 0.17 | 0.03 | −0.57 | 0.32 | 0.08 |
The effects of some SNPs are nearly zero for higher quantiles. For example, the estimated coefficient of AX.15207405 G decreases from 2.03 (τ = 0.2; p = 10−24) to −0.05 (τ = 0.7; p = 0.92), with the estimated standard error increasing from 0.20 to 0.48. Similarly, the estimated coefficient of AX.40182999 A decreases from 1.5 (τ = 0.2; p = 9.6×10−13) to −0.01 (τ = 0.7; p = 0.96). The results again hint at heterogeneous SNP effects in various risk groups, which cannot be detected using traditional Cox models.
Finally, our results shed light on the roles of SNPs in the high risk group (i.e. lower quantiles). Specifically, we map the 83 SNPs with significant effects at the 0.2-th quantile by Fused-HDCQR to the corresponding genes and rank the genes by the number of significant SNPs (over total number of SNPs for each gene in the parenthesis), which are TP53 (14/321), RRM1 (14/174), ERCC1 (10/167), BRCA1 (10/114), ALK (8/163), ROS1 (5/294), EGFR (5/261), ERBB2 (4/167), and 6 other genes with numbers of significant SNPs less than 4. While these genes were reported to be associated with lung cancer (Toyooka et al., 2003; Takeuchi et al., 2012; Rosell et al., 2011; Lord et al., 2002; Zheng et al., 2007; Sasaki et al., 2006; Brose et al., 2002), our analysis provides more detailed information as to which SNPs and locations of the genes are jointly associated with the lung cancer survival, as well as the estimated effects and uncertainties. Analysis of heterogeneous SNP effects has been gaining increasing research attention in lung cancer research (McKay et al., 2017; Dong et al., 2012; Huang et al., 2009), and beyond it (Garcia-Closas et al., 2008; Cheng et al., 2010; Gulati et al., 2014).
7. Conclusions
Our proposed procedure involves repeated estimates from low dimensional CQRs, which are computationally straightforward and can be efficiently implemented with parallel computing. We require the variable selection to possess a sure screening property as in condition (A4). This seems to be supported by our simulations, which find our procedure works well when the variable selection method can select a superset of the true model with high probability. Our condition is much weaker than a stringent condition of selection consistency as specified in Fei et al. (2019).
In regards to the selection of B, we recommend B to be in the same order of the sample size n. Smaller B might not affect coefficient estimation much, but it would yield biased standard errors for inference. In addition, we opt to define Γm by setting the grid as n/log p equally spaced points between τ0 and τU. This may cover the quantile interval well, with reasonable computation efficiency.
There are open questions left to be addressed. First, substantial work is needed when predictors are highly correlated as the performance of our method, like the other competing methods, deteriorates when correlations among predictors become stronger. Second, it is of interest to investigate an alternative method when the sparsity condition fails. For example, it is challenging to find an effective strategy to draw inference when a non-negligible portion of predictors have small but non-zero effects. We will pursue them elsewhere.
Supplementary Material
Acknowledgements
We are deeply grateful toward the Editor, the AE and the two referees for their constructive comments and suggestions that have helped improve the manuscript substantially. We thank our long time collaborator, David Christiani, Harvard Medical School, for providing the Boston Lung Cancer Survivor Cohort data. The work is partially supported by a grant from NIH and NSF.
References
- Belloni A and Chernozhukov V (2011). ℓ1-penalized quantile regression in high-dimensional sparse models. The Annals of Statistics 39(1), 82–130. [Google Scholar]
- Belloni A, Chernozhukov V, and Kato K (2019). Valid post-selection inference in high-dimensional approximately sparse quantile regression models. Journal of the American Statistical Association 114(526), 749–758. [Google Scholar]
- Berk R, Brown L, Buja A, Zhang K, and Zhao L (2013). Valid post-selection inference. The Annals of Statistics 41(2), 802–837. [Google Scholar]
- Brose MS, Volpe P, Feldman M, Kumar M, Rishi I, Gerrero R, et al. (2002). BRAF and RAS mutations in human lung cancer and melanoma. Cancer research 62(23), 6997–7000. [PubMed] [Google Scholar]
- Bühlmann P, Kalisch M, and Meier L (2014). High-dimensional statistics with a view toward applications in biology. Annual Review of Statistics and Its Application 1, 255–278. [Google Scholar]
- Cheng I, Plummer SJ, Neslund-Dudas C, Klein EA, Casey G, Rybicki BA, and Witte JS (2010). Prostate cancer susceptibility variants confer increased risk of disease progression. Cancer Epidemiology and Prevention Biomarkers 19(9), 2124–2132. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Christiani DC (2017). The Boston lung cancer survival cohort. http://grantome.com/grant/NIH/U01-CA209414-01A1. [Online; accessed November 27, 2018].
- D’Antonio M, Reyna J, Jakubosky D, Donovan MK, Bonder M-J, et al. (2019). Systematic genetic analysis of the MHC region reveals mechanistic underpinnings of HLA type associations with disease. eLife 8, e48476. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dong J, Hu Z, Shu Y, Pan S, Chen W, Wang Y, et al. (2012). Potentially functional polymorphisms in dna repair genes and non-small-cell lung cancer survival: A pathway-based analysis. Molecular carcinogenesis 51(7), 546–552. [DOI] [PubMed] [Google Scholar]
- Efron B (2014). Estimation and accuracy after model selection. Journal of the American Statistical Association 109(507), 991–1007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Evans DM, Spencer CC, Pointon JJ, Su Z, Harvey D, Kochan G, et al. (2011). Interaction between ERAP1 and HLA-B27 in ankylosing spondylitis implicates peptide handling in the mechanism for HLA-B27 in disease susceptibility. Nature genetics 43(8), 761–767. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J, Fan Y, and Barut E (2014). Adaptive robust variable selection. The Annals of Statistics 42(1), 324–351. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J, Samworth R, and Wu Y (2009). Ultrahigh dimensional feature selection: beyond the linear model. Journal of Machine Learning Research 10, 2013–2038. [PMC free article] [PubMed] [Google Scholar]
- Fang EX, Ning Y, and Liu H (2017). Testing and confidence intervals for high dimensional proportional hazards models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79(5), 1415–1437. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fei Z, Zhu J, Banerjee M, and Li Y (2019). Drawing inferences for high-dimensional linear models: A selection-assisted partial regression and smoothing approach. Biometrics 75(2), 551–561. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fleming TR and Harrington DP (2011). Counting Processes and Survival Analysis, Volume 169. John Wiley & Sons. [Google Scholar]
- Garcia-Closas M, Hall P, Nevanlinna H, Pooley K, Morrison J, Richesson DA, et al. (2008). Heterogeneity of breast cancer associations with five susceptibility loci by clinical and pathological characteristics. PLoS genetics 4(4), e1000054. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gulati S, Martinez P, Joshi T, Birkbak NJ, Santos CR, Rowan AJ, et al. (2014). Systematic evaluation of the prognostic impact and intratumour heterogeneity of clear cell renal cell carcinoma biomarkers. European urology 66(5), 936–948. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hanawa H, Hargrove PW, Kepes S, Srivastava DK, Nienhuis AW, and Persons DA (2004). Extended β-globin locus control region elements promote consistent therapeutic expression of a γ-globin lentiviral vector in murine β-thalassemia. Blood 104(8), 2281–2290. [DOI] [PubMed] [Google Scholar]
- He X, Wang L, and Hong HG (2013). Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. The Annals of Statistics 41(1), 342–369. [Google Scholar]
- Ho DSW, Schierding W, Wake M, Saffery R, and O’Sullivan J (2019). Machine learning SNP based prediction for precision medicine. Frontiers in Genetics 10, 267. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hong HG, Christiani DC, and Li Y (2019). Quantile regression for survival data in modern cancer research: expanding statistical tools for precision medicine. Precision clinical medicine 2(2), 90–99. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang Y-T, Heist RS, Chirieac LR, Lin X, Skaug V, Zienolddiny S, et al. (2009). Genome-wide analysis of survival in early-stage non–small-cell lung cancer. Journal of clinical oncology 27(16), 2660–2667. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Javanmard A and Montanari A (2014). Confidence intervals and hypothesis testing for high-dimensional regression. Journal of Machine Learning Research 15(1), 2869–2909. [Google Scholar]
- Kelley MJ, Li S, and Harpole DH (2001). Genetic analysis of the β-tubulin gene, tubb, in non-small-cell lung cancer. Journal of the National Cancer Institute 93(24), 1886–1888. [DOI] [PubMed] [Google Scholar]
- Koenker R and Bassett G Jr (1978). Regression quantiles. Econometrica: Journal of the Econometric Society 46(1), 33–50. [Google Scholar]
- Kong S, Yu Z, Zhang X, and Cheng G (2018). High dimensional robust inference for cox regression models. arXiv preprint arXiv:1811.00535. [Google Scholar]
- Korpanty GJ, Graham DM, Vincent MD, and Leighl NB (2014). Biomarkers that currently affect clinical practice in lung cancer: EGFR, ALK, MET, ROS-1, and KRAS. Frontiers in oncology 4, 204. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee JD, Sun DL, Sun Y, and Taylor JE (2016). Exact post-selection inference, with application to the lasso. The Annals of Statistics 44(3), 907–927. [Google Scholar]
- Lord RV, Brabender J, Gandara D, Alberola V, Camps C, Domine M, et al. (2002). Low ERCC1 expression correlates with prolonged survival after cisplatin plus gemcitabine chemotherapy in non-small cell lung cancer. Clinical Cancer Research 8(7), 2286–2291. [PubMed] [Google Scholar]
- McKay JD, Hung RJ, Han Y, Zong X, Carreras-Torres R, Christiani DC, et al. (2017). Large-scale association analysis identifies new lung cancer susceptibility loci and heterogeneity in genetic susceptibility across histological subtypes. Nature genetics 49(7), 1126–1132. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meinshausen N, Meier L, and Bühlmann P (2009). P-values for high-dimensional regression. Journal of the American Statistical Association 104(488), 1671–1681. [Google Scholar]
- Moon C, Oh Y, and Roth JA (2003). Current status of gene therapy for lung cancer and head and neck cancer. Clinical cancer research 9(14), 5055–5067. [PubMed] [Google Scholar]
- Ning Y and Liu H (2017). A general theory of hypothesis tests and confidence regions for sparse high dimensional models. The Annals of Statistics 45(1), 158–195. [Google Scholar]
- Paez JG, Jänne PA, Lee JC, Tracy S, Greulich H, Gabriel S, Herman P, Kaye FJ, Lindeman N, Boggon TJ, et al. (2004). Egfr mutations in lung cancer: correlation with clinical response to gefitinib therapy. Science 304(5676), 1497–1500. [DOI] [PubMed] [Google Scholar]
- Pâques F and Duchateau P (2007). Meganucleases and dna double-strand break-induced recombination: perspectives for gene therapy. Current gene therapy 7(1), 49–66. [DOI] [PubMed] [Google Scholar]
- Peng L and Huang Y (2008). Survival analysis with quantile regression models. Journal of the American Statistical Association 103(482), 637–649. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Portnoy S (2003). Censored regression quantiles. Journal of the American Statistical Association 98(464), 1001–1012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Powell JL (1986). Censored regression quantiles. Journal of econometrics 32(1), 143–155. [Google Scholar]
- Risch A and Plass C (2008). Lung cancer epigenetics and genetics. International Journal of Cancer 123(1), 1–7. [DOI] [PubMed] [Google Scholar]
- Rosell R, Molina MA, Costa C, Simonetti S, Gimenez-Capitan A, Bertran-Alamillo J, et al. (2011). Pretreatment EGFR T790M mutation and BRCA1 mRNA expression in erlotinib-treated advanced non–small-cell lung cancer patients with EGFR mutations. Clinical Cancer Research 17(5), 1160–1168. [DOI] [PubMed] [Google Scholar]
- Sasaki H, Shimizu S, Endo K, Takada M, Kawahara M, Tanaka H, et al. (2006). EGFR and erbB2 mutation status in japanese lung cancer patients. International Journal of Cancer 118(1), 180–184. [DOI] [PubMed] [Google Scholar]
- Shows JH, Lu W, and Zhang HH (2010). Sparse estimation and inference for censored median regression. Journal of Statistical Planning and Inference 140(7), 1903–1917. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Soda M, Choi YL, Enomoto M, Takada S, Yamashita Y, Ishikawa S, et al. (2007). Identification of the transforming eml4–alk fusion gene in non-small-cell lung cancer. Nature 448(7153), 561–566. [DOI] [PubMed] [Google Scholar]
- Takeuchi K, Soda M, Togashi Y, Suzuki R, Sakata S, Hatano S, et al. (2012). RET, ROS1 and ALK fusions in lung cancer. Nature Medicine 18(3), 378–381. [DOI] [PubMed] [Google Scholar]
- Toyooka S, Tsuda T, and Gazdar AF (2003). The TP53 gene, tobacco exposure, and lung cancer. Human Mutation 21(3), 229–239. [DOI] [PubMed] [Google Scholar]
- Van de Geer S, Bühlmann P, Ritov Y, and Dezeure R (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics 42(3), 1166–1202. [Google Scholar]
- Wager S and Athey S (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association 113(523), 1228–1242. [Google Scholar]
- Wager S, Hastie T, and Efron B (2014). Confidence intervals for random forests: The jackknife and the infinitesimal jackknife. Journal of Machine Learning Research 15(1), 1625–1651. [PMC free article] [PubMed] [Google Scholar]
- Wang L, Wu Y, and Li R (2012). Quantile regression for analyzing heterogeneity in ultra-high dimension. Journal of the American Statistical Association 107(497), 214–222. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wei L-J (1992). The accelerated failure time model: a useful alternative to the Cox regression model in survival analysis. Statistics in medicine 11(14–15), 1871–1879. [DOI] [PubMed] [Google Scholar]
- Yamamoto H, Shigematsu H, Nomura M, Lockwood WW, Sato M, Okumura N, et al. (2008). Pik3ca mutations and copy number gains in human lung cancers. Cancer research 68(17), 6913–6921. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang C-H and Zhang SS (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76(1), 217–242. [Google Scholar]
- Zhao P and Yu B (2006). On model selection consistency of lasso. Journal of Machine Learning Research 7(Nov), 2541–2563. [Google Scholar]
- Zhao SD and Li Y (2012). Principled sure independence screening for cox models with ultra-high-dimensional covariates. Journal of Multivariate Analysis 105(1), 397–411. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zheng Q, Gallagher C, and Kulasekera K (2013). Adaptive penalized quantile regression for high dimensional data. Journal of Statistical Planning and Inference 143(6), 1029–1038. [Google Scholar]
- Zheng Q, Peng L, and He X (2015). Globally adaptive quantile regression with ultra-high dimensional data. The Annals of Statistics 43(5), 2225–2258. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zheng Q, Peng L, and He X (2018). High dimensional censored quantile regression. The Annals of Statistics 46(1), 308–343. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zheng Z, Chen T, Li X, Haura E, Sharma A, and Bepler G (2007). DNA synthesis and repair genes RRM1 and ERCC1 in lung cancer. New England Journal of Medicine 356(8), 800–808. [DOI] [PubMed] [Google Scholar]
- Zhu Q-G, Zhang S-M, Ding X-X, He B, and Zhang H-Q (2017). Driver genes in non-small cell lung cancer: Characteristics, detection methods, and targeted therapies. Oncotarget 8(34), 57680–57692. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
