Abstract
Quantile regression has become a valuable tool to analyze heterogeneous covaraite-response associations that are often encountered in practice. The development of quantile regression methodology for high dimensional covariates primarily focuses on examination of model sparsity at a single or multiple quantile levels, which are typically prespecified ad hoc by the users. The resulting models may be sensitive to the specific choices of the quantile levels, leading to difficulties in interpretation and erosion of confidence in the results. In this article, we propose a new penalization framework for quantile regression in the high dimensional setting. We employ adaptive L1 penalties, and more importantly, propose a uniform selector of the tuning parameter for a set of quantile levels to avoid some of the potential problems with model selection at individual quantile levels. Our proposed approach achieves consistent shrinkage of regression quantile estimates across a continuous range of quantiles levels, enhancing the flexibility and robustness of the existing penalized quantile regression methods. Our theoretical results include the oracle rate of uniform convergence and weak convergence of the parameter estimators. We also use numerical studies to confirm our theoretical findings and illustrate the practical utility of our proposal.
Keywords and phrases: Ultra-high dimensional data, Varying covariate effects, Adaptive penalized quantile regression, model selection oracle property
1. Introduction
We consider the problem of analyzing high and ultra-high dimensional data, which have become widely available in a large variety of scientific fields, such as biomedical imaging, signal processing, machine learning, and finance. In ultra-high dimensional data sets, the number of candidate covariates p is allowed to increase at an exponential rate of the number of observations n, but only a relatively small number s of them have real impact on the response variable. In such a situation, it becomes very useful but challenging to identify relevant variables and measure their influences.
Effort has been made to address this unprecedent challenge in the context of linear regression (Meinshausen and Buhlmann, 2006; Zhang and Huang, 2008; Huang et al., 2008; Kim et al., 2008; Lv and Fan, 2009; Fan and Lv, 2011, among others). Quantile regression (Koenker and Bassett (1978)) has emerged as a flexible tool to models the effects of covariates on the conditional quantiles, and it permits investigation of heterogeneity across quantiles. For example, meteorologists typically focus on the extreme temperatures in climate studies. Gaussian model based procedures would be inadequate for addressing scientific questions of this kind, and quantile models have a natural role to play. Most of current literature on quantile regression for high dimensional data inquire into covariate effects at a single or multiple prespecified quantile levels, to which we shall refer as locally concerned quantile regression. A number of authors, for example, Knight and Fu (2000), Li et al. (2007), Zou and Yuan (2008), Wu and Liu (2009), Rocha et al. (2009), considered the locally concerned quantile regression using penalization to achieve sparsity. Several authors, such as Wang et al. (2012), Zheng et al. (2013) and Fan et al. (2014), investigated cases with ultra-high dimensional covariates.
There are subtle and yet important issues with the practical use of locally concerned quantile regression. For example, when interests lie in identifying variables that impact the upper quantiles, would one just consider a single τ-th quantile at τ = 0.9, or several quantile levels? There is usually no clear scientific support for choosing one τ over another nearby value. With a limited sample size, there is variability in the set of selected variables as τ changes, even if just slightly. Such variability is clearly undesirable for interpretation. More importantly, some important variables are likely to be missed, simply due to chance, if we perform variable selection at any given τ.
To address the limitations of locally concerned quantile regression, we propose an alternative model selection strategy, called globally concerned quantile regression, that examines regression quantiles over a set of quantile levels, denoted by Δ ⊂ (0, 1). Typically, Δ is selected as an interval of quantile levels that well captures part of the conditional distributions. For example, Δ may be chosen as [0.4, 0.6] if we would like to identify variables that impact the center of the conditional distributions, or [0.75, 0.9] if we are interested in the upper tails. If we are interested in identifying variables that have impact on any quantile of the conditional distributions, we may choose Δ = [0.1, 0.9]. If Δ is a singleton set or a finite set, globally concerned quantile regression reduces to locally concerned quantile regression. Therefore, we can take the view that globally concern quantile regression extends locally concerned quantile regression by allowing for contemporaneous evaluation of the covariate effects at a continuum of quantile levels. This additional flexibility offered by globally concerned quantile regression can enhance high-dimensional sparse modeling. Specifically, a globally concerned quantile regression approach can take advantage of all useful information across quantiles to improve the stability of variable selection. Even if an active variable is missed by locally concerned quantile regression at the targeted quantile level, its trail may still be captured within the neighborhood of the quantile level.
A naive pointwise approach for globally concerned quantile regression in the ultra-high dimensional setting would perform an existing locally concerned penalization method separately at each τ ∈ Δ (including tuning parameter selection), and then take the union of active variable sets identified at each τ. Such an approach tends to result in considerably overfitted models, as clearly demonstrated by simulation studies in Section 4. The model overfitting phenomenon is analogous to the inflated type-I error in multiple comparisons. Similar to the strategy of controlling the familywise error rates by adjusting critical values of individual tests in multiple comparison problems, we consider appropriate selection of the tuning parameter in the penalization.
Belloni and Chernozhukov (2011) investigated the L1-penalized quantile regression (L1-QR) for the ultra-high dimensional data setting and established some useful results that hold uniformly across Δ. They established a near oracle consistency rate for p > n, and showed that the model identified by L1-QR contains the true model as a submodel. Nevertheless, there remain some important open questions. First, in the ultrahigh dimensional setting, log p could be o(nb) for some b > 0. Thus the convergence rate is less satisfactory. Notice that AR-Lasso proposed by Fan et al. (2014) enjoys a convergence rate at a given quantile level. We ask if we can achieve the same convergence rate for the penalized regression quantiles uniformly in Δ. Secondly, as commented by Fan et al. (2014) and Wang et al. (2012), L1-QR typically does not possess the desired model selection oracle property. Can this deficiency be corrected by adopting an adaptively weighted L1-penalty instead of the L1-penalty under globally concerned quantile regression? Thirdly, L1-QR requires an assumption that the active covariate effects do not cross 0 as τ varies in Δ. We hope that this restriction can be removed.
Motivated by the precursor work by Peng et al. (2013) on variable selection under globally concerned quantile regression with a fixed covariate dimension, we study the penalization strategy of adaptively weighting the L1-penalty in the ultra-high dimensional setting, in combination with a GIC-type uniform selector of the tuning parameter. The theoretical development for the high dimensional case cannot rely on the traditional empirical process arguments used in Peng et al. (2013); instead we modify a chaining idea from Talagrand (2005) to circumvent the difficulty with high dimensions. We are able to show that with probability tending to one, the proposed estimator can successfully identify the set of relevant covariates, including those having effects on some or all quantile levels in Δ. The model selection oracle property can be established, implicating the same convergence rate for the proposed estimator and the oracle estimator. We employ the empirical process techniques to derive the convergence rate of the oracle estimator and thus that of the proposed estimator. We demonstrate that the adaptively weighted penalties can reduce the bias induced by the L1 penalties. Compared to L1-QR, which may select a model of size as large as O(n/log p), the proposed method can achieve consistent model selection, and then an improved estimation convergence rate, which we show can be uniformly in τ ∈ Δ. When Δ is a singleton set and our estimator becomes one for penalized locally concerned quantile regression, the convergence rate can be further improved to the oracle rate, . Our theoretical investigations do not preclude the cases where the covariates effects equal zero at some quantile.
In addition, we show that any linear combination of the proposed estimator converges weakly to a Gaussian process as a function of τ ∈ Δ. Such weak convergence results have been lacking in the high dimensional setting, possibly because the increasing dimensionality makes the classical approaches to weak convergence inapplicable. With a mild constraint on the model size, we show, for the first time, that the proposed GIC-type uniform tuning parameter selector can ensure consistent model selection in the ultra-high dimensional setting.
The rest of the article is organized as follows. In Section 2, we introduce a globally concerned quantile regression framework and propose an adaptively weighted L1-penalization procedure. In Section 3, we investigate the theoretical properties of the proposed estimator. We conduct Monte Carlo study to evaluate its finite sample performance in Section 4. In Section 5, we demonstrate the proposed method with a real data example. Further discussions are contained in Section 6. We defer all technical proofs to Section 7.
2. Adaptively weighted L1-penalized quantile regression
2.1. A globally concerned quantile regression framework
We consider a globally concerned quantile regression model, which takes the form,
| (2.1) |
where QY(τ|X) := inf{y : Pr(Y ≤ y|X) ≥ τ} denotes the τth conditional quantile of a response variable Y given X, X := (1, ZT)T is a p × 1 vector of covariates, is a p × 1 vector of unknown coefficient functions of τ, Δ ⊂ (0, 1) is a prespecified set of interest. Here, Δ can take a general form as the union of multiple disjoint intervals. In contrast, a locally concerned quantile regression model can be expressed as (2.1) but with Δ being a singleton set or a countable set. The coefficients, α0(τ) and , represent intercept and covariates effects on the τ-th conditional quantile of Y given X respectively, which are allowed to vary over τ ∈ Δ. It is worth pointing out that the globally concerned quantile regression model (2.1) bears a meaningful distinction from a location shift linear model only when the covariates are confined to a compact set (Koenker, 2005). By this consideration, a bounded covariate space is assumed throughout the presentation of theoretical results in Section 3. Nevertheless, our technical proofs in Section 5 adopt a weaker assumption on covariates, which allows the support of covariates to expand with n.
We consider the ultra-high dimensional setting, where log p = o(nb) for some b > 0. The set of relevant/active covariates under the globally concerned quantile regression model (2.1) is defined as
The implication from this definition is that we intend to identify all variables that impact the selected segment of the conditional distribution of the response (reflected by the choice of Δ). A relevant variable may influence all or some of the quantiles of interest. Let u(j) denote the jth component of the vector u. To ensure the model identifiability of (2.1), we impose a global sparsity condition, which assumes that the number of relevant covariates is small relatively to the sample size, that is, |SΔ| = s = o(n), where |·| denotes the cardinality. It is easy to see that this global sparsity condition implies the local sparsity condition for a τ-th quantile regression model, |Sτ| = o(n), where
It is important to note that the stronger global sparsity condition is indispensable for the globally concerned quantile regression. Suppose we only assume the local sparsity for each τ ∈ Δ. Although |Sτ| = o(n) for all τ ∈ Δ, |SΔ| = |⋃τ∈Δ Sτ| could still be greater than n when β̰0(τ) is allowed to change with τ and so is Sτ. On the other hand, Belloni and Chernozhukov (2011) studied the globally concerned L1-QR under the local sparsity condition, |Sτ| < s ≤ n/log(n ∨ p) for all τ ∈ Δ. This condition, when coupled with their constraint that the active covariate effects do not cross 0, implies |SΔ| = o(n), the global sparsity condition assumed in this paper.
Without loss of generality, we assume that SΔ = {2, ···, s}, i.e. the first s variables have non-vanishing effects on the conditional distribution of interest. We use to denote the collection of all irrelevant variables. We allow the number of covariates p = pn and the true model size s = sn to increase with the sample size n. To ease the presentation, we often omit the subscript n, when it is clear from the context.
2.2. Adaptively weighted L1-penalized quantile regression
Given a random sample consisting of n independent and identically distributed observations, denoted by {(Yi, Zi), i = 1, ···, n}, we use the adaptively weighted L1-penalized quantile regression estimator β̂λn(τ), which is the minimizer of the following objective function:
| (2.2) |
where , ρτ(t) := t(τ − 1{t ≤ 0}) is the τ-th quantile loss function, λn is a tuning parameter controlling the overall model complexity over Δ, and ωj(τ), j = 2, ···, p, are nonnegative weight functions assigned to β(j)(τ)’s. Consequently, the estimated global model,
is a collection of all active variables identified by β̂(τ) over Δ. The general requirements about wj(τ) and λn are expounded in theoretical studies presented in Section 3. Note that, adopting adaptive weights distinguishes the proposed estimation method from L1-QR in which the penalties assigned to covariates are non-adaptive.
The proposed globally concerned quantile regression framework allows for more flexibility in selecting the form of ωj(τ), as compared to locally concerned quantile regression. For example, we can choose ωj(τ) to be one of the following functions:
where β̃(τ) is any uniformly consistent initial estimate of β0(τ). The design of these adaptive weight functions essentially follows the same idea behind the classical adaptive-Lasso (Zou, 2006) to assign different penalty levels according to the conjectured variable importance. The choice (w1) corresponds to a traditional choice of adaptive weights as seen in adaptive-Lasso (Zou, 2006), while the other two examples reflect a global use of the information on β0(τ), because ωj(τ) is the same for all τ ∈ Δ. By using (w2) or (w3), we take the view that the variable importance is respectively captured by the maximum signal strength or the cumulative signal strength over τ ∈ Δ, corresponding to the L∞ and L1 norm of coefficient functions. Clearly other function norms may also be used to define adaptive weights. Compared to (w2) and (w3), the adaptive weight (w1) is less tailored to the objective of identifying the set of globally relevant variables; it changes with τ and reflects the local relevance rather than the global importance of variables. As elaborated in Section 3, the proposed adaptively weighted L1-penalized quantile regression can achieve the model selection oracle property with all the weights listed above. However, (w1) is generally subject to stronger signal conditions compared to (w2) and (w3). Our numerical studies suggest that adopting (w2) or (w3) may lead to more favorable finite sample performance as compared to using the traditional adaptive weight, (w1).
For the initial estimator β̃(τ), we recommend using the estimator from L1-QR, which is a minimizer of (2.2) by setting ωj(τ) = 1, j = 2, ···, p. Therefore, the proposed adaptively weighted L1-penalized quantile regression consists of two stages in the implementation. In the first stage, we employ the L1-QR to generate β̃(τ) and then construct the adaptively weight function ωj(τ), which are used in the second stage to produce the function estimator, β̂λn(·).
Another important component of the proposed penalized estimation is to adopt a common tuning parameter λn for τ ∈ Δ. This is key to achieving the oracle model selection property for globally concerned quantile regression. If we tune λn(τ) for local sparsity, it would result in inconsistent selection for the set of globally irrelevant covariates, SΔ. This phenomenon shares a similar spirit to the issue of inflated type I error in the multiple comparison setting, and is clearly demonstrated via some simulations reported in Section 4.
To select λn, we propose a uniform selector of tuning parameter by using a GIC criterion adapted to the globally concerned quantile regression model (2.1):
| (2.3) |
where , and φn is a sequence converging to 0 with n. We show in Theorem 3.4 that the proposed GIC criterion would ensure the oracle model selection property when a reasonable upper bound for model size can be imposed to eliminate clearly oversized models in model selection. The simulation studies in Section 4 demonstrate satisfactory behaviors of the proposed tuning parameter selection.
2.3. Computation Algorithm for β̂λn(τ)
Here we assume that the adaptive weight wj(τ) is a function or functional of the initial estimator β̃λn(·).
To begin, we review some important results about standard quantile regression to help understand the proposed computation algorithm. As elaborated in Koenker and Bassett (1978); Bassett and Koenker (1982); Portnoy (1991), the sample regression quantile function is piecewise constant on τ ∈ (0, 1), due to the parametric programming nature of the optimization problem. By Corollary 3.1 of Portnoy (1991), the size of breakpoint set for a regression quantile function is roughly O(mlogm), where m is the number of data points involved in the quantile regression minimization problem. Koenker and D’Orey (1987) provided a detailed account of how to compute a regression quantile function for all τ ∈ (0, 1) via parametric linear programming including the procedure to determine all the breakpoints. The implementation of Koenker and D’Orey (1987)’s algorithm has been made available in rq() function in R package quantreg.
Below we provide a detailed description of the proposed computation algorithm for β̂λn(·). Our basic strategy for determining the full function path of β̂λn(·) is to formulate β̂λn(τ) as regression quantile of some augmented datasets. By the general property of regression quantile functions discussed above, we can conclude that β̂λn(τ) is a piecewise constant function changing at only a finite number of breakpoints. Our algorithm delineates how the breakpoint set of β̂λn(·) can be obtained from using rq() function after appropriate data manipulations. We also provide sample code in Section 4 of the supplemental article [Zheng et al. (2015)].
Step 1: Follow Belloni and Chernozhukov (2011) to obtain the tuning parameter λ̃n for computing the L1-QR estimator β̃(τ).
Step 2: Fit the rq() function to the augmented dataset, { ( ), 1 ≤ i ≤ n+2p − 2}, where for 1 ≤ i ≤ n, , and for 1 ≤ j ≤ p − 1. Here ej is a p-dimensional vector with the jth component equal to 1 and all the others equal to 0. The results would include the set of all breakpoints of β̃(·), denoted by R1, and also the value of β̃(τ) at each breakpoint. The function estimator β̃(·) is a piecewise constant function that jumps only at breakpoints in R1.
-
Step 3: Compute the adaptive weights ωj(τ) (j = 2, … p) based on β̃(·). Let
be the set of τ-intervals on each of which the weights, ω2(τ), …, ωp(τ), are all constant. Here |R1| denotes the size of the breakpoint set R1.
-
Step 4: Calculate GIC(λ) on a sequence of tuning parameter candidates. More specifically, given any fixed tuning parameter λ,
Step 4.1: Set k = 1.
Step 4.2: Fit the rq() function to the augmented dataset: { , 1 ≤ i ≤ n + 2p − 2}, where for 1 ≤ i ≤ n, , and for 1 ≤ j ≤ p − 1. Denote the resulting breakpoint set by R2,k(R1). For each breakpoint τ ∈ R2,k(R1) ∩ Ik(R1), let β̂λ(τ) be the corresponding rq() coefficient estimate.
Step 4.3: Increase k by 1 and go back to Step 5.2 unless k > |R1|.
Step 4.4: Calculate GIC(λ) based on β̂λ(·), which is now fully determined with the breakpoint set given by .
Step 5: Find λ̂n that minimizes GIC(λ).
Step 6: Obtain β̂λn(·) over τ ∈ Δ corresponding to λ = λ̂n.
It is worth pointing out that Belloni and Chernozhukov (2011) proposed to choose their tuning parameter, λ̃n, based on simulations of a pivot quantity, and thus Step 1 can be carried out without iterating with other steps. The construction of the augmented dataset in Step 2 is motivated by the fact that the definition of ( ) and ( ) makes . Consequently,
This finding indicates that β̃(τ) can be equivalently formulated as the τth regression quantile of the augmented dataset, {( ), 1 ≤ i ≤ n + 2p − 2}. Thus β̃(τ) is piecewise constant over τ and so is wj(τ). It allows us to use existing software, such as rq(), to solve the L1-QR as a standard quantile regression problem and to obtain the full function path of the initial estimator, β̃(·). The same idea is applied in Step 5.2, with additional attention paid to incorporating adaptive weights, which may change with τ. More specifically, given that wj(τ) is piecewise constant over τ, we can separately handle β̂λn (·) in each τ-interval where adaptive weights are constant over τ. With data manipulations similar to those for β̃(τ), we can formulate β̂λn(τ) as the τth regression quantile of an augmented dataset, as shown in Step 5.2. Therefore, the breakpoint set of β̂λn (·) in the given τ-interval can be obtained from using rq() function. The union of all such breakpoint sets then gives the breakpoint set of β̂λn(·) (see Step 5.4). In the special case where the weight function is chosen as (w2) or (w3), |R1| would be 1 and the loop between Steps 5.2 and 5.3 would not be needed.
Note that, the above algorithm enables us to determine β̂λn(·) for all τ ∈ Δ. However, the computation burden could be heavy in ultra-high dimensional settings, since the size of breakpoint set is roughly O((n + 2p) log(n + 2p)) (Portnoy, 1991). To alleviate the computational burden, one can approximate β̂λn(·) by a cadlag step function which jumps at a sufficiently fine pre-specified grid in Δ. Specifically, let
denote a τ-grid in Δ, inf{Δ} = τ0 < τ1 < … < τM(n) = sup{Δ}. Define the size of
by ||
|| = max{τk − τk−1; k = 1, …, M(n)}. We propose to approximate β̂λn(·) by
, where for τ ∈ Δ,
. With the smoothness of β0(·) assumed in condition (C3) in Section 3, we can show that
has the same uniform convergence rate and asymptotic distribution as β̂λn(τ) when (ns)1/2 ||
|| = o(1). The proof is provided in Section 3 of the supplemental article [Zheng et al. (2015)].
In our simulation studies, we chose the grid of size 5/2n. The estimator based on grid approximation performs well with realistic sample sizes. In our real data analysis, β̂λn(τ)’s obtained from the exact and approximate calculations select the same set of variables.
Clearly, adopting is computationally appealing because it only requires calculating β̂λn(·) at M(n) grid points. Given that the only technical constraint for M(n) is limn→∞ M (n)/(ns)1/2 = ∞, M (n) can be chosen to be much smaller than the total number of exact breakpoints. For example, under our simulation set-up (II) with n = 200 and p = 400, the average number of breakpoints (with (w2) adopted) from 50 simulations is 4130, while the number of grid points that we used in simulations is 2n/5 = 80. Such a dramatic difference between the number of exact breakpoints and M(n) indicates a huge saving in computation from adopting a grid based approximation instead of the exact calculation of β̂λn(τ) when n and p are large.
The interpretation of the final model after variable selection may be enhanced by imposing a reasonable semiparametric structure to the coefficient functions in β0(·). For example, constant effects may be assumed for a subset of selected variables according to preliminary scientific information or exploratory analysis. Taking into account such additional information in the post-variable-selection estimation can lead to efficiency gains in addition to simplified interpretations. Such model polishing may follow similar lines of existing work, such as Qian and Peng (2010) and Yang and He (2012), but will not be further exploited in the paper.
3. Theoretical Properties
In this section, we investigate the theoretical properties of the adaptively weighted L1-penalized quantile regression. To ease the presentation, we decompose Xi, i = 1, ···, n, into , where Xia, Xib correspond to the relevant and irrelevant covariates, respectively. Similarly, we write β as . In particular, we have the true coefficients .
3.1. Preliminaries
We impose the following regularity conditions to facilitate our technical derivations.
-
(C1)[Condition on the conditional density] Let fτ (·|x) denote the probability density function of given Xi = x. For each x at the support of X, supτ∈Δfτ (·|x) < f̄ and infτ∈Δ fτ (0|x) ≥ f for some constants f̄, f > 0. Moreover, there exists a constant A0 > 0, such that for all u,
-
(C2)[Conditions on covariates] The covariates are normalized such that , for j = 2, ···, p, and obeys
-
(C3)[Condition on the true regression coefficients β0(τ)] There exists a positive constant L, such that
where ||·|| denotes the the l2-norm, and s is the size of SΔ.
-
(C4)[Conditions on the identifiability of the true model] The eigenvalues of is bounded from below and above by some constants λmin and λmax, respectively. Moreover,
where Rs = {δ ∈ ℝp: δb = 0}.
Conditions (C1)–(C4) basically follow assumptions imposed in Belloni and Chernozhukov (2011), since we will adopt L1-QR to construct the consistent initial estimates. Those conditions are also needed to derive the convergence rate of our proposed estimator. Condition (C1) only imposes mild conditions on the conditional density of the response given covariates, and does not assume homoscedasticity. Condition (C2) requires the covariates to be well-behaved with finite second moments. Condition (C3) implies that the true regression coefficient function is Lipschitz continuous with respect to τ. The eigenvalue conditions from (C4) are widely assumed in literature of high dimensional models. They are critical to the model identifiability. The restricted nonlinear impact (RNI) coefficient q controls the quality of minoration of the quantile loss function by a quadratic function over the true global model.
3.2. Main Results
We start with a hypothetical scenario assuming that all relevant covariates are known in advance, and establish the asymptotic properties of the oracle regularized estimator obtained from it. Then we will show that our proposed estimator can enjoy the same properties despite the lack of true model information. The oracle regularized estimator, denoted by is the minimizer of (2.2) over Rs. The following theorem shows that the oracle regularized estimator is uniformly consistent over Δ, with the described convergence rate.
Theorem 3.1
Under the regularity conditions (C1)–(C4), if , then the oracle regularized estimator satisfies
Theorem 3.1 shows that if λn and ωj(τ)’s are appropriately chosen, the bias introduced by the penalty terms can be well controlled. The uniform upper bound for indicates the uniform consistency of the oracle regularized estimator with the convergence rate , which is within a logarithmic factor of the optimal rate shown by He and Shao (2000). If we only consider a finite number of τ’s, we obtain the following corollary,
Corollary 3.1
Suppose the regularity conditions (C1)–(C4) hold. Given a fixed τ0, if , then the oracle regularized estimator satisfies
According to Corollary 3.1, the locally concerned oracle regularized estimator can achieve the optimal convergence rate. This suggests that the additional logarithmic factor in Theorem 3.1 is needed to guarantee the uniform convergence.
Next, we establish the weak convergence for the penalized oracle estimator, . We define two p × p matrices,
Theorem 3.2
Suppose the regularity conditions (C1)–(C4) hold, s = o (n1/3), and . Given any α ∈ ℝp such that ||α|| = 1, we have
-
if , then we have
converges weakly to a mean zero Gaussian process with covariance . Here ∘ denotes the Hadamard product, ^ is the minimum operator, and ω(τ) = (ω1(τ), ···, ωp(τ))T.
-
if supτ∈Δ n−1/2λn||ωa(τ)|| = op(1), then
converges weakly to a mean zero Gaussian process with covariance Σ(τ, τ′).
Theorem 3.2 states two weak convergence results under two different conditions. Part (a) requires that , the minimum signal strength of all relevant covariates is beyond the level . This would exclude the cases where some active variables have coefficient functions crossing 0 or partially equal to 0. Part (b) assumes supτ∈Δ n−1/2λn||ωa(τ)|| = op(1). If we choose adaptive weight (w2), with β̃(τ) obtained from Belloni and Chernozhukov (2011), then we have
Thus, the condition in (b) is satisfied if and . These two conditions require , j = 2, ···, s the maximum impact of each relevant variable is beyond the level . Yet is typical in the high dimensional cases, and the varying covariate effects are allowed to cross zero or partially equal zero in our framework. Using similar arguments, with adaptive weight (w3) used, condition (b) is satisfied if is beyond the level . Thus, condition (b) is neither weaker nor stronger than condition (a). If we choose (w1) as the adaptive weight, then condition (b) can be satisfied by the signal assumption that is beyond the level . Therefore, Theorem 3.2 entails weaker signal conditions from using global adaptive weights, (w2) and (w3), than that from adopting the local adaptive weight, (w1).
For the oracle regularized estimator to have asymptotic normality, the true model size is allowed to be o(n1/3), the fastest model growth rate required in Welsh (1989) and He and Shao (2000) for the unpenalized quantile regression estimator to have the asymptotic normality.
If we can show the the oracle regularized estimator is also the minimizer of (2.2) over ℝp with probability tending to 1, then our proposed estimator enjoys the same properties established in Theorem 3.1 and Theorem 3.2. The following theorem shows that with probability approaching 1, .
Theorem 3.3
Suppose the regularity conditions (C1)–(C4) and hold. Furthermore, we assume following conditions holds,
and n/(s2 log p) → ∞. If and , then we have
Theorem 3.3 not only establishes the model selection oracle property of β̂λn(τ), but also indicates β̂λn (τ) enjoys the weak convergence under mild conditions stated in Theorem 3.2. Moreover, Theorem 3.3, coupled with Theorem 3.1, indicates that the bias of the proposed estimator can be bounded by , the same rate achieved by Fan et al. (2014) for AR-Lasso at a single quantile level. In high-dimensional settings, this bound can be smaller than , a bias bound established by Belloni and Chernozhukov (2011), which is applicable to the globally concerned quantile regression method with nonadaptive L1 penalty.
The condition essentially asks for weak correlations between relevant covariates and irrelevant covariates so that relevant and irrelevant covariates can be well separated. It can be seen that the model growth rate condition, n/(s2 log p) → ∞, is satisfied by the conditions for Theorem 3.2. The conditions, and , indicate different signal strength requirements under different choices of adaptive weights. With adaptive weight (w1), the signal condition,
| (3.1) |
is necessary for one to apply Theorem 3.3. When (w2) or (w3) is used, the signal condition can be relaxed to , or respectively. These weaker signal conditions can well accommodate scenarios with covariate effects crossing zero at some τ ∈ Δ.
In the presence of heavy-tailed distributions, the signal condition (3.1), coupled with the estimation error bounded established in Theorem 3.1 for the oracle estimator, implies uniform sign consistency across τ ∈ Δ, namely . Consider the special case with finite s and p = O(n1/2), which Fan et al. (2014) (see Proposition 1) used to investigate the suboptimality of Lasso in the case of heavy-tailed error distributions. Our signal requirement for globally concerned quantile regression achieving uniform sign consistency can be weaker than what is required by least squares-based Lasso. This finding provides further evidence for the suboptimality of Lasso in the presence of heavy-tailed distributions.
3.3. GIC procedure for λn
The performance of the proposed adaptively weighted L1-penalized quantile regression hinges on the choice of tuning parameter λn. To achieve the model selection oracle property, the tuning parameter λn and the adaptive weights ωj(τ)’s are required to satisfy the following conditions: , and . However, the theoretically optimal λn is not practically achievable, because it depends on the unknown true model size s. Although many existing criteria including AIC and cross-validation could be potentially employed to select the tuning parameter, Wang et al. (2007) and Zhang et al. (2010) showed that the tuning parameter selected by AIC and cross-validation may fail to consistently identify the true model. Wang et al. (2009) considered the tuning parameter selection with a modified BIC in the setting of high dimensional linear regression with p < n, and Wang and Zhu (2011) extended the modified BIC to the situation where p could be larger than n. They demonstrated the modified BIC can identify the true model consistently for high dimensional linear regression. The modified BIC can be considered as the generalized information criterion (GIC) according to from Nishii (1984). Motivated by those results, we select the practically optimal λ̂n by minimizing GIC.
Fan and Tang (2013) recently considered the GIC tuning parameter selection in high dimensional penalized generalized linear regression. In addition to the challenging aspects commented by Fan and Tang (2013), our problem has one more difficult aspect due to varying τ ∈ Δ. As a result, we introduce ∫Δlog σ̂λ(τ)dτ to measure the cumulative model fitting of varying effects over Δ.
We define over-fitted model as OF := {S : S ⫌ SΔ}, and under-fitted model as UF := {S : S ∉ OF, S ≠ SΔ}. To study the properties of the proposed GIC in (3.2), we first introduce an axillary GIC of model S, which is defined as
| (3.2) |
where measure the fitness of model S at τ-th quantile, and β̂S(τ) is obtained by the unpenalized quantile regression. We also define , which serves as a baseline of model fitness.
In addition, we set a model size upper bound, denoted by Cm(n), with s < Cm(n) < p. This means, we limit model search to submodels with size no more than Cm(n). Such an assumption has been widely assumed in the studies of tuning parameter selection for high dimensional models [see e.g. Chen and Chen (2008), Wang and Zhu (2011), Fan and Tang (2013)]. On one hand, it reduces the number of candidate models from 2p to 2Cm(n) and greatly alleviates the computation. On the other hand, it is needed so that the fitness of the model can be correctly measured.
To establish the asymptotic properties of the GIC tuning parameter selector, we assume the following condition (C4+), which is an enhanced version of (C4):
-
(C4+)
where ||·||0 denotes the L0 norm.
Recall φn is defined in (2.3) as a sequence converging to 0. Let ξn denote , which measures the minimal overall effect of intercept and relevant variables upon the conditional distribution. We characterize the behavior the GIC for any model S with size no more than Cm(n) by the following theorem:
Theorem 3.4
Suppose conditions (C1), (C2), (C3) and (C4+) hold. If log p/n = o (φn), , and , we have
Theorem 3.4 indicates that the true model SΔ has the smallest GIC value among the models with size no more than Cm(n). Let λ̂n denote the minimizer of (2.3). The following corollary shows that the proposed GIC tuning parameter selector can consistently identify the true model with probability approaching 1,
Corollary 3.2
Under the same conditions as in Theorem 3.4, if , and , then we have
where Ŝλ is defined in (2.3).
4. Simulation studies
We conduct simulation studies to evaluate the finite sample performance of the proposed method. We consider the adaptively weighted L1-penalized quantile regression with three weight functions, (w1), (w2) and (w3), denoted by AW1, AW2 and AW3 respectively. We compare them with several other methods: (i) the adaptive-LASSO quantile regression at a single predetermined quantile level τ, denoted by SS(τ); (ii) adaptive-LASSO with least squares objective function, denoted by LS; (iii) the L1-QR over Δ, which is the initial estimator (Step 1) of our proposed algorithm; (iv) the pointwise approach by collecting estimates from SS(τ) over τ ∈ Δ denoted by PS1; and (v) the onestep estimate from PS1 by unpenalized quantile regression using variables selected by PS1 over τ ∈ Δ, which is denoted by PS2. Note that, PS1 and PS2 have the same variable selection results but differ in their coefficient estimates. Except for L1-QR, the tuning parameter λn for the other methods is selected by a GIC criterion with φn = log(log n) log p/n. The candidate values for λn include the n/4 equally-spaced grid points between 0 and n/10. We use the sample size n = 200, and generate Y based on model (2.1) with p = 400 covariates. We consider four setups, denoted by (I), (II), (III) and (IV).
Setup (I)
The intercept coefficient function is , where Φ−1 is the quantile function of the standard normal distribution. The covariates Zi are generated from the multivariate normal distribution Np(0, Σ) with Σ = (σjk)p×p and σjk = 0.5|j−k|. The coefficients for Z(1), Z(2), Z(5), Z(8), Z(12) and Z(16) are set to be 2, 1.5, 3, 1, 0.9, and 1 constantly over τ, and the other coefficients are zero. It is clear that setup (I) is a standard linear regression model with normal random errors.
Setup (II)& (III)
The intercept coefficient function is set to be 0, and only Z(1), Z(2) and Z(8) have nonzero coefficients. We plot the coefficient functions for Z(1), Z(2) and Z(8) in Figure 1(a) and (b), respectively for these two setups. We first generate Z̃i from Np(0, Σ) where Σ is the same as that used in Setup (I), and then set Zi = Φ(Z̃i) (componentwise operation), where Φ(·) is the standard normal distribution function. Setup (II) and (III) are designed to assess the performance of the proposed estimator in varying covariate effects models.
Fig. 1.
The coefficient functions in simulation set-ups (II) & (III)
In these setups, we set Δ = [0.1, 0.9] and choose τ = 0.25, 0.50, and 0.75 for SS(τ). We compare the performance of the different methods described above in terms of the following criteria.
NCN : mean number of correctly identified variables (with nonzero coefficient functions)
NIN : mean number of incorrectly selected variables
PUF : percentage of under-fitted models
PCF : percentage of correctly fitted models
POF : percentage of over-fitted models
RAEEΔ : median of relative absolute estimation errors with respect to the oracle unpenalized quantile estimator β̂o(τ) over Δ (under the correctly specified sparse model). It is defined as
The first five criteria are used to assess the model selection performance. We would like to have NCN to be close to the true number of signals (ie. 6 in setup (I) and 3 in setups (II) & (III)). The ideal PCF is 100%, but the other measures (NIN, PUF and POF) should be as close to 0 as possible. The measure RAEEΔ is aimed to evaluate the global estimation accuracy for τ ∈ Δ. We hope this measure to be as small as possible. For SS(τ) and LS, we calculate RAEEΔ by extrapolating the coefficient function estimate as the constant valued function over τ ∈ Δ.
In Table 1, we present simulation results for setup (I)–(III) based on 400 simulations. Under setup (I), where the true model is a standard linear model with normal random errors, we find that AW1, AW2, AW3, SS(0.25), SS(0.50), SS(0.75) and LS all have very similar performance in model selection and estimation accuracy. Since the oracle estimator adopted here is based on the quantile regression and the random errors follow a normal distribution, it is reasonable to observe that LS has the best performance in estimation accuracy. Nevertheless, this advantage over the other approaches is only moderate. We also observe that applying the naive strategy of merging variable selection results from all SS(τ), as in PS1 and PS2, tends to produce an overfitted model (2.1) with POF=94.0%. The large RAEEΔ of PS2 suggests that such an overfitted model tends to have considerably deteriorated estimation accuracy.
Table 1.
Simulation results for setup (I)–(III)
| set-up | Method | NCN | NIN | PUF(%) | PCF(%) | POF(%) | RAEEΔ |
|---|---|---|---|---|---|---|---|
| (I) | AW1 | 6.00 | .09 | .0 | 90.7 | 9.3 | 1.54 |
| AW2 | 6.00 | .07 | .0 | 93.3 | 6.7 | 1.39 | |
| AW3 | 6.00 | .06 | .0 | 93.7 | 6.3 | 1.36 | |
| SS(0.25) | 5.99 | .16 | 1.3 | 84.5 | 14.2 | 1.22 | |
| SS(0.50) | 6.00 | .07 | .0 | 93.8 | 6.2 | 1.06 | |
| SS(0.75) | 5.99 | .16 | .8 | 85.7 | 13.5 | 1.19 | |
| LS | 6.00 | .28 | .0 | 79.9 | 20.1 | .77 | |
| L1-QR | 6.00 | 109.90 | .0 | .0 | 100.0 | 3.89 | |
| PS1 | 6.00 | 2.71 | .0 | 6.0 | 94.0 | 1.28 | |
| PS2 | 6.00 | 2.71 | .0 | 6.0 | 94.0 | 13.45 | |
|
| |||||||
| (II) | AW1 | 2.84 | .04 | 14.8 | 80.5 | 4.7 | 3.62 |
| AW2 | 2.91 | .05 | 7.7 | 87.8 | 4.5 | 1.83 | |
| AW3 | 2.87 | .04 | 11.5 | 84.8 | 4.5 | 1.92 | |
| SS(0.25) | 2.42 | .10 | 39.9 | 55.7 | 4.4 | 4.48 | |
| SS(0.50) | 1.99 | .00 | 100.0 | .0 | .0 | 3.07 | |
| SS(0.75) | 2.51 | .05 | 38.8 | 59.5 | 1.7 | 4.26 | |
| LS | 1.95 | .18 | 100.0 | .0 | .0 | 3.85 | |
| L1-QR | 3.00 | 26.30 | .0 | .0 | 100.0 | 4.90 | |
| PS1 | 3.00 | 1.77 | .5 | 21.7 | 78.3 | 2.58 | |
| PS2 | 3.00 | 1.77 | .5 | 21.7 | 78.3 | 12.04 | |
|
| |||||||
| (III) | AW1 | 2.83 | .19 | 16.0 | 70.3 | 13.7 | 2.05 |
| AW2 | 2.85 | .21 | 14.2 | 71.0 | 14.8 | 1.34 | |
| AW3 | 2.80 | .09 | 19.5 | 73.8 | 6.7 | 1.38 | |
| SS(0.25) | 1.79 | .12 | 100.0 | .0 | .0 | 3.64 | |
| SS(0.50) | 1.01 | .01 | 100.0 | .0 | .0 | 3.29 | |
| SS(0.75) | 1.76 | .08 | 100.0 | .0 | .0 | 2.70 | |
| LS | 2.10 | .18 | 99.5 | 0.2 | .3 | 3.44 | |
| L1-QR | 2.98 | 26.78 | 1.8 | .0 | 98.2 | 3.67 | |
| PS1 | 2.94 | 1.65 | 5.5 | 24.2 | 70.3 | 1.48 | |
| PS2 | 2.94 | 1.65 | 5.5 | 24.2 | 70.3 | 8.78 | |
In setups (II) and (III), we observe that LS has very poor performance in identifying the true model. For example, in setup (II), although Z(8) has a strong effect on the response at all quantile levels except for τ = 0.5, the symmetric traits make its mean effect on Y neutralized. Thus, LS frequently fails to select Z(8), yielding NCN= 1.95 and PUF= 100%. Yet, if one simply applies the locally concerned penalized median regression, Z(8) is characterized as an inactive variable as well and its effects on other segments of the conditional distribution of Y given Z are overlooked. The unsatisfactory results of SS(0.50) supports the use of our globally concerned quantile regression, which enables a more thorough assessment of covariates impacting across various quantiles. Compared to SS(0.25) and SS(0.75), where Z(8) is an active variable, the proposed method for globally concerned quantile regression seems to considerably enhance the power for selecting the correct set of relevant variables. This is clearly indicated by the PCF improvement ranging between 20% and 30%. Like in setup (I), PS1 and PS2, the naive “global” methods, suffer from the overfitting problem with POF= 78.3%. The superiority of the proposed method is more evident in setup (III), where Z(1) and Z(2) have partial effects on two disjoint quantile ranges. In this case, the PCFs of AW1, AW2 and AW3 are still over 70%, which is far above the PCFs achieved by the other methods.
By comparing RAEEΔ in Table 1, we observe that the proposed globally adaptive estimators outperform the other methods also in estimation accuracy for non-i.i.d error models. In addition, Table 1 suggests that adopting weight functions, (w2) and (w3), which are designed to capture the global signal strength, generally yields better finite-sample performance, compared to the traditional weight (w1). We also note that AW2 and AW3 have similar or even smaller RAEEΔ as compared to PS1, an approach that penalizes for local sparsity separately for each quantile index τ. This shows that the proposed globally adaptive method does not sacrifice estimation accuracy while avoiding the overfitting issue associated with the naive pointwise approach. This observation is in line with our theoretical results in Theorem 3.1 and Corollary 3.1.
We also examine the local estimation accuracy of different methods. Specifically, we compute the L1 loss of each estimator of β0(0.5) and then calculate its ratio to that obtained from SS(0.50). The results in Table 2 suggest that our globally concerned penalization method may lose a small proportion of local estimation efficiency compared to the locally concerned SS(0.5), a reasonable price to pay for achieving better global model selection and estimation results. On the other hand, PS2 has much larger local estimation errors compared to the other methods. This may be viewed as a natural consequence of the poor global sparsity control by the naive pointwise approach.
Table 2.
Local estimation error ratio over SS(0.50) at τ = 0.5 for setup (I)–(III)
| set-up | AW1 | AW2 | AW3 | PS1 | PS2 | LS |
|---|---|---|---|---|---|---|
| (I) | 1.16 | 1.13 | 1.14 | 1.00 | 11.01 | 0.75 |
| (II) | 1.19 | 1.61 | 1.63 | 1.00 | 18.52 | 2.35 |
| (III) | 1.01 | 1.17 | 1.18 | 1.00 | 6.21 | 1.33 |
Setup (IV)
We intend to assess the utility of the proposed method in selecting active variables at tail quantiles. We generate data from a linear regression model, where the covariates Zi are generated from the multivariate normal distribution Np(0, Σ), the same covariance matrix used in setup (I). The effects of Z(1), Z(2), Z(5), Z(12), Z(16) and Z(25) are set to be 1.5, 1.25, 2, 4/3, 2 and 3 constantly over τ, and all other covariates have constantly zero coefficients across τ. For each simulated dataset, to identify active variables at quantile level τ = 0.05, we implement SS(0.05) as well as AW1, AW2, AW3, L1-QR, PS1, PS2 with Δ = [0.03, 0.07]. The results are reported in Table 3 based on 400 simulated data sets. By comparing SS(0.05) with SS(0.25), SS(0.50), and SS(0.75) in setup (I), we observe a larger variability in model selection for quantile regression at the tail quantile with τ = 0.05, reflected by a relatively smaller PCF=51.0%. On the other hand, the proposed methods, AW1, AW2 and AW3, still maintain satisfactory performance in model selection with PCFs above 80%. This demonstrates that, by accumulating information in the neighborhood around a tail quantile, the proposed method may boost the power of correct model selection at tail quantiles. We also conduct simulations with Δ chosen as [0.025, 0.075]. Note that such a choice of Δ may represent the same interest in the response distribution reflected by Δ = [0.03, 0.07]. Thus, it would be desirable to observe that the variable selection results between these two choices of Δ are in close proximity. The results in Table A.1 (see the supplemental article [Zheng et al. (2015)]) are very similar to those in Table 3. The results in Table A.2 further demonstrate a good agreement in the selected variables between the two choices of Δ, [0.03, 0.07] and [0.025, 0.075]. These suggests that the proposed estimators are quite robust to reasonable variations in the choice of Δ.
Table 3.
Simulation results over Δ = [0.03, 0.07] for setup (IV)
| set-up | Method | NCN | NIN | PUF | PCF | POF | RAEEΔ |
|---|---|---|---|---|---|---|---|
| (IV) | AW1 | 5.83 | .01 | 11.8 | 87.2 | 1.0 | 2.15 |
| AW2 | 5.87 | .03 | 11.8 | 86.2 | 2.0 | 1.63 | |
| AW3 | 5.80 | .03 | 17.8 | 80.5 | 1.8 | 1.62 | |
| SS(0.05) | 5.99 | .85 | 1.0 | 51.0 | 48.0 | 1.28 | |
| L1-QR | 6.00 | 35.53 | .0 | .0 | 100.0 | 3.85 | |
| PS1 | 6.00 | 3.18 | .2 | 13.8 | 87.0 | 1.29 | |
| PS2 | 6.00 | 3.18 | .2 | 13.8 | 87.0 | 10.55 |
5. A real data example
We now illustrate the application of the proposed method by analyzing a microarray data set reported by Scheetz et al. (2006). This real dataset contains expression values of 31,042 probe sets on 120 12-week-old male offsprings of rats. As in Huang et al. (2008), Kim et al. (2008) and Wang et al. (2012), we are interested in identifying genes whose expressions are predictive for gene TRIM32, which is associated with a genetically heterogeneous disease of multiple organ systems. The probe corresponding to gene TRIM32 is 1389163 at. Our approach to finding relevant genes is to apply high dimensional quantile regression analyses upon the remaining probe sets. In our analysis, we further narrow down the problem to selecting variables that impact ordinary expression levels of probe 1389163 at, which may be roughly captured by the middle part of the response distribution. Therefore, we formulate the target of our analysis as
, considering two reasonable choices of Δ, (0.2, 0.8) and (0.25, 0.75).
We use the data set available in R package flare, which have been processed to exclude probes that are not expressed or lack variation. There remain 200 probe sets serving as covariates. We choose AW2 as the representative of the proposed globally concerned method. In addition, we also implement the adaptive-LASSO method for locally concerned quantile regression, SS(τ), at τ = 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, and the pointwise approaches, PS1 and PS2, as described in Section 4.
To evaluate each method, we compute its prediction error as follow. First, we randomly split the 120 rats into a training set and a test set, each consisting of 60 subjects. We apply the method to the training data set and obtain the estimator of β0(τ), denoted by β̂train(τ). Next, we calculate the prediction error over the test set as
For SS(τ), we calculate PE(Δ) by treating the coefficient estimate as a constant valued function over τ ∈ Δ.
Table 4 lists the probe sets selected by each method using all 120 observations. In addition, we present the averages of PE(Δ) along with the corresponding standard deviations (within parentheses). All calculations are based on 400 replications of random splitting into training and test sets.
Table 4.
Probe sets identified by various methods
| Δ | Method | Probes | PE(Δ)
|
|
|---|---|---|---|---|
| Δ = (0.2, 0.8) | Δ = (0.25, 0.75) | |||
| (0.2, 0.8) | AW2 | “24565” “25141” “25367” “29045” | .146 (.016) | — |
| (0.25, 0.75) | AW2 | “24565” “25141” “25367” “29045” | — | .124 (.011) |
|
| ||||
| (0.2, 0.8) | PS1 | 17 probes | .151 (.016) | — |
| (0.25, 0.75) | PS1 | 13 probes | — | .129 (.012) |
|
| ||||
| (0.2, 0.8) | PS2 | 17 probes | .143 (.013) | — |
| (0.25, 0.75) | PS2 | 13 probes | — | .124 (.011) |
|
| ||||
| 0.25 | SS | “6222”, “15863”, “22140”, “25141”, “25439”, “29045” | .191 (.024) | .151 (.018) |
| 0.30 | SS | “25141”, “29045” | .181 (.023) | .146 (.017) |
| 0.35 | SS | “14949”, “24245”, “25141”, “29045” | .174 (.022) | .141 (.016) |
| 0.40 | SS | “25141”, “29045” | .170 (.020) | .138 (.015) |
| 0.45 | SS | “25141” | .167 (.020) | .135 (.015) |
| 0.50 | SS | “25141” | .165 (.020) | .135 (.015) |
| 0.55 | SS | “25141” | .164 (.020) | .136 (.015) |
| 0.60 | SS | “21092”, “24565”, “25141” | .167 (.021) | .137 (.017) |
| 0.65 | SS | “24565”, “25141” | .169 (.023) | .140 (.018) |
| 0.70 | SS | “11711”, “24565”, “25141” | .176 (.025) | .146 (.019) |
| 0.75 | SS | “24565”, “25141”, “25367” | .186 (.028) | .155 (.023) |
From Table 4, it is clear that the set of probes selected by the locally concerned quantile method considerably varies with τ. Only probe “25141” is selected by all SS(τ)’s considered. The probe “24565”, which was consistently selected by SS(τ) with τ = 0.60, 0.65, 0.70, 0.75, does not show up in the estimated active variable set for lower τ’s. These may suggest a heterogeneous relationship across different segments of the response distribution, but the variations are partly due to the sensitivity of the variable selection methods as τ varies. In fact, there is a quite large variability in the selected probes at low quantile indices. For example, SS(0.25) selected 6 probes, but SS(0.30) selected only 2 probes. It would be difficult to interpret such variable selection results at two close quantile indices.
To identify
, PS1 and PS2, which take the union of selected variables from locally concerned SS(τ), select a total number of 13 probes for Δ = (0.25, 0.75) and 17 probes for Δ = (0.2, 0.8). In contrast, the proposed globally concerned quantile regression AW2 only selects 4 probes but renders very competitive prediction errors. It has the smallest PE(Δ) when Δ = (0.25, 0.75) and the second smallest PE(Δ) when Δ = (0.2, 0.8). The results show that the globally concerned quantile regression approach can mitigate the over-fitting problem of the alternative methods based on local quantile regression approach. Furthermore, AW2 selects the same set of probes with the two slightly different choices of Δ. This suggests the robustness of our method to small changes of Δ, which is a desirable feature from the perspective of model selection.
6. Remarks
We aim to provide a rigorous statistical approach to identifying variables that are predictive to some or all quantiles at the percentile level in a predetermined set Δ. Our work does not directly address how Δ should be chosen because the choice of Δ should align with the scientific problem at hand. In practice, Δ can be chosen as a large interval, say [0.1, 0.9] to reflect an interest in normal outcomes, or a smaller interval, say [0.75, 0.9] to reflect our interest in the upper tail of the response distribution. Even in the cases where one is interested in a single quantile level τ, our work suggests that applying the proposed method with Δ chosen as a small interval containing τ can lead to more stable variable selection results when the sample size n is limited (relative to p).
In applications, it is likely that we have more than one reasonable choices of Δ. For example, either (0.7, 0.9) or (0.75, 0.9) can be a reasonable choice for Δ when we are interested in identifying variables that impact the upper tails of the response variable. This reinforces our belief that the variable selection results should be quite insensitive to small changes in the choice of Δ. Our simulation studies and data example suggest that our approach is robust to small variations of Δ. These empirical results further endorse the practical value of the proposed method.
7. Technical proofs
We present the proofs of our main results in this section. We start with introducing some notations. Given a random sample Z1, ···, Zn, we adopt the following empirical process notations. Let
and
. Let ψτ (u) = τ − 1{u < 0} denote the score function of ρτ(u) and
. We define
(t) = {β ∈ ℝp : βb = 0, infτ
∈ Δ ||β − β0(τ)|| ≤ τ} and Rs(B) = {δ ∈ Rs : ||δ|| ≤ (s log2
n/n)1/2B} for some B > 0. We used τ* to denote min{Δ ∪ 1 − Δ} and ∂A to denote the boundary of a set A.
As mentioned in Section 2.1, it is more meaningful to investigate the globally concerned quantile regression model (2.1) when the support of X is compact. However, our theoretical work can be established in more general conditions, where the magnitude of covariates is allowed to increase in a suitable rate with n. In particular, the following condition is assumed for the proof:
-
(C5)[Conditions on the largest covaraites] We assume
for some M1n = o(n1/2/(s log n)3/2) and .
We denote the event that max2≤j≤p |σ̂j − 1| ≤ 1/2, the event that , and the event that by Ω0, Ω1, and Ω2, respectively. According to conditions (C2) and (C5), we have , and .
Next, we present the technical lemmas used in the proofs of our theorems and corollaries. The proofs of lemmas and Corollaries 1–2 are relegated to Section 2 of the supplemental article [Zheng et al. (2015)].
Lemma 7.1
Under conditions (C1)-(C4), if , then for any δ ∈ Rs; ||δ|| ≤ t, we have
| (7.1) |
Lemma 7.2
Under conditions (C2)-(C4), with probability at least 1 − 16/n3 − γ0, we have
| (7.2) |
Lemma 7.3
Under conditions (C4), given a fixed τ0, with probability at least 1 − 8/a, we have
Lemma 7.4
Under conditions (C1)-(C5), give any α ∈ Rs, such that ||α|| = 1, if s = o (n1/3), then we have
Lemma 7.5
Under conditions (C1),(C3) and (C4) given an α ∈ Rs such that ||α|| = 1, αTMn(τ, 0) converges weakly to a mean zero Gaussian process with variance .
Lemma 7.6
Under condition (C1), if exists for some τ0 ∈ Δ, then we have
almost surely.
Lemma 7.7
Under conditions (C1), (C2), (C3), and (C4+), for some k ≤ Cm(n) = o(n/log p), with probability at least 1 − 16 exp(−(A2 − 2) log p) − γ0, we have
for some constant A > 2.
Lemma 7.8
Under the same conditions as in Theorem 3.4, for some constant A > 2, we have
Lemma 7.9
Under the same conditions as in Theorem 3.4, for some constant A > 2, we have
Lemma 7.10
Under the same conditions as in Theorem 3.4, for some constant A > 2, we have
with probability at least 1 − 16 exp(−(A2 − 2) log p) − 12/p − γ0n.
Lemma 7.11
Under the same conditions as in Theorem 3.3, with probability at least 1 − 4 exp(C3s log p/2 + 4s log n) − γ2, for some constant C3 > 1, we have
In the following, we provide the proofs for the theoretical results in Section 3.2.
Proof of Theorem 3.1
We consider Qn(β; τ) − Qn(β0(τ); τ) over {β ∈ ℝp : infτ∈Δ ||β − β0(τ)|| ≤ t}
According to Lemma 7.1, if , then we obtain
| (7.3) |
By Lemma 7.2, with probability at least 1 − 16/n3 − γ0n,
| (7.4) |
For I3, since , we have
| (7.5) |
(7.3), (7.4), and (7.5) together imply that with probability at least 1 − 16/n3 − γ0n,
where , with a sufficiently large C1.
Since Qn(β; τ) is a convex function with respect to β, then , the minimizer of (2.2) over Rs, must locate within { }. This completes our proof of Theorem 3.1.
Proof of Theorem 3.2
According to Theorem 3.1, we know that
for some B > 0. Therefore, minimizing the objective function Qn(β; τ) from (2.2) over Rs is equivalent to minimizing n−1/2[Qn(β0(τ) + δ; τ) − Qn(β0(τ); τ)] over Rs(B) with probability approaching 1.
Given any δ ∈ Rs(B), we have
We consider II2 first. By Lemma 7.4, we have uniformly for δ ∈ Rs(B) and τ ∈ Δ,
| (7.6) |
Next we deal with II4. From the proof of Lemma 7.1, it is easy to observe uniformly for δ ∈ Rs(B) and τ ∈ Δ,
| (7.7) |
Now we deal with II1. Note that,
then we obtain that
is a bounded function. Applying similar arguments used in Lemma 7.4, we have uniformly for δ ∈ Rs(B) and τ ∈ Δ
| (7.8) |
For II3, since E[Mn(τ, 0)] = 0, we have
| (7.9) |
Now we consider II5. On one hand, if , then
| (7.10) |
Combining (7.6), (7.7), (7.8), (7.9) and (7.10) together, we have
Let δ̂ be the minimizer of L(δ; τ) := n1/2δTHτδ−δTMn,τ(0)+n−1/2λn[ω(τ)∘sgn(β0(τ))]Tδ+op (||δa||) over Rs(B). Then we have
Given any α ∈ ℝp, ||α|| = 1, applying Lemma 7.5 yields
converges weakly to a mean zero Gaussian process with covariance Σ(τ, τ′).
On other hand, if n−1/2λn||ωa|| = op(1), then
| (7.11) |
Consequently,
With similar arguments, we can show that converges weakly to a mean zero Gaussian process with covariance Σ(τ, τ′). This completes the proof of Theorem 3.2.
Proof of Theorem 3.3
According to KKT conditions, to show that is a global minimizer of Qn(β; τ) over ℝp, we only need to check the following condition
| (7.12) |
By Theorem 3.1, we know that Pr (supτ∈Δ ||β̂o(τ) − β0(τ)|| ∈ Rs(B)) → 1. Therefore, if we can we can show that
| (7.13) |
then (7.12) follows immediately, and the probability that is also the global minimizer of (2.2) over ℝp is approaching 1.
We first state some results which will be used in the following proof. Under conditions stated in the theorem, ∀δ ∈ Rs(B), we have
| (7.14) |
where the first inequality follows from the condition expectation on Xi and condition (C1), and the second inequality follows from Cauchy-Schwartz inequality. Moreover, we have
| (7.15) |
In the following, we restrict our attention on Ω0. Let αj denote the vector where the jth component is 1 and all other components are 0. Then .
First, we evaluate III2. By (7.14), we have
| (7.16) |
Next, we consider III3.
Applying Lemma 2.3.7 in Van Der Vaart and Wellner (2000) yields
We have
| (7.17) |
where the first two inequalities are elementary, the third inequality follows from Lemma 1.5 in Ledoux and Talagrand (1991), and the last inequality follows from condition (C2).
Conditional on (Xi, Yi), i = 1, ···, n, we can partition Δ with a grid {τ0, τ1, τ2, ···, τn} such that for each 1 ≤ i ≤ n, there is only one observation (X(i), Y(i)) satisfies . Here, τ0 = 0. Then we have
As τ move from τ0 to τn, we observe
where . Since V(i)’s are independent and ’s are symmetric about 0, According to Levy’s inequality, we have P(max1≤k≤n |Sk| ≥ u) ≤ 2P(|Sn| ≥ u). Therefore, we have
Choose , we obtain
where the inequality follows the arguments in (7.17). The above inequality and (7.17) together imply
| (7.18) |
Now we consider III1. By Lemma 7.11, we can show
| (7.19) |
let . Then (7.16), (7.18), and (7.19) together yield
which implies
| (7.20) |
, coupled with implies that, with probability approaching 1, λnωj(τ) > K̃ for all j > s. This, (7.20), and condition (C5) together infer that (7.13) holds with probability tending to 1, and so does (7.12). Therefore, is a global minimizer of (2.2) with probability approaching 1. This completes the proof of Theorem 3.3.
Proof of Theorem 3.4
Theorem 3.4 is a direct implication of Lemmas 7.8 and 7.9.
Supplementary Material
Acknowledgments
Qi Zheng was supported by NIH grant R01HL113548. Limin Peng was supported by NSF Award DMS-1007660 and NIH grant R01HL113548. Xuming He was supported by NSF Awards DMS-1307566 and DMS-1237234 and National Natural Science Foundation of China Grant 11129101.
Footnotes
Supplement Material: Additional simulation results, Proofs of technical lemmas and corollaries, Justification for the proposed grid approximation, and Sample Code (doi: COMPLETED BY THE TYPESETTER; Supplement rev 042215.pdf).
References
- Bassett G, Koenker R. An empirical quantile function for linear models with iid errors. Journal of the American Statistical Association. 1982;77:407–415. [Google Scholar]
- Belloni A, Chernozhukov V. l1 penalized quantile regression in high dimensional sparse models. Ann Statist. 2011;39:82–130. [Google Scholar]
- Chen J, Chen Z. Extended bayesian information criterion for model selection with large model space. Biometrika. 2008;95:759–771. [Google Scholar]
- Fan J, Fan Y, Barut E. Adaptive robust variable selection. Ann Statist. 2014;42:324–351. doi: 10.1214/13-AOS1191. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J, Lv J. Non-concave penalized likelihood with np-dimensionality. IEEE Tran Inform Theory. 2011;57:5467–5484. doi: 10.1109/TIT.2011.2158486. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan Y, Tang C. Tuning parameter selection in high dimensional penalized likelihood. J R Statist Soc B. 2013;75:531–552. [Google Scholar]
- He X, Shao Q. On parameters of increasing dimensions. J Multivariate Anal. 2000;73:120–135. [Google Scholar]
- Huang J, Ma S, Zhang CH. Adaptive lasso for sparse high-dimensional regression models. Stat Sinica. 2008;18:1603–1618. [Google Scholar]
- Kim Y, Choi H, Oh HS. Smoothly clipped absolute deviation on high dimensions. J Amer Stat Assoc. 2008;103:1665–1673. [Google Scholar]
- Knight K, Fu W. Asymptotics for lasso-type estimators. The Annals of Statistics. 2000:1356–1378. [Google Scholar]
- Koenker R. Quantile Regression. Cambridge University Press; 2005. [Google Scholar]
- Koenker R, Bassett GW. Regression quantiles. Econometrica. 1978;46:33–50. [Google Scholar]
- Koenker RW, D’Orey V. Algorithm as 229: Computing regression quantiles. Journal of the Royal Statistical Society, Seies C. 1987;36:383–393. [Google Scholar]
- Ledoux M, Talagrand M. Ergebnisse der Mathematik und ihrer Grenzgebiete. Vol. 23. Springer; Berlin: 1991. Probability in Banach Space: Isoperimetry and Processes. [Google Scholar]
- Li Y, Liu Y, Zhu J. Quantile regression in reproducing kernel hilbert spaces. Journal of American Statistical Association. 2007;102:255–268. [Google Scholar]
- Lv J, Fan Y. A unified approach to model selection and sparse recovery using regularized least squares. Ann Stat. 2009;37:3498–3528. [Google Scholar]
- Meinshausen N, Buhlmann P. High-dimensional graphs and variable selection with the lasso. The Annals of Statistics. 2006;34:1436–1462. [Google Scholar]
- Nishii R. Asymptotic properties of criteria for selection of variables in multiple regression. Ann Stat. 1984;12:758–765. [Google Scholar]
- Peng L, Xu J, Kutner N. Shrinkage estimation of varying covariate effects based on quantile regression. Stat Comput. 2013 doi: 10.1007/s11222-013-9406-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Portnoy S. Asymptotic behavior of the number of regression quantile breakpoints. Siam J Sci Stat Comput. 1991;12:867–883. [Google Scholar]
- Qian J, Peng L. Censored quantile regression with partially functional effects. Biometrika. 2010;97:839–850. [Google Scholar]
- Rocha G, Wang X, Yu B. Asymptotic distribution and sparsistency for l1- penalized parametric m-estimators with applications to linear svm and logistic regression. 2009 http://arxiv.org/abs/0908.1940.
- Scheetz TE, Kim KYA, Swiderski RE, Philp AR, Braun TA, Knudtson KL, Dorrance AM, DiBona GF, Huang J, Casavant TL, Sheffield VC, Stone EM. Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proc Natl Acad Sci. 2006;103:14429–14434. doi: 10.1073/pnas.0602562103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Talagrand M. The Generic Chaining. Springer; Berlin: 2005. [Google Scholar]
- Van Der Vaart A, Wellner J. Weak Convergence and Empirical Processes: With Applications to Statistics. Springer; New York: 2000. [Google Scholar]
- Wang H, Li R, Leng C. Shringkage tuning parameter selection with a diverging number of parameters. J R Statist Soc B. 2009;71:671–683. [Google Scholar]
- Wang H, Li R, Tsai CL. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika. 2007;94:553–568. doi: 10.1093/biomet/asm053. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang L, Wu Y, Li R. Quantile regression for analyzing heterogeneity in ultrahigh dimension. J Amer Stat Assoc. 2012;101:1418–1429. doi: 10.1080/01621459.2012.656014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang T, Zhu L. Consistent tuning parameter selection in high dimensional sparse linear regression. J Multivariate Anal. 2011;102:1141–1151. [Google Scholar]
- Welsh AH. On m-process and m-estimation. Ann Stat. 1989;17:337–361. [Google Scholar]
- Wu Y, Liu Y. Variable selection in quantile regression. Statistica Sinica. 2009;19:801–817. [Google Scholar]
- Yang Y, He X. Bayesian empirical likelihood for quantile regression. The Annals of Statistics. 2012;40:1102–1131. [Google Scholar]
- Zhang CH, Huang J. The sparsity and bias of the lasso selection in high-dimensional linear regression. The Annals of Statistics. 2008;36:1567–1594. [Google Scholar]
- Zhang Y, Li R, Tsai CL. Regularization parameter selections via generalized information criterion. J Amer Stat Assoc. 2010;105:312–323. doi: 10.1198/jasa.2009.tm08013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zheng Q, Gallagher C, Kulasekera KB. Adaptive penalized quantile regression for high dimensional data. J Statist Plann Inference. 2013;143:1029–1038. [Google Scholar]
- Zheng Q, Peng L, He X. Supplement to “globally adaptive quantile regression with ultra-high dimensional data”. 2015 doi: 10.1214/15-AOS1340. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zou H. The adaptive lasso and its oracle properties. J Amer Stat Assoc. 2006;101:1418–1429. [Google Scholar]
- Zou H, Yuan M. Composite quantile regression and the oracle model selection theory. Annals of Statistics. 2008;36:1108–1126. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.

