GLOBALLY ADAPTIVE QUANTILE REGRESSION WITH ULTRA-HIGH DIMENSIONAL DATA

Qi Zheng; Limin Peng; Xuming He

doi:10.1214/15-AOS1340

. Author manuscript; available in PMC: 2016 Jan 1.

Published in final edited form as: Ann Stat. 2015 Oct 1;43(5):2225–2258. doi: 10.1214/15-AOS1340

GLOBALLY ADAPTIVE QUANTILE REGRESSION WITH ULTRA-HIGH DIMENSIONAL DATA

Qi Zheng ^*, Limin Peng ^*, Xuming He ^†

PMCID: PMC4654965 NIHMSID: NIHMS684570 PMID: 26604424

Abstract

Quantile regression has become a valuable tool to analyze heterogeneous covaraite-response associations that are often encountered in practice. The development of quantile regression methodology for high dimensional covariates primarily focuses on examination of model sparsity at a single or multiple quantile levels, which are typically prespecified ad hoc by the users. The resulting models may be sensitive to the specific choices of the quantile levels, leading to difficulties in interpretation and erosion of confidence in the results. In this article, we propose a new penalization framework for quantile regression in the high dimensional setting. We employ adaptive L₁ penalties, and more importantly, propose a uniform selector of the tuning parameter for a set of quantile levels to avoid some of the potential problems with model selection at individual quantile levels. Our proposed approach achieves consistent shrinkage of regression quantile estimates across a continuous range of quantiles levels, enhancing the flexibility and robustness of the existing penalized quantile regression methods. Our theoretical results include the oracle rate of uniform convergence and weak convergence of the parameter estimators. We also use numerical studies to confirm our theoretical findings and illustrate the practical utility of our proposal.

Keywords and phrases: Ultra-high dimensional data, Varying covariate effects, Adaptive penalized quantile regression, model selection oracle property

1. Introduction

We consider the problem of analyzing high and ultra-high dimensional data, which have become widely available in a large variety of scientific fields, such as biomedical imaging, signal processing, machine learning, and finance. In ultra-high dimensional data sets, the number of candidate covariates p is allowed to increase at an exponential rate of the number of observations n, but only a relatively small number s of them have real impact on the response variable. In such a situation, it becomes very useful but challenging to identify relevant variables and measure their influences.

Effort has been made to address this unprecedent challenge in the context of linear regression (Meinshausen and Buhlmann, 2006; Zhang and Huang, 2008; Huang et al., 2008; Kim et al., 2008; Lv and Fan, 2009; Fan and Lv, 2011, among others). Quantile regression (Koenker and Bassett (1978)) has emerged as a flexible tool to models the effects of covariates on the conditional quantiles, and it permits investigation of heterogeneity across quantiles. For example, meteorologists typically focus on the extreme temperatures in climate studies. Gaussian model based procedures would be inadequate for addressing scientific questions of this kind, and quantile models have a natural role to play. Most of current literature on quantile regression for high dimensional data inquire into covariate effects at a single or multiple prespecified quantile levels, to which we shall refer as locally concerned quantile regression. A number of authors, for example, Knight and Fu (2000), Li et al. (2007), Zou and Yuan (2008), Wu and Liu (2009), Rocha et al. (2009), considered the locally concerned quantile regression using penalization to achieve sparsity. Several authors, such as Wang et al. (2012), Zheng et al. (2013) and Fan et al. (2014), investigated cases with ultra-high dimensional covariates.

There are subtle and yet important issues with the practical use of locally concerned quantile regression. For example, when interests lie in identifying variables that impact the upper quantiles, would one just consider a single τ-th quantile at τ = 0.9, or several quantile levels? There is usually no clear scientific support for choosing one τ over another nearby value. With a limited sample size, there is variability in the set of selected variables as τ changes, even if just slightly. Such variability is clearly undesirable for interpretation. More importantly, some important variables are likely to be missed, simply due to chance, if we perform variable selection at any given τ.

To address the limitations of locally concerned quantile regression, we propose an alternative model selection strategy, called globally concerned quantile regression, that examines regression quantiles over a set of quantile levels, denoted by Δ ⊂ (0, 1). Typically, Δ is selected as an interval of quantile levels that well captures part of the conditional distributions. For example, Δ may be chosen as [0.4, 0.6] if we would like to identify variables that impact the center of the conditional distributions, or [0.75, 0.9] if we are interested in the upper tails. If we are interested in identifying variables that have impact on any quantile of the conditional distributions, we may choose Δ = [0.1, 0.9]. If Δ is a singleton set or a finite set, globally concerned quantile regression reduces to locally concerned quantile regression. Therefore, we can take the view that globally concern quantile regression extends locally concerned quantile regression by allowing for contemporaneous evaluation of the covariate effects at a continuum of quantile levels. This additional flexibility offered by globally concerned quantile regression can enhance high-dimensional sparse modeling. Specifically, a globally concerned quantile regression approach can take advantage of all useful information across quantiles to improve the stability of variable selection. Even if an active variable is missed by locally concerned quantile regression at the targeted quantile level, its trail may still be captured within the neighborhood of the quantile level.

A naive pointwise approach for globally concerned quantile regression in the ultra-high dimensional setting would perform an existing locally concerned penalization method separately at each τ ∈ Δ (including tuning parameter selection), and then take the union of active variable sets identified at each τ. Such an approach tends to result in considerably overfitted models, as clearly demonstrated by simulation studies in Section 4. The model overfitting phenomenon is analogous to the inflated type-I error in multiple comparisons. Similar to the strategy of controlling the familywise error rates by adjusting critical values of individual tests in multiple comparison problems, we consider appropriate selection of the tuning parameter in the penalization.

Belloni and Chernozhukov (2011) investigated the L₁-penalized quantile regression (L₁-QR) for the ultra-high dimensional data setting and established some useful results that hold uniformly across Δ. They established a near oracle consistency rate $\sqrt{s log p / n}$ for p > n, and showed that the model identified by L₁-QR contains the true model as a submodel. Nevertheless, there remain some important open questions. First, in the ultrahigh dimensional setting, log p could be o(n^b) for some b > 0. Thus the convergence rate $\sqrt{s log p / n}$ is less satisfactory. Notice that AR-Lasso proposed by Fan et al. (2014) enjoys a $\sqrt{s log n / n}$ convergence rate at a given quantile level. We ask if we can achieve the same convergence rate for the penalized regression quantiles uniformly in Δ. Secondly, as commented by Fan et al. (2014) and Wang et al. (2012), L₁-QR typically does not possess the desired model selection oracle property. Can this deficiency be corrected by adopting an adaptively weighted L₁-penalty instead of the L₁-penalty under globally concerned quantile regression? Thirdly, L₁-QR requires an assumption that the active covariate effects do not cross 0 as τ varies in Δ. We hope that this restriction can be removed.

Motivated by the precursor work by Peng et al. (2013) on variable selection under globally concerned quantile regression with a fixed covariate dimension, we study the penalization strategy of adaptively weighting the L₁-penalty in the ultra-high dimensional setting, in combination with a GIC-type uniform selector of the tuning parameter. The theoretical development for the high dimensional case cannot rely on the traditional empirical process arguments used in Peng et al. (2013); instead we modify a chaining idea from Talagrand (2005) to circumvent the difficulty with high dimensions. We are able to show that with probability tending to one, the proposed estimator can successfully identify the set of relevant covariates, including those having effects on some or all quantile levels in Δ. The model selection oracle property can be established, implicating the same convergence rate for the proposed estimator and the oracle estimator. We employ the empirical process techniques to derive the convergence rate of the oracle estimator and thus that of the proposed estimator. We demonstrate that the adaptively weighted penalties can reduce the bias induced by the L₁ penalties. Compared to L₁-QR, which may select a model of size as large as O(n/log p), the proposed method can achieve consistent model selection, and then an improved estimation convergence rate, which we show can be $\sqrt{s log n / n}$ uniformly in τ ∈ Δ. When Δ is a singleton set and our estimator becomes one for penalized locally concerned quantile regression, the convergence rate can be further improved to the oracle rate, $O_{p} (\sqrt{s / n})$ . Our theoretical investigations do not preclude the cases where the covariates effects equal zero at some quantile.

In addition, we show that any linear combination of the proposed estimator converges weakly to a Gaussian process as a function of τ ∈ Δ. Such weak convergence results have been lacking in the high dimensional setting, possibly because the increasing dimensionality makes the classical approaches to weak convergence inapplicable. With a mild constraint on the model size, we show, for the first time, that the proposed GIC-type uniform tuning parameter selector can ensure consistent model selection in the ultra-high dimensional setting.

The rest of the article is organized as follows. In Section 2, we introduce a globally concerned quantile regression framework and propose an adaptively weighted L₁-penalization procedure. In Section 3, we investigate the theoretical properties of the proposed estimator. We conduct Monte Carlo study to evaluate its finite sample performance in Section 4. In Section 5, we demonstrate the proposed method with a real data example. Further discussions are contained in Section 6. We defer all technical proofs to Section 7.

2. Adaptively weighted L₁-penalized quantile regression

2.1. A globally concerned quantile regression framework

We consider a globally concerned quantile regression model, which takes the form,

Q_{Y} (τ ∣ X) = X^{T} β_{0} (τ), for τ \in Δ,

(2.1)

where Q_Y(τ|X) := inf{y : Pr(Y ≤ y|X) ≥ τ} denotes the τth conditional quantile of a response variable Y given X, X := (1, Z^T)^T is a p × 1 vector of covariates, $β_{0} (τ) : = {(α_{0} (τ), β_{0}^{*} {(τ)}^{T})}^{T}$ is a p × 1 vector of unknown coefficient functions of τ, Δ ⊂ (0, 1) is a prespecified set of interest. Here, Δ can take a general form as the union of multiple disjoint intervals. In contrast, a locally concerned quantile regression model can be expressed as (2.1) but with Δ being a singleton set or a countable set. The coefficients, α₀(τ) and $β_{0}^{*} (τ)$ , represent intercept and covariates effects on the τ-th conditional quantile of Y given X respectively, which are allowed to vary over τ ∈ Δ. It is worth pointing out that the globally concerned quantile regression model (2.1) bears a meaningful distinction from a location shift linear model only when the covariates are confined to a compact set (Koenker, 2005). By this consideration, a bounded covariate space is assumed throughout the presentation of theoretical results in Section 3. Nevertheless, our technical proofs in Section 5 adopt a weaker assumption on covariates, which allows the support of covariates to expand with n.

We consider the ultra-high dimensional setting, where log p = o(n^b) for some b > 0. The set of relevant/active covariates under the globally concerned quantile regression model (2.1) is defined as

S_{Δ} : = support (β_{0} (τ), τ \in Δ) = {j \in {2, \dots, p} : \exists τ \in Δ, ∣ β_{0}^{(j)} (τ) ∣ > 0} .

The implication from this definition is that we intend to identify all variables that impact the selected segment of the conditional distribution of the response (reflected by the choice of Δ). A relevant variable may influence all or some of the quantiles of interest. Let u⁽^j⁾ denote the jth component of the vector u. To ensure the model identifiability of (2.1), we impose a global sparsity condition, which assumes that the number of relevant covariates is small relatively to the sample size, that is, |S_Δ| = s = o(n), where |·| denotes the cardinality. It is easy to see that this global sparsity condition implies the local sparsity condition for a τ-th quantile regression model, |S_τ| = o(n), where

S_{τ} : = support (β_{0} (τ)) = {j \in {2, \dots, p} : ∣ β_{0}^{(j)} (τ) ∣ > 0} .

It is important to note that the stronger global sparsity condition is indispensable for the globally concerned quantile regression. Suppose we only assume the local sparsity for each τ ∈ Δ. Although |S_τ| = o(n) for all τ ∈ Δ, |S_Δ| = |⋃_τ_∈Δ S_τ| could still be greater than n when β̰₀(τ) is allowed to change with τ and so is S_τ. On the other hand, Belloni and Chernozhukov (2011) studied the globally concerned L₁-QR under the local sparsity condition, |S_τ| < s ≤ n/log(n ∨ p) for all τ ∈ Δ. This condition, when coupled with their constraint that the active covariate effects do not cross 0, implies |S_Δ| = o(n), the global sparsity condition assumed in this paper.

Without loss of generality, we assume that S_Δ = {2, ···, s}, i.e. the first s variables have non-vanishing effects on the conditional distribution of interest. We use $S_{Δ}^{c} : = {s + 1, \dots, p}$ to denote the collection of all irrelevant variables. We allow the number of covariates p = p_n and the true model size s = s_n to increase with the sample size n. To ease the presentation, we often omit the subscript n, when it is clear from the context.

2.2. Adaptively weighted L₁-penalized quantile regression

Given a random sample consisting of n independent and identically distributed observations, denoted by {(Y_i, Z_i), i = 1, ···, n}, we use the adaptively weighted L₁-penalized quantile regression estimator β̂_{λ_n}(τ), which is the minimizer of the following objective function:

Q_{n} (β; τ) = \sum_{i = 1}^{n} ρ_{τ} (Y_{i} - X_{i}^{T} β) + λ_{n} \sum_{j = 2}^{p} ω_{j} (τ) ∣ β^{(j)} ∣, for τ \in Δ,

(2.2)

where $X_{i} : = {(1, Z_{i}^{T})}^{T}$ , ρ_τ(t) := t(τ − 1{t ≤ 0}) is the τ-th quantile loss function, λ_n is a tuning parameter controlling the overall model complexity over Δ, and ω_j(τ), j = 2, ···, p, are nonnegative weight functions assigned to β⁽^j⁾(τ)’s. Consequently, the estimated global model,

{\hat{S}}_{Δ} : = {j \in {2, \dots, p} : \exists τ \in Δ, ∣ {\hat{β}}^{(j)} (τ) ∣ > 0},

is a collection of all active variables identified by β̂(τ) over Δ. The general requirements about w_j(τ) and λ_n are expounded in theoretical studies presented in Section 3. Note that, adopting adaptive weights distinguishes the proposed estimation method from L₁-QR in which the penalties assigned to covariates are non-adaptive.

The proposed globally concerned quantile regression framework allows for more flexibility in selecting the form of ω_j(τ), as compared to locally concerned quantile regression. For example, we can choose ω_j(τ) to be one of the following functions:

\begin{array}{l} (w 1) ω_{j} (τ) = 1 / ∣ {\tilde{β}}^{(j)} (τ) ∣, & (w 2) ω_{j} (τ) = 1 / ({sup}_{τ \in Δ} ∣ {\tilde{β}}^{(j)} (τ) ∣), \\ (w 3) ω_{j} (τ) = 1 / \int_{Δ} ∣ {\tilde{β}}^{(j)} (τ) ∣ d τ, \end{array}

where β̃(τ) is any uniformly consistent initial estimate of β₀(τ). The design of these adaptive weight functions essentially follows the same idea behind the classical adaptive-Lasso (Zou, 2006) to assign different penalty levels according to the conjectured variable importance. The choice (w1) corresponds to a traditional choice of adaptive weights as seen in adaptive-Lasso (Zou, 2006), while the other two examples reflect a global use of the information on β₀(τ), because ω_j(τ) is the same for all τ ∈ Δ. By using (w2) or (w3), we take the view that the variable importance is respectively captured by the maximum signal strength or the cumulative signal strength over τ ∈ Δ, corresponding to the L^∞ and L¹ norm of coefficient functions. Clearly other function norms may also be used to define adaptive weights. Compared to (w2) and (w3), the adaptive weight (w1) is less tailored to the objective of identifying the set of globally relevant variables; it changes with τ and reflects the local relevance rather than the global importance of variables. As elaborated in Section 3, the proposed adaptively weighted L₁-penalized quantile regression can achieve the model selection oracle property with all the weights listed above. However, (w1) is generally subject to stronger signal conditions compared to (w2) and (w3). Our numerical studies suggest that adopting (w2) or (w3) may lead to more favorable finite sample performance as compared to using the traditional adaptive weight, (w1).

For the initial estimator β̃(τ), we recommend using the estimator from L₁-QR, which is a minimizer of (2.2) by setting ω_j(τ) = 1, j = 2, ···, p. Therefore, the proposed adaptively weighted L₁-penalized quantile regression consists of two stages in the implementation. In the first stage, we employ the L₁-QR to generate β̃(τ) and then construct the adaptively weight function ω_j(τ), which are used in the second stage to produce the function estimator, β̂_{λ_n}(·).

Another important component of the proposed penalized estimation is to adopt a common tuning parameter λ_n for τ ∈ Δ. This is key to achieving the oracle model selection property for globally concerned quantile regression. If we tune λ_n(τ) for local sparsity, it would result in inconsistent selection for the set of globally irrelevant covariates, S_Δ. This phenomenon shares a similar spirit to the issue of inflated type I error in the multiple comparison setting, and is clearly demonstrated via some simulations reported in Section 4.

To select λ_n, we propose a uniform selector of tuning parameter by using a GIC criterion adapted to the globally concerned quantile regression model (2.1):

GIC (λ) : = \int_{Δ} log {\hat{σ}}_{λ} (τ) d τ + ∣ {\hat{S}}_{λ} ∣ φ_{n}

(2.3)

where ${\hat{S}}_{λ} : = {j \in {2, \dots, p} : {sup}_{τ \in Δ} ∣ {\hat{β}}_{λ}^{(j)} (τ) ∣ \neq 0}, {\hat{σ}}_{λ} (τ) : = n^{- 1} \sum_{i = 1}^{n} ρ_{τ} (Y_{i} - X_{i}^{T} {\hat{β}}_{λ} (τ))$ , and φ_n is a sequence converging to 0 with n. We show in Theorem 3.4 that the proposed GIC criterion would ensure the oracle model selection property when a reasonable upper bound for model size can be imposed to eliminate clearly oversized models in model selection. The simulation studies in Section 4 demonstrate satisfactory behaviors of the proposed tuning parameter selection.

2.3. Computation Algorithm for β̂_{λ_n}(τ)

Here we assume that the adaptive weight w_j(τ) is a function or functional of the initial estimator β̃_{λ_n}(·).

To begin, we review some important results about standard quantile regression to help understand the proposed computation algorithm. As elaborated in Koenker and Bassett (1978); Bassett and Koenker (1982); Portnoy (1991), the sample regression quantile function is piecewise constant on τ ∈ (0, 1), due to the parametric programming nature of the optimization problem. By Corollary 3.1 of Portnoy (1991), the size of breakpoint set for a regression quantile function is roughly O(mlogm), where m is the number of data points involved in the quantile regression minimization problem. Koenker and D’Orey (1987) provided a detailed account of how to compute a regression quantile function for all τ ∈ (0, 1) via parametric linear programming including the procedure to determine all the breakpoints. The implementation of Koenker and D’Orey (1987)’s algorithm has been made available in rq() function in R package quantreg.

Below we provide a detailed description of the proposed computation algorithm for β̂_{λ_n}(·). Our basic strategy for determining the full function path of β̂_{λ_n}(·) is to formulate β̂_{λ_n}(τ) as regression quantile of some augmented datasets. By the general property of regression quantile functions discussed above, we can conclude that β̂_{λ_n}(τ) is a piecewise constant function changing at only a finite number of breakpoints. Our algorithm delineates how the breakpoint set of β̂_{λ_n}(·) can be obtained from using rq() function after appropriate data manipulations. We also provide sample code in Section 4 of the supplemental article [Zheng et al. (2015)].

Step 1: Follow Belloni and Chernozhukov (2011) to obtain the tuning parameter λ̃_n for computing the L₁-QR estimator β̃(τ).
Step 2: Fit the rq() function to the augmented dataset, { ( $y_{i}^{*}, x_{i}^{*}$ ), 1 ≤ i ≤ n+2p − 2}, where $(y_{i}^{*}, x_{i}^{*}) = (y_{i}, x_{i})$ for 1 ≤ i ≤ n, $(y_{n + j}^{*}, x_{n + j}^{*}) = (0, λ_{n} e_{j})$ , and $(y_{n + p + j}^{*}, x_{n + p + j}^{*}) = (0, - {\tilde{λ}}_{n} e_{j})$ for 1 ≤ j ≤ p − 1. Here e_j is a p-dimensional vector with the jth component equal to 1 and all the others equal to 0. The results would include the set of all breakpoints of β̃(·), denoted by R₁, and also the value of β̃(τ) at each breakpoint. The function estimator β̃(·) is a piecewise constant function that jumps only at breakpoints in R₁.
Step 3: Compute the adaptive weights ω_j(τ) (j = 2, … p) based on β̃(·). Let
${I_{k} (R_{1}) = [τ_{k} (R_{1}), τ_{k + 1} (R_{1})), k = 1, \dots, ∣ R_{1} ∣}$

be the set of τ-intervals on each of which the weights, ω₂(τ), …, ω_p(τ), are all constant. Here |R₁| denotes the size of the breakpoint set R₁.
Step 4: Calculate GIC(λ) on a sequence of tuning parameter candidates. More specifically, given any fixed tuning parameter λ,
- Step 4.1: Set k = 1.
- Step 4.2: Fit the rq() function to the augmented dataset: { $y_{i}^{†}, x_{i}^{†}$ , 1 ≤ i ≤ n + 2p − 2}, where $(y_{i}^{†}, x_{i}^{†}) = (y_{i}, x_{i})$ for 1 ≤ i ≤ n, $(y_{n + j}^{†}, x_{n + j}^{†}) = (0, λ ω_{j} (τ) e_{j})$ , and $(y_{n + p + j}^{†}, x_{n + p + j}^{†}) = (0, - λ ω_{j} (τ) e_{j})$ for 1 ≤ j ≤ p − 1. Denote the resulting breakpoint set by R_2,_k(R₁). For each breakpoint τ ∈ R_2,_k(R₁) ∩ I_k(R₁), let β̂_λ(τ) be the corresponding rq() coefficient estimate.
- Step 4.3: Increase k by 1 and go back to Step 5.2 unless k > |R₁|.
- Step 4.4: Calculate GIC(λ) based on β̂_λ(·), which is now fully determined with the breakpoint set given by $\cup_{k = 1}^{∣ R_{1} ∣} R_{2, k} (R_{1}) \cap I_{k} (R_{1})$ .
Step 5: Find λ̂_n that minimizes GIC(λ).
Step 6: Obtain β̂_{λ_n}(·) over τ ∈ Δ corresponding to λ = λ̂_n.

It is worth pointing out that Belloni and Chernozhukov (2011) proposed to choose their tuning parameter, λ̃_n, based on simulations of a pivot quantity, and thus Step 1 can be carried out without iterating with other steps. The construction of the augmented dataset in Step 2 is motivated by the fact that the definition of ( $y_{n + k}^{*}, x_{n + k}^{*}$ ) and ( $y_{n + p + k - 1}^{*}, x_{n + p + k - 1}^{*}$ ) makes $ρ_{τ} (y_{n + k}^{*} - x_{n + k}^{* T} β) + ρ_{τ} (y_{n + p + k - 1}^{*} - x_{n + p + k - 1}^{* T} β) = {\tilde{λ}}_{n} ∣ β^{(k + 1)} ∣ (k = 1, \dots, p - 1)$ . Consequently,

\sum_{i = 1}^{n + 2 p - 2} ρ_{τ} (y_{i}^{*} - x_{i}^{* T} β) = \sum_{i = 1}^{n} ρ_{τ} (y_{i} - x_{i}^{T} β) + {\tilde{λ}}_{n} \sum_{j = 2}^{p} ∣ β^{(j)} ∣ .

This finding indicates that β̃(τ) can be equivalently formulated as the τth regression quantile of the augmented dataset, {( $y_{i}^{*}, x_{i}^{*}$ ), 1 ≤ i ≤ n + 2p − 2}. Thus β̃(τ) is piecewise constant over τ and so is w_j(τ). It allows us to use existing software, such as rq(), to solve the L₁-QR as a standard quantile regression problem and to obtain the full function path of the initial estimator, β̃(·). The same idea is applied in Step 5.2, with additional attention paid to incorporating adaptive weights, which may change with τ. More specifically, given that w_j(τ) is piecewise constant over τ, we can separately handle β̂_{λ_n} (·) in each τ-interval where adaptive weights are constant over τ. With data manipulations similar to those for β̃(τ), we can formulate β̂_{λ_n}(τ) as the τth regression quantile of an augmented dataset, as shown in Step 5.2. Therefore, the breakpoint set of β̂_{λ_n} (·) in the given τ-interval can be obtained from using rq() function. The union of all such breakpoint sets then gives the breakpoint set of β̂_{λ_n}(·) (see Step 5.4). In the special case where the weight function is chosen as (w2) or (w3), |R₁| would be 1 and the loop between Steps 5.2 and 5.3 would not be needed.

Note that, the above algorithm enables us to determine β̂_{λ_n}(·) for all τ ∈ Δ. However, the computation burden could be heavy in ultra-high dimensional settings, since the size of breakpoint set is roughly O((n + 2p) log(n + 2p)) (Portnoy, 1991). To alleviate the computational burden, one can approximate β̂_{λ_n}(·) by a cadlag step function which jumps at a sufficiently fine pre-specified grid in Δ. Specifically, let Inline graphic denote a τ-grid in Δ, inf{Δ} = τ₀ < τ₁ < … < τ_M₍_n₎ = sup{Δ}. Define the size of by || || = max{τ_k − τ_k₋₁; k = 1, …, M(n)}. We propose to approximate β̂_{λ_n}(·) by ${\hat{β}}_{λ_{n}}^{S_{n}} (\cdot)$ , where for τ ∈ Δ, ${\hat{β}}_{λ_{n}}^{S_{n}} (τ) = \sum_{k = 1}^{M (n)} {\hat{β}}_{λ_{n}} (τ_{k}) I (τ_{k - 1} < τ \leq τ_{k})$ . With the smoothness of β₀(·) assumed in condition (C3) in Section 3, we can show that ${\hat{β}}_{λ_{n}}^{S_{n}} (τ)$ has the same uniform convergence rate and asymptotic distribution as β̂_{λ_n}(τ) when (ns)^1/2 || Inline graphic || = o(1). The proof is provided in Section 3 of the supplemental article [Zheng et al. (2015)].

In our simulation studies, we chose the grid of size 5/2n. The estimator based on grid approximation performs well with realistic sample sizes. In our real data analysis, β̂_{λ_n}(τ)’s obtained from the exact and approximate calculations select the same set of variables.

Clearly, adopting ${\hat{β}}_{λ_{n}}^{S_{n}} (τ)$ is computationally appealing because it only requires calculating β̂_{λ_n}(·) at M(n) grid points. Given that the only technical constraint for M(n) is lim_n_→∞ M (n)/(ns)1/2 = ∞, M (n) can be chosen to be much smaller than the total number of exact breakpoints. For example, under our simulation set-up (II) with n = 200 and p = 400, the average number of breakpoints (with (w2) adopted) from 50 simulations is 4130, while the number of grid points that we used in simulations is 2n/5 = 80. Such a dramatic difference between the number of exact breakpoints and M(n) indicates a huge saving in computation from adopting a grid based approximation instead of the exact calculation of β̂_{λ_n}(τ) when n and p are large.

The interpretation of the final model after variable selection may be enhanced by imposing a reasonable semiparametric structure to the coefficient functions in β₀(·). For example, constant effects may be assumed for a subset of selected variables according to preliminary scientific information or exploratory analysis. Taking into account such additional information in the post-variable-selection estimation can lead to efficiency gains in addition to simplified interpretations. Such model polishing may follow similar lines of existing work, such as Qian and Peng (2010) and Yang and He (2012), but will not be further exploited in the paper.

3. Theoretical Properties

In this section, we investigate the theoretical properties of the adaptively weighted L₁-penalized quantile regression. To ease the presentation, we decompose X_i, i = 1, ···, n, into ${(X_{i a}^{T}, X_{i b}^{T})}^{T}$ , where X_ia, X_ib correspond to the relevant and irrelevant covariates, respectively. Similarly, we write β as ${(β_{a}^{T}, β_{b}^{T})}^{T}$ . In particular, we have the true coefficients $β_{0} (τ) = {(β_{0 a}^{T} (τ), 0)}^{T}$ .

3.1. Preliminaries

We impose the following regularity conditions to facilitate our technical derivations.

(C1)
[Condition on the conditional density] Let f_τ (·|x) denote the probability density function of $Y_{i} - X_{i}^{T} β_{0} (τ)$ given X_i = x. For each x at the support of X, sup_τ_∈Δf_τ (·|x) < f̄ and inf_τ_∈Δ f_τ (0|x) ≥ f for some constants f̄, f > 0. Moreover, there exists a constant A₀ > 0, such that for all u,
$sup_{τ \in Δ, x} ∣ f_{τ} (u ∣ x) - f_{τ} (0 ∣ x) ∣ \leq A_{0} ∣ u ∣ .$
(C2)
[Conditions on covariates] The covariates are normalized such that $E [{(X_{i}^{(j)})}^{2}] = 1$ , for j = 2, ···, p, and ${\hat{σ}}_{j} = \sqrt{\sum_{i = 1}^{n} {(X_{i}^{(j)})}^{2} / n}$ obeys
$lim_{n \to \infty} Pr (max_{2 \leq j \leq p} ∣ {\hat{σ}}_{j} - 1 ∣ \leq 1 / 2) = 1.$
(C3)
[Condition on the true regression coefficients β₀(τ)] There exists a positive constant L, such that
$‖ β_{0 a} (τ_{1}) - β_{0 a} (τ_{2}) ‖ \leq L \sqrt{s} ∣ τ_{1} - τ_{2} ∣, for all τ_{1}, τ_{2} \in Δ,$

where ||·|| denotes the the l₂-norm, and s is the size of S_Δ.
(C4)
[Conditions on the identifiability of the true model] The eigenvalues of $E [X_{i a} X_{i a}^{T}]$ is bounded from below and above by some constants λ_min and λ_max, respectively. Moreover,
$q : = inf_{δ \in R_{s}, δ \neq 0} \frac{E {[{(X_{i}^{T} δ)}^{2}]}^{3 / 2}}{E [{| X_{i}^{T} δ |}^{3}]} > 0,$

where R_s = {δ ∈ ℝ^p: δ_b = 0}.

Conditions (C1)–(C4) basically follow assumptions imposed in Belloni and Chernozhukov (2011), since we will adopt L₁-QR to construct the consistent initial estimates. Those conditions are also needed to derive the convergence rate of our proposed estimator. Condition (C1) only imposes mild conditions on the conditional density of the response given covariates, and does not assume homoscedasticity. Condition (C2) requires the covariates to be well-behaved with finite second moments. Condition (C3) implies that the true regression coefficient function is Lipschitz continuous with respect to τ. The eigenvalue conditions from (C4) are widely assumed in literature of high dimensional models. They are critical to the model identifiability. The restricted nonlinear impact (RNI) coefficient q controls the quality of minoration of the quantile loss function by a quadratic function over the true global model.

3.2. Main Results

We start with a hypothetical scenario assuming that all relevant covariates are known in advance, and establish the asymptotic properties of the oracle regularized estimator obtained from it. Then we will show that our proposed estimator can enjoy the same properties despite the lack of true model information. The oracle regularized estimator, denoted by ${\hat{β}}^{o} (τ) = {({({\hat{β}}_{a}^{o} (τ))}^{T}, 0)}^{T}$ is the minimizer of (2.2) over R_s. The following theorem shows that the oracle regularized estimator is uniformly consistent over Δ, with the described convergence rate.

Theorem 3.1

Under the regularity conditions (C1)–(C4), if ${sup}_{τ \in Δ, j \in S_{τ}} λ_{n} ω_{j} (τ) = O_{p} (\sqrt{n log n})$ , then the oracle regularized estimator satisfies

sup_{τ \in Δ} ‖ {\hat{β}}_{λ_{n}}^{o} (τ) - β_{0} (τ) ‖ = O_{p} (\sqrt{s log n / n}) .

Theorem 3.1 shows that if λ_n and ω_j(τ)’s are appropriately chosen, the bias introduced by the penalty terms can be well controlled. The uniform upper bound for $‖ {\hat{β}}_{λ_{n}}^{o} (τ) - β_{0} (τ) ‖$ indicates the uniform consistency of the oracle regularized estimator with the convergence rate $\sqrt{s log n / n}$ , which is within a logarithmic factor of the optimal rate $\sqrt{s / n}$ shown by He and Shao (2000). If we only consider a finite number of τ’s, we obtain the following corollary,

Corollary 3.1

Suppose the regularity conditions (C1)–(C4) hold. Given a fixed τ₀, if ${max}_{j \in S_{τ_{0}}} λ_{n} ω_{j} (τ_{0}) = O_{p} (\sqrt{n})$ , then the oracle regularized estimator satisfies

‖ {\hat{β}}_{λ_{n}}^{o} (τ_{0}) - β_{0} (τ_{0}) ‖ = O_{p} (\sqrt{s / n}) .

According to Corollary 3.1, the locally concerned oracle regularized estimator can achieve the optimal convergence rate. This suggests that the additional logarithmic factor in Theorem 3.1 is needed to guarantee the uniform convergence.

Next, we establish the weak convergence for the penalized oracle estimator, ${\hat{β}}_{λ_{n}}^{o} (τ)$ . We define two p × p matrices,

H_{τ} : = (\begin{matrix} E [X_{i a} X_{i a}^{T} f_{τ} (0 ∣ X_{i a})] & 0 \\ 0 & 0 \end{matrix}) and H_{τ}^{- 1} : = (\begin{matrix} {[E [X_{i a} X_{i a}^{T} f_{τ} (0 ∣ X_{i a})]]}^{- 1} & 0 \\ 0 & 0 \end{matrix}) .

Theorem 3.2

Suppose the regularity conditions (C1)–(C4) hold, s = o (n^1/3), and ${sup}_{j \in S_{τ}, τ \in Δ} λ_{n} ω_{j} (τ) = O_{p} (\sqrt{n log n})$ . Given any α ∈ ℝ^p such that ||α|| = 1, we have

if $\sqrt{n / (s log n)} {inf}_{2 \leq j \leq s, τ \in Δ} ∣ β_{0}^{(j)} (τ) ∣ \to \infty$ , then we have
$2 n^{1 / 2} α^{T} H_{τ} ({\hat{β}}_{λ_{n}}^{o} (τ) - β_{0} (τ) + \frac{λ_{n}}{2 n} H_{τ}^{- 1} {[ω (τ) \circ sgn (β_{0} (τ))]}^{T})$

converges weakly to a mean zero Gaussian process with covariance $\sum (τ, τ^{'}) : = (τ \land τ^{'} - τ τ^{'}) E [α_{a}^{T} X_{i a} X_{i a}^{T} α_{a}]$ . Here ∘ denotes the Hadamard product, ^ is the minimum operator, and ω(τ) = (ω₁(τ), ···, ω_p(τ))^T.
if sup_τ_∈Δ n^−1/2λ_n||ω_a(τ)|| = o_p(1), then
$2 n^{1 / 2} α^{T} H_{τ} ({\hat{β}}_{λ_{n}}^{o} (τ) - β_{0} (τ))$

converges weakly to a mean zero Gaussian process with covariance Σ(τ, τ′).

Theorem 3.2 states two weak convergence results under two different conditions. Part (a) requires that ${inf}_{2 \leq j \leq s, τ \in Δ} ∣ β_{0}^{(j)} (τ) ∣$ , the minimum signal strength of all relevant covariates is beyond the level $O (\sqrt{s log n / n})$ . This would exclude the cases where some active variables have coefficient functions crossing 0 or partially equal to 0. Part (b) assumes sup_τ_∈Δ n^−1/2λ_n||ω_a(τ)|| = o_p(1). If we choose adaptive weight (w2), with β̃(τ) obtained from Belloni and Chernozhukov (2011), then we have

sup_{τ \in Δ} λ_{n} ‖ ω_{a} (τ) ‖ \leq λ_{n} \sqrt{s} {[min_{2 \leq j \leq s} sup_{τ \in Δ} (∣ β_{0}^{(j)} (τ) ∣ + O_{p} (\sqrt{s log p / n}))]}^{- 1} .

Thus, the condition in (b) is satisfied if $\sqrt{n} / (λ_{n} \sqrt{s}) {min}_{2 \leq j \leq s} {sup}_{τ \in Δ} ∣ β_{0}^{(j)} (τ) ∣ \to \infty$ and $\sqrt{s log p / n} = o ({min}_{2 \leq j \leq s} {sup}_{τ \in Δ} ∣ β_{0}^{(j)} (τ) ∣)$ . These two conditions require ${sup}_{τ \in Δ} ∣ β_{0}^{(j)} (τ) ∣$ , j = 2, ···, s the maximum impact of each relevant variable is beyond the level $O (\sqrt{s log p / n} \lor λ_{n} \sqrt{s} / \sqrt{n})$ . Yet $\sqrt{s log n / n} = o (\sqrt{s log p / n} \lor λ_{n} \sqrt{s} / \sqrt{n})$ is typical in the high dimensional cases, and the varying covariate effects are allowed to cross zero or partially equal zero in our framework. Using similar arguments, with adaptive weight (w3) used, condition (b) is satisfied if ${inf}_{2 \leq j \leq s} \int_{Δ} ∣ β_{0}^{(j)} (τ) ∣ d τ$ is beyond the level $O (\sqrt{s log p / n} \lor λ_{n} \sqrt{s} / \sqrt{n})$ . Thus, condition (b) is neither weaker nor stronger than condition (a). If we choose (w1) as the adaptive weight, then condition (b) can be satisfied by the signal assumption that ${inf}_{2 \leq j \leq s, τ \in Δ} ∣ β_{0}^{(j)} (τ) ∣$ is beyond the level $O (\sqrt{s log p / n} \lor λ_{n} \sqrt{s} / \sqrt{n})$ . Therefore, Theorem 3.2 entails weaker signal conditions from using global adaptive weights, (w2) and (w3), than that from adopting the local adaptive weight, (w1).

For the oracle regularized estimator to have asymptotic normality, the true model size is allowed to be o(n^1/3), the fastest model growth rate required in Welsh (1989) and He and Shao (2000) for the unpenalized quantile regression estimator to have the asymptotic normality.

If we can show the the oracle regularized estimator ${\hat{β}}_{λ_{n}}^{o} (τ)$ is also the minimizer of (2.2) over ℝ^p with probability tending to 1, then our proposed estimator enjoys the same properties established in Theorem 3.1 and Theorem 3.2. The following theorem shows that with probability approaching 1, ${\hat{β}}_{λ_{n}} (τ) = {\hat{β}}_{λ_{n}}^{o} (τ)$ .

Theorem 3.3

Suppose the regularity conditions (C1)–(C4) and ${sup}_{j \in S_{τ}, τ \in Δ} λ_{n} ω_{j} (τ) = O_{p} (\sqrt{n log n})$ hold. Furthermore, we assume following conditions holds,

sup_{j > s, δ \in R_{s}} \frac{E [{(X_{i}^{(j)} X_{i}^{T} δ)}^{2}]}{{‖ δ ‖}^{2}} = o (\frac{log p}{s {log}^{2} n}),

and n/(s² log p) → ∞. If $λ_{n} / (\sqrt{s} log p) \to \infty$ and ${({inf}_{j > s, τ \in Δ} ω_{j} (τ))}^{- 1} \sqrt{n} / \sqrt{s log p} = O_{p} (1)$ , then we have

P r (sup_{τ \in Δ} ∣ {\hat{β}}_{λ_{n}}^{o} (τ) - {\hat{β}}_{λ_{n}} (τ) ∣ = 0) \to 1.

Theorem 3.3 not only establishes the model selection oracle property of β̂_{λ_n}(τ), but also indicates β̂_{λ_n} (τ) enjoys the weak convergence under mild conditions stated in Theorem 3.2. Moreover, Theorem 3.3, coupled with Theorem 3.1, indicates that the bias of the proposed estimator can be bounded by $o (\sqrt{s log n / n})$ , the same rate achieved by Fan et al. (2014) for AR-Lasso at a single quantile level. In high-dimensional settings, this bound can be smaller than $o (\sqrt{s log p / n})$ , a bias bound established by Belloni and Chernozhukov (2011), which is applicable to the globally concerned quantile regression method with nonadaptive L₁ penalty.

The condition ${sup}_{j > s, δ \in R_{s}} E [{(X_{i}^{(j)} X_{i}^{T} δ)}^{2}] / {‖ δ ‖}^{2} = o (log p / (s {log}^{2} n))$ essentially asks for weak correlations between relevant covariates and irrelevant covariates so that relevant and irrelevant covariates can be well separated. It can be seen that the model growth rate condition, n/(s² log p) → ∞, is satisfied by the conditions for Theorem 3.2. The conditions, ${sup}_{j \in S_{τ}, τ \in Δ} λ_{n} ω_{j} (τ) = O_{p} (\sqrt{n log n})$ and $λ_{n} / (\sqrt{s} log p) \to \infty$ , indicate different signal strength requirements under different choices of adaptive weights. With adaptive weight (w1), the signal condition,

lim_{n \to \infty} inf_{2 \leq j \leq s, τ \in Δ} ∣ β_{0}^{(j)} (τ) ∣ \frac{\sqrt{n}}{\sqrt{s} log (p)} = \infty,

(3.1)

is necessary for one to apply Theorem 3.3. When (w2) or (w3) is used, the signal condition can be relaxed to ${lim}_{n \to \infty} {inf}_{2 \leq j \leq s} {sup}_{τ \in Δ} ∣ β_{0}^{(j)} (τ) ∣ \frac{\sqrt{n}}{\sqrt{s} log (p)} = \infty$ , or ${lim}_{n \to \infty} {inf}_{2 \leq j \leq s} \int_{τ \in Δ} ∣ β_{0}^{(j)} (τ) ∣ d τ \frac{\sqrt{n}}{\sqrt{s} log (p)} = \infty$ respectively. These weaker signal conditions can well accommodate scenarios with covariate effects crossing zero at some τ ∈ Δ.

In the presence of heavy-tailed distributions, the signal condition (3.1), coupled with the estimation error bounded established in Theorem 3.1 for the oracle estimator, implies uniform sign consistency across τ ∈ Δ, namely ${lim}_{n \to \infty} P r ({sup}_{τ \in Δ, 2 \leq j \leq s} I {sgn ({\hat{β}}_{λ_{n}}^{(j)} (τ)) \neq sgn (β_{0}^{(j)} (τ))} = 1) = 0$ . Consider the special case with finite s and p = O(n^1/2), which Fan et al. (2014) (see Proposition 1) used to investigate the suboptimality of Lasso in the case of heavy-tailed error distributions. Our signal requirement for globally concerned quantile regression achieving uniform sign consistency can be weaker than what is required by least squares-based Lasso. This finding provides further evidence for the suboptimality of Lasso in the presence of heavy-tailed distributions.

3.3. GIC procedure for λ_n

The performance of the proposed adaptively weighted L₁-penalized quantile regression hinges on the choice of tuning parameter λ_n. To achieve the model selection oracle property, the tuning parameter λ_n and the adaptive weights ω_j(τ)’s are required to satisfy the following conditions: ${sup}_{j \in S_{τ}, τ \in Δ} λ_{n} ω_{j} (τ) = O_{p} (\sqrt{n log n}), λ_{n} / (\sqrt{s} log p) \to \infty$ , and ${({inf}_{j > s, τ \in Δ} ω_{j} (τ))}^{- 1} \sqrt{n} / \sqrt{s log p} = O_{p} (1)$ . However, the theoretically optimal λ_n is not practically achievable, because it depends on the unknown true model size s. Although many existing criteria including AIC and cross-validation could be potentially employed to select the tuning parameter, Wang et al. (2007) and Zhang et al. (2010) showed that the tuning parameter selected by AIC and cross-validation may fail to consistently identify the true model. Wang et al. (2009) considered the tuning parameter selection with a modified BIC in the setting of high dimensional linear regression with p < n, and Wang and Zhu (2011) extended the modified BIC to the situation where p could be larger than n. They demonstrated the modified BIC can identify the true model consistently for high dimensional linear regression. The modified BIC can be considered as the generalized information criterion (GIC) according to from Nishii (1984). Motivated by those results, we select the practically optimal λ̂_n by minimizing GIC.

Fan and Tang (2013) recently considered the GIC tuning parameter selection in high dimensional penalized generalized linear regression. In addition to the challenging aspects commented by Fan and Tang (2013), our problem has one more difficult aspect due to varying τ ∈ Δ. As a result, we introduce ∫_Δlog σ̂_λ(τ)dτ to measure the cumulative model fitting of varying effects over Δ.

We define over-fitted model as OF := {S : S ⫌ S_Δ}, and under-fitted model as UF := {S : S ∉ OF, S ≠ S_Δ}. To study the properties of the proposed GIC in (3.2), we first introduce an axillary GIC of model S, which is defined as

GIC (S) = \int_{τ \in Δ} log ({\hat{σ}}_{S} (τ)) d τ + ∣ S ∣ φ_{n},

(3.2)

where ${\hat{σ}}_{S} (τ) = n^{- 1} \sum_{i = 1}^{n} ρ_{τ} (Y_{i} - X_{i}^{T} {\hat{β}}_{S} (τ))$ measure the fitness of model S at τ-th quantile, and β̂_S(τ) is obtained by the unpenalized quantile regression. We also define $σ (τ) = n^{- 1} \sum_{i = 1}^{n} ρ_{τ} (Y_{i} - X_{i}^{T} β_{0} (τ))$ , which serves as a baseline of model fitness.

In addition, we set a model size upper bound, denoted by C_m(n), with s < C_m(n) < p. This means, we limit model search to submodels with size no more than C_m(n). Such an assumption has been widely assumed in the studies of tuning parameter selection for high dimensional models [see e.g. Chen and Chen (2008), Wang and Zhu (2011), Fan and Tang (2013)]. On one hand, it reduces the number of candidate models from 2^p to 2^C_m(n) and greatly alleviates the computation. On the other hand, it is needed so that the fitness of the model can be correctly measured.

To establish the asymptotic properties of the GIC tuning parameter selector, we assume the following condition (C4+), which is an enhanced version of (C4):

(C4+)
1. $Λ_{min} : = inf_{δ \in ℝ^{p}, {‖ δ ‖}_{0} \leq C_{m} (n), δ \neq 0} \frac{δ^{T} E [X_{i} X_{i}^{T}] δ}{{‖ δ ‖}^{2}} > 0,$
2. $Λ_{max} : = sup_{δ \in ℝ^{p}, {‖ δ ‖}_{0} \leq C_{m} (n), δ \neq 0} \frac{δ^{T} E [X_{i} X_{i}^{T}] δ}{{‖ δ ‖}^{2}} < \infty,$
3. $q^{'} : = inf_{δ \in ℝ^{p}, {‖ δ ‖}_{0} \leq C_{m} (n), δ \neq 0} \frac{E {[{(X_{i}^{T} δ)}^{2}]}^{\frac{3}{2}}}{E [{| X_{i}^{T} δ |}^{3}]} > 0,$
where ||·||₀ denotes the L₀ norm.

Recall φ_n is defined in (2.3) as a sequence converging to 0. Let ξ_n denote ${min}_{1 \leq j \leq s} \int_{Δ} ∣ β_{0}^{(j)} (τ) ∣ d τ$ , which measures the minimal overall effect of intercept and relevant variables upon the conditional distribution. We characterize the behavior the GIC for any model S with size no more than C_m(n) by the following theorem:

Theorem 3.4

Suppose conditions (C1), (C2), (C3) and (C4+) hold. If log p/n = o (φ_n), $φ_{n} = o (ξ_{n}^{5 / 2})$ , and $C_{m} (n) log p / n = o (ξ_{n}^{3})$ , we have

P r (inf_{S \neq S_{Δ}, ∣ S ∣ \leq C_{m} (n)} GIC (S) > GIC (S_{Δ})) \to 1.

Theorem 3.4 indicates that the true model S_Δ has the smallest GIC value among the models with size no more than C_m(n). Let λ̂_n denote the minimizer of (2.3). The following corollary shows that the proposed GIC tuning parameter selector can consistently identify the true model with probability approaching 1,

Corollary 3.2

Under the same conditions as in Theorem 3.4, if ${sup}_{τ \in Δ, j \in S_{τ}} ω_{j} (τ) = O_{p} (\sqrt{n} / (\sqrt{s} log p))$ , and ${({inf}_{j > s, τ \in Δ} ω_{j} (τ))}^{- 1} \sqrt{n} / \sqrt{s log p} = O_{p} (1)$ , then we have

P r ({\hat{S}}_{{\hat{λ}}_{n}} = S_{Δ}) \to 1,

where Ŝ_λ is defined in (2.3).

4. Simulation studies

We conduct simulation studies to evaluate the finite sample performance of the proposed method. We consider the adaptively weighted L₁-penalized quantile regression with three weight functions, (w1), (w2) and (w3), denoted by AW₁, AW₂ and AW₃ respectively. We compare them with several other methods: (i) the adaptive-LASSO quantile regression at a single predetermined quantile level τ, denoted by SS(τ); (ii) adaptive-LASSO with least squares objective function, denoted by LS; (iii) the L₁-QR over Δ, which is the initial estimator (Step 1) of our proposed algorithm; (iv) the pointwise approach by collecting estimates from SS(τ) over τ ∈ Δ denoted by PS1; and (v) the onestep estimate from PS1 by unpenalized quantile regression using variables selected by PS1 over τ ∈ Δ, which is denoted by PS2. Note that, PS1 and PS2 have the same variable selection results but differ in their coefficient estimates. Except for L₁-QR, the tuning parameter λ_n for the other methods is selected by a GIC criterion with φ_n = log(log n) log p/n. The candidate values for λ_n include the n/4 equally-spaced grid points between 0 and n/10. We use the sample size n = 200, and generate Y based on model (2.1) with p = 400 covariates. We consider four setups, denoted by (I), (II), (III) and (IV).

Setup (I)

The intercept coefficient function is $\sqrt{2} Φ^{- 1} (τ)$ , where Φ⁻¹ is the quantile function of the standard normal distribution. The covariates Z_i are generated from the multivariate normal distribution N_p(0, Σ) with Σ = (σ_jk)_p×p and σ_jk = 0.5^|^j⁻^k^|. The coefficients for Z⁽¹⁾, Z⁽²⁾, Z⁽⁵⁾, Z⁽⁸⁾, Z⁽¹²⁾ and Z⁽¹⁶⁾ are set to be 2, 1.5, 3, 1, 0.9, and 1 constantly over τ, and the other coefficients are zero. It is clear that setup (I) is a standard linear regression model with normal random errors.

Setup (II)& (III)

The intercept coefficient function is set to be 0, and only Z⁽¹⁾, Z⁽²⁾ and Z⁽⁸⁾ have nonzero coefficients. We plot the coefficient functions for Z⁽¹⁾, Z⁽²⁾ and Z⁽⁸⁾ in Figure 1(a) and (b), respectively for these two setups. We first generate Z̃_i from N_p(0, Σ) where Σ is the same as that used in Setup (I), and then set Z_i = Φ(Z̃_i) (componentwise operation), where Φ(·) is the standard normal distribution function. Setup (II) and (III) are designed to assess the performance of the proposed estimator in varying covariate effects models.

Fig. 1 — The coefficient functions in simulation set-ups (II) & (III)

In these setups, we set Δ = [0.1, 0.9] and choose τ = 0.25, 0.50, and 0.75 for SS(τ). We compare the performance of the different methods described above in terms of the following criteria.

NCN : mean number of correctly identified variables (with nonzero coefficient functions)
NIN : mean number of incorrectly selected variables
PUF : percentage of under-fitted models
PCF : percentage of correctly fitted models
POF : percentage of over-fitted models
RAEE_Δ : median of relative absolute estimation errors with respect to the oracle unpenalized quantile estimator β̂^o(τ) over Δ (under the correctly specified sparse model). It is defined as

{RAEE}_{Δ} (β) = \frac{\int_{Δ} {‖ β (τ) - β_{0} (τ) ‖}_{1} d τ}{\int_{Δ} {‖ {\hat{β}}^{o} (τ) - β_{0} (τ) ‖}_{1} d τ}

The first five criteria are used to assess the model selection performance. We would like to have NCN to be close to the true number of signals (ie. 6 in setup (I) and 3 in setups (II) & (III)). The ideal PCF is 100%, but the other measures (NIN, PUF and POF) should be as close to 0 as possible. The measure RAEE_Δ is aimed to evaluate the global estimation accuracy for τ ∈ Δ. We hope this measure to be as small as possible. For SS(τ) and LS, we calculate RAEE_Δ by extrapolating the coefficient function estimate as the constant valued function over τ ∈ Δ.

In Table 1, we present simulation results for setup (I)–(III) based on 400 simulations. Under setup (I), where the true model is a standard linear model with normal random errors, we find that AW₁, AW₂, AW₃, SS(0.25), SS(0.50), SS(0.75) and LS all have very similar performance in model selection and estimation accuracy. Since the oracle estimator adopted here is based on the quantile regression and the random errors follow a normal distribution, it is reasonable to observe that LS has the best performance in estimation accuracy. Nevertheless, this advantage over the other approaches is only moderate. We also observe that applying the naive strategy of merging variable selection results from all SS(τ), as in PS1 and PS2, tends to produce an overfitted model (2.1) with POF=94.0%. The large RAEE_Δ of PS2 suggests that such an overfitted model tends to have considerably deteriorated estimation accuracy.

Table 1.

Simulation results for setup (I)–(III)

set-up	Method	NCN	NIN	PUF(%)	PCF(%)	POF(%)	RAEE_Δ
(I)	AW₁	6.00	.09	.0	90.7	9.3	1.54
	AW₂	6.00	.07	.0	93.3	6.7	1.39
	AW₃	6.00	.06	.0	93.7	6.3	1.36
	SS(0.25)	5.99	.16	1.3	84.5	14.2	1.22
	SS(0.50)	6.00	.07	.0	93.8	6.2	1.06
	SS(0.75)	5.99	.16	.8	85.7	13.5	1.19
	LS	6.00	.28	.0	79.9	20.1	.77
	L₁-QR	6.00	109.90	.0	.0	100.0	3.89
	PS1	6.00	2.71	.0	6.0	94.0	1.28
	PS2	6.00	2.71	.0	6.0	94.0	13.45

(II)	AW₁	2.84	.04	14.8	80.5	4.7	3.62
	AW₂	2.91	.05	7.7	87.8	4.5	1.83
	AW₃	2.87	.04	11.5	84.8	4.5	1.92
	SS(0.25)	2.42	.10	39.9	55.7	4.4	4.48
	SS(0.50)	1.99	.00	100.0	.0	.0	3.07
	SS(0.75)	2.51	.05	38.8	59.5	1.7	4.26
	LS	1.95	.18	100.0	.0	.0	3.85
	L₁-QR	3.00	26.30	.0	.0	100.0	4.90
	PS1	3.00	1.77	.5	21.7	78.3	2.58
	PS2	3.00	1.77	.5	21.7	78.3	12.04

(III)	AW₁	2.83	.19	16.0	70.3	13.7	2.05
	AW₂	2.85	.21	14.2	71.0	14.8	1.34
	AW₃	2.80	.09	19.5	73.8	6.7	1.38
	SS(0.25)	1.79	.12	100.0	.0	.0	3.64
	SS(0.50)	1.01	.01	100.0	.0	.0	3.29
	SS(0.75)	1.76	.08	100.0	.0	.0	2.70
	LS	2.10	.18	99.5	0.2	.3	3.44
	L₁-QR	2.98	26.78	1.8	.0	98.2	3.67
	PS1	2.94	1.65	5.5	24.2	70.3	1.48
	PS2	2.94	1.65	5.5	24.2	70.3	8.78

Open in a new tab

In setups (II) and (III), we observe that LS has very poor performance in identifying the true model. For example, in setup (II), although Z⁽⁸⁾ has a strong effect on the response at all quantile levels except for τ = 0.5, the symmetric traits make its mean effect on Y neutralized. Thus, LS frequently fails to select Z⁽⁸⁾, yielding NCN= 1.95 and PUF= 100%. Yet, if one simply applies the locally concerned penalized median regression, Z⁽⁸⁾ is characterized as an inactive variable as well and its effects on other segments of the conditional distribution of Y given Z are overlooked. The unsatisfactory results of SS(0.50) supports the use of our globally concerned quantile regression, which enables a more thorough assessment of covariates impacting across various quantiles. Compared to SS(0.25) and SS(0.75), where Z⁽⁸⁾ is an active variable, the proposed method for globally concerned quantile regression seems to considerably enhance the power for selecting the correct set of relevant variables. This is clearly indicated by the PCF improvement ranging between 20% and 30%. Like in setup (I), PS1 and PS2, the naive “global” methods, suffer from the overfitting problem with POF= 78.3%. The superiority of the proposed method is more evident in setup (III), where Z⁽¹⁾ and Z⁽²⁾ have partial effects on two disjoint quantile ranges. In this case, the PCFs of AW₁, AW₂ and AW₃ are still over 70%, which is far above the PCFs achieved by the other methods.

By comparing RAEE_Δ in Table 1, we observe that the proposed globally adaptive estimators outperform the other methods also in estimation accuracy for non-i.i.d error models. In addition, Table 1 suggests that adopting weight functions, (w2) and (w3), which are designed to capture the global signal strength, generally yields better finite-sample performance, compared to the traditional weight (w1). We also note that AW₂ and AW₃ have similar or even smaller RAEE_Δ as compared to PS1, an approach that penalizes for local sparsity separately for each quantile index τ. This shows that the proposed globally adaptive method does not sacrifice estimation accuracy while avoiding the overfitting issue associated with the naive pointwise approach. This observation is in line with our theoretical results in Theorem 3.1 and Corollary 3.1.

We also examine the local estimation accuracy of different methods. Specifically, we compute the L₁ loss of each estimator of β₀(0.5) and then calculate its ratio to that obtained from SS(0.50). The results in Table 2 suggest that our globally concerned penalization method may lose a small proportion of local estimation efficiency compared to the locally concerned SS(0.5), a reasonable price to pay for achieving better global model selection and estimation results. On the other hand, PS2 has much larger local estimation errors compared to the other methods. This may be viewed as a natural consequence of the poor global sparsity control by the naive pointwise approach.

Table 2.

Local estimation error ratio over SS(0.50) at τ = 0.5 for setup (I)–(III)

set-up	AW₁	AW₂	AW₃	PS₁	PS₂	LS
(I)	1.16	1.13	1.14	1.00	11.01	0.75
(II)	1.19	1.61	1.63	1.00	18.52	2.35
(III)	1.01	1.17	1.18	1.00	6.21	1.33

Open in a new tab

Setup (IV)

We intend to assess the utility of the proposed method in selecting active variables at tail quantiles. We generate data from a linear regression model, where the covariates Z_i are generated from the multivariate normal distribution N_p(0, Σ), the same covariance matrix used in setup (I). The effects of Z⁽¹⁾, Z⁽²⁾, Z⁽⁵⁾, Z⁽¹²⁾, Z⁽¹⁶⁾ and Z⁽²⁵⁾ are set to be 1.5, 1.25, 2, 4/3, 2 and 3 constantly over τ, and all other covariates have constantly zero coefficients across τ. For each simulated dataset, to identify active variables at quantile level τ = 0.05, we implement SS(0.05) as well as AW₁, AW₂, AW₃, L₁-QR, PS1, PS2 with Δ = [0.03, 0.07]. The results are reported in Table 3 based on 400 simulated data sets. By comparing SS(0.05) with SS(0.25), SS(0.50), and SS(0.75) in setup (I), we observe a larger variability in model selection for quantile regression at the tail quantile with τ = 0.05, reflected by a relatively smaller PCF=51.0%. On the other hand, the proposed methods, AW₁, AW₂ and AW₃, still maintain satisfactory performance in model selection with PCFs above 80%. This demonstrates that, by accumulating information in the neighborhood around a tail quantile, the proposed method may boost the power of correct model selection at tail quantiles. We also conduct simulations with Δ chosen as [0.025, 0.075]. Note that such a choice of Δ may represent the same interest in the response distribution reflected by Δ = [0.03, 0.07]. Thus, it would be desirable to observe that the variable selection results between these two choices of Δ are in close proximity. The results in Table A.1 (see the supplemental article [Zheng et al. (2015)]) are very similar to those in Table 3. The results in Table A.2 further demonstrate a good agreement in the selected variables between the two choices of Δ, [0.03, 0.07] and [0.025, 0.075]. These suggests that the proposed estimators are quite robust to reasonable variations in the choice of Δ.

Table 3.

Simulation results over Δ = [0.03, 0.07] for setup (IV)

set-up	Method	NCN	NIN	PUF	PCF	POF	RAEE_Δ
(IV)	AW₁	5.83	.01	11.8	87.2	1.0	2.15
	AW₂	5.87	.03	11.8	86.2	2.0	1.63
	AW₃	5.80	.03	17.8	80.5	1.8	1.62
	SS(0.05)	5.99	.85	1.0	51.0	48.0	1.28
	L₁-QR	6.00	35.53	.0	.0	100.0	3.85
	PS1	6.00	3.18	.2	13.8	87.0	1.29
	PS2	6.00	3.18	.2	13.8	87.0	10.55

Open in a new tab

5. A real data example

We now illustrate the application of the proposed method by analyzing a microarray data set reported by Scheetz et al. (2006). This real dataset contains expression values of 31,042 probe sets on 120 12-week-old male offsprings of rats. As in Huang et al. (2008), Kim et al. (2008) and Wang et al. (2012), we are interested in identifying genes whose expressions are predictive for gene TRIM32, which is associated with a genetically heterogeneous disease of multiple organ systems. The probe corresponding to gene TRIM32 is 1389163 at. Our approach to finding relevant genes is to apply high dimensional quantile regression analyses upon the remaining probe sets. In our analysis, we further narrow down the problem to selecting variables that impact ordinary expression levels of probe 1389163 at, which may be roughly captured by the middle part of the response distribution. Therefore, we formulate the target of our analysis as Inline graphic , considering two reasonable choices of Δ, (0.2, 0.8) and (0.25, 0.75).

We use the data set available in R package flare, which have been processed to exclude probes that are not expressed or lack variation. There remain 200 probe sets serving as covariates. We choose AW₂ as the representative of the proposed globally concerned method. In addition, we also implement the adaptive-LASSO method for locally concerned quantile regression, SS(τ), at τ = 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, and the pointwise approaches, PS1 and PS2, as described in Section 4.

To evaluate each method, we compute its prediction error as follow. First, we randomly split the 120 rats into a training set and a test set, each consisting of 60 subjects. We apply the method to the training data set and obtain the estimator of β₀(τ), denoted by β̂^train(τ). Next, we calculate the prediction error over the test set as

P E (Δ) = \frac{\sum_{i = 1}^{n} 1 {Subject i in test set} \int_{Δ} ρ_{τ} (Y_{i} - X_{i}^{T} {\hat{β}}^{train} (τ)) d τ}{\sum_{i = 1}^{n} 1 {Subject i in test set}} .

For SS(τ), we calculate PE(Δ) by treating the coefficient estimate as a constant valued function over τ ∈ Δ.

Table 4 lists the probe sets selected by each method using all 120 observations. In addition, we present the averages of PE(Δ) along with the corresponding standard deviations (within parentheses). All calculations are based on 400 replications of random splitting into training and test sets.

Table 4.

Probe sets identified by various methods

Δ	Method	Probes	PE(Δ)
Δ	Method	Probes	Δ = (0.2, 0.8)	Δ = (0.25, 0.75)
(0.2, 0.8)	AW₂	“24565” “25141” “25367” “29045”	.146 (.016)	—
(0.25, 0.75)	AW₂	“24565” “25141” “25367” “29045”	—	.124 (.011)

(0.2, 0.8)	PS1	17 probes	.151 (.016)	—
(0.25, 0.75)	PS1	13 probes	—	.129 (.012)

(0.2, 0.8)	PS2	17 probes	.143 (.013)	—
(0.25, 0.75)	PS2	13 probes	—	.124 (.011)

0.25	SS	“6222”, “15863”, “22140”, “25141”, “25439”, “29045”	.191 (.024)	.151 (.018)
0.30	SS	“25141”, “29045”	.181 (.023)	.146 (.017)
0.35	SS	“14949”, “24245”, “25141”, “29045”	.174 (.022)	.141 (.016)
0.40	SS	“25141”, “29045”	.170 (.020)	.138 (.015)
0.45	SS	“25141”	.167 (.020)	.135 (.015)
0.50	SS	“25141”	.165 (.020)	.135 (.015)
0.55	SS	“25141”	.164 (.020)	.136 (.015)
0.60	SS	“21092”, “24565”, “25141”	.167 (.021)	.137 (.017)
0.65	SS	“24565”, “25141”	.169 (.023)	.140 (.018)
0.70	SS	“11711”, “24565”, “25141”	.176 (.025)	.146 (.019)
0.75	SS	“24565”, “25141”, “25367”	.186 (.028)	.155 (.023)

Open in a new tab

From Table 4, it is clear that the set of probes selected by the locally concerned quantile method considerably varies with τ. Only probe “25141” is selected by all SS(τ)’s considered. The probe “24565”, which was consistently selected by SS(τ) with τ = 0.60, 0.65, 0.70, 0.75, does not show up in the estimated active variable set for lower τ’s. These may suggest a heterogeneous relationship across different segments of the response distribution, but the variations are partly due to the sensitivity of the variable selection methods as τ varies. In fact, there is a quite large variability in the selected probes at low quantile indices. For example, SS(0.25) selected 6 probes, but SS(0.30) selected only 2 probes. It would be difficult to interpret such variable selection results at two close quantile indices.

To identify Inline graphic , PS1 and PS2, which take the union of selected variables from locally concerned SS(τ), select a total number of 13 probes for Δ = (0.25, 0.75) and 17 probes for Δ = (0.2, 0.8). In contrast, the proposed globally concerned quantile regression AW₂ only selects 4 probes but renders very competitive prediction errors. It has the smallest PE(Δ) when Δ = (0.25, 0.75) and the second smallest PE(Δ) when Δ = (0.2, 0.8). The results show that the globally concerned quantile regression approach can mitigate the over-fitting problem of the alternative methods based on local quantile regression approach. Furthermore, AW₂ selects the same set of probes with the two slightly different choices of Δ. This suggests the robustness of our method to small changes of Δ, which is a desirable feature from the perspective of model selection.

6. Remarks

We aim to provide a rigorous statistical approach to identifying variables that are predictive to some or all quantiles at the percentile level in a predetermined set Δ. Our work does not directly address how Δ should be chosen because the choice of Δ should align with the scientific problem at hand. In practice, Δ can be chosen as a large interval, say [0.1, 0.9] to reflect an interest in normal outcomes, or a smaller interval, say [0.75, 0.9] to reflect our interest in the upper tail of the response distribution. Even in the cases where one is interested in a single quantile level τ, our work suggests that applying the proposed method with Δ chosen as a small interval containing τ can lead to more stable variable selection results when the sample size n is limited (relative to p).

In applications, it is likely that we have more than one reasonable choices of Δ. For example, either (0.7, 0.9) or (0.75, 0.9) can be a reasonable choice for Δ when we are interested in identifying variables that impact the upper tails of the response variable. This reinforces our belief that the variable selection results should be quite insensitive to small changes in the choice of Δ. Our simulation studies and data example suggest that our approach is robust to small variations of Δ. These empirical results further endorse the practical value of the proposed method.

7. Technical proofs

We present the proofs of our main results in this section. We start with introducing some notations. Given a random sample Z₁, ···, Z_n, we adopt the following empirical process notations. Let $G_{n} (f) = G_{n} (f (Z_{i})) : = n^{- 1 / 2} \sum_{i = 1}^{n} (f (Z_{i}) - E [f (Z_{i})])$ and $E_{n} f = E_{n} f (Z_{i}) : = \sum_{i = 1}^{n} f (Z_{i}) / n$ . Let ψ_τ (u) = τ − 1{u < 0} denote the score function of ρ_τ(u) and $M_{n} (τ, δ) : = n^{- 1 / 2} \sum_{i = 1}^{n} X_{i} ψ_{τ} (Y_{i} - X_{i}^{T} β_{0} (τ) - X_{i}^{T} δ)$ . We define Inline graphic (t) = {β ∈ ℝ^p : β_b = 0, inf_τ _{∈ Δ} ||β − β₀(τ)|| ≤ τ} and R_s(B) = {δ ∈ R_s : ||δ|| ≤ (s log² n/n)^1/2B} for some B > 0. We used τ^* to denote min{Δ ∪ 1 − Δ} and ∂A to denote the boundary of a set A.

As mentioned in Section 2.1, it is more meaningful to investigate the globally concerned quantile regression model (2.1) when the support of X is compact. However, our theoretical work can be established in more general conditions, where the magnitude of covariates is allowed to increase in a suitable rate with n. In particular, the following condition is assumed for the proof:

(C5)
[Conditions on the largest covaraites] We assume
$P r (max_{1 \leq i \leq n, 2 \leq j \leq s} ∣ X_{i}^{(j)} ∣ \leq M_{1 n}) \to 1 and P r (max_{1 \leq i \leq n, s + 1 \leq j \leq p} ∣ X_{i}^{(j)} ∣ \leq M_{2 n}) \to 1,$

for some M₁_n = o(n^1/2/(s log n)^3/2) and $M_{2 n} \leq \sqrt{n / (s^{2} log p)}$ .

We denote the event that max_2≤_j_≤_p |σ̂_j − 1| ≤ 1/2, the event that ${max}_{1 \leq i \leq n, 2 \leq j \leq s} ∣ X_{i}^{(j)} ∣ < M_{1 n}$ , and the event that ${max}_{1 \leq i \leq n, s < j \leq p} ∣ X_{i}^{(j)} ∣ M_{2 n}$ by Ω₀, Ω₁, and Ω₂, respectively. According to conditions (C2) and (C5), we have $γ_{0 n} : = P r (Ω_{0}^{c}) \to 0, γ_{1 n} : = P r (Ω_{1}^{c}) \to 0$ , and $γ_{2 n} : = P r (Ω_{2}^{c}) \to 0$ .

Next, we present the technical lemmas used in the proofs of our theorems and corollaries. The proofs of lemmas and Corollaries 1–2 are relegated to Section 2 of the supplemental article [Zheng et al. (2015)].

Lemma 7.1

Under conditions (C1)-(C4), if $t < 3 q \underline{f} λ_{min} / (4 A_{0} λ_{max}^{3 / 2})$ , then for any δ ∈ R_s; ||δ|| ≤ t, we have

E [ρ_{τ} (Y_{i} - X_{i}^{T} (β_{0} (τ) + δ)) - ρ_{τ} (Y_{i} - X_{i}^{T} β_{0} (τ))] \geq \frac{\underline{f} λ_{min}}{4} t^{2}

(7.1)

Lemma 7.2

Under conditions (C2)-(C4), with probability at least 1 − 16/n³ − γ₀, we have

\begin{array}{l} sup_{τ \in Δ, δ \in R_{s}, ‖ δ ‖ \leq t} ∣ G_{n} [ρ_{τ} (Y_{i} - X_{i}^{T} (β_{0} (τ) + δ)) - ρ_{τ} (Y_{i} - X_{i}^{T} β_{0} (τ))] ∣ \\ \leq 16 \sqrt{s log n} t + 120 \sqrt{s} t \sqrt{log n + log L / 2 - log t / 2} \end{array}

(7.2)

Lemma 7.3

Under conditions (C4), given a fixed τ₀, with probability at least 1 − 8/a, we have

sup_{δ \in R_{s}, ‖ δ ‖ \leq t} \frac{1}{\sqrt{n}} ∣ G_{n} [ρ_{τ_{0}} (Y_{i} - X_{i}^{T} (β_{0} (τ) + δ)) - ρ_{τ_{0}} (Y_{i} - X_{i}^{T} β_{0} (τ))] ∣ \leq 12 a \sqrt{λ_{max} s} t

Lemma 7.4

Under conditions (C1)-(C5), give any α ∈ R_s, such that ||α|| = 1, if s = o (n^1/3), then we have

sup_{δ \in R_{s} (B)} sup_{τ \in Δ} ∣ α^{T} {M_{n, τ} (δ) - E [M_{n, τ} (δ)] - M_{n, τ} (0) + E [M_{n, τ} (0)]} ∣ = o_{p} (1) .

Lemma 7.5

Under conditions (C1),(C3) and (C4) given an α ∈ R_s such that ||α|| = 1, α^TM_n(τ, 0) converges weakly to a mean zero Gaussian process with variance $τ (1 - τ) E [α_{a}^{T} X_{i a} X_{i a}^{T} α_{a}]$ .

Lemma 7.6

Under condition (C1), if $E [∣ Y_{i} - X_{i}^{T} β_{0} (τ_{0}) ∣]$ exists for some τ₀ ∈ Δ, then we have

\begin{array}{l} τ^{*} [E [∣ Y_{i} - X_{i}^{T} β_{0} (τ_{0}) ∣] - {(2 \underline{f})}^{- 1}] \leq inf_{τ \in Δ} σ (τ) \\ \leq sup_{τ \in Δ} σ (τ) \leq E [∣ Y_{i} - X_{i}^{T} β_{0} (τ_{0}) ∣] + {(2 \underline{f})}^{- 1}, \end{array}

almost surely.

Lemma 7.7

Under conditions (C1), (C2), (C3), and (C4+), for some k ≤ C_m(n) = o(n/log p), with probability at least 1 − 16 exp(−(A² − 2) log p) − γ₀, we have

sup_{S_{Δ} \subseteq S, ∣ S ∣ = k} sup_{τ \in Δ} ‖ {\hat{β}}_{S} (τ) - β_{0} (τ) ‖ \leq 16 (2 + \sqrt{192}) A \sqrt{k log p / n} / (\underline{f} Λ_{min}),

for some constant A > 2.

Lemma 7.8

Under the same conditions as in Theorem 3.4, for some constant A > 2, we have

P r (inf_{S \in O F, ∣ S ∣ \leq C_{m} (n)} GIC (S) > GIC (S_{Δ})) \geq 1 - 16 exp (- (A^{2} - 3) log p) - 12 / p - 16 / n^{3} - γ_{0 n} .

Lemma 7.9

Under the same conditions as in Theorem 3.4, for some constant A > 2, we have

P r (inf_{S \in U F, ∣ S ∣ \leq C_{m} (n)} GIC (S) > GIC (S_{Δ})) \geq 1 - 16 exp (- (A^{2} - 3) log p) - 16 / n^{3} - γ_{0} .

Lemma 7.10

Under the same conditions as in Theorem 3.4, for some constant A > 2, we have

inf_{S \in O F, ∣ S ∣ = k τ \in Δ} {\hat{σ}}_{S} (τ) - {\hat{σ}}_{S_{Δ}} (τ) \geq - 225 {(\underline{f} Λ_{min})}^{- 1} (k - s) log p,

with probability at least 1 − 16 exp(−(A² − 2) log p) − 12/p − γ₀_n.

Lemma 7.11

Under the same conditions as in Theorem 3.3, with probability at least 1 − 4 exp(C₃s log p/2 + 4s log n) − γ₂, for some constant C₃ > 1, we have

sup_{τ \in Δ, δ \in R_{s} (B)} n^{1 / 2} α_{j} {M_{n} (τ, δ) - E [M_{n} (τ, δ)] - M_{n} (τ, 0) + E [M_{n} (τ, 0)]} \leq 4 C_{3} \sqrt{n log p}

In the following, we provide the proofs for the theoretical results in Section 3.2.

Proof of Theorem 3.1

We consider Q_n(β; τ) − Q_n(β₀(τ); τ) over {β ∈ ℝ^p : inf_τ_∈Δ ||β − β₀(τ)|| ≤ t}

\begin{array}{l} Q_{n} (β; τ) - Q_{n} (β_{0} (τ); τ) = \sum_{i = 1}^{n} ρ_{τ} (Y_{i} - x_{i}^{T} β) - \sum_{i = 1}^{n} ρ_{τ} (Y_{i} - x_{i}^{T} β_{0} (τ)) + λ_{n} \sum_{j = 2}^{p} ω_{j} (τ) (∣ β^{(j)} ∣ - ∣ β_{0}^{(j)} (τ) ∣) \\ = n E [ρ_{τ} (Y_{i} - X_{i}^{T} β) - ρ_{τ} (Y_{i} - X_{i}^{T} β_{0} (τ))] + \sqrt{n} G_{n} [ρ_{τ} (Y_{i} - X_{i}^{T} β) - ρ_{τ} (Y_{i} - X_{i}^{T} β_{0} (τ))] + λ_{n} \sum_{j = 2}^{p} ω_{j} (τ) (∣ β^{(j)} ∣ - ∣ β_{0}^{(j)} (τ) ∣) \\ : = I_{1} + I_{2} + I_{3} . \end{array}

According to Lemma 7.1, if $t < 3 q \underline{f} λ_{min} / (4 A_{0} λ_{max}^{3 / 2})$ , then we obtain

I_{1} \geq n \underline{f} λ_{min} t^{2} / 4.

(7.3)

By Lemma 7.2, with probability at least 1 − 16/n³ − γ₀_n,

sup_{τ \in Δ} ∣ I_{2} ∣ \leq \sqrt{n} {16 \sqrt{s log n} t + 12 \sqrt{s} t \sqrt{log n + log L / 2 - log t / 2}} .

(7.4)

For I₃, since ${sup}_{j \in S_{τ}, τ \in Δ} λ_{n} ω_{j} (τ) = O_{p} (\sqrt{n log n})$ , we have

\begin{array}{l} I_{3} = \sum_{j \in S_{τ}} λ_{n} ω_{j} (τ) (∣ β^{(j)} ∣ - ∣ β_{0}^{(j)} (τ) ∣) + \sum_{j \notin S_{τ}} λ_{n} ω_{j} (τ) (∣ β^{(j)} ∣ - 0) \\ \geq \sum_{j \in S_{τ}} λ_{n} ω_{j} (τ) (∣ β^{(j)} ∣ - ∣ β_{0}^{(j)} (τ) ∣) \geq - max_{j \in S_{τ}} λ_{n} ω_{j} (τ) \sum_{j \in S_{τ}} (∣ β^{(j)} - β_{0}^{(j)} (τ) ∣) \\ \geq - max_{j \in S_{τ}} λ_{n} ω_{j} (τ) \sqrt{s} t = - O_{p} (\sqrt{n s log n} t) . \end{array}

(7.5)

(7.3), (7.4), and (7.5) together imply that with probability at least 1 − 16/n³ − γ₀_n,

\begin{array}{l} inf_{β \in \partial B_{s} (t), τ \in Δ} Q_{n} (β; τ) - Q_{n} (β_{0} (τ); τ) \geq n \frac{\underline{f} λ_{min}}{4} t^{2} - O_{p} (\sqrt{n s log n} t) \\ - \sqrt{n} (16 \sqrt{s log n} t + 120 \sqrt{s} t \sqrt{log n + log L / 2 - log t / 2}) \\ \geq \frac{\underline{f} λ_{min}}{4} C_{1}^{2} s log n - O_{p} (s log n) C_{1} \geq 0, \end{array}

where $t = C_{1} \sqrt{s log n / n}$ , with a sufficiently large C₁.

Since Q_n(β; τ) is a convex function with respect to β, then ${\hat{β}}_{λ_{n}}^{o} (τ)$ , the minimizer of (2.2) over R_s, must locate within { $β \in ℝ^{p} : ‖ β - β_{0} (τ) ‖ \leq C_{1} \sqrt{s log n / n}$ }. This completes our proof of Theorem 3.1.

Proof of Theorem 3.2

According to Theorem 3.1, we know that

P r (‖ {\hat{β}}_{λ_{n}}^{o} (τ) - β_{0} (τ) ‖ > B \sqrt{s {log}^{2} n / n}) \to 0,

for some B > 0. Therefore, minimizing the objective function Q_n(β; τ) from (2.2) over R_s is equivalent to minimizing n^−1/2[Q_n(β₀(τ) + δ; τ) − Q_n(β₀(τ); τ)] over R_s(B) with probability approaching 1.

Given any δ ∈ R_s(B), we have

\begin{array}{l} n^{- 1 / 2} [Q_{n} (β_{0} (τ) + δ; τ) - Q_{n} (β_{0} (τ); τ)] \\ = G_{n} [ρ_{τ} (Y_{i} - X_{i}^{T} (β_{0} (τ) + δ)) - ρ_{τ} (Y_{i} - X_{i}^{T} β_{0} (τ))] \\ + n^{1 / 2} E [ρ_{τ} (Y_{i} - X_{i}^{T} (β_{0} (τ) + δ)) - ρ_{τ} (Y_{i} - X_{i}^{T} β_{0} (τ))] \\ + n^{- 1 / 2} λ_{n} \sum_{j \leq s} ω_{j} (τ) (| β_{0}^{(j)} (τ) + δ^{(j)} | - | β_{0}^{(j)} (τ) |) \\ = G_{n} [Y_{i} - X_{i}^{T} β_{0} (τ) {ψ_{τ} (Y_{i} - X_{i}^{T} β_{0} (τ)) - ψ_{τ} (Y_{i} - X_{i}^{T} (β_{0} (τ) + δ))}] \\ - δ^{T} [M_{n} (τ, δ) - E [M_{n} (τ, δ)] - M_{n} (τ, 0) + E [M_{n} (τ, 0)]] \\ - δ^{T} [M_{n} (τ, 0) - E [M_{n} (τ, 0)]] \\ + n^{1 / 2} E [ρ_{τ} (Y_{i} - X_{i}^{T} (β_{0} (τ) + δ)) - ρ_{τ} (Y_{i} - X_{i}^{T} β_{0} (τ))] \\ + n^{- 1 / 2} λ_{n} \sum_{j \leq s} ω_{j} (τ) (| β_{0}^{(j)} (τ) + δ^{(j)} | - | β_{0}^{(j)} (τ) |) \\ : = {I I}_{1} + {I I}_{2} + {I I}_{3} + {I I}_{4} + {I I}_{5} \end{array}

We consider II₂ first. By Lemma 7.4, we have uniformly for δ ∈ R_s(B) and τ ∈ Δ,

∣ {I I}_{2} ∣ = o_{p} (‖ δ_{a} ‖) .

(7.6)

Next we deal with II₄. From the proof of Lemma 7.1, it is easy to observe uniformly for δ ∈ R_s(B) and τ ∈ Δ,

| {I I}_{4} - n^{1 / 2} δ^{T} H_{τ} δ | \leq n^{1 / 2} \frac{\underline{f} λ_{max}^{3 / 2}}{4 q} {‖ δ_{a} ‖}^{3} = O (\frac{s {log}^{2} n}{\sqrt{n}} ‖ δ_{a} ‖) .

(7.7)

Now we deal with II₁. Note that,

\begin{array}{l} (Y_{i} - X_{i}^{T} β_{0} (τ)) {ψ_{τ} (Y_{i} - X_{i}^{T} β_{0} (τ)) - ψ_{τ} (Y_{i} - X_{i}^{T} (β_{0} (τ) + δ))} \\ = {\begin{matrix} Y_{i} - X_{i}^{T} β_{0} (τ) & if & X_{i}^{T} δ < Y_{i} - X_{i}^{T} β_{0} (τ) < 0 \\ - [Y_{i} - X_{i}^{T} β_{0} (τ)] & if & 0 < Y_{i} - X_{i}^{T} β_{0} (τ) < X_{i}^{T} δ \\ 0 & otherwise . \end{matrix}} \end{array}

then we obtain that

\frac{Y_{i} - X_{i}^{T} β_{0} (τ)}{X_{i}^{T} δ} {ψ_{τ} (Y_{i} - X_{i}^{T} β_{0} (τ)) - ψ_{τ} (Y_{i} - X_{i}^{T} (β_{0} (τ) + δ))}

is a bounded function. Applying similar arguments used in Lemma 7.4, we have uniformly for δ ∈ R_s(B) and τ ∈ Δ

∣ {I I}_{1} ∣ = o_{p} (‖ δ_{a} ‖) .

(7.8)

For II₃, since E[M_n(τ, 0)] = 0, we have

{I I}_{3} = - δ^{T} M_{n} (τ, 0) .

(7.9)

Now we consider II₅. On one hand, if $\sqrt{n / (s log n)} {inf}_{τ \in Δ, j \leq s} ∣ β_{0}^{(j)} ∣ \to \infty$ , then

n^{- 1 / 2} λ_{n} \sum_{j = 2}^{s} ω_{j} (τ) (| β_{0}^{(j)} (τ) + δ^{(j)} | - | β_{0}^{(j)} (τ) |) = n^{- 1 / 2} λ_{n} {[ω (τ) \circ sgn (β_{0} (τ))]}^{T} δ,

(7.10)

Combining (7.6), (7.7), (7.8), (7.9) and (7.10) together, we have

n^{- 1 / 2} [Q_{n} (β_{0} (τ) + δ; τ) - Q_{n} (β_{0} (τ); τ)] = n^{1 / 2} δ^{T} H_{τ} δ - δ^{T} M_{n} (τ, 0) + n^{- 1 / 2} λ_{n} {[ω (τ) \circ sgn (β_{0} (τ))]}^{T} δ + o_{p} (‖ δ_{a} ‖) .

Let δ̂ be the minimizer of L(δ; τ) := n^1/2δ^TH_τδ−δ^TM_n_,_τ(0)+n^−1/2λ_n[ω(τ)∘sgn(β₀(τ))]^Tδ+o_p (||δ_a||) over R_s(B). Then we have

2 n^{1 / 2} H_{τ} (\hat{δ} + \frac{λ_{n}}{2 n} H_{τ}^{- 1} [ω (τ) \circ sgn (β_{0} (τ))]) = M_{n} (τ, 0)

Given any α ∈ ℝ^p, ||α|| = 1, applying Lemma 7.5 yields

2 n^{1 / 2} α^{T} H_{τ} ({\hat{β}}_{λ_{n}}^{o} (τ) - β_{0} (τ) + \frac{λ_{n}}{2 n} H_{τ}^{- 1} [ω (τ) \circ sgn (β_{0} (τ))])

converges weakly to a mean zero Gaussian process with covariance Σ(τ, τ′).

On other hand, if n^−1/2λ_n||ω_a|| = o_p(1), then

{I I}_{5} \leq n^{- 1 / 2} λ_{n} ‖ ω (τ) ‖ ‖ δ ‖ = o_{p} (‖ δ_{a} ‖) .

(7.11)

Consequently,

n^{- 1 / 2} [Q_{n} (β_{0} (τ) + δ; τ) - Q_{n} (β_{0} (τ); τ)] = n^{1 / 2} δ^{T} H_{τ} δ - δ^{T} M_{n, τ} (0) + o_{p} (‖ δ_{a} ‖) .

With similar arguments, we can show that $2 n^{1 / 2} α^{T} H_{τ} ({\hat{β}}_{λ_{n}}^{o} (τ) - β_{0} (τ))$ converges weakly to a mean zero Gaussian process with covariance Σ(τ, τ′). This completes the proof of Theorem 3.2.

Proof of Theorem 3.3

According to KKT conditions, to show that ${\hat{β}}_{λ_{n}}^{o} (τ)$ is a global minimizer of Q_n(β; τ) over ℝ^p, we only need to check the following condition

sup_{τ \in Δ} {| \sum_{i = 1}^{n} X_{i}^{(j)} ψ_{τ} (Y_{i} - X_{i}^{T} {\hat{β}}^{o} (τ)) | - λ_{n} ω_{j} (τ)} < 0, for j > s .

(7.12)

By Theorem 3.1, we know that Pr (sup_τ_∈Δ ||β̂^o(τ) − β₀(τ)|| ∈ R_s(B)) → 1. Therefore, if we can we can show that

sup_{s < j \leq p, τ \in Δ, δ \in R_{s} (B)} {| \sum_{i = 1}^{n} X_{i}^{(j)} ψ_{τ} (Y_{i} - X_{i}^{T} (β_{0} (τ) + δ)) | - λ_{n} ω_{j} (τ)} < 0,

(7.13)

then (7.12) follows immediately, and the probability that ${\hat{β}}_{λ_{n}}^{o} (τ)$ is also the global minimizer of (2.2) over ℝ^p is approaching 1.

We first state some results which will be used in the following proof. Under conditions stated in the theorem, ∀δ ∈ R_s(B), we have

\begin{array}{l} ∣ E [X_{i}^{(j)} (1 {Y_{i} - X_{i}^{T} (β_{0} (τ) + δ) \leq 0} - 1 {Y_{i} - X_{i}^{T} β_{0} (τ) \leq 0})] ∣ \leq \bar{f} E [| X_{i}^{(j)} X_{i}^{T} δ |] \leq \bar{f} {(E [{(X_{i}^{(j)} X_{i}^{T} δ)}^{2}])}^{1 / 2} = o ({(\frac{log p}{s {log}^{2} n})}^{1 / 2} ‖ δ ‖) \\ = o (\sqrt{log p / n}), \end{array}

(7.14)

where the first inequality follows from the condition expectation on X_i and condition (C1), and the second inequality follows from Cauchy-Schwartz inequality. Moreover, we have

\begin{array}{l} Var [X_{i}^{(j)} (1 {Y_{i} - X_{i}^{T} (β_{0} (τ) + δ) \leq 0} - 1 {Y_{i} - X_{i}^{T} β_{0} (τ) \leq 0})] \leq E [{(X_{i}^{(j)})}^{2} {(1 {Y_{i} - X_{i}^{T} (β_{0} (τ) + δ) \leq 0} - 1 {Y_{i} - X_{i}^{T} β_{0} (τ) \leq 0})}^{2}] \\ \leq \bar{f} E [{(X_{i}^{(j)})}^{2} ∣ X_{i}^{T} δ ∣] \leq \bar{f} {(E [{(X_{i}^{(j)})}^{2}])}^{1 / 2} {(E [{(X_{i}^{(j)} X_{i}^{T} δ)}^{2}])}^{1 / 2} \\ = o (\sqrt{log p / n}) . \end{array}

(7.15)

In the following, we restrict our attention on Ω₀. Let α_j denote the vector where the jth component is 1 and all other components are 0. Then $\sum_{i = 1}^{n} X_{i}^{(j)} ψ_{τ} (Y_{i} - X_{i}^{T} (β_{0} (τ) + δ)) = n^{1 / 2} α_{j} M_{n} (τ, δ)$ .

\begin{array}{l} n^{1 / 2} α_{j} M_{n} (τ, δ) = n^{1 / 2} α_{j} {M_{n} (τ, δ) - E [M_{n} (τ, δ)] - M_{n} (τ, 0) + E [M_{n} (τ, 0)]} + n^{1 / 2} α_{j} {E [M_{n} (τ, δ)] - E [M_{n} (τ, 0)]} + n^{1 / 2} α_{j} M_{n, τ} (0) \\ : = {III}_{1} + {III}_{2} + {III}_{3} \end{array}

First, we evaluate III₂. By (7.14), we have

sup_{τ \in Δ, δ \in R_{s} (B)} ∣ {III}_{2} ∣ \leq \sqrt{n log p} .

(7.16)

Next, we consider III₃.

n^{1 / 2} α_{j} M_{n} (τ, 0) = \sum_{i = 1}^{n} X_{i}^{(j)} (τ - 1 {Y_{i} - X_{i}^{T} β_{0} (τ) < 0}) .

Applying Lemma 2.3.7 in Van Der Vaart and Wellner (2000) yields

\begin{array}{l} P r (n^{- 1 / 2} sup_{τ \in Δ} | \sum_{i = 1}^{n} X_{i}^{(j)} (τ - 1 {Y_{i} - X_{i}^{T} β_{0} (τ) < 0}) | > M) \\ \leq \frac{2 P r (n^{- 1 / 2} {sup}_{τ \in Δ} | \sum_{i = 1}^{n} X_{i}^{(j)} V_{i} (τ - 1 {Y_{i} - X_{i}^{T} β_{0} (τ) < 0}) | > M / 4)}{1 - 1 / M^{2}}, for M > 1. \end{array}

We have

\begin{array}{l} P r (τ | \sum_{i = 1}^{n} X_{i}^{(j)} V_{i} | > \sqrt{6 n log p}) \leq P r (| \sum_{i = 1}^{n} X_{i}^{(j)} V_{i} | > \sqrt{6 n log p}) \leq P r (| \sum_{i = 1}^{n} X_{i}^{(j)} V_{i} | > \sqrt{6 n log p} ∣ Ω_{0}) + P r (Ω_{0}^{c}) \\ \leq 2 exp (- \frac{1}{2} \frac{6 n log p}{\sum_{i = 1}^{n} {(X_{i}^{(j)})}^{2}} ∣ Ω_{0}) + γ_{0 n} \leq \frac{2}{p^{2}} + γ_{0 n}, \end{array}

(7.17)

where the first two inequalities are elementary, the third inequality follows from Lemma 1.5 in Ledoux and Talagrand (1991), and the last inequality follows from condition (C2).

Conditional on (X_i, Y_i), i = 1, ···, n, we can partition Δ with a grid {τ₀, τ₁, τ₂, ···, τ_n} such that for each 1 ≤ i ≤ n, there is only one observation (X₍_i₎, Y₍_i₎) satisfies $X_{(i)}^{T} β_{0} (τ_{i - 1}) < Y_{(i)} \leq X_{(i)}^{T} β_{0} (τ_{i})$ . Here, τ₀ = 0. Then we have

n^{- 1 / 2} sup_{τ \in Δ} | \sum_{i = 1}^{n} X_{i}^{(j)} V_{i} 1 {Y_{i} - X_{i}^{T} β_{0} (τ) < 0} | = n^{- 1 / 2} sup_{τ \in (τ_{i - 1}, τ_{i}), 1 \leq i \leq n} | \sum_{i = 1}^{n} X_{(i)}^{(j)} V_{(i)} 1 {Y_{(i)} - X_{(i)}^{T} β_{0} (τ) < 0} |

As τ move from τ₀ to τ_n, we observe

sup_{τ \in (τ_{i - 1}, τ_{i}), 1 \leq i \leq n} | \sum_{i = 1}^{n} X_{(i)}^{(j)} V_{(i)} 1 {Y_{(i)} - X_{(i)}^{T} β_{0} (τ) < 0} | = max_{1 \leq k \leq n} ∣ S_{k} ∣,

where $S_{k} = \sum_{i = 1}^{k} X_{(i)}^{(j)} V_{(i)}$ . Since V₍_i₎’s are independent and $X_{(i)}^{(j)} V_{(i)} 1 {Y_{(i)} - X_{(i)}^{T} β_{0} (τ) < 0}$ ’s are symmetric about 0, According to Levy’s inequality, we have P(max_1≤_k_≤_n |S_k| ≥ u) ≤ 2P(|S_n| ≥ u). Therefore, we have

\begin{array}{l} P r (n^{- 1 / 2} sup_{τ \in Δ} | \sum_{i = 1}^{n} X_{i}^{(j)} V_{i} 1 {Y_{i} - X_{i}^{T} β_{0} (τ) < 0} | > M / 4) = \int P r (n^{- 1 / 2} sup_{τ \in Δ} | \sum_{i = 1}^{n} X_{i}^{(j)} V_{i} 1 {Y_{i} - X_{i}^{T} β_{0} (τ) < 0} | > M / 4 ∣ X_{i}, Y_{i}) d P \\ = \int P r (max_{1 \leq k \leq n} ∣ S_{k} ∣ > \sqrt{n} M / 4 ∣ X_{i}, Y_{i}) d P \leq \int 2 P r (∣ S_{n} ∣ > \sqrt{n} M / 4 ∣ X_{i}, Y_{i}) d P \\ = 2 P r (∣ S_{n} ∣ > \sqrt{n} M / 4) . \end{array}

Choose $M = \sqrt{96 log p}$ , we obtain

P r (sup_{τ \in Δ} | \sum_{i = 1}^{n} X_{i}^{(j)} V_{i} 1 {Y_{i} - X_{i}^{T} β_{0} (τ) < 0} | > \sqrt{96 n log p}) \leq \frac{4}{p^{2}} + γ_{0 n},

where the inequality follows the arguments in (7.17). The above inequality and (7.17) together imply

P r (sup_{τ \in Δ} {III}_{3} > 20 \sqrt{6 n log p}) \leq \frac{12}{p^{2}} + γ_{0 n} .

(7.18)

Now we consider III₁. By Lemma 7.11, we can show

P r (sup_{τ \in Δ, δ \in R_{s} (B)} ∣ {III}_{1} ∣ > 32 C_{3} \sqrt{n log p} ∣ Ω_{2}) \leq 8 exp [- \frac{1}{2} C_{3} s log p + 4 s log n] .

(7.19)

let $\tilde{K} = 160 \sqrt{n log p}$ . Then (7.16), (7.18), and (7.19) together yield

\begin{array}{l} P r (sup_{τ \in Δ, δ \in R_{s} (B)} | \sum_{i = 1}^{n} X_{i}^{(j)} ψ_{τ} (Y_{i} - X_{i}^{T} (β_{0} (τ) + δ)) | > \tilde{K}) \\ \leq P r (sup_{τ \in Δ, δ \in R_{s} (B)} | \sum_{i = 1}^{n} X_{i}^{(j)} ψ_{τ} (Y_{i} - X_{i}^{T} (β_{0} (τ) + δ)) | > \tilde{K} | Ω_{0} \cap Ω_{2}) + γ_{0 n} + γ_{2 n} \\ \leq \frac{20}{p^{2}} + γ_{0 n} + γ_{2 n} \end{array}

which implies

P r (max_{j > s} sup_{τ \in Δ, δ \in R_{s} (B)} | \sum_{i = 1}^{n} X_{i}^{(j)} ψ_{τ} (Y_{i} - X_{i}^{T} (β_{0} (τ) + δ)) | > \tilde{K}) \leq \frac{20}{p} + γ_{0 n} + γ_{2 n} .

(7.20)

${({inf}_{j > s, τ \in Δ} ω_{j} (τ))}^{- 1} \sqrt{n} / \sqrt{s log p} = O_{p} (1)$ , coupled with $λ_{n} / (\sqrt{s} log p) \to \infty$ implies that, with probability approaching 1, λ_nω_j(τ) > K̃ for all j > s. This, (7.20), and condition (C5) together infer that (7.13) holds with probability tending to 1, and so does (7.12). Therefore, ${\hat{β}}_{λ_{n}}^{o} (τ)$ is a global minimizer of (2.2) with probability approaching 1. This completes the proof of Theorem 3.3.

Proof of Theorem 3.4

Theorem 3.4 is a direct implication of Lemmas 7.8 and 7.9.

Supplementary Material

Supplement Materials

NIHMS684570-supplement-Supplement_Materials.pdf^{(518.9KB, pdf)}

Acknowledgments

Qi Zheng was supported by NIH grant R01HL113548. Limin Peng was supported by NSF Award DMS-1007660 and NIH grant R01HL113548. Xuming He was supported by NSF Awards DMS-1307566 and DMS-1237234 and National Natural Science Foundation of China Grant 11129101.

Footnotes

Supplement Material: Additional simulation results, Proofs of technical lemmas and corollaries, Justification for the proposed grid approximation, and Sample Code (doi: COMPLETED BY THE TYPESETTER; Supplement rev 042215.pdf).

References

Bassett G, Koenker R. An empirical quantile function for linear models with iid errors. Journal of the American Statistical Association. 1982;77:407–415. [Google Scholar]
Belloni A, Chernozhukov V. l1 penalized quantile regression in high dimensional sparse models. Ann Statist. 2011;39:82–130. [Google Scholar]
Chen J, Chen Z. Extended bayesian information criterion for model selection with large model space. Biometrika. 2008;95:759–771. [Google Scholar]
Fan J, Fan Y, Barut E. Adaptive robust variable selection. Ann Statist. 2014;42:324–351. doi: 10.1214/13-AOS1191. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Lv J. Non-concave penalized likelihood with np-dimensionality. IEEE Tran Inform Theory. 2011;57:5467–5484. doi: 10.1109/TIT.2011.2158486. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan Y, Tang C. Tuning parameter selection in high dimensional penalized likelihood. J R Statist Soc B. 2013;75:531–552. [Google Scholar]
He X, Shao Q. On parameters of increasing dimensions. J Multivariate Anal. 2000;73:120–135. [Google Scholar]
Huang J, Ma S, Zhang CH. Adaptive lasso for sparse high-dimensional regression models. Stat Sinica. 2008;18:1603–1618. [Google Scholar]
Kim Y, Choi H, Oh HS. Smoothly clipped absolute deviation on high dimensions. J Amer Stat Assoc. 2008;103:1665–1673. [Google Scholar]
Knight K, Fu W. Asymptotics for lasso-type estimators. The Annals of Statistics. 2000:1356–1378. [Google Scholar]
Koenker R. Quantile Regression. Cambridge University Press; 2005. [Google Scholar]
Koenker R, Bassett GW. Regression quantiles. Econometrica. 1978;46:33–50. [Google Scholar]
Koenker RW, D’Orey V. Algorithm as 229: Computing regression quantiles. Journal of the Royal Statistical Society, Seies C. 1987;36:383–393. [Google Scholar]
Ledoux M, Talagrand M. Ergebnisse der Mathematik und ihrer Grenzgebiete. Vol. 23. Springer; Berlin: 1991. Probability in Banach Space: Isoperimetry and Processes. [Google Scholar]
Li Y, Liu Y, Zhu J. Quantile regression in reproducing kernel hilbert spaces. Journal of American Statistical Association. 2007;102:255–268. [Google Scholar]
Lv J, Fan Y. A unified approach to model selection and sparse recovery using regularized least squares. Ann Stat. 2009;37:3498–3528. [Google Scholar]
Meinshausen N, Buhlmann P. High-dimensional graphs and variable selection with the lasso. The Annals of Statistics. 2006;34:1436–1462. [Google Scholar]
Nishii R. Asymptotic properties of criteria for selection of variables in multiple regression. Ann Stat. 1984;12:758–765. [Google Scholar]
Peng L, Xu J, Kutner N. Shrinkage estimation of varying covariate effects based on quantile regression. Stat Comput. 2013 doi: 10.1007/s11222-013-9406-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
Portnoy S. Asymptotic behavior of the number of regression quantile breakpoints. Siam J Sci Stat Comput. 1991;12:867–883. [Google Scholar]
Qian J, Peng L. Censored quantile regression with partially functional effects. Biometrika. 2010;97:839–850. [Google Scholar]
Rocha G, Wang X, Yu B. Asymptotic distribution and sparsistency for l1- penalized parametric m-estimators with applications to linear svm and logistic regression. 2009 http://arxiv.org/abs/0908.1940.
Scheetz TE, Kim KYA, Swiderski RE, Philp AR, Braun TA, Knudtson KL, Dorrance AM, DiBona GF, Huang J, Casavant TL, Sheffield VC, Stone EM. Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proc Natl Acad Sci. 2006;103:14429–14434. doi: 10.1073/pnas.0602562103. [DOI] [PMC free article] [PubMed] [Google Scholar]
Talagrand M. The Generic Chaining. Springer; Berlin: 2005. [Google Scholar]
Van Der Vaart A, Wellner J. Weak Convergence and Empirical Processes: With Applications to Statistics. Springer; New York: 2000. [Google Scholar]
Wang H, Li R, Leng C. Shringkage tuning parameter selection with a diverging number of parameters. J R Statist Soc B. 2009;71:671–683. [Google Scholar]
Wang H, Li R, Tsai CL. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika. 2007;94:553–568. doi: 10.1093/biomet/asm053. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang L, Wu Y, Li R. Quantile regression for analyzing heterogeneity in ultrahigh dimension. J Amer Stat Assoc. 2012;101:1418–1429. doi: 10.1080/01621459.2012.656014. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang T, Zhu L. Consistent tuning parameter selection in high dimensional sparse linear regression. J Multivariate Anal. 2011;102:1141–1151. [Google Scholar]
Welsh AH. On m-process and m-estimation. Ann Stat. 1989;17:337–361. [Google Scholar]
Wu Y, Liu Y. Variable selection in quantile regression. Statistica Sinica. 2009;19:801–817. [Google Scholar]
Yang Y, He X. Bayesian empirical likelihood for quantile regression. The Annals of Statistics. 2012;40:1102–1131. [Google Scholar]
Zhang CH, Huang J. The sparsity and bias of the lasso selection in high-dimensional linear regression. The Annals of Statistics. 2008;36:1567–1594. [Google Scholar]
Zhang Y, Li R, Tsai CL. Regularization parameter selections via generalized information criterion. J Amer Stat Assoc. 2010;105:312–323. doi: 10.1198/jasa.2009.tm08013. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zheng Q, Gallagher C, Kulasekera KB. Adaptive penalized quantile regression for high dimensional data. J Statist Plann Inference. 2013;143:1029–1038. [Google Scholar]
Zheng Q, Peng L, He X. Supplement to “globally adaptive quantile regression with ultra-high dimensional data”. 2015 doi: 10.1214/15-AOS1340. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zou H. The adaptive lasso and its oracle properties. J Amer Stat Assoc. 2006;101:1418–1429. [Google Scholar]
Zou H, Yuan M. Composite quantile regression and the oracle model selection theory. Annals of Statistics. 2008;36:1108–1126. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement Materials

NIHMS684570-supplement-Supplement_Materials.pdf^{(518.9KB, pdf)}

[R1] Bassett G, Koenker R. An empirical quantile function for linear models with iid errors. Journal of the American Statistical Association. 1982;77:407–415. [Google Scholar]

[R2] Belloni A, Chernozhukov V. l1 penalized quantile regression in high dimensional sparse models. Ann Statist. 2011;39:82–130. [Google Scholar]

[R3] Chen J, Chen Z. Extended bayesian information criterion for model selection with large model space. Biometrika. 2008;95:759–771. [Google Scholar]

[R4] Fan J, Fan Y, Barut E. Adaptive robust variable selection. Ann Statist. 2014;42:324–351. doi: 10.1214/13-AOS1191. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Fan J, Lv J. Non-concave penalized likelihood with np-dimensionality. IEEE Tran Inform Theory. 2011;57:5467–5484. doi: 10.1109/TIT.2011.2158486. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Fan Y, Tang C. Tuning parameter selection in high dimensional penalized likelihood. J R Statist Soc B. 2013;75:531–552. [Google Scholar]

[R7] He X, Shao Q. On parameters of increasing dimensions. J Multivariate Anal. 2000;73:120–135. [Google Scholar]

[R8] Huang J, Ma S, Zhang CH. Adaptive lasso for sparse high-dimensional regression models. Stat Sinica. 2008;18:1603–1618. [Google Scholar]

[R9] Kim Y, Choi H, Oh HS. Smoothly clipped absolute deviation on high dimensions. J Amer Stat Assoc. 2008;103:1665–1673. [Google Scholar]

[R10] Knight K, Fu W. Asymptotics for lasso-type estimators. The Annals of Statistics. 2000:1356–1378. [Google Scholar]

[R11] Koenker R. Quantile Regression. Cambridge University Press; 2005. [Google Scholar]

[R12] Koenker R, Bassett GW. Regression quantiles. Econometrica. 1978;46:33–50. [Google Scholar]

[R13] Koenker RW, D’Orey V. Algorithm as 229: Computing regression quantiles. Journal of the Royal Statistical Society, Seies C. 1987;36:383–393. [Google Scholar]

[R14] Ledoux M, Talagrand M. Ergebnisse der Mathematik und ihrer Grenzgebiete. Vol. 23. Springer; Berlin: 1991. Probability in Banach Space: Isoperimetry and Processes. [Google Scholar]

[R15] Li Y, Liu Y, Zhu J. Quantile regression in reproducing kernel hilbert spaces. Journal of American Statistical Association. 2007;102:255–268. [Google Scholar]

[R16] Lv J, Fan Y. A unified approach to model selection and sparse recovery using regularized least squares. Ann Stat. 2009;37:3498–3528. [Google Scholar]

[R17] Meinshausen N, Buhlmann P. High-dimensional graphs and variable selection with the lasso. The Annals of Statistics. 2006;34:1436–1462. [Google Scholar]

[R18] Nishii R. Asymptotic properties of criteria for selection of variables in multiple regression. Ann Stat. 1984;12:758–765. [Google Scholar]

[R19] Peng L, Xu J, Kutner N. Shrinkage estimation of varying covariate effects based on quantile regression. Stat Comput. 2013 doi: 10.1007/s11222-013-9406-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Portnoy S. Asymptotic behavior of the number of regression quantile breakpoints. Siam J Sci Stat Comput. 1991;12:867–883. [Google Scholar]

[R21] Qian J, Peng L. Censored quantile regression with partially functional effects. Biometrika. 2010;97:839–850. [Google Scholar]

[R22] Rocha G, Wang X, Yu B. Asymptotic distribution and sparsistency for l1- penalized parametric m-estimators with applications to linear svm and logistic regression. 2009 http://arxiv.org/abs/0908.1940.

[R23] Scheetz TE, Kim KYA, Swiderski RE, Philp AR, Braun TA, Knudtson KL, Dorrance AM, DiBona GF, Huang J, Casavant TL, Sheffield VC, Stone EM. Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proc Natl Acad Sci. 2006;103:14429–14434. doi: 10.1073/pnas.0602562103. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Talagrand M. The Generic Chaining. Springer; Berlin: 2005. [Google Scholar]

[R25] Van Der Vaart A, Wellner J. Weak Convergence and Empirical Processes: With Applications to Statistics. Springer; New York: 2000. [Google Scholar]

[R26] Wang H, Li R, Leng C. Shringkage tuning parameter selection with a diverging number of parameters. J R Statist Soc B. 2009;71:671–683. [Google Scholar]

[R27] Wang H, Li R, Tsai CL. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika. 2007;94:553–568. doi: 10.1093/biomet/asm053. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Wang L, Wu Y, Li R. Quantile regression for analyzing heterogeneity in ultrahigh dimension. J Amer Stat Assoc. 2012;101:1418–1429. doi: 10.1080/01621459.2012.656014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Wang T, Zhu L. Consistent tuning parameter selection in high dimensional sparse linear regression. J Multivariate Anal. 2011;102:1141–1151. [Google Scholar]

[R30] Welsh AH. On m-process and m-estimation. Ann Stat. 1989;17:337–361. [Google Scholar]

[R31] Wu Y, Liu Y. Variable selection in quantile regression. Statistica Sinica. 2009;19:801–817. [Google Scholar]

[R32] Yang Y, He X. Bayesian empirical likelihood for quantile regression. The Annals of Statistics. 2012;40:1102–1131. [Google Scholar]

[R33] Zhang CH, Huang J. The sparsity and bias of the lasso selection in high-dimensional linear regression. The Annals of Statistics. 2008;36:1567–1594. [Google Scholar]

[R34] Zhang Y, Li R, Tsai CL. Regularization parameter selections via generalized information criterion. J Amer Stat Assoc. 2010;105:312–323. doi: 10.1198/jasa.2009.tm08013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] Zheng Q, Gallagher C, Kulasekera KB. Adaptive penalized quantile regression for high dimensional data. J Statist Plann Inference. 2013;143:1029–1038. [Google Scholar]

[R36] Zheng Q, Peng L, He X. Supplement to “globally adaptive quantile regression with ultra-high dimensional data”. 2015 doi: 10.1214/15-AOS1340. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] Zou H. The adaptive lasso and its oracle properties. J Amer Stat Assoc. 2006;101:1418–1429. [Google Scholar]

[R38] Zou H, Yuan M. Composite quantile regression and the oracle model selection theory. Annals of Statistics. 2008;36:1108–1126. [Google Scholar]

PERMALINK

GLOBALLY ADAPTIVE QUANTILE REGRESSION WITH ULTRA-HIGH DIMENSIONAL DATA

Qi Zheng

Limin Peng

Xuming He

Abstract

1. Introduction

2. Adaptively weighted L1-penalized quantile regression

2.1. A globally concerned quantile regression framework

2.2. Adaptively weighted L1-penalized quantile regression

2.3. Computation Algorithm for β̂λn(τ)

3. Theoretical Properties

3.1. Preliminaries

3.2. Main Results

Theorem 3.1

Corollary 3.1

Theorem 3.2

Theorem 3.3

3.3. GIC procedure for λn

Theorem 3.4

Corollary 3.2

4. Simulation studies

Setup (I)

Setup (II)& (III)

Fig. 1.

Table 1.

Table 2.

Setup (IV)

Table 3.

5. A real data example

Table 4.

6. Remarks

7. Technical proofs

Lemma 7.1

Lemma 7.2

Lemma 7.3

Lemma 7.4

Lemma 7.5

Lemma 7.6

Lemma 7.7

Lemma 7.8

Lemma 7.9

Lemma 7.10

Lemma 7.11

Proof of Theorem 3.1

Proof of Theorem 3.2

Proof of Theorem 3.3

Proof of Theorem 3.4

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

2. Adaptively weighted L₁-penalized quantile regression

2.2. Adaptively weighted L₁-penalized quantile regression

2.3. Computation Algorithm for β̂_{λ_n}(τ)

3.3. GIC procedure for λ_n