Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2017 Jan 1.
Published in final edited form as: Stat Sin. 2016 Jan;26(1):69–95. doi: 10.5705/ss.2014.049

Regularized Quantile Regression and Robust Feature Screening for Single Index Models

Wei Zhong 1, Liping Zhu 2, Runze Li 3, Hengjian Cui 4
PMCID: PMC4771381  NIHMSID: NIHMS726602  PMID: 26941542

Abstract

We propose both a penalized quantile regression and an independence screening procedure to identify important covariates and to exclude unimportant ones for a general class of ultrahigh dimensional single-index models, in which the conditional distribution of the response depends on the covariates via a single-index structure. We observe that the linear quantile regression yields a consistent estimator of the direction of the index parameter in the single-index model. Such an observation dramatically reduces computational complexity in selecting important covariates in the single-index model. We establish an oracle property for the penalized quantile regression estimator when the covariate dimension increases at an exponential rate of the sample size. From a practical perspective, however, when the covariate dimension is extremely large, the penalized quantile regression may suffer from at least two drawbacks: computational expediency and algorithmic stability. To address these issues, we propose an independence screening procedure which is robust to model misspecification, and has reliable performance when the distribution of the response variable is heavily tailed or response realizations contain extreme values. The new independence screening procedure offers a useful complement to the penalized quantile regression since it helps to reduce the covariate dimension from ultrahigh dimensionality to a moderate scale. Based on the reduced model, the penalized linear quantile regression further refines selection of important covariates at different quantile levels. We examine the finite sample performance of the newly proposed procedure by Monte Carlo simulations and demonstrate the proposed methodology by an empirical analysis of a real data set.

Keywords: Distance correlation, penalized quantile regression, single-index models, sure screening property, ultrahigh dimensionality

1. Introduction

Single index regression models are widely assumed to avoid the “curse of dimensionality”. Let Y be a response variable and x be the associated covariate vector. Traditional single index regression model is referred to as

Y=m(xTβ0)+ε, (1.1)

where m(·) is an unknown regression function, β0 consists of unknown index parameters, and ε is a random error with E(ε | x) = 0 and var(ε | x) = σ2. Model (1.1) has been well studied in the literature. See, for example, Powell, Stock and Stoker (1989) and Härdle, Hall and Ichimura (1993). Zhu, Huang and Li (2012) studied the following heteroscedastic single index regression model

Y=m(xTβ0)+σ(xTβ0)ε, (1.2)

for unknown functions m(·) and σ(·), where ε has mean zero and is assumed to be independent of x. Zhu, Huang and Li (2012) developed an estimation procedure for β0 and m(·) under a quantile loss function when the dimension of x is finite.

The goal of regression analysis amounts to characterizing how the conditional distribution of the response variable Y varies with the realizations of the covariate vector x = (X1, …, Xpn)T. In this paper, we focus on ultrahigh dimensional situation. Thus, we denote by pn the dimension of x to emphasize the dependence of pn on the sample size n. Denote by F (y | x) the conditional distribution of Y given x. In this paper, we study a general class of single index models that include models (1.1) and (1.2) as special cases. Specifically, we assume that there exists β0 ∈ ℝpn such that

F(yx)=F(yxTβ0),forally. (1.3)

That is, the conditional distribution of (Y | x) is fully characterized through a single linear combination of predictors xT β0. Consequently, the “curse of dimensionality” issue is avoided and simultaneously the model interpretability is maintained via a single index structure. Because the conditional distributional function F (· | ·) is unknown, the index parameter β0 is not identifiable. The direction of β0, instead of its true value, is of our primary interest. We refer to model (1.3) as a conditional distribution based single index model (CDSIM for short) in order to distinguish it from models (1.1) and (1.2).

When the covariate dimension is high, it is natural to assume that some covariates are irrelevant. The presence of irrelevant covariates may substantially deteriorate the precision of parameter estimation and the accuracy of response prediction (Altham, 1984). In the context of linear regression or generalized linear regression, many regularization methods, such as the LASSO (Tibshirani, 1996), the SCAD (Fan and Li, 2001; Zou and Li, 2008), the adaptive LASSO (Zou, 2006), the MCP (Zhang, 2010), the hard thresholding penalty (Zheng, Fan and Lv, 2014) and general penalty functions (Fan and Lv, 2013) have been proposed to remove those irrelevant covariates and simultaneously estimate the nonzero coefficients. Naik and Tsai (2001), Kong and Xia (2007), Zhu, Qian and Lin (2011) and Liang et al. (2010) developed some regularization methods for single-index regression. Recently, Wang, Wu and Li (2012) investigated the nonconvex penalized quantile regression for analyzing heterogeneity in the ultrahigh dimensional setting. Fan, Fan and Barut (2014) proposed two-step adaptive robust LASSO based on weighted L1-penalized quantile regression to deal with heavy-tailed high dimensional data.

In this paper, we consider variable selection and feature screening for model (1.3) when the covariate dimension pn is ultrahigh. We further assume β0 is sparse. Denote by A the active index set, βA the nonzero entries of β0 and xA the collection of all active covariates. When β0 is sparse, model (1.3) reduces to

F(yx)=F(yxATβA),forally. (1.4)

Our goal is to identify A and if possible, to estimate βA. To the best of our knowledge, there is few variable selection method designed for model (1.3) or (1.4) with ultrahigh-dimensional covariates.

In this paper we introduce two approaches to accomplish our goal: a penalized linear quantile regression and an independence screening procedure. When model (1.3) is true, the quantile functions of (Y | x) always vary with the realizations of (xT β0). In other words, the quantile function admits a single index structure. This motivates us to implement a penalized quantile regression to exclude irrelevant covariates and simultaneously estimate the direction of β0. The advantage of using quantile regression is that the quantile function characterizes equivalently the distributional function (1.3) and it is resilient to outliers and extreme values in the response. We show that, although the true quantile functions of (Y | x) are possibly nonlinear, the resulting estimator obtained from penalized linear quantile regression remains consistent up to a proportionality constant. This strategy helps to reduce the computational complexity substantially in estimating (1.3) in that the linear quantile regression procedure avoids estimating nonlinear quantile functions. Due to its computational efficiency, it is appealing for ultrahigh dimensional data analysis. We show that the penalized linear quantile regression estimate has the oracle property under mild regularity conditions even when pn tends to ∞ in an exponential rate of n.

From a practical perspective, when the covariate dimension is extremely large, even the penalized linear quantile regression may suffer from at least two serious drawbacks: computational inexpediency and algorithmic instability (Fan, Samworth and Wu, 2009). To further reduce the computational complexity in selecting important covariates from ultrahigh dimensional candidates, we further introduce an independence screening procedure which ranks the importance of each covariate through its distance correlation with the marginal distribution function of the response in model (1.3) and the implicit model (1.4). Since the distribution function is bounded and monotone, we can reasonably expect that this new independence screening procedure still works in the presence of outliers or extreme values in the response variable. In addition, it is computationally efficient and hence offers a useful complement, rather an alterative, to the penalized quantile regression approach since the proposed independence screening can precede the penalized quantile regression when the latter fails to produce a reliable solution within a tolerant time. Based on the reduced model, the penalized quantile regression may further refine selection of important covariates at different quantile levels. We show that this new independence screening procedure has the sure screening property even when pn is ultrahigh.

This paper is organized as follows. In Section 2, we propose the penalized linear quantile regression and study the consistency and the oracle property of the resulting estimator. We propose a robust independence screening procedure and establish its sure screening property in Section 3. We compare the finite sample performance of our proposals with several competitors in Section 4. All technical proofs are given in the Appendix.

2. Penalized Linear Quantile Regression

In this section, we will construct an estimate for the direction of β0 in model (1.3) via the penalized linear quantile regression.

2.1 The Methodology

Model (1.3) and its sparse structure (1.4) indicate that the quantile functions of (Y | x) at different quantile levels are all functions of (xT β0) and (xATβA) if the sparsity principle applies. This motivates us to estimate β0 through the quantile functions at different levels. Similar to Zhu, Huang and Li (2012), we first show that linear quantile regression can be used to estimate the direction of β0 in model (1.3). To be specific, we denote by ρτ (r) = τ rrI(r < 0), the check loss function at the τ th quantile, for τ ∈ (0, 1). Define

τ(u,b)=E{ρτ(YuxTb)}and(uτ,βτ)=argminu,b{τ(u,b)}, (2.1)

where b = (b1, …, bpn)T ∈ ℝpn.

Lemma 1

If x satisfies the linearity condition that

E{xE(x)xTβ0}=var(x)β0{β0Tvar(x)β0}1β0T{xE(x)},

then βτ is proportional to β0 in model (1.3).

The linearity condition is satisfied when x follows an elliptically contour distribution (Li, 1991). Hall and Li (1993) demonstrated that, regardless of the covariate distribution, the linearity condition always offers an ideal approximation to the reality as long as pn is sufficient large. Therefore, the linearity condition is typically regarded as mild in an ultrahigh dimensional setting. Lemma 1 implies that the indices of zero entries in both β0 and βτ coincide. To estimate the direction of β0 in model (1.3), it amounts to estimating βτ defined in (2.1). This lemma can be proved by using similar arguments used in Zhu, Huang and Li (2012). Thus, we omit its proof to save space.

When the covariate dimension is large, it is desirable to exclude irrelevant covariates and simultaneously estimate βτ in (2.1). Note that βτ is identifiable because the linear quantile loss function Lτ (u, b) is convex. Suppose that {(xi, Yi), i = 1, 2, …, n} is a random sample from (1.3). We consider the following penalized linear quantile regression to produce a sparse estimator of βτ:

Q(u,b)=n1i=1nρτ(YiuxiTb)+j=1pnpλ(bj), (2.2)

where pλ(·) is a penalty function with a regularization parameter λ. In this paper, we use the SCAD penalty (Fan and Li, 2001) and the MCP penalty (Zhang, 2010). The MCP function is defined as

pλ(b)=λ(bb22aλ)I(0b<aλ)+aλ22I(baλ),

where a > 1. The SCAD penalty is defined as follows,

pλ(b)=λbI(0b<λ)+aλb(b2+λ2)2a1I(λbaλ)+(a+1)λ22I(b>aλ),

where a = 3.7 is suggested by Fan and Li (2001). By minimizing the objective function Q(u, b), we obtain the estimators (u^τ,β^τ) at the τ -th quantile. In symbols, the resulting estimators are given by

(u^τ,β^τ)=argminu,b{Q(u,b)}. (2.3)

2.2 The Oracle Property

In this section we study the oracle property of the estimators obtained from the penalized linear quantile regression. Without loss of generality, we assume the first qn components of x are active and the rest are inactive, where qnpn) is a small positive integer representing the sparsity level. In other words, A = {1, 2, …, qn}. We define the oracle estimator at the population level by

τ(u,b1)=E{ρτ(YuxATb1)}and(uτo,βτ1o)=argminu,b1{τ(u,b1)}, (2.4)

where b1 = (b1, …, bqn)T ∈ ℝqn. We further write βτ0=(βτ1oT,0T)T, where βτ1o represents a qn-dimensional vector of nonzero components associated with the active covariates and 0 denotes a (pnqn)-dimensional vector of zeros. Accordingly, we define the oracle estimator β^τ0=(β^τ1oT,0T)T at the sample level by

τn(u,b1)=n1i=1n{ρτ(Yiuxi,ATb1)}and(u^τo,β^τ1o)=argminu,b1{τn(u,b1)}. (2.5)

We assume the following regularity conditions to investigate the consistency of the oracle estimator defined in (2.5) and the oracle property of the resulting estimator obtained from the penalized linear quantile regression (2.3).

(C1) The covariates x satisfy the sub-exponential tail probability uniformly in pn. That is, there exist positive constants t0 and C such that

max1kpnE{exp(tXk)}C<,for0<tt0. (2.6)

(C2) There exist positive constants 0 < C1C2 < ∞, such that

C1λmin{E(xAxAT)}λmax{E(xAxAT)}C2,

where λmin and λmax represent the smallest and largest eigenvalues, respectively. Assume further that {(xi,A, Yi), i = 1, …, n} are in general positions (Koenker, 2005, Section 2.2).

(C3) The probability density function of YxT βτ conditional on x, denoted by f (· | x), is uniformly bounded away from 0 and ∞ in the neighborhood around uτo.

(C4) The true model size qn satisfies qn = O(nc1) for 0 ≤ c1 < 1/2.

(C5) For βτ1o=(βτ,1o,βτ,2o,,βτ,qno)T, there exist positive constants c2 and C such that 2c1 < c2 ≤ 1 and

min1jqnβτ,joCn(1c2)2.

Condition (C1) is concerned with the moments of the covariates, which follows immediately when the covariates are bounded, or when x has a multivariate normal distribution. This condition is widely assumed in high dimensional inference. See, for instance, Bickel and Levina (2008). Condition (C2) requires that the design matrix of the true model at the population level be well behaved. Condition (C3) is a common assumption on the conditional distribution function of (YxT βτ) conditional on x. Condition (C4) allows the sparsity size qn can diverge as the sample size n goes to the infinity. Condition (C5) requires that the smallest true signal decay to zero at a slow rate.

Lemma 2 below states the consistency of the oracle estimators u^τoandβ^τ1o.

Lemma 2

Under Conditions (C1)-(C4), the oracle estimators u^τoandβ^τ1o satisfy

β^τ1oβτ1o=Op(qnn)andu^τouτo=Op(qnn). (2.7)

Next we study the theoretical property of the oracle estimator β^τ0=(β^τ1oT,0T)T.

Theorem 1

(The Oracle Property) Suppose Conditions (C1)-(C5) hold, and log pn = o(nmin{c2−2θ,θ}) with 0 < θ < (c2c1)/2 and λ = o {n−(1−c2)/2}. Let Bn(λ) be the set of local minima β^τ of the objective function Q(u, b) defined in (2.2) with the SCAD or the MCP penalty and the tuning parameter λ. The oracle estimator β^τ0=(β^τ1oT,0T)T satisfies

Pr{β^τoBn(λ)}1,asn.

Theorem 1 implies that the oracle estimator β^τo is a local minimizer of the objective function (2.2) with the probability approaching one as n → ∞. This result extends Theorem 2.4 of Wang, Wu and Li (2012) from the linear quantile regression model to model (1.3). The results in Lemmas 1, 2 and Theorem 1 imply that β^τ from the penalized linear quantile regression is a consistent estimator of the direction of β0 in model (1.3). It can detect the non-zero components of β0 and simultaneously estimate its direction. From a technical perspective, Wang, Wu and Li (2012) assumed all covariates are bounded uniformly while we relax their assumption to condition (C1) which only requires the distributions of the covariates have sub-exponential tails. In practice, the linear quantile regression estimator obtained with the LASSO penalty can serve as an initial value in our algorithm to minimize the objective function Q(u, b).

3. Robust SIS based on Distance Correlation

Next we propose a new robust feature screening procedure for model (1.3) through using distance correlation.

3.1 The Methodology

Theorem 1 indicates that the oracle property of the penalized quantile regression holds asymptotically true for log pn = o(nδ) for some δ > 0. This suffices for many problems from a theoretical perspective. From a practical perspective, however, when the covariate dimension is extremely large, even the penalized linear quantile regression suffers from at least two serious drawbacks: computational expediency and algorithmic stability (Fan, Samworth and Wu, 2009). When the penalized linear quantile regression fails to produce a reliable solution within a tolerant time, an independence screening procedure better precede the penalized linear quantile regression. This strategy helps to reduce the ultrahigh dimensionality down to a relatively moderate scale. To this end, we propose an independence screening procedure which excludes irrelevant covariates and hence reduces the computational complexity of subsequent penalized quantile regressions at all different quantile levels. In other words, the independence screening procedure is expected to have a sure screening property and to be independent of the quantile levels. In addition, it is expected to behave well when extreme values and/or outliers are present in the observed response values because subsequent quantile regression automatically has such a robustness property.

We first briefly review the definition of distance correlation (Szekely, Rizzo and Bakirov, 2007). The distance covariance between two random variables X and Y is defined by

dcov2(X,Y)=S1+S22S3, (3.1)

where S1=E(XX~YY~),S2=E(XX~)E(YY~),S3=E{E(XX~X)E(YY~Y)},and(X~,Y~) is an independent copy of (X, Y). Then, the distance covariance between X and Y is defined by

dcorr(X,Y)=dcov(X,Y)dcov(X,Y)dcov(Y,Y). (3.2)

Szekely, Rizzo and Bakirov (2007) pointed out that dcorr(X, Y) = 0 if and only if X and Y are independent and dcorr(X, Y) is strictly increasing in the absolute value of the Pearson correlation between X and Y. Motivated by these properties, Li, Zhong and Zhu (2012) proposed an sure independence screening to rank all predictors using their distance correlations with the response variable, called as DC-SIS, and proved its sure screening property for ultrahigh dimensional data.

Next, we denote by Xk the kth predictor with k = 1, …, pn and propose to quantify the importance of Xk through its distance correlation with the marginal distribution function of Y, denoted by F (Y). That is,

ωk=dcorr{Xk,F(Y)}, (3.3)

where F (y) = E {1(Yy)} and 1(·) denotes an indicator function. This is a modification of the marginal utility in Li, Zhong and Zhu (2012) in that here we use F (Y) instead of Y. The marginal utility defined in (3.3) has several distinctive and appealing advantages comparing with the existing measurements.

  1. It is obvious that dcorr{Xk, F (Y)} = 0 if and only if Xk and Y are independent. Following similar arguments in Li, Zhong and Zhu (2012), we can see that this new independence screening procedure based on (3.3) is model-free and hence is applicable to model (1.3) and its sparse model structure (1.4).

  2. Since F (Y) is a bounded function for all types of Y, we can naturally expect that the independence screening procedure using (3.3) has a reliable performance when the response is the heavy-tailed and when extreme values are present in the response values.

  3. If one suspects that the covariates also contain some extreme values, then one can use the utility dcorr{Fk (Xk), F (Y)} to rank the importance of Xk, where Fk (x) = E {1(Xkx)}.

In the sequel we introduce how to implement the marginal utility (3.3) in the screening procedure. Let {(xi, Yi), i = 1, · · · , n} be a random sample from the population (x, Y). We first estimate the distance covariance between Xk and F (Y) through the moment estimation method,

dcov^2{Xk,F(Y)}=S^k,1+S^k,22S^k,3, (3.4)

where

S^k,1=1n2i=1nj=1nXikXjkFn(Yi)Fn(Yj),S^k,2=1n2i=1nj=1nXikXjk1n2i=1nj=1nFn(Yi)Fn(Yj),andS^k,3=1n3i=1nj=1nl=1nXikXlkFn(Yj)Fn(Yl).

are the corresponding estimators of Sk,1, Sk,2, Sk,3, and Fn(y)=n1Σi=1n1(Yiy). We estimate ωk with

ω^k=dcorr^{Xk,F(Y)}=dcov^(Xk,F(Y))dcov^(Xk,Xk)dcov^(F(Y),F(Y)). (3.5)

Our proposed independence screening procedure retains the covariates with the ω^k values larger than a user-specified threshold. Denote

A^={k:ω^kcnκ,for1kpn}

for some pre-specified thresholds c > 0 and 0 ≤ κ < 1/2. The constants c and κ control the signal strength and will be defined in Condition (C6) below. We refer to this approach as the distance correlation based robust independence screening procedure (DC-RoSIS).

3.2 Sure Screening Property

We first state the consistency of ω^k defined in (3.5), which paves the road for proving the sure screening property of the DC-RoSIS procedure.

Theorem 2

Under Condition (C1), for any 0 < γ < 1/2−κ, there exist positive constants c1 and c2 such that

Pr(max1kpω^kωkcnκ)O(p[exp{c1n12(κ+γ)}+nexp(c2nγ)]) (3.6)

We remark here that to derive the consistency of the estimated marginal utility, we do not assume any moment condition on the response. To prove the sure screening property, we further assume the following condition.

(C6) The marginal utility defined in (3.3) satisfies minkAωk2cnκ, for some constants c > 0 and 0 ≤ κ < 1/2.

Condition (C6) allows the minimal signal of the active covariates converges to zero as the sample size diverges, yet it requires the minimum signal of active covariates be not too small. The sure screening property is stated below.

Theorem 3

(Sure Screening Property) Under condition (C6) and the conditions in Theorem 2, it follows that

Pr(AA^)1O(sn[exp{c1n12(κ+γ)}+nexp(c2nγ)]), (3.7)

where sn is the cardinality of A. Thus, Pr(AA^)1 as n → ∞.

4. Numerical Studies

In this section, we first conduct simulations to demonstrate the finite sample performance of our proposals. We further illustrate the proposed methodology through an empirical analysis of a real data example.

4.1 Simulations

In Example 1 we compare the performance of several independence screening procedures, and in Example 2 we assess the performance of penalized linear quantile regressions with different penalties and at different quantiles. Throughout the simulations we generate x = (X1, X2, · · · , Xp)T from N (0, Σ), where Σ=(σij)p×pwithσij=0.5ij. The dimensionality p = 1, 000 and the sample size n = 200.

Example 1

This example is designed to compare the finite sample performance of our proposal DC-RoSIS with existing procedures including SIS (Fan and Lv, 2008), SIRS (Zhu, Li, Li and Zhu, 2011), RRCS (Li, Peng, Zhang and Zhu, 2012) and DC-SIS (Li, Zhong and Zhu, 2012). We repeat each experiment 500 times and evaluate the performance with the following three criteria.

  1. S: The minimum model size to include all active covariates. We summarize the median of S with its associated robust estimate of the standard deviation (RSD = IQR/1.34). A smaller S value indicates a better performance.

  2. Psj : The empirical probability that the active covariate Xj is selected for a given model size d. We set d = 2[n/ log n] throughout.

  3. Pa: The empirical probability that all active covariates are selected for the given model size d = 2[n/ log n]. If the sure screening property holds true, both Psj and Pa values are close to one when the estimated model size d is reasonably large.

We consider the following four models:

Model (1): H(Y) = xT β + ε,

Model (2): Y = exp(2 − xT β/2) + (2 − xT β/2)2 + exp(xT β/2)ε,

Model (3): Y = {1 + exp (−3xT β)}−1 ε,

Model (4): Y = β1X1 + β2X2 + β7X72 + ε,

where β = (3, 1.5, 0, 0, 0, 0, 2, 0, …, 0)T. In the above four models, only X1, X2 and X7 are truly important. The random error ε is independently generated from either standard normal or standard Cauchy distribution. In model (1), H(Y) = {|Y |λsgn(Y) − 1} is the Box-Cox transformation. This model was used in Li, Peng, Zhang and Zhu (2012). We set λ = 1 and λ = 0.25. Both models (2) and (3) are heteroscedastic single-index models. The single index (xT β) contributes both the conditional mean and variance of the response in model (2), and are totally irrelevant to the mean regression function in model (3). The active covariate X7 in model (4) is quadratically related to the response. Though it is not a special case of model (1.3) or (1.4), we use it here to show that our independence screening procedure works quite well for a variety of regressions even when the model assumptions are violated.

The results are summarized in Table 4.1. It can be seen that SIS does not perform well when ε follows Cauchy distribution. Even when ε follows standard normal distribution, SIS still fails to behave well in nonlinear models (3) and (4). SIRS performs very well for all single-index models. However, SIRS fails to identify X7 as an important covariate in model (4) because it is not capable of detecting symmetric patterns. The performance of RRCS is generally favorable for models (1) and (2). However, RRCS hardly detects the active covariates that are only relevant to the conditional variance of the response in model (3) or X7 that exhibits symmetric patterns with Y in model (4). DC-RoSIS and DC-SIS have similar performances when ε follows standard normal distribution. When ε follows Cauchy distribution, DC-RoSIS significantly improves DC-SIS. For example, in model (1) with λ = 0.25, DC-SIS fails to detect the true relationship between two random variables when very extreme values are present.

Table 4.1.

Performance comparison among different independence screening methods for four regression models with two different random errors.

ε ~ N(0, 1) ε ~ Cauchy Distribution

Method S P s1 P s2 P s7 P a Size P s1 P s2 P s7 P a
SIS 3.0(0.0) 1.00 1.00 1.00 1.00 220.0(483.4) 0.67 0.62 0.49 0.39
DC-SIS 3.0(0.0) 1.00 1.00 1.00 1.00 3.0(0.7) 0.98 0.98 0.95 0.95
Model (1)
(λ = 1)
SIRS 3.0(0.0) 1.00 1.00 1.00 1.00 3.0(0.0) 1.00 1.00 1.00 1.00
RRCS 3.0(0.0) 1.00 1.00 1.00 1.00 3.0(0.0) 1.00 1.00 1.00 1.00
DC-RoSIS 3.0(0.0) 1.00 1.00 1.00 1.00 3.0(0.0) 1.00 1.00 1.00 1.00

SIS 3.0(0.7) 1.00 1.00 0.99 0.99 794.5(210.5) 0.10 0.07 0.09 0.00
DC-SIS 3.0(0.0) 1.00 1.00 1.00 1.00 702.5(246.6) 0.17 0.14 0.13 0.05
Model (1)
(λ = 0.25)
SIRS 3.0(0.0) 1.00 1.00 1.00 1.00 3.0(0.0) 1.00 1.00 1.00 1.00
RRCS 3.0(0.0) 1.00 1.00 1.00 1.00 3.0(0.0) 1.00 1.00 1.00 1.00
DC-RoSIS 3.0(0.0) 1.00 1.00 1.00 1.00 3.0(0.0) 1.00 1.00 1.00 1.00

SIS 5.0(14.2) 0.99 0.99 0.90 0.90 29.0(122.4) 0.89 0.83 0.69 0.63
DC-SIS 3.0(0.7) 1.00 1.00 0.99 0.99 3.0(2.9) 0.99 0.99 0.94 0.94
Model (2) SIRS 3.0(0.0) 1.00 1.00 1.00 1.00 3.0(0.7) 1.00 1.00 1.00 1.00
RRCS 3.0(0.0) 1.00 1.00 1.00 1.00 3.0(0.7) 1.00 1.00 1.00 1.00
DC-RoSIS 3.0(0.0) 1.00 1.00 1.00 1.00 3.0(0.7) 1.00 1.00 1.00 1.00

SIS 786.5(217.5) 0.08 0.06 0.07 0.00 791.0(213.2) 0.06 0.07 0.07 0.00
DC-SIS 4.0(4.5) 1.00 1.00 0.97 0.97 70.0(130.8) 0.92 0.83 0.57 0.52
Model (3) SIRS 7.0(8.2) 1.00 1.00 0.99 0.99 8.0(8.9) 1.00 1.00 0.98 0.98
RRCS 796.0(222.8) 0.10 0.10 0.09 0.00 782.0(253.3) 0.12 0.09 0.06 0.00
DC-RoSIS 8.0(9.7) 1.00 1.00 0.96 0.96 9.0(11.9) 1.00 1.00 0.96 0.96

SIS 270.5(400.6) 1.00 1.00 0.25 0.25 594.0(358.0) 0.72 0.62 0.07 0.05
DC-SIS 4.0(0.7) 1.00 1.00 1.00 1.00 8.0(18.7) 0.99 0.98 0.86 0.86
Model (4) SIRS 427.0(419.6) 1.00 1.00 0.11 0.11 493.5(387.7) 1.00 1.00 0.09 0.09
RRCS 434.0(391.1) 1.00 1.00 0.13 0.13 477.5(394.0) 1.00 1.00 0.10 0.10
DC-RoSIS 4.0(1.5) 1.00 1.00 1.00 1.00 6.0(5.2) 1.00 1.00 0.99 0.99

Example 2

In this example, we examine the finite sample performance of the penalized linear quantile regression with different penalties including LASSO (Tibshirani, 1996), SCAD (Fan and Li, 2001) and MCP (Zhang, 2010). We first utilize our proposed screening procedure to select d = 2[n/ log(n)] top ranked covariates and then apply the penalized linear quantile regression to estimate the direction of β. For the conditional quantile regression, we consider three different quantiles τ = 0.25, 0.50 and 0.75, which correspond to the 1st quartile, the median and 3rd quartile of the response conditioning on the covariates. Following Wang, Wu and Li (2012), an additional independent data set of size 10n is generated to select the tuning parameter λ by minimizing the estimated prediction error based on the quantile check loss function.

We denote the final estimator by β^τ=(β^1,β^2,,β^p)T. Note that the coefficients of covariates removed by the screening procedure are directly set to be zero in the final estimator. Based on 100 repetitions, we evaluate the performance in terms of the following criteria.

Size: The average number of non-zero estimated regression coefficients β^j0for1jp;

C: The average number of truly non-zero coefficients correctly estimated to be non-zero;

IC: The average number of truly zero coefficients incorrectly estimated to be non-zero;

AE: The average of absolute estimation error of β^τ, which is defined by Σj=1pβ^jsign(β^j,1)β^τβ0jsign(β0j,1)β0.

We only report the results for model (2) in Example 1, which is a heteroscedastic single-index model, as the results for other models lead to similar conclusion. The simulation results are charted in Table 4.2. In each column, the value represents the mean of 100 replicates with its sample standard deviation in the parentheses. For two different random errors and different quantiles, the first three columns demonstrate that the LASSO is relatively conservative and tends to select larger models while the SCAD and the MCP are consistent to select the true model. The relatively small values in the column labeled “AE” shows that the proposed penalized linear quantile regression procedure can produce consistent estimators and support the theoretical findings in Theorem 1. In conclusion, the satisfactory simulation results demonstrate that the proposed robust two-stage procedure is indeed robust to the presence of heteroscedasticity and extreme values in the response.

Table 4.2.

Simulation Results for Penalized Linear Quantile Regression at difference quantile levels (25%; 50% and 75%) and with difference penalties (LASSO, SCAD, MCP).

ε ~ N(0, 1)

Method Size C IC AE

LASSO(τ = 0.25) 18.16(6.28) 3.00(0.00) 15.16(6.28) 0.47(0.22)
LASSO(τ = 0.50) 18.14(6.33) 3.00(0.00) 15.14(6.33) 0.93(0.36)
LASSO(τ = 0.75) 13.97(6.16) 2.96(0.20) 11.01(6.15) 1.33(0.57)
SCAD(τ = 0.25) 3.46(0.86) 3.00(0.00) 0.46(0.86) 0.11(0.07)
SCAD(τ = 0.50) 3.68(1.58) 2.96(0.20) 0.72(1.56) 0.28(0.23)
SCAD(τ = 0.75) 3.47(1.58) 2.68(0.55) 0.79(1.52) 0.62(0.36)
MCP(τ = 0.25) 3.36(0.73) 3.00(0.00) 0.36(0.73) 0.11(0.07)
MCP(τ = 0.50) 3.53(1.23) 2.96(0.20) 0.57(1.21) 0.28(0.20)
MCP(τ = 0.75) 3.50(1.68) 2.68(0.55) 0.82(1.62) 0.63(0.36)

ε ~ Cauchy Distribution

Method Size C IC AE

LASSO(τ = 0.25) 23.75(6.63) 3.00(0.00) 20.75(6.63) 0.66(0.25)
LASSO(τ = 0.50) 19.29(7.66) 3.00(0.00) 16.29(7.66) 0.97(0.42)
LASSO(τ = 0.75) 14.01(6.89) 2.88(0.33) 11.13(6.82) 1.34(0.63)
SCAD(τ = 0.25) 3.56(1.19) 3.00(0.00) 0.56(1.19) 0.12(0.08)
SCAD(τ = 0.50) 3.66(1.36) 2.94(0.24) 0.72(1.36) 0.27(0.22)
SCAD(τ = 0.75) 3.33(1.91) 2.60(0.61) 0.73(1.84) 0.63(0.38)
MCP(τ = 0.25) 3.43(0.83) 3.00(0.00) 0.43(0.83) 0.11(0.07)
MCP(τ = 0.50) 3.70(1.67) 2.94(0.24) 0.76(1.66) 0.28(0.24)
MCP(τ = 0.75) 3.57(2.23) 2.64(0.53) 0.93(2.18) 0.65(0.42)

4.2 An Application

In this section we conduct an empirical study of the Cardiomyopathy microarray dataset. This dataset was analyzed by Segal, Dahlquist and Conklin (2003), Hall and Miller (2009) and Li, Zhong and Zhu (2012). The response variable is the genetic overexpression level of a G protein-coupled receptor (Ro1) in mice, which can sense molecules outside the cell and activate inside signal transduction pathways and cellular responses. The covariates are 6, 319 genetic expression levels. Only 30 specimens are observed. The main goal of this analysis is to determine the most influential genes for the response.

We display both the boxplot and the histogram of Y in Figure 4.1. Both indicate that the response distribution may be heavy-tailed and contains outliers. We first implement independence screening procedures to reduce the covariate dimension to the size of 2[n/ log n] = 16. The performances of SIS and DC-SIS are similar to that of DC-RoSIS in this real data analysis. Thus, we only present results of DC-RoSIS with regularized quantile regression with different penalties in this example. The DC-RoSIS selects the two genes, labeled Msa.2877.0 and Msa.2134.0, in the top, which are same as the DC-SIS (Li, Zhong and Zhu, 2012). The gene, Msa.1166.0, identified by generalized correlation ranking (Hall and Miller, 2009) is also ranked in the top 10 by our screening procedure.

Figure 4.1.

Figure 4.1

Exploratory Data Analysis: Histogram and Boxplot of Ro1.

We further apply our proposed penalized linear quantile regression to the reduced model to estimate the direction of the index parameter and to simultaneously select important variables at different quantiles of the response. We choose the quantile levels τ = 0.25, 0.50 and 0.75, and three different penalties, LASSO, SCAD and MCP. We use BIC to select the tuning parameters for each method. With the estimated single index, denoted (xTβ^τ), we apply the cubic splines to estimate the quantile functions q^τ() of model (1.3), or equivalently, model (1.4). Figure 4.2 depicts the estimated curves of q^τ(xTβ^τ) at different quantiles and for different penalties, which demonstrate the computational effectiveness of our proposals.

Figure 4.2.

Figure 4.2

The estimated curves of q^τ(xTβ^τ) (the vertical axis) versus (xTβ^τ) (the horizontal axis) at different quantiles for different penalties. From left to right, τ = 0.25, 0.50 and 075; From up to down, LASSO, SCAD and MCP.

To compare the finite sample performances of different methods with different quantiles, we report the number of nonzero coefficients selected by each method, denoted by “Size” in Table 4.3. In addition, to evaluate the goodness of fit for each model, we follow the idea of R2 for the linear model and define the quantile-adjusted R2 (“Q-R2”) as follows,

QR2=[1Σi=1nρτ2{Yiq^τ(XiTβ^τ)}Σi=1nρτ2(YiY^τ)]×100%, (4.1)

where ρτ (·) is the τ th quantile check loss function, q^τ() is the cubic-spline estimate of qτ(),q^τ(xTβ^τ) is the τ -th quantile function of Yi, and Y^τ is the sample τ th quantile of Y. The larger Q-R2 is, the better the model fit is. For example, for τ = 0.75, SCAD selected 3 covariates, which can explain 93.0% variance of the response in terms of the defined Q-R2. As a benchmark, we also report the model with all 16 selected genes by our screening procedure, denoted by SCREEN in Table 4.3. In addition, we conduct 100 random partitions to examine the prediction performance. For each partition, we randomly select 90% of the data (27 observations) as the training set and the rest 10% (3 observations) as the test set. The average of the model sizes selected by each method with its standard error across 100 partitions in the parenthesis are reported in the third column (“Ave Size”) of Table 4.3. In this table, we also report the average of quantile-adjusted R2 for each method on the training set and its associated standard error, denoted by “Ave Q-R2”. The column labeled by “PE” denotes the median of prediction errors based on the quantile check loss function and its associated robust estimate of the standard deviation (i.e. interquartile range/1.34) in the parentheses. In conclusion, the penalized linear quantile regression improves both the model interpretability in terms of the model size and the model predictability in terms of the prediction errors.

Table 4.3.

Empirical analysis of Cardiomyopathy microarray dataset.

All Data Partitioned Data
Method Size Q-R2 Ave Size Ave Q-R2 PE
SCREEN(τ = 0.25) 16 97.6 16.00(0.00) 94.41(3.66) 0.48(0.21)
SCREEN(τ = 0.50) 16 93.1 16.00(0.00) 92.96(2.67) 0.67(0.27)
SCREEN(τ = 0.75) 16 94.4 16.00(0.00) 94.99(1.31) 0.59(0.33)

LASSO(τ = 0.25) 12 87.8 8.56(1.55) 78.59(10.75) 0.44(0.20)
LASSO(τ = 0.50) 8 89.1 7.21(1.54) 90.71(3.49) 0.55(0.17)
LASSO(τ = 0.75) 5 91.2 5.64(1.25) 94.54(1.71) 0.44(0.22)

SCAD(τ = 0.25) 10 96.9 8.29(2.81) 90.88(6.46) 0.44(0.18)
SCAD(τ = 0.50) 6 92.3 6.52(3.13) 92.33(3.24) 0.58(0.20)
SCAD(τ = 0.75) 3 93.0 3.82(1.46) 94.04(1.45) 0.50(0.25)

MCP(τ = 0.25) 10 96.9 8.69(2.64) 91.71(5.92) 0.43(0.14)
MCP(τ = 0.50) 5 89.3 6.91(2.81) 92.58(3.09) 0.55(0.26)
MCP(τ = 0.75) 4 92.8 4.13(1.64) 94.15(1.46) 0.51(0.28)

5. Discussions

In this paper, we first study the regularized quantile regression for ultrahigh dimensional single-index models, in which the conditional distribution of the response depends on the covariates via a single-index structure. The consistency and the oracle property for the penalized linear quantile regression estimator have been established under the sparsity condition. Then, we propose a robust independence screening based on the distance correlation between the distribution function of the response variable and each covariate, called as DCRoSIS. The new DC-RoSIS enjoys the sure screening property under even milder conditions than the existing alternative methods. It can be applied before the regularized quantile regression to reduce the covariate dimension from ultrahigh dimensionality to a moderate scale. The numerical studies show that the proposed methodology has the excellent finite-sample performance compared with other methods.

The proposed method has reliable performance when the distribution of the response variable is heavily tailed or response realizations contain extreme values. An interesting point raised by a referee is concerned with the performance of proposed procedure in the presence of heavy-tail predictors or extreme outliers contained in the predictors. In this case, Condition (C1) will be violated and the proposed method may fail. However, we may simply use Fk (Xk), the distribution function of Xk, in place of Xk in the proposed screening procedure. This replacement will help us remove condition (C1) and achieve the robustness feature in the x-direction. Please see Appendix A in the Supplement for more details. However, implementing penalized linear quantile regression when x contains outliers is not straightforward. How to remove condition (C1) in the penalized linear quantile regression would be an interesting topic for future research.

Theorem 1 implies that the oracle estimator is a local minimizer of the objective function (2.2) with the probability approaching one as n → ∞. Thus, the proposed method can detect the non-zero components of the true coefficient and simultaneously estimate its direction. For the statistical inference purpose, if one may be interested in the asymptotical distribution of the regularized quantile estimator, we can adapt the idea of Theorem 2 in Wu and Liu (2009). They proved that the SCAD and Adaptive-LASSO penalized linear quantile estimator is asymptotically normal if the number of important covariates is a fixed number. If the number of important covariates diverges to infinity, it becomes much more challenging to derive the asymptotic normality. But it is an interesting and potential research direction.

Supplementary Material

S

Acknowledgment

Zhong’s research was supported by National Natural Science Foundation of China (NNSFC) 11301435, 71131008 and the Fundamental Research Funds for the Central Universities. Zhu’s research was supported by NNSFC 11371236 and 11422107, Innovation Program of Shanghai Municipal Education Commission 13ZZ055, Pujiang Project of Science and Technology Commission of Shanghai Municipality 12PJ1403200 and Program for New Century Excellent Talents, Ministry of Education of China NCET-12-0901. Runze Li is the corresponding author and his research was supported by National Institute of Health (NIH) grants P50-DA10075, P50 DA036107, R01 CA168676, R01 MH096711 and NNSFC 11028103. Cui’s research was supported by NNSFC 11071022, 11028103 and 11231010, Key project of Beijing Municipal Educational Commission and Beijing Center for Mathematics and Information Interdisciplinary Sciences. The authors thank the Editor, the Associate Editor and three anonymous referees for their constructive comments, which have led to a significant improvement of the earlier version of this paper. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH or NNSFC.

Appendix. Proof of Theorems

Appendix A: Proof of Lemma 2

We require the following lemma to prove Lemma 2.

Lemma 3

According to (2.4), E{I(YxATβτ1ouτo)xA}=τ. That is, the oracle estimator uτo of u is the τ th quantile of YxAτβτ1o conditional on xA.

Proof of Lemma 3

Let ξτ be the τ th quantile of YxATβτ1o conditional on xA. By definition, we have that E{I(YxATβτ1oξτ)xA}=τ. It suffices to show τ(ξτ,βτ1o)τ(u,βτ1o) holds for any u. To be specific,

τ(u,βτ1o)τ(ξτ,βτ1o)=E{ρτ(YuxAτβτ1o)}E{ρτ(YξτxAτβτ1o)}=E[(uξτ){I(YξτxATβτ1o0)τ}]+E[0uξτ{I(YξτxATβτ1ot)I(YξτxATβτ1o0)}dt]0,

where the second equality follows from Knight (1998). In the second equality, the first term is zero and the second is nonnegative. Thus ξτ=uτo and the desired conclusion follows.

Proof of Lemma 2

To prove Lemma 2, we borrow the idea of He and Shao (2000) on M-estimation. It suffices to show that for any fixe η > 0, there exists two constants Δ1 and Δ2 such that for all sufficiently large n,

Pr{infu=Δ2γ=Δ1τn(uτo+n12qn12u,βτ1o+n12qn12γ)>τn(uτo,βτ1o)}1η.

We define that

Gn(u,γ)nqn1{τn(uτo+n12qn12u,βτ1o+n12qn12γ)τn(uτo,βτ1o)}=qn1i=1nn12qn12(u+xi,ATγ){I(Yixi,ATβτ1ouτ0)τ}+qn1i=1n0n12qn12(u+xi,ATγ){I(Yixi,ATβτ1ouτo+s)I(Yixi,ATβτ1ouτo)}dsIn1+In2,

where the second equality follows from Knight (1998)’s identity.

Note that E{I(YxATβτ1ouτo)xA}=τ by Lemma 3 and hence E(In1) = 0.

var(In1)=E{var(In1xA)}=var{E(In1xA)}=τ(1τ)qn1E{n1i=1n(u+xi,ATγ)2}2τ(1τ)qn1[u2+λmax{E(xAxAT)}γ2]Cqn1(Δ12+Δ22),

where the last inequality follows by Condition (C2). Thus, In1=Op(qn12)(Δ12+Δ22)12.

Next we evaluate In2. Denote by F (· | xA) and f (· | xA) the conditional distribution and density of ((YxATβτ1o)) given xA, respectively

E(In2)=qn1E[i=1n0n12qn12(u+xi,ATγ){F(uτo+sxi,A)F(uτoxi,A)}ds]=qn1E[i=1n0n12qn12(u+xi,ATγ)f(uτo+sxi,A)sds]Cqn1E[i=1n{n12qn12(u+xi,ATγ)}2]=CE(u+xATγ)2C[1+λmin{E(xAxAT)}](u2+γ2)C(Δ12+Δ22),

where the first inequality follows by Condition (C3) and the last inequality follows by Condition (C2). Therefore, E(In2)=O(1)(Δ12+Δ22). Next we consider the variance of In2,

var(In2)nqn2E[0n12qn12(u+xATγ){I(YxATβτ1ouτo+s)I(YxATβτ1ouτo)}ds]2nqn2E{n12qn12(u+xATγ)}2qn1[1+λmin{E(xAxAT)}](u2+γ2)O(qn1(Δ12+Δ22)),

which converges to zero as n → ∞ because qn = O(nc1). This indicates that |In2E(In2)| = op(1) by Chebyshev’s inequality. Since In2 is always nonnegative,

In2=E(In2)+op(1)C(Δ12+Δ22)+op(1).

For sufficiently large Δ1 and Δ2, In2 dominates In1 asymptotically as n → ∞. Therefore, for any fixed η > 0, there exists two constants Δ1 and Δ2 such that for all sufficiently large n, we have Gn(u, γ) > 0 with probability at least 1 − η.

Appendix B: Proof of Theorem 1

We follow the idea of the proof of Theorem 2.4 in Wang, Wu and Li (2012) to demonstrate the technical proof of Theorem 1. Note that their moment conditions on x are different. With slightly notational abuse, we write xA=(1,xA)T,βτo=(uτo,βτoT)T defined in (2.4), β^τ=(u^τ,β^τT)Tandβ^τo=(u^τo,βτoT)T,whereβ^τ denotes the penalized linear quantile estimator defined in (2.3) and β^τo=(β^τ1oT,0T)T is the oracle estimator defined in (2.5). Accordingly, we write βτ1o=(uτo,βτ1oT)Tandβ^τ1o=(u^τo,β^τ1oT)T.

We first write the objective function (2.2) of the penalized linear quantile regression as the difference of two convex functions in β. Here, we only consider the proof for the SCAD penalty, and the proof for the MCP penalty can be achieved by the similar arguments. To be precise, Q(β) = g(β) − h(β), where g(β)=n1Σi=1nρτ(YixiTβ)+λΣj=1pnβj,andh(β)=Σj=1pnHλ(βj), with

Hλ(βj)={0,0βj<λ;(βj22λβj+λ2){2(a1)},λβjaλ;λβj(a+1)λ22,βj>aλ.}

Thus, the subdifferential of h(β) at any β is

h(β)={μ=(μ0,μ1,,μpn)Tpn+1:μ0=0,μj=h(β)βj,j=1,2,,pn}.

The subdifferential of g(β) at any β is

g(β)={ξ=(ξ0,ξ1,,ξpn)Tpn+1:ξj=(1τ)n1i=1nXijI(YixiTβ<0)τn1i=1nXijI(YixiTβ>0)n1i=1nXijvi+λlj},

where vi = 0 if YixiTβ0 and vi ∈ [τ − 1, τ ] otherwise; l0 = 0; lj = sgn(βj) if βj ≠ 0 and lj ∈ [−1, 1] otherwise, for 1 ≤ jpn.

Let s(β^)={s0(β^),s1(β^),,spn(β^)}T be the set of the subgradient functions for the unpenalized quantile regression, where

sj(β)=(1τ)n1i=1nXijI(YixiTβ<0)τn1i=1nXijI(YixiTβ>0)n1i=1nXijvi,

where vi = 0 if YixiTβ^0 and vi ∈ [τ − 1, τ ] otherwise.

Next we present Lemmas 4, 5 and 6 to facilitate the proof of Theorem 1. Tao and An (1997) proposed the numerical algorithm based on the convex difference representation, which is stated in Lemma 4. Lemmas 5 and 6 characterize the properties of the oracle estimator β^τo and the associated subgradient functions s(β^τo) respectively.

Lemma 4

(Difference Convex Program) g(x) and h(x) are two convex functions. Let x be a point that admits a neighborhood U such that ∂h(x) ∩ ∂g(x) ≠ ∅, ∀xUdom(g). Then x is a local minimizer of g(x) − h(x).

Lemma 5

Assume the conditions (C4)-(C5) holds and λ = o(n−(1−c2)/2). For the oracle estimator β^τo, there exist vi which satisfies vi=0ifYixiTβ^τo0andvi[τ1,τ] otherwise, such that, with probability approaching one, we have

sj(β^τo)=0,j=0,1,,qn,andβ^jo(a+12)λ,j=1,,qn.
Proof of Lemma 5

This lemma is parallel to Lemma 2.2 in Wang, Wu and Li (2012). The unpenalized quantile loss objective function is convex. By the convex optimization theory, 0Σi=1nρτ(YixiTβ^τo). Therefore, there exists vi such that sj(β^τo)=0 with vi=vi for j = 0, 1, …, qn. On the other hand,

min1jqnβ^jomin1jqnβτ,jomax1jqnβ^joβτ,jo.

Condition (C5) requires that min1jqnβτ,joCn(1c2)2. In addition, max1jqnβ^joβτ,joβ^τoβτ1o=Op(qnn)=Op(n(1c1)2)=op(n(1c2)2). Therefore, min1jqnβ^joCn(1c2)2op(n(1c2)2), where c1 and c2 are defined in conditions (C4) and (C5) respectively. For λ = o{n−(1−c2)/2}, we have that, with probability approaching one, β^jo(a+12)λ,j=1,,qn, which completes the proof.

Lemma 6

Assume the conditions (C1)-(C5) hold and λ = o(n−(1−c2)/2), log pn = o(nmin{c2−2θ,θ}) with some constant 0 < θ < (c2c1)/2. For the oracle estimator β^τ1o and the sj(β^τ1o), with probability approaching one, we have

sj(β^τo)λ,andβ^jo=0,j=qn+1,,pn.
Proof of Lemma 6

This lemma is parallel to Lemma 2.3 in Wang, Wu and Li (2012). Since the β^τo is the oracle estimator, |β^jo| = 0, j = qn + 1, …, pn. It remains to show that

Pr(sj(β^τo)>λ,forsomej=qn+1,,pn)0,asn.

Let D={i:YixiTβ^τo=0}={i:Yixi,ATβ^τ1o=0}, then for j = qn + 1, …, pn,

sj(β^τo)=(1τ)n1i=1nXijI(YixiTβ^τo<0)τn1i=1nXijI(YixiTβ^τo>0)n1i=1nXijvi,=n1i=1nXij{I(YixiTβ^τo0)τ}n1i=1nXij{vi+(1τ)I(YixiTβ^τo=0)}=n1i=1nXij{I(Yixi,ATβ^τ1o0)τ}n1iDXij[vi+(1τ)],

where vi[τ1,τ] with iD satisfies sj(β^τo) with vi=vi, for j = 1, …, qn by Lemma 5.

Pr(sj(β^τo)>2λ,forsomej=qn+1,,pn)Pr(n1i=1nXij{I(Yixi,ATβ^τ1o0)τ}>λ,forsomej=qn+1,,pn)+Pr(n1iDXij{vi+(1τ)}>λ,forsomej=qn+1,,pn)Tn1+Tn2.

First, we deal with Tn2. Let M = O(nθ) with some constant 0 < θ < (c2c1)/2, we have that

Tn2Pr(maxj=qn+1,,pnn1iDXij1{XijM}{vi+(1τ)}>λ2)+Pr(maxj=qn+1,,pnn1iDXij1{Xij>M}[vi+(1τ)]>λ2)Tn21+Tn22

Since (xi,A, Yi) are in general positions (Koenker, 2005, Section 2.2), with probability tending to one there exists exactly qn + 1 elements in D. Thus, with probability tending to one,

maxqn+1,,pnn1iDXij1{XijM}{vi+(1τ)}M(qn+1)n1=O(nθ+c11)=o(λ),

where the last equality holds for λ = o(n−(1−c2)/2) and 0 < θ < (c2c1)/2. Therefore, Tn21 → 0 as n → ∞. Next, we deal with Tn22. Note that the events satisfy

{n1iDXij1{Xij>M}[vi+(1τ)]>λ2}{Xij>M,forsomeiD},

because that if |Xij | ≤ M for all iD, then n1ΣiDXij1{Xij<M}=0.

Therefore,

Tn22pnmaxj=qn+1,,pnPr(n1iDXij1{Xij>M}[vi+(1τ)]>λ2)pn(qn+1)maxiD,qn+1jpnPr(Xij>M)pn(qn+1)exp(tM)E{exp(tXij)}Cpn(qn+1)exp(tM)=CpnO(nc1)exp(tnθ)0,

as n → ∞, where log pn = o(nmin{c2−2θ,θ}) with some constant 0 < θ < (c2c1)/2, 0 < tt0, the third inequality holds from Markov’s inequality and the fourth inequality follows from Condition (C1). Therefore, Tn2 = Tn21 + Tn22 → 0, as n → ∞.

It remains to show that

Pr(n1i=1nXij{I(YixiTβ^τo<0)τ}>λ,forsomej=qn+1,,pn)0,

as n → ∞. We consider

Tn1Pr(maxj=qn+1,,pnn1i=1nXij{I(Yixi,ATβτ1o0)τ}>λ2)+Pr(maxj=qn+1,,pnsupβ1βτ1oΔqnnn1i=1nXij[I(Yixi,ATβ10)I(Yixi,ATβτ1o0){Pr(Yixi,ATβ10)Pr(Yixi,ATβτ1o0)}]>λ4)+Pr(maxj=qn+1,,pnsupβ1βτ1oΔqnnn1i=1nXij{Pr(Yixi,ATβ10)Pr(Yixi,ATβτ1o0)}>λ4)=:Jn1+Jn2+Jn3.

First, let us consider Jn1. We choose a M = O(nθ) with 0 < θ < (c2c1)/2, then

Jn1Pr(maxj=qn+1,,pnn1i=1nXij1{XijM}{I(Yixi,ATβτ1o<0)τ}>λ4)+Pr(maxj=qn+1,,pnn1i=1nXij1{Xij>M}{I(Yixi,ATβτ1o0)τ}>λ4)Jn11+Jn12.

By Hoeffding’s inequality, we have that

Pr(n1i=1nXij1(XijM){I(Yixi,ATβτ1o0)τ}>λ4)2exp(nλ28M2).

Thus, Jn11 ≤ 2pn exp{−2/(8M 2)} = 2pn exp(−n1−2θ λ2/8) → 0, as n → ∞, because log pn = o(nmin{c2−2θ,θ}) with some constant 0 < θ < (c2c1)/2 and λ = o{n−(1−c2)/2}. On the other hand, we can similarly follow the arguments that deal with Tn22 and have that

Jn12pnmaxj=qn+1,,pnPr(n1i=1nXij1{Xij>M}>λ4)pnnmax1in,j=qn+1,,pnPr(Xij>M)=O(pnn)exp(tnθ)0,

as n → ∞, because log pn = o(nmin{c2−2θ,θ}). Therefore, Jn1 = Jn11 + Jn12 = o(1).

Following similar arguments for proving Lemma 4.3 of Wang, Wu and Li (2012) and the arguments that deal with Tn22 and Jn12, we can show that Jn2 = o(1). It remains to deal with Jn3. For a fixed M = O(nθ) with 0 < θ < (c2c1)/2,

Jn3Pr(maxj=qn+1,,pnsupβ1βτ1oΔqnnn1i=1nXij1{XijM}{Pr(Yixi,ATβ10)Pr(Yixi,ATβτ1o0)}>λ8)+Pr(maxj=qn+1,,pnsupβ1βτ1oΔqnnn1i=1nXij1{Xij>M}{Pr(Yixi,ATβ10)Pr(Yixi,ATβτ1o0)}>λ8)Jn31+Jn32.

To handle Jn31, we observe that

maxj=qn+1,,pnsupβ1βτ1oΔqnnn1i=1nXij1{Xij>M}{Pr(Yixi,ATβ10)Pr(Yixi,ATβτ1o0)}Msupβ1βτ1oΔqnnE{f(ζxA)xAT(β1βτ1o)}Msupβ1βτ1oΔqnnλmax12{E(xAxAT)}β1βτ1oO{nθ(qnn)12}=O{n(1c12θ)2},

where f (· | xA) is defined in Condition (C3) with ζ is between uτo+xAT(β1βτ1o) and uτo and thus the second inequality follows Condition (C3) and Cauchy-Schwartz inequality, and the third inequality follows Condition (C2). Consequently, together with λ = o{n−(1−c2)/2}, we have that Jn31 ≤ Pr{O(n−(1−c1−2θ)/2) > λ/8} = o(1) if 0 < θ < (c2c1)/2. We can also follow similar arguments for handling Jn12 and obtain that Jn32 = o(1). Therefore, Jn3 = Jn31 + Jn32 = o(1). Consequently,

Pr{maxqn+1,,pnn1i=1nXij{I(Yixi,ATβ^τ1o<0)τ}>λ}Jn1+Jn2+Jn3=o(1),

which implies that Pr{sj(β^τo)<λforsomej=qn+1,,pn}0. This completes the proof of Lemma 6.

With Lemmas 5 and 6 for the random x with sub-exponential tail probability Condition (C1), we can follow the technical proof of Theorem 2.4 of Wang, Wu and Li (2012) to obtain the oracle property and complete the proof.

Proof of Theorem 2

For notational clarity, we use c1 and c2 to denote two different generic positive constants. First we assume F (y) is known. That is, dcov^2{Xk,F(Y)}=S^k1+S^k22S^k3, where

S^k,1=1n2i=1nj=1nXikXjkF(Yi)F(Yj),S^k,2=1n2i=1nj=1nXikXjk1n2i=1nj=1nF(Yi)F(Yj),andS^k,3=1n3i=1nj=1nl=1nXikXjkF(Yi)F(Yj).

Similarly we define ω^k=dcorr^2{Xk,F(Y)}. Theorem 1 of Li, Zhong and Zhu (2012) stated that, for any 0 < γ < 1/2 − κ, there exist positive constants c1 > 0 and c2 > 0 such that

Pr(max1kpω^kωkcnκ)O(p[exp{c1n12(κ+γ)}+nexp(c2nγ)]). (C.1)

To prove Theorem 2, it thus suffices to show the difference between ω^kandω^k defined in (3.5) is ignorable when size n is large enough, which amounts to studying the differences between S^kmandS^km for m = 1, 2, 3. Next, we sketch the proof for the case m = 1 only because the proof of the other two cases is in spirit the same. We recall that S^k1=1n2Σi=1nΣj=1nXikXjkF(Y1)F(Yj)andS^k1=1n2Σi=1nΣj=1nXikXjkFn(Yi)Fn(Yj).

Pr(max1kpnS^k1S^k1ε)=Pr(max1kpnn2i=1nj=1nXikXjkF(Yi)F(Yj)Fn(Yi)Fn(Yj)ε)Pr(max1kpn(AnBn)12ε)Pr(max1kpn(AnBn)12ε,XkM)+Pr(max1kpn(AnBn)12ε,Xk>M)T1+T2,

where M is a positive constant specified later, An=n2Σi=1nΣj=1n(XikXjk)2,andBn=n2Σi=1nΣj=1n{F(Yi)F(Yj)Fn(Yi)Fn(Yj)}2.

Using the argument that | |x| − |y| |≤ |xy| ≤ |x| + |y|, we obtain that

Fn(Yi)Fn(Yj)F(Yi)F(Yj)Fn(Yi)F(Yi)+Fn(Yj)F(Yj)2max1inFn(Yi)F(Yi).

Also because max1kpnAnmax1kpnn2Σi=1nΣj=1n2(Xik2+Xjk2)4M2, we have

T1Pr[max1kpn2Mn1{i=1nj=1n(F(Yi)F(Yj)Fn(Yi)Fn(Yj))2}12ε]Pr{4Mmax1inFn(Yi)F(Yi)ε}Pr{maxyRFn(y)F(y)ε4M}2exp{2n(ε4M)2}=2exp(nε28M2), (C.2)

where the last inequality follows by Dvoretzky-Kiefer-Wolfowitz inequality.

For the second term, for all 0 < s ≤ 2s0, where s0 is defined in Condition (C1),

T2Pr(max1kpnXk>M)=Pr(max1kpnexp(sXk)>exp(sM))max1kpnE{exp(sXk)}exp(sM)Cexp(sM), (C.3)

where C is a positive constant, the second inequality follows Markov’s inequality and the last inequality is applied under Condition (C1).

Then, by choosing M = O(nγ) for 0 < γ < 1/2 − κ, (C.2) and (C.3) together imply that, for some positive constants c1 and c2,

Pr(max1kpnS^k1S^k1ε)2exp(nε28M2)+Cexp(sM)2exp(c1ε2n12γ)+Cexp(c2nγ). (C.4)

Thus, it is not difficult to show that

Pr(max1kpω^kω^kcnκ)O(exp{c1n12(κ+γ)}+exp(c2nγ)). (C.5)

Therefore, (C.1) and (C.5) together completes the proof of Theorem 2.

Footnotes

Supplementary Materials

The supplementary material of this paper consists of Appendix S.A and S.B. In Appendix S.A, we propose to quantify the importance of Xk through its distance correlation between the respective marginal distribution functions of Xk and Y. In Appendix S.B, We provide extra simulation on Example 1 with more active covariates.

Contributor Information

Wei Zhong, Wang Yanan Institute for Studies in Economics, Department of Statistics and Fujian Key Laboratory of Statistical Science, Xiamen University, Xiamen 361005, China. wzhong@xmu.edu.cn.

Liping Zhu, School of Statistics and Management and Key Laboratory of Mathematical Economics, Ministry of Education, Shanghai University of Finance and Economics, Shanghai 200433, China. zhu.liping@mail.shufe.edu.cn.

Runze Li, Department of Statistics and The Methodology Center, The Pennsylvania State University, University Park, PA 16802. rzli@psu.edu.

Hengjian Cui, School of Mathematical Science, Capital Normal University, Beijing 100048, China. hjcui@bnu.edu.cn.

References

  1. Altham PME. Improving the precision of estimation by fitting a generalized linear model and quasi-likelihood. Journal of the Royal Statistical Society, Series B. 1984;46:118–119. [Google Scholar]
  2. Bickel P, Levina E. Regularized estimation of large covariance matrices. Annals of Statistics. 2008;36:199–227. [Google Scholar]
  3. Dvoretzky A, Kiefer J, Wolfowitz J. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. Annals of Mathematical Statistics. 1956;27:642–669. [Google Scholar]
  4. Fan Y, Fan J, Barut E. Adaptive Robust Variable Selection. Annals of Statistics. 2014;42:324–351. doi: 10.1214/13-AOS1191. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Fan J, Li R. Variable selection via nonconcave penalized likelihood and it oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
  6. Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space (with discussion) Journal of the Royal Statistical Society, Series B. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Fan Y, Lv J. Asymptotic equivalence of regularization methods in thresholded parameter space. Journal of the American Statistical Association. 2013;108:1044–1061. [Google Scholar]
  8. Fan J, Samworth R, Wu Y. Ultrahigh dimensional feature selection: beyond the linear model. Journal of Machine Learning Research. 2009;10:1829–1853. [PMC free article] [PubMed] [Google Scholar]
  9. Hall P, Li KC. On almost linearity of low dimensional projection from high dimensional data. Annals of Statistics. 1993;21:867–889. [Google Scholar]
  10. Hall P, Miller H. Using generalized correlation to effect variable selection in very high dimensional problems. Journal of Computational and Graphical Statistics. 2009;18:533–550. [Google Scholar]
  11. Härdle W, Hall P, Ichimura H. Optimal smoothing in single-index models. Annals of Statistics. 1993;21:157–178. [Google Scholar]
  12. He X, Shao Q. On parameters of increasing dimensions. Journal of Multivariate Analysis. 2000;73:120–135. [Google Scholar]
  13. Knight K. Limiting distributions for L1 regression estimators under general conditions. The Annals of Statistics. 1998;26:755–770. [Google Scholar]
  14. Koenker R. Quantile Regression. Cambridge University Press; 2005. [Google Scholar]
  15. Kong E, Xia Y. Variable Selection for the single-index model. Biometrika. 2007;94:217–229. [Google Scholar]
  16. Li G, Peng H, Zhang J, Zhu L. Robust rank correlation based screening. The Annals of Statistics. 2012;40:1846–1877. [Google Scholar]
  17. Li K-C. Sliced inverse regression for dimension reduction. Journal of the American Statistical Association. 1991;86:316–327. [Google Scholar]
  18. Li R, Zhong W, Zhu L. Feature screening via distance correlation learning. Journal of the American Statistical Association. 2012;107:1129–1139. doi: 10.1080/01621459.2012.695654. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Liang H, Liu X, Li R, Tsai C. Estimation and testing for partially linear single-index models. Annals of Statistics. 2010;38:3811–3836. doi: 10.1214/10-AOS835. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Naik PA, Tsai C-L. Single-index model selections. Biometrika. 2001;88:821–832. [Google Scholar]
  21. Powell JL, Stock JH, Stoker TM. Semiparametric estimation of index coefficient. Econometrica. 1989;51:1403–1430. [Google Scholar]
  22. Segal MR, Dahlquist KD, Conklin BR. Regression approach for microarray data analysis. Journal of Computational Biology. 2003;10:961–980. doi: 10.1089/106652703322756177. [DOI] [PubMed] [Google Scholar]
  23. Szekely GJ, Rizzo ML, Bakirov NK. Measuring and testing dependence by correlation of distances. Annals of Statistics. 2007;35:2769–2794. [Google Scholar]
  24. Tao PD, An LTH. Convex analysis approach to D.C. programming: theory, algorithms and applications. Acta Mathematica Vietnamica. 1997;22:289–355. [Google Scholar]
  25. Tibshirani R. Regression shrinkage and selection via LASSO. Journal of the Royal Statistical Society, Series B. 1996;58:267–288. [Google Scholar]
  26. Wang L, Wu Y, Li R. Quantile regression for analyzing heterogeneity in ultra-high dimension. Journal of the American Statistical Association. 2012;107:214–222. doi: 10.1080/01621459.2012.656014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Wu Y, Liu Y. Variable selection in quantile regression. Statistica Sinica. 2009;19:801–817. [Google Scholar]
  28. Zhang C. Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics. 2010;101:1418–1429. [Google Scholar]
  29. Zheng Z, Fan Y, Lv J. High dimensional thresholded regression and shrinkage effect. Journal of the Royal Statistical Society Series B. 2014;76:627–649. [Google Scholar]
  30. Zhu L, Huang M, Li R. Semiparametric quantile regression with high-dimensional covariates. Statistica Sinica. 2012;22:1379–1401. doi: 10.5705/ss.2010.199. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Zhu LP, Li L, Li R, Zhu LX. Model-free feature screening for ultrahigh dimensional data. Journal of the American Statistical Association. 2011;106:1464–1475. doi: 10.1198/jasa.2011.tm10563. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Zhu L, Qian L, Lin J. Variable selection in a class of single-index models. Annals of the Institute of Statistical Mathematics. 2011;63:1277–1293. [Google Scholar]
  33. Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]
  34. Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models. Annals of Statistics. 2008;36:1509–1533. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S

RESOURCES