Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2015 Dec 1.
Published in final edited form as: Econ Theory (N Y). 2014 Jun 6;30(6):1272–1314. doi: 10.1017/S0266466614000176

Efficient Regressions via Optimally Combining Quantile Information*

Zhibiao Zhao 1, Zhijie Xiao 2
PMCID: PMC4251566  NIHMSID: NIHMS605447  PMID: 25484481

Abstract

We develop a generally applicable framework for constructing efficient estimators of regression models via quantile regressions. The proposed method is based on optimally combining information over multiple quantiles and can be applied to a broad range of parametric and nonparametric settings. When combining information over a fixed number of quantiles, we derive an upper bound on the distance between the efficiency of the proposed estimator and the Fisher information. As the number of quantiles increases, this upper bound decreases and the asymptotic variance of the proposed estimator approaches the Cramér-Rao lower bound under appropriate conditions. In the case of non-regular statistical estimation, the proposed estimator leads to super-efficient estimation. We illustrate the proposed method for several widely used regression models. Both asymptotic theory and Monte Carlo experiments show the superior performance over existing methods.

Keywords: Asymptotic normality, Bahadur representation, Efficiency, Fisher information, Quantile regression, Super-efficiency

1 Introduction

For regression estimations, the most widely used approach is the least squares (LS) method (for finite-dimensional models) or local LS method (in infinite-dimensional settings). If the data is normally distributed, the LS estimator has the likelihood interpretation and is the most efficient estimator. In the absence of Gaussianity, the LS estimation is usually less efficient than methods that exploit the distributional information, although may still be consistent under appropriate regularity conditions. Without these regularity conditions, the LS estimator may not even be consistent, for example, when the data have a heavy-tailed distribution such as the Cauchy distribution. Monte Carlo evidence indicates that the LS estimator can be quite sensitive to certain outliers. In empirical analysis, many applications (such as finance and economics) involve data with heavy-tailed or skewed distributions, and the LS estimators may have poor performance in these cases. It is therefore important to develop robust and efficient estimation procedures for general innovation distributions.

If the underlying distribution were known, the Maximum Likelihood Estimator (MLE) could be constructed. Under regularity conditions, the MLE is asymptotic normal and asymptotically efficient in the sense that the limiting covariance matrix attains the Cramér-Rao lower bound. In practice, the true density function is generally unknown and so the MLE is not feasible. Nevertheless, the MLE and the Craḿer-Rao bound serve as a standard against which we should measure our estimator.

For this reason, statisticians have devoted a great deal of research effort to the construction of estimation procedures that can extract distributional information from the data, and thus deliver more efficient estimators than the conventional LS method. For the location model Y = α + ε, where ε has a symmetric density, the adaptive likelihood or score function based estimators of α were constructed in Beran (1974) and Stone (1975). Bickel (1982) further extended the idea to slope estimation of classical linear models. For nonlinear models, adaptive likelihood based estimations are usually technically challenging.

We believe that the quantile regression technique [Koenker and Bassett (1978); Koenker (2005)] can provide a useful method in efficient statistical estimation. Intuitively, an estimation method that exploits the distributional information can potentially provide more efficient estimators. Since quantile regression provides a way of estimating the whole conditional distribution, appropriately using quantile regressions may improve estimation efficiency. Under regularity assumptions, the least-absolute-deviation (LAD) regression (i.e. quantile regression at median) can provide better estimators than the LS regression in the presence of heavy-tailed distributions. In addition, for certain distributions, a quantile regression at a non-median quantile may deliver a more efficient estimator than the LAD method. More importantly, additional efficiency gain can be achieved by combining information over multiple quantiles.

Although combining quantile regression over multiple quantiles can potentially improve estimation efficiency, this is often much easier to say than it is to do in a satisfactory way. To combine information from quantile regression, one may consider combining information over different quantiles via the criterion or loss function. For example, Zou and Yuan (2008) and Bradic, Fan and Wang (2011) proposed the composite quantile regression (CQR) for parameter estimation and variable selection in the classical linear regression models. For nonparametric regression models, Kai, Li and Zou (2010) proposed a local CQR estimation procedure, which is asymptotically equivalent to the local LS estimator as the number of quantiles increases. Alternatively, one may combine information based on estimators at different quantiles. Along this direction, Portnoy and Koenker (1989) studied asymptotically efficient estimation for the simple linear regression model. Although the proposed estimator is efficient asymptotically, it is not the best estimator with fixed quantiles. Also see Chamberlain (1994), Xiao and Koenker (2009), and Chen, Linton and Jacho-Chavez (2011) for related work on combination of estimators.

In this paper we consider regression estimation by combining information across k quantiles τj = j/(k + 1), j = 1, …, k. We show that for a wide range of parametric and nonparametric regression models, more efficient estimators can be constructed via optimally combining quantile regressions. We argue that it is essential to combine quantile information appropriately to achieve efficiency gain. In particular, simple averaging multiple quantile regression estimators is asymptotically equivalent to the LS method. We show that, by optimally combining information across quantiles τ1, …, τk, the efficiency of the proposed optimal weighted quantile average estimator is at most Φk away from the Fisher information, where Φk is defined as (43). As the number of quantiles k → ∞, under appropriate regularity conditions, we have Φk → 0 and the estimator is asymptotically efficient. Interestingly, in the case of non-regular statistical estimation when these regularity conditions do not hold, the proposed estimators may lead to super-efficient estimation.

The proposed methodology provides a generally applicable framework for constructing more efficient estimators under a broad variety of settings. For finite-dimensional parametric estimations, the method can be applied to construct efficient estimators for parameters in both linear and nonlinear regression models with homoscedastic errors, and parameters in location-scale models with conditional heteroscedasticity. We show that, in the presence of conditional heteroscedasticity, some appropriate preliminary quantile regression is needed to improve the efficiency and to facilitate the quantile combination. Different restrictions (and thus optimal weights) are needed for estimation of the location parameters and scalar parameters. For nonparametric function estimations, the asymptotic bias of the proposed estimator is the same as that of the conventional nonparametric estimators (such as the local LS and the local LAD estimators) and meanwhile the inverse of the asymptotic variance is at most Φk away from the optimal Fisher information. Our extensive simulation studies show that the proposed method significantly outperforms the widely used LS, LAD, and the CQR method [Zou and Yuan (2008); Kai, Li and Zou (2010)] for both parametric and nonparametric models.

The rest of this paper is organized as follows: we provide a general discussion on the framework and assumptions for constructing efficient estimators based on quantile regressions in Section 2. Three leading cases of regressions are then investigated in Sections 3–5. In particular, we study the parametric regression models with homoscedasticity in Section 3. In Section 4, we study the parametric models with heteroscedasticity. Nonparametric models are investigated in Section 5. We focus on methodology and our discussions in Sections 3–5 consider the case with finite k. The case with k increasing to infinity is discussed in Section 6. Simulation studies are contained in Section 8, and an application to financial data is given in Section 9 to highlight the proposed method. Proofs are given in the Appendix.

2 Model Setup, Assumptions, and TheWeighted Quantile Average Estimator

We consider regression models of the following form:

Y=m(X)+σ(X)ε, (1)

where (X, Y, ε) is the triplet of covariate, response, and noise, with ε independent of X. Here m(·) and σ(·) are two functions that depend on unknown parameters θ, where θ may be of finite dimension (parametric case) or infinite dimension (nonparametric case).

Denote by Qε(τ) the τ-th quantile of ε for τ ∈ (0, 1). Then the τ-th conditional quantile of Y given X, denoted by QY (τ|X), is given by

QY(τ|X)=m(X)+σ(X)Qε(τ). (2)

As the inverse of the conditional distribution function, QY (τ|X) fully captures the distributional relationship between Y and X. Intuitively, different distributional information may be obtained from different quantiles, and an appropriate combination of multiple quantiles may be more informative about the distribution than the conditional mean in the LS methods.

Throughout this paper we consider combining information over k equally spaced quantiles τj = j/(k + 1), j = 1, …, k. The discussion of this paper focuses on the case where k is assumed to be a given finite number. We consider the case that k → ∞ increases with n in Section 6.3.

We briefly introduce the idea of our proposed estimator. Let θ be a parameter of interest. From the conditional quantile in (2), we can usually identify some perturbed version of θ, denoted by θ(τ). Suppose there exists a class 𝒲 of weights such that

j=1kωjθ(τj)=θ,ω=[ω1,,ωk]T𝒲. (3)

Given data on (X, Y), we can use a quantile regression based on (2) to obtain some consistent estimate, denoted by θ̂(τ), of θ(τ). In light of (3), we propose the following estimate of θ:

θ̂WQAE(ω)=j=1kωjθ̂(τ),ω𝒲. (4)

We term θ̂WQAE(ω) by the weighted quantile average estimator (WQAE) of θ. Since θ̂WQAE(ω) is a consistent estimate of θ for each ω ∈ 𝒲, we propose in this paper estimation of θ based on ω ∈ 𝒲 that minimizes the limiting variance of θ̂WQAE(ω) in parametric settings, or the asymptotic mean squared error in nonparametric settings. If ω is chosen in such a way, we call the corresponding estimator θ̂WQAE(ω*) the optimal WQAE (OWQAE).

The proposed estimation can be applied to a wide range of regression models. In this paper, we focus on the following three leading cases of regression (1):

  • Case 1: Parametric regression models with homoscedastic errors.

  • Case 2: Location-scale models with conditional heteroscedasticity in σ(·).

  • Case 3: Nonparametric regressions.

We study each of these cases in the following three sections 3–5.

Suppose we have samples {(Xt,Yt)}t=1n from (1) with corresponding noises {εt}t=1n. To facilitate the study of asymptotic theory, we impose the following assumptions.

Assumption 1 (i) {(Xt, εt)}t∈ℤ is strictly stationary; for each t, εt is independent of {Xt, Xt−1, …; εt−1, εt−2, …}. (ii) {Xt}t∈ℤ is an ergodic process.

Assumption 2 Denote by fε(·) and Fε(·) the density and distribution functions of ε. fε is positive, twice differentiable, and bounded on {u : 0 < Fε(u) < 1}.

Assumption 1 provides a convenient framework for studying asymptotic theory. First, strong mixing condition implies the ergodicity, and thus Assumption 1(ii) is weaker than the widely used strong mixing conditions in time series analysis. Next, we illustrate two useful properties below.

  • (P1)

    Martingale structure. Let ℱt be the σ-algebra generated by {Xt+1, Xt, …; εt, εt−1, …}. By Assumption 1(i), εt is independent of ℱt−1. For any functions h1(·) and h2(·) such that 𝔼 [|h1(Xt)h2t)|] < ∞ and 𝔼[h2t)] = 0, we have 𝔼[h1(Xt)h2t)|ℱt−1] = h1(Xt)𝔼[h2t)] = 0. Therefore, {h1(Xt)h2t)}t∈ℤ are martingale differences with respect to the filtration {ℱt}t∈ℤ, and consequently t=1nh1(Xt)h2(εt) is a martingale.

  • (P2)
    Law of large numbers. By ergodic theorem [Theorem 3.5.7, Stout (1974)], for any function ℓ(·) such that 𝔼[|ℓ(Xt)|] < ∞, the ergodicity in Assumption 1(ii) implies
    limn1nt=1n(Xt)=𝔼[(X0)],in probability. (5)

In subsequent sections we adopt the following notation. For a random vector Z, we use ‖Z‖ to signify the Euclidean norm of Z, and write Z ∈ ℒq, q > 0, if 𝔼(‖Zq) < ∞.

3 Homoscadastic Parametric Regression Models

In this section we study the parametric regression model (case 1 in Section 2). Corresponding to the general representation (1), let σ(X) ≡ 1 and m(X) = α + m(X, β), where β ∈ ℝp is the vector of unknown parameters, and the intercept α is added to ensure the identifiability of β, we have

Y=α+m(X,β)+ε. (6)

We are interested in the estimation of β.

By (2), we have the conditional quantile representation

QY(τ|X)=α(τ)+m(X,β),whereα(τ)=α+Qε(τ). (7)

Given samples {(Xt,Yt)}t=1n from (6), let (α̂(τ), β̂(τ)) be an estimator of (α(τ), β) from a quantile regression based on (7):

(α̂(τ),β̂(τ))=argmin(a,b)t=1nρτ{Ytam(Xt,b)}, (8)

where ρτ(z) = z(τ − 1z≤0) and 1 is the indicator function. Denote by (x, β) is the partial derivative vector of m(x, β) with respect to β. Define

Dn=[1(X1,β)T1(Xn,β)T]n×(p+1)andZt=[1(Xt,β)]p+1.

Similar to the linear regression, Dn serves as the design matrix and Zt is the equivalent covariate corresponding to observation t for the quantile regression (8). A leading example is the classical linear regression model [see, e.g., Koenker (1984)], corresponding to m(X, β) = XT β. In this case, QY(τ|X) = [α + Qε(τ)] + XT β and (X, β) = X.

Assumption 3 The quantile regression estimator has the Bahadur representation

[α̂(τ)β̂(τ)]=[α(τ)β]+(DnTDn)1fε(Qε(τ))t=1nZt[τ1εt<Qε(τ)]+op(n1/2), (9)

uniformly in τ ∈ 𝒯 ≔ [δ, 1 − δ] with some small constant δ > 0.

Assumption 3 is an asymptotic representation of the quantile regression estimator. Under regularity conditions on the regression function m(·), error density, and the parameter space, a Bahadur representation can be obtained over τ on a subset of [0,1]. See, e.g., Portnoy and Koenker (1989), Jurečková and Procházka (1994), and He and Shao (1996) for related study. Also see Section 4 for discussions on the conditional heteroscedastic parametric models.

Since β does not depend on τ, we can use β̂(τ) to estimate β with any choice of τ. By Theorem 1 below (also, see the definition of Σβ there),

n[β̂(τ)β]N(0,Σβ1τ(1τ)fε2(Qε(τ))). (10)

For example, the case with τ = 0.5 corresponds to the median quantile regression or LAD estimation of β.

As discussed in Section 2, we want to combine information over the k quantiles τj = j/(k + 1), j = 1, …, k, where k is assumed to be a given finite number such that τj ∈ 𝒯. Since β̂(τ) is a consistent estimate of β, from (3)(4), we consider the WQAE of β:

β̂WQAE(ω)=j=1kωjβ̂(τj),wherej=1kωj=1. (11)

Theorem 1 Suppose Assumptions 1–3 hold and ṁ (X, β) ∈ ℒ2 with X=dXt. Then

n[β̂WQAE(ω)β]N(0,Σβ1S(ω)), (12)

with Σβ = 𝔼[(X, β)(X, β)T] − 𝔼[(X, β)]𝔼 [(X, β)T] assumed to be non-singular, and

S(ω)=ωTHωwithH={min(τj,τj)τjτjfε(Qε(τj))fε(Qε(τj))}1j,jkk×k. (13)

The proposed estimator, the OWQAE, of β is obtained by choosing ω to minimize the asymptotic variance of β̂WQAE(ω).

Theorem 2 Under the assumptions of Theorem 1, the optimal weight is

ω*=argminω1++ωk=1S(ω)=H1ekekTH1ek,whereek=(1,,1)T. (14)

With ω* in (14), the OWQAE of β has the following limiting distribution:

n[β̂WQAE(ω*)β]N(0,Σβ1Ωk1),whereΩk=ekTH1ek. (15)

Remark 1: A quick way of combining quantile information is to take a simple average of the quantile regression estimators. This is easy to implement and has been used in the literature [see, e.g., Kai, Li and Zou (2010) for nonparametric estimation] as a method of combining quantile information. If we use ω = [1/k, …, 1/k]T in (11), the resulting unweighted estimator has the asymptotic normality in Theorem 1 with S(ω) replaced by

Rk1k2j=1kj=1kmin(τj,τj)τjτjfε(Qε(τj))fε(Qε(τj)). (16)

Clearly, RkS(ω*). See Section 6.2 for more discussions on the property of Rk.

We compute ω* for some examples below using k = 9 quantiles 0.1, 0.2, …, 0.9.

Example 1. Let ε be Student-t distributed. For t1 (Cauchy distribution), the optimal weight ω* = {−0.03,−0.04,0.08,0.29,0.40,0.29,0.08,−0.04,−0.03}, quantiles τ = 0.4, 0.5, 0.6 contribute almost all information whereas quantiles τ = 0.1, 0.2, 0.8, 0.9 have negative weights, so the unweighted quantile average estimator would not perform well. However, for N(0,1), ω* = {0.13,0.11,0.11,0.10,0.10,0.10,0.11,0.11,0.13} are close to the uniform weights, and thus the OWQAE, the unweighted quantile average estimator, and the LS estimator have comparable performance.

Example 2. Let ε have normal mixture distributions. For Mixture 1: 0.5N(0,1)+0.5N(0,0.56) (different variances), ω* = {−0.002,−0.102,0.183,0.277,0.287,0.277,0.183,−0.102,−0.002}, quantiles 0.3, …, 0.7 contain substantial information whereas quantiles 0.2 and 0.8 have negative weight. For Mixture 2: 0.5N(−2,1)+0.5N(2,1) (different means), ω* = {0.185,0.156,0.153,0.078,−0.144,0.078,0.153,0.156,0.185}, quantiles τ = 0.1, 0.2, 0.3, 0.7, 0.8, 0.9 are comparable while the median performs the worst.

Example 3. Let ε be Gamma random variable with parameter d > 0. For d = 1 (exponential distribution), ω1*=1.152,ω2*=0.124, and ωi*0 for i = 3, …, 9. Quantiles 0.1 and 0.2 contain almost all information.

As shown in Examples 1–3 above, different quantiles may carry substantially different amount of information, and inappropriately utilizing such information may result in a significant loss of efficiency. The latter phenomenon provides strong evidence in favor of our proposed optimally weighted quantile based estimators.

In practice, the optimal weight ω* in (14), which depends on the sparsity or quantile-density function fε(Qε(τ)), needs to be estimated. We make the following assumption on the estimate, denoted by b ε(ε(τ)), of fε(Qε(τ)).

Assumption 4 supτ∈𝒯 |ε(ε(τ)) − fε(Qε(τ))| = op(1) for 𝒯 in Assumption 3.

Plugging the consistent estimate ε(ε(τ)) of fε(Qε(τ)) into the matrix H in (14), we can obtain the following consistent estimate of the optimal weight ω*:

ω̂*=Ĥ1ekekTĤ1ek,whereĤ={min(τj,τj)τjτjε(ε(τj))ε(ε(τj))}1j,jk. (17)

Theorem 3 below asserts that, β̂WQAE(ω̂*) with the estimated weight ω̂* achieves the same efficiency of β̂WQAE(ω*).

Theorem 3 Under the assumptions of Theorem 1 and Assumption 4, we have

n[β̂WQAE(ω̂*)β]N(0,Σβ1Ωk1). (18)

4 The Location-Scale Models

Another class of widely used regression models is the location-scale models (case 2 in Section 2) that allows for conditional heteroscedasticity. There is a large literature in econometrics and statistics studying the location-scale models. Koenker and Zhao (1994) studied L estimation of a location-scale model in the following form:

Yt=XtTβ+σtεt,whereσt=XtTγ, (19)

under the condition XtT>0. Zhao (2001) studied asymptotically efficient median regression using the k-nearest neighbors method. In this section we study the location-scale models via optimal quantile combination.

In the model (19), the positive constraint XtT>0 is somewhat restrictive to allow for flexible applications. For example, it is violated for normally distributed covariates X. For this reason, many researchers consider an alternative form of σt which can be expressed as a linear function of absolute values of the regressors and other variables:

Yt=XtTβ+σtεt,whereσtUtTγσ(Ut), (20)

where Ut is a vector of absolute values of the regressors and other covariates [see, e.g., Koenker and Zhao (1996) for studies on related models]. For example, let XtT=(xt1,,xtp), one may consider a location-scale model with σt=σ(Ut)=UtTγ=γ0+j=1pγj|xtj|, where UtT=(1,|xt1|,,|xtp|), γ = (γ0, γ1, ⋯, γp), γ0 > 0, γ1 ≥ 0, ⋯, γp ≥ 0.

In this section we consider the location-scale regression model (20).1 We are interested in the estimation of β and γ. By (2), we have the conditional quantile representation

QY(τ|X)=XTβ+UTγ(τ)withγ(τ)=γQε(τ). (21)

Given a sample of size n, we may estimate (β, γ(τ)) using a quantile regression similar to (8). However, in the presence of conditional heteroscedasticity, it is more efficient to use a weighted quantile regression with the weights reflecting the conditional heteroscedasticity. In addition, the weighted quantile regression estimates have nice properties that helps combining quantile information.

Thus, following the idea of Koenker and Zhao (1994), we consider the weighted quantile regression:

(β̂(τ),γ̂(τ))=argmin(b,r)t=1n1σ̃tρτ{YtXtTbUtTr}, (22)

where σ̃t=UtTγ̃ and γ̃ is a consistent estimate of γ.

Assumption 5 (i) {(Xt, Ut, εt)}t∈ℤ is strictly stationary; for each t, εt is independent of {(Xt, Ut), (Xt−1,Ut−1), …; εt−1, εt−2, …}. (ii) {(Xt, Ut)}t∈ℤ is an ergodic process.

Assumption 6 (i)Xt‖ + ‖Ut‖ ≤ c1 and UtTγc2 for some constants c1, c2 > 0. (ii) γ̃ − γ = op(n−1/4). (iii) Let (X, U) be distributed as (Xt, Ut). Define the matrices

M1=𝔼[XXTσ(U)2],M2=𝔼[XUTσ(U)2],M3=𝔼[UUTσ(U)2],

and

Mβ=M1M2M31M2T,Mγ=M3M2TM11M2.

The matrices M1, M3, Mβ and Mγ are non-singular.

Assumption 5 is a modification of Assumption 1 by allowing for more covariates (Xt, Ut). In Assumption 6, (i) is imposed simply for technical convenience and can be replaced by some finite moment conditions, (ii) requires that γ̃ must be reasonably close to γ, and (iii) is used to avoid some singular design matrix.

Theorem 4 Suppose Assumptions 2, 5, and 6 hold. Then we have

[β̂(τ)γ̂(τ)]=[βγ(τ)]+1nfε(Qε(τ))t=1nZt[τ1εt<Qε(τ)]+op(n1/2), (23)

with

Zt=[Mβ1(XtM2M31Ut)/σ(Ut)Mγ1(UtM2TM11Xt)/σ(Ut)].

We now construct estimators of β and γ by optimally combining information over quantiles τ1, …, τk.

First, we consider estimation of β. As in Section 3, we consider the WQAE β̂WQAE(ω) given by (11). Using the Bahadur representation (23) from Theorem 4, the same argument in Theorem 1 yields

n[β̂WQAE(ω)β]N(0,Mβ1S(ω)),

with S(ω) given in Theorem 1. Therefore, the optimal weight can be constructed in a similar way as described by Theorem 2, and ω* is given by (14). The OWQAE has the following limiting distribution

n[β̂WQAE(ω*)β]N(0,Mβ1Ωk1),

with Ωk given in (15). If we use the estimated optimal weight ω̂* in (17), under the additional Assumption 4, the conclusion in Theorem 3 also holds here.

Next, we consider estimation of the scale parameter γ via quantile combination. As will be clear from later analysis, the construction of WQAE and choice of optimal weights related to the scale parameter will be different from those of β. For this reason, we denote the weights used in γ estimation by π = [π1, …, πk]T. From (21)(22), γ̂(τ) is an estimation of γ(τ) = γQε(τ). Then, for any π satisfying j=1kπjQε(τj)=1,j=1kπjγ(τj)=γ. Therefore, we propose the following WQAE of γ:

γ̂WQAE(π)=j=1kπjγ̂(τj),wherej=1kπjQε(τj)=1. (24)

Theorem 5 Under the assumptions in Theorem 4, we have the asymptotic normality:

n[γ̂WQAE(π)γ]N(0,Mγ1S(π)),

where S(π) = πTHπ with H defined in (13). Furthermore, the optimal weight is

π*=argminπ1Qε(τ1)++πkQε(τk)=1S(π)=H1qqTH1q,whereq=[Qε(τ1),,Qε(τk)]T. (25)

With π* in (25), the OWQAE of γ has the following limiting distribution:

n[γ̂WQAE(π*)γ]N(0,Mγ1Λk1),whereΛk=qTH1q. (26)

Therefore, the optimal weights for the OWQAE of β and γ are different, and their corresponding OWQAEs have different efficiency. This is due to the structure of the conditional quantile representation (21): β does not depend on the quantile τ whereas γ relies on τ through the coefficient Qε(τ).

Similar to the case of β, the conclusion in Theorem 3 also holds for γ̂WQAE(π̂*) when we use estimated optimal weight π̂* by plugging in consistent estimates of (q, H) into (25).

To implement the weighted quantile regression (22), we need to find a consistent estimate γ̃ of γ. We propose the following procedure:

  1. For each quantile τ = τ1, …, τk, fit the unweighted quantile regression
    (β̃(τ),γ̃(τ))=argmin(b,r)t=1nρτ{YtXtTbUtTr}. (27)
    By the same argument in Theorem 4, (β̃(τ), γ ˜(τ)) = (β, γ(τ)) + Op(n−1/2).
  2. Let
    γ̃=j=1k|γ̃(τj)|. (28)
    Then γ̃=γj=1k|Qε(τj)|+Op(n1/2). Note that, in (22), it suffices for γ̃ to estimate γ up to a multiplication factor. Thus, γ̃ in (28) satisfies Assumption 6(ii).

Finally, we point out an identifiability issue of the optimal weight π* in (25). Since Qε(τ) is identifiable up to a scale factor, if we multiply Qε(τ) by a constant c, π* and hence γ̂WQAE(π*) in (24) will be multiplied by a factor 1/c. This is due to the non-identifiability of the parameter γ in (20). To ensure identifiability, we may impose some constraint on ε; see Section 7 for more discussions on estimating π*.

5 Nonparametric Regressions

In this section we study the nonparametric regression (case 3 in Section 2). We assume that both m(·) and σ(·) in (1) are nonparametric functions, and we are interested in the estimation of m(·). Although our theory is also applicable for multivariate case, to avoid the issue of “curse of dimensionality”, we consider the univariate case X ∈ ℝ.

Recall the conditional quantile QY (τ|X) in (2). Without further assumptions, we cannot identify m(X) from QY(τ|X) at a single quantile. To ensure identifiability, we assume that ε has a symmetric density, which is satisfied for many commonly used distributions, such as normal distribution, Student-t distribution, Cauchy distribution, uniform distribution on a symmetric interval, Laplace distribution, symmetric stable distribution, many normal mixture distributions, and their truncated versions on symmetric intervals.

Consider weights ω1, …, ωk satisfying the constraints

j=1kωj=1andωj=ωk+1j,j=1,,k. (29)

Under the symmetric density assumption above, Qε(τ) + Qε(1 − τ) = 0. Therefore, with quantiles τj = j/(k + 1) and using (2) and (29), we have

j=1kωjQY(τj|X)=j=1kωj[m(X)+σ(X)Qε(τj)]=m(X). (30)

This identity suggests estimation of m(·) by plugging in consistent estimation of QYj|X).

Given samples {(Xt,Yt)}t=1n, we can estimate QY(τ|x) by the local linear quantile regression [Yu and Jones (1998)]:

(Y(τ|x),)=argmin(a,b)t=1nρτ{Ytab(Xtx)}K(Xtxh), (31)

for a kernel function K(·) and bandwidth h. From (30), we propose the WQAE of m(x):

WQAE(x|ω)=j=1kωjY(τj|x). (32)

Assumption 7 (i) fε is symmetric, positive, and twice continuously differentiable on its support; the density function pX(·) > 0 of X is differentiable, m(·) is three times differentiable, and σ(·) > 0 is differentiable, in the neighborhood of x. (ii) nh → ∞ and nh9 → 0. (iii) K(·) integrates to one, is symmetric, and has bounded support. Write

μK=u2K(u)du,φK=K2(u)du,

Theorem 6 Suppose Assumptions 1 and 7 hold. Let S(ω) be defined in (13). Then

nh{WQAE(x|ω)m(x)12m(x)μKh2}N(0,φKσ2(x)pX(x)S(ω)). (33)

Furthermore, ω* in (14) minimizes S(ω) subject to the constraints (29), and

nh{WQAE(x|ω*)m(x)12m(x)μKh2}N(0,φKσ2(x)pX(x)Ωk1), (34)

where Ωk is defined in (15).

For comparison, we briefly review some alternative nonparametric estimation methods. The widely used local linear LS regression estimator, denoted by LS(x), is obtained by replacing the quantile loss ρτ(·) in (31) with the square loss. If 𝔼(εt) = 0 and var(εt) < ∞, under some regularity conditions [Fan and Gijbels (1996)],

nh{LS(x)m(x)12m(x)μKh2}N(0,φKσ2(x)pX(x)var(ε)). (35)

When εt’s are Gaussian, the local LS estimation corresponds to the local likelihood criterion. In the absence of Gaussianity, asymptotic results of LS(x) generally still hold but this estimator is less efficient in terms of mean-squared error than estimators that exploit the distributional information. For heavy-tailed data, local quantile regression is a robust estimation method; see, e.g., Yu and Jones (1998). The local median regression estimator, denoted by LAD(x), corresponds to τ = 0.5 in (31). By Theorem 6,

nh{LAD(x)m(x)12m(x)μKh2}N(0,φKσ2(x)pX(x)14fε2(0)). (36)

Recently, Kai, Li and Zou (2010) proposed a local composite quantile regression (CQR) estimator which takes a simple average of multiple quantile estimations. The local CQR estimator, denoted by CQR(x), has the asymptotic normality

nh{CQR(x)m(x)12m(x)μKh2}N(0,φKσ2(x)pX(x)Rk), (37)

where Rk is defined in (16). Intuitively, LS(x) uses information from the local sample average, LAD(x) uses information from the local sample median, CQR(x) uses information from multiple quantiles with uniform weight, and the proposed OWQAE WQAE(x|ω*) combines information from multiple quantiles optimally.

If the error density fε were known, we could replace the quantile loss ρτ(·) in (31) by the log likelihood log fε(Ytab(Xtx)) and obtain a likelihood-based estimator, denoted by MLE(x), see, e.g., Fan, Farman and Gijbels (1998). Under appropriate conditions,

nh{MLE(x)m(x)12m(x)μKh2}N(0,φKσ2(x)pX(x)(fε)1), (38)

where ℐ(fε) is the Fisher information of fε. Under some regularity conditions, the local likelihood estimator is the most efficient estimator. In practice, fε is unknown and MLE(x) is infeasible. In Section 6.2, it is shown that Ωk → ℐ(fε), and therefore the optimal WQAE WQAE(x|ω*) achieves the same asymptotic efficiency of the infeasible estimator MLE(x).

We now compare the efficiency of WQAE(x|ω*) to LS(x), LAD(x), and CQR(x). From (34)(37), all four estimators have the asymptotic normality with different s2:

nh{(x)m(x)12m(x)μKh2}N(0,φKσ2(x)pX(x)s2).

Define the asymptotic mean-squared error (AMSE) as AMSE{(x)|h} = [m″(xKh2/2]2 + φKσ2(x)s2/[nhpX(x)]. Minimizing the AMSE, we obtain the optimal bandwidth:

h*=argminhAMSE{(x)|h}={μKm(x)}2/5{φKσ2(x)npX(x)}1/5(s2)1/5(s2)1/5, (39)

and the associated optimal AMSE evaluated at the optimal bandwidth h*

AMSE{(x)|h*}=54{μKm(x)}2/5{φKσ2(x)npX(x)}4/5(s2)4/5(s2)4/5. (40)

In Section 6.4, we tabulate s2 for different distributions.

Theorem 7 studies WQAE(x|ω*) when we use the estimated optimal weight ω̂* in (17).

Theorem 7 Under the assumptions of Theorem 1 and Assumption 4, when we use the estimated weight ω̂* in (17), WQAE(x|ω̂*) has the same asymptotic normality as WQAE(x|ω*).

The discussion of the selection of the bandwidth h is deferred to Section 8.4.

6 Efficiency Comparison

6.1 The k-quantile optimal efficiency Ωk and Λk

The parameters in Sections 3–5 can be classified into two types: location-type and scale-type parameters. For β in (6) and (20) and the nonparametric function m(·) in Section 5, these parameters do not directly interact with the error ε, and we call them location-type parameters. For γ in (20), it is directly related to ε, and we call it a scale-type parameter.

Our discussion in the previous sections considers combination of information over a fixed number of quantiles. From the results in Sections 3–5, for the location-type parameters mentioned above, their OWQAE has the asymptotic variance proportional to Ωk1 with Ωk defined in (15); for the scale-type parameter γ in (20), the OWQAE has the asymptotic variance proportional to Λk1 with Λk defined in (26). Since the efficiency of an estimator is inversely proportional to its variance, we call Ωk and Λk the k-quantile optimal efficiency of the location-type and the scale-type parameters, respectively. The larger Ωk and Λk, the better performance of the corresponding estimators.

It is well-known that, under appropriate conditions, the variance of any unbiased parameter estimator has the Cramér-Rao lower bound: the inverse of the Fisher information of the underlying distribution. To illustrate the Fisher information for the location-type and scale-type parameters, consider the simple location-scale model Y = β + γε with location parameter β and scale parameter γ. Note that Y has the density fY(y; β, γ) = fε((y − β)/γ)/γ. Under the specification (β, γ) = (0, 1), we can show that, the Fisher information for the location parameter β is

(fε)=[fε(u)]2fε(u)du=01{fε(Qε(τ))τ}2dτ, (41)

and the Fisher information for the scale parameter γ is

𝒥(fε)=[fε(u)+ufε(u)]2fε(u)du=01{[Qε(τ)fε(Qε(τ))]τ}2dτ. (42)

We assume ℐ(fε) < ∞ and 𝒥(fε) < ∞. The Fisher information ℐ(fε) and 𝒥(fε) serve as a natural standard when we measure the efficiency of our estimators in the case of regular estimation.

Theorem 8 Suppose Assumption 2 holds. Let Δ = 1/(k + 1).

  1. For Ωk in (15), we havek − ℐ(fε)| ≤ Φk, where g(t) = fε(Qε(t)), and
    Φk=g2(Δ)+g2(1Δ)Δ+Δ22Δ1Δ[g(t)]2dt+0Δ{[g(t)]2+[g(1t)]2}dt. (43)
  2. For Λk in (26), we havek − 𝒥(fε)| ≤ Ψk, where h(t) = Qε(t)fε(Qε(t)), and
    Ψk=h2(Δ)+h2(1Δ)Δ+Δ22Δ1Δ[h(t)]2dt+0Δ{[h(t)]2+[h(1t)]2}dt. (44)

Theorem 8 indicates that, by optimally combining k quantiles τ1, …, τk, the k-quantile optimal efficiency Ωk (resp. Λk) for the OWQAE of the location-type (resp. scale-type) parameters is at most Φk (resp. Ψk) away from the corresponding Fisher information ℐ(fε) (resp. 𝒥(fε)). This result holds for any fixed k.

6.2 Asymptotic behavior of Ωk and Λk

In all previous sections, k is assumed to be a given finite number. In the following few sections, we discuss the behavior of the proposed estimators as k increases with n. In this section, we consider the asymptotic behavior of Ωk and Λk as k → ∞. For regular estimation, it is shown that Ωk and Λk approach the corresponding Cramér-Rao efficiency bound. In Section 6.3, we discuss the OWQAE as k → ∞ and when we use the true optimal weight or the estimated optimal weight. In Section 6.4, we discuss the asymptotic relative efficiency of OWQAE compared to some existing methods. Finally, Section 6.5 briefly considers some non-regular estimation.

The Cramér-Rao efficiency analysis is based on the basic assumption of finite Fisher information ℐ(fε) < ∞ and 𝒥(fε) < ∞. From (41) and (42), this implies that 0τ{[g(t)]2+[g(1t)]2}dt0 and 0τ{[h(t)]2+[h(1t)]2}dt0 as τ → 0, where g(τ) and h(·) are defined in Theorem 8. Thus, by Theorem 8, we have the following result.

Theorem 9 Suppose Assumption 2 holds. Let g(·) and h(·) be defined in Theorem 8.

  1. If
    limτ0g2(τ)+g2(1τ)τ=0andlimτ0τ2τ1τ[g(t)]2dt=0, (45)
    then, for Φk in (43), limk→∞ Φk = 0, and
    limkΩk=(fε).
  2. If (45) holds with g(·) replaced by h(·), then, for Ψk in (44), limk→∞ Ψk = 0, and
    limkΛk=𝒥(fε).

The condition (45) is conventionally imposed in the study of efficient estimations. Basically, it requires that the error density decay sufficiently fast at the boundary (the corresponding estimation is sometimes called as regular estimation), otherwise one may estimate the parameters at a faster rate; see, e.g., Akahira and Takeuchi (1995) for a discussion of this issue, also see Section 6.5 below for discussions on related issues. By Theorem 9, as k → ∞, the k-quantile optimal efficiency Ωk and Λk attain the corresponding Fisher information.

From Theorem 9, Ωk and Λk have different limit as k → ∞. As discussed in Section 4, this is due to the extra dependence of γQε(τ) on Qε(τ) in the scale-type parameter.

Proposition 1 below presents an alternative sufficient condition for (45).

Proposition 1 Suppose fε has support onand Assumption 2 holds. Then (45) holds if

limu±{fε2(u)min{1Fε(u),Fε(u)}+|2[logfε(u)]u2|min{1Fε(u),Fε(u)}3/2fε(u)}=0. (46)

Write xy if x/y is bounded away from 0 and ∞. If fε(u) ∝ |u|a as |u| → ∞ for some a > 1, then 1 − Fε(u) ∝ |u|1−a as u → ∞ and Fε(u) ∝ |u|1−a as u → −∞. Thus, by Proposition 1, we have the following result.

Corollary 1 Suppose that there exist a > 1 and b > 0 such that fε(u) ∝ |u|a and2[log fε(u)]/∂u2 ∝ |u|b as |u| → ∞. Then (45) is satisfied if b + 3(a − 1)/2 > 1.

Many commonly used distributions with support on ℝ satisfy (45). (i) For standard normal density fε, ∂2logfε(u)/∂u2 = −1 and 1 − Fε(u) = [1 + o(1)]fε(u)/u as u → ∞, we can verify (46). (ii) For Laplace distribution with density fε(u) = 0.5 exp(−|u|), u ∈ ℝ, ∂2[log fε(u)]/∂u2 = 0 for u ≠ 0, so (46) can be easily verified. (iii) For logistic distribution, g(τ) = c(τ − τ2) for some constant c > 0, so (45) holds. (iv) For Student-t distribution with d > 0 degrees of freedom, Corollary 1 holds with a = d + 1 and b = 2. (v) For normal mixture θN(μ1,σ12)+(1θ)N(μ2,σ22)μ1,μ2,σ12,σ22>0,θ[0,1], we can verify (46).

Recall that the unweighted quantile average estimators have asymptotic variance proportional to Rk defined in (16). The following results show that such a simple averaging estimator is asymptotically equivalent to the LS estimator as k → ∞. This result indicates that if we use a simple average over quantiles, even as we use more and more quantiles, there is no efficiency gain of combining quantile information. Thus, proper weighting over different quantiles is crucial.

Theorem 10 (i) RkΩk1; as k → ∞, the equality holds if and only if ε is normally distributed. (ii) If var(ε) < ∞, then limk→∞ Rk = var(ε).

6.3 Behavior of the OWQAE as k → ∞

In Sections 3–5, the asymptotic normalities of the OWQAE are established for k quantiles with a fixed k. In this section we consider the case that k increases with n. To keep the length, we consider only β̂WQAE(ω*) for the parametric regression case in Section 3.

Since the uniform Bahadur representation holds on a subinterval of [0,1], we modify Assumption 3 so that the Bahadur representation holds uniformly over expanding subintervals of [0,1] when the the number of quantiles increases with n.

Assumption 8 The asymptotic Bahadur representation (9) holds over τ ∈ 𝒯n = [δn, 1 − δn] with δn = (log n)−ε for some ε > 0. Let the number of quantiles k=kn=δn11.

First, we consider the OWQAE with the theoretical optimal weight ω* in (14).

Corollary 2 Consider β̂WQAE(ω) in (11) and ω* in (14). Suppose Assumptions 1, 2, and 8 and (45) hold. Further assume ṁ(Xt, β) ∈ ℒq for some q > 2. Then

n1/2[β̂WQAE(ω*)β]N(0,Σβ1(fε)1).

Thus, if we use more and more quantiles as n → ∞, the efficiency of the OWQAE with the theoretical optimal weight ω* approaches the Fisher information. The same conclusion also holds for the estimators in Sections 4–5 provided that, as in Assumption 8, appropriate Bahadur representations hold uniformly on 𝒯n.

We next briefly discuss limiting behavior of the OWQAE with estimated weight when kn is chosen as in Assumption 8. Again, we discuss the bahavior of the parametric model in Section 3. As kn → ∞, the asymptotic analysis of the proposed estimator is complicated and depends on the behavior of quantile regression estimators and quantile-density estimators at the extreme.

Let β̂WQAE(ω*)=j=1knωj*β̂(τj) be the OWQAE with the optimal weight ω* in (14), and let β̂WQAE(ω̂*)=i=1knω̂j*β̂(τj) be the OWQAE with estimated weight ω̂* [see, e.g., (17)]. Then, using j=1knωj*=1 and j=1knω̂j*=1, we have

n[β̂WQAE(ω̂*)β]=j=1kn(ω̂j*ωj*)n[β̂(τj)β]+n[β̂WQAE(ω*)β]. (47)

From Corollary 2, in order to prove

n[β̂WQAE(ω̂*)β]N(0,Σβ1(fε)1), (48)

it suffices to prove

j=1k(ω̂j*ωj*)n[β̂(τj)β]=op(1). (49)

we need additional regularity conditions regarding the behavior of density fε(Qε(τ)) when τ approaches the boundary, and conditions on the density estimators.

Assumption 9 Let kn and 𝒯n be chosen as in Assumption 8. There exists some constant η > 0 such that infτ∈𝒯n fε(Qε(τ)) ≥ c(log n)−η and j=1kn|ω̂j*ωj*|=op[(logn)(η+ε/2)], where ε is the constant in Assumption 8.

Under Assumption 8 and the condition j=1kn|ω̂j*ωj*|=op[(logn)(η+ε/2)] in Assumption 9, by (71) in the proof of Theorem 1, we have

j=1kn(ω̂j*ωj*)n[β̂(τj)β]=[Σβ1+op(1)]j=1knω̂j*ωj*fε(Qε(τj))Nj+op(1), (50)

where

Nj=1nt=1n{(Xt,β)𝔼[(X,β)]}[τj1εt<Qε(τj)].

Assume without loss of generality that (Xt, β) is scalar-valued. By property (P1) in Section 2, the summands of Nj are martingale differences. By the condition (Xt, β) ∈ ℒ2 and the orthogonality of martingale differences, 𝔼(Nj2)=O(1) uniformly in j. Thus,

𝔼(max1jknNj2)j=1kn𝔼(Nj2)=O(kn),

and thus max1jkn|Nj|=Op(kn). Recall kn in Assumption 8. Under Assumption 9,

|j=1knω̂j*ωj*fε(Qε(τj))Nj|max1jkn|Nj|infτ𝒯nfε(Qε(τ))j=1kn|ω̂j*ωj*|=op(1).

Thus, (49) follows from (50), and we conclude that (48) holds.

6.4 Comparison of asymptotic relative efficiency

We now compare the efficiency of the proposed OWQAE to some existing methods.

First, we consider the parametric case in Section 3. Theorem 2 gives n[β̂WQAE(ω*)β]N(0,Σβ1S(ω*)). For parameter estimations, the most widely used method is the ordinary LS estimator, denoted by β̂LS, which minimizes the squared errors. Assuming var(ε) < ∞ and other appropriate conditions, we have the asymptotic normality n(β̂LSβ)N(0,Σβ1var(ε)). For the quantile regression based estimator β̂(τ) with a single quantile τ, the asymptotic normality in (10) holds. All the three estimators β̂WQAE(ω*), β̂LS, and β̂(τ) have asymptotic normality of the form: n(β̂β)N(0,Σβ1s2), where s2 = var(ε) (assuming finite) for β̂LS, s2=τ(1τ)/fε2(Qε(τ)) for β̂(τ), and s2 = S(ω*) for β̂WQAE(ω*). For comparison, we use β̂WQAE(ω*) as the benchmark and define its asymptotic relative efficiency (ARE) to β̂LS and β̂(τ) as

ARE(β̂LS)=var(ε)S(ω*)andARE(β̂(τ))=τ(1τ)fε2(Qε(τ))S(ω*). (51)

A value of ARE ≥ 1 indicates better performance of β̂WQAE(ω*). Clearly, ARE(β̂(τj)) ≥ 1, j = 1, …, k. Under the conditions in Theorem 9, limk→∞ S(ω*) = 1/ℐ(fε) ≤ var(ε) (the Cramér-Rao inequality) so that limk→∞ ARE(β̂LS) ≥ 1. Intuitively, both β̂LS and β̂(τ) use only partial information: sample average and sample τ-th quantile, respectively. By contrast, β̂WQAE(ω*) combines strength across quantiles and thus can be more efficient.

Using k = 9 quantiles, Table 1 tabulates ARE(β̂LS) and ARE(β̂(τ)), τ = 0.1, …, 0.9, for some commonly used distributions. For all non-normal distributions considered, β̂WQAE(ω*) significantly outperforms β̂LS and β̂(τ). For N(0,1), β̂WQAE(ω*) and β̂LS are comparable, and both are about 50% more efficient than β̂(0.5). For Student-t with one (t1) or two (t2) degrees of freedom, LS is not applicable due to infinite variance; β̂WQAE(ω*) is about 20% more efficient than β̂(0.5) and even substantially more efficient than β̂(τ) for other choices of τ. Thus, potentially much improved efficiency and robustness can be achieved by using the proposed estimator β̂WQAE(ω*). For linear models, Zou and Yuan (2008) studied composite quantile regression (CQR) method, and we include the efficiency of their method for comparison purpose. Clearly, the OWQAE is significantly more efficient than the CQR.

Table 1.

Theoretical ARE [see (51)] of β̂WQAE) compared to β̂LS, Zou and Yuan (2008)’s CQR estimator β̂CQR, and β̂ (τ), using 9 quantiles τj = j/10, j = 1, …, 9. Mixture 1: 0.5N(0,1)+0.5N(0,0.56); Mixture 2: 0.5N(−2,1)+0.5N(2,1). For Student-t1, t2, LS is not applicable. [Numbers ≥ 1 indicate better performance of β̂WQAE)].

ε β̂(τ) with τ =
distribution β̂LS β̂CQR τ1 τ2 τ3 τ4 τ5 τ6 τ7 τ8 τ9

Student-t1 NA 1.58 47.12 6.40 2.34 1.40 1.19 1.40 2.34 6.40 47.12
Student-t2 NA 1.12 9.01 2.85 1.65 1.27 1.17 1.27 1.65 2.85 9.01
N(0,1) 0.96 1.03 2.80 1.96 1.67 1.54 1.51 1.54 1.67 1.96 2.80
Mixture 1 10.17 2.43 91.95 32.64 3.26 1.80 1.55 1.80 3.26 32.64 91.95
Mixture 2 3.28 2.80 3.01 2.81 3.68 7.84 56.24 7.84 3.68 2.81 3.01
Laplace 2.00 1.32 9.00 4.00 2.33 1.50 1.00 1.50 2.33 4.00 9.00
Gamma(1) 9.00 3.67 1.00 2.25 3.86 6.00 9.00 13.50 21.00 36.00 81.00
Gamma(2) 2.44 1.73 1.12 1.49 1.91 2.42 3.10 4.08 5.65 8.68 17.34
Gamma(3) 1.71 1.41 1.26 1.42 1.64 1.94 2.35 2.94 3.88 5.68 10.75
Beta(1,1) 1.67 2.04 1.80 3.20 4.20 4.80 5.00 4.80 4.20 3.20 1.80
Beta(1,2) 2.35 2.33 1.05 2.11 3.16 4.22 5.27 6.33 7.38 8.44 9.49
Beta(1,3) 3.31 2.71 1.02 2.12 3.32 4.66 6.19 8.00 10.27 13.33 19.04

We briefly mention efficiency comparison of the nonparametric estimator WQAE(x|ω*) relative to the local LS, local LAD, and Kai, Li and Zou (2010)’s local CQR estimators in Section 5. By (40),

  • The nonparametric relative efficiency = (The parametric relative efficiency)4/5.

Thus, the same efficiency comparison result (up to an exponent 4/5) in Table 1 also holds for the nonparametric estimator WQAE(x|ω*) in Section 5.

6.5 Asymptotic super-efficiency

By Corollary 2, under (45), β̂WQAE(ω*) is an asymptotically efficient estimator of β, with limiting covariance matrix approaching the Fisher information bound. The corresponding conditions (45) are not mathematical trivialities but are real restricting conditions to obtain the efficiency results. In the case when those “usual” regularity conditions do not hold, the previously discussed efficiency result may not hold, and we may obtain different results from the likelihood-based estimation. For example, we may have a different rate of convergence, and in general, asymptotically efficient estimators do not exist. These “unusual” cases are sometimes called “non-regular” statistical estimation. In this section, we briefly discuss the case of non-regular estimation. In this case, Theorem 11 below shows that, by using quantile regression with optimal weighting, super-efficient estimators may be obtained in the sense that the efficiency is larger than the Fisher information ℐ(fε).

Theorem 11 Recall Ωk in (15). Let g(τ) be defined in Theorem 8. Assume

limτ0g2(τ)+g2(1τ)τ=c. (52)
  1. If 0 < c < ∞ and limτ0τ2τ1τ[g(t)]2dt=0, then limk→∞ Ωk = c + ℐ(fε).

  2. If c = ∞, then limk→∞ Ωk = ∞.

Condition (45) covers the regular case c = 0 in (52). Theorem 11 indicates that, for the non-regular case c > 0 in (52), under appropriate conditions, for large k, the variance of the (standardized) optimally weighted quantile regression based estimator β̂WQAE(ω*) is smaller than the Cramér-Rao bound. In particular, if c = ∞, as the number of quantiles k increases, the asymptotic variance approaches zero. In this sense, the estimator β̂WQAE(ω*) is asymptotically super-efficient.

Corollary 3 below concerns a special case of super-efficiency when the density fε is positive at the boundary.

Corollary 3 Denote the support of fε by 𝒟, then limk→∞ Ωk = ∞ in any of the following three cases: (i) 𝒟 = [D1, D2] with fε(D1) + fε(D2) > 0; (ii) 𝒟 = [D1, ∞) with fε(D1) > 0; or (iii) 𝒟 = (−∞, D2] with fε(D2) > 0.

For the truncated version of the distributions in Section 6.2, we have limk→∞ Ωk = ∞. For example, for the truncated normal on [−1, 1], Corollary 3 (i) applies. For uniform distribution on [0, 1], we can show Ωk = 2k + 2 → ∞.

Similar results can also be established for Λk. We omit the details.

7 Estimation of the Optimal Weight

To construct the proposed OWQAE β̂WQAE(ω*) in Sections 3–5, we need to obtain estimations of the optimal weight ω* in (14) and π* in (25). It suffices to estimate Qε(τ) and fε(Qε(τ)). We can accomplish this through a two-step procedure: first, use a preliminary estimator to obtain residuals; second, estimate Qε(τ) and fε(Qε(τ)) based on the residuals. Here we illustrate the idea using the models in Sections 3–5.

(Case 1: Parametric model in Section 3.) Since fε (Qε (τ)) remains the same if we change ε to c + ε for any c, α in (6) can be absorbed into ε. We propose the procedure:

  1. Use the uniform weight ω = [1/k, …, 1/k]T to obtain the preliminary estimator β̂, and compute the “residuals” (a combination of both α and ε) as ε̂t = Ytm(Xt, β̂).

  2. To estimate fε(u), use the nonparametric density estimate ε(u)=(nb)1t=1nK{(uε̂t)/b}, where, we follow Silverman (1986) to choose the rule-of-thumb bandwidth b:
    b=0.9n1/5*min{sd(ε̂1,,ε̂n),IQR(ε̂1,,ε̂n)1.34}.
    Here, “sd” and “IQR” are the sample standard deviation and sample interquartile.
  3. Estimate fε(Qε(τ)) by ε(ε(τ)), where ε(τ) is the sample τ-th quantile of ε̂1, …, ε̂n.

  4. Plug ε(ε(τ)) into (14) to obtain the estimated optimal weight ω̂*.

(Case 2: Location-scale model in Section 4.) To ensure identifiability, we assume without loss of generality that j=1k|Qε(τj)|=1, otherwise we can consider the reparametrized model Y = XT β + (UT γ*)ε* with γ* = cγ, ε* = ε/c, and c=j=1k|Qε(τj)|. Note that this assumption bears no effect on the optimal weight ω* since fε(Qε(τ)) is invariant under the transformation cε for any c > 0. For each quantile τ = τ1, …, τk, we fit the quantile regression (27) to obtain (β̃(τ), γ̃(τ)). Define the preliminary estimator

β̃=1kj=1kβ̃(τj)andγ̃=j=1k|γ̃(τj)|. (53)

Then (β̃, γ̃) consistently estimates (β, γ). We use the procedure to compute ω* and π*:

  1. Use β̃ and γ̃ in (53) to compute the errors ε̃t=(YtXtTβ̃)/(UtTγ̃), t = 1, …, n. To better mimic the constraint j=1k|Qε(τj)|=1, consider the transformed errors
    ε̂t=ε̃tj=1k|ε(τj)|,t=1,,n,
    where ε(τ) is the sample τ-quantile of ε̃1, …, ε̃n.
  2. Use the same steps (ii)–(iii) in case 1 above to obtain estimates ε(ε(τ)) and ε(τ).

  3. Use (14) to compute ω̂* and use (25) to compute π̂*.

(Case 3: Nonparametric regression model in Section 5.) As in case 2 above, ω* is invariant under the transformation cε, c > 0. Assume without loss of generality that |ε| has median one. Then the conditional median of |Ym(X)| given X is σ(X), and we can apply local median quantile regression to estimate σ(·). We propose the procedure:

  1. Use (32) with the uniform weight to obtain the preliminary estimator (·).

  2. Compute Yt (Xt) and estimate σ(·) by local linear median quantile regression:
    (σ̂(x),)=argminσ,bt=1nρ0.5{|Yt(Xt)|σb(Xtx)}K(Xtx). (54)
    For the bandwidth ℓ, following Yu and Jones (1998), we use ℓ = ℓLS(π/2)1/5, where ℓLS is the plug-in bandwidth [Ruppert, Sheather and Wand (1995)] for local linear LS regression based on the data (Xi, |Yi(Xi)|2), i = 1, …, n.
  3. Compute the errors ε̂t = [Yt(Xt)]/σ̂(Xt) and obtain the estimator ε(ε(τ)) as in the parametric regression case 1 above.

  4. Use (14) to obtain ω̂1, …, ω̂k and symmetrize them: ω̂j*=(ω̂j+ω̂k+1j)/2, j = 1, …, k.

8 Monte Carlo Studies

We conduct Monte Carlo studies to investigate the sampling performance of the proposed procedures in a variety of regression models. In all settings below, we use 1000 realizations to evaluate the performance of various methods.

8.1 Linear models with homoscedastic errors

For linear models, we compare six estimation methods. OLS: ordinary LS estimator; LAD: the median quantile estimator with τ = 0.5 in (8); QAU, QAO, QAE: the WQAE in (11) with the uniform weights, theoretical optimal weight ω* in (14), and estimated optimal weight ω̂* (cf. Section 7), respectively; CQR: Zou and Yuan (2008)’s CQR estimator. For QAU, QAO, QAE, and CQR, we use k = 9 quantiles 0.1, 0.2, …, 0.9. With 1000 realizations, we use QAE as the benchmark to which the other five methods are compared based on the empirical relative efficiency:

RE(Method)=MSE(Method)MSE(QAE)andMSE=11000j=11000[β̂(j)β]2, (55)

where “Method” stands for OLS, LAD, QAU, QAO, CQR, and β̂(j) is the estimator of β in the j-th realization. A value of RE ≥ 1 indicates better performance of QAE.

We consider both independent data and time series data:

Model1:Yt=α+Xtβ+εt,Xt~N(0,1),(α,β)=(0,1), (56)
Model2:Yt=α+β1Yt1+β2|Yt2|+0.2εt,(α,β1,β2)=(0,0.3,0.5). (57)

Model 2 is a variant of the threshold autoregressive model with a linear component Yt−1. For the innovation εt, we consider 12 distributions: Normal distribution N(0,1), Student-t distribution with one (t1) and two (t2) degrees of freedom, the two normal mixture distributions in Example 2, Laplace distribution, Beta distributions Beta(1,1), Beta(1,2), Beta(1,3), and Gamma distributions Gamma(1), Gamma(2), Gamma(3).

The results are summarized in Table 2(a) for Model 1 with sample sizes n = 100, 300, and in Table 2(b) for Model 2 with n = 300. For N(0,1), Student-t2, and Laplace distributions, QAE and CQR are comparable; for all other distributions, QAE significantly outperforms CQR. Also, QAE outperforms OLS for all non-normal distributions whereas they are comparable for N(0,1). For n = 300, the superior performance of QAE is even more remarkable, which agrees with our asymptotic theory. For almost all cases considered, QAE substantially outperforms the LAD estimator and the relative efficiency can be as high as almost 2000%. It is worth pointing out that, for Beta and Gamma distributions, the relative efficiencies are much higher than the other distributions considered, owing to the super-efficiency phenomenon in Section 6.5. We also note that QAE with estimated optimal weight has comparable performance to QAO with theoretical optimal weight. We conclude that the proposed OWQAE offers a more efficient alternative to existing methods.

Table 2.

(Linear regression model) Empirical relative efficiency of OWQAE with estimated optimal weight compared to five methods: OLS, LAD, QAU, QAO, CQR. Mixture 1: 0.5N(0,1)+0.5N(0,0.56); Mixture 2: 0.5N(−2,1)+0.5N(2,1). For Student-t1, t2, due to infinite variance, the OLS is not stable and varies significantly in simulations. [Numbers ≥ 1 indicate better performance of OWQAE].

(a): Model 1 in (56)

εt n = 100 n = 300
Distribution OLS LAD QAU QAO CQR OLS LAD QAU QAO CQR
Student-t1 NA 0.86 2.55 0.74 1.11 NA 1.04 3.16 0.89 1.37
Student-t2 NA 0.95 1.14 0.81 0.89 NA 1.06 1.30 0.99 1.02
N(0,1) 0.86 1.33 0.90 0.90 0.91 0.89 1.43 0.95 0.95 0.97
Mixture 1 5.10 0.89 3.70 0.77 1.45 7.79 1.27 6.12 0.95 2.00
Mixture 2 2.22 10.85 2.31 1.18 2.03 2.65 19.88 2.84 1.09 2.30
Laplace 1.24 0.85 1.06 0.85 0.89 1.48 0.87 1.23 0.87 1.01
Gamma(1) 5.69 5.46 4.81 0.67 2.51 7.09 6.98 6.12 0.76 2.91
Gamma(2) 2.08 2.60 1.96 0.88 1.48 2.23 2.85 2.14 0.94 1.61
Gamma(3) 1.49 2.02 1.48 0.92 1.25 1.64 2.21 1.61 0.99 1.35
Beta(1,1) 1.49 4.20 1.73 0.91 1.77 1.58 4.58 1.88 0.95 1.90
Beta(1,2) 1.71 3.65 1.93 1.09 1.71 2.02 4.47 2.34 1.97 1.99
Beta(1,3) 2.39 4.30 2.59 1.36 2.01 2.78 5.12 3.06 3.03 2.25
(b): Model 2 in (57), n = 300

εt β1 β2
Distribution OLS LAD QAU QAO CQR OLS LAD QAU QAO CQR
Student-t1 NA 1.00 1.85 0.89 1.28 NA 0.85 2.74 0.80 1.07
Student-t2 NA 1.10 1.19 1.04 0.99 NA 1.04 1.23 1.13 1.00
N(0,1) 0.91 1.45 0.94 0.94 0.98 0.90 1.36 0.94 0.94 0.95
Mixture 1 6.85 1.14 4.84 0.92 1.78 6.82 1.16 5.20 0.88 1.77
Mixture 2 2.70 19.79 2.86 1.11 2.35 2.80 19.51 2.99 1.18 2.43
Laplace 1.36 0.89 1.17 0.89 0.99 1.37 0.91 1.21 0.91 1.00
Gamma(1) 6.70 6.23 5.38 0.79 2.83 6.24 6.22 5.35 0.75 2.80
Gamma(2) 2.39 2.94 2.22 0.95 1.70 2.42 2.80 2.23 0.93 1.68
Gamma(3) 1.53 1.95 1.50 0.97 1.27 1.68 2.07 1.59 0.91 1.36
Beta(1,1) 1.50 4.30 1.79 0.93 1.80 1.50 4.15 1.77 0.94 1.81
Beta(1,2) 1.92 4.24 2.21 0.86 1.95 2.23 4.89 2.52 0.85 2.21
Beta(1,3) 2.78 4.71 3.04 0.82 2.29 2.77 4.60 3.01 0.80 2.28

8.2 Nonlinear models with homoscedastic errors

We consider two nonlinear models (one independent data and the other time series data):

Model3:Yt=α+exp(βXt)+0.5εt,Xt~N(0,1),(α,β)=(0,0.6), (58)
Model4:Yt=α+0.5+β1Yt12+β2Yt22+0.2εt,(α,β1,β2)=(0,0.3,0.5). (59)

Model 4 is Engle (1982)’s ARCH model. Again, we consider the 12 distributions in Section 8.1 for εt. Table 3 summarizes the empirical relative efficiency [cf. (55)] of the proposed OWQAE compared to the other methods OLS, LAD, QAU, and QAO (see Section 8.1). The proposed OWQAE is significantly superior to the OLS, LAD and QAU, and comparable to the QAO with theoretical optimal weight.

Table 3.

(Nonlinear regression model) Empirical relative efficiency of OWQAE with estimated weights compared to four methods: OLS, LAD, QAU, QAO. Mixture 1: 0.5N(0,1)+0.5N(0,0.56); Mixture 2: 0.5N(−2,1)+0.5N(2,1). For Student-t1, t2, due to infinite variance, the OLS is not stable and varies significantly in simulations. [Numbers ≥ 1 indicate better performance of OWQAE].

Model 3 in (58) Model 4 in (59)
εt β = 0.6 β1 = 0.3 β2 = 0.5
Distribution OLS LAD QAU QAO OLS LAD QAU QAO OLS LAD QAU QAO

Student-t1 NA 0.48 1.61 0.55 NA 0.75 2.06 0.74 NA 0.73 2.38 0.75
Student-t2 NA 1.00 1.37 1.55 NA 1.02 1.23 0.91 NA 0.99 1.21 1.06
N(0,1) 0.79 1.81 0.96 0.95 0.97 1.45 0.99 0.99 0.92 1.38 0.97 0.97
Mixture 1 3.75 1.01 3.16 0.86 6.45 1.12 4.62 0.89 6.99 1.09 5.08 0.91
Mixture 2 1.81 9.54 1.90 1.02 2.56 16.83 2.72 1.15 2.75 18.08 2.85 1.15
Laplace 1.36 1.19 1.15 1.19 1.52 0.93 1.28 0.93 1.42 0.89 1.21 0.89
Gamma(1) 4.21 6.58 3.94 0.75 6.74 6.42 5.78 0.77 6.67 6.44 5.56 0.73
Gamma(2) 1.71 2.78 1.69 0.97 2.23 2.59 2.07 0.97 2.39 2.90 2.21 0.92
Gamma(3) 1.31 1.89 1.28 1.14 1.59 2.04 1.53 0.96 1.64 2.05 1.58 0.92
Beta(1,1) 1.15 4.61 1.98 0.94 1.67 4.48 1.95 0.93 1.57 4.02 1.82 0.94
Beta(1,2) 1.08 3.72 1.97 2.50 1.98 3.98 2.24 0.89 2.06 4.21 2.30 0.85
Beta(1,3) 1.58 4.38 2.93 5.27 2.84 4.85 3.02 0.81 2.96 4.94 3.09 0.89

8.3 Location-scale models with conditional heteroscedasticity

Consider two location-scale models (one independent data and the other time series data):

Model5:Yt=βXt+(γ0+γ1|Xt|)εt,Xt~N(0,1),(β,γ0,γ1)=(0.6,0.5,1.0), (60)
Model6:Yt=βYt1+(γ0+γ1|Yt1|)εt,(β,γ0,γ1)=(0.4,0.5,0.5). (61)

Model 6 is the ARCH model in Koenker and Zhao (1996), and it is different from Engle’s ARCH model where the conditional heteroscedasticity takes the form in (59). Due to the conditional heteroscedasticity, it is slightly more difficult to estimate the parameters. We use Model 5 to illustrate five estimation methods.

  1. (LS method) If εt has zero mean and unit variance, the Gaussian-likelihood based estimation method is to minimize the loss function
    t=1n{(YtbXt)2(r0+r1|Xt|)2+log[(r0+r1|Xt|)2]}. (62)
    This is essentially an LS type estimation and the Gaussianity is not necessary for the consistency. In general, if εt has variance σ2, then this LS method produces consistent estimators of β and σ(γ0, γ1).
  2. (LAD method) First, apply (27) with τ = 0.5 to obtain the LAD estimator β̂LAD of β. Second, apply median quantile regression based on absolute residuals:
    t=1nρ0.5(|Ytβ̂LADXt|r0r1|Xt|). (63)
    This LAD regression produces estimators of Q|ε|(0.5)(γ0, γ1), where Q|ε|(0.5) is the median quantile of |εt|.
  3. (OWQAE with theoretical optimal weights). As in Section 8.1, we denote this method by QAO.

  4. (OWQAE with estimated optimal weights). As in Section 8.1, we denote this method by QAE. As discussed in Section 7, under the constraint j=1k|Qε(τj)|=1, QAE produces consistent estimators of (β, γ0, γ1). Without the latter constraint, QAE produces consistent estimators of β and j=1k|Qε(τj)|(γ0,γ1).

  5. (OWQAE based on the unweighted quantile regression (27)). This method works the same as the OWQAE above and the only difference is to use β̃(τ) and γ̃(τ) in (27) to form the OWQAE. Denote this method by QAEU. Again, QAEU produces estimators of β and j=1k|Qε(τj)|(γ0,γ1). We include this method to evaluate the performance of the OWQAE based on the unweighted quantile regression (27).

As discussed above, the five estimation methods produce consistent estimators of β and λ(γ0, γ1) for some constant λ depending on the distribution of εt. To make sensible comparison, we divide the corresponding estimators by λ to obtain consistent estimators of (γ0, γ1). Furthermore, to ensure the consistency of the LS method, we consider the properly centered εt for the 12 distributions in Section 8.1. That is, if εt has finite mean, then we center it so that 𝔼(εt) = 0.

The results are summarized in Table 4(a) for Model 5 and in Table 4(b) for Model 6. In both models, sample size n = 300. We make three observations. First, the OWQAE delivers much superior overall performance than OLS and LAD. Second, in all cases considered, the OWQAE using heteroscedasticity-weighted quantile regression (22) clearly outperforms the OWQAE using unweighted quantile regression (27). Third, the OWQAE with estimated weights is comparable to the QAO with theoretical optimal weights.

Table 4.

(Location-scale model) Empirical relative efficiency of OWQAE with estimated weights compared to four methods: OLS, LAD, QAO, OWQAE using unweighted quantile regression in (27). Mixture 1: 0.5N(0,1)+0.5N(0,0.56); Mixture 2: 0.5N(−2,1)+0.5N(2,1). For Student-t1, t2, due to infinite variance, the OLS is not stable and varies significantly in simulations. [Numbers ≥ 1 indicate better performance of OWQAE].

(a): Model 5 in (60)

location parameter scale parameter
εt β = 0.6 γ0 = 0.5 γ1 = 1.0
Distribution OLS LAD QAO QAEU OLS LAD QAO QAEU OLS LAD QAO QAEU
Student-t1 NA 0.96 0.85 1.17 NA 1.17 0.88 1.20 NA 1.22 0.92 1.25
Student-t2 NA 1.05 0.91 1.13 NA 1.66 0.94 1.28 NA 1.70 0.96 1.27
N(0,1) 0.89 1.43 0.93 1.12 0.68 2.29 0.92 1.28 0.68 2.37 0.94 1.26
Mixture 1 7.05 1.21 0.91 1.18 0.55 1.88 0.85 1.23 0.55 2.05 0.89 1.27
Mixture 2 2.57 23.03 1.03 1.23 0.66 2.07 0.86 1.28 0.69 2.58 0.86 1.20
Laplace 1.53 0.80 0.80 1.11 0.93 1.89 0.96 1.25 0.96 1.81 0.96 1.21
Gamma(1) 7.38 6.40 0.79 1.17 7.31 3.72 0.50 1.13 6.38 3.18 0.44 1.14
Gamma(2) 2.32 2.89 0.92 1.13 2.70 2.61 0.79 1.23 2.60 2.76 0.81 1.21
Gamma(3) 1.67 2.37 0.92 1.13 1.79 2.63 0.83 1.15 1.76 2.84 0.88 1.20
Beta(1,1) 1.48 4.39 0.95 1.09 0.66 3.87 0.83 1.19 0.65 3.56 0.85 1.17
Beta(1,2) 1.95 4.48 0.87 1.12 1.04 3.41 0.71 1.10 1.07 3.75 0.73 1.21
Beta(1,3) 2.75 4.97 0.84 1.11 1.71 3.64 0.69 1.14 1.63 3.30 0.65 1.10
(b): Model 6 in (61)

location parameter scale parameter
εt β = 0.4 γ0 = 0.5 γ1 = 0.5
Distribution OLS LAD QAO QAEU OLS LAD QAO QAEU OLS LAD QAO QAEU
Student-t1 NA 0.65 0.55 NA NA 16.84 0.37 NA NA 3.85 0.30 NA
Student-t2 NA 1.03 0.89 6.76 NA 3.69 0.88 3.38 NA 4.01 0.96 6.02
N(0,1) 0.89 1.45 0.97 1.12 0.62 1.94 0.92 1.21 0.67 2.10 0.97 1.19
Mixture 1 6.47 1.12 0.85 1.22 0.51 1.76 0.77 1.07 0.60 1.92 0.99 1.10
Mixture 2 2.40 22.77 1.00 6.59 0.70 20.72 0.91 34.50 0.64 9.51 0.84 5.82
Laplace 1.48 0.80 0.81 1.53 0.92 2.31 0.92 1.50 0.91 2.35 0.97 1.44
Gamma(1) 5.71 6.16 0.78 1.10 5.13 3.65 0.40 1.12 7.79 5.61 0.62 1.21
Gamma(2) 2.13 2.86 0.93 1.22 2.63 4.37 0.69 1.58 2.73 4.64 0.80 1.53
Gamma(3) 1.44 2.04 0.95 1.57 1.89 4.55 0.84 2.12 1.77 3.85 0.84 1.76
Beta(1,1) 1.59 4.30 0.92 1.01 0.67 3.14 0.71 1.00 0.83 3.79 0.98 1.01
Beta(1,2) 1.73 4.27 0.86 0.98 1.00 2.44 0.65 0.99 1.15 3.21 0.89 0.99
Beta(1,3) 2.48 5.17 0.81 1.01 1.50 2.45 0.54 0.98 1.48 3.79 0.87 1.00

8.4 Nonparametric regression models

In our data analysis, we use the standard Gaussian kernel for K(·). We now address the bandwidth selection issue. By (39), the optimal bandwidth h* is proportional to (s2)1/5. Denote by hLS* and hOWQAE* the bandwidth for the least squares estimator and the proposed OWQAE. Then hOWQAE*=hLS*[S(ω*)/var(ε)]1/5, where S(ω*) is defined in (13). In practice, to select hLS*, we can use the plug-in bandwidth selector in Ruppert, Sheather and Wand (1995), implemented using the command dpill in the R package KernSmooth. We then select hOWQAE* by plugging in estimates of S(ω*) and var(ε) using the two-step procedure in Section 7, but for the purpose of comparison we shall use their true values in our simulation studies. Similarly, we can choose the optimal bandwidths for the other two estimators. Kai, Li and Zou (2010) adopted the same strategy. For the preliminary estimator (·) in step (i) of Section 7, we use the plug-in bandwidth selector hLS*.

We compare the empirical performance of the four methods (LS, LAD, CQR, OWQAE) in Section 5. With 1000 realizations, we use the least squares estimator LS as the benchmark to which the other three methods are compared based on the relative efficiency:

RE()=MISE(LS)MISE()and MISE()=11000j=1100012[j(x)m(x)]2dx,

where j is the estimator in the j-th realization, and [ℓ1, ℓ2] is the interval over which m is estimated. A value of RE ≥ 1 indicates that outperforms the LS estimator. To facilitate computation, the integral is approximated using 20 grid points.

Consider n = 200 samples from the model

Model7:Y=sin(2X)+2exp(16X2)+0.5ε,X~Unif[1.6,1.6]. (64)

The same model was also used in Kai, Li and Zou (2010) with the normal design X ~ N(0,1). Here we use the uniform design to avoid some computational issues. For ε, we consider nine symmetric distributions: N(0,1), truncated normal on [−1, 1], truncated Cauchy on [−10, 10], truncated Cauchy on [−1, 1], Student-t with 3 (t3) degrees of freedom, Standard Laplace distribution, uniform distribution on [−0.5, 0.5], and two normal mixture distributions: 0.5N(2,1)+0.5N(−2,1) and 0.95N(0,1)+0.05N(0,9). The first normal mixture can be used to model a two-cluster population, whereas the second normal mixture can be viewed as a noise contamination model. Let [ℓ1, ℓ2] = [−1.5, 1.5].

The relative efficiencies of the four methods, with LS(x) as the benchmark, are summarized in Table 5(a). Overall, OWQAE either significantly outperforms or is comparable to the other three methods. For example, for N(0,1) and 0.95N(0,1)+0.05N(0,9), OWQAE is comparable to LS; for other distributions, OWQAE has about 20% efficiency gain over LS for most distributions and more than 60% efficiency gain for 0.5N(2,1)+0.5N(−2,1). When compared with CQR, OWQAE outperforms CQR for all but the four distributions: N(0,1), Student-t3, Laplace distribution, and 0.95N(0,1)+0.05N(0,9), for which they are comparable. While OWQAE underperforms LAD for truncated Cauchy on [−10, 10], it has substantial efficiency gains for N(0,1), truncated N(0,1) on [−1, 1], truncated Cauchy on [−1, 1], uniform on [−0.5, 0.5], and 0.5N(2,1)+0.5 N(−2,1).

Table 5.

(Nonparametric regression model) Empirical relative efficiency of the local least-absolute-deviation estimator (LAD), Kai, Li and Zou (2010)’s local CQR estimator, and the proposed OWQAE, relative to the benchmark local LS estimator. For CQR and OWQAE: τj = j/(k + 1), j = 1, …, k, k = 9, 19, 29.

(a): Model 7 in (64)

CQR, k = OWQAE, k =
Distribution of ε LS LAD 9 19 29 9 19 29
N(0,1) 1 0.73 0.99 1.00 0.99 0.96 0.95 0.95
Truncated N(0,1) on [−1, 1] 1 0.54 0.93 0.98 0.99 1.14 1.25 1.26
Truncated Cauchy on [−10, 10] 1 1.67 1.16 1.07 1.05 1.40 1.27 1.21
Truncated Cauchy on [−1, 1] 1 0.53 0.93 0.98 0.99 1.07 1.20 1.20
Student-t with 3 d.f.’s 1 1.44 1.39 1.21 1.16 1.42 1.28 1.19
Standard Laplace 1 1.26 1.11 1.06 1.05 1.16 1.10 1.06
Uniform on [−0.5, 0.5] 1 0.51 0.94 0.98 0.99 1.23 1.24 1.21
0.5N(−2,1)+0.5N(2,1) 1 0.31 0.92 0.97 0.99 1.57 1.60 1.62
0.95N(0,1)+0.05N(0,9) 1 0.88 1.13 1.08 1.06 1.05 0.98 0.95
(b): Model 8 in (65)

CQR, k = OWQAE, k =
Distribution of ε LS LAD 9 19 29 9 19 29
N(0,1) 1 0.66 0.96 0.98 0.98 0.94 0.95 0.95
Truncated N(0,1) on [−1, 1] 1 0.43 0.84 0.91 0.93 1.14 1.56 1.80
Truncated Cauchy on [−10, 10] 1 2.39 1.35 1.17 1.12 1.97 1.78 1.61
Truncated Cauchy on [−1, 1] 1 0.46 0.84 0.91 0.94 1.02 1.35 1.55
Student-t with 3 d.f.’s 1 1.73 1.75 1.52 1.41 1.72 1.51 1.44
Standard Laplace 1 1.48 1.18 1.11 1.09 1.31 1.24 1.16
Uniform on [−0.5, 0.5] 1 0.36 0.84 0.91 0.94 1.46 2.21 2.70
0.5N(−2,1)+0.5N(2,1) 1 0.20 0.88 0.95 0.97 2.12 2.18 2.17
0.95N(0,1)+0.05N(0,9) 1 0.86 1.18 1.15 1.12 1.10 1.02 0.95

The empirical performance of the proposed method in Table 5(a) is not as impressive as its theoretical performance in (40). For example, for truncated N(0,1) on [−1, 1], the theoretical AREs according to (40) are 1, 0.48, 0.86, 0.93, 0.95, 1.13, 1.69, 2.17, compared to 1, 0.54, 0.93, 0.98, 0.99, 1.14, 1.25, 1.26 in Table 5(a). To explain this phenomenon, the plot (not included here) of the function m(x) = sin(2x) + 2 exp(−16x2) exhibits large curvature and sharp changes on [−0.5, 0.5], and thus a large estimation bias could easily offset the asymptotic efficiency improvements, especially for a moderate sample size. To appreciate this, use the same X and ε in (64), and consider model

Model8:Y=1.8X+0.5ε. (65)

Then the bias term h2μKm″(x) = 0 vanishes and the variance plays a dominating role. For all four estimation methods, we use the same bandwidth: the plug-in bandwidth selector for local linear regression. We summarize the relative efficiencies in Table 5(b). The overall pattern of the empirical relative efficiencies is consistent with that of the theoretical ones in (40), and the proposed OWQAE significantly outperforms other methods for almost all distributions considered. Also, using more quantiles (k = 29) significantly improves the performance of OWQAE for truncated N(0,1) on [−1, 1], truncated Cauchy on [−1, 1] and uniform on [−0.5, 0.5]. The latter property is not shared by the CQR method.

In summary, for most non-normal distributions considered, the proposed method can have substantial efficiency improvements over other methods, and the empirical performance is consistent with our asymptotic theory.

9 An Empirical Application

To highlight the proposed approach, we consider a simple application of this method to the widely studied cross-section of stock returns. The Capital Asset Pricing Model [CAPM, see Sharpe (1964) and Black (1972)] has long served as the backbone of both theoretical and empirical finance. It is generally agreed that investors demand a higher expected return for investment in riskier securities. Over the past three decades a number of studies have empirically examined the performance of the CAPM in the cross-section of returns, and it is also well documented that the rate of return to holding common stocks is to some extend predictable over time. A large number of papers have studied the appropriateness of CAPM model in explaining how investors assess the risk and how they determine what risk premium to demand, and several alternative models have also been proposed in the literature. However, empirical evidence is ambiguous. The support for other asset-pricing models is no better. In addition, the theory behind the CAPM has an intuitive appeal that other models lack. For these (and other) reasons, in spite of the controversy in empirical studies, the CAPM is still widely used in fiancial applications and still the preferred model used in MBA and other managerial finance courses.

The focus of this section is not on the choice of alternative models. In this section, we consider applications of the methods that we discussed in the previous sections on the traditional widely used CAPM cross-sectional regression (similar to those of Fama-MacBeth which can be used to study the predictability of returns). The cross-sectional regression equation at time t is

Ri,t=λ0+λ1βim,t1+εi,t, (66)

where λ0 is the intercept term, λ1 is the slope coefficient, and βim,t−1 is the conditional beta of the excess return for asset i in month t. The dating convention indicates that the conditional beta is formed using only information available at time t − 1. This regression model provides a decomposition of each excess return over each period into two components: the first component, λ1βim,t−1, represents the part of return of asset i that is related to the cross-sectional structure of risk, as measured by the betas. The remaining component of the return is uncorrelated to the measures of risk. Thus, the asset pricing model implies that the predictability of returns should be related to the risk.

We consider a population of stocks traded on the New York Stock Exchange (NYSE) from January 2009 to December 2010. We study monthly stock returns. These data are available from CRSP (the Center for Research in Security Prices) as well as many other data resources. Following the literature of many empirical studies, the stocks are considered if their returns in the current month and also the previous 60 month are available, and we exclude firms with negative book-to-market equity (using information from Compustat). The cross-sectional regression model (66) is usually estimated by the least squares method in practice. On the other hand, cumulated empirical evidence in finance indicates that stock returns are not normally distributed. In fact, it is well-known that the distributions of returns are heavy-tailed. Therefore it is important to consider estimation procedures which have good properties in the absence of Gaussianity.

We estimate the cross-sectional regression model (66) using four methods: the traditional OLS regression, the LAD estimation, a simple equally-weighted quantile averaging estimation (denoted by QAU), and the optimally weighted quantile averaging estimation (denoted by OWQAE). We use k = 9 quantiles 0.1, 0.2, …, 0.9, for quantile combination. For the purpose of comparison, we evaluate the performance of these estimators based on their out-of-sample prediction. In particular, we estimate the cross-sectional regression model (66) based on cross-sectional data at each month of 2009, and then use the estimated coefficients λ̂0 and λ̂1 to construct forecast of return at the corresponding month of 2010. We compare both the mean squared prediction error (MSE) and the mean absolute deviation (MAD) of the predictions. In particular, we calculate the mean squared prediction error and the mean absolute prediction error by

MSE=i(Ri,t+1i,t+1)2,MAD=i|Ri,t+1i,t+1|,

for each month, and then average these mean squared prediction errors and mean absolute prediction errors respectively over all months. Table A1 below sumarizes the results.

Table A1.

OLS LAD QAU OWQAE
MAD 51.39 46.94 48.35 44.78
MSE 10.50 8.89 9.45 7.95

Numbers in this table are multiplied by 500 for convenience.

Model (66) is the basic regression model that characterizes the risk premiums. We next consider an extension of model (66) which adds conditional heteroscedastic effect of capitalization (the “size” effect). We consider an analogue of (20),

Ri,t=λ0+λ1βim,t1+σi,tεi,t, (67)

where σi,t = γCapi,t and Capi,t is the market capitalization. Again, we estimate the cross-sectional regression model (67) using the four estimation methods mentioned before. More specifically, the two-stage Weighted Least Squares (WLS), the two-stage Weighted Least Absolute Deviation (WLAD), the QAU and OWQAE based on quantiles 0.1, 0.2, …, 0.9. Table A2 below reports the mean squared prediction errors (MSE) and the mean absolute prediction errors (MAD) that are calculated in a similar way as Table A1.

Table A2.

WLS WLAD QAU OWQAE
MAD 51.01 46.80 48.30 44.09
MSE 10.40 8.86 9.41 7.90

Numbers in this table are multiplied by 500 for convenience.

The empirical results from Tables A1A2 indicate that least squares method-based estimation is less efficient than other methods. In particular, the proposed OWQAE estimator performs relatively better than other methods.

10 Further Discussions

We propose a general method of combining quantile regression information to improve efficiency of regression estimators. The proposed method is simple and more efficient regression estimators can be constructed based on a relatively small number of quantiles.

The proposed method has a wide range of applicability and can be potentially applied to many other models. We briefly discuss a few directions of interesting applications of our approach, without giving full details.

The first direction is efficient estimation for the varying-coefficient model:

Y=α(U)+XTβ(U)+ε, (68)

where α(·) is the functional intercept and β(·) is the p-dimensional column vector of functional coefficients. Then the conditional τ-th quantile of Y given (X, U) is

QY(τ|X,U)=ατ(U)+XTβ(U)withατ(U)=α(U)+Qε(τ).

A useful application is the varying-coefficient longitudinal model when we have longitudinal measurements from multiple subjects. Wang, Zhu and Zhou (2009) studied quantile regression for a partially linear varying-coefficient longitudinal model. In their work, the coefficients depend on the quantile, and they estimated the coefficients for each quantile without combining information across quantiles. We will explore further in a future paper.

A second direction is volatility estimation in time series. In financial econometrics, volatility plays an important role in asset pricing and risk management. Here we briefly discuss volatility estimation for both parametric and nonparametric ARCH models.

(Nonparametric volatility) Consider nonparametric ARCH(1): Xt = σ(Xt−1t. Let Qε2(τ) be the τ-th quantile of εt2, and QXt2|Xt1(τ) the conditional τ-th quantile of Xt2 given Xt−1. Then QXt2|Xt1=x(τ)=σ2(x)Qε2(τ) and QXt2|Xt1=x(τ)/QXt2|Xt1=0(τ)=σ2(x)σ2(0) for all τ. Given estimates Xt2|Xt1=x(τ) of QXt2|Xt1=x(τ), we can construct efficient estimators of σ2(x)/σ2(0) by combining Xt2|Xt1=x(τj)/Xt2|Xt1=0(τj), j = 1, …, k.

(Parametric volatility) Consider parametric ARCH(p) model:

Xt=σtεtandσt2=β0+β1Xt12++βpXtp2,β0>0,β1,,βp0.

Let ℐt−1 be the information up to time t − 1. Denote by Qε2(τ) the τ-th quantile of εt2, and by QXt2|t1(τ) the conditional τ-th quantile of Xt2 given ℐt−1. Then

QXt2|t1(τ)=β0(τ)+j=1pβj(τ)Xtj2andβj(τ)=βjQε2(τ),j=0,,p.

Therefore, we can apply quantile regression with quantile τ to obtain consistent estimates β̂j(τ) of βj(τ). Note that βj(τ)/β0(τ) = βj0 for all τ. Therefore,

β̂0(τ)+j=1pβ̂j(τ)Xtj2β̂0(τ)β0+j=1pβjXtj2β0=σt2/β0,for allτ(0,1).

We can construct efficient estimators of σt2/β0 by combining quantiles τ1, …, τk.

Similar ideas also apply to generalized ARCH models. Since substantial work is needed here, we will explore further in a separate project.

Appendix: Proofs

Proof of Theorem 1. By the ergodicity in Assumption 1(ii) and (5),

DnTDnn[1𝔼[(X,β)T]𝔼[(X,β)]𝔼[(X,β)(X,β)T]]Σ,in probability. (69)

Recall the definition of Σβ in Theorem 1. Then we can easily verified that

Σ1=[1+𝔼[(X,β)T]Σβ1𝔼[(X,β)]𝔼[(X,β)T]Σβ1Σβ1𝔼[(X,β)]Σβ1]. (70)

By Assumption 3 and (69)(70),

β̂(τ)=β+Σβ1+op(1)nfε(Qε(τ))t=1n{(Xt,β)𝔼[(X,β)]}[τ1εt<Qε(τ)]+op(n1/2). (71)

Therefore,

n[j=1kωjβ̂(τj)β]=Σβ1+op(1)nt=1n{(Xt,β)𝔼[(X,β)]}dt+op(1), (72)

where

dt=j=1kωjfε(Qε(τj))[τj1εt<Qε(τj)]. (73)

By the Cramér-Wold device, it suffices to consider the case that (Xt, β) is scalar-valued. Let ℱt be the σ-algebra generated by {Xt+1, Xt, …; εt, εt−1, …}. By property (P1) in Section 2, {{(Xt, β) − 𝔼[(X, β)]}dt}t∈ℤ form martingale differences with respect to {ℱt}t∈ℤ. Using cov{τ − 1εtQε(τ), τ′ − 1εtQε(τ′)} = min(τ, τ′) − ττ′, we have 𝔼(dt2)=S(ω) with S(ω) defined in (13). Since εt is independent of ℱt−1, by (5),

1nt=1n𝔼[({(Xt,β)𝔼[(X,β)]}dt)2|t1]=S(ω)nt=1n{(Xt,β)𝔼[(X,β)]}2ΣβS(ω). (74)

Since k is fixed, by Assumption 2, dt is bounded. Thus, the assumption (Xt, β) ∈ ℒ2 ensures the Lindeberg condition. The result then follows from the martingale CLT.

Proof of Theorem 2. The optimal weight follows from the Lagrange multiplier method. The asymptotic normality follows from Theorem 1 and S(ω*)=1/(ekTH1ek).

Proof of Theorem 3. Recall dt in (72)(73). Since k is fixed, by Slutsky’s theorem, it is easy to see that t=1n{(Xt,β)𝔼[(X,β)]}dt has the same asymptotic distribution if we replace ωj therein by any ω̃j such that ω̃j = ωj + op(1). Thus, the result follows.

Proof of Theorem 4. Define the vectors

Vt=[XtUt],θ=[br],θ(τ)=[βγ(τ)],θ̂(τ)=[β̂(τ)γ̂(τ)],δ=n[θθ(τ)].

Then

YtXtTbUtTr=UtTγ[εtQε(τ)]VtTδ/n.

Since θ̂(τ) minimizes the criterion function in (22), the reparametrized parameter δ̂=n[θ̂(τ)θ(τ)] minimizes the loss function

L(δ)=t=1n1UtTγ̃[ρτ{UtTγ[εtQε(τ)]VtTδ/n}ρτ{UtTγ[εtQε(τ)]}].

Suppose we can establish the quadratic approximation

L(δ)=L*(δ)+op(1), (75)

where

L*(δ)=1nt=1n[τ1εt<Qε(τ)]VtTδUtTγ+fε(Qε(τ))2δT𝔼[V0V0T(U0Tγ)2]δ. (76)

Then the convexity lemma in Pollard (1991) gives δ̂ = δ ^* + op(1), where

δ̂*=argminδL*(δ)=1fε(Qε(τ)){𝔼[V0V0T(U0Tγ)2]}11nt=1nVtUtTγ[τ1εt<Qε(τ)].

The desired result then follows by using block matrix inverse of {𝔼[V0V0T/(U0Tγ)2]}1.

It remains to prove (75). In view of L(δ), define

(δ)=t=1n1UtTγ[ρτ{UtTγ[εtQε(τ)]VtTδ/n}ρτ{UtTγ[εtQε(τ)]}].

It suffices to prove (δ) = L*(δ) + op(1) and L(δ) = (δ) + op(1).

First, we prove (δ) = L*(δ) + op(1). Using ρτ (cz) = cρτ (z) for c > 0, we can rewrite

(δ)=t=1n[ρτ{εtQε(τ)VtTδnUtTγ}ρτ{εtQε(τ)}].

Applying Knight (1998)’s identity

ρτ(uυ)ρτ(u)=υ(τ1u<0)+0υ(1us1u0)ds, (77)

we can obtain

(δ)=1nt=1n[τ1εt<Qε(τ)]VtTδUtTγ+t=1n𝔼(ξt|𝒢t1)+t=1n[ξt𝔼(ξt|𝒢t1)], (78)

where 𝒢t is the σ-algebra generated by {(Xt+1,Ut+1), (Xt,Ut), …; εt, εt−1, …}, and

ξt=0VtTδnUtTγ[1εtQε(τ)+s1εtQε(τ)]ds.

By Assumption 5, εt is independent of 𝒢t−1. Thus,

𝔼(ξt|𝒢t1)=0VtTδnUtTγ[Fε(s+Qε(τ))Fε(Qε(τ))]ds. (79)

By Assumption 6(i), there exists some constant c such that

|VtTδ|UtTγc. (80)

Thus, from (79) and Taylor’s expansion Fε(s + Qε(τ)) − Fε(Qε(τ)) = sfε(Qε(τ)) + o(s),

t=1n𝔼(ξt|𝒢t1)=fε(Qε(τ))2δT[1nt=1nVtVtT(UtTγ)2]δ+o(1)fε(Qε(τ))2δT𝔼[V0V0T(U0Tγ)2]δ,in probability, (81)

where the convergence follows from the ergodicity and (5). Since {ξt − 𝔼(ξt|𝒢t−1)}t∈ℤ are martingale differences with respect to {𝒢t−1}t∈ℤ, by their orthogonality,

𝔼({t=1n[ξt𝔼(ξt|𝒢t1)]}2)=t=1n𝔼{[ξt𝔼(ξt|𝒢t1)]2}𝔼[(nξ0)2]. (82)

From (80), we have |nξ0|c1|ε0Qε(τ)|c/n, which combined with (82) gives t=1n[ξt𝔼(ξt|𝒢t1)]=op(1). Thus, by (78) and (81), we have (δ) = L*(δ) + op(1).

Next, we prove the approximation L(δ) = (δ) + op(1). Let

ηt=ρτ{UtTγ[εtQε(τ)]VtTδ/n}ρτ{UtTγ[εtQε(τ)]}.

Then it is easy to see that

L(δ)(δ)=t=1nηtUtT(γγ̃)(UtTγ)2+t=1nηt(UtTγUtTγ̃)2(UtTγ)2UtTγ̃N1+N2. (83)

By the same argument leading to the quadratic approximation (δ) = L*(δ)+op(1) above, we can show that each element of t=1nηtUtT/(UtTγ)2 has a quadratic approximation of the order Op(1). Thus, N1 = Op(‖γ̃ − γ‖) = op(1). For N2, by Assumption 6(i)–(ii), UtTγUtTγ̃=op(n1/4) and |ηt||VtTδ|/n=O(n1/2), which give N2 = op(1). Thus, we conclude that L(δ) = (δ) + op(1), completing the proof.

Proof of Theorem 5. The asymptotic normally follows from the Bahadur representation in Theorem 4 and the same martingale CLT argument in Theorem 1. The optimal weight follows from the Lagrange multiplier method.

Proof of Theorem 6. Write Kt = K{(Xtx)/h}. For IID data, Kai, Li and Zou (2010) have shown the following asymptotic representation:

Y(τ|x)QY(τ|x)=12m(x)μKh2+σ(x)pX(x)fε(Qε(τ))1nht=1n[τ1εt<Qε(τ)]Kt+op(1nh).

Examining their argument and using properties (P1) and (P2) in Section 2, we see that the asymptotic representation also holds under Assumption 1. Therefore, by (30),

WQAE(x|ω)=m(x)+12m(x)μKh2+σ(x)nhpX(x)t=1ndtKt+op(1nh),

where dt is defined in (73). The desired asymptotic normality then follows from the same martingale CLT argument in Theorem 1.

It is easy to see that, under the symmetric density assumption, the optimal weight ω* in (14) automatically satisfies the symmetric weight constraint in (29).

Proof of Theorem 7. This follows from the same argument of Theorem 3.

Proof of Theorem 8. Recall τj = j/(k + 1). Define k × k matrices Γ and P:

Γ=[min(τj,τj)τjτj]1j,jk,P=diag{fε(Qε(τ1)),,fε(Qε(τk))}. (84)

Here “diag” stands for the diagonal matrix. By direct matrix multiplications, we can verify

Γ1=(k+1)[2100121001210012], (85)

with 2(k + 1) on the diagonal, −(k + 1) on the super-/sub-diagonals, and 0 elsewhere.

  1. Recall g(τ) = fε(Qε(τ)) and Δ = 1/(k + 1). By H = P−1ΓP−1 and (85),
    Ωk=ekTPΓ1Pek=(k+1){g2(τ1)+g2(τk)+j=2k[g(τj)g(τj1)]2}=(k+1)[g2(τ1)+g2(τk)]+Wk+Δ1Δ[g(t)]2dt, (86)
    where
    Wk=(k+1)j=2k[g(τj)g(τj1)]2Δ1Δ[g(t)]2dt.
    We can rewrite Wk as
    Wk=(k+1)j=2k{[τj1τjg(t)dt]2(τjτj1)τj1τj[g(t)]2dt}=k+12j=2kτj1τjτj1τj[g(t)g(s)]2dtds.
    For t, s ∈ [τj−1, τj], we have |g(t)g(s)|=|stg(υ)dυ|τj1τj|g(υ)|dυ, uniformly. Thus, by the Cauchy-Schwarz inequality,
    maxt,s[τj1,τj]|g(t)g(s)|2[τj1τj|g(υ)|dυ]2Δτj1τj[g(υ)]2dυ.
    Applying the above inequality, we can obtain
    |Wk|k+12j=2k(τjτj1)2maxt,s[τj1,τj]|g(t)g(s)|2Δ22Δ1Δ|g(t)|2dt.

    The result then follows from (86) and the identity 01[g(τ)]2dτ=(fε) in (41).

  2. Using H = P−1ΓP−1, we can write Λk = qT H−1q = (Pq)TΓ−1(Pq). Note that Pq = [h1), …, hk)]T with h(t) = Qε(t)fε(Qε(t)). Using (85) and the argument in (i) above, we can easily obtain the desired result.

Proof of Proposition 1. Let u = Qε(τ) so that τ = Fε(u). Since fε has support ℝ, u → −∞ as τ → 0. Recall g(τ) = fε(Qε(τ)). By the chain rule, we can show g′′(τ) = [∂2 log fε(u)/∂u2]/fε(u). Then one can easily show that (46) is equivalent to

limτ0{g2(τ)+g2(1τ)τ+[|g(τ)|+|g(1τ)|]τ3/2}=0. (87)

For example, limτ→0 g2(τ)/τ = 0 if and only if limufε2(u)/Fε(u)=0, and limτ→0 g2(1 − τ)/τ = 0 if and only if limufε2(u)/[1Fε(u)]=0. It remains to show (87) implies (45).

Let ε > 0 be any given number. By (87), there exists 0<τ0*<1/2 such that |g(t)|<εt3/2 and |g(1t)|<εt3/2 for all t(0,τ0*). Fix τ0*. By Assumption 2, there exists c < ∞ such that |g′′(τ)| < c for τ[τ0*,1τ0*]. Let τ*=min{τ0*,ε[c(12τ0*)]1/2}.

Then τ*2c(12τ0*)<ε. For τ < τ*, applying τ1τ=ττ0*+τ0*1τ0*+1τ0*1τ, we have

τ2τ1τ|g(t)|2dtτ2(ττ0*εt3dt+τ0*1τ0*cdt+ττ0*εt3dt)τ2[ε2τ2+c(12τ0*)+ε2τ2]2ε,

completing the proof.

Proof of Theorem 10. (i) For S(ω) in (13), Rk = S((1/k, …, 1/k)T). By the uniqueness of the minimizer ω* of S(ω) [see (14)], RkΩk1 with equality if and only if ω* = (1/k, …, 1/k)T. Let g(τ) = fε(Qε(τ)), and for convenience write g0) = gk+1) = 0. For ω*=(ω1*,,ωk*)T in (14), by H−1 = PΓ−1P and (85), we can show

ωj*=(k+1)[2g(τj)g(τj1)g(τj+1)]g(τj)Ωk, (88)

where Ωk is defined in (15). Note that, for j = ⌊(k + 1)τ⌋ with τ ∈ (0, 1), limk→∞(k + 1)2[2gj) − gj−1) − gj+1)]gj) = −g′′(τ)g(τ). Thus, as k → ∞, ωj*=1/k for all j implies g′′(τ)g(τ) = −c, τ ∈ (0, 1), for some c > 0. Define the transformation u = Qε(τ). By the chain rule, we can show that g′′(τ)g(τ) = −c is equivalent to [fε(u)fε(u)fε(u)2]/fε2(u)=c or {log fε(u)}′′ = −c. Thus, fε(u) must be a normal density.

(ii) See the proof in Kai, Li and Zou (2010).

Proof of Corollary 2. As in the proof of Theorem 1, we use martingale CLT and consider scalar-valued (Xt, β). Similar to dt in (73), with the optimal weight ω*, define

dt*=j=1knωj*fε(Qε(τj))[τj1εt<Qε(τj)].

By Theorem 9, S(ω*) = 1/Ωkn → 1/ℐ(fε). Thus, the convergence of the conditional variances follows from the argument in (74). It remains to verify the Lindeberg condition.

By Assumption 2, g(τ) = fε(Qε(τ)) is bounded on τ ∈ (0, 1). Thus, from (88),

|dt*|kn+1Ωknj=1kn|2g(τj)g(τj1)g(τj+1)|c1kn2,

for some constant c1. For any given c2 > 0,

1nt=1n𝔼[({(Xt,β)𝔼[(X,β)]}dt*)21|{(Xt,β)𝔼[(X,β)]}dt*|c2n]c12kn4𝔼({(Xt,β)𝔼[(X,β)]}21|(X,β)𝔼[(X,β)]|c2n/(c1kn2)). (89)

Note that, for any random variable U ∈l ℒq, q > 2, and constant c > 0,

𝔼(U21|U|c)𝔼(|U|qcq21|U|c)𝔼(|U|q)cq2. (90)

Applying (90) to (89) and using kn = O[(log n)ε], we have the Lindeberg condition.

Proof of Theorem 11. (i) It follows from (86) and the proof of Theorem 8. (ii) From limτ→0[g2(τ) + g2(1 − τ)]/τ = ∞ and (86), 1/S(ω*) ≥ (k + 1)[g21) + g2k)] → ∞.

Footnotes

*

We thank Roger Koenker, Steve Portnoy, Victor Chernozhukov, two referees, and seminar attendants at University of Illinois for their very helpful comments. Zhao’s research was partially supported by a NIDA grant P50-DA10075-15. Xiao thanks Boston College for research support. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIDA or the NIH.

1

Note that different forms of the location-scale model can be studied similarly to our analysis in this section, and optimally weighted quantile averaging estimators can be constructed (but the construction of optimal weights will be different, depending on the specific form of the regression model).

Contributor Information

Zhibiao Zhao, Email: zuz13@stat.psu.edu, Department of Statistics, Penn State University, University Park, PA 16802.

Zhijie Xiao, Email: zhijie.xiao@bc.edu, Department of Economics, Boston College, Chestnut Hill, MA 02467.

References

  1. Akahira M, Takeuchi K. Non-Regular Statistical Estimation. New York: Springer; 1995. [Google Scholar]
  2. Beran R. Asymptotically efficient adaptive rank estimates in location models. Annals of Statistics. 1974;2:248–266. [Google Scholar]
  3. Bickel P. On adaptive estimation. Annals of Statistics. 1982;10:647–671. [Google Scholar]
  4. Black F. capital market equilibrium with restricted borrowing. Journal of Business. 1972;45:444–454. [Google Scholar]
  5. Bradic J, Fan J, Wang W. Penalized composite quasi-likelihood for ultrahigh dimensional variable selection. Journal of the Royal Statistical Society, Series B. 2011;73:325–349. doi: 10.1111/j.1467-9868.2010.00764.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Chen X, Linton O, Jacho-Chavez D. Averaging of moment condition estimators. Working paper. 2011 [Google Scholar]
  7. Chamberlain G. Quantile regression, censoring and the structure of wages. Advances in Econometrics, sixth World Congress. 1994:171–210. [Google Scholar]
  8. Engle RF. Autoregressive conditional heteroscedasticity with estimates of the variance of U.K. inflation. Econometrica. 1982;50:987–1007. [Google Scholar]
  9. Fan J, Farmen M, Gijbels I. Local maximum likelihood estimation and inference. Journal of the Royal Statistical Society, Series B. 1998;60:591–608. [Google Scholar]
  10. Fan J, Gijbels I. Local Polynomial Modelling and Its Applications. London: Chapman and Hall; 1996. [Google Scholar]
  11. He X, Shao QM. A general Bahadur representation of M-estimators and its application to linear regression with nonstochastic designs. Annals of Statistics. 1996;24:2608–2630. [Google Scholar]
  12. Jurečková J, Procházka B. Regression quantiles and trimmed least squares estimator in nonlinear regression model. Journal of Nonparametric Statistics. 1994;3:201–222. [Google Scholar]
  13. Kai B, Li R, Zou H. Local composite quantile regression smoothing: an efficient and safe alternative to local polynomial regression. Journal of the Royal Statistical Society, Series B. 2010;72:49–69. doi: 10.1111/j.1467-9868.2009.00725.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Koenker R. A note on L-estimates for linear models. Statistics & Probability Letters. 1984;2:323–325. [Google Scholar]
  15. Koenker R. Quantile Regression. New York: Cambridge University Press; 2005. [Google Scholar]
  16. Koenker R, Bassett G. Regression quantiles. Econometrica. 1978;46:33–49. [Google Scholar]
  17. Koenker R, Zhao Q. L-estimation for linear heteroscedastic models. Journal of Nonparametric Statistics. 1994;3:223–235. [Google Scholar]
  18. Koenker R, Zhao Q. Conditional quantile estimation and inference for ARCH models. Econometric Theory. 1996;12:793–813. [Google Scholar]
  19. Pollard D. Asymptotics for least absolute deviation regression estimators. Econometric Theory. 1991;7:186–199. [Google Scholar]
  20. Portnoy S, Koenker R. Adaptive L-estimation of Linear Models. Annals of Statistics. 1989;17:362–381. [Google Scholar]
  21. Ruppert D, Sheather SJ, Wand MP. An effective bandwidth selector for local least squares regression. Journal of the American Statistical Association. 1995;90:1257–1270. [Google Scholar]
  22. Sharpe WF. Capital asset prices: a theory of market equilibrium under conditions of risk. Journal of Finance. 1964;19:425–442. [Google Scholar]
  23. Silverman BW. Density Estimation. London: Chapman and Hall; 1986. [Google Scholar]
  24. Stone C. Adaptive maximum likelihood estimation of a location parameter. Annals of Statistics. 1975;3:267–284. [Google Scholar]
  25. Stout WF. Almost Sure Convergence. New York: Academic Press; 1974. [Google Scholar]
  26. Wang H, Zhu Z, Zhou J. Quantile regression in partially linear varying coefficient models. Annals of Statistics. 2009;37:3841–3866. doi: 10.1214/07-AOS561. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Xiao Z, Koenker R. Conditional quantile estimation and inference for GARCH models. Journal of the American Statistical Association. 2009;104:1696–1712. [Google Scholar]
  28. Yu K, Jones MC. Local linear quantile regression. Journal of the American Statistical Association. 1998;93:228–237. [Google Scholar]
  29. Zhao Q. Asymptotically efficient median regression in the presence of heteroskedasticity of unknown form. Econometric Theory. 2001;17:765–784. [Google Scholar]
  30. Zou H, Yuan M. Composite quantile regression and the oracle model selection theory. Annals of Statistics. 2008;36:1108–1126. [Google Scholar]

RESOURCES