Abstract
We develop a generally applicable framework for constructing efficient estimators of regression models via quantile regressions. The proposed method is based on optimally combining information over multiple quantiles and can be applied to a broad range of parametric and nonparametric settings. When combining information over a fixed number of quantiles, we derive an upper bound on the distance between the efficiency of the proposed estimator and the Fisher information. As the number of quantiles increases, this upper bound decreases and the asymptotic variance of the proposed estimator approaches the Cramér-Rao lower bound under appropriate conditions. In the case of non-regular statistical estimation, the proposed estimator leads to super-efficient estimation. We illustrate the proposed method for several widely used regression models. Both asymptotic theory and Monte Carlo experiments show the superior performance over existing methods.
Keywords: Asymptotic normality, Bahadur representation, Efficiency, Fisher information, Quantile regression, Super-efficiency
1 Introduction
For regression estimations, the most widely used approach is the least squares (LS) method (for finite-dimensional models) or local LS method (in infinite-dimensional settings). If the data is normally distributed, the LS estimator has the likelihood interpretation and is the most efficient estimator. In the absence of Gaussianity, the LS estimation is usually less efficient than methods that exploit the distributional information, although may still be consistent under appropriate regularity conditions. Without these regularity conditions, the LS estimator may not even be consistent, for example, when the data have a heavy-tailed distribution such as the Cauchy distribution. Monte Carlo evidence indicates that the LS estimator can be quite sensitive to certain outliers. In empirical analysis, many applications (such as finance and economics) involve data with heavy-tailed or skewed distributions, and the LS estimators may have poor performance in these cases. It is therefore important to develop robust and efficient estimation procedures for general innovation distributions.
If the underlying distribution were known, the Maximum Likelihood Estimator (MLE) could be constructed. Under regularity conditions, the MLE is asymptotic normal and asymptotically efficient in the sense that the limiting covariance matrix attains the Cramér-Rao lower bound. In practice, the true density function is generally unknown and so the MLE is not feasible. Nevertheless, the MLE and the Craḿer-Rao bound serve as a standard against which we should measure our estimator.
For this reason, statisticians have devoted a great deal of research effort to the construction of estimation procedures that can extract distributional information from the data, and thus deliver more efficient estimators than the conventional LS method. For the location model Y = α + ε, where ε has a symmetric density, the adaptive likelihood or score function based estimators of α were constructed in Beran (1974) and Stone (1975). Bickel (1982) further extended the idea to slope estimation of classical linear models. For nonlinear models, adaptive likelihood based estimations are usually technically challenging.
We believe that the quantile regression technique [Koenker and Bassett (1978); Koenker (2005)] can provide a useful method in efficient statistical estimation. Intuitively, an estimation method that exploits the distributional information can potentially provide more efficient estimators. Since quantile regression provides a way of estimating the whole conditional distribution, appropriately using quantile regressions may improve estimation efficiency. Under regularity assumptions, the least-absolute-deviation (LAD) regression (i.e. quantile regression at median) can provide better estimators than the LS regression in the presence of heavy-tailed distributions. In addition, for certain distributions, a quantile regression at a non-median quantile may deliver a more efficient estimator than the LAD method. More importantly, additional efficiency gain can be achieved by combining information over multiple quantiles.
Although combining quantile regression over multiple quantiles can potentially improve estimation efficiency, this is often much easier to say than it is to do in a satisfactory way. To combine information from quantile regression, one may consider combining information over different quantiles via the criterion or loss function. For example, Zou and Yuan (2008) and Bradic, Fan and Wang (2011) proposed the composite quantile regression (CQR) for parameter estimation and variable selection in the classical linear regression models. For nonparametric regression models, Kai, Li and Zou (2010) proposed a local CQR estimation procedure, which is asymptotically equivalent to the local LS estimator as the number of quantiles increases. Alternatively, one may combine information based on estimators at different quantiles. Along this direction, Portnoy and Koenker (1989) studied asymptotically efficient estimation for the simple linear regression model. Although the proposed estimator is efficient asymptotically, it is not the best estimator with fixed quantiles. Also see Chamberlain (1994), Xiao and Koenker (2009), and Chen, Linton and Jacho-Chavez (2011) for related work on combination of estimators.
In this paper we consider regression estimation by combining information across k quantiles τj = j/(k + 1), j = 1, …, k. We show that for a wide range of parametric and nonparametric regression models, more efficient estimators can be constructed via optimally combining quantile regressions. We argue that it is essential to combine quantile information appropriately to achieve efficiency gain. In particular, simple averaging multiple quantile regression estimators is asymptotically equivalent to the LS method. We show that, by optimally combining information across quantiles τ1, …, τk, the efficiency of the proposed optimal weighted quantile average estimator is at most Φk away from the Fisher information, where Φk is defined as (43). As the number of quantiles k → ∞, under appropriate regularity conditions, we have Φk → 0 and the estimator is asymptotically efficient. Interestingly, in the case of non-regular statistical estimation when these regularity conditions do not hold, the proposed estimators may lead to super-efficient estimation.
The proposed methodology provides a generally applicable framework for constructing more efficient estimators under a broad variety of settings. For finite-dimensional parametric estimations, the method can be applied to construct efficient estimators for parameters in both linear and nonlinear regression models with homoscedastic errors, and parameters in location-scale models with conditional heteroscedasticity. We show that, in the presence of conditional heteroscedasticity, some appropriate preliminary quantile regression is needed to improve the efficiency and to facilitate the quantile combination. Different restrictions (and thus optimal weights) are needed for estimation of the location parameters and scalar parameters. For nonparametric function estimations, the asymptotic bias of the proposed estimator is the same as that of the conventional nonparametric estimators (such as the local LS and the local LAD estimators) and meanwhile the inverse of the asymptotic variance is at most Φk away from the optimal Fisher information. Our extensive simulation studies show that the proposed method significantly outperforms the widely used LS, LAD, and the CQR method [Zou and Yuan (2008); Kai, Li and Zou (2010)] for both parametric and nonparametric models.
The rest of this paper is organized as follows: we provide a general discussion on the framework and assumptions for constructing efficient estimators based on quantile regressions in Section 2. Three leading cases of regressions are then investigated in Sections 3–5. In particular, we study the parametric regression models with homoscedasticity in Section 3. In Section 4, we study the parametric models with heteroscedasticity. Nonparametric models are investigated in Section 5. We focus on methodology and our discussions in Sections 3–5 consider the case with finite k. The case with k increasing to infinity is discussed in Section 6. Simulation studies are contained in Section 8, and an application to financial data is given in Section 9 to highlight the proposed method. Proofs are given in the Appendix.
2 Model Setup, Assumptions, and TheWeighted Quantile Average Estimator
We consider regression models of the following form:
(1) |
where (X, Y, ε) is the triplet of covariate, response, and noise, with ε independent of X. Here m(·) and σ(·) are two functions that depend on unknown parameters θ, where θ may be of finite dimension (parametric case) or infinite dimension (nonparametric case).
Denote by Qε(τ) the τ-th quantile of ε for τ ∈ (0, 1). Then the τ-th conditional quantile of Y given X, denoted by QY (τ|X), is given by
(2) |
As the inverse of the conditional distribution function, QY (τ|X) fully captures the distributional relationship between Y and X. Intuitively, different distributional information may be obtained from different quantiles, and an appropriate combination of multiple quantiles may be more informative about the distribution than the conditional mean in the LS methods.
Throughout this paper we consider combining information over k equally spaced quantiles τj = j/(k + 1), j = 1, …, k. The discussion of this paper focuses on the case where k is assumed to be a given finite number. We consider the case that k → ∞ increases with n in Section 6.3.
We briefly introduce the idea of our proposed estimator. Let θ be a parameter of interest. From the conditional quantile in (2), we can usually identify some perturbed version of θ, denoted by θ(τ). Suppose there exists a class 𝒲 of weights such that
(3) |
Given data on (X, Y), we can use a quantile regression based on (2) to obtain some consistent estimate, denoted by θ̂(τ), of θ(τ). In light of (3), we propose the following estimate of θ:
(4) |
We term θ̂WQAE(ω) by the weighted quantile average estimator (WQAE) of θ. Since θ̂WQAE(ω) is a consistent estimate of θ for each ω ∈ 𝒲, we propose in this paper estimation of θ based on ω ∈ 𝒲 that minimizes the limiting variance of θ̂WQAE(ω) in parametric settings, or the asymptotic mean squared error in nonparametric settings. If ω is chosen in such a way, we call the corresponding estimator θ̂WQAE(ω*) the optimal WQAE (OWQAE).
The proposed estimation can be applied to a wide range of regression models. In this paper, we focus on the following three leading cases of regression (1):
Case 1: Parametric regression models with homoscedastic errors.
Case 2: Location-scale models with conditional heteroscedasticity in σ(·).
Case 3: Nonparametric regressions.
We study each of these cases in the following three sections 3–5.
Suppose we have samples from (1) with corresponding noises . To facilitate the study of asymptotic theory, we impose the following assumptions.
Assumption 1 (i) {(Xt, εt)}t∈ℤ is strictly stationary; for each t, εt is independent of {Xt, Xt−1, …; εt−1, εt−2, …}. (ii) {Xt}t∈ℤ is an ergodic process.
Assumption 2 Denote by fε(·) and Fε(·) the density and distribution functions of ε. fε is positive, twice differentiable, and bounded on {u : 0 < Fε(u) < 1}.
Assumption 1 provides a convenient framework for studying asymptotic theory. First, strong mixing condition implies the ergodicity, and thus Assumption 1(ii) is weaker than the widely used strong mixing conditions in time series analysis. Next, we illustrate two useful properties below.
-
(P1)
Martingale structure. Let ℱt be the σ-algebra generated by {Xt+1, Xt, …; εt, εt−1, …}. By Assumption 1(i), εt is independent of ℱt−1. For any functions h1(·) and h2(·) such that 𝔼 [|h1(Xt)h2(εt)|] < ∞ and 𝔼[h2(εt)] = 0, we have 𝔼[h1(Xt)h2(εt)|ℱt−1] = h1(Xt)𝔼[h2(εt)] = 0. Therefore, {h1(Xt)h2(εt)}t∈ℤ are martingale differences with respect to the filtration {ℱt}t∈ℤ, and consequently is a martingale.
-
(P2)Law of large numbers. By ergodic theorem [Theorem 3.5.7, Stout (1974)], for any function ℓ(·) such that 𝔼[|ℓ(Xt)|] < ∞, the ergodicity in Assumption 1(ii) implies
(5)
In subsequent sections we adopt the following notation. For a random vector Z, we use ‖Z‖ to signify the Euclidean norm of Z, and write Z ∈ ℒq, q > 0, if 𝔼(‖Z‖q) < ∞.
3 Homoscadastic Parametric Regression Models
In this section we study the parametric regression model (case 1 in Section 2). Corresponding to the general representation (1), let σ(X) ≡ 1 and m(X) = α + m(X, β), where β ∈ ℝp is the vector of unknown parameters, and the intercept α is added to ensure the identifiability of β, we have
(6) |
We are interested in the estimation of β.
By (2), we have the conditional quantile representation
(7) |
Given samples from (6), let (α̂(τ), β̂(τ)) be an estimator of (α(τ), β) from a quantile regression based on (7):
(8) |
where ρτ(z) = z(τ − 1z≤0) and 1 is the indicator function. Denote by ṁ(x, β) is the partial derivative vector of m(x, β) with respect to β. Define
Similar to the linear regression, Dn serves as the design matrix and Zt is the equivalent covariate corresponding to observation t for the quantile regression (8). A leading example is the classical linear regression model [see, e.g., Koenker (1984)], corresponding to m(X, β) = XT β. In this case, QY(τ|X) = [α + Qε(τ)] + XT β and ṁ(X, β) = X.
Assumption 3 The quantile regression estimator has the Bahadur representation
(9) |
uniformly in τ ∈ 𝒯 ≔ [δ, 1 − δ] with some small constant δ > 0.
Assumption 3 is an asymptotic representation of the quantile regression estimator. Under regularity conditions on the regression function m(·), error density, and the parameter space, a Bahadur representation can be obtained over τ on a subset of [0,1]. See, e.g., Portnoy and Koenker (1989), Jurečková and Procházka (1994), and He and Shao (1996) for related study. Also see Section 4 for discussions on the conditional heteroscedastic parametric models.
Since β does not depend on τ, we can use β̂(τ) to estimate β with any choice of τ. By Theorem 1 below (also, see the definition of Σβ there),
(10) |
For example, the case with τ = 0.5 corresponds to the median quantile regression or LAD estimation of β.
As discussed in Section 2, we want to combine information over the k quantiles τj = j/(k + 1), j = 1, …, k, where k is assumed to be a given finite number such that τj ∈ 𝒯. Since β̂(τ) is a consistent estimate of β, from (3)–(4), we consider the WQAE of β:
(11) |
Theorem 1 Suppose Assumptions 1–3 hold and ṁ (X, β) ∈ ℒ2 with . Then
(12) |
with Σβ = 𝔼[ṁ(X, β)ṁ(X, β)T] − 𝔼[ṁ(X, β)]𝔼 [ṁ(X, β)T] assumed to be non-singular, and
(13) |
The proposed estimator, the OWQAE, of β is obtained by choosing ω to minimize the asymptotic variance of β̂WQAE(ω).
Theorem 2 Under the assumptions of Theorem 1, the optimal weight is
(14) |
With ω* in (14), the OWQAE of β has the following limiting distribution:
(15) |
Remark 1: A quick way of combining quantile information is to take a simple average of the quantile regression estimators. This is easy to implement and has been used in the literature [see, e.g., Kai, Li and Zou (2010) for nonparametric estimation] as a method of combining quantile information. If we use ω = [1/k, …, 1/k]T in (11), the resulting unweighted estimator has the asymptotic normality in Theorem 1 with S(ω) replaced by
(16) |
Clearly, Rk ≥ S(ω*). See Section 6.2 for more discussions on the property of Rk.
We compute ω* for some examples below using k = 9 quantiles 0.1, 0.2, …, 0.9.
Example 1. Let ε be Student-t distributed. For t1 (Cauchy distribution), the optimal weight ω* = {−0.03,−0.04,0.08,0.29,0.40,0.29,0.08,−0.04,−0.03}, quantiles τ = 0.4, 0.5, 0.6 contribute almost all information whereas quantiles τ = 0.1, 0.2, 0.8, 0.9 have negative weights, so the unweighted quantile average estimator would not perform well. However, for N(0,1), ω* = {0.13,0.11,0.11,0.10,0.10,0.10,0.11,0.11,0.13} are close to the uniform weights, and thus the OWQAE, the unweighted quantile average estimator, and the LS estimator have comparable performance.
Example 2. Let ε have normal mixture distributions. For Mixture 1: 0.5N(0,1)+0.5N(0,0.56) (different variances), ω* = {−0.002,−0.102,0.183,0.277,0.287,0.277,0.183,−0.102,−0.002}, quantiles 0.3, …, 0.7 contain substantial information whereas quantiles 0.2 and 0.8 have negative weight. For Mixture 2: 0.5N(−2,1)+0.5N(2,1) (different means), ω* = {0.185,0.156,0.153,0.078,−0.144,0.078,0.153,0.156,0.185}, quantiles τ = 0.1, 0.2, 0.3, 0.7, 0.8, 0.9 are comparable while the median performs the worst.
Example 3. Let ε be Gamma random variable with parameter d > 0. For d = 1 (exponential distribution), , and for i = 3, …, 9. Quantiles 0.1 and 0.2 contain almost all information.
As shown in Examples 1–3 above, different quantiles may carry substantially different amount of information, and inappropriately utilizing such information may result in a significant loss of efficiency. The latter phenomenon provides strong evidence in favor of our proposed optimally weighted quantile based estimators.
In practice, the optimal weight ω* in (14), which depends on the sparsity or quantile-density function fε(Qε(τ)), needs to be estimated. We make the following assumption on the estimate, denoted by b f̂ε(Q̂ε(τ)), of fε(Qε(τ)).
Assumption 4 supτ∈𝒯 |f̂ε(Q̂ε(τ)) − fε(Qε(τ))| = op(1) for 𝒯 in Assumption 3.
Plugging the consistent estimate f̂ε(Q̂ε(τ)) of fε(Qε(τ)) into the matrix H in (14), we can obtain the following consistent estimate of the optimal weight ω*:
(17) |
Theorem 3 below asserts that, β̂WQAE(ω̂*) with the estimated weight ω̂* achieves the same efficiency of β̂WQAE(ω*).
Theorem 3 Under the assumptions of Theorem 1 and Assumption 4, we have
(18) |
4 The Location-Scale Models
Another class of widely used regression models is the location-scale models (case 2 in Section 2) that allows for conditional heteroscedasticity. There is a large literature in econometrics and statistics studying the location-scale models. Koenker and Zhao (1994) studied L estimation of a location-scale model in the following form:
(19) |
under the condition . Zhao (2001) studied asymptotically efficient median regression using the k-nearest neighbors method. In this section we study the location-scale models via optimal quantile combination.
In the model (19), the positive constraint is somewhat restrictive to allow for flexible applications. For example, it is violated for normally distributed covariates X. For this reason, many researchers consider an alternative form of σt which can be expressed as a linear function of absolute values of the regressors and other variables:
(20) |
where Ut is a vector of absolute values of the regressors and other covariates [see, e.g., Koenker and Zhao (1996) for studies on related models]. For example, let , one may consider a location-scale model with , where , γ = (γ0, γ1, ⋯, γp), γ0 > 0, γ1 ≥ 0, ⋯, γp ≥ 0.
In this section we consider the location-scale regression model (20).1 We are interested in the estimation of β and γ. By (2), we have the conditional quantile representation
(21) |
Given a sample of size n, we may estimate (β, γ(τ)) using a quantile regression similar to (8). However, in the presence of conditional heteroscedasticity, it is more efficient to use a weighted quantile regression with the weights reflecting the conditional heteroscedasticity. In addition, the weighted quantile regression estimates have nice properties that helps combining quantile information.
Thus, following the idea of Koenker and Zhao (1994), we consider the weighted quantile regression:
(22) |
where and γ̃ is a consistent estimate of γ.
Assumption 5 (i) {(Xt, Ut, εt)}t∈ℤ is strictly stationary; for each t, εt is independent of {(Xt, Ut), (Xt−1,Ut−1), …; εt−1, εt−2, …}. (ii) {(Xt, Ut)}t∈ℤ is an ergodic process.
Assumption 6 (i) ‖Xt‖ + ‖Ut‖ ≤ c1 and for some constants c1, c2 > 0. (ii) γ̃ − γ = op(n−1/4). (iii) Let (X, U) be distributed as (Xt, Ut). Define the matrices
and
The matrices M1, M3, Mβ and Mγ are non-singular.
Assumption 5 is a modification of Assumption 1 by allowing for more covariates (Xt, Ut). In Assumption 6, (i) is imposed simply for technical convenience and can be replaced by some finite moment conditions, (ii) requires that γ̃ must be reasonably close to γ, and (iii) is used to avoid some singular design matrix.
Theorem 4 Suppose Assumptions 2, 5, and 6 hold. Then we have
(23) |
with
We now construct estimators of β and γ by optimally combining information over quantiles τ1, …, τk.
First, we consider estimation of β. As in Section 3, we consider the WQAE β̂WQAE(ω) given by (11). Using the Bahadur representation (23) from Theorem 4, the same argument in Theorem 1 yields
with S(ω) given in Theorem 1. Therefore, the optimal weight can be constructed in a similar way as described by Theorem 2, and ω* is given by (14). The OWQAE has the following limiting distribution
with Ωk given in (15). If we use the estimated optimal weight ω̂* in (17), under the additional Assumption 4, the conclusion in Theorem 3 also holds here.
Next, we consider estimation of the scale parameter γ via quantile combination. As will be clear from later analysis, the construction of WQAE and choice of optimal weights related to the scale parameter will be different from those of β. For this reason, we denote the weights used in γ estimation by π = [π1, …, πk]T. From (21)–(22), γ̂(τ) is an estimation of γ(τ) = γQε(τ). Then, for any π satisfying . Therefore, we propose the following WQAE of γ:
(24) |
Theorem 5 Under the assumptions in Theorem 4, we have the asymptotic normality:
where S(π) = πTHπ with H defined in (13). Furthermore, the optimal weight is
(25) |
With π* in (25), the OWQAE of γ has the following limiting distribution:
(26) |
Therefore, the optimal weights for the OWQAE of β and γ are different, and their corresponding OWQAEs have different efficiency. This is due to the structure of the conditional quantile representation (21): β does not depend on the quantile τ whereas γ relies on τ through the coefficient Qε(τ).
Similar to the case of β, the conclusion in Theorem 3 also holds for γ̂WQAE(π̂*) when we use estimated optimal weight π̂* by plugging in consistent estimates of (q, H) into (25).
To implement the weighted quantile regression (22), we need to find a consistent estimate γ̃ of γ. We propose the following procedure:
- For each quantile τ = τ1, …, τk, fit the unweighted quantile regression
By the same argument in Theorem 4, (β̃(τ), γ ˜(τ)) = (β, γ(τ)) + Op(n−1/2).(27)
Finally, we point out an identifiability issue of the optimal weight π* in (25). Since Qε(τ) is identifiable up to a scale factor, if we multiply Qε(τ) by a constant c, π* and hence γ̂WQAE(π*) in (24) will be multiplied by a factor 1/c. This is due to the non-identifiability of the parameter γ in (20). To ensure identifiability, we may impose some constraint on ε; see Section 7 for more discussions on estimating π*.
5 Nonparametric Regressions
In this section we study the nonparametric regression (case 3 in Section 2). We assume that both m(·) and σ(·) in (1) are nonparametric functions, and we are interested in the estimation of m(·). Although our theory is also applicable for multivariate case, to avoid the issue of “curse of dimensionality”, we consider the univariate case X ∈ ℝ.
Recall the conditional quantile QY (τ|X) in (2). Without further assumptions, we cannot identify m(X) from QY(τ|X) at a single quantile. To ensure identifiability, we assume that ε has a symmetric density, which is satisfied for many commonly used distributions, such as normal distribution, Student-t distribution, Cauchy distribution, uniform distribution on a symmetric interval, Laplace distribution, symmetric stable distribution, many normal mixture distributions, and their truncated versions on symmetric intervals.
Consider weights ω1, …, ωk satisfying the constraints
(29) |
Under the symmetric density assumption above, Qε(τ) + Qε(1 − τ) = 0. Therefore, with quantiles τj = j/(k + 1) and using (2) and (29), we have
(30) |
This identity suggests estimation of m(·) by plugging in consistent estimation of QY (τj|X).
Given samples , we can estimate QY(τ|x) by the local linear quantile regression [Yu and Jones (1998)]:
(31) |
for a kernel function K(·) and bandwidth h. From (30), we propose the WQAE of m(x):
(32) |
Assumption 7 (i) fε is symmetric, positive, and twice continuously differentiable on its support; the density function pX(·) > 0 of X is differentiable, m(·) is three times differentiable, and σ(·) > 0 is differentiable, in the neighborhood of x. (ii) nh → ∞ and nh9 → 0. (iii) K(·) integrates to one, is symmetric, and has bounded support. Write
Theorem 6 Suppose Assumptions 1 and 7 hold. Let S(ω) be defined in (13). Then
(33) |
Furthermore, ω* in (14) minimizes S(ω) subject to the constraints (29), and
(34) |
where Ωk is defined in (15).
For comparison, we briefly review some alternative nonparametric estimation methods. The widely used local linear LS regression estimator, denoted by m̂LS(x), is obtained by replacing the quantile loss ρτ(·) in (31) with the square loss. If 𝔼(εt) = 0 and var(εt) < ∞, under some regularity conditions [Fan and Gijbels (1996)],
(35) |
When εt’s are Gaussian, the local LS estimation corresponds to the local likelihood criterion. In the absence of Gaussianity, asymptotic results of m̂LS(x) generally still hold but this estimator is less efficient in terms of mean-squared error than estimators that exploit the distributional information. For heavy-tailed data, local quantile regression is a robust estimation method; see, e.g., Yu and Jones (1998). The local median regression estimator, denoted by m̂LAD(x), corresponds to τ = 0.5 in (31). By Theorem 6,
(36) |
Recently, Kai, Li and Zou (2010) proposed a local composite quantile regression (CQR) estimator which takes a simple average of multiple quantile estimations. The local CQR estimator, denoted by m̂CQR(x), has the asymptotic normality
(37) |
where Rk is defined in (16). Intuitively, m̂LS(x) uses information from the local sample average, m̂LAD(x) uses information from the local sample median, m̂CQR(x) uses information from multiple quantiles with uniform weight, and the proposed OWQAE m̂WQAE(x|ω*) combines information from multiple quantiles optimally.
If the error density fε were known, we could replace the quantile loss ρτ(·) in (31) by the log likelihood log fε(Yt−a−b(Xt−x)) and obtain a likelihood-based estimator, denoted by m̂MLE(x), see, e.g., Fan, Farman and Gijbels (1998). Under appropriate conditions,
(38) |
where ℐ(fε) is the Fisher information of fε. Under some regularity conditions, the local likelihood estimator is the most efficient estimator. In practice, fε is unknown and m̂MLE(x) is infeasible. In Section 6.2, it is shown that Ωk → ℐ(fε), and therefore the optimal WQAE m̂WQAE(x|ω*) achieves the same asymptotic efficiency of the infeasible estimator m̂MLE(x).
We now compare the efficiency of m̂WQAE(x|ω*) to m̂LS(x), m̂LAD(x), and m̂CQR(x). From (34)–(37), all four estimators have the asymptotic normality with different s2:
Define the asymptotic mean-squared error (AMSE) as AMSE{m̂(x)|h} = [m″(x)μKh2/2]2 + φKσ2(x)s2/[nhpX(x)]. Minimizing the AMSE, we obtain the optimal bandwidth:
(39) |
and the associated optimal AMSE evaluated at the optimal bandwidth h*
(40) |
In Section 6.4, we tabulate s2 for different distributions.
Theorem 7 studies m̂WQAE(x|ω*) when we use the estimated optimal weight ω̂* in (17).
Theorem 7 Under the assumptions of Theorem 1 and Assumption 4, when we use the estimated weight ω̂* in (17), m̂WQAE(x|ω̂*) has the same asymptotic normality as m̂WQAE(x|ω*).
The discussion of the selection of the bandwidth h is deferred to Section 8.4.
6 Efficiency Comparison
6.1 The k-quantile optimal efficiency Ωk and Λk
The parameters in Sections 3–5 can be classified into two types: location-type and scale-type parameters. For β in (6) and (20) and the nonparametric function m(·) in Section 5, these parameters do not directly interact with the error ε, and we call them location-type parameters. For γ in (20), it is directly related to ε, and we call it a scale-type parameter.
Our discussion in the previous sections considers combination of information over a fixed number of quantiles. From the results in Sections 3–5, for the location-type parameters mentioned above, their OWQAE has the asymptotic variance proportional to with Ωk defined in (15); for the scale-type parameter γ in (20), the OWQAE has the asymptotic variance proportional to with Λk defined in (26). Since the efficiency of an estimator is inversely proportional to its variance, we call Ωk and Λk the k-quantile optimal efficiency of the location-type and the scale-type parameters, respectively. The larger Ωk and Λk, the better performance of the corresponding estimators.
It is well-known that, under appropriate conditions, the variance of any unbiased parameter estimator has the Cramér-Rao lower bound: the inverse of the Fisher information of the underlying distribution. To illustrate the Fisher information for the location-type and scale-type parameters, consider the simple location-scale model Y = β + γε with location parameter β and scale parameter γ. Note that Y has the density fY(y; β, γ) = fε((y − β)/γ)/γ. Under the specification (β, γ) = (0, 1), we can show that, the Fisher information for the location parameter β is
(41) |
and the Fisher information for the scale parameter γ is
(42) |
We assume ℐ(fε) < ∞ and 𝒥(fε) < ∞. The Fisher information ℐ(fε) and 𝒥(fε) serve as a natural standard when we measure the efficiency of our estimators in the case of regular estimation.
Theorem 8 Suppose Assumption 2 holds. Let Δ = 1/(k + 1).
- For Ωk in (15), we have |Ωk − ℐ(fε)| ≤ Φk, where g(t) = fε(Qε(t)), and
(43) - For Λk in (26), we have |Λk − 𝒥(fε)| ≤ Ψk, where h(t) = Qε(t)fε(Qε(t)), and
(44)
Theorem 8 indicates that, by optimally combining k quantiles τ1, …, τk, the k-quantile optimal efficiency Ωk (resp. Λk) for the OWQAE of the location-type (resp. scale-type) parameters is at most Φk (resp. Ψk) away from the corresponding Fisher information ℐ(fε) (resp. 𝒥(fε)). This result holds for any fixed k.
6.2 Asymptotic behavior of Ωk and Λk
In all previous sections, k is assumed to be a given finite number. In the following few sections, we discuss the behavior of the proposed estimators as k increases with n. In this section, we consider the asymptotic behavior of Ωk and Λk as k → ∞. For regular estimation, it is shown that Ωk and Λk approach the corresponding Cramér-Rao efficiency bound. In Section 6.3, we discuss the OWQAE as k → ∞ and when we use the true optimal weight or the estimated optimal weight. In Section 6.4, we discuss the asymptotic relative efficiency of OWQAE compared to some existing methods. Finally, Section 6.5 briefly considers some non-regular estimation.
The Cramér-Rao efficiency analysis is based on the basic assumption of finite Fisher information ℐ(fε) < ∞ and 𝒥(fε) < ∞. From (41) and (42), this implies that and as τ → 0, where g(τ) and h(·) are defined in Theorem 8. Thus, by Theorem 8, we have the following result.
Theorem 9 Suppose Assumption 2 holds. Let g(·) and h(·) be defined in Theorem 8.
The condition (45) is conventionally imposed in the study of efficient estimations. Basically, it requires that the error density decay sufficiently fast at the boundary (the corresponding estimation is sometimes called as regular estimation), otherwise one may estimate the parameters at a faster rate; see, e.g., Akahira and Takeuchi (1995) for a discussion of this issue, also see Section 6.5 below for discussions on related issues. By Theorem 9, as k → ∞, the k-quantile optimal efficiency Ωk and Λk attain the corresponding Fisher information.
From Theorem 9, Ωk and Λk have different limit as k → ∞. As discussed in Section 4, this is due to the extra dependence of γQε(τ) on Qε(τ) in the scale-type parameter.
Proposition 1 below presents an alternative sufficient condition for (45).
Proposition 1 Suppose fε has support on ℝ and Assumption 2 holds. Then (45) holds if
(46) |
Write x ∝ y if x/y is bounded away from 0 and ∞. If fε(u) ∝ |u|−a as |u| → ∞ for some a > 1, then 1 − Fε(u) ∝ |u|1−a as u → ∞ and Fε(u) ∝ |u|1−a as u → −∞. Thus, by Proposition 1, we have the following result.
Corollary 1 Suppose that there exist a > 1 and b > 0 such that fε(u) ∝ |u|−a and ∂2[log fε(u)]/∂u2 ∝ |u|−b as |u| → ∞. Then (45) is satisfied if b + 3(a − 1)/2 > 1.
Many commonly used distributions with support on ℝ satisfy (45). (i) For standard normal density fε, ∂2logfε(u)/∂u2 = −1 and 1 − Fε(u) = [1 + o(1)]fε(u)/u as u → ∞, we can verify (46). (ii) For Laplace distribution with density fε(u) = 0.5 exp(−|u|), u ∈ ℝ, ∂2[log fε(u)]/∂u2 = 0 for u ≠ 0, so (46) can be easily verified. (iii) For logistic distribution, g(τ) = c(τ − τ2) for some constant c > 0, so (45) holds. (iv) For Student-t distribution with d > 0 degrees of freedom, Corollary 1 holds with a = d + 1 and b = 2. (v) For normal mixture , we can verify (46).
Recall that the unweighted quantile average estimators have asymptotic variance proportional to Rk defined in (16). The following results show that such a simple averaging estimator is asymptotically equivalent to the LS estimator as k → ∞. This result indicates that if we use a simple average over quantiles, even as we use more and more quantiles, there is no efficiency gain of combining quantile information. Thus, proper weighting over different quantiles is crucial.
Theorem 10 (i) ; as k → ∞, the equality holds if and only if ε is normally distributed. (ii) If var(ε) < ∞, then limk→∞ Rk = var(ε).
6.3 Behavior of the OWQAE as k → ∞
In Sections 3–5, the asymptotic normalities of the OWQAE are established for k quantiles with a fixed k. In this section we consider the case that k increases with n. To keep the length, we consider only β̂WQAE(ω*) for the parametric regression case in Section 3.
Since the uniform Bahadur representation holds on a subinterval of [0,1], we modify Assumption 3 so that the Bahadur representation holds uniformly over expanding subintervals of [0,1] when the the number of quantiles increases with n.
Assumption 8 The asymptotic Bahadur representation (9) holds over τ ∈ 𝒯n = [δn, 1 − δn] with δn = (log n)−ε for some ε > 0. Let the number of quantiles .
First, we consider the OWQAE with the theoretical optimal weight ω* in (14).
Corollary 2 Consider β̂WQAE(ω) in (11) and ω* in (14). Suppose Assumptions 1, 2, and 8 and (45) hold. Further assume ṁ(Xt, β) ∈ ℒq for some q > 2. Then
Thus, if we use more and more quantiles as n → ∞, the efficiency of the OWQAE with the theoretical optimal weight ω* approaches the Fisher information. The same conclusion also holds for the estimators in Sections 4–5 provided that, as in Assumption 8, appropriate Bahadur representations hold uniformly on 𝒯n.
We next briefly discuss limiting behavior of the OWQAE with estimated weight when kn is chosen as in Assumption 8. Again, we discuss the bahavior of the parametric model in Section 3. As kn → ∞, the asymptotic analysis of the proposed estimator is complicated and depends on the behavior of quantile regression estimators and quantile-density estimators at the extreme.
Let be the OWQAE with the optimal weight ω* in (14), and let be the OWQAE with estimated weight ω̂* [see, e.g., (17)]. Then, using and , we have
(47) |
From Corollary 2, in order to prove
(48) |
it suffices to prove
(49) |
we need additional regularity conditions regarding the behavior of density fε(Qε(τ)) when τ approaches the boundary, and conditions on the density estimators.
Assumption 9 Let kn and 𝒯n be chosen as in Assumption 8. There exists some constant η > 0 such that infτ∈𝒯n fε(Qε(τ)) ≥ c(log n)−η and , where ε is the constant in Assumption 8.
Under Assumption 8 and the condition in Assumption 9, by (71) in the proof of Theorem 1, we have
(50) |
where
Assume without loss of generality that ṁ(Xt, β) is scalar-valued. By property (P1) in Section 2, the summands of Nj are martingale differences. By the condition ṁ(Xt, β) ∈ ℒ2 and the orthogonality of martingale differences, uniformly in j. Thus,
and thus . Recall kn in Assumption 8. Under Assumption 9,
Thus, (49) follows from (50), and we conclude that (48) holds.
6.4 Comparison of asymptotic relative efficiency
We now compare the efficiency of the proposed OWQAE to some existing methods.
First, we consider the parametric case in Section 3. Theorem 2 gives . For parameter estimations, the most widely used method is the ordinary LS estimator, denoted by β̂LS, which minimizes the squared errors. Assuming var(ε) < ∞ and other appropriate conditions, we have the asymptotic normality . For the quantile regression based estimator β̂(τ) with a single quantile τ, the asymptotic normality in (10) holds. All the three estimators β̂WQAE(ω*), β̂LS, and β̂(τ) have asymptotic normality of the form: , where s2 = var(ε) (assuming finite) for β̂LS, for β̂(τ), and s2 = S(ω*) for β̂WQAE(ω*). For comparison, we use β̂WQAE(ω*) as the benchmark and define its asymptotic relative efficiency (ARE) to β̂LS and β̂(τ) as
(51) |
A value of ARE ≥ 1 indicates better performance of β̂WQAE(ω*). Clearly, ARE(β̂(τj)) ≥ 1, j = 1, …, k. Under the conditions in Theorem 9, limk→∞ S(ω*) = 1/ℐ(fε) ≤ var(ε) (the Cramér-Rao inequality) so that limk→∞ ARE(β̂LS) ≥ 1. Intuitively, both β̂LS and β̂(τ) use only partial information: sample average and sample τ-th quantile, respectively. By contrast, β̂WQAE(ω*) combines strength across quantiles and thus can be more efficient.
Using k = 9 quantiles, Table 1 tabulates ARE(β̂LS) and ARE(β̂(τ)), τ = 0.1, …, 0.9, for some commonly used distributions. For all non-normal distributions considered, β̂WQAE(ω*) significantly outperforms β̂LS and β̂(τ). For N(0,1), β̂WQAE(ω*) and β̂LS are comparable, and both are about 50% more efficient than β̂(0.5). For Student-t with one (t1) or two (t2) degrees of freedom, LS is not applicable due to infinite variance; β̂WQAE(ω*) is about 20% more efficient than β̂(0.5) and even substantially more efficient than β̂(τ) for other choices of τ. Thus, potentially much improved efficiency and robustness can be achieved by using the proposed estimator β̂WQAE(ω*). For linear models, Zou and Yuan (2008) studied composite quantile regression (CQR) method, and we include the efficiency of their method for comparison purpose. Clearly, the OWQAE is significantly more efficient than the CQR.
Table 1.
ε | β̂(τ) with τ = | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
distribution | β̂LS | β̂CQR | τ1 | τ2 | τ3 | τ4 | τ5 | τ6 | τ7 | τ8 | τ9 |
Student-t1 | NA | 1.58 | 47.12 | 6.40 | 2.34 | 1.40 | 1.19 | 1.40 | 2.34 | 6.40 | 47.12 |
Student-t2 | NA | 1.12 | 9.01 | 2.85 | 1.65 | 1.27 | 1.17 | 1.27 | 1.65 | 2.85 | 9.01 |
N(0,1) | 0.96 | 1.03 | 2.80 | 1.96 | 1.67 | 1.54 | 1.51 | 1.54 | 1.67 | 1.96 | 2.80 |
Mixture 1 | 10.17 | 2.43 | 91.95 | 32.64 | 3.26 | 1.80 | 1.55 | 1.80 | 3.26 | 32.64 | 91.95 |
Mixture 2 | 3.28 | 2.80 | 3.01 | 2.81 | 3.68 | 7.84 | 56.24 | 7.84 | 3.68 | 2.81 | 3.01 |
Laplace | 2.00 | 1.32 | 9.00 | 4.00 | 2.33 | 1.50 | 1.00 | 1.50 | 2.33 | 4.00 | 9.00 |
Gamma(1) | 9.00 | 3.67 | 1.00 | 2.25 | 3.86 | 6.00 | 9.00 | 13.50 | 21.00 | 36.00 | 81.00 |
Gamma(2) | 2.44 | 1.73 | 1.12 | 1.49 | 1.91 | 2.42 | 3.10 | 4.08 | 5.65 | 8.68 | 17.34 |
Gamma(3) | 1.71 | 1.41 | 1.26 | 1.42 | 1.64 | 1.94 | 2.35 | 2.94 | 3.88 | 5.68 | 10.75 |
Beta(1,1) | 1.67 | 2.04 | 1.80 | 3.20 | 4.20 | 4.80 | 5.00 | 4.80 | 4.20 | 3.20 | 1.80 |
Beta(1,2) | 2.35 | 2.33 | 1.05 | 2.11 | 3.16 | 4.22 | 5.27 | 6.33 | 7.38 | 8.44 | 9.49 |
Beta(1,3) | 3.31 | 2.71 | 1.02 | 2.12 | 3.32 | 4.66 | 6.19 | 8.00 | 10.27 | 13.33 | 19.04 |
We briefly mention efficiency comparison of the nonparametric estimator m̂WQAE(x|ω*) relative to the local LS, local LAD, and Kai, Li and Zou (2010)’s local CQR estimators in Section 5. By (40),
The nonparametric relative efficiency = (The parametric relative efficiency)4/5.
Thus, the same efficiency comparison result (up to an exponent 4/5) in Table 1 also holds for the nonparametric estimator m̂WQAE(x|ω*) in Section 5.
6.5 Asymptotic super-efficiency
By Corollary 2, under (45), β̂WQAE(ω*) is an asymptotically efficient estimator of β, with limiting covariance matrix approaching the Fisher information bound. The corresponding conditions (45) are not mathematical trivialities but are real restricting conditions to obtain the efficiency results. In the case when those “usual” regularity conditions do not hold, the previously discussed efficiency result may not hold, and we may obtain different results from the likelihood-based estimation. For example, we may have a different rate of convergence, and in general, asymptotically efficient estimators do not exist. These “unusual” cases are sometimes called “non-regular” statistical estimation. In this section, we briefly discuss the case of non-regular estimation. In this case, Theorem 11 below shows that, by using quantile regression with optimal weighting, super-efficient estimators may be obtained in the sense that the efficiency is larger than the Fisher information ℐ(fε).
Theorem 11 Recall Ωk in (15). Let g(τ) be defined in Theorem 8. Assume
(52) |
If 0 < c < ∞ and , then limk→∞ Ωk = c + ℐ(fε).
If c = ∞, then limk→∞ Ωk = ∞.
Condition (45) covers the regular case c = 0 in (52). Theorem 11 indicates that, for the non-regular case c > 0 in (52), under appropriate conditions, for large k, the variance of the (standardized) optimally weighted quantile regression based estimator β̂WQAE(ω*) is smaller than the Cramér-Rao bound. In particular, if c = ∞, as the number of quantiles k increases, the asymptotic variance approaches zero. In this sense, the estimator β̂WQAE(ω*) is asymptotically super-efficient.
Corollary 3 below concerns a special case of super-efficiency when the density fε is positive at the boundary.
Corollary 3 Denote the support of fε by 𝒟, then limk→∞ Ωk = ∞ in any of the following three cases: (i) 𝒟 = [D1, D2] with fε(D1) + fε(D2) > 0; (ii) 𝒟 = [D1, ∞) with fε(D1) > 0; or (iii) 𝒟 = (−∞, D2] with fε(D2) > 0.
For the truncated version of the distributions in Section 6.2, we have limk→∞ Ωk = ∞. For example, for the truncated normal on [−1, 1], Corollary 3 (i) applies. For uniform distribution on [0, 1], we can show Ωk = 2k + 2 → ∞.
Similar results can also be established for Λk. We omit the details.
7 Estimation of the Optimal Weight
To construct the proposed OWQAE β̂WQAE(ω*) in Sections 3–5, we need to obtain estimations of the optimal weight ω* in (14) and π* in (25). It suffices to estimate Qε(τ) and fε(Qε(τ)). We can accomplish this through a two-step procedure: first, use a preliminary estimator to obtain residuals; second, estimate Qε(τ) and fε(Qε(τ)) based on the residuals. Here we illustrate the idea using the models in Sections 3–5.
(Case 1: Parametric model in Section 3.) Since fε (Qε (τ)) remains the same if we change ε to c + ε for any c, α in (6) can be absorbed into ε. We propose the procedure:
Use the uniform weight ω = [1/k, …, 1/k]T to obtain the preliminary estimator β̂, and compute the “residuals” (a combination of both α and ε) as ε̂t = Yt − m(Xt, β̂).
- To estimate fε(u), use the nonparametric density estimate , where, we follow Silverman (1986) to choose the rule-of-thumb bandwidth b:
Here, “sd” and “IQR” are the sample standard deviation and sample interquartile. Estimate fε(Qε(τ)) by f̂ε(Q̂ε(τ)), where Q̂ε(τ) is the sample τ-th quantile of ε̂1, …, ε̂n.
Plug f̂ε(Q̂ε(τ)) into (14) to obtain the estimated optimal weight ω̂*.
(Case 2: Location-scale model in Section 4.) To ensure identifiability, we assume without loss of generality that , otherwise we can consider the reparametrized model Y = XT β + (UT γ*)ε* with γ* = cγ, ε* = ε/c, and . Note that this assumption bears no effect on the optimal weight ω* since fε(Qε(τ)) is invariant under the transformation cε for any c > 0. For each quantile τ = τ1, …, τk, we fit the quantile regression (27) to obtain (β̃(τ), γ̃(τ)). Define the preliminary estimator
(53) |
Then (β̃, γ̃) consistently estimates (β, γ). We use the procedure to compute ω* and π*:
- Use β̃ and γ̃ in (53) to compute the errors , t = 1, …, n. To better mimic the constraint , consider the transformed errors
where Q̃ε(τ) is the sample τ-quantile of ε̃1, …, ε̃n. Use the same steps (ii)–(iii) in case 1 above to obtain estimates f̂ε(Q̂ε(τ)) and Q̂ε(τ).
(Case 3: Nonparametric regression model in Section 5.) As in case 2 above, ω* is invariant under the transformation cε, c > 0. Assume without loss of generality that |ε| has median one. Then the conditional median of |Y − m(X)| given X is σ(X), and we can apply local median quantile regression to estimate σ(·). We propose the procedure:
Use (32) with the uniform weight to obtain the preliminary estimator m̂(·).
- Compute Yt − m̂ (Xt) and estimate σ(·) by local linear median quantile regression:
For the bandwidth ℓ, following Yu and Jones (1998), we use ℓ = ℓLS(π/2)1/5, where ℓLS is the plug-in bandwidth [Ruppert, Sheather and Wand (1995)] for local linear LS regression based on the data (Xi, |Yi − m̂(Xi)|2), i = 1, …, n.(54) Compute the errors ε̂t = [Yt − m̂(Xt)]/σ̂(Xt) and obtain the estimator f̂ε(Q̂ε(τ)) as in the parametric regression case 1 above.
Use (14) to obtain ω̂1, …, ω̂k and symmetrize them: , j = 1, …, k.
8 Monte Carlo Studies
We conduct Monte Carlo studies to investigate the sampling performance of the proposed procedures in a variety of regression models. In all settings below, we use 1000 realizations to evaluate the performance of various methods.
8.1 Linear models with homoscedastic errors
For linear models, we compare six estimation methods. OLS: ordinary LS estimator; LAD: the median quantile estimator with τ = 0.5 in (8); QAU, QAO, QAE: the WQAE in (11) with the uniform weights, theoretical optimal weight ω* in (14), and estimated optimal weight ω̂* (cf. Section 7), respectively; CQR: Zou and Yuan (2008)’s CQR estimator. For QAU, QAO, QAE, and CQR, we use k = 9 quantiles 0.1, 0.2, …, 0.9. With 1000 realizations, we use QAE as the benchmark to which the other five methods are compared based on the empirical relative efficiency:
(55) |
where “Method” stands for OLS, LAD, QAU, QAO, CQR, and β̂(j) is the estimator of β in the j-th realization. A value of RE ≥ 1 indicates better performance of QAE.
We consider both independent data and time series data:
(56) |
(57) |
Model 2 is a variant of the threshold autoregressive model with a linear component Yt−1. For the innovation εt, we consider 12 distributions: Normal distribution N(0,1), Student-t distribution with one (t1) and two (t2) degrees of freedom, the two normal mixture distributions in Example 2, Laplace distribution, Beta distributions Beta(1,1), Beta(1,2), Beta(1,3), and Gamma distributions Gamma(1), Gamma(2), Gamma(3).
The results are summarized in Table 2(a) for Model 1 with sample sizes n = 100, 300, and in Table 2(b) for Model 2 with n = 300. For N(0,1), Student-t2, and Laplace distributions, QAE and CQR are comparable; for all other distributions, QAE significantly outperforms CQR. Also, QAE outperforms OLS for all non-normal distributions whereas they are comparable for N(0,1). For n = 300, the superior performance of QAE is even more remarkable, which agrees with our asymptotic theory. For almost all cases considered, QAE substantially outperforms the LAD estimator and the relative efficiency can be as high as almost 2000%. It is worth pointing out that, for Beta and Gamma distributions, the relative efficiencies are much higher than the other distributions considered, owing to the super-efficiency phenomenon in Section 6.5. We also note that QAE with estimated optimal weight has comparable performance to QAO with theoretical optimal weight. We conclude that the proposed OWQAE offers a more efficient alternative to existing methods.
Table 2.
(a): Model 1 in (56) | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
εt | n = 100 | n = 300 | ||||||||
Distribution | OLS | LAD | QAU | QAO | CQR | OLS | LAD | QAU | QAO | CQR |
Student-t1 | NA | 0.86 | 2.55 | 0.74 | 1.11 | NA | 1.04 | 3.16 | 0.89 | 1.37 |
Student-t2 | NA | 0.95 | 1.14 | 0.81 | 0.89 | NA | 1.06 | 1.30 | 0.99 | 1.02 |
N(0,1) | 0.86 | 1.33 | 0.90 | 0.90 | 0.91 | 0.89 | 1.43 | 0.95 | 0.95 | 0.97 |
Mixture 1 | 5.10 | 0.89 | 3.70 | 0.77 | 1.45 | 7.79 | 1.27 | 6.12 | 0.95 | 2.00 |
Mixture 2 | 2.22 | 10.85 | 2.31 | 1.18 | 2.03 | 2.65 | 19.88 | 2.84 | 1.09 | 2.30 |
Laplace | 1.24 | 0.85 | 1.06 | 0.85 | 0.89 | 1.48 | 0.87 | 1.23 | 0.87 | 1.01 |
Gamma(1) | 5.69 | 5.46 | 4.81 | 0.67 | 2.51 | 7.09 | 6.98 | 6.12 | 0.76 | 2.91 |
Gamma(2) | 2.08 | 2.60 | 1.96 | 0.88 | 1.48 | 2.23 | 2.85 | 2.14 | 0.94 | 1.61 |
Gamma(3) | 1.49 | 2.02 | 1.48 | 0.92 | 1.25 | 1.64 | 2.21 | 1.61 | 0.99 | 1.35 |
Beta(1,1) | 1.49 | 4.20 | 1.73 | 0.91 | 1.77 | 1.58 | 4.58 | 1.88 | 0.95 | 1.90 |
Beta(1,2) | 1.71 | 3.65 | 1.93 | 1.09 | 1.71 | 2.02 | 4.47 | 2.34 | 1.97 | 1.99 |
Beta(1,3) | 2.39 | 4.30 | 2.59 | 1.36 | 2.01 | 2.78 | 5.12 | 3.06 | 3.03 | 2.25 |
(b): Model 2 in (57), n = 300 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
εt | β1 | β2 | ||||||||
Distribution | OLS | LAD | QAU | QAO | CQR | OLS | LAD | QAU | QAO | CQR |
Student-t1 | NA | 1.00 | 1.85 | 0.89 | 1.28 | NA | 0.85 | 2.74 | 0.80 | 1.07 |
Student-t2 | NA | 1.10 | 1.19 | 1.04 | 0.99 | NA | 1.04 | 1.23 | 1.13 | 1.00 |
N(0,1) | 0.91 | 1.45 | 0.94 | 0.94 | 0.98 | 0.90 | 1.36 | 0.94 | 0.94 | 0.95 |
Mixture 1 | 6.85 | 1.14 | 4.84 | 0.92 | 1.78 | 6.82 | 1.16 | 5.20 | 0.88 | 1.77 |
Mixture 2 | 2.70 | 19.79 | 2.86 | 1.11 | 2.35 | 2.80 | 19.51 | 2.99 | 1.18 | 2.43 |
Laplace | 1.36 | 0.89 | 1.17 | 0.89 | 0.99 | 1.37 | 0.91 | 1.21 | 0.91 | 1.00 |
Gamma(1) | 6.70 | 6.23 | 5.38 | 0.79 | 2.83 | 6.24 | 6.22 | 5.35 | 0.75 | 2.80 |
Gamma(2) | 2.39 | 2.94 | 2.22 | 0.95 | 1.70 | 2.42 | 2.80 | 2.23 | 0.93 | 1.68 |
Gamma(3) | 1.53 | 1.95 | 1.50 | 0.97 | 1.27 | 1.68 | 2.07 | 1.59 | 0.91 | 1.36 |
Beta(1,1) | 1.50 | 4.30 | 1.79 | 0.93 | 1.80 | 1.50 | 4.15 | 1.77 | 0.94 | 1.81 |
Beta(1,2) | 1.92 | 4.24 | 2.21 | 0.86 | 1.95 | 2.23 | 4.89 | 2.52 | 0.85 | 2.21 |
Beta(1,3) | 2.78 | 4.71 | 3.04 | 0.82 | 2.29 | 2.77 | 4.60 | 3.01 | 0.80 | 2.28 |
8.2 Nonlinear models with homoscedastic errors
We consider two nonlinear models (one independent data and the other time series data):
(58) |
(59) |
Model 4 is Engle (1982)’s ARCH model. Again, we consider the 12 distributions in Section 8.1 for εt. Table 3 summarizes the empirical relative efficiency [cf. (55)] of the proposed OWQAE compared to the other methods OLS, LAD, QAU, and QAO (see Section 8.1). The proposed OWQAE is significantly superior to the OLS, LAD and QAU, and comparable to the QAO with theoretical optimal weight.
Table 3.
Model 3 in (58) | Model 4 in (59) | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
εt | β = 0.6 | β1 = 0.3 | β2 = 0.5 | |||||||||
Distribution | OLS | LAD | QAU | QAO | OLS | LAD | QAU | QAO | OLS | LAD | QAU | QAO |
Student-t1 | NA | 0.48 | 1.61 | 0.55 | NA | 0.75 | 2.06 | 0.74 | NA | 0.73 | 2.38 | 0.75 |
Student-t2 | NA | 1.00 | 1.37 | 1.55 | NA | 1.02 | 1.23 | 0.91 | NA | 0.99 | 1.21 | 1.06 |
N(0,1) | 0.79 | 1.81 | 0.96 | 0.95 | 0.97 | 1.45 | 0.99 | 0.99 | 0.92 | 1.38 | 0.97 | 0.97 |
Mixture 1 | 3.75 | 1.01 | 3.16 | 0.86 | 6.45 | 1.12 | 4.62 | 0.89 | 6.99 | 1.09 | 5.08 | 0.91 |
Mixture 2 | 1.81 | 9.54 | 1.90 | 1.02 | 2.56 | 16.83 | 2.72 | 1.15 | 2.75 | 18.08 | 2.85 | 1.15 |
Laplace | 1.36 | 1.19 | 1.15 | 1.19 | 1.52 | 0.93 | 1.28 | 0.93 | 1.42 | 0.89 | 1.21 | 0.89 |
Gamma(1) | 4.21 | 6.58 | 3.94 | 0.75 | 6.74 | 6.42 | 5.78 | 0.77 | 6.67 | 6.44 | 5.56 | 0.73 |
Gamma(2) | 1.71 | 2.78 | 1.69 | 0.97 | 2.23 | 2.59 | 2.07 | 0.97 | 2.39 | 2.90 | 2.21 | 0.92 |
Gamma(3) | 1.31 | 1.89 | 1.28 | 1.14 | 1.59 | 2.04 | 1.53 | 0.96 | 1.64 | 2.05 | 1.58 | 0.92 |
Beta(1,1) | 1.15 | 4.61 | 1.98 | 0.94 | 1.67 | 4.48 | 1.95 | 0.93 | 1.57 | 4.02 | 1.82 | 0.94 |
Beta(1,2) | 1.08 | 3.72 | 1.97 | 2.50 | 1.98 | 3.98 | 2.24 | 0.89 | 2.06 | 4.21 | 2.30 | 0.85 |
Beta(1,3) | 1.58 | 4.38 | 2.93 | 5.27 | 2.84 | 4.85 | 3.02 | 0.81 | 2.96 | 4.94 | 3.09 | 0.89 |
8.3 Location-scale models with conditional heteroscedasticity
Consider two location-scale models (one independent data and the other time series data):
(60) |
(61) |
Model 6 is the ARCH model in Koenker and Zhao (1996), and it is different from Engle’s ARCH model where the conditional heteroscedasticity takes the form in (59). Due to the conditional heteroscedasticity, it is slightly more difficult to estimate the parameters. We use Model 5 to illustrate five estimation methods.
- (LS method) If εt has zero mean and unit variance, the Gaussian-likelihood based estimation method is to minimize the loss function
This is essentially an LS type estimation and the Gaussianity is not necessary for the consistency. In general, if εt has variance σ2, then this LS method produces consistent estimators of β and σ(γ0, γ1).(62) - (LAD method) First, apply (27) with τ = 0.5 to obtain the LAD estimator β̂LAD of β. Second, apply median quantile regression based on absolute residuals:
This LAD regression produces estimators of Q|ε|(0.5)(γ0, γ1), where Q|ε|(0.5) is the median quantile of |εt|.(63) (OWQAE with theoretical optimal weights). As in Section 8.1, we denote this method by QAO.
(OWQAE with estimated optimal weights). As in Section 8.1, we denote this method by QAE. As discussed in Section 7, under the constraint , QAE produces consistent estimators of (β, γ0, γ1). Without the latter constraint, QAE produces consistent estimators of β and .
(OWQAE based on the unweighted quantile regression (27)). This method works the same as the OWQAE above and the only difference is to use β̃(τ) and γ̃(τ) in (27) to form the OWQAE. Denote this method by QAEU. Again, QAEU produces estimators of β and . We include this method to evaluate the performance of the OWQAE based on the unweighted quantile regression (27).
As discussed above, the five estimation methods produce consistent estimators of β and λ(γ0, γ1) for some constant λ depending on the distribution of εt. To make sensible comparison, we divide the corresponding estimators by λ to obtain consistent estimators of (γ0, γ1). Furthermore, to ensure the consistency of the LS method, we consider the properly centered εt for the 12 distributions in Section 8.1. That is, if εt has finite mean, then we center it so that 𝔼(εt) = 0.
The results are summarized in Table 4(a) for Model 5 and in Table 4(b) for Model 6. In both models, sample size n = 300. We make three observations. First, the OWQAE delivers much superior overall performance than OLS and LAD. Second, in all cases considered, the OWQAE using heteroscedasticity-weighted quantile regression (22) clearly outperforms the OWQAE using unweighted quantile regression (27). Third, the OWQAE with estimated weights is comparable to the QAO with theoretical optimal weights.
Table 4.
(a): Model 5 in (60) | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
location parameter | scale parameter | |||||||||||
εt | β = 0.6 | γ0 = 0.5 | γ1 = 1.0 | |||||||||
Distribution | OLS | LAD | QAO | QAEU | OLS | LAD | QAO | QAEU | OLS | LAD | QAO | QAEU |
Student-t1 | NA | 0.96 | 0.85 | 1.17 | NA | 1.17 | 0.88 | 1.20 | NA | 1.22 | 0.92 | 1.25 |
Student-t2 | NA | 1.05 | 0.91 | 1.13 | NA | 1.66 | 0.94 | 1.28 | NA | 1.70 | 0.96 | 1.27 |
N(0,1) | 0.89 | 1.43 | 0.93 | 1.12 | 0.68 | 2.29 | 0.92 | 1.28 | 0.68 | 2.37 | 0.94 | 1.26 |
Mixture 1 | 7.05 | 1.21 | 0.91 | 1.18 | 0.55 | 1.88 | 0.85 | 1.23 | 0.55 | 2.05 | 0.89 | 1.27 |
Mixture 2 | 2.57 | 23.03 | 1.03 | 1.23 | 0.66 | 2.07 | 0.86 | 1.28 | 0.69 | 2.58 | 0.86 | 1.20 |
Laplace | 1.53 | 0.80 | 0.80 | 1.11 | 0.93 | 1.89 | 0.96 | 1.25 | 0.96 | 1.81 | 0.96 | 1.21 |
Gamma(1) | 7.38 | 6.40 | 0.79 | 1.17 | 7.31 | 3.72 | 0.50 | 1.13 | 6.38 | 3.18 | 0.44 | 1.14 |
Gamma(2) | 2.32 | 2.89 | 0.92 | 1.13 | 2.70 | 2.61 | 0.79 | 1.23 | 2.60 | 2.76 | 0.81 | 1.21 |
Gamma(3) | 1.67 | 2.37 | 0.92 | 1.13 | 1.79 | 2.63 | 0.83 | 1.15 | 1.76 | 2.84 | 0.88 | 1.20 |
Beta(1,1) | 1.48 | 4.39 | 0.95 | 1.09 | 0.66 | 3.87 | 0.83 | 1.19 | 0.65 | 3.56 | 0.85 | 1.17 |
Beta(1,2) | 1.95 | 4.48 | 0.87 | 1.12 | 1.04 | 3.41 | 0.71 | 1.10 | 1.07 | 3.75 | 0.73 | 1.21 |
Beta(1,3) | 2.75 | 4.97 | 0.84 | 1.11 | 1.71 | 3.64 | 0.69 | 1.14 | 1.63 | 3.30 | 0.65 | 1.10 |
(b): Model 6 in (61) | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
location parameter | scale parameter | |||||||||||
εt | β = 0.4 | γ0 = 0.5 | γ1 = 0.5 | |||||||||
Distribution | OLS | LAD | QAO | QAEU | OLS | LAD | QAO | QAEU | OLS | LAD | QAO | QAEU |
Student-t1 | NA | 0.65 | 0.55 | NA | NA | 16.84 | 0.37 | NA | NA | 3.85 | 0.30 | NA |
Student-t2 | NA | 1.03 | 0.89 | 6.76 | NA | 3.69 | 0.88 | 3.38 | NA | 4.01 | 0.96 | 6.02 |
N(0,1) | 0.89 | 1.45 | 0.97 | 1.12 | 0.62 | 1.94 | 0.92 | 1.21 | 0.67 | 2.10 | 0.97 | 1.19 |
Mixture 1 | 6.47 | 1.12 | 0.85 | 1.22 | 0.51 | 1.76 | 0.77 | 1.07 | 0.60 | 1.92 | 0.99 | 1.10 |
Mixture 2 | 2.40 | 22.77 | 1.00 | 6.59 | 0.70 | 20.72 | 0.91 | 34.50 | 0.64 | 9.51 | 0.84 | 5.82 |
Laplace | 1.48 | 0.80 | 0.81 | 1.53 | 0.92 | 2.31 | 0.92 | 1.50 | 0.91 | 2.35 | 0.97 | 1.44 |
Gamma(1) | 5.71 | 6.16 | 0.78 | 1.10 | 5.13 | 3.65 | 0.40 | 1.12 | 7.79 | 5.61 | 0.62 | 1.21 |
Gamma(2) | 2.13 | 2.86 | 0.93 | 1.22 | 2.63 | 4.37 | 0.69 | 1.58 | 2.73 | 4.64 | 0.80 | 1.53 |
Gamma(3) | 1.44 | 2.04 | 0.95 | 1.57 | 1.89 | 4.55 | 0.84 | 2.12 | 1.77 | 3.85 | 0.84 | 1.76 |
Beta(1,1) | 1.59 | 4.30 | 0.92 | 1.01 | 0.67 | 3.14 | 0.71 | 1.00 | 0.83 | 3.79 | 0.98 | 1.01 |
Beta(1,2) | 1.73 | 4.27 | 0.86 | 0.98 | 1.00 | 2.44 | 0.65 | 0.99 | 1.15 | 3.21 | 0.89 | 0.99 |
Beta(1,3) | 2.48 | 5.17 | 0.81 | 1.01 | 1.50 | 2.45 | 0.54 | 0.98 | 1.48 | 3.79 | 0.87 | 1.00 |
8.4 Nonparametric regression models
In our data analysis, we use the standard Gaussian kernel for K(·). We now address the bandwidth selection issue. By (39), the optimal bandwidth h* is proportional to (s2)1/5. Denote by and the bandwidth for the least squares estimator and the proposed OWQAE. Then , where S(ω*) is defined in (13). In practice, to select , we can use the plug-in bandwidth selector in Ruppert, Sheather and Wand (1995), implemented using the command dpill in the R package KernSmooth. We then select by plugging in estimates of S(ω*) and var(ε) using the two-step procedure in Section 7, but for the purpose of comparison we shall use their true values in our simulation studies. Similarly, we can choose the optimal bandwidths for the other two estimators. Kai, Li and Zou (2010) adopted the same strategy. For the preliminary estimator m̂(·) in step (i) of Section 7, we use the plug-in bandwidth selector .
We compare the empirical performance of the four methods (LS, LAD, CQR, OWQAE) in Section 5. With 1000 realizations, we use the least squares estimator m̂LS as the benchmark to which the other three methods are compared based on the relative efficiency:
where m̂j is the estimator in the j-th realization, and [ℓ1, ℓ2] is the interval over which m is estimated. A value of RE ≥ 1 indicates that m̂ outperforms the LS estimator. To facilitate computation, the integral is approximated using 20 grid points.
Consider n = 200 samples from the model
(64) |
The same model was also used in Kai, Li and Zou (2010) with the normal design X ~ N(0,1). Here we use the uniform design to avoid some computational issues. For ε, we consider nine symmetric distributions: N(0,1), truncated normal on [−1, 1], truncated Cauchy on [−10, 10], truncated Cauchy on [−1, 1], Student-t with 3 (t3) degrees of freedom, Standard Laplace distribution, uniform distribution on [−0.5, 0.5], and two normal mixture distributions: 0.5N(2,1)+0.5N(−2,1) and 0.95N(0,1)+0.05N(0,9). The first normal mixture can be used to model a two-cluster population, whereas the second normal mixture can be viewed as a noise contamination model. Let [ℓ1, ℓ2] = [−1.5, 1.5].
The relative efficiencies of the four methods, with m̂LS(x) as the benchmark, are summarized in Table 5(a). Overall, OWQAE either significantly outperforms or is comparable to the other three methods. For example, for N(0,1) and 0.95N(0,1)+0.05N(0,9), OWQAE is comparable to LS; for other distributions, OWQAE has about 20% efficiency gain over LS for most distributions and more than 60% efficiency gain for 0.5N(2,1)+0.5N(−2,1). When compared with CQR, OWQAE outperforms CQR for all but the four distributions: N(0,1), Student-t3, Laplace distribution, and 0.95N(0,1)+0.05N(0,9), for which they are comparable. While OWQAE underperforms LAD for truncated Cauchy on [−10, 10], it has substantial efficiency gains for N(0,1), truncated N(0,1) on [−1, 1], truncated Cauchy on [−1, 1], uniform on [−0.5, 0.5], and 0.5N(2,1)+0.5 N(−2,1).
Table 5.
(a): Model 7 in (64) | ||||||||
---|---|---|---|---|---|---|---|---|
CQR, k = | OWQAE, k = | |||||||
Distribution of ε | LS | LAD | 9 | 19 | 29 | 9 | 19 | 29 |
N(0,1) | 1 | 0.73 | 0.99 | 1.00 | 0.99 | 0.96 | 0.95 | 0.95 |
Truncated N(0,1) on [−1, 1] | 1 | 0.54 | 0.93 | 0.98 | 0.99 | 1.14 | 1.25 | 1.26 |
Truncated Cauchy on [−10, 10] | 1 | 1.67 | 1.16 | 1.07 | 1.05 | 1.40 | 1.27 | 1.21 |
Truncated Cauchy on [−1, 1] | 1 | 0.53 | 0.93 | 0.98 | 0.99 | 1.07 | 1.20 | 1.20 |
Student-t with 3 d.f.’s | 1 | 1.44 | 1.39 | 1.21 | 1.16 | 1.42 | 1.28 | 1.19 |
Standard Laplace | 1 | 1.26 | 1.11 | 1.06 | 1.05 | 1.16 | 1.10 | 1.06 |
Uniform on [−0.5, 0.5] | 1 | 0.51 | 0.94 | 0.98 | 0.99 | 1.23 | 1.24 | 1.21 |
0.5N(−2,1)+0.5N(2,1) | 1 | 0.31 | 0.92 | 0.97 | 0.99 | 1.57 | 1.60 | 1.62 |
0.95N(0,1)+0.05N(0,9) | 1 | 0.88 | 1.13 | 1.08 | 1.06 | 1.05 | 0.98 | 0.95 |
(b): Model 8 in (65) | ||||||||
---|---|---|---|---|---|---|---|---|
CQR, k = | OWQAE, k = | |||||||
Distribution of ε | LS | LAD | 9 | 19 | 29 | 9 | 19 | 29 |
N(0,1) | 1 | 0.66 | 0.96 | 0.98 | 0.98 | 0.94 | 0.95 | 0.95 |
Truncated N(0,1) on [−1, 1] | 1 | 0.43 | 0.84 | 0.91 | 0.93 | 1.14 | 1.56 | 1.80 |
Truncated Cauchy on [−10, 10] | 1 | 2.39 | 1.35 | 1.17 | 1.12 | 1.97 | 1.78 | 1.61 |
Truncated Cauchy on [−1, 1] | 1 | 0.46 | 0.84 | 0.91 | 0.94 | 1.02 | 1.35 | 1.55 |
Student-t with 3 d.f.’s | 1 | 1.73 | 1.75 | 1.52 | 1.41 | 1.72 | 1.51 | 1.44 |
Standard Laplace | 1 | 1.48 | 1.18 | 1.11 | 1.09 | 1.31 | 1.24 | 1.16 |
Uniform on [−0.5, 0.5] | 1 | 0.36 | 0.84 | 0.91 | 0.94 | 1.46 | 2.21 | 2.70 |
0.5N(−2,1)+0.5N(2,1) | 1 | 0.20 | 0.88 | 0.95 | 0.97 | 2.12 | 2.18 | 2.17 |
0.95N(0,1)+0.05N(0,9) | 1 | 0.86 | 1.18 | 1.15 | 1.12 | 1.10 | 1.02 | 0.95 |
The empirical performance of the proposed method in Table 5(a) is not as impressive as its theoretical performance in (40). For example, for truncated N(0,1) on [−1, 1], the theoretical AREs according to (40) are 1, 0.48, 0.86, 0.93, 0.95, 1.13, 1.69, 2.17, compared to 1, 0.54, 0.93, 0.98, 0.99, 1.14, 1.25, 1.26 in Table 5(a). To explain this phenomenon, the plot (not included here) of the function m(x) = sin(2x) + 2 exp(−16x2) exhibits large curvature and sharp changes on [−0.5, 0.5], and thus a large estimation bias could easily offset the asymptotic efficiency improvements, especially for a moderate sample size. To appreciate this, use the same X and ε in (64), and consider model
(65) |
Then the bias term h2μKm″(x) = 0 vanishes and the variance plays a dominating role. For all four estimation methods, we use the same bandwidth: the plug-in bandwidth selector for local linear regression. We summarize the relative efficiencies in Table 5(b). The overall pattern of the empirical relative efficiencies is consistent with that of the theoretical ones in (40), and the proposed OWQAE significantly outperforms other methods for almost all distributions considered. Also, using more quantiles (k = 29) significantly improves the performance of OWQAE for truncated N(0,1) on [−1, 1], truncated Cauchy on [−1, 1] and uniform on [−0.5, 0.5]. The latter property is not shared by the CQR method.
In summary, for most non-normal distributions considered, the proposed method can have substantial efficiency improvements over other methods, and the empirical performance is consistent with our asymptotic theory.
9 An Empirical Application
To highlight the proposed approach, we consider a simple application of this method to the widely studied cross-section of stock returns. The Capital Asset Pricing Model [CAPM, see Sharpe (1964) and Black (1972)] has long served as the backbone of both theoretical and empirical finance. It is generally agreed that investors demand a higher expected return for investment in riskier securities. Over the past three decades a number of studies have empirically examined the performance of the CAPM in the cross-section of returns, and it is also well documented that the rate of return to holding common stocks is to some extend predictable over time. A large number of papers have studied the appropriateness of CAPM model in explaining how investors assess the risk and how they determine what risk premium to demand, and several alternative models have also been proposed in the literature. However, empirical evidence is ambiguous. The support for other asset-pricing models is no better. In addition, the theory behind the CAPM has an intuitive appeal that other models lack. For these (and other) reasons, in spite of the controversy in empirical studies, the CAPM is still widely used in fiancial applications and still the preferred model used in MBA and other managerial finance courses.
The focus of this section is not on the choice of alternative models. In this section, we consider applications of the methods that we discussed in the previous sections on the traditional widely used CAPM cross-sectional regression (similar to those of Fama-MacBeth which can be used to study the predictability of returns). The cross-sectional regression equation at time t is
(66) |
where λ0 is the intercept term, λ1 is the slope coefficient, and βim,t−1 is the conditional beta of the excess return for asset i in month t. The dating convention indicates that the conditional beta is formed using only information available at time t − 1. This regression model provides a decomposition of each excess return over each period into two components: the first component, λ1βim,t−1, represents the part of return of asset i that is related to the cross-sectional structure of risk, as measured by the betas. The remaining component of the return is uncorrelated to the measures of risk. Thus, the asset pricing model implies that the predictability of returns should be related to the risk.
We consider a population of stocks traded on the New York Stock Exchange (NYSE) from January 2009 to December 2010. We study monthly stock returns. These data are available from CRSP (the Center for Research in Security Prices) as well as many other data resources. Following the literature of many empirical studies, the stocks are considered if their returns in the current month and also the previous 60 month are available, and we exclude firms with negative book-to-market equity (using information from Compustat). The cross-sectional regression model (66) is usually estimated by the least squares method in practice. On the other hand, cumulated empirical evidence in finance indicates that stock returns are not normally distributed. In fact, it is well-known that the distributions of returns are heavy-tailed. Therefore it is important to consider estimation procedures which have good properties in the absence of Gaussianity.
We estimate the cross-sectional regression model (66) using four methods: the traditional OLS regression, the LAD estimation, a simple equally-weighted quantile averaging estimation (denoted by QAU), and the optimally weighted quantile averaging estimation (denoted by OWQAE). We use k = 9 quantiles 0.1, 0.2, …, 0.9, for quantile combination. For the purpose of comparison, we evaluate the performance of these estimators based on their out-of-sample prediction. In particular, we estimate the cross-sectional regression model (66) based on cross-sectional data at each month of 2009, and then use the estimated coefficients λ̂0 and λ̂1 to construct forecast of return at the corresponding month of 2010. We compare both the mean squared prediction error (MSE) and the mean absolute deviation (MAD) of the predictions. In particular, we calculate the mean squared prediction error and the mean absolute prediction error by
for each month, and then average these mean squared prediction errors and mean absolute prediction errors respectively over all months. Table A1 below sumarizes the results.
Table A1.
OLS | LAD | QAU | OWQAE | |
---|---|---|---|---|
MAD | 51.39 | 46.94 | 48.35 | 44.78 |
MSE | 10.50 | 8.89 | 9.45 | 7.95 |
Numbers in this table are multiplied by 500 for convenience.
Model (66) is the basic regression model that characterizes the risk premiums. We next consider an extension of model (66) which adds conditional heteroscedastic effect of capitalization (the “size” effect). We consider an analogue of (20),
(67) |
where σi,t = γCapi,t and Capi,t is the market capitalization. Again, we estimate the cross-sectional regression model (67) using the four estimation methods mentioned before. More specifically, the two-stage Weighted Least Squares (WLS), the two-stage Weighted Least Absolute Deviation (WLAD), the QAU and OWQAE based on quantiles 0.1, 0.2, …, 0.9. Table A2 below reports the mean squared prediction errors (MSE) and the mean absolute prediction errors (MAD) that are calculated in a similar way as Table A1.
Table A2.
WLS | WLAD | QAU | OWQAE | |
---|---|---|---|---|
MAD | 51.01 | 46.80 | 48.30 | 44.09 |
MSE | 10.40 | 8.86 | 9.41 | 7.90 |
Numbers in this table are multiplied by 500 for convenience.
The empirical results from Tables A1–A2 indicate that least squares method-based estimation is less efficient than other methods. In particular, the proposed OWQAE estimator performs relatively better than other methods.
10 Further Discussions
We propose a general method of combining quantile regression information to improve efficiency of regression estimators. The proposed method is simple and more efficient regression estimators can be constructed based on a relatively small number of quantiles.
The proposed method has a wide range of applicability and can be potentially applied to many other models. We briefly discuss a few directions of interesting applications of our approach, without giving full details.
The first direction is efficient estimation for the varying-coefficient model:
(68) |
where α(·) is the functional intercept and β(·) is the p-dimensional column vector of functional coefficients. Then the conditional τ-th quantile of Y given (X, U) is
A useful application is the varying-coefficient longitudinal model when we have longitudinal measurements from multiple subjects. Wang, Zhu and Zhou (2009) studied quantile regression for a partially linear varying-coefficient longitudinal model. In their work, the coefficients depend on the quantile, and they estimated the coefficients for each quantile without combining information across quantiles. We will explore further in a future paper.
A second direction is volatility estimation in time series. In financial econometrics, volatility plays an important role in asset pricing and risk management. Here we briefly discuss volatility estimation for both parametric and nonparametric ARCH models.
(Nonparametric volatility) Consider nonparametric ARCH(1): Xt = σ(Xt−1)εt. Let Qε2(τ) be the τ-th quantile of , and the conditional τ-th quantile of given Xt−1. Then and for all τ. Given estimates of , we can construct efficient estimators of σ2(x)/σ2(0) by combining , j = 1, …, k.
(Parametric volatility) Consider parametric ARCH(p) model:
Let ℐt−1 be the information up to time t − 1. Denote by Qε2(τ) the τ-th quantile of , and by the conditional τ-th quantile of given ℐt−1. Then
Therefore, we can apply quantile regression with quantile τ to obtain consistent estimates β̂j(τ) of βj(τ). Note that βj(τ)/β0(τ) = βj/β0 for all τ. Therefore,
We can construct efficient estimators of by combining quantiles τ1, …, τk.
Similar ideas also apply to generalized ARCH models. Since substantial work is needed here, we will explore further in a separate project.
Appendix: Proofs
Proof of Theorem 1. By the ergodicity in Assumption 1(ii) and (5),
(69) |
Recall the definition of Σβ in Theorem 1. Then we can easily verified that
(70) |
By Assumption 3 and (69)–(70),
(71) |
Therefore,
(72) |
where
(73) |
By the Cramér-Wold device, it suffices to consider the case that ṁ(Xt, β) is scalar-valued. Let ℱt be the σ-algebra generated by {Xt+1, Xt, …; εt, εt−1, …}. By property (P1) in Section 2, {{ṁ(Xt, β) − 𝔼[ṁ(X, β)]}dt}t∈ℤ form martingale differences with respect to {ℱt}t∈ℤ. Using cov{τ − 1εt≤Qε(τ), τ′ − 1εt≤Qε(τ′)} = min(τ, τ′) − ττ′, we have with S(ω) defined in (13). Since εt is independent of ℱt−1, by (5),
(74) |
Since k is fixed, by Assumption 2, dt is bounded. Thus, the assumption ṁ(Xt, β) ∈ ℒ2 ensures the Lindeberg condition. The result then follows from the martingale CLT.
Proof of Theorem 2. The optimal weight follows from the Lagrange multiplier method. The asymptotic normality follows from Theorem 1 and .
Proof of Theorem 3. Recall dt in (72)–(73). Since k is fixed, by Slutsky’s theorem, it is easy to see that has the same asymptotic distribution if we replace ωj therein by any ω̃j such that ω̃j = ωj + op(1). Thus, the result follows.
Proof of Theorem 4. Define the vectors
Then
Since θ̂(τ) minimizes the criterion function in (22), the reparametrized parameter minimizes the loss function
Suppose we can establish the quadratic approximation
(75) |
where
(76) |
Then the convexity lemma in Pollard (1991) gives δ̂ = δ ^* + op(1), where
The desired result then follows by using block matrix inverse of .
It remains to prove (75). In view of L(δ), define
It suffices to prove L̃(δ) = L*(δ) + op(1) and L(δ) = L̃(δ) + op(1).
First, we prove L̃(δ) = L*(δ) + op(1). Using ρτ (cz) = cρτ (z) for c > 0, we can rewrite
Applying Knight (1998)’s identity
(77) |
we can obtain
(78) |
where 𝒢t is the σ-algebra generated by {(Xt+1,Ut+1), (Xt,Ut), …; εt, εt−1, …}, and
By Assumption 5, εt is independent of 𝒢t−1. Thus,
(79) |
By Assumption 6(i), there exists some constant c such that
(80) |
Thus, from (79) and Taylor’s expansion Fε(s + Qε(τ)) − Fε(Qε(τ)) = sfε(Qε(τ)) + o(s),
(81) |
where the convergence follows from the ergodicity and (5). Since {ξt − 𝔼(ξt|𝒢t−1)}t∈ℤ are martingale differences with respect to {𝒢t−1}t∈ℤ, by their orthogonality,
(82) |
From (80), we have , which combined with (82) gives . Thus, by (78) and (81), we have L̃(δ) = L*(δ) + op(1).
Next, we prove the approximation L(δ) = L̃(δ) + op(1). Let
Then it is easy to see that
(83) |
By the same argument leading to the quadratic approximation L̃(δ) = L*(δ)+op(1) above, we can show that each element of has a quadratic approximation of the order Op(1). Thus, N1 = Op(‖γ̃ − γ‖) = op(1). For N2, by Assumption 6(i)–(ii), and , which give N2 = op(1). Thus, we conclude that L(δ) = L̃(δ) + op(1), completing the proof.
Proof of Theorem 5. The asymptotic normally follows from the Bahadur representation in Theorem 4 and the same martingale CLT argument in Theorem 1. The optimal weight follows from the Lagrange multiplier method.
Proof of Theorem 6. Write Kt = K{(Xt − x)/h}. For IID data, Kai, Li and Zou (2010) have shown the following asymptotic representation:
Examining their argument and using properties (P1) and (P2) in Section 2, we see that the asymptotic representation also holds under Assumption 1. Therefore, by (30),
where dt is defined in (73). The desired asymptotic normality then follows from the same martingale CLT argument in Theorem 1.
It is easy to see that, under the symmetric density assumption, the optimal weight ω* in (14) automatically satisfies the symmetric weight constraint in (29).
Proof of Theorem 7. This follows from the same argument of Theorem 3.
Proof of Theorem 8. Recall τj = j/(k + 1). Define k × k matrices Γ and P:
(84) |
Here “diag” stands for the diagonal matrix. By direct matrix multiplications, we can verify
(85) |
with 2(k + 1) on the diagonal, −(k + 1) on the super-/sub-diagonals, and 0 elsewhere.
-
We can rewrite Wk asFor t, s ∈ [τj−1, τj], we have , uniformly. Thus, by the Cauchy-Schwarz inequality,Applying the above inequality, we can obtain
Using H = P−1ΓP−1, we can write Λk = qT H−1q = (Pq)TΓ−1(Pq). Note that Pq = [h(τ1), …, h(τk)]T with h(t) = Qε(t)fε(Qε(t)). Using (85) and the argument in (i) above, we can easily obtain the desired result.
Proof of Proposition 1. Let u = Qε(τ) so that τ = Fε(u). Since fε has support ℝ, u → −∞ as τ → 0. Recall g(τ) = fε(Qε(τ)). By the chain rule, we can show g′′(τ) = [∂2 log fε(u)/∂u2]/fε(u). Then one can easily show that (46) is equivalent to
(87) |
For example, limτ→0 g2(τ)/τ = 0 if and only if , and limτ→0 g2(1 − τ)/τ = 0 if and only if . It remains to show (87) implies (45).
Let ε > 0 be any given number. By (87), there exists such that and for all . Fix . By Assumption 2, there exists c < ∞ such that |g′′(τ)| < c for . Let .
Then . For τ < τ*, applying , we have
completing the proof.
Proof of Theorem 10. (i) For S(ω) in (13), Rk = S((1/k, …, 1/k)T). By the uniqueness of the minimizer ω* of S(ω) [see (14)], with equality if and only if ω* = (1/k, …, 1/k)T. Let g(τ) = fε(Qε(τ)), and for convenience write g(τ0) = g(τk+1) = 0. For in (14), by H−1 = PΓ−1P and (85), we can show
(88) |
where Ωk is defined in (15). Note that, for j = ⌊(k + 1)τ⌋ with τ ∈ (0, 1), limk→∞(k + 1)2[2g(τj) − g(τj−1) − g(τj+1)]g(τj) = −g′′(τ)g(τ). Thus, as k → ∞, for all j implies g′′(τ)g(τ) = −c, τ ∈ (0, 1), for some c > 0. Define the transformation u = Qε(τ). By the chain rule, we can show that g′′(τ)g(τ) = −c is equivalent to or {log fε(u)}′′ = −c. Thus, fε(u) must be a normal density.
(ii) See the proof in Kai, Li and Zou (2010).
Proof of Corollary 2. As in the proof of Theorem 1, we use martingale CLT and consider scalar-valued ṁ(Xt, β). Similar to dt in (73), with the optimal weight ω*, define
By Theorem 9, S(ω*) = 1/Ωkn → 1/ℐ(fε). Thus, the convergence of the conditional variances follows from the argument in (74). It remains to verify the Lindeberg condition.
By Assumption 2, g(τ) = fε(Qε(τ)) is bounded on τ ∈ (0, 1). Thus, from (88),
for some constant c1. For any given c2 > 0,
(89) |
Note that, for any random variable U ∈l ℒq, q > 2, and constant c > 0,
(90) |
Applying (90) to (89) and using kn = O[(log n)ε], we have the Lindeberg condition.
Proof of Theorem 11. (i) It follows from (86) and the proof of Theorem 8. (ii) From limτ→0[g2(τ) + g2(1 − τ)]/τ = ∞ and (86), 1/S(ω*) ≥ (k + 1)[g2(τ1) + g2(τk)] → ∞.
Footnotes
We thank Roger Koenker, Steve Portnoy, Victor Chernozhukov, two referees, and seminar attendants at University of Illinois for their very helpful comments. Zhao’s research was partially supported by a NIDA grant P50-DA10075-15. Xiao thanks Boston College for research support. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIDA or the NIH.
Note that different forms of the location-scale model can be studied similarly to our analysis in this section, and optimally weighted quantile averaging estimators can be constructed (but the construction of optimal weights will be different, depending on the specific form of the regression model).
Contributor Information
Zhibiao Zhao, Email: zuz13@stat.psu.edu, Department of Statistics, Penn State University, University Park, PA 16802.
Zhijie Xiao, Email: zhijie.xiao@bc.edu, Department of Economics, Boston College, Chestnut Hill, MA 02467.
References
- Akahira M, Takeuchi K. Non-Regular Statistical Estimation. New York: Springer; 1995. [Google Scholar]
- Beran R. Asymptotically efficient adaptive rank estimates in location models. Annals of Statistics. 1974;2:248–266. [Google Scholar]
- Bickel P. On adaptive estimation. Annals of Statistics. 1982;10:647–671. [Google Scholar]
- Black F. capital market equilibrium with restricted borrowing. Journal of Business. 1972;45:444–454. [Google Scholar]
- Bradic J, Fan J, Wang W. Penalized composite quasi-likelihood for ultrahigh dimensional variable selection. Journal of the Royal Statistical Society, Series B. 2011;73:325–349. doi: 10.1111/j.1467-9868.2010.00764.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen X, Linton O, Jacho-Chavez D. Averaging of moment condition estimators. Working paper. 2011 [Google Scholar]
- Chamberlain G. Quantile regression, censoring and the structure of wages. Advances in Econometrics, sixth World Congress. 1994:171–210. [Google Scholar]
- Engle RF. Autoregressive conditional heteroscedasticity with estimates of the variance of U.K. inflation. Econometrica. 1982;50:987–1007. [Google Scholar]
- Fan J, Farmen M, Gijbels I. Local maximum likelihood estimation and inference. Journal of the Royal Statistical Society, Series B. 1998;60:591–608. [Google Scholar]
- Fan J, Gijbels I. Local Polynomial Modelling and Its Applications. London: Chapman and Hall; 1996. [Google Scholar]
- He X, Shao QM. A general Bahadur representation of M-estimators and its application to linear regression with nonstochastic designs. Annals of Statistics. 1996;24:2608–2630. [Google Scholar]
- Jurečková J, Procházka B. Regression quantiles and trimmed least squares estimator in nonlinear regression model. Journal of Nonparametric Statistics. 1994;3:201–222. [Google Scholar]
- Kai B, Li R, Zou H. Local composite quantile regression smoothing: an efficient and safe alternative to local polynomial regression. Journal of the Royal Statistical Society, Series B. 2010;72:49–69. doi: 10.1111/j.1467-9868.2009.00725.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Koenker R. A note on L-estimates for linear models. Statistics & Probability Letters. 1984;2:323–325. [Google Scholar]
- Koenker R. Quantile Regression. New York: Cambridge University Press; 2005. [Google Scholar]
- Koenker R, Bassett G. Regression quantiles. Econometrica. 1978;46:33–49. [Google Scholar]
- Koenker R, Zhao Q. L-estimation for linear heteroscedastic models. Journal of Nonparametric Statistics. 1994;3:223–235. [Google Scholar]
- Koenker R, Zhao Q. Conditional quantile estimation and inference for ARCH models. Econometric Theory. 1996;12:793–813. [Google Scholar]
- Pollard D. Asymptotics for least absolute deviation regression estimators. Econometric Theory. 1991;7:186–199. [Google Scholar]
- Portnoy S, Koenker R. Adaptive L-estimation of Linear Models. Annals of Statistics. 1989;17:362–381. [Google Scholar]
- Ruppert D, Sheather SJ, Wand MP. An effective bandwidth selector for local least squares regression. Journal of the American Statistical Association. 1995;90:1257–1270. [Google Scholar]
- Sharpe WF. Capital asset prices: a theory of market equilibrium under conditions of risk. Journal of Finance. 1964;19:425–442. [Google Scholar]
- Silverman BW. Density Estimation. London: Chapman and Hall; 1986. [Google Scholar]
- Stone C. Adaptive maximum likelihood estimation of a location parameter. Annals of Statistics. 1975;3:267–284. [Google Scholar]
- Stout WF. Almost Sure Convergence. New York: Academic Press; 1974. [Google Scholar]
- Wang H, Zhu Z, Zhou J. Quantile regression in partially linear varying coefficient models. Annals of Statistics. 2009;37:3841–3866. doi: 10.1214/07-AOS561. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xiao Z, Koenker R. Conditional quantile estimation and inference for GARCH models. Journal of the American Statistical Association. 2009;104:1696–1712. [Google Scholar]
- Yu K, Jones MC. Local linear quantile regression. Journal of the American Statistical Association. 1998;93:228–237. [Google Scholar]
- Zhao Q. Asymptotically efficient median regression in the presence of heteroskedasticity of unknown form. Econometric Theory. 2001;17:765–784. [Google Scholar]
- Zou H, Yuan M. Composite quantile regression and the oracle model selection theory. Annals of Statistics. 2008;36:1108–1126. [Google Scholar]