Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Aug 9.
Published in final edited form as: J Econom. 2023 Mar 24;249(Pt A):105426. doi: 10.1016/j.jeconom.2023.01.028

Feature-splitting Algorithms for Ultrahigh Dimensional Quantile Regression*

Jiawei Wen , Songshan Yang , Christina Dan Wang §, Yifan Jiang , Runze Li
PMCID: PMC12326526  NIHMSID: NIHMS1882856  PMID: 40778086

Abstract

This paper is concerned with computational issues related to penalized quantile regression (PQR) with ultrahigh dimensional predictors. Various algorithms have been developed for PQR, but they become ineffective and/or infeasible in the presence of ultrahigh dimensional predictors due to the storage and scalability limitations. The variable updating schema of the feature-splitting algorithm that directly applies the ordinary alternating direction method of multiplier (ADMM) to ultrahigh dimensional PQR may make the algorithm fail to converge. To tackle this hurdle, we propose an efficient and parallelizable algorithm for ultrahigh dimensional PQR based on the three-block ADMM. The compatibility of the proposed algorithm with parallel computing alleviates the storage and scalability limitations of a single machine in the large-scale data processing. We establish the rate of convergence of the newly proposed algorithm. In addition, Monte Carlo simulations are conducted to compare the finite sample performance of the proposed algorithm with that of other existing algorithms. The numerical comparison implies that the proposed algorithm significantly outperforms the existing ones. We further illustrate the proposed algorithm via an empirical analysis of a real-world data set.

Keywords: ADMM, Penalized quantile regression, Parallel computing, Sample-splitting algorithm

1. Introduction

Quantile regression (QR) is well acknowledged as a powerful tool for analyzing data with heterogeneous effect. Since the seminal work of Koenker and Bassett (1978), QR has been extensively applied in many research fields, in particular, in econometrics. For a complete review of QR, refer to Koenker (2017) and Koenker et al. (2017). Many recent advances and achievements of QR can be found in literature. Wang and He (2022) provided a unified theory for high-dimensional quantile regression with both convex and nonconvex penalties. Gimenes and Guerre (2022) proposed a QR inference framework for first-price auctions, and Cai et al. (2022) reexamined the heterogeneous predictability of US stock returns at different quantile levels. Other recent studies of QR include, but not limited to, D’Haultfœuille et al. (2018), Altunbaş and Thornton (2019), Giessing and He (2019), Gu and Volgushev (2019), Firpo et al. (2022), He et al. (2022) , and Narisetty and Koenker (2022).

For variable selection in QR, penalized quantile regression (PQR) has been developed with fixed and finite dimensional predictors in Li and Zhu (2008) and Wu and Liu (2009). Furthermore, PQR with high-dimensional predictors has also been studied in the statistical literature, since as the advent of data science, high-dimensional data analysis has become one of the most important research topics in the last decade. Belloni and Chernozhukov (2011) derived a nice error bound for PQR with the Lasso penalty (1-QR for short). Wang et al. (2012) studied the PQR with folded concave penalty such as the smoothed clip absolute deviation (SCAD) penalty (Fan and Li, 2001) and minimax concave penalty (MCP) (Zhang, 2010), and further established the oracle property for PQR in the ultrahigh dimension setting under mild conditions. In summary, estimation and theory of PQR are well studied and understood in the literature.

The numerical minimization problem for searching solutions to PQR, however, is challenging due to the nonsmoothness objective function with the possible nonconvexity of folded concave penalty. Sherwood and Maidman (2017) developed an R-package rqPen for 1-QR, and the algorithm is similar to the 1-QR introduced in Koenker and Mizera (2014). Peng and Wang (2015) developed an iterative coordinate descent algorithm (QICD) for solving PQR with nonconvex penalty. Gu et al. (2018) introduced fast alternating direction method of multiplier (ADMM) (Boyd et al., 2011) for PQR in high dimension.

As the advent of big data, it is of crucial importance to study numerical algorithms for PQR in ultrahigh dimension and/or a large data size. The ADMM (Boyd et al., 2011) has been introduced to cope with PQR with a large data size. Yu et al. (2017) and Fan et al. (2021) developed parallel algorithms for PQR based on sample-splitting ADMM. By sample-splitting, it means by its name that the algorithm partitions the data across samples. Ultrahigh dimensionality adds another challenge in minimizing the objective function of ultrahigh dimensional PQR. This work aims to tackle simultaneous challenges of nonsmoothness, nonconvexity and ultrahigh dimensionality by developing feature-splitting algorithms for PQR.

In this paper, we propose an efficient and parallelizable algorithm for PQR in ultrahigh dimension based on three-block ADMM. It is noteworthy that Yu and Lin (2017) briefly mentioned one direct extension of the feature-splitting ADMM for PQR without theoretical justifications and numerical studies. The variable update schema in Yu and Lin (2017) makes the convergence of the algorithm uncertain. Chen et al. (2016) showed that Gauss-Seidel multi-block ADMM is not necessarily convergent. For more detailed discussion on this, see Section 2. The uncertain convergence motivates us to avoid the direct extension of the feature-splitting ADMM, and instead to develop a three-block ADMM algorithm for ultrahigh dimensional PQR. Using related techniques in Sun et al. (2015), we establish the rate of convergence of the proposed algorithm and the theoretical convergence guarantee, and address the convergence uncertainty. The compatibility of the proposed three-block ADMM algorithm with parallel computing alleviates the storage and scalability limitations of a single machine in the large-scale data processing. The proposed three-block ADMM algorithms also enjoy numerical efficiency over the directly extended two-block ADMM. It is worthy to note that the newly proposed algorithms can be directly applied to PQR with various penalties including the 1, the SCAD and the MCP penalties by local linear approximation to the penalties (Zou and Li, 2008). Based on theories developed in Wang et al. (2013) and Fan et al. (2014), the proposed algorithms are able to obtain an PQR estimate with strong oracle property in ultrahigh dimension.

The rest of this article is organized as follows. In Section 2, we present the computational framework based on the three-block ADMM for PQR and establish the linear rate of convergence of the algorithm. In Section 3, we demonstrate the numerical and statistical efficiency of the proposed framework in high and ultra-high dimensional settings through Monte Carlo simulation, and illustrate the proposed algorithm via an empirical analysis of a Chinese supermarket data set. Technical proofs are given in the Appendix.

Throughout the paper, we adopt the following notations. For a matrix M=(mij)s×t, denote Mmax=max(i,j)|mij|,Mmin=min(i,j)|mij|, and λmin(M) and λmax(M) as the smallest and largest eigenvalues of M, respectively. XA denotes the sub-matrix of X with the columns indexed by A. M0 indicates that M is positive definite. For a positive semidefinite operator or matrix M,xM2=xTMx.

2. Feature-splitting Algorithms for PQR

Suppose that {xi,yi},i=1,,n, is a random sample from linear regression model

yi=xiTβ+εi,

where β is p-dimensional vector of regression coefficients, and εi is a random error with E(εixi)=0. In this paper, we are interested in solving QR in ultrahigh dimensional regime, in which pn. Define y=(y1,,yn)T as the response vector, and X=(x1,,xn)T as the corresponding design matrix. For a given τ(0,1), the quantile of interest, define the loss function ρτ(z)=z[τI(z<0)]=τ(z)++(1τ)(z) where I(⋅) is the indicator function, (z)+=max{0,z}, and (z)=(z)+. The QR is to minimize its objective function

L(yXβ)=1ni=1nρτ(yixiTβ) (1)

with respect to β, and this leads to the QR estimate of β. The minimization problem in QR can be reformulated as a linear programming problem. The Frisch-Newton algorithm can be applied to solve the minimization problem with computational complexity growing as a cubic function of p when p < n.

2.1. Penalized quantile regression

In the presence of ultrahigh dimensional predictors, it is common to impose sparsity assumption on β. That is, only a small portion of elements in β are nonzero. This implies that only a small portion of predictors are significant in the model. Thus, it is critical to identify the significant predictors in QR in ultrahigh dimension. The variable selection in QR is similar to that in linear regression for which penalized least squares methods have been proposed. It is then natural to extend the PQR method to the variable selection for QR. With PQR, it is to minimize the penalized quantile loss function

Q(β)=L(yXβ)+j=1ppλj(|βj|), (2)

where pλj() is a penalty function with a regularization parameter λj that controls model complexity. The algorithms to be developed in this paper allow that different regression coefficients have different penalties, although it is common to take all pλj() to be the same and denoted by pλ(). This paper concentrates on the two most commonly-used penalties: the Lasso (i.e., 1) penalty pλ(|β|)=λ|β| and the SCAD penalty whose first derivative is defined as

pλ(|β|)=λ{I(|β|λ)+(aλ|β|)+(a1)λI(|β|>λ)} (3)

with pλ(0):=pλ(0+)=λ and a = 3.7 as suggested in Fan and Li (2001). The proposed algorithms are directly applicable for other folded concave penalties (Fan et al., 2020).

It is challenging in minimizing the objective function of PQR in (2) since both the loss function and penalty function are nonsmoothing. When folded concave penalty such as the SCAD penalty is used in ultrahigh dimensional PQR, the minimization problem becomes even more challenging due to its nonconvexity and ultrahigh dimensionality. It is noteworthy that it is a convex minimization problem for the PQR with the 1 penalty, and when pn, it has unique minimizer. For the PQR with a folded concave penalty, minimizing the objective function in (2) may be achieved by iteratively minimizing PQR with reweighted 1 penalty with the aid of the local linear approximation (LLA) to the penalty function. Specifically, given βk=(β1k,,βpk)T updated from the k-th step in the course of iterations, we first approximate

pλ(|βj|)qλ(|βj|;|βjk|)=pλ(|βjk|)+pλ(|βjk|)(|βj||βjk|), (4)

which is referred to as the LLA. Then at the (k+1)-th step we minimize

Qk+1(β)=L(yXβ)+λj=1pλ1pλj(|βjk|)|βj|=L(yXβ)+λj=1pαj|βj|, (5)

where αj=λ1pλj(|βjk|)0. The function in (5) is the objective function of PQR with reweighted 1 penalty and weights αj’s that are updated at every step.

The LLA was first proposed in Zou and Li (2008) for penalized likelihood with finite dimensional predictors, and further adopted in Wang et al. (2013) and Fan et al. (2014) for penalized least squares for ultrahigh dimensional linear regression models. Note that if we set initial value β0=0, the β1 is the PQR-Lasso estimator defined as the PQR with 1 penalty. Then β2 can be regarded as the one-step sparse estimator with initial value being the PQR-Lasso estimator. With properly chosen tuning parameters, Wang et al. (2013) and Fan et al. (2014) showed that under some regularity conditions on high dimensional linear models, the corresponding penalized least squares estimator β2 enjoys the strong oracle property with probability tending to one. This motivates us to focus on developing feature-splitting algorithm for PQR with weighted 1 penalty.

2.2. Three-block ADMM

Define the PQR estimator with weighted 1-penalty to be

β^=argminβL(yXβ)+λαβ1, (6)

where αβ1=j=1p|αjβj| with α being the weight vector.

The non-smoothness of the objective function in (6) hinders an efficient application of gradient-based methods. To decouple the non-smooth parts in computation, we decentralize problem (6) into the following constrained optimization problem,

minβ,zL(z)+λαβ1,s.t.z+Xβ=y. (7)

Problem (7) is a natural candidate of classical two-block ADMM algorithm. Define augmented Lagrangian function as

ϕ(β,z;γ)=L(z)+λαβ1+γ,z+Xβy+ϕ2z+Xβy22, (8)

where γn is the Lagrangian multiplier, and ϕ > 0 is the parameter associated with the quadratic term. The classic iterative scheme at the iteration k for two-block ADMM is

βk+1=argminβϕ(β,zk;γk),
zk+1=argminzϕ(βk+1,z;γk),
γk+1=γk+θϕ(zk+1+Xβk+1y),

where θ is a tuning parameter controlling the step size. The effect of tuning parameter θ on the algorithm convergence has been discussed in literature (Fortin and Glowinski, 2000; Fazel et al., 2013), where the convergences are established when θ is constrained in (0,(1+5)/2). In our numerical experiments, we set θ=1.618 that is slightly less than (1+5)/2 for faster convergence. Gu et al. (2018) proposed an efficient algorithm (qradmm) to solve PQR based on the two block ADMM algorithm. While qradmm performs very well for moderate dimensions, we found it can still be out of memory with larger p in our numerical study. This motivates us to split the high dimensional variable to smaller blocks and speed up updates through parallelization.

We next propose a new three-block semi-proximal ADMM framework that capacitates a parallel update of β to cope with the ultrahigh dimensionality. Major computational cost of two-block ADMM for solving (7) comes from the β update, which takes up to O(np) operations and may impede an efficient execution of the algorithm with ultrahigh dimension p. This calls for feature-splitting algorithm for PQR in ultrahigh dimension. For a pre-specified G, let us partition X and β as follows,

X=(X1,,XG),β=(β1T,β2T,,βGT)T,Xβ=g=1GXgβg.

Then problem (7) can be rewritten as a three-block optimization problem

minβ,z,ωL(z)+g=1Gλαgβg1,s.t.X1β1+z+ω2++ωG=y,Xgβg=ωg,g=2,,G. (9)

Intuitively, slack variables ωg,g=2,,G store information of each local update βg. Each βg is updated independently and we view β=(β1T,β2T,,βGT)T as a single variable block in the algorithm. Likewise, all ωg together make up the third variable block. There may exist multiple ways to transform a problem into a form that ADMM can handle. For example, in formulation (9), the role of X1β1 is not special and Xgβg,g=1,,G are exchangeable. In this paper, we use formulation (9) to illustrate the computational framework.

The augmented Lagrangian function for (9) is given by

ϕ(β,z,ω;γ)=1n[τ1T(z)++(1τ)1T(z)]+λg=1Gαgβg1+γ1T(X1β1+z+ω2++ωGy)+ϕ2X1β1+z+ω2++ωGy22+g=2GγgT(Xgβgωg)+ϕ2g=2GXgβgωg22. (10)

As seen from (10), each βg is decoupled in the quadratic term, which allows a natural parallelization for β updates. Two-block ADMM can be directly extended to solve (9), and the corresponding algorithm is referred to as Gauss-Seidel multi-block ADMM. At the kth iteration, it updates each variable with

{βk+1=argminϕ(β,zk,ωk;γk)zk+1=argminϕ(βk+1,z,ωk;γk)ωk+1=argminϕ(βk+1,zk+1,ω;γk)γ1k+1=γ1k+θϕ(X1β1k+1+zk+1+g=2Gωgk+1y)γgk+1=γgk+θϕ(Xgβgk+1ωgk+1),g=2,,G. (11)

Procedure (11) may perform well in practice. However, its theoretical convergence has remained unclear until the work by Chen et al. (2016), in which the authors showed that Gauss-Seidel multi-block ADMM is not necessarily convergent. To address the convergence uncertainty, Sun et al. (2015) proposed a symmetric Gauss-Seidel based semi-proximal ADMM (sGS-sPADMM) for convex programming problems, which enjoys both theoretical convergence guarantee and numerical efficiency over the directly extended multi-block ADMM. This convergent semi-proximal ADMM has three separable blocks in the objective function with the third part being linear and updates ω twice to improve convergence, but the extra step may incur additional computational cost.

Inspired by Sun et al. (2015), we now propose the three-block ADMM algorithm for solving PQR with weighted 1 penalty using the following special iterative cycle (βωzω)

{βk+1=argminϕ(β,zk,ωk;γk)+ϕ2ββk𝒯f2ωk+12=argminϕ(βk+1,zk,ω;γk)zk+1=argminϕ(βk+1,z,ωk+12;γk)+ϕ2zzk𝒯g2ωk+1=argminϕ(βk+1,zk+1,ω;γk)γ1k+1=γ1k+θϕ(X1β1k+1+zk+1+i=2Kωik+1y)γik+1=γik+θϕ(Xiβik+1ωik+1),i=2,,K,

where 𝒯f and 𝒯g are some positive semidefinite matrices.

Given the augmented Lagrangian function defined in (10),

βk+1=argminϕ(β,zk,ωk;γk)

becomes

β1k+1=argminβ1p1λα1β11+ϕ2X1β1+g=2Gωgk+zky+γ1kϕ22,βgk+1=argminβgpgλαgβg1+ϕ2Xgβgωgk+γgkϕ22,g=2,,G. (12)

It can be seen that β subproblems are a series of weighted 1-penalized least squares problems. If pg is too large, Xgn×pg may not be full column rank, and thus, the generated sequences may not be well-defined. This concern can be addressed with an additional general position condition (Koenker, 2017), which indicated the existence of the unique solution of QR in a rather general condition. Standard quadratic solvers can be applied to solve (12) efficiently. In our numerical studies, we use R solver ‘glmnet’ to compute β through the coordinate descent (CD) algorithm.

ωg,g=2,,G, and z are updated in the following cycle:

ωgk+12=1G(yzk+GXgβgk+1j=1GXjβjk+1),zk+1=(yX1β1k+1g=2Gωgk+12γ1kϕτnϕ)+(y+X1β1k+1+g=2Gωgk+12+γ1kϕ+τ1nϕ)+,ωgk+1=1G(yzk+1+GXgβgk+1j=1GXjβjk+1), (13)

in which we perform an extra intermediate step to compute ωk+12 before computing zk+1. As seen from (13), the extra cost for update ω is negligible. The derivations of updates are given in Appendix A.1.

Finally, we update γ1 and γg via gradient ascent,

γ1k+1=γ1k+θϕ(X1β1k+1+zk+1+g=2Gωgk+1y),γgk+1=γgk+θϕ(Xgβgk+1ωgk+1),g=2,,G. (14)

We call this algorithm FS-QRADMM-CD, and summarize it in Algorithm 1. From our numerical studies, we observe that FS-QRADMM-CD has favorable practical performances.

Algorithm 1.

FS-QRADMM-CD for weighted 1-penalized QR

Initialization: β0,ω0,z0,γ0, and ϕ>0,θ>0 are given.
while the stopping criterion is not satisfied, do
  Compute βk+1 by (12) using CD algorithm.
  Compute ωk+12,zk+1 and ωk+1 by (13).
  Update γk+1 by (14).
end while

Besides using coordinate descent algorithm to update β, we have another solution for β update. To ensure that solutions from (12) are well-defined, we add G self-adjoint positive semidefinite matrices, denoted as 𝒯g,g=1,,G, to (12). A general principle is that 𝒯g should be as small as possible, while the optimization problems are still easy to compute. Here we add proximal terms 12βgβgk𝒯g2,g=1,,G, to each of the β-subproblems, where the proximal operators 𝒯g is positive definite. The positive definiteness of 𝒯g makes {βk} automatically well-defined. In this paper, we take 𝒯g=ηgIpgϕXgTXg with ηg>ϕλmax(XgTXg). This essentially is a linearization step of the β update, as it uses ηgIpg to approximate the Hessian matrix XgTXg. The modified minimization problem admits a closed-form solution, which can be carried out componentwisely,

β1k+1=S(β1kϕη1X1T(X1β1k+g=2Gωgk+zky+γ1kϕ),α1λη1),βgk+1=S(βgkϕηgXgT(Xgβgkωgk+γgkϕ),αgληg),g=2,,G, (15)

where S(x,t)=sign(x)(|x|t)I(|x|>t) is the soft thresholding function.

Updates in (15) manifest one advantage of splitting feature space into lower dimensions. The β update can be regarded as a one-step iteration of the proximal gradient. After feature-splitting, ηg’s are relatively small compared to the “un-splitted” η, as η needs to be larger than ϕλmax(XTX). Since η increases significantly with p for high dimensional data, the step size for the update (i.e., 1η) can be rather small and slow down the convergence of the algorithm. The update for ω,z and γ in this algorithm is exactly same as those in Algorithm 1. We use FS-QRADMM-prox to denote this algorithm and summarize it in Algorithm 2. Note that 𝒯g0 is also required in the proof of {βk} convergence.

Thus, we compute β2,,βG on separate processors/cores in the manner of parallel computing, and then aggregate the updated information to compute other variables.

We establish the linear rate of convergence for Algorithm 2 in Theorem 1 in which the proximal term is necessary for establishing the theory and whose proof is given in the Appendix A.

Theorem 1. For θ(0,(1+5)/2), the sequence (βk,zk,ωk,γk) generated by Algorithm 2 converges to a limit point (β¯,z¯,ω¯,γ¯), where (β¯,z¯,ω¯) is the primal optimal and γ¯ is the dual optimal. Furthermore, there exists a constant μ ∈ (0, 1) such that Distk+1μDistk, where Distk at the k-th iteration is defined as,

Distk=zkz¯22+G1Gzkzk122+g=1GXg(βgkβ¯g)1G(XβkXβ¯)22+m1Gg=1GXg(βgkβ¯g)+(zkz¯)22+g=1Gβgkβ¯g𝒯g2, (16)

where m1=1+d1d1θ(1d1)min{θ,1θ} and d1(0,12).

Algorithm 2.

FS-QRADMM-prox for weighted 1-penalized QR

Initialization: β0,ω0,z0,γ0, and ϕ>0,θ>0 are given. 𝒯g=ηgIpgϕXgTXg
 with ηg>ϕλmax(XgTXg),g=1,,G.
while the stopping criterion is not satisfied, do
  Compute βk+1 by (15).
  Compute ωk+12,zk+1 and ωk+1 by (13).
  Update γk+1 by (14).
end while

Remark. The minimization problem for searching the solution of the penalized quantile regression can be written as a linear programming problem. The primal and dual problems are feasible. The minimizer of the dual problem equals the solution of the linear programming problem (primal problem) by strong duality. Thus both the optimal values of the primal and dual problems equal the optimal value of the penalized quantile regression problem.

The effect of G on the convergence is twofold. On the one hand, increasing G reduces the dimension of subproblems and the value of η, and thus it accelerates the computation of each sub-problem. On the other hand, increasing G leads to an increased number of sub-problems and may raise the value of μ. This slows down the convergence both practically and theoretically. In our numerical experiments, it seems that choosing G from 5 to 10 works well for p ranging from thousands to tens of thousands.

2.3. PQR-Lasso and PQR-SCAD

In this paper, PQR-Lasso refers to the PQR in (2) with the 1 penalty, pλ(|β|)=λ|β|. Thus, the PQR-Lasso can be solved by Algorithms 1 and 2 directly with all weights αj=1,j=1,,p in (6). The resulting solutions from Algorithms 1 and 2 for the PQR-Lasso are denoted by FS-QRADMM-CD(Lasso) and FS-QRADMM-prox(Lasso) in Section 3, respectively.

Parallel to the PQR-Lasso, PQR-SCAD refers to the PQR in (2) with the SCAD penalty whose first-order derivative is defined in (3). Since the SCAD penalty is folded concave, the objective function of PQR-SCAD may have multiple local minimizers. To avoid this issue, we recommend (a) using the proposed algorithm to obtain the PQR-Lasso estimate β^L=(β^L,1,,β^L,p)T, and then (b) further to obtain the PQR with weighted 1 penalty, in which the weight αj is λ1pλ(|β^L,j|) with pλ(|β|) being the first-order derivative of the SCAD penalty. We refer to the resulting estimate as the two-step PQR-SCAD estimate. Note that both 1 penalty and the SCAD-based weighted 1 penalty are convex. The two-step SCAD estimate is well defined when L(y–Xβ) is strictly convex with respect to β. Denote FS-QRADMM-CD(TS-SCAD) and FS-QRADMM-prox(TS-SCAD) to be the resulting solutions of Algorithms 1 and 2 for the two-step PQR-SCAD. The corresponding algorithms of FS-QRADMM-CD and FS-QRADMM-prox for the two-step PQR-SCAD are presented in Algorithms 3 and 4 in Section A.3 in the Appendix.

The two-step PQR-SCAD shares the same spirit of one-step sparse maximum likelihood estimation proposed in Zou and Li (2008) for folded concave penalization problems. The second step in the two-step PQR-SCAD is to correct the bias inherited in 1 penalty which is known to over-penalize large coefficients and introduce bias to the resulting model. As shown in Corollary 8 in Fan et al. (2014), the two-step PQR-SCAD can find the oracle estimator among multiple local minimums with overwhelming probability, under certain regularity conditions. This provides theoretical justification for the two-step SCAD. In other words, the resulting solutions of Algorithms 3 and 4 for the two-step PQR-SCAD enjoy strong oracle property in the terminology of Fan et al. (2014).

The two-step PQR-SCAD procedure can be extended to two-step PQR with a general folded concave penalty characterized by the following conditions: (a) pλ(t) is nondecreasing and concave for t[0,) with pλ (0) = 0; (b) pλ(t) is differentiable in (0, ∞); (c) for some positive constants a1 and a2,pλ(t)a1λ for t(0,a2λ], and (d) pλ(t)=0 for t[aλ,) with a > 1. As shown in Fan et al. (2014), the two-step PQR with a general folded concave penalty also enjoys the strong oracle property under certain regularity conditions.

It is desirable to have a data-driven method to select the regularization parameters in PQR-Lasso and PQR-SCAD. In our numerical study, we set the same penalty and tuning parameter for all coefficients, and λ is chosen by HBIC criterion proposed in Lee et al. (2014),

HBIC(λ)=log{i=1nρτ(yixiTβ)}+|𝒜|log(logn)log(p)n, (17)

where |𝒜| is the cardinality of active set. We select the λ that minimizes HBIC.

Wang et al. (2013) recommends using different λ’s in the first step and the second step in the penalized least squares setting to ensure the resulting Lasso estimate satisfying a certain rate of convergence. Denote λ1 and λ to be regularization parameters used in the first and second step, respectively. Following the recommendation in Wang et al. (2013), we choose λ1=υλ, where υ>0 and tends to 0 as n. We set υ = λ suggested by Wang et al. (2013) in our numerical studies in Section 3.

3. Numerical Studies

In this section, we assess the performance of the proposed algorithms via simulation studies and illustrate the application of the newly proposed procedure via an empirical analysis. For all ADMM-based methods, we implement the warm-start technique introduced in Friedman et al. (2007) and Friedman et al. (2010), which uses the solution from the previous λ to initialize computation at the current λ. The way of splitting the features has no influence on the convergence property of the algorithm. We equally distribute the features into K groups without adjusting the order in our numerical studies. The stopping criterion of ADMM-based algorithms is provided in the Appendix.

3.1. Simulation study

In this simulation, we compare the performance of Algorithms 1 and 2 with R packages rqPen (Sherwood and Maidman, 2017), qradmm (Gu et al., 2018), hqReg (Yi and Huang, 2017) and Conquer (Tan et al., 2022). Since qradmm package is boosted by FORTRAN, we re-implement its core algorithm, i.e., a two-block proximal ADMM, using R code for a relatively fair comparison. The R package rqPen implements an iterative coordinate descent algorithm (QICD) proposed in Peng and Wang (2015) to solve sparse quantile regression. QICD applies a convex majorization function on the concave penalty term, and solves the majorized objective function by coordinate descent. The R package qradmm implements a two-block proximal ADMM for PQR with 1 penalty proposed in Gu et al. (2018). We use the R packages hqreg and conquer to implement the methods proposed by Yi and Huang (2017) and Tan et al. (2022), respectively. The regularization parameter λ in all algorithms to be compared is selected by the HBIC criterion defined in (17).

We take the simulation setting similar to that of Peng and Wang (2015). We generate z˜=(Z1,Z2,,Zp)T from Np(0,Σ), where Σ=(σij) with σij=0.5|ij|. Then set X1=Φ(Z1) and Xj=Zj for j=2,,p, where Φ() is the cumulative distribution function of N(0, 1). The response variable Y is generated from the following heteroscedastic regression model,

Y=X6+X100+X500+X1000+0.7X1ε, (18)

where εN(0,1). We consider three different quantiles τ=0.3,0.5 and 0.7. Note that X1 does not affect the center of the conditional distribution Y given x, but affects the conditional distribution when τ=0.3 or 0.7. In our simulation, we set n = 400 and p = 1000 and 50000. For each case, we conduct 500 replications.

The following criteria are used to compare the performance of different algorithms.

  1. Average absolute error: the average and standard deviation of β^β1=j=1p|β^jβj| over 500 replications.

  2. Size: the average number of nonzero β^j’s over 500 replications.

  3. P1: the proportion of models that select all active features except for X1 over 500 replications

  4. P2: the proportion of models that select X1 over 500 replications.

The proportion P2 is expected to be close to 0 when τ=0.5, and be close to 1 when τ=0.3 and 0.7.

The simulation results over 500 replications are summarized in Tables 1 and 2. Compared to the PQR-Lasso, two-step PQR-SCAD produces models with significantly smaller absolute error and better selection accuracy in general. FS-QRADMM-CD(TS-SCAD) and FS-QRADMM-prox (TS-SCAD) have the best performances with respect to estimation and variable selection accuracy. When p = 1000, three ADMM-based methods perform comparably well and outperform rqPen, hqReg and Conquer by a significant margin. rqPen, hqReg and Conquer obtain relatively larger estimation errors and is more likely to miss X1 when τ=0.3, and 0.7. The current version of rqPen runs out of memory when solving two-step PQR-SCAD, as noted in the table. Nonetheless, when p = 50000, both Qradmm and rqPen fail due to their demanding memory usage. In fact, we notice that the efficiency of Qradmm deteriorates sharply when p increases. hqReg and Conquer are able to finish the job when p = 50000, but the proposed methods still outperform hqReg and Conquer.

Table 1:

Comparison of algorithms for PQR when p = 1000 and n = 400.

n = 400, p = 1000 τ β^β1 P1 P2 Size
FS-QRADMM-CD (Lasso) 0.3 0.295 (0.003) 100% 100% 5.56 (0.03)
0.5 0.210 (0.003) 100% 5.4% 4.36 (0.03)
0.7 0.281 (0.003) 100% 100% 5.56 (0.03)

FS-QRADMM-prox (Lasso) 0.3 0.295 (0.003) 100% 100% 5.62 (0.03)
0.5 0.198 (0.003) 100% 4.6% 4.34 (0.02)
0.7 0.301 (0.003) 100% 100% 5.56 (0.03)

qradmm(Lasso) 0.3 0.310 (0.003) 100% 100% 5.68 (0.03)
0.5 0.230 (0.003) 100% 9% 5.32 (0.06)
0.7 0.327 (0.005) 100% 100% 6.73 (0.08)

rqPen(Lasso) 0.3 0.598 (0.004) 100% 61.2% 5.10 (0.04)
0.5 0.267 (0.003) 100% 0% 4.23 (0.02)
0.7 0.601 (0.004) 100% 56.6% 5.04 (0.04)

hqReg(Lasso) 0.3 0.593 (0.006) 100% 50% 4.95 (0.04)
0.5 0.235 (0.003) 100% 0% 4.31 (0.03)
0.7 0.589 (0.006) 100% 51.6% 4.97 (0.04)

Conquer(Lasso) 0.3 0.590 (0.005) 100% 45% 4.73 (0.03)
0.5 0.231 (0.002) 100% 0% 4.27 (0.02)
0.7 0.586 (0.005) 100% 45% 4.72 (0.03)

FS-QRADMM-CD (TS-SCAD) 0.3 0.119 (0.002) 100% 100% 5.00 (0.00)
0.5 0.035 (0.001) 100% 0.2% 4.00 (0.00)
0.7 0.125 (0.002) 100% 100% 5.00 (0.00)

FS-QRADMM-prox (TS-SCAD) 0.3 0.115 (0.002) 100% 100% 5.00 (0.00)
0.5 0.040 (0.001) 100% 0.2% 4.00 (0.00)
0.7 0.123 (0.001) 100% 100% 5.00 (0.00)

qradmm (TS-SCAD) 0.3 0.122 (0.002) 100% 100% 5.00 (0.00)
0.5 0.038 (0.001) 100% 0.4% 4.00 (0.00)
0.7 0.129 (0.002) 100% 100% 5.00 (0.00)

rqPen (TS-SCAD) The algorithm runs out of memory for τ = 0.3, 0.5, 0.7

Conquer(SCAD) 0.3 0.339 (0.004) 100% 63.4% 4.67 (0.02)
0.5 0.049 (0.001) 100% 0% 4.07 (0.01)
0.7 0.350 (0.004) 100% 57% 4.60 (0.02)

Table 2:

Comparison of algorithms for PQR when p = 50000 and n = 400.

n = 400, p = 50000 τ β^β1 P1 P2 Size
FS-QRADMM-CD (Lasso) 0.3 0.320 (0.003) 100% 98.2% 5.34 (0.02)
0.5 0.250 (0.003) 100% 2% 4.25 (0.03)
0.7 0.349 (0.003) 100% 100% 5.15 (0.03)

FS-QRADMM-prox (Lasso) 0.3 0.326 (0.003) 100% 92.4% 4.93 (0.01)
0.5 0.121 (0.001) 100% 0% 4.01 (0.00)
0.7 0.394 (0.002) 100% 95.6% 5.01 (0.11)

qradmm(Lasso) The algorithm runs out of memory for τ = 0.3, 0.5, 0.7

rqPen(Lasso) The algorithm runs out of memory for τ = 0.3, 0.5, 0.7

hqReg(Lasso) 0.3 0.812 (0.004) 100% 2.2% 4.23 (0.02)
0.5 0.365 (0.003) 100% 0% 4.20 (0.02)
0.7 0.808 (0.004) 100% 4.4% 4.26 (0.02)

Conquer(Lasso) 0.3 0.717 (0.003) 100% 18% 5.88 (0.07)
0.5 0.303 (0.002) 100% 0% 8.63 (0.11)
0.7 0.705 (0.003) 100% 26.8% 6.13 (0.07)

FS-QRADMM-CD (TS-SCAD) 0.3 0.180 (0.003) 100% 98.8% 4.99 (0.00)
0.5 0.047 (0.001) 100% 0% 4.00 (0.00)
0.7 0.172 (0.003) 100% 99.6% 5.00 (0.03)

FS-QRADMM-prox (TS-SCAD) 0.3 0.158 (0.002) 100% 100% 5.00 (0.00)
0.5 0.069 (0.005) 100% 2.2% 7.31 (0.47)
0.7 0.244 (0.007) 100% 99.2% 6.64 (0.14)

qradmm (TS-SCAD) The algorithm runs out of memory for τ = 0.3, 0.5, 0.7

rqPen (TS-SCAD) The algorithm runs out of memory for τ = 0.3, 0.5, 0.7

Conquer(SCAD) 0.3 0.396 (0.002) 100% 48.4% 5.04 (0.04)
0.5 0.058 (0.001) 100% 0% 5.78 (0.07)
0.7 0.390 (0.003) 100% 49.6% 5.12 (0.05)

Figure 1 plots the curves of β^β1 with respect to the iteration steps averaged over 500 replications when n = 400, p = 1000, and τ=(0.3,0.5,0.7). We can see that Algorithms 1 and 2 converge to true β within approximately 20 iterations.

Figure 1:

Figure 1:

Convergence curves of β^β1 of FS-QRADMM-CD(Lasso) (left panel) and FS-QRADMM-prox(Lasso) (right panel) over 500 replications.

We next examine the performance of the proposed algorithms when the sample size is large. To this end, we conduct a simulation with n = 30000 and p = 1000. The simulation results are summarized in Table 3, from which it can be seen that the proposed two algorithm and the ADMM for quantile regression have the same performance and perform better than the conquer algorithm.

Table 3:

Performance of proposed algorithms for PQR when p = 1000 and n = 30000.

n = 30000, p = 1000 τ β^β1 P1 P2 Size
FS-QRADMM-CD (Lasso) 0.3 0.031 (0.0004) 100% 100% 5.06 (0.003)
0.5 0.020 (0.0004) 100% 0.4% 4.04 (0.003)
0.7 0.029 (0.0004) 100% 100% 5.06 (0.003)

FS-QRADMM-prox (Lasso) 0.3 0.029 (0.0003) 100% 100% 5.05 (0.003)
0.5 0.020 (0.0003) 100% 0.5% 4.03 (0.002)
0.7 0.029 (0.0003) 100% 100% 5.05 (0.003)

qradmm(Lasso) 0.3 0.030 (0.0004) 100% 100% 5.08 (0.003)
0.5 0.023 (0.0003) 100% 0.6% 4.05 (0.004)
0.7 0.029 (0.0004) 100% 100% 5.04 (0.004)

rqPen(Lasso) The algorithm runs out of memory for τ = 0.3, 0.5, 0.7

hqReg(Lasso) 0.3 0.040 (0.0008) 100% 100% 6.06 (0.09)
0.5 0.020 (0.0002) 100% 1% 4.60 (0.07)
0.7 0.040 (0.0008) 100% 51.6% 5.90 (0.09)

Conquer(Lasso) 0.3 0.066 (0.0009) 100% 100% 5.23 (0.04)
0.5 0.020 (0.0003) 100% 0% 4.11 (0.04)
0.7 0.065 (0.0009) 100% 100% 5.25 (0.05)

FS-QRADMM-CD (TS-SCAD) 0.3 0.012 (0.0005) 100% 100% 5.00 (0.00)
0.5 0.004 (0.0002) 100% 0.2% 4.00 (0.00)
0.7 0.013 (0.0004) 100% 100% 5.00 (0.00)

FS-QRADMM-prox (TS-SCAD) 0.3 0.011 (0.0003) 100% 100% 5.00 (0.00)
0.5 0.004 (0.0002) 100% 0% 4.00 (0.00)
0.7 0.012 (0.0003) 100% 100% 5.00 (0.00)

qradmm (TS-SCAD) (TS-SCAD) 0.3 0.012 (0.0004) 100% 100% 5.00 (0.00)
0.5 0.004 (0.0003) 100% 0% 4.00 (0.00)
0.7 0.011 (0.0005) 100% 100% 5.00 (0.00)

rqPen (TS-SCAD) The algorithm runs out of memory for τ = 0.3, 0.5, 0.7

Conquer(SCAD) 0.3 0.028 (0.0008) 100% 100% 5.15 (0.039)
0.5 0.005 (0.0002) 100% 1% 4.45 (0.061)
0.7 0.029 (0.0008) 100% 100% 5.14 (0.037)

3.2. A real data example

QR model is widely adopted in the analysis of consumer markets due to its robustness against outliers. In this section, we apply the proposed algorithm for an empirical analysis of a supermarket data set studied in Wang (2009) and compare it with other existing algorithms. This data set contains the daily number of customers and the daily sale volumes of 6398 products from a supermarket in China over 464 days. Following Wang (2009), we set the response to be the daily number of customers, and predictors to be the daily sale volumes of products. Since the sample size n = 464 is much less than the dimension p = 6398, it is reasonable to assume that only a small proportion of predictors have significant effects on the response. The distribution of the number of customers is highly skewed. This motivates us to consider PQR with the proposed algorithm in this example. We standardize the response and the predictors for our analysis.

We randomly split observations into training and testing datasets of sizes 300 and 164, respectively, and fit PQR-Lasso and two-step QR-SCAD on the training data with τ=0.3,0.5 and 0.7. The regularization parameter λ is chosen by HBIC criterion. We report the averaged predictive error and its standard deviation on the testing data over 100 replications in Table 4. The predictive error is measured by the loss function 1ni=1nρτ(yiy^i). We also report the average model sizes and its corresponding standard deviation to evaluate the interpretability of models selected from different methods. For PQR-Lasso, we observe that ADMM-based algorithms have similar performance with that of rqPen and hqReg in terms of prediction error. The average values and standard deviations of the loss function are very close among those methods. In general, all methods perform the best when τ=0.5. The proposed method selects fewer products than qradmm, rqPen, hqReg do in most scenarios, which indicates a better model interpretability.

Table 4:

Performances of ADMM and lpSolve of sparse quantile regression on the Chinese Supermarket Data.

τ 1ni=1nρτ(yiy^i) Size
FS-QRADMM-CD (Lasso) 0.3 0.118 (0.001) 97.35 (0.62)
0.5 0.116 (0.001) 100.56 (0.53)
0.7 0.127 (0.001) 97.37 (0.71)

FS-QRADMM-prox (Lasso) 0.3 0.117 (0.001) 103.23 (1.13)
0.5 0.113 (0.001) 118.61 (0.93)
0.7 0.127 (0.001) 96.02 (1.06)

qradmm(Lasso) 0.3 0.115 (0.000) 119.01 (0.49)
0.5 0.116 (0.001) 121.26 (0.37)
0.7 0.130 (0.000) 127.35 (0.58)

rqPen(Lasso) 0.3 0.117 (0.001) 113.62 (0.73)
0.5 0.115 (0.001) 117.17 (0.56)
0.7 0.128 (0.001) 120.11 (0.62)

hqReg(Lasso) 0.3 0.117 (0.001) 49.31 (0.46)
0.5 0.116 (0.001) 90.85 (0.65)
0.7 0.127 (0.001) 43.42 (0.42)

Conquer(Lasso) 0.3 0.118 (0.001) 80.8 (1.47)
0.5 0.114 (0.001) 42.9 (0.58)
0.7 0.125 (0.001) 39.6 (0.49)

FS-QRADMM-CD (TS-SCAD) 0.3 0.112 (0.000) 63.86 (0.39)
0.5 0.111 (0.001) 69.77 (0.47)
0.7 0.116 (0.001) 72.71 (0.61)

FS-QRADMM-prox (TS-SCAD) 0.3 0.116 (0.000) 97.03 (0.96)
0.5 0.110 (0.000) 100.33 (1.02)
0.7 0.113 (0.000) 95.66 (0.79)

qradmm (TS-SCAD) 0.3 0.113 (0.001) 469.88 (2.56)
0.5 0.114 (0.001) 477.72 (2.33)
0.7 0.120 (0.001) 521.33 (1.99)

rqPen(TS-SCAD) The algorithm runs out of memory for the three τs

Conquer(SCAD) 0.3 0.115 (0.001) 65.7 (1.95)
0.5 0.113 (0.001) 65.7(0.82)
0.7 0.120 (0.001) 36.0 (0.48)

Similar results are also observed with two-step the PQR-SCAD. However, the rqPen for the two-step SCAD fails in this example due to the limitation of computing memory. The proposed algorithms have similar prediction error to that of qradmm, but the model sizes are much smaller. When τ=0.7, the proposed methods outperform qradmm, with fewer products included in the QR model. Conquer with SCAD penalty has similar performance to the proposed method under this scenario. We can also notice that PQR-SCAD gives better loss than PQR-Lasso does, and the two-step PQR-SCAD procedures select fewer products when the proposed algorithms are implemented.

4. Conclusion

QR model is a powerful data analytic tool in econometrics. To promote the application of QR in the high/ultrahigh dimension, in this paper, we propose efficient and parallelizable algorithms for PQR based on a three-block ADMM algorithm with feature-splitting, and further establish the convergence of the proposed algorithms. Due to the nature of feature-splitting algorithm, the proposed algorithms can be used to minimize the objective function of PQR in ultrahigh dimension. Our numerical study implies that the proposed algorithms outperform existing ones for PQR. To illustrate the performance of proposed methods, we conduct a comprehensive simulation study. The numerical experiments suggest that the proposed method is stable when the dimension of the data is huge while existing algorithms run out of memory and fail to accomplish the tasks. The proposed algorithms may be extended to other statistical models such as supporting vector machine which has similar loss function to the loss function of QR. This is an interesting topic for future research.

Acknowledgment

Christina Dan Wang is supported in part by National Natural Science Foundation of China (NNSFC) grant 11901395 and 12271363. Li’s research was supported by National Science Foundation DMS-1820702 and NIAID/NIH grants R01-AI136664 and R01AI170249.

Appendix: Technical Details and Proofs

In this appendix, we first provide details of how to update each variable in Algorithm 2, and then provide technical proofs of Theorem 1.

A.1. Sub-problems in Algorithm 2

In this subsection, we derive the updates for β,z and ω in Algorithm 2. For ease of notation, define a set of functions f, g, h.

f(β)=nλg=1Gαgβg1,h(ω)=0,g(z)=τ1T(z)++(1τ)1T(z), (A.1)

Thus, f, g, h are closed proper convex functions. Further define matrices F, G, H

F=Diag(X1,X2,,XG),G=(In,0,,0)T,H=(InInInIn000In000In). (A.2)

Then Problem (9) can be expressed as a general three-block constrained optimization problem,

minβ,z,ω{f(β)+g(z)+h(ω)Fβ+Gz+Hω=c}, (A.3)

where by definitions of F, G and H,

Fβ=(X1β1X2β2XGβG),Gz=(z00),Hω=(ω2++ωGω2ωG). (A.4)

As in sGS-sPADMM proposed by Sun et al. (2015), we update the three-block variables using a special cycle,

{βk+1=argminϕ(β,zk,ωk;γk)+(ϕ2ββk𝒯f2),ωk+12=(HTH)1H(cFβk+1Gzk),zk+1=argminϕ(βk+1,z,ωk+12;γk)+(ϕ2zzk𝒯h2),ωk+1=(HTH)1H(cFβk+1Gzk+1),γk+1=γk+θϕ(Fβk+1+Gzk+1+Hωk+1c), (A.5)

where 𝒯f and 𝒯h are optionally added self-adjoint positive semidefinite operators. To update ω, we need to compute (HTH)1. Since

HTH=(InIn)+(InIn)(InIn)T,

we apply the Sherman-Morrison-Woodebury formula to compute (HTH)1 and it follows that

ωik+12=(HTH)1HT(cFβk+1Gzk)=1G(yzk+GXiβik+1j=1GXjβjk+1),i=2,,G, (A.6)

In the z-subproblem, we set 𝒯h=0 and then we have

zk+1=argminzϕ(βk+1,z,ωk+12;γk)=argminz{1ni=1nρτ(zi)+γ1T(X1β1k+1+z+i=2Gωik+12y)+ϕ2X1β1k+1+z+i=2Gωik+12y22}=argminz1ni=1nρτ(zi)+ϕ2X1β1k+1+z+i=2Gωik+12y+γ1kϕ22. (A.7)

The closed-form solution of the z-subproblem can be easily derived as

zk+1=max(yX1β1k+1i=2Gωik+12γ1kϕτnϕ,0)max((yX1β1k+1i=2Gωik+12γ1kϕτnϕ+1nϕ),0). (A.8)

A.2. Proof of Theorem 1

We first show Lemmas A.1, A.2 and A.3, which are used in the proof of Theorem 1. From (A.2), we have Fact 1 below.

Fact 1. HTH is positive definite.

Assumptions 1 below is imposed to obtain theoretical guarantee on feasibility and convergence of the sequence (βk,zk,ωk,γk).

Assumption 1. There exists (β^,z^,ω^)p1×p2×p3, such that Fβ^+Gz^+Hω^=c.

For algorithm (A.5), the projection matrix 𝒫=H(HTH)1HT plays an important role in the convergence analysis. Let 𝒬=I𝒫. Since ω can be expressed as ω(β,z)=(HTH)1HT(cFβGz), it follows that Hω=𝒫(cFβGz). Given that h(ω)=0 in our case, we can now rewrite (A.3) as

minβ,z{f(β)+g(z)𝒬(Fβ+Gzc)=0}. (A.9)

Stopping Criterion. In the implementation of Algorithm 2, We use the same stopping criterion as the one introduced in Boyd et al. (2011). The primal and dual residuals are often used in characterizing the convergence stage. Define rk+1=(X1β1k+1+zk+1+ω2k+1++ωGk+1y22+g=2G(Xgβgk+1ωgk+1)22)0.5 as the primal residual and sk+1=ϕ/G(X1T,,XGT)T(zk+1zk)2 as the dual residual at the (k + 1)th iteration. The termination criterion is

rk2ϵpriandsk2ϵdual, (A.10)

where ϵpri>0 and ϵdual>0 are feasibility tolerances chosen as ϵpri=nϵabs+ϵrelGmax(Xβk2,zk2,c2), and ϵdual=pϵabs+ϵrelGFT𝒬γk2. A common choice could be ϵabs=0.001 and ϵrel=0.001.

The augmented Lagrangian function for (A.9) is given by

ϕ(z,β;γ)=f(β)+g(z)+γ,𝒬(Fβ+Gzc)+ϕ2𝒬(Fβ+Gzc)22.

Using similar arguments in Sun et al. (2015), it follows that applying the updates in A.5 to problem (A.3) is equivalent to applying the following 2-block semi-proximal ADMM to (A.9),

{βk+1=argminϕ(β,zk;γk)+ϕ2ββkFT𝒫F+𝒯f2,zk+1=argminϕ(βk+1,z;γk)+ϕ2zzkGT𝒫G+𝒯h2,γk+1=γk+θϕ𝒬(Fβk+1+Gzk+1c). (A.11)

The Karush–Kuhn–Tucker (KKT) optimality condition of (A.9) is

0(𝒬F)Tγ+f(β),0(𝒬G)Tγ+g(z),𝒬(cFβGz)=0. (A.12)

Denote the solution set to (A.12) as Ω¯, then we can replace Assumption 1 by assuming that Ω¯ is non-empty. Let u¯=(β¯,z¯,γ¯) be an optimal solution to (A.9). We have the following lemma on the convergence of the proposed algorithm by utilizing its equivalence to the updates in (A.11).

Lemma A.1. Suppose Assumption 1 holds. 𝒯f and 𝒯h are chosen such that 𝒯f+FTF and 𝒯h+GTG are positive definite. Then under the condition θ(0,(1+5)/2), the sequence (βk,zk,ωk,γk) generated by (A.5) converges to a limit point (β¯,z¯,ω¯,γ¯) with (β¯,z¯,ω¯) solving (A.3) and γ¯ is the dual optimal.

Lemma A.1 follows by a direct application of Theorem 3.2 in Han et al. (2018). Based on (A.2), we have the following fact.

Fact 2. Suppose uk converges to u¯Ω¯. There exists a positive constant q such that

uku¯22q2(βkproxf(βk(𝒬F)Tγk)22+zkproxh(zk(𝒬G)Tγk)22+𝒬(cFβkGzk)22), (A.13)

for a sufficiently large k.

For any convex function P, proxP(·) denotes the proximal mapping associated with P. That is,

proxP(x)=argminy{12xy22+P(y)}. (A.14)

Denote =C×Diag(FT𝒫F+𝒯f,GTG+𝒯h,(θ2ϕ1)I), where C=max{3ϕ2FT𝒫F+𝒯f2,3ϕ2λmax(FFT),2ϕ2GT𝒫G+𝒯h2,3(1θ)2ϕλmax(𝒬FFT𝒬))+2(1θ)2ϕλmax(𝒬GGT𝒬)+1ϕ}, then we have the following relationship.

Lemma A.2. Suppose the sequence uk=(βk,zk,γk) is generated by algorithm (A.5), then for any k0,

uk+1u¯22q2uk+1uk2. (A.15)

Proof. Consider the optimality conditions of subproblems in (A.11), we have

0f(βk+1)+(𝒬F)Tγk+ϕ(𝒬F)T𝒬(Fβk+1+Gzkc)+ϕ(FT𝒫F+𝒯f)(βk+1βk),0g(zk+1)+(𝒬G)Tγk+ϕ(𝒬G)T𝒬(Fβk+1+Gzk+1c)+ϕ(GT𝒫G+𝒯h)(zk+1zk),0=(θϕ)1(γk+1γk)𝒬(Fβk+1+Gzk+1c). (A.16)

Then we have 𝒬(Fβk+1+Gzkc)=(θϕ)1(γk+1γk)𝒬G(zk+1zk) and it follows that

βk+1=proxf(βk+1(𝒬F)T(γk+θ1(γk+1γk)ϕ𝒬G(zk+1zk))+ϕ(FT𝒫F+𝒯f)(βk+1βk)),zk+1=proxg(zk+1(𝒬G)T(γk+θ1(γk+1γk))+ϕ(GT𝒫G+𝒯h)(zk+1zk)),γk+1=γk+θϕ𝒬(Fβk+1+Gzk+1c).

and we have

βk+1proxf(βk+1(𝒬F)Tγk+1)22+zk+1proxg(zk+1(𝒬G)Tγk+1)22+𝒬(cFβk+1Gzk+1)22uk+1uk2. (A.17)

We first bound the term βk+1proxf(βk+1(𝒬F)Tγk+1)22. By the fact that the proximal mapping is Lipschitz continuous with constant 1, i.e., proxh(x)proxh(y)2xy2 for any mapping h,

βk+1proxf(βk+1(𝒬F)Tγk+1)22βk+1(𝒬F)T(γk+θ1(γk+1γk)ϕ𝒬G(zk+1zk))+ϕ(FT𝒫F+𝒯f)(βk+1βk)βk+1+(QF)Tγk+122=ϕ(FT𝒫F+𝒯f)(βk+1βk)+ϕFT𝒬G(zk+1zk)+(11θ)(QF)T(γk+1γk)22=ϕ(FT𝒫F+𝒯f)(βk+1βk)22+ϕFT𝒬G(zk+1zk)22+(11θ)2(QF)T(γk+1γk)22+2ϕ2(βk+1βk)(FT𝒫F+𝒯f)TFT𝒬G(zk+1zk)+2(11θ)ϕ(βk+1βk)(FT𝒫F+𝒯f)T(QF)T(γk+1γk)+2(11θ)ϕ(zk+1zk)TGT𝒬F(QF)T(γk+1γk) (A.18)

By taking into account the fact that

2ϕ2(βk+1βk)(FT𝒫F+𝒯f)TFT𝒬G(zk+1zk)ϕ(FT𝒫F+𝒯f)(βk+1βk)22+ϕFT𝒬G(zk+1zk)222(11θ)ϕ(βk+1βk)(FT𝒫F+𝒯f)T(QF)T(γk+1γk)ϕ(FT𝒫F+𝒯f)(βk+1βk)22+(11θ)2(QF)T(γk+1γk)22,

and

2(11θ)ϕ(zk+1zk)TGT𝒬F(QF)T(γk+1γk)(11θ)2(QF)T(γk+1γk)22+ϕFT𝒬G(zk+1zk)22,

and the inequality that

FT𝒬G(zk+1zk)22=(𝒬G(zk+1zk))TFFT(𝒬G(zk+1zk))Tλmax(FFT)𝒬G(zk+1zk)22,

where λmax(FFT) is the largest eigenvalue of FFT, (A.18) can be reduced to

βk+1proxf(βk+1(𝒬F)Tγk+1)223ϕ2FT𝒫F+𝒯f2βk+1βkFT𝒫F+𝒯f2+3ϕ2λmax(FFT)zk+1zkGT𝒬G2+3(11θ)2(QF)T(γk+1γk)22. (A.19)

Similarly we can bound the term zk+1proxg(zk+1(𝒬G)Tγk+1)22,

zk+1proxh(zk+1(𝒬G)Tγk+1)222ϕ2GT𝒫G+𝒯h2zk+1zkGTG+𝒯h2+2(11θ)2(𝒬G)T(γk+1γk)22. (A.20)

From the update of γ, we have

𝒬(cFβk+1Gzk+1)22=(θϕ)2γk+1γk22. (A.21)

Combining (A.19), (A.20) and (A.21), we can obtain that

βk+1proxf(βk+1(𝒬F)Tγk+1)22+zk+1proxh(zk+1Gγk+1)22+𝒬(cFβk+1Gzk+1)223ϕ2FT𝒫F+𝒯f2βk+1βkFT𝒫F+𝒯f+3ϕ2λmax(FFT)zk+1zkGT𝒬G2+(θϕ)2γk+1γk22+3(11θ)2(𝒬F)T(γk+1γk)22+2ϕ2GT𝒫G+𝒯h2zk+1zkGT𝒫G+𝒯h2+2(11θ)2(𝒬G)T(γk+1γk)22C×(βk+1βkFT𝒫+𝒯f2+zk+1zkGTG+𝒯h2+θ2ϕ1γk+1γk22) (A.22)

Lemma A.3. Suppose that Assumptions 1 holds, and assume that both FTF+𝒯f and GTG+𝒯h are positive definite. Then for all k sufficiently large and θ(0,1+52), there exists μ(0,1) such that

uk+1u¯1+zk+1zkGT𝒫G+𝒯h2μ(uku¯1+zkzk1GT𝒫G+𝒯h2), (A.23)

where

1=(FT(m1𝒬+𝒫)F+𝒯fm1FT𝒬G0m1GT𝒬FGT(𝒫+(m1+1)𝒬)G000θ1ϕ2I) (A.24)

with m1(0,1).

Proof. From Theorem 1 in Han et al. (2018), we can derive the following results.

{βkβ¯FT𝒫F+𝒯f2+zkz¯GTG+𝒯h2+zkzk1GT𝒫G+𝒯h2+(1min{θ,1θ})𝒬(Fβk+Gzkc)22+θ1ϕ2γkγ¯22}{βk+1β¯FT𝒫F+𝒯f2+zk+1z¯GTG+𝒯h2+zk+1zkGT𝒫G+𝒯h2+(1min{θ,1θ})𝒬(Fβk+1+Gzk+1c)22+θ1ϕ2γk+1γ¯22}zk+1zkGT𝒫G+𝒯h+(θθ2+min(θ2,1))G𝒬TG2+βk+1βkFT𝒫F+𝒯f2+(1θ+min{θ,θ1})𝒬(Fβk+1+Gzk+1c)22 (A.25)

When θ(0,1+52), it is ensured that (1θ+ϕmin{θ,θ1})>0. Let d1(0,12), then we have

{βkβ¯FT𝒫F+𝒯f2+zkz¯GTG+𝒯h2+zkzk1GT𝒫G+𝒯h2+(1+d1d1θ(1d1)min{θ,1θ})𝒬(Fβk+Gzkc)22+θ1ϕ2γkγ¯22}{βk+1β¯FT𝒫F+𝒯f2+zk+1z¯GTG+𝒯h2+zk+1zkGT𝒫G+𝒯h2+(1+d1d1θ(1d1)min{θ,1θ})𝒬(Fβk+1+Gzk+1c)22+θ1ϕ2γk+1γ¯22} (A.26)
zk+1zkGT𝒫G+𝒯h+(θθ2+min(θ2,1))GT𝒬G2+βk+1βkFT𝒫F+𝒯f2+(1d1)(1θ+min{θ,θ1})𝒬(Fβk+1+Gzk+1c)22+d1(1θ+min{θ,θ1})𝒬(Fβk+Gzkc)22=zk+1zkGT𝒫G+𝒯h+(θθ2+min(θ2,1))GT𝒬G2+βk+1βkFT𝒫F+𝒯f2+(12d1))(1θ+min{θ,θ1})θ2ϕ2γk+1γk22+d1(1θ+min{θ,θ1})(𝒬(Fβk+Gzkc)22+𝒬(Fβk+1+Gzk+1c)22)zk+1zkGT𝒫G+𝒯h+(θθ2+min(θ2,1))GT𝒬G2+βk+1βkFT𝒫F+𝒯f2+(12d1)(1θ+min{θ,θ1})θ2ϕ2γk+1γk22+12d1(1θ+min{θ,θ1})𝒬F(βk+1βk)+𝒬G(zk+1zk)22 (A.27)

Note that 𝒬(Fβk+1+Gzk+1c)=𝒬F(βk+1β¯)+𝒬G(zk+1z¯), and we have

{βkβ¯FT𝒫F+𝒯f2+zkz¯GTG+𝒯h2+zkzk1GT𝒫G+𝒯h2+(1+d1d1θ(1d1)min{θ,1θ})𝒬F(βkβ¯)+𝒬G(zkz¯)22+θ1ϕ2γkγ¯22}{βk+1β¯FT𝒫F+𝒯f2+zk+1z¯GTG+𝒯h2+zk+1zkGT𝒫G+𝒯h2+(1+d1d1θ(1d1)min{θ,1θ})𝒬F(βk+1β¯)+𝒬G(zk+1z¯))22+θ1ϕ2γk+1γ¯22}zk+1zkGT𝒫G+𝒯h+(θθ2+min(θ2,1))GT𝒬G2+βk+1βkFT𝒫F+𝒯f2+(12d1)(1θ+min{θ,θ1})θ2ϕ2γk+1γk22+12d1(1θ+min{θ,θ1})𝒬F(βk+1βk)+𝒬G(zk+1zk)22(1θ+min{θ,θ1})min{12d1,12d1,θ}(βk+1βkFT𝒫F+𝒯f2+zk+1zkGTG+𝒯h2+θ2ϕ2γk+1γk22) (A.28)

Let m1=1+d1d1θ(1d1)min{θ,1θ} in 1 defined in (A.24), and m2=(1θ+min{θ,θ1})min{12d1,12d1,θ}. Note that when θ(0,1+52), the following relationship holds.

FTF+𝒯f0andGTG+𝒯h010.

Combining with Lemma A.2, we have

uku¯12+zkzk1GTG+𝒯h2(uk+1u¯1+zk+1zkGTG+𝒯h2)m2C(C×(βk+1βkFT𝒫F+𝒯f2+zk+1zkGTG+𝒯h2+θ2ϕ2γk+1γk22))=m2Cuk+1uk2m2d2Cq2uk+1u¯22+m2(1d2)Cq2zk+1zkGTG+𝒯h2m2d2Cq2λmax(1)uk+1u¯12+m2(1d2)Cq2zk+1zkGTG+𝒯h2. (A.29)

Take d2=λmax(1)1+λmax(1), then we can obtain (A.23) with μ=[1+m2Cq2(1+λmax(1))]1. □

Proof of Theorem 1. Since f=1 and g=1n1nT()+ are piecewise linear-quadratic functions, thus both proxf(·) and proxg(·) are piecewise polyhedral (Poliquin and Rockafellar, 1993) which implies Fact 2 (Han et al., 2018). Since we take 𝒯g=ηiIpgXgTXg,g=1,,G, then 𝒯g+FTF=Diag(η1Ip1,,ηKIpK) is positive definite, and this together with the fact that GTG=In0 imply that the sequence (βk,zk,ωk,γk) is automatically well defined. By Lemma A.1, under the condition θ(0,(1+5)/2), the sequence (βk,zk,ωk,γk) generated by algorithm (A.5) converges to a limit point (β¯,z¯,ω¯,γ¯) with (β¯,z¯,ω¯) solving (9) and γ¯ is the dual optimal.

To derive the rate of convergence, we first compute 1. By definition,

𝒫=H(HTH)1HT=1G((G1)IIII(G1)IIII(G1)I)

It follows that

βk+1β¯FT𝒫F2=i=1GXi(βik+1β¯i)221Gi=1GXiT(βik+1β¯i)22,m1𝒬F(βk+1β¯)+𝒬G(zk+1z¯)22=m1Gi=1GXi(βik+1β¯i)+(zk+1z¯)22,zk+1z¯GTG+𝒯h2=zk+1z¯22,zk+1zkGT𝒫G+𝒯h2=G1Gzk+1zk22. (A.30)

Plugging equations (A.30) back into (A.29), we derive the results in Theorem 1 easily.

A.3. Algorithms for Two-Step PQR-SCAD

This section presents two three-block ADMM algorithms for PQR-SCAD proposed in Section 2.3.

Algorithm 3.

FS-QRADMM-CD for Two-Step PQR-SCAD

Initialization: β˜0,λ,v,z˜0,γ˜0,ω˜i0, and ϕ>0,θ=1.618,k=0.
while the stopping criterion is not satisfied, do
  Update β˜k+1 by
     β˜1k+1=argminβ1p1nvλβ11+ϕ2X1β1+g=2Gω˜gk+z˜ky+γ˜1kϕ22,
     β˜gk+1=argminβgpgnvλβg1+ϕ2Xgβgω˜gk+γ˜gkϕ22,g=2,,G
  Compute ω˜k+12,z˜k+1 and ω˜k+1 by (13).
  Update γ˜k+1 by (14).
end while The solution is denoted as β^1,z^1,ω^1.
Initialization: β^0=β^1,z^0=z^1,ω^0=ω^1 and ϕ>0,θ=1.618,k=0. Compute
αj=λ1pλ(|β^j0|) for j=1,,p.
while the stopping criterion is not satisfied, do
  Update β^k+1 by (12).
  Compute ω^k+12,z^k+1 and ω^k+1 by (13).
  Update γ^k+1 by (14).
end while

Algorithm 4.

FS-QRADMM-prox for Two-Step PQR-SCAD

Initialization: β˜0,λ,v,z˜0,γ˜0,ω˜i0, and ϕ>0,θ=1.618,k=0.
while the stopping criterion is not satisfied, do
  Update β˜k+1 by
    β˜1k+1=Shrink(β˜1jkϕη1X1jT(X1β˜1k+g=2Gω˜gk+z˜ky+γ˜1kϕ),nvλη1)j=1,,p1
    β˜gk+1=Shrink(β˜gjkϕηgXgjT(Xgβ˜gkω˜gk+γ˜gkϕ),nvληg)j=1,,pg,g=2,,G.
  Compute ω˜k+12,z˜k+1 and ω˜k+1by (13).
  Update γ˜k+1 by (14).
end while denote the solution as β^1,z^1,ω^1
Initialization: β^0=β^1,z^0=z^1,ω^0=ω^1 and ϕ>0,θ=1.618,k=0. Compute
αj=λ1pλ(|β^j0|) for j=1,,p.
while the stopping criterion is not satisfied, do
  Update β^k+1 by (15).
  Compute ω^k+12,z^k+1 and ω^k+1 by (13).
  Update γ^k+1 by (14).
end while

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  1. Altunbaş Y and Thornton J (2019). The impact of financial development on income inequality: a quantile regression approach. Economics Letters, 175:51–56. [Google Scholar]
  2. Belloni A and Chernozhukov V (2011). L1-penalized quantile regression in high-dimensional sparse models. The Annals of Statistics, 39(1):82–130. [Google Scholar]
  3. Boyd S, Parikh N, Chu E, Peleato B, and Eckstein J (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122. [Google Scholar]
  4. Cai Z, Chen H, and Liao X (2022). A new robust inference for predictive quantile regression. Journal of Econometrics. In press. [Google Scholar]
  5. Chen C, He B, Ye Y, and Yuan X (2016). The direct extension of ADMM for multi-block convex minimization problems is not necessarily convergent. Mathematical Programming, 155(1):57–79. [Google Scholar]
  6. D’Haultfœuille X, Maurel A, and Zhang Y (2018). Extremal quantile regressions for selection models and the black–white wage gap. Journal of Econometrics, 203(1):129–142. [Google Scholar]
  7. Fan J and Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348–1360. [Google Scholar]
  8. Fan J, Li R, Zhang C-H, and Zou H (2020). Statistical Foundations of Data Science. Chapman and Hall/CRC. [Google Scholar]
  9. Fan J, Xue L, and Zou H (2014). Strong oracle optimality of folded concave penalized estimation. The Annals of Statistics, 42(3):819–849. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Fan Y, Lin N, and Yin X (2021). Penalized quantile regression for distributed big data using the slack variable representation. Journal of Computational and Graphical Statistics, 30(3):557–565. [Google Scholar]
  11. Fazel M, Pong TK, Sun D, and Tseng P (2013). Hankel matrix rank minimization with applications to system identification and realization. SIAM Journal on Matrix Analysis and Applications, 34(3):946–977. [Google Scholar]
  12. Firpo S, Galvao AF, Pinto C, Poirier A, and Sanroman G (2022). GMM quantile regression. Journal of Econometrics. In press. [Google Scholar]
  13. Fortin M and Glowinski R (2000). Augmented Lagrangian methods: applications to the numerical solution of boundary-value problems, volume 15. Elsevier. [Google Scholar]
  14. Friedman J, Hastie T, Höfling H, and Tibshirani R (2007). Pathwise coordinate optimization. The Annals of Applied Statistics, 1(2):302–332. [Google Scholar]
  15. Friedman J, Hastie T, and Tibshirani R (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22. [PMC free article] [PubMed] [Google Scholar]
  16. Giessing A and He X (2019). On the predictive risk in misspecified quantile regression. Journal of Econometrics, 213(1):235–260. [Google Scholar]
  17. Gimenes N and Guerre E (2022). Quantile regression methods for first-price auctions. Journal of Econometrics, 226(2):224–247. [Google Scholar]
  18. Gu J and Volgushev S (2019). Panel data quantile regression with grouped fixed effects. Journal of Econometrics, 213(1):68–91. [Google Scholar]
  19. Gu Y, Fan J, Kong L, Ma S, and Zou H (2018). ADMM for high-dimensional sparse penalized quantile regression. Technometrics, 60(3):319–331. [Google Scholar]
  20. Han D, Sun D, and Zhang L (2018). Linear rate convergence of the alternating direction method of multipliers for convex composite programming. Mathematics of Operations Research, 43(2):622–637. [Google Scholar]
  21. He X, Pan X, Tan KM, and Zhou W-X (2022). Smoothed quantile regression with large-scale inference. Journal of Econometrics. In press. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Koenker R (2017). Quantile regression: 40 years on. Annual Review of Economics, 9:155–176. [Google Scholar]
  23. Koenker R and Bassett G (1978). Regression quantiles. Econometrica, 46(1):33–50. [Google Scholar]
  24. Koenker R, Chernozhukov V, He X, and Peng L (2017). Handbook of Quantile Regression. CRC press. [Google Scholar]
  25. Koenker R and Mizera I (2014). Convex optimization, shape constraints, compound decisions, and empirical Bayes rules. Journal of the American Statistical Association, 109(506):674–685. [Google Scholar]
  26. Lee ER, Noh H, and Park BU (2014). Model selection via Bayesian information criterion for quantile regression models. Journal of the American Statistical Association, 109(505):216–229. [Google Scholar]
  27. Li Y and Zhu J (2008). L1-norm quantile regression. Journal of Computational and Graphical Statistics, 17(1):163–185. [Google Scholar]
  28. Narisetty N and Koenker R (2022). Censored quantile regression survival models with a cure proportion. Journal of Econometrics, 226(1):192–203. [Google Scholar]
  29. Peng B and Wang L (2015). An iterative coordinate descent algorithm for high-dimensional nonconvex penalized quantile regression. Journal of Computational and Graphical Statistics, 24(3):676–694. [Google Scholar]
  30. Poliquin R and Rockafellar R (1993). A calculus of epi-derivatives applicable to optimization. Canadian Journal of Mathematics, 45(4):879–896. [Google Scholar]
  31. Sherwood B and Maidman A (2017). rqPen: Penalized Quantile Regression. R package version 2.0. [Google Scholar]
  32. Sun D, Toh K-C, and Yang L (2015). A convergent 3-block semiproximal alternating direction method of multipliers for conic programming with 4-type constraints. SIAM Journal on Optimization, 25(2):882–915. [Google Scholar]
  33. Tan KM, Wang L, and Zhou W-X (2022). High-dimensional quantile regression: Convolution smoothing and concave regularization. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 84(1):205–233. [Google Scholar]
  34. Wang H (2009). Forward regression for ultra-high dimensional variable screening. Journal of the American Statistical Association, 104(488):1512–1524. [Google Scholar]
  35. Wang L and He X (2022). Analysis of global and local optima of regularized quantile regression in high dimensions: A subgradient approach. Econometric Theory. [Google Scholar]
  36. Wang L, Kim Y, and Li R (2013). Calibrating non-convex penalized regression in ultra-high dimension. The Annals of Statistics, 41(5):2505–2536. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Wang L, Wu Y, and Li R (2012). Quantile regression for analyzing heterogeneity in ultra-high dimension. Journal of the American Statistical Association, 107(497):214–222. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Wu Y and Liu Y (2009). Variable selection in quantile regression. Statistica Sinica, 19(2):801–817. [Google Scholar]
  39. Yi C and Huang J (2017). Semismooth newton coordinate descent algorithm for elastic-net penalized huber loss regression and quantile regression. Journal of Computational and Graphical Statistics, 26(3):547–557. [Google Scholar]
  40. Yu L and Lin N (2017). Admm for penalized quantile regression in big data. International Statistical Review, 85(3):494–518. [Google Scholar]
  41. Yu L, Lin N, and Wang L (2017). A parallel algorithm for large-scale nonconvex penalized quantile regression. Journal of Computational and Graphical Statistics, 26(4):935–939. [Google Scholar]
  42. Zhang C-H (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2):894–942. [Google Scholar]
  43. Zou H and Li R (2008). One-step sparse estimates in nonconcave penalized likelihood models. The Annals of Statistics, 36(4):1509–1533. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES