Feature-splitting Algorithms for Ultrahigh Dimensional Quantile Regression

Jiawei Wen; Songshan Yang; Christina Dan Wang; Yifan Jiang; Runze Li

doi:10.1016/j.jeconom.2023.01.028

. Author manuscript; available in PMC: 2025 Aug 9.

Published in final edited form as: J Econom. 2023 Mar 24;249(Pt A):105426. doi: 10.1016/j.jeconom.2023.01.028

Feature-splitting Algorithms for Ultrahigh Dimensional Quantile Regression^*

Jiawei Wen ^†, Songshan Yang ^‡, Christina Dan Wang ^§, Yifan Jiang ^¶, Runze Li ^‖

PMCID: PMC12326526 NIHMSID: NIHMS1882856 PMID: 40778086

Abstract

This paper is concerned with computational issues related to penalized quantile regression (PQR) with ultrahigh dimensional predictors. Various algorithms have been developed for PQR, but they become ineffective and/or infeasible in the presence of ultrahigh dimensional predictors due to the storage and scalability limitations. The variable updating schema of the feature-splitting algorithm that directly applies the ordinary alternating direction method of multiplier (ADMM) to ultrahigh dimensional PQR may make the algorithm fail to converge. To tackle this hurdle, we propose an efficient and parallelizable algorithm for ultrahigh dimensional PQR based on the three-block ADMM. The compatibility of the proposed algorithm with parallel computing alleviates the storage and scalability limitations of a single machine in the large-scale data processing. We establish the rate of convergence of the newly proposed algorithm. In addition, Monte Carlo simulations are conducted to compare the finite sample performance of the proposed algorithm with that of other existing algorithms. The numerical comparison implies that the proposed algorithm significantly outperforms the existing ones. We further illustrate the proposed algorithm via an empirical analysis of a real-world data set.

Keywords: ADMM, Penalized quantile regression, Parallel computing, Sample-splitting algorithm

1. Introduction

Quantile regression (QR) is well acknowledged as a powerful tool for analyzing data with heterogeneous effect. Since the seminal work of Koenker and Bassett (1978), QR has been extensively applied in many research fields, in particular, in econometrics. For a complete review of QR, refer to Koenker (2017) and Koenker et al. (2017). Many recent advances and achievements of QR can be found in literature. Wang and He (2022) provided a unified theory for high-dimensional quantile regression with both convex and nonconvex penalties. Gimenes and Guerre (2022) proposed a QR inference framework for first-price auctions, and Cai et al. (2022) reexamined the heterogeneous predictability of US stock returns at different quantile levels. Other recent studies of QR include, but not limited to, D’Haultfœuille et al. (2018), Altunbaş and Thornton (2019), Giessing and He (2019), Gu and Volgushev (2019), Firpo et al. (2022), He et al. (2022) , and Narisetty and Koenker (2022).

For variable selection in QR, penalized quantile regression (PQR) has been developed with fixed and finite dimensional predictors in Li and Zhu (2008) and Wu and Liu (2009). Furthermore, PQR with high-dimensional predictors has also been studied in the statistical literature, since as the advent of data science, high-dimensional data analysis has become one of the most important research topics in the last decade. Belloni and Chernozhukov (2011) derived a nice error bound for PQR with the Lasso penalty (ℓ₁-QR for short). Wang et al. (2012) studied the PQR with folded concave penalty such as the smoothed clip absolute deviation (SCAD) penalty (Fan and Li, 2001) and minimax concave penalty (MCP) (Zhang, 2010), and further established the oracle property for PQR in the ultrahigh dimension setting under mild conditions. In summary, estimation and theory of PQR are well studied and understood in the literature.

The numerical minimization problem for searching solutions to PQR, however, is challenging due to the nonsmoothness objective function with the possible nonconvexity of folded concave penalty. Sherwood and Maidman (2017) developed an R-package rqPen for ℓ₁-QR, and the algorithm is similar to the ℓ₁-QR introduced in Koenker and Mizera (2014). Peng and Wang (2015) developed an iterative coordinate descent algorithm (QICD) for solving PQR with nonconvex penalty. Gu et al. (2018) introduced fast alternating direction method of multiplier (ADMM) (Boyd et al., 2011) for PQR in high dimension.

As the advent of big data, it is of crucial importance to study numerical algorithms for PQR in ultrahigh dimension and/or a large data size. The ADMM (Boyd et al., 2011) has been introduced to cope with PQR with a large data size. Yu et al. (2017) and Fan et al. (2021) developed parallel algorithms for PQR based on sample-splitting ADMM. By sample-splitting, it means by its name that the algorithm partitions the data across samples. Ultrahigh dimensionality adds another challenge in minimizing the objective function of ultrahigh dimensional PQR. This work aims to tackle simultaneous challenges of nonsmoothness, nonconvexity and ultrahigh dimensionality by developing feature-splitting algorithms for PQR.

In this paper, we propose an efficient and parallelizable algorithm for PQR in ultrahigh dimension based on three-block ADMM. It is noteworthy that Yu and Lin (2017) briefly mentioned one direct extension of the feature-splitting ADMM for PQR without theoretical justifications and numerical studies. The variable update schema in Yu and Lin (2017) makes the convergence of the algorithm uncertain. Chen et al. (2016) showed that Gauss-Seidel multi-block ADMM is not necessarily convergent. For more detailed discussion on this, see Section 2. The uncertain convergence motivates us to avoid the direct extension of the feature-splitting ADMM, and instead to develop a three-block ADMM algorithm for ultrahigh dimensional PQR. Using related techniques in Sun et al. (2015), we establish the rate of convergence of the proposed algorithm and the theoretical convergence guarantee, and address the convergence uncertainty. The compatibility of the proposed three-block ADMM algorithm with parallel computing alleviates the storage and scalability limitations of a single machine in the large-scale data processing. The proposed three-block ADMM algorithms also enjoy numerical efficiency over the directly extended two-block ADMM. It is worthy to note that the newly proposed algorithms can be directly applied to PQR with various penalties including the $ℓ_{1}$ , the SCAD and the MCP penalties by local linear approximation to the penalties (Zou and Li, 2008). Based on theories developed in Wang et al. (2013) and Fan et al. (2014), the proposed algorithms are able to obtain an PQR estimate with strong oracle property in ultrahigh dimension.

The rest of this article is organized as follows. In Section 2, we present the computational framework based on the three-block ADMM for PQR and establish the linear rate of convergence of the algorithm. In Section 3, we demonstrate the numerical and statistical efficiency of the proposed framework in high and ultra-high dimensional settings through Monte Carlo simulation, and illustrate the proposed algorithm via an empirical analysis of a Chinese supermarket data set. Technical proofs are given in the Appendix.

Throughout the paper, we adopt the following notations. For a matrix $M = {(m_{i j})}_{s \times t}$ , denote ${∥ M ∥}_{\max} = \max_{(i, j)} | m_{i j} |, {∥ M ∥}_{\min} = \min_{(i, j)} | m_{i j} |$ , and $λ_{\min} (M)$ and $λ_{\max} (M)$ as the smallest and largest eigenvalues of M, respectively. X_A denotes the sub-matrix of X with the columns indexed by A. $M ≻ 0$ indicates that M is positive definite. For a positive semidefinite operator or matrix $M, {∥ x ∥}_{M}^{2} = x^{T} M x$ .

2. Feature-splitting Algorithms for PQR

Suppose that ${x_{i}, y_{i}}, i = 1, \dots, n$ , is a random sample from linear regression model

y_{i} = x_{i}^{T} β + ε_{i},

where $β$ is p-dimensional vector of regression coefficients, and $ε_{i}$ is a random error with $E (ε_{i} ∣ x_{i}) = 0$ . In this paper, we are interested in solving QR in ultrahigh dimensional regime, in which $p ≫ n$ . Define $y = {(y_{1}, \dots, y_{n})}^{T}$ as the response vector, and $X = {(x_{1}, \dots, x_{n})}^{T}$ as the corresponding design matrix. For a given $τ \in (0, 1)$ , the quantile of interest, define the loss function $ρ_{τ} (z) = z [τ - I (z < 0)] = τ {(z)}_{+} + (1 - τ) {(z)}_{-}$ where I(⋅) is the indicator function, ${(z)}_{+} = \max {0, z}$ , and ${(z)}_{-} = {(- z)}_{+}$ . The QR is to minimize its objective function

L (y - X β) = \frac{1}{n} \sum_{i = 1}^{n} ρ_{τ} (y_{i} - x_{i}^{T} β)

(1)

with respect to $β$ , and this leads to the QR estimate of $β$ . The minimization problem in QR can be reformulated as a linear programming problem. The Frisch-Newton algorithm can be applied to solve the minimization problem with computational complexity growing as a cubic function of p when p < n.

2.1. Penalized quantile regression

In the presence of ultrahigh dimensional predictors, it is common to impose sparsity assumption on $β$ . That is, only a small portion of elements in $β$ are nonzero. This implies that only a small portion of predictors are significant in the model. Thus, it is critical to identify the significant predictors in QR in ultrahigh dimension. The variable selection in QR is similar to that in linear regression for which penalized least squares methods have been proposed. It is then natural to extend the PQR method to the variable selection for QR. With PQR, it is to minimize the penalized quantile loss function

Q (β) = L (y - X β) + \sum_{j = 1}^{p} p_{λ_{j}} (| β_{j} |),

(2)

where $p_{λ_{j}} (\cdot)$ is a penalty function with a regularization parameter $λ_{j}$ that controls model complexity. The algorithms to be developed in this paper allow that different regression coefficients have different penalties, although it is common to take all $p_{λ_{j}} (\cdot)$ to be the same and denoted by $p_{λ} (\cdot)$ . This paper concentrates on the two most commonly-used penalties: the Lasso (i.e., $ℓ_{1}$ ) penalty $p_{λ} (| β |) = λ | β |$ and the SCAD penalty whose first derivative is defined as

p_{λ}^{'} (| β |) = λ {I (| β | \leq λ) + \frac{{(a λ - | β |)}_{+}}{(a - 1) λ} I (| β | > λ)}

(3)

with $p_{λ}^{'} (0) : = p_{λ}^{'} (0 +) = λ$ and a = 3.7 as suggested in Fan and Li (2001). The proposed algorithms are directly applicable for other folded concave penalties (Fan et al., 2020).

It is challenging in minimizing the objective function of PQR in (2) since both the loss function and penalty function are nonsmoothing. When folded concave penalty such as the SCAD penalty is used in ultrahigh dimensional PQR, the minimization problem becomes even more challenging due to its nonconvexity and ultrahigh dimensionality. It is noteworthy that it is a convex minimization problem for the PQR with the $ℓ_{1}$ penalty, and when $p \leq n$ , it has unique minimizer. For the PQR with a folded concave penalty, minimizing the objective function in (2) may be achieved by iteratively minimizing PQR with reweighted $ℓ_{1}$ penalty with the aid of the local linear approximation (LLA) to the penalty function. Specifically, given $β^{k} = {(β_{1}^{k}, \dots, β_{p}^{k})}^{T}$ updated from the k-th step in the course of iterations, we first approximate

p_{λ} (| β_{j} |) \approx q_{λ} (| β_{j} |; | β_{j}^{k} |) = p_{λ} (| β_{j}^{k} |) + p_{λ}^{'} (| β_{j}^{k} |) (| β_{j} | - | β_{j}^{k} |),

(4)

which is referred to as the LLA. Then at the $(k + 1)$ -th step we minimize

Q^{k + 1} (β) = L (y - X β) + λ \sum_{j = 1}^{p} λ^{- 1} p_{λ_{j}}^{'} (| β_{j}^{k} |) | β_{j} | = L (y - X β) + λ \sum_{j = 1}^{p} α_{j} | β_{j} |,

(5)

where $α_{j} = λ^{- 1} p_{λ_{j}}^{'} (| β_{j}^{k} |) \geq 0$ . The function in (5) is the objective function of PQR with reweighted $ℓ_{1}$ penalty and weights $α_{j}$ ’s that are updated at every step.

The LLA was first proposed in Zou and Li (2008) for penalized likelihood with finite dimensional predictors, and further adopted in Wang et al. (2013) and Fan et al. (2014) for penalized least squares for ultrahigh dimensional linear regression models. Note that if we set initial value $β^{0} = 0$ , the $β^{1}$ is the PQR-Lasso estimator defined as the PQR with $ℓ_{1}$ penalty. Then $β^{2}$ can be regarded as the one-step sparse estimator with initial value being the PQR-Lasso estimator. With properly chosen tuning parameters, Wang et al. (2013) and Fan et al. (2014) showed that under some regularity conditions on high dimensional linear models, the corresponding penalized least squares estimator $β^{2}$ enjoys the strong oracle property with probability tending to one. This motivates us to focus on developing feature-splitting algorithm for PQR with weighted $ℓ_{1}$ penalty.

2.2. Three-block ADMM

Define the PQR estimator with weighted $ℓ_{1}$ -penalty to be

\hat{β} = \underset{β}{argmin} L (y - X β) + λ {∥ α \circ β ∥}_{1},

(6)

where ${∥ α \circ β ∥}_{1} = \sum_{j = 1}^{p} | α_{j} β_{j} |$ with $α$ being the weight vector.

The non-smoothness of the objective function in (6) hinders an efficient application of gradient-based methods. To decouple the non-smooth parts in computation, we decentralize problem (6) into the following constrained optimization problem,

\min_{β, z} L (z) + λ {∥ α \circ β ∥}_{1}, s.t. z + X β = y .

(7)

Problem (7) is a natural candidate of classical two-block ADMM algorithm. Define augmented Lagrangian function as

ℒ_{ϕ} (β, z; γ) = L (z) + λ {∥ α \circ β ∥}_{1} + 〈 γ, z + X β - y 〉 + \frac{ϕ}{2} {∥ z + X β - y ∥}_{2}^{2},

(8)

where $γ \in ℛ^{n}$ is the Lagrangian multiplier, and ϕ > 0 is the parameter associated with the quadratic term. The classic iterative scheme at the iteration k for two-block ADMM is

β^{k + 1} = \underset{β}{argmin} ℒ_{ϕ} (β, z^{k}; γ^{k}),

z^{k + 1} = \underset{z}{argmin} ℒ_{ϕ} (β^{k + 1}, z; γ^{k}),

γ^{k + 1} = γ^{k} + θ ϕ (z^{k + 1} + X β^{k + 1} - y),

where $θ$ is a tuning parameter controlling the step size. The effect of tuning parameter $θ$ on the algorithm convergence has been discussed in literature (Fortin and Glowinski, 2000; Fazel et al., 2013), where the convergences are established when $θ$ is constrained in $(0, (1 + \sqrt{5}) / 2)$ . In our numerical experiments, we set $θ = 1.618$ that is slightly less than $(1 + \sqrt{5}) / 2$ for faster convergence. Gu et al. (2018) proposed an efficient algorithm (qradmm) to solve PQR based on the two block ADMM algorithm. While qradmm performs very well for moderate dimensions, we found it can still be out of memory with larger p in our numerical study. This motivates us to split the high dimensional variable to smaller blocks and speed up updates through parallelization.

We next propose a new three-block semi-proximal ADMM framework that capacitates a parallel update of $β$ to cope with the ultrahigh dimensionality. Major computational cost of two-block ADMM for solving (7) comes from the $β$ update, which takes up to O(np) operations and may impede an efficient execution of the algorithm with ultrahigh dimension p. This calls for feature-splitting algorithm for PQR in ultrahigh dimension. For a pre-specified G, let us partition X and $β$ as follows,

X = (X_{1}, \dots, X_{G}), β = {(β_{1}^{T}, β_{2}^{T}, \dots, β_{G}^{T})}^{T}, X β = \sum_{g = 1}^{G} X_{g} β_{g} .

Then problem (7) can be rewritten as a three-block optimization problem

\min_{β, z, ω} L (z) + \sum_{g = 1}^{G} λ {‖ α_{g} \circ β_{g} ‖}_{1}, s.t. X_{1} β_{1} + z + ω_{2} + \dots + ω_{G} = y, X_{g} β_{g} = ω_{g}, g = 2, \dots, G .

(9)

Intuitively, slack variables $ω_{g}, g = 2, \dots, G$ store information of each local update $β_{g}$ . Each $β_{g}$ is updated independently and we view $β = {(β_{1}^{T}, β_{2}^{T}, \dots, β_{G}^{T})}^{T}$ as a single variable block in the algorithm. Likewise, all $ω_{g}$ together make up the third variable block. There may exist multiple ways to transform a problem into a form that ADMM can handle. For example, in formulation (9), the role of $X_{1} β_{1}$ is not special and $X_{g} β_{g}, g = 1, \dots, G$ are exchangeable. In this paper, we use formulation (9) to illustrate the computational framework.

The augmented Lagrangian function for (9) is given by

ℒ_{ϕ} (β, z, ω; γ) = \frac{1}{n} [τ 1^{T} {(z)}_{+} + (1 - τ) 1^{T} {(z)}_{-}] + λ \sum_{g = 1}^{G} {‖ α_{g} \circ β_{g} ‖}_{1} + γ_{1}^{T} (X_{1} β_{1} + z + ω_{2} + \dots + ω_{G} - y) + \frac{ϕ}{2} {‖ X_{1} β_{1} + z + ω_{2} + \dots + ω_{G} - y ‖}_{2}^{2} + \sum_{g = 2}^{G} γ_{g}^{T} (X_{g} β_{g} - ω_{g}) + \frac{ϕ}{2} \sum_{g = 2}^{G} {‖ X_{g} β_{g} - ω_{g} ‖}_{2}^{2} .

(10)

As seen from (10), each $β_{g}$ is decoupled in the quadratic term, which allows a natural parallelization for $β$ updates. Two-block ADMM can be directly extended to solve (9), and the corresponding algorithm is referred to as Gauss-Seidel multi-block ADMM. At the kth iteration, it updates each variable with

{\begin{array}{l} β^{k + 1} = argmin ℒ_{ϕ} (β, z^{k}, ω^{k}; γ^{k}) \\ z^{k + 1} = argmin ℒ_{ϕ} (β^{k + 1}, z, ω^{k}; γ^{k}) \\ ω^{k + 1} = argmin ℒ_{ϕ} (β^{k + 1}, z^{k + 1}, ω; γ^{k}) \\ γ_{1}^{k + 1} = γ_{1}^{k} + θ ϕ (X_{1} β_{1}^{k + 1} + z^{k + 1} + \sum_{g = 2}^{G} ω_{g}^{k + 1} - y) \\ γ_{g}^{k + 1} = γ_{g}^{k} + θ ϕ (X_{g} β_{g}^{k + 1} - ω_{g}^{k + 1}), g = 2, \dots, G . \end{array}

(11)

Procedure (11) may perform well in practice. However, its theoretical convergence has remained unclear until the work by Chen et al. (2016), in which the authors showed that Gauss-Seidel multi-block ADMM is not necessarily convergent. To address the convergence uncertainty, Sun et al. (2015) proposed a symmetric Gauss-Seidel based semi-proximal ADMM (sGS-sPADMM) for convex programming problems, which enjoys both theoretical convergence guarantee and numerical efficiency over the directly extended multi-block ADMM. This convergent semi-proximal ADMM has three separable blocks in the objective function with the third part being linear and updates $ω$ twice to improve convergence, but the extra step may incur additional computational cost.

Inspired by Sun et al. (2015), we now propose the three-block ADMM algorithm for solving PQR with weighted ℓ₁ penalty using the following special iterative cycle $(β \to ω \to z \to ω)$

{\begin{array}{l} β^{k + 1} = argmin ℒ_{ϕ} (β, z^{k}, ω^{k}; γ^{k}) + \frac{ϕ}{2} {‖ β - β^{k} ‖}_{𝒯_{f}}^{2} \\ ω^{k + \frac{1}{2}} = argmin ℒ_{ϕ} (β^{k + 1}, z^{k}, ω; γ^{k}) \\ z^{k + 1} = argmin ℒ_{ϕ} (β^{k + 1}, z, ω^{k + \frac{1}{2}}; γ^{k}) + \frac{ϕ}{2} {‖ z - z^{k} ‖}_{𝒯_{g}}^{2} \\ ω^{k + 1} = argmin ℒ_{ϕ} (β^{k + 1}, z^{k + 1}, ω; γ^{k}) \\ γ_{1}^{k + 1} = γ_{1}^{k} + θ ϕ (X_{1} β_{1}^{k + 1} + z^{k + 1} + \sum_{i = 2}^{K} ω_{i}^{k + 1} - y) \\ γ_{i}^{k + 1} = γ_{i}^{k} + θ ϕ (X_{i} β_{i}^{k + 1} - ω_{i}^{k + 1}), i = 2, \dots, K, \end{array}

where $𝒯_{f}$ and $𝒯_{g}$ are some positive semidefinite matrices.

Given the augmented Lagrangian function defined in (10),

β^{k + 1} = argmin ℒ_{ϕ} (β, z^{k}, ω^{k}; γ^{k})

becomes

β_{1}^{k + 1} = \underset{β_{1} \in ℛ^{p_{1}}}{argmin} λ {‖ α_{1} \circ β_{1} ‖}_{1} + \frac{ϕ}{2} {‖ X_{1} β_{1} + \sum_{g = 2}^{G} ω_{g}^{k} + z^{k} - y + \frac{γ_{1}^{k}}{ϕ} ‖}_{2}^{2}, β_{g}^{k + 1} = \underset{β_{g} \in ℛ^{p_{g}}}{argmin} λ {‖ α_{g} \circ β_{g} ‖}_{1} + \frac{ϕ}{2} {‖ X_{g} β_{g} - ω_{g}^{k} + \frac{γ_{g}^{k}}{ϕ} ‖}_{2}^{2}, g = 2, \dots, G .

(12)

It can be seen that $β$ subproblems are a series of weighted $ℓ_{1}$ -penalized least squares problems. If p_g is too large, $X_{g} \in ℛ^{n \times p_{g}}$ may not be full column rank, and thus, the generated sequences may not be well-defined. This concern can be addressed with an additional general position condition (Koenker, 2017), which indicated the existence of the unique solution of QR in a rather general condition. Standard quadratic solvers can be applied to solve (12) efficiently. In our numerical studies, we use R solver ‘glmnet’ to compute $β$ through the coordinate descent (CD) algorithm.

$ω_{g}, g = 2, \dots, G$ , and z are updated in the following cycle:

ω_{g}^{k + \frac{1}{2}} = \frac{1}{G} (y - z^{k} + G X_{g} β_{g}^{k + 1} - \sum_{j = 1}^{G} X_{j} β_{j}^{k + 1}), z^{k + 1} = {(y - X_{1} β_{1}^{k + 1} - \sum_{g = 2}^{G} ω_{g}^{k + \frac{1}{2}} - \frac{γ_{1}^{k}}{ϕ} - \frac{τ}{n ϕ})}_{+} - {(- y + X_{1} β_{1}^{k + 1} + \sum_{g = 2}^{G} ω_{g}^{k + \frac{1}{2}} + \frac{γ_{1}^{k}}{ϕ} + \frac{τ - 1}{n ϕ})}_{+}, ω_{g}^{k + 1} = \frac{1}{G} (y - z^{k + 1} + G X_{g} β_{g}^{k + 1} - \sum_{j = 1}^{G} X_{j} β_{j}^{k + 1}),

(13)

in which we perform an extra intermediate step to compute $ω^{k + \frac{1}{2}}$ before computing $z^{k + 1}$ . As seen from (13), the extra cost for update ω is negligible. The derivations of updates are given in Appendix A.1.

Finally, we update $γ_{1}$ and $γ_{g}$ via gradient ascent,

γ_{1}^{k + 1} = γ_{1}^{k} + θ ϕ (X_{1} β_{1}^{k + 1} + z^{k + 1} + \sum_{g = 2}^{G} ω_{g}^{k + 1} - y), γ_{g}^{k + 1} = γ_{g}^{k} + θ ϕ (X_{g} β_{g}^{k + 1} - ω_{g}^{k + 1}), g = 2, \dots, G .

(14)

We call this algorithm FS-QRADMM-CD, and summarize it in Algorithm 1. From our numerical studies, we observe that FS-QRADMM-CD has favorable practical performances.

Algorithm 1.

FS-QRADMM-CD for weighted $ℓ_{1}$ -penalized QR

Initialization:

β^{0}, ω^{0}, z^{0}, γ^{0}

, and

ϕ > 0, θ > 0

are given.

while the stopping criterion is not satisfied, do

Compute

β^{k + 1}

by (12) using CD algorithm.

Compute

ω^{k + \frac{1}{2}}, z^{k + 1}

and

ω^{k + 1}

by (13).

Update

γ^{k + 1}

by (14).

end while

Open in a new tab

Besides using coordinate descent algorithm to update $β$ , we have another solution for $β$ update. To ensure that solutions from (12) are well-defined, we add G self-adjoint positive semidefinite matrices, denoted as $𝒯_{g}, g = 1, \dots, G$ , to (12). A general principle is that $𝒯_{g}$ should be as small as possible, while the optimization problems are still easy to compute. Here we add proximal terms $\frac{1}{2} {‖ β_{g} - β_{g}^{k} ‖}_{𝒯_{g}}^{2}, g = 1, \dots, G$ , to each of the $β$ -subproblems, where the proximal operators $𝒯_{g}$ is positive definite. The positive definiteness of $𝒯_{g}$ makes ${β^{k}}$ automatically well-defined. In this paper, we take $𝒯_{g} = η_{g} I_{p_{g}} - ϕ X_{g}^{T} X_{g}$ with $η_{g} > ϕ λ_{\max} (X_{g}^{T} X_{g})$ . This essentially is a linearization step of the $β$ update, as it uses $η_{g} I_{p_{g}}$ to approximate the Hessian matrix $X_{g}^{T} X_{g}$ . The modified minimization problem admits a closed-form solution, which can be carried out componentwisely,

β_{1}^{k + 1} = S (β_{1}^{k} - \frac{ϕ}{η_{1}} X_{1}^{T} (X_{1} β_{1}^{k} + \sum_{g = 2}^{G} ω_{g}^{k} + z^{k} - y + \frac{γ_{1}^{k}}{ϕ}), \frac{α_{1} λ}{η_{1}}), β_{g}^{k + 1} = S (β_{g}^{k} - \frac{ϕ}{η_{g}} X_{g}^{T} (X_{g} β_{g}^{k} - ω_{g}^{k} + \frac{γ_{g}^{k}}{ϕ}), \frac{α_{g} λ}{η_{g}}), g = 2, \dots, G,

(15)

where $S (x, t) = sign (x) (| x | - t) I (| x | > t)$ is the soft thresholding function.

Updates in (15) manifest one advantage of splitting feature space into lower dimensions. The $β$ update can be regarded as a one-step iteration of the proximal gradient. After feature-splitting, $η_{g}$ ’s are relatively small compared to the “un-splitted” $η$ , as $η$ needs to be larger than $ϕ λ_{\max} (X^{T} X)$ . Since η increases significantly with p for high dimensional data, the step size for the update (i.e., $\frac{1}{η}$ ) can be rather small and slow down the convergence of the algorithm. The update for $ω, z$ and $γ$ in this algorithm is exactly same as those in Algorithm 1. We use FS-QRADMM-prox to denote this algorithm and summarize it in Algorithm 2. Note that $𝒯_{g} ≻ 0$ is also required in the proof of ${β^{k}}$ convergence.

Thus, we compute $β_{2}, \dots, β_{G}$ on separate processors/cores in the manner of parallel computing, and then aggregate the updated information to compute other variables.

We establish the linear rate of convergence for Algorithm 2 in Theorem 1 in which the proximal term is necessary for establishing the theory and whose proof is given in the Appendix A.

Theorem 1. For $θ \in (0, (1 + \sqrt{5}) / 2)$ , the sequence $(β^{k}, z^{k}, ω^{k}, γ^{k})$ generated by Algorithm 2 converges to a limit point $(\bar{β}, \bar{z}, \bar{ω}, \bar{γ})$ , where $(\bar{β}, \bar{z}, \bar{ω})$ is the primal optimal and $\bar{γ}$ is the dual optimal. Furthermore, there exists a constant μ ∈ (0, 1) such that ${Dist}^{k + 1} \leq μ {Dist}^{k}$ , where ${Dist}^{k}$ at the k-th iteration is defined as,

{Dist}^{k} = {‖ z^{k} - \bar{z} ‖}_{2}^{2} + \frac{G - 1}{G} {‖ z^{k} - z^{k - 1} ‖}_{2}^{2} + \sum_{g = 1}^{G} {‖ X_{g} (β_{g}^{k} - {\bar{β}}_{g}) - \frac{1}{G} (X β^{k} - X \bar{β}) ‖}_{2}^{2} + \frac{m_{1}}{G} {‖ \sum_{g = 1}^{G} X_{g} (β_{g}^{k} - {\bar{β}}_{g}) + (z^{k} - \bar{z}) ‖}_{2}^{2} + \sum_{g = 1}^{G} {‖ β_{g}^{k} - {\bar{β}}_{g} ‖}_{𝒯_{g}}^{2},

(16)

where $m_{1} = 1 + d_{1} - d_{1} θ - (1 - d_{1}) \min {θ, \frac{1}{θ}}$ and $d_{1} \in (0, \frac{1}{2})$ .

Algorithm 2.

FS-QRADMM-prox for weighted $ℓ_{1}$ -penalized QR

Initialization:

β^{0}, ω^{0}, z^{0}, γ^{0}

, and

ϕ > 0, θ > 0

are given.

𝒯_{g} = η_{g} I_{p_{g}} - ϕ X_{g}^{T} X_{g}

with

η_{g} > ϕ λ_{\max} (X_{g}^{T} X_{g}), g = 1, \dots, G .

while the stopping criterion is not satisfied, do

Compute

β^{k + 1}

by (15).

Compute

ω^{k + \frac{1}{2}}, z^{k + 1}

and

ω^{k + 1}

by (13).

Update

γ^{k + 1}

by (14).

end while

Open in a new tab

Remark. The minimization problem for searching the solution of the penalized quantile regression can be written as a linear programming problem. The primal and dual problems are feasible. The minimizer of the dual problem equals the solution of the linear programming problem (primal problem) by strong duality. Thus both the optimal values of the primal and dual problems equal the optimal value of the penalized quantile regression problem.

The effect of G on the convergence is twofold. On the one hand, increasing G reduces the dimension of subproblems and the value of $η$ , and thus it accelerates the computation of each sub-problem. On the other hand, increasing G leads to an increased number of sub-problems and may raise the value of $μ$ . This slows down the convergence both practically and theoretically. In our numerical experiments, it seems that choosing G from 5 to 10 works well for p ranging from thousands to tens of thousands.

2.3. PQR-Lasso and PQR-SCAD

In this paper, PQR-Lasso refers to the PQR in (2) with the $ℓ_{1}$ penalty, $p_{λ} (| β |) = λ | β |$ . Thus, the PQR-Lasso can be solved by Algorithms 1 and 2 directly with all weights $α_{j} = 1, j = 1, \dots, p$ in (6). The resulting solutions from Algorithms 1 and 2 for the PQR-Lasso are denoted by FS-QRADMM-CD(Lasso) and FS-QRADMM-prox(Lasso) in Section 3, respectively.

Parallel to the PQR-Lasso, PQR-SCAD refers to the PQR in (2) with the SCAD penalty whose first-order derivative is defined in (3). Since the SCAD penalty is folded concave, the objective function of PQR-SCAD may have multiple local minimizers. To avoid this issue, we recommend (a) using the proposed algorithm to obtain the PQR-Lasso estimate ${\hat{β}}_{L} = {({\hat{β}}_{L, 1}, \dots, {\hat{β}}_{L, p})}^{T}$ , and then (b) further to obtain the PQR with weighted $ℓ_{1}$ penalty, in which the weight $α_{j}$ is $λ^{- 1} p_{λ}^{'} (| {\hat{β}}_{L, j} |)$ with $p_{λ}^{'} (| β |)$ being the first-order derivative of the SCAD penalty. We refer to the resulting estimate as the two-step PQR-SCAD estimate. Note that both $ℓ_{1}$ penalty and the SCAD-based weighted $ℓ_{1}$ penalty are convex. The two-step SCAD estimate is well defined when L(y–Xβ) is strictly convex with respect to $β$ . Denote FS-QRADMM-CD(TS-SCAD) and FS-QRADMM-prox(TS-SCAD) to be the resulting solutions of Algorithms 1 and 2 for the two-step PQR-SCAD. The corresponding algorithms of FS-QRADMM-CD and FS-QRADMM-prox for the two-step PQR-SCAD are presented in Algorithms 3 and 4 in Section A.3 in the Appendix.

The two-step PQR-SCAD shares the same spirit of one-step sparse maximum likelihood estimation proposed in Zou and Li (2008) for folded concave penalization problems. The second step in the two-step PQR-SCAD is to correct the bias inherited in $ℓ_{1}$ penalty which is known to over-penalize large coefficients and introduce bias to the resulting model. As shown in Corollary 8 in Fan et al. (2014), the two-step PQR-SCAD can find the oracle estimator among multiple local minimums with overwhelming probability, under certain regularity conditions. This provides theoretical justification for the two-step SCAD. In other words, the resulting solutions of Algorithms 3 and 4 for the two-step PQR-SCAD enjoy strong oracle property in the terminology of Fan et al. (2014).

The two-step PQR-SCAD procedure can be extended to two-step PQR with a general folded concave penalty characterized by the following conditions: (a) p_λ(t) is nondecreasing and concave for $t \in [0, \infty)$ with p_λ (0) = 0; (b) p_λ(t) is differentiable in (0, ∞); (c) for some positive constants a₁ and $a_{2}, p_{λ}^{'} (t) \geq a_{1} λ$ for $t \in (0, a_{2} λ]$ , and (d) $p_{λ}^{'} (t) = 0$ for $t \in [a λ, \infty)$ with a > 1. As shown in Fan et al. (2014), the two-step PQR with a general folded concave penalty also enjoys the strong oracle property under certain regularity conditions.

It is desirable to have a data-driven method to select the regularization parameters in PQR-Lasso and PQR-SCAD. In our numerical study, we set the same penalty and tuning parameter for all coefficients, and λ is chosen by HBIC criterion proposed in Lee et al. (2014),

HBIC (λ) = \log {\sum_{i = 1}^{n} ρ_{τ} (y_{i} - x_{i}^{T} β)} + | 𝒜 | \frac{\log (\log n) \log (p)}{n},

(17)

where $| 𝒜 |$ is the cardinality of active set. We select the λ that minimizes HBIC.

Wang et al. (2013) recommends using different $λ$ ’s in the first step and the second step in the penalized least squares setting to ensure the resulting Lasso estimate satisfying a certain rate of convergence. Denote $λ_{1}$ and $λ$ to be regularization parameters used in the first and second step, respectively. Following the recommendation in Wang et al. (2013), we choose $λ_{1} = υ λ$ , where $υ > 0$ and tends to 0 as $n \to \infty$ . We set υ = λ suggested by Wang et al. (2013) in our numerical studies in Section 3.

3. Numerical Studies

In this section, we assess the performance of the proposed algorithms via simulation studies and illustrate the application of the newly proposed procedure via an empirical analysis. For all ADMM-based methods, we implement the warm-start technique introduced in Friedman et al. (2007) and Friedman et al. (2010), which uses the solution from the previous $λ$ to initialize computation at the current $λ$ . The way of splitting the features has no influence on the convergence property of the algorithm. We equally distribute the features into K groups without adjusting the order in our numerical studies. The stopping criterion of ADMM-based algorithms is provided in the Appendix.

3.1. Simulation study

In this simulation, we compare the performance of Algorithms 1 and 2 with R packages rqPen (Sherwood and Maidman, 2017), qradmm (Gu et al., 2018), hqReg (Yi and Huang, 2017) and Conquer (Tan et al., 2022). Since qradmm package is boosted by FORTRAN, we re-implement its core algorithm, i.e., a two-block proximal ADMM, using R code for a relatively fair comparison. The R package rqPen implements an iterative coordinate descent algorithm (QICD) proposed in Peng and Wang (2015) to solve sparse quantile regression. QICD applies a convex majorization function on the concave penalty term, and solves the majorized objective function by coordinate descent. The R package qradmm implements a two-block proximal ADMM for PQR with $ℓ_{1}$ penalty proposed in Gu et al. (2018). We use the R packages hqreg and conquer to implement the methods proposed by Yi and Huang (2017) and Tan et al. (2022), respectively. The regularization parameter λ in all algorithms to be compared is selected by the HBIC criterion defined in (17).

We take the simulation setting similar to that of Peng and Wang (2015). We generate $\tilde{z} = {(Z_{1}, Z_{2}, \dots, Z_{p})}^{T}$ from $N_{p} (0, Σ)$ , where $Σ = (σ_{i j})$ with $σ_{i j} = {0.5}^{| i - j |}$ . Then set $X_{1} = Φ (Z_{1})$ and $X_{j} = Z_{j}$ for $j = 2, \dots, p$ , where $Φ (\cdot)$ is the cumulative distribution function of N(0, 1). The response variable Y is generated from the following heteroscedastic regression model,

Y = X_{6} + X_{100} + X_{500} + X_{1000} + 0.7 X_{1} ε,

(18)

where $ε \sim N (0, 1)$ . We consider three different quantiles $τ = 0.3, 0.5$ and 0.7. Note that X₁ does not affect the center of the conditional distribution Y given x, but affects the conditional distribution when $τ = 0.3$ or 0.7. In our simulation, we set n = 400 and p = 1000 and 50000. For each case, we conduct 500 replications.

The following criteria are used to compare the performance of different algorithms.

Average absolute error: the average and standard deviation of ${∥ \hat{β} - β ∥}_{1} = \sum_{j = 1}^{p} | {\hat{β}}_{j} - β_{j} |$ over 500 replications.
Size: the average number of nonzero ${\hat{β}}_{j}$ ’s over 500 replications.
P₁: the proportion of models that select all active features except for X₁ over 500 replications
P₂: the proportion of models that select X₁ over 500 replications.

The proportion P₂ is expected to be close to 0 when $τ = 0.5$ , and be close to 1 when $τ = 0.3$ and 0.7.

The simulation results over 500 replications are summarized in Tables 1 and 2. Compared to the PQR-Lasso, two-step PQR-SCAD produces models with significantly smaller absolute error and better selection accuracy in general. FS-QRADMM-CD(TS-SCAD) and FS-QRADMM-prox (TS-SCAD) have the best performances with respect to estimation and variable selection accuracy. When p = 1000, three ADMM-based methods perform comparably well and outperform rqPen, hqReg and Conquer by a significant margin. rqPen, hqReg and Conquer obtain relatively larger estimation errors and is more likely to miss X₁ when $τ = 0.3$ , and 0.7. The current version of rqPen runs out of memory when solving two-step PQR-SCAD, as noted in the table. Nonetheless, when p = 50000, both Qradmm and rqPen fail due to their demanding memory usage. In fact, we notice that the efficiency of Qradmm deteriorates sharply when p increases. hqReg and Conquer are able to finish the job when p = 50000, but the proposed methods still outperform hqReg and Conquer.

Table 1:

Comparison of algorithms for PQR when p = 1000 and n = 400.

n = 400, p = 1000	$τ$	${‖ \hat{β} - β ‖}_{1}$	P1	P2	Size
FS-QRADMM-CD (Lasso)	0.3	0.295 (0.003)	100%	100%	5.56 (0.03)
	0.5	0.210 (0.003)	100%	5.4%	4.36 (0.03)
	0.7	0.281 (0.003)	100%	100%	5.56 (0.03)

FS-QRADMM-prox (Lasso)	0.3	0.295 (0.003)	100%	100%	5.62 (0.03)
	0.5	0.198 (0.003)	100%	4.6%	4.34 (0.02)
	0.7	0.301 (0.003)	100%	100%	5.56 (0.03)

qradmm(Lasso)	0.3	0.310 (0.003)	100%	100%	5.68 (0.03)
	0.5	0.230 (0.003)	100%	9%	5.32 (0.06)
	0.7	0.327 (0.005)	100%	100%	6.73 (0.08)

rqPen(Lasso)	0.3	0.598 (0.004)	100%	61.2%	5.10 (0.04)
	0.5	0.267 (0.003)	100%	0%	4.23 (0.02)
	0.7	0.601 (0.004)	100%	56.6%	5.04 (0.04)

hqReg(Lasso)	0.3	0.593 (0.006)	100%	50%	4.95 (0.04)
	0.5	0.235 (0.003)	100%	0%	4.31 (0.03)
	0.7	0.589 (0.006)	100%	51.6%	4.97 (0.04)

Conquer(Lasso)	0.3	0.590 (0.005)	100%	45%	4.73 (0.03)
	0.5	0.231 (0.002)	100%	0%	4.27 (0.02)
	0.7	0.586 (0.005)	100%	45%	4.72 (0.03)

FS-QRADMM-CD (TS-SCAD)	0.3	0.119 (0.002)	100%	100%	5.00 (0.00)
	0.5	0.035 (0.001)	100%	0.2%	4.00 (0.00)
	0.7	0.125 (0.002)	100%	100%	5.00 (0.00)

FS-QRADMM-prox (TS-SCAD)	0.3	0.115 (0.002)	100%	100%	5.00 (0.00)
	0.5	0.040 (0.001)	100%	0.2%	4.00 (0.00)
	0.7	0.123 (0.001)	100%	100%	5.00 (0.00)

qradmm (TS-SCAD)	0.3	0.122 (0.002)	100%	100%	5.00 (0.00)
	0.5	0.038 (0.001)	100%	0.4%	4.00 (0.00)
	0.7	0.129 (0.002)	100%	100%	5.00 (0.00)

rqPen (TS-SCAD)	The algorithm runs out of memory for τ = 0.3, 0.5, 0.7

Conquer(SCAD)	0.3	0.339 (0.004)	100%	63.4%	4.67 (0.02)
	0.5	0.049 (0.001)	100%	0%	4.07 (0.01)
	0.7	0.350 (0.004)	100%	57%	4.60 (0.02)

Open in a new tab

Table 2:

Comparison of algorithms for PQR when p = 50000 and n = 400.

n = 400, p = 50000	$τ$	${‖ \hat{β} - β ‖}_{1}$	P1	P2	Size
FS-QRADMM-CD (Lasso)	0.3	0.320 (0.003)	100%	98.2%	5.34 (0.02)
	0.5	0.250 (0.003)	100%	2%	4.25 (0.03)
	0.7	0.349 (0.003)	100%	100%	5.15 (0.03)

FS-QRADMM-prox (Lasso)	0.3	0.326 (0.003)	100%	92.4%	4.93 (0.01)
	0.5	0.121 (0.001)	100%	0%	4.01 (0.00)
	0.7	0.394 (0.002)	100%	95.6%	5.01 (0.11)

qradmm(Lasso)	The algorithm runs out of memory for τ = 0.3, 0.5, 0.7

rqPen(Lasso)	The algorithm runs out of memory for τ = 0.3, 0.5, 0.7

hqReg(Lasso)	0.3	0.812 (0.004)	100%	2.2%	4.23 (0.02)
	0.5	0.365 (0.003)	100%	0%	4.20 (0.02)
	0.7	0.808 (0.004)	100%	4.4%	4.26 (0.02)

Conquer(Lasso)	0.3	0.717 (0.003)	100%	18%	5.88 (0.07)
	0.5	0.303 (0.002)	100%	0%	8.63 (0.11)
	0.7	0.705 (0.003)	100%	26.8%	6.13 (0.07)

FS-QRADMM-CD (TS-SCAD)	0.3	0.180 (0.003)	100%	98.8%	4.99 (0.00)
	0.5	0.047 (0.001)	100%	0%	4.00 (0.00)
	0.7	0.172 (0.003)	100%	99.6%	5.00 (0.03)

FS-QRADMM-prox (TS-SCAD)	0.3	0.158 (0.002)	100%	100%	5.00 (0.00)
	0.5	0.069 (0.005)	100%	2.2%	7.31 (0.47)
	0.7	0.244 (0.007)	100%	99.2%	6.64 (0.14)

qradmm (TS-SCAD)	The algorithm runs out of memory for τ = 0.3, 0.5, 0.7

rqPen (TS-SCAD)	The algorithm runs out of memory for τ = 0.3, 0.5, 0.7

Conquer(SCAD)	0.3	0.396 (0.002)	100%	48.4%	5.04 (0.04)
	0.5	0.058 (0.001)	100%	0%	5.78 (0.07)
	0.7	0.390 (0.003)	100%	49.6%	5.12 (0.05)

Open in a new tab

Figure 1 plots the curves of ${‖ \hat{β} - β^{*} ‖}_{1}$ with respect to the iteration steps averaged over 500 replications when n = 400, p = 1000, and $τ = (0.3, 0.5, 0.7)$ . We can see that Algorithms 1 and 2 converge to true $β$ within approximately 20 iterations.

We next examine the performance of the proposed algorithms when the sample size is large. To this end, we conduct a simulation with n = 30000 and p = 1000. The simulation results are summarized in Table 3, from which it can be seen that the proposed two algorithm and the ADMM for quantile regression have the same performance and perform better than the conquer algorithm.

Table 3:

Performance of proposed algorithms for PQR when p = 1000 and n = 30000.

n = 30000, p = 1000	$τ$	${‖ \hat{β} - β ‖}_{1}$	P1	P2	Size
FS-QRADMM-CD (Lasso)	0.3	0.031 (0.0004)	100%	100%	5.06 (0.003)
	0.5	0.020 (0.0004)	100%	0.4%	4.04 (0.003)
	0.7	0.029 (0.0004)	100%	100%	5.06 (0.003)

FS-QRADMM-prox (Lasso)	0.3	0.029 (0.0003)	100%	100%	5.05 (0.003)
	0.5	0.020 (0.0003)	100%	0.5%	4.03 (0.002)
	0.7	0.029 (0.0003)	100%	100%	5.05 (0.003)

qradmm(Lasso)	0.3	0.030 (0.0004)	100%	100%	5.08 (0.003)
	0.5	0.023 (0.0003)	100%	0.6%	4.05 (0.004)
	0.7	0.029 (0.0004)	100%	100%	5.04 (0.004)

rqPen(Lasso)	The algorithm runs out of memory for τ = 0.3, 0.5, 0.7

hqReg(Lasso)	0.3	0.040 (0.0008)	100%	100%	6.06 (0.09)
	0.5	0.020 (0.0002)	100%	1%	4.60 (0.07)
	0.7	0.040 (0.0008)	100%	51.6%	5.90 (0.09)

Conquer(Lasso)	0.3	0.066 (0.0009)	100%	100%	5.23 (0.04)
	0.5	0.020 (0.0003)	100%	0%	4.11 (0.04)
	0.7	0.065 (0.0009)	100%	100%	5.25 (0.05)

FS-QRADMM-CD (TS-SCAD)	0.3	0.012 (0.0005)	100%	100%	5.00 (0.00)
	0.5	0.004 (0.0002)	100%	0.2%	4.00 (0.00)
	0.7	0.013 (0.0004)	100%	100%	5.00 (0.00)

FS-QRADMM-prox (TS-SCAD)	0.3	0.011 (0.0003)	100%	100%	5.00 (0.00)
	0.5	0.004 (0.0002)	100%	0%	4.00 (0.00)
	0.7	0.012 (0.0003)	100%	100%	5.00 (0.00)

qradmm (TS-SCAD) (TS-SCAD)	0.3	0.012 (0.0004)	100%	100%	5.00 (0.00)
	0.5	0.004 (0.0003)	100%	0%	4.00 (0.00)
	0.7	0.011 (0.0005)	100%	100%	5.00 (0.00)

rqPen (TS-SCAD)	The algorithm runs out of memory for τ = 0.3, 0.5, 0.7

Conquer(SCAD)	0.3	0.028 (0.0008)	100%	100%	5.15 (0.039)
	0.5	0.005 (0.0002)	100%	1%	4.45 (0.061)
	0.7	0.029 (0.0008)	100%	100%	5.14 (0.037)

Open in a new tab

3.2. A real data example

QR model is widely adopted in the analysis of consumer markets due to its robustness against outliers. In this section, we apply the proposed algorithm for an empirical analysis of a supermarket data set studied in Wang (2009) and compare it with other existing algorithms. This data set contains the daily number of customers and the daily sale volumes of 6398 products from a supermarket in China over 464 days. Following Wang (2009), we set the response to be the daily number of customers, and predictors to be the daily sale volumes of products. Since the sample size n = 464 is much less than the dimension p = 6398, it is reasonable to assume that only a small proportion of predictors have significant effects on the response. The distribution of the number of customers is highly skewed. This motivates us to consider PQR with the proposed algorithm in this example. We standardize the response and the predictors for our analysis.

We randomly split observations into training and testing datasets of sizes 300 and 164, respectively, and fit PQR-Lasso and two-step QR-SCAD on the training data with $τ = 0.3, 0.5$ and 0.7. The regularization parameter $λ$ is chosen by HBIC criterion. We report the averaged predictive error and its standard deviation on the testing data over 100 replications in Table 4. The predictive error is measured by the loss function $\frac{1}{n} \sum_{i = 1}^{n} ρ_{τ} (y_{i} - {\hat{y}}_{i})$ . We also report the average model sizes and its corresponding standard deviation to evaluate the interpretability of models selected from different methods. For PQR-Lasso, we observe that ADMM-based algorithms have similar performance with that of rqPen and hqReg in terms of prediction error. The average values and standard deviations of the loss function are very close among those methods. In general, all methods perform the best when $τ = 0.5$ . The proposed method selects fewer products than qradmm, rqPen, hqReg do in most scenarios, which indicates a better model interpretability.

Table 4:

Performances of ADMM and lpSolve of sparse quantile regression on the Chinese Supermarket Data.

	$τ$	$\frac{1}{n} \sum_{i = 1}^{n} ρ_{τ} (y_{i} - {\hat{y}}_{i})$	Size
FS-QRADMM-CD (Lasso)	0.3	0.118 (0.001)	97.35 (0.62)
	0.5	0.116 (0.001)	100.56 (0.53)
	0.7	0.127 (0.001)	97.37 (0.71)

FS-QRADMM-prox (Lasso)	0.3	0.117 (0.001)	103.23 (1.13)
	0.5	0.113 (0.001)	118.61 (0.93)
	0.7	0.127 (0.001)	96.02 (1.06)

qradmm(Lasso)	0.3	0.115 (0.000)	119.01 (0.49)
	0.5	0.116 (0.001)	121.26 (0.37)
	0.7	0.130 (0.000)	127.35 (0.58)

rqPen(Lasso)	0.3	0.117 (0.001)	113.62 (0.73)
	0.5	0.115 (0.001)	117.17 (0.56)
	0.7	0.128 (0.001)	120.11 (0.62)

hqReg(Lasso)	0.3	0.117 (0.001)	49.31 (0.46)
	0.5	0.116 (0.001)	90.85 (0.65)
	0.7	0.127 (0.001)	43.42 (0.42)

Conquer(Lasso)	0.3	0.118 (0.001)	80.8 (1.47)
	0.5	0.114 (0.001)	42.9 (0.58)
	0.7	0.125 (0.001)	39.6 (0.49)

FS-QRADMM-CD (TS-SCAD)	0.3	0.112 (0.000)	63.86 (0.39)
	0.5	0.111 (0.001)	69.77 (0.47)
	0.7	0.116 (0.001)	72.71 (0.61)

FS-QRADMM-prox (TS-SCAD)	0.3	0.116 (0.000)	97.03 (0.96)
	0.5	0.110 (0.000)	100.33 (1.02)
	0.7	0.113 (0.000)	95.66 (0.79)

qradmm (TS-SCAD)	0.3	0.113 (0.001)	469.88 (2.56)
	0.5	0.114 (0.001)	477.72 (2.33)
	0.7	0.120 (0.001)	521.33 (1.99)

rqPen(TS-SCAD)	The algorithm runs out of memory for the three τs

Conquer(SCAD)	0.3	0.115 (0.001)	65.7 (1.95)
	0.5	0.113 (0.001)	65.7(0.82)
	0.7	0.120 (0.001)	36.0 (0.48)

Open in a new tab

Similar results are also observed with two-step the PQR-SCAD. However, the rqPen for the two-step SCAD fails in this example due to the limitation of computing memory. The proposed algorithms have similar prediction error to that of qradmm, but the model sizes are much smaller. When $τ = 0.7$ , the proposed methods outperform qradmm, with fewer products included in the QR model. Conquer with SCAD penalty has similar performance to the proposed method under this scenario. We can also notice that PQR-SCAD gives better loss than PQR-Lasso does, and the two-step PQR-SCAD procedures select fewer products when the proposed algorithms are implemented.

4. Conclusion

QR model is a powerful data analytic tool in econometrics. To promote the application of QR in the high/ultrahigh dimension, in this paper, we propose efficient and parallelizable algorithms for PQR based on a three-block ADMM algorithm with feature-splitting, and further establish the convergence of the proposed algorithms. Due to the nature of feature-splitting algorithm, the proposed algorithms can be used to minimize the objective function of PQR in ultrahigh dimension. Our numerical study implies that the proposed algorithms outperform existing ones for PQR. To illustrate the performance of proposed methods, we conduct a comprehensive simulation study. The numerical experiments suggest that the proposed method is stable when the dimension of the data is huge while existing algorithms run out of memory and fail to accomplish the tasks. The proposed algorithms may be extended to other statistical models such as supporting vector machine which has similar loss function to the loss function of QR. This is an interesting topic for future research.

Acknowledgment

Christina Dan Wang is supported in part by National Natural Science Foundation of China (NNSFC) grant 11901395 and 12271363. Li’s research was supported by National Science Foundation DMS-1820702 and NIAID/NIH grants R01-AI136664 and R01AI170249.

Appendix: Technical Details and Proofs

In this appendix, we first provide details of how to update each variable in Algorithm 2, and then provide technical proofs of Theorem 1.

A.1. Sub-problems in Algorithm 2

In this subsection, we derive the updates for $β, z$ and $ω$ in Algorithm 2. For ease of notation, define a set of functions f, g, h.

f (β) = n λ \sum_{g = 1}^{G} {‖ α_{g} \circ β_{g} ‖}_{1}, h (ω) = 0, g (z) = τ 1^{T} {(z)}_{+} + (1 - τ) 1^{T} {(z)}_{-},

(A.1)

Thus, f, g, h are closed proper convex functions. Further define matrices F, G, H

F = D i a g (X_{1}, X_{2}, \dots, X_{G}), G = {(I_{n}, 0, \dots, 0)}^{T}, H = (\begin{matrix} I_{n} & I_{n} & \dots & I_{n} \\ - I_{n} & 0 & \dots & 0 \\ ⋮ & ⋱ & ⋮ \\ 0 & \dots & - I_{n} & 0 \\ 0 & 0 & \dots & - I_{n} \end{matrix}) .

(A.2)

Then Problem (9) can be expressed as a general three-block constrained optimization problem,

\min_{β, z, ω} {f (β) + g (z) + h (ω) ∣ F β + G z + H ω = c},

(A.3)

where by definitions of F, G and H,

F β = (\begin{matrix} X_{1} β_{1} \\ X_{2} β_{2} \\ ⋮ \\ X_{G} β_{G} \end{matrix}), G z = (\begin{matrix} z \\ 0 \\ ⋮ \\ 0 \end{matrix}), H ω = (\begin{matrix} ω_{2} + \dots + ω_{G} \\ - ω_{2} \\ ⋮ \\ - ω_{G} \end{matrix}) .

(A.4)

As in sGS-sPADMM proposed by Sun et al. (2015), we update the three-block variables using a special cycle,

{\begin{array}{l} β^{k + 1} = argmin ℒ_{ϕ} (β, z^{k}, ω^{k}; γ^{k}) + (\frac{ϕ}{2} {‖ β - β^{k} ‖}_{𝒯_{f}}^{2}), \\ ω^{k + \frac{1}{2}} = {(H^{T} H)}^{- 1} H (c - F β^{k + 1} - G z^{k}), \\ z^{k + 1} = argmin ℒ_{ϕ} (β^{k + 1}, z, ω^{k + \frac{1}{2}}; γ^{k}) + (\frac{ϕ}{2} {‖ z - z^{k} ‖}_{𝒯_{h}}^{2}), \\ ω^{k + 1} = {(H^{T} H)}^{- 1} H (c - F β^{k + 1} - G z^{k + 1}), \\ γ^{k + 1} = γ^{k} + θ ϕ (F β^{k + 1} + G z^{k + 1} + H ω^{k + 1} - c), \end{array}

(A.5)

where $𝒯_{f}$ and $𝒯_{h}$ are optionally added self-adjoint positive semidefinite operators. To update $ω$ , we need to compute ${(H^{T} H)}^{- 1}$ . Since

H^{T} H = (\begin{matrix} I_{n} \\ ⋱ \\ I_{n} \end{matrix}) + (\begin{matrix} I_{n} \\ ⋮ \\ I_{n} \end{matrix}) {(\begin{matrix} I_{n} \\ ⋮ \\ I_{n} \end{matrix})}^{T},

we apply the Sherman-Morrison-Woodebury formula to compute ${(H^{T} H)}^{- 1}$ and it follows that

ω_{i}^{k + \frac{1}{2}} = {(H^{T} H)}^{- 1} H^{T} (c - F β^{k + 1} - G z^{k}) = \frac{1}{G} (y - z^{k} + G X_{i} β_{i}^{k + 1} - \sum_{j = 1}^{G} X_{j} β_{j}^{k + 1}), i = 2, \dots, G,

(A.6)

In the z-subproblem, we set $𝒯_{h} = 0$ and then we have

z^{k + 1} = \underset{z}{argmin} ℒ_{ϕ} (β^{k + 1}, z, ω^{k + \frac{1}{2}}; γ^{k}) = \underset{z}{argmin} {\frac{1}{n} \sum_{i = 1}^{n} ρ_{τ} (z_{i}) + γ_{1}^{T} (X_{1} β_{1}^{k + 1} + z + \sum_{i = 2}^{G} ω_{i}^{k + \frac{1}{2}} - y) + \frac{ϕ}{2} {‖ X_{1} β_{1}^{k + 1} + z + \sum_{i = 2}^{G} ω_{i}^{k + \frac{1}{2}} - y ‖}_{2}^{2}} = \underset{z}{argmin} \frac{1}{n} \sum_{i = 1}^{n} ρ_{τ} (z_{i}) + \frac{ϕ}{2} {‖ X_{1} β_{1}^{k + 1} + z + \sum_{i = 2}^{G} ω_{i}^{k + \frac{1}{2}} - y + \frac{γ_{1}^{k}}{ϕ} ‖}_{2}^{2} .

(A.7)

The closed-form solution of the z-subproblem can be easily derived as

z^{k + 1} = \max (y - X_{1} β_{1}^{k + 1} - \sum_{i = 2}^{G} ω_{i}^{k + \frac{1}{2}} - \frac{γ_{1}^{k}}{ϕ} - \frac{τ}{n ϕ}, 0) - \max (- (y - X_{1} β_{1}^{k + 1} - \sum_{i = 2}^{G} ω_{i}^{k + \frac{1}{2}} - \frac{γ_{1}^{k}}{ϕ} - \frac{τ}{n ϕ} + \frac{1}{n ϕ}), 0) .

(A.8)

A.2. Proof of Theorem 1

We first show Lemmas A.1, A.2 and A.3, which are used in the proof of Theorem 1. From (A.2), we have Fact 1 below.

Fact 1. $H^{T} H$ is positive definite.

Assumptions 1 below is imposed to obtain theoretical guarantee on feasibility and convergence of the sequence $(β^{k}, z^{k}, ω^{k}, γ^{k})$ .

Assumption 1. There exists $(\hat{β}, \hat{z}, \hat{ω}) \in ℛ^{p_{1}} \times ℛ^{p_{2}} \times ℛ^{p_{3}}$ , such that $F \hat{β} + G \hat{z} + H \hat{ω} = c$ .

For algorithm (A.5), the projection matrix $𝒫 = H {(H^{T} H)}^{- 1} H^{T}$ plays an important role in the convergence analysis. Let $𝒬 = I - 𝒫$ . Since $ω$ can be expressed as $ω (β, z) = {(H^{T} H)}^{- 1} H^{T} (c - F β - G z)$ , it follows that $H ω = 𝒫 (c - F β - G z)$ . Given that $h (ω) = 0$ in our case, we can now rewrite (A.3) as

\min_{β, z} {f (β) + g (z) ∣ 𝒬 (F β + G z - c) = 0} .

(A.9)

Stopping Criterion. In the implementation of Algorithm 2, We use the same stopping criterion as the one introduced in Boyd et al. (2011). The primal and dual residuals are often used in characterizing the convergence stage. Define $r^{k + 1} = {({‖ X_{1} β_{1}^{k + 1} + z^{k + 1} + ω_{2}^{k + 1} + \dots + ω_{G}^{k + 1} - y ‖}_{2}^{2} + {‖ \sum_{g = 2}^{G} (X_{g} β_{g}^{k + 1} - ω_{g}^{k + 1}) ‖}_{2}^{2})}^{0.5}$ as the primal residual and $s^{k + 1} = {‖ ϕ / G {(X_{1}^{T}, \dots, X_{G}^{T})}^{T} (z^{k + 1} - z^{k}) ‖}_{2}$ as the dual residual at the (k + 1)th iteration. The termination criterion is

{‖ r^{k} ‖}_{2} \leq ϵ^{pri} and {‖ s^{k} ‖}_{2} \leq ϵ^{dual},

(A.10)

where $ϵ^{pri} > 0$ and $ϵ^{dual} > 0$ are feasibility tolerances chosen as $ϵ^{pri} = \sqrt{n} ϵ^{abs} + \frac{ϵ^{rel}}{\sqrt{G}} \max ({‖ X β^{k} ‖}_{2}, {‖ z^{k} ‖}_{2}, {‖ c ‖}_{2})$ , and $ϵ^{dual} = \sqrt{p} ϵ^{abs} + \frac{ϵ^{rel}}{G} {‖ F^{T} 𝒬 γ^{k} ‖}_{2}$ . A common choice could be $ϵ^{abs} = 0.001$ and $ϵ^{rel} = 0.001$ .

The augmented Lagrangian function for (A.9) is given by

ℒ_{ϕ} (z, β; γ) = f (β) + g (z) + 〈 γ, 𝒬 (F β + G z - c) 〉 + \frac{ϕ}{2} {‖ 𝒬 (F β + G z - c) ‖}_{2}^{2} .

Using similar arguments in Sun et al. (2015), it follows that applying the updates in A.5 to problem (A.3) is equivalent to applying the following 2-block semi-proximal ADMM to (A.9),

{\begin{array}{l} β^{k + 1} = argmin ℒ_{ϕ} (β, z^{k}; γ^{k}) + \frac{ϕ}{2} {‖ β - β^{k} ‖}_{F^{T} 𝒫 F + 𝒯_{f}}^{2}, \\ z^{k + 1} = argmin ℒ_{ϕ} (β^{k + 1}, z; γ^{k}) + \frac{ϕ}{2} {‖ z - z^{k} ‖}_{G^{T} 𝒫 G + 𝒯_{h}}^{2}, \\ γ^{k + 1} = γ^{k} + θ ϕ 𝒬 (F β^{k + 1} + G z^{k + 1} - c) . \end{array}

(A.11)

The Karush–Kuhn–Tucker (KKT) optimality condition of (A.9) is

0 \in {(𝒬 F)}^{T} γ + \partial f (β), 0 \in {(𝒬 G)}^{T} γ + \partial g (z), 𝒬 (c - F β - G z) = 0 .

(A.12)

Denote the solution set to (A.12) as $\bar{Ω}$ , then we can replace Assumption 1 by assuming that $\bar{Ω}$ is non-empty. Let $\bar{u} = (\bar{β}, \bar{z}, \bar{γ})$ be an optimal solution to (A.9). We have the following lemma on the convergence of the proposed algorithm by utilizing its equivalence to the updates in (A.11).

Lemma A.1. Suppose Assumption 1 holds. $𝒯_{f}$ and $𝒯_{h}$ are chosen such that $𝒯_{f} + F^{T} F$ and $𝒯_{h} + G^{T} G$ are positive definite. Then under the condition $θ \in (0, (1 + \sqrt{5}) / 2)$ , the sequence $(β^{k}, z^{k}, ω^{k}, γ^{k})$ generated by (A.5) converges to a limit point $(\bar{β}, \bar{z}, \bar{ω}, \bar{γ})$ with $(\bar{β}, \bar{z}, \bar{ω})$ solving (A.3) and $\bar{γ}$ is the dual optimal.

Lemma A.1 follows by a direct application of Theorem 3.2 in Han et al. (2018). Based on (A.2), we have the following fact.

Fact 2. Suppose $u^{k}$ converges to $\bar{u} \in \bar{Ω}$ . There exists a positive constant q such that

{‖ u^{k} - \bar{u} ‖}_{2}^{2} \leq q^{2} ({‖ β^{k} - p r o x_{f} (β^{k} - {(𝒬 F)}^{T} γ^{k}) ‖}_{2}^{2} + {‖ z^{k} - p r o x_{h} (z^{k} - {(𝒬 G)}^{T} γ^{k}) ‖}_{2}^{2} + {‖ 𝒬 (c - F β^{k} - G z^{k}) ‖}_{2}^{2}),

(A.13)

for a sufficiently large k.

For any convex function P, prox_P(·) denotes the proximal mapping associated with P. That is,

{prox}_{P} (x) = \underset{y}{argmin} {\frac{1}{2} ∥ x - y ∥_{2}^{2} + P (y)} .

(A.14)

Denote $ℋ = C \times Diag (F^{T} 𝒫 F + 𝒯_{f}, G^{T} G + 𝒯_{h}, (θ^{- 2} ϕ^{- 1}) I)$ , where $C = \max {3 ϕ^{2} ‖ F^{T} 𝒫 F + 𝒯_{f} ‖_{2}, 3 ϕ^{2} λ_{\max} (F F^{T}), 2 ϕ^{2} {‖ G^{T} 𝒫 G + 𝒯_{h} ‖}_{2}, 3 {(1 - θ)}^{2} ϕ λ_{\max} (𝒬 F F^{T} 𝒬)) + 2 {(1 - θ)}^{2} ϕ λ_{\max} (𝒬 G G^{T} 𝒬) + \frac{1}{ϕ}}$ , then we have the following relationship.

Lemma A.2. Suppose the sequence $u^{k} = (β^{k}, z^{k}, γ^{k})$ is generated by algorithm (A.5), then for any $k \geq 0$ ,

{‖ u^{k + 1} - \bar{u} ‖}_{2}^{2} \leq q^{2} {‖ u^{k + 1} - u^{k} ‖}_{ℋ}^{2} .

(A.15)

Proof. Consider the optimality conditions of subproblems in (A.11), we have

0 \in \partial f (β^{k + 1}) + {(𝒬 F)}^{T} γ^{k} + ϕ {(𝒬 F)}^{T} 𝒬 (F β^{k + 1} + G z^{k} - c) + ϕ (F^{T} 𝒫 F + 𝒯_{f}) (β^{k + 1} - β^{k}), 0 \in \partial g (z^{k + 1}) + {(𝒬 G)}^{T} γ^{k} + ϕ {(𝒬 G)}^{T} 𝒬 (F β^{k + 1} + G z^{k + 1} - c) + ϕ (G^{T} 𝒫 G + 𝒯_{h}) (z^{k + 1} - z^{k}), 0 = {(θ ϕ)}^{- 1} (γ^{k + 1} - γ^{k}) - 𝒬 (F β^{k + 1} + G z^{k + 1} - c) .

(A.16)

Then we have $𝒬 (F β^{k + 1} + G z^{k} - c) = {(θ ϕ)}^{- 1} (γ^{k + 1} - γ^{k}) - 𝒬 G (z^{k + 1} - z^{k})$ and it follows that

β^{k + 1} = {prox}_{f} (β^{k + 1} - {(𝒬 F)}^{T} (γ^{k} + θ^{- 1} (γ^{k + 1} - γ^{k}) - ϕ 𝒬 G (z^{k + 1} - z^{k})) + ϕ (F^{T} 𝒫 F + 𝒯_{f}) (β^{k + 1} - β^{k})), z^{k + 1} = {prox}_{g} (z^{k + 1} - {(𝒬 G)}^{T} (γ^{k} + θ^{- 1} (γ^{k + 1} - γ^{k})) + ϕ (G^{T} 𝒫 G + 𝒯_{h}) (z^{k + 1} - z^{k})), γ^{k + 1} = γ^{k} + θ ϕ 𝒬 (F β^{k + 1} + G z^{k + 1} - c) .

and we have

{‖ β^{k + 1} - {prox}_{f} (β^{k + 1} - {(𝒬 F)}^{T} γ^{k + 1}) ‖}_{2}^{2} + {‖ z^{k + 1} - {prox}_{g} (z^{k + 1} - {(𝒬 G)}^{T} γ^{k + 1}) ‖}_{2}^{2} + {‖ 𝒬 (c - F β^{k + 1} - G z^{k + 1}) ‖}_{2}^{2} \leq {‖ u^{k + 1} - u^{k} ‖}_{ℋ}^{2} .

(A.17)

We first bound the term ${‖ β^{k + 1} - {prox}_{f} (β^{k + 1} - {(𝒬 F)}^{T} γ^{k + 1}) ‖}_{2}^{2}$ . By the fact that the proximal mapping is Lipschitz continuous with constant 1, i.e., ${∥ {prox}_{h} (x) - {prox}_{h} (y) ∥}_{2} \leq {∥ x - y ∥}_{2}$ for any mapping h,

{‖ β^{k + 1} - {prox}_{f} (β^{k + 1} - {(𝒬 F)}^{T} γ^{k + 1}) ‖}_{2}^{2} \leq ‖ β^{k + 1} - {(𝒬 F)}^{T} (γ^{k} + θ^{- 1} (γ^{k + 1} - γ^{k}) - ϕ 𝒬 G (z^{k + 1} - z^{k})) + {ϕ (F^{T} 𝒫 F + 𝒯_{f}) (β^{k + 1} - β^{k}) - β^{k + 1} + {(Q F)}^{T} γ^{k + 1} ‖}_{2}^{2} = ‖ ϕ (F^{T} 𝒫 F + 𝒯_{f}) (β^{k + 1} - β^{k}) + ϕ F^{T} 𝒬 G (z^{k + 1} - z^{k}) + {(1 - \frac{1}{θ}) {(Q F)}^{T} (γ^{k + 1} - γ^{k}) ‖}_{2}^{2} = {‖ ϕ (F^{T} 𝒫 F + 𝒯_{f}) (β^{k + 1} - β^{k}) ‖}_{2}^{2} + {‖ ϕ F^{T} 𝒬 G (z^{k + 1} - z^{k}) ‖}_{2}^{2} + {(1 - \frac{1}{θ})}^{2} {‖ {(Q F)}^{T} (γ^{k + 1} - γ^{k}) ‖}_{2}^{2} + 2 ϕ^{2} (β^{k + 1} - β^{k}) {(F^{T} 𝒫 F + 𝒯_{f})}^{T} F^{T} 𝒬 G (z^{k + 1} - z^{k}) + 2 (1 - \frac{1}{θ}) ϕ (β^{k + 1} - β^{k}) {(F^{T} 𝒫 F + 𝒯_{f})}^{T} {(Q F)}^{T} (γ^{k + 1} - γ^{k}) + 2 (1 - \frac{1}{θ}) ϕ {(z^{k + 1} - z^{k})}^{T} G^{T} 𝒬 F {(Q F)}^{T} (γ^{k + 1} - γ^{k})

(A.18)

By taking into account the fact that

2 ϕ^{2} (β^{k + 1} - β^{k}) {(F^{T} 𝒫 F + 𝒯_{f})}^{T} F^{T} 𝒬 G (z^{k + 1} - z^{k}) \leq {‖ ϕ (F^{T} 𝒫 F + 𝒯_{f}) (β^{k + 1} - β^{k}) ‖}_{2}^{2} + {‖ ϕ F^{T} 𝒬 G (z^{k + 1} - z^{k}) ‖}_{2}^{2} 2 (1 - \frac{1}{θ}) ϕ (β^{k + 1} - β^{k}) {(F^{T} 𝒫 F + 𝒯_{f})}^{T} {(Q F)}^{T} (γ^{k + 1} - γ^{k}) \leq {‖ ϕ (F^{T} 𝒫 F + 𝒯_{f}) (β^{k + 1} - β^{k}) ‖}_{2}^{2} + {(1 - \frac{1}{θ})}^{2} {‖ {(Q F)}^{T} (γ^{k + 1} - γ^{k}) ‖}_{2}^{2},

and

2 (1 - \frac{1}{θ}) ϕ {(z^{k + 1} - z^{k})}^{T} G^{T} 𝒬 F {(Q F)}^{T} (γ^{k + 1} - γ^{k}) \leq {(1 - \frac{1}{θ})}^{2} {‖ {(Q F)}^{T} (γ^{k + 1} - γ^{k}) ‖}_{2}^{2} + {‖ ϕ F^{T} 𝒬 G (z^{k + 1} - z^{k}) ‖}_{2}^{2},

and the inequality that

{‖ F^{T} 𝒬 G (z^{k + 1} - z^{k}) ‖}_{2}^{2} = {(𝒬 G (z^{k + 1} - z^{k}))}^{T} F F^{T} {(𝒬 G (z^{k + 1} - z^{k}))}^{T} \leq λ_{\max} (F F^{T}) {‖ 𝒬 G (z^{k + 1} - z^{k}) ‖}_{2}^{2},

where $λ_{\max} (F F^{T})$ is the largest eigenvalue of FF^T, (A.18) can be reduced to

{‖ β^{k + 1} - {prox}_{f} (β^{k + 1} - {(𝒬 F)}^{T} γ^{k + 1}) ‖}_{2}^{2} \leq 3 ϕ^{2} {‖ F^{T} 𝒫 F + 𝒯_{f} ‖}_{2} {‖ β^{k + 1} - β^{k} ‖}_{F^{T} 𝒫 F + 𝒯_{f}}^{2} + 3 ϕ^{2} λ_{\max} (F F^{T}) {‖ z^{k + 1} - z^{k} ‖}_{G^{T} 𝒬 G}^{2} + 3 {(1 - \frac{1}{θ})}^{2} {‖ {(Q F)}^{T} (γ^{k + 1} - γ^{k}) ‖}_{2}^{2} .

(A.19)

Similarly we can bound the term ${‖ z^{k + 1} - {prox}_{g} (z^{k + 1} - {(𝒬 G)}^{T} γ^{k + 1}) ‖}_{2}^{2}$ ,

{‖ z^{k + 1} - {prox}_{h} (z^{k + 1} - {(𝒬 G)}^{T} γ^{k + 1}) ‖}_{2}^{2} \leq 2 ϕ^{2} {‖ G^{T} 𝒫 G + 𝒯 ‖}_{h}_{2} {‖ z^{k + 1} - z^{k} ‖}_{G^{T} G + 𝒯_{h}}^{2} + 2 {(1 - \frac{1}{θ})}^{2} {‖ {(𝒬 G)}^{T} (γ^{k + 1} - γ^{k}) ‖}_{2}^{2} .

(A.20)

From the update of $γ$ , we have

{‖ 𝒬 (c - F β^{k + 1} - G z^{k + 1}) ‖}_{2}^{2} = {(θ ϕ)}^{- 2} {‖ γ^{k + 1} - γ^{k} ‖}_{2}^{2} .

(A.21)

Combining (A.19), (A.20) and (A.21), we can obtain that

{‖ β^{k + 1} - {prox}_{f} (β^{k + 1} - {(𝒬 F)}^{T} γ^{k + 1}) ‖}_{2}^{2} + {‖ z^{k + 1} - {prox}_{h} (z^{k + 1} - G γ^{k + 1}) ‖}_{2}^{2} + {‖ 𝒬 (c - F β^{k + 1} - G z^{k + 1}) ‖}_{2}^{2} \leq 3 ϕ^{2} {‖ F^{T} 𝒫 F + 𝒯_{f} ‖}_{2} {‖ β^{k + 1} - β^{k} ‖}_{F^{T} 𝒫 F + 𝒯_{f}} + 3 ϕ^{2} λ_{\max} (F F^{T}) {‖ z^{k + 1} - z^{k} ‖}_{G^{T} 𝒬 G}^{2} + {(θ ϕ)}^{- 2} {‖ γ^{k + 1} - γ^{k} ‖}_{2}^{2} + 3 {(1 - \frac{1}{θ})}^{2} {‖ {(𝒬 F)}^{T} (γ^{k + 1} - γ^{k}) ‖}_{2}^{2} + 2 ϕ^{2} {‖ G^{T} 𝒫 G + 𝒯_{h} ‖}_{2} {‖ z^{k + 1} - z^{k} ‖}_{G^{T} 𝒫 G + 𝒯_{h}}^{2} + 2 {(1 - \frac{1}{θ})}^{2} {‖ {(𝒬 G)}^{T} (γ^{k + 1} - γ^{k}) ‖}_{2}^{2} \leq C \times ({‖ β^{k + 1} - β^{k} ‖}_{F^{T} 𝒫 + 𝒯_{f}}^{2} + {‖ z^{k + 1} - z^{k} ‖}_{G^{T} G + 𝒯_{h}}^{2} + θ^{- 2} ϕ^{- 1} {‖ γ^{k + 1} - γ^{k} ‖}_{2}^{2})

(A.22)

□

Lemma A.3. Suppose that Assumptions 1 holds, and assume that both $F^{T} F + 𝒯_{f}$ and $G^{T} G + 𝒯_{h}$ are positive definite. Then for all k sufficiently large and $θ \in (0, \frac{1 + \sqrt{5}}{2})$ , there exists $μ \in (0, 1)$ such that

{‖ u^{k + 1} - \bar{u} ‖}_{ℋ_{1}} + {‖ z^{k + 1} - z^{k} ‖}_{G^{T} 𝒫 G + 𝒯_{h}}^{2} \leq μ ({‖ u^{k} - \bar{u} ‖}_{ℋ_{1}} + {‖ z^{k} - z^{k - 1} ‖}_{G T 𝒫 G + 𝒯_{h}}^{2}),

(A.23)

where

ℋ_{1} = (\begin{matrix} F^{T} (m_{1} 𝒬 + 𝒫) F + 𝒯_{f} & m_{1} F^{T} 𝒬 G & \dots & 0 \\ m_{1} G^{T} 𝒬 F & G^{T} (𝒫 + (m_{1} + 1) 𝒬) G & \dots & 0 \\ 0 & 0 & \dots & θ^{- 1} ϕ^{- 2} I \end{matrix})

(A.24)

with $m_{1} \in (0, 1)$ .

Proof. From Theorem 1 in Han et al. (2018), we can derive the following results.

{{‖ β^{k} - \bar{β} ‖}_{F^{T} 𝒫 F + 𝒯_{f}}^{2} + {‖ z^{k} - \bar{z} ‖}_{G^{T} G + 𝒯_{h}}^{2} + {‖ z^{k} - z^{k - 1} ‖}_{G^{T} 𝒫 G + 𝒯_{h}}^{2} + (1 - \min {θ, \frac{1}{θ}}) {‖ 𝒬 (F β^{k} + G z^{k} - c) ‖}_{2}^{2} + θ^{- 1} ϕ^{- 2} {‖ γ^{k} - \bar{γ} ‖}_{2}^{2}} - {{‖ β^{k + 1} - \bar{β} ‖}_{F^{T} 𝒫_{F} + 𝒯_{f}}^{2} + {‖ z^{k + 1} - \bar{z} ‖}_{G^{T} G + 𝒯_{h}}^{2} + {‖ z^{k + 1} - z^{k} ‖}_{G^{T} 𝒫 G + 𝒯_{h}}^{2} + (1 - \min {θ, \frac{1}{θ}}) {‖ 𝒬 (F β^{k + 1} + G z^{k + 1} - c) ‖}_{2}^{2} + θ^{- 1} ϕ^{- 2} {‖ γ^{k + 1} - \bar{γ} ‖}_{2}^{2}} \geq {‖ z^{k + 1} - z^{k} ‖}_{G^{T} 𝒫 G + 𝒯_{h} + (θ - θ^{2} + \min (θ^{2}, 1)) G 𝒬^{T} G}^{2} + {‖ β^{k + 1} - β^{k} ‖}_{F^{T} 𝒫 F + 𝒯_{f}}^{2} + (1 - θ + \min {θ, θ^{- 1}}) {‖ 𝒬 (F β^{k + 1} + G z^{k + 1} - c) ‖}_{2}^{2}

(A.25)

When $θ \in (0, \frac{1 + \sqrt{5}}{2})$ , it is ensured that $(1 - θ + ϕ \min {θ, θ^{- 1}}) > 0$ . Let $d_{1} \in (0, \frac{1}{2})$ , then we have

{{‖ β^{k} - \bar{β} ‖}_{F^{T} 𝒫 F + 𝒯_{f}}^{2} + {‖ z^{k} - \bar{z} ‖}_{G^{T} G + 𝒯_{h}}^{2} + {‖ z^{k} - z^{k - 1} ‖}_{G^{T} 𝒫 G + 𝒯_{h}}^{2} + (1 + d_{1} - d_{1} θ - (1 - d_{1}) \min {θ, \frac{1}{θ}}) {‖ 𝒬 (F β^{k} + G z^{k} - c) ‖}_{2}^{2} + θ^{- 1} ϕ^{- 2} {‖ γ^{k} - \bar{γ} ‖}_{2}^{2}} - {{‖ β^{k + 1} - \bar{β} ‖}_{F^{T} 𝒫 F + 𝒯_{f}}^{2} + {‖ z^{k + 1} - \bar{z} ‖}_{G^{T} G + 𝒯_{h}}^{2} + {‖ z^{k + 1} - z^{k} ‖}_{G^{T} 𝒫 G + 𝒯_{h}}^{2} + (1 + d_{1} - d_{1} θ - (1 - d_{1}) \min {θ, \frac{1}{θ}}) {‖ 𝒬 (F β^{k + 1} + G z^{k + 1} - c) ‖}_{2}^{2} + θ^{- 1} ϕ^{- 2} {‖ γ^{k + 1} - \bar{γ} ‖}_{2}^{2}}

(A.26)

\geq {‖ z^{k + 1} - z^{k} ‖}_{G^{T} 𝒫 G + 𝒯_{h} + (θ - θ^{2} + \min (θ^{2}, 1)) G^{T} 𝒬 G}^{2} + {‖ β^{k + 1} - β^{k} ‖}_{F^{T} 𝒫 F + 𝒯_{f}}^{2} + (1 - d_{1}) (1 - θ + \min {θ, θ^{- 1}}) {‖ 𝒬 (F β^{k + 1} + G z^{k + 1} - c) ‖}_{2}^{2} + d_{1} (1 - θ + \min {θ, θ^{- 1}}) {‖ 𝒬 (F β^{k} + G z^{k} - c) ‖}_{2}^{2} = {‖ z^{k + 1} - z^{k} ‖}_{G^{T} 𝒫 G + 𝒯_{h} + (θ - θ^{2} + \min (θ^{2}, 1)) G^{T} 𝒬 G}^{2} + {‖ β^{k + 1} - β^{k} ‖}_{F^{T} 𝒫 F + 𝒯_{f}}^{2} + (1 - 2 d_{1})) (1 - θ + \min {θ, θ^{- 1}}) θ^{- 2} ϕ^{- 2} {‖ γ^{k + 1} - γ^{k} ‖}_{2}^{2} + d_{1} (1 - θ + \min {θ, θ^{- 1}}) ({‖ 𝒬 (F β^{k} + G z^{k} - c) ‖}_{2}^{2} + {‖ 𝒬 (F β^{k + 1} + G z^{k + 1} - c) ‖}_{2}^{2}) \geq {‖ z^{k + 1} - z^{k} ‖}_{G^{T} 𝒫 G + 𝒯_{h} + (θ - θ^{2} + \min (θ^{2}, 1)) G^{T} 𝒬 G}^{2} + {‖ β^{k + 1} - β^{k} ‖}_{F^{T} 𝒫 F + 𝒯_{f}}^{2} + (1 - 2 d_{1}) (1 - θ + \min {θ, θ^{- 1}}) θ^{- 2} ϕ^{- 2} {‖ γ^{k + 1} - γ^{k} ‖}_{2}^{2} + \frac{1}{2} d_{1} (1 - θ + \min {θ, θ^{- 1}}) {‖ 𝒬 F (β^{k + 1} - β^{k}) + 𝒬 G (z^{k + 1} - z^{k}) ‖}_{2}^{2}

(A.27)

Note that $𝒬 (F β^{k + 1} + G z^{k + 1} - c) = 𝒬 F (β^{k + 1} - \bar{β}) + 𝒬 G (z^{k + 1} - \bar{z})$ , and we have

{{‖ β^{k} - \bar{β} ‖}_{F^{T} 𝒫 F + 𝒯_{f}}^{2} + {‖ z^{k} - \bar{z} ‖}_{G^{T} G + 𝒯_{h}}^{2} + {‖ z^{k} - z^{k - 1} ‖}_{G^{T} 𝒫 G + 𝒯_{h}}^{2} + (1 + d_{1} - d_{1} θ - (1 - d_{1}) \min {θ, \frac{1}{θ}}) 𝒬 F (β^{k} - \bar{β}) + 𝒬 G {(z^{k} - \bar{z})}_{2}^{2} + θ^{- 1} ϕ^{- 2} {‖ γ^{k} - \bar{γ} ‖}_{2}^{2}} - {{‖ β^{k + 1} - \bar{β} ‖}_{F^{T} 𝒫 F + 𝒯_{f}}^{2} + {‖ z^{k + 1} - \bar{z} ‖}_{G^{T} G + 𝒯_{h}}^{2} + {‖ z^{k + 1} - z^{k} ‖}_{G^{T} 𝒫 G + 𝒯_{h}}^{2} + (1 + d_{1} - d_{1} θ - (1 - d_{1}) \min {θ, \frac{1}{θ}}) ∥ 𝒬 F (β^{k + 1} - \bar{β}) + 𝒬 G (z^{k + 1} - \bar{z})) ∥_{2}^{2} + θ^{- 1} ϕ^{- 2} {‖ γ^{k + 1} - \bar{γ} ‖}_{2}^{2}} \geq {‖ z^{k + 1} - z^{k} ‖}_{G^{T} 𝒫 G + 𝒯_{h} + (θ - θ^{2} + \min (θ^{2}, 1)) G^{T} 𝒬 G}^{2} + {‖ β^{k + 1} - β^{k} ‖}_{F^{T} 𝒫 F + 𝒯_{f}}^{2} + (1 - 2 d_{1}) (1 - θ + \min {θ, θ^{- 1}}) θ^{- 2} ϕ^{- 2} {‖ γ^{k + 1} - γ^{k} ‖}_{2}^{2} + \frac{1}{2} d_{1} (1 - θ + \min {θ, θ^{- 1}}) 𝒬 F (β^{k + 1} - β^{k}) + 𝒬 G {(z^{k + 1} - z^{k})}_{2}^{2} \geq (1 - θ + \min {θ, θ^{- 1}}) \min {\frac{1}{2} d_{1}, 1 - 2 d_{1}, θ} ({‖ β^{k + 1} - β^{k} ‖}_{F^{T} 𝒫 F + 𝒯_{f}}^{2} + {‖ z^{k + 1} - z^{k} ‖}_{G^{T} G + 𝒯_{h}}^{2} + θ^{- 2} ϕ^{- 2} {‖ γ^{k + 1} - γ^{k} ‖}_{2}^{2})

(A.28)

Let $m_{1} = 1 + d_{1} - d_{1} θ - (1 - d_{1}) \min {θ, \frac{1}{θ}}$ in $ℋ_{1}$ defined in (A.24), and $m_{2} = (1 - θ + \min {θ, θ^{- 1}}) \min {\frac{1}{2} d_{1}, 1 - 2 d_{1}, θ}$ . Note that when $θ \in (0, \frac{1 + \sqrt{5}}{2})$ , the following relationship holds.

F^{T} F + 𝒯_{f} ≻ 0 and G^{T} G + 𝒯_{h} ≻ 0 \Leftrightarrow ℋ_{1} ≻ 0.

Combining with Lemma A.2, we have

{‖ u^{k} - \bar{u} ‖}_{ℋ_{1}}^{2} + {‖ z^{k} - z^{k - 1} ‖}_{G^{T} G + 𝒯_{h}}^{2} - ({‖ u^{k + 1} - \bar{u} ‖}_{ℋ_{1}} + {‖ z^{k + 1} - z^{k} ‖}_{G^{T} G + 𝒯_{h}}^{2}) \geq \frac{m_{2}}{C} (C \times ({‖ β^{k + 1} - β^{k} ‖}_{F^{T} 𝒫 F + 𝒯_{f}}^{2} + {‖ z^{k + 1} - z^{k} ‖}_{G^{T} G + 𝒯_{h}}^{2} + θ^{- 2} ϕ^{- 2} {‖ γ^{k + 1} - γ^{k} ‖}_{2}^{2})) = \frac{m_{2}}{C} {‖ u^{k + 1} - u^{k} ‖}_{ℋ}^{2} \geq \frac{m_{2} d_{2}}{C q^{2}} {‖ u^{k + 1} - \bar{u} ‖}_{2}^{2} + \frac{m_{2} (1 - d_{2})}{C q^{2}} {‖ z^{k + 1} - z^{k} ‖}_{G^{T} G + 𝒯_{h}}^{2} \geq \frac{m_{2} d_{2}}{C q^{2} λ_{\max} (ℋ_{1})} {‖ u^{k + 1} - \bar{u} ‖}_{ℋ_{1}}^{2} + \frac{m_{2} (1 - d_{2})}{C q^{2}} {‖ z^{k + 1} - z^{k} ‖}_{G^{T} G + 𝒯_{h}}^{2} .

(A.29)

Take $d_{2} = \frac{λ_{\max} (ℋ_{1})}{1 + λ_{\max} (ℋ_{1})}$ , then we can obtain (A.23) with $μ = {[1 + \frac{m_{2}}{C q^{2} (1 + λ_{\max} (ℋ_{1}))}]}^{- 1}$ . □

Proof of Theorem 1. Since $f = {∥ \cdot ∥}_{1}$ and $g = \frac{1}{n} 1_{n}^{T} {(\cdot)}_{+}$ are piecewise linear-quadratic functions, thus both prox_f(·) and prox_g(·) are piecewise polyhedral (Poliquin and Rockafellar, 1993) which implies Fact 2 (Han et al., 2018). Since we take $𝒯_{g} = η_{i} I_{p_{g}} - X_{g}^{T} X_{g}, g = 1, \dots, G$ , then $𝒯_{g} + F^{T} F = Diag (η_{1} I_{p_{1}}, \dots, η_{K} I_{p_{K}})$ is positive definite, and this together with the fact that $G^{T} G = I_{n} ≻ 0$ imply that the sequence $(β^{k}, z^{k}, ω^{k}, γ^{k})$ is automatically well defined. By Lemma A.1, under the condition $θ \in (0, (1 + \sqrt{5}) / 2)$ , the sequence $(β^{k}, z^{k}, ω^{k}, γ^{k})$ generated by algorithm (A.5) converges to a limit point $(\bar{β}, \bar{z}, \bar{ω}, \bar{γ})$ with $(\bar{β}, \bar{z}, \bar{ω})$ solving (9) and $\bar{γ}$ is the dual optimal.

To derive the rate of convergence, we first compute $ℋ_{1}$ . By definition,

𝒫 = H {(H^{T} H)}^{- 1} H^{T} = \frac{1}{G} (\begin{matrix} (G - 1) I & - I & \dots & - I \\ - I & (G - 1) I & \dots & - I \\ ⋮ & ⋱ & ⋮ \\ ⋮ & ⋱ & ⋮ \\ - I & - I & \dots & (G - 1) I \end{matrix})

It follows that

{‖ β^{k + 1} - \bar{β} ‖}_{F^{T} 𝒫 F}^{2} = \sum_{i = 1}^{G} {‖ X_{i} (β_{i}^{k + 1} - {\bar{β}}_{i}) ‖}_{2}^{2} - \frac{1}{G} {‖ \sum_{i = 1}^{G} X_{i}^{T} (β_{i}^{k + 1} - {\bar{β}}_{i}) ‖}_{2}^{2}, m_{1} {‖ 𝒬 F (β^{k + 1} - \bar{β}) + 𝒬 G (z^{k + 1} - \bar{z}) ‖}_{2}^{2} = \frac{m_{1}}{G} {‖ \sum_{i = 1}^{G} X_{i} (β_{i}^{k + 1} - {\bar{β}}_{i}) + (z^{k + 1} - \bar{z}) ‖}_{2}^{2}, {‖ z^{k + 1} - \bar{z} ‖}_{G^{T} G + 𝒯_{h}}^{2} = {‖ z^{k + 1} - \bar{z} ‖}_{2}^{2}, {‖ z^{k + 1} - z^{k} ‖}_{G^{T} 𝒫 G + 𝒯_{h}}^{2} = \frac{G - 1}{G} {‖ z^{k + 1} - z^{k} ‖}_{2}^{2} .

(A.30)

Plugging equations (A.30) back into (A.29), we derive the results in Theorem 1 easily.

A.3. Algorithms for Two-Step PQR-SCAD

This section presents two three-block ADMM algorithms for PQR-SCAD proposed in Section 2.3.

Algorithm 3.

FS-QRADMM-CD for Two-Step PQR-SCAD

Initialization:

{\tilde{β}}^{0}, λ, v, {\tilde{z}}^{0}, {\tilde{γ}}^{0}, {\tilde{ω}}_{i}^{0}

, and

ϕ > 0, θ = 1.618, k = 0

while the stopping criterion is not satisfied, do

Update

{\tilde{β}}^{k + 1}

{\tilde{β}}_{1}^{k + 1} = \underset{β_{1} \in ℛ^{p_{1}}}{argmin} n v λ {‖ β_{1} ‖}_{1} + \frac{ϕ}{2} {‖ X_{1} β_{1} + \sum_{g = 2}^{G} {\tilde{ω}}_{g}^{k} + {\tilde{z}}^{k} - y + \frac{{\tilde{γ}}_{1}^{k}}{ϕ} ‖}_{2}^{2},

{\tilde{β}}_{g}^{k + 1} = \underset{β_{g} \in ℛ^{p_{g}}}{argmin} n v λ {‖ β_{g} ‖}_{1} + \frac{ϕ}{2} {‖ X_{g} β_{g} - {\tilde{ω}}_{g}^{k} + \frac{{\tilde{γ}}_{g}^{k}}{ϕ} ‖}_{2}^{2}, g = 2, \dots, G

Compute

{\tilde{ω}}^{k + \frac{1}{2}}, {\tilde{z}}^{k + 1}

and

{\tilde{ω}}^{k + 1}

by (13).

Update

{\tilde{γ}}^{k + 1}

by (14).

end while The solution is denoted as

{\hat{β}}^{ℓ_{1}}, {\hat{z}}^{ℓ_{1}}, {\hat{ω}}^{ℓ_{1}}

Initialization:

{\hat{β}}^{0} = {\hat{β}}^{ℓ_{1}}, {\hat{z}}^{0} = {\hat{z}}^{ℓ_{1}}, {\hat{ω}}^{0} = {\hat{ω}}^{ℓ_{1}}

and

ϕ > 0, θ = 1.618, k = 0

. Compute

α_{j} = λ^{- 1} p_{λ} (| {\hat{β}}_{j}^{0} |)

for

j = 1, \dots, p

while the stopping criterion is not satisfied, do

Update

{\hat{β}}^{k + 1}

by (12).

Compute

{\hat{ω}}^{k + \frac{1}{2}}, {\hat{z}}^{k + 1}

and

{\hat{ω}}^{k + 1}

by (13).

Update

{\hat{γ}}^{k + 1}

by (14).

end while

Open in a new tab

Algorithm 4.

FS-QRADMM-prox for Two-Step PQR-SCAD

Initialization:

{\tilde{β}}^{0}, λ, v, {\tilde{z}}^{0}, {\tilde{γ}}^{0}, {\tilde{ω}}_{i}^{0}

, and

ϕ > 0, θ = 1.618, k = 0

while the stopping criterion is not satisfied, do

Update

{\tilde{β}}^{k + 1}

{\tilde{β}}_{1}^{k + 1} = Shrink {({\tilde{β}}_{1 j}^{k} - \frac{ϕ}{η_{1}} X_{1 j}^{T} (X_{1} {\tilde{β}}_{1}^{k} + \sum_{g = 2}^{G} {\tilde{ω}}_{g}^{k} + {\tilde{z}}^{k} - y + \frac{{\tilde{γ}}_{1}^{k}}{ϕ}), \frac{n v λ}{η_{1}})}_{j = 1, \dots, p_{1}}

{\tilde{β}}_{g}^{k + 1} = Shrink {({\tilde{β}}_{g j}^{k} - \frac{ϕ}{η_{g}} X_{g j}^{T} (X_{g} {\tilde{β}}_{g}^{k} - {\tilde{ω}}_{g}^{k} + \frac{{\tilde{γ}}_{g}^{k}}{ϕ}), \frac{n v λ}{η_{g}})}_{j = 1, \dots, p_{g}}, g = 2, \dots, G .

Compute

{\tilde{ω}}^{k + \frac{1}{2}}, {\tilde{z}}^{k + 1}

and

{\tilde{ω}}^{k + 1}

by (13).

Update

{\tilde{γ}}^{k + 1}

by (14).

end while denote the solution as

{\hat{β}}^{ℓ_{1}}, {\hat{z}}^{ℓ_{1}}, {\hat{ω}}^{ℓ_{1}}

Initialization:

{\hat{β}}^{0} = {\hat{β}}^{ℓ_{1}}, {\hat{z}}^{0} = {\hat{z}}^{ℓ_{1}}, {\hat{ω}}^{0} = {\hat{ω}}^{ℓ_{1}}

and

ϕ > 0, θ = 1.618, k = 0

. Compute

α_{j} = λ^{- 1} p_{λ} (| {\hat{β}}_{j}^{0} |)

for

j = 1, \dots, p

while the stopping criterion is not satisfied, do

Update

{\hat{β}}^{k + 1}

by (15).

Compute

{\hat{ω}}^{k + \frac{1}{2}}, {\hat{z}}^{k + 1}

and

{\hat{ω}}^{k + 1}

by (13).

Update

{\hat{γ}}^{k + 1}

by (14).

end while

Open in a new tab

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

Altunbaş Y and Thornton J (2019). The impact of financial development on income inequality: a quantile regression approach. Economics Letters, 175:51–56. [Google Scholar]
Belloni A and Chernozhukov V (2011). L1-penalized quantile regression in high-dimensional sparse models. The Annals of Statistics, 39(1):82–130. [Google Scholar]
Boyd S, Parikh N, Chu E, Peleato B, and Eckstein J (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends^® in Machine learning, 3(1):1–122. [Google Scholar]
Cai Z, Chen H, and Liao X (2022). A new robust inference for predictive quantile regression. Journal of Econometrics. In press. [Google Scholar]
Chen C, He B, Ye Y, and Yuan X (2016). The direct extension of ADMM for multi-block convex minimization problems is not necessarily convergent. Mathematical Programming, 155(1):57–79. [Google Scholar]
D’Haultfœuille X, Maurel A, and Zhang Y (2018). Extremal quantile regressions for selection models and the black–white wage gap. Journal of Econometrics, 203(1):129–142. [Google Scholar]
Fan J and Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348–1360. [Google Scholar]
Fan J, Li R, Zhang C-H, and Zou H (2020). Statistical Foundations of Data Science. Chapman and Hall/CRC. [Google Scholar]
Fan J, Xue L, and Zou H (2014). Strong oracle optimality of folded concave penalized estimation. The Annals of Statistics, 42(3):819–849. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan Y, Lin N, and Yin X (2021). Penalized quantile regression for distributed big data using the slack variable representation. Journal of Computational and Graphical Statistics, 30(3):557–565. [Google Scholar]
Fazel M, Pong TK, Sun D, and Tseng P (2013). Hankel matrix rank minimization with applications to system identification and realization. SIAM Journal on Matrix Analysis and Applications, 34(3):946–977. [Google Scholar]
Firpo S, Galvao AF, Pinto C, Poirier A, and Sanroman G (2022). GMM quantile regression. Journal of Econometrics. In press. [Google Scholar]
Fortin M and Glowinski R (2000). Augmented Lagrangian methods: applications to the numerical solution of boundary-value problems, volume 15. Elsevier. [Google Scholar]
Friedman J, Hastie T, Höfling H, and Tibshirani R (2007). Pathwise coordinate optimization. The Annals of Applied Statistics, 1(2):302–332. [Google Scholar]
Friedman J, Hastie T, and Tibshirani R (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22. [PMC free article] [PubMed] [Google Scholar]
Giessing A and He X (2019). On the predictive risk in misspecified quantile regression. Journal of Econometrics, 213(1):235–260. [Google Scholar]
Gimenes N and Guerre E (2022). Quantile regression methods for first-price auctions. Journal of Econometrics, 226(2):224–247. [Google Scholar]
Gu J and Volgushev S (2019). Panel data quantile regression with grouped fixed effects. Journal of Econometrics, 213(1):68–91. [Google Scholar]
Gu Y, Fan J, Kong L, Ma S, and Zou H (2018). ADMM for high-dimensional sparse penalized quantile regression. Technometrics, 60(3):319–331. [Google Scholar]
Han D, Sun D, and Zhang L (2018). Linear rate convergence of the alternating direction method of multipliers for convex composite programming. Mathematics of Operations Research, 43(2):622–637. [Google Scholar]
He X, Pan X, Tan KM, and Zhou W-X (2022). Smoothed quantile regression with large-scale inference. Journal of Econometrics. In press. [DOI] [PMC free article] [PubMed] [Google Scholar]
Koenker R (2017). Quantile regression: 40 years on. Annual Review of Economics, 9:155–176. [Google Scholar]
Koenker R and Bassett G (1978). Regression quantiles. Econometrica, 46(1):33–50. [Google Scholar]
Koenker R, Chernozhukov V, He X, and Peng L (2017). Handbook of Quantile Regression. CRC press. [Google Scholar]
Koenker R and Mizera I (2014). Convex optimization, shape constraints, compound decisions, and empirical Bayes rules. Journal of the American Statistical Association, 109(506):674–685. [Google Scholar]
Lee ER, Noh H, and Park BU (2014). Model selection via Bayesian information criterion for quantile regression models. Journal of the American Statistical Association, 109(505):216–229. [Google Scholar]
Li Y and Zhu J (2008). L1-norm quantile regression. Journal of Computational and Graphical Statistics, 17(1):163–185. [Google Scholar]
Narisetty N and Koenker R (2022). Censored quantile regression survival models with a cure proportion. Journal of Econometrics, 226(1):192–203. [Google Scholar]
Peng B and Wang L (2015). An iterative coordinate descent algorithm for high-dimensional nonconvex penalized quantile regression. Journal of Computational and Graphical Statistics, 24(3):676–694. [Google Scholar]
Poliquin R and Rockafellar R (1993). A calculus of epi-derivatives applicable to optimization. Canadian Journal of Mathematics, 45(4):879–896. [Google Scholar]
Sherwood B and Maidman A (2017). rqPen: Penalized Quantile Regression. R package version 2.0. [Google Scholar]
Sun D, Toh K-C, and Yang L (2015). A convergent 3-block semiproximal alternating direction method of multipliers for conic programming with 4-type constraints. SIAM Journal on Optimization, 25(2):882–915. [Google Scholar]
Tan KM, Wang L, and Zhou W-X (2022). High-dimensional quantile regression: Convolution smoothing and concave regularization. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 84(1):205–233. [Google Scholar]
Wang H (2009). Forward regression for ultra-high dimensional variable screening. Journal of the American Statistical Association, 104(488):1512–1524. [Google Scholar]
Wang L and He X (2022). Analysis of global and local optima of regularized quantile regression in high dimensions: A subgradient approach. Econometric Theory. [Google Scholar]
Wang L, Kim Y, and Li R (2013). Calibrating non-convex penalized regression in ultra-high dimension. The Annals of Statistics, 41(5):2505–2536. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang L, Wu Y, and Li R (2012). Quantile regression for analyzing heterogeneity in ultra-high dimension. Journal of the American Statistical Association, 107(497):214–222. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wu Y and Liu Y (2009). Variable selection in quantile regression. Statistica Sinica, 19(2):801–817. [Google Scholar]
Yi C and Huang J (2017). Semismooth newton coordinate descent algorithm for elastic-net penalized huber loss regression and quantile regression. Journal of Computational and Graphical Statistics, 26(3):547–557. [Google Scholar]
Yu L and Lin N (2017). Admm for penalized quantile regression in big data. International Statistical Review, 85(3):494–518. [Google Scholar]
Yu L, Lin N, and Wang L (2017). A parallel algorithm for large-scale nonconvex penalized quantile regression. Journal of Computational and Graphical Statistics, 26(4):935–939. [Google Scholar]
Zhang C-H (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2):894–942. [Google Scholar]
Zou H and Li R (2008). One-step sparse estimates in nonconcave penalized likelihood models. The Annals of Statistics, 36(4):1509–1533. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] Altunbaş Y and Thornton J (2019). The impact of financial development on income inequality: a quantile regression approach. Economics Letters, 175:51–56. [Google Scholar]

[R2] Belloni A and Chernozhukov V (2011). L1-penalized quantile regression in high-dimensional sparse models. The Annals of Statistics, 39(1):82–130. [Google Scholar]

[R3] Boyd S, Parikh N, Chu E, Peleato B, and Eckstein J (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends^® in Machine learning, 3(1):1–122. [Google Scholar]

[R4] Cai Z, Chen H, and Liao X (2022). A new robust inference for predictive quantile regression. Journal of Econometrics. In press. [Google Scholar]

[R5] Chen C, He B, Ye Y, and Yuan X (2016). The direct extension of ADMM for multi-block convex minimization problems is not necessarily convergent. Mathematical Programming, 155(1):57–79. [Google Scholar]

[R6] D’Haultfœuille X, Maurel A, and Zhang Y (2018). Extremal quantile regressions for selection models and the black–white wage gap. Journal of Econometrics, 203(1):129–142. [Google Scholar]

[R7] Fan J and Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348–1360. [Google Scholar]

[R8] Fan J, Li R, Zhang C-H, and Zou H (2020). Statistical Foundations of Data Science. Chapman and Hall/CRC. [Google Scholar]

[R9] Fan J, Xue L, and Zou H (2014). Strong oracle optimality of folded concave penalized estimation. The Annals of Statistics, 42(3):819–849. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Fan Y, Lin N, and Yin X (2021). Penalized quantile regression for distributed big data using the slack variable representation. Journal of Computational and Graphical Statistics, 30(3):557–565. [Google Scholar]

[R11] Fazel M, Pong TK, Sun D, and Tseng P (2013). Hankel matrix rank minimization with applications to system identification and realization. SIAM Journal on Matrix Analysis and Applications, 34(3):946–977. [Google Scholar]

[R12] Firpo S, Galvao AF, Pinto C, Poirier A, and Sanroman G (2022). GMM quantile regression. Journal of Econometrics. In press. [Google Scholar]

[R13] Fortin M and Glowinski R (2000). Augmented Lagrangian methods: applications to the numerical solution of boundary-value problems, volume 15. Elsevier. [Google Scholar]

[R14] Friedman J, Hastie T, Höfling H, and Tibshirani R (2007). Pathwise coordinate optimization. The Annals of Applied Statistics, 1(2):302–332. [Google Scholar]

[R15] Friedman J, Hastie T, and Tibshirani R (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22. [PMC free article] [PubMed] [Google Scholar]

[R16] Giessing A and He X (2019). On the predictive risk in misspecified quantile regression. Journal of Econometrics, 213(1):235–260. [Google Scholar]

[R17] Gimenes N and Guerre E (2022). Quantile regression methods for first-price auctions. Journal of Econometrics, 226(2):224–247. [Google Scholar]

[R18] Gu J and Volgushev S (2019). Panel data quantile regression with grouped fixed effects. Journal of Econometrics, 213(1):68–91. [Google Scholar]

[R19] Gu Y, Fan J, Kong L, Ma S, and Zou H (2018). ADMM for high-dimensional sparse penalized quantile regression. Technometrics, 60(3):319–331. [Google Scholar]

[R20] Han D, Sun D, and Zhang L (2018). Linear rate convergence of the alternating direction method of multipliers for convex composite programming. Mathematics of Operations Research, 43(2):622–637. [Google Scholar]

[R21] He X, Pan X, Tan KM, and Zhou W-X (2022). Smoothed quantile regression with large-scale inference. Journal of Econometrics. In press. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Koenker R (2017). Quantile regression: 40 years on. Annual Review of Economics, 9:155–176. [Google Scholar]

[R23] Koenker R and Bassett G (1978). Regression quantiles. Econometrica, 46(1):33–50. [Google Scholar]

[R24] Koenker R, Chernozhukov V, He X, and Peng L (2017). Handbook of Quantile Regression. CRC press. [Google Scholar]

[R25] Koenker R and Mizera I (2014). Convex optimization, shape constraints, compound decisions, and empirical Bayes rules. Journal of the American Statistical Association, 109(506):674–685. [Google Scholar]

[R26] Lee ER, Noh H, and Park BU (2014). Model selection via Bayesian information criterion for quantile regression models. Journal of the American Statistical Association, 109(505):216–229. [Google Scholar]

[R27] Li Y and Zhu J (2008). L1-norm quantile regression. Journal of Computational and Graphical Statistics, 17(1):163–185. [Google Scholar]

[R28] Narisetty N and Koenker R (2022). Censored quantile regression survival models with a cure proportion. Journal of Econometrics, 226(1):192–203. [Google Scholar]

[R29] Peng B and Wang L (2015). An iterative coordinate descent algorithm for high-dimensional nonconvex penalized quantile regression. Journal of Computational and Graphical Statistics, 24(3):676–694. [Google Scholar]

[R30] Poliquin R and Rockafellar R (1993). A calculus of epi-derivatives applicable to optimization. Canadian Journal of Mathematics, 45(4):879–896. [Google Scholar]

[R31] Sherwood B and Maidman A (2017). rqPen: Penalized Quantile Regression. R package version 2.0. [Google Scholar]

[R32] Sun D, Toh K-C, and Yang L (2015). A convergent 3-block semiproximal alternating direction method of multipliers for conic programming with 4-type constraints. SIAM Journal on Optimization, 25(2):882–915. [Google Scholar]

[R33] Tan KM, Wang L, and Zhou W-X (2022). High-dimensional quantile regression: Convolution smoothing and concave regularization. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 84(1):205–233. [Google Scholar]

[R34] Wang H (2009). Forward regression for ultra-high dimensional variable screening. Journal of the American Statistical Association, 104(488):1512–1524. [Google Scholar]

[R35] Wang L and He X (2022). Analysis of global and local optima of regularized quantile regression in high dimensions: A subgradient approach. Econometric Theory. [Google Scholar]

[R36] Wang L, Kim Y, and Li R (2013). Calibrating non-convex penalized regression in ultra-high dimension. The Annals of Statistics, 41(5):2505–2536. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] Wang L, Wu Y, and Li R (2012). Quantile regression for analyzing heterogeneity in ultra-high dimension. Journal of the American Statistical Association, 107(497):214–222. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] Wu Y and Liu Y (2009). Variable selection in quantile regression. Statistica Sinica, 19(2):801–817. [Google Scholar]

[R39] Yi C and Huang J (2017). Semismooth newton coordinate descent algorithm for elastic-net penalized huber loss regression and quantile regression. Journal of Computational and Graphical Statistics, 26(3):547–557. [Google Scholar]

[R40] Yu L and Lin N (2017). Admm for penalized quantile regression in big data. International Statistical Review, 85(3):494–518. [Google Scholar]

[R41] Yu L, Lin N, and Wang L (2017). A parallel algorithm for large-scale nonconvex penalized quantile regression. Journal of Computational and Graphical Statistics, 26(4):935–939. [Google Scholar]

[R42] Zhang C-H (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2):894–942. [Google Scholar]

[R43] Zou H and Li R (2008). One-step sparse estimates in nonconcave penalized likelihood models. The Annals of Statistics, 36(4):1509–1533. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Feature-splitting Algorithms for Ultrahigh Dimensional Quantile Regression^*

Jiawei Wen

Songshan Yang

Christina Dan Wang

Yifan Jiang

Runze Li

Abstract

1. Introduction

2. Feature-splitting Algorithms for PQR

2.1. Penalized quantile regression

2.2. Three-block ADMM

Algorithm 1.

Algorithm 2.

2.3. PQR-Lasso and PQR-SCAD

3. Numerical Studies

3.1. Simulation study

Table 1:

Table 2:

Figure 1:

Table 3:

3.2. A real data example

Table 4:

4. Conclusion

Acknowledgment

Appendix: Technical Details and Proofs

A.1. Sub-problems in Algorithm 2

A.2. Proof of Theorem 1

A.3. Algorithms for Two-Step PQR-SCAD

Algorithm 3.

Algorithm 4.

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Feature-splitting Algorithms for Ultrahigh Dimensional Quantile Regression*

Jiawei Wen

Songshan Yang

Christina Dan Wang

Yifan Jiang

Runze Li

Abstract

1. Introduction

2. Feature-splitting Algorithms for PQR

2.1. Penalized quantile regression

2.2. Three-block ADMM

Algorithm 1.

Algorithm 2.

2.3. PQR-Lasso and PQR-SCAD

3. Numerical Studies

3.1. Simulation study

Table 1:

Table 2:

Figure 1:

Table 3:

3.2. A real data example

Table 4:

4. Conclusion

Acknowledgment

Appendix: Technical Details and Proofs

A.1. Sub-problems in Algorithm 2

A.2. Proof of Theorem 1

A.3. Algorithms for Two-Step PQR-SCAD

Algorithm 3.

Algorithm 4.

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Feature-splitting Algorithms for Ultrahigh Dimensional Quantile Regression^*