Majorization Minimization by Coordinate Descent for Concave Penalized Generalized Linear Models

Dingfeng Jiang; Jian Huang

doi:10.1007/s11222-013-9407-3

. Author manuscript; available in PMC: 2014 Oct 9.

Published in final edited form as: Stat Comput. 2013 Jun 6;24(5):871–883. doi: 10.1007/s11222-013-9407-3

Majorization Minimization by Coordinate Descent for Concave Penalized Generalized Linear Models

Dingfeng Jiang ¹, Jian Huang ²

PMCID: PMC4191872 NIHMSID: NIHMS532755 PMID: 25309048

Abstract

Recent studies have demonstrated theoretical attractiveness of a class of concave penalties in variable selection, including the smoothly clipped absolute deviation and minimax concave penalties. The computation of the concave penalized solutions in high-dimensional models, however, is a difficult task. We propose a majorization minimization by coordinate descent (MMCD) algorithm for computing the concave penalized solutions in generalized linear models. In contrast to the existing algorithms that use local quadratic or local linear approximation to the penalty function, the MMCD seeks to majorize the negative log-likelihood by a quadratic loss, but does not use any approximation to the penalty. This strategy makes it possible to avoid the computation of a scaling factor in each update of the solutions, which improves the efficiency of coordinate descent. Under certain regularity conditions, we establish theoretical convergence property of the MMCD. We implement this algorithm for a penalized logistic regression model using the SCAD and MCP penalties. Simulation studies and a data example demonstrate that the MMCD works sufficiently fast for the penalized logistic regression in high-dimensional settings where the number of covariates is much larger than the sample size.

Keywords: logistic regression, p ≫ n models, smoothly clipped absolute deviation penalty, minimax concave penalty, variable selection

1 Introduction

Variable selection is a fundamental problem in statistics. Penalized methods have been shown to have attractive theoretical properties for variable selection in p ≫ n models, with n being the sample size and p the number of variables. Several important penalties have been proposed. Examples include the ℓ₁ penalty or the least absolute shrinkage and selection operator (Lasso) (Tibshirani (1996)), the smoothly clipped absolute deviation (SCAD) penalty (Fan and Li (2001)) and the minimax concave penalty (MCP) (Zhang (2010)). The SCAD and MCP are concave penalties that possess the oracle properties, meaning that they can correctly select important variables and estimate variable coefficients with high probabilities as if the model were known in advance under certain sparsity conditions and other appropriate regularity conditions.

Considerable progress has been made on the computational algorithms for penalized regression models. Efron et al (2004) showed that a modified version of their LARS algorithm can efficiently compute the entire Lasso solution path in a linear model. This modified LARS algorithm is the same as the homotopy algorithm proposed earlier by Osborne, Presnell and Turlach (2000). For the SCAD penalty, Fan and Li (2001) proposed a local quadratic approximation (LQA) algorithm. A drawback of the LQA algorithm is that once a coefficient is set to zero at any iteration step, it permanently stays at zero and the corresponding variable is removed from the final model. Hunter and Li (2005) used the majorization-minimization (MM) approach to optimize a perturbed version of LQA by bounding the denominator away from zero. How to choose the size of perturbation and how the perturbation affects the sparsity need to be determined in specific models. Zou and Li (2008) proposed a local linear approximation (LLA) algorithm. The LLA algorithm approximates the concave penalized solutions by repeatedly using the algorithms for the Lasso penalty. Schifano, Strawderman and Wells (2010) generalized the idea of the LLA algorithm via the MM approach to multiple penalties and proved the convergence properties of their minimization by iterated soft thresholding (MIST) algorithm. Zhang (2010) developed the PLUS algorithm for computing the concave penalized least squares solutions, including the MCP solutions, in linear regression models.

In the last few years, it has been recognized that the coordinate descent algorithm (CDA) can efficiently compute the Lasso solutions in p ≫ n models. This algorithm has a long history in applied mathematics and has its roots in the Gauss-Siedel method for solving linear systems (Warge (1963); Ortega and Rheinbold (1970); Tseng (2001)). The CDA optimizes an objective function by working on one coordinate (or a block of coordinates) at a time, iteratively cycling through all coordinates until convergence is reached. It is particularly suitable for the problems that have a simple solution for each coordinate but lack one in higher dimensions. CDA for a Lasso penalized linear model has shown to be very competitive with LARS, especially in high-dimensional cases (Friedman, Hastie, Höfling and Tibshirani (2007); Wu and Lange (2008); Friedman, Hastie and Tibshirani (2010)).

Coordinate descent has also been used in computing the concave penalized solution paths. Breheny and Huang (2011) compared the CDA and LLA for various combinations of (n, p) and various designs of covariate matrices. Their results showed that the CDA converges much faster than the LLA-LARS algorithm in the settings they considered. Mazumder, Friedman and Hastie (2011) demonstrated that the CDA has better convergence properties than the LLA. Breheny and Huang (2011) also proposed an adaptive rescaling technique to overcome the difficulty due to the constantly changing scaling factors in computing the solution for the MCP penalized generalized linear models (GLM). However, the adaptive rescaling approach can not be applied to the SCAD penalty. Furthermore, it is not clear whether its solution reach a local optimal point of the original objective function.

We propose a majorization minimization by coordinate descent (MMCD) algorithm for computing the solutions of a concave penalized GLM model, with emphasis on the logistic regression. The MMCD seeks a closed form solution for each coordinate and avoids the computation of scaling factors by majorizing the loss function. Under reasonable regularity conditions, we establish the convergence property of the MMCD. This algorithm is particularly suitable for the logistic regression model due to the fact that a simple and effective majorization can be found. This paper is organized as follows. Section 2 defines the concave penalized solutions in a GLM. Section 3 describes the proposed MMCD algorithm, explains the benefits of majorization and studies its convergence property. Section 4 implements the MMCD algorithm in a concave penalized logistic model. Simulation studies are performed to compare the MMCD algorithm and its competitors in terms of computational efficiency and selection performance. Concluding remarks are given in section 5.

2 Concave Penalized Solutions for GLMs

Let ${{(y_{i}, x_{i})}_{i = 1}^{n}}$ be the observed data, where y_i is a response variable and x_i is a (p + 1)-dimensional vector of predictors, with the first element being 1 corresponding to the intercept. We consider a GLM with the canonical link function, in which y_i relates to x_i through a linear combination $η_{i} = x_{i}^{T} β$ , with β = (β₀, β₁, …, β_p)^T ∈ ℝ^p+1. Here β₀ is the intercept. The conditional density function of y_i given x_i is f_i(y_i) = exp{(y_iθ_i − ψ(θ_i))/ϕ_i + c(y_i, ϕ)}. Here ϕ_i > 0 is a dispersion parameter. The form of ψ(θ) depends on the specified model. For example, ψ(θ) = log(1 + exp(θ)) in a logistic model. The (scaled) negative log-likelihood function is

ℓ (β) \propto \frac{1}{n} \sum_{i = 1}^{n} {ψ (x_{i}^{T} β) - y_{i} x_{i}^{T} β} .

(1)

Here x_i0 = 1, 1 ≤ i ≤ n. For the other p variables, we assume they are standardized, that is, ${‖ x^{j} ‖}_{2}^{2} / n = 1$ with x^j = (x_1j, …, x_nj)^T, 1 ≤ j ≤ p. Here ‖v‖₂ is the ℓ₂ norm of a n-dimensional vector v. The standardization allows the penalization to be applied evenly to each variable.

We consider the concave penalized GLM criterion

Q (β; λ, γ) = ℓ (β) + \sum_{j = 1}^{p} ρ (| β_{j} |; λ, γ),

(2)

where ρ is a penalty function with a penalty parameter λ ≥ 0 and a regularization parameter γ that controls the shape of ρ. Note that in (2) the intercept β₀ is not penalized. We focus on two concave penalties, SCAD and MCP. The SCAD penalty (Fan and Li (2001)) is defined as

ρ (t; λ, γ) = λ \int_{0}^{| t |} 1_{{x \leq λ}} + \frac{{(γ λ - x)}_{+}}{(γ - 1) λ} 1_{{x > λ}} d x,

(3)

with λ ≥ 0 and γ > 2. Here 1_x∈A is the indicator function and x₊ = x1_{x≥0} denotes the non-negative part of x. The MCP penalty (Zhang (2010)) is defined as

ρ (t; λ, γ) = λ \int_{0}^{| t |} {(1 - \frac{x}{γ λ})}_{+} d x

(4)

with λ ≥ 0 and γ > 1.

For the SCAD and MCP penalties, the regularization parameter γ controls the degree of concavity, a smaller γ corresponds to a penalty that is more concave. The two penalties begin by applying the same degree of penalization as Lasso, and then gradually reduce the penalization to zero as |t| increases. When γ → ∞, the SCAD and MCP penalties converge to the ℓ₁ penalty.

To have a basic understanding of these penalties, consider a thresholding operator defined as the solution to a penalized univariate linear regression,

θ̂ (λ, γ) = \underset{θ}{argmin} {\frac{1}{2 n} \sum_{i = 1}^{n} {(y_{i} - x_{i} θ)}^{2} + ρ (θ; λ, γ)} .

Let ${θ̂}_{L S} = \sum_{i = 1}^{n} x_{i} y_{i} / \sum_{i = 1}^{n} x_{i}^{2}$ be the least squares solution. Denote the soft-thresholding operator by S(t, λ) = sign(t) (|t| − λ)₊ for λ ≥ 0 (Donoho and Johnstone (1994)), where sign(t) = 1,0, −1 if t > 0, = 0, < 0, respectively. Then the SCAD and MCP penalties have closed form solutions for θ̂ (λ, γ) as follows,

{θ̂}_{S C A D} (λ, γ) = S ({θ̂}_{L S}, λ) 1_{{| {θ̂}_{L S} | \leq 2 λ}} + \frac{γ - 1}{γ - 2} S ({θ̂}_{L S}, λ γ / (γ - 1)) 1_{{2 λ < | {θ̂}_{L S} | \leq γ λ}} + {θ̂}_{L S} 1_{{| {θ̂}_{L S} | > λ γ}},

(5)

{θ̂}_{M C P} (λ, γ) = \frac{γ}{γ - 1} S ({θ̂}_{L S}, λ) 1_{{| {θ̂}_{L S} | \leq λ γ}} + {θ̂}_{L S} 1_{{| {θ̂}_{L S} | > λ γ}} .

(6)

Observe that both the SCAD and MCP use the LS solution if |θ̂_LS| > λγ; the MCP only applies a scaled soft-thresholding operation if |θ̂_LS| ≤ λγ while the SCAD applies a soft-thresholding operation if |θ̂_LS| < 2λ and a scaled soft-thresholding operation if 2λ < |θ̂_LS| ≤ λγ. These thresholding operators are the basic building blocks of the proposed MMCD algorithm described below.

3 Majorization Minimization by Coordinate Descent

For a GLM, a quadratic approximation to ℓ(β) in a neighborhood of a given estimate β̃ leads to a weighed least squares loss,

ℓ^{s} (β | β̃) = \frac{1}{2 n} \sum_{i = 1}^{n} w_{i} {(z_{i} - x_{i}^{T} β)}^{2},

(7)

with $w_{i} (β̃) = ψ̈ (x_{i}^{T} β̃)$ and $z_{i} (β̃) = ψ̈ {(x_{i}^{T} β̃)}^{- 1} {y_{i} - ψ̇ (x_{i}^{T} β̃)} + x_{i}^{T} β̃$ , where ψ̇(θ) and ψ̈(θ) are the first and second derivatives of ψ(θ) with respect to (w.r.t.) θ. Using ℓ^s(β|β̃) in the criterion function, the CDA updates the jth coordinate by fixing the remaining k (k ≠ j) coordinates. Let ${β̂}_{j}^{m} = {({β̂}_{0}^{m + 1}, \dots, {β̂}_{j}^{m + 1}, {β̂}_{j + 1}^{m}, \dots, {β̂}_{p}^{m})}^{T}$ , the CDA updates ${β̂}_{j - 1}^{m}$ to ${β̂}_{j}^{m}$ by minimizing the criterion

{β̂}_{j}^{m + 1} = \underset{β_{j}}{argmin} Q^{s} (β_{j} | {β̂}_{j - 1}^{m}) = \underset{β_{j}}{argmin} \frac{1}{2 n} \sum_{i = 1}^{n} w_{i} {(z_{i} - \sum_{s < j} x_{i j} {β̂}_{s}^{m + 1} - x_{i j} β_{j} - \sum_{s > j} x_{i j} {β̂}_{s}^{m})}^{2} + ρ (| β_{j} |; λ, γ),

(8)

where w_i and z_i depend on $({β̂}_{j - 1}^{m}, x_{i}, y_{i})$ . The jth coordinate-wise minimizer is then obtained by solving the equation,

\frac{1}{n} \sum_{i = 1}^{n} w_{i} x_{i j}^{2} β_{j} + ρ' (| β_{j} |) sgn (β_{j}) - \frac{1}{n} \sum_{i = 1}^{n} w_{i} x_{i j} (z_{i} - x_{i}^{T} {β̂}_{j - 1}^{m}) - \frac{1}{n} \sum_{i = 1}^{n} w_{i} x_{i j}^{2} {β̂}_{j}^{m} = 0,

(9)

with ρ′(|t|) the first derivative of ρ(|t|) w.r.t |t| for t ≠ 0. Note that for t = 0, we have ρ′(0+) = λ. The notation sgn(t) is the subdifferential of |t|, that is,

sgn (t) = {\begin{matrix} 1, & if t > 0; \\ - 1, & if t < 0; \\ \in [- 1, 1], & if t = 0 . \end{matrix}

We refer to Lange (2004) for a detailed description of sub-differentials.

Define the scaling factor by $δ_{j} ≜ n^{- 1} \sum_{i = 1}^{n} w_{i} x_{i j}^{2}$ . Directly solving (9) with the MCP penalty gives the jth coordinate-wise solution as,

{β̂}_{j}^{m + 1} = \frac{S (τ_{j}, λ)}{δ_{j} - 1 / γ} 1_{{| τ_{j} | \leq δ_{j} γ λ}} + \frac{τ_{j}}{δ_{j}} 1_{{| τ_{j} | > δ_{j} γ λ}},

(10)

with $τ_{j} = n^{- 1} \sum_{i = 1}^{n} w_{i} x_{i j} (z_{i} - x_{i}^{T} {β̂}_{j - 1}^{m}) + δ_{j} {β̂}_{j}^{m}$ . Observe that when |τ_j| ≤ λ, we have ${β̂}_{j}^{m + 1} = 0$ . In a linear model, w_i = 1, i = 1, …, n. Thus the scaling factor δ_j = 1 for standardized predictors. In a GLM, however, the dependence of w_i on $({β̂}_{j - 1}^{m}, x_{i}, y_{i})$ causes the scaling factor δ_j to change from iteration to iteration. This is problematic because δ_j − 1/γ can be very small and is not guaranteed to be positive. Thus direct application of CDA may not be numerically stable and can lead to unreasonable solutions.

To overcome this difficulty, Breheny and Huang (2011) proposed an adaptive rescaling approach, which uses

{β̂}_{j}^{m + 1} = \frac{S (τ_{j}, λ)}{δ_{j} (1 - 1 / γ)} 1_{{| τ_{j} | \leq γ λ}} + \frac{τ_{j}}{δ_{j}} 1_{{| τ_{j} | > γ λ}},

(11)

for the jth coordinate-wise solution. This is equivalent to applying a new regularization parameter γ* = γ/δ_j at each coordinate-wise iteration. Hence, the effective regularization parameters are not the same for the penalized variables and are not known until the algorithm reaches convergence. Numerically, the scaling factor δ_j requires extra computation, which is not desirable for large p. In addition, a small δ_j could cause convergence issues. The adaptive rescaling approach cannot be applied to the SCAD penalty because the scaled soft-thresholding operation only applies to the middle clause of the SCAD thresholding operator as shown in (5).

Observe that in a GLM, the scaling factor δ_j is equal to the second partial derivative of the loss function, that is, $\nabla_{j}^{2} ℓ (β) = \sum ψ̈ (x_{i}^{T} β) x_{i j}^{2} / n = \sum w_{i} x_{i j}^{2} / n$ . The MMCD algorithm seeks an upper bound of $\nabla_{j}^{2} ℓ (β)$ . Thus we need to find an M such that δ_j ≤ M for each coordinate. Such idea of finding a uniform bound on the second derivative was also proposed by Böhning and Lindsay (1988).

For the MM approach, the use of the upper bound of δ_j is equivalent to finding a surrogate function $ℓ^{M M} (β_{j} | {β̂}_{j - 1}^{m})$ that majorizes $ℓ^{s} (β_{j} | {β̂}_{j - 1}^{m})$ , where

ℓ^{M M} (β_{j} | {β̂}_{j - 1}^{m}) = ℓ ({β̂}_{j - 1}^{m}) + \nabla_{j} ℓ ({β̂}_{j - 1}^{m}) (β_{j} - {β̂}_{j}^{m}) + \frac{1}{2} M {(β_{j} - {β̂}_{j}^{m})}^{2},

(12)

and

ℓ^{s} (β_{j} | {β̂}_{j - 1}^{m}) = ℓ ({β̂}_{j - 1}^{m}) + \nabla_{j} ℓ ({β̂}_{j - 1}^{m}) (β_{j} - {β̂}_{j}^{m}) + \frac{1}{2} \nabla_{j}^{2} ℓ ({β̂}_{j - 1}^{m}) {(β_{j} - {β̂}_{j}^{m})}^{2},

(13)

with the second partial derivative $\nabla_{j}^{2} ℓ ({β̂}_{j - 1}^{m})$ in the Taylor expansion replaced by its upper bound M. Note that for a given ${β̂}_{j - 1}^{m}, \nabla_{j}^{2} ℓ ({β̂}_{j - 1}^{m})$ is a constant, however, the value of $\nabla_{j}^{2} ℓ ({β̂}_{j - 1}^{m})$ changes from iteration to iteration. The majorization uses the upper bound to avoid these changes. And such majorization is valid regardless the current estimate of ${β̂}_{j - 1}^{m}$ . This is different from common implementation of the MM approach, where the majorization is constructed at the current estimation point. Also note that the majorization is applied coordinate-wisely to better fit the CDA. The descent property of the MM approach ensures that the iterative minimization of $ℓ^{M M} (β_{j} | {β̂}_{j - 1}^{m})$ leads to a descent sequence of the original objective function. For more details about the MM algorithm, we refer to Lange, Hunter, and Yang (2000); Hunter and Lange (2004).

Given the majorization of δ_j, some algebra shows that the jth coordinate-wise solutions of the SCAD and MCP are

S C A D : {β̂}_{j}^{m + 1} = \frac{1}{M} S (τ_{j}, λ) 1_{{| τ_{j} | \leq (1 + M) λ}} + \frac{S (τ_{j}, γ λ / (γ - 1))}{M - 1 / (γ - 1)} 1_{{(1 + M) λ < | τ_{j} | \leq M γ λ}} + \frac{1}{M} τ_{j} 1_{{| τ_{j} | > M γ λ}},

(14)

M C P : {β̂}_{j}^{m + 1} = \frac{S (τ_{j}, λ)}{M - 1 / γ} 1_{{| τ_{j} | \leq M γ λ}} + \frac{1}{M} τ_{j} 1_{{| τ_{j} | > M γ λ}},

(15)

with $τ_{j} = M {β̂}_{j}^{m} + n^{- 1} \sum_{i = 1}^{n} x_{i j} (y_{i} - ψ̇ (x_{i}^{T} {β̂}_{j - 1}^{m})), j = 1, \dots, p$ . The solution of the intercept is

{β̂}_{0}^{m + 1} = τ_{0} / M,

(16)

with $τ_{0} = M {β̂}_{0}^{m} + n^{- 1} \sum_{i = 1}^{n} x_{i 0} (y_{i} - ψ̇ (x_{i}^{T} {β̂}^{m}))$ , where ${β̂}^{m} = {({β̂}_{0}^{m}, {β̂}_{1}^{m}, \dots, {β̂}_{p}^{m})}^{T}$ . In (14) and (15), we want to ensure that the denominators are positive, i.e. M − 1/(γ − 1) > 0 and M − 1/γ > 0. This naturally leads to the constraint on the penalty, inf_t ρ″(|t|;λ, γ) > −M, where ρ″(|t|;λ, γ) is the second derivative of ρ(|t|;λ, γ) w.r.t. |t|. For the SCAD and MCP, the condition is satisfied by choosing a proper γ. Since inf_t ρ″(|t|;λ, γ) = −1/(γ − 1) for the SCAD penalty, and inf_t ρ″(|t|;λ, γ) = −1/γ for the MCP, we require γ > 1 + 1/M for the SCAD and γ > 1/M for the MCP.

The MMCD algorithm can gain further efficiency by adopting the following tip. Let η = (η₁, …, η_n)^T and $X = {(x_{1}^{T}, \dots, x_{n}^{T})}^{T}$ , and ${η̂}_{j}^{m} = X {β̂}_{j}^{m}$ be the linear component corresponding to ${β̂}_{j}^{m}$ . Further efficiency can be achieved by using the equation

{η̂}_{j + 1}^{m} = {η̂}_{j}^{m} + x^{j + 1} ({β̂}_{j + 1}^{m + 1} - {β̂}_{j + 1}^{m}) = {η̂}_{j}^{m} + ({β̂}_{j + 1}^{m} - {β̂}_{j}^{m}) x^{j + 1} .

(17)

This equation turns a O(np) operation into a O(n) one. Since this step is involved in each iteration for each coordinate, it turns out to be significant in reducing the computational cost.

We can also majorize the penalty in each iteration using LQA (Fan and Li (2001)), perturbed LQA (Hunter and Li (2005)) or LLA (Zou and Li (2008)). However, this does not take full advantage of CDA. Indeed, the approximation of the penalty requires additional iterations for convergence and is not necessary, since an exact coordinate-wise solution exists. Thus the MMCD uses the exact form of the penalties and only majorizes the loss.

We now summarize the MMCD for a given (λ, γ). Assume the following conditions hold:

The second partial derivative of ℓ(β) w.r.t. β_j is uniformly bounded for standardized X, i.e. there exists a real number M > 0 such that $\nabla_{j}^{2} ℓ (β) \leq M$ for all β and j = 0, …, p.
inf_t ρ″(|t|;λ, γ) > −M, with ρ″(|t|;λ, γ) being the second derivative of ρ(| t|;λ, γ) w.r.t. |t|.

The MMCD then proceeds as follows,

Given an initial value of β̂⁰, the MMCD computes the corresponding linear component η̂⁰.
For m = 0,1, …, the MMCD updates the intercept by (16), and uses (14) or (15) to update ${β̂}_{j}^{m}$ to ${β̂}_{j + 1}^{m}$ for the penalized variables. After each iteration, it also computes the corresponding linear component ${η̂}_{j + 1}^{m}$ using (17). Then the MMCD cycles through all the coordinates such that β̂^m is updated to β̂^m+1.
The MMCD checks the convergence criterion. If the algorithm converges then stops iteration, otherwise it repeats step 2 until the algorithm converges.

3.1 Convergence analysis of MMCD

Theorem 1 establishes that under certain conditions, the MMCD solution converges to a minimum of the objective function.

Theorem 1 Consider the objective function (2), where the given data (y, X) lies in a compact set and no two columns of X are identical. For a given (λ, γ), suppose the penalty ρ(|t|;λ, γ) ≡ ρ(t) satisfies ρ(t) = ρ(−t) and ρ′(|t|) is a non-negative, uniformly bounded function, with ρ′(|t|) being the first derivative (assuming existence) of ρ(|t|) w.r.t. |t|. Also assume that two conditions stated in the MMCD algorithm hold.

Then the sequence generated by the MMCD algorithm {β^m} converges to a minimum of the function Q(β;λ, γ).

The condition on (y, X) is mild. The standardization of the columns of X can be performed as long as no columns are zero. The proof of Theorem 1 is given in Appendix. It extends the convergence results of Mazumder, Friedman and Hastie (2011) for their SparseNet algorithm with a least squares loss. Theorem 1 covers more general loss functions.

4 The MMCD for Penalized Logistic Regression

As mentioned in the introduction, the MMCD algorithm is particularly suitable for a logistic regression, which is one of the most widely used models in biostatistical applications. For a logistic regression model, the response y is a vector of 0 or 1 with 1 indicating the event of interest. The first and second derivatives of the loss function are ∇_jℓ(β̂) = −(x^j)^T (y − π̂)/n and $\nabla_{j}^{2} ℓ (β̂) = n^{- 1} \sum w_{i} x_{i j}^{2}$ , with w_i = π̂_i(1 − π̂_i) and π̂_i being the estimated probability of the ith observation given the current estimate β̂, i.e. ${π̂}_{i} = 1 / (1 + exp (- x_{i}^{T} β̂))$ . Since for any 0 ≤ π ≤ 1, π(1 − π) ≤ 1/4 (Böhning and Lindsay (1988)), then the upper bound of $\nabla_{j}^{2} ℓ (β̂)$ is M = 1/4 for standardized x^j. Correspondingly τ_j = 4⁻¹ β̂_j + n⁻¹ (x^j)^T (y − π̂) for j = 0, …, p. By the condition (b), we require γ > 5 for the SCAD penalty and γ > 4 for the MCP penalty.

4.1 Computation of Solution Surface

A common practice in applying the SCAD and MCP penalties is to calculate the solution path in λ for a fixed value of γ. For example, for linear regression models with standardized variables, it has been suggested one uses γ ≈ 3.7 for the SCAD (Fan and Li (2001)) and γ ≈ 2.7 (Zhang (2010)) for the MCP. However, in a GLM model including the logistic regression, these values may not be appropriate. Therefore, we use a data driven procedure to choose γ together with λ. This requires the computation of solution surface over a two-dimensional grid of (λ, γ). We re-parameterize κ = 1/γ to facilitate the description of the approach for computing the solution surface.

Define the grid values in [0, κ_max) × [λ_min, λ_max] to be 0 = κ₁ ≤ κ₂ ≤ ⋯ ≤ κ_K < κ_max and λ_max = λ₁ ≥ λ₂ ≥ … ≥ λ_V = λ_min. The number of grid points K and V are pre-specified. In our implementation, the κ-grid points are uniform in the standard scale while those for λ are uniform in the log scale. The κ_max is the largest value of κ such that the condition (b) of the MMCD algorithm is valid. We have κ_max = 1/5 for the SCAD and κ_max = 1/4 for he MCP. Note that when κ = 0, the SCAD and MCP penalties both become the Lasso penalty. The λ_max is the smallest value of λ such that β̂_j = 0, j = 1, …, p. For the logistic regression model, λ_max = n⁻¹ max_j|(x^j)^T (y − π̂)| for every κ_k with π̂ = ȳJ and J being a vector whose elements are all equal to 1. We set λ_min = ελ_max, with ε = 0.0001 if n > p and ε = 0.01 otherwise. The solution surface is then calculated over the rectangle [0, κ_max) × [λ_min, λ_max]. Denote the MMCD solution for a given (κ_k, λ_v) by β̂_{κ_k,λ_v}.

We follow the approach of Mazumder, Friedman and Hastie (2011) to compute the solution surface by initializing the MMCD algorithm at the Lasso solutions on a grid of λ values. The Lasso solutions correspond to κ = 0. Then for each point in the grid of λ values, we compute the solutions on a grid of κ values starting from κ = 0, using the solution at the previous point as the initial value for the current point. The details of the approach are as follows.

The approach first computes the Lasso solution along λ. When computing β̂_{κ₀, λ_v+1}, it uses β̂_{κ₀, λ_v} as the initial value in the MMCD algorithm.
For a given λ_v, the approach computes the solution along κ. That is the β̂_{κ_k, λ_v} is used as the initial value in computing the solution of β̂_{κ_k+1, λ_v}.
The approach then cycles through v = 1, …, V for step (2) to complete the solution surface.

Define a variable to be a causal one if its coefficient β ≠ 0; otherwise call it to be a null variable. Figure (1) presents the solution paths of a causal variable with β = 2 along κ using the MCP penalty. Observe that the estimates could change substantially when κ crosses certain values. This justifies our treating κ as a tuning parameter since a pre-specified κ might not give the optimal result. This is the reason why we prefer a data-driven procedure to choose both κ and λ.

Fig. 1 — Plots of solution paths along κ. It shows the paths for a causal variable with β = 2. Observe that the estimates could change substantially when κ crosses certain threshold values.

4.2 Design of simulation study

Let Z be the design matrix of the covariates, that is, it is a sub-matrix of X with its first column removed. Let A₀ ≡ {1 ≤ j ≤ p : β_j ≠ 0} be the set of causal variables with dimension p₀. We fix p₀ = 10, β₀ = 0.0 and the coefficients of A₀ to be (±0.6, ±1.2, 2.4, ∓0.6, ∓1.2, ±2.4)^T such that the signal-to-noise ratio (SNR), defined as $S N R = \sqrt{β^{T} X^{T} X β / n}$ , is approximately in the range of (3,4). The covariates are generated from a multivariate normal distribution with zero means and variance Σ, with Σ being a positive-definite matrix with dimension p × p. The values of the response variable are generated from a Bernoulli distribution with y_i ~ Bernoulli(1, p_i), and p_i = exp(β^T x_i)/(1 + exp(β^T x_i)) for i = 1, …, n.

We consider five types of correlation structures for Σ.

(a)
Independent structure (IN) for the p penalized variables. Here Σ = I_p, with I_p being the identity matrix of dimension p × p.
(a)
Separate structure (SP). The causal and null variables are independent. Let Σ₀ and Σ₁ be the covariance matrix for the causal and the null variables, respectively, then Σ = block diagonal(Σ₀, Σ₁).Within each set of variables, we assume a compound symmetry structure, that is, ρ(x_ij, x_ik) = ρ for j ≠ k.
(c)
Partial correlated structure (PC), i.e. part of the causal variables are correlated with part of the null variables. Specifically, Σ = block diagonal(Σ_a, Σ_b, Σ_c), with Σ_a being the covariance matrix for the first 5 causal variables; Σ_b being the covariance matrix for the remaining 5 causal variables and 5 null variables; Σ_c being the covariance matrix for the remaining null variables. We also assume a compound symmetry structure within Σ_a, Σ_b, Σ_c.
(d)
First-order auto-regressive (AR) structure, i.e. ρ(x_ij, x_ik) = ρ^(|j−k|), for j ≠ k.
(e)
Compound symmetry (CS) structure for p variables.

4.3 Numerical implementation of the LLA algorithm

The basic idea of the LLA is to approximate a concave penalty ρ(|β_j|; γ, λ) by $ρ̇ (| {β̂}_{j}^{m} |; γ, λ) | β_{j} |$ based on the current estimate β̂^m. For the logistic regression, we also use a quadratic approximation (7) to the loss based on β̂^m. To compute β̂^m+1 using the LLA, we minimize a Lasso-type criterion

ℓ^{s} (β | {β̂}^{m}) + \sum_{j = 1}^{p} ρ̇ ({β̂}_{j}^{m}; γ, λ) | β_{j} | .

(18)

To compare the MMCD with the LLA, we implement the LLA algorithm in two ways. The first implementation strictly follows the description in Zou and Li (2008). This uses working data based on the current estimate and separates the design matrix into two parts, $U = {j : ρ̇ (| {β̂}_{j}^{m} |; γ, λ) = 0}$ and $V = {j : ρ̇ (| {β̂}_{j}^{m} |; γ, λ) \neq 0}$ for a current estimate β̂^m, with ρ̇(t) being the derivative of ρ(·). The computation of β̂^m+1 involves ${(X_{U}^{* T} X_{U}^{*})}^{- 1}$ , with $X_{U}^{*} = (X_{j} : j \in U)$ being the design matrix of variables in U. Hence, the solution could be non-unique if n < p_U with p_U being the number of variables in U. Therefore, this approach generally only works in the settings with n > p.

In the second implementation, we use the CDA to minimize (18). This implementation can handle data sets with p ≫ n. We call this implementation LLA-CD algorithm below.

Since both implementations require an initial estimate β̂ to approximate the penalty, we use the Lasso solutions to initiate the computation along κ for the LLA and LLA-CD algorithms. The LLA, adaptive rescaling, LLA-CD and MMCD algorithms are programmed in FORTRAN with similar programming structures for fair comparison. We observed that the adaptive rescaling algorithm does not converge within 1,000 iterations if κ_max is large. Hence, we set κ_max = 0.25 for the adaptive rescaling algorithm in our computation. We use the convergence criterion ‖β̂^m+1 − β̂^m‖₂/(‖β̂^m‖₂ + δ) < ε. We choose δ = 0.01 and ε = 0.001 if n > p and ε = 0.01 if n < p. We set correlation coefficient ρ = 0.5, the number of grids K = 10, V = 100.

4.4 Comparison of computational efficiency

We first provide a brief description of the available algorithms, namely the LQA, perturbed LQA, LLA and MIST. These four algorithms share the same spirit in the sense that they all use a surrogate function to majorize the penalty ρ(|t|; λ, γ). The LQA uses the following approximation,

ρ (| t |; λ, γ) \approx ρ (| t_{0} |; λ, γ) + \frac{ρ' (| t_{0} |; λ, γ)}{2 | t_{0} |} (t^{2} - t_{0}^{2}), t \approx t_{0} .

(19)

Then a Newton-Raphson type iteration is employed to minimize the penalized criterion with the surrogate penalty function. When t₀ is close to zero, the algorithm is unstable. Fan and Li (2001) suggested that if β̂_j is small enough, say |β̂_j| < ε (a pre-specified value), one sets β̂_j = 0 and removes the jth variable from iterations. Hence, a drawback of LQA algorithm is that once a variable is removed in any iteration, it is necessarily excluded from the final model. Hunter and Li (2005) proposed a perturbed version of LQA to majorize the LQA.

ρ (| t |; λ, γ) \approx ρ (| t_{0} |; λ, γ) + \frac{ρ' (| t_{0} |; λ, γ)}{2 | t_{0} + τ_{0} |} (t^{2} - t_{0}^{2}), t \approx t_{0} .

(20)

Practically, how to determine the size of τ₀ is not easy since the size of τ₀ could impact the speed of convergence and the sparsity of the solution. The LLA algorithm by (Zou and Li (2008)) approximates the penalty by

ρ (| t |; λ, γ) \approx ρ (| t_{0} |; λ, γ) + ρ' (| t_{0} |; λ, γ) (| t | - | t_{0} |), t \approx t_{0} .

(21)

The MIST (Schifano, Strawderman and Wells (2010)) algorithm extends the LLA to several other penalties.

We focus on the comparison of computational efficiency of the LLA, adaptive rescaling, LLA-CD and MMCD for the MCP penalized logistic regression models since the adaptive rescaling approach can only be applied to the MCP penalty. The computation is done on Inter Xeon (W3540@2.93GHZ) machines with Ubuntu 10.04 operating system (Kernel version 2.6). We consider two settings with n > p and n < p.

Figure 2 shows the average elapsed time measured in seconds based on 100 replications for p = 100,200 and 500 with a fixed sample size n = 1, 000. Observe that the time for the adaptive rescaling algorithm increases dramatically when n = 1,000 and p = 500. This suggests that the ratio of p/n has a greater impact on the efficiency of the adaptive rescaling algorithm. The speed of the LLA-CD is also impacted by the p/n ratio, although to a less degree. The MMCD and LLA are fairly stable to the change of p/n ratio. Overall, the MMCD is the fastest one. It is worth noting that in the setting with n = 1,000, p = 500, the adaptive rescaling and LLA-CD did not converge within 1,000 iterations for a convergence criterion of ε = 0.001 in some replications.

Fig. 2 — Computational efficiency of the LLA, adaptive rescaling, LLA-CD and MMCD algorithms with fixed sample size (n = 1, 000). The solid, dashed, dotted lines and dashed line with dark circles represent the average elapse times of the MMCD, LLA-CD, adaptive rescaling and LLA, respectively. Here, IN, SP, PC, AR and CS refers to the five correlation structures of covariates described in Subsection 4.2.

For high dimensional data with p ≫ n, we focus on the comparison between the adaptive rescaling, LLA-CD and MMCD. We used two sample sizes n = 100 and n = 300. The numbers of variables, p, is set to be 500, 1, 000, 2, 000, 5,000 and 10, 000. Figure 3 presents the average elapsed times in seconds based on 100 replications for n = 300. It shows that as p increases, the advantage of the MMCD becomes more apparent. For a fixed p, the MMCD gains more efficiency when the predictors are correlated and n is large. In addition, the MMCD has the smallest standard error of computation time, followed by the LLA-CD and adaptive rescaling. This suggests the MMCD is the most stable one among the three algorithms considered here in high-dimensional settings.

Fig. 3 — Computational efficiency of the adaptive rescaling, LLA-CD and MMCD algorithms for p ≫ n models. The sample size is n = 300. The solid, dashed and dotted lines represent the average elapse times of the MMCD, LLA-CD and the adaptive rescaling, respectively. Here, IN, SP, PC, AR and CS refers to the five correlation structures of covariates described in Subsection 4.2.

4.5 Comparison of selection performance

We further compare the selection performance of the LLA, adaptive rescaling, LLA-CD and MMCD for the MCP penalized logistic models. Since we are not addressing the issue of tuning parameter selection in this article, the algorithms are compared based on the models with the best predictive performance rather than the models chosen by a particular tuning parameter selection approach. This is done as follows. We first compute the solution surface over [0, κ_max) × [λ_min, λ_max] by each algorithm based on training datasets. Given the solution β̂_{κ_k,λ_v}, we compute the predictive area under ROC curve (PAUC) AUC_{(κ_k,λ_v)} for each β̂_{κ_k,λ_v} based on a validation set with n* = 3, 000. The model corresponding to the maximum predictive AUC_{(κ_k,λ_v)} is selected as the final model for comparison. We compare the results in terms of model size (MS) defined as the total number of selected variables; false discover rate (FDR), defined as the proportion of false positive variables among the total selected variables; the maximum predictive area under ROC curve (PAUC) of the validation dataset. The results reported below are based on 1,000 replicates.

Table 1 presents the comparison among four algorithms in n > p settings with n = 1000 and p = 100. The results show that the models selected by four approaches have similar PAUC. In terms of models size and FDR, the MMCD and LLA-CD have similar performance, both having smaller model size and FDR than the adaptive rescaling and LLA. Table 2 presents the comparison among the adaptive rescaling, LLA-CD and MMCD in high dimensional settings with n = 100 and p = 2, 000. Similar to the low dimensional case, the PAUC of three methods are almost identical. In terms of model size and FDR, the MMCD and LLA-CD have similar results.

Table 1.

Comparison of selection performance among four algorithms with n = 1,000 and p = 100. PAUC refers to the maximum predictive area under ROC curve (PAUC) of the validation dataset. MS is model size. FDR is false discovery rate. SE is the standard errors based on 1,000 replications. Here, IN, SP, PC, AR and CS refer to the five correlation structures of covariates described in Subsection 4.2.

Σ (SNR)	Algorithm	PAUC (SE*10⁵)	MS (SE*10¹)	FDR (SE*10³)
IN	LLA	0.947(4.58)	10.96(0.55)	0.07(3.32)
(4.32)	Adp res	0.948(4.35)	16.28(1.21)	0.36(4.23)
	LLA-CD	0.948(3.45)	10.79(0.57)	0.06(3.09)
	MMCD	0.948(3.37)	10.90(0.56)	0.07(3.31)
SP	LLA	0.915(7.74)	11.39(0.77)	0.10(3.96)
(3.05)	Adp res	0.916(6.93)	14.14(0.87)	0.27(3.90)
	LLA-CD	0.917(7.24)	11.35(0.84)	0.10(4.10)
	MMCD	0.917(6.67)	11.27(0.64)	0.10(3.57)
PC	LLA	0.945(5.95)	14.25(1.50)	0.24(5.72)
(3.89)	Adp res	0.947(5.25)	15.55(1.09)	0.33(3.97)
	LLA-CD	0.947(5.61)	11.61(1.07)	0.11(4.46)
	MMCD	0.947(5.07)	11.41(0.79)	0.10(3.93)
AR	LLA	0.921(8.83)	13.83(1.28)	0.24(5.34)
(3.20)	Adp res	0.924(6.73)	18.76(1.34)	0.44(3.94)
	LLA-CD	0.924(7.88)	11.29(0.76)	0.10(3.49)
	MMCD	0.925(5.98)	12.11(0.82)	0.15(4.45)
CS	LLA	0.919(8.13)	12.42(1.07)	0.16(4.70)
(3.06)	Adp res	0.921(7.02)	14.15(0.90)	0.27(3.98)
	LLA-CD	0.922(6.61)	10.64(0.54)	0.05(2.80)
	MMCD	0.922(6.60)	10.94(0.58)	0.07(3.32)

Open in a new tab

Table 2.

Comparison of selection performance among the adaptive rescaling, LLA-CD and MMCD with n = 100 and p = 2, 000. PAUC refers to the maximum predictive area under ROC curve (PAUC) of the validation dataset. MS is model size. FDR is false discovery rate. SE is the standard errors based on 1, 000 replications. Here, IN, SP, PC, AR and CS refer to the five correlation structures of covariates described in Subsection 4.2.

Σ (SNR)	Algorithm	PAUC (SE*10⁵)	MS (SE*10¹)	FDR (SE*10³)
IN	Adp res	0.828(1.30)	12.25(3.01)	0.60(6.95)
(4.33)	LLA-CD	0.842(1.28)	5.56(2.02)	0.25(8.42)
	MMCD	0.844(1.27)	6.41(2.09)	0.28(9.06)
SP	Adp res	0.778(1.96)	12.06(3.77)	0.62(6.05)
(3.05)	LLA-CD	0.795(1.74)	5.25(2.15)	0.26(8.16)
	MMCD	0.797(1.76)	5.75(2.25)	0.28(8.34)
PC	Adp res	0.872(0.64)	7.12(1.37)	0.42(6.43)
(3.87)	LLA-CD	0.877(0.54)	5.19(1.29)	0.24(7.48)
	MMCD	0.877(0.54)	5.37(1.27)	0.26(7.46)
AR	Adp res	0.812(1.21)	6.21(1.69)	0.49(8.53)
(3.19)	LLA-CD	0.830(1.10)	3.02(0.71)	0.17(7.73)
	MMCD	0.831(1.07)	3.21(0.94)	0.18(8.10)
CS	Adp res	0.770(1.79)	11.89(3.58)	0.64(6.32)
(3.04)	LLA-CD	0.776(1.80)	6.99(3.09)	0.37(9.27)
	MMCD	0.781(1.80)	7.39(2.99)	0.39(9.49)

Open in a new tab

4.6 Application to a Cancer Gene Expression Dataset

The purpose of this study is to discover the biomarkers associated with the prognosis of breast cancer (van’t Veer et al (2002); Van de Vijver et al (2002)). Approximately 25,000 genes were scanned using microarrays for n = 295 patients. Metastasis within five years is modeled as the outcome. A subset of 1,000 genes with the highest Spearman correlations to the outcomes are used in the penalized models to stabilize the computation. For the same reason as in the simulation study, we do not resort to any tuning parameter selection procedure to choose models for comparison. Instead, we randomly partition the whole dataset n = 295 into a training (approximately 1/3 of the observations) and a validation dataset (approximately 2/3 of the observations). The model fitting is solely based on the training dataset; the solution corresponding to the maximum predictive AUC of the validation dataset is chosen as the final model for comparison. We repeat this random partition process for 900 times.

Table 3 shows the results for the SCAD penalty using the MMCD, and the MCP penalty using the adaptive rescaling, LLA-CD and MMCD. The PAUCs from different approaches are close to each other. The model size of the MCP penalty by the LLA-CD algorithm happens to be the largest.

Table 3.

Application of the SCAD and MCP in a microarray dataset. The average and standard error are computed based on the 900 split processes. The predictive AUC is calculated as the maximum predictive AUC of the validation dataset created by the random splitting process. In each split process, approximately n = 100 samples are assigned to the training dataset and n = 200 samples into the validation dataset.

Solution surface	PAUC(SE*10³)	MS(SE)
SCAD (MMCD)	0.7567 (0.99)	35.50 (0.47)
MCP(Adap res)	0.7565 (1.15)	39.06 (0.68)
MCP(LLA-CD)	0.7537 (0.99)	43.07 (0.63)
MCP(MMCD)	0.7570 (0.99)	35.66 (0.49)

Open in a new tab

Finally, we present the results for the breast cancer study using the cross-validated area under ROC curve (CV-AUC) as a tuning parameter selection method. This method uses a combination of cross validation and ROC methodology. For details of using the CV-AUC for tuning parameter selection in penalized logistic regression, we refer to Jiang, Huang, and Zhang (2011). We use 5-fold cross validation to compute the CV-AUC. For this dataset, the SCAD using the MMCD, the MCP using the adaptive rescaling and the LLA-CD select the same model with 67 variables and CV-AUC = 0.7808. The MCP using the MMCD selects 16 variables with CV-AUC = 0.8024.

5 Discussion

In this article, we propose the MMCD algorithm for computing the concave penalized solutions. Our simulation studies and data example demonstrate that this algorithm is efficient in concave penalized logistic regression models with p ≫ n. Unlike the existing algorithms, such as the LQA, LLA and MIST that approximate the penalty function, the MMCD seeks a closed form solution for each coordinate using the exact penalty term. The majorization is only applied to the loss function. This approach increases the efficiency of CDA in high-dimensional settings. The convergence of the MMCD is proved under reasonable conditions.

The comparison among the LLA, adaptive rescaling, LLA-CD and MMCD indicates that the MMCD is the more efficient approach, especially for large p and correlated covariates, although the LLA-CD is competitive in some cases. The LLA-CD implements the adjacent initiation idea to reduce the computational cost, that is, it uses β̂_{κ_k,λ_v} as the initial values to compute β̂_{κ_k+1, λ_v}.Within the CDA component, the solutions are updated in a sequential manner, i.e. using ${β̂}_{j}^{m + 1}$ to compute ${β̂}_{j + 1}^{m + 1}$ , rather than in a vector form, which uses β̂^m to compute β̂^m+1. This is different from the the LLA-LARS implementation by Breheny and Huang (2011). The adjacent initiation and the sequential updating scheme may be the main reason why the two implementations of the LLA perform so differently.

The application of the MMCD to logistic regression is facilitated by the fact that a simple and effective majorization function can be constructed for the logistic likelihood. Majorization functions can also be found for other GLM models, for example the baseline category logit model. In such models, the MMCD can be implemented in a similar way. However, in some other important models in the GLM family such as the log-linear model, it appears that no simple majorization function exists. One possible approach is to design a sequence of majorization functions according to the solutions at each iteration. This is an interesting problem that requires further investigation.

Acknowledgements

The authors thank the reviewers and the associate editor for their helpful comments that led to considerable improvements in the paper. The research of Huang is supported in part by NIH grants R01CA120988, R01CA142774 and NSF grant DMS 1208225.

Appendix

In the Appendix, we prove Theorem 1. The proof follows the basic idea of Mazumder, Friedman and Hastie (2011). However, there are also some important differences. In particular, we need to take care of the intercept in Lemma 1 and Theorem 1, the quadratic approximation to the loss function and the coordinate-wise majorization in Theorem 1.

Lemma 1 Suppose the data (y, X) lies on a compact set and the following conditions hold:

The loss function ℓ(β) is (total) differentiable w.r.t. β for any β ∈ ℝ^p+1.
The penalty function ρ(t) is symmetric around 0 and is differentiable on t ≥ 0; ρ′(|t|) is non-negative, continuous and uniformly bounded, where ρ′(|t|) is the derivative of ρ(|t|) w.r.t. |t|.
The sequence {β^k} is bounded.
For every convergent subsequence {β^n_k} ⊂ {βⁿ}, the successive differences converge to zero: β^n_k − β^{n_k − 1} → 0.

Then if β^∞ is any limit point of the sequence {β^k}, then β^∞ is a minimum for the function Q(β); i.e.

\underset{α ↓ 0 +}{lim inf} {\frac{Q (β^{\infty} + α δ) - Q (β^{\infty})}{α}} \geq 0,

(22)

for any δ = (δ₀, …, δ_p) ∈ ℝ^p+1.

Proof For any β = (β₀, …, β_p)^T and δ_j = (0, …, δ_j, …, 0) ∈ ℝ^p+1, we have

\underset{α ↓ 0 +}{lim inf} {\frac{Q (β + α δ_{j}) - Q (β)}{α}} = \nabla_{j} ℓ (β) δ_{j} + \underset{α ↓ 0 +}{lim inf} {\frac{ρ (| β_{j} + α δ_{j} |) - ρ (| β_{j} |)}{α}} = \nabla_{j} ℓ (β) δ_{j} + \partial ρ (β_{j}; δ_{j}),

(23)

for j ∈ {1, …, p}, with

\partial ρ (β_{j}; δ_{j}) = {\begin{matrix} ρ' (| β_{j} |) sgn (β_{j}) δ_{j}, & | β_{j} | > 0; \\ ρ' (0) | δ_{j} |, & | β_{j} | = 0, \end{matrix}

(24)

Assume $β^{n_{k}} \to β^{\infty} = (β_{0}^{\infty}, \dots, β_{p}^{\infty})$ , and by assumption 4, as k → ∞

β_{j}^{n_{k} - 1} = (β_{0}^{n_{k}}, \dots, β_{j - 1}^{n_{k}}, β_{j}^{n_{k}}, β_{j + 1}^{n_{k} - 1}, \dots, β_{p}^{n_{k} - 1}) \to (β_{0}^{\infty}, \dots, β_{j - 1}^{\infty}, β_{j}^{\infty}, β_{j + 1}^{\infty}, \dots, β_{p}^{\infty})

(25)

By (24) and (25), we have the results below for j ∈ {1, …, p}.

\partial ρ (β_{j}^{n_{k}}; δ_{j}) \to \partial ρ (β_{j}^{\infty}; δ_{j}) if β_{j}^{\infty} \neq 0; \partial ρ (β_{j}^{\infty}; δ_{j}) \geq \underset{k}{lim inf} \partial ρ (β_{j}^{n_{k}}; δ_{j}) if β_{j}^{\infty} = 0 .

(26)

By the coordinate-wise minimum of jth coordinate j ∈ {1, …, p}, we have

\nabla_{j} ℓ (β_{j}^{n_{k} - 1}) δ_{j} + \partial ρ (β_{j}^{n_{k}}; δ_{j}) \geq 0, for all k .

(27)

Thus (26, 27) implies that for all j ∈ {1, …, p},

\nabla_{j} ℓ (β^{\infty}) δ_{j} + \partial ρ (β_{j}^{\infty}; δ_{j}) \geq \underset{k}{lim inf} {\nabla_{j} ℓ (β_{j}^{n_{k} - 1}) δ_{j} + \partial ρ (β_{j}^{n_{k}}; δ_{j})} \geq 0 .

(28)

By (23,28), for j ∈ {1, …, p}, we have

\underset{α ↓ 0 +}{lim inf} {\frac{Q (β^{\infty} + α δ_{j}) - Q (β^{\infty})}{α}} \geq 0 .

(29)

Following the above arguments, it is easy to see that for j = 0

\nabla_{0} ℓ (β^{\infty}) δ_{0} \geq 0 .

(30)

Hence for δ = (δ₀, …, δ_p) ∈ ℝ^p+1, by the differentiability of ℓ(β), we have

\underset{α ↓ 0 +}{lim inf} {\frac{Q (β^{\infty} + α δ) - Q (β^{\infty})}{α}} = \nabla_{0} ℓ (β^{\infty}) δ_{0} + \sum_{j = 1}^{p} [\nabla_{j} ℓ (β^{\infty}) δ_{j} + \underset{α ↓ 0 +}{lim inf} {\frac{ρ (| β_{j}^{\infty} + α δ_{j} |) - ρ (| β_{j}^{\infty} |)}{α}}] = \nabla_{0} ℓ (β^{\infty}) δ_{1} + \sum_{j = 1}^{p} \underset{α ↓ 0 +}{lim inf} {\frac{Q (β^{\infty} + α δ_{j}) - Q (β^{\infty})}{α}} \geq 0,

(31)

by (29, 30). This completes the proof.

Proof of Theorem 1

Proof To ease notation, write $χ_{β_{0}, \dots, β_{j - 1}, β_{j + 1}, \dots, β_{p}}^{j} \equiv χ (u)$ for Q(β) as a function of the jth coordinate with β_l, l ≠ j being fixed. We first deal with the j ∈ {1, …, p} coordinates, then the intercept (0th coordinate) in the following arguments.

For j ∈ {1, …, p}th coordinate, observe that

χ (u + δ) - χ (u) = ℓ (β_{0}, \dots, β_{j - 1}, u + δ, β_{j + 1}, \dots, β_{p}) - ℓ (β_{0}, \dots, β_{j - 1}, u, β_{j + 1}, \dots, β_{p}) + ρ (| u + δ |) - ρ (| u |)

(32)

= \nabla_{j} ℓ (β_{0}, \dots, β_{j - 1}, u, β_{j + 1}, \dots, β_{p}) δ + \frac{1}{2} \nabla_{j}^{2} ℓ (β_{0}, \dots, β_{j - 1}, u, β_{j + 1}, \dots, β_{p}) δ^{2} + o (δ^{2}) + ρ' (| u |) (| u + δ | - | u |) + \frac{1}{2} ρ ″ (| u^{*} |) {(| u + δ | - | u |)}^{2},

(33)

with |u*| being some number between |u + δ| and |u|. Notation ∇_jℓ(β₀, …, β_j−1, u, β_j+1, …, β_p) and $\nabla_{j}^{2} ℓ (β_{0}, \dots, β_{j - 1}, u, β_{j + 1}, \dots, β_{p})$ denote the first and second derivative of the function ℓ w.r.t. the jth coordinate (assuming to be existed by condition (1)).

We re-write the RHS of (33) as follows:

R H S (of 33) = \nabla_{j} ℓ (β_{0}, \dots, β_{j - 1}, u, β_{j + 1}, \dots, β_{p}) δ + (\nabla_{j}^{2} ℓ (β_{0}, \dots, β_{j - 1}, u, β_{j + 1}, \dots, β_{p}) - M) δ^{2} + ρ' (| u |) sgn (u) δ + ρ' (| u |) (| u + δ | - | u |) - ρ' (| u |) sgn (u) δ + \frac{1}{2} ρ ″ (| u^{*} |) {(| u + δ | - | u |)}^{2} + (M - \frac{1}{2} \nabla_{j}^{2} ℓ (β_{0}, \dots, β_{j - 1}, u, β_{j + 1}, \dots, β_{p})) δ^{2} + o (δ^{2}) .

(34)

On the other hand, the solution of the jth coordinate (j ∈ {1, …, p}) is to minimize the following function,

Q_{j} (u | β) = ℓ (β) + \nabla_{j} ℓ (β) (u - β_{j}) + \frac{1}{2} \nabla_{j}^{2} ℓ (β) {(u - β_{j})}^{2} + ρ (| u |),

(35)

By majorization, we bound $\nabla_{j}^{2} ℓ (β)$ by a constant M for standardized variables. So the actual function being minimized is

{Q̃}_{j} (u | β) = ℓ (β) + \nabla_{j} ℓ (β) (u - β_{j}) + \frac{1}{2} M {(u - β_{j})}^{2} + ρ (| u |) .

(36)

Since u is to minimize (36), we have, for the jth (j ∈ {1, …, p}) coordinate,

\nabla_{j} ℓ (β) + M (u - β_{j}) + ρ' (| u |) sgn (u) = 0,

(37)

Because χ(u) is minimized at u₀, by (37), we have

0 = \nabla_{j} ℓ (β_{0}, \dots, β_{j - 1}, u_{0} + δ, β_{j + 1}, \dots, β_{p}) + M (u_{0} - u_{0} - δ) + ρ' (| u_{0} |) sgn (u_{0}) = \nabla_{j} ℓ (β_{0}, \dots, β_{j - 1}, u_{0}, β_{j + 1}, \dots, β_{p}) + \nabla_{j}^{2} ℓ (β_{0}, \dots, β_{j - 1}, u_{0}, β_{j + 1}, \dots, β_{p}) δ + o (δ) - M δ + ρ' (| u_{0} |) sgn (u_{0}),

(38)

if u₀ = 0 then the above holds true for some value of sgn(u₀) ∈ [−1, 1].

Observe that ρ′(|x|) ≥ 0, then

ρ' (| u |) (| u + δ | - | u |) - ρ' (| u |) sgn (u) δ = ρ' (| u |) [(| u + δ | - | u |) - sgn (u) δ] \geq 0

(39)

Therefore using (38, 39) in (34) at u₀, we have, for j ∈ {1, …, p},

χ (u_{0} + δ) - χ (u_{0}) \geq \frac{1}{2} ρ ″ (| u^{*} |) {(| u + δ | - | u |)}^{2} + δ^{2} (M - \frac{1}{2} \nabla_{j}^{2} ℓ (β_{0}, \dots, β_{j - 1}, u_{0}, β_{j + 1}, \dots, β_{p})) + o (δ^{2}) \geq \frac{1}{2} M δ^{2} + \frac{1}{2} ρ ″ (| u^{*} |) {(| u + δ | - | u |)}^{2} + o (δ^{2}) .

(40)

By the condition (b) of the MMCD algorithm inf_t ρ″(|t|;λ, γ) > −M and (|u + δ| − |u|)² ≤ δ². Hence there exist $θ_{2} = \frac{1}{2} (M + {inf}_{x} ρ ″ (| x |) + o (1)) > 0$ , such that for the jth coordinate, j ∈ {1, …, p},

χ (u_{0} + δ) - χ (u_{0}) \geq θ_{2} δ^{2} .

(41)

Now consider β₀, observe that

χ (u + δ) - χ (u) = ℓ (u + δ, β_{1}, \dots, β_{p}) - ℓ (u, β_{1}, \dots, β_{p}) = \nabla_{1} ℓ (u, β_{1}, \dots, β_{p}) δ + \frac{1}{2} \nabla_{1}^{2} ℓ (u, β_{1}, \dots, β_{p}) δ^{2} + o (δ^{2}) = \nabla_{1} ℓ (u, β_{1}, \dots, β_{p}) δ + (\nabla_{1}^{2} (ℓ (u, β_{1}, \dots, β_{p}) - M) δ^{2} + (M - \frac{1}{2} \nabla_{1}^{2} ℓ (u, β_{1}, \dots, β_{p})) δ^{2} + o (δ^{2}),

(42)

By similar arguments to (38), we have

0 = \nabla_{1} ℓ (u_{0} + δ, β_{1}, \dots, β_{p}) + M (u_{0} + δ - u_{0}) = \nabla_{1} ℓ (u_{0}, β_{1}, \dots, β_{p}) + \nabla_{1}^{2} ℓ (u_{0}, β_{1}, \dots, β_{p}) δ + o (δ) - M δ .

(43)

Therefore, by (42, 43), for the first coordinate of β

χ (u_{0} + δ) - χ (u_{0}) = (M - \frac{1}{2} \nabla_{1}^{2} ℓ (u_{0}, β_{1}, \dots, β_{p})) δ^{2} + o (δ^{2}) = \frac{1}{2} M δ^{2} + \frac{1}{2} (M - \nabla_{1}^{2} ℓ (u_{0}, β_{1}, \dots, β_{p})) δ^{2} + o (δ^{2}) \geq \frac{1}{2} δ^{2} (M + o (1)) .

(44)

Hence there exists a $θ_{1} = \frac{1}{2} (M + o (1)) > 0$ , such that for the first coordinate of β

χ (u_{0} + δ) - χ (u_{0}) \geq θ_{1} δ^{2} .

(45)

Let θ = min(θ₁, θ₂), using (41,45), we have for all the coordinates of β,

χ (u_{0} + δ) - χ (u_{0}) \geq θ δ^{2},

(46)

By (46) we have

Q (β_{j}^{m - 1}) - Q (β_{j + 1}^{m - 1}) \geq θ {(β_{j + 1}^{m} - β_{j + 1}^{m - 1})}^{2} = θ {‖ β_{j}^{m - 1} - β_{j + 1}^{m - 1} ‖}_{2}^{2},

(47)

where $β_{j}^{m - 1} = (β_{1}^{m}, \dots, β_{j}^{m}, β_{j + 1}^{m - 1}, \dots, β_{p}^{m - 1})$ . The (47) establishes the boundedness of the sequence {β^m} for every m > 1 since the starting point of {β¹} ∈ ℝ^p+1.

Applying (47) over all the coordinates, we have for all m

Q (β^{m}) - Q (β^{m + 1}) \geq θ {‖ β^{m + 1} - β^{m} ‖}_{2}^{2} .

(48)

Since the (decreasing) sequence Q(β^m) converges, (48) shows that the sequence {β^k} have a unique limit point. This completes the proof of the convergence of {β^k}.

The assumption (3) and (4) in Lemma 1 holds by (48). Hence, the limit point of {β^k} is a minimum of Q(β) by Lemma 1. This completes the proof of the theorem.

Footnotes

SUPPLEMENTAL MATERIALS

R-package for the MMCD Algorithm: R-package ‘cvplogistic’ is available at www.r-project.org (R Development Core Team (2011). It implements the adaptive rescaling, LLA-CD and MMCD algorithms for the logistic regressions with concave penalties.

Contributor Information

Dingfeng Jiang, Exploratory Statistics, Data and Statistical Science, AbbVie Inc. dingfengjiang@gmail.com.

Jian Huang, Department of Statistics and Actuarial Science, and Department of Biostatistics, University of Iowa.

References

Böhning D, Lindsay B. Monotonicity of quadratic-approximation algorithms. Ann. Inst. Stat. Math. 1988;40(4):641–663. [Google Scholar]
Breheny P, Huang J. Coordinate descent algorithms for non-convex penalized regression, with application to biological feature selection. Ann. Appl. Stat. 2011;5(1):232–253. doi: 10.1214/10-AOAS388. [DOI] [PMC free article] [PubMed] [Google Scholar]
Donoho DL, Johnstone JM. Ideal spatial adaptation by wavelet shrinkage. Biometrika. 1994;81(3):425–455. [Google Scholar]
Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. Ann. Stat. 2004;32(2):407–451. [Google Scholar]
Fan J, Li R. Variable selection via non-concave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001;96(456):1348–13608. [Google Scholar]
Friedman J, Hastie T, Höfling H, Tibshirani R. Pathwise coordinate optimization. Ann. Appl. Stat. 2007;1(2):302–332. [Google Scholar]
Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 2010;33(1):1–22. [PMC free article] [PubMed] [Google Scholar]
Hunter DR, Lange K. A tutorial on MM algorithms. Am. Stat. 2004;58(1):30–37. [Google Scholar]
Hunter DR, Li R. Variable selection using MM algorithms. Ann. Stat. 2005;33(4):1617–1642. doi: 10.1214/009053605000000200. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jiang D, Huang J, Zhang Y. The cross-validated AUC for MCP-Logistic regression with high-dimensional data. Stat. Methods. Med. Res. 2011 doi: 10.1177/0962280211428385. Accepted. [DOI] [PubMed] [Google Scholar]
Lange K. Optimization. New York: Springer; 2004. [Google Scholar]
Lange K, Hunter D, Yang I. Optimization transfer using surrogate objective functions (with discussion) J. Comput. Graph. Stat. 2000;9(1):1–59. [Google Scholar]
Mazumder R, Friedman J, Hastie T. SparseNet: Coordinate descent with non-convex penalties. J. Am. Stat. Assoc. 2011;106(495):1125–1138. doi: 10.1198/jasa.2011.tm09738. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ortega JM, Rheinbold WC. Iterative solution of nonlinear equations in several variables. New York, NY: Academic Press; 1970. [Google Scholar]
Osborne MR, Presnell B, Turlach BA. A new approach to variable selection in least square problems. IMA. J. Numer. Anal. 2000;20(3):389–403. [Google Scholar]
Schifano ED, Strawderman RL, Wells MT. Majorization-minimization algorithms for non-smoothly penalized objective functions. Electronic Journal of Statistics. 2010;4:1258–1299. [Google Scholar]
Tibshirani R. Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Ser. B. 1996;58(1):267–288. [Google Scholar]
Tseng P. Convergence of a Block Coordinate Descent Method for Non-differentiable Minimization. J. Optimiz. Theory. App. 2001;109(3):475–494. [Google Scholar]
van’t Veer LJ, Dai H, van de Vijver MJ, et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415(31):530–536. doi: 10.1038/415530a. [DOI] [PubMed] [Google Scholar]
van de Vijver MJ, He YD, van’t Veer LJ, et al. A gene-expression signature as a predictor of survival in breast cancer. N. Engl. J. Med. 2002;347(25):1999–2009. doi: 10.1056/NEJMoa021967. [DOI] [PubMed] [Google Scholar]
Warge J. Minimizing certain convex functions. SIAM J. Appl. Math. 1963;11(3):588–593. [Google Scholar]
Wu TT, Lange K. Coordinate descent algorithms for Lasso penalized regression. Ann. Appl. Stat. 2008;2(1):224–244. doi: 10.1214/10-AOAS388. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang CH. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 2010;38(2):894–942. [Google Scholar]
Zou H, Li R. One-step sparse estimates in non-concave penalized likelihood models. Ann. Stat. 2008;36(4):1509–1533. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]
R Development Core Team. R Foundation for Statistical Computing. Vienna, Austria: ISBN; R: a Language and environment for statistical computing. 3-900051-07-0, http://www.R-project.org. [Google Scholar]

[R1] Böhning D, Lindsay B. Monotonicity of quadratic-approximation algorithms. Ann. Inst. Stat. Math. 1988;40(4):641–663. [Google Scholar]

[R2] Breheny P, Huang J. Coordinate descent algorithms for non-convex penalized regression, with application to biological feature selection. Ann. Appl. Stat. 2011;5(1):232–253. doi: 10.1214/10-AOAS388. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Donoho DL, Johnstone JM. Ideal spatial adaptation by wavelet shrinkage. Biometrika. 1994;81(3):425–455. [Google Scholar]

[R4] Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. Ann. Stat. 2004;32(2):407–451. [Google Scholar]

[R5] Fan J, Li R. Variable selection via non-concave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001;96(456):1348–13608. [Google Scholar]

[R6] Friedman J, Hastie T, Höfling H, Tibshirani R. Pathwise coordinate optimization. Ann. Appl. Stat. 2007;1(2):302–332. [Google Scholar]

[R7] Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 2010;33(1):1–22. [PMC free article] [PubMed] [Google Scholar]

[R8] Hunter DR, Lange K. A tutorial on MM algorithms. Am. Stat. 2004;58(1):30–37. [Google Scholar]

[R9] Hunter DR, Li R. Variable selection using MM algorithms. Ann. Stat. 2005;33(4):1617–1642. doi: 10.1214/009053605000000200. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Jiang D, Huang J, Zhang Y. The cross-validated AUC for MCP-Logistic regression with high-dimensional data. Stat. Methods. Med. Res. 2011 doi: 10.1177/0962280211428385. Accepted. [DOI] [PubMed] [Google Scholar]

[R11] Lange K. Optimization. New York: Springer; 2004. [Google Scholar]

[R12] Lange K, Hunter D, Yang I. Optimization transfer using surrogate objective functions (with discussion) J. Comput. Graph. Stat. 2000;9(1):1–59. [Google Scholar]

[R13] Mazumder R, Friedman J, Hastie T. SparseNet: Coordinate descent with non-convex penalties. J. Am. Stat. Assoc. 2011;106(495):1125–1138. doi: 10.1198/jasa.2011.tm09738. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Ortega JM, Rheinbold WC. Iterative solution of nonlinear equations in several variables. New York, NY: Academic Press; 1970. [Google Scholar]

[R15] Osborne MR, Presnell B, Turlach BA. A new approach to variable selection in least square problems. IMA. J. Numer. Anal. 2000;20(3):389–403. [Google Scholar]

[R16] Schifano ED, Strawderman RL, Wells MT. Majorization-minimization algorithms for non-smoothly penalized objective functions. Electronic Journal of Statistics. 2010;4:1258–1299. [Google Scholar]

[R17] Tibshirani R. Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Ser. B. 1996;58(1):267–288. [Google Scholar]

[R18] Tseng P. Convergence of a Block Coordinate Descent Method for Non-differentiable Minimization. J. Optimiz. Theory. App. 2001;109(3):475–494. [Google Scholar]

[R19] van’t Veer LJ, Dai H, van de Vijver MJ, et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415(31):530–536. doi: 10.1038/415530a. [DOI] [PubMed] [Google Scholar]

[R20] van de Vijver MJ, He YD, van’t Veer LJ, et al. A gene-expression signature as a predictor of survival in breast cancer. N. Engl. J. Med. 2002;347(25):1999–2009. doi: 10.1056/NEJMoa021967. [DOI] [PubMed] [Google Scholar]

[R21] Warge J. Minimizing certain convex functions. SIAM J. Appl. Math. 1963;11(3):588–593. [Google Scholar]

[R22] Wu TT, Lange K. Coordinate descent algorithms for Lasso penalized regression. Ann. Appl. Stat. 2008;2(1):224–244. doi: 10.1214/10-AOAS388. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Zhang CH. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 2010;38(2):894–942. [Google Scholar]

[R24] Zou H, Li R. One-step sparse estimates in non-concave penalized likelihood models. Ann. Stat. 2008;36(4):1509–1533. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] R Development Core Team. R Foundation for Statistical Computing. Vienna, Austria: ISBN; R: a Language and environment for statistical computing. 3-900051-07-0, http://www.R-project.org. [Google Scholar]

PERMALINK

Majorization Minimization by Coordinate Descent for Concave Penalized Generalized Linear Models

Dingfeng Jiang

Jian Huang

Abstract

1 Introduction

2 Concave Penalized Solutions for GLMs