Variable Selection for Support Vector Machines in Moderately High Dimensions

Xiang Zhang; Yichao Wu; Lan Wang; Runze Li

doi:10.1111/rssb.12100

. Author manuscript; available in PMC: 2017 Jan 1.

Published in final edited form as: J R Stat Soc Series B Stat Methodol. 2015 Jan 5;78(1):53–76. doi: 10.1111/rssb.12100

Variable Selection for Support Vector Machines in Moderately High Dimensions

Xiang Zhang, Yichao Wu ¹, Lan Wang ², Runze Li ³

PMCID: PMC4709852 NIHMSID: NIHMS725692 PMID: 26778916

Summary

The support vector machine (SVM) is a powerful binary classification tool with high accuracy and great flexibility. It has achieved great success, but its performance can be seriously impaired if many redundant covariates are included. Some efforts have been devoted to studying variable selection for SVMs, but asymptotic properties, such as variable selection consistency, are largely unknown when the number of predictors diverges to infinity. In this work, we establish a unified theory for a general class of nonconvex penalized SVMs. We first prove that in ultra-high dimensions, there exists one local minimizer to the objective function of nonconvex penalized SVMs possessing the desired oracle property. We further address the problem of nonunique local minimizers by showing that the local linear approximation algorithm is guaranteed to converge to the oracle estimator even in the ultra-high dimensional setting if an appropriate initial estimator is available. This condition on initial estimator is verified to be automatically valid as long as the dimensions are moderately high. Numerical examples provide supportive evidence.

Keywords: Local linear approximation, nonconvex penalty, oracle property, support vector machines, ultra-high dimensions, variable selection

1. Introduction

Due to the recent advent of new technologies for data acquisition and storage, we have seen an explosive growth of data complexity in a variety of research areas such as genomics, imaging and finance. As a result, the number of predictors becomes huge. However there are only a moderate number of instances available for study (Donoho et al., 2000). For example, in tumor classification using genomic data, expression values of tens of thousands of genes are available, but the number of arrays is typically at the order of tens. Classification of high dimensional data poses many statistical challenges and calls for new methods and theories. In this article we consider high dimensional classification where the number of covariates diverges with the sample size and can be potentially much larger than the sample size.

Support vector machine (SVM, Vapnik, 1996) is a powerful binary classification tool with high accuracy and great flexibility. It has achieved success in many applications. However, one serious drawback of the standard SVM is that its performance can be adversely affected if many redundant variables are included in building the decision rule (Friedman et al., 2001), see the evidence in the numerical results of Section 5.1. Classification using all features has been shown to be as poor as random guessing due to noise accumulation in high dimensional space (Fan and Fan, 2008). Many methods have been proposed to remedy this problem, such as the recursive feature elimination suggested by Guyon et al. (2002). In particular, superior performance can be achieved with a unified method, namely achieving variable selection and prediction simultaneously (Fan and Li, 2001) by using an appropriate sparsity penalty. It is well known that the standard SVM can fit in the regularization framework of loss + penalty using the hinge loss and L₂ penalty. Based on this, several attempts have been made to achieve variable selection for the SVM by replacing the L₂ penalty with other forms of penalty. Bradley and Mangasarian (1998), Zhu et al. (2004), and Wegkamp and Yuan (2011) considered the L₁-penalized SVM; Zou and Yuan (2008) proposed to use the F_∞-norm SVM to select groups of predictors; Wang et al. (2006) and Wang et al. (2007) suggested the elastic net penalty for the SVM; Zou (2007) proposed to penalize the SVM with the adaptive LASSO penalty; Zhang et al. (2006), Becker et al. (2011) and Park et al. (2012) studied the smoothly clipped absolute deviation (SCAD, Fan and Li, 2001)-penalized SVM. Recently Park et al. (2012) studied the oracle property of the SCAD-penalized SVM with a fixed number of predictors. Yet, to the best of our knowledge, the theory of variable selection consistency of sparse SVMs in high dimensions or ultra-high dimensions (Fan and Lv, 2008) has not been studied so far.

In this article, we study the variable selection consistency of sparse SVMs. Instead of using the L₂ penalty, we consider the penalized SVM with a general class of nonconvex penalties, such as the SCAD penalty or the minimax concave penalty (MCP, Zhang, 2010). Though the convex L₁ penalty can also induce sparsity, it is well known that its variable selection consistency in linear regression relies on the stringent “irrepresentable condition” on the design matrix. This condition, however, can easily be violated in practice, see the examples in Zou (2006) and Meinshausen and Yu (2009). Moreover, the regularization parameter for model selection consistency in this case is not optimal for prediction accuracy (Meinshausen and Bühlmann, 2006; Zhao and Yu, 2007). For the nonconvex penalty, Kim et al. (2008) investigated the oracle property of the SCAD-penalized least squares regression in the high dimensions. However, a different set of proving techniques are needed for the nonconvex penalized SVMs because the hinge loss in the SVM is not a smooth function. The Karush-Kuhn-Tucker local optimality condition is generally not sufficient for the setup of a nonsmooth loss plus a nonconvex penalty. A new sufficient optimality condition based on subgradient calculation is used in the technical proof in this paper. We prove that under some general conditions, with probability tending to one, the oracle estimator is a local minimizer of the nonconvex penalized SVM objective function where the number of variables may grow exponentially with the sample size. By oracle estimator, we mean an estimator obtained by minimizing the empirical hinge loss with only relevant covariates. As one referee pointed out, with a finite sample, the empirical hinge loss may have multiple minimizers because the objective function is piecewise linear. This issue will vanish asymptotically because we assume that the population hinge loss has a unique minimizer. Such an assumption on the population hinge loss has been made in the existing literature (Koo et al., 2008).

Even though the nonconvex penalized SVMs are shown to enjoy the aforementioned local oracle property, it is largely unknown whether numerical algorithms can identify this local minimizer since the involved objective function is nonconvex and typically multiple local minimizers exist. Existing methods rely heavily on conditions that guarantee the local minimizer to be unique. In general, when the convexity of the hinge loss function dominates the concavity of the penalty, the nonconvex penalized SVMs actually have a unique minimizer due to global convexity. Recently Kim and Kwon (2012) gave sufficient conditions for a unique minimizer of the nonconvex penalized least square regression when global convexity is not satisfied. However, for ultra-high dimensional cases, it would be unrealistic to assume the existence of a unique local minimizer. See Wang et al. (2013) for relevant discussion and a possible solution to nonconvex penalized regression.

In this article, we further address the nonuniqueness issue of local minimizers by verifying that with probability tending to one, the local linear approximation (LLA) algorithm (Zou and Li, 2008) is guaranteed to yield an estimator with the desired oracle property in merely two iterations under the localizability condition (Fan et al., 2014). This convergence result extends the work of Fan et al. (2014) by relaxing the differentiability assumption of the loss function and holds in the ultra-high dimensional setting with p = o(exp(n^δ)) for some positive constant δ. We further show that the localizability condition is automatically valid for the moderately high dimensional setting with $p = o (\sqrt{n})$ . To the best of our knowledge, this is the first result on the convergence of LLA algorithm in the setup of a nonsmooth loss function with a nonconvex penalty.

The rest of this paper is organized as follows. Section 2 introduces the methodology of nonconvex penalized SVMs. Section 3 contains the main results of the properties of nonconvex penalized SVMs. The implementation procedure is summarized in Section 4. Simulation studies and a real data example are provided in Section 5, followed by a discussion in Section 6. Technical proofs are presented in Section 7. A zip file containing R demonstration codes for one simulation example and the real data example is available at http://www4.stat.ncsu.edu/~wu/soft/VarSelforSVMbyZhangWuWangLi.zip.

2. Nonconvex penalized support vector machines

We begin with the basic setup and notations. In binary classification, we are typically given a random sample ${(Y_{i}, X_{i})}_{i = 1}^{n}$ from an unknown population distribution P(X, Y). Here Y_i ∈ {1, −1} denotes the categorical label and $X_{i} = {(X_{i 0}, X_{i 1}, \dots, X_{i p})}^{T} = {(X_{i 0}, {(X_{i}^{*})}^{T})}^{T}$ denotes the input covariates with X_i₀ = 1 corresponding to the intercept term. The goal is to estimate a classification rule that can be used to predict output labels for future observations with input covariates only. With potentially varying misclassification cost specified by weight W_i = w if Y_i = 1 and W_i = 1 − w if Y_i = −1 for some 0 < w < 1, the linear weighted support vector machine (WSVM, Lin et al., 2002) estimates the classification boundary by solving

min_{β} n^{- 1} \sum_{i = 1}^{n} W_{i} {(1 - Y_{i} X_{i}^{T} β)}_{+} + λ {(β^{*})}^{T} β^{*},

where (1 − u)₊ = max{1 − u, 0} denotes the hinge loss, λ > 0 is a regularization parameter, and β = (β₀, (β^*)^T)^T with β^* = (β₁, β₂, · · ·, β_p)^T. The standard SVM is a special case of the WSVM with weight parameter w = 0.5. In this paper, we consider the WSVM for more generality. In general, the corresponding decision rule, sign(X^T β), uses all covariates and is not capable of selecting relevant covariates.

Towards variable selection for the linear WSVM, we consider the population linear weighted hinge loss Inline graphic {W(1 − Y X^Tβ)₊}. Let $β_{0} = {(β_{00}, β_{01}, \dots, β_{0 p})}^{T} = {(β_{00}, {(β_{0}^{*})}^{T})}^{T}$ denote the true parameter value, which is defined as the minimizer of the population weighted hinge loss. Namely

β_{0} = arg min_{β} E {W {(1 - Y X^{T} β)}_{+}} .

(1)

The number of covariates p = p_n is allowed to increase with the sample size n. It is even possible that p_n is much larger than n. In this paper we assume the true parameter β₀ to be sparse. Let A = {1 ≤ j ≤ p_n; β₀_j ≠ 0} be the index set of the nonzero coefficients. Let q = q_n = |A| be the cardinality of set A, which is also allowed to increase with n. Without loss of generality, we assume that the last p_n − q_n components of β₀ are zero. That is, $β_{0}^{T} = (β_{01}^{T}, 0^{T})$ . Correspondingly, we write $X_{i}^{T} = (Z_{i}^{T}, R_{i}^{T})$ , where $Z_{i} = {(X_{i 0}, X_{i 1}, \dots, X_{i q})}^{T} = {(1, {(Z_{i}^{*})}^{T})}^{T}$ and R_i = (X_i_[_q_+1], …, X_ip)^T. Further we denote π₊ (resp. π₋) to be the marginal probability of the label Y = +1 (resp. −1).

To facilitate our theoretical analysis, we introduce the gradient vector and Hessian matrix of the population linear weighted hinge loss. Let L(β₁) = Inline graphic {W (1− Y Z^T β₁)₊} be the population linear weighted hinge loss using only relevant covariates. Define S(β₁) = (S(β₁)_j) to be the (q_n + 1)-dimension vector given by

S (β_{1}) = - E {I (1 - Y Z^{T} β_{1} \geq 0) W Y Z},

where I(·) denotes the indicator function. Also define H(β₁) = (H(β₁)_jk) to be the (q_n + 1) × (q_n + 1) matrix given by

H (β_{1}) = E {δ (1 - Y Z^{T} β_{1}) W {ZZ}^{T}},

where δ(·) denotes the Dirac delta function. It can be shown that if well-defined, S(β₁) and H(β₁) can be considered to be the gradient vector and Hessian matrix of L(β₁), respectively. See Lemma 2 of Koo et al. (2008) for details.

2.1. Nonconvex penalized support vector machines

By acting as if the true sparsity structure is known in advance, the oracle estimator is defined as $\hat{β} = {({\hat{β}}_{1}^{T}, 0^{T})}^{T}$ , where

{\hat{β}}_{1} = arg min_{β_{1}} n^{- 1} \sum_{i = 1}^{n} W_{i} {(1 - Y_{i} Z_{i}^{T} β_{1})}_{+} .

(2)

Here the objective function is piecewise linear. With a finite sample, it may have multiple minimizers. In that case, β̂₁ can be chosen to be any minimizer. Our forthcoming theoretical results still hold. In the limit as n → ∞, β̂₁ minimizes the population version of the objective function Inline graphic {W(1− Y Z^T β₁)₊}. Lin (2002) showed that when the misclassification costs are equal, the minimizer of {(1 − Y f(Z))₊} over measurable f(Z) is the Bayes rule sgn(p(Z)−1/2), where p(Z) = P(Y = +1|Z = Z). This suggests that the oracle estimator is aiming at approximating the Bayes rule. In practice, achieving an estimator with the desired oracle property is very challenging, because the sparsity structure of the true parameter β₀ is largely unknown. Later we will show that under some regularity conditions, our proposed algorithm can find an estimator with oracle property and claim convergence with high probability. Indeed, the numerical examples in Section 5.1 demonstrate that the estimator selected by our proposed algorithm has performance close to that of the Bayes rule. Note that the Bayes rule is unattainable here because we assume no knowledge on the high dimensional conditional density P(X|Y).

In this paper, we consider the nonconvex penalized hinge loss objective function:

Q (β) = n^{- 1} \sum_{i = 1}^{n} W_{i} {(1 - Y_{i} X_{i}^{T} β)}_{+} + \sum_{j = 1}^{p_{n}} p_{λ_{n}} (∣ β_{j} ∣),

(3)

where p_{λ_n}(·) is a symmetric penalty function with tuning parameter λ_n. Let $p_{λ_{n}}^{'} (t)$ be the derivative of p_{λ_n}(t) with respect to t. We consider a general class of nonconvex penalties that satisfy the following conditions.

(Condition 1) The symmetric penalty p_{λ_n}(t) is assumed to be nondecreasing and concave for t ∈ [0, +∞), with a continuous derivative $p_{λ_{n}}^{'} (t)$ on (0, +∞) and p_{λ_n}(0) = 0.
(Condition 2) There exists a > 1 such that ${lim}_{t \to 0 +} p_{λ_{n}}^{'} (t) = λ_{n}, p_{λ_{n}}^{'} (t) \geq λ_{n} - t / a$ for 0 < t < aλ and $p_{λ_{n}}^{'} (t) = 0$ for t ≥ aλ.

The motivation for such a nonconvex penalty is that the convex L₁ penalty lacks the oracle property due to the overpenalization of large coefficients in the selected model. Consequently it is undesirable to use the L₁ penalty when the purpose of the data analysis is to select the relevant covariates among potentially high dimensional candidates in classification. Note that p, q, λ and other related quantities are allowed to depend on n, and we suppress the subscript n whenever there is no confusion.

Two commonly used nonconvex penalties that satisfy Conditions 1 and 2 are the SCAD and MCP penalties. The SCAD penalty (Fan and Li, 2001) is defined by

p_{λ} (∣ β ∣) = λ ∣ β ∣ I (0 \leq ∣ β ∣ < λ) + \frac{a λ ∣ β ∣ - (β^{2} + λ^{2}) / 2}{a - 1} \times I (λ \leq ∣ β ∣ \leq a λ) + \frac{(a + 1) λ^{2}}{2} I (∣ β ∣ > a λ)

for some a > 2. The MCP (Zhang, 2010) is defined by

p_{λ} (∣ β ∣) = λ (∣ β ∣ - \frac{β^{2}}{2 a λ}) I (0 \leq ∣ β ∣ < a λ) + \frac{a λ^{2}}{2} I (∣ β ∣ \geq a λ) for some a > 1.

3. Oracle property

3.1. Regularity conditions

To facilitate our technical proofs, we impose the following regularity conditions:

(A1)
The densities of Z^* given Y = +1 and Y = −1 are continuous and have common support in .
(A2)
$E [X_{j}^{2}] < \infty$ for 1 ≤ j ≤ q.
(A3)
The true parameter β₀ is unique and a nonzero vector.
(A4)
q_n = O(n^c₁), namely lim_n_→∞ q_n/n^c₁ < ∞, for some 0 ≤ c₁ ≤ 1/2.
(A5)
There exists a constant M₁ > 0 such that $λ_{\max} (n^{- 1} X_{A}^{T} X_{A}) \leq M_{1}$ , where X_A is the first q_n+1 columns of the design matrix and λ_max denotes the largest eigenvalue. It is further assumed that ${max}_{1 \leq i \leq n} ‖ Z_{i} ‖ = O_{p} (\sqrt{q_{n}} log (n))$ , (Z_i, Y_i) are in general position (Koenker, 2005, sect. 2.2), X_ij are sub-Gaussian random variables for 1 ≤ i ≤ n, q_n + 1 ≤ j ≤ p_n.
(A6)
λ_min(H(β₀₁)) ≥ M₂ for some constant M₂ > 0, where λ_min denotes the smallest eigenvalue.
(A7)
n^(1−c₂)/2 min_{1≤j≤q_n} |β₀_j| ≥ M₃ for some constant M₃ > 0 and 2c₁ < c₂ ≤1.
(A8)
Denote the conditional density of Z^T β₀₁ given Y = +1 and Y = −1 as f and g, respectively. It is assumed that f is uniformly bounded away from 0 and ∞ in a neighborhood of 1 and g is uniformly bounded away from 0 and ∞ in a neighborhood of −1.

Remark 3.1

The Conditions (A1)–(A3) and (A6) are also assumed for fixed p in Koo et al. (2008). We need these assumptions to ensure that the oracle estimator is consistent in the scenario of diverging p. Condition (A3) states that the optimal classification decision function is not constant, which is required to ensure S(β) and H(β) are well-defined gradient vector and Hessian matrix of the hinge loss, see Lemma 2 and Lemma 3 of Koo et al. (2008). The Conditions (A4) and (A7) are common in the literature of high dimensional inference (Kim et al., 2008). More specifically, (A4) states that the divergence rate of the number of nonzero coefficients cannot be faster than root-n and (A7) simply states that the signals cannot decay too quickly. The Condition on the largest eigenvalues of the design matrix in (A5) is similar to the sparse Riesz condition and also assumed in Zhang and Huang (2008), Yuan (2010) and Zhang (2010). Note that the bound on the smallest eigenvalue is not specified. The Condition on the maximum norm in (A5) holds when Z^* given Y follows multivariate normal distribution. (Z_i, Y_i) are in general position if with probability one there are exactly (q_n + 1) elements in $D = {i : 1 - Y_{i} Z_{i}^{T} {\hat{β}}_{1} = 0}$ (Koenker, 2005, sect. 2.2). The condition for general position is true with probability one w.r.t. Lebesgue measure. The Condition (A8) requires that there is enough information around the nondifferentiable point of the hinge loss, similar to Condition (C5) in Wang et al. (2012) for quantile regression.

For illustrative examples that satisfy all the above conditions, assume 0 < π₊ = 1−π₋ < 1 and let the number of signals be fixed. The first example is that the conditional distributions of X^* given Y have unbounded support Inline graphic with sub-Gaussian tails. It can be easily seen that the Fisher’s discriminant analysis is one special case when X^* given Y are Gaussian. Condition (A1)–(A4) and (A7) are trivial. Condition (A5) holds by the properties of sub-Gaussian random variable. Koo et al. (2008) showed that Condition (A6) holds if the supports of the conditional densities of Z^* given Y are convex, which are naturally satisfied for Inline graphic . Condition (A8) is trivially satisfied by the unbounded support of the conditional distribution of Z^* given Y. Another example is the Probit model that X^* has unbounded support with sub-Gaussian tails and Pr(Y = +1|X^*) = Φ(X^Tβ) for some β ≠ 0. It can be easily checked that the conditional distributions of X^* given Y also have unbounded supports Inline graphic and hence all the conditions are satisfied.

3.2. Local oracle property

In this subsection, we establish the theory of the local oracle property for the nonconvex penalized SVMs; namely, the oracle estimator is one of the local minimizers of the objective function Q(β) defined in (3). We start with the following lemma on the consistency of the oracle estimator, which can be viewed as an extension of the consistency result in Koo et al. (2008) to the diverging p scenario.

Lemma 3.1

Assume that Conditions (A1)–(A7) are satisfied. The oracle estimator $\hat{β} = {({\hat{β}}_{1}^{T}, 0^{T})}^{T}$ satisfies $‖ {\hat{β}}_{1} - β_{01} ‖ = O_{p} (\sqrt{q_{n} / n})$ when n → ∞.

Though the convexity of the nonconvex penalized hinge loss objective function Q(β) is not guaranteed, it can be written as the difference of two convex functions:

Q (β) = g (β) - h (β),

(4)

where $g (β) = n^{- 1} \sum_{i = 1}^{n} W_{i} {(1 - Y_{i} X_{i}^{T} β)}_{+} + λ_{n} \sum_{j = 1}^{p} ∣ β_{j} ∣$ and $h (β) = λ_{n} \sum_{j = 1}^{p} ∣ β_{j} ∣ - \sum_{j = 1}^{p} p_{λ_{n}} (∣ β_{j} ∣) = \sum_{j = 1}^{p} H_{λ_{n}} (β_{j})$ . The form of H_λ(β_j) depends on the penalty function. For the SCAD penalty, we have

H_{λ} (β_{j}) = [(β_{j}^{2} - 2 λ ∣ β_{j} ∣ + λ^{2}) / {2 (a - 1)}] I (λ \leq ∣ β_{j} ∣ \leq a λ) + {λ ∣ β_{j} ∣ - (a + 1) λ^{2} / 2} I (∣ β_{j} ∣ > a λ),

while for MCP, we have $H_{λ} (β_{j}) = {β_{j}^{2} / (2 a)} I (0 \leq ∣ β_{j} ∣ < a λ) + (λ ∣ β_{j} ∣ - a λ^{2} / 2) I (∣ β_{j} ∣ \geq a λ)$ . This decomposition is useful, as it naturally satisfies the form of the difference of convex functions (DC) algorithm (An and Tao, 2005).

To prove the oracle property of the nonconvex penalized SVMs, we will use a sufficient local optimality condition for the difference convex programming first presented in Tao and An (1997). This sufficient condition is based on subgradient calculus. The subgradient can be viewed as an extension of the gradient of the smooth convex function to the nonsmooth convex function. Let dom(g)={X : g(X) < ∞} be the effective domain of a convex function g. The subgradient of g(X) at a point X₀ is defined as ∂g(x₀) = {t : g(x) ≥ g(x₀)+(x−x₀)^Tt}. Note that at the nondifferentiable point, the subgradient contains a collection of vectors. One can easily check that the subgradient of the hinge loss function at the oracle estimator is the collection of vectors s(β̂) = (s₀(β̂), …, s_p(β̂))^T with

s_{j} (\hat{β}) = - n^{- 1} \sum_{i = 1}^{n} W_{i} Y_{i} X_{i j} I (1 - Y_{i} X_{i}^{T} \hat{β} > 0) - n^{- 1} \sum_{i = 1}^{n} W_{i} Y_{i} X_{i j} v_{j},

(5)

where −1 ≤ v_i ≤ 0 if $1 - Y_{i} X_{i}^{T} \hat{β} = 0$ and v_i = 0 otherwise, j = 0, …, p. Under some regularity conditions, we can study the asymptotic behaviors of the subgradient at the oracle estimator. The results are summarized in the following Theorem.

Theorem 3.1

Suppose that Conditions (A1)–(A8) hold, and the tuning parameter satisfies λ = o(n^{−(1−c₂)/2}) and $log (p) q log (n) n^{- \frac{1}{2}} = o (λ)$ . For the oracle estimator β̂, there exists $v_{i}^{*}$ which satisfies $v_{i}^{*} = 0$ if $1 - Y_{i} X_{i}^{T} \hat{β} \neq 0$ and $v_{i}^{*} \in [- 1, 0]$ if $1 - Y_{i} X_{i}^{T} \hat{β} = 0$ , such that for s_j(β̂) with $v_{i} = v_{i}^{*}$ , with probability approaching one, we have

\begin{matrix} s_{j} (\hat{β}) = 0, j = 0, 1, \dots, q, \\ ∣ {\hat{β}}_{j} ∣ \geq (a + \frac{1}{2}) λ, j = 1, \dots, q, \\ ∣ s_{j} (\hat{β}) ∣ \leq λ and ∣ {\hat{β}}_{j} ∣ = 0, j = q + 1, \dots, p, \end{matrix}

Theorem 3.1 characterizes the subgradients of the hinge loss at the oracle estimator. It basically says that in a regular setting, with probability arbitrarily close to one, those components of the subgradients corresponding to the relevant covariates are exactly zero and those corresponding to irrelevant covariates are not far away zero.

We now present the sufficient optimality condition based on subgradient calculation. Corollary 1 of Tao and An (1997) states that if there exists a neighborhood U around the point X^* such that ∂h(X) ∩ ∂g(X^*) ≠ ∅, ∀X ∈ U ∩ dom(g), then X^* is a local minimizer of g(X) − h(X). To verify this local sufficient condition, we study the asymptotic behaviors of subgradients of the two convex functions in the aforementioned decomposition (4) of Q(β). Note that, based on (5), the subgradient function of g(β) at β can be shown to be the following collection of vectors:

\begin{array}{l} \partial g (β) & = {ξ = {(ξ_{0}, \dots, ξ_{p})}^{T} \in R^{p + 1} : \\ ξ_{j} & = - n^{- 1} \sum_{i = 1}^{n} W_{i} Y_{i} X_{i j} I (1 - Y_{i} X_{i}^{T} \hat{β} > 0) - n^{- 1} \sum_{i = 1}^{n} W_{i} Y_{i} X_{i j} v_{j} + λ l_{j}, j = 0, \dots, p}, \end{array}

where l₀ = 0, l_j = sgn(β_j) if β_j ≠ 0 and l_j ∈ [−1, 1] otherwise for 1 ≤ j ≤ p, and −1 ≤ v_i ≤ 0 if $1 - Y_{i} X_{i}^{T} \hat{β} = 0$ and v_i = 0 otherwise for 1 ≤ i ≤ n. Furthermore, by Condition 2 of the class of nonconvex penalty functions, ${lim}_{t \to 0 +} H_{λ}^{'} (t) = {lim}_{t \to 0 -} H_{λ}^{'} (t) = λ sgn (t) - λ sgn (t) = 0$ . Thus h(β) is differentiable everywhere. Consequently the subgradient of h(β) at point β is a singleton:

\partial h (β) = {μ = (μ_{0}, \dots, μ_{p}) \in R^{p + 1} : μ_{j} = \frac{\partial h (β)}{\partial β_{j}}, j = 0, \dots, p} .

For the class of nonconvex penalty functions under consideration, $\frac{\partial h (β)}{\partial β_{j}} = 0$ for j = 0. For 1 ≤ j ≤ p,

\frac{\partial h (β)}{\partial β_{j}} = [{β_{j} - λ sgn (β_{j})} / (a - 1)] I (λ \leq ∣ β_{j} ∣ \leq a λ) + λ sgn (β_{j}) I (∣ β_{j} ∣ > a λ)

for the SCAD penalty, and

\frac{\partial h (β)}{\partial β_{j}} = (β_{j} / a) I (0 \leq ∣ β_{j} ∣ < a λ) + λ sgn (β_{j}) I (∣ β_{j} ∣ \geq a λ)

for the MCP.

Combining this with Theorem 3.1, we will prove that with probability tending to one, for any β in a ball in Inline graphic with the center β̂ and radius $\frac{λ}{2}$ , there exists a subgradient ξ = (ξ₀, …, ξ_p)^T ∈ ∂g(β̂) such that $\frac{h (β)}{\partial β_{j}} = ξ_{j}$ , j = 0, 1, …, p. Consequently the oracle estimator β̂ is itself a local minimizer of (3). This is summarized in the following theorem.

Theorem 3.2

Assume that Conditions (A1)-(A8) hold. Let B_n(λ) be the set of local minimizers of the objective function Q(β) with regularization parameter λ. The oracle estimator $\hat{β} = {({\hat{β}}_{1}^{T}, 0^{T})}^{T}$ satisfies

Pr {\hat{β} \in B_{n} (λ)} \to 1

as n → ∞, if λ = o(n^{−(1−c₂)/2}), and $log (p) q log (n) n^{- \frac{1}{2}} = o (λ)$ .

It can be shown that if we take λ = n^−1/2+^δ for some c₁ < δ < c₂/2, then the oracle property holds even for p = o(exp(n^{(δ−c₁)/2})). Therefore, the local oracle property holds for the nonconvex penalized SVM even when the number of covariates grows exponentially with the sample size.

3.3. An algorithm with provable convergence to the oracle estimator

Note that Theorem 3.2 indicates that one of the local minimizers possesses the oracle property. However, there can potentially be multiple local minimizers and it remains challenging to identify the oracle estimator. In the high dimensional setting, assuming that the local minimizer is unique would not be realistic.

In this article, instead of assuming the uniqueness of solutions, we work directly on the conditions under which the oracle estimator can be identified by some numerical algorithms that solve the nonconvex penalized SVM objective function. One possible algorithm is the local linear approximation (LLA) algorithm proposed by Zou and Li (2008). We focus on theoretical development first in this section and delay the detailed LLA algorithm for the nonconvex penalized SVMs to Section 4. Recently LLA has been shown to be capable of identifying the oracle estimator in the setup of folded concave penalized estimation with a differentiable loss function (Wang et al., 2013; Fan et al., 2014). We generalize their results to nondifferentiable loss functions, so that it can fit in the framework of the nonconvex penalized SVMs. Similar to their work, the main condition required is the existence of an appropriate initial estimator inputed in the iterations of the LLA algorithm. Denote the initial estimator as β̂⁽⁰⁾. Intuitively, if the initial estimator β̂⁽⁰⁾ lies in a small neighborhood of the true value β₀, the algorithm should converge to the good local minimizer around β₀. This localizability will be formalized in terms of L_∞ distance later. With such an appropriate initial estimator, under the aforementioned regularity conditions, one can prove that the LLA algorithm converges to the oracle estimator with probability tending to one even in ultra-high dimensions.

Let ${\tilde{β}}^{(0)} = {({\tilde{β}}_{0}^{(0)}, \dots, {\tilde{β}}_{p}^{(0)})}^{T}$ . Consider the following events:

$F_{n 1} = {∣ {\tilde{β}}_{j}^{(0)} - β_{0 j} ∣ > λ, for some 1 \leq j \leq p}$ ,
F_n₂ = {|β₀_j| < (a + 1)λ, for some 1 ≤ j ≤ q},
F_n₃ = {for all subgradients s(β̂), $∣ s_{j} (\hat{β}) ∣ > (1 - \frac{1}{a}) λ$ for some q+1 ≤ j ≤ p or |s_j(β̂)| ≠ 0 for some 0 ≤ j ≤ q},
F_n₄ = {|β̂_j| < aλ, for some 1 ≤ j ≤ q}.

Denote the corresponding probability as P_ni = Pr(F_ni), i = 1, 2, 3, 4. P_n₁ represents the localizability of the problem. When we have an appropriate initial estimator, we expect P_n₁ to converge to 0 as n → ∞. P_n₂ is the probability that the true signal is too small to be detected by any method. P_n₃ describes the behavior of the subgradients at the oracle estimator. As stated in Theorem 3.1, there exists a subgradient such that its components corresponding to irrelevant variables are near 0 and those corresponding to relevant variables are exactly 0, so P_n₃ cannot be too large. P_n₄ has to do with the magnitude of the oracle estimator on relevant variables. Under regularity conditions, the oracle estimator will detect the ture signals and hence P_n₄ will be very small.

Now we provide conditions for the LLA algorithm to find the oracle estimator β̂ in the nonconvex penalized SVMs based on P_n₁, P_n₂, P_n₃ and P_n₄.

Theorem 3.3

With probability at least 1 − P_n₁ − P_n₂ − P_n₃ − P_n₄, the LLA algorithm initiated by β̂⁽⁰⁾ finds the oracle estimator β̂ after two iterations. Furthermore, if (A1)–(A8) hold, λ= o(n^{−(1−c₂)/2}) and $log (p) q log (n) n^{- \frac{1}{2}} = o (λ)$ , then P_n₂ → 0, P_n₃ → 0 and P_n₄ → 0 as n → ∞.

The first part of Theorem 3.3 provides a nonasymptotic lower bound on the probability that the LLA algorithm converges to the oracle estimator. As we will show in the appendix, if none of the events F_ni happen, the LLA algorithm initiated with β̃⁽⁰⁾ will find the oracle estimator in the first iteration, and in the second iteration it will find the oracle estimator again and thus claim convergence. Note that only a single correction is required in the first iteration and the second iteration is needed to stop the algorithm. Therefore, the LLA algorithm can identify the oracle estimator after two iterations and this result holds generally without the Conditions (A1)–(A8).

The second part of Theorem 3.3 indicates that under Conditions (A1)–(A8), the lower bound is determined only by the limiting behavior of the initial estimator. As long as an appropriate initial estimator is available, the problem of selecting the oracle estimator from potential multiple local minimizers is addressed. Let β̂^L₁ be the solution to the L₁-penalized SVM. When the initial estimator β̃⁽⁰⁾ is taken to be β̂^L₁ and the following Condition (A9) holds, by Theorem 3.3 the oracle estimator can be identified even in the ultra-high dimensional setting. The result is summarized in the following Corollary.

(A9)
$Pr (∣ {\hat{β}}_{j}^{L_{1}} - β_{0 j} ∣ > λ)$ , for some 1 ≤ j ≤ p) → 0 as n → ∞.

Corollary 3.1

Let β̂(λ) be the solution found by the LLA algorithm initiated by β̂^L₁ after two iterations. Assume the same conditions in Theorem 3.3 and (A9) hold, then

Pr {\hat{β} (λ) = \hat{β}} \to 1 a s n \to \infty .

In the ultra-high dimensional case, one may require more stringent conditions to guarantee (A9). For the nonconvex penalized least square regression, one can use the LASSO solution (Tibshirani, 1996) as the initial estimator and (A9) holds if one can further assume the restricted eigenvalue condition of the design matrix (Bickel et al., 2009). However, it is still largely unknown whether this conclusion also applies to the setting where both the loss and the penalty are nondifferentiable. Without imposing any new regularity conditions, we next prove that in the moderately high dimensions with $p = o (\sqrt{n})$ , the solution to the L₁-penalized SVM satisfies (A9) under conditions quite similar to (A1)–(A8).

The following regularity Conditions are modified from (A1)–(A8). Conditions (A3), (A7)–(A8) are the same as aforementioned.

(A1*)
The densities of X^* given Y = +1 and Y = −1 are continuous and have a common support in .
(A2*)
$E [X_{j}^{2}] < \infty$ for 1 ≤ j ≤ p.
(A4*)
p_n = O(n^c₁) for some 0 ≤ c₁ < 1/2.
(A5*)
There exists a constant M₁ > 0 such that λ_max(n⁻¹X^T X) ≤ M₁. It is further assumed that ${max}_{1 \leq i \leq n} ‖ X_{i} ‖ = O_{p} (\sqrt{p_{n}} log n)$ , (X_i, Y_i) are in general position (Koenker, 2005, sect. 2.2), X_ij are sub-Gaussian random variables for 1 ≤ i ≤ n, q_n + 1 ≤ j ≤ p_n.
(A6*)
λ_min(H(β₀)) ≥ M₃ for some constant M₃ > 0.

Under the new regularity conditions, we can conclude that the solution to the L₁-penalized SVM is an appropriate initial estimator. Combined with Theorem 3.3, the LLA algorithm initiated with a zero vector can identify the oracle estimator with one more iteration. The results are summarized in the following Theorem.

Theorem 3.4

Assume β̂^L₁ is the solution to the L₁-penalized SVM with tuning parameter c_n. If the modified conditions hold, λ = o(n^{−(1−c₂)/2}), p log(n)n^−1/2 = o(λ) and c_n = o(n^−1/2), then we have $Pr (∣ {\hat{β}}_{j}^{L_{1}} - β_{0 j} ∣ > λ)$ , for some 1 ≤ j ≤ p) 0 as n → ∞. Further, the LLA algorithm initiated by β̂^L₁ finds the oracle estimator in two iterations with probability tending to one. That is, Pr{β̂(λ) = β̂} → 1 as n → ∞.

Note that Theorem 3.4 can guarantee that the LLA algorithm initialized by the β̂^L₁ identifies the oracle estimator with high probability only when $p = o (\sqrt{n})$ . However, our empirical studies suggest that even for cases with p much larger than n, the LLA algorithm initiated by β̂^L₁ usually converges within two iterations and the identified local minimizer has acceptable performance.

4. Implementation and tuning

To solve the nonconvex penalized SVMs, we use the LLA algorithm. More explicitly, we start with an initial value { ${\tilde{β}}^{(0)} : {\tilde{β}}_{j}^{(0)} = 0$ , j = 1, 2, …, p}. At each step t ≥ 1, we update by solving

min_{β} {n^{- 1} \sum_{i = 1}^{n} W_{i} {(1 - Y_{i} X_{i}^{T} β)}_{+} + \sum_{j = 1}^{p} p_{λ}^{'} (∣ {\tilde{β}}_{j}^{(t - 1)} ∣) ∣ β_{j} ∣},

(6)

where $p_{λ}^{'} (\cdot)$ denotes the derivative of p_λ(·). Following the literature, when ${\tilde{β}}_{j}^{(t - 1)} = 0$ , we take $p_{λ}^{'} (0)$ as $p_{λ}^{'} (0 +) = λ$ . The LLA algorithm is an instance of the majorize-minimize (MM) algorithm and converges to a local minimizer of the nonconvex objective function.

With slack variables, the convex optimization problem in (6) can be easily recast as a linear programming (LP) problem

min_{ξ, η, β} {n^{- 1} \sum_{i = 1}^{n} W_{i} ξ_{i} + \sum_{j = 1}^{p} p_{λ}^{'} (∣ {\tilde{β}}_{j}^{(t - 1)} ∣) η_{j}}

subject to

\begin{array}{l} ξ_{i} \geq 0; i = 1, 2, \dots, n, \\ ξ_{i} \geq 1 - Y_{i} X_{i}^{T} β; i = 1, 2, \dots, n, \\ η_{j} \geq β_{j}, η_{j} \geq - β_{j}; j = 1, 2, \dots, p . \end{array}

We propose using the stopping rule that $p_{λ}^{'} (∣ {\tilde{β}}_{j}^{(t - 1)} ∣)$ stabilizes for j = 1, 2, ···, p, namely, when $\sum_{j = 1}^{p} {(p_{λ}^{'} (∣ {\tilde{β}}_{j}^{(t - 1)} ∣) - p_{λ}^{'} (∣ {\tilde{β}}_{j}^{(t)} ∣))}^{2}$ is sufficiently small.

For the choice of tuning parameter λ, Claeskens et al. (2008) suggested the SVM information criterion (SVMIC). For a subset S of {1, 2, …, p}, the SVMIC is defined as

SVMIC (S) = \sum_{i = 1}^{n} ξ_{i} + log (n) ∣ S ∣,

where |S| is the cardinality of S and ξ_i, i = 1, 2, ···, n denote the corresponding optimal slack variables. This criterion directly follows the spirit of the Bayesian information criterion (BIC) by Schwarz (1978). Chen and Chen (2008) showed that BIC can be too liberal when the model space is large and proposed the extended BIC (EBIC):

{EBIC}_{γ} (S) = - 2 log Likelihood + log (n) ∣ S ∣ + 2 γ (\begin{matrix} p \\ ∣ S ∣ \end{matrix}), 0 \leq γ \leq 1.

By combining these ideas, we suggest the SVM-extend BIC (SVMIC_γ)

SVMI C_{γ} (S) = \sum_{i = 1}^{n} 2 W_{i} ξ_{i} + log (n) ∣ S ∣ + 2 γ (\begin{matrix} p \\ ∣ S ∣ \end{matrix}), 0 \leq γ \leq 1.

Note that SVMIC_γ reduces to SVMIC when γ = 0 and w = 0.5. We use γ = 0.5 as suggested by Chen and Chen (2008) and choose the λ that minimizes SVMIC_γ.

5. Simulation and real data examples

We carry out Monte Carlo studies to evaluate the finite-sample performance of the non-convex penalized SVMs. We compare the performance of SCAD-penalized SVM, MCP-penalized SVM, standard L₂ SVM, L₁-penalized SVM, adaptively weighted L₁-penalized SVM (Zou, 2007) and hybrid Huberized SVM (Wang et al., 2007) (denoted by SCAD-svm, MCP-svm, L₂-svm, L₁-svm, Adap L₁-svm, and Hybrid-svm, respectively) with weight parameter w = 0.5. The main interest here is the ability to identify the relevant covariates and the control of test error when p > n.

5.1. Simulation study

We consider two data generation processes. The first, adapted from Park et al. (2012), is essentially a standard linear discriminant analysis (LDA) setting. The second is related to probit regression.

Model 1: Pr(Y = 1) = Pr(Y = −1) = 0.5, X^*|(Y = 1) ~ MN(μ, Σ), X^*|(Y = −1) ~ MN(−μ, Σ), q = 5, μ = (0.1, 0.2, 0.3, 0.4, 0.5, 0, …, 0)^T ∈ R^p, Σ = (σ_ij) with nonzero elements σ_ii = 1 for i = 1, 2, ···, p and σ_ij = ρ = −0.2 for 1 ≤ i ≠ j ≤ q. The Bayes rule is sign(2.67X₁+2.83X₂+3X₃+3.17X₄+3.33X₅) with Bayes error: 6.3%.
Model 2: X^* ~ MN(0_p, Σ), Σ = (σ_ij) with nonzero elements σ_ii = 1 for i = 1, 2, ···, p and σ_ij = 0.4^|ⁱ⁻^j^| for 1 ≤ i ≠ j ≤ p, Pr(Y = 1|X^*) = Φ((X^*)^Tβ^*) where Φ(·) is the CDF of the standard normal distribution, β^* = (1.1, 1.1, 1.1, 1.1, 0, …, 0)^T, q = 4. The Bayes rule is sign (X₁+X₂+X₃+X₄) with Bayes error 10.4%.

We consider different (n, p) settings for each data generation process with p much larger than n. Similarly to Mazumder et al. (2011), an independent tuning dataset of size 10n is generated to tune any regularization parameter for all methods by minimizing the estimated prediction error calculated over the tuning dataset. We also report the performance of the SCAD- and MCP-penalized SVMs using SVMIC_γ to select the tuning parameter λ. Notice that tuning by a large independent tuning dataset of 10n approximates the ideal “population tuning”, which is usually not available in practice. By giving all the other methods the best possible tuning, we are controlling the effect of tuning parameter selection and conservative about the performance of the nonconvex penalized SVMs tuned by SVMIC_γ. As we will see later, the results of SCAD- and MCP-penalized SVMs using the independent tuning dataset are slightly better than the corresponding results using SVMIC_γ tuning; and all other methods have no ability to select the correct model exactly, even with an unrealistically good tuning parameter. The range of λ is {2⁻⁶, …, 2³}. We use a=3.7 for the SCAD penalty and a = 3 for the MCP as suggested in the literature. We generate an independent test dataset of size n to report the estimated test error. The columns “Signal” and “Noise” summarize the average number of selected relevant and irrelevant covariates, respectively. The numbers in the “Correct” column summarize the percentages of selecting the exactly true model over replications.

Table 1 shows the results of Model 1 for different (n, p) settings. The numbers in parentheses are the corresponding standard errors based on 100 replications. When tuned by using an independent tuning set of size 10n, both SCAD- and MCP-penalized SVMs identify more relevant variables than any other methods and they also reduce the number of falsely selected variables dramatically. When tuned by SVMIC_γ, SCAD- and MCP-penalized SVMs select slightly fewer signals when n = 100, but this is based on the fact that other methods select a much larger model without proper control of noise. A large proportion of the missed relevant covariates are from X₁ as it has the weakest signal. Notice that SVMIC_γ performs almost the same as “population tuning” when n is relatively large. In general, the nonconvex penalized SVMs have an overwhelmingly high probability to select the exact true mode as n and p increase, while other methods show very weak, if any at all, ability to recover the exact true model. This is consistent with our theory of asymptotic oracle property of nonconvex penalized SVMs. The test errors of SCAD- and MCP-penalized SVMs are uniformly smaller than those of any other method in all settings, even in the settings with a small sample size n = 100 and tuned by SVMIC_γ, where they select slightly fewer signals. This is due to the fact that in high dimensional classification problem, a large number of falsely selected variables will greatly blur the prediction power of the relevant variables.

Table 1.

Simulation results for Model 1

Method	n	p	Signal	Noise	Correct	Test Error
SCAD-svm	100	400	4.94(0.03)	0.89(0.19)	64%	8.71%(0.4%)
	100	800	4.93(0.03)	0.93(0.14)	51%	9.39%(0.4%)
	200	800	5.00(0.00)	0.09(0.05)	96%	7.20%(0.2%)
	200	1600	5.00(0.00)	0.07(0.04)	96%	7.24%(0.2%)
MCP-svm	100	400	4.90(0.04)	0.88(0.17)	53%	8.96%(0.4%)
	100	800	4.92(0.03)	1.37(0.20)	40%	10.59%(0.5%)
	200	800	5.00(0.00)	0.06(0.04)	97%	7.30%(0.2%)
	200	1600	5.00(0.00)	0.09(0.03)	92%	6.79%(0.2%)
SCAD-svm^(SVMIC_γ)	100	400	4.64(0.08)	0.48(0.11)	64%	10.32%(0.6%)
	100	800	4.63(0.09)	0.57(0.09)	52%	11.68%(0.7%)
	200	800	5.00(0.00)	0.03(0.02)	97%	7.24%(0.2%)
	200	1600	4.99(0.01)	0.05(0.03)	95%	7.23%(0.2%)
MCP-svm^(SVMIC_γ)	100	400	4.46(0.10)	0.44(0.08)	45%	11.81%(0.6%)
	100	800	4.34(0.11)	0.68(0.11)	38%	13.13%(0.7%)
	200	800	5.00(0.00)	0.09(0.03)	92%	7.34%(0.2%)
	200	1600	5.00(0.00)	0.06(0.03)	95%	7.19%(0.2%)
L₁-svm	100	400	4.87(0.05)	32.97(1.47)	0%	16.08%(0.5%)
	100	800	4.63(0.07)	44.34(2.18)	0%	19.71%(0.6%)
	200	800	5.00(0.00)	21.33(1.70)	0%	9.59%(0.3%)
	200	1600	4.99(0.01)	33.37(0.96)	0%	10.88%(0.3%)
Hybrid-svm	100	400	4.78(0.05)	24.74(1.37)	0%	16.34%(0.5%)
	100	800	4.62(0.06)	27.16(1.30)	0%	19.93%(0.6%)
	200	800	5.00(0.00)	12.86(0.99)	0%	9.93%(0.2%)
	200	1600	4.99(0.01)	10.85(0.98)	0%	10.53%(0.3%)
Adap L₁-svm	100	400	4.39(0.08)	13.14(0.90)	0%	16.76%(0.5%)
	100	800	3.99(0.08)	12.50(0.69)	0%	20.19%(0.6%)
	200	800	4.86(0.04)	3.93(0.25)	1%	10.04%(0.3%)
	200	1600	4.49(0.06)	1.01(0.09)	4%	13.43%(0.4%)
L₂-svm	100	400	5.00(0.00)	395.00(0.00)	0%	39.23%(0.5%)
	100	800	5.00(0.00)	795.00(0.00)	0%	42.99%(0.5%)
	200	800	5.00(0.00)	795.00(0.00)	0%	39.22%(0.3%)
	200	1600	5.00(0.00)	1595.00(0.00)	0%	42.50%(0.4%)

Open in a new tab

Table 2 shows the results of Model 2 for n = 250 and p = 800. The numbers in the parentheses are the corresponding standard errors based on 200 replications. We observe similar performance patterns in terms of both variable selection and prediction error. Due to the higher correlation between signal and noise, in Model 2 it is generally more difficult to select the relevant covariates. Both SCAD- and MCP-penalized SVM still have reasonable performance in identifying the underlying true model and result in more accurate prediction. Note that under this data generation process the adaptively weighted L₁-penalized SVM behaves similar to nonconvex penalized SVMs, though its oracle property is largely unknown.

Table 2.

Simulation results for Model 2 with n = 250 and p = 800

Method	Signal	Noise	Correct	Test Error
SCAD-svm	3.99(0.01)	0.26(0.08)	92.5%	11.4%(0.1%)
MCP-svm	3.99(0.01)	0.17(0.07)	93.5%	11.3%(0.1%)
SCAD-svm^(SVMIC_γ)	3.96(0.02)	0.05(0.02)	94%	11.5%(0.1%)
MCP-svm^(SVMIC_γ)	3.98(0.01)	0.07(0.02)	92.5%	11.4%(0.1%)
L₁-svm	4.00(0.00)	6.84(0.42)	7.5%	12.4%(0.1%)
Hybrid-svm	4.00(0.00)	4.03(0.41)	10.5%	11.9%(0.1%)
Adap L₁-svm	4.00(0.00)	2.90(0.28)	38%	11.8%(0.1%)
L₂-svm	4.00(0.00)	796.00(0.00)	0%	32.5%(0.2%)

Open in a new tab

5.2. Real data application

We next use a real dataset to illustrate the performance of the nonconvex penalized SVM. This dataset is part of the MicroArray Quality Control (MAQC)-II project, available at the GEO database with accession number GSE20194. It contains 278 patient samples from two classes: 164 with have positive estrogen receptor (ER) status and 114 with have negative estrogen receptor (ER) status. Each sample is described by 22283 genes.

The original data have been standardized for each predictor. To reduce the computational burden, only the 3000 genes with largest absolute values of the two sample t-statistics are used. Such simplification has been considered in Cai and Liu (2011). Though only 3000 genes are used, the classification result is satisfactory. We randomly split the data into an equally balanced training set with 50 samples with positive ER status and 50 samples with negative ER status, and the rest were designated as the test set. As in the simulation study, we use a=3.7 for the SCAD penalty and a=3 for the MCP penalty. To get a fair comparison, a 5-fold cross validation is implemented on the training set to select a tuning parameter by a grid search over {2⁻¹⁵, …, 2³} for all methods and the test error is calculated on the test data. The above procedure is repeated 100 times.

Table 3 summarizes the average classification error and number of selected genes. The numbers in the parentheses are the corresponding standard errors based on 100 replications. Nonconvex penalized SVMs achieve significantly lower test error than all the other methods except for the doubly penalized hybrid SVM. Although the doubly penalized hybrid SVM performs similar to SCAD- and MCP-penalized SVMs in terms of test error, it selects a much more complex model in general. In addition, the number of genes selected by nonconvex penalized SVMs is stable, while the model size selected by hybrid SVM ranges from 102 genes to 2576 genes across the 100 replications. Such stability is desirable, so that the procedure is robust to the random partition of the data. The numerical results confirm that SCAD- and MCP-penalized SVMs can achieve both promising prediction power and excellent gene selection ability.

Table 3.

Classification error of MAQC-II dataset

Method	Test error	Genes
SCAD-svm	9.8%(0.2%)	2.06(0.43)
MCP-svm	9.6%(0.2%)	1.04(0.02)
L₁-svm	10.9%(0.2%)	28.74(1.36)
Adap L₁-svm	13.1%(0.2%)	34.30(1.03)
Hybrid-svm	10.0%(0.1%)	1391.60(94.86)
L₂-svm	10.8%(0.2%)	3000.00(0.00)

Open in a new tab

6. Discussion

In this article we study the nonconvex penalized SVMs with a diverging number of covariates in terms of variable selection. When the true model is sparse, under some regularity conditions, we prove that it enjoys the oracle property. That is, one of the local minimizers of the nonconvex penalized SVM behaves like the oracle estimator as if the true sparsity is known in advance and only the relevant variables are used to form the decision boundary. We also show that as long as we have an appropriate initial estimator, we can identify the oracle estimator with probability tending to one.

6.1. Connection to Bayes rule

In this paper, the true model and the oracle property are built on β₀, which is the minimizer of the population version of the hinge loss. This definition has a strong connection to the Bayes rule, which is theoretically optimal if the underlying distribution is known. In the equal-weight case (w=1/2), the Bayes rule is given by sign(X^Tβ_Bayes) with β_Bayes = arg min_β Inline graphic {I(sign(X^Tβ) ≠ Y)}. To appreciate the connection, we first note that β_Bayes and β₀ are equivalent to each other in the important special case of Fisher linear discriminant analysis. Indeed, consider an informative example setting with π₊ = π₋ = 1/2, X^*|(Y = +1) ~ N(μ₊, Σ) and X^*|(Y = −1) ~ N(μ₋, Σ), where μ₊ and μ₋ denote different mean vectors for two classes and Σ a same variance covariance matrix. It is known that in this case the Bayes rule boundary is given by

{(μ_{+} - μ_{-})}^{T} \sum^{- 1} {x^{*} - 1 / 2 (μ_{+} + μ_{-})} = 0.

Note that β₀ as the minimizer of the population hinge loss satisfies the gradient condition

S (β_{0}) = - E {I (1 - Y X^{T} β_{0} \geq 0) Y X} = 0,

which is equivalent to following equations:

\begin{array}{l} Pr (1 - X^{T} β_{0} \geq 0 ∣ Y = + 1) & = Pr (1 + X^{T} β_{0} \geq 0 ∣ Y = - 1), \\ E {I (1 - X^{T} β_{0} \geq 0) X^{*} ∣ Y = + 1} & = E {I (1 + X^{T} β_{0} \geq 0) X^{*} ∣ Y = - 1} . \end{array}

(7)

For any $β_{0, ⊥}^{*}$ that satisfies ${(β_{0}^{*})}^{T} \sum β_{0, ⊥}^{*} = 0, {(X^{*})}^{T} β_{0}^{*}$ and ${(X^{*})}^{T} β_{0, ⊥}^{*}$ are conditionally independent given Y and thus we can decompose the conditional expectation in (7) into two parts. It can be seen from (7) that

\begin{array}{l} β_{00} & = - 1 / 2 {(β_{0}^{*})}^{T} (μ_{+} + μ_{-}), \\ {(μ_{+} - μ_{-})}^{T} β_{0, ⊥}^{*} & = 0, \forall β_{0, ⊥}^{*} satisfying {(β_{0}^{*})}^{T} \sum β_{0, ⊥}^{*} = 0. \end{array}

That is, (μ₊ − μ₋) lies in the space spanned by $\sum β_{0}^{*}$ . The decision boundary defined by the true value is then

x^{T} β_{0} \equiv C {(μ_{+} - μ_{-})}^{T} \sum^{- 1} {x^{*} - 1 / 2 (μ_{+} + μ_{-})} = 0

for some constant C. Therefore, the Bayes rule is equivalent to β₀.

In more general settings, β_Bayes and β₀ may not be the same. However, Lin (2000) showed that the nonlinear SVM approaches the Bayes rule in a direct fashion, and its expected misclassification rate quickly converges to that of the Bayes rule even though its extension to linear SVM is largely unknown. Furthermore, denote R(f) and R₀(f) to be the risk in terms of the 0–1 loss and hinge loss, respectively, for any measurable f; that is, R(f) = Inline graphic {I(sign(f(X)) ≠ Y)} and R₀(f) = {(1 − Y f(X))₊}. It is known that minimizing R(f) directly is very difficult because minimizing the empirical 0–1 loss is infeasible in practice (Bartlett et al., 2006). Instead, we can always shift the target from the 0–1 loss to a convex surrogate such as the hinge loss. Assume that the minimizers of R(f) and R₀(f) are both linear functions, and by definitions they are X^Tβ_Bayes and X^Tβ₀, respectively. By Theorem 1 of Bartlett et al. (2006), we have the optimal excess risk upper bound

R (X^{T} β) - R (X^{T} β_{Bayes}) \leq R_{0} (X^{T} β) - R_{0} (X^{T} β_{0})

for any β. Hence pursuing oracle property on β₀ has the potential to efficiently control the excess risk. As can be seen in this paper, the main advantages of working with the hinge loss instead of the 0–1 loss are the theoretical tractability and convenience in practical implementation.

6.2. Other issues

As one referee pointed out, the objective function (2) in the definition of our oracle estimator is piecewise linear and may have multiple minimizers. The same issue applies to the L₁-penalized SVM and the nonconvex penalized SVM. Based on our theoretical development, non-uniqueness of the minimizer of (2) is not essential. When the minimizer is not unique, our theoretical results still hold for any particular minimizer. In this sense, we can first use the nonconvex penalized SVM to identify important predictors. In the next step, to obtain a unique classifier, a refitting can be applied by using the standard L₂ penalized SVM on those identified important predictors. For Model 1 in Section 5.1, we considered this refitting. This additional refitting step does not lead to much improvement: it reduces the average test errors in some settings but not in others. Thus the refitting result is not reported here.

An alternative approach to deal with this non-uniqueness is to consider a joint penalty by using both a nonconvex penalty and a standard L₂ penalty. The objective function then becomes

n^{- 1} \sum_{i = 1}^{n} W_{i} {(1 - Y_{i} X_{i}^{T} β)}_{+} + \sum_{j = 1}^{p_{n}} p_{λ_{1 n}} (∣ β_{j} ∣) + \sum_{j = 1}^{p_{n}} λ_{2 n} β_{j}^{2}

for two different tuning parameters λ₁_n and λ₂_n. The corresponding oracle estimator is then defined as the minimizer of the objective function for the standard L₂ SVM using only the relevant covariates. One advantage of this joint penalty formulation over the method proposed in this paper is that the uniqueness of the oracle estimator is guaranteed in the finite sample case. However, it involves simultaneously selecting two tuning parameters, and this may not be convenient in practice. We conduct a simple numerical experiment using Model 1 in Section 5.1 with n = 200 and p = 600 or 800. The simulation results are summarized in Table 4. As shown in Table 4, our numerical example suggests that the performance of this joint penalty method is similar to the approach proposed in this paper.

Table 4.

Comparision between SCAD and joint penalized SVMs using Model 1

Method	p	Signal	Noise	Correct	Test Error
SCAD-svm	600	5.00(0.00)	0.17(0.07)	93%	7.04%(0.2%)
SCAD-svm	800	5.00(0.00)	0.13(0.06)	93%	7.25%(0.2%)
Joint SCAD+L₂-svm	600	5.00(0.00)	1.22(0.28)	65%	7.12%(0.2%)
Joint SCAD+L₂-svm	800	5.00(0.00)	2.64(0.62)	50%	7.10%(0.2%)

Open in a new tab

Several issues remain unsolved. In this article we only study the SVMs in nonseparable cases in the limit. Although the nonseparable cases are important in practical applications, it would be interesting to show similar results for separable cases. The asymptotic analysis of separable cases requires the positiveness of the limit of the regularization term, which is different from the analysis in this article. Another issue is the availability of an appropriate initial estimator in ultra-high dimensions. Our empirical studies suggest that the L₁-penalized SVM provides a reasonable initial estimator and the LLA algorithm converges very quickly even for cases with p ≫ n. However it still lacks theoretical justification since our Theorem 3.4 only provides theoretical support in moderately high dimensions with $p = o (\sqrt{n})$ . One could try to extend the work of Bickel et al. (2009) by assuming similar types of restricted eigenvalues conditions. This extension would require new techniques because both the loss function and the penalty are nondifferentiable and the nonsmooth locations are different in L₁-penalized SVM, whereas the setup in Bickel et al. (2009) is a smooth loss function with a nonsmooth penalty.

Acknowledgments

We thank the co-Editors Professor Gareth Roberts and Professor Piotr Fryzlewicz, the Associate Editor and three referees for very constructive comments and suggestions which have improved the presentation of the paper. We also thank Amanda Applegate for her help on the presentation of this paper. The research is partially supported by National Science Foundation grants DMS-1055210 (Wu) and DMS-1308960 (Wang) and National Institutes of Health grants R01-CA149569 (Zhang and Wu), P01-CA142538 (Wu), P50-DA10075 (Li) and P50-DA036107 (Li).

7. Appendix

We first prove Lemma 3.1.

Proof (Proof of Lemma 3.1)

Let $l (β_{1}) = n^{- 1} \sum_{i = 1}^{n} W_{i} {(1 - Y_{i} Z_{i}^{T} β_{1})}_{+}$ . Note that β̂₁ = arg min_β₁ l (β₁). We will show that when ∀η > 0, there exists a constant Δ such that for all n sufficiently large, $Pr {{inf}_{‖ u ‖ = Δ} l (β_{01} + \sqrt{q / n} u) > l (β_{01})} \geq 1 - η$ . Because l(β₁) is convex, with probability at least 1 − η, β̂₁ is in the ball { $β_{1} : ‖ β_{1} - β_{01} ‖ \leq Δ \sqrt{q / n}$ }. Denote $Λ_{n} (u) = {n q}^{- 1} {l (β_{01} + \sqrt{q / n} u) - l (β_{01})}$ . Observe that $E {Λ_{n} (u)} = {n q}^{- 1} {L (β_{01} + \sqrt{q / n} u) - L (β_{01})$ . Recall also that β₀ = arg min_β Inline graphic {W(1 − Y X^Tβ)}. If we restrict the last p −q elements to be 0, it can be easily seen that β₀₁ = arg min_β₁ {W(1 − Y Z^Tβ₁)} = arg min_β₁ L(β₁), thus S (β₀₁) = 0. By Taylor series expansion of L(β₁) around β₀₁, we have $E {Λ_{n} (u)} = \frac{1}{2} u^{T} H (\tilde{β}) u + o_{p} (1)$ , where $\tilde{β} = β_{01} + \sqrt{q / n} t u$ for some 0 < t < 1. As shown in Koo et al. (2008), for 0 ≤ j, k ≤ q, the (j, k)-th element of the Hessian Matrix H(β₀₁) is continuous given (A1) and (A2); thus H(β) is continuous. By continuity of H(β) at β₀₁, then $\frac{1}{2} u^{T} H (\tilde{β}) u = \frac{1}{2} u^{T} H (β_{01}) u + o (1)$ as n → ∞. Define $W_{n} = - \sum_{i = 1}^{n} ζ_{i} W_{i} Y_{i} Z_{i}$ where $ζ_{i} = I (1 - Y_{i} Z_{i}^{T} β_{01} \geq 0)$ . Recall that S(β₀₁) = − Inline graphic [ζ_iW_iY_iZ_i] = 0. If we define

R_{i, n} (u) = W_{i} {(1 - Y_{i} Z_{i}^{T} (β_{01} + \frac{\sqrt{q}}{\sqrt{n}} u))}_{+} - W_{i} {(1 - Y_{i} Z_{i}^{T} β_{01})}_{+} + ζ_{i} W_{i} Y_{i} Z_{i}^{T} \sqrt{q / n} u

then we have

Λ_{n} (u) = E {Λ_{n} (u)} + W_{n}^{T} u / \sqrt{q n} + q^{- 1} \sum_{i = 1}^{n} [R_{i, n} (u) - E {R_{i, n} (u)}] .

(8)

Then similar to Equation (28) in Koo et al. (2008) we have

q^{- 2} \sum_{i = 1}^{n} E [{∣ R_{i, n} (u) - E {R_{i, n} (u)} ∣}^{2}] \leq C Δ^{2} E {q^{- 1} (1 + {‖ Z ‖}^{2}) U (\sqrt{1 + {‖ Z ‖}^{2}} Δ \sqrt{q / n})},

where $U (t) = I (∣ 1 - Y_{i} Z_{i}^{T} β_{01} ∣ < t)$ . (A2) implies that E{q⁻¹(1 + ||Z||²)} < ∞. Hence, for any ε > 0, we can choose a positive constant C such that E[q⁻¹(1+||Z||²)I{q⁻¹(1+||Z||²) > C}] < ε/2, then

E {q^{- 1} (1 + {‖ Z ‖}^{2}) U (\sqrt{1 + {‖ Z ‖}^{2}} Δ \sqrt{q / n})} \leq E [q^{- 1} (1 + {‖ Z ‖}^{2}) I {q^{- 1} (1 + {‖ Z ‖}^{2}) > C}] + C Pr (∣ 1 - Y_{i} Z_{i}^{T} β_{01} ∣ < C Δ \sqrt{q / n} .

We can take a large N such that $Pr (∣ 1 - Y_{i} Z_{i}^{T} β_{01} ∣ < C Δ \sqrt{q / n}) < \frac{ε}{2 C}$ for all n > N by (A4). This proves that $q^{- 2} \sum_{i = 1}^{n} E {{∣ R_{i, n} (u) - E [R_{i, n} (u)] ∣}^{2}} \to 0$ as n → ∞. Observe that $E (W_{n}^{T} u / \sqrt{q n}) = 0$ , and

Var (W_{n}^{T} u / \sqrt{q n}) \leq C n^{- 1} q^{- 1} \sum_{i = 1}^{n} {(Z_{i}^{T} u)}^{2} \leq C q^{- 1} λ_{\max} (n^{- 1} X_{A}^{T} X_{A}) {‖ u ‖}^{2} \to 0

as n → ∞. Therefore, the first term of (8) will dominate other terms as n → ∞. By (A6) we have $\frac{1}{2} u^{T} H (β_{01}) u > 0$ . Thus we can choose a sufficiently large Δ such that Λ_n(U) > 0 with probability 1 − η for ||U|| = Δ and all sufficiently large n.

The proof of Theorem 3.1 relies on the following Lemmas.

Lemma 7.1

Pr {max_{q + 1 \leq j \leq p} n^{- 1} ∣ \sum_{i = 1}^{n} W_{i} Y_{i} X_{i j} I (1 - Y_{i} Z_{i}^{T} β_{01} \geq 0) ∣ > λ / 2} \to 0 a s n \to \infty .

Proof

Recall that $E {W_{i} Y_{i} X_{i j} I (1 - Y_{i} Z_{i}^{T} β_{01} \geq 0)} = 0$ . By (A5) and Lemma 14.9 of Bühlmann and Van De Geer (2011), we have $Pr {n^{- 1} ∣ \sum_{i = 1}^{n} W_{i} Y_{i} X_{i j} I (1 - Y_{i} Z_{i}^{T} β_{01} \geq 0) ∣ > λ / 2} \leq exp (- C n λ^{2})$ . Note that

\begin{array}{l} Pr {max_{q + 1 \leq j \leq p} n^{- 1} ∣ \sum_{i = 1}^{n} W_{i} Y_{i} X_{i j} I (1 - Y_{i} Z_{i}^{T} β_{01} \geq 0) ∣ > λ / 2} \\ = Pr {\cup_{q + 1 \leq j \leq p} {n^{- 1} ∣ \sum_{i = 1}^{n} W_{i} Y_{i} X_{i j} I (1 - Y_{i} Z_{i}^{T} β_{01} \geq 0) ∣ > λ / 2}} \leq p exp (- C n λ^{2}) \to 0 \end{array}

as n → ∞ by the fact that log(p) = o(nλ²).

Lemma 7.2

For any Δ > 0,

\begin{array}{l} Pr {max_{q + 1 \leq j \leq p} sup_{‖ β_{1} - β_{01} ‖ \leq Δ \sqrt{q / n}} ∣ \sum_{i = 1}^{n} W_{i} Y_{i} X_{i j} [I (1 - Y_{i} Z_{i}^{T} β_{1} \geq 0) - I (1 - Y_{i} Z_{i}^{T} β_{01} \geq 0) \\ - Pr (1 - Y_{i} Z_{i}^{T} β_{1} \geq 0) + Pr (1 - Y_{i} Z_{i}^{T} β_{01} \geq 0)] ∣ > n λ} \to 0 \end{array}

as n → ∞.

Proof

We generalize an approach by Welsh (1989). We cover the ball { $β_{1} : ‖ β_{1} - β_{01} ‖ \leq Δ \sqrt{q / n}$ } with a net of balls with radius $Δ \sqrt{q / n^{5}}$ . It can be shown that this net can be constructed with cardinality N ≤ dn⁴^q for some d > 0. Denote the N balls by B(t₁), …, B(t_N), where t_k, k = 1, …, N are the centers. Denote $κ_{i} (β_{1}) = 1 - Y_{i} Z_{i}^{T} β_{1}$ , and

\begin{array}{l} J_{n j 1} = \sum_{k = 1}^{N} Pr (∣ \sum_{i = 1}^{n} W_{i} Y_{i} X_{i j} [I {κ_{i} (t_{k}) \geq 0} - I {κ_{i} (β_{01}) \geq 0} - Pr {κ_{i} (t_{k}) \geq 0} + Pr {κ_{i} (β_{01}) \geq 0}] ∣ > n λ / 2), \\ J_{n j 2} = \sum_{k = 1}^{N} Pr (sup_{{\tilde{β}}_{1} \in B (t_{k})} ∣ \sum_{i = 1}^{n} W_{i} Y_{i} X_{i j} [I {κ_{i} ({\tilde{β}}_{1}) \geq 0} - I {κ_{i} (t_{k}) \geq 0} - Pr {κ_{i} ({\tilde{β}}_{1}) \geq 0} + Pr {κ_{i} (t_{k}) \geq 0}] ∣ > n λ / 2) . \end{array}

Then by (A5),

Pr (sup_{‖ β_{1} - β_{01} ‖ \leq Δ \sqrt{q / n}} ∣ \sum_{i = 1}^{n} W_{i} Y_{i} X_{i j} [I {κ_{i} (β_{1}) \geq 0} - I {κ_{i} (β_{01}) \geq 0} - Pr {κ_{i} (β_{1}) \geq 0} + Pr {κ_{i} (β_{01}) \geq 0}] ∣ > n λ) \leq J_{n j 1} + J_{n j 2} .

To evaluate J_nj₁, let U_i = W_iY_iX_ij [I{κ_i(t_k) ≥ 0} − I{κ_i(β₀₁) ≥ 0} − Pr{κ_i(t_k) ≥ 0} + Pr{κ_i(β₀₁) ≥ 0}]. The U_i are independent mean-zero random variable, and $Var (U_{i}) = E (U_{i}^{2}) = E (U_{i}^{2} ∣ Y_{i} = 1) Pr (Y_{i} = 1) + E (U_{i}^{2} ∣ Y_{i} = - 1) Pr (Y_{i} = - 1)$ . Denote F and G the CDF of the conditional distribution of Z^Tβ₀₁ given Y = +1 and Y = −1. Observe that

\begin{array}{l} E (U_{i}^{2} ∣ Y_{i} = 1) \leq C {F_{i} (1 + Z_{i}^{T} (β_{01} - t_{k})) (1 - F_{i} (1 + Z_{i}^{T} (β_{01} - t_{k}))) + F_{i} (1) (1 - F_{i} (1)) - 2 F_{i} (min (1 + Z_{i}^{T} (β_{01} - t_{k}), 1)) + 2 F_{i} (1) F_{i} (1 + Z_{i}^{T} (β_{01} - t_{k}))} \\ \leq C ∣ Z_{i}^{T} (t_{k} - β_{01}) ∣, \end{array}

and it follows by (A8) that

\begin{array}{l} E (U_{i}^{2} ∣ Y_{i} = - 1) \\ \leq C {G_{i} (- 1 + Z_{i}^{T} (β_{01} - t_{k})) (1 - G_{i} (- 1 + Z_{i}^{T} (β_{01} - t_{k}))) + G_{i} (- 1) (1 - G_{i} (- 1)) \\ - 2 (1 - G_{i} (max (- 1 + Z_{i}^{T} (β_{01} - t_{k}), - 1))) + 2 (1 - G_{i} (- 1)) (1 - G_{i} (- 1 + Z_{i}^{T} (β_{01} - t_{k})))} \\ \leq C ∣ Z_{i}^{T} (t_{k} - β_{01}) ∣ . \end{array}

Thus we have

\sum_{i = 1}^{n} Var (U_{i}) \leq n C max_{i} ‖ Z_{i} ‖ ‖ t_{k} - β_{01} ‖ = n O (\sqrt{q} log (n)) O (\sqrt{q / n}) = O (\sqrt{n} q log (n)) .

Applying Lemma 14.9 of Bühlmann and Van De Geer (2011), for some positive constant C₁ and C₂ under the assumptions on the rate of λ,

J_{n j 1} \leq 2 N exp (- \frac{n^{2} λ^{2} / 4}{C_{1} \sqrt{n} q log (n) + C_{2} n λ}) \leq C exp {4 q log (n) - C n λ} .

(9)

To evaluate J_nj₂, note that I(x ≥ s) is decreasing in s. Denote

V_{i} = [I {κ_{i} ({\tilde{β}}_{1}) \geq 0} - I {κ_{i} (t_{k}) \geq 0} - Pr {κ_{i} ({\tilde{β}}_{1}) \geq 0} + Pr {κ_{i} (t_{k}) \geq 0}] .

We have −B_i ≤ V_i ≤ A_i for any β̃₁ ∈ B(t_k), where

\begin{array}{l} A_{i} = [I {κ_{i} (t_{k}) \geq - Δ \sqrt{q / n^{5}}} - I {κ_{i} (t_{k}) \geq 0} - Pr {κ_{i} (t_{k}) \geq Δ \sqrt{q / n^{5}}} + Pr {κ_{i} (t_{k}) \geq 0}], \\ B_{i} = [I {κ_{i} (t_{k}) \geq 0} - I {κ_{i} (t_{k}) \geq Δ \sqrt{q / n^{5}}} - Pr {κ_{i} (t_{k}) \geq 0} + Pr {κ_{i} (t_{k}) \geq - Δ \sqrt{q / n^{5}}}] . \end{array}

Therefore, we have

\begin{array}{l} Pr (sup_{{\tilde{β}}_{1} \in B (t_{k})} ∣ \sum_{i = 1}^{n} W_{i} Y_{i} X_{i j} [I {κ_{i} ({\tilde{β}}_{1}) \geq 0} - I {κ_{i} (t_{k}) \geq 0} - Pr {κ_{i} ({\tilde{β}}_{1}) \geq 0} + Pr {κ_{i} (t_{k}) \geq 0}] ∣ > n λ / 2) \\ \leq Pr (C max_{i} ∣ X_{i j} ∣ sup_{{\tilde{β}}_{1} \in B (t_{k})} ∣ \sum_{i = 1}^{n} V_{i} ∣ > n λ / 2) \leq Pr {C max_{i} ∣ X_{i j} ∣ max (\sum_{i = 1}^{n} A_{i}, \sum_{i = 1}^{n} B_{i}) > n λ / 2} \end{array}

by the fact that A_i > 0, B_i > 0. Note that

\begin{array}{l} \sum_{i = 1}^{n} A_{i} = \sum_{i = 1}^{n} [I {κ_{i} (t_{k}) \geq - Δ \sqrt{q / n^{5}}} - I {κ_{i} (t_{k}) \geq 0} - Pr {κ_{i} (t_{k}) \geq - Δ \sqrt{q / n^{5}}} + Pr {κ_{i} (t_{k}) \geq 0}] \\ + \sum_{i = 1}^{n} [Pr {κ_{i} (t_{k}) \geq - Δ \sqrt{q / n^{5}}} - Pr {κ_{i} (t_{k}) \geq Δ \sqrt{q / n^{5}}}] \end{array}

and

\begin{array}{l} \sum_{i = 1}^{n} [Pr {κ_{i} (t_{k}) \geq - Δ \sqrt{q / n^{5}}} - Pr {κ_{i} (t_{k}) \geq Δ \sqrt{q / n^{5}}}] \\ = [F_{i} (1 + Δ \sqrt{q / n^{5}} - Z_{i}^{T} (β_{01} - t_{k})) - F_{i} (1 - Δ \sqrt{q / n^{5}} - Z_{i}^{T} (β_{01} - t_{k}))] Pr (Y_{i} = 1) \\ + [G_{i} (- 1 + Δ \sqrt{q / n^{5}} - Z_{i}^{T} (β_{01} - t_{k})) - G_{i} (- 1 - Δ \sqrt{q / n^{5}} - Z_{i}^{T} (β_{01} - t_{k}))] Pr (Y_{i} = - 1) \\ \leq Cn log (q) \sqrt{q / n^{5}} \sqrt{q} = C log (q) {q n}^{- 3 / 2} \end{array}

by (A8). Denote

O_{i} = [I {κ_{i} (t_{k}) \geq - Δ \sqrt{q / n^{5}}} - I {κ_{i} (t_{k}) \geq 0} - Pr {κ_{i} (t_{k}) \geq - Δ \sqrt{q / n^{5}}} + Pr {κ_{i} (t_{k}) \geq 0}] .

Thus for sufficiently large n by λ = o(n^{−(1−c₂)/2}) and A(7), we have

\sum_{k = 1}^{N} Pr (C \sum_{i = 1}^{n} A_{i} > n λ / 2) \leq \sum_{k = 1}^{N} Pr (C \sum_{i = 1}^{n} O_{i} > n λ / 2 - C log (q) {q n}^{- 3 / 2}) \leq \sum_{k = 1}^{N} Pr (C \sum_{i = 1}^{n} O_{i} > n λ / 4) .

Notice that O_i are independent mean-zero random variables, and

E (O_{i}^{2}) = E {[I {κ_{i} (t_{k}) \geq - Δ \sqrt{q / n^{5}}} - I {κ_{i} (t_{k}) > 0}]}^{2} \leq \sqrt{q / n^{5}} max_{i} ‖ Z_{i} ‖ = Cq log (n) n^{- 5 / 2},

using a similar idea to deriving the upper bound of $E (U_{i}^{2})$ . Applying Bernstein’s inequality and the fact that ${max}_{i} ∣ X_{i j} ∣ = O_{p} (\sqrt{log (n)})$ for sub-Gaussian random variable, for some positive constant C₁ and C₂,

\sum_{k = 1}^{N} Pr (C max_{i} ∣ X_{i j} ∣ \sum_{i = 1}^{n} A_{i} > n λ / 2) \leq N exp (- \frac{n^{2} λ^{2} / 4}{C_{1} {q n}^{- 3 / 2} log {(n)}^{3 / 2} + C_{2} n λ}) \leq C exp {4 q log (n) - C n λ} .

Similarly, we can prove that $\sum_{k = 1}^{N} Pr (C {max}_{i} ∣ X_{i j} ∣ \sum_{i = 1}^{n} B_{i} > n λ / 2) \leq C exp {4 q log (n) - C n λ}$ . Therefore, we have

J_{n j 2} \leq C exp {4 q log (n) - C n λ} .

(10)

Using (9) and (10), then the probability of Lemma 7.2 is bounded by

\sum_{j = q + 1}^{p} (J_{n j 1} + J_{n j 2}) \leq C exp {log (p) + 4 q log (n) - C n λ} \to 0

(11)

which completes the proof.

Now we prove Theorem 3.1.

Proof (Proof of Theorem 3.1)

The unpenalized hinge loss objective function is convex. By convex optimization theorem, there exists $v_{i}^{*}$ such that s_j(β̂) = 0, j = 0, 1, ....q, with $v_{i} = v_{i}^{*}$ .

Note that min_1≤_j_≤_q |β̂_j| ≥ min_1≤_j_≤_q |β₀_j| − max_1≤_j_≤_q |β̂_j − β₀_j|. By (A7) we have n^(1−c₂)/2 min_{1≤j≤q_n} |β₀_j| ≥ M₁, and ${max}_{1 \leq j \leq q} ∣ {\hat{β}}_{j} - β_{0 j} ∣ = O_{p} (\sqrt{q / n})$ by Lemma 3.1. Thus we have min_1≤_j_≤_q |β̂_j| = O_p(n^{−(1−c₂)/2}). By λ = o(n^{−(1−c₂)/2}), we have $Pr (∣ {\hat{β}}_{j} ∣ \geq (a + \frac{1}{2}) λ) \to 1$ , for j = 0, 1, …, q.

By the definition of the oracle estimator, we have |β̂_j| = 0, j = q + 1, …, p. It suffices to show that Pr{|s_j(β̂)| > λ, for some j = q+1, …, p} → 0. Let $D = {i : 1 - Y_{i} Z_{i}^{T} {\hat{β}}_{1} = 0}$ ; then for j = q + 1, …, p, we have

s_{j} (\hat{β}) = - n^{- 1} \sum_{i = 1}^{n} W_{i} Y_{i} X_{i j} I (1 - Y_{i} Z_{i}^{T} {\hat{β}}_{1} \geq 0) - n^{- 1} \sum_{i \in D} W_{i} Y_{i} X_{i j} (v_{j} - 1),

where −1 ≤ v_i ≤ 0 if i ∈ D and v_i = 0 otherwise. By (A5) (Z_i, Y_i) are in general positions, with probability one there are exactly (q+1) elements in D. Then by (A4), with probability one |n⁻¹Σ_i_∈_DW_iY_iX_ij (v_j −1)| = O(qn⁻¹ log(q)) = o(λ). Thus we only need to show that $Pr {{max}_{q + 1 \leq j \leq p} ∣ n^{- 1} \sum_{i = 1}^{n} W_{i} Y_{i} X_{i j} I (1 - Y_{i} Z_{i}^{T} {\hat{β}}_{1} \geq 0) ∣ > λ} \to 0$ . Observe that

\begin{array}{l} Pr {max_{q + 1 \leq j \leq p} ∣ n^{- 1} \sum_{i = 1}^{n} W_{i} Y_{i} X_{i j} I (1 - Y_{i} Z_{i}^{T} {\hat{β}}_{1} \geq 0) ∣ > λ} \\ \leq Pr {max_{q + 1 \leq j \leq p} ∣ n^{- 1} \sum_{i = 1}^{n} W_{i} Y_{i} X_{i j} [I (1 - Y_{i} Z_{i}^{T} {\hat{β}}_{1} \geq 0) - I (1 - Y_{i} Z_{i}^{T} β_{01} \geq 0)] ∣ > \frac{λ}{2}} \\ + Pr {max_{q + 1 \leq j \leq p} ∣ n^{- 1} \sum_{i = 1}^{n} W_{i} Y_{i} X_{i j} I (1 - Y_{i} Z_{i}^{T} β_{01} \geq 0) ∣ > \frac{λ}{2}} . \end{array}

(12)

By Lemma 7.1 the second term of (12) is o_p(1). Notice that from Lemma 3.1, the first term of (12) is bounded by

\begin{array}{l} Pr [max_{q + 1 \leq j \leq p} ∣ n^{- 1} \sum_{i = 1}^{n} W_{i} Y_{i} X_{i j} {I (1 - Y_{i} Z_{i}^{T} {\hat{β}}_{1} \geq 0) - I (1 - Y_{i} Z_{i}^{T} β_{01} \geq 0)} ∣ > \frac{λ}{2}] \\ \leq Pr [max_{q + 1 \leq j \leq p} sup_{‖ β_{1} - β_{01} ‖ \leq Δ \sqrt{q / n}} ∣ n^{- 1} \sum_{i = 1}^{n} W_{i} Y_{i} X_{i j} {I (1 - Y_{i} Z_{i}^{T} β_{1} \geq 0) - I (1 - Y_{i} Z_{i}^{T} β_{01} \geq 0) \\ - Pr (1 - Y_{i} Z_{i}^{T} β_{1} \geq 0) + Pr (1 - Y_{i} Z_{i}^{T} β_{01} \geq 0)} ∣ > \frac{λ}{4}] \\ + Pr [max_{q + 1 \leq j \leq p} sup_{‖ β_{1} - β_{01} ‖ \leq Δ \sqrt{q / n}} ∣ n^{- 1} \sum_{i = 1}^{n} W_{i} Y_{i} X_{i j} {Pr (1 - Y_{i} Z_{i}^{T} β_{1} \geq 0) \\ - Pr (1 - Y_{i} Z_{i}^{T} β_{01} \geq 0)} ∣ > \frac{λ}{4}] . \end{array}

(13)

By Lemma 7.2, the first term of (13) is o_p(1). Thus we only need to bound the second term of (13). Notice that

∣ Pr (1 - Y_{i} Z_{i}^{T} β_{1} \geq 0) - Pr (1 - Y_{i} Z_{i}^{T} β_{01} \geq 0) ∣ \leq ∣ F_{i} (1 + Z_{i}^{T} (β_{1} - β_{01})) - F_{i} (1) ∣ Pr (Y_{i} = 1) + ∣ G_{i} (- 1 + Z_{i}^{T} (β_{1} - β_{01})) - G_{i} (- 1) ∣ Pr (Y_{i} = - 1) .

Then we have

\begin{array}{l} max_{q + 1 \leq j \leq p} sup_{‖ β_{1} - β_{01} ‖ \leq Δ \sqrt{q / n}} ∣ n^{- 1} \sum_{i = 1}^{n} W_{i} Y_{i} X_{i j} {Pr (1 - Y_{i} Z_{i}^{T} β_{1} \geq 0) - Pr (1 - Y_{i} Z_{i}^{T} β_{01} \geq 0)} ∣ \\ \leq C max_{i, j} ∣ X_{i j} ∣ sup_{‖ β_{1} - β_{01} ‖ \leq Δ \sqrt{q / n}} n^{- 1} \sum_{i = 1}^{n} ‖ Z_{i} ‖ ‖ β_{1} - β_{01} ‖ = O_{p} (\sqrt{log pn}) O (\sqrt{q / n}) O_{p} (\sqrt{q} log (n)) \\ = o_{p} (λ) . \end{array}

Thus

Pr [max_{q + 1 \leq j \leq p} sup_{‖ β_{1} - β_{01} ‖ \leq Δ \sqrt{q / n}} ∣ n^{- 1} \sum_{i = 1}^{n} W_{i} Y_{i} X_{i j} {Pr (1 - Y_{i} Z_{i}^{T} β_{1} \geq 0) - Pr (1 - Y_{i} Z_{i}^{T} β_{01} \geq 0)} ∣ > \frac{λ}{4}] = o_{p} (1),

which completes the proof.

Now we prove Theorem 3.2.

Proof (Proof of Theorem 3.2)

We will show β̂ is a local minimizer of Q(β) by writing Q(β) as g(β) − h(β).

By Theorem 3.1, we have Pr{ Inline graphic ⊆ ∂g(β̂)} → 1, where

G = {ξ = (ξ_{0}, \dots, ξ_{p}) : ξ_{0} = 0; ξ_{j} = λ sgn {(\hat{β})}_{j}), j = 1, \dots, q; ξ_{j} = s_{j} (β) + λ l_{j}, j = q + 1, \dots, p .},

where l_j ∈ [−1,+1], j = q + 1, …, p.

Consider any β in the R^p⁺¹ with the center β̂ and radius $\frac{λ}{2}$ . It is suffices to show that there exist ξ^* ∈ Inline graphic such that $Pr {ξ_{j}^{*} = \frac{\partial h (β)}{\partial β_{j}}} \to 1$ as n → ∞.

Since $\frac{\partial h (β)}{\partial β_{0}} = 0$ , we have $ξ_{0}^{*} = \frac{\partial h (β)}{\partial β_{0}}$ .

For j = 1, …, q, we have ${min}_{1 \leq j \leq q} ∣ β_{j} ∣ \geq {min}_{1 \leq j \leq q} ∣ {\hat{β}}_{j} ∣ - {max}_{1 \leq j \leq q} ∣ {\hat{β}}_{j} - β_{j} ∣ \geq (a + \frac{1}{2}) λ - \frac{λ}{2} = a λ$ with probability one by Theorem 3.1. Therefore by Condition 2 of the class of penalties $Pr {\frac{\partial h (β)}{\partial β_{j}} = λ sgn (β_{j})} \to 1$ for j = 1, …, q. For sufficently large n, sgn(β_j) = sgn(β̂_j). Thus we have $Pr {ξ_{j}^{*} = \frac{\partial h (β)}{\partial β_{j}}} \to 1$ as n → ∞ for j = 1, …, q.

For j = q+1, …, p, we have Pr{|β_j| ≤ |β̂_j|+|β_j− β̂_j | ≤ λ} → 1 by Theorem 3.1. Therefore we have $Pr {\frac{\partial h (β)}{\partial β_{j}} = 0} \to 1$ for SCAD and $Pr {\frac{\partial h (β)}{\partial β_{j}} = - \frac{β_{j}}{a}} \to 1$ for MCP. Observe that by Condition 2 we have $Pr {∣ \frac{\partial h (β)}{\partial β_{j}} ∣ \leq λ} \to 1$ for the class of penalties. By Lemma 1 we have Pr{|s_j(β̂_j)| ≤ λ} → 1 for j = q + 1, …, p. We can always find l_j ∈ [−1,+1] such that $Pr {ξ_{j}^{*} = s_{j} (\hat{β}) + λ l_{j} = \frac{\partial h (β)}{\partial β_{j}}} \to 1$ for j = 1, …, q, for both penalties. This completes the proof.

The proof of Theorem 3.3 consists of two parts. First we wil show that LLA algorithm initiated by β̃⁽⁰⁾ gives the oracle estimator after one iteration. Then we will show that once LLA algorithm finds the oracle estimator β̂, the LLA algorithm will find it again in the next iteration, that is, the LLA algorithm will converge.

Proof (Proof of Theorem 3.3)

Assume that none of the events F_ni is true, for i = 1, …, 4. The probability that none of these event is true is at least 1−P_n₁−P_n₂−P_n₃−P_n₄. Then we have

\begin{array}{l} ∣ {\tilde{β}}_{j}^{(0)} ∣ = ∣ {\tilde{β}}_{j}^{(0)} - β_{0 j} ∣ \leq λ, q + 1 \leq j \leq p, \\ ∣ {\tilde{β}}_{j}^{(0)} ∣ \geq ∣ β_{0 j} ∣ - ∣ {\tilde{β}}_{j}^{(0)} - β_{0 j} ∣ \geq a λ, 1 \leq j \leq q . \end{array}

By Condition 2 of the class of nonconvex penalties, we have $p_{λ}^{'} (∣ {\tilde{β}}_{j}^{(0)} ∣) = 0$ for 1 ≤ j ≤ q. Therefore the solution of the next iteration of β̃⁽¹⁾ is the solution to the convex optimization

{\tilde{β}}^{(1)} = arg min_{β} n^{- 1} \sum_{i = 1}^{n} W_{i} {(1 - Y_{i} X_{i}^{T} β)}_{+} + \sum_{q + 1 \leq j \leq p} p_{λ}^{'} (∣ {\tilde{β}}_{j}^{(0)} ∣) \cdot ∣ β_{j} ∣ .

(14)

By the fact the F_n₃ is not true, there exist some subgradients of oracle estimator s(β̂) such that s_j(β̂) = 0 for 0 ≤ j ≤ q and $∣ s_{j} (\hat{β}) ∣ < (1 - \frac{1}{a}) λ$ for q + 1 ≤ j ≤ p. Note that by the definition of subgradient, we have

\begin{array}{l} n^{- 1} \sum_{i = 1}^{n} W_{i} {(1 - Y_{i} X_{i}^{T} β)}_{+} \geq n^{- 1} \sum_{i = 1}^{n} W_{i} {(1 - Y_{i} X_{i}^{T} \hat{β})}_{+} + \sum_{0 \leq j \leq p} s_{j} (\hat{β}) (β_{j} - {\hat{β}}_{j}) \\ = n^{- 1} \sum_{i = 1}^{n} W_{i} {(1 - Y_{i} X_{i}^{T} \hat{β})}_{+} + \sum_{q + 1 \leq j \leq p} s_{j} (\hat{β}) (β_{j} - {\hat{β}}_{j}) . \end{array}

Then we have for any β

\begin{array}{l} {n^{- 1} \sum_{i = 1}^{n} W_{i} {(1 - Y_{i} X_{i}^{T} β)}_{+} + \sum_{q + 1 \leq j \leq p} p_{λ}^{'} (∣ {\tilde{β}}_{j}^{(0)} ∣) ∣ β_{j} ∣} - {n^{- 1} \sum_{i = 1}^{n} W_{i} {(1 - Y_{i} X_{i}^{T} \hat{β})}_{+} + \sum_{q + 1 \leq j \leq p} p_{λ}^{'} (∣ {\tilde{β}}_{j}^{(0)} ∣) ∣ {\hat{β}}_{j} ∣} \\ \geq \sum_{q + 1 \leq j \leq p} {p_{λ}^{'} (∣ {\tilde{β}}_{j}^{(0)} ∣) - s_{j} (\hat{β}) \cdot sgn (β_{j})} \cdot ∣ β_{j} ∣ \geq \sum_{q + 1 \leq j \leq p} {(1 - \frac{1}{a}) λ - s_{j} (\hat{β}) \cdot sgn (β_{j})} \cdot ∣ β_{j} ∣ \geq 0. \end{array}

The strict inequality holds unless β_j = 0 for all q + 1 ≤ j ≤ p. Since we consider the non-separable case that the oracle estimator is unique, we know the oracle estimator is the unique minimizer of (14) and hence β̃⁽¹⁾ = β̂. This proves that the LLA algorithm finds the oracle estimator after one iteration.

In the case that F_n₂ is not true, we have |β̂_j| > aλ for all 1 ≤ j ≤ q. Hence by Condition 2 of the class of penalties $p_{λ}^{'} (∣ {\hat{β}}_{j} ∣) = 0$ for all 1 ≤ j ≤ q and $p_{λ}^{'} (∣ {\hat{β}}_{j} ∣) = p_{λ}^{'} (0) = λ$ for all q+1 ≤ j ≤ p. Once the LLA algorithm finds β̂, the solution to the next LLA iteration β̃⁽²⁾ is the minimizer of the convex optimization problem

{\tilde{β}}^{(2)} = arg min_{β} n^{- 1} \sum_{i = 1}^{n} W_{i} {(1 - Y_{i} X_{i}^{T} β)}_{+} + \sum_{q + 1 \leq j \leq p} λ ∣ β_{j} ∣ .

(15)

Then we have for any β

{n^{- 1} \sum_{i = 1}^{n} W_{i} {(1 - Y_{i} X_{i}^{T} β)}_{+} + \sum_{q + 1 \leq j \leq p} λ ∣ β_{j} ∣} - {n^{- 1} \sum_{i = 1}^{n} W_{i} {(1 - Y_{i} X_{i}^{T} \hat{β})}_{+} + \sum_{q + 1 \leq j \leq p} λ ∣ {\hat{β}}_{j} ∣} \geq \sum_{q + 1 \leq j \leq p} {λ - s_{j} (\hat{β}) \cdot sgn (β_{j})} \cdot ∣ β_{j} ∣ \geq 0.

and hence β̃⁽²⁾= β̂ is the unique minimizer of (15). That is, the LLA algorithm finds the oracle estimator again and stops.

As n → ∞, by Theorem 3.1 we have P_n₂ → 0 and P_n₄ → 0. The proof for P_n₃ → 0 is similar to the proof for Theorem 3.1 by changing the constant to be ( $1 - \frac{1}{a}$ ).

Now we prove Theorem 3.4.

Proof (Proof of Theorem 3.4)

Let || · ||₁ be the L₁ norm of a vector. Denote $l_{n} (β) = n^{- 1} \sum_{i = 1}^{n} W_{i} {(1 - Y_{i} X_{i}^{T} β)}_{+} + c_{n} {‖ β ‖}_{1}$ . Note that

E [n p^{- 1} {l_{n} (β_{0} + \sqrt{p / n} u) - l_{n} (β_{0})}] = E [n p^{- 1} {W {(1 - Y X^{T} (β_{0} + \sqrt{p / n} u))}_{+} - W {(1 - Y X^{T} β_{0})}_{+}}] + n p^{- 1} c_{n} ({‖ β_{0} + \sqrt{p / n} u ‖}_{1} - {‖ β_{0} ‖}_{1})

for some constant Δ that ||u|| = Δ. Observe that ${‖ β_{0} + \sqrt{p / n} u ‖}_{1} - {‖ β_{0} ‖}_{1} \leq {‖ \sqrt{p / n} u ‖}_{1} = \sqrt{p / n} {‖ u ‖}_{1}$ . By the fact that c_n = o(n^−1/2), we have $n p^{- 1} c_{n} ({‖ β_{0} + \sqrt{p / n} u ‖}_{1} - {‖ β_{0} ‖}_{1}) \to 0$ as n → ∞. Then similar to the proof of Lemma 3.1, we can show that the expectation is dominated by $\frac{1}{2} u^{T} H (β_{0}) u > 0$ and $Pr {{inf}_{‖ u ‖ = Δ} l_{n} (β_{0} + \sqrt{p / n} u) > l_{n} (β_{0})} \geq 1 - η$ . Hence $‖ {\hat{β}}^{L_{1}} - β_{0} ‖ = O_{p} (\sqrt{p / n})$ . Because ${p n}^{- \frac{1}{2}} = o (λ), Pr (∣ {\hat{β}}_{j}^{L_{1}} - β_{0 j} ∣ > λ, for some 1 \leq j \leq p) \to 0$ as n → ∞. Then using Theorem 3.1 and Corollary 3.1 we have Pr{β̂(λ) = β̂} → 1, which completes the proof.

Contributor Information

Yichao Wu, North Carolina State University, Raleigh, NC, USA.

Lan Wang, The University of Minnesota, Minneapolis, MN, USA.

Runze Li, The Pennsylvania State University, University Park, PA, USA.

References

An LTH, Tao PD. The dc (difference of convex functions) programming and dca revisited with dc models of real world nonconvex optimization problems. Annals of Operations Research. 2005;133(1–4):23–46. [Google Scholar]
Bartlett PL, Jordan MI, McAuliffe JD. Convexity, classification, and risk bounds. Journal of the American Statistical Association. 2006;101(473):138–156. [Google Scholar]
Becker N, Toedt G, Lichter P, Benner A. Elastic scad as a novel penalization method for svm classification tasks in high-dimensional data. BMC Bioinformatics. 2011;12(1):138. doi: 10.1186/1471-2105-12-138. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bickel P, Ritov Y, Tsybakov A. Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics. 2009;37(4):1705–1732. [Google Scholar]
Bradley P, Mangasarian O. Feature selection via concave minimization and support vector machines. Machine Learning Proceedings of the Fifteenth International Conference (ICML98); 1998. pp. 82–90. [Google Scholar]
Bühlmann P, Van De Geer S. Statistics for high-dimensional data: methods, theory and applications. Springer; 2011. [Google Scholar]
Cai T, Liu W. A direct estimation approach to sparse linear discriminant analysis. Journal of the American Statistical Association. 2011;106(496):1566–1577. [Google Scholar]
Chen J, Chen Z. Extended bayesian information criteria for model selection with large model spaces. Biometrika. 2008;95(3):759–771. [Google Scholar]
Claeskens G, Croux C, Van Kerckhoven J. An information criterion for variable selection in support vector machines. The Journal of Machine Learning Research. 2008;9:541–558. [Google Scholar]
Donoho D, et al. High-dimensional data analysis: The curses and blessings of dimensionality. AMS Math Challenges Lecture. 2000:1–32. [Google Scholar]
Fan J, Fan Y. High dimensional classification using features annealed independence rules. Annals of Statistics. 2008;36(6):2605–2637. doi: 10.1214/07-AOS504. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96(456):1348–1360. [Google Scholar]
Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society. Series B (Methodological) 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Xue L, Zou H. Strong oracle optimality of folded concave penalized estimation. The Annals of Statistics. 2014 doi: 10.1214/13-aos1198. to appear. [DOI] [PMC free article] [PubMed] [Google Scholar]
Friedman J, Hastie T, Tibshirani R. The elements of statistical learning. Vol. 1. Springer; 2001. Series in Statistics. [Google Scholar]
Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Machine learning. 2002;46(1):389–422. [Google Scholar]
Kim Y, Choi H, Oh H. Smoothly clipped absolute deviation on high dimensions. Journal of the American Statistical Association. 2008;103(484):1665–1673. [Google Scholar]
Kim Y, Kwon S. Global optimality of nonconvex penalized estimators. Biometrika. 2012;99(2):315–325. [Google Scholar]
Koenker R. Quantile regression. Vol. 38. Cambridge University Press; 2005. [Google Scholar]
Koo J, Lee Y, Kim Y, Park C. A Bahadur representation of the linear support vector machine. The Journal of Machine Learning Research. 2008;9:1343–1368. [Google Scholar]
Lin Y. Technical report. Technical report 1029. Department of Statistics, University of Wisconsin; Madison: 2000. Some asymptotic properties of the support vector machine. [Google Scholar]
Lin Y. Support vector machines and the bayes rule in classification. Data Mining and Knowledge Discovery. 2002;6:259–275. [Google Scholar]
Lin Y, Lee Y, Wahba G. Support vector machines for classification in non-standard situations. Machine Learning. 2002;46(1–3):191–202. [Google Scholar]
Mazumder R, Friedman J, Hastie T. Sparsenet: Coordinate descent with nonconvex penalties. Journal of the American Statistical Association. 2011;106(495):1125– 1138. doi: 10.1198/jasa.2011.tm09738. [DOI] [PMC free article] [PubMed] [Google Scholar]
Meinshausen N, Bühlmann P. High-dimensional graphs and variable selection with the lasso. The Annals of Statistics. 2006;34(3):1436–1462. [Google Scholar]
Meinshausen N, Yu B. Lasso-type recovery of sparse representations for high-dimensional data. The Annals of Statistics. 2009:246–270. [Google Scholar]
Park C, Kim KR, Myung R, Koo JY. Oracle properties of scad-penalized support vector machine. Journal of Statistical Planning and Inference. 2012;142(8):2257–2270. [Google Scholar]
Schwarz G. Estimating the dimension of a model. The Annals of Statistics. 1978;6(2):461–464. [Google Scholar]
Tao P, An L. Convex analysis approach to dc programming: Theory, algorithms and applications. Acta Mathematica Vietnamica. 1997;22(1):289–355. [Google Scholar]
Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B (Methodological) 1996;58(1):267–288. [Google Scholar]
Vapnik V. The nature of statistical learning theory. Springer; New York: 1996. [Google Scholar]
Wang L, Kim Y, Li R. Calibrating non-convex penalized regression in ultra-high dimension. The Annals of Statistics. 2013;41:2505–2536. doi: 10.1214/13-AOS1159. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang L, Wu Y, Li R. Quantile regression for analyzing heterogeneity in ultra-high dimension. Journal of the American Statistical Association. 2012;107(497):214–222. doi: 10.1080/01621459.2012.656014. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang L, Zhu J, Zou H. The doubly regularized support vector machine. Statistica Sinica. 2006;16(2):589–615. [Google Scholar]
Wang L, Zhu J, Zou H. Hybrid huberized support vector machines for microarray classification. Proceedings of the 24th International Conference on Machine Learning. 2007:983–990. [Google Scholar]
Wegkamp M, Yuan M. Support vector machines with a reject option. Bernoulli. 2011;17:1368–1385. [Google Scholar]
Welsh A. On m-processes and m-estimation. The Annals of Statistics. 1989;17(1):337–361. [Google Scholar]
Yuan M. High dimensional inverse covariance matrix estimation via linear programming. The Journal of Machine Learning Research. 2010;99:2261–2286. [Google Scholar]
Zhang C. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics. 2010;38(2):894–942. [Google Scholar]
Zhang CH, Huang J. The sparsity and bias of the lasso selection in high-dimensional linear regression. The Annals of Statistics. 2008;36(4):1567–1594. [Google Scholar]
Zhang H, Ahn J, Lin X, Park C. Gene selection using support vector machines with non-convex penalty. Bioinformatics. 2006;22(1):88–95. doi: 10.1093/bioinformatics/bti736. [DOI] [PubMed] [Google Scholar]
Zhao P, Yu B. On model selection consistency of lasso. Journal of Machine Learning Research. 2007;7(2):2541–2563. [Google Scholar]
Zhu J, Rosset S, Hastie T, Tibshirani R. 1-norm support vector machines. Advances in Neural Information Processing Systems. 2004;16(1):49–56. [Google Scholar]
Zou H. The adaptive lasso and its oracle properties. Journal of the American statistical association. 2006;101(476):1418–1429. [Google Scholar]
Zou H. An improved 1-norm svm for simultaneous classification and variable selection. Eleventh International Conference on Artificial Intelligence and Statistics.2007. [Google Scholar]
Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models. Annals of Statistics. 2008;36(4):1509–1533. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zou H, Yuan M. The f-infinity norm support vector machine. Statistica Sinica. 2008;18:379–398. [Google Scholar]

[R1] An LTH, Tao PD. The dc (difference of convex functions) programming and dca revisited with dc models of real world nonconvex optimization problems. Annals of Operations Research. 2005;133(1–4):23–46. [Google Scholar]

[R2] Bartlett PL, Jordan MI, McAuliffe JD. Convexity, classification, and risk bounds. Journal of the American Statistical Association. 2006;101(473):138–156. [Google Scholar]

[R3] Becker N, Toedt G, Lichter P, Benner A. Elastic scad as a novel penalization method for svm classification tasks in high-dimensional data. BMC Bioinformatics. 2011;12(1):138. doi: 10.1186/1471-2105-12-138. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Bickel P, Ritov Y, Tsybakov A. Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics. 2009;37(4):1705–1732. [Google Scholar]

[R5] Bradley P, Mangasarian O. Feature selection via concave minimization and support vector machines. Machine Learning Proceedings of the Fifteenth International Conference (ICML98); 1998. pp. 82–90. [Google Scholar]

[R6] Bühlmann P, Van De Geer S. Statistics for high-dimensional data: methods, theory and applications. Springer; 2011. [Google Scholar]

[R7] Cai T, Liu W. A direct estimation approach to sparse linear discriminant analysis. Journal of the American Statistical Association. 2011;106(496):1566–1577. [Google Scholar]

[R8] Chen J, Chen Z. Extended bayesian information criteria for model selection with large model spaces. Biometrika. 2008;95(3):759–771. [Google Scholar]

[R9] Claeskens G, Croux C, Van Kerckhoven J. An information criterion for variable selection in support vector machines. The Journal of Machine Learning Research. 2008;9:541–558. [Google Scholar]

[R10] Donoho D, et al. High-dimensional data analysis: The curses and blessings of dimensionality. AMS Math Challenges Lecture. 2000:1–32. [Google Scholar]

[R11] Fan J, Fan Y. High dimensional classification using features annealed independence rules. Annals of Statistics. 2008;36(6):2605–2637. doi: 10.1214/07-AOS504. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96(456):1348–1360. [Google Scholar]

[R13] Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society. Series B (Methodological) 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Fan J, Xue L, Zou H. Strong oracle optimality of folded concave penalized estimation. The Annals of Statistics. 2014 doi: 10.1214/13-aos1198. to appear. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Friedman J, Hastie T, Tibshirani R. The elements of statistical learning. Vol. 1. Springer; 2001. Series in Statistics. [Google Scholar]

[R16] Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Machine learning. 2002;46(1):389–422. [Google Scholar]

[R17] Kim Y, Choi H, Oh H. Smoothly clipped absolute deviation on high dimensions. Journal of the American Statistical Association. 2008;103(484):1665–1673. [Google Scholar]

[R18] Kim Y, Kwon S. Global optimality of nonconvex penalized estimators. Biometrika. 2012;99(2):315–325. [Google Scholar]

[R19] Koenker R. Quantile regression. Vol. 38. Cambridge University Press; 2005. [Google Scholar]

[R20] Koo J, Lee Y, Kim Y, Park C. A Bahadur representation of the linear support vector machine. The Journal of Machine Learning Research. 2008;9:1343–1368. [Google Scholar]

[R21] Lin Y. Technical report. Technical report 1029. Department of Statistics, University of Wisconsin; Madison: 2000. Some asymptotic properties of the support vector machine. [Google Scholar]

[R22] Lin Y. Support vector machines and the bayes rule in classification. Data Mining and Knowledge Discovery. 2002;6:259–275. [Google Scholar]

[R23] Lin Y, Lee Y, Wahba G. Support vector machines for classification in non-standard situations. Machine Learning. 2002;46(1–3):191–202. [Google Scholar]

[R24] Mazumder R, Friedman J, Hastie T. Sparsenet: Coordinate descent with nonconvex penalties. Journal of the American Statistical Association. 2011;106(495):1125– 1138. doi: 10.1198/jasa.2011.tm09738. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Meinshausen N, Bühlmann P. High-dimensional graphs and variable selection with the lasso. The Annals of Statistics. 2006;34(3):1436–1462. [Google Scholar]

[R26] Meinshausen N, Yu B. Lasso-type recovery of sparse representations for high-dimensional data. The Annals of Statistics. 2009:246–270. [Google Scholar]

[R27] Park C, Kim KR, Myung R, Koo JY. Oracle properties of scad-penalized support vector machine. Journal of Statistical Planning and Inference. 2012;142(8):2257–2270. [Google Scholar]

[R28] Schwarz G. Estimating the dimension of a model. The Annals of Statistics. 1978;6(2):461–464. [Google Scholar]

[R29] Tao P, An L. Convex analysis approach to dc programming: Theory, algorithms and applications. Acta Mathematica Vietnamica. 1997;22(1):289–355. [Google Scholar]

[R30] Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B (Methodological) 1996;58(1):267–288. [Google Scholar]

[R31] Vapnik V. The nature of statistical learning theory. Springer; New York: 1996. [Google Scholar]

[R32] Wang L, Kim Y, Li R. Calibrating non-convex penalized regression in ultra-high dimension. The Annals of Statistics. 2013;41:2505–2536. doi: 10.1214/13-AOS1159. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] Wang L, Wu Y, Li R. Quantile regression for analyzing heterogeneity in ultra-high dimension. Journal of the American Statistical Association. 2012;107(497):214–222. doi: 10.1080/01621459.2012.656014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] Wang L, Zhu J, Zou H. The doubly regularized support vector machine. Statistica Sinica. 2006;16(2):589–615. [Google Scholar]

[R35] Wang L, Zhu J, Zou H. Hybrid huberized support vector machines for microarray classification. Proceedings of the 24th International Conference on Machine Learning. 2007:983–990. [Google Scholar]

[R36] Wegkamp M, Yuan M. Support vector machines with a reject option. Bernoulli. 2011;17:1368–1385. [Google Scholar]

[R37] Welsh A. On m-processes and m-estimation. The Annals of Statistics. 1989;17(1):337–361. [Google Scholar]

[R38] Yuan M. High dimensional inverse covariance matrix estimation via linear programming. The Journal of Machine Learning Research. 2010;99:2261–2286. [Google Scholar]

[R39] Zhang C. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics. 2010;38(2):894–942. [Google Scholar]

[R40] Zhang CH, Huang J. The sparsity and bias of the lasso selection in high-dimensional linear regression. The Annals of Statistics. 2008;36(4):1567–1594. [Google Scholar]

[R41] Zhang H, Ahn J, Lin X, Park C. Gene selection using support vector machines with non-convex penalty. Bioinformatics. 2006;22(1):88–95. doi: 10.1093/bioinformatics/bti736. [DOI] [PubMed] [Google Scholar]

[R42] Zhao P, Yu B. On model selection consistency of lasso. Journal of Machine Learning Research. 2007;7(2):2541–2563. [Google Scholar]

[R43] Zhu J, Rosset S, Hastie T, Tibshirani R. 1-norm support vector machines. Advances in Neural Information Processing Systems. 2004;16(1):49–56. [Google Scholar]

[R44] Zou H. The adaptive lasso and its oracle properties. Journal of the American statistical association. 2006;101(476):1418–1429. [Google Scholar]

[R45] Zou H. An improved 1-norm svm for simultaneous classification and variable selection. Eleventh International Conference on Artificial Intelligence and Statistics.2007. [Google Scholar]

[R46] Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models. Annals of Statistics. 2008;36(4):1509–1533. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R47] Zou H, Yuan M. The f-infinity norm support vector machine. Statistica Sinica. 2008;18:379–398. [Google Scholar]

PERMALINK

Variable Selection for Support Vector Machines in Moderately High Dimensions

Xiang Zhang

Yichao Wu

Lan Wang

Runze Li

Summary

1. Introduction

2. Nonconvex penalized support vector machines

2.1. Nonconvex penalized support vector machines

3. Oracle property

3.1. Regularity conditions

Remark 3.1

3.2. Local oracle property

Lemma 3.1

Theorem 3.1

Theorem 3.2

3.3. An algorithm with provable convergence to the oracle estimator

Theorem 3.3

Corollary 3.1

Theorem 3.4

4. Implementation and tuning

5. Simulation and real data examples

5.1. Simulation study

Table 1.

Table 2.

5.2. Real data application

Table 3.

6. Discussion

6.1. Connection to Bayes rule

6.2. Other issues

Table 4.

Acknowledgments

7. Appendix

Proof (Proof of Lemma 3.1)

Lemma 7.1

Proof

Lemma 7.2

Proof

Proof (Proof of Theorem 3.1)

Proof (Proof of Theorem 3.2)

Proof (Proof of Theorem 3.3)

Proof (Proof of Theorem 3.4)

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases