Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2017 Jan 1.
Published in final edited form as: J R Stat Soc Series B Stat Methodol. 2015 Jan 5;78(1):53–76. doi: 10.1111/rssb.12100

Variable Selection for Support Vector Machines in Moderately High Dimensions

Xiang Zhang, Yichao Wu 1, Lan Wang 2, Runze Li 3
PMCID: PMC4709852  NIHMSID: NIHMS725692  PMID: 26778916

Summary

The support vector machine (SVM) is a powerful binary classification tool with high accuracy and great flexibility. It has achieved great success, but its performance can be seriously impaired if many redundant covariates are included. Some efforts have been devoted to studying variable selection for SVMs, but asymptotic properties, such as variable selection consistency, are largely unknown when the number of predictors diverges to infinity. In this work, we establish a unified theory for a general class of nonconvex penalized SVMs. We first prove that in ultra-high dimensions, there exists one local minimizer to the objective function of nonconvex penalized SVMs possessing the desired oracle property. We further address the problem of nonunique local minimizers by showing that the local linear approximation algorithm is guaranteed to converge to the oracle estimator even in the ultra-high dimensional setting if an appropriate initial estimator is available. This condition on initial estimator is verified to be automatically valid as long as the dimensions are moderately high. Numerical examples provide supportive evidence.

Keywords: Local linear approximation, nonconvex penalty, oracle property, support vector machines, ultra-high dimensions, variable selection

1. Introduction

Due to the recent advent of new technologies for data acquisition and storage, we have seen an explosive growth of data complexity in a variety of research areas such as genomics, imaging and finance. As a result, the number of predictors becomes huge. However there are only a moderate number of instances available for study (Donoho et al., 2000). For example, in tumor classification using genomic data, expression values of tens of thousands of genes are available, but the number of arrays is typically at the order of tens. Classification of high dimensional data poses many statistical challenges and calls for new methods and theories. In this article we consider high dimensional classification where the number of covariates diverges with the sample size and can be potentially much larger than the sample size.

Support vector machine (SVM, Vapnik, 1996) is a powerful binary classification tool with high accuracy and great flexibility. It has achieved success in many applications. However, one serious drawback of the standard SVM is that its performance can be adversely affected if many redundant variables are included in building the decision rule (Friedman et al., 2001), see the evidence in the numerical results of Section 5.1. Classification using all features has been shown to be as poor as random guessing due to noise accumulation in high dimensional space (Fan and Fan, 2008). Many methods have been proposed to remedy this problem, such as the recursive feature elimination suggested by Guyon et al. (2002). In particular, superior performance can be achieved with a unified method, namely achieving variable selection and prediction simultaneously (Fan and Li, 2001) by using an appropriate sparsity penalty. It is well known that the standard SVM can fit in the regularization framework of loss + penalty using the hinge loss and L2 penalty. Based on this, several attempts have been made to achieve variable selection for the SVM by replacing the L2 penalty with other forms of penalty. Bradley and Mangasarian (1998), Zhu et al. (2004), and Wegkamp and Yuan (2011) considered the L1-penalized SVM; Zou and Yuan (2008) proposed to use the F-norm SVM to select groups of predictors; Wang et al. (2006) and Wang et al. (2007) suggested the elastic net penalty for the SVM; Zou (2007) proposed to penalize the SVM with the adaptive LASSO penalty; Zhang et al. (2006), Becker et al. (2011) and Park et al. (2012) studied the smoothly clipped absolute deviation (SCAD, Fan and Li, 2001)-penalized SVM. Recently Park et al. (2012) studied the oracle property of the SCAD-penalized SVM with a fixed number of predictors. Yet, to the best of our knowledge, the theory of variable selection consistency of sparse SVMs in high dimensions or ultra-high dimensions (Fan and Lv, 2008) has not been studied so far.

In this article, we study the variable selection consistency of sparse SVMs. Instead of using the L2 penalty, we consider the penalized SVM with a general class of nonconvex penalties, such as the SCAD penalty or the minimax concave penalty (MCP, Zhang, 2010). Though the convex L1 penalty can also induce sparsity, it is well known that its variable selection consistency in linear regression relies on the stringent “irrepresentable condition” on the design matrix. This condition, however, can easily be violated in practice, see the examples in Zou (2006) and Meinshausen and Yu (2009). Moreover, the regularization parameter for model selection consistency in this case is not optimal for prediction accuracy (Meinshausen and Bühlmann, 2006; Zhao and Yu, 2007). For the nonconvex penalty, Kim et al. (2008) investigated the oracle property of the SCAD-penalized least squares regression in the high dimensions. However, a different set of proving techniques are needed for the nonconvex penalized SVMs because the hinge loss in the SVM is not a smooth function. The Karush-Kuhn-Tucker local optimality condition is generally not sufficient for the setup of a nonsmooth loss plus a nonconvex penalty. A new sufficient optimality condition based on subgradient calculation is used in the technical proof in this paper. We prove that under some general conditions, with probability tending to one, the oracle estimator is a local minimizer of the nonconvex penalized SVM objective function where the number of variables may grow exponentially with the sample size. By oracle estimator, we mean an estimator obtained by minimizing the empirical hinge loss with only relevant covariates. As one referee pointed out, with a finite sample, the empirical hinge loss may have multiple minimizers because the objective function is piecewise linear. This issue will vanish asymptotically because we assume that the population hinge loss has a unique minimizer. Such an assumption on the population hinge loss has been made in the existing literature (Koo et al., 2008).

Even though the nonconvex penalized SVMs are shown to enjoy the aforementioned local oracle property, it is largely unknown whether numerical algorithms can identify this local minimizer since the involved objective function is nonconvex and typically multiple local minimizers exist. Existing methods rely heavily on conditions that guarantee the local minimizer to be unique. In general, when the convexity of the hinge loss function dominates the concavity of the penalty, the nonconvex penalized SVMs actually have a unique minimizer due to global convexity. Recently Kim and Kwon (2012) gave sufficient conditions for a unique minimizer of the nonconvex penalized least square regression when global convexity is not satisfied. However, for ultra-high dimensional cases, it would be unrealistic to assume the existence of a unique local minimizer. See Wang et al. (2013) for relevant discussion and a possible solution to nonconvex penalized regression.

In this article, we further address the nonuniqueness issue of local minimizers by verifying that with probability tending to one, the local linear approximation (LLA) algorithm (Zou and Li, 2008) is guaranteed to yield an estimator with the desired oracle property in merely two iterations under the localizability condition (Fan et al., 2014). This convergence result extends the work of Fan et al. (2014) by relaxing the differentiability assumption of the loss function and holds in the ultra-high dimensional setting with p = o(exp(nδ)) for some positive constant δ. We further show that the localizability condition is automatically valid for the moderately high dimensional setting with p=o(n). To the best of our knowledge, this is the first result on the convergence of LLA algorithm in the setup of a nonsmooth loss function with a nonconvex penalty.

The rest of this paper is organized as follows. Section 2 introduces the methodology of nonconvex penalized SVMs. Section 3 contains the main results of the properties of nonconvex penalized SVMs. The implementation procedure is summarized in Section 4. Simulation studies and a real data example are provided in Section 5, followed by a discussion in Section 6. Technical proofs are presented in Section 7. A zip file containing R demonstration codes for one simulation example and the real data example is available at http://www4.stat.ncsu.edu/~wu/soft/VarSelforSVMbyZhangWuWangLi.zip.

2. Nonconvex penalized support vector machines

We begin with the basic setup and notations. In binary classification, we are typically given a random sample {(Yi,Xi)}i=1n from an unknown population distribution P(X, Y). Here Yi ∈ {1, −1} denotes the categorical label and Xi=(Xi0,Xi1,,Xip)T=(Xi0,(Xi)T)T denotes the input covariates with Xi0 = 1 corresponding to the intercept term. The goal is to estimate a classification rule that can be used to predict output labels for future observations with input covariates only. With potentially varying misclassification cost specified by weight Wi = w if Yi = 1 and Wi = 1 − w if Yi = −1 for some 0 < w < 1, the linear weighted support vector machine (WSVM, Lin et al., 2002) estimates the classification boundary by solving

minβn-1i=1nWi(1-YiXiTβ)++λ(β)Tβ,

where (1 − u)+ = max{1 − u, 0} denotes the hinge loss, λ > 0 is a regularization parameter, and β = (β0, (β*)T)T with β* = (β1, β2, · · ·, βp)T. The standard SVM is a special case of the WSVM with weight parameter w = 0.5. In this paper, we consider the WSVM for more generality. In general, the corresponding decision rule, sign(XT β), uses all covariates and is not capable of selecting relevant covariates.

Towards variable selection for the linear WSVM, we consider the population linear weighted hinge loss Inline graphic{W(1 − Y XTβ)+}. Let β0=(β00,β01,,β0p)T=(β00,(β0)T)T denote the true parameter value, which is defined as the minimizer of the population weighted hinge loss. Namely

β0=argminβE{W(1-YXTβ)+}. (1)

The number of covariates p = pn is allowed to increase with the sample size n. It is even possible that pn is much larger than n. In this paper we assume the true parameter β0 to be sparse. Let A = {1 ≤ jpn; β0j ≠ 0} be the index set of the nonzero coefficients. Let q = qn = |A| be the cardinality of set A, which is also allowed to increase with n. Without loss of generality, we assume that the last pnqn components of β0 are zero. That is, β0T=(β01T,0T). Correspondingly, we write XiT=(ZiT,RiT), where Zi=(Xi0,Xi1,,Xiq)T=(1,(Zi)T)T and Ri = (Xi[q+1], …, Xip)T. Further we denote π+ (resp. π) to be the marginal probability of the label Y = +1 (resp. −1).

To facilitate our theoretical analysis, we introduce the gradient vector and Hessian matrix of the population linear weighted hinge loss. Let L(β1) = Inline graphic{W (1− Y ZT β1)+} be the population linear weighted hinge loss using only relevant covariates. Define S(β1) = (S(β1)j) to be the (qn + 1)-dimension vector given by

S(β1)=-E{I(1-YZTβ10)WYZ},

where I(·) denotes the indicator function. Also define H(β1) = (H(β1)jk) to be the (qn + 1) × (qn + 1) matrix given by

H(β1)=E{δ(1-YZTβ1)WZZT},

where δ(·) denotes the Dirac delta function. It can be shown that if well-defined, S(β1) and H(β1) can be considered to be the gradient vector and Hessian matrix of L(β1), respectively. See Lemma 2 of Koo et al. (2008) for details.

2.1. Nonconvex penalized support vector machines

By acting as if the true sparsity structure is known in advance, the oracle estimator is defined as β^=(β^1T,0T)T, where

β^1=argminβ1n-1i=1nWi(1-YiZiTβ1)+. (2)

Here the objective function is piecewise linear. With a finite sample, it may have multiple minimizers. In that case, β̂1 can be chosen to be any minimizer. Our forthcoming theoretical results still hold. In the limit as n → ∞, β̂1 minimizes the population version of the objective function Inline graphic{W(1− Y ZT β1)+}. Lin (2002) showed that when the misclassification costs are equal, the minimizer of Inline graphic{(1 − Y f(Z))+} over measurable f(Z) is the Bayes rule sgn(p(Z)−1/2), where p(Z) = P(Y = +1|Z = Z). This suggests that the oracle estimator is aiming at approximating the Bayes rule. In practice, achieving an estimator with the desired oracle property is very challenging, because the sparsity structure of the true parameter β0 is largely unknown. Later we will show that under some regularity conditions, our proposed algorithm can find an estimator with oracle property and claim convergence with high probability. Indeed, the numerical examples in Section 5.1 demonstrate that the estimator selected by our proposed algorithm has performance close to that of the Bayes rule. Note that the Bayes rule is unattainable here because we assume no knowledge on the high dimensional conditional density P(X|Y).

In this paper, we consider the nonconvex penalized hinge loss objective function:

Q(β)=n-1i=1nWi(1-YiXiTβ)++j=1pnpλn(βj), (3)

where pλn(·) is a symmetric penalty function with tuning parameter λn. Let pλn(t) be the derivative of pλn(t) with respect to t. We consider a general class of nonconvex penalties that satisfy the following conditions.

  • (Condition 1) The symmetric penalty pλn(t) is assumed to be nondecreasing and concave for t ∈ [0, +∞), with a continuous derivative pλn(t) on (0, +∞) and pλn(0) = 0.

  • (Condition 2) There exists a > 1 such that limt0+pλn(t)=λn,pλn(t)λn-t/a for 0 < t < and pλn(t)=0 for taλ.

The motivation for such a nonconvex penalty is that the convex L1 penalty lacks the oracle property due to the overpenalization of large coefficients in the selected model. Consequently it is undesirable to use the L1 penalty when the purpose of the data analysis is to select the relevant covariates among potentially high dimensional candidates in classification. Note that p, q, λ and other related quantities are allowed to depend on n, and we suppress the subscript n whenever there is no confusion.

Two commonly used nonconvex penalties that satisfy Conditions 1 and 2 are the SCAD and MCP penalties. The SCAD penalty (Fan and Li, 2001) is defined by

pλ(β)=λβI(0β<λ)+aλβ-(β2+λ2)/2a-1×I(λβaλ)+(a+1)λ22I(β>aλ)

for some a > 2. The MCP (Zhang, 2010) is defined by

pλ(β)=λ(β-β22aλ)I(0β<aλ)+aλ22I(βaλ)forsomea>1.

3. Oracle property

3.1. Regularity conditions

To facilitate our technical proofs, we impose the following regularity conditions:

  • (A1)

    The densities of Z* given Y = +1 and Y = −1 are continuous and have common support in Inline graphic.

  • (A2)

    E[Xj2]< for 1 ≤ jq.

  • (A3)

    The true parameter β0 is unique and a nonzero vector.

  • (A4)

    qn = O(nc1), namely limn→∞ qn/nc1 < ∞, for some 0 ≤ c1 ≤ 1/2.

  • (A5)

    There exists a constant M1 > 0 such that λmax(n-1XATXA)M1, where XA is the first qn+1 columns of the design matrix and λmax denotes the largest eigenvalue. It is further assumed that max1inZi=Op(qnlog(n)), (Zi, Yi) are in general position (Koenker, 2005, sect. 2.2), Xij are sub-Gaussian random variables for 1 ≤ in, qn + 1 ≤ jpn.

  • (A6)

    λmin(H(β01)) ≥ M2 for some constant M2 > 0, where λmin denotes the smallest eigenvalue.

  • (A7)

    n(1−c2)/2 min1≤jqn |β0j| ≥ M3 for some constant M3 > 0 and 2c1 < c2 ≤1.

  • (A8)

    Denote the conditional density of ZT β01 given Y = +1 and Y = −1 as f and g, respectively. It is assumed that f is uniformly bounded away from 0 and ∞ in a neighborhood of 1 and g is uniformly bounded away from 0 and ∞ in a neighborhood of −1.

Remark 3.1

The Conditions (A1)–(A3) and (A6) are also assumed for fixed p in Koo et al. (2008). We need these assumptions to ensure that the oracle estimator is consistent in the scenario of diverging p. Condition (A3) states that the optimal classification decision function is not constant, which is required to ensure S(β) and H(β) are well-defined gradient vector and Hessian matrix of the hinge loss, see Lemma 2 and Lemma 3 of Koo et al. (2008). The Conditions (A4) and (A7) are common in the literature of high dimensional inference (Kim et al., 2008). More specifically, (A4) states that the divergence rate of the number of nonzero coefficients cannot be faster than root-n and (A7) simply states that the signals cannot decay too quickly. The Condition on the largest eigenvalues of the design matrix in (A5) is similar to the sparse Riesz condition and also assumed in Zhang and Huang (2008), Yuan (2010) and Zhang (2010). Note that the bound on the smallest eigenvalue is not specified. The Condition on the maximum norm in (A5) holds when Z* given Y follows multivariate normal distribution. (Zi, Yi) are in general position if with probability one there are exactly (qn + 1) elements in D={i:1-YiZiTβ^1=0} (Koenker, 2005, sect. 2.2). The condition for general position is true with probability one w.r.t. Lebesgue measure. The Condition (A8) requires that there is enough information around the nondifferentiable point of the hinge loss, similar to Condition (C5) in Wang et al. (2012) for quantile regression.

For illustrative examples that satisfy all the above conditions, assume 0 < π+ = 1−π < 1 and let the number of signals be fixed. The first example is that the conditional distributions of X* given Y have unbounded support Inline graphic with sub-Gaussian tails. It can be easily seen that the Fisher’s discriminant analysis is one special case when X* given Y are Gaussian. Condition (A1)–(A4) and (A7) are trivial. Condition (A5) holds by the properties of sub-Gaussian random variable. Koo et al. (2008) showed that Condition (A6) holds if the supports of the conditional densities of Z* given Y are convex, which are naturally satisfied for Inline graphic. Condition (A8) is trivially satisfied by the unbounded support of the conditional distribution of Z* given Y. Another example is the Probit model that X* has unbounded support Inline graphic with sub-Gaussian tails and Pr(Y = +1|X*) = Φ(XTβ) for some β0. It can be easily checked that the conditional distributions of X* given Y also have unbounded supports Inline graphic and hence all the conditions are satisfied.

3.2. Local oracle property

In this subsection, we establish the theory of the local oracle property for the nonconvex penalized SVMs; namely, the oracle estimator is one of the local minimizers of the objective function Q(β) defined in (3). We start with the following lemma on the consistency of the oracle estimator, which can be viewed as an extension of the consistency result in Koo et al. (2008) to the diverging p scenario.

Lemma 3.1

Assume that Conditions (A1)–(A7) are satisfied. The oracle estimator β^=(β^1T,0T)T satisfies β^1-β01=Op(qn/n) when n → ∞.

Though the convexity of the nonconvex penalized hinge loss objective function Q(β) is not guaranteed, it can be written as the difference of two convex functions:

Q(β)=g(β)-h(β), (4)

where g(β)=n-1i=1nWi(1-YiXiTβ)++λnj=1pβj and h(β)=λnj=1pβj-j=1ppλn(βj)=j=1pHλn(βj). The form of Hλ(βj) depends on the penalty function. For the SCAD penalty, we have

Hλ(βj)=[(βj2-2λβj+λ2)/{2(a-1)}]I(λβjaλ)+{λβj-(a+1)λ2/2}I(βj>aλ),

while for MCP, we have Hλ(βj)={βj2/(2a)}I(0βj<aλ)+(λβj-aλ2/2)I(βjaλ). This decomposition is useful, as it naturally satisfies the form of the difference of convex functions (DC) algorithm (An and Tao, 2005).

To prove the oracle property of the nonconvex penalized SVMs, we will use a sufficient local optimality condition for the difference convex programming first presented in Tao and An (1997). This sufficient condition is based on subgradient calculus. The subgradient can be viewed as an extension of the gradient of the smooth convex function to the nonsmooth convex function. Let dom(g)={X : g(X) < ∞} be the effective domain of a convex function g. The subgradient of g(X) at a point X0 is defined as ∂g(x0) = {t : g(x) ≥ g(x0)+(xx0)Tt}. Note that at the nondifferentiable point, the subgradient contains a collection of vectors. One can easily check that the subgradient of the hinge loss function at the oracle estimator is the collection of vectors s(β̂) = (s0(β̂), …, sp(β̂))T with

sj(β^)=-n-1i=1nWiYiXijI(1-YiXiTβ^>0)-n-1i=1nWiYiXijvj, (5)

where −1 ≤ vi ≤ 0 if 1-YiXiTβ^=0 and vi = 0 otherwise, j = 0, …, p. Under some regularity conditions, we can study the asymptotic behaviors of the subgradient at the oracle estimator. The results are summarized in the following Theorem.

Theorem 3.1

Suppose that Conditions (A1)–(A8) hold, and the tuning parameter satisfies λ = o(n−(1−c2)/2) and log(p)qlog(n)n-12=o(λ). For the oracle estimator β̂, there exists vi which satisfies vi=0 if 1-YiXiTβ^0 and vi[-1,0] if 1-YiXiTβ^=0, such that for sj(β̂) with vi=vi, with probability approaching one, we have

sj(β^)=0,j=0,1,,q,β^j(a+12)λ,j=1,,q,sj(β^)λandβ^j=0,j=q+1,,p,

Theorem 3.1 characterizes the subgradients of the hinge loss at the oracle estimator. It basically says that in a regular setting, with probability arbitrarily close to one, those components of the subgradients corresponding to the relevant covariates are exactly zero and those corresponding to irrelevant covariates are not far away zero.

We now present the sufficient optimality condition based on subgradient calculation. Corollary 1 of Tao and An (1997) states that if there exists a neighborhood U around the point X* such that ∂h(X) ∩ ∂g(X*) ≠ ∅, ∀XU ∩ dom(g), then X* is a local minimizer of g(X) − h(X). To verify this local sufficient condition, we study the asymptotic behaviors of subgradients of the two convex functions in the aforementioned decomposition (4) of Q(β). Note that, based on (5), the subgradient function of g(β) at β can be shown to be the following collection of vectors:

g(β)={ξ=(ξ0,,ξp)TRp+1:ξj=-n-1i=1nWiYiXijI(1-YiXiTβ^>0)-n-1i=1nWiYiXijvj+λlj,j=0,,p},

where l0 = 0, lj = sgn(βj) if βj ≠ 0 and lj ∈ [−1, 1] otherwise for 1 ≤ jp, and −1 ≤ vi ≤ 0 if 1-YiXiTβ^=0 and vi = 0 otherwise for 1 ≤ in. Furthermore, by Condition 2 of the class of nonconvex penalty functions, limt0+Hλ(t)=limt0-Hλ(t)=λsgn(t)-λsgn(t)=0. Thus h(β) is differentiable everywhere. Consequently the subgradient of h(β) at point β is a singleton:

h(β)={μ=(μ0,,μp)Rp+1:μj=h(β)βj,j=0,,p}.

For the class of nonconvex penalty functions under consideration, h(β)βj=0 for j = 0. For 1 ≤ jp,

h(β)βj=[{βj-λsgn(βj)}/(a-1)]I(λβjaλ)+λsgn(βj)I(βj>aλ)

for the SCAD penalty, and

h(β)βj=(βj/a)I(0βj<aλ)+λsgn(βj)I(βjaλ)

for the MCP.

Combining this with Theorem 3.1, we will prove that with probability tending to one, for any β in a ball in Inline graphic with the center β̂ and radius λ2, there exists a subgradient ξ = (ξ0, …, ξp)T∂g(β̂) such that h(β)βj=ξj, j = 0, 1, …, p. Consequently the oracle estimator β̂ is itself a local minimizer of (3). This is summarized in the following theorem.

Theorem 3.2

Assume that Conditions (A1)-(A8) hold. Let Bn(λ) be the set of local minimizers of the objective function Q(β) with regularization parameter λ. The oracle estimator β^=(β^1T,0T)T satisfies

Pr{β^Bn(λ)}1

as n → ∞, if λ = o(n−(1−c2)/2), and log(p)qlog(n)n-12=o(λ).

It can be shown that if we take λ = n−1/2+δ for some c1 < δ < c2/2, then the oracle property holds even for p = o(exp(n(δc1)/2)). Therefore, the local oracle property holds for the nonconvex penalized SVM even when the number of covariates grows exponentially with the sample size.

3.3. An algorithm with provable convergence to the oracle estimator

Note that Theorem 3.2 indicates that one of the local minimizers possesses the oracle property. However, there can potentially be multiple local minimizers and it remains challenging to identify the oracle estimator. In the high dimensional setting, assuming that the local minimizer is unique would not be realistic.

In this article, instead of assuming the uniqueness of solutions, we work directly on the conditions under which the oracle estimator can be identified by some numerical algorithms that solve the nonconvex penalized SVM objective function. One possible algorithm is the local linear approximation (LLA) algorithm proposed by Zou and Li (2008). We focus on theoretical development first in this section and delay the detailed LLA algorithm for the nonconvex penalized SVMs to Section 4. Recently LLA has been shown to be capable of identifying the oracle estimator in the setup of folded concave penalized estimation with a differentiable loss function (Wang et al., 2013; Fan et al., 2014). We generalize their results to nondifferentiable loss functions, so that it can fit in the framework of the nonconvex penalized SVMs. Similar to their work, the main condition required is the existence of an appropriate initial estimator inputed in the iterations of the LLA algorithm. Denote the initial estimator as β̂(0). Intuitively, if the initial estimator β̂(0) lies in a small neighborhood of the true value β0, the algorithm should converge to the good local minimizer around β0. This localizability will be formalized in terms of L distance later. With such an appropriate initial estimator, under the aforementioned regularity conditions, one can prove that the LLA algorithm converges to the oracle estimator with probability tending to one even in ultra-high dimensions.

Let β(0)=(β0(0),,βp(0))T. Consider the following events:

  • Fn1={βj(0)-β0j>λ,forsome1jp},

  • Fn2 = {|β0j| < (a + 1)λ, for some 1 ≤ jq},

  • Fn3 = {for all subgradients s(β̂), sj(β^)>(1-1a)λ for some q+1 ≤ jp or |sj(β̂)| ≠ 0 for some 0 ≤ jq},

  • Fn4 = {|β̂j| < , for some 1 ≤ jq}.

Denote the corresponding probability as Pni = Pr(Fni), i = 1, 2, 3, 4. Pn1 represents the localizability of the problem. When we have an appropriate initial estimator, we expect Pn1 to converge to 0 as n → ∞. Pn2 is the probability that the true signal is too small to be detected by any method. Pn3 describes the behavior of the subgradients at the oracle estimator. As stated in Theorem 3.1, there exists a subgradient such that its components corresponding to irrelevant variables are near 0 and those corresponding to relevant variables are exactly 0, so Pn3 cannot be too large. Pn4 has to do with the magnitude of the oracle estimator on relevant variables. Under regularity conditions, the oracle estimator will detect the ture signals and hence Pn4 will be very small.

Now we provide conditions for the LLA algorithm to find the oracle estimator β̂ in the nonconvex penalized SVMs based on Pn1, Pn2, Pn3 and Pn4.

Theorem 3.3

With probability at least 1 − Pn1Pn2Pn3Pn4, the LLA algorithm initiated by β̂(0) finds the oracle estimator β̂ after two iterations. Furthermore, if (A1)–(A8) hold, λ= o(n−(1−c2)/2) and log(p)qlog(n)n-12=o(λ), then Pn2 → 0, Pn3 → 0 and Pn4 → 0 as n → ∞.

The first part of Theorem 3.3 provides a nonasymptotic lower bound on the probability that the LLA algorithm converges to the oracle estimator. As we will show in the appendix, if none of the events Fni happen, the LLA algorithm initiated with β̃(0) will find the oracle estimator in the first iteration, and in the second iteration it will find the oracle estimator again and thus claim convergence. Note that only a single correction is required in the first iteration and the second iteration is needed to stop the algorithm. Therefore, the LLA algorithm can identify the oracle estimator after two iterations and this result holds generally without the Conditions (A1)–(A8).

The second part of Theorem 3.3 indicates that under Conditions (A1)–(A8), the lower bound is determined only by the limiting behavior of the initial estimator. As long as an appropriate initial estimator is available, the problem of selecting the oracle estimator from potential multiple local minimizers is addressed. Let β̂L1 be the solution to the L1-penalized SVM. When the initial estimator β̃(0) is taken to be β̂L1 and the following Condition (A9) holds, by Theorem 3.3 the oracle estimator can be identified even in the ultra-high dimensional setting. The result is summarized in the following Corollary.

  • (A9)

    Pr(β^jL1-β0j>λ), for some 1 ≤ jp) → 0 as n → ∞.

Corollary 3.1

Let β̂(λ) be the solution found by the LLA algorithm initiated by β̂L1 after two iterations. Assume the same conditions in Theorem 3.3 and (A9) hold, then

Pr{β^(λ)=β^}1asn.

In the ultra-high dimensional case, one may require more stringent conditions to guarantee (A9). For the nonconvex penalized least square regression, one can use the LASSO solution (Tibshirani, 1996) as the initial estimator and (A9) holds if one can further assume the restricted eigenvalue condition of the design matrix (Bickel et al., 2009). However, it is still largely unknown whether this conclusion also applies to the setting where both the loss and the penalty are nondifferentiable. Without imposing any new regularity conditions, we next prove that in the moderately high dimensions with p=o(n), the solution to the L1-penalized SVM satisfies (A9) under conditions quite similar to (A1)–(A8).

The following regularity Conditions are modified from (A1)–(A8). Conditions (A3), (A7)–(A8) are the same as aforementioned.

  • (A1*)

    The densities of X* given Y = +1 and Y = −1 are continuous and have a common support in Inline graphic.

  • (A2*)

    E[Xj2]< for 1 ≤ jp.

  • (A4*)

    pn = O(nc1) for some 0 ≤ c1 < 1/2.

  • (A5*)

    There exists a constant M1 > 0 such that λmax(n−1XT X) ≤ M1. It is further assumed that max1inXi=Op(pnlogn), (Xi, Yi) are in general position (Koenker, 2005, sect. 2.2), Xij are sub-Gaussian random variables for 1 ≤ in, qn + 1 ≤ jpn.

  • (A6*)

    λmin(H(β0)) ≥ M3 for some constant M3 > 0.

Under the new regularity conditions, we can conclude that the solution to the L1-penalized SVM is an appropriate initial estimator. Combined with Theorem 3.3, the LLA algorithm initiated with a zero vector can identify the oracle estimator with one more iteration. The results are summarized in the following Theorem.

Theorem 3.4

Assume β̂L1 is the solution to the L1-penalized SVM with tuning parameter cn. If the modified conditions hold, λ = o(n−(1−c2)/2), p log(n)n−1/2 = o(λ) and cn = o(n−1/2), then we have Pr(β^jL1-β0j>λ), for some 1 ≤ jp) 0 as n → ∞. Further, the LLA algorithm initiated by β̂L1 finds the oracle estimator in two iterations with probability tending to one. That is, Pr{β̂(λ) = β̂} → 1 as n → ∞.

Note that Theorem 3.4 can guarantee that the LLA algorithm initialized by the β̂L1 identifies the oracle estimator with high probability only when p=o(n). However, our empirical studies suggest that even for cases with p much larger than n, the LLA algorithm initiated by β̂L1 usually converges within two iterations and the identified local minimizer has acceptable performance.

4. Implementation and tuning

To solve the nonconvex penalized SVMs, we use the LLA algorithm. More explicitly, we start with an initial value { β(0):βj(0)=0, j = 1, 2, …, p}. At each step t ≥ 1, we update by solving

minβ{n-1i=1nWi(1-YiXiTβ)++j=1ppλ(βj(t-1))βj}, (6)

where pλ(·) denotes the derivative of pλ(·). Following the literature, when βj(t-1)=0, we take pλ(0) as pλ(0+)=λ. The LLA algorithm is an instance of the majorize-minimize (MM) algorithm and converges to a local minimizer of the nonconvex objective function.

With slack variables, the convex optimization problem in (6) can be easily recast as a linear programming (LP) problem

minξ,η,β{n-1i=1nWiξi+j=1ppλ(βj(t-1))ηj}

subject to

ξi0;i=1,2,,n,ξi1-YiXiTβ;i=1,2,,n,ηjβj,ηj-βj;j=1,2,,p.

We propose using the stopping rule that pλ(βj(t-1)) stabilizes for j = 1, 2, ···, p, namely, when j=1p(pλ(βj(t-1))-pλ(βj(t)))2 is sufficiently small.

For the choice of tuning parameter λ, Claeskens et al. (2008) suggested the SVM information criterion (SVMIC). For a subset S of {1, 2, …, p}, the SVMIC is defined as

SVMIC(S)=i=1nξi+log(n)S,

where |S| is the cardinality of S and ξi, i = 1, 2, ···, n denote the corresponding optimal slack variables. This criterion directly follows the spirit of the Bayesian information criterion (BIC) by Schwarz (1978). Chen and Chen (2008) showed that BIC can be too liberal when the model space is large and proposed the extended BIC (EBIC):

EBICγ(S)=-2logLikelihood+log(n)S+2γ(pS),0γ1.

By combining these ideas, we suggest the SVM-extend BIC (SVMICγ)

SVMICγ(S)=i=1n2Wiξi+log(n)S+2γ(pS),0γ1.

Note that SVMICγ reduces to SVMIC when γ = 0 and w = 0.5. We use γ = 0.5 as suggested by Chen and Chen (2008) and choose the λ that minimizes SVMICγ.

5. Simulation and real data examples

We carry out Monte Carlo studies to evaluate the finite-sample performance of the non-convex penalized SVMs. We compare the performance of SCAD-penalized SVM, MCP-penalized SVM, standard L2 SVM, L1-penalized SVM, adaptively weighted L1-penalized SVM (Zou, 2007) and hybrid Huberized SVM (Wang et al., 2007) (denoted by SCAD-svm, MCP-svm, L2-svm, L1-svm, Adap L1-svm, and Hybrid-svm, respectively) with weight parameter w = 0.5. The main interest here is the ability to identify the relevant covariates and the control of test error when p > n.

5.1. Simulation study

We consider two data generation processes. The first, adapted from Park et al. (2012), is essentially a standard linear discriminant analysis (LDA) setting. The second is related to probit regression.

  • Model 1: Pr(Y = 1) = Pr(Y = −1) = 0.5, X*|(Y = 1) ~ MN(μ, Σ), X*|(Y = −1) ~ MN(−μ, Σ), q = 5, μ = (0.1, 0.2, 0.3, 0.4, 0.5, 0, …, 0)TRp, Σ = (σij) with nonzero elements σii = 1 for i = 1, 2, ···, p and σij = ρ = −0.2 for 1 ≤ ijq. The Bayes rule is sign(2.67X1+2.83X2+3X3+3.17X4+3.33X5) with Bayes error: 6.3%.

  • Model 2: X* ~ MN(0p, Σ), Σ = (σij) with nonzero elements σii = 1 for i = 1, 2, ···, p and σij = 0.4|ij| for 1 ≤ ijp, Pr(Y = 1|X*) = Φ((X*)Tβ*) where Φ(·) is the CDF of the standard normal distribution, β* = (1.1, 1.1, 1.1, 1.1, 0, …, 0)T, q = 4. The Bayes rule is sign (X1+X2+X3+X4) with Bayes error 10.4%.

We consider different (n, p) settings for each data generation process with p much larger than n. Similarly to Mazumder et al. (2011), an independent tuning dataset of size 10n is generated to tune any regularization parameter for all methods by minimizing the estimated prediction error calculated over the tuning dataset. We also report the performance of the SCAD- and MCP-penalized SVMs using SVMICγ to select the tuning parameter λ. Notice that tuning by a large independent tuning dataset of 10n approximates the ideal “population tuning”, which is usually not available in practice. By giving all the other methods the best possible tuning, we are controlling the effect of tuning parameter selection and conservative about the performance of the nonconvex penalized SVMs tuned by SVMICγ. As we will see later, the results of SCAD- and MCP-penalized SVMs using the independent tuning dataset are slightly better than the corresponding results using SVMICγ tuning; and all other methods have no ability to select the correct model exactly, even with an unrealistically good tuning parameter. The range of λ is {2−6, …, 23}. We use a=3.7 for the SCAD penalty and a = 3 for the MCP as suggested in the literature. We generate an independent test dataset of size n to report the estimated test error. The columns “Signal” and “Noise” summarize the average number of selected relevant and irrelevant covariates, respectively. The numbers in the “Correct” column summarize the percentages of selecting the exactly true model over replications.

Table 1 shows the results of Model 1 for different (n, p) settings. The numbers in parentheses are the corresponding standard errors based on 100 replications. When tuned by using an independent tuning set of size 10n, both SCAD- and MCP-penalized SVMs identify more relevant variables than any other methods and they also reduce the number of falsely selected variables dramatically. When tuned by SVMICγ, SCAD- and MCP-penalized SVMs select slightly fewer signals when n = 100, but this is based on the fact that other methods select a much larger model without proper control of noise. A large proportion of the missed relevant covariates are from X1 as it has the weakest signal. Notice that SVMICγ performs almost the same as “population tuning” when n is relatively large. In general, the nonconvex penalized SVMs have an overwhelmingly high probability to select the exact true mode as n and p increase, while other methods show very weak, if any at all, ability to recover the exact true model. This is consistent with our theory of asymptotic oracle property of nonconvex penalized SVMs. The test errors of SCAD- and MCP-penalized SVMs are uniformly smaller than those of any other method in all settings, even in the settings with a small sample size n = 100 and tuned by SVMICγ, where they select slightly fewer signals. This is due to the fact that in high dimensional classification problem, a large number of falsely selected variables will greatly blur the prediction power of the relevant variables.

Table 1.

Simulation results for Model 1

Method n p Signal Noise Correct Test Error
SCAD-svm 100 400 4.94(0.03) 0.89(0.19) 64% 8.71%(0.4%)
100 800 4.93(0.03) 0.93(0.14) 51% 9.39%(0.4%)
200 800 5.00(0.00) 0.09(0.05) 96% 7.20%(0.2%)
200 1600 5.00(0.00) 0.07(0.04) 96% 7.24%(0.2%)
MCP-svm 100 400 4.90(0.04) 0.88(0.17) 53% 8.96%(0.4%)
100 800 4.92(0.03) 1.37(0.20) 40% 10.59%(0.5%)
200 800 5.00(0.00) 0.06(0.04) 97% 7.30%(0.2%)
200 1600 5.00(0.00) 0.09(0.03) 92% 6.79%(0.2%)
SCAD-svm(SVMICγ) 100 400 4.64(0.08) 0.48(0.11) 64% 10.32%(0.6%)
100 800 4.63(0.09) 0.57(0.09) 52% 11.68%(0.7%)
200 800 5.00(0.00) 0.03(0.02) 97% 7.24%(0.2%)
200 1600 4.99(0.01) 0.05(0.03) 95% 7.23%(0.2%)
MCP-svm(SVMICγ) 100 400 4.46(0.10) 0.44(0.08) 45% 11.81%(0.6%)
100 800 4.34(0.11) 0.68(0.11) 38% 13.13%(0.7%)
200 800 5.00(0.00) 0.09(0.03) 92% 7.34%(0.2%)
200 1600 5.00(0.00) 0.06(0.03) 95% 7.19%(0.2%)
L1-svm 100 400 4.87(0.05) 32.97(1.47) 0% 16.08%(0.5%)
100 800 4.63(0.07) 44.34(2.18) 0% 19.71%(0.6%)
200 800 5.00(0.00) 21.33(1.70) 0% 9.59%(0.3%)
200 1600 4.99(0.01) 33.37(0.96) 0% 10.88%(0.3%)
Hybrid-svm 100 400 4.78(0.05) 24.74(1.37) 0% 16.34%(0.5%)
100 800 4.62(0.06) 27.16(1.30) 0% 19.93%(0.6%)
200 800 5.00(0.00) 12.86(0.99) 0% 9.93%(0.2%)
200 1600 4.99(0.01) 10.85(0.98) 0% 10.53%(0.3%)
Adap L1-svm 100 400 4.39(0.08) 13.14(0.90) 0% 16.76%(0.5%)
100 800 3.99(0.08) 12.50(0.69) 0% 20.19%(0.6%)
200 800 4.86(0.04) 3.93(0.25) 1% 10.04%(0.3%)
200 1600 4.49(0.06) 1.01(0.09) 4% 13.43%(0.4%)
L2-svm 100 400 5.00(0.00) 395.00(0.00) 0% 39.23%(0.5%)
100 800 5.00(0.00) 795.00(0.00) 0% 42.99%(0.5%)
200 800 5.00(0.00) 795.00(0.00) 0% 39.22%(0.3%)
200 1600 5.00(0.00) 1595.00(0.00) 0% 42.50%(0.4%)

Table 2 shows the results of Model 2 for n = 250 and p = 800. The numbers in the parentheses are the corresponding standard errors based on 200 replications. We observe similar performance patterns in terms of both variable selection and prediction error. Due to the higher correlation between signal and noise, in Model 2 it is generally more difficult to select the relevant covariates. Both SCAD- and MCP-penalized SVM still have reasonable performance in identifying the underlying true model and result in more accurate prediction. Note that under this data generation process the adaptively weighted L1-penalized SVM behaves similar to nonconvex penalized SVMs, though its oracle property is largely unknown.

Table 2.

Simulation results for Model 2 with n = 250 and p = 800

Method Signal Noise Correct Test Error
SCAD-svm 3.99(0.01) 0.26(0.08) 92.5% 11.4%(0.1%)
MCP-svm 3.99(0.01) 0.17(0.07) 93.5% 11.3%(0.1%)
SCAD-svm(SVMICγ) 3.96(0.02) 0.05(0.02) 94% 11.5%(0.1%)
MCP-svm(SVMICγ) 3.98(0.01) 0.07(0.02) 92.5% 11.4%(0.1%)
L1-svm 4.00(0.00) 6.84(0.42) 7.5% 12.4%(0.1%)
Hybrid-svm 4.00(0.00) 4.03(0.41) 10.5% 11.9%(0.1%)
Adap L1-svm 4.00(0.00) 2.90(0.28) 38% 11.8%(0.1%)
L2-svm 4.00(0.00) 796.00(0.00) 0% 32.5%(0.2%)

5.2. Real data application

We next use a real dataset to illustrate the performance of the nonconvex penalized SVM. This dataset is part of the MicroArray Quality Control (MAQC)-II project, available at the GEO database with accession number GSE20194. It contains 278 patient samples from two classes: 164 with have positive estrogen receptor (ER) status and 114 with have negative estrogen receptor (ER) status. Each sample is described by 22283 genes.

The original data have been standardized for each predictor. To reduce the computational burden, only the 3000 genes with largest absolute values of the two sample t-statistics are used. Such simplification has been considered in Cai and Liu (2011). Though only 3000 genes are used, the classification result is satisfactory. We randomly split the data into an equally balanced training set with 50 samples with positive ER status and 50 samples with negative ER status, and the rest were designated as the test set. As in the simulation study, we use a=3.7 for the SCAD penalty and a=3 for the MCP penalty. To get a fair comparison, a 5-fold cross validation is implemented on the training set to select a tuning parameter by a grid search over {2−15, …, 23} for all methods and the test error is calculated on the test data. The above procedure is repeated 100 times.

Table 3 summarizes the average classification error and number of selected genes. The numbers in the parentheses are the corresponding standard errors based on 100 replications. Nonconvex penalized SVMs achieve significantly lower test error than all the other methods except for the doubly penalized hybrid SVM. Although the doubly penalized hybrid SVM performs similar to SCAD- and MCP-penalized SVMs in terms of test error, it selects a much more complex model in general. In addition, the number of genes selected by nonconvex penalized SVMs is stable, while the model size selected by hybrid SVM ranges from 102 genes to 2576 genes across the 100 replications. Such stability is desirable, so that the procedure is robust to the random partition of the data. The numerical results confirm that SCAD- and MCP-penalized SVMs can achieve both promising prediction power and excellent gene selection ability.

Table 3.

Classification error of MAQC-II dataset

Method Test error Genes
SCAD-svm 9.8%(0.2%) 2.06(0.43)
MCP-svm 9.6%(0.2%) 1.04(0.02)
L1-svm 10.9%(0.2%) 28.74(1.36)
Adap L1-svm 13.1%(0.2%) 34.30(1.03)
Hybrid-svm 10.0%(0.1%) 1391.60(94.86)
L2-svm 10.8%(0.2%) 3000.00(0.00)

6. Discussion

In this article we study the nonconvex penalized SVMs with a diverging number of covariates in terms of variable selection. When the true model is sparse, under some regularity conditions, we prove that it enjoys the oracle property. That is, one of the local minimizers of the nonconvex penalized SVM behaves like the oracle estimator as if the true sparsity is known in advance and only the relevant variables are used to form the decision boundary. We also show that as long as we have an appropriate initial estimator, we can identify the oracle estimator with probability tending to one.

6.1. Connection to Bayes rule

In this paper, the true model and the oracle property are built on β0, which is the minimizer of the population version of the hinge loss. This definition has a strong connection to the Bayes rule, which is theoretically optimal if the underlying distribution is known. In the equal-weight case (w=1/2), the Bayes rule is given by sign(XTβBayes) with βBayes = arg minβ Inline graphic{I(sign(XTβ) ≠ Y)}. To appreciate the connection, we first note that βBayes and β0 are equivalent to each other in the important special case of Fisher linear discriminant analysis. Indeed, consider an informative example setting with π+ = π = 1/2, X*|(Y = +1) ~ N(μ+, Σ) and X*|(Y = −1) ~ N(μ, Σ), where μ+ and μ denote different mean vectors for two classes and Σ a same variance covariance matrix. It is known that in this case the Bayes rule boundary is given by

(μ+-μ-)T-1{x-1/2(μ++μ-)}=0.

Note that β0 as the minimizer of the population hinge loss satisfies the gradient condition

S(β0)=-E{I(1-YXTβ00)YX}=0,

which is equivalent to following equations:

Pr(1-XTβ00Y=+1)=Pr(1+XTβ00Y=-1),E{I(1-XTβ00)XY=+1}=E{I(1+XTβ00)XY=-1}. (7)

For any β0, that satisfies (β0)Tβ0,=0,(X)Tβ0 and (X)Tβ0, are conditionally independent given Y and thus we can decompose the conditional expectation in (7) into two parts. It can be seen from (7) that

β00=-1/2(β0)T(μ++μ-),(μ+-μ-)Tβ0,=0,β0,satisfying(β0)Tβ0,=0.

That is, (μ+μ) lies in the space spanned by β0. The decision boundary defined by the true value is then

xTβ0C(μ+-μ-)T-1{x-1/2(μ++μ-)}=0

for some constant C. Therefore, the Bayes rule is equivalent to β0.

In more general settings, βBayes and β0 may not be the same. However, Lin (2000) showed that the nonlinear SVM approaches the Bayes rule in a direct fashion, and its expected misclassification rate quickly converges to that of the Bayes rule even though its extension to linear SVM is largely unknown. Furthermore, denote R(f) and R0(f) to be the risk in terms of the 0–1 loss and hinge loss, respectively, for any measurable f; that is, R(f) = Inline graphic{I(sign(f(X)) ≠ Y)} and R0(f) = Inline graphic{(1 − Y f(X))+}. It is known that minimizing R(f) directly is very difficult because minimizing the empirical 0–1 loss is infeasible in practice (Bartlett et al., 2006). Instead, we can always shift the target from the 0–1 loss to a convex surrogate such as the hinge loss. Assume that the minimizers of R(f) and R0(f) are both linear functions, and by definitions they are XTβBayes and XTβ0, respectively. By Theorem 1 of Bartlett et al. (2006), we have the optimal excess risk upper bound

R(XTβ)-R(XTβBayes)R0(XTβ)-R0(XTβ0)

for any β. Hence pursuing oracle property on β0 has the potential to efficiently control the excess risk. As can be seen in this paper, the main advantages of working with the hinge loss instead of the 0–1 loss are the theoretical tractability and convenience in practical implementation.

6.2. Other issues

As one referee pointed out, the objective function (2) in the definition of our oracle estimator is piecewise linear and may have multiple minimizers. The same issue applies to the L1-penalized SVM and the nonconvex penalized SVM. Based on our theoretical development, non-uniqueness of the minimizer of (2) is not essential. When the minimizer is not unique, our theoretical results still hold for any particular minimizer. In this sense, we can first use the nonconvex penalized SVM to identify important predictors. In the next step, to obtain a unique classifier, a refitting can be applied by using the standard L2 penalized SVM on those identified important predictors. For Model 1 in Section 5.1, we considered this refitting. This additional refitting step does not lead to much improvement: it reduces the average test errors in some settings but not in others. Thus the refitting result is not reported here.

An alternative approach to deal with this non-uniqueness is to consider a joint penalty by using both a nonconvex penalty and a standard L2 penalty. The objective function then becomes

n-1i=1nWi(1-YiXiTβ)++j=1pnpλ1n(βj)+j=1pnλ2nβj2

for two different tuning parameters λ1n and λ2n. The corresponding oracle estimator is then defined as the minimizer of the objective function for the standard L2 SVM using only the relevant covariates. One advantage of this joint penalty formulation over the method proposed in this paper is that the uniqueness of the oracle estimator is guaranteed in the finite sample case. However, it involves simultaneously selecting two tuning parameters, and this may not be convenient in practice. We conduct a simple numerical experiment using Model 1 in Section 5.1 with n = 200 and p = 600 or 800. The simulation results are summarized in Table 4. As shown in Table 4, our numerical example suggests that the performance of this joint penalty method is similar to the approach proposed in this paper.

Table 4.

Comparision between SCAD and joint penalized SVMs using Model 1

Method p Signal Noise Correct Test Error
SCAD-svm 600 5.00(0.00) 0.17(0.07) 93% 7.04%(0.2%)
800 5.00(0.00) 0.13(0.06) 93% 7.25%(0.2%)
Joint SCAD+L2-svm 600 5.00(0.00) 1.22(0.28) 65% 7.12%(0.2%)
800 5.00(0.00) 2.64(0.62) 50% 7.10%(0.2%)

Several issues remain unsolved. In this article we only study the SVMs in nonseparable cases in the limit. Although the nonseparable cases are important in practical applications, it would be interesting to show similar results for separable cases. The asymptotic analysis of separable cases requires the positiveness of the limit of the regularization term, which is different from the analysis in this article. Another issue is the availability of an appropriate initial estimator in ultra-high dimensions. Our empirical studies suggest that the L1-penalized SVM provides a reasonable initial estimator and the LLA algorithm converges very quickly even for cases with pn. However it still lacks theoretical justification since our Theorem 3.4 only provides theoretical support in moderately high dimensions with p=o(n). One could try to extend the work of Bickel et al. (2009) by assuming similar types of restricted eigenvalues conditions. This extension would require new techniques because both the loss function and the penalty are nondifferentiable and the nonsmooth locations are different in L1-penalized SVM, whereas the setup in Bickel et al. (2009) is a smooth loss function with a nonsmooth penalty.

Acknowledgments

We thank the co-Editors Professor Gareth Roberts and Professor Piotr Fryzlewicz, the Associate Editor and three referees for very constructive comments and suggestions which have improved the presentation of the paper. We also thank Amanda Applegate for her help on the presentation of this paper. The research is partially supported by National Science Foundation grants DMS-1055210 (Wu) and DMS-1308960 (Wang) and National Institutes of Health grants R01-CA149569 (Zhang and Wu), P01-CA142538 (Wu), P50-DA10075 (Li) and P50-DA036107 (Li).

7. Appendix

We first prove Lemma 3.1.

Proof (Proof of Lemma 3.1)

Let l(β1)=n-1i=1nWi(1-YiZiTβ1)+. Note that β̂1 = arg minβ1 l (β1). We will show that when ∀η > 0, there exists a constant Δ such that for all n sufficiently large, Pr{infu=Δl(β01+q/nu)>l(β01)}1-η. Because l(β1) is convex, with probability at least 1 − η, β̂1 is in the ball { β1:β1-β01Δq/n}. Denote Λn(u)=nq-1{l(β01+q/nu)-l(β01)}. Observe that E{Λn(u)}=nq-1{L(β01+q/nu)-L(β01). Recall also that β0 = arg minβ Inline graphic{W(1 − Y XTβ)}. If we restrict the last pq elements to be 0, it can be easily seen that β01 = arg minβ1 Inline graphic{W(1 − Y ZTβ1)} = arg minβ1 L(β1), thus S (β01) = 0. By Taylor series expansion of L(β1) around β01, we have E{Λn(u)}=12uTH(β)u+op(1), where β=β01+q/ntu for some 0 < t < 1. As shown in Koo et al. (2008), for 0 ≤ j, kq, the (j, k)-th element of the Hessian Matrix H(β01) is continuous given (A1) and (A2); thus H(β) is continuous. By continuity of H(β) at β01, then 12uTH(β)u=12uTH(β01)u+o(1) as n → ∞. Define Wn=-i=1nζiWiYiZi where ζi=I(1-YiZiTβ010). Recall that S(β01) = − Inline graphic [ζiWiYiZi] = 0. If we define

Ri,n(u)=Wi(1-YiZiT(β01+qnu))+-Wi(1-YiZiTβ01)++ζiWiYiZiTq/nu

then we have

Λn(u)=E{Λn(u)}+WnTu/qn+q-1i=1n[Ri,n(u)-E{Ri,n(u)}]. (8)

Then similar to Equation (28) in Koo et al. (2008) we have

q-2i=1nE[Ri,n(u)-E{Ri,n(u)}2]CΔ2E{q-1(1+Z2)U(1+Z2Δq/n)},

where U(t)=I(1-YiZiTβ01<t). (A2) implies that E{q−1(1 + ||Z||2)} < ∞. Hence, for any ε > 0, we can choose a positive constant C such that E[q−1(1+||Z||2)I{q−1(1+||Z||2) > C}] < ε/2, then

E{q-1(1+Z2)U(1+Z2Δq/n)}E[q-1(1+Z2)I{q-1(1+Z2)>C}]+CPr(1-YiZiTβ01<CΔq/n.

We can take a large N such that Pr(1-YiZiTβ01<CΔq/n)<ε2C for all n > N by (A4). This proves that q-2i=1nE{Ri,n(u)-E[Ri,n(u)]2}0 as n → ∞. Observe that E(WnTu/qn)=0, and

Var(WnTu/qn)Cn-1q-1i=1n(ZiTu)2Cq-1λmax(n-1XATXA)u20

as n → ∞. Therefore, the first term of (8) will dominate other terms as n → ∞. By (A6) we have 12uTH(β01)u>0. Thus we can choose a sufficiently large Δ such that Λn(U) > 0 with probability 1 − η for ||U|| = Δ and all sufficiently large n.

The proof of Theorem 3.1 relies on the following Lemmas.

Lemma 7.1

Pr{maxq+1jpn-1i=1nWiYiXijI(1-YiZiTβ010)>λ/2}0asn.

Proof

Recall that E{WiYiXijI(1-YiZiTβ010)}=0. By (A5) and Lemma 14.9 of Bühlmann and Van De Geer (2011), we have Pr{n-1i=1nWiYiXijI(1-YiZiTβ010)>λ/2}exp(-Cnλ2). Note that

Pr{maxq+1jpn-1i=1nWiYiXijI(1-YiZiTβ010)>λ/2}=Pr{q+1jp{n-1i=1nWiYiXijI(1-YiZiTβ010)>λ/2}}pexp(-Cnλ2)0

as n → ∞ by the fact that log(p) = o(2).

Lemma 7.2

For any Δ > 0,

Pr{maxq+1jpsupβ1-β01Δq/ni=1nWiYiXij[I(1-YiZiTβ10)-I(1-YiZiTβ010)-Pr(1-YiZiTβ10)+Pr(1-YiZiTβ010)]>nλ}0

as n → ∞.

Proof

We generalize an approach by Welsh (1989). We cover the ball { β1:β1-β01Δq/n} with a net of balls with radius Δq/n5. It can be shown that this net can be constructed with cardinality Ndn4q for some d > 0. Denote the N balls by B(t1), …, B(tN), where tk, k = 1, …, N are the centers. Denote κi(β1)=1-YiZiTβ1, and

Jnj1=k=1NPr(i=1nWiYiXij[I{κi(tk)0}-I{κi(β01)0}-Pr{κi(tk)0}+Pr{κi(β01)0}]>nλ/2),Jnj2=k=1NPr(supβ1B(tk)i=1nWiYiXij[I{κi(β1)0}-I{κi(tk)0}-Pr{κi(β1)0}+Pr{κi(tk)0}]>nλ/2).

Then by (A5),

Pr(supβ1-β01Δq/ni=1nWiYiXij[I{κi(β1)0}-I{κi(β01)0}-Pr{κi(β1)0}+Pr{κi(β01)0}]>nλ)Jnj1+Jnj2.

To evaluate Jnj1, let Ui = WiYiXij [I{κi(tk) ≥ 0} − I{κi(β01) ≥ 0} − Pr{κi(tk) ≥ 0} + Pr{κi(β01) ≥ 0}]. The Ui are independent mean-zero random variable, and Var(Ui)=E(Ui2)=E(Ui2Yi=1)Pr(Yi=1)+E(Ui2Yi=-1)Pr(Yi=-1). Denote F and G the CDF of the conditional distribution of ZTβ01 given Y = +1 and Y = −1. Observe that

E(Ui2Yi=1)C{Fi(1+ZiT(β01-tk))(1-Fi(1+ZiT(β01-tk)))+Fi(1)(1-Fi(1))-2Fi(min(1+ZiT(β01-tk),1))+2Fi(1)Fi(1+ZiT(β01-tk))}CZiT(tk-β01),

and it follows by (A8) that

E(Ui2Yi=-1)C{Gi(-1+ZiT(β01-tk))(1-Gi(-1+ZiT(β01-tk)))+Gi(-1)(1-Gi(-1))-2(1-Gi(max(-1+ZiT(β01-tk),-1)))+2(1-Gi(-1))(1-Gi(-1+ZiT(β01-tk)))}CZiT(tk-β01).

Thus we have

i=1nVar(Ui)nCmaxiZitk-β01=nO(qlog(n))O(q/n)=O(nqlog(n)).

Applying Lemma 14.9 of Bühlmann and Van De Geer (2011), for some positive constant C1 and C2 under the assumptions on the rate of λ,

Jnj12Nexp(-n2λ2/4C1nqlog(n)+C2nλ)Cexp{4qlog(n)-Cnλ}. (9)

To evaluate Jnj2, note that I(xs) is decreasing in s. Denote

Vi=[I{κi(β1)0}-I{κi(tk)0}-Pr{κi(β1)0}+Pr{κi(tk)0}].

We have −BiViAi for any β̃1B(tk), where

Ai=[I{κi(tk)-Δq/n5}-I{κi(tk)0}-Pr{κi(tk)Δq/n5}+Pr{κi(tk)0}],Bi=[I{κi(tk)0}-I{κi(tk)Δq/n5}-Pr{κi(tk)0}+Pr{κi(tk)-Δq/n5}].

Therefore, we have

Pr(supβ1B(tk)i=1nWiYiXij[I{κi(β1)0}-I{κi(tk)0}-Pr{κi(β1)0}+Pr{κi(tk)0}]>nλ/2)Pr(CmaxiXijsupβ1B(tk)i=1nVi>nλ/2)Pr{CmaxiXijmax(i=1nAi,i=1nBi)>nλ/2}

by the fact that Ai > 0, Bi > 0. Note that

i=1nAi=i=1n[I{κi(tk)-Δq/n5}-I{κi(tk)0}-Pr{κi(tk)-Δq/n5}+Pr{κi(tk)0}]+i=1n[Pr{κi(tk)-Δq/n5}-Pr{κi(tk)Δq/n5}]

and

i=1n[Pr{κi(tk)-Δq/n5}-Pr{κi(tk)Δq/n5}]=[Fi(1+Δq/n5-ZiT(β01-tk))-Fi(1-Δq/n5-ZiT(β01-tk))]Pr(Yi=1)+[Gi(-1+Δq/n5-ZiT(β01-tk))-Gi(-1-Δq/n5-ZiT(β01-tk))]Pr(Yi=-1)Cnlog(q)q/n5q=Clog(q)qn-3/2

by (A8). Denote

Oi=[I{κi(tk)-Δq/n5}-I{κi(tk)0}-Pr{κi(tk)-Δq/n5}+Pr{κi(tk)0}].

Thus for sufficiently large n by λ = o(n−(1−c2)/2) and A(7), we have

k=1NPr(Ci=1nAi>nλ/2)k=1NPr(Ci=1nOi>nλ/2-Clog(q)qn-3/2)k=1NPr(Ci=1nOi>nλ/4).

Notice that Oi are independent mean-zero random variables, and

E(Oi2)=E[I{κi(tk)-Δq/n5}-I{κi(tk)>0}]2q/n5maxiZi=Cqlog(n)n-5/2,

using a similar idea to deriving the upper bound of E(Ui2). Applying Bernstein’s inequality and the fact that maxiXij=Op(log(n)) for sub-Gaussian random variable, for some positive constant C1 and C2,

k=1NPr(CmaxiXiji=1nAi>nλ/2)Nexp(-n2λ2/4C1qn-3/2log(n)3/2+C2nλ)Cexp{4qlog(n)-Cnλ}.

Similarly, we can prove that k=1NPr(CmaxiXiji=1nBi>nλ/2)Cexp{4qlog(n)-Cnλ}. Therefore, we have

Jnj2Cexp{4qlog(n)-Cnλ}. (10)

Using (9) and (10), then the probability of Lemma 7.2 is bounded by

j=q+1p(Jnj1+Jnj2)Cexp{log(p)+4qlog(n)-Cnλ}0 (11)

which completes the proof.

Now we prove Theorem 3.1.

Proof (Proof of Theorem 3.1)

The unpenalized hinge loss objective function is convex. By convex optimization theorem, there exists vi such that sj(β̂) = 0, j = 0, 1, ....q, with vi=vi.

Note that min1≤jq |β̂j| ≥ min1≤jq |β0j| − max1≤jq |β̂jβ0j|. By (A7) we have n(1−c2)/2 min1≤jqn |β0j| ≥ M1, and max1jqβ^j-β0j=Op(q/n) by Lemma 3.1. Thus we have min1≤jq |β̂j| = Op(n−(1−c2)/2). By λ = o(n−(1−c2)/2), we have Pr(β^j(a+12)λ)1, for j = 0, 1, …, q.

By the definition of the oracle estimator, we have |β̂j| = 0, j = q + 1, …, p. It suffices to show that Pr{|sj(β̂)| > λ, for some j = q+1, …, p} → 0. Let D={i:1-YiZiTβ^1=0}; then for j = q + 1, …, p, we have

sj(β^)=-n-1i=1nWiYiXijI(1-YiZiTβ^10)-n-1iDWiYiXij(vj-1),

where −1 ≤ vi ≤ 0 if iD and vi = 0 otherwise. By (A5) (Zi, Yi) are in general positions, with probability one there are exactly (q+1) elements in D. Then by (A4), with probability one |n−1ΣiDWiYiXij (vj −1)| = O(qn−1 log(q)) = o(λ). Thus we only need to show that Pr{maxq+1jpn-1i=1nWiYiXijI(1-YiZiTβ^10)>λ}0. Observe that

Pr{maxq+1jpn-1i=1nWiYiXijI(1-YiZiTβ^10)>λ}Pr{maxq+1jpn-1i=1nWiYiXij[I(1-YiZiTβ^10)-I(1-YiZiTβ010)]>λ2}+Pr{maxq+1jpn-1i=1nWiYiXijI(1-YiZiTβ010)>λ2}. (12)

By Lemma 7.1 the second term of (12) is op(1). Notice that from Lemma 3.1, the first term of (12) is bounded by

Pr[maxq+1jpn-1i=1nWiYiXij{I(1-YiZiTβ^10)-I(1-YiZiTβ010)}>λ2]Pr[maxq+1jpsupβ1-β01Δq/nn-1i=1nWiYiXij{I(1-YiZiTβ10)-I(1-YiZiTβ010)-Pr(1-YiZiTβ10)+Pr(1-YiZiTβ010)}>λ4]+Pr[maxq+1jpsupβ1-β01Δq/nn-1i=1nWiYiXij{Pr(1-YiZiTβ10)-Pr(1-YiZiTβ010)}>λ4]. (13)

By Lemma 7.2, the first term of (13) is op(1). Thus we only need to bound the second term of (13). Notice that

Pr(1-YiZiTβ10)-Pr(1-YiZiTβ010)Fi(1+ZiT(β1-β01))-Fi(1)Pr(Yi=1)+Gi(-1+ZiT(β1-β01))-Gi(-1)Pr(Yi=-1).

Then we have

maxq+1jpsupβ1-β01Δq/nn-1i=1nWiYiXij{Pr(1-YiZiTβ10)-Pr(1-YiZiTβ010)}Cmaxi,jXijsupβ1-β01Δq/nn-1i=1nZiβ1-β01=Op(logpn)O(q/n)Op(qlog(n))=op(λ).

Thus

Pr[maxq+1jpsupβ1-β01Δq/nn-1i=1nWiYiXij{Pr(1-YiZiTβ10)-Pr(1-YiZiTβ010)}>λ4]=op(1),

which completes the proof.

Now we prove Theorem 3.2.

Proof (Proof of Theorem 3.2)

We will show β̂ is a local minimizer of Q(β) by writing Q(β) as g(β) − h(β).

By Theorem 3.1, we have Pr{ Inline graphic∂g(β̂)} → 1, where

G={ξ=(ξ0,,ξp):ξ0=0;ξj=λsgn(β^)j),j=1,,q;ξj=sj(β)+λlj,j=q+1,,p.},

where lj ∈ [−1,+1], j = q + 1, …, p.

Consider any β in the Rp+1 with the center β̂ and radius λ2. It is suffices to show that there exist ξ*Inline graphic such that Pr{ξj=h(β)βj}1 as n → ∞.

Since h(β)β0=0, we have ξ0=h(β)β0.

For j = 1, …, q, we have min1jqβjmin1jqβ^j-max1jqβ^j-βj(a+12)λ-λ2=aλ with probability one by Theorem 3.1. Therefore by Condition 2 of the class of penalties Pr{h(β)βj=λsgn(βj)}1 for j = 1, …, q. For sufficently large n, sgn(βj) = sgn(β̂j). Thus we have Pr{ξj=h(β)βj}1 as n → ∞ for j = 1, …, q.

For j = q+1, …, p, we have Pr{|βj| ≤ |β̂j|+|βjβ̂j | ≤ λ} → 1 by Theorem 3.1. Therefore we have Pr{h(β)βj=0}1 for SCAD and Pr{h(β)βj=-βja}1 for MCP. Observe that by Condition 2 we have Pr{h(β)βjλ}1 for the class of penalties. By Lemma 1 we have Pr{|sj(β̂j)| ≤ λ} → 1 for j = q + 1, …, p. We can always find lj ∈ [−1,+1] such that Pr{ξj=sj(β^)+λlj=h(β)βj}1 for j = 1, …, q, for both penalties. This completes the proof.

The proof of Theorem 3.3 consists of two parts. First we wil show that LLA algorithm initiated by β̃(0) gives the oracle estimator after one iteration. Then we will show that once LLA algorithm finds the oracle estimator β̂, the LLA algorithm will find it again in the next iteration, that is, the LLA algorithm will converge.

Proof (Proof of Theorem 3.3)

Assume that none of the events Fni is true, for i = 1, …, 4. The probability that none of these event is true is at least 1−Pn1Pn2Pn3Pn4. Then we have

βj(0)=βj(0)-β0jλ,q+1jp,βj(0)β0j-βj(0)-β0jaλ,1jq.

By Condition 2 of the class of nonconvex penalties, we have pλ(βj(0))=0 for 1 ≤ jq. Therefore the solution of the next iteration of β̃(1) is the solution to the convex optimization

β(1)=argminβn-1i=1nWi(1-YiXiTβ)++q+1jppλ(βj(0))·βj. (14)

By the fact the Fn3 is not true, there exist some subgradients of oracle estimator s(β̂) such that sj(β̂) = 0 for 0 ≤ jq and sj(β^)<(1-1a)λ for q + 1 ≤ jp. Note that by the definition of subgradient, we have

n-1i=1nWi(1-YiXiTβ)+n-1i=1nWi(1-YiXiTβ^)++0jpsj(β^)(βj-β^j)=n-1i=1nWi(1-YiXiTβ^)++q+1jpsj(β^)(βj-β^j).

Then we have for any β

{n-1i=1nWi(1-YiXiTβ)++q+1jppλ(βj(0))βj}-{n-1i=1nWi(1-YiXiTβ^)++q+1jppλ(βj(0))β^j}q+1jp{pλ(βj(0))-sj(β^)·sgn(βj)}·βjq+1jp{(1-1a)λ-sj(β^)·sgn(βj)}·βj0.

The strict inequality holds unless βj = 0 for all q + 1 ≤ jp. Since we consider the non-separable case that the oracle estimator is unique, we know the oracle estimator is the unique minimizer of (14) and hence β̃(1) = β̂. This proves that the LLA algorithm finds the oracle estimator after one iteration.

In the case that Fn2 is not true, we have |β̂j| > for all 1 ≤ jq. Hence by Condition 2 of the class of penalties pλ(β^j)=0 for all 1 ≤ jq and pλ(β^j)=pλ(0)=λ for all q+1 ≤ jp. Once the LLA algorithm finds β̂, the solution to the next LLA iteration β̃(2) is the minimizer of the convex optimization problem

β(2)=argminβn-1i=1nWi(1-YiXiTβ)++q+1jpλβj. (15)

Then we have for any β

{n-1i=1nWi(1-YiXiTβ)++q+1jpλβj}-{n-1i=1nWi(1-YiXiTβ^)++q+1jpλβ^j}q+1jp{λ-sj(β^)·sgn(βj)}·βj0.

and hence β̃(2)= β̂ is the unique minimizer of (15). That is, the LLA algorithm finds the oracle estimator again and stops.

As n → ∞, by Theorem 3.1 we have Pn2 → 0 and Pn4 → 0. The proof for Pn3 → 0 is similar to the proof for Theorem 3.1 by changing the constant to be ( 1-1a).

Now we prove Theorem 3.4.

Proof (Proof of Theorem 3.4)

Let || · ||1 be the L1 norm of a vector. Denote ln(β)=n-1i=1nWi(1-YiXiTβ)++cnβ1. Note that

E[np-1{ln(β0+p/nu)-ln(β0)}]=E[np-1{W(1-YXT(β0+p/nu))+-W(1-YXTβ0)+}]+np-1cn(β0+p/nu1-β01)

for some constant Δ that ||u|| = Δ. Observe that β0+p/nu1-β01p/nu1=p/nu1. By the fact that cn = o(n−1/2), we have np-1cn(β0+p/nu1-β01)0 as n → ∞. Then similar to the proof of Lemma 3.1, we can show that the expectation is dominated by 12uTH(β0)u>0 and Pr{infu=Δln(β0+p/nu)>ln(β0)}1-η. Hence β^L1-β0=Op(p/n). Because pn-12=o(λ),Pr(β^jL1-β0j>λ,forsome1jp)0 as n → ∞. Then using Theorem 3.1 and Corollary 3.1 we have Pr{β̂(λ) = β̂} → 1, which completes the proof.

Contributor Information

Yichao Wu, North Carolina State University, Raleigh, NC, USA.

Lan Wang, The University of Minnesota, Minneapolis, MN, USA.

Runze Li, The Pennsylvania State University, University Park, PA, USA.

References

  1. An LTH, Tao PD. The dc (difference of convex functions) programming and dca revisited with dc models of real world nonconvex optimization problems. Annals of Operations Research. 2005;133(1–4):23–46. [Google Scholar]
  2. Bartlett PL, Jordan MI, McAuliffe JD. Convexity, classification, and risk bounds. Journal of the American Statistical Association. 2006;101(473):138–156. [Google Scholar]
  3. Becker N, Toedt G, Lichter P, Benner A. Elastic scad as a novel penalization method for svm classification tasks in high-dimensional data. BMC Bioinformatics. 2011;12(1):138. doi: 10.1186/1471-2105-12-138. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bickel P, Ritov Y, Tsybakov A. Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics. 2009;37(4):1705–1732. [Google Scholar]
  5. Bradley P, Mangasarian O. Feature selection via concave minimization and support vector machines. Machine Learning Proceedings of the Fifteenth International Conference (ICML98); 1998. pp. 82–90. [Google Scholar]
  6. Bühlmann P, Van De Geer S. Statistics for high-dimensional data: methods, theory and applications. Springer; 2011. [Google Scholar]
  7. Cai T, Liu W. A direct estimation approach to sparse linear discriminant analysis. Journal of the American Statistical Association. 2011;106(496):1566–1577. [Google Scholar]
  8. Chen J, Chen Z. Extended bayesian information criteria for model selection with large model spaces. Biometrika. 2008;95(3):759–771. [Google Scholar]
  9. Claeskens G, Croux C, Van Kerckhoven J. An information criterion for variable selection in support vector machines. The Journal of Machine Learning Research. 2008;9:541–558. [Google Scholar]
  10. Donoho D, et al. High-dimensional data analysis: The curses and blessings of dimensionality. AMS Math Challenges Lecture. 2000:1–32. [Google Scholar]
  11. Fan J, Fan Y. High dimensional classification using features annealed independence rules. Annals of Statistics. 2008;36(6):2605–2637. doi: 10.1214/07-AOS504. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96(456):1348–1360. [Google Scholar]
  13. Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society. Series B (Methodological) 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Fan J, Xue L, Zou H. Strong oracle optimality of folded concave penalized estimation. The Annals of Statistics. 2014 doi: 10.1214/13-aos1198. to appear. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Friedman J, Hastie T, Tibshirani R. The elements of statistical learning. Vol. 1. Springer; 2001. Series in Statistics. [Google Scholar]
  16. Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Machine learning. 2002;46(1):389–422. [Google Scholar]
  17. Kim Y, Choi H, Oh H. Smoothly clipped absolute deviation on high dimensions. Journal of the American Statistical Association. 2008;103(484):1665–1673. [Google Scholar]
  18. Kim Y, Kwon S. Global optimality of nonconvex penalized estimators. Biometrika. 2012;99(2):315–325. [Google Scholar]
  19. Koenker R. Quantile regression. Vol. 38. Cambridge University Press; 2005. [Google Scholar]
  20. Koo J, Lee Y, Kim Y, Park C. A Bahadur representation of the linear support vector machine. The Journal of Machine Learning Research. 2008;9:1343–1368. [Google Scholar]
  21. Lin Y. Technical report. Technical report 1029. Department of Statistics, University of Wisconsin; Madison: 2000. Some asymptotic properties of the support vector machine. [Google Scholar]
  22. Lin Y. Support vector machines and the bayes rule in classification. Data Mining and Knowledge Discovery. 2002;6:259–275. [Google Scholar]
  23. Lin Y, Lee Y, Wahba G. Support vector machines for classification in non-standard situations. Machine Learning. 2002;46(1–3):191–202. [Google Scholar]
  24. Mazumder R, Friedman J, Hastie T. Sparsenet: Coordinate descent with nonconvex penalties. Journal of the American Statistical Association. 2011;106(495):1125– 1138. doi: 10.1198/jasa.2011.tm09738. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Meinshausen N, Bühlmann P. High-dimensional graphs and variable selection with the lasso. The Annals of Statistics. 2006;34(3):1436–1462. [Google Scholar]
  26. Meinshausen N, Yu B. Lasso-type recovery of sparse representations for high-dimensional data. The Annals of Statistics. 2009:246–270. [Google Scholar]
  27. Park C, Kim KR, Myung R, Koo JY. Oracle properties of scad-penalized support vector machine. Journal of Statistical Planning and Inference. 2012;142(8):2257–2270. [Google Scholar]
  28. Schwarz G. Estimating the dimension of a model. The Annals of Statistics. 1978;6(2):461–464. [Google Scholar]
  29. Tao P, An L. Convex analysis approach to dc programming: Theory, algorithms and applications. Acta Mathematica Vietnamica. 1997;22(1):289–355. [Google Scholar]
  30. Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B (Methodological) 1996;58(1):267–288. [Google Scholar]
  31. Vapnik V. The nature of statistical learning theory. Springer; New York: 1996. [Google Scholar]
  32. Wang L, Kim Y, Li R. Calibrating non-convex penalized regression in ultra-high dimension. The Annals of Statistics. 2013;41:2505–2536. doi: 10.1214/13-AOS1159. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Wang L, Wu Y, Li R. Quantile regression for analyzing heterogeneity in ultra-high dimension. Journal of the American Statistical Association. 2012;107(497):214–222. doi: 10.1080/01621459.2012.656014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Wang L, Zhu J, Zou H. The doubly regularized support vector machine. Statistica Sinica. 2006;16(2):589–615. [Google Scholar]
  35. Wang L, Zhu J, Zou H. Hybrid huberized support vector machines for microarray classification. Proceedings of the 24th International Conference on Machine Learning. 2007:983–990. [Google Scholar]
  36. Wegkamp M, Yuan M. Support vector machines with a reject option. Bernoulli. 2011;17:1368–1385. [Google Scholar]
  37. Welsh A. On m-processes and m-estimation. The Annals of Statistics. 1989;17(1):337–361. [Google Scholar]
  38. Yuan M. High dimensional inverse covariance matrix estimation via linear programming. The Journal of Machine Learning Research. 2010;99:2261–2286. [Google Scholar]
  39. Zhang C. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics. 2010;38(2):894–942. [Google Scholar]
  40. Zhang CH, Huang J. The sparsity and bias of the lasso selection in high-dimensional linear regression. The Annals of Statistics. 2008;36(4):1567–1594. [Google Scholar]
  41. Zhang H, Ahn J, Lin X, Park C. Gene selection using support vector machines with non-convex penalty. Bioinformatics. 2006;22(1):88–95. doi: 10.1093/bioinformatics/bti736. [DOI] [PubMed] [Google Scholar]
  42. Zhao P, Yu B. On model selection consistency of lasso. Journal of Machine Learning Research. 2007;7(2):2541–2563. [Google Scholar]
  43. Zhu J, Rosset S, Hastie T, Tibshirani R. 1-norm support vector machines. Advances in Neural Information Processing Systems. 2004;16(1):49–56. [Google Scholar]
  44. Zou H. The adaptive lasso and its oracle properties. Journal of the American statistical association. 2006;101(476):1418–1429. [Google Scholar]
  45. Zou H. An improved 1-norm svm for simultaneous classification and variable selection. Eleventh International Conference on Artificial Intelligence and Statistics.2007. [Google Scholar]
  46. Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models. Annals of Statistics. 2008;36(4):1509–1533. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Zou H, Yuan M. The f-infinity norm support vector machine. Statistica Sinica. 2008;18:379–398. [Google Scholar]

RESOURCES