Skip to main content
Journal of Applied Statistics logoLink to Journal of Applied Statistics
. 2020 Nov 18;49(5):1105–1120. doi: 10.1080/02664763.2020.1849053

Structured sparse support vector machine with ordered features

Kuangnan Fang a,b, Peng Wang a, Xiaochen Zhang a, Qingzhao Zhang a,b,c,CONTACT
PMCID: PMC9041777  PMID: 35707509

ABSTRACT

In the application of high-dimensional data classification, several attempts have been made to achieve variable selection by replacing the 2-penalty with other penalties for the support vector machine (SVM). However, these high-dimensional SVM methods usually do not take into account the special structure among covariates (features). In this article, we consider a classification problem, where the covariates are ordered in some meaningful way, and the number of covariates p can be much larger than the sample size n. We propose a structured sparse SVM to tackle this type of problems, which combines the non-convex penalty and cubic spline estimation procedure (i.e. penalizing second-order derivatives of the coefficients) to the SVM. From a theoretical point of view, the proposed method satisfies the local oracle property. Simulations show that the method works effectively both in feature selection and classification accuracy. A real application is conducted to illustrate the benefits of the method.

Keywords: Structured sparse, support vector machine, variable selection, local oracle property

1. Introduction

Classification is one of the most important research fields in statistics and machine learning, and it is also a common practical problem. The support vector machine (SVM)[19] is a powerful classification tool with high accuracy and great flexibility. In this article, we will focus on a classification with n cases having class labels {yi{1,1};i=1,n} and features {xij;i=1,2,,n,j=1,2,,p}. The SVM has an equivalent formulation as the 2 penalized hinge loss [11]:

min(β0,β)1ni=1n{yi(β0+xiβ)}+λ2β2, (1)

where the loss (t)=[1t]+ is called hinge loss, and is the 2-norm. λ is the tuning parameter, which is used to control the tradeoff between loss and penalty.

In the application of high-dimensional data classification, several attempts have been made to achieve variable selection by replacing the 2-penalty with other penalties for the SVM, such as 1-SVM [1,28], 0-SVM and -SVM [10], p-SVM [4], SCAD-SVM [25,27], Hybrid Huberized SVM [20,21], and MCP-SVM [27]. Since the hinge loss does not have the first derivative, it causes some difficulties in the calculation. [6,15] considered square hinge loss in the SVM; [17,20–22] suggested using the Huberized hinge loss in the SVM. [17] point out that the Huberized regularized model paths are both less affected by the outlier than the non-Huberized squared loss.

This paper concerns a class of structured sparse classification problems with ordering features, i.e. xj that can be ordered as x1,x2,,xp in some sense. A motivating example comes from protein mass spectroscopy. For each blood serum sample i, we observed the intensity xij for many time-of-flight values tj. Time of flight is related to the mass over charge ratio m/z of the constituent proteins in the blood. Figure 1 shows an example that protein mass spectroscopy taken from [16]. We plot intensity xij on the vertical Y -axis against m/z on the horizontal x-axis. The features are ordered in a meaningful way, i.e. xij are ordered by m/z, which may lead to a high correlation among closely located variables. The SVM methods mentioned above for processing high-dimensional data classification, do not consider the structure in which the variables are arranged in order. Our goal is to predict the label from the ordered features, especially for pn.

Figure 1.

Figure 1.

Protein mass spectroscopy data: average profiles from control (–) and ovarian cancer patients (–).

Besides the above example, there are also many data like this, such as the gene expression data in microarray studies, single nucleotide polymorphisms (SNPs) data in genome-wide association studies (GWAS), graph and image data[9]. Those special structures among variables may lead to successive coefficients vary slowly. Fused lasso [18] encourages flatness of the coefficients by penalizing the 1-norm of coefficients' successive differences. However, it may not perform well when the features vary smoothly, rather than being like a step function. To capture the smooth features in a group, smooth-lasso [12] replaces the 1-penalty of the difference of the adjacent coefficients in the fused lasso by the 2-penalty. Recently, [9] proposed the spline-lasso and spline-MCP, in which they imposed an 2-penalty on the discrete version of the second derivatives of coefficients. However, these methods mentioned above are mainly used in the regression.

To the best of our knowledge, the present article is the first to develop theory and methodology for SVM to incorporate the ordered structure among predictors. This study may advance from the existing ones along the following aspects. First, the structured sparse SVM can achieve variable selection as well as capture the ordered structure of features. The subsequent numerical analysis proves that ignoring this data structure has an impact on the accuracy of the classification and the variable selection. Second, we can theoretically prove that our method has local oracle properties. Even when the number of covariates grows exponentially with the sample size, the local oracle property still holds for the structured sparse SVM. Finally, the algorithm and asymptotic properties are established based on the general form, and we can advocate many kinds of loss functions in the formulation of structured sparse SVM, such as Huberized hinge loss and squared hinge loss.

The rest of the article is organized as follows. In Section 2, we describe the model, an efficient algorithm, and local oracle properties for structured sparse SVM. Simulation results and an application of the proposed method to a protein mass spectroscopy dataset are presented in Sections 3 and 4. Discussions of the proposed method and results are given in Section 5. Proofs for the oracle properties of structured sparse SVM are provided in the Appendix.

2. Structured sparse support vector machine

2.1. Methodology

In this paper, we allow the number of covariates p to increase with the sample size n. It is even possible that p is much larger than n. We assume that the true parameter is sparse, and the features are ordered in some meaningful way. Thus, we need to get an estimator that enjoys the structured sparse property.

The structured sparse SVM is formulated in terms of a loss function that is regularized by penalty terms. Our proposed minimization objective function is

min(β0,β)1ni=1n{yi(β0+xiβ)}+j=1ppλ1(|βj|)+λ2j=2p1(Δj(2)β)2. (2)

In (2), the first part is a convex loss function. We can advocate many kinds of loss functions. Huberized hinge loss,

δ(t)=0,t>1,(1t)2/(2δ),1δ<t1,1tδ/2,t1δ,

is adopted in this paper. We fix the pre-specified constant δ=2 following by [21]. The results with other losses are provided in the supplemental materials.

The second part is used to achieve variable selection. We consider the penalized SVM with a general class of non-convex penalties, such as the smoothly clipped absolute deviation (SCAD) penalty [7] and the minimax concave penalty (MCP) [24]. The SCAD penalty is defined by pλ(x)=λ0|x|min{1,(at/λ)+/(a1)}dt for some a>2. The MCP is defined by λ0|x|{1t/(aλ)}+dt for some a>1. The experiments with different a values are presented in supplemental materials. We find our results to be insensitive to these choices, and for brevity, we fixed a = 3.7 for SCAD penalty and a = 3 for MCP as suggested in the literature [3,7,27,29].

The third part mimics the cubic spline to encourage the smoothness of coefficients. As the coefficient of the variables might vary smoothly, we encourage the smoothness of coefficients by penalizing the 2 norm of the discrete version of the second-order derivatives of the coefficients. Denote the first- and second-order difference (or discrete versions of derivatives) of the coefficients by Δjβ=:(βj+1βj) and Δj(2)β=ΔjβΔj1β=βj+12βj+βj1. Then the 2 norm of the discrete version of the second-order derivatives of the coefficients is j=2p1(Δj(2)β)2. The estimator by minimizing (2) enjoys structured sparse property.

Remark 2.1

The idea of the third part in (2) is similar to spline-lasso [9], which is used in the regression. The computation of spline-lasso is converted to lasso through a certain transformation. However, the conversion is no longer applicable when we solve the SVM problem. The computation is more complicated for the SVM problem than regression. The details of the algorithm are shown in Section 2.2.

2.2. Algorithm

Then we give the algorithm to solve this problem. Without loss of generality, we assume that the input data are standardized: i=1nxij/n=0, i=1nxij2/n=1, j=1,2,,p. We use the generalized coordinate descent (GCD) algorithm [22] to calculate the structured sparse SVM problem.

When using the GCD algorithm, the loss function () should satisfy the following condition

(t+a)(a)+(a)t+M2t2 a,t, (3)

where M is a constant greater than 0. The corresponding M value is 2/δ for Huberized hinge loss. It can be proved that the common loss functions such as Huberized hinge loss, logistic loss, and square hinge loss satisfy the above condition. Although the hinge loss, the loss for the standard SVM, does not satisfy (3), [21] showed that Huberized hinge function with the parameter δ=0.01 is nearly identical to the hinge loss.

Let D be a (p2)×p matrix with Dii=Di,i+2=1, Di,i+1=2, and Dij=0 otherwise. Given current estimate {β~0,β~}. Define the current margin ri=yi(β~0+xiβ~). The coordinate descent algorithm cyclically minimizes

F(βj|β~0,β~)=1ni=1n{ri+yixij(βjβ~j)}+pλ1(|βj|)+λ2(DD)jjβj2+2λ2l=1,ijn(DD)ljβ~lβj (4)

with respect to βj. According to local linear approximation (LLA) [29], we have pλ1(|βj|)pλ1(|β~j|)+pλ1(|β~j|)(|βj||β~j|), for βjβ~j. As pointed out by a referee, CCCP (constrained concave–convex procedure) algorithm is also an efficient algorithm for solving this problem, which is worth investigating as future work. When () satisfies (3), we can get F(βj|β~0,β~)Fˆ(βj|β~0,β~), where

Fˆ(βj|β~0,β~)=1ni=1n(ri)+1ni=1n(ri)yixij(βjβ~j)+M2(βjβ~j)2+pλ1(|β~j|)|βj|+λ2(DD)jjβj2+2λ2l=1,ljp(DD)ljβ~lβj. (5)

Since Fˆ is a quadratic majorization function of F, we can get the new update by minimizing Fˆ:

βˆjnew=argminβjFˆ(βj|β~0,β~)=S(z,pλ1(|β~j|))M+2λ2(DD)jj, (6)

where S(z,t)=(|z|t)+sign(z),z=Mβ~j1ni=1n(ri)yixij2λ2l=1,ijn(DD)ljβ~l.

Likewise, we can update the intercept by minimizing

Fˆ(β0|β~0,β~)=1ni=1n(ri)+1ni=1n(ri)yi(β0β~0)+M2(β0β~0)2. (7)

Then the intercept is updated by

βˆ0new=argminβ0Q(β0|β~0,β~)=β~0i=1n(ri)yiMn. (8)

Then we can iterate (4)–(8) until convergence.

Remark 2.2

In this paper, we specify the initial value by L1 penalized SVM following by [27], and it leads to a satisfactory result.

Remark 2.3

The above algorithm satisfies the majorization–minimization (MM) principle [5,13,14], and the MM principle ensures the descent property of the GCD algorithm. The proof is similar to [21,22] and we omit here.

2.3. Asymptotic properties

In this subsection, we establish the theory of the local oracle property for the structured sparse SVM, namely the oracle estimator is one of the local minimizers of (2).

Since β0 does not affect variable selection, we make β0=0 for the convenience of expression without loss of generality in this section. Let β=(β1,β2,,βp) denote the true parameter value, which is defined as the minimizer of the population loss: β=argminβL(β)=argminβE{(yxβ)}.

We use pn in this section to denote the number of features. Let A={j:βj0,1jpn} be the index set of the non-zero coefficients, and Ac={j:βj=0,1jpn} be the index set of the zero coefficients. qn=|A| is the cardinality of set  A. DA is a submatrix formed by D removing the column corresponding to the element in the Ac. βAc=(βj)jAc is a vector composed of the components of β corresponding to the elements in Ac. Then the oracle estimate βˆ is defined as βˆ=argminβAc=0{1ni=1n(yixiAβA)+λ2nβADADAβA}.

Theorem 2.1

Assume that the conditions 1–5 listed in the Appendix hold. Let Bn(λ1n,λ2n) be the set of local minimizers of the objective function

Qn(β)=Ln(β)+j=1ppλ1n(|βj|)+λ2nβDDβ

with regularization parameter λ1n, λ2n. The oracle estimator βˆ=(βˆA,0) satisfies

Pr{βˆBn(λ1n,λ2n)}1

as n, if qnn1/2logpnlogn=o(λ1n), λ2nqn1/2n1/2=o(λ1n), and λ1n=o(n(1c3)/2).

From Theorem 2.1, we can see that if we take λ1n=n1/2+τ for some c1<τ<c3/2, then the oracle property holds even for p=o{exp(n(τc1)/2)}. Thus, even when the number of covariates grows exponentially with the sample size, the local oracle property still holds for the structured sparse SVM.

3. Simulations

In this section, numerical experiments are conducted to study the performance of our proposed method. We use Spline-penalty-HSVM, where penalty includes SCAD and MCP, to represent our proposed method (i.e. Spline-SCAD-HSVM and Spline-MCP-HSVM). To investigate the performance, we compared performances of the proposed method with other alternatives without considering structured sparsity: SCAD-HSVM and MCP-HSVM.

Three data generation processes are considered in this paper. We set n = 100 and p = 1000. In Example 3.1, the non-zero coefficients of the variables are completely smooth by position. Partial non-zero coefficients are smooth in Example 3.2. For Example 3.3, the non-zero coefficients are not smooth. Within each example, our simulated data consist of a training set and a testing set. Models are fitted on training data only, and the testing set with sample size 500 is used to show the predictions of each method. The optimal regularization parameters λ1 and λ2 are selected on a 15-by-20 meshgrid through a 5-fold cross validation. Possible values of λ2 are from [0.1,0.2,,1.9,2]. For each fixed λ2, we compute the solutions for a fine grid of λ1s. Following [21], we start with λ1(max) which is the smallest λ1 to set all βj to be zero, and set λ1(min)=0.01λ1(max). Between λ1(min) and λ1(max), 15 points are placed uniformly in the log-scale. Then we select the optimal regularization parameters that achieve the maximum of the classification accuracy rate. Here are the details of the three scenarios.

Example 3.1

Consider xN(0,Σ) with Σ=(0.5|ij|)p×p, βj=j/40 for j=1,2,,20; βj=1j/40 for j=21,,40; βj=sin(πj/40) for j=81,,120; βj=0.5(1cos(πj/20)), for j=161,,200; and βj=0 for the otherwise. Pr(y=1)=1/(1+exp(xβ)), and Pr(y=1)=1/(1+exp(xβ)). The Bayes rule is sgn(xβ) with Bayes error 5.2%.

Example 3.2

The setting of x is the same as in Example 3.1. βj takes the same value as in Example 1 for j=1,2,,120; βjUniform(0.5,0.5), for j=160,,200; and βj=0 for the otherwise. Pr(y=1|x)=Φ(xβ), where Φ() is the distribution function of standard normal distribution. The Bayes rule is sgn(xβ) with Bayes error 9.1%.

Example 3.3

The generations of x and y are the same as in Example 3.1. βjUniform(0,1), for j=1,2,,40; and βj=0 for the otherwise. The Bayes rule is sgn(xβ) with Bayes error 8.3%.

The performance of different methods will be examined in two aspects: classification prediction and feature selection. In the evaluation of classification prediction, the classification accuracy rate (ACC), area under curve (AUC), true positive rate (TPR) and false positive rate (FPR) are adopted. As for feature selection, we compare TPR and FPR for different methods.

The results of the simulations are shown in Table 1. When the features vary smoothly, the SVM model with the spline penalty is significantly better in classification prediction and variable selection than the SVM model without the spline penalty (Example 3.1 and 3.2). The proposed method performs better, especially when non-zero coefficients of the variables are completely smooth by position in terms of classification prediction. In Figure 2, we present the estimation results for the correlated features in Example 3.1, by four different methods. From the figure, we can conclude that both the Spline-SCAD-HSVM and Spline-MCP-HSVM give a good estimate of the coefficients, while SCAD-HSVM and MCP-HSVM do not clean out the noisy signals very well. The improvement is not surprising since Spline-SCAD-HSVM and Spline-MCP-HSVM can capture the smoothing changes in coefficients. When the non-zero coefficients are not smooth, which means structural information described in this paper does not exist, the performances of both methods are similar (Example 3.3). This observation indicates that the proposed models are also applicable even when the features do not vary smoothly.

Table 1.

The simulation results obtained from 100 Monte Carlo repetitions (with standard errors in parentheses).

  Classification prediction Variable selection
Method ACC AUC TPR FPR TPR FPR
Example 3.1
SCAD-HSVM 0.782 (0.021) 0.874 (0.018) 0.769 (0.046) 0.205 (0.040) 0.732 (0.038) 0.309 (0.029)
MCP-HSVM 0.775 (0.025) 0.867 ( 0.024) 0.789 (0.031) 0.237 (0.047) 0.761 (0.028) 0.304 (0.036)
Spline-SCAD-HSVM 0.922 (0.022) 0.983 (0.011) 0.911 (0.034) 0.067 (0.030) 0.874 (0.019) 0.254 (0.021)
Spline-MCP-HSVM 0.932 (0.010) 0.987 (0.010) 0.920 (0.033) 0.055 (0.028) 0.868 (0.024) 0.143 (0.019)
Example 3.2
SCAD-HSVM 0.756 (0.011) 0.845 (0.008) 0.723 (0.010) 0.258 (0.008) 0.611 (0.019) 0.253 (0.008)
MCP-HSVM 0.738 (0.015) 0.834 (0.007) 0.715 (0.009) 0.263 (0.008) 0.606 (0.020) 0.265 (0.009)
Spline-SCAD-HSVM 0.798 (0.013) 0.853 (0.011) 0.729 (0.009) 0.259 (0.008) 0.799 (0.019) 0.223 (0.010)
Spline-MCP-HSVM 0.804 (0.012) 0.854 (0.012) 0.731 (0.015) 0.257 (0.008) 0.799 (0.020) 0.225 (0.011)
Example 3.3
SCAD-HSVM 0.788 (0.009) 0.880 (0.012) 0.799 (0.010) 0.244 (0.002) 0.782 (0.018) 0.215 (0.005)
MCP-HSVM 0.788 (0.009) 0.880 (0.012) 0.799 (0.010) 0.244 (0.002) 0.781 (0.020) 0.215 (0.005)
Spline-SCAD-HSVM 0.801 (0.006) 0.878 (0.011) 0.791 (0.010) 0.290 (0.003) 0.790 (0.019) 0.229 (0.011)
Spline-MCP-HSVM 0.808 (0.006) 0.878 (0.011) 0.795 (0.010) 0.281 (0.003) 0.790 (0.015) 0.229 (0.012)

Figure 2.

Figure 2.

The average estimation results for the correlated features in Example 1 of 100 Monte Carlo repetitions, by four different methods: SCAD-HSVM, MCP-HSVM, Spline-SCAD-HSVM and Spline-MCP-HSVM. The solid line curve is the true β, and the scatter dot represents the estimation for each method.

4. Real data analysis

In this section, we apply our methods to a dataset of Ovarian Dataset 8-7-02. The dataset is provided by the US Food and Drug Administration (FDA) and the National Cancer Institute (NCI), which can be downloaded and accessed at http://home.ccr.cancer.gov/. The data were collected as serum samples from normal and cancer patients, and the mass spectrometry technique was combined with the WCX2 protein chip and SELDI-TOF. The sample set included 91 controls and 162 ovarian cancers, which were not randomized. Each mass spectrometer sample contains a 15154-dimensional mass-to-charge ratio (m/z) /intensity characteristic. As mentioned in Section 1, the features are ordered in a meaningful way. Following the original researchers, we ignored m/z-sites below 100, where chemical artifacts can occur [16].

We randomly choose 173 samples from data as the training set, and the remaining 80 samples are used as the testing set. Four methods with Huberized hinge loss, i.e. SCAD-HSVM, MCP-HSVM, Spline-SCAD-HSVM, and Spline-MCP-HSVM, are fitted using the training set. Additional results with other losses are provided in the supplemental materials. Tuning parameters are chosen by 5-fold cross validation base on the training set. We select the optimal regularization parameters that achieve the maximum of the classification accuracy rate among grid points using two-dimensional grid search. We run the sample-splitting method 100 times, and the results are summarized in Table 2. We can see that the performance of the Spline-SCAD-HSVM and Spline-MCP-HSVM is slightly better than the SCAD-HSVM and MCP-HSVM. The ACC, AUC and TPR are slightly higher when we consider the model that explicitly incorporates the special structures among the features.

Table 2.

Results of 100 random splits of the ovarian cancer dataset (with standard errors in parentheses).

Method ACC AUC TPR FPR
SCAD-HSVM 0.919 (0.025) 0.968 (0.010) 0.930 (0.021) 0.061 (0.012)
MCP-HSVM 0.921 (0.026) 0.971 (0.008) 0.932 (0.022) 0.060 (0.015)
Spline-SCAD-HSVM 0.947 (0.018) 0.989 (0.004) 0.972 (0.009) 0.043 (0.011)
Spline-MCP-HSVM 0.947 (0.019) 0.992 (0.003) 0.975 (0.009) 0.041 (0.010)

To complement the estimation and identification analysis, we also evaluate the stability of analysis by computing the observed occurrence index (OOI). For each feature identified using the training data, we compute its probability of being identified out of the 100 resamplings; this probability has been referred to as the OOI. The median OOI values of SCAD-HSVM, MCP-HSVM, Spline-SCAD-HSVM, and Spline-MCP-HSVM are 0.736, 0.739, 0.857, and 0.862, respectively. Figure 3 shows the number of selected proteins versus the selection frequency of four different methods out of the 100 random splits. We can conclude that the OOI value of the model with the spline item is significantly higher, which indicates that Spline-SCAD-HSVM and Spline-MCP-HSVM are more stable than SCAD-HSVM and MCP-HSVM.

Figure 3.

Figure 3.

The figure shows the number of selected proteins versus the selection frequency of four different methods: (a) SCAD-HSVM, (b) MCP-HSVM, (c) Spline-SCAD-HSVM, and (d) Spline-MCP-HSVM.

5. Discussion

In this article, we consider a high-dimensional data classification problem, where the features are ordered in some meaningful way. When the coefficients are sparse and change smoothly, we propose a structured sparse SVM, which combines the non-convex penalty and cubic spline estimation procedure (i.e. penalizing second-order derivatives of the coefficients) to the SVM, and proved that it satisfies the local oracle property under some conditions. The simulation and empirical results show that the proposed method has a higher accuracy of classification and prediction compared with the existing methods.

In the future, we want to work on more complex data. For example, when the high-dimensional data variables have group structure information and the intra-group features are ordered in some meaningful way. Moreover, our approach could also be extended to the framework of semisupervised learning and multi-class classification.

Supplementary Material

Supplementary_Data

Acknowledgments

We would like to thank the editor, associate editor and two reviewers for their constructive comments that have led to a significant improvement of the manuscript.

Appendices.

Appendix 1. Regularity conditions.

To facilitate our technical proofs, we impose the following regularity conditions.

Condition A.1

The loss function () is convex and it has a first order continuous derivative. There exist constants M1 and M2, such that |(t)|M1(|t|+1),|((t))|M2,t, where () represents the subgradient.

Condition A.2

qn=O(nc1),0c1<1/2; λ2nDβ=O(nc2), (1c1)/2<c21/2.

Condition A.3

The Hessian matrix H(βA)=E[2(yxAβA)] satisfies conditions

0<M3<λmin{H(βA)}λmax{H(βA)}<M4<,

where xA is the matrix formed by x removing the column corresponding to the element in the Ac. where λmin and λmax denote the smallest and largest eigenvalue, respectively.

Condition A.4

There is a constant M5>0 such that λmax(n1xAxA)M5. It is further assumed that xij are sub-Gaussian random variables for 1in, jAc.

Condition A.5 Condition on the true model dimension —

There exist positive constants c3 and M6 such that 1c1c31 and n(1c3)/2minjA|βj|M6.

Remark A.1

Condition 1 requires that the loss function be smooth and that the change is gentle, which is satisfactory for some common SVM loss functions, such as Huberized hinge loss function and square hinge loss function. Condition 2 states that the divergence rate of the number of non-zero coefficients cannot be faster than n1/2, and the coefficient of the variable is slowly changing in position, which supports our introduction of spline penalty. Under Conditions 3, the Hessian matrix of the loss function is assumed to be positive definite, and its eigenvalues are uniformly bounded. The condition on the largest eigenvalues of the design matrix, which is assumed in Condition 4, is similar to that of [23,26,27]. Condition 5 simply states that the signals cannot decay too quickly.

Appendix 2. Some lemmas.

The proof of Theorem 2.1 relies on the following lemmas.

Lemma A.1

Assume that Conditions 1–5 are satisfied. Then the oracle estimator satisfies βˆAβA=Op(qn/n).

Proof.

Let αn=qn/n, and Qn(βA)=1ni=1n(yixiAβA)+λ2nβADADAβA, we want to show that for any given ϵ>0, there exists a constant C>0 such that

Prinfu=CQn(βA+αnu)>Qn(βA)1ϵ. (A.1)

This implies that there exists a local minimum in the ball {βA+αnu:uC} with probability at least 1ϵ. Hence, there exists a local minimizer such that βˆAβA=Op(αn).

Let

Λn(u)=Qn(βA+αnu)Qn(βA)=1ni=1n[{yixiA(βA+αnu)}(yixiAβA)]+λ2n{(βA+αnu)DADA(βA+αnu)βADADAβA}.

By applying Taylor series expansion around β, we have

Λn(u)=1ni=1n[αn{(yixiAβA)}u+αn22u2{(yixiAβ~A)}u]+2λ2nαnβADADAu+λ2nαn2uDADAuαnni=1n{(yixiAβA)}u+αn2u2ni=1n2{(yixiAβ~A)}u+2λ2nαnβADADAu=I1+I2+I3, (A.2)

where β~A=βA+αntu, 0<t<1. By Conditions 1–3, we have

|I1|=|αnni=1n{(yixiAβA)}u|αn1ni=1n{(yixiAβA)}u=αnOp(qn/n)u=Op(αn2)u.

With Conditions 1 and 4, using Chebyshev inequality similarly as that in [8], we have when qn=O(nc1),0c1<1/2

Pr{1ni=1n2{(yixiAβA)}H(βA)ϵqn1}qn2nϵ2=o(1),

Thus

1ni=1n2{(yixiAβA)}H(βA)=op(qn1),

Then

I2=12αn2uH(βA)u{1+op(1)}.

By choosing a sufficiently large C, the second term I2 dominates the first term I1 uniformly in u=C. By Cauchy–Schwarz inequality and Condition 2, we have

|I3|=|2λ2nαnβADADAu|2λ2nαnDAβADAu=2λ2nαnDβuDADAu2λ2nαnDβuλmax(DADA)=o(αn2)u.

This is also dominated by the second term of (A2). Hence, by choosing a sufficiently large C, (A1) holds. This completes the proof of the lemma.

Lemma A.2

Assume that Conditions 1–5 hold and that qnn1/2logpnlogn=o(λ1n). Then

PrmaxjAc|1ni=1nyixij(yixiAβA)|>λ1n20,

as n.

Proof.

Recall that E[n1i=1nyixij(yixiAβA)]=0, and the fact that maxi|xij|=Op(logn) for sub-Gaussian random variables. For some positive constans C, we have

|yixij(yixiAβA)|M1|xij|(|xiAβA|+1)M1|xij|(xiAβA+1)Cqnlogn.

By Lemma 14.11 of [2], we have

Pr|1ni=1nyixij(yixiAβA)|>λ1n22expnλ1n28C2qn2log2n.

Then

PrmaxjAc|1ni=1nyixij(yixiAβA)|>λ1n2=PrjAc|1ni=1nyixij(yixiAβA)|>λ1n22pnexpnλ1n28C2qn2log2n=2explogpn1nλ1n28C2qn2logpnlog2n0

as n by the fact that qnn1/2logpnlogn=o(λ1n).

Lemma A.3

Suppose that Conditions 1–5 hold, qnn1/2logpnlogn=o(λ1n), λ2nqn1/2n1/2=o(λ1n), and λ1n=o(n(1c3)/2). For j=1,2,,p, denote

sj(βˆ)=[Ln(βˆ)+λ2nβADADAβA]βj.

For the oracle estimator βˆ and sj(βˆ), with probability approaching 1, we have

sj(βˆ)=0,|βˆj|(a+12)λ1n,jA,|sj(βˆ)|λ1n,|βˆj|=0,jAc.

Proof.

The objective function Ln(β)+λ2nβDADAβ is convex derivative function. By the convex optimization theorem, we have sj(βˆ)=0, jA.

Note that minjA|βˆj|minjA|βj|maxjA|βˆjβj|. Furthermore, we have minjA|βj|M6n(1c3)/2 by Condition 5, and maxjA|βˆjβj|βˆAβA=Op(qn/n)=Op(n(1c1)/2)=op(n(1c3)/2). Then according to λ1n=o(n(1c3)/2), we have

Pr|βˆj|a+12λ1n1,for jA.

For jAc, we have

sj(βˆ)=1ni=1nyixij(yixiAβˆA)+2λ2ni=1pn(DD)ijβˆi (A.3)

We observe that

Pr{maxjAc|n1i=1nyixij(yixiAβˆA)|>λ1n}Pr{maxjAc|n1i=1nyixij(yixiAβA)|>λ1n2}+Pr{maxjAc|n1i=1nyixij[(yixiAβˆA)(yixiAβA)]|>λ1n2}. (A.4)

By Lemma A.2, the first term of inequality (A.4) is op(1). From Lemma A.1, the second term of inequality (A.4) is bounded by

Pr{maxjAc|n1i=1nyixij[(yixiAβˆA)(yixiAβA)]|>λ1n2}PrmaxjAcsupβAβACqn/n|n1i=1nyixij[(yixiAβˆA)(yixiAβA)]|>λ1n2 (A.5)

Together with Conditions 1 and 4, we have

maxjAcsupβAβACqn/n|n1i=1nyixij[(yixiAβA)(yixiAβA)]|M2supβAβACqn/nmaxi,j|xij|n1i=1n(βAβA)xiAxiA(βAβA)M2supβAβACqn/nmaxi,j|xij|(βAβA)(n1xAxA)(βAβA)M2supβAβACqn/nmaxi,j|xij|βAβA[λmax(n1xAxA)]=O{log(pnn)}qn/n=o(λ1n), (A.6)

as n by the fact that qnn1/2logpnlogn=o(λ1n).

By (A.4)–(A.6), as n, we have

Pr{maxjAc|n1i=1nyixij(yixiAβˆA)|>λ1n}0. (A.7)

Then according to the nature of the matrix DD,

2λ2n|i=1pn(DD)ijβˆi|2λ2n|i=1pn(DD)ij(βˆiβi)|+2λ2n|i=1pn(DD)ijβi|270λ2nβˆAβA+26λ2nDβ=op(λ1n). (A.8)

By (A.3), (A.7) and (A.8), we have |sj(βˆ)|λ1n for jAc.

As the oracle estimate βˆ is defined as

βˆ=argminβAc=01ni=1n(yixiAβA)+λ2nβADADAβA,

|βˆj|=0 for jAc naturally.

Appendix 3. Proof of Theorem 2.1.

Proof.

Let

Qn(β)=Ln(β)+j=1ppλ1n(|βj|)+λ2nβDDβ=Δg(β)h(β),

where

g(β)=Ln(β)+λ1nj=1p|βj|+λ2nβDDβ,h(β)=λ1nj=1p|βj|j=1ppλ1n(|βj|).

By writing as g(β)h(β), we need to show that βˆ is a local minimizer of Qn(β). Based on Lemma A3, the proof is similar to that of Theorem 3.2 in [27]. We omit the proof here.

Funding Statement

This study was supported by the National Natural Science Foundation of China [grant number 11971404], [grant number 71471152], Humanity and Social Science Youth Foundation of Ministry of Education of China [grant number 19YJC910010], [grant number 20YJC910004], the 111 Project (B13028) and Fundamental Research Funds for the Central Universities [grant number 20720181003].

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

  • 1.Bradley P.S. and Mangasarian O.L., Feature selection via concave minimization and support vector machines, ICML 98 (1998), pp. 82–90. [Google Scholar]
  • 2.Bühlmann P. and Van De. Geer S., Statistics for High-dimensional Data: Methods, Theory and Applications, Springer Science, New York, 2011. [Google Scholar]
  • 3.Chan W.H., Mohamad M.S., Deris S., Corchado J.M., and Kasim S., An improved gSVM-SCADL2 with firefly algorithm for identification of informative genes and pathways, Int. J. Bioinf. Res. Appl. 12 (2016), pp. 72–93. [Google Scholar]
  • 4.Chen W.J. and Tian Y.J., p-norm proximal support vector machine and its applications, Procedia Computer Sci. 1 (2010), pp. 2417–2423. [Google Scholar]
  • 5.De Leeuw J. and Heiser W.J., Convergence of Correction Matrix Algorithms for Multidimensional Scaling, in Geometric Representations of Relational Data, Vol. 36. Mathesis Press, Ann Arbor, 1977, pp. 735–752.
  • 6.Fan J. and Fan Y., High dimensional classification using features annealed independence rules, Ann. Stat. 36 (2008), pp. 2605–2637. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Fan J. and Li R., Variable selection via nonconcave penalized likelihood and its oracle properties, J. Am. Stat. Assoc. 96 (2001), pp. 1348–1360. [Google Scholar]
  • 8.Fan J. and Peng H., Nonconcave penalized likelihood with a diverging number of parameters, Ann. Stat. 32 (2004), pp. 928–961. [Google Scholar]
  • 9.Guo J., Hu J., Jing B.Y., and Zhang Z., Spline-lasso in high-dimensional linear regression, J. Am. Stat. Assoc. 111 (2016), pp. 288–297. [Google Scholar]
  • 10.Guyon I., Weston J., Barnhill S., and Vapnik V., Gene selection for cancer classification using support vector machines, Mach. Learn. 46 (2002), pp. 389–422. [Google Scholar]
  • 11.Hastie T., Tibshirani R., and Friedman J., The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, New York, 2001. [Google Scholar]
  • 12.Hebiri M. and Geer S.V.D., The smooth-lasso and other 1+2-penalized methods, Electron. J. Stat. 5 (2011), pp. 1184–1226. [Google Scholar]
  • 13.Hunter D.R. and Lange K., A tutorial on MM algorithms, Am. Stat. 58 (2004), pp. 30–37. [Google Scholar]
  • 14.Lange K., Hunter D.R., and Yang I., Optimization transfer using surrogate objective functions, J. Comput. Graph. Stat. 9 (2000), pp. 1–20. [Google Scholar]
  • 15.Mangasarian O.L., A finite newton method for classification, Optim. Methods. Softw. 17 (2002), pp. 913–929. [Google Scholar]
  • 16.Petricoin III E.F., Ardekani A.M., Hitt B.A., Levine P.J., Fusaro V.A., Steinberg S.M., Mills G.B., Simone C., Fishman D.A., Kohn E.C., and Liotta L.A., Use of proteomic patterns in serum to identify ovarian cancer, The Lancet. 359 (2002), pp. 572–577. [DOI] [PubMed] [Google Scholar]
  • 17.Rosset S. and Zhu J., Piecewise linear regularized solution paths, Ann. Stat. 35 (2007), pp. 1012–1030. [Google Scholar]
  • 18.Tibshirani R., Saunders M., Rosset S., Zhu J., and Knight K., Sparsity and smoothness via the fused lasso, J. R. Stat. Soc. Ser. B(Methodological) 67 (2005), pp. 91–108. [Google Scholar]
  • 19.Vapnik V., The Nature of Statistical Learning Theory, New York, Springer, 1995. [Google Scholar]
  • 20.Wang L., Zhu J., and Zou H., Hybrid huberized support vector machines for microarray classification and gene selection, Bioinformatics. 24 (2008), pp. 412–419. [DOI] [PubMed] [Google Scholar]
  • 21.Yang Y. and Zou H., An efficient algorithm for computing the HHSVM and its generalizations, J. Comput. Graph. Stat. 22 (2013), pp. 396–415. [Google Scholar]
  • 22.Yang Y. and Zou H., A fast unified algorithm for solving group-lasso penalize learning problems, Stat. Comput. 25 (2015), pp. 1129–1141. [Google Scholar]
  • 23.Yuan M., High dimensional inverse covariance matrix estimation via linear programming, J. Mach. Learn. Res. 11 (2010), pp. 2261–2286. [Google Scholar]
  • 24.Zhang C.H., Nearly unbiased variable selection under minimax concave penalty, Ann. Stat. 38 (2010), pp. 894–942. [Google Scholar]
  • 25.Zhang H.H., Ahn J., Lin X., and Park C., Gene selection using support vector machines with non-convex penalty, Bioinformatics. 22 (2005), pp. 88–95. [DOI] [PubMed] [Google Scholar]
  • 26.Zhang C.H. and Huang J., The sparsity and bias of the lasso selection in high-dimensional linear regression, Ann. Stat. 36 (2008), pp. 1567–1594. [Google Scholar]
  • 27.Zhang X., Wu Y., Wang L., and Li R., Variable selection for support vector machines in moderately high dimensions, J. R. Stat. Soc. Ser. B (Methodological) 78 (2016), pp. 53–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Zhu J., Rosset S., Hastie T., and Tibshirani R., 1-norm support vector machines, Adv. Neural. Inf. Process. Syst. 16 (2004), pp. 49–56. [Google Scholar]
  • 29.Zou H. and Li R., One-step sparse estimates in nonconcave penalized likelihood models, Ann. Stat. 36 (2008), pp. 1509–1533. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary_Data

Articles from Journal of Applied Statistics are provided here courtesy of Taylor & Francis

RESOURCES